This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
. (3) A check is performed to determine if the address is local to the issuing processor. (4) If not, a read request transaction is performed to deliver the request to the designated processor, which (5) accesses the specified address, and (6) returns the value in a reply transaction to the original node, which (7) completes the memory access.
arise with many different communication abstractions, so they will addressed in more detail after looking at the corresponding protocols for message passing abstractions. When supporting a shared address space abstraction, we need to ask whether it is coherent and what memory consistency model it supports. In this chapter we consider designs that do not replicate data automatically through caches; all of the next chapter is devoted to that topic. Thus, each remote read and write go to the node hosting the address and access the location. Thus, coherence is met by the natural serialization involved in going through the network and accessing memory. One important subtlety is that the accesses from remote nodes need to be coherent with access from the local node. Thus, if shared data is cached locally, processing the remote reference needs to be cache coherent within the node. Achieving sequential consistency in scalable machines is more challenging than in bus-based designs because the interconnect does not serialize memory accesses to locations on different nodes. Furthermore, since the latencies of network transactions tend to be large, we are tempted to try to hide this cost whenever possible. In particular, it is very tempting to issue multiple write transactions without waiting for the completion acknowledgments to come back in between. To see how this can undermine the consistency model, consider our familiar flag-based code fragment executing on a multiprocessor with physically distributed memory but no caches. The variables A and flag are allocated in two different processing nodes, as shown in Figure 7-9(a). Because of delays in the network, processor P2 may see the stores to A and flag in the reverse of the order they are generated. Ensuring point-to-point ordering among packets between each
9/10/97
DRAFT: Parallel Computer Architecture
443
Scalable Multiprocessors
pair of nodes does not remedy the situation, because multiple pairs of nodes may be involved. A situation with a possible reordering due to the use of different paths within the network is illustrated in Figure 7-9(b). Overcoming this problem is one of the reasons why writes need to be acknowledged. A correct implementation of this construct will wait for the write of A to be completed before issuing the write of flag. By using the completion transactions for the write, and the read response, it is straight-forward to meet the sufficient conditions for sequential consistency. The deeper design question is how to meet these conditions while minimizing the amount of waiting by determining that the write has been committed and appears to all processors as if it had performed. while (flag==0); print A;
A=1; flag=1; P3
P1 Memory
P2
Memory
Memory
A:0
flag:0->1
delay
P3
P2
3: load A
1: A=1
Interconnection Network (a)
path
2: flag=1 P1
congested (b)
Figure 7-9 Possible reordering of memory references for shared flags The network is assumed to preserve point-to-point order. The processors have no cache as shown in part (a). The variable A is assumed to be allocated out of P2’s memory, while variable flag is assumed to be allocated from P3’s memory. It is assumed that the processors do not stall on a store instruction, as is true for most uniprocessors. It is easy to see that if there is network congestion along the links from P1 to P2, P3 can get the updated value of flag from P1 and then read the stale value of A (A=0) from P2. This situation can easily occur, as indicated in part (b) where messages are always routed first in X-dimension and then in Y-dimension of a 2D mesh.
7.3.3 Message Passing A send/receive pair in the message passing model is conceptually a one-way transfer from a source area specified by the source user process to a destination address specified by the destination user process. In addition, it embodies a pairwise synchronization event between the two processes. In Chapter 2, we noted the important semantic variations on the basic message passing abstractions, such as synchronous and asynchronous message send. User level message passing models are implemented in terms of primitive network transactions, and the different synchronization semantics have quite different implementations, i.e., different network transaction protocols. In most early large scale machines, these protocols were buried within the vendor kernel and library software. In more modern machines the primitive transactions are exposed to allow a wider set of programming models to be supported.
444
DRAFT: Parallel Computer Architecture
9/10/97
Realizing Programming Models
This chapter uses the concepts and terminology associated with MPI. Recall that MPI distinguishes the notion of when a call to a send or receive function returns from when message operation completes. A synchronous send completes once the matching receive has executed, the source data buffer can be reused, and the data is ensured of arriving in the destination receive buffer. A buffered send completes as soon as the source data buffer can be reused, independent of whether the matching receive has been issued; the data may have been transmitted or it may be buffered somewhere in the system.1 Buffered send completion is asynchronous with respect to the receiver process. A receive completes when the message data is present in the receive destination buffer. A blocking function, send or receive, returns only after the message operation completes. A non-blocking function returns immediately, regardless of message completion, and additional calls to a probe function are used to detect completion. The protocols are concerned only with message operation and completion, regardless of whether the functions are blocking. To understand the mapping from user message passing operations to machine network transaction primitives, let us consider first synchronous messages. The only way for the processor hosting the source process to know whether the matching receive has executed is for that information to be conveyed by an explicit transaction. Thus, the synchronous message operation can be realized with a three phase protocol of network transactions, as shown in Figure 7-10. This protocol is for a sender-initiated transfer. The send operation causes a “ready-to-send” to be transmitted to the destination, carrying the source process and tag information. The sender then waits until a corresponding “ready-to-receive” has arrived. The remote action is to check a local table to determine if a matching receive has been performed. If not, the ready-to-send information is recorded in the table to await the matching receive. If a matching receive is found, a “ready-to-receive” response transaction is generated. The receive operation checks the same table. If a matching send is not recorded there, the receive is recorded, including the destination data address. If a matching send is present, the receive generates a “ready-to-receive” transaction. When a “readyto-receive” arrives at the sender, it can initiate the data transfer. Assuming the network is reliable, the send operation can complete once all the data has been transmitted. The receive will complete once it all arrives. Note that with this protocol both the source and destination nodes know the local addresses for the source and destination buffers at the time the actual data transfer occurs. The ‘ready’ transactions are small, fixed format packets, whereas the data is a variable length transfer. In many message passing systems the matching rule associated with a synchronous message is quite restrictive and the receive specifies the sending process explicitly. This allows for an alternative “receiver initiated” protocol in which the match table is maintained at the sender and only two network transactions are required (receive ready and data transfer).
1. The standard MPI mode is a combination of buffered and synchronous that gives the implementation substantial freedom and the programmer few guarantees. The implementation is free to choose to buffer that data, but cannot be assumed to do so. Thus, when the send completes the send buffer can be reused, but it cannot be assumed that the receiver has reached the point of the receive call. Nor can it be assumed that send buffering will break the deadlock associated with two nodes sending to each other and then posting the receive. Non-blocking sends can be used to avoid the deadlock, even with synchronous sends. The read-mode send is a stronger variant of synchronous mode, where it is an error if the received has not executed. Since the only to obtain knowledge of the state of the non-local processes is through exchange of messages, an explicit message event would need to be used to indicate readiness. The race condition between posting the ready receive and transmitting the synchronization message is very similar to the flags example in the shared address space case.
9/10/97
DRAFT: Parallel Computer Architecture
445
Scalable Multiprocessors
Source (1) Initiate send (2) Address Translation on Psrc
Destination recv Psrc, local VA, len
send Pdest, local VA, len send-ready req
(4) Send-Ready Request
(5) Remote check for posted receive (assume success)
Tag Check
wait
(6) Reply Transaction
rcv-ready reply
(7) Bulk Data transfer Source VA -> Dest VA or Id
data-xfer req Time Figure 7-10 Synchronous Message Passing Protocol The figure illustrates the three-way handshake involved in realizing a synchronous send/receive pair in terms of primitive network transactions.
The buffered send is naively implemented with an optimistic single-phase protocol, as suggested in Figure 7-11. The send operation transfers the source data in a single large transaction with an envelop containing the information used in matching, e.g., source process and tag, as well as length information. The destination strips off the envelop and examines its internal table to determine if a matching receive has been posted. If so, it can deliver the data at the specified receive address. If no matching receive has been posted, the destination allocates storage for the entire message and receives it into the temporary buffer. When the matching receive is posted later, the message is copied to the desired destination area and the buffer is freed. This simple protocol presents a family of problems. First, the proper destination of the message data cannot be determined until after examining the process and tag information and consulting the match table. These are fairly expensive operations, typically performed in software. Meanwhile the message data is streaming in from the network at high rate. One approach is to always receive the data into a temporary input buffer and then copy it to its proper destination. Of course, this introduces a store-and-forward delay and, possibly, consumes a fair amount of processing resources. The second problem with this optimistic approach is analogous to the input buffer problem discussed for a shared address space abstraction since there is no ready-to-receive handshaking before the data is transferred. In fact, the problem is amplified in several respects. First, the transfers are larger, so the total volume of storage needed at the destination is potentially quite large.
446
DRAFT: Parallel Computer Architecture
9/10/97
Realizing Programming Models
Second, the amount of buffer storage depends on the program behavior; it is not just a result of the rate mismatch between multiple senders and one receiver, much less a timing mismatch where the data happens to arrive just before the receiver was ready. Several processes may choose to send many messages each to a single process, which happens not to receive them until much later. Conceptually, the asynchronous message passing model assumes an unbounded amount of storage outside the usual program data structures where data is stored until the receives for them are posted and performed. Furthermore, the blocking asynchronous send must be allowed to complete to avoid deadlock in simple communication patterns, such as a pairwise exchange. The program needs to continue executing past the send to reach a point where a receive is posted. Our optimistic protocol does not distinguish transient receive-side buffering from the prolonged accumulation of data in the message passing layer.
Source (1) Initiate send (2) Address Translation
Destination
send (Pdest , local VA, len)
(4) Send Data (5) Remote check for posted receive (assume fail) Allocate data buffer
data-xfer req
Time
Tag Match Allocate Buffer
recv Ps, local VA, len
Figure 7-11 Asynchronous (optimistic) Message Passing Protocol The figure illustrates a naive single-phase optimisitic protocol for asynchronous message passing where the source simply delivers that data to the destination, without concern for whether the destination has storage to hold it.
More robust message passing systems use a three phase protocol for long transfers. The send issues a ready-to-send with the envelop, but keeps the data buffered at the sender until the destination can accept it. The destination issues a ready-to-receive either when it has sufficient buffer space or a matching receive has been executed so the transfer can take place to the correct destination area. Note that in this case the source and destination addresses are known at both sides of the transfer before the actual data transfer takes place. For short messages, where the handshake would dominate the actual data transfer cost, a simple credit scheme can be used. Each process sets aside a certain amount of space for processes that might send short messages to it. When a short message is sent, the sender deducts its destination “credit” locally, until it receives notification that the short message has been received. In this way, a short message can usually be launched without waiting a round-trip delay for the handshake. The completion acknowledgment is used later to determine when additional short messages can be sent without the handshake. As with the shared address space, there is a design issue concerning whether the source and destination addresses are physical or virtual. The virtual-to-physical mapping at each end can be per-
9/10/97
DRAFT: Parallel Computer Architecture
447
Scalable Multiprocessors
Source (1) Initiate send (2) Address Translation on Pd
send Pd, local VA, len send-rdy req
(4) Send-Ready Request (5) Remote check for posted receive (assume fail) Record Send-rdy
Destination
return and compute
(6) Rcv-ready reguest
Tag Check
recv Ps, local VA, len
(7) Bulk Data reply Source VA -> Dest VA or Id rcv-rdy req
data-xfer reply Time Figure 7-12 Asynchronous Conservative Message Passing Protocol The figure illustrates a one-plus-two phase conservative protocol for asynchronous message passing. The data is held at the source until the matching receive is executed, making the destination address known before the data is delivered.
formed as part of the send and receive calls, allowing the communication assist to exchange physical addresses, which can be used for the DMA transfers. Of course, the pages must stay resident during the transfer for the data to stream into memory. However, the residency is limited in time, from just before the handshake until the transfer completes. Also, very long transfers can be segmented at the source so that few resident pages are involved. Alternatively, temporary buffers can be kept resident and the processor relied upon to copy the data from and to the source and destination areas. The resolution of this issue depends heavily on the capability of the communication assist, as discussed below. In summary, the send/receive message passing abstraction is logically a one-way transfer, where the source and destination addresses are specified independently by the two participating processes and an arbitrary amount of data can be sent before any is received. The realization of this class of abstractions in terms of primitive network transactions typically requires a three-phase protocol to manage the buffer space on the ends, although an optimistic single-phase protocol can be employed safely to a limited degree.
448
DRAFT: Parallel Computer Architecture
9/10/97
Realizing Programming Models
7.3.4 Common challenges The inherent challenge in realizing programming models in large scale systems is that each processor has only limited knowledge of the state of the system, a very large number of network transactions can be in progress simultaneously, and the source and destination of the transaction are decoupled. Each node must infer the relevant state of the system from its own point-to-point events. In this context, it is possible, even likely, for a collection of sources to seriously overload a destination before any of them observe the problem. Moreover, because the latencies involved are inherently large, we are tempted to use optimistic protocols and large transfers. Both of these increase the potential over-commitment of the destination. Furthermore, the protocols used to realize programming models often require multiple network transactions for an operation. All of these issues must be considered to ensure that forward progress is made in the absence of global arbitration. The issues are very similar to those encountered in bus-based designs, but the solutions cannot rely on the constraints of the bus, namely a small number of processors, a limited number of outstanding transactions, and total global ordering. Input buffer overflow Consider the problem of contention for the input buffer on a remote node. To keep the discussion simple, assume for the moment fixed-format transfers on a completely reliable network. The management of the input buffer is simple: a queue will suffice. Each incoming transaction is placed in the next free slot in the queue. However, it is possible for a large number of processors to make a request to the same module at once. If this module has a fixed input buffer capacity, on a large system it may become overcommitted. This situation is similar to the contention for the limited buffering within the network and it can be handled in a similar fashion. One solution is to make the input buffer large and reserve a portion of it for each source. The source must constrain its own demand when it has exhausted its allotment at the destination. The question then arises, how does the source determine that space is again available for it? There must be some flow control information transmitted back from the destination to the sender. This is a design issue that can be resolved in a number of ways: each transaction can be acknowledged, or the acknowledgment can be coupled with the protocol at a higher level. For example, the reply indicates completion of processing the request. An alternative approach, common in reliable networks, is for the destination to simply refuse to accept incoming transactions when its input buffer is full. Of course, the data has no place to go, so it remains stuck in the network for a period of time. The network switch feeding the full buffer will be in a situation where it cannot deliver packets as fast as they are arriving. Given that it also has finite buffering, it will eventually refuse to accept packets on its inputs. This phenomena is called back-pressure. If the overload on the destination is sustained long enough, the backlog will build in a tree all the way back to the sources. At this point, the sources feel the back pressure from the overloaded destination and are forced to slow down such that the sum of the rates from all the sources sending to the destination are no more than what the destination can receive. One might worry that the system will fail to function in such situations. Generally, networks are built so that they are deadlock free, which means that messages will make forward progress as long as messages are removed from the network at the destinations[DaSe87]. (We will see in Chapter 7 that it takes care in the network design to ensure deadlock freedom. The situations that need to be avoided are similar to traffic gridlock, where all the cars in an intersection are preventing the others from moving out of the way.) So, there will be forward progress. The problem is that with the network so backed up, messages not headed for the overloaded destination will get stuck in traf-
9/10/97
DRAFT: Parallel Computer Architecture
449
Scalable Multiprocessors
fic also. Thus, the latency of all communication increases dramatically with the onset of this backlog. Even if there is a large backlog, the messages will eventually move forward. Networks typically have only finite buffering per switch, so eventually the backlog will reach back to the sources and the network will refuse to accept any incoming messages. Thus, the destinations exert back-pressure through the network, which eventually reaches all the way to the sources. This situation may cause additional transactions to queue up within the network in front of the full buffers, thereby increasing the communication latency. This situation establishes an interesting “contract” between the nodes and the network. From the source point of view, if the network accepts a transaction it is guaranteed that the transaction will eventually be delivered to the destination. However, a transaction may not be accepted for an arbitrarily large period of time and during that time the source must continue to accept incoming transactions. Alternatively, the network may be constructed so that the destination can inform the source of the state of its input buffer. This is typically done by reserving a special acknowledgment path in the reverse direction. When the destination accepts a transaction it explicitly ACKs the source; if it discards the transaction it can deliver a negative acknowledgment, informing the source to try again later. Local area networks, such as Ethernet, FDDI, and ATM, take more austere measures and simply drop the transaction whenever there is not space to buffer it. The source relies on time-outs to decide that it may have been dropped and tries again. Fetch deadlock The input buffer problem takes on an extra twist in the context of the request/response protocols that are intrinsic to a shared address space and present in message passing implementations. In a reliable network, when a processor attempts to initiate a request transaction the network may refuse to accept it, as a result of contention for the destination and/or contention within the network. In order to keep the network deadlock free, the source is required to continue accepting transactions, even while it cannot initiate its own. However, the incoming transaction may be a request, which will generate a response. The response cannot be initiated because the network is full. A common solution to this fetch deadlock problem is to provide two logically independent communication networks for requests and responses. This may be realized as two physical networks, or separate virtual channels within a single network and separate output and input buffering. While stalled on attempting to send a request it is necessary to continue accepting responses. However, responses can be completed without initiating further transactions. Thus, response transactions will eventually make progress. This implies that incoming requests can eventually be serviced, which implies that stalled requests will eventually make progress. An alternative solution is to ensure that input buffer space is always available at the destination when a transaction is initiated by limiting the number of outstanding transactions. In a request/ response protocol it is straight-forward to limit the number of outstanding requests any processor has outstanding; a counter is maintained and each response decrements the counter allowing a new request to be issued. Standard, blocking reads and writes are realized by simply waiting for the response before completing the current request. Nonetheless, with P processors and a limit of k outstanding requests per processor, it is possible for all kp requests to be directed to the same module. There needs to be space for the kp outstanding requests that might be headed for a single
450
DRAFT: Parallel Computer Architecture
9/10/97
Realizing Programming Models
destination plus space for responses to the requests issued by the destination node. Clearly, the available input buffering ultimately limits the scalability of the system. The request transaction is guaranteed to make progress because the network can always sink transactions into available input buffer space at the destination. The fetch deadlock problem arises when a node attempts to generate a request and its outstanding credit is exhausted. It must service incoming transactions in order to receive its own responses, which enable generation of additional requests. Incoming requests can be serviced because it is guaranteed that the requestor reserved input buffer space for the response. Thus, forward progress is ensured even if the node merely queues and ignores input transactions while attempting to deliver a response. Finally, we could adopt the approach we followed for split-transaction busses and NAK the transaction if the input buffer is full. Of course, the NAK may be arbitrarily delayed. Here we assume that the network reliably delivers transactions and NAKs, but the destination node may elect to drop them in order to free up input buffer space. Responses never need to be NAKed, because they will be sinked at the destination node, which is the source of the corresponding request and can be assumed to have set aside input buffering for the response. While stalled attempting to initiate a request we need to accept and sink responses and accept and NAK requests. We can assume that input buffer space is available at the destination of the NAK, because it simply uses the space reserved for the intended response. As long as each node provides some input buffer space for requests, we can ensure that eventual some request succeeds and the system does not livelock. Additional precautions are required to minimize the probability of starvation. 7.3.5 Communication architecture design space In the remainder of this chapter we examine the spectrum of important design points for largescale distributed memory machines. Recall, our generic large-scale architecture consists of a fairly standard node architecture augmented with hardware communication assist, as suggested by Figure 7-13. The key design issue is the extent to which the information in a network transaction is interpreted directly by the communication assist, without involvement of the node processor. In order to interpret the incoming information, its format must be specified, just as the format of an instruction set must be defined before one can construct an interpreter for it, i.e., a processor. The formatting of the transaction must be performed in part by the source assist, along with address translation, destination routing, and media arbitration. Thus, the processing performed by the source communication assist in generating the network transaction and that at the destination together realize the semantics of the lowest level hardware communication primitives presented to the node architecture. Any additional processing required to realize the desired programming model is performed by the node processor(s), either at user or system level. Establishing a position on the nature of the processing performed in the two communication assists involved in a network transactions has far reaching implications for the remainder of the design, including how input buffering is performed, how protection is enforced, how many times data is copied within the node, how addresses are translated and so on. The minimal interpretation of the incoming transaction is not to interpret it at all. It is viewed as a raw physical bit stream and is simply deposited in memory or registers. More specific interpretation provides user-level messages, a global virtual address space, or even a global physical address space. In the following sections we examine each of these in turn, and present important machines that embody the respective design points as case studies.
9/10/97
DRAFT: Parallel Computer Architecture
451
Scalable Multiprocessors
Scalable Network Message
C. A. M
C. A. P
M
Communication Assist
P
Node Architecture
Figure 7-13 Processing of a Network Transaction in the Generic Large-Scale Architecture A network transaction is a one-way transfer of information from an output buffer at the source to an input buffer at the destination causing some kind of action to occur at the destination, the occurrence of which is not directly visible at the source. The source communication assist formats the transaction and causes it to be routed through the network. The destination communication assist must interpret the transaction and cause the appropriate actions to take place. The nature of this interpretation is a critical design aspect of scalable multiprocessors.
7.4 Physical DMA This section considers designs where no interpretation is placed on the information within a network transaction. This approach is representative of most early message passing machines, including the nCUBE10, nCUBE/2, the Intel iPSC, iPSC/2, iPSC860, and the Delta, and the Ametek and IBM SP-1. Also, most LAN interfaces follow this approach. The hardware can be very simple and the user communication abstraction can be very general, but typical processing costs are large. Node to Network Interface The hardware essentially consists of support for physical DMA (direct memory access), as suggested by Figure 7-14. A DMA device or channel typically has associated with it address and length registers, status (e.g., transmit ready, receive ready), and interrupt enables. Either the device is memory mapped or privileged instructions are provided to access the registers. Addresses are physical1, so the network transaction is transferred from a contiguous region of memory. To inject the message into the network, requires a trap to the kernel. Typically, the data will be copied into a kernel area so that the envelope, including the route and other information can be constructed. Portions of the envelope, such as the error detection bits, may be generated by the communication assist. The kernel selects the appropriate out-going channel, sets the channel address to the physical address of the message and sets the count. The DMA engine will push the message into the network. When transmission completes the output channel ready flag is set
1. One exception to this is the SBus used in Sun workstations and servers, which provide virtual DMA, allowing I/O devices to operate on virtual, rather than physical addresses.
452
DRAFT: Parallel Computer Architecture
9/10/97
Physical DMA
and an interrupt is generated, unless it is masked. The message will work its way through the network to the destination, at which point the DMA at the input channel of the destination node must be started to allow the message to continue moving through the network and into the node. (If there is delay in starting the input channel or if the message collides in the network with another using the same link, typically the message just sits in the network.) Generally, the input channel address register is loaded in advance with the base address where the data is to be deposited. The DMA will transfer words from the network into memory as they arrive. The end-ofmessage causes the input ready status bit to be set and an interrupt, unless masked. In order to avoid deadlock, the input channels must be activated to receive messages and drain the network, even if there are output messages that need to be sent on busy output channels. Sending typically requires a trap to the operating system. Privileged software can then provide the source address translation, translate the logical destination node to a physical route, arbitrate for the physical media, and access the physical device. The key property of this approach is that the destination processor initiates a DMA transfer from the network into a region of memory and the next incoming network transaction is blindly deposited in the specified region of memory. The system sets up the in-bound DMA on the destination side. When it opens up an input channel, it cannot determine whether the next message will be a user message or a system message. It will be received blindly into the predefined physical region. Message arrival will typically cause an interrupt, so privileged software can inspect the message and either process it or deliver it to the appropriate user process. System software on the node processor interprets the network transaction and (hopefully) provides a clean abstraction to the user.
dest
data
DMA Channels addr cmd len rdy M
addr len rdy
status, interrupt
P
M
P
Figure 7-14 Hardware support in the communication assist for blind, physical DMA The minimal interpretation is blind, physical DMA, which allows the destination communication assist merely to deposit the transaction data into storage, whereupon it will be interpreted by the processor. Since the type of transaction is not determined in advance, kernel buffer storage is used and processing the transaction will usually involve a context switch and one or more copies.
One potential way to reduce the communication overhead is to allow the user level access to the DMA device. If the DMA device is memory mapped, as most are, this is a matter of setting up the user virtual address space to include the necessary region of the I/O space. However, with this approach the protection domain and the level of resource sharing is quite crude. The current user gets the whole machine. If the user misuses the network, there may be no way for the operating system to intervene, other than rebooting the machine. This approach has certainly been
9/10/97
DRAFT: Parallel Computer Architecture
453
Scalable Multiprocessors
employed in experimental setting, but is not very robust and tends to make the parallel machine into a very expensive personal computer. Implementing Communication Abstractions Since the hardware assist in these machines is relatively primitive, the key question becomes how to deliver the newly received network transaction to the user process in a robust, protected fashion. This is where the linkage between programming model and communication primitives occurs. The most common approach is to support the message passing abstraction directly in the kernel. An interrupt is taken on arrival of a network transaction. The process identifier and tag in the network transaction is parsed and a protocol action is taken along the lines specified by Figure 7-10 or Figure 7-12. For example, if a matching receive has been posted the data can be copied directly into the user memory space. If not, the kernel provides buffering or allocates storage in the destination user process to buffer the message until a matching receive is performed. Alternatively, the user process can preallocate communication buffer space and inform the kernel where it wants to receive messages. Some message passing layers allow the receiver to operate directly out of the buffer, rather than receiving the data into its address space. It is also possible for the kernel software to provide the user-level abstraction of a global virtual address space. In which case, read and write requests are issued either directly through a system call or by trapping on a load or store to a logically remote page. The kernel on the source issues the request and handles the response. The kernel on the destination extracts the user process, command, and destination virtual address from the network transaction and performs the read or write operation, along the lines of Figure 7-8, issuing a response. Of course, the overhead associated with such an implementation of the shared address abstraction is quite large, especially for word at a time operation. Greater efficiency can be gained through bulk data transfers, which make the approach competitive with message passing. There have been many software shared memory systems built along these lines, mostly on clusters of workstations, but the thrust of these efforts is on automatic replication to reduce the amount of communication. They are described in the next chapter. Other linkages between the kernel and user are possible. For example the kernel could provide the abstraction of a user-level input queue can simply append the message to the appropriate queue, following some well defined policy on queue overflow. 7.4.1 A Case Study: nCUBE/2 A representative example of the physical DMA style of machine is the nCUBE/2. The network interface is organized as illustrated by Figure 7-15, where each of the DMA output channels drives an output port of link and each input DMA channel is associated with an input port. This machine is an example of a direct network, in which data is forwarded from its source to its destination through intermediate nodes. The router forwards network transactions from input ports to output ports. The network interface inspects the envelop of messages arriving on each input port and determines whether the message is destined for the local node. If so, the input DMA is activated to drain the message into memory. Otherwise, the router forwards the message to the proper output port.1 Link-by-link flow control ensures that delivery is reliable. User programs are assigned to contiguous subcubes and the routing is such that the links within the subcube are only used within that subcube, so with space-shared use of the machine, user programs cannot interfere with one another. A peculiarity of the nCUBE/2 is that no count register is associated with
454
DRAFT: Parallel Computer Architecture
9/10/97
Physical DMA
Input Ports
Output Ports
°°°
°°°
Router
addr
addr
addr
DMA Channels
addr len
addr len
addr len
Memory Bus Memory
Processor
Figure 7-15 Network interface organization of the nCUBE/2 Multiple DMA channels drive network transactions directly from memory into the network or from the network into memory. The inbound channels deposit data into memory at a location determined by the processor, independent of the contents. To avoid copying, the machine allows multiple message segments to be transferred as a single unit through the network. A more typical approach is for the processor to provide the communication assist with a queue of inbound DMA descriptors, each containing the address and length of a memory buffer. When a network transactions arrive, a descriptor is popped from the queue and the data is deposited in the associated memory buffer.
the input channels, so the kernels on the source and destination nodes must ensure that incoming messages never over-run the input buffers in memory. To assist in interpreting network transactions and delivering the user data into the desired region of memory without copies, it is possible to send a series of message segments as a single, contiguous transfer. At the destination the input DMA will stop at each logical segment. Thus, the destination can take the interrupt and inspect the first segment in order to determine where to direct the remainder, for example by performing the lookup in the receive table. However, this facility is costly, since a start-DMA and interrupt (or busy-wait) is required for each segment. The best strategy at the kernel level is to keep an input buffer associated with every incoming channel, while output buffers are being injected into the output ports [vEi*92]. Typically each message will contain a header, allowing the kernel to dispatch on the message type and take appropriate action to handle it, such as performing the tag match and copying the message into the user data area. The most efficient and most carefully document communication abstraction on this platform is user-level active messages, which essentially provides the basic network transaction at user level [vEik*92]. It requires that the first word of the user message contain the address of the user routine that will handle the message. The message arrival causes an interrupt, so the
1. This routing step is the primary point where the interconnection network topology, an n-cube, is bound into the design of the node. As we will discuss in Chapter 10, the output port is given by the position of the first bit that differs in the local node address and the message destination address.
9/10/97
DRAFT: Parallel Computer Architecture
455
Scalable Multiprocessors
kernel performs a return-from-interrupt to the message handler, with the interrupted user address on the stack. With this approach, a message can be delivered into the network in 13 µs (16 instructions, including 18 memory references, costing 160 cycles) and extracted from the network in 15 µs (18 instructions, including 26 memory references, costing 300 cycles). Comparing this with the 150 µs start-up of the vendor “vertex” message passing library reflects the gap between the hardware primitives and the user-level operations in the message passing programming model. The vertex message passing layer uses an optimistic one-way protocol, but matching and buffer management is required. Typical LAN Interfaces The simple DMA controllers of the nCUBE/2 are typical of parallel machines and qualitatively different from what is typically found in DMA controllers for peripheral devices and local area networks. Notice that each DMA channel is capable of a single, contiguous transfer. A short instruction sequence sets the channel address and channel limit for the next input or output operation. Traditional DMA controllers provide the ability to chain together a large number of transfers. To initiate an output DMA, a DMA descriptor is chained onto the output DMA queue. The peripheral controller polls this queue, issuing DMA operations and informing the processor as they complete. Most LAN controllers, including Ethernet LANCE, Motorola DMAC, Sun ATM adapters and many others, provide a queue of transmit descriptors and a queue of receive descriptors. (There is also a free list of each kind of descriptor. Typically, the queue and its free list are combined into a single ring.) The kernel builds the output message in memory and sets up a transmit descriptor with its address and length, as well as some control information. In some controllers a single message can be described by a sequence of descriptors, so the controller can gather the envelope and the data from memory. Typically, the controller has a single port into the network, so it pushes the message onto the wire. For Ethernets and rings, each of the controllers inspect the message as it comes by, so a destination address is specified on the transaction, rather than a route. The inbound side is more interesting. Each receive descriptor has a destination buffer address. When a message arrives, a buffer descriptor is popped off the queue and a DMA transfer is initiated to load the message data into memory. If no receive descriptor is available, the message is dropped and higher level protocols must retry (just as if the message was garbled in transit). Most devices have configurable interrupt logic, so an interrupt can be generated on every arrival, after so many bytes, or after a message has waited for too long. The operating system driver manages these input and output queues. The number of instructions required to set up even a small transfer is quite large with such devices, partly because of the formatting of the descriptors and the handshaking with the controller.
7.5 User-level Access The most basic level of hardware interpretation of the incoming network transaction distinguishes user messages from system messages and delivers user messages to the user program without operating system intervention. Each network transaction carries a user/system flag which is examined by the communication assist as the message arrives. In addition, it should be possible to inject a user message into the network at user level; the communication assist automatically 456
DRAFT: Parallel Computer Architecture
9/10/97
User-level Access
inserts the user flag as it generates the transaction. In effect, this design point provides a userlevel network port, which can be written and read without system intervention. Node to Network Interface A typical organization for a parallel machine supporting user-level network access is shown in Figure 7-16. A region of the address space is mapped to the network input and output ports, as well as the status register, as indicated in Figure 7-17. The processor can generate a network transaction by writing the destination node number and the data into the output port. The communication assist performs protection check, translates the logical destination node number into a physical address or route, and arbitrates for the medium. It also inserts the message type and any error checking information. Upon arrival, a system message will cause an interrupt so the system can extract it from the network, whereas a user message can sit in the input queue until the user process reads it from the network, popping the queue. If the network backs up, attempts to write messages into the network will fail and the user process will need to continue to extract messages from the network to make forward progress. Since current microprocessors do not support userlevel interrupts, an interrupting user message is treated by the NI as a system message and the system rapidly transfers control to a user-level handler.
data
M
P
Status, Interupt
u/s dest
M
P
Figure 7-16 Hardware support in the communication assist for user-level network ports. The network transaction is distinguished as either system or user. The communication assist provides network input and output fifos accessible to the user or system. It marks user messages as they are sent and checks the transaction type as they are received. User messages may be retained in the user input fifo until extracted by the user application. System transactions cause an interrupt, so that they may be handled in a priviledged manner by the system. In the absence of user-level interupt support, interrupting user tranactions are treated as special system transactions.
One implication of this design point is that the communication primitives allow a portion of the process state to be in the network, having left the source but not arrived at the destination. Thus, if the collection of user processes forming a parallel program is time-sliced out, the collection of in-flight messages for the program need to be swapped as well. They will be reinserted into the network or the destination input queues when the program is resumed.
9/10/97
DRAFT: Parallel Computer Architecture
457
Scalable Multiprocessors
Virtual Address Space
Virtual Address Space
Net Output Port
Net Output Port
Net Input Port
Processor
Status
Net Input Port
Processor
Status Regs
Regs
PC
PC
Figure 7-17 Typical User-level Architecture with network ports In addition to the storage presented by the instruction set architecture and memory space, a region of the user’s virtual address space provides access to the network output port, input port, and status register. Network transactions are initiated and received by writing and reading the ports, plus checking the status register.
7.5.1 Case Study: Thinking Machines CM-5 The first commercial machine to seriously support a user-level network was the Thinking Machines Corp. CM-5, introduced in late 1991. The communication assist is contained in a network interface chip (NI) attached at the memory bus as if it were an additional memory controller, as illustrated in Figure 7-12. The NI provides an input and output FIFO for each of two data networks and a “control network”, which is specialized for global operations, such as barrier, broadcast, reduce and scan. The functionality of the communication assist is made available to the processor by mapping the network input and output ports, as well as the status registers, into the address space, as shown in Figure 7-17. The kernel can access all of the FIFOs and registers, whereas a user process can only access the user FIFO and status. In either case, communication operations are initiated and completed by reading and writing the communication assist registers using conventional loads and stores. In addition, the communication assist can raise an interrupt. In the CM5, each network transaction contains a small tag and the communication assist maintains a small table to indicate which tags should raise an interrupt. All system tags raise an interrupt. In the CM-5, it is possible to write a five word message into the network in 1.5 µs (50 cycles) and read one out in 1.6 µs. In addition, the latency across an unloaded network varies from 3 µs for neighboring nodes to 5 µs across a thousand node machine. An interrupt vectored to user level costs roughly 10 µs. The user-level handler may process several messages, if they arrive in rapid succession. The time to transfer a message into or out of the network interface is dominated by the time spent on the memory bus, since these operations are performed as uncached writes and reads. If the message data starts out in registers, it can be written to the network interface as a sequence of buffered stores. However, this must be followed by a load to check if the message was accepted, at which point the write latency is experienced as the write buffer is retired to the NI. If the message data originates in memory and is to be deposited in memory, rather than registers, it is interesting to evaluate whether DMA should be used to transfer data to and from the NI. The critical resource is the memory bus. When using conventional memory operations to access the 458
DRAFT: Parallel Computer Architecture
9/10/97
User-level Access
user-level network port, each word of message data is first loaded into a register and then stored into the NI or memory. If the data is uncachable, each data word in the network transaction involves four bus transactions. If the memory data is cachable, the transfers between the processor and memory are performed as cache block transfers or are avoided if the data is in cache. However, the NI stores and loads remain. With DMA transfers, the data is moved only once across the memory bus on each end of the network transaction, using burst mode transfers. However, the DMA descriptor must still be written to the NI. The performance advantages of DMA can be lost altogether if the message data cannot be in cachable regions of memory, so the DMA transfer performed by the NI must be coherent with respect to the processor cache. Thus, it is critical that the node memory architecture support coherent caching. On the receiving side, the DMA must be initiated by the NI based on information in the network transaction and state internal to the NI; otherwise we are again faced with the problems associated with blind physical DMA. This leads us to place additional interpretation on the network transaction, in order to have the communication assist extract address fields. We consider this approach further in Section 7.6 below. The two data networks in the CM-5 provide a simple solution to the fetch deadlock problem; one network can be used for requests and one for responses[Lei*92]. When blocking on a request, the node continues to accept incoming replies and requests, which may generate out-going replies. When blocked on sending a reply, only incoming replies are accepted from the network. Eventually the reply will succeed, allowing the request to proceed. Alternatively, buffering can be provided at each node, with some additional end-to-end flow control to ensure that the buffers do not overflow. Should a user program be interrupted when it is part way through popping a message from the input queue, the system will extract the remainder of the message and pushes it back into the front of the input queue before resuming the program. 7.5.2 User Level Handlers Several experimental architectures have investigated a tighter integration of the user-level network port with the processor, including the Manchester Dataflow Machine[Gur*85], Sigma1[Shi*84], iWARP[Bor90], Monsoon[PaCu90], EM-4[Sak*91], and J-Machine[Dal93]. The key difference is that the network input and output ports are processor registers, as suggested by Figure 7-18, rather than special regions of memory. This substantially changes the engineering of the node, since the communication assist is essentially a processor function unit. The latency of each of the operations is reduced substantially, since data is moved in and out of the network with register-to-register instructions. The bandwidth demands on the memory bus are reduced and the design of the communication support is divorced from the design of the memory system. However, the processor is involved in every network transaction. Large data transfers consume processor cycles and are likely to pollute the processor cache. Interestingly, the experimental machines have arrived at a similar design point from vastly different approaches. The iWARP machine[Bor*90], developed jointly by CMU and Intel, binds two registers in the main register file to the head of the network input and output ports. The processor may access the message on a word-by-word basis as it streams in from the network. Alternatively, a message can be spooled into memory by a DMA controller. The processor specifies which message it desires to access via the port registers by specifying the message tag, much as in a traditional receive call. Other messages are spooled into memory by the DMA controller using an input buffer queue to specify the destination address. The extra hardware mechanism to direct one in-coming and one out-going message through the register file was motivated by sys-
9/10/97
DRAFT: Parallel Computer Architecture
459
Scalable Multiprocessors
data
M
addr u/s dest
P
M
P
Figure 7-18 Hardware support in the communication assist for user-level handlers. The basic level of support required for user-level handlers is that the communication assist can determine that the network transaction is destined for a user processes and make it directly available to that process. This either means that each process has a logical set of FIFOs or a single set FIFOs is timeshared among user processes.
tolic algorithms where a stream of data is pumped through processors in a highly regular pipeline, doing a small amount of computation as the stream flows through. By interpreting the tag, or virtual channel in iWARP terms, in the network interface, the memory-based message flows are not hindered by the register-based message flow. By contrast, in a CM-5 style of design all messages are interleaved through a single input buffer. The *T machine[NPA93], proposed by MIT and Motorola, offered a more general-purpose architecture for user-level message handling. It extended the Motorola 88110 RISC microprocessor to include a network function unit containing a set of registers much like the floating-point unit. A multiword outgoing message is composed in a set of output registers and a special instruction causes the network function unit to send it out. There are several of these output register sets forming a queue and the send advances the queue, exposing the next available set to the user. Function unit status bits indicate whether an output set is available; these can be used directly in branch instructions. There are also several input message register sets, so when a message arrives it is loaded into an input register set and a status bit is set or an interrupt is generated. Additional hardware support was provided to allow the processor to dispatch rapidly to the address specified by the first word of the message. The *T design drew heavily on previous efforts supporting message driven execution and dataflow architectures, especially the J-Machine, Monsoon and EM-4. These earlier designs employed rather unusual processor architectures, so the communication assist is not clearly articulated. The J-machine provides two execution contexts, each with a program counter and small register set. The “system” execution context has priority over the “user” context. The instruction set includes a segmented memory model, with one segment being a special on-chip message input queue. There is also a message output port for each context. The first word of the network transaction is specified as being the address of the handler for the message. Whenever the user context is idle and a message is present in the input queue, the head of the message is automatically loaded into the program counter and an address register is set up to reference the rest of the message. The handler must extract the message from the input buffer before suspending or competing. Arrival of a system level message preempts the user context and initiates a system handler.
460
DRAFT: Parallel Computer Architecture
9/10/97
Dedicated Message Processing
In Monsoon, a network transaction is of fixed format and specified to contain a handler address, data frame address, and a 64-bit data value. The processor supports a large queue of such small messages. The basic instruction scheduling mechanism and the message handling mechanism are deeply integrated. In each instruction fetch cycle, a message is popped from the queue and the instruction specified by the first word of the message is executed. Instructions are in a 1+x address format and specify an offset relative to the frame address where a second operand is located. Each frame location contains presence bits, so if the location is empty the data word of the message is stored in the specified location (like a store accumulator instruction). If the location is not empty, its value is fetched, an operation is performed on the two operands, and one or more messages carrying the result are generated, either for the local queue or a queue across the network. In earlier, more traditional dataflow machines, the network transaction carried an instruction address and a tag, which is used in an associative match to locate the second operand, rather simple frame relative addressing. Later hybrid machines [NiAr89,GrHo90,Cul*91,Sak91] execute a sequence of instructions for each message dequeue-and-match operation.
7.6 Dedicated Message Processing A third important design style for large-scale distributed-memory machines seeks to allow sophisticated processing of the network transaction using dedicated hardware resources without binding the interpretation in the hardware design. The interpretation is performed by software on a dedicated communications processor (or message processor) which operates directly on the network interface. With this capability, it is natural to consider off-loading the protocol processing associated with the message passing abstraction to the communications processor. It can perform the buffering, matching, copying and acknowledgment operations. It is also reasonable to support a global address space where the communication processor performs the remote read operation on behalf of the requesting node. The communications processors can cooperate to provide a general capability to move data from one region of the global address space to another. The communication processor can provide synchronization operations and even combinations of data movement and synchronization, such as write data and set a flag or enqueue data. Below we look at the basic organizational properties of machines of this class to understand the key design issues. We will look in detail at two machines as case studies, the Intel Paragon and the Meiko CS-2. A generic organization for this style of design is shown in Figure 7-19, where the compute processor (P) and communication processor (mP) are symmetric and both reside on the memory bus. This essentially starts with a bus-based SMP of Chapter 4 as the node and extends it with a primitive network interface similar to that described in the previous two sections. One of the processors in the SMP node is specialized in software to function as a dedicated message processor. An alternative organization is to have the communication processor embedded into the network interface, as shown in Figure 7-20. These two organizations have different latency, bandwidth, and cost trade-offs, which we examine a little later. Conceptually, they are very similar. The communication processor typically executes at system privilege level, relieving the machine designer from addressing the issues associated with a user-level network interface discussed above. The two processors communicate via shared memory, which typically takes the form of a command queue and response area, so the change in privilege level comes essentially for free as part of the hand-off. Since the design assumes there is a system-level processor responsible for managing network transactions, these designs generally allow word-by-word access to the NI FIFOs, as well as DMA, so the communication processor can inspect portions of the message and decide
9/10/97
DRAFT: Parallel Computer Architecture
461
Scalable Multiprocessors
dest
data
NI
M
NI
M
P
mP
user
system
P
mP
Figure 7-19 Machine organization for dedicated message processing with a symmetric processor Each node has a processor, symetric with the main processor on a shared memory bus, that is dedicated to initiating and handling network transactions. Being dedicated, it can always run in system mode, so transferring data through memory implicitly crosses the protection boundary. The message processor can provide any additional protection checks of the contents of the transactions.
what actions to take. The communication processor can poll the network and the command queues to move the communication process along.
data
dest
NI M
NI mP
P
M
mP
P
Figure 7-20 Machine organization for dedicated message processing with an embedded processor The communication assist consists of a dedicated, programmable message processor embedded in the network interface. It has a direct path to the network that does not utilize the memory bus shared by the main processor.
The message processor provides the computation processor with a very clean abstraction of the network interface. All the details of the physical network operation are hidden, such as the hardware input/output buffers, the status registers, and the representation of routes. A message can be sent by simply writing it, or a pointer to it, into shared memory. Control information is exchanged between the processors using familiar shared-memory synchronization primitives, such as flags and locks. Incoming messages can be delivered directly into memory by the mes-
462
DRAFT: Parallel Computer Architecture
9/10/97
Dedicated Message Processing
sage processor, along with notification to the compute processor via shared variables. With a well-designed user-level abstraction, the data can be deposited directly into the user address space. A simple low-level abstraction provides each user process in a parallel program with a logical input queue and output queue. In this case the flow of information in a network transaction is as shown in Figure 7-21.
Network dest
°°°
Mem
NI
P
MP
Mem
NI
MP
P
Figure 7-21 Flow of a Network Transaction with a symmetric message processor Each network transaction flows through memory, or at least across the memory bus in a cache-to-cache transfer between the main processor and memory processor. It crosses the memory bus again between the message processor and network interface.
These benefits are not without costs. Since communication between the compute processor and message processor is via shared memory within the node, communication performance is strongly influenced by the efficiency of the cache coherency protocol. A review of Chapter 4 reveals that these protocols are primarily designed to avoid unnecessary communication when two processors are operating on mostly distinct portions of a shared data structure. The shared communication queues are a very different situation. The producer writes an entry and sets a flag. The consumer must see the flag update, read the data, and clear the flag. Eventually, the producer will see the cleared flag and rewrite the entry. All the data must be moved from the producer to the consumer with minimal latency. We will see below that inefficiencies in traditional coherency protocols make this latency significant. For example, before the producer can write a new entry, the copy of the old one in the consumer’s cache must be invalidated. One might imagine that an update protocol, or even uncached writes might avoid this situation, but then a bus transaction will occur for every word, rather than every cache block. A second problem is that the function performed by the message processor is “concurrency intensive”. It handles requests from the compute processor, messages arriving from the network and messages going out and into the network all at once. By folding all these events into a single sequential dispatch loop, they can only be handled one at a time. This can seriously impair the bandwidth capability of the hardware. Finally, the ability of the message processor to deliver messages directly into memory does not completely eliminate the possibility of fetch deadlock at the user level, although it can ensure that 9/10/97
DRAFT: Parallel Computer Architecture
463
Scalable Multiprocessors
the physical resources are not stalled when an application deadlocks. A user application may need to provide some additional level of flow control. 7.6.1 Case Study: Intel Paragon To make these general issues concrete, in this section we examine how they arise in an important machine of this ilk, the Intel Paragon, first shipped in 1992. Each node is a shared memory multiprocessor with two or more 50MHz i860XP processors, a Network Interface Chip (NIC) and 16 or 32 MB of memory, connected by a 64-bit, 400 MB/sec, cache-coherent memory bus, as shown in Figure 7-22. In addition, two DMA engines (one for sending and the other for receiving) are provided to burst data between memory and the network. DMA transfers operate within the cache coherency protocol and are throttled by the network interface before buffers are over or under-run. One of the processors is designated as a message processor to handle network transactions and message passing protocols while the other is used as a compute processor for general computing. ‘I/O nodes’ are formed by adding I/O daughter cards for SCSI, Ethernet, and HiPPI connections.
Network
175 MB/sec
NIC Mem
2KB FIFOs
L1 Cache
L1 Cache
i860XP
i860XP
Compute Processor
64-bit ° 400 MB/s Cache-Coherent Mem. Bus
°°°
DMA
Message Processor
Figure 7-22 Intel Paragon machine organization Each node of the machine includes a dedicated message processor, identical to the compute processor on a cache coherent memory bus, which has a simple, system-level interface to a fast, reliable network, and two cache-coherent DMA engines that repects the flow control of the NIC
The i860XP processor is a 50MHz RISC processor that supports dual-instruction issue (one integer and one floating-point), dual-operation instructions (e.g. floating-point multiply-and-add) and graphics operations. It is rated at 42 MIPS for integer operations and has a peak performance of 75 MFLOPS for double-precision operations. The paging unit of the i860XP implements protected, paged, virtual memory using two four-way set-associative translation look-aside buffers (TLB). One TLB supports 4KB pages using two-level page tables and has 64 entries; the other supports 4MB pages using one-level page tables and has 16 entries. Each i860XP has a 16KB instruction cache and a 16KB data cache, which are 4-way set-associative with 32-byte cache blocks [i860]. The i860XP uses write-back caching normally, but it can also be configured to use write-through and write-once polices under software or hardware control. The write buffers can 464
DRAFT: Parallel Computer Architecture
9/10/97
Dedicated Message Processing
hold two successive stores to prevent stalling on write misses. The cache controllers of the i860XP implement a variant of the MESI (Modified, Exclusive, Shared, Invalid) cache consistency protocol discussed in Chapter 4. The external bus interface also supports a 3-stage address pipeline (i.e. 3 outstanding bus cycles) and burst mode with transfer length of 2 or 4 at 400 MB/ sec. The Network Interface Chip (NIC) connects the 64-bit synchronous memory bus to 16-bit asynchronous (self-timed) network links. A 2KB transmit FIFO (tx) and a 2KB receive FIFO (rx) are used to provide rate matching between the node and a full-duplex 175 MB/sec network link. The head of the rx FIFO and the tail of the tx FIFO are accessible to the node as a memory mapped NIC I/O register. In addition, a status register contains flags that are set when the FIFO is full, empty, almost full, or almost empty and when an end-of-packet marker is present. The NIC can optionally generate an interrupt when each flag is set. Reads and writes to the NIC FIFOs are uncached and must be done one double-word (64 bits) at a time. The first word of a message must contain the route (X-Y displacements in a 2D mesh), but the hardware does not impose any other restriction on the message format. In particular, it does not distinguish between system and user messages. In addition, the NIC also performs parity and CRC checks to maintain end-to-end data integrity. Two DMA engines, one for sending and the other for receiving, can transfer a contiguous block of data between main memory and the Network Interface Chip (NIC) at 400MB/sec. The memory region is specified as a physical address, aligned on a 32-byte boundary, with a length between 64 bytes and 16KB (one DRAM page) in multiples of 32 bytes (a cache block). During DMA transfer, the DMA engine snoops on the processor caches to ensure consistency. Hardware flow control prevents the DMA from overflowing or underflowing the NIC FIFO’s. If the output buffer is full, the send DMA will pause and free the bus. Similarly, the receive DMA pauses when the input buffer is empty. The bus arbitrator gives priority to the DMA engine over the processors. A DMA transfer is started by storing an address-length pair to a memory mapped DMA register using the stio instruction. Upon completion, the DMA engine sets a flag in a status register and optionally generates an interrupt. With this hardware configuration a small message of just less than two cache blocks (seven words) can be transferred from registers in one compute processor to registers in another compute processor waiting on the transfer in just over 10 µs (500 cycles). This time breaks down almost equally between the three processor-to-processor transfers: compute processor to message processor across the bus, message processor to message processor across the network, and message processor to compute processor across the bus on the remote node. It may seem surprising that transfers between two processors on a cache-coherent memory bus would have the same latency as transfers between two message processors through the network, especially since the transfers between the processor and the network interface involves a transfer across the same bus. Let’s look at this situation in a little more detail. An i860 processor can write a cache block from registers using two quad-word store instructions. Suppose that part of the block is used as a full/ empty flag. In the typical case, the last operation on the block was a store by the consumer to clear the flag; with the Paragon MESI protocol this writes through to memory, invalidates the producer block, and leaves the consumer in the exclusive state. The producer’s load on the flag which finds the flag clear misses in the producer cache, reads the block from memory, and downgrades the consumer block to shared state. The first store writes through and invalidates the consumer block, but it leaves the producer block in shared state since sharing was detected when the write was performed. The second store also writes-through, but since there is no sharer it leaves 9/10/97
DRAFT: Parallel Computer Architecture
465
Scalable Multiprocessors
the producer in the exclusive state. The consumer eventually reads the flag, misses, and brings in the entire line. Thus, four bus transactions are required for a single cache block transfer. (By having an additional flag that allows the producer to check that several blocks are empty, this can be reduced to three bus transactions. This is left as an exercise.) Data is written to the network interface as a sequence of uncached double-word stores. These are all pipelined through the writebuffer, although they do involve multiple bus transfers. Before writing the data, the message processor needs to check that there is room in the output buffer to hold it. Rather than pay the cost of an uncached read, it checks a bit in the processor status word corresponding to a masked ‘output buffer empty’ interrupt. On the receiving end, the communication processor reads a similar ‘input non-empty’ status bit and then reads the message data as a series of uncached loads. The actual NI to NI transfer takes only about 250ns plus 40ns per hop in an unloaded network. For a bulk memory-to-memory transfer, there is additional work for the communication processor to start the send-DMA from the user source region. This requires about 2 µs (100 cycles). The DMA transfer bursts at 400 MB/s into the network output buffer of 2048 bytes. When the output buffer is full, the DMA engine backs off until a number of cache blocks are drained into the network. On the receiving end, the communication processor detects the presence of an incoming message, reads the first few words containing the destination memory address, and starts the receive-DMA to drain the remainder of the message into memory. For a large transfer, the send DMA engine will be still moving data out of memory, the receive DMA engine will be moving data into memory, and a portion of the transfer will occupy the buffers and network links in between. At this point, the message moves forward at the 175 MB/s of the network links. The send and receive DMA engines periodically kick in and move data to or from network buffers at 400 MB/s bursts. A review of the requirements on the communication processor shows that it is responding to a large number of independent events on which it must take action. These include the current user program writing a message into the shared queue, the kernel on the compute processor writing a message into a similar “system queue”, the network delivering a message into the NI input buffer, the NI output buffer going empty as a result of the network accepting a message, the send DMA engine completing, and the receive DMA engine completing. The bandwidth of the communication processor is determined by the time it takes to detect and dispatch on these various events. While handling any one of the events all the others are effectively locked out. Additional hardware, such as the DMA engines, is introduced to minimize the work in handling any particular event and allow data to flow from the source storage area (registers or memory) to the destination storage area in a fully pipelined fashion. However, the communication rate (messages per second) is still limited by the sequential dispatch loop in the communication processor. Also, the software on the computation processor which keeps data flowing, avoids deadlock, and avoids starving the network is rather tricky. Logically it involves a number of independent cooperating threads, but these are folded into a single sequential dispatch loop which keeps track of the state of each of the partial operations. This concurrency problem is addressed by our next case study. The basic architecture of the Paragon is employed in the ASCI Red machine, which is the first machine to sustain a teraFLOPS (one trillion floating-point operations per second). When fully configured, this machine will contain 4500 nodes with dual 200 MHz Pentium Pro processors and 64 MB of memory. It uses an upgraded version of the Paragon network with 400 MB/s links, still in a grid topology. The machine will be spread over 85 cabinets, occupying about 1600 sq. ft and drawing 800 KW of power. Forty of the nodes will provide I/O access to large RAID storage systems, an additional 32 nodes will provide operating system services to a lightweight kernel operating on the individual nodes, and 16 nodes will provide “hot” spares. 466
DRAFT: Parallel Computer Architecture
9/10/97
Dedicated Message Processing
Many cluster designs employ SMP nodes as the basic building block, with a scalable high performance LAN or SAN. This approach admits the option of dedicating a processor to message processing or of having that responsibility taken on by processors as demanded by message traffic. One key difference is that networks such as that used in the Paragon will back up and stop all communication progress, including system messages, unless the inbound transactions are serviced by the nodes. Thus, dedicating a processor to message handling provides a more robust design. (In special cases, such as attempting to set the record on the Linpack benchmark, even the Paragon and ASCI red are run with both processors doing user computation.) Clusters usually rely on other mechanisms to keep communication flowing, such as dedicating processing within the network interface card, as we discuss below. 7.6.2 Case Study: Meiko CS-2 The Meiko CS-2 provides a representative concrete design with an asymmetric message processor, that is closely integrated with the network interface and has a dedicated path to the network. The node architecture is essentially that of a SparcStation 10, with two standard superscalar Sparc modules on the MBUS, each with L1 cache on-chip and L2 cache on the module. Ethernet, SBUS, and SCSI connections are also accessible over the MBUS through a bus adapter to provide I/O. (A high performance variant of the node architecture includes two Fujistsu µVP vector units sharing a three ported memory system. The third port is the MBUS, which hosts the two compute processors and the communications module, as in the basic node.) The communications module functions as either another processor module or a memory module on the MBUS, depending on its operation. The network links provide 50 MB/s bandwidth in each direction. This machine takes a unique position on how the network transaction is interpreted and on how concurrency is supported in communication processing. A network transaction on the Meiko CS-2 is a code-sequence transferred across the network and executed directly by the remote communications processor. The network is circuit switched, so a channel is established and held open for the duration of the network transaction execution. The channel closes with an ACK if the channel was established and the transaction executed to completion successfully. A NAK is returned if connection is not established, there is a CRC error, the remote execution times out, or a conditional operation fails. The control flow for network transactions is straight-line code with conditional abort, but no branching. A typical cause of time-out is a page-fault at the remote end. The sorts of operations that can be included in a network transactions include: read, write, or read-modify-write of remote memory, setting events, simple tests, DMA transfers, and simple reply transactions. Thus, the format of the information in a network transaction is fairly extensive. It consists of a context identifier, a start symbol, a sequence of operations in a concrete format, and an end symbol. A transaction is between 40 and 320 bytes long. We will return to the operations supported by network transactions in more detail after looking at the machine organization. Based on our discussion above, it makes sense to consider decomposing the communication processor into several, independent processors, as indicated by Figure 7-23. A command processor (Pcmd) waits for communication commands to be issued on behalf of the user or system and carries them out. Since it resides as a device on the memory bus, it can respond directly to reads and writes of addresses for which it is responsible, rather than polling a shared memory location as would a conventional processor. It carries out its work by pushing route information and data into the output processor (Pout) or by moving data from memory to the output processor. (Since the output processor acts on the bus as a memory device, each store to its internal buffers can trigger
9/10/97
DRAFT: Parallel Computer Architecture
467
Scalable Multiprocessors
it to take action.) It may require assistance of a device responsible for virtual-to-physical (V->P) address translation. It also provides whatever protection checks are required for user-to-user communication. The output processor monitors status of the network output FIFO and delivers network transactions into the network. An input processor (Pin) waits for arrival of a network transaction and executes it. This may involve delivering data into memory, posting a command to an event processor (Pevent) to signal completion of the transaction or posting a reply operation to a reply processor (Preply)which operates very much like the output processor. Network
dest
°°° Pout
Pin
P reply
Mem
P in
Preply
Mem P cmd
P
Pout
V->P
Pcmd
Pevnts
V->P
P evnts
P
Figure 7-23 Meiko CS-2 Conceptual Structure with multiple specialized communication processors Each of the individual aspects of generating and processing network transactions is associated in independently operating hardware function units.
The Meiko CS-2 provides essentially these independent functions, although they operate as timemultiplexed threads on a single microprogrammed processor, called the elan [HoMc93]. This makes the communication between the logical processors very simple and provides a clean conceptual structure, but it does not actually keep all the information flows progressing smoothly. The function organization of Elan is depicted in Figure 7-24. A command is issued by the compute processor to the command processor via an exchange instruction, which swaps the value of a register and a memory location. The memory location is mapped to the head of the command processor input queue. The value returned to the processor indicates whether the enqueue command was successful or whether the queue was already full. The value given to the command processor contains a command type and a virtual address. The command processor basically supports three commands: start DMA, in which case the address points to a DMA descriptor, set event, in which case the address refers to a simple event data structure, or start thread, in which case the address specifies the first instruction in the thread. The DMA processor reads data from memory and generates a sequence of network transactions to cause data to be stored into the remote node. The command processor also performs event operations, which involve updating a small event data structure and possibly raising an interrupt for the main processor in order to wake a sleeping thread. The start-thread command is conveyed to a simple RISC thread processor which executes an arbitrary code sequence to construct and issue network transactions. Network transactions are interpreted by the input processor, which may cause threads to execute, DMA to start, replies to be issued, or events to be set. The reply is simply a set-event operation with an optional write of three words of data.
468
DRAFT: Parallel Computer Architecture
9/10/97
Dedicated Message Processing
Network Generates set-event 3 x wt_wd
output control
Preply
Pthread
PDMA Pinput
setevent SWAP: CMD, Addr Accept
intr
Pcmd
run thd
St DMA Mem I/F
P
Threads User Data
execute net xacts - requests from Pthread - w-blks from PDMA - SE + wt_wd from Preply
DMA from memory Issue write-block xacts (50 µs limit) RISC inst. set 64K non-preemptive threads Construct arbitrary net xacts Output protocol
DMA descriptors
Mem
Figure 7-24 Meiko CS-2 machine organization
To make the machine operation more concrete, let us consider a few simple operations. Suppose a user process wants to write data into the address space of another process of the same parallel program. Protection is provided by a capability for a communication context which both processes share. The source compute processor builds a DMA descriptor and issues a start DMA command. Its DMA processor reads the descriptor and transfers the data as a sequence of block, each involves loading up to 32 bytes from memory and forming a write_block network transaction. The input processor on the remote node will receive and execute a series of write_block transactions, each containing a user virtual memory address and the data to be written at that address. Reading a block of data from a remote address space is somewhat more involved. A thread is started on the local communication processor which issues a start-DMA transaction. The input processor on the remote node passes the start-DMA and its descriptor to the DMA processor on the remote node, which reads data from memory and returns it as a sequence of write_block transactions. To detect completion, these can be augmented with set-event operations. In order to support direct user-to-user transfers, the communication processor on the Meiko CS-2 contains its own page table. The operating system on the main processor keeps this consistent with the normal page tables. If the communication processor experiences a page fault, an interrupt is generated at the processor so that the operating system there can fault in the page and update the page tables.
9/10/97
DRAFT: Parallel Computer Architecture
469
Scalable Multiprocessors
The major shortcoming with this design is that the thread processor is quite slow and is non-preemptively scheduled. This makes it very difficult to off-load any but the most trivial processing to the thread processor. Also, the set of operations provided by network transactions are not powerful enough to construct an effective remote enqueue operation with a single network transaction.
7.7 Shared Physical Address Space In this section we examine a fourth major design style for the communication architecture of scalable multiprocessors – a shared physical address space. This builds directly upon the modest scale shared memory machines and provides the same communication primitives: loads, stores, and atomic operations on shared memory locations. Many machines have been developed to extend this approach to large scale systems, including CM*, C.mmp, NYU Ultracomputer, BBN Butterfly, IBM RP3, Denelcor HEP-1, BBN TC2000, and the Cray T3D. Most of the early designs employed a dancehall organization, with the interconnect between the memory and the processors, whereas most of the later designs distributed memory organization. The communication assist translates bus transactions into network transactions. The network transactions are very specific, since they describe only a predefined set of memory operations, and are interpreted directly by communication assist at the remote node. Scalable Network src rrsp tag data Output Processing – Response
tag src addr read dest
Input Processing – Parse – Complete Rd
Pseudo Mem
Pseudo Proc
°°°
$
$
M
mmu Data
Pseudo Proc
P
P
Pseudo Mem
Comm. Assist
M
mmu
Ld R<- Addr
Figure 7-25 Shared physical address space machine organization In scalable shared physical address machines, network transactions are initiated as a result of conventional memory instructions. They have a fixed set of formats, interpreted directly in the communication assist hardware. The operations are request/response, and most systems provide two distinct networks to avoid fetch deadlock. The communication architecture must assume the role of pseudo memory unit on the issuing node and pseudo processor on the servicing node. The remote memory operation is accepted by the pseudo memory unit on the issuing node, which carries out request-response transactions with the remote node. The source bus transaction is held open for the duration of the request network transaction, remote memory access, and response network transaction. Thus, the communication assist must be able to access the local memory on the remote node even while the processor on that node is stalled in the middle of its own memory operation.
A generic machine organization for a large scale distributed shared physical address machine is shown in Figure 7-25. The communication assist is best viewed as forming a pseudo-memory
470
DRAFT: Parallel Computer Architecture
9/10/97
Shared Physical Address Space
module and a pseudo-processor, integrated into the processor-memory connection. Consider, for example, a load instruction executed by the processor on one node. The on-chip memory management unit translates the virtual address into a global physical address, which is presented to the memory system. If this physical address is local to the issuing node, the memory simply responds with the contents of the desired location. If not, the communication assist must act like a memory module, while it access the remote location. The pseudo-memory controller accepts the read transaction on the memory bus, extracts the node number from the global physical address and issues a network transaction to the remote node to access the desired location. Note that at this point the load instruction is stalled between the address phase and data phase of the memory operation. The remote communication assist receives the network transaction, reads the desired location and issues a response transaction to the original node. The remote communication assist appears as a pseudo-processor to the memory system on its node when it issues the proxy read to the memory system. An important point to note is that when the pseudo-processor attempts to access memory on behalf of a remote node, the main processor there may be stalled in the middle of its own remote load instruction. A simple memory bus with a single outstanding operation is inadequate for the task. Either there must be two independent paths into memory or the bus must support split-phase operation with unordered completion. Eventually the response transaction will arrive at the originating pseudo-memory controller. It will complete the memory read operation just as if it were a (slow) memory module. A key issue, which we will examine deeply in the next chapter, is the cachability of the shared memory locations. In most modern microprocessors, the cachability of an address is determined by a field in the page table entry for the containing page, which are extracted from the TLB when the location is accessed. In this discussion it is important to distinguish two orthogonal concepts. An address may be either private to a process or shared among processes, and it may be either physically local to a processor or physically remote. Clearly, addresses that are private to a process and physically local to the processor on which that process executes should be cached. This requires no special hardware support. Private data that is physically remote can also be cached, although this requires that the communication assists support cache block transactions, rather than just single words. There is no processor change required to cache remote blocks, since remote memory accesses appear identical to local memory accesses, only slower. If physically local and logically shared data is cached locally, then accesses on behalf of remote nodes performed by the pseudo-processor must be cache coherent. If local, shared data is cached only in write-through mode, this only requires that the pseudo-processor can invalidate cached data when it performs writes to memory on behalf of remote nodes. To cache shared data as writeback, the pseudo processor needs to be able to cause data to be flushed out of the cache. The most natural solution is to integrate the pseudo-processor on a cache coherent memory bus, but the bus must also be split-phase with some number of outstanding transactions reserved for the pseudoprocessor. The final option is to cache shared, remote data. The hardware support for accessing, transferring, and placing a remote block in the local cache is completely covered by the above options. The new issue is keeping the possibly many copies of the block in various caches coherent. We must also deal with the consistency model for such a distributed shared memory with replication. These issues require substantially design consideration, and we devote all of Chapter 7 to addressing them. It is clearly attractive from a performance viewpoint to cache shared remote data that is mostly accessed locally
9/10/97
DRAFT: Parallel Computer Architecture
471
Scalable Multiprocessors
7.7.1 Case study: Cray T3D The Cray T3D(R) provides a concrete example of an aggressive shared global physical address design using a modern microprocessor. The design follows the basic outline of Figure 7-13, with pseudo-memory controller and pseudo-processor providing remote memory access via a scalable network supporting independent delivery of request and response transactions. There are seven specific network transaction formats, which are interpreted directly in hardware. However, the design extends the basic shared physical address approach in several significant ways. The T3D system is intended to scale to 2048 nodes, each with a 150 MHz dual-issue DEC Alpha 21064 microprocessor and up to 64 MB of memory, as illustrated in Figure 7-26. The DEC Alpha architecture is intended to be used as a building block for parallel architectures [Alp92] and several aspects of the 21064 strongly influenced the T3D design. We first describe salient aspects of the microprocessor itself and the local memory system. Then we discuss the assists constructed around the basic processor to provide a shared physical address space, latency tolerance, block transfer, synchronization, and fast message passing. The Cray designers sometimes refer to this as the “shell” of support circuitry around a conventional microprocessor that embodies the parallel processing capability.
Resp In
Req Out
Req in
3D Torus of Pair of PEs – share net & BLT – upto 2048 – 64 MB each
Block Transfer DMA Engine
Msg Queue - 4080x4x64
PE# + FC
Prefetch Queue - 16 x 64
Special Registers - swaperand - fetch&add - barrier
Resp Out
DTB $
DRAM
32-bit P.A. mmu - 5 + 27
P
150 MHz Dec Alpha (64-bit) 8 KB Inst + 8 KB Data 43-bit Virtual Address 32 & 64 bit mem + byte operations Non-blocking stores + mem-barrier Prefetch Load-lock, Store Conditional
Figure 7-26 Cray T3D machine organization Each node contains an elaborate communication assist, which includes more more than the pasedo memory and pseudo processor functions required for a shared physical address space. A set of external segment registers (the DTB) are provided to extend the machines limited physical address space. A prefetch queue support latency hiding through explicit read-ahead. A message queue supports events associated with message passing models. A collection of DMA units provide block transfer capability and special pointwise and global synchronization operations are supported. The machine is organized as a 3D torus, as the name might suggest.
The Alpha 21064 has 8KB on-chip instruction and data caches, as well as support for an external L2 cache. In the T3D design, the L2 cache is eliminated in order to reduce the time to access main memory. The processor stalls for the duration of a cache miss, so reducing the miss latency 472
DRAFT: Parallel Computer Architecture
9/10/97
Shared Physical Address Space
directly increases the delivered bandwidth. The Cray design is biased toward access patterns typical of vector codes, which scan through large regions of memory. The measured access time of a load that misses to memory is 155 ns (23 cycles) on the T3D compared to 300 ns (45 cycles) on a DEC Alpha workstation at the same clock rate with a 512 KB L2 cache[Arp*95]. (The Cray T3D access time increases to 255 ns if the access is off-page within the DRAM.) These measurements are very useful in calibrating the performance of the global memory accesses, discussed below. The Alpha 21064 provides a 43-bit virtual address space in accordance with the Alpha architecture, however, the physical address space is only 32 bits in size. Since the virtual-to-physical translation occurs on-chip, only the physical address is presented to the memory system and communication assist. A fully populated system of 2048 nodes with 64 MB each would require 37 bits of global physical address space. To enlarge the physical address space of each node, the T3D provides an external register set, called the DTB Annex, which uses 5 bits of the physical address to select a register containing a 21-bit node number; this is concatenated with a 27-bit local physical address to form a full global physical address.1 The Annex registers also contain an additional field specifying the type of access, e.g., cached or uncached. Annex register 0 always refers to the local node. The Alpha load-lock and store-conditional instructions are used in the T3D to read and write the Annex registers. Updating an Annex register takes 23 cycles, just like an off-chip memory access, and can be followed immediately by a load or store instruction that uses the Annex register. A read or write of a location in the global address space is accomplished by a short sequence of instructions. First, processor number part of the global virtual address is extracted and stored into an Annex register. A temporary virtual address is constructed so that the upper bits specify this Annex register and the lower bits the address on that node. Then a load or store instruction is issued on the temporary virtual address. The load operation takes 610 ns (91 cycles), not including the Annex setup and address manipulation. (This number increases by 100 ns (15 cycles) if the remote DRAM access is off-page. Also, if a cache block is brought over it increases to 785885 ns.) Remember the virtual-to-physical translation occurs in the processor issuing the load. The page tables are set up so that the Annex register number is simply carried through from the virtual address to the physical address. So that the resulting physical address will make sense on the remote node, the physical placement of all processes in a parallel program is identical. (Paging is not supported). In addition, care is exercised to ensure that all processes in a parallel program have extended their heap to the same length. The Alpha 21064 provides only non-blocking stores. Execution is allowed to proceed after a store instruction without waiting for the store to complete. Writes are buffered by write buffer, which is four deep and in each entry up to 32 bytes of write data can be merged. Several store instructions can be outstanding. A ‘memory barrier’ instruction is provided to ensure that writes have completed before further execution commences. The Alpha non-blocking store allows remote stores to be overlapped, so high bandwidth can be achieved in writes to remote memory locations. Remote writes of up to a full cache block can issue from the write buffer every 250 ns, providing up to 120 MB/s of transfer bandwidth from the local cache to remote memory. A single
1. This situation is not altogether unusual. The C.mmp employed a similar trick to overcome the limited addressing capability of the LSI-11 building block. The problems that arose from this led to a famous quote attributed variously to Gordon Bell or Bill Wulf that the only flaw in an architecture that is hard to overcome is too small an address space.
9/10/97
DRAFT: Parallel Computer Architecture
473
Scalable Multiprocessors
blocking remote-write involves a sequence to issue the store, push the store out of the write buffer with a memory barrier operation, and then a wait on a completion flag provided by the network interface. This requires 900 ns, plus the Annex setup and address arithmetic. The Alpha provides a special “prefetch” instruction, intended to encourage the memory system to move important data closer to the processor. This is used in the T3D to hide the remote read latency. An off-chip prefetch queue of 16 words is provided. A prefetch causes a word to be read from memory and deposited into the queue. Reading from the queue pops the word at its head. The prefetch issue instruction is treated like a store and takes only a few cycles. The pop operation takes the 23 cycles typical of an off chip access. If eight words are prefetched and then popped, the network latency is completely hidden and the effective latency of each is less than 300 ns. The T3D also provides a bulk transfer engine, which can move blocks or regular strided data between the local node and a remote node in either direction. Reading from a remote node, the block transfer bandwidth peaks at 140 MB/s and writing to a remote node it peaks at 90 MB/s. However, use of the block transfer engine requires a kernel trap to provide the virtual to physical translation. Thus, the prefetch queue provides better performance for transfers up to 64 KB of data and non-blocking stores are faster for any length. The primary advantage of the bulk transfer engine is the ability to overlap communication and computation. This capability is limited to some extent, since the processor and the bulk transfer engine compete for the same memory bandwidth. The T3D communication assist also provides special support for synchronization. First, there is a dedicated network to support global-OR and global-AND operations, especially for barriers. This allows processors to raise a flag indicating that they have reached the barrier, continue executing, and then wait for all to enter before leaving. Each node also has a set of external synchronization registers to support atomic-swap and fetch&inc. There is also a user-level message queue, which will cause a message either to be enqueued or a thread invoked on a remote node. Unfortunately, either of these involve a remote kernel trap, so the two operations take 25 µs and 70 µs, respectively. In comparison, building a queue in memory using the fetch&inc operation allows a four word message to be enqueued in 3µs and dequeued in 1.5µs. 7.7.2 Cray T3E The Cray T3E [Sco96] follow-on to the Cray T3D provides an illuminating snapshot of the tradeoffs in large scale system design. The two driving forces in the design were the need to provide a more powerful, more contemporary processor in the node and to simplify the “shell.” The Cray T3D has many complicated mechanisms for supporting similar functions, each with unique advantages and disadvantages. The T3E uses the 300 MHz, quad-issue Alpha 21164 processor with a sizable (96 KB) second-level on-chip cache. Since the L2 cache is on chip, eliminating it is not an option as on the T3D. However, the T3E forgoes the board-level tertiary cache typically found in Alpha 21164 based workstations. The various remote access mechanisms are unified into a single external register concept. In addition, remote memory access are performed using virtual addresses that are translated to physical addresses by the remote communication assist. A user process has access to a set of 512 E-registers of 64 bits each. The processor can read and write contents of the E-registers using conventional load and store instructions to a special region of the memory space. Operations are also provided to get data from global memory into an E-reg-
474
DRAFT: Parallel Computer Architecture
9/10/97
Shared Physical Address Space
ister, to put data from an E-register to global memory, and to perform atomic read-modify-writes between E-registers and global memory. To load remote data into an E-register involves three steps. First, the processor portion of the global virtual address is constructed in an E-address register. Second, the get command issued via a store to a special region of memory. The field of the address used in the command store specifies the put operation and another field specifies the destination data E-register. The command-store data specifies an offset to be used relative to the address E-register. The command store has the side-effect of causing the remote read to be performed and the destination E-register to be loaded with the data. Finally, the data is read into the processor via a load to the data E-register. The process for a remote put is similar, except that the store data is placed in the data E-register, which is specified in the put command-store. This approach of causing load and stores to data registers as a side-effect of operations on address registers goes all the way back to the CDC-6600[Tho64], although it seems to have been largely forgotten in the meanwhile. The utility of the prefetch queue is provided by E-registers by associating a full/empty bit with each E-register. A series of gets can be issued, and each one sets the associated destination E-register to empty. When the gets complete, the register is set to full. If the processor attempts to load from an empty E-register, the memory operation is stalled until the get completes. The utility of the block transfer engine is provided by allowing vectors of four or eight words to be transferred through E-registers in a single operation. This has the added advantages of providing a means of efficient gather operations. The improvements in the T3E greatly simplify code generation for the machine and offer several performance advantages, however, these are not at all uniform. The computational performance is significantly higher on the T3D due to the faster processor and larger on-chip cache. However, the remote read latency more than twice that of the T3D, increasing from 600 ns to roughly 1500 ns. The increase is due to the level 2 cache miss penalty and the remote address translation. The remote write latency is essentially the same. The prefetch cost is improved by roughly a factor of two, obtaining a rate of one word read every 130 ns. Each of the memory modules can service a read every 67 ns. Non-blocking writes have essentially the same performance on the two machines. The block transfer capability of the T3E is far superior to the T3D. A bandwidth of greater than 300 MB/s is obtained without the large startup cost of the block transfer engine. The bulk write bandwidth is greater than 300 MB/s, three times the T3D. 7.7.3 Summary There is a wide variation in the degree of hardware interpretation of network transactions in modern large scale parallel machines. These variations result in a wide range in the overhead experienced by the compute processor in performing communication operations, as well as in the latency added to that of the actual network by the communication assist. By restricting the set of transactions, specializing the communication assist to the task of interpreting these transactions, and tightly integrating the communication assist with the memory system of the node, the overhead and latency can be reduced substantially. The hardware specialization also can provide the concurrency need within the assist to handle several simultaneous streams of events with high bandwidth.
9/10/97
DRAFT: Parallel Computer Architecture
475
Scalable Multiprocessors
7.8 Clusters and Networks of Workstations Along with the use of commodity microprocessors, memory devices, and even workstation operating systems in modern large-scale parallel machines, scalable communication networks similar to those used in parallel machines have become available for use in a limited local area network setting. This naturally raises the question as to what extent are networks of workstations (NOWs) and parallel machines are converging. Before entertaining this question a little background is in order. Traditionally, collections of complete computers with a dedicated interconnect, often called clusters, have been used to serve multiprogramming workloads and to provide improved availability[Kro*86,Pfi95]. In multiprogramming clusters, a single front-end machine usually acts as an intermediary between a collection of compute servers and a large number of users at terminals or remote machines. The front-end tracks the load on the cluster nodes and schedules tasks onto the most lightly loaded nodes. Typically, all the machines in the cluster are setup to function identically; they have the same instruction set, the same operating systems, and the same file system access. In older systems, such as the Vax VMS cluster[Kro*86], this was achieved by connecting each of the machines to a common set of disks. More recently, this single system image is usually achieved by mounting common file systems over the network. By sharing a pool of functionally equivalent machines, better utilization can be achieved on a large number of independent jobs. Availability clusters seek to minimize downtime of large critical systems, such as important online databases and transaction processing systems. Structurally they have much in common with multiprogramming clusters. A very common scenario is to use a pair of SMPs running identical copies of a database system with a shared set of disks. Should the primary system fail due to a hardware or software problem, operation rapidly “fails over” to the secondary system. The actual interconnect that provides the shared disk capability can be dual access to the disks or some kind of dedicated network. A third major influence on clusters has been the rise of popular public domain software, such as Condor[Lit*88] and PVM[Gei93], that allow users to farm jobs over a collection of machines or to run a parallel program on a number of machines connected by an arbitrary local-area or even wide-area network. Although the communication performance capability is quite small, typical latencies are a millisecond or more for even small transfers and the aggregate bandwidth is often less than one MB/s, these tools provide an inexpensive vehicle for a class of problems with a very high ratio of computation to communication. The technology breakthrough that presents the potential of clusters taking on an important role in large scale parallel computing is a scalable, low-latency interconnect, similar in quality to that available in parallel machines, but deployed like a local-area network. Several potential candidates networks have evolved from three basic directions. Local area networks have traditionally been either a shared bus (e.g. Ethernet) or a ring (e.g., Token Ring and FDDI) with fixed aggregate bandwidth, or a dedicated point-to-point connection (e.g., HPPI). In order to provide scalable bandwidth to support a large number of fast machines, there has been a strong push toward switch based local area network (e.g., HPPI switches, FDDI switches[LuPo97], and Fiberchannels. Probably the most significant development is the widespread adoption of the ATM (asynchronous transfer mode) standard, developed by the Telecommunications industry, as a switched LAN. Several companies offer ATM switches (routers in the terminology of Section 7.2) with up to sixteen ports at 155 Mb/s (19.4 MB/s) link bandwidth. These can be cascaded to form a larger 476
DRAFT: Parallel Computer Architecture
9/10/97
Clusters and Networks of Workstations
network. Under ATM, a variable length message is transferred on a pre-assigned route, called “virtual circuit”, as a sequence of 53 byte cells (48 bytes of data and 5 bytes of routing information). We will look at these networking technologies in more detail in Chapter 8. In terms of the model developed in Section 7.2, current ATM switches typically have a routing delay of about 10 µs in an unloaded network, although some are much higher. A second major standardization effort is represented by SCI (scalable coherent interconnect), which includes a physical layer standard and a particular distributed cache coherency strategy. There has also been a strong trend to evolve the proprietary networks used within MPP systems into a form that can be used to connect a large number of independent workstations or PCs over a sizable area. Examples of this include Tnet[Hor95] (now System Net) from Tandem Corp. and Myrinet[Bod*95]. The Myrinet switch provides eight ports at 160 MB/s each which can be cascaded in regular and irregular topologies to form a large network. It transfers variable length packets with a routing delay of about 350 ns per hop. Link level flow control is used to avoid dropping packets in the presence of contention. As with more tightly integrated parallel machines, the hardware primitives in emerging NOWs and Clusters remains an open issue and subject to much debate. Conventional TCP/IP communication abstractions over these advanced networks exhibit large overheads (a millisecond or more)[Kee*95], in many cases larger than that of common ethernet. A very fast processor is required to move even 20 MB/s using TCP/IP. However, the bandwidth does scale with the number of processors, at least if there is little contention. Several more efficient communication abstractions have been proposed, including Active Messages[And*95,vEi*92,vEi*95] and Reflective Memory[Gil96, GiKa97]. Active Messages provide user-level network transactions, as discussed above. Reflective Memory allows writes to special regions of memory to appear as writes into regions on remote processors; there is no ability to read remote data, however. Supporting a true shared physical address space in the presence of potential unreliability, i.e., node failures and network failures, remains an open question. An intermediate strategy is to view the logical network connecting a collection of communicating processes as a fully connected group of queues. Each processes has a communication endpoint consisting of a send queue, a receive queue, and a certain amount of state information, such as whether notifications should be delivered on message arrival. Each process can deliver a message to any of the receive queues by depositing it in its send queue with an appropriate destination identifier. At the time of writing, this approach is being standardized by an industry consortium, lead by Intel, Microsoft, and Compaq, as the Virtual Interface Architecture, based on several research efforts, including Berkeley NOW, Cornell UNET, Illinois FM, and Priceton SHRIMP projects. The hardware support for the communication assists and the interpretation of the network transactions within clusters and NOWs spans almost the entire range of design points discussed in Section 7.3.5. However, since the network plugs into existing machines, rather than being integrated into the system at the board or chip level, typically it must interface at an I/O bus rather than at the memory bus or closer to the processor. In this area too there is considerable innovation. Several relatively fast I/O busses have been developed which maintain cache coherency. The most notable development being PCI. Experimental efforts have integrated the network through the graphics bus[Mar*93]. An important technological force further driving the advancement of clusters is the availability of relatively inexpensive SMP building blocks. For example, clustering a few tens of Pentium Pro “quad pack” commodity servers yields a fairly large scale parallel machine with very little effort. At the high-end, most of the very large machine are being constructed as a highly optimized clus9/10/97
DRAFT: Parallel Computer Architecture
477
Scalable Multiprocessors
ter of the vendor’s largest commercially available SMP node. For example, in the 1997-8 window of machines purchased by the Department of Energy as part of the Accelerated Strategic Computing Initiative, the Intel machine is built as 4536 dual-processor Pentium Pros. The IBM machine is to be 512 4-way PowerPC 604s, upgraded to 8-way PowerPC 630s. The SGI/Cray machine is initially sixteen 32-way Orgins interconnected with many HPPI 6400 links, eventually to be integrated into larger cache coherent units, as described in the next chapter. 7.8.1 Case Study: Myrinet SBus Lanai A representative example of an emerging NOW is illustrated by Figure 7-27. A collection of ULTRASparc workstations are integrated using a Myrinet scalable network via an intelligent network interface card (NIC). Let us start with the basic hardware operation and work upwards. The network illustrates what is becoming to be known as a system area network (SAN), as opposed to a tightly packaged parallel machine network or widely dispersed local area network (LAN). The links are parallel copper twisted pairs (18 bits wide) and can be a few tens of feet long, depending on link speed and cable type. The communication assist follows the dedicated message processor approach similar to the Meiko CS2 and IBM SP2. The NIC contains a small embedded “Lanai” processor to control message flow between the host and the network. A key difference in the Myricom design is that the NIC contains a sizable amount of SRAM storage. All message data is staged through NIC memory between the host and the network. This memory is also used for Lanai instruction and data storage. There are three DMA engines on the NIC, one for network input, one network output, and one for transfers between the host and the NIC memory. The host processor can read and write NIC memory using conventional loads and stores to properly regions of the address space, i.e., through programmed I/O. The NIC processor uses DMA operations to access host memory. The kernel establishes regions of host memory that are accessible to the NIC. For short transfers it is most efficient for the host to move the data directly into and out of the NIC, whereas for long transfers it is better for the host to write addresses into the NIC memory and for the NIC to pick up these addresses and use them to set up DMA transfers. The Lanai processor can read and write the network FIFOs or it can drive them by DMA operations from or to NIC memory. The “firmware” program executing within the Lanai primarily manages the flow of data by orchestrating DMA transfers in response to commands written to it by the host and packet arrivals from the network. Typically, a command is written into NIC memory, where is it picked up by the NIC processor. The NIC transfers data, as required, from the host and pushes it into the network. The Myricom network uses source based routing, so the header of the packet includes a simple routing directive for each network switch along the path to the destination. The destination NIC receives the packet into NIC memory. It can then inspect the information in the transaction and process it as desired to support the communication abstraction. The NIC is implemented as four basic components, a bus interface FPGA, a link interface, SRAM, and the Lanai chip, which contains the processor, DMA engines and link FIFOs. The link interface converts from on-board CMOS signals to long-line differential signalling over twisted pairs. A critical aspect of the design is the bandwidth to the NIC memory. The three DMA engines and the processor share an internal bus, implemented within the Lanai chip. The network DMA engines can demand 320 MB/s, while the host DMA can demand short bursts of 100 MB/s on an SBus or long bursts of 133 MB/s on a PCI bus. The design goal for the firmware to keep all three DMA engines active simultaneously, however, this is a little tricky because once it starts the
478
DRAFT: Parallel Computer Architecture
9/10/97
Clusters and Networks of Workstations
8-port wormhole switches
160 MB/s bidirectional links
Myricom Lanai NIC (37.5 MHz proc, Link 256 MB sram Interface 3 dma units) r dma s dma
Myricom Network
°°°
sram
mP host dma Bus Interface
Mem B/A
S-bus (25 MHz)
UPA UltraSparc
L2 Cache
Figure 7-27 NOW organization using Myrinet and dedicated message processing with an embedded processor Although the nodes of a cluster are complete conventional computers, a sophisticated communication assist can be provided within the interface to a scalable, low latency network. Typically, the network interface is attached to a conventional I/O bus, but increasingly vendors are providing means of tighter integration with the node architecture. In many cases, the communication assist provides dedicated processing of network transactions.
DMA engines its available bandwidth and hence its execution rate are reduce considerably by competition of the SRAM. Typically, the NIC memory is logically divided into a collection of functionally distinct regions, including instruction storage, internal data structures, message queues, and data transfer buffers. Each page of the NIC memory space is independently mapped by the host virtual memory system. Thus, only the kernel can access the NIC processor code and data space. The remaining communication space can be partitioned into several, disjoint communication regions. By controlling the mapping of these communication regions, several user processes can have communication endpoints that are resident on the NIC, each containing message queues and associated data transfer area. In addition, a collection of host memory frames can be mapped into the IO space accessible from the NIC. Thus, several user processes can have the ability to write messages into the NIC and read messages directly from the NIC, or to write message descriptors containing pointers to data that is to be DMA transferred through the card. These communication end points can be managed much like conventional virtual memory, so writing into an endpoint
9/10/97
DRAFT: Parallel Computer Architecture
479
Scalable Multiprocessors
causes it to be made resident on the NIC. The NIC firmware is responsible for multiplexing messages from the collection of resident endpoints onto the actual network wire. It detects when a message has been written into a send queue and forms a packet by translating the users destination address into a route through the network to the destination node and an indentifier of the endpoint on that node. In addition, it places a source identifier in the header, which can be checked at the destination. The NIC firmware inspects the header for each incoming packet. If it is destined for a resident endpoint, then it can be deposited directly in the associated receive buffer and, optionally, for a bulk data transfer if the destination region is mapped the data can be DMA transferred into host memory. If these conditions are not met, or if the message is corrupted or the protection check violated, the packet can be NACKed to the source. The driver that manages the mapping of endpoints and data buffer spaces is notified to cause the situation to be remedied before the message is successfully retried. 7.8.2 Case Study: PCI Memory Channel A second important representative cluster communication assist design is the Memory Channel[Gil96] developed by Digital Equipment Corporation, based on the Encore Reflective memory and research efforts in the virtual memory mapped communication[Blu*94,Dub*96]. This approach seeks to provide a limited form of a shared physical address space without needing to fully integrate the pseudo memory device and pseudo processor into the memory system of the node. It also preserves some of the autonomy and independent failure characteristics of clusters. As with other clusters, the communication assist is contained in a network interface card that is inserted into a conventional node on an extension bus, in this case the PCI (peripheral connection interface) I/O bus. The basic idea behind reflective memory is to establish a connection between a region of the address space of one process, a ‘transmit region’, and a ‘receive region’ in another, as indicated by Figure 7-28. Data written to a transmit region by the source is “reflected” into the receive region of the destination. Usually, a collection of processes will have a fully connected set of transmit-receive region pairs. The transmit regions on a node are allocated from a portion of the physical address space that is mapped to the NIC. The receive regions are locked-down in memory, via special kernel calls, and the NIC is configured so that it can DMA transfer data into them. In addition, the source and destination processes must establish a connection between the source transmit region and the destination receive region. Typically, this is done by associating a key with the connection and binding each region to the key. The regions are an integral number of pages. The DEC memory channel is a PCI-based NIC, typically placed in a Alpha based SMP, described by Figure 7-29[Gil96]. In addition to the usual transmit and receive FIFOs, it contains a page control table (PCT), a receive DMA engine, and transmit and receive controllers. A block of data is written to the transmit region with a sequence of stores. The Alpha write buffer will attempt to merge updates to a cache block, so the send controller will typically see a cache-block write operation. The upper portion of the address given to the controller is the frame number, which is used to index into the PCT to obtain a descriptor for the associated receive region, i.e., the destination node (or route to the node), the receive frame number, and associated control bits. This information, along with source information, is placed into the header of the packet that is delivered to the destination node through the network. The receive controller extracts the receive frame number and uses it to index into the PCT. After checking the packet integrity and verifying the source, the data is DMA transferred into memory. The receive regions can be cachable host
480
DRAFT: Parallel Computer Architecture
9/10/97
Clusters and Networks of Workstations
Nodei
Physical Addrress
VA0
Nodej
VA
S1
S1
S0
S2
S2
S2
S3
IO
R0
R1 R2
R2
R1 R2
VA2
R3 S3 R3
Nodek VA
S0 S1 R1 R0
Figure 7-28 Typical Reflective Memory address space organization Send regions of the processess on a node are mapped to regions in the physical address space associated with the NIC. Receive regions are pinned in memory and mappings within the NIC are established so that it can DMA the associated data to host memory. Here, nodei has two communicating processes, one of which has established reflective memory connections with processes on two other node. Writes to a send region generate memory transactions against the NIC. It accepts the write data, builds a packet with a header indentifying the receive page and offset, and routes the packet to the destination node. The NIC,upon accepting a packet from the network inspects the header and DMA transfers the data into the corresponding receive page. Optionally, packet arrival may generate an interrupt. In general, the user process scans receive regions for relavant updates. To support message passing, the receive pages contain message queues.
memory, since the transfers across the memory bus are cache coherent. If required, an interrupt is raised upon write completion. As with a shared physical address space, this approach allows data to be transferred between nodes by simply storing the data from the source and loading it at the destination. However, the use of the address space is much more restrictive, since data that is to be transferred to a particular node must be placed in a specific send page and receive page. This is quite different from the scenario where any process can write to any shared address and any process can read that address. Typically, shared data structures are not placed in the communication regions. Instead the regions are used as dedicated message buffers. Data is read from a logically shared data structure and transmitted to a process that requires it through a, logical, memory channel. Thus, the communication abstraction is really one of memory-based message passing. There is no mechanism to read a remote location; a process can only read the data that has been written to it. To bet-
9/10/97
DRAFT: Parallel Computer Architecture
481
Scalable Multiprocessors
MEMORY CHANNEL interconnect 100 MB/s
°°° Link Interface
rx ctr rcv dma
PCT
tx ctr
Bus Interface
B/A
Alpha P-$
PCI (33 MHz)
Mem
AlphaServer SMP
Figure 7-29 DEC Memory Channel Hardware Organization A MEMORY CHANNEL cluster consists of a collection of AlphaServer SMPs with PCI MEMORY CHANNEL adapters. The adapter contains a page control table which maps local transmit regions to remote receive regions and local receive regions to locked down physical page frames. The transmit controller (tx ctr) accepts PCI write transactions, constructs a packet header using the store address and PCT entry contents, and deposits a packet containing the write data into the transmit FIFO. The receive controller (rx ctrl) checks the CRC and parses the header of an incoming packet before initiating a DMA transfer of the packet data into host memory. Optionally, an interrupt may be generated. In the initial offering, the memory channel interconnect is a shared 100 MB/s bus, but this is intended to be replaced by a switch. Each of the nodes is typically a sizable SMP AlphaServer.
ter support the “write by one, read by any” aspect of shared memory, the DEC memory channel allows transmit regions to multicast to a group of receive regions, including a loop-back region on the source node. To ease construction of distributed data structures, in particular a distributed lock manager, the FIFO order is preserved across operations from a transmit region. Since reflective memory builds upon, rather than extends the node memory system, the write to a transmit region finishes as soon as the NIC accepts the data. To determine that the write has actually occurred, the host checks a status register in the NIC. The raw MEMORY CHANNEL interface obtains an one-way communication latency of 2.9 µs and a transfer bandwidth of 64 MB/s between two 300 MHz DEC AlphaServers.[Law*96] The one way latency for a small MPI message is about 7 µs and a maximum bandwidth of 61 MB/s is achieved on long messages. Using the TruCluster MEMORY CHANNEL software[Car*96], acquiring and releasing and uncontended spinlock takes approximately 130 µs and 120 µs, respectively. The Princeton SHRIMP designs[Blu*94,Dub*96] extend the reflective memory model to better support message passing. They allow the receive offset to be determined by a register at the des482
DRAFT: Parallel Computer Architecture
9/10/97
Comparison of Communication Performance
tination, so successive packets will be queued into the receive region. In addition, a collection of writes can be performed to a transmit region and then a segment of the region transmitted to the receive region.
7.9 Comparison of Communication Performance Now that we have seen the design spectrum of modern scalable distributed memory machines, we can solidify our understanding of the impact of these design trade-offs in terms of the communication performance that is delivered to applications. In this section we examine communication performance through microbenchmarks at three levels. The first set of microbenchmarks use a non-standard communication abstraction that closely approximates the basic network transaction on a user-to-user basis. The second use the standard MPI message passing abstraction and the third a shared address space. In making this comparison, we can see the effects of the different organizations and of the protocols used to realize the communication abstractions. 7.9.1 Network Transaction Performance Many factors interact to determine the end-to-end latency of the individual network transaction, as well as the abstraction built on top of it. When we measure the time per communication operation we observe the cumulative effect of these interactions. In general, the measured time will be larger than what one would obtain by adding up the time through each of the individual hardware components. As architects, we may want to know how each of the components impacts performance, however, what matters to programs is the cumulative effect, including the subtle interactions which inevitably slow things down. Thus, in this section we take an empirical approach to determining the communication performance of several of our case study machines. We use as a basis for the study a simple user-level communication abstraction, called Active Messages, which closely approximates the basic network transaction. We want to measure not only the total message time, but the portion of this time that is overhead in our data transfer equation (Equation 1.5) and the portion that is due to occupancy and delay. Active Messages constitute request and response transactions in a form which is essentially a restricted remote procedure call. A request consists of the destination processor address, an indentifier for the message handler on that processor, and a small number of data words in the source processor registers which are passed as arguments to the handler. An optimized instruction sequence issues the message into the network via the communication assist. On the destination processor an optimized instruction sequence extracts the message from the network and invokes the handler on the message data to perform a simple action and issue a response, which identifies a response handler on the original source processor. To avoid fetch deadlock without arbitrary buffering, a node services response handlers whenever attempting to issue an Active Message. If the network is backed up and the message cannot be issued, the node will continue to drain incoming responses. While issuing a request, it must also service incoming requests. Notification of incoming messages may be provided through interrupts, through signalling a thread, through explicitly servicing the network with a null message event, or through polling. The microbenchmark that uses Active Messages is a simple echo test, where the remote processor is continually servicing the network and issuing replies. This eliminates the timing variations that would be observed if the processor was busy doing other work when the request arrived.
9/10/97
DRAFT: Parallel Computer Architecture
483
Scalable Multiprocessors
Also, since our focus is on the node-to-network interface, we pick processors that are adjacent in the physical network. All measurements are performed by the source processor, since many of these machines do not have a global synchronous clock and the “time skew” between processors can easily exceed the scale of an individual message time. To obtain the end-to-end message time, the round-trip time for a request-response is divided by two. However, this one-way message time has three distinct portions, as illustrated by Figure 7-30. When a processor injects a message, it is occupied for a number of cycles as it interfaces with the communication assist. We call this the send overhead, as it is time spent that cannot be used for useful computation. Similarly, the destination processor spends a number of cycles extracting or otherwise dealing with the message, called the receive overhead. The portion of the total message cost that is not covered by overhead is the communication latency. In terms of the communication cost expression developed in Chapter xx3, it includes the portions of the transit latency, bandwidth and assist occupancy components that are not overlapped with send or receive overhead. It can potentially be masked by other useful work or by processing additional messages, as we will discuss in detail in Chapter 8. From the processor’s viewpoint, it cannot distinguish time spent in the communication assists from that spent in the actual links and switches of the interconnect. In fact, the communication assist may begin pushing the message into the network while the source processor is still checking status, but this work will be masked by the overhead. In our microbenchmark, we seek to determine the length of the three portions.
Machine Resource
total communication latency Os Destination Processor
L
Or
observed network latency
Communication Assist Network Communication Assist Source Processor
Time of the Message Figure 7-30 Breakdown of message time into source-overhead, network time, and destination overhead A graphical depiction of the machine operation associated with our basic data transfer model for communication operations. The source processor spends O s cycles injecting the message to the communication assist, during which time it can perform no other useful work, and similarly the destination experiences O r overhead in extracting the message. To actually transfer the information involves the communication assists and network interfaces, as well as the links and switches of the network. As seen from the processor, these subcomponents are indistinquishable. It experiences the portion of the transfer time that is not covered by its own overhead as latency, that can be overlapped with other useful work. The processor also experiences the maximum message rate, which we represent in terms of the minimum average time between messages, or gap.
The left hand graph in Figure 7-31 shows a comparison of one-way Active Message time for the Thinking Machines CM5 (Section 7.5.1) Intel Paragon (Section 7.6.1) the Meiko CS-2 (Section 7.6.2) and a cluster of workstations (Section 7.8.1). The bars show the total one-way latency of a small (five word) message divided up into three segments indicating the processing overhead on the sending side (Os) the processing overhead on the receiving side (Or), and the
484
DRAFT: Parallel Computer Architecture
9/10/97
Comparison of Communication Performance
remaining communication latency (L). The bars on the right (g) shows the time per message for a pipelined sequence of request-response operations. For example, on the Paragon an individual message has an end-to-end latency of about 10 µs, but a burst of messages can go at about 7.5 µs 1 - = 133 thousand messages per second. Let us examine each of per message, or a rate of ------------7.5µs
these components in more detail. The send overhead is nearly uniform over the four designs, however, the factors determining this component are different on each system. On the CM-5, this time is dominated by uncached writes of the data plus the uncached read of the NI status to determine if the message was accepted. The status also indicates whether an incoming message has arrived, which is convenient, since the node must receive messages even while it is unable to send. Unfortunately, two status reads are required for the two networks. (A later model of the machine, the CM-500, provided more effective status information and substantially reduced the overhead.) On the Paragon, the send overhead is determined by the time to write the message into the memory buffer shared by the compute processor and message processor. The compute processor is able to write the message and set the ‘message present’ flag in the buffer entry with two quad word stores, unique to the i860, so it is surprising that the overhead is so high. The reason has to do with the inefficiency of the bus-based cache coherence protocol within the node for such a producer-consumer situation. The compute processor has to pull the cache blocks away from the message processor before refilling them. In the Meiko CS-2, the message is built in cached memory and then a pointer to the message (and command) is enqueued in the NI a single swap instruction. This instruction is provided in the Sparc instruction set to support synchronization operations. In this context, it provides a way of passing information to the NI and getting status back indicating whether the operation was successful. Unfortunately, the exchange operation is considerably slower than the basic uncached memory operations. In the NOW system, the send overhead is due to a sequence of uncached double word stores and an uncached load across the I/O bus to NI memory. Surprisingly, these have the same effective cost as the more tightly integrated designs. An important point revealed by this comparison is that the cost of uncached operations, misses, and synchronization instructions, generally considered to be infrequent events and therefore a low priority for architectural optimization, are critical to communication performance. The time spent in the cache controllers before they allow the transaction to delivered to the next level of the storage hierarchy dominates even the bus protocol. The comparison of the receive overheads shows that cache-to-cache transfer from the network processor to the compute processor is more costly than uncached reads from the NI on the memory bus of the CM-5 and CS-2. However, the NOW system is subject to the greater cost of uncached reads over the I/O bus. The latency component of the network transaction time is the portion of the one-way transfer that is not covered by either send or receive overhead. Several facets of the machine contribute to this component, including the processing time on the communication assist or in the network interface, the time to channel the message onto a link, and the delay through the network. Different facets are dominant in our study machines. The CM-5 links operate at 20 MB/s (4 bits wide at 40 MHz). Thus, the occupancy of a single wire to transfer a message with 40 bytes of data (the payload) plus an envelope containing route information, CRC, and message type is nearly 2.5 µs. With the fat-tree network, there are many links (typically N/4) crossing the top of the network, so the aggregate network bandwidth is seldom a bottleneck. Each router adds a delay of roughly 200 ns, and there are at most 2 log 4 N hops. The network interface occupancy is essentially the same as the wire occupancy, as it is a very simple device that spools the packet onto or off the wire.
9/10/97
DRAFT: Parallel Computer Architecture
485
Scalable Multiprocessors
14 12
Microseconds
10
L 8
Or 6
Os 4
g 2
T3D
NOW Ultra
Meiko CS2
Paragon
CM-5
T3D
NOW Ultra
Meiko CS2
Paragon
CM-5
0
Figure 7-31 Performance comparison at the Network Transaction Level via Active Messages
In the Paragon, the latency is dominated by processing in the message processor at the source and destination. With 175 MB/s links, the occupancy of a link is only about a 300 ns. The routing delay per hop is also quite small, however, in a large machine the total delay may be substantially larger than the link occupancy, as the number of hops can be 2 N . Also, the network bisection can potentially become a bottleneck, since only N links cross the middle of the machine. However, the dominant factor is the assist (message processor) occupancy in writing the message across the memory bus into the NI at the source and reading the message from the destination. These steps account for 4 µs of the latency. Eliminating the message processors in the communication path on the Paragon decreases the latency substantially. Surprisingly, it does not increase the overhead much; writing and reading the message to and from the NI have essentially the same cost as the cache to cache transfers. However, since the NI does not provide sufficient interpretation to enforce protection and to ensure that messages move forward through the network adequately, avoiding the message processors is not a viable option in practice. The Meiko has a very large latency component. The network accounts for a small fraction of this, as it provides 40 MB/s links and a fat-tree topology with ample aggregate bandwidth. The message processor is closely coupled with the network, on a single chip. The latency is almost entirely due to accessing system memory from the message processor. Recall, the compute processor provides the message processor with a pointer to the message. The message processor performs DMA operations to pull the message out of memory. At the destination it writes it into a shared queue. An unusual property of this machine is that a circuit through the network is held open through the network transaction and an acknowledgment is provided to the source. This mechanism is used to convey to the source message processor whether it successfully obtained a
486
DRAFT: Parallel Computer Architecture
9/10/97
Comparison of Communication Performance
lock on the destination incoming message queue. Thus, even though the latency component is large, it is difficult to hide it through pipelining communication operations, since source and destination message processors are occupied for the entire duration. The latency in the NOW system is distributed fairly evenly over the facets of the communication system. The link occupancy is small with 160 MB/s links and the routing delay is modest, with 350 ns per hop in roughly a fat tree topology. The data is deposited into the NI by the host and accessed directly from the NI. Time is spent on both ends of the transaction to manipulate message queues and to perform DMA operations between NI memory and the network. Thus, the two NIs and the network each contribute about a third of the 4 µs latency. The Cray T3D provides hardware support for user level messaging in the form of a per processor message queue. The capability in the DEC Alpha to extend the instruction set with privileged subroutines, called PAL code, is used. A four-word message is composed in registers and a PAL call issued to send the message. It is placed in a user level message queue at the destination processor, the destination processor is interrupted, and control is returned either the user application thread, which can poll the queue, or to a specific message handler thread. The send overhead to inject the message is only 0.8 µs, however, the interrupt has an overhead of 25 µs and the switch to the message handler has a cost of 33 µs[Arp*95]. The message queue, without interrupts or thread switch, can be inserted into using the fetch&increment registers provided for atomic operations. The fetch&increment to advance the queue pointer and writing the message takes 1.5 µs, whereas dispatching on the message and reading the data, via uncached reads, takes 2.9 µs. The Paragon, Meiko, and NOW machines employ a complete operating system on each node. The systems cooperate using messages to provide a single system image and to run parallel programs on collections of nodes. The message processor is instrumental in multiplexing communication from many user processes onto a shared network and demultiplexing incoming network transactions to the correct destination processes. It also provides flow control and error detection. The MPP systems rely on the physical integrity of a single box to provide highly reliable operation. When a user process fails, the other processes in its parallel program are aborted. When a node crashes, its partition of the machine is rebooted. The more loosely integrated NOW must contend with individual node failures that are a result of hardware errors, software errors, or physical disconnection. As in the MPPs, the operating systems cooperate to control the processes that form a parallel program. When a node fails, the system reconfigures to continue without it. 7.9.2 Shared Address Space Operations It is also useful to compare the communication architectures of the case study machines for shared address space operations: read and write. These operations can easily be built on top of the user level message abstraction, but this does not exploit the opportunity to optimize for these simple, frequently occurring operations. Also, on machines where the assist does not provide enough interpretation the remote processor is involved whenever its memory is accessed. In machines with a dedicated message processor, it can potentially service memory requests on behalf of remote nodes, without involving the compute processor. On machines supporting a shared physical address, this memory request service is provided directly in hardware. Figure 7-32 shows performance of a read of a remote location for the case study machines[Kri*96]. The bars on the left show the total read time, broken down into the overhead associated with issuing the read and the latency of the remaining communication and remote pro-
9/10/97
DRAFT: Parallel Computer Architecture
487
Scalable Multiprocessors
cessing. The bars on the right show the minimum time between reads in the steady state. For the CM-5, there is no opportunity for optimization, since the remote processor must handle the network transaction. In the Paragon, the remote message processor performs the read request and replies. The remote processing time is significant, because the message processor must read the message from the NI, service it, and write a response to the NI. Moreover, the message processor must perform a protection check and the virtual to physical translation. On the Paragon with OSF1/AD, the message processor runs in kernel mode and operates on physical addresses; thus, it performs the page table lookup for the requested address, which may not be for the currently running process, in software. The Meiko CS-2 provides read and write network transactions, where the source-to-destination circuit in the network is held open until the read response or write acknowledgment is returned. The remote processing is dedicated and uses a hardware page table to perform the virtual-to-physical transaction on the remote node. Thus, the read latency is considerably less than a pair of messages, but still substantial. If the remote message processor performs the read operation, the latency increases by an additional 8 µs. The NOW system achieves a small performance advantage by avoiding use of the remote processor. As in the Paragon, the message processor must perform the protection check and address translation, which are quite slow in the embedded message processor. In addition, accessing remote memory from the network interface involves a DMA operation and is costly. The major advantage of dedicated processing of remote memory operations in all of these systems is that the performance of the operation does not depend on the remote compute processor servicing incoming requests from the network. 25
Microseconds
20
15
Latency Issue
10
Gap 5
T3D
NOW Ultra
Meiko CS2
Paragon
CM-5
T3D
NOW Ultra
Meiko CS2
Paragon
CM-5
0
Figure 7-32 Performance Comparison on Shared Address Read For the five study platforms the bars are the right show the total time to perform a remote read operation and isolate the portion of that time that involves the processor issuing and completing the read. The remainder can be overlapped with useful work, as is discussed in depth in Chapter 11. The bars on the right show the minimum time between successive reads, which is the reciprocal of the maximum rate.
488
DRAFT: Parallel Computer Architecture
9/10/97
Comparison of Communication Performance
The T3D shows an order of magnitude improvement available through dedicated hardware support for a shared address space. Given that the reads and writes are implemented through add-on hardware, there is a cost of 400 ns to issue the operation. The transmission of the request, remote service, and transmission of the reply take an additional 450 ns. For a sequence of reads, performance can be improved further by utilizing the hardware prefetch queue. Issuing the prefetch instruction and the pop from the queue takes about 200 ns. The latency is amortized over a sequence of such prefetches, and is almost completely hidden if 8 or more are issued. 7.9.3 Message Passing Operations Let us also take a look at measured message passing performance of several large scale machines. As discussed earlier, the most common performance model for message passing operations is a linear model for the overall time to send n bytes given by n T ( n ) = T 0 + --- . B
(EQ 7.2)
The start-up cost, T 0 , is logically the time to send zero bytes and B is the asymptotic bandwidth. n The delivered data bandwidth is simply BW ( n ) = ----------. Equivalently, the transfer time can be T (n)
characterized by two parameters, r ∞ and n 1 , which are the asymptotic bandwidth and the trans--2
fer size at which half of this bandwidth is obtained, i.e., the half-power point. The start-up cost reflects the time to carry out the protocol, as well as whatever buffer management and matching required to set up the data transfer. The asymptotic bandwidth reflects the rate at which data can be pumped through the system from end to end. While this model is easy to understand and useful as a programming guide, it presents a couple of methodological difficulties for architectural evaluation. As with network transactions, the total message time is difficult to measure, unless a global clock is available, since the send is performed on one processor and the receive on another. This problem is commonly avoided by measuring the time for an echo test – one processor sends the data and then waits to receive it back. However, this approach only yields a reliable measurement if the receive is posted before the message arrives, for otherwise it is measuring the time for the remote node to get around to issuing the receive. If the receive is preposted, then the test does not measure the full time of the receive and it does not reflect the costs of buffering data, since the match succeeds. Finally, the measured times are not linear in the message size, so fitting a line to the data yields a parameter that has little to do with the actual startup cost. Usually there is a flat region for small values of n , so the start-up cost obtained through the fit will be smaller than the time for a zero byte message, perhaps even negative. These methodological concerns are not a problem for older machines, which had very large start-up costs and simple, software based message passing implementations. Table 7-1 shows the start-up cost and asymptotic bandwidth reported for commercial message passing libraries on several important large parallel machines over a period of time[Don96]. (Also shown in the table are two small-scale SMPs, discussed in the previous chapter.) While the start-up cost has dropped by an order of magnitude in less than a decade, this improvement essentially tracks improvements in cycle time, as indicated by the middle column of the table. As illus9/10/97
DRAFT: Parallel Computer Architecture
489
Scalable Multiprocessors
trated by the right hand columns, improvements in start up cast have failed to keep pace with improvements in either floating-point performance or in bandwidth. Notice that the start-up costs are on an entirely different scale from the hardware latency of the communication network, which is typically a few microseconds to a fraction of a microsecond. It is dominated by processing overhead on each message transaction, the protocol and the processing required to provide the synchronization semantics of the message passing abstraction. Table 7-1 Message passing start-up costs and asymptotic bandwidths
Year
T0 (µs)
max BW (MB/s)
Cycles per T 0
MFLOPS per proc
n1
fp ops per T 0
--2
iPSC/2
88
700
2.7
5600
1
700
1400
nCUBE/2
90
160
2
3000
2
300
300
iPSC/860
91
160
2
6400
40
6400
320
CM5
92
95
8
3135
20
1900
760
SP-1
93
50
25
2500
100
5000
1250 3560
Meiko CS2
93
83
43
7470
24 (97)a
1992 (8148)
Paragon
94
30
175
1500
50
1500
7240
SP-2
94
35
40
3960
200
7000
2400
Cray T3D (PVM)
94
21
27
3150
94
1974
1502
NOW
96
16
38
2672
180
2880
4200
SGI Power Challenge
95
10
64
900
308
3080
800
Sun E5000
96
11
160
1760
180
1980
2100
a. Using Fujitsu vector units
Figure 7-33 shows the measured one way communication time, on the echo test, for several machines as a function of message size. In this log-log plot, the difference in start-up costs is apparent, as is the non-linearity for small messages. The bandwidth is given by the slope of the lines. This is more clearly seen from plotting the equivalent BW ( n ) in Figure 7-34. A caveat must be made in interpreting the bandwidth data for message passing as well, since the pairwise bandwidth only reflects the data rate on a point-to-point basis. We know, for example, that for the busbased machines this is not likely to be sustained if many pairs of nodes are communicating. We will see in Chapter 7 that some network can sustain much higher aggregate bandwidths than others. In particular, the Paragon data is optimistic for many communication patterns involving multiple pairs of nodes, whereas the other large scale systems can sustain the pairwise bandwidth for most patterns. 7.9.4 Application Level Performance The end-to-end effects of all aspects of the computer system come together to determine the performance obtained at the application level. This level of analysis is most useful to the end user, and is usually the basis for procurement decisions. It is also the ultimate basis for evaluating architectural trade-offs, but this requires mapping cumulative performance effects down to root
490
DRAFT: Parallel Computer Architecture
9/10/97
Comparison of Communication Performance
100000
iPSC/860 IBM SP2 Meiko CS2 Paragon/Sunmos
10000
Cray T3D
Time
(µs)
NOW SGI Challenge 1000
Sun E5000
100
10
1 1
10
100
1000
10000
100000
1000000
Figure 7-33 Time for Message Passing Operation vs. Message Size Message passing implementations exhibit nearly an order of magnitude spread in start-up cost and the cost is constant for a range of small message sizes, introducing a substantial nonlinearity in the time per message. The time is nearly linear for a range of large messages, where the slope of the curve gives the bandwidth.
causes in the machine and in the program. Typically, this involves profiling the application to isolate the portions where most time is spent and extracting the usage characteristics of the application to determine its requirements. This section briefly compares performance on our study machines for two the NAS parallel benchmarks in the NPB2 suite[NPB]. The LU benchmark solves a finite difference discretization of the 3-D compressible NavierStokes equations used for modeling fluid dynamics through a block-lower-triangular blockupper-triangular approximate factorization of the difference scheme. The LU factored form is cast as a relaxation, and is solved by Symmetric Successive Over Relaxation (SSOR). The LU benchmark is based on the NAS reference implementation from 1991 using the Intel message passing library, NX[Bar*93]. It requires a power-of-two number of processors. A 2-D partitioning of the 3-D data grid onto processors is obtained by halving the grid repeatedly in the first two dimensions, alternating between x and y, until all processors are assigned, resulting in vertical pencil-like grid partitions. Each pencil can be thought of as a stack of horizontal tiles. The ordering of point based operations constituting the SSOR procedure proceeds on diagonals which progressively sweep from one corner on a given z plane to the opposite corner of the same z plane, thereupon proceeding to the next z plane. This constitutes a diagonal pipelining method and is called a ``wavefront’’ method by its authors [Bar*93]. The software pipeline spends relatively lit-
9/10/97
DRAFT: Parallel Computer Architecture
491
Scalable Multiprocessors
180
(MB/s)
iPSC/860 160
IBM SP2
140
Meiko CS2 Paragon/Sunmos Cray T3D
120
NOW SGI Challenge
Bandwidth
100
Sun E5000
80 60 40 20 0 10
100
1000
10000
100000
1000000
Figure 7-34 Bandwidth vs. Message Size Scalable machines generally provide a few tens of MB/s per node on large message transfers, although the Paragon design shows that this can be pushed into the hundreds of MB/s. The bandwidth is limited primarily by the ability of the DMA support to move data through the memory system, including the internal busses. Small scale shared memory designs exhibit a point-to-point bandwidth dictated by out-of-cache memory copies when few simultaneous transfers are on-going.
tle time filling and emptying and is perfectly load-balanced. Communication of partition boundary data occurs after completion of computation on all diagonals that contact an adjacent partition. It results in a relatively large number of small communications of 5 words each. Still, the total communication volume is small compared to computation expense, making this parallel LU scheme relatively efficient. Cache block reuse in the relaxation sections is high. Figure 7-35 shows the speedups obtained on sparse LU with the Class A input (200 iterations on a 64x64x64 grid) for the IBM SP-2 (wide nodes), Cray T3D, and UltraSparc cluster (NOW). The speedup is normalized to the performance on four processors, shown in the right-hand portion of the figure, because this is the smallest number of nodes for which the problem can be run on the T3D. We see that Scalability is generally good to beyond a hundred processors, but both the T3D and NOW scale considerably better than the SP2. This is consistent with the ratio of the processor performance to the performance of small message transfers. The BT algorithm solves three sets of uncoupled systems of equations, first in the x, then in the y, and finally in the z direction. These systems are block tridiagonal with 5x5 blocks and are solved using a multi-partition scheme[Bru88]. The multi-partition approach provides good load balance and uses coarse grained communication. Each processor is responsible for several disjoint subblocks of points (``cells'') of the grid. The cells are arranged such that for each direction of the
492
DRAFT: Parallel Computer Architecture
9/10/97
Comparison of Communication Performance
Speedup on LU-A
125
MFLOP/S on LU-A at 4 Processors
100
250
75
200
50
T3D SP2 NOW
25
150 100 50
0 75
100
125
0
NOW
50
Processors
SP2
25
T3D
0
Figure 7-35 Application Performance on Sparse LU NAS Parallel Benchmark Scalability of the IBM SP2, Cray T3D, and a UltraSparc cluster, on the sparse LU benchmark is shown normalized to the performacne obtained on four processors. The base performance of the three systems is shown on the right.
100
BT MFLOPS at 25 Processors
90
70
1400
60
1200
50
1000
40
800
T3D SP2 NOW
30 20 10
600 400 200
10
20
30
40 50 60 70 Processors
80
90 100
NOW
0
SP2
0
0
T3D
Speedup
80
Figure 7-36 Application Performance on BT NAS Parallel Benchmark (in F77 and MPI) Scalability of the IBM SP2, Cray T3D, and a UltraSparc cluster, on the BT benchmark is shown normalized to the performacne obtained on 25 processors. The base performance of the three systems is shown on the right.
line solve phase the cells belonging to a certain processor will be evenly distributed along the direction of solution. This allows each processor to perform useful work throughout a line solve, instead of being forced to wait for the partial solution to a line from another processor before beginning work. Additionally, the information from a cell is not sent to the next processor until all sections of linear equation systems handled in this cell have been solved. Therefore the granularity of communications is kept large and fewer messages are sent. The BT codes require a
9/10/97
DRAFT: Parallel Computer Architecture
493
Scalable Multiprocessors
square number of processors. These codes have been written so that if a given parallel platform only permits a power-of-two number of processors to be assigned to a job, then unneeded processors are deemed inactive and are ignored during computation, but are counted when determining Mflop/s rates. Figure 7-36 shows the scalability of the IBM SP2, Cray T3D, and the UltraSparc NOW on the Class A problem of the BT benchmark. Here the speedup is normalized to the performance on 25 processors, shown in the right-hand portion of the figure, because that is the smallest T3D configuration for which performance data is available. The scalability of all three platforms is good, with the SP-2 still lagging somewhat, but having much higher per node performance. Table 7-2 Communication characteristics for one iteration of BT on class A problem over a range of processor count 4 processors bins (KB)
16 processors
msgs
KB
bins (KB)
msgs
36 processors KB
bins (KB)
msgs
64 processors KB
bins (KB)
msgs
KB
43.5+
12
513
11.5+
144
1,652
4.5+
540
2,505
3+
1,344
4,266
81.5+
24
1,916
61+
96
5,472
29+
540
15,425
19+
1,344
25,266
261+
12
3,062
69+
144
9,738
45+
216
9,545
35.5+
384
13,406
Total KB
5,490
17,133
27,475
42,938
Time per iter
5.43 s
1.46 s
0.67 s
0.38
1.0
11.5
40.0
110.3
0.25
0.72
1.11
1.72
ave MB/s ave MB/s/P
To understand why the scalability is less than perfect we need to look more closely at the characteristics of the application, in particular at the communication characteristics. To focus our attention on the dominant portion of the application we will study one iteration of the main outermost loop. Typically, the application performs a few hundred of these iterations, after an initialization phase. Rather than employ the simulation methodology of the previous chapters to examine communication characteristics, we collect this information by running the program using an instrumented version of the MPI message passing library. One useful characterization is the histogram of messages by size. For each message that is sent, a counter associated with its size bin is incremented. The result, summed over all processors, is shown in the top portion of Table 7-2 for a fixed problem size and processors scaled from four to 64. For each configuration, the non-zero bins are indicated along with the number of messages of that size and an estimate of the total amount of data transferred in those messages. We see that this application essentially sends three sizes of messages, but these sizes and frequencies vary strongly with the number of processors. Both of these properties are typical of message passing programs. Well-tuned programs tend to be highly structured and use communication sparingly. In addition, since the problem size is held fixed, as the number of processors increases the portion of the problem that each processor is responsible for decreases. Since the data transfer size is determined by how the program is coded, rather than by machine parameters, it can change dramatically with configuration. Whereas on small configurations the program sends a few very large messages, on a large configuration it sends many, relatively small messages. The total volume of communication increases by almost a 494
DRAFT: Parallel Computer Architecture
9/10/97
Synchronization
factor of eight when the number of processors increases by 16. This increase is certainly one factor affecting the speedup curve. In addition, the machine with higher start-up cost is affected more strongly by the decrease in message size. It is important to observe that with a fixed problem size scaling rule, the workload on each processor changes with scale. Here we see it for communication, but it also changes cache behavior and other factors. The bottom portion of Table 7-2 gives an indication of the average communication requirement for this application. Taking the total communication volume and dividing by the time per iteration1 on the Ultrasparc cluster we get the average delivered message data bandwidth. Indeed the communication rate scales substantially, increasing by a factor of more than one hundred over an increase in machine size of 16. Dividing further by the number of processors, we see that the average per processor bandwidth is significant, but not extremely large. It is in the same general ballpark as the rates we observed for shared address space applications on simulations of SMPs in Section 5.5. However, we must be extremely careful about making design decisions based on average communication requirements because communication tends to be very bursty. Often substantial periods of computation are punctuated by intense communication. For the BT application we can get an idea of the temporal communication behavior by taking snapshots of the message histogram at regular intervals. The result is shown in Figure 7-37 for several iterations on one of the 64 processors executing the program. For each sample interval, the bar shows the size of the largest message sent in that interval. The this application the communication profile is similar on all processors because it follows a bulk synchronous style with all processors alternating between local computation and phases of communication. The three sizes of messages are clearly visible, repeating in a regular pattern with the two smaller sizes more common than the larger one. Overall there is substantially more white space than dark, so the average communication bandwidth is more of an indication of the ratio of communication to computation than it is the rate of communication during a communication phase.
7.10 Synchronization Scalability is a primary concern in the combination of software and hardware that implements synchronization operations in large scale distributed memory machines. With a message passing programming model, mutual exclusion is a given since each process has exclusive access to its local address space. Point-to-point events are implicit in every message operation. The more interesting case is orchestrating global or group synchronization from point-to-point messages. An important issue here is balance: It is important that the communication pattern used to achieve the synchronization be balanced among nodes, in which case high message rates and efficient synchronization can be realized. In the extreme, we should avoid having all processes communicate with or wait for one process at a given time. Machine designers and implementors of message passing layers attempt to maximize the message rate in such circumstances, but only the program can relieve the load imbalance. Other issues for global synchronization are similar to those for a shared address space, discussed below.
1. The execution time is the only data in the table that is a property of the specific platform. All of the other communication characteristics are a property of the program and are the same when measured on any platform.
9/10/97
DRAFT: Parallel Computer Architecture
495
Scalable Multiprocessors
40
35
Message Size (kilobytes)
30
25
20
15
10
5
0 0
500
1000 1500 Time (milliseconds)
2000
2500
Figure 7-37 Message Profile over time for BT-A on 64 processors The prgram is executed with an instrumented MPI library that samples the communication histogram at regular intervals. The graph shows the largest message sent from one particular processor during each sample interval. It is clear that communication is regular and bursty.
In a shared address space, the issues for mutual exclusion and point-to-point events are essentially the same as those discussed in the previous chapter. As in small-scale shared memory machines, the trend in scalable machines is to build user-level synchronization operations (like locks and barriers) in software on top of basic atomic exchange primitives. Two major differences, however, may affect the choice of algorithms. First, the interconnection network is not centralized but has many parallel paths. On one hand, this means that disjoint sets of processors can coordinate with one another in parallel on entirely disjoint paths; on the other hand, it can complicate the implementation of synchronization primitives. Second, physically distributed memory may make it important to allocate synchronization variables appropriately among memories. The importance of this depends on whether the machine caches nonlocal shared data or not, and is clearly greater for machines that don’t, such as the ones described in this chapter. This section covers new algorithms for locks and barriers appropriate for machines with physically distributed memory and interconnect, starting from the algorithms discussed for centralizedmemory machines in Chapter 5 (Section 5.6). Once we have studied scalable cache-coherent systems in the next chapter, we will compare the performance implications of the different algorithms for those machines with what we saw for machines with centralized memory in Chapter 5, and also examine the new issues raised for implementing atomic primitives. Let us begin with algorithms for locks.
496
DRAFT: Parallel Computer Architecture
9/10/97
Synchronization
7.10.1 Algorithms for Locks In Section 5.6, we discussed the basic test&set lock, the test&set lock with back-off, the test-andtest&set lock, the ticket lock, and the array-based lock. Each successively went a step further in reducing bus traffic and fairness, but often at a cost in overhead. For example, the ticket lock allowed only one process to issue a test&set when a lock was released, but all processors were notified of the release through an invalidation and a subsequent read miss to determine who should issue the test&set. The array-based lock fixed this problem by having each process wait on a different location, and the releasing process notify only one process of the release by writing the corresponding location. However, the array-based lock has two potential problems for scalable machines with physically distributed memory. First, each lock requires space proportional to the number of processors. Second, and more important for machines that do not cache remote data, there is no way to know ahead of time which location a process will spin on, since this is determined at runtime through a fetch&increment operation. This makes it impossible to allocate the synchronization variables in such a way that the variable a process spins on is always in its local memory (in fact, all of the locks in Chapter 5 have this problem). On a distributed-memory machine without coherent caches, such as the Cray T3D and T3E, this is a big problem since processes will spin on remote locations, causing inordinate amounts of traffic and contention. Fortunately, there is a software lock algorithm that both reduces the space requirements and also ensures all spinning will be on locally allocated variables. We describe this algorithm next. Software Queuing Lock This lock is a software implementation of a lock originally proposed for all-hardware implementation by the Wisconsin Multicube project [Goo89]. The idea is to have a distributed linked list or a queue of waiters on the lock. The head node in the list represents the process that holds the lock. Every other node is a process that is waiting on the lock, and is allocated in that process’s local memory. A node points to the process (node) that tried to acquire the lock just after it. There is also a tail pointer which points to the last node in the queue, i.e. the last node to have tried to acquire the lock. Let us look pictorially at how the queue changes as processes acquire and release the lock, and then examine the code for the acquire and release methods. Assume that the lock in Figure 7-38 is initially free. When process A tries to acquire the lock, it gets it and the queue looks as shown in Figure 7-38(a). In step (b), process B tries to acquire the lock, so it is put on the queue and the tail pointer now points to it. Process C is treated similarly when it tries to acquire the lock in step (c). B and C are now spinning on local flags associated with their queue nodes while A holds the lock. In step (d), process A releases the lock. It then “wakes up” the next process, B, in the queue by writing the flag associated with B’s node, and leaves the queue. B now holds the lock, and is at the head of the queue. The tail pointer does not change. In step (e), B releases the lock similarly, passing it on to C. There are no other waiting processes, so C is at both the head and tail of the queue. If C releases the lock before another process tries to acquire it, then the lock pointer will be NULL and the lock will be free again. In this way, processes are granted the lock in FIFO order with regard to the order in which they tried to acquire it. The latter order will be defined next. The code for the acquire and release methods is shown in Figure 7-39. In terms of primitives needed, the key is to ensure that changes to the tail pointer are atomic. In the acquire method, the
9/10/97
DRAFT: Parallel Computer Architecture
497
Scalable Multiprocessors
(a) L
A
(c)
(b) A
A L
B (d)
B
C
(e)
L
B C C
L
L
Figure 7-38 States of the queue for a lock as processes try to acquire and as processes release. acquiring process wants to change the lock pointer to point to its node. It does this using an atomic fetch&store operation, which takes two operands: It returns the current value of the first operand (here the current tail pointer), and then sets it to the value of the second operand, returning only when it succeeds. The order in which in which the atomic fetch&store operations of different processes succeed defines the order in which they acquire the lock. In the release method, we want to atomically check if the process doing the release is the last one in the queue, and if so set the lock pointer to NULL. We can do this using an atomic compare&swap operation, which takes three operands: It compares the first two (here the tail pointer and the node pointer of the releasing process), and if they are equal it sets the first (the tail pointer) to the third operand (here NULL) and returns TRUE; if they are not equal it does nothing and returns FALSE. The setting of the lock pointer to NULL must be atomic with the comparison, since otherwise another process could slip in between and add itself to the queue, in which case setting the lock pointer to NULL would be the wrong thing to do. Recall from Chapter 5 that a compare&swap difficult to implement as a single machine instruction, since it requires three operands in a memory instruction (the functionality can, however, be implemented using loadlocked and store-conditional instructions). It is possible to implement this queuing lock without a compare&swap—using only a fetch&store—but the implementation is a lot more complicated (it allows the queue to be broken and then repairs it) and it loses the FIFO property of lock granting [MiSc96]. It should be clear that the software queueing lock needs only as much space per lock as the number of processes waiting on or participating in the lock, not space proportional to the number of processes in the program. It is the lock of choice for machines that support a shared address space with distributed memory but without coherent caching[Ala97].
498
DRAFT: Parallel Computer Architecture
9/10/97
Synchronization
struct node { struct node *next; int locked; } *mynode, *prev_node; shared struct node *Lock; lock (Lock, mynode) { mynode->next = NULL;
/* make me last on queue */
prev_node = fetch&store(Lock, mynode); /*Lock currently points to the previous tail of the queue; atomically set prev_node to the Lock pointer and set Lock to point to my node so I am last in the queue */
if (prev_node != NULL) {/* if by the time I get on the queue I am not the only one, i.e. some other process on queue still holds the lock */
mynode->locked = TRUE;
/* Lock is locked by other process*/
prev_node->next = mynode;
/* connect me to queue */
while (mynode->locked) {}; /* busy-wait till I am granted the lock */ } } unlock (Lock, mynode) { if (mynode->next == NULL) {
/* no one to release, it seems*/
if compare&swap(Lock,mynode,NULL) /* really no one to release, i.e.*/ return;
/* i.e. Lock points to me, then set Lock to NULL and return */
while (mynode->next == NULL);/* if I get here, someone just got on the queue and made my c&s fail, so I should wait till they set my next pointer to point to them before I grant them the lock */
} mynode->next->locked = FALSE; /*someone to release; release them */ } Figure 7-39 Algorithms for software queueing lock.
7.10.2 Algorithms for Barriers In both message passing and shared address space models, global events like barriers are a key concern. A question of considerable debate is whether special hardware support is needed for global operations, or whether sophisticated software algorithms upon point-to-point operations are sufficient. The CM-5 represented one end of the spectrum, with a special “control” network 9/10/97
DRAFT: Parallel Computer Architecture
499
Scalable Multiprocessors
providing barriers, reductions, broadcasts and other global operations over a subtree of the machine. Cray T3D provided hardware support for barriers, but the T3E does not. Since it is easy to construct barriers that spin only on local variables, most scalable machines provide no special support for barriers at all but build them in software libraries using atomic exchange primitives or LL/SC. Implementing these primitives becomes more interesting when has to interact with coherent caching as well, so we will postpone the discussion of implementation till the next chapter. In the centralized barrier used on bus-based machines, all processors used the same lock to increment the same counter when they signalled their arrival, and all waited on the same flag variable until they were released. On a large machine, the need for all processors to access the same lock and read and write the same variables can lead to a lot of traffic and contention. Again, this is particularly true of machines that are not cache-coherent, where the variable quickly becomes a hotspot as several processors spin on it without caching it. It is possible to implement the arrival and departure in a more distributed way, in which not all processes have to access the same variable or lock. The coordination of arrival or release can be performed in phases or rounds, with subsets of processes coordinating with one another in each round, such that after a few rounds all processes are synchronized. The coordination of different subsets can proceed in parallel with no serialization needed across them. In a bus-based machine, distributing the necessary coordination actions wouldn’t matter much since the bus serializes all actions that require communication anyway; however, it can be very important in machines with distributed memory and interconnect where different subsets can coordinate in different parts of the network. Let us examine a few such distributed barrier algorithms. Software Combining Trees A simple distributed way to coordinate the arrival or release is through a tree structure (see contention
little contention
Flat
Tree Structured
Figure 7-40 Replacing a flat arrival structure for a barrier by an arrival tree (here of degree 2). Figure 7-40), just as we suggested for avoiding hot-spots in Chapter 3. An arrival tree is a tree that processors use to signal their arrival at a barrier. It replaces the single lock and counter of the centralized barrier by a tree of counters. The tree may be of any chosen degree or branching factor, say k. In the simplest case, each leaf of the tree is a process that participate in the barrier. When a process arrives at the barrier, it signals its arrival by performing a fetch&increment on the counter associated with its parent. It then checks the value returned by the fetch&increment to see if it was the last of its siblings to arrive. If not, it’s work for the arrival is done and it simply waits for the release. If so, it considers itself chosen to represent its siblings at the next level of
500
DRAFT: Parallel Computer Architecture
9/10/97
Synchronization
the tree, and so does a fetch&increment on the counter at that level. In this way, each tree node sends only a single representative process up to the next higher level in the tree when all the processes represented by that node’s children have arrived. For a tree of degree k, it takes log k p levels and hence that many steps to complete the arrival notification of p processes. If subtrees of processes are placed in different parts of the network, and if the counter variables at the tree nodes are distributed appropriately across memories, fetch&increment operations on nodes that do not have an ancestor-descendent relationship need not be serialized at all. A similar tree structure can be used for the release as well, so all processors don’t busy-wait on the same flag. That is, the last process to arrive at the barrier sets the release flag associated with the root of the tree, on which only k-1 processes are busy-waiting. Each of the k processes then sets a release flag at the next level of the tree on which k-1 other processes are waiting, and so on down the tree until all processes are released. The critical path length of the barrier in terms of number of dependent or serialized operations (e.g. network transactions) is thus O(logk p), as opposed to O(p) for the centralized barrier or O(p) for any barrier on a centralized bus. The code for a simple combining tree barrier tree is shown in Figure 7-41. While this tree barrier distributes traffic in the interconnect, for machines that do not cache remote shared data it has the same problem as the simple lock: the variables that processors spin on are not necessarily allocated in their local memory. Multiple processors spin on the same variable, and which processors reach the higher levels of the tree and spin on the variables there depends on the order in which processors reach the barrier and perform their fetch&increment instructions, which is impossible to predict. This leads to a lot of network traffic while spinning. Tree Barriers with Local Spinning There are two ways to ensure that a processor spins on a local variable. One is to predetermine which processor moves up from a node to its parent in the tree based on process identifier and the number of processes participating in the barrier. In this case, a binary tree makes local spinning easy, since the flag to spin on can be allocated in the local memory of the spinning processor rather than the one that goes up to the parent level. In fact, in this case it is possible to perform the barrier without any atomic operations like fetch&increment, but with only simple reads and writes as follows. For arrival, one process arriving at each node simply spins on an arrival flag associated with that node. The other process associated with that node simply writes the flag when it arrives. The process whose role was to spin now simply spins on the release flag associated with that node, while the other process now proceeds up to the parent node. Such a static binary tree barrier has been called a “tournament barrier” in the literature, since one process can be thought of as dropping out of the tournament at each step in the arrival tree. As an exercise, think about how you might modify this scheme to handle the case where the number of participating processes is not a power of two, and to use a non-binary tree. The other way to ensure local spinning is to use p-node trees to implement a barrier among p processes, where each tree node (leaf or internal) is assigned to a unique process. The arrival and wakeup trees can be the same, or they can be maintained as different trees with different branching factors. Each internal node (process) in the tree maintains an array of arrival flags, with one entry per child, allocated in that node’s local memory. When a process arrives at the barrier, if its tree node is not a leaf then it first checks its arrival flag array and waits till all its children have signalled their arrival by setting the corresponding array entries. Then, it sets its entry in its parent’s (remote) arrival flag array and busy-waits on the release flag associated with its tree node in
9/10/97
DRAFT: Parallel Computer Architecture
501
Scalable Multiprocessors
struct tree_node { int count = 0;
/* counter initialized to 0 */
int local_sense;
/* release flag implementing sense reversal */
struct tree_node *parent; } struct tree_node tree[P];
/* each element (node) allocated in a different memory */
private int sense = 1; private struct tree_node *myleaf;/* pointer to this process’s leaf in the tree */ barrier () { barrier_helper(myleaf); sense = !(sense);
/* reverse sense for next barrier call */
} barrier_helper(struct tree_node *mynode) { if (fetch&increment (mynode->count) == k-1){/* last to reach node */ if (mynode->parent != NULL) barrier_helper(mynode->parent); /*go up to parent node */ mynode->count = 0;
/*set up for next time */
mynode->local_sense = !(mynode->local_sense);/* release */ endif while (sense != mynode->local_sense) {};/* busy-wait */ } Figure 7-41 A software combining barrier algorithm, with sense-reversal. the wakeup tree. When the root process arrives and when all its arrival flag array entries are set, this means that all processes have arrived. The root then sets the (remote) release flags of all its children in the wakeup tree; these processes break out of their busy-wait loop and set the release flags of their children, and so on until all processes are released. The code for this barrier is shown in Figure 7-42, assuming an arrival tree of branching factor 4 and a wakeup tree of branching factor 2. In general, choosing branching factors in tree-based barriers is largely a trade-off between contention and critical path length counted in network transactions. Either of these types of barriers may work well for scalable machines without coherent caching. Parallel Prefix In many parallel applications a point of global synchronization is associated with combining information that has been computed by many processors and distributing a result based on the
502
DRAFT: Parallel Computer Architecture
9/10/97
Synchronization
struct tree_node { struct tree_node *parent; int parent_sense = 0; int wkup_child_flags[2];
/* flags for children in wakeup tree */
int child_ready[4];
/* flags for children in arrival tree */
int child_exists[4]; } /* nodes are numbered from 0 to P-1 level-by-level starting from the root.
struct tree_node tree[P];
/* each element (node) allocated in a different memory */
private int sense = 1, myid; private me = tree[myid]; barrier() { while (me.child_ready is not all TRUE) {};/* busy-wait */ set me.child_ready to me.child_exists; /* re-initialize for next barrier call */ if (myid !=0) {
/*set parent’s child_ready flag, and wait for release */ myid – 1 tree[ --------------------- ].child_ready[(myid-1) mod 4] = true; 4
while (me.parent_sense != sense) {}; }
me.child_pointers[0] = me.child_pointers[1] = sense; sense = !sense; } Figure 7-42 A combining tree barrier that spins on local variables only. combination. Parallel prefix operations are an important, widely applicable generalization of reductions and broadcasts[Ble93]. Given some associative, binary operator ⊕ we want to compute S i = x i ⊕ x i – 1 … ⊕ x 0 for i = 0, …, P . A canonical example is a running sum, but several other operators are useful. The carry-look-ahead operator from adder design is actually a special case of a parallel prefix circuit. The surprising fact about parallel prefix operations is that they can be performed as quickly as a reduction followed by a broadcast, with a simple pass up a binary tree and back down. Figure 7-43 shows the upward sweep, in which each node applies the operator to the pair of values it receives from its children and passes the result to its parent, just as with a binary reduction. (The value that is transmitted is indicated by the range of indices next to each arc; this is the subsequence over which the operator is applied to get that value.) In addition, each node holds onto the value it received from its least significant child (rightmost in the figure). Figure 7-44 shows the downward sweep. Each node waits until it receives a value from its parent. It passes this value along unchanged to its rightmost child. It combines this value with the value the was held over from the upward pass and passes the result to its left child. The nodes along the right edge of the tree are special, because they do not need to receive anything from their parent.
9/10/97
DRAFT: Parallel Computer Architecture
503
Scalable Multiprocessors
This parallel prefix tree can be implemented either in hardware or in software. The basic algorithm is important to understand. F-0 7-0 7-0
F-8 B-8
3-0 B-8
F-C
9-8
D-C F-E
D-C
E
C
F
E
D
C
B
3-0
7-4 5-4
B-A
9-8
7-6
A
8
6
A
9
8
7
6
1-0 5-4
3-2
1-0
4
2
0
5
4
3
2
1
0
Figure 7-43 Upward sweep of the Parallel Prefix operation Each node receive two elements from its children, combines them and passes thge result to its parent, and holds the element from the least significant (right) child
7-0 7-0 B-8 B-0
3-0
9-8
D-C D-0
F
9-0
B-0
E E-0 D-0 E
3-0
7-0
C A C-0 B-0 A-0 9-0 D
C
B
A
5-4 7-0
1-0
5-0
8
3-0
6
8-0
7-0
6-0
9
8
7
5-0 6
1-0
4 4-0 3-0 5
4
2
0
2-0
1-0
0
3
2
1
0
Figure 7-44 Downward sweep of the Parallel Prefix operation When a node receives an element from above, it passes the data down to its right child, combines it with its stored element, and passes the result to its left child. Nodes along the rightmost branch need nothing from above.
504
DRAFT: Parallel Computer Architecture
9/10/97
Concluding Remarks
All-to-all Personalized Communication All-to-all personalized communication is when each process has a distinct set of data to transmit to every other process. The canonical example of this is a transpose operation, say, where each process owns a set of rows of a matrix and needs to transmit a sequence of elements in each row to every other processor. Another important example is remapping a data structure between blocked cyclic layouts. Many other permutations of this form are widely used in practice. Quite a bit of work has been done in implementing all-to-all personalized communication operations efficiently, i.e., with no contention internal to the network, on specific network topologies. If the network is highly scalable then the internal communication flows within the network become secondary, but regardless of the quality of the network contention at the endpoints of the network, i.e., the processors or memories is critical. A simple, widely used scheme is to schedule the sequence of communication events so that P rounds of disjoint pairwise exchanges are performed. In round i , process p transmits the data it has for process q = p ⊗ i obtained as the exclusive-OR of the binary number for p and the binary representation of i . Since, exclusive-OR is commutative, p = q ⊗ i , and the round is, indeed, an exchange.
7.11 Concluding Remarks We have seen in this chapter that most modern large scale machines constructed from general purpose nodes with a complete local memory hierarchy augmented by a communication assist interfacing to a scalable network. However, there is a wide range of design options for the communication assist. The design is influenced very strongly by where the communication assist interfaces to the node architecture: at the processor, at the cache controller, at the memory bus, or at the I/O bus. It is also strongly influenced by the target communication architecture and programming model. Programming models are implemented on large scale machines in terms of protocols constructed out of primitive network transactions. The challenges in implemented such a protocol are that a large number of transactions can be outstanding simultaneously and there is no global arbitration. Essentially any programming model can be implemented on any of the available primitives at the hardware/software boundary, and many of the correctness and scalability issues are the same, however the performance characteristics of the available hardware primitives are quite different. The performance ultimately influences how the machine is viewed by programmers. Modern large scale parallel machine designs are rooted heavily in the technological revolution of the mid 80s – the single chip microprocessor - which was coined the “killer micro” as the MPP machines began to take over the high performance market from traditional vector supercomputers. However, these machines pioneered a technological revolution of their own - the single chip scalable network switch. Like the microprocessor, this technology has grown beyond its original intended use and a wide class of scalable system area networks are emerging, including switched gigabit (perhaps multi-gigabit) ethernets. As a result, large scale parallel machine design has split somewhat into two branches. Machines oriented largely around message passing concepts and explicit get/put access to remote memory are being overtaken by clusters, because of their extreme low cost, ease of engineering, and ability to track technology. The other branch is machines that deeply integrated the network into the memory system to provide cache coherent access to a global physical address space with automatic replication. In other words, machine that look to the programmer like those of the previous chapter, but are built like the machines in this
9/10/97
DRAFT: Parallel Computer Architecture
505
Scalable Multiprocessors
chapter. Of course, the challenge is advancing the cache coherence mechanisms in a manner that provides scalable bandwidth and low latency. This is the subject of the next chapter.
7.12 References [Ala97] [Alp92] [And*92b]
[And*94] [Arn*89]
[Arp*95] [AtSe88] [BaP93] [Bar*93]
[Bar*94] [Ble93] [Blu*94] [Bod*95] [BoRo89] [Bor*90]
[BBN89] [Bru88]
506
Alain Kägi, Doug Burger, and James R. Goodman, Efficient Synchronization: Let Them Eat QOLB, Proc. of 24th International Symposium on Computer Architecture, June, 1997. Alpha Architecture Handbook, Digital Equipment Corporation, 1992. T. Anderson, S. Owicki, J. Saxe and C. Thacker, “High Speed Switch Scheduling for Local Area Networks”, Proc. Fifth Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS V), October 1992. Anderson, T.E.; Culler, D.E.; Patterson, D. A case for NOW (Networks of Workstations), IEEE Micro, Feb. 1995, vol.15, (no.1):54-6 E.A. Arnould, F.J. Bitz, Eric Cooper, H.T. Kung, R.D. Sansom, and P.A. Steenkiste, “The Design of Nectar: A Network Backplane for Heterogeneous Multicomputers”, Proc. Third Conference on Architectural Support for Programming Languages and Operating Systems (ASPLOS III), April 1989. A. Arpaci, D. E. Culler, A. Krishnamurthy, S. Steinberg, and K. Yelick. Empirical Evaluation of the Cray-T3D: A Compiler Perspective. To appear in Proc. of ISCA, June 1995. William C. Athas and Charles L. Seitz, Multicomputers: Message-Passing Concurrent Computers, IEEE Computer, Aug. 1988, pp. 9-24. D. Banks and M. Prudence, “A High Performance Network Architecture for a PA-RISC Workstation”, IEEE J. on Selected Areas in Communication, 11(2), Feb. 93. Barszcz, E.; Fatoohi, R.; Venkatakrishnan, V.; and Weeratunga, S.: ``Solution of Regular, Sparse Triangular Linear Systems on Vector and Distributed-Memory Multiprocessors'' Technical Report NAS RNR-93-007, NASA Ames Research Center, Moffett Field, CA, 94035-1000, April 1993. E. Barton, J. Crownie, and M. McLaren. Message Passing on the Meiko CS-2. Parallel Computing, 20(4):497-507, Apr. 1994. Guy Blelloch, Prefix Sums and Their Applications. In Synthesis of Parallel Algorithms ed John Reif, Morgan Kaufmann, 1993, pp 35-60. M. A. Blumrich, C. Dubnicki, E. W. Felten, K. Li, M.R. Mesarina, “Two Virtual Memory Mapped Network Interface Designs”, Hot Interconnects II Symposium, Aug. 1994. N. Boden, D. Cohen, R. Felderman, A. Kulawik, C. Seitz, J. Seizovic, and W. Su, Myrinet: A Gigabit-per-Second Local Area Network, IEEE Micro, Feb. 1995, vol.15, (no.1), pp. 29-38. Luc Bomans and Dirk Roose, Benchmarking the iPSC/2 Hypercube Multiprocessor. Concurrency: Practivce and Experience, 1(1):3-18, Sep. 1989. S. Borkar, R. Cohn, G. Cox, T. Gross, H. T. Kung, M. Lam, M. Levine, B. Moore, W. Moore, C. Peterson, J. Susman, J. Sutton, J. Urbanski, and J. Webb. Supporting systolic and memory communication in iwarp.In Proc. 17th Intl. Symposium on Computer Architecture, pages 70-81. ACM, May 1990. Revised version appears as technical report CMU-CS-90-197. BBN Advanced Computers Inc., TC2000 Technical Product Summary, 1989. Bruno, J; Cappello, P.R.: Implementing the Beam and Warming Method on the Hypercube. Proceedings of 3rd Conference on Hypercube Concurrent Computers and Applications, Pasadena,
DRAFT: Parallel Computer Architecture
9/10/97
References
CA, Jan 19-20, 1988. [BCS93]
Edoardo Biagioni, Eric Cooper, and Robert Sansom, “Designing a Practical ATM LAN”, IEEE Network, March 1993. [Car*96] W. Cardoza, F. Glover, and W. Snaman Jr., “Design of a TruCluster Multicomputer System for the Digital UNIX Environment”, Digital Technical Journal, vol. 8, no. 1, pp. 5-17, 1996. [Cul*91] D. E. Culler, A. Sah, K. E. Schauser, T. von Eicken, and J. Wawrzynek, Fine-grain Parallelism with Minimal Hardware Support, Proc. of the 4th Int’l Symposium on Arch. Support for Programming Languages and Systems (ASPLOS), pp. 164-175, Apr. 1991. [DaSe87] W. Dally and C. Seitz, Deadlock-Free Message Routing in Multiprocessor Interconnections Networks, IEEE-TOC, vol c-36, no. 5, may 1987. [Dal90] William J. Dally, Performance Analysis of k-ary n-cube Interconnection Networks, IEEE-TOC, V. 39, No. 6, June 1990, pp. 775-85. [Dal93] Dally, W.J.; Keen, J.S.; Noakes, M.D., The J-Machine architecture and evaluation, Digest of Papers. COMPCON Spring ‘93, San Francisco, CA, USA, 22-26 Feb. 1993, p. 183-8. [Dub96] C. Dubnicki, L. Iftode, E. W. Felten, K. Li, “Software Support for Virtual Memory-Mapped Communication”, 10th International Parallel Processing Symposium, April 1996. [Dun88] T. H. Dunigan, Performance of a Second Generation Hypercube, Technical Report ORNL/TM10881, Oak Ridge Nat. Lab., Nov. 1988. [Fox*88]G. Fox, M. Johnson, G. Lyzenga, S. Otto, J. Salmon, D. Walker, Solving Problems on Concurrent Processors, vol 1, Prentice-Hall, 1988. [Gei93] A. Geist et al, PVM 3.0 User’s Guide and REference Manual, Oak Ridge NAtional Laboratory, Oak Ridge, Tenn. Tech. Rep. ORNL/TM-12187, Feb. 1993. [Gil96] Richard Gillett, Memory Channel Network for PCI, IEEE Micro, pp. 12-19, vol 16, No. 1., Feb. 1996. [Gika97] Richard Gillett and Richard Kaufmann, Using Memory Channel Network, IEEE Micro, Vol. 17, No. 1, January / February 1997. [Goo89] James. R. Goodman, Mary. K. Vernon, Philip. J. Woest., Set of Efficient Synchronization Primitives for a Large-Scale Shared-Memory Multiprocessor, Proceedings of the Third International Conference on Architectural Support for Programming Languages and Operating Systems, April, 1989. [Got*83] A. Gottlieb, R. Grishman, C. P. Kruskal, K. P. McAuliffe, L. Rudolph, and M. Snir. The NYU Ultracomputer - Designing an MIMD Shared Memory Parallel Computer, IEEE Trans. on Computers, c-32(2):175-89, Feb. 1983. [Gro92] W. Groscup, The Intel Paragon XP/S supercomputer. Proceedings of the Fifth ECMWF Workshop on the Use of Parallel Processors in Meteorology. Nov 1992, pp. 262--273. [GrHo90] Grafe, V.G.; Hoch, J.E., The Epsilon-2 hybrid dataflow architecture, COMPCON Spring ‘90, San Francisco, CA, USA, 26 Feb.-2 March 1990, p. 88-93. [Gur*85]J. R. Gurd, C. C. Kerkham, and I. Watson, The Manchester Prototype Dataflow Computer, Communications of the ACM, 28(1):34-52, Jan. 1985. [Hay*94] K. Hayashi, et al. AP1000+: Architectural Support of PUT/GET Interface for Parallelizing Compiler. Proc. of ASPLOS VI, October 1994. [HoMc93] Mark Homewood and Moray McLaren, Meiko CS-2 Interconnect Elan - Elite Design, Hot Interconnects, Aug. 1993. [Hor95] R. Horst, TNet: A Reliable System Area Network, IEEE Micro, Feb. 1995, vol.15, (no.1), pp. 3745.
9/10/97
DRAFT: Parallel Computer Architecture
507
Scalable Multiprocessors
[i860]
Intel Corporation. i750, i860, i960 Processors and Related Products, 1994.
[Kee*95]
Kimberly K. Keeton, Thomas E. Anderson, David A. Patterson, LogP Quantified: The Case for Low-Overhead Local Area Networks. Hot Interconnects III: A Symposium on High Performance Interconnects , 1995. R. E. Kessler and J. L. Schwarzmeier, Cray T3D: a new dimension for Cray Research, Proc. of Papers. COMPCON Spring '93, pp 176-82, San Francisco, CA, Feb. 1993. R. Kent Koeninger, Mark Furtney, and Martin Walker. A Shared Memory MPP from Cray Research, Digital Technical Journal 6(2):8-21, Spring 1994. A. Krishnamurthy, K. Schauser, C. Scheiman, R. Wang, D. Culler, and K. Yelick, “Evaluation of Architectural Support for Global Address-Based Communication in Large-Scale Parallel Machines”, Proc. of ASPLOS 96. Kronenberg, Levy, Strecker “Vax Clusters: A closely-coupled distributed system,” ACM Transactions on Computer Systems, May 1986 v 4(3):130-146. Kung, H.T, et al, Network-based multicomputers: an emerging parallel architecture, Proceedings Supercomputing ‘91, Albuquerque, NM, USA, 18-22 Nov. 1991 p. 664-73 J. V. Lawton, J. J. Brosnan, M. P. Doyle, S.D. O Riodain, T. G. Reddin, “Building a High-performance Message-passing System for MEMORY CHANNEL Clusters”, Digital Technical Journal, Vol. 8 No. 2, pp. 96-116, 1996. C. E. Leiserson, Z. S. Abuhamdeh, D. C. Douglas, C. R. Feynman, M. N. Ganmukhi, J. V. Hill, W. D. Hillis, B. C. Kuszmau, M. A. St. Pierre, D. S. Wells, M. C. Wong, S. Yang, R. Zak. The Network Architecture of the CM-5. Symposium on Parallel and Distributed Algorithms '92" Jun 1992, pp. 272-285. M. Litzkow, M. Livny, and M. W. Mutka, ``Condor - A Hunter of Idle Workstations,'' Proceedings of the 8th International Conference of Distributed Computing Systems, pp. 104-111, June, 1988. Jefferey Lukowsky and Stephen Polit, IP Packet Switching on the GIGAswitch/FDDI System, On-line as http://www.networks.digital.com:80/dr/techart/gsfip-mn.html. Martin, R. “HPAM: An Active Message Layer of a Network of Workstations”, presented at Hot Interconnects II, Aug. 1994. John Mellor-Crummey and Michael Scott. Algorithms for Scalable Synchronization on Shared Memory Mutiprocessors. ACM Transactions on Computer Systems, vol. 9, no. 1, pp. 21-65, February 1991. M. Michael and M. Scott, Simple, Fast, and Practical Non-Blocking and Blocking Concurrent Queue Algorithms'', Proceedings of the 15th Annual ACM Symposium on Principles of Distributed Computing, Philadelphia, PA, pp 267-276, May 1996. MPI Forum, “MPI: A Message Passing Interface”, Supercomputing 93, 1993. (Updated spec at http://www.mcs.anl.gov/mpi/) R. S. Nikhil and Arvind, Can Dataflow Subsume von Neuman Computing?, Proc. of 16th Annual International Symp. on Computer Architecture, pp. 262-72, May 1989 R. Nikhil, G. Papadopoulos, and Arvind, *T: A Multithreaded Massively Parallel Architecture, Proc. of ISCA 93, pp. 156-167, May 1993. NAS Parallel Benchmarks, http://science.nas.nasa.gov/Software/NPB/. Papadopoulos, G.M. and Culler, D.E. Monsoon: an explicit token-store architecture, Proceedings. The 17th Annual International Symposium on Computer Architecture, Seattle, WA, USA, 28-31 May 1990, p. 82-91. Paul Pierce and Greg Regnier, The Paragon Implementation of the NX Message Passing Interface.
[KeSc93] [Koe*94] [Kri*96]
[Kro*86] [Kun*91] [Law*96]
[Lei*92]
[Lit*88] [LuPo97] [Mar*93] [MCS91]
[MiSc96]
[MPI93] [NiAr89] [NPA93] [NPB] [PaCu90]
[PiRe94] 508
DRAFT: Parallel Computer Architecture
9/10/97
References
Proc. of the Scalable High-Performance Computing Conference, pp. 184-90, May 1994. [Pfi*85]
[Pfi95] [Rat85] [Sak*91]
[Sco96]
[Shi*84] [SP2]
G. F. Pfister, W. C. Brantley, D. A. George, S. L. Harvey, W. J. Kleinfelder, K. P. McAuliff, E. A. Melton, V. A. Norton, and J. Weiss, The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture, International Conference on Parallel Processing, 1985. G.F. Pfister, In Search of Clusters – the coming battle in lowly parallel computing, Prentice Hall, 1995. J. Ratner, Concurrent Processing: a new direction in scientific computing, Conf. Proc. of the 1985 Natinal Computing Conference, p. 835, 1985. Sakai, S.; Kodama, Y.; Yamaguchi, Y., Prototype implementation of a highly parallel dataflow machine EM4. Proceedings. The Fifth International Parallel Processing Symposium, Anaheim, CA, USA, 30 April-2 May, p. 278-86. Scott, S., Synchronization and Communication in the T3E Multiprocesor, Proc. of the 7th International Conference on Architectural Suport for Programming Languages and Operating Systems, Cambridge, MA, pp.. 26-36, Oct. 1996. T. Shimada, K. Hiraki, and K. Nishida, An Architecture of a Data Flow Machine and its Evaluation, Proc. of Compcon 84, pp. 486-90, IEEE, 1984. C. B. Stunkel et al. The SP2 Communication Subsystem. On-line as http://
ibm.tc.cornell.edu/ibm/pps/doc/css/css.ps [ScSc95] [Sei84] [Sei85] [Swa*77a] [Swa*77b]
[Tho64]
[TrS91] [TrDu92] [vEi*92]
[vEi*95] [WoHi95]
9/10/97
K. E. Schauser and C. J. Scheiman. Experience with Active Messages on the Meiko CS-2. IPPS ‘95. C. L. Seitz, Concurrent VLSI Architectures, IEEE Transactions on Computers, 33(12):1247-65, Dec. 1984 Charles L. Seitz, The Cosmic Cube, Communications of the ACM, 28(1):22-33, Jan 1985. R. J. Swan, S. H. Fuller, and D. P. Siewiorek, CM* - a modular, multi-microprocessor, AFIPS Conference Proceedings, vol 46, National Computer Conference, 1977, pp. 637-44. R. J. Swan, A. Bechtolsheim, K-W Lai, and J. K. Ousterhout, The Implementation of the CM* multi-microprocessor, AFIPS Conference Proceedings, vol 46, National Computer Conference, 1977, pp. 645-55. Thornton, James E., Parallel Operation in the Control Data 6600, AFIPS proc. of the Fall Joint Computer Conference, pt. 2, vol. 26, pp. 33-40, 1964. (Reprinted in Computer Structures: Princiiples and Examples, Siework, Bell, and Newell, McGray Hill, 1982). C. Traw and J. Smith, “A High-Performance Host Interface for ATM Networks,” Proc. ACM SIGCOMM Conference, pp. 317-325, Sept. 1991. R. Traylor and D. Dunning. Routing Chip Set for Intel Paragon Parallel Supercomputer. Proc. of Hot Chips ‘92 Symposium, August 1992. von Eicken, T.; Culler, D.E.; Goldstein, S.C.; Schauser, K.E., Active messages: a mechanism for integrated communication and computation, Proc. of 19th Annual International Symposium on Computer Architecture, old Coast, Qld., Australia, 19-21 May 1992, pp. 256-66 T. von Eicken, A. Basu, and Vineet Buch, Low-Latency Communication Over ATM Using Active Messages, IEEE Micro, Feb. 1995, vol.15, (no.1), pp. 46-53. D. A. Wood and M. D. Hill, Cost-effective Parallel Computing, IEEE Computer, Feb. 1995, pp. 69-72.
DRAFT: Parallel Computer Architecture
509
Scalable Multiprocessors
7.13 Exercices 7.1 A radix-2 FFT over n complex numbers is implemented as a sequence of log n completely parallel steps, requiring 5n log n floating point operations while reading and writing each element of data log n times. Calculate the communication to computation ratio on a dancehall design, where all processors access memory though the network, as in Figure 7-3. What communication bandwidth would the network need to sustain for the machine to deliver 250 p MFLOPS on p processors? 7.2 If the data in Exercise 7.1 is spread over the memories in a NUMA design using either a cycle or block distribution, the log --n- of the steps will access data in the local memory, assuming p both n and p are powers of two, and in the remaining log p steps half of the reads and half of the writes will be local. (The choice of hayout determines which steps are local and which are remote, but the ratio stays the same.) Calculate the communication to computation ratio on a distributed memory design, where each processor has a local memory, as in Figure 7-2. What communication bandwidth would the network need to sustain for the machine to deliver 250 p MFLOPS on p processors? How does this compare with Exercise 7.1? 7.3 If the programmer pays attention to the layout, the FFT can be implemented so that a single, global transpose operation is required where n – --n- data elements are trasmitted across the p network. All of the log n steps are performed on local memory. Calculate the communication to computation ratio on a distributed memory design. What communication bandwidth would the network need to sustain for the machine to deliver 250 p MFLOPS on p processors? How does this compare with Exercise 7.2? 7.4 Reconsider Example 7-1 where the number of hops for a n-node configuration is n . How does the average transfer time increase with the number of nodes? What about 3 n . 7.5 Formalize the cost scaling for the designs in Exercise 7.4. 7.6 Consider a machine as described in Example 7-1, where the number of links occupied by each transfer is log n . In the absence of contention for individual links, how many transfers can occur simultaneously? 7.7 Reconsider Example 7-1 where the network is a simple ring. The average distance between two nodes on ring of n nodes is n/2. How does the average transfer time increase with the number of nodes? Assuming each link can be occupied by at most one transfer at a time, how many such transfers can take place simultaneously. 7.8 For a machine as described in Example 7-1, suppose that a broadcast from a node to all the other nodes use n links. How would you expect the number of simultaneous broadcasts to scale with the numebr of nodes? 7.9 Suppose a 16-way SMP lists at $10,000 plus $2,000 per node, where each node contains a fast processor and 128 MB of memory. How much does the cost increase when doubling the capcity of the system from 4 to 8 processors? From 8 to 16 processors? 7.10 Prove the statement from Section 7.2.3 that parallel computing is more cost-effective whenever Speedup ( p ) > Costup ( p ) . 7.11 Assume a bus transaction with n bytes of payload occupies the bus for 4 + n--- cycles, up to 8
a limit of 64 bytes. Draw graph comparing the bus utilization for programmed I/O and DMA for sending various sized messages, where the message data is in memory, not in registers. For DMA include the extra work to inform the communication assist of the DMA address and
510
DRAFT: Parallel Computer Architecture
9/10/97
Exercices
length. Consider both the case where the data is in cache and where it is not. What assumptions do you need to make about reading the status registers? 7.12 The Intel Paragon has an output buffer of size 2KB which can be filled from memory at a rate of 400 MB/s and drained into the network at a rate of 175MB/s. The buffer is drained into the network while it is being filled, but if the buffer gets full the DMA device will stall. In designing your message layer you decide to fragment long messages into DMA bursts that are as long as possible without stalling behind the output buffer. Clearly these can be at least 2KB is size, but in fact they can be longer. Calculate the apparent size of the output buffer driving a burst into an empty network given these rates of flow. 7.13 Based on a rough estimate from Figure 7-33, which of the machines will have a negative T 0 if a linear model is fit to the communication time data? 7.14 Use the message frequency data presented in Table 7-2 to estimate the time each processor would spend in communication on an iteration of BT for the machines described in Table 7-1. 7.15 Table 7-3 describes the communication characteristics of sparse LU. • How does the message size characteristics differ from that of BT? What does this say about the application? • How does the message frequency differ? • Estimate the time each processor would spend in communication on an iteration of LU for the machines described in Table 7-1. Table 7-3 Communication characteristics for one iteration of LU on class A problem over a range of processor count 4 processors bins (KB)
16 processors
msgs
KB
bins (KB)
msgs
32 processors KB
bins (KB)
msgs
64 processors bins (KB)
KB
msgs
KB
1+
496
605
0.5-1
2,976
2,180
0-0.5
2,976
727
0-0.5
13,888
3,391
163.5+
8
1,279
81.5+
48
3,382
0.5-1
3,472
2,543
81.5
224
8,914
40.5+
48
1,910
81.5+
56
4,471
Total KB Time per iter ave MB/s ave MB/s/P
9/10/97
2.4s
0.58
0.34
0.19
0.77
10.1
27.7
63.2
0.19
0.63
0.86
0.99
DRAFT: Parallel Computer Architecture
511
Scalable Multiprocessors
512
DRAFT: Parallel Computer Architecture
9/10/97
Introduction
C HA PTE R 8
Directory-based Cache Coherence
Morgan Kaufmann is pleased to present material from a preliminary draft of Parallel Computer Architecture; the material is (c) Copyright 1996 Morgan Kaufmann Publishers. This material may not be used or distributed for any commercial purpose without the express written consent of Morgan Kaufmann Publishers. Please note that this material is a draft of forthcoming publication, and as such neither Morgan Kaufmann nor the authors can be held liable for changes or alterations in the final edition.
8.1 Introduction A very important part of the evolution of parallel architecture puts together the concepts of the previous two chapters. The idea is to provide scalable machines supporting a shared physical address space, but with automatic replication of data in local caches. In Chapter 7 we studied how to build scalable machines by distributing the memory among the nodes. Nodes were connected together using a scalable network, a sophisticated communication assist formed the nodeto-network interface as shown in Figure 8-1, and communication was performed through pointto-point network transactions. At the final point in our design spectrum, the communication assist provided a shared physical address space. On a cache miss, it was transparent to the cache and the processor whether the miss was serviced from the local physical memory or some remote memory; the latter just takes longer. The natural tendency of the cache in this design is to replicate the data accessed by the processor, regardless of whether the data are shared or from where they come. However, this again raises the issue of coherence. To avoid the coherence problem
9/3/97
DRAFT: Parallel Computer Architecture
513
Directory-based Cache Coherence
and simplify memory consistency, the designs of Chapter 7 disabled the hardware caching of logically shared but physically remote data. This implied that the programmer had to be particularly careful about data distribution across physical memories, and replication and coherence were left to software. The question we turn to in this chapter is how to provide a shared address space programming model with transparent, coherent replication of shared data in the processor caches and an intuitive memory consistency model, just as in the small scale shared memory machines of Chapter 5 but without the benefit of a globally snoopable broadcast medium. Not only must the hardware latency and bandwidth scale well, as in the machines of the previous chapter, but so must the protocols used for coherence as well, at least up to the scales of practical interest. This
Memory
P1
P1
Cache
Cache
Comm. Assist
Memory
Comm Assist
Scalable Interconnection Network Figure 8-1 A generic scalable multiprocessor.
chapter will focus on full hardware support for this programming model, so in terms of the layers of abstraction it is supported directly at the hardware-software interface. Other programming models can be implemented in software on the systems discussed in this chapter (see Figure 8-2). Message Passing Shared Address Space
Programming Model
Compilation Communication Abstraction or User/System Boundary Library Operating Systems Support
Hardware/Software Boundary Communication Hardware Physical Communication Medium
Figure 8-2 Layers of abstraction for systems discussed in this chapter.
The most popular approach to scalable cache coherence is based on the concept of a directory. Since the state of a block in the caches can no longer be determined implicitly, by placing a request on a shared bus and having it be snooped, the idea is to maintain this state explicitly in a place (a directory) where requests can go and look it up. Consider a simple example. Imagine that each block of main memory has associated with it, at the main memory, a record of the caches containing a copy of the block and the state therein. This record is called the directory entry for that block (see Figure 8-3). As in bus-based systems, there may be many caches with a clean, 514
DRAFT: Parallel Computer Architecture
9/3/97
Introduction
Directory
P1
P1
Cache
Cache Directory
Memory
Memory
Comm. Assist
Comm Assist
Scalable Interconnection Network Figure 8-3 A scalable multiprocessor with directories. readable block, but if the block is writable, possibly dirty, in one cache then only that cache may have a valid copy. When a node incurs a cache miss, it first communicates with the directory entry for the block using network transactions. Since the directory entry is collocated with the main memory for the block, its location can be determined from the address of the block. From the directory, the node determines where the valid copies (if any) are and what further actions to take. It then communicates with the copies as necessary using additional network transactions. The resulting changes to the state of cached blocks are also communicated to the directory entry for the block through network transactions, so the directory stays up to date. On a read miss, the directory indicates from which node the data may be obtained, as shown in Figure 8-4(a). On a write miss, the directory identifies the copies of the block, and invalidation or update network transactions may be sent to these copies (Figure 8-4(b)). (Recall that a write to a block in shared state is also considered a write miss). Since invalidations or updates are sent to multiple copies through potentially disjoint paths in the network, determining the completion of a write now requires that all copies reply to invalidations with explicit acknowledgment transactions; we cannot assume completion when the read-exclusive or update request obtains access to the interconnect as we did on a shared bus. Requests, replies, invalidations, updates and acknowledgments across nodes are all network transactions like those of the previous chapter; only, here the end-point processing at the destination of the transaction (invalidating blocks, retrieving and replying with data) is typically done by the communication assist rather than the main processor. Since directory schemes rely on point to point network transactions, they can be used with any interconnection network. Important questions for directories include the actual form in which the directory information is stored, and how correct, efficient protocols may be designed around these representations. A different approach than directories is to extend the broadcast and snooping mechanism, using a hierarchy of broadcast media like buses or rings. This is conceptually attractive because it builds larger systems hierarchically out of existing small-scale mechanisms. However, it does not apply to general networks such as meshes and cubes, and we will see that it has problems with latency and bandwidth, so it has not become very popular. One approach that is popular is a two-level protocol. Each node of the machine is itself a multiprocessor, and the caches within the node are kept coherent by one coherence protocol called the inner protocol. Coherence across nodes is maintained by another, potentially different protocol called the outer protocol. To the outer proto-
9/3/97
DRAFT: Parallel Computer Architecture
515
Directory-based Cache Coherence
Requestor
Requestor 1.
P
C
Directory node for block
M/D
A
RdEx request to directory
P
Read request to directory
C
1.
A
2. Reply with sharers identity
M/D
2. 3. Read req. to owner
Reply with owner identity
P 3b. Inval. req. to sharer
3a. Inval. req. to sharer
M/D
4a. Data Reply
M/D
Directory node for block 4b. Inval. ack
4a. Inval. ack
4b. Revision message to directory
P
P
C
C
C A
A
C A
P
P C
M/D
A
Node with dirty copy
M/D
A
M/D
Sharer
Sharer
(b) Write miss to a block with two sharers
(a) Read miss to a block in dirty state
Figure 8-4 Basic operation of a simple directory. Two example operations are shown. On the left is a read miss to a block that is currently held in dirty state by a node that is not the requestor or the node that holds the directory information. A read miss to a block that is clean in main memory (i.e. at the directory node) is simpler,: the main memory simply replies to the requestor and the miss is satisfied in a single request-reply pair. of transactions. On the right is a write miss to a block that is currently in shared state in two other nodes’ caches (the two sharers). The big rectangles are the are nodes, and the arcs (with boxed labels) are network transactions. The numbers 1, 2 etc. next to a transaction show the serialization of transactions. Different letters next to the same number indicate that the transactions can be overlapped.
col each multiprocessor node looks like a single cache, and coherence within the node is the responsibility of the inner protocol. Usually, an adapter or a shared tertiary cache is used to represent a node to the outer protocol. The most common organization is for the outer protocol to be a directory protocol and the inner one to be a snoopy protocol [LoC96, LLJ+93, ClA96, WGH+97]. However, other combinations such as snoopy-snoopy [FBR93], directory-directory [Con93], and even snoopy-directory may be used. Putting together smaller-scale machines to build larger machines in a two-level hierarchy is an attractive engineering option: It amortizes fixed per-node costs, may take advantage of packaging hierarchies, and may satisfy much of the interprocessor communication within a node and have only a small amount require more costly communication across nodes. The major focus of this chapter will be on directory protocols across nodes, regardless of whether the node is a uniprocessor or a multiprocessor or what coherence method it uses. However, the interactions among two-level protocols will be discussed. As we examine the organizational structure of the directory, the protocols involved in supporting coherence and consistency, and the requirements placed on the communication assist, we will find another rich and interesting design space. The next section provides a framework for understanding different approaches for providing coherent replication in a shared address space, including snooping, directories and hierarchical snooping. Section 8.3 introduces a simple directory representation and a protocol corresponding to it, and then provides an overview of the alternatives for organizing directory information and
516
DRAFT: Parallel Computer Architecture
9/3/97
Scalable Cache Coherence
P C
B1
P
P
C
C Snooping Adapter
Snooping Adapter
Main Mem
B1
P
P
C
C
B1 Main Mem
Dir.
Main Mem
P
P
C
C
P
Assist
Assist
C
B1 Main Mem
Dir.
B2 Network
(a) Snooping-snooping
M/D
(b) Snooping-directory
P
P
P
P
P
P
P
P
C
C
C
C
C
C
C
C
A
A
M/D
M/D
A
M/D
A
A
M/D
M/D
A
A
M/D
M/D
Network1
Network1
Network1
Network1
Directory adapter
Directory adapter
Dir/Snoopy adapter
Dir/Snoopy adapter
A
Bus (or Ring) Network2
(c) Directory-directory
(d) Directory-snooping
Figure 8-5 Some possible organizations for two-level cache-coherent systems. Each node visible at the outer level is itself a multiprocessor. B1 is a first-level bus, and B2 is second-level bus. The label snoopingdirectory, for example, means that a snooping protocol is used to maintain coherence within a multiprocessor node, and a directory protocol is used to maintain coherence across nodes.
protocols. This is followed by a quantitative assessment of some issues and architectural tradeoffs for directory protocols in Section 8.4. The chapter then covers the issues and techniques involved in actually designing correct, efficient protocols. Section 8.5 discusses the major new challenges for scalable, directory-based systems introduced by multiple copies of data without a serializing interconnect. The next two sections delve deeply into the two most popular types of directory-based protocols, discussing various design alternatives and using two commercial architectures as case studies: the Origin2000 from Silicon Graphics, Inc. and the NUMA-Q from Sequent Computer Corporation. Synchronization for directory-based multiprocessors is discussed in Section 8.9, and the implications for parallel software in Section 8.10. Section 8.11 covers some advanced topics, including the approaches of hierarchically extending the snooping and directory approaches for scalable coherence.
8.2 Scalable Cache Coherence Before we discuss specific protocols, this section briefly lays out the major organizational alternatives for providing coherent replication in the extended memory hierarchy, and introduces the basic mechanisms that any approach to coherence must provide, whether it be based on snoop-
9/3/97
DRAFT: Parallel Computer Architecture
517
Directory-based Cache Coherence
ing, hierarchical snooping or directories. Different approaches and protocols discussed in the rest of the chapter will refer to the mechanisms described here. On a machine with physically distributed memory, nonlocal data may be replicated either only in the processors caches or in the local main memory as well. This chapter assumes that data are replicated only in the caches, and that they are kept coherent in hardware at the granularity of cache blocks just as in bus-based machines. Since main memory is physically distributed and has nonuniform access costs to a processor, architectures of this type are often called cache-coherent, nonuniform memory access architectures or CC-NUMA. Alternatives for replicating in main memory will be discussed in the next chapter, which will discuss other hardware-software tradeoffs in scalable shared address space systems as well. Any approach to coherence, including snoopy coherence, must provide certain critical mechanisms. First, a block can be in each cache in one of a number of states, potentially in different states in different caches. The protocol must provide these cache states, the state transition diagram according to which the blocks in different caches independently change states, and the set of actions associated with the state transition diagram. Directory-based protocols also have a directory state for each block, which is the state of the block at the directory. The protocol may be invalidation-based, update-based, or hybrid, and the stable cache states themselves are very often the same (MESI, MSI) regardless of whether the system is based on snooping or directories. The tradeoffs in the choices of states are very similar to those discussed in Chapter 5, and will not be revisited in this chapter. Conceptually, the cache state of a memory block is a vector containing its state in every cache in the system (even when it is not present in a cache, it may be thought of as being present but in the invalid state, as discussed in Chapter 5). The same state transition diagram governs the copies in different caches, though the current state of the block at any given time may be different in different caches. The state changes for a block in different caches are coordinated through transactions on the interconnect, whether bus transactions or more general network transactions. Given a protocol at cache state transition level, a coherent system must provide mechanisms for managing the protocol. First, a mechanism is needed to determine when (on which operations) to invoke the protocol. This is done in the same way on most systems: through an access fault (cache miss) detection mechanism. The protocol is invoked if the processor makes an access that its cache cannot satisfy by itself; for example, an access to a block that is not in the cache, or a write access to a block that is present but in shared state. However, even when they use the same set of cache states, transitions and access fault mechanisms, approaches to cache coherence differ substantially in the mechanisms they provide for the following three important functions that may need to be performed when an access fault occurs: (a) Finding out enough information about the state of the location (cache block) in other caches to determine what action to take, (b) Locating those other copies, if needed, for example to invalidate them, and (c) Communicating with the other copies, e.g. obtaining data from or invalidating/updating them. In snooping protocols, all of the above functions were performed by the broadcast and snooping mechanism. The processor puts a “search” request on the bus, containing the address of the block, and other cache controllers snoop and respond. A broadcast method can be used in distributed machines as well; the assist at the node incurring the miss can broadcast messages to all others nodes, and their assists can examine the incoming request and respond as appropriate. 518
DRAFT: Parallel Computer Architecture
9/3/97
Overview of Directory-Based Approaches
However, broadcast does not scale, since it generates a large amount of traffic (p network transactions on every miss on a p-node machine). Scalable approaches include hierarchical snooping and directory-based approaches. In a hierarchical snooping approach, the interconnection network is not a single broadcast bus (or ring) but a tree of buses. The processors are in the bus-based snoopy multiprocessors at the leaves of the tree. Parent buses are connected to children by interfaces that snoop the buses on both sides and propagate relevant transactions upward or downward in the hierarchy. Main memory may be centralized at the root or distributed among the leaves. In this case, all of the above functions are performed by the hierarchical extension of the broadcast and snooping mechanism: A processor puts a “search” request on its bus as before, and it is propagated up and down the hierarchy as necessary based on snoop results. Hierarchical snoopy systems are discussed further in Section 8.11.2. In the simple directory approach described above, information about the state of blocks in other caches is found by looking up the directory through network transactions. The location of the copies is also found from the directory, and the copies are communicated with using point to point network transactions in an arbitrary interconnection network, without resorting to broadcast. Important questions for directory protocols are how the directory information is actually organized, and how protocols might be structured around these organizations using network transactions. The directory organization influences exactly how the protocol addresses the three issues above. An overview of directory organizations and protocols is provided in the next section.
8.3 Overview of Directory-Based Approaches This section begins by describing a simple directory scheme and how it might operate using cache states, directory states, and network transactions. It then discusses the issues in scaling directories to large numbers of nodes, provides a classification of scalable directory organizations, and discusses the basics of protocols associated with these organizations. The following definitions will be useful throughout our discussion of directory protocols. For a given cache or memory block: The home node is the node in whose main memory the block is allocated. The dirty node is the node that has a copy of the block in its cache in modified (dirty) state. Note that the home node and the dirty node for a block may be the same. The exclusive node is the node that has a copy of the block in its cache in an exclusive state, either dirty or exclusive-clean as the case may be (recall from Chapter 5 that clean-exclusive means this is the only valid cached copy and that the block in main memory is up to date). Thus, the dirty node is also the exclusive node. The local node, or requesting node, is the node containing the processor that issues a request for the block. The owner node is the node that currently holds the valid copy of a block and must supply the data when needed; this is either the home node, when the block is not in dirty state in a cache, or the dirty node.
9/3/97
DRAFT: Parallel Computer Architecture
519
Directory-based Cache Coherence
Blocks whose home is local to the issuing processor are called locally allocated (or sometimes simply local) blocks, while all others are called remotely allocated (or remote). Let us begin with the basic operation of directory-based protocols using a very simple directory organization. 8.3.1 Operation of a Simple Directory Scheme As discussed earlier, a natural way to organize a directory is to maintain the directory information for a block together with the block in main memory, i.e. at the home node for the block. A simple organization for the directory information for a block is as a bit vector. The bit vector contains p presence bits and one or more state bits for a p-node machine (see Figure 8-6). Each pres-
P 1
Pn
Cache
Cache dirty bit
Directory
Memory
Memory
Directory
presence bits
Interconnection Network
Figure 8-6 Directory information for a distributed-memory multiprocessor.
ence bit represents a particular node (uniprocessor or multiprocessor), and specifies whether or not that node has a cached copy of the block. Let us assume for simplicity that there is only one state bit, called the dirty bit, which indicates if the block is dirty in one of the node caches. Of course, if the dirty bit is ON, then only one node (the dirty node) should be caching that block and only that node’s presence bit should be ON. With this structure, a read miss can easily determine from looking up the directory which node, if any, has a dirty copy of the block or if the block is valid in main memory, and a write miss can determine which nodes are the sharers that must be invalidated. In addition to this directory state, state is also associated with blocks in the caches themselves, as discussed in Section 8.2, just as it was in snoopy protocols. In fact, the directory information for a block is simply main memory’s view of the state of that block in different caches. The directory does not necessarily need to know the exact state (e.g. MESI) in each cache, but only enough information to determine what actions to take. The number of states at the directory is therefore typically smaller than the number of cache states. In fact, since the directory and the caches communicate through a distributed interconnect, there will be periods of time when a directory’s knowledge of a cache state is incorrect since the cache state has been modified but notice of the modification has not reached the directory. During this time the directory may send a message to the cache based on its old (no longer valid) knowledge. The race conditions caused by this distri-
520
DRAFT: Parallel Computer Architecture
9/3/97
Overview of Directory-Based Approaches
bution of state makes directory protocols interesting, and we will see how they are handled using transient states or other means in Sections 8.5 through 8.7. To see how a read miss and write miss might interact with this bit vector directory organization, consider a protocol with three stable cache states (MSI) for simplicity, a single level of cache per processor, and a single processor per node. On a read or a write miss at node i (including an upgrade from shared state), the communication assist or controller at the local node i looks up the address of the memory block to determine if the home is local or remote. If it is remote, a network transaction is generated to the home node for the block. There, the directory entry for the block is looked up, and the controller at the home may treat the miss as follows, using network transactions similar to those that were shown in Figure 8-4:
• If the dirty bit is OFF, then the controller obtains the block from main memory, supplies it to the requestor in a reply network transaction, and turns the ith presence bit, presence[i], ON. • If the dirty bit is ON, then the controller replies to the requestor with the identity of the node whose presence bit is ON, i.e. the owner or dirty node. The requestor then sends a request network transaction to that owner node. At the owner, the cache changes its state to shared, and supplies the block to both the requesting node, which stores the block in its cache in shared state, as well as to main memory at the home node. At memory, the dirty bit is turned OFF, and presence[i] is turned ON. A write miss by processor i goes to memory and is handled as follows:
• If the dirty bit is OFF, then main memory has a clean copy of the data. Invalidation request transactions are sent to all nodes j for which presence[j] is ON. The directory controller waits for invalidation acknowledgment transactions from these processors—indicating that the write has completed with respect to them—and then supplies the block to node i where it is placed in the cache in dirty state. The directory entry is cleared, leaving only presence[i] and the dirty bit ON. If the request is an upgrade instead of a read-exclusive, an acknowledgment is returned to the requestor instead of the data itself. • If the dirty bit is ON, then the block is first recalled from the dirty node (whose presence bit is ON), using network transactions. That cache changes its state to invalid, and then the block is supplied to the requesting processor which places the block in its cache in dirty state. The directory entry is cleared, leaving only presence[i] and the dirty bit ON. On a writeback of a dirty block by node i, the dirty data being replaced are written back to main memory, and the directory is updated to turn OFF the dirty bit and presence[i]. As in bus-based machines, writebacks cause interesting race conditions which will be discussed later in the context of real protocols. Finally, if a block in shared state is replaced from a cache, a message may be sent to the directory to turn off the corresponding presence bit so an invalidation is not sent to this node the next time the block is written. This message is called a replacement hint; whether it is sent or not does not affect the correctness of the protocol or the execution. A directory scheme similar to this one was introduced as early as 1978 [CeF78] for use in systems with a few processors and a centralized main memory. It was used in the S-1 multiprocessor project at Lawrence Livermore National Laboratories [WiC80]. However, directory schemes in one form or another were in use even before this. The earliest scheme was used in IBM mainframes, which had a few processors connected to a centralized memory through a high-bandwidth switch rather than a bus. With no broadcast medium to snoop on, a duplicate copy of the cache tags for each processor was maintained at the main memory, and it served as the directory.
9/3/97
DRAFT: Parallel Computer Architecture
521
Directory-based Cache Coherence
Requests coming to the memory looked up all the tags to determine the states of the block in the different caches [Tan76, Tuc86]. The value of directories is that they keep track of which nodes have copies of a block, eliminating the need for broadcast. This is clearly very valuable on read misses, since a request for a block will either be satisfied at the main memory or the directory will tell it exactly where to go to retrieve the exclusive copy. On write misses, the value of directories over the simpler broadcast approach is greatest if the number of sharers of the block (to which invalidations or updates must be sent) is usually small and does not scale up quickly with the number of processing nodes. We might already expect this to be true from our understanding of parallel applications. For example, in a near-neighbor grid computation usually two and at most four processes should share a block at a partition boundary, regardless of the grid size or the number of processors. Even when a location is actively read and written by all processes in the application, the number of sharers to be invalidated at a write depends upon the temporal interleaving of reads and writes by processors. A common example is migratory data, which are read and written by one processor, then read and written by another processor, and so on (for example, a global sum into which processes accumulate their values). Although all processors read and write the location, on a write only one other processor—the previous writer—has a valid copy since all others were invalidated at the previous write. Empirical measurements of program behavior show that the number of valid copies on most writes to shared data is indeed very small the vast majority of the time, that this number does not grow quickly with the number of processes in the application, and that the frequency of writes to widely shared data is very low. Such data for our parallel applications will be presented and analyzed in light of application characteristics in Section 8.4.1. As we will see, these facts also help us understand how to organize directories cost-effectively. Finally, even if all processors running the application have to be invalidated on most writes, directories are still valuable for writes if the application does not run on all nodes of the multiprocessor. 8.3.2 Scaling The main goal of using directory protocols is to allow cache coherence to scale beyond the number of processors that may be sustained by a bus. Scalability issues for directory protocols include both performance and the storage overhead of directory information. Given a system with distributed memory and interconnect, the major scaling issues for performance are how the latency and bandwidth demands presented to the system by the protocol scale with the number of processors used. The bandwidth demands are governed by the number of network transactions generated per miss (times the frequency of misses), and latency by the number of these transactions that are in the critical path of the miss. These performance issues are affected both by the directory organization itself and by how well the flow of network transactions is optimized in the protocol itself (given a directory organization), as we shall see. Storage, however, is affected only by how the directory information is organized. For the simple bit vector organization, the number of presence bits needed scales linearly with both the number of processing nodes (p bits per memory block) and the amount of main memory (one bit vector per memory block), leading to a potentially large storage overhead for the directory. With a 64-byte block size and 64 processors, the directory storage overhead is 64 bits divided by 64 bytes or 12.5 percent, which is not so bad. With 256 processors and the same block size, the overhead is 50 percent, and with 1024 processors it is 200 percent! The directory overhead does not scale well, but it may be acceptable if the number of nodes visible to the directory at the target machine scale is not very large.
522
DRAFT: Parallel Computer Architecture
9/3/97
Overview of Directory-Based Approaches
Fortunately, there are many other ways to organize directory information that improve the scalability of directory storage. The different organizations naturally lead to different high-level protocols with different ways of addressing the protocol issues (a) through (c) and different performance characteristics. The rest of this section lays out the space of directory organizations and briefly describes how individual read and write misses might be handled in straightforward request-reply protocols that use them. As before, the discussion will pretend that there are no other cache misses in progress at the time, hence no race conditions, so the directory and the caches are always encountered as being in stable states. Deeper protocol issues—such as performance optimizations in structuring protocols, handling race conditions caused by the interactions of multiple concurrent memory operations, and dealing with correctness issues such as ordering, deadlock, livelock and starvation—will be discussed later when we plunge into the depths of directory protocols in Sections 8.5 through 8.7. 8.3.3 Alternatives for Organizing Directories At the highest level, directory schemes are differentiated by the mechanisms they provide for the three functions of coherence protocols discussed in Section 8.2. Since communication with cached copies is always done through network transactions, the real differentiation is in the first two functions: finding the source of the directory information upon a miss, and determining the locations of the relevant copies. There are two major classes of alternatives for finding the source of the directory information for a block:
• Flat directory schemes, and • Hierarchical directory schemes. The simple directory scheme described above is a flat scheme. Flat schemes are so named because the source of the directory information for a block is in a fixed place, usually at the home that is determined from the address of the block, and on a miss a request network transaction is sent directly to the home node to look up the directory in a single direct network transaction. In hierarchical schemes, memory is again distributed with the processors, but the directory information for each block is logically organized as a hierarchical data structure (a tree). The processing nodes, each with its portion of memory, are at the leaves of the tree. The internal nodes of the tree are simply hierarchically maintained directory information for the block: A node keeps track of whether each of its children has a copy of a block. Upon a miss, the directory information for the block is found by traversing up the hierarchy level by level through network transactions until a directory node is reached that indicates its subtree has the block in the appropriate state. Thus, a processor that misses simply sends a “search” message up to its parent, and so on, rather than directly to the home node for the block with a single network transaction. The directory tree for a block is logical, not necessarily physical, and can be embedded in any general interconnection network. In fact, in practice every processing node in the system is not only a leaf node for the blocks it contains but also stores directory information as an internal tree node for other blocks. In the hierarchical case, the information about locations of copies is also maintained through the hierarchy itself; copies are found and communicated with by traversing up and down the hierarchy guided by directory information. In flat schemes, how the information is stored varies a lot. At the highest level, flat schemes can be divided into two classes:
9/3/97
DRAFT: Parallel Computer Architecture
523
Directory-based Cache Coherence
• memory-based schemes, in which all the directory information about all cached copies is stored at the home node of the block. The basic bit vector scheme described above is memory-based. The locations of all copies are discovered at the home, and they can be communicated with directly through point-to-point messages. • cache-based schemes, in which the information about cached copies is not all contained at the home but is distributed among the copies themselves. The home simply contains a pointer copy of the block. Each cached copy then contains a pointer to (or the identity of) the node that has the next cached copy of the block, in a distributed linked list organization. The locations of copies are therefore determined by traversing this list via network transactions. Figure 8-7 summarizes the taxonomy. Hierarchical directories have some potential advantages. For example, a read miss to a block whose home is far away in the interconnection network topology might be satisfied closer to the issuing processor if another copy is found nearby in the topology as the request traverses up and down the hierarchy, instead of having to go all the way to the home. Also, requests from different nodes can potentially be combined at a common ancestor in the hierarchy, and only one request sent on from there. These advantages depend on how well the logical hierarchy matches the underlying physical network topology. However, instead of only a few point-to-point network transactions needed to satisfy a miss in many flat schemes, the number of network transactions needed to traverse up and down the hierarchy can be much larger, and each transaction along the way needs to look up the directory information at its destination node. As a result, the latency and bandwidth requirements of hierarchical directory schemes tend to be much larger, and these organizations are not popular on modern systems. Hierarchical directories will not be discussed much in this chapter, but will be described briefly together with hierarchical snooping approaches in Section 8.11.2. The rest of this section will examine flat directory schemes, both memory-based and cache-based, looking at directory organizations, storage overhead, the structure of protocols, and the impact on performance characteristics. Directory Storage Schemes
Finding source of directory information
Centralized
Flat
Locating Copies Memory-based (Directory information collocated with memory module that is home for that memory block) Examples: Stanford DASH/FLASH, .MIT Alewife, SGI Origin, HAL.
Cache-based (Caches holding a copy of the memory block form a linked list; memory holds pointer to head of linked list) Examples: IEEE SCI, Sequent NUMA-Q.
Hierarchical (Hierarchy of caches that guarantee the inclusion property. Each parent keeps track of exactly which of its immediate children have a copy of the block.) Examples: Data Diffusion Machine.
Figure 8-7 Alternatives for storing directory information.
524
DRAFT: Parallel Computer Architecture
9/3/97
Overview of Directory-Based Approaches
Flat, Memory-Based Directory Schemes The bit vector organization above, called a full bit vector organization, is the most straightforward way to store directory information in a flat, memory based scheme The style of protocol that results has already been discussed. Consider its basic performance characteristics on writes. Since it preserves information about sharers precisely and at the home, the number of network transactions per invalidating write grows only with the number of sharers. Since the identity of all sharers is available at the home, invalidations sent to them can be overlapped or even sent in parallel; the number of fully serialized network transactions in the critical path is thus not proportional to the number of sharers, reducing latency. The main disadvantage, as discussed earlier, is storage overhead. There are two ways to reduce the overhead of full bit vector schemes. The first is to increase the cache block size. The second is to put multiple processors, rather than just one, in a node that is visible to the directory protocol; that is, to use a two-level protocol. The Stanford DASH machine uses a full bit vector scheme, and its nodes are 4-processor bus-based multiprocessors. These two methods actually make full-bit-vector directories quite attractive for even fairly large machines. For example, for a 256-processor machine using 4-processor nodes and 128-byte cache blocks, the directory memory overhead is only 6.25%. As small-scale SMPs become increasingly attractive building blocks, this storage problem may not be severe. However, these methods reduce the overhead by only a small constant factor each. The total storage overhead is still proportional to P*M, where P is the number of processing nodes and M is the number of total memory blocks in the machine (M=P*m, where m is the number of blocks per local memory), and would become intolerable in very large machines. Directory storage overhead can be reduced further by addressing each of the orthogonal factors in the P*M expression. We can reduce the width of each directory entry by not letting it grow proportionally to P. Or we can reduce the total number of directory entries, or directory height, by not having an entry per memory block. Directory width is reduced by using limited pointer directories, which are motivated by the earlier observation that most of the time only a few caches have a copy of a block when the block is written. Limited pointer schemes therefore do not store yes or no information for all nodes, but rather simply maintain a fixed number of pointers, say i, each pointing to a node that currently caches a copy of the block [ASH+88]. Each pointer takes log P bits of storage for P nodes, but the number of pointers used is small. For example, for a machine with 1024 nodes, each pointer needs 10 bits, so even having 100 pointers uses less storage than a full bit-vector scheme. In practice, five or less pointers seem to suffice. Of course, these schemes need some kind of backup or overflow strategy for the situation when more than i copies are cached, since they can keep track of only i copies precisely. One strategy may be to resort to broadcasting invalidations to all nodes when there are more than i copies. Many other strategies have been developed to avoid broadcast even in these cases. Different limited pointer schemes differ primarily in their overflow strategies and in the number of pointers they use. Directory height can be reduced by organizing the directory as a cache, taking advantage of the fact that since the total amount of cache in the machine is much smaller than the total amount of memory, only a very small fraction of the memory blocks will actually be replicated in caches at a given time, so most of the directory entries will be unused anyway [GWM90, OKN90]. The
9/3/97
DRAFT: Parallel Computer Architecture
525
Directory-based Cache Coherence
Advanced Topics section of this chapter (Section 8.11) will discuss the techniques to reduce directory width and height in more detail. Regardless of these optimizations, the basic approach to finding copies and communicating with them (protocol issues (b) and (c)) is the same for the different flat, memory-based schemes. The identities of the sharers are maintained at the home, and at least when there is no overflow the copies are communicated with by sending point to point transactions to each. Flat, Cache-Based Directory Schemes In flat, cache-based schemes there is still a home main memory for the block; however, the directory entry at the home does not contain the identities of all sharers, but only a pointer to the first sharer in the list plus a few state bits. This pointer is called the head pointer for the block. The remaining nodes caching that block are joined together in a distributed, doubly linked list, using additional pointers that are associated with each cache line in a node (see Figure 8-8). That is, a cache that contains a copy of the block also contains pointers to the next and previous caches that have a copy, called the forward and backward pointers, respectively. On a read miss, the requesting node sends a network transaction to the home memory to find out the identity of the head node of the linked list, if any, for that block. If the head pointer is null (no current sharers) the home replies with the data. If the head pointer is not null, then the requestor must be added to the list of sharers. The home replies to the requestor with the head pointer. The requestor then sends a message to the head node, asking to be inserted at the head of the list and hence become the new head node. The net effect is that the head pointer at the home now points to the requestor, the forward pointer of the requestor’s own cache entry points to the old head node (which is now the second node in the linked list), and the backward pointer of the old head node point to the requestor. The data for the block is provided by the home if it has the latest copy, or by the head node which always has the latest copy (is the owner) otherwise. On a write miss, the writer again obtains the identity of the head node, if any, from the home. It then inserts itself into the list as the head node as before (if the writer was already in the list as a sharer and is now performing an upgrade, it is deleted from its current position in the list and inserted as the new head). Following this, the rest of the distributed linked list is traversed nodeto-node via network transactions to find and invalidate successive copies of the block. If a block that is written is shared by three nodes A, B and C, the home only knows about A so the writer sends an invalidation message to it; the identity of the next sharer B can only be known once A is reached, and so on. Acknowledgments for these invalidations are sent to either the writer. Once again, if the data for the block is needed it is provided by either the home or the head node as appropriate. The number of messages per invalidating write—the bandwidth demand—is proportional to the number of sharers as in the memory-based schemes, but now so is the number of messages in the critical path, i.e. the latency. Each of these serialized messages also invokes the communication assist, increasing latency further. In fact, even a read miss to a clean block involves the controllers of multiple nodes to insert the node in the linked list. Writebacks or other replacements from the cache also require that the node delete itself from the sharing list, which means communicating with the nodes before and after it in the list. This is necessary because the new block that replaces the old one will need the forward and backward pointer slots of the cache entry for its own sharing list. An example cache-based protocol will be described in more depth in Section 8.7.
526
DRAFT: Parallel Computer Architecture
9/3/97
Overview of Directory-Based Approaches
Main Memory (Home)
Node 0
Node 1
Node 2
P
P
P
Cache
Cache
Cache
Figure 8-8 The doubly linked list distributed directory representation in the IEEE Scalable Coherent Interface (SCI) standard. A cache line contains not only data and state for the block, but also forward and backward pointers for the distributed linked list.
To counter the latency disadvantage, cache-based schemes have some important advantages over memory-based schemes. First, the directory overhead is not so large to begin with. Every block in main memory has only a single head pointer. The number of forward and backward pointers is proportional to the number of cache blocks in the machine, which is much smaller than the number of memory blocks. The second advantage is that a linked list records the order in which accesses were made to memory for the block, thus making it easier to provide fairness and avoid live-lock in a protocol (most memory-based schemes do not keep track of request order in their sharing vectors or lists, as we will see). Third, the work to be done by controllers in sending invalidations is not centralized at the home but rather distributed among sharers, thus perhaps spreading out controller occupancy and reducing the corresponding bandwidth demands placed at the home. Manipulating insertion in and deletion from distributed linked lists can lead to complex protocol implementations. For example, deleting a node from a sharing list requires careful coordination and mutual exclusion with processors ahead and behind it in the linked list, since those processors may also be trying to replacing the same block concurrently. These complexity issues have been greatly alleviated by the formalization and publication of a standard for a cache-based directory organization and protocol: the IEEE 1596-1992 Scalable Coherence Interface standard (SCI) [Gus92]. The standard includes a full specification and C code for the protocol. Several commercial machines have started using this protocol (e.g. Sequent NUMA-Q [LoC96], Convex Exemplar [Con93,TSS+96], and Data General [ClA96]), and variants that use alternative list representations (singly-linked lists instead of the doubly-linked lists in SCI) have also been explored [ThD90]. We shall examine the SCI protocol itself in more detail in Section 8.7, and defer detailed discussion of advantages and disadvantages until then.
9/3/97
DRAFT: Parallel Computer Architecture
527
Directory-based Cache Coherence
Summary of Directory Organization Alternatives To summarize, there are clearly many different ways to organize how directories store the cache state of memory blocks. Simple bit vector representations work well for machines that have a moderate number of nodes visible to the directory protocol. For larger machines, there are many alternatives available to reduce the memory overhead. The organization chosen does, however, affect the complexity of the coherence protocol and the performance of the directory scheme against various sharing patterns. Hierarchical directories have not been popular on real machines, while machines with flat memory-based and cache-based (linked list) directories have been built and used for some years now. This section has described the basic organizational alternatives for directory protocols and how the protocols work a high level. The next section quantitatively assesses the appropriateness of directory-based protocols for scalable multiprocessors, as well as some important protocol and architectural tradeoffs at this basic level. After this, the chapter will dive deeply into examining how memory-based and cache-based directory protocols might actually be realized, and how they might address the various performance and correctness issues that arise.
8.4 Assessing Directory Protocols and Tradeoffs Like in Chapter 5, this section uses a simulator of a directory-based protocol to examine some relevant characteristics of applications that can inform architectural tradeoffs, but that cannot be measured on these real machines. As mentioned earlier, issues such as three-state versus fourstate or invalidate versus update protocols that were discussed in Chapter 5 will not be revisited here. (For update protocols, an additional disadvantage in scalable machines is that useless updates incur a separate network transaction for each destination, rather than a single bus transaction that is snooped by all caches. Also, we will see later in the chapter that besides the performance tradeoffs, update-based protocols make it much more difficult to preserve the desired memory consistency model in directory-based systems.) In particular, this section quantifies the distribution of invalidation patterns in directory protocols that we referred to earlier, examines how the distribution of traffic between local and remote changes as the number of processors is increased for a fixed problem size, and revisits the impact of cache block size on traffic which is now divided into local, remote and overhead. In all cases, the experiments assume a memorybased flat directory protocol with MESI cache states, much like that of the Origin2000. 8.4.1 Data Sharing Patterns for Directory Schemes It was claimed earlier that the number of invalidations that need to be sent out on a write is usually small, which makes directories particularly valuable. This section quantifies that claim for our parallel application case studies, develops a framework for categorizing data structures in terms of sharing patterns and understanding how they scale, and explains the behavior of the application case studies in light of this framework. Sharing Patterns for Application Case Studies In an invalidation-based protocol, two aspects of an application’s data sharing patterns are important to understand in the context of directory schemes: (i) the frequency with which processors
528
DRAFT: Parallel Computer Architecture
9/3/97
Assessing Directory Protocols and Tradeoffs
issue writes that may require invalidating other copies (i.e. writes to data that are not dirty in the writer’s cache, or invalidating writes), called the invalidation frequency, and (ii) the distribution of the number of invalidations (sharers) needed upon these writes, called the invalidation size distribution. Directory schemes are particularly advantageous if the invalidation size distribution has small measures of average, and as long as the frequency is significant enough that using broadcast all the time would indeed be a performance problem. Figure 8-9 shows the invalidation Radix Invalidation Patterns
LU Invalidation Patterns 80
91.22
90 80
72.75
52 to 55
56 to 59
60 to 63
0
0
0
0
0
0
0
0
invalidations
Invalidation
Patterns
72.92
# of
Radiosity
Barnes-Hut Invalidation Patterns
# of
60 to 63
56 to 59
52 to 55
48 to 51
44 to 47
40 to 43
36 to 39
32 to 35
28 to 31
24 to 27
20 to 23
16 to 19
60 to 63
56 to 59
52 to 55
48 to 51
44 to 47
40 to 43
36 to 39
32 to 35
7
# of
invalidations
28 to 31
2.24 1.59 1.16 0.97 3.28 2 . 2 1.74 1.46 0.92 0.45 0.37 0.31 0.28 0.26 0.24 0.19 0.19 0.91
6
56 to 59
4.16
0 5
52 to 55
12.04
1 0 6.68
0.33
4
0
3
0
20
2
0
60 to 63
0
48 to 51
40 to 43
36 to 39
32 to 35
28 to 31
24 to 27
20 to 23
7
6
5
4
3
2
1
0
0
16 to 19
1.88 1 . 4
2 . 5 1.06 0.61 0.24 0.28 0 . 2 0.06 0 . 1 0.07
12 to 15
1.27
2.87
8 to 11
5.33
44 to 47
10.56
30
0
15 10
40
24 to 27
20
Patterns
50
20 to 23
22.87
Invalidation
12 to 15
35
8 to 11
% of shared writes
40
invalidations
58.35
60
45
30 25
0.65 0.12 0.06 0.04 0.17 0.21 0.11 0 . 1 0 . 1 0.09 0.08 0.06 0.07 0 . 1 0.03 0.03 0.03 0.19
invalidations
48.35
50
3.99
0
12 to 15
0.02
16 to 19
0
7
36 to 39
0
6
32 to 35
0
5
28 to 31
0
4
24 to 27
# of
0
0.91
0
0
60 to 63
0
56 to 59
0
52 to 55
0
48 to 51
0
44 to 47
0
40 to 43
0
20 to 23
7
6
5
4
3
2
1
0
0.03
16 to 19
0
12 to 15
0.49 0.34 0.03
19.89
20 10
3.04
0
30
3
15.06
1
20
40
2
30
50
1
40
60
8 to 11
% of shared writes
50
8 to 11
% of shared writes
60
0
% of shared writes
0.01
70
70
5
0
60 to 63
48 to 51
Raytrace 80
80.98
80
10
0
56 to 59
44 to 47
# of
invalidations
Ocean Invalidation Patterns 90
0
52 to 55
40 to 43
12 to 15
1
0
# of
0.45 0.07 0.17 0.02 0.02 0.01 0.06 0.01 0.03
0
0
48 to 51
0.22
44 to 47
0
40 to 43
0
36 to 39
0
32 to 35
0
16 to 19
0
12 to 15
0
7
0
8 to 11
0
6
0
2
0
36 to 39
6
0
32 to 35
5
0
28 to 31
4
0
24 to 27
3
0
20 to 23
0
16 to 19
0
7
0
8 to 11
0
28 to 31
20 10
0
24 to 27
8.75
10
3 0 26.39
1
20
40
5
30
50
4
50 40
60
3
60
20 to 23
% of shared writes
70
70
2
% of shared writes
100
invalidations
Figure 8-9 Invalidation patterns with default data sets and 64 processors. The x-axis shows the invalidation size (number of active sharers) upon an invalidating write, and the y-axis shows the frequency of invalidating writes for each x-axis value. The invalidation size distribution was measured by simulating a full bit vector directory representation and recording for every invalidating write how many presence bits (other than the writer’s) are set when the write occurs. Frequency is measured as invalidating writes with that many invalidations per 1000 instructions issued by the processor. The data are averaged over all the 64 processors used.
size distributions for our parallel application case studies running on 64-node systems for the default problem sizes presented in Chapter 4. Infinite per-processor caches are used in these simulations, to capture inherent sharing patterns. With finite caches, replacement hints sent to the directory may turn off presence bits and reduce the number of invalidations sent on writes in some cases (though not the traffic since the replacement hint has to be sent). It is clear that the measures of average of the invalidation size distribution are small, indicating that directories are likely to be very useful in containing traffic and that it is not necessary for the directory to maintain a presence bit per processor in a flat memory-based scheme. The non-zero 9/3/97
DRAFT: Parallel Computer Architecture
529
Directory-based Cache Coherence
frequencies at very large processor counts are usually due to synchronization variables, where many processors spin on a variable and one processor writes it invalidating them all. We are interested in not just the results for a given problem size and number of processors, but also in how they scale. The communication to computation ratios discussed in Chapter 4 give us a good idea about how the frequency of invalidating writes should scale. For the size distributions, we can appeal to our understanding of the applications, which can also help explain the basic results observed in Figure 8-9 as we will see shortly. A Framework for Sharing Patterns More generally, we can group sharing patterns for data structures into commonly encountered categories, and use this categorization to reason about the resulting invalidation patterns and how they might scale. Data access patterns can be categorized in many ways: predictable versus unpredictable, regular versus irregular, coarse-grained versus fine-grained (or contiguous versus non-contiguous in the address space), near-neighbor versus long-range, etc. For the current purpose, the relevant categories are read-only, producer-consumer, migratory, and irregular readwrite. Read-only Read-only data structures are never written once they have been initialized. There are no invalidating writes, so these data are a non-issue for directories. Examples include program code and the scene data in the Raytrace application. Producer-consumer A processor produces (writes) a data item, then one or more processors consume (read) it, then a processor produces it again, and so on. Flag-based synchronization is an example, as is the near-neighbor sharing in an iterative grid computation. The producer may be the same process every time, or it may change; for example in a branch-and-bound algorithm the bound may be written by different processes as they find improved bounds. The invalidation size is determined by how many consumers there have been each time the producer writes the value. We can have situations with one consumer, all processes being consumers, or a few processes being consumers. These situations may have different frequencies and scaling properties, although for most applications studied so far either the size does not scale quickly with the number of processors or the frequency has been found to be low.1 Migratory Migratory data bounce around or migrate from one processor to another, being written (and usually read) by each processor to which they bounce. An example is a global sum, into which different processes add their partial sums. Each time a processor writes the variable only the previous writer has a copy (since it invalidated the previous “owner” when it did its write), so only a single invalidation is generated upon a write regardless of the number of processors used. Irregular Read-Write This corresponds to irregular or unpredictable read and write access patterns to data by different processes. A simple example is a distributed task queue system.
1. Examples of the first situation are the non-corner border elements in a near-neighbor regular grid partition and the key permutations in Radix. They lead to an invalidation size of one, which does not increase with the number of processors or the problem size. Examples all processes being consumers are a global energy variable that is read by all processes during a time-step of a physical simulation and then written by one at the end, or a synchronization variable on which all processes spin. While the invalidation size here is large, equal to the number of processors, such writes fortunately tend to happen very infrequently in real applications. Finally, examples of a few processes being consumers are the corner elements of a grid partition, or the flags used for tree-based synchronization. This leads to an invalidation size of a few, which may or may not scale with the number of processors (it doesn’t in these two examples).
530
DRAFT: Parallel Computer Architecture
9/3/97
Assessing Directory Protocols and Tradeoffs
Processes will probe (read) the head pointer of a task queue when they are looking for work to steal, and this head pointer will be written when a task is added at the head. These and other irregular patterns usually tend to lead to wide-ranging invalidation size distribution, but in most observed applications the frequency concentration tends to be very much toward the small end of the spectrum (see the Radiosity example in Figure 8-9). A similar categorization can be found in [GuW92]. Applying the Framework to the Application Case Studies Let us now look briefly at each of the applications in Figure 8-9 to interpret the results in light of the sharing patterns as classified above, and to understand how the size distributions might scale. In the LU factorization program, when a block is written it has previously been read only by the same processor that is doing the writing (the process to which it is assigned). This means that no other processor should have a cached copy, and zero invalidations should be sent. Once it is written, it is read by several other processes and no longer written further. The reason that we see one invalidation being sent in the figure is that the matrix is initialized by a single process, so particularly with infinite caches that process has a copy of the entire matrix in its cache and will be invalidated the first time another processor does a write to a block. An insignificant number of invalidating writes invalidates all processes, which is due to some global variables and not the main matrix data structure. As for scalability, increasing the problem size or the number of processors does not change the invalidation size distribution much, except for the global variables. Of course, they do change the frequencies. In the Radix sorting kernel, invalidations are sent in two producer-consumer situations. In the permutation phase, the word or block written has been read since the last write only by the process to which that key is assigned, so at most a single invalidation is sent out. The same key position may be written by different processes in different outer loop iterations of the sort; however, in each iteration there is only one reader of a key so even this infrequent case generates only two invalidations (one to the reader and one to the previous writer). The other situation is the histogram accumulation, which is done in a tree-structured fashion and usually leads to a small number of invalidations at a time. These invalidations to multiple sharers are clearly very infrequent. Here too, increasing the problem size does not change the invalidation size in either phase, while increasing the number of processors increases the sizes but only in some infrequent parts of the histogram accumulation phase. The dominant pattern by far remains zero or one invalidations. The nearest- neighbor, producer-consumer communication pattern on a regular grid in Ocean leads to most of the invalidations being to zero or one processes (at the borders of a partition). At partition corners, more frequently encountered in the multigrid equation solver, two or three sharers may need to be invalidated. This does not grow with problem size or number of processors. At the highest levels of the multigrid hierarchy, the border elements of a few processors’ partitions might fall on the same cache block, causing four or five sharers to be invalidated. There are also some global accumulator variables, which display a migratory sharing pattern (one invalidation), and a couple of very infrequently used one-producer, all-consumer global variables. The dominant, scene data in Raytrace are read-only. The major read-write data are the image and the task queues. Each word in the image is written only once by one processor per frame. This leads to either zero invalidations if the same processor writes a given image pixel in consecutive
9/3/97
DRAFT: Parallel Computer Architecture
531
Directory-based Cache Coherence
frames (as is usually the case), or one invalidation if different processors do, as might be the case when tasks are stolen. The task queues themselves lead to the irregular read-write access patterns discussed earlier, with a wide-ranging distribution that is dominated in frequency by the low end (hence the very small non-zeros all along the x-axis in this case). Here too there are some infrequently written one-producer, all-consumer global variables. In the Barnes-Hut application, the important data are the particle and cell positions, the pointers used to link up the tree, and some global variables used as energy values. The position data are of producer-consumer type. A given particle’s position is usually read by one or a few processors during the force-calculation (tree-traversal) phase. The positions (centers of mass) of the cells are read by many processes, the number increasing toward the root which is read by all. These data thus cause a fairly wide range of invalidation sizes when they are written by their assigned processor after force-calculation.The root and upper-level cells responsible for invalidations being sent to all processors, but their frequency is quite small. The tree pointers are quite similar in their behavior. The first write to a pointer in the tree-building phase invalidates the caches of the processors that read it in the previous force-calculation phase; subsequent writes invalidate those processors that read that pointer during tree-building, which irregularly but mostly to a small number of processors. As the number of processors is increased, the invalidation size distribution tends to shift to the right as more processors tend to read a given item, but quite slowly, and the dominant invalidations are still to a small number of processors. The reverse effect (also slow) is observed when the number of particles is increased. Finally, the Radiosity application has very irregular access patterns to many different types of data, including data that describe the scene (patches and elements) and the task queues. The resulting invalidation patterns, shown in Figure 8-9, show a wide distribution, but with the greatest frequency by far concentrated toward zero to two invalidations. Many of the accesses to the scene data also behave in a migratory way, and there are a few migratory counters and a couple of global variables that are one-producer, all-consumer. The empirical data and categorization framework indicate that in most cases the invalidation size distribution is dominated by small numbers of invalidations. The common use of parallel machines as multiprogrammed compute servers for sequential or small-way parallel applications further limits the number of sharers. Sharing patterns that cause large numbers of invalidations are empirically found to be very infrequent at runtime. One exception is highly contented synchronization variables, which are usually handled specially by software or hardware as we shall see. In addition to confirming the value of directories, this also suggests ways to reduce directory storage, as we shall see next. 8.4.2 Local versus Remote Traffic For a given number of processors and machine organization, the fraction of traffic that is local or remote depends on the problem size. However, it is instructive to examine how the traffic and its distribution changes with number of processors even when the problem size is held fixed (i.e. under PC scaling). Figure 8-10 shows the results for the default problem sizes, breaking down the remote traffic into various categories such as sharing (true or false), capacity, cold-start, writeback and overhead. Overhead includes the fixed header per cache block sent across the network, including the traffic associated with protocol transactions like invalidations and acknowledgments that do not carry any data. This component of traffic is different than on a bus-based machine: each individual point-to-point invalidation consumes traffic, and acknowledgments
532
DRAFT: Parallel Computer Architecture
9/3/97
Rem. Writeback Rem. Capacity
Traffic (Bytes/Instr)
Traffic (Bytes/FLOP)
Rem. Overhead
6
8
6
Rem. Cold
4
4
Rem. Shared True Shared
2
2
1 2 4 8 16 32 64
1 2 4 8 16 32 64
0 1 2 4 8 16 32 64
0 Procs
1 2 4 8 16 32 64
True Shared Data
Remote Writeback
Local Data
Local Data
Remote Overhead
Remote Capacity
Radix
Remote Cold Remote Shared
Raytrace
10
FFT
Ocean
Radix
Raytrace
0 Procs
0.02
0.04
0.06
0.08
0.10
Radiosity
0.0
Volrend
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Water-Sp 0.8
Water-Nsq Ocean
0.12
0 Procs
1
2
FFT
0
0.1
0.2
FMM Cholesky
3
0.1
0.2
Barnes
Traffic (Bytes/FLOP)
1 2 4 8 16 32 64
0.0 Procs
Traffic (Bytes/Instr)
1 2 4 8 16 32 64
1 2 4 8 16 32 64
0.3
1 2 4 8 16 32 64
1 2 4 8 16 32 64
0.4
1 2 4 8 16 32 64
1 2 4 8 16 32 64
0.5
1 2 4 8 16 32 64
1 2 4 8 16 32 64
0.6
1 2 4 8 16 32 64
1 2 4 8 16 32 64
0.7
1 2 4 8 16 32 64
LU
Assessing Directory Protocols and Tradeoffs
Traffic (Bytes/FLOP)
Figure 8-10 Traffic versus number of processors. The overhead per block transferred across the network is 8 bytes. No overhead is considered for local accesses. The graphs on the left assume 1MB per-processor caches, and the ones on the right 64KB. The caches have a 64 byte block size and are 4-way set associative.
place traffic on the interconnect too. Here also, traffic is shown in bytes per FLOP or bytes per instruction for different applications. We can see that both local traffic and capacity-related remote traffic tend to decrease when the number of processors increases, due to both decrease in per-processor working sets and decrease in cold misses that are satisfied locally instead of remotely. However, sharing-related traffic increases as expected. In applications with small working sets, like Barnes-Hut, LU, and Radiosity, the fraction of capacity-related traffic is very small, at least beyond a couple of processors. In irregular applications like Barnes-Hut and Raytrace, most of the capacity-related traffic is remote, all the more so as the number of processors increases, since data cannot be distributed at page grain for it to be satisfied locally. However, in cases like Ocean, the capacity-related traffic is substantial even with the large cache, and is almost entirely local when pages are placed properly. With round-robin placement of shared pages, we would have seen most of the local capacity misses turn to remote ones. When we use smaller caches to capture the realistic scenario of working sets not fitting in the cache in Ocean and Raytrace, we see that capacity traffic becomes much larger. In Ocean, most of this is still local, and the trend for remote traffic looks similar. Poor distribution of pages would have swamped the network with traffic, but with proper distribution network (remote) traffic is quite low. However, in Raytrace the capacity-related traffic is mostly remote, and the fact that it dominates changes the slope of the curve of total remote traffic: remote traffic still increases with the number of processors, but much more slowly since the working set size and hence capacity miss rate does not depend that much on the number of processors. When a miss is satisfied remotely, whether it is satisfied at the home or needs another message to obtain it from a dirty node depends both on whether it is a sharing miss or a capacity/conflict/ cold miss, as well as on the size of the cache. In a small cache, dirty data may be replaced and written back, so a sharing miss by another processor may be satisfied at the home rather than at the (previously) dirty node. For applications that allow data to be placed in the memory of the node to which they are assigned, it is often the case that only that node writes the data, so even if the data are found dirty they are found so at the home itself. The extent to which this is true depends on both the application and whether the data are indeed distributed properly.
9/3/97
DRAFT: Parallel Computer Architecture
533
Directory-based Cache Coherence
8.4.3 Cache Block Size Effects The effects of block size on cache miss rates and bus traffic were assessed in Chapter 5, at least up to 16 processors. For miss rates, the trends beyond 16 processors extend quite naturally, except for threshold effects in the interaction of problem size, number of processors and block size as discussed in Chapter 4. This section examines the impact of block size on the components of local and remote traffic. 0.3
2.0
25
4
1.0
0.2
Traffic (B/instrn)
Traffic (Bytes/instr)
Traffic (Bytes/FLOP)
20 3
1.5
2 0.1 1
0.5
15
10
Remote Shared
Remote Cold
Remote Capacity
Remote Writeback
Remote Overhead
Local Data
True Shared Data
FFT
Ocean
Radix
64
256
16
128
8
32
64
256
16
0 128
Radix
8
Volrend
32
Radiosity Raytrace
8 16 32 64 128 256
Ocean Water-Nsq Water-Sp
8 16 32 64 128 256
LU
0 8 16 32 64 128 256
FMM
0.0
8 16 32 64 128 256
8 16 32 64 128 256
FFT
8 16 32 64 128 256
8 16 32 64 128 256
Cholesky
8 16 32 64 128 256
8 16 32 64 128 256
Barnes
8 16 32 64 128 256
8 16 32 64 128 256
Linesize (bytes)
8 16 32 64 128 256
5 0.0
Raytrace
Figure 8-11 Traffic versus cache block size. The overhead per block transferred over the network is 8 bytes. No overhead is considered for local access. The graph on the left shows the data for 1MB per processor caches, and that on the right for 64KB caches. All caches are 4-way set associative.
Figure 8-11 shows how traffic scales with block size for 32-processor executions of the applications with 1MB caches. Consider Barnes-Hut. The overall traffic increases slowly until about a 64 byte block size, and more rapidly thereafter primarily due to false sharing. However, the magnitude is small. Since the overhead per block moved through the network is fixed (as is the cost of invalidations and acknowledgments), the overhead component tends to shrink with increasing block size to the extent that there is spatial locality (i.e. larger blocks reduce the number of blocks transferred). LU has perfect spatial locality, so the data traffic remains fixed as block size increases. Overhead is reduced, so overall traffic in fact shrinks with increasing block size. In Raytrace, the remote capacity traffic has poor spatial locality, so it grows quickly with block size. In these programs, the true sharing traffic has poor spatial locality too, as is the case in Ocean at column-oriented partition borders (spatial locality even on remote data is good at row-oriented borders). Finally, the graph for Radix clearly shows the impact on remote traffic for false sharing when it occurs past the threshold block size (here about 128 or 256 bytes).
8.5 Design Challenges for Directory Protocols Actually designing a correct, efficient directory protocol involves issues that are more complex and subtle than the simple organizational choices we have discussed so far, just as actually designing a bus-based protocol was more complex than choosing the number of states and drawing the state transition diagram for stable states. We had to deal with the non-atomicity of state transitions, split transaction busses, serialization and ordering issues, deadlock, livelock and starvation. Having understood the basics of directories, we are now ready to dive deeply into these issues for them as well. This section discusses the major new protocol-level design challenges that arise in implementing directory protocols correctly and with high performance, and identi-
534
DRAFT: Parallel Computer Architecture
9/3/97
Design Challenges for Directory Protocols
fies general techniques for addressing these challenges. The next two sections will discuss case studies of memory-based and cache-based protocols in detail, describing the exact states and directory structures they use, and will examine the choices they make among the various techniques for addressing design challenges. They will also examine how some of the necessary functionality is implemented in hardware, and illustrate how the actual choices made are influenced by performance, correctness, simplicity, cost, and also limitations of the underlying processing node used. As with the systems discussed in the previous two chapters, the design challenges for scalable coherence protocols are to provide high performance while preserving correctness, and to contain the complexity that results. Let us look at performance and correctness in turn, focusing on issues that were not already addressed for bus-based or non-caching systems. Since performance optimizations tend to increase concurrency and complicate correctness, let us examine them first. 8.5.1 Performance The network transactions on which cache coherence protocols are built differ from those used in explicit message passing in two ways. First, they are automatically generated by the system, in particular by the communication assists or controllers in accordance with the protocol. Second, they are individually small, each carrying either a request, an acknowledgment, or a cache block of data plus some control bits. However, the basic performance model for network transactions developed in earlier chapters applies here as well. A typical network transaction incurs some overhead on the processor at its source (traversing the cache hierarchy on the way out and back in), some work or occupancy on the communication assists at its end-points (typically looking up state, generating requests, or intervening in the cache) and some delay in the network due to transit latency, network bandwidth and contention. Typically, the processor itself is not involved at the home, the dirty node, or the sharers, but only at the requestor. A protocol does not have much leverage on the basic costs of a single network transaction—transit latency, network bandwidth, assist occupancy, and processor overhead—but it can determine the number and structure of the network transactions needed to realize memory operations like reads and writes under different circumstances. In general, there are three classes of techniques for improving performance: (i) protocol optimizations, (ii) high-level machine organization, and (iii) hardware specialization to reduce latency and occupancy and increase bandwidth. The first two assume a fixed set of cost parameters for the communication architecture, and are discussed in this section. The impact of varying these basic parameters will be examined in Section 8.8, after we understand the protocols in more detail. Protocol Optimizations At the protocol level, the two major performance goals are: (i) to reduce the number of network transactions generated per memory operation m, which reduces the bandwidth demands placed on the network and the communication assists; and (ii) to reduce the number of actions, especially network transactions, that are on the critical path of the processor, thus reducing uncontended latency. The latter can be done by overlapping the transactions needed for a memory operation as much as possible. To some extent, protocols can also help reduce the end point assist occupancy per transaction—especially when the assists are programmable—which reduces both uncontended latency as well as end-point contention. The traffic, latency and occupancy characteristics should not scale up quickly with the number of processing nodes used, and should perform gracefully under pathological conditions like hot-spots.
9/3/97
DRAFT: Parallel Computer Architecture
535
Directory-based Cache Coherence
As we have seen, the manner in which directory information is stored determines some basic order statistics on the number of network transactions in the critical path of a memory operation. For example, a memory-based protocol can issue invalidations in an overlapped manner from the home, while in a cache-based protocol the distributed list must be walked by network transactions to learn the identities of the sharers. However, even within a class of protocols there are many ways to improve performance. Consider read operations in a memory-based protocol. The strict request-reply option described earlier is shown in Figure 8-12(a). The home replies to the requestor with a message containing 3:intervention 2:intervention
1: req 1: req
L
H
4a:revise
R
L
H
R
2:reply 4:reply
3:response
4b:response
(a) Strict request-reply
(a) Intervention forwarding 2:intervention
1: req
L
H
3a:revise
R
3b:response
(c) Reply forwarding
Figure 8-12 Reducing latency in a flat, memory-based protocol through forwarding. The case shown is of a read request to a block in exclusive state. L represents the local or requesting node, H is the home for the block, and R is the remote owner node that has the exclusive copy of the block.
the identity of the owner node. The requestor then sends a request to the owner, which replies to it with the data (the owner also sends a “revision” message to the home, which updates memory with the data and sets the directory state to be shared). There are four network transactions in the critical path for the read operation, and five transactions in all. One way to reduce these numbers is intervention forwarding. In this case, the home does not reply to the requestor but simply forwards the request as an intervention transaction to the owner, asking it to retrieve the block from its cache. An intervention is just like a request, but is issued in reaction to a request and is directed at a cache rather than a memory (it is similar to an invalidation, but also seeks data from the cache). The owner then replies to the home with the data or an acknowledgment (if the block is clean-exclusive rather than modified)—at which time the home updates its directory state and replies to the requestor with the data (Figure 8-12(b)). Intervention forwarding reduces the total number of transactions needed to four, but all four are still in the critical path. A more aggressive method is reply forwarding (Figure 8-12(c)). Here too, the home forwards the intervention message to the owner node, but the intervention contains the identity of the requestor and the owner replies directly to the requestor itself. The owner also sends a revision message to the home so the memory and directory can be updated. This keeps the number of transactions at four, but reduces the number in the critical path to three (request→intervention→reply-to-requestor) and constitutes a three-message miss. Notice that with either 536
DRAFT: Parallel Computer Architecture
9/3/97
Design Challenges for Directory Protocols
of intervention forwarding or reply forwarding the protocol is no longer strictly request-reply (a request to the home generates a request to the owner node, which generates a reply). This can complicate deadlock avoidance, as we shall see later. Besides being only intermediate in its latency and traffic characteristics, intervention forwarding has the disadvantage that outstanding intervention requests are kept track of at the home rather than at the requestor, since responses to the interventions are sent to the home. Since requests that cause interventions may come from any of the processors, the home node must keep track of up to k*P interventions at a time, where k is the number of outstanding requests allowed per processor. A requestor, on the other hand, would only have to keep track of at most k outstanding interventions. Since reply forwarding also has better performance characteristics, most systems prefer to use it. Similar forwarding techniques can be used to reduce latency in cache-based schemes at the cost of strict request-reply simplicity, as shown in Figure 8-13. In addition to forwarding, other protocol techniques to reduce latency include overlapping transactions and activities by performing them speculatively. For example, when a request arrives at the home, the controller can read the data from memory in parallel with the directory lookup, in the hope that in most cases the block will indeed be clean at the home. If the directory lookup indicates that the block is dirty in some cache then the memory access is wasted and must be ignored. Finally, protocols may also automatically detect and adjust themselves at runtime to 5:inval 2a:inval
1: inval
3:inval
3a:inval
1: inval
H
S1
S2
H
S3
S1 2b:ack
2:ack 4:ack
S2
S3
3b:ack 4b:ack
6:ack
(b)
(a) 2:inval
1:inval
H
S1
3:inval
S2
S3
4:ack
(c)
Figure 8-13 Reducing latency in a flat, cache-based protocol. The scenario is of invalidations being sent from the home H to the sharers Si on a write operation. In the strict request-reply case (a), every node includes in its acknowledgment (reply) the identity of the next sharer on the list, and the home then sends that sharer an invalidation. The total number of transactions in the invalidation sequence is 2s, where s is the number of sharers, and all are in the critical path. In (b), each invalidated node forwards the invalidation to the next sharer and in parallel sends an acknowledgment to the home. The total number of transactions is still 2s, but only s+1 are in the critical path. In (c), only the last sharer on the list sends a single acknowledgment telling the home that the sequence is done. The total number of transactions is s+1. (b) and (c) are not strict request-reply.
interact better with common sharing patterns to which the standard invalidation-based protocol is not ideally suited (see Exercise 8.4).
9/3/97
DRAFT: Parallel Computer Architecture
537
Directory-based Cache Coherence
High-level Machine Organization Machine organization can interact with the protocol to help improve performance as well. For example, the use of large tertiary caches can reduce the number of protocol transactions due to artifactual communication. For a fixed total number of processors, using multiprocessor rather than uniprocessor nodes may be useful as discussed in the introduction to this chapter: The directory protocol operates only across nodes, and a different coherence protocol, for example a snoopy protocol or another directory protocol, may be used to keep caches coherent within a node. Such an organization with multiprocessor—especially bus-based—nodes in a two-level hierarchy is very common, and has interesting advantages as well as disadvantages. The potential advantages are in both cost and performance. On the cost side, certain fixed pernode costs may be amortized among the processors within a node, and it is possible to use existing SMPs which may themselves be commodity parts. On the performance side, advantages may arise from sharing characteristics that can reduce the number of accesses that must involve the directory protocol and generate network transactions across nodes. If one processor brings a block of data into its cache, another processor in the same node may be able to satisfy its miss to that block (for the same or a different word) more quickly through the local protocol, using cache-to-cache sharing, especially if the block is allocated remotely. Requests may also be combined: If one processor has a request outstanding to the directory protocol for a block, another processor’s request within the same SMP can be combined with and obtain the data from the first processor’s response, reducing latency, network traffic and potential hot-spot contention. These advantages are similar to those of full hierarchical approaches mentioned earlier. Within an SMP, processors may even share a cache at some level of the hierarchy, in which case the tradeoffs discussed in Chapter 6 apply. Finally, cost and performance characteristics may be improved by using a hierarchy of packaging technologies appropriately. Of course, the extent to which the two-level sharing hierarchy can be exploited depends on the sharing patterns of applications, how well processes are mapped to processors in the hierarchy, and the cost difference between communicating within a node and across nodes. For example, applications that have wide but physically localized read-only sharing in a phase of computation, like the Barnes-Hut galaxy simulation, can benefit significantly from cache-to-cache sharing if the miss rates are high to begin with. Applications that exhibit nearest-neighbor sharing (like Ocean) can also have most of their accesses satisfied within a multiprocessor node. However, while some processes may have their all accesses satisfied within their node, others will have accesses along at least one border satisfied remotely, so there will be load imbalances and the benefits of the hierarchy will be diminished. In all-to-all communication patterns, the savings in inherent communication are more modest. Instead of communicating with p-1 remote processors in a p-processor system, a processor now communicates with k-1 local processors and p-k remote ones (where k is the number of processors within a node), a savings of at most (p-k)/(p-1). Finally, with several processes sharing a main memory unit, it may also be easier to distribute data appropriately among processors at page granularity. Some of these tradeoffs and application characteristics are explored quantitatively in [Web93, ENS+95]. Of our two case-study machines, the NUMA-Q uses four-processor, bus-based, cache-coherent SMPs as the nodes. The SGI Origin takes an interesting position: two processors share a bus and memory (and a board) to amortize cost, but they are not kept coherent by a snoopy protocol on the bus; rather, a single directory protocol keeps all caches in the machines coherent.
538
DRAFT: Parallel Computer Architecture
9/3/97
Design Challenges for Directory Protocols
Compared to using uniprocessor nodes, the major potential disadvantage of using multiprocessor nodes is bandwidth. When processors share a bus, an assist or a network interface, they amortize its cost but compete for its bandwidth. If their bandwidth demands are not reduced much by locality, the resulting contention can hurt performance. The solution is to increase the throughput of these resources as well as processors are added to the node, but this increases compromises the cost advantages. Sharing a bus within a node has some particular disadvantages. First, if the bus has to accommodate several processors it becomes longer and is not likely to be contained in a single board or other packaging unit. Both these effects slow the bus down, increasing the latency to both local and remote data. Second, if the bus supports snoopy coherence within the node, a request that must be satisfied remotely typically has to wait for local snoop results to be reported before it is sent out to the network, causing unnecessary delays. Third, with a snoopy bus at the remote node too, many references that do go remote will require snoops and data transfers on the local bus as well as the remote bus, increasing latency and reducing bandwidth. Nonetheless, several directory-based systems use snoopy-coherent multiprocessors as their individual nodes [LLJ+93, LoC96, ClA96, WGH+97]. 8.5.2 Correctness Correctness considerations can be divided into three classes. First, the protocol must ensure that the relevant blocks are invalidated/updated and retrieved as needed, and that the necessary state transitions occur. We can assume this happens in all cases, and not discuss it much further. Second, the serialization and ordering relationships defined by coherence and the consistency model must be preserved. Third, the protocol and implementation must be free from deadlock, livelock, and ideally starvation too. Several aspects of scalable protocols and systems complicate the latter two sets of issues beyond what we have seen for bus-based cache-coherent machines or scalable non-coherent machines. There are two basic problems. First, we now have multiple cached copies of a block but no single agent that sees all relevant transactions and can serialize them. Second, with many processors there may be a large number of requests directed toward a single node, accentuating the input buffer problem discussed in the previous chapter. These problems are aggravated by the high latencies in the system, which push us to exploit the protocol optimizations discussed above; these allow more transactions to be in progress simultaneously, further complicating correctness. This subsection describes the new issues that arise in each case (write serialization, sequential consistency, deadlock, livelock and starvation) and lists the major types of solutions that are commonly employed. Some specific solutions used in the case study protocols will be discussed in more detail in subsequent sections. Serialization to a Location for Coherence Recall the write serialization clause of coherence. Not only must a given processor be able to construct a serial order out of all the operations to a given location—at least out of all write operations and its own read operations—but all processors must see the writes to a given location as having happened in the same order. One mechanism we need for serialization is an entity that sees the necessary memory operations to a given location from different processors (the operations that are not contained entirely within a processing node) and determines their serialization. In a bus-based system, operations from different processors are serialized by the order in which their requests appear on the bus (in fact all accesses, to any location, are serialized this way). In a distributed system that does not cache shared data, the consistent serializer for a location is the main memory that is the home of a loca-
9/3/97
DRAFT: Parallel Computer Architecture
539
Directory-based Cache Coherence
tion. For example, the order in which writes become visible is the order in which they reach the memory, and which write’s value a read sees is determined by when that read reaches the memory. In a distributed system with coherent caching, the home memory is again a likely candidate for the entity that determines serialization, at least in a flat directory scheme, since all relevant operations first come to the home. If the home could satisfy all requests itself, then it could simply process them one by one in FIFO order of arrival and determine serialization. However, with multiple copies visibility of an operation to the home does not imply visibility to all processors, and it is easy to construct scenarios where processors may see operations to a location appear to be serialized in different orders than that in which the requests reached the home, and in different orders from each other. As a simple example of serialization problems, consider a network that does not preserve pointto-point order of transactions between end-points. If two write requests for shared data arrive at the home in one order, in an update-based protocol (for example) the updates they generate may arrive at the copies in different orders. As another example, suppose a block is in modified state in a dirty node and two nodes issues read-exclusive requests for it in an invalidation protocol. In a strict request-reply protocol, the home will provide the requestors with the identity of the dirty node, and they will send requests to it. However, with different requestors, even in a network that preserves point-to-point order there is no guarantee that the requests will reach the dirty node in the same order as they reached the home. Which entity provides the globally consistent serialization in this case, and how is this orchestrated with multiple operations for this block simultaneously in flight and potentially needing service from different nodes? Several types of solutions can be used to ensure serialization to a location. Most of them use additional directory states called busy states (sometimes called pending states as well). A block being in busy state at the directory indicates that a previous request that came to the home for that block is still in progress and has not been completed. When a request comes to the home and finds the directory state to be busy, serialization may be provided by one of the following mechanisms. Buffer at the home. The request may be buffered at the home as a pending request until the previous request that is in progress for the block has completed, regardless of whether the previous request was forwarded to a dirty node or a strict request-reply protocol was used (the home should of course process requests for other blocks in the meantime). This ensures that requests will be serviced in FIFO order of arrival at the home, but it reduces concurrency. The buffering must be done regardless of whether a strict request-reply protocol is used or the previous request was forwarded to a dirty node, and this method requires that the home be notified when a write has completed. More importantly, it raises the danger of the input buffer at the home overflowing, since it buffers pending requests for all blocks for which it is the home. One strategy in this case is to let the input buffer overflow into main memory, thus providing effectively infinite buffering as long as there is enough main memory, and thus avoid potential deadlock problems. This scheme is used in the MIT Alewife prototype [ABC+95]. Buffer at requestors. Pending requests may be buffered not at the home but at the requestors themselves, by constructing a distributed linked list of pending requests. This is clearly a natural extension of a cache-based approach which already has the support for distributed linked lists, and is used in the SCI protocol [Gus92,IEE93]. Now the number of pending requests that a node may need to keep track off is small and determined only by it. NACK and retry. An incoming request may be NACKed by the home (i.e. negative acknowledgment sent to the requestor) rather than buffered when the directory state indicates that a previous request for the block is still in progress. The request will be retried later, and will be 540
DRAFT: Parallel Computer Architecture
9/3/97
Design Challenges for Directory Protocols
serialized in the order in which it is actually accepted by the directory (attempts that are NACKed do not enter in the serialization order). This is the approach used in the Origin2000 [LaL97]. Forward to the dirty node: If a request is still outstanding at the home because it has been forwarded to a dirty node, subsequent requests for that block are not buffered or NACKed. Rather, they are forwarded to the dirty node as well, which determines their serialization. The order of serialization is thus determined by the home node when the block is clean at the home, and by the order in which requests reach the dirty node when the block is dirty. If the block leaves the dirty state before a forwarded request reaches it (for example due to a writeback or a previous forwarded request), the request may be NACKed by the dirty node and retried. It will be serialized at the home or a dirty node when the retry is successful. This approach was used in the Stanford DASH protocol [LLG+90,LLJ+93]. Unfortunately, with multiple copies in a distributed network simply identifying a serializing entity is not enough. The problem is that the home or serializing agent may know when its involvement with a request is done, but this does not mean that the request has completed with respect to other nodes. Some remaining transactions for the next request may reach other nodes and complete with respect to them before some transactions for the previous request. We will see concrete examples and solutions in our case study protocols. Essentially, we will learn that in addition to a global serializing entity for a block individual nodes (e.g. requestors) should also preserve a local serialization with respect to each block; for example, they should not apply an incoming transaction to a block while they still have a transaction outstanding for that block. Sequential Consistency: Serialization across Locations Recall the two most interesting components of preserving the sufficient conditions for satisfying sequential consistency: detecting write completion (needed to preserve program order), and ensuring write atomicity. In bus-based machine, we saw that the restricted nature of the interconnect allows the requestor to detect write completion early: the write can be acknowledged to the processor as soon as it obtains access to the bus, without needing to wait for it to actually invalidate or update other processors (Chapter 6). By providing a path through which all transactions pass and ensuring FIFO ordering among writes beyond that, the bus also makes write atomicity quite natural to ensure. In a machine that has a general distributed network but does not cache shared data, detecting the completion of a write requires an explicit acknowledgment from the memory that holds the location (Chapter 7). In fact, the acknowledgment can be generated early, once we know the write has reached that node and been inserted in a FIFO queue to memory; at this point, it is clear that all subsequent reads will no longer see the old value and we can assume the write has completed. Write atomicity falls out naturally: a write is visible only when it updates main memory, and at that point it is visible to all processors. Write Completion. As mentioned in the introduction to this chapter, with multiple copies in a distributed network it is difficult to assume write completion before the invalidations or updates have actually reached all the nodes. A write cannot be acknowledged by the home once it has reached there and assume to have effectively completed. The reason is that a subsequent write Y in program order may be issued by the processor after receiving such an acknowledgment for a previous write X, but Y may become visible to another processor before X thus violating SC. This may happen because the transactions corresponding to Y took a different path through the network or if the network does not provide point-to-point order. Completion can only be assumed once explicit acknowledgments are received from all copies. Of course, a node with a copy can
9/3/97
DRAFT: Parallel Computer Architecture
541
Directory-based Cache Coherence
generate the acknowledgment as soon as it receives the invalidation—before it is actually applied to the caches—as long as it guarantees the appropriate ordering within its cache hierarchy (just as in Chapter 6). To satisfy the sufficient conditions for SC, a processor waits after issuing a write until all acknowledgments for that write have been received, and only then proceeds past the write to a subsequent memory operation. Write atomicity. Write atomicity is similarly difficult when there are multiple copies with no shared path to them. To see this, Figure 8-14 shows how an example from Chapter 5 (Figure 511) that relies on write atomicity can be violated. The constraints of sequential consistency have
while (A==0); B=1;
A=1;
P2
P1
P3
A:0 Cache B:0->1
Cache A:0->1
Cache Mem
while (B==0); print A;
Mem
A=1
Mem
delay
B=1
A=1 Interconnection Network Figure 8-14 Violation of write atomicity in a scalable system with caches. Assume that the network preserves point-to-point order, and every cache starts out with copies of A and B initialized to 0. Under SC, we expect P3 to print 1 as the value of A. However, P2 sees the new value of A, and jumps out of its while loop to write B even before it knows whether the previous write of A by P1 has become visible to P3. This write of B becomes visible to P3 before the write of A by P1, because the latter was delayed in a congested part of the network that the other transactions did not have to go through at all. Thus, P3 reads the new value of B but the old value of A.
to be satisfied by orchestrating network transactions appropriately. A common solution for write atomicity in an invalidation-based scheme is for the current owner of a block (the main-memory module or the processor holding the dirty copy in its cache) to provide the appearance of atomicity, by not allowing access to the new value by any process until all invalidation acknowledgments for the write have returned. Thus, no processor can see the new value until it is visible to all processors. Maintaining the appearance of atomicity is much more difficult for update-based protocols, since the data are sent to the readers and hence accessible immediately. Ensuring that readers do not read the value until it is visible to all of them requires a two-phase interaction. In the first-phase the copies of that memory block are updated in all processors’ caches, but those processors are prohibited from accessing the new value. In the second phase, after the first phase is known to have completed through acknowledgments as above, those processors are sent messages that allow them to use the new value. This makes update protocols even less attractive for scalable directory-based machines than for bus-based machines. 542
DRAFT: Parallel Computer Architecture
9/3/97
Design Challenges for Directory Protocols
Deadlock In Chapter 7 we discussed an important source of potential deadlock in request-reply protocols such as those of a shared address space: the filling up of a finite input buffer. Three solutions were proposed for deadlock: Provide enough buffer space, either by buffering requests at the requestors using distributed linked lists, or by providing enough input buffer space (in hardware or main memory) for the maximum number of possible incoming transactions. Among cache-coherent machines, this approach is used in the MIT Alewife and the Hal S1. Use NACKs. Provide separate request and reply networks, whether physical or virtual (see Chapter 10), to prevent backups in the potentially poorly behaved request network from blocking the progress of well-behaved reply transactions. Two separate networks would suffice in a protocol that is strictly request-reply; i.e. in which all transactions can be separated into requests and replies such that a request transaction generates only a reply (or nothing), and a reply generates no further transactions. However, we have seen that in the interest of performance many practical coherence protocols use forwarding and are not always strictly request-reply, breaking the deadlock-avoidance assumption. In general, we need as many networks (physical or virtual) as the longest chain of different transaction types needed to complete a given operation. However, using multiple networks is expensive and many of them will be under-utilized. In addition to the approaches that provide enough buffering, two different approaches can be taken to dealing with deadlock in protocols that are not strict request-reply. Both rely on detecting situations when deadlock appears possible, and resorting to a different mechanism to avoid deadlock in these cases. That mechanism may be NACKs, or reverting to a strict request-reply protocol. The potential deadlock situations may be detected in many ways. In the Stanford DASH machine, a node conservatively assumes that deadlock may be about to happen when both its input request and output request buffers fill up beyond a threshold, and the request at the head of the input request queue is one that may need to generate further requests like interventions or invalidations (and is hence capable of causing deadlock). At this point, the node takes such requests off the input queue one by one and sends NACK messages back for them to the requestors. It does this until the request at the head is no longer one that can generate further requests, or until it finds that the output request queue is no longer full. The NACK’ed requestors will retry their requests later. An alternative detection strategy is when the output request buffer is full and has not had a transaction removed from it for T cycles. In the other approach, potential deadlock situations are detected in much the same way. However, instead of sending a NACK to the requestor, the node sends it a reply asking it to send the intervention or invalidation requests directly itself; i.e. it dynamically backs off to a strict requestreply protocol, compromising performance temporarily but not allowing deadlock cycles. The advantage of this approach is that NACKing is a statistical rather than robust solution to such problems, since requests may have to be retried several times in bad situations. This can lead to increased network traffic and increased latency to the time the operation completes. Dynamic back-off also has advantages related to livelock, as we shall see next, and is used in the Origin2000 protocol.
9/3/97
DRAFT: Parallel Computer Architecture
543
Directory-based Cache Coherence
Livelock Livelock and starvation are taken care of naturally in protocols that provide enough FIFO buffering of requests, whether centralized or through distributed linked lists. In other cases, the classic livelock problem of multiple processors trying to write a block at the same time is taken care of by letting the first request to get to the home go through but NACKing all the others. NACKs are useful mechanisms for resolving race conditions without livelock. However, when used to ease input buffering problems and avoid deadlock, as in the DASH solution above, they succeed in avoiding deadlock but have the potential to cause livelock. For example, when the node that detects a possible deadlock situation NACKs some requests, it is possible that all those requests are retried immediately. With extreme pathology, the same situation could repeat itself continually and livelock could result. (The actual DASH implementation steps around this problem by using a large enough request input buffer, since both the number of nodes and the number of possible outstanding requests per node are small. However, this is not a robust solution for larger, more aggressive machines that cannot provide enough buffer space.) The alternative solution to deadlock, of switching to a strict request-reply protocol in potential deadlock situations, does not have this problem. It guarantees forward progress and also removes the request-request dependence at the home once and for all. Starvation The actual occurrence of starvation is usually very unlikely in well-designed protocols; however, it is not ruled out as a possibility. The fairest solution to starvation is to buffer all requests in FIFO order; however, this can have performance disadvantages, and for protocols that do not do this avoiding starvation can be difficult to guarantee. Solutions that use NACKs and retries are often more susceptible to starvation. Starvation is most likely when many processors repeatedly compete for a resource. Some may keep succeeding, while one or more may be very unlucky in their timing and may always get NACKed. A protocol could decide to do nothing about starvation, and rely on the variability of delays in the system to not allow such an indefinitely repeating pathological situation to occur. The DASH machine uses this solution, and times out with a bus error if the situation persists beyond a threshold time. Or a random delay can be inserted between retries to further reduce the small probability of starvation. Finally, requests may be assigned priorities based on the number of times they have been NACK’ed, a technique that is used in the Origin2000 protocol. Having understood the basic directory organizations and high-level protocols, as well as the key performance and correctness issues in a general context, we are now ready to dive into actual case studies of memory-based and cache-based protocols. We will see what protocol states and activities look like in actual realizations, how directory protocols interact with and are influenced by the underlying processing nodes, what real scalable cache-coherent machines look like, and how actual protocols trade off performance with the complexity of maintaining correctness and of debugging or validating the protocol.
544
DRAFT: Parallel Computer Architecture
9/3/97
Memory-based Directory Protocols: The SGI Origin System
8.6 Memory-based Directory Protocols: The SGI Origin System We discuss flat, memory-based directory protocols first, using the SGI Origin multiprocessor series as a case study. At least for moderate scale systems, this machine uses essentially a full bit vector directory representation. A similar directory representation but slightly different protocol was also used in the Stanford DASH multiprocessor [LLG+90], a research prototype which was the first distributed memory machine to incorporate directory-based coherence. We shall follow a similar discussion template for both this and the next case study (the SCI protocol as used in the Sequent NUMA-Q). We begin with the basic coherence protocol, including the directory structure, the directory and cache states, how operations such as reads, writes and writebacks are handled, and the performance enhancements used. Then, we briefly discuss the position taken on the major correctness issues, followed by some prominent protocol extensions for extra functionality. Having discussed the protocol, we describe the rest of the machine as a multiprocessor and how the coherence machinery fits into it. This includes the processing node, the interconnection network, the input/output, and any interesting interactions between the directory protocol and the underlying node. We end with some important implementation issues—illustrating how it all works and the important data and control pathways—the basic performance characteristics (latency, occupancy, bandwidth) of the protocol, and the resulting application performance for our sample applications. 8.6.1 Cache Coherence Protocol The Origin system is composed of a number of processing nodes connected by a switch-based interconnection network (see Figure 8-15). Every processing node contains two processors, each with first- and second-level caches, a fraction of the total main memory on the machine, an I/O interface, and a single-chip communication assist or coherence controller called the Hub which implements the coherence protocol. The Hub is integrated into the memory system. It sees all cache misses issued by the processors in that node, whether they are to be satisfied locally or remotely, receives transactions coming in from the network (in fact, the Hub implements the network interface as well), and is capable of retrieving data from the local processor caches. In terms of the performance issues discussed in Section 8.5.1, at the protocol level the Origin2000 uses reply forwarding as well as speculative memory operations in parallel with directory lookup at the home. At the machine organization level, the decision in Origin to have two processors per node is driven mostly by cost: Several other components on a node (the Hub, the system bus, etc.) are shared between the processors, thus amortizing their cost while hopefully still providing substantial bandwidth per processor. The Origin designers believed that the latency disadvantages of a snoopy bus discussed earlier outweighed its advantages (see Section 8.5.1), and chose not to maintain snoopy coherence between the two processors within a node. Rather, the bus is simply a shared physical link that is multiplexed between the two processors in a node. This sacrifices the potential advantage of cache to cache sharing within the node, but helps reduce latency. With a Hub shared between two processors, combining of requests to the network (not the protocol) could nonetheless have been supported, but is not due to the additional implementation cost. When discussing the protocol in this section, let us assume for simplicity that each node contains only one processor, its cache hierarchy, a Hub, and main memory. We will see the impact of using two processors per node on the directory structure and protocol later.
9/3/97
DRAFT: Parallel Computer Architecture
545
Directory-based Cache Coherence
P
P
P
P
L2 cache (1-4 MB)
L2 cache (1-4 MB)
L2 cache (1-4 MB)
L2 cache (1-4 MB)
SysAD bus
SysAD bus
Hub
Hub
Xbow Router
Directory
Main Memory (1-4 GB)
Directory
Main Memory (1-4 GB)
Interconnection Network Figure 8-15 Block diagram of the Silicon Graphics Origin multiprocessor.
Other than the use of reply forwarding, the most interesting aspects of the Origin protocol are its use of busy states and NACKs to resolve race conditions and provide serialization to a location, and the way in which it handles race conditions caused by writebacks. However, to illustrate how a complete protocol works in light of these races, as well as the performance enhancement techniques used in different cases, we will describe how it deals with simple read and write requests as well. Directory Structure and Protocol States The directory information for a memory block is maintained at the home node for that block. How the directory structure deals with the fact that each node is two processors and how the directory organization changes with machine size are interesting aspects of the machine, and we shall discuss these after the protocol itself. For now, we can assume a full bit vector approach like in the simple protocol of Section 8.3. In the caches, the protocol uses the states of the MESI protocol, with the same meanings as used in Chapter 5. At the directory, a block may be recorded as being in one of seven states. Three of these are basic states: (i) Unowned, or no copies in the system, (ii) Shared, i.e. zero or more readonly copies, their whereabouts indicated by the presence vector, and (iii) Exclusive, or one readwrite copy in the system, indicated by the presence vector. An exclusive directory state means the block might be in either dirty or clean-exclusive state in the cache (i.e. the M or E states of the
546
DRAFT: Parallel Computer Architecture
9/3/97
Memory-based Directory Protocols: The SGI Origin System
MESI protocol). Three other states are busy states. As discussed earlier, these imply that the home has received a previous request for that block, but was not able to complete that operation itself (e.g. the block may have been dirty in a cache in another node); transactions to complete the request are still in progress in the system, so the directory at the home is not yet ready to handle a new request for that block. The three busy states correspond to three different types of requests that might still be thus in progress: a read, a read-exclusive or write, and an uncached read (discussed later). The seventh state is a poison state, which is used to implement a lazy TLB shootdown method for migrating pages among memories (Section 8.6.4). Given these states, let us see how the coherence protocol handles read requests, write requests and writeback requests from a node. The protocol does not rely on any order preservation among transactions in the network, not even point-to-point order among transactions between the same nodes. Handling Read Requests Suppose a processor issues a read that misses in its cache hierarchy. The address of the miss is examined by the local Hub to determine the home node, and a read request transaction is sent to the home node to look up the directory entry. If the home is local, it is looked up by the local Hub itself. At the home, the data for the block is accessed speculatively in parallel with looking up the directory entry. The directory entry lookup, which completes a cycle earlier than the speculative data access, may indicate that the memory block is in one of several different states—shared, unowned, busy, or exclusive—and different actions are taken in each case. Shared or unowned: This means that main memory at the home has the latest copy of the data (so the speculative access was successful). If the state is shared, the bit corresponding to the requestor is set in the directory presence vector; if it is unowned, the directory state is set to exclusive. The home then simply sends the data for the block back to the requestor in a reply transaction. These cases satisfy a strict request-reply protocol. Of course, if the home node is the same as the requesting node, then no network transactions or messages are generated and it is a locally satisfied miss. Busy: This means that the home should not handle the request at this time since a previous request has been forwarded from the home and is still in progress. The requestor is sent a negative acknowledge (NACK) message, asking it to try again later. A NACK is categorized as a reply, but like an acknowledgment it does not carry data. Exclusive: This is the most interesting case. If the home is not the owner of the block, the valid data for the block must be obtained from the owner and must find their way to the requestor as well as to the home (since the state will change to shared). The Origin protocol uses reply forwarding to improve performance over a strict request-reply approach: The request is forwarded to the owner, which replies directly to the requestor, sending a revision message to the home. If the home itself is the owner, then the home can simply reply to the requestor, change the directory state to shared, and set the requestor’s bit in the presence vector. In fact, this case is handled just as if the owner were not the home, except that the “messages” between home and owner do not translate to network transactions. Let us look in a little more detail at what really happens when a read request arrives at the home and finds the state exclusive. (This and several other cases are also shown pictorially in Figure 816.) As already mentioned, the main memory block is accessed speculatively in parallel with the directory. When the directory state is discovered to be exclusive, it is set to busy-exclusive state. This is a general principle used in the coherence protocol to prevent race conditions and provide serialization of requests at the home. The state is set to busy whenever a request is received that
9/3/97
DRAFT: Parallel Computer Architecture
547
Directory-based Cache Coherence
cannot be satisfied by the main memory at the home. This ensures that subsequent requests for that block that arrive at the home will be NACK’ed until they can be guaranteed to see the consistent final state, rather than having to be buffered at the home. For example, in the current situation we cannot set the directory state to shared yet since memory does not yet have an up to date copy, and we do not want to leave it as exclusive since then a subsequent request might chase the same exclusive copy of the block as the current request, requiring that serialization be determined by the exclusive node rather than by the home (see earlier discussion of serialization options). Having set the directory entry to busy state, the presence vector is changed to set the requestor’s bit and unset the current owner’s. Why this is done at this time will become clear when we examine writeback requests. Now we see an interesting aspect of the protocol: Even though the directory state is exclusive, the home optimistically assumes that the block will be in clean-exclusive rather than dirty state in the owner’s cache (so the block in main memory is valid), and sends the speculatively accessed memory block as a speculative reply to the requestor. At the same time, the home forwards the intervention request to the owner. The owner checks the state in its cache, and performs one of the following actions:
• If the block is dirty, it sends a response with the data directly to the requestor and also a revision message containing the data to the home. At the requestor, the response overwrites the speculative transfer that was sent by the home. The revision message with data sent to the home is called a sharing writeback, since it writes the data back from the owning cache to main memory and tells it to set the block to shared state. • If the block is clean-exclusive, the response to the requestor and the revision message to the home do not contain data since both already have the latest copy (the requestor has it via the speculative reply). The response to the requestor is simply a completion acknowledgment, and the revision message is called a downgrade since it simply asks the home to downgrade the state of the block to shared. In either case, when the home receives the revision message it changes the state from busy to shared. You may have noticed that the use of speculative replies does not have any significant performance advantage in this case, since the requestor has to wait to know the real state at the exclusive node anyway before it can use the data. In fact, a simpler alternative to this scheme would be to simply assume that the block is dirty at the owner, not send a speculative reply, and always have the owner send back a response with the data regardless of whether it has the block in dirty or clean-exclusive state. Why then does Origin use speculative replies in this case? There are two reasons, which illustrate both how a protocol is influenced by the use of existing processors and their quirks as well as how different protocol optimizations influence each other. First, the R10000 processor it uses happens to not return data when it receives in intervention to a cleanexclusive (rather than dirty) block. Second, speculative replies enable a different optimization used in the Origin protocol, which is to allow a processor to simply drop a clean-exclusive block when it is replaced from the cache, rather than write it back to main memory, since main memory will in any case send a speculative reply when needed (see Section 8.6.4). Handling Write Requests As we saw in Chapter 5, write “misses” that invoke the protocol may generate either read-exclusive requests, which request both data and ownership, or upgrade requests which request only ownership since the requestor’s data is valid. In either case, the request goes to the home, where 548
DRAFT: Parallel Computer Architecture
9/3/97
Memory-based Directory Protocols: The SGI Origin System
the directory state is looked up to determine what actions to take. If the state at the directory is anything but unowned (or busy) the copies in other caches must be invalidated, and to preserve the ordering model invalidations must be explicitly acknowledged. As in the read case, a strict request-reply protocol, intervention forwarding, or reply forwarding can be used (see Exercise 8.2), and Origin chooses reply forwarding to reduce latency: The home updates the directory state and sends the invalidations directly; it also includes the identity of the requestor in the invalidations so they are acknowledged directly back to the requestor itself. The actual handling of the read exclusive and upgrade requests depends on the state of the directory entry when the request arrives; i.e. whether it is unowned, shared, exclusive or busy. 1: Read/RdEx request
1. Read/RdEx request
L
H
2b: Intervention 3b: Revise
L
R
H 2a: Speculative Reply
2. Shared or Excl Response (a) A read or read exclusive request to a block in shared or unowned state at the directory. An exclusive reply is sent even in the read case if the block is in unowned state at the directory, so it may be loaded in E rather than S state in the cache 1. Read/RdEx/Upgrade request
L
H
3a: Shared or Excl Response (b) Read or RdEx to a block in exclusive directory state. The intervention may be of type shared or exclusive, respectively (latter causes invalidation as well). The revision message is a sharing writeback or an ownership transfer, resp. 1: RdEx/Upgrade request
L
2b: Inval. request
R
H 2a: Exclusive reply or upgrade ack.
2. NACK
3a: Inval ack. (d) RdEx or upgrade request to directory in shared state
(c) Read/RdEx request to directory in busy state, or upgrade request to directory in busy, unowned or exclusive states (latter two are inappropriate)
2b: InterventionX
1: RequestX 1. Writeback
L L
3a: Response
H
2a: Writeback
R
H 2c: Spec. ReplyX 2. Ack
(e) Writeback request to directory in exclusive state
3b: Writeback Ack
(b) WB req. to dir BUSY (the X-subscripted transactions and dashed arcs are those for the other request that made the directory busy, see text)
Figure 8-16 Pictorial depiction of protocol actions in response to requests in the Origin multiprocessor. The case or cases under consideration in each case are shown below the diagram, in terms of the type of request and the state of the directory entry when the request arrives at the home. The messages or types of transactions are listed next to each arc. Since the same diagram represents different request type and directory state combinations, different message types are listed on each arc, separated by slashes.
9/3/97
DRAFT: Parallel Computer Architecture
549
Directory-based Cache Coherence
Unowned: If the request is an upgrade, the state being unowned means that the block has been replaced from the requestor’s cache, and the directory notified, since it sent the upgrade request (this is possible since the Origin protocol does not assume point-to-point network order). An upgrade is no longer the appropriate request, so it is NACKed. The write operation will be retried, presumably as a read-exclusive. If the request is a read-exclusive, the directory state is changed to dirty and the requestor’s presence bit is set. The home replies with the data from memory. Shared: The block must be invalidated in the caches that have copies, if any. The Hub at the home first makes a list of sharers that are to be sent invalidations, using the presence vector. It then sets the directory state to exclusive and sets the presence bit for the requestor. This ensures that subsequent requests for the block will be forwarded to the requestor. If the request was a read-exclusive, the home next sends a reply to the requestor (called an “exclusive reply with invalidations pending”) which also contains the number of sharers from whom to expect invalidation acknowledgments. If the request was an upgrade, the home sends an “upgrade acknowledgment with invalidations pending” to the requestor, which is similar but does not carry the data for the block. In either case, the home next sends invalidation requests to all the sharers, which in turn send acknowledgments to the requestor (not the home). The requestor waits for all acknowledgments to come in before it “closes” or completes the operation. Now, if a new request for the block comes to the home, it will see the directory state as exclusive, and will be forwarded as an intervention to the current requestor. This current requestor will not handle the intervention immediately, but will buffer it until it has received all acknowledgments for its own request and closed the operation. Exclusive: If the request is an upgrade, then an exclusive directory state means another write has beaten this request to the home and changed the value of the block in the meantime. An upgrade is no longer the appropriate request, and is NACK’ed. For read-exclusive requests, the following actions are taken. As with reads, the home sets the directory to a busy state, sets the presence bit of the requestor, and sends a speculative reply to it. An invalidation request is sent to the owner, containing the identity of the requestor (if the home is the owner, this is just an intervention in the local cache and not a network transaction). If the owner has the block in dirty state, it sends a “transfer of ownership” revision message to the home (no data) and a response with the data to the requestor. This response overrides the speculative reply that the requestor receives from the home. If the owner has the block in clean-exclusive state, it relies on the speculative reply from the home and simply sends an acknowledgment to the requestor and a “transfer of ownership” revision to the home. Busy: the request is NACK’ed as in the read case, and must try again. Handling Writeback Requests and Replacements When a node replaces a block that is dirty in its cache, it generates a writeback request. This request carries data, and is replied to with an acknowledgment by the home. The directory cannot be in unowned or shared state when a writeback request arrives, because the writeback requestor has a dirty copy (a read request cannot change the directory state to shared in between the generation of the writeback and its arrival at the home to see a shared or unowned state, since such a request would have been forwarded to the very node that is requesting the writeback and the directory state would have been set to busy). Let us see what happens when the writeback request reaches the home for the two possible directory states: exclusive and busy. Exclusive: The directory state transitions from exclusive to unowned (since the only cached copy has been replaced from its cache), and an acknowledgment is returned.
550
DRAFT: Parallel Computer Architecture
9/3/97
Memory-based Directory Protocols: The SGI Origin System
Busy: This is an interesting race condition. The directory state can only be busy because an intervention for the block (due to a request from another node Y, say) has been forwarded to the very node X that is doing the writeback. The intervention and writeback have crossed each other. Now we are in a funny situation. The other operation from Y is already in progress and cannot be undone. We cannot let the writeback be dropped, or we would lose the only valid copy of the block. Nor can we NACK the writeback and retry it after the operation from Y completes, since then Y’s cache will have a valid copy while a different dirty copy is being written back to memory from X’s cache! The protocol solves this problem by essentially combining the two operations. The writeback that finds the directory state busy changes the state to either shared (if the state was busy-shared, i.e. the request from Y was for a read copy) or exclusive (if it was busy-exclusive). The data returned in the writeback is then forwarded by the home to the requestor Y. This serves as the response to Y, instead of the response it would have received directly from X if there were no writeback. When X receives an intervention for the block, it simply ignores it (see Exercise 8.5). The directory also sends a writeback-acknowledgment to X. Node Y’s operation is complete when it receives the response, and the writeback is complete when X receives the writeback-acknowledgment. We will see an exception to this treatment in a more complex case when we discuss the serialization of operations. In general, writebacks introduce many subtle situations into coherence protocols. If the block being replaced from a cache is in shared state, we mentioned in Section 8.3 that the node may or may not choose to send a replacement hint message back to the home, asking the home to clear its presence bit in the directory. Replacement hints avoid the next invalidation to that block, but they incur assist occupancy and do not reduce traffic. In fact, if the block is not written again by a different node then the replacement hint is a waste. The Origin protocol chooses not to use replacement hints. In all, the number of transaction types for coherent memory operations in the Origin protocol is 9 requests, 6 invalidations and interventions, and 39 replies. For non-coherent operations such as uncached memory operations, I/O operations and special synchronization support, the number of transactions is 19 requests and 14 replies (no invalidations or interventions). 8.6.2 Dealing with Correctness Issues So far we have seen what happens at different nodes upon read and write misses, and how some race conditions—for example between requests and writebacks—are resolved. Let us now take a different cut through the Origin protocol, examining how it addresses the correctness issues discussed in Section 8.5.2 and what features the machine provides to deal with errors that occur. Serialization to a Location for Coherence The entity designated to serialize cache misses from different processors is the home. Serialization is provided not by buffering requests at the home until previous ones have completed or by forwarding them to the owner node even when the directory is in a busy state, but by NACKing requests from the home when the state is busy and causing them to be retried. Serialization is determined by the order in which the home accepts the requests—i.e. satisfies them itself or forwards them—not the order in which they first arrive at the home. The earlier general discussion of serialization suggested that more was needed for coherence on a given location than simply a global serializing entity, since the serializing entity does not have
9/3/97
DRAFT: Parallel Computer Architecture
551
Directory-based Cache Coherence
full knowledge of when transactions related to a given operation are completed at all the nodes involved. Having understood a protocol in some depth, we are now ready examine some concrete examples of this second problem [Len92] and see how it might be addressed.
Example 8-1
Consider the following simple piece of code. The write of A may happen either P1
P2
rd A (i)
wr A
BARRIER
BARRIER
rd A (ii) before the first read of A or after it, but should be serializable with respect to it. The second read of A should in any case return the value written by P2. However, it is quite possible for the effect of the write to get lost if we are not careful. Show how this might happen in a protocol like the Origin’s, and discuss possible solutions.
Answer
Figure 8-17 shows how the problem can occur, with the text in the figure explaining 1. P1 issues read request to home node for A 2. P2 issues read-excl. request to home (for the write of A). Home (serializer) won’t process it until it is done with read
Home 5
3
2
1 4 P1
6 P2
3. Home receives 1, and in response sends reply to P1 (and sets directory presence bit). Home now thinks read is complete (there are no acknowledgments for a read reply). Unfortunately, the reply does not get to P1 right away. 4. In response to 2, home sends invalidate to P1; it reaches P1 before transaction 3 (there is no point-to-point order assumed in Origin, and in any case the invalidate is a request and 3 is a reply). 5. P1 receives and applies invalidate, sends ack to home. 6. Home sends data reply to P2 corresponding to request 2. Finally, the read reply (3) reaches P1, and overwrites the invalidated block. When P1 reads A after the barrier, it reads this old value rather than seeing an invalid block and fetching the new value. The effect of the write by P2 is lost as far as P1 is concerned.
Figure 8-17 Example showing how a write can be lost even though home thinks it is doing things in order. Transactions associated with the first read operation are shown with dotted lines, and those associated with the write operation are shown in solid lines. Three solid bars going through a transaction indicates that it is delayed in the network.
the transactions, the sequence of events, and the problem. There are two possible solutions. An unattractive one is to have read replies themselves be acknowledged explicitly, and let the home go on to the next request only after it receives this acknowledgment. This causes the same buffering and deadlock problems as before, 552
DRAFT: Parallel Computer Architecture
9/3/97
Memory-based Directory Protocols: The SGI Origin System
further violates the request-reply nature of the protocol, and leads to long delays. The more likely solution is to ensure that a node that has a request outstanding for a block, such as P1, does not allow access by another request, such as the invalidation, to that block in its cache until its outstanding request completes. P1 may buffer the invalidation request, and apply it only after the read reply is received and completed. Or P1 can apply the invalidation even before the reply is received, and then consider the read reply invalid (a NACK) when it returns and retry it. Origin uses the former solution, while the latter is used in DASH. The buffering needed in Origin is only for replies to requests that this processor generates, so it is small and does not cause deadlock problems.
Example 8-2
Not only the requestor, but also the home may have to disallow new operations to be actually applied to a block before previous ones have completed as far as it is concerned. Otherwise directory information may be corrupted. Show an example illustrating this need and discuss solutions.
Answer
This example is more subtle and is shown in Figure 8-18. The node issuing the write request detects completion of the write (as far is its involvement is concerned) through acknowledgments before processing another request for the block. The problem is that the home does not wait for its part of the write operation—which includes waiting for the revision message and directory update—to complete before it allows another access (here the writeback) to be applied to the block. The Origin Initial Condition: block is in dirty state in P1’s cache 1. P2 sends read-exclusive request to home.
Home
2
3b P1
2. Home forwards request to P1 (dirty node). 3. P1 sends data reply to P2 (3a), and “ownership-transfer” revise message to home to change owner to P2 (3b).
4
3a
1
P2
4. P2, having received its reply, considers write complete. Proceeds, but incurs a replacement of the just-dirtied block, causing it to be written back in transaction 4. This writeback is received by the home before the ownership transfer request from P1 (even point-to-point network order wouldn’t help), and the block is written into memory. Then, when the revise message arrives at the home, the directory is made to point to P2 as having the dirty copy. But this is untrue, and our protocol is corrupted.
Figure 8-18 Example showing how directory information can be corrupted if a node does not wait for a previous request to be responded to before it allows a new access to the same block.
protocol solves this problem through its busy state: The directory will be in busyexclusive state when the writeback arrives before the revision message. When the directory detects that the writeback is coming from the same node that put the directory into busy-exclusive state, the writeback is NACKed and must be retried. (Recall from the discussion of handling writebacks that the writeback was treated
9/3/97
DRAFT: Parallel Computer Architecture
553
Directory-based Cache Coherence
differently if the request that set the state to busy came from a different node than the one doing the writeback; in that case the writeback was not NACK’ed but sent on as the reply to the requestor.) These examples illustrate the importance of another general requirement that nodes must locally fulfil for proper serialization, beyond the existence of a serializing entity: Any node, not just the serializing entity, should not apply a transaction corresponding to a new memory operation to a block until a previously outstanding memory operation on that block is complete as far as that node’s involvement is concerned. Preserving the Memory Consistency Model The dynamically scheduled R10000 processor allows independent memory operations to issue out of program order, allowing multiple operations to be outstanding at a time and obtaining some overlap among them. However, it ensures that operations complete in program order, and in fact that writes leave the processor environment and become visible to the memory system in program order with respect to other operations, thus preserving sequential consistency (Chapter 11 will discuss the mechanisms further). The processor does not satisfy the sufficient conditions for sequential consistency spelled out in Chapter 5, in that it does not wait to issue the next operation till the previous one completes, but it satisfies the model itself.1 Since the processor guarantees SC, the extended memory hierarchy can perform any reorderings to different locations that it desires without violating SC. However, there is one implementation consideration that is important in maintaining SC, which is due to the Origin protocol’s interactions with the processor. Recall from Figure 8-16(d) what happens on a write request (read exclusive or upgrade) to a block in shared state. The requestor receives two types of replies: (i) an exclusive reply from the home, discussed earlier, whose role really is to indicate that the write has been serialized at memory with respect to other operations for the block, and (ii) invalidation acknowledgments indicating that the other copies have been invalidated and the write has completed. The microprocessor, however, expects only a single reply to its write request, as in a uniprocessor system, so these different replies have to be dealt with by the requesting Hub. To ensure sequential consistency, the Hub must pass the reply on to the processor—allowing it to declare completion—only when both the exclusive reply and the invalidation acknowledgments have been received. It must not pass on the reply simply when the exclusive reply has been received, since that would allow the processor to complete later accesses to other locations even before all invalidations for this one have been acknowledged, violating sequential consistency. We will see in Section 9.2 that such violations are useful when more relaxed memory consistency models than sequential consistency are used. Write atomicity is provided as discussed earlier: A node does not allow any incoming accesses to a block for which invalidations are outstanding until the acknowledgments for those invalidations have returned.
1. This is true for accesses that are under the control of the coherence protocol. The processor also supports memory operations that are not visible to the coherence protocol, called non-coherent memory operations, for which the system does not guarantee any ordering: it is the user’s responsibility to insert synchronization to preserve a desired ordering in these cases.
554
DRAFT: Parallel Computer Architecture
9/3/97
Memory-based Directory Protocols: The SGI Origin System
Deadlock, Livelock and Starvation The Origin uses finite input buffers and a protocol that is not strict request-reply. To avoid deadlock, it uses the technique of reverting to a strict request-reply protocol when it detects a high contention situation that may cause deadlock, as discussed in Section 8.5.2. Since NACKs are not used to alleviate contention, livelock is avoided in these situations too. The classic livelock problem due to multiple processors trying to write a block at the same time is avoided by using the busy states and NACKs. The first of these requests to get to the home sets the state to busy and makes forward progress, while others are NACK’ed and must retry. In general, the philosophy of the Origin protocol is two-fold: (i) to be “memory-less”, i.e. every node reacts to incoming events using only local state and no history of previous events; and (ii) to not have an operation hold globally shared resources while it is requesting other resources. The latter leads to the choices of NACKing rather than buffering for a busy resource, and helps prevent deadlock. These decisions greatly simplify the hardware, yet provide high performance in most cases. However, since NACKs are used rather than FIFO ordering, there is still the problem of starvation. This is addressed by associating priorities with requests. The priority is a function of the number of times the request has been NACK’ed, so starvation should be avoided.1 Error Handling Despite a correct protocol, hardware and software errors can occur at runtime. These can corrupt memory or write data to different locations than expected (e.g. if the address on which to perform a write becomes corrupted). The Origin system provides many standard mechanisms to handle hardware errors on components. All caches and memories are protected by error correction codes (ECC), and all router and I/O links protected by cyclic redundancy checks (CRC) and a hardware link-level protocol that automatically detects and retries failures. In addition, the system provides mechanisms to contain failures within the part of the machine in which the program that caused the failure is running. Access protection rights are provided on both memory and I/O devices, preventing unauthorized nodes from making modifications. These access rights also allow the operating system to be structured into cells or partitions. A cell is a number of nodes, configured at boot time. If an application runs within a cell, it may be disallowed from writing memory or I/ O outside that cell. If the application fails and corrupts memory or I/O, it can only affect other applications or the system running within that cell, and cannot harm code running in other cells. Thus, a cell is the unit of fault containment in the system.
1. The priority mechanism works as follows. The directory entry for a block has a “current” priority associated with it. Incoming transactions that will not cause the directory state to become busy are always serviced. Other transactions are only potentially serviced if their priority is greater than or equal to the current directory priority. If such a transaction is NACK’ed (e.g. because the directory is in busy state when it arrives), the current priority of the directory is set to be equal to that of the NACK’ed request. This ensures that the directory will no longer service another request of lower priority until this one is serviced upon retry. To prevent a monotonic increase and “topping out” of the directory entry priority, it is reset to zero whenever a request of priority greater than or equal to it is serviced.
9/3/97
DRAFT: Parallel Computer Architecture
555
Directory-based Cache Coherence
8.6.3 Details of Directory Structure While we assumed a full bit-vector directory organization so far for simplicity, the actual structure is a little more complex for two reasons: first, to deal with the fact that there are two processors per node, and second, to allow the directory structure to scale to more than 64 nodes. Bits in the bit vector usually correspond to nodes, not individual processors. However, there are in fact three possible formats or interpretations of the directory bits. If a block is in an exclusive state in a processor cache, then the rest of the directory entry is not a bit vector with one bit on but rather contains an explicit pointer to that specific processor, not node. This means that interventions forwarded from the home are targeted to a specific processor. Otherwise, for example if the block is cached in shared state, the directory entry is interpreted as a bit vector. There are two directory entry sizes: 16-bit and 64-bit (in the 16-bit case, the directory entry is stored in the same DRAM as the main memory, while in the 64-bit case the rest of the bits are in an extended directory memory module that is looked up in parallel). Even though the processors within a node are not cache-coherent, the unit of visibility to the directory in this format is a node or Hub, not a processor as in the case where the block is in an exclusive state. If an invalidation is sent to a Hub, unlike an intervention it is broadcast to both processors in the node over the SysAD bus. The 16bit vector therefore supports up to 32 processors, and the 64-bit vector supports up to 128 processors. For larger systems, the interpretation of the bits changes to the third format. In a p-node system, each bit now corresponds to p/64 nodes. The bit is set when any one (or more) of the nodes in that group of p/64 has a copy of the block. If a bit is set when a write happens, then invalidations are broadcast to all the p/64 nodes represented by that bit. For example, with 1024 processors (512 nodes) each bit corresponds to 8 nodes. This is called a coarse vector representation, and we shall see it again when we discuss overflow strategies for directory representations as an advanced topic in Section 8.11. In fact, the system dynamically chooses between the bit vector and coarse vector representation on a large machine: if all the nodes sharing the block are within the same 64-node “octant” of the machine, a bit-vector representation is used; otherwise a coarse vector is used. 8.6.4 Protocol Extensions In addition to the protocol optimizations such as reply forwarding speculative memory accesses the Origin protocol provides some extensions to support special operations and activities that interact with the protocol. These include input/output and DMA operations, page migration and synchronization. Support for Input/Output and DMA Operations To support reads by a DMA device, the protocol provides “uncached read shared” requests. Such a request returns to the DMA device a snapshot of a coherent copy of the data, but that copy is then no longer kept coherent by the protocol. It is used primarily by the I/O system and the block transfer engine, and as such is intended for use by the operating system. For writes from a DMA device, the protocol provides “write-invalidate” requests. A write invalidate simply blasts the new value into memory, overwriting the previous value. It also invalidates all existing cached copies of the block in the system, thus returning the directory entry to unowned state. From a protocol perspective, it behaves much like a read-exclusive request, except that it modifies the block in memory and leaves the directory in unowned state.
556
DRAFT: Parallel Computer Architecture
9/3/97
Memory-based Directory Protocols: The SGI Origin System
Support for Automatic Page Migration As we discussed in Chapter 3, on a machine with physically distributed memory it is often important for performance to allocate pages of data intelligently across physical memories, so that most capacity, conflict and cold misses are satisfied locally. Despite the very aggressive communication architecture in the Origin (much more so than other existing machines) the latency of an access satisfied by remote memory is at least 2-3 times that of a local access even without contention. The appropriate distribution of pages among memories might change dynamically at runtime, either because a parallel application’s access patterns change or because the operating system decides to migrate an application process from one processor to another for better resource management across multiprogrammed applications. It is therefore useful for the system to detect the need for moving pages at runtime, and migrate them automatically to where they are needed. Origin provides a miss counter per node at the directory for every page, to help determine when most of the misses to a page are coming from a nonlocal processor so that the page should be migrated. When a request comes in for a page, the miss counter for that node is incremented and compared with the miss counter for the home node. If it exceeds the latter by more than a threshold, then the page can be migrated to that remote node. (Sixty four counters are provided per page, and in a system with more than 64 nodes eight nodes share a counter.) Unfortunately, page migration is typically very expensive, which often annuls the advantage of doing the migration. The major reason for the high cost is not so much moving the page (which with the block transfer engine takes about 25-30 microseconds for a 16KB page) as changing the virtual to physical mappings in the TLBs of all processors that have referenced the page. Migrating a page keeps the virtual address the same but changes the physical address, so the old mappings in the page tables of processes are now invalid. As page table entries are changed, it is important that the cached versions of those entries in the TLBs of processors be invalidated. In fact, all processors must be sent a TLB invalidate message, since we don’t know which ones have a mapping for the page cached in their TLB; the processors are interrupted, and the invalidating processor has to wait for the last among them to respond before it can update the page table entry and continue. This process typically takes over a hundred microseconds, in addition to the cost to move the page itself. To reduce this cost, Origin uses a distributed lazy TLB invalidation mechanism supported by its seventh directory state, the poisoned state. The idea is to not invalidate all TLB entries when the page is moved, but rather to invalidate a processor’s TLB entry only when that processor next references the page. Not only is the 100 or more microseconds of time removed from the single migrating processor’s critical path, but TLB entries end up being invalidated for only those processors that actually subsequently reference the page. Let’s see how this works. To migrate a page, the block transfer engine reads cache blocks from the source page location using special “uncached read exclusive” requests. This request type returns the latest coherent copy of the data and invalidates any existing cached copies (like a regular read-exclusive request), but also causes main memory to be updated with the latest version of the block and puts the directory in the poisoned state. The migration itself thus takes only the time to do this block transfer. When a processor next tries to access a block from the old page, it will miss in the cache, and will find the block in poisoned state at the directory. At that time, the poisoned state will cause the processor to see a bus error. The special OS handler for this bus error invalidates the processor’s TLB entry, so that it obtains the new mapping from the page table when it retries the access. Of course, the old physical page must be reclaimed by the system at some point, to avoid wasting storage. Once the
9/3/97
DRAFT: Parallel Computer Architecture
557
Directory-based Cache Coherence
block copy has completed, the OS invalidates one TLB entry per scheduler quantum, so that after some fixed amount of time the old page can be moved on to the free list. Support for Synchronization Origin provides two types of support for synchronization. First, the load linked store conditional instructions of the R10000 processor are available to compose synchronization operations, as we saw in the previous two chapters. Second, for situations in which many processors contend to update a location, such as a global counter or a barrier, uncached fetch&op primitives are provided. These fetch&op operations are performed at the main memory, so the block is not replicated in the caches and successive nodes trying to update the location do not have to retrieve it from the previous writer’s cache. The cacheable LL/SC is better when the same node tends to repeatedly update the shared variable, and the uncached fetch&op is better when different nodes tend to update in an interleaved or contended way. The protocol discussion above has provided us with a fairly complete picture of how a flat memory-based directory protocol is implemented out of network transactions and state transitions, just as a bus-based protocol was implemented out of bus transactions and state transitions. Let us now turn our attention to the actual hardware of the Origin2000 machine that implements this protocol. We first provide an overview of the system hardware organization. Then, we examine more deeply how the Hub controller is actually implemented. Finally, we will discuss the performance of the machine. Readers interested in only the protocol can skip the rest of this section, especially the Hub implementation which is quite detailed, without loss of continuity. 8.6.5 Overview of Origin2000 Hardware As discussed earlier, each node of the Origin2000 contains two MIPS R10000 processors connected by a system bus, a fraction of the main memory on the machine (1-4 GB per node), the Hub which is the combined communication/coherence controller and network interface, and an I/ O interface called the Xbow. All but the Xbow are on a single 16”-by-11” printed circuit board. Each processor in a node has its own separate L1 and L2 caches, with the L2 cache configurable from 1 to 4 MB with a cache block size of 128 bytes and 2-way set associativity. The directory state information is stored in the same DRAM modules as main memory or in separate modules, depending on the size of the machine, and looked up in parallel with memory accesses. There is one directory entry per main memory block. Memory is interleaved from 4 ways to 32 ways, depending on the number of modules plugged in (4-way interleaving at 4KB granularity within a module, and up to 32-way at 512MB granularity across modules). The system has up to 512 such nodes, i.e. up to 1024 R10000 processors. With a 195 MHz R10000 processor, the peak performance per processor is 390 MFLOPS or 780 MIPS (four instructions per cycle), leading to an aggregate peak performance of almost 500 GFLOPS in the machine. The peak bandwidth of the SysAD (system address and data) bus that connects the two processors is 780 MB/s, as is that of the Hub’s connection to memory. Memory bandwidth itself for data is about 670MB/s. The Hub connections to the off-board network router chip and Xbow are 1.56 GB/sec each, using the same link technology. A more detailed picture of the node board is shown in Figure 8-19. The Hub chip is the heart of the machine. It sits on the system bus of the node, and connects together the processors, local memory, network interface and Xbow, which communicate with one another through it. All cache misses, whether to local or remote memory, go through it, and it implements the coherence protocol. It is a highly integrated, 500K gate standard-cell design in
558
DRAFT: Parallel Computer Architecture
9/3/97
Memory-based Directory Protocols: The SGI Origin System
Tag
SC
Extended Directory
SC
Main Memory and 16-bit Directory
R10K SC
SC
SC
SC
BC
Hub Tag
BC
R10K SC
Pwr/gnd
BC BC
BC
BC
Main Memory and 16-bit Directory SC
Network
Pwr/gnd
Pwr/gnd
I/O
Connections to Backplane
Figure 8-19 A node board on the Origin multiprocessor. SC stands for secondary cache and BC for memory bank controller.
0.5µ CMOS technology. It contains outstanding transaction buffers for each of its two processors (each processor itself allows four outstanding requests), a pair of block transfer engines that support block memory copy and fill operations at full system bus bandwidth, and interfaces to the network, the SysAD bus, the memory/directory, and the I/O subsystem. It also implements the atmemory, uncached fetch&op instructions and page migration support discussed earlier. The interconnection network has a hypercube topology for machines with up to 64 processors, but a different topology called a fat cube beyond that. This topology will be discussed in Chapter 10. The network links have very high bandwidth (1.56GB/s total per link in the two directions) and low latency (41ns pin to pin through a router), and can use flexible cabling up to 3 feet long for the links. Each link supports four virtual channels. Virtual channels are described in Chapter 10; for now, we can think of the machine as having four distinct networks such that each has about one-fourth of the link bandwidth. One of these virtual channels is reserved for request network transactions, one for replies. Two can be used for congestion relief and high priority transactions, thereby violating point-to-point order, or can be reserved for I/O as is usually done. The Xbow chip connects the Hub to I/O interfaces. It is itself implemented as a cross-bar with eight ports. Typically, two nodes (Hubs) might be connected to one Xbow, and through it to six external I/O cards as shown in Figure 8-20. The Xbow is quite similar to the SPIDER router chip, but with simpler buffering and arbitration that allow eight ports to fit on the chip rather than six. It provides two virtual channels per physical channel, and supports wormhole or pipelined routing of packets. The arbiter also supports the reservation of bandwidth for certain devices, to support real time needs like video I/O. High-performance I/O cards like graphics connect directly to the Xbow ports, but most other ports are connected through a bridge to an I/O bus that allows multiple cards to plug in. Any processor can reference any physical I/O device in the machine, either through uncached references to a special I/O address space or through coherent DMA operations. An I/O device can DMA to and from any memory in the system, not just that on a 9/3/97
DRAFT: Parallel Computer Architecture
559
Directory-based Cache Coherence
node to which it is directly connected through the Xbow. Communication between the processor and the appropriate Xbow is handled transparently by the Hubs and routers. Thus, like memory I/ O is also globally accessible but physically distributed, so locality in I/O distribution is also a performance rather than correctness issue.
To Bridge Graphics Bridge 16 Hub1
16 16
16
IOC3
SIO
SCSI
SCSI
LINC
CTRL
Xbow Hub2
16
16 16
16
Bridge
Graphics To Bridge
Figure 8-20 Typical Origin I/O configuration shared by two nodes.
8.6.6 Hub Implementation The communication assist, the Hub, must have certain basic abilities to implement the protocol. It must be able to observe all cache misses, synchronization events and uncached operations, keep track of outgoing requests while moving on to other outgoing and incoming transactions, guarantee the sinking of replies coming in from the network, invalidate cache blocks, and intervene in the caches to retrieve data. It must also coordinate the activities and dependences of all the different types of transactions that flow through it from different components, and implement the necessary pathways and control. This subsection briefly describes the major components of the Hub controller used in the Origin2000, and points out some of the salient features it provides to implement the protocol. Further details of the actual data and control pathways through the Hub, as well as the mechanisms used to actually control the interactions among messages, are also very interesting to truly understand how scalable cache coherence is implemented, and can be read elsewhere [Sin97]. Being the heart of the communication architecture and coherence protocol, the Hub is a complex chip.It is divided into four major interfaces, one for each type of external entity that it connects together: the processor interface or PI, the memory/directory interface or MI, the network interface or NI, and the I/O interface (see Figure 8-21). These interfaces communicate with one another through an on-chip crossbar switch. Each interface is divided into a few major structures, including FIFO queues to buffer messages to/from other interfaces and to/from external entities. A key property of the interface design is for the interface to shield its external entity from the details of other interfaces and entities and vice versa. For example, the PI hides the processors from the rest of the world, so any other interface must only know the behavior of the PI and not
560
DRAFT: Parallel Computer Architecture
9/3/97
Memory-based Directory Protocols: The SGI Origin System
XIO
Craylink SSD/SSR LLP 0123
Msg
Inval Multicast
SSD/SSR LLP 0123
Msg
NI IO Transl
II
Tables Crossbar
BTE IRB A BTE B Protocol Table
MD PI
Protocol Protocol Table Table FandOp Cache
Dir
Mem
CRB A CRB B SysAD
Figure 8-21 Layout of the Hub chip. The crossbar at the center connects the buffers of the four different interfaces. Clockwise from the bottom left, the BTEs are the block transfer engines. The top left corner is the I/O interface (the SSD and SSR interface and translate signals from the I/O ports). Next is the network interface (NI), including the routing tables. The bottom right is the memory/directory (MD) interface, and at the bottom is the processor interface (PI) with its request tracking buffers (CRB and IRB).
of the processors themselves. Let us discuss the structures of the PI, MI, and NI briefly, as well as some examples of the shielding provided by the interfaces. The Processor Interface (PI) The PI has the most complex control mechanisms of the interfaces, since it keeps track of outstanding protocol requests and replies. The PI interfaces with the memory buses of the two R10000 processors on one side, and with incoming and outgoing FIFO queues connecting it to each of the other HUB interfaces on the other side (Figure 8-21). Each physical FIFO is logically separated into independent request and reply “virtual FIFOs” by separate logic and staging buffers. In addition, the PI contains three pairs of coherence control buffers that keep track of outstanding transactions, control the flow of messages through the PI, and implement the interactions among messages dictated by the protocol. They do not however hold the messages themselves. There are two Read Request Buffers (RRB) that track outstanding read requests from each processor, two Write Request Buffers (WRB) that track outstanding write requests, and two Intervention Request Buffers (IRB) that track incoming invalidation and intervention requests. Access to the three sets of buffers is through a single bus, so all messages contend for access to them. A message that is recorded in one type of buffer may also need to look up another type to check for outstanding conflicting accesses or interventions to the same address (as per the coherence protocol). For example, an outgoing read request performs an associative lookup in the WRB to
9/3/97
DRAFT: Parallel Computer Architecture
561
Directory-based Cache Coherence
see if a writeback to the same address is pending as well. If there is a conflicting WRB entry, an outbound request is not placed in the outgoing request FIFO; rather, a bit is set in the RRB entry to indicate that when the WRB entry is released the read request should be re-issued (i.e. when the writeback is acknowledged or is cancelled by an incoming invalidation as per the protocol). Buffers are also looked up to close an outstanding transaction in them when a completion reply comes in from either the processors or another interface. Since the order of completion is not deterministic, a new transaction must go into any available slot, so the buffers are implemented as fully associative rather than FIFO buffers (the queues that hold the actual messages are FIFO). The buffer lookups determine whether an external request should be generated to either the processor or the other interfaces. The PI provides some good examples of the shielding provided by interfaces. If the processor provides data as a reply to an incoming intervention, it is the logic in the PI’s outgoing FIFO that expands the reply into two replies, one to the home as a sharing writeback revision message and one to the requestor. The processor itself does not have to modified for this purpose. Another example is in the mechanisms used to keep track of and match incoming and outgoing requests and responses. All requests passing through the PI are given request numbers, and responses carry these request numbers as well. However, the processor itself does not know about these request numbers, and it is the PI’s job to ensure that when it passes on incoming requests (interventions or invalidations) to the processor, that it can match the processor’s replies to the outstanding interventions/invalidations without the processor having to deal with request numbers. The Memory/Directory Interface (MI) The MI also has FIFOs between it and the Hub crossbar. The FIFO from the Hub to the MI separates out headers from data, so that the header of the next message can be examined by the directory while the current one is being serviced, allowing writes to be pipelined and performed at peak memory bandwidth. The MI also contains a directory interface, the memory interface and t a controller. The directory interface contains the logic and tables that implement the coherence protocol as well as the logic that generates outgoing message headers; the memory interface contains the logic that generates outgoing message data. Both the memory and directory RAMS have their own address and data buses. Some messages, like revision messages coming to the home, may not access the memory but only the directory. On a read request, the read is issued to memory speculatively simultaneously with starting the directory operation. The directory state is available a cycle before the memory data, and the controller uses this (plus the message type and initiator) to look up the directory protocol table. This hard-wired table directs it to the action to be taken and the message to send. The directory block sends the latter information to the memory block, where the message headers are assembled and inserted into the outgoing FIFO, together with the data returning from memory. The directory lookup itself is a read-modify-write operation. For this, the system provides support for partial writes of memory blocks, and a one-entry merge buffer to hold the bytes that are to be written between the time they are read from memory and written back. Finally, to speed up the at-memory fetch&op accesses provided for synchronization, the MI contains a four-entry LRU fetch&op cache to hold the recent fetch&op addresses and avoid a memory or directory access. This reduces the best-case serialization time at memory for fetch&op to 41ns, about four 100MHz Hub cycles.
562
DRAFT: Parallel Computer Architecture
9/3/97
Memory-based Directory Protocols: The SGI Origin System
The Network Interface (NI) The NI interfaces the Hub crossbar to the network router for that node. The router and the Hub internals use different data transport formats, protocols and speeds (100MHz in the Hub versus 400MHz in the router), so one major function of the NI is to translate between the two. Toward the router side, the NI implements a flow control mechanism to avoid network congestion [Sin97]. The FIFOs between the Hub and the network also implement separate virtual FIFOs for requests and replies. The outgoing FIFO has an invalidation destination generator which takes the bit vector of nodes to be invalidated and generates individual message for them, a routing table that pre-determines the routing decisions based on source and destination nodes, and virtual channel selection logic. 8.6.7 Performance Characteristics The stated hardware bandwidths and occupancies of the system are as follows, assuming a 195MHz processor. Bandwidth on the SysAD bus, shared by two processors in the node, is 780MB/s, as is the bandwidth between the Hub and local memory, and the bandwidth between a node and the network in each direction. The data bandwidth of the local memory itself is about 670 MB/s. The occupancy of the Hub at the home for a transaction on a cache block is about 20 Hub cycles (about 40 processor cycles), though it varies between 18 and 30 Hub cycles depending on whether successive directory pages accessed are in the same bank of the directory SDRAM and on the exact pattern of successive accesses (read or read-exclusives followed by another of the same or followed by a writeback or an uncached operation). The latencies of references depend on many factors such as the type of reference, whether the home is local or not, where and in what state the data are currently cached, and how much contention is there for resources along the way. The latencies can be measured using microbenchmarks. As we did for the Challenge machine in Chapter 6, let us examine microbenchmark results for latency and bandwidth first, followed by the performance and scaling of our six applications. Characterization with Microbenchmarks Unlike the MIPS R4400 processor used in the SGI Challenge, the Origin’s MIPS R10000 processor is dynamically scheduled and does not stall on a read miss. This makes it more difficult to measure read latency. We cannot for example measure the unloaded latency of a read miss by simply executing the microbenchmark from Chapter 4 that reads the elements of an array with stride greater than the cache block size. Since the misses are to independent locations, subsequent misses will simply be overlapped with one another and will not see their full latency. Instead, this microbenchmark will give us a measure of the throughput that the system can provide on successive read misses issued from a processor. The throughput is the inverse of the “latency” measured by this microbenchmark, which we can call the pipelined latency. To measure the full latency, we need to ensure that subsequent operations are dependent on each other. To do this, we can use a microbenchmark that chases pointers down a linked list: The address of the next read is not available to the processor until the previous read completes, so the reads cannot be overlapped. However, it turns out this is a little pessimistic in determining the unloaded read latency. The reason is that the processor implements critical word restart; i.e. it can use the value returned by a read as soon as that word is returned to the processor, without waiting for the rest of the cache block to be loaded in the caches. With the pointer chasing microbenchmark, the next read will be issued before the previous block has been loaded, and will contend
9/3/97
DRAFT: Parallel Computer Architecture
563
Directory-based Cache Coherence
with the loading of that block for cache access. The latency obtained from this microbenchmark, which includes this contention, can be called back-to-back latency. Avoiding this contention between successive accesses requires that we put some computation between the read misses that depends on the data being read but that does not access the caches between two misses. The goal is to have this computation overlap the time that it takes for the rest of the cache block to load into the caches after a read miss, so the next read miss will not have to stall on cache access. The time for this overlap computation must of course be subtracted from the elapsed time of the microbenchmark to measure the true unloaded read miss latency assuming critical word restart. We can call this the restart latency [HLK97]. Table 8-1 shows the restart and back-to-back latencies measured on the Origin2000, with only one processor executing the microbenchmark but the data that are accessed being distributed among the memories of different numbers of processors. The back to back latency is usually about 13 bus cycles (133 ns), because the L2 cache block size (128B) is 12 double words longer than the L1 cache block size (32B) and there is one cycle for bus turnaround. Table 8-1 Back-to-back and restart latencies for different system sizes. The first column shows where in the extended memory hierarchy the misses are satisfied. For the 8P case, for example, the misses are satisfied in the furthest node away from the requestor in a system of 8 processors. Given the Origin2000 topology, with 8 processors this means traversing through two network routers. Network Routers Traversed
Back-to-Back (ns)
Restart (ns)
L1 cache
0
5.5
5.5
L2 cache
0
56.9
56.9
local memory
0
472
329
4P remote memory
1
690
564
8P remote memory
2
890
759
16P remote memory
3
991
862
Memory level
Table 8-2 lists the back-to-back latencies for different initial states of the block being referenced [HLK97]. Recall that the owner node is home when the block is in unowned or shared state at the directory, and the node that has a cached copy when the block is in exclusive state. The restart latency for the case where both the home and the owner are the local node (i.e. if owned the block is owned by the other processor in the same node) is 338 ns for the unowned state, 656 ns for the clean-exclusive state, and 892 ns for the modified state. Table 8-2 Back-to-back latencies (in ns) for different initial states of the block. The first column indicates whether the home of the block is local or not, the second indicates whether the current owner is local or not, and the last three columns give the latencies for the block being in different states. State of Block
564
Unowned
CleanExclusive
Home
Owner
local
local
472
707
1036
remote
local
704
930
1272
local
remote
472
930
1159
remote
remote
704
917
1097
DRAFT: Parallel Computer Architecture
Modified
9/3/97
Memory-based Directory Protocols: The SGI Origin System
Application Speedups Figure 8-22 shows the speedups for the six parallel applications on a 32-processor Origin2000, 30
30
25
25
Speedup
Ocean: n=1026 Ocean: n=514 Radix: 4M keys
15
LU: n=2048
20 Speedup
Galaxy: 512K bodies Galaxy: 16K bodies
20
LU: n=1024 Raytrace: balls Raytrace: car
15
Radiosity: room
Radix: 1M keys
Radiosity: largeroom
10
10
5
5
Processors
31
29
27
25
23
21
19
17
15
13
9
11
7
5
3
31
29
27
25
23
21
19
17
15
13
9
11
7
5
3
1
1
0
0
Processors
Figure 8-22 Speedups for the parallel applications on the Origin2000. Two problem sizes are shown for each application. The Radix sorting program does not scale well, and the Radiosity application is limited by the available input problem sizes. The other applications speed up quite well when reasonably large problem sizes are used.
using two problem sizes for each. We see that most of the applications speed up well, especially once the problem size is large enough. The dependence on problem size is particular stark in applications like Ocean and Raytrace. The exceptions to good speedup at this scale are Radiosity and particularly Radix. In the case of Radiosity, it is because even the larger problem is relatively small for a machine of this size and power. We can expect to see better speedups for larger data sets. For Radix, the problem is the highly scattered, bursty pattern of writes in the permutation phase. These writes are mostly to locations that are allocated remotely, and the flood of requests, invalidations, acknowledgments and replies that they generate cause tremendous contention and hot-spotting at Hubs and memories. Running larger problems doesn’t alleviate the situation, since the communication to computation ratio is essentially independent of problem size; in fact, it worsens the situation once a processor’s partition of the keys do not fit in its cache, at which point writeback transactions are also thrown into the mix. For applications like Radix and an FFT (not shown) that exhibit all-to-all bursty communication, the fact that two processors share a Hub and two Hubs share a router also causes contention at these resources, despite their high bandwidths. For these applications, the machine would perform better if it had only a single processor per Hub and per router. However, the sharing of resources does save cost, and does not get in the way of most of the other applications [JS97]. Scaling Figure 8-23 shows the scaling behavior for the Barnes-Hut galaxy simulation on the Origin2000. The results are quite similar to those on the SGI Challenge in Chapter 6, although extended to more processors. For applications like Ocean, in which an important working set is proportional to the data set size per processor, machines like the Origin2000 display an interesting effect due to scaling when we start from a problem size where the working set does not fit in the cache on a uniprocessor. Under PC and TC scaling, the data set size per processor diminishes with increasing number of processors. Thus, although the communication to computation ratio increases, we
9/3/97
DRAFT: Parallel Computer Architecture
565
Directory-based Cache Coherence
observe superlinear speedups once the working set starts to fit in the cache, since the performance within each node becomes much better when the working set fits in the cache. Under MC scaling, communication to computation ratio does not change, but neither does the working set size per processor. As a result, although the demands on the communication architecture scale more favorably than under TC or PC scaling, speedups are not so good because the beneficial effect on node performance of the working set suddenly fitting in the cache is no longer observed. 600000
Naive TC
20
Naive MC TC
15
MC PC
10
500000 400000
Naive TC Naive MC TC
300000
MC
200000 100000
5
29
25
21
17
9
13
31
28
25
22
19
16
13
7
10
4
1
5
0
0
1
Speedup
25
Number of Bodies
30
Processors
Processors
Figure 8-23 Scaling of speedups and number of bodies under different scaling models for Barnes-Hut on the Origin2000. As with the results for bus-based machines in Chapter 6, the speedups are very good under all scaling models and the number of bodies that can be simulated grows much more slowly under realistic TC scaling than under MC or naive TC scaling.
8.7 Cache-based Directory Protocols: The Sequent NUMA-Q The flat, cache-based directory protocol we describe is the IEEE Standard Scalable Coherent Interface (SCI) protocol [Gus92]. As a case study of this protocol, we examine the NUMA-Q machine from Sequent Computer Systems, Inc., a machine targeted toward commercial workloads such as databases and transaction processing [CL96]. This machine relies heavily on thirdparty commodity hardware, using stock SMPs as the processing nodes, stock I/O links, and the Vitesse “data pump” network interface to move data between the node and the network. The only customization is in the board used to implement the SCI directory protocol. A similar directory protocol is also used in the Convex Exemplar series of machines [Con93,TSS+96], which like the SGI Origin are targeted more toward scientific computing. NUMA-Q is a collection of homogeneous processing nodes interconnected by a high-speed links in a ring interconnection (Figure 8-24). Each processing node is an inexpensive Intel Quad busbased multiprocessor with four Intel Pentium Pro microprocessors, which illustrates the use of high-volume SMPs as building blocks for larger systems. Systems from Data General [ClA96] and from HAL Computer Systems [WGH+97] also use Pentium Pro Quads as their processing nodes, the former also using an SCI protocol similar to NUMA-Q across Quads and the latter using a memory-based protocol inspired by the Stanford DASH protocol. (In the Convex Exemplar series, the individual nodes connected by the SCI protocol are small directory-based multiprocessors kept coherent by a different directory protocol.) We described the Quad SMP node in Chapter 1 (see Figure 1-17 on page 48) and so shall not discuss it much here.
566
DRAFT: Parallel Computer Architecture
9/3/97
Cache-based Directory Protocols: The Sequent NUMA-Q
Quad
Quad
Quad
Quad
Quad
Quad P6
P6
P6
P6
IQ-Link PCI I/O
PCI I/O
Memory
I/O device
I/O device
I/O device
Mgmt. and Diagnostic Controller
Figure 8-24 Block diagram of the Sequent NUMA-Q multiprocessor.
The IQ-Link board in each Quad plugs into the Quad memory bus, and takes the place of the Hub in the SGI Origin. In addition to the data path from the Quad bus to the network and the directory logic and storage, it also contains an (expandable) 32MB, 4-way set associative remote cache for blocks that are fetched from remote memory. This remote cache is the interface of the Quad to the directory protocol. It is the only cache in the Quad that is visible to the SCI directory protocol; the individual processor caches are not, and are kept coherent with the remote cache through the snoopy bus protocol within the Quad. Inclusion is also preserved between the remote cache and the processor caches within the node through the snoopy coherence mechanisms, so if a block is replaced from the remote cache it must be invalidated in the processor caches, and if a block is placed in modified state in a processor cache then the state in the remote cache must reflect this. The cache block size of this remote cache is 64 bytes, which is therefore the granularity of both communication and coherence across Quads. 8.7.1 Cache Coherence Protocol Two interacting coherence protocols are used in the Sequent NUMA-Q machine. The protocol used within a Quad to keep the processor caches and the remote cache coherent is a standard busbased Illinois (MESI). Across Quads, the machine uses the SCI cache-based, linked list directory protocol. Since the directory protocol sees only the per-Quad remote caches, it is for the most part oblivious to how many processors there are within a node and even to the snoopy bus protocol itself. This section focuses on the SCI directory protocol across remote caches, and ignores the multiprocessor nature of the Quad nodes. Interactions with the snoopy protocol in the Quad will be discussed later.
9/3/97
DRAFT: Parallel Computer Architecture
567
Directory-based Cache Coherence
Directory Structure The directory structure of SCI is the flat, cache-based distributed doubly-linked list scheme that was described in Section and illustrated in Figure 8-8 on page 527. There is a linked list of sharers per block, and the pointer to the head of this list is stored in the main memory that is the home of the corresponding memory block. An entry in the list corresponds to a remote cache in a Quad. It is stored in Synchronous DRAM memory in the IQ-Link board of that Quad, together with the forward and backward pointers for the list. Figure 8-25 shows a simplified representation of a list. The first element or node is called the head of the list, and the last node the tail. The head node has both read and write permission on its cached block, while the other nodes have only read permission (except in a special-case extension called pairwise sharing that we shall discuss later). The pointer in a node that points to its neighbor in the direction toward the tail of the list is called the forward or downstream pointer, and the other is called the backward or upstream pointer. Let us see how the cross-node SCI coherence protocol uses this directory representation.
Home Memory
Head
Tail
Figure 8-25 An SCI sharing list.
States A block in main memory can be in one of three directory states whose names are defined by the SCI protocol: Home: no remote cache (Quad) in the system contains a copy of the block (of course, a processor cache in the home Quad itself may have a copy, since this is not visible to the SCI coherence protocol but is managed by the bus protocol within the Quad). Fresh: one or more remote caches may have a read-only copy, and the copy in memory is valid.This is like the shared state in Origin. Gone: another remote cache contains a writable (exclusive-clean or dirty) copy. There is no valid copy on the local node. This is like the exclusive directory state in Origin. Consider the cache states for blocks in a remote cache. While the processor caches within a Quad use the standard Illinois or MESI protocol, the SCI scheme that governs the remote caches has a large number of possible cache states. In fact, seven bits are used to represent the state of a block in a remote cache. The standard describes twenty nine stable states and many pending or transient states. Each stable state can be thought of as having two parts, which is reflected in the naming structure of the states. The first describes which element of the sharing list for that block that cache entry is. This may be ONLY (for a single-element list), HEAD, TAIL, or MID (neither the head nor the tail of a multiple-element list). The second part describes the actual state of the cached block. This includes states like Dirty: modified and writable Clean: unmodified (same as memory) but writable Fresh: data may be read, but not written until memory is informed
568
DRAFT: Parallel Computer Architecture
9/3/97
Cache-based Directory Protocols: The Sequent NUMA-Q
Copy: unmodified and readable and several others. A full description can be found in the SCI Standard document [IEE93]. We shall encounter some of these states as we go along. The SCI standard defines three primitive operations that can be performed on a list. Memory operations such as read misses, write misses, and cache replacements or writebacks are implemented through these primitive operations, which will be described further in our discussion: List construction: adding a new node (sharer) to the head of a list, Rollout: removing a node from a sharing list, which requires that a node communicate with its upstream and downstream neighbors informing them of their new neighbors so they can update their pointers, and Purging (invalidation): the node at the head may purge or invalidate all other nodes, thus resulting in a single-element list. Only the head node can issue a purge. The SCI standard also describes three “levels” of increasingly sophisticated protocols. There is a minimal protocol that does not permit even read-sharing; that is, only one node at a time can have a cached copy of a block. Then there is a typical protocol which is what most systems are expected to implement. This has provisions for read sharing (multiple copies), efficient access to data that are in FRESH state in memory, as well as options for efficient DMA transfers and robust recovery from errors. Finally, there is the full protocol which implements all of the options defined by the standard, including optimizations for pairwise sharing between only two nodes and queue-on-lock-bit (QOLB) synchronization (see later). The NUMA-Q system implements the typical set, and this is the one we shall discuss. Let us see how these states and the three primitives are used to handle different types of memory operations: read requests, write requests, and replacements (including writebacks). In each case, a request is first sent to the home node for the block, whose identity is determined from the address of the block. We will initially ignore the fact that there are multiple processors per Quad, and examine the interactions with the Quad protocol later. Handling Read Requests Suppose the read request needs to be propagated off-Quad. We can now think of this node’s remote cache as the requesting cache as far as the SCI protocol is concerned. The requesting cache first allocates an entry for the block if necessary, and sets the state of the block to a pending (busy) state; in this state, it will not respond to other requests for that block that come to it. It then begins a list-construction operation to add itself to the sharing list, by sending a request to the home node. When the home receives the request, its block may be in one of three states: HOME, FRESH and GONE. HOME: There are no cached copies and the copy in memory is valid. On receiving this request, the home updates its state for the block to FRESH, and set its head pointer to point to the requesting node. The home then replies to the requestor with the data, which upon receipt updates its state from PENDING to ONLY_FRESH. All actions at nodes in response to a given transaction are atomic (the processing for one is completed before the next one is handled), and a strict request-reply protocol is followed in all cases. FRESH: There is a sharing list, but the copy at the home is valid. The home changes its head pointer to point to the requesting cache instead of the previous head of the list. It then sends back a transaction to the requestor containing the data and also a pointer to the previous head. 9/3/97
DRAFT: Parallel Computer Architecture
569
Directory-based Cache Coherence
On receipt, the requestor moves to a different pending state and sends a transaction to that previous head asking to be attached as the new head of the list. The previous head reacts to Requestor
Old Head
Requestor
Old Head
INVALID
ONLY FRESH
PENDING
HEAD_ FRESH
FRESH
FRESH
Home memory
Home memory
(a) Initial state, plus first round-trip
(a) After first round-trip
Requestor
Old Head
Requestor
Old Head
PENDING
HEAD_ FRESH
HEAD_ FRESH
TAIL_ VALID
FRESH
FRESH
Home memory
Home memory
(a) Second round-trip
(a) Final state
Figure 8-26 Messages and state transitions in a read miss to a block that is in the FRESH state at home, with one node on the sharing list. Solid lines are the pointers in the sharing list, while dotted lines represent network transactions. Null pointers are not shown.
this message by changing its state from HEAD_FRESH to MID_VALID or from ONLY_FRESH to TAIL_VALID as the case may be, updating its backward pointer to point to the requestor, and sending an acknowledgment to the requestor. When the requestor receives this acknowledgment it sets its forward pointer to point to the previous head and changes its state from the pending state to HEAD_FRESH. The sequence of transactions and actions is shown in Figure 8-26, for the case where the previous head is in state HEAD_FRESH when the request comes to it. GONE: The cache at the head of the sharing list has an exclusive (clean or modified) copy of the block. Now, the memory does not reply with the data, but simply stays in the GONE state and sends a pointer to the previous head back to the requestor. The requestor goes to a new pending state and sends a request to the previous head asking for the data and to attach to the head of the list. The previous head changes its state from HEAD_DIRTY to MID_VALID or ONLY_DIRTY to TAIL_VALID or whatever is appropriate, sets its backward pointer to point to the requestor and returns the data to the requestor. (the data may have to be retrieved from one of the processor caches in the previous head node). The requestor then updates its copy, sets its state to HEAD_DIRTY, and sets its forward pointer to point to the new head, as always in a single atomic action. Note that even though the reference was a read, the head of
570
DRAFT: Parallel Computer Architecture
9/3/97
Cache-based Directory Protocols: The Sequent NUMA-Q
the sharing list is in HEAD_DIRTY state. This does not have the standard meaning of dirty that we are familiar with; that is, that it can write those data without having to invalidate anybody. It means that it can indeed write the data without communicating with the home (and even before sending out the invalidations), but it must invalidate the other nodes in the sharing list since they are in valid state. It is possible to fetch a block in HEAD_DIRTY state even when the directory state is not GONE, for example when that node is expected to write that block soon afterward. In this case, if the directory state is FRESH the memory returns the data to the requestor, together with a pointer to the old head of the sharing list, and then puts itself in GONE state. The requestor then prepends itself to the sharing list by sending a request to the old head, and puts itself in the HEAD_DIRTY state. The old head changes its state from HEAD_FRESH to MID_VALID or from ONLY_FRESH to TAIL_VALID as appropriate, and other nodes on the sharing list remain unchanged. In the above cases where a requestor is directed to the old head, it is possible that the old head (let’s call it A) is in pending state when the request from the new requestor (B) reaches it, since it may have some transaction outstanding on that block. This is dealt with by extending the sharing list backward into a (still distributed) pending list. That is, node B will indeed be physically attached to the head of the list, but in a pending state waiting to truly become the head. If another node C now makes are request to the home, it will be forwarded to node B and will also attach itself to the pending list (the home will now point to C, so subsequent requests will be directed there). At any time, we call the “true head” (here A) the head of the sharing list, the part of the list before the true head the pending list, and the latest element to have joined the pending list (here C) the pending head (see Figure 8-27). When A leaves the pending state and completes its operaC
B
A
X
PENDING
PENDING
PENDING
TAIL_ VALID
Pending Head
Pending list
Real Head
Sharing list
Figure 8-27 Pending lists in the SCI protocol. tion, it will pass on the “true head” status to B, which will in turn pass it on to C when its request is completed. The alternative to using a pending list would have been to NACK requests that come to a cache that is in pending state and have the requestor retry until the previous head’s pending state changes. We shall return to these issues and choices when we discuss dealing with correctness issues in Section 8.7.2. Note also that there is no pending or busy state at the directory, which always simply takes atomic actions to change its state and head pointer and returns the previous state/pointer information to the requestor, a point we will revisit there. Handling Write Requests The head node of a sharing list is assumed to always have the latest copy of the block (unless it is in a pending state). Thus, only the head node is allowed to write a block and issues invalidations. On a write miss, there are three cases. The first case is when the writer is already at the head of
9/3/97
DRAFT: Parallel Computer Architecture
571
Directory-based Cache Coherence
the list, but it does not have the sole modified copy. In this case, it first ensures that it is in the appropriate state for this case, by communicating with the home if necessary (and in the process ensuring that the home block is in or transitions to the GONE state). It then modifies the data locally, and invalidates the rest of the nodes in the sharing list. This case will be elaborated on in the next two paragraphs. The second case is when the writer is not in the sharing list at all. In this case, the writer must first allocate space for and obtain a copy of the block, then add itself to the head of the list using the list construction operation, and then perform the above steps to complete the write. The third case is when the writer is in the sharing list but not at the head. In this case, it must remove itself from the list (rollout), then add itself to the head (list construction), and finally perform the above steps. We shall discuss rollout in the context of replacement, where it is also needed, and we have already seen list construction. Let us focus on the case where the writing node is already at the head of the list. If the block is in HEAD_DIRTY state in the writer’s cache, it is modified right away (since the directory must be in GONE state) and then the writing node purges the rest of the sharing list. This is done in a serialized request-reply manner: An invalidation request is sent to the next node in the sharing list, which rolls itself out from the list and sends back to the head a pointer to the next node in the list. The head then sends this node a similar request, and so on until all entries are purged (i.e. the reply to the head contains a null pointer, see also Figure 8-28). The writer, or Writer HEAD_ DIRTY
MID_ VALID
TAIL_ VALID
(a) Initial state plus first round-trip
INVALID Writer PENDING
TAIL_ VALID
Writer ONLY_ DIRTY
INVALID
INVALID
(b) State after first round trip, plus second round-trip
(c) Final state
Figure 8-28 Purging a sharing list from a HEAD_DIRTY node in SCI. head node, stays in a special pending state while the purging is in progress. During this time, new attempts to add to the sharing list are delayed in a pending list as usual. The latency of purging a sharing list is a few round trips (invalidation request, acknowledgment and the rollout transactions) plus the associated actions per sharing list entry, so it is important that sharing lists not get too long. It is possible to reduce the number of network transactions in the critical path by having each node pass on an invalidation request to the next node rather than return the identity to the writer. This is not part of the SCI standard since it complicates protocol-level recovery from errors by distributing the state of the invalidation progress, but practical systems may be tempted to take advantage of this short-cut. If the writer is the head of the sharing list but has the block in HEAD_FRESH state, then it must be changed to HEAD_DIRTY before the block can be modified and the rest of the entries purged. The writer goes into a pending state and sends a request to the home, the home changes from FRESH to GONE state and replies to the message, and then the writer goes into a different pending state and purges the rest of the blocks as above. It may be that when the request reaches the home the home is no longer in FRESH state, but it points to a newly queued node that got there in 572
DRAFT: Parallel Computer Architecture
9/3/97
Cache-based Directory Protocols: The Sequent NUMA-Q
the meantime and has been directed to the writer. When the home looks up its state, it detects this situation and sends the writer a corresponding reply that is like a NACK. When the writer receives this reply, based on its local pending state it deletes itself from the sharing list (how it does this given that a request is coming at it is discussed in the next subsection) and tries to reattach as the head in HEAD_DIRTY or ONLY_DIRTY state by sending the appropriate new request to the home. Thus, it is not a retry in the sense that it does not try the same request again, but a suitably modified request to reflect the new state of itself and the home. The last case for a write is if the writer has the block in ONLY_DIRTY state, in which case it can modify the block without generating any network transactions. Handling Writeback and Replacement Requests A node that is in a sharing list for a block may need to delete itself, either because it must become the head in order to perform a write operation, or because it must be replaced to make room for another block in the cache for capacity or conflict reasons, or because it is being invalidated. Even if the block is in shared state and does not have to write data back, the space in the cache (and the pointer) will now be used for another block and its list pointers, so to preserve a correct representation the block being replaced must be removed from its sharing list. These replacements and list removals use the rollout operation. Consider the general case of a node trying to roll out from the middle of a sharing list. The node first sets itself to a special pending state, then sends a request each to its upstream and downstream neighbors asking them to update their forward and backward pointers, respectively, to skip that node. The pending state is needed since there is nothing to prevent two adjacent nodes in a sharing list from trying to roll themselves out at the same time, which can lead to a race condition in the updating of pointers. Even with the pending state, if two adjacent nodes indeed try to roll out at the same time they may set themselves to pending state simultaneously and send messages to each other. This can cause deadlock, since neither will respond while they are in locked state. A simple priority system is used to avoid such deadlock: By convention, the node closer to the tail of the list has priority and is deleted first. The neighbors do not have to change their state, and the operation is completed by setting the state of the replaced cache entry to invalid when both the neighbors have replied. If the element being replaced is the last element of a two-element list, then the head of the list may change its state from HEAD_DIRTY or HEAD_FRESH to ONLY_DIRTY or ONLY_FRESH as appropriate. If the entry to be rolled out is the head of the list, then the entry may be in dirty state (a writeback) or in fresh state (a replacement). Here too, the same set of transactions is used whether it is a writeback or merely a replacement. The head puts itself in a special rollout pending state and first sends a transaction to its downstream neighbor. This causes the latter to set its backward pointer to the home memory and change its state appropriately (e.g. from TAIL_VALID or MID_VALID to HEAD_DIRTY, or from MID_FRESH to HEAD_FRESH). When the replacing node receives a reply, it sends a transaction to the home which updates its pointer to point to the new head but need not change its state. The home replies to the replacer, which is now out of the list and sets its state to INVALID. Of course, if the head is the only node in the list then it needs to communicate only with memory, which will set its state to HOME. In a distributed protocol like SCI, it is possible that a node sends a request to another node, but when it arrives the local state at the recipient is not compatible with the incoming request (another transaction may have changed the state in the interim). We saw one example in the pre-
9/3/97
DRAFT: Parallel Computer Architecture
573
Directory-based Cache Coherence
vious subsection, with a write to a block in HEAD_FRESH state in the cache, and the above situation provides another. For one thing, by the time the message from the head that is trying to replace itself (let’s call it the replacer) gets to the home, the home may have set its head pointer to point to a different node X, from which it has received a request for the block in the interim. In general, whenever a transaction comes in, the recipient looks up its local state and the incoming request type; if it detects a mismatch, it does not perform the operation that the request solicits but issues a reply that is a lot like a NACK (plus perhaps some other information). The requestor will then check its local state again and take an appropriate action. In this specific case, the home detects that the incoming transaction type requires that the requestor be the current head, which is not true so it NACKs the request. The replacer keeps retrying the request to the home, and being NACKed. At some point, the request from X that was pointed to the replacer will reach the replacer asking to be preprended to the list. The replacer will look up its (pending) state, and reply to that request telling it to instead go to the downstream neighbor (the real head, since the replacer is rolling out of the list). The replacer is now off the list and in a different pending state, waiting to go to INVALID state, which it will when the next NACK from the home reaches it. Thus, the protocol does include NACKs, though not in the traditional sense of asking requests to retry when a node or resource is busy. NACKs are just to indicate inappropriate requests and facilitate changes of state at the requestor; a request that is NACKed will never succeed in its original form, but may cause a new type of request to be generated which may succeed. Finally, a performance question when a block needs to be written back upon a miss is whether the miss should be satisfied first or the block should be written back first. In discussing bus-based protocols, we saw that most often the miss is serviced first, and the block to be written back is put in a writeback buffer (which must also be snooped). In NUMA-Q, the simplifying decision is made to service the writeback (rollout) first, and only then satisfy the miss. While this slows down misses that cause writebacks, the complexity of the buffering solution is greater here than in bus-based systems. More importantly, the replacements we are concerned with here are from the remote cache, which is large enough (tens of megabytes) that replacements are likely to be very infrequent. 8.7.2 Dealing with Correctness Issues A major emphasis in the SCI standard is providing well-defined, uniform mechanisms for preserving serialization, resolving race conditions, and avoiding deadlock, livelock and starvation. The standard takes a stronger position on these issues, particularly starvation and fairness, than many other coherence protocols. We have mentioned earlier that most of the correctness considerations are satisfied by the use of distributed lists of sharers as well as pending requests, but let us look at how this works in one further level of detail. Serialization of Operations to a Given Location In the SCI protocol too, the home node is the entity that determines the order in which cache misses to a block are serialized. However, here the order is that in which the requests first arrive at the home, and the mechanism used for ensuring this order is very different. There is no busy state at the home. Generally, the home accepts every request that comes to it, either satisfying it wholly by itself or directing it to the node that it sees as the current head of the sharing list (this may in fact be the true head of the sharing list or just the head of the pending list for the line). Before it directs the request to another node, it first updates its head pointer to point to the current requestor. The next request for the block from any node will see the updated state and pointer—
574
DRAFT: Parallel Computer Architecture
9/3/97
Cache-based Directory Protocols: The Sequent NUMA-Q
i.e. to the previous requestor—even though the operation corresponding to the previous request is not globally complete. This ensures that the home does not direct two conflicting requests for a block to the same node at the same time, avoiding race conditions. As we have seen, if a request cannot be satisfied at the previous head node to which it was directed—i.e. if that node is in pending state—the requestor will attach itself to the distributed pending list for that block and await its turn as long as necessary (see Figure 8-27 on page 571). Nodes in the pending list obtain access to the block in FIFO order, ensuring that the order in which they complete is the same as that in which they first reached the home. In some cases, the home may NACK requests. However, unlike in the earlier protocols, we have seen that those requests will never succeed in their current form, so they do not count in the serialization. They may be modified to new, different requests that will succeed, and in that case those new requests will be serialized in the order in which they first reach the home. Memory Consistency Model While the SCI standard defines a coherence protocol and a transport layer, including a network interface design, it does not specify many other aspects like details of the physical implementation. One of the aspects that it does not prescribe is the memory consistency model. Such matters are left to the system implementor. NUMA-Q does not satisfy sequential consistency, but rather more relaxed memory consistency model called processor consistency that we shall discuss in Section 9.2, since this is the consistency model supported by the underlying microprocessor. Deadlock, Livelock and Starvation The fact that a distributed pending list is used to hold waiting requests at the requestors themselves, rather than a hardware queue shared at the home node by all blocks allocated in it, implies that there is no danger of input buffers filling up and hence no deadlock problem at the protocol level. Since requests are not NACKed from the home (unless they are inappropriate and must be altered) but will simply join the pending list and always make progress, livelock does not occur. The list mechanism also ensures that the requests are handled in FIFO order as they first come to the home, thus preventing starvation. The total number of pending lists that a node can be a part of is the number of requests it can have outstanding, and the storage for the pending lists is already available in the cache entries (replacement of a pending entry is not allowed, the processor stalls on the access that is causing this replacement until the entry is no longer pending). Queuing and buffering issues at the lower, transport level are left up to the implementation; the SCI standard does not take a position on them, though most implementations including NUMA-Q use separate request and reply queues on each of the incoming and outgoing paths. Error Handling The SCI standard provides some options in the “typical” protocol to recover from errors at the protocol level. NUMA-Q does not implement these, but rather assumes that the hardware links are reliable. Standard ECC and CRC checks are provided to detect and recover from hardware errors in the memory and network links. Robustness to errors at the protocol level often comes at the cost of performance. For example, we mentioned earlier that SCI’s decision to have the writer send all the invalidations one by one serialized by replies simplifies error recovery (the writer
9/3/97
DRAFT: Parallel Computer Architecture
575
Directory-based Cache Coherence
knows how many invalidations have been completed when an error occurs) but compromises performance. NUMA-Q retains this feature, but it and other systems may not in future releases. 8.7.3 Protocol Extensions While the SCI protocol is fair and quite robust to errors, many types of operations can generate many serialized network transactions and therefore become quite expensive. A read miss requires two network transactions with the home, and at least two with the head node if there is one, and perhaps more with the head node if it is in pending state. A replacement requires a rollout, which again is four network transactions. But the most troublesome operation from a scalability viewpoint is invalidation on a write. As we have seen, the cost of the invalidation scales linearly with the number of nodes on the sharing list, with a fairly large constant (more than a round-trip time), which means that writes to widely shared data do not scale well at all. Other performance bottlenecks arise when many requests arrive for the same block and must be added to a pending list; in general, the latency of misses tends to be larger in SCI than in memory-based protocols. Extensions have been proposed to SCI to deal with widely shared data, building a combining hierarchy through the bridges or switches that connect rings. Some extensions require changes to the basic protocol and hardware structures, while others are plug-in compatible with the basic SCI protocol and only require new implementations of the bridges. The complexity of the extensions tends to reduce performance for low degrees of sharing. They are not finalized in the standard, are quite complex, and are beyond the scope of this discussion. More information can be found in [IEE95, KaG96, Kax96]. One extension that is included in the standard specializes the protocol for the case where only two nodes share a cache block and they ping-pong ownership of it back and forth between them by both writing it repeatedly. This is described in the SCI protocol document [IEE93]. Unlike Origin, NUMA-Q does not provide hardware or OS support for automatic page migration. With the very large remote caches, capacity misses needing to go off-node are less of an issue. However, proper page placement can still be useful when a processor writes and has to obtain ownership for data: If nobody else has a copy (for example the interior portion of a processor’s partition in the equation solver kernel or in Ocean) then if the page is allocated locally obtaining ownership does not generate network traffic while if it is remote it does. The NUMA-Q position is that data migration in main memory is the responsibility of user-level software. Similarly, little hardware support is provided for synchronization beyond simple atomic exchange primitives like test&set. An interesting aspect of the NUMA-Q in terms of coherence protocols is that it has two of them: a bus-based protocol within the Quad and an SCI directory protocol across Quads, isolated from the Quad protocol by the remote cache. The interactions between these two protocols are best understood after we look more closely at the organization of the IQ-Link. 8.7.4 Overview of NUMA-Q Hardware Within a Quad, the second-level caches currently shipped in NUMA-Q systems is 512KB large and 4-way set associative with a 32-byte block size. The Quad bus is a a 532 MB/s split-transaction in-order bus, with limited facilities for the out-of-order responses needed by a two-level coherence scheme. A Quad also contains up to 4GB of globally addressible main memory, two 32-bit wide 133MB/s Peripheral Component Interface (PCI) buses connected to the Quad bus by PCI bridges and to which I/O devices and a memory and diagnostic controller can attach, and the 576
DRAFT: Parallel Computer Architecture
9/3/97
Cache-based Directory Protocols: The Sequent NUMA-Q
IQ-Link board that plugs into the memory bus and includes the communication assist and the network interface. In addition to the directory information for locally allocated data and the tags for remotely allocated but locally cached data, which it keeps on both the bus side and the directory side, the IQLink board consists of four major functional blocks as shown in Figure 8-29: the bus interface controller, the SCI link interface controller, the DataPump, and the RAM arrays. We introduce these briefly here, though they will be described further when discussing the implementation of the IQ-Link in Section 8.7.6.
SCI ring in
Network Interface (DataPump)
SCI ring out
Directory Controller (SCLIC)
Remote Cache Tag
Orion Bus Interface (OBIC)
Remote Cache Data
Local Directory
Remote Cache Tag
Local Directory
Network Size Tags and Directory
Bus Size Tags and Directory
Quad Bus
Figure 8-29 Functional block diagram of the NUMA-Q IQ-Link board. The remote cache data is implemented in Synchronous DRAM (SDRAM). The bus-side tags and directory are implemented in Static RAM (SRAM), while the network side tags and directory can afford to be slower and are therefore implemented in SDRAM.
The Orion bus interface controller (OBIC) provides the interface to the shared Quad bus, managing the remote cache data arrays and the bus snooping and requesting logic. It acts as both a pseudo-memory controller that translates accesses to nonlocal data as well as a pseudo-processor that puts incoming transactions from the network on to the bus. The DataPump, a gallium arsenide chip built by Vitesse Semiconductor Corporation, provides the link and packet level transport protocol of the SCI standard. It provides an interface to a ring interconnect, pulling off packets that are destined for its Quad and letting other packets go by. The SCI link interface controller (SCLIC) interfaces to the DataPump and the OBIC, as well as to the interrupt controller and the directory tags. It’s main function is to manage the SCI coherence protocol, using one or more programmable protocol engines.
9/3/97
DRAFT: Parallel Computer Architecture
577
Directory-based Cache Coherence
For the interconnection across Quads, the SCI standard defines both a transport layer and a cache coherence protocol. The transport layer defines a functional specification for a node-to-network interface, and a network topology that consists of rings made out of point-to-point links. In particular, it defines a 1GB/sec ring interconnect, and the transactions that can be generated on it. The NUMA-Q system is initially a single-ring topology of up to eight Quads as shown in Figure 8-24, and we shall assume this topology in discussing the protocol (though it does not really matter for the basic protocol). Cables from the quads connect to the ports of a ring, which is contained in a single box called the Lash. Larger systems will include multiple eight-quad systems connected together with local area networks. In general, the SCI standard envisions that due to the high latency of rings larger systems will be built out of multiple rings interconnected by switches. The transport layer itself will be discussed in Chapter 10. Particularly since the machine is targeted toward database and transaction processing workloads, I/O is an important focus of the design. As in Origin, I/O is globally addressible, so any processor can write to or read from any I/O device, not just those attached to the local Quad. This is very convenient for commercial applications, which are often not structured so that a processor need only access its local disks. I/O devices are connected to the two PCI busses that attach through PCI bridges to the Quad bus. Each PCI bus is clocked at half the speed of the memory bus and is half as wide, yielding roughly one quarter the bandwidth. One way for a processor to access I/O devices on other Quads is through the SCI rings, whether through the cache coherence protocol or through uncached writes, just as Origin does through its Hubs and network. However, bandwidth is a precious resource on a ring network. Since commercial applications are often not structured to exploit locality in their I/O operations, I/O transfers can occupy substantial bandwidth, interfering with memory accesses. NUMA-Q therefore provides a separate communication substrate through the PCI busses for inter-quad IO transfers. A “Fiber Channel” link connects to a PCI bus on each node. The links are connected to the shared disks in the system through either point to point connections, an arbitrated Fibre Channel loop, or a Fibre Channel Switch, depending on the scale of the processing and I/O systems (Figure 8-30). Fibre Channel talks to the disks through a bridge that converts Fibre Channel data format to the SCSI format that the disks accept at 60MB/s sustained. I/O to any disk in the system usually takes a path through the local PCI bus and the Fibre Channel switch; however, if this path fails for some reason the operating system can cause I/O transfers to go through the SCI ring to another Quad, and through its PCI bus and Fibre Channel link to the disk. Processors perform I/O through reads and writes to PCI devices; there is no DMA support in the system. Fibre Channel may also be used to connect multiple NUMA-Q systems in a loosely coupled fashion, and to have multiple systems share disks. Finally, a management and diagnostic controller connects to a PCI bus on each Quad; these controllers are linked together and to a system console through a private local area network like Ethernet for system maintenance and diagnosis. 8.7.5 Protocol Interactions with SMP Node The earlier discussion of the SCI protocol ignored the multiprocessor nature of the Quad node and the bus-based protocol within it that keeps the processor caches and the remote cache coherent. Now that we understand the hardware structure of the node and the IQ-Link, let us examine the interactions of the two protocols, the requirements that the interacting protocols place upon the Quad, and some particular problems raised by the use of an off-the-shelf SMP as a node.
578
DRAFT: Parallel Computer Architecture
9/3/97
Cache-based Directory Protocols: The Sequent NUMA-Q
FC Switch Quad Bus PCI
FC link Fibre Channel
FC/SCSI Bridge
SCI Ring Quad Bus PCI
Fibre Channel
FC/SCSI Bridge
Figure 8-30 I/O Subsystem of the Sequent NUMA-Q.
A read request illustrates some of the interactions. A read miss in a processor’s second-level cache first appears on the Quad bus. In addition to being snooped by the other processor caches, it is also snooped by the OBIC. The OBIC looks up the remote cache as well as the directory state bits for locally allocated blocks, to see if the read can be satisfied within the Quad or must be propagated off-node. In the former case, main memory or one of the other caches satisfies the read and the appropriate MESI state changes occur (snoop results are reported after a fixed number, four, of bus cycles; if a controller cannot finish its snoop within this time, it asserts a stall signal for another two bus cycles after which memory checks for the snoop result again; this continues until all snoop results are available). The Quad bus implements in-order responses to requests. However, if the OBIC detects that the request must be propagated off-node, then it must intervene. It does this by asserting a delayed reply signal, telling the bus to violate it’s in-order response property and proceed with other transactions, and that it will take responsibility for replying to this request. This would not have been necessary if the Quad bus implemented out-oforder replies. The OBIC then passes on the request to the SCLIC to engage the directory protocol. When the reply comes back it will be passed from the SCLIC back to the OBIC, which will place it on the bus and complete the deferred transaction. Note that when extending any busbased system to be the node of a larger cache-coherent machine, it is essential that the bus be split-transaction. Otherwise the bus will be held up for the entire duration of a remote transaction, not allowing even local misses to complete and not allowing incoming network transactions to be serviced (potentially causing deadlock). Writes take a similar path out of and back into a Quad. The state of the block in the remote cache, snooped by the OBIC, indicates whether the block is owned by the local Quad or must be propagated to the home through the SCLIC. Putting the node at the head of the sharing list and invalidating other nodes, if necessary, is taken care of by the SCLIC. When the SCLIC is done, it places a response on the Quad bus which completes the operation. An interesting situation arises due to a limitation of the Quad itself. Consider a read or write miss to a locally allocated block that is cached remotely in a modified state. When the reply returns and is placed on the bus as a deferred reply, it should update the main memory. However, the Quad memory was not imple-
9/3/97
DRAFT: Parallel Computer Architecture
579
Directory-based Cache Coherence
mented for deferred requests and replies, and does not update itself on seeing a deferred reply. When a deferred reply is passed down through the OBIC, it must also ensure that it updates the memory through a special action before it gives up the bus. Another limitation of the Quad is its bus protocol. If two processors in a Quad issue read-exclusive requests back to back, and the first one propagates to the SCLIC, we would like to be able to have the second one be buffered and accept the reply from the first in the appropriate state. However, the Quad bus will NACK the second request, which will then have to retry until the first one returns and it succeeds. Finally, consider serialization. Since serialization at the SCI protocol level is done at the home, incoming transactions at the home have to be serialized not only with respect to one another but also with respect to accesses by the processors in the home Quad. For example, suppose a block is in the HOME state at the home. At the SCI protocol level, this means that no remote cache in the system (which must be on some other node) has a valid copy of the block. However, this does not mean that no processor cache in the home node has a copy of the block. In fact, the directory will be in HOME state even if one of the processor caches has a dirty copy of the block. A request coming in for a locally allocated block must therefore be broadcast on the Quad bus as well, and cannot be handled entirely by the SCLIC and OBIC. Similarly, an incoming request that makes the directory state change from HOME or FRESH to GONE must be put on the Quad bus so that the copies in the processor caches can be invalidated. Since both incoming requests and local misses to data at the home all appear on the Quad bus, it is natural to let this bus be the actual serializing agent at the home. Similarly, there are serialization issues to be addressed in a requesting quad for accesses to remotely allocated blocks. Activities within a quad relating to remotely allocated blocks are serialized at the local SCLIC. Thus, if there are requests from local processors for a cached block and also requests from the SCI interconnect for the same block coming in to the local node, these requests are serialized at the local SCLIC. Other interactions with the node protocol will be discussed when we have better understood the implementation of the IQ-Link board components. 8.7.6 IQ-Link Implementation Unlike the single-chip Hub in Origin, the CMOS SCLIC directory controller, the CMOS OBIC bus interface controller and the GaAs Data Pump are separate chips. The IQ-Link board also contains some SRAM and SDRAM chips for tags, state and remote cache data (see Figure 8-29 on page 577). In contrast, the HAL S1 system [WGH+97], discussed later, integrates the entire coherence machinery on a single chip, which plugs right in to the Quad bus. The remote cache data are directly accessible by the OBIC. Two sets of tags are used to reduce communication between the SCLIC and the OBIC, the network-side tags for access by the SCLIC and the busside tags for access by the OBIC. The same is true for the directory state for locally allocated blocks. The bus-side tags and directory contain only the information that is needed for the bus snooping, and are implemented in SRAM so they can be looked up at bus speed. The networkside tags and state need more information and can be slower, so they are implemented in synchronous DRAM (SDRAM). Thus, the bus-side local directory SRAM contains only the two bits of directory state per 64-byte block (HOME, FRESH and GONE), while the network side directory contains the 6-bit SCI head pointer as well. The bus side remote cache tags have only four bits of state, and do not contain the SCI forward and backward list pointers. They actually keep track of 14 states, some of which are transient states that ensure forward progress within the quad (e.g. keep track of blocks that are being rolled out, or keep track of the particular bus agent that has an outstanding retry and so must get priority for that block). The network-side tags, which are part
580
DRAFT: Parallel Computer Architecture
9/3/97
Cache-based Directory Protocols: The Sequent NUMA-Q
of the directory protocol, do all this, so they contain 7 bits of state plus two 6-bit pointers per block (as well as the 13-bit cache tags themselves). Unlike the hard-wired protocol tables in Origin, the SCLIC coherence controller in NUMA-Q is programmable. This means the protocol can be written in software or firmware rather than hardwired into a finite state machine. Every protocol-invoking operation from a processor, as well as every incoming transaction from the network invokes a software “handler” or task that runs on the protocol engine. These software handlers, written in microcode, may manipulate directory state, put interventions on the Quad bus, generate network transactions, etc. The SCLIC has multiple register sets to support 12 read/write/invalidate transactions and one interrupt transaction concurrently (the SCLIC provides a bridge for routing standard Quad interrupts between Quads; some available extra bits are used to include the destination Quad number when generating an interrupt, so the SCLIC can use the standard Quad interrupt interface). A programmable engine is chosen so that the protocol may be debugged in software and corrected by simply downloading new protocol code, so that there would be flexibility to experiment with or change protocols even after the machine was built and bottlenecks discovered, and so that code can be inserted into the handlers to monitor chosen events for performance debugging. Of course, a programmable protocol engine has higher occupancy per transaction than a hard-wired one, so there is a performance cost associated with this decision. Attempts are therefore made to reduce this performance impact: The protocol processor has a simple three-stage pipeline, and issues up to two instructions (a branch and another instruction) every cycle. It uses a cache to hold recently used directory state and tag information rather than access the directory RAMs every time, and is specialized to support the kinds of the bit-field manipulation operations that are commonly needed in directory protocols as well as useful instructions to speed up handler dispatch and management like “queue on buffer full” and “branch on queue space available”. A somewhat different programmable protocol engine is used in the Stanford FLASH multiprocessor [KOH+94]. Each Pentium Pro processor can have up to four requests outstanding. The Quad bus can have To/From Data Pump 1-entry buffers
Protocol Processor
Protocol Cache 1-entry buffers
link to OBIC
Figure 8-31 Simplified block diagram of SCLIC chip. eight requests outstanding at a time, like the SGI Challenge bus, and replies come in order. The OBIC can have four requests outstanding (to the SCLIC), and can buffer two incoming transactions to the Quad bus at a time. Thus, if more than four requests on the Quad bus need to go offQuad, the OBIC will cause the bus to stall. This scenario is not very likely, though. The SCLIC can have up to eight requests outstanding and can buffer four incoming requests at a time. A simplified picture of the SCLIC is shown in Figure 8-31. Finally, the Data Pump request and reply buffers are each two entries deep outgoing to the network and four entries deep incoming. All
9/3/97
DRAFT: Parallel Computer Architecture
581
Directory-based Cache Coherence
request and reply buffers, whether incoming or outgoing, are physically separate in this implementation. All three components of the IQ-Link board also provide performance counters to enable nonintrusive measurement of various events and statistics. There are three 40-bit memory-mapped counters in the SCLIC and four in the OBIC. Each can be set in software to count any of a large number of events, such as protocol engine utilization, memory and bus utilization, queue occupancies, the occurrence of SCI command types, and the occurrence of transaction types on the P6 bus. The counters can be read by software at any time, or can be programmed to generate interrupts when they cross a predefined threshold value. The Pentium Pro processor module itself provides a number of performance counters to count first- and second-level cache misses, as well as the frequencies of request types and the occupancies of internal resources. Together with the programmable handlers, these counters can provide a wealth of information about the behavior of the machine when running workloads. 8.7.7 Performance Characteristics The memory bus within a quad has a peak bandwidth of 532 MB/s, and the SCI ring interconnect can transfer 500MB/s in each direction. The IQ-Link board can transfer data between these two at about 30MB/s in each direction (note that only a small fraction of the transactions appearing on the quad bus or on the SCI ring are expected to be relevant to the other). The latency for a local read miss satisfied in main memory (or the remote cache) is expected to be about 250ns under ideal conditions. The latency for a read satisfied in remote memory in a two-quad system is expected to be about 3.1 µs, a ratio of about 12 to 1. The latency through the data pump network interface for the first 18 bits of a transaction is 16ns, and then 2ns for every 18 bits thereafter. In the network itself, it takes about 26ns for the first bit to get from a quad into the Lash box that implements the ring and out to the data pump of the next quad along the ring. The designers of the NUMA-Q have performed several experiments with microbenchmarks and with synthetic workloads that resemble database and transaction processing workloads. To obtain a flavor for the microbenchmark performance capabilities of the machine, how latencies vary under load, and the characteristics of these workloads, let us take a brief look at the results. Backto-back read misses are found to take 600ns each and obtain a transfer bandwidth to the processor of 290 MB/s; back to back write misses take 525ns and sustain 185MB/s; and inbound writes from I/O devices to the local memory take 360ns at 88MB/s bandwidth, including bus arbitration. Table 8-3 shows the latencies and characteristics “under load” as seen in actual workloads running on eight quads. The first two rows are for microbenchmarks designed to have all quads issuing read misses that are satisfied in remote memory. The third row is for a synthetic, on-line transaction processing benchmark, the fourth for a workload that looks somewhat like the Transaction Processing Council’s TPC-C benchmark, and the last is for a decision support application
582
DRAFT: Parallel Computer Architecture
9/3/97
Cache-based Directory Protocols: The Sequent NUMA-Q
that issues very few remote read/write references since it is written using explicit message passing. Table 8-3 Characteristics of microbenchmarks and workloads running on an 8-quad NUMA-Q. Latency of L2 misses Workload All
SCLIC Utilization
Remotely satisfied
Percentage of L2 misses satisfied in Local memory
Other Local cache
Local “remote cache”
Remote node
Remote Read Misses
7580ns
8170ns
93%
1%
0%
2%
97%
Remote Write Misses
8750ns
9460ns
94%
1%
0%
2%
97
OLTP-like
925ns
4700ns
62%
66%
20%
5%
9%
TPC-C like
875ns
6170ns
68%
73%
4%
16%
7%
Decisionsupport like
445ns
3120ns
9%
54.5%
40%
4%
1.5%
Remote latencies are significantly higher than the ideal for all but the message-passing decision support workload. In general, the SCI ring and protocol have higher latencies than those of more distributed networks and memory-based protocols (for example, requests that arrive late in a pending list have to wait a while to be satisfied). However, it turns out at least in these transaction-processing like workloads, much of the time in a remote access is spent passing through the IQ-Link board itself, and not in the bus or in the SCI protocol. Figure 8-32 shows the breakdowns of average remote latency into three components for two workloads on an 8-quad system. Clearly, the path to improved remote access performance, both under load and not under load, is to make the IQ-Link board more efficient. Opportunities being considered by the designers [Note to artist: the 10 and 18 percent parts in the two pie-charts should be shaded similarly and captioned “SCI time”, the 25 and 21 percent should be captioned “Bus time” and the 65 and 61 percent should be captioned “IQ-Link board time”.]
10%
18%
25%
65%
(a) TPC-C
21%
61%
(a) TPC-D (decision support)
Figure 8-32 Components of average remote miss latency in two workloads on an 8-quad NUMA-Q. include redesigning the SCLIC, perhaps using two instruction sequencers instead of one, and optimizing the OBIC, with the hope of reducing the remote access latency to about 2.5 µs under heavy load in the next generation. 8.7.8 Comparison Case Study: The HAL S1 Multiprocessor The S1 multiprocessor from HAL Computer Systems is an interesting combination of some features of the NUMA-Q and the Origin2000. Like NUMA-Q it also uses Pentium Pro Quads as the processing nodes; however, it uses a memory-based directory protocol like that of the Origin2000 across Quads rather than the cache-based SCI protocol. Also, to reduce latency and assist occupancy, it integrates the coherence machinery more tightly with the node than the NUMA-Q does.
9/3/97
DRAFT: Parallel Computer Architecture
583
Directory-based Cache Coherence
Instead of using separate chips for the directory protocol controller (SCLIC), bus interface controller (OBIC) and network interface (Data Pump), the S1 integrates the entire communication assist and the network interface into a single chip called the Mesh Coherence Unit (MCU). Since the memory-based protocol does not require the use of forward and backward pointers with each cache entry, there is no need for a Quad-level remote data cache except to reduce capacity misses, and the S1 does not use a remote data cache. The directory information is maintained in separate SRAM chips, but the directory storage needed is greatly reduced by maintained information only for blocks that are in fact cached remotely, organizing the directory itself as a cache as will be discussed in Section 8.11.1. The MCU also contains a DMA engine to support explicit message passing and to allow transfers of large chunks of data through block data transfer even in the cache-coherent case. Message passing can be implemented either through the DMA engine (preferred for large messages) or through the transfer mechanism used for cache blocks (preferred for small messages). The MCU is hard-wired instead of programmable, which reduces its occupancy for protocol processing and hence improves its performance under contention. The MCU also has substantial hardware support for performance monitoring. The only other custom chip used is the network router, which is a six-ported crossbar with 1.9 million transistors, optimized for speed. The network is clocked at 200MHz. The latency through a single router is 42 ns, and the usable per-link bandwidth is 1.6GB/sec in each direction, similar to that of the Origin2000 network. The S1 interconnect implementation scales to 32 nodes (128 processors). A major goal of integrating all the assist functionality into a single chip in S1 was to reduce remote access latency and increase remote bandwidth. From the designers’ simulated measurements, the best-case unloaded latency for a read miss that is satisfied in local memory is 240ns, for a read miss to a block that is clean at a remote home is 1065 ns, and for a read miss to a block that is dirty in a third node is 1365 ns. The remote to local latency ratio is about 4-5, which is a little worse than on the SGI Origin2000 but better than on the NUMA-Q. The bandwidths achieved in copying a single 4KB page are 105 MB/s from local memory to local memory through processor reads and writes (limited primarily by the Quad memory controller which has to handle both the reads and writes of memory), about 70 MB/s between local memory and a remote memory (in either direction) when done through processor reads and writes, and about 270 MB/s in either direction between local and remote memory when done through the DMA engines in the MCUs. The case of remote transfers through processor reads and writes is limited primarily by the limit on the number of outstanding memory operations from a processor. The DMA case is also better because it requires only one bus transaction at the initiating end for each memory block, rather than two split transaction pairs in the case of processor reads and writes (once for the read and once for the write). At least in the absence of contention, the local Quad bus becomes a bandwidth bottleneck much before the interconnection network. Having understood memory-based and cache-based protocols in some depth, let us now briefly examine some key interactions with the cost parameters of the communication architecture in determining performance.
8.8 Performance Parameters and Protocol Performance Of the four major performance parameters in a communication architecture—overhead on the main processor, occupancy of the communication assist, network transit latency, and network bandwidth—the first is quite small on cache-coherent machines (unlike on message-passing sys-
584
DRAFT: Parallel Computer Architecture
9/3/97
Performance Parameters and Protocol Performance
tems, where it often dominates). In the best case, the portion that we can call processor overhead, and which cannot be hidden from the processor through overlap, it is the cost of issuing the memory operation. In the worst case, it is the cost of traversing the processor’s cache hierarchy and reaching the assist. All other protocol processing actions are off-loaded to the communication assist (e.g. the Hub or the IQ-Link). As we have seen, the communication assist has many roles in protocol processing, including generating a request, looking up the directory state, and sending out and receiving invalidations and acknowledgments. The occupancy of the assist for processing a transaction not only contributes to the uncontended latency of that transaction, but can also cause contention at the assist and hence increase the cost of other transactions. This is especially true in cache-coherent machines because there is a large number of small transactions—both data-carrying transactions and others like requests, invalidations and acknowledgments—so the occupancy is incurred very frequently and not amortized very well. In fact, because messages are small, assist occupancy very often dominates the node-to-network data transfer bandwidth as the key bottleneck to throughput at the end-points [HHS+95]. It is therefore very important to keep assist occupancy small. At the protocol level, it is important to not keep the assist tied up by an outstanding transaction while other transactions are available for it to process, and to reduce the amount of processing needed from the assist per transaction. For example, if the home forwards a request to a dirty node, the home assist should not be held up until the dirty node replies—dramatically increasing its effective occupancy—but should go on to service the next transaction and deal with the reply when it comes. At the design level, it is important to specialize the assist enough that its occupancy per transaction is low. Impact of Network Latency and Assist Occupancy Figure 8-33 shows the impact of assist occupancy and network latency on performance, assuming an efficient memory-based directory protocol similar to the one we shall discuss for the SGI Origin2000. In the absence of contention, assist occupancy behaves just like network transit latency or any other component of cost in a transaction’s path. Increasing occupancy by x cycles would have the same impact as keeping occupancy constant but increasing transit latency by x cycles. Since the X-axis is total uncontended round-trip latency for a remote read miss—including the cost of transit latency and assist occupancies incurred along the way—if there is no contention induced by increasing occupancy then all the curves for different values of occupancy will be identical. In fact, they are not, and the separation of the curves indicates the impact of the contention induced by increasing assist occupancy. The smallest value of occupancy (o) in the graphs is intended to represent that of an aggressive hard-wired controller such as the one used in the Origin2000. The least aggressive one represents placing a slow general-purpose processor on the memory bus to play the role of communication assist. The most aggressive transit latencies used represent modern high-end multiprocessor interconnects, while the least aggressive ones are closer to using system area networks like asynchronous transfer mode (ATM). We can see that for an aggressive occupancy, the latency curves take the expected 1/l shape. The contention induced by assist occupancy has a major impact on performance for applications that stress communication throughput (especially those in which communication is bursty), especially for the low-latency networks used in multiprocessors. For higher occupancies, the curve almost flattens with increasing latency. The problem is especially severe for applications with bursty communication, such as sorting and FFTs, since there the rate of communication relative to computation during the phase that performs communication does
9/3/97
DRAFT: Parallel Computer Architecture
585
Directory-based Cache Coherence
not change much with problem size, so larger problem sizes do not help alleviate the contention during that phase. Assist occupancy is a less severe a problem for applications in which communication events are separated by computation and whose communication bandwidth demands are small (e.g. Barnes-Hut in the figure). When latency tolerance techniques are used (see Chapter 11), bandwidth is stressed even further so the impact of assist occupancy is much greater even at higher transit latencies [HHS+95]. These data show that is very important to keep assist occupancy low in machines that communicate and maintain coherence at a fine granularity such as that of cache blocks. The impact of contention due to assist occupancy tends to increase with 0.6
0.9 0.8
O1 O2 O4
0.5 0.4
O8 O16
0.3
Efficiency
0.6
Parallel
Parallel
Efficiency
0.5
0.7
0.4
O1 O2 O4
0.3
O8 O16
0.2
0.2 0.1
0.1 0
4154
3482
3314
2106
1770
1354
1154
914
1054
706
606
514
382
314
214
3706
3370
2554
1882
1754
1354
1154
914
1054
706
606
514
382
314
214
0
Read Miss Cost
Read Miss Cost
(a) FFT
(a) Ocean
Figure 8-33 Impact of assist occupancy and network latency on the performance of memory-based cache coherence protocols. The Y-axis is the parallel efficiency, which is the speedup over a sequential execution divided by the number of processors used (1 is ideal speedup). The X-axis is the uncontended round-trip latency of a read miss that is satisfied in main memory at the home, including all components of cost (occupancy, transit latency, time in buffers, and network bandwidth). Each curve is for a different value of assist occupancy (o), while along a curve the only parameter that is varies is the network transit latency (l). The lowest occupancy assumed is 7 processor cycles, which is labelled O1. O2 corresponds to twice that occupancy (14 processor cycles), and so on. All other costs, such as the time to propagate through the cache hierarchy and through buffers, and the node-to-network bandwidth are held constant. The graphs are for simulated 64-processor executions. The main conclusion is that the contention induced by assist occupancy is very important to performance, especially in low-latency networks.
the number of processors used to solve a given problem, since communication to computation ratio tends to increase. Effects of Assist Occupancy on Protocol Tradeoffs The occupancy of the assist has an impact not only on the performance of a given protocol but also on the tradeoffs among protocols. We have seen that cache-based protocols can have higher latency on write operations than memory-based protocols, since the transactions needed to invalidate sharers are serialized. The SCI cache-based protocol tends to have more protocol processing to do on a given memory operation than a memory-based protocol, so an assist for this protocol tends to have higher occupancy, so the effective occupancy of an assist tends to be significantly higher. Combined with the higher latency on writes, this would tend to cause memorybased protocols to perform better. This difference between the performance of the protocols will become greater as assist occupancy increases. On the other hand, the protocol processing occupancy for a given memory operation in SCI is distributed over more nodes and assists, so depending on the communication patterns of the application it may experience less contention at a given assist. For example, when hot-spotting becomes a problem due to bursty irregular communication in memory-based protocols (as in Radix sorting), it may be somewhat alleviated in SCI. SCI may also be more resilient to contention caused by data distribution problems or by less opti-
586
DRAFT: Parallel Computer Architecture
9/3/97
Synchronization
mized programs. How these tradeoffs actually play out in practice will depend on the characteristics of real programs and machines, although overall we might expect memory-based protocols to perform better in the most optimized cases. Improving Performance Parameters in Hardware There are many ways to use more aggressive, specialized hardware to improve performance characteristics such as latency, occupancy and bandwidth. Some notable techniques include the following. First, an SRAM directory cache may be placed close to the assist to reduce directory lookup cost, as is done in the Stanford FLASH multiprocessor [KOH+94]. Second, a single bit of SRAM can be maintained per memory block at the home, to keep track of whether or not the block is in clean state in the local memory. If it is, then on a read miss to a locally allocated block there is no need to invoke the communication assist at all. Third, if the assist occupancy is high, it can be pipelined into stages of protocol processing (e.g. decoding a request, looking up the directory, generating a response) or its occupancy can be overlapped with other actions. The former technique reduces contention but not the uncontended latency of memory operations, while the latter does the reverse. Critical path latency for an operation can also be reduced by having the assist generate and send out a response or a forwarded request even before all the cleanup it needs to do is done. This concludes our discussion of directory-based cache coherence protocols. Let us now look at synchronization on scalable cache-coherent machines.
8.9 Synchronization Software algorithms for synchronization on scalable shared address space systems using atomic exchange instructions or LL-SC have already been discussed in Chapter 7 (Section 7.10). Recall that the major focus of these algorithms compared to those for bus-based machines was to exploit the parallelism of paths in the interconnect and to ensure that processors would spin on local rather than nonlocal variables. The same algorithms are applicable to scalable cache-coherent machines. However, there are two differences. On one hand, the performance implications of spinning on remotely allocated variables are likely to be much less significant, since a processor caches the variable and then spins on it locally until it is invalidated. Having processors spin on different variables rather than the same one, as achieved in the tree barrier, is of course useful so that not all processors rush out to the same home memory when the variable is written and invalidated, reducing contention. However, there is only one (very unlikely) reason that it may actually be very important to performance that the variable a processor spins on is allocated locally: if the cache is unified and direct-mapped and the instructions for the spin loop conflict with the variable itself, in which case conflict misses will be satisfied locally. A more minor benefit of good placement is converting the misses that occur after invalidation into 2-hop misses from 3hop misses. On the other hand, implementing atomic primitives and LL-SC is more interesting when it interacts with a coherence protocol. This section examines these two aspects, first comparing the performance of the different synchronization algorithms for locks and barriers that were described in Chapters 5 and 7 on the SGI Origin2000, and then discussing some new implementation issues for atomic primitives beyond the issues already encountered on bus-based machines.
9/3/97
DRAFT: Parallel Computer Architecture
587
Directory-based Cache Coherence
8.9.1 Performance of Synchronization Algorithms The experiments we use to illustrate synchronization performance are the same as those we used on the bus-based SGI Challenge in Section 5.6, again using LL-SC as the primitive used to construct atomic operations. The delays used are the same in processor cycles, and therefore different in actual microseconds. The results for the lock algorithms described in Chapters 5 and 7 are shown in Figure 8-34 for 16-processor executions. Here again, we use three different sets of values for the delays within and after the critical section for which processors repeatedly contend.
(a) null (c=0, d=0)
(microsec)
3 ticket 2 ticket, prop
array-based 5 queuing
(b) critical-section (c=3.64 µs, d=0)
15
13
11
9
7
5
1
15
0
13
1
LL-SC LL-SC, exp
3
4
0
11
15
13
11
9
7
5
3
1
0
6
1
9
1
7
2
4LL-SC LL-SC, exp 3 ticket 2ticket, prop
1
Time
4
5
5
3
3
6
6 array-based 5 queuing
Time
7
7
(microsec)
8
7
(microsec)
8
8
Time
9
(c) delay (c=3.64 µs, d=1.29 µs)
Figure 8-34 Performance of locks on the SGI Origin2000, for three different scenarios.
Here too, until we use delays between critical sections the simple locks behavior unfairly and yield higher throughput. Exponential backoff helps the LL-SC when there is a null critical section, since this is the case where there is significant contention to be alleviated. Otherwise, with null critical sections the ticket lock itself scales quite poorly as before, but scales very well when proportional backoff is used. The array-based lock also scales very well. With coherent caches, the better placement of lock variables in main memory afforded by the software queuing lock is not particularly useful, and in fact the queueing lock incurs contention on its compare and swap operations (implemented with LL-SC) and scales worse than the array lock. If we force the simple locks to behave fairly, they behave much like the ticket lock without proportional backoff. If we use a non-null critical section and a delay between lock accesses (Figure 8-34(c)), all locks behave fairly. Now, the simple LL-SC locks don’t have their advantage, and their scaling disadvantage shows through. The array-based lock, the queueing lock, and the ticket lock with proportional backoff all scale very well. The better data placement of the queuing lock does not matter, but neither is the contention any worse for it now. Overall, we can conclude that the array-based lock and the ticket lock perform well and robustly for scalable cache-coherent machines, at least when implemented with LL-SC, and the simple LL-SC lock with exponential backoff performs best when there is no delay between an unlock and the next lock, due to repeated unfair successful access by a processor in its own cache. The most sophisticated, queueing lock is unnecessary, but also performs well when there are delays between unlock and lock.
588
DRAFT: Parallel Computer Architecture
9/3/97
Implications for Parallel Software
8.9.2 Supporting Atomic Primitives Consider implementing atomic exchange (read-modify-write) primitives like test&set performed on a memory location. What matters here is that a conflicting write to that location by another processor occur either before the read component of the read-modify-write operation or after its write component. As we discussed for bus-based machines in Chapter 5 (Section 5.6.3), the read component may be allowed to complete as soon as the write component is serialized with respect to other writes. and we can ensure that no incoming invalidations are applied to the block until the read has completed. If the read-modify-write is implemented at the processor (cacheable primitives), this means that the read can complete once the write has obtained ownership, and even before invalidation acknowledgments have returned. Atomic operations can also be implemented at the memory, but it is easier to do this if we disallow the block from being cached in dirty state by any processor. Then, all writes go to memory, and the read-modify-write can be serialized with respect to other writes as soon as it gets to memory. Memory can send a response to the read component in parallel with sending out invalidations corresponding to the write component. Implementing LL-SC requires all the same consideration to avoid livelock as it did for bus-based machines, plus one further complication. Recall that an SC should not send out invalidations/ updates if it fails, since otherwise two processors may keep invalidating/updating each other and failing, causing livelock. To detect failure, the requesting processor needs to determine if some other processor’s write to the block has happened before the SC. In a bus-based system, the cache controller can do this by checking upon an SC whether the cache no longer has a valid copy of the block, or whether there are incoming invalidations/updates for the block that have already appeared on the bus. The latter detection, which basically checks the serialization order, cannot be done locally by the cache controller with a distributed interconnect. In an invalidation-based protocol, if the block is still in valid state in the cache then the read exclusive request corresponding to the SC goes to the directory at the home. There, it checks to see if the requestor is still on the sharing list. If it isn’t, then the directory knows that another conflicting write has been serialized before the SC, so it does not send out invalidations corresponding to the SC and the SC fails. In an update protocol, this is more difficult since even if another write has been serialized before the SC the SC requestor will still be on the sharing list. One solution [Gha95] is to again use a two-phase protocol. When the SC reaches the directory, it locks down the entry for that block so no other requests can access it. Then, the directory sends a message back to the SC requestor, which upon receipt checks to see if the lock flag for the LL-SC has been cleared (by an update that arrived between now and the time the SC request was sent out). If so, the SC has failed and a message is sent back to the directory to this effect (and to unlock the directory entry). If not, then as long as there is point to point order in the network we can conclude that no write beat the SC to the directory so the SC should succeed. The requestor sends this acknowledgment back to the directory, which unlocks the directory entry and sends out the updates corresponding to the SC, and the SC succeeds.
8.10 Implications for Parallel Software What distinguishes the coherent shared address space systems described in this chapter from those described in Chapters 5 and 6 is that they all have physically distributed rather than centralized main memory. Distributed memory is at once an opportunity to improve performance through data locality and a burden on software to exploit this locality. As we saw in Chapter 3, in
9/3/97
DRAFT: Parallel Computer Architecture
589
Directory-based Cache Coherence
the cache-coherent machine with physically distributed memory (or CC-NUMA machines) discussed in this chapter, parallel programs may need to be aware of physically distributed memory, particularly when their important working sets don’t fit in the cache. Artifactual communication occurs when data are not allocated in the memory of a node that incurs capacity, conflict or cold misses on them (it can also occur due to false sharing or fragmentation of communication, which occur at cache block granularity). Consider also a multiprogrammed workload, consisting of even sequential programs, in which application processes are migrated among processing nodes for load balancing. Migrating a process will turn what should be local misses into remote misses, unless the system moves all the migrated process’s data to the new node’s main memory as well. In the CC-NUMA machines discussed in this chapter, main memory management is typically done at the fairly large granularity of pages. The large granularity can make it difficult to distribute shared data structures appropriately, since data that should be allocated on two different nodes may fall on the same unit of allocation. The operating system may migrate pages to the nodes that incur cache misses on them most often, using information obtained from hardware counters, or the runtime system of a programming language may do this based on user-supplied hints or compiler analysis. More commonly today, the programmer may direct the operating system to place pages in the memories closest to particular processes. This may be as simple as providing these directives to the system—such as “place the pages in this range of virtual addresses in this process X’s local memory”—or it may involve padding and aligning data structures to page boundaries so they can be placed properly, or it may even require that data structures be organized differently to allow such placement at page granularity. We saw examples in using four-dimensional instead of two-dimensional arrays in the equation solver kernel, in Ocean and in LU. Simple, regular cases like these may also be handled by sophisticated compilers. In Barnes-Hut, on the other hand, proper placement would require a significant reorganization of data structures as well as code. Instead of having a single linear array of particles (or cells), each process would have an array or list of its own assigned particles that it could allocate in its local memory, and between time-steps particles that were reassigned would be moved from one array/ list to another. However, as we have seen data placement is not very useful for this application, and may even hurt performance due to its high costs. It is important that we understand the costs and potential benefits of data migration. Similar issues hold for software-controlled replication, and the next chapter will discuss alternative approaches to coherent replication and migration in main memory. One of the most difficult and insidious problems for a programmer to deal with in a coherent shared address space is contention. Not only is data traffic implicit and often unpredictable, but contention can also be caused by “invisible” protocol transactions such as the ownership requests above, invalidations and acknowledgments, which a programmer is not inclined to think about at all. All of these types of transactions occupy the protocol processing portion of the communication assist, so it is very important to keep the occupancy of the assist per transaction very low to contain end-point contention. Invisible protocol messages and contention make performance problems like false sharing all the more important for a programmer to avoid, particularly when they cause a lot of protocol transactions to be directed toward the same node. For example, we are often tempted to structure some kinds of data as an array with one entry per process. If the entries are smaller than a page, several of them will fall on the same page. If these array entries are not padded to avoid false sharing, or if they incur conflict misses in the cache, all the misses and traffic will be directed at the home of that page, causing a lot of contention. If we do structure data this way, then in a distributed memory machine it is advantageous not only to structure multiple such arrays as an array of records, as we did to avoid false sharing in Chapter 5, but also to pad and align the records to a page and place the pages in the appropriate local memories. 590
DRAFT: Parallel Computer Architecture
9/3/97
Advanced Topics
An interesting example of how contention can cause different orchestration strategies to be used in message passing and shared address space systems is illustrated by a high-performance parallel Fast Fourier Transform. Conceptually, the computation is structured in phases. Phases of local computation are separated by phases of communication, which involve the transposition of a matrix. A process reads columns from a source matrix and writes them into its assigned rows of a destination matrix, and then performs local computation on its assigned rows of the destination matrix. In a message passing system, it is important to coalesce data into large messages, so it is necessary for performance to structure the communication this way (as a separate phase apart from computation). However, in a cache-coherent shared address space there are two differences. First, transfers are always done at cache block granularity. Second, each fine-grained transfer involves invalidations and acknowledgments (each local block that a process writes is likely to be in shared state in the cache of another processor from a previous phase, so must be invalidated), which cause contention at the coherence controller. It may therefore be preferable to perform the communication on-demand at fine grain while the computation is in progress, thus staggering out the communication and easing the contention on the controller. A process that computes using a row of the destination matrix can read the words of that source matrix column from a remote node on-demand while it is computing, without having to do a separate transpose phase. Finally, synchronization can be expensive in scalable systems, so programs should make a special effort to reduce the frequency of high-contention locks or global barrier synchronization.
8.11 Advanced Topics Before concluding the chapter, this section covers two additional topics. The first is the actual techniques used to reduce directory storage overhead in flat, memory-based schemes, as promised in Section 8.3.3. The second is techniques for hierarchical coherence, both snooping- and directory-based. 8.11.1 Reducing Directory Storage Overhead In the discussion of flat, memory based directories in Section 8.3.3, it was stated that the size or width of a directory entry can be reduced by using a limited number of pointers rather than a full bit-vector, and that doing this requires some overflow mechanism when the number of copies of the block exceeds the number of available pointers. Based on the empirical data about sharing patterns, the number of pointers likely to be provided in limited pointer directories is very small, so it is important that the overflow mechanism be efficient. This section first discusses some possible overflow methods for limited pointer directories. It then examines techniques to reduce directory “height”, by not having a directory entry for every memory block in the system but rather organizing the directory as a cache. The overflow methods include schemes called broadcast, no broadcast, coarse vector, software overflow, and dynamic pointers. Broadcast (DiriB) The overflow strategy in the DiriB scheme [ASH+88] is to set a broadcast bit in the directory entry when the number of available pointers (indicated by subscript i in DiriB) is exceeded. When that block is written again, invalidation messages are sent to all nodes, regardless of whether or not they were caching the block. It is not semantically incorrect, but simply wasteful, to send an invalidation message to a processor not caching the block. Network band-
9/3/97
DRAFT: Parallel Computer Architecture
591
Directory-based Cache Coherence
width may be wasted, and latency stalls may be increased if the processor performing the write must wait for acknowledgments before proceeding. However, the method is very simple. No broadcast (DiriNB) The DiriNB scheme [ASH+88] avoids broadcast by never allowing the number of valid copies of a block to exceed i. Whenever the number of sharers is i and another node requests a read-shared copy of the block, the protocol invalidates the copy in one of the existing sharers and frees up that pointer in the directory entry for the new requestor. A major drawback of this scheme is that it does not deal well with data that are read by many processors during a period, even though the data do not have to be invalidated. For example, a piece of code that is running on all the processors, or a small table of pre-computed values shared by all processors will not be able to reside in all the processors’ caches at the same time, and a continual stream of misses will be generated. While special provisions can be made for blocks containing code, for example, their consistency may be managed by software instead of hardware, it is not clear how to handle widely shared read-mostly data well in this scheme. Coarse vector (DiriCVr) The DiriCVr scheme [GWM90] also uses i pointers in its initial representation, but overflows to a coarse bit vector scheme like the one used by the Origin2000. In this representation, each bit of the directory entry is used to point to a unique group of the nodes in the machine (the subscript r in DiriCVr indicates the size of the group), and that bit is turned ON whenever any node in that partition is caching that block. When a processor writes that block, all Overflow bit
2 Pointers
Overflow bit
0
8-bit coarse vector
1
P0
P1
P2
P3
P0
P1
P2
P3
P4
P5
P6
P7
P4
P5
P6
P7
P8
P9
P10
P11
P8
P9
P10
P11
P12
P13
P14
P15
P12
P13
P14
P15
(a) No overflow
(a) Overflow
Figure 8-35 The change in representation in going from limited pointer representation to coarse vector representation on overflow. Upon overflow, the two 4-bit pointers (for a 16-node system) are viewed as an eight-bit coarse vector, each bit corresponding to a group of two nodes. The overflow bit is also set, so the nature of the representation can be easily determined. The dotted lines in (b) indicate the correspondence between bits and node groups.
nodes in the groups whose bits are turned ON are sent an invalidation message, regardless of whether they have actually accessed or are caching the block. As an example, consider a 256node machine for which we store eight pointers in the directory entry. Since each pointer needs to be 8-bits wide, there are 64-bits available for the coarse-vector on overflow. Thus we can implement a Dir8CV4 scheme, with each coarse-vector bit pointing to a group of four nodes. A single bit per entry keeps track of whether the current representation is the normal limited pointers or a 592
DRAFT: Parallel Computer Architecture
9/3/97
Advanced Topics
coarse vector. As shown in Figure 8-36, an advantage of a scheme like DiriCVr (and several of the following) over DiriB and DiriNB is that its behavior is more robust to different sharing patterns. 800
700
Invalidations
500
Normalized
600
300
B NB CV
400
200
100
0 LocusRoute
Cholesky
Barnes-Hut
Figure 8-36 Comparison of invalidation traffic generated by Dir4B, Dir4NB, and Dir4CV4 schemes as compared to the full-bitvector scheme.
The results are taken from [Web93], so the simulation parameters are different from those used in this book. The number of processors (one per node) is 64. The data for LocusRoute application, which has data that are written with some frequency and read by many nodes, shows the potential pitfalls of the DiriB scheme. Cholesky and Barnes-Hut, which have data that are shared by large numbers of processors (e.g., nodes close to root of tree in Barnes-Hut) show the potential pitfalls of DiriNB scheme. The DiriCVr scheme is found to be reasonably robust.
Software Overflow (DiriSW) This scheme is different from the previous ones in that it does not throw away the precise caching status of a block when overflow occurs. Rather, the current i pointers and a pointer to the new sharer are saved into a special portion of the node’s local main memory by software. This frees up space for new pointers, so i new sharers can be handled by hardware before software must be invoked to store pointers away into memory again. The overflow also causes an overflow bit to be set in hardware. This bit ensures that when a subsequent write is encountered the pointers that were stored away in memory will be read out and invalidation messages sent to those nodes as well. In the absence of a very sophisticated (programmable) communication assist, the overflow situations (both when pointers must be stored into memory and when they must be read out and invalidations sent) are handled by software running on the main processor, so the processor must be interrupted upon these events. The advantages of this scheme are that precise information is kept about sharers even upon overflow, so there is no extra traffic generated, and the overflow handling is managed by software. The major overhead is the cost of the interrupts and software processing. This disadvantage takes three forms: (i) the processor at the home of the block spends time handling the interrupt instead of performing user’s computation, (ii) the overhead for performing these requests is large, thus potentially becoming a bottleneck for contention if many requests come to it at the same time, and (iii) the requesting processor may stall longer due to the higher latency of the requests that cause an interrupt as well as increased contention.1
Example 8-3
9/3/97
Software overflow for limited pointer directories was used in the MIT Alewife research prototype [ABC+95] and called the “LimitLESS” scheme [ALK+91]. The
DRAFT: Parallel Computer Architecture
593
Directory-based Cache Coherence
Alewife machine is designed to scale to 512 processors. Each directory entry is 64bits wide. It contains five 9-bit pointers to record remote processors caching the block, and one dedicated bit to indicate whether the local processor is also caching the block (thus saving 8 bits when this is true). Overflow pointers are stored in a hash-table in the main memory. The main processor in Alewife has hardware support for multithreading (see Chapter 11), with support for fast handing of traps upon overflow. Nonetheless, while the latency of a request that causes 5 invalidations and can be handled in hardware is only 84 cycles, a request requiring 6 invalidations and hence software intervention takes 707 cycles. Dynamic pointers (DiriDP) The DiriDP scheme [SiH91] is a variation of the DiriSW scheme. In this scheme, in addition to the i hardware pointers each directory entry contains a pointer into a special portion of the local node’s main memory. This special memory has a free list associated with it, from which pointer-structures can be dynamically allocated to processors as needed. The key difference from DiriSW is that all linked-list manipulation is done by a special-purpose protocol hardware/processor rather than by the general-purpose processor of the local node. As a result, the overhead of manipulating the linked-lists is small. Because it also contains a pointer to memory, the number of hardware pointers i used in this scheme is typically very small (1 or 2). The DiriDP scheme is being used as the default directory organization for the Stanford FLASH multiprocessor [KOH+94]. Among these many alternative schemes for maintaining directory information in a memorybased protocol, it is quite clear that the DiriB and DiriNB schemes are not very robust to different sharing patterns. However, the actual performance (and cost/performance) tradeoffs among the others are not very well understood for real applications on large-scale machines. The general consensus seems to be that full bit vectors are appropriate for machines with a moderate number of processing nodes, while several among the last few schemes above ought to suffice for larger systems. In addition to reducing directory entry width, an orthogonal way to reduce directory memory overhead is to reduce the total number of directory entries used by not using one per memory block [GWM90, OKN90]; that is, to go after the M term in the P*M expression for directory memory overhead. Since the two methods of reducing overhead are orthogonal, they can be traded off against each other: Reducing the number of entries allows us to make entries wider (use more hardware pointers) without increasing cost, and vice versa. The observation that motivates the use of fewer directory entries is that the total amount of cache memory is much less than the total main memory in the machine. This means that only a very small fraction of the memory blocks will be cached at a given time. For example, each processing node may have a 1 Mbyte cache and 64 Mbytes of main memory associated with it. If there were one directory entry per memory block, then across the whole machine 63/64 or 98.5% of the directory entries will correspond to memory blocks that are not cached anywhere in the machine. That is an awful lot of directory entries lying idle with no bits turned ON, especially if replacement hints are used. This waste of memory can be avoided by organizing the directory as a cache,
1. It is actually possible to reply to the requestor before the trap is handled, and thus not affect the latency seen by it. However, that simply means that the next processor’s request is delayed and that processor may experience a stall.
594
DRAFT: Parallel Computer Architecture
9/3/97
Advanced Topics
and allocating the entries in it to memory blocks dynamically just as cache blocks are allocated to memory blocks containing program data. In fact, if the number of entries in this directory cache is small enough, it may enable us to use fast SRAMs instead of slower DRAMs for directories, thus reducing the access time to directory information. This access time is in the critical path that determines the latency seen by the processor for many types of memory references. Such a directory organization is called a sparse directory, for obvious reasons. While a sparse directory operates quite like a regular processor cache, there are some significant differences. First, there is no need for a backing store for this cache: When an entry is replaced, if any node’s bits in it are turned on then we can simply send invalidations or flush messages to those nodes. Second, there is only one directory entry per cache entry, so spatial locality is not an issue. Third, a sparse directory handles references from potentially all processors, whereas a processor cache is only accessed by the processor(s) attached to it. And finally, the references stream the sparse directory sees is heavily filtered, consisting of only those references that were not satisfied in the processor caches. For a sparse directory to not become a bottleneck, it is essential that it be large enough and have enough associativity to not incur too many replacements of actively accessed blocks. Some experiments and analysis studying the sizing of the directory cache can be found in [Web93]. 8.11.2 Hierarchical Coherence In the introduction to this chapter, it was mentioned that one way to build scalable coherent machines is to hierarchically extend the snoopy coherence protocols based on buses and rings that we have discussed in Chapters 5 and 6 in this chapter, respectively. We have also been introduced to hierarchical directory schemes in this chapter. This section describes hierarchical approaches to coherence a little further. While hierarchical ring-based snooping has been used in commercial systems (e.g. in the Kendall Square Research KSR-1 [FBR93]) as well as research prototypes (e.g. the University of Toronto’s Hector system [VSL+91, FVS92]), and hierarchical directories have been studied in academic research, these approaches have not gained much favor. Nonetheless, building large systems hierarchically out of smaller ones is an attractive abstraction, and it is useful to understand the basic techniques. Hierarchical Snooping The issues in hierarchical snooping are similar for buses and rings, so we study them mainly through the former. A bus hierarchy is a tree of buses. The leaves are bus-based multiprocessors that contain the processors. The busses that constitute the internal nodes don’t contain processors, but are used for interconnection and coherence control: They allow transactions to be snooped and propagated up and down the hierarchy as necessary. Hierarchical machines can be built with main memory centralized at the root or distributed among the leaf multiprocessors (see Figure 837). While a centralized main memory may simplify programming, distributed memory has advantages both in bandwidth and potentially in performance (note however that if data are not distributed such most cache misses are satisfied locally, remote data may actually be further away than the root in the worst case, potentially leading to worse performance). Also, with distributed memory, a leaf in the hierarchy is a complete bus-based multiprocessor, which is already a commodity product with cost advantages. Let us focus on hierarchies with distributed memory, leaving centralized memory hierarchies to the exercises.
9/3/97
DRAFT: Parallel Computer Architecture
595
Directory-based Cache Coherence
The processor caches within a leaf node (multiprocessor) are kept coherent by any of the snoopy P C
B1
P
P
C
C
P B1
P0
C A:shared
C
Coherence Monitor
Coherence Monitor
B2 Main Mem
Main Memory
P1 B1
C
P2 A:shared
C
B1
Coherence Monitor
Coherence Monitor
(a) Centralized Memory
P3 A:shared
B2
C
Main Mem
(b) Distributed Memory
Figure 8-37 Hierarchical bus-based multiprocessors, shown with a two-level hierarchy. protocols discussed in Chapter 5. In a simple, two-level hierarchy, we connect several of these bus-based systems together using another bus (B2) that has the main memory attached to it (the extension to multilevel hierarchies is straightforward). What we need is a coherency monitor associated with each B1 bus that monitors (snoops) the transactions on both buses, and decides which transactions on its B1 bus should be forwarded to the B2 bus and which ones that appear on the B2 bus should be forwarded to its B1 bus. This device acts as a filter, forwarding only the necessary transactions in both directions, and thus reduces the bandwidth demands on the buses. In a system with distributed memory, the coherence monitor for a node has to worry about two types of data for which transactions may appear on either the B1 or B2 bus: data that are allocated nonlocally but cached by some processor in the local node, and data that are allocated locally but cached remotely. To watch for the former on the B2 bus, a remote access cache per node can be used as in the Sequent NUMA-Q. This cache maintains inclusion (see Section 6.4.1) with regard to remote data cached in any of the processor caches on that node, including a dirtybut-stale bit indicating when a processor cache in the node has the block dirty. Thus, it has enough information to determine which transactions are relevant in each direction and pass them along. For locally allocated data, bus transactions can be handled entirely by the local memory or caches, except when the data are cached by processors in other (remote) nodes. For these data, there is no need to keep the data themselves in the coherence monitor, since either the valid data are already available locally or they are in modified state remotely; in fact, we would not want to since the amount of data may be as large as the local memory. However, the monitor keeps state information for locally allocated blocks that are cached remotely, and snoops the local B1 bus so relevant transactions for these data can be forwarded to the B2 bus if necessary. Let’s call this part of the coherence monitor the local state monitor. Finally, the coherence monitor also watches the B2 bus for transactions to its local addresses, and passes them on to the local B1 bus unless the local state monitor says they are cached remotely in a modified state. Both the remote cache and the local state part of the monitor are looked up on B1 and B2 bus transactions. Consider the coherence protocol issues (a) through (c) that were raised in Section 8.2. (a) Information about the state in other nodes (subtrees) is implicitly available to enough of an extent in the local coherence monitor (remote cache and local state monitor); (b) if this information indicates a need to find other copies beyond the local node, the request is broadcast on the next bus (and so on hierarchically in deeper hierarchies), and other relevant monitors will respond; and (c) 596
DRAFT: Parallel Computer Architecture
9/3/97
Advanced Topics
communication with the other copies is performed simultaneously as part of finding them through the hierarchical broadcasts on buses. More specifically, let us consider the path of a read miss, assuming a shared physical address space. A BusRd request appears on the local B1 bus. If the remote access cache, the local memory or another local cache has a valid copy of the block, they will supply the data. Otherwise, either the remote cache or the local state monitor will know to pass the request on to the B2 bus. When the request appears on B2, the coherence monitors of other nodes will snoop it. If a node’s local state monitor determines that a valid copy of the data exists in the local node, it will pass the request on to the B1 bus, wait for the reply and put it back on the B2 bus. If a node’s remote cache contains the data, then if it has it in shared state it will simply reply to the request itself, if in dirty state it will reply and also broadcast a read on the local bus to have the dirty processor cache downgrade the block to shared, and if dirty-but-stale then it will simply broadcast the request on the local bus and reply with the result. In the last case, the local cache that has the data dirty will change its state from dirty to shared and put the data on the B1 bus. The remote cache will accept the data reply from the B1 bus, change its state from dirty-but-stale to shared, and pass the reply on to the B2 bus. When the data response appears on B2, the requestor’s coherence monitor picks it up, installs it and changes state in the remote cache if appropriate, and places it on its local B1 bus. (If the block has to be installed in the remote cache, it may replace some other block, which will trigger a flush/invalidation request on that B1 bus to ensure the inclusion property.) Finally, the requesting cache picks up the response to its BusRd request from the B1 bus and stores it in shared state. For writes, consider the specific situation shown in Figure 8-37(b), with P0 in the left hand node issuing a write to location A, which is allocated in the memory of a third node (not shown). Since P0’s own cache has the data only in shared state, an ownership request (BusUpgr) is issued on the local B1 bus. As a result, the copy of B in P1’s cache is invalidated. Since the block is not available in the remote cache in dirty-but-stale state (which would have been incorrect since P1 had it in shared state), the monitor passes the BusUpgr request to bus B2, to invalidate any other copies, and at the same time updates its own state for the block to dirty-but-stale. In another node, P2 and P3 have the block in their caches in shared state. Because of the inclusion property, their associated remote cache is also guaranteed to have the block in shared state. This remote cache passes the BusUpgr request from B2 onto its local B1 bus, and invalidates its own copy. When the request appears on the B1 bus, the copies of B in P2 and P3’s caches are invalidated. If there is another node on the B2 bus whose processors are not caching block B, the upgrade request will not pass onto its B1 bus. Now suppose another processor P4 in the left hand node issues a store to location B. This request will be satisfied within the local node, with P0’s cache supplying the data and the remote cache retaining the data in dirty-but-stale state, and no transaction will be passed onto the B2 bus. The implementation requirements on the processor caches and cache controllers remain unchanged from those discussed in Chapter 6. However, there are some constraints on the remote access cache. It should be larger than the sum of the processor caches and quite associative, to maintain inclusion without excessive replacements. It should also be lockup-free, i.e. able to handle multiple requests from the local node at a time while some are still outstanding (more on this in Chapter 11). Finally, whenever a block is replaced from the remote cache, an invalidate or flush request must be issued on the B1 bus depending on the state of the replaced block (shared or dirty-but-stale, respectively). Minimizing the access time for the remote cache is less critical than increasing its hit-rate, since it is not in the critical path that determines the clock-rate of the processor: The remote caches are therefore more likely to be built out of DRAM rather than SRAM. 9/3/97
DRAFT: Parallel Computer Architecture
597
Directory-based Cache Coherence
The remote cache controller must also deal with non-atomicity issues in requesting and acquiring buses that are similar to the ones we discussed in Chapter 6. Finally, what about write serialization and determining store completion? From our earlier discussion of how these work on a single bus in Chapter 6, it should be clear that serialization among requests from different nodes will be determined by the order in which those requests appear on the closest bus to the root on which they both appear. For writes that are satisfied entirely within the same leaf node, the order in which they may be seen by other processors— within or without that leaf—is their serialization order provided by the local B1 bus. Similarly, for writes that are satisfied entirely within the same subtree, the order in which they are seen by other processors—within or without that subtree—is the serialization order determined by the root bus of that subtree. It is easy to see this if we view each bus hanging off a shared bus as a processor, and recursively use the same reasoning that we did with a single bus in Chapter 5. Similarly, for store completion, a processor cannot assume its store has committed until it appears on the closest bus to the root on which it will appear. An acknowledgment cannot be generated until that time, and even then the appropriate orders must be preserved between this acknowledgment and other transactions on the way back to the processor (see Exercise 8.11). The request reaching this bus is the point of commitment for the store: once the acknowledgment of this event is sent back, the invalidations themselves no longer need to be acknowledged as they make their way down toward the processor caches, as long as the appropriate orders are maintained along this path (just as with multi-level cache hierarchies in Chapter 6).
Example 8-4
Encore GigaMax One of the earliest machines that used the approach of hierarchical snoopy buses with distributed memory was the Gigamax [Wil87, WWS+89] from Encore Corporation. The system consisted of up to eight Encore Multimax machines (each a regular snoopy bus-based multiprocessor) connected together by fiber-optic links to a ninth global bus, forming a two-level hierarchy. Figure 8-38 shows a block diagram. Each node is augmented with a Uniform Interconnection Card (UIC) and a Uniform Cluster (Node) Cache (UCC) card. The UCC is the remote access cache, and the UIC is the local state monitor. The monitoring of the global bus is done differently in the Gigamax due to its particular organization. Nodes are connected to the global bus through a fiber-optic link, so while the local node cache (UCC) caches remote data it does not snoop the global bus directly. Rather, every node also has a corresponding UIC on the global bus, which monitors global bus transactions for remote memory blocks that are cached in this local node. It then passes on the relevant requests to the local bus. If the UCC indeed sat directly on the global bus as well, the UIC on the global bus would not be necessary. The reason the GigaMax use fiber-optic links and not a single UIC sitting on both buses is that high-speed buses are usually short: The Nanobus used in the Encore Multimax and GigaMax is 1 foot long (light travels 1 foot in a nanosecond, hence the name Nanobus). Since each node is at least 1 foot wide, and the global bus is also 1 foot wide, flexible cabling is needed to hook these together. With fiber, links can be made quite long without affecting their transmission capabilities. The extension of snoopy cache coherence to hierarchies of rings is much like the extension of hierarchies of buses with distributed memory. Figure 8-39 shows a block diagram. The local rings and the associated processors constitute nodes, and these are connected by one or more glo-
598
DRAFT: Parallel Computer Architecture
9/3/97
Advanced Topics
Tag and Data RAMS for remote data cached locally (Two 16MB banks 4-way associative)
P
C
C
UCC
UIC
8-way interleaved memory
Local Nano Bus
P
P
C
C
Memory
P
Memory
Motorola 88K processors
(64-bit data, 32-bit address, split-transaction, 80ns cycles)
Tag RAM only for local data cached remotely
UCC
UIC
Fiber-optic link
(Bit serial, 4 bytes every 80ns)
Tag RAM only for remote data cached locally
UIC
UIC
Global Nano Bus
(64-bit data, 32-bit address, split-transaction, 80ns cycles)
Figure 8-38 Block diagram for the Encore Gigamax multiprocessor. inter-ring interface
local ring
global ring(s)
local ring
P P
processor-to-ring interface
Figure 8-39 Block diagram for a hierarchical ring-based multiprocessor. bal rings. The coherence monitor takes the form of an inter-ring interface, serving the same roles as the coherence monitor (remote cache and local state monitor) in a bus hierarchy. Hierarchical Directory Schemes Hierarchy can also be used to construct directory schemes. Here, unlike in flat directory schemes, the source of the directory information is not found by going to a fixed node. The locations of copies are found neither at a fixed home node nor by traversing a distributed list pointed to by that home. And invalidation messages are not sent directly to the nodes with copies. Rather, all these activities are performed by sending messages up and down a hierarchy (tree) built upon the nodes, with the only direct communication (network transactions) being between parents and children in the tree.
9/3/97
DRAFT: Parallel Computer Architecture
599
Directory-based Cache Coherence
At first blush the organization of hierarchical directories is much like hierarchical snooping. Consider the example shown in Figure 8-40. The processing nodes are at the leaves of the tree and processing nodes
level-1 directory
(Tracks which of its children level-1 directories have a copy of the memory block. Also tracks which local memory blocks are cached outside this subtree. Inclusion is maintained between level-1 directories and level-2 directory.)
(Tracks which of its children processing nodes have a copy of the memory block. Also tracks which local memory blocks are cached outside this subtree. Inclusion is maintained between processor caches and directory.)
level-2 directory
Figure 8-40 Organization of hierarchical directories.
main memory is distributed along with the processing nodes. Every block has a home memory (leaf) in which it is allocated, but this does not mean that the directory information is maintained or rooted there. The internal nodes of the tree are not processing nodes, but only hold directory information. Each such directory node keeps track of all memory blocks that are being cached/ recorded by its immediate sub-trees. It uses a presence vector per block to tell which of its subtrees have copies of the block, and a bit to tell whether one of them has it dirty. It also records information about local memory blocks (i.e., blocks allocated in the local memory of one of its children) that are being cached by processing nodes outside its subtree. As with hierarchical snooping, this information is used to decide when requests originating within the subtree should be propagated further up the hierarchy. Since the amount of directory information to be maintained by a directory node that is close to the root can become very large, the directory information is usually organized as a cache to reduce its size, and maintains the inclusion property with respect to its children’s caches/directories. This requires that on a replacement from a directory cache at a certain level of the tree, the replaced block be flushed out of all of its descendent directories in the tree as well. Similarly, replacement of the information about a block allocated within that subtree requires that copies of the block in nodes outside the subtree be invalidated or flushed. A read miss from a node flows up the hierarchy until either (i) a directory indicates that its subtree has a copy (clean or dirty) of the memory block being requested, or (ii) the request reaches the directory that is the first common ancestor of the requesting node and the home node for that block, and that directory indicates that the block is not dirty outside that subtree. The request then flows down the hierarchy to the appropriate processing node to pick up the data. The data reply follows the same path back, updating the directories on its way. If the block was dirty, a copy of the block also finds its way to the home node. A write miss in the cache flows up the hierarchy until it reaches a directory whose subtree contains the current owner of the requested memory block. The owner is either the home node, if the block is clean, or a dirty cache. The request travels down to the owner to pick up the data, and the requesting node becomes the new owner. If the block was previously in clean state, invalidations are propagated through the hierarchy to all nodes caching that memory block. Finally, all directories involved in the above transaction are updated to reflect the new owner and the invalidated copies. 600
DRAFT: Parallel Computer Architecture
9/3/97
Advanced Topics
In hierarchical snoopy schemes, the interconnection network is physically hierarchical to permit the snooping. With point to point communication, hierarchical directories do not need to rely on physically hierarchical interconnects. The hierarchy discussed above is a logical hierarchy, or a hierarchical data structure. It can be implemented on a network that is physically hierarchical (that is, an actual tree network with directory caches at the internal nodes and processing nodes at the leaves) or on a non-hierarchical network such as a mesh with the hierarchical directory embedded in this general network. In fact, there is a separate hierarchical directory structure for every block that is cached. Thus, the same physical node in a general network can be a leaf (processing) node for some blocks and an internal (directory) node for others (see Figure 8-41 below). Finally, the storage overhead of the hierarchical directory has attractive scaling properties. It is the cost of the directory caches at each level. The number of entries in the directory goes up as we go further up the hierarchy toward to the root (to maintain inclusion without excessive replacements), but the number of directories becomes smaller. As a result, the total directory memory needed for all directories at any given level of the hierarchy is typically about the same. The directory storage needed is not proportional to the size of main memory, but rather to that of the caches in the processing nodes, which is attractive. The overall directory memory overhead is C × log P
b - , where C is the cache size per processing node at the leaf, M is the proportional to -----------------------
M×B
main-memory per node, B is the memory block size in bits, b is the branching factor of the hierarchy, and P is the number of processing nodes at the leaves (so logbP is the number of levels in the tree). More information about hierarchical directory schemes can be found in the literature [Sco91, Wal92, Hag92, Joe95]. Performance Implications of Hierarchical Coherence Hierarchical protocols, whether snoopy or directory, have some potential performance advantages. One is the combining of requests for a block as they go up and down the hierarchy. If a processing node is waiting for a memory block to arrive from the home node, another processing node that requests the same block can observe at their common ancestor directory that the block has already been requested. It can then wait at the intermediate directory and grab the block when it comes back, rather than send a duplicate request to the home node. This combining of transactions can reduce traffic and hence contention. The sending of invalidations and gathering of invalidation acknowledgments can also be done hierarchically through the tree structure. The other advantage is that upon a miss, if a nearby node in the hierarchy has a cached copy of the block, then the block can be obtained from that nearby node (cache to cache sharing) rather than having to go to the home which may be much further away in the network topology. This can reduce transit latency as well as contention at the home. Of course, this second advantage depends on how well locality in the hierarchy maps to locality in the underlying physical network, as well as how well the sharing patterns of the application match the hierarchy. While locality in the tree network can reduce transit latency, particularly for very large machines, the overall latency and bandwidth characteristics are usually not advantageous for hierarchical schemes. Consider hierarchical snoopy schemes first. With busses there is a bus transaction and snooping latency at every bus along the way. With rings, traversing rings at every level of the hierarchy further increases latency to potentially very high levels. For example, the uncontended latency to access a location on a remote ring in a fully populated Kendall Square Research KSR1 machine [FBR93] was higher than 25 microseconds [SGC93], and a COMA organization was
9/3/97
DRAFT: Parallel Computer Architecture
601
Directory-based Cache Coherence
used to reduce ring remote capacity misses. The commercial systems that have used hierarchical snooping have tended to use quite shallow hierarchies (the largest KSR machine was a two-level ring hierarchy, with up to 32 processors per ring). The fact that there are several processors per node also implies that the bandwidth between a node and its parent or child must be large enough to sustain their combined demands. The processors within a node will compete not only for bus or link bandwidth but also for the occupancy, buffers, and request tracking mechanisms of the node-to-network interface. To alleviate link bandwidth limitations near the root of the hierarchy, multiple buses or rings can be used closer to the root; however, bandwidth scalability in practical hierarchical systems remains quite limited. For hierarchical directories, the latency problem is that the number of network transactions up and down the hierarchy to satisfy a request tends to be larger than in a flat memory-based scheme. Even though these transactions may be more localized, each one is a full-fledged network transaction which also requires either looking up or modifying the directory at its (intermediate) destination node. This increased end-point overhead at the nodes along the critical path tends to far outweigh any reduction in total transit latency on the network links themselves, especially given the characteristics of modern networks. Although some pipelining can be used—for example the data reply can be forwarded on toward the requesting node while a directory node is being updated—in practice the latencies can still become quite large compared to machines with no hierarchy [Hag92, Joe95]. Using a large branching factor in the hierarchy can alleviate the latency problem, but it increases contention. The root of the directory hierarchy can also become a bandwidth bottleneck, for both link bandwidth and directory lookup bandwidth. Multiple links may be used closer to the root—particularly appropriate for physically hierarchical networks [LAD+92]—and the directory cache may be interleaved among them. Alternatively, a multirooted directory hierarchy may be embedded in a non-hierarchical, scalable point-to-point interconnect [Wal92, Sco91, ScG93]. Figure 8-41 shows a possible organization. Like hierarchical The processing nodes as leaves As internal nodes
P12 is also an internal node for P0’s directory tree
Directory tree for P0’s memory
Directory tree for P3’s memory All three circles in this rectangle represent the same P0 P1 P2 P3 P4 P5 P6 P7 P8 P9 P10 P11 P12 P13 P14 P15 processing node (P0). Figure 8-41 A multi-rooted hierarchical directory embedded in an arbitrary network. A 16-node hierarchy is shown. For the blocks in the portion of main memory that is located at a processing node, that node itself is the root of the (logical) directory tree. Thus for P processing nodes, there are P directory trees. The figure shows only two of these. In addition to being the root for its local memory’s directory tree, a processing node is also an internal node in the directory trees for the other processing nodes. The address of a memory block implicitly specifies a particular directory tree, and guides the physical traversals to get from parents to children and vice versa in this directory tree.
directory schemes themselves, however, these techniques have only been in the realm of research so far.
602
DRAFT: Parallel Computer Architecture
9/3/97
Concluding Remarks
8.12 Concluding Remarks Scalable systems that support a coherent shared address space are an important part of the multiprocessing landscape. Hardware support for cache coherence is becoming increasingly popular for both technical and commercial workloads. Most of these systems use directory-based protocols, whether memory-based or cache-based. They are found to perform well, at least at the moderate scales at which they have been built, and to afford significant ease of programming compared to explicit message passing for many applications. While supporting cache coherence in hardware has a significant design cost, it is alleviated by increasing experience, the appearance of standards, and the fact that microprocessors themselves provide support for cache coherence. With the microprocessor coherence protocol understood, the coherence architecture can be developed even before the microprocessor is ready, so there is not so much of a lag between the two. Commercial multiprocessors today typically use the latest microprocessors available at the time they ship. Some interesting open questions for coherent shared address space systems include whether their performance on real applications will indeed scale to large processor counts (and whether significant changes to current protocols will be needed for this), whether the appropriate node for a scalable system will be a small scale multiprocessor or a uniprocessor, the extent to which commodity communication architectures will be successful in supporting this abstraction efficiently, and the success with which a communication assist can be designed that supports the most appropriate mechanisms for both cache coherence and explicit message passing.
8.13 References [ABC+95] [ALK+91]
[ASH+88]
[CeF78] [ClA96] [Con93] [ENS+95]
[FBR93]
9/3/97
Anant Agarwal et al. The MIT Alewife Machine: Architecture and Performance. In Proceedings of 22nd International Symposium on Computer Architecture, pp. 2-13, June 1995. Anant Agarwal, Ben-Hong Lim, David Kranz, John Kubiatowicz. LimitLESS Directories. A Scalable Cache Coherence Scheme. In Proceedings of 4th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 224-234, April 1991. Anant Agarwal, Richard Simoni, John Hennessy, and Mark Horowitz. An Evaluation of Directory Schemes for Cache Coherence. In Proceedings of 15th International Symposium on Computer Architecture, pp. 280-289, June 1988. L. Censier and P. Feautrier. A New Solution to Cache Coherence Problems in Multiprocessor Systems. IEEE Transaction on Computer Systems, C-27(12):1112-1118, December 1978. R. Clark and K. Alnes. An SCI Chipset and Adapter. In Symposium Record, Hot Interconnects IV, pp. 221-235, August 1996. Further information available from Data General Corporation. Convex Computer Corporation. Exemplar Architecture. Convex Computer Corporation, Richardson, Texas. November 1993. Andrew Erlichson, Basem Nayfeh, Jaswinder Pal Singh and Oyekunle Olukotun. The Benefits of Clustering in Cache-coherent Multiprocessors: An Application-driven Investigation. In Proceedings of Supercomputing’95, November 1995. Steve Frank, Henry Burkhardt III, James Rothnie. The KSR1: Bridging the Gap Between Shared Memory and MPPs. In Proceedings of COMPCON, pp. 285-294, Spring 1993.
DRAFT: Parallel Computer Architecture
603
Directory-based Cache Coherence
[FCS+93] [FVS92] [HHS+95]
[Gus92] [GuW92] [GWM90]
[Hag92] [HLK97] [IEE93] [IEE95] [Joe95] [Kax96] [KaG96]
[KOH+94]
[LRV96]
[Lal97] [LAD+92]
[Len92]
604
Jean-Marc Frailong, et al. The Next Generation SPARC Multiprocessing System Architecture. In Proceedings of COMPCON, pp. 475-480, Spring 1993. K. Farkas, Z. Vranesic, and M. Stumm. Cache Consistency in Hierarchical Ring-Based Multiprocessors. In Proceedings of Supercomputing ‘92, November 1992. Chris Holt, Mark Heinrich, Jaswinder Pal Singh, Edward Rothberg and John L. Hennessy. The Effects of Latency, Occupancy and Bandwidth in Distributed Shared Memory Multiprocessors. Technical Report no. CSL-TR-95-660, Computer Systems Laboratory, Stanford University, January 1995. David Gustavson. The Scalable Coherence Interface and Related Standards Projects. IEEE Micro, 12(1), pp:10-22, February 1992. Anoop Gupta and Wolf-Dietrich Weber. Cache Invalidation Patterns in Shared-Memory Multiprocessors. IEEE Transactions on Computers, 41(7):794-810, July 1992. Anoop Gupta, Wolf-Dietrich Weber, and Todd Mowry. Reducing Memory and Traffic Requirements for Scalable Directory-Based Cache-Coherence Schemes. In Proceedings of International Conference on Parallel Processing, volume I, pp. 312-321, August 1990. Erik Hagersten. Toward Scalable Cache Only Memory Architectures. Ph.D. Dissertation, Swedish Institute of Computer Science, Sweden, October 1992. Cristina Hristea, Daniel Lenoski and John Keen. Measuring Memory Hierarchy Performance of Cache Coherent Multiprocessors Using Micro Benchmarks. In Proceedings of SC97, 1997. IEEE Computer Society. IEEE Standard for Scalable Coherent Interface (SCI). IEEE Standard 1596-1992, New York, August 1993. IEEE Standard for Cache Optimization for Large Numbers of Processors using the Scalable Coherent Interface (SCI) Draft 0.35, September 1995. Truman Joe. COMA-F A Non-Hierarchical Cache Only Memory Architecture. Ph.D. thesis, Computer Systems Laboratory, Stanford University, March 1995. Stefanos Kaxiras. Kiloprocessor extensions to SCI. In Proceedings of the Tenth International Parallel Processing Symposium, 1996. Stefanos Kaxiras and James Goodman. The GLOW Cache Coherence Protocol Extensions for Widely Shared Data. In Proceedings of the International Conference on Supercomputing, pp. 3543, 1996. Jeff Kuskin, Dave Ofelt, Mark Heinrich, John Heinlein, Rich Simoni, Kourosh Gharachorloo, John Chapin, Dave Nakahira, Joel Baxter, Mark Horowitz, Anoop Gupta, Mendel Rosenblum, and John Hennessy. The Stanford FLASH Multiprocessor. In Proceedings of the 21st International Symposium on Computer Architecture, pp. 302-313, April 1994. James R. Larus, Brad Richards and Guhan Viswanathan. Parallel Programming in C**: A LargeGrain Data-Parallel Programming Language. In Gregory V. Wilson and Paul Lu, eds., Parallel Programming Using C++, MIT Press 1996. James P. Laudon and Daniel Lenoski. The SGI Origin: A ccNUMA Highly Scalable Server. In Proceedings of the 24th International Symposium on Computer Architecture, 1997. C. E. Leiserson, Z. S. Abuhamdeh, D. C. Douglas, C. R. Feynman, M. N. Ganmukhi, J. V. Hill, W. D. Hillis, B. C. Kuszmaul, M. A. St. Pierre, D. S. Wells, M. C. Wong, S. Yang, R. Zak. The Network Architecture of the CM-5. Symposium on Parallel and Distributed Algorithms '92" Jun 1992, pp. 272-285. Daniel Lenoski. The Stanford DASH Multiprocessor... PhD Thesis, Stanford University, 1992 <<xxx?>>.
DRAFT: Parallel Computer Architecture
9/3/97
References
[LLG+90]
[LLJ+93]
[LoC96]
[OKN90]
[RSG93]
[SGC93] [Sco91]
[ScG93]
[SiH91]
[SFG+93] [Sin97]
[Tan76] [ThD90] [THH+96]
[Tuc86] [VSL+91] [Wal92]
9/3/97
Daniel Lenoski, et al. The Directory-Based Cache Coherence Protocol for the DASH Multiprocessor. In Proceedings of the 17th International Symposium on Computer Architecture, pp. 148159, May 1990. Daniel Lenoski, James Laudon, Truman Joe, David Nakahira, Luis Stevens, Anoop Gupta, and John Hennessy. The DASH Prototype: Logic Overhead and Performance. IEEE Transactions on Parallel and Distributed Systems, 4(1):41-61, January 1993. Tom Lovett and Russell Clapp. STiNG: A CC-NUMA Computer System for the Commercial Marketplace. In Proceedings of 23rd International Symposium on Computer Architecture, pp. 308-317, 1996. B. O’Krafka and A. Newton. An Empirical Evaluation of Two Memory-Efficient Directory Methods. In Proceedings of the 17th International Symposium on Computer Architecture, pp. 138-147, May 1990. Edward Rothberg, Jaswinder Pal Singh, and Anoop Gupta. Working Sets, Cache Sizes, and Node Granularity Issues for Large-Scale Multiprocessors. In Proceedings of the 20th International Symposium on Computer Architecture, pp. 14-25, May 1993.
DRAFT: Parallel Computer Architecture
605
Directory-based Cache Coherence
[Web93]
Wolf-Dietrich Weber. Scalable Directories for Cache-coherent Shared-memory Multiprocessors. Ph.D. Thesis, Computer Systems Laboratory, Stanford University, January 1993. Also available as Stanford University Technical Report no. CSL-TR-93-557. [WGH+97] Wolf-Dietrich Weber, Stephen Gold, Pat Helland, Takeshi Shimizu, Thomas Wicki and Winifrid Wilcke. The Mercury Interconnect Architecture: A Cost-effective Infrastructure for High-performance Servers. In Proceedings of the 24th International Symposium on Computer Architecture, 1997. [WeS94] Shlomo Weiss and James Smith. Power and PowerPC. Morgan Kaufmann Publishers Inc. 1994.
8.14 Exercises 8.1 Short answer questions a. What are the inefficiencies and efficiencies in emulating message passing on a cache coherent machine compared to the kinds of machines discussed in the previous chapter? b. For which of the case study parallel applications used in this book do you expect there to be a substantial advantage to using multiprocessor rather than uniprocessor nodes (assuming the same total number of processors). For which do you think there might be disadvantages in some machines, and under what circumstances? Are there any special benefits that the Illinois coherence scheme offers for organizations with multiprocessor nodes? c. How might your answer to the previous question differ with increasing scale of the machine. That is, how do you expect the performance benefits of using fixed-size multiprocessor nodes to change as the machine size is increased to hundreds of processors? 8.2 High-level memory-based protocol tradeoffs. a. Given a 512 processor system, in which each node visible to the directory is has 8 processors and 1GB of main memory, and a cache block size of 64 bytes, what is the directory memory overhead for: (i) a Full Bit Vector Scheme (ii) DIRiB, with i = 3. b. In a full bit vector scheme, why is the volume of data traffic guaranteed to be much greater than that of coherence traffic (i.e. invalidation and invalidation-acknowledgment messages)? c. The chapter provided diagrams showing the network transactions for strict request-reply, intervention forwarding and reply forwarding for read operations in a flat memory-based protocol like that of the SGI Origin. Do the same for write operations.
606
DRAFT: Parallel Computer Architecture
9/3/97
Exercises
d. The Origin protocol assumed that acknowledgments for invalidations are gathered at the requestor. An alternative is to have the acknowledgments be sent back to the home (from where the invalidation requests come) and have the home send a single acknowledgment back to the requestor. This solution is used in the Stanford FLASH multiprocessor. What are the performance and complexity tradeoffs between these two choices? e. Under what conditions are replacement hints useful and under what conditions are they harmful? In which of our parallel applications do you think they would be useful or harmful to overall performance, and why? f. Draw the network transaction diagrams (like those in Figure 8-16 on page 549) for an uncached read shared request, and uncached read exclusive request, and a write-invalidate request in the Origin protocol. State one example of a use of each. 8.3 High-level issues in cache-based protocols. a. Instead of the doubly linked list used in the SCI protocol, it is possible to use a single linked list. What is the advantage? Describe what modifications would need to be made to the following operations if a singly linked list were used: (i) replacement of a cache block that is in a sharing list (ii) write to a cache block that is in a sharing list. Qualitatively discuss the effects this might have on large scale multiprocessor performance? b. How might you reduce the latency of writes that cause invalidations in the SCI protocol? Draw the network transactions. What are the major tradeoffs? 8.4 Adaptive protocols. a. When a variable exhibits migratory sharing, a processor that reads the variable will be the next one to write it. What kinds of protocol optimizations could you use to reduce traffic and latency in this case, and how would you detect the situation dynamically. Describe a scheme or two in some detail. b. Another pattern that might be detected dynamically is a producer-consumer pattern, in which one processor repeatedly writes (produces) a variable and another processor repeatedly reads it. Is the standard MESI invalidation-based protocol well-suited to this? Why or why not? What enhancements or protocol might be better, and what are the savings in latency or traffic? How would you dynamically detect and employ the enhancements or variations? 8.5 Correctness issues in protocols a. Why is write atomicity more difficult to provide with update protocols in directory-based systems? How would you solve the problem? Does the same difficulty exist in a bus-based system? b. Consider the following program fragment running on a cache-coherent multiprocessor, assuming all values to be 0 initially. P1
P2
P3
P4
A=1
u=A
w=A
A=2
v=A
x=A
There is only one shared variable (A). Suppose that a writer magically knows where the cached copies are, and sends updates to them directly without consulting a directory node.
9/3/97
DRAFT: Parallel Computer Architecture
607
Directory-based Cache Coherence
Construct a situation in which write atomicity may be violated, assuming an update-based protocol. (i) Show the violation of SC that occurs in the results. (ii) Can you produce a case where coherence is violated as well? How would you solve these problems? (iii) Can you construct the same problems for an invalidation-based protocol? (iv) Could you do it for update protocols on a bus? c. In handling writebacks in the Origin protocol, we said that when the node doing the writeback receives an intervention it ignores it. Given a network that does not preserve point to point order, what situations do we have to be careful of? How do we detect that this intervention should be dropped? Would there be a problem with a network that preserved point-to-point order? d. Can the serialization problems discussed for Origin in this chapter arise even with a strict request-reply protocol, and do the same guidelines apply? Show example situations, including the examples discussed in that section. e. Consider the serialization of writes in NUMA-Q, given the two-level hierarchical coherence protocol. If a node has the block dirty in its remote cache, how might writes from other nodes that come to it get serialized with respect to writes from processors in this node. What transactions would have to be generated to ensure the serialization? 8.6 Eager versus delayed replies to processor. a. In the Origin protocol, with invalidation acknowledgments coming to the home rather than the requestor, it would be possible for other read and write requests to come to the requestor that has outstanding invalidations before the acknowledgments for those invalidations come in. (i) Is this possible or a problem with delayed exclusive replies? With eager exclusive replies? If it is a problem, what is the simplest way to solve it? (ii) Suppose you did indeed want to allow the requestor with invalidations outstanding to process incoming read and write requests. Consider write requests first, and construct an example where this can lead to problems. (Hint: consider the case where P1 writes to a location A, P2 writes to location B which is on the same cache block as A and then writes a flag, and P3 spins on the flag and then reads the value of location B.) How would you allow the incoming write to be handled but still maintain correctness? Is it easier if invalidation acknowledgments are collected at the home or at the requestor? (iii) Now answer the same questions above for incoming read requests. (iv) What if you were using an update-based protocol. Now what complexities arise in allowing an incoming request to be processed for a block that has updates outstanding from a previous write, and how might you solve them? (v) Overall, would you choose to allow incoming requests to be processed while invalidations or updates are outstanding, or deny them? b. Suppose a block needs to be written back while there are invalidations pending for it. Can this lead to problems, or is it safe? If it is problematic, how might you address the problem? c. Are eager exclusive replies useful with an underlying SC model? Are they at all useful if the processor itself provides out of order completion of memory operations, unlike the MIPS R10000?
608
DRAFT: Parallel Computer Architecture
9/3/97
Exercises
d. Do the tradeoffs between collecting acknowledgments at the home and at the requestor change if eager exclusive replies are used instead of delayed exclusive replies? 8.7 Implementation Issues. e. In the Origin implementation, incoming request messages to the memory/directory interface are given priority over incoming replies, unless there is a danger of replies being starved. Why do you think this choice of giving priorities to requests was made? Describe some alternatives for how you might you detect when to invert the priority. What would be the danger with replies being starved? 8.8 4. Page replication and migration. a. Why is it necessary to flush TLBs when doing migration or replication of pages? b. For a CC-NUMA multiprocessor with software reloaded TLBs, suppose a page needs to be migrated. Which one of the following TLB flushing schemes would you pick and why? (Hint: The selection should be based on the following two criteria: Cost of doing the actual TLB flush, and difficulty of tracking necessary information to implement the scheme.) (i) Only TLBs that currently have an entry for a page (ii) Only TLBs that have loaded an entry for a page since the last flush (iii) All TLBs in the system. c. For a simple two processor CC-NUMA system, the traces of cache misses for 3 virtual pages X, Y, Z from the two processors P0 and P1 are shown. Time goes from left to right. "R" is a read miss and "W" is a write miss. There are two memories M0 and M1, local to P0 and P1 respectively. A local miss costs 1 unit and a remote miss costs 4 units. Assume that read misses and write misses cost the same. Page X: P0: RRRR R R RRRRR RRR P1: R R R R R RRRR RR Page Y: P0: no accesses P1: RR WW RRRR RWRWRW WWWR Page Z: P0: R W RW R R RRWRWRWRW P1: WR RW RW W W R (i) How would you place pages X, Y, and Z, assuming complete knowledge of the entire trace? (ii) Assume the pages were initially placed in M0, and that a migration or replication can be done at zero cost. You have prior knowledge of the entire trace. You can do one migration, or one replication, or nothing for each page at the beginning of the trace. What action would be appropriate for each of the pages? (iii) Same as (ii) , but a migration or replication costs 10 units. Also give the final memory access cost for each page. (iv) Same as (iii), but a migration or replication costs 60 units. (v) Same as (iv), but the cache-miss trace for each page is the shown trace repeated 10 times. (You still can only do one migration or replication at the beginning of the entire trace)
9/3/97
DRAFT: Parallel Computer Architecture
609
Directory-based Cache Coherence
8.9 Synchronization: a. Full-empty bits, introduced in Section 5.6, provide hardware support for fine-grained synchronization. What are the advantages and disadvantages of full-empty bits, and why do you think they are not used in modern systems? b. With an invalidation based protocol, lock transfers take more network transactions than necessary. An alternative to cached locks is to use uncached locks, where the lock variable stays in main memory and is always accessed directly there. (i) Write pseudocode for a simple lock and a ticket lock using uncached operations? (ii) What are the advantages and disadvantages relative to using cached locks? Which would you deploy in a production system? c. Since high-contention and low-contention situations are best served by different lock algorithms, one strategy that has been proposed is to have a library of synchronization algorithms and provide hardware support to switch between them “reactively” at runtime based on observed access patterns to the synchronization variable [LiA9??]. (i) Which locks would you provide in your library? (ii) Assuming a memory-based directory protocol, design simple hardware support and a policy for switching between locks at runtime. (iii) Describe an example where this support might be particularly useful. (iv) What are the potential disadvantages? 8.10 Software Implications and Performance Issues a. You are performing an architectural study using four applications: Ocean, LU, an FFT that performs local calculations on rows separated by a matrix transposition, and BarnesHut. For each application, answer the above questions assuming a CC-NUMA system: (i) what modifications or enhancements in data structuring or layout would you use to ensure good interactions with the extended memory hierarchy? (ii) what are the interactions with cache size and granularities of allocation, coherence and communication that you would be particularly careful to represent or not represent? b. Consider the example of transposing a matrix of data in parallel, as is used in computations such as high-performance Fast Fourier Transforms. Figure 8-42 shows the transpose pictorially. Every processor tranposes one “patch” of its assigned rows to every other processor, including one to itself. Before the transpose, a processor has read and written its assigned rows of the source matrix of the transpose, and after the transpose it reads and writes its assigned rows of the destination matrix. The rows assigned to a processor in both the source and destination matrix are allocated in its local memory. There are two ways to perform the transpose: a processor can read the local elements from its rows of the source matrix and write them to the associated elements of the destination matrix, whether they are local or remote, as shown in the figure; or a processor can write the local rows of the destination matrix and read the associated elements of the source matrix whether they are local or remote. (i) Given an invalidation-based directory protocol, which method do you think will perform better and why? (iii) How do you expect the answer to (i) to change if you assume an update-based directory protocol?
610
DRAFT: Parallel Computer Architecture
9/3/97
Exercises
Destination Matrix
Source Matrix Owned by Proc 0
Patch
Owned by Proc 1 Owned by Proc 2 Owned by Proc 3 Long cache block crosses traversal order
Long cache block follows traversal order
Figure 8-42 Sender-initiated matrix transposition. The source and destination matrices are partitioning among processes in groups of contiguous rows. Each process divides its set of n/ p rows into p patches of size n/p * n/p. Consider process P2 as a representative example: One patch assigned to it ends up in the assinged set of rows of every other process, and it transposes one patch (third from left, in this case) locally.
Consider this implementation of a matrix transpose which you plan to run on 8 processors. Each processor has one level of cache, which is fully associative, 8 KB, with 128 byte lines. (Note: AT and A are not the same matrix): Transpose(double **A, double **AT) { int i,j,mynum; GETPID(mynum); for (i=mynum*nrows/p; i<((mynum+1)*(nrows/p)); i++) { for (j=0; j<1024; j++) { AT[i][j] = A[j][i]; } } } The input data set is a 1024-by-1024 matrix of double precision floating point numbers, decomposed so that each processor is responsible for generating a contiguous block of rows in the transposed matrix AT. Ignoring the contention problem caused by all processors first going to processor 0, what is the major performance problem with this code? What technique would you use to solve it? Restructure the code to alleviate the performance problem as much as possible. Write the entire restructured loop. 8.11 Bus Hierarchies.
9/3/97
DRAFT: Parallel Computer Architecture
611
Directory-based Cache Coherence
a. Consider a hierarchical bus-based system with a centralized memory at the root of the hierarhcy rather than distributed memory. What would be the main differences in how reads and writes are satisfied? Briefly describe in terms of the path taken by reads and writes. b. Could you have a hierarchical bus based system with centralized memory (say) without pursuing the inclusion property between the L2 (node) cache and the L1 caches? If so, what complications would it cause? c. Given the hierarchical bus-based multiprocessor shown in Figure 8-37, design a statetransition diagram for maintaining coherence for the L2 cache. Assume that the L1 caches follow the Illinois protocol. Precisely define the cache block states for L2, the available bus-transaction types for the B1 and B2 buses, other actions possible, and the state-transition diagram. Note you have considerable freedom in choosing cache block states and other parameters; there is no one fixed right answer. Justify your decisions. d. Given the ending state of caches in examples for Figure 8-37 on page 596, give some more memory references, and ask for detailed description of coherence actions that would need to occur. e. [To ensure SC in a two-level hierarchical bus design, is it okay to return an acknowledgment when the invalidation request reaches the B2 bus? If yes, what constraints are imposed on the design/implementation of L1 caches and L2 caches and the orders preserved among transactions? If not, why not. Would it be okay if the hierarchy had more than two levels? f. Suppose two processors in two different nodes of a hierarchical bus-based machine issue an upgrade for a block at the same time. Trace their paths through the system, discussing all state changes and when they must happen, as well as what precautions prevent deadlock and prevent both processors from gaining ownership. g. An optimization in distributed-memory bus-based hierarchies is that if another processor’s cache on the local bus can supply the data, we do not have to go to the global bus and remote node. What are the tradeoffs of supporting this optimization in ring-based hierarchies? (Answer: May need to go around the local ring an extra time to get to the inter-ring interface.) 8.12 Consider design of a single chip in year 2005, when you have 100 million transistors. You decide to put four processors on the chip. How would you design the interface to build a modular component that can be used in scalable systems: (i) global-memory organization (e.g., the chip has multiple CPUs each with local cache, a shared L2 cache, and a single snoopy cache bus interface), or (ii) distributed memory organization (e.g., the chip has multiple CPUs each with local cache on an internal snoopy bus, a separate memory bus interface that comes of this internal snoopy bus, and another separate snoopy cache bus interface). Discuss the tradeoffs.??? not very well formed??? 8.13 Hierarchical directories. a. What branching would you choose in a machine with a hierarchical directory? Highlight the major tradeoffs. What techniques might you use to alleviate the performance tradeoffs? Be as specific in your description as possible. b. Is it possible to implement hierarchical directories without maintaining inclusion in the directory caches. Design a protocol that does that and discuss the advantages and disadvantages.
612
DRAFT: Parallel Computer Architecture
9/3/97
Introduction
C HA PTE R 9
Hardware-Software Tradeoffs
Morgan Kaufmann is pleased to present material from a preliminary draft of Parallel Computer Architecture; the material is (c) Copyright 1996 Morgan Kaufmann Publishers. This material may not be used or distributed for any commercial purpose without the express written consent of Morgan Kaufmann Publishers. Please note that this material is a draft of forthcoming publication, and as such neither Morgan Kaufmann nor the authors can be held liable for changes or alterations in the final edition.
9.1 Introduction This chapter focuses on addressing some potential limitations of the directory-based, cachecoherent systems discussed in the previous chapter, and the hardware-software tradeoffs that this engenders. The primary limitations of those systems, in both performance and cost, are the following.
• High waiting time at memory operations. Sequential consistency (SC) was assumed to be the memory consistency model of choice, and the chapter discussed how systems might satisfy the sufficient conditions for SC. To do this, a processor would have to wait for the previous memory operation to complete before issuing the next one. This has even greater impact on performance in scalable systems than in bus-based systems, since communication latencies are longer and there are more network transactions in the critical path. Worse still, it is very limiting for compilers, which cannot reorder memory operations to shared data at all if the programmer assumes sequential consistency.
9/4/97
DRAFT: Parallel Computer Architecture
613
Hardware-Software Tradeoffs
• Limited capacity for replication. Communicated data are automatically replicated only in the processor cache, not in local main memory. This can lead to capacity misses and artifactual communication when working sets are large and include nonlocal data or when there are a lot of conflict misses. • High design and implementation cost. The communication assist contains hardware that is specialized for supporting cache-coherence and tightly integrated into the processing node. Protocols are also complex, and getting them right in hardware takes substantial design time. By cost here we mean the cost of hardware and of system design time. Recall from Chapter 3 that there is also a programming cost associated with achieving good performance, and approaches that reduce system cost can often increase this cost dramatically. The chapter will focus on these three limitations, the first two of which have primarily to do with performance and the third primarily with cost. The approaches that have been developed to address them are still controversial to varying extents, but aspects of them are finding their way into mainstream parallel architecture. There are other limitations as well, including the addressability limitations of a shared physical address space—as discussed for the Cray T3D in Chapter 7—and the fact that a single protocol is hard-wired into the machine. However, solutions to these are often incorporated in solutions to the others, and they will be discussed as advanced topics. The problem of waiting too long at memory operations can be addressed in two ways in hardware. First, the implementation can be designed to not satisfy the sufficient conditions for SC, which modern non-blocking processors are not inclined to do anyway, but to satisfy the SC model itself. That is, a processor needn’t wait for the previous operation to complete before issuing the next one; however, the system ensures that operations do not complete or become visible to other processors out of program order. Second, the memory consistency model can itself be relaxed so program orders do not have to be maintained so strictly. Relaxing the consistency model changes the semantics of the shared address space, and has implications for both hardware and software. It requires more care from the programmer in writing correct programs, but enables the hardware to overlap and reorder operations further. It also allows the compiler to reorder memory operations within a process before they are even presented to hardware, as optimizing compilers are wont to do. Relaxed memory consistency models will be discussed in Section 9.2. The problem of limited capacity for replication can be addressed by automatically caching data in main memory as well, not just in the processor caches, and by keeping these data coherent. Unlike in hardware caches, replication and coherence in main memory can be performed at a variety of granularities—for example a cache block, a page, or a user-defined object—and can be managed either directly by hardware or through software. This provides a very rich space of protocols, implementations and cost-performance tradeoffs. An approach directed primarily at improving performance is to manage the local main memory as a hardware cache, providing replication and coherence at cache block granularity there as well. This approach is called cacheonly memory architecture or COMA, and will be discussed in Section 9.3. It relieves software from worrying about capacity misses and initial distribution of data across main memories, while still providing coherence at fine granularity and hence avoiding false-sharing. However, it is hardware-intensive, and requires per-block tags to be maintained in main memory as well. Finally, there are many approaches to reduce hardware cost. One approach is to integrate the communication assist and network less tightly into the processing node, increasing communication latency and occupancy. Another is to provide automatic replication and coherence in soft-
614
DRAFT: Parallel Computer Architecture
9/4/97
Relaxed Memory Consistency Models
ware rather than hardware. The latter approach provides replication and coherence in main memory, and can operate at a variety of granularities. It enables the use of off-the-shelf commodity parts for the nodes and interconnect, reducing hardware cost but pushing much more of the burden of achieving good performance on to the programmer. These approaches to reduce hardware cost are discussed in Section 9.4. The three issues are closely related. For example, cost is strongly related to the manner in which replication and coherence are managed in main memory: The coarser the granularities used, the more likely that the protocol can be synthesized through software layers, including the operating system, thus reducing hardware costs (see Figure 9-1). Cost and granularity are also related to the Shared
Address
Space
Programming Model
Compilation Communication Abstraction or User/System Boundary Library Operating Systems Support
Hardware/Software Boundary Communication Hardware Physical Communication Medium
Figure 9-1 Layers of the communication architecture for systems discussed in this chapter. memory consistency model: Lower-cost solutions and larger granularities benefit more from relaxing the memory consistency model, and implementing the protocol in software makes it easier to fully exploit the relaxation of the semantics. To summarize and relate the alternatives, Section 9.5 constructs a framework for the different types of systems based on the granularities at which they allocate data in the replication store, keep data coherent and communicate data. This leads naturally to an approach that strives to achieve a good compromise between the high-cost COMA approach and the low-cost all-software approach. This approach, called Simple-COMA, is discussed in Section 9.5 as well. The implications of the systems discussed in this chapter for parallel software, beyond those discussed in the previous chapters, are explored in Section 9.6. Finally, Section 9.7 covers some advanced topics, including the techniques to address the limitations of a shared physical address space and a fixed coherence protocol.
9.2 Relaxed Memory Consistency Models Recall from Chapter 5 that the memory consistency model for a shared address space specifies the constraints on the order in which memory operations (to the same or different locations) can appear to execute with respect to one another, enabling programmers to reason about the behavior and correctness of their programs. In fact, any system layer that supports a shared address space naming model has a memory consistency model: the programming model or programmer’s interface, the communication abstraction or user-system interface, and the hardware-software interface. Software that interacts with that layer must be aware of its memory consistency model. We shall focus mostly on the consistency model as seen by the programmer—i.e. at the interface
9/4/97
DRAFT: Parallel Computer Architecture
615
Hardware-Software Tradeoffs
between the programmer and the rest of the system composed of the compiler, OS and hardware—since that is the one with which programmers reason1. For example, a processor may preserve all program orders presented to it among memory operations, but if the compiler has already reordered operations then programmers can no longer reason with the simple model exported by the hardware. Other than programmers (including system programmers), the consistency model at the programmer’s interface has implications for programming languages, compilers and hardware as well. To the compiler and hardware, it indicates the constraints within which they can reorder accesses from a process and the orders that they cannot appear to violate, thus telling them what tricks they can play. Programming languages must provide mechanisms to introduce constraints if necessary, as we shall see. In general, the fewer the reorderings of memory accesses from a process we allow the system to perform, the more intuitive a programming model we provide to the programmer, but the more we constrain performance optimizations. The goal of a memory consistency model is to impose ordering constraints that strike a good balance between programming complexity and performance. The model should also be portable; that is, the specification should be implementable on many platforms, so that the same program can run on all these platforms and preserve the same semantics. The sequential consistency model that we have assumed so far provides an intuitive semantics to the programmer—program order within each process and a consistent interleaving across processes—and can be implemented by satisfying its sufficient conditions. However, its drawback is that it by preserving a strict order among accesses it restricts many of the performance optimizations that modern uniprocessor compilers and microprocessors employ. With the high cost of memory access, computer systems achieve higher performance by reordering or overlapping the servicing of multiple memory or communication operations from a processor. Preserving the sufficient conditions for SC clearly does not allow for much reordering or overlap. With SC at the programmer’s interface, the compiler cannot reorder memory accesses even if they are to different locations, thus disallowing critical performance optimizations such as code motion, commonsubexpression elimination, software pipelining and even register allocation.
Example 9-1
Show how register allocation can lead to a violation of SC even if the hardware satisfies SC.
Answer
Consider the code fragment shown in Figure 9-2(a). After register allocation, the code produced by the compiler and seen by hardware might look like that in Figure 9-2(b). The result (u,v) = (0,0) is disallowed under SC hardware in (a), but not only can but will be produced by SC hardware in (b). In effect, register allocation reorders the write of A and the read of B on P1 and reorders the write of B and the read of A on P2. A uniprocessor compiler might easily perform these
1. The term programmer here refers to the entity that is responsible for generating the parallel program. If the human programmer writes a sequential program that is automatically parallelized by system software, then it is the system software that has to deal with the memory consistency model; the programmer simply assumes sequential semantics as on a uniprocessor.
616
DRAFT: Parallel Computer Architecture
9/4/97
Relaxed Memory Consistency Models
optimizations in each process: They are valid for sequential programs since the reordered accesses are to different locations. P1
P2
P1
P2
B=0
A=0
r1 = 0
r2 = 0
A=1
B=1
A=1
B=1
u=B
v=A
u = r1
v = r2
B = r1
A = r2
(a) Before register allocation
(a) After register allocation
Figure 9-2 Example showing how register allocation by the compiler can violate SC. The code in (a) is the original code with which the programmer reasons under SC. r1, r2 are registers.
SC at the programmer’s interface implies SC at the hardware-software interface; if the sufficient conditions for SC are met, a processor waits for an access to complete or at least commit before issuing the next one, so most of the latency suffered by memory references is directly seen by processors as stall time. While a processor may continue executing non-memory instructions while a single outstanding memory reference is being serviced, the expected benefit from such overlap is tiny (even without instruction-level parallelism on average every third instruction is a memory reference [HeP95]). We need to do something about this performance problem. One approach we can take is to preserve sequential consistency at the programmer’s interface but hide the long latency stalls from the processor. We can do this in several ways, divided into two categories [GGH91]. The techniques and their performance implications will be discussed further in Chapter 11; here, we simply provide a flavor for them. In the first category, the system still preserves the sufficient conditions for SC. The compiler does not reorder memory operations. Latency tolerance techniques such as prefetching of data or multithreading (see Chapter 11) are used to overlap accesses with one another or with computation—thus hiding much of their latency from the processor—but the actual read and write operations are not issued before previous ones complete and the operations do not become visible out of program order. In the second category, the system preserves SC but not the sufficient conditions at the programmer’s interface. The compiler can reorder operations as long as it can guarantee that sequential consistency will not be violated in the results. Compiler algorithms have been developed for this [ShS88,KrY94], but they are expensive and their analysis is currently quite conservative. At the hardware level, memory operations are issued and executed out of program order, but are guaranteed to become visible to other processors in program order. This approach is well suited to dynamically scheduled processors that use an instruction lookahead buffer to find independent instructions to issue. The instructions are inserted in the lookahead buffer in program order, they are chosen from the instruction lookahead buffer and executed out of order, but they are guaranteed to retire from the lookahead buffer in program order. Operations may even issue and execute out of order past an unresolved branch in the lookahead buffer based on branch prediction— called speculative execution—but since the branch will be resolved and retire before them they will not become visible to the register file or external memory system before the branch is resolved. If the branch was mis-predicted, the effects of those operations will never become visi-
9/4/97
DRAFT: Parallel Computer Architecture
617
Hardware-Software Tradeoffs
ble. The technique called speculative reads goes a little further; here, the values returned by reads are used even before they are known to be correct; later checks determine if they were incorrect, and if so the computation is rolled back to reissue the read. Note that it is not possible to speculate with stores in this manner because once a store is made visible to other processors it is extremely difficult to roll back and recover: a store’s value should not be made visible to other processors or the external memory system environment until all previous references have correctly completed. Some or all of these techniques are supported by many modern microprocessors, such as the MIPS R10000, the HP PA-8000, and the Intel Pentium Pro. However, they do require substantial hardware resources and complexity, their success at hiding multiprocessor latencies is not yet clear (see Chapter 11), and not all processors support them. Also, these techniques work for processors, but they do not help compilers perform the reorderings of memory operations that are critical for their optimizations. A completely different way to overcome the performance limitations imposed by SC is to change the memory consistency model itself; that is, to not guarantee such strong ordering constraints to the programmer, but still retain semantics that are intuitive enough to a programmer. By relaxing the ordering constraints, these relaxed consistency models allow the compiler to reorder accesses before presenting them to the hardware, at least to some extent. At the hardware level, they allow multiple memory accesses from the same process to be outstanding at a time and even to complete or become visible out of order, thus allowing much of the latency to be overlapped and hidden from the processor. The intuition behind relaxed models is that SC is usually too conservative, in that many of the orders it preserves are not really needed to satisfy a programmer’s intuition in most situations. Much more detailed treatments of relaxed memory consistency models can be found in [Adv93,Gha95]. Consider the simple example shown in Figure 9-3. On the left are the orderings that will be mainP1
P2
P1
A = 1;
while (flag == 0);
A = 1;
B = 1;
u = A;
B = 1;
flag = 1;
v = B;
P2 ... = flag; ... = flag;
flag = 1;
... = flag; u = A;
Orderings Maintained By Sequential Consistency (a)
v = B; Orderings Necessary For Correct Program Semantics (b)
Figure 9-3 Intuition behind relaxed memory consistency models. The arrows in the figure indicate the orderings maintained. The left side of the figure (part (a)) shows the orderings maintained by the sequential consistency model. The right side of the figure (part (b)) shows the orderings that are necessary for “correct / intuitive” semantics.
tained by an SC implementation. On the right are the orderings that are necessary for intuitively correct program semantics. The latter are far fewer. For example, writes to variables A and B by P1 can be reordered without affecting the results observed by the program; all we must ensure is 618
DRAFT: Parallel Computer Architecture
9/4/97
Relaxed Memory Consistency Models
that both of them complete before variable flag is set to 1. Similarly, reads to variables A and B can be reordered at P2, once flag has been observed to change to value 1.1 Even with these reorderings, the results look just like those of an SC execution. On the other hand, while the accesses to flag are also simple variable accesses, a model that allowed them to be reordered with respect to A and B at either process would compromise the intuitive semantics and SC results. It would be wonderful if system software or hardware could automatically detect which program orders are critical to maintaining SC semantics, and allow the others to be violated for higher performance [ShS88]. However, the problem is undecidable for general programs, and inexact solutions are often too conservative to be very useful. There are three parts to a complete solution for a relaxed consistency model: the system specification, the programmer’s interface, and a translation mechanism. The system specification. This is a clear specification of two things. First, what program orders among memory operations are guaranteed to be preserved, in an observable sense, by the system, including whether write atomicity will be maintained. And second, if not all program orders are preserved by default, then what mechanisms does the system provide for a programmer to enforce those orderings explicitly. As should be clear by now, each of the compiler and the hardware has a system specification, but we focus on the specification that the two together or the system as a whole presents to the programmer. For a processor architecture, the specification it exports governs the reorderings that it allows and the order-preserving primitives it provides, and is often called the processor’s memory model. The programmer’s interface. The system specification is itself a consistency model. A programmer may use it to reason about correctness and insert the appropriate order-preserving mechanisms. However, this is a very low level interface for a programmer: Parallel programming is difficult enough without having to think about reorderings and write atomicity! The specific reorderings and order-enforcing mechanisms supported are different across systems, compromising portability. Ideally, a programmer would assume SC semantics, and the system components would automatically determine what program orders they may alter without violating this illusion. However, this is too difficult to compute. What a programmer therefore wants is a methodology for writing “safe” programs. This is a contract such that if the program follows certain high-level rules or provides enough program annotations—such as telling the system that flag in Figure 9-3 is in fact used as a synchronization variable—then any system on which the program runs will always guarantee a sequentially consistent execution, regardless of the default reorderings in the system specifications it supports. The programmer’s responsibility is to follow the rules and provide the annotations (which hopefully does not involve reasoning at the level of potential reorderings); the system’s responsibility is to maintain the illusion of sequential consistency. Typically, this means using the annotations as constraints on compiler reorderings, and also translating the annotations to the right orderpreserving mechanisms for the processor at the right places. The implication for programming languages is that they should support the necessary annotations. The translation mechanism. This translates the programmer’s annotations to the interface exported by the system specification.
1. Actually, it is possible to further weaken the requirements for correct execution. For example, it is not necessary for stores to A and B to complete before store to Flag is done; it is only necessary that they be complete by the time processor P2 observes that the value of Flag has changed to 1. It turns out, such relaxed models are extremely difficult to implement in hardware, so we do not discuss them here. In software, some of these relaxed models make sense, and we will discuss them in Section 9.3.
9/4/97
DRAFT: Parallel Computer Architecture
619
Hardware-Software Tradeoffs
We organize our discussion of relaxed consistency models around these three pieces. We first examine different low-level specifications exported by systems and particularly microprocessors, organizing them by the types of reorderings they allow. Section 9.2.2 discusses the programmer’s interface or contract and how the programmer might provide the necessary annotations, and Section 9.2.3 briefly discusses translation mechanisms. A detailed treatment of implementation complexity and performance benefits will be postponed until we discuss latency tolerance mechanisms in Chapter 11. 9.2.1 System Specifications Several different reordering specifications have been proposed by microprocessor vendors and by researchers, each with its own mechanisms for enforcing orders. These include total store ordering (TSO) [SFC91, Spa91], partial store ordering (PSO) [SFC91, Spa91], and relaxed memory ordering (RMO) [WeG94] from the SUN Sparc V8 and V9 specifications, processor consistency described in [Goo89, GLL+90] and used in the Intel Pentium processors, weak ordering (WO) [DSB86,DuS90], release consistency (RC) [GLL+90], and the Digital Alpha [Sit92] and IBM/ Motorola PowerPC [MSS+94] models. Of course, a particular implementation of a processor may not support all the reorderings that its system specification allows. The system specification defines the semantic interface for that architecture, i.e. what orderings the programmer must assume might happen; the implementation determines what reorderings actually happen and how much performance can actually be gained. Let us discuss some of the specifications or consistency models, using the relaxations in program order that they allow as our primary axis for grouping models together [Gha95]. The first set of models only allows a read to bypass (complete before) an earlier incomplete write in program order; i.e. allows the write->read order to be reordered (TSO, PC). The next set allows this and also allows writes to bypass previous writes; i.e. write->write reordering (PSO). The final set allows reads or writes to bypass previous reads as well; that is, allows all reorderings among read and write accesses (WO, RC, RMO, Alpha, PowerPC). Note that a read-modify-write operation is treated as being both a read and a write, so it is reordered with respect to another operation only if both a read and a write can be reordered with respect to that operation. Of course, in all cases we assume basic cache coherence—write propagation and write serialization—and that uniprocessor data and control dependences are maintained within each process. The specifications discussed have in most cases been motivated by and defined for the processor architectures themselves, i.e. the hardware interface. They are applicable to compilers as well, but since sophisticated compiler optimizations require the ability to reorder all types of accesses most compilers have not supported as wide a variety of ordering models. In fact, at the programmer’s interface, all but the last set of models have their utility limited by the fact that they do not allow many important compiler optimizations. The focus in this discussion is mostly on the models; implementation issues and performance benefits will be discussed in Chapter 11. Relaxing the Write-to-Read Program Order The main motivation for this class of models is to allow the hardware to hide the latency of write operations that miss in the first-level cache. While the write miss is still in the write buffer and not yet visible to other processors, the processor can issue and complete reads that hit in its cache or even a single read that misses in its cache. The benefits of hiding write latency can be substantial, as we shall see in Chapter 11, and most processors can take advantage of this relaxation.
620
DRAFT: Parallel Computer Architecture
9/4/97
Relaxed Memory Consistency Models
How well do these models preserve the programmer’s intuition even without any special operations? The answer is quite well, for the most part. For example, spinning on a flag for event synchronization works without modification (Figure 9-4(a)). The reason is that TSO and PC models preserve the ordering of writes, so the write of the flag is not visible until all previous writes have completed in the system. For this reason, most early multiprocessors supported one of these two models, including the Sequent Balance, Encore Multimax, VAX-8800, SparcCenter1000/2000, SGI 4D/240, SGI Challenge, and even the Pentium Pro Quad, and it has been relatively easy to port even complex programs such as the operating systems to these machines. Of course, the semantics of these models is not SC, so there are situations where the differences show through and SC-based intuition is violated. Figure 9-4 shows four code examples, the first three of which we have seen earlier, where we assume that all variables start out having the value 0. In segment (b), SC guarantees that if B is printed as 1 then A too will be printed as 1, since the P1
P2
P1
P2
A = 1; Flag = 1;
while (Flag==0); print A;
A = 1; B = 1;
print B; print A; (b)
(a) P1
P2
P3
P1
P2
A = 1;
while (A==0); B = 1;
while (B==0); print A;
A = 1; (i) print B; (ii)
B = 1; (iii) print A; (iv)
(c)
(d)
Figure 9-4 Example code sequences repeated to compare TSO, PC and SC. Both TSO and PC provide same results as SC for code segments (a) and (b), PC can violate SC semantics for segment (c) (TSO still provides SC semantics), and both TSO and PC violate SC semantics for segment (d).
writes of A and B by P1 cannot be reordered. For the same reason, TSO and PC also have the same semantics. For segment (c), only TSO offers SC semantics and prevents A from being printed as 0, not PC. The reason is that PC does not guarantee write atomicity. Finally, for segment (d) under SC no interleaving of the operations can result in 0 being printed for both A and B. To see why, consider that program order implies the precedence relationships (i) -> (ii) and (iii) -> (iv) in the interleaved total order. If B=0 is observed, it implies (ii) -> (iii), which therefore implies (i) -> (iv). But (i) -> (iv) implies A will be printed as 1. Similarly, a result of A=0 implies B=1. A popular software-only mutual exclusion algorithm called Dekker’s algorithm—used in the absence of hardware support for atomic read-modify-write operations [TW97]—relies on the property that both A and B will not be read as 0 in this case. SC provides this property, further contributing to its view as an intuitive consistency model. Neither TSO nor PC guarantees it, since they allow the read operation corresponding to the print to complete before previous writes are visible. To ensure SC semantics when desired (e.g., to port a program written under SC assumptions to a TSO system), we need mechanisms to enforce two types of extra orderings: (i) to ensure that a read does not complete before an earlier write (applies to both TSO and PC), and (ii) to ensure write atomicity for a read operation (applies only to PC). For the former, different processor architectures provide somewhat different solutions. For example, the SUN Sparc V9 specification 9/4/97
DRAFT: Parallel Computer Architecture
621
Hardware-Software Tradeoffs
[WeG94] provides memory barrier (MEMBAR) or fence instructions of different flavors that can ensure any desired ordering. Here, we would insert a write-to-read ordering flavored MEMBAR before the read. This MEMBAR flavor prevents a read that follows it in program order from issuing before a write that precedes it has completed. On architectures that do not have memory barriers, it is possible to achieve this effect by substituting an atomic read-modify-write operation or sequence for the original read. A read-modify-write is treated as being both a read and a write, so it cannot be reordered with respect to previous writes in these models. Of course the value written in the read-modify-write must be the same as the value read to preserve correctness. Replacing a read with a read-modify-write also guarantees write atomicity at that read on machines supporting the PC model. The details of why this works are subtle, and the interested reader can find them in the literature [AGG+93]. Relaxing the Write-to-Read and Write-to-Write Program Orders Allowing writes as well to bypass previous outstanding writes to different locations allows multiple writes to be merged in the write buffer before updating main memory, and thus multiple write misses to be overlapped and become visible out of order. The motivation is to further reduce the impact of write latency on processor stall time, and to improve communication efficiency between processors by making new data values visible to other processors sooner. SUN Sparc’s PSO model [SFC91, Spa91] is the only model in this category. Like TSO, it guarantees write atomicity. Unfortunately, reordering of writes can violate our intuitive SC semantics quite a bit. Even the use of ordinary variables as flags for event synchronization (Figure 9-4(a)) is no longer guaranteed to work: The write of flag may become visible to other processors before the write of A. This model must therefore demonstrate a substantial performance benefit to be attractive. The only additional instruction we need over TSO is one that enforces write-to-write ordering in a process’s program order. In SUN Sparc V9, this can be achieved by using a MEMBAR instruction with the write-to-write flavor turned on (the earlier Sparc V8 provided a special instruction called store-barrier or STBAR to achieve this effect). For example, to achieve the intuitive semantics we would insert such an instruction between the writes of A and flag. Relaxing All Program Orders: Weak Ordering and Release Consistency In this final class of specifications, no program orders are guaranteed by default (other than data and control dependences within a process, of course). The benefit is that multiple read requests can also be outstanding at the same time, can be bypassed by later writes in program order, and can themselves complete out of order, thus allowing us to hide read latency. These models are particularly useful for dynamically scheduled processors, which can in fact proceed past read misses to other memory references. They are also the only ones that allow many of the reorderings (even elimination) of accesses that are done by compiler optimizations. Given the importance of these compiler optimizations for node performance, as well as their transparency to the programmer, these may in fact be the only reasonable high-performance models for multiprocessors unless compiler analysis of potential violations of consistency makes dramatic advances. Prominent models in this group are weak ordering (WO) [DSB86,DuS90], release consistency (RC) [GLL+90], Digital Alpha [Sit92], Sparc V9 relaxed memory ordering (RMO) [WeG94], and IBM PowerPC [MSS+94,CSB93]. WO is the seminal model, RC is an extension of WO supported by the Stanford DASH prototype [LLJ+93], and the last three are supported in commercial
622
DRAFT: Parallel Computer Architecture
9/4/97
Relaxed Memory Consistency Models
architectures. Let us discuss them individually, and see how they deal with the problem of providing intuitive semantics despite all the reordering; for instance, how they deal with the flag example. Weak Ordering: The motivation behind the weak ordering model (also known as the weak consistency model) is quite simple. Most parallel programs use synchronization operations to coordinate accesses to data when this is necessary. Between synchronization operations, they do not rely on the order of accesses being preserved. Two examples are shown in Figure 9-5.The left segment uses a lock/unlock pair to delineate a critical section inside which the head of a linked list is updated. The right segment uses flags to control access to variables participating in a producer-consumer interaction (e.g., A and D are produced by P1 and consumed by P2). The key in the flag example is to think of the accesses to the flag variables as synchronization operations, since that is indeed the purpose they are serving. If we do this, then in both situations the intuitive semantics are not violated by any read or write reorderings that happen between synchronization accesses (i.e. the critical section in segment (a) and the four statements after the while loop in segment (b)). Based on these observations, weak ordering relaxes all program orders by default, and guarantees that orderings will be maintained only at synchronization operations that can be identified by the system as such. Further orderings can be enforced by adding synchronization operations or labeling some memory operations as synchronization. How appropriate operations are identified as synchronization will be discussed soon. P1, P2,..., Pn ... Lock(TaskQ) newTask->next = Head; if (Head != NULL) Head->prev = newTask; Head = newTask; UnLock(TaskQ) ... (a)
P2
P1
TOP: while(flag2==0); TOP: while(flag1==0); A = 1; x = A; u = B; y = D; v = C; B = 3; D = B * C; C = D / B; flag2 = 0; flag1 = 0; flag1 = 1; flag2 = 1; goto TOP; goto TOP; (b)
Figure 9-5 Use of synchronization operations to coordinate access to ordinary shared data variables. The synchronization may be through use of explicit lock and unlock operations (e.g., implemented using atomic test-&-set instructions) or through use of flag variables for producer-consumer sharing.
The left side of Figure 9-6 illustrates the reorderings allowed by weak ordering. Each block with a set of reads/writes represents a contiguous run of non-synchronization memory operations from a processor. Synchronization operations are shown separately. Sufficient conditions to ensure a WO system are as follows. Before a synchronization operation is issued, the processor waits for all previous operations in program order (both reads and writes) to have completed. Similarly, memory accesses that follow the synchronization operation are not issued until the synchronization operation completes. Read, write and read-modify-write operations that are not labeled as synchronization can be arbitrarily reordered between synchronization operations. Especially when synchronization operations are infrequent, as in many parallel programs, WO typically provides considerable reordering freedom to the hardware and compiler.
9/4/97
DRAFT: Parallel Computer Architecture
623
Hardware-Software Tradeoffs
read/write ... read/write
1 read/write ... read/write
sync read/write ... read/write
1
read/write ... read/write
2
2
release (write)
sync read/write ... read/write
acquire (read)
read/write ... read/write
3
3 Release Consistency
Weak Ordering Figure 9-6 Comparison of weak-ordering and release consistency models. The operations in block 1 precede the first synchronization operation, which is an acquire, in program order. Block 2 is between the two synchronization operations, and block 3 follows the second synchronization operation, which is a release.
Release Consistency: Release consistency observes that weak ordering does not go far enough. It extends the weak ordering model by distinguishing among the types of synchronization operations and exploiting their associated semantics. In particular, it divides synchronization operations into acquires and releases. An acquire is a read operation (it can also be a read-modifywrite) that is performed to gain access to a set of variables. Examples include the Lock(TaskQ) operation in part (a) of Figure 9-5, and the accesses to flag variables within the while conditions in part (b). A release is a write operation (or a read-modify-write) that grants permission to another processor to gain access to some variables. Examples include the “UnLock(TaskQ)” operation in part (a) of Figure 9-5, and the statements setting the flag variables to 1 in part (b).1 The separation into acquire and release operations can be used to further relax ordering constraints as shown in Figure 9-6. The purpose of an acquire is to delay memory accesses that follow the acquire statement until the acquire succeeds. It has nothing to do with accesses that precede it in program order (accesses in the block labeled “1”), so there is no reason to wait for those accesses to complete before the acquire can be issued. That is, the acquire itself can be reordered with respect to previous accesses. Similarly, the purpose of a release operation is to grant access to the new values of data modified before it in program order. It has nothing to do with accesses that follow it (accesses in the block labeled “3”), so these need not be delayed until the release has completed. However, we must wait for accesses in block “1” to complete before the release is visible to other processors (since we do not know exactly which variables are associated with the release or that the release “protects”2), and similarly we must wait for the acquire
1. Exercise: In Figure 9-5 are the statements setting Flag variables to 0 synchronization accesses? 2. It is possible that the release also grants access to variables outside of the variables controlled by the preceding acquire. The exact association of variables with synchronization accesses is very difficult to exploit at the hardware level. Software implementations of relaxed models, however, do exploit such optimizations, as we shall see in Section 9.4.
624
DRAFT: Parallel Computer Architecture
9/4/97
Relaxed Memory Consistency Models
to complete before the operations in block “3” can be performed. Thus, the sufficient conditions for providing an RC interface are as follows: Before an operation labeled as a release is issued, the processor waits until all previous operations in program order have completed; and operations that follow an acquire operation in program order are not issued until that acquire operation completes. These are sufficient conditions, and we will see how we might be more aggressive when we discuss alternative approaches to a shared address space that rely on relaxed consistency models for performance. Digital Alpha, Sparc V9 RMO, and IBM PowerPC memory models. The WO and RC models are specified in terms of using labeled synchronization operations to enforce orders, but do not take a position on the exact operations (instructions) that must be used. The memory models of some commercial microprocessors provide no ordering guarantees by default, but provide specific hardware instructions called memory barriers or fences that can be used to enforce orderings. To implement WO or RC with these microprocessors, operations that the WO or RC program labels as synchronizations (or acquires or releases) cause the compiler to insert the appropriate special instructions, or the programer can insert these instructions directly. The Alpha architecture [Sit92], for example, supports two kinds of fence instructions: (i) the Table 9-1 Characteristics of various system-centric relaxed models. A “yes” in the appropriate column indicates that those orders can be violated by that system-centric model.
Model
W-to-R reorder
W-to-W reorder
R-to-R / W reorder
Read Other’s Write Early
SC
Read Own Write Early
Operations for Ordering
yes
TSO
yes
PC
yes
yes
PSO
yes
yes
WO
yes
yes
yes
RC
yes
yes
yes
RMO
yes
yes
yes
Alpha
yes
yes
yes
PowerPC
yes
yes
yes
yes
yes
yes
MEMBAR, RMW
yes
MEMBAR, RMW
yes
STBAR, RMW
yes
SYNC
yes
REL, ACQ, RMW
yes
various MEMBARs
yes
MB, WMB
yes
SYNC
memory barrier (MB) and (ii) the write memory barrier (WMB). The MB fence is like a synchronization operation in WO: It waits for all previously issued memory accesses to complete before issuing any new accesses. The WMB fence imposes program order only between writes (it is like the STBAR in PSO). Thus, a read issued after a WMB can still bypass a write access issued before the WMB, but a write access issued after the WMB cannot. The Sparc V9 relaxed memory order (RMO) [WeG94] provides a fence instruction with four flavor bits associated with it. Each bit indicates a particular type of ordering to be enforced between previous and following load/ store operations (the four possibilities are read-to-read, read-to-write, write-to-read, and write-towrite orderings). Any combinations of these bits can be set, offering a variety of ordering choices. Finally, the IBM PowerPC model [MSS+94,CSB93] provides only a single fence instruction, called SYNC, that is equivalent to Alpha’s MB fence. It also differs from the Alpha and RMO models in that the writes are not atomic, as in the processor consistency model. The
9/4/97
DRAFT: Parallel Computer Architecture
625
Hardware-Software Tradeoffs
model envisioned is WO, to be synthesized by putting SYNC instructions before and after every synchronization operation. You will see how different models can be synthesized with these primitives in Exercise 9.3. The prominent specifications discussed above are summarized in Table 9-1.1 They have different performance implications, and require different kinds of annotations to ensure orderings. It is worth noting again that although several of the models allow some reordering of operations, if program order is defined as seen by the programmer then only the models that allow both read and write operations to be reordered within sections of code (WO, RC, Alpha, RMO, and PowerPC) allow the flexibility needed by many important compiler optimizations. This may change if substantial improvements are made in compiler analysis to determine what reorderings are possible given a consistency model. The portability problem of these specifications should also be clear: for example, a program with enough memory barriers to work “correctly” (produce intuitive or sequentially consistent executions) on a TSO system will not necessarily work “correctly” when run on an RMO system: It will need more special operations. Let us therefore examine higher-level interfaces that are convenient for programmers and portable to the different systems, in each case safely exploiting the performance benefits and reorderings that the system affords. 9.2.2 The Programming Interface The programming interfaces are inspired by the WO and RC models, in that they assume that program orders do not have to be maintained at all between synchronization operations. The idea is for the program to ensure that all synchronization operations are explicitly labeled or identified as such. The compiler or runtime library translates these synchronization operations into the appropriate order-preserving operations (memory barriers or fences) called for by the system specification. Then, the system (compiler plus hardware) guarantees sequentially consistent executions even though it may reorder operations between synchronization operations in any way it desires (without violating dependences to a location within a process). This allows the compiler sufficient flexibility between synchronization points for the reorderings it desires, and also allows the processor to perform as many reorderings as permitted by its memory model: If SC executions are guaranteed even with the weaker models that allow all reorderings, they surely will be guaranteed on systems that allow fewer reorderings. The consistency model presented at the programmer’s interface should be at least as weak as that at the hardware interface, but need not be the same. Programs that label all synchronization events are called synchronized programs. Formal models for specifying synchronized programs have been developed; namely the data-race-free models influenced by weak ordering [AdH90a] and the properly-labeled model [GAG+92] influenced by release consistency [GLL+90]. Interested readers can obtain more details from these references (the differences between them are small). The basic question for the programmer is how to decide which operations to label as synchronization. This is of course already done in the major-
1. The relaxation “read own write early” relates to both program order and write atomicity. The processor is allowed to read its own previous write before the write is serialized with respect to other writes to the same location. A common hardware optimization that relies on this relaxation is the processor reading the value of a variable from its own write buffer. This relaxation applies to almost all models without violating their semantics. It even applies to SC, as long as other program order and atomicity requirements are maintained.
626
DRAFT: Parallel Computer Architecture
9/4/97
Relaxed Memory Consistency Models
ity of cases, when explicit, system-specified primitives such as locks and barriers are used. These are usually also easy to distinguish as acquire or release, for memory models such as RC that can take advantage of this distinction; for example, a lock is an acquire and an unlock is a release, and a barrier contains both since arrival at a barrier is a release (indicates completion of previous accesses) while leaving it is an acquire (obtaining permission for the new set of accesses). The real question is how to determine which memory operations on ordinary variables (such as our flag variables) should be labeled as synchronization. Often, programmers can identify these easily, since they know when they are using this event synchronization idiom. The following definitions describe a more general method when all else fails: Conflicting operations: Two memory operations from different processes are said to conflict if they access the same memory location and at least one of them is a write. Competing operations: These are a subset of the conflicting operations. Two conflicting memory operations (from different processes) are said to be competing if it is possible for them to appear next to each other in a sequentially consistent total order (execution); that is, to appear one immediately following the other in such an order, with no intervening memory operations on shared data between them. Synchronized Program: A parallel program is synchronized if all competing memory operations have been labeled as synchronization (perhaps differentiated into acquire and release by labeling the read operations as acquires and the write operations as releases). The fact that competing means competing under any possible SC interleaving is an important aspect of the programming interface. Even though the system uses a relaxed consistency model, the reasoning about where annotations are needed can itself be done while assuming an intuitive, SC execution model, shielding the programmer from reasoning directly in terms of reorderings. Of course, if the compiler could automatically determine what operations are conflicting or competing, then the programmer’s task would be made a lot simpler. However, since the known analysis techniques are expensive and/or conservative [ShS88, KrY94], the job is almost always left to the programmer. Consider the example in Figure 9-7. The accesses to the variable flag are competing operations P2
P1 A = 1; po
... = flag; po
B = 1; po
flag = 1;
co
... = flag;
P2 spinning on flag
po
... = flag;
P2 finally reads flag=1
po
u = A; po
v = B; Figure 9-7 An example code sequence using spinning on a flag to synchronize producer-consumer behavior. The arcs labeled “po” show program order, and that labeled “co” shows conflict order. Notice that between write to A by P1 and read to A by P2, there is a chain of accesses formed by po and co arcs. This will be true in all executions of the program. Such chains which have at least one po arc, will not be present for accesses to the variable flag (there is only a co arc).
by the above definition. What this really means is that on a multiprocessor they may execute simultaneously on their respective processors, and we have no guarantee about which executes first. Thus, they are also said to constitute data races. In contrast, the accesses to variable A (and 9/4/97
DRAFT: Parallel Computer Architecture
627
Hardware-Software Tradeoffs
to B) by P1 and P2 are conflicting operations, but they are necessarily separated in any SC interleaving by an intervening write to the variable flag by P1 and a corresponding read of flag by P2. Thus, they are not competing accesses and do not have to be labeled as synchronization. To be a little more formal, given a particular SC execution order the conflict order (co) is the order in which the conflicting operations occur (note that in a given total or execution order one must occur before the other). In addition, we have the program order (po) for each process. Figure 9-7 also shows the program orders and conflict order for a sample execution of our code fragment. Two accesses are non-competing if under all possible SC executions (interleavings) there always exists a chain of other references between them, such that at least one link in the chain is formed by a program-order rather than conflict order arc. The complete formalism can be found in the literature [Gha95]. Of course, the definition of synchronized programs allows a programmer to conservatively label more operations than necessary as synchronization, without compromising correctness. In the extreme, labeling all memory operations as synchronization always yields a synchronized program. This extreme example will of course deny us the performance benefits that the system might otherwise provide by reordering non-synchronization operations, and will on most systems yield much worse performance than straightforward SC implementations due to the overhead of the order-preserving instructions that will be inserted. The goal is to only label competing operations as synchronization. In specific circumstances, we may decide not to label some competing accesses as synchronization, since we want to allow data races in the program. Now we are no longer guaranteed SC semantics, but we may know through application knowledge that the competing operations are not being used as synchronization and we do not need such strong ordering guarantees in certain sections of code. An example is the use of the asynchronous rather than red-black equation solver in Chapter 2. There is no synchronization between barriers (sweeps) so within a sweep the read and write accesses to the border elements of a partition are competing accesses. The program will not satisfy SC semantics on a system that allows access reorderings, but this is okay since the solver repeats the sweeps until convergence: Even if the processes sometimes read old values in an iteration and sometimes new (unpredictably), they will ultimately read updated values and make progress toward convergence. If we had labeled the competing accesses, we would have compromised reordering and performance. The last issue related to the programming interface is how the above labels are to be specified by the programmer. In many cases, this is quite stylized and already present in the programming language. For example, some parallel programming languages (e.g., High-Performance Fortran [HPF93] allow parallelism to be expressed in only stylized ways from which it is trivial to extract the relevant information. For example, in FORALL loops (loops in which all iterations are independent) only the implicit barrier at the end of the loop needs to be labeled as synchronization: The FORALL specifies that there are no data races within the loop body. In more general programming models, if programmers use a library of synchronization primitives such as LOCK, UNLOCK and BARRIER, then the code that implements these primitives can be labeled by the designer of the library; the programmer needn’t do anything special. Finally, if the application programmer wants to add further labels at memory operations—for example at flag variable accesses or to preserve some other orders as in the examples of Figure 9-4 on page 621—we need programming language or library support. A programming language could provide an attribute for variable declarations that indicates that all references to this variable are synchronization accesses, or there could be annotations at the statement level, indicating that a particular 628
DRAFT: Parallel Computer Architecture
9/4/97
Relaxed Memory Consistency Models
access is to be labeled as a synchronization operation. This tells the compiler to constrain its reordering across those points, and the compiler in turn translates these references to the appropriate order-preserving mechanisms for the processor. 9.2.3 Translation Mechanisms For most microprocessors, translating labels to order preserving mechanisms amounts to inserting a suitable memory barrier instruction before and/or after each operation labeled as a synchronization (or acquire or release). It would save instructions if one could have flavor bits associated with individual loads/stores themselves, indicating what orderings to enforce and avoiding extra instructions, but since the operations are usually infrequent this is not the direction that most microprocessors have taken so far. 9.2.4 Consistency Models in Real Systems With the large growth in sales of multiprocessors, modern microprocessors are designed so that they can be seamlessly integrated into these machines. As a result, microprocessor vendors expend substantial effort defining and precisely specifying the memory model presented at the hardware-software interface. While sequential consistency remains the best model to reason about correctness and semantics of programs, for performance reasons many vendors allow orders to be relaxed. Some vendors like Silicon Graphics (in the MIPS R10000) continue to support SC by allowing out-of-order issue and execution of operations, but not out of order completion or visibility. This allows overlapping of memory operations by the dynamically scheduled processor, but forces them to complete in order. The Intel Pentium family supports a processor consistency model, so reads can complete before previous writes in program order (see illustration), and many microprocessors from SUN Microsystems support TSO which allows the same reorderings. Many other vendors have moved to models that allow all orders to be relaxed (e.g., Digital Alpha and IBM PowerPC) and provide memory barriers to enforce orderings where necessary. At the hardware interface, multiprocessors usually follow the consistency model exported by the microprocessors they use, since this is the easiest thing to do. For example, we saw that the NUMA-Q system exports the processor consistency model of its Pentium Pro processors at this interface. In particular, on a write the ownership and/or data are obtained before the invalidations begin. The processor is allowed to complete its write and go on as soon as the ownership/data is received, and the SCLIC takes care of the invalidation/acknowledgment sequence. It is also possible for the communication assist to alter the model, within limits, but this comes at the cost of design complexity. We have seen an example in the Origin2000, where preserving SC requires that the assist (Hub) only reply to the processor on a write once the exclusive reply and all invalidation acknowledgments have been received (called delayed exclusive replies). The dynamically scheduled processor can then retire its write from the instruction lookahead buffer and allow subsequent operations to retire and complete as well. If the Hub replies as soon as the exclusive reply is received and before invalidation acknowledgments are received (called eager exclusive replies), then the write will retire and subsequent operations complete before the write is actually completed, and the consistency model is more relaxed. Essentially, the Hub fools the processor about the completion of the write. Having the assist cheat the processor can enhance performance but increases complexity, as in the case of eager exclusive replies. The handling of invalidation acknowledgments must now be 9/4/97
DRAFT: Parallel Computer Architecture
629
Hardware-Software Tradeoffs
done asynchronously by the Hub through tracking buffers in its processor interface, in accordance with the desired relaxed consistency model. There are also questions about whether subsequent accesses to a block from other processors, forwarded to this processor from the home, should be serviced while invalidations for a write on that block are still outstanding (see Exercise 9.2), and what happens if the processor has to write that block back to memory due to replacement while invalidations are still outstanding in the Hub. In the latter case, either the writeback has to be buffered by the Hub and delayed till all invalidations are acknowledged, or the protocol must be extended so a later access to the written back block is not satisfied by the home until the acknowledgments are received by the requestor. The extra complexity and the perceived lack of substantial performance improvement led the Origin designers to persist with a sequential consistency model, and reply to the processor only when all acknowledgments are received. On the compiler side, the picture for memory consistency models is not so well defined, complicating matters for programmers. It does not do a programmer much good for the processor to support sequential consistency or processor consistency if the compiler reorders accesses as it pleases before they even get to the processor (as uniprocessor compilers do). Microprocessor memory models are defined at the hardware interface; they tend to be concerned with program order as presented to the processor, and assume that a separate arrangement will be made with the compiler. As we have discussed, exporting intermediate models such as TSO, processor consistency and PSO up to the programmer’s interface does not allow the compiler enough flexibility for reordering. We might assume that the compiler most often will not reorder the operations that would violate the consistency model, e.g. since most compiler reorderings of memory operations tend to focus on loops, but this is a very dangerous assumption, and sometimes the orders we rely upon indeed occur in loops. To really use these models at the programmer’s interface, uniprocessor compilers would have to be modified to follow these reordering specifications, compromising performance significantly. An alternative approach, mentioned earlier, is to use compiler analysis to determine when operations can be reordered without compromising the consistency model exposed to the user [ShS88, KrY94]. However, the compiler analysis must be able to detect conflicting operations from different processes, which is expensive and currently rather conservative. More relaxed models like Alpha, RMO, PowerPC, WO, RC and synchronized programs can be used at the programmer’s interface because they allow the compiler the flexibility it needs. In these cases, the mechanisms used to communicate ordering constraints must be heeded not only by the processor but also by the compiler. Compilers for multiprocessors are beginning to do so (see also Section 9.6). Underneath a relaxed model at the programmer’s interface, the processor interface can use the same or a stronger ordering model, as we saw in the context of synchronized programs. However, there is now significant motivation to use relaxed models even at the processor interface, to realize the performance potential. As we move now to discussing alternative approaches to supporting a shared address space with coherent replication of data, we shall see that relaxed consistency models can be critical to performance when we want to support coherence at larger granularities than cache blocks. We will also see that the consistency model can be relaxed even beyond release consistency (can you think how?).
9.3 Overcoming Capacity Limitations The systems that we have discussed so far provide automatic replication and coherence in hardware only in the processor caches. A processor cache replicates remotely allocated data directly
630
DRAFT: Parallel Computer Architecture
9/4/97
Overcoming Capacity Limitations
upon reference, without it being replicated in the local main memory first. Access to main memory has nonuniform cost depending on whether the memory is local or remote, so these systems are traditionally called cache coherent non-uniform memory access (CC-NUMA) architectures. On a cache miss, the assist determines from the physical address whether to look up local memory and directory state or to send the request directly to a remote home node. The granularity of communication, of coherence, and of allocation in the replication store (cache) is a cache block. As discussed earlier, a problem with these systems is that the capacity for local replication is limited to the hardware cache. If a block is replaced from the cache, it must be fetched from remote memory if it is needed again, incurring artifactual communication. The goal of the systems discussed in this section is to overcome the replication capacity problem while still providing coherence in hardware and at fine granularity of cache blocks for efficiency. 9.3.1 Tertiary Caches There are several ways to achieve this goal. One is to use a large but slower remote access cache, as in the Sequent NUMA-Q and Convex Exemplar [Con93,TSS+96]. This may be needed for functionality anyway if the nodes of the machine are themselves small-scale multiprocessors—to keep track of remotely allocated blocks that are currently in the local processor caches—and can simply be made larger for performance. It will then also hold replicated remote blocks that have been replaced from local processor caches. In NUMA-Q, this DRAM remote cache is at least 32MB large while the sum of the four lowest-level processor caches in a node is only 2 MB. A similar method, which is sometimes called the tertiary cache approach, is to take a fixed portion of the local main memory and manage it like a remote cache. These approaches replicate data in main memory at fine grain, but they do not automatically migrate or change the home of a block to the node that incurs cache misses most often on that block. Space is always allocated for the block in the regular part of main memory at the original home. Thus, if data were not distributed appropriately in main memory by the application, then in the tertiary cache approach even if only one processor ever accesses a given memory block (no need for multiple copies or replication) the system may end up wasting half the available main memory in the system: There are two copies of each block, but only one is ever used. Also, a statically established tertiary cache is wasteful if its replication capacity is not is not needed for performance. The other approach to increasing replication capacity, which is called cache-only memory architecture or COMA, is to treat all of local memory as a hardware-controlled cache. This approach, which achieves both replication and migration, is discussed in more depth next. In all these cases, replication and coherence are managed at a fine granularity, though this does not necessarily have to be the same as the block size in the processor caches. 9.3.2 Cache-only Memory Architectures (COMA) In COMA machines, every fine-grained memory block in the entire main memory has a hardware tag associated with it. There is no fixed node where there is always guaranteed to be space allocated for a memory block. Rather, data dynamically migrate to or are replicated at the main memories of the nodes that access or “attract” them; these main memories organized as caches are therefore called attraction memories. When a remote block is accessed, it is replicated in attraction memory as well as brought into the cache, and is kept coherent in both places by hardware. Since a data block may reside in any attraction memory and may move transparently from one to the other, the location of data is decoupled from its physical address and another addressing mechanism must be used. Migration of a block is obtained through replacement in the attrac9/4/97
DRAFT: Parallel Computer Architecture
631
Hardware-Software Tradeoffs
tion memory: If block x originally resides in node A’s main (attraction) memory, then when node B reads it B will obtain a copy (replication); if the copy in A’s memory is later replaced by another block that A references, then the only copy of that block left is now in B’s attraction memory. Thus, we do not have the problem of wasted original copies that we potentially had with the tertiary cache approach. Automatic data migration also has substantial advantages for multiprogrammed workloads in which the operating system may decide to migrate processes among nodes at any time. Hardware-Software Tradeoffs The COMA approach introduces clear hardware-software tradeoffs (as do the other approaches). By overcoming the cache capacity limitations of the pure CC-NUMA approach, the goal is to free parallel software from worrying about data distribution in main memory. The programmer can view the machine as if it had a centralized main memory, and worry only about inherent communication and false sharing (of course, cold misses may still be satisfied remotely). While this makes the task of software writers much easier, COMA machines require a lot more hardware support than pure CC-NUMA machines to implement main memory as a hardware cache. This includes per-block tags and state in main memory, as well as the necessary comparators. There is also the extra memory overhead needed for replication in the attraction memories, which we shall discuss shortly. Finally, the coherence protocol for attraction memories is more complicated than that we saw for processor caches. There are two reasons for this, both having to do with the fact that data move dynamically to where they are referenced and do not have a fixed “home” to back them up. First, the location of the data must be determined upon an attraction memory miss, since it is no longer bound to the physical address. Second, with no space necessarily reserved for the block at the home, it is important to ensure that the last or only copy of a block is not lost from the system by being replaced from its attraction memory. This extra complexity is not a problem in the tertiary cache approach. Performance Tradeoffs Performance has its own interesting set of tradeoffs. While the number of remote accesses due to artifactual communication is reduced, COMA machines tend to increase the latency of accesses that do need to be satisfied remotely, including cold, true sharing and false sharing misses. The reason is that even a cache miss to nonlocal memory needs to first look up the local attraction memory, to see if it has a local copy of the block. The attraction memory access is also a little more expensive than a standard DRAM access because the attraction memory is usually implemented to be set-associative, so there may be a tag selection in the critical path. In terms of performance, then, COMA is most likely to be beneficial for applications that have high capacity miss rates in the processor cache (large working sets) to data that are not allocated locally to begin with, and most harmful to applications where performance is dominated by communication misses. The advantages are also greatest when access patterns are unpredictable or spatially interleaved from different processes at fine grain, so data placement or replication/ migration at page granularity in software would be difficult. For example, COMA machines are likely to be more advantageous when a two-dimensional array representation is used for a nearneighbor grid computation than when a four-dimensional array representation is used, because in the latter case appropriate data placement at page granularity is not difficult in software. The higher cost of communication may in fact make them perform worse than pure CC-NUMA when four-dimensional arrays are used with data placement. Figure 9-8 summarizes the tradeoffs in
632
DRAFT: Parallel Computer Architecture
9/4/97
Overcoming Capacity Limitations
terms of application characteristics. Let us briefly look at some design options for COMA protoApplication Characteristics
Very low miss rate Should perform similarly • Dense Linear Algebra • N-body
Mostly capacity misses COMA has potential advantage due to automatic replication/migration
Mostly coherence misses COMA may be worse due to added latency • small problem sizes for many applications (e.g. FFT, Ocean) • applications with a lot of false sharing
Coarse-grained data access Replication and particularly migration/initial-placement may be easy at page granularity even without COMA support • many regular scientific applications (Ocean, FFT)
Fine-grained data access COMA advantageous • applications with large working sets but irregular data structures and access patterns (Raytrace with large data sets)
Figure 9-8 Performance tradeoffs between COMA and non-COMA architectures.
cols, and how they might solve the architectural problems of finding the data on a miss and not losing the last copy of a block. Design Options: Flat versus Hierarchical Approaches COMA machines can be built with hierarchical or with flat directory schemes, or even with hierarchical snooping (see Section 8.11.2). Hierarchical directory-based COMA was used in the Data Diffusion Machine prototype [HLH92,Hag92], and hierarchical snooping COMA was used in commercial systems from Kendall Square Research [FBR93]. In these hierarchical COMA schemes, data are found on a miss by traversing the hierarchy, just as in regular hierarchical directories in non-COMA machines. The difference is that while in non-COMA machines there is a fixed home node for a memory block in a processing node, here there is not. When a reference misses in the local attraction memory, it proceeds up the hierarchy until a node is found that indicates the presence of the block in its subtree in the appropriate state. The request then proceeds down the hierarchy to the appropriate processing node (which is at a leaf), guided by directory lookups or snooping at each node along the way. In flat COMA schemes, there is still no home for a memory block in the sense of a reserved location for the data, but there is a fixed home where just the directory information can be found
9/4/97
DRAFT: Parallel Computer Architecture
633
Hardware-Software Tradeoffs
[SJG92,Joe95]. This fixed home is determined from either the physical address or a global identifier obtained from the physical address, so the directory information is also decoupled from the actual data. A miss in the local attraction memory goes to the directory to look up the state and copies, and the directory may be maintained in either a memory-based or cache-based way.The tradeoffs for hierarchical directories versus flat directories are very similar to those without COMA (see Section 8.11.2). Let us see how hierarchical and flat schemes can solve the last copy replacement problem. If the block being replaced is in shared state, then if we are certain that there is another copy of the block in the system we can safely discard the replaced block. But for a block that is in exclusive state or is the last copy in the system, we must ensure that it finds a place in some other attraction memory and is not thrown away. In the hierarchical case, for a block in shared state we simply have to go up the hierarchy until we find a node that indicates that a copy of the block exists somewhere in its subtree. Then, we can discard the replaced block as long as we have updated the state information on the path along the way. For a block in exclusive state, we go up the hierarchy until we find a node that has a block in invalid or shared state somewhere in its subtree, which this block can replace. If that replaceable block found is in shared rather than invalid state, then its replacement will require the previous procedure. In a flat COMA, more machinery is required for the last copy problem. One mechanism is to label one copy of a memory block as the master copy, and to ensure that the master copy is not dropped upon replacement. A new cache state called master is added in the attraction memory. When data are initially allocated, every block is a master copy. Later, a master copy is either an exclusive copy or one of the shared copies. When a shared copy of a block that is not the master copy is replaced, it can be safely dropped (we may if we like send a replacement hint to the directory entry at the home). If a master copy is replaced, a replacement message is sent to the home. The home then chooses another node to send this master to in the hope of finding room, and sets that node to be the master. If all available blocks in that set of the attraction memory cache at this destination are masters, then the request is sent back to the home which tries another node, and so on. Otherwise, one of the replaceable blocks in the set is replaced and discarded (some optimizations are discussed in [JoH94]). Regardless of whether a hierarchical or flat COMA organization is used, the initial data set of the application should not fill the entire main or attraction memory, to ensure that there is space in the system for a last copy to find a new residence. To find replaceable blocks, the attraction memories should be quite highly associative as well. Not having enough extra (initially unallocated) memory can cause performance problems for several reasons. First, the less the room for replication in the attraction memories, the less effective the COMA nature of the machine in satisfying cache capacity misses locally. Second, little room for replication implies that the replicated blocks are more likely to be replaced to make room for replaced last copies. And third, the traffic generated by replaced last copies can become substantial when the amount of extra memory is small, which can cause a lot of contention in the system. How much memory should be set aside for replication and how much associativity is needed should be determined empirically [JoH94]. Summary: Path of a Read Operation Consider a flat COMA scheme. A virtual address is first translated to a physical address by the memory management unit. This may cause a page fault and a new mapping to be established, as in a uniprocessor, though the actual data for the page is not loaded into memory. The physical
634
DRAFT: Parallel Computer Architecture
9/4/97
Reducing Hardware Cost
address is used to look up the cache hierarchy. If it hits, the reference is satisfied. If not, then it must look up the local attraction memory. Some bits from the physical address are used to find the relevant set in the attraction memory, and the tag store maintained by the hardware is used to check for a tag match. If the block is found in an appropriate state, the reference is satisfied. If not, then a remote request must be generated, and the request is sent to the home determined from the physical address. The directory at the home determines where to forward the request, and the owner node uses the address as an index into its own attraction memory to find and return the data. The directory protocol ensures that states are maintained correctly as usual.
9.4 Reducing Hardware Cost The last of the major issues discussed in this chapter is hardware cost. Reducing cost often implies moving some functionality from specialized hardware to software running on existing or commodity hardware. In this case, the functionality in question is managing for replication and coherence. Since it is much easier for software to control replication and coherence in main memory rather than in the hardware cache, the low-cost approaches tend to provide replication and coherence in main memory as well. The differences from COMA, then, are the higher overhead of communication, and potentially the granularity at which replication and coherence are managed. Consider the hardware cost of a pure CC-NUMA approach. The portion of the communication architecture that is on a node can be divided into four parts: the part of the assist that checks for access control violations, the per-block tags that it uses for this purpose, the part that does the actual protocol processing (including intervening in the processor cache), and the network interface itself. To keep data coherent at cache block granularity, the access control part needs to see every load or store to shared data that misses in the cache so it can take the necessary protocol action. Thus, for fine-grained access control to be done in hardware, the assist must at least be able to snoop on the local memory system (as well as issue requests to it in response to incoming requests from the network). For coherence to be managed efficiently, each of the other functional components of the assist can benefit greatly from hardware specialization and integration. Determining access faults (i.e. whether or not to invoke the protocol) quickly requires that the tags per block of main memory be located close to this part of the assist. The speed with which protocol actions can be invoked and the assist can intervene in the processor cache increases as the assist is integrated closer to the cache. Performing protocol operations quickly demands that the assist be either hard-wired, or if programmable (as in then Sequent NUMA-Q) then specialized for the types of operations that protocols perform most often (for example bit-field extractions and manipulations). Finally, moving small pieces of data quickly between the assist and network interface asks that the network interface be tightly integrated with the assist. Thus, for highest performance we would like that these four parts of the communication architecture be tightly integrated together, with as few bus crossings as possible needed to communicate among them, and that this whole communication architecture be specialized and tightly integrated into the memory system. Early cache-coherent machines such as the KSR-1 [FBR93] accomplished this by integrating a hard-wired assist into the cache controller and integrating the network interface tightly into the assist. However, modern processors tend to have even their second-level cache controllers on the processor chip, so it is increasingly difficult to integrate the assist into this controller once the
9/4/97
DRAFT: Parallel Computer Architecture
635
Hardware-Software Tradeoffs
processor is built. The SGI Origin therefore integrates its hard-wired Hub into the memory controller (and the network interface tightly with the Hub), and the Stanford Flash integrates its specialized programmable protocol engine into the memory controller. These approaches use specialized, tightly integrated hardware support for cache coherence, and do not leverage inexpensive commodity parts for the communication architecture. They are therefore expensive, more so in design and implementation time than in the amount of actual hardware needed. Research efforts are attempting to lower this cost, with several different approaches. One approach is to perform access control in specialized hardware, but delegate much of the other activity to software and commodity hardware. Other approaches perform access control in software as well, and are designed to provide a coherent shared address space abstraction on commodity nodes and networks with no specialized hardware support. The software approaches provide access control either by instrumenting the program code, or by leveraging the existing virtual memory support to perform access control at page granularity, or by exporting an objectbased programming interface and providing access control in a runtime layer at the granularity of user-defined objects. Let us discuss each of these approaches, which are currently all at the research stage. More time will be spent on the so-called page-based shared virtual memory approach, because it changes the granularity at which data are allocated, communicated and kept coherent while still preserving the same programming interface as hardware-coherent systems, and because it relies on relaxed memory consistency for performance and requires substantially different protocols. 9.4.1 Hardware Access Control with a Decoupled Assist While specialized hardware support is used for fine-grained access control in this approach, some or all of the other aspects (protocol processing, tags, and network interface) are decoupled from one another. They can then either use commodity parts attached to less intrusive parts of the node like the I/O bus, or use no extra hardware beyond that on the uniprocessor system. For example, the per-block tags and state can be kept in specialized fast memory or in regular DRAM, and protocol processing can be done in software either on a separate, inexpensive general-purpose processor or even on the main processor itself. The network interface usually has some specialized support for fine-grained communication in all these cases, since otherwise the end-point overheads would dominate. Some possible combinations for how the various functions might be integrated are shown in Figure 9-9. The problem with the decoupled hardware approach, of course, is that it increases the latency of communication, since the interaction of the different components with each other and with the node is slower (e.g. it may involve several bus crossings). More critically, the effective occupancy of the decoupled communication assist is much larger than that of a specialized, integrated assist, which can hurt performance substantially for many applications as we saw in the previous chapter (Section 8.8). 9.4.2 Access Control through Code Instrumentation It is also possible to use no additional hardware support at all, but perform all the functions needed for fine-grained replication and coherence in main memory in software. The trickiest part of this is fine-grained access control in main memory, for which a standard uniprocessor does not provide support. To accomplish this in software, individual read and write operations can be 636
DRAFT: Parallel Computer Architecture
9/4/97
Reducing Hardware Cost
CPU CPU Cache Cache
AC + NI + PP Mem
AC + NI + PP
Mem (a) AC, NI, and PP integrated together and into memory controller (SGI Origin)
CPU Cache
Mem
(b) AC, NI, and PP integrated together on memory bus (Wisc. Typhoon design)
CPU PP
AC + NI
(c) AC and NI integrated together and on memory bus; separate commodity PP also on bus (Wisc. Typhoon-1 design)
PP
Cache
Mem
AC
NI
(d) Separate AC, commodity NI and commodity PP on memory bus (Wisc. Typhoon-0)
Figure 9-9 Some alternatives for reducing cost over the highly integrated solution. AC is the access control facility, NI is the network interface, and PP is the protocol processing facility (whether hardwired finite state machine or programmable). The highly integrated solution is shown in (a). The more the parts are separated out, the more the expensive bus crossings needed for them to communicate with one another to process a transaction. The alternatives we have chosen are the Typhoon-0 research prototype from the University of Wisconsin and two other proposals. In these designs, the commodity PP in (b) and (c) is a complete processor like the main CPU, with its own cache system.
instrumented in software to look up per-block tag and state data structures maintained in main memory [SFL+94,SGT96]. To the extent that cache misses can be predicted, only the reads and writes that miss in the processor cache hierarchy need to be thus instrumented. The necessary protocol processing can be performed on the main processor or on whatever form of communication assist is provided. In fact, such software instrumentation allows us to provide access control and coherence at any granularity chosen by the programmer and system, even different granularities for different data structures. Software instrumentation incurs a run-time cost, since it inserts extra instructions into the code to perform the necessary checks. The approaches that have been developed use several tricks to reduce the number of checks and lookups needed [SGT96], so the cost of access control in this approach may well be competitive with that in the decoupled hardware approach. Protocol pro-
9/4/97
DRAFT: Parallel Computer Architecture
637
Hardware-Software Tradeoffs
cessing in software on the main processor also has a significant cost, and the network interface and interconnect in these systems are usually commodity-based and hence less efficient than in tightly-coupled multiprocessors. 9.4.3 Page-based Access Control: Shared Virtual Memory Another approach to providing access control and coherence with no additional hardware support is to leverage the virtual memory support provided by the memory management units of microprocessors and the operating system. Memory management units already perform access control in main memory at the granularity of pages, for example to detect page faults, and manage main memory as a fully associative cache on the virtual address space. By embedding a coherence protocol in the page fault handlers, we can provide replication and coherence at page granularity [LiH89], and manage the main memories of the nodes as coherent, fully associative caches on a shared virtual address space. Access control now requires no special tags, and the assist needn’t even see every cache miss. Data enter the local cache only when the corresponding page is already present in local memory. Like in the previous two approaches the processor caches do not have to be kept coherent by hardware, since when a page is invalidated the TLB will not let the processor access its blocks in the cache. This approach is called page-based shared virtual memory or SVM for short. Since the cost can be amortized over a whole page of data, protocol processing is often done on the main processor itself, and we can more easily do without special hardware support for communication in the network interface. Thus, there is less need for hardware assistance beyond that available on a standard uniprocessor system. A very simple form of shared virtual memory coherence is illustrated in Figure 9-10, following an invalidation protocol very similar to those in pure CC-NUMA. A few aspects are worthy of note. First, since the memory management units of different processors manage their main memories independently, the physical address of the page in P1’s local memory may be completely different from that of the copy in P0’s local memory, even though the pages have the same (shared) virtual address. Second, a page fault handler that implements the protocol must know or determine from where to obtain the up-to-date copy of a page that it has missed on, or what pages to invalidate, before it can set the page’s access rights as appropriate and return control to the application process. A directory mechanism can be used for this—every page may have a home, determined by its virtual address, and a directory entry maintained at the home—though highperformance SVM protocols tend to be more complex as we shall see. The problems with page-based shared virtual memory are (i) the high overheads of protocol invocation and processing and (ii) the large granularity of coherence and communication. The former is expensive because most of the work is done in software on a general-purpose uniprocessor. Page faults take time to cause an interrupt and to switch into the operating system and invoke a handler; the protocol processing itself is done in software, and the messages sent to other processors use the underlying message passing mechanisms that are expensive. On a representative SVM system in 1997, the round-trip cost of satisfying a remote page fault ranges from a few hundred microseconds with aggressive system software support to over a millisecond. This should be compared with less than a microsecond needed for a read miss on aggressive hardware-coherent systems. In addition, since protocol processing is typically done on the main processor (due to the goal of using commodity uniprocessor nodes with no additional hardware support), even
638
DRAFT: Parallel Computer Architecture
9/4/97
Reducing Hardware Cost
page
Virtual address space (pages)
SVM Library r
r 2
1
w 3
invalidation.
req.
r 4
req.
Local Memories of Nodes
page
P0
P1
P0
(a)
P1
P0
P1 (b)
Figure 9-10 Illustration of Simple Shared Virtual Memory. At the beginning, no node has a copy of the stippled shared virtual page with which we are concerned. Events occur in the order 1, 2, 3, 4. Read 1 incurs a page fault during address translation and fetches a copy of the page to P0 (presumably from disk). Read 2, shown in the same frame, incurs a page fault and fetches a read-only copy to P1 (from P0). This is the same virtual page, but is at two different physical addresses in the two memories. Write 3 incurs a page fault (write to a read-only page), and the SVM library in the page fault handlers determines that P1 has a copy and causes it to be invalidated. P0 now obtains read-write access to the page, which is like the modified or dirty state. When Read 4 by P1 tries to read a location on the invalid page, it incurs a page fault and fetches a new copy from P0 through the SVM library.
incoming requests interrupt the processor, pollute the cache and slow down the currently running application thread that may have nothing to do with that request. The large granularity of communication and coherence is problematic for two reasons. First, if spatial locality is not very good it causes a lot of fragmentation in communication (only a word is needed but a whole page is fetched). Second, it can easily lead to false sharing, which causes these expensive protocol operations to be invoked frequently. Under a sequential consistency model, invalidations are propagated and performed as soon as the corresponding write is detected, so pages may be frequently ping-ponged back and forth among processors due to either true or false sharing (Figure 9-11 shows an example). This is extremely undesirable in SVM systems due to the high cost of the operations, and it is very important that the effects of false sharing be alleviated. Using Relaxed Memory Consistency The solution is to exploit a relaxed memory consistency model such as release consistency, since this allows coherence actions such as invalidations or updates, collectively called write notices, to be postponed until the next synchronization point (writes do not have to become visible until then). Let us continue to assume an invalidation-based protocol. Figure 9-12 shows the same example as Figure 9-11 above: The writes to x by processor P1 will not generate invalidations to the copy of the page at P0 until the barrier is reached, so the effects of false sharing will be greatly mitigated and none of the reads of y by P0 before the barrier will incur page faults. Of
9/4/97
DRAFT: Parallel Computer Architecture
639
Hardware-Software Tradeoffs
r(y)
P0
r(y)
r(y)
BAR
r(y)
req. inv.
P1
ack.
reply
w(x)
w(x)
w(x)
BAR w(x)
Figure 9-11 Problem with sequential consistency for SVM. Process P0 repeatedly reads variable y, while process P1 repeatedly writes variable x which happens to fall on the same page as y. Since P1 cannot proceed until invalidations are propagated and acknowledged under SC, the invalidations are propagated immediately and substantial (and very expensive) communication ensues due to this false sharing.
P0
r(y) r(y) r(y)
r(y)
BAR
r(y) r(y)
inv. req.
reply
ack.
P1
w(x) w(x) w(x)
BAR
w(x) w(x) w(x)
Figure 9-12 Reducing SVM communication due to false sharing with a relaxed consistency model. There is not communication at reads and writes until a synchronization event, at which point invalidations are propagated to make pages coherent. Only the first access to an invalidated page after a synchronization point generates a page fault and hence request.
course, if P0 accesses y after the barrier it will incur a page fault due to false sharing since the page has now been invalidated. There is a significant difference from how relaxed consistency is typically used in hardwarecoherent machines. There, it is used to avoid stalling the processor to wait for acknowledgments (completion), but the invalidations are usually propagated and applied as soon possible, since this is the natural thing for hardware to do. Although release consistency does not guarantee that the effects of the writes will be seen until the synchronization, in fact they usually will be. The effects of false sharing cache blocks are therefore not reduced much even within a period with no synchronization, and nor is the number of network transactions or messages transferred; the goal is mostly to hide latency from the processor. In the SVM case, the system takes the contract literally: Invalidations are actually not propagated until the synchronization points. Of course, this makes it critical for correctness that all synchronization points be clearly labeled and communicated to the system. 1 When should invalidations (or write notices) be propagated from the writer to other copies, and when should they be applied. One possibility is to propagate them when the writer issues a release operation. At a release, invalidations for each page that the processor wrote since its pre-
640
DRAFT: Parallel Computer Architecture
9/4/97
Reducing Hardware Cost
P0
acq
w(x)
rel lock grant
inv.
r(y)
acq
inv.
w(x)
rel
P1 lock grant
inv.
r(y)
acq
r(x)
rel
P2 (a) Eager Release Consistency
P0
acq
w(x)
rel lock grant + inv.
r(y)
acq
w(x)
rel
P1
lock grant + inv.
r(y)
acq
r(x)
rel
P2 (b) Lazy Release Consistency Figure 9-13 Eager versus lazy implementations of release consistency. The former performs consistency actions (invalidation) at a release point, while the latter performs them at an acquire. Variables x and y are on the same page. The reduction in communication can be substantial, particular for SVM.
vious release are propagated to all processors that have copies of the page. If we wait for the invalidations to complete, this satisfies the sufficient conditions for RC that were presented earlier. However, even this is earlier than necessary: Under release consistency, a given process does not really need to see the write notice until it itself does an acquire. Propagating and applying write notices to all copies at release points, called eager release consistency (ERC) [CBZ91,CBZ95], is conservative because it does not know when the next acquire by another processor will occur. As shown in Figure 9-13(a), it can send expensive invalidation messages to more processors than necessary (P2 need not have been invalidated by P0), it requires separate messages for invalidations and lock acquisitions, and it may invalidate processors earlier than necessary thus causing false sharing (see the false sharing between variables x and y, as a result
1. In fact, a similar approach can be used to reduce the effects of false sharing in hardware as well. Invalidations may be buffered in hardware at the requestor, and sent out only at a release or a synchronization (depending on whether release consistency or weak ordering is being followed) or when the buffer becomes full. Or they may be buffered at the destination, and only applied at the next acquire point by that destination. This approach has been called delayed consistency [DWB+91] since it delays the propagation of invalidations. As long as processors do not see the invalidations, they continue to use their copies without any coherence actions, alleviating false sharing effects.
9/4/97
DRAFT: Parallel Computer Architecture
641
Hardware-Software Tradeoffs
of which the page is fetched twice by P1 and P2, once at the read of y and once at the write of x). The extra messages problem is even more significant when an update-based protocol is used. These issues and alternatives are discussed further in Exercise 9.6. The best known SVM systems tend to use a lazier form of release consistency, called lazy release consistency or LRC, which propagates and applies invalidations to a given processor not at the release that follows those writes but only at the next acquire by that processor [KCZ92]. On an acquire, the processor obtains the write notices corresponding to all previous release operations that occurred between its previous acquire operation and its current acquire operation. Identifying which are the release operations that occurred before the current acquire is an interesting question. One way to view this is to assume that this means all releases that would have to appear before this acquire in any sequentially consistent1 ordering of the synchronization operations that preserves dependences. Another way of putting this is that there are two types of partial orders imposed on synchronization references: program order within each process, and dependence order among acquires and releases to the same synchronization variable. Accesses to a synchronization variable form a chain of successful acquires and releases. When an acquire comes to a releasing process P, the synchronization operations that occurred before that release are those that occur before it in the intersection of these program orders and dependence orders. These synchronization operations are said to have occurred before it in a causal sense. Figure 9-14 shows an example. By further postponing coherence actions, LRC alleviates the three problems associated with ERC; for example, if false sharing on a page occurs before the acquire but not after it, its ill effects will not be seen (there is no page fault on the read of y in Figure 9-13(b)). On the other hand, some of the work to be done is shifted from release point to acquire point, and LRC is significantly more complex to implement than ERC as we shall see. However, LRC has been found to usually perform better, and is currently the method of choice. The relationship between these software protocols and consistency models developed for hardware-coherent systems is interesting. First, the software protocols do not satisfy the requirements of coherence that were discussed in Chapter 5, since writes are not automatically guaranteed to be propagated unless the appropriate synchronization is present. Making writes visible only through synchronization operations also makes write serialization more difficult to guarantee— since different processes may see the same writes through different synchronization chains and hence in different orders—and most software systems do not provide this guarantee. The difference between hardware implementations of release consistency and ERC is in when writes are propagated; however, by propagating write notices only at acquires LRC may differ from release consistency even in whether writes are propagated, and hence may allow results that are not permitted under release conssitency. LRC is therefore a different consistency model than release consistency, while ERC is simply a different implementation of RC, and it requires greater programming care. However, if a program is properly labeled then it is guaranteed to run “correctly” under both RC and LRC.
1. Or even processor consistent, since RC allows acquires (reads) to bypass previous releases (writes).
642
DRAFT: Parallel Computer Architecture
9/4/97
Reducing Hardware Cost
P1
P3
P2
P4
A1 A4 R1 R4 A1 A2 A3 R1 R2
R3 A3 A4 R3 R4 A1 R1 A1
Figure 9-14 The causal order among synchronization events (and hence data accesses). The figure shows what synchronization operations are before the acquire A1 (by process P3) or the release R1 by process P2 that enables it. The dotted horizontal lines are time-steps, increasing downward, indicating a possible interleaving in time. The bold arrows show dependence orders along an acquire-release chain, while the dotted arrows show the program orders that are part of the causal order. The A2, R2 and A4, R4 pairs that are untouched by arrows do not “happen before” the acquire of interest in causal order, so the accesses after the acquire are not guaranteed to see the accesses between them.
Example 9-2
Design an example in which LRC produces a different result than release consistency? How would you avoid the problem?
Answer
Consider the code fragment below, assuming the pointer ptr is initialized to NULL. P1
lock L1; ptr = non_null_ptr_val; unlock L1;
P2 while (ptr == null) {}; lock L1; a = ptr; unlock L1;
Under RC and ERC, the new non-null pointer value is guaranteed to propagate to P2 before the unlock (release) by P1 is complete, so P2 will see the new value and jump out of the loop as expected. Under LRC, P2 will not see the write by P1 until it performs its lock (acquire operation); it will therefore enter the while loop, never
9/4/97
DRAFT: Parallel Computer Architecture
643
Hardware-Software Tradeoffs
see the write by P1, and hence never exit the while loop. The solution is to put the appropriate acquire synchronization before the reads in the while loop, or to label the reads of ptr appropriately as acquire synchronizations. The fact that coherence information is propagated only at synchronization operations that are recognized by the software SVM layer has an interesting, related implication. It may be difficult to run existing application binaries on relaxed SVM systems, even if those binaries were compiled for systems that support a very relaxed consistency model and they are properly labeled. The reason is that the labels are compiled down to the specific fence instructions used by the commercial microprocessor, and those fence instructions may not be visible to the software SVM layer unless the binaries are instrumented to make them so. Of course, if the source code with the labels is available, then the labels can be translated to primitives recognized by the SVM layer. Multiple Writer Protocols Relaxing the consistency model works very well in mitigating the effects of false sharing when only one of the sharers writes the page in the interval between two synchronization points, as in our previous examples. However, it does not in itself solve the multiple-writer problem. Consider the revised example in Figure 9-15. Now P0 and P1 both modify the same page between the same
P0
w(y) w(y)
w(y)
BAR
r(x)
? P1
w(x) w(x)
w(x)
BAR
r(y)
Figure 9-15 The multiple-writer problem. At the barrier, two different processors have written the same page independently, and their modifications need to be merged.
two barriers. If we follow a single-writer protocol, as in hardware cache coherence schemes, then each of the writers must obtain ownership of the page before writing it, leading to ping-ponging communication even between the synchronization points and compromising the goal of using relaxed consistency in SVM. In addition to relaxed consistency, we need a “multiple-writer” protocol. This is a protocol that allows each processor writing a page between synchronization points to modify its own copy locally, allowing the copies to become inconsistent, and makes the copies consistent only at the next synchronization point, as needed by the consistency model and implementation (lazy or eager). Let us look briefly at some multiple writer mechanisms that can be used with either eager or lazy release consistency. The first method is used in the TreadMarks SVM system from Rice University [KCD+94]. The idea is quite simple. To capture the modifications to a shared page, it is initially write-protected. At the first write after a synchronization point, a protection violation occurs. At this point, the system makes a copy of the page (called a twin) in software, and then unprotects the actual page so further writes can happen without protection violations. Later, at the next release or incoming
644
DRAFT: Parallel Computer Architecture
9/4/97
Reducing Hardware Cost
acquire (for ERC or LRC), the twin and the current copy are compared to create a “diff”, which is simply a compact encoded representation of the differences between the two. The diff therefore captures the modifications that processor has made to the page in that synchronization interval. When a processor incurs a page fault, it must obtain the diffs for that page from other processors that have created them, and merge them into its copy of the page. As with write notices, there are several alternatives for when we might compute the diffs and when we might propagate them to other processors with copies of the page (see Exercise 9.6). If diffs are propagated eagerly at a release, they and the corresponding write notices can be freed immediately and the storage reused. In a lazy implementation, diffs and write notices may be kept at the creator until they are requested. In that case, they must be retained until it is clear that no other processor needs them. Since the amount of storage needed by these diffs and write notices can become very large, garbage collection becomes necessary. This garbage collection algorithm is quite complex and expensive, since when it is invoked each page may have uncollected diffs distributed among many nodes [KCD+94]. An alternative multiple writer method gets around the garbage collection problem for diffs while still implementing LRC, and makes a different set of performance tradeoffs [ISL96b,ZIL96]. The idea here is to not maintain the diffs at the writer until they are requested nor to propagate them to all the copies at a release. Instead, every page has a home memory, just like in flat hardware cache coherence schemes, and the diffs are propagated to the home at a release. The releasing processor can then free the storage for the diffs as soon as it has sent them to the home. The arriving diffs are merged into the home copy of the page (and the diff storage freed there too), which is therefore kept up to date. A processor performing an acquire obtains write notices for pages just as before. However, when it has a subsequent page fault on one of those pages, it does not obtain diffs from all previous writers but rather fetches the whole page from the home. Other than storage overhead and scalability, this scheme has the advantage that on a page fault only one round-trip message is required to fetch the data, whereas in the previous scheme diffs had to be obtained from all the previous (“multiple”) writers. Also, a processor never incurs a page fault for a page for which it is the home. The disadvantages are that whole pages are fetched rather than diffs (though this can be traded off with storage and protocol processing overhead by storing the diffs at the home and not applying them there, and then fetching diffs from the home), and that the distribution of pages among homes becomes important for performance despite replication in main memory. Which scheme performs better will depend on how the application sharing patterns manifest themselves at page granularity, and on the performance characteristics of the communication architecture. Alternatives Methods for Propagating Writes Diff processing—twin creation, diff computation, and diff application—incurs significant overhead, requires substantial additional storage, and can also pollute the first-level processor cache, replacing useful application data. Some recent systems provide hardware support for fine-grained communication in the network interface—particularly fine-grained propagation of writes to remote memories—which that can be used to accelerate these home-based, multiple-writer SVM protocols and avoid diffs altogether. As discussed in Chapter 7, the idea of hardware propagation of writes originated in the PRAM [LS88] and PLUS [BR90] systems, and modern examples include the network interface of the SHRIMP multicomputer prototype at Princeton University [BLA+94] and the Memory Channel from Digital Equipment Corporation [GCP96].
9/4/97
DRAFT: Parallel Computer Architecture
645
Hardware-Software Tradeoffs
For example, the SHRIMP network interface allows unidirectional mappings to be established between a pair of pages on different nodes, so that the writes performed to the source page are automatically snooped and propagated in hardware to the destination page on the other node. To detect these writes, a portion of the “assist” snoops the memory bus. However, it does not have to arbitrate for the memory bus, and the rest of it sits on the I/O bus, so the hardware is not very intrusive in the node. Of course, this “automatic update” mechanism requires that the cache be write-through, or that writes to remotely mapped pages somehow do appear on the memory bus at or before a release. This can be used to construct a home-based multiple writer scheme without diffs as follows [IDF+96,ISL96a]. Whenever a processor asks for a copy of the page, the system establishes a mapping from that copy to the home version of the page (see Figure 9-16). Then, P2
automatic updates
page fetch
P2 page fetch
P0
P1
P3
(a) The automatic update mechanism
(b) Supporting multiple writers using automatic update
Figure 9-16 Using an automatic update mechanism to solve the multiple writer problem. The variables x and y fall on the same page, which has node P2 as its home. If P0 (or P1) were the home, it would not need to propagate automatic updates and would not incur page faults on that page (only the other node would).
writes issued by the non-home processor to the page are automatically propagated by hardware to the home copy, which is thus always kept up to date. The outgoing link that carries modifications to their homes is flushed at a release point to ensure that all updates will reach (point to point network order is assumed). If a processor receives write notices for a page on an acquire (coherence actions are managed at synchronization points exactly as before), then when it faults on that page it will fetch the entire page from the home. The DEC Memory Channel interface provides a similar mechanism, but it does not snoop the memory bus. Instead, special I/O writes have to be performed to mapped pages and these writes are automatically propagated to the mapped counterparts in main memory at the other end. Compared to the all-software home-based scheme, these hardware “automatic update” schemes have the advantage of not requiring the creation, processing and application of diffs. On the other hand, they increase data traffic, both on the memory bus when shared pages are cached writethrough and in the network: Instead of communicating only the final diff of a page at a release, every write to a page whose home is remote is propagated to the home in fine-grained packets even between synchronization points, so multiple writes to the same location will all be propagated. The other disadvantage, of course, is that it uses hardware support, no matter how nonintrusive. Even without this hardware support, twinning and diffing are not the only solutions for the multiple writers problem even in all software scheme. What we basically need to convey is what has 646
DRAFT: Parallel Computer Architecture
9/4/97
Reducing Hardware Cost
changed on that page during that interval. An alternative mechanism is to use software dirty bits. In this case, a bit is maintained for every word (or block) in main memory that keeps track of whether that block has been modified. This bit is set whenever that word is written, and the words whose bits are set at a release or acquire point constitute the diff equivalent that needs to be communicated. The problem is that the dirty bits must be set and reset in separate instructions from the stores themselves—since microprocessors have no support for this—so the program must be instrumented to add an instruction to set the corresponding bit at every shared store. The instrumentation is much like that needed for the all-software fine-grained coherence schemes mentioned earlier, and similar optimizations can be used to eliminate redundant setting of dirty bits. The advantage of this approach is the elimination of twinning and diff creation; the disadvantage is the extra instructions executed with high frequency. The discussion so far has focussed on the functionality of different degrees of laziness, but has not addressed implementation. How do we ensure that the necessary write notices to satisfy the partial orders of causality get to the right places at the right time, and how do we reduce the number of write notices transferred. There is a range of methods and mechanisms for implementing release consistent protocols of different forms (single versus multiple writer, acquire- versus release-based, degree of laziness actually implemented). The mechanisms, what forms of laziness each can support, and their tradeoffs are interesting, and will be discussed in the Advanced Topics in Section 9.7.2. Summary: Path of a Read Operation To summarize the behavior of an SVM system, let us look at the path of a read. We examine the behavior of a home-based system since it is simpler. A read reference first undergoes address translation from virtual to physical address in the processor’s memory management unit. If a local page mapping is found, the cache hierarchy is looked up with the physical address and it behaves just like a regular uniprocessor operation (the cache lookup may of course be done in parallel for a virtually indexed cache). If a local page mapping is not found, a page fault occurs and then the page is mapped in (either from disk or from another node), providing a physical address. If the operating system indicates that the page is not currently mapped on any other node, then it is mapped in from disk with read/write permission, otherwise it is obtained from another node with read-only permission. Now the cache is looked up. If inclusion is preserved between the local memory and the cache, the reference will miss and the block is loaded in from local main memory where the page is now mapped. Note that inclusion means that a page must be flushed from the cache (or the state of the cache blocks changed) when it is invalidated or downgraded from read/write to read-only mode in main memory. If a write reference is made to a read-only page, then also a page fault is incurred and ownership of the page obtained before the reference can be satisfied by the cache hierarchy. Performance Implications Lazy release consistency and multiple writer protocols improve the performance of SVM systems dramatically compared to sequentially consistent implementations. However, there are still many performance problems compared with machines that manage coherence in hardware at cache block granularity. The problems of false sharing and either extra communication or protocol processing overhead do not disappear with relaxed models, and the page faults that remain are still expensive to satisfy. The high cost of communication, and the contention induced by higher end-point processing overhead, magnifies imbalances in communication volume and
9/4/97
DRAFT: Parallel Computer Architecture
647
Hardware-Software Tradeoffs
hence time among processors. Another problem in SVM systems is that synchronization is performed in software through explicit software messages, and is very expensive. This is exacerbated by the fact that page misses often occur within critical sections, artificially dilate them and hence greatly increasing the serialization at critical sections. The result is that while applications with coarse-grained data sharing patterns (little false sharing and communication fragmentation) perform quite well on SVM systems, applications with finer-grained sharing patterns do not migrate well from hardware cache-coherent systems to SVM systems. The scalability of SVM systems is also undetermined, both in performance and in the ability to run large problems since the storage overheads of auxiliary data structures grow with the number of processors. All in all, it is still unclear whether fine-grained applications that run well on hardware-coherent machines can be restructured to run efficiently on SVM systems as well, and whether or not such systems are viable for a wide range of applications. Research is being done to understand the performance issues and bottlenecks [DKC+93, ISL96a, KHS+97, JSS97], as well as the value of adding some hardware support for fine-grained communication while still maintaining coherence in software at page granularity to strike a good balance between hardware and software support [KoS96, ISL96a, BIS97]. With the dramatically increasing popularity of low-cost SMPs, there is also a lot of research in extending SVM protocols to build coherent shared address space machines as a two-level hierarchy: hardware coherence within the SMP nodes, and software SVM coherence across SMP nodes [ENC+96, SDH+97, SBI+97]. The goal of the outer SVM protocol is to be invoked as infrequently as possible, and only when cross-node coherence is needed. 9.4.4 Access Control through Language and Compiler Support Language and compiler support can also be enlisted to support coherent replication. One approach is to program in terms of data objects of “regions” of data, and have the runtime system that manages these objects provide access control and coherent replication at the granularity of objects or regions. This “shared object space” programming model motivates the use of even more relaxed memory consistency models, as we shall see next. We shall also briefly discuss compiler-based coherence and approaches that provide a shared address space in software but do not provide automatic replication and coherence. Object-based Coherence The release consistency model takes advantage of the when dimension of memory consistency. That is, it tells us by when it is necessary for writes by one process to be performed with respect to another process. This allows successively lazy implementations to be developed, as we have discussed, which delay the performing of writes as long as possible. However, even with release consistency, if the synchronization in the program requires that process P1’s writes be performed or become visible at process P2 by a certain point, then this means that all of P1’s writes to all the data it wrote must become visible (invalidations or updates conveyed), even if P2 does not need to see all the data. More relaxed consistency models take into the account the what dimension, by propagating invalidations or updates only for those data that the process acquiring the synchronization may need. The when dimension in release consistency is specified by the programmer through the synchronization inserted in the program. The question is how to specify the what dimension. It is possible to associate with a synchronization event the set of pages that must be made consistent with respect to that event. However, this is very awkward for a programmer to do.
648
DRAFT: Parallel Computer Architecture
9/4/97
Reducing Hardware Cost
write A acquire (L) write B release (L) acquire (L) read B release (L) read C Figure 9-17 Why release consistency is conservative. Suppose A and B are on different pages. Since P1 wrote A and B before releasing L (even though A was not written inside the critical section), invalidations for both A and B will be propagated to P2 and applied to the copies of A and B there. However, P2 does not need to see the write to A, but only that to B. Suppose now P2 reads another variable C that resides on the same page as A. Since the page containing A has been invalidated, P2 will incur a page miss on its access to C due to false sharing. If we could somehow associate the page containing B with the lock L1, then only the invalidation of B can be propagated on the acquire of L by P2, and the false sharing miss would be saved.
Region- or object-based approaches provide a better solution. The programmer breaks up the data into logical objects or regions (regions are arbitrary, user-specified ranges of virtual addresses that are treated like objects, but do not require object-oriented programming). A runtime library then maintains consistency at the granularity of these regions or objects, rather than leaving it entirely to the operating system to do at the granularity of pages. The disadvantages of this approach are the additional programming burden of specifying regions or objects appropriately, and the need for a sophisticated runtime system between the application and the OS. The major advantages are: (i) the use of logical objects as coherence units can help reduce false sharing and fragmentation to begin with, and (ii) they provide a handle on specifying data logically, and can be used to relax the consistency using the what dimension. For example, in the entry consistency model [BZS93], the programmer associates a set of data (regions or objects) with every synchronization variable such as a lock or barrier (these associations or bindings can be changed at runtime, with some cost). At synchronization events, only those objects or regions that are associated with that synchronization variable are guaranteed to be made consistent. Write notices for other modified data do not have to be propagated or applied. An alternative is to specify the binding for each synchronization event rather than variable. If no bindings are specified for a synchronization variable, the default release consistency model is used. However, the bindings are not hints: If they are specified, they must be complete and correct, otherwise the program will obtain the wrong answer. The need for explicit, correct bindings imposes a substantial burden on the programmer, and sufficient performance benefits have not yet been demonstrated to make this worthwhile. The Jade language achieves a similar effect, though by specifying data usage in a different way [RSL93]. Finally, attempts have been made to exploit the association between synchronization and data implicitly in a page-based shared virtual memory approach, using a model called scope consistency [ISL96b].
9/4/97
DRAFT: Parallel Computer Architecture
649
Hardware-Software Tradeoffs
Compiler-based Coherence Research has been done in having the compiler keep caches coherent in a shared address space, using some additional hardware support in the processor system. These approaches rely on the compiler (or programmer) to identify parallel loops. A simple approach to coherence is to insert a barrier at the end of every parallel loop, and flush the caches between loops. However, this does not allow any data locality to be exploited in caches across loops, even for data that are not actively shared at all. More sophisticated approaches have been proposed that require support for selective invalidations initiated by the processor and fairly sophisticated hardware support to keep track of version numbers for cache blocks. We shall not describe these here, but the interested reader can refer to the literature [CV90]. Other than non-standard hardware and compiler support for coherence, the major problem with these approaches is that they rely on the automatic parallelization of sequential programs by the compiler, which is simply not there yet for realistic programs. Shared Address Space without Coherent Replication Systems in this category supports a shared address space abstraction through the language and compiler, but without automatic replication and coherence, just like the Cray T3D and T3E did in hardware. One type of example is a data parallel languages (see Chapter 2) like High Performance Fortran. The distributions of data specified by the user, together with the owner computes rule, are used by the compiler or runtime system to translate off-node memory references to explicit send/receive messages, to make messages larger, or to align data for better spatial locality etc. Replication and coherence are usually up to the user, which compromises ease of programming, or system software may try to replicate in main memory automatically. Efforts similar to HPF are being made with languages based on C and C++ as well [BBG+93,LRV96]. A more flexible language and compiler based approach is taken by the Split-C language [CDG+93]. Here, the user explicitly specifies arrays as being local or global (shared), and for global arrays specifies how they should be laid out among physical memories. Computation may be assigned independently of the data layout, and references to global arrays are converted into messages by the compiler or runtime system based on the layout. The decoupling of computation assignment from data distribution makes the language much more flexible than an owner computes rule for load balancing irregular programs, but it still does not provide automatic support for replication and coherence which can be difficult to manage. Of course, all these software systems can be easily ported to hardware-coherent shared address space machines, in which case the shared address space, replication and coherence are implicitly provided.
9.5 Putting it All Together: A Taxonomy and Simple-COMA The approaches to managing replication and coherence in the extended memory hierarchy discussed in this chapter have had different goals: improving performance by replicating in main memory, in the case of COMA, and reducing cost, in the case of SVM and the other systems of the previous section. However, it is useful to examine the management of replication and coherence in a unified framework, since this leads to the design of alternative systems that can pick and choose aspects of existing ones. A useful framework is one that distinguishes the approaches along two closely related axes: the granularities at which they allocate data in the replication store, keep data coherent and communicate data between nodes, and the degree to which they uti650
DRAFT: Parallel Computer Architecture
9/4/97
Putting it All Together: A Taxonomy and Simple-COMA
lize additional hardware support in the communication assist beyond that available in uniprocessor systems. The two axes are related because some functions are either not possible at fine granularity without additional hardware support (e.g. allocation of data in main memory) or not possible with high performance. The framework applies whether replication is done only in the cache (as in CC-NUMA) or in main memory as well, though we shall focus on replication in main memory for simplicity. Figure 9-18 depicts the overall framework and places different types of systems in it. We divide
Allocation:
Block (memory/cache)
Page (memory) S
H
Coherence (access control)
Block
Communication
Block
Block
H H (COMA, Pure CC-NUMA)
H H
Page S H/S
Block (Typhoon, (Blizzard-S, Simple COMA) Shasta)
S
H*
Block* (SHRIMP, Cashmere)
S
Page (Ivy, TreadMarks)
Figure 9-18 Granularities of allocation, coherence and communication in coherent shared address space systems. The granularities specified are for the replication store in which data are first replicated automatically when they are brought into a node. This level is main memory in all the systems, except for pure CC-NUMA where it is the cache. Below each leaf in the taxonomy are listed some representative systems of that type, and next to each node is a letter indicating whether the support for that function is usually provided in hardware (H) or software (S). The asterisk next to block and H in one case means that not all communication is performed at fine granularity. Citations for these systems include: COMA [HLH+92, SJG92, FBR93], Typhoon [RLW94], Simple COMA [SWC+95], Blizzard-S [SFL+94], Shasta [LGT96], SHRIMP [BLA+94,IDF+96], Cashmere [KoS96], Ivy [LiH89], TreadMarks [KCD+94].
granularities into “page” and “block” (cache block), since these are the most common in systems that do not require a stylized programming model such as objects. However, “block” includes other fine granularities such as individual words, and “page” also covers coarse-grained objects or regions of memory that may be managed by a runtime system. On the left side of the figure are COMA systems, which allocate space for data in main memory at the granularity of cache blocks, using additional hardware support for tags etc. Given replication at cache block granularity, it makes sense to keep data coherent at a granularity at least this fine as well, and also to communicate at fine granularity.1 On the right side of the figure are systems that allocate and manage space in main memory at page granularity, with no extra hardware needed for this function. Such systems may provide
1. This does not mean that fine-grained allocation necessarily implies fine-grained coherence or communication. For example, is possible to exploit communication at coarse grain even when allocation and coherence are at fine grain, to gain the benefits of large data transfers. However, the situations discussed are indeed the common case, and we shall focus on these.
9/4/97
DRAFT: Parallel Computer Architecture
651
Hardware-Software Tradeoffs
access control and hence coherence at page granularity as well, as in SVM, in which case they may either provide or not provide support for fine-grained communication as we have seen. Or they may provide access control and coherence at block granularity as well, using either software instrumentation and per-block tags and state or hardware support for these functions. Systems that support coherence at fine granularity typically provide some form of hardware support for efficient fine-grained communication as well. 9.5.1 Putting it All Together: Simple-COMA and Stache While COMA and SVM both replicate in main memory, and each addresses some but not all of the limitations of CC-NUMA, they are at two ends of the spectrum in the above taxonomy. While COMA is a hardware intensive solution and maintains fine granularities, issues in managing main memory such as the last copy problem are challenging for hardware. SVM leaves these complex memory management problems to system software—which also allows main memory to be a fully associative cache through the OS virtual to physical mappings—but its performance may suffer due to the large granularities and software overheads of coherence. The framework in Figure 9-18 leads to interesting ways to combine the cost and hardware simplicity of SVM with the performance advantages and ease of programming of COMA; namely the Simple-COMA [SWC+95] and Stache [RLW94] approaches shown in the middle of the figure. These approaches divide the task of coherent replication in main memory into two parts: (i) memory management (address translation, allocation and replacement), and (ii) coherence, including replication and communication and coherence. Like COMA, they provide coherence at fine granularity with specialized hardware support for high performance, but like SVM (and unlike COMA) they leave memory management to the operating system at page granularity. Let us begin with Simple COMA. The major appeal of Simple COMA relative to COMA is design simplicity. Performing allocation and memory management through the virtual memory system instead of hardware simplifies the hardware protocol and allows fully associative management of the “attraction memory” with arbitrary replacement policies. To provide coherence or access control at fine grain in hardware, each page in a node’s memory is divided into coherence blocks of any chosen size (say a cache block), and state information is maintained in hardware for each of these units. Unlike in COMA there is no need for tags since the presence check is done at page level. The state of the block is checked in parallel with the memory access for that unit when a miss occurs in the hardware cache. Thus, there are two levels of access control: for the page, under operating system control, and if that succeeds then for the block, under hardware control. Consider the performance tradeoffs relative to COMA. Simple COMA reduces the latency in the path to access the local main memory (which is hopefully the frequent case among cache misses). There is no need for the hardware tag comparison and selection used in COMA, or in fact for the local/remote address check that is needed in pure CC-NUMA machines to determine whether to look up the local memory. On the other hand, every cache miss looks up the local memory (cache), so like in COMA the path of a nonlocal access is longer. Since a shared page can reside in different physical addresses in different memories, controlled independently by the node operating systems, unlike in COMA we cannot simply send a block’s physical address across the network on a miss and use it at the other end. This means that we must support a shared virtual address space. However, the virtual address issued by the processor is no longer available by the time the attraction memory miss is detected. This requires that the physical address, which is available, be “reverse translated” to a virtual address or other globally consis-
652
DRAFT: Parallel Computer Architecture
9/4/97
Implications for Parallel Software
tent identifier; this identifier is sent across the network and is translated back to a physical address by the other node. This process incurs some added latency, and will be discussed further in Section 9.7. Another drawback of Simple COMA compared to COMA is that although communication and coherence are at fine granularity, allocation is at page grain. This can lead to fragmentation in main memory when the access patterns of an application do not match page granularity well. For example, if a processor accesses only one word of a remote page, only that coherence block will be communicated but space will be allocated in the local main memory for the whole page. Similarly, if only one word is brought in for an unallocated page, it may have to replace an entire page of useful data (fortunately, the replacement is fully associative and under the control of software, which can make sophisticated choices). In contrast, COMA systems typically allocate space for only that coherence block. Simple COMA is therefore more sensitive to spatial locality than COMA. An approach similar to Simple COMA is taken in the Stache design proposed for the Typhoon system [SWC+95] and implemented in the Typhoon-zero research prototype [RPW96]. Stache uses the tertiary cache approach discussed earlier for replication in main memory, but with allocation managed at page level in software and coherence at fine grain in hardware. Other differences from Simple COMA are that the assist in the Typhoon systems is programmable, physical addresses are reverse translated to virtual addresses rather than other global identifiers (to enable user-level software protocol handlers, see Section 9.7). Designs have also been proposed to combine the benefits of CC-NUMA and Simple COMA [FaW97]. Summary: Path of a Read Reference Consider the path of a read in Simple COMA. The virtual address is first translated to a physical address by the processor’s memory management unit. If a page fault occurs, space must be allocated for a new page, though data for the page are not loaded in. The virtual memory system decides which page to replace, if any, and establishes the new mapping. To preserve inclusion, data for the replaced page must be flushed or invalidated from the cache. All blocks on the page are set to invalid. The physical address is then used to look up the cache hierarchy. If it hits (it will not if a page fault occurred), the reference is satisfied. If not, then it looks up the local attraction memory, where by now the locations are guaranteed to correspond to that page. If the block of interest is in a valid state, the reference completes. If not, the physical address is “reverse translated” to a global identifier, which plays the role of a virtual address, which is sent across the network guided by the directory coherence protocol. The remote node translates this global identifier to a local physical address, uses this physical address to find the block in its memory hierarchy, and sends the block back to the requestor. The data is then loaded into the local attraction memory and cache, and delivered to the processor. To summarize the approaches we have discussed, Figure 9-19 summarizes the node structure of the approaches that use hardware support to preserve coherence at fine granularity.
9.6 Implications for Parallel Software Let us examine the implications for parallel software of all the approaches discussed in this chapter, beyond those discussed already in earlier chapters. 9/4/97
DRAFT: Parallel Computer Architecture
653
Hardware-Software Tradeoffs
Network Network Assist S t Standard a Local t Memory e
Tags Globally coherent + State Attraction Memory
Assist
=
Globally coherent L2 cache
L2 cache
L1 cache
L1 cache
P
P (b) COMA
(a) “Pure” CC-NUMA
Network Network Assist + RTLB Assist + RTLB St a te
Globally coherent Attraction Memory
Regular part of memory
St a te
Tertiary cache part of memory
L2 cache
L2 cache
L1 cache
L1 cache
P
P
(c) Simple COMA
(d) Tertiary Cache (e.g. Stache)
Figure 9-19 Logical structure of a typical node in four approaches to a coherent shared address space. A vertical path through the node in the figures traces the path of a read miss that must be satisfied remotely (going through main memory means that main memory must be looked up, though this may be done in parallel with issuing the remote request if speculative lookups are used).
Relaxed memory consistency models require that parallel programs label the desired conflicting accesses as synchronization points. This is usually quite stylized, for example looking for variables that a process spins on in a while loop before proceeding, but sometimes orders must be preserved even without spin-waiting as in some of the examples in Figure Figure 9-4. A programming language may provide support to label some variables or accesses as synchronization, which will then be translated by the compiler to the appropriate order-preserving instruction. Current programming languages do not provide integrated support for such labeling, but rely on the programmer inserting special instructions or calls to a synchronization library. Labels can also be used by the compiler itself, to restrict its own reorderings of accesses to shared memory.
654
DRAFT: Parallel Computer Architecture
9/4/97
Implications for Parallel Software
One mechanism is to declare some (or all) shared variables to be of a “volatile” type, which means that those variables will not be allocated in registers and will not be reordered with respect to the accesses around them. Some new compilers also recognize explicit synchronization calls and disallow reorderings across them (perhaps distinguishing between acquire and release calls), or obey orders with respect to special order-preserving instructions that the program may insert. The automatic, fine-grained replication and migration provided by COMA machines is designed to allows the programmer to ignore the distributed nature of main memory in the machine. It is very useful when capacity or conflict misses dominate. Experience with parallel applications indicates that because of the sizes of working sets and of caches in modern systems, the migration feature may be more broadly useful than the replication and coherence, although systems with small caches or applications with large, unstructured working sets can benefit from the latter as well. Fine-grained, automatic migration is particularly useful when the data structures in the program cannot easily be distributed appropriately at page granularity, so page migration techniques such as those provided by the Origin2000 may not be so successful; for example, when data that should be allocated on two different nodes fall on the same page (see Section 8.10). Data migration is also very useful in multiprogrammed workloads, consisting of even sequential programs, in which application processes are migrated among processing nodes by the operating system for load balancing. Migrating a process will turn what should be local misses into remote misses, unless the system moves all the migrated process’s data to the new node’s main memory as well. However, this case is likely to be handled quite well by software page migration. In general, while COMA systems suffer from higher communication latencies than CC-NUMA systems, they may allow a wider class of applications to perform well with less programming effort. However, explicit migration and proper data placement can still be moderately useful even on these systems. This is because ownership requests on writes still go to the directory, which may be remote and does not migrate with the data, and thus can cause extra traffic and contention even if only a single processor ever writes a page. Commodity-oriented systems put more pressure on software not only to reduce communication volume, but also to orchestrate data access and communication carefully since the costs and/or granularities of communication are much larger. Consider shared virtual memory, which performs communication and coherence at page granularity. The actual memory access and data sharing patterns interact with this granularity to produce an induced sharing pattern at page granularity, which is the pattern relevant to the system [ISL96a]. Other than by high communication to computation ratio, performance is adversely affected when the induced pattern involves writesharing and hence multiple writers of the same page, or fine grained access patterns in which references from different processes are interleaved at a fine granularity. This makes it important to try to structure data accesses so that accesses from different processes tend not to be interleaved at a fine granularity in the address space. The high cost of synchronization and the dilation of critical sections makes it important to reduce the use of synchronization in programming for SVM systems. Finally, the high cost of communication and synchronization may make it more difficult to use task stealing successfully for dynamic load balancing in SVM systems. The remote accesses and synchronization needed for stealing may be so expensive that there is little work left to steal by the time stealing is successful. It is therefore much more important to have a well balanced initial assignment of tasks in a task stealing computation in an SVM system than in a hardware-coherent system. In similar ways, the importance of programming and algorithmic optimizations depend on the communication costs and granularities of the system at hand.
9/4/97
DRAFT: Parallel Computer Architecture
655
Hardware-Software Tradeoffs
9.7 Advanced Topics Before we conclude this chapter, let us discuss two other limitations of the traditional CCNUMA approach, and examine (as promised) the mechanisms and techniques for taking advantage of relaxed memory consistency models in software, for example for shared virtual memory protocols. 9.7.1 Flexibility and Address Constraints in CC-NUMA Systems The two other limitations of traditional CC-NUMA systems mentioned in the introduction to this chapter were the fact that a single coherence protocol is hard-wired into the machine and the potential limitations of addressability in a shared physical address space. Let us discuss each in turn. Providing Flexibility One size never really fits all. It will always be possible to find workloads that would be better served by a different protocol than the one hard-wired into a given machine. For example, while we have seen that invalidation-based protocols overall have advantages over update-based protocols for cache-coherent systems, update-based protocols are advantageous for purely producerconsumer sharing patterns. For a one-word producer-consumer interaction, in an invalidationbased protocol the producer will generate an invalidation which will be acknowledged, and then the consumer will issue a read miss which the producer will satisfy, leading to four network transactions; an update-based protocol will need only one transaction in this case. As another example, if large amounts of pre-determinable data are to be communicated from one node to another, then it may be useful to transfer them in a single explicit large message than one cache block at a time through load and store misses. A single protocol does not even necessarily best fit all phases or all data structures of a single application, or even the same data structure in different phases of the application. If the performance advantages of using different protocols in different situations are substantial, it may be useful to support multiple protocols in the communication architecture. This is particularly likely when the penalty for mismatched protocols is very high, as in commodity-based systems with less efficient communication architectures. Protocols can be altered—or mixed and matched—by making the protocol processing part of the communication assist programmable rather than hard-wired, thus implementing the protocol in software rather than hardware. This is clearly quite natural in page-based coherence protocols where the protocol is in the software page fault handlers. For cache-block coherence it turns out that the requirements placed on a programmable controller by different protocols are usually very similar. On the control side, they all need quick dispatch to a protocol handler based on a transaction type, and support for efficient bit-field manipulation in tags. On the data side, they need high-bandwidth, low-overhead pipelined movement of data through the controller and network interface. The Sequent NUMA-Q discussed earlier provides a pipelined, specialized programmable coherence controller, as does the Stanford FLASH design [KOH+94, HKO+94]. Note that the controller being programmable doesn’t alter the need for specialized hardware support for coherence; those issues remain the same as with a fixed protocol. The protocol code that runs on the controllers in these fine-grained coherent machines operates in privileged mode so that it can use the physical addresses that it sees on the bus directly. Some
656
DRAFT: Parallel Computer Architecture
9/4/97
Advanced Topics
researchers also advocate that users be allowed to write their own protocol handlers in user-level software so they can customize protocols to exactly the needs of individual applications [FLR+94]. While this can be advantageous, particularly on machines with less efficient communication architectures, it introduces several complications. One cause of complications is similar to the issue discussed earlier for Simple COMA systems, though now for a different reason. Since the address translation is already done by the processor’s memory management unit by the time a cache miss is detected, the assist sees only physical addresses on the bus. However, to maintain protection, user-level protocol software cannot be allowed access to these physical addresses. So if protocol software is to run on the assist at user-level, the physical addresses must be reverse translated back to virtual addresses before this software can use them. Such reverse translation requires further hardware support and increases latency, and it is complicated to implement. Also, since protocols have many subtle complexities related to correctness and deadlock and are difficult to debug, it is not clear how desirable it is to allow users to write protocols that may deadlock a shared machine. Overcoming Physical Address Space Limitations In a shared physical address space, the assist making a request sends the physical address of the location (or cache line) across the network, to be interpreted by the assist at the other end. As discussed for Simple COMA, this has the advantage that the assist at the other end does not have to translate addresses before accessing physical memory. However, a problem arises when the physical addresses generated by the processor may not have enough bits to serve as global addresses for the entire shared address space, as we saw for the Cray T3D in Chapter 7 (the Alpha 21064 processor emitted only 32 bits of physical address, insufficient to address the 128 GB or 37 bits of physical memory in a 2048-processor machine). In the T3D, segmentation through the annex registers was used to extend the address space, potentially introducing delays into the critical paths of memory accesses. The alternative is to not have a shared physical address space, but send virtual addresses across the network which will be retranslated to physical addresses at the other end. SVM systems as well as Simple COMA systems use this approach of implementing a shared virtual rather than physical address space. One advantage of this approach is that now only the virtual addresses need to be large enough to index the entire shared (virtual) address space; physical addresses need only be large enough to address a given processor’s main memory, which they certainly ought to be. A second advantage, seen earlier, is that each node manages its own address space and translations so more flexible allocation and replacement policies can be used in main memory. However, this approach does require address translation at both ends of each communication. 9.7.2 Implementing Relaxed Memory Consistency in Software The discussion of shared virtual memory schemes that exploit release consistency showed that they can be either single-writer or multiple-writer, and can propagate coherence information at either release or acquire operations (eager and lazy coherence, respectively). As mentioned in that discussion, a range of techniques can be used to implement these schemes, successively adding complexity, enabling new schemes, and making the propagation of write notices lazier. The techniques are too complex and require too many structures to implement for hardware cache coherence, and the problems they alleviate are much less severe in that case due to the fine gran-
9/4/97
DRAFT: Parallel Computer Architecture
657
Hardware-Software Tradeoffs
ularity of coherence, but they are quite well suited to software implementation. This section examines the techniques and their tradeoffs. The basic questions for coherence protocols that were raised in Section 8.2 are answered somewhat differently in SVM schemes. To begin with, the protocol is invoked not only at data access faults as in the earlier hardware cache coherence schemes, but both at access faults (to find the necessary data) as well as at synchronization points (to communicate coherence information in write notices). The next two questions—finding the source of coherence information and determining with which processors to communicate—depend on whether release-based or acquirebased coherence is used. In the former, we need to send write notices to all valid copies at a release, so we need a mechanism to keep track of the copies. In the latter, the acquirer communicates with only the last releaser of the synchronization variable and pulls all the necessary write notices from there to only itself, so there is no need to explicitly keep track of all copies. The last standard question is communication with the necessary nodes (or copies), which is done with point-to-point messages. Since coherence information is not propagated at individual write faults but only at synchronization events, a new question arises, which is how to determine for which pages write notices should be sent. In release-based schemes, at every release write notices are sent to all currently valid copies. This means that a node has only to send out write notices for writes that it performed since its previous release. All previous write notices in a causal sense have already been sent to all relevant copies, at the corresponding previous releases. In acquire-based methods, we must ensure that causally necessary write notices, which may have been produced by many different nodes, will be seen even though the acquirer goes only to the previous releaser to obtain them. The releaser cannot simply send the acquirer the write notices it has produced since its last release, or even the write notices it has produced ever. In both release- and acquire-based cases, several mechanisms are available to reduce the number of write notices communicated and applied. These include version numbers and time-stamps, and we shall see them as we go along. To understand the issues more clearly, let us first examine how we might implement single-writer release consistency using both release-based and acquire-based approaches. Then, we will do the same thing for multiple-writer protocols. Single Writer with Coherence at Release The simplest way to maintain coherence is to send write notices at every release to all sharers of the pages that the releasing processor wrote since its last release. In a single-writer protocol, the copies can be kept track of by making the current owner of a page (the one with write privileges) maintain the current sharing list, and transferring the list at ownership changes (when another node writes the page). At a release, a node sends write notices for all pages it has written to the nodes indicated on its sharing lists (see Exercise 9.6). There are two performance problems with this scheme. First, since ownership transfer does not cause copies to be invalidated (read-only copies may co-exist with one writable copy), a previous owner and the new owner may very well have the same nodes on their sharing lists. When both of them reach their releases (whether of the same synchronization variable or different ones) they will both send invalidations to some of the same pages, so a page may receive multiple (unnecessary) invalidations. This problem can be solved by using a single designated place to keep track of which copies of a page have already been invalidated, e.g. by using directories to keep track of
658
DRAFT: Parallel Computer Architecture
9/4/97
Advanced Topics
sharing lists instead of maintaining them at the owners. The directory is looked up before write notices are sent, and invalidated copies are recorded so multiple invalidations won’t be sent to the same copy. The second problem is that a release may invalidate a more recent copy of the page (but not the P1
P2
A1
rx
P2 and P3 have read page x earlier, and are therefore on P1’s sharing list
wx
ownership change
wx
R1
ownership change
P3 rx
wx
invalidations Invalidation applied
Invalidation not applied, since current owner
Figure 9-20 Invalidation of a more recent copy in a simple single-writer protocol. Processor P2 steals ownership from P1, and then P3 from P2. But P1’s release happens after all this. At the release, P1 sends invalidations to both P2 and P3. P2 applies the invalidation even though it’s copy is more recent than P1’s, since it doesn’t know, while P3 does not apply the invalidation since it is the current owner.
most recent), which it needn’t have done, as illustrated in Figure 9-20. This can be solved by associating a version number with each copy of a page. A node increments its version number for a page whenever it obtains ownership of that page from another node. Without directories, a processor will send write notices to all sharers (since it doesn’t know their version numbers) together with its version number for that page, but only the receivers that have smaller version numbers will actually invalidate their pages. With directories, we can reduce the write notice traffic as well, not just the number of invalidations applied, by maintaining the version numbers at the directory entry for the page and only sending write notices to copies that have lower version numbers than the releaser. Both ownership and write notice requests come to the directory, so this is easy to manage. Using directories and version numbers together solves both the above problems. However, this is still a release-based scheme, so invalidations may be sent out and applied earlier than necessary for the consistency model, causing unnecessary page faults. Single Writer with Coherence at Acquire A simple way to use the fact that coherence activity is not needed until the acquire is to still have the releaser send out write notices to all copies, but do this not at the release but only when the next acquire request comes in. This delays the sending out of write notices; however, the acquire must wait until the releaser has sent out the write notices and acknowledgments have been received, which slows down acquire operation. The best bet for such fundamentally release-based approaches would be to send write notices out at the release, and wait for acknowledgments only before responding to the next incoming acquire request. This allows the propagation of write notices to be overlapped with the computation done between the release and the next incoming acquire. Consider now the lazier, pull-based method of propagating coherence operations at an acquire, from the releaser to only that acquirer. The acquirer sends a request to the last releaser (or current
9/4/97
DRAFT: Parallel Computer Architecture
659
Hardware-Software Tradeoffs
holder) of the synchronization variable, whose identity it can obtain from a designated manager node for that variable. This is the only place from where the acquirer is obtaining information, and it must see all writes that have happened before it in the causal order. With no additional support, the releaser must send to the acquirer all write notices that the releaser has either produced so far or received from others (at least since the last time it sent the acquirer write notices, if it keeps track of that) It cannot only send those that it has itself produced since the last release, since it has no idea how many of the necessary write notices the acquirer has already seen through previous acquires from another processors. The acquirer must keep those write notices to pass on to the next acquirer. Carrying around an entire history of write notices is obviously not a good idea. Version numbers, incremented at changes of ownership as before, can help reduce the number of invalidations applied if they are communicated along with the write notices, but they do not help reduce the number of write notices sent. The acquirer cannot communicate version numbers to the releaser for this, since it has no idea what pages the releaser wants to send it write notices for. Directories with version numbers don’t help, since the releaser would have to send the directory the history of write notices. In fact, what the acquirer wants is the write notices corresponding to all releases that precede it causally and that it hasn’t already obtained through its previous acquires. Keeping information at page level doesn’t help, since neither acquirer nor releaser knows which pages the other has seen and neither wants to send all the information it knows. The solution is to establish a system of virtual time. Conceptually, every node keeps track of the virtual time period up to which it has seen write notices from each other node. An acquire sends the previous releaser this time vector; the releaser compares it with its own, and sends the acquirer write notices corresponding to time periods the releaser has seen but the acquirer hasn’t. Since the partial orders to be satisfied for causality are based on synchronization events, associating time increments with synchronization events lets us represent these partial orders explicitly. More precisely, the execution of every process is divided into a number of intervals, a new one beginning (and a local interval counter for that process incrementing) whenever the process successfully executes a release or an acquire. The intervals of different processes are partially ordered by the desired precedence relationships discussed earlier: (i) intervals on a single process are totally ordered by the program order, and (ii) an interval on process P precedes an interval on process Q if its release precedes Q’s acquire for that interval in a chain of release-acquire operations on the same variable. However, interval numbers are maintained and incremented locally per process, so the partial order (ii) does not mean that the acquiring process’s interval number will be larger than the releasing process’s interval number. What it does mean, and what we are trying to ensure, is that if a releaser has seen write notices for interval 8 from process X, then the next acquirer should also have seen at least interval 8 from process X before it is allowed to complete its acquire. To keep track of what intervals a process has seen and hence preserve the partial orders, every process maintains a vector timestamp for each of its intervals [KCD+94]. Let VPi be the vector timestamp for interval i on process P. The number of elements in the vector VPi equals the number of processes. The entry for process P itself in VPi is equal to i. For any other process Q, the entry denotes the most recent interval of process Q that precedes interval i on process P in the partial orders. Thus, the vector timestamp indicates the most recent interval from every other process that this process ought to have already received and applied write notices for (through a previous acquire) by the time it enters interval i.
660
DRAFT: Parallel Computer Architecture
9/4/97
Advanced Topics
On an acquire, a process P needs to obtain from the last releaser R write notices pertaining to intervals (from any processor) that R has seen before its release but the acquirer has not seen through a previous acquire. This is enough to ensure causality: Any other intervals that P should have seen from other processes, it would have seen through previous acquires in its program order. P therefore sends its current vector timestamp to R, telling it which are the latest intervals from other processes it has seen before this acquire. R compares P’s incoming vector time-stamp with its own, entry by entry, and piggybacks on its reply to P write notices for all intervals that are in R’s current timestamp but not in P’s (this is conservative, since R’s current time-stamp may have seen more than R had at the time of the relevant release). Since P has now received these write notices, it sets its new vector timestamp, for the interval starting with the current acquire, to the pairwise maximum of R’s vector timestamp and its own previous one. P is now up to date in the partial orders. Of course, this means that unlike in a release-based case, where write notices could be discarded by the releaser once they had been sent to all copies, in the acquire-based case a process must retain the write notices that it has either produced or received until it is certain that no later acquirer will need them. This can lead to significant storage overhead, and may require garbage collection techniques to keep the storage in check. Multiple Writer with Release-based Coherence With multiple writers, the type of coherence scheme we use depends on how the propagation of data (e.g. diffs) is managed; i.e. whether diffs are maintained at the multiple writers or propagated at a release to a fixed home. In either case, with release-based schemes a process needs only send write notices for the writes it has done since the previous release to all copies, and wait for acknowledgments for them at the next incoming acquire. However, since there is no single owner (writer) of a page at a given time, we cannot rely on an owner having an up to date copy of the sharing list. We need a mechanism like a directory to keep track of copies. The next question is how does a process find the necessary data (diffs) when it incurs a page fault after an invalidation. If a home-based protocols is used to manage multiple writers, then this is easy: the page or diffs can be found at the home (the release must wait till the diffs reach the home before it completes). If diffs are maintained at the writers in a distributed form, the faulting process must know not only from where to obtain the diffs but also in what order they should be applied. This is because diffs for the same data may have been produced either in different intervals in the same process, or in intervals on different processes but in the same causal chain or acquires and releases; when they arrive at a processor, they must be applied in accordance with the partial orders needed for causality. The locations of the diffs may be determined from the incoming write notices. However, the order of application is difficult to determine without vector time-stamps. This is why when homes are not used for data, simple directory and release-based multiple writer coherence schemes use updates rather than invalidations as the write notices: The diffs themselves are sent to the sharers at the release. Since a release waits for acknowledgments before completing, there is no ordering problem in applying diffs: It is not possible for diffs for a page that correspond to different releases that are part of the same causal order to reach a processor at the same time. In other words, the diffs are guaranteed to reach processors exactly according to the desired partial order. This type of mechanism was used in the Munin system [CBZ91]. Version numbers are not useful with release-based multiple writer schemes. We cannot update version numbers at ownership changes, since there are none, but only at release points. And then version numbers don’t help us save write notices: Since the releaser has just obtained the latest version number for the page at the release itself, there is no question of another node having a
9/4/97
DRAFT: Parallel Computer Architecture
661
Hardware-Software Tradeoffs
more recent version number for that page. Also, since a releaser only has to send the write notices (or diffs) it has produced since its previous release, with an update-based protocol there is no need for vector time-stamps. If diffs were not sent out at a release but retained at the releaser, time-stamps would have been needed to ensure that diffs are applied in the correct order, as discussed above. Multiple Writer with Acquire-based Coherence The coherence issues and mechanisms here are very similar to the single-writer acquire-based schemes (the main difference is in how the data are managed). With no special support, the acquirer must obtain from the last releaser all write notices that it has received or produced since their last interaction, so large histories must be maintained and communicated. Version numbers can help reduce the number of invalidations applied, but not the number transferred (with homebased schemes, the version number can be incremented every time a diff gets to the home, but the releaser must wait to receive this number before it satisfies the next incoming acquire; without homes, a separate version manager must be designated anyway for a page, which causes complications). The best method is to use vector time-stamps as described earlier. The vector time-steps manage coherence for home-based schemes; for non home-based schemes, they also manage the obtaining and application of diffs in the right order (the order dictated by the time-stamps that came with the write notices). To implement LRC, then, a processor has to maintain a lot of auxiliary data structures. These include:
• For every page in its local memory an array, indexed by process, each entry being a list of write notices or diffs received from that process.
• A separate array, indexed by process, each entry being a pointer to a list of interval records. These represent the intervals of that processor for which the current processor has already received write notices. An interval record points to the corresponding list of write notices, and each write notice points to its interval record. • A free pool for creating diffs. Since these data structures may have to be kept around for a while, determined by the precedence order established at runtime, they can limit the sizes of problems that can be run and the scalability of the approach. Home-based approaches help reduce the storage for diffs: Diffs do not have to be retained at the releaser past the release, and the first array does not have to maintain lists of diffs received. Details of these mechanisms can be found in [KCD+94, ZIL97].
9.8 Concluding Remarks The alternative approaches to supporting coherent replication in a shared address space discussed in this chapter raise many interesting sets of hardware-software tradeoffs. Relaxed memory models increase the burden on application software to label programs correctly, but allow compilers to perform their optimizations and hardware to exploit more low-level concurrency. COMA systems require greater hardware complexity but simplify the job of te programmer. Their effect on performance depends greatly on the characteristics of the application (i.e. whether sharing misses or capacity misses dominate inter-node communication). Finally, commodity-based approaches
662
DRAFT: Parallel Computer Architecture
9/4/97
References
that reduce system cost and provide a better incremental procurement model, but they often require substantially greater programming care to achieve good performance. The approaches are also still controversial, and the tradeoffs have not shaken out. Relaxed memory models are very useful for compilers, but we will see in Chapter 11 that several modern processors are electing to implement sequential consistency at the hardware-software interface with increasingly sophisticated alternative techniques to obtain overlap and hide latency. Contracts between the programmer and the system have also not been very well integrated into current programming languages and compilers. Full hardware support for COMA was implemented in the KSR-1 [FBR93] but is not very popular in systems being built today because of to its high cost. However, approaches similar to Simple COMA are beginning to find acceptance in commercial products. All-software approaches like SVM and fine-grained software access control have been demonstrated to achieve good performance at relatively small scale for some classes of applications. Because they are very easy to build and deploy, they will be likely be used on clusters of workstations and SMPs in several environments. However, the gap between these and all-hardware systems is still quite large in programmability as well as in performance on a wide range of applications, and their scalability has not yet been demonstrated. The commodity based approaches are still in the research stages, and it remains to be seen if they will become viable competitors to hardware cache-coherent machines for a large enough set of applications. It may also happen that the commoditization of hardware coherence assists and methods to integrate them into memory systems will make these all-software approaches more marginal, for use largely in those environments that do not wish to purchase parallel machines but rather use clusters of existing machines as shared address space multiprocessors when they are otherwise idle. In the short term, economic reasons might nonetheless provide all-software solutions with another likely role. Since vendors are likely to build tightly-coupled systems with only a few tens or a hundred nodes, connecting these hardware-coherent systems together with a software coherence layer may be the most viable way to construct very large machines that still support the coherent shared address space programming model. While supporting a coherent shared address space on scalable systems with physically distributed memory is well established as a desirable way to build systems, only time will tell how these alternative approaches will shake out.
9.9 References [Adv93]
[AdH90a] [AdH90b]
[AdH93] [AGG+93]
9/4/97
Sarita V. Adve. Designing Memory Consistency Models for Shared-Memory Multiprocessors, Ph.D. Thesis, University of Wisconsin-Madison, 1993. Available as Computer Sciences Technical Report #1198, University of Wisconsin-Madison, December 1993. Sarita Adve and Mark Hill. Weak Ordering: A New Definition. In Proceedings of 17th International Symposium on Computer Architecture, pp. 2-14, May 1990. Sarita Adve and Mark Hill. Implementing Sequential Consistency in Cache-Based Systems. In Proceedings of 1990 International Conference on Parallel Processing, pp. 47-50, August 1990. <
DRAFT: Parallel Computer Architecture
663
Hardware-Software Tradeoffs
puter Sciences, University of Wisconsin - Madison, December 1993. Also available as Stanford University Technical Report CSL-TR-93-595. [AnL93] Jennifer Anderson and Monica Lam. Global Optimizations for Parallelism and Locality on Scalable Parallel Machines. In Proceedings of the SIGPLAN’93 Conference on Programming Language Design and Implementation, June 1993. <
664
DRAFT: Parallel Computer Architecture
9/4/97
References
[ENC+96]
[FLR+94]
[FaW97] [FBR93] [FCS+93] [GCP96] [GLL+90]
[GGH91]
[GAG+92] [Gha95] [Goo89] [Hag92] [HLH92] [HeP95] [HKO+94]
[HPF93 [IDF+96]
[ISL96a]
9/4/97
Andrew Erlichson, Neal Nuckolls, Greg Chesson and John Hennessy. SoftFLASH: Analyzing the Performance of Clustered Distributed Virtual Shared Memory. In Proceedings of 7th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 210--220, October, 1996. Babak Falsafi, Alvin R. Lebeck, Steven K. Reinhardt, Ioannis Schoinas, Mark D. Hill, James R. Larus, Anne Rogers, and David A. Wood. Application-Specific Protocols for User-Level Shared Memory. In Proceedings of Supercomputing’94, November 1994. Babak Falsafi and David A. Wood. Reactive NUMA: A Design for Unifying S-COMA and CCNUMA. In Proceedings of 24th International Symposium on Computer Architecture, June 1997. Steve Frank, Henry Burkhardt III, James Rothnie. The KSR1: Bridging the Gap Between Shared Memory and MPPs. In Proceedings of COMPCON, pp. 285-294, Spring 1993. Jean-Marc Frailong, et al. The Next Generation SPARC Multiprocessing System Architecture. In Proceedings of COMPCON, pp. 475-480, Spring 1993. R. Gillett and M. Collins and D. Pimm. Overview of Network Memory Channel for PCI. In Proceedings of the IEEE Spring COMPCON '96, February 1996. Kourosh Gharachorloo, et al. Memory Consistency and Event Ordering in Scalable Shared-Memory Multiprocessors. In Proceedings of 17th International Symposium on Computer Architecture, pp. 15-26, May 1990. Kourosh Gharachorloo, Anoop Gupta, and John Hennessy. Two Techniques to Enhance the Performance of Memory Consistency Models. In Proceedings of 1991 International Conference on Parallel Processing, pp. I:355-364, August 1991. Kourosh Gharachorloo, et al. Programming for Different Memory Consistency Models. Journal of Parallel and Distributed Computing, 15(4):399-407, August 1992. Kourosh Gharachorloo. Memory Consistency Models for Shared-Memory Multiprocessors. Ph.D. Dissertation, Computer Systems Laboratory, Stanford University, December 1995. James R. Goodman. Cache Consistency and Sequential Consistency. Technical Report No. 1006, Computer Science Department, University of Wisconsin-Madison, February 1989. Erik Hagersten. Toward Scalable Cache Only Memory Architectures. Ph.D. Dissertation, Swedish Institute of Computer Science, Sweden, October 1992. E. Hagersten, A. Landin, and S. Haridi. DDM—A Cache Only Memory Architecture. IEEE Computer, vol. 25, no. 9, pp. 44-54, September 1992. John Hennessy and David Patterson. Computer Architecture: A Quantitative Approach. Second Edition, Morgan Kaufmann Publishers, 1995. Mark Heinrich, Jeff Kuskin, Dave Ofelt, John Heinlein, Joel Baxter, J.P. Singh, Rich Simoni, Kourosh Gharachorloo, Dave Nakahira, Mark Horowitz, Anoop Gupta, Mendel Rosenblum, and John Hennessy. The Performance Impact of Flexibility on the Stanford FLASH Multiprocessor. In Proceedings of 6th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 274-285, October 1994. High Performance Fortran Forum. High Performance Fortran Language Specification. Scientific Programming, vol. 2, pp. 1-270, 1993. L. Iftode and C. Dubnicki and E. W. Felten and Kai Li. Improving Release-Consistent Shared Virtual Memory using Automatic Update. In Proceedings of the Second Symposium on High Performance Computer Architecture, February 1996. Liviu Iftode, Jaswinder Pal Singh and Kai Li. Understanding Application Performance on Shared Virtual Memory Systems. In Proceedings of the 23rd International Symposium on Computer Ar-
DRAFT: Parallel Computer Architecture
665
Hardware-Software Tradeoffs
chitecture, April1996. [ISL96b]
[JSS97]
[Joe95] [JoH94]
[KCD+94]
[KCZ92]
[KHS+97]
[KoS96]
[KOH+94]
[LiH89] [LS88] [LRV96]
[LLJ+93]
[MSS+94] [RLW94]
[RPW96] 666
Liviu Iftode, Jaswinder Pal Singh and Kai Li. Scope Consistency: a Bridge Between Release Consistency and Entry Consistency. In Proceedings of the Symposium on Parallel Algorithms and Architectures, June 1996. Dongming Jiang, Hongzhang Shan and Jaswinder Pal Singh. Application Restructuring and Performance Portability on Shared Virtual Memory and Hardware-Coherent Multiprocessors. In Proceedings of the Sixth ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming, pp. 217-229, June 1997. Truman Joe. COMA-F A Non-Hierarchical Cache Only Memory Architecture. Ph.D. thesis, Computer Systems Laboratory, Stanford University, March 1995. Truman Joe and John L. Hennessy. Evaluating the Memory Overhead Required for COMA Architectures. In Proceedings of the 21st International Symposium on Computer Architecture, pp. 82-93, April 1994. Peter Keleher, Alan L. Cox, Sandhya Dwarkadas and Willy Zwaenepoel. TreadMarks: Distributed Shared Memory on Standard Workstations and Operating Systems. Proceedings of the Winter USENIX Conference, January 1994, pp. 15-132. Peter Keleher and Alan L. Cox and Willy Zwaenepoel. Lazy Consistency for Software Distributed Shared Memory. In Proceedings of the 19th International Symposium on Computer Architecture, pp. 13-21, May 1992. Leonidas Kontothanassis, Galen Hunt, Robert Stets, Nikolaos Hardavellas, Michal Cierniak, Srinivasan Parthasarathy, Wagner Meira, Sandhya Dwarkadas and Michael Scott. VM-Based Shared Memory on Low-Latency, Remote-Memory-Access Networks. In Proceedings of the 24th International Symposium on Computer Architecture, June 1997. Leonidas. I. Kontothanassis and Michael. L. Scott. Using Memory-Mapped Network Interfaces to Improve the Performance of Distributed Shared Memory", In Proceedings of the Second Symposium on High Performance Computer Architecture, pp. ??, February 1996. Jeff Kuskin, Dave Ofelt, Mark Heinrich, John Heinlein, Rich Simoni, Kourosh Gharachorloo, John Chapin, Dave Nakahira, Joel Baxter, Mark Horowitz, Anoop Gupta, Mendel Rosenblum, and John Hennessy. The Stanford FLASH Multiprocessor. In Proceedings of the 21st International Symposium on Computer Architecture, pp. 302-313, April 1994. Kai Li and Paul Hudak. Memory Coherence in Shared Virtual Memory Systems. In ACM Transactions on Computer Systems, vol. 7, no. 4, Nov. 1989, pp. 321--359. Richard Lipton and Jonathan Sandberg. PRAM: A Scalable Shared Memory. Technical report no. CS-TR-180-88, Computer Science Department, Princeton University, September 1988. James R. Larus, Brad Richards and Guhan Viswanathan. Parallel Programming in C**: A LargeGrain Data-Parallel Programming Language. In Gregory V. Wilson and Paul Lu, eds., Parallel Programming Using C++, MIT Press 1996. Daniel Lenoski, James Laudon, Truman Joe, David Nakahira, Luis Stevens, Anoop Gupta, and John Hennessy. The DASH Prototype: Logic Overhead and Performance. IEEE Transactions on Parallel and Distributed Systems, 4(1):41-61, January 1993. Cathy May, Ed Silha, Rick Simpson, and Hank Warren, editors. The PowerPC Architecture: A Specification for a New Family of RISC Processors. Morgan Kaufman Publishers, Inc., 1994. Steven K. Reinhardt, James R. Larus and David A. Wood. Tempest and Typhoon: User-Level Shared Memory. In Proceedings of 21st International Symposium on Computer Architecture, pp. 325-337, 1994. Steven K. Reinhardt, Robert W. Pfile and David A. Wood. Decoupled Hardware Support for DisDRAFT: Parallel Computer Architecture
9/4/97
References
[RSL93] [RSG93]
[SBI+97]
[SWC+95]
[SGT96]
[SFC91] [SFL+94]
[ShS88]
[Sit92] [SJG92]
[SDH+97]
[Spa91] [TSS+96]
[WeG94] [WeS94] [ZIL96]
9/4/97
tributed Shared Memory. In Proceedings of 23rd International Symposium on Computer Architecture, 1996. Martin Rinard, Daniel Scales and Monica Lam. Jade: A High-Level, Machine-Independent Language for Parallel Programming. IEEE Computer, June 1993. Edward Rothberg, Jaswinder Pal Singh, and Anoop Gupta. Working Sets, Cache Sizes, and Node Granularity Issues for Large-Scale Multiprocessors. In Proceedings of the 20th International Symposium on Computer Architecture, pp. 14-25, May 1993. <
667
Hardware-Software Tradeoffs
9.10 Exercises 9.1 General a. Why are update-based coherence schemes relatively incompatible with the sequential consistency memory model in a directory-based machine? b. The Intel Paragon machine discussed in Chapter 7 has two processing elements per node, one of which always executes in kernel mode to support communication. Could this processor be used effectively as a programmable communication assist to support cache coherence in hardware, like the programmable assist of the Stanford FLASH or the Sequent NUMA-Q? 9.2 Eager versus delayed replies to processor. a. In the Origin protocol, with invalidation acknowledgments coming to the home, it would be possible for other read and write requests to come to the requestor that has outstanding invalidations before the acknowledgments for those invalidations come in. (i) Is this possible or a problem with delayed exclusive replies? With eager exclusive replies? If it is a problem, what is the simplest way to solve it? (ii) Suppose you did indeed want to allow the requestor with invalidations outstanding to process incoming read and write requests. Consider write requests first, and construct an example where this can lead to problems. (Hint: consider the case where P1 writes to a location A, P2 writes to location B which is on the same cache block as A and then writes a flag, and P3 spins on the flag and then reads the value of location B.) How would you allow the incoming write to be handled but still maintain correctness? Is it easier if invalidation acknowledgments are collected at the home or at the requestor? (iii) Now answer the same questions above for incoming read requests. (iv) What if you were using an update-based protocol. Now what complexities arise in allowing an incoming request to be processed for a block that has updates outstanding from a previous write, and how might you solve them? (v) Overall, would you choose to allow incoming requests to be processed while invalidations or updates are outstanding, or deny them? b. Suppose a block needs to be written back while there are invalidations pending for it. Can this lead to problems, or is it safe? If it is problematic, how might you address the problem? c. Are eager exclusive replies useful with an underlying SC model? Are they at all useful if the processor itself provides out of order completion of memory operations, unlike the MIPS R10000? d. Do the tradeoffs between collecting acknowledgments at the home and at the requestor change if eager exclusive replies are used instead of delayed exclusive replies? 9.3 Relaxed Memory Consistency Models a. If the compiler reorders accesses according to WO, and the processor’s memory model is SC, what is the consistency model at the programmer’s interface? What if the compiler is SC and does not reorder memory operations, but the processor implements RMO? b. In addition to reordering memory operations (reads and writes) as discussed in the example in the chapter, register allocation by a compiler can also eliminate memory accesses entirely. Consider the example code fragment shown below. Show how register allocation
668
DRAFT: Parallel Computer Architecture
9/4/97
Exercises
can violate SC. Can a uniprocessor compiler do this? How would you prevent it in the compiler you normally use? P1
P2
A=1
while (flag == 0);
flag = 1
u = A.
c. Consider all the system specifications discussed in the chapter. Order them in order of weakness, i.e. draw arcs between models such that an arc from model A to model B indicates that A is stronger than B, i.e. that is any execution that is correct under A is also correct under B, but not necessarily vice versa. d. Which of PC and TSO is better suited to update-based directory protocols and why? e. Can you describe a more relaxed system specification than Release Consistency, without explicitly associating data with synchronization as in Entry Consistency? Does it require additional programming care beyond RC? f. Can you describe a looser set of sufficient conditions for WO? For RC? 9.4 More Relaxed Consistency: Imposing orders through fences. a. Using the fence operations provided by the DEC Alpha and Sparc RMO consistency specifications, how would you ensure sufficient conditions for each of the RC, WO, PSO, TSO, and SC models? Which ones do you expect to be efficient and which ones not, and why? b. A write-fence operation stalls subsequent write operations until all of the processor’s pending write operations have completed. A full-fence stalls the processor until all of its pending operations have completed. (i) Insert the minimum number of fence instructions into the following code to make it sequentially consistent. Don’t use a full-fence when a write-fence will suffice. ACQUIRE LOCK1 LOAD A STORE A RELEASE LOCK1 LOAD B STORE B ACQUIRE LOCK1 LOAD C STORE C RELEASE LOCK1 (ii) Repeat part (i) to guarantee release consistency. c. The IBM-370 consistency model is much like TSO, except that it does not allow a read to return the value written by a previous write in program order until that write has completed. Given the following code segments, what combination of values for (u,v,w,x) are not allowed by SC. In each case, do the IBM-370, TSO, PSO, PC, RC and WO models preserve SC semantics without any ordering instructions or labels, or do they not? If not, insert the necessary fence operations to make them conform to SC. Assume that all variables had the value 0 before this code fragment was reached.
9/4/97
DRAFT: Parallel Computer Architecture
669
Hardware-Software Tradeoffs
(i) P1
P2
A=1
B=1
u=A
v=B
w=B
x=A
P1
P2
A=1
B=1
C=1
C=2
u=C
v=C
w=B
x=A
(ii)
d. Consider a two-level coherence protocol with snooping-based SMPs connected by a memory-based directory protocol, uisng release consistency. While invalidation acknowledgments are still pending for a write to a memory block, is it okay to supply the data to another processor in (i) the same SMP node, or (ii) a different SMP node? Justify your answers and state any assumptions. 9.5 More Relaxed Consistency: Labeling and Performance a. Can a program that is not properly labeled run correctly on a system that supports release consistency? If so, how, and if not then why not? b. Why are there four flavor bits for memory barriers in the SUN Sparc V9 specification. Why not just two bits, one to wait for all previous writes to complete, and another for previous reads to complete? c. To communicate labeling information to the hardware (i.e. that a memory operations is labeled as an acquire or a release), there are two options: One is to associate the label with the address of the location, the other with the specific operation in the code. What are the tradeoffs between the two? [ d. Two processors P1 and P2 are executing the following code fragments under the sequential consistency (SC) and release consistency (RC) models. Assume an architecture where both read and write misses take 100 cycles to complete. However, you can assume that accesses that are allowed to be overlapped under the consistency model are indeed overlapped. Grabbing a free lock or unlocking a lock takes 100 cycles, and no overlap is possible with lock/unlock operations from the same processor. Assume all the variables and locks are initially uncached and all locks are unlocked. All memory locations are initialized to 0. All memory locations are distinct and map to different indices in the caches.
670
P1
P2
LOCK (L1)
LOCK(L1)
A = 1
x = A
B = 2
y = B
UNLOCK(L1)
x1 = A
DRAFT: Parallel Computer Architecture
9/4/97
Exercises
UNLOCK(L1) x2 = B (i) What are the possible outcomes for x and y under SC? What are the possible outcomes of x and y under RC? (ii) Assume P1 gets the lock first. After how much time will P2 complete under SC? Under RC? e. Given the following code fragment, we want to compute its execution time under various memory consistency models. Assume a processor architecture with an arbitrarily deep write buffer. All instructions take 1 cycle ignoring memory system effects. Both read and write misses take 100 cycles to complete (i.e. globally perform). Locks are cacheable and loads are non-blocking. Assume all the variables and locks are initially uncached, and all locks are unlocked. Further assume that once a line is brought into the cache it does not get invalidated for the duration of the code’s execution. All memory locations referenced LOAD A STORE B LOCK (L1) STORE C LOAD D UNLOCK (L1) LOAD E STORE F
above are distinct; furthermore they all map to different cache lines. (i) If sequential consistency is maintained, how many cycles will it take to execute this code? (ii) Repeat part (i) for weak ordering. (iii) Repeat part (i) for release consistency. f. The following code is executed on an aggressive dynamically scheduled processor. The processor can have multiple outstanding instructions, the hardware allows for multiple outstanding misses and the write buffer can hide store latencies (of course, the above features may only be used if allowed by the memory consistency model.) Processor 1: sendSpecial(int value) { A = 1; LOCK(L); C = D*3; E = F*10 G = value; READY = 1; UNLOCK(L); 9/4/97
DRAFT: Parallel Computer Architecture
671
Hardware-Software Tradeoffs
} Processor 2: receiveSpecial() { LOCK(L); if (READY) { D = C+1; F = E*G; } UNLOCK(L); }
Assume that locks are non-cachable and are acquired either 50 cycles from an issued request or 20 cycles from the time a release completes (whichever is later). A release takes 50 cycles to complete. Read hits take one cycle to complete and writes take one cycle to put into the write buffer. Read misses to shared variables take 50 cycles to complete. Writes take 50 cycles to complete. The write buffer on the processors is sufficiently large that it never fills completely. Only count the latencies of reads and writes to shared variables (those listed in capitals) and the locks. All shared variables are initially uncached with a value of 0. Assume that Processor 1 obtains the lock first. (i) Under SC, how many cycles will it take from the time processor 1 enters sendSpecial() to the time that processor 2 leaves receiveSpecial()? Make sure that you justify your answer. Also make sure to note the issue and completion time of each synchronization event. (ii) How many cycles will it take to return from receiveSpecial() under Release Consistency? 9.6 Shared Virtual Memory a. In eager release consistency diffs are created and also propagated at release time, while in one form of lazy release consistency they are created at release time but propagated only at acquire time, i.e. when the acquire synchronization request comes to the processor.: (i) Describe some other possibilities for when diffs might be created, propagated, and applied in all-software lazy release consistency (think of release time, acquire time, or access fault time). What is the laziest scheme you can design? (ii) Construct examples where each lazier scheme leads to less diffs being created/communicated, and/or fewer page faults. (iii) What complications does each lazier scheme cause in implementation, and which would you choose to implement? b. Delaying the propagation of invalidations until a release point or even until the next acquire point (as in lazy release consistency) can be done in hardware-coherent systems as well. Why is LRC not used in hardware-coherent systems? Would delaying invalidations until a release be advantageous? c. Suppose you had a co-processor to perform the creation and application of diffs in an allsoftware SVM system, and therefore did not have to perform this activity on the main processor. Considering eager release consistency and the lazy variants you designed above, comment on the extent to which protocol processing activity can be overlapped with computation on the main processor. Draw time-lines to show what can be done on the main processor and on the co-processor. Do you expect the savings in performance to be sub-
672
DRAFT: Parallel Computer Architecture
9/4/97
Exercises
stantial? What do you think would be the major benefit and implementation complexity of having all protocol processing and management be performed on the co-processor? d. Why is garbage collection more important and more complex in lazy release consistency than in eager release consistency? What about in home-based lazy release consistency? Design a scheme for periodic garbage collection (when and how), and discuss the complications. 9.7 Fine-grained Software Shared Memory a. In systems like Blizzard-S or Shasta that instrument read and write operations in software to provide fine-grained access control, a key performance goal is to reduce the overhead of instrumentation. Describe some techniques that you might use to do this. To what extent do you think the techniques can be automated in a compiler or instrumentation tool? b. When messages (e.g. page requests or lock requests) arrive at a node in a software shared memory system, whether fine-grained or coarse grained, there are two major ways to handle them in the absence of a programmable communication assist. One is to interrupt the main processor, and the other is to have the main processor poll for messages. (i) What are the major tradeoffs between the two methods? (ii) Which do you expect to perform better for page-based shared virtual memory and why? For fine-grained software shared memory? What application characteristics would most influence your decision? (iii) What new issues arise and how might the tradeoffs change if each node is an SMP rather than a uniprocessor? (iv) How do you think you would organize message handling with SMP nodes? 9.8 Shared Virtual Memory and Memory Consistency Models a. List all the tradeoffs you can think of between LRC based on diffs (not home based) versus based on automatic update. Which do you think would perform better? What about home-based LRC based on diffs versus based on automatic update? b. Is a properly labeled program guaranteed to run correctly under LRC? Under ERC? Under RC? Under Scope Consistency? Is a program that runs correctly under ERC guaranteed to run correctly under LRC? Is it guaranteed to be properly labeled (i.e. can a program that is not properly labeled run correctly under ERC? Is a program that runs correctly under LRC guaranteed to run correctly under ERC? Under Scope Consistency? c. Consider a single-writer release-based protocol. On a release, does a node need to find the up-to-date sharing list for each page you’ve modified since the last release from the current owner, or just send write notices to nodes on your version of the sharing list for each such page? Explain why. d. Consider version numbers without directories. Does this avoid the problem of sending multiple invalidates to the same copy. Explain why, or give a counter-example. e. Construct an example where Scope Consistency transparently delivers performance improvement over regular home-based LRC, and one where it needs additional programming complexity for correctness; i.e. a program that satisfied release consistency but would not work correctly under scope consistency. 9.9 Alternative Approaches. Trace the path of a write reference in (i) a pure CC-NUMA, (ii) a flat COMA, (iii) an SVM with automatic update, (iv) an SVM protocol without automatic update, and (v) a simple COMA. 9.10 Software Implications and Performance Issues 9/4/97
DRAFT: Parallel Computer Architecture
673
Hardware-Software Tradeoffs
a. You are performing an architectural study using four applications: Ocean, LU, an FFT that uses a matrix transposition between local calculations on rows, and Barnes-Hut. For each application, answer the above questions assuming a page-grained SVM system (these questions were asked for a CC-NUMA system in the previous chapter): (i) what modifications or enhancements in data structuring or layout would you use to ensure good interactions with the extended memory hierarchy? (ii) what are the interactions with cache size and granularities of allocation, coherence and communication that you would be particularly careful to represent or not represent? What new ones become important in SVM systems that were not so important in CC-NUMA? (iii) are the interactions with cache size as important in SVM as they are in CC-NUMA? If they are of different importance, say why. 9.11 Consider the FFT calculation with a matrix transpose described in the exercises at the end of the previous chapter. Suppose you are running this program on a page-based SVM system using an all-software, home-based multiple-writer protocol. a. Would you rather use the method in which a processor reads locally allocated data and writes remotely allocated data, or the one in which the processor reads remote data and writes local data? b. Now suppose you have hardware support for automatic update propagation to further speed up your home-based protocol? How does this change the tradeoff, if at all? c. What protocol optimizations can you think of that would substantially increase the performance of one scheme or the other?
674
DRAFT: Parallel Computer Architecture
9/4/97
Introduction
C HA PTE R 10
Interconnection Network Design
Morgan Kaufmann is pleased to present material from a preliminary draft of Parallel Computer Architecture; the material is (c) Copyright 1996 Morgan Kaufmann Publishers. This material may not be used or distributed for any commercial purpose without the express written consent of Morgan Kaufmann Publishers. Please note that this material is a draft of forthcoming publication, and as such neither Morgan Kaufmann nor the authors can be held liable for changes or alterations in the final edition.
10.1 Introduction We have seen throughout this book that scalable high-performance interconnection networks lie at the core of parallel computer architecture. Our generic parallel machine has basically three components: the processor-memory nodes, the node-to-network interface, and the network that holds it all together. The previous chapters give a general understanding of the requirements placed on the interconnection network of a parallel machine; this chapter examines the design of high-performance interconnection networks for parallel computers in depth. These networks share basic concepts and terminology with local area networks (LAN) and wide area networks (WAN), which may be familiar to many readers, but the design trade-offs are quite different because of the dramatic difference of time scale. Parallel computer networks are a rich and interesting topic because they have so many facets, but this richness also make the topic difficult to understand in an overall sense. For example, parallel
9/4/97
DRAFT: Parallel Computer Architecture
675
Interconnection Network Design
computer networks are generally wired together in a regular pattern. The topological structure of these networks have elegant mathematical properties and there are deep relationships between these topologies and the fundamental communication patterns of important parallel algorithms. However, pseudo-random wiring patterns have a different set of nice mathematical properties and tend to have more uniform performance, without really good or really bad communication patterns. There is a wide range of interesting trade-offs to examine at this abstract level and a huge volume of research papers focus completely on this aspect of network design. On the other hand, passing information between two independent asynchronous devices across an electrical or optical link presents a host of subtle engineering issues. These are the kinds of issues that give rise to major standardization efforts. From yet a third point of view, the interactions between multiple flows of information competing for communications resources have subtle performance effects which are influenced by a host of factors. The performance modeling of networks is another huge area of theoretical and practical research. Real network designs address issues at each of these levels. The goal of this chapter is to provide a holistic understanding of the many facets of parallel computer networks, so the reader may see the large, diverse networks design space within the larger problem of parallel machine design as driven by application demands. As with all other aspects of design, network design involves understanding trade-offs and making compromises so that the solution is near-optimal in a global sense, rather than optimized for a particular component of interest. The performance impact of the many interacting facets can be quite subtle. Moreover, there is not a clear consensus in the field on the appropriate cost model for networks, since trade-offs can be made between very different technologies, for example, bandwidth of the links may be traded against complexity of the switches. It is also very difficult to establish a well-defined workload against which to assess network designs, since program requirements are influenced by every other level of the system design before being presented to the network. This is the kind of situation that commonly gives rise to distinct design “camps” and rather heated debates, which often neglect to bring out the differences in base assumptions. In the course of developing the concepts and terms of computer networks, this chapter points out how the choice of cost model and workload lead to various important design points, which reflect key technological assumptions. Previous chapters have illuminated the key driving factors on network design. The communication-to-computation ratio of the program places a requirement on the data bandwidth the network must deliver if the processors are to sustain a given computational rate. However, this load varies considerably between programs, the flow of information may be physically localized or dispersed, and it may be bursty in time or fairly uniform. In addition, the waiting time of the program is strongly affected by the latency of the network, and the time spent waiting affects the bandwidth requirement. We have seen that different programming models tend to communicate at different granularities, which impacts the size of data transfers seen by the network, and that they use different protocols at the network transaction level to realize the higher level programming model. This chapter begins with a set of basic definitions and concepts that underlie all networks. Simple models of communication latency and bandwidth are developed to reveal the core differences in network design styles. Then the key components which are assembled to form networks are described concretely. Section 10.3 explains the rich space of interconnection topologies in a common framework. Section 10.4 ties the trade-offs back to cost, latency, and bandwidth under basic workload assumptions. Section 10.5 explains the various ways that messages are routed within the topology of the network in a manner that avoids deadlock and describes the further impact of routing on communication performance. Section 10.6 dives deeper into the hardware organiza676
DRAFT: Parallel Computer Architecture
9/4/97
Introduction
tion of the switches that form the basic building block of networks in order to provide a more precise understanding of the inherent cost of various options and the mechanics underlying the more abstract network concepts. Section 10.7 explores the alternative approaches to flow control within a network. With this grounding in place, Section 10.8 brings together the entire range of issues in a collection of case studies and examines the transition of parallel computer network technology into other network regimes, including the emerging System Area Networks. 10.1.1 Basic definitions The job of an interconnection network in a parallel machine is to transfer information from any source node to any desired destination node, in support of the network transactions that are used to realize the programming model. It should accomplish this task with as small a latency as possible, and it should allow a large number of such transfers to take place concurrently. In addition, it should be inexpensive relative to the cost of the rest of the machine. The expanded diagram for our Generic Large-Scale Parallel Architecture in Figure 10-1 illustrates the structure of an interconnection network in a parallel machine. The communication assist on the source node initiates network transactions by pushing information through the network interface (NI). These transactions are handled by the communication assist, processor, or memory controller on the destination node, depending on the communication abstraction that is supported. Scalable Interconnection Network
network interface
CA M
CA P
M
P
Figure 10-1 Generic Parallel Machine Interconnection Network The communication assist initiates network transactions on behalf of the processor or memory controller, through a network interface that causes information to be transmitted across a sequence of links and switches to a remote node where the network transaction takes place.
The network is composed of links and switches, which provide a means to route the information from the source node to the destination node. A link is essentially a bundle of wires or a fiber that carries an analog signal. For information to flow along a link, a transmitter converts digital information at one end into an analog signal that is driven down the link and converted back into digi-
9/4/97
DRAFT: Parallel Computer Architecture
677
Interconnection Network Design
tal symbols by the receiver at the other end. The physical protocol for converting between streams of digital symbols and an analog signal forms the lowest layer of the network design. The transmitter, link, and receiver collectively form a channel for digital information flow between switches (or NIs) attached to the link. The link level protocol segments the stream of symbols crossing a channel into larger logical units, called packets or messages, that are interpreted by the switches in order to steer each packet or message arriving on an input channel to the appropriate output channel. Processing nodes communicate across a sequence of links and switches. The node level protocol embeds commands for the remote communication assist within the packets or messages exchanged between the nodes to accomplish network transactions. Formally, a parallel machine interconnection network is a graph, where the vertices, V , are processing hosts or switch elements connected by communication channels, C ⊆ V × V . A channel is a physical link between host or switch elements and a buffer to hold data as it is being transferred. It has a width, w , and a signalling rate, f = 1--- , which together determine the channel τ
bandwidth, b = w ⋅ f . The amount of data transferred across a link in a cycle1 is called a physical unit, or phit. Switches connect a fixed number of input channels to a fixed number of output channels; this number is called the switch degree. Hosts typically connect to a single switch, but can be multiply connected with separate channels. Messages are transferred through the network from a source host node to a destination host along a path, or route, comprised of a sequence of channels and switches. A useful analogy to keep in mind is a roadway system composed of streets and intersections. Each street has a speed limit and a number of lanes, determining its peak bandwidth. It may be either unidirectional (one-way) or bidirectional. Intersections allow travelers to switch among a fixed number of streets. In each trip a collection of people travel from a source location along a route to a destination. There are many potential routes they may use, many modes of transportation, and many different ways of dealing with traffic encountered en route. A very large number of such “trips” may be in progress in a city concurrently, and their respective paths may cross or share segments of the route. A network is characterized by its topology, routing algorithm, switching strategy, and flow control mechanism.
• The topology is the physical interconnection structure of the network graph; this may be regular, as with a two-dimensional grid (typical of many metropolitan centers), or it may be irregular. Most parallel machines employ highly regular networks. A distinction is often made between direct and indirect networks; direct networks have a host node connected to each switch, whereas indirect networks have hosts connected only to a specific subset of the switches, which form the edges of the network. Many machines employ hybrid strategies, so the more critical distinction is between the two types of nodes: hosts generate and remove
1. Since many networks operate asynchronously, rather than being controlled by a single global clock, the notion of a network “cycle” is not as widely used as in dealing with processors. One could equivalently define the network cycle time as the time to transmit the smallest physical unit of information, a phit. For parallel architectures, it is convenient to think about the processor cycle time and the network cycle time in common terms. Indeed, the two technological regimes are becoming more similar with time.
678
DRAFT: Parallel Computer Architecture
9/4/97
Introduction
traffic, whereas switches only move traffic along. An important property of a topology is the diameter of a network, which is the maximum shortest path between any two nodes.
• The routing algorithm determines which routes messages may follow through the network graph. The routing algorithm restricts the set of possible paths to a smaller set of legal paths. There are many different routing algorithms, providing different guarantees and offering different performance trade-offs. For example, continuing the traffic analogy, a city might eliminate gridlock by legislating that cars must travel east-west before making a single turn north or south toward their destination, rather than being allowed to zig-zag across town. We will see that this does, indeed, eliminate deadlock, but it limits a driver’s ability to avoid traffic en route. In a parallel machine, we are only concerned with routes from a host to a host. • The switching strategy determines how the data in a message traverses its route. There are basically two switching strategies. In circuit switching the path from the source to the destination is established and reserved until the message is transferred over the circuit. (This strategy is like reserving a parade route; it is good for moving a lot of people through, but advanced planning is required and it tends to be unpleasant for any traffic that might cross or share a portion of the reserved route, even when the parade is not in sight. It is also the strategy used in phone systems, which establish a circuit through possibly many switches for each call.) The alternative is packet switching, in which the message is broken into a sequence of packets. A packet contains routing and sequencing information, as well as data. Packets are individually routed from the source to the destination. (The analogy to traveling as small groups in individual cars is obvious.) Packet switching typically allows better utilization of network resources because links and buffers are only occupied while a packet is traversing them. • The flow control mechanism determines when the message, or portions of it, move along its route. In particular, flow control is necessary whenever two or more messages attempt to use the same network resource, e.g., a channel, at the same time. One of the traffic flows could be stalled in place, shunted into buffers, detoured to an alternate route, or simply discarded. Each of these options place specific requirements on the design of the switch and also influence other aspects of the communication subsystem. (Discarding traffic is clearly unacceptable in our traffic analogy.) The minimum unit of information that can be transferred across a link and either accepted or rejected is called a flow control unit, or flit. It may be as small as a phit or as large as a packet or message. To illustrate the difference between a phit and a flit, consider the nCUBE case study in Section 7.2.4. The links are a single bit wide so a phit is one bit. However, a switch accepts incoming messages in chunks of 36 bits (32 bits of data plus four parity bits). It only allows the next 36 bits to come in when it has a buffer to hold it, so the flit is 36 bits. In other machines, such as the T3D, the phit and flit are the same. The diameter of a network is the maximum distance between two nodes in the graph. The routing distance between a pair of nodes is the number of links traversed en route; this is at least as large as the shortest path between the nodes and may be larger. The average distance is simply the average of the routing distance over all pairs of node; this is also the expected distance between a random pair of nodes. In a direct network routes must be provided between every pair of switches, whereas in an indirect network it is only required that routes are provided between hosts. A network is partitioned if a set of links or switches are removed such that some nodes are no longer connected by routes. Most of the discussion in this chapter centers on packets, because packet switching is used in most modern parallel machine networks. Where specific important properties of circuit switching
9/4/97
DRAFT: Parallel Computer Architecture
679
Interconnection Network Design
arise, they are pointed out. A packet is a self-delimiting sequence of digital symbols and logically consists of three parts, illustrated in Figure 10-2: a header, a data payload, and a trailer. Usually the routing and control information comes in the header at the front of the packet so that the switches and network interface can determine what to do with the packet as it arrives. Typically, the error code is in the trailer, so it can be generated as the message spools out, onto the link. The header may also have a separate error checking code. The two basic mechanisms for building abstractions in the context of networks are encapsulation and fragmentation. Encapsulation involves carrying higher level protocol information in an uninterpreted form within the message format of a given level. Fragmentation involves splitting the higher level protocol information into a sequence of messages at a given level. While these basic mechanisms are present in any network, in parallel computer networks the layers of abstraction tend to be much shallower than in, say, the internet, and designed to fit together very efficiently. To make these notions concrete, observe that the header and trailer of a packet form an “envelope” that encapsulates the data payload. Information associated with the node-level protocol is contained within this payload. For example, a read request is typically conveyed to a remote memory controller in a single packet and the cache line response is a single packet. The memory controllers are not concerned with the actual route followed by the packet, or the format of the header and trailer. At the same time, the network is not concerned with the format of the remote read request within the packet payload. A large bulk data transferred would typically not be carried out as a single packet, instead, it would be fragmented into several packets. Each would need to contain information to indicate where its data should be deposited or which fragments in the sequence it represents. In addition, a single packet is fragmented at the link level into a series of symbols, which are transmitted in order across the link, so there is no need for sequencing information. This situation in which higher level information is carried within an envelope that is interpreted by the lower level protocol, or multiple such envelopes, occurs at every level of network design. Routing and Control Header
Data Payload
Trailer
Error Code
digital symbol
Sequence of symbols transmitted over a channel Figure 10-2 Typical Packet Format A packet forms the logical unit of information that is steered along a route by switches. It is comprised of three parts, a header, data payload, and trailer. The header and trailer are interpreted by the switches as the packet progresses along the route, but the payload is not. The node level protocol is carried in the payload.
10.1.2 Basic communication performance There is much to understand on each of the four major aspects of network design, but before going to these aspects in detail it is useful to have a general understanding of how they interact to determine the performance and functionality of the overall communication subsystem. Building on the brief discussion of networks in Chapter 5, let us look at performance from the latency and bandwidth perspectives.
680
DRAFT: Parallel Computer Architecture
9/4/97
Introduction
Latency To establish a basic performance model for understanding networks, we may expand the model for the communication time from Chapter 1. The time to transfer a single n -bytes of information from its source to its destination has four components, as follows. T ime ( n ) S – D = Overhead + Routing Delay + Channel Occupancy + Contention Delay
(EQ 10.1)
The overhead associated with getting the message into and out of the network on the ends of the actual transmission has been discussed extensively in dealing with the node-to-network interface in previous chapters. We have seen machines that are designed to move cache line sized chunks and others that are optimized for large message DMA transfers. In previous chapters, the routing delay and channel occupancy are effectively lumped together as the unloaded latency of the network for typical message sizes, and contention has been largely ignored. These other components are the focus of this chapter. The channel occupancy provides a convenient lower bound on the communication latency, independent of where the message is going or what else is happening in the network. As we look at network design in more depth, we see below that the occupancy of each link is influenced by the channel width, the signalling rate, and the amount of control information, which is in turn influenced by the topology and routing algorithm. Whereas previous chapters were concerned with the channel occupancy seen from “outside” the network, the time to transfer the message across the bottleneck channel in the route, the view from within the network is that there is a channel occupancy associated with each step along the route. The communication assist is occupied for a period of time accepting the communication request from the processor or memory controller and spooling a packet into the network. Each channel the packet crosses en route is occupied for a period of time by the packet, as is the destination communication assist. For example, an issue that we need to be aware of is the efficiency of the packet encoding. The packet envelop increases the occupancy because header and trailer symbols are added by the n+n b
source and stripped off by the destination. Thus, the occupancy of a channel is --------------E- , where n E is the size of the envelope and b is the raw bandwidth of the channel. This issue is addressed in the “outside view” by specifying the effective bandwidth of the link, derated from the raw bandn width by --------------- , at least for fixed sized packets. However, within the network packet efficiency n + nE
remains a design issue. The effect is more pronounced with small packets, but it also depends on how routing is performed. The routing delay is seen from outside the network as the time to move a given symbol, say the first byte of the message, from the source to the destination. Viewed from within the network, each step along the route incurs a routing delay that accumulates into the delay observed from the outside. The routing delay is a function of the number of channels on the route, called the routing distance, h , and the delay, ∆ incurred as each switch as part of selecting the correct output port. (It is convenient to view the node-to-network interface as contributing to the routing delay like a switch.) The routing distance depends on the network topology, the routing algorithm, and the particular pair of source and destination nodes. The overall delay is strongly affected by switching and routing strategies.
9/4/97
DRAFT: Parallel Computer Architecture
681
Interconnection Network Design
With packet switched, store-and-forward routing, the entire packet is received by a switch before it is forwarded on the next link, as illustrated in Figure 10-3a. This strategy is used in most widearea networks and was used in several early parallel computers. The unloaded network latency for an n byte packet, including envelope, with store-and-forward routing is n T sf ( n, h ) = h --- + ∆ . b
(EQ 10.2)
Equation 10.2 would suggest that the network topology is paramount in determining network latency, since the topology fundamentally determines the routing distance, h . In fact, the story is more complicated. First, consider the switching strategy. With circuit switching one expects a delay proportional to h to establish the circuit, configure each of the switches along the route, and inform the source that the route is established. After this time, the data should move along the circuit in time n ⁄ b plus an additional small delay proportional to h . Thus, the unloaded latency in units of the network cycle time, τ , for an n byte message traveling distance h in a circuit switched network is: n T cs ( n, h ) = --- + h∆ . b
(EQ 10.3)
In Equation 10.3, the routing delay is an additive term, independent of the size of the message. Thus, as the message length increases, the routing distance, and hence the topology, becomes an insignificant fraction of the unloaded communication latency. Circuit switching is traditionally used in telecommunications networks, since the call set-up is short compared to the duration of the call. It is used in a minority of parallel computer networks, including the Meiko CS-2. One important difference is that in parallel machines the circuit is established by routing the message through the network and holding open the route as a circuit. The more traditional approach is to compute the route on the side and configure the switches, then transmit information on the circuit. It is also possible to retain packet switching and yet reduce the unloaded latency from that of naive store-and-forward routing. The key concern with Equation 10.2 is that the delay is the product of the routing distance and the occupancy for the full message. However, a long message can be fragmented into several small packets, which flow through the network in a pipelined fashion. In this case, the unloaded latency is n–n n T sf ( n, h, n p ) = --------------p + h -----p + ∆ , b b
(EQ 10.4)
where n p is the size of the fragments. The effective routing delay is proportional to the packet size, rather than the message size. This is basically the approach adopted (in software) for traditional data communication networks, such as the internet. In parallel computer networks, the idea of pipelining the routing and communication is carried much further. Most parallel machines use packet switching with cut-through routing, in which the switch makes its routing decision after inspecting only the first few phits of the header and allows the remainder of the packet to “cut through” from the input channel to the output channel, as indicated by Figure 10-3b. (If we may return to our vehicle analogy, cut-through routing is like what happens when a train encounters a switch in the track. The first car is directed to the proper 682
DRAFT: Parallel Computer Architecture
9/4/97
Introduction
output track and the remainder follow along behind. By contrast, store-and-forward routing is like what happens at the station, where all the cars in the train must have arrived before the first can proceed toward the next station.) For cut-through routing, the unloaded latency has the same form as the circuit switch case, n T ct ( n, h ) = --- + h∆ . b
(EQ 10.5)
although the routing coefficients may differ since the mechanics of the process are rather different. Observe that with cut-through routing a single message may occupy the entire route from the source to the destination, much like circuit switching. The head of the message establishes the route as it moves toward its destination and the route clears as the tail moves through. Cut-Through Routing
Store & Forward Routing Source
Dest 3210
32 10 3 21 3 2 3
Dest
0 1 0 2 10 3 21 0 3 21 3 2 3
321
0
3 2
1
0
3
2
1
0
3
2
1 0
3
2 10
0
3 21 0
10 210 3210
Time
321 (a)
0
°°°
(b)
Figure 10-3 Store-and-forward vs. Cut-through routing A four flit packet traverses three hops from source to destination under store&forward and cut-through routing. Cutthrough achieves lower latency by making the routing decision (grey arrow) on the first flit and pipelining the packet through a sequence of switches. Store&forward accumulates the entire packer before routing it toward the destination.
The discussion of communication latency above addresses a message flowing from source to destination without running into traffic along the way. In this unloaded case, the network can be viewed simply as a pipeline, with a start-up cost, pipeline depth, and time per stage. The different switching and routing strategies change the effective structure of the pipeline, whereas the topology, link bandwidth, and fragmentation determine the depth and time per stage. Of course, the reason that networks are so interesting is that they are not a simple pipeline, but rather an interwoven fabric of portions of many pipelines. The whole motivation for using a network, rather than a bus, is to allow multiple data transfers to occur simultaneously. This means that one message flow may collide with others and contend for resources. Fundamentally, the network must provide a mechanism for dealing with contention. The behavior under contention depends on
9/4/97
DRAFT: Parallel Computer Architecture
683
Interconnection Network Design
several facets of the network design: the topology, the switching strategy, and the routing algorithm, but the bottom line is that at any given time a channel can only be occupied by one message. If two messages attempt to use the same channel at once, one must be deferred. Typically, each switch provides some means of arbitration for the output channels. Thus, a switch will select one of the incoming packets contending for each output and the others will be deferred in some manner. An overarching design issue for networks is how contention is handled. This issue recurs throughout the chapter, so let’s first just consider what it means for latency. Clearly, contention increases the communication latency experienced by the nodes. Exactly how it increases depends on the mechanisms used for dealing with contention within the network, which in turn differ depending on the basic network design strategy. For example, with packet switching, contention may be experienced at each switch. Using store-and-forward routing, if multiple packets that are buffered in the switch need to use the same output channel, one will be selected and the others are blocked in buffers until they are selected. Thus, contention adds queueing delays to the basic routing delay. With circuit switching, the effect of contention arises when trying to establish a circuit; typically, a routing probe is extended toward the destination and if it encounters a reserved channel, it is retracted. The network interface retries establishing the circuit after some delay. Thus, the start-up cost in gaining access to the network increases under contention, but once the circuit is established the transmission proceeds at full speed for the entire message. With cut-through packet switching, two packet blocking options are available. The virtual cutthrough approach is to spool the blocked incoming packet into a buffer, so the behavior under contention degrades to that of store-and-forward routing. The worm-hole approach buffers only a few phits in the switch and leaves the tail of the message in-place along the route. The blocked portion of the message is very much like a circuit held open through a portion of the network. Switches have limited buffering for packets, so under sustained contention the buffers in a switch may fill up, regardless of routing strategy. What happens to incoming packets if there is no buffer space to hold them in the switch? In traditional data communication networks the links are long and there is little feedback between the two ends of the channel, so the typical approach is to discard the packet. Thus, under contention the network becomes highly unreliable and sophisticated protocols (e.g., TCP/IP) are used at the nodes to adapt the requested communication load to what the network can deliver without a high loss rate. Discarding on buffer overrun it also used with most ATM switches, even if they are employed to build closely integrated clusters. Like the wide area case, the source receives no indication that its packet was dropped, so it must rely on some kind of timeout mechanism to deduce that a problem has occured. In most parallel computer networks, a packet headed for a full buffer is typically blocked in place, rather than discarded; this requires a handshake between the output port and input port across the link, i.e, link level flow control. Under sustained congestion, traffic ‘backs-up’ from the point in the network where contention for resources occurs towards the sources that are driving traffic into that point. Eventually, the sources experience back-pressure from the network (when it refuses to accept packets), which causes the flow of data into the network to slow down to that which can move through the bottleneck.1 Increasing the amount of buffering within the network allows contention to persist longer without causing back-pressure at the source, but it also increases the potential queuing delays within the network when contention does occur. One of the concerns that arises in networks with message blocking is that the ‘back up’ can impact traffic that is not headed for the highly contended output. Suppose that the network traffic 684
DRAFT: Parallel Computer Architecture
9/4/97
Introduction
favors one of the destinations, say because it holds an important global variable that is widely used. This is often called a “hot spot.” If the total amount of traffic destined for that output exceeds the bandwidth of the output, this traffic will back up within the network. If this condition persists, the backlog will propagate backward through the tree of channels directed at this destination, which is calledtree saturation. [PfNo85]. Any traffic that cross the tree will also be delayed. In a worm-hole routed network, an interesting alternative to blocking the message is to discard it, but to inform the source of the collision. For example, in the BBN Butterfly the source held onto the tail of the message until the head reached the destination and if a collision occured in route, the worm was retracted all the way back to the source[ReTh86]. We can see from this brief discussion that all aspects of network design – link bandwidth, topology, switching strategy, routing algorithm, and flow control – combine to determine the latency per message. It should also be clear that there is a relationship between latency and bandwidth. If the communication bandwidth demanded by the program is low compared to the available network bandwidth, there will be few collisions, buffers will tend to be empty, and the latency will stay low, especially with cut-through routing. As the bandwidth demand increases, latency will increase due to contention. An important point that is often overlooked is that parallel computer networks are effectively a closed system. The load placed on the network depends on the rate at which processing nodes request communication, which in turn depends on how fast the network delivers this communication. Program performance is affected most strongly by latency some kind of dependence is involved; the program must wait until a read completes or until a message is received to continue. While it waits, the load placed on the network drops and the latency decreases. This situation is very different from that of file transfers across the country contending with unrelated traffic. In a parallel machine, a program largely contends with itself for communication resources. If the machine is used in a multiprogrammed fashion, parallel programs may also contend with one another, but the request rate of each will be reduced as the serviced rate of the network is reduced due to the contention. Since low latency communication is critical to parallel program performance, the emphasis in this chapter is on cut-through packet-switched networks. Bandwidth Network bandwidth is critical to parallel program performance in part because higher bandwidth decreases occupancy, in part because higher bandwidth reduces the likelihood of contention, and in part because phases of a program may push a large volume of data around without waiting for individual data items to be transmitted along the way. Since networks behave something like pipelines it is possible to deliver high bandwidth even when the latency is large.
1. This situation is exactly like multiple lanes of traffic converging on a narrow tunnel or bridge. When the traffic flow is less than the flow-rate of the tunnel, there is almost no delay, but when the in-bound flow exceeds this bandwidth traffic backs-up. When the traffic jam fills the available storage capacity of the roadway, the aggregate traffic moves forward slowly enough that the aggregate bandwidth is equal to that of the tunnel. Indeed, upon reaching the mouth of the tunnel traffic accelerates, as the Bernoulli effect would require.
9/4/97
DRAFT: Parallel Computer Architecture
685
Interconnection Network Design
It is useful to look at bandwidth from two points of view: the “global” aggregate bandwidth available to all the nodes through the network and the “local” individual bandwidth available to a node. If the total communication volume of a program is M bytes and the aggregate communication bandwidth of the network is B bytes per second, then clearly the communication time is at M least ----- seconds. On the other hand, if all of the communication is to or from a single node , this B
estimate is far too optimistic; the communication time would be determined by the bandwidth through that single node. Let us look first at the bandwidth available to a single node and see how it may be influenced by network design choices. We have seen that the effective local bandwidth is reduced from the raw n n + nE
link bandwidth by the density of the packet, b --------------- . Furthermore, if the switch blocks the packet for the routing delay of ∆ cycles while it makes its routing decision, then the effective n local bandwidth is further derated to b ----------------------------- , since ∆w is the opportunity to transmit data n + n E + ∆w
that is lost while the link is blocked. Thus, network design issues such as the packet format and the routing algorithm will influence the bandwidth seen by even a single node. If multiple nodes are communicating at once and contention arises, the perceived local bandwidth will drop further (and the latency will rise). This happens in any network if multiple nodes send messages to the same node, but it may occur within the interior of the network as well. The choice of network topology and routing algorithm affect the likelihood of contention within the network. If many of the nodes are communicating at once, it is useful to focus on the global bandwidth that the network can support, rather than the bandwidth available to each individual node. First, we should sharpen the concept of the aggregate communication bandwidth of a network. The most common notion of aggregate bandwidth is the bisection bandwidth: of the network, which is the sum of the bandwidths of the minimum set of channels which, if removed, partition the network into two equal sets of nodes. This is a valuable concept, because if the communication pattern is completely uniform, half of the messages are expected to cross the bisection in each direction. We will see below that the bisection bandwidth per node varies dramatically in different network topologies. However, bisection bandwidth is not entirely satisfactory as a metric of aggregate network bandwidth, because communication is not necessarily distributed uniformly over the entire machine. If communication is localized, rather than uniform, bisection bandwidth will give a pessimistic estimate of communication time. An alternative notion of global bandwidth that caters to localized communication patterns would be the sum of the bandwidth of the links from the nodes to the network. The concern with this notion of global bandwidth, is that the internal structure of the network may not support it. Clearly, the available aggregate bandwidth of the network depends on the communication pattern; in particular, it depends on how far the packets travel, so we should look at this relationship more closely. The total bandwidth of all the channels (or links) in the network is the number of channels times the bandwidth per channel, i.e., Cb bytes per second, Cw bits per cycle, or C phits per cycle. If each of N hosts issues a packet every M cycles with an average routing distance of h , a packet Nhl n - cycles. Thus, he total load on the network is --------occupies, on average, h channels for l = --M
w
phits per cycle. The average link utilization is at least
686
DRAFT: Parallel Computer Architecture
9/4/97
Introduction
C ρ = M --------- , Nhl
(EQ 10.6)
which obviously must be less than one. One way of looking at this is that the number of links per C node, ---- , reflects the communication bandwidth (flits per cycle per node) available, on average, N
to each node. This bandwidth is consumed in direct proportion to the routing distance and the message size. The number of links per node is a static property of the topology. The average routing distance is determined by the topology, the routing algorithm, the program communication pattern, and the mapping of the program onto the machine. Good communication locality may yield a small h , while random communication will travel the average distance, and really bad patterns may traverse the full diameter. The message size is determined by the program behavior and the communication abstraction. In general, the aggregate communication requirements in Equation 10.6 says that as the machine is scaled up, the network resources per node must scale with the increase in expected latency. In practice, several factors limit the channel utilization, ρ , well below unity. The load may not be perfectly balanced over all the links. Even if it is balanced, the routing algorithm may prevent all the links from being used for the particular communication pattern employed in the program. Even if all the links are usable and the load is balanced over the duration, stochastic variations in the load and contention for low-level resources may arise. The result of these factors is networks have a saturation point, representing the total channel bandwidth they can usefully deliver. As illustrated in Figure 10-4, if the bandwidth demand placed on the network by the processors is moderate, the latency remains low and the delivered bandwidth increases with increased demanded. However, at some point, demanding more bandwidth only increases the contention for resources and the latency increases dramatically. The network is essentially moving as much traffic as it can, so additional requests just get queued up in the buffers. One attempts to design parallel machines so that the network stays out of saturation, either by providing ample communication bandwidth or by limiting the demands placed by the processors. A word of caution is in order regarding the dramatic increase in latency illustrated in Figure 10-4 as the network load approaches saturation. The behavior illustrated in the figure is typical of all queueing systems (and networks) under the assumption that the load placed on the system is independent of the response time. The sources keep pushing messages into the system faster than it can service them, so a queue of arbitrary length builds up somewhere and the latency grows with the length of this queue. In other words, this simple analysis assumes an open system. In reality, parallel machines are closed systems. There is only a limited amount of buffering in the network and, usually, only a limited amount of communication buffering in the network interfaces. Thus, if these “queues” fill up, the sources will slow down to the service rate, since there is no place to put the next packet until one is removed. The flow-control mechanisms affect this coupling between source and sink. Moreover, dependences within the parallel programs inherently embed some degree of end-to-end flow-control, because a processor must receive remote information before it can do additional work which depends on the information and generate additional communication traffic. Nonetheless, it is important to recognize that a shared resource, such as a network link, is not expected to be 100% utilized. This brief performance modeling of parallel machine networks shows that the latency and bandwidth of real networks depends on all aspects of the network design, which we will examine in some detail in the remainder of the chapter. Performance modeling of networks itself is a rich
9/4/97
DRAFT: Parallel Computer Architecture
687
Interconnection Network Design
70
60
Latency
50
40
Saturation 30
20
10
0 0
0.2
0.4
0.6
0.8
1
0.8
0.7
Delivered Bandwidth
0.6
0.5
0.4
Saturation
0.3
0.2
0.1
0 0
0.2
0.4
0.6
0.8
1
Figure 10-4 Typical Network Saturation Behavior Networks can provide low latency when the requested bandwidth is well below that which can be delivered. In this regime, the delivered bandwidth scales linearly with that requested. However, at some point the network saturates and additional load causes the latency to increase sharply without yielding additional delivered bandwidth. area with a voluminous literature base, and the interested reader should consult the references at the end of this chapter as a starting point. In addition, it is important to note that performance is not the only driving factor in network designs. Cost and fault tolerance are two other critical criteria. For example, the wiring complexity of the network is a critical issue in several large scale machines. As these issues depend quite strongly on specifics of the design and the technology employed, we will discuss them along with examining the design alternatives.
10.2 Organizational Structure This section outlines the basic organizational structure of a parallel computer network. It is useful to think of this issue in the more familiar terms of the organization of the processor and the applicable engineering constraints. We normally think of the processor as being composed of datapath, control, and memory interface, including perhaps the on-chip portions of the cache hierarchy. The datapath is further broken down into ALU, register file, pipeline latches, and so
688
DRAFT: Parallel Computer Architecture
9/4/97
Organizational Structure
forth. The control logic is built up from examining the data transfers that take place in the datapath. Local connections within the datapath are short and scale well with improvements in VLSI technology, whereas control wires and busses are long and become slower relative to gates as chip density increases. A very similar notion of decomposition and assembly applies to the network. Scalable interconnection networks are composed of three basic components: links, switches, and network interfaces. A basic understanding of these components, their performance characteristics, and inherent costs is essential for evaluating network design alternatives. The set of operations is quite limited, since they fundamentally move packets toward their intended destination. 10.2.1 Links A link is a cable of one or more electrical wires or optical fibers with a connector at each end attached to a switch or network interface port. It allows an analog signal to be transmitted from one end, received at the other, and sampled to obtain the original digital information stream. In practice, there is tremendous variation in the electrical and physical engineering of links, however, their essential logical properties can be characterized along three independent dimensions: length, width, clocking.
• A “short” link is one in which only a single logical value can be on the link at any time; a “long” link is viewed as a transmission line, where a series of logical values propagate along the link at a fraction of the speed of light, (1-2 feet per ns, depending on the specific medium). • A “thin” link is one in which data, control, and timing information are multiplexed onto each wire, such as on a single serial link. A “wide” link is one which can simultaneously transmit data and control information. In either case, network links are typically narrow compared to internal processor datapath, say 4 to 16 data bits. • The clocking may be “synchronous” or “asynchronous.” In the synchronous case, the source and destination operate on the same global clock, so data is sampled at the receiving end according to the common clock. In the asynchronous case, the source encodes its clock in some manner within the analog signal that is transmitted and the destination recovers the source clock from the signal and transfers the information into its own clock domain. A short electrical link behaves like a conventional connection between digital components. The signalling rate is determined by the time to charge the capacitance of the wire until it represents a logical value on both ends. This time increases only logarithmicaly with length if enough power can be used to drive the link.1 Also, the wire must be terminated properly to avoid reflections, which is why it is important that the link be point-to-point. The Cray T3D is a good example of a wide, short, synchronous link design. Each bidirectional link contains 24 bits in each direction, 16 for data, 4 for control, and 4 providing flow control for the link in the reverse direction, so a switch will not try to deliver flits into a full buffer. The entire machine operates under a single 150 MHz clock. A flit is a single phit of 16 bits. Two of the con-
1. The RC delay of a wire increases with the square of the length, so for a fixed amount of signal
drive the network cycle time is strongly affected by length. However, if the driver strength is increased using an ideal driver tree, the time to drive the load of a longer wire only increases logarithmically. (If τ inv is the propagation delay of a basic gate, then the effective propagation delay of a “short” wire of length l grows as t s = K τ inv log l .) 9/4/97
DRAFT: Parallel Computer Architecture
689
Interconnection Network Design
trol bits identify the phit type (00 No info, 01 Routing tag, 10 packet, 11 last). The routing tag phit and the last-packet phit provide packet framing. In a long wire or optical fiber the signal propagates along the link from source to destination. For a long link, the delay is clearly linear in the length of the wire. The signalling rate is determined by the time to correctly sample the signal at the receiver, so the length is limited by the signal decay along the link. If there is more than one wire in the link, the signalling rate and wire length are also limited by the signal skew across the wires. A correctly sampled analog signal can be viewed as a stream of digital symbols (phits) delivered from source to destination over time. Logical values on each wire may be conveyed by voltage levels or voltage transitions. Typically, the encoding of digital symbols is chosen so that it is easy to identify common failures, such as stuck-at faults and open connections, and easy to maintain clocking. Within the stream of symbols individual packets must be identified. Thus, part of the signaling convention of a link is its framing, which identifies the start and end of each packet. In a wide link, distinct control lines may identify the head and tail phits. In a narrow link, special control symbols are inserted in the stream to provide framing. In an asynchronous serial link, the clock must be extracted from the incoming analog signal as well; this is typically done with a unique synchronization burst.[PeDa96]. The Cray T3E network provides a convenient contrast to the T3D. It uses a wide, asynchronous link design. The link is 14 bits wide in each direction, operating at 375 MHz. Each “bit” is conveyed by a low voltage differential signal (LVDS) with a nominal swing of 600 mV on a pair of wires, i.e., the receiver senses the difference in the two wires, rather than the voltage relative to ground. The clock is sent along with the data. The maximum transmission distance is approximately one meter, but even at this length there will be multiple bits on the wire at a time. A flit contains five phits, so the switches operate at 75MHz on 70 bit quantities containing one 64 bit word plus control information. Flow control information is carried on data packets and idle symbols on the link in the reverse direction. The sequence of flits is framed into single-word and 8word read and write request packets, message packets, and other special packets. The maximum data bandwidth of a link is 500 MB/s. The encoding of the packet within the frame is interpreted by the nodes attached to the link. Typically, the envelope is interpreted by the switch to do routing and error checking. The payload is delivered uninterpreted to the destination host, at which point further layers or internal envelopes are interpreted and peeled away. However, the destination node may need to inform the source whether it was able to hold the data. This requires some kind of node-to-node information that is distinct from the actual communication, for example, an acknowledgment in the reverse direction. With wide links, control lines may run in both directions to provide this information. Thin links are almost always bidirectional, so that special flow-control signals can be inserted into the stream in the reverse direction.1 The Scalable Coherent Interface (SCI) defines both a long, wide copper link and a long, narrow fiber link. The links are unidirectional and nodes are always organized into rings. The copper link comprises 18 pairs of wires using differential ECL signaling on both edges of a 250 MHz clock.
1. This view of flow-control as inherent to the link is quite different from view in more traditional networking applications, where flow control is realized on top of the link-level protocol by special packets.
690
DRAFT: Parallel Computer Architecture
9/4/97
Organizational Structure
It carries 16 bits of data, clock, and a flag bit. The fiber link is serial and operates at 1.25 Gb/s. Packets are a sequence of 16-bit phits, with the header consisting of a destination node number phit and a command phit. The trailer consists of a 32-bit CRC. The flag bit provides packet framing, by distinguishing idle symbols from packet flits. At least one idle phit occurs between successive packets. Many evaluations of networks treat links as having a fixed cost. Common sense would suggest that the cost increases with the length of the link and its width. This is actually a point of considerable debate within the field, because the relative quality of different networks depends on the cost model that is used in the evaluation. Much of the cost is in the connectors and the labor involved in attaching them, so there is a substantial fixed cost. The connector costs increases with width, whereas the wire cost increases with width and length. In many cases, the key constraint is the cross sectional area of a bundle of links, say at the bisection; this increases with width. 10.2.2 Switches A switch consists of a set of input ports, a set of output ports, an internal “cross-bar” connecting each input to every output, internal buffering, and control logic to effect the input-output connection at each point in time, as illustrated in Figure 10-5. Usually, the number of input ports is equal to the number of output ports, which is called the degree of the switch.1 Each output port includes a transmitter to drive the link. Each input port includes a matching receiver. The input port includes a synchronizer in most designs to align the incoming data with the local clock domain of the switch. This is essentially a FIFO, so it is natural to provide some degree of buffering with each input port. There may also be buffering associated with the outputs or shared buffering for the switch as a whole. The interconnect within the switch may be an actual cross bar, a set of multiplexors, a bus operating at a multiple of the link bandwidth, or an internal memory. The complexity of the control logical depends on the routing and scheduling algorithm, as we will discuss below. At the very least, it must be possible to determine the output port required by each incoming packet, and to arbitrate among input ports that need to connect to the same output port. Many evaluations of networks treat the switch degree as its cost. This is clearly a major factor, but again there is room for debate. The cost of some parts of the switch is linear in the degree, e.g., the transmitters, receivers, and port buffers. However, the internal interconnect cost may increase with the square of the degree. The amount of internal buffering and the complexity of the routing logic also increase more than linearly with degree. With recent VLSI switches, the dominant constraint tends be the number of pins, which is proportional to the number of ports times the width of each port.
1. As with most rules, there are exceptions. For example, in the BBN Monarch design two distinct kinds of switches were used that had unequal number of input and output ports.[Ret*90] Switches that routed packets to output ports based on routing information in the header could have more outputs than inputs. A alternative device, called a concentrator, routed packets to any output port, which were fewer than the number of input ports and all went to the same node.
9/4/97
DRAFT: Parallel Computer Architecture
691
Interconnection Network Design
Receiver Input Ports
Input Buffer
Output Buffer Transmiter
Output Ports
Cross-bar
Control Routing, Scheduling
Figure 10-5 Basic Switch Organiation A switch consists of a set of input ports, a set of output ports, an internal “cross-bar” connecting each input to every output, internal buffering, and control logic to effect the input-output connection at each point in time.
10.2.3 Network Interfaces The network interface (NI) contains one or more input/output ports to source packets into and sink packets from the network, under direction of the communication assist which connects it to the processing node, as we have seen in previous chapters. The network interface, or host nodes, behave quite differently than switch nodes and may be connected via special links. The NI formats the packets and constructs the routing and control information. It may have substantial input and output buffering, compared to a switch. It may perform end-to-end error checking and flow control. Clearly, its cost is influenced by its storage capacity, processing complexity, and number of ports.
10.3 Interconnection Topologies Now that we understand the basic factors determining the performance and the cost of networks, we can examine each of the major dimensions of the design space in relation to these factors. This section covers the set of important interconnection topologies. Each topology is really a class of networks scaling with the number of host nodes, N , so we want to understand the key characteristics of each class as a function of N . In practice, the topological properties, such as distance, are not entirely independent of the physical properties, such as length and width, because some topologies fundamentally require longer wires when packed into a physical volume, so it is important to understand both aspects.
692
DRAFT: Parallel Computer Architecture
9/4/97
Interconnection Topologies
10.3.1 Fully connected network A fully-connected network consists of a single switch, which connects all inputs to all outputs. The diameter is one. The degree is N . The loss of the switch wipes out the whole network, however, loss of a link removes only one node. One such network is simply a bus, and this provides a useful reference point to describe the basic characteristics. It has the nice property that the cost scales as O( N ). Unfortunately, only one data transmission can occur on a bus at once, so the total bandwidth is O(1), as is the bisection. In fact, the bandwidth scaling is worse than O(1) because the clock rate of a bus decreases with the number of ports due to RC delays. (An ethernet is really a bit-serial, distributed bus; it just operates at a low enough frequency that a large number of physical connections are possible.) Another fully connected network is a cross-bar. It provides O ( N ) bandwidth, but the cost of the interconnect is proportional to the number of cross-points, or O ( N 2 ) . In either case, a fully connected network is not scalable in practice. This is not to say they are not important. Individual switches are fully connected and provide the basic building block for larger networks. A key metric of technological advance in networks is the degree of a cost-effective switch. With increasing VLSI chip density, the number of nodes that can be fully connected by a cost-effective switch is increasing. 10.3.2 Linear arrays and rings The simplest network is a linear array of nodes numbered consecutively 0, …, N – 1 and con2 nected by bidirectional links. The diameter is N – 1 , the average distance is roughly --- N , and 3
removal of a single link partitions the network, so the bisection width is one. Routing in such a network is trivial, since there is exactly one route between any pair of nodes. To describe the route from node A to node B , let us define R = B – A to be the relative address of B from A . This signed, log N bit number is the number of links to cross to get from A to B with the positive direction being away from node 0. Since there is a unique route between a pair of nodes, clearly the network provides no fault tolerance. The network consists of N-1 links and can easily be laid out in O ( N ) space using only short wires. Any contiguous segment of nodes provides a subnetwork of the same topology as the full network. A ring or torus of N nodes can be formed by simply connecting the two ends of an array. With unidirectional links, the diameter is N-1, the average distance is N/2, the bisection width is one link, and there is one route between any pair of nodes. The relative address of B from A is ( B – A )modN . With bidirectional links, the diameter is N/2, the average distance is N/3, the degree of the node is two and the bisection is two. There are two routes (two relative addresses) between pair of nodes, so the network can function with degraded performance in the presence of a single faulty link. The network is easily laid out with O ( N ) space using only short wires, as indicated by Figure 10-6, by simply folding the ring. The network can be partitioned into smaller subnetworks, however, the subnetworks are linear arrays, rather than rings. Although these one dimensional networks are not scalable in any practical sense, they are an important building block, conceptually and in practice. The simple routing and low hardware complexity of rings has made them very popular for local area interconnects, including FDDI, Fiber Channel Arbitrated Loop, and Scalable Coherent Interface (SCI). Since they can be laid out with very short wires, it is possible to make the links very wide. For example, the KSR-1 used a
9/4/97
DRAFT: Parallel Computer Architecture
693
Interconnection Network Design
32-node ring as a building block which was 128 bits wide. SCI obtains its bandwidth by using 16-bit links. Linear Array Torus
Torus arranged to use short wires
Figure 10-6 Linear and Ring Topologies The linear array and torus are easily laid out to use uniformly short wires. The distance and cost grow a O ( N ) , whereas the aggregate bandwidth is only O ( 1 ) .
10.3.3 Multidimensional meshes and tori Rings and arrays generalize naturally to higher dimensions, including 2D ‘grids’ and 3D ‘cubes’, with or without end-around connections. A d -dimensional array consists of N = k d – 1 × … × k 0 nodes, each identified by its d -vector of coordinates ( i d – 1, …, i 0 ) , where 0 ≤ i j ≤ k j – 1 for 0 ≤ j ≤ d – 1 . Figure 10-6 shows the common cases of two and three dimensions. For simplicity,
we will assume the length along each dimension is the same, so N = k d ( k =
d
N , r = logd N ).
This is called an d -dimensional k -ary array. (In practice, engineering constraints often result in non-uniform dimensions, but the theory is easily extended to handle that case.) Each node is a switch addressed by a d -vector of radix k coordinates and is connected to the nodes that differ by one in precisely one coordinate. The node degree varies between d and 2d , inclusive, with nodes in the middle having full degree and the corners having minimal degree. For a d dimensional k -ary torus, the edges wrap-around from each face, so every node has degree 2d (in the bidirectional case) and is connected to nodes differing by one (mod k) in each dimension. We will refer to arrays and tori collectively as meshes. The d -dimensional k -ary unidirectional torus is a very important class of networks, often called a k -ary d -cube, employed widely in modern parallel machines. These networks are usually employed as direct networks, so an additional switch degree is required to support the bidirectional host connection from each switch. They are generally viewed as having low degree, two or three, so the network scales by increasing k along some or all of the dimensions. To define the routes from a node A to node B , let R = ( b d – 1 – a d – 1, …, b 0 – a 0 ) be the relative address of B from A . A route must cross r i = b i – a i links in each dimension i in the appropriate direction.The simplest approach is to traverse the dimensions in order, so for i = 0…d – 1 travel r i hops in the i -th dimension. This corresponds to traveling between two locations in a metropolitan grid by driving in, say, the east-west direction, turning once, and driving in the northsouth direction. Of course, we can reach the same destination by first travelling north/south and
694
DRAFT: Parallel Computer Architecture
9/4/97
Interconnection Topologies
Figure 10-7 Grid, Torus, and Cube Topologies Grids, tori, and cubes are special cases of k -ary d -cube networks, which are constructed with k nodes in each of d dimensions. Low dimensional networks pack well in physical space with short wires. then east/west, or by zig-zagging anywhere between these routes. In general, we may view the source and destination points as corners of a subarray and follow any path between the corners that reduces the relative address from the destination at each hop. The diameter of the network is d ( k – 1 ) . The average distance is simply the average distance in 2 3
each dimension, roughly d --- k . If k is even, the bisection of an d -dimensional k -ary array is k d – 1 bidirectional links. This is obtained by simply cutting the array in the middle by a
(hyper)plane perpendicular to one of the dimensions. (If k is odd, the bisection may be a little bit larger.) For a unidirectional torus, the relative address and routing generalizes along each dimension just as for a ring. All nodes have degree d (plus the host degree) and k d – 1 links cross the middle in each direction. It is clear that a 2-dimensional mesh can be laid out O ( N ) space in a plane with short wires, and 3-dimensional mesh in O ( N ) volume in free space. In practice, engineering factors come into play. It is not really practical to build a huge 2D structure and mechanical issues arise in how 3D is utilized, as illustrated by the following example.
Example 10-1
Using a direct 2D array topology, such as in the Intel Paragon where a single cabinet holds 64 processors forming a 4 wide by 16 high array of nodes (each node containing a message processor and 1 to 4 compute processors), how might you configure cabinets to construct a large machine with only short wires?
Answer
Although there are many possible approaches, the one used by Intel is illustrated in Figure 1-24 of Chapter 1. The cabinets stand on the floor and large configurations are formed by attaching these cabinets side by side, forming a 16 × k array. The largest configuration was a 1824 node machine at Sandia National Laboratory
9/4/97
DRAFT: Parallel Computer Architecture
695
Interconnection Network Design
configured as a 16 × 114 array. The bisection bandwidth is determined by the 16 links that cross between cabinets. Other machines have found alternative strategies for dealing with real world packaging restrictions. The MIT J-machine is a 3D torus, where each board comprises an 8x16 torus in the first two dimensions. Larger machines are constructed by stacking these boards next to one another, with board-to-board connections providing the links in the third dimension. The Intel ASCI Red machine with 4536 compute nodes is constructed as 85 cabinets. It allows long wires to run between cabinets in each dimension. In general, with higher dimensional meshes several logical dimensions are embedded in each physical dimension using longer wires. Figure 10-6 shows a 6x3x2 array and a 4-dimensional 3-ary array embedded in a plane. It is clear that for a given physical dimension, the average wire length and the number of wires increases with the number of logical dimensions.
Figure 10-8 Embeddings of many logical dimensions in two phjysical dimensions Higher dimensional k -ary d -cube can be laid out in 2D by replicating a 2D slice and then connecting slices across the remaining dimensions. This wiring complexity of these higher dimensional networks can be easily seen in the figure.
10.3.4 Trees In meshes, the diameter and average distance increases with the d -th root of N . There are many other topologies where the routing distance grows only logarithmically. The simplest of these is a tree. A binary tree has degree three.Typically, trees are employed as indirect networks with hosts as the leaves, so for N leaves the diameter is 2 log N . (Such a topology could be used as a direct network of N = k log k nodes.) In the indirect case, we may treat the binary address of each node as a d = log N bit vector specifying a path from the root of the tree – the high order bit indicates whether the node is below the left or right child of the root, and so on down the levels of the tree.
696
DRAFT: Parallel Computer Architecture
9/4/97
Interconnection Topologies
The levels of the tree correspond directly to the “dimension” of the network. One way to route from node A to node B would be to go all the way up to the root and then follow the path down specified by the address of B . Of course, we really only need to go up to the first common parent of the two nodes, before heading down. Let R = B ⊗ A , the bit-wise xor of the node addresses, be the relative address of A B and i be the position of the least significant 1 in R . The route from A to node B is simply i hops up and the low-order bits of B down. Formally, a complete indirect binary tree is a network of 2N – 1 nodes organized as d = log2 N levels. Host nodes occupy level 0 and are identified by a d -bit level address A = a d – 1, …, a 0 . A switch node is identified by its level, i , and its d – i bit level address, A ( i ) = a d – 1, …, a i . A switch node [ i, A ( i ) ] is connected to a parent, [ i + 1, A ( i + 1 ) ] , and two children, [ i – 1, A ( i ) || 0 ] and [ i – 1, A ( i ) || 1 ] , where the vertical bars indicate bit-wise concatenation. There is a unique route between any pair of nodes by going up to the least common ancestor, so there is no fault tolerance. The average distance is almost as large as the diameter and the tree partitions into subtrees. One virtue of the tree is the ease of supporting broadcast or multicast operations Clearly, by increasing the branching factor of the tree the routing distance is reduced. In a k -ary tree, each node has k children, the height of the tree is d = logk N and the address of a host is specified by a d -vector of radix k coordinates describing the path down from the root. One potential problem with trees is that they seem to require long wires. After all, when one draws a tree in a plane, the lines near the root appear to grow exponentially in length with the number of levels, as illustrated in the top portion of Figure 10-9, resulting in a O ( N log N ) layout with O ( N ) long wires. This is really a matter of how you look at it. The same 16-node tree is laid out compactly in two dimensions using a recursive “H-tree” pattern, which allows an O ( N ) layout with only O N long wires [BhLe84]. One can imagine using the H-tree pattern with multiple nodes on a chip, among nodes (or subtrees) on a board, or between cabinets on a floor, but the linear layout might be used between boards. The more serious problem with the tree is its bisection. Removing a single link near the root bisects the network. It has been observed that computer scientists have a funny notion of trees; real trees get thicker toward the trunk. An even better analogy is your circulatory system, where the heart forms the root and the cells the leaves. Blood cells are routed up the veins to the root and down the arteries to the cells. There is essentially constant bandwidth across each level, so that blood flows evenly. This idea has been addressed in an interesting variant of tree networks, called fat-trees, where the upward link to the parent has twice the bandwidth of the child links. Of course, packets don’t behave quite like blood cells, so there are some issues to be sorted out on how exactly to wire this up. These will fall out easily from butterflies. 10.3.5 Butterflies The constriction at the root of the tree can be avoided if there are “lots of roots.” This is provided by an important logarithmic network, called a butterfly. (The butterfly topology arises in many settings in the literature. It is the inherent communication pattern on an element-by-element level of the FFT, the Batcher odd-even merge sort, and other important parallel algorithms. It is isomorphic to topologies in the networking literature including the Omega and SW-Banyan net-
9/4/97
DRAFT: Parallel Computer Architecture
697
Interconnection Network Design
4 3 2 1 0
Figure 10-9 Binary Trees Trees provide a simple network with logarithmic depth. They can be laid out efficiently in 2D by using an H-tree configuration. Routing is simple and they contain trees as subnetworks. However, the bisection bandwidth is only O(1) .
works, and closely related to the shuffle-exchange network and the hypercube, which we discuss later.) Given 2x2 switches, the basic building-block of the butterfly obtained by simply crossing one of each pair of edges, as illustrated in the top of Figure 10-10. This is a tool for correcting one bit of the relative address – going straight leaves the bit the same, crossing flips the bit. These 2x2 butterflies are composed into a network of N = 2 d nodes in log 2 N levels by systematically changing the cross edges. A 16-node butterfly is illustrated in the bottom portion of Figure 10-10. This configuration shows an indirect network with unidirectional links going upward, so hosts deliver packets into level 0 and receive packets from level d . Each level corrects one additional bit of the relative address. Each node at level d forms the root of a tree with all the hosts as leaves., and from each host there is a tree of routes reaching every node at level d . A d -dimensional indirect butterfly has N = 2 d host nodes and d2 d – 1 switch nodes of degree 2 N organized as d levels of ---- nodes each. A switch node at level i , [ i, A ] has its outputs connected 2
to nodes [ i + 1, A ] and [ i + 1, A ⊗ 2 i ] . To route from A to B , compute the relative address R = A ⊗ B and at level i use the “straight edge” if r i is 0 and the cross edge otherwise. The N diameter is log N . In fact, all routes are log N long. The bisection is ---- . (A slightly different for2
mulation with only one host connected to each edge switch has bisection N , but twice as many switches at each level and one additional level.)
698
DRAFT: Parallel Computer Architecture
9/4/97
Interconnection Topologies
0
1
0
0 0
1
1 1
0
Basic Butterfly Building-block 1
4
16-node Butterfly
3 2 1 0 Figure 10-10 Butterfly Butterflies are a logarithmic depth network constructed by composing 2x2 blocks which correct one bit in the relative address. It can be viewed as a tree with multiple roots.
A d -dimensional k -ary butterfly is obtained using switches of degree k , a power of 2. The address of a node is then viewed as a d -vector of radix k coordinates, so each level corrects log k bits in the relative address. In this case, there are logk N levels. In effect, this fuses adjacent levels into a higher radix butterfly. There is exactly one route from each host output to each host input, so there is no inherent fault tolerance in the basic topology. However, unlike the 1D mesh or the tree where a broken link partitions the network, in the butterfly there is the potential for fault tolerance. For example, the route from A to B may be broken, but there is a path from A to another node C and from C to B. There are may proposals for making the butterfly fault tolerant by just adding a few extra links. One simple approach is add an extra level to the butterfly so there are two routes to every destination from every source. This approach was used in the BBN T2000. The butterfly appears to be a qualitatively more scalable network than meshes and trees, because each packet crosses log N links and there are N log N links in the network, so on average it should be possible for all the nodes to send messages anywhere all at once. By contrast, on a 2D torus or tree there are only two links per node, so nodes can only send messages a long distance infrequently and very few nodes are close. A similar argument can be made in terms of bisection. For N a random permutation of data among the N nodes, ---- messages are expected to cross the bisec2
N tion in each direction. The butterfly has ---- links across the bisection, whereas the d -dimensional 2
9/4/97
DRAFT: Parallel Computer Architecture
699
Interconnection Network Design
d–1 ------------
mesh has only N d links and the tree only one. Thus, as the machine scales, a given node in a butterfly can send every other message to a node on the other side of the machine, whereas a node d
N in a 2D mesh can send only every --------- -th message and the tree only every N -th message to the 2
other side. There are two potential problems with this analysis, however. The first is cost. In a tree or d dimensional mesh, the cost of the network is a fixed fraction of the cost of the machine. For each host node there is one switch and d links. In the butterfly the cost of the network per node increases with the number of nodes, since for each host node there is log N switches and links. Thus, neither scales perfectly. The real question comes down to what a switch costs relative to a processor and what fraction of the overall cost of the machine we are willing to invest in the network in order to be able to deliver a certain communication performance. If the switch and link is 10% of the cost of the node, then on a thousand processor machine the network will be only one half of the total cost with a butterfly. On the other hand, if the switch is equal in cost to a node, we are unlikely to consider more than a low dimensional network. Conversely, if we reduce the dimension of the network, we may be able to invest more in each switch. The second problem is that even though there is enough links in the butterfly to support bandwidth-distance product of a random permutation, the topology of the butterfly will not allow an arbitrary permutation of N message among the N nodes to be routed without conflict. A path from an input to an output blocks the paths of many other input-output pairs, because there are shared edges. In fact, even if going through the butterfly twice, there are permutations that cannot be routed without conflicts. However, if two butterflies are laid back-to-back, so a message goes through one forward and the other in the reverse direction, then for any permutation there is a choice of intermediate positions that allow a conflict-free routing of the permutation. This backto-back butterfly is called a Benes[Ben65,Lei92] network and it has been extensively studied because of its elegant theoretical properties. It is often seen as having little practical significance because it is costly to compute the intermediate positions and the permutation has to be known in advance. On the other hand, there is another interesting theoretical result that says that on a butterfly any permutation can be routed with very few conflicts with high probability by first sending every message to a random intermediate node and then routing them to the desired destination[Lei92]. These two results come together in a very nice practical way in the fat-tree network. A d -dimensional k -ary fat-tree is formed by taking a d -dimensional k -ary Benes network and folding it back on itself at the high order dimension, as illustrated in Figure 10-10. The collection N of ---- switches at level i are viewed as N d – i “fat nodes” of 2 i – 1 switches. The edges of the for2
ward going butterfly go up the tree toward the roots and the edges of the reverse butterfly go down toward the leaves. To route from A to B , pick a random node C in the least common ancestor fat-node of A to B , take the unique tree route from A to C and the unique tree route back down from C to B . Let i be the highest dimension of difference in A and B , then there are 2 i root nodes to choose from, so the longer the routing distance the more the traffic can be dis-
tributed. This topology clearly has a great deal of fault tolerance, it has the bisection of the butterfly, the partitioning properties of the tree, and allows essentially all permutations to be routed with very little contention. It is used in the Connection Machine CM-5 and the Meiko CS-2. In
700
DRAFT: Parallel Computer Architecture
9/4/97
Interconnection Topologies
the CM-5 the randomization on the upward path is down dynamically by the switches, in the CS2 the source node chooses the ancestor. A particularly important practical property of butterflies and fat-trees is that the nodes have fixed degree, independent of the size of the network. This allows networks of any size to be constructed with the same switches. As is indicated in Figure 10-11, the physical wiring complexity of the higher levels of a fat-tree or any butterflylike network become critical, as there are a large number of long wires that connect to different places. 16-node Benes Network (Unidirectional)
16-node 2-ary Fat-Tree (Bidirectional)
Figure 10-11 Benes network and fat-tree A Benes network is constructed essentially by connecting two butterflies back-to-back. It has the interesting property that it can route any permutation in a conflict free fashion, given the opportunity to compute the route off-line. Two forward going butterflies do not have this property. The fat tree is obtained by folding the second half of the Benes network back on itself and fusing the two directions, so it is possible to turn around at each level.
10.3.6 Hypercubes It may seem that the straight-edges in the butterfly are a target for potential, since they take a packet forward to the next level in the same column. Consider what happens if we collapse all the switches in a column into a single log N degree switch. This has brought us full circle; it is exactly a d -dimensional 2-ary torus! It is often called a hypercube or binary n-cube. Each of the N = 2 d nodes are connected to the d nodes that differ by exactly one bit in address. The relative address R ( A, B ) = A ⊗ B specifies the dimensions that must be crossed to go from A to B. Clearly, the length of the route is equal to the number of ones in the relative address. The dimensions can be corrected in any order (corresponding the different ways of getting between opposite corners of the subcube) and the butterfly routing corresponds exactly to dimension order routing, called e-cube routing in the hypercube literature. Fat-tree routing corresponds to picking a random node in the subcube defined by the high order bit in the relative address, sending the packet “up” the random node and back “down” to the destination. Observe that the fat-tree uses distinct sets of links for the two directions, so to get the same properties we need a pair of bidirectional links between nodes in the hypercube.
9/4/97
DRAFT: Parallel Computer Architecture
701
Interconnection Network Design
The hypercube is an important topology that has received tremendous attention in the theoretical literature. For example, lower dimensional meshes can be embedded one-to-one in the hypercube by choosing an appropriate labeling of the nodes. Recall, a greycode sequence orders the numbers from 0 to 2 d – 1 so adjacent numbers differ by one bit. This shows how to embed a 1D mesh into a d -cube, and it can be extended to any number of dimensions. (See exercises) Clearly, butterflies, shuffle-exchange networks and the like embed easily. Thus, there is a body of literature on “hypercube algorithms” which subsumes the mesh and butterfly algorithms. (Interestingly, a d -cube does not quite embed a d – 1 level tree, because there is one extra node in the d -cube. Practically speaking, the hypercube was used by many of the early large scale parallel machines, including the Cal Tech research prototypes[Seit85], the first three Intel iPSC generations[Rat85], and three generations of nCUBE machines. Later large scale machines, including the Intel Delta Intel Paragon, and Cray T3D have used low dimensional meshes. One of reasons for the shift is that in practice the hypercube topology forces the designer to use switches of a degree that supports the largest possible configuration. Ports are wasted in smaller configuration. The k -ary d cube approach provides the practical scalability of allowing arbritary sized configurations to be construct with a given set of components, i.e., with switches of fixed degree. None the less, this begs the question of what should be the degree? The general trend in network design in parallel machines is toward switches that can be wired in an arbitrary topology. This what we see, for example, in the IBM SP-2, SGI Origin, Myricom network, and most ATM switches. The designer may choose to adopt a particular regular toplogy, or may wire together configurations of different sizes differently. At any point in time, technology factors such as pin-out and chip area limit the largest potential degree. Still, it is a key question, how large should be the degree?
10.4 Evaluating Design Trade-offs in Network Topology The k-ary d-cube provides a convenient framework for evaluating desing alternatives for direct networks. The design question can be posed in two ways. Given a choice of dimension, the design of the switch is determined and we can ask how the machine scales. Alternatively, for the machine scale of interest, i.e., N = k d , we may ask what is the best dimensionality under salient cost constraints. We have the 2D torus at one extreme, the hypercube at the other, and a spectrum of networks between. As with most aspects of architecture, the key to evaluating trade-offs is to define the cost model and performance model and then to optimize the design accordingly. Network topology has been a point of lively debate over the history of parallel architectures. To a large extent this is because different positions make sense under different cost models, and the technology keeps changing. Once the dimensionality (or degree) of the switch is determined the space of candidate networks is relatively constrained, so the question that is being addressed is how large a degree is worth working toward. Let’s collect what we know about this class of networks in one place. The total number of switches is N , regardless of degree, however, the switch degree is d so the total number of links k–1 is C = Nd and there are 2wd pins per node. The average routing distance is d ----------- , the diam2
702
DRAFT: Parallel Computer Architecture
9/4/97
Evaluating Design Trade-offs in Network Topology
N eter is d ( k – 1 ) , and k d – 1 = --- links cross the bisection in each direction (for even k ). Thus, k
2Nw there are ----------- wires crossing the middle of the network. k
If our primary concern is the routing distance, then we are inclined to maximize the dimension and build a hypercube. This would be the case with store-and-forward routing, assuming that the degree of the switch and the number of links were not a significant cost factor. In addition, we get to enjoy its elegant mathematical properties. Not surprisingly, this was the topology of choice for most of the first generation large-scale parallel machine. However, with cut-through routing and a more realistic hardware cost model, the choice is much less clear. If the number of links or the switch degree are the dominant costs, we are inclined to minimize the dimension and build a mesh. For the evaluation to make sense, we want to compare the performance of design alternatives with roughly equal cost. Different assumptions about what aspects of the system are costly lead to very different conclusions. The assumed communication pattern influences the decision too. If we look at the worst case traffic pattern for each network, we will prefer high dimensional networks where essentially all the paths are short. If we look at patterns where each node is communicating with only one or two near neighbors, we will prefer low dimensional networks, since only a few of the dimensions are actually used. 10.4.1 Unloaded Latency Figure 10-12 shows the increase in unloaded latency under our model of cut-through routing for 2, 3, and 4-cubes, as well as binary d-cubes ( k = 2 ), as the machine size is scaled up. It assumes unit routing delay per stage and shows message sizes of 40 and 140 flits. This reflects two common message sizes, for w = 1 byte . The bottom line shows the portion of the latency resulting from channel occupancy. As we should expect, for smaller messages (or larger routing delay per stage) the scaling of the low dimension networks is worse because a message experiences more routing steps on average. However, in making comparisons across the curves in this figure, we are tacitly assuming that the difference is degree is not a signficant component of the system cost. Also, one cycle routing is very aggressive; more typical values for high performance switches are 4-8 network cycles (See Table 10-1). On the other hand, larger message sizes are also common. To focus our attention on the dimensionality of the network as a design issue, we can fix the cost model and the number of nodes that reflects our design point and examine the performance characteristics of networks with fixed cost for a range of d . Figure 10-13 shows the unloaded latency for short messages as a function of the dimensionality for four machine sizes. For large machines the routing delays in low dimensionality networks dominate. For higher dimensionality, the latency approaches the channel time. This “equal number of nodes” cost model has been widely used to support the view that low dimensional networks do not scale well. It is not surprising that higher dimensional network are superior under the cost model of Figure 10-13, since the added switch degree, number of channels and channel length comes essentially for free. The high dimension networks have a much larger number of wires and pins, and bigger switches than the low dimension networks. For the rightmost end of the graph in Figure 10-13, the network design is quite impractical. The cost of the network is a significant fraction of the cost of a large scale parallel machine, so it makes sense to compare equal cost
9/4/97
DRAFT: Parallel Computer Architecture
703
Interconnection Network Design
250
d=2 d=3
80
d=4 k=2 n/w
60
T(n=140)
100
Latency
120
Ave
Ave
Latency
T(n=40)
140
40
200
150
100
50
20 0
0 0
2000
4000
6000
8000
0
10000
2000
4000
6000
8000
10000
Figure 10-12 Unloaded Latency Scaling of networks for various dimensions with fixed link width The n/w line shows the channel occupancy component of message transmission, i.e, the time for the bits to cross a single channel, which is independent of network topology. The curves show the additional latency due to routing
Average Latency (n = 40, ∆ = 2)
250 256 1024 200
16384 1048576
150
100
50
0 0
5
10
15
20
Figure 10-13 Unloaded Latency for k-ary d-cubes with Equal Node Count ( n=14B, ∆ = 2 ) as a function of the degree With the link width and routing delay fixed, the unloaded latency for large networks rises sharply at low dimensions due to routing distance.
designs under appropriate technological assumptions. As chip size and density improves, the switch internals tend to become a less significant cost factor, whereas pins and wires remain critical. In the limit, the physical volume of the wires presents a fundamental limit on the amount of
704
DRAFT: Parallel Computer Architecture
9/4/97
Evaluating Design Trade-offs in Network Topology
interconnection that is possible. So lets compare these networks under assumptions of equal wiring complexity. One sensible comparison is to keep the total number of wires per node constant, i.e., fix the number of pins 2dw . Let’s take as our baseline a 2-cube with channel width w = 32 , so there are a total of 128 wires per node. With more dimensions there are more channels and they must each be thinner. In particular, w d = 64 ⁄ d . So, in an 8-cube the links are only 8 bits wide. Assuming 40 and 140 byte messages, a uniform routing delay of 2 cycles per hop, and uniform cycle time, the unloaded latency under equal pin scaling is shown in Figure 10-14. This figure shows a very different story. As a result of narrower channels, the channel time becomes greater with increasing dimension; this mitigates the reduction in routing delay stemming from the smaller routing distance. The very large configurations still experience large routing delays for low dimensions, regardless of the channel width, but all the configurations have an optimum unloaded latency at modest dimension.
300
300 256 nodes 1024 nodes 16 k nodes 1M nodes
200
Ave Latency T(n= 140 B)
Latency
T(n=40B)
250
150
Ave
100
50
250
200
150
100 256 nodes 1024 nodes 16 k nodes 1M nodes
50
0
0 0
5
10
15
20
0
5
10
15
20
Figure 10-14 Unloaded Latency for k-ary d-cubes with Equal Pin Count (n=40B and n=140B, ∆ = 2 ) With equal pin count, higher dimensions implies narrower channels, so the optimal design point balances the routing delay, which increases with lower dimension, against channel time, which increases with higher dimension.
If the design is not limited by pin count, the critical aspect of the wiring complexity is likely to be the number of wires that cross through the middle of the machine. If the machine is viewed as laid out in a plane, the physical bisection width grows only with the square root of the area, and in three-space it only grows as the two-thirds power of the volume. Even if the network has a high logical dimension, it must be embedded in a small number of physical dimensions, so the designer must contend with the cross-sectional area of the wires crossing the midplane. We can focus on this aspect of the cost by comparing designs with equal number of wires crossing the bisection. At one extreme, the hypercube has N such links. Let us assume these have unit
9/4/97
DRAFT: Parallel Computer Architecture
705
Interconnection Network Design
N size. A 2D torus has only 2 N links crossing the bisection, so each link could be -------- times the 2
width of that used in the hypercube. By the equal bisection criteria, we should compare a 1024 node hypercube with bit-serial links with a torus of the same size using 32-bit links. In general, the d -dimensional mesh with the same bisection width as the N node hypercube has links of d N k - = --- . Assuming cut-through routing, the average latency of an n -byte packet to width w d = --------
2
2
a random destination on an unloaded network is as follows. d N–1 k–1 k–1 n n n T ( n, N , d ) = ------ + ∆ ⋅ d ----------- = --------- + ∆ ⋅ d ----------- = ---------------- + ∆ ⋅ d ------------------ 2 2 2 k⁄2 wd d N⁄2
(EQ 10.7)
Thus, increasing the dimension tends to decrease the routing delay but increase the channel time, as with equal pin count scaling. (The reader can verify that the minimum latency is achieved when the two terms are essentially equal.) Figure 10-15 shows the average latency for 40 byte messages, assuming ∆ = 2 , as a function of the dimension for a range of machine sizes. As the dimension increases from d = 2 the routing delay drops rapidly whereas the channel time increases steadily throughout as the links get thinner. (The d = 2 point is not shown for N = 1M nodes because it is rather ridiculous; the links are 512 bits wide and the average number of hops is 1023.) For machines up to a few thousand nodes, 3 N ⁄ 2 and log N are very close, so the impact of the additional channels on channel width becomes the dominant effect. If large messages are considered, the routing component becomes even less significant. For large machines the low dimensional meshes under this scaling rule become impractical because the links become very wide. Thus far we have concerned ourselves with the wiring cross-sectional area, but we have not worried about the wire length. If a d -cube is embedded in a plane, i.e., d ⁄ 2 dimensions are embedded in each physical dimension, such that the distance between the centers of the nodes is fixed, then each additional dimension increases the length of the longest wire by a k factor. Thus, the n --- – 1
length of the longest wire in a d -cube is k 2 times that in the 2-cube. Accounting for increased wire length further strengthens the argument for a modest number of dimensions. There are three ways this accounting might be done. If one assumes that multiple bits are pipelined on the wire, then the increased length effectively increases the routing delay. If the wires are not pipelined, then the cycle time of the network is increased as a result of the long time to drive the wire. If the drivers are scaled up perfectly to meet the additional capacitive load, the increase is logarithmic in the wire length. If the signal must propagate down the wire like a transmission line, the increase is linear in the length. The embedding of the d -dimensional network into few physical dimensions introduces secondorder effects that may enhance the benefit of low dimensions. If a high dimension network is embedded systematically into a plane, the wire density tends to be highest near the bisection and low near the perimeter. A 2D mesh has uniform wire density throughout, so it makes better use of the area that it occupies.
706
DRAFT: Parallel Computer Architecture
9/4/97
Evaluating Design Trade-offs in Network Topology
1000 900
Ave
Latency
T(n=40)
800 700 600 500 400 300 200
256 nodes 1024 nodes
100
16 k nodes 1M nodes
0 0
5
10
15
20
Figure 10-15 Unloaded Latency for k-ary d-cubes Equal Bisection Width (n=40B, ∆=2). The balance between routing delay and channel time shifts even more in favor of low degree networks with an equal bisection width scaling rule.
We should also look at the trade-offs in network design from a bandwidth viewpoint. The key factor influencing latency with equal wire complexity scaling is the increased channel bandwidth at low dimension. Wider channels are beneficial if most of the traffic arises from or is delivered to one or a few nodes. If traffic is localized, so each node communicates with just a few neighbors, only a few of the dimensions are utilized and again the higher link bandwidth dominates. If a large number of nodes are communicating throughout the machine, then we need to model the effects of contention on the observed latency and see where the network saturates. Before leaving the examination of trade-offs for latency in the unloaded case, we note that the evaluation is rather sensitive to the relative time to cross a wire and to cross a switch. If the routing delay per switch is 20 times that of the wire, the picture is very different, as shown in Figure 10-16. 10.4.2 Latency under load In order to analyze the behavior of a network under load, we need to capture the effects of traffic congestion on all the other traffic that is moving through the network. These effects can be subtle and far reaching. Notice when you are next driving down a loaded freeway, for example, that where a pair of freeways merge and then split the traffic congestion is worse than even a series of on-ramps merging into a single freeway. There is far more driver-to-driver interaction in the interchange and at some level of traffic load the whole thing just seems to stop. Networks behave in a similar fashion, but there are many more interchanges. (Fortunately, rubber-necking at a roadside
9/4/97
DRAFT: Parallel Computer Architecture
707
Interconnection Network Design
1000
Ave Latency T(n= 140 B)
900 800 700 600 500 400 300
256 nodes
200
1024 nodes 16 k nodes
100
1M nodes
0 0
5
10
15
20
Figure 10-16 Unloaded Latency for k-ary d-cubes with Equal Pin Count and larger routing delays(n=140B, ∆ = 20 )
When the time to cross a switch is significantly larger than the time to cross a wire, as is common in practice, higher degree switches become much more attractive.
disturbance is less of an issue). In order to evaluate these effects in a family of topologies, we must take a position on the traffic pattern, the routing algorithm, the flow-control strategy, and a number of detailed aspects of the internal design of the switch. One can then either develop a queuing model for the system or build a simulator for the proposed set of designs. The trick, as in most other aspects of computer design, is to develop models that are simple enough to provide intuition at the appropriate level of design, yet accurate enough to offer useful guidance as the design refinement progresses. We will use a closed-form model of contention delays in k -ary d -cubes for random traffic using dimension-order cut-through routing and unbounded internal buffers, so flow control and deadlock issues do not arise, due to Agarwal [Aga91]. The model predictions correlate well with simulation results for networks meeting the same assumptions. This model is based on earlier work by Kruskal and Snir modeling the performance of indirect (Banyan) networks [KrSn83]. Without going through the derivation, the main result is that we can model the latency for random communication of messages of size n on a k -ary d -cube with channel width w at a load corresponding to an aggregate channel utilization of ρ by: n T ( n, k, d, w, ρ ) = ---- + h ave ( ∆ + W ( n, k, d, w, ρ ) ) w h ave – 1 ρ n 1 - ⋅ 1 + --- W ( n, k, d, w, ρ ) = ---- ⋅ ------------ ⋅ ----------------2 w 1–ρ d h ave
(EQ 10.8)
where h ave = d ----------- . 2 k–1
708
DRAFT: Parallel Computer Architecture
9/4/97
Evaluating Design Trade-offs in Network Topology
Using this model, we can compare the latency under load for networks of low or high dimension of various sizes, as we did for unloaded latency. Figure 10-17 shows the predicted latency on a 1000 node 10-ary 3-cube and a 1024 node 32-ary 2-cube as a function of the requested aggregate channel utilization, assuming equal channel width, for relatively small messages sizes of 4, 8, 16, and 40 phits. We can see from the right-hand end of the curves that the two networks saturate at roughly the same channel utilization, however, this saturation point decreases rapidly with the message size. The left end of the curves indicates the unloaded latency. The higher degree switch enjoys a lower base routing delay with the same channel time, since there are fewer hops and equal channel widths. As the load increases, this difference becomes less significant. Notice how large is the contended latency compared to the unloaded latency. Clearly, in order to deliver low latency communication to the user program, it is important that the machine is designed so that the network does not go into saturation easily; either by providing excess network bandwidth, “headroom”, or by conditioning the processor load. 300
250
n40,d2,k32 n40,d3,k10
Latency
200
n16,d2,k32 n16,d3,k10 n8,d2,k32
150
n8,d3,k10 n4,d2,k32 n4,d3,k10
100
50
0 0
0.2
0.4
0.6
0.8
1
Figure 10-17 Latency with Contention vs. load for 2-ary 32-cube and 3-ary 10-cube with routing delay 2 At low channel utilization the higher dimensional network has significantly low latency, with equal channel width, but as the utilization increases they converge toward the same saturation point.
The data in Figure 10-17 raises a basic tradeoff in network design. How large a packet should the network be designed for? The data shows clearly that networks move small packets more efficiently than large ones. However, smaller packets have worse packet efficiency, due to the routing and control information contained in each one, and require more network interface events for the same amount of data transfer.
9/4/97
DRAFT: Parallel Computer Architecture
709
Interconnection Network Design
We must be careful about the conclusions we draw from Figure 10-17 regarding the choice of network dimension. The curves in the figure show how efficiently each network utilizes the set of channels it has available to it. The observation is that both use their channels with roughly equal effectiveness. However, the higher dimensional network has a much greater available bandwidth per node; it has 1.5 times as many channels per node and each message uses fewer channels. In a Nd k -ary d -cube the available flits per cycle under random communication is -------------------- , or (k – 1) d ---------------2 2w 2 ⁄ ( k – 1 ) flits per cycle per node ( ----------- bits per cycle). The 3-cube in our example has almost k–1
four times as much available bandwidth at the same channel utilization, assuming equal channel width. Thus, if we look at latency against delivered bandwidth, the picture looks rather different, as shown in Figure 10-18. The 2-cube starts out with a higher base latency and saturates before the 3-cube begins to feel the load. 350
300
Latency
250
200
150
100 n8, d3, k10 n8, d2, k32
50
0 0
0.05
0.1
0.15
0.2
0.25
Figure 10-18 Latency vs flits per cycle with Contention for 2-ary 32-cube and 3-ary 10-cube. Routing delay 2
This comparison brings us back to the question of appropriate equal cost comparison. As an exercise, the reader can investigate the curves for equal pin out and equal bisection comparison. Widening the channels shifts the base latency down by reducing channel time, increases the total bandwidth available, and reduces the waiting time at each switch, since each packet is serviced faster. Thus, the results are quite sensitive to the cost model. There are interesting observations that arise when the width is scaled against dimension. For 2 k–1
example, using the equal bisection rule, the capacity per node is. C ( N, d ) = w d ⋅ ----------- ≈ 1 . The aggregate capacity for random traffic is essentially independent of dimension! Each host can 710
DRAFT: Parallel Computer Architecture
9/4/97
Routing
expect to drive, on average, a fraction of a bit per network clock period. This observation yields a new perspective on low dimension networks. Generally, the concern is that each of several nodes must route messages a considerable distance along one dimension. Thus, each node must send packets infrequently. Under the fixed bisection width assumption, with low dimension the channel becomes a shared resource pool for several nodes, whereas high dimension networks partition the bandwidth resource for traffic in each of several dimensions. When a node uses a channel, in the low dimension case it uses it for a shorter amount of time. In most systems, pooling results in better utilization than partitioning. In current machines, the area occupied by a node is much larger than the cross-section of the wires and the scale is generally limited to a few thousand nodes. In this regime, the bisection of the machine is often realized by bundles of cables. This does represent a significant engineering challenge, but it is not a fundamental limit on the machine design. There is a tendency to use wider links and faster signalling in topologies with short wires, as illustrated by Table 10-1, but it is not as dramatic as the equal bisection scaling rule would suggest.
10.5 Routing The routing algorithm of a network determines which of the possible paths from source to destination are used as routes and how the route followed by each particular packet is determined. We have seen, for example, that in a k -ary d -cube the set of shortest routes is completely described by the relative address of the source and destination, which specifies the number of links that need to be crossed in each dimension. Dimension order routing restricts the set of legal paths so that there is exactly one route from each source to each destination, the one obtained by first traveling the correct distance in the high-order dimension, then the next dimension and so on. This section describes the different classes of routing algorithms that are used in modern machines and the key properties of good routing algorithms, such as producing a set of deadlock-free routes, maintaining low latency, spreading load evenly, and tolerating faults. 10.5.1 Routing Mechanisms Let’s start with the nuts and bolts. Recall, the basic operation of a switch is to monitor the packets arriving at its inputs and for each input packet select an output port on which to send it out. Thus, a routing algorithm is a function R : N × N → C , which at each switch maps the destination node, n d , to the next channel on the route. There are basically three mechanisms that high-speed switches use to determine the output channel from information in the packet header: arithmetic, source-based port select, and table look-up. All of these mechanisms are drastically simpler than the kind of general routing computations performed in traditional LAN and WAN routers. In parallel computer networks, the switch needs to be able to make the routing decision for all its inputs every cycle, so the mechanism needs to be simple and fast.
9/4/97
DRAFT: Parallel Computer Architecture
711
Interconnection Network Design
Simple arithmetic operations are sufficient to select the output port in most regular topologies. For example, in a 2D mesh each packet will typically carry the signed distance to travel in each dimensions, [ ∆x, ∆y ] , in the packet header. The routing operation at switch ij is given by: Direction
Condition
West ( – x )
∆x < 0
East ( +x )
∆x > 0
South ( – y )
∆x = 0, ∆y < 0
North ( +y )
∆x = 0, ∆y > 0
Processor
∆x = 0, ∆y = 0
To accomplish this kind of routing, the switch needs to test the address in the header and decrement or increment one routing field. For a binary cube, the switch computes the position of the first bit that differs between the destination and the local node address, or the first non-zero bit if the packet carries the relative address of the destination. This kind of mechanism is used in Intel and nCUBE hypercubes, the Paragon, the Cal Tech Torus Routing Chip[Sesu93], and the Jmachine, among others. A more general approach is source based routing, in which the source builds a header consisting of the output port number for each switch along the route, p 0, p 1, …, p h – 1 . Each switch simply strips off the port number from the front of the message and sends the message out on the specified channel. This allows a very simple switch with little control state and without even arithmetic units to support sophisticated routing functions on arbitrary topologies. All of the intelligence is in the host nodes. It has the disadvantage that the header tends to be large, and usually of variable size. If the switch degree is d and routes of length h are permitted, the header may need to carry h log d routing bits. This approach is usedin MIT Parc and Arctic routers, Meiko CS-2, and Myrinet. A third approach, which is general purpose and allows for a small fixed size header is table driven routing, in which each switch contains a routing table, R , and the packet header contains a routing field, i , so the output port is determined by indexing into the table by the routing field, o = R [ i ] . This is used, for example, in HPPI and ATM switches. Generally, the table entry also gives the routing field for the next step in the route, o, i′ = R [ i ] , to allow more flexibility in the table configuration. The disadvantage of this approach is that the switch must contain a sizable amount of routing state, and it requires additional switch-specific messages or some other mechanism for establishing the contents of the routing table. Furthermore, fairly large tables are required to simulate even simple routing algorithms. This approach is better suited to LAN and WAN traffic, where only a few of the possible routes among the collection of nodes are used at a time and these are mostly long lasting connections. By contrast, in a parallel machine there is often traffic among all the nodes. Traditional networking routers contain a full processor which can inspect the incoming message, perform an arbitrary calculation to select the output port, and build a packet containing the message data for that output. This kind of approach is often employed in routers and (sometimes) bridges, which connect completely different networks (e.g., those that route between ethernet, FDDI, ATM) or at least different data link layers. It really does not make sense at the time scale of the communication within a high performance parallel machine.
712
DRAFT: Parallel Computer Architecture
9/4/97
Routing
10.5.2 Deterministic Routing A routing algorithm is deterministic (or nonadaptive) if the route taken by a message is determined solely by its source and destination, and not by other traffic in the network. For example, dimension order routing is deterministic; the packet will follow its path regardless of whether a link along the way is blocked. Adaptive routing algorithms allow the route for a packet to be influenced by traffic it meets along the way. For example, in a mesh the route could zig-zag towards its destination if links along the dimension order path was blocked or faulty. In a fat tree, the upward path toward the common ancestor can steer away from blocked links, rather than following a random path determined when the message was injected into the network. If a routing algorithm only selects shortest paths toward the destination, it is minimal; otherwise it is nonminimal. Allowing multiple routes between each source and destination is clearly required for fault tolerance, and it also provides a way to spread load over more links. These virtues are enjoyed by source-based and table driven routing, but the choice is made when the packet is injected. Adaptive routing delays the choice until the packet is actually moving through the switch, which clearly makes the switch more complex, but has the potential of obtaining better link utilization. We will concentrate first on deterministic routing algorithms and develop an understanding of some of the most popular routing algorithms and the techniques for proving them deadlock free, before investigating adaptive routing. 10.5.3 Deadlock Freedom In our discussions of network latency and bandwidth we have tacitly assumed that messages make forward progress so it is meaningful to talk about performance. In this section, we show how one goes about proving that a network is deadlock free. Recall, deadlock occurs when a packet waits for an event that cannot occur, for example when no message can advance toward its destination because the queues of the message system are full and each is waiting for another to make resources available. This can be distinguished from indefinite postponement, which occurs when a packet awaits for an event that can occur, but never does, and from livelock, which occurs when the routing of a packet never leads to its destination. Indefinite postponement is primarily a question of fairness and livelock can only occur with adaptive nonminimal routing. Being free from deadlock is a basic property of well-designed networks that must be addressed from the very beginning. Deadlock can occur in a variety of situations. A ‘head-on’ deadlock may occur when two nodes attempt to send to each other and each begins sending before either receives. It is clear that if they both attempt to complete sending before receiving, neither will make forward progress. We saw this situation at the user message passing layer using synchronous send and receive, and we saw it at the node-to-network interface layer. Within the network it could potentially occur with half duplex channels, or if the switch controller were not able to transmit and receive simultaneously on a bidirectional channel. We should think of the channel as a shared resource, which is acquired incrementally, first at the sending end and then at the receiver. In each case, the solution is to ensure that nodes can continue to receive while being unable to send. One could make this example more elaborate by putting more switches and links between the two nodes, but it is clear that a reliable network can only be deadlock free if the nodes are able to remove packets from the network even when they are unable to send packets. (Alternatively, we might recover from the deadlock by eventually detecting a timeout and aborting one or more of the packets, effectively
9/4/97
DRAFT: Parallel Computer Architecture
713
Interconnection Network Design
preempting the claim on the shared resource. This does raise the possibility of indefinite postponement, which we will address later.) In this head-on situation there is no routing involved, instead the problem is due to constraints imposed by the switch design. A more interesting case of deadlock occurs when there are multiple messages competing for resources within the network, as in the routing deadlock illustrated in Figure 10-19. Here we have several messages moving through the network, where each message consists of several phits. We should view each channel in the network as having associated with it a certain amount of buffer resources; these may be input buffers at the channel destination, output buffers at the channel source, or both. In our example, each message is attempting to turn to the left and all of the packet buffers associated with the four channels are full. No message will release a packet buffer until after it has acquired a new packet buffer into which it can move. One could make this example more elaborate by separating the switches involved by additional switches and channels, but it is clear that the channel resources are allocated within the network on a distributed basis as a result of messages being routed through. They are acquired incrementally and non-preemptible, without losing packets. This routing deadlock can occur with store-and-forward routing or even more easily with cutthrough routing because each packet stretches over several phit buffers. Only the header flits of a packet carry routing information, so once the head of a message is spooled forward on a channel, all of the remaining flits of the message must spool along on the same channel. Thus, a single packet may hold onto channel resources across several switches. While we might hope to interleave packets with store-and-forward routing, the phits of a packet must be kept together. The essential point in these examples is that there are resources logically associated with channels and that messages introduce dependences between these resources as they move through the network. The basic technique for proving a network deadlock free is to articulate the dependencies that can arise between channels as a result of messages moving through the networks and to show that there are no cycles in the overall channel dependency graph; hence there is no traffic patterns that can lead to deadlock. The most common way of doing this is to number the channel resources such that all routes follow a monotonically increasing (or decreasing) sequence, hence, no dependency cycles can arise. For a butterfly this is trivial, because the network itself is acyclic. It is also simple for trees and fat-trees, as long as the upward and downward channels are independent. For networks with cycles in the channel graph, the situation is more interesting. To illustrate the basic technique for showing a routing algorithm to be deadlock-free, let us show that ∆x, ∆y routing on a k-ary 2D array is deadlock free. To prove this, view each bidirectional channel as a pair of unidirectional channels, numbered independently. Assign each positive- x channel 〈 i, y〉 → 〈 i + 1, y〉 the number i , and similarly number the negative- x going channel starting from 0 at the most positive edge. Number the positive- y channel 〈 x, j〉 → 〈 x, j + 1〉 the number N + j , and similarly number the negative- y edges from the from the most positive edge. This numbering is illustrated in Figure 10-20. Any route consisting of a sequence of consecutive edges in one x direction, a 90 degree turn, and a sequence of consecutive edges in one y direction is strictly increasing. The channel dependence graph has a node for every unidirectional link in the network and there is an edge from node A to node B if it is possible for a packet to traverse channel A and then channel B. All edges in the channel dependence graph go from lower numbered nodes to higher numbered ones, so there are no cycles in the channel dependence graph.
714
DRAFT: Parallel Computer Architecture
9/4/97
Routing
Figure 10-19 Examples of Network Routing Deadlock Each of four switches has four input and output ports. Four packets have each aquired an input port and output buffer and an input port, and all are attempting to acquire the next output buffer as they turn left. None will relenquish their output buffer until it moves forward,so none can progress.
This proof easily generalizes to any number of dimensions and, since a binary d -cube consists of a pair of unidirectional channels in each dimension, also shows that e-cube routing is deadlock free on a hypercube. Observe, however, that the proof does not apply to k -ary d -cubes in general, because the channel number decreases at the wrap-around edges. Indeed, it is not hard to show that for k > 4 , dimension order routing will introduce a dependence cycle on a unidirectional torus, even with d = 1 . Notice that deadlock-free routing proof applies even if there is only a single flit of buffering on each channel and that the potential for deadlock exists in k -ary d -cube even with multiple packet buffers and store-and-forward routing, since a single message may fill up all the packet buffers along its route. However, if the use of channel resources is restricted, it is possible to break the deadlock. For example, consider the case of a unidirectional torus with multiple packet buffers per channel and store-and-forward routing. Suppose that one of the packet buffers associated with each channel is reserved for messages destined for nodes with a larger number than their source, i.e., packets that do not use wrap-around channels. This means that it will always be possible for “forward going” messages to make progress. Although wrap-around messages may be postponed, the network does not deadlock. This solution is typical of the family of techniques for making store-and-forward packet-switched networks deadlock free; there is a concept of a structured buffer pool and the routing algorithm restricts the assignment of buffers to packets to break dependence cycles. This solution is not sufficient for worm-hole routing, since it tacitly assumes that packets of different messages can be interleaved as they move forward. Observe that deadlock free routing does not mean the system is deadlock free. The network is deadlock free only as long as it is drained into the NIs, even when the NIs are unable to send. If 9/4/97
DRAFT: Parallel Computer Architecture
715
Interconnection Network Design
Channel Dependence Graph
1 2
Unidirectional Version of 4-ary 2-array with channels numbered 1 00
2 01
2 17
18 10 17
02 1
11
18 17
3
30
0 18 17
18 17
2
3
1
0
13
17 18 21
22
23
31
32
33
17 18 1 2
19
16
1 18 17
1 2
18 20
3
03 0
12
2
16 19
17 18 2
3
1
0
16 19 1 2
17 18
16 19
16 19
2
3
1
0
Figure 10-20 Channel ordering in the network graph and corresponding channel dependence graph To show the routing algoirithm is deadlock free, it is sufficient to demonstrate that the channel dependence graph has no cycles. two-phase protocols are employed, we need to ensure that fetch-deadlock is avoided. This means either providing two logically independent networks, or ensuring that the two phases are decoupled through NI buffers. Of course, the program may still have deadlocks, such as circular-waits on locks or head-on collision using synchronous message passing. We have worked our way down from the top, showing how to make each of these layers deadlock free, as long as the next layer is deadlock free. Given a network topology and a set of resources per channel, there are two basic approaches for constructing a deadlock-free routing algorithm: restrict the paths that packets may follow or restrict how resources are allocated. This observation raises a number of interesting questions. Is there a general technique for producing deadlock free routes with wormhole routing on an arbitrary topology? Can such routes be adaptive? Is there some minimum amount of channel resources required? 10.5.4 Virtual Channels The basic technique making networks with worm-hole routing deadlock free is to provide multiple buffers with each physical channel and to split these buffers into a group of virtual channels. Going back to our basic cost model for networks, this does not increase the number of links in the network, nor the number of switches. In fact, it does not even increase the size of the cross-bar internal to each switch, since only one flit at a time moves through the switch for each output channel. As indicated in Figure 10-21, it does require additional selectors and multiplexors
716
DRAFT: Parallel Computer Architecture
9/4/97
Routing
within the switch to allow the links and the cross-bar to be shared among multiple virtual channels per physical channel. Input Ports
Output Ports
Cross-Bar
Figure 10-21 Multiple Virtual Channels in a basic switch Each physical channel is shared by multiple virtual channels. The input ports of the switch split the incoming virtual channels into separate buffers, however, these are multiplexed through the switch to avoid expanding the cross-bar.
Virtual channels are used to avoid deadlock by systematically breaking cycles in the channel dependency graph. Consider, for example, the four-way routing deadlock cycle of Figure 10-19. Suppose we have two virtual channels per physical and messages at a node numbered higher than their destination are routed on the high channels, while messages at a node numbered less than their destinations are routed on the low channels. As illustrated in Figure 10-22, the dependence cycle is broken. Applying this approach to the k -ary d -cube, treat the channel labeling as a radix d + 1 + k number of the form ivx , where i is the dimension, x is the coordinate of the source node of the channel in dimension i , and v is the virtual channel number. In each dimension, if the destination node has a smaller coordinate that the source node in that dimension, i.e., the message must use a wrap-around edge, use the v = 1 virtual channel in that dimension. Otherwise, use the v = 0 channel. The reader may verify that dimension order routing is deadlock free with this assignment of virtual channels. Similar techniques can be employed with other popular topologies [DaSe87]. Notice that with virtual channels we need to view the routing algorithm as a function R : C × N → C , because the virtual channel selected for the output depends on which channel it came in on. 10.5.5 Up*-Down* Routing Are virtual channels required for deadlock free wormhole routing on an arbitrary topology? No. If we assume that all the channels are bidirectional there is a simple algorithm for deriving deadlock free routes for an arbitrary topology. Not surprisingly, it restricts the set of legal routes. The general strategy is similar to routing in a tree, where routes go up the tree away from the source and then down to the destination. We assume the network consists of a collection of switches, some of which have one or more hosts attached to them. Given the network graph, we want to number the switches so that the numbers increase as we get farther away from the hosts. One 9/4/97
DRAFT: Parallel Computer Architecture
717
Interconnection Network Design
Packet switches from lo to hi channel
Figure 10-22 Breaking Deadlock Cycles with Virtual Channels Each physical channel is broken into two virtual channels, call them lo and hi. The virtual channel “parity” of the input port is used for the output, except on turns north to west, which make a transition from lo to hi.
approach is to partition the network into levels, with the host nodes at level zero and each switch at the level corresponding to its maximum distance from a host. Assign numbers starting with level 0 and increasing with level. This numbering is easily accomplished with a breadth-firstsearch of the graph. (Alternatively, construct a spanning tree of the graph with hosts at the leaves and numbers increasing toward the root.) It is clear that for any source host, any destination host can be reached by a path consisting of a sequence of up channels (toward higher numbered nodes), a single turn, and a series of down channels. Moreover, the set of routes following such up*-down* paths are deadlock free. The network graph may have cycles, but the channel dependence graph under up*-down* routing does not. The up channels form a DAG and the down channels form a DAG. The up channels depend only on lower numbered up channels and the down channels depend only on up channels and higher numbered down channels. This style of routing was developed for Autonet[And*92], which was intended to be self configuring. Each of the switches contained a processor, which could run a distributed algorithm to determine the topology of the network and find a unique spanning tree. Each of the hosts would compute up*-down* routes as a restricted shortest-paths problem. Then routing tables in the switches could be established. The breadth-first search version is used by Atomic[Fel94] and Myrinet[Bod*95], where the switches are passive and the hosts determine the topology by probing the network. Each host runs the BFS and shortest paths algorithms to determine its set of source-based routes to the other nodes. A key challenge in automatic mapping of networks, especially with simple switches that only move messages through without any special processing is determining when two distinct routes through the network lead to the same switch[Mai*97]. One solution is to try returning to the source using routes from previously known switches, another is
718
DRAFT: Parallel Computer Architecture
9/4/97
Routing
detecting when there exists identical paths from two, supposedly distinct switches to the same host. 10.5.6 Turn-model Routing We have seen that a deadlock free routing algorithm can be constructed by restricting the set of routes within a network or by providing buffering with each channel that is used in a structured fashion. How much do we need to restrict the routes? Is there a minimal set of restrictions or a minimal combination of routing restrictions and buffers? An important development in this direction, which has not yet appeared in commercial parallel machines, is turn-model routing. Consider, for example, a 2D array. There are eight possible turns, which form two simple cycles, as shown in Figure 10-23. (The figure is illustrating cycles appearing in the network involving multiple messages. The reader should see that there is a corresponding cycle in the channel dependency graph.) Dimension order routing prevents the use of four of the eight turns – when traveling in +− x it is legal to turn in +− y , but once a packet is traveling in +− y it can make no further turns. The illegal turns are indicated by grey lines in the figure. Intuitively, it should be possible to prevent cycles by eliminating only one turn in each cycle. +Y
-X
+X
-Y Figure 10-23 Turn Restrictions of ∆x, ∆y routing .Dimension order routing on a 2D array prohibits the use of four of the eight possible turns, thereby breaking both of the simple dependence cycles. A deadlock free routing algorithm can be obtained by prohibiting only two turns.
Of the 16 different ways to prohibit two turns in a 2D array, 12 prevent deadlock. These consist of the three unique shown in Figure 10-24 and rotations of these. The west-first algorithm is so named because there is no turn allowed into the – x direction; therefore, if a packet needs to travel in this direction it must do so before making any turns. Similarly, in north-last there is no way to turn out of the +y direction, so the route must make all its other adjustments before heading in this direction. Finally, negative-first prohibits turns from a positive direction into a negative direction, so the route must go as negative as its needs to before heading in either positive direction. Each of these turn-model algorithms allow very complex, even non-minimal routes. For example, Figure 10-23 shows some of the routes that might be taken under west-first routing. The elongated rectangles indicate blockages or broken links that might cause such a set of routes to be
9/4/97
DRAFT: Parallel Computer Architecture
719
Interconnection Network Design
West-first
North-last
Negative-first
Figure 10-24 Minimal Turn Model Routing in 2D Only two of the eight possible turns need be prohibited in order to obtain a deadlock free routing algorithm. Legal turns are shown for three such algorithms.
used. It should be clear that minimal turn models allow a great deal of flexibility in route selection. There are many legal paths between pairs of nodes.
Figure 10-25 Examples of legal west-first routes in a 8x8 array Substantial routing adaptation is obtained with turn model routing, thus providing the ability to route around faults in a deadlock free manner.
720
DRAFT: Parallel Computer Architecture
9/4/97
Routing
The turn-model approach can be combined with virtual channels and it can be applied in any topology. (In some networks, such as unidirectional d-cubes, virtual channels are still required.) The basic method is (1) partition the channels into sets according to the direction they route packets (excluding wraparound edges), (2) identify the potential cycles formed by “turns” between directions, and (3) prohibit one turn in each abstract cycle, being careful to break all the complex cycles as well. Finally, wraparound edges can be incorporated, as long as they do not introduce cycles. If virtual channels are present, treat each set of channels as a distinct virtual direction. Up*-down* is essentially a turn-model algorithm, with the assumption of bidirectional channels, using only two directions. Indeed, reviewing the up*-down* algorithm, there may be many shortest paths that conform to the up*-down* restriction, and there are certainly many non-minimal up*-down* routes in most networks. Virtual channels allow the routing restrictions to be loosened even further. 10.5.7 Adaptive Routing The fundamental advantage of loosening the routing restrictions is that it allows there to be multiple legal paths between pairs of nodes. This is essential for fault tolerance. If the routing algorithm allows only one path, failure of a single link will effectively leave the network disconnected. With multipath routing, it may be possible to steer around the fault. In addition, it allows traffic to be spread more broadly over available channels and thereby improves the utilization of the network. When a vehicle is parked in the middle of the street, it is often nice to have the option of driving around the block. Simple deterministic routing algorithms can introduce tremendous contention within the network, even though communication load is spread evenly over independent destinations. For example, Figure 10-26 shows a simple case where four packets are traveling to distinct destinations in a 2D mesh, but under dimension order routing they are all forced to travel through the same link. The communication is completely serialized through the bottleneck, while links for other shortest paths are unused. A multipath routing algorithm could use alternative channels, as indicated in the right hand portion of the figure. For any network topology there exist bad permutations[GoKr84], but simple deterministic routing makes these bad permutations much easier to run across. The particular example in Figure 10-26 has important practical significance. A common global communication pattern is a transpose. On a 2D mesh with dimension order routing, all the packets in a row must go through a single switch before filling in the column. Multipath routing can be incorporated as an extension of any of the basic switch mechanisms. With source based routing, the source simply chooses among the legal routes and builds the header accordingly. No change to the switch is required. With table driven routing, this can be accomplished by setting up table entries for multiple paths. For arithmetic routing additional control information would need to be incorporated in the header and interpreted by the switch. Adaptive routing is a form of multipath routing where the choice of routes is made dynamically by the switch in response to traffic encountered en route. Formally, an adaptive routing function is a mapping of the form: R A : C × N × Σ → C , where Σ represents the switch state. In particular, if one of the desired outputs is blocked or failed, the switch may choose to send the packet on an alternative channel. Minimal adaptive routing will only route packets along shortest paths to their destination, i.e., every hop must reduce the distance to the destination. An adaptive algorithm that allows all shortest paths to be used is fully adaptive, otherwise it is partially adaptive. An inter-
9/4/97
DRAFT: Parallel Computer Architecture
721
Interconnection Network Design
Figure 10-26 Routing path conflicts under deterministic dimension-order routing Several message from distinct source to distinct destination contend for resources under dimension order routing, whereas an adaptive routing scheme may be able to use disjoint paths.
esting extreme case of non-minimal adaptive routing is what is called “hot potato” routing. In this scheme the switch never buffers packets. If more than one packet is destined for the same output channel, the switch sends one towards its destination and misroutes the rest onto other channels. Adaptive routing is not widely used in current parallel machines, although it has been studied extensively in the literature[NgSe89,LiHa91], especially through the Chaos router[KoSn91]. The nCUBE/3 is to provide minimal adaptive routing in a hypercube. The network proposed for the Tera machine is to use hot potato routing, with 128 bit packets delivered in one 3ns cycle. Although adaptive routing has clear advantages, it is not without its disadvantages. Clearly it adds to the complexity of the switch, which can only make the switch slower. The reduction in bandwidth can outweigh the gains of the more sophisticated routing – a simple deterministic network in its operating regime is likely to outperform a clever adaptive network in saturation. In non-uniform networks, such as a d -dimensional array, adaptivity hurts performance on uniform random traffic. Stochastic variations in the load introduce temporary blocking in any network. For switches at the boundary of the array, this will tend to propel packets toward the center. As a result, contention forms in the middle of the array that is not present under deterministic routing. Adaptive routing can cause problems with certain kinds of non-uniform traffic as well. With adaptive routing, packets destined for a hot spot will bounce off the saturated path and clog up the rest of the network, not just the tree to the destination, until all traffic is effected. In addition, when the hot spot clears, it will take even longer for the network saturation to drain because more packets destined for the problem node are in the network. Non-minimal adaptive routing performs poorly as the network reaches saturation, because packets traverse extra links and hence consume more bandwidth. The throughput of the network tends to drop off as load is increased, rather than flattening at the saturation point, as suggested by Figure 10-17. Recently there have been a number of proposals for low-cost partially and fully adaptive that use a combination of a limited number of virtual channels and restrictions on the set of turns[ChKi92, ScJa95]. It appears that most of the advantages of adaptive routing, including fault tolerance and channel utilization, can be obtained with a very limited degree of adaptability.
722
DRAFT: Parallel Computer Architecture
9/4/97
Switch Design
10.6 Switch Design Ultimately, the design of a network boils down to the design of the switch and how the switches are wired together. The degree of the switch, its internal routing mechanisms, and its internal buffering determine what topologies can be supported and what routing algorithms can be implemented. Now that we understand these higher level network design issues, let us return to the switch design in more detail. Like any other hardware component of a computer system, a network switch comprises: datapath, control, and storage. This basic structure was illustrated at the beginning of the chapter in Figure 10-5. Throughout most of the history of parallel computing, switches were built from a large number of low integration components, occupying a board or a rack. Since the mid 1980s, most parallel computer networks are built around single chip VLSI switches – exactly the same technology as the microprocessor. (This transition is taking place in LANs a decade later.) Thus, switch design is tied to the same technological trends discussed in Chapter 1: decreasing feature size, increasing area, and increasing pin count. We should view modern chip design from a VLSI perspective. 10.6.1 Ports The total number of pins is essentially the total number of input and output ports times the channel width. Since the perimeter of the chip grows slowly compared to area, switches tend to be pin limited. The pushes designers toward narrow high frequency channels. Very high speed serial links are especially attractive, because it uses the least pins and eliminates problems with skew across the bit lines in a channel. However, with serial links the clock and all control signals must be encoded within the framing of the serial bit stream. With parallel links, one of the wires is essentially a clock for the data on the others. Flow control is realized using an additional wire, providing a ready, acknowledge handshake. 10.6.2 Internal Datapath The datapath is the connectivity between each of a set of input ports, i.e., input latches, buffers or FIFO, and every output port. This is generally referred to as the internal cross-bar, although it can be realized in many different ways. A non-blocking cross-bar is one in which each input port can be connected to a distinct output in any permutation simultaneously. Logically, for an n × n switch the non-blocking cross-bar is nothing more than an n -way multiplexor associated with each destination, as shown in Figure 10-27. The multiplexor may be implemented in a variety of different ways, depending on the underlying technology. For example, in VLSI it is typically realized as a single bus with n tristate drivers, also shown in Figure 10-27. In this case, the control path provides n selects per output. It is clear that the hardware complexity of the cross bar is determined by the wires. There are nw data wires in each direction, requiring ( nw ) 2 area. There are also n 2 control wires, which add to this significantly. How do we expect switches to track improvements in VLSI technology? Assume that the area of the cross-bar stays constant, but the feature size decreases. The ideal VLSI scaling law says that if the feature size is reduced by a factor of s (including the gate thickness) and reduce the voltage level by the same factor, then the speed of the transistors improves by a factor of 1 ⁄ s , the propagation delay of wires connecting neighboring transistors improves by a factor of 1 ⁄ s , and the total number of transistors per unit area increases by 1 ⁄ s 2 with the same power density. For switches, this means that the wires get thinner and closer together, so the
9/4/97
DRAFT: Parallel Computer Architecture
723
Interconnection Network Design
Io
Io
I1
I2
I3
O0 I1
Oi O2 O3
I2
phase
RAM addr Din Dout
I3 Io I1 I2 I3
O0 Oi O2 O3
Figure 10-27 Cross-bar Implementations The cross bar internal to a switch can be implemented as a collection of multiplexors, a grid of tristate drivers, or via a contentional static RAM that time multiplexes across the ports.
degree of the switch can increase by a factor of 1 ⁄ s . Notice, the switch degree improves as only the square root of the improvement in logic density. The bad news is that these wires run the entire length of the cross bar, hence the length of the wires stays constant. The wires get thinner, so they have more resistance. The capacitance is reduced and the net effect is that the propagation delay is unchanged[Bak90]. In other words, ideal scaling gives us an improvement is switch degree for the same area, but no improvement is speed. The speed does improve (by a factor of 1 ⁄ s if the voltage level is held constant, but then the power increases with 1 ⁄ s 2 . Increases in chip area will allow larger degree, but the wire lengths increase and the propagation delays will increase. There is some degree of confusion over the term cross bar. In the traditional switching literature multistage interconnection network that have a single controller are sometimes called cross bars, eventhough the data path through the switch is organized as an interconnection of small cross bars. In many cases these are connected in a butterfly topology, called a Banyan network in the switching literature. A banyan networks is non-blocking if its inputs are sorted, so some nonblocking cross-bars are built as batcher sorting network in front of a banyan network. An approach that has many aspects in common with Benes networks is to employ a variant of the butterfly, called a delta network, and use two of these networks in series. The first serves to randomize packets position relative to the input ports and the second routes to the output port. This is used, for example, in some commercial ATM switches[Tu888]. In VLSI switches, it is usually
724
DRAFT: Parallel Computer Architecture
9/4/97
Switch Design
more effective to actually build a non-blocking cross-bar, since its is simple, fast and regular. The key limit is pins anyways. It is clear that VLSI switches will continue to advance with the underlying technology, although the growth rate is likely to be slower than the rate of improvement in storage and logic. The hardware complexity of the cross-bar can be reduced if we give up the non-blocking property and limit the number of inputs that can be connected to outputs at once. In the extreme, this reduces to a bus with n drivers and n output selects. However, the most serious issue in practice turns out to be the length of the wires and the number of pins, so reducing the internal bandwidth of the individual switches provides little savings and a significant loss in network performance. 10.6.3 Channel Buffers The organization of the buffer storage within the switch has a significant impact on the switch performance. Traditional routers and switches tend to have large SRAM or DRAM buffers external to the switch fabric, whereas in VLSI switches the buffering is internal to the switch and comes out of the same silicon budget as the datapath and the control section. There are four basic options: no buffering (just input and output latches), buffers on the inputs, buffers on the outputs, or a centralized shared buffer pool. A few flits of buffering on input and output channels decouples the switches on either end of the link and tends to provide a significant improvement in performance. As chip size and density increase, more buffering is available and the network designer has more options, but still the buffer real-estate comes at a premium and its organization is important. Like so many other aspects in network design, the issue is not just how effectively the buffer resources are utilized, but how the buffering effects the utilization of other components of the network. Intuitively, one might expect sharing of the switch storage resources to be harder to implement, but allow better utilization of these resources than partitioning the storage among ports. They are indeed harder to implement, because all of the communication ports need to access the pool simultaneously, requiring a very high bandwidth memory. More surprisingly, sharing the buffer pool on demand can hurt the network utilization, because a single congested output port can “hog” most of the buffer pool and thereby prevent other traffic from moving through the switch. Given the trends in VLSI, it makes sense to keep the design simple and throw a little more independent resources at the problem. Input buffering An attractive approach is to provide independent FIFO buffers with each input port, as illustrated in Figure 10-28. Each buffer needs to be able to accept a flit every cycle and deliver one flit to an output, so the internal bandwidth of the switch is easily matched to the flow of data coming in. The operation of the switch is relatively simple; it monitors the head of each input fifo, computes the desired output port of each, and schedules packets to move through the cross-bar accordingly. Typically, there is routing logic associated with each input port to determine the desired output. This is trivial for source-based routing; it requires an arithmetic unit per input for algorithmic routing, and, typically a routing table per input with table-driven routing. With cut-through routing, the decision logic does not make an independent choice every cycle, but only every packet. Thus, the routing logic is essentially a finite state machine, which spools all the flits of a packet to the same output channel before making a new routing decision at the packet boundary[SeSu93].
9/4/97
DRAFT: Parallel Computer Architecture
725
Interconnection Network Design
Input Ports
Output Ports
R0 R1 Cross-bar
R2 R3
Scheduling
Figure 10-28 Input Buffered Switch A FIFO is provided at each of the input ports, but the controller can only inspect and service the packets at the heads of the input FIFOs.
One problem with the simple input buffered approach is the occurrence of “head of line” (HOL) blocking. Suppose that two ports have packets destined for the same output port. One of them will be scheduled onto the output and the other will be blocked. The packet just behind the blocked packet may be destined for one of the unused outputs (there is guaranteed to be unused outputs), but it will not be able to move forward. This head-of-line blocking problem is familiar in our vehicular traffic analogy, it corresponds to having only one lane approaching an intersection. If the car ahead is blocked attempting to turn there is no way to proceed down the empty street ahead. We can easily estimate the effect of head-of-line blocking on channel utilization. If we have two input ports and randomly pick an output for each, the first succeeds and the second has a 50-50 chance of picking the unused output. Thus, the expected number of packets per cycle moving through the switch is 1.5 and, hence, the expected utilization of each output is 75%. Generalizing this, if E ( n, k ) is the expect number of output ports covered by k random inputs to an n -port – E ( n, k ) switch then, E ( n, k + 1 ) = E ( n, k ) + n------------------------- . Computing this recurrence up to k = n for varin
ous switch sizes reveals that the expected output channel utilization for a single cycle of a fully loaded switch quickly drops to about 65%. Queueing theory analysis shows that the expected utilization in steady state with input queueing is 59%[Kar*87]. The impact of HOL blocking can be more significant than this simple probabilistic analysis indicates. Within a switch, there may be bursts of traffic for one output followed by bursts for another, and so on. Even though the traffic is evenly distributed, given a large enough window, each burst results in blocking on all inputs.[Li88] Even if there is no contention for an output within a switch, the packet at the head of an input buffer may be destined for an output that is blocked due to congestion elsewhere in the network. Still the packets behind it cannot move forward. In a wormhole routed network, the entire worm will be stuck in place, effectively consum-
726
DRAFT: Parallel Computer Architecture
9/4/97
Switch Design
ing link bandwidth without going anywhere. A more flexible organization of buffer resources might allow packets to slide ahead of packets that are blocked. Output buffering The basic enhancement we need to make to the switch is to provide a way for it to consider multiple packets at each input as candidates for advancement to the output port. A natural option is to expand the input FIFOs to provide an independent buffer for each output port, so packets sort themselves by destination upon arrival, as indicated by Figure 10-29. (This is the kind of switch assumed by the conventional delay analysis in Section 10.4; the analysis is simplified because the switch does not introduce additional contention effects internally.) With a steady stream of traffic on the inputs, the outputs can be driven at essentially 100%. However, the advantages of such a design are not without a cost; additional buffer storage and internal interconnect is required.1 Along with the sorting stage and the wider multiplexors, this may increase the switch cycle time or increase its routing delay. It is a matter of perspective whether the buffers in Figure 10-29 are associated with the input or the output ports. If viewed as output port buffers, the key property is that each output port has enough internal bandwidth to receive a packet from every input port in one cycle. This could be obtained with a single output FIFO, but it would have to run at an internal clock rate of n times that of the input ports. Virtual Channels buffering Virtual channels suggest an alternative way of organizing the internal buffers of the switch. Recall, a set of virtual channels provide transmission of multiple independent packets across a single physical link. As illustrated in Figure 10-29, to support virtual channels the flow across the link is split upon arrival at an input port into distinct channel buffers. (These are multiplexed together again, either before or after the crossbar, onto the output ports. If one of the virtual channel buffers is blocked, it is natural to consider advancing the other virtual channels toward the outputs. Again, the switch has the opportunity to select among multiple packets at each input for advance to the output ports, however, in this case the choice is among different virtual channels rather than different output ports. It is possible that all the virtual channels will need to route to the same output port, but expected coverage of the outputs is much better. In a probabilistic analysis, we can ask what the expected number of distinct output ports are covered by choosing among vn requests for n ports, where v is the number of virtual channels.. Simulation studies show that large (256 to 1024 node) 2-ary butterflies using wormhole routing with moderate buffering (16 flits per channel) saturate at a channel utilization of about 25% under random traffic. If the same sixteen flits of buffering per channel are distributed over a larger number of virtual channels, the saturation bandwidth increases substantially. It exceeds 40% with just two virtual channels (8-flit buffers) and is nearly 80% at sixteen channels with single-flit buffers[Dal90a]. While this study keeps the total buffering per channel fixed, it does not really
1. The MIT Parc switch[Joe94] provides the capability of the nxn buffered switch but avoid the storage and interconnect penalty. It is designed for fixed-size packets. The set of buffers at each input form a pool and each output has a list of pointers to packets destined for it. The timing constraint in this design is the ability to push n pointers per cycle into the output port buffer, rather than n packets per cycle.
9/4/97
DRAFT: Parallel Computer Architecture
727
Interconnection Network Design
Input Ports
Output Ports R0
Output Ports R1
Output Ports R2
Output Ports R3 Control
Figure 10-29 Switch Design to avoid HOL blocking Packets are sorted by output port at each input port so that the controller can a packet for an output port if any input has a packet destined for that output.
keep the cost constant. Notice that a routing decision needs to be computed for each virtual channel, rather than each physical channel, if packets from any of the channels are to be considered for advancement to output ports. By now the reader is probably jumping a step ahead to additional cost/performance trade-offs that might be considered. For example, if the cross-bar in Figure 10-27 has an input per virtual channel, then multiple packets can advance from a single input at once. This increases the probability of each packet advancing and hence the channel utilization. The cross-bar increases in size in only one dimension and the multiplexors are eliminated, since each output port is logically a vn -way mux. 10.6.4 Output Scheduling We have seen routing mechanisms that determine the desired output port for each input packet, datapaths that provide a connection from the input ports to the outputs, and buffering strategies that allow multiple packets per input port to be considered as candidates for advancement to the output port. A key missing component in the switch design is the scheduling algorithm which selects which of the candidate packets advance in each cycle. Given a selection, the remainder of the switch control asserts the control points in the cross-bar/mux’s and buffers/latches to effect
728
DRAFT: Parallel Computer Architecture
9/4/97
Switch Design
the register transfer from each selected inputs to the associated output. As with the other aspects of switch design, there is a spectrum of solutions varying from simple to complex. A simple approach is to view the scheduling problem as n independent arbitration problems, one for each output port. Each candidate input buffer has a request line to each output port and a grant line from each port, as indicated by Figure 10-30. (The figure shows four candidate input buffers driving three output ports to indicate that routing logic and arbitration input is on a per input buffer basis, rather than per input port.) The routing logic computes the desired output port and asserts the request line for the selected output port. The output port scheduling logic arbitrates among the requests, selects one and asserts the corresponding grant signal. Specifically, with the cross-bar using tri-state drivers in Figure 10-27b output port j enables input buffer i by asserting control enable e ij . The input buffer logic advances its FIFO as a result of one of the grant lines being asserted.
R0 O0 Input Buffers
R1 R2
Cross-bar
R3
Output Ports
O1
O2
Figure 10-30 Control Structure for Output Scheduling .Associated with each input buffer is routing logic to determine the output port, a request line and a grant line per output. Each output has selection logic to arbitrate among asserted request and assert one grant, causing a flit to advance from the input buffer to the output port.
An additional design question in the switch is the arbitration algorithm used for scheduling flits onto the output. Options include a static priority scheme, random, round-robin, and oldest-first (or deadline scheduling). Each of these have different performance characteristics and implementation complexity. Clearly static priority is the simplest; it is simply a priority encoder. However, in a large network it can cause indefinite postponement. In general, scheduling algorithms that provide fair service to the inputs perform better. Round-robin requires extra state to change the order of priority in each cycle. Oldest-first tends to have the same average latency as random assignment, but significantly reduces the variance in latencies[Dal90a]. One way to implement oldest-first scheduling, is to use a control FIFO of input port numbers at each output port. When an input buffer requests an output, the request is enqueued. The oldest request at the head of the FIFO is granted. It is useful to consider the implementation of various routing algorithms and topologies in terms of Figure 10-30. For example, in a direct d -cube, there are d + 1 inputs, let’s number them i o, …, i d , and d + 1 outputs, numbered o 1, …, o d + 1 , with the host connected to input i 0 and output o d + 1 . The straight path i j → o j corresponds to routing in the same dimension, other paths are a change in dimension. With dimension order routing, packets can only increase in dimension as they cross the switch. Thus, the full complement of request/grant logic is not required. Input j
9/4/97
DRAFT: Parallel Computer Architecture
729
Interconnection Network Design
need only request outputs j, …, d + 1 and output j need only grant inputs 0, …, j . Obvious static priority schemes assign priority in increasing or decreasing numerical order. If we further specialize the switch design to dimension order routing and use a stacked 2x2 cross-bar design (See exercise) then each input requests only two outputs and each output grants only two inputs. The scheduling issue amounts to preferring traffic continuing in a dimension vs. traffic turning onto the next higher dimension. What are implementation requirements of adaptive routing? First, the routing logic for an input must compute multiple candidate outputs according to the specific rules of the algorithm, e.g., turn restrictions or plane restrictions. For partially adaptive routing, this may be only a couple of candidates. Each output receives several request and can grant one. (Or if the output is blocked, it may not grant any.) The tricky point is that an input may be selected by more than one output. It will need to choose one, but what happens to the other output? Should it iterate on its arbitration and select another input or go idle? This problem can be formalized as one of on-line bipartite matching[Kar*90]. The requests define a bipartite graph with inputs on one side and outputs on the other. The grants (one per output) define a matching of input-output pairs within the request graph. The maximum matching would allow the largest number of inputs to advance in a cycle, which ought to give the highest channel utilization. Viewing the problem in this way, one can construct logic that approximates fast parallel matching algorithms[And*92]. The basic idea is to form a tentative matching using a simple greedy algorithm, such as random selection among requests at each output followed by selection of grants at each inputs. Then, for each unselected output, try to make an improvement to the tentative matching. In practice, the improvement diminishes after a couple of iterations. Clearly, this is another case of sophistication vs. speed. It the scheduling algorithm increases the switch cycle time or routing delay, it may be better to accept a little extra blocking and get the job done faster. This maximum matching problem applies to the case where multiple virtual channels are multiplexed through each input of the cross-bar, even with deterministic routing. (Indeed, the technique was proposed for the AN2 ATM switch to address the situation of scheduling cells from several “virtual circuits”.1 Each input port has multiple buffers that it can schedule onto its crossbar input, and these may be destined for different outputs. The selection of the outputs determine which virtual channel to advance. If the cross-bar is widened, rather than multiplexing the inputs, the matching problem vanishes and each output can make a simple independent arbitration. 10.6.5 Stacked Dimension Switches Many aspects of switch design are simplified if there are only two input and two outputs, including the control, arbitration, and data paths. Several designs, including the torus routing chip[SeSu93], J-machine, and Cray T3D, have used a simple 2x2 building block and stacked these to construct switches of higher dimension, as illustrated in Figure 10-31. If we have in mind a d-cube, traffic continuing in a given dimension passes straight through the switch in that dimension, whereas if it need to turn into another dimension it is routed vertically through the switch. Notice that this adds a hop for all but the lowest dimension. This same technique yields a topology called cube-connected cycles when applied to a hypercube.
1. Virtual circuits should not be confused with virtual channels. The former is a technique for associating routing a resources along an entire source to destination route. The latter is a strategy for structuring the buffering associated with each link.
730
DRAFT: Parallel Computer Architecture
9/4/97
Flow Control
Host In
Zin
2x2
Zout
Yin
2x2
Yout
Xin
2x2
Xout
Host Out Figure 10-31 Dimension stacked switch Traffic continuing in a dimension passes straight through one 2x2 switch, whereas when it turns to route in another dimension it is routed vertically up the stack
10.7 Flow Control In this section we consider in detail what happens when multiple flows of data in the network attempt to use the same shared network resources at the same time. Some action must be taken to control these flows. If no data is to be lost, some of the flows must be blocked while others proceed. The problem of flow control arises in all networks and at many levels, but it is qualitatively different in parallel computer networks than in local and wide area networks. In parallel computers network traffic needs to be delivered about as reliably as traffic across a bus and there are a very large number of concurrent flows on very small time scales. No other networking regime has such stringent demands. We will look briefly at some of these differences and then examine in detail how flow control is addressed at the link level and end-to-end in parallel machines. 10.7.1 Parallel Computer Networks vs. LANs and WANs To build intuition for the unique flow-control requirements of the parallel machine networks, let us take a little digression and examine the role of flow control in the networks we deal with every day for file transfer and the like. We will look at three examples: ethernet style collision based arbitration, FDDI style global arbitration, and unarbitrated wide-area networks. In an ethernet, the entire network is effectively a single shared wire, like a bus only longer. (The aggregate bandwidth is equal to the link bandwidth.) However, unlike a bus there is no explicit arbitration unit. A host attempts to send a packet by first checking that the network appears to be quiet and then (optimistically) driving its packet onto the wire. All nodes watch the wire, including the hosts attempting to send. If there is only one packet one the wire, every host will “see” it and the host specified as the destination will pick it up. If there is a collision, every host will
9/4/97
DRAFT: Parallel Computer Architecture
731
Interconnection Network Design
detect the garbled signal, including the multiple senders. The minimum that a host may drive a packet, i.e., the minimum channel time, is about 50 µs; this is to allow time for all hosts to detect collisions. The flow-control aspect is how the retry is handled. Each sender backs off for a random amount of time and then retries. With each repeated collision the retry interval from which the random delay is chosen is increased. The collision detection is performed within the network interface hardware, and the retry is handled by the lowest level of the ethernet driver. If there is no success after a large number of tries, the ethernet driver gives up and drops the packet. However, the message operation originated from some higher-level communication software layer which has its own delivery contract. For example, the TCP/IP layer will detect the delivery failure via a timeout and engage its own adaptive retry mechanism, just as it does for wide-area connection, which we discuss below. The UDP layer will ignore the delivery failure, leaving it to the user application to detect the event and retry. The basic concept of the ethernet rests on the assumption that the wire is very fast compared to the communication capability of the hosts. (This assumption was reasonable in the mid 70s when ethernet was developed.) A great deal of study has been given to the properties of collision-based media access control, but basically as the network reaches saturation, the delivered bandwidth drops precipitously. Ring-based LANS, such as token-ring and FDDI, use a distributed form of global arbitration to control access to the shared medium. A special arbitration token circulates the ring when there is an empty slot. A host that desires to send a packet waits until the token comes around, grabs it, and drives the packet onto the ring. After the packet is sent, the arbitration token is returned to the ring. In effect, flow-control is performed at the hosts on every packet as part of gaining access to the ring. Even if the ring is idle, a host must wait for the token to traverse half the ring, on average. (This is why the unloaded latency for FDDI is generally higher than that of Ethernet.) However, under high load the full link capacity can be used. Again, the basic assumption underlying this global arbitration scheme is that the network operates on a much small timescale than the communication operations of the hosts. In the wide-area case, each TCP connection (and each UDP packet) follows a path through a series of switches, bridges and routers across media of varying speeds between source and destination. Since the internet is a graph, rather than a simple linear structure, at any point there may be a set of incoming flows with a collective bandwidth greater than the outgoing link. Traffic will back up as a result. Wide-area routers provide a substantial amount of buffering to absorb stochastic variations in the flows, but if the contention persists these buffers will eventually fill up. In this case, most routers will just drop the packets. Wide-area links may be stretches of fibers many miles long, so when the packet is driven onto a link it is hard to know if it will find buffer space available when it arrives. Furthermore, the flows of data through a switch are usually unrelated transfers. They are not, in general, a collection of flows from within a single parallel program, which imposes a degree of inherent end-to-end flow control by waiting for data it needs before continuing. The TCP layer provides end-to-end flow control and adapts dynamically to the perceived characteristics of the route occupied by a connection. It assumes that packet loss (detected via timeout) is a result of contention at some intermediate point, so when it experiences a loss it sharply decreases the send rate (by reducing the size of its burst window). It slowly increases this rate, i.e., the window size, as data is transferred successfully (detected via acknowledgments from destination to source), until it once again experiences a loss. Thus, each flow is controlled at the source governed by the occurrence of time-outs and acknowledgments.
732
DRAFT: Parallel Computer Architecture
9/4/97
Flow Control
Of course, the wide-area case operates on timescale of fractions of a second, whereas parallel machine networks operate on a scale nanoseconds. Thus, one should not expect the techniques to carry over directly. Interestingly, the TCP flow-control mechanisms do work well in the context of collision-based arbitration, such as ethernet. (Partly the software overhead time tends to give a chance for the network to clear.) However, with the emergence of high-speed, switched local and wide area networks, especially ATM, the flow-control problem has taken on much more of the characteristics of the parallel machine case. At the time of writing, most commercial ATM switches provide a sizable amount of buffering per link (typically 64 to 128 cells per link) but drop cells when this is exceeded. Each cell is 53 bytes, so at the 155 Mb/s OC-3 rate a cell transfer time is 2.7 µs. Buffers fill very rapidly compared to typical LAN/WAN end-to-end times. (The TCP mechanisms prove to be quite ineffective in the ATM settings when contending with more aggressive protocols, such as UDP.) Currently, the ATM standardization efforts are hotly debating link-level flow and rate control measures. This development can be expected to draw heavily on the experiences from parallel machines. 10.7.2 Link-level flow control Essentially all parallel machine interconnection networks provide link-level flow control. The basic problem is illustrated in Figure 10-32. Data is to be transferred from an output port of one node across a link to an input port of a node operating autonomously. The storage may be a simple latch, a FIFO, or buffer memory. The link may be short or long, wide or thin, synchronous or asynchronous. The key point is that as a result of circumstances at the destination node, storage at the input port may not be available to accept the transfer, so the data must be retained at the source until the destination is ready. This may cause the buffers at the source to fill, and it in turn may exert “back pressure” on its sources.
Ready Data
Figure 10-32 Link level flow-control As a result of circumstances at the destination node, the storage at the input port may not be available to accept the data transfer, so the data must be retained at the source until the destination is ready
The implementation of link-level flow control differs depending on the design of the link, but the main idea is the same. The destination node provides feedback to the source indicating whether it is able to receive additional data on the link. The source holds onto the data until the destination indicates that it is able. Before we examine how this feedback is incorporated into the switch operation, we look at how the flow control is implemented on different kinds of links.
9/4/97
DRAFT: Parallel Computer Architecture
733
Interconnection Network Design
With short-wide links, the transfer across the link is essentially like a register transfer within a machine, extended with a couple of control signals. One may view the source and destination registers as being extended with a full/empty bit, as illustrated in Figure 10-33. If the source is full and the destination is empty, the transfer occurs, the destination become full, and the source becomes empty (unless it is refilled from its source). With synchronous operation (e.g., in the Cray T3D, IBM SP-2, TMC CM-5 and MIT J-machine) the flow control determines whether a transfer occurs for the clock cycle. It is easy to see how this is realized with edge-triggered or multi-phase level-sensitive designs. If the switches operate asynchronously, the behavior is much like a register transfer in a self-timed design. The source asserts the request signal when it is full and ready to transfer, the destination uses this signal to accept the value (when the input port is available) and asserts ack when it has accepted the data. With short-thin links the behavior is similar, excepts that series of phits is transferred for each req/ack handshake. . F/E
Ready/Ack
F/E
Source
Destination
Req
Data
Figure 10-33 Simple Link level handshake The source asserts its request when it has a flit to transmit, the destination acknowledges the receipt of a flit when it is ready to accept the next one. Until that time, the source repeatedly transmits the flit.
The req/ack handshake can be viewed as the transfer of a single token or credit between the source and destination. When the destination frees the input buffer, it passes the token to the source (i.e., increases its credit). The source uses this credit when it sends the next flit and must wait until its account is refilled. For long links, this credit scheme is expanded so that the entire pipeline associated with the link propagation delay can be filled. Suppose that the link is sufficiently long that several flits are in transit simultaneously. As indicated by Figure 10-34, it will also take several cycles for acks to propagate in the reverse direction, so a number of acks (credits) may be in transit as well. The obvious credit-based flow control is for the source to keep account of the available slots in the destination input buffer. The counter is initialized to the buffer size. It is decemented when a flit is sent and the output is blocked if the counter reaches zero. When the destination removes a flit from the input buffer, it returns a credit to the source, which increments the counter. The input buffer will never overflow; there is always room to drain the link into the buffer. This approach is most attractive with wide links, which have dedicated control lines for the reverse ack. For thin link, which multiplex ack symbols onto the opposite going channel, the ack per flit can be reduced by transferring bigger chunks of credit. However, there is still the problem that the approach is not very robust to loss of credit tokens. Ideally, when the flows are moving forward smoothly there is no need for flow-control. The flowcontrol mechanism should be a governor which gently nudges the input rate to match the output rate. The propagation delays on the links give the system momentum. An alternative approach to link level flow control is to view the destination input buffer as a staging tank with a low water mark and a high water mark. When the fill level drops below the low mark, a GO symbol is sent 734
DRAFT: Parallel Computer Architecture
9/4/97
Flow Control
Figure 10-34 Transient Flits and Acks with long links. With longer links the flow control scheme needs to allow more slack in order to keep the link full. Several flits can be driven onto the wire before waiting for acks to come back.
to the source, and when it goes over the high mark a STOP symbol is generated. There needs to be enough room below the low mark to withstand a full round-trip delay (the GO propagating to the source, being processed, and the first of a stream of flits propagating to the destination). Also, there must be enough headroom above the high mark to absorb the round-trip worth of flits that may be in-flight. A nice property of this approach is that redundant GO symbols may be sent anywhere below the high mark and STOP symbols may be sent anywhere above the low mark with no harmful effect. The fraction of the link bandwidth used by flow control symbols can be reduced by increasing the amount of storage between the low and high marks in the input buffer. This approach is used for example in Cal Tech routers[SeSu93] and the Myrinet commercial follow-on[Bod*95]. Incoming Phits Flow-control Symbols
Full
Stop
High Mark
Go
Low Mark Empty Outgoing Phits
Figure 10-35 Slack Buffer Operation When the fill level drops below the low mark, a GO symbol is sent to the source, and when it goes over the high mark a STOP symbol is generated.
It is worth noting that the link-level flow-control is used on host-switch links, as well as switchswitch links. In fact, it is generally carried over to the processor-NI interface as well. However, the techniques may vary for these different kinds of interfaces. For example, in the Intel Paragon the (175 MB/s) network links are all short with very small flit buffers. However, the NIs have
9/4/97
DRAFT: Parallel Computer Architecture
735
Interconnection Network Design
large input and output FIFOs. The communication assist includes a pair for DMA engines that can burst at 300 MB/s between memory and the network FIFOs. It is essential that the output (input) buffer not hit full (empty) in the middle of a burst, because holding the bus transaction would hurt performance and potential lead to deadlock. Thus, the burst is matched to the size of the middle region of the buffer and the high/low marks to the control turn-around between the NI and the DMA controller. 10.7.3 End-to-end flow control Link-level flow-control exerts a certain amount of end-to-end control, because if congestion persists buffer will fill up and the flow will be controlled all the way back to the source host nodes. This is generally called back-pressure. For example, if k nodes are sending data to a single destination, they must eventually all slow to an average bandwidth of 1 ⁄ k -th the output bandwidth. If the switch scheduling is fair and all the routes through the network are symmetric, back-pressure will be enough to effect this. The problem is that by the time the sources feel the back-pressure and regulate their output flow, all the buffers in the tree from the hot spot to the sources are full. This will cause all traffic that crosses this tree to be slowed as well. Hot-spots This problem received quite a bit of attention as the technology reached a point where machines of several hundred to a thousand processors became reasonable.[PfNo85] If a thousand processors deliver on average any more than 0.1% of their traffic to any one destination, that destination becomes saturated. If this situation persists, as it might if frequent accesses are made to an important shared variable, a saturation tree forms in front of this destination that will eventually reach all the way to the sources. At this point, all the remaining traffic in the system is severely impeded. The problem is particularly pernicious in Butterflies, since there is only one route to each destination and there is a great deal of sharing among routes from one destination. Adaptive routing proves to make the hot spot problem even worse because traffic destined for the hot spot is sure to run into contention and be directed to alternate routes. Eventually the entire network clogs up. Large network buffers do not solve the problem, they just delay the onset. The time for the hot spot to clear is proportional to the total amount of hot spot traffic buffered in the network, so adaptivity and large buffers increase the time required for the hot spot to clear after the load is removed. Various mechanisms have been developed to mitigate the causes of hot spots, such as all nodes incrementing a shared variable to perform a reduction or synchronization event. We examine these below, however, they only address situations where the problematic traffic is logically related. The more fundamental problem is that link-level flow control is like stopping on the freeway. Once the traffic jam forms, you are stuck. The better solution is not to get on the freeway at such times. With fewer packets inside the network, the normal mechanisms can do a better job of getting traffic to its destinations. Global communication operations Problems with simple back-pressure have been observed with completely balanced communication patterns, such as each node sending k packets to every other node. This occurs in many situations, including transposing a global matrix, converting between blocked and cyclic layouts, or the decimation step of an FFT [BrKu94,Dus*96]. Even if the topology is robust enough to avoid 736
DRAFT: Parallel Computer Architecture
9/4/97
Case Studies
serious internal bottlenecks on these operations (this is true of a fat-tree, but not a low-degree dimension-order mesh[Lei92]) a temporary backlog can have a cascading effect. When one destination falls behind in receiving packets from the network, a backlog begins to form. If priority is given to draining the network, this node will fall behind in sending to the other nodes. They in turn will send more than they receive and the backlog will tend to grow. Simple end-to-end protocols in the global communication routines have been shown to mitigate this problem in practice. For example, a node may wait after sending a certain amount of data until it has also received this amount, or it may wait for chunks of its data to be acknowledged. These precautions keep the processors more closely in step and introduce small gaps in the traffic flow which decouples the processor-to-processor interaction through the network. (Interestingly, this technique is employed with the metering lights on the Oakland Bay bridge. Periodically, a wavefront of cars is injected into the bridge, separated by small gaps. This reduces the stochastic temporary blocking and avoids cascading blockage.) Admission control With shallow, cut-through networks the latency is low below saturation. Indeed in most modern parallel machine networks a single small message, or perhaps a few messages, will occupy an entire path from source to destination. If the remote network interface is not ready to accept the message, it is better to keep it within the source NI, rather than blocking traffic within the network. One approach to achieving this is to perform NI-to-NI credit-based flow-control. One study examining such techniques for a range of networks [CaGo95] indicates that allowing a single outstanding message per pair of NIs gives good throughput and maintains low latency.
10.8 Case Studies Networks are a fascinating area of study from a practical design and engineering viewpoint because they do one simple operation – move information from one place to another – and yet there is a huge space of design alternatives. Although the most apparent characteristics of a network are its topology, link bandwidth, switching strategy, and routing algorithm, however, several more characteristics must be specified to completely describe the design. These include the cycle time, link width, switch buffer capacity and allocation strategy, routing mechanism, switch output selection algorithm, and the flow-control mechanism. Each component of the design can be understood and optimized in isolation, but they all interact in interesting ways to determine the global network performance and any particular traffic pattern, in the context of node architecture and the dependencies embedded in the programs running on the platform. In this section, we summarize a set of concrete network design points in important commercial and research parallel architectures. Using the framework established above in this chapter, we systematically describe the key design parameters. 10.8.1 Cray T3D Network The Cray T3D network consists of a 3-dimensional bidirectional torus of up to 1024 switch nodes, each connected directly to a pair of processors1, with a data rate of 300 MB/s per channel. Each node is on a single board, along with two processors and memory. There are up to 128
9/4/97
DRAFT: Parallel Computer Architecture
737
Interconnection Network Design
nodes (256 processors) per cabinet; larger configurations are constructed by cabling together cabinets. Dimension-order, cut-through packet switched routing is used. The design of the network is very strongly influenced by the higher level design the rest of the system. Logically independent request and response networks are supported, with two virtual channels each to avoid deadlock, over a single set of physical links. Packets are variable length in multiples of 16 bits, as illustrated by Figure 10-36, and network transactions include various reads and writes, plus the ability to start a remote block-transfer engine (BLT), which is basically a DMA device. The first phit always contains the route, followed by the destination address, and the packet type or command code. The remainder of the payload depends on the packet type, consisting of relative addresses and data. All of the packet is parity protected, except the routing phit. If the route is corrupted it will be misrouted and the error will be detected at the destination because the destination address will not match the value in the packet.1 Read Req - no cache - cache - prefetch - fetch&inc
Read Resp
Route Tag Dest PE Command
Route Tag Dest PE Command Word 0
Addr 0 Addr 1 Src PE
Read Resp Write Req - cached - Proc - BLT 1 - fetch&inc Route Tag Dest PE Command Word 0 Word 1 Word 2 Word 3
Packet Type req/resp coomand 3
1
Route Tag Dest PE Command Addr 0 Addr 1 Src PE Word 0
Write Req - proc 4 - BLT 4
Route Tag Dest PE Command Addr 0 Addr 1 Src PE Word 0 Word 1 Word 2 Word 3
Write Resp
Route Tag Dest PE Command
BLT Read Req
Route Tag Dest PE Command Addr 0 Addr 1 Src PE Addr 0 Addr 1
8
Figure 10-36 Cray T3D Packet Formats All packets consist of a series of 16 bit phits, with the first three being the route and tag, the destination PE number, and the command. Parsing and processing of the rest of the packet is determined by the tag and command.
T3D links are short, wide, and synchronous. Each unidirectional link is 16 data and 8 control bits wide, operating under a single 150 MHz clock. Flits and phits are 16 bits. Two of the control bits identify the phit type (00 No info, 01 Routing tag, 10 packet, 11 last). The routing tag phit and the last-packet phit provide packet framing. Two additional control bits identify the virtual channel
1. Special I/O Gateway nodes attached to the boundary of the cube include two processors each attached to a 2-dimensional switch node connected into the X and Z dimensions. We will leave that aspect of the design and return to it in the I/O chapter. 1. This approach to error detection reveals a subtle aspect of networks. If the route is corrupted, there is a very small probability that it will take a wrong turn at just the wrong time and collide with another packet in a manner that causes deadlock within the network, in which case the end node error detection will not be engaged.
738
DRAFT: Parallel Computer Architecture
9/4/97
Case Studies
(req-hi, req-lo, resp-hi, resp-lo). The remaining four control lines are acknowledgments in the opposite direction, one per virtual channel. Thus, flow control per virtual channel and phits can be interleaved between virtual channels on a cycle by cycle basis. The switch is constructed as three independent dimension routers, in six 10k gate arrays. There is modest amount of buffering in each switch, (8 16-bit parcels for each of four virtual channels in each of three dimensions), so packets compress into the switches when blocked. There is enough buffering in a switch to store small packets. The input port determines the desired output port by a simple arithmetic operation. The routing distance is decremented and if the result is non-zero the packet continues in its current direction on the current dimension, otherwise it is routed into the next dimension. Each output port uses a rotating priority among input ports requesting that output. For each input port, there is a rotating priority for virtual channels requesting that output. The network interface contains eight packet buffers, two per virtual channel. The entire packet is buffered in the source NI before being transmitted into the network. It is buffered in the destination NI before being delivered to the processor or memory system. In addition to the main data communication network, separate tree-like networks for logical-AND (Barrier) and logical OR (Eureka) are provided. The presence of bidirectional links provides two possible options in each dimension. A table lookup is performed in the source network interface to select the (deterministic) route consisting of the direction and distance in each of the three dimensions. An individual program occupies a partition of the machine consisting of a logically contiguous sub-array of arbitrary shape (under operating system configuration). A shift and mask logic within the communication assist maps the partition-relative virtual node address into a machine-wide logical <X,Y,Z> coordinate address. The machine can be configured with spare nodes, which can be brought in to replace a failed node. The <X,Y,Z> is used as an index for the route look-up, so the NI routing tables provides the final level of translation, identifying the physical node by its < ± x, ∆x, ± y, ∆y, ± z, ∆z > route from the source. This routing lookup also identifies which of the four virtual channels is used. To avoid deadlock within either the request or response (virtual) network the high channel is used for packets that cross the dateline, and the low channel otherwise. 10.8.2 IBM SP-1, SP-2 Network The network used in the IBM SP-1 and SP-2[Abay94, Stu*94] parallel machines is in some ways more versatile than that in the Cray T3D, but of lower performance and without support for two phase, request-response operation. It is packet switched, with cut-through, source-based routing and no virtual channels. The switch has eight bidirectional 40 MB/s ports and can support a wide variety of topologies. However, in the SP machines a collection of switches are packaged on a board as a 4-ary 2-dimensional butterfly with sixteen internal connections to hosts in the same rack as the switch board and sixteen external connections to other racks, as illustrated in Figure 10-37. The rack-level topology varies from machine to machine, but is typically a variant of a butterfly. Figure 1-23 of Chapter 1 shows the large IBM SP-2 configuration at the Maui High Performance Computing Center. Individual cabinets have the first level of routing at the bottom connecting the 16 internal nodes to 16 external links. Additional cabinets provide connectivity between collections of cabinets. The wiring for these additional levels is located beneath the machine room floor. Since the physical toplogy is not fixed in hardware, the network interface inserts the route for each outgoing message via a table lookup.
9/4/97
DRAFT: Parallel Computer Architecture
739
Interconnection Network Design
16-node Rack Multi-rack Configuration Inter-Rack External Switch Ports E0E1E2E3 E15
Switch Board
P0P1P2P3
Intra-Rack Host Ports
P15
Figure 10-37 SP Switch Packaging The collection of switches within a rack provide bidirectional connection between 16 internal ports and 16 external ports.
Packets consist of a sequence of up to 255 bytes, the first byte is the packet length, followed by one or more routing bytes and then the data bytes. Each routing byte contains two 3-bit output specifiers, with an additional selector bit. The links are synchronous, wide, and long. A single 40 MHz clock is distributed to all the switches, with each link tuned so that its delay is an integral number of cycles. (Interboard signals are 100K ECL differential pairs. On-board clock trees are also 100K ECL.) The links consist of ten wires: eight data bits, a framing “tag” control bit, and a reverse flow-control bit. Thus, the phit is one byte. The tag bit identifies the length and routing phits. The flit is two bytes; two cycles are used to signal the availability of two bytes of storage in the receiver buffer. At any time, a stream of data/tag flits can be propagating down the link, while a stream of credit tokens propagate in the other direction. The switch provides 31 bytes of FIFO buffering on each input port, allowing links to be 16 phits long. In addition, there are 7 bytes of FIFO buffering on each output and a shared central queue holding 128 8-byte chunks. As illustrated in Figure 10-38, the switch contains both an unbuffered byte-serial cross-bar and a 128 x 64-bit dual port RAM as interconnect between input and output port. After two bytes of the packet have arrived in the input port, the input port control logic can request the desired output. If this output is available, the packet cuts through via the cross-bar to the output port, with a minimum routing delay of 5 cycles per switch. If the output port is not available, the packet fills into the input FIFO. If the output port remains blocked, the packet is spooled into the central queue in 8-byte “chunks.” Since the central queue accepts one 8-byte input and one 8-byte output per cycle, its bandwidth matches that of the eight byte serial input and output ports of the switch. Internally, the central queue is organized as eight FIFO linked lists, one per output port, using and auxiliary 128x7-bit RAM for the links. One 8-byte chunk reserved for each output port. Thus, the switch operates in byte serial mode when the load is low,
740
DRAFT: Parallel Computer Architecture
9/4/97
Case Studies
FIFO
8 CRC check
8
Route control
64
Input Port
Route control
8
Deserializer
CRC check
Out Arb
RAM 64x128
FIFO
8
64 In Arb
° ° ° Flow Control
Ouput Port
Central Queue
Serializer
Input Port
° ° 64 ° 8
8
8x8 Crossbar 8
8 XBar Arb
8
Flow Control
FIFO
° ° ° Ouput Port Serializer
Flow Control
Deserializer
but when contention forms it time-multiplexes 8-byte chunks through the central queue, with the inputs acting as a deserializer and the outputs as a serializer.
CRC Gen
Flow Control
FIFO 8 XBar Arb
CRC Gen
Figure 10-38 IBM SP (Vulcan) switch design The IBM SP switch uses a cross-bar for unbuffered packets passed directly from input port to output port, but all buffered packets are shunted into a common memory pool.
Each output port arbitrates among requests on an LRU basis, with chunks in the central queue having priority over bytes in input FIFOs. Output ports are served by the central queue in LRU order. The central queue gives priority to inputs which chunks destined for unblocked output ports. The SP network has three unusual aspects. First, since the operation is globally synchronous, instead of including CRC information in the envelop of each packet, time is divided into 64-cycle “frames.” The last two phits of each frame carry CRC. The input port checks the CRC and the output port generates it (after stripping of used routing phits). Secondly, the switch is a single chip and every switch chip is shadowed by an identical switch. The pins are bidirectional I/O pins, so one of the chips merely checks the operation of the other. (This will detect switch errors, but not link errors.) Finally, the switch supports a circuit switched “service mode” for various diagnostic purposes. The network is drained free of packets before changing modes. 10.8.3 Scalable Coherent Interconnect The Scalable Coherent Interface provides a well specified case study in high performance interconnects, because it emerged through a standards process, rather than as a proprietary design or academic proposal. It was a standard long time in coming, but has gained popularity as implementations have got underway. It has been adopted by several vendors, although in many cases only a portion of the specification is followed. Essentially the full SCI specification is used in the
9/4/97
DRAFT: Parallel Computer Architecture
741
Interconnection Network Design
interconnect of the HP/Convex Exemplar and in the Sequent NUMA-Q. The Cray SCX I/O network is based heavily on SCI. A key element of the SCI design is that it builds around the concept of unidirectional rings, called ringlets, rather than bidirectional links. Ringlets are connected together by switches to form large networks. The specification defines three layers. A physical layer, a packet layer, and a transaction layer. The physical layer is specified in two 1 GB/s forms, as explained in Example . Packets consist of a sequence of 16 bit units, much like the Cray T3D packets. As illustrated in Figure 1039, all packets consists of a TargetId, Command, and SourceId, followed by zero or more command specific units, and finally a 32-bit CRC. One unusual aspect, is that source and target Ids are arbitrary 16 bit node addresses. The packet does not contain a route. In this sense, the design is like a bus. The source puts the target address on the interconnect and the interconnect determines how to get the information to the right place. Within a ring this is simple, because the packet circulates the ring and the target extracts the packet. In the general case, switches use table driven routing to move packets from ring to ring.
General Packet Format target id command source id control Address Status (0or 6 bytes) ext (0-16 bytes) data (0 or 16 bytes) CRC
request mdr spr phase old ech0 2 2 2 1 1 request echo mdr spr phase old ech1 2 2 2 1 1 response mdr spr phase old ech0 2 2 2 1 1 response echo mdr spr phase old ech1 2 2 2 1 1
eh 1
cmd 7
bsy res0 trans id 1 1 6 eh 1
cmd 7
bsy res1 trans id 1 1 6
control trace tod Exponent tod Mantissa tpr 2 2 2 5
trans id 6
Figure 10-39 SCI Packet formats SCI operations involve a pair transactions, a request and a response. Owing to its ring-based underpinnings, each transaction involves conveying a packet from the source to the target and an echo (going the rest of the way around the ring) from the target back to the source. All packets are a sequence of 16-bit units, with the first three being the destination node, command, and source node, and the final being the CRC. Requests contain a 6 byte address, optional extension, and optional data. Responses have a similar format, but the address bytes carry status information. Both kinds of echo packets contain only the minimal four units. The command unit identifies the packet type through the phase and ech fields. It also describes the operation to be performed (request cmd) or matched (trans id). The remaining fields and the control unit address lowel level issues of how packet queuing and retry are handled at the packet layer. An SCI transaction, such as read or write, consists of two network transactions (request and response), each of which has two phases on each ringlet. Let’s take this step by step. The source node issues a request packet for a target. The request packet circulates on the source ringlet. If the target is on the same ring as the source, the target extracts the request from the ring and replaces it with a echo packet, which continues around the ring back to the source, whereupon it is removed from the ring. The echo packet serves to inform the source that the original packet was either accepted by the target, or rejected, in which the echo contains a NAK. It may be rejected 742
DRAFT: Parallel Computer Architecture
9/4/97
Case Studies
either because of a buffer full condition or because the packet was corrupted. The source maintains a timer on outstanding packets, to it can detect if the echo gets corrupts. If the target is not on the source ring, a switch node on the ring serves as a proxy for the target. It accepts the packet and provides the echo, once it has successfully buffered the packet. Thus, the echo only tells the source that the packet successfully left the ringlet. The switch will then initiate the packet onto another ring along the route to the target. Upon receiving a request, the target node will initiate a response transaction. It too will have a packet phase and an echo phase on each ringlet on the route back to the original source. The request echo packet informs the source of the transaction ide assigned to the target; this is used to match the eventual response, much as on a split-phase bus. Rather then have a clean envelop for each of the layers, in SCI they blur together a bit in the packet format where several fields control queue management and retry mechanisms. The control tpr (transaction priority), command mpr (maximum ring priority) and command spr (send priority) together determine one of four priority levels for the packet. The transaction priority is initiallly set by the requestor, but the actual send priority is established by nodes along the route based on what other blocked transactions they have in their queues. The phase and busy fields are used as part of the flow control negotiation between the source and target nodes. In the Sequent NUMA-Q, the 18-bit wide SCI ring is driven directly by the DataPump in a Quad at 1GB/sec node-to-network bandwidth. The transport layer follows a strict request-reply protocol. When the DataPump puts a packet on the ring, it keeps a copy of the packet in its outgoing buffer until an acknowledgment (called an echo) is returned. When a DataPump removes an incoming packet from the ring it replaces it by a positive echo that is returned to the sender along the ring. If a DataPump detects a packet destined for it but does not have space to remove that packet from the ring, it sends a negative echo (like a NACK), which causes the sender to retry its transmission (still holding the space in its outgoing buffer). Since the latency of communication on a ring increases linearly with the number of nodes on it, large high-performance SCI systems are expected to be built out of smaller rings interconnected in arbitrary network topologies. For example, a performance study indicates that a single ring can effectively support about 4-8 high-performance processing nodes [SVG92]. The interconnections among rings are expected to be either multiple small bridges, each connecting together a small number of rings, or centralized switches each connecting a large number of rings. 10.8.4 SGI Origin Network The SGI Origin network is based on a flexible switch, called SPIDER, that supports six pairs of unidirectional links, each pair providing over 1.56 GB/s of total bandwidth in the two directions. Two nodes (four processors) are connected to each router, so there are four pairs of links to connect to other routers. This building block is configured in a family of topologies related to hypercunes, as illustrated in Figure 10-40. The links are flexible (long, wide) cables that can be up to 3 meters long. The peak bisection bandwidth of a maximum configuration is over 16GB/s. Messages are pipelined through the network, and the latency through the router itself is 41ns from pin to pin. Routing is table driven, so as part of the network initialization the routing tables are set up in each switch. This allows routing to be programmable, so it is possible to support a range of configurations, to have a partially populated network (i.e., not all ports are used), and route around faulty components. Availability is also aided by the fact that the routers are separately powered, and provide per-link error protection with hardware retry upon error. The switch pro-
9/4/97
DRAFT: Parallel Computer Architecture
743
Interconnection Network Design
vides separate virtual channels for requests and replies, and supports 256 levels of message priority, with packet aging. N N
N
N
N
N
N N
N
N N
(a) 2-node
(b) 4-node
N N
N
(c) 8-node
(d) 16-node
(d) 32-node
meta-router
(e) 64-node Figure 10-40 Network topology and router connections in the Origin multiprocessor. Figures (d) and (e) show only the routers and omit the two nodes connected to each for simplicity. Figure (e) also shows only a few of the fat-cube connections across hypercubes; The routers that connect 16-node subcubes are called meta-routers. For a 512-node (1024-processor) configuration, each meta-router itself would be replaced by a 5-d router hypercube.
10.8.5 Myricom Network As final case study, we briefly examine the Myricom network [Bod95] used in several cluster systems. The communication assist within the network interface card was described in Section 7.8. Here we are concerned with the switch that is used to construct a scalable interconnection network. Perhaps the most interesting aspect of its design is it simplicity. The basic building block is a switch with eight bidirectional ports of 160 MB/s each. The physical link in a 9 bit wide, long wire. It can be up to 25 m and 23 phits can be in transit simultaneously. The encoding allows control flow, framing symbols, and ‘gap’ (or idle) symbols to be interleaved with data symbols. Symbols are constantly transmitted between switches across the links, so the switch can determine which of its ports are connected. A packet is simply a sequence of routing bytes followed by a sequence of payload bytes and a CRC byte. The end of a packet is delimited by the presence of a gap symbol. The start of a packet is indicated by the presence of a non-gap symbol. The operation of the switch is extremely simple, it removes the first byte from an incoming packet, computes the output port by adding the contents of the route byte to the input port number and spools the rest of the packet to that port. The switch uses wormhole routing with a small amount of buffering for each input and each output. The routing delay through a switch is less
744
DRAFT: Parallel Computer Architecture
9/4/97
Concluding Remarks
than 500 ns. The switches can be wired together in an arbitrary topology. It is the responsibility of the host communication software to construct routes that are valid for the physical interconnection of the network. If a packet attempts to exit a switch on an invalid or unconnected port, it is discarded. When all the routing bytes are used up, the packet should have arrived at a NIC and the first byte of the message has a bit to indicate that it is not a routing byte and the remaing bits indicate the packet type. All higher level packet formating and transactions are realized by the NIC and the host, the interconnection network only moves the bits.
10.9 Concluding Remarks Parallel computer networks present a rich and diverse design space that brings together several levels of design. The physical layer, link level issues represent some of the most interesting electrical engineering aspects of computer design. In a constant effort to keep pace with the increasing rate of processors, link rates are constantly improving. Currently we are seeing multigigabit rates on copper pairs and parallel fiber technologies offer the possibility of multigigabyte rates per link in the near future. One of the major issues at these rates is dealing with errors. If bit error rates of the physical medium are in the order of 10 –19 per bit and data is being transmitted at a rate of 10 9 bytes per second, then an error is likely to occur on a link roughly every 10 minutes. With thousands of links in the machine, errors occur every second. These must be detected and corrected rapidly. The switch to switch layer of design also offers a rich set of trade-offs, including how aspects of the problem are pushed down into the physical layer, and how they are pushed up into the packet layer. For example, flow control can be built in to the exchange of digital symbols across a link, or it can be addressed at the next layer in terms of packets thar cross the link, or even one layer higher, in terms of messages sent from end to end. There is a huge space of alternatives for the design of the switch itself, and these are again driven by engineering constraints from below and design requirements from above. Even the basic topology, switching strategy, and routing algorithm reflects a compromise of engineering requirements and design requirements from below and above. We have seen, for example, how a basic question like the degree of the switch depends heavily on the cost model associated with the target technology and the characteristics of the communication pattern in the target workload. Finally, as with many other aspects of architecture, there is significant room for debate as to how much of the higher level semantics of the node-to-node communication abstraction should be embedded in the hardware design of the network itself. The entire area of network design for parallel computers is bound to be exciting for many years to come. In particular, there is a strong flow of ideas and technology between parallel computer networks and advancing, scalable networks for local area and system area communication.
10.10 References [AbAy94] [Aga91]
9/4/97
B. Abali and C. Aykanat, Routing Algorithms for IBM SP1, Lecture Notes in Comp. Sci, Springer-Verlag, vol. 853, 1994 pp. 161-175. A. Agarwal, Limit on Interconnection Performance, IEEE Transactions on Parallel and Distributed Systems, V. 2, No. 4, Oct. 1991, pp. 398-412.
DRAFT: Parallel Computer Architecture
745
Interconnection Network Design
[And*92] [Arn*89]
[Bak90] [Bak*95] [Ben65] [BhLe84] [Bod*95] [Bor*90]
[Bre*94] [BrKu94]
[CaGo95] [ChKi92] [Coh*93] [Dal90a] [Dal90b] [DaSe87] [Dus*96]
[Fel94] [GlNi92] [GrLe89] [Gro92]
746
T. E. Anderson, S. S. Owicki, J. P. Saxe, C. P. Thacker, High Speed Switch Scheduling for Local Area Networks, Proc. of ASPLOS V, pp. 98-110, oct. 1992. E. A. Arnold, F. J. Bitz, E. C. Cooper, H. T. Kung, R. D. Sansom, and P. Steenkiste, The Design of Nectar: A Network Backplane for Heterogeneous Multicomputers, Proc. of ASPLOS III, Apr. 1989, pp. 205-216. H. B. Bakoglu, Circuits, Interconnection, and Packaging for VLSI, Addison-Wesley Publishing Co., 1990. W. E. Baker, R. W. Horst, D. P. Sonnier, and W. J. Watson, A Flexible ServerNet-based FaultTolerant Architecture, 25th Annual Symposium on Fault-Tolerant Computing, Jun. 1995. V. Benes. Mathematical Theory of Connecting Networks and Telephone Traffic, Academic Press,. New York, NY, 1965. S. Bhatt and C. L. Leiserson, How to Assemble Tree Machines, Advances in Computing Research, vol. 2, pp95-114, JAI Press Inc, 1984. N. Boden, D. Cohen, R. Felderman, A. Kulawik, C. Seitz, J. Seizovic, and W. Su, Myrinet: A Gigabit-per-Second Local Area Network, IEEE Micro, Feb. 1995, vol.15, (no.1), pp. 29-38. S. Borkar, R. Cohn, G. Cox, T. Gross, H. T. Kung, M. Lam, M. Levine, B. Moore, W. Moore, C. Peterson, J. Susman, J. Sutton, J. Urbanski, and J. Webb. Supporting systolic and memory communication in iwarp.In Proc. 17th Intl. Symposium on Computer Architecture, pages 70-81. ACM, May 1990. Revised version appears as technical report CMU-CS-90-197. E. A. Brewer, F. T. Chong, F. T. Leighton, Scalable Expanders: Exploiting Hierarchical Random Wiring, Proc. of the 1994 Symp. on the Theory of Computing, Montreal Canada, May, 1994. Eric A. Brewer and Bradley C. Kuszmaul, How to Get Good Performance from the CM-5 Data Network, Proceedings of the 1994 International Parallel Processing Symposium, Cancun, Mexico, April 1994. T. Callahan and S. C. Goldstein, NIFDY: A Low Overhead, High Throughput Network Interface, Proc. of the 22nd Annual Symp. on Computer Architecture, June 1995. A. A. Chien and J. H. Kim, Planar-Adaptive Routing: Low-cost Adaptive Networks for Multiprocessors, Proc. of ISCA, 1992. D. Cohen, G. Finn, R. Felderman, and A. DeSchon, ATOMIC: A Low-Cost, Very High-Speed, Local Communication Architecture, Proc. of the 1993 Int. Conf. on Parallel Processing, 1993. William J. Dally, Virtual-Channel Flow Control, Proc. of ISCA 1990, Seattle WA. William J. Dally, Performance Analysis of k-ary n-cube Interconnection Networks, IEEE-TOC, V. 39, No. 6, June 1990, pp. 775-85. W. Dally and C. Seitz, Deadlock-Free Message Routing in Multiprocessor Interconnections Networks, IEEE-TOC, vol c-36, no. 5, may 1987. Andrea C. Dusseau, David E. Culler, Klaus E. Schauser, and Richard P. Martin. Fast Parallel Sorting Under LogP: Experience with the CM-5. IEEE Transactions on Parallel and Distributed Systems, 7(8):791-805, August 1996. R. Felderman, et al, “Atomic: A High Speed Local Communication Architecture”, J. of High Speed Networks, Vol. 3, No. 1, pp.1-29, 1994. C. J. Glass, L. M. Ni, The Turn Model for Adaptive Routing, Proc. of ISCA 1992. pp. 278-287. R. I. Greenberg and C. E. Leiserson, Randomized Routing on Fat-Trees, Advances in Computing Research, vol. 5, pp. 345-374, JAI Press, 1989. W. Groscup, The Intel Paragon XP/S supercomputer. Proceedings of the Fifth ECMWF Workshop on the Use of Parallel Processors in Meteorology. Nov 1992, pp. 262--273. DRAFT: Parallel Computer Architecture
9/4/97
References
[Gun81] [HiTu93] [HoMc93] [Hor95] [Joe94] [JoBo91] [Kar*87] [Kar*90] [KeSc93] [KeKl79] [Koe*94] [KoSn] [KrSn83] [Kun*91] [Leis85] [Lei92] [Lei*92]
[Li88] [LiHa91] [Mai*97]
[NuDa92]
9/4/97
K. D. Gunther, Prevention of Deadlocks in Packet-Switched Data Transport Systems, IEEE Transactions on Communication, vol. com-29, no. 4, Apr. 1981, pp. 512-24. W. D. Hillis and L. W. Tucker, The CM-5 Connection Machine: A Scalable Supercomputer, Communications of the ACM, vol. 36, no. 11, nov. 1993, pp. 31-40. Mark Homewood and Moray McLaren, Meiko CS-2 Interconnect Elan - Elite Design, Hot Interconnects, Aug. 1993. R. Horst, TNet: A Reliable System Area Network, IEEE Micro, Feb. 1995, vol.15, (no.1), pp. 3745. C. F. Joerg, Design and Implementation of a Packet Switched Routing Chip, Technical Report, MIT Laboratory for Computer Science, MIT/LCS/TR-482, Aug. 1994. Christopher F. Joerg and Andy Boughton, The Monsoon Interconnection Network, Proc. of ICCD, Oct. 1991. M. Karol, M. Hluchyj, S. Morgan, Input versus Output Queueing on a Space Division Packet Switch, IEEE Transactions on Communications, v. 35, no. 12, pp. 1347-56, Dec. 1987. R. Karp, U. Vazirani, V. Vazirani, An Optimal Algorithm for on-line Bipartite Matching, Proc. of the 22dn ACM Symposium on the Theory of Computing, pp. 352-58, May, 1990. R. E. Kessler and J. L. Schwarzmeier, Cray T3D: a new dimension for Cray Research, Proc. of Papers. COMPCON Spring '93, pp 176-82, San Francisco, CA, Feb. 1993. P. Kermani and L. Kleinrock, Virtual Cut-through: A New Computer Communication Switching Technique, Computer Networks, vol. 3., pp. 267-286, sep. 1979. R. Kent Koeninger, Mark Furtney, and Martin Walker. A Shared Memory MPP from Cray Research, Digital Technical Journal 6(2):8-21, Spring 1994. S. Kostantantindou and L. Snyder, Chaos Router: architecture and performance, Proc. of the 18th Annual Symp. on Computer Architecture, pp. 212-23, 1991. C. P. Kruskal and M. Snir, The Perormance of Multistage Interconnection Networks for Multiprocvessors, IEEE Transactions on Computers, vol. c-32, no. 12, dec. 1983, pp. 1091-1098. Kung, H.T, et al, Network-based multicomputers: an emerging parallel architecture, Proceedings Supercomputing ‘91, Albuquerque, NM, USA, 18-22 Nov. 1991 p. 664-73 C. E. Leiserson, Fat-trees: Universal Networks for Hardware-Efficient Supercomputing, IEEE Transactions on Computers, Vol. c-34, no. 10, Oct. 1985, pp. 892-901 F. Thomson Leighton, Introduction to Parallel Algorithms and Architectures, Morgan Kaufmann Publishers, Inc., San Francisco, CA, 1992. C. E. Leiserson, Z. S. Abuhamdeh, D. C. Douglas, C. R. Feynman, M. N. Ganmukhi, J. V. Hill, W. D. Hillis, B. C. Kuszmau, M. A. St. Pierre, D. S. Wells, M. C. Wong, S. Yang, R. Zak. The Network Architecture of the CM-5. Symposium on Parallel and Distributed Algorithms '92" Jun 1992, pp. 272-285. Li, S-Y, Theory of Periodic Contention and Its Application to Packet Switching, Proc. of INFOCOM ‘88, pp. 320-5, mar. 1988. D. Linder and J. Harden, An Adaptive Fault Tolerant Wormhole strategy for k-ary n-cubes, IEEE Transactions on Computer, C-40(1):2-12, Jan., 1991. A. Mainwaring, B. Chun, S. Schleimer, D. Wilkerson, System Area Network Mapping, Proc. of the 9th Annual ACM Symp. on Parallel Algorithms and Architecture, Newport, RI, June 1997, pp.116-126. P. Nuth and W. J. Dally, The J-Machine Network, Proc. of International Conference on Computer Design: VLSI in computers and processors, Oct. 1992. DRAFT: Parallel Computer Architecture
747
Interconnection Network Design
[NgSe89] [PeDa96] [Pfi*85]
[PfNo85] [Rat85] [Ret*90] [ReTh96] [ScGo94] [Sch*90]
[ScJa95] [Sei85] [SeSu93]
[SP2]
J. Ngai and C. Seitz, A Framework for Adaptive Routing in Multicomputer Networks, Proc. of the 1989 Symp. on Parallel Algorithms and Architectures, 1989. L. Peterson and B. Davie, Computer Networks, Morgan Kaufmann Publishers, Inc., 1996. G. F. Pfister, W. C. Brantley, D. A. George, S. L. Harvey, W. J. Kleinfelder, K. P. McAuliff, E. A. Melton, V. A. Norton, and J. Weiss, The IBM Research Parallel Processor Prototype (RP3): Introduction and Architecture, International Conference on Parallel Processing, 1985. G. F. Pfister and V. A. Norton, “Hot Spot” COntention and Combining Multistage Interconnection Networks, IEEE Transactions on Computers, vol. c-34, no. 10, oct. 1985. J. Ratner, Concurrent Processing: a new direction in scientific computing, Conf. Proc. of the 1985 Natinal Computing Conference, p. 835, 1985. R. Rettberg, W. Crowther, P. Carvey, and R. Tomlinson, The Monarch Parallel Processor Hardware Design, IEEE Computer, April 1990, pp. 18-30. R. Rettberg and R. Thomas, Contention is no Obstacle to Shared-Memory Multiprocessing, Communications of the ACM, 29:12, pp. 1202-1212, Dec. 1986. S. L. Scott and J. R. Goodman, The Impact of Pipelined Channels on k-ary n-Cube Networks, IEEE Transactions on Parallel and Distributed Systems, vol. 5, no. 1, jan. 1994, pp. 2-16. M. D. Schroeder, A. D. Birrell, M. Burrows, H. Murray, R. M. Needham, T. L. Rodeheffer, E. H. Satterthwaite, C. P. Thacker, Autonet: a High-speed, Self-configuring Local Area Network Using Point-to-point Links, IEEE Journal on Selected Areas in Communications, vol. 9, no. 8. pp. 131835, oct. 1991. L. Schwiebert and D. N. Jayasimha, A Universal Proof Technique for Deadlock-Free Routing in Interconnection Networks, SPAA, Jul 1995, pp. 175-84. Charles L. Seitz, The Cosmic Cube, Communications of the ACM, 28(1):22-33, Jan 1985. Charles L/ Seitz and Wen-King Su, A Family of Routing and Communication Chips Based on Mosaic, Proc of Univ. of Washington Symposium on Integrated Systems, MIT PRess, pp. 320337. C. B. Stunkel et al. The SP2 Communication Subsystem. On-line as http://
ibm.tc.cornell.edu/ibm/pps/doc/css/css.ps. [Stu*94]
[SVG92] [TaFr88] [TrDu92] [Tur88] [vEi*92]
748
C. B. Stunkel, D. G. Shea, D. G. Grice, P. H. Hochschild, and M. Tsao, The SP-1 High Performance Switch, Proc. of the Scalable High Performance Computing Conference, Knoxville, TN pp. 150-57, May 1994. Steven Scott, Mary Vernon and James Goodman. Performance of the SCI Ring. In Proceedings of the 19th International Symposium on Computer Architecture, pp. 403-414, May 1992. Y. Tamir and G. L. Frazier, High-performance Multi-queue Buffers for VLSI Communication Switches, 15th Annual International Symposium on Computer Architecture, 1988, pp. 343-354. R. Traylor and D. Dunning. Routing Chip Set for Intel Paragon Parallel Supercomputer. Proc. of Hot Chips ‘92 Symposium, August 1992. J. S. Turner, Design of a Broadcast Packet Switching Network, IEEE Transactions on Communication, vol 36, no. 6, pp. 734-43, June, 1988. von Eicken, T.; Culler, D.E.; Goldstein, S.C.; Schauser, K.E., Active messages: a mechanism for integrated communication and computation, Proc. of 19th Annual International Symposium on Computer Architecture, old Coast, Qld., Australia, 19-21 May 1992, pp. 256-66
DRAFT: Parallel Computer Architecture
9/4/97
Exercises
10.11 Exercises 10.1 Consider a packet format with 10 bytes of routing and control information and 6 bytes of CRC and other trailer information. The payload contains a 64 byte cache block, along with 8 bytes of command and address information. If the raw link bandwidth is 500 MB/s, what is the effective data bandwidth on cache block transfers using this format? How would this change with 32 byte cache blocks? 128 byte blocks? 4 KB page transfers? 10.2 Suppose the links are 1 byte wide and operating at 300 MHz in a network where the average routing distance between nodes is log 4P for P nodes. Compare the unloaded latency for 80 byte packets under store-and-forward and cut-through routing, assuming 4 cycles of delay per hop to make the routing decision and P ranging from 16 to 1024 nodes. 10.3 Perform the comparison in Exercise 10.2 for 32 KB transfers fragmented into 1 KB packets. 10.4 Find an optimal fragmentation strategy for a 8KB page sized message under the following assumptions. There is a fixed overhead of o cycles per fragment, there is a store-and-forward delay through the source and the destination NI, packets cut-through the network with a routing delay of R cycles, and the limiting data transfer rate is b bytes per cycle. 10.5 Suppose an n × n matrix of double precision numbers is laid out over P nodes by rows, and in order to compute on the columns you desire to transpose it to have a column layout. How much data will cross the bisection during this operation? 10.6 Consider a 2D torus direct network of N nodes with a link bandwidth of b bytes per second. Calculate the bisection bandwidth and the average routing distance. Compare the estimate of aggregate communication bandwidth using Equation 10.6 with the estimate based on bisection. In addition, suppose that every node communicates only with nodes 2 hops away. Then what bandwidth is available? What is each node communicates only with nodes in its row? 10.7 Show how an N node torus is embedded in an N node hypercube such that neighbors in the torus, i.e., nodes of distance one, are neighbors in the hypercube. (Hint: observe what happens when the addresses are ordered in a greycode sequence.) Generalize this embedding for higher dimensional meshes. 10.8 Perform the analysis of Equation 10.6 for a hypercube network. In the last step, treat the row as being defined by the greycode mapping of the grid into the hypercube. 10.9 Calculate the average distance for a linear array of N nodes (assume N is even) and for a 2D mesh and 2D torus of N nodes. 10.10 A more accurate estimate of the communication time can be obtained by determining the average number of channels traversed per packet, h ave , for the workload of interest and the C particular network topology. The effective aggregate bandwidth is at most ---------- B . Suppose h ave the orchestration of a program treats the processors as a p × p logical mesh where each node communicates n bytes of data with its eight neighbors in the four directions and on the diagonal. Give an estimate of the communication time on a grid and a hypercube. 10.11 show how an N node tree can be embedded in an N node hypercube by stretching one of the tree edges across two hypercube edges. 10.12 Use a spread sheet to construct the comparison in Figure 10-12 for three distinct design points: 10 cycle routing delays, 1024 byte messages, and 16-bit wide links. What conclusions can you draw? 10.13 Verify that the minimum latency is achieved under equal pin scaling in Figure 10-14 when the routing delay is equal to the channel time.
9/4/97
DRAFT: Parallel Computer Architecture
749
Interconnection Network Design
10.14 Derive a formula for dimension that achieves the minimum latency is achieved under equal bisection scaling based on Equation 10.7. 10.15 Under the equal bisection scaling rule, how wide are the links of a 2D mesh with 1 million nodes? 10.16 Compare the behavior under load of 2D and 3D cubes of roughly equal size, as in Figure 10-17, for equal pin and equal bisection scaling. 10.17 Specify the boolean logic used to compute the routing function in the switch with ∆x, ∆y routing on 2D mesh. 10.18 If ∆x, ∆y routing is used, what effect does decrementing the count in the header have on the CRC in the trailor? How can this issue be addressed in the switch design? 10.19 Specify the boolean logic used to compute the routing function in the switch dimension order routing on a hypercube. 10.20 Prove the e-cube routing in a hypercube is deadlock free. 10.21 Construct the table-based equivalent of [ ∆x, ∆y ] routing in an 2D mesh. 10.22 Show that permitting the pair of turns not allowed in Figure 10-24 leaves complex cycles. hint: They look like a figure 8. 10.23 Show that with two virtual channels, arbitrary routing in a bidirectional network can be made deadlock free. 10.24 (Pick any flow control scheme in the chapter and compute fraction of bandwidth used by flow control symbols. 10.25 Revise latency estimates of Figure 10-14 for stacked dimension switches. 10.26 Table 10-1 provides the topologies and an estimate of the basic communication performance characteristics for several important designs. For each, calculate the network latency for 40 byte and 160 byte messages on 16 and 1024 node configurations. What fraction of this is routing delay and what fraction is occupancy?
Table 10-1 Link width and routing delay for various parallel machine networks
750
Machine
Topology
Cycle Time (ns)
Channel Width (bits)
Routing Delay (cycles)
FLIT (data bits)
nCUBE/2
Hypercube
25
1
40
32
TMC CM-5
Fat tree
25
4
10
4
2
16
IBM SP-2
Banyan
25
8
Intel Paragon
2d Mesh
11.5
16
Meiko CS-2
Fat tree
20
8
Cray T3D
3D torus
6.67
16
Dash
Torus
30
16
J-Machine
3d Mesh
31
8
Monsoon
Butterfly
20
16
DRAFT: Parallel Computer Architecture
16 1
8
9/4/97
Introduction
C HA PTE R 11
Latency Tolerance
Morgan Kaufmann is pleased to present material from a preliminary draft of Parallel Computer Architecture; the material is (c) Copyright 1997 Morgan Kaufmann Publishers. This material may not be used or distributed for any commercial purpose without the express written consent of Morgan Kaufmann Publishers. Please note that this material is a draft of forthcoming publication, and as such neither Morgan Kaufmann nor the authors can be held liable for changes or alterations in the final edition.
11.1 Introduction In Chapter 1, we saw that while the speed of microprocessors increases by more than a factor of ten per decade, the speed of commodity memories (DRAMs) only doubles, i.e., access time is halved. Thus, the latency of memory access in terms of processor clock cycles grows by a factor of six in ten years! Multiprocessors greatly exacerbate the problem. In bus-based systems, the imposition of a high-bandwidth bus between processor and memory tends to increase the latency of obtaining data from memory: Recall that in both the SGI Challenge and the Sun Enterprise about a third of the read miss penalty was spent in the bus protocol. When memory is physically distributed, the latency of the network and the network interface is added to that of accessing the local memory on the node. Caches help reduce the frequency of high-latency accesses, but they are not a panacea: They do not reduce inherent communication, and programs have significant miss rates from other sources as well. Latency usually grows with the size of the machine, since more nodes implies more communication relative to computation, more hops in the network for general communication, and likely more contention. The goal of the protocols developed in the previous chapters has been to reduce the frequency of long-latency events and the bandwidth demands imposed on the communication media while providing a convenient programming model. The primary goal of hardware design has been to reduce the latency of data access while maintaining high, scalable bandwidth. Usually, we can improve bandwidth by throwing hardware at the problem, but latency is a more fundamental limitation. With careful design one can try not to exceed the inherent latency of the technology too much, but data access nonetheless takes time. So far, we have seen three ways to reduce the 9/4/97
DRAFT: Parallel Computer Architecture
751
Latency Tolerance
latency of data access in a multiprocessor system, the first two the responsibility of the system and the third of the application.
• Reduce the access time to each level of the storage hierarchy. This requires careful attention to detail in making each step in the access path efficient. The processor-to-cache interface can be made very tight. The cache controller needs to act quickly on a miss to reduce the penalty in going to the next level. The network interface can be closely coupled with the node and designed to format, deliver, and handle network transactions quickly. The network itself can be designed to reduce routing delays, transfer time, and congestion delays. All of these efforts reduce the cost when access to a given level of the hierarchy is required. Nonetheless, the costs add up. • Reduce the likelihood that data accesses will incur high latency: This is the basic job of automatic replication such as in caches, which take advantage of spatial and temporal locality in the program access pattern to keep the most important data close to the processor that accesses it. We can make replication more effective by tailoring the machine structure, for example, by providing a substantial amount of storage in each node. • Reduce the frequency of potential high-latency events in the application. This involves decomposing and assigning computation to processors to reduce inherent communication, and structuring access patterns to increase spatial and temporal locality. These system and application efforts to reduce latency help greatly, but often they do not suffice. There are also other potentially high-latency events than data access and communication, such as synchronization. This chapter discusses another approach to dealing with the latency that remains:
• Tolerate the remaining latency. That is, hide the latency from the processor’s critical path by overlapping it with computation or other high-latency events. The processor is allowed to perform other useful work or even data access/communication while the high-latency event is in progress. The key to latency tolerance is in fact parallelism, since the overlapped activities must be independent of one another. The basic idea is very simple, and we use it all the time in our daily lives: If you are waiting for one thing, e.g. a load of clothes to finish in the washing machine, do something else, e.g. run an errand, while you wait (fortunately, data accesses do not get stolen if you don’t keep an eye on them). Latency tolerance cuts across all of the issues we have discussed in the book, and has implications for hardware as well as software, so it serves a useful role in ‘putting it all together’. As we shall see, the success of latency tolerance techniques depends on both the characteristics of the application as well as on the efficiency of the mechanisms provided and the communication architecture. Latency tolerance is already familiar to us from multiprogramming in uniprocessors. Recall what happens on a disk access, which is a truly long latency event. Since the access takes a long time, the processor does not stall waiting for it to complete. Rather, the operating system blocks the process that made the disk access and switches in another one, typically from another application, thus overlapping the latency of the access with useful work done on the other process. The blocked process is resumed later, hopefully after that access completes. This is latency tolerance: While the disk access itself does not complete any more quickly, and in this case the process that made the disk access sees its entire latency or even more since it may be blocked for even longer, the underlying resource (e.g. the processor) is not stalled and accomplishes other useful work in the meantime. While switching from one process to another via the operating system takes a lot
752
DRAFT: Parallel Computer Architecture
9/4/97
Introduction
of instructions, the latency of disk access is high enough that this is very worthwhile. In this multiprogramming example, we do not succeed in reducing the execution time of any one process— in fact we might increase it—but we improve the system’s throughput and utilization. More overall work gets done and more processes complete per unit time. The latency tolerance that is our primary focus in this chapter is different from the above example in two important ways. First, we shall focus mostly on trying to overlap the latency with work from the same application; i.e. our goal will indeed be to use latency tolerance to reduce the execution time of a given application. Second, we are trying to tolerate not disk latencies but those of the memory and communication systems. These latencies are much smaller, and the events that cause them are usually not visible to the operating system. Thus, the time-consuming switching of applications or processes via the operating system is not a viable solution. Before we go further, let us define a few terms that will be used throughout the chapter, including latency itself. The latency of a memory access or communication operation includes all components of the time from issue by the processor till completion. For communication, this includes the processor overhead, assist occupancy, transit latency, bandwidth-related costs, and contention. The latency may be for one-way transfer of communication or round-trip transfers, which will usually be clear from the context, and it may include the cost of protocol transactions like invalidations and acknowledgments in addition to the cost of data transfer. In addition to local memory latency and communication latency, there are two other types of latency for which a processor may stall. Synchronization latency is the time from when a processor issues a synchronization operation (e.g. lock or barrier) to the time that it gets past that operation; this includes accessing the synchronization variable as well as the time spent waiting for an event that it depends on to occur. Finally, instruction latency is the time from when an instruction is issued to the time that it completes in the processor pipeline, assuming no memory, communication or synchronization latency. Much of instruction latency is already hidden by pipelining, but some may remain due to long instructions (e.g. floating point divides) or bubbles in the pipeline. Different latency tolerance techniques are capable of tolerating some subset of these different types of latencies. Our primary focus will be on tolerating communication latencies, whether for explicit or implicit communication, but some of the techniques discussed are applicable to local memory, synchronization and instruction latencies, and hence to uniprocessors as well. A communication from one node to another that is triggered by a single user operation is called a message, regardless of its size. For example, a send in the explicit message-passing abstraction constitutes a message, as does the transfer of a cache block triggered by a cache miss that is not satisfied locally in a shared address space (if the miss is satisfied locally its latency is called memory latency, if remotely then communication latency). Finally, an important aspect of communication for the applicability of latency tolerance techniques is whether it is initiated by the sender (source or producer) or receiver (destination or consumer) of the data. Communication is said to be sender-initiated if the operation that causes the data to be transferred is initiated by the process that has produced or currently holds the data, without solicitation from the receiver; for example, an unsolicited send operation in message passing. It is said to be receiver-initiated if the data transfer is caused or solicited by an operation issued by the process that needs the data, for example a read miss to nonlocal data in a shared address space. Sender- and receiver-initiated communication will be discussed in more detail later in the context of specific programming models. Section 11.2 identifies the problems that result from memory and communication latency as seen from different perspectives, and provides an overview of the four major types of approaches to latency tolerance: block data transfer, precommunication, proceeding past an outstanding com9/4/97
DRAFT: Parallel Computer Architecture
753
Latency Tolerance
munication event in the same thread, and multithreading or finding independent work to overlap in other threads of execution. These approaches are manifested as different specific techniques in different communication abstractions. This section also discusses the basic system and application requirements that apply to any latency tolerance technique, and the potential benefits and fundamental limitations of exploiting latency tolerance in real systems. Having covered the basic principles, the rest of the chapter examines how the four approaches are applied in the two major communication abstractions. Section 11.3 briefly discusses how they may be used in an explicit message-passing communication abstraction. This is followed by a detailed discussion of latency tolerance techniques in the context of a shared address space abstraction. The discussion for a shared address space is more detailed since latency is likely to be a more significant bottleneck when communication is performed through individual loads and stores than through flexibly sized transfers. Also, latency tolerance exposes interesting interactions with the architectural support already provided for a shared address space, and many of the techniques for hiding read and write latency in a shared address space are applicable to uniprocessors as well. Section 11.4 provides an overview of latency tolerance in this abstraction, and the next four sections each focus on one of the four approaches, describing the implementation requirements, the performance benefits, the tradeoffs and synergies among techniques, and the implications for hardware and software. One of the requirements across all techniques we will discuss is that caches be nonblocking or lockup-free, so Section 11.9discusses techniques for implementing lockup-free caches. Finally, Section 11.10 concludes the chapter.
11.2 Overview of Latency Tolerance To begin our discussion of tolerating communication latency, let us look at a very simple example computation that will serve as an example throughout the chapter. The example is a simple producer-consumer computation. A process PA computes and writes the n elements of an array A, and another process PB reads them. Each process performs some unrelated computation during the loop in which it writes or reads the data in A, and A is allocated in the local memory of the processor that PA runs on. With no latency tolerance, the process generating the communication would simply issue a word of communication at a time—explicitly or implicitly as per the communication abstraction—and would wait till the communication message completes before doing anything else. This will be referred to as the baseline communication structure. The computation might look like that shown in Figure 11-1(a) with explicit message passing, and like that shown in Figure 11-1(b) with implicit, read-write communication in a shared address space. In the former, it is the send operation issued by PA that generates the actual communication of the data, while in the latter it is the read of A[i] by PB.1 We assume that the read stalls the processor until it completes, and that a synchronous send is used (as was described in Section 2.4.6). The resulting time-lines for the processes that initiate the communi-
1. The examples are in fact not exactly symmetric in their baseline communication structure, since in the message passing version the communication happens after each array entry is produced while in the shared address space version it happens after the entire array has been produced. However, implementing the finegrained synchronization necessary for the shared address space version would require synchronization for each array entry, and the communication needed for this synchronization would complicate the discussion. The asymmetry does not affect the discussion of latency tolerance.
754
DRAFT: Parallel Computer Architecture
9/4/97
Overview of Latency Tolerance
Process PA
Process PA
Process PB
for i←0to n-1 do
Process PB
for i←0to n-1 do compute A[i];
compute A[i]; write A[i];
write A[i]; compute f(B[i]);
send (A[i] to PB); compute f(B[i]); /* unrelated */ end for
end for flag ← 1;
for i←1 to n-1 do receive (myA[i]);
while flag = 0 {}; for i←1 to n-1 do read A[i];
use myA[i]; compute g(C[i]);
use A[i]; compute g(C[i]);
/* unrelated */ end for
end for
(a) Message Passing
(b) Shared Address Space
Figure 11-1 Pseudocode for the example computation. Pseudocode is shown for both explicit message passing and with implicit read-write communication in a shared address space, with no latency hiding in either case. The boxes highlight the operations that generate the data transfer.
cation are shown in Figure 11-2. A process spends most of its time stalled waiting for communication. Time in processor’s critical path
Process PA in message passing: ca
iteration i
cc
iteration i+2
iteration i+1
round-trip comm. time (send + ack)
Process PB in shared address space: iteration i
ca+cb
iteration i+1
iteration i+2
round-trip comm. time (read request + reply)
Figure 11-2 Time-lines for the processes that initiate communication, with no latency hiding. The black part of the time-line is local processing time (which in fact includes the time spent stalled to access local data), and the blank part is time spent stalled on communication. ca, cb, cc are the times taken to compute an A[i] and to perform the unrelated computations f(B[i]) and g[C[i]), respectively.
9/4/97
DRAFT: Parallel Computer Architecture
755
Latency Tolerance
11.2.1 Latency Tolerance and the Communication Pipeline Approaches to latency tolerance are best understood by looking at the resources in the machine and how they are utilized. From the viewpoint of a processor, the communication architecture from one node to another can be viewed as a pipeline. The stages of the pipeline clearly include the network interfaces at the source and destination, as well as the network links and switches along the way (see Figure 11-3). There may also be stages in the communication assist, the local NI
NI S
S
S
S P
P Pipeline stages Figure 11-3 The network viewed as a pipeline for a communication sent from one processor to another.
The stages of the pipeline include the network interface (NI) and the hops between successive switches (S) on the route between the two processors (P).
memory/cache system, and even the main processor, depending on how the architecture manages communication. It is important to recall that the end-point overhead incurred on the processor itself cannot be hidden from the processor, though all the other components potentially can. Systems with high end-point overhead per message therefore have a difficult time tolerating latency. Unless otherwise mentioned, this overview will ignore the processor overhead; the focus at the end-points will be the assist occupancy incurred in processing, since this has a chance of being hidden too. We will also assume that this message initiation and reception cost is incurred as a fixed cost once per message (i.e. it is not proportional to message size). Figure 11-4 shows the utilization problem in the baseline communication structure: Either the processor or the communication architecture is busy at a given time, and in the communication pipeline only one stage is busy at a time as the single word being transmitted makes its way from source to destination.1 The goal in latency tolerance is therefore to overlap the use of these resources as much as possible. From a processor’s perspective, there are three major types of overlap that can be exploited. The first is the communication pipeline between two nodes, which allows us to pipeline the transmission of multiple words through the network resources along the way, just like instruction pipelining overlaps the use of different resources (instruction fetch unit, register files, execute unit, etc.) in the processor. These words may be from the same message, if messages are larger than a single word, or from different messages. The second is the overlap in different portions of the network (including destination nodes) exploited by having communication messages outstanding with different nodes at once. In both cases, the communication of one word is overlapped with that of other words, so we call this overlapping communication with communication. The third major kind of overlap is that of computation with communication; i.e.
1. For simplicity, we ignore the fact that the width of the network link is often less than a word, so even a single word may occupy multiple stages of the network part of the pipeline.
756
DRAFT: Parallel Computer Architecture
9/4/97
Overview of Latency Tolerance
Time in processor’s critical path
Process PA in message passing: iteration i
ca
cc
iteration i+2
iteration i+1
P: o
o
o
o
CA: l
N: 1: 2: 3: 4:
l
Process PB in shared address space: iteration i
ca+cb
iteration i+1
iteration i+2
P: oo
o
o
CA: l
l
N: 1: 2: 3: 4:
Figure 11-4 Time-lines for the message passing and shared address space programs with no latency hiding. Time for a process is divided into that spent on the processor (P) including the local memory system, on the communication assist (CA) and in the network (N). The numbers under the network category indicate different hops or links in the network on the path of the message. The end-point overheads o in the message passing case are shown to be larger than in the shared address space, which we assume to be hardware-supported. lis the network latency, which is the time spent that does not involve the processing node itself.
a processor can continue to do useful local work while a communication operation it has initiated is in progress. 11.2.2 Approaches There are four major approaches to actually exploiting the above types of overlap in hardware resources and thus tolerating latency. The first, which is called block data transfer, is to make individual messages larger so they communicate more than a word and can be pipelined through the network. The other three approaches are precommunication, proceeding past communication in the same thread, and multithreading. These are complementary to block data transfer, and their goal is to overlap the latency of a message, large or small, with computation or with other messages. Each of the four approaches can be used in both the shared address space and message passing abstractions, although the specific techniques and the support required are different in the two cases. Let us briefly familiarize ourselves with the four approaches.
9/4/97
DRAFT: Parallel Computer Architecture
757
Latency Tolerance
Block data transfer Making messages larger has several advantages. First, it allows us to exploit the communication pipeline between the two nodes, thus overlapping communication with communication (of different words). The processor sees the latency of the first word in the message, but subsequent words arrive every network cycle or so, limited only by the rate of the pipeline. Second, it serves the important purpose of amortizing the per-message overhead at the end-points over the large amount of data being sent, rather than incurring these overheads once per word (while message overhead may have a component that is proportional to the length of the message, there is invariably a large fixed cost as well). Third, depending on how packets are structured, this may also amortize the per-packet routing and header information. And finally, a large message requires only a single acknowledgment rather than one per word; this can mean a large reduction in latency if the processor waits for acknowledgments before proceeding past communication events, as well as a large reduction in traffic, end-point overhead and possible contention at the other end. Many of these advantages are similar to those obtained by using long cache blocks to transfer data from memory to cache in a uniprocessor, but on a different scale. Figure 11-5 shows the effect on the time-line of the example computation bunching up all the communication in a single large message in the explicit message passing case, while still using synchronous messages and without overlapping computation. For simplicity of illustration, the network stages have been collapsed into one (pipelined) stage. Time in processor’s critical path
iteration 0
iteration 1 cc
ca P:
iteration 2
o
o
CA: l o
o
l
N:
3*ca
data comm.
ack.
3*cc
Key:
P: CA:
local computation (ca, cb, cc)
N:
end point overhead network latency bandwidth-related time bandwidth-related time
Figure 11-5 Effect of making messages larger on the time-line for the message passing program. The figure assumes that three iterations of the loop are executed. The sender process (PA) first computes the three values A[0..2] to be sent, then sends them in one message (the computation ca), and then performs all three pieces of unrelated computation f(B[0..2]), i.e. the computation cb. The overhead of the data communication is amortized, and only a single acknowledgment is needed. A bandwidthrelated cost is added to the data communication, for the time taken to get the remaining words into the destination word after the first arrives, but this is tiny compared to the savings in latency and overhead. This is not required for the acknowledgment, which is small.
758
DRAFT: Parallel Computer Architecture
9/4/97
Overview of Latency Tolerance
While making messages larger helps keep the pipeline between two nodes busy, it does not in itself keep the processor or communication assist busy while a message is in progress or keep other portion of the network busy. The other three approaches can address these problems as well. They are complementary to block data transfer, in that they are applicable whether messages are large or small. Let us examine them one by one. Precommunication This means generating the communication operation before the point where it naturally appears in the program, so that the operations are completed before data are actually needed1. This can be done either in software, by inserting a precommunication operation earlier in the code, or by having hardware detect the possibility of precommunication and generate it without an explicit software instruction. The operations that actually use the data typically remain where they are. Of course, the precommunication transaction itself should not cause the processor to stall until it completes. Many forms of precommunication require that long-latency events be predictable, so that hardware or software can anticipate them and issue them early. Consider the interactions with sender- and receiver-initiated communication. Sender-initiated communication is naturally initiated soon after the sender produces the data. This has two implications. On one hand, the data may reach the receiver before actually needed, resulting in a form of precommunication for free for the receiver. On the other hand, it may be difficult for the sender to pull up the communication any earlier, making it difficult for the sender itself to hide the latency it sees through precommunication. Actual precommunication, by generating the communication operation earlier, is therefore more common in receiver-initiated communication. Here, communication is otherwise naturally initiated when the data are needed, which may be long after they have been produced. Proceeding past communication in the same process/thread The idea here is to let the communication operations be generated where they naturally occur in the program, but allow the processor to proceed past them and find other independent computation or communication that would come later in the same process to overlap with them. Thus, while precommunication causes the communication to be overlapped with other instructions from that thread that are before the point where the communication-generating operation appears in the original program, this technique causes it to be overlapped with instructions from later in the process. As we might imagine, this latency tolerance method is usually easier to use effectively with sender-initiated communication, since the work following the communication often does not depend on the communication. In receiver-initiated communication, a receiver naturally tries to access data only just before they are actually needed, so there is not much independent work to be found in between the communication and the use. It is, of course, possible to delay the actual use of the data by trying to push it further down in the instruction stream relative to other independent instructions, and compilers and processor hardware can exploit some overlap in this way. In either case, either hardware or software must check that the data transfer has completed before executing an instruction that depends on it.
1. Recall that several of the techniques, including this one, are applicable to hiding local memory access latency as well, even though we are speaking in terms of communication. We can think of that as communication with the memory system.
9/4/97
DRAFT: Parallel Computer Architecture
759
Latency Tolerance
Multithreading The idea here is similar to the previous case, except that the independent work is found not in the same process or thread but by switching to another thread that is mapped to run on the same processor (as discussed in Chapter 2, process and thread are used interchangeably in this book to mean a locus of execution with its own program counter and stack). This makes the latency of receiver-initiated communication easier to hide than the previous case and so this method lends itself easily to hiding latency from either a sender or a receiver. Multithreading implies that a given processor must have multiple threads that are concurrently executable or “ready” to have their next instruction executed, so it can switch from one thread to another when a long latency event is encountered. The multiple threads may be from the same parallel program, or from completely different programs as in our earlier multiprogramming example. A multithreaded program is usually no different from an ordinary parallel program; it is simply decomposed and assigned among P processors, where P is usually a multiple of the actual number of physical processors p, and P/p threads are mapped to the same physical processor. 11.2.3 Fundamental Requirements, Benefits and Limitations Most of the chapter will focus on specific techniques and how they are exploited in the major programming models. We will pick this topic up again in Section 11.3. But first, it is useful to abstract away from specific techniques and understand the fundamental requirements, benefits and limitations of latency tolerance, regardless of the technique used. The basic analysis of benefits and limitations allows us to understand the extent of the performance improvements we can expect, and can be done based only on the goal of overlapping the use of resources and the occupancies of these resources. Requirements Tolerating latency has some fundamental requirements, regardless of approach or technique. These include extra parallelism, increased bandwidth, and in many cases more sophisticated hardware and protocols. Extra parallelism, or slackness. Since the overlapped activities (computation or communication) must be independent of one another, the parallelism in the application must be greater than the number of processors used. The additional parallelism may be explicit in the form of additional threads, as in multithreading, but even in the other cases it must exist within a thread. Even two words being communicated in parallel from a process implies a lack of total serialization between them and hence extra parallelism. Increased bandwidth. While tolerating latency hopefully reduces the execution time during which the communication is performed, it does not reduce the amount of communication performed. The same communication performed in less time means a higher rate of communication per unit time, and hence a larger bandwidth requirement imposed on the communication architecture. In fact, if the bandwidth requirements increase much beyond the bandwidth afforded by the machine, the resulting resource contention may slow down other unrelated transactions and may even cause latency tolerance to hurt performance rather than help it.
760
DRAFT: Parallel Computer Architecture
9/4/97
Overview of Latency Tolerance
More sophisticated hardware and protocols. Especially if we want to exploit other forms of overlap than making messages larger, we have the following additional requirements of the hardware or the protocols used:
• the processor must be allowed to proceed past a long-latency operation before the operation completes; i.e. it should not stall upon a long-latency operation. • a processor must be allowed to have multiple outstanding long-latency operations (e.g. send or read misses), if they are to be overlapped with one another. These requirements imply that all latency tolerance techniques have significant costs. We should therefore try to use algorithmic techniques to reduce the frequency of high latency events before relying on techniques to tolerate latency. The fewer the long-latency events, the less aggressively we need to hide latency. Potential Benefits A simple analysis can give us good upper bounds on the performance benefits we might expect from latency tolerance, thus establishing realistic expectations. Let us focus on tolerating the latency of communication, and assume that the latency of local references is not hidden. Suppose a process has the following profile when running without any latency tolerance: It spends Tc cycles computing locally, Tov cycles processing message overhead on the processor, Tocc (occupancy) cycles on the communication assist, and Tl cycles waiting for message transmission in the network. If we can assume that other resources can be perfectly overlapped with the activity on the main processor, then the potential speedup can be determined by a simple application of Amdahl’s Law. The processor must be occupied for Tc+Tov cycles; the maximum latency that we can hide from the processor is Tl + Tocc (assuming Tc+Tov > Tl + Tocc), so the maximum speedup T +T +T +T T c + T ov
T occ + T
c ov occ l due to latency hiding is ------------------------------------------------, or ( 1 + ----------------------l ). This limit is an upper bound,
Tc + To
since it assumes perfect overlap of resources and that there is no extra cost imposed by latency tolerance. However, it gives us a useful perspective. For example, if the process originally spends at least as much time computing locally as stalled on the communication system, then the maximum speedup that can be obtained from tolerating communication latency is a factor of 2 even if there is no overhead incurred on the main processor. And if the original communication stall time is overlapped only with processor activity, not with other communication, then the maximum speedup is also a factor of 2 regardless of how much latency there was to hide. This is illustrated in Figure 11-6. How much latency can actually be hidden depends on many factors involving the application and the architecture. Relevant application characteristics include the structure of the communication and how much other work is available to be overlapped with it. Architectural issues are the kind and costs of overlap allowed: how much of the end-point processing is performed on the main processor versus on the assist; can communication be overlapped with computation, other communication, or both; how many messages involving a given processor may be outstanding (in progress) at a time; to what extent can overhead processing can be overlapped with data transmission in the network for the same message; and what are the occupancies of the assist and the stages of the network pipeline.
9/4/97
DRAFT: Parallel Computer Architecture
761
Latency Tolerance
1.0
0.5
(a) Comp = Comm
(b) Comp > Comm
(c) Comp < Comm, overlap only with comp.
(d) Comp < Comm, overlap with comm as well
Figure 11-6 Bounds on the benefits from latency tolerance. Each figure shows a different scenario. The gray portions represent computation time and the white portions communication time. The bar on the left is the time breakdown without latency hiding, while the bars on the right show the situation with latency hiding. When computation time (Comp.) equals communication time (Comm.), the upper bound on speedup is 2. When computation exceeds communication, the upper bound is less than 2, since we are limited by computation time (b). The same is true of the case where communication exceeds computation, but can be overlapped only with computation (c). The way to obtain a better speedup than a factor of two is to have communication time originally exceed computation time, but let communication be overlapped with communication as well (d).
Figure 11-7 illustrates the effects on the time-line for a few different kinds of message structuring and overlap. The figure is merely illustrative. For example, it assumes that overhead per message is quite high relative to transit latency, and does not consider contention anywhere. Under these assumptions, larger messages are often more attractive than many small overlapped messages for two reasons: first, they amortize overhead; and second, with small messages the pipeline rate might be limited by the end-point processing of each message rather than the network link speed. Other assumptions may lead to different results. Exercise b looks more quantitatively at an example of overlapping communication only with other communication. Clearly, some components of communication latency are clearly easier to hide than others. For example, overhead incurred on the processor cannot be hidden from that processor, and latency incurred off-node—either in the network or at the other end—is generally easier to hide by overlapping with other messages than occupancy incurred on the assist or elsewhere within the node. Let us examine some of the key limitations that may prevent us from achieving the upper bounds on latency tolerance discussed above. Limitations The major limitations that keep the speedup from latency tolerance below the above optimistic limits can be divided into three classes: application limitations, limitations of the communication architecture, and processor limitations. Application limitations: The amount of independent computation time that is available to overlap with the latency may be limited, so all the latency may not be hidden and the processor will have to stall for some of the time.Even if the program has enough work and extra parallelism, the structure of the program may make it difficult for the system or programmer to identify the concurrent operations and orchestrate the overlap, as we shall see when we discuss specific latency tolerance mechanisms. 762
DRAFT: Parallel Computer Architecture
9/4/97
Overview of Latency Tolerance
Time in processor’s critical path
ov
ov
ov ov
ca oc l oc
oc
(a) Baseline communication; no latency tolerance
l
cc
P: CA:
(b) Single large message, no overlap with computation
N:
P: (c) Single large message, overlap with computation
CA: N:
P: CA: N:
P:
(d) Word-size overlapped messages, no overlap with computation
(e) Word-size overlapped messages, overlap with computation
CA: N:
(f) Word-size overlapped messages, overlap with computation, but only one outstanding message (no overlap with communication)
P: CA: N:
Figure 11-7 Time-lines for different forms of latency tolerance. ‘ P indicates time spent on the main processor, ‘A on the local communication assist, and ‘N nonlocal (in network or on other nodes). Stipple patterns correspond to components of time as per legend in Figure 11-4.
9/4/97
DRAFT: Parallel Computer Architecture
763
Latency Tolerance
Communication architecture limitations: Limitations of the communication architecture may be of two types: first, there may be restrictions on the number of messages or the number of words that can be outstanding from a node at a time, and second, the performance parameters of the communication architecture may limit the latency that can be hidden. Number of outstanding messages or words. With only one message outstanding, computation can be overlapped with both assist processing and network transmission. However, assist processing may or may not be overlapped with network transmission for that message, and the network pipeline itself can be kept busy only if the message is large. With multiple outstanding messages, all three of processor time, assist occupancy and network transmission can be overlapped even without any incoming messages. The network pipeline itself can be kept well utilized even when messages are very small, since several may be in progress simultaneously. Thus, for messages of a given size, more latency can be tolerated if each node allows multiple messages to be outstanding at a time. If L is the cost per message to be hidden, and r cycles of independent computation are available to be overlapped with each message, in the best case we need --L- messages to be outr
standing for maximal latency hiding. More outstanding messages do not help, since whatever latency could be hidden already has been. Similarly, the number or words outstanding is important. If k words can be outstanding from a processor at a time, from one or multiple messages, then in the best case the network transit time as seen by the processor can be reduced by almost a factor of k. More precisely, for one-way messages to the same destination it is reduced from k*l cycles to l + k/B, where l is the network transit time for a bit and B is the bandwidth or rate of the communication pipeline (network and perhaps assist) in words per cycle. Assuming that enough messages and words are allowed to be outstanding, the performance parameters of the communication architecture can become limitations as well. Let us examine some of these, assuming that the per-message latency of communication we want to hide is L cycles. For simplicity, we shall consider one-way data messages without acknowledgments. Overhead. The message-processing overhead incurred on the processor cannot be hidden. Assist occupancy: The occupancy of the assist can be hidden by overlapping with computation. Whether assist occupancy can be overlapped with other communication (in the same message or in different messages) depends on whether the assist is internally pipelined and whether assist processing can be overlapped with network transmission of the message data. An end-point message processing time of o cycles per message (overhead or occupancy) establishes an upper bound on the frequency with which we can issue messages to the network (at best every o cycles). If we assume that on average a processor receives as many messages as it sends, and that the overhead of sending and receiving is the same, then the time spent in end-point processing per message sent is 2o, so the largest number of messages that we can have outstanding L from a processor in a period of L cycles is ------ . If 2o is greater than the r cycles of computation 2o
L we can overlap with each message, then we cannot have the --- messages outstanding that we r
would need to hide all the latency. Point-to-point Bandwidth: Even if overhead is not an issue, the rate of initiating words into the network may be limited by the slowest link in the entire network pipeline from source to destina-
764
DRAFT: Parallel Computer Architecture
9/4/97
Overview of Latency Tolerance
tion; i.e. the stages of the network itself, the node-to-network interface, or even the assist occupancy or processor overhead incurred per word. Just as an end-point overhead of o cycles between messages (or words) limits the number of outstanding messages (or words) in an L cycle L period to --- , a pipeline stage time of s cycles in the network limits the number of outstanding o
L words to --s- .
Network Capacity: Finally, the number of messages a processor can have outstanding is also limited by the total bandwidth or data carrying capacity of the network, since different processors compete for this finite capacity. If each link is a word wide, a message of M words traveling h hops in the network may occupy up to M*h links in the network at a given time. If each processor has k such messages outstanding, then each processor may require up to M*h*k links at a given time. However, if there are D links in the network in all and all processors are transmitting in this way, then each processor on average can only occupy D/p links at a given time. Thus, the number D p
D p×M×h
of outstanding messages per processor k is limited by the relation M × h × k < ---- , or k < ------------------------ . Processor limitations: Spatial locality through long cache lines already allows more than one word of communication to be outstanding at a time. To hide latency beyond this in a shared address space with communication only through cache misses, a processor and its cache subsystem must allow multiple cache misses to be outstanding simultaneously. This is costly, as we shall see, so processor-cache systems have relatively small limits on this number of outstanding misses (and these include local misses that don’t cause communication). Explicit communication has more flexibility in this regard, though the system may limit the number of messages that can be outstanding at a time. It is clear from the above discussion that the communication architecture efficient (high bandwidth, low end point overhead or occupancy) is very important for tolerating latency effectively. We have seen the properties of some real machines in terms of network bandwidth, node-to-network bandwidth, and end-point overhead, as well as where the end-point overhead is incurred, in the last few chapters. From the data for overhead, communication latency, and gap we can compute two useful numbers in understanding the potential for latency hiding. The first is the ratio L/ o for the limit imposed by communication performance on the number of messages that can be outstanding at a time in a period of L cycles, assuming that other processors are not sending messages to this one at the same time. The second is the inverse of the gap, which is the rate at which messages can be pipelined into the network. If L/o is not large enough to overlap the outstanding messages with computation, then the only way to hide latency beyond this is to make messages larger. The values of L/o for remote reads in a shared address space and for explicit message
9/4/97
DRAFT: Parallel Computer Architecture
765
Latency Tolerance
passing using message passing libraries can be computed from Figures 7-31 and 7-32, and from the data in Chapter 8 as shown in Table 11-1. Table 11-1 Number of messages outstanding at a time as limited by L/o ratio, and rate at which messages can be issued to the network. L here is the total cost of a message, including overhead. For a remote read, o is the overhead on the initiating processor, which cannot be hidden. For message passing, we assume symmetry in messages and therefore count o to be the sum of the send and receive overhead. (*) For the machines that have hardware support for a shared address space, there is very little time incurred on the main processor on a remote read, so most of the latency can be hidden if the processor and controller allow enough outstanding references. The gap here is determined by the occupancy of the communication assist.
Machine
Network Transaction L/2o
Remote Read
Msgs/ msec
L/o
Machine
Msgs/ msec
Network Transaction L/2o
Remote Read
Msgs/ msec
Msgs/ msec
L/o
TMC CM-5
2.12
250
1.75
161
NOW-Ultra
1,79
172
1.77
127
Intel Paragon
2.72
131
5.63
133
Cray T3D
1.00
345
1.13
2500*
Meiko CS-2
3.28
74
8.35
63.3
SGI Origin
*
5000*
For the message passing systems, it is clear that the number of messages that can be outstanding to overlap the cost of another message is quite small, due to the high per-message overhead. In the message-passing machines (e.g. nCube/2, CM-5, Paragon), hiding latency with small messages is clearly limited by the end-point processing overheads. Making messages larger is therefore likely to be more successful at hiding latency than trying to overlap multiple small messages. The hardware supported shared address space machines are limited much less by performance parameters of the communication architecture in their ability to hide the latency of small messages. Here, the major limitations tend to be the number of outstanding requests supported by the processor or cache system, and the fact that a given memory operation may generate several protocol transactions (messages), stressing assist occupancy. Having understood the basics, we are now ready to look at individual techniques in the context of the two major communication abstractions. Since the treatment of latency tolerance for explicit message passing communication is simpler, the next section discusses all four general techniques in its context.
11.3 Latency Tolerance in Explicit Message Passing To understand which latency tolerance techniques are most useful and how they might be employed, it is useful to first consider the structure of communication in an abstraction; in particular, how sender- and receiver-initiated communication are orchestrated, and whether messages are of fixed (small or large) or variable size. 11.3.1 Structure of Communication The actual transfer of data in explicit message-passing is typically sender-initiated, using a send operation. Recall that a receive operation does not in itself cause data to be communicated, but rather copies data from an incoming buffer into the application address space. Receiver-initiated communication is performed by issuing a request message (itself a send) to the process that is 766
DRAFT: Parallel Computer Architecture
9/4/97
Latency Tolerance in Explicit Message Passing
the source of the data. That process then sends the data back via another send.1 The baseline example in Figure 11-1 uses synchronous sends and receives, with sender-initiated communication. A synchronous send operation has a communication latency equal to the time it takes to communicate all the data in the message to the destination, plus the time for receive processing, plus the time for an acknowledgment to be returned. The latency of a synchronous receive operation is its processing overhead, including copying the data into the application, plus the additional latency if the data have not yet arrived. We would like to hide these latencies, including overheads if possible, at both ends. Let us continue to assume that the overheads are incurred on a communication assist and not on the main processor, and see how the four classes of latency tolerance techniques might be applied in classical sender-initiated send-receive message passing. 11.3.2 Block Data Transfer With the high end-point overheads of message passing, making messages large is important both for amortizing overhead and to enable a communication pipeline whose rate is not limited by the end-point overhead of message-processing. These two benefits can be obtained even with synchronous messages. However, if we also want to overlap the communication with computation or other messages, then we must use one or more of the techniques below (instead of or in addition to making messages larger). We have already seen how block transfer is implemented in message-passing machines in Chapter 7. The other three classes of techniques, described below, try to overlap messages with computation or other messages, and must therefore use asynchronous messages so the processor does not stall waiting for each message to complete before proceeding. While they can be used with either small or large (block transfer) messages, and are in this sense complementary to block transfer, we shall illustrate them with the small messages that are in our baseline communication structure. 11.3.3 Precommunication In our example of Figure 11-1, we can actually pull the sends up earlier than they actually occur, as shown in Figure 11-8. We would also like to pull the receives on PB up early enough before the data are needed by PB, so that we can overlap the receive overhead incurred on the assist with other work on the main processor. Figure 11-8 shows a possible precommunicating version of the message passing program. The loop on PA is split in two. All the sends are pulled up and combined in one loop, and the computations of the function f(B[]) are postponed to a separate loop. The sends are asynchronous, so the processor does not stall at them. Here, as in all cases, the use of asynchronous communication operations means that probes must be used as appropriate to ensure the data have been transferred or received. Is this a good idea? The advantage is that messages are sent to the receiver as early as possible; the disadvantages are that no useful work is done on the sender between messages, and the pres-
1. In other abstractions built on message passing machines and discussed inChapter 7, like remote procedure call or active messages, the receiver sends a request message, and a handler that processes the request at the data source sends the data back without involving the application process there. However, we shall focus on classical send-receive message passing which dominates at the programming model layer.
9/4/97
DRAFT: Parallel Computer Architecture
767
Latency Tolerance
sure on the buffers at the receiver is higher since messages may arrive much before they are needed. Whether the net result is beneficial depends on the overhead incurred per message, the number of messages that a process is allowed to have outstanding at a time, and how much later than the sends the receives occur anyway. For example, if only one or two messages may be outstanding, then we are better off interspersing computation (of f(B[])) between asynchronous sends, thus building a pipeline of communication and computation in software instead of waiting for those two messages to complete before doing anything else. The same is true if the overhead incurred on the assist is high, so we can keep the processor busy while the assist is incurring this high per-message overhead. The ideal case would be if we could build a balanced software pipeline in which we pull sends up but do just the right amount of computation between message sends. Process PA for i←0to n-1 do compute A[i];
Process PB a_receive (myA[0] from PA); for i←0 to n-2 do
write A[i];
a_receive (myA[i+1] from PA);
a_send (A[i] to proc.PB);
while (!recv_probe(myA[i]) {};
end for
use myA[i];
for i←0to n-1 do
compute g(C[i]);
compute f(B[i]);
end for while (!received(myA[n-1]) {}; use myA[n-1]; compute g(C[n-1])
Figure 11-8 Hiding latency through precommunication in the message-passing example. In the pseudocode, a_send and a_receive are asynchronous send and receive operations, respectively.
Now consider the receiver PB. To hide the latency of the receives through precommunication, we try to pull them up earlier in the code and we use asynchronous receives. The a_receive call simply posts the specification of the receive in a place where the assist can see it, and allows the processor to proceed. When the data come in, the assist is notified and it moves them to the application data structures. The application must check that the data have arrived (using the recv_probe() call) before it can use them reliably. If the data have already arrived when the a_receive is posted (pre-issued), then what we can hope to hide by pre-issuing is the overhead of the receive processing; otherwise, we might hide both transit time and receive overhead. Receive overhead is usually larger than send overhead, as we saw in Chapter 7, so if many asynchronous receives are to be issued (or pre-issued) it is usually beneficial to intersperse computation between them rather than issue them back to back. One way to do this is to build a software pipeline of communication and computation, as shown in Figure 11-8. Each iteration of the loop issues an a_receive for the data needed for the next iteration, and then does the processing for the current iteration (using a recv_probe to ensure that the message for the current iteration has arrived before using it). The hope is that by the time the next iteration is reached, the message for
768
DRAFT: Parallel Computer Architecture
9/4/97
Latency Tolerance in Explicit Message Passing
which the a_receive was posted in the current iteration will have arrived and incurred its overhead on the assist, and the data will be ready to use. If the receive overhead is high relative to the computation per iteration, the a_receive can be several iterations ahead instead of just one issued. The software pipeline has three parts. What was described above is the steady-state loop. In addition to this loop, some work must be done to start the pipeline and some to wind it down. In this example, the prologue must post a receive for the data needed in the first iteration of the steadystate loop, and the epilogue must process the code for the last iteration of the original loop (this last interaction is left out of the steady-state loop since we do not want to post an asynchronous receive for a nonexistent next iteration). A similar software pipeline strategy could be used if communication were truly receiver-initiated, for example if there were no send but the a_receive sent a request to the node running PA and an handler there reacted to the request by supplying the data. In that case, the precommunication strategy is known as prefetching the data, since the receive (or get) truly causes the data to be fetched across the network. We shall see prefetching in more detail in the context of a shared address space in Section 11.7, since there the communication is very often truly receiver-initiated. 11.3.4 Proceeding Past Communication and Multithreading Now suppose we do not want to pull communication up in the code but leave it where it is. One way to hide latency in this case is to simply make the communication messages asynchronous, and proceed past them to either computation or other asynchronous communication messages. We can continue doing this until we come to a point where we depend on a communication message completing or run into a limit on the number of messages we may have outstanding at a time. Of course, as with prefetching the use of asynchronous receives means that we must add probes or synchronization to ensure that data have reached their destinations before we try to use them (or on the sender side that the application data we were sending has been copied or sent out before we reuse the corresponding storage). To exploit multithreading, as soon as a process issues a send or receive operation, it suspends itself and another ready-to-run process or thread from the application is scheduled to run on the processor. If this thread issues a send or receive, then it too is suspended and another thread switched in. The hope is that by the time we run out of threads and the first thread is rescheduled, the communication operation that thread issued will have completed. The switching and management of threads can be managed by the message passing library using calls to the operating system to change the program counter and perform other protected thread-management functions. For example, the implementation of the send primitive might automatically cause a thread switch after initiating the send (of course, if there is no other ready thread on that processing node, then the same thread will be switched back in). Switching a thread requires that we save the processor state needed to restart it that will otherwise be wiped out by the new thread. This includes the processor registers, the program counter, the stack pointer, and various processor status words. The state must be restored when the thread is switched back in. Saving and restoring state in software is expensive, and can undermine the benefits obtained from multithreading. Some message-passing architectures therefore provide hardware support for multithreading, for example by providing multiple sets of registers and program counters in hardware. Note that architectures that provide asynchronous message handlers that are separate from the application process but run on the main processor are essentially multi-
9/4/97
DRAFT: Parallel Computer Architecture
769
Latency Tolerance
threaded between the application process and the handler. This form of multithreading between applications and message handlers can be supported in software, though some research architectures also provide hardware support to make it more efficient (e.g. the message-driven processor of the J-Machine [DFK+92,NWD93]). We shall discuss these issues of hardware support in more detail when we discuss multithreading in a shared address space in Section 11.8.
11.4 Latency Tolerance in a Shared Address Space The rest of this chapter focuses on latency tolerance in the context of hardware-supported shared address space. This discussion will be a lot more detailed for several reasons. First, the existing hardware support for communication brings the techniques and requirements much closer to the architecture and the hardware-software interface. Second, the implicit nature of long-latency events (such as communication) makes it almost necessary that much of the latency tolerance be orchestrated by the system rather than the user program. Third, the granularity of communication and efficiency of the underlying fine-grained communication mechanisms require that the techniques be hardware-supported if they are to be effective. And finally, since much of the latency being hidden is that of reads (read latency), writes (write latency) and perhaps instructions—not explicit communication operations like sends and receives—most of the techniques are applicable to the wide market of uniprocessors as well. In fact, since we don’t know ahead of time which read and write operations will generate communication, latency hiding is treated in much the same way for local accesses and communication in shared address space multiprocessors. The difference is in the magnitude of the latencies to be hidden, and potentially in the interactions with a cache coherence protocol. As with message passing, let us begin by briefly examining the structure of communication in this abstraction before we look at the actual techniques. Much of our discussion will apply to both cache-coherent systems as well as those that do not cache shared data, but some of it, including the experimental results presented, will be specific to cache-coherent systems. For the most part, we assume that the shared address space (and cache coherence) is supported in hardware, and the default communication and coherence are at the small granularity of individual words or cache blocks. The experimental results presented in the following sections will be taken from the literature. These results use several of the same applications that we have used earlier in this book, but they typically use different versions with different communication to computation ratios and other behavioral characteristics, and they do not always follow the methodology outlined in Chapter 4. The system parameters used also vary, so the results are presented purely for illustrative purposes, and cannot be compared even among different latency hiding techniques in this chapter. 11.4.1 Structure of Communication The baseline communication is through reads and writes in a shared address space, and is called read-write communication for convenience. Receiver-initiated communication is performed with read operations that result in data from another processor’s memory or cache being accessed. This is a natural extension of data access in the uniprocessor program model: Obtain words of memory when you need to use them.
770
DRAFT: Parallel Computer Architecture
9/4/97
Block Data Transfer in a Shared Address Space
If there is no caching of shared data, sender-initiated communication may be performed through writes to data that are allocated in remote memories.1 With cache coherence, the effect of writes is more complex: Whether writes lead to sender or receiver initiated communication depends on the cache coherence protocol. For example, suppose processor PA writes a word that is allocated in PB’s memory. In the most common case of an invalidation-based cache coherence protocol with writeback caches, the write will only generate a read exclusive or upgrade request and perhaps some invalidations, and it may bring data to itself if it does not already have the valid block in its cache, but it will not actually cause the newly written data to be transferred to PB. While requests and invalidations constitute communication too, the actual communication of the new value from PA to PB will be generated by the later read of the data by PB. In this sense, it is receiver-initiated. Or the data transfer may be generated by an asynchronous replacement of the data from PA’s cache, which will cause it to go back to its “home” in PB’s memory. In an update protocol, on the other hand, the write itself will communicate the data from PA to PB if PB had a cached copy. Whether receiver-initiated or sender-initiated, the communication in a hardware-supported readwrite shared address space is naturally fine-grained, which makes tolerance latency particularly important. The different approaches to latency tolerance are better suited to different types and magnitudes of latency, and have achieved different levels of acceptance in commercial products. Other than block data transfer, the other three approaches that try to overlap load and store accesses with computation or other accesses require support in the microprocessor architecture itself. Let us examine these approaches in some detail in the next four sections.
11.5 Block Data Transfer in a Shared Address Space In a shared address space, the coalescing of data and the initiation of block transfers can be done either explicitly in the user program or transparently by the system, either by hardware or software. For example, the prevalent use of long cache blocks on modern systems can itself be seen as a means of transparent small block transfers in cache-block sized messages, used effectively when a program has good spatial locality on communicated data. Relaxed memory consistency models further allow us to buffer words or cache blocks and send them in coalesced messages only at synchronization points, a fact utilized particularly by software shared address space systems as seen in Chapter 9. However, let us focus here on explicit initiation of block transfers. 11.5.1 Techniques and Mechanisms Explicit block transfers are initiated by issuing a command similar to a send in the user program, as shown for the simple example in Figure 11-9. The send command is interpreted by the communication assist, which transfers the data in a pipelined manner from the source node to the destination. At the destination, the communication assist pulls the data words in from the network interface and stores them in the specified locations. The path is shown in Figure 11-10. There are two major differences from send-receive message passing, both of which arise from the fact that
1. An interesting case is the one in which a processor writes a word that is allocated in a different processor’s memory, and a third processor reads the word. In this case, we have two communication events, one “sender-initiated” and one “receiver-initiated”.
9/4/97
DRAFT: Parallel Computer Architecture
771
Latency Tolerance
Process PA
Process PB
for i←0to n-1 do A[i]←...; end for send (A[0..n-1] to tmp[0..n-1]); flag ← 1;
while (!flag) {}; /*spin-wait*/
for i←0to n-1 do
for i←1 to n-1 do
compute f(B[i]);
use tmp[i];
end for
compute g(C[i]); end for
Figure 11-9 Using block transfer in a shared address space for the example of Figure 11-1. The array A is allocated in processor PA’s local memory, and tmp in processor PB’s local memory.
the sending process can directly specify the program data structures (virtual addresses) where the data are to be placed at the destination, since these locations are in the shared address space. On one hand, this removes the need for receive operations in the programming model, since the incoming message specifies where the data should be put in the program address space. It may also remove the need for system buffering in main memory at the destination, if the destination assist is available to put the data directly from the network interface into the program data structures in memory. On the other hand, another form of synchronization (like spinning on a flag or blocking) must be used to ensure that data arrive before they are used by the destination process, but not before they are welcome. It is also possible to use receiver-initiated block transfer, in which case the request for the transfer is issued by the receiver and transmitted to the sender’s commnication assist. The communication assist on a node that performs the block transfer could be the same one that processes coherence protocol transactions, or a separate DMA engine dedicated to block transfer. The assist or DMA engine can be designed with varying degrees of aggressiveness in its functionality; for example, it may allow or disallow transfers of data that are not contiguous in memory at the source or destination ends, such as transfers with uniform stride or more general scatter-gather operations discussed in Chapter 2. The block transfer may leverage off the support already provided for efficient transfer of cache blocks, or it may be a completely separate mechanism. Particularly since block transfers may need to interact with the coherence protocol in a coherent shared address space, we shall assume that the block transfer uses pipelined transfers of cache blocks; i.e. the block transfer engine is a part of the source that reads cache blocks out of memory and transfers them into the network. Figure 11-10 shows the steps in a possible implementation of block transfer in a cache-coherent shared address space. 11.5.2 Policy Issues and Tradeoffs Two policy issues are of particular interest in block transfer: those that arise from interactions with the underlying shared address space and cache coherence protocol, and the question of where the block transferred data should be placed in the destination node.
772
DRAFT: Parallel Computer Architecture
9/4/97
Block Data Transfer in a Shared Address Space
Processor 1
Processor 5 Memory
Comm. Assist
Steps
Memory
2 3
4 Comm. Assist
Network Interface
Network Interface
1 Source Proc notifies Assist 2 Source Assist reads blocks 3 Source Assist sends blocks 4 Dest Assist stores blocks 5 Dest Proc checks status
Destination
Source
Figure 11-10 Path of a block transfer from a source node to a destination node in a shared address space machine. The transfer is done in terms of cache blocks, since that is the granularity of communication for which the machine is optimized.
Interactions with the Cache Coherent Shared Address Space The first interesting interaction is with the shared address space itself, regardless of whether it includes automatic coherent caching or not. The shared data that a processor “sends” in a block transfer may not be allocated in its local memory, but in another processing node’s memory or even scattered among other node’s memories. The main policy choices here are to (a) disallow such transfers, (b) have the initiating node retrieve the data from the other memories and forward them one cache block at a time in a pipelined manner, or (c) have the initiating node send messages to the owning nodes’ assists asking them to initiate and handle the relevant transfers. The second interaction is specific to a coherent shared address space, since now the same data structures may be communicated using two different protocols: block transfer and cache coherence. Regardless of which main memory the data are allocated in, they may be cached in (i) the sender’s cache in dirty state, (ii) the receiver’s cache in shared state, or (iii) another processor’s cache (including the receiver’s) in dirty state. The first two cases create what is called the local coherence problem, i.e. ensuring that the data sent are the latest values for those variables on the sending node, and that copies of those data on the receiving node are left in coherent state after the transfer. The third case is called the global coherence problem, i.e ensuring that the values transferred are the latest values for those variables anywhere in the system (according to the consistency model), and that the data involved in the transfer are left in coherent state in the whole system. Once again, the choices are to provide no guarantees in any of these three cases, to provide only local but not global coherence, or to provide full global coherence. Each successive step makes the programmer’s job easier, but imposes more requirements on the communication assist. This is because the assist is now responsible for checking the state of every block being transferred, retrieving data from the appropriate caches, and invalidating data in the appropriate caches. The data being transferred may not be properly aligned with cache blocks, which makes interactions with the block-level coherence protocol even more complicated, as does the fact that the directory information for the blocks being transferred may not be present at the sending node More details on implementing block transfer in cache coherent machines can be found in the literature [KuA93,HGD+94,HBG+97]. Programs that use explicit message passing are simpler in
9/4/97
DRAFT: Parallel Computer Architecture
773
Latency Tolerance
this regard: they require local coherence but not global coherence, since any data that they can access—and hence transfer—are allocated in their private address space and cannot be cached by any other processors. For block transfer to maintain even local coherence, the assist has some work to do for every cache block and becomes an integral part of the transfer pipeline. The occupancy of the assist can therefore easily become the bottleneck in transfer bandwidth, affecting even those blocks for which no interaction with caches is required. Global coherence may require the assist to send network transactions around before sending each block, reducing bandwidth dramatically. It is therefore important that a block transfer system have a “pay for use” property; i.e. providing coherent transfers should not significantly hurt the performance of transfers that do not need coherence, particularly if the latter are a common case. For example, if block transfer is used in the system only to support explicit message passing programs, and not to accelerate coarsegrained communication in an otherwise read-write shared address space program, then it may make sense to provide only local coherence but not global coherence. But in the former case, global coherence may in fact be important if we want to provide true integration of block transfer with a coherent read-write shared address space, and not make the programming model restrictive. As a simple example of why global coherence may be needed, consider false sharing. The data that processor P1 wants to send P2 may be allocated in P1’s local memory and produced by P1, but another processor P3 may have written other words on those cache blocks more recently. The cache blocks will therefore be in invalid state in P1’s cache, and the latest copies must be retrieved from P3 in order to be sent to P2. False sharing is only an example: It is not difficult to see that in a true shared address space program the words that P1 sends to P2 might themselves have been written most recently by another processor. Where to Place the Transferred Data? The other interesting policy issue is whether the data that are block transferred should be placed in the main memory of the destination node, in the cache, or in both. Since the destination node will read the data, having the data come into the cache is useful. However, it has some disadvantages. First, it requires intervening in the processor cache, which is expensive on modern systems and may prevent the processor from accessing the cache at the same time. Second, unless the block transferred data are used soon they may be replaced before they are used, so we should transfer data into main memory as well. And third and most dangerous, the transferred data might replace other useful data from the cache, perhaps even data that are currently in the processor’s active working set. For these reasons, large block transfer into a small first-level cache are likely to not be a good idea, while transfers into a large or second-level cache may be more useful. 11.5.3 Performance Benefits In a shared address space, there are some new advantages for using large explicit transfer compared to communicating implicitly in cache block sized messages. However, some of the advantages discussed earlier are compromised and there are disadvantages as well. Let us first discuss the advantages and disadvantages qualitatively, and then look at some performance data on real programs from the literature and some analytical observations.
774
DRAFT: Parallel Computer Architecture
9/4/97
Block Data Transfer in a Shared Address Space
Potential Advantages and Disadvantages The major sources of performance advantage are outlined below. The first two were discussed in the context of message passing, so we only point out the differences here.
• Amortized per-message overhead. This advantage may be less important in a hardwaresupported shared address space, since the end-point overhead is already quite small. In fact, explicit, flexibly sized transfers have higher end-point overhead than small fixedsize transfers, as discussed in Chapter 7, so the per-communication overhead is likely to be substantially larger for block transfer than for read-write communication. This turns out to be a major stumbling block for block transfer on modern cache-coherent machines, which currently use the facility more for operating system operations like page migration than for application activity. However, in less efficient communication architectures, such as those that use commodity parts, the overhead per message can be large even for cache block communication and the amortization due to block transfer can be very important.
• Pipelined transfer of large chunks of data. • Less wasted bandwidth: In general, the larger the message, the less the relative overhead of headers and routing information, and the higher the payload (the fraction of data transferred that is useful to the end application). This advantage may be lost when the block transfer is built on top of the existing cache line transfer mechanism, in which case again the header information may be nonetheless sent once per cache block. When used properly, block transfer can also reduce the number of protocol messages (e.g. invalidations, acknowledgments) that may have otherwise been sent around in an invalidation-based protocol. • Replication of transferred data in main memory at the destination, since the block transfer is usually done into main memory. This reduces the number of capacity misses at the destination that have to be satisfied remotely, as in COMA machines. However, without a COMA architecture it implies that the user must manage the coherence and replacement of the data that have been block transferred and hence replicated in main memory. • Bundling of synchronization with data transfer. Synchronization notifications can be piggybacked on the same message that transfers data, rather than having to send separate messages for data and synchronization. This reduces the number of messages needed, though the absence of an explicit receive operation implies that functionally synchronization has still to be managed separately from data transfer (i.e. there is no send-receive match that provides synchronization) just as in asynchronous send-receive message passing. The sources of potential performance disadvantage are:
• Higher overhead per transfer, as discussed in the first point above. • Increased contention. Long messages tend to incur higher contention, both at the end points and in the network. This is because longer messages tend to occupy more resources in the network at a time, and because the latency tolerance they provide places greater bandwidth demands on the communication architecture. • Extra work. Programs may have to do extra work to organize themselves so communication can be performed in large transfers (if this can be done effectively at all). This extra work may turn out to have a higher cost than the benefits achieved from block transfer (see Section 3.7 in Chapter 3).
9/4/97
DRAFT: Parallel Computer Architecture
775
Latency Tolerance
The following simple example illustrate the performance improvements that can be obtained by using block transfers rather than reads and writes, particularly from reduced overhead and pipelined data transfer.
Example 11-1
Suppose we want to communicate 4KB of data from a source node to a destination node in a cache-coherent shared address space machine with a cache block size of 64 bytes. Assume that the data are contiguous in memory, so spatial locality is exploited perfectly (Exercise 11.4 discusses the issues that arise when spatial locality is not good). It takes the source assist 40 processor cycles to read a cache block out of memory, 50 cycles to push a cache block out through the network interface, and the same numbers of cycles for the complementary operations at the receiver. Assume that the local read miss latency is 60 cycles, the remote read miss latency 180 cycles, and the time to start up a block transfer 200 cycles. What is the potential performance advantage of using block transfer rather than communication through cache misses, assuming that the processor blocks on memory operations until they complete?
Answer
The cost of getting the data through read misses, as seen by the processor, is 180*(4096/64) = 11520 cycles. With block transfer, the rate of the transfer pipeline is limited by max(40,50,50) or 50 cycles per block. This brings the data to the local memory at the destination, from where they will be read by the processor through local misses, each of which costs 60 cycles. Thus, the cost of getting the data to the destination processor with block transfer is 200 + (4096/64)*(50+60) or 7240 cycles. Using block transfer therefore gives us a speedup of 11520/7240 or 1.6. Another way to achieve block transfers is with vector operations, e.g. a vector read from remote memory. In this case, a single instruction causes the data to appear in the (vector) registers, so individual loads and stores are not required even locally and there is a savings in instruction bandwidth. Vector registers are typically managed by software, which has the disadvantages of aliasing and tying up register names but the advantage of not suffering from cache conflicts. However, many high-performance systems today do not include vector operations, and we shall focus on block transfers that still need individual load and store operations to access and use the data. Performance Benefits and Limitations in Real Programs Whether block transfer can be used effectively in real programs depends on both program and system characteristics. Consider a simple example amenable to block transfer, a near-neighbor grid computation, to see how the performance issues play out. Let us ignore the complications discussed above by assuming that the transfers are done from the source main memory to the destination main memory and that the data are not cached anywhere. We assume a four-dimensional array representation of the two-dimensional grid, and that a processor’s partition of the grid is allocated in its local memory. Instead of communicating elements at partition boundaries through individual reads and writes, a process can simply send a single message containing all the appropriate boundary elements to its neighbor, as shown in Figure 11-11. The communication to computation ratio is proportional to
776
DRAFT: Parallel Computer Architecture
9/4/97
Block Data Transfer in a Shared Address Space
n √p
n P0
P1
P2
P3
P4
P5
P6
P7
P8
P9
P10
P11
P12
P13
P14
P15
n √p
n
Figure 11-11 The use of block transfer in a near-neighbor grid program. A process can send and entire boundary subrow or subcolumn of its partition to the process that owns the neighboring partition in a n single large message. The size of each message is ------- elements. p
p ------- . Since block transfer is intended to improve communication performance, this in itself n
would tend to increase the relative effectiveness of block transfer as the number of processors n increases. However, the size of an individual block transfer is ------- elements and therefore p
Normalized Execution Time
decreases with p. Smaller transfers make the additional overhead of initiating a block transfer relatively more significant, reducing the effectiveness of block transfer. These two factors combine to yield a sweet-spot in the number of processors for which block transfer is most effective for a given grid size, yielding a performance improvement curve for block transfer that might look like that in Figure 11-12. The sweet-spot moves to larger numbers of processors as the grid size increases.
1.0
Sweet spot
Processors
Figure 11-12 Relative performance improvement with block transfer. The figure shows the execution time using block transfer normalized to that using loads and stores. The sweet spot occurs when communication is high enough that block transfer matters, and the transfers are also large enough that overhead doesn’t dominate.
Figure 11-13 illustrates this effect for a real program, namely a Fast Fourier Transform (FFT), on a simulated architecture that models the Stanford FLASH multiprocessor and is quite close to the SGI Origin2000 (though much more aggressive in its block transfer capabilities and performance). We can see that the relative benefit of block transfer over ordinary cache-coherent com-
9/4/97
DRAFT: Parallel Computer Architecture
777
Latency Tolerance
1.2
1.2
1
0.8 B = 128 B = 64 B = 32 Norm.
0.6 0.4
Normalized Exec. Time
Normalized Exec. Time
1
B = 128 B = 64 B = 32
0.8
Norm. 0.6
0.4 0.2
0.2
0
0 1
2
4
8
16
32
1
64
2
3
4
5
6
7
Figure 11-13 The benefits of block data transfer in a Fast Fourier Transform program. The architectural model and parameters used resemble those of the Stanford Flash multiprocessor [KOH+94], and are similar to those of the SGI Origin2000. Each curve is for a different cache block size. For each curve, the Y axis shows the execution time normalized to that of an execution without block transfer using the same cache block size. Thus, the different curves are normalized differently, and different points for the same X-axis value are not normalized to the same number. The closer a point is to the value 1 (the horizontal line), the less the improvement obtained by using block transfer over regular read-write communication for that cache
munication diminishes with increasing cache block size, since the excellent spatial locality in this program causes long cache blocks to themselves behave like little block transfers. For the largest block size of 128 bytes, which is in fact the block size used in the SGI Origin2000, the benefits of using block transfer on such an efficient cache-coherent machine are small, even though the block transfer engine simulated is very optimistic. The sweet-spot effect is clearly visible, and the sweet spot moves with increasing problem size. Results for Ocean, a more complete, regular application whose communication is a more complex variant of the nearest-neighbor communication, are shown are shown in Figure 11-14. Block transfers at row-oriented partition boundaries 1.2
1.2
1
0.8 B=128 B=64 B=32 Norm
0.6 0.4
Normalized Exec. Time
Normalized Exec. Time
1
B=128 B=64
0.8
B=32 Norm
0.6 0.4
0.2
0.2
0
0 1
2
4
8
16
32
(a) 130-by-130 point grids
1
2
4
8
16
32
64
(a) 258-by-258 point grids
Figure 11-14 The benefits of block transfer in the Ocean application. The platform and data interpretation are exactly the same as in Figure 11-13.
transfer contiguous data, but this communication also has very good spatial locality; at columnoriented partition boundaries spatial locality in communicated data is poor, but it is also difficult to exploit block transfer well. The relative benefits of block transfer therefore don’t depend so
778
DRAFT: Parallel Computer Architecture
9/4/97
Block Data Transfer in a Shared Address Space
greatly on cache block size. Overall, the communication to computation ratio in Ocean is much smaller than in FFT. While the communication to computation ratio increases at higher levels of the grid hierarchy in the multigrid solver—as processor partitions become smaller—block transfers also become smaller at these levels and are less effective at amortizing overhead. The result is that block transfer is a lot less helpful for this application. Quantitative data for other applications can be found in [WSH94]. As we saw in the Barnes-Hut galaxy simulation in Chapter 3, irregular computations can also be structured to communicate in coarse-grained messages, but this often requires major changes in the algorithm and its orchestration. Often, additional work has to be done to gather the data together for large messages and orient the parallel algorithm that way. Whether this is useful depends on how the extra computation costs compare with the benefits obtained from block transfer. In addition to data needed by the computation directly, block transfer may also be useful for some aspects of parallelism management. For example, in a task stealing scenario if the task descriptors for a stolen task are large they can be sent using a block transfer, as can the associated data, and the necessary synchronization for task queues can be piggybacked on these transfers. One situation when block transfer will clearly be beneficial is when the overhead to initiate remote read-write accesses is very high, unlike in the architecture simulated above; for example, when the shared address space is implemented in software. To see the effects of other parameters of the communication architecture, such as network transit time and point-to-point bandwidth, let us look at the potential benefits more analytically. In general, the time to communicate a large block of data through remote read misses is (#of read misses * remote read miss time), and the time to get the data to the destination processor through block transfer is (start-up overhead + #of cache blocks transferred * pipeline stage time per line + #of read misses * local read miss time). The simplest case to analyze is a large enough contiguous chunk of data, so that we can ignore the block transfer start-up cost and also assume perfect spatial locality. In this case, the speedup due to block transfer is limited by the ratio: remote read miss time block transfer pipeline stage time + local read miss time
(EQ 11.1)
where the block transfer pipeline stage time is the maximum of the time spent in any of the stage of the pipeline that was shown in Figure 11-10. A longer network transit time implies that the remote read miss time is increased. If the network bandwidth doesn’t decrease with increasing latency, then the other terms are unaffected and longer latencies favor block transfer. Alternatively, network bandwidth may decrease proportionally to the increase in latency e.g. if latency is increased by decreasing the clock speed of the network. In this case, at some point the rate at which data can be pushed into the network becomes smaller than the rate at which the assist can retrieve data from main memory. Up to this point, the memory access time dominates the block transfer pipeline stage time and increases in network latency favor block transfer. After this point the numerator and denominator in the ratio increase at the same rate so the relative advantage of block transfer is unchanged. Finally, suppose latency and overhead stay fixed but bandwidth changes, (e.g. more wires are added to each network link). We might think that load-based communication is latency bound while block transfer is bandwidth-bound, so increasing bandwidth should favor block transfer. However, past a point increasing bandwidth means that the pipeline stage time becomes domi-
9/4/97
DRAFT: Parallel Computer Architecture
779
Latency Tolerance
nated by the memory access time, as above, so increasing bandwidth further does not improve the performance of block transfer. Reducing bandwidth increases the block transfer pipeline stage time proportionally, once it starts to dominate the end-point effects, and the extent to which this hurts block transfer depends on how it impacts the end-to-end read miss latency of a cache block. Thus, if the other variables are kept constant in each case, block transfer is helped by increased per-message overhead, helped by increased latency, hurt by decreased latency, helped by increased bandwidth up to a point, and hurt by decreased bandwidth. In summary, then, the performance improvement obtained from block transfer over read-write cache coherent communication depends on the following factors:
• What fraction of the execution time is spent in communication that is amenable to block • • • •
transfer. How much extra work must be done to structure this communication in block transfers. The problem size and number of processors, which affect the first point above as well as the sizes of the transfers and thus how well they amortize block transfer overhead. The overhead itself, and the different latencies and bandwidths in the system. The spatial locality obtained with read-write communication and how it interacts with the granularity of data transfer (cache block size, see Exercise 11.4).
If we can really make all messages large enough, then the latency of communication may not be the problem anymore; bandwidth may become a more significant constraint. But if not, we still need to hide the latency of individual data accesses by overlapping them with computation or with other accesses. The next three sections examine the techniques for doing this in a shared address space. As mentioned earlier, the focus is mostly on the case where communication is performed through reads and writes at the granularity of individual cache blocks or words. However, we shall also discuss how the techniques might be used in conjunction with block transfer. Let us begin with techniques to move past long-latency accesses in the same thread of computation.
11.6 Proceeding Past Long-latency Events in a Shared Address Space A processor can proceed past a memory operation to other instructions if the memory operation is made non-blocking. For writes, this is usually quite easy to implement: The write is put in a write buffer, and the processor goes on while the buffer takes care of issuing the write to the memory system and tracking its completion as necessary. A similar approach can be taken with reads if the processor supports non-blocking reads. However, due to the nature of programs, this requires greater support from software and/or hardware than simply a buffer and the ability to check dependences. The difference is that unlike a write, a read is usually followed very soon by an instruction that needs the value returned by the read (called a dependent instruction). This means that even if the processor proceeds past the read, it will stall very soon since it will reach a dependent instruction. If the read misses in the cache, very little of the miss latency is likely to be hidden in this manner. Hiding read latency effectively, using non-blocking reads, requires that we “look ahead” in the instruction stream past the dependent instruction to find other instructions that are independent of the read. This requires support in the compiler (instruction scheduler) or the hardware or both. Hiding write latency does not usually require instruction lookahead: We can allow the processor to stall when it encounters a dependent instruction, since it is not likely to encounter one soon.
780
DRAFT: Parallel Computer Architecture
9/4/97
Proceeding Past Long-latency Events in a Shared Address Space
Proceeding past operations before they complete can provide benefits from both buffering and pipelining. Buffering of memory operations allows the processor to proceed with other activity, to delay the propagation and application of requests like invalidations, and to merge operations to the same word or cache block in the buffer before sending them out further into the memory system. Pipelining means that multiple memory operations (cache misses) are issued into the extended memory hierarchy at a time, so their latencies are overlapped and they can complete earlier. These advantages are true for uniprocessors as well as multiprocessors. Whether or not proceeding past the issuing of memory operations violates sequential consistency in multiprocessors depends on whether the operations are allowed to perform with respect to (become visible to) other processors out of program order. By requiring that memory operations from the same thread should not appear to perform with respect to other processors out of program order, sequential consistency (SC) restricts—but by no means eliminates—the amount of buffering and overlap that can be exploited. More relaxed consistency models allow greater overlap, including allowing some types of memory operations to even complete out of order as discussed in Chapter 9. The extent of overlap possible is thus determined by both the mechanisms provided and the consistency model: Relaxed models may not be useful if the mechanisms needed to support their reorderings (for example, write buffers, compiler support or dynamic scheduling) are not provided, and the success of some types of aggressive implementations for overlap may be restricted by the consistency model. Since our focus here is latency tolerance, the discussion is organized around increasingly aggressive mechanisms needed for overlapping the latency of read and write operations. Each of these mechanisms is naturally associated with a particular class of consistency models—the class that allows those operations to even complete out of order—but can also be exploited in more limited ways with stricter consistency models, by allowing overlap but not out of order completion. Since hiding read latency requires more elaborate support—albeit provided in many modern microprocessors—to simplify the discussion let us begin by examining mechanisms to hide only write latency. Reads are initially assumed to be blocking, so the processor cannot proceed past them; later, hiding read latency will be examined as well. In each case, we examine the sources of performance gains and the implementation complexity, assuming the most aggressive consistency model needed to fully exploit that overlap, and in some cases also look at the extent to which sequential consistency can exploit the overlap mechanisms. The focus is always on hardware cache-coherent systems, though many of the issues apply quite naturally to systems that don’t cache shared data or are software coherent as well. 11.6.1 Proceeding Past Writes Starting with a simple, statically-scheduled processor with blocking reads, to proceed past write misses the only support we need in the processor is a buffer, for example a write buffer. Writes, such as those that miss in the first-level cache, are simply placed in the buffer and the processor proceeds to other work in the same thread. The processor stalls upon a write only if the write buffer is full. The write buffer is responsible for controlling the visibility and completion of writes relative to other operations from that processor, so the processor’s execution units do not have to worry about this. Similarly, the rest of the extended memory hierarchy can reorder transactions (at least to different locations) as it pleases, since the machinery close to the processor takes care of preserving the necessary orders. Consider overlap with reads. For correct uniprocessor operation, a
9/4/97
DRAFT: Parallel Computer Architecture
781
Latency Tolerance
read may be allowed to bypass the write buffer and be issued to the memory system, as long as there is not a write to the same location pending in the write buffer. If there is, the value from the write buffer may be forwarded to the read even before the write completes. The latter mechanism will certainly allow reads to complete out of order with respect to earlier writes in program order, thus violating SC, unless the write buffer is flushed upon a match and the read reissued to the memory system thereafter. Reads bypassing the write buffer will also violate SC, unless the read is not allowed to complete (bind its value) before those writes. Thus, SC can take advantage of write buffers, but the benefits are limited. We know from Chapter 9 that allowing reads to complete before previous writes leads to models like processor consistency (PC) and total store ordering (TSO). Now consider the overlap among writes themselves. Multiple writes may be placed in the write buffer, and the order in which they are made visible or they complete is also determined by the buffer. If writes are allowed to complete out of order, a great deal of flexibility and overlap among writes can be exploited by the buffer. First, a write to a location or cache block that is currently anywhere in the buffer can be merged (coalesced) into that cache block and only a single ownership request sent out for it by the buffer, thus reducing traffic as well. Especially if the merging is not into the last entry of the buffer, this clearly leads to writes becoming visible out of program order. Second, buffered writes can be retired from the buffer as soon as possible, before they complete, making it possible for other writes behind them to get through. This allows the ownership requests of the writes to be pipelined through the extended memory hierarchy. Allowing writes to complete out of order requires partial store order (PSO) or a more relaxed consistency model. Stricter models like SC, TSO and PC essentially require that merging into the write buffer not be allowed except into the last block in the buffer, and even then in restricted circumstances since other processors must not be able to see the writes to different words in the block in a different order. Retiring writes early to let others pass through is possible under the stricter models as long as the order of visibility and completion among these writes is preserved in the rest of the system. This is relatively easy in a bus-based machine, but very difficult in a distributed-memory machine with different home memories and independent network paths. On the latter systems, guaranteeing program order among writes, as needed for SC, TSO and PC, essentially requires that a write not be retired from the head of the (FIFO) buffer until it has committed with respect to all processors. Overall, a strict model like SC can utilize write buffers to overlap write latency with reads and other writes, but only to a limited extent. Under relaxed models, when exactly write operations are sent out to the extended memory hierarchy and made visible to other processors depends on implementation and performance considerations as well. For example, under a relaxed model invalidations may be sent out as soon as the writes are able to get through the write buffer, or the writes may be delayed in the buffer until the next synchronization point. The latter may allow substantial merging of writes, as well as reduction of invalidations and misses due to false sharing. However, it implies more traffic in bursts at synchronization points, rather than pipelined traffic throughout the computation. Performance Impact Simulation results are available in the literature [GGH91a,Gha95] that measure the benefits of allowing the processor to proceed past writes on the parallel applications in the original SPLASH application suite [SWG92]. They do not try to maintain SC, but exploit the maximum overlap
782
DRAFT: Parallel Computer Architecture
9/4/97
Proceeding Past Long-latency Events in a Shared Address Space
that is possible without too much extra hardware. They are separated into techniques that allow the program order between a write and a following read (write->read order) to be violated, satisfying TSO or PC, and those that also allow the order between two writes (write->write order) to be violated, satisfying PSO. The latter is also essentially the best that more relaxed models like RMO, WO or RC can accomplish given that reads are blocking. Let us look at these results in the rest of this subsection. Figure 11-15 shows the reduction in execution time, divided into different components, for two representative applications when the write->read and write->write orders order is allowed to be violated. The applications are older versions of Ocean and Barnes-Hut applications, running with small problem sizes on 16-processor, simulated, directory-based cache-coherent systems. The baseline for comparison is the straightforward implementation of sequential consistency, in which the processor issuing a read or write reference stalls until that reference completes (measurements for using the write buffer but preserving SC—by stalling a read until the write buffer is empty—showed only a very small performance improvement over stalling the processor on all writes themselves). Details of the simulated system, which assumes considerably less aggressive processors than modern systems, and an analysis of the results can be found in [GGH91a,Gha95]; here, we simply present a few of the results to illustrate the effects. 100
80
60
Synch Write Read Busy
40
[Artist: please change “Barnes” to “Barnes-Hut” in this and other figures.]
20
0 Base
W-R Ocean
W-W
Base
W-R Barnes
W-W
Figure 11-15 Performance benefits from proceeding past writes by taking advantage of relaxed consistency models. The results assume a statically scheduled processor model. For each application, Base is the case with no latency hiding, W-R indicates that reads can bypass writes (i.e. the write->read order can be violated, and W-W indicates that both writes and reads can bypass writes. The execution time for the Base case is normalized to 100 units. The system simulated is a cache-coherent multiprocessor using a flat, memory-based directory protocol much like that of the Stanford DASH multiprocessor [LLJ+92]. The simulator assumes a write-through first-level and write-back second-level caches, with a deep, 16-entry write buffer between the two. The processor is single-issue and statically scheduled, with no special scheduling in the compiler targeted toward latency tolerance. The write buffer is aggressive, with read bypassing of it and read forwarding from it enabled. To preserve the write-write order (initially), writes are not merged in the write buffer, and the write at the head of the FIFO write buffer is not retired until it has been completed. The data access parameters assumed for a read miss are one cycle for an L1 hit, 15 cycles for an L2 hit, 29 cycles if satisfied in local memory, 101 cycles for a two-message remote miss, and 132 cycles for a three-message remote miss. Write latencies are somewhat smaller in each case. The system therefore assumes a much slower processor relative to a memory system than modern systems.
The figure shows that simply allowing write->read reordering is usually enough to hide most of the write latency from the processor (see the first two bars for each application). All the applications in the study had their write latency almost entirely hidden by this deep write buffer, except Ocean which has the highest frequency of write misses to begin with. There are also some interesting secondary effects. One of these is that while we might not expect the read stall time to change much from the base SC case, in some cases it increases slightly. This is because the addi-
9/4/97
DRAFT: Parallel Computer Architecture
783
Latency Tolerance
tional bandwidth demands of hiding write latency contend with read misses, making them cost more. This is a relatively small effect with these simple processors, though it may be more significant with modern, dynamically scheduled processors that also hide read latency. The other, beneficial effect is that synchronization wait time is sometimes also reduced. This happens for two reasons. First some of the original load imbalance is caused by imbalances in memory stall time; as stall time is reduced, imbalances are reduced as well. Second, if a write is performed inside a critical section and its latency is hidden, then the critical section completes more quickly and the lock protecting it can be passed more quickly to another processor, which therefore incurs less wait time for that lock. The third bar for each application in the figure shows the results for the case when the write-write order is allowed to be violated as well. Now write merging is enabled to any block in the same 16-entry write buffer, and a write is allowed to retire from the write buffer as soon as it reaches the head, even before it is completed. Thus, pipelining of writes through the memory and interconnect is given priority over buffering them for more time, which might have enabled greater merging in the write buffer and reduction of false sharing misses by delaying invalidations. Of course, having multiple writes outstanding in the memory system requires that the caches allow multiple outstanding misses, i.e. that they are lockup-free. The implementation of lockup-free caches will be discussed in Section 11.9. The figure shows that the write-write overlap hides whatever write latency remained with only write-read overlap, even in Ocean. Since writes retire from the write buffer at a faster rate, the write buffer does not fill up and stall the processor as easily. Synchronization wait time is reduced further in some cases, since a sequence of writes before a release completes more quickly causing faster transfer of synchronization between processors. In most applications, however, writewrite overlap doesn’t help too much since almost all the write latency was already hidden by allowing reads to bypass previous writes. This is especially true because the processor model assumes that reads are blocking, so write-write overlap cannot be exploited past read operations. For reordering by hardware on cache-coherent systems, differences between models like weak ordering and release consistency—as well as subtle differences among models that allow the same reorderings, such as TSO which preserves write atomicity and processor consistency which does not—do not seem to affect performance substantially. Since performance is highly dependent on implementation and on problem and cache sizes, the study examined the impact of using less aggressive write buffers and second-level cache architectures, as well as of scaling down the cache size to capture the case where an important working set does not fit in the cache (see Chapter 4). For the first part, four different combinations were examined: an L2 cache that blocks on a miss versus a lockup-free L2 cache, and a write buffer that allows reads to bypass it if they are not to an address in the buffer and one that does not. The results indicate that having a lockup-free cache is very important to obtaining good performance improvements, as we might expect. As for bypassable write-buffers, they are very important for the system that allows write->read reordering, particularly when the L2 cache is lockup-free, but less important for the system that allows write->write reordering as well. The reason is that in the former case writes retire from the buffer only after completion, so it is quite likely that when a read misses in the first-level cache it will find a write still in the write buffer, and stalling the read till the write buffer empties will hurt performance; in the latter case, writes retire much faster so the likelihood of finding a write in the buffer to bypass is smaller. For the same reason, it is easier to use smaller write buffers effectively when write->write reordering is allowed so writes are allowed to retire early. Without lockup-free caches, the ability for reads to bypass writes in the buffer is much less advantageous. 784
DRAFT: Parallel Computer Architecture
9/4/97
Proceeding Past Long-latency Events in a Shared Address Space
250
100
200
80
Normalized Execution Time
Normalized Execution Time
The results with smaller L1/L2 caches are shown in Figure 11-16, as are results for varying the
Synch 60 Write
150
Synch Write Read Busy
Read Busy4 0
100
50
20
64K/256K
8K/64K
2K/8K
64K/256K
Ocean
8K/64K Barnes
W-W
Base
W-W
Base
W-W
Base
W-W
Base
W-W
Base
W-W
Base
0
2K/8K
(a) Effects of Cache Size
0 Base
W-W
16B
Base
W-W
64B
(b) Effects of Cache Block Size for Ocean
Figure 11-16 Effects of cache size and cache block size on the benefits of proceeding past writes. The block size assumed in the graph for varying cache size is the default 16B used in the study. The cache sizes assumed in the graph for varying block size are the default 64KB first level and 256KB second level caches.
cache block size. Consider cache size first. With smaller caches, write latency is still hidden effectively by allowing write->read and write->write reordering. (As discussed in Chapter 4, for Barnes-Hut the smallest caches do not represent a realistic scenario.) Somewhat unexpectedly, the impact of hiding write latency is often larger with larger caches, even though the write latency to be hidden is smaller. This is because larger caches are more effective in reducing read latency than write latency, so the latter is relatively more important. All the above results assumed a cache block size of only 16 bytes, which is quite unrealistic today. Larger cache blocks tend to reduce the miss rates for these applications and hence reduce the impact of these reorderings on execution time a little, as seen for Ocean in Figure 11-16(b). 11.6.2 Proceeding Past Reads As discussed earlier, to hide read latency effectively we need not only non-blocking reads but also a mechanism to look ahead beyond dependent instructions. The later independent instructions may be found by the compiler or by hardware. In either case, implementation is complicated by the fact that the dependent instructions may be followed soon by branches. Predicting future paths through the code to find useful instructions to overlap requires effective branch prediction, as well as speculative execution of instructions past predicted branches. And given speculative execution, we need hardware support to cancel the effects of speculatively executed instructions upon detecting misprediction or poor speculation. The trend in the microprocessor industry today is toward increasingly sophisticated processors that provide all these features in hardware. For example, they are included by processor families such as the Intel Pentium Pro [Int96], the Silicon Graphics R10000 [MIP96], the SUN Ultrasparc [LKB91], and the HP-PA8000 [Hun96]. The reason is that latency hiding and overlapped opera-
9/4/97
DRAFT: Parallel Computer Architecture
785
Latency Tolerance
tion of the memory system and functional units are increasingly important even in uniprocessors. These mechanisms have a high design and hardware cost, but since they are already present in the microprocessors they can be used to hide latency in multiprocessors as well. And once they are present, they can be used to hide both write latency as well as read latency. Dynamic Scheduling and Speculative Execution To understand how memory latency can be hidden by using dynamic scheduling and speculative execution, and especially how the techniques interact with memory consistency models, let us briefly recall from uniprocessor architecture how the methods work. More details can be found in the literature, including [HeP95]. Dynamic scheduling means that instructions are decoded in the order presented by the compiler, but they are executed by their functional units in the order in which their operands become available at runtime (thus potentially in a different order than program order). One way to orchestrate this is through reservation stations and Tomasulo’s algorithm [HeP95]. This does not necessarily mean that they will become visible or complete out of program order, as we will see. Speculative execution means allowing the processor to look at and schedule for execution instructions that are not necessarily going to be useful to the program’s execution, for example instructions past a future uncertain branch instruction. The functional units can execute these instructions assuming that they will indeed be useful, but the results are not committed to registers or made visible to the rest of the system until the validity of the speculation has been determined (e.g. the branch has been resolved). The key mechanism used for speculative execution is an instruction lookahead or reorder buffer. As instructions are decoded by the decode unit, whether in-line or speculatively past predicted branches, they are placed in the reorder buffer, which therefore holds instructions in program order. Among these, any instructions that are not dependent on a yet incomplete instruction can be chosen from the reorder buffer for execution. It is quite possible that independent long latency instructions (e.g. reads) that are before them in the buffer have not yet been executed or completed, so in this way the processor thus proceeds past incomplete memory operations and other instructions. The reorder buffer implements register renaming, so instructions in the buffer become ready for execution as soon as their operands are produced by other instructions, not necessarily waiting till the operands are available in the register file. The results of an instruction are maintained with it in the reorder buffer, and not yet committed to the register file. An instruction is retired from the reorder buffer only when it reaches the head of the reorder buffer, i.e. in program order. It is at this point that the result of a read may be put in the register and the value produced by a write is free to be made visible to the memory system. Retiring in program (decode) order makes it easy to perform speculation: if a branch is found to be mispredicted, then no instruction that was after it in the buffer (hence in program order) can have left the reorder buffer and committed its effects: No read has updated a register, and no write has become visible to the memory system. All instructions after the branch are then invalidated from the reorder buffer and the reservations stations, and decoding is resumed from the correct (resolved) branch target. In-order retirement also makes it easy to implement precise exceptions, so the processor can restart quickly after an exception or interrupt. However, in-order retirement does mean that if a read miss reaches the head of the buffer before its data value has come back, the reorder buffer (not the processor) stalls since later instructions cannot be retired. This FIFO nature and stalling implies that the extent of overlap and latency tolerance may be limited by the size of the reorder buffer.
786
DRAFT: Parallel Computer Architecture
9/4/97
Proceeding Past Long-latency Events in a Shared Address Space
Hiding Latency under Sequential and Release Consistency The difference between retiring and completion is important. A read completes when its value is bound, which may be before it reaches the head of the reorder buffer and can retire (modifying the processor register). On the other hand, a write is not complete when it reaches the head of the buffer and retires to the memory system; it completes only when it has actually become visible to all processors. Understand this difference helps us understand how out-of-order, speculative processors can hide latency under both SC and more relaxed models like release consistency (RC). Under RC, a write may indeed be retired from the buffer before it completes, allowing operations behind it to retire more quickly. A read may be issued and complete anytime once it is in the reorder buffer, unless there is an acquire operation before it that has not completed. Under SC, a write is retired from the buffer only when it has completed (or at least committed with respect to all processors, see Chapter 6), so it may hold up the reorder buffer for longer once it reaches the head and is sent out to the memory system. And while a read may still be issued to the memory system and complete before it reaches the head of the buffer, it is not issued before all previous memory operations in the reorder buffer have completed. Thus, SC again exploits less overlap than RC. Additional Techniques to Enhance Latency Tolerance Additional techniques—like a form of hardware prefetching, speculative reads, and write buffers— can be used with these dynamically scheduled processors to hide latency further under SC as well as RC [KGH91b], as mentioned in Chapter 9 (Section 9.2). In the hardware prefetching technique, hardware issues prefetches for any memory operations that are in the reorder buffer but are not yet permitted by the consistency model to actually issue to the memory system. For example, the processor might prefetch a read that is preceded by another incomplete memory operation in SC, or by an incomplete acquire operation in RC, thus overlapping them. For writes, prefetching allows the data and ownership to be brought to the cache before the write actually gets to the head of the buffer and can be issued to the memory system. These prefetches are nonbinding, which means that the data are brought into the cache, not into a register or the reorder buffer, and are still visible to the coherence protocol, so they do not violate the consistency model. (Prefetching will be discussed in more detail in Section 11.7.) Hardware prefetching is not sufficient when the address to be prefetched is not known, and determining the address requires the completion of a long-latency operation. For example, if a read miss to an array entry A[I] is preceded by a read miss to I, then the two cannot be overlapped because the processor does not know the address of A[I] until the read of I completes. The best we could do is prefetch I, and then issue the read of A[I] before the actual read of I completes. However, suppose now there is a cache miss to a different location X between the reads of I and A[I]. Even if we prefetched I, under sequential consistency we cannot execute and complete the read of A[I] before the read of X has completed, so we have to wait for the long-latency miss to X. To increase the overlap, the second technique that a processor can use is speculative read operations. These read operations use the results of previous operations (often prefetches) that have not yet completed as the addresses to read from, speculating that the value used as the address will not change between now and the time when all previous operations have completed. In the above example, we can use the value prefetched or read for I and issue, execute and complete the read of A[I] before the read of X completes.
9/4/97
DRAFT: Parallel Computer Architecture
787
Latency Tolerance
What we need to be sure of is that we used the correct value for I and hence read the correct value for A[I], i.e. that they are not modified between the time of the speculative read and the time when it is possible to perform the actual read according to the consistency model. For this reason, speculative reads are loaded into not the registers themselves but a buffer called the speculative read buffer. This buffer “watches” the cache and hence is apprised of cache actions to those blocks. If an invalidation, update or even cache replacement occurs on a block whose address is in the speculative read buffer before the speculatively executed read has reached the front of the reorder buffer and retired, then the speculative read and all instructions following it in the reorder buffer must be cancelled and the execution rolled back just as on a mispredicted branch. Such hardware prefetching and speculative read support are present in processors such as the Pentium Pro [Int96], the MIPS R10000 [MIP96] and the HP-PA8000 [Hun96]. Note that under SC, every read issued before previous memory operations have completed is a speculative read and goes into the speculative read buffer, whereas under RC reads are speculative (and have to watch the cache) only if they are issued before a previous acquire has completed. A final optimization that is used to increase overlap is the use of a write buffer even in a dynamically scheduled processor. Instead of writes having to wait at the head of the reorder buffer until they complete, holding up the reorder buffer, they are removed from the reorder buffer and placed in the write buffer. The write buffer allows them to become visible to the extended memory hierarchy and keeps track of their completion as required by the consistency model. Write buffers are clearly useful with relaxed models like RC and PC, in which reads are allowed to bypass writes: The reads will now reach the head of the write buffer and retire more quickly. Under SC, we can put writes in the write buffer, but a read stalls when it reaches the head of the reorder buffer anyway until the write completes. Thus, it may be that much of the latency that the store would have seen is not hidden but is instead seen by the next read to reach the head of the reorder buffer. Performance Impact Simulation studies in the literature have examined the extent to which these techniques can hide latency under different consistency models and different combinations of techniques. The studies do not use sophisticated compiler technology to schedule instructions in a way that can obtain increased benefits from dynamic scheduling and speculative execution, for example by placing more independent instructions close after a memory operation or miss so smaller reorder buffers will suffice. A study assuming an RC model finds that a substantial portion of read latency can indeed be hidden using a dynamically scheduled processor with speculative execution (let’s call this a DS processor), and that the amount of read latency that can be hidden increases with the size of the reorder buffer even up to buffers as large as 64 to 128 entries (the longer the latency to be hidden, the greater the lookahead needed and hence the larger the reorder buffer) [KGH92]. A more detailed study has compared the performance of the SC and RC consistency models with aggressive, four-issue DS processors [PRA+96]. It has also examined the benefits obtained individually from hardware prefetching and speculative loads in each case. Because a write takes longer to complete even on an L1 hit when the L1 cache is write-through rather than write-back, the study examined both types of L1 caches, always assuming a write-back L2 cache. While the study uses very small problem sizes and scaled down caches, the results shed light on the interactions between mechanisms and models. The most interesting question is whether with all these optimizations in DS processors, RC still buys substantial performance gains over SC. If not, the programming burden of relaxed consistency models may not be justified with modern DS processors (relaxed models are still important
788
DRAFT: Parallel Computer Architecture
9/4/97
Proceeding Past Long-latency Events in a Shared Address Space
5.0E+6
Execution Time
4.5E+6 4.0E+6 3.5E+6
Synch Write
3.0E+6 2.5E+6
Read Pipestall Busy
2.0E+6 1.5E+6 1.0E+6 500.0E+3 RC-sr
RC
PC-pf Write-back
SC-wb-sr
SC-pf
RC-pf
PC-sr
PC
SC-sr
SC
000.0E+0
Figure 11-17 Performance of an FFT kernel with different consistency models, assuming a dynamically scheduled processor with speculative execution. The bars on the left assume write-through first-level caches, and the bars on the right assume write-back first-level caches. Secondlevel caches are always write-back. SC, PC and RC are sequential, processor, and release consistency, respectively. For SC, there are two sets of bars: the first assuming no write buffer and the second including a write buffer. For each set of three bars, the first bar excludes hardware prefetching and speculative loads, the second bar (PF) includes the use of hardware prefetching, and the third bar (SR) includes the use of speculative read operations as well. The processor model assumed resembles the MIPS R10000 [MIP96]. The processor is clocked at 300MHz and capable of issuing 4 instructions per cycle. It used a reorder buffer of size 64 entries, a merging write buffer with 8 entries, a 4KB direct-mapped first-level cache and a 64 KB, 4-way set associative second level cache. Small caches are chosen since the data sets are small, and they may exaggerate the effect of latency hiding. More detailed parameters can be found in [PRA+96].
at the programmer’s interface to allow compiler optimizations, but the contract and the programming burden may be lighter if it is only the compiler that may reorder operations). The results of the study, shown for two programs in Figures 11-17 and 11-18, indicate that RC is still beneficial, even though the gap is closed substantially. The figures show the results for SC without any of the more sophisticated optimizations (hardware prefetching, speculative reads, and write buffering), and then with those applied cumulatively one after the other. Results are also shown for processor consistency (PC) and for RC. These cases always assume write buffering, and are shown both without the other two optimizations and then with those applied cumulatively. 16000000
Execution Time
14000000 12000000 Synch Write Read
10000000 8000000
Pipestall Busy
6000000 4000000 2000000 RC-sr
RC
PC-pf Writeback
SC-wb-sr
SC-pf
RC-pf
PC-sr
PC
SC-sr
SC
0
Figure 11-18 Performance of Radix with different consistency models, assuming a dynamically scheduled processor with speculative execution. The system assumptions and organization of the figure are the same as in Figure 11-17.
Clearly, the results show that RC has substantial advantages over SC even with a DS processor when hardware prefetching and speculative reads are not used. The main reason is that RC is able to hide write latency much more successfully than SC, as with simple non-DS processors, since it allows writes to be retired faster and allows later accesses like reads to be issued and complete 9/4/97
DRAFT: Parallel Computer Architecture
789
Latency Tolerance
before previous writes. The improvement due to RC is particularly true with write-through caches, since under SC a write that reaches the head of the buffer issues to the memory system but has to wait until the write performs in the second-level cache, holding up later operations, while under RC it can retire when it issues and before it completes. While read latency is hidden with some success even under SC, RC allows for much earlier issue, completion and hence retirement of reads. The effects of hardware prefetching and speculative reads are much greater for SC than for RC, as expected from the discussion above (since in RC operations are already allowed to issue and complete out of order, so the techniques only help those operations that follow closely after acquires). However, a significant gap still remains compared to RC, especially in write latency. Write buffering is not very useful under SC for the reason discussed above. The figure also confirms that write latency is a much more significant problem with writethrough first-level caches under SC than with write-back caches, but is hidden equally well under RC with both types of caches. The difference in write latency between write-through and writeback caches under SC is much larger for FFT than for Radix. This is because while in FFT the writes are to locally allocated data and hit in the write-back cache for this problem size, in Radix they are to nonlocal data and miss in both cases. Overall, PC is in between SC and RC, with hardware prefetching helping but speculative reads not helping so much as in SC. The graphs also appear to reveal an anomaly: The processor busy time is different in different scheme, even for the same application, problem size and memory system configuration. The reason is an interesting methodological point for superscalar processors: since several (here four) instructions can be issued every cycle, how do we decide whether to attribute a cycle to busy time or a particular type of stall time. The decision made in most studies in the literature, and in these results, is to attribute a cycle to busy time only if all instruction slots issue in that cycle. If not, then the cycle is attributed to whatever form of stall is seen by the first instruction (starting from the head of the reorder buffer) that should have issued in that cycle but didn’t. Since this pattern changes with consistency models and implementations, the busy time is not the same across schemes. While the main results have already been discussed, it is interesting to examine the interactions with hardware prefetching and speculative reads in a little more detail, particularly in how they interact with application characteristics. The benefits from this form of hardware prefetching are limited by several effects. First, prefetches are issued only for operations that are in the reorder buffer, so they cannot be issued very far in advance. Prefetching works much more successfully when a number of operations that will otherwise miss are bunched up together in the code, so they can all be issued together from the reorder buffer. This happens naturally in the matrix transposition phase of an FFT, so the gains from prefetching are substantial, but it can be aided in other programs by appropriate scheduling of operations by the compiler. Another limitation is the one that motivated speculative loads in the first place: The address to be prefetched may not be known, and the read that returns it may be blocked by other operations. This effect is encountered in the Radix sorting application, shown in Figure 11-18. Radix also encounters misses close to each other in time, but the misses (both write misses to the output array in the permutation phase and read misses in building the global histograms) are to array entries indexed by histogram values that have to be read as well. Speculative loads help the performance of Radix tremendously, as expected, improving both read and write miss stall times in this case. An interesting effect in both programs is that the processor busy time is reduced as well. This is because the ability to consume the values of reads speculatively makes a lot more otherwise dependent instructions available to execute, greatly increasing 790
DRAFT: Parallel Computer Architecture
9/4/97
Proceeding Past Long-latency Events in a Shared Address Space
the utilization of the 4-way superscalar processor. We see that speculative loads help reduce read latency in FFT as well (albeit less), even though it does not have indirect array accesses. The reason is that conflict misses in the cache are reduced due to greater combining of accesses to an outstanding cache block (accesses to blocks that conflict in the cache and would otherwise have caused conflict misses are overlapped and hence combined in the mechanism used to keep track of outstanding misses). This illustrates that sometimes important observed effects are not directly due to the feature being studied. While speculative reads, and speculative execution in general, can be hurt by mis-speculation and consequent rollbacks, this hardly occurs at all in these programs, which take quite predictable and straightforward paths through the code. The results may be different for programs with highly unpredictable control flow and access patterns. (Rollbacks are even fewer under RC, where there is less need for speculative loads.) 11.6.3 Implementation Issues To proceed past writes, the only support we need in the processor is a write buffer and nonblocking writes. Increasing sophistication in the write buffer (merging to arbitrary blocks, early retirement before obtaining ownership) allows both reads and writes to effectively bypass previous writes. To proceed effectively past reads, we need not just nonblocking reads but also support for instruction lookahead and speculative execution. At the memory system level, supporting multiple outstanding references (particularly cache misses) requires that caches be lockup-free. Lockup-free caches also find it easier to support multiple outstanding write misses (they are simply buffered in a write buffer) than to support multiple outstanding read misses: For reads they also need to keep track of where the returned data is to be placed back in the processor environment (see the discussion of lockup-free cache design in Section 11.9). In general, once this machinery close to the processor takes care of tracking memory operations and preserving the necessary orders of the consistency model, the rest of the memory and interconnection system can reorder transactions related to different locations as it pleases. In previous chapters we have discussed implementation issues for sequential consistency on both bus-based and scalable multiprocessors. Let us briefly look here at what other mechanisms we need to support relaxed consistency models; in particular, what additional support is needed close to the processor to constrain reorderings and satisfy the consistency model, beyond the uniprocessor mechanisms and lockup-free caches. Recall from Chapter 9 that the types of reorderings allowed by the processor have affected the development of consistency models; for example, models like processor consistency, TSO and PSO were attractive when processors did not support non-blocking reads. However, compiler optimizations call for more relaxed models at the programmer’s interface regardless of what the processor supports, and particularly with more sophisticated processors today the major choice seems to be between SC and models that allow all forms of reordering between reads and writes between synchronizations (weak ordering, release consistency, SUN RMO, DEC Alpha, IBM PowerPC). The basic mechanism we need is to keep track of the completion of memory operations. But this was needed in SC as well, even for the case of a single outstanding memory operation. Thus, we need explicit acknowledgment transactions for invalidations/updates that are sent out, and a mechanism to determine when all necessary acknowledgments have been received. For SC, we need to determine when the acknowledgments for each individual write have arrived; for RC, we need only determine at a release point that all acknowledgments for previous writes have arrived. For different types of memory barriers or fences, we need to ensure that all relevant operations before the barrier or fence have completed. As was discussed for SC in the context of the SGI
9/4/97
DRAFT: Parallel Computer Architecture
791
Latency Tolerance
Origin2000 case study in Chapter 8, a processor usually expects only a single response to its write requests, so coordinating the receipt of data replies and acknowledgments is usually left to an external interface. For example, to implement a memory barrier or a release, we need a mechanism to determine when all previously issued writes and/or reads have completed. Conceptually, each processor (interface) needs a counter that keeps track of the number of uncompleted reads and writes issued by that processor. The memory barrier or release operation can then check whether or not the appropriate counters are zero before allowing further memory references to be issued by that processor. For some models like processor consistency, in which writes are non-atomic, implementing memory barrier or fence instructions can become quite complicated (see [GLL+90, Gha95] for details). Knowledge of FIFO orderings in the system buffers can allow us to optimize the implementation of some types of fences or barriers. An example is the write-write memory barriers (also called store barriers) used to enforce write-to-write order in the PSO model when using a write buffer that otherwise allows writes to be reordered. On encountering such a MEMBAR, it is not necessary to stall the processor immediately to wait for previous writes to complete. The MEMBAR instruction can simply be inserted into the write-buffer (like any other write), and the processor can continue on and even execute subsequent write operations. These writes are inserted behind the MEMBAR in the write buffer. When the MEMBAR reaches the head of the write buffer, it is at this time that the write buffer stalls (not necessarily the processor itself) to wait for all previous stores to complete before allowing any other stores to become visible to the external memory system. A similar technique can be used with reorder buffers in speculative processors. Of course, one mechanism that is needed in the relaxed models is support for the order-preserving instructions like memory barriers, fences, and instructions labeled as synchronization, acquire or release. Finally, for most of the relaxed models we need mechanisms to ensure write atomicity. This, too was needed for SC, and we simply need to ensure that the contents of a memory block for which invalidation acknowledgments are pending are not given to any new processor requesting them until the acknowledgments are received. Thus, given the support to proceed past writes and/or reads provided by the processor and caches, most of the mechanisms needed for relaxed models are already needed to satisfy the sufficient conditions for SC itself. While the implementation requirements are quite simple, they inevitably lead to tradeoffs. For example, for counting acknowledgments the location of the counters can be quite critical to performance. Two reasonable options are: (i) close to the memory controller, and (ii) close to the first-level cache controller. In the former case, updates to the counters upon returning acknowledgments do not need to propagate up the cache hierarchy to the processor chip; however, the time to execute a memory barrier itself becomes quite large because we have to traverse the hierarchy cross the memory bus to query the counters. The latter option has the reverse effect, with greater overhead for updating counter values but lower overhead for fence instructions. The correct decision can be made on the basis of the frequency with which fences will be used relative to the occurrence of cache misses. For example, if we use fences to implement a strict model like SC on a machine which does not preserve orders (like the DEC Alpha), we expect fences to be very frequent and the option of placing the counters near the memory controller to perform poorly.1
792
DRAFT: Parallel Computer Architecture
9/4/97
Precommunication in a Shared Address Space
11.6.4 Summary The extent to which latency can be tolerated by proceeding past reads and writes in a multiprocessor depends on both the aggressiveness of the implementation and the memory consistency model. Tolerating write latency is relatively easy in cache-coherent multiprocessors, as long as the memory consistency model is relaxed. Both read and write latency can be hidden in sophisticated modern processors, but still not entirely. The instruction lookahead window (reorder buffer) sizes needed to hide read latency can be substantial, and grow with the latency to be hidden. Fortunately, latency is increasingly hidden as window size increases, rather than needing a very large threshold size before any significant latency hiding takes place. In fact, with these sophisticated mechanisms that are widely available in modern multiprocessors, read latency can be hidden quite well even under sequential consistency. In general, conservative design choices—such as blocking reads or blocking caches—make preserving orders easier, but at the cost of performance. For example, delaying writes until all previous instructions are complete in dynamically scheduled processors to avoid rolling back on writes (e.g. for precise exceptions) makes it easier to preserve SC, but it makes write latency more difficult to hide under SC. Compiler scheduling of instructions can help dynamically scheduled processors hide read latency even better. Sophisticated compiler scheduling can also help statically scheduled processors (without hardware lookahead) to hide read latency by increasing the distance between a read and its dependent instructions. The implementation requirements for relaxed consistency models are not large, beyond the requirements already needed (and provided) to hide write and read latency in uniprocessors. The approach of hiding latency by proceeding past operations has two significant drawbacks. The first is that it may very well require relaxing the consistency model to be very effective, especially with simple statically scheduled processors but also with dynamically scheduled ones. This places a greater burden on the programmer to label synchronization (competing) operations or insert memory barriers, although relaxing the consistency model is to a large extent needed to allow compilers to perform many of their optimizations anyway. The second drawback is the difficulty of hiding read latency effectively with processors that are not dynamically scheduled, and the resource requirements of hiding multiprocessor read latencies with dynamically scheduled processors. In these situations, other methods like precommunication and multithreading might be more successful at hiding read latency.
11.7 Precommunication in a Shared Address Space Precommunication, especially prefetching, is a technique that has already been widely adopted in commercial microprocessors, and its importance is likely to increase in the future. To understand
1. There are many other options too. For example, we might have a special signal pin on the processor chip that is connected to the external counter and indicates whether or not the counter is zero (to implement the memory barrier we only need to know if the counter is zero; not its actual value.) However, this can get quite tricky. For example, how does one ensure that there are no memory accesses issued before the memory barrier that are still propagating from the processor to the memory and that the counter does not yet even know about. When querying the external counter directly this can be taken care of by FIFO ordering of transactions to and from the memory; with a shortcut pin, it becomes difficult.
9/4/97
DRAFT: Parallel Computer Architecture
793
Latency Tolerance
the techniques for pre-communication, let us first consider a shared address space with no caching of shared data, so all data accesses go to the relevant main memory. This will introduce some basic concepts, after which we shall examine prefetching in a cache-coherent shared address space, including performance benefits and implementation issues. 11.7.1 Shared Address Space With No Caching of Shared Data In this case, receiver-initiated communication is triggered either by reads of nonlocally allocated data, and sender-initiated by writes to nonlocally allocated data. In our example code in Figure 11-1 on page 755, the communication is sender initiated, if we assume that the array A is allocated in the local memory of process PB, not PA. As with the sends in the message-passing case, the most precommunication we can do is perform all the writes to A before we compute any of the f(B[]), by splitting into two loops (see the left hand side of Figure 11-19). By making the writes non blocking, some of the write latency may be overlapped with the (B[]) computations; however, much of the write latency would probably have been hidden anyway if writes were made non blocking and left where they were. The more important effect is to have the writes and the overall computation on PA complete earlier, hence allowing the reader PB, to emerge from its while loop more quickly. If the array A is allocated in the local memory of PA, the communication is receiver-initiated. The writes by the producer PA are now local, and the reads by the consumer PB are remote. Precommunication in this case means prefetching the elements of A before they are actually needed, just as we issued a_receive’s before they were needed in the message-passing case. The difference is that the prefetch is not just posted locally, like the receive, but rather causes a prefetch request to be sent across the network to the remote node (PA) where the controller responds to the request by actually transferring the data back. There are many ways to implement prefetching, as we shall see. One is to issue a special prefetch instruction, and build a software pipeline as in the message passing case. The shared address space code with prefetching is shown in Figure 11-19. The software pipeline has a prologue that issues a prefetch for the first iteration, a steady-state Process PA
Process PB while flag = 0 {};
for i←0to n-1 do compute A[i]; write A[i];
prefetch(A[0]); for i←0 to n-2 do prefetch (A[i+1]);
end for
read A[i] from prefetch_buffer;
flag ← 1;
use A[i];
for i←0to n-1 do compute f(B[i]);
compute g(C[i]); end for read A[n-1] from prefetch_buffer; use A[n-1]; compute g(C[n-1])
Figure 11-19 Prefetching in the shared address space example.
794
DRAFT: Parallel Computer Architecture
9/4/97
Precommunication in a Shared Address Space
period of n-1 iterations in which a prefetch is issued for the next iteration and the current iteration is executed, and an epilogue consisting of the work for the last iteration. Note that a prefetch instruction does not replace the actual read of the data item, and that the prefetch instruction itself must be nonblocking (must not stall the processor) if it is to achieve its goal of hiding latency through overlap. Since shared data are not cached in this case, the prefetched data are brought into a special hardware structure called a prefetch buffer. When the word is actually read into a register in the next iteration, it is read from the head of the prefetch buffer rather than from memory. If the latency to hide were much larger than the time to compute a single loop iteration, we would prefetch several iterations ahead and there would potentially be several words in the prefetch buffer at a time. However, if we can assume that the network delivers messages from one point to another in the order in which they were transmitted from the source, and that all the prefetches from a given processor in this loop are directed to the same source, then we can still simply read from the head of the prefetch buffer every time. The Cray T3D multiprocessor and the DEC Alpha processor it uses provide this form of prefetching. Even if data cannot be prefetched early enough that they have arrived by the time the actual reference is made, prefetching may have several benefits. First, if the actual reference finds that the address it is accessing has an outstanding prefetch associated with it, then it can simply wait for the remaining time until the prefetched data returns. Also, depending on the number of prefetches allowed to be outstanding at a time, prefetches can be issued back to back to overlap their latencies. This can provide the benefits of pipelined data movement, although the pipeline rate may be limited by the overhead of issuing prefetches. 11.7.2 Cache-coherent Shared Address Space In fact, precommunication is much more interesting in a cache-coherent shared address space: Shared nonlocal data may now be precommunicated directly into a processor’s cache rather than a special buffer, and precommunication interacts with the cache coherence protocol. We shall therefore discuss the techniques in more detail in this context. Consider an invalidation-based coherence protocol. A read miss fetches the data locally from wherever it is. A write that generates a read-exclusive fetches both data and ownership (by informing the home, perhaps invalidating other caches, and receiving acknowledgments), and a write that generates an upgrade fetches only ownership. All of these “fetches” have latency associated with them, so all are candidates for prefetching (we can prefetch data or ownership or both). Update-based coherence protocols generate sender-initiated communication, and like other sender-initiated communication techniques they provide a form of precommunication from the viewpoint of the destinations of the updates. While update protocols are not very prevalent, techniques to selectively update copies can be used for sender-initiated precommunication even with an underlying invalidation-based protocol. One possibility is to insert software instructions that generate updates only on certain writes, to contain unnecessary traffic due to useless updates and hybrid update-invalidate. Some such techniques will be discussed later.
9/4/97
DRAFT: Parallel Computer Architecture
795
Latency Tolerance
Prefetching Concepts There are two broad categories of prefetching applicable to multiprocessors and uniprocessors: hardware-controlled and software-controlled prefetching. In hardware-controlled prefetching, no special instructions are added to the code: special hardware is used to try to predict future accesses from observed behavior, and prefetch data based on these predictions. In software-controlled prefetching, the decisions of what and when to prefetch are made by the programmer or compiler (hopefully the compiler!) by static analysis of the code, and prefetch are instructions inserted in the code based on this analysis. Tradeoffs between hardware- and software-controlled prefetching will be discussed later in this section. Prefetching can also be combined with block data transfer in block prefetch (receiver-initiated prefetch of a large block of data) and block put (sender-initiated) techniques.Let us begin by discussing some important concepts in prefetching in a cache-coherent shared address space. In a multiprocessor, a key issue that dictates how early a prefetch can be issued in both softwareand hardware-controlled prefetching in a multiprocessor is whether prefetches are binding or non-binding. A binding prefetch means that the value of the prefetched datum is bound at the time of the prefetch; i.e. when the process later reads the variable through a regular read, it will see the value that the variable had when it was prefetched, even if the value has been modified by another processor and become visible to the reader’s cache in between the prefetch and the actual read. The prefetch we discussed in the non-cache-coherent case (prefetching into a prefetch buffer, see Figure 11-19) is typically a binding prefetch, as is prefetching directly into processor registers. A non-binding prefetch means that the value brought by a prefetch instruction remains subject to modification or invalidation until the actual operation that needs the data is executed. For example, in a cache-coherent system, the prefetched data are still visible to the coherence protocol; prefetched data are brought into the cache rather than into a register or prefetch buffer (which are typically not under control of the coherence protocol), and a modification by another processor that occurs between the time of the prefetch and the time of the use will update or invalidate the prefetched block according to the protocol. This means that non-binding prefetches can be issued at any time in a parallel program without affecting program semantics. Binding prefetches, on the other hand, affect program semantics, so we have to be careful about when we issue them. Since non-binding prefetches can to be generated as early as desired, they have performance advantages as well. The other important concepts have to do with the questions of what data to prefetch, and when to initiate prefetches. Determining what to prefetch is called analysis, and determine when to initiate prefetches is called scheduling. The concepts that we need to understand for these two aspects of prefetching are those of prefetches being possible, obtaining good coverage, being unnecessary, and being effective [Mow94]. Prefetching a given reference is considered possible only if the address of the reference can be determined ahead of time. For example, if the address can be computed only just before the word is referenced, then it may not be possible to prefetch. This is an important consideration for applications with irregular and dynamic data structures implemented using pointers, such as linked lists and trees. The coverage of a prefetching technique is the percentage of the original cache misses (without any prefetching) that the technique is able to prefetch. “Able to prefetch” can be defined in different ways; for example, able to prefetch early enough that the actual reference hits and all the
796
DRAFT: Parallel Computer Architecture
9/4/97
Precommunication in a Shared Address Space
latency is hidden, or able to issue a prefetch earlier than just before the reference itself so that at least a part of the latency is hidden. Achieving high coverage is not the only goal, however. In addition to not issuing prefetches for data that will not be accessed by the processor, we should not issue prefetches for data that are already in the cache to which data are prefetched, since these prefetches will just consume overhead and cache access bandwidth (interfering with regular accesses) without doing anything useful. More is not necessarily better for prefetching. Both these types of prefetches are called unnecessary prefetches. Avoiding them requires that we analyze the data locality in the application’s access patterns and how it interacts with the cache size and organization, so that prefetches are issued only for data that are not likely to be present in the cache. Finally, timing and luck play important roles in prefetching. A prefetch may be possible and not unnecessary, but it may be initiated too late to hide the latency from the actual reference. Or it may be initiated too early, so it arrives in the cache but is then either replaced (due to capacity or mapping conflicts) or invalidated before the actual reference. Thus, a prefetch should be effective: early enough to hide the latency, and late enough so the chances of replacement or invalidation are small. As important as what and when to prefetch, then, are the questions of what not to prefetch and when not to issue prefetches. The goal of prefetching analysis is to maximize coverage while minimizing unnecessary prefetches, and the goal of scheduling to maximize effectiveness. Let us now consider hardware-controlled and software-controlled prefetching in some detail, and see how successfully they address these important aspects. Hardware-controlled Prefetching The goal in hardware-controlled prefetching is to provide hardware that can detect patterns in data accesses at runtime. Hardware-controlled prefetching assumes that accesses in the near future will follow these detected patterns. Under this assumption, the cache blocks containing those data can be prefetched and brought into the processor cache so the later accesses may hit in the cache. The following discussion assumes nonbinding prefetches. Both analysis and scheduling are the responsibility of hardware, with no special software support, and both are performed dynamically as the program executes. Analysis and scheduling are also very closely coupled, since the prefetch for a cache block is initiated as soon as it is determined that the block should be prefetched: It is difficult for hardware to make separate decisions about these. Besides coverage, effectiveness and minimizing unnecessary prefetches, the hardware has two other important goals:
• it should be simple, inexpensive, and not take up too much area • it should not be in the critical path of the processor cycle time as far as possible Many simple hardware prefetching schemes have been proposed. At the most basic level, the use of long cache blocks itself is a form of hardware prefetching, exploited well by programs with good spatial locality. Essentially no analysis is used to restrict unnecessary prefetches, the coverage depends on the degree of spatial locality in the program, and the effectiveness depends on how much time elapses between when the processor access the first word (which issues the cache miss) and when it accesses the other words on the block. For example, if a process simply traverses a large array with unit stride, the coverage of the prefetching with long cache blocks 9/4/97
DRAFT: Parallel Computer Architecture
797
Latency Tolerance
will be quite good (75% for a cache block of four words) and there will be no unnecessary prefetches, but the effectiveness is likely to not be great since the prefetches are issued too late. Extending the idea of long cache blocks or blocks are one-block lookahead or OBL schemes, in which a reference to cache block i may trigger a prefetch of block i+1. There are several variants for the analysis and scheduling that may be used in this technique; for example, prefetching i+1 whenever a reference to i is detected, only if a cache miss to i is detected, or when i is referenced for the first time after it is prefetched. Extensions may include prefetching several blocks ahead (e.g. blocks i+1, i+2, and i+3) instead of just one [DDS95], an adaptation of the idea of stream buffers in uniprocessors where several subsequent blocks are prefetched into a separate buffer rather than the cache when a miss occurs [Jou90]. Such techniques are useful when accesses are mostly unit-stride (the stride is the gap between data addresses accessed by consecutive memory operations). The next step is to detect and prefetch accesses with non-unit or large stride. A simple way to do this is to keep the address of the previously accessed data item for a given instruction (i.e a given program counter value) in a history table indexed by program counter (PC). When the same instruction is issued again (e.g. in the next iteration of a loop) if it is found in the table the stride is computed as the difference between the current data address and the one in the history table for that instruction, and a prefetch issued for the data address ahead of the current one by that stride [FuP91,FPJ92]. The history table is thus managed much like a branch history table. This scheme is likely to work well when the stride of access is fixed but is not one; however, most other prefetching schemes that we shall discuss, hardware or software, are likely to work well in this case. The schemes so far find ways to detect simple regular patterns, but do not guard against unnecessary prefetches when references do not follow a regular pattern. For example, if the same stride is not maintained between three successive accesses by the same instruction, then the previous scheme will prefetch useless data that the processor does not need. The traffic generated by these unnecessary prefetches can be very detrimental to performance, particularly in multiprocessors, since it consumes valuable resources thus causing contention and getting in the way of useful regular accesses. In more sophisticated hardware prefetching schemes, the history table stores not only the data address accessed the last time by an instruction but also the stride between the previous two addresses accessed by it [BaC91,ChB92]. If the next data address accessed by that instruction is separated from the previous address by the same stride, then a regular-stride pattern is detected and a prefetch may be issued. If not, then a break in the pattern is detected and a prefetch is not issued, thus reducing unnecessary prefetches. In addition to the address and stride, the table entry also contains some state bits that keep track of whether the accesses by this instruction have recently been in a regular-stride pattern, have been in an irregular pattern, or are transitioning into or out of a regular-stride pattern. A set of simple rules is used to determine based on the stride match and the current state whether to potentially issue a prefetch or not. If the result is to potentially prefetch, the cache is looked up to see if the block is already there, and if not then the prefetch is issued. While this scheme improves the analysis, it does not yet do a great job of scheduling prefetches. In a loop, it will prefetch only one loop iteration ahead. If the amount of work to do in a loop iteration is small compared to the latency that needs to be hidden, this will not be sufficient to tolerate the latency. The prefetched data will arrive too late and the processor will have to stall. On the other hand, if the work in a loop iteration is too much, the prefetched data will arrive too early 798
DRAFT: Parallel Computer Architecture
9/4/97
Precommunication in a Shared Address Space
and may replace other useful blocks before being used. The goal of scheduling is to achieve justin-time prefetching, that is, to have a prefetch issued about l cycles before the instruction that needs the data, where l is the latency we want to hide, so that a prefetched datum arrives just before the actual instruction that needs it is executed and the prefetch is likely to be effective (see Section 11.7.1). This means that the prefetch should be issued --lb
loop iterations (or times the
instruction is encountered) ahead of the actual instruction that needs the data, where b is the predicted execution time of an iteration of the loop body. One way to implement such scheduling in hardware is to use a “lookahead program counter” (LA-PC) that tries to remain l cycles ahead of the actual current PC, and that is used to access the history table and generate prefetches instead of (but in conjunction with) the actual PC. The LAPC starts out a single cycle ahead of the regular PC, but keeps being incremented even when the processor (and PC) stalls on a cache miss, thus letting it get ahead of the PC. The LA-PC also looks up the branch history table, so the branch prediction mechanism can be used to modify it when necessary and try to keep it on the right track. When a mispredicted branch is detected, the LA-PC is set back to being equal to PC+1. A limit register controls how much farther the LA-PC can get ahead of the PC. The LA-PC stalls when this limit is exceeded, or when the buffer of outstanding references is full (i.e. it cannot issue any prefetches). Both the LA-PC and the PC look up the prefetch history table every cycle (they are likely to access different entries). The lookup by the PC updates the “previous address” and the state fields in accordance with the rules, but does not generate prefetches. A new “times” field is added to the history table entries, which keeps track of the number of iterations (or encounters of the same instruction) that the LA-PC is ahead of the PC. The lookup by the LA-PC increments this field for the entry that it hits (if any), and generates an address for potential prefetching according to the rules. The address generated is the “times” field times the stride stored in the entry, plus the previous address field. The times field is decremented when the PC encounters that instruction in its lookup of the prefetch history table. More details on these hardware schemes can be found in [ChB92, Sin97]. Prefetches in these hardware schemes are treated as hints, since they are nonbinding, so actual cache misses get priority over them. Also, if a prefetch raises an exception (for example, a page fault or other violation), the prefetch is simply dropped rather than handling the exception. More elaborate hardware-controlled prefetching schemes have been proposed to try to prefetch references with irregular stride, requiring even more hardware support [ZhT95]. However, even the techniques described above have not found their way into microprocessors or multiprocessors. Instead, several microprocessors are preferring to provide prefetch instructions for use by software-controlled prefetching schemes. Let us examine software-controlled prefetching, before we examine its relative advantages and disadvantages with respect to hardware-controlled schemes. Software-controlled Prefetching In software-controlled prefetching, the analysis of what to prefetch and the scheduling of when to issue prefetches are not done dynamically by hardware, but rather statically by software. The compiler (or programmer) inserts special prefetch instructions into the code at points where it deems appropriate. As we saw in Figure 11-19, this may require restructuring the loops in the program to some extent as well. The hardware support needed is the provision of these nonblocking instructions, a cache that allows multiple outstanding accesses, and some mechanism
9/4/97
DRAFT: Parallel Computer Architecture
799
Latency Tolerance
for keeping track of outstanding accesses (this is often combined with the write buffer in the form of an outstanding transactions buffer). The latter two mechanisms are in fact required for all forms of latency tolerance in a read-write based system. Let us first consider software-controlled prefetching from the viewpoint of a given processor trying to hide latency in its reference stream, without complications due to interactions with other processors. This problem is equivalent to prefetching in uniprocessors, except with a wider range of latencies. Then, we shall discuss the complications introduced by multiprocessing. Prefetching With a Single Processor Consider a simple loop, such as our example from Figure 11-1. A simple approach would be to always issue a prefetch instruction one iteration ahead on array references within loops. This would lead to a software pipeline like the one in Figure 11-19 on page 794, with two differences: The data are brought into the cache rather than into a prefetch buffer, and the later load of the data will be from the address of the data rather than from the prefetch buffer (i.e. read(A[i]) and use(A[i])). This can easily be extended to approximate “just-in-time” prefetching above by issuing prefetches multiple iterations ahead as discussed above. You will be asked to write the code for just-in-time prefetching for a loop like this in Exercise 11.9. To minimize unnecessary prefetches, it is important to analyze and predict the temporal and spatial locality in the program as well as trying to predict the addresses of future references. For example, blocked matrix factorization reuses the data in the current block many times in the cache, so it does not make sense to prefetch references to the block even if they can be predicted early. In the software case, unnecessary prefetches have an additional disadvantage beyond useless traffic and cache bandwidth (even when traffic is not generated): They introduce useless prefetch instructions in the code, which add execution overhead. The prefetch instructions are often placed within conditional expressions, and with irregular data structures extra instructions are often needed to compute the address to be prefetched, both of which increase the instruction overhead. How easy is it to analyze what references to prefetch in software? In particular, can a compiler do it, or must it be left to the programmer? The answer depends on the program’s reference patterns. References are most predictable when they correspond to traversing an array in some regular way. For example, a simple case to predict is when an array is referenced inside a loop or loop nest, and the array index is an affine function of the loop indices in the loop nest; i.e. a linear combination of loop index values. The following code shows an example: for i <- 1 to n for j <- 1 to n sum = sum + A[3i+5j+7]; end for end for. Given knowledge of the amount of latency we wish to hide for a reference, we can try to issue the prefetch that many iterations ahead in the relevant loop. Also, the amount of array data accessed and the spatial locality in the traversal are easy to predict, which makes locality analysis easy. The major complication in analyzing data locality is predicting cache misses due to mapping conflicts.
800
DRAFT: Parallel Computer Architecture
9/4/97
Precommunication in a Shared Address Space
A slightly more difficult class of references to analyze is indirect array references. e.g. for i <- 1 to n sum = sum + A[index[i]]; end for. While we can easily predict the values of i and hence the elements of the index array that will be accessed, we cannot predict the value in index[i] and hence the elements of A that we shall access. What we must do in order to predict the accesses to A is to first prefetch index[i] well enough in advance, and then use the prefetched value to determine the element of array A to prefetch. The latter requires additional instructions to be inserted. Consider scheduling. If the number of iterations that we would normally prefetch ahead is k, we should prefetch index[i] 2k iterations ahead so it returns k iterations before we need A[index[i]], at which point we can use the value of index[i] to prefetch A[index[i]] (see Exercise a). Analyzing temporal and spatial locality in these cases is more difficult than predicting addresses. It is impossible to perform accurate analysis statically, since we do not know ahead of time how many different locations of A will be accessed (different entries in the index array may have the same value) or what the spatial relationships among the references to A are. Our choices are therefore to either prefetch all references A[index[i]] or none at all, to obtain profile information about access patterns gathered at runtime and use it to make decisions, or to use higher-level programmer knowledge. Compiler technology has advanced to the point where it can handle the above types of array references in loops quite well [Mow94], within the constraints described above. Locality analysis [WoL91] is first used to predict when array references are expected to miss in a given cache (typically the first-level cache). This results in a prefetch predicate, which can be thought of as a conditional expression inside which a prefetch should be issued for a given iteration. Scheduling based on latency is then used to decide how many iterations ahead to issue the prefetches. Since the compiler may not be able to determine which level of the extended memory hierarchy the miss will be satisfied in, it may be conservative and assume the worst-case latency. Predicting conflict misses is particularly difficult. Locality analysis may tell us that a word should still be in the cache so a prefetch should not be issued, but the word may be replaced due to conflict misses and therefore would benefit from a prefetch. A possible approach is to assume that a small-associativity cache of C bytes effectively behaves like a fully-associative cache that is smaller by some factor, but this is not reliable. Multiprogramming also throws off the predictability of misses across process switches, since one process might pollute the cache state of another that is prefetching, although the time-scales are such that locality analysis often doesn’t assume state to stay in the cache that long anyway. Despite these problems, limited experiments have shown quite good potential for success with compiler-generated prefetching when most of the cache misses are to affine or indirect array references in loops [Mow94], as we shall see in Section 11.7.3. These include programs on regular grids or dense matrices, as well as sparse matrix computations (in which the sparse matrices use indirect array references, but most of the data are often stored in dense form anyway for efficiency). These experiments are performed through simulation, since real machines are only just beginning to provide effective underlying hardware support for software-controlled prefetching. Accesses that are truly difficult for a compiler to predict are those that involve pointers or recursive data structures (such as linked lists and trees). Unlike array indexing, traversing these data structures requires dereferencing pointers along the way; the address contained in the pointer
9/4/97
DRAFT: Parallel Computer Architecture
801
Latency Tolerance
within a list or tree element is not known until that element is reached in the traversal, so it cannot be easily prefetched. Predicting locality for such data structures is also very difficult, since their traversals are usually highly irregular in memory. Prefetching in these cases currently must be done by the programmer, exploiting higher-level semantic knowledge of the program and its data structures as shown in the following example. Compiler analysis for prefetching recursive data structures is currently the subject of research [LuM96], and will also be helped by progress in alias analysis for pointers.
Example 11-2
Consider the tree traversal to compute the force on a particle in the Barnes-Hut application described in Chapter 3. The traversal is repeated for each particle assigned to the process, and consecutive particles reuse much of the tree data, which is likely to stay in the cache across particles. How should prefetches be inserted in this tree traversal code?
Answer
The traversal of the octree proceeds in depth first manner. However, if it is determined that a tree cell needs to be opened then all its eight children will be examined as long they are present in the tree. Thus, we can insert prefetches for all the children of a cell as soon as we determine that the cell will be opened (or we can speculatively issue prefetches as soon as we touch a cell and can therefore dereference the pointers to its children). Since we expect the working set to fit in the cache (which a compiler is highly unlikely to be able to determine), to avoid unnecessary prefetches it is likely that we should need to prefetch a cell only the first time that we access it (i.e. for the first particle that accesses it), not for subsequent particles. Cache conflicts may occur which cause unpredictable misses, but there is likely little we can do about that statically. Note that we need to do some work to generate prefetches (determine if this is the first time we are visiting the cell, access and dereference the child pointers etc.), so the overhead of a prefetch is likely to be several instructions. If the overhead is incurred a lot more often than successful prefetches are generated, it may overcome the benefits of prefetching. One problem with this scheme is that we may not be prefetching early enough when memory latencies are high. Since we prefetch all the children at a time, in most cases the depth first work done for the first child (or two) should be enough to hide the latency of the rest of the children, but this may not be the case. The only way to improve this is to speculatively prefetch multiple levels down the tree when we encounter a cell, hoping that we will touch all the cells we prefetch and they will still be in the cache when we reach them. Other applications that use linked lists in unstructured ways may be even more difficult for a compiler or even programmer to prefetch successfully. Finally, sometimes a compiler may fail to determine what to prefetch because other aspects of the compiler technology are not up to their task. For example, the key references in a loop may be within a procedure call, and the compiler’s interprocedural analysis may not be sophisticated enough to detect that the references in the procedure are indeed affine functions of the indices of the loop indices. If the called procedure is in a different file, the compiler would have to perform interprocedural analysis or inlining across files. These sorts of situations are easy for the programmer, who has higher-level semantic knowledge of what is being accessed.
802
DRAFT: Parallel Computer Architecture
9/4/97
Precommunication in a Shared Address Space
Additional Complications in Multiprocessors Two additional issues that we must consider when prefetching in parallel programs are prefetching communication misses, and prefetching with ownership. Both arise from the fact that other processors might also be accessing and modifying the data that a process references. In an invalidation-based cache-coherent multiprocessor, data may be removed from a processor’s cache—and misses therefore incurred— not only due to replacements but also due to invalidations. A similar locality analysis is needed for prefetching in light of communication misses: We should not prefetch data so early that they might be invalidated in the cache before they are used, and we should ideally recognize when data might have been invalidated so that we can prefetch them again before actually using them. Of course, because we are using nonbinding prefetching, these are all performance issues rather than correctness issues. We could insert prefetches into a process’s reference stream as if there were not other processes in the program, and the program would still run correctly. It is difficult for the compiler to predict incoming invalidations and perform this analysis, because the communication in the application cannot be easily deduced from an explicitly parallel program in a shared address space. The one case where the compiler has a good chance is when the compiler itself parallelized the program, and therefore knows all the necessary interactions of when different processes read and modify data. But even then, dynamic task assignment and false sharing of data greatly complicate the analysis. A programmer has the semantic information about interprocess communication, so it is easier for the programmer to insert and schedule prefetches as necessary in the presence of invalidations. The one kind of information that a compiler does have is that conveyed by explicit synchronization statements in the parallel program. Since synchronization usually implies that data are being shared (for example, in a “properly labeled” program, see Chapter 9, the modification of a datum by one process and its use by another process is separated by a synchronization), the compiler analysis can assume that communication is taking place and that all the shared data in the cache have been invalidated whenever it sees a synchronization event. Of course, this is very conservative, and it may lead to a lot of unnecessary prefetches especially when synchronization is frequent and little data is actually invalidated between synchronization events. It would be nice if a synchronization event conveyed some information about which data might be modified, but this is usually not the case. As a second enhancement, since a processor often wants to fetch a cache block with exclusive ownership (i.e. in exclusive mode, in preparation for a write), or to simply obtain exclusivity by invalidating other cached copies, it makes sense to prefetch data with exclusivity or simply to prefetch exclusivity. This can have two benefits when used judiciously. First, it reduces the latency of the actual write operations that follow, since the write does not have to invalidate other blocks and wait to obtain ownership (that was already done by the prefetch). Whether or not this in itself has an impact on performance depends on whether write latency is already hidden by other methods like using a relaxed consistency model. The second advantage is in the common case where a process first reads a variable and then shortly thereafter writes it. Prefetching with ownership even before the read in this case halves the traffic, as seen in Figure 11-20, and hence improve the performance of other references as well by reducing contention and bandwidth needs. Quantitative benefits of prefetching in exclusive more are discussed in [Mow94].
9/4/97
DRAFT: Parallel Computer Architecture
803
Latency Tolerance
Time No Prefetching: busy
read miss (A)
busy
write miss (A)
busy
Prefetching in Exclusive Mode: busy
read(A) write(A) exclusive prefetch, overlapped with busy time Figure 11-20 Benefits from prefetching with ownership. Suppose the latest copy of A is not available locally to begin with, but is present in other caches. Normal hardware cache coherence would fetch the corresponding block in shared state for the read, and then communicate again to obtain ownership upon the write. With prefetching, if we recognize this read-write pattern we can issue a single prefetch with ownership before the read itself, and not have to incur any further communication at the write. By the time the write occurs, the block is already there in exclusive state.
Hardware-controlled versus Software-controlled Prefetching Having seen how hardware-controlled and software-controlled prefetching work, let us discuss some of their relative advantages and disadvantages. The most important advantages of hardware-controlled prefetching are (i) it does not require any software support from the programmer or compiler, (ii) it does not require recompiling code (which may be very important in practice when the source code is not available), and (iii) it does not incur any instruction overhead or code expansion since no extra instructions are issued by the processor. On the other hand, its most obvious disadvantages are that it requires substantial hardware overhead and the prefetching “algorithms” are hard-wired into the machine. However, there are many other differences and tradeoffs, having to do with coverage, minimizing unnecessary prefetches and maximizing effectiveness. Let us examine these tradeoffs, focusing on compiler-generated rather than programmer-generated prefetches in the software case. Coverage: The hardware and software schemes take very different approaches to analyzing what to prefetch. Software schemes can examine all the references in the code, while hardware simply observes a window of dynamic access patterns and predicts future references based on current patterns. In this sense, software schemes have greater potential for achieving coverage of complex access patterns. Software may be limited in its analysis (e.g. by the limitations of pointer and interprocedural analysis) while hardware may be limited by the cost of maintaining sophisticated history and the accuracy of needed techniques like branch prediction. The compiler (or even programmer) also cannot react to some forms of truly dynamic information that are not predicted by its static analysis, such as the occurrence of replacements due to cache conflicts, while hardware can. Progress is being made in improving the coverage of both approaches in prefetching more types of access patterns [ZhT95, LuM96], but the costs in the hardware case appear high. It is possible to use some runtime feedback from hardware (for example conflict miss pat-
804
DRAFT: Parallel Computer Architecture
9/4/97
Precommunication in a Shared Address Space
terns) to improve software prefetching coverage, but there has not been much progress in this direction. Reducing unnecessary prefetches: Hardware prefetching is driven by increasing coverage, and does not perform any locality analysis to reduce unnecessary prefetches. The lack of locality or reuse analysis may generate prefetches for data that are already in the cache, and the limitations of the analysis may generate prefetches to data that the processor does not access. As we have seen, these prefetches waste cache access bandwidth even if they find the data in the cache and are not propagated past it. If they do actually prefetch the data, they waste memory or network bandwidth as well, and can displace useful data from the processor’s working set in the cache. Especially on a bus-based machine, wasting too much interconnect bandwidth on prefetches has at least the potential to saturate the bus and reduce instead of enhancing performance [Tue93]. Maximizing effectiveness: In software prefetching, scheduling is based on prediction. However, it is often difficult to predict how long a prefetch will take to complete, for example where in the extended memory hierarchy it will be satisfied and how much contention it will encounter. Hardware can in theory automatically adapt to how far ahead to prefetch at runtime, since it lets the lookahead-PC get only as far ahead as it needs to. But hiding long latencies becomes difficult due to branch prediction, and every mispredicted branch causes the lookahead-PC to be reset leading to a few ineffective prefetches until it gets far enough ahead of the PC again. Thus, both the software and hardware schemes have potential problems with effectiveness or just-in-time prefetching. Hardware prefetching is used in dynamically scheduled microprocessors to prefetch data for operations that are waiting in the reorder buffer but cannot yet be issued, as discussed earlier. However, in that case, hardware does not have to detect patterns and analyze what to prefetch. So far, the on-chip hardware support needed for more general hardware analysis and prefetching of non-unit stride accesses has not been considered worthwhile. On the other hand, microprocessors are increasingly providing prefetching instructions to be used by software (even in uniprocessor systems). Compiler technology is progressing as well. We shall therefore focus on software-controlled prefetching in our quantitative evaluations of prefetching in the next section. After this evaluation, we shall examine more closely some of the architectural support needed by and tradeoffs encountered in the implementation of software-controlled prefetching. Policy Issues and Tradeoffs Two major issues that a prefetching scheme must take a position on are when to drop prefetches that are issued, i.e. not propagate them any further, and where to place the prefetched data. Since they are hints, prefetches can be dropped by hardware when their costs seem to outweigh their benefits. Two interesting questions are whether to drop a prefetch when the prefetch instruction incurs a TLB miss in address translation, and whether to drop it when the buffer of outstanding prefetches or transactions is already full. TLB misses are expensive, but if they would occur anyway at the actual reference to the address then it is probably worthwhile to incur the latency at prefetch time rather than at reference time. However, the prefetch may be to an invalid address and would not be accessed later, but this is not known until the translation is complete. One possible solution is to guess: to have two flavors of prefetches, one that gets dropped on TLB misses and the other that does not, and insert the appropriate one based on the expected nature of the TLB miss if any. Whether dropping prefetches when the buffer that tracks outstanding accesses is
9/4/97
DRAFT: Parallel Computer Architecture
805
Latency Tolerance
full is a good idea depends on the extent to which we expect contention for the buffer with ordinary reads and writes, and the extent to which we expect prefetches to be useless. If both these extents are high, we would be inclined to drop prefetches in this case. Prefetched data can be placed in the primary cache, in the secondary cache, or in a separate prefetch buffer. A separate buffer has the advantage that prefetched data do not conflict with useful data in a cache, but it requires additional hardware and must be kept coherent if prefetches are to be non-binding. While ordinary misses replicate data at every level of the cache hierarchy, on prefetches we may not want to bring the data very close to the processor since they may replace useful data in the small cache and may compete with processor loads and stores for cache bandwidth. However, the closer to the processor we prefetch, the more the latency we hide if the prefetches are successful. The correct answer depends on the size of the cache and the important working sets, the latencies to obtain data from different levels, and the time for which the primary cache is tied up when prefetched data are being put in it, among other factors (see Exercise f). Results in the literature suggest that the benefits of prefetching into the primary cache usually outweigh the drawbacks [Mow94]. Sender-initiated Precommunication In addition to hardware update-based protocols, support for software-controlled “update”, “deliver”, or “producer prefetch” instructions has been implemented and proposed. An example is the “poststore” instruction in the KSR-1 multiprocessor from Kendall Square Research, which pushes the contents of the whole cache block into the caches that currently contain a (presumably old) copy of the block; other instructions push only the specified word into other caches that contain a copy of the block. A reasonable place to insert these update instructions is at the last write to a shared cache block before a release synchronization operation, since it is likely that those data will be needed by consumers. The destination nodes of the updates are the sharers in the directory entry. As with update protocols, the assumption is that past sharing patterns are a good predictor of future behavior. (Alternatively, the destinations may be specified in software by the instruction itself, or the data my be pushed only to the home main memory rather than other caches, i.e. a write-through rather than an update, which hides some but not all of the latency from the destination processors.) These software-based update techniques have some of the same problems as hardware update protocols, but to a lesser extent since not every write generates a bus transaction. As with update protocols, competitive hybrid schemes are also possible [Oha96, GrS96]. Compared to prefetching, the software-controlled sender-initiated communication has the advantage that communication happens just when the data are produced. However, it has several disadvantages. First, the data may be communicated too early, and may be replaced from the consumer’s cache before use, particularly if they are placed in the primary cache. Second, unlike prefetching this scheme precommunicates only communication (coherence) misses, not capacity or conflict misses. Third, while a consumer knows what data it will reference, a producer may deliver unnecessary data into processor caches if past sharing patterns are not a perfect predictor of future patterns, which they often aren’t, or may even deliver the same data value multiple times. Fourth, a prefetch checks the cache and is dropped if the data are found in the cache, reducing unnecessary network traffic. The software update performs no such checks and can increase traffic and contention, though it reduces traffic when it is successful since it deposits the data in the right places without requiring multiple protocol transactions. And finally, since the receiver does not control how many precommunicated messages it receives, buffer overflow and
806
DRAFT: Parallel Computer Architecture
9/4/97
Precommunication in a Shared Address Space
the resulting complications at the receiver may easily occur. The wisdom from simulation results so far is that prefetching schemes work better than deliver or update schemes for most applications [Oha96], though the two can complement each other if both are provided [AHA+97]. Finally, both prefetch and deliver can be extended with the capability to transfer larger blocks of data (e.g. multiple cache blocks, a whole object, or an arbitrarily defined range of addresses) rather than a single cache block. These are called block prefetch and block put mechanisms (block put differs from block transfer in that the data are deposited in the cache and not in main memory). The issues here are similar to those encountered by prefetch and software update except for differences due to their size. For example, it may not be a good idea to prefetch or deliver a large block to the primary cache. Since software-controlled prefetching is generally more popular than hardware-controlled prefetching or sender-initiated precommunication, let us briefly examine some results from the literature that illustrate the performance improvements it can achieve. 11.7.3 Performance Benefits Performance results from prefetching so far have mostly been examined through simulation. To illustrate the potential, let us examine results from hand-inserted prefetches in some of the example applications used in this book [WSH95]. Hand-inserted prefetches are used since they can be more aggressive then the best available compiler algorithms. Results from state-of-the-art compiler algorithms will also be discussed. Benefits with Single-issue, Statically Scheduled Processors Section 11.5.3 illustrated the performance benefits for block transfer on the FFT and Ocean applications for a particular simulated platform. Let us first look at how prefetching performs for the same programs and platform. This experiment focuses on prefetching only remote accesses (misses that cause communication). Figure 11-21 shows the results for FFT. Comparing the prefetching results with the original non-prefetched results, we see in the first graph that prefetching remote data helps performance substantially in this predictable application. As with block transfer, the benefits are less for large cache blocks than for small ones, since there is a lot of spatial locality in the applications so large cache blocks already achieve prefetching in themselves. The second graph compares the performance of block transfer with that of the prefetched version, and shows that the results are quite similar: prefetching is able to deliver most of the benefits of even the very aggressive block transfer in this application, as long as enough prefetches are allowed to be outstanding at a time (this experiment allows a total of 16 simultaneous outstanding operations from a processor). Figure 11-22 shows the same results for the Ocean application. Like block transfer, prefetching helps less here since less time is spent in communication and since not all of the prefetched data are useful (due to poor spatial locality on communicated data along column-oriented partition boundaries). These results are only for prefetching remote accesses. Prefetching is often much more successful on local accesses (like uniprocessor prefetching). For example, in the nearest-neighbor grid computations in the Ocean application, with barriers between sweeps it is difficult to issue prefetches well enough ahead of time for boundary elements from a neighbor partition. This is another reason for the benefits from prefetching remote data not being very large. However, a process can very easily issue prefetches early enough for grid points within its assigned partition,
9/4/97
DRAFT: Parallel Computer Architecture
807
Latency Tolerance
1.2
1.2
1
B = 128 B = 64 B = 32 Norm.
0.8
0.6
0.4
Normalized Exec. Time
Normalized Exec. Time
1
0.8
B = 128 B = 64 B = 32 Norm.
0.6
0.4
0.2
0.2
0
0 1
2
3
4
5
6
1
7
2
3
4
5
6
7
(a) Block transfer relative to prefetching
(a) Prefetching relative to original
Figure 11-21 Performance benefits of prefetching remote data in a Fast Fourier Transform. The graph on the left shows the performance of the prefetched version relative to that of the original version. The graph can be interpreted just as described in Figure 11-13 on page 778: Each curve shows the execution time of the prefetched version relative to that of the non-prefetched version for the same cache block size (for each number of processors). Graph (b) shows the performance of the version with block transfer (but no prefetching), described earlier, relative to the version with prefetching (but no block transfer) rather than relative to the original version. It enables us to compare the benefits from block transfer with those from prefetching remote data. 1.2
1.2
1 B=128 B=64
0.8
B=32 Norm
0.6 0.4
Normalized Exec. Time
Normalized Exec. Time
1
B=128 B=64 B=32 Norm
0.8 0.6 0.4 0.2
0.2 0
0 1
2
4
8
16
32
64
(a) Prefetching relative to original
1
2
4
8
16
32
64
(a) Block transfer relative to prefetching
Figure 11-22 Performance benefits of prefetching remote data in Ocean. The graphs are interpreted in exactly the same way as in Figure 11-21
which are not touched by any other process. Results from a state-of-the-art compiler algorithm show that the compiler can be quite successful in prefetching regular computations on dense arrays, where the access patterns are very predictable [Mow94]. These results, shown for two applications in Figure 11-23, include both local and remote accesses for 16-processor executions. The problems in these cases typically are being able to prefetch far enough ahead of time (for example when the misses occur at the beginning of a loop nest or just after a synchronization point), and being able to analyze and predict conflict misses.
808
DRAFT: Parallel Computer Architecture
9/4/97
Precommunication in a Shared Address Space
140
Normalized Exec. Time
Normalized Exec. Time
140 120 100
120 100
Prefetch Mem Overhead
Prefetch Mem Overhead 80 Synch Data 6 0 Busy 40
80 60 40 20
Synch Data Busy
20 0
0 N
S
2K/8K
N
I
8K/64K
S
N
S
64K/256K
N
S
2K/8K
N
I
S
8K/64K
N
S
64K/256K
(b) LU
(a) Ocean
Figure 11-23 Performance benefits from compiler-generated prefetching. The results are for an older version of the Ocean simulation program (which partitions in chunks of complete rows and hence has a higher inherent communication to computation ratio) and for an unblocked and hence lower-performance dense LU factorization. There are three sets of bars, for different combinations of L1/L2 cache sizes. The bars for each combination are the execution times for no prefetching (N) and selective prefetching(S). For the intermediate combination (8K L1 cache and 64K L2 cache), results are also shown for the case where prefetches are issued indiscriminately (I), without locality analysis. All execution times are normalized to the time without prefetching for the 8K/64K cache size combination. The processor, memory, and communication architecture parameters are chosen to approximate those of the Stanford DASH multiprocessor, which is a relatively old multiprocessor, and can be found in [Mow94]. Latencies on modern systems are much larger relative to processor speed than on DASH. We can see that prefetching helps performance, and that the choice of cache sizes makes a substantial difference to the impact of prefetching. The increase in “busy” time with prefetching (especially indiscriminate prefetching) is due to the fact that prefetch instructions are included in busy time.
Some success has also been achieved on sparse array or matrix computations that use indirect addressing, but more irregular, pointer-based applications have not seen much success through compiler-generated prefetching. For example, the compiler algorithm is not able to successfully prefetch the data in the tree traversals in the Barnes-Hut application, both because temporal locality on these accesses is high but difficult to analyze, and because it is difficult to dereference the pointers and issue the prefetches well enough ahead of time. Another problem is that many applications with such irregular access patterns exhibit a high degree of temporal locality; this locality is very difficult to analyze, so even when the prefetching analysis is successful it can lead to a lot of unnecessary prefetches. Programmers can often do a better job in these cases, as discussed earlier, and runtime profile data may be useful to identify the data accesses that generate the most misses. For the cases where prefetching is successful overall, compiler algorithms that use locality analysis to reduce unnecessary prefetch instructions have been found to reduce the number of prefetches issued substantially without losing much in coverage, and hence to perform much better (see Figure 11-24). Prefetching with ownership is found to hide write latency substantially in an architecture in which the processor implements sequential consistency by stalling on writes, but is less important when write latency is already being hidden through a relaxed memory consistency model. It does reduce traffic substantially in any case. Quantitative evaluations in the literature [Mow94, ChB94] also show that as long as caches are reasonably large, cache intereference effects due to prefetching are negligible. They also illustrate that by being more selective, software prefetching indeed tends to induce less unnecessary
9/4/97
DRAFT: Parallel Computer Architecture
809
90
100
80
90 80 (%)
70 60
Coverage
Unnecessary
Prefetches
(%)
Latency Tolerance
50 40 30
70 60 Indiscriminate Selective
50 40 30
20
20
10
10 0
0 Ocean
LU
MP3D
Cholesky
Locus
Ocean
LU
MP3D
Cholesky
Locus
Figure 11-24 The benefits of selective prefetching through locality analysis. The fraction of prefetches that are unnecessary is reduced substantially, while the coverage is not compromised. MP3D is an application for simulating rarefied hydrodynamics, Cholesky is a sparse matrix factorization kernel, and LocusRoute is a wire-routing application from VLSI CAD. MP3D and Cholesky use indirect array accesses, and LocusRoute uses pointers to implement linked lists so is more difficult for coverage.
extra traffic and cache conflict misses than hardware prefetching. However, the overhead due to extra instructions and associated address calculations can sometimes be substantial in software schemes. Benefits with Multiple-issue, Dynamically Scheduled Processors While the above results were obtained through simulations that assumed simple, statically scheduled processors, the effectiveness of software-controlled prefetching has also been measured (through simulation) on multiple-issue, dynamically scheduled processors [LuM96, BeF96a, BeF96b] and compared with the effectiveness on simple, statically scheduled processors [RPA+97]. Despite the latency tolerance already provided by dynamically scheduled processors (including some through hardware prefetching of operations in the reorder buffer, as discussed earlier), software-controlled prefetching is found to be effective in further reducing execution time. The reduction in memory stall time is somewhat smaller in dynamically scheduled processors than it is in statically scheduled processors. However, since memory stall time is a greater fraction of execution time in the former than in the latter (dynamically scheduled superscalar processors reduce instruction processing time much more effectively than they can reduce memory stall time), the percentage improvement in overall execution time due to prefetching is often comparable in the two cases. There are two main reasons for prefetching being less effective in reducing memory stall time with dynamically scheduled superscalar processors than with simple processors. First, since dynamically scheduled superscalar processors increase the rate of instruction processing quite rapidly, there is not so much computation time to overlap with prefetches and prefetches often end up being late. The second reason is that dynamically scheduled processors tend to cause more resource contention for processor resources (outstanding request tables, functional units, tracking buffers, etc.) that are encountered by a memory operation even before the L1 cache. This is because they cause more memory operations to be outstanding at a time, and they do not block on read misses. Prefetching tends to further increase the contention for these resources, thus increasing the latency of non-prefetch accesses. Since this latency occurs before the L1 cache, it is not hidden effectively by prefetching. Resource contention of this sort is also the reason that
810
DRAFT: Parallel Computer Architecture
9/4/97
Precommunication in a Shared Address Space
simply issuing prefetches earlier does not solve the late prefetches problem: Early prefetches not only are often wasted, but also tie up these processor resources for an even longer time since they tend to keep more prefetches outstanding at a time. The study in [RPA+97] was unable to improve performance significantly by varying how early prefetches are issued. One advantage of dynamically scheduled superscalar processors compared to single-issue statically scheduled processors from the viewpoint of prefetching is that the instruction overhead of prefetching is usually much smaller, since the prefetch instructions can occupy empty slots in the processor and are hence overlapped with other instructions. Comparison with Relaxed Memory Consistency Studies that compare prefetching with relaxed consistency models have found that on statically scheduled processors with blocking reads, the two techniques are quite complementary [GHG+91]. Relaxed models tend to greatly reduce write stall time but do not do much for read stall time in these blocking-read processors, while prefetching helps to reduce read stall time. Even after adding prefetching, however, there is a substantial difference in performance between sequential and relaxed consistency, since prefetching is not able to hide write latency as effectively as relaxed consistency. On dynamically scheduled processors with nonblocking reads, relaxed models are helpful in reducing read stall time as well as write stall time. Prefetching also helps reduce both, so it is interesting to examine whether starting from sequential consistency performance is helped more by prefetching or by using a relaxed consistency model. Even when all optimizations to improve the performance of sequential consistency are applied (like hardware prefetching, speculative reads, and write-buffering), it is found to be more advantageous to use a relaxed model without prefetching than to use a sequentially consistent model with prefetching on dynamically scheduled processors [RPA+97]. The reason is that although prefetching can help reduce read stall time somewhat better than relaxed consistency, it does not help sequential consistency hide write latency nearly as well as can be done with relaxed consistency. 11.7.4 Implementation Issues In addition to extra bandwidth and lockup-free caches that allow multiple outstanding accesses (prefetches and ordinary misses), there are two types of architectural support needed for software-controlled prefetching: instruction set enhancements, and keeping track of prefetches. Let us look at each briefly. Instruction Set Enhancements Prefetch instructions differ from normal load instructions in several ways. First, there is no destination register with non-binding prefetches. Other differences are that prefetches are non-blocking and non-excepting instructions. The non-blocking property allows the processor to proceed past them. Since prefetches are hints and since no instructions depend on the data they return, unlike nonblocking loads they do not need support for dynamic scheduling, instruction lookahead and speculative execution to be effective. Prefetches do not generate exceptions when the address being prefetched is invalid; rather, they are simply ignored since they are only hints. This allows us to be aggressive in generating prefetches without analysis for exceptions (e.g. prefetching past array bounds in a software pipeline, or prefetching a data-dependent address before the dependence is resolved).
9/4/97
DRAFT: Parallel Computer Architecture
811
Latency Tolerance
Prefetches can be implemented either as a special flavor of a load instruction, using some extra bits in the instruction, or using an unused opcode. Prefetches themselves can have several flavors, such as prefetching with or without ownership, prefetching multiple cache blocks rather than one, prefetching data in uncached mode rather than into the cache (in which case the prefetches are binding since the data are no longer visible to the coherence protocol), and flavors that determine whether prefetches should be dropped under certain circumstances. The issues involved in choosing an instruction format will be discussed in Exercise d. Keeping Track of Outstanding Prefetches To support multiple outstanding reads or writes, a processor must maintain state for each outstanding transaction: in the case of a write, the data written to must be merged with the rest of the cache block when it returns, while in the case of a read the register being load into must be kept track of and interlocks with this register handled. Since prefetches are just hints, it is not necessary to maintain processor state for outstanding prefetches, as long as the cache can receive prefetch responses from other nodes while the local processor might be issuing more requests. All we really need is lockup-free caches. Nonetheless, having a record of outstanding prefetches in the processor may be useful for performance reasons. First, it allows the processor to proceed past a prefetch even when the prefetch cannot be serviced immediately by the lockup-free cache, by buffering the prefetch request till cache can accept it. Second, it allows subsequent prefetches to the same cache block to be merged by simply dropping them. Finally, a later load miss to a cache block that has an outstanding prefetch can be merged with that prefetch and simply wait for it to return, thus partially hiding the latency of the miss. This is particularly important when the prefetch cannot be issued far enough before the actual miss: If there were no record of outstanding prefetches, the load miss would simply be issued and would incur its entire latency. The outstanding prefetch buffer may simply be merged with an existing outstanding access buffer (such as the processor’s write buffer), or separate structures may be maintained. The disadvantage of a combined outstanding transaction buffer is that prefetches may be delayed behind writes or vice versa; the advantage is that a single structure may be simpler to build and manage. In terms of performance, the difference between the two alternatives diminishes as more entries are used in the buffer(s), and may be significant for small buffers. 11.7.5 Summary To summarize our discussion of precommunication in a cache-coherent shared address space, the most popular method so far is for microprocessors to provide support for prefetch instructions to be used by software-controlled prefetches, whether inserted by a compiler or a programmer. The same instructions are used for either uniprocessor or multiprocessor systems. In fact, prefetching turns out to be quite successful in competing with block data transfer even in cases where the latter technique works well, even though prefetching involves the processor on every cache block transfer. For general computations, prefetching has been found to be quite successful in hiding latency in predictable applications with relatively regular data access patterns, and successful compiler algorithms have been developed for this case. Programmer-inserted prefetching still tends to outperform compiler-generated prefetching, since the programmer has knowledge of access patterns across computations that enable earlier or better scheduling of prefetches. However, prefetching irregular computations, particularly those that use pointers heavily, has a long way to go. Hardware prefetching is used in limited forms as in dynamically scheduled processors; while it has important advantages in not requiring that programs be re-compiled, it’s future
812
DRAFT: Parallel Computer Architecture
9/4/97
Multithreading in a Shared Address Space
in microprocessors is not clear. Support for sender-initiated precommunication instructions is also not so popular as support for prefetching.
11.8 Multithreading in a Shared Address Space Hardware-supported multithreading is perhaps the most versatile technique in terms of hiding different types of latency and in different circumstances. It has the following conceptual advantages over other approaches:
• it requires no special software analysis or support (other than having more explicit threads or processes in the parallel program than the number of processors), • because it is invoked dynamically, it can handle unpredictable situations like cache conflicts etc. just as well as predictable ones. • while the previous techniques are targeted at hiding memory access latency, multithreading can potentially hide the latency of any long-latency event just as easily, as long as the event can be detected at runtime. This includes synchronization and instruction latency as well. • like prefetching, it does not change the memory consistency model since it does not reorder accesses within a thread. Despite these potential advantages, multithreading is currently the least popular latency tolerating technique in commercial systems, for two reasons. First, it requires substantial changes to the microprocessor architecture. Second, its utility has so far not been adequately proven for uniprocessor or desktop systems, which constitute the vast majority of the marketplace. We shall see why in the course of this discussion. However, with latencies becoming increasingly longer relative to processor speeds, with more sophisticated microprocessors that already provide mechanisms that can be extended for multithreading, and with new multithreading techniques being developed to combine multithreading with instruction-level parallelism (by exploiting instruction-level parallelism both within and across threads, as discussed later in Section 11.8.5), this trend may change in the future. Let us begin with the simple form of multithreading that we considered in message passing, in which instructions are executed from one thread until that thread encounters a long-latency event, at which point it is switched out and another thread switched in. The state of a thread —which includes the processor registers, the program counter (which holds the address of the next instruction to be executed in that process) the stack pointer, and some per-process parts of the processor status words (e.g. the condition codes)—is called the context of that thread, so multithreading is also called multiple-context processing. If the latency that we are trying to tolerate is large enough, then we can save the state to memory in software when the thread is switched out and then load it back into the appropriate registers when the thread is switched back in (note that restoring the processor status word requires operating system involvement). This is how multithreading is typically orchestrated on message-passing machines, so a standard single-threaded microprocessor can be used in that case. In a hardware-supported shared address space, and even more so on a uniprocessor, the latencies we are trying to tolerate are not that high. The overhead of saving and restoring state in software may be too high to be worthwhile, and we are likely to require hardware support. The cost of a context switch may also involve flushing or squashing instructions already in the processor pipeline, as we shall see. Let us examine this relationship between switch overhead and latency a little more quantitatively.
9/4/97
DRAFT: Parallel Computer Architecture
813
Latency Tolerance
Consider processor utilization, i.e. the fraction of time that a processor spends executing useful instructions rather than stalled for some reason. The time threads spends executing before they encounter a long-latency event is called the busy time. The total amount of time spent switching among threads is called the switching time. If no other thread is ready, the processor is stalled until one becomes ready or the long-latency event it stalled on completes. The total amount of time spent stalled for any reason is called the idle time. The utilization of the processor can then be expressed as: busy utilization = ------------------------------------------------------------ . busy + switching + idle
(EQ 11.2)
It is clearly important to keep the switching cost low. Even if we are able to tolerate all the latency through multithreading, thus removing idle time completely, utilization and hence performance is limited by the time spent context switching. (Of course, processor utilization is used here just for illustrative purposes; as discussed in Chapter 4 it is not in itself a very good performance metric). 11.8.1 Techniques and Mechanisms For processors that issue instructions from only a single thread in a given cycle, hardware-supported multithreading falls broadly into two categories, determined by the decision about when to switch threads. The approach assumed so far—in message passing, in multiprogramming to tolerate disk latency, and at the beginning of this section—is to let a thread run until it encounters a long-latency event (e.g. a cache miss, a synchronization event, or a high-latency instruction such as a divide), and then switch to another ready thread. This has been called the blocked approach, since a context switch happens only when a thread is blocked or stalled for some reason. Among shared address space systems, it is used in the MIT Alewife prototype [ABC+95]. The other major hardware-supported approach is to simply switch threads every processor cycle, effectively interleaving the processor resource among a pool of ready threads at a one-cycle granularity. When a thread encounters a long-latency event, it is marked as not being ready, and does not participate in the switching game until that event completes and the thread joins the ready pool again. Quite naturally, this is called the interleaved approach. Let us examine both approaches in some detail, looking both at their qualitative features and tradeoffs as well as their quantitative evaluation and implementation details. After covering blocked and interleaved multithreading for processors that issue instructions from only a single thread in a cycle, we will examine the integration of multithreading with instruction-level (superscalar) parallelism, which has the potential to overcome the limitations of the traditional approaches (see Section 11.8.5). Let us begin with the blocked approach. Blocked Multithreading The hardware support for blocked multithreading usually involves maintaining multiple hardware register files and program counters, for use by different threads. An active thread, or a context, is a thread that is currently assigned one of these hardware copies. The number of active threads may be smaller than the number of ready threads (threads that are not stalled but are ready to run), and is limited by the number of hardware copies of the resources. Let us first take a high-level look at the relationship among the latency to be tolerated, the thread switching overhead and the number of active threads, by analyzing an average-case scenario.
814
DRAFT: Parallel Computer Architecture
9/4/97
Multithreading in a Shared Address Space
Suppose a processor can support N active threads at a time (N-way multithreading). And suppose that each thread operates by repeating the following cycle:
• execute useful instructions without stalling for R cycles (R is called the run length) • encounter a high-latency event and switch to another thread.
Proc. Utilization
Suppose that the latency we are trying to tolerate each time is L cycles, and the overhead of a thread switch is C cycles. Given a fixed set of values for R, L and C, a graph of processor utilization versus the number of threads N will look like that shown in Figure 11-25. We can see that 1.0 0.8 0.6 0.4 0.2
1
2
3
4
5
6
7
Threads
Figure 11-25 Processor Utilization versus number of threads in multithreading. The figure shows the two regimes of operation: the linear regime, and saturation. It assumes R=40, L=200 and C=10 cycles.
there are two distinct regimes of operation. The utilization increases linearly with the number of threads up to a threshold, at which point it saturates. Let us see why. Initially, increasing the number of threads allows more useful work to be done in the interval L that a thread is stalled, and latency continues to be hidden. Once N is sufficiently large, by the time we cycle through all the other threads—each with its run length of R cycles and switch cost of C cycles—and return to the original thread we may have tolerated all L cycles of latency. Beyond this, there is no benefit to having more threads since the latency is already hidden. The R+L value of N for which this saturation occurs is given by (N-1) R + NC = L, or Nsat = -------------- . R+C
Beyond this point, the processor is always either busy executing a run or incurring switch overhead, so the utilization according to Equation 11.2 is: 1 R u sat = -------------- = ------------- . R+C C 1 + ---R
(EQ 11.3)
If N is not large enough relative to L, then the runs of all N-1 other threads will complete before the latency L passes. A processor therefore does useful work for R+(N-1)*R cycles out of every R+L and is idle for the rest of the time. That is, it is busy for R+(N-1)*R or NR cycles, switching for N*C cycles and idle for L-[(N-1)R+N*C] cycles, i.e. busy for NR out of every R+L cycles, leading to a utilization of 1 NR u lin = ------------- = N ⋅ ------------L R+L 1 + --R
9/4/97
DRAFT: Parallel Computer Architecture
.
(EQ 11.4)
815
Latency Tolerance
This analysis is clearly simplistic, since it uses an average run length of R cycles and ignores the burstiness of misses and other long-latency events. Average-case analysis may lead us to assume that less threads suffice than are necessary to handle the bursty situations where latency tolerance may be most crucial. However, the analysis suffices to make the key points. Since the best utilization we can get with any number of threads, usat, decreases with increasing switch cost C (see Equation 11.3), it is very important that we keep switch cost low. Switch cost also affects the types of latency that we can hide; for example, pipeline latencies or the latencies of misses that are satisfied in the second-level cache may be difficult to hide unless the switch cost is very low. Switch cost can be kept low if we provide hardware support for several active threads; that is, have a hardware copy of the register file and other processor resources needed to hold thread state for enough threads to hide the latency. Given the appropriate hardware logic, then, we can simply switch from one set of hardware state to another upon a context switch. This is the approach taken in most hardware multithreading proposals. Typically, a large register file is either statically divided into as many equally sized register frames as the active threads supported (called a segmented register file), or the register file may be managed as a sort of dynamically managed cache that holds the registers of active contexts (see Section 11.8.3). While replicating context state in hardware can bring the cost of this aspect of switching among active threads down to a single cycle—it’s like changing a pointer instead of copying a large data structure—there is another time cost for context switching that we have not discussed so far. This cost arises from the use of pipelining in instruction execution. When a long-latency event occurs, we want to switch the current thread out. Suppose the long latency event is a cache miss. The cache access is made only in the data fetch stage of the instruction pipeline, which is quite late in the pipeline. Typically, the hit/miss result is only known in the writeback stage, which is at the end of the pipeline. This means that by the time we know that a cache miss has occurred and the thread should be switched out, several other instructions from that thread (potentially k, where k is the pipeline depth) have already been fetched and are in the pipeline (see Figure 11-26). We are faced with three possibilities: IF1
IF2
RF
EX
DF1 DF2 WB
Ai+6 Ai+5 Ai+4 Ai+3 Ai+2 Ai+1
Ai
Figure 11-26 Impact of late miss detection in a pipeline. Thread A is the current thread running on the processor. A cache miss occurring on instruction Ai from this thread is only detected after the second data fetch stage (DF2) of the pipeline, i.e. in the writeback (WB) cycle of Ai’s traversal through the pipeline. At this point, the following six instructions from thread A (Ai+1 through Ai+6) are already in the different stages of the assumed seven-stage pipeline. If all the instructions are squashed when the miss is detected (dashed x’s), we lose at least seven cycles of work.
• Allow these subsequent instructions to complete, and start to fetch instructions from the new thread at the same time. • Allow the instructions to complete before starting to fetch instructions from the new thread. • Squash the instructions from the pipeline and then start fetching from the new thread.
816
DRAFT: Parallel Computer Architecture
9/4/97
Multithreading in a Shared Address Space
The first case is complex to implement, for two reasons. First, instructions from different threads will be in the pipeline at the same time. This means that the standard uniprocessor pipeline registers and dependence resolution mechanisms (interlocks) do not suffice. Every instruction must be tagged with its context as it proceeds through the pipeline, and/or multiple sets of pipeline registers must be used to distinguish results from different contexts. In addition to the increase in area (the pipeline registers are a substantial component of the data path) and the fact that the processor has to be changed much more substantially, the use of multiple pipeline registers at the pipeline stages means that the registers must be multiplexed onto the latches for the next stage, which may increase the processor cycle time. Since part of the motivation for the blocked scheme is its design simplicity and ability to use commodity processors with as little design effort and modification as possible, this may not be a very appropriate choice. The second problem with this choice is that the instructions in the pipeline following the one that incurs the cache miss may stall the pipeline because they may depend on the data returned by the miss. The second case avoids having instructions from multiple threads simultaneously in the pipeline. However, it does not do anything about the second problem above. The third case avoids both problems: Instructions from the new thread do not interact with those from the old one since the latter are squashed, and the instructions being squashed means they will not miss and hold up the pipeline. This third choice is the simplest to implement, since the standard uniprocessor pipeline suffices and already has the ability to squash instructions, and is the favored choice for the blocked scheme. How does a context switch get triggered in the blocked approach to hide the different kinds of latency. On a cache miss, the switch can be triggered by the detection of the miss in hardware. For synchronization latency, we can simply ensure that an explicit context switch instruction follows every synchronization event that is expected to incur latency (or even all synchronization events). Since the synchronization event may be satisfied by another thread on the same processor, an explicit switch is necessary in this case to avoid deadlock. Long pipeline stalls can also be handled by inserting a switch instruction following a long-latency instruction such as a divide. Finally, short pipeline stalls like data hazards are likely to be very difficult to hide with the blocked approach. To summarize, the blocked approach has the following positive qualities:
• relatively low implementation cost (as we shall see in more detail later) • good single-thread performance: If only a single thread is used per processor, there are no context switches and this scheme performs just like a standard uniprocessor would. The disadvantage is that the context switch overhead is high: approximately the depth of the pipeline, even when registers and other processor state do not have to be saved to or restored from memory. This overhead limits the types of latencies that can be hidden. Let us examine the performance impact through an example, taken from [LGH94].
Example 11-3
Suppose four threads, A, B, C, and D, run on a processor. The threads have the following activity: A issues two instructions, with the second instruction incurring a cache miss, then issues four more.
9/4/97
DRAFT: Parallel Computer Architecture
817
Latency Tolerance
B issues one instruction, followed by a two-cycle pipeline dependency, followed by two more instructions, the last of which incurs a cache miss, followed by two more. C issues four instructions, with the fourth instruction incurring a cache miss, followed by three more. D issues six instructions, with the sixth instruction causing a cache miss, followed by one more. Show how successive pipeline slots are either occupied by threads or wasted in a blocked multithreaded execution. Assume a simple four-deep pipeline and hence four-cycle context-switch time, and a cache miss latency of ten cycles (small numbers are used here for ease of illustration). Thread A
= busy cycles from threads A-D, respectively.
Thread B
= context switch overhead
Thread C
= idle (stall) cycle = abbreviation for four context switch cycles
Thread D Memory latency Pipeline latency
Four context switch cycles
Figure 11-27 Latency tolerance in the blocked multithreading approach. The top part of the figure shows how the four active threads on a processor would behave if each was the only thread running on the processor. For example, the first thread issues one instruction, then issues a second one which incurs a cache miss, then issues some more. The bottom part shows how the processor switches among threads when they incur a cache miss. In all cases, the instruction shown in each slot is the one that is (or would be) in the last, WB stage of the pipeline in that cycle. Thus, if the second instruction from the first thread had not missed, it would have been in the WB state in the second cycle shown at the bottom. However, since it misses, that cycle is wasted and the three other instructions that have already been fetched into the 4-deep pipeline (from the same thread) have to be squashed in order to switch contexts. Note that the two-cycle pipeline latency does not generate a thread switch, since the switch overhead of four cycles is larger than this latency.
Answer
818
The solution is shown in Figure 11-27, assuming that threads are chosen roundrobin starting from thread A. We can see that while most of the memory latency is hidden, this is at the cost of context switch overhead. Assuming the pipeline is in steady state at the beginning of this sequence, we can count cycles starting from the time the first instruction reaches the WB stage (i.e. the first cycle shown for the multithreaded execution in the bottom part of the figure). Of the 51 cycles taken in the multithreaded execution, 21 are useful busy cycles, two are pipeline stalls, there are no idle cycles stalled on memory, and 28 are context switch cycles, leading to a
DRAFT: Parallel Computer Architecture
9/4/97
Multithreading in a Shared Address Space
processor utilization of (21/51) * 100 or only 41% despite the extremely low cache miss penalty assumed. Interleaved Multithreading In the interleaved approach, in every processor clock cycle a new instruction is chosen from a different thread that is ready and active (assigned a hardware context) so that threads are switched every cycle rather than only on long-latency events. When a thread incurs a longlatency event, it is simply disabled or removed from the pool of ready threads until that event completes and the thread is labeled ready again. The key advantage of the interleaved scheme is that there is no context switch overhead. No event needs to be detected in order to trigger a context switch, since this is done every cycle anyway. Segmented or replicated register files are used here as well to avoid the need to save and restore registers. Thus, if there are enough concurrent threads, in the best case all latency will be hidden without any switch cost, and every cycle of the processor will perform useful work. An example of this ideal scenario is shown in Figure 11-28, Memory latency
Pipeline latency
T1:
T4:
Memory latency
Memory latency T2:
T5:
T3:
T6:
Memory latency
Pipeline latency
Figure 11-28 Latency tolerance in an ideal setting the interleaved multithreading approach. The top part of the figure shows how the six active threads on a processor would behave if each was the only thread running on the processor. The bottom part shows how the processor switches among threads round-robin every cycle, leaving out those threads whose last instruction caused a stall that has not been satisfied yet. For example, the first (solid) thread is not chosen again until its high-latency memory reference is satisfied and its turn comes again in the round-robin scheme.
where we assume six active threads for illustration. The typical disadvantage of the interleaved approach is the higher hardware cost and complexity, though the specifics of this and other potential disadvantages depend on the particular type of interleaved approach used. Interleaved schemes have undergone a fair amount of evolution. The early schemes severely restricted the number and type of instructions from a given thread that could be in the processor’s pipeline at a time. This reduced the need for hardware interlocks and simplifying processor
9/4/97
DRAFT: Parallel Computer Architecture
819
Latency Tolerance
design, but had severe implications for the performance of single-threaded programs. More recent interleaved schemes greatly reduce these restrictions (blocked schemes of course do not have them, since the same thread runs until it encounters a long-latency event). In practice, another distinguishing feature among interleaved schemes is whether they use caches to reduce latency before trying to hide it. Machines built so far using the interleaved technique do not use caches at all, but rely completely on latency tolerance through multithreading [Smi81, ACC+90]. More recent proposals advocate the use of interleaved techniques and full pipeline interlocks with caches. [LGH94] (recall that blocked multithreaded systems use caching since they want to keep a thread running efficiently for as long as possible). Let us look at some interesting points in this development. The Basic Interleaved Scheme The first interleaved multithreading scheme was that used in the Denelcor HEP (Heterogeneous Element Processor) multiprocessor, developed between 1978 and 1985 [Smi81]. Each processor had up to 128 active contexts, 64 user-level and 64 privileged, though only about 50 of them were available to the user. The large number of active contexts was needed even though the memory latency was small—about 20-40 cycles without contention—since the machine had no caches so the latency was incurred on every memory reference (memory modules were all on the other side of a multistage interconnection network, but the processor had an additional direct connection to one of these modules which it could consider its “local” module). The 128 active contexts were supported by replicating the register file and other critical state 128 times. The pipeline on the HEP was 8 cycles deep. The pipeline supported interlocks among non-memory instructions, but did not allow more than one memory, branch or divide operation to be in the pipeline at a given time. This meant that several threads had to be active on each processor at a time just to utilize the pipeline effectively, even without any memory or other stalls. For parallel programs, for which HEP was designed, this meant that the degree of explicit concurrency in the program had to be much larger than the number of processors, especially in the absence of caches to reduce latency, greatly restricting the range of applications that would perform well. Better Use of the Pipeline Recognizing the drawbacks or poor single-thread performance and very large number of needed threads in allowing only a single memory operation from a thread in the pipeline at a time, systems descended from the HEP have alleviated this restriction. These systems include the Horizon [KuS88] and the Tera [ACC+90] multiprocessors. They still do not use caches to reduce latency, relying completely on latency tolerance for all memory references. The first of these designs, the Horizon, was never actually built. The design allows multiple memory operations from a thread to be in the pipeline, but does this without hardware pipeline interlocks even for non-memory instructions. Rather, the analysis of dependences is left to the compiler. The idea is quite simple. Based on the compiler analysis every instruction is tagged with a three-bit “lookahead” field, which specifies the number of immediately following instructions in that thread that are independent of that instruction. Suppose the lookahead field for an instruction is five. This means that the next five instructions (memory or otherwise) are independent of the current instruction, and so can be in the pipeline with the current instruction even though there are no hardware interlocks to resolve dependences. Thus, when a long latency event is encountered, the thread does not immediately become unready but can issue five more instructions before it becomes unready. The maximum value of the lookahead field is seven, so if more
820
DRAFT: Parallel Computer Architecture
9/4/97
Multithreading in a Shared Address Space
than seven following instructions are independent, the machine cannot know this and will prohibit more than seven from entering the pipeline until the current one leaves. Each “instruction” or cycle in Horizon can in fact issue up to three operations, one of which may be a memory operation. This means that twenty-one independent operations must be found to achieve the maximum lookahead, so sophisticated instruction scheduling by the compiler is clearly very important in this machine. It also means that for a typical program it is very likely that a memory operation is issued every instruction or two. Since the maximum lookahead size is larger than the average distance between memory operations, allowing multiple memory operations in the pipe is very important to hiding latency. However, in the absence of caches every memory operation is a long-latency event, now occurring almost once per instruction, so a large number of ready threads is still needed to hide latency. In particular, single-thread performance— the performance of programs that are not multithreaded—is not helped much by a small amount of lookahead without caches. The small number of lookahead bits provided is influenced by the high premium on bits in the instruction word and particularly by the register pressure introduced by having three results per instruction times the number of lookahead instructions; more lookahead and greater register pressure might have been counter-productive for instruction scheduling. Effective use of large lookahead also requires more sophistication in the compiler. The Tera architecture is the latest in the series of interleaved mutithreaded architectures without caches. It is actually being built by Tera Computer Company. Tera manages instruction dependences differently than Horizon and HEP: It provides hardware interlocks for instructions that don’t access memory (like HEP), and Horizon-like lookahead for memory instructions. Let us see how this works. The Tera machine separates operations into memory operations, “arithmetic” operations, and control operations (e.g. branches). The unusual, custom-designed processor can issue three operations per instruction, much like Horizon, either one from each category or two arithmetic and one memory operation. Arithmetic and control operations go into one pipeline, which unlike Horizon has hardware interlocks to allow multiple operations from the same thread. The pipeline is very deep, and there is a sizeable minimum issue delay between consecutive instructions from a thread (about 16 cycles, with the pipeline being deeper than this), even if there are no dependences between consecutive instructions. Thus, while more than one instruction can be in this pipeline at the same time, several interleaved threads (say 16) are required to hide the instruction latency. Even without memory references, a single thread would at best complete one instruction every 16 cycles. Memory operations pose a bigger problem, since they incur substantially larger latency. Although Tera uses very aggressive memory and network technology, the average latency of a memory reference without contention is about 70 cycles (a processor cycle is 2.8ns). Tera therefore uses compiler-generated lookahead fields for memory operations. Every instruction that includes a memory operation (called a memory instruction) has a 3-bit lookahead field which tells how many immediately following instructions (memory or otherwise) are independent of that memory operation. Those instructions do not have to be independent of one another (those dependences are taken care of by the hardware interlocks in that pipeline), just of that memory operation. The thread can then issue that many instructions past the memory instruction before it has to render itself unready. This change in the meaning of lookahead makes it easier for the compiler to schedule instructions to have larger lookahead values, and also eases register pressure. To make this concrete, let us look at an example.
9/4/97
DRAFT: Parallel Computer Architecture
821
Latency Tolerance
Example 11-4
Suppose that the minimum issue delay between consecutive instructions from a thread in Tera is 16, and the average memory latency to hide is 70 cycles. What is the smallest number of threads that would be needed to hide latency completely in the best case, and how much lookahead would we need per memory instruction in this case?
Answer
A minimum issue delay of 16 means that we need about 16 threads to keep the pipeline full. If 16 threads were to suffice to hide 70 cycles of memory latency, and since memory operations are almost one per instruction in each thread, we would like each of the sixteen threads to issue about 4 independent instructions before it is made unready after a memory operation, i.e. to have a lookahead of about 4. Since latencies are in fact often larger than the average uncontended 70 cycles, a higher lookahead of at most seven (three bits) is provided. Longer latencies would ask for larger lookahead values and sophisticated compilers. So would a desire for fewer threads, though this would require reducing the minimum issue delay in the nonmemory pipeline as well. While it seems from the above that supporting a few tens of active threads should be enough, Tera like HEP supports 128 active threads in hardware. For this, it replicates all processor state (program counter, status word and registers) 128 times, resulting in a total of 4096 64-bit general registers (32 per thread) and 1024 branch target registers (8 per thread). The large number of threads is supported for several reasons, and reflects the fundamental reliance of the machine on latency tolerance rather than reduction. First, some instructions may not have much lookahead at all, particularly read instructions and particularly when there are three operations per instruction. Second, with most memory references going into the network they may incur contention, in which case the latency to be tolerated may be much longer than 70 cycles. Third, the goal is not only to hide instruction and memory latency but also tolerate synchronization wait time, which is caused by load imbalance or contention for critical resources, and which is usually much larger than remote memory references. The designers of the Tera system and its predecessors take a much more radical view to the redesign of the multiprocessor building block than advocated by the other latency tolerating techniques we have seen so far. The belief here is that the commodity microprocessor building block, with its reliance on caches and support for only one or a small number of hardware contexts, is inappropriate for general-purpose multiprocessing. Because of the high latencies and physically distributed memory in modern convergence architectures, the use of relatively “latency-intolerant” commodity microprocessors as the building block implies that much attention must be paid to data locality in both the caches and in data distribution. This makes the task of programming for performance too complicated, since compilers have not yet succeeded in managing locality automatically in any but the simplest cases and their potential is unclear. The Tera approach argues that the only way to make multiprocessing truly general purpose is to take this burden of locality management off software and place it on the architecture in the form of much greater latency tolerance support in the processor. Unlike the approach we have followed so far—reduce latency first, then hide the rest—this approach does not pay much attention to reducing latency at all; the processor is redesigned with a primary focus on tolerating the latency of fine-grained accesses through interleaved multithreading. If this technique is successful, the programmer’s view of the machine can indeed be a PRAM (i.e. memory references cost a single cycle regardless of where they are satisfied, see Chapter 3) and the programmer can concentrate on concurrency rather than latency management. Of course, this approach sacrifices the tremendous
822
DRAFT: Parallel Computer Architecture
9/4/97
Multithreading in a Shared Address Space
leverage obtained by using commodity microprocessors and caches, and faces head-on the challenge of the enormous effort that must be invested in the design of a modern high-performance processor. It is also likely to result in poor single-thread performance, which means that even uniprocessor applications must be heavily multithreaded to achieve good performance. Full Single-thread Pipeline Support and Caching While the HEP and the blocked multithreading approaches are in some ways at opposite ends of a multithreading spectrum, both have several limitations. The Tera approach improves matters from the basic HEP approach, but it still does not provide full single-thread support and requires many concurrent threads for good utilization. Since it does not use caches, every memory operation is a long-latency operation, which increases the number of threads needed and the difficulty of hiding the latency. This also means that every memory reference consumes memory and perhaps communication bandwidth, so the machine must provide tremendous bandwidth as well. The blocked multithreading approach, on the other hand, requires less modification to a commodity microprocessor. It utilizes caches and does not switch threads on cache hits, thus providing good single-thread performance and needing a smaller number of threads. However, it has high context switch overhead and cannot hide short latencies. The high switch overhead also makes it less suited to tolerating the not-so-large latencies on uniprocessors than the latencies on multiprocessors. It is therefore difficult to justify either of these schemes for uniprocessors and hence for the high-volume marketplace. It is possible to use an interleaved approach with both caching and full single-thread pipeline support, thus requiring a smaller number of threads to hide memory latency, providing better support for uniprocessors, and incurring lower overheads. One such scheme has been studied in detail, and has in fact been termed the “interleaved” scheme [LGH94]. We shall describe it next. From the HEP and Tera approaches, this interleaved approach takes the idea of maintaining a set of active threads, each with its own set of registers and status words, and having the processor select an instruction from one of the ready threads every cycle. The selection is usually simple, such as round-robin among the ready threads. A thread that incurs a long latency event makes itself unready until the event completes, as before. The difference is that the pipeline is a standard microprocessor pipeline, and has full bypassing and forwarding support so that instructions from the same thread can be issued in consecutive cycles; there is no minimum issue delay as in Tera. In the best case, with no pipeline bubbles due to dependences, a k-deep pipeline may contain k instructions from the same thread. In addition, the use of caches to reduce latency implies that most memory operations are not long latency events; a given thread is ready a much larger fraction of the time, so the number of threads needed to hide latency is kept small. For example, if each thread incurs a cache miss every 30 cycles, and a miss takes 120 cycles to complete, then only five threads (the one that misses and four others) are needed to achieve full processor utilization. The overhead in this interleaved scheme arises from the same source as in the blocked scheme, which is the other scheme that uses caches. A cache miss is detected late in the pipeline, and if there are only a few threads then the thread that incurs the miss may have fetched other instructions into the pipeline. Unlike in the Tera, where the compiler guarantees through lookahead that any instruction entering the pipeline after a memory instruction is independent of the memory instruction, we must do something about these instructions. For the same reason as in the blocked scheme, the proposed approach chooses to squash these instructions, i.e. to mark them as not 9/4/97
DRAFT: Parallel Computer Architecture
823
Latency Tolerance
being allowed to modify any processor state. Note that due to the cycle-by-cycle interleaving there are instructions from other threads interleaved in the pipeline, so unlike the blocked scheme not all instructions need to be squashed, only those from the thread that incurred the miss. The cost of making a thread unready is therefore typically much smaller than that of a context switch in the blocked scheme, in which case all instructions in the pipeline must be squashed. In fact, if enough ready threads are supported and available, which requires a larger degree of state replication that is advocated by this approach, then there may not be multiple instructions in the pipeline from the missing thread and no instructions will need to be squashed. The comparison with the blocked approach is shown in Figure 11-29. IF1
IF2
RF
EX
DF1 DF2 WB
Blocked:
Ai+6 Ai+5 Ai+4 Ai+3 Ai+2 Ai+1
Ai
Interleaved:
Ai+2 Ci+1 Bi+1 Ai+1
Ai
Ci
Bi
Figure 11-29 “Switch cost” in the interleaved scheme with full single-thread support. The figure shows the assumed seven-stage pipeline, followed by the impact of late miss detection in the blocked scheme (taken from Figure 11-26) in which all instructions in the pipeline are from thread A and have to be squashed, followed by the situation for the interleaved scheme. In the interleaved scheme, instructions from three different threads are in the pipeline, and only those from thread A need to be squashed. The “switch cost” or overhead incurred on a miss is three cycles rather than seven in this case.
The result of the lower cost of making a thread unready is that shorter latencies such as local memory access latencies can be tolerated more easily than in the blocked case, making this interleaved scheme more appropriate for uniprocessors. Very short latencies that cannot be hidden by the blocked scheme, such as those of short pipeline hazards, are usually hidden naturally by the interleaving of contexts without even needing to make a context unready. The effect of the differences on the simple four-thread, four-deep pipeline example that we have been using so far is illustrated in Figure 11-30. In this simple example, assuming the pipeline is in steady state at the beginning of this sequence, the processor utilization is 21 cycles out of the 30 cycles taken in all, or 70% (compared to 41% for the blocked scheme example). While this example is contrived and uses unrealistic parameters for ease of graphical illustration, the fact remains that on modern superscalar processors that issue a memory operation in almost every cycle, the context switch overhead of switching on every cache miss may become quite expensive. The disadvantage of this scheme compared to the blocked scheme is greater implementation complexity. The blocked scheme and this last interleaved scheme (henceforth called the interleaved scheme) start with simple, commodity processors with full pipeline interlocks and caches, and modify them to make them multithreaded. As stated earlier, even if they are used with superscalar processors, they only issue instructions from within a single thread in a given cycle. A more sophisticated multithreading approach exists for superscalar processors, but for simplicity let us examine the performance and implementation issues for these simpler, more directly comparable approaches first.
824
DRAFT: Parallel Computer Architecture
9/4/97
Multithreading in a Shared Address Space
Thread A
= busy cycles from threads A-D, respectively.
Thread B
= context “switch” overhead due to squashing instructions = idle (stall) cycle
Thread C
= abbreviation for four context switch cycles
Thread D Memory latency
C D
E WB
C D
D A
IF D
IF D
B
E WB
E WB
C D
B C D
IF D
E WB
A B
IF D
Pipeline latency
B D
Figure 11-30 Latency tolerance in the interleaved scheme. The top part of the figure shows how the four active threads on a processor would behave if each was the only thread running on the processor. The bottom part shows how the processor switches among threads. Again, the instruction in a slot (cycle) is the one that retires (or would retire) from the pipeline that cycle. In the first four cycles shown, an instruction from each thread retires. In the fifth cycle, A would have retired but discovers that it has missed and needs to become unready. The three other instructions in the pipeline at that time are from the three other threads, so there is no switch cost except for the one cycle due to instruction that missed. When B’s next instruction reaches the WB stage (cycle 9 in the figure) it detects a miss and has to become unready. At this time, since A is already unready an instruction each from C and D have entered the pipeline as has one more from B (this one is now in its IF stage, and would have retired in cycle 12). Thus, the instruction from B that misses wastes a cycle, and one instruction from B has to be squashed. Similarly, C’s instruction that would have retired in cycle 13 misses and causes another instruction from C to be squashed, and so on.
11.8.2 Performance Benefits Studies have shown that both the blocked scheme and the interleaved scheme with full pipeline interlocks and caching can hide read and write latency quite effectively [LGH94, KCA91]. The number of active contexts needed is found to be quite small, usually in the vicinity of four to eight, although this may change as the latencies become longer relative to processor cycle time. Let us examine some results from the literature for parallel programs [Lau94]. The architectural model is again a cache-coherent multiprocessor with 16 processors using a memory-based protocol. The processor model is single issue and modeled after the MIPS R4000 for the integer pipeline and the DEC Alpha 21064 for the floating-point pipeline. The cache hierarchy used is a small (64 KB) single-level cache, and the latencies for different types of accesses are modeled after the Stanford DASH prototype [LLJ+92]. Overall, both the blocked and interleaved schemes were found to improve performance substantially. Of the seven applications studied, the speedup from multithreading ranged from 2.0 to nearly 3.5 for three applications, from 1.2 to 1.6 for three oth-
9/4/97
DRAFT: Parallel Computer Architecture
825
Latency Tolerance
ers, and was negligible for the last application because it had very little extra parallelism to begin with (see Figure 11-31). The interleaved scheme was found to outperform the blocked scheme, as
Speedup over 1 thread per processor
3
2.5
2 Blocked Interleaved
1.5
1
0.5
0
4
Barnes
8
4
Water
8
4
8
Ocean
4
8
Locus
4
8
PTHOR
Figure 11-31 Speedups for blocked and interleaved multithreading. The bars show speedups for different numbers of contexts (4 and 8) relative to a single-context per processor execution for seven applications. All results are for sixteen processor executions. Some of the applications were introduced in Figure 11-24 on page 810. Water is a molecular dynamics simulation of water molecules in liquid state. The simulation model assumes a single-level, 64KB, direct-mapped writeback cache per processor. The memory system latencies assumed are 1 cycle for a hit in the cache, 24-45 cycles with a uniform distribution for a miss satisfied in local memory, 75-135 cycles for a miss satisfied at a remote home, and 96-156 cycles for a miss satisfied in a dirty node that is not the home. These latencies are clearly low by modern standards. The R4000-like integer unit has a seven-cycle pipeline, and the floating point unit has a 9-stage pipeline (5 execute stages). The divide instruction has a 61 cycle latency, and unlike other functional units the divide unit is not pipelined. Both schemes switch contexts on a divide instruction. The blocked scheme uses an explicit switch instruction on synchronization events and divides, which has a cost of 3 cycles (less than a full 7-cycle context switch, because the decision to switch is known after the decode stage rather than at the writeback stage). The interleaved scheme uses a backoff instruction in these cases, which has a cost of 1-3 cycles depending on how many instructions need to be squashed as a result.
expected from the discussion above, with a geometric mean speedup of 2.75 compared to 1.9. The advantages of the interleaved scheme are found to be greatest for applications that incur a lot of latency due to short pipeline stalls, such as those for result dependences in floating point add, subtract and multiply instructions. Longer pipeline latencies, such as the tens of cycle latencies of divide operations, can be tolerated quite well by both schemes, though the interleaved scheme still performs better due to its lower switch cost. The advantages of the interleaved scheme are retained even when the organizational and performance parameters of the extended memory hierarchy were changed (for example, longer latencies and multilevel caches). They are in fact likely to be even greater with modern processors that issue multiple operations per cycle, since the frequency of cache misses and hence context switches is likely to increase. A potential disadvantage of multithreading is that multiple threads of execution share the same cache, TLB and branch prediction unit, raising the possibility of negative intereference between them (e.g. mapping conflicts across threads in a low-associativity cache). These negative effects have been found to be quite small in published studies. Figure 11-32 shows more detailed breakdowns of execution time, averaged over all processors, for two applications that illustrate interesting effects. With a single context, Barnes shows signif826
DRAFT: Parallel Computer Architecture
9/4/97
Multithreading in a Shared Address Space
140
140
120 Normalized Exec. Time
Normalized Exec. Time
120 100 80 60 40
100
Context Switch Synchronization Memory
80
Instruction (Long) Instruction (Short) Busy
60 40 20
20
0
0 1
2
4
8
1
Barnes
2
4
8
1
PTHOR
2
4
8
1
Barnes
No. o f Contexts (a) Blocked Scheme
2
4
8
PTHOR No. o f Contexts
(a) Interleaved Scheme
Figure 11-32 Execution time breakdowns for two applications under multithreading. Busy time is the time spent executing application instructions; pipeline stall time is the time spent stalled due to pipeline dependences; memory time and synchronization time are the time spent stalled on the memory system and at synchronization events, respectively. Finally, context switch time is the time spent in context switch overhead.
icant memory stall time due to the single-level direct-mapped cache used (as well as the small problem size of only 4K bodies). The use of more contexts per processor is able to hide most of the memory latency, and the lower switch cost of the interleaved scheme is clear from the figure. The other major form of latency in Barnes (and in Water) is pipeline stalls due to long-latency floating point instructions, particularly divides. The interleaved scheme is able to hide this latency more effectively than the blocked scheme. However, both start to taper off in their ability to hide divide latency at more than four contexts. This is because the divide unit is not pipelined, so it quickly becomes a resource bottlenecks when divides from different contexts compete. PTHOR is an example of an application in which the use of more contexts does not help very much, and even hurts as more contexts are used. Memory latency is hidden quite well as soon as we go to two contexts, but the major bottleneck is synchronization latency. The application simply does not have enough extra parallelism (slackness) to exploit multiple contexts effectively. Note that busy time increases with the number of contexts in PTHOR. This is because the application uses a set of distributed task queues, and more time is spent maintaining these queues as the number of threads increases. These extra instructions cause more cache misses as well. Prefetching and multithreading can hide similar types of memory access latency, and one may be better than the other in certain circumstances. The performance benefits of using the two together are not well understood. For example, the use of multithreading may cause constructive or destructive interference in the cache among threads, which is very difficult to predict and so makes the analysis of what to prefetch more difficult. The next two subsections discuss some detailed implementation issues for the blocked and interleaved schemes, examining the additional implementation complexity needed to implement each scheme beyond that needed for a commodity microprocessor. Readers not interested in implementation can skip to Section 11.8.5 to see a more sophisticated multithreading scheme for superscalar processors. Let us begin with the blocked scheme, which is simpler to implement.
9/4/97
DRAFT: Parallel Computer Architecture
827
Latency Tolerance
11.8.3 Implementation Issues for the Blocked Scheme Both the blocked and interleaved schemes have two kinds of requirements: state replication, and control. State replication, essentially involves replicating the register file, program counter and relevant portions of the processor status word once per active context, as discussed earlier. For control, logic and registers are needed to manage switching between contexts, making contexts ready and unready, etc. The PC unit is the portion of the processor that requires significant changes, so we discuss it separately as well. State Replication Let us look at the register file, the processor status word, and the program counter unit separately. Simple ways to give every active context its own set of registers are to replicate the register file or to use a larger, statically segmented register file (see Figure 11-33). While this allows registers to
register frame pointer offset
Figure 11-33 A segmented register file for a multithreaded processor. The file is divided into four frames, assuming that four contexts can be active at a given time. The register values for each active context remain in that context’s register frame across context switches, and each context’s frame is managed by the compiler as if it were itself a register file. A register in a frame is accessed through the current frame pointer and a register offset within the frame, so the compiler need not be aware of which particular frame a context is in (which is determined at run time). Switching to a different active context only requires that the current frame pointer be changed.
be accessed quickly, it may not use silicon area efficiently. For example, since only one context runs at a time until it encounters a long-latency event, only one register file is actively being used for some time while the others are idle. At the very least, we would like to share the read and write ports across register files, since these ports often take up a substantial portion of the silicon area of the files. In addition, some contexts might require more or less registers than others, and the relative needs may change dynamically. Thus, allowing the contexts to share a large register file dynamically according to need may provide better register utilization than splitting the registers statically into equally-sized register files. This results in a cache-like structure, indexed by context identifier and register offset, with the potential disadvantage that the register file is larger so has a higher access time. Several proposals have been made to improve register file efficiency in one or more of the above issues [NuD95,Lau94,Omo94,Smi85], but substantial replication is needed in all cases. The MIT Alewife machine uses a modified Sun SPARC processor, and uses its register windows mechanism to provide a replicated register file.
828
DRAFT: Parallel Computer Architecture
9/4/97
Multithreading in a Shared Address Space
A modern “processor status word” is actually several registers; only some parts of it (such as floating point status/control, etc.) contain process-specific state rather than global machine state, and only these parts need to be replicated. In addition, multithreading introduces a new global status word called the context status word (CSW). This contains an identifier that specifies which context is currently running, a bit that says whether context switching is enabled (we have seen that it may be disabled while exceptions are handled), and a bit vector that tells us which of the currently active contexts (the ones whose state is maintained in hardware) is ready to execute. Finally, TLB control registers need to be modified to support different address space identifiers from the different contexts, and to allow a single TLB entry to be used for a page that is shared among contexts. PC Unit Every active context must also have its program counter (PC) available in hardware. Processors that support exceptions efficiently provide a mechanism to do this with minimal replication, since in many ways exceptions behave like context switches. Here’s how this works. In addition to the PC chain, which holds the PCs for the instructions that are in the different stages of the pipeline, a register called the exception PC (EPC) is provided in such processors. The exception EPC is fed by the PC chain, and contains the address of the last instruction retired from the pipeline. When an exception occurs, the loading of the EPC is stopped with the faulting instruction, all the incomplete instructions in the pipeline are squashed, and the exception handler address is put in the PC. When the exception returns, the EPC is loaded into the PC so the faulting instruction is reexecuted. This is exactly the functionality we need for multiple contexts. We simply need to replicate the EPC register, providing one per active context. The EPC for a context serves to handle both exceptions as well as context switches. When a given context is operating, the PC chain feeds into its EPC while the EPCs of other contexts simply retain their values. On an exception, the current context’s EPC behaves exactly as it did in the single-threaded case. On a context switch, the current context’s EPC stops being loaded at the faulting (long-latency) instruction and the incomplete instructions in the pipeline are squashed (as for an exception). The PC is loaded from the EPC of the selected next context, which then starts executing, and so on. The only drawback of this scheme is that now an exception cannot take a context switch (since the value loaded into the EPC by the context that incurred the exception will be lost), so context switches must be disabled when an exception occurs and re-enabled when the exception returns. However, the PCs can still be managed through software saving and restoring even in this case. Control The key functions of the control logic in a blocked implementation are to detect when to switch contexts, to choose the context to switch to, and to orchestrate and perform the switch. Let us discuss each briefly. A context switch in the blocked scheme may be triggered by three events: (i) a cache miss, (ii) an explicit context switch instruction, used for synchronization events and very long-latency instructions, and (iii) a timeout. The timeout is used to ensure that a single context does not run too long or spin waiting on a flag that will therefore never be written. The decision to switch on a cache miss can be based on three signals: the cache miss notification, of course; a bit that says that context switching is enabled, and a signal that states that there is a context ready to run. A simple way to implement an explicit context switch instruction is to have it behave as if the next instruction generated a cache miss (i.e. to raise the cache miss signal or generate another signal that has the same effect on the context switch logic); this will cause the context to switch, and to be 9/4/97
DRAFT: Parallel Computer Architecture
829
Latency Tolerance
restarted from that following instruction. Finally, the timeout signal can be generated via a resettable threshold counter. Of the many policies that can be used to select the next context upon a switch, in practice simply switching to the next active context in a round-robin fashion—without concern for special relationships among contexts or the history of the contexts’ executions—seems to work quite well. The signals this requires are the current context identifier, the vector of ready contexts, and the signal that detected the need to switch. Finally, orchestrating a context switch in the blocked scheme requires the following actions. They are required to complete by different stages in the processor pipeline, so the control logic must enable the corresponding signals in the appropriate time windows.
• • • • •
Save the address of the first uncompleted instruction from the current thread. Squash all incomplete instructions in the pipeline. Start executing from the (saved) PC of the selected context. Load the appropriate address space identifier in the TLB bound registers. Load various control/status registers from the (saved) processor status words of the new context, including the floating point control/status register and the context identifier. • Switch the register file control to the register file for the new context, if applicable. In summary, the major costs of implementing blocked context switching come from replicating and managing the register file, which increases both area and register file access time. If the latter is in the critical path of the processor cycle time, it may cause the pipeline depth to be increased to maintain a high clock rate, which can increase the time lost on branch mispredictions. The replication needed in the PC unit is small and easy to manage, focused mainly on replicating a single register for the exception PC. The number of process- or context-specific registers in the processor status words is also small, as is the number of signals needed to manage context switch control. Let us examine the interleaved scheme. 11.8.4 Implementation Issues for the Interleaved Scheme A key reason that the blocked scheme is relatively easy to implement is that most of the time the processor behaves like a single-threaded processor, and invokes additional complexity and processor state changes only at context switches. The interleaved scheme needs a little more support, since it switches among threads every cycle. This potentially requires the processor state to be changed every cycle, and implies that the instruction issue unit must be capable of issuing from multiple active streams. A mechanism is also needed to make contexts active and inactive, and feed the active/inactive status into the instruction unit every cycle. Let us again look at the state replication and control needs separately. State Replication The register file must be replicated or managed dynamically, but the pressure on fast access to different parts of the entire register file is greater due to the fact that successive cycles may access the registers of different contexts. We cannot rely on the more gradually changing access patterns of the blocked scheme (the Tera processor uses a banked or interleaved register file, and a thread may be rendered unready due to a busy register bank as well). The parts of the process status
830
DRAFT: Parallel Computer Architecture
9/4/97
Multithreading in a Shared Address Space
word that must be replicated are similar to those in the blocked scheme, though again the processor must be able to switch status words every cycle. PC Unit The greatest difference in changes is to the PC unit. Instructions from different threads are in the pipeline at the same time, and the processor must be able to issue instructions from different threads every cycle. Contexts become active and inactive dynamically, and this status must feed into the PC unit to prevent instructions being selected from an inactive context. The processor pipeline is also impacted, since to implement bypassing and forwarding correctly the processor’s PC chain must now carry a context identifier for each pipeline stage. In the PC unit itself, new mechanisms are needed for handling context availability and for squashing instructions, for keeping track of the next instruction to issue, for handling branches, and for handling exceptions. Let us examine some of these issues briefly. A fuller treatment can be found in the literature [Lau94]. Consider context availability. Contexts become unavailable due to either cache misses or explicit backoff instructions that make the context unavailable for a specified number of cycles. The backoff instructions are issued for example at synchronization events. The issuing of further instructions from that context is stopped by clearing a “context available” signal. To squash the instructions already in the pipeline from that context, we must broadcast a squash signal as well as the context identifier to all stages, since we don’t know which stages contain instructions from that context. On a cache miss, the address of the instruction that caused the miss is loaded into the EPC. Once the cache miss is satisfied and the context becomes available again, the PC bus is loaded from the EPC when that context is selected next. Explicit backoff instructions are handled similarly, except that we do not want the context to resume from the backoff instruction itself but rather from the instruction that follows it. To orchestrate this, a next bit can be included in the EPC. There are three sources that can determine the next instruction to be executed from a given thread: the next sequential instruction, the predicted branch from the branch target buffer (BTB), and the computed branch if the prediction is detected to be wrong. When only a single context is in the pipeline at a time, the appropriate next instruction address can be driven onto the PC bus from the “next PC” (NPC) register as soon as it is determined. In the interleaved case, however, in the cycle when the next instruction address for a given context is determined and ready to be put on the PC bus, it may not be the context scheduled for that cycle. Further, since the NPC for a context may be determined in different pipeline stages for different instructions—for example, it is determined much later for a mispredicted branch than for a correctly predicted branch or a nonbranch instruction—different contexts could produce their NPC value during the same cycle. Thus, the NPC value for each context must be held in a holding register until it is time to execute the next instruction from that context, at which point it will be driven onto the PC bus (see Figure 11-34). Branches too require some additional mechanisms. First, the context identifier must be broadcast to the pipeline stages when squashing instructions past a mispredicted branch, but this is the same functionality needed when making contexts unavailable. Second, by the time the actual branch target is computed, the predicted instruction that was fetched speculatively could be anywhere in the pipeline or may not even have been issued yet (since other contexts will have intervened unpredictably). To find this instruction address to determine the correctness of the branch, it may be necessary for branch instructions to carry along with them their predicted address as they proceed along the pipeline stages. One way to do this is to provide a predicted PC register 9/4/97
DRAFT: Parallel Computer Architecture
831
Latency Tolerance
Next Seq. Instr. Predicted Br. Addr. Computed Br. Addr.
Holding registers
Next PC 0
Next PC 1
PC Bus PC Bus (a) Blocked Approach
(b) Interleaved, with two contexts
Figure 11-34 Driving the PC bus in the blocked and interleaved multithreading approaches, with two contexts. If more contexts were used, the interleaved scheme would require more replication, while the blocked scheme would not.
chain that runs along parallel to the PC chain, and is loaded and checked as the branch reaches the appropriate pipeline stages. A final area of enhancement in the PC unit over a conventional processor is exception handling. Consider what happens when an exception occurs in one context. One choice is to have that context be rendered unready to make way for the exception handler, and let that handler be interleaved with the other user contexts (the Tera takes an approach similar to this). In this case, another user thread may also take an exception while the first exception handler is running, so the exception handlers must be able to cope with multiple concurrent handler executions. Another option is to render all the contexts unready when an exception occurs in any context, squash all the instructions in the pipeline, and reenable all contexts when the exception handler returns. This can cause a loss of performance if exceptions are frequent. It also means that the exception PCs (EPCs) of all active contexts need to be loaded with the first uncompleted instruction address from their respective threads when an exception occurs. This is more complicated than in the blocked case, where only the single EPC of the currently running and excepting context needs to be saved (see Exercise a). Control Two interesting issues related to control outside of the PC unit are tracking context availability information and feeding it to the PC unit, and choosing and switching to the next context every cycle. The context available signal is modified on a cache miss, when the miss returns, and on backoff instructions and expiration. Availability status for cache misses can be tracked by maintaining pending miss registers per context, which are loaded upon a miss and checked upon miss return to reenable the appropriate context. For explicit backoff instructions, we can maintain a counter per context, initialized to the backoff value when the backoff instruction is encountered (and the context availability signal cleared). The counter is decremented every cycle until it reaches zero, at which point the availability signal for that context is set again. Backoff instructions can be used to tolerate instruction latency as well, but with the interleaving of contexts it may be difficult to choose a good number of backoff cycles. This is further complicated by the fact that the compiler may rearrange instructions transparently. Backoff values are implementation-specific, and may have to be changed for subsequent generations of processors.
832
DRAFT: Parallel Computer Architecture
9/4/97
Multithreading in a Shared Address Space
Fortunately, short instruction latencies are often handled naturally by the interleaving of other contexts without any backoff instructions, as we saw in Figure 11-30. Robust solutions for long instruction latencies may require more complex hardware support such as scoreboarding. As for choosing the next context, a reasonable approach once again is to choose round-robin qualified by context availability. 11.8.5 Integrating Multithreading with Multiple-Issue Processors So far, our discussion of multithreading has been orthogonal to the number of operations issued per cycle. We saw that the Tera system issues three operations per cycle, but the packing of operations into wider instructions is done by the compiler, and the hardware chooses a three-operation instruction from a single thread in every cycle. A single thread usually does not have enough instruction-level parallelism to fill all the available slots in every cycle, especially if processors begin to issue many more operations per instruction. With many threads available anyway, an alternative is to let available operations from different threads be scheduled in the same cycle, thus filling instruction slots more effectively. This approach has been called simultaneous multithreading, and there have been many proposals for it [HKN+92, TEL95]. Another way of looking at it is that it is like interleaved multithreading, but operations from the different available threads compete for the functional units in every cycle. Simultaneous multithreading seeks to overcome the drawbacks of both multiple-instruction issue and traditional (interleaved) multithreading by exploiting the simultaneous availability of operations from multiple threads. Simple multiple-issue processors suffer from two inefficiencies. First, not all slots in a given cycle are filled, due to limited ability to find instruction-level parallelism. Second, many cycles have nothing scheduled due to long-latency instructions. Simple multithreading addresses the second problem but not the first, while simultaneous multithreading tries to address both (see Figure 11-35). Instruction Width
Cycles
(a) Single-threaded
(a) Multi-threaded (Interleaved)
(c) Simultaneous Multi-threaded
Figure 11-35 Simultaneous Multithreading. The potential improvements are illustrated for both simple interleaved multithreading and simultaneous multithreading for a fourissue processor.
9/4/97
DRAFT: Parallel Computer Architecture
833
Latency Tolerance
Choosing operations from different threads to schedule in the same cycle may be difficult for a compiler, but many of the mechanisms for it are already present in dynamically scheduled microprocessors. The instruction fetch unit must be extended to fetch operations from different hardware contexts in every cycle, but once operations from different threads are fetched and placed in the reorder buffer the issue logic can choose operations from this buffer regardless of which thread they are from. Studies of single-threaded dynamically scheduled processors have shown that the causes of empty cycles or empty slots are quite well distributed among instruction latencies, cache misses, TLB misses and load delay slots, with the first two often being particularly important. The variety of both the sources of wasted time and of their latencies indicates that fine-grained multithreading may be a good solution. In addition to the issues for interleaved multiprocessors, several new issues arise in implementing simultaneously multithreaded processors [TEE+96]. First, in addition to the fundamental property of issuing instructions flexibly from different threads, the instruction fetch unit must be designed in a way that fetches instructions effectively from multiple active contexts as well. The more the flexibility allowed in fetching—compared to say fetching from only one context in a cycle or fetching at most two operations from each thread in a cycle—the more the complexity in the fetching logic and instruction cache design. However, more flexibility reduces the frequency of fetch slots being left empty because the thread that is allowed to fetch into those slots does not have enough ready instructions. Interesting questions also arise in choosing which context or contexts to fetch instructions from in the next cycle. We could choose contexts in a fixed order of priority (say try to fill from context 0 first, then fill the rest from context 1 and so on) or we could choose based on execution characteristics of the contexts (for example give priority to the context that has the fewest instructions currently in the fetch unit or the reorder buffer, or the context that has the fewest outstanding cache misses). Finally, we have to decide which operations to choose from the reorder buffer among the ones that are ready in each cycle. The standard practice in dynamically scheduled processors is to choose the oldest operation that is ready, but other choices may be more appropriate based on the thread the operations are from and how it is behaving. There is little or no performance data on simultaneous multithreading in the context of multiprocessors. For uniprocessors, a performance study examines the potential performance benefits well as the impact of some of the tradeoffs discussed above [TEE+96]. That study finds, for example, that speculative execution is less important with simultaneous multithreading than with single-threaded dynamically scheduled processors, because there are more available threads and hence non-speculative instructions to choose from. Overall, as data access and synchronization latencies become larger relative to processor speed, multithreading promises to become increasingly successful in hiding latency. Whether or not it will actually be incorporated in microprocessors depends on a host of other factors, such as what other latency tolerance techniques are employed (such as prefetching, dynamic scheduling, and relaxed consistency models) and how multithreading interacts with them. Since multithreading requires extra explicit threads and significant replication of state, an interesting alternative to it is to place multiple simpler processors on a chip and placing the multiple threads on different processors. How this compares with multithreading (e.g. simultaneous multithreading) in cost and performance is not yet well understood, for desktop systems or as a node for a larger multiprocessor.
834
DRAFT: Parallel Computer Architecture
9/4/97
Lockup-free Cache Design
This concludes our discussion of latency tolerance in a shared address space. Table 11-2 summaTable 11-2 Key properties of the three read-write latency hiding techniques in a shared address space. Property
Relaxed Models
Prefetching
Multithreading
Block Data Transfer
Types of latency hidden
Write (blocking-read processors) Read (dynamically scheduled processors)
Write Read
Write Read Synchronization Instruction
Write Read
Requirements of software
Labeling synch. operations
Predictability
Explicit extra concurrency
Identifying and orchestrating block transfers
Little
Little
Substantial
Not in processor, but block transfer engine in memory system
Yes
Yes
Currently no
Not needed
Hardware support Support in commercial micros?
rizes and compares some key features of the four major techniques used. The techniques can be and often are combined. For example, processors with blocking reads can use relaxed consistency models to hide write latency and prefetching or multithreading to hide read latency. And we have seen that dynamically scheduled processors can benefit from all of prefetching, relaxed consistency models and multithreading individually. How these different techniques interact in dynamically scheduled processors, and in fact how well prefetching might complement multithreading even in blocking-read processors, is likely to be better understood in the future.
11.9 Lockup-free Cache Design Throughout this chapter, we have seen that in addition to the support needed in the processor— and the additional bandwidth and low occupancies needed in the memory and communication systems—several latency tolerance techniques in a shared address space require that the cache allow multiple outstanding misses if they are to be effective. Before we conclude this chapter, let us examine how such a lockup-free cache might be designed. There are several key design questions for a cache subsystem that allows multiple outstanding misses:
• How many and what kinds of misses can be outstanding at the same time. Two distinct points in design complexity are: (i) a single read and multiple writes, and (ii) multiples reads and writes. Like the processor, it is easier for the cache to support multiple outstanding writes than reads. • How do we keep track of the outstanding misses. For reads, we need to keep track of (i) the address of the word requested, (ii) the type of read request (i.e. read, read-exclusive, or prefetch, and single-word or double-word read), (iii) the place to return data when it comes into the cache (e.g. to which register within a processor, or to which processor if multiple processors are sharing a cache), and (iv) current status of the outstanding request. For writes, we do not need to track where to return the data, but the new data being returned must be merged with the data block returned by the next level of the memory hierarchy. A key issue here is whether to store most of this information within the cache blocks themselves or whether to
9/4/97
DRAFT: Parallel Computer Architecture
835
Latency Tolerance
have a separate set of transaction buffers. Of course, while fulfilling the above requirements, we need to ensure that the design is deadlock/livelock free. • How do we deal with conflicts among multiple outstanding references to the same memory block? What kinds of conflicting misses to a block should we allow and disallow (e.g., by stalling the processor)? For example, should we allow writes to words within a block to which a read miss is outstanding? • How do we deal with conflicts between multiple outstanding requests that map to the same block in the cache, even though they refer to different memory blocks? To illustrate the options, let us examine two different designs. They differ primarily in where they store the information that keeps track of outstanding misses. The first design uses a separate set of transaction buffers for tracking requests. The second design, to the extent possible, keeps track of outstanding requests in the cache blocks themselves. The first design is a somewhat simplified version of that used in Control Data Corporation’s Cyber 835 mainframe, introduced in 1979 [Kro81]. It adds a number of miss state holding registers (MSHRs) to the cache, together with some associated logic. Each MSHR handles one or more misses to a single memory block. This design allows considerable flexibility in the kinds of requests that can be simultaneously outstanding to a block, so a significant amount of state is stored in each MSHR as shown in Table 11-3. Table 11-3 MSHR State Entries and their Roles. State
Description
Purpose
Cache block pointer
Pointer to cache block allocated to this request
Which cache line within a set in set-associative cache?
Request address
Address of pending request
Allows merging of requests
Unit identification tags (one per word)
Identifies requesting processor unit
Where to return this word?
Send-to-cpu status (one per word)
Valid bits for unit identification tags
Is unit id tag valid?
Partial write codes (one per word)
Bit vector for tracking partial writes to a word
Which words returning from memory to overwrite
Valid indicator
Valid MSHR contents
Are contents valid?
The MSHRs are accessed in parallel to the regular cache. If the access hits in the cache, the normal cache hit actions take place. If the access misses in the cache, the actions depend on the content of the MSHRs:
• If no MSHR is allocated for that block, a new one is allocated and initialized. If none is free, or if all cache lines within a set have pending requests, then the processor stalls. If the allocated block contains dirty data, a writeback is initiated. If the processor request is a write, the data is written at the proper offset into the cache block, and the corresponding partial write code bits are set in the MSHR. A request to fetch the block from the main memory subsystem (e.g., BusRd, BusrdX) is also initiated. • If an MSHR is already allocated for the block, the new request is merged with the previously pending requests for the same block. For example, a new write request can be merged by writing the data into the allocated cache block and by setting the corresponding partial write bits. A read request to a word that has completely been written in the cache (by earlier writes) can simply read the data from the cache memory. A read request to a word that has not been requested before is handled by setting the proper unit identification tags. If it is to a word that 836
DRAFT: Parallel Computer Architecture
9/4/97
Lockup-free Cache Design
has already been requested, then either a new MSHR must be allocated (since there is only one unit identification tag per word) or the processor must be stalled. Since a write does not need a unit identification tag, a write request for a word to which a read is already pending is handled easily: the data returned by main memory can simply be forwarded to the processor. Of course, the first such write to a block will have to generate a request that asks for exclusive ownership. Finally, when the data for the block returns to the cache, the cache block pointer in the MSHR indicates where to put the contents. The partial write codes are used to avoid writing stale data into the cache, and the send-to-cpu bits and unit-identification-tags are used to forward replies to waiting functional units. This design requires an associative lookup of MSHRs, but allows the cache to have all kinds of memory references outstanding at the same time. Fortunately, since the MSHRs do the complex task of merging requests and tearing apart replies for a given block, to the extended memory hierarchy below it appears simply as if there are multiple requests to distinct blocks coming from the processor. The coherence protocols we have discussed in the previous chapters are already designed to multiple outstanding requests to different blocks without deadlock. The alternative to this design is to store most of the relevant state in the cache blocks themselves, and not use separate MSHRs. In addition to the standard MESI states for a writeback cache, we add three transient or pending states: invalid-pending (IP), shared-pending (SP), and exclusivepending (EP), corresponding to what the state of the block was when the write was issued. In each of these three states, the cache tag is valid and the cache block is awaiting data from the memory system. Each cache block also has subblock write bits (SWBs), which is a bit-vector with one bit per word. In IP state, all SWBs must be OFF and all words are considered invalid. In both EP and SP states, the bits that are turned ON indicate the words in the block that have been written by the processor since the block was requested from memory, and which the data returned from memory should not overwrite. However, in the EP state, words for which the bits are OFF are considered invalid, while in the SP state, those words are considered valid (not stale). Finally, there is a set of pending-read registers; these contain the address and type of pending read requests. The key benefit of keeping this extra state information with each cache block is that no additional storage is needed to keep track of pending write requests. On a write that does not find the block in modified state, the block simply goes into the appropriate pending state (depending on what state it was in before), initiates the appropriate bus transaction, and sets the SWB bits to indicate which words the current write has modified so the subsequent merge will happen correctly. Writes that find the block already in pending state only require that the word is written into the line and the corresponding SWB is set (this is true except if the state was invalid-pending—i.e. a request for a shared block is outstanding—in which case the state is changed to EP and a readexclusive request is generated). Reads may use the pending-read registers. If a read finds the desired word in a valid state in the block (including a pending state with the SWB on), then it simply returns it. Otherwise, it is placed in a pending-read register which keeps track of it. Of course, if the block accessed on a read or write is not in the cache (no tag match) then a writeback may be generated, the block is set to the invalid-pending state, all SWBs are turned off (except of the word being written if it is a write) and the appropriate transaction is placed on the bus. If the tag does not match and the existing block is already in pending state, then the processor stalls. Finally, when a response to an out9/4/97
DRAFT: Parallel Computer Architecture
837
Latency Tolerance
standing request arrives, the corresponding cache block is updated except for the words that have their SWBs on. The cache block moves out of the pending state. All pending read-registers are checked to see if any were waiting for this cache block; if so, data is returned for those requests and those pending-read registers freed. Details of the actual state changes, actions and race conditions can be found in [Lau94]. One key observation that makes race conditions relatively easy to deal with is that even though words are written into cache blocks before ownership is obtained for the block, those words are not visible to requests from other processors until ownership is obtained. Overall, the two designs described above are not that different conceptually. The latter solution keeps the state for writes in the cache blocks and does not use separate tracking registers for writes. This reduces the number of pending registers needed and the complexity of the associative lookups, but it is more memory intensive since extra state is stored with all lines in the cache even though only a few of them will have an outstanding request at a time. The correctness interactions with the rest of the protocol are similar and modest in the two cases.
11.10 Concluding Remarks With the increasing gap between processor speeds and memory access and communication times, latency tolerance is likely to be critical in future multiprocessors. Many latency tolerance techniques have been proposed and are used, and each has its own relative advantages and disadvantages. They all rely on excess concurrency in the application program beyond the number of processors used, and they all tend to increase the bandwidth demands placed on the communication architecture. This greater stress makes it all the more important that the different performance aspects of the communication architecture (the processor overhead, the assist occupancy and the network bandwidth) be both efficient and also well balanced. For cache-coherent multiprocessors, latency tolerance techniques are supported in hardware by both the processor and the memory system, leading to a rich space of design alternatives. Most of these hardware-supported latency tolerance techniques are also applicable to uniprocessors, and in fact their commercial success depends on their viability in the high-volume uniprocessor market where the latencies to be hidden are smaller. Techniques like dynamic scheduling, relaxed memory consistency models and prefetching are commonly encountered in microprocessor architectures today. The most general latency hiding technique, multithreading, is not yet popular commercially, although recent directions in integrating multithreading with dynamically scheduled superscalar processors appear promising. Despite the rich space of issues in hardware support, much of the latency tolerance problem today is a software problem. To what extent can a compiler automate prefetching so a user does not have to worry about it? And if not, then how can the user naturally convey information about what and when to prefetch to the compiler? If block transfer is indeed useful on cache-coherent machines, how will users program to this mixed model of implicit communication through reads and writes and explicit transfers? Relaxed consistency models carry with them the software problem of specifying the appropriate constraints on reordering. And will programs be decomposed and assigned with enough extra explicit parallelism that multithreading will be successful? Automating and simplifying the software support required for latency tolerance is a task that is far form fully accomplished. In fact, how latency tolerance techniques will play out in the future and what software support they will use is an interesting open question in parallel architecture.
838
DRAFT: Parallel Computer Architecture
9/4/97
References
11.11 References [AHA+97]
[ALK+91]
[ABC+95]
[ACC+90]
[BaC91] [BeF96a] [BeF96b] [ChB92]
[ChB94]
[DDS95]
[DFK+92]
[FuP91] [FPJ92]
[GGH91a]
[GGH91b]
9/4/97
Hazim A. Abdel-Shafi, Jonathan Hall, Sarita V. Adve and Vikram S. Adve. An Evaluation of Fine-Grain Producer-Initiated Communication in Cache-Coherent Multiprocessors. In Proceedings of the Third Symposium on High Performance Computer Architecture, February 1997. Agarwal, A., Lim, B.-H., Kranz, D. and Kubiatowicz, J. APRIL: A Processor Architecture for Multiprocessing. In Proceedings of the 17th Annual International Symposium on Computer Architecture, June 1990, pp. 104-114. Anant Agarwal, Ricardo Bianchini, David Chaiken, Kirk L. Johnson, David Kranz, John Kubiatowicz, Beng-Hong Lim, Kenneth Mackenzie and Donald Yeung. The MIT Alewife Machine: Architecture and Performance. In Proceedings of the 22nd Annual International Symposium on Computer Architecture, pp. 2-13, June 1995. Alverson, R., Callahan, D., Cummings, D., Koblenz, B, Porterfield, A., and Smith, B. The Tera Computer System. In Proceedings of the 1990 International Conference on Supercomputing, June 1990, pp. 1-6. Jean-Loup Baer and Tien-Fu Chen. An Efficient On-chip Preloading Scheme to Reduce Data Access Penalty. In Proceedings of Supercomputing’91, pp. 176-186, November 1991. Bennett, J.E. and Flynn, M.J. Latency Tolerance for Dynamic Processors. Technical Report no. CSL-TR-96-687, Computer Systems Laboratory, Stanford University, 1996. Bennett, J.E. and Flynn, M.J. Reducing Cache Miss Rates Using Prediction Caches. Technical Report no. CSL-TR-96-707, Computer Systems Laboratory, Stanford University, 1996. Tien-Fu Chen and Jean-Loup Baer. Reducing Memory Latency via Non-Blocking and Prefetching Caches. In Proceedings of the Fifth International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 51-61, 1992. Tien-Fu Chen and Jean-Loup Baer. A Performance Study of Software and Hardware Data Prefetching Schemes. In Proceedings of the 21st Annual Symposium on Computer Architecture, pp. 223-232, April 1994. Fredrik Dahlgren, Michel Dubois, and Per Stenstrom. Sequential Hardware Prefetching in Shared-Memory Multiprocessors. IEEE Transactions on Parallel and Distributed Systems, vol. 6, no. 7, July 1995. William J. Dally, J.A. Stuart Fiske, John S. Keen, Richard A. Lethin, Michael D. Noakes and Peter R. Nuth. The Message Driven Processor: A Multicomputer Processing Node with Efficient Mechanisms. IEEE Micro, April 1992, pp. 23-39. John W.C. Fu and Janak H. Patel. Data Prefetching in Multiprocessor Vector Cache Memories. In Proceedings of the 18th Annual Symposium on Computer Architecture, pp. 54-63. May 1991. John W. C. Fu and Janak H. Patel and Bob L. Janssens. Stride Directed Prefetching in Scalar Processors. In Proceedings of the 25th Annual International Symposium on Microarchitecture, pp. 102-110, December 1992. Kourosh Gharachorloo, Anoop Gupta, and John Hennessy. Performance Evaluation of Memory Consistency Models for Shared-Memory Multiprocessors. In Proceedings of 4th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 245-257, April 1991. Gharachorloo, K., Gupta, A., and Hennessy, J. Two Techniques to Enhance the Performance of Memory Consistency Models. In Proceedings of the International Conference on Parallel Processing, 1991, pp. I355-I364.
DRAFT: Parallel Computer Architecture
839
Latency Tolerance
[GGH92]
[GrS96]
[GHG+91]
[HGD+94]
[HBG+97]
[HeP95] [HKN+92]
[Hun96] [Int96] [Jor85] [Jou90]
[Kro81] [KuA93] [KuS88] [KCA91]
[Lau94]
[LGH94]
[LKB91]
840
Gharachorloo, K., Gupta, A. and Hennessy, J. Hiding Memory Latency Using Dynamically Scheduling in Shared Memory Multiprocessors. In Proceedings of the 19th International Symposium on Computer Architecture, 1992, pp. 22-33. Hekan Grahn and Per Stenstrom. Evaluation of a Competitive-Update Cache Coherence Protocol with Migratory Data Detection. Journal of Parallel and Distributed Computing, vol. 39, no. 2, pp. 168-180, December 1996. Anoop Gupta, John Hennessy, Kourosh Gharachorloo, Todd Mowry and Wolf-Dietrich Weber. Comparative Evaluation of Latency Reducing and Tolerating Techniques. In Proceedings of the 18th International Symposium on Computer Architecture, pp. 254-263, May 1991. John Heinlein, Kourosh Gharachorloo, Scott Dresser, and Anoop Gupta. Integration of Message Passing and Shared Memory in the Stanford FLASH Multiprocessor. In Proceedings of 6th International Conference on Architectural Support for Programming Languages and Operating Systems, pp. 38-50, October 1994. John Heinlein, Robert P. Bosch, Jr., Kourosh Gharachorloo, Mendel Rosenblum, and Anoop Gupta. Coherent Block Data Transfer in the FLASH Multiprocessor. In Proceedings of the 11th International Parallel Processing Symposium, April 1997. Hennessy, J.L. and Patterson, D.A. Computer Architecture: A Quantitative Approach. Second Edition, Morgan Kaufman, 1995. Hiroaki Hirata, Kozo Kimura, Satoshi Nagamine, Yoshiyuki Mochizuki, Akio Nishimura, Yoshimori Nakase and Teiji Nishizawa. An Elementary Processor Architecture with Simultaneous Instruction Issuing from Multiple Threads. In Proceedings of the 19th International Symposium on Computer Architecture, pp. 136-145, May 1992. Hunt, D. Advanced features of the 64-bit PA-8000. Hewlett Packard Corporation, 1996. Intel Corporation. Pentium(r) Pro Family Developer’s Manual. Jordan, H. F. HEP Architecture, Programming, and Performance. In Parallel MIMD Computation: The HEP Supercomputer and its Applications, J S. Kowalik, ed., MIT Press, 1985, page 8. Norman P. Jouppi. Improving Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers. In Proceedings of the 17th Annual Symposium on Computer Architecture, pp. 364-373, June 1990. D. Kroft. Lockup-free Instruction Fetch/Prefetch Cache Organization. In Proceedings of 8th International Symposium on Computer Architecture, pp. 81-87, May 1981. John Kubiatowicz and Anant Agarwal. The Anatomy of a Message in the Alewife Multiprocessor. In Proceedings of the International Conference on Supercomputing, pp. 195-206, July 1993. Kuehn, J.T, and Smith, B.J. The Horizon Supercomputing System: Architecture and Software. In Proceedings of Supercomputing’88, November 1988, pp. 28-34. Kiyoshi Kurihara, David Chaiken, and Anant Agarwal. Latency Tolerance through Multithreading in Large-Scale Multiprocessors. Proceedings of the International Symposium on Shared Memory Multiprocessing, pp. 91-101, April 1991. Laudon, J. Architectural and Implementation Tradeoffs in Multiple-context Processors. Ph.D. Thesis, Stanford University, Stanford, California, 1994. Also published as Technical Report CSLTR-94-634, Computer Systems Laboratory, Stanford University, May 1994. Laudon, J., Gupta, A. and Horowitz, M. Architectural and Implementation Tradeoffs in the Design of Multiple-Context Processors. In Multithreaded Computer Architecture: A Summary of the State of the Art, edited by R.A. Iannucci, Kluwer Academic Publishers, 1994, pp. 167-200. R. L. Lee, A. Y. Kwok, and F. A. Briggs. The Floating Point Performance of a Superscalar
DRAFT: Parallel Computer Architecture
9/4/97
References
[LLJ+92]
[LEE+97]
[LuM96]
[MIP96] [Mow94]
[NWD93]
[NuD95]
[Oha96]
[Omo94]
[PRA+96]
[RPA+97]
[SWG92] [Sin97]
[Smi81] [Smi85]
9/4/97
SPARC Processor. In Proceedings of the 4th Symposium on Architectural Support for Programming Languages and Operating Systems, April. 1991. Daniel Lenoski, James Laudon, Truman Joe, David Nakahira, Luis Stevens, Anoop Gupta, and John Hennessy. The DASH Prototype: Logic Overhead and Performance. IEEE Transactions on Parallel and Distributed Systems, 4(1):41-61, January 1993. J.L. Lo, S.J. Eggers, J.S. Emer, H.M. Levy, R.L. Stamm, and D.M. Tullsen. Converting ThreadLevel Parallelism into Instruction-Level Parallelism via Simultaneous Multithreading. ACM Transactions on Computer Systems, vol. xx, no. xx, August, 1997). Luk, C.-K. and Mowry, T.C. Compiler-based Prefetching for Recursive Data Structures. In Proceedings of the Seventh Symposium on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII), October 1996. MIPS Technologies, Inc. R10000 Microprocessor User’s Manual, Version 1.1, January 1996. Todd C. Mowry. Tolerating Latency through Software-Controlled Data Prefetching. Ph. D. Thesis, Computer Systems Laboratory, Stanford University, 1994. Also Technical Report no. CSLTR-94-628, Computer Systems Laboratory, Stanford University, June 1994. Michael D. Noakes and Deborah A. Wallach and William J. Dally. The J-Machine Multicomputer: An Architectural Evaluation. In Proceedings of the 20th International Symposium on Computer Architecture, pp. 224-235, May 1993. Nuth, Peter R. and Dally, William J. The Named-State Register File: Implementation and Performance. In Proceedings of the First International Symposium on High-Performance Computer Architecture, January 1995. Moriyoshi Ohara. Producer-Oriented versus Consumer-Oriented Prefetching: a Comparison and Analysis of Parallel Application Programs. Ph. D. Thesis, Computer Systems Laboratory, Stanford University, 1996. Available as Stanford University Technical Report No. CSL-TR-96-695, June 1996. Amos R. Omondi. Ideas for the Design of Multithreaded Pipelines. In Multithreaded Computer Architecture: A Summary of the State of the Art, Robert Iannucci, Ed., Kluwer Academic Publishers, 1994. See also Amos R. Omondi, Design of a high performance instruction pipeline, Computer Systems Science and Engineering, vol. 6, no. 1, January 1991, pp. 13-29. Pai, V.S., Ranganathan, P., Adve, S, and Harton, T. An Evaluation of Memory Consistency Models for Shared-Memory Systems with ILP Processors. In Proceedings of the Seventh Symposium on Architectural Support for Programming Languages and Operating Systems (ASPLOS-VII), October 1996. Ranganathan, P., Pai, V.S., Abdel-Shafi, H. and Adve, S.V. The Interaction of Software Prefetching with ILP Processors in Shared-Memory Systems. In Proceedings of the 24th International Symposium on Computer Architecture, June 1997. Singh, J.P., Weber, W-D., and Gupta, A. SPLASH: The Stanford ParalleL Applications for Shared Memory. Computer Architecture News, vol. 20, no. 1, pp. 5-44, March 1992. Jaswinder Pal Singh. Some Aspects of Controlling Scheduling in Hardware Control Prefetching. Technical Report no. xxx, Computer Science Department, Princeton University. Also available on the World Wide Web home page of Parallel Comptuter Architecture by Culler and Singh published by Morgan Kaufman publishers. Smith, B.J. Architecture and Applications of the HEP Multiprocessor Computer System. In Proceedings of SPIE: Real-Time Signal Processing IV, Vol. 298, August, 1981, pp. 241-248. Smith, B.J. The Architecture of HEP. In Parallel MIMD Computation: The HEP Supercomputer and its Applications, edited by J.S. Kowalik, MIT Press, 1985, pp. 41-55.
DRAFT: Parallel Computer Architecture
841
Latency Tolerance
[Tom67] [TuE93]
[TEL95]
[TEE+96]
[WoL91]
[WSH94]
[ZhT95]
Tomasulo, R.M. An Efficient Algorithm for Exploiting Multiple Arithmetic Units. IBM Journal of Research and Development, 11:1, January 1967, pp. 25-33. Dean M. Tullsen and Susan J. Eggers. Limitations of Cache Prefetching on a Bus-Based Multiprocessor. In Proceedings of the 20th Annual Symposium on Computer Architecture, pp. 278-288, May 1993. Dean M. Tullsen, Susan J. Eggers and Henry M. Levy. Simultaneous Multithreading: Maximizing On-Chip Parallelism. In Proceedings of the 20th Annual Symposium on Computer Architecture, pp. 392-403, June 1995. D.M. Tullsen, S.J. Eggers, J.S. Emer, H.M. Levy, J.L. Lo and R.L. Stamm. Exploiting Choice: Instruction Fetch and Issue on an Implementable Simultaneous Multithreading Processor. In Proceedings of the 23rd International Symposium on Computer Architecture, May 1996. Michael E. Wolf and Monica S. Lam. A Data Locality Optimizing Algorithm. In Proceedings of the ACM SIGPLAN'91 Conference on Programming Language Design and Implementation, pp. 30-44, June 1991. Steven Cameron Woo, Jaswinder Pal Singh, and John L. Hennessy. The Performance Advantages of Integrating Block Data Transfer in Cache-Coherent Multiprocessors. In Proceedings of the 6th International Conference on Architectural Support for Programming Languages and Operating Systems, pages 219-229, San Jose, CA, October 1994. Zheng Zhang and Josep Torrellas. Speeding up Irregular Applications in Shared-Memory Multiprocessors: Memory Binding and Group Prefetching. In Proceedings of the 22nd Annual Symposium on Computer Architecture, pp. 188-199, May 1995.
11.12 Exercises 11.1 General. a. Why is latency reduction generally a better idea than latency tolerance? b. Suppose a processor communicates k words in m messages of equal size, the assist occupancy for processing a message is o, and there is no overhead on the processor. What is the best case latency as seen by the processor if only communication, not computation, can be overlapped with communication? First, assume that acknowledgments are free; i.e. are propagated instantaneously and don’t incur overhead; and then include acknowledgments. Draw timelines, and state any important assumptions. c. You have learned about a variety of different techniques to tolerate and hide latency in shared-memory multiprocessors. These techniques include blocking, prefetching, multiple context processors, and relaxed consistency models. For each of the following scenarios, discuss why each technique will or will not be an effective means of reducing/hiding latency. List any assumptions that you make. (i) A complex graph algorithm with abundant concurrency using linked pointer structures. (ii) A parallel sorting algorithm where communication is producer initiated and is achieved through long latency write operations. Receiver-initiated communication is not possible. (iii) An iterative system-of-equation solver. The inner loop consists of a matrix-matrix multiply. Assume both matrices are huge and don’t fit into the cache. 11.2 You are charged with implementing message passing on a new parallel supercomputer. The architecture of the machine is still unsettled, and your boss says the decision of whether to
842
DRAFT: Parallel Computer Architecture
9/4/97
Exercises
provide hardware cache coherence will depend on the message passing performance of the two systems under consideration, since you want to be able to run message passing applications from the previous-generation architecture. In the system without cache coherence (the “D” system), the engineers on your team tell you that message passing should be implemented as successive transfers of 1KB. To avoid problems with buffering at the receiver side, you’re required to acknowledge each individual 1KB transfer before the transmission of the next one can begin (so, only one block can be in flight at a time). Each 1KB line requires 200 cycles of setup time, after which it begins flowing into the network. This overhead accounts for time to determine where to read the buffer from in memory and to set up the DMA engine which performs the transfer. Assume that from the time that the 1KB chunk reaches the destination, it takes 20 cycles for the destination node to generate a response, and it takes 50 cycles on the sending node to accept the ACK and proceed to next 1KB transfer. In the system with cache coherence (the “CC” system), messages are sent as a series of 128 byte cache line transfers. In this case, however, acknowledgments only need to be sent at the end of every 4KB page. Here, each transfer requires 50 cycles of setup time, during which time the line can be extracted from the cache, if necessary, to maintain cache coherence. This line is then injected into the network, and only when the line is completely injected into the network can processing on the next line begin. The following are the system parameters: Clock rate = 10 ns (100 MHz), Network latency = 30 cycles, Network bandwidth = 400 MB/s. State any other assumptions that you make. a. What is the latency (until the last byte of the message is received at the destination) and sustainable bandwidth for a 4KB message in the D system? b. What is the corresponding latency and bandwidth in the CC system? c. A designer on the team shows you that you can easily change the CC system so that the processing for the next line occurs while the previous one is being injected into the network. Calculate the 4KB message latency for the CC system with this modification. 11.3 Consider the example of transposing a matrix of data in parallel, as is used in computations such as high-performance Fast Fourier Transforms. Figure 11-36 shows the transpose pictorially. Every processor tranposes one “patch” of its assigned rows to every other processor, including one to itself. Performing the tranpose through reads and writes was discussed in the exercises at the end of Chapter 8. Since it is completely predictable which data a processor has to send to which other processors, a processor can send an entire patch at a time in a single message rather than communicate the patches through individual read or write cache misses. a. What would you expect the curve of block transfer performance relative to read-write performance to look like? b. What special features would the block transfer engine benefit most from? c. Write pseudocode for the block transfer version of the code. d. Suppose you wanted to use block data transfer in the Raytrace application. For what purposes would you use it, and how? Do you think you would gain significant performance benefits? 11.4 An interesting performance issue in block data transfer in a cache-coherent shared address space has to do with the impact of long cache blocks. Assume that the data movement in the block transfer leverages the cache coherence mechanisms. Consider the simple equation
9/4/97
DRAFT: Parallel Computer Architecture
843
Latency Tolerance
Destination Matrix
Source Matrix Owned by Proc 0
Patch
Owned by Proc 1 Owned by Proc 2 Owned by Proc 3 Long cache block crosses traversal order
Long cache block follows traversal order
Figure 11-36 Sender-initiated matrix transposition. The source and destination matrices are partitioning among processes in groups of contiguous rows. Each process divides its set of n/ p rows into p patches of size n/p * n/p. Consider process P2 as a representative example: It sends one patch to every other process, and transposes one patch (third from left, in this case) locally. Every patch may be transferred as a single block transfer message, rather than through individual remote writes or remote reads (if receiver-initiated).
solver on a regular grid, with its near neighbor communication. Suppose the n-by-n grid is partitioned into square subblocks among p processors. a. Compared to the results shown for FFT in the chapter, how would you expect the curves for this application to differ when each row or column boundary is sent directly in a single block transfer and why? b. How might you structure the block transfer to only send useful data, and how would you expect performance in this case to compare with the previous one? c. What parameters of the architecture would most affect the tradeoffs in b.? d. If you indeed wanted to use block transfer, what deeper changes might you make to the parallel program? 11.5 Memory Consistency Models a. To maintain all program orders as in SC, we can optimize in the following way. In our baseline implementation, the processor stalls immediately upon a write until the write completes. The alternative is to place the write in the write buffer without stalling the processor. To preserve the program order among writes, what the write buffer does is retire a write (i.e. pass it further along the memory hierarchy and make it visible to other processors) only after the write is complete. The order from writes to reads is maintained by flushing the write buffer upon a read miss. (i) What overlap does this optimization provide? (ii) Would you expect it to yield a large performance improvement? Why or why not? b. If the second-level cache is blocking in the performance study of proceeding past writes with blocking reads, is there any advantage to allowing write->write reordering over not allowing it? If so, under what conditions?
844
DRAFT: Parallel Computer Architecture
9/4/97
Exercises
c. Write buffers can allow several optimizations, such as buffering itself, merging writes to the same cache line that is in the buffer, and forwarding values from the buffer to read operations that find a match in the buffer. What constraints must be imposed on these optimizations to maintaining program order under SC? Is there a danger with the merging optimizations for the processor consistency model? 11.6 Give a bunch of code sequences and a architecture latency model, calculate completion time under various memory consistency models. Plenty of questions in cs315a exams of this sort. 11.7 Prefetching a. How early can you issue a binding prefetch in a cache-coherent system? b. We have talked about the advantages of nonbinding prefetches in multiprocessors. Do nonbinding prefetches have any advantages in uniprocessors as well. Assume that the alternative is to prefetch directly into the register file, not into a prefetch buffer. c. Sometimes a predicate must be evaluated to determine whether or not to issue a prefetch. This introduces a conditional (if) expression around the prefetch inside the loop containing the prefetches. Construct an example, with pseudocode, and describe what is the performance problem? How would you fix the problem? d. Describe situations in which a producer-initiated deliver operation might be more appropriate than prefetching or an update protocol. Would you implement a deliver instruction if you were designing a machine? 11.8 Prefetching Irregular Accesses a. Consider the loop for i <- 1 to 200 sum = sum + A[index[i]]; end for. Write a version of the code with non-binding software-controlled prefetches inserted. Include the prologue, the steady state of the software pipeline, and the epilogue. Assume the memory latency is such that it takes five iterations of the loop for a data item to return, and that data are prefetched into the primary cache. b. When prefetching indirect references such as in the example above, extra instructions are needed for address generation. One possibility is to save the computed address in a register at the time of the prefetch and then reuse it later for the load? Are there any problems with or disadvantages to this? c. What if an exception occurs when prefetching down multiple levels of indirection in accesses. What complications are caused and how might they be addressed? d. Describe some hardware mechansisms (at a high level) that might be able to prefetch iregular accesses, such as records or lists. 11.9 More prefetching a. Show how the following loop would be rewritten with prefetching so as to hide latency: for i=0 to 128 { for j=1 to 32 A[i,j] = B[i] * C[j] } Try to reduce overhead by prefetching only those references that you expect to miss in the cache. Assume that a read prefetch is expressed as “PREFETCH(&variable)”, and it fetches the entire cache line in which variable resides in shared mode. A read-exclu-
9/4/97
DRAFT: Parallel Computer Architecture
845
Latency Tolerance
sive prefetch operation which is expressed as “RE_PREFETCH(&variable)” which fetches the line in “exclusive” mode. The machine has a cache miss latency of 32 cycles. Explicitly prefetch each needed variable (don’t take into account the cache block size). Assume that the cache is large (so you don’t have to worry about interference), but not so large that you can prefetch everything at the start. In other words, we are looking for justin-time prefetching. The matrix A[i,j] is stored in memory with A[i,j] and A[i,j+1] contiguous. Assume that the main computation of the loop takes 8 cycles to complete. b. Construct an example in which a prefetching compiler’s decision to assume everything in the cache is invalidated when it sees a synchronization operation is very conservative, and show how a programmer can do better. It might be useful to think of the case study applications used in the book. c. One alternative to prefetching is to use nonblocking load operations, and issue these operations significantly before the data are needed for computation. What are the tradeoffs between prefetching and using nonblocking loads in this way? d. There are many options for the format that a prefetch instruction can take. For example, some architectures allow a load instruction to have multiple flavors, one of which can be reserved for a prefetch. Or in architectures that reserve a register to always have the value zero (e.g. the MIPS and SPARC architectures); a load with that register as the destination can be interpreted as a prefetch since such a load does not change the contents of the register. A third option is to have a separate prefetch instruction in the instruction set, with a different opcode than a load. (i) Which do you think is the best alternative and why? (ii) What addressing mode would you use for a prefetch instruction and why? e. Suppose we issue prefetches when we expect the corresponding references to miss in the primary (first-level) cache. A question that arises is which levels of the memory hierarchy beyond the first-level cache should we probe to see if the prefetch can be satisfied there. Since the compiler algorithm usually schedules prefetches by conservatively assuming that the latency to be hidden is the largest latency (uncontended, say) in the machine, one possibility is to not even check intermediate levels of the cache hierarchy but always get the data from main memory or from another processor’s cache if the block is dirty. What are the problems with this method and which one do you think is most important? f. We have discussed some qualitative tradeoffs between prefetching into the primary cache and prefetching only into the second-level cache. Using only the following parameters, construct an analytical expression for the circumstances under which prefetching into the primary cache is beneficial. What about the analysis or what it leaves out strikes you as making it most difficult to rely on it in practice? pt is the number of prefetches that bring data into the primary cache early enough, pd is the number of cases in which the prefetched data are displaced from the cache before they are used, pc is the number of cache conflicts in which prefetches replace useful data, pf is the number of prefetch fills (i.e. the number of times a prefetch tries to put data into the primary cache), ls is the access latency to the second-level cache (beyond that to the primary cache) and lf is the average number of cycles that a prefetch fill stalls the processor. After generating the fullblown general expression, your goal is to find a condition on lf that makes prefetching into the first level cache worthwhile. To do this, you can make the following simplifying assumptions: ps = pt - pd, and pc = pd. 11.10 Multithreading. Consider a “blocked” context-switching processor. (i.e. a processor that switches contexts only on long-latency events.) Assume that there are arbitrarily many threads available and clearly state any other assumptions that you make in answering the fol846
DRAFT: Parallel Computer Architecture
9/4/97
Exercises
lowing questions. The threads of a given application have been analyzed to show the following execution profile: 40% of cycles are spent on instruction execution (busy cycles) 30% of cycles spent stalled on L1 cache misses but L2 hits (10 cycle miss penalty) 30% of cycles spent stalled on L2 cache misses (30 cycle miss penalty) a. What will be the busy time if the context switch latency (cost) is 5 cycles? b. What is the maximum context switch latency that will ensure that busy time is greater than or equal to 50%? 11.11 Blocked multithreading a. In blocked, multiple-context processors with caches, a context switch occurs whenever a reference misses in the cache. The blocking context at this point goes into “stalled” state, and it remains there until the requested data arrives back at the cache. At that point it returns to “active” state, and it will be allowed to run when the active contexts ahead of it block. When an active context first starts to run, it reissues the reference it had blocked on. In the scheme just described, the interaction between the multiple contexts can potentially lead to deadlock. Concretely describe an example where none of the contexts make forward process. b. What do you think would happen to the idealized curve for processor utilization versus degree of multithreading in Figure 11-25 if cache misses were taken into account. Draw this more realistic curve on the same figure as the idealized curve. c. Write the logic equation that decides whether to generate a context switch in the blocked scheme, given the following input signals: Cache Miss, MissSwitchEnable (enable switching on a cache miss), CE (signal that allows processor to enable context switching), OneCount(CValid) (number of ready contexts), ES (explicit context switch instruction), and TO (timeout). Write another equation to decide when the processor should stall rather than switch. d. In discussing the implementation of the PC unit for the blocked multithreading approach, we said that the use of the exception PC for exceptions as well as context switches meant that context switching must be disabled upon an exception. Does this indicate that the kernel cannot use the hardware-provided multithreading at all? If so, why? If not, how would you arrange for the kernel to use the multiple hardware contexts. 11.12 Interleaved Multithreading a. Why is exception handling more complex in the interleaved scheme than in the blocked scheme? [Hint: think about how the exception PC (EPC) is used.] How would you handle the issues that arise? : b. How do you think the Tera processor might do lookahead across branches? The jprocessor provides JUMP_OFTEN and JUMP_SELDOM branch operations. Why do you think it does this? c. Consider a simple, HEP-like multithreaded machine with no caches. Assume that the average memory latency is 100 clock cycles. Each context has blocking loads and the machine enforces sequential consistency. (i) Given that 20% of a typical workload’s instructions are loads and 10% are stores, how many active contexts are needed to hide the latency of the memory operations? (ii) How many contexts would be required if the machine supported release consistency (still with blocking loads)? State any assumptions that you make.
9/4/97
DRAFT: Parallel Computer Architecture
847
Latency Tolerance
(iii) How many contexts would be needed for parts (i) and (ii) if we assumed a blocked multiple-context processor instead of the cycle-by-cycle interleaved HEP processor. Assume cache hit rates of 90% for both loads and stores. (iv) For part (iii), what is the peak processor utilization assuming a context switch overhead of 10 cycles. 11.13 Studies of applications have shown that combining release consistency and prefetching always results in better performance than when either technique is used alone. This is not the case when multiple-contexts and prefetching techniques are combined; the combined performance can sometimes be worse. Explain the latter observation, using an example situation to illustrate.
848
DRAFT: Parallel Computer Architecture
9/4/97
C HA PTE R 12
Future Directions
Morgan Kaufmann is pleased to present material from a preliminary draft of Parallel Computer Architecture; the material is (c) Copyright 1998 Morgan Kaufmann Publishers. This material may not be used or distributed for any commercial purpose without the express written consent of Morgan Kaufmann Publishers. Please note that this material is a draft of forthcoming publication, and as such neither Morgan Kaufmann nor the authors can be held liable for changes or alterations in the final edition
In the course of writing this book the single factor that stood out most among the many interesting facets of parallel computer architecture was the tremendous pace of change. Critically important designs became “old news” as they were replace by newer designs. Major open questions were answered, while new ones took their place. Start-up companies left the marketplace as established companies made bold strides into parallel computing and powerful competitors joined forces. The first teraflops performance was achieved and workshops had already been formed to understand how to accelerate progress towards petaflops. The movie industry produced its first full-length computer animated motion picture on a large cluster and for the first time a parallel chess program defeated a grand master. Meanwhile, multiprocessors emerged in huge volume with the Intel Pentium Pro and its glueless cache coherence memory bus. Parallel algorithms were put to work to improve uniprocessor performance by better utilizing the storage hierarchy. Networking technology, memory technology, and even processor design were all thrown “up for grabs” as we began looking seriously at what to do with a billion transistors on a chip.
9/4/97
DRAFT: Parallel Computer Architecture
849
Future Directions
Looking forward to the future of parallel computer architecture, the one prediction that can be made with certainty is continued change. The incredible pace of change makes parallel computer architecture an exciting field to study and in which to conduct research. One needs to continually revisit basic questions, such as what are the proper building blocks for parallel machines? What are the essential requirements on the processor design, the communication assist and how it integrates with the processor, the memory and the interconnect? Will these continue to utilize commodity desktop components, or will there be a new divergence as parallel computing matures and the great volume of computers shifts into everyday appliances? The pace of change makes for rich opportunities in industry, but also for great challenges as one tries to stay ahead. Although it is impossible to predict precisely where the field will go, this final chapter seeks to outline some of the key areas of development in parallel computer architecture and the related technologies. Whatever directions the market takes and whatever technological breakthroughs occur, the fundamental issues addressed throughout this book will still apply. The realization of parallel programming models will still rest upon the support for naming, ordering, and synchronization. Designers will still battle with latency, bandwidth, and cost. The core techniques for addressing these issue will remain valid, however, the way that they are employed will surely change as the critical coefficients of performance, cost, capacity, and scale continue to change. New algorithms will be invented, changing the fundamental application workload requirements, but the basic analysis techniques will remain. Given the approach taken throughout the book, it only makes sense of structure the discussion of potential future directions around hardware and software. For each, we need to ask what trends are likely to continue, thus providing a basis for evolutionary development, and which are likely stop abruptly, either because a fundamental limit is struck or because a breakthrough changes the direction. Section 12.1 examines trends in technology and architecture, while Section 12.2 looks at how changing software requirements may influence the direction of system design and considers how the application base is likely to broaden and change.
12.1 Technology and Architecture Technological forces shaping the future of parallel computer architecture can be placed into three categories: evolutionary forces, as indicated by past and current trends, fundamental limits that wall off further progress along a trend, and breakthroughs that create a discontinuity and establish new trends. Of course, only time will tell how these actually play out. This section examines all three scenarios and the architectural changes that might arise. To help sharpen the discussion, let us consider two questions. At the high end, how will the next factor of 1000 increase in performance be achieved? At the more moderate scale, how will cost effective parallel systems evolve? In 1997 computer systems form a parallelism pyramid roughly as in Figure 12-1. Overall shipments of uniprocessor PCs, workstations, and servers is on the order of tens of millions. The 2-4 processor end of the parallel computer market is on the scale of 100K to 1 M. These are almost exclusively servers, with some growth toward the desktop. This segment of the market grew at a moderate pace throughout the 80s and early 90’s and then shot up with the introduction of Intel based SMPs manufactured by leading PC vendors, as well as the traditional workstation and server vendors are pushing down to expand volume. The next level is occupied by machines of 5 to 30 processors. These are exclusively high-end servers. The volume is in the tens of thousands of units and has been growing steadily; this segment dominates the
850
DRAFT: Parallel Computer Architecture
9/4/97
Technology and Architecture
high-end server market, including the Enterprise market, which used to be called the mainframe market. At the scale of several tens to a hundred processors, the volume is on order a thousand systems. These tend to be dedicated engines supporting massive data bases, large scientific applications, or major engineering investigations, such as oil exploration, structural modeling, or fluid dynamics. Volume shrinks rapidly beyond a hundred processors, with order tens of systems at the thousand processor scale. Machines at the very top end have been on the scale of a thousand to two thousand processors since 1990. In 1996-7 this stepped up toward ten thousand processors. The most visible machines at the very top end being dedicated to advanced scientific computing, including the U.S. Department of Energy “ASCI” teraflops machines and the Hitachi SR2201 funded by the Japanese Ministry of Technology and Industry. Tens of Ma chines with Thousands of Processors Thousands of Machines with Hundreds of Processors Tens to Hundreds of Thousands Multiprocessors of with Tens of Processors Hundreds of Thousands to Millions of Small Scale Multiprocessors
Tens to Hundreds of Millions of Uniprocessor PCs, Workstations, and Devices
Figure 12-1 Market Pyramid for Parallel Computers
12.1.1 Evolutionary Scenario If current technology trends hold, parallel computer architecture can be expected to follow an evolutionary path, in which economic and market forces play a crucial role. Let’s expand this evolutionary forecast, then we can consider how the advance of the field may diverge from this path. Currently, we see processor performance increasing by about a factor of 100 per decade (or 200 per decade if the basis is Linpack or SpecFP). DRAM capacity also increases by about a factor of 100 per decade (quadrupling every three years). Thus, current trends would suggest that the basic balance of computing performance to storage capacity (MFlops/MB) of the nodes in parallel machines could remain roughly constant. This ratio varies considerably in current machines, depending on application target and cost point, but under the evolutionary scenario the family of options would be expected continue into the future with dramatic increases in both capacity and performance. Simply riding the commodity growth curve, we could look toward achieving petaflops scale performance by 2010, or perhaps a couple of years earlier if the scale of parallelism is increased, but such systems would be in excess of 100 million dollars to construct. It is less clear what communication performance these machines will provide, for reasons that are discussed below. To achieve this scale of performance a lot earlier in a general purpose system, would
9/4/97
DRAFT: Parallel Computer Architecture
851
Future Directions
involve an investment and a scale of engineering that is probably not practical, although special purpose designs may push up the time frame for a limited set of applications. To understand what architectural directions may be adopted, the VLSI technology trends underlying the performance and capacity trends of the components are important. Microprocessor clock rates are increasing by a factor of 10-15 per decade, while transistors per microprocessor increase by more than a factor of 30 per decade. DRAM cycles times, on the other hand, improve much more slowly, roughly a factor of two per decade. Thus, the gap between processor speed and memory speed is likely to continue to widen. In order to stay on the processor performance growth trend, the increase in the ratio of memory access time to processor cycle time will require that processors employ better latency avoidance and latency tolerance techniques. Also, the increase in processor instruction rates, due to the combination of cycle time and parallelism, will demand that the bandwidth delivered by the memory increase.1 Both of these factors, the need for latency avoidance and for high memory bandwidth, as well as the increase in on-chip storage capacity, will cause the storage hierarchy to continue to become deeper and more complex. Recall, caches reduce the sensitivity to memory latency by avoiding memory references and they reduce the bandwidth required of the lower level of the storage hierarchy. These two factors will also cause the degree dynamic scheduling of instruction level parallelism to increase. Latency tolerance fundamentally involves allowing a large number of instructions, including several memory operations, to be in progress concurrently. Providing the memory system with multiple operations to work on at a time allows pipelining and interleaving to be used to increase bandwidth. Thus, the VLSI technology trends are likely to encourage the design of processors that are both more insulated from the memory system and more flexible, so that they can adapt to the behavior of the memory system. This bodes well for parallel computer architecture, because the processor component is likely to become increasingly robust to infrequent long latency operations. Unfortunately, neither caches nor dynamic instruction scheduling reduce the actual latency on an operation that crosses the processor chip boundary. Historically, each level of cache added to the storage hierarchy increases the cost of access to memory. (For example, we saw that the Cray T3D and T3E designers eliminated a level of cache in the workstation design to decrease the memory latency, and the presence of a second level on-chip cache in the T3E increased the communication latency.) This phenomenon of increasing latency with hierarchy depth is natural because designers rely on the hit being the frequent case; increasing the hit rate and reducing the hit cost do more for processor performance than decreasing the miss cost. The trend toward deeper hierarchies presents a problem for parallel architecture, since communication, by its very nature, involves crossing out of the lowest level of the memory hierarchy on the node. The miss penalty contributes to the communication overhead, regardless of whether the communication abstraction is shared address access or messages. Good architecture and clever programming can reduce the unnecessary communication to a minimum, but still each algorithm has some level of inherent communication. It is an open question whether designers will be able to achieve the low miss penalties required for efficient communication while attempting to maximize processor performance through deep hierarchies. There are some positive indication in this direction. Many
1. Often in observing the widening processor-memory speed gap, comparisons are made between processor rates with memory access times, by taking the reciprocal of the access time. Comparing between throughput and latency in this way makes the gap appear artificially wider.
852
DRAFT: Parallel Computer Architecture
9/4/97
Technology and Architecture
scientific applications have relatively poor cache behavior because they sweep through large data sets. Beginning with the IBM Power2 architecture and continuing in the SGI Power Challenge and Sun Ultrasparc, attention has been paid to improving the out of cache memory bandwidth, at least for sequential access. These efforts proved very valuable for database applications. So, we may be able to look forward to node memory structures that can sustain high bandwidth, even in the evolutionary scenario. There are strong indications that multithreading will be utilized in future processor generations to hide the latency of local memory access. By introducing logical, thread-level parallelism on a single processor, this direction further reduces the cost of the transition from one processor to multiple processors, thus making small scale SMPs even more attractive on a broad scale. It also establishes an architectural direction that may yield much greater latency tolerance in the long term. Link and switch bandwidths are increasing, although this phenomenon does not have the smooth evolution of CMOS under improving lithography and fabrication techniques. Links tend to advance through discrete technological changes. For example, copper links have transitioned through a series of driver circuits: unterminated lines carrying a single bit at a time were replaced by terminated lines with multiple bits pipelined on the wire, and these may be replaced by active equalization techniques. At the same time, links have gotten wider as connector technology has improved, allowing finer pitched, better matched connections, and cable manufacturing has advanced, providing better control over signal skew. For several years, it has seemed that fiber would soon take over as the technology of choice for high speed links. However, the cost of transceivers and connectors has impeded its progress. This may change in the future, as the efficient LED arrays that have been available in Gallium Arsinide (GAs) technologies become effective in CMOS. The real driver of cost reducer, of course, is volume. The arrival of gigabit ethernet, which uses the FiberChannel physical link, may finally drive the volume of fiber transceivers up enough to cause a dramatic cost reduction. In addition, high quality parallel fiber has been demonstrated. Thus, flexible high performance links with small physical cross section may provide an excellent link technology for some time to come. The bandwidths that are being required even for a uniprocessor design, and the number of simultaneous outstanding memory transactions needed to obtain this bandwidth, are stretching the limits of what can be achieved on a shared bus. Many system designs have already streamlined the bus design by requiring all components to transfer entire cache lines. Adapters for I/O devices are constructed with one block caches that support the cache coherency protocol. Thus, essentially all systems will be constructed as SMPs, even if only one processor is attached. Increasingly, the bus is being replaced by a switch and the snooping protocols are being replaced by directories. For example the HP/Convex Exemplar uses a cross-bar, rather than a bus, with the PA8000 processor, the Sun Ultrasparc UPA has a switched interconnect within the 2-4 processor node, although these are connected by a packet switched bus in the Enterprise 6000. The IBM PowerPC-based G30 uses a switch for the data path, but still uses a shared bus for address and snoop. Where busses are still used, they are packet-switched (split-phase). Thus, even in the evolutionary scenario, we can expect to see high performance network integrated ever more deeply into high volume designs. This trend makes the transition from high volume, moderate scale parallel systems to large scale, moderate volume parallel system more attractive, because the new technology required is less. Higher speed networks are a dominant concern for current I/O subsystems, as well. There has been a great deal of attention paid to improved I/O support, with PCI replacing traditional vendor 9/4/97
DRAFT: Parallel Computer Architecture
853
Future Directions
I/O busses. There is a very strong desire to support faster local area networks, such as gigabit ethernet, OC12 ATM (622 Mb/s), SCI, FiberChannels and P1394.2. A standard PCI bus can provide roughly one Gb/s of bandwidth. Extended PCI with 64 bit, 66 MHz operation exists and promises to become more widespread in the future, offering multi-gigabit performance on commodity machines. Several vendors a looking at ways of providing direct memory bus access for high performance interconnects or distributed shared memory extensions. These trends will ensure that small scale SMPs continue to be very attractive, and that clusters and more tightly packaged collections of commodity nodes will remain a viable option for the large scale. It is very likely that these designs will continue to improve as high-speed network interfaces become more mature. We already seeing a trend toward better integration of NICs with the cache coherence protocols so that control registers can be cached and DMA can be performed directly on user level data structures. For many reasons, large scale designs are likely to use SMPs nodes, so clusters of SMPs are likely to be a very important vehicle for parallel computing. With the recent introduction of the CC-NUMA based designs, such as the HP/Convex SPP, the SGI Origin, and especially the PentiumPro-based machines, large scale cache-coherent designs look increasingly attractive. The core question is whether a truly composable SMP-based node will emerge, so that large clusters of SMPs can essentially be snapped together as easily as one adds memory or I/O devices to a single node. 12.1.2 Hitting a Wall So, if current trends hold the evolution of parallel computer architecture looks bright. Why might this not happen? Might we hit a wall instead? There are three basic possibilities, a latency wall, an overhead wall, and a cost or power wall. The latency wall fundamentally is the speed of light, or rather of the propagation of electrical signals. We will soon see processors operating at clock rates in excess of 1 GHz, or a clock period of less that one nanosecond. Signals travel about a foot per nanosecond. In the evolutionary view, the physical size of the node does not get much smaller; it gets faster with more storage, but it is still several chips with connectors and PC board traces. Indeed, from 1987 to 97, the footprint of a basic processor and memory module did not shrink much. There was an improvement from two nodes per board to about four when first level caches moved on-chip and DRAM chips were turned on their side as SIMMs, but most designs maintained one level of cache off-chip. Even if the off-chip caches are eliminated (as in the Cray T3D and T3E), the processor chip consumes more an more power with each generation and a substantial amount of surface area is needed to dissipate the heat. Thus, thousand processor machines will still be meters across, not inches. While the latency wall is real, there are several reasons why it will probably not impede the practical evolution of parallel architectures in the foreseeable future. One reason is that latency tolerance techniques at the processor level are quite effective on the scale of tens of cycles. There have been studies suggesting that caches are losing their effectiveness even on uniprocessors due to memory access latency[Bur*96]. However, these studies assume memory operations that are not pipelined and a processor typical of mid-90s designs. Other studies suggest that if memory operations are pipelined and if the processor is allowed to issue several instructions from a large instruction window, then branch prediction accuracy is a far more significant limit on performance than latency[JoPa97]. With perfect prediction, such an aggressive design can tolerate memory access times in the neighborhood of 100 cycles for many applications. Multithreading techniques provide an alternative source of instruction level parallelism that can be used to hide
854
DRAFT: Parallel Computer Architecture
9/4/97
Technology and Architecture
latency, even with imperfect branch prediction. But, such latency tolerance techniques fundamentally demand bandwidth and bandwidth comes at a cost. The cost arises either through higher signalling rates, more wires, more pins, more real estate, or some combination of these. Also, the degree of pipelining that a component can support is limited by its occupancy. To hide latency requires careful attention to the occupancy of every stage along the path of access or communication. Where the occupancy cannot be reduced, interleaving techniques must be used to reduce the effective occupancy. Following the evolutionary path, speed of light effects are likely to be dominated by bandwidth effects on latency. Currently, a single cache-line sized transfer is several hundred bits in length and, since links are relatively narrow, a single network transaction reaches entirely across the machine with fast cut-through routing. As links get wider, the effective length of a network transaction, i.e., the number of phits, will shrink, but there is quite a bit of room for growth before it takes more than a couple of concurrent transactions per processor to cover the physical latency. Moreover, cache block sizes are increasing just to amortize the cost of a DRAM access, so the length of a network transaction, and hence the number of outstanding transactions required to hide latency, may be nearly constant as machines evolve. Explicit message sizes are likely to follow a similar trend, since processors tend to be inefficient in manipulating objects smaller than a cache block. Much of the communication latency today is in the network interface, in particular that store-andforward delay at the source and at the destination, rather than in the network itself. The network interface latency is likely to be reduced as designs mature and as it becomes a larger fraction of the network latency. Consider, for example, what would be required to cut-through one or both of the network interfaces. On the source side, there is no difficulty in translating the destination to a route and spooling the message onto the wire as it becomes available from the processor. However, the processor may not be able to provide the data into the NI as fast as the NI spools data into the network, so as part of the link protocol it may be necessary to hold back the message (transferring idle phits on the wire). This machinery is already built into most switches. Machines such as the Intel Paragon, Meiko CS-2, and Cray T3D provide flow-control all the way through the NI and back to the memory system in order to perform large block transfers without a storeand-forward delay. Alternatively, it may be possible to design the communication assist such that once a small message starts onto the wire, e.g., a cache line, it is completely transferred without delay. Avoiding the store-and-forward delay on the destination is a bit more challenging, because, in general, it is not possible to determine that the data in the message is good until the data has been received and checked. If it is spooled directly into memory, junk may be deposited. The key observation is that it is much more important that the address be correct than that the data contents be correct, because we do not want to spool data into the wrong place in memory. A separate checksum can be provided on the header. The header is checked before the message is spooled into the destination node. For a large transfer, there is typically a completion event, so that data can be spooled into memory and checked before being marked as arrived. Note that this does mean that the communication abstraction should not allow applications to poll data values within the bulk transfer to detect completion. For small transfers, a variety of tricks can be played to move the data into the cache speculatively. Basically, a line is allocated in the cache and the data is transferred, but if it does not checksum correctly, the valid bit on the line is never set. Thus, there will need to be greater attention paid to communication events in the design of the communication assist and memory system, but it is possible to streamline network transactions much more than the current state-of-the-art to reduce latency. 9/4/97
DRAFT: Parallel Computer Architecture
855
Future Directions
The primary reason that parallel computers will not hit a fundamental latency wall is that overall communication latency will continue to be dominated by overhead. The latency will be there, but it will still be a modest fraction of the actual communication time. The reason for this lies deep in the current industrial design process. Where there are one or more levels of cache on the processor chip, an off-chip cache, and then the memory system, in designing a cache controller for a given level of the memory hierarchy, the designer is given a problem which has a fast side toward the processor and a slow side toward the memory. The design goal is or minimize the expression: Ave Mem Access (S) = HitTime × HitRate s + ( 1 – HitRate s ) × MissTime
(EQ 12.1)
for a typical address stream, S , delivered to the cache on the processor side. This design goal presents an inherent trade-off because improvements in any one component generally come at the cost of worsening the others. Thus, along each direction in the design space the optimal design point is a compromise between extremes. The HitTime is generally fixed by the target rate of the fast component. This establishes a limit against which the rest of the design is optimized, i.e., the designer will do whatever is required to keep the HitTime within this limit. For example, one can consider cache organizational improvements to improve HitRate, such higher associativity, as long as it can be accomplished in the desired HitTime[Prz*88]. The critical aspect for parallel architecture concerns the MissTime. How hard is the designer likely to work to drive down the MissTime? The usual rule of thumb is to make the two additive components roughly equal. This guarantees that the design is within a factor of two of optimal and tends to be robust and good in practice. The key point is that since MissRates are small for a uniprocessor, the MissTime will be a large multiple of the HitTime. For first level caches with greater than 95% hit rates, it will be twenty times the HitTime and for lower caches it will still be an order of magnitude. A substantial fraction of the miss time is occupied by the transfer to the lower level of the storage hierarchy, and small additions to this have only a modest effect on uniprocessor performance. The cache designer will utilize this small degree of freedom in many useful ways. For example, cache line sizes can be increased to improve the HitRate, at the cost of a longer miss time. In addition, each level of the storage hierarchy adds to the cost of the data transfer because another interface must be crossed. In order to modularize the design, interfaces tend to decouple the operations on either side. There is some cost to the handshake between caches on-chip. There is a larger cost in the interface between an on-chip cache and an off-chip cache, and a much larger cost to the more elaborate protocol required across the memory bus. In addition, for communication there is the protocol associated with the network itself. The accumulation of these effects is why the actual communication latency tends to be many times the speed of light bound. The natural response of the designer responsible for dealing with communication aspects of a design is invariably to increase the minimum data transfer size, e.g., increasing the cache line size or the smallest message fragment. This shifts the critical time from latency to occupancy. If each transfer is large enough to amortize the overhead, the additional speed of light latency is again a modest addition. Wherever the design is partitioned into multiple levels of storage hierarchy with the emphasis placed on maximizing a level relative to the processor-side reference stream, the natural tendency of the designers will result in a multiplication of overhead with each level of the hierarchy between the processor and the communication assist. In order to get close to the speed of light latency limit, a very different design methodology will need to be established for processor
856
DRAFT: Parallel Computer Architecture
9/4/97
Technology and Architecture
design, cache design, and memory design. One of the architectural trends that may bring about this change is the use of extensive out-of-order execution or multithreading to hide latency, even in uniprocessor systems. These techniques change the cache designer’s goal. Instead of minimizing the sum of the two components in Equation 12.1, the goal is essentially to minimize the maximum, as in Equation 12.2. max ( ( HitTime × HitRate s ), ( 1 – HitRate s ) × MissTime )
(EQ 12.2)
When there is a miss, the processor does not wait for it to be serviced, it continues executing and issues more requests, many of which, hopefully, hit. In the meantime, the cache is busy servicing the miss. Hopefully, the miss will be complete by the time another miss is generated or the processor runs out of things it can do without the miss completing. The miss needs to be detected and dispatched for service without many cycles of processor overhead, even though it will take some time to process it. In effect, the miss needs to be handed off for processing essentially within the HitTime budget. Moreover, it may be necessary to sustain multiple outstanding requests to keep the processor busy, as fully explained in Chapter 11. The MissTime may be too large for a one-to-one balance in the components of Equation 12.2 to be met, either because of latency or occupancy effects. Also, misses and communication events tend to cluster, so the interval between operations that need servicing is frequently much less than the average. “Little’s Law” suggests that there is another potential wall, a cost wall. If the total latency that needs to be hidden is L and the rate of long latency requests is ρ , then the number of outstanding requests per processor when the latency is hidden is ρL , or greater when clustering is considered. With this number of communication events in-flight, the total bandwidth delivered by the network with P processors needs to be PρL ( P ) , where L ( P ) reflects the increase in latency with machine size. This requirement establishes a lower bound on the cost of the network. To deliver this bandwidth, the aggregate bandwidth of the network itself will need to be much higher, as discussed in Chapter 10, since there will be bursts, collisions, and so on. Thus, to stay on the evolutionary path, latency tolerance will need to be considered in many aspects of the system design and network technology will need to improve in bandwidth and in cost. 12.1.3 Potential Breakthroughs We have seen so far a rosy evolutionary path for the advancement of parallel architecture, with some dark clouds that might hinder this advance. Is there also a silver lining? Are there aspects of the technological trends that may create new possibilities for parallel computer design? The answer is certainly in the affirmative, but the specific directions are uncertain at best. While it is possible that dramatic technological changes, such as quantum devices, free space optical interconnects, molecular computing, or nanomechanical devices are around the corner, there appears to be substantial room left in the advance of conventional CMOS VLSI devices[Patt95]. The simple fact of continued increase in the level of integration is likely to bring about a revolution in parallel computer design. From an academic viewpoint, it is easy to underestimate the importance of packaging thresholds in the process of continued integration, but, history shows that these factors are dramatic indeed. The general effect of the thresholds of integration is illustrated in Figure 12-2, that shows two qualitative trends. The straight line reflects the steady increase in the level of systems integration
9/4/97
DRAFT: Parallel Computer Architecture
857
Future Directions
with time. Overlaid on this is a curve depicting the amount of innovation in system design. A given design regime tends to be stable for a considerable period, in spite of technological advance, but when the level of integration crosses a critical threshold many new design options are enabled and there is a design renaissance. The figure shows two of the epochs in the history of computer architecture. Recall, during the late 70’s and early 80’s computer system design followed a stable evolutionary path with clear segments: minicomputers, dominated by the DEC Vax in engineering and academic markets, mainframes, dominated by IBM in the commercial markets, vector supercomputers, dominated by Cray Research in the scientific market. The minicomputer had burst on the scene as a result of an earlier technology threshold, where MSI and LSI components, especially semiconductor memories, permitted the design of complex systems with relatively little engineering effort. In particular, this level of integration permitted the use of microprogramming techniques to support a large virtual address space and complex instruction set. The vector supercomputer niche reflected the end of a transition. Its exquisite ECL circuit design, coupled with semiconductor memory in a clean load-store register based architecture wiped out the earlier, more exotic parallel machine designs. These three major segments evolved in a predictable, evolutionary fashion, each with their market segment, while the microprocessor marched forward from 4-bits, to 8-bits, to 16-bits, bringing forward the personal computer and the graphics workstation. Level of Integration
Tides of Change
and Opportunity for innovation
desktop system
DEC / IBM MSI
killer micro
Time
Levels of Integration (Time)
Figure 12-2 Tides of Change The history on computing oscillates between periods of stable development and rapid innovation. Enabling technology generally does not hit “all of a sudden,” it arrives and evolves. The “revolutions” occur when technology crosses a key threshold. One example is the arrival of the 32-bit microprocessor in the mid-80s which broke the stable hold of the less integrated minicomputers and mainframes, and enabled a renasaince in computer design, including low-level parallelism on-chip and high-level parallelism through multiprocessor designs. In the late 90’s this transition is fully mature and microprocessor-based desktop and server technology dominate all segments of the market. There is tremendous convergence in parallel machine design. However, the level of integration continues to rise and soon the single-chip computer will be as natural as the single board computer of the 80’s. The question is what new renasaince of design will this enable.
In the mid-80s, the microprocessor reached a critical threshold where a full 32-bit processor fit on a chip. Suddenly the entire picture changed. A complete computer of significant performance
858
DRAFT: Parallel Computer Architecture
9/4/97
Technology and Architecture
and capacity fit on a board. Several such boards could be put in a system. Suddenly, bus-based cache-coherent SMPs appeared from man small companies, including Synapse, Encore, Flex, and Sequent. Relatively large message passing systems appeared from Intel, nCUBE, Ametek, Inmos, and Thinking Machines Corp. At the same time, several minisupercomputer vendors appeared, including Multiflow, FPS, Culler Scientific, Convex, Scientific Computing System, and Cydrome. Several of these companies failed as the new plateau was established. The workstation and later the personal computer absorbed the technical computing aspect of the minicomputer market. SMP servers took over the larger scale data centers, transaction processing, and engineering analysis, eliminating the minisuper. The vector supercomputers gave way to massively parallel microprocessor-based systems. Since that time, the evolution of designs has again stabilized. The understanding of cache coherence techniques has advanced, allowing shared address support at increasingly large scale. The transition of scalable, low latency networks from MPPs to conventional LAN or computer room environments has allowed casually engineered clusters of PCs, workstations, or SMPs to deliver substantial performance at very low cost, essentially as a personal supercomputer. Several very large machines are constructed as clusters of shared memory machines of various sizes. The convergence observed throughout this book is clearly in progress and the basic design question facing parallel machine designers is “how will the commodity components be integrated?” not what components will be used. Meanwhile, the level of integration in microprocessors and memories is fast approaching a new critical threshold where a complete computer fit on a chip, not a board. Soon after the turn of the century the gigabit DRAM chip will arrive. Microprocessors are on the way to 100 Million transistors by the turn of the century. This new threshold is likely to bring about a new design renaissance as profound as that of the 32-bit microprocessor of the mid-80s, the semiconductor memory of the mid-70s, and the IC of the 60’s. Basically, the strong differentiation between processor chips and memory chips will break down, and most chips will have processing logic and memory. It is easy to enumerate many reasons why the processor-and-memory level of integration will take place and is likely to enable dramatic change in computer design, especially parallel computer design. Several research projects are investigating aspects of this new design space, under a variety of acronyms (PIM, IRAM, C-RAM, etc.). To avoid confusion, we will give processorand-memory yet another, PAM. Only history will reveal which old architectural ideas will gain new life and which completely new ideas will arrive. Let’s look at some of the technological factors leading toward new design options. One clear factor is that microprocessors chips are mostly memory. It is SRAM memory used for caches, but memory none-the-less. Table 12-1 shows the fraction of the transistors and die area used for caches and memory interface, including store buffers, etc. for four recent microprocessors from two vendors.[Pat*97] The actual processor is a small and diminishing component of the microprocessor chip, even though processors are getting quite complicated. This trend is
9/4/97
DRAFT: Parallel Computer Architecture
859
Future Directions
made even more clear by Figure 12-3, which shows the fraction of the transistors devoted to caches in several microprocessors over the past decade[Burg97]. Table 12-1 Fraction of Microprocessor Chips devoted to Memory
Year 1993
1995
1994
Microprocessor
On-Chip Cache Size
Intel
I: 8 KB,
Pentium
D: 8 KB
Intel
I: 8 KB,
Pentium Pro
D: 8 KB,
Digital
I: 8 KB,
Alpha
D: 8 KB,
21164
L2: 96 KB
L2: 512 KB
Total Transistors
fraction devoted to memory
Die Area (mm^2)
fraction devoted to memory
3.1 M
32%
~300
32%
P: 5.5 M +L2: 31 M
P: 18% +L2: 100% (Total: 88%)
P: 242 +L2: 282
P: 23% +L2: 100% (total: 64%)
9.3 M
77%
298
37%
2.1 M
95%
50
61%
Digital 1996
StrongArm
I: 16 KB D: 16 KB
SA-110
The vast majority of the real estate, and an even larger fraction of the transistors, are used for data storage and organized as multiple level of on-chip caches. This investment in on-chip storage is necessary because of the time to access to off-chip memory, i.e., the latency of chip interface, offchip caches, memory bus, memory controller, and the actual DRAM. For many applications, the best way to improve performance is to increase the amount of on-chip storage. One clear opportunity this technological trend presents is putting multiple processors on-chip. Since the processor is only a small fraction of the chip real-estate, the potential peak performance can be increased dramatically at a small incremental cost. The argument for this approach is further strengthened by the diminishing returns in performance for processor complexity. For example, the real-estate devoted to register ports, instruction prefetch window, and hazard detection, and bypassing each increase more than linearly with the number of instructions issued per cycle, while the performance improves little beyond 4-way issue superscalar. Thus, for the same area, multiple processors of a less aggressive design can be employed[Olu*96]. This motivates reexamination of sharing issues that have evolved along with technology on SMPs since the mid80s. Most of the early machines shared an off-chip first level cache, then there was room for separate caches, then L1 caches moved on-chip, and L2 caches were sometimes shared and sometimes not. Many of the basic trade-offs remain the same in the board-level and chip-level multiprocessors: sharing caches closer to the processor allows for finer grained data sharing and eliminates further levels of coherence support, but increases access time due to the interconnect on the fast side of the shared cache. Sharing at any level presents the possibility of positive or negative interference, depending on the application usage pattern. However, the board level designs were largely determined by the particular properties of the available components. With multiple processors on-chip, all the design options can be considered within the same homogeneous medium. Also, the specific trade-offs are different because of the different costs and per860
DRAFT: Parallel Computer Architecture
9/4/97
Technology and Architecture
100 90
Merced PPro512
% Transistors in Cache
80 PPC-604 21164
70
PA8500
21064a PPC-601
60
M2 486_DX4 PPC-604
50 486_DX 68040
40
21264 AMDK6 P55C
68060 AMDK5 M1
30
Pentium 20 68030
10 0 1985
PA7200
386_SX
386_SL PA7100
1990
PA8000 1995
PA8200 2000
Figure 12-3 Fraction of Transistors on Microprocessors devote to Caches. Since caches migrated on-chip in the mid 80’s, the fraction of the transistors on commercial microprocessors that are devoted to caches has risen steadily. Although the processors are complex and exploit substantial instruction level parallelism, only be providing a great deal of local storage and exploiting locality can their bandwidth requirements be statisfied at reasonable latency.
formance characteristics. Given the broad emergence of inexpensive SMPs, especially with “glueless” cache coherence support, multiple processors on a chip is a natural path of evolution for multiprocessors. A somewhat more radical change is suggested by the observation that the large volume of storage next to the processor could be DRAM storage, rather than SRAM. Traditionally, SRAM uses manufacturing techniques intended for processors and DRAM uses quite different techniques. The engineering requirements for microprocessors and DRAMs are traditionally very different. Microprocessor fabrication is intended to provide high clock rates and ample connectivity for datapaths and control using multiple layers of metal, whereas DRAM fabrication focuses on density and yield at minimal cost. The packages are very different. Microprocessors use expensive packages with many pins for bandwidth and materials designed to dissipate large amounts of heat. DRAM packages have few pins, low cost, and are suited to the low-power characteristics of DRAM circuits. However, these differences are diminishing. DRAM fabrication processes have become better suited for processor implementations, with two or three levels of metal, and better logic speed[Sau*96]. The drive toward integrating logic into the DRAM is driven partly by necessity and partly by opportunity. The immense increase in capacity (4x every three years) has required that the inter-
9/4/97
DRAFT: Parallel Computer Architecture
861
Future Directions
nal organization of the DRAM and its interface change. Early designs consisted of a single, square array of bits. The address was presented in two pieces, so a row could be read from the array and then a column selected. As the capacity increased, it was necessary to place several smaller arrays on the chip and to provide an interconnect from the many arrays and the pins. Also, with limited pins and a need to increase the bandwidth, there was a desire to run part of the DRAM chip at a higher rate. Many modern DRAM designs, including Synchronous DRAM, Enhanced DRAM, and RAMBUS make effective use of the row buffers within the DRAM chip and provide high bandwidth transfers between the row buffers and the pins. These approaches required that DRAM processes be capable of supporting logic as well. At the same time, there were many opportunities to incorporate new logic functions into the DRAM, especially for graphics support in video RAMs. For example, 3D-RAM places logic for z-buffer operations directly in the video-RAM chip that provides the frame buffer. The attractiveness of integrating processor and memory is very much a threshold phenomenon. While processor design was constrained by chip area there was certainly no motivation to use fabrication techniques other than those specialized for fast processors, and while memory chips were small, so many were used in a system, there was no justification for the added cost of incorporating a processor. However, the capacity of DRAMs has been increasing more rapidly than the transistor count or, more importantly, the area used for processors. At the gigabit DRAM or perhaps the following generation, the incremental cost of the processor is modest, perhaps 20%. From the processor designer’s viewpoint, the advantage of DRAM over SRAM is that it has more than an order of magnitude better density. However, the access time is greater, access is more restrictive, and refresh is required[Sau*96]. Somewhat more subtle threshold phenomena further increase the attractiveness of PAM. The capacity of DRAM has been growing faster than the demand for storage in most applications. The rapid increase in capacity has been beneficial at the high end, because it became possible to run very large problems, which tends to reduce the communication-to-computation ratio and make parallel processing more effective. At the low end it had the effect of reducing the number of memory chips sold per system. When there are only a few memory chips, traditional DRAM interfaces with few pins do not work well, so new DRAM interfaces with high speed logic are essential. When there is only one memory chip in a typical system, there is a huge cost savings in bring the processor on-chip and eliminating everything in between. However, this raises a question of what the memory organization should be for larger systems. Augmenting the impact of critical thresholds of evolving technology are new technological factors and market changes. One is high speed CMOS serial links. ASIC cells are available that will drive in excess of 1 Gb/s on a serial link and substantially higher rates have been demonstrated in laboratory experiments. Previously, these rates were only available with expensive ECL circuits or GAs technology. High speed links using a few pins provides a cost effective means of integrating PAM chips into a large system, and cab form the basis for the parallel machine interconnection network. A second factor is the advancement and widespread use of FPGA and other configurable logic technology. This makes it possible to fabricate a single building block with processor, memory, and unconfigured logic, which can then be configured to suit a variety of applications. The final factor is the development of low power microprocessors for the rapidly growing market of network appliances, sometimes called WebPCs or Java stations, palmtop computers, and other sophisticated electronic devices. For many of these applications, modest single chip PAMs provide ample processing and storage capacity. The huge volume of these markets may indeed make PAM the commodity building block, rather than the desktop system.
862
DRAFT: Parallel Computer Architecture
9/4/97
Technology and Architecture
The question presented by these technological opportunities is how the organization structure of the computer node should change. The basic starting point is indicated by Figure 12-4, which shows that each of the subsystems between the processor and the DRAM bit array present a narrow interface because pins and wires are expensive, even though they are relatively wide internally, and add latency. Within the DRAM chips, the datapath is extremely wide. The bit array itself is a collection of incredibly tightly packed trench capacitors, so little can be done there. However, the data buffers between the bit array and the external interface are still wide, less dense, and are essentially SRAM and logic. Recall, when a DRAM is read, a portion of the address is used to select a row, which is read into the data buffer, and then another portion of the address is used to select a few bits from the data buffer. The buffer is written back to the row, since the read is destructive. On a write, the row is read, a portion is modified in the buffer, and eventually it is written back.
DRAM bit array data buffers SIMM Bus
DRAM chip interface
Memory bus
L2 cache
128 byte blocks Microprocessor chip interface
L1 cache 32 byte blocks word or double word Proc
Figure 12-4 Bandwidths across a computer system Processor data path is several words wide, but typically has a couple word wide interface to its L1 cache. The L1 cache blocks are 32 or 64 bytes wide, but it is constrained between the processor word-at-a-time operation and the microprocessor chip interface. The L2 cache blocks are even wider, but it is constrained between the microprocessor chip interface and the memory bus interface, both width critical. The SIMMS that form a bank of memory may have an interface that is wider and slower than the memory bus. Internally, this is a small section of a very wide data buffer, which is transferred directly to and from the actual bit arrays.
Current research investigates three basic possibilities, each of which have substantial history and can be understood, explored, and evaluated in terms of the fundamental design principles put forward in this book. They are surveyed briefly here, but the reader will surely want to consult the most recent literature and the web. The first option is to place simple, dedicated processing elements into the logic associated with the data buffers of more-or-less conventional DRAM chips, indicated in Figure 12-5. This approach has been called Processor-in-Memory (PIM)[Gok*95] and ComputationalRAM[Kog94,ElSn90]. It is fundamentally SIMD processing of a restricted class of data parallel operations. Typically, these will be small bit-serial processors providing basic logic operations, but they could operate on multiple bits or even a word at a time. As we saw in Chapter 1, the
9/4/97
DRAFT: Parallel Computer Architecture
863
Future Directions
approach has appeared several times in the history of parallel computer architecture. Usually it appears at the beginning of a technological transition when a general purpose operations are not quite feasible, so the specialize operations enjoys a generous performance advantage. Each time it has proved applicable for a limited class of operations, usually image processing, signal processing, or dense linear algebra, and each time it has given way to more general purpose solutions as the underlying technology evolves.
address
memory controller
address decoder
For example, in the early 60’s, there were numerous SIMD machines proposed which would allow construction of a high performance machine by only replicating the function units and sharing a single instruction sequencer, including Staran, Pepe, and Illiac. The early effort culminated in the development of the Illiac IV at the University of Illinois which took many years and became operational only months before the Cray-1. In the early 80s the approach appeared again with the ICL DAP, which provided an array of compact processors, and got a big boost in the mid 80s when processor chips got large enough to support 32 bit-serial processors, but not a full 32bit processor. The Goodyear MPP, Thinking Machines CM-1, and MasPar grew out of this window of opportunity. The key recognition that made the latter two machines much more successful that any previous SIMD approach was the need to provide a general purpose interconnect between the processing elements, not just a low dimensional grid, which is clearly cheap to build. Thinking Machines also was able to capture the arrival of the single chip floating point unit and modify the design in the CM-2 to provide operations on two thousand 32-bit PEs, rather than 64 thousand 1-bit PEs. However, these designs were fundamentally challenged by Amdahl’s law, since the high performance mode can only be applied on the fraction of the problem that fits the specialized operations. Within a few years, they yielded to MPP designs with a few thousand general purpose microprocessors, which could perform SIMD operations and more general operation, so the parallelism could be utilized more of the time. The PIMs approach was deployed in the Cray 3/SSS, before the company filed “Chapter 11”, to provide special support for the National Security Agency. It has also been demonstrated for more conventional technology[Aim96,Shi96].
DRAM Block sense amps °°° proc. elements °°° Col. Select Memory Bus
Figure 12-5 Processor-in-memory Organization Simple function units (processing elements) are incorporated into the data buffers of fairly conventional DRAM chips.
A second option is to enhance the data buffers associated with banks of DRAM so they can be used as vector registers. There can be high bandwidth transfers between DRAM rows and vector
864
DRAFT: Parallel Computer Architecture
9/4/97
Technology and Architecture
DRAM Block sense amps
vector registers
vector function units
address decoder
address decoder
registers using the width of the bit arrays, but arithmetic is performed on vector registers by streaming the data through a small collection of conventional function units. This approach also rests on a good deal of history, including the very successful Cray machines, as well as several unsuccessful attempts. The Cray-1 success was startling in part because of the contrast with unsuccessful CDC Star-100 vector processor a year earlier. The Star-100 internally operated on short segments of data that were streamed out of memory into temporary registers. However, it provided vector operations only on contiguous (stride 1) vectors and the core memory it employed had long latency. The success of the Cray-1 was not just a matter of exposing the vector registers, but a combination of its use of fast semiconductor memory, providing a general interconnect between the vector registers and many banks of memory so that non-unit stride and later gather/scatter operations could be performed, and a very efficient coupling of the scalar and vector operations. In other words, low latency, high bandwidth for general access patterns, and efficient synchronization. Several later attempts at this style of design in newer technologies have failed to appreciate these lessons completely, including the FPS-DMAX, which provided linear combinations of vectors in memory, the Stellar, Ardent, and Stardent vector workstations, the Star-100 vector extension to the Sparcstation, the vector units in the memory controllers of the CM-5, and the vector function units on the Meiko CS-2. It is easy to become enamored with the peak performance and low cost of the special case, but if the start-up costs are large, addressing capability is limited, or interaction with scalar access is awkward, the fraction of time that the extension is actually used drops quickly and the approach is vulnerable to the general purpose solution.
°°°
DRAM Block sense amps °°°
Scalar Processor
Figure 12-6 Vector IRAM Organization DRAM data buffers are enhanced to provide vector registers, which can be streamed through pipelined function units. On-chip memory system and vector support is interfaced to a scalar processor, possibly with its own caches.
The third option pursues a general purpose design, but removes the layers of abstraction that have been associated with the distinct packaging of processors, off-chip caches, and memory systems. Basically, the data buffers associated with the DRAMs are utilized as the final layer of caches. This is quite attractive since advanced DRAM designs have essentially been using the data buffers as caches for several years. However, in the past they were restricted by the narrow interface out of the DRAM chips and the limited protocols of the memory bus. In the more integrated design, the DRAM-buffer caches can interface directly to the higher level caches of one or more on-chip processors. This approach changes the basic cache design trade-offs somewhat, but conventional analysis techniques apply. The cost of long lines is reduced, since the transfer between
9/4/97
DRAFT: Parallel Computer Architecture
865
Future Directions
the data buffer and the bit array is performed in parallel. The current high cost of logic near the DRAM tends to place a higher cost on associativity. Thus, many approaches use direct mapped DRAM-buffer caches and employ victim buffers or analogous techniques to reduce the rate of conflict misses.[Sau*96]. When these integrated designs are used as a building block for parallel machines, the expect effects are observed of long cache blocks causing increased false sharing along with improved data prefetching when spatial locality is present. [Nay*97]. When the PAM approach moves from the stage of academic, simulation based study to active commercial development we can expect to see an even more profound effect arising from the change in the design process. No longer is a cache designer in a position of optimizing one step in the path, constrained by a fast interface on one side and a slow interface on the other. The boundaries will have been removed and the design can be optimized as an end-to-end problem. All of the possibilities we have seen for integrating the communication assist are present: at the processor, into the cache controller, or into the memory controller, but in PAM then can be addressed in a uniform framework, rather than within the particular constraint of each component of the system. Howsoever the detailed design of integrated processor and memory components shakes out, these new components are likely to provide a new, universal building block for larger scale parallel machines. Clearly, collections of them can be connected together to form distributed memory machines and the communication assist is likely to be much better integrated with the rest of the design, since it is all on one chip. Also, the interconnect pins are likely to be the only external interface, so communication efficiency will be even more critical. It will clearly be possible to build parallel machines on a scale far beyond what has never been possible. A practical limit which has stayed roughly constant since the earliest computers is that large scale systems are limited to about ten thousand components. Larger systems have been built, but tend to be difficult to maintain. In the early days, its was ten thousand vacuum tubes, then ten thousand gates, then ten thousand chips. Recently, large machines have had about a thousand processors and each processor required about ten components, either chips or memory SIMMS. The first of teraflops machine has almost ten thousand processors, each with multiple chips, so we will see if the pattern has changed. The complete system on a chip may well be the commodity building block of the future, used in all sorts of intelligent appliances at a volume much greater than the desktop, and hence at a much lower cost. In any case, we can look forward to much large scale parallelism as the processors per chip continues to rise. A very interesting question is what happens when programs need access to more data than fits on the PAM. One approach is to provide a conventional memory interface for expansion. A more interesting alternative is to simply provide CC-NUMA access to other PAM chips, as developed in Chapter 8. The techniques are now very well understood, as is the minimal amount of hardware support required to provide this functionality. If the processor component of the PAM is indeed small, then even if only one process in the system is ever used, the additional cost of the other processors sitting near the memories is small. Since parallel software is become more and more widespread, the extra processing power is there when applications demand it.
12.2 Applications and System Software The future is parallel computer architecture clearly is going to be increasingly a story of parallel software and of hardware/software interactions. We can view parallel software as falling into five
866
DRAFT: Parallel Computer Architecture
9/4/97
Applications and System Software
different classes: applications, compilers, languages, operating systems, and tools. In parallel software too, the same basic categories of change apply: evolution, hitting walls, and breakthroughs. 12.2.1 Evolution While data management and information processing are likely to be the dominant applications that exploit multiprocessors, applications in science and engineering have always been the proving ground for high end computing. Early parallel scientific computing focused largely on models of physical phenomena that result in fairly regular computational and communication characteristics. This allowed simple partitionings of the problems to be successful in harnessing the power of multiprocessors, just as it led earlier to effective exploitation of vector architectures. As the understanding of both the application domains and parallel computing grew, early adopters began to model the more complex, dynamic and adaptive aspects that are integral to most physical phenomena, leading to applications with irregular, unpredictable characteristics. This trend is expected to continue, bringing with it the attendant complexity for effective parallelization. As multiprocessing becomes more and more widespread, the domains of its application will evolve, as will the applications whose characteristics are most relevant to computer manufacturers. Large optimization problems encountered in finance and logistics—for example, determining good crew schedules for commercial airlines—are very expensive to solve, have high payoff for corporations, and are amenable to scalable parallelism. Methods developed under the fabric of artificial intelligence, including searching techniques and expert systems, are finding practical use in several domains and can benefit greatly from increased computational power and storage. In the area of information management, an important trend is toward increased use of extracting trends and inferences from large volumes of data, using data mining and decision support techniques (in the latter, complex queries are made to determine trends that will provide the basis for important decisions). These applications are often computationally intensive as well as databaseintensive, and mark an interesting marriage of computation and data storage/retrieval to build computation-and-information servers. Such problems are increasingly encountered in scientific research areas as well, for example in manipulating and analyzing the tremendous volumes of biological sequence information that is rapidly becoming available through the sciences of genomics and sequencing, and computing on the discovered data to produce estimates of threedimensional structure. The rise of wide-scale distributed computing—enabled by the internet and the world-wide web—and the abundance of information in modern society make this marriage of computing and information management all the more inevitable as a direction of rapid growth. Multiprocessors have already become servers for internet search engines and world-wide web queries. The nature of the queries coming to an information server may also span a broader range, from a few, very complex mining and decision support queries at a time to a staggeringly large number of simultaneous queries in the case where the clients are home computers or handheld information appliances. As the communication characteristics of networks improve, and the need for parallelism with varying degrees of coupling becomes stronger, we are likely to see a coming together of parallel and distributed computing. Techniques from distributed computing will be applied to parallel computing to build a greater variety of information servers that can benefit from the better performance characteristics “within the box”, and parallel computing techniques will be employed in the other direction to use distributed systems as platforms for solving a single problem in paral-
9/4/97
DRAFT: Parallel Computer Architecture
867
Future Directions
lel. Multiprocessors will continue to play the role of servers—database and transaction servers, compute servers, and storage servers—though the data types manipulated by these servers are becoming much richer and more varied. From information records containing text, we are moving to an era where images, three-dimensional models of physical objects, and segments of audio and video are increasingly stored, indexed, queried and served out as well. Matches in queries to these data types are often approximate, and serving them out often uses advanced compression techniques to conserve bandwidth, both of which require computation as well. Finally, the increasing importance of graphics, media and real-time data from sensors—for military, civilian, and entertainment applications—will lead to increasing significance of real-time computing on data as it streams in and out of a multiprocessor. Instead of reading data from a storage medium, operating on it, and storing it back, this may require operating on data on its way through the machine between input and output data ports. How processors, memory and network integrate with one another, as well as the role of caches, may have to be revisited in this scenario. As the application space evolves, we will learn what kinds of applications can truly utilize large-scale parallel computing, and what scales of parallelism are most appropriate for others. We will also learn whether scaled up general-purpose architectures are appropriate for the highest end of computing, or whether the characteristics of these applications are so differentiated that they require substantially different resource requirements and integration to be cost-effective. As parallel computing is embraced in more domains, we begin to see portable, pre-packaged parallel applications that can be used in multiple application domains. A good example of this is a an application called Dyna3D that solves systems of partial differential equations on highly irregular domains, using a method called the finite-element method [cite Hughes book, cite Dyna3D]. Dyna3D was initially developed at government research laboratories for use in modeling weapons systems on exotic, multi-million dollar supercomputers. After the cold war ended, the technology was transitioned into the commercial sector and it became the mainstay of crash modeling in the automotive industry and elsewhere on widely available parallel machines. By the time this book was written, the code was widely used on cost effective parallel servers for performing simulations on everyday appliances, such as what happens to a cellular phone when it is dropped. A major enabling factor in the widespread use of parallel applications has been the availability of portable libraries that implement programming models on a wide range of machines. With the advent of the Message Passing Interface, this has become a reality for message passing, which is used in Dyna3D: The application of Dyna3D to different problems is usually done on very different platforms, from machines designed for message passing like the Intel Paragon to machines that support a shared address space in hardware like the Cray T3D to networks of workstations. Similar portability across a wide range of communication architecture performance characteristics has not yet been achieved for a shared address space programming model, but is likely soon. As more applications are coded in both the shared address space and message passing models, we find a greater separation of the algorithmic aspects of creating a parallel program (decomposition and assignment) from the more mechanical and architecture-dependent aspects (orchestration and mapping), with the former being relatively independent of the programming model and architecture. Galaxy simulations and other hierarchical N-body computations, like Barnes-Hut, provide an example. The early message passing parallel programs used an orthogonal recursive bisection method to partition the computational domain across processors, and did not maintain a global tree data structure. Shared address space implementations on the other hand used a global tree, and a different partitioning method that led to a very different domain decomposition. With time, and with the improvement in message passing communication architectures, the message 868
DRAFT: Parallel Computer Architecture
9/4/97
Applications and System Software
passing versions also evolved to use similar partitioning techniques to the shared address space version, including building and maintaining a logically global tree using hashing techniques. Similar developments have occurred in ray tracing, where parallelizations that assumed no logical sharing of data structures gave way to logically shared data structures that were implemented in the message passing model using hashing. A continuation of this trend will also contribute to the portability of applications between systems that preferentially support one of the two models. Like parallel applications, parallel programming languages are also evolving apace. The most popular languages for parallel computing continue to be based on the most popular sequential programming languages (C, C++, and FORTRAN), with extensions for parallelism. The nature of these extensions is now increasingly driven by the needs observed in real applications. In fact, it is not uncommon for languages to incorporate features that arise from experience with a particular class of applications, and then are found to generalize to some other classes of applications as well. In a similar vein, portable libraries of commonly occurring data structures and algorithms for parallel computing are being developed, and templates for these being defined for programmers to customize according to the needs of their specific applications. Several styles of parallel languages have been developed. The least common denominator is MPI for message passing programming, which has been adopted across a wide range of platforms, and a sequential language enhanced with primitives like the ones we discussed in Chapter 2 for a shared address space. Many of the other language systems try to provide the programmer with a natural programming model, often based on a shared address space, and hide the properties of the machine’s communication abstraction through software layers. One major direction has been explicitly parallel object-oriented languages, with emphasis on appropriate mechanisms to express concurrency and synchronization in a manner integrated with the underlying support for data abstraction. Another has been the development of data parallel languages with directives for partitioning such as High Performance Fortran, which are currently being extended to handle irregular applications. A third direction has been the development of implicitly parallel languages. Here, the programmer is not responsible for specifying the assignment, orchestration, or mapping, but only for the decomposition into tasks and for specifying the data that each task accesses. Based on these specifications, the runtime system of the language determines the dependences among tasks and assigns and orchestrates them appropriately for parallel execution on a platform. The burden on the programmer is decreased, or at least localized, but the burden on the system is increased. With the increasing importance of data access for performance, several of these languages have proposed language mechanisms and runtime techniques to divide the burden of achieving data locality and reducing communication between the progammer and the system in a reasonable way. Finally, the increasing complexity of the applications being written and the phenomena being modeled by parallel applications has lead to the development of languages to support composability of bigger applications from smaller parts. We can expect much continued evolution in all these directions—appropriate abstractions for concurrency, data abstraction, synchronization and data management; libraries and templates, explicit and implicit parallel programming languages—with the goal of achieving a good balance between ease of programming, performance, and portability across a wide range of platforms (in both functionality and performance). The development of compilers that can automatically parallelize programs has also been evolving over the last decade or two. With significant developments in the analysis of dependences among data accesses, compilers are now able to automatically parallelize simple array-based FORTRAN programs and achieve respectable performance on small-scale multiprocessors [cite]. 9/4/97
DRAFT: Parallel Computer Architecture
869
Future Directions
Advances have also been made in compiler algorithms for managing data locality in caches and main memory, and in optimizing the orchestration of communication given the performance characteristics of an architecture (for example, making communication messages larger for message passing machines). However, compilers are still not able to parallelize more complex programs, especially those that make substantial use of pointers. Since the address stored in a pointer is not known at compile time, it is very difficult to determine whether a memory operation made through a pointer is to the same or a different address than another memory operation. In acknowledging this difficulty, one approach that is increasingly being taken is that of interactive parallelizing compilers. Here, the compiler discovers the parallelism that it can, and then gives intelligent feedback to the user about the places where it failed and asks the user questions about choices it might make in decomposition, assignment and orchestration. Given the directed questions and information, a user familiar with the program may have or may discover the higherlevel knowledge of the application that the compiler did not have. Another approach in a similar vein is integrated compile-time and run-time parallelization tools, in which information gathered at runtime may be used to help the compiler—and perhaps user—parallelize the application successfully. or parallelization. Compiler technology will continue to evolve in these areas, as well as in the simpler area of providing support to deal with relaxed memory consistency models. Operating systems are making the transition from treating uniprocessors to multiprocessors, and from treating multiprocessors as batch-oriented compute engines to multiprogrammed servers of computational and storage resources. In the former category, the evolution has included making operating systems more scalable by reducing the serialization within the operating system, and making the scheduling and resource management policies of operating systems take spatial and temporal locality into account (for example, not moving processes around too much in the machine, having them be scheduled close to the data they access, and having them be scheduled as far as possible on the same processor every time). In the latter category, a major challenge is managing resources and process scheduling in a way that strikes a good balance between fairness and performance, just as was done in uniprocessor operating systems. Attention is paid to data locality in scheduling application processes in a multiprogrammed workload, as well as to parallel performance in determining resource allocation: For example, the relative parallel performance of applications in a multiprogrammed workload may be used to determine how many processors and other resources to allocate to it at the cost of other programs. Operating systems for parallel machines are also increasingly incorporating the characteristics of multiprogrammed mainframe machines that were the mainstay servers of the past. One challenge in this area is exporting to the user the image of a single operating system (called single system image), while still providing the reliability and fault tolerance of a distributed system. Other challenges include containing faults so only the faulting application or the resources that the faulting application uses are affected by a fault, and providing the reliability and availability that people expect from mainframes. This evolution is necessary if scalable microprocessor-based multiprocessors are to truly replace mainframes as “enterprise” servers for large organizations, running multiple applications at a time. With the increasing complexity of application-system interactions, and the increasing importance of the memory and communication systems for performance, it is very important to have good tools for diagnosing performance problems. This is particularly true of a shared address space programming model, since communication there is implicit and artifactual communication can often dominate other performance effects. Contention is a particularly prevalent and difficult performance effect for a programmer to diagnose, especially because the point in the program (or machine) where the effects of contention are felt can be quite different from the point that causes the contention. Performance tools continue to evolve, providing feedback about where in the pro870
DRAFT: Parallel Computer Architecture
9/4/97
Applications and System Software
gram are the largest overhead components of execution time, what data structures might be causing the data access overheads, etc. We are likely to see progress in techniques to increase the visibility of performance monitoring tools, which will be helped by the recent increasing willingness of machine designers to add a few registers or counters at key points in a machine solely for the purpose of performance monitoring. We are also likely to see progress in the quality of the feedback that is provided to the user, ranging from where in the program code a lot of time is being spent stalled on data access (available today), to which data structures are responsible for the majority of this time, to what is the cause of the problem (communication overhead, capacity misses, conflict misses with another data structure, or contention). The more detailed the information and the better it is cast in terms of the concepts the programmer deals with rather than in machine terms, the more likely that a programmer can respond to the information and improve performance. Evolution along this path will hopefully bring us to a good balance between software and hardware support for performance diagnosis. A final aspect of the evolution will be the continued integration of the system software pieces described above. Languages will be designed to make the compiler’s job easier in discovering and managing parallelism, and to allow more information to be conveyed to the operating system to make its scheduling and resource allocation decisions. 12.2.2 Hitting a Wall On the software side, there is a wall that we have been hitting up against for many years now, which is the wall of programmability. While programming models are becoming more portable, architectures are converging, and good evolutionary progress is being made in many areas as described above, it is still the case that parallel programming is much more difficult than sequential programming. Programming for good performance takes a lot of work, sometimes in determining a good parallelization, and other times in implementing and orchestrating it. Even debugging parallel programs for correctness is an art and at best a primitive science. The parallel debugging task is difficult because of the interactions among multiple processes with their own program orders, and because of sensitivity to timing. Depending on when events in one process happen to occur relative to events in another process, a bug in the program may or may not manifest itself at runtime in a particular execution. And if it does, instrumenting the code to monitor certain events can cause the timing to be perturbed in such a way that the bug no longer appears. Parallel programming has been up against this wall for some time now; while evolutionary progress is helpful and has greatly increased the adoption of parallel computing, breaking down this wall will take a breakthrough that will truly allow parallel computing to realize the potential afforded to it by technology and architectural trends. It is unclear whether this breakthrough will be in languages per se, or in programming methodology, or whether it will simply be an evolutionary process. 12.2.3 Potential Breakthroughs Other than breakthroughs in parallel programming languages or methodology and parallel debugging, we may hope for a breakthrough in performance models for reasoning about parallel programs. While many models have been quite successful at exposing the essential performance characteristics of a parallel system [Val90,Cul*96] and some have even provided a methodology for using these parameters, the more challenging aspect is modeling the properties of complex parallel applications and their interactions with the system parameters. There is not yet a well defined methodology for programmers or algorithm designers to use for this purpose, to deter9/4/97
DRAFT: Parallel Computer Architecture
871
Future Directions
mine how well an algorithm will perform in parallel on a system or which among competing partitioning or orchestration approaches will perform better. Another breakthrough may come from architecture, if we can somehow design machines in a cost-effective way that makes if much less important for a programmer to worry about data locality and communication; that is, to truly design a machine that can look to the programmer like a PRAM. An example of this would be if all latency incurred by the program could be tolerated by the architecture. However, this is likely to require tremendous bandwidth, which has a high cost, and it is not clear how to exploit the necessary concurrency from an application that is needed for this degree of latency tolerance, if the application has it to begin with. The ultimate breakthrough, of course, will be the complete success of parallelizing compilers in taking a wide range of sequential programs and converting them into efficient parallel executions on a given platform, achieving good performance at a scale close to that inherently afforded by the application (i.e. close to the best one could do by hand). Besides the problems discussed above, parallelizing compilers tend to look for and utilize low-level, localized information in the program, and are not currently good at performing high-level, global analysis transformations. Compilers also lack semantic information about the application; for example, if a particular sequential algorithm for a phase of a problem does not parallelize well, there is nothing the compiler can do to choose another one. And for a compiler to take data locality and artifactual communication into consideration and manage the extended memory hierarchy of a multiprocessor is very difficult. However, even effective programmer-assisted compiler parallelization, keeping the programmer involvement to a minimum, would be perhaps the most significant software breakthrough in making parallel computing truly mainstream. Whatever the future holds, it is certain that there will be continued evolutionary advance that will cross critical thresholds, there will be significant walls encountered, but there is likely to be ways around them, and there will be unexpected breakthroughs. Parallel computing will remain the place where exciting changes in computer technology and applications are first encountered, and there be an ever evolving cycle of hardware/software interactions.
872
DRAFT: Parallel Computer Architecture
9/4/97
References
12.3 References [Bal*62]
[Bat79] [Bat80] [Bou*72] [Bur*96] [Burg97] [Cor72] [Cul*96]
[Ell*92]
[Ell*97]
[Gok*95] [Hill85] [JoPa97]
[Kog94] [Nic90]
[Olu*96] [Patt95] [Pat*97]
[Prz*88]
9/4/97
J. R. Ball, R. C. Bollinger, T. A. Jeeves, R. C. McReynolds, D. H. Shaffer, On the Use of the Solomon Parallel-Processing Computer, Proceedings of the AFIPS Fall Joint Computer Conference, vol. 22, pp. 137-146, 1962. Kenneth E. Batcher, The STARAN Computer. Infotech State of the Art Report: Supercomputers. vol 2, ed C. R. Jesshope and R. W. Hockney, Maidenhead: Infotech Intl. Ltd, pp. 33-49. Kenneth E. Batcher, Design of a Massively Parallel Processor, IEEE Transactions on Computers, c-29(9):836-840. W. J. Bouknight, S. A Denenberg, D. E. McIntyre, J. M. Randall, A. H. Sameh, and D. L. Slotnick, The Illiac IV System, Proceedings of the IEEE, 60(4):369-388, Apr. 1972. D. Burger, J. Goodman, and A. Kagi, Memory Bandwidth Limitations in Future Microprocessors, Proc. of the 23rd Annual Symposium on Computer Architecture, pp. 78-89, May 1996. Doug Burger, System-Level Implications of Processor-Memory Integration, Workshop on Mixing Logic and DRAM: Chips that Compute and Remember at ISCA97, June 1997. J. A. Cornell. Parallel Processing of ballistic missile defense radar data with PEPE, COMPCON 72, pp. 69-72. D. E. Culler. R. M. Karp, R. M., D. A. Patterson, A., Sahay, K.E. Schauser, E. Santos, R. Subramonian and T. von Eicken, LogP: A Practical Model of Parallel Computation, CACM. vol 39, no. 11, pp. 78-85, Aug. 1996. Duncan G. Elliott, W. Martin Snelgrove, and Michael Stumm. Computational RAM: A MemorySIMD Hybrid and its Application to DSP. In Custom Integrated Circuits Conference, pages 30.6.1--30.6.4, Boston, MA, May 1992. Duncan Elliott, Michael Stumm , and Martin Snelgrove, Computational RAM: The case for SIMD computing in memory, Workshop on Mixing Logic and DRAM: Chips that Compute and Remember at ISCA97, June 1997. Maya Gokhale, Bill Holmes, and Ken Iobst. Processing in Memory: the Terasys Massively Parallel PIM Array. Computer, 28(3):23--31, April 1995 W. Daniel Hillis, The Connection Machine, MIT Press 1985. Norman P. Jouppi and Parthasarathy Ranganathan, The Relative Importance of Memory Latency, Bandwidth, and Branch Limits to Performance, Workshop on Mixing Logic and DRAM: Chips that Compute and Remember at ISCA97, June 1997. Peter M. Kogge. EXECUBE - A New Architecture for Scalable MPPs. In 1994 International Conference on Parallel Processing, pages I77--I84, August 1994. Nickolls, J.R., The design of the MasPar MP-1: a cost effective massively parallel computer. COMPCON Spring ‘90 Digest of Papers. San Francisco, CA, USA, 26 Feb.-2 March 1990, p. 258. Kunle Olukotun, Basem A. Nayfeh, Lance Hammond, Ken Wilson and Kunyung Chang, “The Case for a Single-Chip Multiprocessor”, ASPLOS, Oct. 1996. Patterson, D., "Microprocessors in 2020," Scientific American, September 1995. David Patterson, Thomas Anderson, Neal Cardwell, Richard Fromm, Kimberly Keeton, Christoforos Kozyrakis, Randi Thomas, and Katherine Yelick, A Case for Intelligent RAM , IEEE Micro, April 1997. Steven Przybylski, Mark Horowitz, John Hennessy, Performance Tradeoffs in Cache Design, Proc. 15th Annual Symposium on Computer Architecture, May 1988, pp. 290-298. DRAFT: Parallel Computer Architecture
873
Future Directions
[Red73]
S. F. Reddaway, DAP - a distributed array processor. First Annual International Symposium on Computer Architecture, 1973.
[Sau*96]
Ashley Saulsbury and Fong Pong and Andreas Nowatzyk, “Missing the Memory Wall: The Case for Processor/Memory Integration", Proceedings of the 23rd Annual International Symposium on Computer Architecture, pp. 90-101, May 1996. Daniel L. Slotnick, W. Carl Borck, and Robert C. McReynolds, The Solomon Computer, Proceedings of the AFIPS Fall Joint Computer Conference, vol. 22, pp. 97-107, 1962. Daniel L. Slotnick, Unconventional Systems, Proceedings of the AFIPS Spring Joint Computer Conference, vol 30, pp. 477-81, 1967. Harold S. Stone. A Logic-in-Memory Computer. IEEE Transactions on Computers, C-19(1):73-78, January 1970. L.W. Tucker and G.G. Robertson, Architecture and Applications of the Connection Machine, IEEE Computer, Aug 1988, pp. 26-38 Nobuyuki Yamashita, Tohru Kimura, Yoshihiro Fujita, Yoshiharu Aimoto, Takashi Manaba, Shin'ichiro Okazaki, Kazuyuki Nakamura, and Masakazu Yamashina. A 3.84GIPS Integrated Memory Array Processor LSI with 64 Processing Elements and 2Mb SRAM. In International Solid-State Circuits Conference, pages 260--261, San Francisco, February 1994. NEC. L. G., Valiant, A Bridging Model for Parallel Computation, CACM, vol 33, no. 8, pp. 103-11, Aug. 1990. C.R. Vick and J. A. Cornell, PEPE Architecture - present and future. AFIPS Conference Proceedings, 47: 981-1002, 1978.
[Slo*62] [Slot67] [Sto70] [TuRo88] [Yam*94]
[Val90] [ViCo78]
874
DRAFT: Parallel Computer Architecture
9/4/97
A PPEN D IX A
Parallel Benchmark Suites
Morgan Kaufmann is pleased to present material from a preliminary draft of Parallel Computer Architecture; the material is (c) Copyright 1997 Morgan Kaufmann Publishers. This material may not be used or distributed for any commercial purpose without the express written consent of Morgan Kaufmann Publishers. Please note that this material is a draft of forthcoming publication, and as such neither Morgan Kaufmann nor the authors can be held liable for changes or alterations in the final edition. Despite the difficulty of identifying “representative” parallel applications and the immaturity of parallel software (languages, programming environments), the evaluation of machines and architectural ideas must go on. To this end, several suites of parallel “benchmarks” have been developed and distributed. We put benchmarks in quotations because of the difficulty of relying on a single suite of programs to make definitive performance claims in parallel computing. Benchmark suites for multiprocessors vary in the kinds of application domains they cover, whether they include toy programs, kernels or real applications, the communication abstractions they are targeted for, and their philosophy toward benchmarking or architectural evaluation. Below we discuss some of the most widely used, publicly available benchmark/application suites for parallel processing. Table 0.1 on page 878 shows how these suites can currently be obtained.
9/4/97
DRAFT: Parallel Computer Architecture
875
A.1 ScaLapack This suite [DW94] consists of parallel, message-passing implementations of the LAPACK linear algebra kernels. The Lapack suite consists of many important linear algebra kernels, including routines to solve linear systems of equations, eigenvalue problems, and singular value problems. matrix multiplications, factorizations and eigenvalue solvers. It therefore also includes many types of matrix factorizations (including LU and others) used by these routines. Information about the ScaLapack suite, and the suite itself, can be obtained from the Netlib repository maintained at the Oak Ridge National Laboratories.
A.2 TPC The Transaction Processing Performance Council (TPC), founded in 1988, has created a set of publicly available benchmarks, called TPC-A, TPC-B and TPC-C, that are representative of different kinds of transaction-processing workloads and database system workloads <>. While database and transaction processing workloads are very important in actual usage of parallel machines, source codes for these workloads are almost impossible to obtain because of their competitive value to their developers. TPC has a rigorous policy for reporting results. They use two metrics: (i) throughput in transactions per second, subject to a constraint that over 90% of the transactions have a response time less than a specified threshold; (ii) cost-performance in price per transaction per second, where price is the total system price and maintenance cost for five years. In addition to the innovation in metrics, they also provide benchmarks that scale with the power of the system in order to measure it realistically. The first TPC benchmark was TPC-A, intended to consist of a single, simple, update-intensive transaction that provides a simple, repeatable unit of work, and is designed to exercise the key features of an on-line transaction processing (OLTP) system. The chosen transaction is from banking, and consists of reading 100 bytes from a terminal, updating the account, branch and teller records, and writing a history record, and finally writing 200 bytes to a terminal. TPC-B, the second benchmark, is an industry-standard database (not OLTP) benchmark, designed to exercise the system components necessary for update-intensive database transactions. It therefore has significant disk I/O, moderate system and application execution time, and requires transaction integrity. Unlike OLTP, it does not require terminals or networking. The third benchmark, TPC-C, was approved in 1992. It is designed to be more realistic than TPC-A but carry over many of its characteristics. As such, it is likely to displace TPC-A with time. TPC-C is a multi-user benchmark, and requires a remote terminal emulator to emulate a population of users with their terminals. It models the activity of a wholesale supplier with a number of geographically distributed sales districts and supply warehouses, including customers placing orders, making payments or making inquiries, as well as deliveries and inventory checks. It uses a database size that scales with the throughput of the system. Unlike TPC-A, which has a single, simple type of transaction that is repeated and therefore requires a simple database, TPCC has a more complex database structure, multiple transaction types of varying complexity, online and deferred execution modes, higher levels of contention on data access and update, pat-
876
DRAFT: Parallel Computer Architecture
9/4/97
terns that simulate hot-spots, access by primary as well as non-primary keys, realistic requirements for full-screen terminal I/O and formatting, requirements for full transparency of data partitioning, and transaction rollbacks. <
A.3 SPLASH The SPLASH (Stanford ParalleL Applications for SHared Memory) suite [SWG92] was developed at Stanford University to facilitate the evaluation of architectures that support a cachecoherent shared address space. It was replaced by the SPLASH-2 suite [SW+95], which enhanced some applications and added more. The SPLASH-2 suite currently contains 7 complete applications and 5 computational kernels. Some of the applications and kernels are provided in two versions, with different levels of optimization in the way data structures are designed and used (see the discussion of levels of optimization in Section 4.3.2). The programs represent various computational domains, mostly scientific and engineering applications and computer graphics. They are all written in C, and use the Parmacs macro package from Argonne National Laboratories [LO+87] for parallelism constructs. All of the parallel programs we use for workload-driven evaluation of shared address space machines in this book come from the SPLASH-2 suite.
A.4 NAS Parallel Benchmarks The NAS benchmarks [BB+91] were developed by the Numerical Aerodynamic Simulation group at the National Aeronautic and Space Administration (NASA). They are a set of eight computations, five kernels and three pseudo-applications (not complete applications but more representative of the kinds of data structures, data movement and computation required in real aerophysics applications than the kernels). These computations are each intended to focus on some important aspect of the types of highly parallel computations in aerophysics applications. The kernels include an embarrassingly parallel computation, a multigrid equation solver, a conjugate gradient equation solver, a three-dimensional FFT equation solver, and an integer sort. Two different data sets, one small and one large, are provided for each benchmark. The benchmarks are intended to evaluate and compare real machines against one another. The performance of a machine on the benchmarks is normalized to that of a Cray Y-MP (for the smaller data sets) or a Cray C-90 (for the larger data sets). The NAS benchmarks take a different approach to benchmarking than the other suites described in this section. Those suites provide programs that are already written in a high-level language
9/4/97
DRAFT: Parallel Computer Architecture
877
(such as Fortran or C, with constructs for parallelism). The NAS benchmarks, on the other hand, do not provide parallel implementations, but are so-called “paper-and-pencil” benchmarks. They specify the problem to be solved in complete detail (the equation system and constraints, for example) and the high-level method to be used (multigrid or conjugate gradient method, for example), but do not provide the parallel program to be used. Instead, they leave it up to the user to use the best parallel implementation for the machine at hand. The user is free to choose language constructs (though the language must be an extension of Fortran or C), data structures, communication abstractions and mechanisms, processor mapping, memory allocation and usage, and low-level optimizations (with some restrictions on the use of assembly language). The motivation for this approach to benchmarking is that since parallel architectures are so diverse and there is no established dominant programming language or communication abstraction that is most efficient on all architectures, a parallel implementation that is best suited to one machine may not be appropriate for another. If we want to compare two machines using a given computation or benchmark, we should use the most appropriate implementation for each machine. Thus, fixing the code to be used is inappropriate, and only the high-level algorithm and the inputs with which it should be run should be specified. This approach puts a greater burden on the user of the benchmarks, but is more appropriate for comparing widely disparate machines. Providing the codes themselves, on the other hand, makes the user’s task easier, and may be a better approach for exploring architectural tradeoffs among a well-defined class of similar architectures. Because the NAS benchmarks, particularly the kernels, are relatively easy to implement and represent an interesting range of computations for scientific computing, they have been widely embraced by multiprocessor vendors.
A.5 PARKBENCH The PARKBENCH (PARallel Kernels and BENCHmarks) effort [PAR94] is a large-scale effort Table 0.1 Obtaining Public Benchmark and Application Suites Benchmark Suite
Communication Abstraction
How to Obtain Code or Information
ScaLapack
Message-passing
E-mail: [email protected]
TPC
Either (not provided parallel)
Transaction Proc. Perf. Council
SPLASH
Shared Address Space (CC)
ftp/Web: mojave.stanford.edu
NAS
Either (paper-and-pencil)
Web: www.nas.nasa.gov
PARKBENCH
Message-Passing
E-mail: [email protected]
to develop a suite of microbenchmarks, kernels and applications for benchmarking parallel machine, with at least an initial focus on explicit message-passing programs using Fortran77 with the Parallel Virtual Machine (PVM) [BD+91] library for communication. Versions of the programs in the High Performance Fortran language [HPF93] are also provided, to provide portability across message-passing and shared address space platforms. PARKBENCH provides different the types of benchmarks we have discussed: low-level benchmarks or microbenchmarks, kernels, compact applications, and compiler benchmarks (the last two categories yet to be provided <
DRAFT: Parallel Computer Architecture
9/4/97
low-level benchmarks, the PARKBENCH committee first suggest using some existing benchmarks to measure per-node performance, and provide some of their own microbenchmarks for this purpose as well. The purpose of these uniprocessor microbenchmarks is to characterize the performance of various aspects of the architecture and compiler system, and obtain some parameters that can be used to understand the performance of the kernels and compact applications. The uniprocessor microbenchmarks include timer calls, arithmetic operations and memory bandwidth and latency stressing routines. There are also multiprocessor microbenchmarks that test communication latency and bandwidth—both point-to-point as well as all-to-all—as well as global barrier synchronization. Several of these multiprocessor microbenchmarks are taken from the earlier Genesis benchmark suite. The kernel benchmarks are divided into matrix kernels (multiplication, factorization, transposition and tridiagonalization), Fourier transform kernels (a large 1-d FFT and a large 3-D FFT), partial differential equation kernels (a 3-d successive-over-relaxation iterative solver, and the multigrid kernel from the NAS suite), and others including the conjugate gradient, integer sort and embarrassingly parallel kernels from the NAS suite, and a paper-and-pencil I/O benchmark. Compact applications (full but perhaps simplified applications) are intended in the areas of climate an meteorological modeling, computational fluid dynamics, financial modeling and portfolio optimization, molecular dynamics, plasma physics, quantum chemistry, quantum chromodynamics and reservoir modeling, among others. Finally, the compiler benchmarks are intended for people developing High Performance Fortran compilers to test their compiler optimizations, not to evaluate architectures.
A.6 Other Ongoing Efforts SPEC/HPCG: The developers of the widely used SPEC (Standard Performance Evaluation Corporation) suite for uniprocessor benchmarking [SPE89] have teamed up with the developers of the Perfect Club (PERFormance Evaluation for Cost effective Transformations) Benchmarks for traditional vector supercomputers [BC+89] to form the SPEC: HPCG (High Performance Computing Group) to develop a suite of benchmarks measuring the performance of systems that “push the limits of computational technology”, notably multiprocessor systems. Many other benchmarking efforts exist. As we can see, most of them so far are for message-passing computing, though some of these are beginning to provide versions in High Performance Fortran that can be run on any of the major communication abstraction with the appropriate compiler support. We can expect the development of more shared address space benchmarks in the future. Many of the existing suites are also targeted toward scientific computing (the PARKBENCH and NAS efforts being the most large-scale among these), though there is increasing interest in producing benchmarks for other classes of workloads (including commercial and general purpose time-shared) for parallel machines. Developing benchmarks and workloads that are representative of real-world parallel computing and are also effectively portable to real machines is a very difficult problem, and the above are all steps in the right direction.
9/4/97
DRAFT: Parallel Computer Architecture
879
880
DRAFT: Parallel Computer Architecture
9/4/97