This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
i = P o l l a r d e r : : F a c t o r y ( ) . g e t
( ) ; r e t u r n i −>i n t e g r a t e ( 0 , 1 ) ; } ) has data x: Table 1. Algroithm for broadcast
Listing 1.2. Wire-up Code for our Numerical Integration Demo Fig. 4. Exemplary use of Pollarder. Notice how similar the componentization and wireup are, even though the applications have to perform very different computations.
The heterogeneous networks we use in our multi-cluster setup have proven to be problematic for MuCluDent, as the slowest node will determine the whole system’s performance. With overlapping computation and communication, a load balancer can reduce the time needed for computation on slower nodes, but to compensate high latency networks, the ghost zone would have to be enlarged. While enlarged ghost zones on all nodes would be undesirable (as they come at the expense of increase overhead and would be unnecessary between nodes sharing a low latency network), handling a locally increased ghost zone width only between selected nodes turned out to be overly complex. We plan to smooth this out with a HAP capable parallelization for MuCluDent, but, unlike our other parallelizations, this implementation is not yet able to perform load balancing. As MuCluDent’s computational load is distributed very unevenly across the simulation grid, we could not gather meaningful benchmark results for this parallelization so far. We did however test the HAP pattern with a demo application which implements a simple numerical integration for one dimensional functions. Figure 4 shows a small code excerpt. For the test run we used three dual Opteron nodes from our RA cluster. For comparison we did integrate f (x) = x2 on the interval
182
A. Schäfer and D. Fey uni-jena.de cluster middleware and diameter according to our distance measure
mirz.uni-jena.de MPI 0.99
mirz.uni-jena.de MPI 0.24
MPI 1.71
mipool.uni-jena.de inf-ra.uni-jena.de MPI 1.02
MPI 0.62
Nodes CPUs
Fig. 5. Cluster analysis from a test run on our multi-cluster system. The initial maximum diameter was 0.7, the diameter multiplier was 1.6.
[0, 1] once using a "flat" MPI parallelization with six processes (two on each machine) and once with a stacked HAP parallelization that used three MPI processes which forwarded their sub-intervals to a threaded parallelization. Although the problem scales well, the HAP parallelization turned out to be 31% faster than the flat parallelization. This is because of the low number of samples (2000), the initial scatter of interval borders and the final gather of results did dominate the running time and a reduced number of MPI processes sped them up. But still it substantiates our claim that HAP may benefit a system’s efficient usage. As the flat parallelizations in the MuCluDent project suffer from the same problem that communication may be the dominating factor in a multi-cluster, we expect a comparable gain for our geometric decomposition codes. Figure 5 shows the result of Pollarder’s environment detection using our cluster analysis algorithm. For the test 20 nodes from the Unix pool were used along with 10 from the Linux pool and three from the RA cluster (two of which were dual Opterons). Despite its early stage, our prototype was able to reliably detect the system’s structure, including the two dual Opterons on the right. An interesting observation is the sub-cluster of diameter 0.24 in the Unix pool (mirz.uni-jena.de). Initially this seemed to be a bug in our algorithm, but it turned out that the nodes in this sub-cluster have gigabit Ethernet, which contrasts the other Unix pool nodes that only use Fast Ethernet.
8
Summary and Outlook
Complexity and variety of contemporary grid systems have become major challenges for scientific computing. We have presented a new approach to grid application componentization, specially targeted at adaptive parallelizations. The presented design patterns can break down an application’s functionality into small, reusable components. Our prototype suggests that these pattern are
Pollarder: An Architecture Concept
183
generic enough to be employed in a variety of applications, ranging from loosely coupled problems like simple function integration to tightly coupled geometric decomposition codes. The Model-Parallelization-Balancer pattern takes care for coarse grained adaptation, while Hierarchical Adaptive Parallelization can decompose complex parallelizations into smaller sub-parallelizations. This is especially important in the face of increasingly popular combined multi-core and MPI cluster setups. An factory takes over environment discovery and assembles the application’s components, thereby enabling self-adaption to multiple environments and relieving the user from manual interaction. While the adaptation provided by the factory is of static nature, the balancer in the MPB pattern can provide dynamic adaptation during at runtime. Despite being only a prototype, our current implementation has already proven itself in a real application and is able to reliably detect even complex multi-cluster setups.
References 1. Goodale, T., Allen, G., Lanfermann, G., Masso, J., Radke, T., Seidel, E., Shalf, J.: The Cactus Framework and Toolkit: Design and Applications. In: Palma, J.M.L.M., Sousa, A.A., Dongarra, J., Hernández, V. (eds.) VECPAR 2002. LNCS, vol. 2565. Springer, Heidelberg (2003) 2. Bernholdt, D.E., Allan, B.A., Armstrong, R.C., Bertrand, F., Chiu, K., Dahlgren, T.L., Damevski, K., Elwasif, W.R., Epperly, T.G., Govindaraju, M., Katz, D.S., Kohl, J.A., Krishnan, M.K., Kumfert, G.K., Larson, J.W., Lefantzi, S., Lewis, M.J., Malony, A.D., McInnes, L.C., Nieplocha, J., Norris, B., Parker, S.G., Ray, J., Shende, S., Windus, T.L., Zhou, S.: A Component Architecture for High-Performance Scientific Computing. International Journal of High Performance Computing Applications 20, 163–202 (2006) 3. Baduel, L., Baude, F., Caromel, D., Contes, A., Huet, F., Morel, M., Quilici, R.: Programming, Deploying, Composing, for the Grid. In: Grid Computing: Software Environments and Tools. Springer, Heidelberg (2006) 4. Fowler, M.: Inversion of Control Containers and the Dependency Injection Pattern (2004) 5. Mattson, T.G., Sanders, B.A., Massingil, B.L.: Patterns for Parallel Programming. Addison Wesley Professional, Reading (2004) 6. Quinn, M.J. (ed.): Parallel Programming in C with MPI and OpenMP, vol. 1. Mc Graw Hill, New York (2003) 7. Heyer, L., Kruglyak, S., Yooseph, S.: Exploring expression data: identication and analysis of coexpressed genes. Genome Research 9, 1106–1115 (1999) 8. Schäfer, A., Erdmann, J., Fey, D.: Simulation of Dendritic Growth for Materials Science in Multi-Cluster Environments. In: Workshop Grid4TS, vol. 3 (2007)
The Design and Evaluation of MPI-Style Web Services Ian Cooper and Yan Huang School of Computer Science, Cardiff University, United Kingdom {i.m.cooper,yan.huang}@cs.cardiff.ac.uk
Abstract. This paper describes how Message Passing Web Services (MPWS) can be used as a message passing tool to enable parallel processing between WS-based processes in a web services oriented computing environment. We describe the evaluation tests performed to assess the point-to-point communications performance of MPWS compared to mpiJava wrapping MPICH. Following these evaluations we conclude that: using web services to enable parallel processing is a practical solution in coarse grained parallel applications; and that due to inter message pipelining, the MPWS system can, under certain conditions, improve on the communication times of mpiJava.
1
Introduction
A workflow is a series of processing tasks, each of which operates on a particular data set and is mapped to a particular processor for execution. In a looselycoupled web service environment, a workflow can itself be presented as a web service, and invoked by other workflows. Web service standards and technologies provide an easy and flexible way for building workflow-based applications, encouraging the re-use of existing applications, and creating large and complex applications from composite workflows. BPEL4WS is commonly used for web service based scientific workflow compositions [1], but users are limited to applications with non-interdependent processes. Furthermore, issues relating to the unsatisfactory performance of SOAP messaging have tended to inhibit the wide adoption of web service technologies for high performance distributed scientific computing. In spite of the performance concerns, the use of web service architectures to build distributed computing systems for scientific applications has become an area of much active research. Recently developed workflow languages have started addressing the problem of intercommunicating processes, Grid Services Flow Language (GSFL) [2] is one example; it provides the functionality for one currently executing Grid service to communicate directly with another concurrently executing Grid service. Another example is Message Passing Flow Language (MPFL) [3], this specifies an XML based language that enables web service based workflows using MPI-style send and receive commands, to be described. Neither of the examples mentioned above have presented a workflow engine and currently there is no workflow engine that supports MPI M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 184–193, 2008. c Springer-Verlag Berlin Heidelberg 2008
The Design and Evaluation of MPI-Style Web Services
185
style direct message passing; the GSFL paper describes an implementation using OGSA notification ports in a subscriber producer methodology, but the MPFL remains a draft language with no implementation details. In this paper, we investigate the potential and suitability of using a web services infrastructure to support parallel applications that require MPI-like message passing. We look at various methods and tools that can be used to implement these message exchange patterns (MEPs) and assess the suitability of previous work, within the web service framework, for this emerging workflow use. We then propose an implementation for Message Passing Web Services (MPWS) and present performance results comparing MPWS against mpiJava [4]; a leading hpc Java implementation [5]. We have used mpiJava as it is a tool for distributed computing rather than for use within a cluster environment; MPWS combines distributed, loosely coupled services to form a temporary, tightly coupled application with a similar goal. There has also been much research to compare mpiJava to other HPC systems [6].
2
Background and Related Research
In the context of parallel computing and MPI, message passing is referred to as the act of cooperatively passing data between two or more separate workers or processes [7]. Thus, message passing is used in parallel scientific applications to share data between cooperating processes. It enables applications to be split into concurrently running subtasks that have data interdependencies. In a serviceoriented scenario, this can be translated to the act of sending data from one executing service to another, concurrently executing, service. The problem here is that a service can be concurrently invoked many times; once a service is invoked, there must be a way of determining which instance of the service needs to receive the message. SOAP based web services communicate via SOAP messages, and these messages are exchanged in a variety of patterns. Within the WS framework there is normally a simple Message Exchange Pattern (MEP) that involves either a request only, or a request and response message. The normal invocation of a service during the execution of a workflow is for the workflow manager to request a service and then, when the service has completed, a response is returned to the workflow manager. It can be seen that this requires mediation by the workflow manager at every step of the workflow process. Kut and Birant [8] have suggested that web services could become a tool for parallel processing and present a model, using threads to call web services in parallel, to allow web services to perform parallel processing tasks. This model and can be extended (as shown in Fig. 1) to allow these services to exchange data directly, this removes the need for the workflow manager to intervene every time a process transfers data [2]. Currently there is no standard for directly passing data from one service to another running service. Alternative MEPs are in various stages of research; inonly patterns are in common usage in most web service platforms, and research
186
I. Cooper and Y. Huang
Fig. 1. Extending the use of parallel executing services to perform message passing
has been undertaken into a single request multiple response (SRMR) MEP [9]. In this framework for SRMR an agent is used to relay the service call, and a centralized web service collects the responses. Research into the use of web services in parallel computations is presented by Puppin et al [10], who developed an approach for wrapping MPI nodes within web services. Their paper shows that the performance of wrapped MPI nodes can be comparable with MPI running in a cluster environment, although, many more computers are required for the wrapped MPI version. In our research, we focus on developing and evaluating web services that are capable of MPI–like communication with other services; the performance of SOAP messaging is a key issue in determining if MPWS can be made comparable in performance with other distributed message-passing systems. There is a problem when it comes to sending the data within a SOAP message. SOAP uses XML and if true XML formatting is to be used, i.e. listing each entity of the data within a tagged element, the space overhead for the message is potentially very large. The most efficient method of encoding data is to serialize it into a binary representation. In the Java language there is an in-built function to transform objects to their binary encoded representation; this is the mechanism that mpiJava uses to encode its objects before sending them to a socket. The problem is that we cannot translate a binary file directly to string format, as there are not enough characters available. There are four solutions available to this problem; binary-to-character encoding [11], packaging, binary XML encoding [12], and linking [11]. Packaging, such as SOAP with Attachments (SwA) [13], or Message Transmission Optimization Mechanism (MTOM) [14] allows data to be transmitted externally to the SOAP envelope. A comparison of transmission speeds using SOAP with Attachments and true XML formatting is given in [15]. MTOM also stores the data within the object model. MTOM has been chosen as the transmission protocol for these messages as it is SOAP based; yet it increases the speed of the data by allowing attachments, while keeping the data accessible in the object model. MTOM does not have the
The Design and Evaluation of MPI-Style Web Services
187
coding overheads of either the binary to character or the binary XML encoding, and it stays within the SOAP communication protocols, unlike linking.
3
The Design of a Message-Passing Web Service
The challenge is to design a tool which will combine the tightly coupled programming concept like MPI and the distributed, loosely coupled architecture of SOAP web services; to do this we need to adhere to WS and SOAP messaging standards whilst providing an efficient form of communication between services. MPWS is designed to address three areas; the creation of a set of services, the initialisation of those services so they are aware of each other, and the communication between the services. The creation of a set of services is achieved by the workflow manager, its role is to accept jobs, normally specified using an XML-based workflow language such as MPFL, then find a collection of suitable services for those jobs and invoke them all within a unique communication domain. A communication domain is a collection of service instances which are involved in the same composite application, and can communicate directly with each other; this means that each service instance must be aware of all other service instances in the domain. Based on the job definition, the workflow manager will discover and select a group of suitable Message Passing (MP) web services using standard WS techniques, then generate a communication domain ID for the workflow application. The workflow manager can then specify the rank number and invoke a run method for each MP service involved. The initialisation of the service is performed in the invocation of the run method; the input data for the application as well as the binding information for the services to work together, is passed to each individual service that is involved in the same workflow application. The binding information includes: communication domain ID; the rank number for the service; and a list of service endpoint references, each associated with a particular rank ID. Knowing the rank number as well as the service endpoint references, will allow the service to perform point-to-point message passing with all other services in the same communication domain. An MP web service can participate in multiple applications concurrently, so in order to solve the problem of identifying which service invocation is to be addressed, there is one communication domain established for each application instance; this is associated with a unique identifying number the Communication Domain ID. Each MP web service instance belongs to a communication domain, and each service instance has an associated resource; this resource is identified by the Communication Domain ID, is initiated for the particular communication domain, and stores the binding information and messages sent to that service instance. WS-Resources are defined in the WSRF specifications [16], they allow for the concept of state within web services. A resource is uniquely identifiable and accessible via the web service [17]. The use of resources provides message buffers for an MP web service. Instead of sending
188
I. Cooper and Y. Huang
and receiving the messages synchronously, the message is sent to the resource associated with the receiving web service instance, then the receiving web service can retrieve a particular message from the corresponding resource. A message is associated with a communication domain ID and a message tag; this will ensure that the message can be identified within a communication domain. MPWS has been designed to conform to WS Standards and to SOAP messaging standards, to allow the use of loosely coupled services in a traditionally tightly coupled MPI coding style. To this end we have designed MPWS to support multi-layer interfaces; the upper layer as a WS layer, and the lower layer as a message-passing (MP) layer. With the web service layer, an MP web service supports WSDL standards, providing loosely coupled services which can be easily published, discovered and reused. There are two main methods exposed via the web services interface: – Run method - this mainly consists of a sequence of instructions so that it performs one or more particular tasks. Since an MP web service normally involves cooperation with other MP web services for a particular application, setting up communication domains is the first task when the run method is invoked – Store method - this receives messages sent from other MP web services and stores them to the resource associated with the MP web service instance. With the message passing layer, an MP web service is able to conduct message-passing communication with other MP web services by supporting message-passing interfaces, including send, receive, broadcast, and sendReceive. The message-passing interfaces are not exposed via WSDL, but are low-level interfaces that can only be invoked via the WSDL-level methods. For example, inside a run method body, there may be instructions such as sending data to a particular MP web service or receiving data from a particular MP web service, and these can be carried out by directly invoking the methods provided within the message-passing programming package, MTOM is used as the transmission protocol in this layer. Fig. 2(a) gives an example which shows a send operation scenario between two MP web services, A and B. A communication domain was initiated with the communication domain ID equal to 3303. Service A sends a message to service B within the communication domain. The send method from the MP service is called to send the message to service B. This is done by invoking the store method provided by service B. When the store method is called, it stores the message it received into the resource associated with the domain ID 3303. Although service B has received the message and stored it within one of its associated resources, the message cannot be used unless a receive method is called. The receive method retrieves this message from the resource (ID = 3303) associated with the service instance, the tag name associated with the message is used to identify the particular message within the communication domain (Fig. 2(b)). The use of the resource to provide a buffering service for message passing encourages the adoption of the asynchronous fire-and-forget style [18] of message
The Design and Evaluation of MPI-Style Web Services
189
Fig. 2. An example of sending a message from Service A to Service B
sending which is supported in AXIS 2.1.1. The fire-and-forget send method returns immediately after the existence of the receiving host is confirmed providing increased performance over the sendRecieve or sendRobust style .
4 4.1
The Evaluation Testing
Many benchmark suites have been devised and put forward as the definitive parallel computing benchmark tests ([19],[20]), many of these are designed to test the underlying hardware or the collective communications features of the message passing tools. The purposes of the tests that are to be performed on MPWS and mpiJava are to find the speed of the communication implementations and not the capabilities of the network. The ping pong test is used in most of the bench mark suites as a simple bandwidth and latency test. Getov et. al. [21] used a number of variations of the ping pong test to compare the performance of MPI and java-MPI, also Foster and Karonis [22] use the ping pong test to evaluate MPICH-G, a grid enabled MPI. It has been decided to use two variations of the ping pong tests. The first, PingPong, transfers data from one process to another and then back again. In this test, there are an even number of processors within the communication domain that are paired up to concurrently pass data to and from each other, see Fig. 3(a). In this figure the messages are represented by the solid arrows, the time taken for the message to be sent from one service to a second service and then back again is measured as the round trip time. The second test is the Ping*Pong test [21], this test involves sending multiple messages from one service to a second service before the second service returns a message, this is also seen in Fig. 3(b). This test will differentiate between: the intra message pipeline effect, where the message is broken into smaller parts by the system and processed through a pipeline to speed up the communication; and the inter message pipeline effect, where the system does not have to wait for one message to complete its transfer before starting processing the next message [21]. The ping*pong test may show more a realistic view of the systems performance, as it emulates many real applications of message passing (such as a matrix multiplication).
190
I. Cooper and Y. Huang
Fig. 3. Communication Diagram for PingPong, Ping*Pong and matrix multiplication tests
As a further test that has a more real life application to it, a one dimensionally blocked parallel matrix multiplication application is used. This application is based on a simple parallelisation of the matrix multiplication problem. The communications for the matrix multiplication application are shown in Fig. 3(c), each arrow represents a portion of the matrix being sent from rank(i) to another processor. It is important to note that while the order of the sends for each rank are fixed, a rank can start sending its data as soon as it has received data from the preceding rank. For the matrix multiplication application, the actual multiplication calculations are extremely time consuming and dilute the performance of the communications with variances in processor utilisation at the time of testing. We have, therefore omitted the calculation part of the application and only presented the communication part. 4.2
Evaluation Results and Discussion
Versions of each test have been written and evaluated as both a web service, running on Tomcat 5.5.20 using AXIS 2.1.2, and in Java using the mpiJava API (V1.2 wrapping MPICH 1.2.6); all code was written in Java 1.6.0. The MPWS evaluation tests are undertaken on a public network of university machines, all of which are prone to unforeseen activity. The tests were done during low usage hours to reduce inconsistencies and all graphs show minimum timings to reduce the impact of the network on the results; the error bars show maximum timings over the set of tests. The Linux machines used for the testing have twin Intel pentium 4, 2.8GHz processors; in order to eliminate the discrepancy’s between the different handling of threads with the MPWS and mpiJava systems, both systems were restrained to using only one processor on each machine. The graphs in Fig. 4 and Fig. 5 show the timings of MPWS and mpiJava running the ping pong tests. The results show the expected communications overhead of the SOAP message, that degrades the performance for smaller messages, but they also show that over a message data size threshold of approximately 200Kbytes (or n = 160) the extra communication overhead has been absorbed by the total MPWS communication time to make the MPWS and MPI systems run at a relatively similar speed. The graph in Fig. 5 concentrates on the timings for smaller message sizes, allowing the reader to easily compare the two systems. The ping pong test shows
The Design and Evaluation of MPI-Style Web Services
Fig. 4. Times of Ping Pong test MPWS and mpiJava
191
Fig. 5. Times of Ping Pong test MPWS and mpiJava; small message sizes
that for large message sizes the MP web services are an acceptable alternative to mpiJava, but below the data sizes of around 125Kbytes, the systems overheads are very noticeable. This is not really unexpected, as the there are the overheads of the SOAP headers and the HTTP protocol to consider. The results for the ping*pong test are shown in Fig. 6, it is noticed that the threshold (n=130) for MPWS absorbing the overhead of the SOAP messages is slightly lower than with the PingPong test. More significant, is the tenancy for MPWS to outperform the version using mpiJava’s standard send; we put this down to the inter message pipeline effect and the buffer handling of the two different systems. The parallel matrix multiplication communication results are shown in Fig. 7, they consistently show that the MPWS performs the communications faster than mpiJava at matrix sizes above the overhead threshold. We again put the results of the matrix test down to the application of the system buffers in the MPWS and mpiJava implementations, and the inter message pipeline effect. In the ping*pong test, both the inter message pipeline of the send and receive were being tested, but in the matrix multiplication test, each of the consecutive sends from every processor are being received by a different processor. In MPWS, the
Fig. 6. Times for the Ping*Pong test MPWS and mpiJava
Fig. 7. Times for the Matrix Multiplication test MPWS and mpiJava
192
I. Cooper and Y. Huang
main message buffering occurs in the receiving processor. This distributes the message buffering process at the time of high utilisation.
5
Conclusion and Further Work
From the tests we have discovered that despite using MTOM, The overhead of SOAP messaging is still a problem which affects the performance of MPWS when message sizes are small. However, when the message sizes reach a threshold, MPWS and mpiJava systems run at a relatively similar speed. We also found that the inter message pipe effect, is a noticeable feature in MPWS applications that use consecutive sends; it is even more so in those applications who’s consecutive sends are received by a distributed selection of processors. From the above observations, we conclude that MPWS is an effective tool for coarse grained parallel applications , such as a parallel matrix multiplication, implemented in a service oriented environment. The next steps will be to consider the design of other send styles, such as ssend (synchronous send), and evaluate MPI style collective communication functionality such as: broadcast; gather and scatter; and all reduce.
References 1. Akram, A., Meredith, D., Allan, R.: Evaluation of bpel to scientific workflows. In: CCGRID 2006: Proceedings of the Sixth IEEE International Symposium on Cluster Computing and the Grid (CCGRID 2006), pp. 269–274. IEEE Computer Society, Washington (2006) 2. Krishnan, S., Wagstrom, P., von Laszewski, G.: Gsfl: A workflow framework for grid services (2002) Preprint ANL/MCS-P980-0802 3. Huang, Y., Huang, Q.: Ws-based workflow description language for message passing. In: 5th IEEE International Symposium on Cluster Computing and Grid Computing, Cardiff, Wales, U.K. (2005) 4. Carpenter, B., Fox, G., Ko, S., Lim, S.: mpiJava 1.2: API Specification (October 1999), http://www.npac.syr.edu/projects/pcrc/mpiJava/mpiJava.html 5. Baker, M., Carpenter, B., Shafi, A.: An Approach to Buffer Management in Java HPC Messaging. In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2006. LNCS, vol. 3992, pp. 953–960. Springer, Heidelberg (2006) 6. Lee, H.K., Carpenter, B., Fox, G., Lim, S.B.: Benchmarking hpjava: Prospects for performance. In: 6th Workshop on Languages, Compilers and Run-time Systems for Scalable Computers (March 2002) 7. Gropp, W.: Tutorial on MPI: The Message-Passing Interface 8. Kut, A., Birant, D.: An approach for parallel execution of web services. In: Proceedings - IEEE International Conference on Web Services, pp. 812–813. IEEE Computer Society, Los Alamitos (2004) 9. Ruth, M., Lin, F., Tu, S.: Adapting single-request/multiple-response messaging to web services. In: Computer Software and Applications Conference, 29th Annual International, vol. 2, pp. 287–292 (2005) 10. Puppin, D., Tonellotto, N., Laforenza, D.: How to run scientific applications over web services. In: International Conference Workshops on Parallel Processing. ICPP 2005 Workshops, pp. 29–33 (2005)
The Design and Evaluation of MPI-Style Web Services
193
11. Harrington, B., Brazile, R., Swigger, K.: Ssrle: Substitution and segment-run length encoding for binary data in xml. In: 2006 IEEE International Conference on Information Reuse and Integration, September 2006, pp. 11–16 (2006) 12. Bayardo, R.J., Gruhl, D., Josifovski, V., Myllymaki, J.: An evaluation of binary XML encoding optimizations for fast stream based XML processing. In: WWW 2004: Proceedings of the 13th international conference on World Wide Web, pp. 345–354. ACM Press, New York (2004) 13. Barton, J.J., Thatte, S., Nielsen, H.F.: Soap messages with attachments. W3c note, W3C (December 2000) 14. The Apache Software Foundation: MTOM Guide -Sending Binary Data with SOAP. 1.0 edn. (May 2005), http://ws.apache.org/axis2/1 0/mtom-guide.html 15. Ying, Y., Huang, Y., Walker, D.W.: Using soap with attachments for e-science. In: Proceedings of the UK e-Science All Hands Meeting 2004 (August 2004) 16. Czajkowski, K., Ferguson, D.F., Foster, I., Frey, J., Graham, S., Sedukhin, I., Snelling, D., Tuecke, S., Vambenepe, W.: The ws-resource framework version 1.0. Technical report, Globus Alliance and IBM (2004) 17. Graham, S., Karmarkar, A., Mischkinsky, J., Robinson, I., Sedukhin, I.: Web Services Resource 1.2 (WS-Resource) Public Review Draft 01. OASIS, June 10 (2005) 18. Jayasinghe, D.: Invoking web services using apache axis2 (December 2006), http://today.java.net/pub/a/today/2006/12/13/ invoking-web-services-using-apache-axis2.html (Accessed August 2007) 19. Luszczek, P., Dongarra, J., Koester, D., Rabenseifner, R., Lucas, B., Kepner, J., McCalpin, J., Bailey, D., Takahashi, D.: Introduction to the hpc challenge benchmark suite. Technical report, icl.cs.utk.edu (March 2005) 20. Intel: Intel mpi benchmarks. Technical report, Intel (June 2006) 21. Getov, V., Gray, P., Sunderam, V.: Mpi and java-mpi: contrasts and comparisons of low-level communication performance. In: Supercomputing 1999: Proceedings of the 1999 ACM/IEEE conference on Supercomputing (CDROM), p. 21. ACM Press, New York (1999) 22. Foster, I., Karonis, N.: A grid-enabled mpi: Message passing in heterogeneous distributed computing systems. In: IEEE/ACM Conference on Supercomputing, 1998. SC 1998, pp. 46–46. IEEE Computer Society, Los Alamitos (1998)
Automatic Data Reuse in Grid Workflow Composition Ondrej Habala, Branislav Simo, Emil Gatial, and Ladislav Hluchy Institute of Informatics, Slovak Academy of Sciences, Dubravska 9,84507 Bratislava, Slovakia {Ondrej.Habala,Branislav.Simo,Emil.Gatial, Ladislav.Hluchy}@savba.sk
Abstract. Many papers, research projects, and software products have tackled the problem of automatic composition of a workflow of computer processes, which computes certain data or performs a specific task. In recent years this has also gained popularity in grid computing, especially in connection with semantic description of resources usable in the workflow. However, most of the works dealing with semantically-aided workflow composition propose solutions only for workflows of processes, without the data necessary to execute them. We describe the design of a system, which will be able to find not only the processes, but also the content for their execution, based solely on the list of available resources and a description of the required target of the workflow. The solution is based on our previous work in the project K-Wf Grid, utilizes semantic description of resources by means of ontologies, and operates on a SOA-based grid composed of web services. It is being developed in the context of a project called SEMCO-WS1. Keywords: Semantic grid, SOA, web services, automated workflow composition.
1 Introduction Many papers, projects, software solutions [1-3] have tackled the problem of automatic composition of a workflow of computer processes. This type of automation is very attractive especially in software engineering applied to scientific research, where complicated simulations and parameter studies often require tens of single steps in order to obtain the solution required by the scientist. Since the inception of grid computing, workflow composition of grid jobs into complex workflows has also gained prominence with its apparent usefulness, long history of previous works not applied specifically to grid, and robust mathematical theory based mainly on direct acyclic graphs. In recent years, advances in semantic web have been applied also in grid computing – creating semantic grid [4] – and specifically in the area of semantically-aided composition of workflows of grid tasks. However, most of the many works on this topic have concerned themselves only with the composition of a 1
This work is supported by projects SEMCO-WS APVV-0391-06, int.eu.grid EU 6FP RI-031857, VEGA 2/7098/27.
M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 194–202, 2008. © Springer-Verlag Berlin Heidelberg 2008
Automatic Data Reuse in Grid Workflow Composition
195
workflow of computer processes – represented by grid jobs, calls to web service interfaces, of other custom tasks – solving the “how”, and have omitted the “what” of this problem, it being the data, on which these processes operate. This has been left to the user. While the sought after result is a system, in which the user enters the description of the data he/she requires, and the system composes a workflow able to compute it, most of the existing solutions create only a workflow able to solve a class of problems, and the selection of the one unique member of this class via entering the correct data is left to the user. We have designed, and begun to implement, a system, which solves also the “what” of automated workflow composition. The proposed system is based on previous work done in the context of the project K-Wf Grid [8], and extends it with tools which are able to determine exactly which data is necessary for which process in the composed workflow in order to get – at the end – the data which the user has described as his/her target. The system is based on semantic description of data and grid services by ontologies. The workflows are modeled as Petri nets, this being a legacy of K-Wf Grid offering very good means to model data (as Petri net tokens). The system interacts with the user only to the extent absolutely necessary to acquire data or services which are required for the solution, but currently not available in the grid. The rest of this paper first presents the project K-Wf Grid and its results, and establishes a frame of reference for our own work. Then we briefly present the project SEMCO-WS [9], and the main part of the paper describes the proposed solution of automatically composing workflow with not only processes, but also data.
2 Results of K-Wf Grid The project Knowledge Based Workflow System for Grid Applications – K-Wf Grid – started in September 2004 and ended in February 2007. It was very successful in attaining its goals, and the final review in March 2007, as well as a public showcase at the Cracow Grid Forum ’06 was both a success. The consortium was composed of six partners, and the work was very well focused on one goal – automated composition of workflows of grid services using semantic support, and accessible through a comfortable web-based graphical interface. For the purpose of this paper, we will discuss only the parts of K-Wf Grid middleware, connected with application workflow composition. The middleware has been tested on three pilot applications. Each application went through several stages, beginning with integration, and ending with a successful workflow execution. K-Wf Grid application is a set of web or grid (WSRF) services. Any application has first to be integrated into the system’s knowledge base [10]. The process of integration is mostly automatic [11], the application expert has only to annotate WSDL documents of the application’s services with markup denoting input and output structures used in service calls. Following the integration, the application may be used in the system. User enters a textual description of his/her problem, and a tool [12] connected to the knowledge base finds any available targets – service results – relevant to this problem description. The user selects one or more of the found targets, and thus establishes a context for a workflow.
196
O. Habala et al.
The workflow is formed and executed in several stages. We have to remember that it is always modeled as a Petri net, so the appropriate terms are activities (for processes), tokens (for data), and data places – for inputs and outputs of the activities. It begins with an abstract description of a problem, having only one activity, and one output (the target of the workflow). This simple workflow is then expanded [13] using descriptions of available services, stored in the knowledge base. The result is a workflow of service classes – descriptions of service interfaces, without actual grounding (concrete service providers). Then in another step [14] the available service providers for each service are found, and the workflow is now composed of a set of activities, each of which presents several possible choices of a concrete web/grid service for its execution (firing, in terms of Petri nets). The final pre-execution step is scheduling, during which the scheduler [5] selects for each activity one service, assumed to be the best one (considering a metric for evaluating different properties of services) for the workflow. After this workflow construction, the workflow (assuming that the system was able to find all necessary service classes and grounded services for these classes) may be executed. During execution, one or more input data structures may be necessary – the user will be asked to provide data description. Although this step is also comfortable, and the user may use custom web forms developed by the application developer, it is still necessary for the user to know the application and be able to judge which data will be necessary to produce the target he/she wishes to obtain. Following the workflow execution, the data may be downloaded from grid storage; or, if the application contains also visualization tools and custom services able to cooperate with grid middleware (so-called job packagers), it is transformed into easily readable form and made accessible through the K-Wf Grid portal.
3 Semantic Workflow Composition in SEMCO-WS The project SEMCO-WS is a small national project with a consortium of four members, all from Slovakia. It started in February 2007 and is scheduled to end in January 2010. The project is trying to expand and refine several components of K-Wf Grid (adding also other features, not present in K-Wf Grid). While the whole process of workflow construction and execution in K-Wf Grid was observable through a graphical workflow visualization tool, the user could edit only data tokens, not the workflow structure. SEMCO-WS will include an improved version of the visualization tool, supporting also complete workflow editing. The process of knowledge base management will be also supported by comfortable graphical tools, and the knowledge base itself will be decentralized. Most importantly for this paper, the process of data selection, left to the user in K-Wf Grid, will be fully automated. The user will be asked only to provide data, which will not be available and described in the knowledge base. The design of this part of SEMCO-WS middleware is explained in the following chapter.
Automatic Data Reuse in Grid Workflow Composition
197
4 Adding Automatic Data Selection to Workflow Composition To be able to propose not only the activities of a workflow, but also the content of the initial tokens in it, based only on the content of the final output token (which represents the target data of the workflow), the system has to be able to 1. know the content of any data present in the system, 2. infer the necessary input token parameters, based on the parameters of the output token to be generated by any activity, 3. decompose output structures of web and grid services into separate tokens, and 4. compose input structures for web and grid services from existing tokens. The solutions of these partial problems are described below. 4.1 Content of Data – Metadata Any data piece available to the workflow composition system is represented by a token. If it is a file, it can be its content (for smaller files), its URL, LFN, or any other identifier. It can be an URL of a database, and an identifier of an item in the database. The actual content of a token is application dependent, and the system does not need to be able to read it. Any token has first to be either entered by a user, or generated by an application activity, and it is processed only by another application activity or a user, so the application dependence is fully hidden in the application domain. The workflow system needs only to know the metadata of the token, to be able to evaluate its content for possible use in a workflow. The metadata, represented by OWL and OWL-S constructions is composed of generic, application-independent part, and of other, application dependent part. This layering of ontologies has been also used in K-Wf Grid, where the application ontologies were using a common base ontology layer with basic facilities for description of services, files, resources, computers, clusters, etc. The OWL standard is used mainly because it has been already incorporated into K-Wf Grid middleware, upon which our system is based. Alternatively, also other suitable ontology representation language could be used, or even the WSMO language [15], developed specifically for modeling of web services. The requirement that the system has to be able to use data computed in the past for current workflows also implies, that all tokens created by any application have to be stored in a database. Since in K-Wf Grid and in SEMCO-WS the workflows (including tokens) are described in an XML dialect called the Grid Workflow Description Language [7], a simple XML database is sufficient for token storage, and later lookup. When the system identifies a data piece based on its metadata, the ontology will contain also the identifier of the token representing the data, and the token can be retrieved from the XML database using this identifier. Each newly created token has to be described by its metadata. This can be done in two ways – either the metadata is generated by another application component (a simple web service, or other module), or it may be computed using a set of mathematical and logical formulas. As we will discus below, it is necessary to be able to infer the properties of input data from required output parameters, so it may be also
198
O. Habala et al.
possible – at least for a subclass of activities – to describe the inverse transformation and infer properties of output tokens from known properties of input tokens. 4.2 Backtracking from Required Output Token to the Necessary Input Tokens The process of constructing a workflow in K-Wf Grid has been sufficiently described before [13], [14], [6]. This process did not provide for reusing existing data, and always assumed that the whole workflow chain has to be computed anew, even if some partial results from previous workflow could be reused and could replace parts of the newly created workflow. The workflow construction process used backtracking, from the final activity to the initial activities of the workflow. In SEMCO-WS, the process will also include backtracking from the final token to the initial tokens of the workflow. We can abstract tokens and activities as data providers, with the difference that tokens provide data and require no inputs, and activities also provide data, but require input data – other tokens. So the process of workflow construction can be described using this algorithm: Program construct_workflow (token_metadata_list) //The input is a list of metadata descriptions of all //tokens we wish to produce with the constructed //workflow Variables: workflow //list of components of the constructed //workflow token activity token_metadata //member of token_metadata_list 1. Foreach token_metadata in token_metadata_list a. token ← find in token_db based on token_metadata b. If token ≠ nil workflow = workflow + token Else Find activity able to produce token workflow = workflow + activity token_metadata_list = token_metadata_list + token_metadata of all input tokens of activity 2. If token_metadata_list is not empty, goto 1 3. Output workflow So we see, that we first try to find the data we need, and if it cannot be found (it was not yet computed or entered into the system), we find an activity which can compute such data. Of course, then we have to find the correct input data for this activity too, and the process repeats itself. The K-Wf Grid incarnation of this algorithm was looking only for activities, and essentially omitted step 1a in the algorithm. In 1b-else clause, we are looking for an activity able to produce a token with certain parameters. To be able to do this, we also need descriptions of the capabilities
Automatic Data Reuse in Grid Workflow Composition
199
of activities, i.e. what are the possible parameters of tokens the activity is able to produce. If this description includes also the parameters of the token to be produced, the activity may be used to produce it. 4.3 From Tokens and Activities to Data Structures and Services Our Petri net model of a workflow operates with activities, places, and tokens. For purposes of management of SOA applications, we need to be able to transform these concepts to web service calls, and input/output structures of web service interfaces (data structures). While the transition from activities to web service calls is straightforward, tokens and data structures do not map directly. Any activity may require more than one piece of data – so on its input are several tokens, coming from several input places, but the underlying service can consume only one input structure. Also, the output data structure of the service may contain several data fragments – values, references to files, etc. – and we want to store them as tokens separately, since they may be later used separately as inputs to other activities. For the purpose of construction of input data structures for services, and decomposition of their output structures into separate tokens, we have extended the annotation of WSDL documents of application’s services with additional elements. These elements (contained inside the definition of data structures) contain XSLT code, which may be used to automatically compose the structure from several XML fragments identical with tokens, and also to automatically split the output structure into the fragments which then become the output tokens of the activity. The composition and decomposition process is quite straightforward; an activity may have several input and output places, but the actual web service hidden behind the activity has only one input, and one output, both represented by an XML structure. Upon activity firing, the input tokens (XML fragments) are concatenated together, and then transformed into the format of the web service input structure using the XSLT provided in WSDL document. Similarly, the output structure is then filtered by XSLT into distinct output tokens – one XSLT document for each output token. 4.4 The Workflow Composition Process The whole process and problem of automated workflow construction using also existing data has been decomposed into several steps (see Fig. 1): 1. Initial data has to be entered by the user; he/she is a domain expert, so it is easy (using custom web forms which are part of the application) for him/her to create the token, as well as describe it with metadata; both token and metadata are stored. 2. When the GWES looks for components of the workflow, it queries the knowledge base for any existing tokens which can provide the necessary data. 3. If such token is found, it is extracted from the token database; otherwise an activity is used (not shown in Fig. 1 since it is not the focus of this paper, composition of services into workflow has been sufficiently described in other works).
200
O. Habala et al.
Fig. 1. The data location/creation process
4. During execution of the workflow, GWES has to compose an input structure for the actual application service from the input tokens of the service’s activity. This is done automatically, using XSLT transformations present in the WSDL document of the service. 5. The composed input structure is used in a call to the application service; the service replies with an output structure. 6. The obtained output structure is decomposed into single tokens, which represent data items contained in the structure. This is also done automatically using XSLT transformations inside WSDL document (see Chapter 4.3 for details). 7. The created tokens have to be annotated; this is another application-specific step of the process, but (as discussed above), we can avoid helper services by using mathematical and logical formulas which describe the process of transformation of parameters of input tokens (which are known) into parameters of output tokens. Alternatively, if such formulas cannot be used because of the complexity of the transformation, helper service can be used. The formulas or the URL of the helper service can also be found inside the annotated WSDL document of the application service. 8. The created tokens (extracted from output of the application service in step 6) are stored in the token database for later reuse. 9. The metadata of the new tokens (created in step 7) is stored in the system’s knowledge base. The cycle may begin with another iteration from step 2, or – if the workflow has no more activities which can be fired – end. We have not discussed some situation which may arise, when the composition of a workflow is halted by a requirement for data, which cannot be found in the database, nor produced by any known activity. In such case, the user may be asked to provide the data. Alternatively he/she may abort the workflow composition process, enter new application service into the system, and restart the workflow composition. So the situation can be resolved by adding either the data, or a service which can produce it.
Automatic Data Reuse in Grid Workflow Composition
201
5 Conclusions We have shown, that fully automated construction of workflows of web and grid services can also reuse existing data, provided that the processes is supported by a semantic annotation layer, and the application services are annotated. Three key components of the annotation are formulas for transformation of input metadata into output metadata, formulas for the inverse transformation, and description of capabilities of application services. The design presented in this paper is currently entering the implementation phase. The project SEMCO-WS is continuing the work of K-Wf Grid in some areas, where the semantic support of grid workflow composition can be further improved, and the inclusion of automatic data reuse is one of them. The prototype of this system is to be ready by 2009, and when SEMCO-WS finishes in 2010, the system will be ready to be used and further improved by other researchers.
References 1. Bubak, M., Gubala, T., Kapalka, M., Malawski, M., Rycerz, K.: Grid Service Registry for Workflow Composition Framework. In: Bubak, M., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2004. LNCS, vol. 3038, pp. 34–41. Springer, Heidelberg (2004) 2. VDS – The GriPhyN Virtual Data System (Accessed January 2008), http://www.ci.uchicago.edu/wiki/bin/view/VDS/VDSWeb/WebMain 3. Krishnan, S., Wagstrom, P., von Laszewski, G.: GSFL: A Workflow Framework for Grid Services. In: Preprint ANL/MCS-P980-0802, Argonne National Laboratory, 9700 S. Cass Avenue, Argonne, 1L 60439, U.S.A. (2002) 4. Semantic Grid Community Portal (Accessed January 2008), http://www.semanticgrid.org/ 5. Wieczorek, M., Prodan, R., Fahringer, T.: Comparison of Workflow Scheduling Strategies on the Grid. In: Wyrzykowski, R., Dongarra, J., Meyer, N., Waśniewski, J. (eds.) PPAM 2005. LNCS, vol. 3911, pp. 792–800. Springer, Heidelberg (2006) 6. Hoheisel, A., Der, U.: An XML-based Framework for Loosely Coupled Applications on Grid Environments. In: Sloot, P.M.A., Abramson, D., Bogdanov, A.V., Gorbachev, Y.E., Dongarra, J., Zomaya, A.Y. (eds.) ICCS 2003. LNCS, vol. 2657, pp. 245–254. Springer, Heidelberg (2003) 7. Alt, M., Gorlatch, S., Hoheisel, A., Pohl, H.W.: A Grid Workflow Language Using HighLevel Petri Nets. In: Wyrzykowski, R., Dongarra, J., Meyer, N., Waśniewski, J. (eds.) PPAM 2005. LNCS, vol. 3911, pp. 715–722. Springer, Heidelberg (2006) 8. Knowledge-based Workflow System for Grid Applications (K-Wf Grid). EU 6th FP Project, 2004-2007 (Accessed January 2008), http://www.kwfgrid.eu 9. Semantic composition of Web and Grid Services (SEMCO-WS). Slovak APVV project, 2007-2009 (Accessed January 2008), http://semco-ws.ui.sav.sk/ 10. Kryza, B., Pieczykolan, J., Babik, M., Majewska, M., Slota, R., Hluchy, L., Kitowski, J.: Managing Semantic Metadata in K-Wf Grid with Grid Organizational Memory. In: Bubak, M., Turala, M., Wiatr, K. (eds.) Proceedings of Cracow Grid Workshop – CGW 2005, Krakow, Poland, November 20-23 2005, pp. 66–73 (2005) 11. Habala, O., Babik, M., Hluchy, L., Laclavik, M., Balogh, Z.: Semantic Tools for Workflow Construction. In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2006. LNCS, vol. 3993, pp. 980–987. Springer, Heidelberg (2006)
202
O. Habala et al.
12. Laclavik, M., Seleng, M., Hluchy, L.: User Assistant Agent (EMBET): Towards Collaboration and Knowledge Sharing in Grid Workflow Applications. In: Cracow 2006 Grid Workshop: K-Wf Grid, pp. 122–130 (2007) ISBN 978-83-915141-8-4 13. Gubala, T., Herezlak, D., Bubak, M., Malawski, M.: Semantic Composition of Scientific Workflows Based on the Petri Nets Formalism. In: Proc. of e-Science 2006, Amsterdam (2006) ISBN-0-7695-2734-5 14. Dutka, L., Kitowski, J.: AAB – Automatic Service Selection Empowered by Knowledge. In: Bubak, M., Turała, M., Wiatr, K. (eds.) Proceedings of Cracow Grid Workshop – CGW 2005, ACC-Cyfronet USTs, ACC-Cyfronet UST, November 20-23 2005, p. 58 (2006) 15. Web Service Modeling Ontology (Accessed March 2008), http://www.wsmo.org/
Performance Analysis of GRID Middleware Using Process Mining∗ Anastas Misev1 and Emanouil Atanassov2 1
University Sts Cyril and Methodius, Faculty of Natural Sciences & Mathematics Institute of Informatics, Skopje, Macedonia 2 Bulgarian Academy of Sciences, Institute for Parallel Processing, Sofia, Bulgaria [email protected], [email protected]
Abstract. Performance analysis of the GRID middleware used in a production setting can give valuable information to both GRID users and developers. A new approach to this issue is to use the process mining techniques. Analyzing logs of the middleware activities, performed on the SEE-GRID pilot production Grid infrastructure, objective qualitative and quantitative information on what actually happens can be obtained. Using the appropriate tools like ProM to apply the process mining algorithms, many interesting findings and conclusions can be drawn. In this paper we describe our approach and show some of our conclusions. Keywords: Grid, middleware, performances, process mining.
1
Introduction
Performance analysis of the GRID middleware can give valuable information to both GRID users and developers. Users gain by better understanding the workflow that is followed during job’s lifecycle and the possible obstacles. Since the Grid middleware usually presents alternative ways to accomplish the same final result, the relevant performance information enables the users to optimize their choices and improve their throughput. Developers can benefit by locating the bottlenecks and other problematic points during the job lifecycle and try to modify the middleware appropriately. They can also compare various implementations. The performance of the GRID middleware can be analyzed from various aspects. As seen in [1], [2], analysis can be performed on the MDS, OGSA-DAI etc. All of this focuses mostly on the developers views of the middleware. In this work, we analyze the performance from the logging and bookkeeping data obtained from the Logging and Bookkeeping (L&B) Service. In this way, we try to quantify the perception of reliability that the users get when they look at the final outcome (success/failure) of their jobs. ∗
This paper is based on the work done at the Institute for Parallel Processing at the Bulgarian Academy of Sciences, during the one month stay, supported by the FP6 project: Bulgarian IST Centre of Competence in 21 Century (BIS-21++), Contract no.: INCO-CT-2005-016639.
M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 203–212, 2008. © Springer-Verlag Berlin Heidelberg 2008
204
A. Misev and E. Atanassov
2 Description of the Logging and Bookkeeping Service and Database The Logging and Bookkeeping (L&B) service [3] tracks jobs managed by the gLite WMS (workload management system) or the Resource Broker (RB). It gathers events from various WMS/RB components in a reliable way and processes them in order to give a higher level view, the status of job. Virtually all the important data are fed to L&B internally from various gLite middleware components, transparently from user’s point of view. Three main features of the system are events delivery, notifications and security, and access control. For a deeper understanding of them, refer to [3], [4]. All the data that the service receives is stored in a relational database. The diagram of the database is shown in the Fig. 1.
Fig. 1. Database structure of the L&B database
The unique user id, along with the cert data from the X509 certificate is stored in the Users table. Each job is assigned a unique identifier that is used as a foreign key in the related tables and is stored in the Jobs table. For each job, several events are created in the Events table. Each event has the job id, the sequence number, event code and a time stamp. Two more tables relate to the Events table: Short_fields and Long_fields. Both of them store pairs of (name, value) data, related to the event by the job id and event sequence number. The Short_fields contains shorter values, strings of up to 255 characters. For the longer values (entire JDL for a particular job, CE name, name of the queue, names of the files accompanying the job etc), which can be up to 16 millions of characters, the database uses the Long_fields table, with the same reference to the Events table (job id and event sequence number). When the job’s lifetime ends, a record is added to the States table, referencing the job’s final sequence number and large string field containing detailed description of the job lifetime.
Performance Analysis of GRID Middleware Using Process Mining
205
3 The Rationale for L and B Based Performance Analysis Various approaches have been proposed to tackle the performance analysis of the distributed systems. The work done by Margalef et al [5], proposes 3 different approaches: static, run-time and dynamic. In that context, L&B based analysis is a static, post-mortem analysis. As such, it has many advantages, but also some disadvantages. Main advantages are that it will not introduce any overhead to the production system, since the analysis is done off-line. Analyzing trace files can require lots of time, but since time is not an issue in off-line analysis, more comprehensive and in-depth analysis can be performed, helping even non-expert users to make fine tuning of their applications. Main disadvantage of this approach is that such analysis require high level of details in the log files, which will require lots of resources (both processing and storage) for their manipulation. Also, since the analysis is static and post-mortem, it cannot cope with some dynamic application behavior that will occur during each execution. Process mining has not yet been used as a tool to make performance analysis of GRID middleware. Other applications of the technique have proven very useful [7], [11].
4 Short Overview of Process Mining Process mining techniques allow for extracting information from event logs [6], [8], [9]. It targets the automatic discovery of information from an event log. This information can be used to deploy new systems that support the execution of business processes or as a feedback tool that helps in auditing, analyzing and improving already enacted business processes. The main benefit of process mining techniques is that information is objectively compiled. In other words, process mining techniques are helpful because they gather information about what is actually happening according to an event log of an organization, and not what people think that is happening in this organization. The type of data in an event log determines which perspectives of process mining can be discovered and specifies the type of questions that can be answered using the mining process: 1. The control flow perspective can be mined if the logs contain tasks executed by a process. The key elements in this perspective are processes and cases (process instances). This represents the “How?” questions. 2. If the log provides information about the persons/systems that executed the tasks, the organizational perspective can be discovered, giving answers to the “Who?” questions. 3. When the log contains more details about the tasks, like the values of data fields that the execution of a task modifies, the case perspective (i.e. the perspective linking data to cases) can be discovered. This relates to the “What?” questions.
206
A. Misev and E. Atanassov
We have chosen the ProM framework [10], [11] for several reasons: it is opensource, Java based, and it has big variety of available plug-ins. It is extensible with new plug-ins, if required. The included plug-ins can be used either on logs only, called discovery plug-ins, or on logs and process models, called conformance and extension plug-ins. The discovery plug-ins can be used to discover the process elements only from the log files. They can then depict the process into various formats (Petri nets for example). The conformance plug-ins relies both on log data and a process model. They can be used to test the conformance of the data in the log to the proposed process model. The extension plug-ins also requires both logs and process model, but they discover the information that will enable to enhance the process model. The ProM framework uses its own format to store the log data and additional attributes. The format is called MXML and is based on XML. Along with the ProM, there is an open source tool called ProMImport [12] that enables conversion from various well known log formats into MXML.
5 Application of Process Mining on the L and B Log Data For the purpose of our analysis, we use the job identifiers as process instances (or cases) and events as audit trail entries. We also use the status field of the job (from the Events table) as the model element (or state in which on job can be in) and the combination of program and host name as originator (the entity performing the process element). For the future research, we will add more attributes to the analysis (CE name, detailed status of the job, queue and VO name, etc.) After we have imported the log data into the MXML format and loaded the log into the framework, we can proceed with log filtering. Log filtering enables us to select only the data that is relevant to the analysis. For example, we can define that only logs for jobs starting with REGJOB event will be used. Also, we can select that we will analyze only complete instances, so we can define another filter that will include only jobs with particular event as last event (DONE, CLEAR, CANCEL…). It is possible to perform more advanced filtering using the advanced filtering tab. For example, we can filter out only events done by specific originators, which will help us to reduce the data for some of the analysis. This is important in our case since some of the events are reported by multiple originators to the service. Also, we can use remapping filter to remap sub-jobs to a parent job. 5.1 Log Summary We have started the mining process of the L&B logs with simple log summary plug in. It gives an overview of the number of jobs (process instances) and events (audit trail entries). For each of the model element, a frequency is calculated and shown. Also, the model elements that are first in the audit trails (Starting log events) and last (Ending log events) are shown with the frequency of their occurrence. Finally, each of the originators is shown, along with the frequency of occurrences in the audit trails. From here we can get basic notion about the data we are mining. For example, we can instantly see how many of the process instances finished successfully by looking
Performance Analysis of GRID Middleware Using Process Mining
207
at the Ending log evens. We can also see the workload performed by various originators (program-service and host name combination). 5.2 Heuristic Miner Another appropriate plug-in for analyzing data that is less structured or has instances that follow several different paths of execution is the Heuristic Miner. Using the tool, a heuristic network can be produced depicting the control flow in the given process model. A simplified example is shown in Fig. 2. The numbers in the boxes represent the number of occurrences of the specific event and the numbers on the links represent the frequency and the absolute number of occurrences of a specific transition.
Fig. 2. Heuristic network
Using this network, we can easily recognize the frequencies of various transitions in the job’s lifespan. The network can also be converted to a Petri net, as one of the most common formalism used to represent workflows. 5.3 Petri Net Performance Analysis Once having a Petri net from the log, we can perform additional analysis. Especially useful is the Petri net performance analysis. As a result of this analysis, an interactive diagram is produced, helping in identifying the bottlenecks in the process model. Different color coding of the places (circles) of the Petri net marks the different time needed in each one of them, as shown in Fig. 5. (Blue means low waiting time, yellow middle and purple high). Also, by selecting two transitions in the net, the tools shows the statistics (min, max and average time needed from one to the other). 5.4 Performance Sequence Diagram Performance sequence diagram plug-in can be especially of help if you want to know what behavior in your processes is common, what behaviors are rare and what behavior may result in extreme situations (e.g. instances with extremely high throughput times). An example of the output is given in the Fig. 4. We have used this output to identify the most common sequences of events that occur during job’s lifetime, along
208
A. Misev and E. Atanassov
with their basic statistics. The diagram can be a full diagram, showing all the instances into time, or pattern diagram, grouping (by variable parameters) similar sequences into patterns. It also has a rich set of filtering options allowing us to select sets of process instances, or even individual ones. This plug-in, if used by end-users can help them visually identify the problems with their job submissions.
Fig. 3. LTL plug-in
5.5 Conformance Analysis Conformance analysis plug-in requires both log data and a process model (Petri net for example). It replays the entire log and checks the conformance of each job with the model. It offers two perspectives: Log and model. The log perspective illustrates each separate job and indicates the ones that do not conform to the model. Model perspective shows the Petri net and indicates the non conformant points. It can also show the number of times an activity should occur (regarding the model) but actually didn’t and vice versa. 5.6 LTL Plug-In The Linear Temporal Logic (LTL) plug-in check validity of LTL formulas on the analyzed log. It has a rich set of options and predefined formulas. As a result it divides the set of process instances into ones that satisfy and others that not satisfy the formula. The example shown in Fig. 3 show the conformance of the processes to the formula “eventually activity RUNNING then DONE then CLEAR”.
6 Some Important Findings about Middleware Performance Derived from the Process Mining We have performed the analysis on different subsets of the L&B data. At the beginning, mostly for performance reasons, we have analyzed jobs from several users,
Performance Analysis of GRID Middleware Using Process Mining
209
user by user. Subsequently, we have made a filtered data set from the whole database. We must note that some of the plug-ins requires much more time when working with large datasets, especially the ones that perform log replay. Most of the results that follow are from the overall analysis. 6.1 Percentage of Successful Jobs From the performed analysis, we can conclude that the underlying infrastructure (SEE-GRID [13]) performs satisfactory. The overall percentage of the successful jobs is around 70%. We identified several factors that influence the success rate of jobs. First of all, there is the human factor. Analysis of logs of jobs submitted by experienced users show greater percentage of success. If we analyze filtered logs from more experienced users, we can conclude that up to 80% end either with status DONE, or with status CLEAR (retrieved output). We have to make deeper analysis which require additional data attributes added to the logs to be analyzed, like status code, exit code etc. to better understand the percentage of the finished jobs and what is more important to discover the reasons why the other jobs failed. Other factors include: 1. the “quality” of the Grid sites – usually larger sites in terms of number of CPUs have better support 2. software versions – the installation of a new middleware version or revision usually causes some hick-ups 3. Lack of resources or inappropriate job scheduling mechanisms – large percentage of failures are caused by jobs waiting in the queue for too long. The so-called proxy renewal mechanism did not work reliably until newer versions of the middleware solved the problem. 6.2
Patterns of Job Control Flow in the Middleware
Using the Performance sequence diagram, we can obtain useful data about the patters of events that the jobs follow during their lifetime. Analyzing a RB log consisting of
Fig. 4. Patterns of job control flow
210
A. Misev and E. Atanassov
around 18500 jobs, we have identified 85 different patterns of behavior. As shown on the Fig. 4, most of the jobs that finish successfully follow the first and the third pattern. They do it in average time of 27 hours and 11 hours respectively, with the former length is due to user intervention to pick up the results. This can be used as a good reference for the lengths of the proxy certificates. Another conclusion that we can make from this results is the relatively large number of jobs following the fifth pattern (Pattern 4 in Fig. 4). Around 600 jobs have failed after waiting in the queues for an average of more than 280 hours. Using this data and examining the specific instances we could identify the reason for such failures.
Fig. 5. Performance analysis with bottlenecks (details)
6.3 Bottlenecks in the Job Lifetime Using performance sequence diagram, we have analyzed jobs from single user (for performance sake). Out of 470 jobs, almost 20% of them have been waiting in the queues with average time between 57 and 65 hours. All of them finished with ABORT (mostly due to proxy expired). Using Petri net performance analysis we could also note that the point with the biggest waiting time (shown in purple in the Fig. 5). We can see that the most time jobs spend waiting to start running. Other two bottlenecks include the running time (which greatly depends on the job itself) and time before the output is retrieved (shown in yellow), which is a human interaction.
7 Future Works The work specified in this paper is only the beginning of deeper and wider performance analysis of the GRID middleware. Several issues that we will tackle soon include building custom import filter, based on the ProMImport framework to import the data directly from the L&B database, extend the data that is imported into the framework with additional elements (CE name, matching process results, some JDL attributes etc. to enable even more analysis), enable direct connection to the L&B web service interface, so the users can select particular jobs from within the ProM framework and get even more data from the service directly and propose a more
Performance Analysis of GRID Middleware Using Process Mining
211
intuitive interface (possibly web) to the L&B data, to enable users to get better understanding. Waiting time in the queues can be quite long. Some solutions to these problems that we will investigate further are: − Providing separate queues for various types of jobs could strengthen the user’s perception of the GRID, − Providing end-to-end mechanism for jobs prioritization.
8 Conclusion Using the process mining to analyze GRID middleware is not a new idea, but very little has been done to actually analyze the platform. Using the L&B database as a source of logging data was a natural choice. After researching for the appropriate tool, the ProM tool was chosen, mostly for the features mentioned. The initial results of the mining process are presented in this paper. A very important conclusion from the analysis is that the underlying infrastructure performs satisfactory. With an overall job success rate of around 70% it is quite near the EGEE [14] average of 79% [15]. Since users experience affects the percentage of successful jobs, the education of the users about the underlying technology will increase the overall performance. The more aware the users are about the possibilities of the infrastructure and in the ways to evaluate certain sites, the better the success rate will be. In this context, measuring the success rate of each site can help users choose only the set of sites that promise higher throughput.
References 1. Zhang, X., Schopf, J.M.: Performance Analysis of the Globus Toolkit Monitoring and Discovery Service. In: MDS2, Proceedings of the International Workshop on Middleware Performance (MP 2004), part of the 23rd International Performance Computing and Communications Conference (IPCCC) (2004) 2. Jackson, M., Antonioletti, M., Chue Hong, N., Hume, A., Krause, A., Sugden, T., Westhead, M.: Performance Analysis of the OGSA-DAI Software. In: Proceedings of the UK e-Science All Hands Meeting, Nottingham, UK (September 2004) 3. EGEE User’s Guide, Service Logging And Bookkeeping (L&B) (2007), https://edms.cern.ch/document/571273/1 4. Kouril, D., Krenek, A., Matyska, L., Mulac, M., Pospısil, J., Ruda, M., Salvet, Z., Sitera, J., Skrabal, J., Vocu, M.: Advances in the L&B Grid Job Monitoring Service (2007) (visited 06.08.2007), http://lindir.ics.muni.cz/dg_public/lb2.pdf 5. Margalef, T., Jorba, J., Morajko, O., Morajko, A., Luque, E.: Different Approaches to Automatic Performance Analysis of Distributed Applications. In: Getov, V., et al. (eds.) Performance Analysis and Grid Computing. Springer, Heidelberg (2004) 6. van der Aalst, W.M.P., Weijters, A.J.M.M. (eds.): Process Mining. Special Issue of Computers in Industry, vol. 53. Elsevier Science Publishers, Amsterdam (2004)
212
A. Misev and E. Atanassov
7. Rozinat, A., de Jong, I.S.M., Gunther, C.W., van der Aalst, W.M.P.: Process Mining of Test Processes: A Case Study, BETA Working Paper Series, WP 220, Eindhoven University of Technology, Eindhoven (2007) 8. Alves de Medeiros, A.K., Günther, C.W.: Process Mining: Using CPN Tools to Create Test Logs for Mining Algorithms. In: Sixth Workshop and Tutorial on Practical Use of Colored Petri Nets and the CPN Tools, Aarhus, Denmark (October 2005) 9. Process mining (2007), http://www.processmining.org/ 10. ProM tool (2007), http://is.tm.tue.nl/~cgunther/dev/prom/ 11. Alves de Medeiros, A.K., Weijters, A.J.M.M. (Ton): ProM tutorial, Technische Universiteit Eindhoven, The Netherlands (November 2006) 12. ProMimport, http://is.tm.tue.nl/~cgunther/dev/promimport/ 13. SEE-GRID – South Eastern Europe GRID-enabled eInfrastructure Development (2007), http://www.see-grid.eu/ 14. EGEE – Enabling Grids for E-sciencE (2007), http://www.eu-egee.org/ 15. Monitoring and visualization tool for LCG (statistics for February 2008), http://gridview.cern.ch/GRIDVIEW/job_index.php
Bi-criteria Pipeline Mappings for Parallel Image Processing Anne Benoit1 , Harald Kosch2 , Veronika Rehn-Sonigo1 , and Yves Robert1 1
LIP, ENS Lyon, 46 All´ee d’Italie, 69364 Lyon Cedex 07, France {Anne.Benoit,Veronika.Sonigo,Yves.Robert}@ens-lyon.fr 2 University of Passau, Innstr. 43, 94032 Passau, Germany [email protected]
Abstract. Mapping workflow applications onto parallel platforms is a challenging problem, even for simple application patterns such as pipeline graphs. Several antagonistic criteria should be optimized, such as throughput/period and latency (or a combination). Typical applications include digital image processing, where images are processed in steadystate mode. In this paper, we study the bi-criteria mapping (minimizing period and latency) of the JPEG encoding on a cluster of workstations. We present an integer linear programming formulation for this NP-hard problem, and we present an in-depth performance evaluation of several polynomial heuristics. Keywords: pipeline, workflow application, multi-criteria, optimization, JPEG encoding.
1
Introduction
This work considers the problem of mapping workflow applications onto parallel platforms. This is a challenging problem, even for simple application patterns. For homogeneous architectures, several scheduling and load-balancing techniques have been developed but the extension to heterogeneous clusters makes the problem more difficult. Structured programming approaches rule out many of the problems which the low-level parallel application developer is usually confronted to, such as deadlocks or process starvation. We therefore focus on pipeline applications, as they can easily be expressed as algorithmic skeletons. More precisely, in this paper, we study the mapping of a particular pipeline application: we focus on the JPEG encoder (baseline process, basic mode). This image processing application transforms numerical pictures from any format into a standardized format called JPEG. This standard was developed almost 20 years ago to create a portable format for the compression of still images and new versions are created until now (see http://www.jpeg.org/). Meanwhile, several parallel algorithms have been proposed [9]. JPEG (and later JPEG 2000) is used for encoding still images in Motion-JPEG (later MJ2). These standards are commonly employed in IP-cams and are part of many video applications in the world of game consoles. Motion-JPEG (M-JPEG) has been adopted and further developed to several M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 215–225, 2008. c Springer-Verlag Berlin Heidelberg 2008
216
A. Benoit et al.
other formats, e.g., AMV (alternatively known as MTV) which is a proprietary video file format designed to be consumed on low-resource devices. The manner of encoding in M-JPEG and subsequent formats leads to a flow of still image coding, hence pipeline mapping is appropriate. We consider the different steps of the encoder as a linear pipeline of stages, where each stage gets some input, has to perform several computations and transfers the output to the next stage. The corresponding mapping problem can be stated informally as follows: which stage to assign to which processor? We require the mapping to be interval-based, i.e., a processor is assigned an interval of consecutive stages. Two key optimization parameters emerge. On the one hand, we target a high throughput, or short period, in order to be able to handle as many images as possible per time unit. On the other hand, we aim at a short response time, or latency, for the processing of each image. These two criteria are antagonistic: intuitively, we obtain a high throughput with many processors to share the work, while we get a small latency by mapping many stages to the same processor in order to avoid the cost of inter-stage communications. Source Image Data
Scaling
YUV Conversion
Block Storage
Subsampling
FDCT
Quantizer
Entropy Encoder
Quantization Table
Huffman Table
Compressed Image Data
Fig. 1. Steps of the JPEG encoding
2
Framework
Principles of JPEG encoding. Here we briefly present the mode of operation of a JPEG encoder (see [14] for further details). The encoder consists in seven pipeline stages, as shown in Fig. 1. In the first stage, the image is scaled to have a multiple of an 8x8 pixel matrix, and the standard even claims a multiple of 16x16. In the next stage a color space conversion is performed from the RGB to the YUV-color model. The sub-sampling stage is an optional stage, which, depending on the sampling rate, reduces the data volume: as the human eye can dissolve luminosity more easily than color, the chrominance components are sampled more rarely than the luminance components. Admittedly, this leads to a loss of data. The last preparation step consists in the creation and storage of so-called MCUs (Minimum Coded Units), which correspond to 8x8 pixel blocks in the picture. The next stage is the core of the encoder. It performs a Fast Discrete Cosine Transformation (FDCT) (eg. [15]) on the 8x8 pixel blocks which are interpreted as a discrete signal of 64 values. After the transformation, every point in the matrix is represented as a linear combination of the 64 points. The quantizer reduces the image information to the important parts. Depending on the quantization factor and quantization matrix, irrelevant frequencies are reduced. Thereby quantization errors can occur, that are remarkable as quantization noise or block generation in the encoded image. The last stage is the entropy encoder, which performs a modified Huffman coding.
Bi-criteria Pipeline Mappings for Parallel Image Processing
217
Applicative framework. On the theoretical point of view, we consider a pipeline of n stages Sk , 1 ≤ k ≤ n. Tasks are fed into the pipeline and processed from stage to stage, until they exit the pipeline after the last stage. The k-th stage Sk first receives an input from the previous stage, of size δ k−1 , then performs a number of wk computations, and finally outputs data of size δ k to the next stage. These three operations are performed sequentially. The first stage S1 receives an input of size δ 0 from the outside world, while the last stage Sn returns the result, of size δ n , to the outside world, thus these particular stages behave in the same way as the others. On the practical point of view, we consider the applicative pipeline of the JPEG encoder as presented in Fig. 1 and its seven stages. Target platform. We target a platform with p processors Pu , 1 ≤ u ≤ p, fully interconnected as a (virtual) clique. There is a bidirectional link linku,v : Pu → Pv between any processor pair Pu and Pv , of bandwidth bu,v . The speed of processor Pu is denoted as su , and it takes X/su time-units for Pu to execute X floating point operations. We enforce a linear cost model for communications: it takes X/b time-units to send (resp. receive) a message of size X to (resp. from) Pv . Communications contention is taken care of by enforcing the one-port model [3]. Bi-criteria interval mapping problem. We seek to map intervals of consecutive stages onto processors [13]. Intuitively, assigning several consecutive tasks to the same processor will increase their computational load, but may well dramatically decrease communication requirements. We search for a partition of [1..n] into m ≤ p intervals Ij = [dj , ej ] such that dj ≤ ej for 1 ≤ j ≤ m, d1 = 1, dj+1 = ej + 1 for 1 ≤ j ≤ m − 1 and em = n. The optimization problem is to determine the best mapping, over all possible partitions into intervals, and over all processor assignments. The objective can be to minimize either the period, or the latency, or a combination: given a threshold period, what is the minimum latency that can be achieved? and the counterpart: given a threshold latency, what is the minimum period that can be achieved? The decision problem associated to this bi-criteria interval mapping optimization problem is NP-hard, since the period minimization problem is NP-hard for interval-based mappings (see [2]).
3
Linear Program Formulation
We present here an integer linear program to compute the optimal interval-based bi-criteria mapping on Fully Heterogeneous platforms, respecting either a fixed latency or a fixed period. We consider a framework of n stages and p processors, plus two fictitious extra stages S0 and Sn+1 respectively assigned to Pin and Pout . First we need to define a few variables. For k ∈ [0..n + 1] and u ∈ [1..p] ∪ {in, out}, xk,u is a boolean variable equal to 1 if stage Sk is assigned to processor Pu ; we let x0,in = xn+1,out = 1, and xk,in = xk,out = 0 for 1 ≤ k ≤ n. For k ∈ [0..n], u, v ∈ [1..p] ∪ {in, out} with
218
A. Benoit et al.
u = v, zk,u,v is a boolean variable equal to 1 if stage Sk is assigned to Pu and stage Sk+1 is assigned to Pv : hence linku,v : Pu → Pv is used for the communication between these two stages. If k = 0 then zk,in,v = 0 for all v = in and if k = n then zk,u,out = 0 for all u = out. For k ∈ [0..n] and u ∈ [1..p] ∪ {in, out}, yk,u is a boolean variable equal to 1 if stages Sk and Sk+1 are both assigned to Pu ; we let yk,in = yk,out = 0 for all k, and y0,u = yn,u = 0 for all u. For u ∈ [1..p], first(u) is an integer variable which denotes the first stage assigned to Pu ; similarly, last(u) denotes the last stage assigned to Pu . Thus Pu is assigned the interval [first(u), last(u)]. Of course 1 ≤ first(u) ≤ last(u) ≤ n. Topt is the variable to optimize, so depending on the objective function it corresponds either to the period or to the latency. We list belowthe constraints that need to be enforced. For simplicity, we write u instead of u∈[1..p]∪{in,out} when summing over all processors. First there are constraints for processor and link usage: every stage is assigned a processor, x = 1. Every communication either is assigned a i.e., ∀k ∈ [0..n + 1], k,u u link or collapses because both stages are assigned to the same processor: ∀k ∈ z + y = 1. If stage Sk is assigned to Pu and stage Sk+1 [0..n], k,u,v k,u u=v u to Pv , then linku,v : Pu → Pv is used for this communication: ∀k ∈ [0..n], ∀u, v ∈ [1..p] ∪ {in, out}, u = v, xk,u + xk+1,v ≤ 1 + zk,u,v . If both stages Sk and Sk+1 are assigned to Pu , then yk,u = 1: ∀k ∈ [0..n], ∀u ∈ [1..p] ∪ {in, out}, xk,u + xk+1,u ≤ 1 + yk,u . If stage Sk is assigned to Pu , then necessarily firstu ≤ k ≤ lastu . We write this constraint as: ∀k ∈ [1..n], ∀u ∈ [1..p], firstu ≤ k.xk,u + n.(1 − xk,u ) and ∀k ∈ [1..n], ∀u ∈ [1..p], lastu ≥ k.xk,u . Furthermore, if stage Sk is assigned to Pu and stage Sk+1 is assigned to Pv = Pu (i.e., zk,u,v = 1) then necessarily lastu ≤ k and firstv ≥ k + 1 since we consider intervals. We write this constraint as: ∀k ∈ [1..n − 1], ∀u, v ∈ [1..p], u = v, lastu ≤ k.zk,u,v + n.(1 − zk,u,v ) and ∀k ∈ [1..n − 1], ∀u, v ∈ [1..p], u = v, firstv ≥ (k + 1).zk,u,v . The latency of the schedule is bounded by Tlatency : p n δ k−1 w δ n zk−1,t,u + suk xk,u + zn,u,out ≤ Tlatency b b u=1 k=1 t=u t,u u∈[1..p]∪{in} u,out and t ∈ [1..p] ∪ {in, out}. There remains to express the period of each processor and to constrain it by Tperiod : ∀u ∈ [1..p], n δk δ k−1 w k z + su xk,u + z ≤ Tperiod . bt,u k−1,t,u bu,v k,u,v k=1
t=u
v=u
Finally, the objective function is either to minimize the period Tperiod respecting the fixed latency Tlatency or to minimize the latency Tlatency with a fixed period Tperiod . So in the first case we fix Tlatency and set Topt = Tperiod . In the second case Tperiod is fixed a priori and Topt = Tlatency . With this mechanism the objective function reduces to minimizing Topt in both cases.
Bi-criteria Pipeline Mappings for Parallel Image Processing
4
219
Overview of the Heuristics
The problem of bi-criteria interval mapping of workflow applications is NPhard [2], so in this section we briefly describe polynomial heuristics to solve it. See [2] for a more complete description or refer to the Web at: http://graal.ens-lyon.fr/∼vsonigo/code/multicriteria/ In the following, we denote by n the number of stages, and by p the number of processors. We distinguish two sets of heuristics. The heuristics of the first set aim to minimize the latency respecting an a priori fixed period. The heuristics of the second set minimize the counterpart: the latency is fixed a priori and we try to achieve a minimum period while respecting the latency constraint. 4.1
Minimizing Latency for a Fixed Period
All the following heuristics sort processors by non-increasing speed, and start by assigning all the stages to the first (fastest) processor in the list. This processor becomes used. H1-Sp-mono-P: Splitting mono-criterion. At each step, we select the used processor j with the largest period and we try to split its stage interval, giving some stages to the next fastest processor j in the list (not yet used). This can be done by splitting the interval at any place, and either placing the first part of the interval on j and the remainder on j , or the other way round. The solution which minimizes max(period(j), period(j )) is chosen if it is better than the original solution. Splitting is performed as long as we have not reached the fixed period or until we cannot improve the period anymore. H2-Sp-bi-P: Splitting bi-criteria. This heuristic uses a binary search over the latency. For this purpose at each iteration we fix an authorized increase of the optimal latency (which is obtained by mapping all stages on the fastest processor), and we test if we get a feasible solution via splitting. The splitting mechanism itself is quite similar to H1-Sp-mono-P except that we choose the Δlatency ) within the authorized latency solution that minimizes maxi∈{j,j } ( Δperiod(j) increase to decide where to split. While we get a feasible solution, we reduce the authorized latency increase for the next iteration of the binary search, thereby aiming at minimizing the mapping global latency. H3-3-Sp-mono-P: 3-splitting mono-criterion. At each step we select the used processor j with the largest period and we split its interval into three parts. For this purpose we try to map two parts of the interval on the next pair of fastest processors in the list, j and j , and to keep the third part on processor j. Testing all possible permutations and all possible positions where to cut, we choose the solution that minimizes max(period(j), period(j ), period(j )). H4-3-Sp-bi-P: 3-splitting bi-criteria. In this heuristic the choice of where to split is more elaborated: it depends not only of the period improvement, but also of the latency increase. Using the same splitting mechanism as in H3-3Δlatency Sp-mono-P, we select the solution that minimizes maxi∈{j,j ,j } ( Δperiod(i) ).
220
A. Benoit et al.
Here Δlatency denotes the difference between the global latency of the solution before the split and after the split. In the same manner Δperiod(i) defines the difference between the period before the split (achieved by processor j) and the new period of processor i. 4.2
Minimizing Period for a Fixed Latency
As in the heuristics described above, first of all we sort processors according to their speed and map all stages on the fastest processor. H5-Sp-mono-L: Splitting mono-criterion. This heuristic uses the same method as H1-Sp-mono-P with a different break condition. Here splitting is performed as long as we do not exceed the fixed latency, still choosing the solution that minimizes max(period(j), period(j )). H6-Sp-bi-L: Splitting bi-criteria. This variant of the splitting heuristic works similarly to H5-Sp-mono-L, but at each step it chooses the solution Δlatency ) while the fixed latency is not exceeded. which minimizes maxi∈{j,j } ( Δperiod(i) Remark. In the context of M-JPEG coding, minimizing the latency for a fixed period corresponds to a fixed coding rate, and we want to minimize the response time. The counterpart (minimizing the period respecting a fixed latency L) corresponds to the question: if I accept to wait L time units for a given image, which coding rate can I achieve? We evaluate the behavior of the heuristics with respect to these questions in Section 5.
5
Experiments and Simulations
In the following experiments, we study the mapping of the JPEG application onto clusters of workstations. Pf ix = 310 1
Lopt = 337, 575 2
3
4
5
P6
6
7
Lopt = 336, 729
1
2
3
4
P6
5
2
6
7
1
3
4
5
4
6
7
5
6
Lf ix = 340 1
Popt = 307, 319 2
3
4
5
6
7
P3
Lf ix = 330 1
7
P3
P4
Lopt = 322, 700 2
3 P7
P3
Pf ix = 330
Popt = 307, 319
1 P5
P3
Pf ix = 320
Lf ix = 370
Popt = 322, 700 2
3
4
P3
P3
a)
b)
5
6
7
Fig. 2. LP solutions strongly depend on fixed initial parameters
Influence of fixed parameters. In this first test series, we examine the influence of fixed parameters on the solution of the linear program. As shown in
Bi-criteria Pipeline Mappings for Parallel Image Processing
221
Fig. 2, the division into intervals is highly dependant of the chosen fixed value. The optimal solution to minimize the latency (without any supplemental constraints) obviously consists in mapping the whole application pipeline onto the fastest processor. As expected, if the period fixed in the linear program is not smaller than the latter optimal mono-criterion latency, this solution is chosen. Decreasing the value for the fixed period imposes to split the stages among several processors, until no more solution can be found. Fig. 2(a) shows the division into intervals for a fixed period. A fixed period of Tperiod = 330 is sufficiently high for the whole pipeline to be mapped onto the fastest processor, whereas smaller periods lead to splitting into intervals. We would like to mention, that for a period fixed to 300, there exists no solution anymore. The counterpart fixed latency - can be found in Fig. 2(b). Note that the first two solutions find the same period, but for a different latency. The first solution has a high value for latency, which allows more splits, hence larger communication costs. Comparing the last lines of Fig. 2(a) and (b), we state that both solutions are the same, and we have Tperiod = Tlatency . Finally, expanding the range of the fixed values, a sort of bucket behavior becomes apparent: Increasing the fixed parameter has in a first time no influence, the LP still finds the same solution until the increase crosses an unknown bound and the LP can find a better solution. This phenomenon is shown in Fig. 3. 350
330 L_fixed 325
340 Optimal Period
Optimal Latency
P_fixed
330 320 310 300 300
320 315 310 305
305
310 315 320 Fixed Period
(a) Fixed P.
325
330
300 320 330 340 350 360 370 380 390 400 Fixed Latency
(b) Fixed L.
Fig. 3. Bucket behavior of LP solutions
Assessing heuristic performance. The comparison of the solution returned by the LP program, in terms of optimal latency respecting a fixed period (or the converse) with the heuristics is shown in Fig. 4. The implementation is fed with the parameters of the JPEG encoding pipeline and computes the mapping on 10 randomly created platforms with 10 processors. On platforms 3 and 5, no valid solution can be found for the fixed period. There are two important points to mention. First, the solutions found by H2 often are not valid, since they do not respect the fixed period, but they have the best ratio latency/period. Fig. 5(b) plots some more details: H2 achieves good latency results, but the fixed period of P=310 is often violated. This is a consequence of the fact that the fixed period value is very close to the feasible period. When the tolerance for the period is bigger, this heuristic succeeds to find low-latency solutions. Second, all solutions, LP and heuristics, always keep the stages 4 to 7 together (see Fig. 2
222
A. Benoit et al.
for an example). As stage 5 (DCT) is the most costly in terms of computation, the interval containing these stages is responsible for the period of the whole application. Finally, in the comparative study H1 always finds the optimal period for a fixed latency and we therefore recommend this heuristic for period optimization. In the case of latency minimization for a fixed period, then H5 is to use, as it always finds the LP solution in the experiments. This is a striking result, especially given the fact that the LP integer program may require a long time to compute the solution (up to 11389 seconds in our experiments), while the heuristics always complete in less than a second, and find the corresponding optimal solution. 360
400 LP fixed P H1 H2 H3 H4
360
LP fixed L H5 H6
350 heuristical solution
heuristical solution
380
340 320 300
340 330 320 310 300 290 280
280
270 0
1
2
3 4 5 6 random platform
7
8
0
9
1
2
3
4
5
6
7
8
9
random platform
(a) Fixed P = 310.
(b) Fixed L = 370.
Fig. 4. Behavior of the heuristics (comparing to LP solution) 350 theoretical simulation
600
330 latency
latency
LP H2
340
500 400 300
320 310 300
200
290
100 1
2
3
4
5
6
280 280
290
300
heuristic
(a) Simulative latency.
310 320 period
330
340
350
(b) H2 versus LP.
Fig. 5. MPI simulation results
MPI simulations on a cluster. This last experiment performs a JPEG encoding simulation. All simulations are made on a cluster of homogeneous Optiplex GX 745 machines with an Intel Core 2 Duo 6300 of 1,83Ghz. Heterogeneity is enforced by increasing and decreasing the number of operations a processor has to execute. The same holds for bandwidth capacities. For simplicity we use a MPI program whose stages have the same communication and computation parameters as the JPEG encoder, but we do not encode real images (hence the name simulation, although we use an actual implementation with MPICH). In this experiment the same random platforms with 10 processors and fixed parameters as in the theoretical experiments are used. We measured the latency of the simulation, even for the heuristics of fixed latency, and computed the average over all random platforms. Fig. 5(a) compares the average of the theoretical
Bi-criteria Pipeline Mappings for Parallel Image Processing
223
results of the heuristics to the average simulative performance. The simulative behavior nicely mirrors the theoretical behavior, with the exception of H2 (see Fig. 5(b)). Here once again, some solutions of this heuristic are not valid, as they do not respect the fixed period.
6
Related Work
The blockwise independent processing of the JPEG encoder allows to apply simple data parallelism for efficient parallelization. Many papers have addressed this fine-grain parallelization opportunity [5,12]. In addition, parallelization of almost all stages, from color space conversion, over DCT to the Huffman encoding has been addressed [1,7]. Recently, with respect to the JPEG2000 codec, efficient parallelization of wavelet coding has been introduced [8]. All these works target the best speed-up with respect to different architectures and possible varying load situations. Optimizing the period and the latency is an important issue when encoding a pipeline of multiple images, as for instance for Motion JPEG (M-JPEG). To meet these issues, one has to solve in addition to the above mentioned work a bi-criteria optimization problem, i.e., optimize the latency, as well as the period. The application of coarse grain parallelism seems to be a promising solution. We propose to use an interval-based mapping strategy allowing multiple stages to be mapped to one processor which allows meeting the most flexible the domain constraints (even for very large pictures). Several pipelined versions of the JPEG encoding have been considered. They rely mainly on pixel or blockwise parallelization [6,10]. For instance, Ferretti et al. [6] uses three pipelines to carry out concurrently the encoding on independent pixels extracted from the serial stream of incoming data. The pixel and block-based approach is however useful for small pictures only. Recently, Sheel et al. [11] consider a pipeline architecture where each stage presents a step in the JPEG encoding. The targeted architecture consists of Xtensa LX processors which run subprograms of the JPEG encoder program. Each program accepts data via the queues of the processor, performs the necessary computation, and finally pushes it to the output queue into the next stage of the pipeline. The basic assumptions are similar to our work, however no optimization problem is considered and only runtime (latency) measurements are available. The schedule is static and set according to basic assumptions about the image processing, e.g., that the DCT is the most complex operation in runtime.
7
Conclusion
In this paper, we have studied the bi-criteria (minimizing latency and period) mapping of pipeline workflow applications, from both a theoretical and practical point of view. On the theoretical side, we have presented an integer linear programming formulation for this NP-hard problem. On the practical side, we have studied in depth the interval mapping of the JPEG encoding pipeline on a cluster of workstations. Owing to the LP solution, we were able to characterize
224
A. Benoit et al.
a bucket behavior in the optimal solution, depending on the initial parameters. Furthermore, we have compared the behavior of some polynomial heuristics to the LP solution and we were able to recommended two heuristics with almost optimal behavior for parallel JPEG encoding. Finally, we evaluated the heuristics running a parallel pipeline application with the same parameters as a JPEG encoder. The heuristics were designed for general pipeline applications, and some of them were aiming at applications with a large number of stages (3-splitting), thus a priori not very efficient on the JPEG encoder. Still, some of these heuristics reach the optimal solution in our experiments, which is a striking result. A natural extension of this work would be to consider further image processing applications with more pipeline stages or a slightly more complicated pipeline architecture. Naturally, our work extends to JPEG 2000 encoding which offers among others wavelet coding and more complex multiple-component image encoding [4]. Another extension is for the MPEG coding family which uses lagged feedback: the coding of some types of frames depends on other frames. Differentiating the types of coding algorithms, a pipeline architecture seems again to be a promising solution architecture.
References 1. Agostini, L.V., Silva, I.S., Bampi, S.: Parallel color space converters for JPEG image compression. Microelectronics Reliability 44, 697 (2004) 2. Benoit, A., Rehn-Sonigo, V., Robert, Y.: Multi-criteria Scheduling of Pipeline Workflows. In: HeteroPar 2007, Algorithms, Models and Tools for Parallel Computing on Heterogeneous Networks. IEEE Computer Society Press, Los Alamitos (2007) 3. Bhat, P., Raghavendra, C., Prasanna, V.: Efficient collective communication in distributed heterogeneous systems. Journal of Parallel and Distributed Computing 63, 251 (2003) 4. Christopoulos, C., Skodras, A., Ebrahimi, T.: The JPEG2000 still image coding system: an overview. IEEE Trans. on Consumer Electronics 46, 1103 (2000) 5. Falkemeier, J., Joubert, G.: Parallel image compression with JPEG for multimedisa applications. High Performance Computing: Technologies, Methods and Applications, Advances in Parallel Computing, 379–394 (1995) 6. Ferretti, M., Boffadossi, M.: A Parallel Pipelined Implementation of LOCO-I for JPEG-LS. In: 17th International Conference on Pattern Recognition (ICPR 2004), vol. 1, pp. 769–772 (2004) 7. Kumaki, T., et al.: Acceleration of DCT Processing with Massive-Parallel MemoryEmbedded SIMD Matrix Processor. IEICE Trans. on Information and Systems LETTER- Image Processing and Video Processing E90-D, 1312 (2007) 8. Meerwald, P., Norcen, R., Uhl, A.: Parallel JPEG2000 Image Coding on Multiprocessors. In: IPDPS 2002. IEEE Computer Society Press, Los Alamitos (2002) 9. Monnes, P., Furht, B.: Parallel JPEG Algorithms for Still Image Processing. In: Southeastcon 1994. Creative Technology Transfer - A Global Affair. Proceedings of the 1994 IEEE, pp. 375–379 (1994)
Bi-criteria Pipeline Mappings for Parallel Image Processing
225
10. Papadonikolakis, M., Pantazis, V., Kakarountas, A.P.: Efficient high-performance ASIC implementation of JPEG-LS encoder. In: Proceedings of the Conference on Design, Automation and Test in Europe (DATE2007). IEEE Communications Society Press (2007) 11. Shee, S.L., Erdos, A., Parameswaran, S.: Architectural Exploration of Heterogeneous Multiprocessor Systems for JPEG. International Journal of Parallel Programming 35 (2007) 12. Shen, K., Cook, G., Jamieson, L., Delp, E.: An overview of parallel processing approaches to image and video compression. In: Image and Video Compression, Proc. SPIE, vol. 2186, pp. 197–208 (1994) 13. Subhlok, J., Vondran, G.: Optimal latency-throughput tradeoffs for data parallel pipelines. In: ACM SPAA 1996, pp. 62–71. ACM Press, New York (1996) 14. Wallace, G.K.: The JPEG still picture compression standard. Commun. ACM 34, 30 (1991) 15. Wen-Hsiung, C., Smith, C., Fralick, S.: A Fast Computational Algorithm for the Discrete Cosine Tranfsorm. IEEE Trans. on Communications 25, 1004 (1977)
A Simulation Framework for Studying Economic Resource Management in Grids Kurt Vanmechelen, Wim Depoorter, and Jan Broeckhove University of Antwerp, BE-2020 Antwerp, Belgium [email protected]
Abstract. Economic principles are increasingly being regarded as a way to address conflicting user requirements, to improve the effectiveness of grid resource management systems, and to deliver incentives for providers to join virtual organizations. Because economic resource management mechanisms can encourage grid participants to reveal the true valuations of their jobs and resources, the system becomes capable of making better scheduling decisions. A lot of exploratory research into different market mechanisms for grids is ongoing. Since it is impractical to conduct analysis of novel mechanisms on operational grids, most of this research is being carried out using simulation. This paper presents the Grid Economics Simulator (GES) in support of such research. The key design goals of the framework are enabling a wide variety of economic and non-economic forms of resource management while simultaneously supporting distributed execution of simulations and exhibiting good scalability properties.
1
Introduction
Conducting research into resource management systems (RMS) on real grids is difficult for two main reasons. The first one relates to the costs involved in setting up and maintaining such a system. The second is the need to test new RMS’s under a variety of different load patterns and infrastructural arrangements, which is all but impossible to achieve with a real grid system. The large scale on which grid RMS’s need to be studied exacerbates these problems. The only viable option for researchers then is to resort to simulation. While there exists a number of general purpose simulators for grids, they have limited support for economic resource management systems (ERMS). There is a need for such support however, as it allows easy comparison between different economic and non-economic approaches and enables researchers to focus on the mechanism design and implementation of the chosen approach, while leveraging the strength of the existing general purpose framework in setting up the grid environment, running the simulation and monitoring the desired metrics. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 226–235, 2008. c Springer-Verlag Berlin Heidelberg 2008
A Simulation Framework
2
227
Related Work
To provide some background on existing simulators and their capabilities we describe a number of them [1,2,3,4,5,6,7] here. For a more elaborate overview one can consult [8]. The Bricks simulator was designed as a performance evaluation system to analyse different scheduling approaches for High Performance Computing systems in a global setting [1]. Two of the most interesting features of Bricks are the use of a scripting language to describe the configuration and parameters of the simulation and its ability to incorporate external components such as NWS into simulations. Bricks has also been used to evaluate fixed cost-based scheduling approaches [9]. The framework dictates a centralized approach for resource management however, limiting its general applicability. Development has ceased and the framework is no longer available from the official project site. MicroGrid can create virtual Globus environments of arbitrary composition and allows for the execution of real applications [2]. As such, it is actually an emulator rather than a simulator. This makes MicroGrid interesting for optimizing grid applications with regards to the target configuration of the grid or conversely allow designers of grids to play with various parameters to optimize the grid architecture. Since MicroGrid is an emulator running real applications, it is very time intensive. It is also difficult to test new resource management approaches as all of them have to be compatible with Globus. Active development seems to have halted after 2004. SimGrid [3] is an extensive toolkit for the simulation of distributed applications and is written in C. The toolkit started out with a central scheduling approach and was subsequently adapted to allow for decentralized scheduling [10]. Later on, it was extended in order to allow developers to implement distributed services in the simulator and transfer them to a real world grid without code modification. Development is ongoing with the addition of MPI support and modifications to the networking layer. SimGrid focuses heavily on the network aspects of grids and less on scheduling strategies. To accommodate for economic resource management, substantial modifications would have to be made to make the simulated entities economic aware and to support the required interaction patterns. While SimGrid has been used in combination with economic scheduling approaches [11], the auctions were performed outside of the framework, with SimGrid only executing the resulting schedule. GridSim is written in Java on top of the SimJava 2.0 basic discrete event infrastructure, dating from 2002. GridSim allows for packet-level simulation of the network and also offers components oriented towards data grids. Additionally, it supports advance reservations, workload traces, an output statistics framework and background network traffic. GridSim has been used to simulate a NimrodG like deadline and budget constrained scheduling system [4] and an auction environment [5]. Development is ongoing with the latest release dating from September 2007. OptorSim [6] is a discrete event simulator that has been developed to simulate data access optimization algorithms in grids. In this regard, it takes inter-site
228
K. Vanmechelen, W. Depoorter, and J. Broeckhove
bandwidth into account for data transfers between grid sites. The simulator’s focus is on overall optimization of grid resources rather than intra-site or per-user optimization. This allows OptorSim to simplify two aspects; all users are modeled as a single Users entity and the worker nodes at each grid site are represented by a single entity as well. The simulation model is based on a simplification of the architecture proposed by the European DataGrid (EDG). OptorSim has been used to evaluate cost-based replication-aware algorithms for Resource Brokers and Replica Optimization Services (ROS). The latest version was released in October 2006. jCase [7] is a tool for evaluating combinatorial auctions through simulation. It has been applied to the field of grid resource management and supports multiple algorithms for price determination and solvers for determining the optimal set of winners in a combinatorial auction. As such, it is one of the few simulation tools that support research into ERMS’s. jCase however, is not a general purpose framework and specifically targets combinatorial auctions. Currently it also lacks support for simulations of dynamic systems over time.
3
GES Overview
The Grid Economics Simulator (GES) is a discrete event simulator that has been developed for the evaluation of various economic approaches in their ability to efficiently organize a resource market. This section will present an overview of the simulator’s architecture, operation and features. 3.1
Key Abstractions
Since the focus of GES is on economic grid resource management, we will describe the key abstractions from an economic point of view. It is important to note however that GES also supports non-economic resource management in which case aspects such as billing and pricing are omitted. The consumer represents a grid user that wants to execute computational jobs. Each consumer has a queue of jobs that need to be executed and for which resources must be acquired from providers through participation in the market. A consumer is provided with a budgetary endowment that may be replenished periodically. In every simulation step, consumers are billed with the usage rate prices for all resources that are allocated to their jobs at that particular moment. Every provider hosts a number of CPU and disk resources that are supplied to the computational market. Providers interact with consumers to agree upon a price for the execution of a job. When agreement is reached, the provider will bill the consumer. The execution of a job may start immediately or in the future. Once a resource is allocated to a job, it remains allocated until the job completes. The market brings together consumers and providers. It also dictates the interaction pattern used for negotiating resource allocations. A market has a bank facility that keeps accounts for each consumer and provider. The bank also handles all transactions necessary for paying the bills associated with resource usage.
A Simulation Framework
229
A market follows either a spot market or future market allocation paradigm. The former is characterized by immediate dispatching of a job to a resource while the latter supports advance reservation. A more in depth explanation on these allocation paradigms is given in 3.5. 3.2
Simulation Parameters
All simulated entities are characterized by a number of parameters. The most important ones relate to the number of consumers and providers participating in the simulation, the number of jobs and their induced workload, the budgets of the consumers, and the number of CPUs and their collective processing capacity. For increased flexibility, values for these parameters may be chosen in a multitude of ways and at different grouping levels as supported by the configuration layer described in the next subsection. For example, the average number of jobs in the simulation N aj is related to the normalized total system load L and the average normalized load of a job laj by L = N aj × laj. Therefore it is possible to choose two of these three parameters to fully specify the load of a scenario. When we want to simulate the arrival of jobs over time we can also use traces of job arrivals T j or approximate them using arrival distributions Dj . The previous discourse is also applicable on a consumer group level as well as on the individual consumer level. For consumer group i for example, we can choose values for the average number of jobs in the group Niaj and the average normalized load of a job liaj . The translation of these averages into concrete values for each consumer can be done in a straightforward way by distributing them equally over all consumers in the group, but also by means of a chosen random distribution. 3.3
Architecture
An overview of the GES architecture is given in the layered diagram of figure 1. Each layer is mapped to a package in the simulator’s codebase. One of the key design goals of the architecture are extensibility and reusability. This “extend-and-refine” philosophy can be found throughout the whole simulation core layer and its components. The domain layer contains base classes for all domain entities such as Consumer, Provider, Job, GridResource and GridEnvironment. The Bank entity is situated in the economic layer. Support for traditional forms of resource management is provided through the non-economic layer. Class extension is heavily used from the domain layer up to the specific RMS implementation. For instance, a Consumer class of the Domain layer only keeps track of job status metrics, while an EconomicConsumer also keeps track of budgetary metrics. Existing components can be easily extended when new RMS algorithms are added to the framework. An overview of the different RMS systems that are currently supported by GES is given in section 3.5.
230
K. Vanmechelen, W. Depoorter, and J. Broeckhove
Fig. 1. Overview of the architecture of GES
Examples of reusability can be found in the Economic layer which provides components for accounting, billing and transactions, the Future Market layer which hosts reservation mechanisms for preemptible and non-preemptible workloads, the Auctions layer that supports pluggable protocols for auctioning, and the Tendering layer where new negotiation strategies can be plugged in. Simulations can be distributed over multiple processing nodes through the distribution layer. This layer interfaces with compute resources that host a Jini-enabled compute service, clusters fronted by a Sun Grid Engine head node, or clusters with a passwordless SSH setup. Currently, distribution is supported at the granularity of a simulated scenario. Possibilities for distributed execution of the individual entities in the simulation are planned for future releases. The gui layer allows the user to create, run and monitor live market scenarios. A screenshot of the user interface is given in figure 2. A persistency framework
Fig. 2. Screenshot of the GES UI
A Simulation Framework
231
allows for storing both scenario configurations and configurations of the UI layout. Aggregated metrics over simulation runs and over a selection of simulated entities (e.g. a collection of consumers) are supported in the form of means, variances, standard deviations and box plots. After data collection and analysis, data can be directly exported from the simulator’s UI to standard data formats such as csv or graphical formats such as eps and png. 3.4
Operational Overview
A simulation runs for a number of time steps. Each time step consists of a number of phases. For the spot markets (see 3.5), these phases are listed on figure 3. First a central controller updates the joblist and budget of the consumer. Then, depending on the market mechanism used, the consumers, providers or both are instructed to start negotiations. In order to execute jobs, the consumer accepts a bill and sends it to the bank in phase 4. In phase 5 all monetary transactions take place. Finally providers are instructed to execute the relevant jobs. When these are finished, the consumer is notified in phase 7.
Fig. 3. Overview of a simulation step in GES
3.5
RMS Frameworks
GES comes with built-in support for a number of reference and experimental resource management systems, both non-economic and economic. The noneconomic RMS’s are provided as a reference and for the purpose of comparison. We have implemented an offline central scheduler that can be initialized with different non-economic scheduling policies: – An Earliest Deadline First policy that schedules in the jobs of the consumer with the earliest deadline first. – A Priority policy where jobs are processed in order of the consumer’s configured priority level. – A Round Robin policy scheduling jobs from different consumers in a round robin fashion. – A FIFO policy that schedules jobs in first-in-first-out manner as they arrive. – The DONE policy that aims to maximize the number of consumers that meet their deadline. It follows a greedy approach, scheduling in consumer requests in order of increasing workload. When planning in an individual job, the CPU with the largest available remaining processing capacity is selected and the job is planned in as close as possible to its deadline.
232
K. Vanmechelen, W. Depoorter, and J. Broeckhove
The economic RMS’s implemented in GES are divided into two separate branches. The first one encompasses the spot markets while the second one incorporates the future markets. Spot markets are characterized by very dynamic price setting and quick reaction to changing conditions but also suffer from the exposure problem [12]. Future markets with support for advance reservation and co-allocation solve this problem at the expense of increased complexity and reaction time to changing market conditions. In spot markets, consumers have to negotiate per job for execution rights, while in future markets they can do this for an entire application consisting of multiple jobs. The spot markets that are implemented in GES are the following: – A Selective Tendering market with congestion control [13], where consumers request quotes from a group of selected providers. If a consumer is unable to obtain an allocation after requesting a certain number of quotes, it backs off and tries again at a later point in time. – An Auction market which supports double auctions as well as English, Dutch, First-Price Sealed-Bid and Vickrey auctions [14]. – A Commodity market that uses a Walrasian Auctioneer [15] for pricing. Multiple price adjustment schemes can be used ranging from a routine based on Smale’s method [16] to various optimization routines delivered by the Matlab Optimization Toolbox which are interfaced through RMI. – An implementation of the market mechanisms used in Tycoon [17]. The future markets supported by GES are: – The CBS [18], a centralized brokering system where consumers have to direct their application processing requests to a central broker entity that will negotiate with the providers. The broker aims to maximize the total value generated by fulfilling the consumers’ requests. – The DAS market [18] is a decentralized auctioning system where each provider holds auctions for selling its resources over the scheduling window. A consumer will place a sealed bid for each of its jobs at potentially multiple providers. These providers then calculate the winners of the auctions using a greedy heuristic. Multiple rounds can be held in order to schedule in as much consumers as possible.
4
Case Study: Value Realization for Users with Hard Deadlines
In this case study, we will use GES to study the difference in value realization between different RMS’s for consumers which assign a hard deadline to the execution of their application. We compare the economic DAS and CBS markets with a non-economic, deadline-based scheduler that adopts the DONE policy. We varied the processing capacity of the Grid while measuring realized value, infrastructure utilization, price levels and resource shares. For each sample point, we requested 100 runs in order to monitor the variance in the output metrics as a
A Simulation Framework
233
result of the use of stochastic variables. In total, 5700 simulations were necessary for the data collection. The simulation was run for 2016 simulated time steps with a grid environment hosting 300 consumers and 20 providers. Consumers were divided into three groups with different deadline ranges and associated valuation factors (V Fdeadline ) as shown in figure 4 (left). Every consumer hosted between 210 and 390 jobs with each job having a processing requirement between 1 and 80 time steps. Each consumer’s valuation was determined by multiplying a base valuation of 10000 credits with a load dependent factor and the V Fdeadline factor. For this setup, we assumed consumers to bid truthfully and consequently equated each consumer’s bid with its valuation. The total processing capacity in the system was 1250, which was uniformly distributed over the providers in the environment. The processing capacity of each individual CPU varied between 0.5 and 1. We ran our simulations on the CalcUA cluster at the university of Antwerp which hosts 256 Opteron 250 nodes using GES’ distribution layer. Our experiment took 10069 seconds on the cluster, yielding a speedup of 85. This speedup closely corresponded to the amount of nodes available to us on the cluster. The right graph in figure 4 shows the percentagewise increase of realized consumer value compared to the DONE RMS when varying infrastructural capacity. As can be observed from the graph, both the CBS and DAS markets compare favourably to the non-economic approach.
10
DAS CBS 40
8
35
7
30
Value / ValueDONE (%)
VFdeadline
45
VFdeadline Group I VFdeadline Group II VFdeadline Group III
9
6
5
4
20
15
10
3
5
2
1 200
25
400
600
800
1000 1200 Deadline
1400
1600
1800
2000
0 200
400
600
800
1000 1200 Capacity
1400
1600
1800
2000
Fig. 4. Valuation factors for the three consumer groups (left) and the value increase for the CBS and DAS markets compared to the DONE RMS, under varying capacity (right)
Table 1 and 2 show the results for the CBS market and DONE RMS respectively. Standard deviation is shown only for the utilization and value metrics due to space considerations. Although the non-economic approach attains higher utilization levels as a consequence of its preference for smaller workloads, it does not realize as much value for users as the economic approach. The high-value consumer groups are allotted a greater share in the CBS market because of their larger budgetary endowment and valuations. In the non-economic approach, the low-value group is given the largest share because it has the execution window with the least amount of resource competition. Table 1 shows that cost levels
234
K. Vanmechelen, W. Depoorter, and J. Broeckhove Table 1. Output metrics under varying capacity for the CBS market
Cap. U til.(%) 2000 79.28±1.41 1000 81.82±1.08 200 83.45±1.40
V alue(%) 95.54±0.78 58.82±0.90 12.59±0.23
ShareI (%) 37.23 56.96 56.72
ShareII (%) 37.60 28.07 29.20
ShareIII (%) 25.17 14.97 14.08
CostI 1.90 6.25 7.02
CostII 1.59 2.87 3.35
CostIII 0.85 0.89 1.05
Table 2. Output metrics under varying capacity for the DONE RMS Cap. U til.(%) 2000 83.99±1.30 1000 92.40±1.51 200 93.08±1.67
V alue(%) 92.94±2.12 48.06±1.62 9.45±0.69
ShareI (%) 32.29 29.68 29.16
ShareII (%) 33.34 31.49 29.58
ShareIII (%) 34.37 38.84 41.26
per unit of workload adjust to the degree of congestion in the system and the budgetary capabilities of the different consumer groups.
5
Summary and Future Work
Economic forms of resource management offer great opportunities for building grids that deliver incentives for provider participation and that try to maximize realized consumer value. There is a need for general purpose simulators with economic support to assist research in this field. We have introduced the Grid Economics Simulator and illustrated its extensibility by describing its architecture and operation and by providing an overview of the different supported RMS’s. We demonstrated the capabilities of GES with a case study highlighting various aspects of the framework. While GES in its current form has proven to be very useful in our research [15,18,19], we are planning for the inclusion of additional features. The first is the inclusion of network abstractions. This is a necessary step for more realistic simulation and to enable planned future research towards bandwidth pricing. In addition, we wish to be able to import traces from workload databases such as the Grid Workloads Archive [20]. This would allow us to use more realistic user and job profiles in simulations.
References 1. Takefusa, A., Matsuoka, S., Nakada, H., Aida, K., Nagashima, U.: Overview of a performance evaluation system for global computing scheduling algorithms. In: Proceedings of HPDC 1999, pp. 97–104. IEEE Computer Society, Los Alamitos (1999) 2. Song, H.J., Liu, X., Jakobsen, D., Bhagwan, R., Zhang, X., Taura, K., Chien, A.: The microgrid: a scientific tool for modeling computational grids. Sci. Program. 8(3), 127–141 (2000)
A Simulation Framework
235
3. Casanova, H.: Simgrid: a toolkit for the simulation of application scheduling. In: Proceedings of CCGrid 2001, pp. 430–437. IEEE Computer Society, Los Alamitos (2001) 4. Buyya, R.: Economic-based Distributed Resource Management and Scheduling for Grid Computing. PhD thesis, Monash University, Australia (2002) 5. Assun¸ca ˜o, M.A., Buyya, R.: An evaluation of communication demand of auction protocols in grid environments. In: Proceedings of GECON 2006, pp. 24–33. World Scientific, Singapore (2006) 6. Cameron, D.G., Millar, A.P., Nicholson, C., Carvajal-Schiaffino, R., Stockinger, K., Zini, F.: Analysis of Scheduling and Replica Optimisation Strategies for Data Grids Using OptorSim. Journal of Grid Computing 2(1), 57–69 7. Schnizler, B.: Resource Allocation in the Grid; A Market Engineering Approach. PhD thesis, University of Karlsruhe (2007) 8. Sulistio, A., Yeo, C.S., Buyya, R.: A taxonomy of computer-based simulations and its mapping to parallel and distributed systems simulation tools. Softw. Pract. Exper. 34, 653–673 (2004) 9. Takefusa, A., Casanova, H.: A study of deadline scheduling for client-server systems on the computational grid. In: Proceedings of HPDC 2001, pp. 406–415 (2001) 10. Legrand, A., Lerouge, J.: Metasimgrid: Towards realistic scheduling simulation of distributed applications. Technical Report 2002-28, LIP (2002) 11. Das, A., Grosu, D.: Combinatorial auction-based protocols for resource allocation in grids. In: Proceedings of PDSEC 2005, IEEE Computer Society, Los Alamitos (2005) 12. Bykowsky, M.M., Cull, R.J., Ledyard, J.O.: Mutually destructive bidding: The fcc auction design problem. Journal of Regulatory Economics 17(3), 205–228 (2000) 13. Depoorter, W.: Establishment of agency as an effective market based resource allocation method using ges. Master’s thesis, University of Antwerp (2007) 14. Vanmechelen, K., Broeckhove, J.: A comparative analysis of single-unit vickrey auctions and commodity markets for realizing grid economies with dynamic pricing. In: Veit, D.J., Altmann, J. (eds.) GECON 2007. LNCS, vol. 4685, pp. 98–111. Springer, Heidelberg (2007) 15. Stuer, G., Vanmechelen, K., Broeckhove, J.: A commodity market algorithm for pricing substitutable grid resources. Fut. Gen. Comput. Syst. 23(5), 688–701 (2007) 16. Smale, S.: A convergent process of price adjustment and global newton methods. Journal of Mathematical Economics 3(2), 107–120 (1976) 17. Feldman, M., Lai, K., Zhang, L.: A price-anticipating resource allocation mechanism for distributed shared clusters. In: Proceedings of EC 2005, British Columbia, ACM Press, New York (2005) 18. Vanmechelen, K., Depoorter, W., Broeckhove, J.: Economic grid resource management for CPU bound applications with hard deadlines. In: Proceedings of CCGrid 2008, IEEE Computer Society, Los Alamitos (in press, 2008) 19. Vanmechelen, K., Stuer, G., Broeckhove, J.: Pricing substitutable grid resources using commodity market models. In: Proceedings of GECON 2006, pp. 103–112. World Scientific, Singapore (2006) 20. Iosup, A., Li, H., Jan, M., Anoep, S., Dumitrescu, C., Wolters, L., Epema, D.H.J.: The Grid Workloads Archive. FGCS (submitted, 2007)
Improving Metaheuristics for Mapping Independent Tasks into Heterogeneous Memory-Constrained Systems Javier Cuenca1 and Domingo Gim´enez2 1
Departamento de Ingenier´ıa y Tecnolog´ıa de Computadores, Universidad de Murcia, 30071 Murcia, Spain [email protected] 2 Departamento de Inform´ atica y Sistemas, Universidad de Murcia, 30071 Murcia, Spain [email protected]
Abstract. This paper shows different strategies for improving some metaheuristics for the solution of a task mapping problem. Independent tasks with different computational costs and memory requirements are scheduled in a heterogeneous system with computational heterogeneity and memory constraints. The tuned methods proposed in this work could be used for optimizing realistic systems, such as scheduling independent processes onto a processors farm. Keywords: processes mapping, metaheuristics, heterogeneous systems.
1
Introduction
In this work the problem of mapping independent tasks to the processors in a heterogeneous system is considered. The tasks are generated by a processor and sent to other processors which solve them and return the solutions to the initial one. So, a master-slave scheme is used. The master-slave scheme is one of the most popular parallel algorithmic schemes [1], [2]. There are publications about optimal mapping master-slave schemes in parallel systems [3], [4], [5], but in those works the optimal mappings are obtained only under certain restrictions, and memory constraints are not considered. In our approach each task has a computational cost and a memory requirement. The processors in the system have different speeds and a certain amount of memory, which imposes a restriction on the tasks which it can be assigned. The goal is to obtain a task mapping which leads to a low total execution time. To obtain the optimum mapping in the general case is an NP problem [6], and heuristic methods may be preferable. In our previous work [7], the basic scheduling problem was explained together with some possible variants. To solve them, different metaheuristics (Genetic Algorithm, Scatter Search, Tabu Search and
This work has been partially supported by the Consejer´ıa de Educaci´ on de la Regi´ on de Murcia, Fundaci´ on S´eneca 02973/PI/05.
M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 236–245, 2008. c Springer-Verlag Berlin Heidelberg 2008
Improving Metaheuristics for Mapping Independent Tasks
237
GRASP) [8], [9] were proposed. In this work these metaheuristics are improved in different ways in order to reduce the time to perform the task mapping and to obtain a better solution. The paper is organized in the following way: in section 2 the basic scheduling problem is explained; in section 3 some metaheuristics for the solution of the proposed scheduling problem are analysed; and, finally, section 4 summarizes the conclusions and outlines future research.
2
Scheduling Problem
Of the different scheduling problems introduced in our previous work [7], this paper studies, as an example, the problem with fixed arithmetic costs and no communications in depth. In this problem, given t tasks, with arithmetic costs c = (c0 , c1 , . . . , ct−1 ) and memory requirements i = (i0 , i1 , . . . , it−1 ), and p processors with the times to perform a basic arithmetic operation a = (a0 , a1 , . . . , ap−1 ), and memory capacities m = (m0 , m1 , . . . , mp−1 ), from all the mappings of tasks to the processors, d = (d0 , d1 , . . . , dt−1 ) (dk = j means task k is assigned to processor j), with ik ≤ mdk , find d with which the following mimimum is obtained: ⎧ ⎫ ⎨ ⎬ max cl , min aj (1) ⎭ {d/ ik ≤mdk ∀k=0,1,...,t−1} {j=0,1,...,p−1} ⎩ l=0,1,...,t−1;dl =j
where the minimum of the mapping times which satisfies the memory constraints is obtained, and for each mapping the time is that of the processor which takes most time in the solution of the tasks it has been assigned. There is a maximum of pt assignations (with the memory constraints the number of possibilities may decrease), and it is not possible to solve the problem with a reasonable time by generating all the possible mappings. An alternative is to obtain an approximate solution using some heuristic method. This possibility is considered in this paper.
3
Application of Metaheuristics to the Scheduling Problem
In this section the application of metaheuristic methods to the version of the scheduling problem previously described is analysed. The methods considered are: Genetic Algorithm (GA), Scatter Search (SS), Tabu Search (TS) and GRASP (GR). The four metaheuristics are analysed from the same perspective, identifying common routines and element representations. The goal is to obtain a mapping with an associated modelled time close to the optimum, but with a low assignation time, because this time is added to the execution time of the routine. A general metaheuristic scheme is considered [10]. One such scheme is shown in algorithm 1. Each of the functions that appears in that scheme works in a different way depending on the metaheuristic chosen:
238
J. Cuenca and D. Gim´enez
Algorithm 1. General scheme of a metaheuristic method. Initialize(S); while not EndCondition(S) do SS = ObtainSubset(S); if |SS| > 1 then SS1 = Combine(SS); else SS1 = SS; end SS2 = Improve(SS1); S = IncludeSolutions(SS2); end
– Initialize. To create each individual of the initial set S, this function assigns tasks to processors with a probability proportional to the processor speed. • GA works with a large initial population of assignations. • SS works with a reduced number of elements in S. This could produce a lower time for this method than that of the GA. • TS works with a set S with only one element. • GR: In each iteration the cost of each candidate is evaluated, and a number of candidates are selected to be included in the set of solutions. – ObtainSubset: In this function some of the individuals are selected randomly. • GA: The individuals with better fitness function (equation 1) have more likelihood of being selected. • SS: It is possible to select all the elements for combination, or to select the best elements (those with better fitness function) to be combined with the worst ones. • TS: This function is not necessary because |S| = 1. • GR: One element from the set of solutions is selected to constitute the set SS (|SS| = 1). – Combine: In this function the selected individuals are crossed, and SS1 is obtained. • GA, SS: The individuals can be crossed in different ways. One possibility is to cross pairs of individuals by exchanging half of the mappings, obtaining two descendants. • TS, GR: This function is not necessary. – Improve: • GA: A few individuals are selected to obtain other individuals, which can differ greatly. This process is done by using mutation operands. The aim is to diversify the population to avoid falling in local optimums. • SS: This function consists on a greedy method which works by evaluating the fitness value of the elements obtained with the p possible processors (with memory constraints) in each component, in order to search for a better element in its neighborhood.
Improving Metaheuristics for Mapping Independent Tasks
239
• TS: Some elements in the neighborhood of the actual element are analysed, excluding those in a list of previously analysed tabu elements. • GR: This function consists of a local search to improve the element selected. Some greedy method can be used, or all the elements in the neighborhood of the selected one can be analysed. – IncludeSolutions: This function selects some elements of SS2 to be included in S for the next iteration. • GA: The best individuals from the original set, their descendants and the individuals obtained by mutation, are included in the next S. • SS: The best elements are selected, as well as some elements which are scattered with respect to them to avoid falling within local minimums. • TS, GR: The best element from those analysed is taken as the next solution. – EndCondition: • GA, SS, TS, GR: The convergence criterion could be a maximum number of iterations, or that the best fitness value from the individuals in the population does not change over a number of iterations. 3.1
Basic Experimental Tuning of the Metaheuristics
Experiments with different tasks and systems configurations have been carried out, obtaining similar results. The experiments, whose results are shown beyond, has the following configuration: The size of each task has been randomly generated between 1000 and 2000, the arithmetic cost is n3 , and the memory requirement n2 . The number of processors in the system is the same as the number of tasks. The costs of basic arithmetic operations has been randomly generated between 0.1 and 0.2 μsecs. The memory of each processor is between half the memory needed by the biggest task and one and a half times this memory. Preliminary results for the proposed problem in section 2 have been obtained using the following parameter values, whereas with other close values the results would be similar. – GA: • Initialize: The population has 80 elements; the elements in S are initially generated randomly assigning the tasks to the processors, with the probability proportional to the processor speed. • Combine: Each pair of elements is combined with half of the components of each parent; in each combination the best parent and the best descendant are included in the population. • Improve: the probability of mutation is 1/5. • EndCondition: the maximum number of iterations is 800, and the maximum number of iterations without improving the optimum solution is 80. – SS: • Initialize: S has 20 elements. The initialization is that in GA.
240
J. Cuenca and D. Gim´enez
• Combine: The combination is that in GA. • Improve: Each element is improved with a greedy method, which works by selecting for the processor with highest execution time a task which could be assigned to another processor reducing the fitness function (equation 1). • IncludeSolutions: The elements with lowest cost function and those most scattered with respect to the best ones (using a 1-norm) are included in the reference set. • EndCondition: The maximum number of iterations is 400, and the maximum number of iterations without improving the optimum solution is 40. – TS: • Improve: The neighborhood has 10 elements, obtained by taking the tasks assigned to the processor with most cost and reassigning them to other processors. The tabu list has 10 elements. • EndCondition: The maximum number of iterations is 200, and the maximum number of iterations without improving the solution is 20. – GR: • Initialize: The initial set has 20 elements. The elements are generated as in GA and SS. • ObtainSubset: The element selected from S is chosen randomly, with more probability for the elements with better fitness function (equation 1). • Improve: The element is improved with the greedy method used in SS. • EndCondition: The number of iterations is 20. Table 1 compares the mapping time and the simulated time obtained, in a PC Centrino 2.0 GHz., with each of the heuristics, and those with a backtracking, for those problem sizes where the backtracking obtains a solution using a reasonable mapping time. Those cases where the corresponding method does not obtain the optimal solution are in bold. In almost all the cases the metaheuristics provide the best solution and use less time than a backtracking. Table 1. Comparison of backtracking and the metaheuristics. Mapping time and modelled execution time (in seconds), varying the number of tasks.
tasks 4 8 12 13 14
Back map. simul. 0.025 3132 0.034 4731 0.058 1923 0.132 1278 0.791 1124
GA map. simul. 0.051 3132 0.028 4731 0.021 1923 0.055 1278 0.081 1124
SS map. simul. 0.065 3132 0.132 4731 0.158 1923 0.159 1278 0.192 1124
TS map. simul. 0.010 3132 0.015 4731 0.016 2256 0.016 1376 0.017 1124
GR map. simul. 0.019 3132 0.024 4731 0.029 1923 0.024 1278 0.027 1135
Improving Metaheuristics for Mapping Independent Tasks
241
For big systems and using the different heuristics, satisfactory mappings are obtained in a reduced time. In Table 2 the mapping and the simulated times for big systems are shown. Those cases where the best solution of modelled time is obtained for each problem size appear in bold. GA and SS are the methods that need more mapping time to obtain a good solution with the parameters considered. GR and TS use much less time and obtain the best solution for almost all the cases. TS needs less time than GR, but its solutions are not always as good. Therefore, GR is the method which behaves best. Following these results with the preliminary tunings, a deeper study on how to improve those metaheuristics is now underway. For example, the next subsection shows how advanced tunings can be applied to the Genetic Algorithm. Table 2. Comparison of the metaheuristics for big systems. Mapping time and modelled execution time (in seconds), varying the number of tasks.
tasks 25 50 100 200 400
3.2
GA map. simul. 0.139 1484 0.413 1566 0.592 1903 0.825 3452 3.203 3069
SS map. simul. 0.259 1450 0.429 1900 0.834 1961 1.540 3452 2.682 3910
TS map. simul. 0.010 1450 0.015 1757 0.022 3018 0.079 3452 0.375 3069
GR map. simul. 0.045 1450 0.078 1524 0.158 1460 0.293 3452 0.698 3069
Advanced Tuning of the Genetic Algorithm
Various tunings possibilities have been studied in order to improve the GA method. The most significant ares: – In the routine Combine: • T1. It is possible to change the heredity method. Instead of a descendant inheriting strictly each half of its components from each parent, each component is inherited pseudo-randomly, so giving more probability to the parent with best fitness value (the fitness value of a solution is the modelled execution time of the processor that needs more time to finish its assigned tasks) (equation 1). • T2. Another possibility of changing the heredity method consists of choosing each component of a descendant from the less loaded processor from those of its parents. The load of a processor r, Wr , is the product of the cost of performing an arithmetic operation in r and the addition of the cost of the tasks assigned to r: cl , (2) Wr = ar {l=0,1,...,t−1;dl =r}
In other words, if for the i-th component, that task is assigned in the parent A to the processor r, that has a load of Wr , and in the parent B to the processor q, that has a load of Wq , then in the descendant the component i will be r if Wr < Wq , or q in other case.
242
J. Cuenca and D. Gim´enez
– T3. In the routine Improve it is possible to introduce a hybrid approach, using a steered mutation instead of a pure mutation. In the solution to be improved, each task assigned to an overloaded processor (a processor is overloaded if its load (equation 2) is greater to the average load of all the processors) is reassigned randomly to another processor. Therefore, this routine mutates the solution to another where the total loads of the most overload processors have been reduced. – T4. In the routine ObtainSubset, where the solutions that will be combined are chosen, it is possible to chose these solutions pseudo-randomly, giving more probability to the solutions with better fitness. In the first column of Table 3 the times obtained with the base case (the original GA) in a PC Centrino 2.8 Ghz, are shown for different numbers of tasks. In the second column, the times obtained when the T1 tuning are shown. The solutions are obtained more quickly than in the base case (less mapping time), but these solutions are worse than the previous ones (more simulated execution time). This could be because T1 is a greedy tuning that leads the algorithm to a local minimum. In the third column, the times obtained when the T2 tuning are shown. Now, the time to obtain the solutions is very similar to the base case, but the solutions for some of the problems are better. So this tuning could be an interesting improvement. In the forth column the times obtained when the T3 tuning is applied to the routine Improve are shown. The solutions are worse than in the base case, that is, a steered mutation does not work as well as we thought (a deeper study is given below). In the fifth column, the times obtained when the T4 tuning is applied to the routine Improve are shown. The times are better in some cases. It converges faster than the base case, using less mapping time. Since the improvements T2, T3 and T4 seem interesting and they affect different parts of the algorithm, it could be appropriate to combine them. In this way in the sixth and seventh columns of Table 3 the results when using T2 with the other tunings are shown. Combining T2 with T3 or T4 the results do not improve those obtained only with T2 and they need more mapping time. Therefore, it is better to apply just T2. Finally, combining T3 and T4 the results do not improve any of them. Table 3. Comparison of the different tunings applied to the Genetic Algorithm, varying the number of tasks basic GA T1 T2 T3 T4 T2+T3 T2+T4 T3+T4 tasks map. simul. map. simul. map. simul. map. simul. map. simul. map. simul. map. simul. map. simul. 50 0.13 1646 0.02 2277 0.05 1524 0.08 1715 0.09 1715 0.05 1524 0.06 1524 0.08 1715 100 0.25 2068 0.09 2581 0.13 1460 0.14 2230 0.25 2000 0.17 1460 0.16 1460 0.14 2230 150 0.47 2422 0.19 2908 0.19 2039 0.25 2464 0.36 2418 0.22 2039 0.22 2039 0.25 2464 200 0.41 3452 0.28 3717 0.31 3452 0.31 3452 0.33 3452 0.34 3452 0.34 3452 0.33 3452 400 1.56 3069 1.19 4184 1.19 3069 1.67 3069 1.42 3069 1.20 3069 1.25 3069 1.72 3069 1600 12.10 3680 10.50 4061 11.77 1735 11.38 3882 12.08 3482 12.56 1735 11.28 1735 12.09 3882
In order to understand better the behavior of the algorithm with the different tunings, the Figs. 1, 2 and 3 show the evolution of the best solution from the new generated individuals per iteration, in each case, along all the iterations, for the problem of mapping 1600 tasks.
Improving Metaheuristics for Mapping Independent Tasks
243
Fig. 1. Evolution of the best solution from the new generated individuals per iteration for a problem size of 1600 tasks. Without tuning (T0) applied to the routine Combine, with T1 and with T2.
Regarding the routine Combine (Fig. 1), if the tuning T1 is applied, the restriction of inheriting each component from the best parent confers a more greedy tendency to the algorithm. It falls in local minimums, with worse solutions than in the base case, where it can seldom exit. However, with the tuning T2 each component of a descendant can come from any of the parents, so a bigger
Fig. 2. Evolution of the best solution from the new generated individuals per iteration for a problem size of 1600 tasks. Without tuning (T0) applied to the routine Improve and with T3.
244
J. Cuenca and D. Gim´enez
Fig. 3. Evolution of the best solution from the new generated individuals per iteration for a problem size of 1600 tasks. Without tuning (T0) applied to the routine ObtainSubset and with T4.
mixture of genetic code is produced, causing more diversity of descendants and so allowing the algorithm exits from local minimums easily. The tendency, from the first iteration, is to improve the best solution because the most overloaded processors are unloaded in each step. In the routine Improve (Fig. 2), with the tuning T3 the mutation operation is steered towards better solutions quickly, but this kind of mutation prevents the genetic code of the descendant differing a lot from those of the parents. In this way, if the algorithm falls in a local minimum it is very difficult to get out of it because it has not a pure mutation. If the tuning T4 is applied to the routine ObtainSubset (Fig. 3), the algorithm progresses slowly but surely, because in each iteration only the best solutions are chosen to have descendants and few false moves are made.
4
Conclusions and Future Works
The paper presents some improvements on previous proposals for the application of metaheuristics techniques to tasks to processors mapping problems, where the tasks are independent and have various computational costs and memory requirements, and the computational system is heterogeneous in computation and with different memory capacities (communications are not yet considered). The metaheuristics considered have been: Genetic Algorithm, which is a global search method; Scatter Search is also a global search method, but with improvement phases; Tabu Search is a local search method with the search guided by historic information; GRASP method is a multiple local search method. The
Improving Metaheuristics for Mapping Independent Tasks
245
parameters and the routines have been tuned and the experiments to obtain satisfactory versions of the metaheuristics have been carried out, mainly with the Genetics Algorithm where some detailed tuning techniques have been studied. In future works advanced tunings, like those applied to the Genetic Algorithm in this work, will be applied to the other metaheuristics. On the other hand, different characteristics of the heterogeneous systems will be considered: variable arithmetic cost in each processor depending on the problem size, variable communication cost in each link,... Other general approximations (dynamic assignation of tasks, adaptive metaheuristics,...) will also be studied. The tuned methods proposed in this work will be used for optimizing realistic systems, such as scheduling independent processes or mapping MPI jobs onto a processors farm.
References 1. Wilkinson, B., Allen, M.: Parallel Programming: Techniques and Applications Using Networked Workstations and Parallel Computers, 2nd edn. Prentice-Hall, Englewood Cliffs (2005) 2. Grama, A., Gupta, A., Karypis, G., Kumar, V.: Introduction to Parallel Computing, 2nd edn. Addison-Wesley, Reading (2003) 3. Banino, C., Beaumont, O., Legrand, A., Robert, Y.: Sheduling strategies for master-slave tasking on heterogeneous processor grids. In: Fagerholm, J., Haataja, J., J¨ arvinen, J., Lyly, M., R˚ aback, P., Savolainen, V. (eds.) PARA 2002. LNCS, vol. 2367, pp. 423–432. Springer, Heidelberg (2002) 4. Pinau, J.F., Robert, Y., Vivien, F.: Off-line and on-line scheduling on heterogeneous master-slave platforms. In: 14th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP 2006), pp. 439–446 (2006) 5. Brucker, P.: Scheduling Algorithms, 1st edn. Springer, Heidelberg (2007) 6. Lennerstad, H., Lundberg, L.: Optimal scheduling results for parallel computing. SIAM News, 16–18 (1994) 7. Cuenca, J., Gim´enez, D., L´ opez, J.J., Mart´ınez-Gallary, J.P.: A proposal of metaheuristics to schedule independent tasks in heterogeneous memory-constrained systems. In: CLUSTER (2007) 8. Hromkovic, J.: Algorithmics for Hard Problems, 2nd edn. Springer, Heidelberg (2003) 9. Dr´eo, J., P´etrowski, A., Siarry, P., Taillard, E.: Metaheuristics for Hard Optimization. Springer, Heidelberg (2005) 10. Raidl, G.R.: A unified view on hybrid metaheuristics. In: Almeida, F., Blesa Aguilera, M.J., Blum, C., Moreno Vega, J.M., P´erez P´erez, M., Roli, A., Sampels, M. (eds.) HM 2006. LNCS, vol. 4030, pp. 1–12. Springer, Heidelberg (2006)
A2 DLT: Divisible Load Balancing Model for Scheduling Communication-Intensive Grid Applications M. Othman , M. Abdullah, H. Ibrahim, and S. Subramaniam Department of Communication Technology and Network, University Putra Malaysia, 43400 UPM Serdang, Selangor D.E., Malaysia [email protected], [email protected]
Abstract. Scheduling an application in data grid is significantly complex and very challenging because of its heterogeneous in nature of the grid system. Divisible Load Theory (DLT) is a powerful model for modelling data-intensive grid problem where both communication and computation loads are partitionable. This paper presents a new divisible load balancing model known as adaptive ADLT (A2 DLT) for scheduling the communication intensive grid applications. This model reduces the maximum completion time (makespan) as compared to the ADLT and Constraint DLT (CDLT) models. Experimental results showed that the model can balance the load efficiently, especially when the communication-intensive applications are considered. Keywords: Divisible Load Theory, Data Grid, Load Balancing.
1
Introduction
In data grid environment, many large-scale scientific experiments and simulations generate very large amounts of data in the distributed storages, spanning thousands of files and data sets [1]. Due to the heterogenous nature of the grid system, scheduling an applications in such environment either data- or communication- intensive is significantly complex and challenging. Grid scheduling is defined as a process of making scheduling decision involving allocating job to resources over multiple administrative domains [2]. The DLT has emerged as a powerful model for modelling data-intensive grid problem [3]. The DLT model exploits the parallelism of a divisible application which is continuously divisible into parts of arbitrary size, by scheduling the loads in a single source onto multiple computing resources. The load scheduling in data Grid is addressed using DLT model with additional constraint that each worker node receives the same load fraction from each data source [4]. Most of the previous models do not take into account the communication time. In
The author is also an associate researcher at the Lab of Computational Science and Informatics, Institute of Mathematical Research (INSPEM), University Putra Malaysia.
M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 246–253, 2008. c Springer-Verlag Berlin Heidelberg 2008
A2 DLT: Divisible Load Balancing Model
247
order to achieve a high performance, we must consider both communicationand computation-times [5,6]. In [7], the CDLT is used for scheduling decomposable data-intensive applications and the results are compared with the results of genetic algorithm. The same constraint is tested, which was suggested in [4] and each worker node receives the same load fraction from each data source. They considered the communication time but not in dividing the load. Firstly, they divided the load using DLT model then added the communication time to the makespan. When ADLT model is proposed for scheduling such applications and compared with CDLT model, and it gives better performance [8]. In this paper, a A2 DLT model is proposed as an improvement of ADLT model. The objective of the model is to distribute loads over all sites in such a way to achieve an optimal makespan for large scale jobs.
2
Scheduling Model
In [5], the target data intensive application model can be decomposed into multiple independent subtasks and executed in parallel across multiple sites without any interaction among subtasks. Lets consider job decomposition by decomposing input data objects into multiple smaller data objects of arbitrary size and processing them on multiple virtual sites. High Energy Physic (HEP) jobs are arbitrarily divisible at event granularity and intermediate data product processing granularity [1]. In this research, assuming that a job requires a very large logical input data set (D) consists of N physical datasets and each physical dataset (of size Lk ) resides at a data source (DSk , for all k = 1, 2, . . . , N ) of a particular site. Fig 1 shows how the logical input data (D) is decomposed onto networks and their computing resources. The scheduling problem is to decompose D into datasets (Di for all i = 1, 2, . . . , M ) across M virtual sites in a Virtual Organization (VO) given its initial physical decomposition. Again, we assume that the decomposed data can be analyzed on any site. 2.1
Notations and Definitions
All notations and their definitions used throughout this paper are shown in Table 1. 2.2
Cost Model
The execution time cost (Ti ) of a subtask allocated to the site i and the turn around time (TT urn Around T ime ) of a job can be expressed as follows Ti = Tinput and
cm (i)
+ Tcp (i) + Toutput
cm (i, d)
248
M. Othman et al.
l1,1
D1
)
1
l1, i l i,1
f( d
DS1
+1 1 ,i
1,M
N,1
li,i
f(d i+1)
M
i+1 N,
Di+1
f(d
li,
M)
lN
DSi
f(di) Di
,i
li,i+1
lN,M
DM DSN
Fig. 1. Data decomposition and their processing Table 1. Notation and Definition Notation M N Li Lij L αij αj wj Zij Ti
Definition The total number of nodes in the system The total number of data files in the system The loads in data file i The loads that node j will receive from data file i The sum of loads in the system, where L = N i=1 Li The amount of load that node j will receive from data file i The fraction of L that node j will receive from all data files The inverse of the computing speed of node j The link between node i and data source j The processing time in node i
TT urn
M
Around T ime
= max{Ti }, i=1
respectively. The input data transfer (Tinput cm (i)), computation (Tcp (i)), and output data transfer to the client at the destination site d (Toutput cm (i, d)) are 1 presented as a maxN k=1 {lki · Zki }, di · wi · ccRatio and f (di ) · Zid , respectively. The Zij is the network bandwidth between site i and j, wi is the computing time to process a unit dataset of size 1MB at site i, the function f (di ) is an output data size and ccRatio is the non-zero ratio of computation and communication. The turn around time of an application is the maximum among all the execution times of the sub tasks. The problem of scheduling a divisible job onto M sites can be stated as deciding the portion of original workload (D) to be allocated to each site, that is, finding a distribution of lki which minimizes the turn around time of a job.
A2 DLT: Divisible Load Balancing Model
249
The proposed model uses this cost model when evaluating solutions at each generation.
3
ADLT Scheduling Model
In all literature related to the divisible load scheduling, an optimality criterion [6] was used to derive an optimal solution. In order to obtain an optimal makespan, it is necessary and sufficient that all the sites that participate in the computation must complete at the same time. Otherwise, load could be redistributed to each sites and this will improve the processing time. This optimality principle in the design of load distribution strategy is used. The communication time fraction is added into the ADLT model and the final fraction of the model is shown as below, CMi,j =
1 1 M wj (Σx=1 wx )
+
Zi,j
CMi,j αj = N M i=1
j=1
1 M
N
x=1
1 y=1 Zx,y
CMi,j
(1)
(2)
and CMi,j αi,j = N M i=1
j=1
CMi,j
Li .
(3)
Details of this model and their derivation can be found in [8].
4
Proposed A2 DLT Model
In the ADLT model, the fraction equations (1), (2) and (3) are taken separately from each source, see [8]. In addition the node speed- and link speed- fractions are also taken separately, thus yields the node speed fraction as, 1 wj 1 M (Σx=1 wx )
(4)
while the link speed fraction for each link given as, N x=1
1 Zi,j
M
1 y=1 Zx,y
.
(5)
Again, the summation of these fractions are also taken separately at each source. These loads are divided by using these fractions and finally the makespan is calculated. In the proposed model, we must balance the load of the whole system (means all sources). In other word, the node speed fraction is calculated separately, thus yield the node speed- and the link speed- fractions given as,
250
M. Othman et al.
1 M (Σx=1 wx ) +
1 wj N x=1
M
(6)
1 y=1 Zx,y
and,
1 M (Σx=1 wx ) +
1 Zi,j N x=1
M
1 y=1 Zx,y
,
(7)
respectively. Finally, the new fraction is given as,
CMi,j =
1 wj
1 M (Σx=1 wx ) +
N x=1
M
1 y=1 Zx,y
+
1 Zi,j
1 M (Σx=1 wx ) +
CMi,j αj = N M i=1
and
CMi,j αi,j = N M i=1
5
j=1
j=1
CMi,j
CMi,j
,
Li .
N x=1
M
1 y=1 Zx,y
, (8)
(9)
(10)
Numerical Experiments
To measure the performance of the proposed A2 DLT model against the previous models, randomly generated experimental configurations are used, see [7,8]. The network bandwidth between sites is uniformly distributed between 1Mbps and 10Mbps. The location of n data sources (DSk ) is randomly selected and each physical dataset size (Lk ) is randomly selected with a uniform distribution in the range of 1GB to 1TB. We assumed that the computing time spent in a site i to process a unit dataset of size 1MB is uniformly distributed in the range 1/rcb to 10/rcb seconds where rcb is the ratio of computation speed to communication speed. We examined the overall performance of each model by running them under 100 randomly generated Grid configurations. We varied the parameters, ccRatio (0.001 to 1000), M (20 to 100), N (20 to 100), rcb (10 to 500) and data file size (1 GB to 1 TB). When both the number of nodes and the number of data files are 50, the results are collected and shown in Fig. 2. The results showed that the makespan of the proposed model is better than the other models, especially when the ccRatio is less than 1 (communicationintensive applications). Thus, the proposed model balances the load among the nodes more efficiently. From Table 2, the results show that the A2 DLT is 34% better than CDLT in terms of makespan. While the A2 DLT is better than ADLT by 25%. These results showed that A2 DLT is the best among CDLT and ADLT models.
A2 DLT: Divisible Load Balancing Model
251
Fig. 2. Makespan for A2 DLT, ADLT and CDLT models (N =50, M =50 and ccRatio=0.001 to 1000)
Table 2. Percentage makespan improvements of A2 DLT against CDLT and ADLT models ccRatio 0.001 0.01 0.1 1 Average
CDLT (%) ADLT (%) 49 49 53 43 30 20 5 -13 34 25
Fig. 3. Makespan vs. Data file Size for A2 DLT, ADLT and CDLT models (N =100, M =100 and ccRatio=0.001)
252
M. Othman et al.
When we compare the A2 DLT model to the CDLT and ADLT with different size of data file, the A2 DLT model produces a better result as increasing the size of data file. The result is shown in Fig. 3. The impact of the ratio of output data size to input data size is also shown in Fig. 4. The A2 DLT model performs better for communication intensive applications that generate small output data compared to input data size (low oiRatio). For computation intensive applications, the ratio of output data size to input data size does not affect the performance of the algorithms much unless when ccRatio is 1000.
Fig. 4. The impact of output data size to input data size (a)oiRatio > 0.5(b)oiRatio = 0 : No output or small size of output
6
Conclusion
Previously, ADLT model reduced the makespan for scheduling divisible load application. In this paper, an improvement version of ADLT model known as the A2 DLT model is proposed. The new model reduces the makespan and balance the load better than ADLT model, especially for communication-intensive applications. The experiment results showed that A2 DLT model improved with an average of 34% and 25% of makespan compared to CDLT and ADLT models, respectively. With such improvement, the proposed model can be integrated in the existing data grid schedulers in order to improve the performance.
References 1. Jaechun, N., Hyoungwoo, P.: GEDAS: A Data Management System for Data Grid Environments. In: Sunderam, V.S., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2005. LNCS, vol. 3514, pp. 485–492. Springer, Heidelberg (2005)
A2 DLT: Divisible Load Balancing Model
253
2. Venugopal, S., Buyya, R., Ramamohanarao, K.: A Taxonomy of Data Grids for Distributed Data Sharing, Management and Processing. ACM Computing Surveys 38(1), 1–53 (2006) 3. Robertazzi, T.G.: Ten Reasons to Use Divisible Load Theory. IEEE Computer 36(5), 63–68 (2003) 4. Wong, H.M., Veeravalli, B., Dantong, Y., Robertazzi, T.G.: Data Intensive Grid Scheduling: Multiple Sources with Capacity Constraints. In: Proceeding of the IASTED Conference on Parallel and Distributed Computing and Systems, Marina del Rey, USA (2003) 5. Mequanint, M.: Modeling and Performance Analysis of Arbitrarily Divisible Loads for Sensor and Grid Networks. PhD Thesis. Dept. Electrical and Computer Engineering, Stony Brook University, New York USA (2005) 6. Bharadwaj, V., Ghose, D., Robertazzi, T.G.: Divisible Load Theory: A New Paradigm for Load Scheduling in Distributed Systems. Cluster Computing 6, 7– 17 (2003) 7. Kim, S., Weissman, J.B.: A Genetic Algorithm Based Approach for Scheduling Decomposable Data Grid Applications. In: Proceeding of the International Conference on Parallel Processing. IEEE Computer Society Press, Washington (2004) 8. Othman, M., Abdullah, M., Ibrahim, H., Subramaniam, S.: Adaptive Divisible Load Model for Scheduling Data-Intensive Grid Applications. In: Shi, Y., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2007. LNCS, vol. 4487, pp. 446–453. Springer, Heidelberg (2007)
Evaluation of Eligible Jobs Maximization Algorithm for DAG Scheduling in Grids Tomasz Szepieniec1 and Marian Bubak1,2 1
Academic Computer Centre CYFRONET AGH, ul. Nawojki 11, 30-950 Krak´ ow, Poland 2 Institute of Computer Science, AGH, al. Mickiewicza 30, 30-059 Krak´ ow, Poland [email protected], [email protected] Phone: (+48 12) 617 43 35, Fax: (+48 12) 633 80 54
Abstract. Among many attempts to design DAG scheduling algorithms that would face grid environment requirements, the strategy of number of eligible jobs maximization seems promising. Therefore, this paper presents the results of thorough analysis and evaluation of this strategy and its implementation called PRIO. We have analysed a large space of random DAGs and various resources parameters to compare results of PRIO algorithm with standard critical path length prioritization, FIFO prioritization as well as with quasi-optimal solution. Results of this comparison, in terms of the makespan and robustness, are supplemented by a theoretical and specific case analysis. We conclude with an assessment of usefulness of the current implementation of eligible jobs maximization strategy. Keywords: DAG, list scheduling, application scheduling, Internet-based computing, eligible jobs maximization.
1
Introduction
Modern, large scale applications require efficient execution on available resources. The structure of many of these application may be represented by DAGs. Many attempts to the DAG-based application scheduling for grid environments were enhancements of existing solutions [1,2], however, one may argue that they address the requirements only partially or require knowledge about the environment that is hard to obtain. The uncertainty level typical for Internet-based computing [3] precludes an accurate identification of a critical path. An observation done from a perspective of a user or an application scheduler in large computation environments, that free resources quickly become busy if not allocated immediately, provide foundation for an idea of keeping an application ready to use resources immediately when they become available. In case of DAGs, keeping applications ’ready’ means scheduling jobs in a way which maximises the number of jobs that are eligible for mapping to new resources when they appear. An implementation of this strategy was a subject of a series of papers [3,4,5,6]. An M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 254–263, 2008. c Springer-Verlag Berlin Heidelberg 2008
Evaluation of Eligible Jobs Maximization Algorithm for DAG Scheduling
255
algorithm was introduced to provide Internet-Computing (IC) optimal schedule for a large class of DAGs. However, in practice an algorithm that may be applied for all possible DAGs is needed, so a heuristic algorithm was designed to provide an IC optimal schedule if it exists, while for the rest of DAGs, the heuristics take steps to enhance eligible jobs ratio [7]. The heuristics were implemented with integration with Condor DAGMan, under name PRIO Tool. This implementation was used in [7] for comparison with the FIFO ordering. The results of the evaluation were promising, however, while FIFO ordering is rather a basic strategy, we believe that a comparison with stronger algorithms is still needed to understand usefulness of the Malewicz’s heuristics. The goal of this paper is to provide a comparison with more advanced techniques and better understanding of the usability of PRIO algorithm for the DAGs scheduling in contemporary grids. Specifically, we focused on following aspects: – assessment of PRIO algorithm applicability to real grid systems; we tried to identify possible strengths and weaknesses of this algorithm; – statistical evaluation of the algorithm with comparison to the standard DAG’s job prioritization method, FIFO ordering and optimal schedules, in order to clarify how successful the algorithm is and how much robust results it provides; – case analysis of a few typical results to understand characteristics of schedules provided by PRIO algorithm prioritization. Before we describe results of an analysis in Section 3, we give a brief overview of related works in Section 2. Next, we introduce a practical evaluation by describing a grid model implemented for simulation in Section 4. A methodology we have used is given in Section 5. Statistical evaluation and case analysis are presented in Sections 6 and 7. Finally, we summarize our evaluation in Section 8.
2
Related Works
Extensive overview of DAG scheduling algorithms related to grids is given in [8]. In a grid environment, new challenges are faced. The most important ones are the heterogeneity of resources and dynamic changes of several parameters of the environment, like resources availability and transfer time between nodes. Several attempts were made to adapt previous DAG scheduling heuristics, like HEFT [9] and FCP [10] to grid environment [1,2], however none of them fully addresses new requirements. An important issue in the list scheduling, like HEFT or SDC [11], is its sensitivity to the method of job performance on nodes evaluation [13]. From a practical point of view, in a large environment it is complicated to gather relevant data, even if we use less sensitive hybrid algorithm [12]. All the mentioned above heuristics do not consider the heterogeneity of the network parameters. For applications which are sensitive to them, clustering heuristics would be a better solution [8]. Finally, some of proposed algorithms e.g. JDCS [14], provide sophisticated mechanisms such as back-trace techniques to reduce the data
256
T. Szepieniec and M. Bubak
preloading delay, but they usually require rarely available services, like the grid performance prediction or the resource reservation, and heavily depend on their quality. Currently available evaluations of PRIO [7] are limited to comparison with FIFO-based ordering. While the latter method, in practice, means ‘no prioritization’, we consider such an evaluation insufficient. However, PRIO proved to be substantially better than FIFO. It is worth to mention that comparison was made on DAGs with almost equal execution time (random changes up to 30 per cent) and for resources nearly homogeneous.
3
Analysis of the Eligible Jobs Maximization Algorithm
The heuristics mentioned in the previous section, like HEFT, focus directly on reducing makespan. On the contrary, PRIO maximises number of eligible jobs in each step, hoping that this strategy results in increasing resource usage and, eventually, in decreasing the overall DAG makespan. PRIO Tool is an implementation of a heuristic algorithm proposed by Malewicz et al. [3]. The heuristic includes the results of research on algorithms, that find optimal IC schedules for some classes of DAGs [3,4,5,6]. PRIO takes advantage of the idea that it is possible to derive an IC-optimal schedule for a complex DAG by decomposing it into simple components, scheduling each component independently, and then combining the resulting schedules [3]. The tool provides a prioritization of DAG jobs, which is typically the first stage of list scheduling algorithms. Mapping jobs to resources is expected to be done according to prioritized list. The target platform of PRIO is the application scheduling in grids in which we have concurrent schedulers. In such environments, if heavily used, resources that become available are allocated immediately to jobs that are eligible. So, from the point of view of application level, schedulers resources are lost if there are no eligible jobs available at the moment when resources appear. The aim of the PRIO is to minimise probability of kind of situation, called a gridlock. The important advantage of PRIO algorithm is that we do not require estimation of time needed to complete either jobs or transfers. The only input data that PRIO requires is a structure of a DAG. Therefore, in environments in which there are no means of obtaining estimation of node and links efficiency (in some cases this efficiency could vary for different jobs and/or change over time) or we do not have knowledge about jobs’ reference execution time, PRIO would be a better choice than methods that are very sensitive to the accuracy of performance data [13]. However, in many cases, taking into consideration the structure of DAGs only, would weaken the results. Especially, it is worth to note that a sequence of jobs obtained by the PRIO algorithm is, in fact, intended for jobs completion, while at job completion time new eligible jobs are triggered. PRIO proposes this order for submission, which means that an assumption is made that all jobs and transfers take the same time or, at least, an order of completion remains the
Evaluation of Eligible Jobs Maximization Algorithm for DAG Scheduling
257
same. It touches the heuristics applicability to heterogeneous environments and limits its usage for DAGs composed with jobs having different execution time. In such case a schedule done according PRIO would suffer from: – the aim of maximising eligible jobs would be missed as a result of changes in jobs’ completion sequence, – gridlock probability increase in case of long jobs are scheduled while short jobs would provide new eligible jobs earlier.
4
Simulation Method and Related Grid Model
While the PRIO algorithm was designed to provide an application level scheduling, in the simulation environment we try to model resources which are available for single user in a computational grid environment organized similarly to WLCG/EGEE grid [15]. In WLCG resources consist of about 250 computation clusters, called ‘sites’, of size between several to thousands CPUs. Jobs at each site are scheduled by a Local Resource Management System (LRMS) according to local policy of supporting set of virtual organizations which are mapped onto queues at LRMS. A resource broker service chooses suitable services according to a job’s owner membership in a virtual organization and job requirements. This decision is done immediately according to a resources list that is ranked typically by the expected waiting time in a local queue. In such a grid architecture, requesting a resource for a job at the moment when it becomes eligible, introduces unacceptable overheads. More reasonable solution is to use lazy scheduling, in which resources are requested in advance according to expected need, but allocation to jobs is done at the moment when resources become available. When no jobs are ready for execution at this time, resources are freed by the application, while usually keeping resources that we do not actively use is against rules. Such conditions creates a room for application scheduling in which the environment is seen as a stream of resources ready for the job allocation and usage. We use a simple, discrete-event simulator, similar to one applied in the first evaluation of PRIO algorithm. Resources are modelled as a probabilistic stream of free workers (CPUs) parameterized by: the average time between resourcesappear events and the average number of workers available in such events. Computational resources are parameterized by heterogeneity level ranged from homogeneous resources to the level in which some workers can be 4 times more efficient than others, which is the value observed in WLCG. While in our evaluation we do not focus on brokering resources to a specific site but on prioritization only, data transfers are not modeled separately. We considered it included in overall cost of executing job and in the heterogeneity of resources. We also assume that resources broker takes resources in random order, regardless of their efficiency.
258
5
T. Szepieniec and M. Bubak
Evaluation Set-Up
The aim of our evaluation was to understand PRIO usability by comparing its results with other solutions for a wide range of cases. Below we describe prioritization methods chosen for the comparison. In each case a motivation for adding this algorithm to our comparison is provided. 1. BTIME – a classic approach to prioritize jobs according to the maximal sum of execution costs on a path from a current job to one of sinks in DAG, based on an estimated computing and communication costs of DAG jobs. A comparison with this approach is the most interesting as this method is commonly used. 2. FIFO – jobs are scheduled in the order in which they are made eligible. When more that one job is eligible at the same time, the order from DAG definition is taken. The reason for adding this algorithm to comparison was twofold: (1) this model represents no prioritization, so we could measure added values of PRIO, and (2) this algorithm was used in previous PRIO evaluations it gave us an opportunity for validating our results. 3. Quasi-optimal – post mortem searching for optimal prioritization based on exact availability of resources. We applied the algorithm that test all possible prioritizations with some optimizations that speed-up the process (e.g. detecting schedules already tested). However, it was still necessary to limit both execution time and a solution buffer size. In cases when the algorithm was not able to complete within defined time, we consider as a result the best of: temporal result of optimal algorithm and other algorithms used in the comparison. Motivation for adding this algorithm to comparison was to estimate a room for further improvement. Random DAGs generation was done using modified DAGGEN tool [16]. We generated DAGs of different characteristics, parametrized by fatness (FAT), density of communication (DENS), regularity (REG), jumps between level (JUMP), difference in cost between jobs and overall size of DAG (CCR). Values of parameters used are shown in upper part of Table 1. We had 3000 different configurations for DAG generation, each was used 10 times in the process of generating DAGs. Table 1. Values for DAG parameters and environment parameters Parameter FAT DENS REG JUMP CCR size of DAGs TBE RS HI
Values 0.05, 0.1, 0.2, 0.3, 0.5 0.05, 0.1, 0.2, 0.3, 0.4 0.01, 0.05, 0.1, 0.2, 0.4 1, 3, 5 0, 1, 2, 3 10, 30 5, 10, 20, 40, 80, 160 1, 3, 5, 7, 9, 11 1, 2, 3, 4
Evaluation of Eligible Jobs Maximization Algorithm for DAG Scheduling
259
Table 2. Ratio of wins for each prioritization algorithm in nontrivial cases
summary of wins individual wins
PRIO 59.0% 3.9%
BTIME 93.7% 34.0%
FIFO 8.6% 0.9%
Table 3. Normalized average makespan and its standard deviation achieved by each method in nontrivial cases average makespan standard deviation
PRIO 1.111 0.164
BTIME 1.097 0.158
FIFO 1.129 0.174
Resources are characterized by average time between resources-appear events (TBE), average resources size (RS) in each event and heterogeneity index (HI) of the system, defined as maximum performance ratio between two machines in a system. A list of parameters used is collected in the lower part of Table 1. There were 90 parameters combination each applied 10 times to every generated DAG. In general, there were 2.7M simulations performed for every prioritization method. It is worth mentioning that in a mapping process we mapped all the eligible jobs in prioritization order, so the overall mapping order could be different from the prioritization list. As we focus on evaluation of prioritization algorithms, the simplest method of resource mapping was used. To make the competition fair, the efficiency of workers (heterogeneity index) mapped onto the same job remains the same for the whole methods. The simulations were performed on Zeus cluster in ACC CYFRONET AGH. Limiting the optimal algorithm execution time to 1 minute, we were able to collect all the results in less that 72 hours using 150 cores of 4-core Intel Xeon 2.33 GHz.
6
Statistical Analysis of Results
From analysis presented, we excluded trivial cases, which we consider schedules that provided the same results for all algorithms. We had two classes of such cases: (1) resources were so limited that the whole process got almost sequential, (2) availability of resources allowed for scheduling all eligible jobs virtually immediately. In our experiment 82% of all cases were classified as trivial. Table 2 presents a summary of the comparison, assuming that we are interested only in the best algorithm for each case. For all cases we chose an algorithm that provides the smallest makespan, which we call a winner. In many cases more than one method provided the best result. In general, the most successful prioritization method was BTIME. PRIO proved to be substantially better that FIFO, but failed to provide result comparable to BTIME.
260
T. Szepieniec and M. Bubak
Table 4. Selected parameter’s value correlation with makespan and range of changes of average makespan results for PRIO and BTIME PRIO BTIME Correlation Max.change Correlation Max.change density 0.2 0.1% 97.6 0.0% allowed jumps between levels 0.99 2.6% 0.99 3.0% allowed jumps between levels 0.99 2.6% 0.99 3.0% performance ratio between jobs -0.95 1.4% -0.85 1.3% heterogeneity of environment -0.95 1.4% -0.85 1.3%
1.22 "FIFO" "PRIO" "BTIME"
1.2 1.18
Average makespan
1.16 1.14 1.12 1.1 1.08 1.06 1.04 1.02 1 0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
Resources usage ratio
Fig. 1. Average normalized makespan in function of resource usage ratio
For better understanding of usability of PRIO algorithm we analysed parameters in class of 3.9% cases in which PRIO provide the best result. The results show the same distribution of parameters, similar to distribution of parameters in whole nontrivial cases set. Therefore, we can assume that success rate of PRIO algorithm, depends on a specific structure of DAG and its relation to the resources stream. The rest of the analysis is based on makespan values normalized according to quasi-optimal method results. Normalization was done to achieve the same impact to presented data from long and short schedules. General summary of average makespan of each algorithm and its standard deviation is presented in Table 3. We see that PRIO is situated between the other two algorithms both in makespan and robustness measured in terms of standard deviation of makespan [17]. Additionally, we could observe that each prioritization method produced average makespan about 10% worse than quasioptimal. Taking into account also standard deviation that exceeds 15%, we have a picture of substantial room for further improvement in this field. We also provided an analysis on how simulation parameters influence the results, but even if there was a strong correlation between parameter values and makespan, overall impact for makespan in range of parameters we evaluated does not exceed 3% (Table 4).
Evaluation of Eligible Jobs Maximization Algorithm for DAG Scheduling
261
Table 5. Average stall ratio
average stall ratio standard deviation
PRIO 0.412 0.263
BTIME 0.408 0.261
FIFO 0.415 0.259
In Figure 1 average makespan in function of average resource usage is presented to check how schedules depend on value that reflects the availability of resources. We shall note, that in every class of each parameter BTIME prioritization outperformed PRIO. Additionally, we can conclude that scheduling of DAGs prioritized by evaluation algorithm depends strongly on specific structure of a DAG and resources availability. Another question that we tried to answer in this evaluation was if the PRIO algorithm was able to provide more eligible jobs. If so, the ratio of ‘stalls’, defined as a number of cases where there were no eligible jobs when a resource event appeared in the system, should be smaller than in the other algorithms. Unfortunately, as we can see in Table 5 stall ratio was higher that in BTIME case. It could be connected with the fact, that bad schedule decisions lead to stalls while waiting for a job that is being executed longer than average. Huge value of standard deviation is caused by various parameters of resource flow.
7
Typical Cases Analysis
In the previous section we presented the results of evaluation that shows that PRIO prioritization provides results statistically worse than simple BTIME prioritization. In this section, we will provide analysis of two cases, taken from the experiment, illustrated in Figure 2, in order to better understand the results. On both figures, the number of jobs remaining both for submission and for completion their execution are shown in function of time. A space between these two lines was filled to better illustrate running jobs. In the left graph, we can observe that, an application scheduled according to PRIO prioritization gained more resources and complete more jobs in first stage (t<150) of computation process. This was enough to start next stage substantially earlier and save about 6% of overall makespan. So, in this case PRIO algorithm provide results according to expectations. In the right graph we can observe that for most of the time both prioritizations were equally efficient. Sequences proposed, although different in second half of makespan, not caused differences in allocating resources that were appearing. What is more, PRIO gained two slots in the last stage, substantially before the other algorithm. Surprisingly, at the end scheduling according to BTIME prioritization wins, while using such priorities it was possible to start the prelast job not waiting for completion of others running. So, in this particular case the strategy, that PRIO is build on, clearly provide bad solutions. Summing up, we should note, that overall results depend on subtle relations between jobs timing and resource appeared-events. Concerning applicability of
262
T. Szepieniec and M. Bubak
Fig. 2. Traces of two different schedules according PRIO and BTIME prioritizations. Graphs present number of jobs remaining for submission (lower line) and for completion (upper line) in function of time. Dashed area between them illustrates jobs in execution. Overlapping dashed area not impose the same set of jobs!
strategy used in PRIO, we could conclude that gaining more resource give good results until it is done in final part of the schedule. In final stage of DAG execution it is important to keep concurrent running, avoiding last jobs to be left at the end alone.
8
Conclusions
In this paper we considerably extended knowledge about heuristics that maximize eligible jobs (the PRIO algorithm) by comparison with other heuristics and analyse the applicability of this algorithm in real usage. The main conclusion is that, the tested implementation of eligible jobs maximization strategy is not mature enough to be used in the described environment. At this moment, simpler mechanisms, like BTIME heuristics, provide better schedules. However, there is a significant room for improvements, where strategy for maximizing eligible jobs could be useful, in improving existing solutions. To achieve this, we will continue our research towards elimination of weakness of the PRIO algorithm identified in this paper. Acknowledgments. We would like to thank Grzegorz Malewicz for providing us the implementation of PRIO tool and a simulator, on which the presented evaluation was obtained. Simulations were processed on Zeus cluster in ACC CYFRONET AGH. This work was partly supported by EU IST CoreGRID Project.
Evaluation of Eligible Jobs Maximization Algorithm for DAG Scheduling
263
References 1. You, S.Y., Kim, H.Y., Hwang, D.H., Kim, S.C.: Task Scheduling Algorithm in GRID Considering Heterogeneous Environment. In: Proc. of the International Conference on Parallel and Distributed Processing Techniques and Applications, PDPTA 2004, Nevada, USA, June 2004, pp. 240–245 (2004) 2. Ma, T., Buyya, R.: Critical-Path and Priority based Algorithms for Scheduling Workflows with Parameter Sweep Tasks on Global Grids. In: Proc. of the 17th International Symposium on Computer Architecture and High Performance Computing, Rio de Janeiro, Brazil (October 2005) 3. Malewicz, G., Rosenberg, A., Yurkewych, M.: Towards a Theory for Scheduling Dags in Internet-Based Computing. IEEE Transactions on Computers 55(6), 757– 768 (2006) 4. Rosenberg, A.L.: On scheduling mesh-structured computations for Internet-based computing. IEEE Trans. Comput. 53, 1176–1186 (2004) 5. Rosenberg, A.L., Yurkewych, M.: Guidelines for scheduling some common computation-dags for Internet-based computing. IEEE Trans. Comput. 54, 428– 438 (2005) 6. Cordasco, G., Malewicz, G., Rosenberg, A.L.: Advances in IC-Scheduling Theory: Scheduling Expansive and Reductive Dags and Scheduling Dags via Duality. IEEE TPDS 18(11) (November 2007) ISSN: 1045-9219 7. Malewicz, G., Foster, I., Rosenberg, A., Wilde, M.: A Tool for Prioritizing DAGMan Jobs and Its Evaluation. In: 15th IEEE International Symposium on High Performance Distributed Computing (HPDC-15), pp. 156–167 (2006) 8. Dong, F., Akl, S.G.: Scheduling Algorithms for Grid Computing: State of the Art and Open Problems. Technical Report of Queen’s University School of Computing, 2006-504 (January 2006) 9. Topcuoglu, H., Hariri, S., Wu, M.Y.: Performance-Effective and Low-Complexity Task Scheduling for Heterogeneous Computing. IEEE Transactions on Parallel and Distributed Systems 13(3), 260–274 (2002) 10. Radulescu, A., van Gemund, A.J.C.: On the Complexity of List Scheduling Algorithms for Distributed Memory Systems. In: Proc. of 13th International Conference on Supercomputing, Portland, Oregon, USA, ovember 1999, pp. 68–75 (1999) 11. Shi, Z., Dongarra, J.J.: Scheduling workflow applications on processors with different capabilities. Future Generation Computer Systems 22(6), 665–675 (2006) 12. Sakellariou, R., Zhao, H.: A Hybrid Heuristic for DAG Scheduling on Heterogeneous Systems. In: Proc. of 18th International Parallel and Distributed Processing Symposium (IPDPS 2004), Santa Fe, New Mexico USA, April 2004, pp. 111–123 (2004) 13. Zhao, H., Sakellariou, R.: An Experimental Investigation into the Rank Function of the Heterogeneous Earliest Finish Time Scheduling Algorithm. In: Kosch, H., B¨ osz¨ orm´enyi, L., Hellwagner, H. (eds.) Euro-Par 2003. LNCS, vol. 2790, pp. 189– 194. Springer, Heidelberg (2003) 14. Dong, F., Akl, S.G.: A Joint Data and Computation Scheduling Algorithm for the Grid. In: Kermarrec, A.-M., Boug´e, L., Priol, T. (eds.) Euro-Par 2007. LNCS, vol. 4641, pp. 587–597. Springer, Heidelberg (2007) 15. Wordwide LHC Grid Computing. Web page: http://lcg.web.cern.ch/LCG/ 16. Daggen – Synthetic DAG Generation. http://www.loria.fr/∼ suter/dags.html 17. Canon., L.C., Jeannot, E.: A Comparison of Robustness Metrics for Scheduling DAGs on Heterogeneous Systems. In: 6th Int. Workshop on Algorithms, Models and Tools – HeteroPar 2007, Austin (2007)
Parallel Path-Relinking Method for the Flow Shop Scheduling Problem Wojciech Bo˙zejko1 and Mieczyslaw Wodecki2 1
Wroclaw University of Technology Institute of Computer Engineering, Control and Robotics Janiszewskiego 11-17, 50-372 Wroclaw, Poland [email protected] 2 University of Wroclaw Institute of Computer Science Joliot-Curie 15, 50-383 Wroclaw, Poland [email protected]
Abstract. The matter of using scheduling algorithms in parallel computing environments is discussed in the paper. A parallel path-relinking approach based on scatter search metaheuristics is proposed for the flow shop problem with Cmax and Csum criteria. Obtained results are very promising: the superlinear speedup is observed for some versions of the parallel algorithm.
1
Introduction
The main issue discussed here is the problem of using scheduling algorithms in parallel environments, such as multiprocessor systems, cluster or local network. On the one hand, sequential character of the scheduling algorithms’ computation process is the obstacle in projecting enough effective parallel algorithms. On the other hand, parallel computations offer essential advantages of solving difficult problems of combinatorial optimization. We take into consideration the permutation flow shop scheduling problem, as well as the classic NP-hard problem of the combinatorial optimization which can be described as follows. A number of jobs are to be processed on a number of machines. Each job must go through all the machines in exactly the same order and the job order must be the same on each machine – machines are ordered as a linear chain. Each machine can process at most one job at any point of time and each job may be processed on at most one machine at any time. The objective is to find a schedule that minimizes the sum of job’s completion times (F ||Csum problem) or maximal job completion time (F ||Cmax problem). Garey, Johnson & Seti [4] show that F ||Cmax is strongly NP-hard for more than 2 machines. The branch and bound algorithm was propsed by Grabowski [5]. Its performance is not entirely satisfactory however, as they experience difficulty in solving instances with 20 jobs and 5 machines. Thus, there exist two, not conflicted mutually, approaches which allow one to solve large-size instances M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 264–273, 2008. c Springer-Verlag Berlin Heidelberg 2008
Parallel Path-Relinking Method for the Flow Shop Scheduling Problem
265
in the acceptable time: (1) approximate methods (mainly metaheuristics), (2) parallel methods. In the matter of parallel metaheuristics, dedicated mainly for homogeneous multiprocessors systems (such as mainframe computers and specialized clusters) a parallel variant of the scatter search method, one of most promising currently methods of combinatorial optimization, has been projected and researched experimentally in the application of flow shop scheduling problems with Cmax and Csum criteria. In some cases the effect of superlinear speedup has been observed. Although algorithms have not been executed with a huge number of iterations, a new best solution has been obtained for the Csum flow shop problem for benchmark instances of Taillard [13]. This work is the continuation of author’s research on constructing efficient parallel algorithms to solve hard combinatorial problems ([2,3,15]). Further, we present a parallel algorithm based on scatter search method which not only speeds up the computations but also improves the quality of the results.
2
The Problems
The flow shop problem with makespan criterion. We consider, as the test case, the well-known in the scheduling theory, strongly NP-hard problem called the permutation flow-shop problem with the makespan criterion and denoted by F ||Cmax . Skipping consciously the long list of papers dealing with this subject we only refer the reader to recent reviews and best up-to-now algorithms [8,6,9]. The problem has been introduced as follows. There is n jobs from a set J = {1, 2, . . . , n} to be processed in a production system having m machines, indexed by 1, 2, . . . , m, organized in the line (sequential structure) – ordered as a linear chain. Single job reflects one final product (or sub-product) manufacturing. Each job is performed in m subsequent stages, in common way for all tasks. Stage i is performed by machine i, i = 1, . . . , m. Every job j ∈ J is split into sequence of m operations O1j , O2j , . . . , Omj performed on machines in turn. Operation Oij reflects processing of job j on machine i with processing time pij > 0. Once started job cannot be interrupted. Each machine can execute at most one job at a time, each job can be processed on at most one machine at a time. The sequence of loading jobs into system is represented by a permutation π = (π(1), . . . , π(n)) on the set J. The optimization problem is to find the optimal sequence π ∗ so that Cmax (π ∗ ) = min Cmax (π). π∈Π
(1)
where Cmax (π) is the makespan for permutation π and Π is the set of all permutations. Denoting by Cij the completion time of job j on machine i we have Cmax (π) = Cm,π(n) . Values Cij can be found by using recursive formula Ciπ(j) = max{Ci−1,π(j) , Ci,π(j−1) } + piπ(j) , i = 1, 2, . . . , m, j = 1, . . . , n, (2) with initial conditions Ciπ(0) = 0, i = 1, 2, . . . , m, C0π(j) = 0, j = 1, 2, . . . , n.
266
W. Bo˙zejko and M. Wodecki
Notice that the problem of transforming sequential algorithm for scheduling problems into parallel one is nontrivial, because of strongly sequential character of computations carried out by (2) and by other known scheduling algorithms. The flow shop problem with Csum criterion. The objective is to find a schedule that minimizes the sum of job’s completion times. The problem is indicated by F ||Csum . For the sake of special properties (blocks of critical path, [6]) the problem in question is regarded as an easier one than a problem with objective Csum . Unfortunately, there are not any similar properties, which can speedup computations, for the F ||Csum flow shop problem. There are plenty of good heuristic algorithms for solving F ||Cmax flow shop problem, with the objective of minimizing maximal job’s completion times. Constructive algorithms (LIT and SPD from [14]) have low efficiency and can only be applied to a limited range. Smutnicki [12] provides wort-case analysis of known approximate algorithms. Bozejko and Wodecki [3] proposed parallel genetic algorithm, Reeves and Yamada [11] – a hybrid algorithm consisting of elements of tabu search, simulated annealing and path relinking methods. The results of this algorithm, applied to Taillard benchmark tests [13] are the best known ones in the literature nowadays. The flow shop problem with the sum of job’s completion time criterion can be formulate applying notations from the previous paragraph. We wish to find a permutation π ∗ ∈ Π that: Csum (π ∗ ) = min Csum (π), where Csum (π) = π∈Π
n
Cmπ(j) ,
j=1
where Ciπ(j) is the time required to complete job j on the machine i in the processing order given by the permutation π.
3
Multi-thread Search: Scatter Search Method
The main idea of the scatter search method is presented in [7]. The algorithm is based on the idea of evaluation of the so-called starting solutions set. In the classic version a linear combination of the starting solution is used to construct a new solution. In case of a permutational representation of the solution using linear combination of permutations gives us an object which is not a permutation. Therefore, in this paper a path relinking procedure is used to construct a path from one solution of the starting set to another solution from this set. The best element of such a path is chosen as a candidate to add to the starting solution set. 3.1
Path Relinking
The base of the path relinking procedure, which connects two solutions π1 , π2 ∈ Π, is a multi-step crossover fusion (MSXF) described by Reeves and Yamada [11]. Its idea is based on a stochastic local search, starting from π1 solution,
Parallel Path-Relinking Method for the Flow Shop Scheduling Problem
267
to find a new good solution where the other solution π1 is used as a reference point. The neighborhood N (π) of the permutation (individual) π is defined as a set of new permutations that can be achieved from π by exactly one adjacent pairwise exchange operator which exchanges the positions of two adjacent jobs of a problem’s solution connected with permutation π. The distance measure d(π,σ) is defined as a number of adjacent pairwise exchanges needed to transform permutation π into permutation σ. Such a measure is known as Kendall’s τ measure. Algorithm 2. Path-relinking procedure Let π 1 , π 2 be reference solutions. Set x = q = π1 ; repeat For each member yi ∈ N (π), calculate d(yi , π 2 ); Sort yi ∈ N (π) in ascending order of d(yi , π 2 ); repeat Select yi from N (π) with a probability inversely proportional to the index i; Calculate Csum (yi ); Accept yi with probability 1 if Csum (yi ) ≤ Csum (x), and with probability PT (yi ) = exp((Csum (x) − Csum (yi )) / T ) otherwise (T is temperature); Change the index of yi from i to n and the indices of yk , k = i+1,...,n from k to k−1; until yi is accepted; x ← yi ; if Csum (x) < Csum (q) then q ← x; until some termination condition is satisfied ; return q { q is the best solutions lying on the path from π1 to π2 } The condition of termination consisted in exceeding 100 iterations by the path relinking procedure. 3.2
Parallel Scatter Search Algorithm
The parallel algorithm was projected to execute on tow machines: – the cluster of 152 dual-core Intel Xeon 2.4 GHz processors connected by Gigabit Ethernet with 3Com SuperStack 3870 swiches (for the F ||Csum problem), – Silicon Graphics SGI Altix 3700 Bx2 with 128 Intel Itanium2 1.5 GHz processors and cache-coherent Non-Uniform Memory Access (cc-NUMA), craylinks NUMAflex4 in fat tree topology with the bandwidth 4.3 Gbps (for the F ||Cmax problem), installed in the Wroclaw Center of Networking and Supercomputing. Both supercomputers have got a distributed memory, where each processor has its local cache memory (in the same node) which is accessible in a very short time (comparing to the time of access to the memory in other node). Taking into consideration this type of architecture we choose a client-server model for the scatter
268
W. Bo˙zejko and M. Wodecki
Fig. 1. Executing concurrent path-relinking procedures in the set S
search algorithm proposed here, where calculations of path-relinking procedures are executed by processors on local data and communication takes place rarely to create a common set of new starting solutions. The process of communication and evaluation of the starting solutions set S is controlled by processor number 0. We call this model global. For comparison a model without communication was also implemented in which an independent scatter search threads are executed in parallel. The result of such an algorithm is the best solution from solutions generated by all the searching threads. We call this model independent. Algorithms were implemented in C++ language using MPI (mpich 1.2.7) library and executed under the OpenPBS batching system which measures times of processor’s usage. Algorithm 3. Parallel scatter search algorithm for the SIMD model without shared memory parfor p := 1 to number of processors do for i := 1 to iter do Step 1. if (p = 0) then {only procesor number 0} Generate a set of unrepeated starting solutions S, |S| = n. Broadcast a set S among all the processors. else {other processors} Receive from the procesor 0 a set of starting solutions S. end if;
Parallel Path-Relinking Method for the Flow Shop Scheduling Problem
269
Step 2. For randomly chosen n/2 pair from the S apply path relinking procedure to generate a set S - of n/2 solutions which lies on paths. Step 3. Apply local search procedure to improve value of the cost function of solutions from the set S . Step 4. if (p = 0) then Send solutions from the set S to procesor 0 else {only processor number 0} Receive sets S from other processors and add its elements to the set S Step 5. Leave in the set S at most n solutions by deleting the worst and repeated solutions. if |S| < n then Add a new random solutions to the set S such, that elements in the set S does not duplicate and |S| = n. end if; end if; end for; end parfor. 3.3
Computer Simulations
Tests were based on 50 instances with 100,. . . ,500 operations (n × m=20×5, 20×10, 20×20, 50×5, 50×10) due to Taillard [13], taken from the OR-Library [10]. The results were compared to the best known, taken from [10] for the F ||Cmax and from [11] for the F ||Csum . For each version of the scatter search algorithm (global or independent) following metrics were calculated: – ARPD - Average Percentage Relative Deviation to the benchmark’s cost function value where P RD =
Fref − Falg · 100%, Fref
where Fref is reference criterion function value from [10] for the F ||Cmax and from [11] for the F ||Csum , and Falg is the result obtained by parallel scatter search algorithm. There were no situations where Fref = 0 for the benchmark tests. – ttotal (in seconds) – real time of executing the algorithm for 50 benchmark instances from [13], – tcpu (in seconds) – the sum of time’s consuming on all processors for 50 benchmark instances from [13].
270
W. Bo˙zejko and M. Wodecki
Table 1. Values of APRD for parallel scatter search algorithm for the F ||Cmax problem (global model). The sum of iterations’s number for all processors is 9600. n×m
Processors 1 2 4 8 16 iter =9600 2 iter = 4800 iter = 2400 8 iter = 1200 iter = 600
20 × 5 20 × 10 20 × 20 50 × 5 50 × 10
0.000% 0.097% 0.039% 0.007% 0.345%
0.000% 0.060% 0.035% -0.001% 0.104%
0.000% 0.072% 0.061% -0.015% 0.113%
0.000% 0.131% 0.062% -0.001% 0.123%
0.096% 0.196% 0.136% 0.007% 0.272%
average
0.098%
0.029%
0.046%
0.063%
0.142%
ttotal (h:min:sec) tcpu (h:min:sec)
30:04:40 30:05:02
15:52:13 31:44:21
7:40:51 30:41:54
3:35:47 28:45:30
1:42:50 27:24:58
Table 2. Values of APRD for parallel scatter search algorithm for the F ||Cmax problem (independent model). The sum of iterations’s number for all processors is 9600. n×m
Processors 1 2 4 8 16 iter =9600 2 iter = 4800 iter = 2400 8 iter = 1200 iter = 600
20 × 5 20 × 10 20 × 20 50 × 5 50 × 10
0.000% 0.097% 0.039% 0.007% 0.345%
0.000% 0.080% 0.062% 0.000% 0.278%
0.000% 0.066% 0.048% 0.007% 0.148%
0.000% 0.039% 0.031% 0.007% 0.238%
0.096% 0.109% 0.031% 0.000% 0.344%
average
0.098%
0.084%
0.054%
0.063%
0.097%
ttotal (h:min:sec) tcpu (h:min:sec)
30:04:40 30:05:02
14:38:29 29:16:14
6:58:59 27:54:19
3:15:34 26:03:33
1:32:46 24:41:24
Table 3. Values of APRD for parallel scatter search algorithm for the F ||Csum problem (independent model). The sum of iterations’s number for all processors is 16000. n×m
Processors 1 2 4 8 16 iter =16000 2 iter = 8000 iter = 4000 8 iter = 2000 iter = 1000
20x5 20x10 20x20 50x5 50x10 average
0.000 0.000 0.000 0.904 0.913 0.363
0.007 0.000 0.000 1.037 0.986 0.406
0.000 0.000 0.000 0.906 1.033 0.388
0.006 0.000 0.000 0.903 0.989 0.380
0.016 0.000 0.000 0.933 1.110 0.412
ttotal (h:min:sec) tcpu (h:min:sec)
75:27:40 75:25:48
37:40:08 75:02:51
18:38:23 74:10:18
9:06:24 72:19:26
4:28:57 70:57:24
Parallel Path-Relinking Method for the Flow Shop Scheduling Problem
271
Table 4. Values of APRD for parallel scatter search algorithm for the F ||Csum problem (global model). The sum of iterations’s number for all processors is 16000. n×m
Processors 1 2 4 8 16 iter =16000 2 iter = 8000 iter = 4000 8 iter = 2000 iter = 1000
20x5 20x10 20x20 50x5 50x10
0.000 0.000 0.000 0.993 1.103
0.000 0.000 0.000 0.677 0.648
0.000 0.000 0.000 0.537 0.474
0.008 0.004 0.000 0.449 0.404
0.007 0.000 0.000 0.764 0.734
average
0.419
0.265
0.202
0.173
0.301
ttotal (h:min:sec) tcpu (h:min:sec)
75:23:44 75:20:42
41:19:51 77:57:57
23:28:19 75:46:07
14:30:03 74:38:51
7:23:50 73:13:35
Flow shop problem with makespan Cmax criterion. Tables 1 and 2 presents results of computations of the parallel scatter search method for the number of iterations (as a sum of iterations on all the processors) equals to 9600. The cost of computations, understanding as a sum of time-consuming an all the processors, is about 7 hours for the all 50 benchmark instances of the flow shop problem. The best results (average percentage deviations to the best known solutions) have the 2-processors version of the global model of the scatter search algorithm (with communication) which are 70.4% better comparing to average 1-processor implementation(0.029% vs 0.098%). Because the time-consuming on all the processors is a little bit longer than the time of the sequential version we can say that the speedup of this version of the algorithm if almost-linear. For the 4 and 8-processors implementation of the global model and for 2,4 and 8-processors implementations of the independent model the average results of ARPD are better than ARPD of the 1-processors versions whereas the timesconsuming on all the processors (tcpu ) are shorter. So these algorithm obtain better results with a smaller cost of computations - the speedup is superlinear. This anomaly can be understood as the situation where the sequential algorithm executes its search threads such that there is a possibility to choose a better path of the solutions space trespass, which the parallel algorithm do. More about superlinear speedup can be found in the book of Alba [1]. Flow shop problem with Csum criterion. Similar situation takes place for the tests of the parallel scatter search algorithm for the F ||Csum problem. Tables 3 and 4 present results of computations for the global and independent model, for the number of iterations (as a sum of iterations on all the processors) equals to 16000. The best results are achieved for the 8-processors version of the global model version of scatter search and they are 52.3% better than the results of sequential scatter search algorithm (0.173% vs 0.363%). Also here a superlinear speedup effect has been observed for the 8 and 16-processors implementations of the global model of parallel scatter search. The time consuming of
272
W. Bo˙zejko and M. Wodecki
this implementations (74:38:51 and 73:13:35, hours:minutes:seconds) was smaller than the total time of sequential algorithm execution (75:20:42). Such a situations takes place only for the global model of the scatter search algorithms – independent searches are not so effective, both in results (ARPD) and speedup. The new best solution foud so far has been discovered for the flow shop problem with Csum criterion during computational experiments. The new upper bound for the tai50 instance is 88106 (previous one was 88215, from [11]). Though there was not purpose of this research, results obtained by the proposed algorithm are only 0.05% worse (4 processors, independent model) in average from the best results for the Cmax problem, obtained by Nowicki and Smutnicki [9]. For the Csum problem the results are 0.17% worse (8 processors, also independent model) from the best known obtained by the algorithm of Reeves and Yamada [11].
4
Conclusions
An approach to parallelization of the scheduling algorithms for the flow shop problem has been described here. In multiple-thread search, represented by a parallel scatter search here, parallelization increases the quality of obtained solutions keeping comparable costs of computations. Superlinear speedup is observed in cooperative (global) model of parallelism. The parallel scatter search skeleton can be easily adopted to solve other NP-hard problems with permutational solution representation, such as traveling salesman problem (TSP), quadratic assignment problem (QAP) or single machine scheduling problems.
References 1. Alba, E.: Parallel Metaheuristics. Wiley & Sons Inc., Chichester (2005) 2. Bo˙zejko, W., Wodecki, M.: Solving the flow shop problem by parallel tabu search. In: Proceedings of PARELEC 2004, pp. 189–194. IEEE Computer Society, Los Alamitos (2004) 3. Bo˙zejko, W., Wodecki, M.: Parallel genetic algorithm for the flow shop scheduling problem. In: Wyrzykowski, R., Dongarra, J., Paprzycki, M., Wa´sniewski, J. (eds.) PPAM 2004. LNCS, vol. 3019, pp. 566–571. Springer, Heidelberg (2004) 4. Garey, M.R., Johnson, D.S., Seti, R.: The complexity of flowshop and jobshop scheduling. Mathematics of Operations Research 1, 117–129 (1976) 5. Grabowski, J.: A new algorithm of solving the flow-shop problem, Operations Research in Progress, pp. 57–75. D. Reidel Publishing Company (1982) 6. Grabowski, J., Pempera, J.: New block properties for the permutation flow shop problem with application in tabu search. Journal of Operational Research Society 52, 210–220 (2000) 7. James, T., Rego, C., Glover, F.: Sequential and Parallel Path-Relinking Algorithms for the Quadratic Assignment Problem. IEEE Intelligent Systems 20(4), 58–65 (2005) 8. Nowicki, E., Smutnicki, C.: A fast tabu search algorithm for the permutation flow shop problem. European Journal of Operational Research 91, 160–175 (1996)
Parallel Path-Relinking Method for the Flow Shop Scheduling Problem
273
9. Nowicki, E., Smutnicki, C.: Some aspects of scatter search in the flow-shop problem. European Journal of Operational Research 169, 654–666 (2006) 10. OR-Library: http://people.brunel.ac.uk/∼ mastjjb/jeb/info.html 11. Reeves, C.R., Yamada, T.: Genetic algorithms, path relinking and the flowshop sequencing problem. Evolutionary Computation 6, 45–60 (1998) 12. Smutnicki, C.: Some results of the worst-case analysis for flow shop scheduling. European Journal of Operational Research 109(1), 66–87 (1998) 13. Taillard, E.: Benchmarks for basic scheduling problems. European Journal of Operational Research 64, 278–285 (1993) 14. Wang, C., Chu, C., Proth, J.: Heuristic approaches for n/m/F/ΣCi scheduling problems. European Journal of Operational Research, 636–644 (1997) 15. Wodecki, M., Bo˙zejko, W.: Solving the flow shop problem by parallel simulated annealing. In: Wyrzykowski, R., Dongarra, J., Paprzycki, M., Wa´sniewski, J. (eds.) PPAM 2001. LNCS, vol. 2328, pp. 236–247. Springer, Heidelberg (2002)
A Fast and Efficient Algorithm for Topology-Aware Coallocation Valentin Kravtsov1, Martin Swain2 , Uri Dubin1 , Werner Dubitzky2 , and Assaf Schuster1 1
2
Technion-Israel Institute of Technology, Technion City, 32000, Haifa, Israel svali [email protected] University of Ulster, Cromore Road, Coleraine, BT52 1SA, Northern Ireland, UK
Abstract. Modern distributed applications require coallocation of massive amounts of resources. Grid level allocation systems must efficiently decide where these applications can be executed. To this end, the resource requests are described as labeled graphs, which must be matched with equivalent labeled graphs of available resources. The coallocation problem described in the paper has real-world requirements and inputs that differ from those of a classical graph matching problem. We propose a new algorithm to solve the coallocation problem. The algorithm is especially tailored for medium to large grid systems, and is currently being integrated into the QosCosGrid system’s allocation module.
1
Introduction
The problem we are tackling is a maximal allocation of a labeled requests graph to a labeled offers 1 graph. This problem is also referred as graph matching. The allocation must satisfy both the constraints of the nodes (machines) and the constraints of the links (network). In our setup, the allocation can be nonoptimal in terms of the allocation size; however, all allocations must obey all computing and network constraints. The motivation for our work comes from real-world scientific applications, including complex systems simulations. Complex systems simulations include highly parallel applications such as large cellular automata; molecular dynamics simulations; combinations of coarse and fine-grained parallel applications, such as distributed evolutionary algorithms for optimizing parameters; techniques such as parallel tempering, where molecular dynamics simulations are combined with Monte Carlo algorithms; and agent-based models where both the frequency of communication between agents and the number of agents is highly variable and may change with time [1]. Such applications rely on the coallocation of large numbers of reliable resources. This requirement has traditionally been met by supercomputing facilities, but some applications researchers are now looking to computational grids as a more economic computing resource. 1
In this paper we use the terms “offers” and “available machines” interchangeably, assuming that only available machines are offered by resource providers.
M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 274–283, 2008. c Springer-Verlag Berlin Heidelberg 2008
A Fast and Efficient Algorithm for Topology-Aware Coallocation
275
Fig. 1. A parallelized agent-based model (left) with the agent interactions represented as a graph (right)
Quasi-opportunistic supercomputing is a new approach, designed to enable the execution of demanding parallel applications on massive nondedicated resources in grid environments [2]. Fig. 1 shows an approach for parallelizing an agent-based model in which each agent interacts with others within a certain distance and must be aware of those agents that are within that distance. In Fig. 1, on the left, each black dot represents an agent, the light gray circle represents the distance for definite interactions, and the outer circle indicates possible future interactions. These interactions can be represented by a graph, as shown on the right, and it is this graph which depicts the properties and the topology of the required resources. The matching methods in the literature can be divided into two broad categories: the first contains exact matching methods that require a strict correspondence among the two objects being matched or at least among their subparts. Algorithms that solve these problems for the general graphs are exponential in the worst case. The second category defines inexact matching methods, where a matching can occur even if the two graphs being compared are structurally different, relaxing to some extent the given constraints [3]. Our case can be seen as a mixture of both categories: as in exact matching, we must not violate any of the constraints, but as in inexact matching, nonoptimal allocation sizes are permissible. Forgoing this optimality requirement allows us to provide an efficient algorithm for resource coallocation, which in practice delivers results that are reasonably close to the optimum. The problem described above can be very hard to solve even with heuristic algorithms in real grids. As real-world grids may consist of thousands of machines, and we are planning to simultaneously allocate hundreds to thousands of jobs, the number of edges in the offers and requests graphs might be of an order of 106 . Thus, even light heuristic algorithms which are linear in the product of the number of edges are almost useless when dealing with the computation time of O(1012 ). To reduce the problem complexity, we propose a simplified version, which we call the clustered topology-aware coallocation problem (CTAAP). In this problem, the offered machines are aggregated into a relatively small number of
276
V. Kravtsov et al.
homogeneous clusters. Each cluster contains identical machines, interconnected by identical links. This formulation does not account for the differences between the machines in the clusters, but significantly reduces the problem size. Unfortunately, even the reduced problem is still NP-complete with no approximation available. In this paper we propose a new heuristic algorithm, the CTAAP-Solver, which solves the CTAAP problem. In our solution, we execute graduated assignment graph matching [4] once, and use its output as a starting point for the greedy search procedure. During this greedy search, we repeatedly execute an algorithm for weighted bipartite graph matching, steering it towards a feasible solution that does not violate any constraint. This paper is organized as follows. Related work is summarized in section 2. In section 3 we discuss the problem definition and its intractability. In this section we also formalize the problem and give the details of our CTAAP-Solver algorithm and its complexity. Experimental results are given in section 4.
2
Related Work
Exact matching. Most of the algorithms for exact graph matching are based on some form of tree search with backtracking. The first important algorithm of this family was by Ullmann [5] in 1976. Ullmann’s algorithm is widely known and, despite its age, is still widely used and is probably the most popular graph matching algorithm. A more recent algorithm for both isomorphism and subgraph isomorphism is the VF algorithm, by Cordella et al. [6]. The authors define a heuristic that is based on the analysis of the sets of nodes adjacent to the ones already considered in the partial mapping. This heuristic computes quickly, leading to a significant improvement over Ullmann’s and other algorithms in many cases. However, the worst case run time of Ullmann’s algorithm is Θ(N !N 3 ), and Θ(N !N ) for the VF algorithm. Inexact matching. Tree search with backtracking can also be used for inexact matching. In [7], the A∗ algorithm is used with a fast and simple heuristic that takes into account only the future cost of unmatched nodes. A radically different approach is to cast graph matching, which is inherently a discrete optimization problem, into a continuous, nonlinear optimization problem. One of the pioneering works for this approach is that of Fischler and Elschlager [8]. In [9], a new matching algorithm based on a probabilistic relaxation framework is proposed, which introduces the definition of a Bayesian graph edit distance. Gold and Rangarajan [4] presented the graduated assignment graph matching (GAGM) algorithm. In this algorithm a technique known as graduated nonconvexity is employed to avoid poor local optima [3]. However, the inexact matching algorithms that we are aware of do not guarantee that constraints will not be violated.
A Fast and Efficient Algorithm for Topology-Aware Coallocation
3 3.1
277
The Topology-Aware Coallocation Algorithm Topology-Aware Coallocation: Definition and Analysis
Our coallocation model assumes an “` a la operating systems” scheduling system, meaning that the time axis is divided into discrete (potentially inconstant and long) time slots, and the decision about which processes to execute in a certain time slot is made repeatedly by an allocation management system. In this paper, we consider only a certain time slot where the quantitative values of computing resources are assumed to be constant. Thus we ignore the time index in the following discussion. The mathematical model will be defined in the next subsection, while here we will discuss the intractability of the presented problem. Matching that does not account for link constraints – known also as bipartite graph matching – is a well-studied problem [10], with a variety of efficient (polynomial) solving algorithms [11], [12]. However, matching that takes into account the links between the nodes, which is a general graph-matching problem, becomes NP-hard. Even the simplified form of the problem defined above – the CTAAP, where offered computing machines are grouped into homogeneous clusters, is still NP-complete with no approximation even for a constant number of clusters. This can be shown by reducing the independent set (IS) problem to the CTAAP. IS is defined as follows: – Input: Graph G = (V, E) and a positive integer k. – Question: Is there a subset V ⊆ V of size k such that no two vertexes in V are joined by an edge in E? The reduction is as follows: given a graph G of n vertexes, we will treat it as a requests graph. We will create an offers graph with two clusters, one of size k with no links between the nodes, and another cluster with n − k nodes, all of which are interconnected and also connected to all the nodes of the first cluster. There is a solution of CTAAP that can allocate all the requests to offers iff there is a solution to the IS problem. Not only is the IS NP-complete [13] but, as was shown by [14], no polynomial time algorithm can approximate it within a factor 1− of n/2(log n) for any > 0, unless N P = ZP P . Given that fact, it is clear that CTAAP cannot be solved even approximately by polynomial time algorithms. 3.2
Clustered Topology-Aware Coallocation: Model Formalization
Specifying the topology request. The request for n tasks is presented as a graph GR = (V, E), where |V | = n and vi denotes a request for a single resource (machine). A positive vector C = [c1 , c2 , . . . , cn ] represents request properties, where ci denotes the minimal quantitative properties for the required computational resource vi (e.g. FLOPS). Different quantitative properties might be described by multiple property vectors. For example, if vector C represents the minimal
278
V. Kravtsov et al.
CPU requirements, then the minimal memory requirements of the n tasks are represented by the positive vector M = [m1 , m2 , . . . , mn ]. The properties of edges e ∈ E are represented by an n-by-n adjacency matrix B, where bij refers to the connectivity level between a user’s tasks vi and vj . Usually, this matrix is symmetric, and ∀i : 1 ≤ i ≤ n, bii = 0. Matrix B represents the communication bandwidth between tasks as estimated by the user. Specifying the resource offer. Analogously, an offer of m clusters of identical ˆ machines is denoted as a graph GˆO = (Vˆ , E),where |Vˆ | = m. The individual properties of the identical machines in the clusters are denoted ˆ = [mˆ1 , mˆ2 , . . . , mˆm ] (memory). as Cˆ = [cˆ1 , cˆ2 , . . . , cˆm ] (CPU) and M ˆ A capacity vector CAP = [cap ˆ 1 , cap ˆ 2 , . . . , cap ˆ m ] denotes the number of availˆ represents the able machines in each cluster; an m-by-m adjacency matrix B edges’ properties (e.g., currently available communication bandwidth within and between the m clusters in the grid), assuming identical connectivity properties between all the machines in each cluster. In the offers graph there are usually self-loops. The self-loop of the node vˆi denotes the connectivity level between the machines in cluster i: bˆii = 0. The bandwidth could be estimated between two adjacent (physically connected) clusters or between two distant but connected clusters using maximum flow techniques. The allocation matrix. We are interested in finding the n-by-m allocation matrix X, in which the term xij = 1 represents an allocation of a requested task vi to an offered resource vˆj . Several constraints must hold for a correct coallocation: m xij ≤ 1, (1) ∀i : 1 ≤ i ≤ n, j=1
denoting that one requested task can be mapped to at most one offered resource; ∀j : 1 ≤ j ≤ m,
n
xij ≤ cap ˆ j,
(2)
i=1
denoting that an offered cluster j can serve at most cap ˆ j tasks; ∀i, j : 1 ≤ i ≤ n, 1 ≤ j ≤ m, xij ci ≤ cˆj ∧ xij mi ≤ mˆj
(3)
denoting that the individual (computation/memory) properties of a request must fit the properties of the matched offer; ∀i, j, k, l : 1 ≤ j, l ≤ m, 1 ≤ i, k ≤ n, xij bik xkl ≤ bˆjl ,
(4)
denoting that the pairwise (connectivity) properties of any pair of requests must fit the properties of the matched offers pair; and ∀i, j : 1 ≤ i ≤ n, 1 ≤ j ≤ m, xij ∈ {0, 1},
(5)
A Fast and Efficient Algorithm for Topology-Aware Coallocation
279
denoting that the decision is binary, where 1 indicates an allocation of a requested task vi to an offered resource vˆj . Different objective functions will express different “global welfare” schemas. m we are interested in maximizing the system utilization: n Here max( i=1 j=1 xij ). 3.3
An Algorithm for Clustered Topology-Aware Coallocation
Our algorithm consists of three procedures. The first one finds a weights matrix X by executing a modified version of the graduated assignment graph matching [4] algorithm. Matrix X contains values between 0 and 1, which denote the “profitability” of each allocation Xij . An extra row and column are added By to hold the slack variables (this augmented matrix is denoted by X). incorporating slack variables, the graph matching algorithm can handle outliers (spurious or missing nodes or links) in a statistically robust manner. As β is constantly increased, only one number in each row and up to cap ˆ j numbers in each column approach 1, while all the others approach 0. ˆ M ˆ , CAP ˆ , matrixes B, B, ˆ edge compatibility Input: vectors C, M, C, function F (er , eo ) → R | er ∈ GR , eo ∈ GO Output: weights matrix X ←1+ ; β ← β0 , X while β ≤ βf do while X does not converge AND #iterations ≤ I0 do N M Qij ← (ci > cˆj ∨ mi > mˆj ) ? 0 : k=1 l=1 Xkl F (eik , ejl ) ; Xij ← exp(βQij ) ; does not converge AND #iterations ≤ I1 do while X 1 ← MX+1ij by normalizing across rows X ; // update X ij X j=1
ij
1
Xij ij ← min(1, N +1−(cap X ˆ j −1) i=1
smallesti
);
// normalizing across
columns, smallesti stands for ith smallest element in column j end end β ← β · βr ; end return X ; Algorithm 1. Step 1 – inexact graph matching In the second procedure we address equations 1-3 and 5 only (i.e., computational and capacity constraints). Discarding equation 4, we have an instance of a weighted bipartite matching problem, modeled as follows: G = (V , E ), where V = V ∪ Vˆ , and E = (vi , vˆj )|vi ∈ V ∧ vˆj ∈ Vˆ ∧ ci ≤ cˆj ∧ mi ≤ mˆj , where we use weights computed by procedure 1: w((vi , vˆj )) = Xij . To solve the
280
V. Kravtsov et al.
maximum weighted bipartite problem, we use a slightly modified version of the LEDA implementation [15]. The resulting allocation suits constraints 1-3 and 5 but might violate constraint 4. ˆ M ˆ , CAP ˆ , weights matrix X Input: vectors C, M, C, Output: allocation matrix A ˆ M ˆ , CAP ˆ ); A ← solve maximum weighted bipartite matching(C, M, X, C, return A; Algorithm 2. Step 2 – weighted bipartite graph matching In the last procedure, we have to make sure that no connectivity constraints were violated by procedure 2. To do so, we analyze all the allocation pairs (Xij and Xkl ), counting how many connectivity violating allocation pairs each allocation Xij appears in. If no connectivity violations were detected, the algorithm terminates. Otherwise, the “worst” allocation Xij (the one that appears in the most violating pairs) is removed from the allocation matrix, from then on forcing Xij = 0, and step 2 is repeated. ˆ M ˆ , CAP ˆ and matrixes B, B ˆ Input: allocation matrix A, vectors C, M, C, Output: final allocation matrix problem costij ← ZERO M AT RIX ; curr alloc ← {(i, j)|i ∈ {1..n} ∧ j ∈ {1..m} ∧ Aij = 1} ; forall (i, j)|(i, j) ∈ curr alloc do forall (k, l)|(k, l) ∈ curr alloc ∧ (k, l) = (i, j) do if Bik > Bˆjl then // Allocation violates edge constraints problem costij ++ ; problem costkl ++ ; end end end if (problem cost is ZERO M AT RIX) then return A ; else (i, j) ← index of the biggest number in problem cost ; Aij ← 0 ; Go To Procedure 2 ; end Algorithm 3. Step 3 – cleanup 3.4
Algorithm Complexity
In order to analyze the complexity of the entire algorithm, we will analyze each one of its three steps. Here we will assume that the number of clusters in the grid is M and the number of jobs is N .
A Fast and Efficient Algorithm for Topology-Aware Coallocation
281
Step 1: the normalization across rows takes O(N 2 M 2 ), while the normalization across columns takes O(N 2 M 2 log(N )). The overall complexity of step 1 is O(N 2 M 2 log(N )). Step 2: the constructed graph G has O(N + M ) vertexes and O(N M ) edges. Using the algorithm proposed in [15], the overall complexity of this step is O((N + M )2 log(N + M ) + N M (N + M )) = O(N 2 log(N + M ) + N 2 M ), assuming that N M. Step 3: as there is a maximal number of N allocations, the analysis of the correctness of all given allocation pairs takes O(N 2 ) time. Overall: In the worst case, steps 2 and 3 are repeated N M times; thus the overall time complexity in the worst case is O(N 3 M log(N + M ) + N 3 M 2 ). However, we expect the average performance to be much better. It is also important to note that the algorithm is polynomial in the number of clusters only, regardless of the number of actual machines in those clusters.
4
Experimental Results
In order to evaluate the performance of the CTAAP-Solver algorithm, we performed a series of experiments to estimate both its quality and speed. The results for the CTAAP-Solver algorithm are compared to the optimal results calculated by an integer programming technique. The integer programming model that we used is based on the five equations listed above. The fourth quadratic equation was replaced by a series of equivalent linear equations. The system of five equations, including the modified equation 4, was fed into integerprogramming solvers GLPK [16] and CBC [17], which provided an optimal solution. The following values for the constants were used in all the experiments: β0 = 0.5, βf = 10, βr = 1.075, I0 = 4, and I1 = 30. Fig. 2 describes the results of the first experiment, in which the CTAAP-Solver algorithm results are compared with the optimal solution. The request graph of 50 nodes has computing and network properties as random values in the range of 1...100 (all the random numbers mentioned in this text have uniform distribution in the given range). The offered graph consisted of 5 homogeneous clusters, each with a random capacity in the range of 1...11. In five consistent tests composed of 100 independent runs, we increased the offered properties ranges from 1...100 to 1...200, then to 1...300, etc. The results depicted in Fig. 2 show that as the chances of a single request to be mapped to a single available resource increase, our algorithm performs better. The “range” itself is of no importance: a request can be mapped to an offer iff the offer’s properties are not lower than the request’s properties. Only the order of the requests and offers is important, and not the values themselves. In the first point of the graph, the chances of a single requested machine to be mapped to a specific available machine are 50.5% (both offer and request are integers, randomized in a range of 1...100). In the second test, these chances increase
282
V. Kravtsov et al.
Fig. 2. The success rate of the CTAAP-Solver algorithm
(as the offer is randomized in the range of 1...200, but the request is still randomized in the range of 1...100) and thus becomes 1/2 + 1/2 · 0.505 ≈ 75%, while in the third experiment it is 2/3 + 1/3 · 0.505 ≈ 83%, and so on. Another experiment, the results of which are given in Fig. 3, compares the runtime of the CTAAP-Solver algorithm with the runtime of one of the best open-source integer programming solvers – CBC2 [17]. In this experiment the size of the requests and offers graphs was constantly increased. The computing and network properties were random values in the range of 1...100. The offers graph consisted of 5 homogeneous clusters, with a random capacity in the range of 1 and 1/5 of the size of the requests graph. The computing and network properties were random numbers in the range of 1...200.
Fig. 3. Left: a comparison of the runtime of the CTAAP-Solver algorithm vs. CBC2 optimized exhaustive search. Right: the runtime of the CTAAP-Solver algorithm.
5
Conclusions
Here we have presented a new algorithm that provides a fast and efficient solution to the topology-aware coallocation problem. This algorithm is currently used as an important building block in the QosCosGrid scheduling system.
A Fast and Efficient Algorithm for Topology-Aware Coallocation
283
Acknowledgments. The work in this paper was supported by EC grant QosCosGrid IST FP6 STREP 033883.
References 1. Charlot, M., et al.: The QosCosGrid project: Quasi-Opportunistic Supercomputing for Complex Systems Simulations. In: Description of a general framework from different types of applications. Ibergrid conference, Centro de Supercomputacion de Galicia (GESGA) (2007) 2. Kravtsov, V., Carmeli, D., Schuster, A., Yoshpa, B., Silberstein, M., Dubitzky, W.: Quasi-Opportunistic Supercomputing in Grids, Hot Topic Paper. In: IEEE International Symposium on High Performance Distributed Computing, Monterey Bay, California, USA (2007) 3. Conte, D., Foggia, P., Sansone, C., Vento, M.: Thirty Years of Graph Matching in Pattern Recognition. International Journal of Pattern Recognition and Artificial Intelligence 18(3), 265–298 (2004) 4. Gold, S., Rangarajan, A.: A Graduated Assignment Algorithm for Graph Matching. IEEE Trans. Pattern Anal. Mach. Intell. 18(4), 377–388 (1996) 5. Ullman, J.R.: An algorithm for subgraph isomorphism. J. Assoc. Comput. Mach. 23, 31–42 (1976) 6. Cordella, L.P., Foggia, P., Sansone, C., Tortorella, F., Vento, M.: Graph matching: a fast algorithm and its evaluation. In: 14th Int. Conf. Pattern Recognition, pp. 1582–1584 (1998) 7. Gregory, L., Kittler, J.: Using graph search techniques for contextual colour retrieval. In: Joint IAPR Int. Workshops SSPR and SPR, pp. 186–194 (2002) 8. Fischler, M., Elschlager, R.: The representation and matching of pictorial structures. IEEE Trans. Computing 22, 67–92 (1973) 9. Myers, R., Wilson, R.C., Hancock, E.R.: Bayesian graph edit distance. IEEE Trans. Patt. Anal. Mach. Intell. 22, 628–635 (2000) 10. Lovasz, L., Plummer, M.D.: Matching Theory. Elsevier Science Publishing Company, New York (1986) 11. Blum, N.: A Simplified Realization of the Hopcroft-Karp Approach to Maximum Matching in General Graphs. Univ. of Bonn, Computer Science V, 85232-CS (2001) 12. Cormen, T.H., Leiserson, C.E., Rivest, R.L.: Introduction to Algorithms, The Bellman-Ford algorithm, pp. 588–592. MIT Press and McGraw-Hill, New York, USA (2001) 13. Pardalos, P.M., Xue, J.: The maximum clique problem. Journal of Global Optimization 4(3), 301–328 (1994) 14. Khot, S.: Improved Inapproximability Results for MaxClique, Chromatic Number and Approximate Graph Coloring. In: Proceedings of the 42nd IEEE Symposium on the Foundations of Computer Science 600, Washington, DC, USA (2001) 15. Mehlhorn, K., Naeher, S.: The LEDA Platform of Combinatorial and Geometric Computing. Cambridge University Press, Cambridge (1999) 16. Andrew, M.: GNU Linear Programming Kit, Version 4.22. http://www.gnu.org/software/glpk/glpk.html 17. Bonami, P., et al.: Research Report RC 23771, An Algorithmic Framework For Convex Mixed Integer Nonlinear Programs. IBM T. J. Watson Research Center, Yorktown, USA (2005)
View-OS: A New Unifying Approach Against the Global View Assumption Ludovico Gardenghi1 , Michael Goldweber2 , and Renzo Davoli1 1
2
Dept. of Computer Science University of Bologna Bologna, Italy {garden,renzo}@cs.unibo.it Dept. of Mathematics and Computer Sciences Xavier University Cincinnati, OH [email protected]
Abstract. One traditional characteristic of operating systems is that all the processes share the same view of the environment. This global view assumption (GVA) means that for processes running on the same computer, the same pathname points to the same file, the processes share the same network stack and therefore the same IP addresses, the routing characteristics are identical, etc. There have been many proposals for “bending” the GVA for either individual processes or for the system as a whole. Some of these proposals include microkernels or specialized virtual machines. Most proposals are for system administrators, others are tailored to specific applications. A View-OS is our unifying solution for altering the GVA. It allows a user to partially or completely redefine the behavior of an arbitrary subset of the system calls called from his processes, thus altering his view of the environment in terms of file system, communication, devices, access control etc. We have implemented it with a system-call, partial, modular virtual machine called *MView. Each divergence from the standard view may be implemented in a specific module. Hence instead of always having to load a complete kernel (e.g. Usermode Linux), the overhead of a per-process definition of the environment depends on the degree of divergence from the standard global view.
1
Introduction: A Change in Perspective
Modern operating systems make processes run inside “sandboxes” that isolate them and provide protection with respect to other processes and resources. When a process needs a system service (e.g. more memory, new processes, communication with existing ones, access to the file system or to the network, I/O with devices) the boundary of these sandboxes is crossed via a system call. Hence from the perspective of a process, the set of system calls made available by the operating system is the only “window” through which a process can “see the M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 287–296, 2008. c Springer-Verlag Berlin Heidelberg 2008
288
L. Gardenghi, M. Goldweber, and R. Davoli
world” beyond itself. All possible interaction between processes and resources is mediated through this facility. Actually, other kinds of communication are available. For instance, in a UNIXlike system, processes may use shared memory or send/receive signals. Shared memory allocation, as well as signals, however are only made available through system calls. Similarly, the structure and content of file systems as well as communication with devices, other processes, and other hosts can be defined exclusively in terms of “what responses are given to the process by system calls.” This set of answers represents the view of the world for the process. All processes whose system calls are serviced by the same kernel, therefore, share the same view—the global view assumption, or GVA. Consider the following: a process running on system A has all its system calls answered from the kernel running on system B. This process would work the same as if it was really running on B: it would see B’s file system, its network address and routing, and so on. The view is provided to the process by system B’s kernel. We define a system where each process is allowed to create its own unique view or perspective as a View-OS[1] [2]. In a View-OS each process makes use of the services and resources offered by the kernel to define their own process-specific view. For instance, an individual view may contain: – A modified view of the file system from that shown by the kernel. This might include added, removed or changed subsets, different permissions, or be physically deployed on remote hosts. – New network interfaces with ad-hoc addresses, hidden addresses, or different routing and filtering rules. – More generally, a different set of visible and accessible devices, each with its own set of permissions and semantics. Notably, both the permissions and semantics may be different from those provided by the kernel. This is accomplished in a View-OS by providing each user-mode process with the capability for redefining system call behavior or even to define new ones. 1.1
Security Issues
A View-OS, like any traditional global view operating system, must insure that there is no danger to system security. Process views cannot be constructed without regard to security issues. A process can build a view only by relying on the set of resources the kernel would “normally” make available to it. No global changes to the system can be made while inside a personal view if those changes are not allowed under normal conditions. Key to the View-OS concept is that a process could however see local changes as if they were global ones. Some examples. Disk image mounting. Assume a user owns a disk image file. This user is able to read its contents. Mounting it as a file system is nothing but interpreting the contents according to some file system structure. In a View-OS, the user could mount the image inside a personal view of the file system namespace.
View-OS: A New Unifying Approach Against the Global View Assumption
289
Remote file systems. Similarly, a user may want to connect a remote subtree from another host to the local file system namespace, using some network transport. As long as the user has enough privileges on the network and on the remote subtree, there is no reason for not allowing the user to extend her personal perspective of the local host with a new portion of file system. The remainder of this paper is as follows. In section 2 we provide an overview of tools, models and architectures that have influenced View-OS, pointing out similarities and differences between them and our project. In section 3 we expand on the basic ideas of a View-OS, focusing on the fundamental concepts of the proposal. Section 4 closes the paper with some observations of using our *MView View-OS in the field, our conclusions and future directions for this project.
2
Other Models and View-OS
In this section we present a comparison between View-OS and a selection of tools, models and techniques for virtualization. In particular our goal is to illustrate how specific aspects of each such tool/technique can be captured by a View-OS. Virtual machines [3] [4] (VMs) are the oldest[5] and most used tools able to change the perspective of processes. Processes running inside a virtual machine have the same view they would have if running on a different system. VMs may virtualize physical architectures[6] [7] [8] [9] [10] [11] [12] [13] [14] or abstract ones[15], at various levels (typically a whole hardware architecture is emulated, thus allowing the user to install and run an operating system and all its applications). The important point here is that the perspective is completely changed with respect to the real, underlying one; moreover, it is shared by all the processes inside the VM. A View-OS allows a greater degree of flexibility: if needed, the perspective change may affect only a subset of the processes and then only a subset of their views. Moreover, a View-OS offers a lighter approach when it is not necessary to boot an entire operating system. In a paravirtualization, the VM monitor is a light layer called the hypervisor. As with virtual machines, each virtualized system (or domain) needs to boot a kernel; the hypervisor just provides scheduling between domains. The shared devices are managed by a specific, privileged “Domain 0.” Paravirtualization is the key idea of Xen[16]. Xen provides good support for device drivers, inherited from those running in Domain 0. Since Xen implements entire virtual machines, the VM management must be done by root. As with “classical” virtual machines, a View-OS can be used to overcome the issues addressed by paravirtualization while sharing the concept that device drivers and existing applications can be inherited from existing systems or lower (possibly virtual) layers. In microkernel systems, requests made to the kernel are sent as messages to specific servers. Each server is responsible for a specific task (e.g. file system management, network, memory). In a typical microkernel each server manages requests for every process, thus giving the usual global view. However, it is possible to have different servers for different groups of processes, thus creating
290
L. Gardenghi, M. Goldweber, and R. Davoli
more than one perspective. This is the where microkernels and a View-OS are similar. There are, however, to two very important differences. – As stated by the name, a microkernel is a kernel architecture. Its aim is the same as a monolithic kernel: providing services to processes. Using a pool of servers instead of a single, bigger process is only matter of cleaner design and failure isolation. Usually, processes are not able to start new, personal servers. Server management is a privileged action that can be only performed by the kernel. – Microkernels come as completely new and different operating systems. They have their set of system calls, their device drivers, their ABI, etc. This is a serious obstacle toward the effective usability of microkernel prototypes: the usually limited hardware compatibility and the need to port existing applications to the new system (or to write new ones from scratch) greatly reduces the user base. Alternatively, View-OS concepts, as demonstrated by our working prototypes, may be implemented gradually on existing systems, thus keeping binary compatibility with existing applications and relying on existing operating systems for device drivers. The Exokernel [17] architecture is an attempt to loosen the tight link between interface and implementation of the various abstractions given by the operating system. Its approach is to move the physical resource management into user space (or application level) and provide a low level interface from a minimal kernel to “untrusted” libraries. A View-OS also may allow this. Provided that a user process has enough authorizations to access a device in raw mode, it may exploit View-OS by inserting a personal, custom driver between itself and the device. The View-OS approach adds these capabilities on top of an existing kernel, allowing the simultaneous usage of some kernel and some user device drivers, together with a gradual migration from the first ones to the second ones depending on the availability of new user-level drivers. Plan 9 [18] was a very important research project by Bell Labs. In Plan 9 each process has its own name space. A process can change, add, or remove entities from its name space without affecting other processes. Everything in Plan 9 is accessible through names: networking, file system, GUI windows, etc. Thus a change in the name space implies a change in the process view. Unfortunately, Plan 9 has its own kernel, thus the number of supported architectures and available device drivers is quite limited. Moreover, since Plan 9 has a very different system model from other modern operating systems, the porting of applications from other operating systems to Plan 9 is often difficult. A View-OS provides processes with some of the features of Plan 9 while retaining compatibility with standard (Linux) kernels and applications. System Call Interposition. This technique is used to monitor every system call generated by a process. Its main goal is to create a sort of “jail,” “sandbox,” or “guarded environment” for process execution, and to keep track of potential malicious activities by processes. System Call Interposition is often used only to deny dangerous operations (e.g. access to sensitive data). Blocking or denying
View-OS: A New Unifying Approach Against the Global View Assumption
291
system calls becomes a trivial sub-case of system call redefinition and, thus, easy to implement with a View-OS. FUSE [19] is a mechanism, available in recent Linux releases, that allows the kernel to rely on user-mode programs for file system support. It suffers from the GVA, as file systems are mounted with the usual mount semantics and affect the whole system. For this reason, this special purpose virtualization is commonly restricted from users or allowed with specific limitations. A View-OS provides the same features of FUSE, with the additional ability of limiting the visibility only to a subset of processes, and allowing regular users to mount real, virtual, local or remote file systems with no interference between each other. Moreover, our implementation has source-level compatibility with existing FUSE modules. “Minor” partial virtualities. There are a number of classic and well-known tools from UNIX-like systems that are good examples of specific applications which modify the GVA using different techniques: chroot, a system call operating on the filesystem namespace; fakeroot, a user-space access control hack based on dynamic libraries preloading; /dev/tty and /proc/self, files with a different meaning for each process. This enumeration while far from complete is hopefully representative. There are also many partial, ad-hoc tools which are also neither interoperable nor integratable. Regardless, our goal is to show that a View-OS is a very strong step toward providing a unified framework discussing and implementing various methodologies and strategies for modifying the GVA.
3
View-OS: Relaxing the Global View Assumption
UMView (and KMView) are proof of concept prototypes of a View-OS. In this section we describe a View-OS—its goals and capabilities—in more general terms. We will, nevertheless, refer to *MView when we wish to provide a practical example of a given concept. All the software has been released under the GPL free license and is available in the standard Debian distribution. The Virtual Square wiki[20] provides access to the software, technical documents and examples. A View-OS allows users to change the perspective of their processes. The basic idea is to redirect each system call to a monitor, or hypervisor. The running process does not have immediate access to the “real” services given by the kernel through system calls; each system call request is “intercepted” and checked by the hypervisor. The hypervisor then decides on one of two behaviors depending on the specific system call and on its parameters. If the hypervisor decides that the system call refers to a unmodified part of the perspective/view, it just asks—or makes the process ask—the underlying level to execute the call. A View-OS, defined in this manner, is a natural fit for nesting, as its interface is the same toward both its upper and lower levels. For this reason, letting the process run the system call “as it is” may mean asking the real kernel to execute the request, or asking a lower View-OS instance to check for a possible change in perspective. From the point of view of the hypervisor, there is no difference between these two cases.
292
L. Gardenghi, M. Goldweber, and R. Davoli
On the other hand, the hypervisor may elect to “trigger” and implement a change in some portion of the view of a process. This may mean executing an existing system call with altered semantics or the execution of a new system call not supported by the underlying level. For example the View-OS may implement a new system call to open a disk image file, read its content, parse it according to a specific file system format, and let the calling process believe it is accessing a real disk with real files and directories. A View-OS allows a user to “boot” a minimalist “kernel” and configure it to manage different file systems, network stacks and services other than the real ones. (i.e. The ones made available by the kernel used to boot the machine.) The same may also apply to device drivers: the underlying kernel exports a raw view on a device and the hypervisor takes care of using the correct driver for it. This is very similar to the microkernel and exokernel approaches, as described in section 2. What the View-OS concept changes is that the microkernel and exokernel approaches, in addition to the other approaches previously described are no longer mutually exclusive. Depending on the specific issue, one can choose a more monolithic approach or a more modular one. Both the system administrator and the user may cooperate and have more flexibility in choosing for each service the best option, basing their decision on performances, security needs, and software availability. For instance, if the current monolithic kernel does not yet provide a driver for a given device—it may be under development—the system administrator may choose to use a user-level driver via a View-OS hypervisor. As soon as a more official driver becomes available for the operating system kernel it may be plugged in and used. Similarly, an inability to update the current kernel or the need for better performance may lead the administrator to delegate some services to the hypervisor or to the kernel. Our prototypes, *MView are designed to work on a GNU/Linux system and are potentially able to work with every peripheral that is supported by the Linux kernel; allowing the user to run non-modified version of GNU/Linux software. We denote this flexible approach as a millikernel. That is, this solution locates between the two extreme solutions: microkernels (everything but a minimalist message-passing engine must be outside the kernel) and monolithic kernels (everything that is not a user-created process lives inside the kernel). In the remaining part of this chapter we will describe some of the most promising View-OS areas of application. 3.1
Security
A View-OS allows one to implement the required granularity for one of the most theoretically important security principles: the principle of the minimum privilege. At present, UNIX-like systems provide two main authorization mechanisms. The first (and older) one is the usual owner/group permissions system. Recently, the privileges traditionally associated with the superuser have been split into
View-OS: A New Unifying Approach Against the Global View Assumption
293
distinct units known as capabilities, which can be independently enabled and disabled. Capabilities may include the ability to bypass some permission checks on the file system, to kill other user’s processes, to invoke privileged network operations, to change attributes and priority for a process, and so forth. While this mechanism tries to capture the need for greater granularity, it is not useful for regular users as capabilities only refer to global administrative activities. Nonadministrative users are not able to define their own capabilities nor associate them with their processes. With networking, the security situation is even worse. A given system has a fixed network stack and a fixed set of network interfaces. There are no useful flexible control mechanisms that can be used to allow or deny network-related operations to single users or processes. If a network interface is up, everyone can see it and everyone can open ports and listen for packets on it.1 These UNIX authorization mechanisms do not comply with the minimum privilege principle, both for users and administrators. A View-OS allows an administrator to give a process exactly the minimum set of (file, network, . . . ) resources that are necessary to complete its task. An administrator may want to refine capabilities for a certain process. For instance, she may decide that a given process may ignore file system permissions but only on a subtree. Users may be given personal TCP/IP addresses, so it becomes easier to apply shaping or filtering rules. While this is currently possible, it requires keeping track of every TCP connection and every single UDP packet. Quite often, users would like to separate and isolate different groups of activities (e.g. work, leisure, experiments). A game should not be able to access e-mail files/directories. If a user wants to try a new, unstable application she also wants to protect her data from accidental deletions. A View-OS addresses all these problems by allowing users and superusers to describe a minimal set of resources around every process. Finally, with a View-OS, many, if not all of the operations that require “set user ID” executables (e.g. the mount command) may be converted to regular (non-suid) operations. Hence one can remove many potential security holes: executables ran by regular users but with superuser privileges. 3.2
Flexibility
Dealing with malicious or broken software is not the only field where a ViewOS proves useful. The open-ended flexibility of a View-OS allows users to build their world around themselves. Our unification approach also helps in making different virtualizations cooperate and integrate. Transformations away from the global view can be relative to the file system, network, devices, and a group of other less frequently used components of a process view. 1
There are some packet filtering infrastructures such as IPTables which allow the system administrator to enforce various policies but, typically, the granularity on users and processes is too coarse.
294
L. Gardenghi, M. Goldweber, and R. Davoli
While we have described the individual virtualizations made possible by an implementation of a View-OS, it is instructive to consider more complex ViewOS applications. The aim is to give an idea of how one is able to combine different, simple, but interoperable virtualizations to create useful structures. It is worth pointing out that View-OS was designed as part of a bigger virtualization framework, named Virtual Square[21] [22]. Also part of this framework is VDE[23], a virtual networking tool able to connect virtual and real machines together. A VDE is often used to combine different View-OS instances and link them to real networks. A remote encrypted file system. Let’s assume that a (very paranoid) user keeps a Second Extended disk image on her home computer and that a portion of this file system is encrypted using EncFS. This user may want to access this disk image while on another computer. The traditional approach would require her to copy the whole disk image onto the local host and then mount it with superuser privileges via the loopback device. With a View-OS she may combine three different modules: a ssh module to reach the remote file system on her home machine; an ext2 module to mount the disk image and, finally, an encfs module to decrypt its content. None of these operations require root access, and none of the other local users can see see the contents of the disk image. Partitioning images and devices at user level. The basic idea is to let a user manage her disk images (or their removable storage devices) entirely in user space. In the case of removable media, we assume that the kernel grants exclusive read/write privileges on the device to that user. A specific *MView module allows to see a file or device as a disk, to partition it (using the usual Linux tools) and to mount the new partitions entirely at user level. The kernel does not have to support every single (strange) kind of file system that any user may want to use, but simply delegates the parsing of the raw device content to another layer i.e. a View-OS. Per-process IP stack or address. Not only IP addresses, but even network stacks may be assigned on a per-process basis. This is useful to test new, experimental implementations or to use different optimizations and tuning for different kinds of applications. The combination of VDE with a network stack module for the View-OS level makes this possible. The increased level of isolation and granularity makes mobility and server reorganization much easier as a single daemon may have its own specific IP address and can keep it even if it has to be moved onto a different physical host. 3.3
Fast Prototyping
A key advantage to the View-OS approach to virtualization is the possibility to create very light environments with single, focused/specific alterations to the global view. This makes a View-OS very useful every time a designer has to build and test prototypes for applications or protocols and doesn’t want (or can’t) alter the configuration of the running system.Usually, whole system virtual machines
View-OS: A New Unifying Approach Against the Global View Assumption
295
have to be created, configured and if needed interconnected. This “heavy” approach can often be lightened using a View-OS. Examples illustrating how a View-OS allows for fast and lightweight prototyping include copy-on-write for configurations files, providing a light framework for testing network protocols, verifying the effectiveness of new system calls, and the testing of new file system implementations.
4
Conclusions
The founding idea behind a View-OS is the change in perspective that we made while examining a system. Instead of focusing on the operating system and its kernel which sees processes as an uniform set of objects, we consider a process as the main actor that makes use of operating system services as a way to build up its own perspective. The term we use to denote the classical way of looking at a system is GVA (global view assumption). The goal of a View-OS is to relax this very limiting approach. Furthermore, we endeavored to provide a unified approach encompassing different techniques, concepts, paradigms both in operating systems architectures and virtual machine design toward to our goal of relaxing the GVA. Instead of using tool A to relax one portion of a view, and tool B (which might not even interoperate with A) to relax a different portion of one’s view, one need only use a View-OS—a unified paradigm for relaxing any portion (or portions) of a view, all accomplished in user-mode. UMView (and KMView) are working prototypes of a View-OS. UMView is implemented as a System Call Virtual Machine and allows regular Linux programs to run on regular Linux kernels with the added benefit of allowing processes to create their own personal perspective/view by themselves—no superuser intervention is needed. We believe that having a working, usable implementation is as important as creating a good model. It is a future goal to not only speed up the performance of *MView, but to meaningfully measure the overhead these View-OS implementations impose. It is a non-trivial task to simply determine what to measure for this. We can report that an “empty” KMView environment, i.e. one with no perspective altering modules loaded, yielded an ad hoc measured 20% loss of performance with respect to an unaltered system. This overhead, though not so low, is quite encouraging considering KMView is a proof-of-concept prototype and not intended as a production environment, yet. Another future goal is to explore the educational potential of a View-OS. If one believes that the best way to learn about something is to build one, then implementing View-OS modules becomes an excellent educational pursuit. Since all is executed in user-mode, students can literally safely explore system behavior by redefining it in every conceivable way: coherent or not, safe or unsafe, without compromising the integrity of the rest of the system.
296
L. Gardenghi, M. Goldweber, and R. Davoli
References 1. Davoli, R.: The View-OS project., http://www.sf.net/projects/view-os 2. Davoli, R., Goldweber, M., Gardenghi, L.: UMView: View-OS implemented as a system call virtual machine. In: 7th Usenix Symposium on Operating Systems Design and Implementation. Poster Session, Seattle, WA (November 2006) 3. Smith, J., Nair, R.: Virtual Machines: Versatile Platforms for Systems and Processes. Morgan Kaufmann, San Francisco (2005) 4. Smith, J.E., Nair, R.: The architecture of virtual machines. IEEE Computer 38(5), 32–38 (2005) 5. Adair, R., Bayles, R., Comeau, L., Creasy, R.: A virtual machine system for the 360/40. Technical report, IBM Cambridge Scientific Center Report 320-2007, Cambridge, Mass. (May 1966) 6. Bellard, F.: Qemu, a fast and portable dynamic translator. In: USENIX 2005 Annual Technical Conf., FREENIX Track hardware emulator (2005) 7. Bartholomew, D.: Qemu: a multihost, multitarget emulator. Linux J (145) (2006) 8. Qumranet Inc.: KVM: Kernel-based virtualization driver (2006), http://kvm.qumranet.com 9. Gavare, A.: GXemul project, http://gavare.se/gxemul/ 10. Microsoft (formerly from Connectix): Virtual PC, http://www.microsoft.com/windowsxp/virtualpc/ 11. VMware, Inc.: VMware, http://www.vmware.com/ 12. Lawton, K.: Bochs project home page, http://bochs.sourceforge.net 13. Biallas, S.: PearPC project, http://pearpc.sourceforge.net 14. Morsiani, M., Davoli, R.: Learning operating system structure and implementation through the MPS computer system simulator. In: Proc. of the 30th SIGCSE Technical Symp. on Computer Science Education, New Orleans, pp. 63–67 (1999) 15. Goldweber, M., Davoli, R.: The Kaya project and the μMPS hardware emulator. In: Proc. of ITiCSE 2005. Conf. on Innovation and Technology in Computer Science Education, Lisbon (2005) 16. Dragovic, B., Fraser, K., Hand, S., Harris, T., Ho, A., Pratt, I., Warfield, A., Barham, P., Neugebauer, R.: Xen and the art of virtualization. In: Proc. of the ACM Symp. on Operating Systems Principles (October 2003) 17. Engler, D.R., Kaashoek, M.F., O’Toole, J.: Exokernel: an operating system architecture for application-level resource management. In: SOSP 1995: Proc. of the 15th ACM symposium on Operating Systems principles, pp. 251–266. ACM Press, New York (1995) 18. Pike, R., Presotto, D., Dorward, S., Flandrena, B., Thompson, K., Trickey, H., Winterbottom, P.: Plan 9 from Bell Labs. Computing Systems 8(3), 221–254 (Summer 1995) 19. Szeredi, M.: FUSE: filesystem in user space, http://fuse.sourceforge.net 20. Davoli, R.: Virtual square wiki page, http://wiki.virtualsquare.org/ 21. Davoli, R.: Virtual square. In: Proc. of OSS2005. Open Source Software 2005, Genova (2005) 22. Davoli, R., Goldweber, M.: Virtual square in computer science education. In: Proc. of ITiCSE05. Conf. on Innovation and Technology in Computer Science Education, Lisbon (2005) 23. Davoli, R.: VDE: virtual distributed ethernet. In: Proc. of Tridentcom 2005, Trento (2005)
Evaluating Sparse Data Storage Techniques for MPI Groups and Communicators Mohamad Chaarawi and Edgar Gabriel Parallel Software Technologies Laboratory, Department of Computer Science, University of Houston {mschaara,gabriel}@cs.uh.edu
Abstract. In this paper we explore various sparse data storage techniques in order to reduce the amount of memory required for MPI groups and communicators. The idea behind the approach is to exploit similarities between the objects and thus store only the difference between the original process group and the resulting one. For each technique, we detail the memory saved compared to the currently used implementations, and present a runtime decision routine capable of choosing dynamically the most efficient technique for each scenario. Furthermore, we evaluate the performance impact of the new structures using point-to-point benchmarks as well as an application scenario over InfiniBand, Myrinet and Gigabit Ethernet networks.
1
Introduction
The memory footprint of a process running the MPI equivalent of ’hello world’ can reach tens of Megabytes on today’s platforms. Some of the factors contributing to the large memory footprint are related to optimizations within the MPI library code base, such as using statically allocated memory or having many different code paths in order to optimize a particular operation. The larger fraction of the memory utilized by an MPI library is however allocated dynamically at runtime and depends on system parameters such as the network interconnect and application parameters such as the the number of processes. While reducing the memory footprint of an MPI process was not considered to have a high priority for a while, recent hardware developments force us to rethink some concepts used within communication libraries. Machines such as the IBM Blue Gene/L [4] have the capability to run MPI jobs consisting of more than 100,000 processes. At the same time, each node only has 512 MB of main memory available, leading to 256 MB for each MPI process. A similar problem occurs on commodity clusters due to the increasing number of cores per processor [2] giving the end-users the possibility to run parallel jobs consisting of a large number of MPI processes, while at the same time the main memory per core remains constant at best. For platforms facing the problem outlined above, an MPI library should avoid internal structures having a high level dependency on the number of MPI processes. In this paper we are exploring various sparse data storage techniques in M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 297–306, 2008. c Springer-Verlag Berlin Heidelberg 2008
298
M. Chaarawi and E. Gabriel
order to reduce the amount of memory required for MPI groups and communicators. The idea behind the approach is to exploit similarities between the objects. Instead of storing the entire list of processes which are part of a new group or communicator, the approach presented in this paper stores only the difference between the original communicator and the resulting one. These techniques might not only be relevant for very large number of processes, but will also be beneficial for applications having a moderate number of processes, generating however a very large number of communicators [1]. Much of the work on optimizing memory usage within MPI libraries has focused so far on the networking layer. Panda et al. show in [12] the benefits of using the Shared Receive Queue features of InfiniBand in order to reduce the memory usage of large scale applications. Shipman et al. [11] introduce a new pipelining protocol for network fabrics dealing with registered memory, which further reduces the memory utilization for these network interconnects. The work done in [5] focuses on controlling the number of unexpected messages a process can handle in order to limit the memory usage of the MPI library. The remainder of the paper is organized as follows: Sec. 2 discusses briefly the current implementation of groups and communicators in Open MPI, and presents three different sparse data storage techniques. Sec. 3 evaluates the performance impact of the techniques detailed in the previous section using a point-to-point benchmarks as well as the High Performance Linkpack (HPL) benchmark. Finally, Sec. 4 summarizes the paper and presents the currently ongoing work in this area.
2
Alternative Storage Formats for Groups and Communicator
In most MPI libraries available today, each MPI group contains the list of its member processes. In Open MPI [6], each entry of the list is a pointer to the according process structure, while in MPICH2 [8] the according list contains the ranks of the processes in MPI COMM WORLD. The position in this array indicates the rank of the process in this group. An MPI communicator contains typically pointers to either one MPI group for intra-communicators, or two groups for inter-communicators. While this approach guarantees the fastest access to the process structure for a communication – and thus minimal communication latencies – the information stored in those arrays is often redundant. For example, in case a numerical library creates a duplicate of MPI COMM WORLD in order to generate a unique communication context, the communicator structure will contain a process list with redundant information to the original communicator. For a 100,000 process job, this list will take 8 × 100, 000 bytes of memory, assuming that the size of a pointer is 8 bytes. Therefore, three alternative storage formats have been evaluated in this study in order to minimize the redundant information between different process groups and thus minimize the memory footprint of the according structures. In order to evaluate benefits and disadvantages of these storage formats, we have
Evaluating Sparse Data Storage Techniques
299
implemented all formats in Open MPI. For this, the group structure and the group management functions had to be adapted. The following subsections give some details to each format. PList Format: The PList storage format is the original storage format containing a list of pointers to the process structures of the group members. This implementation of this format is unchanged compared to the original version. Range Format: For this format, the included processes in a group are described by ranges of process ranks, e.g. having n consecutive processes starting from rank r. The syntax of this storage format has been derived from the MPI Group range excl/incl functions in MPI [9]. Thus, the group structure in Open MPI has been extended by a list holding the required number of < base rank, number of processes> pairs in order to describe the members of the new group. The base ranks stored in the group-range list correspond to the rank of the process in the original group. Thus, the group structure also needs to store a pointer to the original group and increase the reference counter of that object accordingly. While this storage format can be applied to any group/communicator, it will be most memory efficient if the list of ranks included in the new process group can be described by a small number of large blocks. Strided Format: In some cases, the included processes in a group follow a regular pattern, e.g. a new group/communicator includes every n-th process of the original group. Three integers are required in the group structure in order to support this format: the grp offset contains the rank of the process where the pattern starts; grp last elt is the rank of the last process in the pattern; the grp stride describes how many processes from the original group have to be skipped between two subsequent members of the new group. The group creation function for strided groups can automatically determine all three parameters. In case no regular pattern could be determined by the routine, the strided group creation function will indicate that it can not be used for this particular process group. Similarly to the range format, a pointer to the original group is required to be able to determine the process structures. Bitmap Format: The main idea behind this storage format is to use a bit-array of the size of the original communicator/group. The bit at position i indicates, whether the process with the rank i in the original group is a member of the resulting group or not. The main restriction of this storage format is that ranks of the included processes in the new group have to be monotonically increasing in order to be able to uniquely map the rank of a processes from one group to another group. 2.1
Group Management Operations
Additionally to the functionality outlined in the previous subsections, each storage format provides a function which estimates the amount of main memory
300
M. Chaarawi and E. Gabriel Table 1. Memory consumption of each storage format Storage format Memory consumption PList number of processes × sizeof(void *) Range number of ranges ×2× sizeof(int) Strided 3× sizeof(int) Bitmap ( number of processes /8)
required by this format, given the list of process members. Whenever a new group or communicator is created, these functions are queried in order to decide which of the storage techniques will be applied. Table 1. summarizes the formula used to estimate the memory consumption of each format. The current runtime decision logic will then pick the storage format requiring the least amount of memory. A flag in the group structure indicates which storage format is being used for a particular group. The groups used by MPI COMM WORLD and MPI COMM SELF are always stored using the PList format. The most performance sensitive functions with respect to group and communicator management are returning a pointer to the process structure given a tuple of
Evaluating Sparse Data Storage Techniques
301
e.g. in case one group has been derived from the other group. Unfortunately, we can not detail the according formulas in this paper due to space limitations, please refer to [3]. The most severe restriction of the current approach is that all storage formats assume as of now that a group is derived from a single parent group. This is not the case for the MPI Group union and MPI Group difference operations as well as for MPI Intercomm merge. For these functions, the implementation will automatically fall back to the PList format. For similar reasons, the current implementation does not support inter-communicators.
3
Performance Evaluation
This section evaluates the performance implications of the various storage formats presented in the previous section. Two separate set of tests have been conducted, namely a point-to-point benchmark in order to quantify the effect on the latency and an application benchmark. The machines used for the tests were the shark cluster at the University of Houston and the IBM BigRed cluster at Indiana University. Shark consists of 24 dual-core 2.2GHz AMD Opteron nodes connected by a 4xInfiniBand and a Gigabit Ethernet network interconnect. BigRed, which was mainly used for the point-to-point benchmarks, consists of 768 IBM JS21 Blades, each having two dual-core PowerPC 970 MP processors, 8GB of memory, and a PCI-X Myrinet 2000 adapter. Within the scope of this analysis we used up to 512 MPI processes on 128 nodes on BigRed. 3.1
Point-to-Point Benchmark
In order to evaluate the effect on the point-to-point performance of Open MPI when using the alternative storage formats for groups and communicators, we created in a new test within the latency test suite [7]. The basic idea behind the latency test suite is to provide building blocks for ping-pong benchmarks, such as different data type constructors, communicator constructors, or data transfer primitives. This allows users to set-up their own point-to-point benchmarks, e.g. by mimicking a particular section of their applications. The new test case developed within this project creates a hierarchy of communicators. Starting from the processes in MPI COMM WORLD the test excludes all odd-ranked elements of the communicator. Using the resulting communicator the benchmark keeps creating new communicators excluding odd ranked elements until a communicator consisting of only one or two processes is being created. For each new communicator, a ping-pong benchmark will be executed between the first and second process in one case, and between the first and last process in another case. An additional overhead to the communication latency is expected when executing with the sparse formats, which comes from the fact that getting the actual process pointer for each data transfer operation requires some additional computation and lookup operations. This effect is supposed to increase with the increase in the hierarchy of groups depending on each other.
302
M. Chaarawi and E. Gabriel
Table 2. Results of the point-to-point benchmark running 48 processes over InfiniBand PList Range Strided Bitmap 0-first 0-last 0-first 0-last 0-first 0-last 0-first 0-last Level Level Level Level Level Level
0 1 2 3 4 5
3.7 3.7 3.7 3.74 3.74 3.74
3.7 3.7 3.7 3.7 3.7 3.7
3.7 3.74 3.74 3.74 3.85 3.9
3.7 3.74 3.79 3.85 3.85 3.85
3.7 3.74 3.74 3.74 3.8 3.8
3.7 3.71 3.74 3.74 3.8 3.8
3.7 3.74 3.74 3.79 3.8 3.9
3.7 3.74 3.79 3.79 3.85 3.9
For each implementation of the groups (plist - range - sparse - bitmap), the test was executed five times. The results that are provided show the minimum latency of the all the tests executed. This shows the best achievable result on the according cluster. Times are given in μs. The results on shark over the InfiniBand network interconnect are shown in Table 2. The level of each communicator shown in the first column of the table indicates the number of indirections required to look up the process structure. For level 0 ( = MPI COMM WORLD) the latency is independent of the storage format, since this communicator is always using the PList format. Furthermore, there is no performance difference for level 0 whether the ping-pong benchmark is executed between the first and the second process, or between the first and the last process of the communicator. As expected, the latency is mostly constant for the original PList format, and thus independent of the communicator used. For the other formats, the latency does increase depending on the level of the communicator, i.e. the number of indirections required to lookup the process structure. In order to quantify the overhead, lets consider the highest overhead observed in our measurements, which adds 0.2μs to the original latency. Accessing the process structure for that particular communicator level requires 5 indirections of the algorithm described in section 2.1. Thus, the average overhead per level can be estimated to be up to 0.04μs on this architecture. For the bitmap and the range formats, we also would have expected to see a slight increase in the latency when executing the ping-pong benchmark between the first and the last process, compared to the first and the second process. The reason for this is that the costs for the rank-translation algorithms for these two formats should increase with the rank being translated, since we have to walk linearly through the list of participating processes. However, due to the fact that our maximum job size is only 48 processes and that the number of processes decreases by a factor of two with each level of communicator, we could not observe in these benchmarks the expected effect. There are slight differences in the performance of the alternative storage formats, with strided being slightly faster than the bitmap and the ranges format. The reason for this is probably the rank-translation algorithm, which only requires applying a simple formula for the strided format, compared to a slightly more complex algorithm for the other two sparse storage formats.
Evaluating Sparse Data Storage Techniques
303
Table 3. Results of the point-to-point benchmarks running 48 processes over Gigabit Ethernet PList Range Strided Bitmap 0-first 0-last 0-first 0-last 0-first 0-last 0-first 0-last Level Level Level Level Level Level
0 51.55 51.84 51.61 51.39 51.11 52.14 1 51.61 52.45 51.65 52.34 52.09 52.95 2 51.7 52.7 51.59 53.8 51.55 52.64 3 51.09 52.3 51.45 53.19 51.4 52.4 4 50.8 51.45 51 51.81 51.75 51.8 5 51.7 51.5 51.86 51.4 51.55 51.14
51.55 52.05 51.75 51.15 51.25 51.61
51.2 52.8 52.75 51.86 52.6 51.7
Table 4. Results of the point-to-point benchmark running 512 processes over Myrinet PList Range Strided Bitmap 0-first 0-last 0-first 0-last 0-first 0-last 0-first 0-last Level Level Level Level Level Level Level Level Level
0 1 2 3 4 5 6 7 8
6.25 6.25 6.09 6.20 6.21 6.25 6.75 6.75 6.76
7.34 7.40 7.30 7.30 7.20 7.25 7.25 7.25 7.36
6.04 6.29 6.25 6.35 6.35 6.45 7.19 7.20 7.40
7.45 7.65 8.00 8.00 8.06 8.30 8.00 8.05 8.00
6.25 6.29 6.29 6.29 6.40 6.40 6.85 6.89 7.06
7.40 7.44 7.45 7.39 7.55 7.55 7.40 7.55 7.55
6.10 6.29 6.29 6.35 6.60 6.75 7.56 7.95 8.80
7.25 9.05 9.80 10.14 10.44 10.50 10.19 10.05 9.30
Table 3. summarizes the performance results on shark over Gigabit Ethernet. In our measurements, no performance effects could be observed which could be directly related to the sparse data storage formats used for groups and communicators. The reason for that is, that the perturbation of the measurements using this particular switch was higher than expected overhead, assuming that the overhead due to the different storage formats would be in the same range as for InfiniBand results presented previously. Table 4. presents the results obtained for a 512 processes run on 128 nodes on Big Red. In order to ensure, that the same network protocol is used for all communicator levels, the 0-first tests have been modified such that the first MPI processes on the first two nodes are being used for the first three communicators ( levels 0, 1, and 2). First, we would like to analyze the results obtained using the plist format – the default Open MPI approach – on this machine. The results for this storage format are presented in the first two columns of the Table 4. The most fundamental observation with respect to the results obtained on this machine is that the latency shows a fundamentally larger variance depending on the nodes used for the ping-pong benchmark. Furthermore, there is a relevant increase in the communication latency when executing the ping-pong benchmark between
304
M. Chaarawi and E. Gabriel
the rank 0 and the last process in the communicator, compared to the results obtained in the 0-first tests. In order to explain this effect, we made several verification runs confirming the results shown above. Furthermore, we verified this behavior for the plist format with support for sparse storage formats being disabled in Open MPI. Since we can positively exclude any effects due to the sparse storage formats for these results, we think that the most probably explanation has to deal with caching effects when accessing the process structures of processes with a higher rank. With respect to the sparse storage formats, the results indicate a similar behavior as obtained on the shark cluster over InfiniBand. The redirections required to look up the process structure lead to a small performance penalty when using the sparse storage techniques. In order to estimate the overhead introduced by the sparse storage techniques, we compare the latency obtained for a particular communicator level with a sparse storage technique to the latency obtained on the same communicator level with the plist storage format. This overhead is then divided by the number of indirect lookup operations required for that communicator level. Since the results show a larger variance than on the shark cluster, we provide an upper bound for the overhead by reporting only the maximum values achieved in this set of tests and the average obtained over all levels. In the 0-first tests, the range format introduces an average overhead of 0.057μs per level, the maximum overhead found was 0.08μs. The penalty on the latency per level when using the strided format in these tests was 0.04μs, while the highest overhead observed in these tests using this storage format was 0.1μs. The bitmap format shows once again the highest overhead, with an average penalty of 0.118μs per level, and a maximum overhead of up to 0.255μs. In contrary to the results obtained on the shark cluster, the 0-last tests show a significant additional overhead for the range and the bitmap format, which are due to the fact, that the rank-translation operation involves a linear parsing of all participating processes in that communicator. The bitmap format has an average overhead of more than 0.8μs per level in these tests, while the average overhead for the range method increases to 0.197μs. As expected, the strided format does not show any sensitivity to the rank being applied in the ranktranslation operation. 3.2
Application Benchmark
In order to determine the impact of the new storage formats on the performance of a real application scenario, we executed multiple test-cases of the HPL benchmark [10]. A major requirement for the application benchmark chosen for this subsection is, that the code has to create sub-communicators which expose a benefit of the sparse storage techniques. HPL organizes processes in a 2-D Cartesian process topology. Three different type of communicators are created by HPL: (I) a duplicate of MPI COMM WORLD, (II) a row communicator for each row of the 2-D Cartesian topology, and (III) a column communicator for each column of the 2-D Cartesian topology. For (I), the new group creation functions will choose the range format for process numbers larger than 64, the bitmap
Evaluating Sparse Data Storage Techniques
305
format otherwise. Similarly, communicator (II) can be represented by a single range of processes, while the communicator (III) is best described by the strided format. Assuming a 90,000 process run of HPL organized in a 300 × 300 process topology, the default implementation of groups and communicators in Open MPI would take 724,800 bytes to store the list participating processes for all three communicators per process. Using the sparse storage techniques described in this paper, the memory consumption for that scenario can be reduced to 58 bytes per process. In the following, we present performance results for 48 processes test-cases using shark. Table 5. summarizes the measurements over the InfiniBand network interconnect. Four different test cases have been executed, namely two problem sizes (24,000 and 28,000) each executed with two different block sizes (160 and 240). Since the latency does show some dependence on the storage format used for MPI groups, we would expect to see minor increases in the execution time for highly latency sensitive applications. However, none of the test cases executed in this subsection show a significant performance degradation related to the storage format. Table 5. Execution time of the HPL benchmark on 48 Processes using InfiniBand in seconds Size Block size PList Range Strided Bitmap 24000 24000 28000 28000
4
160 240 160 240
65.81 69.05 99.78 104.25
65.81 69.08 99.73 104.31
65.84 69.22 99.84 104.27
65.87 69.14 99.81 104.29
Summary
In this paper, we introduced various storage formats for groups and communicators in order to minimize the memory footprint of the according structures. The main idea behind these formats is to store only the difference between the original group and the newly created one. Three different formats – range, strided and bitmap – have been implemented in Open MPI. Additionally to the memory consumption of each format, the paper also evaluates the performance impact of the sparse storage formats. Using a modified ping-pong benchmark, we could determine the performance overhead due to the new data storage formats to be up to 0.04μs per hierarchy level of the communicator. This overhead is negligible for most scenarios, especially when taking into account, that many applications only derive communicators directly from MPI COMM WORLD. Our tests using the HPL benchmark did not show any measurable overhead due to the new storage formats. The techniques detailed in this paper will be available with Open MPI version 1.3. Acknowledgments. This research was funded in part by a gift from the Silicon Valley Community Foundation, on behalf of the Cisco Collaborative Research
306
M. Chaarawi and E. Gabriel
Initiative of Cisco Systems and was supported in part by the National Science Foundation through TeraGrid resources provided by Indiana University.
References 1. Open MPI users mailing lists (2007), http://www.open-mpi.org/community/lists/users/2007/03/2925.php 2. Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., Yelick, K.A.: The Landscape of Parallel Computing Research: A View from Berkeley. Technical report, EECS Department University of California, Berkeley (2006) 3. Chaarawi, M.: Optimizations of group and communicator operations in Open MPI. Master Thesis, Department of Computer Science, University of Houston (2006) 4. Gara, A., et al.: Overview of the Blue Gene/L system architecture. IBM Journal of Research and Development 49(2/3), 195–212 (2005) 5. Farreras, M., Cortes, T., Labarta, J., Almasi, G.: Scaling MPI to short-memory MPPs such as BG/L. In: ICS 2006: Proceedings of the 20th annual international conference on Supercomputing, pp. 209–218. ACM, New York (2006) 6. Gabriel, E., Fagg, G.E., Bosilca, G., Angskun, T., Dongarra, J.J., Squyres, J.M., Sahay, V., Kambadur, P., Barrett, B., Lumsdaine, A., Castain, R.H., Daniel, D.J., Graham, R.L., Woodall, T.S.: Open MPI: Goals, concept, and design of a next generation MPI implementation. In: Kranzlm¨ uller, D., Kacsuk, P., Dongarra, J. (eds.) EuroPVM/MPI 2004. LNCS, vol. 3241, pp. 97–104. Springer, Heidelberg (2004) 7. Gabriel, E., Fagg, G.E., Dongarra, J.J.: Evaluating dynamic communicators and one-sided operations for current MPI libraries. International Journal of High Performance Computing Applications 19(1), 67–79 (2005) 8. Gropp, W., Lusk, E., Doss, N., Skjellum, A.: A high-performance, portable implementation of the MPI message passing interface standard. Parallel Computing 22(6), 789–828 (1996) 9. Message Passing Interface Forum. MPI: A Message Passing Interface Standard (June 1995), http://www.mpi-forum.org 10. Petit, A., Whaley, R.C., Dongarra, J., Cleary, A.: HPL - A Portable Implementation of the High-Performance Linpack Benchmark for Distributed-Memory Computers. Version 1.0, http://www.netlib.org/benchmark/hpl/ 11. Shipman, G.M., Brightwell, R., Barrett, B., Squyres, J.M., Bloch, G.: Investigations on InfiniBand: Efficient network buffer utilization at scale. In: Cappello, F., Herault, T., Dongarra, J. (eds.) PVM/MPI 2007. LNCS, vol. 4757, pp. 178–186. Springer, Heidelberg (2007) 12. Sur, S., Koop, M.J., Panda, D.K.: High-performance and scalable MPI over InfiniBand with reduced memory usage: an in-depth performance analysis. In: SC 2006: Proceedings of the 2006 ACM/IEEE conference on Supercomputing, p. 105. ACM, New York (2006)
Method of Adaptive Quality Control in Service Oriented Architectures Tomasz Szydlo and Krzysztof Zielinski Department of Computer Science, AGH University of Science and Technology [email protected], [email protected]
Abstract. Internet era is very attractive for growing up small companies and start-ups because of simplicity of selecting services that need to be used for composing complex applications. This model requires a novel approach to accounting and providing Quality of Service defined by Service Level Agreement. By characterizing these needs we make a proposal for service adaptation that can dynamically change offered quality with respect to customer’s preferences and budget.
1 Introduction Software systems becoming larger and far more complex than ever. At the same time, they have to provide assured quality and performance. What is more, time to market is shortened significantly. Most of the basis functionalities of modern software are reusable components that might be shared between projects. Analyzing many projects from 2003 to 2006, IBM research has found that most of them express the same architectural template. The result of this findings is the Service-Oriented Solution Stack (S3) [5], which provides a detailed architectural definition of an SOA across nine layers from business process to operational systems layer. These layers are crossed by integration, quality of service, information architecture and governance layers. Each of them has a logical and physical aspect. Logical aspect includes architectural elements, design decisions, where physical aspect is related to the technology of implementation. Service Level Agreement is the contract between provider and customer. SLA defines the terms and conditions of service quality that provider delivers to the customers. Most important in SLA is the Quality of Service information. QoS information consists of several criteria like execution duration, availability, execution time, and many more. SLA constitutes also financial information as a price for using service, and the way in which penalties are compensated. With such an complex systems, it is not possible or it is very difficult to analyze and tune the application manually to fulfil SLA requirements. Any change can have influence on financial condition of the company because any deviations on business agreements are needed to be compensated. Additionally, growing up companies insist on incorporating more flexible way of accounting and payment methods for service M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 307–316, 2008. © Springer-Verlag Berlin Heidelberg 2008
308
T. Szydlo and K. Zielinski
usage. It is very convenient, from customer point of view, to point out what QoS metrics are more important than the others, and what is the maximum budget which might be spent for using service. Aim of service provider is not only to assure QoS, but dynamically change it not to overdraft budget. We think that service oriented architecture allows for building applications which might adapt to the changes in execution environment. The structure of this paper is as follows. Section 2 discusses related work. In Section 3, concept of adaptive quality control that we propose is presented. Motivating scenario is presented and evaluated in Section 4. Finally, conclusions and future work are sketched in Section 6.
2 Related Work In this section, we cover related work on QoS driven adaptable architectures and approaches. 2.1 Autonomic Computing System that is called adaptive is a one that is able to modify itself to adapt to the changes in the environment. System must be aware of context information. This is done mostly by monitoring modules and the implemented sensors. As a reaction to the changes, system modifies itself through the executing modules by plenty of effectors. Planning and evaluating what to do when changes are noticed is described by adaptation logic. This approach is investigated further, but it derived from Autonomic Computing that has been started by IBM in 2001. Main goal is to create self-management systems to overcome growing complexity and to reduce effort of maintenance.
Fig. 1. Monitor-Analyze-Plan-Execute loop
In Autonomic System, operator does not influence system directly, but defines general policies, which defines system behaviours. IBM has defined the following four areas of usage: Self-configuration is an autonomic configuration of components; Self-healing automatically discovers and corrects faults; Self-optimization automatically monitors and controls of resources to ensure good functioning with respect to defined requirements; Self-protection is an ability to identify and protect system from attacks.
Method of Adaptive Quality Control in Service Oriented Architectures
309
Adaptation process might be divided into to orthogonal aspects: − Behavioural versus architectural adaptability. Adaptation is behavioural when the behaviour of the service can be modified without modifying its structure. This is done by tuning up some parameters. In contrast, architectural adaptability takes place when the structure of the system is modified by means e.g. of switching between instances of the service. − Run-time versus design-time adaptation. Adaptation actions may be executed at run-time or at design-time. The adaptation is run-time when it can be performed during execution, and design-time is when selection of services is done on the designing stage. 2.2 QoS Frameworks Several approaches have been proposed for QoS driven service selection. Zeng [4] considers service selection as an global optimization problem using linear programming. Target function that is optimised is a linear combination of QoS metrics. Similar approaches incorporates modified Dijkstra’s algorithm [7], constraint programming logic or a knapsack problem. That approach is only applicable in design time adaptation because it does not recalculate solution during composite service execution. We share Kokash [6] opinion that quality of these solutions depends strongly on the user weights for each QoS metric that is not trivial to establish them in the right way. Abdelzaher et al. [8] shows how control-theoretic approach can be used to achieve quality of service guarantees. They demonstrate that a software system can be approximated by a linearized model and controlled by actuators and sensors.
3 Concept of Adaptive Quality Control Analysis of existing frameworks in terms of QoS unveils that this term is used interchangeably for describing quality from provider point of view as well as from client point of view. We have decided to distinguish Quality of Experience from Quality of Service.
Fig. 2. Different ways of perceiving quality
QoS is the quality of provided services; QoE is the observed quality by customer. The idea is presented in Fig. 2. It is very common that these values are completely different. Let us think of calculating balance account on the end of month. Accounting service which we are using is 99,9% available but, once a month is disabled for maintenance for 1 hour. If we are accessing this service during this hour, from our point of view, its availability would be 0,0%. When we assume that QoS is a vector of values for a given metrics, QoE is defined as:
310
T. Szydlo and K. Zielinski n
QoE = ∑ QoS ti × i =1
1 n
for n → ∞
where n is the number of invokes of composite service and ti is the time when invocation takes place. It makes us to think of monitoring QoE and on that basis adapt composite service to fulfil requirements of an agreement. Further research led us to the idea of QoS controller, depicted in Fig. 3 that would be responsible for providing desired QoE described by SLA. Service oriented architecture decomposes composite service into set of inter-working simple base services. Set of services with the same functionality will be called abstract service. During invocation of composite service, we can exchange simple service by any that belongs to the same abstract service set. We can think of this idea as of an interface and the class that implements it.
Fig. 3. Control loop
In our work we distinguish four kind of services: − Abstract Service (Si). Service, which is described by its functionality but instance is not pointed out; − Service (Iik). Service is a functionality provided by provider; − Abstract composite service (S). Composite service that contains at least one abstract service; − Composite service (C). Composite service that might be executed because all of its base services are instances.
Composite abstract service consists of several abstract services S = {S1 , S 2 .., S n },
{
}
each having several instances Si = I i1 , I i2 ,.., I iki . In this work, we are assuming that services Si are executed sequentially. This is very strong assumption and applicable only for simple services, but problem of evaluating concurrent services [3] with structured activities is out of scope of this paper. However, idea of QoS controller elaborated later is applicable for this type of composite services as well. To make a good use of service adaptation, we have to collect history of previous invocations of base services. Taking into account these data, we can envisage more or less accurately the quality of invocation. Analyzing it deeper we have figured out, that quality of provided service is strictly related to context of execution environment. Due to this fact, one can use services located on the globe where currently night is hence, servers are not overloaded.
Method of Adaptive Quality Control in Service Oriented Architectures
311
Assuming that we have a history of Iik invocations, expected quality of further invocations might be calculated using regression analysis. In the simplest case, it can be simple moving average. Before any assumptions about expected quality, we can ask base service provider of execution environment context and on that basis we can analyse only subset of the invocation history where the context information was the same. 3.1 Quality of Service Metrics Qualities of Service attributes include metrics like throughput, response time, availability but exact definition and measurement process must be well defined to give consumer and provider common understanding. As for describing Web service’s functional aspects WSDL language has been defined, there is no universal language for describing metrics and the way of measuring it. Moreover, many of the well-known metrics like availability do not have formal definition. For example availability might be described as percentile value, or the number of positive invocations during last 50 invocations, or the availability during last hour. Metrics are not only applicable for simple services, but also for composite ones so for each metric must be provided algorithm for calculating values for a service composition [1]. In this paper we are considering only services invocated in a sequence. We can distinguish metrics to the quantitative ones that have numerical values or might be described by values and the qualitative ones that are described in words. Nevertheless, for any metric must be provided an equation or algorithm to recalculate it to <0;1> range. QoS is a vector of values for a given metrics. If service a is better then service b from a given metric point of view, means that metric value of service a is greater than value for service b. In the other case, for evaluating we have to use inversions of the metric values. 3.2 QoS Controller We have designed a QoS controller, which continuously monitors deviation of current user QoE from agreed SLA, and on that basis, service QoS is tuned by selecting service instances that need to be invoked. Deviation of provided quality of service is described as follows: ΔQoS = QoESLA − QoE
Before execution of abstract service, it is decided which instance to invoke. Decision process takes into account services already invoked, correction from feedback loop, and influence of each abstract service instance on possible overall quality. Number of total invocation possibilities is ∏ Si , but only when none of i =1..n
abstract services was invoked until now. Assuming that services S1,..,Sk were invoked, total number p of possible invocations is ∏ S i . Before invocation of any base i = k +1.. n
service, all possible invocations are evaluated as presented in Fig. 4. For the services that have not been invoked yet, mean value I ik from historical invocations is used for calculating QoS of this composition. From the set {C1,..,Cp} of possible invocations, we are selecting one that fulfils:
312
T. Szydlo and K. Zielinski
Fig. 4. Execution model
(
)
(
∀ sign QoS (Cx ) − QoEi = sign ΔQoS i i
i
)
It guaranties that if any of the metrics value in SLA is greater than in QoE, composite service will be selected with QoS containing that metric large enough to balance difference, and analogically small enough when metrics value in SLA is less then in QoE. 3.3 Variable QoS Providing contracted quality of service is very competitive task, as well as, providing service with variable quality that do not exceed specified budget. Our idea is to estimate the number of invocations to the end of accounting period basing on the number of invocations until now. Having this information and the amount of money which left to spend, we can change the offer to the cheaper one. Customer has priorities what sorts of metrics are more important then others. Providing exact weights of each metric is multi dimensional decision problem and it is not a trivial task. We have found that decomposing problem of assigning weights makes it easier to deal with. Analytic Hierarchy Process [2] is a technique based on mathematics and human psychology for prioritizing elements of decision problem. For each pair of metrics user specifies which one is preferred in the form of fraction between 1/9 and 9/1. The result of AHP is a vector of weights wi for earch metric i. We will be referring to fitness factor of QoS with additional information of metrics importance as: fitness (QoS ) = ∑ wi × QoS i i
Customer chooses set of SLA in which is interested. After estimating cost per single invoke, system choose the SLA with the best fitness and the price per invoke less than calculated. 3.4 Service Level Agreement Client agreement is represented as tuple: SLA = (QoESLA , price, penalties, time ) where QoE is the quality of user experience, price is the amount of money which client has to pay for each service invoke, and penalties is the price which provider will pay in the case of any deviation of QoE, and time is the accounting period. Price for using service S is defined as follow:
Method of Adaptive Quality Control in Service Oriented Architectures
313
bill = n * price − missed * penalties
0 ⎧⎪ missed = ⎨ i i ∀ max : QoE SLA − QoE n ⎪⎩ i
(
(
))
, if QoE ≥ QoESLA , if QoE < QoESLA
where n is the number of invocations. Accounting algorithm might be illustrated by simple example. Let us say that availability agreed by SLA is 50%, and on the end of month, on 20 invokes, only 6 was successful. It means that customer was overcharged unfairly for 4 invokes which has to be compensated. It is a company politic which defines how value of price is related to value of penalties. Accounting for variable QoS is quite different because client agreement is represented as a set of SLAs and the metrics weights calculated using AHP from preferences set: SLAvar = ( preferences, {SLA1 ,.., SLAl }, budget )
System dynamically changes SLA to assure that bill will not be greater then budget. On every accounting period, client is charged independently for every SLA contract but total sum is not grater than budget. Client is eligible to receive compensation if any of SLAs is violated.
4 Motivating Scenario Let us assume that we have Internet website that provide information for amateur pilots and we want to include very accurate weather forecast for nearest airports on the main site. Composite service consists of two base services i.e. service which takes city name on input and returns airports in the near proximity, and the second service that takes name of the places and returns weather forecast. We have to define the metric for describing quality of provided data. Moreover, our portal is a growing up business hence, we have limited budget to maintain system, so we do not like to spend more than assumed price for the service. For this type of service, it is better from customer point of view, to achieve very accurate data than has high availability of poor information. To verify our concepts we have developed simulation environment that is flexible enough to implement different adaptation strategies. 4.1 Metrics We have assumption that base services are invoked in a sequence. Below are described metrics used in this example scenario and algorithms for calculating values for composite services. Availability. Availability is the probability that service is accessible. For our composite service, availability is the product of availabilities of the base services. Execution time. Execution time is the time between invocating service, and the time of receiving response. For composite service, execution time is the sum of execution time of each base service. Data Quality. For this example we have provided Data Quality metric which defines the quality of received data. Data quality means simply the accuracy in kilometres of
314
T. Szydlo and K. Zielinski
weather information. Value of this metric for composite service is the minimum value of the base services. As it was mentioned before, our service is very specific, so customer has to decide what is more important and what is not: − availability is three times as important as execution_time; − availability is five times less important than data_quality; − execution_time is five times less important than data_quality.
Fig. 5. Evaluated metrics weights
Results of metric weights evaluation with AHP are presented in Fig. 5. Customer agreed to use three SLA. Each contract has the same data quality metrics, but differs in response time and availability. In the table below, contracts are listed in the order with respect to metrics weights. Table 1. SLA contracts SLA
Price per 100 invocations
0 1 2
1 0,5 0,2
Response time [ms] 100 300 200
Availability
Data quality [km] 0,9 0,8 0,7
5 5 5
4.2 Evaluation Customer noticed that his website gets three thousands visits per month. As the website become more popular, number of invocations may significantly increase. In this situation client has decided to buy service with variable QoS and with maximum budget of 30€€ . Detailed statistic of user invocations is presented in Fig. 6. Larger number of page visits during third week we can explain that there was national flying contest. Total number of invocations was not as expected three thousands but almost five thousands. During the month, system tried to estimate number of invocations until the end of an accounting period and this is depicted in Fig. 7. Because number of invocations was greater then expected, system has to switch SLA, to makes the total sum less than assumed 30€€ . We can notice that on 14th day system has decided to switch SLA one down, but increased number of visits stayed for three days, so system has decided to switch one more down. Web site traffic started coming back to normal state, so system come back to the best SLA as depicted in Fig. 8.
Method of Adaptive Quality Control in Service Oriented Architectures
315
To calculate expected number of invocations to the end of the month, system estimated it on the moving average basis from the last 7 days, and mean number of visits per day is multiply by the number of days to the end. 500
6000 5000
invocations
invocations/day
400
300
200
4000 3000 2000 1000
100
0 1
0 1
8
15
22
8
15
29
22
29
day
day
Estimated number of invocations
Fig. 6. Number of invocations per day
Real number of invocations
Fig. 7. Estimating number of invocations to the end of the month 60
3
50 40
SLA
price
2
30 20
1
10 0 1
8
15
22
29
0 1
8
15
22
day
29
day
Variable QOS
Fig. 8. Selected SLA
Constant QOS
Fig. 9. Partial bill
0,95 0,9
availability
0,85 0,8 0,75 0,7 0,65 0,6 1400
1500
1600
1700
1800
1900
2000
2100
2200
2300
invocations Availability
Fig. 10. Convergence of availability
SLA has been changed several times during accounting period hence, QoS metrics had to convergence to the values contracted in selected SLA. One can notice that convergence schema in Fig. 10 is very similar to the one known from the control theory. 4.3 Discussion Without variable QoS, customer would be asked to pay 50€€ , which is a lot more than assured 30€€ . With variable QoS activated, system is able to encompass busy week on
316
T. Szydlo and K. Zielinski
the web site. As it has been stated in contracted SLAs, data quality was the same during the whole month, but availability and response time have changed as many times as SLA has changed. In Fig. 9 we can find partial bill during the whole month. The proposed approach does not take into account deviations caused by internet connections. Possible extension to incorporate it is to design remote invocation monitoring which would be deployed at the customer side. Secondly, accounting method might be unfair, especially when number of composite service invocation is very low or when invocations are not equally spread over the accounting period.
5 Conclusion In this paper, we have presented novel approach to managing composite services. With the growing popularity of the Internet, providing services with guaranteed QoS has become increasingly important. Our algorithm integrates statistical methods for predicting quality of service for composite services invocations, as well as, automatic adaptation strategies to keep consumer’s budget by changing SLA during execution. Presented case study verified usability of our method. Many remaining issues are worth further research. The most interesting is how this model behaves in real applications. Another interesting aspect is how to improve accounting algorithm to prevent unfair charges. Finally, further examples, experimental tests and practical experience are needed to find true potential of applying adaptive quality control to different class of composite services. This remains an important focus for our future research.
References 1. Menascé, D.A.: Composing Web Services: A QoS View. IEEE Internet Computing 8(6), 88–90 (2004) 2. Forman, E.H., Selly, M.A.: Decision By Objectives – How To Convince Others That You Are Right. World Scientific Publishing Co. Pte. Ltd., Singapore (2001) 3. Cardoso, J., Sheth, A.P., Miller, J.A., Arnold, J., Kochut, K.: Quality of service for workflows and web service processes. J. Web Sem. 1(3), 281–308 (2004) 4. Zeng, L., Benatallah, B., Ngu, A.H.H., Dumas, M., Kalagnanam, J., Chang, H.: QoS-Aware Middleware for Web Services Composition. IEEE Transactions on Software Engineering 30(5), 311–327 (2004) 5. Arsanjani, A., Zhang, L.-J., Ellis, M., Allam, A., Channabasavaiah, K.: S3: A ServiceOriented Reference Architecture. IT Professional 9(3), 10–17 (2007) 6. Kokash, N.: A Service Selection Model to Improve Composition Reliability. In: Proc. Proceedings of the International Workshop on AI for Service Composition (2006) 7. Gu, X., Nahrstedt, K., Chang, H., Ward, C.: QoS-Assured Service Composition in Managed Service Overlay Networks. In: Proc. ICDCS ’03: Proceedings of the 23rd International Conference on Distributed Computing Systems, Washington, DC, USA, p. 194 (2003) 8. Abdelzaher, T., Stankovic, J., Lu, C., Zhang, R., Lu, Y.: Feedback performance control in software services. Control Systems Magazine, 23(3), 74–90 (2003)
Ontology Supported Selection of Versions for N-Version Programming in Semantic Web Services Pawel L. Kaczmarek Department of Computer Systems Architecture, Faculty of Electronics, Telecommunications and Informatics, Gda´ nsk University of Technology [email protected]
Abstract. Web Services environment provides capabilities for effective N-version programming as there exist different versions of software that provide the same functionality. N-version programming, however, faces the significant problem of co-relation of failures in different software versions. This paper presents a solution that attempts to reduce the risk of co-relation of failures by selecting for invocation services having relatively different non-functional features. We use an ontology-driven approach to identify and store information about software features related to differences in software versions, such as: software vendor, design technology or implementation language. We present an algorithm for selection of software versions using the designed ontology. The solution was verified in a prototypical implementation with the use of an existing OWL-S API library.
1
Introduction
N-version programming (NVP) is a resilience computing mechanism [1] used for decades to increase software dependability. The technique was initially used and researched in sequential systems, however, different research groups focus on NVP in distributed systems as described later in this paper. It seems that NVP can be successfully applied in Web Services and Semantic Web Services. The Web Services architecture assumes that services supplying the same functionality are advertised and available for clients. A client can either choose a service that supplies the best price and dependability or invoke different services in order to increase dependability. The paper addresses the typical problem that NVP faces: there exists a corelation between errors in different software versions [2]. The co-relation results from similar educational background, programming languages, the algorithms used and other factors. In Web Services, however, services differ in vendor and technology, which might lay foundations for creation of dependable N-version systems. In our solution, we attempt to design a technique for selection of services that are unlikely to fail for similar input or in similar conditions. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 317–326, 2008. c Springer-Verlag Berlin Heidelberg 2008
318
2
P.L. Kaczmarek
Semantic N-Version Invocation Module
The designed solution is aimed at increasing dependability of NVP without increasing the number of invoked versions of a service. The limited number of invoked services reduces the invocation costs. In this solution, we select those services that have relatively different non-functional features, which consequently limits the risk of repeating feature-specific errors during the execution of a selected set. The first step is to identify service features, dependencies between features and their impact strength. An N-version features ontology is defined to describe the features related to differences in software versions. Examples of ontology concepts are: implementation language, software vendor, design process, runtime platform and the algorithms used (see Sect. 3.1). It is assumed that the already existing service registers know different services supplying the same functionality. It is also assumed that the existing servers already offer services of equal functionality. Relevant information about available services is stored in the ontology. The next step is to design an algorithm for selection of services depending on service features. Generally, the algorithm calculates the number of common features for groups of services and selects a group in which services are relatively different(see Sect. 3.2). An N-version invocation that uses our service selection mechanism consists of the following steps: – A matching subsystem selects available services that match clients request. – Services are selected from the initial set concerning service features. • The N-version features ontology is queried for service features • Binary service similarities are identified • Service similarities are calculated for potential groups • A group with the lowest service similarity is selected for invocation – Selected services are invoked – Result is voted and returned to the client. Finally, the solution is implemented and validated. 2.1
Module Architecture
The architecture of the the N-version invocation module is shown in Fig. 1. The system consists of the following submodules: Search and matching module - performs typical tasks for service discovery and matching. It is assumed that an already existing matching module is used and the module is capable of delivering a set of services that match a client’s request. Selection module - selects services for an N-version invocation from available services supplied by the Search module. A service features knowledge base is used to create a configuration of possibly different services. Service knowledge base - uses the N-version features ontology to store information about known services. It can be stored twofold:
Ontology Supported Selection of Versions for N-Version Programming
Client
invoke OWL−S return
Search appropriate matching
319
Service knowledge base query
Select for invocation
Invoke / vote
N−version features ontology
invoke third party
Fig. 1. Main parts of N-Version invocation modules with replicas selection
– locally - integrated in client’s application – remotely - accessible to different clients through Web Services Invocation module - manages OWL-S [3] definitions, invokes N-version services, votes the result and returns it to a client.
3
Selection of Services
The identified service features and dependencies related to service differences in NVP are stored as an ontology. We use ontology-driven approach because of the following: (i) it is a systematic and organized way for entity description, (ii) there already exist ontologies and taxonomies that describe services and (iii) there are technological similarities between ontologies (OWL) and Semantic Web Services (OWL-S). 3.1
N-Version Features Ontology
We designed the N-version features ontology focusing on concepts concerning corelation of errors during N-version invocation. The designed ontology is based on the following existing ontologies and taxonomies: EvoOnt - A Software Evolution Ontology [4], Ontology and Taxonomy of Services [5], Core Software Ontology [6] and Service Ontology from Obelix [7]. Fig. 2 presents classes and relations defined in the N-version features ontology. A SemanticService describes a service that contains ontological descriptions of the service bundle contents [7]. A service is a loosely coupled, reusable software component that semantically encapsulates discrete functionality and is distributed as well as programmatically accessible over standard Internet protocols. Concepts describing a SemanticService concern vendor, development and runtime information. A V endor of a SemanticService is an organization or a person that supplies the service. A SemanticService is designed with the use
320
P.L. Kaczmarek Algorithm
ImplementationLanguage
MiddlewareServer
A
:usesAlgorithm
E
:usesMiddleware Design A
RuntimePlatform
:implementedIn :usesTechnology
E E
:wasDesigned
E
:runsOn
:runsOn
E
DesignTechnology SemanticService
A
OperatingSystem
:usesCommon
E
CommonModule
:hasVendor E
:hasVersion
Vendor
Framework A
:externality
:commercialization
Version
Library
A
Fig. 2. Most important classes of the N-version features ontology
of a DesignT echnology such as the Waterfall model or the Spiral model and Algorithms. It is implemented in one or more ImplementationLanguages. A CommonM odule and its subclasses describe third party modules that are included in service code as Libraries or F rameworks. Finally, a SemanticService runs on a RuntimeP lat f orm. 3.2
Service Selection Algorithm
The ontological description is used by the selection algorithm to identify groups of services in which services have relatively different features. Services from one of the groups are selected for an N-version invocation. The algorithm selects services for N-version invocation from available services that match the client’s request. The algorithm makes the following assumptions: – A matching subsystem has already selected services that match the client’s request. – The N-version features ontology describes available services. Ensuring that services in an N-version invocation differ reduces the risk of co-relating failures specific for the features. Generally, the algorithm is done in the following steps: (i) the ontology is queried for service features, (ii) common features for pairs of services are calculated, (iii) common features are added up for services from potential groups, (iv) services from a group with relatively small number of common features are returned. The selection is done after basic client-server matching and before the actual invocation of different versions of a service. The input for the selection algorithm is the set of available services that supply a required functionality. The output is the set selected for an N-version invocation. Algorithm 1 presents the most important selection steps.
Ontology Supported Selection of Versions for N-Version Programming
321
Algorithm 1. Selection of service versions for N-version invocation. input: matchingServices - all services that match the client’s request input: groupSize - number of versions that are invoked output: services selected for invocation query the N-version features ontology for information about matchingServices for all service in matchingServices do fetch serviceF eatures end for create f eatureM atrix in which rows and columns correspond to services from matchingServices for all servicei , servicej in matchingServices do f eatureM atrix[i][j] = count common features for servicei and servicej end for create groupsList containing all subsets of size groupSize from matchingServices for all group in groupsList do calculate similarityM etric: add up values from f eatureM atrix for each pair of services from group end for identify bestGroups: select groups with the smallest similarityM etric select one f inalGroup from bestGroups return services from f inalGroup
The presented listing simplifies the analysis treating all features as equally important. However, features from the N-version ontology may have different impact on correlation of failures in N-version programming. Although there is no research on such impact known to us, the designed algorithm should distinguish feature importance during service selection. We arbitrarily select features of primary and secondary impact strength. Concepts considered as of primary importance are: ImplementaionLanguage, V endor, M iddlewareServer, CommonM odule and DesignT echnology. Other concepts are considered as of secondary importance. 3.3
Gathering Data for Service Descriptions
Although the structure of the N-version features ontology is statically defined, it is still necessary to fill in information about known services. The most desired approach is to fetch automatically data about service features from existing sources of information. This can be done in many cases as information is available for automatic processing in different places in Semantic Web infrastructure: WSDL, UDDI, OWL-S. Some features, however, need to be handled manually. Information about service is available in WSDL definition, UDDI registry or OWL-S files on different abstraction level. Service vendor is described in UDDI definitions in “businessEntity“ part of service description with optional detailed information. Service runtime can be fetched by querying service endpoint about middleware platform.
322
P.L. Kaczmarek
Existing sources do not provide information about service design and implementation. In particular, it will be necessary to handle manually information about development approach, design process and used algorithms. Information about implementation language and used frameworks is not normally available in service description. Additional information is either included directly in the N-version features ontology or in OWL-S descriptions of individual services.
4
Prototypical Implementation
The designed solution was verified by a prototypical implementation. The implementation covers a simplified N-version features ontology, service selection algorithm and invocation of semantic services. The simplified ontology contained the SemanticService class and the following classes that describe service features: V endor, RuntimeP latf orm and ImplementationLanguage. The implemented algorithm uses information stored in the ontology to calculate similarityM etric for the potential groups of services. It is assumed that service features are of equal importance. Selected services are invoked using their OWL-S and WSLD definitions. Partial results from services are gathered and the final result is voted using simple majority voting. If consensus is achieved, it is returned to the invoker, otherwise an exception is thrown by the N-version invocation module. We used the following third party libraries in the implementation: – Protege-OWL - definition of the N-version features ontology. – Mindswap OWL-S API - Java API for invocation of Semantic Web Services and transformation from WSLD to OWL-S. – Jena SPARQL - execution of SPARQL queries on the N-version features ontology to fetch information. about services. The implementation does not cover service matching between client’s request and server’s offering as it is not within the scope of this paper. The matching phase is realized by a mock matcher with fixed matching between services. Automated creation of the N-version features ontology is not yet implemented in the system. Code snippets showing SPARQL query and service invocation are shown in Listings 1.1 1.2. Listing 1.1. SPARQL query executed on the prototypical ontology PREFIX . . . SELECT ? s e r v i c e ? r u n t i m e P l a t f o r m ? v endor ? implLanguage WHERE { ? service rdf:type table:SemanticService . ? s e r v i c e table:hasRuntime ? runtimePlatform . ? s e r v i c e t a b l e : h a s V e n d o r ? v endor . ? s e r v i c e t a b l e : i m p l e m e n t e d I n ? implLanguage . } ...
Ontology Supported Selection of Versions for N-Version Programming
323
Listing 1.2. Code snippets for service selection and invocation import com . hp . h p l . j e n a . q uery . ∗ ; import o r g . mindswap . o w l s . p r o c e s s . ∗ ; ... ProcessExecutionEngine exec ; // from o r g . mindswap . o w l s . i o OWLSReader r e a d e r ; ... public L i s t s e l e c t S e r v i c e s ( . . . ) { ... Query q uery = QueryFactory . c r e a t e ( q u e r y S t r i n g ) ; QueryExecution qe = QueryEx ecutionFactor y . c r e a t e ( query , model ) ; R e s u l t S e t r e s u l t s = qe . e x e c S e l e c t ( ) ; ... } public S t r i n g inv ok eNVariant ( . . . ) throws E x c e p t i o n { ... selectedServices = selectServices ( . . . ) ; f o r ( i n t i = 0 ; i < s e l e c t e d S e r v i c e s . s i z e ( ) ; i ++) { ... // i n v o k e u s i n g Mindswap OWL−S API s e r v i c e = r ead er . read ( o w l s F i l e ) ; process = service . getProcess ( ) ; exec . execute ( process , v alu es ) ; ... } ... }
4.1
Selection Example
As an example of service selection let us consider the following demo configuration. Six services supplying the same functionality differ in their implementation language, runtime platform and service vendor. NVP is configured to invoke triples of services. We select three services from six available ones in such a way that the selected services have possibly different features. Table 1 shows features of demo services fetched by the SPARQL query. Table 2 shows the number of common features for pairs of services. Let ES abbreviate ExemplaryService. For example, services ES1 and ES2 have two common features: RuntimeP latf orm and V endor, while services ES1 and ES3 have no common features. Table 1. Demo services Service id ExemplaryService1 ExemplaryService2 ExemplaryService3 ExemplaryService4 ExemplaryService5 ExemplaryService6
ImplLanguage CSharp JSharp J2EE J2EE J2EE J2EE
RuntimePlatform DotNet DotNet Axis2 Axis2 JBoss JBoss
Vendor SemanticDemoCorp SemanticDemoCorp FreeSemanticProducts OntologyDemoUniv SemanticDemoCorp FreeSemanticProducts
324
P.L. Kaczmarek Table 2. Number of common features between services ES1 ES2 ES3 ES4 ES5 ES6
ES1 x -
ES2 2 x -
ES3 0 0 x -
ES4 0 0 2 x -
ES5 1 1 1 1 x -
ES6 0 0 2 1 2 x
Then the algorithm calculates similarityM etric for groups of services. Assuming that triples are selected, there are 20 potential groups with similarity M etrics ranging from 1 to 5. Groups number 9 {ES1, ES4, ES6} and 15 {ES2, ES4, ES6} have the lowest value of the metric (one). Group 19 {ES3, ES5, ES6}, for example, has similarityM etric of 5. Either group number 9 or 15 is selected and passed on to the invocation and voting procedure. In this scenario, the Java programming language is a common feature for the majority of invoked services for both 9 and 15 groups. It may happen that an error specific for Java programming will be repeated and will demonstrate in both implementations for some input. Other features differ in invoked services, which gives background to expect that a feature specific fault will not corrupt the N-version invocation. For example, an error in SemanticDemoCorp development process will probably not be repeated in other companies, therefore a fault specific for SemanticDemoCorp will be concealed by other versions. Analogously, if a failure demonstrate in the Axis2 middleware for some configuration, it will be concealed by services running on JBoss and .NET. 4.2
Dependability of N-Version Module Itself
The N-version invocation module may be a source of additional errors and a threat for computer system dependability. We propose to use two alternative invocation mechanisms that can be used in case the primary N-version invocation module fails: (i) a secondary, simplified N-version invocation module and (ii) a simple invocation of a service. The secondary module performs a standard Nversion invocation, in which randomly chosen services are selected from the set of matching services and invoked. If both the primary and the secondary N-version modules fail, a simple invocation is performed on a service randomly selected from matching services. A possibly simple invocation stub that can detect failures or timeouts from N-version modules is created. The stub invokes either primary module, secondary module or a single service.
5
Related Work
Although dependability in distributed systems is a mature research discipline, the works related to N-version programming in service oriented architecture are
Ontology Supported Selection of Versions for N-Version Programming
325
quite recent. Looker et al. [8] propose Axis stub for N-version invocations. The solution uses service location and majority-voting scheme. Our work differs in that we use semantic selection of different versions of a web service, additionally, we propose rules for service selection to achieve best dependability results. Santos et al. [9] propose a fault tolerant infrastructure that is based on FTCorba architecture. Similarly to the previous work, the solution does not concern semantic information and does not select services from available ones. Cardoso [10] proposes semantic integration of Web Services with the use of WSDL-S and JXTA technology. The solution is based on creating Semantic Web Services proxies and peer groups that are used as N-version software. Our solution differs in that we concern service features for service selection. We use the N-version features ontology driven algorithm to select those service instances that should be included in N-Version invocations. Additionally, in our solution, versions of Web Services are unaware of each other as they are discovered and invoked by the invocation module. Townend et al. [11] propose a replication based solution for grid environments. An “FT-Grid co-ordination service” is used to locate, receive, and vote upon jobs submitted by a client program. Our solution differs in that we use semantic information about services and select service versions depending on their features. There is lot of work in dependability of SOA systems that is loosely related to the scope of this paper. [1] describes research in software diversity and off-theshelf components. WS* standards were proposed for WS* dependability such as: WSReliableMessaging, WSReliability, WSSecurity, WSAtomicTransactions and others. The standards concern usually lower layers of software systems. Backward recovery and exception handling is addressed in [12] [13].
6
Conclusions and Future Work
The aim of this paper was to propose a technique for selection of service versions for N-version invocations. The solution is aimed at increasing software dependability without increasing the number of invoked services in order to reduce invocation cost. We presented an ontology of service features related to N-version programing and an algorithm for service selection. The designed ontology and algorithm show that services can be selected according to their non-functional features, which reduces the risk of repeating features-specific failures. A prototype implementation shows that this solution can be effectively applied in Semantic Web Services. The obtained results are promising, especially considering the fact that Web Services infrastructure supplies different infrastructures for service development and sharing. The future work concerns further effort to fully implement the designed solution. Current implementation does not integrate matching module, feature gathering functionality and some classes from the N-version features ontology. It also needs to be updated to OWL-S 1.1 version. Additionally, experiments need to be performed to determine the impact of primary and secondary service
326
P.L. Kaczmarek
features on service execution. The distinction of impact strength was done heuristically and needs to be verified. Finally, the implemented system should be verified in a real-world application. Acknowledgments. This work was supported by the Polish Ministry of Science and Higher Education under research project No. N519 022 32/2949.
References 1. ReSIST: Resilience for Survivability in IST, A European Network of Excellence: Resilience-Building Technologies: State of Knowledge (2006) 2. Knight, J.C., Leveson, N.G.: An experimental evaluation of the assumption of independence in multiversion programming. IEEE Transactions on Software Engineering (1986) 3. W3C: OWL-S: Semantic Markup for Web Services (2004) 4. Kiefer, C., Bernstein, A., Tappolet, J.: Evoont - a software evolution ontology. Technical report, Dynamic and Distributed information Systems Group, University of Zrich (2007), http://www.ifi.uzh.ch/ddis/msr/ 5. Cohen, S.: Ontology and taxonomy of services in a service-oriented architecture. The Architecture Journal, Microsoft Corporation (2007) 6. Gangemi, A., Mika, P., Sabou, M., Oberle, D.: An ontology of services and service descriptions. Technical report, OntoWare.org, Institute AIFB, University of Karlsruhe (2003), http://cos.ontoware.org/ 7. Baida, Z., Gordijn, J., Akkermans, H.: Service ontology. Technical report, Ontology-Based ELectronic Integration of CompleX Products and Value Chains (2003) 8. Looker, N., Munro, M., Xu, J.: Increasing web service dependability through consensus voting. In: 29th IEEE Annual International Computer Software and Applications Conference (2005) 9. Santos, G., Lung, L.C., Montez, C.: Ftweb: A fault tolerant infrastructure for web services. In: Ninth IEEE International EDOC Enterprise Computing Conference (2005) 10. Cardoso, J.: Semantic integration of web services and peertopeer networks to achieve fault-tolerance. In: IEEE International Conference on Granular Computing (2006) 11. Townend, P., Xu, J.: Replication-based fault tolerance in a grid environment. In: U.K. e-Science 3rd All-Hands Meeting (2004) 12. Xu, J., Romanovsky, A., Randell, B.: Concurrent exception handling and resolution in distributed object systems. IEEE Transactions on Parallel and Distributed Systems (2000) 13. Kaczmarek, P.L., Krawczyk, H.: Remote exception handling for pvm processes. In: Dongarra, J., Laforenza, D., Orlando, S. (eds.) EuroPVM/MPI 2003. LNCS, vol. 2840. Springer, Heidelberg (2003)
Hybrid Index for Metric Space Databases Mauricio Marin1 , Veronica Gil-Costa2 , and Roberto Uribe3 2
1 Yahoo! Research, Santiago, Chile DCC, Universidad Nacional de San Luis, Argentina 3 DCC, Universidad de Magallanes, Chile [email protected]
Abstract. We present an index data structure for metric-space databases. The proposed method has the advantage of allowing an efficient use of secondary memory. In the case of index entirely loaded in main memory our strategy achieves competitive performance. Our experimental study shows that the proposed index outperforms other strategies known to be efficient in practice. A valuable feature of the proposal is that the index can be dynamically updated once constructed.
1
Introduction
Searching in metric spaces is a very active research field since it offers efficient methods for indexing and searching by similarity in non-structured domains. For example, multimedia databases manage objects without any kind of structure like images, audio clips or fingerprints. Retrieving the most similar fingerprint to a given one is a typical example of similarity search. The problem of text retrieval is present in systems that range from a simple text editor to big search engines. In this context we can be interested in retrieving words similar to a given one to correct edition errors, or documents similar to a given query. We can find more examples in areas such as computational biology (retrieval of DNA or protein sequences) or pattern recognition (where a pattern can be classified from other previously classified patterns). Similarity search can be trivially implemented comparing the query with all the objects of the collection. However, the high computational cost of the distance function, and the high number of times it has to be evaluated, makes similarity search very inefficient with this approach. This has motivated the development of indexing and search methods in metric spaces that make this operation more efficient trying to reduce the number of evaluations of the distance function. This can be achieved storing in the index information that, given a query, can be used to discard a significant amount of objects from the data collection without comparing them with the query. Although reducing the number of evaluations of the distance function is the main goal of indexing algorithms, there are other important features. Some methods can only work with discrete distance functions while others admit continuous distances too. Some methods are static, since the data collection cannot grow M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 327–336, 2008. c Springer-Verlag Berlin Heidelberg 2008
328
M. Marin, V. Gil-Costa, and R. Uribe
once the index has been built. Dynamic methods support insertions in an initially empty collection. Another important factor is the possibility of efficiently storing these structures in secondary memory. Search methods in metric spaces can be grouped in two classes [2]: pivotbased and clustering-based search methods. A pivot-based strategy selects some objects as pivots from the collection and then computes the distance between the pivots and the objects of the database and use this information to group related objects. This method selects a subset of objects from the collection as pivots, and the index is built computing and storing the distances from each of them to the objects of the database. During the search, this information is used to discard objects from the result without comparing them with the query. Clustering techniques partition the collection of data into groups called clusters such that similar entries fall into the same group. Thus, the space is divided into zones as compact as possible, usually in a recursive fashion, and this technique stores a representative point (“center”) for each zone plus a few extra data that permit quickly discarding the zone at query time. In the search, complete regions are discarded from the result based on the distance from their center to the query. In this paper we propose a combination of two existing methods (Sec. 2). The first method is used as it is proposed by their authors whereas the second one has been highly optimized by us to deal with secondary memory efficiently and very importantly to reduce the running time by increasing the ability of the strategy to quickly discard objects that cannot be part of the solution to a given query (Sec. 3). We present a complete evaluation of the performance of the proposed strategy in Sec. 4 which shows that our strategy consistently outperforms all others in practice. Sec. 5 presents concluding remarks.
2
Metric Spaces and Indexing Strategies
A metric space (X, d) is composed of an universe of valid objects X and a distance function d : X × X → R+ defined among them. The distance function determines the similarity between two given objects. The goal is, given a set of objects and a query, to retrieve all objects close enough to the query. This function holds several properties: strictly positiveness (d(x, y) > 0 and if d(x, y) = 0 then x = y), symmetry (d(x, y) = d(y, x)), and the triangle inequality (d(x, z) ≤ d(x, y)+d(y, z)). The finite subset U ⊂ X with size n = |U|, is called the database and represents the collection of objects. A k-dimensional vector space is a particular case of metric space in which every object is represented by a vector of k real coordinates. The definition of the distance function depends on the type of the objects we are managing. In a vector space, d could be a distance function of the family Ls (x, y) = i<=i<=k (|xi − yi |s )1/s . For example s = 2 yields Euclidean distance, that is the number of insertions, deletions or modifications to make two words equal. There are three main queries of interest, – range search: that retrieves all the objects u ∈ U within a radius r of the query q, that is: (q, r)d = {u ∈ U/d(q, u) ≤ r};
Hybrid Index for Metric Space Databases
329
– nearest neighbor search: that retrieves the most similar object to the query q, that is N N (q) = {u ∈ U/∀v ∈ U, d(q, u) ≤ d(q, v)}; – k-nearest neighbors search: a generalization of the nearest neighbor search, retrieving the set kN N (q) ⊆ U such that |kN N (q)| = k and ∀u ∈ kN N (q), v ∈ U − kN N (q), d(q, u) ≤ d(q, v). We focus on range queries since nearest neighbor queries can be rewritten as range queries in an optimal way [2]. In the following we describe the data structures we combine to produce our metric-space index. 2.1
List of Clusters (LC)
This strategy [1] builds the index by choosing a set of centers c ∈ U with radius rc where each center maintains a bucket that keep all objects that are within the extension of the ball (c, rc ). Each bucket contains the k objects that are the closet ones to the respective center c. Thus the radius rc is the maximum distance between the center c and the k-nearest neighbor. The buckets are filled as the centers are created and thereby a given element a located in the intersection of two or more center balls is assigned to the first center. The first center is randomly chosen from the set of objects. The next are selected so that they maximize the sum of the distances to all previous centers. A range query q with radius r is solved by scanning in order of creation the centers. At each center we compute d(q, c) and in the case that d(q, c) ≤ rc all objects in the bucket associated with c are compared against the query. Also if the query ball (q, r) is totally contained in the center ball (c, rc ), there is no need to consider others centers. 2.2
Sparse Spatial Selection (SSS)
During construction, this pivot-based strategy selects some objects as pivots from the collection and then computes the distance between the pivots and the objects of the database [4]. The result is a table of distances where columns are the pivots and rows the objects. Each cell in the table contains the distance between the object and the respective pivot. These distances are used to solve queries as follows. For a range query (q, r) the distances between the query and all pivots are computed. The objects x from the collection that do not hold the condition |d(pi , x) − d(pi , q)| ≤ r for all pivots pi can be immediately discarded due to the triangle inequality. The objects that pass this test are considered as potential members of the final set of objects that form part of the solution for the query and therefore they are directly compared against the query by applying the condition d(x, q) ≤ r. The gain in performance comes from the fact that it is much cheaper to effect the calculations for discarding objects using the table than computing the distance between the candidate objects and the query. A key issue for efficiency is the method employed to calculate the pivots, which must be effective enough to drastically reduce total number of distance computations between the objects and the query. To select the pivots set, let
330
M. Marin, V. Gil-Costa, and R. Uribe
(X, d) be a metric space, U ⊂ X an object collection, and M the maximum distance between any pair of objects, M = max{d(x, y)/x, y ∈ X}. The set of pivots contains initially only the first object of the collection. Then, for each element xi ∈ U, xi is chosen as a new pivot if its distance to every pivot in the current set of pivots is equal or greater than α M , being α a constant parameter. Therefore, an object in the collection becomes a new pivot if it is located at more than a fraction of the maximum distance with respect to all the current pivots. 2.3
LC-SSS Combination (Hybrid)
We propose a combination between the List of Clusters (LC) and Sparse Spatial Selection (SSS) indexing strategies. In this case we both compute the LC centers and SSS pivots independently. We form the clusters of LC and within each cluster we build a SSS table using the global pivots and organization of columns and rows described above. We emphasize on global SSS pivots because intuition tells that in each cluster of LC one should calculate pivots with the objects located in the respective cluster. However, we have found that the quality of SSS pivots degrades significantly when they are restricted to a subset of the database, and also the total number of them tends to be unnecessarily large. We call this strategy hybrid.
3
Optimizing Running Time and Secondary Memory
Our contribution to increasing the performance of the SSS index is as follows. During construction of the table of distances we compute the cumulative sum of the distances among all objects and the respective pivots. We then sort the pivots by these values in increasing order and define the final order of pivots as follows. Assume that the sorted sequence of pivots is p1 , p2 , ...., pn . Our first pivot is p1 , the second is pn , the third p2 , the fourth pn−1 and so on. We also keep the rows in the table sorted by the values of the first pivot so that upon reception of a range query q with radius r we can quickly (binary search) determine between what rows are located the objects that can be selected as candidates to be part of the answer. This because objects oi being part of the answer can only be located between the rows that satisfies d(p1 , oi ) ≥ d(q, p1 )−r and d(p1 , oi ) ≤ d(q, p1 ) + r. In practice, during query processing and after the two binary searches on the first column of the table, we can take advantage of the column x rows organization of the table of distances by first performing a few, say v, vertical wise applications of the triangular inequality on the objects located in the rows delimited by the results of the binary searches, followed by horizontal wise applications of the triangular inequality to discard as soon as possible all objects that are not potential candidates to be part of the query answer. See Fig. 1 which shows the case of two queries being processed concurrently. For secondary memory the combination of these strategies have the advantage of increasing the locality of accesses to disk and the processor can keep in main
Hybrid Index for Metric Space Databases
331
P1 Pn P2 Pn−1 P3 Pn−2 ......................................Pk 11111 00000 00000 11111 00000 00000 11111 queries: Q1 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111 00000 11111
111111111111111 000000000000000 000000000Candidate 000000000000000111111111 111111111111111 111111111 000000000 Objetcs
00000 11111 11111 00000 00000 11111 00000 11111
1111111111 0000000000 000000000000000 111111111111111 0000000000 1111111111 000000000000000 111111111111111
00000 Q2 11111 00000 11111
Vertical Processing
Horizontal Processing
Fig. 1. Optimization to the SSS distance table
2 3 3 4
4 12 9 11
2 5 7 3
11 12 11 3
3 13 15 5 44
2 4 4 8
4 11 1 14
2 3 6 8
11 3 3 11
3 5 6 2 44
4 4 5 6
1 5 12 11
6 1 6 10
3 7 10 9
6 22 14 9 88
8 8 9 9
1 4 8 5
2 7 11 7
3 4 13 9
4 7 1 8 88
6 7 7 7
2 7 6 9
2 4 8 11
12 8 8 6
12 17 18 19 132
3 3 5 6
12 9 12 11
5 7 6 10
12 11 10 9
13 15 14 9 132
8 8 8 9
14 1 4 8
11 3 4 13
2 4 7 1 176
6 9 11 12
2 9 6 2
2 1 11 10
12 6 9 3
12 11 16 10 176
9 9 9 10
5 9 4 13
7 1 9 12
9 6 4 7
8 11 23 20 220
4 7 7 7
5 7 6 9
1 4 8 11
7 8 8 6
22 17 18 19 220
11 12 12
6 2 13
11 10 12
9 3 9
16 10 21
9 10 12
4 13 13
9 12 12
4 7 9
23 20 21
8 2 7 11
264
264
(a)
(b)
Fig. 2. Storing the distance table in blocks composed of a fixed number of disk pages
memory the first v columns of the table. In the experiments performed in this paper we observed that with v = n/4 we achieved competitive running times. In the following we describe two feasible physical organizations of the index on disk pages. The description is illustrated in Fig. 2 which presents two cases
332
M. Marin, V. Gil-Costa, and R. Uribe
for the distribution of a distance table with 23 objects and 4 pivots. The table is partitioned in 5 blocks. The first 4 columns contains the distances from objects to the 4 pivots and the last column contains the respective object ID associated with each row. The cell located at the bottom-right indicates the physical address of the disk page containing the next table block. Each block is stored in contiguous disk pages. We assume that the main memory is large enough to store two blocks. Fig. 2.a represents a case in which all objects 1 ... 23 are available at construction time and Fig. 2.b a case in which objects are arriving one by one to the index and every time a block is filled up a new one is started. The first case requires a external memory sorting by the first pivot. In the latter case the first column is kept sorted every two blocks since we are assuming that they both fit into main memory. Thus external sorting is not required. In the next section we show that both strategies achieve a very similar performance which indicates that the scheme supports efficiently further updates once the index has been constructed from an initial set of objects. In Fig. 2.a and 2.b, the grey cells represent the cases in which the triangular inequality gives a positive match for a range query q with d(q, pi ) = {6, 8, 3, 7} for pivots pi and radius r = 3. We assume that the query is solved by performing one vertical operation followed a horizontal operation for each row selected for the first pivot. In fact, as the first pivot is sorted by distance it is only necessary to perform two binary searches to detect the first row with value d(q, p1 ) − r = 3 and the last row with value d(q, p1 ) + r = 9. Then the sequence of horizontal applications of the triangular inequality determines that the objects 22, 17 and 11 are candidates which must be directly compared against the query object. Notice that a second vertical operation would have reduced significantly the number of horizontal operations (which is a tradeoff that depends on the application).
4
Experiments
The performance of Hybrid Index was tested with several collections of data. First, we used a collection of 100, 000 vectors of dimension 10, synthetically generated with Gaussian distribution. The Euclidean distance was used as the distance function when working with this collection. We also worked with a collection of 86, 061 words taken from the Spanish dictionary, and using the edit distance as the distance function. The algorithm was compared with other wellknown clustering-based indexing methods: M-Tree [6], GNAT [8], EGNAT [7], Spatial Approximation Trees (SAT) [3]. We also included in the comparison the LC [1] and SSS [4] strategies, and a recent version of the SSS called the SSSTree [5] which uses a tree structure in which the SSS pivots are used to recursively divide the space. 4.1
Cost of Secondary Memory Access
In the left part of table 1 we show for the Spanish dictionary data set the total number of blocks and objects per block for cases in which we limit the total
Hybrid Index for Metric Space Databases
333
number of pivots to 4, 8, 12, 16 and 20. The first three columns show the disk activity when constructing the index with the 90% of the data set by using the strategy despicted in Fig. 2.a. The last column shows the case when the same data is indexed on-line by using the strategy of Fig. 2.b. In this case no reads of blocks are effected and blocks are written to disk as soon as they become full during the insertion of objects. In the first case reads and seeks have to be performed in order to perform the sorting by the first column and move whole rows among blocks. However, the actual difference in running time between the two alternatives is negligible, presumebly because of disk-cache effects. Table 1. Disk activity for index construction Pivots Blocks Objects 4 378 204 8 686 113 12 994 78 16 1291 60 20 1614 48
Writes 399 721 1030 1342 1676
Seeks 761 1373 1989 2583 3229
Reads Writes 780 380 1408 688 2025 996 2634 1293 3291 1616
The next 10% of the data set is used to perform range queries with radio 1, 2, 3 and 4. The Fig. 3.a and 3.b show the total number of block reads performed during the processing of queries for the two methods of index construction. The differences in disk ativity are irrelavant showing that both approaches achieve similar performance. However, for large radious 4 the on-line creation of the index tends to generate more activity because large radio tend to generate a large number of candidate objects which are expected to be evenly distributed onto all blocks. 4.2
Calls to the Distance Evaluation Function
Computing the distance between two complex objects is known to be very expensive in terms of running time in metric-space databases. This produces an implementation independent base upon which comparing different strategies. In the following we review previous studies on comparison of a number of metricspace index and then we compare the best performers with our proposal. Fig. 4 and 5 show results for different data structures proposed so far. The Hybrid strategy achieves the best performance in terms of this metric though very similar to the LC strategy. 4.3
Comparing Running Times
In Fig. 6 we present results for running times with the different strategies. The proposed Hybrid achieves the best performance for most cases. Notice that structures such as the SAT achieves better performance than ours for range queries with large radio. The results suggests that SAT performs significantly better for large r. However, for these radio almost all objects are part of the solution to
334
M. Marin, V. Gil-Costa, and R. Uribe Cost Average Search (n=86061 spanish words)
Cost Average Search (n=86061 spanish words) 2500
Pivots: 4 Pivots: 8 Pivots: 12 Pivots: 16 Pivots: 20
2000
Number of Access to Disk
Number of Access to Disk
2500
1500
1000
500
0
1
2
3
4
Pivots: 4 Pivots: 8 Pivots: 12 Pivots: 16 Pivots: 20
2000
1500
1000
500
0
1
2
Search Range
3
4
Search Range
(a)
(b)
Fig. 3. Disk seeks and their respective block read for during range queries
Evaluations of the distance function
70000
Hyb SSS LC SAT EGNAT SSSTree M-Tree EGNAT
60000 50000 40000 30000 20000 10000 1
1.5
2
2.5 Query range
3
3.5
4
Fig. 4. Number of calls to the distance evaluation function per query for different metric-space index data structures. Results for the Spanish dictionary data set.
Evaluations of the distance function
90000 80000 70000 60000
Hyb LC SSS SAT M-Tree GNAT EGNAT SSSTree
50000 40000 30000 20000 0.01
0.1
1
Query range
Fig. 5. Number of calls to the distance evaluation function per query for different metric-space index data structures. Results for the Gaussian vector space.
500 480 460 440 420 400 380 360 340 320 300
900
SSS LC SAT Hyb
335
Hyb SSS LC SAT
800 Running Time (Secs)
Running Time (Secs)
Hybrid Index for Metric Space Databases
700 600 500 400 300 200
1
2
3
4
1
2
Query range
3
Query range
(a)
(b)
Fig. 6. Total running times for processing 10,000 queries with the Spanish dictionary (left) and a Gauss vector data set (right)
1000 Total
Running Time (Secs)
800
600
Distance Evaluation
400 Triangle Inequality
200
0 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 alpha
Fig. 7. Running time for the three main components in the execution of the Hybrid
the query and we do not see a practical use of queries like this ones in actual applications. Finally Fig. 7 shows results for the cummulative running time involved in accessing the distance table and executing the distance evaluation function for different values of the parameter α, namely different number of pivots. The results show a tradeoff between both costs with optimum in α = 0.7.
5
Conclusions
We hace presented a simple but very efficient strategy to solve queries in metric-space databases. Our strategy achieves best performance than most other strategies. However, it is not able to outperform in a significant manner to a tree
336
M. Marin, V. Gil-Costa, and R. Uribe
based structure called SSSTree which is in fact based on a strategy quite similar to ours. However, our strategy has clear advantages with respect to secondary management, total memory used by the index. Also the organization of the index in terms of a table with columns and rows allows it to exploit in an optimal way the parallelism available in the new computer architectures based on multi-cores devised to support multi-threading by hardware. We are currently evaluating the gain in performance in this architectures by solving queries using the standard openMP. Acknowledgments. This work has been partially funded by FONDECYT project 1060776, UMAG PR-F1-002IC-06, and UNSL PICT 2002-11-12600.
References 1. Ch´ avez, E., Navarro, G.: A compact space decomposition for effective metric indexing. Pattern Recognition Letters 26(9), 1363–1376 (2005) 2. Ch´ avez, E., Navarro, G., Baeza-Yates, R., Marroquyn, J.L.: Searching in metric spaces. ACM Computing Surveys 3(33), 273–321 (2001) 3. Navarro, G.: Searching in metric spaces by spatial approximation. The Very Large Databases Journal (VLDBJ) 711(1) (2002) 4. Brisaboa, N., Pedreira, O.: Spatial selection of sparse pivots for similarity search in metric spaces. In: van Leeuwen, J., Italiano, G.F., van der Hoek, W., Meinel, C., Sack, H., Pl´ aˇsil, F. (eds.) SOFSEM 2007. LNCS, vol. 4362, pp. 434–445. Springer, Heidelberg (2007) 5. Brisaboa, N., Pedreira, O., Seco, D., Solar, R., Uribe, R.: Clustering-based similarity search in metric spaces with sparse spatial centers. In: Geffert, V., et al. (eds.) SOFSEM 2008. LNCS, vol. 4910, pp. 186–197. Springer, Heidelberg (2008) 6. Ciaccia, P., Patella, M., Zezula, P.: M-tree: An efficient access method for similarity search in metric spaces. In: Proceedings of the 23rd International Conference on Very Large Data Bases (VLDB 1997), pp. 426–435 (1997) 7. Uribe, R., Navarro, G., Barrientos, R., Marin, M.: An index data structure for searching in metric space databases. In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2006. LNCS, vol. 3991, pp. 611–617. Springer, Heidelberg (2006) 8. Brin, S.: Near neighbor search in large metric spaces. In: 21st conference on Very Large Databases (1995)
Structural Testing for Semaphore-Based Multithread Programs Felipe S. Sarmanho, Paulo S.L. Souza, Simone R.S. Souza, and Adenilso S. Sim˜ ao Universidade de S˜ ao Paulo, ICMC, S˜ ao Carlos - SP, 668, Brazil {sarmanho,pssouza,srocio,adenilso}@icmc.usp.br
Abstract. This paper presents structural testing criteria for validation of semaphore-based multithread programs exploring control, data, communication and synchronization information. A post-mortem method based on timestamps is defined to determine the implicit communication among threads using shared variables. The applicability of the coverage testing criteria is illustrated by a case study. Keywords: software testing, multithread programs, testing criteria.
1
Introduction
Concurrent programming is important to reduce the execution time in several application domains, such as image processing and simulations. A concurrent program is a group of processes (or threads) that execute simultaneously and work together to perform a task. These threads access a common addressing space and interact through memory (using shared variables). The most common method to develop multithread programs is to use thread libraries, like PThreads (POSIX Threads). Concurrent program testing is not trivial. Features like synchronization, interthread communication and non-determinism make this activity complex [1]. Multiple executions of a concurrent program with the same input may present different results due to different synchronization and communication sequences. Petascale systems also add more factors to this scenario, making it even worse [2]. Structural testing is a test technique that use source code information to guide the testing activity. Coverage criteria are defined to apply structural testing. A testing criterion is a predicate to be satisfied by a set of test cases and can be used as a guide for the test data generation. It is also a good heuristic to indicate defects on programs and thus to improve their quality. This activity is composed of: (1) static analysis to obtain the necessary data about the source code, and usually obtaining a Control Flow Graph (CFP) [3]; (2) determining required elements for the coverage criterion chosen; and (3) analyzing the coverage reached in source code by test cases, based on coverage criterion.
This work is supported by CNPq.
M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 337–346, 2008. c Springer-Verlag Berlin Heidelberg 2008
338
F.S. Sarmanho et al.
In the literature there are some works that address testing of concurrent programs [4, 5, 6, 7, 8, 9]. Most of these works propose a test model to represent the concurrent program and to support the testing application. The Concurrency States Graph (CG) is a CFG extension proposed in [4], in which nodes represent concurrency states while the edges represent the actions required for transition between these states. That work considers concurrent languages with explicit synchronization using rendezvous-style mechanisms, such as Ada and CSP. It presents coverage criteria adapted for the CG; however, its usage is limited in the practice by the state space explosion problem. PPFG (Parallel Program Flow Graph) is a graph where was inserted the concept of synchronization node to the CFG [5, 10]. In the PPFG each process that composes the program has its own CFG. The synchronization nodes are then connected based on possible synchronizations. This model was proposed to adapt the all-du-path criterion to concurrent programs. PCFG (Parallel Control Flow Graph) also adapts the CFG for the context of parallel programs in message passing environments [7]. The PCFG includes the concept of synchronization nodes that are used to represent the send and receive primitives. The concept of variables was extended, to consider the concept of communicational use (s-use). Coverage criteria were also proposed in [7], based on models of control and data flow for message passing programs. Lei and Carver propose an approach to reachability testing. Reachability testing is a combination of deterministic and non-deterministic execution, where the information and the required elements are generated on-the-fly, without static analysis [6]. This proposal guarantees all feasible synchronization sequences will be exercised at least once. The lack of static analysis means it cannot say how many executions are required. This causes the state space explosion problem. In recent works, Lei et. al [9] presents a combinatorial approach, called t-way, to reduce the number of synchronization sequences to be executed. These related works bring relevant improvements for concurrent program testing. However, few works are found that investigate the application of the testing coverage criteria and supporting tools in the context of multithreading programs. For these programs, new aspects need to be considered. For instance, data flow information must consider that an association between one variable definition and its use can occur in different threads. The implicit inter-thread communication that occurs through shared memory makes complex the test activity. The investigation of these challenges it is not a trivial task and presents many difficulties. To overcome these difficulties, we present a family of structural testing criteria for semaphore-based multithread programs and a new test model for the support to the criteria. This model includes important features, such as: synchronization, communication, parallelism and concurrency. These data are collected using static and dynamic analyses. Information about communication is obtained after the execution of an instrumented version of the program, using a post-mortem methodology. This methodology has been adapted from Lei and Carver work [6]. Testing criteria were defined to exploit the control and data flows of these programs, considering their sequential and parallel aspects. The
Structural Testing for Semaphore-Based Multithread Programs
339
main contribution of the testing criteria proposed in this paper is to provide a coverage measure that can be used for evaluating the progress of the testing activity. This is important to evaluate the quality of test cases, as well as, to consider that a program has been tested enough. It is important to point out that the objective this work is not to debug concurrent programs which already have an error revealed.
2
Model Test for Shared Memory Programs
Let MT = {t0 , t1 , ..., tn−1 } be a multithread program composed of n threads denoted by ti . Threads can execute different functionalities but all they share the same memory address space. They may also use an additional private memory. Each thread t has its own control flow graph CFGt that is built by using the same concepts of traditional programs [3]. In short, a CF G of a thread t is composed of a set of nodes N t and a set of edges EIt . These edges that link nodes in the same thread are called intra-thread edges. Each node n in the thread t is represented by the notation nti , a well-known terminology in the software testing context. Each node corresponds to a set of commands that are sequentially executed or can be associated to a synchronization primitive (post or wait ). A multithread program M T is associated with a Parallel Control Flow Graph for Shared Memory (PCFGSM ), which is composed of both the CF Gt (for 0 ≤ t < n) and the representation of the synchronization among threads. N and E represent the set of nodes and edges of the PCFGSM , respectively. For construction of the PCFGSM , it is assumed that (1) n is fixed and known at compilation time; (2) there is implicit communication by means of shared variables; (3) there is explicit synchronization using semaphores (which has two basic atomic primitives: post (or p) and wait (or w)); and (4) initialization and finalization of threads act as a synchronization over a virtual semaphore. Three subsets of N are defined: Nt (nodes in the thread t), Np (nodes with post primitives) and Nw (nodes with wait primitives). For each nti ∈ Np , a set Mw (nti ) is associated, such that for each nti ∈ Np , with a post to a semaphore sem, we define Mw (nti ) as the set of nodes nqj ∈ Nw , such that exist a thread q ∈ [0..n − 1] and a wait primitive with respect to (w.r.t.) sem in nqj . In a similar way, for each nti ∈ Nw , a set Mp (nti ) is associated, such that for each nti ∈ Nw , with a wait to a semaphore sem, we define Mp (nti ) as the set of nodes nqj ∈ Np , such that exist a thread q ∈ [0..n − 1] and a post primitive w.r.t. sem in nqj . In other words, Mw (nti ) contains all possible wait nodes that can match with nti and Mp (nti ) contains all possible post nodes that can match with nti . Using the above definitions, we also define the set ES ⊂ E that contains edges that represent the synchronization (edge-s) between two threads, such that: ES = {(ntj , nqk ) | ntj ∈ Mp (nqk ) ∧ nqk ∈ Mw (ntj )}
(1)
The concurrent program shown in the Fig. 1 is used to illustrate these definitions. This program implements the producer-consumer problem with limited buffer, using PThreads library in ANSI C. There are three threads: (1) a master,
340
F.S. Sarmanho et al.
Fig. 1. Producer-Consumer implemented with PThreads/ANSI C
which initializes variables and creates the producer and consumer threads; (2) a producer, which populates the buffer; (3) a consumer, which removes items from the buffer for further processing. Table 1 contains values of all sets introduced above. Figure 2 shows the PCFGSM for the program in the Fig. 1. t0 , t1 and t2 represent the master, producer and consumer threads, respectively. Dotted lines represent synchronization edges. Some examples of synchronization edges are: (92 , 61 ) is a synchronization over semaphore, (90 , 12 ) is a synchronization of initialization and (121 , 100 ) is a synchronization of finalization. Note that there may exist internal synchronization edges, such as (101 , 61 ) and (92 , 52 ) in Fig. 2. A path π t = (nt1 , nt2 , ..., ntj ), where (nti , nti+1 ) ∈ EIt , is intra-thread if it has no synchronization edges. A path that includes at least one synchronization edge is called an inter-thread path and is denoted by Π = (PATHS, SYNC ), where PATHS = {π 1 , π 2 , ..., π n } and SYNC = {(pti , wjq ) | (pti , wjq ) ∈ ES } [7]. Here pti is a post node i in thread t and wjq is a wait node j in thread q. PCFGSM also captures information about data flow. Besides local variables, multithread programs have more two special types of variables: (1) shared variables, used for communication; and (2) synchronization variables, used by semaphores. V denotes all variables. VLt ⊂ V contains local variables of thread t. VC ⊂ V contains the shared variables and VS ⊂ V contains the synchronization variables. Therefore, we define: def (nti ) = {x | x is a variable defined in nti }.
Structural Testing for Semaphore-Based Multithread Programs
341
Table 1. Sets of the test model introduced for the program shown in Fig. 1
Fig. 2. PCFGSM graph that represents the program shown in Fig. 1
n=3 MT = {t0 , t1 , t2 } N0 = {10 , 20 , 30 , ..., 120 } N1 = {11 , 21 , 31 , ..., 121 } N2 = {12 , 22 , 32 , ..., 122 } N = N0 ∪ N1 ∪ N2 Np = {30 , 50 , 60 , 80 , 90 , 101 , 111 , 121 , 92 , 102 , 122 } Nw = {100 , 110 , 11 , 51 , 61 , 12 , 42 , 52 } EI0 = {(10 , 20 ), (20 , 30 ), ..., (100 , 110 ), (110 , 120 )} EI1 = {(11 , 21 ), (21 , 31 ), ..., (111 , 31 ), (31 , 121 )} EI2 = {(12 , 22 ), (22 , 32 ), ..., (112 , 32 ), (32 , 122 )} Es = {(30 , 61 ), (30 , 52 ), (50 , 51 ), (60 , 51 ), (80 , 11 ), (90 , 12 ), (101 , 52 ), (101 , 61 ), (111 , 42 ), (121 , 100 ), (121 , 110 ), (92 , 61 ), (92 , 52 ), (102 , 51 ), (122 , 100 ), (122 , 110 )} E = EI0 ∪ EI1 ∪ EI2 ∪ Es Mw (30 ) = {61 , 52 } Mp (100 ) = {121 , 122 } Mw (50 ) = {51 } Mp (110 ) = {121 , 122 } Mw (60 ) = {51 } Mp (11 ) = {80 } Mw (80 ) = {11 } Mp (51 ) = {50 , 60 , 102 } Mw (90 ) = {12 } Mp (61 ) = {30 , 101 , 92 } Mw (101 ) = {61 , 52 } Mp (12 ) = {90 } Mw (111 ) = {42 } Mp (42 ) = {111 } Mw (121 ) = {100 , 110 } Mp (52 ) = {30 , 101 , 92 } Mw (92 ) = {52 , 61 } Mw (102 ) = {51 } Mw (122 ) = {100 , 110 } VL0 = {prod h, cons h} VL1 = {prod, item} VL2 = {cons, my item} VC = {queue, avail} VS = {mutex, f ull, empty} def (10 ) = {avail} def (91 ) = {prod} def (21 ) = {prod} def (22 ) = {cons} def (41 ) = {item} def (62 ) = {cons} def (71 ) = {queue} def (72 ) = {avail} def (81 ) = {avail} def (82 ) = {my item}
A path π t = (n1 , n2 , ..., nj , nk ) is definition-clear w.r.t. a local variable c ∈ VLt from n1 to node nk or edge (nj , nk ), if x ∈ def (n1 ) and x ∈ / def (ni ), for i ∈ [2..j]. The notion definition-clear path is not applicable to shared variables because the communication (definition and use of shared variables) in threads is implicit. It is hard to establish a path that statically defines and uses shared variables. In Secion 2.1, we present a method to determine execution-based definition-clear paths for shared variables using a post-mortem methodology. The use of variables in multithread programs can be: computational use (cuse): computational statements related to local variable x ∈ VLt ; predicative use (p-use): conditional statements that modify the control flow of the thread and are related to local variable x ∈ VLt ; synchronization use (sync-use): synchronization statements on semaphores-variable x ∈ VS ;communicational c-use (comm-c-use): computational statements related to shared variable
342
F.S. Sarmanho et al.
x ∈ VC ; and communicational p-use (comm-p-use): conditional statements used on control flow of the thread, related to shared variable x ∈ VC . Based on these definitions, we establish associations between variable definition and use. Five kinds of associations are defined: c-use association: is defined by a triple (nti , ntj , x) iff x ∈ VLt , x ∈ def (nti ), ntj has a c-use of x and there is at least one definition-clear path w.r.t. x from nti to ntj . p-use association: is defined by a triple (nti , (ntj , ntk ), x) iff x ∈ VLt , x ∈ def (nti ), (ntj , ntk ) has a p-use of x and there is at least one definition-clear path w.r.t. x from nti to (ntj , ntk ). sync-use association: is defined by a triple (nti , (ntj , nqk ), sem) iff sem ∈ VS , (ntj , nqk ) has a sync-use of sem and there is at least one definition-clear path w.r.t. sem from nti to (ntj , nqk ). comm-c-use association: is defined by a triple (nti , nqj , x) iff x ∈ VC , x ∈ def (nti ) and nqj has a c-use of the shared variable x. comm-p-use association: is defined by a triple (nti , (nqj , nqk ), x) iff x ∈ VC , x ∈ def (nti ) and (nqj , nqk ) has a p-use of the shared variable x. 2.1
Applying Timestamps to Determine Implicit Communication
In this section, we present a method to establish pairs of definition and use of shared variables. These pairs are obtained after execution of the multithread program identifying the order that the concurrent events happened. Lamport [11] presented a way to order concurrent events by means of a happens-before relationship. This relationship can determine if an event e1 occurs before an event e2 , denoted by e1 ≺ e2 . To obtain this happens-before relationship it is necessary to assign timestamps to concurrent events. Lie and Carver [6] presented a method to assign timestamps that use local logical clock. We adapt this method to assign timestamps in our testing method. The method obtains all synchronizations that happened for an execution and thus generates the communication events. The method assigns a local logical clock vector, denoted by ti .cv, for each thread ti . This vector has dimension n, where n is the total number of threads. Each position i ∈ [0..n − 1] on the clock vector is associated to thread ti . The clock-vector position i is updated when a new event occurs in thread ti . For instance, observe the c1 event in t0 (before the clock vector was [0, 0, 0]). When a synchronization event occurs in tj other i positions, for i = j, of the clockvector can also be updated. For instance, considering the match (p2 , w2 ). Before this synchronization the clock vector associated with t2 was [0, 0, 0]. After, the values were updated to [2, 4, 1]. The logical space-time diagram shown in the Fig. 3 illustrates the method, using a hypothetical example. This diagram only considers events of synchronization (pi and wj ) and communication (ck ). Vertical lines represent the logical time of each thread. Arrows among threads represent synchronization events matching post and wait events. For instance, wait events w1 and w2 race the same post event p1 , but the match (w1 , p1 ) has occurred. It is possible that a wait primitive has several posts to match. These posts are inserted in a queue. Our method considers the access criterion LIFO to get the happens-before relationship. We chose LIFO to get most updated timestamps.
Structural Testing for Semaphore-Based Multithread Programs
Fig. 3. Example of logical space-time diagram
343
Fig. 4. Example of nondeterminism
Rules defined in [6] are used to establish if an event e1 happens-before an event e2 . These rules are not showed here for sake of space. With this method, it is possible to show the communications that happened for a program execution.
3
Coverage Criteria
Based on the control, data and communication flow models and definitions presented in previous section, we propose two sets of structural testing criteria for shared-memory parallel programs. These criteria allow the testing of sequential and parallel aspects of the programs. Control Flow and Synchronization-based Criteria All-p-nodes criterion: the test set must execute all nodes nti ∈ Np . All-w-nodes criterion: the test set must execute all nodes nti ∈ Nw . All-nodes criterion: the test set must execute all nodes nti ∈ N . All-s-edges criterion: the test set must execute all sync edges (nti , nqj ) ∈ Es . All-edges criterion: the test set must execute all edges (ni , nj ) ∈ E. Data Flow and Communication-based Criteria All-def-comm criterion: the test set must execute paths that cover an association comm-c-use or comm-p-use for all definition of x ∈ Vc . All-def criterion: the test set must execute paths that cover an association c-use, p-use, comm-c-use or comm-p-use for all definition of x ∈ def (nti ). All-comm-c-use criterion: the test set must execute paths that cover all comm-c-use associations. All-comm-p-use criterion: the test set must execute paths that cover all comm-p-use associations. All-c-use criterion: the test set must execute paths that cover all c-use associations. All-p-use criterion: the test set must execute paths that cover all p-use associations.
344
F.S. Sarmanho et al.
All-sync-use criterion: the test set must execute paths that cover all sync-use associations. It is necessary to know which path was exercised to evaluate the required elements covered from an execution. One option to obtain this information is to instrument the source code to produce execution trace. This instrumentation can change the original program behaviour. However, this interference does not affect the structural testing proposed here, because it does not prevent the extraction and the future execution of all possible pairs of synchronization. Due to non-determinism, executions of a program with the same input can cause different event sequences to occur. The Fig. 4 shows the example where the nodes 81 and 91 in t1 have non-deterministic waits and in nodes 20 (t0 ) and 22 (t2 ) have post to t1 . All these operations are on the same semaphore. This case illustrates the possible synchronizations among these threads. During the testing activity is essential to guarantee that these synchronizations are executed. Controlled execution is a mechanism used to achieve deterministic execution, i.e. two executions of the program with the same input are guaranteed to execute the same instruction and the specified synchronization sequence. The controlled execution used in this work was adapted from Carver method [12]. The Table 2 shows some required elements for the criteria defined in this section. These required elements are taken on the static analysis. Table 2. Some required elements by the proposed criteria for the program of the Fig. 1 Criteria
Required Elements
30 , 50 , 60 , 80 , 90 , 101 , 111 , 121 , 92 , 102 , 122 100 , 110 , 11 , 51 , 61 , 12 , 42 , 52 10 , 20 , 30 , ..., 120 , 11 , 21 , ..., 121 , 12 , 22 , ..., 122 (30 , 61 ), (30 , 52 ), (50 , 51 ), (60 , 51 ), (80 , 11 ), (90 , 12 ), All-edges-s (101 , 52 ), (101 , 61 ), (111 , 42 ), (121 , 100 ), (121 , 110 ), (92 , 61 ), ... (10 , 20 ), (20 , 30 ), ..., (110 , 120 ), (11 , 21 ), (21 , 31 ), ..., (111 , 31 ), (31 , 121 ), All-edges (12 , 22 ), ..., (112 , 32 ), (32 , 122 ), (30 , 61 ), (30 , 52 ), ..., (90 , 12 ), (101 , 61 ), ... All-def-comm (10 , 71 , avail), (81 , 71 , avail), (72 , 82 , avail), (71 , 82 , queue) (10 , 82 , avail), (21 , (31 , 41 ), prod), (41 , 71 , item), (71 , 82 , queue), All-def (81 , 71 , avail), (62 , (32 , 122 ), cons), ... (10 , 71 , avail), (10 , 72 , avail), (72 , 72 , avail), All-comm-c-use (72 , 81 , avail), (71 , 82 , queue), ... All-comm-p-use ∅ (81 , 72 , avail), (72 , 82 , avail), (21 , 91 , prod), (91 , 91 , prod), All-c-use (41 , 71 , item), (71 , 82 , queue), (82 , 112 , myi tem)... (21 , (31 , 41 ), prod), (21 , (31 , 121 ), prod), (62 , (32 , 42 ), cons), All-p-use (62 , (32 , 122 ), cons), ... (20 , (30 , 61 ), mutex), (20 , (30 , 51 ), mutex), (50 , (60 , 51 ), empty), All-sync-use (61 , (101 , 61 ), mutex), (52 , (92 , 61 ), mutex), ... All-nodes-p All-nodes-w All-nodes
4
Total 11 8 36 16 51 4 10 13 0 16 8 14
Case Study
In order to illustrate the proposed testing criteria, consider the program in Fig. 1. The buffer is limited to two produced/consumed items. Due to threads scheduling, two executions are possible: (1) produce, consume, produce, consume
Structural Testing for Semaphore-Based Multithread Programs
345
(PCPC ); (2) produce, produce, consume, consume (PPCC ). Using controlled execution, it is possible to force the order these executions. Considering to first execution (PCPC) the executed paths and their synchronizations are: π 0 = {1,2,3,4,5,6,8,9,10,11,12} π 1 = {1,2,3,4,5,6,8,9,10,11,3,4,5,6,7,8,9,10,11,12} π 2 = {1,2,3,4,5,6,8,9,10,11,3,4,5,6,7,8,9,10,11,12} SYNC = {(80 , 11 ), (60 , 51 ), (30 , 61 ), (90 , 12 ), (111 , 42 ), (101 , 52 ), (50 , 51 ), (92 , 61 ), (121 , 100 ), (111 , 42 ), (101 , 52 ), (122 , 110 )} For this execution, some coverd elements are the edges-s (60 , 51 ) and (122 , 110 ), the comm-c-use (10 , 71 , avail), (72 , 81 , avail), (71 , 82 , queue). To illustrate how the testing criteria can contribute to reveal faults, consider that the mutex semaphore was initialized with the value 0 or 2 on the main function (code line 17). This will cause a deadlock state or an inappropriate concurrent access to shared variables respectively. An execution that covers the required elements (30 , 61 ) and (10 , 72 , avail), edges-s and comm-c-use respectively, will reveal the fault for the deadlock case. The execution of the required element comm-c-uses (81 , 82 , queue) will reveal the fault for the case of inappropriate concurrent access. For both cases other required elements can also reveal these faults. To illustrate a communication fault consider that avail was initialized with 1 (10 ) and all synchronizations are correct. This fault can be revealed with the execution of the required elements comm-c-use (72 , 72 , avail) and edge-s (92 , 52 ). It will be necessary the execution of the PPCC sequence to reveal this fault, since the paths executed with the PCPC sequence do not reveal it.
5
Conclusion
Concurrent programs testing is not a trivial activity. This paper contributes in this context by addressing some of these problems for semaphore-based multithreading programs. The paper introduced both structural testing criteria to validate shared-memory parallel programs and a new model test to capture information about control, data, communication and synchronization from these programs. The paper also presents a post-mortem method based on timestamps to determine which communications (related with shared variables) happened in an execution. This information is important to establish the pairs of definition and use of the shared variables. The proposed testing criteria are based on models of control and data flow and include the main features of the most used PThreads/ANSI C programs. The model considers communication, concurrency and synchronization faults among threads and also fault related to sequential aspects of each thread. The use of the proposed criteria contributes to improve the quality of the test cases. The criteria offer a coverage measure that can be used in two testing procedures. Firstly, for generation of test cases, where these criteria can be used as guideline for test data selection. Secondly, for the evaluation of a test set. The
346
F.S. Sarmanho et al.
criteria can be used to determine when the testing activity can be ended and also to compare test sets. The evolution of our work on this subject is directed to several lines of research: 1) development of a supporting tool for the introduced testing criteria (it is now being implemented); 2) development of experiments to refine and evaluate the testing criteria; 3) implementation of mechanisms to validate multithread programs that dynamically create threads; and 4) conduction of an experiment to evaluate the efficacy of the generated test data against ad hoc test sets.
References [1] Yang, C.D., Pollock, L.L.: The challenges in automated testing of multithreaded programs. In: 14th Int. Conference on Testing Computer Software, pp. 157–166 (1997) [2] Dongarra, J.J., Walker, D.W.: The quest for petascale computing. Computing in Science and Engineering 03(3), 32–39 (2001) [3] Rapps, S., Weyuker, E.: Selecting software test data using data flow information. IEEE Transactions on Software Engineering SE-11(4), 367–375 (1985) [4] Taylor, R.N., Levine, D.L., Kelly, C.D.: Structural testing of concurrent programs. IEEE Trans. on Software Engineering 18(3), 206–215 (1992) [5] Yang, C.S.D., Souter, A.L., Pollock, L.L.: All-du-path coverage for parallel programs. In: Young, M. (ed.) ISSTA 1998: Proc. of the ACM SIGSOFT Int. Symposium on Software Testing and Analysis, pp. 153–162 (1998) [6] Lei, Y., Carver, R.: Reachability testing of concurrent programs. IEEE Trans. on Software Engineering 32(6), 382–403 (2006) [7] Vergilio, S.R., Souza, S.R.S., Souza, P.S.L.: Coverage testing criteria for messagepassing parallel programs. In: 6th LATW, Salvador, Ba, pp. 161–166 (2005) [8] Edelstein, O., Farchi, E., Goldin, E., Nir, Y., Ratsaby, G., Ur, S.: Framework for testing multi-threaded Java programs. Concurrency and Computation: Practice and Experience 15(3–5), 485–499 (2003) [9] Lei, Y., Carver, R.H., Kacker, R., Kung, D.: A combinatorial testing strategy for concurrent programs. Softw. Test., Verif. Reliab. 17(4), 207–225 (2007) [10] Yang, C.S.D., Pollock, L.L.: All-uses testing of shared memory parallel programs. Softw. Test, Verif. Reliab. 13(1), 3–24 (2003) [11] Lamport, L.: The implementation of reliable distributed multiprocess systems. Computer Networks 2, 95–114 (1978) [12] Carver, R.H., Tai, K.C.: Replay and testing for concurrent programs. IEEE Softw. 8(2), 66–74 (1991)
Algorithms of Basic Communication Operation on the Biswapped Network Wenhong Wei and Wenjun Xiao Department of Computer Science, South China University of Technology, 510641 Guangzhou, China [email protected], [email protected]
Abstract. Biswapped network (BSN) is a new topology for interconnection networks in multiprocessor systems. BSN is built of 2n copies of an n-node basic network and total nodes are 2n2. Some topological properties of BSN have been investigated, and some algorithms have been developed on the BSN such as sorting and matrix multiplication etc. In this paper, we develop algorithms for some basic communication operations—broadcast, prefix sum and data sum etc. Keywords: BSN, Broadcast, Prefix sum, Data sum.
1 Introduction The swapped network is also called as the OTIS-network and has important applications in parallel processing [1,2]. In this network architecture, n2 processors are divided n groups where there are n processors, and processors in the same group are connected by intra-group link, simultaneously, these groups are connected by inter-group link. But swapped network is not a Cayley graph, and then it is not a symmetrical network architecture, so some algorithms on it are not always convenient. For remedying this limitation about swapped network, [3] proposed biswapped network (BSN), the new network is a class of Cayley graph if the basic network is a Cayley graph and is tight related to the swapped network. BSN is of more regularity than the swapped network. BSN is built of 2n copies of an n-node basic network using a simple rule for connectivity that ensures its regularity, modularity, fault tolerance, and algorithmic efficiency. Some topological properties of BSN have been investigated [3], and some algorithms have been developed on the BSN such as sorting and matrix multiplication etc [4]. In most parallel algorithms, processors need to exchange data with other processors, hence it is the most important to develop algorithms of basic communication operation, and algorithms for basic communication operation can be used to arrive at efficient parallel algorithms for numerous applications, from image processing, computational geometry, matrix algebra, graph theory, and so forth [5]. In [6], Wang and Sahni developed algorithms of basic operations on OTIS-Mesh, their basic operation algorithms including broadcast, prefix sum and data sum etc can be only applied to OTIS-Mesh. In this paper, we develop deterministic algorithms of basic communication M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 347–354, 2008. © Springer-Verlag Berlin Heidelberg 2008
348
W. Wei and W. Xiao
operation for parallel computation on the BSN, such as broadcast, prefix sum and data sum etc, and analyze time complexity of these algorithms. According to [4], we know BSN has better topological properties than OTIS, the basic communication operation algorithms on the BSN proposed by us, are more general and better than those on OTIS-Mesh. For example, in a 2n2 processors BSN-Mesh, our broadcast algorithm’s time complexity is 4 n − 2 , but in a n2 processors OTIS-Mesh, their broadcast algorithm’s time complexity is 4 n − 3 . As the number of processor in our network is bigger than theirs, we can conclude that our broadcast algorithm is better than theirs. The remainder of this paper is organized as follows. In Section 2, we give the definition of BSN. Section 3 presents the basic data communication algorithms on BSN including broadcast, prefix sum and data sum etc, and analyzes time complexity of these algorithms. Finally, in Section 4, we provide some concluding remarks.
2 Introduction of BSN Definition 1. Let Ω be a graph with the vertex set V (Ω) = {h1, h2 ,..., hn} and the arc set E (Ω ) . Our biswapped network Σ(Ω) = Σ = (V (Σ), E (Σ)) is a graph defined as follows [3]: V(Σ) = {〈 g, p,0〉 , 〈 g, p,1〉 | g, p ∈ V(Ω)} and E(Σ) = {(〈g, p1, 0〉 , 〈g, p2, 0〉), (〈g, p1, 1〉 , 〈g, p2, 1〉) | (p1, p2)∈E(Ω), g∈V(Ω)} ∪ {(〈g, p, 0〉 , 〈p, g, 1〉), (〈g, p, 1〉 , 〈p, g, 0〉) | g, p ∈V(Ω)} Intuitively, if we regard the basis network as group, the definition postulates 2n groups, each group being a Ω digraph: n groups, with nodes numbered 〈group#, processor#, 0〉, form part 0 of the bipartite graph, and other n groups constitute part 1, with associated node numbers 〈group#, processor#, 1〉. Each group p in either part of Σ has the same internal connectivity as Ω (intra-group edges, forming the first set in the definition of E(Σ)). In addition, node g of group p in part 0/1 is connected to node p in group g of part 1/0 (inter-group or swap edges in the second set in the definition for E(Σ)). The name “biswapped network” (BSN) arises from two defining properties of the network just introduced: when group are viewed as super-nodes, the resulting graph of super-nodes is a complete 2n-node bipartite graph, and the inter-group links connect nodes in which the group number and the node number within group are interchanged or swapped. When Ω = C4 is ring, an example of the network Σ(C4 ) is denoted in Fig. 1.
Fig. 1. An example of the BSN with Ω=C4
Algorithms of Basic Communication Operation on the Biswapped Network
349
Like swapped network (or OTIS), and links between vertices of the same group are regarded as intra-group links. The links between vertices in a group and another group are following a swapping strategy, which are regarded as inter-group links.
3 Basic Communication Operations on the BSN 3.1 Broadcast Broadcast is, perhaps, the most fundamental operation in the parallel computing. In this operation, data is initially in a single processor, and after broadcasting, it is to be transmitted to all other processors in the same network. For example, if processor <0, 0, 0> has value A in BSN, all 2n2 processors of the BSN have value A after broadcasting. Suppose that broadcast is applied in all-ports mode, and it can be accomplished using the following four-step algorithm if we suppose processor u (u=
Step 1: processor u transmit its data x to processor v (v=
350
W. Wei and W. Xiao
Fig. 2. An example of broadcast on the BSN with Ω=C3
Theorem 1. The broadcast algorithm on BSN is optimal if the basic network’s broadcast algorithm is optimal in all-ports mode. Proof. If broadcast algorithm is optimal on a network in all-ports mode, the diameter of that network is equal to the move steps of broadcast. Let Ω denote basic network of BSN and D denote its diameter, according to [3], the diameter of BSN is 2D(Ω)+2. In our broadcast algorithm, there are two broadcasts in the basic network, so the move steps are 2D(Ω) if the basic network’s broadcast algorithm is optimal. In addition, there are two inter-group moves, and our broadcast algorithm needs 2D(Ω)+2 moves, which equal to the diameter of BSN, so our broadcast algorithm is optimal. 3.2 Prefix Sum In 2n2 processors BSN, if we label each group of part 0 from 0 to n-1 and each group of part 1 from n to 2n-1, and label each processor of each group from 0 to n-1. Now, let D(p) be the data in processor p, 0≤p<2n2. In a prefix sum, each processor p computes PS ( p) =
∑
p
i =0
D (i ),0 ≤ p < 2n 2
. So prefix sum algorithm results from the
following equation: PS(p)=SD(p)+LS(p)
(1)
Where SD(p) is the sum of D(i) over all processors i whose group label is smaller than that of p and LS(p) is the local prefix sum within the group of p. The algorithm for prefix sum is shown in Table 2.
Algorithms of Basic Communication Operation on the Biswapped Network
351
Table 2. Algroithm for prefix sum
Step 1: perform a local prefix sum in each group. Step 2: transmit prefix sums computed in Step 1 for processor n-1 in each group of part 0 to processors in group 2n-1 and for processor n-1 in each group (except group 2n-1) of part 1 to processors in group n-1 by inter-group connection. Step 3: in group n-1 and group 2n-1, perform a modified prefix sum in data A which is received in Step 2. In this modification, processor P computes rather than
∑
P
i =0
∑
P −1 i =0
A(i )
A(i ) (P≥1).
Step 4: swap between prefix sums computed in Step 3 for processor n-1 in group 2n-1 of part 1 and processor n-1 in group n-1 of part 0 by inter-group connection. Step 5: after summing result in Step 4 and local prefix sum, processor n-1 in group n-1 of part 0 broadcasts the result to each processor and each processor in group n-1 of part 0 performs sum in the result and data A. Step 6: transmit prefix sums computed in Step 5 for processor n-1 of group n-1 to each group of part 1 and prefix sums computed in Step 3 for processor n-1 of group 2n-1 to each group of part 0. Step 7: broadcast the result from Step 6 in each processor. Step 8: perform sum in local prefix sum and modified prefix sum in each processor. Following Step 1, each group computed local prefix sum and the result is stored in processor respectively. Step 2 corresponds to inter-group transmission operation, in Step 2, the results of local prefix sum in all groups in part 0 are transmitted to each processor in group 2n-1 of part 1, and similarly, the results of local prefix sum in all groups except for group 2n-1 of part 1 are transmitted to each processor in group n-1 of part 0. In Step 3, processors in group n-1 and group 2n-1 perform a modified prefix sum on the data which is received in previous step, the modified prefix sum of current processor is equal to the prefix sum of preceding processor. For example, if data A0, A1, …, An-1 is stored in processor0, processor1, …, processorn-1 respectively, the modified prefix sum of n data is equal to 0, A0, A0+A1, …, A0+A1+…+An-2. In Step 4, modified prefix sum computed in group n-1 of part 0 and group 2n-1 in part 1 are swapped, and processor n-1 in group n-1 has prefix sum of previous n-1 groups of the part 0 and processor n-1 in group 2n-1 has prefix sum about pervious n-1 groups of the part 1. In Step 5, the data from processor n-1 in group 2n-1 that were added local prefix sum in processor n-1 of group n-1 were broadcasted to other processors in the same group, and then each processor in this group has the prefix sum of previous n groups in part 0, at last, each processor performed sum in it and modified prefix sum. After Step 6, processor n-1 in each group has modified prefix sum. Following Step 7, each processor has modified prefix sum. In last step, each processor computed Equation (1) and each processor in each group has last result. Fig. 3 shows the process of prefix sum on the BSN-C3, we denote A0, A1, A2 as the prefix sum of processor <0, 0, 0>, <0, 1, 0>, <0, 2, 0> in group 1 and B0, B1, B2 as the prefix sum of processor <1, 0, 0>, <1, 1, 0>, <1, 2, 0> in group 2, the remainder groups are similar.
352
W. Wei and W. Xiao
A0
B0
A1
A2
A2+B2+C2 C0 A2+B2
B1
B2
part 0
C1
C2
A2+B2+C2+D2+E2 D0
part 1
E0
F0
F1 F2 A2 D2+E2 (e) After Step 5, the processors in group 2 have modified prefix sum and local prefix sum. D2
D1
A2+B2+ C2+D2
E2
E1
Fig. 3. An example of prefix sum in the BSN with Ω=C3
We think about the algorithm complexity in worst occasion now, there are 3 transmitting data in inter-group, 2 broadcasts and 2 prefix sum operations in this algorithm (the time of arithmetic operation is ignored). We know that broadcast prefix sum operation will cost the most time in the array, so if the basic network of BSN is an array, the algorithm complexity is worst. Now we suppose that transmitting and broadcast a data will cost one time unit, the broadcast time is n-1 and prefix sum operation time is also n-1 in n processors array. So the whole algorithm’s complexity of 2n2 processors BSN is 4n-1 at worst.
Algorithms of Basic Communication Operation on the Biswapped Network
353
3.3 Data Sum Data sum is also named as semi-group computing, each processor is to compute the sum of the values of all processors. The algorithm is shown in Table 3. Table 3. Algroithm for data sum
Step 1: each processor of each group performs data sum in itself group. Step 2: each processor of each group transmits its data sum to other processor by the inter-group connection. Step 3: each processor of each group performs data sum operation in received data from other groups. Step 4: each processor in each group swaps the results that are computed in Step 3 by inter-group connection. Step 5: the processor of each group performs the sum operation in itself group. In Step 1, each processor performs data sum in intra-group. After Step 2, each processor in part 0 has the sum of data in part 1, and each processor of the part 1 has the sum of data in part 0 contrarily. In Step 3, each processor sums the data that was received by inter-group connection, and then the processors of part 0 (part 1) have the sum of all processors of the part 1 (part 0). After Step 4, each processor in each group has the sum of all the processors. In Step 5, all the processors perform the sum operation and each processor of each group has final data sum. In 2n2 processors BSN, if the basic network is a complete graph, our algorithm complexity is the best. Suppose that intra-group and inter-group data transmission cost one time unit. One data sum operation is performed in Step 1, Step 3 and Step 5 respectively, which cost 3 time steps in all at best. One data transmission operation is performed in Step 2 and Step 4 which totally cost two time steps. So the whole algorithm complexity is 5 at best. Table 4. Comparison between our algorithms and [6] OTIS-Mes h
BSN-Array
BSN-Mesh
BSN-Complete graph
Broadcast
44 N − 3
2N
24 8 N − 2
4
Prefix sum
84 N − 6
2 2N − 1
44 8 N − 5
7
Data sum
84 N − 7
3 2N − 1 2
34 8 N − 4
5
4 Conclusion In this paper, we have developed the algorithms of basic communication operation on the BSN including broadcast, prefix sum and data sum etc, which are important in parallel computing model, and also analyzed these algorithms’ time complexity. We assume that there are N processors in OTIS-Mesh and BSN respectively, comparison
354
W. Wei and W. Xiao
between our basic communication algorithms and [6] in the time complexity shows in Table 4. From the Table 4, we know that our algorithms including broadcast, prefix sum and data sum are better than the algorithms in [6] when the basic networks are same. In our algorithms, the time complexity is constant where basic network is complete graph, that is, the time complexity is constant at best. Acknowledgments. This work is supported by the Doctorate Foundation of South China University of Technology and Open Research Foundation of Guangdong Province Key Laboratory of Computer Network.
References 1. Parhami, B.: Swapped Interconnection Networks: Topological, Performance, and Robustness Attributes. Journal of Parallel and Distributed Computing 65, 1443–1452 (2005) 2. Day, K., Al-yyoub, A.: Topological Properties of OTIS-networks. IEEE Transactions on Parallel and Distributed Systems 13(4), 359–366 (2002) 3. Xiao, W.J., Chen, W.D., He, M.X., Wei, W.H., Parhami, B.: Biswapped Network and Their Topological Properties. In: Proceedings Eighth ACIS International Conference on Software Eng., Artific. Intelligence, Networking, and Parallel/Distributed Computing, pp. 193–198 (2007) 4. Wei, W.H., Xiao, W.J.: Matrix Multiplication on the Biswapped-Mesh Network. In: Proceedings Eighth ACIS International Conference on Software Eng., Artific. Intelligence, Networking, and Parallel/Distributed Computing, pp. 211–215 (2007) 5. Sahni, S., Wang, C.F.: BPC Permutations on the OTIS-Mesh Optoelectronic Computer. In: Proc. Fourth International Conference on Massively Parallel Processing Using Optical Interconnections, pp. 130–135 (1997) 6. Wang, C.F., Sahni, S.: Basic Operations on the OTIS-Mesh Optoelectronic Computer. IEEE Trans. Parallel and Distributed Systems 9, 1226–1236 (1998) 7. Coudert, D., Ferreira, A., et al.: Topologies for Optical Interconnection Networks Based on the Optical Transpose Interconnection System. Applied Optical IP 39(17), 2965–2974 (2000) 8. Day, K., Al-yyoub, A.: Topological Properties of OTIS-networks. IEEE Transactions on Parallel and Distributed Systems 13(4), 359–366 (2002)
Rule Engine Based Lightweight Framework for Adaptive and Autonomic Computing Jakub Adamczyk, Rafał Chojnacki, Marcin Jarząb, and Krzysztof Zieliński Institute of Computer Science, AGH - University of Science and Technology, Al. Mickiewicza 30, 30-059 Kraków, Poland {j.adamczyk, chojnacki, mj, kz}@agh.edu.pl http://www.ics.agh.edu.pl
Abstract. The paper describes a framework architecture called the Autonomic Management Toolkit (AMT). This toolkit was implemented to support dynamic deployment and management of adaptation loops. This requires automatic resource discovery, instrumentation and attachment to Autonomic Manager (AM), and furthermore a scalable and easily changed decisionmaking module, which is a major part of the AM. The architecture of a system satisfying these requirements is proposed and described. This system is compared to PMAC (Policy Management Autonomic Computing) – a highly advanced software tool offered by IBM. The central element of AMT is a lightweight AM with Rule Engine as a decisionmaking module. This makes the proposed solution lightweight and flexible. The AM activity is very briefly specified and the process of constructing an execution loop is described. The proposed interfaces are specified. These interfaces are generally sufficient to support a wide range of policies, including standard regulators, well know from control theory. Subsequently, AMT usage is illustrated by a simple example. The paper ends with an overview of related work and conclusions. Keywords: autonomic manager, rule engine, adaptive, workload management, Drools, PMAC.
1 Introduction Adaptive and autonomic management of computing resources is a well known problem facing computer scientists. For decades system components and software have been evolving to deal with the increased complexity of system control, resource sharing, and operational management. The evolution of these trends addresses the increasingly complex and distributed computing environments of today [2]. The research started several years ago by IBM, Motorola, SUN and many other companies, has resulted in software environments supporting policy-driven system management for autonomic computing [6]. One of those environments is PMAC (Policy Based Autonomic Computing) [5], developed by IBM. The majority of existing [5, 6, 17] policydriven systems rely on the construction of an adaptation loop. A challenging problem is how to construct a system able to discover new resources during runtime, automatically generate an intermediate management layer and apply M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 355–364, 2008. © Springer-Verlag Berlin Heidelberg 2008
356
J. Adamczyk et al.
selected policies with limited human intervention or through a software system restart. Another related issue is the construction of a policy evaluation engine. The decisionmaking module, being a fundamental part of the Autonomic Manager, should be powerful enough to process many policies in a scalable way and should accept their specification expressed using existing rules [4] or policy specification languages [18]. Implementation of a system satisfying these requirements opens new possibilities for building a management or adaptation system loop dynamically and with limited effort. This was the main motivation behind the construction of the Autonomic Management Toolkit (AMT), described in this paper. AMT is a lightweight framework for constructing adaptive and autonomic computing systems. The proposed solution exploits a different concept in comparison to the complex and heavy PMAC technology and promotes a lightweight approach [1]. Lightweight development is such an extensive topic that often that it is difficult to specify exactly what it means. In this paper, a lightweight framework implies programming with Plain Old Java Objects (POJOs) using design patterns implemented with JMX [8] technology MBean components used for coupling objects and integrating autonomic managers. The paper focuses on a framework architecture called AMT (Autonomic Management Toolkit) which uses JMX to control managed resources. The central element of this architecture is the lightweight autonomic manager. The most innovative concept applied in AMT construction is the usage of a Rule Engine as a reasoner within the Autonomic Manager. Such an approach introduces a natural mapping of term policies to a production rule definitions [5]. As this mapping could introduce some constrains related to policy expression, it is not applied in most applications. It performs operations on managed resources registered with the manager. The managed resource’s interface is fully compliant with PMAC. Newly discovered resources can be registered during runtime. A wrapper object implementing the managed resource interface for any resource represented as an MBean can be dynamically generated. This guarantees full flexibility and adaptabilit, not heretofore supported by PMAC. The wrapper objects support local and remote operation invocation and event notification, thus the managed resources can be located anywhere. The ongoing work is related to previous research performed at AGH UST concerning monitoring and management of virtualized environment [3], [12]. The structure of the paper is as follows. First, in Section 2, AMT architecture is shortly described and the motivation behind its construction is discussed. Next, in Section 3 the functionality of the AMT Autonomic Manager is presented. In Section 4 integration techniques for managed resources are described. In Section 5 examples of AMT usage for management of Solaris containers are presented. Section 6 contains a comparison of existing solutions. The paper ends with conclusions.
2 AMT System Architecture Description The AMT architecture presented in Fig. 1 consists of several key subsystems, which correspond to components existing in PMAC, but are constructed to provide features listed in Table 1. A single entry point to AMT is the Autonomic Manager interface, which exposes operations supported by software modules of the Autonomic Manager described below:
Rule Engine Based Lightweight Framework
− − −
357
Policy Management Module (PMM) – policies defined by system administrator are deployed to AMT. Policy Evaluators Module (PEM) – policies are obtained from storage, instantiated by a given reasoner and evaluated. The key point of this subsystem is an interface that supports interoperability with different reasoners. Resource Access Module (RAM) – defines resources and manages interactions compatible with AMT specification. This part of the system has been implemented to offer full functional compatibility with PMAC.
These three modules offer the core AMT functionality which is accessible by tje AMT Console Module (ACM).
Fig. 1. Key components of architecture of Autonomic Management Toolkit
A JIMS Integration Module (JIM) is also available. This module enables integration of AMT with the JIMS [10], [11] (JMX-based Infrastructure Monitoring System) infrastructure responsible for discovering and instrumentation of managed resources represented as MBeans. This part of system can be substituted by any other monitoring system or instrumentation technique. Each policy available in AMT has the following properties: − − −
Name – must be unique within policy scope. Scope – hierarchical structure of non-strings used for denotation of the given policy applicability domain.. Type – similar to PMAC, each policy can be either solicited or unsolicited. Each solicited decision is a direct decision requested from a resource or any other external system. Solicited decisions have an input and an output. The input and the output are sets of key-value mappings. An unsolicited decision is a reaction to a system state change, evaluated periodically. An evaluation property defines the timespan between intervals.
358
− − −
J. Adamczyk et al.
Evaluation in milliseconds – only for unsolicited policies which are evaluated periodically; evaluation in milliseconds defines the timespan between policy evaluations. Activated on startup – if set to true, policy evaluation is started immediately following AMT initialization. Reasoner – reasoner’s name, valid for policy evaluation.
AMT is designated to support many Rule Engines running in parallel, which may be attached and detached to a running Policy Evaluator without system restart. To resolve incompatibilities between different Rule Engines, AMT introduces the Reasoner Adapter (RA) concept which implements the interface described in Table 1. Table 1. Specification of the Reasoner Adapter interface
Method Name attachResource detachResource addPolicy removePolicy evaluateSolicitedDecision getInfo shutdown
Description Attaches new resource Detaches resource Adds policy for unsolicited and solicited decisions Stops policy evaluation and removes it from Rule Engine Returns solicited decision or null if decision is not available Returns Rule Engine description Performs required operations upon detachment
3 AMT Autonomic Manager Activities Application of Rule Engine as a major part of the Autonomic Manager influences its startup time procedure and runtime activity. When the Rule Engine is started, it activates a rule base in the production memory for Rule Engine use. The rule base contains all rules and class definitions to be evaluated against facts. The Rule Engine is driven by the emergence or change of facts residing in the Working Memory. Such changes may be generated by external events, as results of Rule Engine activity or by the expiration of a timer which associated with previously received facts. In our case, facts represent manageable resources, which are implemented as MBeans components created on demand, as described in Section 4. The activity of the Autonomic Manager is performed in the following steps, representing the execution of an adaptation loop, depicted in Fig. 2: 1. Managed Resources (MRs) are instantiated as MBeans. 2. Resource Wrappers of MRs which play the role of facts, are constructed and inserted into the Working Memory. The Reasoner Adapter interface is used in this step. 3. Production Rules representing policies are loaded into the Production Memory. At this point, the Inference Engine is also started. 4. The Pattern Matching algorithm is performed on all rules in the Production Memory and facts present in the Working Memory.
Rule Engine Based Lightweight Framework
359
Fig. 2. Autonomic Manager processing steps
5. All rules that are evaluated as true are added to the Agenda to be performed. 6. Action is performed on the representation MR in the Working Memory. 7. The action is forwarded to MR via the Resource Wrapper and enforced with effectors. 8. MR parameter changes accessed by sensors are communicated to the Resource Wrapper, which in turn triggers execution of step 4. Steps 4 to 8 constitute the main execution loop of the Autonomic Manager. Since rules are declarative Knowledge Representation forms, they are not analogous functions in a procedural language. Instead they are fired in response to changes in the facts available to the rule engine.
4 Managed Resource Representation For the execution loop implementation, a crucial point is the representation of managed resources. AMT uses the JMX technology for accessing resources. A key element of JMX architecture is the MBean Server, which is a container for MBean components – Java objects which represent manageable resources and which allow operating on attributes (read/write values), executing actions and subscribing to notifications of events related to these resources. Each MBean registered in a server has a unique assigned name and is accessible for clients using various protocols supported by appropriate connectors (RMI, SOAP, HTML, SNMP) installed in am MBean server. Resource representation in AMT is based on the PMAC framework concept, thus it also enables the use of the PMAC Autonomic Manager. To meet this requirement we have defined an abstract class Resource, which implements the JavaManagedResource interface, for resource representation. The resource class specification and its dependencies on other classes and interfaces are depicted in Fig. 3.
360
J. Adamczyk et al.
Fig. 3. Specification of the AMT resource classes
For each manageable resource, an MBean resource wrapper class is generated with resource-specific action methods. The implementation of such a wrapper class depends on MBean properties, methods and notifications. When an MBean attach request process is initiated, AMT checks if a suitable resource wrapper class is available. If there is no such class then a wrapper generation process is started. The wrapper is generated from a parameterized template and transformed into Java class source code using the Apache Velocity library, then compiled with the Java Compiler API and loaded into the JVM. The proposed mechanisms of the resource attachment process are flexible and enable discovery of a managed resource during runtime, with standard JMX protocols, and generating a suitable wrapper class. Therefore, the adaptation loop may be constructed on the fly, which is a unique feature of AMT. Building upon our former work [10, 11, 12] on management of virtualized resources, it creates a complete framework which satisfies the aforementioned requirements.
5 Examples of AMT Usage This section presents a case study of controlling workloads within the Solaris 10 environment. The AMT is used for workload management of Solaris Containers [8] based on mechanisms specified by the Control Theory [13] and structured as a closed-loop AM workload manager. The whole system is treated as a black box. The controller uses current CPU usage values and adjusts them by changing shares (control signals) to maintain the requested CPU usage. The implementation uses a control loop managed by the AMT, running within a JIMS Management Agent on the machine on which the workload is running. The goal of the control loop was to adjust the project.cpu-shares resource control to a value which would assure that a given percentage of CPU time
Rule Engine Based Lightweight Framework
361
would always be available in conditions of constant load for a specific workload – for instance, a given project should be guaranteed 70% of CPU time when other active workloads (disturbances) are also present. A sample controller algorithm could use the Proportional (P) regulator – Eq. 1, where: (i) Uw – required usage of CPU by workload Ww, (ii) Uwt – usage observed at time t by workload Ww, (iii) Swt – number of shares set at time t for workload Ww, (iiii) Kp – proportional coefficient. Swt+1 = Swt + Kp * e(t), where e(t) = Uwt – Uw
(1)
The above control algorithm for Solaris 10 is implemented with the Drools rulebased policy. The rule definition implementation is enhanced with some heuristics; this rather complex task is explained in more detail in [12]. The experiment was performed under Solaris 10 running on a Sun Blade B100 (1GB RAM, CPU SPARC 650 Mhz) board. In this scenario a constant disturbance was activated after the controlled workload reached a steady state (considering the fact that it was the only CPU-bound workload, it reached close to 100% CPU usage). Fig. 4 presents a case where only one CPU-bound workload is started in the selected project, at its beginning. After a few seconds, when CPU usage increases to almost 100%, two other CPU-bound workloads are started in other projects, which results in a drop of CPU usage of the selected project. Subsequently, after several more seconds, controller P is turned on. It changes the share allocation for the controlled project, stabilizing CPU usage at 70%. The experiment was performed with a sample value of Kp = 3. This section presented a simple scenario of how AMT can be used. It is rather a proof of concept which validates our AMT design than a real-life problem. We have described policies which focus on system optimization and try to maintain the system
Fig. 4. AMT used for adaptive management of Solaris 10 project.cpu-shares resource
362
J. Adamczyk et al.
in a stable state. These policies are able to initiate valid reactions when the environment state changes.
6 Related Work The AMT system offers an alternative solution to the PMAC framework. Limitations and drawbacks of the PMAC in comparison to the Autonomic Management Toolkit (AMT) are presented in Table 2. This comparison points out the AMT toolkit features which make the proposed solution more flexible and easy to use. A list of projects in the area of autonomic computing, conducted by academic institutions, can be found in [14]. The presented research can be compared to research performed under the project entitled “Models and Extensible Architecture for Adaptive Middleware Platform” [15], by National ICT Australia. Similarly to AMT, this project assumes that the ability to develop/implement autonomic services must have the following general characteristics: (i) Standards-based: programming frameworks to construct autonomic services must leverage standard services; (ii) Maintain Separation of Concerns: The adaptive behavior must be expressed in complete separation from the main application’s business logic; (iii) Deployment/Development Scalability: solutions to provide adaptive behavior should scale down to small-scale applications, and scale up to large multi-server deployments. Despite similar assumptions, the project facilitates the development of adaptive behavior for legacy server applications [16] with support for aspect programming. The Adaptive Server framework developed under this project [17] supports the development of adaptive behavior for serverside components running on application servers. This is in contrast to AMT, which addresses the general issue of the adaptation loop construction, powered by Rule Engines. Table 2. Comparison of PMAC and AMT frameworks
Feature Application server required to support operations Resource description by WSRF needed
PMAC Yes Yes
AMT No No
Resources attachment only during the system restart Policy syntax is verbose Policy specification expressiveness
Yes
No
Yes Limited
Number of policy evaluators available
One
No Depends on policy evaluator Many
Activity logging and errors reporting
Limited
Declarative style of policy representation only
Yes
Full and extendable No
Rule Engine Based Lightweight Framework
363
7 Conclusions The implementation of the AMT system exploits the potential of Rule Engine-based computing which is a very attractive solution for policy-driven autonomic computing systems. However, integration of the Rule Engine with managed computer resources is not easy and requires proper virtualization of such resources. This aspect is elaborated in related work [10, 11, 12]. As the Rule Engine [4] is a rather sophisticated software module which supports scalable pattern matching algorithms, it may be used for a large number of facts and rules constituting a representation of knowledge. This feature is important for AC systems typically built as a hierarchy of self-management subsystems. Each such subsystem usually provides [6] self-configuration, self-optimization, self-healing and self-protection functionality. In spite of differences between these terms, they are handled similarly – typically driven by rules, which specify a high-level goal of system activity. A more exhaustive evaluation of the AMT concept calls for an extensive performance study. The existing overheads and bottlenecks should be identified. The presented work is only a proof of concept, showing solutions to integration and interoperability problems affecting various technologies used for AMT implementation. The constructed framework is fully operational and open to further enhancements.
References 1. Tate, B.A., Gehtland, J.: Better, Faster, Lighter Java. O’Reilly Media, Sebastopol (2005) 2. Ganek, A.G., Corbi, T.A.: The dawning of the autonomic computing era, http://www.research.ibm.com/journal/sj/421/ganek.html 3. Janik, A., Zielinski, K.: Transparent Resource Management with Java RM API. In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2006. LNCS, vol. 3994, pp. 1023–1030. Springer, Heidelberg (2006) 4. Morgan, T.: Business Rules and Information Systems: Aligning IT with Business Goals. Addison-Wesley Professional, Reading (2002) 5. Policy Management for Autonomic Computing, Tivoli, version 1.2.1, http://dl.alphaworks.ibm.com/technologies/pmac/PMDevGuide121. pdf 6. Strassner, J.C.: Policy-based Network Management – Solutions for the Next Generation. Morgan Kaufmann, San Francisco (2004) 7. Sullins, B.G., Whipple, M.B.: JMX in Action. Manning Publication Co. (2003) ISBN: 1930110561 8. Lageman, M.: Solaris Containers – What They Are and How to Use Them, Sun Microsystems (2005), http://www.sun.com/blueprints/0505/819-2679.pdf 9. Szydło, T., Szymacha, R., Zieliñski, K.: Policy-based Context-aware Adaptable Software Components for Mobility Computing. In: EDOC 2006: 10th IEEE International Enterprise Distributed Object Computing Conference, Washington, DC, USA, pp. 483–487 (2006) 10. Zieliński, K., Jarząb, M., Wieczorek, D., Bałos, K.: JIMS Extensions for Resource Monitoring and Management of Solaris 10. In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2006. LNCS, vol. 3994, pp. 1039–1046. Springer, Heidelberg (2006)
364
J. Adamczyk et al.
11. Zielinski, K., Jarzab, M., Wieczorek, D., Balos, K.: Open Interface for Autonomic Management of Virtualized Resources in Complex Systems - Construction Methodology, Future Generation Computer Systems, http://www.sciencedirect.com/science/journal/0167739X 12. Jarzab, M., Zieliński, K.: Framework for Consolidated Workload Adaptive Management. In: 2nd IFIP CEE-SET 2007, Software Engineering in Progress, NAKOM, pp. 17–30 (2007) ISBN 978-83-89529-44-2 13. Hellerstein, J.L., Diao, Y., Parekh, S., Tilbury, D.M.: Feedback Control of Computing Systems. Wiley-IEEE Press, Chichester (2004) 14. Portal for Autonomic Computing resources, http://www.autonomiccomputing.org 15. Enabling Adaptation of J2EE Applications Using Components, WebServices and Aspects, National ICT Australia, http://www.cse.unsw.edu.au/~yliu/asf-demo/index.html 16. Liu, Y., Gorton, I.: Implementing Adaptive Performance Management in Server Applications. In: ICSE 2007 workshop on Software Engineering for Adaptive and Self-managing Systems (SEAMS) (2007) 17. Gorton, I., Liu, Y., Trivedi, N.: An Extensible, Lightweight Architecture for Adaptive J2EE Applications. In: International Workshop of Software Engineering and Middle-ware (SEM) (November 2006) (accepted) 18. Kephart, J.O., Walsh, W.: An artificial intelligence perspective on autonomic computing polices. In: IEEE 5th Intl. Workshop on Policies for Distributed Systems and Networks, pp. 3–12 (2004)
A Monitoring Module for a Streaming Server Transmission Architecture Sadick Jorge Nahuz1, Mario Meireles Teixeira2, and Zair Abdelouahab1 1
Graduate Program in Electrical Engineering, Federal University of Maranhao, Campus do Bacanga, 65085-580 Sao Luis, MA, Brazil 2 Department of Informatics, Federal University of Maranhao, Campus do Bacanga, 65085-580 Sao Luis, MA, Brazil [email protected], [email protected], [email protected]
Abstract. The Internet has experienced a considerable increase in the use of audio and video applications, which provoke a large consumption of the resources available in the network and servers. Therefore, the monitoring and analysis of those resources has become an essential task in order to enhance the service delivered to users. This work depicts a monitoring module implemented in a video server architecture, which is used to track the transmission of some popular video formats. Our experiments have demonstrated that one of the formats delivers a performance considerably better than the other, regarding the bandwidth allocated to each user session, what reassures the importance of having such a monitoring module available in a server’s architecture. Keywords: monitoring, qos, streaming server, mpeg-4, mov.
1 Introduction In recent years there has been an amazing development and wide spreading of network applications that transmit and receive audio and video data through the Internet. Those applications have service requirements different from those of traditional data oriented applications, based in text/images, e-mail, FTP and DNS [1]. Today, it is observed a great demand for a class of Internet applications known as Multimedia Systems. This class of systems differs from others because they need to transmit multimedia data such as video and audio, generating a large data stream, provoking an accentuated consuming of the Internet resources. Multimedia systems supply services oriented towards the execution of applications that need resources assurance so as not to hinder its performance. Multimedia systems are very sensitive to end-to-end delays and to delay variation, but they can tolerate occasional data losses [1]. Some current networks can not get adapted to multimedia services because they were not designed to stand such tasks. There are many attempts for expanding the networks current architecture in order to enhance the services quality and to provide support for multimedia applications. Currently, almost all news sites have a video format for their subscribers. In the next years, multimedia services will dominate a great part of the Internet flow, mainly the live transmission ones, but there is much to be researched and defined in that area [2]. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 365–374, 2008. © Springer-Verlag Berlin Heidelberg 2008
366
S.J. Nahuz, M.M. Teixeira, and Z. Abdelouahab
As high speed Internet access reaches an increasing number of households, multimedia services will become part of most users’ routine [9]. Everybody will access their favorite programs, irrespective of their transmission scheduling, once there will always be a file stored in a streaming server, which uses a real time transmission protocol. This paper details a monitoring module designed and implemented in the Darwin Streaming Server, from Apple. This module is used to evaluate this architecture serving both MOV and MP4 files. Transmission peculiarities of both file formats are discussed and the monitoring demonstrates that one of them is able to reach considerable higher throughput than the other. In Sec. 2 are discussed the protocols used for streaming media transmission in the Internet. Sec. 3 details the streaming server chosen for this work, namely the Darwin Streaming Server. Sec. 4 deals with the monitoring implemented in the streaming server and also the changes performed in it. In Sec. 5 are discussed the undergone experiments and their results. Sec. 6 discusses this work main conclusions and their possible unfolding.
2 Streaming Media Protocols Real-time applications do have some peculiarities, as commented above. Thus one might expect that they demand protocols specifically designed to attend their characteristics. This section analyzes the Internet protocols used for streaming media transmission: RTP, RTCP and RTSP. 2.1 RTP and RTCP Protocols The real-time transport protocol, known as RTP, was designed to provide support to applications traffic that broadcasts real-time data. It is normally integrated inside the application (in user mode), so it is not implemented as part of the operating system kernel. Defined in RFC 1889 [3], RTP protocol is a product of the Audio/Video Transport Working Group, and it was formally introduced in January, 1996, by the Networking Working Group) from IETF (Internet Engineering Task Force), aiming at supplying a standardization of functionalities for real-time data transmission applications. [22]. The RTP offers end-to-end transport functions for devices that transmit real-time data, such as audio and video, on unicast or multicast service networks, being characterized as a connectionless protocol. These functions include the identification of the type of data to be handled, sequence number, timestamps and data transmission monitoring. Although the RTP offers end-to-end transfers, it does not offer all functionalities usually found in a transport protocol. Besides that, it does not reserve resources from the network and does not guarantee Quality of Service (QoS) [10] for the applications, neither promotes reordering or re-transmission in case of packet loss, assigning these responsibilities to applications [11]. The RTP and the RTCP protocols were designed as not to depend on the underlying network and transport layers. While RTP is in charge of transportation of the streaming medias (audio and video), the RTCP controls the information returned by
A Monitoring Module for a Streaming Server Transmission Architecture
367
users that received the data, informing about the quality of reception and data transfer, the support and synchronization of different media flows. 2.2 RTSP Protocol The RTSP (Real-Time Streaming Protocol) is based upon the class “Stream Stored audio and video”, operates in the application layer and was conceived by Real Networks, Netscape Communications and Columbia University. Its first RFC was published by the IETF under the number 2326 [4]. RTSP is a public domain protocol that allows client-server interaction between the constant rate media stream source (server) and the user (transducer) [23]. That interactivity arises from the user’s necessity of having a higher control on the media reproduction in the transducer. The RTSP functionalities can be summarized as handling of the file execution, similar to the functionalities that a DVD player turns available to their users, the RTSP allows a transducer to control the media stream by means of commands. 2.3 Streaming Media Formats MPEG-4 is the global multimedia standard that broadcasts professional quality audio and video through various types of bandwidth, from cellular phone networks to WANs and others. MPEG-4 was defined by the Moving Picture Experts Group (MPEG) [12] [14], which actively participates in the International Organization for Standardization (ISO), which specified the well known MPEG-1 and MPEG-2 standards. Hundreds of researchers worldwide contributed in the MPEG-4 building [15], which was concluded in 1998, just to become an international standard in 2000. MOV, another quite popular multimedia format, is used to store video sequences and was created by Apple Computers. The QuickTime Player was presented by Apple as an alternative to the Windows Media platform, from Microsoft, betting in favor of the variety of formats available for content distribution. This video format became still more popular when their specifications were chosen by the MPEG consortium. In this work, the supplying of both types of files, MP4 and MOV are monitored and analyzed from the perspective of a streaming server, the Darwin Streaming Server, properly enhanced with our Monitor Module.
3 The Darwin Server The Darwin Streaming Server from Apple is an open source video server that can serve 3GP, QuickTime (MOV) and MPEG4 videos for clients via Internet using standard RTP and RTSP protocols. It is based on the QuickTime Streaming Server code. Darwin also supports live broadcasting once it uses a video codifier called mp4live, which is part of the MPEG4IP [5]. The latter performs real time video capturing generating a stream to be broadcasted via unicast to the Darwin server, which will distribute it using the RTP and RTSP protocols [16]. The Darwin is an event-driven server working as a set of processes that execute the RTSP, RTP and RTCP streaming media standard broadcasting protocols. The server
368
S.J. Nahuz, M.M. Teixeira, and Z. Abdelouahab
supports various compatible streaming file formats and stands up to two thousand or more on-line users [6] [18]. To perform streaming media broadcasting via RTP, it is necessary to attach a hint track at the beginning of each file. A hint track specifies, for example, RTP time scheduling and the packet’s maximum size (Maximum Transmission Unit – MTU). 3.1 Server Architecture The Darwin server is composed of a parent process that creates a child process, which is the core of the server. The parent waits for the child to exit. If in the exit of the child process an error occurs, the parent process creates a new child process. The core server acts as an interface between the server modules and the clients in the network that use RTP and RTSP protocols to send requests and receive responses (Fig. 1). The server modules process requests and deliver packs to the client. The server core does its work by creating four threads: • • • •
A Main thread manages the server checking whether it needs to terminate, generating a status log or printing statistics; An Idle Task Thread manages a queue of tasks that occur periodically. There are two kinds of task queues: the timeout tasks and the sockets; An Event Thread listens to the sockets to receive requests of RTSP events or RTP packets and deliver them to the Task Threads; and The Task Threads receive the RTSP and RTP requests from the Event Thread. Then, the Task Thread asks the proper server module to process and send packets to the client.
The Darwin server is complex, works asynchronously and needs an event based communication mechanism. For example, when a socket is used in an RTP connection to acquire data, someone must be warned so that the data come to be processed [19]. Each Task object has two main methods: Signal and Run. The Signal method is called by the server to deliver an event to a Task object. Then, the Run is requested in order to schedule a Task to process the event. The goal of each Task object is to implement the server’s functionality by using small time slices that are not blocked. The Run method is in fact a virtual function that is requested whenever a Task object has events to be processed. Within the Run function, the Task object can call GetEvents to automatically receive all the requests that remain in the queue and previously signal events. The Run function is never reentered if a Task object calls GetEvents within the Run function and signals before the function is completed. The Run function will only be called by a new event after it is finished. In fact, the Run function will be repeatedly invoked until all events in the Task object are cleared up from GetEvents. 3.2 Server Functioning The Darwin Streaming Server works by performing a broadcast of its content to all clients that requested the file or by serving on demand. When the client requests the file, the number of packs emitted to each client will depend on the time concerning the beginning of the transmission (broadcast) and on the moment in which the client
A Monitoring Module for a Streaming Server Transmission Architecture
369
Fig. 1. Darwin Server Architecture
established the connection with the server [17]. Each client’s request is served with a complete broadcast of the file [8]. The server has a modular architecture so as to facilitate the building of new functionalities through independent modules. The server uses a main process called Core Server, which truly is an interface between the clients’ requisitions and the modules. Modules must be constituted by a specific structure, with two mandatory methods, one requested by the server to initialize it and the other to execute a certain task. As to know which modules must be requested for a certain event, each module must explicit its role. Thus, each module must have a list of actions called “Roles”. When the server initiates, it first loads modules that are not compiled in the server (dynamic modules) and next, the compiled modules (static modules). Then, the server invokes each module from the QTSS (Core) with the Registry functionality, a role every module must support. Next, the module requests the QTSS_AddRole to specify the other roles it can have. After this, the server invokes the role ‘Initialize’ for each module registered in that role. The ‘Initialize’ executes any initialization task the module requires, allocating memory and global data structures. When the server is deactivated, it invokes a ‘Shutdown’ role for each module registered in that role. When it executes the Shutdown, the module must perform a clearing up and release global structures.
4 The Monitor Module In this work, we developed a monitoring module which is responsible for obtaining bandwidth use and packet loss rate for each user session. The Monitor Module directly
370
S.J. Nahuz, M.M. Teixeira, and Z. Abdelouahab
Fig. 2. Monitor Module Block Diagram
interacts with the Darwin Server kernel (Fig. 2.). It collects data from the server core and these data are processed on-line within the attributes of a server object. Those attributes work directly with the server’s structure, constantly computing the different information needed to assess its performance [17] [21]. This information remains hidden within the Darwin server itself and the monitor module was implemented to collect the information and create an interface to visualize the data. The Monitor Module implements two basic routines: the Main Routine and the Dispatch Routine. In the Main Routine, all QTSS modules must supply a Main routine. The server requests the Main routine to be initialized and uses the role ‘Initialize' for accessing the QTSS stub library (which loads the libraries), thus the server can later invoke the module. In the Dispatch Routine, every QTSS module must provide a Dispatch routine. The server requests a dispatch routine when it invokes a module for a specific task, passing to the dispatch routine the task name and a specific parameter block for the task. The Monitor module implements three rules: QTSS_Register_Role; QTSS_Initialize_Role; QTSS_RTSPFilter_Role. When the QTSS_Register_Role rule is invoked by the server, the module records the rules that it wishes to notify, which are QTSS_Initialize_Role and QTSS_RTSP Filter_Role. When the QTSS_Initialize_Role is invoked, the module performs the global object’s initializations, as, for example, to obtain a reference to the QTSS_ServerObject (object that represents the server). And when the server invokes the QTSS_RTSPFilter_Role rule, the module processes the request. When the Monitor module is notified through the QTSS_RTSPFilter_Role rule, it obtains the values of the server’s total transfer rate in the attribute qtssRTPSvrCurBandwidth, of the QTSS_ServerObject object. Next, for each client’s session, it gets
A Monitoring Module for a Streaming Server Transmission Architecture
371
the transfer rate in the qtssCliSesCurrentBitRate attribute as well as the client’s address in the attribute qtssCliRTSPSess RemoteAddrStr and the amount of lost packets in the object’s qtssCliSesPacketLossPercent attribute which represents the session of each client. After this, it performs the calculation of bandwidth use and stores the results in a file. We also developed a program in Java to read the results file and graphically present them. There are two visualization modes: one shows the bandwidth and the other shows the packet losses.
5 Experiments and Results With the Monitor Module developed in this work, we conducted experiments with the Darwin server in a 100 MBps LAN network using more than 20 computers. All machines accessed a single Darwin video server. In this experimental stage, the implemented Monitor Module was successfully tested, once it collected all information within the core server generating the connections graphs. The experiments were conducted in two steps. In the first step, MP4 video files were used, with 50 MB size. It was noticed that the Darwin server tries to always keep the files within the same data rate, but by doing so, the server ended up with a huge unused band that could be scheduled to speed up the broadcasting [20]. In the video played by clients, excellent performance and optimal image quality was observed when using the QuickTime Player [7] and the RTP protocol to access the server. Fig. 3 shows the monitoring graph of the Darwin server with MP4 video files. It is important to notice that the server is only using 10% of the available bandwidth, in
Fig. 3. Monitoring of MPEG-4 broadcasting
372
S.J. Nahuz, M.M. Teixeira, and Z. Abdelouahab
such a way that its total data output rate is in the order of 10 MBps. The Darwin server, then, leaves 90% of its band totally free, unused. In the second experimental step, we used MOV video files with up to 60 MB. It was noticed that the Darwin server treats the MOV files differently, since not all files presented the same data output rate. According to the quality of each video file, the server provides more or less bandwidth for the session. Again, the player used was the QuickTime, with the RTP protocol to access the server. QuickTime itself adds hint tracks to the MOV files as to work with the streaming server, but that operation is not automatic, being necessary to export the MOV file in the hinted track format. Fig. 4 shows the monitoring graph of the Darwin server with MOV video files. Unlike the former step with MP4 files, the Darwin server uses this time 90% of the available bandwidth, attaining a total 90 MBps as data output rate, as can be seen in the graph. At some moments, some videos use almost the whole bandwidth available in the server. The image quality is equivalent to the one previously obtained using MP4 files. The Darwin server has a management feature which provides a summary of the active sessions. Analyzing the active sessions we detected that the MOV files present an output rate five times higher than that of the MP4 video files. Through the experiments, we can conclude that serving MP4 files via streaming mode takes more processing time of the Darwin server. For this reason, the server can not deliver files at the same output rate of the MOV files. We believe that this fact does not happens due to same ‘hidden’ fault in the Darwin server, but because of the characteristics inherent to each file format, being the MOV type files more naturally biased to be broadcasted via streaming due to its own nature and historical origins. Hence, it is clearly noticed the importance of having a monitoring module such as the one developed here and implemented within the Darwin server architecture.
Fig. 4. Monitoring of MOV broadcasting
A Monitoring Module for a Streaming Server Transmission Architecture
373
Without a proper user session following, such a broadcasting discrepancy between the two video file formats would be easily overlooked by the system administrator.
6 Conclusions This paper describes a monitoring module implemented into a streaming video server, in this case the Darwin Streaming Server, from Apple, an open source server based upon the QuickTime Streaming Server code, of the same manufacturer. For such a goal, the Internet multimedia broadcasting peculiarities were analyzed, as well as the RTP and RTCP protocols. Two popular video format files, MP4 and MOV, were selected for the experiments and also reported. The Darwin server architecture was analyzed in detail as well as the architecture of our own monitor, which was implemented as a Darwin module. The experimental results, properly obtained by means of our Monitor Module, show a significant difference in the delivery rate of the MP4 and MOV files, being the latter delivered at a rate five times higher. Therefore, the importance of having a monitoring module such as the one here presented, since it can become an important tool for the system’s manager that needs to follow its performance. The Monitor Module here presented is still an ongoing project. We intend in the near future to turn it able to capture information other than the used bandwidth and the packet loss rate of each user’s session, thus enhancing its applicability. We will also analyze more carefully the different behavior found while serving MOV and MP4 media files.
References 1. Kurose, J., Ross, K.: Redes de Computadores e a Internet. Addison-Wesley, Reading (2003) 2. Tanembaum, A.: Sistemas Operacionais Modernos, 2nd edn. Prentice-Hall, Englewood Cliffs (2003) 3. Schulzrinne, H.: A Transport Protocol for Real-Time Applications (RTP) (1996), http://www.ietf.org/rfc/rfc1889.txt 4. Schulzrinne, H., Rao, A., Lanphier, R.: Real Time Streaming Protocol (RTSP) (1998), http://www.ietf.org/rfc/rfc2326.txt 5. Mpeg4ip Commuty Site, http://mpeg4ip.sourceforge.net 6. Darwin Project Site, http://developer.apple.com/darwin/projects/streaming 7. QuickTime Apple Web Site, http://www.apple.com/quicktime 8. Ferguson, P., Huston, G.: Quality of Service: Delivering QoS on the Internet and in Corporate Networks. John Wiley, Chichester (1998) 9. Comer, D.E.: Internetworking with TCP/IP: Principles, Protocols and Architecture, 4th edn., vol. 1. Prentice-Hall, Englewood Cliffs (2000) 10. Busse, I., Deffner, B., Schulzrinne, H.: Dynamic QoS Control of Multimedia Applications based on RTP. Computer Communications 19, 49–58 (1996) 11. Stallings, W.: High-Speed Networks and Internets: Performance and Quality of Service, 2nd edn. Prentice-Hall, Englewood Cliffs (2002) 12. MPEG Site, Moving Picture Experts Group, http://www.mpeg.org
374
S.J. Nahuz, M.M. Teixeira, and Z. Abdelouahab
13. Gnustream: A P2P Media Streaming System Prototype (2003) 14. The MPEG-4 Fine-Grained Scalable Video Coding Method for Multimedia Streaming over IP (2001) 15. Wakamiya, N., Miyabayashi, M., Murata, M., Miyahara, H.: MPEG-4 Video Transfer with TCP-Friendly Rate Control. In: Al-Shaer, E.S., Pacifici, G. (eds.) MMNS 2001. LNCS, vol. 2216, pp. 29–42. Springer, Heidelberg (2001) 16. Seungjoon, S.B.: Scalable Resilient Media Streaming (2003) 17. Zhao, W., Tripathi, S.K.: Bandwidth-Efficient Continuous Media Streaming Through Optimal Multiplexing (1999) 18. Sen, S., Rexford, J., Towsley, D.: Proxy Prefix Caching for Multimedia Streams (1999) 19. Chen, S., Shen, B., Yan, Y., Basu, S., Zhang, X.: Fast Proxy Delivery of Multiple Streaming Sessions in Shared Running Buffers. IEEE Transactions on Multimedia 6(7) (2005) 20. Anastasiadis, S.V., Sevcik, K.C., Stumm, M.: Server-Based Smoothing of Variable BitRate Streams. In: ACM Multimedia, pp. 147–158 (2001) 21. Tripathi, Z.: Bandwidth-Efficient Continuous Media Streaming Through Optim (1999) 22. Schulzrinne, H., Gokus: RTP: A transport protocol for real-time applications (1996) 23. Schulzrinne, H., Rao, A., Lanphier, R.: Real time streaming protocol (RTSP), request for comments 2326 (April 1998)
BSP Functional Programming: Examples of a Cost Based Methodology Fr´ed´eric Gava Laboratory of Algorithms, Complexity and Logic, University of Paris-Est [email protected]
Abstract. Bulk-Synchronous Parallel ML (BSML) is a functional dataparallel language for the implementation of Bulk-Synchronous Parallel (BSP) algorithms. It makes an estimation of the execution time (cost) possible. This paper presents some general examples of BSML programs and a comparison of their predicted costs with the measured execution time on a parallel machine. Keywords: BSP Functional Programming, Cost Prediction.
1
Introduction
Solving a problem is often a complex job especially when a parallel machine is used: it is necessary to manage communication, synchronisation, partition of data etc. at the same time. Algorithmic models and high-level languages are needed to simplify both the design of parallel algorithms (ability to compare their costs1 ) and their programmation in a safe, efficient and portable manner. BSML is an extension of ML designed for the implementation of BSP algorithms as functional programs using a small set of parallel primitives. BSP [3,17] is a parallel model which offers a high degree of abstraction and allows scalable and predictable performance on a wide variety of architectures with a realistic cost model based on a small set of machine parameters. Deadlocks and nondeterminism are avoided. BSML is implemented as a parallel library2 for the functional programming language Objective Caml (OCaml). Our methodology is as follow: first, analyse the complexity of the sequential algorithm, then design one or more parallel algorithms, analyse their BSP costs, calculate the BSP parameters of the parallel machines, program these algorithms in BSML and finally test the performance of the programs on different architectures. Using safe high-level languages like ML to program BSP algorithms (that is BSML) allows performance, scalability and expressivity. Other approaches to safe high-performance computation exist. We can cite concurrent programming [6], more or less synchronous processes [5,18] or the automatic parallelization of programs [1] and algorithmic skeletons [7]. In the first two cases, the expressivity of concurrence or mobility is sought (but with 1 2
We speak about complexity for sequential algorithms and cost for parallel ones. Web page at http://bsmllib.free.fr.
M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 375–385, 2008. c Springer-Verlag Berlin Heidelberg 2008
376
F. Gava
lower performance). In other cases, it’s the simplicity of parallelism (which becomes almost transparent) which is sought. As opposed to these methods, we prefer the use of a performance model3 at the cost of expressiveness. Indeed, other approaches do not allow, in their intrinsic models, to predict the run-time4 . That makes algorithmic optimisations and the choice of the best algorithm for a given architecture difficult: even if the number of processes/threads/workers or their computations can be limited, it is often difficult to analyse the communication times or the placement of computation and data on the processors. Nevertheless, it is possible to find empirical optimisations (by successive tests) or to design better schedulers [2]. But algorithmic optimisations are still hard to analyse. In this article, we illustrate our methodology with simple examples of problems that illustrate many aspects of classical algorithmic problems.
2
Functional Bulk-Synchronous Parallel Programming
2.1
The Bulk-Synchronous Parallel Model
In the BSP model, a computer is a set of uniform processor-memory pairs and a communication network allowing inter-processor delivery of messages (for sake of conciseness, we refer to [3,17] for more details). A BSP program is executed as a sequence of super-steps, each one divided into three successive disjoint phases: each processor uses its local data (only) to perform sequential computations and to request data transfers to other nodes; the network delivers the requested data; a global synchronisation barrier occurs, making the transferred data available for the next super-step. The performance of the BSP machine is characterised by 4 parameters: the local processing speed r; the number of processor p; the time l required for a barrier; and the time g for collectively delivering a 1-relation, a communication phase where every processor receives/sends at most one word. The network can deliver an h-relation (every processor receives/sends at most h words) in time g × h. The execution time (cost) of a super-step s is the sum of the maximal of the local processing, the data delivery and the global synchronisation times. 3
4
Note that this observation has already been made in the context of programming C+BSP matrix computations [12,14]. Some tools [8,11] exist but are too much complex to be used or are not implemented.
BSP Functional Programming: Examples of a Cost Based Methodology
377
bsp p: unit→int bsp l: unit→float bsp g: unit→float mkpar: (int→α )→α par apply: (α →β ) par→α par→β par put: (int→α ) par→(int→α ) par proj: α par→int→α super: (unit→α )→(unit→β )→α ∗ β Fig. 1. The BSML primitives
2.2
Bulk-Synchronous Parallel ML
BSML is based on 8 primitives (Fig. 1), three of which are used to access the parameters of the machine. Implementation of these primitives rely either on MPI, PUB [4] or on the TCP/IP functions provided by the Unix module of OCaml. A BSML program is built as a sequential program on a parallel data structure called parallel vector. Its ML type is α par, which expresses that it contains a value of type α at each of the p processors. The BSP asynchronous phase is programmed using the two primitives mkpar and apply so that (mkpar f) stores (f i) on process i (f is a sequential function): mkpar f = (f 0) · · · (f i) · · · (f (p−1)) and apply applies a parallel vector of functions to a parallel vector of arguments: apply · · · fi · · · · · · vi · · · = · · · (fi vi ) · · · The first communication primitive is put. It takes as argument a parallel vector of functions which should return, when applied to i, the value to be sent to processor i. put returns a parallel vector with the vector of received values: at each processor these values are stored in a function which takes as argument a processor identifier and returns the value sent by this processor. The second communication primitive proj is such that (proj vec) returns a function f where (f n) returns the nth value of the parallel vector vec. Without this primitive, the global control cannot take into account data computed locally. The primitive super allows the evaluation of two BSML expressions as interleaved threads of BSP computations. From the programmer’s point of view, the semantics of the superposition is the same as pairing but the evaluation of super E1 E2 is different: the phases of asynchronous computation of E1 and E2 are run; then the communication phase of E1 is merged with that of of E2 and only one barrier occurs; if the evaluation of E1 needs more super-steps than that of E2 then the evaluation of E1 continues (and vice versa). 2.3
Often Used Parallel Functions
The primitives described in the previous section constitute the core of the BSML language. In this section, we define some useful functions which are parts of the standard BSML library. For the sake of conciseness, their full code is omitted. replicate:α →α par creates a parallel vector which contains the same value everywhere. The primitive apply can be used only for a parallel vector of functions that take one argument. To deal with functions that take two arguments we need to define the apply2:(α →β →γ )→α par→β par→γ par function. It is also common to apply the same sequential function at each processor. We use the parfun functions where only the number of arguments to apply differs: parfun:(α →β )→α par→β par and parfun2:(α →β →γ )→α par→β par→γ par.
378
F. Gava BSP parameter g
BSP parameter l
40
180000 MPICH OPEN MPI PUB
35
140000
30
120000 Flops
Flops
MPICH OPEN MPI PUB
160000
25
100000 80000
20
60000 15
40000
10
20000 2
4
6
8
10
12
14
16
2
4
Number of processors
6
8
10
12
14
16
Number of processors
Fig. 2. BSP parameters of our machine
It is common to perform a total exchange. Each processor contains a value (represented as a vector of values) and the result of (rpl total:α par→α list v0 · · · vp−1 ) is [v0 , . . . , vp−1 ], a list of these values on each processor. 2.4
Computation of the BSP Parameters
One of the main advantages of the BSP model is its cost model: it is quite simple and yet accurate. We used the BSML implementation [15] of a program [3] that benchmarks and determines the BSP parameters of our machine (16 Pentium IV 2.8 Ghz, 1 Gb RAM nodes cluster interconnected with a Gigabyte Ethernet network). We then compute these parameters for 3 libraries, corresponding to 2 different implementations of BSML: MPICH, OPEN-MPI5 and PUB [4]. Fig. 2 summarises the timings (where r is 330 Mflops/s for each library) for an increasing number of processors. We notice that the parameter l is growing in a quasi-linear way for the library PUB. However, for the libraries MPI, two jumps are visible. No explanation has been found yet. For parameter g, it is surprising to see that it is high for a few processors and then more stable (real parameter exchange of the network). This is certainly due to the buffer management in communication protocols and OS: when the number of processors increases, the buffers are filled faster and messages are transmitted immediately.
3
Examples of BSP Problems in BSML
To illustrate our methodology, we present 2 classic problems: sieve of Eratosthenes and N -body computing. For each problem, we give the parallel methods, BSP cost formulas as well as tests of comparison between theoretical (depending on the BSP parameters) and experimental performances. This comparison shows that the BSP cost analysis would help choosing the best BSML program. 5
http://www.mcs.anl.gov/mpi/mpich1 and http://www.open-mpi.org/
BSP Functional Programming: Examples of a Cost Based Methodology
3.1
379
Sieve of Eratosthenes
The sieve of Eratosthenes generates a list of primary numbers below a given integer n. We study 3 parallelization methods. We generate only the integers that only √are not multiple of the 4 first prime numbers and we classically iterate 1 , so we to n. The probability of a number a to be a prime number is log(a) √ deduce a complexity of ( n × n)/ log(n). Fig. 3 gives the BSML code of the 3 methods. We used the following functions: elim:int list→int→int list which deletes from a list all the integers multiple of the given parameter; final elim:int list→int list→int list iterates elim; seq generate:int →int→int list which returns the list √ of integers between 2 bounds; and select:int →int list→int list which gives the nth first prime numbers of a list. Logarithmic reduce method. For our first method we use the classical parallel prefix computation (also call folding reduce) : scan ⊕ v0 · · · vp−1 = v0 v0 ⊕ v1 · · · ⊕p−1 k=0 vk We use a divide-and-conquer BSP algorithm (implemented using the super primitive) where the processors are divided into two parts and the scan is recursively applied to those parts; the value held by the last processor of the first part is broadcasted to all the processors of the second part, then this value and the values held locally are combined together by the associative operator ⊕ on the second part. In our computation, √ the sent values are first modified by a given function (select to just sent the nth first prime numbers) The parallel methods is thus very simple: each processor i holds the integers between i × np + 1 and (i + 1) × np . Each processor computes a local sieve (the processor 0 contains thus the first prime numbers) and then our scan is applied. We then eliminate on processor i the integers that are multiple of integers of processors i − 1, i − 2, etc. We have log(p) √ super-steps where each processor sent/received at most 2 values (list of size max n). The BSP cost is accordingly: √
m×m +2× log(p) × ( log(m)
√ n × g + l) where m =
n p.
Direct method. It is easy to see that our initial distribution (bloc of integers) gives a bad load balancing (processor p − 1 has the bigger integers which have little probability to be prime). We will distributes integers in a cyclic way: a is given to processor i where a mod p = i). The second method works as √ follows: each processor computes a local sieve; then integers that are less to n are globally exchanged; a new sieve is applied to this list of integers (thus giving prime numbers) and √ each processor eliminates, in its own list, integers that are multiples of this nth first primes. The BSP cost is accordingly: √√ √ √ √ n× n m×m 2 × log(m) + log(√n) + n × g + l
380
F. Gava
let eratosthene scan n = let p=bsp p() in let listes = mkpar (fun pid→ if pid=0 then seq generate (n/p) 10 else seq generate ((pid+1)∗(n/p)) (pid∗(n/p)+1)) in let local eras = parfun (local eratosthene n) listes in let scan era = scan super final elim (select n) local eras in applyat 0 (fun l →2::3::5::7::l) (fun l→l) scan era let eratosthene direct n = let listes = mkpar (fun pid→ local generation n pid) in let etape1 = parfun (local eratosthene n) listes in let selects = parfun (select n) etape1 in let echanges = replicate total exchange selects in let premiers = local eratosthene n (List.fold left (List.merge compare) [] echanges) in let etape2 = parfun (final elim premiers) etape1 in applyat 0 (fun l→2::3::5::7::(premiers@l)) (fun l→l) etape2 let rec eratosthene n = if (fin recursion n) then apply (mkpar distribution) (replicate (seq eratosthene n)) else let carre n = int of float (sqrt (float of int n)) in let prems distr = eratosthene carre n in let listes = mkpar (fun pid →local generation2 n carre n pid) in let echanges = replicate total exchange prems distr in let prems = (List.fold left (List.merge compare) [] echanges) in parfun (final elim prems) listes let eratosthene rec n = applyat 0 (fun l→2::3::5::7::l) (fun l→l) (eratosthene n) Fig. 3. BSML code of the the parallel versions of the sieve of Eratosthenes
√ Recursive method. Our last method is based on the generation of the nth first primes and elimination of the multiples of this list of integers. We generate this √ by a inductive function on n. We suppose that the inductive step gives the nth first primes and we perform a total exchange on them to eliminates the non-primes. End of this induction comes from the BSP cost: we end when n is small enough so that the sequential methods is faster than the parallel one. The inductive BSP cost is accordingly: √
Cost(n) = Cost(n) =
m×m log(m) √ n×n log(n)
+
√ √ n × g + l + Cost( n) if BSP cost > complexity
Fig. 4 gives the predicted and measured performances (using the PUB implementation). To simplify our prediction, we suppose that pattern-matching and
BSP Functional Programming: Examples of a Cost Based Methodology
Erathostene prefix, direct and recursive
BSP Predicted Erathostene prefix, direct and recursive
16
16 Ideal Direct Recursive Prefix
12
Ideal Pred Direct Pred Recursive Pred Prefix
14 BSP Predicted Acceleration
14
Acceleration
381
10 8 6 4 2
12 10 8 6 4 2
0
0 2
4
6
8
10
12
14
16
2
4
Number of processors for N=10000000
6
8
10
12
14
16
Number of processors for N=10000000
Fig. 4. Performances (using PUB) of the sieve of Eratosthenes
modulo are constants in time. Size of lists of integers can be measured using the Marshal module of OCaml. Note that we obtain a super-linear acceleration for the recursive method. This is due to the fact that, using a parallel method, each processor has a smaller list of integers and thus the garbage collector of OCaml is called less often. One can notice that predicted performances using the BSP cost model are close to the measured ones. 3.2
The N -Body Problem
The classic N -body problem is to calculate the gravitational energy of N point masses, which is given by: E=−
N N mi × mj i=1 j=1 i=j
ri − rj
The complexity of this problem is thus in order of magnitude of N 2 . To compute this sum, we show two parallel algorithms : using a total exchange of the point masses or using a systolic loop6 . At the beginning of these two methods, each processor contains a sub-part (as a list) of the N point masses: we thus have a parallel vector of lists of N/p point masses. Fig. 5 gives the BSML code of the 2 algorithms. pair energy computes the interaction of a list of masses with another one. The sequential method is thus a call of this function to the same list. Total exchange method. The method is naive: a total exchange of these lists is done and then processors compute the interaction of its own list with other ones; at the end, a parallel fold is applied to sum the partial interactions. The BSP cost is accordingly: N × g + 2 × N + Np × N + l + p × g + l that is two 6
There exist more sophisticated algorithms that take advantage of the symmetry of the sum but this is not the subject of this article.
382
F. Gava
super-steps: time of the total exchange and the concatenation of the received lists; time to perform the local interactions and time to finish the fold. Systolic loop. Our second algorithm is based on a systolic loop [13]. In such an algorithm, data is passed around from processor to processor in a sequence of super-steps. We can easily write a generic systolic loop in BSML: (∗ val systolic:(α →α →β ) →(γ →β par →γ ) →α par →γ →γ ∗) let systolic f op vec init = let rec calc n v res = if n=0 then res else let newv=Bsmlcomm.shift right v in calc (n−1) newv (op res (Bsmlbase.parfun2 f vec newv)) in calc (bsp p()) vec init
with shift right:α par→α par which shifts the values from each processor to its right-hand neighbour (part of the standard BSML library). Initially, each processor receives its share of the N point masses and calculates the interactions among them. Then it sends a copy of its particles to its righthand neighbour, while at the same time receiving the particles from its left-hand neighbour. It calculates the interactions between its own particles and those that just came in, and then it passes on the particles that came from the left-hand neighbour to the right-hand neighbour. After p − 1 super-steps, all pairs of type point = float ∗ float ∗ float and atom = point ∗ float let minus point (x1,y1,z1) (x2,y2,z2) = (x1−.x2,y1−.y2,z1−.z2) let length point (x,y,z) = sqrt(x∗.x +. y∗.y+. z∗.z) (∗ val pair energy : atom list →atom list →float ∗) let pair energy some bodies other bodies = List.fold left (fun energy →function (r1,m1) → energy+.(List.fold left (fun energy →function (r2,m2) → let r=length point(minus point r2 r1) in if r>0. then energy+.(m1∗.m2)/.r else energy) 0. other bodies)) 0. some bodies (∗ Total exchange method ∗) let final ex = parfun2 pair energy my bodies (parfun List.concat (total exchange my bodies)) in let res final= fold direct (+.) 0. final ex in ... (∗ Systolic method ∗) let energy=parfun2 pair energy my bodies my bodies in let final sys = systolic pair energy (parfun2 (+.)) my bodies energy in let res final= fold direct (+.) 0. final sys in ... Fig. 5. BSML code of the parallel versions of the N -body problem
BSP Functional Programming: Examples of a Cost Based Methodology
N -body problem with Systolic and Total exchange methods
BSP Predicted N -body problem with Systolic and Total exchange methods
18
16 Ideal Systolic Total exchange
14
Ideal Systolic Total exchange
14 BSP Predicted Acceleration
16
Acceleration
383
12 10 8 6 4
12 10 8 6 4 2
2 0
0 2
4
6
8
10
12
Number of processors for N =50000
14
16
2
4
6
8
10
12
14
16
Number of processors for N =50000
Fig. 6. Performance (using MPICH) of the N -body problem
particles have been treated and a folding of these values can be done to finish the computation. The BSP cost is accordingly: N N N p × (N p ×g+l+2× p + p × p)+p×g+l N ≡ N ×g+p×l+2×N + p ×N +l+p×g+l
that is the same as before but with more synchronization time. Fig. 6 gives the predicted and measured performance (using MPICH). Size of lists of particules are measured as before. One can notice that performances scales well. The naive method has better theoretical and practical performances than the systolic ones. The asset of the systolic method appears when the number of particules is so big that lists do not fit in the main memory of a node of the parallel machine: performance degenerates due to the paging mechanism used to get enough virtual memory. This is a limitation of the BSP model that could be solved using a more sophisticated one for out-of-core applications [9]. One can also notice that for our two examples (sieve of Eratosthenes and N body), measured performances are sometime better than predicted ones. This is due to the fact that in some cases, communications can perform better than predicted ones (g and l are averages of network parameters).
4
Conclusion
BSML is a language for programming BSP algorithms. We have attempted to show that it is possible to predict the performance of BSP algorithms following the parameters of a given machine and so to choose what the most efficient and scalable BSML program is. We have illustrated this with two classical problems. Our work illustrates the importance of a high-level parallel paradigm with more compact and therefore more readable code without too bad performances. Even if our methodology might seem lengthy, we believe it is necessary for the future of parallel programming especially as multi-cores machines became the norm. Their programming (as well as clusters) in a safe, expressive, predictable and efficient manner will surely become one of the keys to software design.
384
F. Gava
Future work will naturally be comparison with other parallel languages and libraries as OCamlP3L, C+BSPlib, C+MPI, Eden or Gph [10] (and with bigger programs and other kinds of architectures as multi-cores ones) in order to validate our approach. Finally, manual cost analysis for functional programs has its limits: it is necessary to estimate (sometimes by testing) the number of flops needed to make a pattern-matching, build a tuple, etc. We could use [16] in order to estimate them automatically. Acknowledgements. Thanks to Louis Gesbert for its speel checking.
References 1. Akerholt, G., Hammond, K., Peyton-Jones, S., Trinder, P.: Processing transactions on GRIP, a parallel graph reducer. In: Reeve, M., Bode, A., Wolf, G. (eds.) PARLE 1993. LNCS, vol. 694. Springer, Heidelberg (1993) 2. Benoit, A., Robert, Y.: Mapping pipeline skeletons onto heterogeneous platforms. In: Shi, Y., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2007. LNCS, vol. 4487, pp. 591–598. Springer, Heidelberg (2007) 3. Bisseling, R.H.: Parallel Scientific Computation. A structured approach using BSP and MPI. Oxford University Press, Oxford (2004) 4. Bonorden, O., Juurlink, B., Von Otte, I., Rieping, O.: The Paderborn University BSP (PUB) library. Parallel Computing 29(2), 187–207 (2003) 5. Chailloux, E., Foisy, C.: A Portable Implementation for Objective Caml Flight. Parallel Processing Letters 13(3), 425–436 (2003) 6. Conchon, S., Le Fessant, F.: Jocaml: Mobile agents for Objective-Caml. In: ASA 1999, pp. 22–29. IEEE Press, Los Alamitos (1999) 7. Di Cosmo, R., Li, Z., Pelagatti, S., Weis, P.: Skeletal Parallel Programming with OcamlP3L 2.0. Parallel Processing Letters (2008) 8. Di Cosmo, R., Pelagatti, S., Li, Z.: A calculus for parallel computations over multidimensional dense arrays. Computer Language Structures and Systems (2005) 9. Gava, F.: External Memory in Bulk Synchronous Parallel ML. Scalable Computing: Practice and Experience 6(4), 43–70 (2005) 10. Hammond, K., Trinder, P.: Comparing parallel functional languages: Programming and performance. Higher-order and Symbolic Computation 15(3) (2003) 11. Hayashi, Y., Cole, M.: Bsp-based cost analysis of skeletal programs. In: Michaelson, G., Trinder, P., Loidl, H.-W. (eds.) Trends in Functional Programming, ch. 2, pp. 20–28 (2000) 12. Hill, J.M.D., McColl, W.F.: BSPlib: The BSP Programming Library. Parallel Computing 24, 1947–1980 (1998) 13. Hinsen, K.: Parallel scripting with Python. Computing in Science & Engineering 9(6) (2007) 14. Krusche, P.: Experimental Evaluation of BSP Programming Libraries. Parallel Processing Letters (to appear, 2008) 15. Loulergue, F., Gava, F., Billiet, D.: Bulk Synchronous Parallel ML: Modular Implementation and Performance Prediction. In: Sunderam, V.S., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2005. LNCS, vol. 3515, pp. 1046–1054. Springer, Heidelberg (2005)
BSP Functional Programming: Examples of a Cost Based Methodology
385
16. Scaife, N., Michaelson, G., Horiguchi, S.: Empirical Parallel Performance Prediction From Semantics-Based Profiling. Scalable Computing: Practice and Experience 7(3) (2006) 17. Skillicorn, D.B., Hill, J.M.D., McColl, W.F.: Questions and Answers about BSP. Scientific Programming 6(3), 249–274 (1997) 18. Verlaguet, J., Chailloux, E.: HirondML: Fair Threads Migrations for Objective Caml. Parallel Processing Letters (to appear, 2008)
On the Modeling Timing Behavior of the System with UML(VR) Leszek Kotulski1 and Dariusz Dymek2 1
Department of Automatics, AGH University of Science and Technology al. Mickiewicza 30, 30 059 Krakow, Poland 2 Department of Computer Science Cracow University of Economics, 31-510 Krakow, Poland [email protected], [email protected]
Abstract. UML notation is assumed to be independent from any software modeling methodology. The existing methodologies support the creation of the final system model, but they do not care about the formal documentation of the reasoning process; the associations between the elements belonging to different types of UML diagrams are remembered either as informal documentation outside the UML model or are forgotten. Described in the paper Vertical Relations try to fill this gap, and allow to look at the use of timing diagrams from the new, more complex, perspective. Usefulness of Virtual Relations in evaluation of the timing properties of the Data Warehouse Reporting systems is presented.
1 Introduction Unified Modeling Language (UML) [1] is an open standard controlled by the Object Management Group (OMG). UML is a family of graphical notation backed by single meta-model. It can be used for describing and designing software systems, in particular those using object-oriented paradigm. In actual version of the UML standard (ver. 2.0) there are 13 types of diagrams, with precisely defined semantics [2]. Variety of diagram types allows us to describe different aspects of designed system, in particular: − the use case diagram shows interactions of users or other software system, − the software structure is dealt with the class diagram, the configuration of instances of classes is shown on the object diagram, the package diagram represent the compile-time structure of classes, composite structure diagram dealing with runtime decomposition of a class (such as a federate), − the activity diagram convey the procedural and parallel behavior of classes, the state machine diagram shows how events change the interior states of an object, − to align interactions between objects the sequence diagram is used, the communication diagram is used for emphasizing the links to be used by interactions, the timing diagram is used to cope with the timing aspects of interactions, − the deployment diagram shows the structure of running system. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 386–395, 2008. © Springer-Verlag Berlin Heidelberg 2008
On the Modeling Timing Behavior of the System with UML(VR)
387
In general, UML standard allows to cope with dynamical description of system beyond the semantic level. UML enables us to describe, how a system and its components interact externally as well as internally. UML as a tool became a base for some software development methodologies like RUP (Rational Unified Processes) [3] or ICONIX [4]. UML bases on such fundamental concept like object-oriented paradigm or distributed and parallel programming but is independent from those methodologies. This fact gives UML some advantages; especially it can be treated as a universal tool for many purposes. In general, software development methodologies based on UML are the sequences of informal recommendations how to, step by step, design a software system using different kind of UML diagrams. Final result is expressed in UML and the whole designing process is informally documented. Possibility of creation of various software development methodologies based on UML flows from the fact, that inside UML, formal dependencies among diagrams of different kinds are not defined. It leaves a blank for various methods of reasoning for software development methodologies. In this paper we show that capability of establishing the formal relations among different kinds of UML diagrams gives some new advantages. Presented (in section 2) way of introduction of such relations is independent from any methodology and does not affect the UML structure and properties. We named this type of relation the Vertical Relation to distinguish it from relations between elements of every single kind of diagrams which we called the Horizontal Relation. The proposed approach is an extension of UML and we named it UML(VR) to emphasizing the existence of additional relations. The consistency of the relations among different kinds of UML diagrams is maintained on the base of graph theory. The introduction of UML(VR) allow us to suggest the application of the timing diagrams to describe the timing behavior of the Actors appearing in the use case diagrams. In section 3 an example of such a solution in case of the Reporting System based on the Data Warehouse concept is presented. Moreover, having defined the Vertical Relation we are able to use timing diagrams associated with the use case diagrams to generate the timing diagrams associated with elements of the class, the object and the deployment diagrams (see section 4), what can be useful in refactoring decisions.
2 UML(VR) Concept UML itself defines the relation between elements from the given kind of diagrams or among diagrams from the same class. Generally, UML does not formally define the relation between various kinds of diagrams. Version 2.0 introduces <
388
L. Kotulski and D. Dymek
The problem of considering both horizontal and vertical consistency of UML model has been already pointed out a few years ago [5], but in practice those investigations has been concentrated on the horizontal consistency. Fortunately, the UML diagrams can be expressed as EDG graphs using XMI standard [7]. During the process of software system designing we can translate each UML diagrams into a form of graph and create a Graph Repository, which will gather the information from every phase of designing process. It gives us a possibility to take advantage of graph grammar to trace the software system designing process, treating this process as a sequence of graphs transformations. We are able to participate in the designing process and simultaneously modify the Graph Repository. In [8] it was proved that, with the help of aedNLC graph transformation system [9], we can control the generation of such a Graph Repository with O(n 2 ) computational complexity. This solution enables us to establish the formal linkage between elements from different kinds of UML diagrams as the Vertical Relation. To illustrate the capability of Vertical Relation we present below one of its exemplifications called Accomplish Relation (AR) [10], [11]. In the Graph Repository we can distinguish various layers (relevant to UML diagrams): the use case layer (UL), the sequence layer (SL), the class layer (CL) (divided onto the class body layer (CBL) and the class method layer (CML)), the object layer1 (OL) (divided onto the object body layer (OBL) and the object method layer (OML)), the timing layer (TML) and the hardware layer (HL). For any G, representing a subgraph of the graph repository R, the notation G|XL means the graph, with the nodes belonging to the XL layer (where XL stands for any UML type of diagram) and the edges induced from the connections inside R. For example, R|UL∪OL means the graph with all the nodes (n_set (R|UL∪OL)) representing user requirements and all the objects, servicing these requirements, with the edges (e_set(R|UL∪OL)) representing both horizontal and vertical relation inside the graph repository. Now we can present a definition of Accomplish Relation function: AR:(Node,Layer) → AR(Node,Layer) ⊂ n_set(R|Layer ) is the function where: Node ∈ n_set (R|XL) : XL ∈ {UL, CBL, CML, OBL, OML,HL} Layer ∈ { UL, CBL, CML, OBL, SL, OML,TML, HL}, Layer ≠ XL AR(Node,Layer) is a subset of nodes from n_set(R|Layer ), which stay in relationship of the following type: “support service” or “is used to” with given Node, based on role performed in the system structure. For better understanding, let’s consider an example: − for any user requirement r∈ n_set (R|UL), AR(r,OBL) returns a set of objects which supports this requirement service, − for any object o∈ n_set (R|OBL), AR(o,UL) returns a set of requirements that are supported by any of its methods, − for any object o∈ n_set (R|OBL), AR(o,HL) returns a set consists of the computing (hardware) node, in which given object is allocated, − for any object x∈ n_set (R|UL∪CBL∪OBL∪SL∪HL), AR(x,TML) returns a set consists of the timing diagram describing the timing properties of its behavior, 1
Packages introduce some sub-layers structure inside this layer.
On the Modeling Timing Behavior of the System with UML(VR)
389
− for any class c∈ n_set (R|CBL), AR(c,UL) returns a set of requirements that are supported by any of its method, The above relations are maintained by the repository graph structure, so there are no complexity problems with their evaluation. Moreover, the graph repository is able to trace any software or requirement modification, so these relations are dynamically changing during the system life time. In the next section the way of using the AR function in practice is presented.
3 Association Timing Diagrams for Use Case Actors One of the most interesting type of diagrams introduced in UML 2.0 standard are the timing diagrams. They are used to show interactions when a primary purpose of the diagrams is to reason about the time. Their properties became exhibited in many areas; one of the most interesting was, presented by Bunker, the solution of problem protocol compliance verification [11]. Timing diagrams focus on conditions changing within and among lifelines along a linear time axis. They describe behavior of both individual classifiers and interactions of classifiers, focusing attention on time of occurrence of events causing changes in the modeled conditions of the lifelines [2]. The classifier is defined in UML 2.0 standard as: “A collection of instances that have something in common. A classifier has the features that characterize its instances”. Classifiers include interfaces, classes, data types, and components [2]. The introduction of timing diagrams is illustrated in OMG documents only by its application to the sequence diagram. Remembering (as some kind of Vertical Relations) the influence of existence elements of one type of diagram (e.g. use case diagram) onto creation of elements of another type of diagrams (e.g. sequence or class diagrams), during the modeling process, creates new possibility of using timing diagrams. We suggest using the timing diagrams for describing how Actors activity will change during the system exploitation. Let us notice that the system overload is caused if at least one, from mentioned below, event will follow: − two or more processes that consume most of the computing system resources will start at the same time, − some process will be activated by great population of users. Information about the possible Actors (defined in the use case level) schedule can be remembered by association of a timing diagram with each of them. However, this information will be useful only if we are able to translate it into the timing diagrams, describing the structure of the other elements of the software system (i.e. class, object and deployment diagrams). The VR creates such a possibility. In [13] we consider a typical reporting systems. Every business organization during its activity generates many single reports. Some of them are created for managers and executives for an internal use only; others are created for external organizations, which are entitled to monitoring state and activity of given organization. For example, in Poland commercial banks have to generate obligatory reports inter alia to the National Bank of Poland (WEBIS reports), the Ministry of Finance (MF reports) and the
390
L. Kotulski and D. Dymek
Warsaw Stock Exchange (SAB reports)2. In all external reports, it is few hundreds of single sheets with thousands of single data. In general, these reports base on almost the same kind of source data, but external requirements on format and contents causes that different software tools (based on different algorithms) are needed. These reports have the periodical character – depending on demands, a given report must be drown up every day, week, decade, month, quarter, half of the year or year, base on data from the end of the corresponding day. Let’s assume that we have the Reporting Data Mart system based on Data Warehouse. To simplify the example we skip the organization of the Extraction Transformation and Loading (ETL) processes and assume that all necessary information are maintained by the Data Warehouse Repository. It’s ease to realize that for different Data Marts the set of used DW processes can be different. Analyzing the information content of reports we can divide them into a few categories, based on kind of source data and the way of their processing. Each of those categories, regardless of periodical character, is generated by different processes. Their results are integrated on the level of the user interface depending on period and external organization. The schema of data flow for Reporting Data Mart is presented in Fig 1.
Fig. 1. Schema of Reporting Data Mart
Each User Application represents functionality associated with the single period and with the single type of obligatory reports. Because of that, we can treat these applications as user requirements, defining Data Mart functionality. As we mention above, reports have the periodical character. It means that processes associated with these reports category have also the periodical character. They are executed only in the given period of time. This period is strictly connected with the organizational process of drawing up the given type of reports. Let us notice that the obligatory reports for the National Bank of Poland must fulfill many control rules, before they can be send out. In practice, it means that those reports are not generated in a single execution of the proper software process. Instead of this, we have the organizational process which can progress even a few days, during which the software process is executed many times after each data correction. Because of that, if we 2
Structure and information contents of those reports are based in international standards so the same situation we can meet in other countries.
On the Modeling Timing Behavior of the System with UML(VR)
391
analyzing the time of availability of system functionality connected with those reports, we must take into account the larger time of the readiness of the hardware environment than in the case of the single process execution. For the purpose of this example we can take a simple Reporting Data Mart with functionality restricted to only three reports categories: weekly, decadal and monthly. We assume that processes associated with weekly, decadal and monthly reports generation are started appropriately 2,3 or 4 days before of the reports delivery time. The second type of the reports are ad hoc reports generated by consultants or verification of the hypothesis prepared by them. The mentioned activities are represented at use case diagram presented in Fig. 2.
Fig. 2. Schema for Reports Generation activities
To estimate the system workload first we have to estimate the external usage of each system function. Each Actor artifact represents a group of users with a similar kind of behavior. Only Consultant Actor uses more that one systems function (ad hoc report generation and hypothesis verification). Thus we have to create five timing diagrams to express user behavior timing characteristics [13]. Three first timing diagrams, representing the weekly, decadal and monthly reports generation processes activity, are presented in Fig. 3, where the number of active processes of a given type is either 0 or 1. The example of a Consultant population activity with respect to the ad hoc reports generaTable 1. Process overloading tion and the hypothesis verification are presented in Fig. 4 (the Y axis is scaled to 1:10 in compari30% son to Fig. 3), basing on assumption that the ad weakly report 30% hoc reports are generated during worktime, and decadal report 30% the hypothesis verification is made in the back- monthly report ground. ad hoc report 1,5% If we are able to estimate the overload made hypothesis verification 4,5% by a single Actor request (the way of such estimation will be considered in the next section), we can evaluate the total system workload. For the purpose of this example assumed estimation is presented in Table 1.
392
L. Kotulski and D. Dymek
Fig. 3. Timing diagrams for periodical reports generation
Fig. 4. Timing diagrams for Consultant activities
Fig. 5. System workload before (a) and after (b) evaluation
Thus (after a simple calculation) we can generate timing diagram representing the final system overloading as presented in Fig. 5a. We can observe that user demand exceeds computing power of the system at 9-th, 30-th and from 57-th to 60-th day of system observation. Fortunately, data for
On the Modeling Timing Behavior of the System with UML(VR)
393
monthly and decadal reports generation usually a prepared by ETL process a few days earlier so we can start: decadal reports evaluation on 7-th and 29-th day, monthly reports on 25-th and 54-th days. Fig. 5b represents the overloading evaluation in such a case. The improvement of these processes effectiveness made by the distribution of some processes will be considered in the next section.
4 System Workload Estimation The solution presented in the previous section bases on the assumption that we are able to estimate the workload of the computing system caused by an Actor request. In such a system as Data Warehouse, where evaluations of the same requests are repeated, such estimation can be made by the observation of the real system. However, it seams to be desirable to consider the influence of the information gathered in the timing diagrams (describing Actors timing behavior) on the final model of the developed software system. In all methodologies using UML the use case diagrams (and class diagrams – for illustration of Domain Model) are the first diagrams generated during the system modeling. Here, we assume that timing diagrams associated with Actors activities are generated at the use case level to express the time relations among the elements of system structure associated with the periodical character of the system functions. The vertical relation AR, introduced in section 2, allows us to designate for each Actor’s request r: − the set of classes modeling the algorithms used during its service (AR(r,CBL)), − the set of object that are responsible for the servicing of the request r (AR(r,OBL), − the deployment of the mentioned in the previous point objects ((AR(o,DL)). Thus we are able to estimate the workload of the software and the hardware components in the following way. Let, for each r∈ n_set (R|UL), TM(r,t) represents timing diagram associated with r (more formally TM(r,t)=AR(r,TML)(t)). Having defined TM for requirements we can calculate it for methods, class, objects and hardware nodes. For any m∈ n_set (R|cmL)
TM ( m, t ) =
U TM( r, t )
r∈AR ( m , UL )
For any c∈ n_set (R|CBL)
TM (c, t ) =
U TM(r, t )
r∈AR ( c , UL )
For any o∈ n_set (R|OBL)
TM (o, t ) =
U TM (r, t )
r∈AR ( o , UL )
For any h∈ n_set (R|HL)
TM ( h, t ) =
U TM(o, t )
o∈AR ( h ,OBL )
where ∪ means the logical sum. Timing diagrams generated for methods and classes helps us to better understanding of the modeled system structure and can be very useful in finding the system elements that should be refactored [14]. Timing diagram generated for Hardware Layer gives us information about the time of the hardware nodes activity, triggered by the execution of processes corresponding with objects allocated at it.
394
L. Kotulski and D. Dymek
Let’s notice that timing diagrams generated for the object can be used to estimate the level of utilize of the hardware equipment. Let’s assume that: − we are able to estimate the (average, periodical) performance of the object components (described as per(o)); this estimation should be associated with the computational complexity of algorithms used inside the object. − we know the computing power of the hardware nodes (described as cp(h)) Then the function
EF (h, t ) =
∑ (TRA(o, t ) ∗ per(o))
o∈AR ( h ,OBL )
cp (h )
shows us the efficiency of the hardware nodes utilization in time. It can be used to indicate the periods of time in which the hardware equipment is almost not used or is very close to overloading. Brief analysis of presented function shows us that we have three ways of influence on its value: (1) we can reschedule the user requirements by changing business processes, (2) we can decrease performance demanded by the object’s processes by rewriting software modules or (3) we can increase the hardware computing power.
5 Conclusions The recent release of UML 2.0 has corrected a lot of design difficulties encountered in the 1.x revision. One of the new introduced capabilities is the possibility of characterization of the timing behavior for some components of the modeled system (with help of timing diagrams). Unfortunately still actual is Engel’s observation that a general consistency of UML model is still missing [7]. In the paper the idea of the formal remembering (as a kind of vertical relations) the associations between elements belonging to the different kinds of the UML diagrams was presented. Those associations appear during the reasoning process, while system modeling. However, this formal approach has a specific context; it means that the mentioned associations are remembered as a graph structures (equivalent to the UML Interchange standard [15]), so their maintenance and/or evaluation is possible with help graph transformation. In this sense this approach differs from other formal approaches supporting UML modeling with such formalisms as SCP [16] or B language [17]. Based on this idea, an application of timing diagrams as a tool for description of Actors timing behavior was shown. The capability of the automatic generation of the timing diagrams associated with objects and classes points out the part of the system that should be consider for possible refactoring. It is all the more important that the refactoring techniques in general are based on the system developer intuition (who discovers “bad smells” part of program [14]). Presented UML(VR) concept seems to be a very promising approach. It can be used for different purpose in the development of the software system. The using of the
On the Modeling Timing Behavior of the System with UML(VR)
395
AR functions, which is an exemplification of the Vertical Relation, has been also studied by authors in such an area as the test generation [18].
References 1. Rumbaugh, J., Jacobson, I., Booch, G.: The Unified Modeling Language Reference Manual. Addison-Wesley Longman Ltd, Amsterdam (1999) 2. Unified Modeling Language, OMG v 2.1.2., http://www.omg.org 3. IBM Rational Unified Process, http://www-306.ibm.com/software/rational/ 4. Rozenberg, D., Scott, K.: Applying Use Case Driven Object Modeling with UML: An Annotated e-Commerce Example. Addison-Wesley, Reading (2001) 5. Kuźniarz, L., Reggio, G., Sourrooille, J., Huzar, Z.: Workshop on Consistency in UML-based Software Development, http://www.ipd.bth.se/uml2002/RR-2002-06.pdf 6. Sourrouille, J., Caplat, G.: A Pragmatic View about Consistency of UML Models. In: Workshop on Consistency Problems in UML-based Software Development II, San Francisco (2003) 7. Engels, G., Groenewegen, L.: Object-Oriented modeling: A road map. In: Finkelstein, A. (ed.) Future of Software Engineering 2000, pp. 105–116. ACM Press, New York (2000) 8. Kotulski, L.: Nested Software Structure Maintained by aedNLC graph grammar. In: Proceedings of the 24th IASTED International Multi-Conference Software Engineering, Innsbruck, Austria, pp. 335–339 (2006) 9. Kotulski L.: Model wspomagania generacji oprogramowania w środowisku rozproszonym za pomocą gramatyk grafowych. Wydawnictwo Uniwersytetu Jagiellońskiego, Kraków (2000) ISBN 83-233-1391-1 10. Dymek, D., Kotulski, L.: On the hierarchical composition of the risk management evaluation in computer information systems. In: The Second International Conference DepCoS RELCOMEX 2007, Szklarska Poreba, Poland, pp. 35–42 (2007) 11. Dymek, D., Kotulski, L.: Evaluation of Risk Attributes Driven by Periodically Changing System Functionality. Transaction on Engineering, Computing and Technology 16, 315– 320 (2006) 12. Bunker, A., Gopalakrishnan, G., Mckee, S.A.: Formal hardware specification languages for protocol compliance verification. ACM Transactions on Design Automation of Electronic Systems 9(1), 1–32 (2004) 13. Kotulski, L., Dymek, D.: On the load balancing of Business Intelligence Reporting Systems. In: Proceedings of the AIS SIGSAND European Symposium on Systems Analysis and Design, University of Gdansk, Poland, pp. 121–125 (2007) 14. Flower., M., Beck, K., Brant, J., Opdyke, W.: Refactoring: Improving the Design of Existing Code. Addison-Wesley Longman Publishing Co. Inc., Amsterdam (2000) 15. UML Diagram Interchange, OMG, version 1.0, http://www.omg.org/technology/documents/modeling_spec_catalog 16. Engels, G., Küster, J.M., Heckel, R., Groenewegen, L.: A methodology for specifying and analyzing consistency of object-oriented behavioral models. In: The 8th European Software Engineering Conference held jointly with ESEC/FSE-9, pp. 186–195. ACM, New York (2001) 17. Snook, C., Butler, M.: UML-B: Formal modeling and design aided by UML. ACM Transaction on Software Engineering Methodology 15(1), 92–122 (2006) 18. Dymek, D., Kotulski, L.: Using UML(VR) for supporting the automated test data generation. In: The Third International Conference DepCoS - RELCOMEX 2008, Szklarska Poreba, Poland (2008)
Reducing False Alarm Rate in Anomaly Detection with Layered Filtering Rafal Pokrywka1,2 1
Institute of Computer Science AGH University of Science and Technology al. Mickiewicza 30, 30-059 Krak´ ow, Poland [email protected] 2 IBM SWG Laboratory ul. Armii Krajowej 18, 30-150 Krak´ ow, Poland
Abstract. There is a general class of methods of detecting anomalies in a computer system which are based on heuristics or artificial intelligence techniques. These methods are to distinguish between normal and anomalous system behaviour. The main weakness of these methods is a false alarm rate which is usually measured by counting false-positives on a sample set representing normal behaviour. In this measurement a base rate of anomalous behaviour in a live environment is not taken into account and that leads to a base-rate fallacy. This problem can greatly affect a real number of false alarms which can be significantly greater then expected value. Usually little can be done to further improve classification algorithms. In this paper a different approach to reducing real false alarm rate based on layered filtering is presented and discussed. The solution explores potential in a properly structured system of several anomaly detectors.
1
Introduction
Work presented here is a part of research on Intrusion Detection System (IDS) which is based on anomaly detection. This kind of security systems are usually placed in key nodes of network infrastructure and are often used to complement signature based detection systems. The system detects anomalies by comparing normal behaviour, stored in some way as a profile, and current behaviour of supervised processes. This approach allows to detect an attack without a priori knowledge of attack’s technique in opposite to signature-based detectors which are basically blind to novel attack patterns. Anomaly detectors sometimes fail to correctly classify current event. From all misclassification types the false-positive error, also referred as false alarm which is when normal behaviour is marked as anomalous, is the most significant. This article focuses on a very important aspect of my research – proper handling of this kind of errors. The goal is to have an effective system yet with very
This paper is NOT related to any of my job responsibilities as an employee of IBM.
M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 396–404, 2008. c Springer-Verlag Berlin Heidelberg 2008
Reducing False Alarm Rate in Anomaly Detection with Layered Filtering
397
low real false alarm rate – even close to 0. The problem is that real life system usually works fine most of the time and only sometimes abnormal event happens – simply the frequency of anomalies is fairly low. The false alarm rate of anomaly detectors is usually measured by taking the ratio between a number of alarms fired on a sample set of normal events and the size of this set – the base rate is not taken into account in this measurement. The result is that real false alarm rate achieved during monitoring of real systems can be significantly larger then expected – even large enough to make anomaly detector basically useless. This phenomenon is known as base-rate fallacy and directly stems out from Bayes theorem. It also frequently skips the attention of researchers. It is a hard task to further improve current anomaly detection algorithms, which have already achieved quite low false alarm rates, but even large improvement may not be enough. The real improvement can be achieved by combining several anomaly detectors with different properties and performance into layers which gradually filters out abnormal events. In this article the term ”event” is used to describe smallest set of information from the system that can be classified. The term ”behaviour” relates to a sequence of events.
2
Anomaly Detectors Overview
Anomaly detection algorithms are used to distinguish if a current event is normal or not. To simplify discussion at this moment it can be assumed that information about type of anomaly, like for example error condition or buffer overflow, can be neglected. In this case detector can be seen as binary classifier as there are only two possible classes of events – normal event or anomaly. There are four possible outcomes from a detector with respect to the actual class of an event: – true positive (TP) – when current event is anomalous and detector prediction is correct (signalisation of anomaly) – false positive (FP) – when current event is normal but detector prediction is incorrect (signalisation of anomaly) – true negative (TN) – when current event is normal and detector prediction is correct (no signalisation of anomaly) – false negative (FN) – when current event is anomalous but detector prediction is incorrect (no signalisation of anomaly) The false positive is also known in statistics as type I error and this is the situation when a false alarm is fired. False negative is know as type II error and it describes the situation when real anomaly is missed by the detector. True positive is when a detector correctly signals anomaly and true negative when it correctly indicates normal event. Anomaly detectors in scientific publications are usually characterized by the following operational characteristics:
398
R. Pokrywka
– detection rate (DR) – which is exactly the same as true positive rate and P can be calculated using the following equation: DR = T PT+F N – false alarm rate (FAR) – which is the same as false positive rate and is P calculated using: F AR = F PF+T N A good graphical presentation of these two characteristics (DR and FAR) is provided by ROC curve. It allows to conveniently compare different detection algorithms or choose the best set of algorithm parameters. A great survey of a couple of the most popular detection techniques can be found in [3].
3
The Base-Rate Fallacy
The difficulty in improving effectiveness of anomaly detectors due to base-rate fallacy phenomenon has been first pointed out by Stefan Axelsson in [2]. The fallacy stems out directly from Bayes theorem which relates prior and posterior probability of an event and is given by the following formula: P (A|B) =
P (A) ∗ P (B|A) . P (B)
(1)
P (B) can be expressed, using the law of total probability for n mutually exclusive outcomes of A, in the following way: P (B) =
n
P (Ai ) ∗ P (B|Ai ) .
(2)
i=1
Finally after combination of (1) and (2) the most popular Bayes theorem form can be derived: P (A) ∗ P (B|A) . (3) P (A|B) = n i=1 P (Ai ∗ P (B|Ai ) Following Axellsson let’s assume that I means an anomalous event in a system, ¬I that it is a normal event (no anomaly), A that there is an alarm signalisation from a detector and ¬A that there is no alarm fired. The false alarm rate can be expressed by the probability P (A|¬I) and detection rate by P (A|I). True negative and false negative rates can be obtained, respectively, in the following way: P (¬A|¬I) = 1 − P (A|¬I) and P (¬A|I) = 1 − P (A|I). Equation (3) can now be rewritten as: P (I|A) =
P (I) ∗ P (A|I) . P (I) ∗ P (A|I) + P (¬I) ∗ P (A|¬I)
(4)
The goal in anomaly detection is to maximise P (I|A), which is called by Axellson a Bayesian Detection Rate (BDR), and P (¬I|¬A), which is the probability that lack of alarm really means lack of anomaly and in this paper will be called Bayesian True Negative Rate (BTNR). In a real life environment the frequency of anomalies is fairly low. Based on [2] the assumptions has been made that
Reducing False Alarm Rate in Anomaly Detection with Layered Filtering
399
the average is 2 ∗ 10 anomalous events per day and 106 events overall per day. This allows to calculate the following probabilities: P (I) = 2∗10−5 and P (¬I) = 1−P (I) = 0.99998. Another assumptions has been made about characteristics of hypothetical anomaly detector: DT = P (A|I) = 1 and F AR = P (A|¬I) = 10−5 . In fact such values would be a really great achievement – they are simply not realistic and are only to show how significant base-rate fallacy is. Taking all these values the calculated BDR is 0.66667. It means that the probability a fired alarm is not a false alarm is only 0.66667. This value in practice can not be tolerated – it makes anomaly detector useless for an administrator. Under the same base rate (¬I)∗P (¬A|¬I) assumptions the probability BT N R = P (¬I|¬A) = P (¬I)∗PP(¬A|¬I)+P (I)∗P (¬A|I) is dominated by the base rate of normal events and is always close to 1 which means that anomaly rarely escapes detection. Axelsson argues that it is crucial to keep false alarm rate as low as possible even if the algorithm complexity and resource consumption is very high. This of course makes the detection very slow and there is a risk that intrusion is not detected on time. The potential damages and losses may have already been done. Also further improvement of F AR of current detection algorithms may not be a feasible task – there is to much effort for little gains. Layered Filtering may be an answer for these difficulties.
4
Layered Filtering
Layered filtering is well known in air or water pollution elimination. It consist of at least two filters in a sequence and each filter is responsible for elimination of different pollution type. The air, for example, flows through all filters and every filter is responsible for eliminating different chemical pollution or dust. As a result a clean air is supposed to be achieved. Returning back to the computer science field there is a method of combining binary classifiers to get a multiple classifier. This classifier uses more then one specialised binary classifier for each class and combines theirs outcomes – see for example [4]. The idea behind layered filtering takes something from both analogies: computer science and non-computer science. It gradually filters out normal system events as air filters do with pollution. It is also similar to multiple classifier but with the exception that there are still only two classes of an event and the specialisation considers only the method of how a check is performed. At the end Layered Filtering follows the rule of thumb that the whole exceeds the sum of its parts. Let’s consider a sequence of layered anomaly detectors Ld1 , ..., Ldn where n is the number of layers. Each detector has one input stream and two output streams: a− stream for anomalous and n− stream for normal events. A detector Ldi passes to Ldi+1 only those events which are classified as anomalous. Among them there could be a lot of false positives but this is not important at this point. A Detector can perform any processing or transformations of events, under the condition that a − stream of Ldi is compatible with input stream of Ldi+1 . The
400
R. Pokrywka
All System events
Ld_1
Abnormal Events + False alarms
Ld_2
Ld_3
Final Output
Filtered out normal events
Fig. 1. Layered filtering schema
first and last detectors are distinguished in a sense that Ld1 must accept event types of system under supervision and Ldn outputs information about anomalies to security officer. Figure 1 presents a schema for an anomaly detector based on layered filtering. Let introduce the following symbols: – – – – – –
DTi – i-th layer detection rate F ARi – i-th layer false alarm rate P (I)i – i-th layer base rate of anomalous events P (¬I)i – i-th layer base rate of normal events BDRi – i-th layer bayesian detection rate BT N Ri – i-th layer bayesian true negatives rate
Because of its operational characteristics the detector Ldi+1 operates on a set of events with significantly changed base rates of anomalies and non-anomalies – this is the most important part as it reduces the base-rate fallacy influence on a real false alarm rate. The base rates probabilities for the next layer are expressed in the following way: P (I)i+1 = BDRi . P (¬I)i+1 = 1 − BDRi .
(5) (6)
It is now possible to write equations for BDRi+1 and BT N Ri+1 as functions of, respectively BDRi and BT N Ri : BDRi ∗ DRi+1 . BDRi ∗ DRi+1 + (1 − BDRi ) ∗ F ARi+1 (1 − BDRi ) ∗ (1 − F ARi+1 ) . = (1 − BDRi ) ∗ (1 − F ARi+1 ) + BDRi ∗ (1 − DRi+1 )
BDRi+1 = BT N Ri+1
(7) (8)
In this method requirements for operational characteristics for Ldi can be relaxed significantly in terms of false alarm rate. However still it is best to have detection rate as high as possible. Additional advantage is that efficiency,
Reducing False Alarm Rate in Anomaly Detection with Layered Filtering
401
in terms of system resources usage, should be improved because avoiding false alarms is one of the most complicated task for anomaly detector. What is more quantity of events that reaches further layers is greatly reduced and it makes it possible to use there more sophisticated algorithms without the risk of significant increase in the overall computational complexity.
5
Results and Example IDS
In this section results for two example layered filters are shown. In both cases the initial frequencies of normal and abnormal events are taken from previous section. These frequencies seen by next layers are calculated using (5) and (6). BDR and BT N R values are calculated using (7) and (8). Table 1. Results for the first layered filter Ld1
Ld2
Ld3
P (I)i 0.00002 0.66667 0.99994 P (¬I)i 0.99998 0.33333 0.00005 1.0 0.98 0.98 DRi F ARi 0.00001 0.0001 0.00001 BDRi 0.66667 0.99994 0.99999 1 0.96153 0.00254 BT N Ri
BDR BTNR 1
Rate Value
0.8
0.6
0.4
0.2
0 1
2 Layer Number
Fig. 2. BT N R and BDR changes for each layer of the first filter
3
402
R. Pokrywka Table 2. Results for the second layered filter Ld1
Ld2
Ld3
P (I)i 0.00002 0.000391 0.27135 P (¬I)i 0.99998 0.99960 0.72864 0.98 0.95 0.98 DRi 0.05 0.001 0.0001 F ARi BDRi 0.000391 0.271353 0.999726 BT N Ri 0.999999 0.999980 0.99260
BDR BTNR 1
Rate Value
0.8
0.6
0.4
0.2
0 1
2 Layer Number
3
Fig. 3. BT N R and BDR changes for each layer of the second filter
First filter consists of three layers. Characteristics of each layer detector were based on assumptions from [2] and are rather unrealistic. Table 1 shows the calculated probabilities of normal and abnormal events, operational characteristics, BDR and BT N R values for each layer. It can be noticed that BDR for the first filter increases significantly for the second layer and also BT N R value falls dramatically for the third layer. In order the anomaly detector to be effective these both values must be maximised. As this is natural that increase of one of these values causes decrease of the second one the right balance must be found of layers quantity and detector characteristics. Figure 2 shows the plot of how the BDR and BT N R values changes for each layer.
Reducing False Alarm Rate in Anomaly Detection with Layered Filtering
403
For the Second filter there have been a more realistic assumptions made, for the false alarm rates and detection rates of each layer, based on evaluations from [3]. This filter also consists of three layers. In this case the increase of BDR is a bit slower and also BT N R stays on acceptable levels for all layers. Table 2 shows the calculated values. Figure 3 presents the relation of the layer number and BDR (BT N R) values. A form of real implementation of layered filter can be found in [1]. The system presented there is not directly referenced as layered filter but it consists of one layer based on Variable Order Markov Chain and the second one based on neural networks and multiagent systems. False alarm rate achieved there is actually 0 but the tests have been performed on a limited number of data sets and there have been a help from additional mechanism for suppressing false alarms called ”anergic agents”.
6
Conclusions and Further Work
This article shows that layered filtering approach for anomaly detection has a potential in reducing false alarm rate. Furthermore it can help in reducing computational complexity because of relaxed requirements especially for the first layer. However each layer detector must be carefully chosen in terms of performance and effectiveness. Also the balance between BDR and BT N R must be monitored as for certain number of layers BT N R starts to fall very quickly. At the end it would be a mistake to use a detector based on the same methods in more then one layer as probably no additional classification decisions would be made. The method presented here focuses on reducing false alarm rate but similar approach can be taken in reducing false negative rate by taking the second output from previous layer and connecting to it some sort of specialised detector which can verify if any anomaly has skipped detection and eventually redirect it to the next layer. It is also possible to build a two dimensional network of detectors for more sophisticated system for reducing both values. Additional important remark can be done about constructing network security system based on signatures and anomaly detectors. Placing signature based detector on the first line of defence changes the base rates in the wrong way and makes anomaly detector useless. First line in detector stack must be held by system based on anomaly detection with proper characteristics minimising false negative rate. Second line may consists of signature based system or another anomaly detector. The work presented here may be an important input to the process of building network security from more then one intrusion detection system. It is very interesting how operational characteristics change with number of layers. Further work includes implementation of IDS which is based on layered filtering and performing tests on real life environments. Also further research is needed on other methods of increasing intrusion detection systems effectiveness in terms of false negative and false positive rates.
404
R. Pokrywka
References 1. Cetnarowicz, K., Rojek, G., Pokrywka, R.: Intelligent Agents as Cells of Immunological Memory. In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2006. LNCS, vol. 3993, pp. 855–862. Springer, Heidelberg (2006) 2. Axelsson, S.: The Base-Rate Fallacy and the Difficulty of Intrusion Detection. ACM Trans. Inf. Syst. Secur. 3, 186–205 (2000) 3. Warrender, C., Forrest, S., Perlmutter, B.: Detecting Intrusions Using System Calls: Alternative Data Models. In: IEEE Symposium on Security and Privacy, pp. 133– 145 (1999) 4. Klautau, A., Jevtic, N., Orlitsky, A.: Combined Binary Classifiers with Applications to Speech Recognition. In: International Conference on Spoken Language Processing 2002, pp. 2469–2472 (2002)
Performance of Multicore Systems on Parallel Data Clustering with Deterministic Annealing Xiaohong Qiu1, Geoffrey C. Fox2, Huapeng Yuan2, Seung-Hee Bae2, George Chrysanthakopoulos3, and Henrik Frystyk Nielsen3 1
Research Computing UITS, Indiana University Bloomington [email protected] 2 Community Grids Lab Indiana University Bloomington {gcf,yuanh,sebae}@indiana.edu 3 Microsoft Research Redmond WA {georgioc, henrikn}@microsoft.com
Abstract. We present a performance analysis of a scalable parallel data clustering algorithm with deterministic annealing for multicore systems that compares MPI and a new C# messaging runtime library CCR (Concurrency and Coordination Runtime) with Windows and Linux and using both threads and processes. We investigate effects of memory bandwidth and fluctuations of run times of loosely synchronized threads. We give results on message latency and bandwidth for two processor multicore systems based on AMD and Intel architectures with a total of four and eight cores. We compare our C# results with C using MPICH2 and Nemesis and Java with both mpiJava and MPJ Express. We show initial speedup results from Geographical Information Systems and Cheminformatics clustering problems. We abstract the key features of the algorithm and multicore systems that lead to the observed scalable parallel performance. Keywords: Data mining, MPI, Multicore, Parallel Computing, Performance, Threads, Windows.
1 Introduction Multicore architectures are of increasing importance and are impacting client, server and supercomputer systems [1-6]. They make parallel computing and its integration with large systems of great importance as “all” applications need good performance rather than just the relatively specialized areas covered by traditional high performance computing. In this paper we consider data mining as a class of applications that has broad applicability and could be important on tomorrow’s client systems. Such applications are likely to be written in managed code (C#, Java) and run on Windows (or equivalent client OS for Mac) and use threads. This scenario is suggested by the recent RMS (Recognition, Mining and Synthesis) analysis by Intel [5]. In our research, we are looking at some core data mining algorithms and their application to scientific areas including cheminformatics, bioinformatics and demographic studies using GIS (Geographical Information Systems). On the computer science side, we are M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 407–416, 2008. © Springer-Verlag Berlin Heidelberg 2008
408
X. Qiu et al.
looking at performance implications of both multicore architectures and use of managed code. Our close ties to science applications ensures that we understand important algorithms and parameter values and can generalize our initial results on a few algorithms to a broader set. In this paper we present new results on a powerful parallel data clustering algorithm that uses deterministic annealing [20] to avoid local minima. We explore in detail the sources of the observed synchronization overhead. We present the performance analysis for C# and Java on both Windows and Linux and identify new features that have not been well studied for parallel scientific applications. This research was performed on a set of multicore commodity PC’s summarized in Table 1; each has two CPU chips and a total of 4 or 8 CPU cores. The results can be extended to computer clusters as we are using similar messaging runtime but we focus in this paper on the new results seen on the multicore systems. Table 1. Multicore PC’s used in paper AMD4: 4 core 2 Processor HPxw9300 workstation, 2 AMD Opteron CPUs Processor 275 at 2.19GHz, L2 Cache 2x1MB (for each chip), Memory 4GB. XP 64bit & Server 2003 Intel4: 4 core 2 Processor Dell Precision PWS670, 2 Intel Xeon CPUs at 2.80GHz, L2 Cache 2x2MB, Memory 4GB. XP Pro 64bit Intel8a: 8 core 2 Processor Dell Precision PWS690, 2 Intel Xeon CPUs E5320 at 1.86GHz, L2 Cache 2x4M, Memory 8GB. XP Pro 64bit Intel8b: 8 core 2 Processor Dell Precision PWS690, 2 Intel Xeon CPUs x5355 at 2.66GHz, L2 Cache 2X4M, Memory 4GB. Vista Ultimate 64bit and Fedora 7 Intel8c: 8 core 2 Processor Dell Precision PWS690, 2 Intel Xeon CPUs x5345 at 2.33GHz, L2 Cache 2X4M, Memory 8GB. Redhat 5
Sect. 2 discusses the CCR and SALSA runtime described in more detail in [7-9]. Sect. 3 describes our motivating clustering application and explains how it illustrates a broader class of data mining algorithms [17]. These results identify some important benchmarks covering memory effects, runtime fluctuations and synchronization costs discussed in Sections 4-6. There are interesting cache effects that will be discussed elsewhere [8]. Conclusions are in Sect. 8 while Sect. 7 briefly describes the key features of the algorithm and how they generalize to other data mining areas. All results and benchmark codes presented are available from http://www.infomall. org/salsa [16].
2 Overview of CCR and SALSA Runtime Model We do not address possible high level interfaces such as OpenMP or parallel languages but rather focus on lower level runtime to which these could map. In other papers [7-9] we have explained our hybrid programming model SALSA (Service Aggregated Linked Sequential Activities) that builds libraries as a set of services and uses simple service composition to compose complete applications [10]. Each service then runs on parallel on any number of cores – either part of a single PC or spread out
Performance of Multicore Systems on Parallel Data Clustering
409
over a cluster. The performance requirements at the service layer are less severe than at the “microscopic” thread level for which MPI is designed and where this paper concentrates. We use DSS (Decentralized System Services) which offers good performance with messaging latencies of 35 µs between services on a single PC [9]. Applications are built from services; services are built as parallel threads or processes that are synchronized with low latency by locks, MPI or a novel messaging runtime library CCR (Concurrency and Coordination Runtime) developed by Microsoft Research [11-15]. CCR provides a framework for building general collective communication where threads can write to a general set of ports and read one or more messages from one or more ports. The framework manages both ports and threads with optimized dispatchers that can efficiently iterate over multiple threads. All primitives result in a task construct being posted on one or more queues, associated with a dispatcher. The dispatcher uses OS threads to load balance tasks. The current applications and provided primitives support a dynamic threading model with some 8 core capabilities given in more detail in [9]. CCR can spawn handlers that consume messages as is natural in a dynamic search application where handlers correspond to links in a tree. However one can also have long running handlers where messages are sent and consumed at a rendezvous points (yield points in CCR) as used in traditional MPI applications. Note that “active messages” correspond to the spawning model of CCR and can be straightforwardly supported. Further CCR takes care of all the needed queuing and asynchronous operations that avoid race conditions in complex messaging. CCR is attractive as it supports such a wide variety of messaging from dynamic threading, services (via DSS described in [9]) and MPI style collective operations discussed in this paper. For our performance comparisons with MPI, we needed rendezvous semantics which are fully supported by CCR and we chose to use the Exchange pattern corresponding to the MPI_SENDRECV interface where each process (thread) sends and receives two messages equivalent to a combination of a left and right shift with its two neighbors in a ring topology. Note that posting to a port in CCR corresponds to a MPISEND and the matching MPIRECV is achieved from arguments of handler invoked to process the port.
3 Deterministic Annealing Clustering Algorithm We are building a suite of data mining services to test the runtime and two layer SALSA programming model. We start with data clustering which has many important applications including clustering of chemical properties which is an important tool [18] for finding for example a set of chemicals similar to each other and so likely candidates for a given drug. We are also looking at clustering of demographic information derived from the US Census data and other sources. Our software successfully scales to cluster the 10 million chemicals in NIH PubChem and the 6 million people in the state of Indiana. Both applications will be published elsewhere and the results given here correspond to realistic applications and subsets designed to test scaling. We use a modification of the well known K-means algorithm [19], using deterministic annealing [20], that has much better convergence properties than K-means and good parallelization properties.
410
X. Qiu et al.
For a set of data points X(labeled by x) and cluster centers Y(labeled by k), one gradually lowers the annealing temperature T and iteratively calculates: Y(k) = ∑ x p(X(x),Y(k)) X(x) . p(X(x),Y(k)) = exp(-d(X(x),Y(k))/T) p(x) / Zx . with Zx = ∑ k exp(-d(X(x),Y(k))/T) .
(1)
Here d(X(x),Y(k)) is the distance defined in space where clustering is occurring. Parallelism can be implemented by dividing points X between the cores and there is a natural loosely synchronous barrier where the sums in each core are combined in a reduction collective to complete the calculation in (1). Rather than plot speed-up, we focus in more detail on the deviations from “perfect speed-up (of P)”. Such parallel applications have a well understood performance model that can be expressed in terms of a parallel overhead f(n,P) (roughly 1-efficiency) where different overhead effects are naturally additive. Putting T(n,P) as the execution time on P cores or more generally processes/threads, we can define Overhead f(n,P) = (PT(n,P)-T(Pn,1))/T(Pn,1) . and efficiency ε = 1/(1+f) and Speed-up = ε .
(2)
For the algorithm of eqn. (1), f(n,P) should depend on the grain size n where each core handles n data points and in fact f(n,P) should decrease proportionally to the reciprocal of the grain size with a coefficient that depends on synchronization costs [6, 21-23]. This effect is 0.45 clearly seen in Fig. 1, which Parallel Overhead 0.4 shows good speed-up on 8 cores of around 7.5 (f(n,P )~ 10 Clusters 0.35 .05) for large problems. 0.3 However we do not find f(n,P) going fully to zero as n 0.25 increases. Rather it rather 0.2 erratically wanders around a 0.15 small number 0.02 to 0.1 as 20 Clusters parameters are varied. The 0.1 overhead also decreases as 0.05 shown in Fig. 1 as the number 0 of clusters increases. This is 0 0.5 1 1.5 2 2.5 3 3.5 4 expected from (1) as the ratio 10000/Grain Size of computation to memory Fig. 1. Parallel Overhead for GIS 2D Clustering on access is proportional to the Intel8b using C# with 8 threads (cores) and CCR Synnumber of clusters. In Fig. 2 chronization. We use two values (10, 20) for the numwe plot the parallel overhead ber of clusters and plot against the reciprocal of the as a function of the number of number of data points. clusters for two large real problems coming from Census data and chemical property clustering. These clearly show the rather random behavior after f(n,8) decreases to a small value corresponding to quite good parallelism – speedups of over 7 on 8 core systems. The results in Fig. 2(b) show lower asymptotic values which were determined to correspond to the binary
Performance of Multicore Systems on Parallel Data Clustering 0.200
Parallel Overhead
0.200
411
Parallel Overhead
0.180
2D GIS Census Data
0.160
0.150
PubChem 1052 Binary Chemical Properties
0.140 0.120 0.100
0.100
0.080 0.060
0.050
0.040 0.020 0.000
0.000
a)
0
5
10
15
20
25
30
b)
0
2
4
6
8
10
Number of Clusters
12
14
16
Number of Clusters
Fig. 2. Parallel Overhead defined in (2) as a function of the number of clusters for a) 2 dimensional GIS data for Indiana in over 200,000 blocks and 40,000 compounds each with 1052 binary properties
data used in Chemistry clustering. This problem showed fluctuations similar in size to 2(a) if one used floating point representation for the Chemistry “fingerprint” data. Of course the binary choice shown in Fig. 2(b) is fastest and the appropriate approach to use. Looking at this performance in more detail we identified effects from memory bandwidth, fluctuations in thread run time and cache interference [24]. We present a summary of the first two here and present cache effects and details in [7, 8].
4 Memory Bandwidth In Fig. 3, we give typical results of a study of the impact of memory bandwidth in the different hardware and software configurations of Table 1. We isolate the kernel of the clustering algorithm of Sect. 2 and examine its performance as a function of grain size n, number of clusters and number of cores. We employ the scaled speed up strategy and measure thread dependence at three fixed values of grain size n (10,000, 50,000 and 500,000). All results are divided by the number of clusters, the grain size, and the number of cores and scaled so the 10,000 data point, one cluster, one core result becomes 1 and deviations from this value represent interesting performance effects. We display cases for 1 cluster where memory bandwidth effects could be important and also for 80 clusters where such effects are small as one performs 80 floating point steps on every variable fetched from memory. Although we studied C, C#, Windows and Linux, we only present Windows C# results in Fig. 3. C with Windows shows similar effects but of smaller magnitude while Linux shows small effects (the results for all n and cluster counts are near 1). Always we use threads not processes and C uses locks and C# uses CCR synchronization. Data is stored so as to avoid any cache line (false sharing) effects [8, 24]. The results for one cluster in Fig. 3(a) clearly show the effect of memory bandwidth with scaled run time increasing significantly as one increases the number of cores used. The performance improves in Fig. 3(b) (scaled runtime < 1) with more clusters when the memory demands are small. In this benchmark the memory demands scale directly with number of cores and inversely with number of clusters. A major concern with multicore system is the need for a memory bandwidth that increases linearly with the number of cores. In Fig. 3(a) we see a 50% increase in the run time for a grain size of 10,000 and 1 cluster. This is for
412
X. Qiu et al.
1.6
C# and Windows and the overhead is reduced to 22% for C on Runtime 1.4 Windows and 13% for C on 500,000 1.3 Linux. Further we note that one expect the 10,000 data point case 1.2 50,000 to get excellent performance as 1.1 Datapoints per thread the dataset can easily fit in cache 1 1 2 3 4 5 6 7 8 a) and minimize memory bandwidth Number of Threads (one per core) 1 needs. However we see similar Scaled Intel 8b Vista C# CCR 80 Clusters Runtime results whether or not dataset fits 50,000 0.95 10,000 into cache. This must be due to 500,000 the complex memory structure 0.9 Datapoints leading to cache conflicts. We get per thread 0.85 excellent cache performance for the simple data structures of ma0.8 1 2 3 4 5 6 7 8 trix multiplication. b) Number of Threads (one per core) In all cases, we get small Fig. 3. Scaled Run time on Intel8b using Vista and overheads for 80 clusters (and in C# with CCR for synchronization on Clustering fact for cluster counts greater Kernel for three dataset sizes with 10,000 50,000 or than 4), which explains why the 500,000 points per thread(core). Each measurement applications of Sect. 2 run well. involved averaging over at least 1000 computations There are no serious memory separated by synchronization whose small cost is not bandwidth issues in cases with included in results. several clusters and in this case that dominates the computation. This is usual parallel computing wisdom; real size problems run with good efficiency as long as there is plenty of computation. [6, 2123] The data mining cases we are studying satisfy this and we expect them to run well on multicore machines expected over the next 5 years. Scaled 1.5
Intel 8b Vista C# CCR 1 Cluster
10,000
5 Synchronization Performance The synchronization performance has been discussed in detail previously [9] for CCR where we discussed dynamic threading in detail showing it had an approximate 5µs overhead. Here we expand the previous brief discussion of the rendezvous (MPI) style performance with Table 2 giving some comparisons between C, C# and Java for the MPI Exchange operation (defined in Sect. 2) running on the maximum number of cores (4 or 8) available on the systems of Table 1. Results for the older Intel8a are available online [16]. In these tests we use a zero size message. Note that the CCR Exchange operation timed in Table 2 has the full messaging transfer semantics of the MPI standards but avoids the complexity of some MPI capabilities like tags [25-27]. We expect that future simplified messaging systems that like CCR span from concurrent threads to collective rendezvous’s will chose such simpler implementations. Nevertheless we think that Table 2 is a fair comparison. Note that in the “Grains” column, we list number of concurrent activities and if they are threads or processes. These measurements correspond to synchronizations occuring roughly every 30µs and were averaged over 500,000 such synchronizations in a single run.
Performance of Multicore Systems on Parallel Data Clustering
413
Table 2. MPI Exchange Latency Machine Intel8c
OS Redhat
Intel8c
Fedora
Intel8b
Vista Fedora
AMD4
Intel4
Vista XP Redhat
XP XP
Runtime MPJE MPICH2 MPICH2 Fast Option Nemesis MPJE mpiJava MPICH2 MPJE MPJE mpiJava CCR MPJE MPJE mpiJava MPICH2 CCR CCR
Grains 8 Procs 8 Procs 8 Procs 8 Procs 8 Procs 8 Procs 8 Procs 8 Procs 8 Procs 8 Procs 8 Thrds 4 Procs 4 Procs 4 Procs 4 Procs 4 Thrds 4 Thrds
Latency µs 181 40.0 39.3 4.21 157 111 64.2 170 142 100 20.2 185 152 99.4 39.3 16.3 25.8
The optimized Nemesis version of MPICH2 gives best performance while CCR with for example 20µs latency on Intel8b, outperforms “vanilla MPICH2”. We can expect CCR and C# to improve and compete in performance with systems like Nemesis using the better optimized (older) languages. We were surprised by the uniformly poor performance of MPI with Java. Here the old mpiJava invokes MPICH2 from a Java-C binding while MPJ Express [27] is pure Java., It appears threads in Java currently are not competitive in performance with those in C#. Perhaps we need to revisit the goals of the old Java Grande activity [29]. As discussed earlier we expect managed code (Java and C#) to be of growing importance as client multicores prolifergate so good parallel multicore Java performance is important.
6 Performance Fluctuations We already noted in Sect. 3 that our performance was impacted by fluctuations in run time that were bigger than seen in most parallel computing studies that typically look at Linux and processes whereas our results are mainly for Windows and threads. In Figs. 4 and 5 we present some results quantifying this using the same “clustering kernel” introduced in Sect. 4. We average results over 1000 synchronization points in a single run. In Figs. 4 and 5 we calculate the standard deviation of the 1000P measured thread runtimes gotten if P cores are used. Our results show much larger run time fluctuations for Windows than for Linux and we believe this effect leads to the 2-10% parallel overheads seen already in Fig. 2. These figures also show many of the same trends of earlier results. The smallest dataset (10,000) which should be contained in cache has the largest fluctuations. C and Linux show lower fluctutions
414
X. Qiu et al.
0.2
0.1
Std Dev Intel 8a XP C# CCR Runtime 1 Cluster
10,000
0.15
0.1
0.075
0 0
1
2
3
4
5
6
Number of Threads (one per core)
7
50,000 0.025
b)
8
500,000 10,000
0.05
50,000 500,000 Datapoints per thread
0.05
a)
Std Dev Intel 8a XP C# CCR Runtime 80 Clusters
Datapoints per thread
0 0
1
2
3
4
5
6
7
Number of Threads (one per core)
8
Fig. 4. Ratio of Standard Deviation to mean of thread execution time averaged over 1000 instances using XP on Intel 8a and C# with CCR for synchronization on Clustering Kernel for three dataset sizes with 10,000 50,000 or 500,000 points per thread (core) 0.006
0.1
Std Dev Runtime
Std Dev Intel 8c Redhat C Locks
Intel 8c Redhat C Locks 1 Cluster
0.075
10,000
10,000 Runtime 80 Clusters Datapoints 0.004 per thread
50,000
0.05
50,000
0.025
Datapoints per thread
500,000
0
a)
1
2
3
4
5
6
Number of Threads (one per core)
7
8
500,000
0.002
0
b)
1
2
3
4
5
6
Number of Threads (one per core)
7
8
Fig. 5. Ratio of Standard Deviation to mean of thread execution time using Redhat on Intel8c (a,b) Linux and C with locks for synchronization on Clustering Kernel for three dataset sizes with 10,000 50,000 or 500,000 points per thread (core). Fedora shows larger effects than Redhat.
than C# and Windows. Further turning to Linux, Redhat outperforms Fedora (shown in [9]). C# in Fig. 4 has rather large (5% or above) fluctuations in all cases considered. Note our results with Linux are all obtained with threads and so are not directly comparable with traditional MPI Linux measurements that use processes. Processes are better isolated from each other in both cache and system effects and so it is possible that these fluctuations are quite unimportant in past scientific programming studies but significant in our case. Although these fluctuations are important in the limit of large grain size when other overheads are small, they are never a large effect and do not stop us getting excellent speedup on large problems.
7 Generalization to Other Data Mining Algorithms The deterministic annealing clustering algorithm has exactly the same structure as other important data mining problems including dimensional scaling and Gaussian mixture models with the addition of deterministic annealing to mitigate the local minima that are a well known difficulty with these algorithms [17]. One can show [17] that one gets these different algorithms by different choices for Y(k), a(x), g(k), T and s(k) in (3). As in Sect. 2, X(x) are the data points to be modeled and F is the objective function to be minimized.
Performance of Multicore Systems on Parallel Data Clustering
415
K F = −T ∑ a ( x ) ln ⎡ ∑ k =1 g (k ) exp{−0.5( X ( x) − Y (k )) 2 / (Ts ( k ))}⎤ . ⎣ ⎦ x =1
(3)
N
Thus we can immediately deduce that our results imply that scalable parallel performance can be achieved for all algorithms given by (3). Further it is interesting that the parallel kernels of these data mining algorithms are similar to those well studied by the high performance (scientific) computing community and need the synchronization primitives supported by MPI. The algorithms use the well established SPMD (Single Program Multiple Data) style with the same decomposition for multicore and distributed execution. However clusters and multicore systems use different implementations of collective operations at synchronization points. We expect this structure is more general than the studied algorithm set.
8 Conclusions Our results are very encouraging for both using C# and for getting good multicore performance on important applications. We have initial results that suggest a class of data mining applications run well on current multicore architectures with efficiencies on 8 cores of at least 95% for large realistic problems. We have looked in detail at overheads due to memory, run time fluctuation and synchronizations. Our results are reinforced in [8, 9] with a study of cache effects and further details of issues covered in this paper. Some overheads such as runtime fluctuations are surprisingly high in Windows/C# environments but further work is likely to address this problem by using lessons from Linux systems that show small effects. C# appears to have much better thread synchronization effects than Java and it seems interesting to investigate this.
References 1. Patterson, D.: The Landscape of Parallel Computing Research: A View from Berkeley 2.0 Presentation at Manycore Computing, Seattle, June 20 (2007) 2. Dongarra, J. (ed.): The Promise and Perils of the Coming Multicore Revolution and Its Impact, CTWatch Quarterly, February 2007, vol. 3(1) (2007), http://www.ctwatch.org/quarterly/archives/february-2007 3. Sutter, H.: The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software. Dr. Dobb’s Journal 30(3) (March 2005) 4. Annotated list of multicore sites: http://www.connotea.org/user/crmc/ 5. Dubey P.: Teraflops for the Masses: Killer Apps of Tomorrow Workshop on Edge Computing Using New Commodity Architectures, UNC (May 23, 2006), http://gamma.cs.unc.edu/EDGE/SLIDES/dubey.pdf 6. Fox G.: Parallel Computing 2007: Lessons for a Multicore Future from the Past Tutorial at Microsoft Research (February 26 to March 1 2007) 7. Qiu, X., Fox, G., Ho, A.: Analysis of Concurrency and Coordination Runtime CCR and DSS, Technical Report January 21 (2007) 8. Qiu, X., Fox, G., Yuan, H., Bae, S., Chrysanthakopoulos, G., Nielsen, H.: Performance Measurements of CCR and MPI on Multicore Systems Summary, September 23 (2007)
416
X. Qiu et al.
9. Qiu, X., Fox, G., Yuan, H., Bae, S., Chrysanthakopoulos, G., Nielsen, H.: High Performance Multi-Paradigm Messaging Runtime Integrating Grids and Multicore Systems. In: Proceedings of eScience 2007 Conference, Bangalore, India, December 10-13 (2007) 10. Gannon, D., Fox, G.: Workflow in Grid Systems Concurrency and Computation. Practice & Experience 18(10), 1009–1019 (2006) 11. Nielsen, H., Chrysanthakopoulos, G.: Decentralized Software Services Protocol – DSSP, http://msdn.microsoft.com/robotics/media/DSSP.pdf 12. Chrysanthakopoulos, G.: Concurrency Runtime: An Asynchronous Messaging Library for C# 2.0, Channel9 Wiki Microsoft, http://channel9.msdn.com/wiki/default.aspx/Channel9.ConcurrencyRuntime 13. Richter J.: Concurrent Affairs: Concurrent Affairs: Concurrency and Coordination Runtime, Microsoft, http://msdn.microsoft.com/msdnmag/issues/06/09/ ConcurrentAffairs/default.aspx 14. Microsoft Robotics Studio is a Windows-based environment that includes end-to-end Robotics Development Platform, lightweight service-oriented runtime, and a scalable and extensible platform, http://msdn.microsoft.com/robotics/ 15. Chrysanthakopoulos, G., Singh, S.: An Asynchronous Messaging Library for C#, Synchronization and Concurrency in Object-Oriented Languages (SCOOL) at OOPSLA Workshop, San Diego, CA (October 2005), http://urresearch.rochester.edu/handle/1802/2105 16. SALSA Multicore research Web site, http://www.infomall.org/salsa For Indiana University papers cited here, http://grids.ucs.indiana.edu/ptliupages/publications 17. Qiu X., Fox G., Yuan H., Bae S., Chrysanthakopoulos G., Nielsen H.: Parallel Clustering and Dimensional Scaling on Multicore Systems Technical Report (February 21 2008) 18. Downs, G., Barnard, J.: Clustering Methods and Their Uses in Computational Chemistry. Reviews in Computational Chemistry 18, 1–40 (2003) 19. K-means algorithm at, http://en.wikipedia.org/wiki/K-means_algorithm 20. Rose, K.: Deterministic annealing for clustering, compression, classification, regression, and related optimization problems. Proc IEEE 86, 2210–2239 (1998) 21. Dongarra, J., Foster, I., Fox, G., Gropp, W., Kennedy, K., Torczon, L., White, A. (eds.): The Sourcebook of Parallel Computing. Morgan Kaufmann, San Francisco (2002) 22. Fox, G., Johnson, M., Lyzenga, G., Otto, S., Salmon, J., Walker, D.: Solving Problems in Concurrent Processors, vol. 1. Prentice-Hall, Englewood Cliffs (1988) 23. Fox, G., Messina, P., Williams, R.: Parallel Computing Work! Morgan Kaufmann, San Mateo Ca (1994) 24. How to Align Data Structures on Cache Boundaries, Internet resource from Intel, http://www.intel.com/cd/ids/developer/asmona/eng/dc/threading/knowledgebase/43837.htm 25. Message passing Interface MPI Forum, http://www.mpi-forum.org/index.html 26. MPICH2 implementation of the Message-Passing Interface (MPI), http://wwwunix.mcs.anl.gov/mpi/mpich/ 27. Baker, M., Carpenter, B., Shafi, A.: MPJ Express: Towards Thread Safe Java HPC. In: IEEE International Conference on Cluster Computing (Cluster 2006), Barcelona, Spain, September 25-28 (2006), http://www.mpj-express.org/docs/papers/mpj-clust06.pdf 28. mpiJava Java interface to the standard MPI runtime including MPICH and LAM-MPI, http://www.hpjava.org/mpiJava.html 29. Java Grande, http://www.javagrande.org
Second Generation Quad-Core Intel Xeon Processors Bring 45 nm Technology and a New Level of Performance to HPC Applications Pawel Gepner, David L. Fraser, and Michal F. Kowalik Intel Corporation {pawel.gepner,david.l.fraser,michal.f.kowalik}@intel.com
R Xeon R procesAbstract. The second generation of Quad-Core Intel sors was launched on November 12th 2007. In this paper we take a look at what the new 45 nm based Quad-Core Intel Xeon Processor brings to high performance computing. We compare an Intel Xeon 5300 series based system with a server utilizing his successor the Intel Xeon 5400. We measure both CPU generations operating in dual socket platforms in typical HPC benchmark scenario using some common HPC benchmarks. The results presented clearly show that the new Intel Xeon processor 5400 family provides significant performance advantage on typical HPC workloads and would therefore be seen to be an appropriate choice for many of HPC installations.
Keywords: HPC, multi-core processors, quad-core processors, parallel processing, benchmarks.
1
Introduction
Today multi-core processors are becoming a standard for high performance computing. The second generation Quad-Core Intel Xeon processor not only represents a technology shrink to 45 nm process technology but also brings lot of new mechanisms which improve overall performance and power savings characR teristics. Intel Xeon 5400 is based on the same micro-architecture as the Intel CoreTM Microarchitecture including some extensions which actually raise the performance. This paper has been written by Intel employees, therefore competitive products were not taken into consideration. Intel Xeon processor family contains 3000, 5000 and 7000 series where each of them is dedicated to different platforms and applications: – Intel Xeon 3000 family is optimized for single-socket solutions; – Intel Xeon 5000 family is optimized for dual-socket solutions; – Intel Xeon 7000 family is optimized for multi-socket system (4 + way). All of them are based on the same microarchitecture principles and they are all used in HPC instalations, where Intel Xeon 5000 family is the most common therefore authors have focused on them. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 417–426, 2008. c Springer-Verlag Berlin Heidelberg 2008
418
P. Gepner, D.L. Fraser, and M.F. Kowalik
Our strategy of building CPUs which do not drive performance via faster clock speeds but rather based on an energy-efficient architecture has changed the landscape of high-performance computing (HPC) completely. If we look on the 30th edition of the TOP500 list released (Nov. 12, 2007) at SC07, the international conference on high performance computing, networking, storage and R CoreTM analysis, in Reno (NV, US). We see 317 systems based on the Intel R Microarchitecture and 102 of them are systems based on Quad-Core Intel R processor. Xeon True performance is a combination of both clock frequency and Instruction Per Clock (IPC). This shows that the performance can be improved by increasing frequency and/or IPC. Frequency is a function of both the manufacturing process and the micro-architecture. Basically there are two micro-architecture approaches which somehow determinate CPU design: more IPC or higher frequency. The first approach uses very few transistors, but the path from start to finish is very long, the second is based on shorter path, but it uses many more transistors [1, 2]. From manufacturing process perspective a key consideration is reducing the size of the transistors which means reducing the distances between the transistors and reducing transistor switching times. These two things contribute significantly to faster processor clock frequencies. Unfortunately as processor frequencies rise, the total heat produced by the processor scales with it. Reducing the transistor size, because smaller transistors can operate on lower voltages, allows the chip to produce less heat this does not however solve the all of the issues and in fact generates another problem with electrons. Based on quantum mechanics principles small elements such as electrons are able to spontaneously flow, over short distances. The emitter and transistor base are now so close together that a considerable number of electrons can escape from one to the other, this effect is called leakage. Reducing operating voltages which effectively reduces the available voltage swing, that is the difference between logic 1 and logic 0 becomes too small so that the transistor will not operate properly. In addition we need to deal with the decreased transistor size, as the leakage current increases we require more complicated process technology. In conclusion, this complicated situation where the number of transistors per unit area needs to increase, but the operating frequency must go down, will likely increase the number of cores required and decrease (not increase so quickly) the clock frequency [3, 4]. Multi-core processor systems changed the dynamics of the market and enabled new innovative designs delivering high performance with optimized power characteristics [5]. They drive multithreading and parallelism at a higher than instruction level, and provide it to mainstream computing on a massive scale.
2
Processor Microarchitecture
The Architecture typically refers to the high level description of the Instruction Set Architecture (ISA). The Architecture generally defines an instruction set for software to write to and, in turn, the software could then be run on all processor implementations of that particular architecture.
Second Generation Quad-Core Intel Xeon Processors
419
The Microarchitecture defines a specific means of implementing compatible hardware that supports the higher level architecture. New micro-architectures typically define improvements that ultimately increase the user benefits when running software that is compatible with the high-level architecture. Microarchitecture is enhanced with each processor generation, delivering improvements in performance, energy efficiency, and capabilities while still maintaining application-level compatibility. Microarchitecture refers to the implementation of the ISA in silicon, including cache memory design, execution units, and pipelining. In fact, benefits from many microarchitecture enhancements can be achieved without any modification or recompilation of code. R Quad Core Xeon R processor family The 45 nm next generation Intel (Harpertown) is the next instance of Intel processors based on Intel 45 nm transistor technology, a new transistor breakthrough that allows for processors with nearly twice the transistor density and drastically reduced electrical leakage. New Intel Quad Core Xeon includes new instructions and microarchitecture enhancements that will deliver superior performance and energy-efficiency while maintaining compatibility to already existing applications. Microarchitecture enhancements in Intel Quad Core Xeon 5400 processor family include: – – – – – –
New set of instructions – Intel SSE4 50% larger L2 Cache Super Shuffle Engine and Fast Radix-16 Divider Enhanced Cache Line Split Load Deep Power Down Technology Enhanced Intel Dynamic Acceleration Technology
Intel SSE4 is a set of new instructions designed to improve the performance and energy efficiency of a broad range of applications. Intel SSE4 builds upon the Intel 64 Instruction Set Architecture (ISA), the most popular and broadly used computer architecture for developing 32-bit and 64-bit applications. Intel SSE4 consists of 54 instructions divided into two major categories: Vectorizing Compiler/Media Accelerators and Efficient Accelerated String/Text Processing. The new Intel Quad Core Xeon currently supports 47 of the Intel SSE4 instructions including the Vectorizing Compiler and Media Accelerator instructions. The remaining instructions will be available in future generations of Intel processors. Software will be able to use Vectorizing Compiler and Media Accelerators to provide high performance compiler primitives, such as packed (using multiple operands at the same time) integer and floating point operations, that allow for performance optimized code generation. It also includes highly optimized mediarelated operations such as sum absolute difference, floating point dot products, and memory loads. The Vectorizing Compiler and Media Accelerator instructions should improve the performance of audio, video, and image editing applications, video encoders, 3-D applications, and games. 50% larger L2 Cache (up to 12 MB in Quad Core implementation): Reduces the latencies for accessing instructions and data, improving application
420
P. Gepner, D.L. Fraser, and M.F. Kowalik
performance (especially those that work on large data sets). 24 Way Set AssociaR Xeon R (16 Way tivity improves data access versus previous generation Intel Set Associativity). Improved Store Forwarding maximizes data balance cache to memory. Super Shuffle Engine and Fast Radix-16 Divider: 3X faster shuffles and 1.6X-2X faster divides. The Super Shuffle Engine will greatly improve the performance of Intel SSE4 and Supplemental Streaming SIMD Extensions 3 (SSSE3) instructions. New super shuffle engine performs 128 bit operation in a single cycle and does not require any software changes. Enhanced Cache Line Split Load: Greatly improved performance on unaligned loads (those that span across cache boundaries) and optimized store and load operations. New 16 byte aligned load instruction on WC (write combining) memory improves read bandwidth from WC memory by reading cache-line size quantities. This Streaming Load routine gives 8X faster reading from WC Memory and improves the performance of memory-intensive applications. Deep Power Down Technology: A new power state that dramatically reduces processor power consumption. This is ideal solution for developing energy efficient applications. Enhanced Intel Dynamic Acceleration Technology: Improves energy efficiency by dynamically increasing the performance of active cores when not all cores are utilized. Conceptually it uses the power headroom of the idle cores to boost the performance of the non-idle core. When one core enters an idle power C-state (CC3 or deeper) and the OS requests a higher performance state on the running core, the non-idle core is boosted up to a higher voltage and higher frequency (EDAT freq) however the overall chip power envelope still remains within the specified Thermal Design Power (TDP). How all of these innovations and changes in microarchitecture reflect to the overall system performance and accelerating high performance computing is described below. In the testing environment we have been testing single system performance in a typical HPC workload.
3
Processor Performance
In this section we have focused on processor performance to compare two generations of the Quad-Core Intel Xeon processors. A popular benchmark well-suited for parallel, core-limited workloads is the Linpack HPL benchmark. Linpack is a floating-point benchmark that solves a dense system of linear equations in parallel. The metric produced is Giga-FLOPS or billions of floating point operations per second. Linpack performs operations called LU Factorization. They are highly parallel and store most of their working data set in processor cache [6]. The processor operations it does perform are predominantly 64-bit floatingpoint vector operations and uses SSE instructions. This benchmark is used to determine the world’s fastest computers published at the website [7].
Second Generation Quad-Core Intel Xeon Processors
421
In both cases each processor core has 3 functional units that are capable of generating 128-bit results per clock. In this case we may assume that a single processor core does two 64 bit floating-point ADD instructions and two 64 bit floating-point MUL instructions per clock. Theoretical performance, calculated as the product of MUL and ADD executed in each clock, multiplied by frequency in both cases are the same. Both implementations are based on the same microarchitectures. R Xeon R processor X5365 this gives 3 GHz x 4 operations For Quad-Core Intel per clock x 4 cores = 48 GFLOPS. Exactly the same theoretical performance we have for new Quad-Core Intel Xeon processor E5472 = 3 GHz x 4 operations per clock x 4 cores = 48 GFLOPS. This is theoretical performance only, it is interesting to observe how a 50% bigger cache and 20% faster Front Side Bus (FSB) of the Intel Xeon processor E5472 benefit Linpack and see if this is the good benchmark for CPU performance. In the all performance tests we were using systems configured as follows, with a one exception for the Stream benchmark. Configuration details Quad-Core Intel Xeon processor X5365 based platform details: Intel preproduction server platform with two Quad-Core Intel Xeon processors X5365 3.00 GHz, 2x4 MB L2 cache, 1333 MHz FSB, 16 GB memory (8x2 GB FBDIMM 667 MHz), RedHat Enterprise Linux Server 5 Kernel 2.6.18-8.el5 on x86-64, Intel C/Fortran Compiler. Workload: Intel Optimized SMP LINPACK Benchmark 10 for Linux / LMBENCH 3.0 / Amber, Eclipse, Fluent, Gamess, Gromacs, Gaussian, LS-DYNA, Monte Carlo, PamCrash, Star-CD. Quad-Core Intel Xeon processor E5472 based platform details: Intel preproduction server platform with two Quad-Core Intel Xeon processors E5472 3.00 GHz, 2x6 MB L2 cache, 1600 MHz FSB, 16 GB memory (8x2 GB FBDIMM 800 MHz), RedHat Enterprise Linux Server 5 Kernel 2.6.18-8.el5 on x86-64, Intel C/Fortran Compiler. Workload: Intel Optimized SMP LINPACK Benchmark 10 for Linux / LMBENCH 3.0 / Amber, Eclipse, Fluent, Gamess, Gromacs, Gaussian, LS-DYNA, Monte Carlo, PamCrash, Star-CD. Using LINPACK HPL we see (Fig. 1) 5% performance improvement between the system based on the Quad-Core Intel Xeon processor X5365 and Quad-Core Intel Xeon processor E5472. It indicates that as we have expected based on the theoretical performance we will not see a big performance improvement on CPU intensive tasks. New Quad-Core Intel Xeon processor E5472 processor and it’s bigger cache and 1600 FSB do not play an important role in Linpack scenarios as it is not a memory intensive benchmark.
4
Memory Performance
In this section we illustrate the memory performance of the two generations of Quad-Core Intel Xeon processors. Memory performance is combination of two elements: latency and throughput. Each is appropriate to a different work load, and each tells a different story. Latency measures how long it takes to chase a
422
P. Gepner, D.L. Fraser, and M.F. Kowalik
Fig. 1. LINPACK: Dense Floating-Point Operations
chain of pointers through memory. Only a single chain is tracked at a time. Each chain stores only one pointer in a cache line and each cache line is randomly selected from a pool of memory. The pool of memory simulates the working environment of an application. When the memory pool is small enough to be placed inside cache, the benchmark measures the latency required to fetch data from cache. By changing the size of the memory pool we can measure the latency of any specific level of cache, or to main memory, by creating the pool bigger than all levels of cache. R Xeon R processor We measured latency using a 3.0 GHz Quad-Core Intel X5365 and a 3.0 GHz Quad-Core Intel Xeon processor E5472. The results of this experiment are shown in Fig. 2. Based on the different FSB as well as the bigger L2 cache the Quad-Core Intel Xeon processor E5472 would have slightly lower latencies to caches but the memory latencies comparing to his predecessor are much more noticeable, and from the result we can see that this is the case. The bigger L2 cache and faster FSB 1600 MHz benefit L2 cache latency as well reduce the access time to the main memory. All those elements in conjunction with the R 5400 chipset enabling 1600 MHz FSB helps random memory access. new Intel The second important characteristic of the memory performance is the throughput for sequential memory accesses. The benchmark we have used to measure the throughput is Stream benchmark. The Stream benchmark is a synthetic benchmark program, written in standard Fortran 77. It estimates, both memory reads and memory writes (in contrast to the standard usage for bcopy). It measures the performance of four long vector operations. These operations are: COPY: a(i) = b(i) SCALE: a(i) = q*b(i) SUM: a(i) = b(i) + c(i) TRIAD: a(i) = b(i) + q*c(i) These operations are representative of long vector operations and the array sizes are defined in that way so that each array is larger than the cache of the processors that are going to be tested. This gives us an indication of
Second Generation Quad-Core Intel Xeon Processors
423
Fig. 2. Memory latency
how effective the memory subsystem is in our implementation excluding caches As Fig. 3 shows we see huge memory bandwidth improvement with new QuadR Xeon R processor E5472 mainly due to 20% faster FSB (1600 MHz) Core Intel as well improved functionality of Intel 5400 chipsets (“Seaburg”) capable to operate at 1600 MHz FSB. This 32% throughput improvement versus older generation Quad Core Xeon based system will be reflected in all memory intensive applications not only for Stream but also for other HPC data intensive workloads. Configuration details Quad-Core Intel Xeon processor X5355 based platform details: Intel preproduction server platform with two Quad-Core Intel Xeon processors X5355 2.66 GHz, 2x4 MB L2 cache, 1333 MHz FSB, 16 GB memory (8x2 GB FBDIMM 667 MHz), RedHat Enterprise Linux Server 5 Kernel 2.6.18-8.el5 on x86-64, Intel C/Fortran Compiler. Workload: Stream Benchmark. Quad-Core Intel Xeon processor E5472 based platform details: Intel preproduction server platform with two Quad-Core Intel Xeon processors E5472 3.00 GHz, 2x6 MB L2 cache, 1600 MHz FSB, 16 GB memory (8x2 GB FBDIMM 800 MHz), RedHat Enterprise Linux Server 5 Kernel 2.6.18-8.el5 on x86-64, Intel C/Fortran Compiler. Workload: Stream Benchmark.
5
Application Performance
Linpack and Stream are synthetic benchmarks and they measure the performance of specific subsystems and do not deliver the full picture of system capability. Typical HPC applications use much more than a single subsystem and their nature is much more sophisticated. So to get a better understanding how R Xeon R processor E5472 based platform benefits real the new Quad-Core Intel applications we have selected a couple of real code examples. These application
424
P. Gepner, D.L. Fraser, and M.F. Kowalik
Fig. 3. Stream benchmark – memory throughput
and benchmarks represent a broad spectrum of HPC workloads and seem to be a typical representation of a testing suite for this class of calculation. Amber is a package of molecular simulation programs. The workload measures the number of problems solved per day (PS) using eight standard molecular dynamic simulations. See [8] for more information. Eclipse from Schlumberger – reservoir simulation software for structure, geology, fluids and development scheme. Fluent is a commercial engineering application used to model computational fluid dynamics. The benchmark consists of 9 standard workloads organized into small, medium and large models. These comparisons use all but the largest of the models which do not fit into the 8 GB of memory available on the platforms. The Rating, the default Fluent metric, was used in calculating the ratio of the platforms by taking a geometric mean of the 8 workload ratings measured. GAMESS from Iowa State University, Inc. – a quantum chemistry program widely used to calculate energies, geometries, frequencies and properties of molecular systems. Gaussian from Gaussian, Inc. – a quantum chemistry program widely used to calculate energies, geometries, frequencies and properties of molecular systems. GROMACS (Groningen Machine for Chemical Simulations) from Groningen University is molecular dynamics program widely used to calculate energies, geometries, frequencies and properties of molecular systems. LS-DYNA is a commercial engineering application used in finite element analysis such as a car collision. The workload used in these comparisons is called 3 Vehicle Collision and is publicly available from [9]. The metric for the benchmark is elapsed time in seconds. Monte Carlo from Oxford Center for Computational Finance (OCCF) – financial simulation engine using Monte Carlo technique [10].
Second Generation Quad-Core Intel Xeon Processors
425
Fig. 4. Platform comparison across HPC selected workloads
PamCrash from ESI Group – an explicit finite-element program well suited for crash simulations. Star-CD is a suite of test cases, selected to demonstrate the versatility and robustness of STAR-CD in computational fluid dynamic solutions. The metric produced is elapsed seconds converted to jobs per day. For more information go to [11]. All these selected workloads have been tested on two dual socket HPC optiR Xeon R processor X5365 based platmized platforms. The Quad-Core Intel form has been used as the baseline to illustrate the improvement (Fig. 4) that the new platform is going to bring in different workloads in typical HPC scenario. As we can see the Quad-Core Intel Xeon processor E5472 based platform shows significant performance improvement up to 37%. The 12 MB L2 cache and new 1600 FSB help especially in the data intensive application and the workloads where data movement plays an important role. If the task is more CPU intensive the difference is around 10-15%.
6
Conclusion
The new Quad-Core Intel Xeon processors bring 45 nm technology to HPC with all the benefits of superior thermal characteristic and substantial production capability. Following the path of its predecessor these new products are continuing to demonstrate performance leadership. From the theoretical performance point of view both generations of Quad-Core Intel Xeon families deliver the same theoretical performance peak but we have seen that in a real life scenario the new architecture extensions bring a lot of performance improvement, even typically CPU intensive tasks show a better performance in the range of 7-15%. In the future this can be extended when the new set of SSE4 instructions become even more widely used and start providing additional headroom for performance improvement. The Vectorizing Compiler instructions should improve the performance of all those HPC applications which use multiple operands at the same R Xeon R processors time. These are the areas where the new Quad-Core Intel
426
P. Gepner, D.L. Fraser, and M.F. Kowalik
bring the biggest improvement – the data intensive workloads. The 50% bigger L2 cache as well as 1600 MHz FSB in conjunction with the Intel 5400 chipsets made the new Quad-Core Intel Xeon processor based platform up 37% more effective when comparing to the old one – whilst operating at the same processor frequency. We see a significant performance advantage, ranging from 20-37% and in addition observe significant performance per watt reduction, as the platform stays in the same power envelope. All of this drives significant improvements in user experience for the HPC environment and become a compeling choice for many of the new HPC installations.
References 1. Gepner, P., Kowalik, M.F.: Multi-Core Processors: New Way to Achieve High System Performance. In: PARELEC 2006, pp. 9–13 (2006) 2. Smith, J.E., Sohi, G.S.: The Microarchitecture of superscalar processors. Proc. IEEE 83, 1609–1624 (1995) 3. Ronen, R., Mendelson, A., Lai, K., Lu, S.-L., Pollack, F., Shen, J.P.: Coming challenges in microarchitecture and architecture. Proc. IEEE 89, 325–340 (2001) 4. Moshovos, A., Sohi, G.S.: Microarchitectural innovations: Boosting microprocessor performance beyond semiconductor technology scaling. Proc. IEEE 89, 1560–1575 (2001) 5. Ramanathan, R.M.: Intel Multi-Core Processors: Leading the Next Digital Revolution, Technology @ Intel Magazine (2005), http://www.intel.com/technology/magazine/computing/multi-core-0905.pdf 6. Dongarra, J., Luszczek, P., Petitet, A.: Linpack Benchmark: Past, Present, and Future, http://www.cs.utk.edu/∼ luszczek/pubs/hplpaper.pdf 7. http://www.top500.org/ 8. http://amber.ch.ic.ac.uk/amber8.bench1.html 9. http://www.topcrunch.org/ 10. http://www.occf.ox.ac.uk/ 11. http://www.cd-adapco.com/products/STAR-CD/
Heuristics Core Mapping in On-Chip Networks for Parallel Stream-Based Applications Piotr Dziurzanski and Tomasz Maka Szczecin University of Technology, ul. Zolnierska 49, 71-210 Szczecin, Poland {pdziurzanski,tmaka}@wi.ps.pl
Abstract. A novel approach for implementation of multimedia streaming applications into mesh Networks on Chips structure is proposed. We provide a new multi-path routing algorithm together with heuristic algorithms for core mapping so that minimize the total transfer in the hardware implementation. The proposed approach has been tested with two popular stream-based video decompression algorithms. The experimental results confirming the proposed approach are provided. Keywords: Multimedia streaming applications, On-Chip routing algorithm, IP core mapping, Multi-path routing.
1
Introduction
Computational-intensive multimedia applications are especially well suited for parallel and distributed processing due to their data-dominated algorithms that can be split intro a number of stages. These stages can be implemented in separate computational units working in a pipeline-like way and transmitting each other streams of relatively large, but usually fixed, amount of data. Some widelyknown examples of the algorithms of that type are, e.g., MPEG-4, DAB, DVB, and many others. In these applications, it is usually required to keep an assumed quality level of service and meet real-time constraints [7]. Multi Processor Systems on Chips (MPSoCs) are often considered as suitable hardware implementations of these applications [3]. As each processing unit of a MPSoC can realize a single stage of streaming application processing, it is still problematic to connect these units together. The simplest point to point (P2P) connections require too much space, whereas bus-based connections result in large number of conflicts and, consequently, despite various arbitrage techniques decrease the overall performance of the whole system [4]. Besides, both P2P and bus-based realizations do not scale well with the constantly increasing number of independent Intellectual Property (IP) cores (i.e., computational units) required by contemporary devices dealing with a number of various algorithms in a single system [1]. In order to omit these obstacles, the packet-based Network-on-Chip (NoC) paradigm for the designs of chips realizing distributed computation has been introduced [2]. The recent popularity of this approach can be attributed to a M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 427–435, 2008. c Springer-Verlag Berlin Heidelberg 2008
428
P. Dziurzanski and T. Maka
lower number of conflicts in a chip with a large number of cores. It is reported that NoC architectures offer high bandwidth and good concurrent communication capability, but they require additional mechanisms to overcome problems typical for packet switching communication, such as packet deadlock or starvation, but the techniques known for traditional computer networks have to be altered before applying to on-chip networks [6]. A mesh is one of the most often used on-chip network topologies owing to its regularity and reliability due to the existence of many redundant interconnections between nodes. In NoCs, each mesh node is comprised of the IP core realizing a particular stage of the algorithm and a router which is typically connected to four neighboring nodes. A typical NoC implementation utilizes packet switching approach that is called wormhole routing [5]. In this technique, each packet is split into smaller units of equal length, flits (flow control units). Usually, the first flits contain some routing information, such as the destination address. Having obtained the routing information, a wormhole router selects the next-hop router and establishes the path to that neighboring router. This path can be used exclusively for transferring the current package flit by flit as long as the whole package has not been transferred. The next-hop router typically does not store the whole package in its buffers, but tries to establish a connection with another router being selected for the transfer. If another package is to be sent through the connection already used for transferring a package, its transfer is deferred as long as all flits of the previous package has not been sent. This situation is known as contention and may result in significant decreasing of efficiency. Contentions are especially likely to be observed in various data-intensive applications, where large streams of data are transferred in every second. This may result in violating the real-time constraints and thus making an MPSoC designing in this way inappropriate for the task. The most popular routing algorithm used in NoCs, named XY, can be also viewed as inappropriate for switching large streams of information. According to this algorithm, a flit is firstly routed according to the X axis as long as the X coordinate is not equal to the X coordinate of the destination core, and then the flit is routed vertically. Despite being deadlock-free [5], this algorithm is not adaptive and thus is not equipped with mechanism for decreasing the contention level. Taking into consideration the above mentioned facts, it follows that in order to design a NoC-based MPSoC for multimedia streaming applications it is necessary to (i) propose a routing algorithm that is more suitable to this task that the traditional XY algorithm and (ii) to propose a mapping scheme of IP cores into mesh nodes that decreases the contention level.
2
Proposed Design Flow
The first stage of the proposed approach is to construct a flow network for a given stream-based algorithm. The processing blocks of the algorithm are identified and the transfers between them are computed. We describe all transfers in a
Heuristics Core Mapping in On-Chip Networks
429
Table 1. Transfers table for H.264 decoder Source Destination Transfer [bps] 1 2 11744051 2 3 503316480 3 4 503316480 4 5 788529152 5 8 788529152 1 5 360710144 1 6 37748736 6 4 11744051 1 7 251658240 7 4 1560281088 8 7 2348810240
so-called transfer table. An example of the transfer table for the H.264 decoder is presented in Table 1. The numbers in the Source and Destination columns represent indices of the nodes in the flow network. Then, we have to determine the mapping of the cores into the mesh structure leading to the improved performance of the NoC structure. The impact of the mapping on the final implementation properties is very significant in case of the traditional wormhole XY routing approach. For the example of MPEG-4 decoder [9], the difference between the required capacity and the best and the worst mappings is about 203.04 per cent. For example, the XY algorithm applied to the H.264 video decoder [8] for core permutation 0-2-3, 7-8-4, 1-6-5 in the first, second, and the third row, respectively, leads to the transfers presented in Fig. 1. In this situation, the maximal transfer between adjacent cores is relatively high being equal to 2240 Mbit/s. It means that in every second such amount of data is to be transferred between cores 8 and 7, so that the NoC infrastructure
Fig. 1. Transfers between cores for the H.264 decoder, XY routing algorithm (in Mbit/s)
430
P. Dziurzanski and T. Maka
has to offer capacities large enough to cope with this transfer. Assuming the most popular regular NoC mesh architecture, all the links have to have equal capacities, so all of them have to be capable of transferring 2240 Mbit/s. However, the majority of the remaining links are utilized in small percentage of this maximal value. It may be expressed with the standard deviation value, which is equal to about 598.36 Mbit/s. Thus, we assumed that the standard deviation express the transfer balancing level. The smaller is the standard deviation, the transfers are closer to each other. Moreover, only 13 links out of 24 are utilized, which results in unbalanced transfers and poor utilization of the available resources. This is our main motivation to introduce a routing scheme we named tapeworm routing. An efficiency of this routing algorithm depends on a structure of cores and their connections to each other, which is defined with a mapping of the flow network nodes functionality into NoC cores. Having selected an appropriate mapping, it is important to balance transfers between each path in a NoC structure. Our mapping algorithms need on its input a complete list of data transfers in the network flow built at the previous stage. The proposed technique takes advantage of the well-known Ford-Fulkerson method for determining maximal throughput of the network between a set of cores. The example of the three successive steps of the tapeworm algorithm is presented in Fig. 2. In this figure, the numbers written above links mean a flow and a remaining available capacity. The data length to be sent between cores S and D is equal to 70 bits. At the first stage, 30 bits of the link between routers 2 and 3 have been already allocated. As the available capacity between routers 1 and 2 (i.e., the link selected by the XY rule) is only 50 bits, 50 bits are sent by this link, and the remaining 20 bits are sent by the alternative route to the 4th router. In router 2, the package is further segmented: 20 bits follow the path to the router 3, whereas 30 bits are sent to router 5. Thus, the total data length sent between routers 5 and 6 is equal to 50 bits. A few algorithms for heuristic core mapping into a NoC structure are provided in the following section.
3
On-Chip Cores Mapping Heuristics
In order to determine the appropriate cores mapping into NoC nodes, i.e., the mapping leading to the minimal value of transfers between cores while using the tapeworm algorithm, it is possible to use an exact algorithm. However, its application is reasonable to the size of the NoC mesh lower than 4x4 cores due to its immense computational complexity O(n! · n2 ), where n is the number of cores. For larger n, however, it is possible to use heuristic algorithms that do not significant worsen the final result, as it is shown in the sequel of this paper. Below, we propose three heuristics that according to experimental results (presented in Section 4), lead to results close to the exact approach. We start our algorithm with a generating of a population of random core permutations to be mapped into the NoC structure. Then, we execute one of the provided below heuristics (Fig. 3-5) for every permutation a number of times.
Heuristics Core Mapping in On-Chip Networks
(a)
(b)
431
(c)
Fig. 2. Successive steps of the tapeworm algorithm
1. 2. 3. 4. 5. 6. 7.
src ← select randomly a core c1 do nDir ← random direction (Left, Right, Up, Down) while (move toward nDir direction is possible) dest ← select the adjacent core of c1 , c2 , in the nDir direction if(the exchange between src and dest cores decreases the total transfer) swap (c1 , c2 ) Fig. 3. Pseudo-code of heuristics 1 for cores mapping
1. 2. 3. 4. 5. 6. 7.
select randomly an item from NoC transfers table between src and dest do nDir ← random direction (Left, Right, Up, Down) while (move dest toward nDir direction is possible) dest ← select the adjacent core of dest in the nDir direction if(the exchange between src and dest cores decreases the total transfer) swap (src, dest) Fig. 4. Pseudo-code of heuristics 2 for cores mapping
1. select randomly an item from NoC transfers table between src and dest 2. d0 ← calculate Manhattan distance between scr and dest cores 4. (d1 , d2 , d3 , d4 ) ← calculate Manhattan distance between src and all neighbors of dest 5. if(min(d0 , d1 , d2 , d3 , d4 ) = d0 ) 6. nmin ← neighbor with the lowest distance 7. swap (dest, nmin ) Fig. 5. Pseudo-code of heuristics 3 for cores mapping
We decided to terminate when a maximum number of generations has been produced, or the obtained transfers do not decrease for a specific number of steps. As we tested our approach for multimedia applications split into relatively low number of cores (9), we could compare the obtained results with the exact solutions.
432
P. Dziurzanski and T. Maka
In the first heuristics (Fig. 3), we select randomly two neighboring cores, and then compute total number of bits transmitted in a second for two cases: (i) for the existing core mapping (ii) and the one obtained after the exchange of the two selected cores. If the latter approach is characterized with a lower transfer, the exchange is performed. In the second approach (Fig. 4), we select randomly a single transfer from the transfer table of the algorithm to be mapped. Next, the direction from the set {left, right, up, down} is selected, and a new mapping, when the destination core is exchanged with the core situated directly on the selected side is performed. Similarly to the previous heuristic, both total transfers before and after the exchange are computed and if the core with the second one is lower, the exchange is carried out. In the third heuristics (Fig. 5), we also select a single transfer from the transfer table. Then, we calculate the Manhattan distance between the source and the destination cores of the selected transfer. The Manhattan distances between the source and all the neighbors of the destination core are also computed and, if the distance is lower for any of the neighbors, it is exchanged with the destination core. In case when more than one neighbor is characterized with a lower distance, the lowest of them is chosen. It is worth stressing that in the case of this approach there is no need of computing the total transfer in the NoC, which is time-saving as presented in the next section.
4
Experimental Results
In order to verify the above provided approach, we have chosen the MPEG-4, and H.264 decoders and decided to execute all the heuristic described in the previous section for determining the permutation mapping that is characterized with the lowest total transfer in the NoC realization. For the selected decoders, we have built flow networks and determined the amount of data transmitted between their nodes in every second. We executed an implementation of the proposed approaches 2000 times and concluded that in every case a local minimum has been found at a relatively early iteration (the solution was the same as obtained with the exact algorithm). For the two first approaches, the local minimum has been found no later than in 367th iteration, whereas for the 3rd approach it was no later than 1684th iteration. However, the vast majority of the found local minima has been found much earlier, what is Table 2. Comparison of the mean transfers and standard deviations obtained with the proposed approaches (in Mbit/s) Approach 1 Approach 2 Approach 3 Algorithm Mean StdDev Mean StdDev Mean StdDev MPEG-4 1734.74 9.03 1734.63 9.1 1726.75 1.96 H.264 828.32 2.79 828.39 2.64 829.21 2.64
Heuristics Core Mapping in On-Chip Networks
433
700 approach 1 approach 2 approach 3
600
number of minima
500
400
300
200
100
0
0
100
200
300
400
500
iterations
Fig. 6. Minima histogram for H.264
1000 900
approach 1 approach 2 approach 3
800
number of minima
700 600 500 400 300 200 100 0
0
100
200
300
400 iterations
500
600
700
Fig. 7. Minima histogram for MPEG-4
depicted by histograms presented in Fig. 6 and Fig. 7 for H.264 and MPEG-4, respectively. Only few instances has been found in further iterations than 300. In Tab. 2, we have presented average transfers and standard deviations for a set of 1000 algorithm executions. All three approaches resulted in values close
434
P. Dziurzanski and T. Maka
to the minimum. For H.264 (MPEG-4), we obtained the global minimum for 59.55, 54.9, and 26.7 (47.15, 48.4, and 96.85) per cent of cases for the first, second, and the third approach, respectively. In the two first approaches, it is important to stress that for each iteration one has to determine the total transfer for a number of times. We have measured that for 200 iterations the transfer is to be determined 60100 times. As our tapeworm routing runs for 8.224 seconds in average (Intel Pentium 4 CPU 3 GHz, 1GB RAM memory), the running time for the two first approaches is less than 6 days. On the other hand, the third approach produces results in 0.5s. Although the first time seems huge, especially in a comparison with the second one, it is still much less, the exact algorithm is practically unacceptable for even 4x4 cores (a few million years of computation). The execution time of the first two approaches depends only polynomially on the size of the mesh.
5
Conclusion
We described our approach for implementing data-intensive streaming multimedia applications in Network on Chip based on the mesh topology. We focused on two phases: mapping of algorithm stages into the target NoC’s nodes and developing a new multi-path routing algorithm. Both our proposals benefit from the static streams of data transferred between IP cores being known at the design stage. We provided three heuristic mapping algorithms and compared them with the exact solutions. The provided experimental results, based on two reallife multimedia decoder examples, showed that in majority of cases our heuristics give results equal to the exact solution at relatively early iterations. The computational complexity of the proposed heuristics is polynomial with respect to the number of cores, whereas the complexity of the exact solution is intractable as being estimated with O(n! · n2 ), where n is the number of IP cores to be mapped. Consequently, the proposed technique allows us to practically solve the problem of mapping a streaming multimedia application into a NoCbased MPSoC in a reasonable time so that the data transfers between IP cores are balanced and close to the global minimum.
References 1. Bjerregaard, T., Mahadevan, S.: A Survey of Research and Practices of Networkon-Chip. ACM Computing Surveys (CSUR) 38, Article 1 (2006) 2. Dally, W.J., Towels, B.: Route Packets, Not Wires: On-Chip Interconnection Networks. In: 38th ACM IEEE Design Automation Conference (DAC), pp. 684–689 (2001) 3. Kavaldjiev, N., et al.: Routing of guaranteed throughput traffic in a network-on-chip. Technical Report TR-CTIT-05-42 Centre for Telematics and Information Technology, University of Twente, Enschede (2005) 4. Lee, H.G., Chang, N., Ogras, U.Y., Marculescu, R.: On-chip communication architecture exploration: A quantitative evaluation of point-to-point, bus, and networkon-chip approaches. ACM Transactions on Design Automation of Electronic Systems 12(3), article no. 23 (2007)
Heuristics Core Mapping in On-Chip Networks
435
5. Li, M., Zeng, Q.A., Jone, W.B.: DyXY: a proximity congestion-aware deadlockfree dynamic routing method for network on chip. In: 43rd ACM IEEE Design Automation Conference (DAC), pp. 849–852 (2006) 6. Ogras, U.Y., Marculescu, R.: Prediction-based Flow Control for Network-on-Chip Traffic. In: 43rd ACM IEEE Design Automation Conference (DAC), pp. 839–844 (2006) 7. Smit, G.J.M., et al.: Efficient Architectures for Streaming DSP Applications, Dynamically Reconfigurable Architectures. Internationales Begegnungs- und Forschungszentrum fuer Informatik (IBFI), Schloss Dagstuhl, Germany (2006) 8. van der Tol, E.B., Jaspers, E.G.T., Gelderblom, R.H.: Mapping of H.264 decoding on a multiprocessor architecture. In: Image and Video Communications and Processing, Santa Clara, CA, USA, vol. 5022, pp. 707–718 (January 2003) 9. van der Tol, E.B., Jaspers, E.G.T.: Mapping of MPEG-4 Decoding on a Flexible Architecture Platform. In: Media Processors 2002, San Jose, CA, USA, vol. 4674, pp. 362–363 (January 2002)
Max-Min-Fair Best Effort Flow Control in Network-on-Chip Architectures Fahimeh Jafari1,2, Mohammad H. Yaghmaee1, Mohammad S. Talebi2, and Ahmad Khonsari3,2 1
Ferdowsi University of Mashhad, Mashhad, Iran 2 IPM, School of Computer, Tehran, Iran 3 ECE Department, University of Tehran, Tehran, Iran {jafari,ak}@ipm.ir, [email protected], [email protected]
Abstract. Network-on-Chip (NoC) has been proposed as an attractive alternative to traditional dedicated busses in order to achieve modularity and high performance in the future System-on-Chip (SoC) designs. Recently, end to end flow control has gained popularity in the design process of network-on-chip based SoCs. Where flow control is employed, fairness issues need to be considered as well. In fact, one of most difficult aspects of flow control is that of treating all sources fairly when it is necessary to turn traffic away from the network. In this paper, we proposed a flow control scheme which admits Max-Min fairness criterion for all sources. In fact, we formulated Max-Min fairness criterion for the NoC architecture and presented implementation to be used as flow control mechanism. Keywords: Network-on-Chip, flow control, Max-Min fairness.
1 Introduction Network-on-Chip (NoC) is a new paradigm structure for designing future System-onChips (SoC) [1]. A typical NoC architecture provides a scalable communication infrastructure for interconnecting cores. Since the communication infrastructure as well as the cores from one design can be easily reused for a new product, NoC provides maximum possibility for reusability. NoCs with their flexible and scalable interconnect provide high computational power to support computationally extensive multimedia applications, i.e. those that combine audio, video and data. In contrast to simple data applications, which can work without guarantees of timing of data delivery, multimedia applications require a guaranteed degree of service in terms of required bandwidth and timelines. According to the networking terminology, we refer to the traffic of simple data as elastic or Best Effort (BE) traffic and to multimedia traffic as inelastic or Guaranteed Service (GS) traffic. Due to the rapid growth of the number of processing elements (PEs) in NoCs [2], employing efficient policy for flow control is inevitable in the design of NoCs to M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 436–445, 2008. © Springer-Verlag Berlin Heidelberg 2008
Max-Min-Fair Best Effort Flow Control in Network-on-Chip Architectures
437
provide the required Quality of Service (QoS). A NoC should support network level flow control in order to avoid congestion in the bottleneck links, i.e. link through which several sources pass [3]. The design and control of NoCs raises several issues well suited to study using techniques of operational research such as optimization and stochastic modeling. Recently, some novel researches have been embarked in studying congestion control in NoCs [4-5]. Congestion control schemes in NoCs mainly focus on utilizing NoC’s resources, with the aim of minimizing network cost or maximizing network utility while maintaining the required QoS for Guaranteed Service traffics. Many strategies for flow control have been proposed for off-chip networks, e.g. data networks, etc. [6-9]. On-chip networks pose different challenges. For instance, in off-chip environments, to overcome congestion in links, packet dropping is allowed. On the contrary, reliability of on-chip wires makes NoCs a loss-less environment. So far, several works have addressed this problem for NoC systems. In [4], a prediction-based flow-control strategy for on-off traffic in on-chip networks is proposed where the prediction is used in router to be aware of buffer fillings. In [5] a flowcontrol scheme for Best Effort traffic based on Model Predictive Control is presented, in which link utilization is used as congestion measure. Dyad [10] controls the congestion by switching from deterministic to adaptive routing when system is going to be congested. [11] proposes a flow control scheme as the solution to rate-sum maximization problem for choosing the BE source rates. The solution to the rate-sum optimization problem is presented as a flow control algorithm. Where flow control is employed, fairness issues need to be considered as well [3]. In fact, one of most difficult aspects of flow control is to choose a policy to accommodate a fair rate allocation. All of the abovementioned studies only regarded the flow control by taking into account the constraints of the system and to the best of our knowledge no policy to maintain fairness among sources was chosen. The fairness of TCP-based flow control algorithms was first analyzed in [12]. The analysis in [12] was based on a single bottleneck link. Different flow control approaches can be classified with respect to the fairness criteria, in favor of which rate allocation is done. One of the famous forms of fairness criterion is Max-Min fairness, which has been discussed in earlier literature and described clearly in [13]. Our main contribution in this paper is to present a flow control scheme for Best Effort traffic in NoC which satisfies Max-Min fairness criterion. Our framework is mainly adopted from the seminal work [13] which presents a basic Max-Min fairness optimization problem. In this paper, we reformulate such a problem for the NoC architecture. The organization of the paper is as follows. In Section 2 we present the system model, the concept of Max-Min fairness and formulation of the flow control as an optimization problem. In section 3 we present an iterative algorithm as the solution to the flow control optimization problem. Section 4 presents the simulation results and discussion about them. Finally, the section 5 concludes the paper and states some future work directions.
438
F. Jafari et al.
2 System Model We consider a NoC with two dimensional mesh topology, a set S of sources and a set L of bidirectional links. Let cl be the finite capacity of link l ∈ L . The NoC assumed to use wormhole routing. In wormhole-routed networks, each packet is divided into a sequence of flits which are transmitted over physical links one by one in a pipeline fashion. The NoC architecture is also assumed to be lossless, and packets traverse the network on a shortest path using a deadlock free XY routing. A source consists of Processing Elements (PEs), routers and Input/Output ports. Each link is a set of wires, busses and channels that are responsible for connecting different parts of the NoC. We denote the set of sources that share link l by S (l ) . Similarly, the set of links that source s passes through is denoted by L(s ) . By definition, l ∈ S (l ) if and only if s ∈ L(s ) . We assume that there are two types of traffic in the NoC: GS and BE traffic. For notational convenience, we divide S into two parts, each one representing sources with the same kind of traffic. In this respect, we denote the set of sources with BE and GS traffic by S BE and SGS , respectively. Each link l is shared between the two aforementioned traffics. GS sources will obtain the required amount of the capacity of links and the remainder should be allocated to BE sources. 2.1 Max-Min Fairness Concept Any discussion of the performance of a rate allocation scheme must address the issue of fairness, since there exist situations where a given scheme might maximize network throughput, for example, while denying access for some users or sources. MaxMin fairness is one the significant fairness criteria. Crudely speaking, a set of rates is max-min fair if no rate can be increased without simultaneously decreasing another rate which is already smaller. In a network with a single bottleneck link, max-min fairness simply means that flows passing through the bottleneck link would have equal rates. The following definition states the formal definition of Max-Min fairness. Defination 1. A feasible rate allocation x = (x s , s ∈ S ) is said to be “max-min fair” if and only if an increase of any rate within the domain of feasible allocations must be at the cost of a decrease of some already smaller rate. Formally, for any other feasible allocation y , if ys > x s then there must exist some s ′ such that x s ′ ≤ x s and ys ′ < x s ′ [13]. Depending on the network topology, a max-min fair allocation may or may not exist. However, if it exists, it is unique (see [14] for proof). In what follows the condition under which the Max-Min rate allocation exists will be stated. Before we proceed to this condition, we define the concept of bottleneck link.
Max-Min-Fair Best Effort Flow Control in Network-on-Chip Architectures
439
Defination 2. With our system model above, we say that link l is a bottleneck for source s if and only if 1. link l is saturated: ∑ x s + ∑ x s = cl s ∈SBE (l )
s ∈SGS (l )
2. source s on link l has the maximum rate among all sources using link l. Intuitively, a bottleneck link for source s is a link which limits x s . Theorem 1. A max-min fair rate allocation exists if and only if every source has a bottleneck link (see [14] for proof). 2.2 Flow Control Model Our focus will be on two objectives. First, choosing source rates (IP loads) of BE traffics so that to accomplish flow control in response to demands at a reasonable level. Second, maintaining Max-Min fairness for all sources. We model the flow control problem in NoC as the solution to an optimization problem. For more convenience, we turn the aforementioned NoC architecture into a mathematical model as in [5]. In this respect, the Max-Min fairness flow control problem can be formulated as:
max min xs
(1)
s ∈S
x
subject to:
∑
xs +
s ∈SBE (l )
∑
x s ≤ cl
∀l ∈ L
s ∈SGS (l )
(2)
xs > 0 ∀s ∈ S BE (3) where source rates, i.e. x s , s ∈ S , are optimization variables. The constraint (2) says the aggregate BE source rates passing thorough link l cannot exceed its free capacity, i.e. the portion of the link capacity which has not been allocated to GS sources. For notational convenience, we define u = min x s s ∈S
∑
cˆl = cl −
xs ,
s ∈SGS (l )
therefore the above mentioned problem can be rewritten as:
u = min x s
(4)
max u
(5)
s ∈S
subject to:
∑
x s ≤ cˆl
∀l ∈ L
s ∈SBE (l )
xs > 0
∀s ∈ S BE
(6) (7)
440
F. Jafari et al.
To solve the above problem, it should be converted so as to be in the form of disciplined optimization problems [15] as follows: (8) max u subject to:
u ≤ xs
∑
∀s ∈ S
x s ≤ cˆl
∀l ∈ L
s ∈SBE (l )
xs > 0
(9) (10)
∀s ∈ S BE
(11) The above optimization problem can be solved using several methods. In the next section, we introduce a simple and famous algorithm, known as “progressive filling”, to solve (8) iteratively. In order to compare the results of progressive filling algorithm with the exact values, we solve problem (8) using CVX [16] which is a MATLAB-based software for disciplined convex optimization problems, whose results will be given in section 4.
3 Max-Min Fairness Algorithm Theorem 1 is particularly useful in deriving a practical method for obtaining a maxmin fair allocation, called “progressive filling”. The idea is as follows: rates of all flows are increased at the same pace, until one or more links are saturated. The rates of flows passing through saturated links are then frozen, and the other flows continue to increase rates. All the sources that are frozen have a bottleneck link. This is because they use a saturated link, and all other sources using the saturated link are frozen at the same time, or were frozen before, thus have a smaller or equal rate. The process is repeated until all rates are frozen. Lastly, when the process terminates, all sources have been frozen at some time and thus have a bottleneck link. Using Theorem 1, the allocation is max-min fair. Theorem 2. For the system model defined above, with fixed routing policy, there exists a unique max-min fair allocation. It can be obtained by the progressive filling algorithm. (see [14] for proof) In the sequel, we derive the max-min rate allocation as the solution to problem (8) and based on this algorithmic solution, we present a flow control scheme for BE traffic in NoC systems. Thus, the aforementioned algorithm can be employed to control the flow of BE traffic in the NoC. The iterative algorithm can be addressed in distributed scenario. However, due to well-formed structure of the NoC, we focus on a centralized scheme; we use a controller like [5] to be mounted in the NoC to implement the above algorithm. The necessary requirement of such a controller is the ability to accommodate simple mathematical operations and the allocation of few wires to communicate flow control information to nodes with a light GS load.
Max-Min-Fair Best Effort Flow Control in Network-on-Chip Architectures
441
Algorithm 1. Max-Min Fair (MMF) Flow Control Algorithm for BE in NoC. Initialization: 1. Initialize cˆl of all links. 2. Define: a. T as the set of sources not passing through any saturated link. b. B as the set of saturated links. c. 3. 4.
B = L − B and T = S BE − T .
Set source rate vector to zero. Initialize T = S BE and B = ∅ .
Loop: Do until ( T
=∅)
2.
⎡⎛ ⎤ ⎞ Δs = min ⎢⎢⎜⎜cl − ∑ Rls x s (t )⎟⎟⎟ ∑ Rls ⎥⎥ l ∈B ⎜ ⎠⎟ s ∈T s ∈T ⎣⎝ ⎦ x s (t + 1) = x s (t ) + Δs ∀s ∈ T
3. 4.
Calculate new bottleneck links and update B and B . ∀s ∈ T ; if s passes through any saturated link then
1.
T ⇐ T − {s } Output: Communicate BE source rates to the corresponding nodes.
4 Simulation Results In this section we examine the proposed flow control algorithm, listed above as Algorithm 1, for a typical NoC architecture. We have simulated a NoC with 4 × 4 Mesh topology which consists of 16 nodes communicating using 24 shared bidirectional links, each one has a fixed capacity of 1 Gbps. In our scenario, packets traverse the network on a shortest path using a deadlock free XY routing. We also assume that each packet consists of 500 flits and each flit is 16 bit long. In order to simulate our scheme, some nodes are considered to have a GS data, such as Multimedia, etc., to be sent to a destination while other nodes, which maybe in the set of nodes with GS traffic, have a BE traffic to be sent. As stated in section 2, GS sources will obtain the required amount of the capacity of links and the remainder should be allocated to BE traffics. We are mainly interested in investigating the fairness properties among source rates. In order to investigate the rate allocation in the optimal sense, we solved problem (8) using CVX [16], which is a MATLAB-based software for disciplined
442
F. Jafari et al.
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Fig. 1. Network topology
convex optimization problems. Optimal source rates, obtained by CVX, are shown in Fig. 2. Source rates obtained from Algorithm 1 is depicted in Fig. 3. The main feature regarding Fig. 1 and Fig. 2 is that both yield equal values for the minimum source rate, i.e. 0.03 Gbps. The main difference is in the aggregate source rate which is greater for the result of Algorithm 1. In order to compare the results of the proposed Max-Min fair flow control with other fairness criteria, we have accomplished rate allocation based on maximizing the sum of source rates, i.e. the so-called Rate-Sum Maximization, whose results are depicted in Fig. 4. Comparing Fig. 3 with Fig. 4, it's apparent that although Rate-Sum criterion aims at maximizing the sum of source rates, there is no guarantee for the rates of weak sources, i.e. sources which achieve very small rate. Indeed, in many scenarios with Rate-Sum criterion, such sources will earn as small as zero. To compare the results of the three above mentioned schemes in more detail, we have considered five parameters featuring the merit of the different schemes as following: 1. 2. 3. 4. 5.
least source rate sum of source rates Variance of source rates with respect to mean value. Jain’s fairness Index [17] min-max ratio [17]
These parameters are presented in Table 1. Jain’s fairness Index and max-min ratio, are defined by (12) and (13), respectively.
(∑ Jain's Fairness Index =
S s =1
xs
)
2
S ∑ s =1 x s2
Min-Max Ratio =
S
(12)
min x s s ∈S
max x s s ∈S
(13)
Max-Min-Fair Best Effort Flow Control in Network-on-Chip Architectures
443
8
Source Rate (x10 bps)
From table 1 we realize that rate allocation with Maximum Rate-Sum criteria, yield slightly greater rate-sum from Max-Min Fair criteria, i.e. Algorithm 1. However, as discussed above, Algorithm 1 guarantees that the rate allocation is max-min fair, and hence the minimum source rate wouldn’t be greater with any other feasible rate allocation and hence rate allocation is carried out in favor of weak sources. On the contrary, Maximum Rate-Sum has no guarantee on such sources and as a result, the weakest source, has achieved his rate as low as zero. Another point which is worth mentioning is that similarity of the rate allocation to uniform rate allocation is further in the Max-Min scheme. To be more precise, we have calculated the variance of source rates in with respect to mean value of source rates in equilibrium. Table 1 shows that the variance of Max-Min rate allocation, obtained from Algorithm 1, is evidently less than that of Maximum Rate-Sum scheme, which in turn implies the inherent fairness in the Max-Min rate allocation. 3.5 3 2.5 2 1.5 1 0.5 0 1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16
Sources
Fig. 2. Rate allocation using CVX results
8
Source Rate (x10 bps)
3.5 3 2.5 2 1.5 1 0.5 0 1
2
3
4
5
6
7
8
9 10 11 12 13 14 15 16
Sources
Fig. 3. Rate allocation using Algorithm 1
444
F. Jafari et al.
8
Source Rate ( x10 bps)
3.5 3 2.5 2 1.5 1 0.5 0 1
2
3
4
5
6
7
8
9
10 11 12 13 14 15 16
Sources
Fig. 4. Rate allocation using Rate-Sum Maximization Table 1. Quantitative comparison between different rate allocation schemes
Max-Min Fair (Mathematical Model) Max-Min Fair (Algorithm 1) Maximum RateSum
Least Rate ( ×108 bps)
Sum of Source Rates ( ×108 bps)
Variance
Fairness Index
Min-max Ratio
0.310
10.079
0.1558
0.7181
0.1856
0.310
13.545
0.5004
0.5888
0.1148
0
15.349
1.1974
0.4346
0
5 Conclusion In this paper we addressed the flow control problem for BE traffic in NoC systems. We considered two objectives. First, choosing source rates (IP loads) of BE traffics so that to accomplish flow control in response to demands at a reasonable level. Second, maintaining Max-Min fairness for all sources. Flow control was modeled as the solution to a simple algorithmic solution to an optimization problem. The algorithm can be implemented by a controller which admits a light communication and communication overhead. Finally, we compared the results of the proposed Max-Min fair flow control with Rate-Sum Maximization scheme based on several criteria such as Jain’s fairness index, max-min ratio, and etc. comparison shows using the proposed flow control scheme, rate allocation has larger fairness index, which denotes that the aim of the proposed flow control scheme is to allocate NoC resources in a fair manner.
References 1. Benini, L., DeMicheli, G.: Networks on Chips: A New SoC Paradigm. Computer Magazine of the IEEE Computer Society 35(1), 70–78 (2002) 2. Dally, W.J., Towles, B.: Route Packets, Not Wires: On-Chip Interconnection Networks. In: Design Automation Conference, pp. 684–689 (2001)
Max-Min-Fair Best Effort Flow Control in Network-on-Chip Architectures
445
3. Cidon, I., Keidar, I.: Zooming in on Network on Chip Architectures. Technion Department of Electrical Engineering (2005) 4. Ogras, U.Y., Marculescu, R.: Prediction-based flow control for network-on-chip traffic. In: Proceedings of the Design Automation Conference (2006) 5. van den Brand, J.W., Ciordas, C., Goossens, K., Basten, T.: Congestion- Controlled BestEffort Communication for Networks-on-Chip. In: Design, Automation and Test in Europe Conference and Exhibition, pp. 948–953 (2007) 6. Kelly, F.P., Maulloo, A., Tan, D.: Rate control for communication networks: Shadow prices, proportional fairness, and stability. J. Oper. Res. Soc. 49(3), 237–252 (1998) 7. Mascolo, S.: Classical control theory for congestion avoidance in high-speed internet. In: Decision and Control IEEE Conference, vol. 3, pp. 2709–2714 (1999) 8. Gu, Y., Wang, H.O., Hong, Y., Bushnell, L.G.: A predictive congestion control algorithm for high speed communication networks. In: American Control Conference, vol. 5, pp. 3779–3780 (2001) 9. Yang, C., Reddy, A.V.S.: A taxonomy for congestion control algorithms in packet switching networks. J. IEEE Network 9(4), 34–45 (1995) 10. Hu, J., Marculescu, R.: DyAD - smart routing for networks-on-chip. In: Design Automation Conference, pp. 260–263 (2004) 11. Talebi, M.S., Jafari, F., Khonsari, A., Yaghmae, M.H.: A Novel Congestion Control Scheme for Elastic Flows in Network-on-Chip Based on Sum-Rate Optimization. In: International Conference on Computational Science and its Applications, pp. 398–409 (2007) 12. Chiu, D.M., Jain, R.: Analysis of the increase and decrease algorithms for congestion avoidance in computer networks. J. Computer Networks and ISDN Systems 17(1), 1–14 (1989) 13. Bertsekas, D.P., Gallager, R.: Data Networks. Prentice-Hall, Englewood Cliffs (1992) 14. Le Boudec, J.Y.: Rate adaptation, Congestion Control and Fairness: A Tutorial. Ecole Polytechnique Fédérale de Lausanne (EPFL) (2001) 15. Bertsekas, D.P.: Nonlinear Programming. Athena Scientific (1999) 16. Grant, M., Boyd, S., Ye, Y.: CVX (Ver. 1.0RC3): Matlab Software for Disciplined Convex Programming, \url{http://www.stanford.edu/boyd/cvx} 17. Jain, R., Chiu, D., Hawe, W.: A Quantitative Measure of Fairness and Discrimination for Resource Allocation in Shared Computer Systems. DEC Research Report TR-301 (1984)
Fast Quadruple Precision Arithmetic Library on Parallel Computer SR11000/J2 Takahiro Nagai1 , Hitoshi Yoshida1 , Hisayasu Kuroda1,2 , and Yasumasa Kanada1,2 1
Dept. of Frontier Informatics, The University of Tokyo, 2-11-16 Yayoi Bunkyo-ku Tokyo, Japan {takahiro.nagai,hitoshi.yoshida}@klab.cc.u-tokyo.ac.jp 2 The Information Technology Center, The University of Tokyo, 2-11-16 Yayoi Bunkyo-ku Tokyo, Japan {kuroda,kanada}@pi.cc.u-tokyo.ac.jp
Abstract. In this paper, the fast quadruple precision arithmetic of four kinds of basic operations and multiply-add operations are introduced. The proposed methods provide a maximum speed-up factor of 5 times to gcc 4.1.1 with POWER 5+ processor used on parallel computer SR11000/J2. We also developed the fast quadruple precision vector library optimized on POWER 5 architecture. Quadruple precision numbers, which is 128 bit long double data type, are emulated with a pair of 64 bit double data type on POWER 5+ prosessor used on SR11000/J2 with Hitachi Optimizing Compiler and gcc 4.1.1. To avoid rounding errors in computing quadruple precision arithmetic operations, emulation needs high computational cost. The proposed methods focus on optimizing the number of registers and instruction latency.
1
Introduction
Some numerical methods require much more computation complexity due to rounding errors as increasing the scale of a problem. For example, CG method, one of the solutions for linear equation Ax=b and using Krylov subspace, is affected by computation errors on a large scale problem. Floating point arithmetic operations generate rounding errors because a real number is approximated with the finite number of significant figures. To reduce errors in floating point arithmetic, quadruple precision arithmetic, e.g. higher precision arithmetic is required. Quadruple precision number, which is a 128 bit long double data type, can be emulated with a pair of 64 bit double precision numbers on POWER 5 architecture by the run-time routine. The cost of the quadruple precision operations takes much more than the double precision operations. In this paper, we present the fast quadruple precision arithmetic of four basic arithmetic operations, i.e. {+, −, ×, ÷} and multiply-add operation, and vector library for POWER 5 architecture based machine such as parallel computer SR11000/J2. We implemented M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 446–455, 2008. c Springer-Verlag Berlin Heidelberg 2008
Fast Quadruple Precision Arithmetic Library
447
Table 1. IEEE 754 data type of 64 bit double and 128 bit long double on SR11000/J2 Data type
Total bit Exponent Exponent range Significand number of significant length bit length bit length figures in decimal IEEE 754 64 11 −1022 ∼ 1023 52 about 16.0 128 11 × 2 −1022 ∼ 1023 52 × 2 about 31.9 SR11000/J2
fast quadruple precision arithmetics and made up a quadruple precision vector library including four basic arithmetics and multiply-add operations. We achieved high performance that maximum speed up factor of 5 times to gcc 4.1.1.
2
128-Bit Long Double Floating Point Data Type
POWER5+ processors, which are CPUs of SR11000/J2, have 64 bit floating point registers. In 64 bit architecture, to store floating point data with quadruple precision, a pair of 64 bit registers is used by the software. Quadruple precision can handle up to about 31 decimal digits precision number, compared to (1 + 52)×log102 16 handled by double precision. The point to notice is that the exponent range is the same as that of double precision. Although the precision is greater, the magnitude of representable numbers is the same as 64 bit double precision numbers. That is, while 128 bit data type can store numbers with more precision than 64 bit data type, it does not store numbers of greater magnitude. The details are as follows. Each pair of 64 bit numbers has two 64 bit floating point data type with sign, exponent and significand. We show the data format with IEEE 754 standard explained in Table 1[5]. Typically, the low-order part has a magnitude that is less than 0.5 units in the last place of the high-order part, so the value of two parts never overlap and the entire significand of the low-order number adds precision beyond the high-order number.
3
Quadruple Precision Arithmetic
All of algorithms as follows are double or quadruple precision data type based on the round-to-nearest rounding mode. We express the floating operations using {⊕, , ⊗, } for {+, −, ×, ÷} respectively. For example, floating point addition a + b = fl(a + b)+err(a + b) exactly, then we use a ⊕ b = fl(a + b) to denote the result of addition, and err() is the error caused by the operation. Now, we explain quadruple precision arithmetic operations consisted of two kinds of basic algorithms, Quick-TwoSum() and TwoSum(). These 64 bit double precision algorithms are already used and implemented on gcc 4.1.1 for 128 bit long double data type[3] and they have explained in papers[1,2,4,8]. This type of 128 bit long double data type does not support IEEE special numbers, NaN and INF. The quadruple precision algorithms introduced in this paper are not satisfiying the IEEE compliance.
448
3.1
T. Nagai et al.
About Precision
There are two kinds of quadruple precision algorithms for addition operation for the accuracy. – Accuracy of 106 bit significand – Accuracy of about 106 bit significand permitting a few bits rounding errors in the last part Compared with these algorithms, the latter method is realized with the half number of instructions of the former one by permitting a few bits rounding error in the last part. This reason is extra instructions in error compensation. In this paper, we select the latter algorithm by focusing on speeding up of quadruple precision arithmetic. We have already quantitatively analyzed the quadruple precision arithmetic of addition and multiplication[11]. Here we introduce the quadruple precision algorithms optimized and implemented as vector library. 3.2
Addition
Quadruple precision addition algorithm of Quad-TwoSum(a, b), which is consisted of floating point addition TwoSum(), computes (sH , sL ) = fl(a + b). Here, (sH , sL ) is a part of s, s = sH + sL . Each of sH and sL is 64 bit data type and indicates high-order and low-order part respectively. Then, a, b, s are 128 bit data type. We do not need to consider separating the quadruple precision numbers because each number is stored into memory as 64 bit data type automatically. TwoSum() algorithm[2] also computes s =fl(c + d) and e =err(c + d). We have Quad-TwoSum(a, b){ (t, r) ←TwoSum(aH , bH ) e ← r ⊕ aL ⊕ b L sH ← t ⊕ e sL ← t sH ⊕ e return (sH , sL ) }
TwoSum(c, d){ s←c⊕d v ← sc e ← (c (s v)) ⊕ (d v) return (s, e) }
to pay attention that both of c and d are not 128 bit data type but 64 bit double data type. Quad-TwoSum() is addition routine for quadruple precision numbers with error considerations using TwoSum() algorithm. Then the number of operation steps is 11 Flop (FLoating point OPeration), sum of Two-Sum 6 Flop and 5 Flop from addition and subtraction operations. We see that this quadruple precision algorithm requires 11 times more operations compared to 1 Flop of double precision. Flop indicates the number of floating point operations.
Fast Quadruple Precision Arithmetic Library
3.3
449
Multiplication
Quadruple precision multiplication algorithm Quad-TwoProd(a, b) computes (pH , pL ) = fl(a × b). (pH , pL ) is a part of p, p = pH + pL . Then, a, b, p are also 128 bit data type. Quad-TwoProd(a, b){ m1 ← aH ⊗ b L t ← aL ⊗ b H ⊕ m 1 p H ← aH ⊗ b H ⊕ t e ← aH ⊗ b H p H pL ← e ⊕ t return (pH , pL ) } Some processors have FMA (Fused Multiply-Add) instruction set that can compute expressions such as a × b ± c with a single rounding error. It is a merit of this instruction that there are not double rounding errors for addition following multiplication operation. FMA instruction is comparatively fast because it is implemented on hardware as well as addition or multiplication instruction. A series of POWER processor can handle FMA instruction, so we made up multiplication algorithm using FMA instruction. It costs 8 Flop in quadruple precision multiplication operation. 3.4
Division
Quadruple precision division Quad-TwoDiv(a, b) computes (qH , qL ) = fl(a ÷ b). (qH , qL ) is a part of q, q = qH + qL . Then, a, b, q are also 128 bit data type. Quad-TwoDiv(a, b){ d1 ← 1.0 bH m1 ← aH ⊗ d1 e1 ← −(bH ⊗ m1 aH ) m1 ← d1 ⊗ e1 ⊕ m1 m2 ← −(bH ⊗ m1 aH ) m2 ← aL ⊕ m 2 m2 ← −(bL ⊗ m1 m2 ) m3 ← d1 ⊗ m2 m2 ← −(bH ⊗ m3 m2 ) m2 ← d1 ⊗ m2 ⊕ m3 qH ← m1 ⊕ m2 e2 ← m1 qH qL ← m2 ⊕ e2 return (qH , qL ) } This algorithm is based on the Newton-Raphson method. The number of operation steps is 18 Flop and 1 double division operation. The definition of Flop
450
T. Nagai et al.
does not include the double division operation because it is costly compared to the cost of double precision addition and multiplication operations. This algorithm is applicable in the usual case that special numbers such as NaN, INF are not generated by first operation of double division (1.0 b).
4
Speeding-Up the Quadruple Precision Arithmetic
We quantitatively evaluate each algorithms of addition and multiplication on parallel computer SR11000/J2 at Information Technology Center, the University of Tokyo. In terms of the number of operations, addition takes 11 Flop and multiplication takes 8 Flop. Division takes 18 Flop and 1 double precision data division. From the analysis in term of the number of addition and multiplication operations, it is possible to speed up by reducing data dependency between instructions under condition that the each instruction latency such as fadd, fmul, fmadd of processors is the same clocks as others. And we have considered the multiply-add operation in quadruple precision with the combination of multiplication and addition.
5
Optimizing Quadruple Precision Arithmetic
First, the theoretical peak performance is 9.2 GFlops in one processor on SR11000/J2. Quadruple precision arithmetic operations are rarely affected by the delay of data transfer from main memory to register because computation time of one quadruple precision operation is large. To get high performance, it is most important to increase throughput and hide instruction latency by pipelining the operations for vector data. To realize pipeline processing, we focus on the loop unrolling. We see that latency of floating point instructions on POWER 5+ such as fadd, fmul, fsub, fabs and fmadd is 6 clocks. Throughput is 2 clocks for fmadd and 1 clock for others. Fig.1 shows the pipeline processing in case of instruction latency is 6 clock.
Fig. 1. Pipelining for 6 clock instruction latency
Fast Quadruple Precision Arithmetic Library
5.1
451
Hiding Instruction Latency
Because of loop unrolling, we can optimize performance by way of hiding instruction latency. Data dependency of quadruple precision arithmetic operations is solved by loop-unrolling, which lines up same instructions as follows. An example of solution is shown below. Here, f r means a 64 bit floating point register in POWER architecture. In Problem(), there is data dependency in three instructions, {+, ×, ÷}. It is possible to hide latency by loop unrolling like Solution(), whose unrolling size is 2. Solution(){ f r1 ← f r2 + f r3 f r9 ← f r7 + f r8 f r5 ← f r1 × f r4 f r11 ← f r9 × f r10 f r7 ← f r5 ÷ f r6 f r13 ← f r11 ÷ f r12 }
Problem(){ f r1 ← f r2 + f r3 f r5 ← f r1 × f r4 f r7 ← f r5 ÷ f r6 }
5.2
Number of Registers
Loop unrolling prevents from stall of CPU resource among instructions. As POWER 5+ processor has 32 logical registers, we used the full logical registers. In fact, there are 120 physical registers and they are utilized by the register renaming function. If m is the number of registers needed for 1 quadruple operation, maximum unrolling size = 32 / m (1) Quadruple precision addition needs 4 registers in 1 operation of ci = ai + bi , that is, m is 4. We can realize maximum unrolling size of 8. maximum unrolling size = 32 / 4 = 8
(2)
In a similar way, quadruple precision multiplication of ci = ai × bi also needs 4 registers, then m is 4. The maximum unrolling size is 8. Quadruple precision division of ci = ai / bi needs 5 registers in 1 operation, then m is 5. Maximum unrolling size is 32/5 = 6. To attain unrolling size 8 as well as addition or multiplication operation, we store 1 register data into memory and reload when it is needed. This method achieves unrolling size of 32/4 = 8.
6
How to Use Quadruple Precision Arithmetic Operations Library
We have discussed algorithms and how to optimize quadruple precision arithmetic for vector data in sections 3, 4 and 5. The interfaces of each quadruple precision arithmetic operations are shown in this section. This library is especially effective for vector data and implemented in C with optimized assembler-code. Users specify the include file ”quad vec.h” in C and call each arithmetic function in library. We have to note here that it is easy for adaptation to FORTRAN.
452
T. Nagai et al. Table 2. Compile option compile option Optimizing C Compiler 01-03/C cc -Os +Op -64 -noparallel No paralleled -roughquad Quadruple precision (add, multiply, div) gcc -maix64 -mlong-double-128 -O3 gcc 4.1.1
– Addition ci = ai + bi void qadd vec(long double a[],long double b[],long double c[],int n) – Subtraction ci = ai − bi void qsub vec(long double a[],long double b[],long double c[],int n) – Multiplication ci = ai × bi void qmul vec(long double a[],long double b[],long double c[],int n) – Division ci = ai / bi void qdiv vec(long double a[],long double b[],long double c[],int n) – Multiply-Add ci = s × bi + ci (s : constant) void qmuladd vec(long double *s,long double b[],long double c[],int n)
Here are the sample program routine computing matrix-multiplication in size N using qmuladd vec() described above. long double a[N][N], b[N][N], c[N][N]; ···
for(i=0;i
7
Numerical Experiment
We implemented and evaluated four kinds of arithmetic operations, addition, multiplication, division and multiply-add operation. Subtract is same operation as addition except for sign. Our proposed methods were optimized with assembler-code and compared with Hitachi Optimizing Compiler of SR11000/J2 [10] and gcc 4.1.1. OS is IBM AIX version 5.3 with large page setting[9]. Compile options are shown in Table 2. The experimented data size is six patterns, i.e. size of L1 cache, half of L2, L2, half of L3, L3 and out of L3. We measured the MQFlops value (1 quadruple precision operation in 1 second is defined as 1 QFlops). Figures from Fig.2 to Fig.9 show the quadruple precision arithmetic operation performances. The effective clocks, which is the clocks in each loop unrolling size in one loop in our proposed method, is shown in Fig.2 and the computational performance is shown in Fig.3 in addition. Our proposed methods in quadruple precision arithmetic operations show high performance in all of data ranges. Performances of our proposal and Hitachi optimizing compiler in quadruple precision addition are
Fast Quadruple Precision Arithmetic Library
Fig. 2. Effective Clocks in our proposed addition
Fig. 3. MQFlops in addition
Fig. 4. Effective Clocks in our poposed multiplication
Fig. 5. MQFlops in multiplication
Fig. 6. Effective Clocks in our proposed division
Fig. 7. MQFlops in division
453
almost same. Operations in gcc 4.1.1 are much slow because its execution calls library in each steps and it takes much cost in function overhead. From the result of quadruple precision addition, our proposed method attained 73.70/38.34 1.9 times speed up than that of Hitachi optimizing compiler and 73.70/13.96 5.3 times speed up than that of gcc 4.1.1 when data size is just on
454
T. Nagai et al.
Fig. 8. Effective Clocks in multiply-add
Fig. 9. MQFlops in multiply-add
Fig. 10. Matrix multiplication using multiply-add arithmetic operation
L1 cache. At the end, matrix-multiplication result using optimized multiply-add operation is shown in Fig.10.
8
Concluding Remarks
In this paper, fast quadruple precision arithmetic of four kinds of basic arithmetic operations and multiply-add operation are developed and evaluated. The proposed methods provide a maximum speed-up 5 times faster for vector data than gcc 4.1.1 with POWER 5+ processor on parallel computer SR11000/J2. Even though proposed method in quadruple precision addition operation is almost the same with Hitachi optimizing compiler in performance, other quadruple precision arithmetic operations’ results show high performance in all of data ranges. We developed the fast quadruple precision library for vector date optimized on POWER 5 architecture. Quadruple precision arithmetic operations are costly, compared with double precision operations because compensating rounding errors. Then we applied the best optimization of hiding latency to fit the number of registers by loop unrolling to quadruple precision arithmetic operation.
Fast Quadruple Precision Arithmetic Library
455
As a future work, we have to develop quadruple precision library, which will be available in various architecture such as Intel and AMD. POWER architecture as well as PowerPC has FMA instructions which can operate in same clocks as add or multiply. Especially, in some environment where there is no FMA instruction, we have to develop fast algorithm in quadruple precision arithmetic operations.
References 1. Dekker, T.J.: A Floating-Point Technique for Extending the Available Precision. Numer. Math. 18, 224–242 (1971) 2. Knuth, D.E.: The Art of Computer Programming, 2nd edn. Addison-Wesley Series in Computer Science and Information. Addison-Wesley Longman Publishing Co., Inc, Amsterdam (1978) 3. The GNU Compiler Collection, http://www.gnu.org/software/gcc/index.html 4. A fortran-90 double-double library, http://www.nersc.gov/∼ dhbailey/mpdist/mpdist.html 5. ANSI/IEEE754-1985 Standard for Binary Floating-Point Arithmetic (1985) 6. Akkas, A., Schulte, M.J.: A Quadruple Precision and Dual Double Precision Floating-Point Multiplier. In: DSD 2003: Proceedings of the Euromicro Symposium on Digital Systems Design, pp. 76–81 (2003) 7. Hida, Y., Li, X.S., Bailey, D.H.: Algorithms for quad-double precision floating point arithmetic. In: Proceedings of the 15th Symposium on Computer Arithmetic, pp. 155–162 (2001) 8. Bailey, D.H.: High-Precision Floating-Point Arithmetic in Scientific Computation. In: Computing in Science and Engineering, vol. 07, pp. 54–61. IEEE Computer Society, Los Alamitos (2005) 9. AIX 5L Rifferences Guide Version 5.3 (IBM Redbooks). IBM Press (2004) 10. Optimizing C User’s Guide For SR11000. Hitachi, Ltd. (2005) 11. Nagai, T., Yoshida, H., Kuroda, H., Kanada, Y.: Quadruple Precision Arithmetic for Multiply/Add Operations on SR11000/J2. In: Proceedings of the 2007 International Conference on Scientific Computing CSC, Worldcomp 2007, Las Vegas, pp. 151–157 (2007)
Characterizing the Basic Synchronization and Communication Operations in Dual Cell-Based Blades Jos´e L. Abell´ an, Juan Fern´ andez, and Manuel E. Acacio Dept. de Ingenier´ıa y Tecnolog´ıa de Computadores, University of Murcia, Spain {jl.abellan,juanf,meacacio}@ditec.um.es
Abstract. The Cell Broadband Engine (Cell BE) is a heterogeneous chip-multiprocessor (CMP) architecture to offer very high performance, especially on game and multimedia applications. The singularity of its architecture, nine cores of two different types, along with the variety of synchronization and communication primitives offered to programmers, make the task of developing efficient applications very challenging. This situation gets even worse when we consider Dual Cell-Based Blade architectures where two separate Cells can be linked together through a dedicated high-speed interface. In this work, we present a characterization of the main synchronization and communication primitives provided by dual Cell-based blades under varying workloads. In particular, we focus on the DMA transfer mechanism, the mailboxes, the signals, the read-modify-write atomic operations, and the time taken by thread creation. Our performance results expose the bottlenecks and asymmetries of these platforms which must be taken into account by programmers for improving the efficiency of their applications.
1
Introduction
Nowadays, among all contemporary CMP (or chip-multiprocessor) architectures, there is one that is currently concentrating an enormous attention due to its architectural particularities and tremendous potential in terms of sustained performance: the Cell Broadband Engine (Cell BE from now on). From the architectural point of view, the Cell BE can be classified as a heterogeneous CMP. In particular, the first generation of the chip integrates up to nine cores of two distinct types [1]. One of the cores, known as the Power Processor Element or PPE, is a 64-bit multithreaded Power-Architecture-compliant processor with two levels of on-chip cache that includes the vector multimedia extension (VMX) instructions. The main role of the PPE is to coordinate and supervise the tasks performed by the rest of cores. The remaining cores (a maximum of 8) are called Synergistic Processing Elements or SPEs and provide the main computing power of the Cell BE.
This work has been jointly supported by the Spanish MEC and European Comission FEDER funds under grants “Consolider Ingenio-2010 CSD2006-00046” and “TIN2006-15516-C04-03”.
M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 456–465, 2008. c Springer-Verlag Berlin Heidelberg 2008
Characterizing the Basic Synchronization and Communication Operations
457
The Cell BE provides programmers with a broad variety of communication and synchronization primitives between the threads that comprise parallel applications, which were evaluated in [2]. At the end, the performance achieved by the applications running on the Cell BE will depend in great extent on the ability of the programmer to select the most adequate primitives as well as their corresponding configuration values. The main purpose of this work is to expose the performance bottlenecks and asymmetries of those primitives under varying workloads on a dual Cell-based blade. The rest of the paper is organized as follows. In Section 2 we provide a short revision of the architecture of the Cell BE and a dual Cell-based blade, and a description of some of the communication and synchronization primitives provided to programmers. Next, in Section 3 we introduce our tool, which is called CellStats, for characterizing these primitives. The results obtained after executing CellStats on a dual Cell-based blade are presented in Section 4. Finally, Section 5 gives the main conclusions of the paper and some of the lessons learned that can help programmers to identify the most appropriate primitive in different situations.
2 2.1
Dual Cell-Based Blade Architecture
The Cell BE architecture [1] is a heterogeneous multi-core chip composed of one general-purpose processor, called PowerPC Processor Element (PPE), eight specialized co-processors, called Synergistic Processing Elements (SPEs), a highspeed memory interface controller, and an I/O interface, all integrated in a single chip. All these elements communicate through an internal high-speed Element Interconnect Bus (EIB) (see Figure 1(a)). Each SPE is a 128-bit RISC processor designed for high-performance on streaming and data-intensive applications [3]. Each SPE consists of a Synergistic Processing Unit (SPU) and a Memory Flow Controller (MFC). The SPUs are in-order processors with two pipelines and 128 128-bit registers. All SPU instructions are inherently SIMD operations that the proper pipeline can run at four different granularities. As opposed to the PPE, the SPEs do not have a private cache memory. In contrast, each SPU includes a 256 KB LS memory to hold both instructions and data of SPU programs, that is, the SPUs cannot access main memory directly. The MFC contains a DMA Controller and a set of memory-mapped registers called MMIO Registers. Each SPU can write its MMIO registers though several Channel Commands. The DMA controller supports DMA transfers among the LSs and main memory. These operations can be issued by the owner SPE, which accesses the MFC through the channel commands, or the other SPEs (or even the PPE), which access the MFC through the MMIO registers. A dual Cell-based Blade is composed of two separate Cell BEs linked together through the EIB, therefore the maximum theoretical performance is duplicated with respect to that of one Cell BE, which is very interesting for emerging
458
J.L. Abell´ an, J. Fern´ andez, and M.E. Acacio
(a) Block Diagram of Cell BE.
(b) Block Diagram of a Dual Cell-Based Blade.
Fig. 1. Cell BE Architecture
scientific, game and multimedia applications. The main components of a dual Cell-based blade are shown in Figure 1(b). In this architecture the two Cell BEs operate in SMP mode with full cache and memory coherency. Main memory is split into two different modules, namely XDRAM0 and XDRAM1, that are attached to Cell0 and Cell1 respectively. In turn, the EIB is extended transparently across a high-speed coherent interface running at 20 GBytes/second in each direction. 2.2
Programming
The SPEs use DMA transfers to read from (Get) or write to (Put) main memory. DMA transfer size must be 1, 2, 4, 8 or a multiple of 16 Bytes up to a maximum of 16 KB. DMA transfers can be either blocking or non-blocking. The latter allow overlapping computation and communication: there might be up to 128 simultaneous transfers between the eight SPE LSs and main memory. In addition, an SPE can issue a single command to perform a list of up to 2048 DMA transfers, each one up to 16 KB in size. In all cases, peak performance can be achieved when both the source and destination addresses are 128-Byte aligned and the size of the transfer is an even multiple of 128 Bytes [4]. Mailboxes are FIFO queues that support exchange of 32-bit messages among the SPEs and the PPE. Each SPE includes two outbound mailboxes, called SPU Write Outbound Mailbox and SPU Write Outbound Interrupt Mailbox, to send messages from the SPE; and a 4-entry inbound mailbox, called SPU Read Inbound Mailbox, to receive messages. Every mailbox is assigned a channel command and a MMIO register. The former allows the owner SPE to access the outbound mailboxes. The latter enables remote SPEs and the PPE to access the inbound mailbox. In contrast, signals were designed with the only purpose of sending notifications to the SPEs. Each SPE has two 32-bit signal registers to collect incoming notifications. A signal register is assigned a MMIO register to enable remote SPEs and the PPE to send individual signals (overwrite mode) or combined
Characterizing the Basic Synchronization and Communication Operations
459
signals (OR mode) to the owner SPE. Read-modify-write atomic operations enable simple transactions on single words residing in main memory. For example, the atomic add return atomic operation adds a 32-bit integer to a word in main memory and returns its value before the addition. Programming of a dual Cell-based blade is equivalent to that of an independent Cell from a functional point of view. However, there are two important differences. First, dual Cell-based blades have 16 SPEs at programmer’s disposal rather than 8 SPEs. This feature involves doubling the maximum theoretical performance but also making much more difficult to extract thread-level parallelism from applications. Second, from an architectural point of view, any operation crossing the Cell-to-Cell interface results in significantly less performance than those that stay on-chip (see Section 4). These facts must be taken into account by programmers to avoid unexpected and undesirable surprises when parallelizing applications for a dual Cell-based blade platform.
3
CellStats
3.1
Architecture
CellStats is a command-line tool which admits a number of parameters such as the operation to evaluate, the number of SPEs, the specific Cell or Cells to use, the number of iterations and other operation-specific parameters. However, the process to launch, instruct and synchronize the threads is the same in all cases. First, the PPE marshals an structure called control block. The control block contains all the information needed by each SPE to complete the operation demanded by the user. Next, the PPE creates as many threads as specified by the user and synchronizes them using mailboxes. In turn, SPEs transfer the control block from main memory to their private LSs, report control block transfer completion to the PPE, and wait for PPE’s approval to resume execution. Then, each SPE performs the task entrusted by the user in a loop. In order to measure the time to complete the loop, the SPE utilizes a register called SPU Decrementer which decrements at regular intervals or ticks 1 . Upon completion of the loop, the SPE sends to the PPE the number of elapsed ticks through its outgoing mailbox. In this way, the PPE can compute not only the elapsed time from the go-ahead indication given to the SPEs, but also the time taken by each individual SPE to complete the task. For further details refer to [2]. 3.2
Functionality
CellStats performs a different experiment depending on the parameters specified by the user: thread creation; PPE-to-SPE or SPE-to-SPE synchronization using mailboxes or signals; data transfers from main memory to local LS/local LS to main memory or remote LS to local LS/local LS to remote LS through DMA operations or list of DMA operations; and atomic operations such as fetch&add, 1
Duration of every tick for the dual Cell-based blade is 70 ns.
460
J.L. Abell´ an, J. Fern´ andez, and M.E. Acacio
fetch&sub, fetch&inc, fetch&dec, and fetch&set on main memory locations. Besides, it is possible to specify the XDRAM memory module (0 or 1) in which memory buffers are allocated. CellStats manages position of memory buffers by using the numactl command. Thread creation. This operation measures the time to launch the threads that are executed by the SPEs. To do that, an empty task that returns immediately is used. Consequently, this operation takes into account not only the time to create the threads but also the time needed to detect their finalization. Mailboxes. This operation performs a PPE-to-SPE or an SPE-to-SPE synchronization using mailboxes. The PPE/SPE writes a message in the incoming mailbox (SPU Read Inbound Mailbox ) of the receiver SPE. Next, the receiver SPE reads the message and replies with another message written to its outgoing mailbox (SPU Write Outbound Mailbox ). When the initiator SPE/PPE reads the message, the synchronization process is complete. In the former case, the PPE uses the runtime management library function spe write in mbox [5] involving a system call which explains the increased latency. Nevertheless, the PPE can also write directly into the corresponding SPE’s MMIO register using a regular assignment. Signals. Unlike mailboxes, this operation performs a PPE-to-SPE or an SPE-to-SPE synchronization using signals. The initiator SPE/PPE signals the destination SPE by writing to the corresponding MMIO register (SPU Signal Notification). If the initiator is an SPE, the destination SPE signals in turn the source SPE, thus finishing the synchronization cycle. Otherwise, the destination SPE sends the reply to the PPE using its outgoing mailbox (SPU Write Outbound Mailbox ). Like mailboxes, it is possible to write directly into the SPE’s MMIO register instead of using the runtime management library function call spe write signal [5]. Atomic operations. These operations enable sequences of read-modify-write instructions on main memory locations in an atomic fashion performed by as many SPEs as indicated by the user. The memory location accessed by the SPEs can be shared or private. In the latter case, the user can also specify the distance, measured in Bytes, between two consecutive private variables. DMA operations. Data transfers between main memory and the local LS, or between a remote LS and the local LS, are achieved through DMA operations. The user can specify not only the DMA size but also whether the source buffer (Gets) or the destination buffer (Puts) is shared or private, and whether the memory location is in main memory or in an SPE’s LS. Just like atomic operations, the user can specify the distance, measured in Bytes, between two consecutive private buffers.
4 4.1
Evaluation Testbed
To develop CellStats we used the IBM SDK v2.1 for the Cell BE architecture installed atop Fedora Core 6 on a regular PC [6]. This development kit includes
Characterizing the Basic Synchronization and Communication Operations
461
a simulator, named Mambo, that allows programmers to execute binary files compiled for the Cell BE architecture. To obtain the experimental results, we installed the same development kit atop Fedora Core 6 on a dual Cell-based IBM BladeCenter QS20 blade which incorporates two 3.2 GHz Cell BEs v5.1, namely Cell0 and Cell1, with 1 GByte of main memory and a 40 GB hard disk. 4.2
Results
Thread creation. The average latency for launching each new thread, as described in Section 3.2, is considerably high, around 1.68 ms. In order to reduce the cost introduced by thread management, programmers can create SPE threads at startup and keep them alive until the application finishes. In this way, the PPE can submit tasks to the SPE threads by means of communication primitives such as mailboxes or signals, thus minimizing overhead. Mailboxes and Signals. In Table 1, the average latencies, measured in nanoseconds, for PPE-to-SPE synchronization using mailboxes or signals are shown. In both cases, the PPE can either invoke a system call (Mailbox-sc or Signal-sc) or write directly into the corresponding SPE’s MMIO register (Mailbox or Signal). Besides, we consider that the selected SPE can be placed on either Cell for comparison (PPE-SPEc0 for Cell0 and PPE-SPEc1 for Cell1). As we can see, the latency is shorter when writing directly into the SPE’s MMIO registers, as defined in file cbe mfc.h, instead of using the runtime management library function calls spe write signal or spe write in mbox [5]. In the former case, it is worth noting that the synchronization latency doubles when the destination SPE resides on Cell1 in both cases. In addition, Table 1 summarizes the average SPE-to-SPE synchronization latency, measured in nanoseconds, using mailboxes or signals when both SPEs are located on the same Cell (SPEc0-SPEc0) or on different Cells (SPEc0-SPEc1), respectively. In the former case, the latency is almost four times shorter because the synchronization messages stay on-chip and do not need to cross the Cell-to-Cell interface. Table 1. Average latency for PPE-to-SPE and SPE-to-SPE synchronization Primitive Mailbox-sc Mailbox Signal-sc Signal
PPE-SPEc0 PPE-SPEc1 SPEc0-SPEc0 SPEc0-SPEc1 10,000.0 779.7 18,000.0 503.8
10,000.0 1678.2 18,000.0 1182.3
N/A 158.1 N/A 160.1
N/A 589.9 N/A 619.4
Atomic Operations. The average latency of the fetch&add atomic operation for a single variable is shown in Figure 2. By using numactl, we have selected the variable’s memory location (XDRAM0 or XDRAM1). As we can see, latency remains constant, at approximately 111 ns, when the variable is privately accessed by the SPEs. However latency grows linearly, up to 7.5 μs for 16 SPEs, when the
462
J.L. Abell´ an, J. Fern´ andez, and M.E. Acacio
Fig. 2. Latency of fetch&add on shared and separate variables (128-Bytes stride)
variable is shared by all intervening SPEs. This is due to the fact that shared variables serialize the execution of atomic operations. Results for the rest of the atomic operations are similar and, therefore, have been omitted for the sake of brevity. Notice that the XDRAM memory module employed has negligible effect on performance results. This is because of the small size of the variable (4 Bytes). DMA Operations. There are three different scenarios for data movement: data transfers between main memory and an SPE’s LS (Gets), data transfers between an SPE’s LS and main memory (Puts) and data transfers between SPEs’ LSs (Movs). Results for Puts do not report significant differences to those of Gets and, therefore, have been omitted for the sake of brevity. In Figure 3 latency and bandwidth figures for Gets using Cell0 and Cell1 are shown. In particular, to generate Figures 3(a), 3(c), 3(e) and 3(g) (left side) all SPEs from Cell0 were used before any SPEs from Cell1, while to generate Figures 3(b), 3(d), 3(f) and 3(h) (right side) SPEs were used in the opposite order. As we can see, two general trends can be identified. First, latency is constant for message sizes smaller than or equal to the cache line, that is 128 Bytes. Second, latency grows proportionally to the message size for messages larger than the cache line until the available bandwidth is exhausted. In addition, a more in depth analysis provides other interesting conclusions. Latency is constant, but proportional to the number of SPEs, for message sizes up 128 Bytes regardless of the originating Cell when shared buffers are used (see Figures 3(a) and 3(b)). Latency is constant, around 300 ns, for message sizes up 128 Bytes regardless of the originating Cell when private buffers are used (see Figures 3(c) and 3(d)).2 For bandwidth figures, there are three important trends to be considered. Firstly, when 8 SPEs are involved, Gets initiated in Cell0 obtain an aggregate bandwidth of 24.6 GB/s (close to the peak memory bandwidth), while Gets initiated in Cell1 reach an aggregate bandwidth of 13.6 GB/s. This is due to the fact that buffers are always placed in XDRAM0 memory module. Therefore, Gets from SPEs in Cell1 must cross the Cell-to-Cell interface, limiting the 2
Stride is larger than or equal to the cache line size in all cases.
Characterizing the Basic Synchronization and Communication Operations
463
(a) Gets from Cell0 (shared memory)
(b) Gets from Cell1 (shared memory)
(c) Gets from Cell0 (private memory)
(d) Gets from Cell1 (private memory)
(e) Gets from Cell0 (shared memory)
(f) Gets from Cell1 (shared memory)
(g) Gets from Cell0 (private memory)
(h) Gets from Cell1 (private memory)
Fig. 3. Latency and bandwidth of DMA Gets on shared and private main memory buffers for a variable number of SPEs and packet sizes using Cell0 and Cell1
maximum achievable aggregate bandwidth. With the numactl command, we have verified that allocating all buffers in XDRAM1 memory module reports just the opposite results. Secondly, when 16 SPEs are considered both Cells are
464
J.L. Abell´ an, J. Fern´ andez, and M.E. Acacio
(a) Intra-Cell Movs (shared LS buffer)
(b) Intra-Cell Movs (shared LS buffer)
(c) Inter-Cell Movs (shared LS buffer)
(d) Inter-Cell Movs (shared LS buffer)
Fig. 4. Latency and bandwidth of Movs on shared LS buffers for a variable number of SPEs and packet sizes using a single Cell and both Cells
involved, thus the figures report the benefits of transferring data from the closest XDRAM memory module (for SPEs within Cell0), and also report the drawback of going through the Cell-to-Cell interface (for SPEs within Cell1). Finally, for private buffers the aggregate bandwidth grows faster for message sizes up to 1 KB because of exploiting simultaneous transfers to different buffers. After that, the aggregate bandwidth figures converge to the same values as before. In turn, latency and bandwidth figures for Movs using Cell0 and Cell1 are shown in Figure 4. In particular, Figures 4(a) and 4(b) correspond to DMA Movs in Cell0 , while Figures 4(c) and 4(d) correspond to DMA Movs between Cell0 and Cell1. In the former case, SPEs approach the maximum available bandwidth of the EIB-to-SPE interface. In the later case, the Cell-to-Cell interface bandwidth is the limiting factor. Nevertheless, the latency is much longer than expected resulting in an aggregate bandwidth shorter than that of Gets originating in Cell1.
5
Conclusions
In this work, we have evaluated the synchronization and communication mechanisms of the Cell BE on a dual Cell-based blade platform. In this way, we can give some recommendations for dual Cell-based blade programmers such as: programmers should avoid frequent creation of threads, since thread creation introduces a significant overhead; they should use direct writes to the SPEs’
Characterizing the Basic Synchronization and Communication Operations
465
MMIO registers, since using runtime management library calls is very slow; for atomic operations, whenever possible, they should use private buffers residing on different cache memory lines, because latency of shared buffers grows linearly with the number of involved SPEs; in case of DMA transfers, they should use private buffers up to 1KB. For messages larger than 1KB, the latency is identical in both cases; finally programmers should be aware of the Cell-to-Cell interface, which determines the maximum achievable bandwidth, and also the asymmetries that arise when memory locations are in the furthest XDRAM memory module. This can be controlled by using the numactl command.
References 1. Kahle, J., Day, M., Hofstee, H., Johns, C., Maeurer, T., Shippy, D.: Introduction to the Cell Multiprocessor. IBM Journal of Research and Development 49(4/5), 589–604 (2005) 2. Abell´ an, J.L., Fern´ andez, J., Acacio, M.E.: CellStats: a Tool to Evaluate the Basics Synchronization and Communication Operations of the Cell BE. In: Proceedings of 16th Euromicro International Conference on Parallel Distributed and network-based Processing, pp. 261–268 (2008) 3. Gschwind, M., Hofstee, H.P., Flachs, B., Hopkins, M., Watanabe, Y., Yamazaki, T.: Synergistic Processing in Cell’s Multicore Architecture. IEEE Micro 26(2), 10–24 (2006) 4. Kistler, M., Perrone, M., Petrini, F.: Cell Processor Interconnection Network: Built for Speed. IEEE Micro 25(3), 2–15 (2006) 5. IBM Systems and Technology Group: SPE Runtime Management Library Version 2.1. (2007) 6. IBM Systems and Technology Group: Cell Broadband Engine Software Development Toolkit (SDK) Installation Guide Version 2.1. (2007)
Performance Evaluation of the NVIDIA GeForce 8800 GTX GPU for Machine Learning Ahmed El Zein1 , Eric McCreath1 , Alistair Rendell1 , and Alex Smola2 1
Dept. of Computer Science, Australian National University, Canberra, Australia 2 Statistical Machine Learning Program, NICTA, Canberra, Australia {Ahmed.ElZein,Eric.McCreath,Alistair.Rendell,Alex.Smola}@anu.edu.au
Abstract. NVIDIA have released a new platform (CUDA) for general purpose computing on their graphical processing units (GPU). This paper evaluates use of this platform for statistical machine learning applications. The transfer rates to and from the GPU are measured, as is the performance of matrix vector operations on the GPU. An implementation of a sparse matrix vector product on the GPU is outlined and evaluated. Performance comparisons are made with the host processor.
1
Introduction
The GeForce 8800 GPU is the first GPU from NVIDIA to implement a unified architecture where the pixels and vertices are processed by the same hardware. This provides a higher degree of programmability than for previous GPUs and is much better suited to general purpose computing. In recognition of this, NVIDIA have released a general purpose programming interface called CUDA (see section 2.3 for details) and have packaged the same basic hardware as a dedicated co-processor for use by high performance computing applications (the Tesla product range). Moreover, NVIDIA have also announced [1] that future generations of their hardware will provide support for IEEE double precision arithmetic; a move that will arguably remove the one remaining major bottleneck to the widespread use of GPUs in scientific computations. While the CUDA programming interface significantly eases use of the NVIDIA GPUs for general purpose programming, the programming model provided by CUDA is very different to that available on a traditional CPU. For instance CUDA has the concepts of shared, constant, texture, and global memories that all have slightly different properties, and determining how best to use each memory type for a given application is non-trivial. Also, it must be remembered that when using any coprocessor the observed performance will depend heavily on what fraction of the application can be run on the coprocessor, and whether the overheads introduced in order to move data to and from the coprocessor are small compared to the computational times involved. In this paper we outline our initial efforts to migrate a Statistical Machine Learning (ML) application to the GeForce 8800 GPU. The kernel of this application involves an iterative solver that performs repeated matrix vector products. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 466–475, 2008. c Springer-Verlag Berlin Heidelberg 2008
Performance Evaluation of the NVIDIA GeForce 8800 GTX GPU
467
For a number of reasons this application would appear to be well suited to use of a GPU. First, matrix operations are generally well suited to vector or stream processors such as the GeForce 8800. Second, matrix vector products scale as O(N · d), where N is the number of data points and d is the inherent dimensionality of the problem, whereas other steps of a ML application typically scale as O(N ). Consequently for high-dimensional problems, migrating this part of the application to the GPU is potentially beneficial. Third, the matrix does not change between iterations, so it can be copied once to memory on the GPU and reused during each iteration. Finally, for many ML problems single precision arithmetic is sufficient, so the porting effort required is of immediate benefit even before double precision GPUs become available. On closer inspection the situation is not quite as simple. In particular, although matrix operations can be easily vectorized, the amount of data may exceed what a single GPU card can hold (there is 768 MB on the GPU used). Consequently the resulting performance depends heavily on the bandwidth of the bus connecting the CPU and the GPU. Secondly, for matrix-vector multiplications, the limiting factor is the memory bandwidth rather than the raw floating point performance (the latter exceeds the former on both CPU and GPU). This ratio is generally less favourable for GPUs than the ratio between GPU and CPU peak floating point ratios. Finally, many ML problems involve sparse matrices, so the use of a sparse matrix vector product may be preferable to use of the dense equivalent. Sparse matrix algorithms are, however, considerably harder to adapt to stream processors. With the goal of migrating the complete ML application to CUDA this paper addresses three issues: i) What transfer rates can be achieved between host and GPU memory and vice versa, ii) What performance is achieved when using the CUDA supplied BLAS library to perform a variety of dense matrix vector products of sizes similar to those required by ML applications, iii) How does the performance of the latter compare with what we can obtain by hand coding sparse matrix vector products in CUDA. The following section gives background information about the ML application, the NVIDIA 8800 GPU hardware, its CUDA programming model, and methods for sparse matrix vector products. Section 3 details our experimental setup, while Section 4 contains detailed performance results. Section 5 uses the performance data gather here to discusses how a full ML application is likely to perform on the GeForce 8800 and outlines plans for our future work.
2 2.1
Background The ML Application
One of the key objectives in ML is, given some patterns xi , such as pictures of apples and oranges, and corresponding labels yi , such as the information whether xi is an apple or an orange, to find some function f which allows us to estimate y from x automatically. See e.g. [2] for an introduction. In this quest, convex optimization is a key enabling technology for many problems. For
468
A. El Zein et al.
Initial guess w
Compute Xw
Calculate loss l and gradient g
Return w
Test for convergence
Iterative solver updates w
Compute X'g
Fig. 1. Iterative solver algorithm. The black boxes refer to matrix-vector operations which could be accelerated by a GPU.
Table 1. Statistics for some typical ML datasets [3] Domain Intrusion Detection Ranking Text Categorization Text Categorization
Dataset Rows Columns Nonzero Elements Density KDDCup99 3,398,431 127 55,503,855 12.86% NetFlix 480,189 17,770 100,480,507 1.17% Reuters C11 804,414 47,236 60,795,680 0.16% Arxiv astro-ph 62,369 99,757 4,977,395 0.08%
instance, Teo et al. [3] proposed a scalable convex solver for such problems. It is an iterative algorithm that involves guessing a solution vector w, using this to evaluate a loss function l(x, y, w) and its derivative g = ∂w l(x, y, w), and then updating w accordingly. This process is repeated until a desired level of convergence is achieved (see Fig. 1). As mentioned above the majority of time is spent evaluating the matrix vector products, and the elements of matrix (X) do not change between iterations. Many ML datasets are very sparse, as shown in Table 1. Exploiting the sparsity decreases the memory footprint of the matrix as well as the the number of floating point operations required for the matrix vector product. Unfortunately it also introduces random memory access patterns and indirect addressing, which is likely to result in less efficient utilization of a GPU’s hardware. 2.2
NVIDIA 8800 GTX Hardware
Figure 2 illustrates the architecture of the GeForce 8800 GTX used in this work. At the heart of the device lies the Streaming Processor Array (SPA) consisting of 8 Texture Processor Cluster (TPC) units. Each TPC contains 2 Streaming Multiprocessor (SM) units and a texture unit. The SM in turn consists of 8 Stream Processors (SP) clocked at a default of 1.35 GHz. When running CUDA applications each SP is able to issue one multiply-add (MAD) instruction per cycle. This gives each SM a peak performance of 21.6 GFLOPS, and the GeForce 8800 GTX with 16 SMs an aggregate performance of 345.6 GFLOPS. The SPA is connected to 768 MB of GDDR3 memory through a 384-bit (48 byte) wide interface. Clocked at 900 MHz (1800 MHz effective double data rate) by default, the frame buffer memory has a peak bandwidth of 84.375 GB/s. More details of the NVIDIA hardware can be found in [4].
Performance Evaluation of the NVIDIA GeForce 8800 GTX GPU Hardware View
469
Software View
Fig. 2. GeForce 8800 GTX architecture and CUDA memory model
2.3
CUDA
The Compute Unified Device Architecture (CUDA), is a hardware and software architecture that enables the issuing and managing of computations on the GPU as a data-parallel device without the need to map the computations to a graphics API. CUDA transforms the hardware’s personality from a graphics card to a multi-threaded coprocessor. Provided with CUDA are Basic Linear Algebra Subprograms (BLAS) and Fast Fourier Transform (FFT) implementations, however NVIDIA only provides a C/C++ API for these. CUDA executes that part of the application that runs on the GPU using hundreds or thousands of threads. These threads are organized into a grid of blocks. The grid can be either one or two dimensional, while each block can be a one, two or three dimensional group of threads. The grid and block dimensions can be set at runtime with each thread able to retrieve its own thread and block id. Each block of threads is executed on one physical SM, with NVIDIA hardware only allowing synchronization and access to fast shared memory for threads in the same block. An illustration of the CUDA memory model is given in Fig. 2. A programming guide [1] providing additional information is available from NVIDIA (http://www.nvidia.com). 2.4
Sparse Matrices on GPUs
A popular representation for sparse matrices is compressed sparse row (CSR) [5] storage. Non-zero elements are arranged into a dense vector val. For each value in val, its column index from the original matrix is stored in a dense vector of the same size ind at the same offset. A third pointer array (ptr) carries the offset of the first element in every row. CSR storage and associated pseudo code for a sparse matrix vector product are shown in Fig. 3.
470
A. El Zein et al.
for each row i do for l=ptr[i] to ptr[i+1]-1 do y[i]=y[i]+val[l].x[ind[l]]
Fig. 3. CSR format and pseudo-code for matrix vector product
Sparse matrix vector products (SpMV) have been implemented on older GPU hardware [6,7,8], but were limited by the graphics API and hardware constraints; Bolz et al. [6] and Kr¨ uger et al. [8] achieved 9 and 110 MFLOPS respectively in 2003. Ujaldon et al. [9] achieved 222 MFLOPS in 2005 and recently Sengupta et al. [10] achived 215 MFLOPS with CUDA on a GeForce 8800 GTX in 2007.
3
Experimental Setup
The GeForce 8800 GTX was hosted in a 2 GHz dual core AMD Athlon64 3800+ system with 2GB of PC3200 DDR memory. The processor has 128KB of L1 cache and 1 MB L2 cache and it has a theoretical peak performance of 8 GFLOPS. For all benchmarks and experiments, the code was run for 100 iterations and the average time was used to calculate bandwidth and FLOPS. Results for the GPU include the time required to transfer the vector to the GPU and the resulting product vector back to the host. For the sparse matrix vector product FLOPS were calculated as (2 × nonzero elements ÷ time). For dense matrix vector products the CUDA BLAS library (CUBLAS) was used on the GPU, while ATLAS1 was used on the host. Both specialist matrix vector routines (SGEMV) and general matrix matrix routines (SGEMM) were considered, although for ATLAS SGEMV always outperformed SGEMM, since SGEMM is optimized for matrix-matrix multiplications, and thus results for ATLAS SGEMM will not be given. ATLAS permits the matrix to be given in either row or column major format, while CUDA only supports matrices in column major format. Results are given for both normal (N) and transpose (T) ordering of the matrix as both these are required (see Fig. 1). As yet CUBLAS doesn’t support sparse matrices, so our own sparse matrix vector implementation was written (described later). Sparse test matrices were generated using the following code with the condition that each row contains at least one non-zero element. s: chosen sparsity (0% to 99%) for each row i do for each column j do if s <= random(0,99) do matrix[i][j] = 1.0 1
Automatically Tuned Linear Algebra Software, http://math-atlas.sourceforge.net
Performance Evaluation of the NVIDIA GeForce 8800 GTX GPU
471
Table 2. Host initiated memory transfer rates (GB/s)
Main Main GPU GPU GPU
4 4.1
Memory to GPU Memory (pinned) to GPU to Main Memory to Main Memory (pinned) Memory to GPU Memory
Latency μs 1KB 1MB 100MB 22 0.03 0.80 1.10 18 0.04 2.70 3.10 18 0.04 0.40 0.50 15 0.05 2.80 3.00 12 0.14 50.59 71.17
Results Memory Transfer Rates
Rates for various memory transfer operations are given in Table 2. All transfers are initiated by a CUDA call on the host, with the time recorded from before this call until after the transfer was complete. Hence all benchmarks involve communication over the PCIe bus which has a maximum bandwidth of 4 GB/s. For host to GPU transfers with large data sizes only ∼25% of the PCIe bandwidth is achieved. CUDA, however, allows for the allocation of non-pageable pinned memory on the host, and when this is used approximately 75% of the peak PCIe rate is achieved. When using unpinned memory transfer rates from the GPU to main memory are significantly less than from main memory to GPU, but these become roughly equivalent when using pinned memory. All transfers were found to have a latency of ∼20μs, probably reflecting the latency of the PCIe bus. The bandwidth for transferring data from GPU memory to GPU memory was also measured and found to have an asymptotic value close to the 84.4 GB/s peak, with nearly 60% of this achieved for a 1 MB transfer. 4.2
Dense Matrix Vector Performance
Using CUBLAS matrix dimensions that were not a multiple of 16 were found to have significantly lower performance; since for ML applications padding can be done once, only results for matrices that are a multiple of 16 will be reported here. Performance data for square matrices of ascending sizes are given in Fig. 4. These show that on the host system performance is roughly constant for all matrix sizes and that normal ordering significantly out performs transpose ordering. On the GPU performance is much more varied. For normal ordering SGEMV performance increases dramatically as the dimension increases, but for transpose ordering it is roughly constant. Thus, while transpose SGEMV ordering is over twice as fast as normal ordering when N=1024, by the time N=5120 it is 30% slower. In almost all cases use of SGEMM instead of SGEMV is found to be slower. At best use of the GPU is ∼ 4.5× faster than use of the host processor. To observe the effect of matrix shape on performance the total size of the matrix was set to ∼100 MB (5120 × 5120 or 26,214,400 elements) while the number of rows and columns was varied. The results, given in Fig. 5, show a degradation in performance when the number of columns exceeds the number
A. El Zein et al.
Host RowMajor Dimension SGEMV N T 1024 2.7 1.2 2048 2.9 1.2 2816 2.9 1.1 3200 2.9 1.1 4480 3.0 1.1 5120 3.0 1.2
GPU ColMajor SGEMV SGEMM N T N T 3.6 7.8 6.9 6.5 7.2 9.2 6.8 7.1 9.5 9.0 6.4 7.0 10.6 10.0 6.2 6.9 13.0 8.7 6.8 7.6 13.6 9.9 7.0 7.8
Effect of Size on Performance 20
15 GFLOPS
472
host-SGEMV-rowMajor-N gpu-SGEMV-colMajor-N gpu-sgemm-colMajor-N
10
5
0 1000
10000 Dimensions
Fig. 4. Performance (GFLOPS) for square matrix vector products
204800 128 51200 512 25600 1024 10240 2560 5120 5120 2560 10240 1024 25600 512 51200 128 204800
Host RowMajor SGEMV N T 1.7 0.5 2.5 1.1 2.8 1.1 2.9 1.2 3.0 1.2 2.7 1.1 2.7 1.2 2.7 1.2 2.7 0.9
GPU ColMajor SGEMV SGEMM N T N T 12.5 9.3 6.0 6.8 13.4 9.9 6.7 7.6 12.8 9.7 6.9 7.8 12.0 10.2 7.0 7.8 13.6 9.9 7.0 7.8 8.9 8.3 7.0 7.9 3.9 10.1 7.1 7.7 2.0 5.3 7.1 7.9 0.5 1.3 1.8 2.0
Effect of Shape on Performance 20
15 GFLOPS
Columns Rows
host-SGEMV-rowMajor-N gpu-SGEMV-colMajor-N gpu-sgemm-colMajor-N
10
5
0 0.0001 0.001 0.01
0.1
1
10
100 1000 10000
Rows/Columns
Fig. 5. Performance (GFLOPS) as function of shape for matrix vector product
of rows, particularly when using transpose ordering. Although a similar effect occurs when using SGEMM this only happens when the difference is 2 orders of magnitude. In summary the results given here suggest that there is further scope to optimise the performance of the SGEMV routine in CUBLAS. 4.3
Initial Sparse Matrix Vector Implementation
The approach taken here stores the matrix in CSR and is parallelized by assigning rows to threads such that each thread multiplies all the elements in a given row by the corresponding elements in the vector before writing the sum to the relevant element in the result vector. The following issues were considered: Memory Loads: While CUDA only supports scalar operations, it also supports upto 128-bit wide vector data types. Loading a float or a float4 from memory costs the same. For a 3000×3000 matrix at 95% sparsity and 32 threads/block, an SpMV took 827 and 442 μs when using float and float4 data types respectively. Use of Different Memory Types: The GeForce 8800 GPU offers different types of memory (Fig. 2). The key to optimizing the SpMV code on the GPU is determining the most efficient use of each memory type.
Performance Evaluation of the NVIDIA GeForce 8800 GTX GPU
473
Global Memory. On our card there was 768 MB of global memory. This memory is not cached, but can be read and written to by any thread. Shared Memory. Each SM has 16KB of shared memory that threads in the same block can use to share data. We do not currently use this. Constant Memory. There is 64KB of cacheable read only memory that is initialized each time a GPU kernel is started. Storing the vector in constant memory limits vector sizes to ∼ 16000 but further reduces the SpMV duration to 242 μs for a 3000×3000 matrix at 95% sparsity and 32 threads/block. Texture References. CUDA allows binding of global memory to a texture reference. Our initial results suggest this may be useful for the matrix, but only with large row dimensions. Results given here do not use texture references. The performance of the SpMV implementation for square matrix vector products with a variety of different numbers of threads per block is given in Figure 6 for matrices with 75% and 95% sparsity. The results show significant variation in performance as a function of number of threads per block, particularly at 95% sparsity. The optimal number of threads per block changes with the size of the matrix, and although there is a general trend suggesting more threads per block for large matrices this is not always true. Secondly, beyond some key dimension the performance drops markedly and becomes roughly the same regardless of thread count. At best a performance of 3-4 GFLOPS is observed. By comparison recent work of Gahvari et al. [11] using a range of sparse matrices on a 1.4 GHz Opteron gave a maximum performance of ∼400 MFLOPS for an unblocked CSR SpMV and a median performance of ∼180 MFLOPS (On current hardware these values would probably increase by a factor of 3). To determine when it is preferable to use SpMV over dense matrix vector products we plot in Fig. 7 the speedup of SpMV at 75% and 95% sparsity over the equivalent SGEMV runs. This shows that for 75% sparsity using SpMV can be advantageous for dimensions upto around 4000, while for 95% sparsity using SpMV is always an advantage.
5
Discussion and Further Directions
The results from Section 4.1 show that it is possible to transfer a large ML dataset to global memory on the GPU card at around 3GB/s. On the GPU card used in this work there was 768 MB of memory, so if 600 MB of this were used to store the ML dataset it would take ∼0.2 s to transfer the data from host memory to GPU memory. This is the minimum amount of time that must be saved when using the GPU instead of the host to perform the computational work. Achieving this is most likely to be possible if the dataset can be copied to the GPU once, left there and re-used in each iteration of the convex solver (Fig. 1). From Table 1 and using CSR storage this should be possible for the intrusion detection and two text categorization datasets. For larger datasets an alternative strategy would be to divide the problem/dataset over multiple GPU cards, or to use double buffering to overlap movement of data to the GPU with computation on the GPU.
474
A. El Zein et al.
SpMV Performance at 75% Sparsity 5
32 threads 64 threads 128 threads 160 threads 512 threads
GFLOPS
4 3 2 1 0 1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
Dimensions SpMV Performance at 95% Sparsity 5
32 threads 64 threads 128 threads 160 threads 512 threads
GFLOPS
4 3 2 1 0 1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
Dimensions
Fig. 6. SpMV performance at 75% and 95% sparsity
Speedup of SpMV over SGEMV at 75% Sparsity 200
32 threads 128 threads 160 threads 256 threads 512 threads
% Speedup
150 100 50 0 -50 -100 1000
2000
3000
4000
5000
6000
7000
8000
9000
10000
Dimensions
% Speedup
Speedup of SpMV over SGEMV at 95% Sparsity 500 450 400 350 300 250 200 150 100 50 0 1000
32 threads 128 threads 160 threads 256 threads 512 threads
2000
3000
4000
5000
6000
7000
8000
9000
Dimensions
Fig. 7. SpMV speedup over SGEMV at 75% and 95% sparsity
10000
Performance Evaluation of the NVIDIA GeForce 8800 GTX GPU
475
Results for dense matrix vector products show a potential speedup of 3 − 5× from use of the GPU for matrices of dimension 2500× 2500 and above. To obtain this performance from the current version of CUBLAS requires, however, that the number of rows in the (column major) matrix be a multiple of 16. While 3 − 5× is a useful performance gain, the cost of moving the dataset to the GPU could easily erode this advantage. As a consequence it is not clear which would be a better option if, for example, it were a choice between buying a dual core host with an NVIDIA GTX 8800 as a coprocessor or a quad core host. Performance for our initial sparse matrix vector product is significantly better than we originally expected, achieving similar 3 − 5× speedups over the host CPU, even for small 1000 × 1000 matrices if the sparsity is over 90%. Since many ML datasets are sparse this suggests that it would be advantageous to place further effort into optimising the SpMV routine for CUDA, and in particular, trying to eliminate the performance drop observed after certain dimensions and trying to determine automatically the optimal number of threads per block to use for a given problem size. While other approaches exist that may offer performance gains in specific areas (such as the use of coalesced memory reads), they also introduce complexities. We are in the process of evaluating such approaches. Finally, for easier integration with existing software it would be useful to implement an OSKI2 (Optimized Sparse Kernel Interface) front-end.
References 1. NVIDIA: NVIDIA CUDA Programming Guide. 1.0 edn. (2007) 2. Vapnik, V.N.: Statistical Learning Theory. John Wiley & Sons, Inc., Chichester (1998) 3. Teo, C.H., Smola, A., Vishwanathan, S.V., Le, Q.V.: A scalable modular convex solver for regularized risk minimization. In: KDD 2007: Proceedings of the 13th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 727–736. ACM, New York (2007) 4. NVIDIA: NVIDIA GeForce 8800 GPU Architecture Overview (2006) 5. Duff, I.S., Erisman, A.M., Reid, J.K.: Direct Methods for Sparse Matrices. Oxford University Press, Oxford (1986) 6. Bolz, J., Farmer, I., Grinspun, E., Schr¨ ooder, P.: Sparse matrix solvers on the gpu: conjugate gradients and multigrid. ACM Trans. Graph. 22(3), 917–924 (2003) 7. Buck, I.: Data parallel computation on graphics hardware (2003) 8. Kr¨ uger, J., Westermann, R.: Linear algebra operators for gpu implementation of numerical algorithms. In: SIGGRAPH 2003: ACM SIGGRAPH 2003 Papers, pp. 908–916. ACM, New York (2003) 9. Ujaldon, M., Saltz, J.: The gpu on irregular computing: Performance issues and contributions. In: CAD-CG 2005: Proceedings of the Ninth International Conference on Computer Aided Design and Computer Graphics (CAD-CG 2005), pp. 442–450. IEEE Computer Society, Washington DC (2005) 10. Sengupta, S., Harris, M., Zhang, Y., Owens, J.D.: Scan primitives for gpu computing. In: Graphics Hardware, pp. 97–106. ACM, New York (2007) 11. Gahvari, H., Mark Hoemmen, J.D., Yelick, K.: Benchmarking sparse matrix-vector multiply in five minute. In: SPEC Benchmark Workshop (2007) 2
http://bebop.cs.berkeley.edu/oski/
Hardware Implementation Aspects of New Low Complexity Image Coding Algorithm for Wireless Capsule Endoscopy Paweł Turcza1,4, Tomasz Zieliński2,4, and Mariusz Duplaga3,4 1
Department of Instrumentation and Measurement, 2 Department of Telecommunications, AGH University of Science and Technology, Kraków, Poland 3 Collegium Medicum, Jagiellonian University, Kraków, Poland 4 Center of Innovation, Technology Transfer and University Development, Jagiellonian University, Kraków, Poland {turcza,tzielin}@agh.edu.pl, [email protected]
Abstract. The paper presents hardware implementation aspects of new efficient image compression algorithm designed for wireless capsule endoscopy with Bayer color filter array (CFA). Since power limitation, small size conditions and specific image data format (CFA) exclude application of traditional image compression techniques dedicated ones are necessary. Discussed algorithm is based on integer version of discrete cosine transform (DCT). Therefore it has low complexity and power consumption. It is demonstrated that the performance of proposed algorithm is comparable to the performance of JPEG2000 – very complex, sophisticated wavelet based coder. In the paper a VLSI coder architecture is proposed and power requirements are discussed. Keywords: Image coding, integer transformation, DCT, Bayer, JPEG2000.
1 Introduction An endoscopic medical procedure, often accompanied by a biopsy of pathological changes, plays a fundamental role in diagnosis of many gastrointestinal (GI) tract diseases. Until recently it has been the most important method of investigation of upper (gastroscopy) and lower (colonoscopy) parts of the GI tract. However, examination with flexible endoscope is also very unpleasant experience for a patient. Due to enormous progress in microelectronics a wireless endoscopic capsule has been invented recently that makes possible a non-invasive evaluation of the whole GI tract together will small intestine. First such capsule was built by Given Imaging Ltd. [1] in the end of 20-th century. It is equipped with a CMOS sensor, lighting, data processing module and transmission unit (Fig. 1). After swallowing by a patient, the capsule is passing passively through the GI tract due to peristaltic intestine movements and making photos (images) that are wirelessly transmitted to the recorder carried by a patient. Unfortunately, this technology is not ideal yet. Due to lack of autonomous locomotion and navigation system, detail investigation of the highvolume stomach is impossible, whereas detail investigation of large intestine is M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 476–485, 2008. © Springer-Verlag Berlin Heidelberg 2008
Hardware Implementation Aspects of New Low Complexity Image Coding Algorithm
477
possible only after its inflation since normally it is shrinked. Quality of the recorded video (256×256 pixels, 8 bits per color, only 2 frames per second) is very low also. It has been already reported in literature [2], [3] that the endoscopic capsule can approximately reach only one megabit per second transmission bitrate due to limitation of power consumption and severe attenuation of radio waves in a human body. Because the CMOS sensor having VGA resolution (640×480 pixels) delivers 2.45×106 bits per image, it is obvious that without data compression transmission of one image would last more than 2 seconds spending quite a lot of energy. Since existing standard image compression algorithms are not appropriate for capsule endoscopy (CE) due to their high computational complexity, simple dedicated algorithms are under development [3], [4]. In this paper memory requirements optimization and hardware implementation aspects of a recently proposed algorithm [5], [6], better than [3], [4], are presented. Influence of codebook reduction in entropy coder on algorithm performance is carefully evaluated and justified. It is shown that the performance of such a modified algorithm is comparable to the performance of JPEG2000 – very complex wavelet based coder. It should be mentioned that problems associated with color filter arrays (CFA) Bayer pattern are efficiently solved by our algorithm, in contrary to the approach presented in [3]. Our method makes it also possible to reconstruct more precisely G1 and G2 CFA image color components. In contrary to [4] our algorithm offers higher compression range, so higher image frame rate is possible. In the paper a VLSI coder architecture is proposed and power requirements are discussed.
2 New Image Coding Algorithm Simplified block diagram of data processing in wireless endoscopy capsule is presented in Fig. 1. The output image data from a CMOS image sensor after image and channel coding (required for bandwidth reduction and error protection) are transmitted by a wireless transceiver (TX) to the outside of the body where they are received, stored and eventually decompressed for subsequent diagnosis. A control unit in the capsule controls operation of a compression module according to commands received from an external controller. Proposed image coder performs sequentially four operations: color transformation, image transformation, coefficients quantization and entropy coding. GB GB RG R G GB GB RG R G Bayer CFA image sensor
image coding
channel coding
wireless TX
uC
Fig. 1. Block diagram of data processing in wireless endoscopy capsule
478
P. Turcza, T. Zieliński, and M. Duplaga
2.1 Color and Structure Transformation In the CE as well as in inexpensive digital cameras, color filters arrays (CFA) are placed on monochrome CMOS image sensor to produce color images. The most popular CFA pattern has been proposed by Bayer [7]. The Bayer CFA, shown in Fig. 2, uses 2 x 2 repeating patterns having two green pixels, one red and one blue. It is clear that application of the CFA results in collection of incomplete image information (only one color component for each pixel) so color interpolation is necessary to reconstruct the full color image from the sensor data. From image compression point of view interpolation step introduces redundancy that is difficult to remove. Therefore, the image compression should precede color data interpolation step for best performance [8]. Additional redundancy reduction can be achieved by a color space transformation. For normal RGB images this is get by the following one:
0.114 ⎤⎡ R ⎤ ⎡ Y ⎤ ⎡0.299 0.587 ⎢ ⎥ ⎢ ⎥ = − − −0.434⎥⎢ C 0.146 0.288 ⎢ b⎥ ⎢ ⎥⎢G ⎥ ⎢⎣Cr ⎥⎦ ⎢⎣0.617 −0.517 0.100 ⎥⎢ ⎦⎣ B ⎥⎦
(1)
where Y is luma component while Cb and Cr denote blue and red chroma components, respectively. However, due to CE power limits and quincunx sampling scheme of green component in Bayer CFA, we propose application of modified RGB to YCgCo space conversion known from Fidelity Range Extensions (FRExt) of H.264 video coding standard [9]: ⎡ Y1 ⎤ ⎡1/ 2 ⎢ ⎥ ⎢ ⎢ Y2 ⎥ = ⎢0 ⎢Cg ⎥ ⎢1/ 4 ⎢ ⎥ ⎢ ⎣⎢ Co ⎦⎥ ⎣0
0 1/ 2 1/ 4 0
1/ 4 1/ 4 −1/ 4 −1/ 2
1/ 4 ⎤ ⎡ G1 ⎤ 1/ 4 ⎥⎥ ⎢⎢G2 ⎥⎥ −1/ 4 ⎥ ⎢ B ⎥ ⎥⎢ ⎥ 1/ 2 ⎦ ⎣ R ⎦
(2)
where the Cg stands for green chroma and the Co stands for orange chroma. Quincunx array (luma component) has high frequency content that makes compression task difficult. Therefore structure conversion [8] (see Fig. 2) with optional reversible filtering (deinterlacing) (see Fig. 3) is applied in our coder. 2.2 Image Data Transformation and Entropy Coding
Block diagram of a general transform based image coder (for only one image component) is presented in Fig. 4. 2D image transformation is the first performed operation. Its goal is to concentrate image energy in the possibly smallest number of transform coefficients that are then scalar quantized and efficiently coded by entropy encoder. As the image transformation usually discrete cosine (DCT) or wavelet (DWT) transforms are chosen. In the proposed algorithm integer approximation of the DCT transformation is used.
Hardware Implementation Aspects of New Low Complexity Image Coding Algorithm
G 1 B G 1 B G1 B R G2 R G 2 R G 2 G 1 B G 1 B G1 B R G2 R G 2 R G 2
Y1
Y1 Y2
Y1
Y1 Y2
Cg
Cg
Cg
Cg
Cg
Cg
Y1 Co Y1 Co
Y2
Y1 Y2
color transformation
Y1 Y2
Y2
Cg Y 1 Y2 Co Cg Y 1 Y2 Co
Cg Y 1 Cg Y2 Co Y 2 Cg Y 1 Cg Y2 Co Y 2
Co
Co
Co
Co
Co
Co
479
structure conversion + optional Y deinterlacing Y1 Y 2 Y1 Y2 Y 1 Y2 Y1 Y 2 Y1 Y2 Y 1 Y2
Cg Cg Cg Cg Cg Cg
Co Co Co Co Co Co
Fig. 2. Bayer CFA and color/structure conversion Y1
Y1 Y2
1/2 1/4
1/2 1 1/4
2 -1/2 -1/2 2
Y1 Y2
Y2
1/2
1 1/4
1/4 1
forward filtering
-1/2 2
-1/2
inverse filtering
1
1
Y1
Y1 Y2
1
Y1 Y2
Y2
Fig. 3. Reversible image filtering (deinterlacing) [8]
Image transformation Image component
Integer 2D-DCT
Scalar quantizer
Huffman entropy coder code book
Fig. 4. Block diagram of transform based image coder
The two-dimensional (2D) DCT is a separable transformation and is usually implemented as the 1D row-wise transformation followed by the 1D column-wise one. In the proposed algorithm the 2D-DCT of each 4×4 non-overlapped input data block X (separate RGB or YCgCo components) has been computed as
Y = Tf XTTf where Tf denotes an integer approximation of the 1D-DCT matrix [10]:
(3)
480
P. Turcza, T. Zieliński, and M. Duplaga
⎡1 ⎢2 Tf = ⎢ ⎢1 ⎢ ⎣1
1 ⎤ −1 −2⎥⎥ 1 −1 −1 1 ⎥ ⎥ −2 2 −1⎦
1
1
(4)
and superscript T is a transposition. Since the transformation (3)(4) represents only an approximation of the original DCT, an additional scaling operation by matrix Sf is required in (3): ab / 2 a 2 ab / 2 ⎤ ⎡a2 ⎢ ab / 2 2 ⎥ b / 4 ab / 2 b 2 / 4 ⎥ T ⎢ S = Y = ( Tf XTf ) ⊗ S f , f (5) ⎢a2 ab / 2 a 2 ab / 2 ⎥ , ⎢ ⎥ 2 2 ⎣ ab / 2 b / 4 ab / 2 b / 4 ⎦ where a = 1/ 2, b = 2 / 5 . and ⊗ is the Kronecker product. The element-by-element multiplication ⊗ can be incorporated into quantization step. What is important, apart from good data decorrelation property, the 1D-DCT (4) has also very efficient computational implementation presented in Fig. 5. x(0)
+
x(1)
+
x(2)
-
+
x(3)
-
+
-2 2
+
X(0)
+
X(2)
+
X(1)
+
X(3)
Fig. 5. Low complexity 4-point DCT butterfly algorithm
The inverse DCT of a 4×4 input data block Y can be computed using the following formula: ˆ = TT ( Y ⊗ S ) T X i i i
(6)
where ⎡1 ⎢1 Ti = ⎢ ⎢1 ⎢ ⎣1
1 1/ 2 −1/ 2 −1
1 −1 −1 1
1/ 2 ⎤ −1 ⎥⎥ ⎥, 1 ⎥ −1/ 2 ⎦
⎡a 2 ⎢ ab Si = ⎢ 2 ⎢a ⎢ ⎣ ab
ab a 2 b 2 ab ab a 2 b 2 ab
ab ⎤ ⎥ b2 ⎥ ab ⎥ . ⎥ b2 ⎦
(7)
After the DCT transformation the coefficients of Y are scalar quantized what results in high compression ratio and loss of information.
Hardware Implementation Aspects of New Low Complexity Image Coding Algorithm
481
In the discussed algorithm the coefficients of the DCT transform are first quantized and then entropy coded. A technique well known from the JPEG standard [11] was applied to data blocks 4×4 instead of 8×8. Since DC coefficients of adjacent 4×4 blocks are strongly correlated they are coded differentially. The remaining AC coefficients are entropy coded in a 2-step process. In the first step sequence of quantized coefficients is converted into an intermediate sequence of symbols (RL, v) where RL is the number of consecutive zero-valued AC coefficients in the sequence preceding the nonzero AC coefficient v. If all the remaining AC coefficients in the block are zero they are represented by symbol (0,0). In the second step variable-length Huffman codes (VLC) are assigned to the symbols z = 16RL+|v|2 where |v|2 denotes length of binary representation (without leading zeros) of v (if v>0) or -v (if v<0). Value of v is encoded using variable length integer (VLI) code whose length in bits |v|2 was already encoded using VLC. VLI of v equals to binary representation of v if v>0 or equals to -v if v<0. Separated Huffman tables are designed for DC and AC coefficients. 2.3 Hardware Implementation
Finally, designing future hardware implementation of the proposed image encoder has been addressed. A developed VLSI architecture is presented in Fig. 6. As expected, the memory is the most using resources. It is required by the 2D DCT transformation, the color converter and the entropy encoder. Although 1D DCT over rows can be implemented very efficiently (during pixel acquisition) the vertical one requires 4 rows to be stored in memory. Since the color converter operates on two lines, one of them has to be acquired first and stored in memory. In order to reduce memory required by entropy coder, Huffman codes for symbols z ≥ 64 are not stored in a code book. Instead they are encoded as a escape sequence follow by fixed length code. Since such symbols occur very infrequently, this technique allows for 4 times reduction of codebook size causing only negligible decrease in compression ratio. The proposed approach significantly reduces Huffman code length as well as width of codebook memory word. The estimated power budget of the design is given in Tab. 1. For VGA (640x480) image resolution and 10 frames per second (fps) pixel clock is equal to 3 MHz. The circuit supply current is 3 x 0.984 = 2.95 mA. We can conclude that the expected power requirement are similar to much simple design proposed in [3], but our algorithm offers significantly higher compression ration.
3 Results The proposed algorithm has been implemented as a computer program and its efficiency has been measured. Two exemplary full RGB color images, one from colonoscopy (Fig. 7) and one from gastroscopy (Fig. 8), have been used in initial experimental tests. Since the discussed algorithm operates on CFA Bayer format, the test images have been converted into this format by low-pass filtering (with short 3-tap
P. Turcza, T. Zieliński, and M. Duplaga
Bayer CFA Image
Color space transformation Registers
482
One line store SRAM 8 kbits
[ ][
Y1 1/ 2 0 1 /4 1/4 Y 2 = 0 1/2 1 /4 1/4 1/ 4 1/4 − 1 /4 − 1/4 Cg 0 0 − 1 /2 1/2 Co
(4 x 8) x DF1
Y1 Y2 Cg Co
][ ] G1 G2 B R
(2+2+3+1) x 8 x ADD31
8 x BUFE2 Y1 Y2 Cg Co
MUX
Address generator
4 lines buffer S R AM 64 kbit 8 x BUFE2 4 lines buffer S R AM 64 kbit
Control unit
MUX 8 x MUX21
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
D
& q3
& + q2
& + q1
(3 x 12) x ADD31
D
Transposition buffer
(16 x 12) x MUX21
D
& +
+
x(1)
+
x(2) x(3)
-
+
-
+
-
-2 2
+
X(0)
+
X(2)
+
X(1)
+
X(3)
RL coder Huffman 4 x DF1 codebook 4 x ADD31
x(0)
DCT unit
q0 (8 x 12) x ADD31
(20 x 12) x DF1
Quantization unit
64 x 12 x 4 S R AM 4 kbits
bits s hifter
to Tx
Fig. 6. Simplified VLSI architecture of the proposed image coder
Hardware Implementation Aspects of New Low Complexity Image Coding Algorithm
483
Table 1. Estimated power budget model [uA/MHz] CMOS cell type Module Color trans. Line buffer DCT Quantizer Entropy coder TOTAL
DF1 0.345 32 240 4
SRAM
ADD31 MUX21 BUFE2 4 kbits 0.342 0.148 0.254 130 64 8 16 96 192 36 4 1
8 kbits 160 1 -
Supply 64 kbits current μA/MHz 251 193 2 503 144 12 132 984
half band filter) and appropriate resampling. As an objective quality measure the peek signal to noise ratio (PSNR) has been used PSNR = 10log10
2552 1 , mse = mse N ⋅M
N
M
∑∑ ( xˆ n =1 m =1
n, m
− xn , m )2
(8)
where xˆn , m denotes a value of the pixel having coordinates (n, m) in reconstructed image (in Bayer CFA format). Obtained compression ratio (CR) with corresponding PSNR for different versions of algorithm based on DCT are presented in Fig. 9 and compared with the standard JPEG2000 coder. The following denotations are used in them: G1-G2-B-R − independent coding of 4 images from CFA Bayer sensor having colors G1, G2, B, R – the simplest approach; G12-B-R − coding without color transformation (2) but with structure transformation (Fig. 2); G12-B-R-di − coding without color transformation (2) but with structure transformation (Fig. 2) and additional deinterlacing/filtering (Fig. 3); Y12-U-V − coding with color transformation (2) and structure transformation (Fig. 2); Y12-U-V-di − coding with color transformation (2), structure transformation (Fig. 2) and additional deinterlacing /filtering (Fig. 3); Y12-U-VDCT8F − algorithm based on standard 8-point, floating-point DCT transform with color transformation (2) and structure transformation (Fig. 2). We can conclude from Fig. 9 that: 1) due to aliasing and high-frequency image content created by the Bayer CFA sampling the 4×4 DCT algorithm slightly outperforms the JPEG2000 for lower compression ratios (e.g. 15) but it is worse than the JPEG2000 for higher ratios (e.g. 30); both algorithms have been defeated by the floating point 8×8 DCT; 2) usage of color space conversion (2) is beneficial; 3) deinterlacing operation, described in Fig. 3, should be neglected since it makes the results worse. In order to obtain results with higher statistical significance more extensive test on longer data set has been performed. 100 video frames have been chosen in random manner from gastroscopy/colonoscopy recordings and coded using Huffman tables with 64 entries (precomputed using different set of frames). From results presented in
484
P. Turcza, T. Zieliński, and M. Duplaga
Fig. 7. Exemplary colonoscopy image
Fig. 8. Exemplary gastroscopy image
Fig. 9. Results for the DCT based image coder. (test images: left - colonoscopy, right – gastroscopy).
Fig. 10. Performance comparision
Fig. 11. Histograms of results from Fig. 10
Fig. 10, 11 we can conclude that for images with many diagnostic details the proposed compression scheme has efficiency similar to the JPEG2000 standard. The JPEG2000 algorithm is significantly better only for simple details-free images which do not have big diagnostic value.
Hardware Implementation Aspects of New Low Complexity Image Coding Algorithm
485
4 Conclusions Low-complexity, low-power image compression algorithm suitable for wireless capsule endoscopy has been proposed and tested in the paper. It makes use of integer version of the discrete cosine transformation. Transform coefficients are encoded using optimized, low complexity Huffman coder. Assuming that low implementation complexity and PSNR having at least 36dB is required, the best PSNR-CR performance has been obtained for the DCT algorithm with the color space conversion (2) and structure conversion (presented in Fig. 2) but without image filtering/deinterlacing (shown in Fig. 3). For it we can get compression ratio CR = 15 for colonoscopy images and CR = 25 for gastroscopy ones.
Acknowledgments. The research activities presented in this paper were conducted under the European Commission R&D Project No 3E061105 (VECTOR).
References 1. Iddan, G., Meron, G., Glukhovsky, A., Swain, P.: Wireless capsule endoscopy. Nature 6785, 417–418 (2000) 2. Mylonaki, M., Fritscher-Ravens, A., Swain, P.: Wireless capsule endoscopy: a comparison with push enteroscopy in patient with gastroscopy and colonoscopy negative gastrointestinal bleeding. Gut J., 1122–1126 (2003) 3. Turgis, D., Puers, R.: Image compression in video radio transmission for capsule endoscopy. Sensors and Actuators A, 129–136 (2005) 4. Xie, X., Li, G.L., Wang, Z.H.: A Near-Lossless Image Compression Algorithm Suitable for Hardware Design in Wireless Endoscopy System. EURASIP Journal on Advances in Signal Processing, Article ID 82160, 1–13 (2007) 5. Turcza, P., Duplaga, M.: Low-Power Image Compression for Wireless Capsule Endoscopy. In: IEEE Int. Workshop on Imaging Systems and Tech. – IST 2007, Krakow, Poland (2007) 6. Turcza, P., Zielinski, T., Duplaga, M.: Low complexity image coding algorithm for capsule endoscopy with Bayer color filter array. Signal Processing, Poland, Poznan, 27–32 (2007) 7. Bayer, B.E.: Color Imaging Array: U.S. Patent 3,971,065 (1976) 8. Koh, C.C., Mukherjee, J., Mitra, S.K.: New efficient methods of image compression in digital cameras with color filter array. IEEE Trans. on Consumer Electronics 49(4), 1448– 1456 (2003) 9. ITU-T Rec. H.264 / ISO/IEC 11496-10: Advanced Video Coding, Final Committee Draft, Document JVTF100 (December 2002) 10. Malvar, H.S., Hallapuro, A., Karczewicz, M., Kerofsky, L.: Low-Complexity Transform and Quantization in H.264/AVC. IEEE Trans. on Circuits and Systems for Video Tech. 7 (2003) 11. Wallace, G.K.: The JPEG still picture compression standard. Communications of the ACM 34, 30–44 (1991)
Database Prebuffering as a Way to Create a Mobile Control and Information System with Better Response Time Ondrej Krejcar and Jindrich Cernohorsky VSB Technical University of Ostrava, Center of Applied Cybernetics, Department of measurement and control, 17. Listopadu 15, 70833 Ostrava Poruba, Czech Republic {Ondrej.Krejcar,Jindrich.Cernohorsky}@vsb.cz
Abstract. Location-aware services can benefit from indoor location tracking. The widespread adoption of Wi-Fi as the network infrastructure creates the opportunity of deploying WiFi-based location services with no additional hardware costs. Additionally, the ability to let a mobile device determine its location in an indoor environment supports the creation of a new range of mobile control system applications. Main area of interest is in a model of a radio-frequency based system enhancement for locating and tracking users of our control system inside the buildings. The developed framework as it is described here joins the concepts of location and user tracking as an extension for a new control system. The experimental framework prototype uses a WiFi network infrastructure to let a mobile device determine its indoor position. User location is used for data pre-buffering and pushing information from server to user’s PDA. All server data is saved as artifacts (together) with its position information in building. Keywords: prebuffering; localization; framework; Wi-Fi; 802.11x; PDA.
1 Introduction The usage of various wireless technologies has been increased dramatically and would be growing in the following years. This will lead to the rise of new application domains each with their own specific features and needs. Also these new domains will undoubtedly apply and reuse existing (software) paradigms, components and applications. Today this is easily recognized in the miniaturized applications in network-connected PDAs that provide more or less the same functionality as their desktop application equivalents. It is very likely for these new mobile application domains to adapt new paradigms that specifically target the mobile environment. We believe that an important paradigm is context-awareness. Context is relevant to the mobile user, because in a mobile environment the context is often very dynamic and the user interacts differently with the applications on his mobile device when the context is different. While usually a desktop machine is in a fixed context, a mobile device may be used in work, on the road, during the meeting, or at home. Context is not limited to the physical world around the user, but also incorporates the user’s M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 489–498, 2008. © Springer-Verlag Berlin Heidelberg 2008
490
O. Krejcar and J. Cernohorsky
behavior, terminal and network characteristics. Context-awareness concepts can be found as basic principles in a long-term strategic research for mobile and wireless systems such as formulated in [5]. The majority of context-aware computing to date has been restricted to location-aware computing for mobile applications (locationbased services). However, position or location information is a relatively simple form of contextual information. To name a few other indicators of context awareness that make up the parametric context space: identity, spatial information (location, speed), environmental information (temperature), resources that are nearby (accessible devices, hosts), availability of resources (battery, display, network, bandwidth), physiological measurements (blood pressure, heart rate), activity (walking, running), schedules and agenda settings. Context-awareness means that anybody is able to use context information. We consider location as prime form of context information. We are focused on position determination in an indoor environment. Location information is used to determine an actual user position and his future position. We have performed a number of experiments with the control system, focusing on the position determination we are encouraged by the results. The remainder of this paper describes the conceptual and technical details.
2 Basic Concepts and Technologies of User Localization The proliferation of mobile computing devices and local-area wireless networks has fostered a growing interest in location-aware systems and services. A key distinguishing feature of such systems is that the application information and/or interface presented to the user is, in general, a function of his/her physical location. The granularity of needed location information could vary from one application to another. For example, locating a nearby printer requires fairly coarse-grained location information whereas locating a book in a library would require fine-grained information. While much research has been focused on a development of services architectures for location-aware systems, less attention has been paid to the fundamental and challenging problem of locating and tracking mobile users, especially in in-building environments. We focus mainly on RF wireless networks in our research. Our goal is to complement the data networking capabilities of RF wireless LANs with accurate user location and tracking capabilities for user needed data pre-buffering. This property we use as an information ground for extension of control system. 2.1 Data Collection A key step of the proposed research methodology is a data collection phase. We record information about the radio signal as a function of a user’s location. The signal information is used to construct and validate models for signal propagation. Among other information, the WaveLAN NIC makes the signal strength (SS) and the signalto-noise ratio (SNR) available. SS is reported to units of dBm and SNR is expressed in dB. A signal strength of Watts is equivalent to 10*log10(s/0.001) dBm. For example, signal strength of 1 Watt is equivalent to 30 dBm. Each time the broadcast
Database Prebuffering as a Way to Create a Mobile Control and Information System
491
Fig. 1. Localization principle – triangulation
packet is received the WaveLAN driver extracts the SS information from the WaveLAN firmware. Then it makes the information available to user-level applications via system calls. It uses the wlconfig utility, which provides a wrapper around the calls to extract the signal information. 2.2 Localization Methodology The general principle states that if a WiFi-enabled mobile device is close to such a stationary device – Access Point (AP) it may “ask” the provider’s location position by setting up a WiFi connection. If the mobile device knows the position of the stationary device, it also knows that its own position is within a 100-meter range of this location provider. Granularity of location can improve by triangulation of two or several visible WiFi APs. The PDA client will support the application in automatically retrieving location information from nearby location providers, and in interacting with the server. Naturally, this principle can be applied to other wireless technologies. The application (locator) is now implemented in C# using the MS Visual Studio .NET 2005 with .NET compact framework and a special OpenNETCF library enhancement [6]. Schema on figure [Fig. 1] describes a runtime localization process. Each star indicates supposed user location, which was exactly measured and computed. The stars points are exactly measured and computed points of suppose user position. The real track on figure presents real movement of user during the time. The exact track mean computed track from measured WiFi intensity level.
492
O. Krejcar and J. Cernohorsky
2.3 WiFi Middleware The WiFi middleware implements the client’s side of location determination mechanism on the Windows Mobile 2005 PocketPC operating system and is part of the PDA client application. The libraries used to manage WiFi middleware are: AccessPoint, AccessPointCollection, Adapter, AdapterCollection, AdapterType, ConnectionStatus, Networking, NetworkType, and SignalStrength. 2.4 Predictive Data Push Technology This part of the project is based on a model of location-aware enhancement, which we have used in created control system. This technique is useful in framework to increase the real dataflow from wireless access point (server side) to PDA (client side). Primary dataflow is enlarged by data pre-buffering. These techniques form the basis of predictive data push technology (PDPT). PDPT copies data from information server to clients PDA to be helpful when user comes at desired location. The benefit of PDPT consists of reduction of time needed to display desired information requested by a user command on PDA. Time delay may vary from a few seconds to number of minutes. It depends on two aspects. First one is the quality of wireless Wi-Fi connection used by client PDA. A theoretic speed of Wi-Fi connection is max 687 kB/s, because of protocol cost on physical layer (app. 30-40 %). However, the test of transfer rate from server to client’s PDA, which we have carried out within our Wi-Fi infrastructure provided the result speed only 80 - 160 kB/s (depends on file size and PDA device). The second aspect is the size of copied data. Current application records just one set of signal strength measurements at the time (by Locator unit in PDPT Client). By this set of values the actual user position is determined by the PDPT server side. PDPT core responds to location change by selection of the artifact to load to PDPT client buffer. The data transfer speed is widely influenced by the size of these artifacts. For larger artifact size the speed is going down. Theoretical background and tests were needed to determine an average artifact size. First of all the maximum response time of an application (PDPT Client) for user was needed to be specified. A special book [8] of „Usability Engineering” specified the maximum response time for an application to 10 seconds. During this time the user was focused on the application and was willing to wait for an answer. We used this time period (10 second) to calculate the maximum possible data size of a file transferred from server to client (during this period). If transfer speed was from 80 to 160 kB/s the result file size was from 800 to 1600 kB. The next step was an average artifact size definition. We used a sample database of network architecture building plan (Autocad file type), which contained 100 files of average size of 470 kB. The client application can download during the 10 second period from 2 to 3 artifacts. The problem is the time, which is needed for displaying them. In case of Autocad file type we measured this time to average 45 seconds. This time consumption is certainly not acceptable, for this reason we are looking for a better solution. We need to use some basic data format, which can be displayed by PDA natively (BMP, JPG, GIF) without any additional striking time consumption. The solution is in format conversion from any to this native (for PDA devices). In case of sound and video format we also recommend using basic data format (wav, mp3, wmv, mpg).
Database Prebuffering as a Way to Create a Mobile Control and Information System
493
The final result of our real tests and consequential calculations is definition of artifact size to average value of 500 kB. The buffer size may differ from 50 to 100 MB in case of 100 to 200 artifacts. 2.5 PDPT Framework Data Artifact Management The PDPT Server SQL database manages the information (for example data about Ethernet hardware such as Ethernet switch, UTP socket, CAT5 cable lead, etc.) in the context of their location in building environment. This context information is same as location information about user track. The PDPT core controls data, which are copied from the server to PDA client by context information (position info). Each database artifacts must be saved in database along the position information, to which it belongs.
Fig. 2. PDPT Framework data artifact management
During the creating process of PDPT Framework the new software application called “Data Artifacts Manager” was created. This application manages the artifacts in WLA database (localization oriented database). User can set the priority, location, and other metadata of the artifact. This manager substitutes the online conversion mechanism, which can transform the real online control system data to WLA database data artifacts during the test phase of the project. This manager can be also used in case of offline version of PDPT Framework usage. The artifacts manager in this offline case is shown at [Fig 2].
494
O. Krejcar and J. Cernohorsky
The Manager allows to the administrator to create a new artifact from multimedia file (image, video, sound, etc.), and edit or delete the existing artifact. The left side of the screen contains the text field of artifact metadata as a position in 3D space. This position is determined by artifact size (in case of building plan) or binding of artifact to some part of a building in 3D space. The 3D axis is possible to take from building plan by some GIS software like Quantum GIS or by own implementation [7]. The central part represents a multimedia file and right side contains the buttons to create, edit, or delete the artifact. The lower part of the application screen shows actual artifacts in WLA database located on SQL Server. 2.6 Framework Design PDPT framework design is based on the most commonly used server-client architecture. To process data the server has online connection to the control system. Technology data are continually saved to SQL Server database [3] and [1].
Fig. 3. System architecture – UML design
The part of this database (desired by user location or his demand) is replicated online to client’s PDA, where it is visualized on the screen. User PDA has location sensor component, which continuously sends the information about nearby AP’s intensity to the framework kernel. The kernel processes this information and makes a decision if or how the part of SQL Server database will be replicated to client’s SQL Server CE database. The kernel decisions constitute the most important part of whole framework, because the kernel must continually compute the position of the user and track, and make a prediction of his future movement. After doing this prediction the appropriate data (part of SQL Server database) are pre-buffered to client’s database for the future possible requirements. The PDPT framework server is created as Microsoft web service to handle a bridge between SQL Server and PDPT PDA Clients.
Database Prebuffering as a Way to Create a Mobile Control and Information System
495
Fig. 4. PDPT Client – Windows Mobile application
2.7 PDPT Client For testing and tuning of PDPT Core was created the PDPT Client application. This client realizes classical client to the server side and an extension by PDPT and Locator module. Figure [Fig. 4] shows classical view of data presentation from MS SQL CE database to user (in this case the image of Ethernet network in company area plan). Each process running in a PDPT client is measured in millisecond resolution to provide a feedback from real situation. Tabs PDPT and Locator present a way to tune the settings of PDPT values.
3 Experiments We have executed a number of indoor experiments with the PDPT framework using the PDPT PDA application. WiFi access points are placed at different locations in building, where the access point cells partly overlap. We have used triangulation principle of AP intensity to obtain a better granularity. It has been found that the location determination mechanism selects the access point that is the closest to the mobile user as the best location provider. Also after the loss of IP connectivity,
496
O. Krejcar and J. Cernohorsky
switching from one access point to another (a new best location provider) takes place within a second in the majority of cases, resulting in only temporary loss of IP connectivity. This technique partially uses a special Radius server [4] to realize “roaming” known in cell’s networks. User, who lost the existing signal of AP is required to ask the new AP to receive IP. This is known as “renew” in Ethernet networks. At the end of this process, user has his identical old IP and connection to new AP. Other best technique to realize roaming is using of WDS (Wireless Decision System). Currently, the usability of the PDPT PDA application is somewhat limited due to the fact that a device has to be continuously powered. If it is not, the WiFi interface and the application cannot execute the location determination algorithm and the PDPT server does not receive location updates from the PDA. 3.1 Data Transfer Increase Tests Using PDPT Framework The main result of utilization of PDPT framework is reduction of data transfer speed. The test is focused on the real usage of developed PDPT Framework and its main issue at increased data transfer. Table 1. Data transfer tests description Test 1 2 3 4 5 6 7 8 9 10 11 12
Type
HTC Blueangel
HTC Universal
Mode SQl CE SQl CE SQL SQL PDPT PDPT SQl CE SQl CE SQL SQL PDPT PDPT
Data 2949 4782 2949 4782 2949 4782 2949 4782 2949 4782 2949 4782
Time 5 9 34 57 12 20 3 6 21 38 9 16
Speed 643 2228 80 69 234 278 514 1782 51 64 214 2228
At the Table 1 is a summary of eighteen tests with three types of PDA and three types of data transfer mode. Each of these eighteen tests are fivefold reiterated for a better accuracy. Data in the table are average values from each iterations. The mode column may contain three different data transfer modes. The SQL CE mode represents data saved at mobile device memory (SQL Server CE) and a data transfer time is very high. The second mode SQL shows data, which are stored at server (SQL Server 2005). Primary data are loaded over Ethernet / Internet to SQL Server CE of mobile device and secondary data are shown to user. Data transfers time consumption of this method is generally very high, which results in a high waiting time for users. The third data mode PDPT is a combination of previous two methods. The PDPT mode provides very good results in the form of a data transfer acceleration. A realization of this test consists of user movement from a sample location A to B at three different way directions. Location B was a destination with requested data,
Database Prebuffering as a Way to Create a Mobile Control and Information System
497
which is not contained at SQL CE buffer in mobile device before test. The result time of this mode consist of average time from view a collection of requested data artifacts on client PDA. If the requested artifact is in SQL CE buffer before request, the time is very short. If the artifact is not present in buffer, PDPT client must download them from server. Final time for third method represents the real usage result. Acknowledgment. This work was supported by the Ministry of Education of the Czech Republic under Project 1M0587
4 Conclusions The main objective of this paper is in the enhancement of control system for locating and tracking of users inside a building. It is possible to locate and track the users with high degree of accuracy. In this paper we have presented the control system framework that uses and handles location information and control system functionality. The indoor location of a mobile user is obtained through an infrastructure of WiFi access points. This mechanism measures the quality of the link of nearby location provider access points to determine actual user position. User location is used in the core of server application of PDPT framework to data pre-buffering and pushing information from server to user’s PDA. Data pre-buffering is the most important technique to reduce time from user request to system response. The experiments show that the location determination mechanism provides a good indication of the actual location of the user in most cases. The median resolution of the system is approximately five meters. Some inaccuracy does not influence the way of how the localization is derived from the WiFi infrastructure. For the PDPT framework application this was not found to be a big limitation for the PDPT framework application as it can be found at chapter Experiments. The experiments also show that the current state of the basic technology, which was used for the framework (mobile device hardware, PDA operating system, wireless network technology) is now at the level of a high usability of the PDPT application [13].
References 1. Reynolds, J.: Going Wi-Fi: A Practical Guide to Planning and Building an 802.11 Network. CMP Books (2003) 2. Wigley, A., Roxburgh, P.: ASP.NET applications for Mobile Devices. Microsoft Press, Redmond (2003) 3. Tiffany, R.: SQL Server CE Database Development with the.NET Compact Framework. Apress (2003) 4. The Internet Engineering Task Force RADIUS Working Group, http://www.ietf.org/ 5. The Wireless World Research Forum (WWRF), http://www.wireless-world-research.org/ 6. OpenNETCF - Smart Device Framework, http://www.opennetcf.org 7. Horak, J., Unucka, J., Stromsky, J., Marsik, V., Orlik, A.: TRANSCAT DSS architecture and modelling services. Journal: Control and Cybernetics 35, 47–71 (2006) 8. Nielsen, J.: Usability Engineering. Morgan Kaufmann, San Francisco (1994)
498
O. Krejcar and J. Cernohorsky
9. Krejcar, O.: Prebuffering as a way to exceed the data transfer speed limits in mobile control systems. In: Icinco 2008, 5th International Conference on Informatics in Control, Automation and Robotics. Insticc Press, Funchal (2008) 10. Evennou, F., Marx, F.: Advanced integration of WiFi and inertial navigation systems for indoor mobile positioning. Eurasip journal on applied signal processing, Hindawi publishing corp., New York (2006) 11. Olivera, V., Plaza, J., Serrano, O.: WiFi localization methods for autonomous robots. Journal Robotica 24, 455–461 (2006) 12. Salazar, A.: Positioning Bluetooth (R) and Wi-Fi (TM) systems. Journal IEEE transactions on consumer electronics 50, 151–157 (2004) 13. Janckulik, D., Krejcar, O., Martinovic, J.: Personal Telemetric System – Guardian. In: Biodevices 2008, pp. 170–173. Insticc Setubal, Funchal (2008)
Network Traffic Classification by Common Subsequence Finding Krzysztof Fabja´nski and Tomasz Kruk NASK, The Research Division, Wawozowa ˛ 18, 02-796 Warszawa, Poland {krzysztof.fabjanski,tomasz.kruk}@nask.pl http://www.nask.pl
Abstract. The paper describes issues related to network traffic analysis. The scope of this article includes discussion regarding the problem of network traffic identification and classification. Furthermore, paper presents two bioinformatics methods: Clustal and Center Star. Both methods were precisely adapted to the network security purpose. In both methods, the concept of extraction of a common subsequence, based on multiple sequence alignment of more than two network attack signatures, was used. This concept was inspired by bioinformatics solutions for the problems related to finding similarities in a set of DNA, RNA or amino acids sequences. Additionally, the scope of the paper includes detailed description of test procedures and their results. At the end some relevant evaluations and conclusions regarding both methods are presented. Keywords: network traffic analysis, anomaly detection, network intrusion detection systems, common subsequence finding, bioinformatics algorithms, Clustal algorithm, Center Star method, automated generation of network attack signatures
1 Introduction The Internet became one of the most popular tools used by almost everyone. It is important to mention that the Internet and the World Wide Web (WWW) are not synonymous. The World Wide Web is one of the many services available in the Internet. The Internet consists of an enormous number of computer networks. Therefore, the issue regarding network security is so important. The network security issue is not only a set of security methods required for ensuring safety. It also consists of elements related to network security policy which should be obeyed. Different institutions and companies are introducing their private security policies. Often, security policies are performed according to some known standards. Unfortunately, this approach does not guarantee that the precious resources will remain unaffected. Other, more sophisticated methods should be introduced. One of the most recognized families of systems are network intrusion detection systems. This group of systems allows to alert about unwanted and malicious activity registered in the network flow. The process of identifying a malicious network flow involves comparing the network flow content with a predefined set of rules. The set of rules, sometimes known as M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 499–508, 2008. c Springer-Verlag Berlin Heidelberg 2008
500
K. Fabja´nski and T. Kruk
well as a set of network attack signatures, describes different Internet threats by mapping their content into the specific format. Despite of the malicious flow method identification, there are many new Internet threats which have not been discovered yet. Fortunately there are methods and heuristic approaches which allow to identify new Internet threats by following different network trends and statistics. Although those methods are very promising, still there is a huge requirement for new algorithms. Those algorithms should be capable of analysing a huge portion of attack signatures for network intrusion detection systems, produced in an automatic manner. In order to support this process, some new approaches were proposed. One of the ways allowing to analyse the attack signature collections is the bioinformatics approach. Multiple sequence alignment is a fundamental tool in bioinformatics analysis. This tool allows to find similarities embedded in a set of DNA, RNA or amino acids sequences. The bioinformatics approach can be adapted to the network traffic identification and classification problem. The second section of this article presents different systems for network traffic analysis. Section three develops briefly two bioinformatics methods: Center Star and Clustal. The fourth section includes various test results. The last section discusses algorithm complexity and their suitability in network traffic analysis.
2 Network Traffic Classification and Identification Problem Computer threats are often a reason of unwanted incidents, which might cause irreversible damage to the system. From the scientific point of view, computer threats use certain vulnerabilities, therefore threats, vulnerabilities and exposures should be considered as a disjointed issue. One of the most popular and widely present group of Internet threats is Internet Worm [1]. Intrusion detection systems (IDS) [2] detect mainly malicious network flow by analyzing its content. An example of IDS are network intrusion detection systems (NIDS). NIDS are able to detect many types of malicious network traffic including worms. One of the most popular NIDS is Snort. It is an open source program available for various architectures. It is equipped with the regular expression engine which enhance the network traffic analysis. It analyses the network flow by comparing its content (not only a payload) with a specific list of rules. During this process, Snort utilises the regular expression engine. As a result of this analysis, Snort makes a decision regarding a particular network flow, whether it is malicious or regular. An example of simple Snort rule is shown in the table (Table 1). Table 1. An exemplary Snort rule alert udp $EXTERNAL_NET 2002 -> $HTTP_SERVERS 2002 ( msg:"MISC slapper worm admin traffic"; content:"|00 00|E|00 00|E|00 00|@|00|"; depth:10; reference:url,isc.incidents.org/analysis.html?id=167; reference:url,www.cert.org/advisories/CA-2002-27.html; classtype:trojan-activity; sid:1889; rev:5;)
Network Traffic Classification by Common Subsequence Finding
501
Snort works as a one thread application. Its action is to receive, decode and analyse the incoming packets. Snort allows us to identify unwanted malicious network flows by generating appropriate alerts. The main problem is that if the set of rules for Snort has a poor quality, we can expect many false positive or false negative alerts. Therefore classification of a network attack signature as well as improving their quality is a matter of great importance. Very often NIDS are combined with systems for automated generation of network attack signatures, such as Honeycomb [3,4]. Tools which join functions of NIDS and automated signature creation system are known as network early warning systems (NEWS). An exemplary network early warning system is Arakis [5]. NEWS develop very sophisticated methods for network traffic classification in order to speed up the process of identification of potential new Internet threats. Main problem concerning classification and identification of network flows is related to extraction of common regions from the network attack signatures sets [6]. Many techniques were developed. One of the techniques allowing the network security specialists to distinguish the regular network flow from the suspicious one, is usage of honeypots [7]. Honeypot is a specially designed system which simulates some network resources in order to capture the malicious flow. Generally, it consists of a part of an isolated, unprotected and monitored network with some virtually simulated computers which seem to have a valuable resources or information. Therefore, flow which occurs inside the honeypot is assumed to be malicious by the definition. Protocol analysis and patterndetection techniques performed on flows collected by honeypots result in network attack signatures generation. Generation of network attack signatures is mainly based on the longest common substring extraction [4,8]. One of the tools allowing generation of network attack signatures is Honeycomb.
3 Sequence Alignment Sequence alignment is a tool that can be used for extraction of common regions from the set of network attack signatures [6]. Extraction of common regions is shown in the figure (Fig. 1). It is somewhat similar to the biologist task. The biologist identifies newly discovered genes by comparing them to the family of genes whose function is already known. Comparison is performed by assigning those newly discovered genes to the known families by common subsequence finding. Problem of extraction of common regions from the network attack signature set is actually the multiple sequence alignment (MSA) [9] problem. The MSA is a generalization of the pairwise alignment [10]. Insertion of gaps is performed into each string so that resulting strings have equal length. Although the problem of multiple sequence GET ||| GET ||| GET
/ | /a/a.HTM | /
HTTP |||| HTTP |||| HTTP/1.1
Fig. 1. Problem of the longest common subsequence finding
502
K. Fabja´nski and T. Kruk
alignment is an NP-complete task, there are many heuristics, probabilistic and other approaches that cope with that issue. A specific classification of those methods was proposed in [10]. Among so many algorithms, two classical approaches where chosen. The first algorithm and probably the most basic is a Center Star method. It was chosen for network traffic identification purpose. The main goal of this adaptation was to check whether this method can be used for network attack signature common region extraction. The second algorithm that was required for classification of network attack signatures is Clustal. It is worth to mention that in both algorithms a global alignment was used. Global alignment was computed using Needleman-Wunsch [20] algorithm. 3.1 Center Star Method The Center Star method [11] is classified to the group of algorithms with some elements of approximation. As it was mentioned before, multiple sequence alignment is an NP-complete problem. Presented method, Center Star, is an approximation of multisequence alignment. Thus, expected results can provide, but do not have to provide optimal solutions. The Center Star method consists of three main steps. Detailed description of Center Star method can be found in [11]. 3.2 Clustal Algorithm Clustering is the method which classifies particular objects into appropriate groups (clusters). Classification is performed based on the defined distance measurement technique. Every object from a single cluster should share a common trait. Data clustering is widely used in many science fields. We can find it in data mining, pattern recognition or bioinformatics. Data clustering algorithms can be divided into two main categories: hierarchical methods. latexdeschierarchical methods – assign objects to the particular clusters by measuring the distance between them. In partitioning approach, new clusters are generated and then recomputing of the new cluster centers is performed, partitioning algorithms. latexdescpartitioning algorithms – start with an initial partition and then by iterative control strategy optimize an objective function. Every cluster is represented by the gravity center or by its center object. In hierarchical methods, in turn, we can distinguish two different types: agglomerative. – we begin the clustering procedure with each element as a separate cluster. Merging them into larger clusters, we come to the point where all elements can be classified into one big cluster, divisive. – starts the process from one big set and then, divides it into successively smaller subsets Clustal is an example of agglomerative algorithm, also know as ”bottom-up” approach. During implementation of Clustal algorithm some modification were performed. Modification were introduced in order to adapt this method for network attack signature classification purpose. Instead of profile representation of internal nodes in the dendrogram, consensus sequence was used. This was caused mainly by the fact that so far the
Network Traffic Classification by Common Subsequence Finding
503
scoring scheme used for network traffic classification has a very basic structure. Assuming that network flow can be represented as a sequence of extended ASCII characters, we have 1 for each match and 0 otherwise. Although the standard objective function is the only reasonable solution for this moment, there is some research [12], which may result in new scoring scheme proposition.
4 Tests and Results This section provides detailed description concerning efficiency tests of Center Star method and Clustal algorithm. Tests were executed on the Intel(R) Xeon(TM) CPU 3.00GHz computer equipped with 2075808 kB of the total memory. Compiler used for compilation was g++ (v4.1). In the test procedure, external data sets, extracted from Arakis database, were used. What is most interesting, data were extracted from Arakis database. Therefore, some tests results were confronted with the Arakis algorithm results. Data, used in the test, consisted of real network signatures, suspected to be malicious. Arakis algorithms were mainly based on DBSCAN [19] clustering mechanism and edit distance measurement. 4.1 Center Star Methods Tests In the figure (Fig. 2 (a)) we have experimental data set presented. The horizontal axis represents the total number of characters (counted as a total sum of network attack signatures lengths). The vertical axis, in turn, represents the actual number of processed signatures. This data set was used in Center Star method tests. (Fig. 2 (b)) shows the actual execution time of the Center Star method. Time was measured in seconds. Next figure (Fig. 2 (c)) reflects the relation between the length of the multiple sequence alignment (MSA) and the common subsequence extracted from MSA. This relation provides us information regarding total length of the extracted subsequence. In the next test, we measured average length of single division in one signature. Assuming that single network attack signature may consist of many parts, this test provided us an approximate information concerning the quality of the extracted common subsequence. The greater the average length of a single division in network attack signature, the lower the probability of false positive or false negative alerts. Center Star method algorithm was compared with Arakis algorithms. The results of the test are shown in the figure (Fig. 2 (d)). In some cases Arakis algorithm seems to obtain better results than Center Star algorithm. Those situations were precisely investigated and it turned out that the reason of that had a background in different interpretation of the clusters representatives. In some cases Arakis algorithm does not update the clusters representatives even if some very long network attack signatures have expired. This has a consequences in overestimating the average single division length of a common subsequence. 4.2 Clustal Algorithm Tests Most of the Clustal tests were performed in order to compare results with those of Arakis algorithm. Data used in the tests, are shown in the figure (Fig. 3 (a)).
504
K. Fabja´nski and T. Kruk
No. of characters vs Number of signatures
Number of characters vs Time
100
6000 5000
80 70
4000
60
Time
Number of signatures
90
50 40
3000 2000
30 1000 20 10
0 0
20000
40000 60000 80000 100000 120000 140000 No. of characters
0
20000
database
40000 60000 80000 100000 120000 140000 Number of characters Center Star method
(a) Center Star: The no. of characters vs the (b) The Center Star method execution time number of signatures No. of characters vs Pattern length
No. of characters vs Average length of divs
2000
1600 Average length of divs
1800 Pattern length
1600 1400 1200 1000 800 600 400 200 0
1400 1200 1000 800 600 400 200 0
0
20000
40000 60000 80000 100000 120000 140000 No. of characters
LCS length
MSA length
0
20000
40000 60000 80000 100000 120000 140000 No. of characters
Center Star method
Arakis algorithm
(c) Center Star method: the MSA and LCS (d) Average single division length of common relation subsequence: Arakis vs Center Star
Fig. 2. Center Star method tests
Next test (Fig. 3 (b)) investigates the Clustal algorithm execution time in respect to the total number of processed characters. Time was measured in seconds. The (Fig. 4) represents the comparison of the Arakis clustering algorithm with the Clustal method. Comparison of those two methods was made in order to show the main advantage of the Clustal algorithm. The main advantage of the Clustal algorithm is the possibility of adjustment. Two subfigures (a,c) present the number of clusters produced by the Arakis and Clustal algorithms. In the first subfigure (a), we can notice that Clustal algorithm produces smaller number of clusters than Arakis solution. However, the precision for that test was rather poor (b). Smaller number of clusters was achieved using EP S1 = 0.01. EP S1 is an epsilon which determines whether the investigated signature should be classified to particular cluster. The condition where distance1 between two signatures is greater than EPS1 is consider as satisfied. Precision is expressed as ratio of MSA length to the LCS length. The closer the ratio to 1, the better precision we obtain. Better precision is obtained at the greater number of clusters cost. In the next subfigures (c,d), EP S1 was set to 0.9. For this values, Clustal algorithm produces much more clusters than Arakis algorithm. On the other hand, precision 1
Levenshtein distance [18].
Network Traffic Classification by Common Subsequence Finding
No. of characters vs Number of signatures
Number of characters vs Time
800
2500
700 2000 600 500
1500 Time
Number of signatures
505
400
1000
300 200
500 100 0
0 0
50000
100000 150000 200000 No. of characters
250000
300000
0
50000
database
100000 150000 200000 Number of characters
250000
300000
Clustal algorithm
(a) Clustal: number of characters vs number of signatures
(b) The Clustal algorithm execution time
Fig. 3. Clustal algorithm tests
gained in those two tests was very high. All four subfigures (a,b,c,d) were generated by computing the Clustal algorithm with the standard scoring scheme2 . Parameters M AT CH, M ISM AT CH and GAP _P EN ALT Y were set according to standard scoring scheme. The reason why gap penalty had the same value assigned as mismatch was straightforward. So far there is no scoring scheme for ASCII alphabet, therefore only trivial approach was presented. In this approach gap penalties were not considered. However in extended test procedure, different values for gap penalties were assigned. Those test results were preliminary and thus they were not published in this paper.
5 Evaluation and Conclusions In this section, detailed estimation of the main methods are given. Estimation was based on theoretical assumption and faced with empirical implementation of both methods. 5.1 Center Star Method Complexity The Center Star method consists of three main phases, after which multiple sequence alignment is found. In the first phase of this method, all pairwise alignment are formed 2 (distance K matrix calculation). The complexity of this phase, in the worst case, is O((N + 3N ) 2 ), where K is the number of input signatures. The second phase is related to finding the signature which is "the closest" to others. This step requires O(K). In the last step, multiple sequence alignment is formed. The last phase computational complexity ). of the Center Star method is O(2N ∗ K 2 The Center Star method provides essential functionality in common motif finding process. It allows us to extract the common subsequence from the multiple sequence alignment. This procedure requires O(KN ).
2
1 for match and 0 otherwise.
506
K. Fabja´nski and T. Kruk No. of characters vs Number of cluster
No. of characters vs Pattern length
140
200 180 160 Pattern length
Number of cluster
120 100 80 60 40
140 120 100 80 60 40 20
20
0 0
0 0
50000
100000 150000 200000 No. of characters
Clustal algorithm
250000
50000
300000
100000 150000 200000 No. of characters
LCS length
250000
300000
MSA length
Arakis algorithm
(a) dist = 1, MATCH = 1, MISMATCH = 0, (b) dist = 1, MATCH = 1, MISMATCH = 0, GAP_PENALTY = 0, EPS1 = 0.01 GAP_PENALTY = 0, EPS1 = 0.01 No. of characters vs Pattern length 80
300
70 Pattern length
Number of cluster
No. of characters vs Number of cluster 350
250 200 150 100
60 50 40 30
50 20 0
0 0
50000
100000 150000 200000 No. of characters
Clustal algorithm
250000
300000
50000
100000 150000 200000 No. of characters
LCS length
250000
300000
MSA length
Arakis algorithm
(c) dist = 1, MATCH = 1, MISMATCH = 0, (d) dist = 1, MATCH = 1, MISMATCH = 0, GAP_PENALTY = 0, EPS1 = 0.9 GAP_PENALTY = 0, EPS1 = 0.9
Fig. 4. Number of clusters vs precision: comparison of the Arakis algorithm with Clustal algorithm
5.2 Clustal Algorithm Complexity In Clustal algorithm we have very complicated and time consuming procedures, including distance matrix calculation, dendrogram creation and clustering mechanism. All those three phases have the following computational complexities: ) 1. Distance matrix calculation - O((N 2 + 3N ) K K 2 2 2. Dendrogram creation - O( 2 +2(K − 1) ∗ [ K 2 +N + 4(N + K)]) 3. Clustering (reading the dendrogram and writing clusters to the file) - O(K) All calculations regarding computational complexity were based on theoretical assumptions and source code analysis. Run-time dependencies shown in (Fig. 3 (b)) seem to confirm the results. Moreover, presented computational complexities do not compromise the theoretical assumptions regarding complexities of presented methods. To sum up, Clustal and Center Star algorithms have got some advantages and disadvantages. One of the biggest drawback of both algorithms is their high run-time complexity. On the other hand, the whole task is an NP-complete problem, so we cannot expect better run-time complexity. Clustal, as well as Center Star method, can be modified in order to decrease this complexity. In the Center Star method instead of finding
Network Traffic Classification by Common Subsequence Finding
507
all pairwise alignment, we can take a randomly selected sequence from the set of input signatures. After that, we can form the multiple sequence alignment by computing all pairwise alignments of the chosen sequence with the rest of sequences. As a result, we would omit the process of choosing the center sequence, which involves computation of all pairwise alignments in the set of network attack signatures. On the other hand, in Clustal algorithm, instead of using Neighbor-Joining algorithm [13][15][16][17] for dendrogram creation, we could have used an Unweighted Pair Group Method with Arithmetic Mean (UPGMA) [14]. The UPGMA is faster than Neighbor-Joining algorithm at precision expense. This improvements lead to the better time complexity, but on the other hand, they result in worse common subsequence extraction and worse network traffic classification. In our case, better time complexity might occur to be more important than worse common subsequence extraction. Extraction of the common subsequence during preprocessing phase should be performed in online mode. On the other hand, clustering of already created signatures must be performed in offline mode. This paper involves the process of classification and identification of network attack signatures only. There were no other tests checking the influence on the number of false positive or false negative alerts. Such experiments will be performed after we finally prove that bioinformatics methods are suitable for suspicious network traffic analysis. In further work, it is expected that adapted methods will be constantly developed. Although, the results of tests performed on the real network traffic data are very promising, still there is an issue related to new scoring function proposition. Therefore, further work will focus on aspects related to statistics regarding the network traffic. Statistics will allow in future, to represent the particular families of Internet threats as a profile structures. The profile structure will allow to create scoring matrices, similar to those which can be met in bioinformatics. Moreover, profile structures will allow to deal with Internet threats such as polymorphic worms. Furthermore, profiles will allow us to identify those regions in the network traffic patterns, which remain unchanged even in case of polymorphic Internet threats.
References 1. Nazario, J.: Defense and Detection Strategies against Internet Worms. Artech House, Boston & London (2004) 2. Kreibich, C., Crowcroft, J.: Automated NIDS Signature Creation using Honeypots. University of Cambridge Computer Laborator (2003) 3. Kreibich, C., Crowcroft, J.: Honeycomb - Creating Intrusion Detection Signatures Using Honeypots. In: Proceedings of the Second Workshop on Hot Topics in Networks (Hotnets II). Cambridge Massachusetts: ACM SIGCOMM, Boston (2003) 4. Rzewuski, C.: Bachelor’s Thesis: SigSearch - automated signature generation system (in Polish). Warsaw University of Technology, The Faculty of Electronics and Information Technology (2005) 5. Kijewski, P., Kruk, T.: Arakis - a network early warning system (in Polish) (2006) 6. Kreibich, C., Crowcroft, J.: Efficient sequence alignment of network traffic. In: IMC 2006: Proceedings of the 6th ACM SIGCOMM on Internet measurement, isbn 1-59593-561-4, pp. 307–312. ACM Press, Brazil (2006)
508
K. Fabja´nski and T. Kruk
7. Bakos, G., Beale, J.: Honeypot Advantages & Disadvantages, LasVegas, pp. 7–8 (November 2002) 8. Kreibich, C.: libstree – A generic suffix tree library, http://www.icir.org/christian/libstree/ 9. Gusfield, D.: Efficient method for multiple sequence alignment with guaranteed error bound. Report CSE-91-4, Computer Science Division, University of California, Davis (1991) 10. Reinert, K.: Introduction to Multiple Sequence Alignment. Algorithmische Bioinformatik WS 03, 1–30 (2005) 11. Bioinformatics Multiple sequence alignment, http://homepages.inf.ed.ac.uk/fgeerts/course/msa.pdf 12. Kharrazi, M., Shanmugasundaram, K., Memon, N.: Network Abuse Detection via Flow Content Characterization. In: IEEE Workshop on Information Assurance and Security United States Military Academy (2004) 13. Saitou, N., Nei, M.: The neighbor-joining method: a new method for reconstructing phylogenetic trees. Mol. Biol. Evol. 2, 406–425 (1987) 14. Tajima, F.: A Simple Graphic Method for Reconstructing Phylogenetic Trees from Molecular Data. In: Reconstruction of Phylogenetic Trees, Department of Population Genetics, National Institute of Genetics, Japan, pp. 578–589 (1990) 15. The Neighbor-Joining Method, http://www.icp.ucl.ac.be/~opperd/private/neighbor.html 16. Weng, Z.: Protein and DNA Sequence Analysis BE561. Boston University (2005) 17. Multiple alignment: heuristics, http://www.bscbioinformatics.com/Stu/Dbq/clustalW.pdf 18. Levenshtein, V.: Binary codes capable of correcting insertions and reversals. Soviet Physics Doklady, 707–710 (1966) 19. Ester, M., Kriegel, H., Sander, J., Xiaowei, X.: A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. In: Proceedings of 2nd International Conference on Knowledge Discovery and Data Mining (KDD 1996), Institute for Computer Science, University of Munich (1996) 20. Fabjañski, K.: Master’s Thesis: Network Traffic Classification by Common Subsequence Finding. Warsaw University of Technology, The Faculty of Electronics and Information Technology, Warsaw (2007)
A Hierarchical Leader Election Protocol for Mobile Ad Hoc Networks Orhan Dagdeviren1 and Kayhan Erciyes2 1
Izmir Institute of Technology Computer Eng. Dept. Urla, Izmir TR-35340, Turkey [email protected] 2 Ege University International Computer Institute Bornova, Izmir, TR-35100, Turkey [email protected]
Abstract. Leader Election is an important problem in mobile ad hoc networks and in distributed computing systems. In this study, we propose a hierarchical, cluster based protocol to elect a leader in a mobile ad hoc network. The initial phase of the protocol employs a clustering algorithm to group nodes of the network after which a leader for a cluster(clusterhead) is elected. The second phase is performed by forming a connected ring of these leaders using the Ring Formation Algorithm. Finally, Chang Roberts Leader Election Algorithm for rings is employed in the final phase to elect the super-leader among the clusterheads. We provide performance results of this protocol for various mobility parameters and analyze its time and message complexities. Keywords: leader election, Chang Roberts algorithm, mobile ad hoc networks.
1
Introduction
Leader election is a fundamental problem addressed by many researchers in distributed systems. This problem in distributed systems is first introduced by LeLann who also proposes a solution for this problem on a unidirectional ring [1]. Chang and Roberts improve this solution and reduces the average messaging complexity [2]. Also, for bidirectional rings and arbitrary networks, various solutions are proposed [3,4,5,6]. Mobile Ad hoc NETworks (MANETs) are a class of distributed networks which do not have a fixed topology and where the nodes communicate using temporary connections with their neighbors. Leader election in MANETs is a relatively new research area. Malpani et. al. [7] propose two leader election protocols based on TORA. The algorithms select a leader for each connected component. The first algorithm is designed to tolerate single topology change whereas the second algorithm tolerates concurrent topology changes. The nodes only exchange messages with their neighbors, thus making this protocol suitable for MANETs. The authors show the proof of correctness but they do not M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 509–518, 2008. c Springer-Verlag Berlin Heidelberg 2008
510
O. Dagdeviren and K. Erciyes
give any simulation results. Vasudevan et. al. [8] propose a weakly-self stabilizing and terminating leader election protocol for MANETs. Their algorithm uses the concept of diffusing computations. They show the proof of correctness by using temporal logic but they also do not give any simulation results. Pradeep et. al. [9] propose a leader election algorithm similar to Malpani et. al.’s algorithm. They use the Zone Routing Protocol and show the proof of correctness of their algorithm. Masum et. al. [10] propose a consensus based asynchronous leader election algorithm for MANETs. The authors claim that their algorithm is adaptive to link failures. All of these algorithms elect leader(s) from ordinary nodes. On the other hand, our algorithm elects one super-leader from previously selected leaders. Cokuslu and Erciyes [11] proposed a two level leader election hierarchy for MANETs. Their protocol is based on constructing dominating sets for clustering and electing the super clusterhead from the subset of connected clusterheads. Since the clusterheads must be connected, the super clusterhead selection protocol is restricted by the underlying protocol. Also dominating set construction is an expensive operation under high mobility. Our Mobile CR protocol can use any clustering and routing protocol under BFA to be stable under high mobility and density. In this study, we propose a Leader Election Protocol (LEP ) that has three layers (phases) for MANETs. At the lowest layer, a clustering algorithm divides the MANET into balanced clusters, using the previously designed Merging Clustering Algorithm (M CA) [12]. The second layer employs the Backbone Formation Algorithm (BF A) which provides a virtual ring architecture of the leaders of the clusters formed by M CA [12]. Finally, using the Mobile Chang Roberts Leader Election Algorithm (M obile CR), the super-leader among the leaders elected in the second phase is elected. We show experimentally and theoretically that the protocol is scalable and has favorable performance with respect to time and message complexities. The rest of the paper is organized as follows. Section 2 provides the background and the proposed architecture is outlined in Section 3. Section 4 describes the extended Chang Roberts algorithm on the proposed model called Mobile CR. The implementation results are explained in Section 5 and the discussions and the conclusions are outlined in Section 6.
2
Background
In this section we explain the clustering using MCA and backbone formation using BFA to show the underlying mechanism for LEP . 2.1
Clustering Using Merging Clustering Algorithm
An undirected graph is defined as G = (V, E), where V is a finite nonempty set and E ⊆ V × V . The V is a set of nodes v and the E is a set of edges e. A graph GS = (VS , ES ) is a spanning subgraph of G = (V, E) if VS = V . A spanning tree of a graph is an undirected connected acyclic spanning subgraph. Intuitively, a minimum spanning tree (MST) for a graph is a subgraph that has the minimum
A Hierarchical Leader Election Protocol for Mobile Ad Hoc Networks
511
number of edges for maintaining connectivity. Merging Clustering Algorithm M CA [12] finds clusters in a MANET by merging the clusters to form higher level clusters as mentioned in Gallagher, Humblet, Spira’s algorithm [6]. However, we focus on the clustering operation by discarding minimum spanning tree. This reduces the message complexity as explained in [12]. The second contribution is to use upper and lower bound parameters for clustering operation which results in balanced number of nodes in the cluster formed. The protocol is simulated in ns2 and stable results under varying density and mobility are shown. 2.2
Backbone Formation Algorithm
Backbone Formation Algorithm constructs a backbone architecture on a clustered MANET [13]. Different than other algorithms, the backbone is constructed as a directed ring architecture to gain the advantage of this topology and to give better services for other middleware protocols. The second contribution is to connect the clusterheads of a balanced clustering scheme which completes two essential needs of clustering by having balanced clusters and minimized routing delay. Besides, the backbone formation algorithm is fault tolerant as the third contribution. Our main idea is to maintain a directed ring architecture by constructing a minimum spanning tree between clusterheads and classifying clusterheads into BACKBONE or LEAF nodes, periodically. To maintain these structures, each clusterhead broadcasts a Leader Info message by flooding. In this phase, cluster member nodes act as routers to transmit Leader Info messages. Algorithm has two modes of operation; hop-based backbone formation scheme and position-based backbone formation scheme. In hop-based backbone formation scheme, minimum number of hops between clusterheads are taken into consideration in the minimum spanning tree construction. Minimum hop counts can be obtained during flooding scheme. For highly mobile scenarios, an agreement between clusterheads must be maintained to guarantee the consistent hop information. In position-based backbone formation scheme, positions of clusterheads are used to construct the minimum spanning tree. If each node knows its velocity and the direction of velocity, these information can be appended with a timestamp to the Leader Info message to construct better minimum spanning tree. But in this mode, nodes must be equipped with a position tracker like a GPS receiver. Backbone formation algorithm is implemented on top of MCA using the ns2 simulator. The results with varying MANET conditions are shown to be stable [13].
3
The Proposed Architecture
We propose a four layer architecture for MANETs as shown in Fig. 1. Implementations of other higher level functions on top of the lower three layers are possible. The lowest layer is the routing layer in which AODV [14] is used and other routing protocols can also be used. AODV is chosen since it is a widely used routing protocol which also has a stable ns2 release. The second layer is where
512
O. Dagdeviren and K. Erciyes
the clustering takes place at the end of which, balanced clusters are formed. Our clustering layer provides that nodes in vicinity of each other join to the same cluster to reduce routing overhead. The third layer inputs these clusters and form a virtual ring of the leaders of these clusters. Finally, the fourth layer shows the implementation of the Mobile CR Algorithm on top of these three layers.
Mobile Chang Roberts Algorithm
Backbone Formation Algorithm
Merging Clustering Algorithm
Ad hoc on Demand Distance Vector
Fig. 1. The Proposed Architecture
4
Mobile Chang-Roberts Algorithm
Chang Roberts Algorithm is an asynchronous leader election algorithm for unidirectional ring networks. Assuming each process can be either red meaning a potential candidate for becoming a leader or black meaning a resigned state, an informal description of the algorithm is as follows. Any red process can initiate the algorithm, however, if a red process receives a token before initiating the algorithm, it resigns by turning black [15]. Non-initiators remain black, and act as routers. A process that receives a token with a smaller id than itself, removes the token. A higher id token is forwarded to the next node and if it reaches the originator, it has a higher id than all and the originator can then declare itself as the leader. 1. 2. 3. 4. 5. 6. 7.
Initially all initiator processes are red. For each initiator i, token < i > is sent to its neighbor; do (for every process i) token < j > ∧ j > i → skip; token < j > ∧ j < i → send token < j >; color := black (i resigns) token < j > ∧ j = i → L(i) := i (i becomes the leader) od
A Hierarchical Leader Election Protocol for Mobile Ad Hoc Networks
513
8. (for a non-initiator process) 9. do token < j > received → color := black; send < j > od This algorithm ensures that the lowest id of the initiators wins and becomes the leader and its complexity is O(n2 ). We provide the following improvement to the classical Mobile CR Algorithm. Every node keeps a list of the identities of the nodes that it has seen so far by the tokens it has received. It then only passes tokens that has a smaller sender id than the ones in the list, rather than checking the id present in the incoming token with itself only. 4.1
Finite State Machine Diagram of the Mobile CR Algorithm
The Mobile CR Algorithm is described using a finite state machine diagram to capture all of the possible asynchronous activities. Firstly, the list of messages used in Mobile CR Algorithm is specified as follows : – LEADER DEAD : It is triggered by an internal event which detects super leader’s crash. – T OKEN : Sent or forwarded by a leader to its next leader for election. – LEADER : Sent by the leader which is the winner of the election. TOKEN p
TOKEN/TOKEN
LEADER/LEADER LOST
SLEEP
LEADER_DEAD/TOKEN LEADER_ FOUND TOKEN p>q /TOKEN LEADER/LEADER
LEADER
CANDIDATE
TOKEN q=p,p=min /TOKEN, LEADER
Fig. 2. FSM of the Mobile CR Leader
The following is the list of node states: – SLEEP : I am in idle state. – LOST : I am not a CANDIDATE as there is someone with higher ID than me. – CAN DIDAT E : I am CANDIDATE to become a LEADER and I am in election. – LEADER F OU N D : LEADER is determined and I know who it is. – LEADER : I am the LEADER.
514
4.2
O. Dagdeviren and K. Erciyes
An Example Operation
Fig. 3 shows an example scenario of the Mobile CR Algorithm for a network with 40 nodes located on top of a 600m2 surface area. The x,y coordinate of each node is given next to it. Node 37’s coverage area is shown by the dotted circle. As shown by the bold lines, the network is partitioned into 5 clusters with MCA. After partitioning the network, the backbone is constructed with BFA as a directed ring architecture which is depicted with the arcs and its arrows in Fig 3. Initially all leader nodes are at the SLEEP state. Within a small amount of
183,24 21 Cluster 1 17 39,47 97,121 14
31 188,98
13 318,20 268,53 30
18 410,28
274,36 6
412,137
23
189,212 1
56,271 19
24 175,254
113,363 7
8 176,317
493,97
12 29
94,190 26
27 555,18
318,168
28 549,166
coverage area of node 37
Cluster 3 267,283 37
264,362
16 513,242
22 340,246
Cluster 2 25 361,317
438,364
15
20 34 584,393
32 336,435 39 39
181,442
21,396
36
58,515 10
3 92,468
2 133,539 Cluster 4
9 268,468
0
402,468 38
504,437
343,512 4
11 239,539 342,590 35
33 390,539 Cluster 5
5 541,540
600m * 600 m surface area
Fig. 3. An Example Operation for Mobile CR
time, node 31 becomes an initiator and changes its state to the CAN DIDAT E by sending a T OKEN message to its next leader on the ring, node 29. Node 29, which is in SLEEP state receives the T OKEN message and changes its state to the LOST . Node 29 sends an acknowledgement message to node 31. Each protocol message is acknowledged to maintain reliable transmission. Node 29 forwards T OKEN message to the Node 37. At the same time, node 39 becomes an initiator and also makes a state transition from SLEEP to CAN DIDAT E state by sending a token message to its next leader on the ring, node 36. Node 37 forwards the T OKEN of node 31 to the node 39. Node 39 loses the election, since id of received token is smaller. However node 39’s token will be forwarded to the node 31. Node 31 blocks the T OKEN message of the node 39. In the end, the T OKEN message of node 31 circulates the ring and node 31 becomes the LEADER of leaders. Node 31 sends a LEADER message to the next leader on the ring which then circulates the ring to indicate the new leader of leaders.
A Hierarchical Leader Election Protocol for Mobile Ad Hoc Networks
4.3
515
Analysis
The proposed protocol consists of three layers as shown in Fig. 1. 1. Merging Clustering Algorithm M CA 2. Backbone Ring Formation Algorithm BF A 3. Mobile Chang Roberts Algorithm M obile CR Theorem 1. The message and time complexity of the protocol is O(kn) where k is the number of clusters. Proof. The message complexity for the protocol is the sum of the message complexities of the above three algorithms plus the messages required for termination detection of the first two algorithms. Assuming termination detection requires negligible number of messages, the message complexity of the Leader Election Protocol(LEP) is : O(LEP ) = O(M CA) + O(BF A) + O(M obile CR)
(1)
O(LEP ) = O(n) + O(kn) + O(k 2 )
(2)
O(LEP ) = O(kn)
(3)
By using the same method time complexity can be found as:
5
O(LEP ) = O(n) + O(kn) + O(n)
(4)
O(LEP ) = O(kn)
(5)
Results
We implemented the protocol stack with the ns2 simulator. Different size of flat surfaces are chosen for each simulation to create medium , dense and highly dense connected networks. Medium, small and very small surfaces vary between 140m × 700m to 700m × 700m, 130m × 650m to 650m × 700m, 120m × 600m to 600m × 600m respectively. Average degree of the network is approximately N/4 for the medium connected, N/3.5 for the dense connected and N/3 for the highly dense connected networks where N denotes the total number of nodes in the network. Although each packet is acknowledged by the destination to maintain reliable transmission and retransmitted if dropped, sparse networks are not studied because of the lack of connectivity. N varies from 10 to 50 in our experiments. Random movements are generated for each simulation and random waypoint model is chosen as the mobility pattern. Low, medium and high mobility scenarios are generated and respective node speeds are limited
516
O. Dagdeviren and K. Erciyes
between 1.0m/s to 5.0m/s, 5.0m/s to 10.0m/s, 10.0m/s to 20.0m/s. We use the codes of MCA and BFA previously simulated by us under Mobile CR to obtain end-to-end measurements. Each measurement is taken as the average value of 20 measurements with the same mobility and density pattern but different randomly generated node locations and speeds. Our previous study shows that MCA and BFA is stable under different density and mobility conditions. Fig. 4 and Fig. 5 show that election time increases linearly with total number of nodes and Mobile CR is stable under density and mobility changes. The runtimes decrease as mobility increases as shown in Fig. 5 since the number of clusterheads forming the ring are smaller for high mobility scenarios resulting in less network traffic.
Fig. 4. Election Time against Density for Mobile CR
Fig. 5. Election Time against Mobility for Mobile CR
Number of nodes on the ring formed by BFA is an important parameter for election time in Mobile CR. Upper and lower bound cluster parameters are defined in MCA to adjust cluster sizes. We use this parameter to divide the network into a number of clusters. Network is divided into 3, 4, 5, and 7 clusters and the effect of the number of clusterheads on the ring is measured. Fig. 6 shows that the election time slightly changes with the total number of clusters in the network. One can expect a linear increase of the election time with the total number of clusters, but the number of actively routing nodes are selected by AODV and not only clusterheads are used for routing.
A Hierarchical Leader Election Protocol for Mobile Ad Hoc Networks
517
Fig. 6. Election Time against Number of Clusters for Mobile CR
Fig. 7. Election Time against Initiator number for Mobile CR
Lastly, we investigate the behaviour of the election time when nodes start concurrently. In our algorithm, each node stores a list of the received id of initiators and blocks the tokens of the candidates having greater id than leader and 1 to 5 initiators are selected for simulations. The results in Fig. 7 shows that the algorithm is stable against varying number of concurrent initiators.
6
Conclusions
We provided a three layer architecture for the dynamic leader election problem in MANETs where the clustering phase provided the leaders for each cluster in the first phase. The backbone formation algorithm provided a ring network among the local leaders in the second phase and a super-leader among the local leaders is elected using the Mobile CR algorithm in the final phase. We showed experimentally and theoretically that this approach is scalable and has an overall favorable performance. We think this approach may find various implementation environments in MANETs where sub activities within the MANET are handled by group/cluster of nodes each having a leader for local decisions but overall control is achieved by a single super-leader node. The protocol can be invoked periodically which ensures correct handling of failing leaders.
518
O. Dagdeviren and K. Erciyes
References 1. LeLann, G.: Distributed Systems: Towards a Formal Approach. IEEE Information Processing 77, 155–169 (1977) 2. Chang, E.J., Roberts, R.: An Improved Algorithm for Decentralized Extrema Finding in Circular Arrangements of Processes. ACM Com., 281–283 (1979) 3. Franklin, W.R.: On an Improved Algorithm for Decentralized Extrema Finding in Circular Configurations of Processors. ACM Com., 281–283 (1982) 4. Peterson, G.L.: An O(nlogn) Unidirectional Algorithm for the Circular Extrema Problem. ACM Trans. Prog. Lang. 4, 758–763 (1982) 5. Dolev, D., Klawe, M., Rodeh, M.: An O(nlogn) Unidirectional Distributed Algorithm for Extrema-Finding in a Circle. J. Algorithms 3, 245–260 (1982) 6. Gallagher, R.G., Humblet, P.A., Spira, P.M.: A Distributed Algorithm for Minimum-Weight Spanning Trees. ACM Trans. Prog. Lang., 66–77 (1983) 7. Malpani, N., Welch, J., Vaidya, N.: Leader Election Algorithms for Mobile Ad Hoc Networks. In: Proc. Int. Works. on Disc. Alg. and Meth., pp. 96–103 (2000) 8. Vasudevan, S., Immerman, N., Kurose, J., Towsley, D.: A Leader Election Algorithm for Mobile Ad Hoc Network. UMass Comp. Sci. Tech. Rep. (2003) 9. Pradeep, P., Kumar, V., Yang, G.-C., Ghosh, R.K., Mohanty, H.: An Efficient Leader Election Algorithm for Mobile Ad Hoc Networks. In: Ghosh, R.K., Mohanty, H. (eds.) ICDCIT 2004. LNCS, vol. 3347, pp. 32–41. Springer, Heidelberg (2004) 10. Masum, S.M., Ali, A.A., Bhuiyan, M.T.I.: Asynchronous Leader Election in Mobile Ad Hoc Networks. In: AINA 2006, pp. 827–831 (2006) 11. Cokuslu, D., Erciyes, K.: A Hierarchical Connected Dominating Set Based Clustering Algorithm for Mobile Ad Hoc Networks. In: IEEE MASCOTS 2007, pp. 60–66 (2007) 12. Dagdeviren, O., Erciyes, K., Cokuslu, D.: Merging Clustering Algorithm for Mobile Ad Hoc Networks. In: Gavrilova, M.L., Gervasi, O., Kumar, V., Tan, C.J.K., Taniar, D., Lagan´ a, A., Mun, Y., Choo, H. (eds.) ICCSA 2006. LNCS, vol. 3981, pp. 681–690. Springer, Heidelberg (2006) 13. Dagdeviren, O., Erciyes, K.: A Distributed Backbone Formation Algorithm for Mobile Ad Hoc Networks. In: Guo, M., Yang, L.T., Di Martino, B., Zima, H.P., Dongarra, J., Tang, F. (eds.) ISPA 2006. LNCS, vol. 4330, pp. 219–230. Springer, Heidelberg (2006) 14. Perkins C. E., Belding-Royer E. M., Das S.: Ad Hoc On Demand Distance Vector (AODV) Routing. RFC 3561 (2003) 15. Ghosh, S.: Distributed Systems, An Algorithmic Approach, ch. 11, pp. 175–176. Chapman and Hall/CRC (2007)
Distributed Algorithms to Form Cluster Based Spanning Trees in Wireless Sensor Networks Kayhan Erciyes1, Deniz Ozsoyeller2, and Orhan Dagdeviren3 1
Ege University International Computer Institute Bornova, Izmir, TR-35100, Turkey [email protected] 2 Izmir University of Economics, Computer Eng. Dept. Balcova, Izmir TR-35350, Turkey [email protected] 3 Izmir Institute of Technology Computer Eng. Dept. Urla, Izmir TR-35340, Turkey [email protected]
Abstract. We propose two algorithms to form spanning trees in sensor networks. The first algorithm forms hierarchical clusters of spanning trees with a given root, the sink. All of the nodes in the sensor network are then classified iteratively as subroot, intermediate or leaf nodes. At the end of this phase, the local spanning trees are formed, each having a unique subroot (clusterhead) node. The communication and data aggregation towards the sink by an ordinary node then is accomplished by sending data to the local subroot which routes data towards the sink. A modified version of the first algorithm is also provided which ensures that the obtained tree is a breadth-first search tree where a node can modify its parent to yield shorter distances to the root. Once the subspanning trees in the clusters are formed, a communication architecture such as a ring can be formed among the subroots. This hybrid architecture which provides co-existing spanning trees within clusters yields the necessary foundation for a two-level communication protocol in a sensor network as well as providing a structure for a higher level abstraction such as the γ synchronizer where communication between the clusters is performed using the ring similar to an α synchronizer and the intra cluster communication is accomplished using the sub-spanning trees as in the β synchronizers. We discuss the model along with the algorithms, compare them and comment on their performances. Keywords: spanning tree, clustering, synchronizers, wireless sensor networks.
1
Introduction
Wireless Sensor Networks (WSNs) have important scientific, environmental, medical and military applications. Example WSN applications include habitat monitoring, remote patient monitoring and military defense systems [1]. WSNs may consist of hundreds or even thousands of nodes that operate independently. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 519–528, 2008. c Springer-Verlag Berlin Heidelberg 2008
520
K. Erciyes, D. Ozsoyeller, and O. Dagdeviren
A survey of WSNs can be found in [2]. WSN nodes are small, inexpensive, embedded, require low power and are distributed regularly or irregularly over a significantly large area. WSN nodes are usually deployed in highly dynamic and sometimes hostile environments. It is therefore very important that these networks should have the capability to perform unattended and distributed but coordinated operation with the other nodes and also to provide self-healing in the case of faults. Communication in WSNs can be performed using two fundamental approaches as tree based and cluster based. Cluster based communications require grouping of closely coupled elements of the sensor network into clusters and electing one of these nodes as the clusterhead (cluster leader) [3]. The cluster leader provides the coordination of the communication among the cluster members and other clusters. Energy is an important and a crucial resource in sensor networks due to the limited lifetime of sensor batteries and also difficulty of recharging batteries of thousands of sensors in remote or hostile environments. Communication of sensor nodes dominates their energy consumption even when they are at idlelistening state [4]. In this study, we propose a distributed algorithm that forms hierarchical spanning trees in a WSN where each sub-spanning tree has a root node that has the role of the leader for that subtree. Our algorithm has the topology of a spanning tree but also has a cluster structure with a clusterhead, therefore is an integration of the tree based and cluster based approaches. To our knowledge, the algorithm in this study is the first attempt to provide a hybrid approach for communication in WSNs. The rest of the paper is organized as follows. Section 2 provides the related work and the algorithms designed are detailed in sections 3 and 4 along with analysis and results obtained. Finally, conclusions are presented in Section 5.
2 2.1
Background Clustering in WSNs
A WSN can be modelled by a graph G(V, E) where V is the set of vertices (nodes of WSN) and E is the set of edges (communication links among the nodes). Clustering the nodes of a graph or graph partitioning is NP-Hard. For this reason, clustering in WSNs is usually performed using some heuristics. Some of the benefits to be gained from clustering in mobile ad hoc and WSNs are the reduction in energy for message transfers and forming of a virtual backbone for routing purposes [5]. HEED (Distributed Clustering in Ad-hoc Sensor Networks: A Hybrid, EnergyEfficient Approach) [6], proposes a distributed clustering algorithm for sensor networks. Clusterheads in HEED are elected using a probabilistic heuristic that considers the residual energy of a node and the number of its neighbors (its degree). HEED assumes a homogenous network and also that neighbor connectivity is known and provides balanced clusters. LEACH (Low-Energy Adaptive Clustering Hierarchy) [7] provides rotating clusterheads chosen randomly and
Distributed Algorithms to Form Cluster Based Spanning Trees in WSNs
521
assumes clusterheads consume uniform energy. Both HEED and LEACH find clusters in a finite number of steps. In PEAS [8], a node goes to sleep (turn off its radio) when it detects a routing node in its transmission range. In GAF [9], the sensor network is divided into fixed square grids each with a routing node. Communication to the sink is propagated by the routers. The ordinary nodes in each grid can turn off their radio components when they have no transmission. GEAR (Geographical and Energy Aware Routing: a recursive data dissemination protocol for wireless sensor networks) [10] and TTDD (Twotier Data Dissemination Model for Large Scale Wireless Sensor Network) [11] are examples of other protocols for cluster formation in WSNs. 2.2
Spanning Tree Formation in WSNs
Building spanning trees rooted at a sink node for data collection is a fundamental method for data aggregation in sensor networks. However, due to the nature of the sensor networks, the spanning tree should be formed in a decentralized way. Gallagher, Humblet and Spira [12], Awerbuch [13], Banerjee and Khuller [14] have all proposed distributed spanning tree algorithms. Gallagher, Humblet and Spira distributed algorithm determines a minimum weight spanning tree for an undirected graph by combining small fragments into larger fragments. A fragment of a spanning tree is its subtree. Time complexity for this algorithm is O(NlogN). The ENCAST (ENergy Critical node Aware Spanning Tree) algorithm [15] finds a shortest path tree (SPT) by breadth first traversal from the sink, and each node can reach the sink via the minimum number of hops using this SPT. However, there may be more than one SPTs in dense sensor networks due to the fact that nodes have many neighbor nodes and some of these neighbors have the same minimum-hop distance from the sink energy of a node as the second selection criteria and attempts to label nodes with less energy as leaf nodes.
3
The Distributed Spanning Tree Algorithm
The first algorithm is a modification of the distributed spanning tree formation algorithm for general networks. We modify this general algorithm below so that clusters which are subtrees are also formed with energy considerations of the WSN. We assume that the sensor nodes are distributed randomly and densely over the area to be monitored and the sensor field can be mapped into a two dimensional space. Furthermore, all the sensor nodes have identical and fixed transmission ranges and hardware configurations and each sensor node can monitor its power level EP . 3.1
Description of the Algorithm
The algorithm we propose is described informally as follows. The sink periodically starts the algorithm by sending a P AREN T message to its neighbors. Any
522
K. Erciyes, D. Ozsoyeller, and O. Dagdeviren
node i that has not received a P AREN T message before sets the sender as its parent, sends ACK(i) message to its parent and sends a P AREN T (i) message to all of its neighbors. We provide a depth of subtree parameter d as the modification to the above classical algorithm to form a spanning tree. Every node that is designated a parent performs n hops = (n hops + 1) MOD d to append to its outgoing message. The recipient of the message with n hops =0 are the SUBROOTs, and n hops ¡=d are INTERMEDIATE nodes or leaf depending on their level within a subtree. The state diagram of Fig. 1 depicts the operation of the Distributed Spanning Tree Algorithm (DSTA). The algorithm is initiated by the sink at regular intervals. Any ordinary node that has not been labeled before, receiving a PARENT message from an upper node, labels itself according to the number of hops the message has traveled which is shown by the parameter of the PARENT message.
Fig. 1. The Finite State Machine Diagram of DSTA
Any further change of states between subroot, intermediate and leaf nodes are not shown for simplicity. The following is a list of messages used in DSTA : – P AREN T : Sent by a parent to the neighbors soliciting for children. – CHILD : Sent by the child to parent acknowledging to be a successor. – T IM EOU T : Internal message informing a timeout has ocurred. This message prevents a subroot waiting indefinitely for acknowledgements from potential children. The message contains the following fields : – Sender : SINK, SUBROOT, SUBROOT0, INTERMED, LEAF; – type : PARENT, CHILD; – n hops: integer showing the number of hops the message has travelled.
Distributed Algorithms to Form Cluster Based Spanning Trees in WSNs
523
If the number of hops in the message is equal to zero, the node labels itself as the SU BROOT . Else if the number of hops is smaller than the allowed depth d of the sub-tree, the node is an intermediate (IN T ERM ) node. Once the number of hops equals the depth, the node is classified as a LEAF . Each labeled node acknowledges its parent by the CHILD message. The following is the list of sensor node states: – SU BRT : A node is labeled as a subroot as the message it has received from its parent has n hops = 0. – SU BCH : A Subroot node has at least one confirmed child in the local tree. – IN T ERM : A node is an intermediate node, that is, it is not a subroot or a leaf node. – IN T CH : An intermediate node with at least one child – LEAF : A node that is the leaf of a local spanning tree. – LEAF CH : A leaf node with at least one child. – SU BRT 0 : A subroot node that has received a SINK message – SU BCH0 : A subroot 0 node that has at least one child. Remark 1. Energy Considerations : A sensor node rejects being labeled as subroot if its energy level is below a threshold, for example, two thirds of EP . This is required as a subroot will have more message transfers than an ordinary node. A branch of the spanning tree formed constitutes a cluster where a subroot node is the clusterhead. Subroots may have other attributed roles in application specific settings. For our purpose, each subroot has the capability to manipulate or filter any incoming message to it during convergecast. 3.2
Analysis of DSTA
In this section, we analyze the number of communication steps (count of messages) to form the spanning trees using DSTA and comment on its performance. Based on the state machine of Fig. 1, the labeling of a sensor node as SUBROOT, INTERMED or LEAF requires two messages called P AREN T and CHILD. The first message is sent by the parent soliciting children and the second mesage is the acknowledgement of the child to its parent. Theorem 1. Time complexity of DSTA is O(D) where D is the diameter of the network from the sink to the furthest leaf and its message complexity is O(n). Proof. The time required for the algorithm is clearly the diameter D of the network. Once a node is labeled and has a designated parent, it will only send a message to its neighbors once. If Δ is the maximum degree of the network graph, total number of messages is Δ*n and for small Δ, message complexity is O(n). 3.3
Results
The distributed spanning tree algorithm is implemented with the ns2 simulator. The IEEE 802.11g standards are chosen for lower layer protocols. Total number
524
K. Erciyes, D. Ozsoyeller, and O. Dagdeviren
of nodes vary from 100 to 500 nodes. Different size of flat surfaces are chosen for each simulation to create high dense, dense and medium connected topologies to measure the effect of node degree. Surface areas vary from 2700m × 1200m to 17920m × 1920m. Depth parameter is changed to obtain different number of clusters, as well as, SUBROOT, INTERMEDIATE and LEAF nodes.
Fig. 2. DSTA Run-times against the Number of Nodes
Fig. 2 displays the run-time results of the distributed spanning tree algorithm ranging from 100 to 500 nodes. Run-time values increase almost linearly, except for the case of 300 nodes which may be due to their random distribution, indicating the scalability when the total number of nodes is increased from 100 to 500 nodes. 4.5s is needed for the formation of distributed spanning tree with clusters. For a network with 100 nodes, different topologies are created to measure the effect of the average node degree parameter. Because each node must be informed by its neighbors to complete reliable flooding. Any corrupted message must be retransmitted. Fig. 3 shows that algorithm performs well up to high dense topologies with 8 nodes connected on the average. Depth parameter of the DSTA changes the number of clusters and the node states in WSN. Number of SUBROOT, INTERMEDIATE and LEAF nodes are measured for 300 nodes as shown in Fig. 4. As depth parameter is increased from 2 to 6, SUBROOT node count decreases and INTERMEDIATE count increases
Fig. 3. DSTA Run-times against the Average Node Degree
Distributed Algorithms to Form Cluster Based Spanning Trees in WSNs
525
Fig. 4. DSTA Number of Node States against the Depth
as expected. Number of LEAF s, which are mostly the gateway nodes, decreases same as SUBROOT s with depth parameter. Our results conform with the analysis that the run-time values and message counts grow linearly. Also, algorithm is stable under different node degrees. Depth parameter changes the number of clusters and node states and its selection is very important. Worst delivery times are scalable and show that nodes can route their packets on top of this spanning tree with reasonable delays.
4
Breadth-First Search Based DSTA
The second algorithm we propose for spanning tree formation in WSNs is the modification of the Breadth-First Search (BFS) spanning tree algorithm for general networks shown below : 1. Initially, the root sets L(root) = 0 and all other vertices set L(v) = ∞. 2. The root sends out the message Layer(0) to all its neighbors. 3. A vertex v, which gets a Layer(d) message from a neighbor w checks if d + 1 < L(v). If so, it does the following: – parent(v) = w; – L(v) = d + 1; – Send Layer(d + 1) to all neighbors except w. We apply the algorithm above, however, based on their designated distances, nodes are labeled as ROOT, SUBROOT and LEAF as in DSTA. Theorem 2. Time complexity of BFS-DSTA is O(n) and its message complexity is O(n|E|). Proof. As the longest path in a network has n-1 nodes, time complexity of the general asynchronous BFS spanning tree algorithm is O(n). Since at every step, there will be a maximum of |E| messages, the message complexity is O(n|E|. For BFS-DSTA, general rules apply and the complexities are the same as the asynchronous BFS algorithm
526
4.1
K. Erciyes, D. Ozsoyeller, and O. Dagdeviren
Results for BFS Based DSTA
We implemented BFS-MDSTA (DSTA with multiple sinks using BFS) in a similar setting of ns2 as in DSTA. Fig. 5 shows the running times of BFS-MDSTA for 1,3 and 5 sinks. We see that there is a linear increase as the number of nodes are increased and also running times depend linearly on the number of concurrent sinks.
Fig. 5. The Running Times for Multi-sink Formation with BFS-MDSTA
Fig. 6 shows the number of clusters formed when BFS-MDSTA is applied to a WSN for 1,3 and 5 sinks with a constant depth of 3. The curves are almost identical showing an even distribution of clusters independent of the count and location of the sinks.
Fig. 6. The Average Number of Clusters for BFS-MDSTA for d = 3
Fig. 7 shows the effect of the subtree depth d on the cluster count when BFS-DSTA is applied upto 250 WSN nodes which was the upper limit that the simulator could tolerate due to the data complexity of maintaining 5 concurrent sinks. We see here that the count of clusters decrease linearly as d increases which is expected.
Distributed Algorithms to Form Cluster Based Spanning Trees in WSNs
527
Fig. 7. The Average Number of Clusters for BFS-MDSTA against Depth
5
Discussions and Conclusions
We proposed two distributed algorithms for spanning tree formation to provide a communication infrastructure in sensor networks. The first algorithm (DSTA) has a lower message complexity of O(n) but does not necessarily find the shortest route to the sink. The second algorithm (BFS-DSTA) uses BFS property and finds the shortest route with elevated message complexity of O(n|E|). These algorithms may be activated at regular intervals by the sink and the dynamic spanning tree configuration consisting of healthy nodes only discards the sensor nodes that have ceased functioning due to energy loss or other hostile environment conditions. We showed that these algorithms are scalable and provide balanced clusters which consist of tress within the clusters. This architecture may be suitably used for a γ synchronizer which requires the same structure we propose. One future direction of this work would therefore be another communication structure such as a ring between the clusterheads so that an α synchronizer can be constructed among the clusters. The local spanning trees produced by DSTA and BFS-DSTA naturally comprise clusters of the sensor network and therefore can be used for other resource management tasks in sensor networks other than the communication infrastructure or the synchronizer function described in this study. The subroot nodes are the leaders of the clusters that can act as the representatives of their cluster members for various tasks in the sensor networks. These leaders can be connected in various configurations such as the ring or other in order to perform tasks such as mutual exclusion in sensor networks. Advantage of this hybrid approach would be the simple and fast data aggregation using the spanning tree within the cluster and a more general framework such as ring based communication among the clusters. Our work is ongoing and we are looking into labeling some nodes of the WSN as privileged nodes of improved transmission capabilities so that these nodes may form an upper spanning tree and hence an upper communication backbone of the WSN. Acknowledgements. This work was partially supported by Turkish Science and Research Council Career Project 104E064.
528
K. Erciyes, D. Ozsoyeller, and O. Dagdeviren
References 1. Mainwaring, A., Polastre, J., Szewczyk, R., Culler, D., Anderson, J.: Wireless Sensor Networks for Habitat Monitoring. ACM WSNA (2002) 2. Akyildiz, I.F., Su, W., Sankarasubramaniam, Y., Cayirci, E.: Wireless Sensor Networks: A Survey. Elsevier Comp. Ntws 38, 393–422 (2002) 3. Chong, J.: A Survey of Clustering Schemes for Mobile Ad Hoc Networks. IEEE Comm. Surveys and Tutorials 7(1), 32–48 (2005) 4. Estrin, D., Govindan, R., Heidemann, J., Kumar, S.: Next Century Challenges: Scalable Coordination in Sensor Networks. Mob. Comp. and Networking (1999) 5. Dagdeviren, O., Erciyes, K.: A Distributed Backbone Formation Algorithm for Mobile Ad Hoc Networks. In: Guo, M., Yang, L.T., Di Martino, B., Zima, H.P., Dongarra, J., Tang, F. (eds.) ISPA 2006. LNCS, vol. 4330, pp. 219–230. Springer, Heidelberg (2006) 6. Younis, O., Fahmy, S.: HEED: A Hybrid, Energy-Efficient, Distributed Clustering Approach for Ad hoc Sensor Networks. IEEE Trans. on Mob. Comp. 3(4) (2004) 7. Heinzelman, W.R., Chandrakasan, A., Balakrishnan, H.: Energy-Efficient Communication Protocol for Wireless Microsensor Networks. IEEE HICSS (2000) 8. Ye, F., Zhong, G., Cheng, J., Lu, S., Zhang, L.: A Robust Energy Conserving Protocol for Long-Lived Sensor Networks. IEEE ICDCS (2003) 9. Xu, Y., Heidemann, J., Estrin, D.: Geography Informed Energy Conservation for Ad Hoc Routing. ACM/IEEE MOBICOM, 70–84 (2001) 10. Yu, Y., Govindan, R., Estrin, D.: Geographical and Energy Aware Routing: A Recursive Data Dissemination Protocol for Wireless Sensor Networks. UCLA Computer Science Department TR UCLA/CSD-TR-01-0023 (2001) 11. Ye, F., Luo, H., Cheng, J., Lu, S., Zhang, L.: A Two-tier Data Dissemination Model for Large-Scale Wireless Sensor Networks. ACM MOBICOM (2002) 12. Gallagher, R.G., Humblet, P.A., Spira, M.: A Distributed Algorithm for MinimumWeight Spanning Trees. ACM Trans. on Prog. Lang. and Sys. 5 (1983) 13. Awerbuch, B.: Optimal Distributed Algorithms for Minimum Weight Spanning Tree, Counting, Leader Election and Related Problems. ACM STIC (1987) 14. Banerjee, S., Khuller, S.: A Clustering Scheme for Hierarchical Routing in Wireless Networks. University of Maryland Tech. Report CS-TR-4103 (2000) 15. Zou, S., Nikolaidis, I., Harms, J.J.: ENCAST: Energy-Critical Node Aware Spanning Tree for Sensor Networks. IEEE CNSR, 249–254 (2005)
The Effect of Network Topology and Channel Labels on the Performance of Label-Based Routing Algorithms Reza Moraveji1,2, Hamid Sarbazi-Azad1,3, and Arash Tavakkol1 1 IPM School of Computer Science, Tehran, Iran Dept. of ECE, Shahid Beheshti Univ., Tehran, Iran 3 Dept. of Computer Engineering, Sharif Univ., Tehran, Iran {moraveji_r,azad,arasht}@ipm.ir, [email protected] 2
Abstract. Designing an efficient deadlock-free routing is a point of concern for irregular topologies. In this paper, we take a step toward the goal by developing three novel deadlock-free routing algorithms in the content of a new family of algorithms called label-based routing algorithms for irregular topologies. In addition, the newly proposed family covers three previously reported routing algorithms [2, 3]. Moreover, by simulating and comparing the newly and traditional proposed routing methods, it is shown that the performance of this family highly depends on the network topology and channel labeling process. Keywords: Label-based routing algorithm, irregular network, network of workstation, performance evaluation.
1 Introduction In recent years, cluster-based irregular networks (INs) such as networks of workstations (NOWs also irregular network-on-chips) have emerged as one of the cost-effective alternatives for traditional regular parallel computers. In such systems, an irregular high-speed network is often required in order to provide the wiring flexibility needed in network and also design of scalable systems with incremental expansion capability [4, 7]. Without a careful design for routing scheme of INs, deadlock may happen in these networks [5, 6]. Since the topology of INs is not predefined, designing and applying deadlock-free routing algorithms are usually done without any pre-assumption about the network topology. Therefore, the major problem of these networks is the complexity of designing a general deadlock-free routing algorithm. The main purpose of this work, section 2, is to take a step in this direction by initially developing some deadlock-free routing schemes in the body of new family of routing algorithms, called label-based routing algorithms, for irregular topologies. Moreover, evaluating the performance of label-based routings in irregular networks under realistic conditions is another major concern. To this end, extensive simulation experiments have been conducted in section 3. Section 4 concludes the paper and outlines some directions for future works in this line of research. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 529–538, 2008. © Springer-Verlag Berlin Heidelberg 2008
530
R. Moraveji, H. Sarbazi-Azad, and A. Tavakkol
2 Label-Based Routing Algorithms In order for a routing algorithm to be deadlock-free, cyclic buffer dependencies between messages and physical channels they allocate, must not occur. When the approach of labeling is used for generating deadlock-free routing algorithms, first, the given topology has to be prepared for implementing the label-based routing algorithm. Let us briefly describe the way in which the topology is labeled and also the method by which the related routing schemes are generated. The main idea of label-based routing algorithms is to classify network channels by assigning predefined labels, then grouping the labeled channels in the way that there is no cyclic dependency between each group. These groups are referred to as zones in [1]. Afterwards, the generated zones are ordered in a sequence such that when a message passes through the needed channels in the zones (regarding the sequence of the zones) the sequence guaranties the message to reach its destination. 2.1
Fundamental Concepts of Graph Labeling, Deadlock-Free Zones and Routing Algorithms
The first step in generating a label-based routing algorithm is graph labeling. Since we plan to make a comparison between the previously reported routing algorithms and the newly proposed ones here, in this paper we use the reported graph labeling in [1-3]. As the starting point, a spanning tree (based on BFS1 graph traversal) is formed on the given irregular network as the base of labeling process. Nodes and channels are labeled in two stages as follows. First stage: Nodes are labeled in ascending order regarding to spanning tree formation and according to their distances from the root of spanning tree. A channel that faces toward the lower node label is called '1' and the channel that goes away from the lower node label is called '0' (figure 1). Second stage: Subsequently, the second stage of labeling is applied to the graph in the case that an increasing number is assigned to each node in the order that nodes are visited by pre-order tree traversal. Channels are labeled using the policy of first stage (figure 1). Therefore, each channel is assigned two different labels and it is possible to think of a channel label as a compound label containing two distinct labels. It is obvious that there may be at most four possible channel labels for a given irregular topology.
(11), (10 ), (01), (00 ) These channel labels are: As a result, a single '0' transition (channel) from node A to B means that the corresponding label of node A is lower than B, and a single '1' transition from node A to B represents that the node A has higher corresponding label than B. Therefore, when both labels are brought into account as a compound label, we have the following outcomes: 11 A ( a0a1 ) ⎯⎯ → B (b0b1 ) ⇒ ( a0 < b0 , a1 < b1 ) 10 A ( a0a1 ) ⎯⎯ → B (b0b1 ) ⇒ ( a0 b0 , a1 b1 )
1
Breadth first search.
The Effect of Network Topology and Channel Labels
531
01 A ( a0a1 ) ⎯⎯ → B (b0b1 ) ⇒ ( a0 > b0 , a1 < b1 )
00 A ( a0a1 ) ⎯⎯ → B (b0b1 ) ⇒ ( a0 > b0 , a1 > b1 )
where ( a0a1 ) and (b0b1 ) are node labels. The second step in generating a label-based routing algorithm is to group the channels such that there is no cyclic dependency between the channels of the same group. Since there are several ways to group the channels, it is possible to generate various deadlock-free groups (zones) and in turn, different deadlock-free routing algorithms.
Fig. 1. Node and channel labeling
As mentioned in label-based routing [1] there is a predefined ordering to travel between channel groups and a message cannot use channel labels belong to a previously traversed group, while it can use channel labels of the same group adaptively. Consequently there is no cyclic dependency between channels of different groups. Therefore, it is sufficient to group channel labels with no cyclic dependency. As indicated in figure 2, message 1 holds channel (A, B) labeled as (10), and it requests the use of channel (C, D) labeled as (11) and message 2 holds (C, D) and requests the use of (A, B). Let’s assume that these two channel labels {(10), (11)} are in a same group. Now, we should consider the situation in which messages 1 and 2 just use channel labels of the mentioned group, {(10), (11)}. For message 1 we have, 10 11 A ( a0a1 ) ⎯⎯ → B (b0b1 ) ⇒ ( a0 > b0 , a1 < b1 ) C ( c0c1 ) ⎯⎯ → D ( d0 d1 ) ⇒ ( c0 > d0 , c1 < d1 )
a0 < b0 ... < c0 < d 0 ⇒ a0 < d 0
Therefore, if message 2 wants to make a request for (A, B) while holding (C, D), it should cross other channels such as (00), (01) and it contradicts the mentioned group ordering traversal [1]. Thus, it is possible to put (10), (11) in one group.
532
R. Moraveji, H. Sarbazi-Azad, and A. Tavakkol
Fig. 2. Cyclic dependency between (A, B) and (C, D)
The following corollary defines a general rule for creating deadlock-free zones or groups of channels without cyclic dependency. Corollary: there is no cyclic dependency between the channels X and Y if and x : y0 ) ⊕ ( x1 : y1 ) = 1 only if they satisfy condition ( 0 , where : is the bitwise XNOR operator and ⊕ represents the bitwise OR operator. It should be noted that labels X ( x 0x 1 ) Y y y and ( 0 1 ) are channel labels not node labels and four possible channel labels were introduced formerly. Possible deadlock-free zones are (11,10 ) , (11,01) ,
(10 ,00 ) , and (01,00 ) .
The third and last step in generating a label-based routing algorithm is to order deadlock-free zones in a sequence that guaranties connectivity for the routing algorithm. Theorem: The path between any pair of nodes is guaranteed by selecting a sequence of channel labels whose first (second) bit is ‘1’ followed by a sequence of channel labels whose first (second) bit is ‘0’ [1]. Proof: When a message chooses a channel with a compound label which contains at least one '1', it gets closer to the root node of the spanning tree. In the worst case situation the respective message reaches the root node and it is obvious that from the root node there is at least one path to each other nodes in the spanning tree (whole network). Therefore, when the channel labels are ordered in a sequence of '1's followed by '0's whether in terms of first or second bit of the channel label, there is at least one path between each pair of source and destination. Now considering the generated deadlock-free zones and possible sequences, the following label-based routing algorithms can be defined: 1. 2.
R1: (11,10 ) → ( 01, 00)
R2: (11, 01) → (10,00 )
Up / down routing Left / right routing
3.
R3: (11) → ( 01,00) → (10)
New
4.
R4: (11) → (10,00 ) → ( 01)
L - turn routing
5. 6.
R5: (10) → (11, 01) → ( 00)
R6: ( 01) → (11,10 ) → ( 00)
New New
The Effect of Network Topology and Channel Labels
533
3 Empirical Performance Evaluation The main performance metric of INs is the average message latency (average amount of time it takes a message to completely reach its destination). In a thorough analysis, the mentioned performance metric of the label-based routings is analyzed under different working conditions considering different irregular topologies and different spanning trees. As you will see, some interesting points are derived from the results of the analyses that were not reported or referred in the previous works on the performance evaluation of routing algorithms in irregular networks. Analysis of this kind can be conducted through results obtained from a real implementation of the network. But a cost effective alternative is to use a simulation of the system. 3.1 Simulator To evaluate the functionality of irregular networks under different conditions, a discrete-event simulator has been developed that mimics the behavior of described label-based routing algorithms at flit level. Input data (irregular topology) to the simulator is specified in the form of adjacency matrix. Also, the spanning tree assigned to the network can be both determined by user or automatically by one of the famous BFS or DFS2 (with a predetermined heuristic) algorithms. 3.2 The Effect of Network Topology When comparing the performances of two or more routing algorithms, using the same working conditions such as number of virtual channels, message lengths, and traffic patterns, it is always expected that one (or more) of the compared routing algorithms shows better performance than the others. Generally, by changing the conditions for all routing algorithms the order of routing performances usually does not change. For example, the performance of XY routing [7] (ignoring the simplicity of implementtation) in comparison with the west-first routing [7] under the same working conditions is worse. By changing the topology on which the respective routing algorithms applied from Mesh3×3 to Mesh5×5 , the performance of west-first routing still remains better, since the latter routing algorithm always provides more adaptivity than the former one. As a result, in most cases a fair comparison provides a definite order of performances of the compared instances. As we will see in this section, this is not true for label-based routing algorithms (compared instances). The performance of the label-based routing algorithm highly depends on the topology to which the routing is applied. Therefore, it is not possible to make a general sequence of the performance of the six aforementioned label-based routing algorithms. Another design parameter that has a strong influence on the performance of the label-based routing algorithm is the degree of irregularity of the network topology; that is, the performance of this family is involved in the variance of node degrees of the topology. 2
Depth first search.
534
R. Moraveji, H. Sarbazi-Azad, and A. Tavakkol
In order to show the above characteristics in irregular networks, a comparative performance evaluation is presented in the results of the figure 3 where the average message latency is plotted against the traffic generation rate. The analyzed irregular networks are as follows: G1(16 ,48), G2(36,124), G3(64, 240), G4(64, 240), G5: Mesh8*8, and G6(100, 364).
Network topology is the first parameter that should be considered, while choosing the best label-based routing algorithm. Let’s look at the simulation results obtained from different network sizes and network topologies. As can be seen, the sequences of routing performances are totally different from one topology to another. The following list presents the sequences of the routing performances for different topologies: • • •
G1: G2: G3:
R6, R4, R3, R5, R2, R1 R1, R3, R6, R2, R4, R5 R2, R1, R6, R3, R4, R5
• • •
G4 : G5 : G6 :
R6, R3, R1, R2, R4, R5 R3, R4, R1, R2, R5, R6 R3, R6, R2, R1, R4, R5
To see how different the sequences of routing performances are, consider the sequences in G3 and G4. Excluding R4 and R5, the sequence in G3 is the reverse of that in G4 although these networks contain the same number of nodes and even channels. The only difference between these two networks is the way of connecting nodes or network topology. Another interesting example that exhibits the effect of network topology on the performance of label-based routing is R1, which shows totally inconsistent behavior in G1 and G2. R1 is the best routing algorithm in G2 while is the worst one in G1. We have the same scenario for R6, in G4 and G5 (mesh8×8). As a consequence, it is sagacious first to specify the network topology; then, choose the routing which shows the best performance on the chosen topology. Another interesting effect is the degree of irregularity (variance of node degrees) of topology. It is evident from figure 3(e) (mesh8×8) that the message latencies and generation rates for which saturations occur for the six routing algorithms are nearly the same. The reason is that although the mesh topology is not completely regular (a network is regular when all the nodes have the same degree), all of the internal nodes have the same degree of four so that the variance of node degrees goes down. The identical result can be seen for G1. As a result, when the regularity of the topology decreases, or the variance of node degrees diminishes, the performance of the labelbased routing algorithms are closely the same (figures 3(a) and 3(e)). It should be noted that when the network size decreases (like G1), the probability that the variance of node degrees become smaller increases (but this may not be true in all cases). 3.3 The Effect of Spanning Tree Construction In the previous section the effect of network topology (network size) on the performance of the six label-based routing algorithms was discussed. According to the numerous presented results, it was shown that the performance of this family of routing algorithms highly depends on the network topology. Going further through the structural details of the six label-based routing algorithms leads us to analyze the effect of forming different spanning trees created in terms of different root nodes. The structure of a label-based routing algorithm is determined by two parameters which are number and order of zones (channel labels).
500
A v e r a g e M e s s a g e L a te n c y
A v e r a g e M e s s a g e L a te n c y
The Effect of Network Topology and Channel Labels
(11,10) → (01,00) (11,01) → (10,00)
400
(11)→ (01,00) → (10) (11)→ (10,00) → (01) (10)→ (11,01) → (00)
300
(01)→ (11,10) → (00)
500 (11,10) → (01,00) (11,01) → (10,00)
400
(11)→ (01,00) → (10) (11)→ (10,00) → (01) (10)→ (11,01) → (00)
300
(01)→ (11,10) → (00)
200
200
100
100
0
0 0
0.001
0.002
0.003
0.004
0.005
0.006
0
0.007
0.0005
0.001
0.0015
0.002
0.0025
0.003
Traffic Generaton Rate
Traffic Generaton Rate
(a) G1
(b) G2
500 (11,10) → (01,00) (11,01) → (10,00)
400
(11)→ (01,00) → (10) (11)→ (10,00) → (01) (10)→ (11,01) → (00)
300
500
A v e r a g e M e s s a g e L a te n c y
A v e ra g e M e s s a g e L a te nc y
535
(01)→ (11,10) → (00) 200
(11,10) → (01,00) (11,01) → (10,00)
400
(11)→ (01,00) → (10) (11)→ (10,00) → (01) (10)→ (11,01) → (00)
300
(01)→ (11,10) → (00) 200
100
100
0
0
0
0.0002
0.0004
0.0006
0.0008
0.001
0.0012
0.0014
0.0016
0.0018
0
0.0002
0.0004
0.0006
0.0008
0.001
Traffic Generaton Rate
A v e ra g e M e s s a g e L a te nc y
A v e r a g e M e s s a g e L a te n c y
0.0016
(d) G4
500 (11,10) → (01,00) (11,01) → (10,00) (11)→ (01,00) → (10) (11)→ (10,00) → (01) (10)→ (11,01) → (00)
300
0.0014
Traffic Generaton Rate
(c) G3
400
0.0012
(01)→ (11,10) → (00) 200
100
500
(11,10) → (01,00)
400
(11,01) → (10,00) (11)→ (01,00) → (10) (11)→ (10,00) → (01)
300
(10)→ (11,01) → (00) (01)→ (11,10) → (00) 200
100
0
0 0
0.0005
0.001
0.0015
0.002
0.0025
0.003
0
0.0001 0.0002 0.0003 0.0004 0.0005 0.0006 0.0007 0.0008 0.0009
Traffic Generaton Rate
(e) G5
0.001
Traffic Generaton Rate
(f) G6
Fig. 3. The average message latency of label-based routing algorithms on G1 – G6 with a message length of 64 flits
Now assume that in an arbitrary network there are two minimal paths between two nodes as follows3: Path1: 11 → 10 → 11 → 00 → 01 → 10 → 00 Path 2 : 11 → 10 → 00 → 00 → 01 → 00 → 01 3
A sequence of channel labels marks a path in the network.
536
R. Moraveji, H. Sarbazi-Azad, and A. Tavakkol
500
500
400
Average Message Latency
A v e r a g e M e s s a g e L a te n c y
Among the six routing algorithms, only R1 can direct the message through the two existing paths. As a result, if the sequence of channel labels in a routing algorithm, like R1, results in more possible minimal paths, the average distance of the network will decrease. Moreover, the sequence of channel labels in a path is determined by graph labeling. Therefore, the performance of a label-based routing algorithm depends on the way of graph labeling.
Root 14 Root 8 Root 0
300
Root 0
400
Root 5 Root 18 Root 26
300
200
200
100
100
0
0
0
0.001
0.002
0.003
0.004
0.005
0.006
0.007
0
0.0005
0.001
0.0015
Traffic Generaton Rate
500
Root 18 Root 48 Root 0
300
500
400
Root 26 Root 59 Root 0
300
200
200
100
100
0
0 0
0.0002
0.0004
0.0006
0.0008
0.001
0.0012
0.0014
0
0.0016
0.0002
0.0004
0.0006
0.0008
0.001
0.0012
0.0014
0.0016
0.0018
Traffic Generaton Rate
Traffic Generaton Rate
(c) R1 on G3
(d) R6 on G4 500
500
400
A v e ra g e M e s s a g e L a te n c y
A v e r a g e M e s s a g e L a te n c y
0.0025
(b) R3 on G2 A v e r a g e M e s s a g e L a te n c y
A v e r a g e M e s s a g e L a te n c y
(a) R6 on G1
400
0.002
Traffic Generaton Rate
Root 34 Root 47 Root 0
300
400 Root 0 Root 32 300
200
200
100
100
Root 78
0
0 0
0.0005
0.001
0.0015
0.002
0.0025
0.003
0
0.0002
0.0004
0.0006
Traffic Generaton Rate
(e) R3 on G5
0.0008
0.001
Traffic Generaton Rate
(f) R4 on G6
Fig. 4. The average message latency of label-based routing algorithms on G1 – G6 with different spanning tree roots and message length of 64 flits
The Effect of Network Topology and Channel Labels
537
Labeling of a network is determined by the spanning tree and the spanning tree is formed based on a root node. Thus, in an irregular topology, the number of different ways that a graph can be labeled is equal to the number of different spanning trees can be formed on the graph. Thus the performance of a label-based routing algorithm depends on the channel labels and all the topological parameters that have direct or indirect effect on the channel labels. Figure 4 shows the effect of spanning tree root on the average message latency in G1-G6 using R1, R3, R4, and R6. It can be observed that changing the root of the spanning tree, and in turn channel labels, initially causes a substantial difference in the network latencies and the generation rates for which saturation occurs.
4 Conclusion First, in addition to cover three previously reported routing algorithms for irregular networks; we proposed three novel deadlock-free routing algorithms in family of routing algorithms called label-based routing algorithms. Second, the work has confronted the task of evaluating the performance of the mentioned family in irregular networks under realistic conditions. Third, by analyzing the experimental results, we revealed that the network topology, channel labels, and other topological parameters related to channel labels have great influence on the performance of label-based routing algorithms. Therefore, it is not possible to make a general sequence of the performance of the six aforementioned label-based routing algorithms. Regarding to previous work which exhibits the reaction of routing algorithms in regular networks in case of analytical models [8], further research in this line may consider developing such models for irregular networks. Moreover, investigating a general routing methodology for irregular networks and proposing some heuristics to compute the best spanning tree on the given topology can be considered for future work.
References 1. Moraveji, R., Sarbazi-azad, H.: A General Methodology of Routing in Irregular Networks. Technical Report, IPM School of Computer Science, Tehran, Iran (2007) 2. Schroeder, M.D., et al.: Autonet: a High-speed, Self configuring Local Area Network Using Point-to-point Links. J. Selected Areas in Communication 9, 1318–1335 (1991) 3. Koibuchi, M., Funahashi, A., Jouraku, A., Amano, H.: L-Turn Routing: An Adaptive Routing in Irregular Networks. In: International Parallel Processing Conference, pp. 383– 392 (2001) 4. Sancho, J.C., Robles, A., Duato, J.: An Effective Methodology to Improve the Performance of the Up*/Down* Routing Algorithm. IEEE Transaction on Parallel and Distributed Systems 15, 740–745 (2004) 5. Lysne, O., Skeie, T., Reinemo, S., Theiss, I.: Layered Routing in Irregular Networks. IEEE Transaction on Parallel and Distributed Systems 17, 51–65 (2006)
538
R. Moraveji, H. Sarbazi-Azad, and A. Tavakkol
6. Puente, J.A., Gregorio, F., Vallejo, R., Beivide.: High-performance Adaptive Routing for Networks with Arbitrary Topology. J. System Architecture 52, 345–358 (2006) 7. Duato, S.J., Yalamanchili, L.N.: Interconnection Networks: An Engineering Approach. IEEE Computer Society Press, Los Alamitos (2003) 8. Moraveji, R., Sarbazi-Azad, H., Nayebi, A., Navi, K.: Performance Modeling of Wormhole Hypermeshes under Hot-spot Traffic. In: Diekert, V., Volkov, M.V., Voronkov, A. (eds.) CSR 2007. LNCS, vol. 4649, pp. 290–302. Springer, Heidelberg (2007)
On the Probability of Facing Fault Patterns: A Performance and Comparison Measure of Network Fault-Tolerance Farshad Safaei1, Ahmad Khonsari3, 2, and Reza Moraveji1,2 1
Dept. of ECE, Shahid Beheshti Univ., Tehran, Iran 2 IPM School of Computer Science, Tehran, Iran 3 Dept. of ECE, Univ. of Tehran, Tehran, Iran [email protected], {ak,moraveji_r}@ipm.ir
Abstract. An important issue in the design and deployment of interconnection networks is the issue of network fault-tolerance for various types of failures. In designing parallel processing using torus as the underlying interconnection topology as well as in designing real applications on such processors, the estimates of the network reliability and fault-tolerance are important in choosing the routing algorithms and predicting their performance in the presence of faulty nodes. Under node-failure model, the faulty nodes may coalesce into fault patterns, which classified into two major categories, i.e., convex (|-shaped, -shaped) and concave (L-shaped, T-shaped, +-shaped, H-shaped, U-shaped) regions. In this correspondence, we propose the first solution for computing the probability of message facing the fault patterns in tori both for convex and concave regions that is verified using simulation experiments. Our approach works for any number of faults as long as the network remains connected. We use these models to measure the network faulttolerance that can be achieved by adaptive routings, and to assess the impact of various fault patterns on the performance of such networks.
1 Introduction Communication in faulty networks is a classical field in network theory. In practice, one cannot expect nodes or communication links to work without complications. Software or hardware faults may cause nodes or links to go down. To be able to adapt with faults without serious degradation of the service, networks and routing protocols have to be set up so that they are fault-tolerant. Several recent studies address faulttolerance in a diverse range of systems and applications [1-12]. Almost all of the performance evaluation studies for functionality of these systems, however, have made use solely of simulation experiments. The limitations of simulation-based studies are that they are highly time-consuming and expensive. Effective analytical models are necessary for predicting the behavior of large networks to help weigh the cost-performance trade-offs of various adaptive routing algorithms. In this correspondence, we focus specifically on the impact of the fault patterns that permits analytical model to predict the probability of facing fault patterns experienced by a M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 539–548, 2008. © Springer-Verlag Berlin Heidelberg 2008
540
F. Safaei, A. Khonsari, and R. Moraveji
message when an adaptive routing scheme is used. To the best of our knowledge, no study has been so far reported in the literature for calculating the probability of message facing the fault patterns to examine the relative performance merits of adaptive fault-tolerant routing algorithms. In this paper, we investigate the characteristics of fault patterns which are suitable for modeling faults in interconnections networks, particularly in torus topology. Our approach employs the theoretical results of algebra, and combinatorics to calculate the probability of occurrences of facing the fault patterns in a 2-D torus. Deriving expressions for characterizing fault patterns play a critical role in studying the performance of faulty networks by means of mathematical analysis. The rest of the correspondence is organized as follows. In Section 2, we describe the basic properties of the torus topology as well as the fault-tolerance in networks. In Section 3, we derive an analytical model for calculating the probability of massage facing the fault patterns. In Section 4, the analytical results and comparison with simulation experiments are presented. Finally, Section 5 draws conclusions.
2 Terminologies This section starts with a discussion of the torus structure and then provides a short summary of fault-tolerance in interconnection networks. Some of these definitions are reiterated from previous works [6, 7, 10-12], for the sake of completeness. 2.1 The Torus Topology The torus has been popular interconnection network topology in contemporary systems [6] due to their desirable properties, such as ease of implementation and ability to exploit communication locality to reduce message latency [12]. In addition, torus is a regular (i.e., all nodes have the same degree) and edge-symmetric network, which improves load balancing across the channels [13]. Definition 1 [13]: An R × C 2-D torus network, denoted by TR×C . Each node
(x 1, y1 ) is connected to its four neighbors (x 1 ± 1mod R, y1 ) and (x 1, y1 ± 1modC ) . Therefore, the total number of channels in torus TR×C is E = 2 × R ×C . 2.2 Network Fault-Tolerance The growth of parallel applications on multiprocessors system-on-chip (Mp-SoCs), multicomputers, cluster computers, and peer-to-peer systems motivates interest in parallel computer networks. The construction of such networks connecting a large population of processing units and components (such as routers, channels and connectors) poses several challenges. First, selecting the right routing algorithms should reflect the full potential of the underlying network topology. Second, connectivity among active nodes of the interconnection network should be maintained, even in the presence of high failure rates or when a large portion of nodes is not active. To seek solutions to these issues, adaptive fault-tolerant routing algorithms have been frequently suggested as a means of providing continuous operations in the presence of one or more
On the Probability of Facing Fault Patterns
541
failures by allowing the graceful system degradation. In designing a fault-tolerant routing algorithm, a suitable fault model is one of the most important issues [7-10]. The fault model can reflect fault situations in a real system. Rectangular fault modes (also known as block faults) are the most common approach to model faulty nodes and to facilitate routing in 2-D tori [9]. However, rectangular fault regions sacrifice many nonfaulty nodes and hence its resources are wasted. In order to reduce non-faulty nodes in rectangular fault regions, many studies have addressed the concept of fault patterns with different shapes [7-10, 12] may form convex or concave regions. A convex region is defined as a region ϕ in which a line segment connecting any two points in ϕ lies entirely within ϕ . If we change the "line segment" in the standard convex region defini tion to "horizontal or vertical line segment", the resulted region is called rectilinear convex segments [7, 10]. Any region that is not convex is a concave region. Examples of convex regions are |-shape, -shape and concave regions are L-shape, U-shape, Tshape, H-shape, +-shape. The detailed mathematical expressions for characterization of the most common concave and convex fault regions in torus and mesh networks have been reported in [14]. For a comprehensive survey of the important issues of the faulttolerant systems and networks, the reader is referred to the articles in [12-15].
3 Mathematical Analysis This section starts with the description of the assumptions used in construction of the analytical models. The derivation and implementation procedure of the mathematical models are then presented. After that, the proposed models are validated through simulation experiments. 3.1 Assumptions The analytical models are based on common assumptions that have been widely accepted in the literature [6, 7, 9, 11-15]. i. Messages are uniformly directed to other network nodes. ii. Messages are routed adaptively through the network. Further, a message is assumed to always follow one of the available shortest paths in the absence of faults. iii. The probabilities of node failure in the network are equiprobable and independent of each other. Moreover, fault patterns are static [6, 7, 9-15] and do not disconnect the network. iv. Nodes are more complex than links and thus have higher failure rates [7, 12, 14]. So, we assume only node failures. 3.2 Calculating the Probability of Message Facing Faulty Patterns Consider an R × C torus network in which there are some faulty nodes have formed one of the fault patterns so that faulty nodes do not disconnect the network. We call such network a connected R × C torus with the X − shape fault pattern.
542
F. Safaei, A. Khonsari, and R. Moraveji
In this section, our goal is to calculate the probability of a message facing the existing fault pattern in the connected R × C torus network in the presence of the X − shape fault pattern. Remark: A path facing the fault-pattern means that there exists one or more points from the set of points reside on the given path. In the torus network, the position of the fault patterns does not play an important role. Since, by changing the coordinate we can transfer these patterns to any other location in the network without changing the location of the nodes respect to each other. Therefore, we can obtain the exact shape of that by knowing the type of the fault pattern and some characteristics. We denote the set of these characteristics in the X − shape fault pattern by S X . In the rectangular fault pattern, the determining characteristics of such regions are its height and width that are indicated by l and h , respectively. Thus, S : l , h . For instance, in the |-shape fault pattern, the determining characteristics of the shape of this line segment are its vertical height or its horizontal width.
S| : 1, h
Vertical line segment
S| : l,1
Horizontal line segment
Fig. 1 depicts some of the commonest fault patterns together with the precise determining characteristics of them. In the fault patterns with different horizontal and vertical determining characteristics, it is possible to alter the horizontal case to vertical case by changing the role of R and C in the torus network. Therefore, for all the proposed fault patterns, the determining characteristics in the vertical case are adequate. In case that the set of fault points are not in any of the mentioned above fault patterns, we should know the coordinate of the new fault pattern points or define new characteristics according to its shape. Here, we investigate those fault patterns about which we know the determining characteristics of their exact shape in addition to having information of their general shape. The set of X − shape fault pattern points with S X characteristics is demonstrated by F (X ; S X ) and the probability of a path confronting it is illustrated by P (X ; S X ) . In order to calculate the parameter P (X ; S X ) , we should enumerate the number of all existing paths facing the X − shape fault pattern and divide them by the number of all existing paths in the connected R × C torus network. This probability is expressed formally as
Phit
The number of minimal paths crossing the fault region The number of all minimal paths existing in the network
(1)
The following theorem provides the total number of paths with minimal length in the network. Theorem 1: In a connected R × C torus network with the X − shape fault region, the number of all existing paths between any pair of non-faulty nodes is denoted by
On the Probability of Facing Fault Patterns
543
Fig. 1. Examples of fault patterns in a 2-D torus network
∑
LT (a,b )
(2)
a ,b ∈v (TR×C )\F ( X ;S X )
where F ( X ; S X ) is the set of X − shape faulty points with determining characteristics S X . Proof: Consider a connected R × C torus with the X − shape fault pattern. Consider two non-faulty points a and b in the mentioned above network. The number of paths crossing from a to b is given by
LT (a, b)
544
F. Safaei, A. Khonsari, and R. Moraveji
Thus, the number of all existing paths in the above network can be calculated as the aggregate of the total number of paths crossing between any two of non-faulty points in the network
∑
LT (a,b )
a ,b ∈v (TR×C )\F ( X ;S X )
■
which completes the proof.
Example 1: Consider an 8 × 7 torus network in which the embedded T-shaped fault region has three determining characteristics l = 7 , h = 5 , and h1 = 3 (see Fig. 2). We wish to route messages from point a to point b . In this network, there is a minimal path from a to b as follows
a = ( 6, 3 ) → ( 6,1 ) → ( 6, 7 ) → ( 5, 7 ) → ( 4, 7 ) = b Therefore, the set of first components of the existing nodes along this path is { 5, 6, 4 } and also the set of second components of the existing nodes in this path is { 3, 2,1, 7 } . So, we get
M (a, b) = { 6, 5, 4 } × { 3, 2,1, 7 } = {( 6, 3 ), ( 6, 2 ), ( 6,1 ), ( 6, 7 ), ( 5, 3 ), ( 5, 2 ) , ( 5,1 ), ( 5, 7 ), ( 4, 3 ), ( 4, 2 ), ( 4,1 ), ( 4, 7 )}
Fig. 2. (a) A torus network with two arbitrary points a and b in the presence of T-shaped fault pattern; (b) Demonstration of M (a, b) mesh subnetwork.
Before we proceed to calculate Equation (1), we pause to give a few definitions; then we present and prove a theorem. Definition 2: Let a and b be two non-faulty points of v(TR×C ) . For any two arbitrary points C i = ( xCi , yC i ) and C j = ( xC j , yC j ) of R ( M (a, b) ) , the number of possible paths from C j to C i is indicated by LM a ,b (C j ,C i ) so that the direction of each path in dimension X (Y ) is collinear with the direction of a path from a to b in dimension ( X )Y . Let us attempt to define LM a ,b (C j ,C i ) by ⎛ Θax ,b (C j ,C i ) + Θay ,b (C j ,C i ) ⎞⎟ ⎜⎜ ⎟⎟ ⎜⎜ ⎟ Θax ,b (C j ,C i ) ⎝ ⎠⎟
(3)
On the Probability of Facing Fault Patterns
545
in which Θax ,b (C j ,C i ) is a function indicating the number of orientations along a path from C j to C i in dimension X which are collinear with the orientations along a path from a to b and it’s criterion is expressed as
(0 ≥ x (0 ≥ x
⎢R ⎥ − xa ≥ − ⎢ ⎥ or xb − x a > ⎣2⎦ ⎢R ⎥ ⎥ or xCi − xC j C i − xC j ≥ − ⎢ ⎣2⎦
⎧⎪ ⎪⎪ Δx (C j ,C i ) ⎪⎪ ⎪⎪ ⎪⎪ ⎪⎪ ⎪⎪ ⎪⎪ ⎪⎪ a ,b Θx (C j ,C i ) = ⎪⎨ Δx (C j ,C i ) ⎪⎪ ⎪⎪ ⎪⎪ ⎪⎪ ⎪⎪ ⎪⎪ ⎪⎪ ⎪⎪ − Δx (C j ,C i ) ⎪⎪⎩
(0 ≤ x (0 ≤ x
b
)
⎢R⎥ ⎢ ⎥ and ⎣2⎦ ⎢R⎥ >⎢ ⎥ ⎣2⎦
)
)
⎢R⎥ ⎢R ⎥ − xa ≤ ⎢ ⎥ or xb − xa < − ⎢ ⎥ and ⎣2⎦ ⎣2⎦ ⎢R⎥ ⎢R ⎥ ⎥ or xC i − xC j < − ⎢ ⎥ C i − xC j ≤ ⎢ ⎣2⎦ ⎣2⎦ b
(4)
)
otherwise
Similarly, we can obtain the criterion of function Θay ,b (C j ,C i ) by interchanging the roles of X and Y as ⎧ ⎪ Δy (C j ,C i ) ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ a ,b Θy (C j ,C i ) = ⎪ ⎨ Δy (C j ,C i ) ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ ⎪ − Δy (C j ,C i ) ⎪ ⎪ ⎪ ⎪ ⎩
(0 ≥ y (0 ≥ y
)
(0 ≤ y (0 ≤ y
)
⎢R ⎥ ⎢R⎥ − ya ≥ − ⎢ ⎥ or yb − ya > ⎢ ⎥ and ⎣2⎦ ⎣2⎦ ⎢R ⎥ ⎢R ⎥ ⎥ or yC i − yC j > ⎢ ⎥ C i − yC j ≥ − ⎢ ⎣2⎦ ⎣2⎦ b
)
⎢R ⎥ ⎢R⎥ − ya ≤ ⎢ ⎥ or yb − ya < − ⎢ ⎥ and ⎣2⎦ ⎣2⎦ ⎢R ⎥ ⎢R ⎥ ⎥ or yC i − yC j < − ⎢ ⎥ C i − yC j ≤ ⎢ ⎣2⎦ ⎣2⎦ b
(5)
)
otherwise
Theorem 2: Given that a and b are two non-faulty points of a torus network and R ( M ( a, b ) ) = {C 1,C 2 , …,C k } , the number of paths from a to b that do not traverse the points C 1,C 2 , …,C k can be calculated as
det dij (a, b)
(6)
0 ≤i, j ≤k
where
d0 j (a, b) = LMa,b (C j ,C k +1 )
j = 0,1, …, k
dij (a, b) = LMa,b (C j ,C i )
i = 1,2, …, k
j = 0,1, …, k
(7)
Proof: The proof is quite involved and we omit it due to lack of space. The interested reader is referred to [16] for the proof. Theorem 3: Let TR×C be a connected R × C torus network with the X − shape fault region having S X characteristics. The number of paths in TR×C not facing the fault pattern is expressed as
546
F. Safaei, A. Khonsari, and R. Moraveji
∑
det
a ,b ∈v (TR×C )\ F (X ;S X )
0 ≤i, j ≤Ca ,b
dij (a, b)
(8)
in which C a,b is the number of elements of R ( M (a,b ) ) ; that is
R ( M (a, b) ) = C a,b Proof: Consider two arbitrary points a and b from the set v(TR×C ) \ F (X ; S X ) . According to Theorem 2, the number of minimal paths (from a to b ) not traversing the points R ( M (a, b) ) , equals det dij (a, b) . Therefore, the number of minimum 0 ≤i, j ≤C a ,b
paths from from a to b not crossing the F (X ; S X ) points, will be equal to
det
0 ≤i, j ≤C a ,b
dij (a, b) . So, the number of all existing paths in TR×C not traversing the
F (X ; S X ) points is equal to aggregate of the total number of paths between any two non-faulty points in TR×C not traversing the points of F (X ; S X ) . That is,
∑
det
a ,b ∈v (TR×C )\ F (X ;S X )
0 ≤i, j ≤Ca ,b
■.
dij (a, b)
It follows from the preceding theorem, the probability that a path in TR×C not facing the fault pattern, Pmiss , is given by
∑
det
a ,b ∈v (TR×C )\ F (X ;S X )
∑
0 ≤i, j ≤Ca ,b
dij (a, b)
LT (a, b)
a ,b ∈v (TR×C )\ F (X ;S X )
=
∑
det
0 ≤i, j ≤Ca ,b
a ,b ∈v (TR×C )\ F (X ;SX )
dij (a, b)
LT (a, b)
(9)
Therefore, it is trivial that
Phit = 1 − Pmiss = 1 − P (X ; S X ) = 1 −
∑
a ,b ∈v (TR×C )\ F (X ;S X )
det
0 ≤i, j ≤Ca ,b
dij (a, b)
LT (a, b)
(10)
4 Experimental Results In the previous sections, we have derived mathematical expressions to calculate the probability of facing the fault patterns. These analytical expressions form the core of other fault patterns computation in other topologies and can be extensively generalized. An experimental approach is necessary to verify the analytical evaluation to which mathematical analysis led. A program has been developed which simulates the failure of nodes and the subsequent constructing of the corresponding fault patterns. The simulator generates faults in the network so that the resulting fault regions are convex or concave. It also checks that all nodes in the network are still connected using adaptive routing. The objective of the simulation is to measure the values of the probability of facing the fault patterns for different values of faulty nodes in the torus topology. For every run, the simulator creates the fault pattern and keeps statistics of the following data:
On the Probability of Facing Fault Patterns
547
• The number of minimal paths crossing the network. • The number of minimal paths confronting the fault pattern. • For each source - destination pair, the probability of facing the fault pattern is computed. Table 1 reveals the results obtained from simulation experiments and the mathematical models in torus for different sizes of the network and various shapes of fault patterns. Table 1. Experimental results of the probability of facing fault patterns in the torus with different fault patterns, and various sizes of the network which agrees with the analytical expressions Torus Network (TR×C ) 9 ×13
10× 10
11× 9
6 ×7
6×5
|-shape, h=4
0.173
0.165
0.147
0.214
0.194
|-shape, l=3
0.109
0.131
0.134
0.156
0.191
||-shape, h=h1=3, l=3, h' =2 (case 1)
0.208
0.219
0.201
0.310
0.377
||-shape, h=4, h1=3, l=2, h' =2 (case 2)
0.212
0.201
0.183
0.257
0.244
L-shape, h=3, l=3
0.165
0.182
0.170
0.229
0.263
Fault patterns characteristics
L-shape, h=4, l=5
0.247
0.286
0.273
0.395
0.429
T-shape, l=4, h=3, l1=2
0.178
0.201
0.194
0.261
0.305
T-shape, l=5, h=4, l1=4
0.237
0.261
0.257
0.375
0.397
U-shape, l=3, h=4, h1=2
0.207
0.219
0.202
0.296
0.353
-shape, l=3, h=2
0.133
0.150
0.147
0.170
0.195
H-shape, l=3, h=4, h1=3, h'=3, h'1=2
0.213
0.226
0.209
0.317
0.392
H-shape, l=5, h=4, l1=4, h1=2
0.216
0.243
0.239
0.339
0.346
5 Conclusions In recent years, efforts have been made to integrate performance and reliability of adaptive routing algorithms to overcome the drawback in the traditional evaluation methods for interconnection networks. For this purpose, a new performance metric of network reliability, probability of facing fault patterns, has been introduced. It is used to assess the performance-related reliability of such routing schemes in the presence of faulty patterns, which can be categorized to two major classes of convex (|-shape, -shape) and concave (L-shape, U-shape, +-shape, T-shape, and H-shape) regions. In this paper, we have attempted to derive mathematical expressions for calculating the probability of message facing fault patterns occur in adaptively-routed torus networks. Predicting the network measures, such as message latency and channel waiting times throughout a faulty network are applications of the results derived in this paper. Since,
548
F. Safaei, A. Khonsari, and R. Moraveji
the mesh topology has become a popular interconnection architecture for constructing massively parallel computers; a more challenging extension of our work would be to propose mathematical expressions for fault patterns in the well-known mesh topologies.
References 1. Chakravorty, S., Kalé, L.V.: A Fault Tolerant Protocol for Massively Parallel Systems. In: Proceedings of the 16th International Symposium on Parallel and Distributed Processing (2004) 2. Al-Karaki, J.N.: Performance Analysis of Repairable Cluster of Workstations. In: Proceedings of the 16th International Symposium on Parallel and Distributed Processing (2004) 3. Karimou, D., Myoupo, J.F.: A Fault-Tolerant Permutation Routing Algorithm in Mobile Ad-Hoc Networks. In: Lorenz, P., Dini, P. (eds.) ICN 2005. LNCS, vol. 3421, pp. 107– 115. Springer, Heidelberg (2005) 4. Gupta, G., Younis, M.: Fault-tolerant clustering of wireless sensor networks. In: IEEE Conf. on Wireless Communications and Networking, pp. 1579–1584 (2003) 5. Pande, P.P., et al.: Performance Evaluation and Design Trade-Offs for Network-on-Chip Interconnect Architectures. IEEE Trans. Computers 54(8), 1025–1040 (2005) 6. Dao, B.V., Duato, J., Yalamanchili, S.: Dynamically configurable message flow control for fault-tolerant routing. IEEE Transactions on Parallel and Distributed Systems 10(1), 7–22 (1999) 7. Suh, Y.J., et al.: Software-based rerouting for fault-tolerant pipelined communication. IEEE Trans. on Parallel and Distributed Systems 11(3), 193–211 (2000) 8. Chen, C.L., Chiu, G.M.: A Fault-tolerant routing scheme for meshes with nonconvex faults. IEEE Trans. on Parallel and Distributed Systems 12(5), 467–475 (2001) 9. Shih, J.-D.: Fault-tolerant wormhole routing in torus networks with overlapped block faults. IEE Proc.-Comput. Digit. Tech. 150(1), 29–37 (2003) 10. Wu, J., Jiang, Z.: On Constructing the Minimum Orthogonal Convex Polygon in 2-D Faulty Meshes, IPDPS (2004) 11. Theiss, I.: Modularity, Routing and Fault Tolerance in Interconnection Networks, PhD thesis, Faculty of Mathematics and Natural Sciences, University of Oslo (2004) 12. Gómez, M.E., et al.: A Routing Methodology for Achieving Fault Tolerance in Direct Networks. IEEE Transactions on Computers 55(4), 400–415 (2006) 13. Duato, J., Yalamanchili, S., Ni, L.M.: Interconnection networks: An engineering approach. Morgan Kaufmann Publishers, San Francisco (2003) 14. Hoseiny Farahabady, M., Safaei, F., Khonsari, A., Fathy, M.: Characterization of Spatial Fault Patterns in Interconnection Networks. Journal of Parallel Computing 32(11-12), 886– 901 (2006) 15. Xu, J.: Topological structure and analysis of interconnection networks. Kluwer Academic Publishers, Dordrecht (2001) 16. Safaei, F., Fathy, M., Khonsari, A., Gilak, M., Ould-Khaoua, M.: A New Performance Measure for Characterizing Fault-Rings in Interconnection Networks. Journal of Information Sciences (submitted, 2007)
Cost-Minimizing Algorithm for Replica Allocation and Topology Assignment Problem in WAN Marcin Markowski and Andrzej Kasprzak Wroclaw University of Technology, Chair of Systems and Computer Networks Wybrzeze Wyspianskiego 27, 50-370 Wroclaw, Poland {marcin.markowski,andrzej.kasprzak}@pwr.wroc.pl
Abstract. The paper deals with the problem of simultaneously assignment of network topology and server’s replica placement in the wide area network in order to minimize the criterion composed of the leasing capacity cost and the building cost of the network. An exact algorithm, based on the branch and bound method, is proposed to solve the problem. Algorithm takes into account the problem of ensuring the reliability of the network. Some interesting features, observed during the computational experiments are reported. Keywords: resource replication, topology assignment, WAN.
1 Introduction Designing or modernizing of the Wide Area Network (WAN) consists in the assignment of resource allocation (i.e. servers, replicas of servers), topology and flow routes. The optimal arrangement of resources and optimal allocation of network channels let us obtain the most efficient and economical solution. Designing of the wide area computer networks is always a compromise between the quality of service in the network and reliability of the network from the one hand and the costs needed to build and to support the network from the other. Quality of services and the network costs are criteria often used during the designing process. In [1] we have considered problem based on designing the WAN assuming that the maximal support cost of the network is bounded. In [2] we have proposed algorithm for CFA problem with cost criterion. In those papers the reliability requirements were not considered. In many cases it is useful to formulate the optimizing problem in such a way: to minimize the cost of the network, when the acceptable quality level is known and the reliability level is given. Then in the paper the problem of server’s replication and topology assignment with the cost criterion delay constraint and reliability constraint is considered. In our opinion it is well-founded to consider two kind of cost: the building cost of the network, borne once and the supporting cost (i.e. connected with the capacity leasing), borne regularly. Then the criterion function is composed of two ingredients: the regular channel capacity leasing cost and the disposable server cost. We assume that the maximal acceptable total average delay in the network is given as the constraint. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 549–558, 2008. © Springer-Verlag Berlin Heidelberg 2008
550
M. Markowski and A. Kasprzak
Designing of the wide area network topology consists in assignment of channels location and choosing of channels' capacities. Properly designed network topology should ensure communication between all pairs of nodes in WAN. There must be at least one path between each pair of nodes in the network. In case of the channel failure or node failure, all paths leading through this channel or node become unserviceable. In case when only one path exists between pair of nodes, failure of any of the path elements makes communication between nodes impossible. To ensure the reliability and survivability of the network, there should exist few different paths between each pair of nodes. Paths are different when they do not have any common channel or node, except the source node and destination node. Usually minimal condition imposing to the network is to ensure two paths between each node pair. Some problems and solutions related to the network reliability are presented in [3, 4]. We assumed that the minimal number of paths between each pair of nodes is given as the constraint. It allows to design network topology with denoted reliability level. The problem considered here may be formulated as follows: given:
user allocation at nodes, the set of the nodes to which replicas may be connected for each server, maximal value of the total average delay in the network, traffic requirements user-user and user-server, the set of potential channels, capacities and their costs (i.e. cost-capacity function) for each potential channel, minimize: linear combination of the capacity leasing cost and server cost, over: servers allocation, channel’s capacity and multicommodity flow, subject to: multicommodity flow constraints, channel capacities constraints, server allocation constraints, total average delay constraint, network reliability constraint (minimal number of different paths between nodes). We consider the discrete cost-capacity function because it is the most important from practical point of view for the reason that the channels capacities can be chosen from the sequence defined by international ITU-T recommendations. Such formulated problem is NP-complete as more general than the capacity and flow assignment problem (CFA) with discrete cost-capacity function which is NP-complete [5]. The literature focusing on the simultaneous server’s replication and topology assignment problem is very limited. Some algorithms for this problem with different delay criterion may be found in [1, 6]. The formulated here problem is more general. It uses cost criterion and take into account the maximal acceptable average delay in the WAN as the constraint. Moreover it takes into account some aspects of reliability of the designed network. In the literature such formulated problem has not been considered yet.
2 Problem Formulation Let n be the number of nodes of the wide area network and b be the number of potential channels which may be used to build the network. For each potential channel i there is the set C i = {c1i ,..., c si (i )−1} of alternative values of capacities from which
exactly one must be chosen if the i -th channel was chosen to build the WAN. Let d ij
Cost-Minimizing Algorithm
551
be the cost of leasing capacity c ij [€€ /month]. Let c si (i ) = 0 for i = 1,..., b. Then C i = C i ∪ {csi (i ) } be the set of alternative capacities from among which exactly one
must be used to channel i. If the capacity c si (i ) is chosen then the i -th channel is not used to build the wide area network. Let x ij be the decision variable, which is equal to one if the capacity c ij is assigned to channel i and x ij is equal to zero otherwise. Since exactly one capacity from the set C i must be chosen for channel i, then the following condition must be satisfied: s (i )
∑ x ij
= 1 for i = 1,..., b.
j =1
(1)
Let W i = {x1i ,..., x si (i ) } be the set of variables x ij , which correspond to the i-th channel. Let X r' be the permutation of values of all variables x ij for which the condition (1) is satisfied, and let X r be the set of variables, which are equal to one in X r' . Designing the wide area network topology (channels allocation) the reliability of the network should be ensured. Particularly, in case of a failure of a link (channel) or a node, some routes must be redirected to another path. Let PN be the least number of different paths between each pair of nodes in the network. Paths between two nodes are different only when they do not have any common nodes or channels. Let MPN be the minimal number of paths, which have to exist between all pairs of nodes of the WAN. Ensuring MPN number of paths is very important for reliability of the network. Let K denotes the total number of servers, which must be allocated in WAN and let LK k denotes the number of replicas of k -th server. Let M k be the set of nodes to which k -th server (or replica of k -th server) may be connected, and let e(k ) be the number of all possible allocation for k -th server. Since only one replica of server may be allocated in one node then the following condition must be satisfied
LK k ≤ e(k ) for k = 1,..., K .
(2)
Let y kh be the decision binary variable for k -th server allocation; y kh is equal to one if the replica of k -th server is connected to node h , and equal to zero otherwise. Since LK k replicas of k -th server must be allocated in the network then the following condition must be satisfied
∑ y kh = LK k
h∈M k
for k = 1,..., K .
(3)
Let Yr be the set of all variables y kh , which are equal to one. The pair of sets ( X r , Yr ) is called a selection. Let ℜ be the family of all selections. X r determines
552
M. Markowski and A. Kasprzak
the network topology and capacities of channels and Yr determines the replicas allocation at nodes of WAN. Let T ( X r , Yr ) be the minimal average delay per packet in WAN in which values of channel capacities are given by X r and traffic requirements are given by Yr (depending on server replica allocation). T ( X r , Yr ) can be obtained by solving a multicommodity flow problem in the network [7]. Let U (Yr ) be the server cost and let
d ( X r ) be the capacity cost. Let Q( X r , Yr ) be linear combination of the capacity cost and the server cost: Q( X r , Yr ) = αD( X r ) + βU (Yr )
(4)
where α and β are the positive coefficients; α , β ∈ [0,1], α + β = 1. Let T max be the maximal acceptable average delay per packet in WAN. Then, the considered servers allocation, capacity and flow assignment problem in WAN with total average delay constraint is formulated as follows: min Q( X r , Yr )
(5)
PN X r ≥ MPN
(6)
( X r , Yr ) ∈ ℜ
(7)
T ( X r , Yr ) ≤ T max
(8)
( X r ,Yr )
subject to
Where PN X r is the least number of different paths between each pair of nodes in the network in which values of channel capacities are given by X r .
3 Calculation Scheme of the Branch and Bound Algorithm Assuming that LK k = 1 for k=1,...,K and C i = C i for i=1,...,b and omitting the constraint (6), the problem (5-7) is resolved itself into the “host allocation, capacity and flow assignment problem”. Since the host allocation, capacity and flow assignment problem is NP-complete [6, 7] then the problem (5-8) is also NP-complete as more general. Then, the branch and bound method can be used to construct the exact algorithm. Starting with the selection ( X 1 , Y1 ) ∈ ℜ we generate a sequence of selections ( X s , Ys ) . Each selection ( X s , Ys ) is obtained from a certain selections ( X r , Yr ) of the sequence by complementing one variable x ij (or y kh ) by another variable from W i (or {ykm : m ∈ M k and m ≠ h} ). For each selection ( X r , Yr ) we constantly fix a subset Fr ∈ ( X r , Yr ) and momentarily fix a set Frt . The variables in Fr are constantly fixed and represent the path
Cost-Minimizing Algorithm
553
from the initial selection ( X 1 , Y1 ) to the selection ( X r , Yr ) . Each momentarily fixed variable in Frt is the variable abandoned during the backtracking process. Variables, which do not belong to Fr or Frt are called free in ( X r , Yr ) . There are two important elements in branch and bound method: testing operation (lower bound of the criterion function) and branching rules. Then, in the next section of the paper, the testing operation and choice operation are proposed. The lower bound LBr and branching rules are calculated for each selection ( X r , Yr ) . The lower bound is calculated to check if the “better” solution (selection ( X s , Ys ) ) may be found. If the testing is negative, we abandon the considered selection ( X r , Yr ) and backtrack to the selection ( X p , Y p ) from which selection ( X r , Yr )
was generated. The basic task of the branching rules is to find the variables for complementing to generate a new selection with the least possible value of the criterion function. The detailed description of the calculation scheme of branch and bound method may be found in the paper [8].
4 The Lower Bound Since the traffic requirements in the network depend on the server allocation, then obtaining the lower bound for the problem (5-8) is difficult. We propose, the lower bound may be obtained by relaxing some constraints and by approximating the discrete cost-capacity curves with the lower linear envelope [5, 7]. To find the lower bound LBr of the criterion function (4) we reformulate the problem (5-8) in the following way: - we assume that the variables x ij ∈ X r − Fr , such that X i ∩ Fr = ∅ are continuous variables. Then we can approximate the discrete cost-capacity curves (given by the set C i ) with the lower linear envelope. Let Z ' be the set of such channels i, for which variables x ij are continuos. The criterion function (4) turns itself into:
⎛ ⎞ ⎜ ⎟ Q( X r , Yr ) = α ⎜ ∑ c i d i + ∑ d i ⎟ + β ∑ ykhukh , i ⎜ i∈Z ' y kh ∈Yr i: x j ∈Fr ⎟⎠ ⎝ where d i = min (d ij / c ij ) and c i is capacity of channel i (continuous variable). xij ∈X i
- we assume that the variables
y kh ∈ Y k − Fr , for
k = 1,.., K ,
such that
Y ∩ Fr = ∅ are continuous variables. We create the model of the WAN in the following way. We add to the considered network 2 K new artificial nodes, numbered from n + 1 to n + 2 K . The artificial nodes n + k and n + K + k correspond to the k -th server. Moreover we add to the network directed artificial channels 〈n+k,m〉, 〈m,n+K+k〉, 〈n+K+k,n+k〉, for all m ∈ M k and k = 1,.., K , such that k
554
M. Markowski and A. Kasprzak
Y k ∩ Fr = ∅ . The capacities of the new artificial channels are following: c(n + k , m ) = ∞ , c(m, n + K + k ) = ∞ , c(n + K + k , n + k ) =
∑ u kh . The leasing
h=1..n
costs of all artificial channels are equal to zero. Then, the lower bound LBr of minimal value of the criterion function Q( X s , Ys )
for every possible successor ( X s , Ys ) generated from ( X r , Yr ) may be obtained by solving the following optimization problem: 2 ⎞ ⎛ ⎛ ⎞ ⎛ ⎟ ⎜ ⎜ ⎟ d i ⎞⎟ ⎜ ⎟ ⎜ ⎜ f ⎟ ∑ i ⎜⎜ ⎟ ⎜ ⎜ c i ⎟⎟ ⎟ i∈Z ' ⎝ ⎠ LBr = min ⎜ α ⎜ ∑ d i f i + + ∑ d i ⎟ + β ∑ ykhukh ⎟ f f ⎜ ⎜ i∈Z ' ⎟ y kh ∈Yr i: x ij ∈Fr ⎟ γT max − ∑ i i i ⎟ ⎜ ⎜ ⎟ i x c f − i x j ∈Fr j j ⎟⎟ ⎜⎜ ⎜ ⎟ ⎠ ⎠ ⎝ ⎝ subject to
f i ≤ c i for i ∈ Z' fi ≤ fi ≤
x ij c ij
ir c max
for
x ij
∈ Fr
for each i ∈ Z '
(9)
(10) (11) (12)
ir where c max is maximal capacity connected with variables x ij ∈ X i − Frt , and Frt is the subset of momentarily fixed variables. The solution of problem (9−12) gives the lower bound LBr. To solve the problem (9-12) we can use an efficient Flow Deviation method [5, 7].
5 Branching Rules The purposes of the branching rules is to find the normal variable from ( X r , Yr ) for complementing and generating a successor of the selection ( X r , Yr ) with the least possible value of criterion function (4). The choice criterions should be constructed in such a way that complementing reduces the value of (4) and the increase of the total average delay in the network is as minimal as possible. Complementing variables x ij causes the capacity change in channel i. Then, the values of average delay in the network and the capacity cost changes, and the value of server cost does not change. Moreover, complementing variable x ij , j < s (i ) by
xsi (i ) = 0 causes that network topology changes - constraint (6) can not be violated. We propose the following choice criterion for complementing variable x ij ∈ X r by the variable xli ∈ X s :
Cost-Minimizing Algorithm
⎧ ⎪ i ⎪ cl Δi jl = ⎨ ⎪ ⎪∞ ⎩
f − i i − fi c j − fi
555
fi
(
α d ij − d li
)
if f i < cli and PN X s ≥ MPN otherwise
where f i is the flow in the i -th channel obtained by solving the multicommodity flow problem for network topology and channel capacities given by the selection X r . The choice criterion for complementing variable ykh ∈ Yr by the variable ykm ∈ Ys may be formulated as follows:
k δ hm
⎧1 ⎪ ⎪γ ⎪ =⎨ ⎪ ⎪ ⎪⎩∞
∑
x ij ∈ X r
~ fi
~ − T ( X r , Yr ) x ij c ij − f i
β (u kh − ukm )
~ if f i < x ij c ij for x ij ∈ X r otherwise
~ ~ ~ where the flow f = [ f1 ,.., f b ] was constructed as follows: flow from all users to k -th server was moved from the routes leading from users to node h to the routes leading from ~ users to node p . The calculation scheme for obtaining flow f may be found in [1]. Let E r = ( X r ∪ Yr ) − Fr , and let Gr be the set of all reverse variables of normal variables, which belong to the set E r . We want to choose a normal variable the complementing of which generates a successor with the possible least value of criterion (4). We should choose such pairs {( x ij , xli ) : x ij ∈ Er , xli ∈ Gr } or {( y kh , y kp ) : k is minimal. ykh ∈ Er , ykm ∈ Gr } , for which the values of the criterion Δi jl or δ hm
6 Computational Results The presented algorithm was implemented in C++ code. Extensive numerical experiments have been performed with this algorithm for many different networks. The experiments were conducted with two main purpose in mind: first, to examine the impact of various problem parameters on the solution (i.e. on the value of the criterion Q) and second, to test the computational efficiency of the algorithm. The typical dependence of the optimal value of Q on maximal acceptable total average delay per packet T max for different values of parameters α and β is presented in the Fig. 1. It follows from computational experiments that the dependence of Q on T max is decreasing function. The following conclusions follows from the computer experiments (Fig.1).
556
M. Markowski and A. Kasprzak
Conclusion 1. There exist such acceptable total average delay per packet T*max , that the problem (5-8) has the same solution for each T max greater or equal to T*max .
The typical dependence of the optimal value of D on the value of parameter α ( β = 1 − α ) is presented in the Fig. 2. It follows from computational experiments and from Fig. 2 that the dependence of D on α is increasing function. We have examined the impact of reliability parameter MPN on the solution. Typical dependences of the optimal value of the criterion Q and total average delay in the network on the minimal number of different paths are presented in the Fig. 3 and Fig. 4. Experiments were conducted for MPN=1, 2 and 3. Obtaining of more than three different path between each pair of nodes is difficult or even impossible for small and medium wide area networks. To ensure MPN=4 there must exists at least four channels adjacent to each node of the network. It makes the network very expensive, because the leasing costs of channels increase fast and the cost of nodes (WAN switches) increases as well. As it follows from Fig. 3 the value of combined cost criterion Q is quite similar for MPN=1 and MPN=2 and rapidly increases for MPN > 2. Similar dependences were discovered for all examined networks. Typical dependence of the optimal value of the total average delay in the network, obtained by solving problem (5-8), on the minimal number of different paths is decreasing function (Fig. 4). In most cases dependency of T on MPN can be approximated with the linear function. Basing on the results, partially presented in the Fig. 3 and Fig. 4, we can formulate the following conclusion. Conclusion 2. For small and medium wide area networks the optimal value of minimal number of paths between each pair of nodes is equal to two. Computational properties of the presented algorithm were tested during experiments. Let NT = ((T max − Tmin ) /(T*max − Tmin )) ⋅ 100% be the normalized maximal acceptable total average delay per packet in the network - problem (5-8) has no
α = 0,6, β = 0,4
Q
α = 0,2, β = 0,8
250000 150000 50000 0,0010
0,0015
0,0020
0,0025
0,0030
T
max
Fig. 1. Typical dependence of the optimal value of criterion Q on maximal acceptable total average delay per packet
Cost-Minimizing Algorithm
557
T. max = 0,002
D
T. max = 0,003
120 000 95 000 70 000 0,00
0,20
0,40
0,60
α
0,80
. Fig. 2. Typical dependence of the optimal value of D on the coefficient α
solution for T max < Tmin . This normalized value let us compare the results obtained for different wide area network topologies and for different numbers and locations of servers. Let P(u , v) , in percentage, be the arithmetic mean of the relative number of iterations for NT∈[u,v] calculated for all considered network topologies and for different servers locations. Fig. 5 shows the dependency of P on divisions [0%,10%), [10%,20%),..., [90%,100%] of NT. It follows from Fig.3 that the exact algorithm is especially effective from computational point of view for NT ∈ [60%, 100%].
Q
T
2,6E+05
0,0030
2,3E+05
0,0015 0,0000
2,0E+05 0
1
2
MPN
3
0
1
2
3
MPN
Fig. 3. Typical dependence of the optimal Fig. 4. Typical dependence of the optimal value of criterion Q on the minimal number of value of the total average delay in the different paths network on the minimal number of different paths
7 Conclusion In the paper the exact algorithm for solving the servers replication and topology assignment problem with cost criterion, delay constraint and reliability constraint is proposed. Such formulated problem has not been considered in the literature yet. It follows from computational experiments that the presented algorithm is effective from computational point of view for greater values of acceptable average delay in
558
M. Markowski and A. Kasprzak
Fig. 5. The dependence of P on normalized maximal average delay per packet NT
the WAN (Fig.5). We are of the opinion that the WAN property formulated as Conclusion 2 is very important from practical point of view. It shows the optimal value of minimal path’s number between each node pairs in the small or medium networks. Moreover, properties presented in the section 4 and 5 may be very useful to construct effective approximate algorithm for solving the problem (5-8). This work was supported by a research project of The Polish State Committee for Scientific Research in 2005-2007.
References 1. Markowski, M., Kasprzak, A.: The Three-Criteria Servers Replication and Topology Assignment Problem in Wide Area Networks. In: Gavrilova, M.L., Gervasi, O., Kumar, V., Tan, C.J.K., Taniar, D., Laganá, A., Mun, Y., Choo, H. (eds.) ICCSA 2006. LNCS, vol. 3982, pp. 1119–1128. Springer, Heidelberg (2006) 2. Markowski, M., Kasprzak, A.: An Exact Algorithm for the Servers Allocation, Capacity and Flow Assignment Problem with Cost Criterion and Delay Constraint in Wide Area Networks. In: Shi, Y., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2007. LNCS, vol. 4487, pp. 442–445. Springer, Heidelberg (2007) 3. Koidea, T., Shinmorib, S., Ishiic, H.: Topological Optimization with a Network Reliability Constraint. Discrete Applied Mathematics 115, 135–149 (2001) 4. Yi-Kuei, L.: Reliability of a Flow Network Subject to Budget Constraints. IEEE Transactions on Reliability 56(1), 10–16 (2007) 5. Pioro, M., Medhi, D.: Routing, Flow, and Capacity Design in Communication and Computer Networks. Elsevier, Morgan Kaufmann Publishers, San Francisco (2004) 6. Chari, K.: Resource Allocation and Capacity Assignment in Distributed Systems. Computers Ops Res. 23(11), 1025–1041 (1996) 7. Fratta, L., Gerla, M., Kleinrock, L.: The Flow Deviation Method: an Approach to Store-andForward Communication Network Design. Networks 3, 97–133 (1973) 8. Wolsey, L.A.: Integer Programming. Wiley-Interscience, New York (1998)
Bluetooth ACL Packet Selection Via Maximizing the Expected Throughput Efficiency of ARQ Protocol Xiang Li1,2,*, Man-Tian Li1, Zhen-Guo Gao2, and Li-Ning Sun1 2
1 Robot Research Institute, Harbin Institute of Technology, Harbin, 150001, China College of Computer Science and Technology, Harbin Engineering University, Harbin, 150001, China {leexiang, gag}@hrbeu.edu.cn, {limt, lnsun}@hit.edu.cn
Abstract. Bluetooth provides different kinds of data packet types with different sizes and error correction mechanisms, thus adapter layer can choose the most suitable packet to be transmitted according to the error rate on the link and application requirements. Based on the acknowledgement history of the most recently transmitted packets, an adaptive algorithm is proposed to choose the suitable Bluetooth data packet for transmission through maximizing the expected throughput efficiency of ARQ protocol on Bluetooth ACL data communication link. Simulation results indicate that this method works very well with a short observation history and also show special performance of DM and DH data packet transmission. Keywords: Bluetooth, Piconet, ARQ, ACL, Throughput Efficiency.
1 Introduction Bluetooth (BT) [1,2] is a short-range radio link intended to be a cable replacement between portable and/or fixed electronic devices. Two types of transmission links, SCO and ACL links are used. SCO link is a symmetric point to point link supporting time-bounded voice traffic. SCO packets are transmitted over reserved intervals without being polled. ACL link is a point to multipoint link between master and all slaves in the piconet and can use all the remaining slots of the channel not used for SCO link. Bluetooth is a frequency hopping system which can support multiple communication channels in a common area (each channel is defined by a unique frequency hopping sequence). Frequency hopping is used in such a way that the radio is turned to the same frequency for the entire duration of the packet, but then changes to a different frequency each time it transmits a new packet or retransmits an erroneous packet. Since the fading and interference in the new frequency channel will be significantly different than that of the previous one, the use of frequency hopping with ARQ provides an effective method of diversity. Automatic Repeat Request (ARQ) protocols are designed to remove transmission errors from data communications systems. When used over relatively high bit-error rate (BER) links (e.g., 10-5 or higher) such as wireless or satellite links, their performance is *
Supported by the Harbin Engineering University Foundation (HEUFT06015).
M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 559–568, 2008. © Springer-Verlag Berlin Heidelberg 2008
560
X. Li et al.
sensitive to the packet size used in the transmission. When too large a packet size is employed, there is an increased need for retransmissions, while too small a packet size is inefficient because of the fixed overhead required per packet. When an ARQ scheme is to be used at the link layer over a relatively high error-rate link, the packet size should be chosen based on the error-rate [3]. The optimal communication problem in Bluetooth has been investigated in some documents. In [4], a solution is proposed to enhance the Bluetooth link layer to make use of channel state information and adopt the suitable Bluetooth packet type to enhance TCP throughput. The throughput of the six Bluetooth ACL packets that use ARQ as a function of channel symbol SNR are derived in [5], then optimal packet type can be selected at different SNRs. Document [6] provides algorithms to maximize the throughput under lossy transmission conditions in a piconet with one or more slaves by selecting the packet lengths optimally in accordance with the channel conditions for different frequencies. All these works are concentrating on throughput under different channel conditions, such as BER or SNR, which are not easy to know ahead. In this paper, we concern with choosing optimal packet payload length on the Bluetooth ACL data communication links, in terms of maximizing the throughput efficiency of ARQ protocol based on the acknowledgement history of the most recently transmitted packets. That is, given the number of packets that required retransmission, an estimate of the channel BER is made, based on which a packet size is chosen to maximizes the expected throughput efficiency of the data link protocol.
2 Bluetooth Data Packets In Bluetooth, the data on the piconet channel is conveyed in packets. The general packet format is shown in Fig.1. Each packet consists of 3 entities: the access code, the header, and the payload. In Fig. 1, the number of bits per entity is indicated [1].
Fig. 1. Standard Packet Format
The access code and header are of fixed size: 72 bits and 54 bits respectively. The payload length can range from zero to a maximum of 2745 bits. Different packet types have been defined. Packets may consist of the (shortened) access code only, of the access code − header, or of the access code − header − payload. Data in Bluetooth can be transmitted asynchronously using ACL packets. In this paper, we mainly focus on ACL packets data transfer used in asynchronously connections. Seven ACL packet types are defined in the Bluetooth. DM stands for DataMedium rate, DH for Data-High rate. DM packets are all 2/3-FEC encoded to tolerate possible transmission errors. Not encoded by FEC, DH packets are more errorvulnerable, but it can carry more information.
Bluetooth ACL Packet Selection
561
3 Adaptive Packet Selection Algorithm 3.1 Throughput Efficiency of ARQ Protocol A protocol performance is usually characterized by many parameters which are defined by the communication system requirements. The most important parameters are the probability of receiving a message without errors and the protocol throughput efficiency. There are several definitions of the protocol throughput efficiency. Most frequently it is defined as the ratio of the mean number of information bits successfully accepted by the receiver to the number of bits that could have been transmitted during the same time interval [7]. To do so we must first derive an expression for the throughput efficiency of the ARQ protocol. The expressions derived in this section assume the use of an “optimal” ARQ protocol in that only packets containing errors are retransmitted. The throughput efficiency of ARQ scheme that uses packets having n bits of information bits k is determined by [8]: η =(
k )/R n
(1)
where, the first term k/n of the above expression represents the ratio of information bits to total bits in a packet, and R represents the average number of transmission attempts per packet. Assuming that the ARQ scheme retransmits a packet until the acknowledgement of a successful reception, the average number of attempts, R , needed to successfully transmit one packet is given by [4]: R =1×
(1-p)+2×p×(1-p)+3×p×p×(1-p)+…= 1 −1 p
(2)
where, p is the packet error rate. So, for a given p, the throughput efficiency of ARQ that uses packets having n bits of information bits k is given by: k n
η = ( )(1 - p)
(3)
3.2 Choosing Packet Size Via Maximizing the Expected Throughput Efficiency of ARQ Protocol When a perfect retransmission algorithm (A perfect retransmission algorithm is one that only retransmits packets that are in error and can continuously transmit new packets as long as no errors occur. The selective repeat protocol is an example of a perfect retransmission algorithm) is employed, the optimal packet size to be used by the data link protocol is given by [3]: kopt =
− h ln(1 − b) − − 4h ln(1 − b) + h 2 ln(1 − b 2 ) 2 ln(1 − b)
(4)
where b is the known channel BER and h is the number of overhead bits per packet(These bits are used for control, error detection, and framing).
562
X. Li et al.
,
When h equals to 126 the optimal packet size under different channel BER can be displayed in Fig. 2 by formula (4). Fig. 2 shows that the optimal packet size decreases according to the increase of the channel BER. This change trends meets to the real application requirement because when a much larger packet size is used the efficiency of the protocol would drop dramatically while the channel BER is much higher. Therefore a much smaller packet size is efficient under a much higher channel BER because a small packet has low packet error rate. On the contrary, a much larger packet size makes efficient use of the channel when the channel BER is much lower.
Fig. 2. Optimal packet size under different channel bit error rate
In Bluetooth, in order to guarantee reliable transmission ARQ mechanism is adopted. That is, the receiving side sends back special control frame as the acknowledgement or negative acknowledgement (ACK/NACK) to the input. In case of drop frame or acknowledgement message, the timer will send out timeout signal when the timer has expired, and to remind other sides that some problems have happened and this frame must be retransmitted. At the same time, receiver must be capable of distinguishing between retransmitted and new frame. With an ARQ scheme in Bluetooth specification, DM, DH and the data field of DV packets are transmitted and retransmitted until acknowledgement of a successful reception is returned by the destination (or timeout is exceeded). The acknowledgement information is included in the header of the return packet, so-called piggy-backing. To determine whether the payload is correct or not, a CRC code is added to the packet. The ARQ scheme only works on the payload in the packet (only that payload which has a CRC). The packet header and the voice payload are not protected by the ARQ. Depending on the packet retransmission records on the current link, an adaptive method to select the best packet for data transmission is proposed to improve the performance of Bluetooth system. The basic idea behind this scheme is that a large packet has low overheads and is advantageous when the BER is relatively low, while a small packet has a low packet error rate and thus is advantageous when the BER is high. So, depending on channel BER, every type of Bluetooth ACL data packet has different performance. Without any bit error, the DH5 packet would give the best performance since it carries the most information bit per unit time. However, as the BER increases, the packet error rate of DH5 increases faster than smaller packets. Thus, there exists a problem to how to select suitable packet to adapt to the current channel conditions. However, it is difficult to estimate the channel BER in a short time when the link is operating under error detection with retransmission [9], but we can acquire the packet retransmission history easily.
Bluetooth ACL Packet Selection
563
This paper proposes a simple algorithm to choose the packet size such that the conditional efficiency of the protocol is maximized under different channel BERs based on packet transmission record. Supposed the BER is b, and R, the number of retransmission requests out of the last M packet transmissions, by averaging the above expression over all possible values of b and using the conditional distribution of b given R (assuming that b is constant over the period of interest). The resulting expression is given by [3]: 1
η R (k ) = ∫ ηP (b | R )db 0
(5)
where P(b|R), is the conditional probability of b given that R out of the last M packets required retransmission. We now wish to choose the value of k that maximizes ηR. To do so we must first express the conditional probability of b given R. The conditional probability of b given R, P(b|R), can be expressed as follows: P[b | R ] =
P[b, R ] P[ R | b]P[b] = P[ R] P[ R ]
(6)
Solving for the above conditional probability requires knowledge of a prior distribution of b. In the absence of a prior, we assume a uniform prior, that is P[b]=1. This approach, in essence, is the same as a maximum likelihood approach where a uniform prior is assumed, except that here we associate a cost function with the estimates of b. With this approach we get 1
1
0
0
P[ R] = ∫ P[ R | b]P[b]db = ∫ P[ R | b]db
(7)
and so P[b | R ] =
P[ R | b ]
∫
1 0
P [ R | b ]db
(8)
In wireless communication, we generally assume that channels are independent and same distribution model, that is, assume that error rates on the channel are independent of each other and commonly kept invariable. Given packet error rate p, the probability that R packets contain errors and therefore require retransmission is the probability that R out of M packet are in error. Since packet errors are independent from packet to packet, this probability can be expressed according to the binomial distribution with parameter p, therefore P[R|b] can be expressed as: R M −R P[ R | b ] = ( M R ) p (1 − p )
(9)
The packet error rate, p, for DH packets is:
p = 1 − (1 − b) k
(10)
Recalling that DM packets are protected by a (15,10) Hamming code (encoded with a 2/3 block FEC), i.e., in every block, 15 bits are used to encode 10 bits of data, which is capable of correcting one bit error per 15 bit code block. The payload is correctly decoded provided that all code blocks contain one or fewer errors. The packet error rate, p, for DM packets can be approximated as:
564
X. Li et al.
p = 1 − ((1 − b)15 + 15b(1 − b)14 ) k / 15
(11)
Hence, for DH packets, P[R|b] can be expressed as: k' R k '( M − R ) P[ R | b ] = ( M R )(1 − (1 − b ) ) (1 − b )
(12)
For DM packets, P[R|b] can be expressed as: 15 14 k '/ 15 R P[R | b] = (M ) ((1− b)15 +15b(1− b)14)k '(M −R) /15 R )(1 − ((1 − b) +15b(1 − b) )
where
k ' is the payload size used in the previous M transmissions.
Combining equations (5)–(13), we can get tively as: 1
η R (k ) = ∫ [ 0
k (1 − b ) k × n
ηR(k) for DH and DM packets respec-
( MR )(1 − (1 − b ) k ' ) R (1 − b ) 1
∫( 0
M R
k'
(M −R)
)(1 − (1 − b ) k ' ) R (1 − b ) k '( M − R ) db
]db
( RM )(1 − ((1 − b)15 + 15b(1 − b)14 )k' /15 ) R ((1 − b)15 + 15b(1 − b)14 )k' (M − R)/15 1
∫( 0
M R
(13)
)(1 − ((1 − b)15 + 15b(1 − b)14 )k' /15 )R ((1 − b)15 + 15b(1 − b)14 )k' (M − R)/15 db
η R (k) =
∫
1 0
[
(14)
]db
(15)
k((1 − b) 15 + 15b(1 − b) 14 ) k/15 × n
where n=k+126. It is now possible to choose the value of k, the payload length to be used in future transmissions, so that the throughput efficiency of the ARQ protocol is maximized. This can be done by choosing the value of k that maximizes equation (14) or (15) for a value of R that is equal to the number of retransmission requests that occurred during the previous M transmissions using the payload size k ' .
4 Simulation Results Usually, the solution way of the maximization problem for ηR(k) in equation (7) or (8) is difficult; However, for specific values of M, R and k ' equation (7) or (8) can be solved
numerically. An optimal value for k can now be found using numerical search algorithms. Since the numerical evaluation of this integral is very intensive, a comprehensive search for the optimal value of k is not practical. Instead, a restricted search using select values for k can be performed. Such a search, for example, can consider values of k that are a multiple of 100; thereby significantly reducing the complexity of the search. Such a restricted search has little impact on the performance of the protocol since values of k that are within 100 bits of the optimal block size should result in near-optimal performance. Here, analysis of equation (14) or (15) is taken using Matlab7.0, where k is always valued as the multiple of 100 and the pace of b is 0.000001. In fig. 3, we plot the optimal payload size when a history of 50 previously transmitted 1500 bit packets payload is considered. As can be seen from the fig. 3 (a), for DH
Bluetooth ACL Packet Selection
565
Optimal payload length
Optimal payload length
packets transmission, when the previous fifty transmissions resulted in no errors the payload length can be maximized to 2744 bits (the maximization throughput efficiency can be gotten at payload length of 3200 bits). When one and two errors occurred the payload length can be increased to 2100 and 1700 bits respectively. When three errors occurred the payload length can be kept at 1500 bits and when more than three errors occur the payload length is reduced. As depicted in fig. 3 (b), for DM packets transmission, when the previous fifty transmissions resulted in no errors the payload length can be maximized to be 2745 bits (the maximization throughput efficiency can be gotten at packet payload length of 4400 bits). When one, two and three errors occurred the payload length can be increased to 2500, 1900 and 1600 bits respectively and when more than three errors occur the payload length is reduced.
R (Retransmisssion Packets)
5QXPEHURIUHWUDQVPLVVLRQSDFNHWV (a) DH packet
R (Retransmisssion Packets)
5QXPEHURIUHWUDQVPLVVLRQSDFNHWV (b) DM packet
Fig. 3. Optimal packet size selection based on retransmission history
Let k be the optimal packet size chosen for a given value of R out of the last M packet transmissions. The efficiency of the ARQ protocol with that value of k can be computed according to equation (3) combined with equation (10) or (11). Then it can be averaged over the distribution of R given b to yield the performance of ARQ for a given value of b. Fig. 4 shows the mean throughput efficiency of ARQ with various values of M and b, and a previous packet payload length of 1500 bits. As can be seen from the figure, whether it is DH or DM packets transmission, good performance is obtained with a history of just 75000 bits payload size (50 packets at packet payload size of 1500 bits). When b is higher than 10-5, more of history is required to obtain a reasonable estimate of throughput efficiency for DM packets transmission; but for DH packets transmission, the situation is quite different, only few or the least history of packets transfer is required to obtain high throughput efficiency. The throughput efficiency can be computed according to equation (3) using the optimal packet size obtained from the formula (4) under different channel BER. In the previous fifty packets transmission with payload length of 1500 bits, the mean throughput efficiency of DH is calculated based on the select optimizing packet size under different retransmission packets. Fig. 5 compares mean throughout efficiency of DH packets transmission based on retransmission history and the optimal packets transmission (opt) according to formula (4) for various b. As can be seen from the
566
X. Li et al.
Fig. 4. Mean throughput efficiency of algorithm for various b
Fig. 5, both mean throughout efficiency increase against the decrease of b. The mean throughout efficiency of DH packets based on retransmission packets is always below to the optimal packets transmission (opt), but the difference is not so great. Similarly, the mean throughput efficiency of DM is calculated based on the select optimizing packet size under different numbers of retransmission packets. Fig. 6 compares mean throughout efficiency of DM packets transmission based on retransmission history and the optimal packets transmission (opt) according to formula (4) for various b. As can be seen from the figure, both mean throughout efficiency increase by the decrease of b. The mean throughout efficiency of Bluetooth DM packets based on retransmission packets becomes mild while the b is less than 10-3, which is close to 0.7773. When b is large than 10-3, the mean throughout efficiency of DM packets based on retransmission packets is larger than the optimal packets transmission (opt). Hence, the larger throughout efficiency can be gained by using DM packets transmission with ARQ protocol when the channel BER is high. In the previous fifty packets transmission with payload length of 1500 bits, select optimizing packet size under different retransmission packets, fig. 7 compares mean throughout efficiency of ARQ during DH and DM packets transmission for various b.
RSW
'+
(
(
(
(
(
(
Fig. 5. Comparing mean throughput efficiency of DH and optimal packets transmission (opt)
Bluetooth ACL Packet Selection
567
RSW
'0
(
(
(
(
(
(
Fig. 6. Comparing mean throughput efficiency of DM and optimal packets transmission (opt)
It is important to note that the performance of ARQ is much more vulnerable to DH packets when b is high. That is, when b is high the use of DH or DM packet type can have a disastrous effect on the throughput efficiency, and DM packets transfer can produce higher throughput efficiency than DH packets. When b is low, small variations in the throughput efficiency from the different bit error rate b, packet types (DH/DM). So in a high error rate environment, it is better to take DM packets as data transmission, which accords with the capability of DM to tolerate high transmission error rate. Oppositely, in a low error rate environment, it is better to take DH packets as data transmission, it is because not decoded by 2/3-FEC DH packets have relatively higher data transfer rate than DM packets data transmission.
DH DM
( ( ( ( ( (
Fig. 7. Comparing mean throughput efficiency of DH and DM packets transmission
5 Conclusion This paper introduces a method to select the optimal packet payload length used by Bluetooth ACL data link layer. The throughput efficiency of ARQ protocol is given based on retransmission history. So, given packet transmission record we can choose the packet size such that the expected throughput efficiency of the ARQ protocol is maximized under different channel BERs. Simulation results show that the method works very well even with a short observation history (50 packets at size of 1500 bits payload, total 75000 bits payload transfer). In a high error rate environment, it is
568
X. Li et al.
better to take DM packets as data transmission; but in a low error rate environment, it is better to take DH packets as data transmission to improve the data transfer rate.
References 1. Bluetooth V1.1 Core Specifications, http://www.bluetooth.org 2. Harrtsen, J.: The bluetooth radio system. IEEE Personal Communicatios 7(1), 28–36 (2000) 3. Modiano, E.: An adaptive algorithm for optimizing the packet size used in wireless ARQ protocols. Wireless Networks 5, 279–286 (1999) 4. Chen, L.J., Kapoor, R., Sanadidi, M.Y., Gerla, M.: Enhancing bluetooth TCP throughput via link layer packet adaptation. In: Proc. of the 2004 IEEE International Conference on Communications (ICC 2004), pp. 4012–4016. IEEE Press, Paris (2004) 5. Valenti, M.C., Robert, M., Reed, J.H.: On the throughput of bluetooth data transmissions. In: Proc. of the IEEE Wireless Commun. and Networking Conf., pp. 119–123. IEEE Press, Orlando (2002) 6. Sarkar, S.: Optimal Communication in Bluetooth Piconets. IEEE Transactions on Vehicular Technology 54(2), 709–721 (2005) 7. Turin, W.: Throughput analysis of the Go-Back-N protocol in fading radio channels. IEEE Journal on Selected Areas in Communications 17(5), 881–887 (1999) 8. Pribylov, V.P., Chernetsky, G.A.: Throughput efficiency of automatic repeat request algorithm with selective reject in communication links with great signal propagation delay. In: Proc. of the 3-rd IEEE-Russia Conferences Microwave Electronics: Measurements, Identification (MEMIA 2001), pp. 202–205. IEEE Press, Novosibirsk (2001) 9. Jesung, J., Yujin, L., Yongsuk, K., Joong, S.M.: An adaptive segmentation scheme for the Bluetooth-based wireless channel. In: Proc. of the 10th International Conference on Computer communications and networks, pp. 440–445. IEEE Press, Scottsdale (2001)
High Performance Computer Simulations of Cardiac Electrical Function Based on High Resolution MRI Datasets Michal Plotkowiak1, Blanca Rodriguez2 , Gernot Plank3 , J¨ urgen E. Schneider4 , 1,2 5 David Gavaghan , Peter Kohl , and Vicente Grau6 1
6
LSI Doctoral Training Centre, University of Oxford, UK [email protected] 2 Computing Laboratory, University of Oxford, UK 3 University of Graz, Austria 4 Department of Cardiovascular Medicine, University of Oxford, UK 5 Department of Physiology, Anatomy and Genetics, University of Oxford, UK Department of Engineering Science and Oxford e-Research Centre, University of Oxford, UK [email protected]
Abstract. In this paper, we present a set of applications that allow performance of electrophysiological simulations on individualized models generated using high-resolution MRI data of rabbit hearts. For this purpose, we propose a pipeline consisting of: extraction of significant structures from the images, generation of meshes, and application of an electrophysiological solver. In order to make it as useful as possible, we impose several requirements on the development of the pipeline. It has to be fast, aiming towards real time in the future. As much as possible, it must use non-commercial, freely available software (mostly open source). In order to verify the methodology, a set of high resolution MRI images of a rabbit heart is investigated and tested; results are presented in this work.
1
Introduction
The heart is an electromechanical pump, whose function and efficiency are known to be intimately related to cardiac histoanatomy. A large number of anatomical and structural factors affect cardiac electromechanical activity, but their detailed influence is poorly understood. Computer simulations have demonstrated the ability to provide insight into the role of cardiac anatomy and structure in cardiac electromechanical function in health and disease [1],[2]. The most advanced cardiac models to date incorporate realistic geometry and fibre orientation [3]. However, for each animal species, only one example of cardiac anatomy is generally used, which obscures the effect of natural variability. In addition, cardiac tissue is generally represented as structurally homogeneous, and cardiac geometry is often overly simplified, as the endocardial structures for example are not represented in detail. This limits the utility of the computational models to M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 571–580, 2008. c Springer-Verlag Berlin Heidelberg 2008
572
M. Plotkowiak et al.
understand the role of heart structure and anatomy in cardiac electromechanical function. Recent advances in medical imaging techniques allow generation of high resolution images containing a wealth of information on the 3D cardiac anatomy and structure. Among all possible techniques, magnetic resonance imaging (MRI) is the most suitable for our purposes. MRI allows acquisition of images in vivo as well as in vitro, and can provide high-quality, high resolution datasets. Thus, MRI images of whole hearts can be used to obtain highly detailed, high resolution models of cardiac anatomy and structure. Figure 1 shows two anatomical MRI sections through two different rabbit hearts, obtained using an 11.7 T MRI system (500 MHz). The information that can be obtained from these high resolution MRI datasets can be used to build the next generation of cardiac computational models with realistic and accurate representation of cardiac anatomy and structure. The goal of the present study is to develop and identify a set of methodologies to run computer simulations of cardiac electrical activity. Proposed heart models incorporate detailed description of cardiac anatomy and structure based on high resolution MRI datasets.
Fig. 1. MRI slices of two rabbit hearts acquired at different resolutions and different levels of contraction, the left image at 26.4 x 26.4 x 24.4 μm, and the right image at 32 x 32 x 44 μm
2
Methodology
This section describes the techniques to go from the high-resolution MRI images to computer simulations of cardiac electrophysiological function. This involves the use of software developed specifically for this project, based on open-source libraries (for segmentation and surface generation), as well as available software packages (for mesh generation and cardiac electrophysiology solver). 2.1
Data Acquisition
The MRI data were acquired using an 11.7 T (500 MHz) MRI system, consisting of a vertical magnet (bore size 123 mm; Magnex Scientific, Oxon, UK), a Bruker
High Performance Computer Simulations of Cardiac Electrical Function
573
Avance console (Bruker Medical, Ettlingen, Germany), and a shielded gradient system (548 mT/m, rise time 160 μs; Magnex Scientific, Oxon, UK). For the MRI signal transmission/reception, dedicated quadrature driven birdcage RFcoils were used. Scanning was performed using a fast gradient echo technique for high-resolution gap-free 3D MRI. Images of coronary perfusion-fixed rabbit hearts, embedded in agarose, were acquired with an in-plane resolution of 26.4 μm x 26.4 μm, and an out-of-plane resolution of 24.4 μm. All the methodological details were described previously in [5]. 2.2
Segmentation
Segmentation of the MRI datasets is the first step towards extracting anatomical information for incorporation into computational models of cardiac electrophysiology. Segmentation may be described as the process of labelling each voxel in a medical image to indicate its tissue type or to which anatomical structure each voxel belongs to. As a first step, in this study, the aim was to segment high resolution MRI of rabbit hearts in order to generate an accurate description of the epicardial and endocardial structures. For this purpose an application, based on Insight Toolkit libraries (ITK, www.itk.org), was developed. ITK is an open-source software system originally developed to support the Visible Human Project1 . In recent years it has become a standard tool for biomedical image segmentation and registration. Segmentation in the present study is performed using the fast marching method, one of the family of level set methods. Level Set Method. Level set methods [6] are a family of numerical techniques for tracking the evolution of contours and surfaces. The main idea is to embed the evolving contour/surface in a higher dimension function ψ. The contour C is represented as the zero level set of this function. In the context of image segmentation, level sets are generally used to evolve a contour/surface using the evolution of the higher dimension function. A large number of variations have been presented in the literature, for which a comprehensive review is out of the scope of this paper. In the formulation introduced in [6]. the level sets function is generally initialized with a signed distance map to an initial surface, and then evolves guided by a speed function combining internal (generally related to contour regularity) and external (generally linked to image features) influences. We use a simplified level set method called the fast marching method [7]. In this method it is assumed that the speed function F > 0, which means that the front is always going forward. This has the advantage of having to consider each voxel only once, thus making the algorithm significantly faster. We chose this method because of both reduced computational requirements (necessary when dealing with large 3D datasets as in this case) and for convenience reasons (available implementation in ITK). The position of the moving front can be characterized by calculating its arrival time T (x, y) as it crosses each point (x, y) on the plane or space. Thresholding the arrival time provides a segmentation of the image. 1
www.nlm.nih.gov/research/visible/visible human.html
574
M. Plotkowiak et al.
In the current application, we initialize the contour by using a set of automatically generated seeds. These are located using a set of heuristic rules involving the intensity of a block centered at each image voxel. A highly conservative threshold was used here to ensure that all the seeds belong to the object. The images were pre-processed, using an anisotropic filter, before application of the fast marching algorithm. 2.3
Surface Generation
In order to generate surface data for further volume meshing and visualization, special algorithms were applied, based on the Visualization Toolkit (VTK) package libraries. For a more detailed description refer to [8]. The proposed surface generation application contains three main elements: marching cubes, decimation and smoothing. Its role is not only to generate a spatial object from binary 2D data but also to prepare structures of interest for finite element meshing. Marching Cubes. The marching cubes algorithm produces a triangular mesh by computing isosurfaces from volumetric data [9]. A cube is defined by the values at its eight vertices, corresponding to eight voxels in the original 3D image. When one or more vertices of a cube have values less than the specified isovalue, and one or more have values greater than this value, the voxel contributes some component to the isosurface. Determining which edges of the cube are intersected by the isosurface, a triangular patch can be created. The final surface representation is obtained by connecting the patches from all cubes. Decimation. Marching cubes usually generate a large number of polygons, and so this has to be reduced before generating a finite element mesh. We used the decimation algorithm from [10], available in VTK. Decimation is designed to reduce the total number of triangles in a mesh, while preserving the original topology [8]. The proposed decimation is an iterative process, where each point in a triangle mesh is visited. Three basic steps are carried out for each of the points. In the first one, the local geometry and topology in the neighbourhood of the point are classified. Next, the vertex is assigned to one of possible five categories: simple, boundary, complex, edge, or corner point. Finally, using a decimation criterion that is based on a local error measure, it is determined whether the point can be removed. If the criterion is satisfied, the point and associated triangles are deleted and the resulting hole is re-triangulated. Smoothing. Mesh smoothing is a method of shifting points of a mesh that can significantly improve its quality (i.e. appearance and shape), without modifying mesh topology. Smoothing applications improve isosurfaces by removing surface noise. We have used Laplacian smoothing, a general smoothing technique that has been used successfully in other applications [8].
High Performance Computer Simulations of Cardiac Electrical Function
2.4
575
Finite Element Mesh
The majority of the leading cardiac electrophysiology simulators (such as the one used in this study and described in the next section) utilize tetrahedral finite element meshes as an input. Therefore, from the model cardiac surfaces, generated as described in the previous section, a tetrahedral finite element mesh was generated. First, a surface mesh composed of triangles or rectangles was generated, and then the volume mesh composed of tetrahedral elements was fitted into the cardiac volume. The most common unstructured meshing algorithms are Delauney triangulation and advancing front methods. There are many different automatic mesh generating tools available. However, meshes generated in this way may contain poorly shaped or distorted elements that cause numerical problems. For instance the size of dihedral angles is very important; if too small then the corresponding number of the elemental matrices increases, if too large then the discretization error in the finite solution increases. The meshing program used in this project, Tetgen2 , performs Delauney tetrahedralization using algorithms presented in [11]. It also includes algorithms for quality control, e.g. Shewchuk’s Delaunay refinement algorithm. This algorithm ensures that no tetrahedra in the generated mesh have a radius-edge ratio greater than 2.0. The reason for choosing TetGen is that it is freely available, contains state-of-the-art tetrahedralization algorithms, and offers good mesh quality control. 2.5
Electrophysiological Simulations
The finite element meshes were generated specifically to conduct simulations of ventricular electrophysiological activity, using one of the most advanced bidomain cardiac simulators available to date, namely Cardiac Arrhythmias Research Package (CARP)3 . CARP uses computational techniques to solve the cardiac bidomain equations [12] defined as: ∇(σi ∇Vm ) + ∇[(σi + σe )φe ] = −Istim
(1)
∇(σe ∇φe ) = −Iion − Istim
(2)
Vm = φi − φe ;
(3)
where φi and φe are intra- and extracellular potentials, σi and σe are intra- and extracellular conductivity tensors, and Iion and Istim are the volume densities of the transmembrane and stimulated currents. In the bidomain model, membrane kinetics are represented by a system of ordinary differential equations that are used to compute the total transmembrane ionic current Iion . CARP uses an expanded library of ion models and plugins (augmentations) called LIMPET. As the animal species in this study is rabbit, the Puglisi-Bers rabbit Ionic model described in [13] was used, which consists of 2 3
http://tetgen.berlios.de/ http://carp.meduni-graz.at/
576
M. Plotkowiak et al.
a system of 17 ordinary differential equations (ODEs) that describe the electrophysiological behaviour of ion channel currents, pumps and exchangers in rabbit ventricular cells. Bidomain models are often used for studies of defibrillation, simulating the application of strong shocks to the heart, which requires representation of the extracellular space. However, simulations of cardiac propagation often use monodomain models, which can be obtained from the bidomain model by assuming φe = 0. The detailed representation of cardiac structure incorporated in the meshes results in a large number of mesh nodes and therefore in computationally very demanding simulations, despite CARPs efficiency. Thus, simulations required the use of high performance computing such as offered by the UK National Grid Service (NGS) (www.grid-support.ac.uk).
3
Results and Discussion
Here we present results for each step of the model development: segmentation, surface generation, finite element mesh, and electrophysiological simulations using a high resolution MRI dataset of the rabbit heart. 3.1
Segmentation Results
MRI data were reconstructed and stored in the form of 1440 2D TIFF-images with a resolution of 1024 x 1024 pixels. For our purpose, the TIFF-images were down-sampled by a factor of 4 on each axis in order to speed up the segmentation process. The same segmentation method can be applied to the full resolution data, however, it would require the use of high performance computing capabilities such as the NGS, for which our segmentation software is not ready at this stage. As a first step in the model development process, we focused on segmentation of the endocardial and epicardial surfaces. Even though, in theory, the fast marching process could work from a single seed, in practice a reduced number of seeds was prone to produce leakage in areas where gradient values were small. In our segmentation program, we generate about 4000 seed points for each MRI slice. The results of segmentation of the rabbit heart are shown in Fig.2. 3.2
Surface Generation Results
The marching cube algorithm was used to generate 3D isosurfaces from the segmented 2D slices. Due to the large size of data (about 200 MB), a decimation algorithm was applied. Parallel algorithms are in the process of being developed to allow handling of the full resolution dataset. Decimation allowed reduction of the datasize by about 50% while still maintaining a very detailed structure. Finally, to improve the quality of decimated isosurfaces smoothing was applied. The final isosurfaces are presented in Fig.3.
High Performance Computer Simulations of Cardiac Electrical Function
577
Fig. 2. Segmentation of MRI slices of rabbit heart. The black outline on the MRI slices shows the segmented boundaries of ventricular structure.
Fig. 3. Rabbit heart isosurfaces generated using marching cubes algorithm. Surfaces were decimated and smoothed using VTK functions.
3.3
Finite Element Mesh
As a result of the steps described above, we obtain an isosurface in stl format, fully compatible with the mesh generator TetGen that we use here. In order to ensure a good mesh quality, i.e. a mesh where the radius-edge ratio is smaller than 2.0 for all tetrahedrals and maximum element volume is constrained, a set of TetGen switches was used4 . The final mesh consisting of 828,476 nodes and 3,706,400 tetrahedral elements, is shown in Fig.4. 3.4
Electrophysiological Simulations Results
The output tetrahedral mesh from TetGen had to be converted into a format compatible with the CARP solver. All simulations for the generated models were 4
http://tetgen.berlios.de/
578
M. Plotkowiak et al.
Fig. 4. Different cuts through the finite element mesh for rabbit heart containing about 4 million tetrahedral elements
Fig. 5. Different stages of electrical propagation in the developed model. Transmembrane potential values at each epicardial node are shown using a grey scale, where black is resting potential and white is depolarized.
Fig. 6. Proposed model development pipeline ilustrating main applications, methods, file formats, and visualization programs used
carried out using a monodomain model. There is no special preparation needed for the bidomain calculation, however for the sake of obtaining a simple electrical propagation the monodomain mode is sufficient and computationally more efficient. Figure 5 shows transmembrane potential distribution on the epicardium at several time points during electrical propagation from apex to base following the application of the electrical stimulus.
High Performance Computer Simulations of Cardiac Electrical Function
4
579
Conclusions
The main contribution of this paper is the design and implementation of an application pipeline, that uses high resolution MR images to create individualized anatomically detailed heart models that are compatible with advanced cardiac electrophysiology simulators (such as CARP). Techniques such as the ones presented here, by generating models with unprecedented realism and level of detail, and introducing natural variability between individuals, have the potential to strengthen and broaden electrophysiological models as a fundamental tool in cardiac research. The main techniques applied in the presented model development are: segmentation, which uses fast marching algorithm to extract ventricular geometry; surface generation that uses marching cubes algorithm and some surface processing (decimation and smoothing), and finite element mesh generation using Delauney tetrahedralization. The main parts of the developed pipeline are presented in Fig.6. The main objectives of this work were to demonstrate the feasibility of the process, and to propose a working prototype. Most of the methods used are amenable to improvement; for instance, more sophisticated segmentation algorithms will be needed in order to extract different anatomical structures, such as papillary muscles and blood vessels, and adaptive meshing techniques may be applied for creating more efficient finite element models. The electrophysiological model used here has some limitations, and in order to be useful in particular applications additional information, such as fibre orientation or cell type distribution, will have to be included. This can be obtained using additional segmentation techniques, or by including information from different image modalities such as histology. In addition, cardiac electro-mechanically coupled solvers are currently being developed that will allow simulation of cardiac electromechanical activity using the meshes developed through the proposed pipeline.
Acknowledgements This work was supported by a LSI DTC scholarship (to M.P.), an MRC Career Development Award (to B.R.), a Marie Curie Fellowship (to G.P.), and a BBSRC grant (BB E003443 to P.K.).
References 1. Kerckhoffs, R.C.P., et al.: Computational methods for cardiac electromechanics. Proc. IEEE 1994, 769–783 (2006) 2. Rodriguez, B., et al.: Differences between left and right ventricular geometry affect cardiac vulnerability to electric shocks. Circ. Res. 97, 168–175 (2005) 3. Vetter, F.J., McCulloch, A.D.: Three-dimensional analysis of regional cardiac function: A model of rabbit ventricular anatomy. Prog. Biophys. Mol. Biol. 69, 157–183 (1998) 4. Nielsen, P.M.F., et al.: Mathematical-model of geometry and fibrous structure of the heart. Amer. J. Physiol. 260, H1365–H1378 (1991)
580
M. Plotkowiak et al.
5. Burton, R., et al.: Three-Dimensional Models of Individual Cardiac Histoanatomy: Tools and Challenges. Ann. NY Acad. Sci. 1080, 301–319 (2006) 6. Osher, S., Sethian, J.: Fronts Propagating with Curvature-Dependent Speed: Algorithms Based on Hamilton-Jacobi Formulations. Journal of Computational Physics 79, 12–49 (1988) 7. Sethian, J.: Level Set Methods and Fast Marching Methods. Cambridge University Press, Cambridge (2002) 8. Schroeder, W., et al.: The Visualization Toolkit, 3rd edn. Kitware Inc (2004) 9. Lorensen, W., Cline, H.: Marching Cubes: A High Resolution 3D Surface Construction Algorithm. Computer Graphics 21(3), 163–169 (1987) 10. Schroeder, W., et al.: Decimation of Triangle Meshes. Computer Graphics 26(2), 65–70 (1992) 11. Si, H., Gaertner, K.: Meshing Piecewise Linear Complexes by Constrained Delaunau Tetrahedralizations. In: Proceeding of the 14th International Meshing Roundtable (2005) 12. Vigmond, E.J., et al.: Solvers for the cardiac bidomain equations. Prog. Biophys. Mol. Biol. 33, 10 (2007) 13. Puglisi, J., Bers, D.: LabHEART: an interactive computer model of rabbit ventricular myocyte ion channels and Ca transport. Am. J Physiol. Cell. Physiol. 281(6), 2049–2060 (2001)
Statistical Modeling of Plume Exhausted from Herschel Small Nozzle with Baffle Gennady Markelov1 and Juergen Kroeker2 1
AOES Group BV, Pustbus 342, 2300 AH Leiden, The Netherlands [email protected] 2 EADS Astrium GmbH, Friedrichshafen, Germany
Abstract. Constantly released Helium on board of the Herschel spacecraft is used to cool three scientific instruments down to 0.3 K. The Helium is released by small nozzles creating a counter-torque. This compensates the torque caused by the solar pressure acting on the spacecraft surfaces. An optimization of the nozzle shape did not allow avoiding severe plume impingement on the spacecraft surfaces and consequently the application of baffles has been considered to reduce plume impingement effects. Two baffle shapes, cylindrical and conical, with different radii and lengths have been analyzed numerically. The analysis has been performed with a kinetic approach, namely, the direct simulation Monte Carlo (DSMC) method to cope with changing flow regime from continuum in a subsonic part of the nozzle through transitional to freemolecular flow inside the baffle. A direct application of DSMC-based software would require large computer resources to model the nozzle and plume flows simultaneously. Therefore, the computation was split in two successive computations for the nozzle and nozzle/plume flow. Computations of plume flow with and without baffles were performed to study effects of baffle size and shape on the plume divergence, and plume impingement on the Herschel spacecraft. It has been shown that small baffles even widen the plume. An increase of radius/length of the baffle decreases the plume divergence, however even the largest baffle could not meet the requirements.
1
Introduction
The ’Herschel Space Observatory’ is part of the fourth cornerstone mission in the ’Horizons 2000’ program of the European Space Agency (ESA), with the objectives to study the formation of galaxies in the early universe and the creation of stars. In a dual launch together with Planck, Herschel will be placed in an operational Lissajous orbit around the Earth-Sun L2 point by an Ariane 5 in 2008 to perform photometer and spectrometer measurements, covering the full far infrared to sub-millimetre wavelength range from 60 to 670 micrometers during its operational lifetime of 3.5 years. The prime contractor for Herschel/Planck is ThalesAlenia Space in Cannes, France, while the Herschel Payload Module is developed, built and tested under responsibility of EADS Astrium Spacecrafts in Friedrichshafen, Germany. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 581–589, 2008. c Springer-Verlag Berlin Heidelberg 2008
582
G. Markelov and J. Kroeker
The released Helium creates at the nozzles a counter-torque, which is used to compensate the torque caused by the solar pressure acting on the spacecraft. This counter-torque is partly neutralized by the Helium plume impingement on the spacecraft surfaces. To reduce the effect of the plume impingement, 95% of the total thrust has to be within a half-cone of 30 deg in a distance of 0.5 m from the nozzle. To achieve the goal the following investigations have been performed: – Optimization of the nozzle geometry to decrease a plume divergence, – Application of a baffle for further decrease of the divergence (the baffle shall be small and have a simple shape), – Definition of proper cant angle and nozzle locations if the design goal could not be achieved by the above activities. Plume exhausted from a nozzle in a hard vacuum is characterized by the presence of all flow regimes, from continuum in the nozzle and even in the plume near field through transitional to free-molecular flow at a large distance from the nozzle. Modelling of such flows requires a special approach, for example, a successive application of continuum and kinetic methods (see [1] and refs. herein). The given problem is complicated by the facts that 1) a transitional flow regime occurs inside the nozzle due to a small mass flow rate and 2) a baffle can affect the flow at the nozzle exit plane and, probably, even inside the supersonic part of the nozzle. This complicates the splitting of the computational domain into regions and the application of proper numerical methods modelling the flow inside the domains. This paper applies the direct simulation Monte Carlo (DSMC) method [2] to model plume flow with and without baffles, study effects of baffle size and shape on the plume divergence, and plume impingement on the Herschel spacecraft. This method is a computer simulation of movement and collisions of particles and it applies a statistical approach to perform the collisions. This is the most widely used numerical method in the area of rarefied gas dynamics and it was successfully applied to model plume flows and near continuum flows.
2
Nozzle and Baffle Geometries
Initially, the nozzle had the supersonic part with an half-angle of 15 deg and exit diameter of 5.45 mm. Temperature of helium at the nozzle inlet is 69 K and mass flow rate is 1.1 mg/sec. The nozzle creates a rather wide plume with about 66% of total thrust within a half-cone of 30 deg. An optimization of the nozzle shape allowed to increase a plume collimation and achieve a plume shape with 74% of total thrust inside a half-cone of 30 deg. The optimal nozzle has a larger supersonic part: the half-cone angle of 32 deg and exit diameter of 15 mm (Fig. 1 left). However, this nozzle does not meet the design goal, 95%, and plume still impinges on the Herschel spacecraft surface. Figure 1 right shows the surface distribution of torque created by the plumes. The plumes impinge mainly the SVM shield, spacecraft body and the radiators creating a torque acting on the spacecraft body. To decrease further the plume divergence, the
Statistical Modeling of Plume Exhausted from Herschel Small Nozzle
583
Radiator Nozzles
SVM shield
Fig. 1. Local Knudsen number flow-field for the optimal nozzle (left) and My torque distribution over the spacecraft surface (nozzles without baffles, values in N/m)
Lb
Lb
β
Rb
φ
Rb
φ
Fig. 2. Cylindrical (left) and conical (right) baffles
application of baffles has been considered. The baffles shall have a simple shape, either cylindrical or conical. Figure 2 shows geometrical parameters of the baffle where Lb is an axial length, Rb is a radius, andβ is the baffle angle.
3
Numerical Approach
Modeling of plume flow is quite complex from a numerical view point because as it includes different flow regimes, a continuum regime in the nozzle and transitional and free-molecular in the far plume field. For the Herschel small nozzle the transitional regime occurs already inside the nozzle due to the small mass flow rate. For example, values of local Knudsen number are larger then a threshold value 0.1, which defines a border between continuum and transitional regimes (see Fig. 1 left). Therefore, the kinetic approach has to be applied inside the nozzle already. Computations were performed with DSMC-based software, SMILE [3]. The software has 20 year history of development and is validated mutually. A variable hard sphere model [2] was applied to model intermolecular
584
G. Markelov and J. Kroeker
collisions and diffuse reflection with complete energy accommodation was used as a gas/surface interaction model. To perform collisions between model particles, SMILE uses the Cartesian uniform grid. Each cell of the grid can be subdivided into smaller Cartesian cells to meet the method requirements on a linear size of the collisional cell. This allows implementation of an efficient algorithm to trace the model particles. The uniform cells are used as a base for other algorithms: radial weights and parallel algorithms. The radial weight is assigned for each cells strip along X-axis to control a number of the model particles and make their distribution more uniform in radial direction. A parallel algorithm applies a static distribution of these cells between processors and cells are distributed to the processors on statistical base. In this case all processors communicate with each other. However, this algorithm allows a good load balance for a small number of processors and, as a result, an efficient use of the parallel computer [4]. It is desirable to have the cells small enough to make these algorithms efficient. The plume has to be computed up to a distance of 0.5 meters from the nozzle and collisional cells have to be small inside the nozzle and large in the plume far field. By adaptation of the uniform cells (subdivision into smaller collisional cells) the flow resolution in any place of computational domain can be achieved. However, even for a grid 1000x1000 cells in axial and radial directions, respectively, the nozzle locates inside two/three cells. This leads to very inefficient use of the software due to large load imbalance over processors and requires large computer resources. To reduce requirements to computer resources, a computation of nozzle and plume flows has been split into the following two successive computations: 1. modeling of flow inside the nozzle and in the vicinity of the nozzle exit, 2. modeling of the nozzle flow from the nozzle throat and plume flow. The two-step approach allows us to reduce the requirements, significantly. An additional benefit is that the first computation has been done only once for all the baffle geometries and it requires more computer resources then the second computation. Numerical solutions inside the supersonic part of the nozzle are exactly the same for both computations. In principal, the second computation could be started using an inflow boundary located downstream from the nozzle throat. In this case, a velocity distribution function has to be sampled along this boundary. Otherwise, any simplification of the function, for example, application of ellipsoidal Maxwellian decreases the solution accuracy. For the first computation an efficiency of the computational cluster used is not less then 85% using up to 28 processors. However, the second computation is very inefficient as the nozzle flow occupies a few cells only, which yields a poor load balancing. For example, an increase of number of processors from 2 to 8 yields a growth of wall clock time required to perform the computation. An improvement has been achieved by redistributing the cells over processors. For the redistribution, the computation has been stopped at certain moments and the work load has been estimated for each cell by comparing the time spent
Statistical Modeling of Plume Exhausted from Herschel Small Nozzle
585
by each processor and number of model particles in that cell. Then cells were grouped along Y-axis to have an approximately equal work load over the groups. This redistribution also reduces the communication between processors as each processor has now a closed sub-domain and not loose cells in the entire computational domain. After this cell redistribution the efficiency has been increased up to 80% using four processors.
4
Numerical Results
Computations of plume flow have been performed for the optimal nozzle geometry. This section uses the following parameters to describe plume properties at a distance of 0.5 m from the nozzle. t30 is the fraction of total thrust, t, within ˙ a half-angle of 30 deg, α95 is a half-angle, which includes 95% of total thrust. m and ta are mass flow rate and thrust along a plume centerline. 4.1
Cylindrical Tube
A geometry of the cylindrical tube is defined by two parameters, length and radius. Computations have been performed to study effects of both parameters on the plume divergence and the obtained results are shown in this section. Effect of the tube length. The cylindrical tubes with a length of Lb = 63, 100, and 250 mm have been considered. The tube diameter is calculated assuming that the tube trailing edge is defined by a half-cone angle of φ = 30 deg. A tip of the cone is near the beginning of the throat. Figures 3 and 4 show Mach number flow fields for all these tubes. Helium atoms reflect on the tube surface in accordance with diffuse reflection. Some of Helium atoms go back to the nozzle and disturb the plume near field flow. An application of the tube makes the plume more collimated in terms of t30 (Table 1). However, tubes with 63-100 mm length create a more divergent plume in terms of α95 parameter. Only an application of 250 mm tube leads to very collimated plume with t30 = 0.901, which is close to the design goal of the Herschel small nozzles. The influence of the tube on the plume structure is clearly seen using a density distribution at a distance of 0.5 m from the nozzle (Fig. 4 right). The tube creates a sudden drop in the density distribution at 30 deg half-cone angle and this drop is larger and sharper for larger tube lengths. Tubes with a length of 63-100 mm create lower density at angles less than 15 deg. The tube with the length of 250 mm provides higher density for angles up to 30 deg and lower density at larger angles comparing with the plume created by the bare nozzle. Due to an open left hand end of the tube, 0.1 mg/sec Helium is released in the opposite direction for the 250 mm tube. When the left hand end is closed, Helium flows only along X direction. However, the closed end causes a slight plume divergence, for example, 100 mm tube creates a wider plume as the open tube with 63 mm length (cf. Tables 1 and 2).
586
G. Markelov and J. Kroeker
Fig. 3. Mach number flow field for tubes with a length of 63 mm (left) and 100 mm (right)
Fig. 4. Mach number flow field for a tube with a length of 250 mm (left) and density distribution at a distance of 0.5 m (right)
Effect of the tube radius. The effect has been investigated for tubes with a length of 100 mm and 250 mm. The radius of 100 mm tube was decreased from 58.34 mm down to 40 mm. As a result, the trailing edge of the baffle is on the half-cone angle of φ = 20 deg. From an intuitive view-point, this should increase the plume collimation. However, a small baffle leads to a large disturbance of the flow in the near plume field where intermolecular collisions occur. As a result, Table 1. The plume properties for tubes length, mm t30 α95 , deg m, ˙ mg/sec 0 0.741 46.64 1.10 63 0.763 51.98 1.02 100 0.799 50.31 1.01 250 0.901 37.23 0.99
ta , mN 0.833 0.760 0.751 0.748
t, mN 0.925 0.851 0.834 0.808
Statistical Modeling of Plume Exhausted from Herschel Small Nozzle
587
Table 2. The plume properties for tubes with closed left end length, mm t30 α95 , deg m, ˙ mg/sec ta , mN t, mN 100 0.763 51.76 1.09 0.801 0.898 250 0.877 38.86 1.09 0.807 0.877 Table 3. Effect of tube radius on plume properties Lb , mm 100 100 250 250
Rb , mm φ, deg 58.34 30 40.00 20 144.94 30 91.59 20
t30 α95 , deg m, ˙ mg/sec 0.799 50.31 1.01 0.637 56.43 0.92 0.901 37.23 0.99 0.832 41.30 0.86
ta , mN 0.751 0.623 0.748 0.612
t, mN 0.834 0.726 0.808 0.674
the tube with a radius of 40 mm creates a plume wider than the bare nozzle does (Table 3). Figure 5 shows that the density distribution for this plume does not have the significant drop at 20 deg and the density in the core flow is lower then the correspondent values for larger tube radius. A decrease of the tube radius for 250 mm length yields also a wider plume. In this case the baffle surface is still far from the nozzle (cf. Figs. 4 left and 5 right) and the plume is more collimated than without the baffle. The decrease of the tube radius does not affect the density distribution for small angles. However, the density drop is not big as it is for larger tube radius and, as a result, density is higher at angles larger than 30 deg. 4.2
Conical Baffle
Various conical baffles have been considered with half-cone angles of 5, 10, 15, and 20 deg. The length of the baffles is set to Lb = 100 mm. The trailing edge
Fig. 5. Density distribution at a distance of 0.5 m for tubes (left) and Mach number flow field for a tube with a length of 250 mm and smaller radius (right)
588
G. Markelov and J. Kroeker
Fig. 6. Mach number flow field for a conical baffle with a length of 100 mm (left, 5 deg; right, 10 deg)
Fig. 7. Mach number flow field for a conical baffle with a length of 100 mm (left, 15 deg; right, 20 deg)
position is defined by a half-cone angle of 30 deg. Figures 6 and 7 show Mach number flow-fields for these baffles. An increase of the baffle angle decreases very slightly the plume collimation in terms of t30 (Table 4). The plausible reason is that the larger portion of Helium flux is emitted along X direction. At the same time the increase of β leads to more collimated plume in terms of α95 . 4.3
Conclusions
The Herschel spacecraft uses Helium to cool scientific instruments and to compensate the torque caused by the solar pressure acting on the spacecraft surface. The Helium is emitted through small nozzles and their design has to provide a minimum plume impingement. The flow in the nozzles and in the plume passes from continuum regime through transitional to free-molecular regime and only kinetic approach could handle such flows. The kinetic approach, namely, the direct simulation Monte Carlo method has been applied to perform a numerical analysis.
Statistical Modeling of Plume Exhausted from Herschel Small Nozzle
589
Table 4. Cone angle effect on plume properties β, deg 0 5 10 15 20
t30 α95 , deg m, ˙ mg/sec 0.799 50.31 1.01 0.791 49.97 1.03 0.786 49.83 1.05 0.783 49.04 1.07 0.779 47.88 1.08
ta , mN 0.751 0.766 0.778 0.788 0.800
t, mN 0.834 0.851 0.865 0.876 0.888
A straightforward application of DSMC-based software, SMILE, would have required enormous computer resources to model the nozzle and plume flows with the required accuracy. Consequently the nozzle and plume flow analyses have been split into two successive computations. In the first analysis the flow inside entire nozzle has been computed and then subsequent a computation of a flow in the supersonic part of nozzle and plume flow has been performed. The second computation has used as boundary conditions the results of the first analysis. This allowed us to reduce significantly requirements to computer resources and achieved the required accuracy. To decrease the plume divergence and, hence, plume impingement on the spacecraft surface, the application of various baffle shapes, cylinders and cones, were investigated. It was shown that the small baffles created even a wider plume than a bare nozzle. An increase of radius/length of the cylindrical baffle decreases the plume divergence. The baffle with largest length and radius showed the best performance, close to the requirement. The application of the conical baffle increase the plume collimation, but no significant effect of the half-cone angle was observed.
References 1. Markelov, G.N.: Plume Impingement Analysis for Aeolus Spacecraft and Gas/Surface Interaction Models. J. Spacecraft Rockets 3, 607–618 (2007) 2. Bird, G.A.: Molecular Gas Dynamics and the Direct Simulation of Gas Flows. Pergamon Press, Oxford (1994) 3. Ivanov, M.S., Markelov, G.N., Gimelshein, S.F.: Statistical Simulation of Reactive Rarefied Flows: Numerical Approach and Applications. AIAA Paper 98-2669 (1998) 4. Ivanov, M., Markelov, G., Taylor, S., Watts, J.: Parallel DSMC strategies for 3D computations. In: Schiano, P., Ecer, A., Periaux, J., Satofuka, N. (eds.) Parallel CFD 1996, pp. 485–492. North Holland, Amsterdam (1997)
An Individual-Based Model of Influenza in Nosocomial Environments Boon Som Ong1, Mark Chen2, Vernon Lee2, and Joc Cing Tay1,* 1 ROSS Scientific Pte Ltd Innovation Centre, Units 211-212, 16 Nanyang Drive Singapore 637722 * [email protected] 2 Department of Clinical Epidemiology, Tan Tock Seng Hospital Moulmein Road, Singapore 30843
Abstract. Traditional approaches in epidemiological modeling assume a fully mixed population with uniform contact rates. These assumptions are inaccurate in a real epidemic. We propose an agent-based and spatially explicit epidemiological model to simulate the spread of influenza for nosocomial environments with high heterogeneity in interactions and susceptibilities. A field survey was conducted to obtain the activity patterns of individuals in a ward of Tan Tock Seng Hospital in Singapore. The data collected supports modeling of social behaviors constrained by roles and physical locations so as to achieve a highly precise simulation of the ward’s activity. Our results validate the long-standing belief that within the ward, influenza is typically transmitted through staff and less directly between patients, thereby emphasizing the importance of stafforiented prophylaxis. The model predicts that outbreak size (and attack rate) will increase exponentially with increasing disease infectiousness beyond a certain threshold but eventually tapers due to a target-limited finite population. The latter constraint also gives rise to a peak in epidemic duration (at the threshold level of infectiousness) that decreases to a steady value for increasing infectiousness. Finally, the results show that the rate of increase in distinct cumulated contacts will be highest within the first 24 hours and gives the highest yield for contact tracing among patients that had prolonged periods of nonisolation. We conclude that agent-based models are a necessary and viable tool for validating epidemiological beliefs and for prediction of disease dynamics when local environmental and host factors are sufficiently heterogeneous. Keywords: Agent-based modeling, Spatially-explicit model, Epidemiology, Influenza, Contact patterns.
1 Introduction During the SARS crisis, hospitals were found to be especially vulnerable to outbreaks [1-3]. Hospitals are also susceptible to nosocomial influenza, and rapid crossinfection between healthcare workers and patients can occur [4-6]. In spite of this, there has been little work in simulating the potential spread of infections in the hospital setting, with a granularity that allows policy makers and infection control M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 590–599, 2008. © Springer-Verlag Berlin Heidelberg 2008
An Individual-Based Model of Influenza in Nosocomial Environments
591
practitioners to explore the utility and potential impact of various hospital outbreaks and infection control measures. We have therefore chosen, 1) to base the geographic and spatial context of our epidemiological model in which the disease outbreak is based on a hospital environment and, 2) to use agents whose behaviors are based on surveyed activity data of patients and healthcare workers. We have designed and developed a spatially explicit agent-based epidemiological simulation model called ASINE (which stands for Agent-based Simulator for Infections in the Nosocomial Environment). This model can simulate the dynamics of disease spread through person-to-person contact among the staff and patient population for a particular ward at Tan Tock Seng Hospital (TTSH) in Singapore. While hospital infections have traditionally been modeled using compartmental models, our use of a spatially explicit agent-based simulation is driven by the fact that, in a hospital environment: 1) Individuals interact with each other locally, 2) Individuals are mobile but may be restricted to certain areas, and 3) The individual environment is heterogeneous [7, 8]. Although individual-based models have been used to model the spread of community influenza [7-10], but such an approach, to our knowledge, has not been applied to nosocomial influenza.
2 The Spread of Influenza within a Hospital Ward As alluded to in our introduction, the primary motivation for our work arose out of the experience of nosocomial SARS outbreaks in 2003, and the threat which pandemic influenza may pose to the hospital environment. From this section onwards, we will refer primarily to influenza, as an example of an infectious disease which can potentially be spread in the hospital through staff and patient interactions. In a typical hospital environment, the main venues for human traffic are within the clinical wards; these were also the key locations where outbreaks were observed during the SARS epidemic in Singapore [1]. Influenza is predominantly spread from person to person, by droplet spray or by direct or indirect (e.g. via fomites) contact with nasopharyngeal secretions [6]. In our model, the geographic context is the spatial environment of CDC1 ward 71. 2.1 Modeling the Environment We implemented a Geographic Information System (GIS) as a data model for a twodimensional schematic map that represents the environment of interest, in which individuals performs their social activities [11, 12]. The spatial environment only consists of location objects in specific positions with no explicit path information. Therefore, a graph is used to provide the navigational structure for agents to move within the ward [13, 14]. Each location can be thought of a node in the graph and thus an edge can then be added between two nodes to denote that a path exists between them. For each location object, we specify a Cartesian coordinate for its position and a rectangle of certain width and height for its shape (say for a bed) in the two dimensional environment. The topology for the CDC ward was thereby approximated in this manner in accordance to our onsite inspection of the ward. 1
Communicable Disease Centre, TTSH.
592
B.S. Ong et al.
2.2 Modeling the Human Population at CDC Ward 71 The healthcare staff can be categorized as: doctors, nurses and health attendants. There is only one clerk and one cleaner. There are many types of patients, but we have categorized them into two types: ambulant and non-ambulant. Ambulant patients are allowed to move around the ward area but for a non-ambulant patient, he/she would be bedridden for the whole duration of stay at the ward. The last type of agent is a visitor, who may visit the patients. In summary, the population at CDC ward 71 (which we modeled) comprises: • • • • • •
18 nurses, 6 for each shift. 3 health attendants, 1 for each shift. 4 doctors, who do their rounds at the ward daily. 1 ward clerk. There is only 1 shift for the ward clerk. 1 ward cleaner. There is only 1 shift for the ward cleaner. Number of patients (ambulant and non-ambulant) and visitors can be parameterized during initialization.
2.3 Modeling Agent Behaviors There are two types of routines - standard and miscellaneous. Healthcare workers have standard routines to follow during a work shift. Nurses need to carry out tasks like taking parameters for patients and bed turnings for non-ambulant patients. Health attendants need to serve meals during meal timings and doctors usually make their rounds in the morning. These standard routines occur during certain times of the day and they must be carried out. Apart from these standard routines, different types of individuals each have a set of activities that may be performed. These activities are categorized as miscellaneous routines an agent performs. For instance, a ward clerk may only visit administrative areas like the doctor’s office or the nursing station. A visitor may only visit the patient’s room and nursing station, but is out of bounds to the staff room. By definition, the visitor may also choose not to visit a patient. Hence each patient can have 0 or more visitors during the visitation hour. Each individual agent performs such activities or actions probabilistically. The algorithm for the selection of an action is based on roulette wheel selection where each action is associated with a probability value that corresponds to its fitness. The fitter the action, the greater the chance the agent will perform this action. The social interactions of each individual are simulated on a daily basis with activity patterns obtained from a field survey. This field survey helps to derive the sets of routines mentioned previously that an agent has to carry out. The survey method was sample-based and purely observational. Movements of representative healthcare staff, patients and visitors was observed during an average work day, so as to establish the frequency, duration and intensity of contacts between healthcare staff, patients and their visitors. The ethics review committee of the National Healthcare Group, Singapore, was consulted, with approval obtained, to ensure that the conduct of the field survey respected privacy and ethical consults. For each location x visited by an individual y (upon observation of y during the survey), the probability of visiting x is calculated based on the observed frequencies of visit to x given by Nx divided by the total frequencies of visit to all locations by y,
An Individual-Based Model of Influenza in Nosocomial Environments
593
given by N. For example, a total of 50 activities for a particular nurse were observed (during the survey). Out of these 50 activities, 10 of them are activities performed at the nursing station. So the probability of a nurse going to a nursing station is 1/5. The actual type of activities performed by each individual is not relevant. The duration an agent ai spends at location x is drawn from a normal distribution based on the frequencies of visit and the amount of time associated with each visit. 2.4 Modeling Disease Transmission Epidemics are usually described using a set of states; namely, susceptible (S), infected (E), infectious (I) and recovered/removed (R) [15]. Depending on the disease’s natural history, an epidemiological model can be described using the SEIR, SEIS, SIR, or SIS pattern as shown in Fig. 1. Fig. 1 illustrates a finite state machine (FSM) which describes the possible state transitions of an infectious disease. The possible set of state transitions that can be obtained from the FSM is SEIR, SEIS, SIR and SIS. The FSM allows us to model diseases of different natural history through alternative state transition routes.
Fig. 1. Possible state transitions of an infectious disease
To simulate the pathogenesis of a disease within a host and transmission between hosts, each individual is associated with a disease model that describes the health state of that individual (using discrete states, susceptible, infected, infectious and recovered). A disease is modeled as an agent and is responsible for transiting between epidemic states and performs computations for infecting other susceptible human agents. The joint preconditions for successful influenza transmission between two agents are 1) both agents must first collocate at a location in order for influenza to spread, and 2) distance between infector agent x and infectee agent y must be within a certain radius. In a spacious room therefore, the disease may not be transmitted so easily. In our model, the infection radius is defined as twice the size of the human agent. Transmission probability, β, is defined as the transmissibility of the infectious agent multiplied by the susceptibility of the susceptible agent, which is based on unittime per contact with the following formula: Unit-time per contact transmission probability, β1 = 1 – (1 – β)
T /T 1
(1)
where T = Newtonian time and T1 = simulation time. Both transmissibility and susceptibility depend on the instantaneous health state of the individual. The latter is a random variable drawn from a normal distribution with a mean of 0.5 and standard deviation of 0.5. The range is between 0 (indicates severely
594
B.S. Ong et al.
immune-suppressed) and 1 [8]. Therefore, a healthy person may be less susceptible to infection and less likely to transmit the disease. All human agents are initialized to be at a susceptible state. The infected and infectious states are both associated with a non-zero time period. When an infectious agent infects a susceptible agent, the susceptible agent will move to the infected state. After the infected period has expired, the agent will move to the infectious state. This is the state where the disease becomes contagious and transmission to other agents can occur at this state. After the infectious period has expired, the agent recovers. Mortality is currently not modeled.
3 Model Validation and Results We designed several experiments with two aims in mind. The first experiment (A) aimed to validate our agent-based epidemiological model in terms of its ability to simulate the ward environment. The second and third experiments (B and C) are samples of the type of results we can obtain from this model, which may be useful in guiding control measures. A) Contact patterns between individuals within the ward environment Table 1. Contact patterns between individuals within the ward environment
Index individual
Distinct contacts of index for 1 day, by individual type
Cleaner Clerk Doctor Health Attendant Nurse Ambulant Patient Non-ambulant Patient Visitor
Cleaner Clerk Doctor Health Attendant
Nurse Ambulant Patient
0 0.89 0
10.12 0.4 14.28 1.21 5.62 7.85
0.94 0 0.37
0.04 1.72 2.83
1.76 1.85 0.2
1.27
Nonambulant Patient 0.15 0 14.65
Visitor Total Contacts 0.72 3.96 4.35
14.13 23.91 35.87
0.64
0.67
0.21
0
8.21
0.3
2.06
13.36
0.6
0.85
1.19
1.36
12.89 4.06
5.65
3.54
30.14
0.05
0.16
3.45
0.4
8.18
0
1.91
16.51
2.36
0.01
0
3.58
0.03
7.03
0
0
1.29
11.94
0.04
0.09
0.4
0.19
1.86
0.57
0.57
2.46
6.18
Table 1 shows the average number of distinct contacts (from 100 realizations) encountered by a putative index individual within the course of a 24-hour period. We see that the contact patterns in the ward which are generated by the model resemble what we expect of a ward environment. For example, the clerk has a high number of contacts mainly with the nursing staff, due to her central location at the nursing station; she has however, minimal contact with patients. The doctors, on the other hand, have a high number of contacts with nurses, patients and visitors; they also have the highest number of contacts amongst all staff types. When we look at patients, we see that patients have less contacts overall, and that patients have few contacts with other patients; in particular, we see that non-ambulant patients are unlikely to have contact with other patients. The model thus affirms the current opinions that, in many nosocomial diseases, transmission is not occurring from direct patient-to-patient contact but through healthcare workers as vectors.
An Individual-Based Model of Influenza in Nosocomial Environments
595
B) Outbreak size and path-length for an infectious disease with different transmission probabilities We can also use our model to simulate other diseases which may, in the future, cause outbreaks spread by direct person-to-person contact. The set of parameters which have been used to describe the biology and natural history of influenza is shown in Table 2. In this set of experiments, we simulated the propagation of the entire outbreak (i.e. all generations of cases) and then calculated the simulated outbreak size, the total attack rate (including staff, patients and visitors), the epidemic duration, and the number of generations of cases. Table 2. Disease parameters for influenza (adopted from [10]) Duration (in days) Minimum Maximum Mean
Infected state 1 3 1.9
Infectious state 3 6 4.1
Again, the parameter describing the infectiousness of the disease is unknown, so we simulated a range of values for infectiousness (β), as shown in Fig. 2a. We see that, below a certain threshold value of infectiousness, the average outbreak size is very small; this is because, at these values, the index case on the average produces less than one other infectious case (R0 < 1), therefore no propagated transmission is possible. With higher levels of infectiousness, the outbreak size has a near linear association (on a logarithmic scale) with infectiousness. Further increase in infectiousness, however, only has a marginal effect on outbreak size since the ward environment is a finite population and almost all individuals who can be infected would have been infected; this is illustrated by Fig. 2b, which shows the attack rates approaching 100% at values of infectiousness exceeding 0.1. When looking at epidemic duration and the maximum number of generations within an outbreak, an interesting pattern emerges. At lower levels of infectiousness, epidemic duration increases with increasing infectiousness (Fig. 2c). This is because of the likelihood that the epidemic will generate successive generations of cases (Fig. 2d). However, with higher levels of infectiousness, epidemic durations decrease; this is because, when the average number of cases infected by an infectious patient increases, the finite number of individuals within the ward environment can be infected in fewer generations than at lower levels of infectiousness. C) Outcome and yield of contact tracing for simulated outbreaks with different infection parameters Fig. 3a and Fig. 3b simulate the dynamics of a commonly used intervention, that of contact tracing; the situation simulated is one where the ward environment is exposed to an infectious index case for a number of days before the case is identified and isolated. For Fig. 3a, we explore the number of distinct contacts that would be generated over the infectious period if the index were a patient, or any of the staff types shown in the picture. We see that staff have far more contacts than patients, and nonambulant patients have the least number of contacts. The number of distinct contacts
596
B.S. Ong et al.
(a)
(b)
(c)
(d)
Fig. 2. Outbreak size and path length of influenza
An Individual-Based Model of Influenza in Nosocomial Environments
597
(a)
(b)
Fig. 3. Outcome and yield of contact tracing
accumulated over time is interesting to note; the sharpest increase is between time zero and day 1; this is because, within the first 24 hours, an index case (in particular in the case of patients), would have met most of the individuals which he/she will ever meet within the ward environment. The result of this cumulative contact pattern translates into the patterns observed in Fig. 3b when we look at the yield of contact tracing, when the index case is a patient. If uniform infectiousness is assumed over time, then the cumulative contacts infected by an index case increases with the days that the patient is left without isolation at a faster rate than the increase in the number of contacts, with the result being that the yield of contact tracing is higher for cases who have not been isolated for a longer period, regardless of the level of infectiousness assumed.
4 Conclusions We have designed and developed an individual-based epidemiological simulation model that can accurately simulate the spread of influenza within a ward at the Communicable Disease Centre of Tan Tock Seng Hospital in Singapore. As influenza is typically spread by droplet or direct person-to-person contact, the basic interactivity and social structure of the domain would be of paramount importance. As such, we undertook field surveys of the movement patterns of staff, patients and their visitors. These movements result from visitation patterns, bed and nurse-patient allocation
598
B.S. Ong et al.
methods, and from miscellaneous activities such as visitations to various staff rooms, pantries, washrooms, or taking work breaks and mandatory activities like bed turning, taking of parameters, administration of medication and shift change. We also employed a two dimensional topology for the physical structure of the ward to constrain the navigational movement of individuals. Disease transmission was based on an individualized SEIR model without morbidity and mortality. Heterogeneity in individual health statuses and interaction patterns determine local transmissibility and form the measurable dynamics of disease spread for parameters such as epidemic duration and attack rate. The resulting individual-based, stochastic model of influenza spread within a constrained environment with finite population was validated with epidemiologists through a number of experiments. We established the long-standing belief that within the wards, influenza is typically transmitted through staff and less directly between patients, thereby emphasizing the importance of staff-oriented prophylaxis. Also, results show that outbreak size (and attack rate) increases exponentially with increasing disease infectiousness beyond a certain threshold but tapers eventually due to a target-limited finite population constraint. The latter constraint also gave rise to a peak in epidemic duration (at the threshold level of infectiousness) that decreases to a steady value for increasing infectiousness. Finally, we showed that the rate of increase in distinct cumulated contacts was highest within the first 24 hours and gave the highest yield for contact tracing among patients that had longer periods of non-isolation. Through this project, we showed that agent-based models are a necessary tool for validating epidemiological beliefs and for prediction of disease dynamics when local environmental and host factors are sufficiently heterogeneous. Acknowledgments. We would like to thank CDC, Tan Tock Seng Hospital of Singapore for providing needed data clearance and access to the ward and its staff. In particular, A/Prof. Leo Yee Sin, Dr. Angela Chow, Staff Nurse Quek Lee Kheng and student helpers Ms. Guo Zaiyi and Ms. Christine Ong.
References 1. Heng, B.H., Lim, S.W.: Epidemiology and control of SARS in Singapore. Epidemiological News Bulletin, Ministry of Health Singapore 29, 42–47 (2003) 2. Skowronski, D.M., Astell, C., Brunham, R.C., Low, D.E., Petric, M., Roper, R.L., Talbot, P.J., Tam, T., Babiuk, L.: Severe acute respiratory syndrome (SARS): a year in review. Annual Review of Medicine 56, 357–381 (2005) 3. Yu, I.T.S., Sung, J.J.Y.: The epidemiology of the outbreak of severe acute respiratory syndrome (SARS) in Hong Kong–what we do know and what we don’t. Epidemiology and Infection 132, 781–786 (2004) 4. Salgado, C.D., Farr, B.M., Hall, K.K., Hayden, F.G.: Influenza in the acute hospital setting. Lancet Infectious Diseases 2, 145–155 (2002) 5. Sartor, C., Zandotti, C., Romain, F., Jacomo, V., Simon, S., Atlan-Gepner, C., Sambuc, R., Vialettes, B., Drancourt, M.: Disruption of services in an internal medicine unit due to a nosocomial influenza outbreak. Infection control and hospital epidemiology 23, 615–619 (2002) 6. Stott, D.J., Kerr, G., Carman, W.F.: Nosocomial transmission of influenza. Occupational Medicine 52, 249–253 (2002)
An Individual-Based Model of Influenza in Nosocomial Environments
599
7. Bian, L.: A conceptual framework for an individual-based spatially explicit epidemiological model. Environment and Planning B: Planning and Design 31, 381–395 (2004) 8. Dunham, J.B.: An Agent-Based Spatially Explicit Epidemiological Model in MASON. Journal of Artificial Societies and Social Simulation 9 (2005) 9. Ferguson, N.M., Cummings, D.A., Fraser, C., Cajka, J.C., Cooley, P.C., Burke, D.S.: Strategies for mitigating an influenza pandemic. Nature 442, 448–452 (2006) 10. Longini Jr., I.M., Halloran, M.E., Nizam, A., Yang, Y.: Containing Pandemic Influenza with Antiviral Agents. American Journal of Epidemiology 159, 623–633 (2004) 11. Crooks, A.T.: Exploring Cities using Agent-based Models and GIS. In: Proceedings of the Agent 2006 Conference on Social Agents: Results and Prospects, University of Chicago and Argonne National Laboratory, Chicago, IL (2006), http://www.agent2006.anl.gov/2006procpdf/Crooks_Agent_2006. pdf 12. Gonçavels, A.S., Rodrigues, A., Correia, L.: Multi-Agent Simulation Within Geographic Information Systems. In: Coelho, H., Espinasse, B. (eds.) Proceedings of 5th Workshop on Agent-Based Simulation (2004) 13. Buckland, M.: Programming game AI by example. Wordware Pub. (2005) 14. Smed, J., Hakonen, H.: Algorithms and Networking for Computer Games. Wiley, Chichester (2006) 15. Hethcote, H.W.: The Mathematics of Infectious Diseases. SIAM Review 42, 599–653 (2000)
Modeling Incompressible Fluids by Means of the SPH Method: Surface Tension and Viscosity Pawel Wr´ oblewski1 , Krzysztof Boryczko1, and Mariusz Kope´c2 1
Department of Computer Science AGH University of Science and Technology, Krak´ ow {pawel.wroblewski,boryczko}@agh.edu.pl 2 Faculty of Physics and Applied Computer Science AGH University of Science and Technology, Krak´ ow [email protected]
Abstract. The adaptations of the SPH method for simulating incompressible fluids which focuse on two features: the surface tension and artificial viscosity, are presented in this article. The background and principles of the SPH method are explained and its application to incompressible fluids simulations is discussed. The methodology and implementation of artificial viscosity in the SPH method are presented. The modification for surface tension simulation, which relies on incorporating additional forces into the model, as well as the methodology are suggested. Also, the new equations for artificial viscosity are presented, which are able to simulate a flow of non-newtonian fluids. The results obtained with the method are presented and discussed.
1
Introduction
A number of existing computer methods can be used for simulating phenomena from the real world. Many of these phenomena, very important in the contemporary science and engineering, are related to fluid mechanics. These phenomena refer to different spatio-temporal scales, starting from micro scale, through meso scale to macro scale. Very interesting processes, e.g. turbulences, wall-fluid interactions, free surface behavior, take place in the domain between these scales. However, neither correct nor efficient methods have been available for this area yet. The SPH method is a very popular particle method for simulating processes in the macro scale [12]; it also seems to be possible to simulate phenomena from the domain between macro and meso scales. However, despite many advantages, the proposed method has also several drawbacks. From the authors’ point of view, one of the most awkward one is the lack of possibility to simulate surface tension. There are a few papers presenting modifications of the SPH method which remove this disadvantage [16][13]. However, the analysis of these modifications reveals that they are either deprived of a strong physical background or are too complicated for computer implementation. Another drawback of the SPH method is the problem with modeling a flow of non-newtonian fluids. This topic is almost not present in papers concerning SPH simulations [15] and there M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 600–609, 2008. c Springer-Verlag Berlin Heidelberg 2008
Modeling Incompressible Fluids by Means of the SPH Method
601
is no straight scheme for obtaining the eligible model of non-newtonian viscosity, what opens this research area for new investigations. The background and principles of SPH method are shown along with its variants, depending on the target application. There are also presented adaptations of this method to incompressible fluids simulations. The proposed modification of the SPH method enables modeling surface tension in several physical phenomena. The changes rely on additional forces acting between SPH particles as well as between SPH particles and walls of a vessel. The methodology of adding new forces is discussed. The implemented modified algorithm has been employed for simulating two phenomena: a fluid drop behavior in vacuum without gravity and a capillary rise between two vertical, parallel plates inserted into a fluid. The second application of the SPH method presented in this paper is the simulation of a flow of non-newtonian fluid. In order to achieve it we propose new equations for artificial viscosity, which actually are a modification of Monaghan’s artificial viscosity model. This modification consists of the change of the character of viscosity’s dependence on interparticle velocity. In the proposed model this dependence is non-linear. The new model of viscosity was validated in the simulation of the flow of the viscous fluid in a long, cylindrical vessel. The non-newtonian character of the fluid manifested itself in the modified velocity profile for the modeled flow [4].
2
Smoothed Particle Hydrodynamics
The SPH method was created in order to simulate astrophysical phenomena [10][5]. The main idea of the method is a field approximation in the set of points from space. The hydrodynamical forces, corresponding to the Navier-Stoke’s equations are calculated in these points (the SPH particles) and with such a background the equations of motion are solved. The approximation procedure uses the kernel function, which vanishes at infinity and its integral is equal to unity. From the theoretical point of view one could choose the gaussian bellshaped function, however in practice it is common to use the spline function with compact support. The authors present results obtained by means of the kernel function proposed in [11]. The approximation procedure applies not only to the hydrodynamical forces, but also other quantities referred to by the modeled phenomenon. The approximation equation for a density is presented below [11]: mj Wij , (1) ρi = j
where mj – the mass of particle j, Wij = W (rij , hij ) – the kernel function evaluated for particles i and j, rij – the distance between particles i and j and hij – a smoothing length. The sum in the above equation runs over all particles from the system. In practice, however, if the support of kernel function is compact, it is enough to count only those SPH particles, for which the kernel function is non-zero. Then, the character of interparticle interactions is shortrange; there exists cut-off radius rcut such, that for every pair of particles whose
602
P. Wr´ oblewski, K. Boryczko, and M. Kope´c
distance is larger than rcut , the force acting between them is equal to zero. For simulation with short-range interactions it is possible to use the structure of cubic cells, which rapidly accelerate calculations [2]. Every SPH particle undergoes the acceleration given by the formula: Pi Pj dvi =− mj + 2 + Πij ∇i Wij , (2) dt ρ2i ρj j where Pi – the pressure at point i, ρi – the density at point i and Πij – the viscosity part of the force. One can find the full derivation of equation (2) in [8]. Besides the force acting between fluid particles, it is also necessary to incorporate into the model forces acting between fluid particles and the walls. In this paper we use a wall consisting of particles, and the corresponding force is very similar to the one given in [12]. It is given by formula: c0 2 Γ (rij /rcut ) mj dvi = , dt 10 rij mi + mj j
(3)
where ⎧2 ⎪ 3, ⎪ ⎪ ⎨2(2q − 3q 2 ), Γ (q) = ⎪ 2(1 − q)2 , ⎪ ⎪ ⎩ 0,
3
if q < 13 , if 13 < q < 12 , if 12 < q < 1, otherwise.
(4)
Incompressible SPH
The standard SPH formula for density (1) is useless when modeling the fluids with free surface. If this equation is applied, the density in the vicinity of the surface changes continuously from the value assumed for all particles to the value of zero on the distance of 2h, which is obviously a discord with the experiments. In the case of such fluids another formula derived from the SPH approximation is used: dρi = mj (vi − vj ) · ∇i Wij , dt j
(5)
which evaluates only the rates of change of the density. Application of this equation requires the initialization of the density values at the beginning of the simulation. The incompressible character of the fluid is modeled by an appropriate equation of state, which is used for evaluating the pressure values in equation (2). The authors use the equation of state given by [12]: 7 ρ ρ0 c20 −1 . (6) P = 7 ρ0
Modeling Incompressible Fluids by Means of the SPH Method
603
When modeling incompressible fluids it is very problematic to choose the proper timestep. If real values of speed of sound c are applied, the timestep is too small for any practical application. Therefore it is convenient to use a value of c several orders of magnitude smaller than the real one. This approach accelerates the calculation significantly, and does not influence results [3].
4
Modeling Viscosity with SPH
In many simulations of fluid flow it is necessary to comply the transition of kinetic energy of the fluid into its thermal energy. In the case of the SPH method presented here, where the thermal energy of the fluid is not concerned, it is necessary to incorporate the dissipation of the fluid energy by means of viscosity. Also, the every day experience of viscosity in almost all real fluids demands incorporating the viscosity term Πij into the equation (2). The most often used [8] model of artificial viscosity in SPH simulations is the one proposed by Monaghan [11]:
−αc μ +βμ2 ij ij ij , if v ij · rij < 0 ρij Πij = , (7) 0 , if v ij · rij ≥ 0 where μij =
hv ij · rij 2 + η2 . rij
(8)
Monaghan proposes to set η 2 = 0.01h2 . In this model viscosity vanishes, when v ij · rij ≥ 0, which has an equivalent on the SPH interpolation level: ∇ · v ≥ 0 [11]. According to this, viscosity is present only when two particles approach to each other. In the opposite case, the viscosity force is equal to 0. There are also other models of artificial viscosity used in SPH simulations. One can mention ones proposed by Hernquist and Katz [6] or by Balsara [1]. A more detailed discussion of the appropriate choice of the artificial viscosity model is presented in [9]. In simulations presented in this article the authors use the model proposed by Monaghan. Its advantages are: simplicity and relatively low computational costs.
5
5.1
The Proposed Modifications of SPH Method for Simulating: Surface Tension and Non-newtonian Character of a Fluid Additional Forces for Surface Tension Model
Modeling the phenomena in which the surface tension effects arise one needs to incorporate two additional parts into the model. The first one represents interactions acting between fluid particles responsible for modeling the surface tension of the fluid. The second part is the modification of the fluid-wall interactions, which is responsible for proper modeling the wetting character of the fluid.
604
P. Wr´ oblewski, K. Boryczko, and M. Kope´c
Fig. 1. The function of new additional force
Fluid-fluid interactions. The surface tension is an effect of mutual attraction of fluid molecules. It is impossible to simulate the exact inter-molecule interaction in the SPH method, because the scale of the method is much larger than the scale where intermolecule forces are present. However, the main idea of the model is still the same and it is realized by incorporating additional attractive forces. When trying to find the form of this force the authors found that it was very difficult to do so, when the range of the new, additional attractive force was the same as the range of the SPH interactions. In this case the additional force modified the nature of the SPH force, and together they led to numerical artifacts. For example, the SPH particles tended to bind in pairs. This is why, following the advice given in [14], the authors move the range of acting of the additional force beyond the SPH range. Actually, it is twice as far as SPH range. In this case the artifacts are not observed anymore, and the results are in good agreement with expectations. The form of the new, proposed force is given by the equation: 3 1 rcut − rij , rcut , (9) Fij = −A · W 2 4 where A is a positive constant and W is the kernel function. The plot of this new function is depicted in the Fig. 1. The form of the proposed additional force is one of many possible. During tests with many different forms of this force the authors concluded, that, in general, the form of the force does not influence the results, if only values of the force are negative in the range [rcut , 2rcut ]. Therefore, the authors proposed the force given by (9), which is convenient to be implemented. Fluid-wall interactions. Similarily, in order to model phenomena which concern hydrophilic or hydrophobic fluids it is also necessary to incorporate additional attractive forces acting between fluid particles and wall particles. In the original SPH model the force acting between walls and fluid was purely repulsive. The authors incorporated additional, attractive forces into the simulation. Their form is given by the formula (9), i.e. it is exactly the same, as in the case of fluid-fluid interactions, but with a different value of constant A. The reasons, why this particular form of the force was used are the same, as in the case mentioned above.
Modeling Incompressible Fluids by Means of the SPH Method
5.2
605
The Modification of Artificial Viscosity Model
Additionally, in order to propose the model capable of simulating flows of nonnewtonian fluids, we propose a modification of Monaghans’ artificial viscosity. The modification relies on a change of the equation (8) to: μij =
hv ij · r ij exp vij − 1 . 2 + η2 · rij vij
(10)
According to this change, an artificial viscosity acting between two particles depends nonlinearly on their mutual velocity, and this way it is possible to obtain the non-netwonian character of a fluid flow. This manifests itself in the change of the velocity profile of the flow, which now corresponds to the viscosity coefficient nonlinearly dependent on the shear rate [4]. By using modification given by (10) the authors introduced non-linear dependence of μij on vij . The equation (10) is only an example of such modifications and is supposed to show the possibilities of further research in this area. The authors tested also several other equations and received similar results.
6
Results
The modifications presented above have been tested in simulations of three different fluid phenomena. 6.1
Fluid Drop Oscillations without Gravity
The first phenomenon used for validating the form of additional attractive forces was the behavior of fluid drop in vacuum. We took well equilibrated circular drop and transformed it into an ellipsoid with the transformation [14]: sin(φ/2) sin u 2 x r , (11) = y sin φ cos(φ/2) cos u sgny where r = x2 + y 2 , φ = 0.63π and u = arctan xy . The ’z’ coordinate for every particle remained unchanged. Then, we examined the relaxation time of the drop, which should depend on the surface tension value of the fluid. If there is no artificial viscosity, then the drop deformed with the above formula oscillates about the equilibrium state, which is ideal sphere. However, when the artificial viscosity is present, it is more convenient to examine the relaxation time. The sample relaxation scheme is depicted in the Fig. 2. The authors ran four different simulations, for four different values of the parameter corresponding to the surface tension value for the fluid. The results obtained are presented in the Fig. 3. The relaxation time should depend on the surface tension coefficient γ as 1 ∼ γ − 2 [14]. The plot presented in the Fig. 3 shows, that the results from the simulation are in good agreement with this dependence at least qualitatively.
606
P. Wr´ oblewski, K. Boryczko, and M. Kope´c
Fig. 2. A relaxation-oscillation of fluid drop in vacuum
Fig. 3. A relaxation time versus surface tension
6.2
Capillary Rise of the Liquid
The second phenomenon used for validating the model of surface tension with the SPH method is capillary rise of the modeled liquid. The effect in this phenomenon depends on the difference between attractive forces of fluid-fluid interactions, and fluid-walls interactions. If fluid-fluid attractive forces are stronger than fluid-walls interactions, then one should expect to see convex meniscus (negative capillary rise). In the opposite case, a concave meniscus should be visible (positive capillary rise). This is what the authors received from the simulations as a result. For fluidfluid attractive forces stronger than fluid-walls interactions, the results are as presented in the Fig. 4.a. On the other hand, when the fluid-wall interactions are stronger than the fluid-fluid ones, the obtained results are as in the Fig. 4.b. Both simulated phenomena prove that the proposed modification of SPH method properly model the surface tension effects. 6.3
The Fluid Flow in an Elongated Vessel
The next phenomenon modeled by means of SPH method is the fluid flow in an elongated vessel. In the beginning all fluid particles filling the vessels rested. Then we applied the initial velocity to all of them and continued the simulation till the flow stopped. The flow was delayed by the forces acting between the
Modeling Incompressible Fluids by Means of the SPH Method a)
607
b)
Fig. 4. Two menisci obtained from the simulations: a) convex meniscus, b) concave meniscus a)
b)
Fig. 5. a) Velocity distribution. b) Velocity profiles for two different viscosity models.
wall and fluid particles. The interactions of wall-fluid particles were modeled by means of the DPD method [7], where brownian part of the force was omitted. It was the dissipative part of the DPD model, which delayed flow of a fluid. The sample velocity distribution, where velocities were marked with color, is presented in the Fig. 5.a. The authors ran several such simulations, each for different artificial viscosity model. When compared the Monaghan’s model with the modified artificial viscosity, it seems that our modification can be treated as a starting point for further research on the area of non-newtonian fluid flows. This can be derived from the analysis of velocity profiles from executed simulations. The velocity profiles of two models: Monaghan’s model and the one given by equation (10), are presented in the Fig. 5.b. The obtained profile is more flattened than the Monaghan’s one, which meets the expectations [4]. 6.4
The Parallel Implementation of the Model
The SPH method, along with presented modifications, was implemented in parallel by means of the OpenMP environment. The simulation of a fluid flow was
608
P. Wr´ oblewski, K. Boryczko, and M. Kope´c
Fig. 6. The relative efficiency of OpenMP implementation
ran for several values of the number of processors. The time of execution of a single simulation step was measured, and the relative efficiency was evaluated. The results are depicted in the Fig. 6. The similar values of relative efficiency were obtained for simulation of SPH method with modifications for surface tension. In this case, however, the execution time of a single simulation step was greater, since the interparticle interaction range was doubled.
7
Conclusions and Future Work
The results presented in the paper show that the proposed form of the additional forces in the SPH method properly model the surface tension of the simulated fluid. These forces were validated in two simulated phenomena, and satisfactory results were obtained. Also, the simulation of the fluid flow with the modified artificial viscosity indicates that proposed improvements allow for reasonable modeling of nonnewtonian fluids. Such simulations will find different applications in science and engineering, for example in modeling properties of the blood flows [4]. However, this method still needs additional work and more detailed derivation of its equations. The presented results qualitatively prove the correctness of the modified method. However, it is still a big challenge to validate the new forces quantitatively. Yet, no analytical relation between physical quantities such as viscosity, surface tension and simulation parameters is known. Acknowledgments. This research is financed by the Polish Ministry of Science and Higher Education, Project No. 3 T11F 010 30.
Modeling Incompressible Fluids by Means of the SPH Method
609
References 1. Balsara, D.: Von Neumann stability analysis of smoothed particle hydrodynamicsSuggestions for optimal algorithms. J. Comput. Phys. 121, 357 (1995) 2. Boryczko, K., Dzwinel, W., Yuen, D.: Parallel implementation of the fluid particle model for simulating complex fluids in the mesoscale. Concurrency and computation: practice and experience 14, 137–161 (2002) 3. Colagrossi, A., Landrini, M.: Numerical simulation of interfacial flows by smoothed particle hydrodynamics. J. Comp. Phys. 191, 448–475 (2003) 4. Gijsen, F., Vosse, F., Janssen, J.: The influence of the non-newtonian properties of blood on the flow in large arteries: steady flow in a carotid bifurcation model. Journal of Biomechanics 32, 601–608 (1999) 5. Gingold, R.A., Monaghan, J.J.: Smoothed particle hydrodynamics - Theory and application to non-spherical stars. Mon. Not. R. Astr. Soc. 181 (1977) 6. Hernquist, L., Katz, N.: TREESPH: A unification of SPH with the hierarchical tree method. The Astrophysical Journal Supplement Series 70, 419–446 (1989) 7. Hoogerbrugge, P.J., Koelman, J.: Simulating microscopic hydrodynamic phenomena with dissipative particle dynamics. Europhys. Lett. 19, 155–160 (1992) 8. Liu, G.R., Liu, M.B.: Smoothed particle hydrodynamics: a meshfree particle method. World Scientific, Singapore (2003) 9. Lombardi, J., Alison, S., Rasio, F., Shapiro, S.: Tests of Spurious Transport in Smoothed Particle Hydrodynamics. Journal of Computational Physics 152, 687– 735 (1999) 10. Lucy, L.B.: A numerical approach to the testing of the fission hypothesis. Astron. J. 82, 1013–1024 (1977) 11. Monaghan, J.J.: Smoothed Particle Hydrodynamics. Annu. Rev. Astron. Astrophys. 30, 543–574 (1992) 12. Monaghan, J.J.: Smoothed Particle hydrodynamics. Rep. Prog. Phys. 68, 1703– 1759 (2005) 13. Morris, J.P.: Simulating surface tension with smoothed particle hydrodynamics. Int. J. Numer. Meth. Fluids 33, 333–353 (2000) 14. Nugent, S., Posch, H.A.: Liquid drops and surface tension with smoothed particle applied mechanics. Phys. Rev. E 62, 4968–4975 (2000) 15. Shao, S., Lo, E.Y.M.: Incompressible SPH method for simulating Newtonian and non-Newtonian flows with a free surface. Advances in Water Resources 26, 787–800 (2003) 16. Tartakovsky, A., Meakin, P.: Modeling of surface tension and contact angles with smoothed particle hydrodynamics. Phys. Rev. E 72, 02630 (2005) 17. Wr´ oblewski, P., Boryczko, K., Kopeæ, M.: SPH - a comparison of neighbor search methods based on constant number of neighbors and constant cut-off radius. TASK Quart. 11, 275–285 (2007)
Optimal Experimental Design in the Modelling of Pattern Formation ` Adri´ an L´ opez Garc´ıa de Lomana, Alex G´ omez-Garrido, David Sportouch, and Jordi Vill` a-Freixa Grup de Recerca en Inform` atica Biom`edica, IMIM-Universitat Pompeu Fabra, C/Doctor Aiguader, 88, 08003 Barcelona, Catalunya, Spain {adrianlopezgarciadelomana,david.sportouch}@gmail.com, {agomez,jvilla}@imim.es http://cbbl.imim.es
Abstract. Gene regulation plays a major role in the control of developmental processes. Pattern formation, for example, is thought to be regulated by a limited number genes translated into transcription factors that control the differential expression of other genes in different cells in a given tissue. We focused on the Notch pathway during the formation of chess-like patterns along development. Simplified models exist of the patterning by lateral inhibition due to the Notch-Delta signalling cascade. We show here how parameters from the literature are able to explain the steady-state behavior of model tissues of several sizes, although they are not able to reproduce time series of experiments. In order to refine the parameters set for data from real experiments we propose a practical implementation of an optimal experimental design protocol that combines parameter estimation tools with sensitivity analysis, in order to minimize the number of additional experiments to perform. Keywords: lateral inhibition, GRN, optimal experimental design, multicellular system.
1
Introduction
One of the most breathtaking processes in biology is the development of a complex creature. In a matter of just a day (a fly maggot), a few weeks (a mouse) or several months (ourselves), an egg grows into millions, billions, or, in the case of humans, 10 trillion cells formed into organs, tissues and parts of the body. So, the main question in developmental biology is to understand how do cells arising from division of a single cell become different from each other. The complexity of the process of pattern formation in developmental biology has been dealt with by a number of researchers in the last decades (for reviews see [1]), both topologically, studying the different genes involved in the process and their relationships, and dynamically, measuring and modeling the temporal behavior of those genes and their products. Different simulation methods have been applied to dynamical models of patterning, involving both ordinary (ODE) [2] M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 610–619, 2008. c Springer-Verlag Berlin Heidelberg 2008
Optimal Experimental Design in the Modelling of Pattern Formation
611
and partial (PDE) differential equations or discrete representations of the cells as cellular automata, among others[3]. Initial models for pattern formation were based on simple assumptions that were able to capture most of the relevant information for a given general question. Thus, it is worth noting the efforts of Meinhardt [4] and others [5] in order to unravel the general rules governing the formation of complex patterns during the embryo development by using simple although soundable mathematical models. At times, high throughput studies can be also performed in order to obtain time dependent qualitative information on the topology of the GRN. This type of information can be processed by probability and statistical inference tools that complement the verbal models defined by the experimentalists and provide a first formal model of the network. However, if one is able to quantify the dynamical information about the expression levels of different genes, even at the level of a few key genes by, for example, real time polymerase chain reaction (RT-PCR) experiments, global optimization protocols can be used to refine the parameters that describe the dynamics of the model. In typical situations the modeller claims for experimental data that is scarce, low quality and, more importantly in most cases, difficult to obtain. How to maximize the outcome from limited resources is the aim of this paper. Here we present the implementation of a practical optimal experimental design pipeline in a theory/experiment integrated fashion. We demonstrate the utility of the protocol in the parameter estimation for one of the simplest models of pattern formation in biology, namely the Notch-Delta pathway for lateral inhibition (LI). To demonstrate the implementation of the method, we work with fictitious RT-PCR experimental data obtained from known models of the Notch-Delta interaction, as the LI model occurs between partner cells in a tissue, which offers an extra challenge for experimental manipulation. However, the proposed protocol is completely general for a RT-PCR experimental setting in any biological system that suits this technique.
2 2.1
Methods Problem Statement
As outlined in the introduction, dynamical biological systems can be described by a large variety of mathematical models. Here we will restrict ourselves to models defined in terms of ODEs. Following [6], the time evolution of a system state of K species x(t, θ) ∈ RK is solution of this set of ODEs: ∂ ∂t x(t, θ) = f (x(t, θ), θ, u(t)). (1) x(0) = x0 . Here θ∈RP denotes the parameters of the system, and u(t) is a vector containing the input of the system. The L properties of the system yM (t, θ)∈RLi that can be measured are described by an observation function g at time ti , i = 1, ..., N (N is here the number of design points):
612
A.L. Garc´ıa de Lomana et al.
yM (ti , θ, u) = g(x(ti , θ, u)), i = 1, ..., N. D
(2)
Li
The observations Y (ti )∈R , i = 1, ..., N are considered as random variables and are given by YD (ti ) = yM (ti , θ0 , u) + i , i = 1, ..., N,
(3)
where θ0 is the true parameter vector and i ∈RLi , i = 1, ..., N describes the distribution error at time ti . We assume that the distribution of the noise (ob2 servation error) follows a normal law (where the variances σij can be estimated from repetitions of the experiments): 2 ij ∼ N (0, σij ), j = 1, ..., Li i = 1, ..., N.
(4)
In fact, yM (t, θ) refers to theoretical values (given by the model) and yD (ti ) (realizations of the random variables YD (ti )) refers to practical values (it corresponds to the Li measurements made experimentally at each time ti , i = 1, ..., N ). Maximum Likelihood Method. This method will help us to get estimates of the parameters of the system. In this method, we need to maximize the likelihood function Jml (θ) to get estimates of the parameter vector θ. This function is defined as: Jml (θ; yD (t1 ), . . . , yD (tN )) = f(YD (t1 ),...,YD (tN )) (yD (t1 ), ..., yD (tN )).
(5)
As defined the random variables YD (ti ) follow a multivariate normal law YD (ti ) ∼ NL (yM (ti , θ, u), C(ti )), i = 1, ..., N,
(6)
where C(ti )∈ML,L (R) is the covariance matrix defined as Cll (ti ) = σil2 and Ckl (ti ) = 0 if k = l. −
D
D
Jml (θ; y (t1 ), . . . , y (tN )) =
1 e N Li NL 2 (2π) 2 σij
N Li 1 2 i=1 j=1
yjD (ti ) − yjM (ti , θ, u) σij
2 . (7)
i=1 j=1
Maximizing the likelihood function (regarding of θ) is in fact the same as maximizing the logarithm of the likelihood function, which in turn is the same as minimizing the opposite of the logarithm of the likelihood function.So, this leads to minimize the following function: −ln(Jml (θ; y D (t1 ), . . . , y D (tN )) =
N Li N L NL 1 i 1 2 ln(σij )+ + 2 2 i=1 j=1 2 i=1 j=1
yjD (ti ) − yjM (ti , θ, u) σij
2 . (8)
So, this comes to minimize the following function: 2 Li N D M (t ) − y (t , θ, u) y i i j j . χ2 (θ) = σij i=1 j=1
(9)
Optimal Experimental Design in the Modelling of Pattern Formation
613
This corresponds to the minimization of a weighted residual sum of squares (with weights: wij = σ12 ) to get the estimated parameters.At this point, we can ij compute analytically asymptotic estimates of the parameters θˆ and asymptotic confidence intervals. In this scope, we assume that we are in the case where we have so much observations that the deviation Δθ between the real θ0 and estimated parameters θˆ is small. Thus, we can expand the observation function in a Taylor series: yjM (ti , θ, u) = yjM (ti , θ0 , u) + ∇θ yj |ti ,θ0 (θ − θ0 ).
(10)
We insert this result in the function to minimize, and we get: χ2 (θ) =
Li 2 N ij i=1 j=1
2 σij
−2
ij 1 ∇θ yj |ti ,θ0 Δθ + 2 Δθ T ([∇θ yj ]T [∇θ yj ])|ti ,θ0 Δθ . 2 σij σij
To minimize χ2 (θ), we need to solve the following equation: get the estimated parameters: Δθ = F
−1
∂ 2 ∂θ χ (θ)
Li N ij T 2 ([∇θ yj ] )|ti ,θ0 . σ i=1 j=1 ij
(11) = 0, so we
(12)
where F is the Fisher information matrix. Parameter Estimation and Covariance Matrix. From the knowledge of F we can easily get the exact values of the (asymptotic) estimated parameters: θˆ = θ0 + F −1
Li N ij T 2 ([∇θ yj ] )|ti ,θ0 . σ ij i=1 j=1
(13)
As we assumed that the residuals are independently distributed, the covariance matrix of the estimated parameter vector is computed by (where the average is over the repetition of experiments): Σ =< ΔθΔθ T >= F −1 .
(14)
Thanks to this covariance matrix, we can see the correlation between the parameters. The correlation matrix is defined by: Σ Rij = √ ij , if i = j. Σii Σjj (15) Rij = 1, if i = j. 2.2
Parameter Correlation and Identifiability Criteria
Equipped with Eqn. (15), we can measure the interrelationship between the parameters and get an idea of the compensation effects of changes in the parameter
614
A.L. Garc´ıa de Lomana et al.
values on the model output. For instance, if two parameters are highly correlated, a change in the model output caused by a change in a model parameter can be compensated by an appropriate change in the other parameter value. This prevents such parameters from being uniquely identifiable even if the model output is very sensitive to changes to individual parameters. Then we can try to improve the information contained in the data by optimizing one of the criteria derived from Σ. We used the modified E-optimal design: (Σ) ). As it minimizes the ratio of the largest to the smallest eigenvalue, min( λλmax min (Σ) it optimizes the functional shape of the confidence intervals. All calculations have been performed with ByoDyn (http://cbbl.imim.es/ByoDyn) most of them at the QosCosGrid [7] environment.
3
Results
3.1
The ODEs Model for the Notch-Delta System
In the model, two adjacent cells,i and j, initially expressing the the same amount of the genesnotch and delta generate an asymmetric final expression of the genes by the lateral inhibition mechanism. The interaction of the protein NOTCH with its ligand Delta activates the cleavage of the NOTCH intracellular domain (NICD) by a γ-secretase. NICD activates the expression of hes5 and ultimately downregulates delta. If one assumes in a very rough approximation that the quantities of the different species in the system are large enough to work with concentrations, we have formalized the verbal model represented in Figure 1, after adimensionalization by t = T0 τ and [x]i = [x]0 xi (τ ): dnotchi (τ ) notch (rnotch − notchi (τ )) . = T0 Kdeg dτ dN OT CHi (τ ) NOTCH (notchi (τ ) − N OT CHi (τ )) = T0 Kdeg dτ k k ND − 2 Kbind [DELT A]0 N OT CHi (τ ) DELT Aj (τ ). n j=1
HES5si (τ ) ddeltai (τ ) delta 1− = T0 Kdeg − deltai (τ ) . s dτ κHES5 + HES5i (τ ) dDELT Ai (τ ) DELTA (deltai (τ ) − DELT Ai (τ )) = T0 Kdeg dτ k k ND Kbind [N OT CH]0 DELT Ai (τ ) N OT CHj (τ ). 2 n j=1 ⎛ ⎞ k k dN Di (τ ) ND ND ⎝ Kbind [DELT A]0 N OT CHi (τ ) DELT Aj (τ ) − Kdeg N Di (τ )⎠ . = T0 dτ n2 j=1
dhes5i (τ ) N Dim (τ ) hes5 (τ ) . = T0 Kdeg − hes5 i dτ κND + N Dim (τ )
−
dHES5i (τ ) HES5 (hes5i (τ ) − HES5i (τ )) . = T0 Kdeg dτ
(16)
where we have assumed that the NOTCH cleavage after the formation of the ND complex, the NICD transport to the nucleus and the transcription factor activation can be simply approximated by the amount of ND complex that is formed
Optimal Experimental Design in the Modelling of Pattern Formation
615
Fig. 1. Simplified model for the Notch/Delta pathway for two adjacent cells i and j. NOTCH* and DELTA* refers to the activated forms.
on the membrane surface. In 16 notch is constitutively activated, while sigmoidal activation and inhibition curves are used for hes5 and delta, respectively. hes5 HES5 N OT CH DELT A ND = Kdeg = Kdeg = Kdeg = Kdeg = By using the parameters Kdeg delta notch Kdeg = 0.01; s = m = 2.0; κN D = κHES5 = 0.1; N OT CH0 = 5.0; Kdeg = ND = 0.25; DELT A0 = 3.0; rnotch = 0.620926, the steady state 0.0016649; Kbind concentrations of the three genes in our model acquire the characteristic chesslike pattern represented in Figure 2, in which different cell types are clearly defined. In addition, the figure shows the correlation matrices for the diverse systems. It appears that the boundary effect vanishes with bigger tissue sizes and that the 5×5 cells model can be considered converged for the purposes of this paper, as seen from the invariant correlation matrix when comparing the 5×5 and 7×7 systems. Thus, in the following paragraphs we will present our protocol for experimental design based on the 5×5 tissue model. Next, we consider a typical experimental setting in which RT-PCR experiments are carried out and provide time dependent data for each of the genes involved in our model. We will generate hypothetical data from real experiments of inner ear early development in chick[8]. In a typical scenario of the model, 4 tissue samples may be extracted at different stages of development. For each of them RT-PCR experiments may be performed, using three replicas for security, showing a behavior that in the best case will be just close to the simulated concentration profiles from Eq. 16. The parameter set θ may then be globally optimized with several methods. We use here a simple approach consisting on local optimizations from 10 or 100 starting random values of θ with varying value of σ 2 for the generated data points. Once the fitting parameters are obtained to some approximation, by using the above detailed simple approach or by more sophisticated methods[9], we are interested in improving their confidence intervals. This can be achieved, of course,
616
A.L. Garc´ıa de Lomana et al. a.
delta
delta
delta
delta
b. H a h A D LT S5 S5 TC lt s5 tc 0 _N DE HE ND NO de he no 0 HE ND h A_ nd g_ g_ g_ g_ g_ g_ g_ H_ a_ a_ tc 0 LT bi de de de de de de de TC pp pp no A_ D DE K_ K_ K_ K_ K_ K_ K_ K_ NO ka ka m r_ s LT _N A DE ind ELT b D K_ eg_ ES5 d H K_ eg_ D d N H K_ eg_ OTC d N a K_ eg_ elt d d K_ eg_ es5 d h h K_ eg_ otc d _n _ K eg d 0 K_ CH_ ES5 T H NO pa_ D p N ka pa_ p ka h m otc n r_ s
H a A h TC lt s5 tc D LT S5 S5 h 0 _N DE HE ND NO de he no 0 HE ND tc A_ nd g_ g_ g_ g_ g_ g_ g_ H_ a_ a_ 0 no LT bi de de de de de de de TC pp pp A_ D DE K_ K_ K_ K_ K_ K_ K_ K_ NO ka ka m r_ s LT _N A DE ind ELT b D K_ eg_ ES5 d H K_ eg_ D d N H K_ eg_ OTC d N a K_ eg_ elt d d K_ eg_ es5 d h h K_ eg_ otc d _n _ K eg d 0 K_ CH_ ES5 T H NO pa_ D p N ka pa_ p ka h m otc n r_ s
A h H a D LT S5 S5 TC lt s5 tc 0 _N DE HE ND NO de he no 0 HE ND h A_ nd g_ g_ g_ g_ g_ g_ g_ H_ a_ a_ tc 0 LT bi de de de de de de de TC pp pp no A_ D DE K_ K_ K_ K_ K_ K_ K_ K_ NO ka ka m r_ s LT _N A DE ind ELT b D K_ eg_ ES5 d H K_ eg_ D d N H K_ eg_ OTC d N a K_ eg_ elt d d K_ eg_ es5 d h h K_ eg_ otc d _n _ K eg d 0 K_ CH_ ES5 T H NO pa_ D p N ka pa_ p ka h m otc n r_ s
H a A h D LT S5 TC lt s5 tc S5 0 _N DE HE ND NO de he no 0 HE ND h A_ nd g_ g_ g_ g_ g_ g_ g_ H_ a_ a_ tc 0 LT bi de de de de de de de TC pp pp no A_ D DE K_ K_ K_ K_ K_ K_ K_ K_ NO ka ka m r_ s LT _N A DE ind ELT b D K_ eg_ ES5 d H K_ eg_ D d N H K_ eg_ OTC d N a K_ eg_ elt d d K_ eg_ es5 d h h K_ eg_ otc d n _ K eg_ d 0 K_ CH_ ES5 T H NO pa_ D p N ka pa_ p ka h m otc n r_ s
Fig. 2. (a.) Delta steady state concentration distribution in model tissues of different dimensions. (b.) Correlation matrices of the adimensional parameters in each case in (a). The steady state is achieved at a biologically plausible time scale.
by choosing a better optimization algorithm or, complementarily, by using information theory in order to estimate what data will provide more information to improve the parameters practical identifiability. This is extremely relevant as new experiments can consume an important number of resources and even one may decide they are not worth trying because of intrinsic identifiability problems of the model. In order to learn about the information content of new data, we generate in silico data at 200 time points through the total simulation time ttotal using the parameters optimized in the previous step. We call this set θ . In a leave-one-out fashion, each value is deleted at a time and the modified E-criteria is evaluated for the remaining data, in order to discover the computer-generated point, according to the current model (topology plus parameters) that contain more information. Figure 3 shows the result of this approach for the two genes of the system. In the first iteration of the protocol, the modified E-criteria suggests that new values for the concentration of hes5 at time t = 1260 would be the most informative. At this stage we measure new data for that gene at such time step and we proceed the next iteration of the approach again. Such measurement in a real experimental set up is simulated here by a new in silico value obtained with or without noise with respect to the known model. Finally, Figure 4 shows the evolution of the modified E-criteria for a number of iterations of the protocol. It can be seen how the higher information content of the new experimental data set (increased after each OED iteration) does not necessarily involves a better (lower) value for the modified E criteria. This problem has multiple origins, being the noise of the new data measured or the fact that the optimization method does not find the same minimum in each parameter estimation step.
Optimal Experimental Design in the Modelling of Pattern Formation
model
4500000
617
delta hes5
4000000 3500000 ME criteria
3000000
2500000 2000000 1500000 10000000
200
400
600
Time
800
1000
1200
1400
Fig. 3. Modified E-criteria after adding one time point per gene at each time from a set of previous targeted behavior
13
10
12
10
11
Modified E-criteria
10
10
10
10
10
10
100 gradient searches; error=0.0 10 gradient searches; error=0.1 100 gradient searches; error=0.1
10
100 gradient searches; error=0.25 5
10
0
1
2
3
4
5
Protocol iterations
Fig. 4. Evolution of the modified E-criteria for 5 iterations of the OED procedure
4
Conclusions
Optimal experimental design has been demonstrated in a realistic example of experiment /theory iterative protocol. In this paper, the experimental data is
618
A.L. Garc´ıa de Lomana et al.
indeed estimated from new calculations in order to show the general applicability of the protocol, although its migration to real experimental setups is straightforward. The benefits from using the proposed approach are clear, as the new experiments to be carried out are decided from a predicted behavior of the modified E-criteria for a set of in silico generated data from the model from optimal parameters from the previous step in the iteration. The proposed protocol provides an easy and neat method to incorporate experimental data, that may be difficult or expensive to obtain, in an informed way. At the same time it provides clues about the identifiability of the parameters for the proposed model, according to the evolution of the modified E-criteria with the iterations of the OED approach. Thus, one expects the modified E-criteria to approach the limit of 1 for a perfectly identifiable model if a big number of experiments is performed, while reaching a different limiting value is indicative of the unidentifiability of the model. The protocol has been exemplified on a hypothetical situation in which a simple gene regulatory network includes three genes interacting in a multicellular system. However, the data proposed, its distribution and the error one performs in the experimental evaluations are realistic and match a typical experimental setting. The next step is to apply this protocol to real data on a more complex model like the regionalization of cellular systems during vertebrate development[10]. Finally, the practical implementation of the protocol makes it suitable for parallelization in several points, like the multiple optimization in each step or the evaluation of the modified E-criteria itself for several time/species trial values. Acknowledgments. ALGL thanks Generalitat de Catalunya for a PhD fellowship. Partially funded by grant BQU2003-04448 (MCYT: Spanish Ministry of Science and Technology), and EC-STREP projects QosCosGrid (FP6-IST2005-033883) and BioBridge (FP6-LIFESCIHEALTH-2005-037909). The authors thankfully acknowledge the computer resources and assistance provided by the Barcelona Supercomputing Center.
References 1. Tomline, C.J., Axelrod, J.D.: Biology by numbers: mathematical modelling in developmental biology. Nature Reviews 8, 331–340 (2007) 2. Jaeger, J., Surkova, S., Blagov, M., Janssens, H., Kosman, D., Kozlov, K.N., Manu, M.E., Vanario-Alonso, C., Samsonova, M., Sharp, D.H., Reinitz, J.: Dynamic control of positional information in the early Drosophila embryo. Nature 430, 368–371 (2004) 3. de Jong, H.: Modeling and simulation of genetic regulatory systems: a literature review. J Comput. Biol. 9(1), 67–103 (2002) 4. Meinhardt, H.: Computational modelling of epithelial patterning. Curr. Opin. Genet. Dev. 17(4), 272–280 (2007) 5. von Dassow, G., Meir, E., Munro, E.M., Odell, G.M.: The segment polarity network is a robust developmental module. Nature 406, 188–192 (2000) 6. Faller, D., Klingm¨ uller, U.T.J.: Simulation methods for optimal experimental design in systems biology. Simulation 79, 717–725 (2003)
Optimal Experimental Design in the Modelling of Pattern Formation
619
7. Coti, C., Herault, T., Peyronnet, S., Rezmerita, A., Cappello, F.: Grid services for MPI. In: ACM/IEEE (ed.) Proceedings of the 8th IEEE International Symposium on Cluster Computing and the Grid (CCGrid 2008), Lyon, France (May 2008) 8. Alsina, B., Abello, G., Ulloa, E., Henriqw, D., Pujades, C., Giraldez, F.: FGF signaling is required for determination of otic neuroblasts in the chick embryo. Dev. Biol. 267(1), 119–134 (2004) 9. Rodriguez-Fernandez, M., Egea, J.A., Banga, J.R.: Novel metaheuristic for parameter estimation in nonlinear dynamic biological systems. BMC Bioinformatics 7, 483 (2006) 10. Alsina, B., Garcia de Lomana, A., Vill` a-Freixa, F., Giraldez, F.: (submitted, 2008)
Self-Organised Criticality as a Function of Connections’ Number in the Model of the Rat Somatosensory Cortex Grzegorz M. Wojcik and Wieslaw A. Kaminski Institute of Computer Science Maria Curie-Sklodowska University pl. Marii Curie-Sklodowskiej 5, 20-031-Lublin, Poland [email protected]
Abstract. The model of the part of the rat somatosensory cortex was examined. Large network of Hodgkin-Huxley neurons was simulated and the modular architecture of this structure divided into layers and subregions was implemented. High degree of complexity required effective parallelisation of the simulation. In this article the results of the parallel neural computations are presented. An occurrence of the self-organised criticality was observed and its characteristics as a function of the connections number was investigated. It was proved that frequency of the socalled spike potential avalanches depends on the density of inter-neuron connections. In addition some benchmarking runs were conducted and parallelisation effectiveness is presented to some extent.
1
Introduction
The critical point is a point at which a system changes radically its behaviour or structure. Self-organised critical phenomena are defined by a complex system which reaches a critical state by its intrinsic dynamics, independently of a value of any control parameter. A typical example of a system exhibiting a self-organised criticality (SOC) is a sand pile model. The sand is slowly dropped onto a surface, forming a pile. As the pile grows, avalanches occur and they carry the sand from the top to the bottom of the pile. At least in the model, the slope of the pile becomes independent of the rate at which the system is driven by dropping sand. This exemplifies so-called (self-organised) critical slope [1]. The oldest numerical models describing the sand-pile problem are presented i.e., in [1], [3], [6]. In this model, a one-dimensional pile of sand is considered. Grains of sand are stored in the columns. The dynamics of the system is defined by a set of equations describing the effect of a one-grain addition. After a proper number of grains have been added to the appropriate columns, a critical inclination of the sand pile occurs and this causes disorder leading to relaxation of the whole system. This disorder is referred to as an avalanche. Critical states of a system are signalled by a power-law distribution in some observable. In the case of sand-piles, the size and the distribution of the avalanches M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 620–629, 2008. c Springer-Verlag Berlin Heidelberg 2008
Self-Organised Criticality as a Function of Connections’ Number
621
can be measured. A frequency of an avalanche occurrence in the system is a function of its size and can be expressed by the power law [1]: D(S) ∼ S −k ,
(1)
where k is always a characteristic number for a given system. Complex systems exhibiting behaviour predicted by the SOC have been widely investigated [4], [5], [8], [10], [11], [14]. Earthquakes, forest fires, biological evolution are just three examples of wide range of phenomena that have been successfully modelled this way [1]. There are experiments that confirm the existence of frequency tuning and adaptation to stimulus statistics in neurons of the rat somatosensory cortex [7]. SOC was found in the model of large biological neural networks [13], however, the aim of the research discussed in this contribution was to investigate whether and how the SOC occurrence depends on the number of connections in the simulated brain tissues. Good understanding the SOC mechanism in the model will allow us to design new series of experiments with the large number of interacting neurons leading to the discovery of new class of neurodynamical phenomena taking place in real brains. Simulations of microcircuits consisting of numerically complicated HodgkinHuxley (HH) neurons [9] are power consuming. The simulation time can be shortened by using cluster-based parallelised computing. All the simulations discussed in this paper were conducted in the parallel version of GENESIS compiled for the MPI environment [15]. The choice of the GENESIS simulator allowed us to use many processors and to design effective way of parallelisation. Remarkably, in this article we demonstrate that critical relaxation phenomena depend on the density of inter-neuron connections existing in the network. Consequently, the effectiveness of model’s parallelisation, the simulation time and its speedup as a functions of the connections’ number will be presented in the last section.
2
Model and Method of Parallelisation
The somatosensory pathways bring sensory information from the periphery into the brain, e.g., from a rat’s whisker to the somatosensory cortex. Information from the snout passes along the trigeminal nerve, projecting to the trigeminal complex in the brainstem, which sends projections to the medial ventral posterior nucleus of the thalamus (VPm). Each whisker has a representative physical structure in the brain, forming 2-D maps of the whisker pad throughout the pathway. In the cortex, these structures are known as barrels. They are formed from clusters of stellate cortical neurons, with the cell bodies arranged in a ring and dendrites filling the middle ”hole”. The dendrites form synapses with multiple axons rising from the VPm [16]. The neurons chosen for the simulations were implemented according to the HH model [9]. Cells are relatively simple (for detail, see Appendix A). The only one modification in the model of neuron was arranged in order to avoid rapid synchronisation of the whole network: an additional parameter responsible
622
G.M. Wojcik and W.A. Kaminski
for the probability of exocitosis was added for each synaptic connection in the post-synaptic neuron. Such a change required a simple modification of original GENESIS code. Changed version of GENESIS, compiled for Linux and MPI, can be downloaded from [15]. The simulated net consisted of 2025 of the above-mentioned neurons. All the cells were placed on a square-shaped grid with 45 rows and 45 columns. Each neuron was identified by a pair of numbers ranging from 0 to 44. Network cells were divided into 22 groups, called layers, numbered from 1 to 22. Communication between the neurons based on the following principles - the input signal from each cell from the m-th layer was transported to all the neurons from layers: m + 1, m + 2, m + 3, ..., m + Ns , where Ns was the integer number, not greater than the number of layers (see Fig. 1). Note that such a structure (2-D with dense ”neural rings”) imitates the structure of the rat’s cortical barrels. The system can be easily parallelised, so we decided to simulate the problem on 15 processors. The network was divided into 15 zones. In each zone the same number of neurons was simulated. The zones were numbered from 1 to 15 and the way in which the zones were arranged is presented in Fig. 1. Such a choice allowed us to run simulations in optimal way, without the barriers being timed
22
0
44
0
22
44 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
Fig. 1. Scheme of the simulated network. Layers are highlighted by thick lines. Stimulating neuron is marked with the black square and all other neurons are put on the intersections of grid lines. Neurons coordinates are marked on the top and the left side of the scheme. In each zone there are 3 columns of neurons as marked at the bottom. The choice of columns belonging to particular zones is arbitrary.
Self-Organised Criticality as a Function of Connections’ Number
623
out. The complexity of the system increases rapidly with Ns , so does the time of simulation. A good parallelisation of the model not only shortens its simulation time, but most often makes it executable at all. That is why parallelisation techniques are so important for HH systems with large number of synapses. Synaptic connections were characterised by three parameters: weight w, time delay τ , and the probability p of transporting the signal, which corresponded to the mentioned probability for the occurrence of exocitosis. The probability p was set to a constant, the same for all of the synapses (p = 0.5).Values of two other parameters depend on the position of both the pre-synaptic and postsynaptic neuron. For each pair of neurons from the m-th layer and the n-th layer respectively, the parameters w and τ were chosen according to following rules: w=
w0 , |m − n|
τ = 10−4 |m − n| [s],
(2) (3)
where w0 was a positive constant (in our simulations w0 = 2). The system was stimulated by the neuron N [23, 23] that was the main receptor of activities from the outside of the net (i.e., a glass capillary stimulating the whisker [7] or an electrode transmitting some random stimulus directly into the cortex). As a result, the receptor was producing a periodic spike potential with a frequency of about 80 Hz. The net was characterised by the parameter T that corresponded to the biological system’s real working time (in these simulations T = 15 s).
3
Simulations and Results
The stimulus was transported from the central unit to all other cells through the arranged connections. During the simulation, the time of spike potential occurrence was collected for each neuron. The avalanche occurs when the group of neurons is spiking in the same and small interval of time (i.e., ti = 1 ms). The algorithm used to compute the number of avalanches was implemented in C++ (the simple analysis of text files containing the time and value of membrane potential, in search for the neurons having high spiking activity in the same time interval). It was shown that for a system with a small number of neighbourhoods (Ns < 6) (the same small number of connections), the power law cannot precisely describe the number of spike-potential avalanche occurrences as a function of their size (Fig. 2). When the Ns = 6 a kind of phase transition leading the system to the SOC behaviour can be observed (Fig. 3) [13]. Systematic analysis of the SOC was performed i.e., by Peter Sloot [12] and the aim of our research described in this contribution was to investigate whether the occurrence of SOC depends only on the number of neighbourhoods or it can appear or disappear for a given Ns as a function of the intra-network connections’ number. As for the most sensitive parameter the value of 6 was chosen for the Ns parameter in all series of the aforementioned experiments [13].
624
G.M. Wojcik and W.A. Kaminski
1000
D(s)
100
10
1
1
10
100 s - Size of Avalanche
1000
Fig. 2. Frequency D(s) as the function avalanche size for Ns = 1, px = 1, T = 15 s
1000
D(s)
100
10
1
1
10
100 s - Size of Avalanche
1000
Fig. 3. Frequency D(s) as the function of avalanche size for Ns = 6, px = 1, T = 15 s
Then the another parameter px defining the probability of synapse creation between two neurons from the the simulated network was implemented in the model. Surprisingly it was noted that self-organisation depends not only on the number of connections (what could be concluded from previous research) but also on the network architecture. On the basis of Fig. 2 and Fig. 3 one could hypothesise that when the number of connections falls the self-organisation disappears. However, for Ns = 6 the SOC manifests itself even better when we decrease the strength of the inter-neuron communication by setting the px below 0.07 (Fig. 4). What’s more when 0.07 < px < 0.4 the SOC behaviour tends to disappear (see two curves in Fig. 5) to come back for px > 0.4 (Fig. 6). Because of relatively high system complexity and tendency to the rapid synchronisation the number of spikes decreases with the number of connections in the network (Fig. 7). That is why the number of avalanches and the inclination of the SOC
Self-Organised Criticality as a Function of Connections’ Number 10000
625
px = 0.01 px = 0.04 px = 0.06
D(s)
1000
100
10
1 10
100 s - Size of Avalanche
1000
Fig. 4. Frequency D(s) as a function of avalanche size for Ns = 6, px < 0.07, T = 15 s 10000
px = 0.08 px = 0.20 px = 0.40
D(s)
1000
100
10
1 10
100 s - Size of Avalanche
1000
Fig. 5. Frequency D(s) as a function avalanche size for Ns = 6, 0.07 < px < 0.4 10000
px = 0.10 px = 0.40 px = 0.80
D(s)
1000
100
10
1 10
100 s - Size of Avalanche
1000
Fig. 6. Frequency D(s) as a function avalanche size for Ns = 6, px > 0.4, T = 15 s
626
G.M. Wojcik and W.A. Kaminski 10000
px = 0.20 px = 0.70 px = 0.80
D(s)
1000
100
10
1 10
100 s - Size of Avalanche
1000
Fig. 7. Scale of the SOC for Ns = 6, T = 15 s. and different px 10000
px = 0.01 px = 0.80
D(s)
1000
100
10
1 10
100 s - Size of Avalanche
1000
Fig. 8. SOC inclination for Ns = 6, T = 15 s. and different px
curve are different and depend on px (Fig. 7-8). The number of spikes in the network falls with the growth of connections’ density (Fig. 9).
4
Parallelisation Effectiveness
The local cluster used for all the simulations was built of 13 machines including one special machine – the so-called “access node”. Each SMP machine had two 64-bit 1.4 GHz Itanium2 IA64 processors with 4 GB of RAM memory. The cluster works under control of Debian Linux Sarge (v. 3.1) and 2.6.8-1 kernel version. The model is simulated in GEneral NEural Simulation System GENESIS v.2.2.1 with its MPI extension. A gcc compiler was used for the system compilation and in case of the MPI and Linux OS the compilation required some tuning of GENESIS code. Changed version can be found in [15].
Self-Organised Criticality as a Function of Connections’ Number
627
1e+07 No. of spikes
No. of Connections / No. of Spikes
No. of connections
1e+06
100000
10000 0
0.1
0.2
0.3
0.4
0.5 px
0.6
0.7
0.8
0.9
1
Fig. 9. Density of connections and number of spikes as a function of px 1000 speedup 1 node
Simulation Time [h] / Speedup
15 nodes
100
10
1 0
0.1
0.2
0.3
0.4
0.5 px
0.6
0.7
0.8
0.9
1
Fig. 10. Time of simulation and speedup as a function of px
The length of a typical run for Ns = 6 and T = 15 s was about 10 hours when the problem was parallelised for 15 nodes. However, for one node the simulation time ranged from 12 h to 230 h depending on the value of px . This gave us in the best case the speedup of 23 (Fig. 10). At first sight it is very optimistic result especially for the structures with large number of synapses. One should remember that 5 years ago such networks with Ns > 6 modelled on one SPARC 400 MHz node had the simulation time of about 3 weeks.
5
Conclusions
Systematic analysis of the simulated part of the rat’s somatosensory cortex dynamics was conducted. Effective parallelisation was applied. SOC manifests itself in large biological neural networks. The ”quality” of SOC depends both on the number of connections and on the architecture of the system. The role of SOC
628
G.M. Wojcik and W.A. Kaminski
phenomena in mammalian brains is still unrecognised. However, good modelling will make it possible for us to design the new series of neuroscientific experiments, leading in the end to better understanding of the brain functionality.
Appendix A: Properties of HH Neurons Our model consisted of multicompartmental neurons with two dendrite compartments, a soma and an axon. The dendrites contained a synaptically activated channel and the soma had voltage activated HH sodium and potassium channels. The behaviour of each compartment was equivalent to the behaviour of some electrical circuit [2]. Thus, each circuit was characterised by a typical for GENESIS group of parameters set as follows: resistances Ra = 0.3 Ω, Rm = 0.33 Ω, capacity Cm = 0.01 F, and potential Em = 0.07 V. For the soma compartment Ek = 0.0594 V and for the dendrite Ek = 0.07 V. Conductance for each type of ionic channels was chosen to be: GK = 360 Ω −1 and GN a = 1200 Ω −1 . These parameters originated from neurophysiological experiments [2] and were chosen to make the model biologically more realistic. The soma had a circular shape with the diameter of 30 μm, dendrites and axon were cable-like with the length of 100 μm. All the other parameters were chosen as suggested by GENESIS authors to simulate the behaviour of the biological-like neurons [2]. More details concerning the HH model can be found elsewhere [2], [9]. Acknowledgements. This work has been supported by the Maria Curie-Sklodowska University, Lublin, Poland (under the grant of UMCS Vice President 2007) and Polish State Committee for Scientific Research under the grant number (N519 017 32/2120). Special thanks to Peter Sloot for inspiration during the meeting in Russia.
References 1. Bak, P.: How nature works: The Science of Self-Organised Criticality. Copernicus Press, New York (1996) 2. Bower, J.M., Beeman, D.: The Book of GENESIS – Exploring Realistic Neural Models with the GEneral NEural SImulation System. Telos, New York (1995) 3. Jensen, H.J.: Self Organizing Criticality. Cambridge University Press, Cambridge (1998) 4. Aegerter, C.M., Gnther, R., Wijngaarden, R.J.: Avalanche dynamics, surface roughening, and self-organized criticality: Experiments on a three-dimensional pile of rice. Phys. Rev. E 67, 051306 (2003) 5. Bak, P., Christensen, K., Danon, L., Scanlon, T.: Unified Scaling Law for Earthquakes. Phys. Rev. Lett. 88, 178501 (2002) 6. Bak, P., Tang, C., Wisenfeld, K.: Self-organized criticality: An explanation of the 1/f noise. Phys. Rev. Lett. 59, 381–384 (1987) 7. Garcia-Lazaro, J.A., Ho, S.S.M., Nair, A., Schnupp, J.W.H.: Adaptation to Stimulus in Rat Somatosensory Cortex. FENS Abstr. 3, A109.4 (2006)
Self-Organised Criticality as a Function of Connections’ Number
629
8. Lubeck, S.: Crossover phenomenon in self-organized critical sandpile models. Phys. Rev. E 62, 6149–6154 (2000) 9. Hodgkin, A.L., Huxley, A.F.: A Quantitative Description of Membrane Current and its Application to Conduction and Excitation in nerve. J. Physiol. 117, 500–544 (1952) 10. Paczuski, M., Bassler, K.E.: Theoretical results for sandpile models of selforganized criticality with multiple topplings. Phys. Rev. E 62, 5347–5352 (2000) 11. Pastor-Satorras, R., Vespignani, A.: Corrections to scaling in the forest-fire model. Phys. Rev. E 61, 4854–4859 (2000) 12. Sloot, P.M.A., Overeinder, B.J., Schoneveld, A.: Self-organized criticality in simulated correlated systems. Comp. Phys. Comm. 142, 7–81 (2001) 13. Wojcik, G.M., Kaminski, W.A., Matejanka, P.: Self-organised Criticality in a Model of the Rat Somatosensory Cortex. In: Malyshkin, V.E. (ed.) PaCT 2007. LNCS, vol. 4671, pp. 468–476. Springer, Heidelberg (2007) 14. Yang, X., Du, S., Ma, J.: Do Earthquakes Exhibit Self-Organized Criticality? Phys. Rev. Lett. 92, 228501 (2004) 15. The GENESIS compiled for Linux MPI: http://www.luna.umcs.lublin.pl/download/modgenesis4mpi.tgz 16. The Rat Somatosensory Pathway: http://www.bris.ac.uk/Depts/Synaptic/info/pathway/somatosensory.htm
Approximate Clustering of Noisy Biomedical Data Krzysztof Boryczko and Marcin Kurdziel Institute of Computer Science, AGH University of Science and Technology, al. Mickiewicza 30, 30–059 Krak´ ow, Poland {boryczko,kurdziel}@agh.edu.pl
Abstract. Classical clustering algorithms often perform poorly on data harboring background noise, i.e. large number of observations distributed uniformly in the feature space. Here, we present a new density-based algorithm for approximate clustering of such noisy data. The algorithm employs Shared Nearest Neighbor Graphs for estimating local data density and identification of core points, which are assumed to indicate locations of clusters. Partitioning of core points into clusters is performed by means of Mutual Nearest Neighbor distance measure. This similarity measure is sensitive to changes in local data density, and is thus useful for discovering clusters that differ in this respect. Performance of the presented algorithm was demonstrated on three data sets, two synthetic and one real world. In all cases, meaningful clustering structures were discovered. Keywords: Cluster analysis, Noisy data, Multidimensional data, Shared Nearest Neighbor Graph, Mutual Nearest Neighborhood.
1
Introduction
Formerly, research in cluster analysis focused on data sets were almost all observations are believed to be members of some clusters. Even if outlier observations were accounted for, they were thought as exceptions rather than significant fraction of the data set. In recent years however, efforts were made to develop clustering techniques suitable for data sets were outlier observations are so frequent that they in fact become a noisy background in which clusters are submerged. A classical example is the DBSCAN algorithm [1], which employs a density-based definition of clusters. Density-based notion of clusters was also adopted in [2]. Unlike DBSCAN, which relay on simple counting of points within spheres of some given radius, this method employ Shared Nearest Neighbor (SNN) graphs for density estimation. Some approaches to noisy data clustering employ data sampling instead of explicit density estimation. This is the case in CURE [3] algorithm for example. Yet another approach to this task focus on graph-based cluster connectivity measures. Typical representatives of this approach are Chameleon [4] and ROCK [5] algorithms. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 630–640, 2008. c Springer-Verlag Berlin Heidelberg 2008
Approximate Clustering of Noisy Biomedical Data
631
We present a new algorithm for clustering of high-dimensional, noisy data. The algorithm, named Clustering With Nearest Neighborhood (CWNN), is inspired by ideas presented in [2], [6] and [7]. CWNN employs the SNN graph to detect the so-called core data points. This allows for explicit handling of background noise as well as automatic assessment of the number and shapes of clusters. The strength of the our approach lies in a method for partitioning core points into origins of clusters. We propose to partition the set of core points by employing the Mutual Nearest Neighbor (MNN) distance measure computed over the proximity measure derived from the SNN graph. Experimental results illustrating performance of this method for data harboring background noise, including multidimensional cases, are demonstrated. It is important to note here, that perfect discrimination between background noise and data clusters is often unattainable, especially if they are of comparable densities. Therefore, CWNN should be seen as an approximate clustering method. 1.1
Mutual Nearest Neighbor Distance Measure
Consider a set of points X = {x1 , x2 , . . . , xn } and a distance metric d(·, ·). For example, this can be a finite subset of m-dimensional cube, X ⊂ −γ, γm ⊂ Rm , and the Euclidean distance. Let NGH(xi ) be a list of neighbors of the point xi , sorted in an ascending order according to d(·, ·). Further, let Gk = {X, E} be the k-Nearest Neighbor (k-NN) graph of X, i.e.: (xi , xj ) ∈ E ⇔ xj ∈ {NGHl (xi ) : l = 1 . . . k}
(1)
where NGHl (xi ) is the l-th element of the list NGH(xi ). The Mutual Nearest Neighbor distance measure, originally proposed in [7], estimates proximity between a pair of points on the basis of their rankings in mutual k-NN lists. In particular, for a pair of points xi , xj ∈ X, such that: xi = NGHk (xj ) ∧ xj = NGHl (xi )
(2)
the value of the MNN distance measure is equal to: MNN(X, xi , xj ) = k + l
(3)
For clustering purposes, the MNN distance measure has a strong advantage over classical distance metrics, e.g. the Euclidean distance, of being more sensitive to changes in local data density [7]. This is illustrated on the example depicted in Fig. 1. The points form two clusters of different densities (marked by C1 and C2 ). Suppose, that we would like to identify the clusters by simply comparing the distances between the points. A straightforward approach would be to identify the connected components within the data set, assuming that any two points are connected when the distance between them is smaller than some threshold value ε. Using the Euclidean distance, it is impossible to choose a proper threshold value ε. For ε ≤ 2 each point from the cluster C1 will be assigned to a separate, artificial cluster. On the other hand, for every
632
K. Boryczko and M. Kurdziel
ε > 2 the whole data set will be assigned to a single cluster. Now, consider the MNN distance measure. For every two points from the cluster C1 that are adjacent to each other and are placed in the same column (e.g., points a and b in Fig. 1) or the same row (e.g., points b and c in Fig. 1) the value of the MNN distance measure is equal to 2. The same situation occurs in the cluster C2 . However, the value of MNN distance measure between the points x and y is equal to: MNN(C1 ∪ C2 , x, y) = 4. We can clearly see this from the lists of the nearest neighbors of those two points: NGH(x) = [{xn1 , xn2 , y}, yn2 , . . .] and NGH(y) = [{yn1 , yn2 }, yn3 , {x, yn4 , yn5 }, . . .]. Consequently, this data set can be properly clustered with the threshold value for the MNN distance measure ε = 3.
a
b
c
xn1
xn2
X
2
C2 Y yn2
yn1
yn4
yn3
yn5
C1
2
1
Fig. 1. An example data set, made of two clusters that cannot be discovered using only the Euclidean metric but can be found with the MNN distance measure
1.2
Estimating Proximity in Multidimensional Spaces with Sparse Shared Nearest Neighbor Graphs
Euclidean metric is not suited for estimating proximity of points in highdimensional spaces (see e.g. [8]). A proximity measure that is better suited for multidimensional data was proposed in [6]. In this paper, proximity between a pair of points was defined to be the number of neighbors they share. We employ this idea in a slightly modified manner. Consider a k-NN graph Gk = (X, E) of the input data set X. A graph Sk = (X, E, W ) in which weights given by: wij = #{xs ∈ X \ {xi , xj } : (xi , xs ) ∈ E ∧ (xj , xs ) ∈ E}
(4)
are assigned to edges (xi , xj ) ∈ E, is called the Shared Nearest Neighbor graph of X. Provided that the number of shared neighbors depends on how close the points are (which is a reasonable assumption), their proximity can be defined in the following way: (5) dSk (xi , xj ) = k − wij The measure dSk (xi , xj ) is well defined only if both the edges (xi , xj ) and (xj , xi ) belong to Sk . If this is not the case, we consider the dSk (xi , xj ) to be infinite.
Approximate Clustering of Noisy Biomedical Data
633
The SNN graph can be used to establish a strong neighborhood relationship in X [2]. This is done by removing from Sk all edges (xi , xj ) ∈ E for which wij < t, where t is a threshold value. In the resultant graph, denoted by Skt , edges connect strong neighbors. Relation defined in this way has an advantage of being relatively immune to background noise. In particular, if a sufficiently high threshold value t is chosen, the noise points that lie outside of high density regions (i.e. clusters) will not have any strong neighbors.
2
Clustering Method Based on the SNN Graph and the MNN Distance Measure
The pseudo-code of CWNN is presented as Algorithm 1. First, the SNN graph Sk of the input data set is build and sparsified using a threshold value t. The resultant sparse SNN graph Skt is used to construct the set of core points Xc , i.e. set of data points that have more than td strong neighbors within a sphere of radius εn . It is assumed, that the core points span regions of data space with relatively dense distribution of data points. In the next step, a graph Gc of core points is created, in which points u, v ∈ Xc are connected if and only if the SNN distance measure dSk (u, v) < ε and the MNN distance measure MNNSk (Xc , u, v) < tm . Here, MNNSk denotes the MNN distance measure calculated over the proximity measure derived from the SNN graph Sk (eq. 5). To locate origin of clusters, CWNN identifies connected components in Sk . Finally, CWNN evaluates each of the non-core data points xi and if d(xi , c) < εn , where c ∈ Xc is the core point that is closest to xi , then the point xi is assigned to the cluster represented by c. Otherwise, xi is assigned to the noise cluster CN . Our approach employs the MNN distance measure for constructing connected components in the graph of core points. To strengthen this measure against background noise it is calculated over proximity measure derived from the SNN graph. As the MNN distance measure is effective in identifying local changes in data density (see Section 1.1), gradients in the data density should split the graph of core points into a number of connected components, each one with a more uniform density distribution. Consequently, the clustering structure should reveal more information about the analyzed data set. Number of clusters constructed by CWNN is equal to the number of connected components in the graph of core points. This is controlled by two threshold values: tm for the MNN distance measure and ε for the SNN distance measure. CWNN do not assume any particular geometry of the clusters. In principle, shapes of clusters depend only on the shapes of connected components. Number of points assigned to the noise cluster depends on two factors: the number of core points identified by the algorithm and the threshold value εn . The threshold εn specifies the maximum distance d(·, ·) between a given point xi and its nearest core point c, which still allows for assigning xi to the cluster represented by c. The number of core points depends mainly on the distribution of density within the data set. However, setting a broader initial neighborhood
634
K. Boryczko and M. Kurdziel
Algorithm 1. The CWNN algorithm INPUT: set of points X = {x1 , x2 , . . . , xn }; distance metric d(·, ·) PARAMETERS: k, t, td , tm , ε ∈ N+ ; εn ∈ R+ OUTPUT: set of data clusters Ω = {C1 , C2 , . . . , Ck }; noise cluster CN Gk = build the k-NN graph of X Sk = build the SNN graph from the graph Gk Skt = sparsify the SNN graph Sk with the threshold t Xc = ∅, Vc = ∅ for all xi ∈ X do ρ = number of strong neighbors of the point xi that lie in a sphere of radius εn if ρ > td then Xc = Xc ∪ {xi } end if end for for all (xi , xj ) ∈ Xc × Xc , i < j do if [dSk (xi , xj ) < ε] & [MNNSk (Xc , xi , xj ] < tm ) then Vc = Vc ∪ {(xi , xj ), (xj , xi )} end if end for Ω = find the connected components of the graph Gc = (Xc , Vc ) for all xi ∈ X \ Xc do c = find the point x ∈ Xc such that: ∀x ∈ Xc \ {x} : d(x, xi ) < d(x , xi ) if d(xi , c) < εn then Cj = find the cluster C ∈ Ω such that c ∈ C Cj = Cj ∪ {xi } else CN = CN ∪ {xi } end if end for
(i.e., higher number of neighbors in the k-NN graph) or decreasing the number of required strong neighbors, td , will increase the number of core points. The computational complexity of the first part of CWNN, i.e. identification of core points, is O(n2 log n + nk log k) due to construction of the k-NN graph and counting of shared neighbors. The computational complexity of the remaining part of the algorithm is O(m2 log m + m · n), where m is the number of core points. However, the number of core points is smaller than the total number of points: m ≤ n. Consequently, the computational complexity of the whole CWNN algorithm is approximately O(n2 log n). The memory complexity is O(n2 ).
3
Experimental Results
Three data sets were used to demonstrate the effectiveness of CWNN. The first one, further called Chameleon data set, was taken from [4]. The second one is a synthetic three-dimensional test set, further called tube data set. Third test set, i.e. microcalcification data set, consists of feature vectors constructed by the
Approximate Clustering of Noisy Biomedical Data
635
Table 1. Parameters of CWNN used for clustering of the test data sets Parameter k t td tm ε Chameleon data set 100 75 4 20 25 Tube data set 250 180 13 15 115 Microcalcification data set 1600 1100 275 1900 1050
εn 10.0 59.0 2.19
authors during work on an algorithm for detecting suspicious lesions in digital mammograms [9]. Parameters used for clustering the test sets are given in Table 1. To set the value of these parameters we used the following heuristic. First, we set the value for the number of neighbors in the k-NN graph. This graph should reveal local properties of the data set. Therefore, we use a small fraction, i.e. between 1% to 2%, of the total number of points for this parameter. The upper value is used for multidimensional data. Next, we construct the histogram of the number of shared neighbors in the k-NN graph and locate its maximum. The first minimum following the maximum corresponds to the number of shared nearest neighbors above the most frequent one. We set the parameter t to a value near this minimum, ensuring that strong neighborhood relationship connect only the truly close points. In the next step, we construct the histogram for the number of strong neighbors. This histogram will usually have a peak corresponding to the background noise followed by peaks corresponding to clusters. We set the parameter td to a value after the peak from noisy background, ensuring correct noise identification. The value for εn is set by evaluating all non-noise points. For each such point we calculate the distance to its furthest strong neighbor. The histogram of these distances is used to identify most frequent values and εn is set close to them. The parameters ε and tm are set by inspecting minimal spanning trees of core points constructed using dSk (·, ·) and MNNSk (Xc , ·, ·) respectively. Again, we construct histograms of edge lengths in these trees. The minima in these histograms correspond to edges connecting clusters. First such minimum is usually a good choice for the value of underlying parameter. Subsequent minima can be used if a more coarse grained clustering is desirable. Comparative tests. In [4] four two-dimensional data sets were used for evaluation of the Chameleon algorithm, three of which contain background noise. We applied CWNN to these three noisy data sets and in each case were able to obtain the correct clustering. For the lack of space we will report only the result for the hardest case (in [4] it is called DS4 ). The Chameleon data set is pictured in Fig. 2a. The clustering given by CWNN is presented in Fig. 2b. As we can see, CWNN was able to remove the background noise while preserving the bona fide clusters. We should note here, that these clusters differ in densities. Furthermore, the density of the rectangular cluster is near the density of the noise. In addition, clusters lay close to each other. In particular, triangle-like clusters are nearly adjacent. Nevertheless CWNN did succeed in clustering this data set. In comparison, according to [4], DBSCAN, a well known density clustering method, is unable to identify the correct clustering
636
(a)
K. Boryczko and M. Kurdziel
(b)
Fig. 2. (a) Two-dimensional Chameleon data set. (b) Clustering of the Chameleon data set obtained with CWNN algorithm.
in this data set. Another density clustering method, CURE, is also reported to fail on this data. Additional results provided on a web page referenced in [4] show that CLARANS [10], ROCK, and group average hierarchical clustering are all unable to correctly cluster the Chameleon data set. The Chameleon algorithm itself managed to identify the genuine clusters in this data. However, this algorithm has no explicit noise removal technique. Therefore, in addition to the bona fide clusters Chameleon constructed additional, spurious clusters out of the background noise, which is evident on the Fig. 6 in [4]. Tube data set. The tube data set is pictured in Fig. 3a. It consists of two cubical clusters, namely cluster A and cluster B, surrounded by a variable density background noise. Density of data points in the clusters is five times greater than the density of the surrounding noise. The density of the noise itself increases linearly from the right to the left end of the tube, resulting in 4-fold difference between the ends. Result of clustering the tube data set is pictured in Fig. 3b. The two biggest clusters found by CWNN are depicted in Fig. 3c. CWNN managed to identify 78.8% of noise points (i.e. 16,207 out of 20,560). From the 577 data points in the cluster A, 516 were found (89.4%). In cluster B, out of the 1078 data points, 1075 points were found (99.7%). CWNN was therefore successful in discovering both clusters. Some artificial clusters were created from the background noise, near cluster B. This is a consequence of the noise density near left end of the tube being comparable with the density of the cluster A. Therefore, assignment of the whole noisy background to the noise cluster would result in lost of the cluster A. Yet, cluster B was not merged with any of the artificial clusters (see Fig. 3c). Gradient of the data density at the border of cluster B, that is well preserved in the set of core points, increases the value of the MNN distance measure between core points in the cluster and core points in the background. This prohibits merging of artificial clusters with cluster B. Microcalcification feature vectors data set. The microcalcification data set contains feature vectors describing suspicious regions of interest (ROIs) found in 200 high-resolution digital mammograms from the DDSM database [11]. Analysis of such data sets is an important step in design and implementation of
Approximate Clustering of Noisy Biomedical Data
Cluster B ClusterB
637
ClusterA ClusterA
(a)
(b)
(c)
Fig. 3. (a) Three-dimensional tube data set containing two clusters enclosed by noisy background. (b) Clustering of the tube data set obtained with CWNN algorithm. (c) Two biggest clusters found by CWNN.
Cluster 1
Cluster 2
Cluster 3
Cluster 4
Cluster 5
Cluster 6
Fig. 4. Example mammogram regions of interest (ROIs), corresponding to feature vectors that belong to different clusters discovered by CWNN
computer aided detection (CAD) systems. Moreover, CAD systems for screening mammography are among most heavily researched computerized detection techniques, owing to difficulties in recognizing early cancer symptoms on mammogram images. Thus, microcalcification data set is an example of rather important class of biomedical data. Each mammogram ROI is described by 27 pixel intensity features, such as: entropy, contrast, moments of a brightness’ histogram, and other. As the algorithm used for initial selection of ROIs was tuned for high sensitivity, a large number of false-positive detections was made. Feature vectors of false-positive ROIs constitute the noise in the data set. Additional details are given in [9]. Out of approximately 93,300 data points in the microcalcification data set, approximately 53,000 (i.e. 57%) were assigned to noise by CWNN. Six clusters were constructed from the remaining data points. Example ROIs corresponding to feature vectors selected randomly from the six discovered data clusters are presented in Fig. 4. As it can be seen, clusters number 4, 5 and 6 contain flat ROIs characterized by low image contrast. The ROIs differ in the average brightness. Cluster no. 1 contains ROIs with linear structures, usually on a dark background. The regions of interest from the cluster no. 3 contain similar structures but on a brighter background. Finally, cluster no. 2 contains round, punctate occlusions resembling small microcalcifications.
638
4
K. Boryczko and M. Kurdziel
Implementation Notes
We have implemented parallel versions of the most costly routines in CWNN, namely construction of the k-NN and sparse SNN graphs and calculation of the MNN distance measure. Parallel version of these routines were designed for shared memory machines and implemented using the OpenMP1 standard. Parallelization of the k-NN graph construction is straightforward. In particular, the first outer loop of this routine runs over all data points, for each one calculating the distances to the points with greater indices. There is no data dependencies between the iterations of this loop and thus they can be directly split between the threads. In the next step, for each point the distances to the remaining points are sorted in an ascending order. Loop performing this sorting can also be parallelized by direct splitting between the threads. The routine for constructing the sparse SNN graph contains two nested loops. The outer loop runs over all data points. In the i-th iteration of the outer loop, the inner loop runs over nearest neighbors y ∈ NGH(xi ) of the point xi , assessing for each of them whether xi ∈ NGH(y) and counting shared neighbors. These operations require read-only access to the k-NN graph and therefore do not impose data dependencies. However, in rare cases threads can compete for write access to the weights matrix of the SNN graph. We efficiently solved this issue by employing a small hash table of locks that protects elements of the weights matrix. This enables splitting of the outer loop between the threads. 8000
25
SGI Altix
SGI Altix
20
6000 5000
Speedup
Execution time [s]
7000
4000 3000 2000
15 10 5
1000 0
5
(a)
10 15 20 25 Number of processors
0
30
(b)
4
8
12 16 20 24 Number of processors
28
32
Fig. 5. The execution time (a) and speedup (b) of the parallel implementation of CWNN algorithm. The results were obtained on the microcalcification data set.
Calculation of the MNN distance measure requires read-only access to the weighted graph of core points. The edge weights are the distance measures derived from the SNN graph (see Sections 1.2 and 2). Write access is needed only for the array storing the results. Therefore, we can employ a parallelization strategy similar to the one used in distance calculation during construction of the k-NN graph. To illustrate the efficiency of the parallelization scheme, benchmark runs on the microcalcification data set were made. The tests were carried out on the SGI 1
www.openmp.org
Approximate Clustering of Noisy Biomedical Data
639
Altix platform, equipped with 1.5 GHz Intel Itanium 2 processors and running Linux operating system. The results are presented in Fig. 5. As we can see, the algorithm scales almost linearly for the number of processors between 2 and 32.
5
Future Work
In the current setup, CWNN can be applied to various types of data provided that a distance metric d(·, ·) is available for them. However, the performance of the algorithm in such cases needs further evaluation, which will be the focus of our future research. Another issue to be studied thoroughly are the methods for estimating the initial values for the CWNN parameters. Although our experience shows that with a help of simple heuristic reasonable values for these parameters can be established in few trial runs, an automatic method would make the algorithm more user-friendly. Acknowledgements. The authors are grateful to the Professor Witold Dzwinel for his valuable comments. This work was partly founded by the Polish Committee for Scientific Research (KBN) grants no. 3T11F01030 and 3T11F01930.
References 1. Ester, M., Kriegel, H., Sander, J., Xu, X.: A density-based algorithm for discovering clusters in large spatial databases with noise. In: Proceedings of the 2nd International Conference on Knowledge Discovery and Data Mining, pp. 226–231. AAAI Press, USA (1996) 2. Ert¨ oz, L., Steinbach, M., Kumar, V.: Finding clusters of different sizes, shapes, and densities in noisy, high dimensional data. In: Proceedings of the Third SIAM International Conference on Data Mining, San Francisco, CA, USA, vol. 47 (2003) 3. Guha, S., Rastogi, R., Shim, K.: Cure: an efficient clustering algorithm for large databases. In: Proceedings of the 1998 ACM SIGMOD international conference on Management of data, pp. 73–84. ACM Press, New York (1998) 4. Karypis, G., Han, E., Kumar, V.: Chameleon: hierarchical clustering using dynamic modeling. IEEE Computer 32(8), 68–75 (1999) 5. Guha, S., Rastogi, R., Shim, K.: Rock: A robust clustering algorithm for categorical attributes. Information Systems 25(5), 345–366 (2000) 6. Jarvis, R., Patrick, E.: Clustering using a similarity measure based on shared near neighbors. IEEE Transactions on Computers 22(11), 1025–1034 (1973) 7. Gowda, K., Krishna, G.: Agglomerative clustering using the concept of mutual nearest neighborhood. Pattern Recognition 10, 105–112 (1978) 8. Aggarwal, C., Hinneburg, A., Keim, D.: On the surprising behaviour of distance metrics in high dimensional space. In: Van den Bussche, J., Vianu, V. (eds.) ICDT 2001. LNCS, vol. 1973, pp. 420–434. Springer, Heidelberg (2000) 9. Boryczko, K., Kurdziel, M.: Recognition of subtle microcalcifications in highresolution mammograms. In: Proceedings of 4th International Conference on Computer Recognition Systems, Advances in Soft Computing, pp. 485–492 (2005)
640
K. Boryczko and M. Kurdziel
10. Ng, R., Han, J.: Clarans: A method for clustering objects for spatial data mining. IEEE Transactions on Knowledge and Data Engineering 14(5), 1003–1016 (2002) 11. Heath, M., Bowyer, K., Kopans, D., Kegelmeyer, W., Moore, R., Chang, K., Munishkumaran, S.: Current status of the digital database for screening mammography. In: Digital Mammography, pp. 457–460. Kluwer Academic Publishers, Dordrecht (1998)
Domain Decomposition Techniques for Parallel Generation of Tetrahedral Meshes Barbara Glut and Tomasz Jurczyk AGH University of Science and Technology, Krak´ ow, Poland {glut,jurczyk}@agh.edu.pl
Abstract. We present solutions for dealing with the problem of parallel generation of unstructured meshes for 3D domains. The selected approach is based on a geometric decomposition of the domain where the input data is given in the form of a surface mesh. The difference between the two presented decomposition techniques lies in the step where the actual partitioning takes place. In the first method the partitioning is obtained using solely the surface mesh while in the latter one a coarse tetrahedral mesh is used. After the decomposition and creation of an interface mesh (which is a surface mesh in both approaches) the final volume meshes are generated independently for each subdomain. The interface mesh and the subdomain meshes are constructed using a Riemannian metric combined with a control space structure which allows to generate meshes with varied density and anisotropy [1]. Keywords: Mesh Generation, Geometric Decomposition, Tetrahedral Mesh, Anisotropic Metric.
1
Introduction
In the modern simulations of processes with the finite element method the increasingly more complicated models require very large number of elements in order to achieve sufficiently precise computations. Consequently, such computational tasks are often solved using a parallel approach. However, the parallelization of the solver does not solve the problem completely. The sequential construction of meshes with a large number of elements poses some problems as well – mainly with respect to the memory requirements. It should be noted that the problem of a parallel mesh generation is considered much more difficult than the parallelization of the further computation step [2]. An efficient parallel algorithm requires an adequate load balancing for computational nodes while minimizing the communication overhead between the processors. The task of decomposing the domain for the subsequent mesh generation is complicated due to the limited information available at the beginning of this process (which usually includes only the geometric model description). At this point it is usually difficult to properly assess the time required to discretize the subdomains of the created partitioning which is necessary to achieve the proper M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 641–650, 2008. c Springer-Verlag Berlin Heidelberg 2008
642
B. Glut and T. Jurczyk
load balancing. An additional complication is often introduced by an irregular density and anisotropy of the mesh in some areas of the discretized model.1 In recent years much notice has been devoted to the problem of a parallel mesh generation and a number of solutions have been proposed [3]. It seems that the key to the successful parallelization of the mesh generation problem is the proper partitioning of the domain into subdomains and a possibly independent discretization of these subdomains. Depending on the method and the order of an interface mesh generation three classes of methods can be identified [4]: 1. The a priori class includes methods which first create meshes of the interfaces and then, in parallel, the meshes for each subdomain are generated. 2. The a posteriori methods generate in parallel the meshes of the subdomains first. These meshes are then adjusted in a way which assures consistency of the mesh within the whole domain. 3. The third class contains methods where the interface meshes and the subdomain meshes are generated concurrently. There have been so far developed no methods which would solve this problem in a satisfactory way for a wide class of 3D models. It is partly due to the fact that even the sequential problem of the volume mesh generation for arbitrary domains is difficult enough. A number of different techniques of mesh construction is utilized in various generators which makes it difficult to choose the most advantageous class of the parallelization methods for the given task. Additionally, the geometric complexity of the considered models is constantly increasing which reduces the chances of finding a definite solution for this problem. However, due to the importance of this problem the heuristic solutions applicable in a possibly wide family of models have to be sought.
2
Main Concept of the Proposed Techniques
Two approaches of decomposing the discretized domain into subdomains are presented in the article. For both methods the input data are the boundary surface meshes. The difference between these methods is the selection of a moment when the parallelization procedure is executed. Both methods can be categorized as the a priori class where the interface meshes are constructed first and then the mesh of each subdomain can be independently generated in parallel. Such approach has a number of advantages. During the parallel generation of the mesh the communication between the computational nodes is limited. The volume mesh does not need to be stored in the memory of a single computational node. The only data interchanged during the simulation phase is the information about interface meshes which have to be compatible. Moreover, the sequential mesh generator can be utilized without any modifications. This technique assures also keeping the initial surface mesh intact which can be beneficial for the computational process. 1
Such requirements concerning the shape and density of elements may result for example from the computational aspects or the geometric features of the domain.
Domain Decomposition Techniques for Parallel Generation
643
The studies presented in this article are founded on the mesh generator developed by authors [1]. This generator constructs meshes using the information gathered in the control space [5]. The concept of a Riemannian metric stored within the adaptive control space structure has a substantial influence on the proposed techniques of the mesh decomposition for a parallel generation. As a consequence the developed methods can be successfully used in the problems with a high local variation of the density and the anisotropy of the mesh which are often found in the contemporary simulations with the adaptive solvers.
3
Method I: Decomposition of Surface Mesh (DSM)
The DSM technique (Fig. 1) [8] is based on the geometric decomposition of the domain using the surface meshes only. The surface mesh is partitioned by cutting it with separators which at this development stage are implemented as planes. Then, the subdomains are being closed by generation of a proper surface mesh on the separators. Finally, the volume mesh can be constructed independently for each of these closed subdomains. The main steps for each closed subdomain: 1. Selection of the separator. 2. Localization of the intersection of the separator with the surface mesh and determination of the cutting contours. 3. Generation of a surface mesh on the separator (Fig. 1(b)) which in case of a planar separator requires: – construction of a 2D control space taking into account various metric sources, – generation of a 2D mesh on the cutting plane, – projection of the planar mesh to the 3D space. 4. Closing of the subdomains (Fig. 1(c)). 5. Generation of volume meshes in the subdomains (Fig. 1(d)). The selection of a separator (with respect to both its shape and placement) is crucial for the final effect of the domain decomposition as well as for the course of the subsequent phases of the method. The selection of the separator should assure a low cut size, a proper load balancing and a minimal number of multiply connected interfaces. In the literature two main techniques are usually proposed for the selection of the cutting plane: along the inertial axis [6] and perpendicularly to the longest edge of the bounding cubicoid [7]. However, none of these methods guarantee sufficiently good results in a general case and this problem is considered as a subject of further studies. In the presented examples the authors applied cutting using the information about the bounding box. The construction of an interface mesh requires first the localization of cutting contours of the surface mesh and the separator. These contours are then projected onto the separator plane. In order to generate the mesh of the cutting plane, a special 2D control space is created. The metric in this case is associated with the lengths of edges in the cutting contour calculated both in the threeand two-dimensional space. Any other available metric sources are also included.
644
B. Glut and T. Jurczyk
(a) surface mesh
(b) split
(c) closing (cross-section)
(d) final mesh (cross-section)
Fig. 1. Subsequent steps of DSM (the cross-section visualization of a mesh is created by removing a set of elements)
Using the created metric field the 2D mesh is generated and projected to the 3D space. This technique was described in more detail in [8,9] where different problems respecting the placement of a separator were also considered.
4
Method II: Decomposition of Coarse Volume Mesh (DCVM)
In the second method (Fig. 2) the coarse volume mesh is utilized to partition the discretization domain. This coarse mesh is created as a result of a discretization based on the boundary nodes only. In this technique the separators are purely virtual and their purpose is to guide the mesh refinement in the selected subdomain. The partitioning of the domains is achieved by separation of the mesh along the refined fragments of the mesh which also defines the boundaries of closed subdomains. The meshes for the closed subdomains are generated independently as in the first approach. In this method the cost of the sequential part of the procedure increases but a more detailed information about the discretized domain becomes available which might help to achieve a better decomposition. The subsequent steps of this method: 1. 2. 3. 4.
Generation of a coarse 3D mesh (Fig. 2(a)). Determination of the separator placement. Refinement of the mesh in the vicinity of the separator (Fig. 2(b)). Separation of the subdomains and recognition of the interface surface (Fig. 2(c)). 5. Refinement of the volume meshes in the subdomains (Fig. 2(d)). The coarse mesh based on the boundary nodes only is created with utilization of a three-dimensional control space. The contents of this space is determined
Domain Decomposition Techniques for Parallel Generation
(a) coarse mesh (cross-section)
(b) refinement (cross-section)
(c) split (cross-section)
645
(d) final mesh (cross-section)
Fig. 2. Subsequent steps of DCVM
using the geometry of the model and any additional criteria which may be introduced by the user. As in the DSM, the first problem which has to be solved is determining the placement of the virtual separator. The solutions proposed in the literature for methods starting decomposition from coarse mesh are usually applying partitioning libraries (like METIS2 , CHACO3 , etc.) [10], neural networks or genetic algorithms [11]. However, in order to better compare both presented methods the selection of a separator is determined in this work similarly as in the DSM. In the vicinity of the separator the control space (and the metric stored therein) is modified in order to obtain a selective refinement of the mesh in the subsequent step of the procedure. This special control space CS3d p used to prepare the coarse mesh for partitioning is calculated using the main control 3d space CS3d m . The metric near the separator is copied directly from CSm and in the other areas of the discretized domain the maximum metric is applied (with an additional smoothing of the metric field). The mesh created with respect to CS3d p is partitioned along the virtual separator which at this point is already properly refined. The procedure of an actual mesh partitioning starts with identification of all faces incident to two tetrahedral blocks belonging to a different partitions. In order to reduce the cut size (i.e. the number of faces between the partitions) an additional operation of moving some mesh elements between the partitions is performed. Finally, the mesh elements from different partitions are divided into the separate meshes which requires duplication of all mesh entities (vertices, edges and faces) forming the interface mesh and some updating of the mesh interconnections (all these operations are local). The volume meshes in the subdomains can be then further refined independently and the discretization of each subdomain is guided by the main control space CS3d m. 2 3
http://glaros.dtc.umn.edu/gkhome/views/metis/ http://www.cs.sandia.gov/∼ bahendr/chaco.html
646
5
B. Glut and T. Jurczyk
Examples
Both proposed methods were inspected for various geometric models and discretizations of theirs surfaces. The test meshes are shown in Fig. 3. The results of the mesh generation via domain decomposition (with one separator) using both described methods are shown in Fig. 4, 5, 6 and 7. Since the article concentrates on the partitioning method itself, all test were computed on a single machine. Table 1 presents the numbers of elements in the volume meshes created using different approaches. For both presented methods the summary number of tetrahedra is similar as in the case of the sequential generation. The only significant difference between the methods is visible in the number of faces on the interface between the mesh partitions. In Table 2 there are gathered the running costs (for a single 3.2 GHz Intel P4 computer with 1 GB memory) of the subsequent steps of the mesh generation process. Both tested methods allow to decrease the expected parallel meshing time.4 The running times for the DCVM method are somewhat higher than for the DSM method. In this case the increased time is mostly due to the cost of the initial sequential step. The times of the volume mesh construction in the partitioned subdomains using the second method (DCVM) are lower since the coarse volume meshes for the subdomains are already available. Also the boundary recovery cost is absent (since this procedure had been already run in the earlier sequential part) which is most visible in the case of the mesh M3 (a non-convex domain). The quality of meshes (Table 3) obtained using both the first and the second method is very similar and also close to the quality of the meshes generated sequentially.
(a) M1
(b) M2
(c) M3
(d) M4
Fig. 3. Example meshes
4
The given summary time does not include the cost of transferring the partition data between the computation nodes which may depend on the specific parallel architecture.
Domain Decomposition Techniques for Parallel Generation
(a) DSM
647
(b) DCVM
Fig. 4. Decomposition of the mesh M1
(a) DSM
(b) DCVM
Fig. 5. Decomposition of the mesh M2
(a) DSM
(b) DCVM
Fig. 6. Decomposition of the mesh M3 Table 1. Number of elements (NFB [103 ] – number of boundary faces, NT [103 ] – number of tetrahedra, NTi [103 ] – number of tetrahedra in ith partition, NFI – number of faces on interface (cut size))
Mesh NFB M1 4.1 M2 6.5 M3 7.9 M4 31.1
Sequential NT 427.2 794.9 83.9 254.6
NT1 209.1 380.9 41.7 127.2
DSM NT2 NT(1+2) 206.5 415.6 377.0 757.9 44.1 85.8 119.5 246.7
NFI 1566 2693 391 432
NT1 214.2 396.5 43.1 133.1
DCVM NT2 NT(1+2) 227.4 441.6 432.0 828.6 43.2 86.3 127.7 260.8
NFI 3657 7152 654 1075
648
B. Glut and T. Jurczyk
(a) DSM
(b) DCVM
Fig. 7. Decomposition of the mesh M4 (only one of the created subdomains is shown) Table 2. Generation time (ts [s] – sequential generation time, ti [s] – mesh generation time for the ith subdomain, the summary parallel generation time tsum is estimated as ts + max(ti ))
Mesh M1 M2 M3 M4
Sequential ts 42.1 80.0 9.6 25.1
ts 0.4 0.5 0.2 1.0
DSM t1 t2 17.1 16.7 32.4 33.0 4.2 6.0 12.0 11.3
tsum 17.5 33.6 6.2 13.0
ts 14.0 27.1 6.0 10.4
DCVM t1 t2 15.4 16.7 29.1 32.5 2.6 2.7 9.1 8.7
tsum 30.7 59.6 8.7 19.5
Table 3. Quality of the generated meshes (η M – average mean ratio of mesh elements min – minimum mean ratio of mesh elements calculated calculated in metric space, ηM in metric space, μM – average length of mesh edges calculated in metric space [12]) Sequential min μ Mesh η M ηM M M1 0.882 0.167 1.037 M2 0.874 0.050 1.031 M3 0.858 0.035 1.034 M4 0.875 0.109 1.040
6
DSM min η M ηM 0.881 0.034 0.874 0.009 0.858 0.092 0.874 0.153
μM 1.018 1.008 1.025 1.027
DCVM min μ η M ηM M 0.882 0.069 1.037 0.874 0.025 1.031 0.846 0.001 1.046 0.875 0.065 1.040
Conclusions
The DSM method based on the generation of a mesh on a separator surface has the benefit of a low cost of the sequential part and a small cut size (for a given selection of a separator). However, this technique is sensitive to a proper placement of a separator [8]. If the angle between the separator and the surface mesh is too small (which can be difficult to avoid for complex models) there may arise problems with the correct projection of contour nodes onto the separator surface. Moreover, the quality of the volume elements generated in such areas may be unacceptably low.
Domain Decomposition Techniques for Parallel Generation
649
The second method (DCVM) overcomes this problem and the quality of the generated mesh elements is unaffected by the selection of the (virtual) separator placement. The separator type for this method can be also easily extended to a non-planar surface. Moreover, due to the availability of the coarse volume mesh during the partitioning phase the predicted number of elements in the final mesh for various areas of the discretized domain might be assessed with higher accuracy. As a result better balancing of the decomposition balancing may be achieved. Unfortunately, these benefits are combined with an increased cost of the sequential part of the algorithm and a higher cut size.
7
Further Research Directions
The computational and communication costs are different for various computer architectures. Because of this it is difficult to select the proper parallelization strategy applicable for different architectures. From this point of view the approach where each subdomain becomes an individual object to discretize appears to be advantageous. However, this thesis has to be tested and verified for a number of different architecture configurations and tools. The further studies are also required with respect to the localization of the optimal placement and shape of the separator. This task is correlated with the problem of assessment of the predicted number of volume elements in the final mesh based only on the number of boundary elements. The authors were inspecting a similar problem for a two-dimensional case [13]. A further studies are however necessary for three-dimensional meshes where the prediction will additionally utilize the information from the control space. Acknowledgments. The partial support of the AGH Grant No. 11.11.120.777 is gratefully acknowledged.
References 1. Glut, B., Jurczyk, T., Kitowski, J.: Anisotropic Volume Mesh Generation Controlled by Adaptive Metric Space. In: AIP Conf. Proc. NUMIFORM 2007, Materials Processing and Design: Modeling, Simulation and Applications, Porto, Portugal, June 17-21, vol. 908, pp. 233–238 (2007) 2. Tu, T., Yu, H., Ramirez-Guzman, L., Bielak, J., Ghattas, O., Ma, K.-L., O’Hallaron, D.R.: From Mesh Generation to Scientific Visualization: An End-toEnd Approach to Parallel Supercomputing. In: Proc. of SC 2006, Tampa, FL (2006) 3. Chrisochoides, N.: A survey of parallel mesh generation methods, http://www.cs.wm.edu/∼ nikos/pmesh survey.pdf 4. Cougny, H.L., Shepard, M.S.: Parallel volume meshing face removals and hierarchical repartitioning. Comput. Methods Appl. Mech. Engrg. 174, 275–298 (1999) 5. Jurczyk, T., Glut, B.: Adaptive Control Space Structure for Anisotropic Mesh Generation. In: Proc. of ECCOMAS CFD 2006 European Conference on Computational Fluid Dynamics, Egmond aan Zee, The Netherlands (2006)
650
B. Glut and T. Jurczyk
6. Ivanov, E., Andr¨ a, H., Kudryavtsev, A.N.: Domain decomposition approach for automatic parallel generation of 3D unstructured grids. In: Proc. of ECCOMAS CFD 2006 European Conference on Computational Fluid Dynamics, Egmond aan Zee, The Netherlands (2006) 7. Larwood, B.G., Weatherill, N.P., Hassan, O., Morgan, K.: Domain decomposition approach for parallel unstructured mesh generation. Int. J.Numer. Meth. Engng. 58, 177–188 (2003) 8. Glut, B., Jurczyk, T., Breitkopf, P., Rassineux, A., Villon, P.: Geometry Decomposition Strategies for Parallel 3D Mesh Generation. In: Proc. of Int. Conf. on Computer Methods and Systems CMS 2005, Krak´ ow, Poland, vol. 1, pp. 443–450 (2005) 9. Jurczyk, T., Glut, B., Breitkopf, P.: Parallel 3D Mesh Generation using Geometry Decomposition. In: AIP Conf. Proc. NUMIFORM 2007, Materials Processing and Design: Modeling, Simulation and Applications, Porto, Portugal, June 17-21, vol. 908, pp. 1579–1584 (2007) 10. Ito, Y., Shih, A.M., Erukala, A.K., Soni, B.K., Chernikov, A., Chrisochoides, N.P., Nakahashi, K.: Parallel unstructured mesh generation by an advancing front method. Mathematics and Computers in Simulation 75, 200–209 (2007) 11. Sziveri, J., Seale, C.F., Topping, B.H.V.: An enhanced parallel sub-domain generation method for mesh partitioning in parallel finite element analysis. Int. J. Numer. Meth. Engng. 47, 1773–1800 (2000) 12. Jurczyk, T.: Efficient Algorithms of Automatic Discretization of Non-Trivial Three-Dimensional Geometries and its Object-Oriented Implementation. PhD thesis, AGH University of Science and Technology, Krak´ ow, Poland (2007), http://home.agh.edu.pl/jurczyk/papers/phd-jurczyk.pdf 13. Jurczyk, T., Glut, B.: Organization of the Mesh Structure. In: Bubak, M., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2004. LNCS, vol. 3037, pp. 646–649. Springer, Heidelberg (2004)
The Complete Flux Scheme for Spherically Symmetric Conservation Laws J.H.M. ten Thije Boonkkamp and M.J.H. Anthonissen Eindhoven University of Technology Department of Mathematics and Computer Science PO Box 513, 5600 MB Eindhoven, The Netherlands {j.h.m.tenthijeboonkkamp,m.j.h.anthonissen}@tue.nl
Abstract. We apply the finite volume method to a spherically symmetric conservation law of advection-diffusion-reaction type. For the numerical flux we use the so-called complete flux scheme. In this scheme the flux is computed from a local boundary value problem for the complete equation, including the source term. As a result, the numerical flux is the superposition of a homogeneous flux and an inhomogeneous flux. The resulting scheme is second order accurate, uniformly in the Peclet numbers. Keywords: finite volumes, advection diffusion equation, complete flux scheme.
1
Introduction
Many problems in physics and engineering can be modelled using conservation laws. These laws lead in general to a system of partial differential equations that cannot be solved analytically. Finite volume methods are a popular choice to discretise these equations, because they feature a discrete conservation property: the computational domain is divided into control volumes and on each volume a discrete conservation law holds. In this paper we study a finite volume method for three-dimensional spherically symmetric steady conservation laws. This type of equation arises, e.g., in combustion theory, where the study of laminar spherical flames is useful for finding parameters such as burning velocity or flame curvature in premixed combustion [1]. Our model problem includes advection, diffusion and reaction terms and we shall develop a numerical scheme that is second order accurate for all flow conditions. This means that the scheme should always retain its high accuracy, unlike, e.g., standard exponentially fitted schemes, which are second order accurate for diffusion dominated flows but reduce to the first order upwind scheme when the advection term becomes large. Additionally the proposed scheme does not produce spurious oscillations for advection dominated flows, which is a well-known flaw of standard central discretisations. High accuracy and the absence of wiggles are favourable properties that may also be achieved by using high resolution schemes such as flux limiting or M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 651–660, 2008. c Springer-Verlag Berlin Heidelberg 2008
652
J.H.M. ten Thije Boonkkamp and M.J.H. Anthonissen
(weighted) essentially nonoscillatory (ENO) methods [4]. These techniques lead to larger discretisation stencils however which is disadvantageous. The method we present uses direct neighbours only. Our algorithm is an extension of the finite volume methods for Cartesian grids introduced in [2,5,6] to spherically symmetric conservation laws. We use an exponential scheme for computing the numerical fluxes. The approximation for the flux is based on the complete differential equation. This implies that we also include the source term in the numerical fluxes. Manzini and Russo [3] also present a finite volume method for advectiondominated problems that is second-order accurate away from boundary and internal layers. They pay special attention to the construction of the numerical advective fluxes in order to prevent numerical oscillations. This goal is achieved by a sophisticated reconstruction algorithm for cell gradients and a velocitybiased mixing of upwind and downwind contributions. Their scheme contains a nonlinear term for shock capturing. This paper is organized as follows. In Section 2, we formulate a stationary advection-diffusion-reaction equation, introduce control volumes and formulate a second order discrete conservation law. In Section 3, we derive an expression for the numerical flux that is second order accurate for all flow conditions. In Section 4, we combine the discrete conservation law with the numerical flux and apply the resulting scheme to a spherically symmetric boundary value problem. We show numerical results using both the homogeneous and the complete flux scheme. By means of Richardson extrapolation we verify the order of accuracy of the finite volume scheme for different flow conditions.
2
Finite Volume Discretization
In this section we outline the finite volume method (FVM) for three-dimensional, spherically symmetric conservation laws. Consider the following steady conservation law of advection-diffusion-reaction type, i.e., ∇·(mϕ − Γ ∇ϕ) = s, (1) where m is the mass flux, Γ ≥ Γmin > 0 a diffusion/conduction coefficient and s a (chemical) source term. The unknown ϕ can be, e.g., the temperature or the concentration of a species in a reacting mixture. The parameters Γ and s are usually (complicated) functions of the unknown ϕ, however, for the sake of discretisation we will consider these as given functions of the spatial coordinate x. Equation (1) has to be coupled with the flow equations, i.e., the continuity equation and the momentum equations. The former reads ∇·m = 0.
(2)
Associated with (1), we introduce the flux vector f , defined by f := mϕ − Γ ∇ϕ.
(3)
The Complete Flux Scheme for Spherically Symmetric Conservation Laws
653
Equation (1) then simply reduces to ∇·f = s. In a FVM we cover the domain with a finite number of control volumes Ωj (j = 1, 2, . . . , N ) and impose the integral form of the conservation law for each control volume, i.e., f ·n dS = s dV, (4) ∂Ωj
Ωj
where n is the outward unit normal on the boundary ∂Ωj . Next, we need numerical approximations for the integrals in (4). In the following, we assume the problem to be spherically symmetric, i.e., ϕ = ϕ(r), and likewise for all other variables, and moreover, f = f (r)er with er the first basis vector in spherical coordinates. We introduce a spatial grid {rj } of (uniform) grid size Δr. As control volumes we choose the spherical shells Ωj := (rj−1/2 , rj+1/2 ) with rj+1/2 := 12 (rj + rj+1 ). Then, the surface integral in (4) reduces to f ·n dS = f ·er dS − f ·er dS ∂Ωj r=rj+1/2 r=rj−1/2 (5) 2 2 = 4π rj+1/2 f (rj+1/2 ) − rj−1/2 f (rj−1/2 ) . For the approximation of the volume integral in (4) we apply the midpoint rule, to find 3 . 3 s dV = 43 π rj+1/2 − rj−1/2 (6) sj , Ωj
with sj := s(rj ). Combining (4), (5) and (6) and using the relation x3 − y 3 = (x − y)(x2 + xy + y 2 ) we obtain the second order discrete conservation law 2 2 1 Fj+1/2 − rj−1/2 Fj−1/2 = Δr rj2 + 12 Δr2 sj , (7) rj+1/2 where Fj+1/2 is the numerical flux at the cell interface approximating f (rj+1/2 ). Finally, the FVM has to be completed with the derivation of an expression for the numerical flux.
3
Derivation of the Numerical Flux
Our objective in this section is to derive an expression for the numerical flux that is uniformly second order accurate in the grid size, i.e., the discretisation error should always be second order for all flow regimes in combination with a source term of arbitrary strength. We adopt the following notation: variables defined in the grid points rj and rj+1 are indicated with the subscripts C and E, respectively, and variables at the interface rj+1/2 by the subscript e. The derivation of the expression for the numerical flux Fe at the eastern cell interface re located between the grid points rC and rE is based on the following model boundary value problem (BVP) for the unknown ϕ: 1 2 = s, rC < r < rE , (8a) r mϕ − Γ ϕ 2 r ϕ(rC ) = ϕC , ϕ(rE ) = ϕE , (8b)
654
J.H.M. ten Thije Boonkkamp and M.J.H. Anthonissen
where the prime ( ) denotes differentiation with respect to r. In the derivation that follows, we assume that M := r2 m = Const > 0
for r ∈ (rC , rE ).
(9)
Note that the condition M = Const is a direct consequence of the continuity equation (2). The diffusion coefficient Γ and the source term s are arbitrary sufficiently smooth functions of r. The (scalar) flux corresponding to (8) reads f := mϕ − Γ ϕ .
(10)
To derive the expression for Fe we carry out the following procedure: 1. Derive the integral expression for ϕ(r) from the inhomogeneous BVP (8). 2. Derive the integral representation for f (re ) from (10). 3. Approximate all integrals involved. In the following, we need the variables D, λ, Λ and S, defined by r r M D(r) := Γ (r)r2 , λ(r) := , Λ(r) := λ(η) dη, S(r) := η 2 s(η) dη. D(r) re re (11) The variable Λ(r) is called the Peclet integral. Substituting (10) in (8a) and integrating the resulting equation we obtain the following integral balance r2 f (r) − r2 f (re ) = S(r). (12) Using the definitions of D and Λ in (11), it is clear that the expression for the flux can be rewritten as r2 f (r) = −D ϕ e−Λ eΛ . (13) Inserting this expression in (12) and once more integrating we obtain the following expression for the flux f (re ): 2 r f (re ) = r2 f (h) (re ) + r2 f (i) (re ), (14a) rE −Λ(rE ) 2 (h) ϕE − e−Λ(rC ) ϕC D−1 e−Λ dr, (14b) (re ) = − e r f rC rE 2 (i) rE −1 −Λ r f (re ) = − D−1 Se−Λ dr D e dr, (14c) rC
rC
where r2 f (h) (re ) and r2 f (i) (re ) are the homogeneous and inhomogeneous part, corresponding to the homogeneous and particular solution of (8), respectively. We introduce some notation. a, b denotes the usual inner product of two functions a(r) and b(r) defined on (rC , rE ), i.e., rE a, b := a(r)b(r) dr. (15) rC
The Complete Flux Scheme for Spherically Symmetric Conservation Laws
655
10
8
6
4
2
−10
−8
−6
−4
−2
0
2
4
6
8
10
Fig. 1. The Bernoulli function B(z)
For a generic variable v(r) defined on (rC , rE ) we indicate the geometric average (of vC and vE ) and the harmonic average by v˜e and vˆe , respectively, i.e., v˜e :=
1 v −1 , 1 . := vˆe Δr
√ vC vE ,
(16)
Consider the expression for the homogeneous flux. Assume first that Γ (r) = Const on (rC , rE ). In this case expression (14b) reduces to ˜e 2 (h) D B(−Pe )ϕC − B(Pe )ϕE , (re ) = r f Δr
(17)
where Pe is the Peclet number defined by Pe :=
M Δr . ˜e D
(18)
Furthermore, B(z) is the Bernoulli function, defined by B(z) :=
z , ez − 1
(19)
see Figure 1. For the constant coefficient homogeneous flux, i.e., Γ (r) and M constant on (rC , rE ), we introduce the notation ˜ e /Δr, Pe ; ϕC , ϕE , (r2 f (h) )(re ) = F h D (20) ˜ e /Δr and Pe and to denote the dependence of (r2 f (h) )(re ) on the parameters D on the function values ϕC and ϕE . In the general case, when Γ (r) is an arbitrary function of r, we can rewrite the homogeneous flux in (14b) as 2 (h) ˆ e /Δr, λ, 1; ϕC , ϕE . r f (re ) = F h D (21) ˜ e and Pe Thus, the flux can be written as the constant coefficient flux with D ˆ replaced by De and λ, 1, respectively.
656
J.H.M. ten Thije Boonkkamp and M.J.H. Anthonissen
Next, consider the inhomogeneous flux. Assume first that λ(r) = Const on (rC , rE ) and define P := λΔr. Substituting the expression for S(r) in (14c) and changing the order of integration, we find the following alternative representation for the inhomogeneous flux rE 2 (i) r − rC , (22) G(σ(r); P ) r2 s(r) dr, σ(r) := r f (re ) = Δr rC where σ(r) is the normalized coordinate on (rC , rE ) and where G(σ; P ) is the Green’s function for the flux. It is given by ⎧ 1 − e−P σ ⎪ ⎪ for 0 ≤ σ ≤ 12 , ⎪ ⎨ 1 − e−P G(σ; P ) = (23) ⎪ P (1−σ) ⎪ 1 − e ⎪ ⎩− for 12 < σ ≤ 1; 1 − eP see Figure 2. Note that G(σ; P ) relates the flux to the source term and is different from the usual Green’s function, which relates the solution to the source term. If we furthermore assume that s(r) = Const on (rC , rE ), relation (22) reduces to 2 (i) 2 r f (re ) = Δr 12 − W (P ) rC s + O(Δr2 ), (24) where W (z) is a weighting function, defined by W (z) :=
ez − 1 − z ; z ez − 1
(25)
see Figure 3. From both figures, it is clear that the inhomogeneous flux is only of importance for advection dominated flow, i.e., |P | 1, in combination with a large source term and in this case the upwind value sC for s(r) should be taken.
1
0.8
P = 0.01
0.6
P=1 P=5
0.4
P = 10 0.2
G
0
−0.2
−0.4
−0.6
−0.8
−1
0
0.1
0.2
0.3
0.4
0.5
σ
0.6
0.7
0.8
0.9
1
Fig. 2. Green’s function G(σ; P ) for several values P > 0
The Complete Flux Scheme for Spherically Symmetric Conservation Laws
657
1
0.8
0.6
0.4
0.2
−10
−8
−6
−4
−2
0
2
4
6
8
10
Fig. 3. The weighting function W (z)
For arbitrary functions Γ (r) and s(r) we have a similar representation for the inhomogeneous flux, i.e., rE 2 (i) G(σ(r); λ, 1) r2 s(r) dr, (26a) r f (re ) = rC
with G(σ; P ) defined in (23) and where the normalized coordinate σ(r) is defined by r
λ(η) dη/λ, 1.
σ(r) :=
(26b)
rC
Note that λ(r) > 0 implying that σ(r) is a monotonically increasing function on (rC , rE ). Expanding s(r) in a Taylor series, we can also evalute the integral in (26a), to find 2 2 (i) sC + O(Δr2 ). (27) r f (re ) = Δr 12 − W (λ, 1) rC Summarizing, we have the exact representation (21) for the homogeneous flux and the second order approximation (27) for the inhomogeneous flux. Both exˆ e /Δr = M/λ, 1, the inner pressions hold for arbitrary Γ (r) and s(r). Since D product λ, 1 is the only integral that remains to be approximated. Straightforward integration and applying the mean value theorem of integration gives λ, 1 =
M Δr , Γ (r∗ ) r˜e2
r∗ ∈ (rC , rE ).
(28)
Using the approximation Γ (r∗ ) = Γ˜e + O(Δr) we obtain λ, 1 = P˜e + O(Δr2 ). Inserting this approximations in the expressions (21) and (27) and omitting O(Δr2 )-terms we obtain the following result for the numerical flux: 2 (29a) r F e = r2 F (h) e + r2 F (i) e , 2 (h) h ˜ = F De /Δr, P˜e ; ϕC , ϕE , (29b) r F e 1 2 2 (i) = Δr 2 − W (P˜e ) rC sC , (29c) r F e which is a second order approximation of (14).
658
4
J.H.M. ten Thije Boonkkamp and M.J.H. Anthonissen
Numerical Schemes and Example
Combining the expressions (29) for the numerical flux with the discrete conservation law (7) we can derive two numerical schemes, i.e., the complete flux (CF) scheme and the homogeneous flux (HF) scheme, for which we only take into account the homogeneous part of the flux. We apply these schemes to a model BVP to investigate their performance for both diffusion dominated and advection dominated flow. Substituting (29) in (7) we obtain the numerical scheme 1 Δr2 ) sj , (30a) −aW,j ϕj−1 + aC,j ϕj − aE,j ϕj+1 = bW,j sj−1 + bC,j + Δr(rj2 + 12 with coefficients aC,j etc., given by aW,j =
˜ j−1/2 D B(−Pj−1/2 ), Δr
bW,j = Δr
1 2
aE,j =
2 − W (Pj−1/2 ) rj−1 ,
˜ j+1/2 D B(Pj+1/2 ), Δr
bC,j = Δr −
1 2
aC,j = aW,j + aE,j ,
+ W (Pj+1/2 ) rj2 .
(30b) (30c)
For the HF scheme we have to take bW,j = bC,j = 0. Note that both schemes gives rise to a tridiagonal system Aϕ = Bs, which can be very efficiently solved using LU-decomposition. Consider the following BVP 1 d 2 dϕ M ϕ − Γ r = s, 0 < r < 1, (31a) r2 dr dr dϕ (1) = 0, (31b) ϕ(0) = 5, dr with Γ (r) and s(r) given by Γ (r) = Γmin ( 1 +
√ r ),
s(r) =
smax . 1 + smax (2r − 1)2
(31c)
The diffusion coefficient Γ (r) is a smoothly varying function whereas the source term has a sharp peak, introducing a steep interior layer near r = 12 ; see Figure 4. To assess the order of accuracy of both schemes, we compute numerical approximations of ϕ( 12 ) with increasingly smaller grid sizes and apply Richardson extrapolation to these results. More precisely, let ϕ( 12 ) = ϕh + εh = ϕh/2 + εh/2 = ϕh/4 + εh/4 ,
h = Δr,
(32)
where ϕh denotes the numerical approximation of ϕ( 12 ) computed with grid size h and εh the corresponding (global) discretisation error, etc. Assuming the following expansion (33) εh = Chp + O(hq ), q > p, we can derive the following expression for the order of accuracy p: . ϕh/2 − ϕh =: q h . 2p = h/4 ϕ − ϕh/2
(34)
The Complete Flux Scheme for Spherically Symmetric Conservation Laws
659
18 16 14 12 10 8 6 4 0
0.2
0.4
0.6
0.8
1
r
Fig. 4. Solution of the model BVP (31). Parameter values are: M = 1, Γmin = 10−7 and smax = 103 . Table 1. The q h -values for the complete flux scheme and the homogeneous flux scheme as a function of N = 1/h. Parameter values are: M = 1 and smax = 103 . Γmin = 10−1 N HF CF 10 2.82 2.57 20 5.56 5.65 40 10.03 12.67 80 4.92 5.24 160 4.07 3.97 320 4.02 3.96 640 4.01 3.98 1280 4.02 4.01
Γmin = 10−7 HF CF 2.37 2.99 2.70 6.59 2.31 18.08 2.03 6.07 2.01 4.07 2.00 4.02 2.00 4.00 2.00 4.00
In Table 1 you find the values of q h for both schemes. Clearly, when diffusion is dominant, i.e., for Γmin = 10−1 , both schemes behave second order accurate. Thus both schemes perform equally well, in agreement with the previous observation that the inhomogeneous flux is only of importance for advection dominated flow. On the other hand, for dominant advection, i.e., Γmin = 10−7 the homogeneous flux scheme reduces to first order, whereas the complete flux scheme is still second order.
5
Conclusions and Future Research
In this paper we have derived the complete flux scheme for spherically symmetric conservation laws of advection-diffusion-reaction type. The numerical flux is computed from a local BVP for the entire equation, including the source term. All parameters are assumed to be arbitrary, sufficiently smooth functions of the radial coordinate r. As a result, the numerical flux is the superposition of a homogeneous flux, corresponding to the homogeneous solution of the BVP, and an inhomogeneous flux, corresponding to the particular solution. The resulting scheme behaves second order accurate, uniformly in the Peclet number, does not
660
J.H.M. ten Thije Boonkkamp and M.J.H. Anthonissen
introduces numerical oscillations near a steep layer and has a simple three-point stencil. Directions for further research are the following. A first obvious extension is to apply the scheme to time dependent conservation laws. An option would be to include the time derivative in the inhomogeneous flux and subsequently apply a suitable time integration method. A second extension is to apply the scheme to the conservation laws of a spherically symmetric flame. Since our model BVP has an interior layer reminiscent of a flame front, it is expected that the CFscheme will give accurate results for laminar flames. The major problem in this case is to construct fast and robust iterative methods to solve the nonlinear, discrete system. A final extension the authors have in mind is to simulate time dependent, i.e., expanding or imploding, spherical flames. All these issues will be subject of future research.
References 1. Groot, G.R.A.: Modelling of Propagating Spherical and Cylindrical Premixed Flames. PhD Thesis, Eindhoven University of Technology (2003) 2. Van ’t Hof, B., Ten Thije Boonkkamp, J.H.M., Mattheij, R.M.M.: Discretisation of the stationary convection-diffusion-reaction equation. Numer. Meth. for Part. Diff. Eq. 14, 607–625 (1998) 3. Manzini, G., Russo, A.: A finite volume method for advection-diffusion problems in convection-dominated regimes. Comput. Methods Appl. Mech. Engrg. 197, 1242– 1261 (2008) 4. Shu, C.-W.: High-order finite difference and finite volume WENO schemes and discontinuous Galerkin methods for CFD. International Journal of Computational Fluid Dynamics 17(2), 107–118 (2003) 5. Thiart, G.D.: Finite difference scheme for the numerical solution of fluid flow and heat transfer problems on nonstaggered grids. Numerical Heat Transfer, Part B 17, 43–62 (1990) 6. Thiart, G.D.: Improved finite difference scheme for the solution of convectiondiffusion problems with the SIMPLEN algorithm. Numerical Heat Transfer, Part B 18, 81–95 (1990)
Computer Simulation of the Anisotropy of Fluorescence in Ring Molecular Systems: Tangential vs. Radial Dipole Arrangement Pavel Heˇrman1 , Ivan Barv´ık2, and David Zapletal1,3 1
2
Department of Physics, Faculty of Education, University of Hradec Kr´ alov´e, Rokitansk´eho 62, CZ-500 03 Hradec Kr´ alov´e, Czech Republic [email protected] Institute of Physics of Charles University, Faculty of Mathematics and Physics, CZ-12116 Prague, Czech Republic 3 Department of Mathematics, University of Pardubice, Studentsk´ a 95, CZ-53210 Pardubice, Czech Republic
Abstract. Time dependence of the anisotropy of fluorescence in recently discovered cyclic antenna units of the BChl photosystem is modeled. Interaction with a bath and a static disorder here modeled as uncorrelated Gaussian disorder in the transfer integrals is taken into account. Parallel computer enviroment is used because one is forced to recalculate every physical quantity for several thousands of different realizations of disorder. Results for the ring LH4 with radial optical transition dipole arrangement are compared with those for the ring LH2 with the tangential one. The difference between LH2 and LH4 results for the static disorder in transfer integrals has an opposite sign in comparison with that one for the static disorder in local energies. Equivalent differences are shifted to a smaller times for the stronger interaction with a bath.
1
Introduction
The most common antenna complexes in purple bacteria are the light-harvesting complexes LH1 and LH2. Some bacteria express also other types of LH complexes such as the B800-820 LH3 complex in Rhodopseudomonas acidophila strain 7050 or the B800 LH4 complex in Rhodopseudomonas palustris [1]. The general organization of the LH2 and LH4 complexes is the same: a ring-shaped structure is formed from cyclically repeated identical subunits. However, the symmetries of these rings are different: LH2 and LH3 are usually nonameric but LH4 is octameric. The other difference is the presence of four bacteriochlorophyll (BChl) molecules per repeating unit in LH4 rather than three ones found in LH2 and LH3. The most striking difference is the occurrence of an additional Bchl-a ring in the LH4 complex, the B800-2 ring, at a position approximately halfway between the densely packed B-a/B-b ring and the B800-1 ring that are both also present in LH2 [1]. In LH2, the B850 ring has nearly tangentially oriented Bchl-a pigments, whereas in LH4 the equivalent B-a/B-b pigments are organized in a more radial fashion. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 661–670, 2008. c Springer-Verlag Berlin Heidelberg 2008
662
P. Heˇrman, I. Barv´ık, and D. Zapletal
We are therefore dealing with ring-shaped units with nonameric and octameric symmetry resembling those rings from antenna complexes LH2 and LH4 with a strong interaction J (in the range 150 − 450 cm− 1) between BChl molecules. Our theoretical approach therefore considers an extended Frenkel exciton states model. Despite intensive study, the precise role of the protein moiety in governing the dynamics of the excited states is still under debate [2]. At room temperature the solvent and protein environment fluctuate with characteristic time scales ranging from femtoseconds to nanoseconds. The dynamical aspects of the system are reflected in the line shapes of electronic transitions. To fully characterize them and thereby the dynamics of the system, one needs to know not only the fluctuation amplitude (coupling strength) but also the time scale of each process involved. The observed linewidth reflects the combined influence of static disorder and exciton coupling to intermolecular, intramolecular, and solvent nuclear motions. The simplest approach is to decompose the line profile into homogeneous and inhomogeneous contributions of the dynamic and static disorder. Yet, a satisfactory understanding of the nature of static disorder in light-harvesting systems has not been reached. In the site excitation basis, static disorder can be present as in diagonal hamiltonian matrix elements as in off-diagonal ones. Time-dependent experiments [3,4] led for the B850 ring in LH2 complex to conclusion that the elementary dynamics occurs on a time scale of about 100 fs [5,6,7]. For example, depolarization of fluorescence was studied already quite some time ago for a model of electronically coupled molecules [8,9]. Rahman et al. [8] were the first who recognize the importance of the off-diagonal density matrix elements (coherences) [10] which can lead to an initial anisotropy larger than the incoherent theoretical limit of 0.4. Already some time ago substantial relaxation on the time scale of 10-100 fs and an anomalously large initial anisotropy of 0.7 was observed by Nagarjan et al. [5]. The high initial anisotropy was ascribed to a coherent excitation of a degenerate pair of states with allowed optical transitions and then relaxation to states at lower energies which have forbidden transitions. Nagarjan et al. [6] concluded, that the main features of the spectral relaxation and the decay of anisotropy are reproduced well by a model considering decay processes of electronic coherences within the manifold of the excitonic states and thermal equilibration among the excitonic states. In that contribution the exciton dynamics was not calculated explicitly. In several steps [11,12,13,14,15,16,17] we have extended the former investigations by Kumble and Hochstrasser [18] and Nagarjan et al. [6] for LH2 rings. We added the effect of dynamic disorder by using a quantum master equation in the Markovian [11] and non-Markovian limits [13,14]. We also investigated influence of four types of the uncorrelated static disorder (Gaussian disorder in local energies, transfer integrals, radial positions of BChls and angular positions of BChls) [12,15,16,17]. Influence of correlated static disorder, namely an elliptical deformation of the ring, has been also investigated [11]. Recently we have investigated the time dependence of the anisotropy of fluorescence for newly discovered type of the molecular ring, the LH4 ring with
Computer Simulation of the Anisotropy of Fluorescence
663
the uncorrelated static disorder in local energies [19]. Main goal of our present investigation is the comparison of the time dependence of the anisotropy of fluorescence after an impulsive excitation for two molecular rings: for molecular ring with tangentially arranged optical transition dipoles rt (t), like in LH2, as well as for the radially arranged one rr (t) like in LH4 [1]. We concentrate on the uncorrelated static disorder - Gaussian disorder in transfer integrals.
2
Model
In the following we assume that only one excitation is present on the ring after an impulsive excitation [18]. The Hamiltonian of an exciton in the ideal ring coupled to a bath of harmonic oscillators reads 1 m Jmn a†m an + hωq b†q bq + √ ¯ Gq ¯hωq a†m am (b†q + b−q ) = H0 = N q m q m,n(m=n) 0 = Hex + Hph + Hex−ph .
(1) a†m
0 Hex
represents the single exciton, i.e. the system. The operator (am ) creates (annihilates) an exciton at site m. Jmn (for m = n) is the so-called transfer integral between sites m and n. Hph describes the bath of phonons in the harmonic approximation. The phonon creation and annihilation operators are denoted by b†q and bq , respectively. The last term in Eq. (1), Hex−ph , represents the exciton– bath interaction which is assumed to be site–diagonal and linear in the bath coordinates. The term Gm q denotes the exciton–phonon coupling constant. 0 Inside one ring the pure exciton Hamiltonian Hex (Eq. (1)) can be diagonalized using the wave vector representation with corresponding delocalized ”Bloch” states and energies. Considering homogeneous case with only nearest neighbour transfer matrix elements Jmn = J12 (δm,n+1 + δm,n−1 ) and using Fourier transformed excitonic operators (Bloch representation) 2π ak = l , l = 0, ±1, . . . ± N/2 , (2) an eikn , k = N n the simplest exciton Hamiltonian in k representation reads 0 Hex = Ek a†k ak , with Ek = −2 J12 cos k .
(3)
k
In the local site basis influence of static disorder is modeled by a Gaussian distribution for the uncorrelated transfer integral fluctuations δJnm with a standard deviation ΔJ Hs = δJmn a†m an . m,n(m=n)
We are using nearest neighbour approximation J = J12 . Hamiltonian of the static disorder adds to the Hamiltonian of the ideal ring H = H 0 + Hs .
(4)
664
P. Heˇrman, I. Barv´ık, and D. Zapletal
All of the Qy transition dipole moments of the chromophores (BChls B850) in a ring without static and dynamic disorder lie approximately in the plane of the ring and the entire dipole strength of the B850 band comes from a degenerate pair of orthogonally polarized transitions (at an energy slightly higher than the transition energy of the lowest exciton state (LH2), slightly lower than the one of the highest exciton state (LH4)). The dipole strength μa of eigenstate |a of the ring with static disorder and the dipole strength μα of eigenstate |α of the ring without static disorder read μa =
N
can μn ,
n=1
μα =
N
cα n μn ,
(5)
n=1
a where cα n and cn are the expansion coefficients of the eigenstates of the unperturbed ring and the disordered one in site representation, respectively. In the case of impulsive excitation the dipole strength is simply redistributed among the exciton levels due to disorder [18]. Thus the impulsive excitation with a pulse of sufficiently wide spectral range will always prepare the same initial state, irrespective of the actual eigenstates of the real ring. After impulsive excitation with polarization ex the excitonic density matrix ρ [12] is given by [6]
ραβ (t = 0; ex ) =
1 (ex · μα )(μβ · ex ), A
A=
(ex · μα )(μα · ex ).
(6)
α
The usual time-dependent anisotropy of fluorescence r(t) =
Sxx (t) − Sxy (t) , Sxx (t) + 2Sxy (t)
Sxy (t) =
Pxy (ω, t)dω
(7)
is determined from Pxy (ω, t) = A ρaa (t)(μa · ey )(ey · μa )[δ(ω − ωa 0 ) + δ(ω − ωa0 )]. (8) a
a
The brackets in Eq. (7) denote the ensemble average and the orientational average over the sample. The crucial quantity entering r(t) in Eq. (7) is the exciton density matrix ρ. ˇ apek [20] read The dynamical equations for ρ obtained by C´ d ρmn (t) = i(Ωmn,pq + δΩmn,pq (t))ρpq (t). dt pq
(9)
In long time approximation coefficient δΩ(t → ∞) becomes time independent. All details of calculations leading to the time convolution-less dynamical equations for ρ(t) are given elsewhere [14] and we shall not repeat them here. Obtaining of the full time dependence of δΩ(t) is not a simple task. We have succeeded to calculate microscopically full time dependence of δΩ(t) only for the simplest molecular model namely dimer [21]. In case of molecular ring we should resort
Computer Simulation of the Anisotropy of Fluorescence
665
to some simplification [14]. In what follows we use Markovian version of Eq. (9) with a simple model for correlation functions Cmn of the bath assuming that each site (i.e. each chromophore) has its own bath completely uncoupled from the baths of the other sites. Furthermore it is assumed that these baths have identical properties [3,22]. Then only one correlation function C(ω) of the bath is needed Cmn (ω) = δmn C(ω) = δmn 2π[1 + nB (ω)][J(ω) − J(−ω)].
(10)
Here J(ω) is the spectral density of the bath [22] and nB (ω) the Bose-Einstein distribution of phonons. The model of J(ω) often used in literature is J(ω) = Θ(ω)j0
ω 2 −ω/ωc e 2ωc3
(11)
and has its maximum at 2ωc .
3
Numerical Solution
For the time propagation of the density matrix ρ (Eq. 9) the short iterative Arnoldi method [23] as well as the standard Runge-Kutta scheme have been used. An advantage of the first one with respect to the second one is the low computational effort for moderate accuracy [24]. Furthermore, the expansion coefficients are adapted at each time to a fixed time step with a prespecified tolerance in contrast to the Runge-Kutta scheme in which the time step is adapted. An uniform time grid is important for averaging of various realizations at the same time points without interpolation. The realization averaging and the orientational averaging can easily be parallelized by means of Message passing interface (MPI). Some computations were performed on a PC cluster. So instead of running about 10 000 realizations on one node, 312 realizations can be calculated on each of the 32 nodes (or 52 realizations on each of 192 nodes). Results of our simulations are presented graphically in the next section. We use dimensionless energies normalized to the transfer integral J12 = J and the renormalized time τ . To convert τ into seconds one has to divide τ by 2πcJ with c being the speed of light in cm s−1 and J in cm−1 . Estimation of J varies between 250 cm−1 and 400 cm−1 . Our time unit (τ = 1) corresponds for these extreme values to 21.2 fs or 13.3 fs.
4
Results
Molecular ring is common shape of many antenna units in bacterial photosynthetic systems. They differ by number of BChl molecules, orientation of their optical transition dipoles, interchromophor distance, etc. We present graphically results of our modeling for time dependence of the fluorescence anisotropy in recently discovered cyclic antenna unit of the BChl photosystem, namely in LH4
666
P. Heˇrman, I. Barv´ık, and D. Zapletal
Fig. 1. The time and ΔJ dependence of the anisotropy of fluorescence rr of the LH4 (octameric) ring (without dynamic disorder)
with radial optical transition dipole arrangement. The transition dipole arrangement has a pronounced effect on the strength of the interaction J between BChl molecules. The width of the exciton energy band of the ideal ring is two times larger for the tangential arrangement as in LH2 in comparison with radial arrangement in LH4. Also signs of J are opposite in both configurations. Optically accessible exciton states in ideal rings are near the bottom (upper) edge of the exciton band in LH2 (LH4) respectively. The time dependence of fluorescence anisotropy (Eq. (7)) has been calculated using dynamical equations for the exciton density matrix ρ to express the time dependence of the optical properties of the ring units in the femtosecond time range. Details are the same as in Ref. [14,15,16]. Substantial relaxation on the
Fig. 2. The time and ωc dependence of the anisotropy of fluorescence rr of the LH4 (octameric) ring. The dynamic disorder is included at T = 0.5 J (j0 = 0.2 J - left, j0 = 0.4 J - right).
Computer Simulation of the Anisotropy of Fluorescence
667
Fig. 3. The time and ΔJ dependence of the anisotropy of fluorescence rr of the LH4 (octameric) ring. The dynamic disorder is also included at T = 0.5 J (j0 = 0.2 J upper row, j0 = 0.4 J - lower row, ωc = 0.1 J - left column, ωc = 0.3 J - right column).
Fig. 4. The time and ΔJ dependence of the difference between anisotropy of fluorescence rt − rr of the LH2 (nonameric) ring and the LH4 (octameric) one (without dynamic disorder)
time scale of 10-100 fs and an anomalously large initial anisotropy of 0.7 has been observed. Nagarjan et al. [6] suggested a model considering decay processes of electronic coherences within the manifold of the exciton states and thermal equilibration among the excitonic states. He supposed (without explicit calculation of the exciton dynamics) that this model reproduces well main features of the spectral relaxation and the decay of anisotropy in cyclic molecular units.
668
P. Heˇrman, I. Barv´ık, and D. Zapletal
Fig. 5. The time and ωc dependence of the difference between anisotropy of fluorescence rt − rr of the LH2 (nonameric) ring and the LH4 (octameric) one for two strengths j0 of dynamic disorder at temperature T = 0.5 J (j0 = 0.2 J - left, j0 = 0.4 J - right)
Fig. 6. The time and ΔJ dependence of the difference between the anisotropy of fluorescence rt − rr of the LH2 (nonameric) ring and LH4 (octameric) one. The dynamic disorder is also included at T = 0.5 J (j0 = 0.2 J - upper row, j0 = 0.4 J - lower row, ωc = 0.1 J - left column, ωc = 0.3 J - right column).
Let us look at time decay of the anisotropy of fluorescence in the molecular ring with radial arrangement rr (t) - LH4, octameric one. Recently we discussed [19] results in the case of one type of the static disorder - uncorrelated Gaussian disorder in the local energies.
Computer Simulation of the Anisotropy of Fluorescence
669
In the present paper we concentrate on other type of the static disorder, the Gaussian uncorrelated disorder in the transfer integrals J, characterized by ΔJ . In Fig. 1. there is the time dependence of the anisotropy of fluorescence in case of the pure static disorder in transfer integrals for different ΔJ . It is seen that the pure static disorder ΔJ = 0.2 leads to decay of the anisotropy of fluorescence from 0.7 to 0.4 within τ = 20. Influence of the pure dynamic disorder, for two its strengths j0 and different maxima 2ωc of the spectral density function J(ω) is shown in Fig. 2. Dynamic disorder has a pronounced effect mainly in case of lower ωc . Consequences of the combined static and dynamic disorder are presented in Fig.3. The time dependence of the anisotropy of fluorescence is displayed on four 3D graphs for two strengths of interaction with the bath and two maxima 2ωc of the spectral density function. While for ωc = 0.1J the influence of the static disorder is secondary due to dominance of the dynamic disorder, for larger ωc = 0.3J the influence of the static disorder is more pronounced. Comparison of the time dependence of fluorescence anisotropy for the octameric ring with radial arrangement of optical transition dipoles (like in LH4) and for the nonameric one with tangential arrangement of optical transition dipoles is presented graphically as differences rt (τ ) − rr (τ ) in Figs 4-6.
5
Conclusions
The difference of the anisotropy of fluorescence between the nonameric tangentially arranged ring rt (τ ) and octameric radially arranged one rr (τ ) for the static disorder in transfer integrals (Fig. 4) has an opposite sign in comparison with the result for the static disorder in local energies as shown in Fig. 2 in [19]. For the influence of the dynamic disorder (interaction with the bath), given by Eqs(9-11), we can conclude (Fig. 5) that the same differences are shifted to a smaller times for the stronger interaction j0 . Similar conclusion can be drawn for the case of simultaneously acting static and dynamic disorder (shown in Fig. 3 and 6). We can also see more rapid decay of the anisotropy of fluorescence due to dynamic disorder in nonameric ring for smaller values of ωc (ωc = 0.1 J) (negative difference) and in octameric ring for larger values of ωc (ωc = 0.3 J) in presence of static disorder in transfer integrals (Fig. 6).
Acknowledgement Support from the Ministry of Education, Youth and Sports of the Czech Republic (projects MSM0021620835 - I.B. and LC06002 - P.H.) is gratefully acknowledged.
References 1. de Ruijter, P.F., et al.: Biophysical J. 87, 3413 (2004) 2. Jang, S., Dempster, S.F., Silbey, R.J.: J. Phys. Chem. B 105, 6655 (2001)
670
P. Heˇrman, I. Barv´ık, and D. Zapletal
3. Sundstr¨ om, V., Pullerits, T., van Grondelle, R.: J. Phys. Chem. B 103, 2327 (1999) 4. Novoderezhkin, V., van Grondelle, R.: J. Phys. Chem. B 106, 6025 (2002) 5. Nagarjan, V., Alden, R.G., Williams, J.C., Parson, W.W.: Proc. Natl. Acad. Sci. USA. 93, 13774 (1996) 6. Nagarjan, V., Johnson, E.T., Williams, J.C., Parson, W.W.: J. Phys. Chem. B 103, 2297 (1999) 7. Nagarjan, V., Parson, W.W.: J. Phys. Chem. B 104, 4010 (2000) 8. Rahman, T.S., Knox, R.S., Kenkre, V.M.: Chem. Phys. 44, 197 (1979) 9. Wynne, K., Hochstrasser, R.M.: Chem. Phys. 171, 179 (1993) 10. K¨ uhn, O., Sundstr¨ om, V., Pullerits, T.: Chem. Phys. 275, 15 (2002) 11. Heˇrman, P., Kleinekath¨ ofer, U., Barv´ık, I., Schreiber, M.: J. Lumin. 447, 94–95 (2001) 12. Heˇrman, P., Kleinekath¨ ofer, U., Barv´ık, I., Schreiber, M.: Chem. Phys. 275, 1 (2002) 13. Barv´ık, I., Kondov, I., Heˇrman, P., Schreiber, M., Kleinekath¨ ofer, U.: Nonlin. Opt. 29, 167 (2002) 14. Heˇrman, P., Barv´ık, I.: Czech. J. Phys. 53, 579 (2003) 15. Reiter, M., Heˇrman, P., Barv´ık, I.: J. Lumin. 110, 258 (2004) 16. Heˇrman, P., Barv´ık, I., Reiter, M.: J. Lumin. 112, 469 (2005) 17. Heˇrman, P., Barv´ık, I.: J. Lumin. 558, 122–123 (2007) 18. Kumble, R., Hochstrasser, R.: J. Chem. Phys. 109, 855 (1998) 19. Heˇrman, P., Barv´ık, I., Zapletal, D.: J. Lumin. 128, 768 (2008) ˇ apek, V.: Z. Phys. B 99, 261 (1996) 20. C´ 21. Barv´ık, I., Macek, J.: J. Chin. Chem. Soc. 47, 647 (2000) 22. May, V., K¨ uhn, O.: Charge and Energy Transfer in Molecular Systems. WileyWCH, Berlin (2000) 23. Pollard, W.T., Friesner, R.A.: J. Chem. Phys. 100, 5054 (1994) 24. Kondov, I., Kleinekath¨ ofer, U., Schreiber, M.: J. Chem. Phys. 114, 1497 (2001)
Functional Availability Analysis of Discrete Transport System Realized by SSF Simulator Tomasz Walkowiak and Jacek Mazurkiewicz Institute of Computer Engineering, Control and Robotics, Wroclaw University of Technology, ul. Janiszewskiego 11/17, 50-372 Wroclaw, Poland {Tomasz.Walkowiak,Jacek.Mazurkiewicz}@pwr.wroc.pl
Abstract. The paper describes a novel approach to functional availability analysis of discrete transport systems realized using Scalable Simulation Framework (SSF). The proposed method is based on modeling and simulating of the system behavior by Monte Carlo simulation. No restriction on the system structure and on a kind of distribution is the main advantage of the method. The paper presents some exemplar system modeling. The authors stress the problem of influence of the functional parameters on final system availability. The problem described in the paper is practically essential for defining an organization of vehicle maintenance and transport system logistics.
1
Introduction
Decisions related to transport systems ought to be taken based on different and sometimes contradictory conditions. The transport systems are characterized by a very complex structure. The performance of the network can be impaired by various types of faults related to the transport vehicles, communication infrastructure or even by traffic congestion [8]. The analysis of transport system functionality can only be done if there is a formal model of the transport logistics. The classical models used for reliability analysis are mainly based on Markov or Semi-Markov processes [1] which are idealized and it is hard to reconcile them with practice. We suggest the Monte Carlo simulation [4] for proper functional parameters calculation. No restriction on the system structure and on a kind of distribution is the main advantage of the method [9]. We propose to use the SSF (Scalable Simulation Framework) [2] instead of dedicated system elaboration. Our previous works [5], [7], [9], [10] show that it is very hard to prepare the simulator which includes all aspects of discrete transport. The SSF is a simulation core. It was developed for a usage in the SSFNet [3] a popular simulator of computer networks. We developed an extension to SSF allowing to simulate transport systems. We propose a formal model of discrete transport system to analyze functional aspects of complex systems. The presented in the next chapter discrete transport system model is based on the Polish Post regional centre of mail distribution. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 671–678, 2008. c Springer-Verlag Berlin Heidelberg 2008
672
2
T. Walkowiak and J. Mazurkiewicz
Discrete Transport System with Central Node and Time-Table (DTSCNTT)
The model can be described as follows: DT SCN T T = CN, N, R, V, I, M, T T ,
(1)
where: CN - central node, N - set of ordinary nodes, R - set of routes, V - set of vehicles, I - input model, M - set of maintenance crews and TT - vehicles’ time-table. Commodities: We can discuss several kinds of a commodity transported in the system. Single kind commodity is placed in a unified container, and containers are transported by vehicles. The commodities are addressed and there are no other parameters describing them. Nodes: We have single central node in the system. The central node is the destination of all commodities taken from other - ordinary nodes. Moreover the length between each two nodes is given. Input Model: The aim of the system is to transport containers from the central node to ordinary nodes and in the opposite way. The containers are generated in each node. The central node is the global generator of commodities driven to each ordinary nodes of the system. The generation of containers is described by Poisson process. In case of central node there are separate processes for each ordinary node. Whereas, for ordinary nodes there is one process. The input model includes intensities of container generation in each ordinary node (routed to central node) and a table of intensities of containers for each ordinary node in the central node. Vehicles: We assumed that all vehicles are of the same type and are described by following functional and reliability parameters: mean speed of a journey, capacity - number of containers which can be loaded, reliability function and time of vehicle maintenance. The central node is the start and destination of vehicle travels. The temporary state of each vehicle is characterized by following data: vehicle state, distance traveled from the begin of the route, capacity of the commodities. The vehicle running to the end of the route is able to take different kinds of commodity (located in unified containers, each container includes singlekind commodity). The vehicle hauling a commodity is always fully loaded or taking the last part of the commodity if it is less than its capacity. Routes: Each route describes possible trip of vehicles. The set of routes we can describe as series of nodes: R = c, v1 , ..., vn , c and vi ∈ N and c = CN.
(2)
Maintenance Crews: Maintenance crews are identical and unrecognized. The crews are not assigned to any node, are not combined to any route, they operate in the whole system and are described only by the number of them. The temporary state of maintenance crews is characterized by: number of crews which are not involved into maintenance procedures and queue of vehicle waiting for the maintenance.
Functional Availability Analysis of Discrete Transport System
673
Time-Table: Vehicles operate according to the time-table exactly as city buses or intercity coaches. The time-table consists of a set of routes (sequence of nodes staring and ending in the central node, times of approaching each node in the route and the recommended size of a vehicle. The number of used vehicle, or the capacity of vehicles does not depend on temporary situation described by number of transportation tasks or by the task amount for example. It means that it is possible to realize the journey by completely empty vehicle or the vehicle cannot load the available amount of commodity (the vehicle is to small). Time-table is a fixed element of the system in investigated time horizon, but it is possible to use different time-tables for different seasons or months of the year. Each day a given time-table is realised, it means that at a time given by the time table a vehicle, selected randomly from vehicles available in the central node, departures from central node and loaded with containers addressed to each ordinary nodes included in a given route. This is done in a proportional way. Next, after arriving at given node (it takes some time according to vehicle speed - random process and road length) the vehicle waits in an input queue if there is any other vehicle being loaded/unload at the same time. There is only one handling point in each node. The time of loading/unloading vehicle is described by a random distribution. The containers addressed to given node are unloaded and empty space in the vehicle is filled by containers addressed to a central node. The operation is repeated in each node on the route and finally the vehicle is approaching the central node when is fully unloaded and after it is available for the next route. The process of vehicle operation could be stopped at any moment due to a failure (described by a random process). After the failure, the vehicle waits for a maintenance crew (if there are no available due to repairing other vehicles), is being repaired (random time) and after that it continues its journey.
3
Simulation Methodology
Discrete transport system described in the previous section is very hard to analyze by a formal model. It does not fit the Markov process framework. A common way of analyzing that kind of systems is a computer simulation. To analyze the system we must at first build a model and then operate the model. The system model needed for simulation has to encompass the system elements behavior and interaction between elements. In case of dependability we have to include system element reliability model. Except the system functionality model we have to model the traffic in the system. The data for simulation of a given real exemplar system consists of system element model (described in the system functionality meta-model formalism) and a given traffic configuration. Once a model has been developed, it is executed on a computer by an eventsimulation, which is based on a idea of event. The event is described by time of event occurring, type of event (in case of DTSCNTT it could be vehicle failure) and element or set of elements of the system on which event has its influence. The simulation is done by analyzing a queue of event (sorted by time of event occurring) while updating the states of system elements according to rules related
674
T. Walkowiak and J. Mazurkiewicz
to a proper type of event. The event-simulation program could be written in general purpose programming language (like C++), in fast prototyping environment (like Matlab) or special purpose discrete-event simulation kernels. One of such kernels, is the Scalable Simulation Framework (SSF) [2] which is a used for SSFNet [3] computer network simulator. SSF is an object-oriented API - a collection of class interfaces with prototype implementations. It is available in C++ and Java. SSFAPI defines just five base classes: Entity, inChannel, outChannel, Process, and Event. The communication between entities and delivery of events is done by channels (channel mappings connects entities) [3]. For the purpose of simulating DTSCNTT we have used Parallel Real-time Immersive Modeling Environment (PRIME) [6] implementation of SSF due to much better documentation then that available for original SSF. We have developed a generic class (named DTSObject) derived from SSF Entity which is a base of classes modeling DTSCNTT objects like: scheduler, node, truck and crew which model the behavior of presented in section 2 discrete transport system. The effectiveness of simulation done in PRIME environment is very promising. The tests done on one batch of simulation of DTSCNTT exemplar described in the next section needed from 3.9 to 9 seconds on Pentium 2 GHz computer. The time needed to perform one simulation depends on the number of events presented in the system, which is a result of DTSCNTT configuration. Due to a presence of randomness in the DTSCNTT model the analysis of it has to be done based on Monte-Carlo approach. It requires a large number of repeated simulation. The SSF is not a Monte-Carlo framework but by simple re-execution of the same code (of course we have to start from different values of random number seed) the statistical analysis of system behavior could be realized [12].
4
Functional Availability of DTSCNTT
The analysis of a given system requires a metric. We propose to use the availability of the system. We define it as an ability of realising the transportation task in required time. The availability is a probability measure. Introducing the following notation: – T - time measured from the moment when the container was introduced to the system to the moment when the container was transferred to the destination (random value), – Tg - guaranteed time of delivery, if exceeded the container is delayed, – N(t) - stochastic process describing the number of delayed containers at time t, – k - the level of acceptable delay, we can define the functional availability Ak (t) as a probability that the number of delayed containers at time t does not exceed k, i.e.: Ak (t) = P r {N (t) ≤ k} .
(3)
The calculation of stochastic process N(t) is based on analysing a state of each
Functional Availability Analysis of Discrete Transport System
675
T a
0
T
T< - Tg
b
t
0
Tg Tg
T> T Tg > Tg
t opóznienie delay
Fig. 1. The delivery in guaranteed time (a) and delayed delivery (b)
not yet delivered container. As illustrated in Fig. 1. we can observe two possible situations: (a) - delivery was realised before guaranteed time Tg - there is no delay, (b) - delivery was delayed - time of delay: T - Tg .
5
DTSCNTT Case Study
For testing purposes of presented DTSCNTT system (chapter 2) and developed extension of SSF (chapter 3) we have developed an exemplar transport system. It consists of one central node (city Wroclaw, Poland) and three ordinary nodes (cites nearby Wroclaw: Rawicz, Olesnica and Nysa). The distances between nodes has been set approximating the real distances between used cities and they equal to: 85, 60 and 30 km. We assumed a usage of 5 trucks (two with capacity set to 10 and three with capacity 15) with mean speed 50km/h. The vehicles realized 19 trips a day: from central node to ordinary node and the return trip. Failures of trucks were modeled by exponential distribution with mean time
NET [ Vertex [ID Nys MTTB 0.6] Vertex [ID Raw MTTB 0.4] Vertex [ID Ole MTTB 0.3] CeVertex [ID Wro MTTB [Nys 0.5 Raw 0.4 Ole 0.3] ] Truck [No 2 Speed 50 Size 10 MTTF 1000] Truck [No 3 Speed 50 Size 15 MTTF 1000] Trip [Size 10 Start 8.00 Dest[ID Ole Time 8.40]]] Trip [Size 10 Start 9.30 Dest[ID Ole Time 10.10]] Trip [Size 10 Start 11.00 Dest[ID Ole Time 11.40]] Trip [Size 10 Start 12.30 Dest[ID Ole Time 13.10]] Trip [Size 10 Start 14.00 Dest[ID Ole Time 14.40]] Trip [Size 10 Start 15.30 Dest[ID Ole Time 16.10]] Trip [Size 10 Start 17.00 Dest[ID Ole Time 17.40]] …
Fig. 2. Exemplar DTSCNTT description in DML file
676
T. Walkowiak and J. Mazurkiewicz
A20(t) 1.00 0.95 0.9 0.85 0.8 0.75 0.7
0
100
200
300
400
500
600
700
800
900
1000
t Fig. 3. Functional availability of the DTSCNTT, 5 trucks operate
A20(t) 1.001 0.95 0.9 0.85 0.8 0.75 0.7
0
100
200
300
400
500
600
700
800
900
1000
t Fig. 4. Functional availability of the DTSCNTT, 4 trucks operate
to failure equal to 1000h. The repair time was modeled by normal distribution with mean value equal to 2h and variance of 0.5h. The containers addressed to ordinary nodes were available in the central node at every 0.5, 0.4 and 0.3 of an hour respectively. Containers addressed to the central node were generated at every 0.6, 0.4, 0.3 of hour in following ordinary nodes. There was a single maintenance crew. The availability of the system Ak (t) was calculated with guaranteed time Tg =24h and parameter k =20. Time-table as well as other functional parameters were described in a DML file (see example in Fig. 2.). The Domain Modeling Language (DML) [6] is a SSF specific text-based language which
Functional Availability Analysis of Discrete Transport System
677
includes a hierarchical list of attributes used to describe the topology of the model and model attributes values. Based on 10 000 time simulations (in each 100 days) the availability of system was calculated. Results presented in Fig. 3. shows the periodic changes. The situation is an effect of used time-tables and a method of cointainers’ generation. The containers are generated during all day (by Poisson process) but according to a time-table trucks do not operate in the night. The probability of delay increases at the night, but selected number of trucks (5) is satisfactory for given system. We have also analyzed a system with a reduced number of vehicles (with 4). The resulting value of the availability function is presented in Fig. 4. It could be noticed that the availability of the system decreases due to lack of sufficient number of trucks. It should be noticed here that looking in the used time-table and not taking into consideration a randomness of the transport system (failures and traffic jams) only three vehicles should be enough to transport all the generated containers.
6
Conclusion
We have presented a simulation approach to functional analysis of Discrete Transport System with Central Node and Time-Table (DTSCNTT). The DTSCNTT models behavior of the Polish Post regional centre of mail distribution. Developed simulation software allows to analyze availability of the system in a function of all model parameters, like for example changes in a time-table or in a number of used trucks. Also, some economic analysis could be done following the idea presented in [5], [11], [12]. It could be used for example for selection of the optimum value for SLA (service level agreement). The presented results, i.e. changes of availability in a function of a number of used trucks shows that presented approach allows to answer a non trivial question what should be a number of vehicles to fulfill some requirements given to the transport system. The implementation of DTSCNTT simulator done based on SSF allows to apply in a simple and fast way changes in the transport system model. Also the time performance of SSF kernel results in a very effective simulator of discrete transport system. Therefore, in our opinion introduced exemplar analysis shows that the described method of transport system modeling can serve for practical solving of essential decision problems related to an organization and parameters of a real transport system. The proposed analysis seems to be very useful for mail distribution centre organization. Work reported in this paper was sponsored by a grant No. 4 T12C 058 30, (years: 2006-2009) from the Polish Committee for Scientific Research (KBN).
References 1. Barlow, R., Proschan, F.: Mathematical Theory of Reliability, Society for Industrial and Applied Mathematics, Philadelphia (1996) 2. Cowie, J.H.: Scalable Simulation Framework API reference manual (1999), http://www.ssfnet.org/SSFdocs/ssfapiManual.pdf
678
T. Walkowiak and J. Mazurkiewicz
3. Cowie, J.H., Nicol, D.M., Ogielski, A.T.: Modeling the Global Internet. Computing in Science and Engineering 1(1), 42–50 (1999) 4. Fishman: Monte Carlo: Concepts, Algorithms, and Applications. Springer-Verlag, New York (1996) 5. Kaplon, K., Mazurkiewicz, J., Walkowiak, T.: Economic Analysis of Discrete Transport Systems. Risk Decision and Policy 8(3), 179–190 (2003) 6. Liu, J.: Parallel Real-time Immersive Modeling Environment (PRIME), Scalable Simulation Framework (SSF), User’s manual, Colorado School of Mines Dep. of Mathematical and Computer Sciences (2006), http://prime.mines.edu 7. Mazurkiewicz, J., Walkowiak, T.: Fuzzy Economic Analysis of Simulated Discrete Transport System. In: Rutkowski, L., Siekmann, J.H., Tadeusiewicz, R., Zadeh, L.A. (eds.) ICAISC 2004. LNCS (LNAI), vol. 3070, pp. 1161–1167. Springer, Heidelberg (2004) 8. Sanso, B., Milot, L.: Performability of a Congested Urban-Transportation Network when Accident Information is Available. Transportation Science 33, 1 (1999) 9. Walkowiak, T., Mazurkiewicz, J.: Hybrid Approach to Reliability and Functional Analysis of Discrete Transport System. In: Bubak, M., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2004. LNCS, vol. 3037, pp. 236–243. Springer, Heidelberg (2004) 10. Walkowiak, T., Mazurkiewicz, J.: Reliability and Functional Analysis of Discrete Transport System with Dispatcher. In: Advances in Safety and Reliability, European Safety and Reliability Conference - ESREL 2005, pp. 2017–2023. Taylor & Francis Group, London (2005) 11. Walkowiak, T., Mazurkiewicz, J.: Simulation Based Management and Risk Analysis of Discrete Transport Systems. In: IEEE TEHOSS 2005 Conference, pp. 431–436 (2005) 12. Walkowiak, T., Mazurkiewicz, J.: Discrete transport system simulated by SSF for reliability and functional analysis. In: International Conference on Dependability of Computer Systems. DepCoS - RELCOMEX 2007, pp. 352–359. IEEE Computer Society Press, Los Alamitos (2007)
Parallel Implementation of Vascular Network Modeling Krzysztof Jurczuk and Marek Kr¸etowski Faculty of Computer Science, Bialystok Technical University Wiejska 45a, 15-351 Bialystok, Poland {kjurczuk,mkret}@wi.pb.edu.pl
Abstract. The paper presents modeling of the vascular system in a parallel environment. The aim of this approach is to accelerate the simulation of vascular network growth and make it closer to analogous real life processes. We concentrated on the perfusion process and made an attempt to parallelize the process of connecting ischemic macroscopic functional units to existing vascular systems. The proposed method was implemented on a computing cluster with the use of the MPI standard. The results show that it is possible to gain a significant speedup that allows us to make simulations for a greater number of macroscopic functional units and vessels in a reasonable time, which increases the possibility to create more complex and more precise virtual organs.
1
Introduction
The human body is characterized by high complexity. It can be observed on each level, starting from molecules, cells and ending on organs and the whole organism [3]. Moreover, a lot of internal mechanisms are parallel or even distributed. These factors are the main reasons why the modeling of living systems is becoming more and more important. The modeling provides new ways to better understand complex interactions between elementary mechanisms and behaviors of the whole organs. One of the main difficulties in model designing is the necessity to capture the most essential properties of the system and disregard the elements whose role is insignificant. It is not easy to choose appropriate simplifications, which applies especially to living organisms. Too simple models can be useless but, on the other hand, too elaborate models can be ineffective in practical cases, which means that the computations cannot be done in a reasonable time. Therefore, it appears natural to attempt to use parallel computing in the modeling of living organisms, especially vascular networks. Implementations in a parallel environment can accelerate the simulation process and allow us to introduce more sophisticated details. In this paper, we focus on the modeling of vascular systems. They are very important in the detection processes of various pathological anomalies, because changes in vascular structures can be directly caused by diseases. Most of these modifications appear in medical images, especially when the contrast product M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 679–688, 2008. c Springer-Verlag Berlin Heidelberg 2008
680
K. Jurczuk and M. Kr¸etowski
is administrated. Vessels play a key role in a contrast material propagation and they are one of the most visible structures in dynamic images. Therefore, the modeling of vascular systems can support the development of methods to detect early indicators of diseases and help to understand the mechanisms of image formation. Many vascular models have been proposed, e.g. Constrained Constructive Optimization (COO) method for an arterial tree generation [11], an algorithm of arterial tree growth inside a defined and gradually expanding shape [1], improved CCO method to simulate the coronary tree [6] or a fractal model [13]. According to our knowledge, all of them use the sequential algorithm to develop vascular systems. In our previous studies [2] and [7], we used the physiological modeling as a way to better understand medical images (both CT and MRI) and to find some image markers of pathologies. In the research we also made use of the sequential algorithm to generate a virtual organ of liver (represented by three vascular trees and parenchyma) [8], CT simulator and MRI virtual scanner implemented in a parallel environment. In this paper, however, we propose a parallel algorithm of the vascular network growth, based on the previously used sequential algorithm. We concentrated on the perfusion process and parallelized the process of connecting new cells to existing vascular trees. The aim of this research is to accelerate the simulation of the vascular network growth and bring it as close to the real, analogous process as possible. The rest of the paper is organized as follows. In the next section the organ model with sequential algorithm of vascular system development is briefly recalled. Whereas, in Sect. 3 the parallel algorithm of the same vascular growth process is presented. An experimental validation of the presented approaches is performed in Sect. 4. The conclusion and some plans for future research are sketched in the last section.
2
Model Description
In its generic form [8], the discussed model was constructed for the modeling of internal organs which develop by a division of their structural elements. But it should be emphasized that it is oriented towards an image generation. Therefore, the model concentrates on elements which are directly visible in images or have a significant influence on image analysis. The main components of the model are: the tissue and the vascular network that perfused it. Most of features are not linked with any internal organ. However, it is very hard to model particular organs without some kind of specialization. Therefore, the model expresses the specificity of liver, as it is one of the most important organs. It plays a major role in the metabolism and has a number of functions in the body, including glycogen storage, decomposition of red blood cells, plasma protein synthesis, and detoxification [12]. Moreover, it possesses an unique organization of the vascular network with three types of trees: hepatic artery, portal vein and hepatic vein.
Parallel Implementation of Vascular Network Modeling
2.1
681
Tissue Modeling
The tissue is represented by a set of Macroscopic Functional Units (MFU) that are distributed regularly inside the specified shape. MFU is a small, fixed size part of tissue that constitutes the functional unit of the model. It is described by its class, which precises the most of its properties, both functional and structural (e.g. probability of mitosis and necrosis, blood flow rate, blood pressure, size and density). Moreover, certain parameters are described by defined distributions. This mechanism enables modeling the blood flow with more natural variability. Several classes of MFU can be defined in the organ, which allows simulating pathological changes. 2.2
Vascular Network Modeling
In the model, each vessel is represented by an ideal, rigid tube with a fixed radius, wall thickness, length and position. The wall thickness depends on the vessel diameter and its function. The model distinguishes vessels larger than capillaries, whereas the capillaries themselves are hidden in the MFUs. Based on this simplification, the vascular tree model assumes the form a binary tree (see Fig. 1a). It means that anastomoses, which occur sporadically, especially in pathological situations, are not taken into account. The binary trees, representing vascular trees, are built of nodes characterized by their spatial position, blood flow rate and pressure.
a)
b)
Fig. 1. Binary vascular trees: a) successive bifurcations b) perfusion process of new MFU - searching the closest vessels in three vascular trees
In the model, the blood is treated as a Newtonian fluid, with constant viscosity (μ), which makes it possible to calculate the pressure difference (ΔP ) between two extremities by the Poiseuille’s law: ΔP = Q
8μl , πr4
(1)
where l is the length, r is the radius and Q is the blood flow of the vessel. Moreover, at each bifurcation the law of matter conservation has to be observed: Q = Qr + Ql ,
(2)
682
K. Jurczuk and M. Kr¸etowski
where Q is the blood flow in a parent vessel, Qr , Ql are the blood flows in descendant vessels (right and left daughter branches). It means that the quantity of blood which entering and leaving a bifurcation has to be equal. Another equation is connected with a decreasing radius of vessels in the trees where we move from proximal to distal segments of vascular trees: rγ = rrγ + rlγ ,
(3)
where r is the radius of parent vessel, rr , rl are the radiuses of descendant vessels (right and left daughter branches) and γ varies between 2 and 3 [5]. This morphological law describes the dependency of the mother vessel radius and the radiuses of its two daughters. 2.3
Sequential Vascular System Growth Algorithm
The organ growth is modeled as a analogy to a hyperplasia process (the increasing number of cells). It starts with an organ, whose size is a fraction of adult one. As it is shown in Fig. 2, after parameters initialization, in discrete time moments (called cycles), an organ enlarges its size. Therefore, between MFUs a new, empty space appears. Additionally, each cycle consists of subcycles. In each subcycle, a MFU has a certain probability to give birth to a new MFU of the same class or to die. Consequently, changes in the tissue and in the corresponding vascular network appear. The processes of the birth/perfusion and the death/retraction are repeated in each subcycle until the empty space, which can appear between cycles, is not occupied by new MFU elements. The whole process ends when organ reaches its full, adult size. At the beginning of each subcycle, for every MFU, a few randomly chosen spatial positions of the new MFU in a neighborhood are tested. If all conditions connected with a free space and a tissue density are fulfilled, a new MFU is created. This new, small functional unit is not perfused by the existing vascular system and it is initially ischemic. The next step is to find an optimal bifurcation point which can be used to perfuse a new MFU. First, the distances between all vessels and the new element are calculated. We choose a fixed number of the closest vessels (see Fig. 1b). Later, temporary bifurcations are created. When there are more than one tree, the algorithm chooses all possible combinations of candidate vessels (a single combination consists of one vessel from each tree). The spatial position of the bifurcation is controlled by the Downhill Simplex procedure [10] (minimization of additional blood volume needed for the new MFU perfusion [8]). Only one combination of vessels can be used to perfuse the new MFU. Therefore, we have to choose the one which is the most appropriate from among all candidates. Additionally, the possible collisions between vessels must be checked. In the presented approach only non-crossing configurations are taken into account. Therefore, the algorithm detects intersections between perfusing vessels both from the same and different trees. Finally, for each remaining configuration the volume of the whole tree is computed. The combination with the lowest sum of volumes permanently perfuses the MFU.
Parallel Implementation of Vascular Network Modeling
683
Fig. 2. Flow chart representing two loops of events which are distinguished in the presented modeling of the organ
The MFUs are perfused by the vascular system in a sequential manner, one by one. This process is time consuming because for each new MFU a large number of temporary bifurcations is created. It requires many calculations to assure the consistency of the characteristics (i.e. blood flow and pressure,...) describing individual vessels. After the reproduction process, comes the degeneration phase in which some of the MFUs can die. The algorithm of retraction is not so time consuming in comparison to the perfusion process. The vessels supplying the MFU simply retract and disappear. It requires only a single recalculation of the constraints for the vascular system.
3
Parallel Vascular System Development
The most time consuming operation in the presented algorithm of the vascular network growth is the process of perfusion. It results from large number of MFUs, the complicated structure of vascular trees and especially by the necessity to find the optimal bifurcation. Therefore, a decision was made to spread the computations concerning the perfusion process over computational nodes. Moreover, our intention was to bring the solution closer to reality, where analogous perfusion processes can be also parallel. The general scheme of the proposed algorithm in a parallel environment is presented in Fig. 3. The algorithm has two parts. The first (see Fig. 4) is
684
K. Jurczuk and M. Kr¸etowski
Fig. 3. Two parts of the parallel algorithm. The first performed at the beginning of each subcyle and connected with the trees and tissue migration. The second performed between subcycles and responsible for the distribution of the perfusion process over nodes.
Fig. 4. The first part of the algorithm connected with the trees and tissue migration
performed at the beginning of each subcycle while the second (see Fig. 5) does calculations between subcycles. Each node must have the vascular system: trees and MFUs, as recent as possible. Therefore, at the beginning of each subcycle (the first part of the presented algorithm) the managing node sends the latest vascular trees and tissue represented by MFUs to the computational nodes. The vascular system can be large and complex, therefore its migration process between the processors within the framework of the message passing interface is composed of 3 steps: packing the nodes and MFUs into a flat message, sending the message and unpacking corresponding nodes and MFUs. In order to minimize the message size we choose only the parameters of the nodes that cannot be reconstructed: position in space, possession of children, individual node number and MFU class. When the computational node receives the message with the vascular trees, the remaining characteristics is restored. Almost all indispensable information about MFUs is sent with trees. Additionally, as the blood flow is unique in each MFU, we also have to transfer the value of the flow. Moreover, a lot of parameters about structure are read from input files at each node, which enables us to send quite a small package in comparison to the real size of the vascular system. We also assigned individual numbers to tree nodes and MFUs, which facilitated the process of migration, rebuilding and permanent perfusion in the managing node. To sum up, each computational node possesses identical vascular trees and MFUs after the completion of the first part of the algorithm.
Parallel Implementation of Vascular Network Modeling
685
Fig. 5. The second part of the algorithm responsible for MFUs perfusion
The second part of the algorithm is responsible for a calculation occurring between individual subcycles. First, the managing node creates the list of new MFUs which can be added to the vascular network. There is a possibility to model this process in a parallel environment but the time needed for that part of the algorithm can be neglected, as it is very short, in comparison to the perfusion time. Next, the managing node makes no attempt to find vessels in order to add new MFUs. They are sent to the computational nodes. The MFU migration is more simple and less time consuming than the entire trees, but to minimize a message within the framework of the message passing interface we only choose the essential information, namely: position in space, blood flow and MFU class. When the computational node receives the message, it tries to find the closest vessels and the optimal point to connect the received MFU to the vascular network. It makes the simulation of the perfusion process. If the searching ends with a success, the computational node does not perfuse permanently a new MFU but sends the parameters of the bifurcation to the managing node. The message contains only the position of the bifurcation point in space and numbers of perfusing vessels. Next, when the managing node receives the parameters of the bifurcation, it checks if there have been some other changes in its trees since the last contact with the sender of a message. If there have been, it checks the changed vessels. If at least, one vessel that was changed, is on the list of vessels to perfuse the new element, the MFU is rejected. But in the other case, the managing node permanently joins the new MFU and broadcasts all new changes that occurred between previous and present contact with the sender of the message. The migration of changes is less time consuming than the entire trees. All vascular trees are only sent at the beginning of each subcycle. If there are more MFUs to perfuse, the managing node sends the next one. The whole vascular system has to be sent at the beginning of each subcycle as there are several other processes (e.g. degeneration and growth of organ shape) between cycles and subcycles, which has an influence on the entire trees. The algorithm of creating new MFUs ensures that the number of rejected MFUs is small enough. Moreover, the rejected MFUs leave empty space between vessels and other MFUs, which increases a probability that the vascular network
686
K. Jurczuk and M. Kr¸etowski
growth algorithm will choose more macroscopic functional units in the next subcycle.
4
Experimental Results
This section contains a preliminary, experimental verification of the proposed algorithm in a parallel environment. The presented results were obtained from many experiments. We used the default settings for the sequential version (about 12000 MFUs). Moreover we checked the behavior of the proposed solution for large configurations with about 50000 MFUs and consequently about 300000 vessels (Fig. 6 shows a visualization of one of the obtained vascular systems).
a)
b)
Fig. 6. Visualization of the adult liver (about 50000 MFUs and 300000 vessels): a) portal veins, b) main hepatic arteries, portal veins and hepatic veins with liver shape
In the experiments a cluster of sixteen SMP servers running Linux 2.6 and connected by an Infiniband network was used. Each server was equipped with two 64-bit Xeon 3.2GHz CPUs with 2MB L2 cache, 2GB of RAM and an Infiniband 10Gb/s HCA connected to a PCI-Express port. We used the MVAPICH version 0.9.5 [9] as the MPI standard implementation [4]. Figure 7a presents the obtained mean speedup. It is far from linear but from practical point of view is very satisfactory. Usually the process to obtain an organ with about 50000 MFUs on the single processor machine (3.2GHz CPU and 2GB of RAM) takes about 24 hours. On the other hand, the parallel version can simulate it approximately 8 times faster (with 16 processors). Moreover, it is worth noting that the speedup, in spite of the increasing number of MFUs, does not decrease significantly. For 16 nodes it still varies around 8. In order to in depth explain presented results a detailed time-sharing figure is necessary and is presented for the case of 50000 MFUs (see Fig. 7b). It is clearly visible that the most time consuming operation is the perfusion process. The degeneration phase takes only a small part of the time necessary to develop the adult organ. The time connected with the MPI operations (e.g. sending and receiving messages) is insignificant in comparison to the time of other operations. In our case it was always less than 1% of the whole simulation time. Moreover,
Parallel Implementation of Vascular Network Modeling 24
vascular system growth theoretical
16
degeneration time perfusion time tree uniformity time MPI time
22 Organ growth time (in hours)
14 Absolute speedup
687
12 10 8 6 4 2
20 18 16 14 12 10 8 6 4 2
0
0
0
2
4
6
8
10
12
Number of processors
a)
14
16
1
2 4 8 Number of processors
16
b)
Fig. 7. Efficiency of the parallel implementation: a) mean speedup for many configurations of MFUs, b) detailed time-sharing figure for the case of about 50000 MFUs
we can observe that the algorithm also spends a short period of time on the processes connected with the maintaining of uniformity of the vascular system (e.g. selection and gathering changes at the managing node and adding changes at the computational nodes). Furthermore, it should be mentioned that we changed the memory organization at the computational nodes. To optimize the time connected with tree rebuilding we introduced a continuous memory representation. It decreased the time needed to allocate and deallocate memory. Prior to this mechanism the mean speedup for 16 nodes was equal approximately 6.7.
5
Conclusion
In this paper, a parallel algorithm to model the vascular network growth is investigated. It was shown that the presented solution significantly reduces the computation time, which increases the possibility to create more elaborate and precise virtual organs. This can be very useful when we use them in CT or MRI simulators. Moreover, this solution can be treated as the first step to bring the presented model closer to reality, in which the analogous processes of the vascular network growth can occur in a parallel way. The presented algorithm is still under development. We see a lot of possible directions for future improvements and at least a few different approaches. First, we would like to introduce more decentralized solution. We also plan to implement the process of perfusion in the framework of multi-platform shared-memory parallel programming (OpenMP), which makes it possible to reduce the time period connected with tree rebuilding, waiting for new MFUs and tree uniformity. Acknowledgments. This work was supported by the grant W/WI/5/08 from Bialystok Technical University.
688
K. Jurczuk and M. Kr¸etowski
References 1. B´ezy-Wendling, J., Bruno, A.: A 3-D dynamic model of vascular trees. Journal of Biological Systems 7(1), 11–31 (1999) 2. B´ezy-Wendling, J., Kr¸etowski, M., Mescam, M., Jurczuk, K., Eliat, P.-A.: Simulation of hepatocellular carcinoma in MRI by combined macrovascular and pharmacokinetic models. In: 4th IEEE International Symposium on Biomedical Imaging: From Nano to Macro, pp. 1272–1275. IEEE Press, Washington (2007) 3. Demongeot, J., B´ezy-Wendling, J., Mattes, J., Haigron, P., Glade, N., Coatrieux, J.-L.: Multiscale modeling and imaging: the challenges of biocomplexity. Proceedings of the IEEE 91, 1723–1737 (2003) 4. Juhasz, Z., Kacsuk, P., Kranzlmuller, D.: Distributed and Parallel Systems: Cluster and Grid Computing. Springer, Heidelberg (2005) 5. Kamiya, A., Togawa, T.: Optimal branching structure of the vascular trees. Bulletin of Mathematical Biophysics 34, 431–438 (1972) 6. Karch, R., Neumann, F., Neumann, M., Schreiner, W.: Staged growth of optimized arterial model trees. Annals of Biomedical Engineering 28, 495–511 (2000) 7. Kr¸etowski, M., B´ezy-Wendling, J., Coupe, P.: Simulation of biphasic CT findings in hepatic cellular carcinoma by a two-level physiological model. IEEE Trans. on Biomedical Engineering 54(3), 538–542 (2007) 8. Kr¸etowski, M., Rolland, Y., B´ezy-Wendling, J., Coatrieux, J.-L.: Physiologically based modeling for medical image analysis: application to 3D vascular networks and CT scan angiography. IEEE Trans. on Medical Imaging 22(2), 248–257 (2003) 9. Liu, J., Wu, J., Panda, D.K.: High performance RDMA-based MPI implementation over InfiniBand. Int. Journal of Parallel Programming 32(3), 167–198 (2004) 10. Press, W.H., Teukolsky, S.A., Vetterling, W.T., Flannery, B.P.: Numerical recipes in C. The art of scientific computing. Cambridge University Press, Cambridge (1992) 11. Schreiner, W., Buxbaum, P.F.: Computer-optimization of vascular trees. IEEE Trans. on Biomedical Engineering 40(5), 482–491 (1993) 12. Sherlock, S., Dooley, J.: Diseases of the liver and biliary system. Blackwell Science, Malden (2002) 13. Zamir, M.: Arterial branching within the confines of fractal L-system formalism. Journal of General Physiology 118, 267–275 (2001)
Some Remarks about Modelling of Annular Three-Layered Plate Structure Dorota Pawlus Faculty of Mechanical Engineering and Computer Science, University of Bielsko-Biała, Willowa 2, 43-309 Bielsko-Biała, Poland
Abstract. The evaluation of the influence of the examined mesh structure on the computational results is considered in this paper. The annular plates with three-layered, cross-section structure having the soft core in the range of critical behaviour after loss of static and dynamic stability have been analysed. Several meshes of plate models, which can be applied in numerical calculations have been presented. The results obtained in finite element method have been compared with the results of plates solved using the finite difference method. The analysis has been undertaken in the wide range of the examined problems taking into account not only the global forms of plate critical deformations but also the other local ones and analysing different plate buckling forms with several transverse waves in circumferential direction, too. In the discussion the rate of the sensitivity of the presented plate models depending on their problem application has been noticed. Keywords: mesh model, sandwich plate, static, dynamic stability, FEM, FDM.
1 Introduction The different critical and overcritical behaviours of the sandwich structure of plates under lateral loads require building of the proper computational model. The geometrical and material parameters of the component layers of the structure essentially determine its behaviour, especially, when there are significant differences among them. Widely examined structure of three-layered plate with soft, foam, thick core is exactly such object, which computational model shows the significant sensitivity for the accepted parameters describing it. The way of building of such structure of plate with annular shape, which enables the solution to the static and dynamic plate problem, with the indication on its computational sensitivity has been considered in this paper. In the range of the axisymmetric dynamic stability problems of sandwich plates ( among others ) recently appeared works [1], [2] could be mentioned.
2 Problem Formulation The problem undertaken in this paper consists in the evaluation of the influence of the model structure of the three-layered plate on the computational results. The annular M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 689–699, 2008. © Springer-Verlag Berlin Heidelberg 2008
690
D. Pawlus
plate with the soft core compressed in the facings plane by the loads acting on the inner or/and outer their perimeters is the subject of the analysis. The scheme of plate loading is presented in Fig. 1. Such loads make the loss of plate stability, characterized by the critical parameters, like: critical static or dynamic load, form of buckling and critical deflection. These quantities for the plate examples differing in the model built by means of the finite element method have been analysed in this work. The examined, exemplary plate has the slidably clamped edges; symmetrical crosssectional structure and the following material and geometrical parameters: - the inner radius ri=0.2 m; - the outer radius ro=0.5 m; - the facing thickness (equal for each facing ) h'=0.0005 m or h'=0.001 m; - the core thickness h2=0.005 m, 0.02 m, 0.06 m; - the steel facing material with Young's modulus E=2.1⋅105 MPa, Poisson's ratio ν=0.3 and mass density μ=7.85⋅103 kg/m3 ; - two polyurethane foam core material with the value of Kirchhoff's modulus equal to G2=5 MPa and mass density μ2=64 kg/m3 [3] or with G2=15.82 MPa and μ2=93.6 kg/m3 [4]; value of Poisson's ratio, equal to ν=0.3 and values of Young's modulus E2=13 MPa and E2=41.13 MPa calculated treating the foam material as an isotropic, respectively. a)
b)
Fig. 1. The scheme of plate loaded: a) on the inner perimeter, b) on the outer perimeter
The obtained results in finite element method have been compared with the computational results of plates solved using the finite difference method. Building the models of plate structure solved in both methods: finite element (FEM) and finite difference (FDM) the distribution of the basic stress on the normal and shearing carrying by the plate layers: facings and core has been used. Such loading distribution of the layers of plate with soft core is the assumption of the classical theory of sandwich plate. In work [5] the proposal of the application of the mixed shell and solid elements in mesh structure has been presented. It has been used in the modelling of the structure of the analysed plate. The application of the shell elements with the option COMPOSITE to specify a shell cross-section for the plates with soft core does not assure the proper results in plate stability problem. The calculations of the plate models built of shell elements for elastic core characteristics corresponding to the facings material parameters have been presented in work [6].
Some Remarks about Modelling of Annular Three-Layered Plate Structure
691
2.1 Plate Models Built in Finite Element Method The calculations were carried out in the ABAQUS system at the Academic Computer Center CYFRONET-CRACOW (KBN/SGL_ORIGIN_2000/PŁódzka/030/1999) [7]. The model in the form of the full annulus of plate is the basic model accepted in problem analysis. This model is composed of the 9-node 3D shell elements and 27-node 3D solid elements building the facings and core meshes, respectively. The mesh of the model is presented in Fig. 2a. a).
b). shell elements
solid elements shell elements
c).
Fig. 2. The forms of plate models: a) full annulus, b) annular sector with single or double core layer, c) built of axisymmetric elements with single, double and quaternary core layer
The examinations of the selected forms of plate critical deformations have been carried out for the models in the form of an annular sector being the 1/8 or 1/6 part of the full plate perimeter. The facings mesh is built of 9-node 3D shell elements and core mesh is built of 27-node 3D solid elements, too. The solid elements could be arranged in single or double layers in core mesh. The models are presented in Fig. 2b. The mesh of the model for some plate examples, which minimal value of critical load pcr ( important in stability problem ) corresponds to the form of regular, axiallysymmetrical buckling could be simplified to the form - where only the axisymmetric elements are used. The cross-section structure presented in Fig. 2c is composed of 3node shell and 8-node solid elements arranged in single, double and quaternary core mesh layer. The regular, axially-symmetrical form of plate buckling corresponds to the minimal value of the critical load for plates slidably clamped and compressed on inner perimeter [8]. Each of analyzed plate model uses the surface contact interaction to connect the elements of facings mesh with elements of core mesh. The option TIE has been applied. The proper, symmetry conditions on the partitioned edges have been formulated for the annular sector models. The boundary conditions with the limitation of radial relative displacements in the plate slidably clamped edges are imposed on the outer
692
D. Pawlus
and inner plate edges. The introduction of the additional condition for the plate layers, by their connection with the equal deflection, increases the numbers of examined plate models. 2.2 Solution to the Plate Stability Problem Using the Finite Difference Method The solution uses the classical theory of sandwich plates with the broken line hypothesis [8]. The equal deflections of plate layers have been assumed. The basic elements in the solution to the static stability problem are as follows: -
-
-
formulation of the equilibrium equations for each of plate layer, determination of the equations of radial and circumferential core deformation, formulation of the physical relations of the material of plate layers, on the strength of the equations of the sectional forces and moments and suitable equilibrium equations determination of the formulas of the resultant radial and circumferential forces and the resultant membrane radial, circumferential and shear forces determined by means of the introduced stress function, by the usage of the equilibrium equations of the projections in the 3-direction of forces loading the plate layers the formulation of the basic differential equation describing the deflections of the analyzed plate, determination of the additional equilibrium equations of projections in the radial and circumferential directions of forces loading the undeformed outer plate layers, determination of the boundary conditions and dimensionless quantities, assumption that the stress function is a solution to the disk state, application of the finite difference method for the approximation of the derivatives with respect to radius and the solution to the eigen-value problem with the calculation of the minimal value of p* as the critical static load pcr:
(
)
det (MAP + MAD ⋅ MATD + MAG ⋅ MATG ) − p*MAC = 0
(1)
p , E MAP, MAC, MAD, MAG, MATD, MATG – matrices of elements composed of geometric and material plate parameters and the quantity b of the length of the interval in the finite differences method and the number m of buckling waves. The detailed description of the problem solution has been presented in work [9]. Presented in this work, the results of plate dynamic stability calculations have been limited to the regular, axially-symmetrical form of plate critical deformation. This form corresponds to the minimal value of critical load for analyzed plates compressed on the inner facings parameters. Then, the solution requires the formulation:
where: p* =
-
the dynamic, equilibrium equations, description of the core deformation taking into account the plate imperfection, determination of the initial loading conditions, assumption of the form of plate predeflection, the formulation of the system of equations using the finite difference method.
Some Remarks about Modelling of Annular Three-Layered Plate Structure
693
The description of the solution is presented in work [10]. The numerical calculations in finite difference method require the proper choice of the number of discrete points to fulfil the results accuracy up to 5% of the technical error. The calculation were carried out for number 14 of discrete points.
3 Discussion of Computational Results The discussion of computational results of analyzed plates has been presented separating the static and dynamic stability plate problems. It was also taken notice on the examples of plates with the core thickness treated as the medium for h2 equal to h2=0.005 m, 0.02 m and the thick: h2=0.06 m.
3.1 Critical Static Loads The observed buckling forms of analyzed plates loaded on inner edge are regular, axially-symmetrical, but the plates compressed on the outer perimeter lose their stability for the different number m of the transverse waves in circumferential direction. The global, quasi-euler’s forms of critical plate buckling are essentially observed. For plate with thick core are expected the primary, local forms of the loss of plate stability, when the critical deformations of plate layers (the core, particularly) do not occur for their equal deflections.
3.1.1 Analysis of Plates with Medium Core The computational results of different models of plates loaded on inner edge are presented in Table 1. The critical form of deformation is the regular axially-symmetrical for all plates examples. For the each plate model it is presented in Fig. 3. Table 1. Values of the critical static stress of plates loaded on the inner edge of facings pcr [MPa] model built annular of sector axisymmetric (1/8 elements part) 1)
h’ [m]/h2 [m]/G2 [MPa]
0.0005/0.005/5.0 0.001/0.005/5.0 0.0005/0.02/5.0 0.001/0.02/5.0 0.0005/0.005/15.82 0.001/0.005/15.82 0.0005/0.02/15.82 0.001/0.02/15.82 1) 2)
full annulus plate model
model built of axisymmetric elements
57.84 64.08 170.50 143.77 137.93 120.30 449.24 326.41
57.48 64.00 168.32 143.20 136.49 119.92 434.56 324.01
57.52 64.00 172.27 144.16 136.63 119.94 457.95 328.30
57.73 62.94 168.04 143.22 137.41 119.21 435.62 323.93
1)
2)
annular sector (1/8 part) 57.82 63.15 173.94 144.17 137.66 119.43 462.66 330.41
annular sector (1/8 part) 57.78 63.14 169.43 143.17 137.46 119.38 438.89 325.67
FDM
64.12 75.61 165.51 150.29 149.91 149.34 437.54 338.94
Plate layers connected with the condition of the equal deflection. Facings connected with the condition of the equal deflection.
The consistence of results of all FEM plate models is observed. The good compatibility of values of critical loads of plates calculated in finite difference and finite element methods is particularly observed for plates with the core thickness, equal to h2=0.02 m. The values of critical loads of plate models with the condition of the equal
694
D. Pawlus a)
b)
c)
Fig. 3. Regular axially-symmetrical form of plate buckling for: a) full annulus plate model, b) annular plate sector, c) model built of axisymmetric elements
layers deflection are generally slightly higher than values obtained for plates without this condition. The increase in these values above the values calculated for FDM plate model appears for plates with thin facings h’=0.0005 m and thicker core (h2=0.02 m). The results of the plates compressed on outer perimeter are presented in Table 2. Table 2 contains the minimal values of critical loads pcr and number of buckling waves m. Some forms of plates buckling are presented in Fig. 4. Table 2. Values of the critical static stress of plates loaded on the outer edge of facings h’ [m]/h2 [m]/G2 [MPa] 0.0005/0.005/5.0 0.001/0.005/5.0 0.0005/0.02/5.0 0.001/0.02/5.0 0.0005/0.005/15.82 0.001/0.005/15.82 0.0005/0.02/15.82 0.001/0.02/15.82 1)
1)
full annulus plate model 19.16 m=7 16.48 m=5 66.56 m=10 43.71 m=7 52.03 m=9 35.04 m=6 193.15 m=12 115.10 m=9
pcr [MPa] full annulus plate model 19.18 m=7 16.49 m=5 67.39 m=9 43.97 m=7 52.10 m=8 35.06 m=6 198.65 m=12 116.31 m=8
FDM 22.37 m=8 20.52 m=5 69.49 m=12 46.95 m=7 61.51 m=9 46.53 m=6 200.62 m=18 125.11 m=9
Plate layers connected with the condition of the equal deflection.
m=5
m=9
Fig. 4. The forms of critical plate deformations
m=12
Some Remarks about Modelling of Annular Three-Layered Plate Structure
695
Results show the slightly increase in values of critical loads of plates with the condition of the equal layers deflection. Then, the change of plate buckling in the form of the decrease in the number m waves could occur. Additionally, in Table 3 the results of the select plate examples obtained for the annular sector of plate model are presented. These values of critical loads are suitably higher than results obtained for full annulus plate model. Table 3. Critical static loads for plate examples compressed on outer edges with the results of the model of plate annular sector h’ [m]/h2 [m]/G2 [MPa] 0.0005/0.005/15.82 0.001/0.005/15.82 0.001/0.02/15.82 1)
1)
full full annulus annulus plate model plate model 52.03 52.10 m=9 m=8 35.04 35.06 m=6 m=6 115.10 116.31 m=9 m=8
pcr [MPa] annular annular sector sector (1/6 part) (1/8 part) 52.75 52.63 m=9 m=8 36.74 m=6 123.23 124.43 m=9 m=8
1)
annular sector (1/8 part) 55.26 m=8
FDM 61.51 m=9 46.53 m=6 125.11 m=9
116.65 m=8
Plate layers connected with the condition of the equal deflection.
3.1.2 Analysis of Plates with Thick Core The results of various models of plates compressed on inner perimeter are presented in Table 4. The results marked by * concern the plate models, which critical deformation has not the regular, axially-symmetrical form. For all examined models the decrease in values of critical loads and the change in the form of critical deformation is observed for the plates with thin facings h’=0.0005 m and thick core h2=0.06 m and particularly for the core Kirchhoff’s moduls equal to G2=15.82 MPa. The examples forms of critical deformations are presented in Fig. 5. Table 4. Values of the critical loads of plates with thick core loaded on the inner edge pcr [MPa] model built of axisymmetric elements 440.20 317.34 1238.81 796.31 2) annular sector (1/8 part) 324.10 * 293.01 670.41 * 677.64 1)
h’ [m]/h2 [m]/G2 [MPa]
full annulus plate model
0.0005/0.06/5.0 0.001/0.06/5.0 0.0005/0.06/15.82 0.001/0.06/15.82
347.54 293.90 791.37 * 689.10 *
h’ [m]/h2 [m]/G2 [MPa]
annular sector (1/8 part)
0.0005/0.06/5.0 0.001/0.06/5.0 0.0005/0.06/15.82 0.001/0.06/15.82
329.58 291.53 718.51 * 676.11
1)
model built of axisymmetric elements 345.48 292.68 774.10 * 686.66 1) annular sector (1/8 part) 445.26 319.62 1252.27 804.84
3)
model built of axisymmetric elements 315.53 288.45 684.19 * 659.77 3) annular sector (1/8 part) 251.81 * 279.29 511.49 * 586.19*
Plate layers connected with the condition of the equal deflection. Facings connected with the condition of the equal deflection. 3) The core mesh built of two layers of solid elements. 4) The core mesh built of four layers of solid elements. 2)
4)
model built of axisymmetric elements 309.47 287.84 649.49 * 655.17 FDM 406.98 312.53 1191.70 749.53
696
D. Pawlus
The results obtained for plates with connected layers by the condition of the equal deflection are the obvious exception. Then, the values of loads and the global buckling forms correspond to the results obtained for plates models calculated in FDM. It could be suspected that these values are too high. The values of critical loads of plates models built of two or four layers of the core elements are lower than values obtained for the models with single core layer. Particularly, the essential decrease in values of critical loads is observed for the plate model in the form of the annular sector with the double core layer. a)
b)
pcr=791.37 MPa
c)
pcr=718.51 MPa
d)
pcr=774.10 MPa
pcr=684.19 MPa
Fig. 5. The forms of buckling of plate models: a) full annulus, b) annular sector, c) model built of axisymmetric elements, d) model built of two layers of core axisymmetric elements
The sensitivity of plate models with thin facings and thick core is observed for plate compressed on the outer perimeter, too. Example results are presented in Table 5. Table 5. Values of the critical loads of plates with thick core loaded on the outer edge h’ [m]/h2 [m]/G2 [MPa]
full annulus plate model
0.0005/0.06/5.0
149.37 *
0.001/0.06/5.0
102.86 m=9
0.0005/0.06/15.82
402.44 *
0.001/0.06/15.82
274.23 m=12
1)
pcr [MPa] 1) full annulus annular sector plate model (1/8 part) 189.45 187.15 m=12 m=20 111.60 124.92 m=8 m=8 571.69 519.03 m=14 m=24 318.34 343.26 m=10 m=12
1)
FDM 185.68 m=17 113.39 m=9 542.70 m=27 320.63 m=12
Plate layers connected with the condition of the equal deflection.
Also for plates loaded on the outer edge the values of critical loads of models with connected layers seem to be to high. Results obtained for the annular sector of plate with the connected layers are both in values of critical loads and the forms of buckling closer to the results obtained in finite difference method.
3.2 Critical Dynamic Loads The calculations of the critical dynamic loads have been carried out for the plates compressed on inner edges of the facings with the linear, rapidly increasing stress expressed by the formula:
Some Remarks about Modelling of Annular Three-Layered Plate Structure
p=st where: p - compressive stress, s – rate of plate loading growth, t – time.
697
(2)
The rate of plate loading growth s is equal for each of numerical analysed plate. The value of the rate s is the result of the following equation: s=K7⋅pcr . The value of parameter K7 is accepted as K7=20. Solving the eigenproblem the value of critical stress pcr is equal: pcr=217.3 MPa calculated for the plate with the facing thickness h'=0.001 m, core thickness h2=0.01 m and the value of core Kirchhoff's modulus equal: G2=15.82 MPa. The calculations are carried out for the regular, axially-symmetrical form of plate buckling. This form has the plate predeflection, too. As the criterion of loss of plate stability, the criterion presented in work [11] was adopted. According to this criterion the loss of plate stability occurs at the moment of time when the speed of the point of maximum deflection reaches the first maximum value. The results of some plate examples obtained using the finite element method in the form of time histories of plate maximum deflection and the velocity of deflection are a)
b)
Fig. 6. Time histories of deflection and velocity of deflection for plates with parameters: a) h’=0.001 m, h2=0.005 m, G2=5 MPa, b) h’=0.001 m, h2=0.06 m, G2=5 MPa Table 6. Values of the critical, dynamic loads pcrdyn and critical deflections wcr of plates loaded on the inner edge pcrdyn [MPa] wcr⋅10-3 [m] h’ [m]/h2 [m]/G2 [MPa]
0.001/0.005/5.0 0.001/0.02/5.0 0.001/0.06/5.0 0.001/0.005/15.82 0.001/0.02/15.82 3)
full annulus plate model 91.27 4.25 152.11 3.13 304.22 4.16 136.90 3.08 330.30 2.99
model built of annular sector axisymmetric (1/8 part) elements 86.93 86.93 3.28 4.38 147.78 147.78 2.94 3.56 304.22 299.87 4.35 4.80 139.10 134.74 3.67 4.31 326.0 326.0 3.08 3.88
The core mesh built of two layers of solid elements.
3)
annular sector (1/8 part) 86.92 4.38 147.76 3.57 291.18 4.42 134.73 4.31 325.95 3.9
FDM 98.88 3.92 159.30 3.77 321.65 5.21 166.69 3.87 346.64 4.25
698
D. Pawlus
presented in Fig. 6. Table 6 shows the values of critical, dynamic loads and critical deflections for the plate models built in finite element and finite difference methods. The results obtained in FEM indicate a mutually good consistency. For plate with thicker core h2=0.02 m and 0.06 m these results correspond to the results obtained in finite difference method, too. The major fluctuations are observed for the plate critical deflections. The calculations show that in dynamic problem in the range of the global buckling observations the influence of the plate structure built in finite element method on the final results is not so essential, like is static analysis.
4 Conclusions The results obtained for presented different models, which are possible to application in computational plate examinations, indicate on some sensitivity of their structures. The observed differences of values of critical loads, particularly of plates with thick core are in the range of the dozens MPa. Therefore, these differences are significant. Particularly, this problem concerns the static stability issue, when other than global critical forms could occur. The study of the mesh structure of these plates models seems to be especially important. The lowest values of critical loads are for plate models with the mesh core composed of the several layers of solid elements. The computational results of these models could be the essential complement to the plate examinations carried out for their basic model in the form of full annulus. Comparing the computational results of plates calculated in two methods: finite element and finite difference essentially, it can be determined the compatibility of the results of plate models with medium core. The consistency of results of plates with thick core is observed for these cases of plate models, which critical deformation has the global, quasi-euler’s form. Then, the values of critical, static loads could be really to high.
References 1. Wang, H.J., Chen, L.W.: Axisymmetric dynamic stability of sandwich circular plates. Composite Structures 59, 99–107 (2003) 2. Chen, Y.R., Chen, L.W., Wang, C.C.: Axisymmetric dynamic instability of rotating polar orthotropic sandwich annular plates with a constrained damping layer. Composite Structures 73(2), 290–302 (2006) 3. Majewski, S., Maćkowski, R.: Creep of Foamed Plastics Used as the Core of Sandwich Plate. Engineering and Building Industry (Inżynieria i Budownictwo) 3, 127–131 (1975) (in Polish) 4. Romanów, F.: Strength of Sandwich Constructions, WSI, Zielona Góra, Poland (1995) (in Polish) 5. Kluesener, M.F., Drake, M.L.: Mathematical Modelling. Damped Structure Design Using Finite Element Analysis, Shock and Vibration Bulletin 52, 1–12 (1982) 6. Pawlus, D.: Homogeneous and sandwich elastic and viscoelastic annular plates under lateral variable loads. In: Proceedings of the Third International Conference on Thin-Walled Structures, pp. 515–522. Elsevier Science, Amsterdam (2001)
Some Remarks about Modelling of Annular Three-Layered Plate Structure
699
7. Hibbitt, Karlsson and Sorensen, Inc.: ABAQUS/Standard. User’s Manual, version 6.1 (2000) 8. Volmir, C.: Stability of Deformed System. Science, Moskwa (1967) (in Russian) 9. Pawlus, D.: Solution to the Static Stability Problem of Three-Layered Annular Plates With a Soft Core. Journal of Theoretical and Applied Mechanics 44(2), 299–322 (2006) 10. Pawlus, D.: Dynamic Stability Problem of Three-Layered Annular Plate under Lateral Time-Dependent Load. Journal of Theoretical and Applied Mechanics 43(2), 385–403 (2005) 11. Volmir, C.: Nonlinear Dynamic of Plates and Shells. Science, Moskwa (1972) (in Russian)
Parallel Quantum Computer Simulation on the CUDA Architecture Eladio Gutierrez, Sergio Romero, Maria A. Trenas, and Emilio L. Zapata Department of Computer Architecture, University of Malaga, 29071 Malaga, Spain {eladio,sromero,maria,ezapata}@ac.uma.es
Abstract. Due to their increasing computational power, modern graphics processing architectures are becoming more and more popular for general purpose applications with high performance demands. This is the case of quantum computer simulation, a problem with high computational requirements both in memory and processing power. When dealing with such simulations, multiprocessor architectures are an almost obliged tool. In this paper we explore the use of the new graphics processor architecture NVIDIA CUDA in the simulation of some basic quantum computing operations. This new architecture is oriented towards a more general exploitation of the graphics platform, allowing to use it as a parallel SIMD multiprocessor. In this direction, some implementation strategies are proposed, showing that the effectiveness of the codes is subject to a right exploitation of the underlying memory hierarchy.
1
Introduction
Contrary to classical computers, quantum computers are devices that process information on the basis of the laws of quantum physics. Due to this fact, they could provide an efficient implementation of different algorithms as refers both to computing time and storage requirements. This way, they would be able to solve some problems of non-polynomial complexity in a much smaller time [10]. Although this approach is quite promising, at the present time it is still necessary to face certain limitations. On the one hand, existing technology allows only to construct quantum computers of very reduced dimensions [7], and on the other hand, only a small number of effective algorithms [5,8,13] are known. Nevertheless, the analysis of this computational model constitutes a topic of great interest for physicists, computer scientists and engineers. Actually, a quantum computer can be considered as a hardware accelerator of the classical processor, from which it receives the orders for the resolution of a concrete problem [7], as shown in Fig. 1. As it is not possible to know the inner state of a quantum computer, according to the laws that govern it, outputs must be obtained by means of a measurement, issuing a result with a certain probability. The quantum parallelism is one of the sources of the power of quantum computers as it allows to perform simultaneous operations on an exponential set of M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 700–709, 2008. c Springer-Verlag Berlin Heidelberg 2008
Parallel Quantum Computer Simulation on the CUDA Architecture
701
superposed states. This causes quantum computer simulation to demand high computational power. This way, parallelism is a suitable tool for mitigating such requirements [7,11]. The simulation of quantum computers not only requires a high computational effort but they present data access patterns with low locality characteristics. Several interests arise for this kind of simulation giving raise to the development of different simulators, both in software [2,3,7,11] as in hardware [6,9,14]. In this paper we prove that modern architectures based on Graphic Processor Units (GPU) are suitable to accomplish an efficient simulation of quantum computers. GPUs are devices specialized in graphics algorithms involving very intensive and highly parallel computations that due to their high computational power are nowadays being used also for general purpose applications. With this purpose, we have implemented several parallel approaches to the simulation of the basic operators of an ideal quantum computer, using the new compute unified device architecture (CUDA) [12], lately released by the GPU manufacturer NVIDIA. Different strategies are explored, looking for the exploitation of the data reference locality in this sort of architecture.
2
Quantum Computing
The ideal quantum computer to be simulated follows the model presented in [4], consisting on the successive application of a network of quantum gates to a quantum register with a classical initial state. The quantum bit (qubit) can be imagined as the linear superposition of two homologous classical states we will note as |0, |1, in Dirac notation. The state of a qubit can be represented using a complex two-dimensional vector, where the basis for these two states are |0 and |1. Thus, the state of a qubit can be written as Ψ = α0 |0+α1 |1, where the coefficients, or amplitudes, verify |α0 |2 +|α1 |2 = 1. |α0 |2 and |α1 |2 are interpreted α 1 as theprobability of measuring |0 or |1. In vector notation, we write Ψ = α0 , 1 0 |0 = 1 and |1 = 0 . A quantum register generalizes the qubit definition. The state of a n-qubit quantum register is determined by the linear superposition of the 2n possible classical states provided by n bits. After this, the state of a quantum reg2n −1 2n −1 2 ister can be written as Ψ = = 1, i=0 αi |i with αi ∈ C, i=0 |αi | 2 since |αi | is interpreted as the probability of obtaining |i when the register is measured. Thus, Ψ belongs to a 2n -dimensional complex vector space, where |i constitutes a basis, with 0 ≤ i ≤ 2n −1. For example, for n = 3, we T will write |6 = |110 = (0 0 0 0 0 0 1 0 ) . By applying the Kronecker’s tensor product, it is possible to represent the elements of the state space basis for the register as a function of the individual states of the qubits. For example |6 = |110 = |1 ⊗ |1 ⊗ |0. The state of a quantum register will evolve according to a transformation, which can be interpreted as an operator U applied to the register state. Quantum physics laws settle that operator U must be a linear and unitary one. It follows that for a n-qubit register, an order 2n ×2n matrix can be found verifying
E. Gutierrez et al. HW accelerator
CPU
Quantum Computer QC
classical initial state commands (quantum operators) results (meassurement)
CPU
2
q1
Quantum Computer Simulation
H 2
H
input state
U0
U1
q1
q2
q2
.... ....
.... ....
qi qn
input state
output state
(b)
.... ....
q0
q1
qn
H
q0
Fig. 2. A QFT implementation
q0
qi
2 U3
q1 q2
4
U2
q0
U
q1
8
q3
Graphics Processor GPU
q3 q2
4
q2
Fig. 1. A quantum computer model
(a)
output state
H
q0
input state
U
qi qn
output state
(c)
I I
input state
I U I
.... ....
Interface classical initial state commands (quantum operators) results (meassurement)
.... ....
Classical Computer
.... ....
702
output state
Fig. 3. Representation of quantum transformations as quantum gate networks
U U ∗ = I, where U ∗ is matrix U both conjugated and transposed, and I is the unitary matrix. Usually, this kind of transformations are represented in the manner of Fig. 3(a). As a particular instance, let us consider the application of a transformation over one particular bit, as shown in Fig. 3(b). In this case, the global transformation will be the tensor product of all the 1-qubit transformations simultaneously applied to each individual qubit. This means that the global resulting transformation Ug will be equivalent to the application of the identity transformation to the residuary bits. If the 1-qubit operator U is applied to the k−th qubit then: Ug = I ⊗
(n−k−1 times) ...
⊗I ⊗ U ⊗ I ⊗
(k times) ...
⊗I = I ⊗n−k−1 ⊗ U ⊗ I ⊗k (1)
The transformation applied to one single qubit can be interpreted like a unitary quantum gate of order 2 × 2. Table 1 presents several well-known transformations. As an example, Pauli transformation X = |01| + |10|, does project component |0 over the |1 one, and vice versa, following that its quantum application to a classic state 0 or 1 is equivalent to the logic operator NOT. The generalization to gates with more than one qubit is straightforward, resulting in an associated matrix of order 2n × 2n , for n qubits. A quantum computer can be thought to be a quantum device on which a sequence (or network) of transformations can be applied successively to the state of a quantum register [4]. Different minimal universal sets of gates have been proposed, in such a way that any n-qubit transformation can be expressed as a network of these gates. It is proven that a complete set can be built from gates Φ(δ), Rz (α), Ry (θ) and CNOT (2 qubits), as stated in [1]. A remarkable transformation is the Quantum Fourier Transform (QFT). The QFT [10] is a key element of some quantum algorithms like the integer
Parallel Quantum Computer Simulation on the CUDA Architecture
703
Table 1. Inventory of quantum computer gates
Identity
1-qubit transformations I = |00| + |11|
I
Pauli X
X = |01| + |10|
X
Pauli Y
Y = j|10| − j|01|
Y
Pauli Z
Z = |00| − |11|
Z
H =
Hadamard y-axis rotation z-axis rotation
1 √ 2
(X + Z)
H
Ry (θ) = cos( θ2 )I + sin( θ2 )Y
R y(θ)
Rz (α) = ejα/2 |00| + e−jα/2 |11|
R z(α)
2-qubit transformation Controlled NOT
CNOT=|00|+|11|+|23|+|32|
Controlled Phase Shift CPh(K)=|00|+|11|+|22|+ej
2π K
|33| K
factorization proposed by Shor [13], of great interest in cryptography. An implementation of the QFT is depicted in Fig. 2 in terms of 1-qubit Hadamard gates and 2-qubit controlled-phase gates. QFT is defined analogously to the classical transform but a normalization coefficient √12n makes it unitary: 2 −1 2 −1 1 in 2πck = √ αk e 2n i |c 2n k=0 c=0 n
|Ψ
3
out
n
(2)
Elementary Quantum Gate Simulation
Simulation of a quantum computer will consist on determining the state of a n-qubit register, after the application of a unitary linear transformation. This n means that we have to compute the register’s state vector |Ψ out = 2i=0−1 αout i |i, n 2 −1 out from initial state |Ψ in = i=0 αin |i, that is, to determine coefficients α for i i the final state as a function of coefficients αin for the initial state and the unitary i matrix U defining the transformation. In general, the application of this unitary transformation will require computations with a complexity order O(22n ) as matrix U is of order 2n ×2n , since 2n is the dimension of the associated vector space. By decomposing this transformation into a set of successive transformations with a lower number of qubits (quantum gates or stages), the effective complexity of the simulation can be reduced. As case of study, the simulation of three basic quatum computing operations is analyzed: a generic gate U over one qubit, the operator U ⊗n applied to an n-qubit register, and the QFT. Simulation of 1-qubit Gate U. The application of a 1-qubit quantum gate performs the operation |Ψ out = Ug |Ψ in , where Ug comes from expression (1) as a function of the 1-qubit transformation U . If we consider the initial state is a 2n −1 superposition of states Ψ in = i=0 αin i |i, the effect of the transformation over the coefficients αin can be determined. i
704
E. Gutierrez et al.
Consider that the 1-qubit transformation U = u00 |00|+u01|01|+u10 |10|+ u11 |11| is applied to the q-th qubit of a n-qubit register. If the initial state is a classical one Ψ in = |i = |bn−1 bn−2 ...b1 b0 , that is, an element in the basis of the state space, where bk represent the bits on the binary expression of natural i. Transformation U over the q-th bit bq will result in: u |b ...0...b1 b0 +u10 |bn−1 ...1...b0 if bq = 0 out Ψ =|bn−1 ⊗...U |bq ⊗...⊗|b0 = 00 n−1 u01 |bn−1 ...0...b1 b0 +u11 |bn−1 ...1...b0 if bq = 1 (3) By means of the bitwise exclusive or (⊕), this can be expressed as: in u00 αin bn−1 ...0...b1 b0 +u01 αbn−1 ...1...b1 b0 if bq = 0 out out αi = αbn−1 bn−2 ...bq ...b1 b0 = in u10 αbn−1 ...0...b1 b0 +u11 αin bn−1 ...1...b1 b0 if bq = 1 u αin +u01 αin i⊕2q if bq = 0 = 00 iin (4) u10 αi⊕2q +u11 αin if bq = 1 i This means we can compute the output coefficients from the input ones. But this requires traversing each one of the coefficients, with a O(2n ) complexity. Actually, coefficients associated to both bq = 0 and bq = 1 can be computed simultaneously. This reduces the complexity of the simulation loop to O(2n−1 ), making in each iteration an effort equivalent to the matrix-vector product of order 2 × 2 (computation in place). Simulation of a Factorizable r-qubit Gate. Let us consider the case of simulating a r-qubit gate to an n-qubit register, being this gate factorizable in terms of the Kronecker’s product of 1-qubit gates. Its simulation follows from the simulation of the single qubit gate. Basically it consists of applying the gate U to one qubit after another resulting in r consecutive steps. With the purpose of analyzing the effect of the transformation over the coefficient space, let us study the application of an 1-qubit transformation U over r consecutive qubits, from the q-th to the (q + r − 1)-th qubit. Following the argument of equations (3) and (4) we can infer that, for this case, the number of input coefficients that contribute to the calculation of a given output coefficient αout is 2r . Furthermore, this set of coefficients is given by k Gq,r k
=
r 2 −1
{αk⊕2q ·m }
(5)
m=0
Observe that such groups, just defined, act like closed groups because it is possible to determine, for all αin ∈ Gk , its corresponding output coefficient αout withth out need of information outside of Gk . For example, transforming the 4 inwhen in in in and 5th qubits, we will have q = 4, r = 2 and G4,2 = α , α , α , α k k k⊕24 k⊕2·24 k⊕3·24 . out out out , α , α and α can be computed from Thus, the coefficients αout k k⊕24 k⊕2·24 k⊕3·24 4,2 those only included in Gk . The number of disjoint groups existing in the coefficient space is 2n−r . Therefore the computation complexity will be equivalent to apply 2n−r times the
Parallel Quantum Computer Simulation on the CUDA Architecture
705
1-qubit gate U to an r-qubit register. In order to simulate a given gate U ⊗n applied to an n-qubit register, computations can be organized by splitting the register into subregister of r qubits. This way we can work over closed groups of a desired size. Note that the partition of the coefficient space into groups is different for the different subregisters. Simulation of the QFT. The n-qubit Quantum Fourier Transform [10,3] can be implemented as shown in Fig. 2. A straightforward simulation, gate by gate, steps which is very inefficient. Nevertheless, a more efficient will involve n(n−1) 2 implementation can be performed by grouping the gates into n stages, one for each k-th qubit, denoted as Uk in Fig. 2: QF T = Un−1 · · · U1 · U0 . It is remarkable that each one of these stages Uk operates only on one of the qubits through controlled phase transformations and Hadamard gates. Therefore, the computational workload of simulating each stage Uk is the same to the simulation a 1-qubit gate. This way, the resulting sequence of stages follows a similar scheme than the previously described simulation of U ⊗n .
4
Implementation
In this section we proceed to map our simulation schemes onto the CUDA architecture for the three cases under study: 1-qubit gate, U ⊗n transformation and QFT. As forementioned, our simulation model (Fig. 1) involves a classical computation (code running on the host), and a quantum computation which is just simulated on the GPU (kernel code running in parallel on the device). The main challenge when designing the kernel code is to handle the limitations coming from the CUDA features. The first limitation arises from the device’s memory system organization. On the one hand, the vector of coefficients describing the state of the quantum register uses to be very large, and only the device’s global memory is able to store the whole of it. Although shared memory is two orders of magnitude faster (as it acts like a parallel data cache for each multiprocessor), it has a reduced size and it will not be able to store but a reduced portion of the coefficients. Subsets of coefficients can be transferred from global to shared memory (copy in) when they are frequently reused and therefore a substantial increment of performance is granted. Notice that results will have to be transferred back to the global memory (copy out). On the other hand, an efficient transference among global and shared memory is restricted to contiguous words (coalescing). Another important limitation comes from the synchronization mechanism inherent to CUDA. This is just a short range synchronism, as it only allows synchronizing threads belonging to the same block. Synchronization of threads pertaining to different blocks must be solved at the host side. Simulation of 1-qubit Gate U. The simulation of a 1-qubit gate U is derived from the parallel SIMD execution of (4), where U is applied to the q-th qubit of a n-qubit register. According to this expression, the computation of a coefficient
706
E. Gutierrez et al.
in in αout k requires accessing to the coefficient αk itself and the coefficient αk⊕2q . Even out more, once these two coefficients are accessed, also αk⊕2q can be calculated. Consequently, the kernel code must determine in parallel every pair of the form {αk , αk⊕2q }. In total, as there exists 2n−1 pairs, the same number of threads will be required. Each thread is in charge of computing the transformation U for each pair. As a given coefficient belongs to one and only one pair, it is necessary only one read and one write operations in global memory, assuming that each coefficient αk is located in the k-th position of the state vector in global memory. Thus, when simulating a gate separately, the use of shared memory will not improve the performance because there is no reusing of data transferred from global to shared memory. Due to the disjointness of different pairs, the coefficients computed after a transformation can be directly overwritten (in-place computation). This way, a higher number of qubits can be simulated. Note that synchronization points become mandatory when consecutive 1-qubit gates are going to be simulated in order to guarantee the correctness of the computation.
Simulation of a Factorizable n-qubit Gate. Let us consider the simulation of a multiple-qubit gate factorizable in terms of the Kronecker’s product of 1qubit gates and/or controlled gates. Without loss of generality, the same 1-qubit gate U is applied to every qubit of the register, that is, the transformation to be analyzed is U ⊗n . A first approach follows from the simulation of the single qubit gate. It consists of applying the gate U to one qubit after another, resulting in n consecutive stages. Due to the lack of inter-block synchronization in the GPU side, synchronization in the host side is necessary. This fact involves a different kernel invocation for each qubit, i.e., for each stage. This solution entails two main disadvantages: the overhead in time due to host synchronization and the inefficiency of not being able to use the fast shared memory. In contrast to this solution, a more efficient proposal is introduced hereafter. The key idea consists of copying-in a subset of coefficients from global to shared memory, performing all possible computations and then copying-out the results from shared to global memory. These coefficients allocated in shared memory can be reused several times, resulting in more accesses to fast shared memory and less accesses to global memory. Basically, we consider quantum subregisters of consecutive qubits in such a way that the corresponding groups Gk of expression (5) fit the multiprocessor shared memory. That is, resulting groups must fullfil the condition Card(Gk ) = S, being S the number of coefficients fitting the shared memory. Thus, the number of qubits in the subregister to be transformed is log2 (S). This approach is expressed in Fig. 5 which corresponds with the scheme of Fig. 4(a) for q = 0, r = 4 and S = 2r , where the gates are applied to the log2 (S) less significant qubits. Note that in this case, q = 0, coefficients in a group Gk are allocated in consecutive memory positions. This fact involves that copy-in/out operations benefit from the coalescing condition.
Parallel Quantum Computer Simulation on the CUDA Architecture
q3
U U
(a)
q0
q1
q1
q2 q4
q5
q5
q6
q6 output state
Synch. on host copy−out
q2
q3
q4
q7
copy−in
copy−out copy−in
U
input state
copy−out
q0
U
q7
(b)
input state
q3
SynchThreads
U
q4
U U
output state
q5
U
SynchThreads
U
q6 q7
(c)
SynchThreads
U
input state
MSB qubits
q2
copy−in
LSB qubits
q1
SynchThreads copy−out
U
MSB qubits
copy−in q0
707
U output state
Fig. 4. Using the shared memory allows to reduce the number of GPU/CPU synchronizations (a), but for the most significant qubits (b) the lack of coalescing may degrade the performance. However contiguous memory locations can be used by defining coalesced groups of coefficients (c).
r −1 = 2m=0 {αk⊕m } Let G = Gr,q=0 k / Card(G) = 2r = S
Copy in(G) Apply U ⊗r to α ∈ G Copy out(G) Fig. 5. Kernel for LSB qubits of U ⊗n
r−m,q Let H = Gr−m,q ∪ Gr−m,q k k+1 ... ∪ Gk+M / m = log2 (M ), q > log2 (S) S ) = 2r−m = M , / Card(Gr−m,q k Card(H) = S
Copy in(H ) Apply U ⊗r−m ⊗ I ⊗m to α ∈ H Copy out(H) Fig. 6. Kernel for MSB qubits of U ⊗n
When proceeding with the remaining n − log2 (S) qubits (q > log2 (S)), a first attempt may consist of building new groups Gq,r k with cardinality S, as described above. This situation is shown in Fig. 4(b). Observe that, according to equation (5), the coefficients of these groups are located far apart in global memory when q > log2 (S). This means that global memory accesses are not coalesced, influencing adversely the performance. With the purpose of overcoming this lack of coalescing, a superset of coefficient groups can be copied-in/out by selecting them appropriately. Being M the S ) qubits can be now transformed because number of these new groups, log2 ( M the maximum number of coefficients in shared memory is S. These M groups are selected in such a way that the coefficients altogether constitute S/M series of M consecutive coefficients. This idea is illustrated in Fig. 6, where H is referred to the superset of groups. In contrast to the approach followed for the LSB qubits, where log2 (S) gates are simulated, this one reduces the number of qubits to be S ), but locality is severely improved. This results in more processed to log2 ( M kernel invocations as depicted in Fig. 4(c), where q = 4, S = 16 and M = 4. Nevertheless, extra copy-in/out operations, including host side synchronizations, are worthwhile because memory accesses are coalesced. Simulation of the QFT. As discussed in section 3 the QFT can be carried out in n stages, denoted Uk , one for each k-th qubit, as shown in Fig. 2. This way, the resulting chain of stages is analogous to the simulation of U ⊗n previously
708
E. Gutierrez et al.
described, but the transformation to be considered for each stage (qubit) depends on each qubit. We have considered two implementations. The first one proceeds stage by stage, working on the state vector coefficients stored in the device global memory. This requires n different kernel code invocations, one for each stage Uk . A better performance can be achieved by using the shared memory to allocate a portion of the coefficient space and transferring data between global and shared memory according to the coalescing criterion. But in this case the computational scheme becomes more complex for two reasons. On the one hand each transformation Uk implies controlled (conditional) updates of the state vector coefficients. This makes each coefficient to be transformed in a different way, as contrasted with U ⊗n , where the same matrix is applied to all qubits. On the other hand, transformations Uk must be translated properly when working on the local space of coefficients copied to the share memory, since coalesced consecutive coefficients in the shared memory do not necessarily correspond with consecutive coefficients in the global space.
5
Experimental Results
Simulations of a multiple-qubit gate U ⊗n and the QFT for n qubits have been implemented and tested. The target GPU platform has been an NVIDIA GeForce 8800GTX with the following features: 16 multiprocessors of 8 processors at 1.35GHz, a device global memory of 768MB with a latency up to 600 clock cycles, and 8KB parallel data cache (shared memory) per multiprocessor with a latency of 4 cycles. CUDA 1.1 SDK and toolkit was used. The goal was to evaluate the two main strategies previously introduced: gate-by-gate (stage-bystage for the QFT) and coalescing-improved implementations. In the case of U ⊗n , we select U = H (Hadamard gate) giving rise to the so-called Walsh gate [10]. Table 2 shows the execution time in milliseconds for the stage-by-stage and coalescing-improved strategies on the GPU platform for different sizes of the quantum register. It is also shown, with the purpose of setting a time reference, a sequential simulation on a Intel Core2 6300 based platform at 1.86GHz with linux OS using libquantum [2], one of most popular quantum simulation software. The upper register size limit is 26 qubits, which is imposed by the memory size of the device. Notice that each state vector coefficient is a complex number made of two simple-precision floating point numbers (32 bits). Other two significant parameters, for the coalescing-improved version, are the number of coefficients fitting the shared memory (S) and the optimal number of consecutive coefficients to be transferred (M ). These values are platform-dependent, being S = 1024 and M = 32 for our device. Also, the number of threads is set to a half of the number of coefficients. Some facts can be highlighted concerning these results. Firstly, a good scalability with respect to the number of coefficients can be observed for both parallel versions. Secondly, note that the coalescing-improved version exhibits a better performance for all the range. In other words, a high performance requires a good exploitation of the device memory hierarchy. On the one hand, data to be reused must be allocated in the shared memory, whenever possible. On the
Parallel Quantum Computer Simulation on the CUDA Architecture
709
Table 2. Execution time (msec) for the Walsh gate and the QFT simulations Number of qubits of the quantum register Implementation CPU sequential: libquantum H⊗n GPU stage-by-stage GPU coalescing-improved
15
16
17
18
19
20
-
-
-
31
78
156 328 688 1453 3031 6281 13062
-
-
-
1.8 3.2 6.1 12.0 24.3 49.9
-
-
-
1.1 2.0 4.1
30
80
CPU sequential: libquantum 10 QFT GPU stage-by-stage GPU coalescing-improved
21
22
23
24
25
26
102
212
439
8.8 18.1 37.1 76.2
158
342
150 350 730 1560 3260 6710 13910 28160 56890
2.38 4.58 9.09 18.6 39.5 85.2 185 402 874 1897 4100 8847 0.20 0.43 0.81 1.83 3.78 7.84 16.9 35.1 72.9
150
313
667
other hand, it is crucial to coalesce the accesses to the global memory. Finally, the coalescing-improved GPU version can reach a speedup up to 85 relative to the CPU implementation, for the fastest execution.
References 1. Barenco, A., Bennett, C.H., Cleve, R., DiVicenzo, D.P., Margolus, N., Shor, P., Sleator, T., Smolin, J.A., Weinfurter, H.: Elementary Gates for Quantum Computation. Phys. Rev. A 52, 3457–3467 (1995) 2. Butscher, B., Weimer, H.: The libquantum Library, http://www.enyo.de/libquantum/ 3. De Raedt, K., Michielsen, K., De Raedt, H., Trieu, B., Arnold, G., Richter, M., Lippert, T., Watanabe, H., Ito, N.: Massively Parallel Quantum Computer Simulator. Computer Physics Communications 176, 121–136 (2007) 4. Deutsch, D.: Quantum Computational Networks. Proceedings of Royal Society of London, Series A 425, 73–90 (1989) 5. Deutsch, D., Jozsa, R.: Rapid Solution of Problems by Quantum Computation. Proceedings of Royal Society of London, Series A 439, 553–558 (1992) 6. Fujishima, M.: FPGA-Based High-Speed Emulator of Quantum Computing. In: IEEE Int’l Conference on Computer Design (2004) ¨ 7. Glendinning, I., Omer, B.: Parallelization of the QC-Lib Quantum Computer Simulator Library. In: Wyrzykowski, R., Dongarra, J., Paprzycki, M., Wa´sniewski, J. (eds.) PPAM 2004. LNCS, vol. 3019, pp. 461–468. Springer, Heidelberg (2004) 8. Grover, L.K.: A Fast Quantum Mechanical Algorithm For Database Search. In: Annual ACM Symposium on the Theory of Computation, pp. 212–219 (1996) 9. Khalid, A.U., Zilic, Z., Radecka, K.: FPGA Emulation of Quantum Circuits. In: IEEE Int’l Conference on Field-Programming Technology (2003) 10. Nielsen, M., Chuang, I.: Quantum Computation and Quantum Information. Cambridge University Press, Cambridge (2004) 11. Niwa, J., Matsumoto, K., Imai, H.: General-Purpose Parallel Simulator for Quantum Computing. In: Calude, C.S., Dinneen, M.J., Peper, F. (eds.) UMC 2002. LNCS, vol. 2509, pp. 230–251. Springer, Heidelberg (2002) 12. NVIDIA CUDA Programming Guide, SDK and Toolkit, http://developer.nvidia.com/object/cuda.html 13. Shor, P.W.: Algorithms for Quantum Computation: Discrete Logarithm and Factoring. In: 35th Symposium on Foundations of Computer Science, pp. 124–134 (1995) 14. Udrescu, M., Prodan, L., Vladutiu, M.: Using HDLs for Describing Quantum Circuits: A Framework for Efficient Quantum Algorithm Simulation. In: Computing Frontiers Conference (2004)
Comparison of Numerical Models of Impact Force for Simulation of Earthquake-Induced Structural Pounding Robert Jankowski Faculty of Civil and Environmental Engineering, Gdańsk University of Technology, ul. Narutowicza 11/12, 80-952 Gdańsk, Poland [email protected]
Abstract. Structural pounding during earthquakes is a complex phenomenon involving plastic deformations, local cracking, etc. The aim of the present paper is to check the accuracy of three pounding force numerical models, such as: the linear viscoelastic model, the non-linear elastic model following the Hertz law of contact and the non-linear viscoelastic model. In the analysis, the results of numerical simulations have been compared with the results of an impact experiment conducted by dropping balls of different building materials. The results of the study indicate that the non-linear viscoelastic model is the most precise one in simulating the pounding force time history during impact. Keywords: pounding, earthquakes, impact force, numerical simulation.
1 Introduction Earthquake-induced pounding between neighbouring, inadequately separated buildings or bridge segments can lead to considerable damage or even collapse of colliding structures (see, for example, [1,2]). Impact itself is a highly complex phenomenon involving plastic deformations at contact points, local cracking or crushing, friction, etc. what results in difficulty in its modelling. Structural pounding has been recently intensively studied applying different numerical models of impact force. The fundamental study on pounding between buildings in series using a linear viscoelastic model has been conducted by Anagnostopoulos [3]. Jankowski et al. [4] used the same model to study pounding of superstructure segments in bridges. In order to simulate the force-deformation relation more realistically, a non-linear elastic model following the Hertz law of contact has been adopted by a number of researchers (see, for example, [5,6]). For the purposes of a more precise simulation of a physical phenomenon, a non-linear viscoelastic model has also been considered [79]. In this model, a non-linear spring following the Hertz law of contact is applied together with an additional non-linear damper, which is activated during the approach period of collision in order to simulate the process of energy loss taking place mainly during that period. The aim of the present paper is to check the accuracy of these pounding force models for the purposes of simulation of different building materials impacts. In the M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 710–717, 2008. © Springer-Verlag Berlin Heidelberg 2008
Comparison of Numerical Models of Impact Force
711
analysis, the results of numerical simulations have been compared with the results of an impact experiment conducted by dropping balls of different mass.
2 Pounding Force Numerical Models 2.1 Linear Viscoelastic Model The linear viscoelastic model is the most frequently used one for simulation of structural pounding under earthquake excitation (see, for example, [3,4]). The pounding force during impact, F (t ) , for this model is expressed as:
F (t ) = kδ (t ) + cδ& (t ) ,
(1)
where δ (t ) describes the deformation of colliding structural members, δ& (t ) denotes the relative velocity between them, k is the impact element’s stiffness simulating the local stiffness at the contact point and c is the impact element’s damping, which can be obtained from the formula [3]: c = 2ξ k
m1m2 , m1 + m2
(2)
where m1 , m2 are masses of structural members and ξ is the damping ratio related to a coefficient of restitution, e, which accounts for the energy dissipation during impact [10]. Value of e = 1 deals with the case of a fully elastic collision, value of e = 0 with a fully plastic one. The relation between ξ and e in the linear viscoelastic model is given by the formula [3]:
ξ=
− ln e
π + (ln e) 2 2
.
(3)
2.2 Non-linear Elastic Model
In order to model the pounding force-deformation relation more realistically, a nonlinear elastic model following the Hertz law of contact has been adopted by a number of researchers [5,6]. The pounding force, F (t ) , for this model is expressed by the formula: 3
F (t ) = βδ 2 (t ) ,
(4)
where β is the impact stiffness parameter, which depends on material properties and geometry of colliding bodies. The disadvantage of the Hertz contact law model is that it is fully elastic and does not account for the energy dissipation during contact due to plastic deformations, local crushing, etc.
712
R. Jankowski
2.3 Non-linear Viscoelastic Model
For the purposes of a more precise simulation of an impact phenomenon, a non-linear viscoelastic model has been proposed [7]. The pounding force, F (t ) , for this model is expressed by the formula: 3
for δ&(t ) > 0 (approach period),
3
for δ&( t ) ≤ 0 (restitution period),
F (t ) = βδ 2 (t ) + c (t )δ&(t ) F (t ) = βδ 2 (t )
(5)
where β is the impact stiffness parameter and c (t ) is the impact element’s damping, which at any instant of time can be obtained from the formula [7]: c (t ) = 2ξ β δ (t )
m1m2 , m1 + m2
(6)
where ξ denotes the damping ratio related to a coefficient of restitution, e. The approximate relation between ξ expressed by the formula [11]:
ξ =
and e in the non-linear viscoelastic model is
9 5 1 − e2 . 2 e ( e(9π − 16) + 16 )
(7)
3 Comparison of Pounding Force Numerical Models In order to verify the accuracy of pounding force models for the use of different building materials, the results of the numerical simulations have been compared with the results of an impact experiment conducted by dropping steel, concrete and timber balls of different mass on rigid surface. Balls have been dropped from various height levels in order to obtain different impact velocities. The properties of balls used in the experiment are specified in Table 1. The experimental setup is shown in Fig. 1. Table 1. Properties of balls used in the experiment Material Steel (type 18G2A) Concrete (grade C30/37) Timber (pinewood)
Ball diameter (mm) 21 50 83 103 114 128 55 71 118
Ball mass (kg) 0.053 – 0.054 0.538 – 0.541 2.013 1.329 – 1.350 1.763 – 1.835 2.531 – 2.636 0.065 – 0.066 0.109 – 0.112 0.493 – 0.497
Comparison of Numerical Models of Impact Force
713
Fig. 1. Setup of the experiment
The numerical analysis has been conducted using the following equation of motion: my&&(t ) + F (t ) = mg ,
(8)
where m is mass of a ball, && y (t ) its vertical acceleration and g stands for the acceleration of gravity. The pounding force, F (t ) , has been set to zero when y (t ) ≤ h (h is a drop height) and has been calculated according to Equation (1), (4) or (5) when y (t ) > h , whereas the deformation, δ (t ) , has been calculated as:
δ (t ) = y (t ) − h .
(9)
A time-stepping integration procedure with constant time step Δt = 1×10−6 s has been applied to solve the Equation (8) numerically. The values of the impact stiffness parameters: k , β and β , defining the models used in the numerical analysis, have been determined using the method of the least squares. The difference between the results of the experiment and the results of the numerical analysis has been assessed by calculating the normalised error: E=
F−F F
⋅100% .
(10)
where F is an impact time history vector obtained from the experiment, F is an impact time history vector obtained from the numerical analysis and F − F is an Euclidean norm of F − F .
714
R. Jankowski
The experimental study and the numerical analysis have been conducted for a large number of impact cases. In the following sections of this paper, the examples of the results are presented. 3.1 Steel-to-Steel Impact
In the first example, the results of the numerical analysis are compared with the results of the experiment conducted for a steel ball of mass 2.013 kg impacting the steel surface with the velocity of 0.92 m/s. In the numerical analysis, the following values of parameters defining the different pounding force models have been used: k = 4.82 × 108 N/m , ξ = 0.17 (e = 0.58) for the linear viscoelastic model,
β = 7.55 × 1010 N/m 3/2 for the non-linear elastic model and β = 6.60 × 1010 N/m 3/2 , ξ = 0.49 (e = 0.58) for the non-linear viscoelastic model. The pounding force time history measured during the experiment and the histories received from the numerical analysis for the considered example of steel-to-steel impact are presented in Fig. 2. Using Equation (10), the simulation errors for pounding force histories have been calculated as equal to: 15.9% for the linear viscoelastic model, 64.1% for the nonlinear elastic model and 15.3% for the non-linear viscoelastic model.
experiment linear viscoelastic model non-linear elastic model non-linear viscoelastic model
4 104
4
Pounding force (N)
3 10
4
2 10
4
1 10
0
0
0.05
0.1
0.15 Time (ms)
0.2
Fig. 2. Pounding force time histories for steel-to-steel impact
0.25
Comparison of Numerical Models of Impact Force
715
3.2 Concrete-to-Concrete Impact
The second example concerns the comparison between the results of the numerical simulations and the experiment conducted for a concrete ball of mass 1.763 kg impacting the concrete surface with the velocity of 0.13 m/s. In the numerical analysis, the following values of parameters defining the different pounding force models have been used: k = 4.91 × 107 N/m , ξ = 0.09 (e = 0.76) for the linear viscoelastic model, β = 1.04 × 1010 N/m3/2 for the non-linear elastic model and
β = 1.02 × 1010 N/m 3/2 , ξ = 0.22 (e = 0.76) for the non-linear viscoelastic model. The pounding force time history measured during the experiment and the histories received from the numerical analysis for the considered example of concrete-toconcrete impact are presented in Fig. 3. The simulation errors for pounding force histories from Fig. 3 have been calculated as equal to: 12.7% for the linear viscoelastic model, 33.5% for the non-linear elastic model and 11.6% for the nonlinear viscoelastic model.
experiment linear viscoelastic model non-linear elastic model non-linear viscoelastic model
Pounding force (N)
1500
1000
500
0
0
0.1
0.2
0.3 0.4 Time (ms)
0.5
0.6
0.7
Fig. 3. Pounding force time histories for concrete-to-concrete impact
3.3 Timber-to-Timber Impact
In the third example, the results of the numerical analysis are compared with the results of the experiment conducted for a timber ball of mass 0.109 kg impacting the
716
R. Jankowski
timber surface with the velocity of 0.39 m/s. In the numerical analysis, the following values of parameters defining the different pounding force models have been used: k = 2.28 × 106 N/m , ξ = 0.16 (e = 0.61) for the linear viscoelastic model,
β = 3.24 × 108 N/m3/2 for the non-linear elastic model and β = 2.52 × 108 N/m3/2 , ξ = 0.43 (e = 0.61) for the non-linear viscoelastic model. The pounding force time history measured during the experiment and the histories received from the numerical analysis for the considered example of timber-to-timber impact are presented in Fig. 4. Using Equation (10), the simulation errors for pounding force histories have been calculated as equal to: 20.9% for the linear viscoelastic model, 61.0% for the non-linear elastic model and 19.5% for the non-linear viscoelastic model. experiment linear viscoelastic model non-linear elastic model non-linear viscoelastic model
300
Pounding force (N)
250 200 150 100 50 0 -50 0
0.1
0.2
0.3
0.4 0.5 Time (ms)
0.6
0.7
0.8
Fig. 4. Pounding force time histories for timber-to-timber impact
4 Conclusions The results of the study indicate that the non-linear viscoelastic model is the most precise one in simulating the pounding force time histories during impact for three different building materials. The model allows us to simulate the relatively rapid increase in the pounding force during the approach period of collision and the decrease in the force with lower unloading rate during the restitution period. Because of the above the model can be successfully used for the numerical simulations of pounding-involved response of structures under earthquake excitation in order to
Comparison of Numerical Models of Impact Force
717
enhance the accuracy of the analysis. On the other hand, Figs. 2-4 show the drawbacks of the two other models considered. In the case of the linear viscoelastic model, the negative force can be observed just before separation, which does not have any physical explanation. In the case of the non-linear elastic model following the Hertz contact law, the pounding force history at approach and restitution periods is symmetric, due to elastic behaviour, and a maximum pounding force attains a higher value in comparison with the experimental results.
References 1. Rosenblueth, E., Meli, R.: The 1985 earthquake: causes and effects in Mexico City. Concrete international 8, 23–34 (1986) 2. Kasai, K., Maison, B.: Building pounding damage during the 1989 Loma Prieta earthquake. Engineering Structures 19, 195–207 (1997) 3. Anagnostopoulos, S.A.: Pounding of buildings in series during earthquakes. Earthquake Engineering and Structural Dynamics 16, 443–456 (1988) 4. Jankowski, R., Wilde, K., Fujino, Y.: Pounding of superstructure segments in isolated elevated bridge during earthquakes. Earthquake Engineering and Structural Dynamics 27, 487–502 (1998) 5. Jing, H.-S., Young, M.: Impact interactions between two vibration systems under random excitation. Earthquake Engineering and Structural Dynamics 20, 667–681 (1991) 6. Chau, K.T., Wei, X.X.: Pounding of structures modeled as non-linear impacts of two oscillators. Earthquake Engineering and Structural Dynamics 30, 633–651 (2001) 7. Jankowski, R.: Non-linear viscoelastic modelling of earthquake-induced structural pounding. Earthquake Engineering and Structural Dynamics 34, 595–611 (2005) 8. Jankowski, R.: Impact force spectrum for damage assessment of earthquake-induced structural pounding. Key Engineering Materials 293–294, 711–718 (2005) 9. Jankowski, R.: Pounding force response spectrum under earthquake excitation. Engineering Structures 28, 1149–1161 (2006) 10. Goldsmith, W.: Impact: The theory and physical behaviour of colliding solids. Edward Arnold Ltd., London (1960) 11. Jankowski, R.: Analytical expression between the impact damping ratio and the coefficient of restitution in the non-linear viscoelastic model of structural pounding. Earthquake Engineering and Structural Dynamics 35, 517–524 (2006)
Large-Scale Image Deblurring in Java Piotr Wendykier and James G. Nagy Dept. of Math and Computer Science, Emory University, Atlanta GA, USA [email protected], [email protected]
Abstract. This paper describes Parallel Spectral Deconvolution (PSD) Java software for image deblurring. A key component of the software, JTransforms, is the first, open source, multithreaded FFT library written in pure Java. Benchmarks show that JTransforms is competitive with current C implementations, including the well-known FFTW package. Image deblurring examples, including performance comparisons with existing software, are also given.
1 Motivation Instruments that record images are integral to advancing discoveries in science and medicine – from astronomical investigations, to diagnosing illness, to studying bacterial and viral diseases [1][2][3]. Computational science has an important role in improving image quality through the development of post-processing image reconstruction and enhancement algorithms and software. Probably the most commonly used post-processing technique is image deblurring, or deconvolution [4]. Mathematically this is the process of computing an approximation of a vector xtrue (which represents the true image scene) from the linear inverse problem b = Axtrue + η .
(1)
Here, A is a large, usually ill-conditioned matrix that models the blurring operation, η is a vector that models additive noise, and b is a vector representing the recorded image, which is degraded by blurring and noise. Generally, it is assumed that the blurring matrix A is known (at least implicitly), but the noise is unknown. Because A is usually severely ill-conditioned, some form of regularization needs to be incorporated [5][6]. Many regularization methods, including Tikhonov, truncated singular (or spectral) value decomposition (TSVD), and Wiener filter, compute solutions of the form xreg = A†r b, where A†r can be thought of as a regularized pseudo-inverse of A. The precise form of A†r depends on many things, including the regularization method, the data b, and the blurring matrix A [4]. The actual implementation of computing xreg can often be done very efficiently using fast Fourier transforms (FFT) and fast discrete cosine transforms (DCT). This paper describes our development of Parallel Spectral Deconvolution (PSD) [7] Java software for image deblurring, including a plugin for the open source image processing system, ImageJ [8]. A key component of our software is the first, open source, multithreaded FFT library written in pure Java, which we call JTransforms [7].
Research supported by the NSF under grant DMS-05-11454.
M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 721–730, 2008. c Springer-Verlag Berlin Heidelberg 2008
722
P. Wendykier and J.G. Nagy
This paper is organized as follows. In Section 2 we describe some basic image deblurring algorithms, and how fast transforms, such as FFTs and DCTs, can be used for efficient implementations. Section 3 describes the performance of our Java implementations, with a particular focus on JTransforms. Benchmarks show that our multithreaded Java approach is competitive with current C implementations, including the well-known FFTW package [9]. Image deblurring examples, including performance comparisons with existing software, are also given.
2 Deblurring Techniques The deblurring techniques considered in this paper are based on filtering out certain spectral coefficients of the computed solution. 2.1 Regularization by Filtering We begin by showing why regularization is needed, and how it can be done through spectral filtering. To simplify the discussion, we assume A is an n × n normal matrix [10], meaning that it has a spectral value decomposition (SVD)1 A = Q∗ ΛQ ,
(2)
where Λ is a diagonal matrix containing the eigenvalues of A, Q is a matrix whose columns, qi , are the corresponding eigenvectors, Q∗ is the complex conjugate transpose of Q, and Q∗ Q = I. We assume further that the eigenvalues are ordered so that |λ1 | ≥ |λ2 | ≥ · · · ≥ |λn | ≥ 0. Using the spectral decomposition, the inverse solution of (1) can be written as xinv = A−1 b = A−1 (Axtrue + η) = xtrue + A−1 η = xtrue +
n ηi qi , λ i=1 i
(3)
= Q∗ η. That is, the inverse solution is comprised of two terms: the desired where η true solution and an error term caused by noise in the data. To understand why the error term usually dominates the inverse solution, it is necessary to know the following properties of image deblurring [4][5]: – Assuming the problem is scaled so that |λ1 | = 1, the eigenvalues, |λi |, decay to, and cluster at 0, without a significant gap to indicate numerical rank. – The eigenvectors qi corresponding to small |λi | tend to have more oscillations than the eigenvectors corresponding to large |λi |. These properties imply that the high frequency components in the error are highly magnified by division of small eigenvalues. The computed inverse solution is dominated by 1
We realize that “SVD” usually refers to “singular value decomposition”. We do not think there should be any confusion because our discussion of filtering can be done using the singular value decomposition in place of the spectral value decomposition.
Large-Scale Image Deblurring in Java
723
these high frequency components, and is in general a very poor approximation of the true solution, xtrue . In order to compute an accurate approximation of xtrue , or at least one that is not horribly corrupted by noise, the solution process must be modified. This process is usually referred to as regularization [5][6]. One class of regularization methods, called filtering, can be formulated as a modification of the inverse solution [5]. Specifically, a filtered solution is defined as (4) xreg = A†r b, φ1 φ2 φn where A†r = Q∗ diag , ,..., Q . The filter factors, φi , satisfy φi ≈ 1 for λ1 λ2 λn large |λi |, and φi ≈ 0 for small |λi |. That is, the large eigenvalue (low frequency) components of the solution are reconstructed, while the components corresponding to the small eigenvalues (high frequencies) are filtered out. Different choices of filter factors lead to different methods; popular choices are the truncated SVD (or pseudo-inverse), Tikhonov, and Wiener filters [5][6][11]. 2.2 Tikhonov Filtering To illustrate spectral filtering, consider the Tikhonov regularization filter factors φi =
|λi |2 , |λi |2 + α2
(5)
where the scalar α is called a regularization parameter, and usually satisfies |λn | ≤ α ≤ |λ1 |. Note that smaller α lead to more φi approximating 1. The regularization parameter is problem dependent, and in general it is nontrivial to choose an appropriate value. Various techniques can be used, such as the discrepancy principle, the L-curve, and generalized cross validation (GCV) [5][6]. There are advantages and disadvantages to each of these approaches [12], especially for large-scale problems. In this work we use GCV, which, using the SVD of A, requires finding α to minimize the function 2 n 2 n α2 α2 |bi | , (6) G(α) = n |λi |2 + α2 |λi |2 + α2 i=1 i=1 = Q∗ b. Standard optimization routines can be used to minimize G(α). where b Tikhonov filtering, and using GCV to choose regularization parameters, has proven to be effective for a wide class of inverse problems. Unfortunately for large scale problems such as image deblurring, it may not be computationally feasible to compute the SVD of A. One way to overcome this difficulty is to exploit structure in the problem. 2.3 Fast Transform Filters In image deblurring, A is a structured matrix that describes the blurring operation, and is given implicitly in terms of a point spread function (PSF). A PSF is an image of a point source object, and provides the essential information to construct A. The structure
724
P. Wendykier and J.G. Nagy
of A depends on the PSF and on the imposed boundary condition [4]. In this subsection we describe two structures that arise in many image deblurring problems. However, due to space limitations, we cannot provide complete details; the interested reader should see [4] for more information. If the blur is assumed to be spatially invariant then the PSF is the same regardless of the position of the point source in the image field of view. In this case, if we also enforce periodic boundary conditions, then A has a circulant matrix structure, and the spectral factorization (7) A = F∗ ΛF , where F is a discrete Fourier transform (DFT); a d-dimensional image implies F is a d-dimensional DFT matrix. In this case, the matrix F does not need to be constructed explicitly; a matrix vector multiplication Fb is equivalent to computing a DFT of b, and similarly F∗ b is equivalent to computing an inverse DFT. Efficient implementations of DFTs are usually referred to as fast Fourier transforms (FFT). The eigenvalues of A can be obtained by computing an FFT of the first column of A, and the first column of A can be obtained directly from the PSF. Thus, the computational efficiency of spectral filtering methods for image deblurring with a spatially invariant PSF and periodic boundary conditions requires efficient FFT routines. If the image has significant features near the boundary of the field of view, then periodic boundary conditions can cause ringing artifacts in the reconstructed image. In this case it may be better to use reflexive boundary conditions. But changing the boundary conditions changes the structure of A, and it no longer has the Fourier spectral decomposition given in (7). However, if the PSF is also symmetric about its center, then A is a mix of Toeplitz and Hankel structures [4], and has the spectral value decomposition A = CT ΛC ,
(8)
where C is the discrete cosine transform (DCT) matrix; a d-dimensional image implies C is a d-dimensional DCT matrix. As with FFTs, there are very efficient algorithms for evaluating DCTs. Furthermore, computations such as the matrix vector multiplication Cb and CT b are done by calling DCT and inverse DCT functions. The eigenvalues of A can be obtained by computing a DCT of the first column of A, and the first column of A can be obtained directly from the PSF. Note that in the case of the FFT, F has complex entries and thus computations necessarily require complex arithmetic. However, in the case of the DCT, C has real entries, and all computations can be done in real arithmetic. Efficient FFT and DCT routines are essential for spectral deblurring algorithms. The next section describes our contribution to the development of efficient parallel Java codes for these important problems.
3 Using Java for Image Deblurring Java is ideally suited to provide efficient, open source image deblurring software that can be used in inexpensive imaging devices for point of care medical applications. Java implementations are available for virtually all computing platforms, and since May
Large-Scale Image Deblurring in Java
725
2007 the source code of Java is distributed under the terms of the GNU General Public License. Moreover, Java has native support for multithreaded programming, which has become a mandatory paradigm in the era of multicore CPUs. Finally, sophisticated imaging functionality is built into Java, allowing for efficient visualization and animation of computational results. Significant improvements have been made to Java since the 1996 release of JDK 1.0, including Just-In Time compilation, memory allocation enhancements, and utilization of performance features in modern x86 and x64 CPUs [13]. It is no longer the case that Java is too slow for high-performance scientific computing applications; this point is illustrated below for spectral image deblurring. There are disadvantages to using Java in scientific computing, including no primitive type for complex numbers, an inability to do operator overloading, and no support for IEEE extended precision floats. In addition, Java arrays were not designed for high-performance computing; a multi-dimensional array is an array of one-dimensional arrays, making it difficult to fully utilize cache memory. Moreover, Java arrays are not resizable, and only 32-bit array indexing is possible. Fortunately open source numerical libraries, such as Colt [14], have been developed to overcome these disadvantages. For our work, we are implementing a fully multithreaded version of Colt, which we call Parallel Colt [7]. In the rest of this section we describe Java implementations of JTransforms, ImageJ and associated plugins for image deblurring. 3.1 JTransforms Fast Fourier Transform. An FFT algorithm is the most efficient method to compute a DFT, with a complexity of Θ(N log(N )) to compute a DFT of a d-dimensional array containing N components. An FFT algorithm was first proposed by Gauss in 1805 [15], but it was the 1965 work by Cooley and Tukey [16] that is generally credited for popularizing its use. The most common variant of the algorithm, called radix-2, uses a divide-and-conquer approach to recursively split the DFT of size N into two parts of size N/2. Other splittings can be used as well, including mixed-radix and split-radix algorithms [17]. Th split-radix algorithm has the lowest arithmetic operation count to compute a DFT when N is a power of 2 [18]. The algorithm was first described in 1968 by Yavne [19] and then reinvented in 1984 by Duhamel and Hollmann [20]. The idea here is to recursively divide a DFT of size N into one DFT of size N/2 and two DFTs of size N/4. Further details about split-radix algorithm can be found in [17]. Parallel Implementation in Java. JTransforms is the first, open source, multithreaded FFT library written in pure Java. The code was derived from the General Purpose FFT Package (OouraFFT) written by Ooura [21]. OouraFFT is a multithreaded implementation of the split-radix algorithm in C and Fortran. In order to provide more portability both Pthreads and Windows threads are used in the implementation. Moreover, the code is highly optimized and in some cases runs faster than FFTW. Even so, the package has several limitations arising from the split-radix algorithm. First of all, the length of the
726
P. Wendykier and J.G. Nagy
input data has to be a power of two. Second, the number of computational threads must also be a power of 2. Finally, one-dimensional transforms can only use two or four threads. JTransforms, with few exceptions, share all the features and limitations of Ooura’s C implementation. However, there are some important distinctions. First, JTransforms uses thread pools, while OouraFFT does not. Although thread pooling in Pthreads is possible, there is no code for this mechanism available in the standard library, and therefore many multithreaded applications written in C do not use thread pools. This has the added problem of causing overhead costs of creating and destroying threads every time they are used. Another difference between our JTransforms and the OouraFFT is the use of “automatic” multithreading. In JTransforms, threads are used automatically when computations are done on a machine with multiple CPUs. Conversely, both OouraFFT and FFTW require manually setting up the maximum number of computational threads. Lastly, JTransform’s API is much simpler than OouraFFT, or even FFTW, since it is only necessary to specify the size of the input data; work arrays are allocated automatically and there is no planning phase. The release of Java 5 in 2004 came with a number of significant new language features [22]. One feature that we have found to be very useful is the cached thread pool, which creates new threads as needed, and reuses previously constructed threads when they become available. This feature allows to improve the performance of programs that execute many short-lived asynchronous tasks. Benchmark. To show the performance of JTransforms we have benchmarked the code against the original OouraFFT and also against FFTW 3.1.2. The benchmark was run on the Sun Microsystems SunFire V40z server, with 4 Dual Core AMD Opteron Processors 875 (2.2GHz) and 32 GB of RAM memory. The machine had installed Red Hat Enterprise Linux version 5 (kernel 2.6.18-8.1.14.el5), gcc version 3.4.6 and Java version 1.6.0_03 (64-bit server VM). The following Java options were used: -d64 -server -Xms15g -Xmx15g. For the OouraFFT, we used -O2 flag for the C compiler (one can get slightly better timings with unsafe flags: -O6 - -fast-math). All libraries were set to use a maximum of eight threads and DFTs were computed in-place. The timings in Tables 1 and 2 are an average among 100 calls of each transform. This average execution time does not incorporate the “warm up” phase (the first two calls require more time) for JTransforms and OouraFFT. Similarly, for FFTW, the times do not incorporate the planning phase. Table 1 presents the benchmark results for computing two-dimensional complex forward DFTs. For 29 × 29 , 210 × 210 and 212 × 212 sizes, JTransforms outperforms all other tested libraries. Table 1. Average execution time (milliseconds) for 2-D, complex forward DFT Library \ Size 27 JTransforms 2.43 OouraFFT 0.74 FFTW_ESTIMATE 1.15 FFTW_MEASURE 0.83 FFTW_PATIENT 0.67
28 3.76 3.15 4.84 2.91 2.81
29 6.21 12.60 31.75 10.73 11.73
210 32.84 33.66 131.80 37.65 36.84
211 198.31 202.78 1149.87 182.77 179.55
212 529.81 789.25 2715.39 840.09 884.39
213 4028.17 4165.33 26889.97 6665.73 3761.50
214 15682.78 16738.65 49670.29 14735.13 56522.40
Large-Scale Image Deblurring in Java
727
Table 2. Average execution time (milliseconds) for 3-D, complex forward DFT Library \ Size 22 JTransforms 0.12 OouraFFT 0.001 FFTW_ESTIMATE 0.48 FFTW_MEASURE 0.48 FFTW_PATIENT 0.001
23 1.09 0.02 0.39 0.37 0.01
24 2.35 0.15 0.44 0.44 0.10
25 5.02 1.67 1.59 1.23 1.48
26 6.43 11.38 11.18 8.28 8.36
27 46.85 58.63 110.14 48.69 47.27
28 553.21 847.13 1471.14 601.88 573.77
29 7115.84 12448.24 34326.50 7432.08 8936.34
Table 2 shows benchmark results for three-dimensional, complex forward DFTs. Once again, our Java implementation is faster than OouraFFT for almost all sizes of input data. Moreover, starting from 26 × 26 × 26 , JTransforms is faster than FFTW. More benchmark results including discrete cosine and sine transforms, can be found at the JTransforms website [7]. 3.2 Deconvolution Plugins for ImageJ ImageJ [8] is an open source image processing program written in Java by Wayne Rasband, a researcher working at the U.S. National Institutes of Health (NIH). Besides having a large number of options for image editing applications, ImageJ is designed with pluggable architecture that allows developing custom plugins (over 300 user-written plugins are currently available). Due to this unique feature, ImageJ has become a very popular application among a large and knowledgeable worldwide user community. DeconvolutionJ [23] is an ImageJ plugin written by Nick Linnenbrügger that implements spectral deconvolution based on the Regularized Wiener Filter [11]. The plugin has a number of limitations. It can handle arbitrary-sized two- and three-dimensional images, although it requires the PSF image to be the same size as the blurred image, and it must be centered in the field of view. In addition, the regularization parameter of the Wiener filter must be specified manually and there is no update option to efficiently deblur the same image with different values of the regularization parameter. Last, but not least, DeconvolutionJ is a serial implementation, and therefore cannot take advantage of modern multicore processors. Our implementation of spectral deconvolution plugin, Parallel Spectral Deconvolution (PSD), does not suffer from any of these limitations. The current version (1.4) implements Tikhonov- and TSVD-based image deblurring [4]. Our multithreaded approach uses both JTransforms and Parallel Colt, so we were able to achieve a superior performance compared to DeconvolutionJ. PSD’s features include two choices of boundary conditions (reflexive and periodic), automatic choice of regularization parameter using GCV, threshold (the smallest nonnegative pixel value assigned to the restored image), single and double precision, a very fast parameter update option, and the possibility of defining the number of computational threads. By default, the plugin recognizes the number of available CPUs and uses that many threads. Nevertheless, current implementation of PSD has a couple of limitations. First, color images are not supported (DeconvolutionJ is also limited to grayscale images). The second limitation arises due to JTransforms, where the size of input data and the number of threads must
728
P. Wendykier and J.G. Nagy
be power of two numbers. To support images of arbitrary size, PSD uses padding. The number of threads, however, must be a power of two number. In order to test the performance of PSD, we also used the SunFire V40z with ImageJ version 1.39s. The following Java options were used: -d64 -server -Xms15g -Xmx15g -XX:+UseParallelGC. The test image (see Fig. 1) is a picture of Ed White performing the first U.S. spacewalk in 1965 [24]. The true image is of the size 4096 × 4096 pixels. The blurred image was generated by reflexive padding of the true data to size 6144 × 6144, convolving it with Gaussian blur PSF (standard deviation = 20), adding 1% white noise and then cropping the resulting image to the size of 4096 × 4096 pixels. Blurred image
Blurred image (crop)
Restored image (PSD)
Restored image (DeconvolutionJ)
Fig. 1. Astronaut image: blurred and restored data
Figure 1 shows the blurred data as well as the deblurred astronaut images using DeconvolutionJ and PSD. To better illustrate the quality of deblurring, we display a small region of the blurred and reconstructed images. In PSD, we used the Tikhonov method with reflexive boundary conditions and regularization parameter equal 0.004. Similarly, in DeconvolutionJ, we used no resizing (the image size was already a power of two), double precision for complex numbers and the same value for the regularization parameter. Table 3 presents average execution times among 10 calls of each method. All timings are given in seconds and the numbers in brackets include the computation of the regularization parameter. One should notice a significant speedup, especially from 1 to 2 threads. The last row in Table 3 shows the execution time for DeconvolutionJ, which is over 11 times greater than the worst case of PSD (Tikhonov, FFT, 1 thread) and almost 30 times greater than the best case of PSD (Tikhonov, DCT, 8 threads). For 3-D deblurring we used exactly the same hardware and software. This time the test image (see Fig. 2), is a T1 weighted MRI image of Jeff Orchard’s head [25]. The Table 3. Average execution times (in seconds) for 2-D deblurring (numbers in brackets include the computation of the regularization parameter) Method 1 thread 2 threads 4 threads 8 threads Tikhonov, FFT 16.3 (54.3) 12.1 (37.8) 10.9 (28.8) 10.6 (27.8) Tikhonov, DCT 14.8 (53.3) 9.1 (32.5) 6.7 (23.7) 6.1 (22.4) DeconvolutionJ 181.7 -
Large-Scale Image Deblurring in Java
729
Table 4. Average execution times (in seconds) for 3-D deblurring (numbers in brackets include the computation of the regularization parameter) Method 1 thread 2 threads 4 threads 8 threads Tikhonov, FFT 9.2 (27.8) 7.3 (18.7) 7.0 (15.6) 6.7 (14.4) Tikhonov, DCT 6.2 (25.6) 3.9 (14.9) 2.4 (10.3) 2.0 (9.7) DeconvolutionJ 31.6 -
true image is of the size 128 × 256 × 256 pixels. The blurred image was generated by zero padding of the true data to size 128×512×512, convolving it with a Gaussian blur PSF (standard deviation = 1), adding 1% white noise and then cropping the resulting image to the size of 128 × 256 × 256 pixels. Figure 2 shows the 63rd slice of the deblurred head images. In PSD, we used the Tikhonov method with reflexive boundary conditions and regularization parameter equal 0.02. In DeconvolutionJ, we used exactly the same parameters as for the 2-D astronaut image and 0.01 for the regularization parameter. In Table 4, we have collected all timings. Once again, the execution time for DeconvolutionJ is over 3 times greater than the worst case of PSD (Tikhonov, FFT, 1 thread) and almost 16 times greater than the best case of PSD (Tikhonov, DCT, 8 threads). Blurred image
Restored image (PSD)
Restored image (DeconvolutionJ)
Fig. 2. Head image (63rd slice): blurred and restored data
4 Conclusion In this paper we have described our research efforts to develop computationally efficient Java software for image deblurring. A key component of this software, JTransforms, is the first, open source, multithreaded FFT library written in pure Java. Due to usage of the cache thread pool we are able to achieve superior performance and speedup on symmetric multiprocessing machines. Numerical results illustrate that our Parallel Spectral Deconvolution package outperforms the ImageJ plugin, DeconvolutionJ, and that our Java FFT implementation, JTransforms, is highly competitive with optimized C implementations, such as FFTW.
730
P. Wendykier and J.G. Nagy
References 1. Sarder, P., Nehorai, A.: Deconvolution methods for 3D fluorescence microscopy images. IEEE Signal Proc. Mag., 32–45 (May 2006) 2. Roggemann, M.C., Welsh, B.: Imaging Through Turbulence. CRC Press, Boca Raton (1996) 3. Sechopoulos, I., Suryanarayanan, S., Vedantham, S., D’Orsi, C.J., Karellas, A.: Scatter radiation in digital tomosynthesis of the breast. Med. Phys. 34, 564–576 (2007) 4. Hansen, P.C., Nagy, J.G., O’Leary, D.P.: Deblurring Images: Matrices, Spectra and Filtering. SIAM (2006) 5. Hansen, P.C.: Rank-deficient and discrete ill-posed problems. SIAM (1997) 6. Vogel, C.R.: Computational Methods for Inverse Problems. SIAM (2002) 7. Wendykier, P.: JTransforms, Parallel Colt, Parallel Spectral Deconvolution (2008), http://piotr.wendykier.googlepages.com/ 8. Rasband, W.S.: ImageJ, U. S. National Institutes of Health, Bethesda, Maryland, USA (2008), http://rsb.info.nih.gov/ij/ 9. Frigo, M., Johnson, S.G.: The design and implementation of FFTW3. Proceedings of the IEEE 93(2), 216–231 (2005) 10. Stewart, G.W.: Matrix Algorithms, Volume 1: Basic Decompositions. SIAM (1998) 11. Gonzalez, R.C., Wintz, P.: 5. Digital Image Processing. Addison-Wesley, Reading (1977) 12. Kilmer, M.E., O’Leary, D.P.: Choosing regularization parameters in iterative methods for ill-posed problems. SIAM J. Matrix Anal. Appl. 22, 1204–1221 (2001) 13. Doederlein, O.: Mustang’s HotSpot Client gets 58% faster! (2005), http://weblogs.java.net/blog/opinali/archive/2005/11/ mustangs_hotspo_1.html 14. Hoschek, W.: Colt Project (2004), http://dsd.lbl.gov/%7Ehoschek/colt/index.html 15. Heideman, M.T., Johnson, D.H., Burrus, C.S.: Gauss and the history of the fast Fourier transform. Archive for History of Exact Sciences 34, 265–277 (1985) 16. Cooley, J.W., Tukey, J.W.: An Algorithm for the Machine Calculation of Complex Fourier Series. Mathematics of Computation 19(90), 297–301 (1965) 17. Van Loan, C.: Computational Frameworks for the Fast Fourier Transform. SIAM (1992) 18. Johnson, S.G., Frigo, M.: A modified split-radix FFT with fewer arithmetic operations. IEEE Trans. Signal Processing 55(1), 111–119 (2007) 19. Yavne, R.: An economical method for calculating the discrete Fourier transform. In: AFIPS Fall Joint Computer Conference, pp. 115–125 (1968) 20. Duhamel, P., Hollmann, H.: Split Radix FFT Algorithms. Electronic Letters 20, 14–16 (1984) 21. Ooura, T.: General Purpose FFT (Fast Fourier/Cosine/Sine Transform) Package (2006), http://www.kurims.kyoto-u.ac.jp/%7Eooura/fft.html 22. Sun Microsystems: New Features and Enhancements J2SE 5.0 (2004), http://java.sun.com/j2se/1.5.0/docs/relnotes/features.html 23. Linnenbrügger, N.: FFTJ and DeconvolutionJ (2002), http://rsb.info.nih.gov/ij/plugins/fftj.html 24. NASA: Great Images in NASA. Ed White performs first U.S. spacewalk (1965), http://grin.hq.nasa.gov/ABSTRACTS/GPN-2006-000025.html 25. Orchard, J.: His Brain (2007), http://www.cs.uwaterloo.ca/%7Ejorchard/mri/
A New Signature-Based Indexing Scheme for Trajectories of Moving Objects on Spatial Networks∗ Jaewoo Chang, Jungho Um, and Youngjin Kim Dept. of Computer Eng., Chonbuk National Univ., Chonju, Chonbuk 561-756, Korea {jwchang,jhum,yzkim}@chonbuk.ac.kr
Abstract. Because moving objects usually move on spatial networks, their trajectories play an important role in indexing them for spatial network databases. In this paper, we propose a new signature-based indexing scheme for moving objects’ trajectories on spatial networks. For this, we design it so that we can efficiently deal with the trajectories of current moving objects as well as for maintaining those of past moving objects. In addition, we provide both an insertion algorithm to store the segment information of moving objects’ trajectories and a retrieval algorithm to find a set of moving objects whose trajectories match with a query trajectory. Finally, we show that our indexing scheme achieves much better performance on trajectory retrieval than the leading trajectory indexing schemes, such as TB-tree and FNR-tree. Keywords: signature-based index scheme, trajectory, spatial network.
1 Introduction Most of the existing work in spatial databases considers Euclidean spaces, where the distance between two objects is determined by the ideal shortest path connecting them [6]. However, in practice, objects can usually move on road networks, where the network distance is determined by the length of the real shortest path connecting two objects on the network. For example, a gas station nearest to a given point in Euclidean spaces may be mored distant in a road network than another gas station. Therefore, the network distance is an important measure in spatial network databases (SNDB). Recently, there have been some studies on SNDB for emerging applications such as location-based service (LBS) [1, 5, 7, 8]. First, Speicys et al. [8] dealt with a computational data model for spatial network. Secondly, Shahabi et al. [7] presented k-nearest neighbors (k-NN) query processing algorithms for SNDB. Finally, Papadias et al. [5] designed a novel index structure for supporting query processing algorithms for SNDB. Because moving objects usually moves on spatial networks, instead of on Eucli-dean spaces, their trajectories play an important role indexing them for spatial network ∗
This work is financially supported by the Ministry of Education and Human Resources Development (MOE), the Ministry of Commerce, Industry and Energy (MOCIE) and the Ministry of Labor (MOLAB) though the fostering project of the Lab of Excellency. This work is also supported by the second stage of Brain Korea 21 Project.
M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 731–740, 2008. © Springer-Verlag Berlin Heidelberg 2008
732
J. Chang, J. Um, and Y. Kim
databases. However, there has been little research on trajectory indexing schemes for spatial networks, even though efficient index structures are required to gain good retrieval performance on their trajectories. In this paper, we propose a new signaturebased indexing scheme for moving objects’ trajectories on spatial networks. For this, we design it so that we can efficiently deal with the trajectories of current moving objects as well as for maintaining those of past moving objects. In addition, we provide both an insertion algorithm to store the segment information of moving objects’ trajectories and a retrieval algorithm to find a set of moving objects whose trajectories match with a query trajectory. The rest of the paper is organized as follows. In Sect. 2, we introduce related work. In Sect. 3, we propose a signature-based indexing scheme for moving objects’ trajectories. In Sect. 4, we provide the performance analysis of our indexing scheme. Finally, we draw our conclusion in Sect. 5.
2 Related Work There has been a little research on trajectory indexing schemes for spatial networks. So we overview both a predominant trajectory index structure for Euclidean spaces and a leading trajectory index structure for spatial networks. First, Pfoser et al. [4] proposed a hybrid index structure which preserves trajectories as well as allows for Rtree typical range search in Euclidean spaces, called TB-tree (Trajectory-Bundle tree). The TB-tree has fast accesses to the trajectory information of moving objects, but it has a couple of problems in SNDB. First, because moving objects move on a predefined spatial network in SNDB, the paths of moving objects are overlapped due to frequently used segments, like downtown streets. This leads to a large volume of overlap among the MBRs of internal nodes. Secondly, because the TB-tree constructs a three-dimensional MBR including time, the dead space for the moving object trajectory is highly increased in case of a long time movement. This leads to a large volume of overlap with other objects’ trajectories. Meanwhile, Frentzos [2] proposed a new indexing technique, called FNR-tree (Fixed Network R-tree), for objects constrained to move on fixed networks in two-dimensional space. The general idea of the FNR-tree is to construct a forest of 1-dimensional (1D) R-trees on top of a 2-dimensional (2D) R-tree. The 2D R-tree is used to index the spatial data of the network, e.g. roads consisting of line segments, while the 1D R-trees are used to index the time interval of each object movement inside a given link of the network. The FNR-tree outperforms the R-tree in most cases, but it has a critical drawback that the FNR-tree has to maintain a tremendously large number of R-trees, thus leading to a great amount of storage overhead to maintain it. This is because the FNR-tree constructs as large number of R-trees as the total number of segments in the networks, being greater than 1 million in some cases.
3 Signature-Based Indexing Scheme 3.1 Trajectory Indexing Scheme for Current Moving Objects Because moving objects change their locations continuously on road networks, the amount of trajectory information for a moving object is generally very large. To solve
A New Signature-Based Indexing Scheme
733
the problems of TB-tree as mentioned in Sect. 2, we propose a new signature-based indexing scheme which can have fast accesses to moving object trajectories. Figure 1 shows the structure of our trajectory indexing scheme. The main idea of our trajectory indexing scheme is to create a signature of a moving object trajectory and maintain partitions which store the fixed number of moving object trajectories and their signatures together in the order of their start time. There are a couple of reasons for using partitions. First, because a partition is created and maintained depending on its start time, it is possible to efficiently retrieve the trajectories of moving objects on a given time. Next, because a partition can be accessed independently to answer a trajectory-based query, it is possible to achieve better retrieval performance by searching partitions in parallel. Partition Table 1 2
MO(timestamp) n
Memory 1 2 3
1 2 3
1 2 3
…
m
Signature Info
…
Partition i-1
. .
m
m
…
…
Partition i-1
Location Info
1 2
. .
…
Signature Info
Signature Info
Location Info
1 2
…
…
…
m
…
Partition i-1
Location Info
Disk 1 2
. .
m
m
Trajectory Info
Trajectory Info
Trajectory Info
Partition 1
Partition 2
Partition n
Fig. 1. Signature-based trajectory indexing scheme
Our trajectory indexing scheme consists of a partition table and a set of partitions. A partition can be divided into three areas; trajectory information, location information, and signature information. A partition table maintains a set of partitions which store trajectories for current moving objects. The partition table is resided in a main memory due to its small size. To answer a user query, we find partitions to be accessed by searching the partition table. The trajectory information area maintains moving object trajectories which consist of a set of segments (or edges). The location information area contains the location of an object trajectory stored in the trajectory information area. This allows for accessing the actual object trajectories corresponding to potential matches to satisfy a query trajectory in the signature information area. The location information area also allows for filtering out irrelevant object trajectories based on the time condition of a query trajectory because it includes the start time, the current time, and the end time for a set of object trajectories. To create a signature from a given object trajectory in an efficient manner, we make use of a superimposed coding because it is very suitable to SNDB applications where the number of segments for an object trajectory is variable [10]. To achieve good retrieval performance, we store both the signature and the location information in a main memory.
734
J. Chang, J. Um, and Y. Kim
3.2 Trajectory Indexing Scheme for Past Moving Objects To answer trajectory-based queries with a past time, it is necessary to efficiently search the trajectories of past moving objects which no longer move on road networks. The trajectories of moving objects can be divided into two groups; one being frequently used for answering queries based on current object trajectories (COTSS) and the other for answering queries based on past object trajectories (POTSS). Figure 2 shows an overall architecture of indexing schemes for moving object trajectories. When a current moving object trajectory in COTSS is no longer changed due to the completion of the object movement, the object trajectory should be moved from COTSS to POTSS. The signature and the location information areas of COTSS are resided in a main memory for fast retrieval, whereas all of three areas of POTSS are maintained in a secondary storage.
COTSS
POTSS
(Current Object Trajectory
(Past Object Trajectory
Storage Structure)
Storage Structure)
Fig. 2. Overall architecture of indexing schemes for moving objects’ trajectories
To move current object trajectories from COTSS to POTSS, we should consider three requirements; retrieval of past object trajectories in an efficient way, accesses of the small number of partitions to answer a trajectory-based query, and construction of an efficient time-based index structure. To satisfy the first requirement, we make use of a bit-sliced method [10] for constructing a signature-based indexing scheme in POTSS, instead of using a bit-string method in COTSS. In the bit-sliced method, we create a fixed-length signature slice for each bit position in the original signature string. When the number of segments in a query trajectory is m and the number of bits assigned to a segment is k, the number of page I/O accesses for answering the query in the bit-sliced method is less than k*m. Therefore, when the number of segments in a query trajectory is small, our indexing scheme requires the small number of page I/O accesses due to the small number of signature slices needed for the query. To satisfy the second requirement, we maintain all the partitions in POTSS so that they can hold the condition that if start_time(partition i)<start_time(partition i+1), end_time(partition i) ≤end_time(partition i+1). If this condition is not satisfied among partitions in POTSS, query processing may be inefficient depending on the time window distribution of partitions in POTSS, even for queries with the same time window. Actually, if all the trajectories of the partition i have completed their movements earlier than those of the partition i-1, the partition i should move from COTSS to POTSS earlier than the partition i-1, leading to the dissatisfaction of the above condition. To prevent it, we require a strategy to store partitions such that if all the trajectories of the partition i are no longer changed, but those of the partition i-1 are changed, we exchange trajectories being changed in the partition i-1 with those having the smallest end time in the partition I and then move the partition i-1 from COTSS to POTSS. To satisfy the final requirement, we construct a B+-tree by using
A New Signature-Based Indexing Scheme
735
the end time of a partition as a key so as to have fast accesses to partitions in POTSS. Figure 3 shows the time-based B+-tree structure. A record, Rec, of a leaf node in the time-based B-tree is
root
sequence set leaf
Pa
Pb
nodes partition pointer
search space
Fig. 3. Time-based B+-tree structure for partitions in POTSS
3.3 Insertion Algorithms for Moving Object Trajectories The algorithms for inserting moving objects trajectories can be divided into an initial trajectory insertion algorithm and a segment insertion algorithm for its trajectory. For the initial trajectory insertion, we find the last partition in the partition table and obtain an available entry (NE) in the last partition. The initial trajectory insertion can be performed according to two cases; one with no expected future trajectories and the other with expected trajectories. The detailed algorithm will be omitted due to its space requirement. For the segment insertion of a moving object trajectory, we find a partition storing its trajectory from the partition table by using the start time (ST) of the moving object. In addition, we obtain the entry storing the trajectory information in the partition. Figure 4 shows the segment insertion algorithm (i.e., InsertSeg) for moving object trajectories. Here NE is the entry in the partition covering the object identified by MOid and Loc is the location of the NE entry in the trajectory information area. The segment insertion can be performed in two cases. First, for a segment insertion for trajectories with no expected future ones, we just store a new segment (TrajSeg) into the NE entry of the trajectory information area, being addressed by Loc. In addition, we generate a trajectory signature (SigTS) from the TrajSeg and store the SigTS into the NE entry of the signature information area. Then, we store <MOid, Loc, StartT, CurrentT, ExpectET> into the NE entry of the location information area. Secondly, for a segment insertion for trajectories with expected future ones, we can store a new segment according to three types of the discrepancy between a new segment and the expected segment of a trajectory. To check if a new segment ac-cords with an expected trajectory’s segment, we call a find-seg() function to find a segment coinciding with TrajSeg from the expected trajectory of the NE entry. First, in case of no segment coinciding with TrajSeg (seg_pos = 0), we perform the same procedure as the segment insertion algorithm with no expected future
736
J. Chang, J. Um, and Y. Kim
segments. In addition, we move the trajectory’s expected segments backward by one and store the TrajSeg into the (#_actual_seg)-th segment of the NE entry. Secondly, in case where the segment coinciding with TrajSeg is the first one (seg_pos = 1), we store only the TrajSeg into the (#_actual_seg)-th segment of the NE entry because the TrajSeg is the same as the first expected segment of the trajectory. Otherwise (seg_pos > 1), we delete the (seg_pos-1) number of segments from the expected segments of the NE entry, store the TrajSeg into the (#_actual_seg)-th segment, and move all the expected segments forward by seg_pos-2. If the ratio of mismatched segments (#_mismatch) over all the segments of the trajectory is less than a threshold (τ), we store the trajectory signature (SigTS) generated from the TrajSeg into the NE entry of the signature information area. Otherwise, we regenerate SigTS from the trajectory information by calling a signature regeneration function (regenerate_sig). Finally, we update the values of #_actual_seg, #_future_seg, and #_mismatch in the NE entry, and we update the CurrentT of the NE entry in the location information area and that of the partition P’s entry in the partition table. Algorithm InsertSeg(MOid, TrajSeg, ST) /* TraSeg contains a segment for the trajectory of a moving object Moid, to be stored with an object trajectory’s start time, ST*/ 1. Generate a signature SigTS from TrajSeg 2. Locate a partition P covering ST in partition table 3. Locate an entry E covering ST for the moving object with MOid and get its location, Loc, in the trajectory information area 4. Obtain #actual_seg, #future_seg, and #mismatch of the trajectory info entry E (i.e., TE) for the MOid in P 5. if(#future_seg = 0) { // no expected trajectory 6. Insert TrajSeg into(#actual_seg+1)th trajectory segment of TE 7. Store SigTS into the entry E of the signature info area in P} 8. else { // expected trajectory exists 9. seg_pos = find_seg(TrajSeg,Loc) 10. #actual_seg++, #future_seg = #future_seg – seg_pos 11. case(seg_pos = 0) { // find no segment 12. Insert TrajSeg into segment of TE and relocate the future traj segments backward 13. Store SigTS into entry E of signature info area in P } 14. case(seg_pos = 1) //find the first segment 15. Insert TrajSeg into (#actual_seg)-th trajectory segment of TE for exchanging the old segment 16. case(seg_pos > 1) {//find the (seg_pos)-th segment 17. #mismatch = #mismatch + seg_pos – 1 18. Insert TrajSeg into (#actual_seg)-th segment of TE and relocate the future traj segments forward 19. if(#mismatch/(#future_seg+#actual_seg) > ) regenerate_sig(Loc,SigTS,E,P)}// end of case 20. } // end of else 21. Update #actual_seg, #future_seg, and #mismatch of TE 22. CurrentT = te of TrajSeg 23. Store CurrentT into the current_time of the entry E and into the current_time of the partition P entry End InsertSeg
τ
Fig. 4. Segment insertion algorithm for moving object trajectories
A New Signature-Based Indexing Scheme
737
3.4 Retrieval Algorithm for Moving Object Trajectories The retrieval algorithm for moving object trajectories finds a set of objects whose trajectories match the segments of a query trajectory. Figure 5 shows the retrieval algorithm (i.e., Retrieve) for moving object trajectories. To find a set of partitions satisfying the time interval (TimeRange) represented by
738
J. Chang, J. Um, and Y. Kim
Algorithm Retrieve(QSegList, TimeRange, MOidList) /* MOidList is moving objects’ id list to satisfy QsegList for TimeRange */ 1. Qsig = 0, #qseg = 0, partList = Ø 2. t1 = TimeRange.lower, t2 = TimeRange.upper 3. for each segment QSj of QsegList { 4. Generate a signature QSSi from Qsj 5. QSig = QSig | QSSj, #qseg = #qseg + 1 } 6. find_partition(TimeRange, partList) 7. for each partition Pn of partList { 8. Obtain a set of candidate entries, CanList, examining the signatures of signature info area in Pn 9. for each candidate entry Ek of CanList { 10. Let s,e,c be start_time, end_time, current_time of the entry Ek of location information area 11. if((s ) AND (e t1 OR c t1)){ 12. #matches = 0 13. Obtain the first segment ESi of the entry Ek, TEk, and the first segment QSj of QsegList 14. while(ESi NULL and QSj NULL) { 15. if(match(Esi, QSj)=FALSE) Obtain the next segment ESi of TEk 16. else { #matches = #matches + 1 17. Obtain the first segment ESi of Tek } 18. if(#matches=#qseg)MOidList=MOidList ∪ {TEk’s MOid} 19. } } } //end of while //end of if //end of for- CanList 20. } // end of for - partList End Retrieve
≤ t2
≥
≠
≥
≠
Fig. 5. Retrieval algorithm for moving object trajectories
4 Performance Analysis We implement our trajectory indexing scheme under Pentium-IV 2.0GHz CPU with 1GB main memory, running Window 2003. For our experiment, we use a road net-work consisting of 170,000 nodes and 220,000 edges [9]. We also generate 50,000 moving objects randomly on the road network by using Brinkhoff’s algorithm [1]. For performance analysis, we compare our indexing scheme with the TB-tree and the FNR tree in terms of insertion time and retrieval time for moving object trajectories. First, Table 1 shows the insertion performance to store one moving object trajectory. It is shown from the result that our indexing scheme preserves nearly the same insertion performance as TB-tree, but the FNR tree provides about two orders of magnitude worse insertion performance than TB-tree. This is because the FNR-tree constructs a tremendously great number of R-trees, i.e., each per a segment in the road network. Table 1. Trajectory insertion performance
Trajectory insertion time(sec)
TB-tree
FNR-tree
Our indexing scheme
1.232
401
1.606
A New Signature-Based Indexing Scheme
739
We measure retrieval time for answering queries whose trajectory contains 2 to 20 segments. Figure 6 shows the trajectory retrieval performance. It is shown from the result that our indexing scheme requires about 20 ms while the FNR-tee and the TBtree needs 25ms and 93ms, respectively, when the number of segments in a query is 2. It is shown that our indexing scheme outperforms the existing schemes when the number of segments in a query trajectory is small. On the contrary, the TB-tree achieves bad retrieval performance due to a large extent of overlap in its internal nodes even when the number of segments in a query trajectory is small. As the number of segments in queries increase, the retrieval time is increased in both the FNR-tree and the TB-tree; however, our indexing scheme requires constant retrieval time. The reason is why our indexing scheme creates a query signature combining all the segments in a query and it searches for potentially relevant trajectories of moving objects once by using the query signature as a filter.
time(sec)
trajectory retrieval(WALL TIME) 1 0.8 0.6 0.4 0.2 0
TB-tree FNR-tree Our indexing s cheme
2
3
4
5
6
8
10 12 14 16 18 20
query segment
Fig. 6. Trajectory retrieval performance
When the number of segments in a query is 20, it is shown that our indexing scheme requires about 20 ms while the FNR-tree and the TB-tree needs 150ms and 850ms, respectively. Thus our indexing scheme achieves about one order of magnitude better retrieval performance than the existing schemes. This is because our indexing scheme constructs an efficient signature-based indexing structure by using a superimposed coding technique. On the contrary, the TB-tree builds a MBR for each segment in a query and performs a range search for each MBR. Because the number of range searches increases in proportion to the number of segments, the TB-tree dramatically degrades on trajectory retrieval performance when the number of segments is great. Similarly, the FNR-tree should search for an R-tree for each segment in a query. Because it gains accesses to as the large number of R-trees as the number of segments in the query, the FNR-tree degrades on trajectory retrieval performance as the number of segments is increased.
5 Conclusion Because moving objects usually move on spatial networks, instead of on Euclidean spaces, efficient index structures are needed to gain good retrieval performance on
740
J. Chang, J. Um, and Y. Kim
their trajectories. However, there has been little research on trajectory indexing schemes for spatial network databases. Therefore, we proposed a signature-based indexing scheme for moving objects’ trajectories on spatial networks so that we might efficiently deal with the trajectories of current moving objects as well as for maintaining those of past moving objects. In addition, we provided both a segment insertion algorithm and a retrieval algorithm. Finally, we showed that our indexing scheme could achieve, to a large extent, about one order of magnitude better retrieval performance than the existing schemes, such as the FNR-tree and TB-tree. As future work, it is required to extend our indexing scheme to a parallel environment so as to achieve better retrieval performance due to the characteristic of signature files [10].
References 1. Brinkhoff, T.: A Framework for Generating Network-Based Moving Objects. GeoInformatica 6(2), 153–180 (2002) 2. Frentzos, R.: Indexing Moving Objects on Fixed Networks. In: International Conference on Spatial and Temporal Databases, Santorini island, Greece, pp. 289–305 (2003) 3. Faloutsos, C., Christodoulakis, S.: Signature Files: An Access Method for Documents and Its Analytical performance Evaluation. ACM Transaction on Office Information Systems 2(4), 267–288 (1984) 4. Pfoser, D., Jensen, C.S., Theodoridis, Y.: Novel Approach to the Indexing of Moving Object Trajectories. In: 27th International Conference on VLDB, Egypt, pp. 395–406 (2000) 5. Papadias, S., Zhang, J., Mamoulis, N., Tao, Y.: Query Processing in Spatial Network Databases. In: 29th International Conference on VLDB, Germany, pp. 802–813 (2003) 6. Shekhar, S.: Spatial Databases - Accomplishments and Research Needs. IEEE Transaction on Knowledge and Data Engineering 11(1), 45–55 (1999) 7. Shahabi, C., Kolahdouzan, M.R., Sharifzadeh, M.: A Road Network Embedding Technique for K-Nearest Neighbor Search in Moving Object Databases. GeoInformatica 7(3), 255–273 (2003) 8. Speicys, L., Jensen, C.S., Kligys, A.: Computational Data Modeling for NetworkConstrained Moving Objects. In: 17th ACM International Symposium on Advances in Geographic Information Systems, New Orleans, Louisiana, USA, pp. 118–125 (2003) 9. Penn State University Libraries, http://www.maproom.psu.edu/dcw/ 10. Zobel, J., Moffat, A., Ramamohanarao, K.: Inverted Files Versus Signature Files for Text Indexing. ACM Transanction on Database Systems 23(4), 453–490 (1998)
Effective Emission Tomography Image Reconstruction Algorithms for SPECT Data J. Ram´ırez1, J.M. G´ orriz1 , M. G´ omez-R´ıo2, A. Romero1 , R. Chaves1, 1 2 A. Lassl , A. Rodr´ıguez , C.G. Puntonet4 , F. Theis5 , and E. Lang3 1
Dept. of Signal Theory, Networking and Communications, University of Granada, Spain [email protected] 2 Servicio de Medicina Nuclear, Hospital Universitario Virgen de las Nieves (HUVN), Granada, Spain 3 Institut f¨ ur Biophysik und physikalische Biochemie, University of Regensburg, Germany 4 Dept. of Architecture and Computer Technology, University of Granada, Spain 5 Max Planck Institute for Dynamics and Self-Organisation, Bernstein Center for Computational Neuroscience, G¨ ottingen, Germany
Abstract. Medical image reconstruction from projections is computationally intensive task that demands solutions for reducing the processing delay in clinical diagnosis applications. This paper analyzes reconstruction methods combined with pre- and post-filtering for Single Photon Emission Computed Tomography (SPECT) in terms of convergence speed and image quality. The evaluation is performed by means of an image database taken from a concurrent study investigating the use of SPECT as a diagnostic tool for the early onset of Alzheimer-type dementia. Filtered backprojection (FBP) methods combined with frequency sampling 2D pre- and post-filtering provides a good trade-off between image quality and delay. Maximum likelihood expectation maximization (ML-EM) improves the quality of the reconstructed image but with a considerable increase in processing delay. To overcome this problem the ordered subsets expectation maximization (OS-EM) method is found to be an effective algorithm for reducing the computational cost with an image quality similar to ML-EM.
1
Introduction
Emission-computed tomography (ECT) has been widely employed in biomedical research and clinical medicine during the last three decades. ECT differs fundamentally from many other medical imaging modalities in that it produces a mapping of physiological functions as opposed to imaging anatomical structure. Tomographic radiopharmaceutical imaging, or ECT, provides in vivo threedimensional maps of a pharmaceutical labeled with a gamma ray emitting radionuclide. The distribution of radionuclide concentrations are estimated from a set of projectional images acquired at many different angles around the patient. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 741–748, 2008. c Springer-Verlag Berlin Heidelberg 2008
742
J. Ram´ırez et al.
Single Photon Emission Computed Tomography (SPECT) imaging techniques employ radioisotopes which decay emitting predominantly a single gamma photon. This represents the fundamental difference between PET (Positron Emission Tomography) and SPECT. PET systems employ isotopes in which a couple of photons are produced in each individual annihilation. There is a rich variety of isotopes that decay, emitting a single-photon and which consequently can be utilized in SPECT. When the nucleus of a radioisotope disintegrates, a gamma photon is emitted with a random direction which is uniformly distributed in the sphere surrounding the nucleus. If the photon is unimpeded by a collision with electrons or other particles within the body, its trajectory will be a straight line or “ray”. In order for a photon detector external to the patient to discriminate the direction that a ray is incident from, a physical collimation is required. Typically, lead collimator plates are placed prior to the the detector’s crystal in such a manner that the photons incident from all but a single direction are blocked by the plates. This guarantees that only photons incident from the desired direction will strike the photon detector. Brain SPECT has become an important diagnostic and research tool in nuclear medicine. The ultimate value of this procedure depends on good technique in acquisition setup and proper data reconstruction [1,2]. This paper analyzes reconstruction methods combined with pre- and post-filtering for Single Photon Emission Computed Tomography (SPECT) in terms of convergence speed and image quality.
2
Filtered Backprojection Reconstruction
An image of the cross section of an object can be recovered or reconstructed from the projection data. In ideal conditions, projections are a set of measurements of the integrated values of some parameter of the object. If the object is represented by a two dimensional function f (x, y) and each line integral by the (θ, t) parameters, the line integral Pθ (t) is defined as: Pθ (t) =
+∞
−∞
+∞
−∞
f (x, y)δ(x cos θ + y sin θ − t)dxdy
(1)
The function Pθ (t) is known as the Radon transform of the function f (x, y). A projection is formed by combining a set of line integrals. The simplest projection is a collection of parallel ray integrals and is given by Pθ (t) for a constant θ. Another type of projection is possible if a single source is placed in a fixed position relative to a line of detectors. This projection is known as fan beam projection because the line integrals are measured along fans. The key to tomographic imaging is the Fourier Slice Theorem which relates the measured projection data to the two-dimensional Fourier transform of the object cross section. The Fourier Slice Theorem is stated as follows: “The Fourier
Effective Emission Tomography Image Reconstruction Algorithms
743
transform Sθ (w) of a parallel projection Pθ (t) of an image f (x, y) taken at angle θ and defined to be: +∞ Pθ (t) exp(−j2πwt)dt (2) Sθ (w) = −∞
gives a slice of the two-dimensional Fourier transform: +∞ +∞ F (u, v) = f (x, y) exp(−j2π(ux + vy))dxdy −∞
(3)
−∞
subtending an angle θ with the u-axis”, that is, Sθ (w) = F (u = w cos θ, v = w sin θ)
(4)
The above result is the essence of straight ray tomography and indicates that by having projections of an object function at angles θ1 , θ2 , ..., θk and taking the Fourier transform of them, the values of F (u, u) can be determined on radial lines. In practice only a finite number of projections of an object can be taken. In that case it is clear that the function F (u, v) is only known along a finite number of radial lines so that one must then interpolate from these radial points to the points on a square grid. The filtered backprojection (FBP) algorithm can be easily derived from the Fourier Slice Theorem. An image of the cross section f (x, y) of an object can be recovered by: π
f (x, y) =
Qθ (x cos θ + y sin θ)dθ
(5)
Sθ (w)|w| exp(j2πwt)dw
(6)
0
where
Qθ (t) =
+∞ −∞
The FBP algorithm then consists of two steps: the filtering part, which can be visualized as a simple weighting of each projection in the frequency domain, and the backprojection part.
3
Maximum Likelihood Expectation Maximization (ML-EM)
In emission tomography, a compound containing a radioactive isotope is introduced into the body and forms an unknown emitter density λ(x, y) under the body’s functional activity. Emissions then occur according to a Poisson process. The acquisition system usually consists of D detectors so that the measured data n∗ (1), ..., n∗ (D) represents the counts of photons emitted by the body and measured by each one of the detectors. The maximum likelihood expectation maximization algorithm (ML-EM) ˆ of λ which maximizes the probability [3,4,5,6] determine an estimate λ
744
J. Ram´ırez et al.
p(n∗ (1),...,n∗ (D)|λ) of observing the actual detector count data over all possible densities. Let n(b) represent the number of unobserved emissions in each of B boxes (pixels) partitioning an object containing an emitter and let p(b, d) be the probability of an emission in box b is detected in detector unit d. ML-EM is an iterative reconstruction algorithm which starts with an initial estimate λ0 ˆ from an old estimate λˆ : and gives the new estimate λ ˆ λ(b) = λˆ (b)
4
D
n∗ (d)p(b, d) B b =1 λ (b )p(b , d) d=1
(7)
Ordered Subset Expectation Maximization (OSEM)
The application of Expectation Maximization (EM) algorithms in emission tomography has led to the introduction of many related techniques. While quality of reconstruction is good, the application of EM is computer intensive and its convergence slow. Ordered subset expectation maximization (OS-EM) [7] algorithm for computed tomography groups projection data in ordered subsets. The standard EM algorithm (i.e., projection followed by backprojection) is then applied to each of the subsets in turn. The resulting reconstruction becomes the starting value for use with the next subset. An iteration of the OS-EM algorithm is defined as a single pass through all the specified subsets. Further iterations may be performed by passing through the same ordered subsets, using as a starting point the reconstruction provided by the previous iteration. By selecting mutually exclusive subsets, each OS-EM iteration has a similar computation to a single EM iteration. In SPECT, the sequential processing of ordered subsets is very natural, as projection data is collected separately for each projection angle (as a camera rotates around the patient in SPECT); counts on single projections can form successive subsets.
5
Prefiltering and Postfiltering
A major drawback of FBP algorithms for tomographic image reconstruction is the undesired amplification of the high frequency noise and its impact on image quality. These effects are caused by the filtering operation or multiplication of Sθ (w) by |w| in equation 6. In order to attenuate the high frequency noise amplified during FBP reconstruction, a number of window function has been proposed. In this way, the reconstruction method described by equations 5 and 6 is normally redefined by applying a frequency window which returns to zero as the frequency tends to π. Among the most common window functions used for FBP reconstruction are: i) sinc (Shepp-Logan filter), ii) cosine, iii) Hamming and, iv ) Hanning window functions. However, even when the reconstruction noise is kept low using a noise controlled FBP approach, the noise captured by the acquisition system needs to be filtered out to improve the quality of reconstructed images.
Effective Emission Tomography Image Reconstruction Algorithms
(a)
745
(b)
(c)
Fig. 1. a) SPECT sinogram acquired by a three head gammacamera, b) Filtered SPECT data, c) Frequency response of noise filter
Moreover, the preprocessing stage of most automatic SPECT image processing systems often incorporates prefiltering, reconstruction and postfiltering to minimize the noise acquired by the gammacamera as well as the noise amplified during FBP reconstruction.
6
Image Quality and Performance Evaluation
Image files were taken from a concurrent study investigating the use of SPECT as a diagnostic tool for the early onset of Alzheimer-type dementia. SPECT data were acquired by a three head gammacamera Picker Prism 3000. The patient is injected with a 99m Tc-ECD radiopharmeceutical which emits gamma rays that are detected by the detector. A total of 180 projections were taken with a 2-degree angular resolution. Fig. 1 shows the effects of noise and filtering on projection data acquired by the gammacamera. The acquired sinogram shows a visible high frequency noise that is effectively filtered by a 2D frequency sampling FIR filter. Fig. 2 shows the effect of noise on FBP image reconstruction and the effectiveness of pre- and post-filtering. Note that, when a simple ramp filter is used, the noise completely
746
J. Ram´ırez et al.
(a)
(b)
(c)
Fig. 2. FBP reconstruction of a SPECT image with a) RAM-LAK filter, b) with FIR prefiltering, c) with pre- and post-filtering
corrupts the reconstructed image. Pre- and post-filtering improves the quality of FBP reconstruction by removing the huge high frequency noise present in SPECT data and the residual noise after reconstruction. Thus, FBP yields good image quality for analysis and display of SPECT data although residual noise is observed in the images as shown in Fig. 2a. ML-EM algorithm is an iterative algorithm for image reconstruction that better models the photons emitted by a radioactive source and yields better image quality when compared to FBP. Fig. 3 shows the slow convergence of ML-EM to a final reconstructed image. It is shown that the iterative algorithm described by equation 7 requires from 15 to 20 iterations to converge and yields a final image with improved quality when compared to FBP as shown in Fig. 4. OS-EM image reconstruction, that groups projection data in ordered subsets and performs projection followed by backprojection to each of the subsets in turn, yields an acceleration on the convergence of the image reconstruction process. Thus, with a number of subset equal to the number of iterations required by
Fig. 3. Convergence of ML-EM SPECT image reconstruction
Effective Emission Tomography Image Reconstruction Algorithms
(a)
747
(b)
Fig. 4. Comparison of FBP and ML-EM methods for SPECT image reconstruction. a) FBP, b) ML-EM.
(a)
(b)
(c)
Fig. 5. Result after one OSEM iteration. Image reconstruction partitioning the set of detectors into a) 10, b) 15 and c) 20 subsets.
ML-EM to converge, OS-EM converges in a single iteration performed over all the subsets. Thus, for the example shown in Fig. 3, OS-EM should converge by partitioning the whole set of detector elements into about 15-20 subsets. Fig. 5 shows the results of a single iteration performed by the OS-EM algorithm with a different number of subsets. It is clearly shown that with 15 and 20 subsets the image appear to be of similar quality to ML-EM with the advantage of a speedup of the order of the number of subsets.
7
Conclusions
Classical filtered backprojection and statistical maximum likelihood expectation maximization image reconstruction algorithms were evaluated in terms of image quality and processing delay. Image files were taken from a concurrent study investigating the use of SPECT as a diagnostic tool for the early onset of Alzheimer-type dementia. FBP image reconstruction needs a careful control and the noise since it tends to amplify high frequency noise. Pre- and post-filtering improves the quality of FBP reconstruction by removing the huge high frequency
748
J. Ram´ırez et al.
noise present in SPECT data and the residual noise after reconstruction. MLEM yields better image quality when compared to FBP since a precise statistical model of the emission is used. However, the processing delay is considerable due to its slow convergence. OS-EM was found to be a good trade-off between image quality and processing delay since it converges in a single iteration by partitioning the set off detection elements into about 15-20 subsets. Acknowledgments. This work has been funded by the PETRI project DENCLASES (PET2006-0253) of the Spanish MEC and the regional Excellence Project (TIC-02566) of the Consejer´ıa de Innovaci´on Ciencia y Empresa (Junta de Andaluc´ıa, Spain). We also acknowledge financial support by the German Academic Exchange Service (DAAD).
References 1. Vandenberghea, S., D’Asselera, Y., de Wallea, R.V., Kauppinenb, T., Koolea, M., Bouwensa, L., Laerec, K.V., Lemahieua, I., Dierckx, R.: Iterative reconstruction algorithms in nuclear medicine. Computerized Medical Imaging and Graphics, 105– 111 (2001) 2. Bruyant, P.P.: Analytic and iterative reconstruction algorithms in spect. The Journal of Nuclear Medicine 43, 1343–1358 (2002) 3. Shepp, L.A., Vardi, Y.: Maximum likelihood reconstruction for emission tomography. IEEE Transactions on Medical Imaging MI-1, 113–122 (1982) 4. Vardi, Y., Shepp, L.A., Kaufman, L.: A statistical model for positron emission tomography. Journal of the American Statistical Association 80, 8–20 (1985) 5. Lange, K., Carson, R.: Em reconstruction for emission and transmission tomography. Journal of Computer Assisted Tomography 8, 306–312 (1984) 6. Chornoboy, E.S., Chen, C.J., Miller, M.I., Miller, T.R., Snyder, D.L.: An evaluation of maximum likelihood reconstruction for spect. IEEE Transactions on Medical Imaging 9, 99–110 (1990) 7. Hudson, H.M., Larkin, R.S.: Accelerated image reconstruction using ordered subsets of projection data. IEEE Transactions on Medical Imaging 13, 601–609 (1994)
New Sky Pattern Recognition Algorithm Wojciech Makowiecki1 and Witold Alda2 1
Astronomical Observatory of the Jagiellonian University, Krak´ ow, Poland [email protected] http://www.wojciech.us 2 AGH University of Science and Technology, Krak´ ow, Poland [email protected]
Abstract. We present here a new algorithm which enables the identification of stars in astronomical photographs using readily available star catalogs for this purpose. The algorithm was implemented in a standalone application called ‘Skyprint‘ capable of performing a matching process. The computational aspect of the problem can be designated to the wide class of image recognition methods and analysis of multidimensional data. The astronomical aspect concentrates on astrometry – the method of determining the coordinates of stars in the celestial sphere. The problem of identifying star patterns occurs most often in such areas as cosmic probe navigation, adjusting and merging numerous photographs of the sky together, or in recovering missing information in relation to a fragment of the sky represented in the photograph. Keywords: Pattern recognition, Astrometry, Geometric hashing.
1
Introduction
The concept of matching the stars in a photograph with stars in the catalog has been investigated for many years. The majority of algorithms which are able to successfully solve the problem are based on the idea of comparing polygonal shapes in the photograph with those in the catalog. The shapes in the photograph are created by connecting stars in the photograph together and likewise the shapes in the catalog. We can distinguish two strategies of approaching the problem of matching these two groups of shapes. The first one, used for example by Valdes [1], Groth [2] and Murtagh [3], relies on heavy restrictions that are put on the fragment of the catalog to be compared with the photograph. We must know that the chosen fragment of the catalog contains most of the stars visible in the photograph. Also the size of the fragment can not be larger than a few times the size of the field of view (FOV) shown on the photograph. The shapes (in this case triangles), made up of stars brighter than a certain limiting magnitude, are created in the picture and in the chosen part of the catalog. Then the triangles are saved in two separate lists so that they can be later investigated for similarities. Confirmation that two found shapes are made up of the same objects is done in the so called ‘voting process‘. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 749–758, 2008. c Springer-Verlag Berlin Heidelberg 2008
750
W. Makowiecki and W. Alda
In this paper the authors have concentrated on proposing a new algorithm belonging to the second group, that has emerged recently e.g. Lang [4]. Algorithms belonging to this group are able to recognize an arbitrary fragment of the catalog shown on the photograph, without being limited by cases where the approximate location is not known. Such algorithms are not usually able to identify particular stars belonging to the shape, however, for this purpose, one of the algorithms from the first group which has already been developed, can be chosen. The main idea behind the algorithm presented is to create and subsequently compare two sets of convex quads. The first set is created by using the stars found in the photograph, and the second one by using the stars present in the catalog.
2
Algorithm
One of the most important issues in the algorithm is its method of characterizing the shapes. We need to do this in a way which is invariant with respect to translation, rotation and scaling. Each shape consists of N points. Each point has 2 coordinates, therefore we need 2N parameters to determine the positions of all N points in the 2-dimensional space. As translation, rotation and scaling in a 2D coordinate system require 4 degrees of freedom, we end up with the 2N − 4 parameters which determine the shape unambiguously. This gives us 2 parameters for each triangle and 4 parameters for each quad. In our approach we have chosen to use quads and to compare them by using only 3 internal angles of each quad. Although such a measure is not sufficient to determine the convex quad unambiguously, we sacrifice this for the advantage of having only three parameters for each shape. Our tests show that such an approach performs well and is sufficient in most cases. The algorithm consists of several steps: 1. Creating the ‘grid of seeds‘ By ‘seed‘ we mean a virtual point having 2 coordinates (Right Ascension and Declination). We define ‘grid of seeds‘ as points distributed in one of any number of regular ways. For example one can choose to satisfy the following condition: number of seeds per unit area ≈ const. (1) number of stars per unit area We believe this rule of placing seeds is one of the best. 2. Choosing only the convex shapes for later comparison Now we use the seeds to create shapes as shown in Fig. 1. We try to create a convex quad in the vicinity of each seed. A case with three collinear stars is treated as a correct convex quad. 3. Angle calculation, discretization and hashing In this stage we first calculate angles by using one of three different methods mentioned later. Subsequently we discretize angles every 2 degrees. For each shape we store 5 values: 3 discrete angles and 2 coordinates of the seed which
New Sky Pattern Recognition Algorithm
751
Fig. 1. Stages of creating a convex quad. Black dots denote seeds of the grid, blue circles denote stars. Sequence of stages are as follows: a) select one seed and choose 4 closest stars b) create quad only if these 4 stars can make a convex one c) calculate internal angles of the quad d) choose the largest angle and two more angles by moving in a clockwise direction, remember values and sequence of these 3 angles.
was used to create this particular quad. We use a standard hash function with primes as coefficients in the form: (α ∗ 23827 + β ∗ 34693 + γ ∗ 46021) % 904997
(2)
which transforms 3 input angles (α, β and γ) to an array index. This method enables quick matching of quads. 4. Division of sky into fragments In this step, in order to use ‘voting‘, we divide the whole sky into fragments as shown in Fig. 2, 3 and 4. Fragments can have different shapes and sizes so there are many different ways of doing this. In our approach we use orthogonal division. We limit the minimum size of the fragments to the maximum FOV of the input photograph. In other words, the minimum size of the fragment must be larger than the photograph. However, the limiting size cannot be too large - because in such a case, identifying fragments would be imprecise or incorrect. Even when the photograph has angular dimensions which fit in the single fragment of the catalog, it may be only partially contained. Figure 2 shows our solution to this problem. We enlarge the fragment
752
W. Makowiecki and W. Alda
Fig. 2. Enlarging fragments of the catalog by a factor of 2 in order to be sure that the photograph is always entirely contained by at least one fragment
by a factor of 2 in order to be sure that the photograph is always entirely contained by at least one fragment. 5. Counting matched quads and voting on catalog fragments We then match the convex quads from the photograph with those in the catalog. Once a match is found, one or more areas in the sky receives votes as shown in Fig. 3. At the end of the process, the region of the catalog which has received the highest number of votes is the one we were looking for (Fig. 4). We use seeds for two reasons. Firstly, to introduce a unique and easy method for choosing stars, which will be used later to create quads. Secondly, using seeds is necessary for reducing the number of convex quads created from the catalog. The problem shows certain asymmetry. For the photograph, we can create as many convex quads as possible without using seeds at all. The number of convex quads is the main reason for this. A photograph always has many fewer stars than the whole sky catalog. However, as it turned out in our tests, such a method is not optimal and introduces difficulties. The main problem concerns the scale of quads. If we create all possible convex quads from the stars in the photograph
New Sky Pattern Recognition Algorithm
753
Fig. 3. Example of quads matching process
we would create convex quads in many different scales, whereas using a grid of seeds for the catalog would create only quads of similar sizes.
3
Software
The name ‘Skyprint‘ [10] was intended to be analogous with fingerprinting because it tries to find the same patterns of stars in a large database. Skyprint is a highly interactive application written in C++ with Qt interface. Sample screen of the program’s graphical interface is shown in Fig. 5. There are two algorithms implemented in Skyprint. First, the algorithm presented in the previous section and second - a version of the algorithm presented by Valdes [1]. The distortion effects originating from projecting a celestial sphere onto a 2D plane can no longer be neglected for photographs with FOV larger than 30 arcmin. For this reason Skyprint gives the option of choosing between three different types of projection. The first one is a ‘no projection‘ method (the angles are calculated on the sphere with use of spherical trigonometry). The second method is to calculate the angles on the sphere by treating them as if they were in a two dimensional plane. The last one, the ‘gnomonic projection‘,
754
W. Makowiecki and W. Alda
Fig. 4. Sample results of voting procedure
Fig. 5. Skyprint - view of the main window
New Sky Pattern Recognition Algorithm
755
is the best option as it uses exactly the same distortion as the one that appears when a picture of the sky in taken. The typical use case of Skyprint is as follows: – – – – –
4
Open the photograph. Select or correct position of stars on the photograph. Choose sky region (possibly the whole sky). Select between algorithm I and II. Choose matching parameters for the selected algorithm.
Results
Although we made our tests on a number of images, here we would like to concentrate on the comparison of results obtained by matching a single photograph while changing many parameters of the algorithm. We have used the simplest orthogonal division as shown in Fig. 3. This allowed us to divide the catalog into 128 parts. We placed seeds much more densely, but in precisely the same manner. Seed density is a parameter of the software. Typically we use two densities: 600 x 300 and 1200 x 600. However, these seeds are not uniformly distributed on the sphere. This particular distribution causes the seeds with highest and lowest declinations to be close to other seeds. This is not desirable, because in this way there would be many more quads created for the regions close to the poles than for the parts of the sky which lie beside the equator. The same problem applies to choosing fragments that the catalog is divided into. One solution is to use number of votes in the region of the catalog with 2nd highest number of votes number of votes in the region of the catalog with highest number of votes total number of quads created from the photgraph
Number of quads (blue bars) Number of votes (red and black bars)
140
120
100
80
60
40
20
0 400
450
500
550
600
Maximal spacing between stars in the photograph [pixels]
Fig. 6. Results obtained without using any projection
650
756
W. Makowiecki and W. Alda
number of votes in the region of the catalog with 2nd highest number of votes number of votes in the region of the catalog with highest number of votes total number of quads created from the photgraph
Number of quads (blue bars) Number of votes (red and black bars)
140
120
100
80
60
40
20
0 300
350
400
450
500
550
600
Maximal spacing between stars in the photograph [pixels]
Fig. 7. Results obtained by calculating the angles on the sphere with use of spherical trigonometry number of votes in the region of the catalog with 2nd highest number of votes number of votes in the region of the catalog with highest number of votes total number of quads created from the photgraph
Number of quads (blue bars) Number of votes (red and black bars)
140
120
100
80
60
40
20
0 350
400
450
500
550
600
Maximal spacing between stars in the photograph [pixels]
Fig. 8. Results obtained by using the gnomonic projection
650
New Sky Pattern Recognition Algorithm
757
HEALPix [8] for determining the uniform distribution of seeds and regions of the catalog. The other is to utilize the constraint in equation 1. For the tests we used a M45 [7] photograph with a FOV of about 1 degree, which is very small in size, when compared with 22.5 degrees (the largest part into which we divide the catalog). The larger the ratio of FOV to the size of the part of the catalog, the harder it is for the algorithm to work properly. This was one of the reasons for choosing this photograph. The second reason was to show that even at such a small FOV (small in the context of amateur photography) we need to make sure we pay attention to distortions. In Fig. 6, 7 and 8 we show the number of created quads compared with the number of votes in the regions with two highest ranks, obtained for each method of angle calculation.
5
Conclusions
We have shown that the implemented algorithm works well with almost any class of photographs. We have made no assumptions about the photograph apart from limiting its size which makes the algorithm much more efficient. In practice the maximum size of the photograph is usually well known and can be set as a parameter, thus representing a very soft constraint. We have also determined the most important factors in the identification of the fragment of the sky. One of them is the number of stars selected in the photograph either by an automatic filter or the user. The more stars selected, properly and precisely, the higher the probability of successful matching. The projection and the method used to calculate angles of the shape is also very important, especially for larger FOVs. For matching real life photographs of the sky, the gnomonic projection is the best choice. Because there are countless methods of characterizing shapes we plan to explore some more possibilities further in our future research on this subject. The work of Arzoumanian et.al. [9] shows that different applications of the star pattern matching algorithms may exist in a disparate fields of science and are still to be explored. AGH Grant no. 11.11.120.777 is acknowledged.
References 1. Valdes, F.G., Campusano, L.E., Velasquez, J.D., Stetson, P.B.: FOCAS Automatic Catalog Matching Algorithms. PASP 107, 1119 (1995) 2. Groth, E.J.: A pattern-matching algorithm for two-dimensional coordinate lists. The Astrophysical Journal 91, 1244–1248 (1986) 3. Murtagh, F.: A new approach to point-pattern matching. PASP 104, 301–307 (1992) 4. Lang, D., Hogg, D.W., Mierle, K., Blanton, M., Roweis, S.: Making the sky searchable (submitted, 2007), http://www.astrometry.net 5. Harvey, C.: New Algorithms for Automated Astrometry. M.Sc. Thesis, University of Toronto (2004)
758
W. Makowiecki and W. Alda
6. Roeser, S., Bastian, U.: PPM Star Catalog, vol. I & II. Astronomisches RechenInstitut, Heidelberg (1991) 7. De Martin, D.: M45 picture, http://www.skyfactory.org 8. G´ orski, K.M., Hivon, E., Banday, A.J., Wandelt, B.D., Hansen, F.K., Reinecke, M., Bartelmann, M.: HEALPix: A Framework for High-Resolution Discretization and Fast Analysis of Data Distributed on the Sphere. The Astrophysical Journal 622(2), 759–771 (2005) 9. Arzoumanian, Z., Holmberg, J., Norman, B.: An astronomical pattern-matching algorithm for computer-aided identification of whale sharks Rhincodon typus. Journal of Applied Ecology - British Ecological Society 42 (6), 999–1011 (2005) 10. Makowiecki, W.: Skyprint software for interactive Star Pattern Matching, http://www.skyprint.info
A Generic Context Information System for Intelligent Vision Applications Luo Sun, Peng Dai, Linmi Tao, and Guangyou Xu Tsinghua National Laboratory for Information Science and Technology Tsinghua University, 100084 Beijing, China {sunluo00,daip02}@mails.tsinghua.edu.cn, {linmi,xgy-dcs}@tsinghua.edu.cn
Abstract. The future intelligent vision is expected to be highly context-aware such that it can perceive and be aware of user's situation and react accordingly. In this paper, we propose a context representation mechanism and build a highperformance, extensible, distributed context information system based on it, in order to facilitate context-awareness development and information sharing. It pays attention to representing and organizing contexual information in an effective way and does not force any certain type of context reasoning algorithm. It can provide information-related services for distributed intelligent vision applications, mainly including representation, storing and retrieval, forming a whole pipeline of real-time semantic metadata generating and management. Besides user context, which is used to support runtime context communication between application components, our system also contains contextual descriptions about running environment and system configuration, making applications based on it can move to another environment or configuration seamlessly. Moreover, context representation in our system has a well-designed plugin-based architecture, helping users add their own context types without any modification of the original system. We introduce a contextaware meeting application based on our system, which employs Dynamic Bayesian Network as context reasoning algorithm. Experiment results show our context information system has excellent configurability, extensibility and performance. Keywords: Context Information System, Context Representation, Context Storing and Context Retrieval.
1 Introduction The final goal of intelligent vision is to provide the right service to the right object in the right place on the right time, which is expected to be highly context-aware such that it can perceive and be aware of user’s situation and react accordingly [1]. By context, we mean a dynamic structure of information that is used to characterize the situation, viewed over a period of time, episode of use, social interaction, internal goals and local influence [2, 3]. In the real world, understanding of each person’s behavior is affected by his relations with others and the surrounding environment, M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 759–769, 2008. © Springer-Verlag Berlin Heidelberg 2008
760
L. Sun et al.
therefore contextual information has to be considered. Moreover, with the development of multimedia sensing technology, more and more data can be acquired by much less effort than even before. Facing large amount of available data, the use of context can help decrease the complexity significantly by processing only relevant scenarios and heuristics to choose only the objects that may involve in a particular activity. The very first step of context-awareness is to represent context in an effective computer-applicable format [1, 2]. Besides representing contextual information, information representation and sharing is playing an important role in the domain of computer vision. Over last 20 years, we have seen a remarkable amount of progress in the abilities and usability of computer vision. Although there is lots of collaboration between groups working in the same field, the majority of researchers have their own ways of working and representation format. This causes the issue that we still only see very individual and proprietary use of this work, both in research and in industry [4]. In this regard, a generic context information system to facilitate context-awareness development and information sharing, which does not force any certain type of context reasoning algorithm, is the urgent need in the development of intelligent computer vision applications. There have been several attempts for this purpose recently. Most of them represent information in XML because of its open, human-readable format and natural hierarchical structure. VEML (Video Event Markup Language) [5] has been implemented for representing and recognizing events in videos, in order to facilitate the development of applications such as video surveillance, video browsing and content-based video indexing. CVML (Computer Vision Markup Language) [4] summarized lots of common visual processing requirements and proposed a framework dedicated to describe computer vision information. MPEG-7, formally named “Multimedia Content Description Interface”, is a wide-accepted standard for describing multimedia content data, which aims to resolve the problems of management and retrieval. It has rich functionality, which also makes itself too complex to employ. Moreover, MPEG-7 is not suitable for dynamic scenes. However VEML and CVML are not as wide-accepted as expected and these three informationdescription languages do not address the concept of context. Context Toolkit [6, 7] aims to be a reusable solution to handle context in a distributed infrastructure. However, it doesn’t provide adequate support for organizing context in a formal structured format, therefore cannot represent the dynamic nature of context [3]. It also adopts hard coded context ontology and cannot be easily extended and interoperate with others. Facing the issue that information systems don’t consider context-awareness and context-aware systems don’t take information sharing into account, we propose a context representation mechanism for intelligent computer vision and build a highperformance, extensible, distributed context information system based on it, in order to facilitate context-awareness and information sharing. It’s used to support information-related services for distributed intelligent vision applications, including representation, storing and retrieval, forming a whole pipeline of real-time semantic metadata generating and management. We have already implemented a flexible multiserver platform for distributed visual information processing in our previous work [8]. Combining it with our context information system, a complete infrastructure of
A Generic Context Information System for Intelligent Vision Applications
761
intelligent computer vision has emerged. The rest of this paper is organized as follows: section 2 describes the system, including our basic idea, several designing aspects of the architecture and some key technologies. Section 3 shows a contextaware meeting application based on our information system, as well as some performance results. In the end, there will be some conclusions.
2 Context Information System 2.1 Distributed Architecture As a complete information system, the following functions should be provided at least. They are minimal requirements to form a whole pipeline of metadata generating and management. 1. 2. 3.
Information representation, the mechanism to formalize information into metadata, representing information in an effective computer-applicable way. Information storing, the ability to store formalized information persistently, usually to hard disks. Information is usually indexed for faster retrieval. Information retrieval, the ability to retrieve stored information later with specified restrictions.
Among them, information representation is usually considered as the most important one since it determines the representing scope and the ability of interoperation with others. However, as more and more sensors are employed into intelligent vision system, it’s not possible to finish all procedures on just one computer. In this regard, a distributed architecture should be taken into account.
Fig. 1. System Architecture. The whole system is organized in a distributed manner, including three separate services, as shown in dash boxes.
762
L. Sun et al.
In our previous work, a flexible platform for distributed visual processing has been finished, which acts as a container for information processing and analysis algorithms to plug in. The platform is composed by a set of servers which collaborate with each other to accomplish the tasks, such as video capture, transmission, buffering and synchronization. With the help of this convenient platform, we design a distributed architecture for our information system, as shown in Fig. 1. There are three kinds of services within the system, namely local storing service, local retrieval service and global archive service. Each service is a separate process, using local or remote socket to communicate with other services or visual processing components. Local storing service, local query service would run on every computer in the vision system, providing information-related services to all visual processing components working on the same machine. Global archive service is unique within the whole system, usually running on a dedicated server for information archiving. We employ XML as basic storing format in our system. Therefore we can use Berkeley DB XML [9] to store metadata, which is a high performance, open source, embeddable XML database with XQuery support. It store XML in an optional indexed way, providing extremely fast access. Our information system architecture features fast storing and retrieval access. This is important if want to deploy it to some existed intelligent vision applications, since they usually do not consider unified information format carefully at the very beginning and slow access may cause significant decreasing of their performances. Fast storing benefits from the local information database which is small and acts as local caching. New information will be stored into the local database directly by synchronized storing function calls, while there is a background thread dedicated to synchronize the local database with the global one. This design can effectively avoid the data loss in case of application crashes, while keeping storing procedure fast enough without annoying applications based on it too much. On the other hand, retrieval access cannot be handled in the same method. It has to be synchronized since applications need results immediately. We optimize it by the means of increasing the capacity of the internal cache of Berkeley DB XML and using UDP protocol with confirming replay, instead of TCP. The experiment later would show our context information system’s performance. 2.2 Context Representation and Plugin-Based Implementation The very first step of context-awareness is to represent it in a computer-applicable format. Traditional context models deal with individuals or their relationships with environmental objects, which can be represented as single level events such as changes of location, time, temperature etc. This method suffers especially under dynamic interaction scenes since it ignores the dynamic nature of context. Greenberg refined context as a dynamic construct, which is viewed over a period of time, episode of use, social interaction, internal goals, and local influence [3]. Following his definition, we propose a computer-applicable version of context representation in the domain of computer vision. Context is defined as multi-level hierarchy, which represents situations at different abstract levels. Moreover, a complete context representation mechanism should not only include runtime dynamic information representing human physical and mental
A Generic Context Information System for Intelligent Vision Applications
763
states, but also need to contain static environment and configuration settings. There are three kinds of descriptions in our information system. 1.
2.
3.
Environment configuration. The external description of a system, used to describe the environment where a system would run, including PC configuration, room configuration, physical objects and their properties, etc. System configuration. The static internal description of a system, used to describe how a system would run, including the connection relationships between visual processing components, what kind of services the system should provide to users, etc. Run-time information communication. The dynamic internal description of a system, used to regulate information format and share it through the whole system, helping all components understand and communicate with each other.
Respectively, context representation defined in our information system can be divided into three parts, environment context, system context and user context. User context is only valid during runtime and contains processing targets and their properties, from low-level features to high-level semantic descriptions. Environment context and system context are rather static, usually remaining unchanged during systems’ running. User context highly depends on the interaction status, which changes frequently. It’s organized in a hierarchical structure at different time scales and group sizes. Fig. 2 shows a user context example under meeting scenario.
Fig. 2. A User Context Example under Meeting Scenario. There are four people within a meeting. At one moment, A and B is talking to each other, while C is staring at A and D is staring at B. Some key nodes of the current user context are showed as a tree on the right side. At the longest time scale, it’s a discussion during a group meeting. There are three short-term interactions currently, namely A talking to B, D staring at B and C staring at A.
Context representation in our system is implemented in a plugin architecture. Plugin-based systems are widely used recently, e.g. Firefox and Eclipse. It has lots of advantages, such as rich extensibility and easy maintenance. It also helps reduce
764
L. Sun et al.
Fig. 3. Modules in Our Plugin System. ID context plugin, which has implemented interface IContextInterface, is a DLL(Dynamic Link Library) file under Windows or a SO(Shared Object) file under Linux and can produce ID context representation, which has implemented interface IContextRepresentation. Location Context component is the same. All plugins are controlled by a module named Context Controller, which provides several useful functions for plugin management, such as loading, refreshing and searching for a specified plugin.
Fig. 4. Some of Implemented Context Representation in our Information System. Blue rectangles are categories and blue arrows between categories represent their relationships and how they work together. Each category contains several context types, which can be used to describe objects or theirs properties in the real world. Expect category “System” and “Environment”, all categories belongs to user context.
future development labor to some extent. In our design, each context type belongs to a
A Generic Context Information System for Intelligent Vision Applications
765
certain category, which is used to organize context types based on what kind of information they designate. Context category is actually a directory on the disk with a configuration file to describe its relationship with other categories. Codes of every context type are stored as a Dynamic Link Library file (DLL) under Windows, and a Shard Object file (SO) under Linux, in the corresponding category directory. When the information system starts, it will scan all possible directories and load context representations dynamically. Therefore, users can add their own context representations, just following some predefined interfaces, without compiling the whole information system. The implementation of our plugin system is based on Qt’s plugin system [10] and the relationships among modules are show in Fig. 30. Every context plugin needs to implement a common interface IContextInterface and every context representation needs to implement a common interface IContextRepresentation. Some of implemented context representations in our system are shown in Fig. 4. It contains many useful context types for intelligent vision. Although they may not cover every requirement, developer can add their own context types very easily with the help of our plugin architecture.
3 Experiment Systems and Results In this section, we would like to show a context-aware meeting application supported by our context information system, as well as some results of performance tests. The major task of this indoor meeting analysis and archiving system is to generate and archive a hierarchical context representation of multimodal meeting data, which is can be further employed for retrieval and more sophisticated analysis [11]. Our
Fig. 5. Sensor Settings in Meeting Room. Three fixed cameras are set to monitor the meeting room from three distinct perspectives. A PTZ camera is placed on the table to focus on any
766
L. Sun et al.
specific target. Three linear microphone arrays are assembled on the table to get audio information from various participants.
information system plays an important role in environment configuration, system configuration and user context representation. A flexible multi-server platform has been developed to support distributed multimedia data capturing and transferring. Multiple audio and visual sensors are installed in the meeting room so as to extract multimodal information in real-time, as illustrated in Fig. 5. Since context cannot be determined by simple collection of multimodal sensor data, this system employs Dynamic Bayesian Network for context reasoning, which can generate analysis results at run-time. The structure of user context is shown in Fig. 6, where different layers have very strong semantic meanings. Fig. 7 shows a description example. At one moment, person A is staring at B (not shown in the picture). Some body parts of A, including head position and orientation, hands positions and so on, are tracked as low-level features for later analysis.
Fig. 6. Hierarchical User Context Structure under Meeting Scene. The top layer is the meeting scenario itself. The second layer is used to describe what current situation is from a high semantic perspective. The third layer represents how people interact with each other individually.
Fig. 7. A Description Example under Meeting Scene. The picture on the left side is from the original video overlapped with some color boxes designating different parts of a person. The
A Generic Context Information System for Intelligent Vision Applications
767
XML segments in the center and on the right are the corresponding low-level and high-level semantic descriptions, respectively. Orange arrows help show the relationships among them.
Storing and retrieval performance have been tested on data extracted from several meetings and another context-aware outdoor surveillance application. Fig. 8 shows some results about storing performance. From the figure, we can find that local storing speed is fairly fast, less than 2ms even when the local database is 20MB. However the storing speed is affected by the size of local database and the concurrency. Synchronization time increases with the size of global database and local database, nearly linearly.
Fig. 8. Results of Storing Performance Experiment. The left plot shows some results of storing a specified event into the local database and different square colors represent different numbers of concurrent storing. The right one shows the background synchronization time and different square colors represent different sizes of global database.
Fig. 9. Results of Retrieval Performance Experiment. These figures show how query time changes with respect to database size and concurrency. The left plot is searching for a specified type in the surveillance application, while the right one for a specified event in the meeting application. Different square colors represent different numbers of concurrent queries.
768
L. Sun et al.
Fig. 9 shows some experiment results about retrieval performance. Event query is a little bit more complex than type query; therefore it costs a bit more time. Query time increases with the size of database and the number of concurrent query processes sublinearly. When database is small, query is fast since cache could be very hot, playing an important role in retrieval. With the increasing of database size, disk I/O costs more and more time.
4 Conclusion Our work in this paper focused on the reusability of context by the means of splitting it into representation and reasoning. Reasoning differs significantly through different context-aware systems while representation remains similar. As a proof of concept, we presented a context representation mechanism and build a high-performance, extensible, distributed context information system based on it to facilitate contextawareness and information sharing. It can support information-related services for distributed intelligent vision applications, mainly including representation, storing and retrieval, forming a whole pipeline of real-time semantic metadata generating and management. Our proposed contextual information includes not only static environment and system settings but also dynamic information representing human physical and mental states. The former is used to describe where and how the application would run, while the latter is used for runtime intercommunication between application components. Contextual information has been implemented in a plugin manner, which means developers can add their own context types without any modification of the original system, just following our interface definition. As a result of our basic idea, this context information system pays attention to representing and organizing contexual information in an effective way and does not force any certain type of context reasoning algorithm; therefore developers can employ any algorithm they prefer, rule-based or probability-based. We also introduce a context-aware meeting analysis and archiving application based on our system, which employs Dynamic Bayesian Network as context reasoning algorithm. Our information system plays an important role in it, showing its configurability and extensibility. Performance of storing and retrieval has been tested on lots of real data and results show our system has excellent performance. In our previous work, we have already finished a flexible platform for distributed visual processing. Our information system actually can be considered as an important complement to it. Combining them together, a complete infrastructure of intelligent vision has emerged. Acknowledgments. This research is supported by NSFC project 60433030 and 60673189 of China. We are about to be thankful for the thoughtful comments and suggestions of our reviewers.
References 1. Goh, E., Chieng, D., Mustapha, A., Ngeow, Y.-C., Low, H.-K.: A Context-Aware Architecture for Smart Space Environment. In: Proceedings of International Conference on Multimedia and Ubiquitous Engineering, Korea, pp. 908–913 (2007)
A Generic Context Information System for Intelligent Vision Applications
769
2. Dey, A.K., Abowd, G.D., Salber, D.: A Conceptual Framework and a Toolkit for Supporting the Rapid Prototyping of Context-Aware Applications. Human-Computer Interaction 16, 97–166 (2001) 3. Greenberg, S.: Context as a Dynamic Construct. Human-Computer Interaction 16, 257– 268 (2001) 4. List, T., Fisher, R.B.: CVML – An XML-based Computer Vision Markup Language. In: Proceedings of International Conference on Pattern Recognition, Cambridge, vol. 1, pp. 789–792 (2004) 5. Nevatia, R., Hobbs, J., Bolles, B.: An Ontology for Video Event Representation. In: Proceedings of International Conference on Computer Vision and Pattern Recognition Workshop (2004) 6. Context Toolkit, http://www.cs.cmu.edu/~anind/context.html 7. Edwards, K., Bellotti, V., Dey, A.K., Newman, M.: Stuck in the Middle: The Challenges of User-Centered Design and Evaluation for Middleware. In: Proceedings of The International Conference on Human Factors in Computing Systems (2003) 8. Wang, Y., Tao, L., Liu, Q., Zhao, Y., Xu, G.: A Flexible Multi-server Platform for Distributed Video Information Processing. In: Proceedings of 5th International Conference on Computer Vision Systems (2007) 9. Oracle Berkeley DB XML, http://www.oracle.com/database/berkeley-db/xml/index.html 10. Qt product page, http://www.trolltech.com/products/qt 11. Dai, P., Di, H., Dong, L., Tao, L., Xu, G.: Group Interaction Analysis in Dynamic Context. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics (to appear, 2007)
Automated Positioning of Overlapping Eye Fundus Images Povilas Treigys1, Gintautas Dzemyda1, and Valerijus Barzdziukas2 1
Institute of Mathematics and Informatics, Akademijos str. 4, LT-08663 Vilnius, Lithuania {treigys,dzemyda}@ktl.mii.lt 2 Kaunas University of Medicine, Eiveniu str. 4, LT-3007 Kaunas, Lithuania [email protected]
Abstract. Changes in eye fundus images can be associated with numerous vision threatening diseases such as glaucoma, optic neuropathy, swelling of the optic nerve head, or related to some systemic disease. Tracking the progress of a possible disease of the patient becomes very difficult from separated retinal images. In this article we present a method which registers two retinal images so that the fundus images overlaps each other in the best way. As a separate case, this article shows that in order to solve the optic nerve disc registration problem a linear transformation of retinal image is sufficient. A human identification possibility via retinal image registration will be disclosed as well. Keywords: automated eye fundus registration, vasculature structure extraction, automated shifting, optic nerve registration, identification, retinal image transformation.
1 Introduction At present ophthalmologists can collect and analyze the eye fundus from digital images. Whenever the image of the eye fundus becomes digital, the means of automatic image processing comes in play. A high quality colour photograph of the eye fundus is helpful in the accommodation and follow-up of the development of the eye disease. Evaluation of the eye fundus images is complicated because of the variety of anatomical structure and possible fundus changes in eye diseases. The optic nerve disc (OD) appears in the normal eye fundus image as a yellowish disc with whitish central cupping (excavation) through which the central retinal artery and vein pass (Fig. 1 left and centre images). Changes of the optic nerve disc can be associated with numerous vision threatening diseases such as glaucoma, optic neuropathy, swelling of the optic nerve head, or related to some systemic disease. Thus, one of the basic tasks in ophthalmology is to analyze the optic nerve disc. Also, the analysis of vasculature can be helpful to indicate pathologic changes associated with diseases such as: a hypertension, diabetes, or atherosclerosis [7]. Vasculature extraction methods from retinal images can be classified into one of the groups: kernel, classifier and tracing-based [19]. In the kernel-based methods, an image is convolved with predefined kernel in most cases. Further, the Gaussian filter M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 770–779, 2008. © Springer-Verlag Berlin Heidelberg 2008
Automated Positioning of Overlapping Eye Fundus Images
771
is introduced in order to model cross-section of the vessels. Afterwards the vessel identification filters [6] are applied. Such a class of vasculature structure extraction algorithms is commonly modelled together with neural networks [8] and is a very time-consuming task. Classification-based methods are composed of two steps. During the first step, segmentation of an image is performed. Segmentation [16, 14] is basically accomplished by the kernel-based methods. In the second step, a set of features has to be provided for the algorithm. Such a set describes the vessels visible in the image. These methods that belong to this class allow processing of the objects with complex structures [1]. This enables algorithms to perform faster, however, these algorithms cannot be automatic in most cases. In the tracing-based class of algorithms [2], the algorithm traces the structure of a vessel between predefined points. Basically tracing ends at the provided reference points. It is common that these reference points are provided interactively by the human. We present here a method for automatic retinal image registration. Image registration is the process of transforming the different sets of data into one coordinate system. In this particular situation, the registration should be performed so that visible structures in two images overlap each other in the resulting image (Fig. 1 right side image). The resulting image comes with a quality measurement parameter that later can be introduced into a decision support system for ophthalmologists. Besides, this article shows that in order to solve the optic nerve disc registration problem, a linear transformation is sufficient. It should be noted that the structure of vasculature is commonly used for human identification purposes. Also, automated human identification within the patient’s database problem is presented also.
Fig. 1. Base retinal image is shown (the left side); committed for registration retinal image is shown (in the centre); superimposed retinal images with structures have yet not been registered is shown (the right side)
2 Image Pre-processing and Scaling The eye fundus images were collected in the Department of Ophthalmology of the Institute for Biomedical Research of Kaunas University of Medicine, using the fundus camera Canon CF-60UVi, at a 60° angle. 6,3 Mpixel images (image size 3072x2048 pixels) were taken. The magnification quotient was 0,0065248 mm/pixels and the common magnification quotient for the system eye-fundus camera was 0,556782±0,000827 (mean±SD). The scale (mm/pixels) for the fundus camera was 0,01171875 mm/pixels.
772
P. Treigys, G. Dzemyda, and V. Barzdziukas
In order to register the position of OD of two retinal images, first of all we have to pre-process images. The first step of image pre-processing is accomplished by scaling down the retinal image to the size of 768x512 pixels. Scaling is performed in order to decrease the computation time. Basically the morphological operations are the most time consuming procedures, because each pixel in a spatial domain is probed with some structuring element also known as a convolution kernel. This leads to a substantial acceleration of vessel structure extraction, which is very important at this stage. It is well known that every pixel in a colour image can be described by three components, namely: red (R) channel, green (G) channel, and blue (B) channel intensity values. Then, every image that consists of NxM pixels can be described by three separate matrices: {R(x,y);G(x,y);B(x,y)}, where x = 1,…,N; y = 1,…,M. Here each function returns the specific intensity value of the channel at the position of (x,y). As usual, in order to calculate the monochrome luminance of colour image, we need to apply coefficients related with the eye's sensitivity to each of the RGB channel. This is done according to the NTSC standard and can be expressed by: I(x,y)=0.2989*R(x,y)+0.587*G(x,y)+0.1140*B(x,y) Here I is the intensity image with integer values ranging from a minimum of zero, to a maximum of 255.
3 Mathematical Morphology and Point-Wise Operations Typically the morphologic operation was developed to deal with binary images but it can be easily applied to intensity images. In the case of intensity images, erosion and dilation is understood as a nonlinear search for minimum or maximum by introducing some filters as well as opening and closing is a combination of erosion and dilation. However, the fundamental concepts of grey-level morphology operations cannot be directly applied to colour images [5, 13]. Thus, we need to convert colour images to intensity ones as described in the pre-processing section. Morphological operations typically probe an image with a small shape or template known as a structuring element. The four basic morphological operations are erosion, dilation, opening, and closing [15]. The grey-scale erosion can be described as the calculation of the minimum pixel value within the structuring element centred on the current pixel Ai,j. Denoting an image by I and a structuring element by Z, the erosion operation IΘZ at a particular pixel (x,y) is defined as: IΘZ = min ( Ax+i , y + j ). ( i , j )∈Z
(1)
Here i and j are indices of the pixels of Z. The grey-scale dilation is considered in a dual manner and thus can be written as: I ⊕ Z = max ( Ax+i , y + j ). ( i , j )∈Z
(2)
The opening of an image is defined as erosion followed by dilation, while the image closing includes dilation followed by erosion. Thus, the morphological operation as closing can be defined as follows: I • Z = ( I ⊕ Z )ΘZ = min ( max ( Ax+i , y + j )). ( i , j )∈Z ( i , j )∈Z
(3)
Automated Positioning of Overlapping Eye Fundus Images
773
The closing operator usually smoothes away the small-scale dark structures from colour retinal images. As closing only eliminates the image details smaller than the structuring element used, it is convenient to set the structuring element big enough to cover all possible vascular structures. Mendels et al. [9] applied the closing grey-level morphology operation to smooth the vascular structures from the retinal images. Let us assume that the above presented scheme is applied to two pre-processed images. Also, let us say that one image (P1) is the base image and the other image (P2) is the image that has to be registered on the base image. In order to see the differences of two spatial images, the technique of intensity values subtraction is frequently used. This operation can be defined as follows: Q(x,y)=|P1(x,y)-P2(x,y)|. After this operation (for each x = 1,…,N; y = 1,…,M ) if at a particular time there are no changes in the spatial domain, the subtracted intensity values acquire the value of 0, otherwise, if there are some differences the intensity value does not become 0. In order to visualize the subtracted image, we have to apply a intensity adjustment procedure. This is needed because the vessels intensity values in colour images are very low according to the surrounding background of the retinal image. Basically the intensity adjustment procedure can be described in this way. Let us assume that the distribution of intensity values of subtracted image Q(x,y) and the transformation function f are continuous in the interval [0, 1] [4]. Moreover, assume that the transfer function is single-valued and monotonically increasing. Then actual intensity levels in shown interval will be recalculated using the function f to the desired intensity levels in a desired interval. In our investigation we used the desired interval with the minimum value of 0 and the maximum value of 255. Also, the Gamma correction factor was set to 1 (this transfer function is nearly linear). By thresholding the intensity adjusted image, next, we will be able to apply skeletonization operation in order to achieve the reference vasculature structures in both images. Thus, for automated threshold level calculation we use Otsu’s method based on the weighted histogram calculation [12].Otsu’s method maximizes the a posteriori between-class variance σ B2 (t ) given by: ⎛ μ T (τ 1 ) − μ1 (τ 1 ) μ1 (τ 1 ) ⎞ ⎟. − w0 (τ 1 ) ⎟⎠ ⎝ 1 − w0 (τ 1 )
σ B2 (t ) = w0 (τ 1 )[1 − w0 (τ 1 )]⎜⎜ τ1
Here
w0 (τ 1 ) = ∑ i =0
(4)
τ1 L −1 ni n n ; w1 (τ 1 ) = 1 − w0 (τ 1 ); μ1 (τ 1 ) = ∑ i i ;μT (τ 1 ) = ∑ i i . N i =0 N i =0 N
The optimal threshold τ1 is found by Otsu’s method through a sequential search for the maximum of max σ B2 (τ 1 ) of τ 1 , where ni represents the number of pixels in the 0≤τ1 < L
grey-level i, L is the number of grey-levels, and N is the total number of pixels in the image [18]. The Otsu thresholding method was applied because in the next morphological operation we have unambiguously distinguished what is foreground and what is background in the retinal image. A foreground is assumed to be the vasculature of the retinal image and the background – remaining part of the retinal image. The next step is to extract the structure of vasculature from both images: from the base image and the image to be registered. For this end we have used the medial axis
774
P. Treigys, G. Dzemyda, and V. Barzdziukas
transform (skeletonization) [10]. Basically, the skeletonization operation is calculated by shifting the origin of the structuring element (Fig. 2) to each possible pixel position in the image. Then, at each position it is compared with the underlying image pixels. If the foreground and background pixels in the structuring element match exactly the foreground and background pixels in the image, then the image pixel situated under the origin of the structuring element is set to the background, otherwise, it is left unchanged. Here we denote that a foreground pixel is assumed to be 1 and a background pixel is 0. An empty cell means that a particular pixel is of no interest, and it is not taken into account for evaluation.
Fig. 2. Structuring elements used for skeletonization
In Fig. 2, both images are first skeletonized by the left-hand structuring element, and afterwards by the right-hand one. Then the above presented process is performed with the remaining six 90° rotations of those two elements during the same iteration. The iteration process is stopped when there are no changes in the images for the last two iterations.
4 Transformation to the Frequency Domain In order to register two vasculature trees achieved by scheme proposed we have to incorporate some cross-correlation method. It is well known that for big images the convolution methods designed for cross-correlation runs very slowly. This problem can be solved by introducing a discrete Fourier transform (DFT) [3]. Usually DFT is defined for the discrete function f(x,y) that is non-zero over the finite region 0 ≤ m ≤ M − 1 and 0 ≤ n ≤ N − 1 . In our case, this function represents a retinal image in the spatial domain. Then, the two-dimensional discrete Fourier transformation of the matrix M by N can be calculated as follows:
F ( p, q ) =
M −1 N −1
∑∑
f (m, n)e
⎛ 2π ⎞ ⎛ 2π ⎞ −i ⎜ ⎟ pm − i ⎜ ⎟ qn ⎝M ⎠ ⎝ N ⎠
e
(5)
.
m =0 n = 0
Where p=0,…,M-1 and q=0,…,N-1. The inverse DFT can be achieved by applying: f (m, n) =
1 MN
M −1 N −1
⎛ 2π ⎞ ⎛ 2π ⎞ i⎜ ⎟ pm i ⎜ ⎟ qn ⎠ ⎝ N ⎠
∑ ∑ F ( p , q )e ⎝ M p =0 q = 0
Here m=0,…,M-1 and n=0,…,N-1.
e
.
(6)
Automated Positioning of Overlapping Eye Fundus Images
775
The Fourier transform produces a complex number valued output image. This image can be displayed with two images, either with the real and imaginary part or with magnitude and phase. In our investigation, we apply Eq. 5 to the base retinal image. The retinal image committed for the registration process is rotated by 1800, since the convolution operation itself reverses the provided pattern [17]. Then, the Eq. 5 is applied to the rotated pattern as well. This results in four arrays, the real and imaginary parts of the two images being convolved. Multiplying the real and imaginary parts of the base image by the real and imaginary parts of the image committed for registration generates a new frequency image with the real and imaginary parts. Taking an inverse DFT of the newly created frequency image, described by Eq. 6 completes the algorithm by producing the final convolved image. The value of each pixel in a convolved correlation image is a measure of how well the target image matches the searched image at a particular point. The new correlation image calculated is composed of noise plus a single high peak, indicating the best match of vasculature of the image to be registered in the base retinal image vasculature. Simply by locating the highest peak in this image, it would specify the detected coordinates of the best match. The frequency transformation procedure described above is applied by taking the structure of vasculature of the image which has to be registered on itself. This is done because another coordinates are necessary that show where the best match of image on itself is (Fig. 3).
Fig. 3. Peaks indicate a shift along the x axis (the left side), and peaks indicate shift along y axis (the right side)
In Fig. 3 on both sides a smaller peak corresponds to the two different images convolved together. The biggest peak corresponds to the image convolved by itself. Then, by introducing a simple linear transform to the retinal image, committed for registration, we shift pixels by the calculated distance along the x and y axes. The result of closing, subtraction, histogram equalization, thresholding, skeletonization and shift calculation is shown in Fig. 4. Fig. 4 shows two structures of vasculature extracted by the method proposed above. The stronger structure belongs to the base retinal image on which s a retinal image intended for registration has to be put. The weaker structure of vasculature belongs to the retinal image intended for registration.
776
P. Treigys, G. Dzemyda, and V. Barzdziukas
Fig. 4. Shows a superimposed structure of the extracted vasculature of two retinal images (top figure), (the bottom figure) shows the registered vasculature structure of two retinal images
5 Results Eye fundus images were provided by the Department of Ophthalmology of the Institute for Biomedical Research of Kaunas University of Medicine (BRKU). The testing
Automated Positioning of Overlapping Eye Fundus Images
777
set consisted of 19 patients’ retinal images of both eyes. It should be noted that registration of images is possible only if those images are of the same patient and of the same eye. This comes from the fact that the structure of eyes vasculature of each human is unique. In order to verify this fact and to obtain the factor of registration error, the algorithm proposed was applied to the retinal images taken from different patients of the same eye (Fig. 5).
Fig. 5. Shows two correlated images. (On the left side) the correlation between the base retinal image and that committed for registration is shown, (on the right side), self-correlation of retinal images is shown.
Fig. 5 shows the magnitudes of convolved images along the x axis. In this particular case, where it is not the same person is taken for investigation, note, that magnitudes on the left image are dramatically lower than that in the image on the right. Thus, to evaluate the quality of the registration, we computed the ratio of the peaksignal to noise (PSNR). 119 possible pairs of eye fundus images have been investigated. The conditions for those images to be of the same person and also of the same eye have been satisfied. The results achieved are shown in Fig. 6.
Fig. 6. Histogram of the ratio between the peak-signal to noise
In Fig. 6, a histogram of the ratio peak-signal to noise is presented. According to [11], acceptable PSNR values are between 20Db and 40Db. The higher the value of decibels, the better registration is performed (Fig. 7). Here we can draw the conclusion on automated human identification from retinal images. If PSNR dramatically lower than 20Db one can made the decision that it is not the same person. In case shown in Fig. 5 and 3 calculated PSNR value was 4.3Db.
778
P. Treigys, G. Dzemyda, and V. Barzdziukas
Fig. 7. (On the left), two overlapping images are illustrated, (on the right), the registered image of the quality of 51Db
The comparative PSNR analysis can be made over the patients’ database in order to automatically identify the person. This can be used for solving the problem of patient’s data protection because medic will be working only with the data about the state of the patient without knowing who really the patient is.
6 Conclusions In this article the authors presented an automated technique for retinal image registration where two images overlap each other in the best way. The task was accomplished by introducing the intensity level morphology operations for vessel extraction. Then the intensity adjustment procedure was performed to enhance the resulting image after subtraction. This operation was followed by the image binarization, where the skeletonization operation was introduced. In the next step the spatial domain of the extracted vasculature structure was converted into the frequency domain, which resulted in a fast convolution of two images. This fact enables us to calculate the image shift. The analysis of provided retinal images showed that the registration quality parameter basically occurs within the bounds of decibels accepted in literature. Also, we have shoved that for fundus image registration problem a linear transformation is enough to obtain satisfactory results. Disclosed problem on human identification revealed that proposed algorithm is also suitable to solve the identification-related class problems. However, more careful analysis in order to evaluate the identification results should be made. Acknowledgements. The research is partially supported by the Lithuanian State Science and Studies Foundation project “Information technology tools of clinical decision support and citizens wellness for e.Health system” No. B-07019.
References 1. Chanwimaluang, T., Guoliang, F., Fransen, S.R.: Hybrid retinal image registration. IEEE Transactions on Technology in Biomedicine 10(1), 129–142 (2006) 2. Dongxiang, X., Jenq-Neng, H., Chun, Y.: Atherosclerotic blood vessel tracking and lumen segmentation intopology changes situations of MR image sequences. In: Proceedings. International Conference on Image Processing, vol. 1, pp. 637–640 (2000)
Automated Positioning of Overlapping Eye Fundus Images
779
3. Edward, W.,, Kamen, B., Heck, S.: Fundamentals of Signals and Systems Using the Web and Matlab (2000) 4. Gonzalez, R., Woods, R.: Digital Image Processing. Addison-Wesley Publishing Company, Reading (1992) 5. Goutsias, J., Heijmans, H., Sivakumar, K.: Morphological operators for image sequences. Computer Vision and Image Understanding (62), 326–346 (1995) 6. Hoover, A., Goldbaum, M.: Locating the optic nerve in a retinal image using the fuzzy convergence of the blood vessels. IEEE Transactions on Medical Imaging 22(8), 951–958 (2003) 7. Lowell, J., Hunter, A., Steel, D., Basu, A., Ryder, R., Kennedy, R.L.: Measurement of Retinal Vessel Widths from Fundus Images Based on 2-D Modeling. MedImg 23(10), 1196–1204 (2004) 8. Matsopoulos, G.K., Asvestas, P.A., Mouravliansky, N.A., Delibasis, K.K.: Multimodal registration of retinal images using self organizing maps. MedImg 23(12), 1557–1563 (2004) 9. Mendels, F., Heneghan, C., Thiran, J.: Identification of the optic disc boundary in retinal images using active contours. In: The Proceedings of the Irish Machine Vision and Image Processing Conference, pp. 103–115 (1999) 10. Mukherjee, J., Kumar, A.M., Das, P.P., Chatterji, B.N.: Use of medial axis transforms for computing normals at boundary points. Pattern Recognition Letters 23(14), 1649–1656 (2002) 11. Netravali, A.N., Haskell, B.G.: Digital Pictures Representation, Compression, and Standards, 2nd edn. Plenum Press, New York (1995) 12. Otsu, N.: A threshold selection method from gray-level histograms. IEEE Trans. Syst. Man Cybernet. SMC91(1), 62–66 (1979) 13. Peters, R.: Mathematical morphology for angle-valued images. Non-linear Image Processing. In: International Conference on Electronic Imaging, Society of Photo-optical Instrumentation Engineers, pp. 1–11 (1997) 14. Soares, J.V.B., Leandro, J.J.G., Cesar Jr., R.M., Jelinek, H.F., Cree, M.J.: Retinal vessel segmentation using the 2-D Gabor wavelet and supervised classification. MedImg 25(9), 1214–1222 (2006) 15. Soille, P.: Morphological Image Analysis. Springer, Berlin (1999) 16. Staal, J., Abramoff, M.D., Niemeijer, M., Viergever, M.A., van Ginneken, B.: Ridge-based vessel segmentation in color images of the retina. MedImg 23(4), 501–509 (2004) 17. Steven, W.S.: The Scientist & Engineer’s Guide to Digital Signal Processing. California Technical Pub. (1997) 18. Tian, H., Lam, S.K., Srikanthan, T.: Implementing Otsu’s Thresholding Process Using Area-Time Efficient Logarithmic Approximation Unit. In: IEEE International Symposium on Circuits and Systems (ISCAS), vol. 4, pp. 21–24 (2003) 19. Vermer, K.A., Vos, F.M., Lemij, H.G., Vossepoel, A.M.: A model based method for retinal blood vessel detection. Computers in Biology and Medicine 34, 209–2019 (2004)
Acceleration of High Dynamic Range Imaging Pipeline Based on Multi-threading and SIMD Technologies Radoslaw Mantiuk and Dawid Paj¸ak Szczecin University of Technology Zolnierska 49, 71-210 Szczecin, Poland {rmantiuk,dpajak}@wi.ps.pl http://zgk.wi.ps.pl
Abstract. In this paper we present a holistic approach to CPU based acceleration of the high dynamic range imaging (HDRI) pipeline. The high dynamic range representation can encode images regardless of the technology used to create and display them, with the accuracy that is only constrained by the limitations of the human eye and not a particular output medium. Unfortunately, the increase in accuracy causes significant computational overhead and effective hardware acceleration is needed to ensure a utility value of HDRI applications. In this work we propose a novel architecture of the HDRI pipeline based on CPU SIMD and multi-threading technologies. We discuss the impact on processing speed caused by vectorization and parallelization of individual image processing operations. A commercial application of the new HDRI pipeline is described together with evaluation of achieved image processing speed-up. Keywords: high dynamic range imaging, SIMD architecture, SSE, multi-threading architecture, image processing, computer visualization.
1
Introduction
The advances in high dynamic range imaging (HDRI), especially in display and camera technology, have a significant impact on existing imaging systems. The assumptions of traditional low-dynamic range imaging, designed for paper print as a major output medium, are ill suited for the range of visual material that is shown on modern displays. The high dynamic range representation can encode images regardless of the technology used to create and display them, with the accuracy that is only constrained by the limitations of the human eye and not a particular output medium. The disadvantage of HDRI technology is computational complexity. A single pixel in an high dynamic range image consumes 4 times more memory storage in comparison to a low dynamic range image (3 floating point numbers (4-bytes long each) against 3-bytes in the low dynamic range representation). This complexity means that HDRI pipeline is not used in many applications which would M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 780–789, 2008. c Springer-Verlag Berlin Heidelberg 2008
Acceleration of HDRI Pipeline
781
significantly benefit from HDR accuracy. For example, processing of huge datasets from medical computer tomography or RAW photographs is problematic for typical personal computers or laptops. Moreover, the constant increase of image resolution and complexity of image processing algorithms will make this problem even worse in the future. In this paper we argue that by usage of modern CPU technologies, it is possible to accelerate the image processing of the HDRI pipeline significantly. The acceleration can be achieved based on existing CPU capabilities: the SIMD instruction set, multi-processor and multi-core architectures. SIMD (Single Instruction Multiple Data) instructions allow to speed-up processing of floating point vector data, which can represent HDR pixels. Because of the independent transformation of individual pixels, most HDRI algorithms are well suited for parallel processing. Multi-threading architecture accelerates such processing almost by a factor of the available threads. A goal of efficient HDRI computing is to accelerate the whole HDRI pipeline rather than to speed-up individual operations. In this paper we propose the architecture of the accelerated pipeline which benefits from careful optimization of HDRI algorithms and from effective RAM memory management. We also introduce queueing techniques that enable grouping many simple operations into one complex command path. This way automatic optimization of CPU hardware usage is implemented and effective acceleration of complex algorithms is possible. We review existing SIMD and multi-threading based technologies in section 2. The concept of high dynamic range imaging pipeline and its possible acceleration techniques are discussed in Section 3. In section 4 we present an architecture of our novel HDRI pipeline which uses SIMD and multi-threading technologies to speed-up data processing. We then describe a software package that operates on the new HDRI pipeline (section 5) and discuss achieved results.
2
Previous Work
A general approach to image processing using SIMD and parallel hardware can be found in a few software packages. The VIPS (VASARI Image Processing System) library [7] seems to be the most known LGPL system for processing huge images. It divides images into small arrays and uses multi-threading to effectively process them on SMP (Symmetric Multiprocessor) computers. Intel Integrated Performance Primitives (Intel IPP) [8] is a library of multi-core-ready, optimized software functions for multimedia data processing. Careful programming in plain C/C++ code and compilation based on IPP compilers can speed-up applications. The Image Processing Toolbox (IPT) [10] provides a set of functions for image manipulation and analysis. The IPT capabilities include SIMD and multi-threading optimized color space transformations, linear filtering, mathematical morphology, geometric transformations, image filtering and analysis. Acceleration is achieved based on the O-Matrix engine [10] that supports fast matrix processing. Multithreading and SIMD architecture is also exploited by the GENIAL (GENeric Image Array Library) [9] library to speed-up computation of signal processing
782
R. Mantiuk and D. Paj¸ak
algorithms. The architecture of GENIAL is based on the same conventions as the Standard Template Library (STL), consisting of containers, iterators, adaptors, function objects and algorithms. The intensive use of templates makes it possible for the library to automatically adapt calculations on containers to the specified problem in order to achieve faster execution. There are a few acceleration toolkits in the medical imaging community. ITK (Insight Segmentation and Registration ToolKit) [11] is a library for image segmentation and registration. Another available choice, MITK (Medical Imaging ToolKit) [12] uses CPU SIMD instructions to accelerate matrix and vector computations, and linear and tri-linear interpolation computations. Both toolkits provide a general framework for medical imaging rather than a set of highly optimized image processing functions. All these approaches seem to be general and not optimized for HDRI processing. In particular they are not intended for throughout rendering of the HDRI pipeline. In this paper we propose a more efficient solution at the cost of rejecting generality of computations. We present a holistic approach to CPU based acceleration of the HDRI pipeline. We don’t deal with GPU (Graphics Processor Image) acceleration techniques [1]. The usage of GPU seems to be promising for fast HDRI processing but this technology is not common on many platforms (e.g. mobile phones or PDA devices). Moreover, advanced GPU capabilities (e.g. shader support) are not well standardized yet so we leave the GPU acceleration as future work.
3
Acceleration of the HDRI Pipeline
High dynamic range imaging [2] is a new paradigm that involves a highly accurate representation of images. As it originates from light simulation (computer graphics rendering) and measurements, the pixels of HDR images are assigned a physical meaning. This highly accurate representation of images gives an unique opportunity to create a common imaging framework, that could meet the requirements of different imaging disciplines. Figure 1 illustrates an example of the HDRI pipeline [3] that starts with acquisition of a real world scene or rendering of an abstract model using computer graphics techniques and ends at a display. This pipeline overcomes shortcomings of a typical graphics pipeline that doesn’t support devices of a higher dynamic range or a wider color gamut [4]. The drawback of the HDRI pipeline is that much more data has to be processed in comparison to a typical 8-bit pipeline. Pixels of an HDR image are represented by a vector of floating point values: one 4-byte floating point number for each of 3 or more color channels (e.g. in the case of multi-spectral imaging more than 10 channels should be supported). Number of bbp (bits per pixel) is then four or more times higher than for 8-bit images. Additionally, it should be considered that most HDRI devices generate huge sets of data because of growing sampling resolutions of input and output devices. Like in typical images, dimensionality of HDRI images is important and change of context in both horizontal and vertical directions cannot be neglected. Summing up, to achieve
Acceleration of HDRI Pipeline
783
SIMD Acceleration Thread 1
Real scene
HDR camera
Image chunk 1
Image processing
Tone mapping
Image chunk 2
Image processing
Tone mapping
Image chunk 3
Image processing
Tone mapping
Image chunk 4
Image processing
Tone mapping
Image ch. n-1
Image processing
Tone mapping
Image chunk n
Image processing
Tone mapping
Thread 2 Image splitting
HDR Storage
Image merging
... Thread N
Abstract 3D model
Rendering engine (floating point processing)
LDR display
HDR display
Fig. 1. High dynamic range imaging pipeline accelerated by SIMD and multi-threading operations
the same performance as for the low dynamic range pipeline, the data in the HDRI pipeline need to be processed much faster. A recent development of CPU hardware allows for the speed-up HDRI processing significantly. In particular the SIMD instruction can be exploited effectively by processing many data in one CPU cycle. HDRI processing is susceptible to parallelization so usage of multi-threading accelerates computations. To exploit SIMD and multi-threading efficiently, we selected a set of algorithm crucial to HDRI processing. Then, we propose a new architecture of the HDRI pipeline. A set of selected algorithms is depicted in Figure 2. From the acceleration viewpoint, the most important group of operations is matrix arithmetic. This group covered both matrix-by-matrix operations (e.g. matrix multiplication) and scalar-by-matrix operation (e.g. multiplication of all matrix elements by a scalar value). Also, matrix manipulation algorithms, like transposition or vertical and horizontal shift, should be considered. Channel masking and pixel masking can eliminate selected color channels or pixels from pipeline processing based on conditional expressions. The accumulation algorithms, such as computation of a sum of pixel values in an image area, is time consuming and should be accelerated. In many cases HDR images must be transformed by a non-linear functions and Look-Up-Table (LUT) is the fastest and simplest way to proceed. Finally, a selected group of advanced image processing algorithms are accelerated. This group includes image scaling, color space conversions and color profile conversions.
784
R. Mantiuk and D. Paj¸ak
Matrix operations scalar-by-matrix - multiplication/division - addition/substraction - exponential operations - logarithmic operations matrix-by-matrix - multiplication - matrix reciprocal - transposition
Accumulation operations - sum of pixel values - computation of maximum/ minimum value - arithmetic average - geometric average
Conditional expressions - channel masking - pixel masking - logic operations
Look-Up Table operations Advanced operations - image scaling - image rotation - color transformations - color profile conversion - histogram computation - Gaussian pyramid - convolution - statistic coefficients
Fig. 2. HDRI operations crucial for fast image processing
4
Using SIMD Operations and Multi-threading in HDRI Processing
In Figure 1 a novel HDRI pipeline accelerated by CPU SIMD operations and multi-threading is presented. The main goal of the pipeline processing is to exploit the SIMD and multi-threading architectures efficiently. Moreover, data interchange between CPU and computer RAM memory should be limited as much as possible. All intermediate and temporary results should be stored in CPU registers or CPU cache memory. If a larger temporary storage is required (e.g. in local tone mapping operators [2]), algorithm could exploit L1 cache, as it offers lower access latency and greater performance than system memory. However in this case, an HDR image must be divided into suitable chunks due to limited cache size. Parallel processing is exploited in a new pipeline. We divide an HDR image into arrays/chunks and the chunks are processed independently. The size of an array is limited by the size of CPU L1 cache rather than the number of available threads or processors. We noticed that even in a one-thread system, it is faster to process small chunks of data rather than the whole image at once. In the pipeline implementation a fixed number of threads (equal to the number of execution units reported by an operating system) is created at run-time. Threads become active only when new tasks are assigned to them and go to sleep right after they finish processing scheduled tasks. Thread management is handled by operating system. It is the operating system’s responsibility to identify Hyper-threading, multi-threading or multi-processor hardware and manage threads efficiently. The goal of SIMD computation is to perform a single operation on many data elements. Modern CPU hardware is equipped with a set of SIMD arithmetic, logical, comparison and conversion instructions. They process 128-bit words in one CPU cycle so a 4-times speed-up of basic operations is potentially possible. Almost all current CPUs offer the SIMD instruction set. Examples are SSE [5] in Intel and AMD processors or AltiVec in IBM’s Power PC.
Acceleration of HDRI Pipeline
785
HDRI processing is especially suited to acceleration based on the SIMD architecture. A common 4-channel RGBA representation (red, green, blue, and alpha channels) of an HDR pixel can be considered as one 128-bit word and all channels can be processed simultaneously. The CPU SIMD instruction set delivers most of operations required in HDRI computing [6]. In the case of advanced operations (like logarithm computation or exponentiation), which are not available, we use existing instructions to approximate results. For example to compute log2 (x), we use specific features of a single precision floating-point number representation. As defined in the IEEE 754-1985 specification, a single precision number is described by the equation: s ∗ 2e ∗ m, where s is a sign bit, e is an 8-bit exponent and m is a 24-bit normalized mantissa. We calculate the log2 (x) value by extracting the exponent from the number representation and adding it to the approximation of log2 (m) of extracted mantissa: log2 (x) = log2 (2e ∗ m) = e + log2 (m), where x > 0. In our implementation we use the Chebyshev mini-max fifth degree polynomial to approximate the function log2 (m). This technique results in a small relative error (10−6 ) and can be implemented very efficiently on SIMD architecture. Most of the operations performed inside the HDRI pipeline are executed on luminance (1 channel) value of HDR pixels. This makes the calculation even more efficient as we process 4 pixels in each instruction. An example of this set of operations is a Look-up table transformation which lets the user apply a custom non-linear transformation on input luminance values (e.g. custom global TMO curve). Given a set of non-overlapping input ranges ([a0 , b0 ]..[an , bn ]) and their mapped counterpart values ([c0 , d0 ]..[cn , dn ]) we transform luminance value l using the formula: le = cx + ((l − ax )/(bx − ax )) ∗ (dx − cx ), where ax ≤ l < bx . The most time consuming task is finding the right input range for every pixel. By using SIMD selective write operations and bit masking we speed-up the procedure by finding the ranges for 4 pixels at the same time. Because of the characteristics of image data (neighbourhood pixels have relatively small differences in luminance values) the performance drawback coming from redundant loop passes (some pixels in the vector might have their range locked but we continue the search until ranges for all 4 pixels are found) is, in most cases, negligible.
5
Example Application: The HDRI Library
We implemented the HDRI pipeline as a Windows dynamic library. All public methods in the library are accessed through a simple C API similar to OpenGL with internal state machine on the library side. The library uses multi-threading and vector processing capabilities. The functions are manually optimized with intensive usage of MMX/SSE/SSE2 intrinsic code. The initialization code is able to detect the features of the CPU (by invoking the cpuid instruction) and choose the best possible code path for the library methods. Also a special 64-bit version is available for Win64 systems (additional performance gain comes from the increased number of CPU SIMD registers). The library itself was implemented in C++.
786
R. Mantiuk and D. Paj¸ak
The activity diagram in Figure 3 shows simplified processing stages of HDRI pipeline. Actual pipeline stages implementation is much more complicated and includes additional features/modules. We skip them because they do not influence the design of the pipeline acceleration methods. A synchronization of all working threads is required to gather the results, calculate required image specific factors and then move to the next stage. The synchronization of threads is implemented via the Win32 event system which is known for low latency and quick kernel dispatching time. Even if we use the most efficient synchronization method available, this task tends to be the slowest element of the pipeline processing (the delays are not even correlated with image data size). By carefully designing the pipeline stages and operation types at each stage we minimized the synchronization count to the absolute minimum. Besides the fixed functionality pipeline we have implemented a mechanism which lets the programmer combine many simple SIMD accelerated functions (addition, multiplication, division, etc.) with their corresponding arguments into one complex command, which is then executed by multi-threading, chunk processing engine of the pipeline. This queueing method offers additional flexibility to computational abilities of the library, however it comes with a price of degraded locality of calculations and increased memory bandwidth usage. Thus it is not able to compete in terms of performance with the fixed functionality pipeline. A set of 5 HDR images (6 M-pixels each) was used to conduct performance tests of the library. The tests covers both low level (basic data processing operations) and advanced HDRI processing operations. The timings from the upper part of table 1 are arithmetic means of results we got for individual test images. The lower part of the table covers tests for execution of the whole HDRI pipeline starting from HDR image blending (for 5 input images) and ending at tone mapping (TMO) and LDR image generation. The test platform was a Windows XP based PC equipped with Athlon X2 3800+ dual core CPU and 2GB DDR RAM. Noticeable speed-ups are visible in both internal functions as well as in overall pipeline performance. The computational power of SIMD architecture is especially exposed in calculation of log-average (arithmetic mean of logarithms) or sRGB gamma correction where approximated functions are used instead of their built-in counterparts. Decreased acceleration ratio of functions with conditional execution (compare timings for log-average and conditional log-average) is caused by the streaming nature of SIMD computation: we calculate values of all elements in the vector and write the result depending on the condition bit mask. In the case of SISD implementation we only calculate the results for elements which pass the condition statement. Operations performing well on a super-scalar FPU (e.g. image blending) or resistant to vectorization due to algorithm nature (tone mapping) still take advantage of parallel processing and speed-up close to the theoretical limit of 2 (number of cores in our test system). False color image generation1 performance 1
A process of encoding the HDR image luminance into an LDR image where certain luminance values are mapped to a predefined RGB triplets from LUT.
Acceleration of HDRI Pipeline
HDR image 1
...
HDR image 2
787
HDR image N
Blending and geometry transformation (e.g. cropping) Luminance channel extraction and clipping [Do we need luminance statistic?]
Compute image luminance statistic (e.g. min, max values)
Yes
Yes
No
No
Luminance clipping
[Do we need to pre-process luminance?]
[Do we need to clip luminance values?]
Yes No
[Do we need false color image?] Yes False colour image
generation No
Tone mapping (TMO)
RGB retrieval
sRGB gamma correction
Yes
[Do we need gamma correction?] No
Output LDR image
Fig. 3. Selected HDRI library pipeline stages (simplified). The default configuration of the HDRI pipeline combines a specified number of the input HDR images into one intermediate image, which is then processed and written as an output LDR image. Most of the functions work on the luminance channel which is extracted in the preprocessing stage. At the end of pipeline the results of luminance processing are applied on all channels of input HDR image.
does not scale linearly with the number of available execution units and seem to be bound by memory bandwidth. We have compared our implementation with the VIPS library in low level function performance. All tested functions of HDRI library performed better than
788
R. Mantiuk and D. Paj¸ak
Table 1. Computation time of internal pipeline functions. For SIMD implementation both single-threaded (st) and multi-threaded (mt) tests were performed. The speed-up is a ratio between measured time of a single-threaded FPU code and SIMD multithreaded implementation. Operation
Time [ms] Speed-up FPU(st) SIMD(st) SIMD(mt) HDR image blending 257 238 139 1.85 Log-average 395 33 19 20.79 Conditional log-average 238 43 25 9.52 XYZ to RGB conversion 64 38 20 3.2 Conditional XYZ to RGB conversion 90 49 22 4.09 sRGB gamma correction 1971 302 157 12.55 Global photographic TMO [13] 3146 590 360 8.74 Gamma TMO 3356 621 375 8.95 False color image generation (LUT example) 690 321 220 3.14
their VIPS equivalents. For example, the computation speed of log-average is about 11 times faster in our solution (both test programs utilize multi-threading architecture). This was expected as VIPS library is architecture independent and does not use SIMD instruction set or math functions approximations.
6
Conclusions and Future Work
The limitations of the existing low-dynamic range imaging technology can be addressed and eliminated in the HDR imaging pipeline that offers higher precision of visual data. In this work we outline acceleration techniques for processing HDR images. We use the SIMD and multi-threading technologies available in present CPU hardware to overcome computation bottlenecks. Those technologies, together with the proposed novel data processing architecture, make the HDRI pipeline as fast as the traditional pipeline. In the final section of the paper we describe the software library implemented based on the proposed design. The utility values of this library was proved in commercial applications (they will be available on the market in the near future). Another increase of computation speed of the HDRI library could be achieved using specialized instructions from the SSE3/SSE4 extensions (e.g. usage of hardware dot product and horizontal data processing instructions could speed up accumulation operations by a factor of 4). At the time of writing of the implementation the CPUs with SSE3/SSE4 support were not propagated enough in the market to count this code path in the library design but we plan to use those operations in the future. We have also been extending functionality of the library to accelerate more complex image processing operations (e.g. histogram computation or Gaussian pyramid usage). In the future we plan to port the library to GPU hardware.
Acceleration of HDRI Pipeline
789
Acknowledgments. The research work, which results are presented in this paper, was sponsored by Polish Ministry of Science and Higher Education (years 2006-2008).
References 1. Owens, J.D., Luebke, D., Govindaraju, N., Harris, M., Kruger, J., Lefohn, A.E., Purcell, T.: A survey of general-purpose computation on graphics hardware. In: Proc. of Eurographics 2005, State of the Art Reports, September 2005, pp. 21–51 (2005) 2. Reinhard, E., Ward, G., Pattanaik, S., Debevec, P.: High Dynamic Range Imaging. In: Data Acquisition, Manipulation, and Display. Morgan Kaufmann, San Francisco (2005) 3. Mantiuk, R., Krawczyk, G., Mantiuk, R.: High Dynamic Range Imaging Pipeline: Merging Computer Graphics, Physics, Photography and Visual Perception. In: Spring Conference on Computer Graphics 2006 Posters and Conference Materials, April 20-22, pp. 37–40 (2006) 4. Mantiuk, R., Krawczyk, G., Mantiuk, R., Seidel, H.P.: High Dynamic Range Imaging Pipeline: Perception-motivated Representation of Visual Content. In: Proc. of SPIE. Human Vision and Electronic Imaging XII. 649212, vol. 6492 (2007) 5. Intel64 and IA-32 Architectures Optimization Reference Manual (May 2007) 6. Intel Architecture Software Developer Manual, Instruction Set Reference, vol.2 (1999) 7. Martinez, K., Cupitt, J.: VIPS - a highly tuned image processing software architecture. In: Proceedings of IEEE International Conference on Image Processing 2, Genova, vol. 2, pp. 574–577 (2005) 8. Taylor, S.: Intel Integrated Performance Primitives Book. ISBN 0971786135, ISBN13 9780971786134 (2004) 9. Laurent, P.: GENIAL - GENeric Image Array Library, http://www.ient.rwth-aachen.de/team/laurent/genial/genial.html 10. Harmonic Software Inc.: IPT - The Image Processing Toolbox for O-Matrix, http://www.omatrix.com/ipt.html 11. Ibanez, L., Schroeder, W., Ng, L., Cates, J.: The ITK Software Guide, 2nd edn., Kitware, Inc. Publisher (November 2005) 12. Zhao, M., Tian, J., Zhu, X., Xue, J., Cheng, Z., Zhao, H.: The Design and Implementation of a C++ Toolkit for Integrated Medical Image Processing and Analysis. In: Proc. of SPIE Conference, vol. 6, pp. 5367–5374 (2004) 13. Reinhard, E., Stark, M., Shirley, P., Ferwerda, J.: Photographic Tone Reproduction for Digital Images. ACM Trans. on Graphics 21(3), 267–276 (2002)
Monte Carlo Based Algorithm for Fast Preliminary Video Analysis Krzysztof Okarma and Piotr Lech Szczecin University of Technology, Faculty of Electrical Engineering, Chair of Signal Processing and Multimedia Engineering, 26. Kwietnia 10, 71-126 Szczecin, Poland {krzysztof.okarma,piotr.lech}@ps.pl
Abstract. In the paper a fast statistical image processing algorithm for video analysis is presented. Our method can be used on colour as well as grayscale or even binary images. The main component of the proposed approach is based on statistical analysis using the Monte Carlo method. A video’s statistical information is acquired by specifying a logical condition for the Monte Carlo technique. The results of the algorithm depend on the correct choice of threshold values; thus the application area is limited by the adaptability of the thresholds to videos with large heterogeneity: e.g. videos with objects moving into and out of the scene, rapidly varying illumination, etc. Keywords: statistical image analysis, Monte Carlo method.
1
Description of the Method
For a static image of analysed scene a constant value can be defined related to the number of pixels fulfilling specified logical condition. Such condition can be defined e.g. as the belonging of the image sample to the specified luminance range. In such case the algorithm works as the area estimator for the objects fulfilling the specified luminance criterion. In the effect of analysis of the whole image the corresponding binary image is created which stores the values equal to 1 for the samples which fulfil the condition and 0 for the others. It gives the quantitative information related to the object’s features described by the logical condition which can be obtained by the summation of all ”ones”. ˆ given as: The estimator L ˆ=m, L (1) where m stands for the total number of ”ones” in the binary image, can be related to the area of the single object located in the empty scene, assuming the object’s pixels fulfil the specified logical condition. After some additional morphological operations it is possible to easily estimate some other parameters such as object’s perimeter, diameter, moments etc. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 790–799, 2008. c Springer-Verlag Berlin Heidelberg 2008
Monte Carlo Based Algorithm for Fast Preliminary Video Analysis
location 2
location 3
location 4
location 9
location 1
location 5
location 8
location 7
location 6
791
Fig. 1. The example of the distortions caused by camera lens
Counting of all ”ones” for high resolution images may be time consuming because analysis of all image samples is required. In order to increase the speed of the algorithm the reduction of the number of analysed samples is sufficient. In that case application of a statistical experiment using the Monte Carlo method is useful. The number of analysed points is equal to the number of draws using pseudo-random generator with uniform distribution. Binary image samples can be stored in one-dimensional vector, numbered from 1 to N , where N is the total number of samples in the scene. Then n independent draws (with returns) are performed from the vector and the number of ”ones” drawn (k) is stored. Estimated number of ”ones” is equal to ˆ MMC = k · N , L n
(2)
where: k - the number of ”ones” drawn, n - the number of draws, N - the total number of samples. The estimation error can be expressed as: K uα K εα = √ · · 1− , (3) n N N where: uα - the value denoting two-sided critical range, K - total number of ”ones” in the entire image. The considerations presented above are correct for the generator with uniform distribution. The prevention of the error’s increase requires good statistical properties of the generator.
2
Influence of Geometrical Distortions
In order to simplify the considerations it is assumed that the goal of the algorithm is the estimation of the object’s area. The sources of the geometrical distortions
792
K. Okarma and P. Lech
Monte Carlo method
„Full-analysis” 2
2 1
3
1 4
1
5
8
0,8
3
9 4
9
1
5
8
0,8
7
7
0,6
0,6 6
0,4
6
0,4
0,2
0,2
0
0
lens 1 2 9 1
8
3
2
1
4
3 1,2
5
7
0,8
lens 1
6
0,4
0,2
0,2 0
0
lens 2
lens 2 2
2
3
3 9
4
1,2
8
5 7
0,8
9
4
1 1
4
7
0,6
0,4
5
8
0,8
6
0,6
1
9
1
1 8
1 0,8
6 0,6
5 7 6
0,6 0,4
0,4
0,2
0,2
0
0
lens 3
lens 3
Fig. 2. The comparison of the normalised results for various locations and camera lens
are mainly related to the camera optics and may be observed mostly near the image corners as shown in Fig. 1. The experiment has been performed using the digital camera with adjustable optical parameters (various lens), the first test with the most visible distortions and the third one with almost invisible ones. The camera has been installed directly over the scene containing single object of known dimensions. In this experiment the scene has been constantly lightened by uniformly distributed light in order to eliminate the influence of its changes. The location of the object has been changed (see Fig. 1). The comparison of the results obtained using three different lens for the classical ”full-analysis” procedure (counting of all the pixels belonging to the object) and the Monte Carlo method is illustrated by Fig. 2. All presented results are normalised assuming the exact object’s area is equal to 1. They show the relevance of the influence of lens quality for both methods and using the Monte Carlo method with good quality lens does not introduce significant errors. In systems with poor quality cameras some additional digital image correction algorithms may be needed to ensure high accuracy.
Monte Carlo Based Algorithm for Fast Preliminary Video Analysis
793
Distortions level below 10% area's relative error
0,06 0,05 0,04 0,03 0,02 0,01 0 1
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 measurement number Monte Carlo method full analysis Monte Carlo average full analysis average
Distortions level over 10% area's relative error
0,6 0,5 0,4 0,3 0,2 0,1 0 1
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 measurement number Monte Carlo method full analysis Monte Carlo average full analysis average
Fig. 3. Relative errors of area estimation for distortions level below 10% and over 10%
3
Influence of Lighting Conditions
Regardless of the static errors dependent on the lens quality, in most practical applications some dynamic ones, caused by changes of local or global lighting conditions, may also appear. In order to analyse the influence of lighting it is assumed that the object is located in the centre of the scene and the best lens is used. The experiment has been performed for a small amount of distortions (below 10% of distorted pixels from the whole scene comparing to the uniformly lightened scene) and the significantly higher number of distorted pixels. All the distortions have been caused by the light changes with the direct influence on the number of pixels fulfilling the same logical condition in each case. The threshold value of 10% distorted pixels has been chosen experimentally for specified lens and the analysed static scene for the better illustration of observed effects. During the experiment 16 measurements have been performed in various lighting conditions (changes of the number of light sources, their locations and parameters). The results obtained for two cases described above are presented in Fig. 3. Comparing the relative errors obtained for two cases analysed in the paper it is worth noticing that for higher amount of distortions the Monte Carlo approach leads to similar results comparing to ”full analysis”. However, for lower distortion level the advantage of the Monte Carlo method is much better visible.
794
K. Okarma and P. Lech Edges detected
Object
Elements with area equal to elementary block size (not used in further analysis)
Fig. 4. Idea of the edge detection using the Monte Carlo approach
4
Applications of the Method
Universality of the Monte Carlo method makes possible to use it in many areas of digital image and video analysis applications. There are many articles and books related to more or less advanced statistical techniques applied for video analysis published in the recent years. A popular approach seems to be the usage of Sequential Monte Carlo [5] or Markov Chain Monte Carlo methods [9] e.g. for video text segmentation [1] as well as some tracking purposes [6]. Nevertheless, such algorithms usually require relatively high computational power so their applications in some real-time systems is limited. Apart from that, a good example of reasonable requirements (Pentium PC) can be the real-time low bitrate video segmentation approach presented in the paper [4] also designed for specific applications. The most crucial features of proposed approach are its easy implementation and low computational complexity. All the applications where analysis of the whole image is not necessary and the high accuracy of results is not required are the potential field of its usage. 4.1
Estimation of Geometrical Features
A typical application of the Monte Carlo method is the area estimation. However, presented idea can be applied also for the estimation of some other geometrical features. The simplest one is the perimeter estimation by edge detection. Analysed binary image should be divided into T x S elements of r x r pixels each using the square grid with the usage of the Monte Carlo method for the area estimation. All the blocks with the area equal to zero or equal to the size of elementary block are not used in further analysis because they do not correspond to the figure’s edge as shown in Fig. 4. In the next step the area of each object’s fragment in elementary square elements is calculated. The estimated values are stored in the array of T x S elements. On the base of the binary image the array K of the same size is created. The elements of that array have the following values: zero if the corresponding element’s value in the binary image is equal to the size of elementary block
Monte Carlo Based Algorithm for Fast Preliminary Video Analysis
795
(all pixels belong to the object) and none of its neighbouring blocks (using 8directional neighbourhood) has zero value, zero if the corresponding element’s value in the binary image is equal to zero (background), one for the others (representing the edge). The array K shows the projection of the edges detected from the source image. Counted number of non-zero elements of array K represents the estimated value of object’s perimeter expressed in the squares of r x r pixels. In order to obtain a better estimation in the final step the number of square elements should be increased (smaller values of parameter r) and then the first steps are repeated. Using additionally the array K achieved in the third step, the analysis of blocks in the binary image with zero values is not needed, so further analysis is performed only for strongly limited number of indexed elements corresponding to the edge obtained in the third step. The limit accuracy of the algorithm is determined by the size of the elementary block equal to 1 pixel what is equivalent to using convolution edge detection filters. Dividing the scene into smaller squares it is possible to easily estimate some motion parameters such as direction and velocity applying the Monte Carlo procedure for each block. If the whole binary image is divided into T x M square blocks containing r x r pixels each, there is also a possibility to estimate some additional geometrical parameters which may be treated as local (e.g. mean diameter or average area) or global ones (e.g. number of objects inside the given area). For a single object on the image plane the most interesting parameters are those ones which are insensitive to image deformations introduced during acquisition and typical geometrical transformations such as scaling, translation and rotation. In such sense the usefulness of the simplest parameters, such as area or perimeter, is strongly limited but many other factors, such as moments, can be often determined on the basis of the simplest ones. Some typical geometrical parameters used in the image analysis are horizontal and vertical projection’s lengths (easily extended by the analysis of the presence of concavities in the object’s shape) and Feret’s diameters as the measure of horizontal and vertical object’s maximum size [3]. An interesting group of parameters are the shape coefficients because of the possibility of fast estimation and wide opportunities of their usage for the classification and recognition purposes. Most of them can be easily calculated on the base of area, perimeter or Feret’s diameters. Another group of parameters used during the binary image analysis is represented by the linear moments e.g. the first order ones related to the object’s centre of gravity and the second order moments used as the object’s inertia measures. The binary array used in the Monte Carlo approach can be treated as the equivalent of reduced resolution one so geometrical parameters can be expressed in blocks of r x r pixels instead of pixels. The processing time is then significantly shortened due to the reduction of the number of analysed pixels and performed
796
K. Okarma and P. Lech
Table 1. Area and perimeter values (in pixels) obtained in the experiments for various numbers of analysed points in the image and in the block respectively Method Points (N) Area Perimeter1 Perimeter2
Full image Monte Carlo 100 200 500 1000 5000 31518 26127 29030 28698 29528 − − − − − − − − − −
Monte Carlo 32x32 pixels 10 20 50 100 500 32957 31938 30945 30530 30276 640 672 736 768 768 437 417 440 436 434
experiments showed a good accuracy of their computation (sufficient for some typical image recognition applications). Performed tests have shown that proposed approach can be valuable for the estimation of some shape coefficients which are dependent on the area and the perimeter. In that case the estimation errors depend on the accuracy of the area and perimeter estimation. It is worth noticing that the approximation of moments and geometrical parameters based on the perimeter requires the usage of the Monte Carlo method with division of the image into blocks while for the geometrical parameters based only on the area or projection lengths the standard Monte Carlo method is sufficient. In order to determine the projection lengths the additional storing of the minimum and maximum coordinates of the analysed pixels representing the object is necessary. The estimation of the moments require slightly more sophisticated algorithm because the accuracy limited to the size of the block (r x r pixels) is not sufficient for most applications. The simplest solution is similar to the technique used for the perimeter estimation. Blocks corresponding to the object’s contour can be divided into smaller elements and then the adjustment vectors for the moments should be calculated. Table 1 illustrates the results of the statistical experiment performed for the estimation of the area and the perimeter of a single object located in the scene. The perimeter has been estimated using two approaches: counting of all 32x32 pixels blocks representing the contour (Perimeter1) and using the square root of the estimated area for each block (Perimeter2). The first method leads to the overestimation and the second one produces too small values. Correct values of the area and the perimeter obtained by the analysis of all pixels are 30286 and 552 pixels respectively. 4.2
Fast Image Quality Estimation
Conventional objective full-reference image quality metrics, mainly based on the Mean Square Error [2], are poorly correlated with the Human Visual System so some other proposals have been presented in recent years. One of the most popular seems to be the Universal Image Quality Index proposed by Wang and Bovik [7]. Such measure models image distortions as the combination of three elements: loss of correlation, luminance distortion and loss of contrast. Assuming xi,j and yi,j are the values of the luminance for the pixel (i, j) of the original
Monte Carlo Based Algorithm for Fast Preliminary Video Analysis
797
and distorted image respectively, it is defined as the local index for the single image block (NxN pixels - usually N=8) as Q=
σxy 2¯ xy¯ 2σx σy 4σxy x¯y¯ , · · = 2 σx σy (¯ x)2 + (¯ y )2 σx2 + σy2 (σx + σy2 ) · [(¯ x)2 + (¯ y )2 ]
(4)
where x¯ and y¯ are the mean values and σ stands for the standard deviation in the original and distorted image blocks respectively [7]. The overall quality index is defined as the mean value of metrics (4) obtained for all blocks using sliding window approach. In the paper [8] the definition (4) has been extended into the Structural Similarity (SSIM) introducing the possibility of choosing the importance exponent for each of three factors in Eq. 4 with additional stability enhancement for the regions where x ¯ or σx2 are close to zero. The modified expression based on the usage of two coefficients preventing such instability can be expressed as SSIM =
(σx2
(2¯ xy¯ + C1 ) · (2σxy + C2 ) , + σy2 + C1 ) · [(¯ x)2 + (¯ y )2 + C2 ]
(5)
where C1 and C2 are the small values chosen experimentally as suggested by authors of the paper [8]. However, analysis of all the image pixels is time consuming and in many applications the exact image quality assessment is not the most crucial element, because the image quality estimation should be performed fast and not necessarily very accurately. Besides all the image quality metrics should be actually treated as estimators, because there is no ideal objective image quality measure. For the applications where the image quality estimation should be fast enough to avoid the introduction of additional delays the calculation of the SSIM index using the Monte Carlo approach is proposed. Estimation of the local SSIM index should be performed only for some randomly chosen pixels inside the current sliding window. Assuming a good quality of the pseudo-random generator the expected number of the drawn pixels inside the window in each position should be almost the same. The advantage of that approach is an equal chance to analyse each pixel of the image so there is no need to use any sophisticated method for the decrease of the resolution in order to preserve some patterns. The results of the image quality estimation using proposed approach for the test images after low-pass and median filtration, JPEG compression and contamination by an achromatic impulse noise are shown in the Tables 2 - 4. The usage of the limited number of randomly distributed pixels during the calculations of the Structural Similarity index may lead to a good quality estimation, assuming the pseudo-random generator with the uniform distribution, even for a low number of samples for some typical distortions. The results obtained for the image ’Baboon’ (Table 3) differ from the other images because of the specific character of the image with many details. It leads to the lower quality index for the lossy JPEG compression (many pixels differ from their originals) and a better one for the images contaminated by an impulse (’salt and pepper’) noise.
798
K. Okarma and P. Lech
Table 2. The SSIM index obtained for various distortions using the Monte Carlo approach with various number of points - the image ’Kodim’ Number low-pass low-pass median median 5% 20% JPEG 3x3 5x5 3x3 5x5 noise noise 60% of points 50 0.9280 0.8581 0.9418 0.8607 0.0880 0.0613 0.9448 0.9355 0.8780 0.9471 0.8997 0.1144 0.0799 0.9472 100 0.9366 0.8811 0.9437 0.8982 0.1094 0.0682 0.9428 200 0.9405 0.8803 0.9508 0.9013 0.1183 0.0723 0.9450 500 0.9384 0.8718 0.9489 0.8945 0.1217 0.0768 0.9472 1000 0.9387 0.8737 0.9498 0.8956 0.1220 0.0759 0.9454 5000 0.9392 0.8762 0.9492 0.8961 0.1847 0.0746 0.9457 10000 0.9393 0.8768 0.9493 0.8974 0.2888 0.0741 0.9455 all
JPEG 10% 0.8495 0.8448 0.8448 0.8475 0.8502 0.8447 0.8485 0.8468
Table 3. The SSIM index obtained for various distortions using the Monte Carlo approach with various number of points - the image ’Baboon’ Number low-pass low-pass median median 5% 20% JPEG 3x3 5x5 3x3 5x5 noise noise 60% of points 50 0.6919 0.4456 0.7333 0.4613 0.5661 0.2323 0.8987 0.6917 0.4790 0.7123 0.5160 0.5043 0.1826 0.9003 100 0.6794 0.4540 0.7039 0.4633 0.5281 0.2342 0.8986 200 0.6906 0.4593 0.7178 0.4846 0.5674 0.2363 0.8992 500 0.6877 0.4629 0.7168 0.4841 0.5409 0.2254 0.8954 1000 0.6860 0.4591 0.7162 0.4796 0.5548 0.2270 0.8992 5000 0.6846 0.4591 0.7125 0.4798 0.5532 0.2318 0.8977 10000 0.6861 0.4608 0.7147 0.4815 0.5518 0.2295 0.8986 all
JPEG 10% 0.7059 0.6790 0.6838 0.6880 0.6802 0.6790 0.6778 0.6800
Table 4. The SSIM index obtained for various distortions using the Monte Carlo approach with various number of points - the image ’Lena’ Number low-pass low-pass median median 5% 20% JPEG 3x3 5x5 3x3 5x5 noise noise 60% of points 50 0.9212 0.8533 0.9324 0.8807 0.2892 0.1098 0.9441 0.8967 0.8123 0.9072 0.8398 0.3052 0.0948 0.9175 100 0.9043 0.8130 0.9175 0.8456 0.3393 0.1115 0.9258 200 0.9087 0.8305 0.9202 0.8565 0.3411 0.0920 0.9234 500 0.9091 0.8309 0.9210 0.8585 0.3469 0.0953 0.9250 1000 0.9105 0.8340 0.9215 0.8612 0.3432 0.0935 0.9256 5000 0.9097 0.8355 0.9203 0.8613 0.3391 0.0930 0.9243 10000 0.9103 0.8347 0.9212 0.8612 0.3397 0.0945 0.9251 all
JPEG 10% 0.8438 0.7926 0.8121 0.8136 0.8173 0.8159 0.8161 0.8164
Analysing presented results for many typical distortions, obtained relative errors of the SSIM values estimated using strongly limited number of pixels are about 1-2%, so proposed fast Monte Carlo SSIM estimation can be treated as an interesting alternative for the applications in lower performance systems with a smaller amount of memory.
Monte Carlo Based Algorithm for Fast Preliminary Video Analysis
5
799
Conclusions
Considering the fact that the Monte Carlo method is much faster than full image analysis it seems to be a good alternative for the classical methods in a wide area of applications. Presented examples are not comprehensive but it is worth noticing that presented approach can be very useful especially in embedded systems with a low computational power and a limited amount of memory.
References 1. Chen, D., Odobez, J.-M.: Sequential Monte Carlo Video Text Segmentation. In: International Conference on Image Processing ICIP 2003, vol. 3, pp. 21–24. IEEE Press, New York (2003) 2. Eskicioglu, A., Fisher, P., Chen, S.: Image Quality Measures and Their Performance. IEEE Trans. Comm. 43(12), 2959–2965 (1995) 3. Kindratenko, V.: Development and Application of Image Analysis Techniques for Identification and Classification of Microscopic Particles. PhD thesis, Antwerp University (1997) 4. Luo, H., Eleftheriadis, A., Kouloheris, J.: Statistical Model-Based Video Segmentation and its Application to Very Low Bit-Rate Video Coding. Signal Processing: Image Communication 16(3), 333–352 (2000) 5. Quan, G., Chelappa, R.: Structure from Motion Using Sequential Monte Carlo Methods. Int. Journal of Computer Vision 59(1), 5–31 (2004) 6. Vermaak, J., Ikoma, N., Godsill, S.J.: Sequential Monte Carlo Framework for Extended Object Tracking. IEE Proc. Radar Sonar Navig. 152(5), 353–363 (2005) 7. Wang, Z., Bovik, A.: A Universal Image Quality Index. IEEE Signal Process. Letters 9(3), 81–84 (2002) 8. Wang, Z., Bovik, A., Sheikh, H., Simoncelli, E.: Image Quality Assessment: From Error Measurement to Structural Similarity. IEEE Trans. Image Process. 13(4), 600–612 (2004) 9. Zhai, Y., Shah, M.: Video Scene Segmentation Using Markov Chain Monte Carlo. IEEE Trans. on Multimedia 8(4), 686–697 (2006)
Interactive Learning of Data Structures and Algorithmic Schemes Clara Segura, Isabel Pita, Rafael del Vado V´ırseda, Ana Isabel Saiz, and Pablo Soler Departamento de Sistemas Inform´ aticos y Computaci´ on Universidad Complutense de Madrid, Spain {csegura,ipandreu,rdelvado}@sip.ucm.es, {anussita,ileras}@gmail.com
Abstract. We present an interactive environment called Vedya for the visualization of data structures and algorithmic schemes which can be used as a very useful educational tool in Computer Science. The integration of Vedya and the Virtual Campus of the Complutense University of Madrid has allowed us to manage the whole administration of the individual students’ homework, including generating exercises, tests, grading delivered homework, and storing the achieved results. The part of the system concerning data structures has been evaluated during the last academic course 2006/07. By means of the Vedya tool, the students benefited from complementary and interactive material, facilitating the intuitive comprehension of most typical operations of classical data structures without any restriction of time or material.
1
Introduction
We present an interactive environment tool called Vedya for the visualization of data structures and algorithmic schemes. The pedagogical aim of Vedya is to facilitate the student’s grasp of the target procedures of education in Computer Science by means of interactive learning, in order to facilitate teamwork and communication between teachers and students. For this purpose, we have integrated Vedya in a motivating environment such as the Virtual Campus of the Complutense University of Madrid (https://www. ucm.es/info/uatd/CVUCM/index.php) facilitating the accessibility, understanding and visualization of the main data structures and algorithmic schemes. The combination of the Vedya tool and the Virtual Campus has allowed us to control the whole administration of the individual students’ homework including generating exercises, tests, grading delivered homework, and storing the achieved results. By means of the Vedya tool, the students have benefited from complementary and interactive material, facilitating the intuitive comprehension of most typical operations of classical data structures without restrictions of time or material.
The authors have been partially supported by the Spanish National Projects MERITFORMS (TIN2005-09027-C03-03) and PROMESAS-CAM (S-0505/TIC/0407).
M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 800–809, 2008. c Springer-Verlag Berlin Heidelberg 2008
Interactive Learning of Data Structures and Algorithmic Schemes
801
Moreover, the continuous utilization of the tool during the theoretical classes of the second four-month period has allowed us to reach one of the most useful educational aims in Computer Science in order to settle some of the academic deficiencies of this kind of subjects: to support the continuous, personal and interactive work across a virtual classroom. During the academic course 2006/07, Vedya has been freely accessible across the Virtual Campus. In this time, and within the framework of an Educational Innovation Project in Computer Science, we have evaluated the part of the tool dedicated to the study of the main data structures that provides interactive learning support to guide our students in their comprehension of a modern (imperative or declarative) programming language. At this moment, the tool has been widely used to illustrate in a graphical, visual and intuitive way the following well-known data structures [1]: linear data structures (stacks and queues), tree-like data structures (binary search trees, AVL trees, and heaps) and functional data structures (ordered and hash tables). Additionally, the flash animations incorporated in the tool have been used to illustrate other data structures like red-black trees, 2-3-4 trees and graphs; and to show how data structures are used to solve problems. Thanks to this effort, our students have assimilated, inside a motivating framework, one of the fundamental concepts of the subject, the difference between the formal description of the behavior of the data structures provided by the algebraic specification [2,3], and their implementation using a concrete programming language. The outline of the paper is the following. In Section 2 we describe Vedya from the point of view of the tool’s user. In Section 3 the implementation is explained. Section 4 provides the results obtained from the application of the tool in the last academic course. Finally, in Section 5 we conclude and outline future work.
2
The Vedya Tool: Description
Vedya is an interactive environment for learning data structures and algorithms. It covers the most common data structures: stacks, queues, binary search trees, AVL trees, priority queues, and sorted and hash tables. Moreover, it also provides other different types of data structures, like one for an implementation of a doctor’s office. Concerning the algorithmic schemes, it covers the most common resolution methods [4]: greedy, divide and conquer, dynamic programming, backtracking, and branch and bound. Lots of work has already been done on data structures and algorithm visualization. However, usually, tools are not complete, have a lack of common Graphical User Interface or can only be executed over some operating systems. In [5] Chen and Sobh present a tool for data structure visualization and userdefined algorithm animation. The data structures available are arrays, stacks, queues, binary search trees, heaps and graphs. The most relevant improvement of this tool is the possibility of executing user-defined algorithms and visualize the state of the data structures used by the algorithms.
802
C. Segura et al.
Fig. 1. Main window for the data structures part of the Vedya tool
Nevertheless, Vedya is something more than a data structure execution tool. The main features that differentiate it from other interactive tools are: (1) First of all, Vedya is created to supply the students with a tool that facilitates the study of both data structures and algorithms. Therefore, all the data structures and algorithmic methods taught in our courses are integrated in the same environment. The environment provides a lot of facilities to make executions more user-friendly: Vedya can be executed using Sun’s j2re 1.4.2.xx. It allows the execution of several data structures/algorithms and several sequences of operations on the same structure at the same time making use of a multi-windows and multi-frame system. It also allows saving sequences of operations in order to continue the execution later on. Operations in a sequence can be deleted and added. (2) Vedya offers several learning possibilities. The main one is the interactive execution of data structures and algorithms, but it is also possible to create simulations that execute automatically, visualize tutorials, and solve tests within the same environment. It also integrates a set of animations that show how data structures are used to solve some problems. (3) Concerning the data structures part, Vedya offers different views for the data structure behavior and the data structure implementations. Moreover, in most of the data structures both a static and a dynamic implementation are shown. Operations may be executed in any view, and the user can move from one view to another to see the changes in any moment. (4) The following algorithms are implemented: • divide and conquer: binary search and quicksort, • dynamic programming: knapsack problem,
Interactive Learning of Data Structures and Algorithmic Schemes
803
• greedy method: non-fractional knapsack problem, Dijkstra’s algorithm, • branch and bound: knapsack problem. The students can visualize how different algorithmic schemes are applied to solve the same problem and under which conditions they can be used. (5) The environment integrates documentation related to the algebraic specification, the implementation code and the operation cost of each data structure/algorithm. Currently there exist two versions of the tool. The old version contains all the data structures and algorithmic schemes mentioned above while the new one offers a subset of them in a more attractive visual environment. Figures in this paper correspond to the new version, which can be found at http://www.fdi.ucm. es/profesor/csegura/. 2.1
Tool Usage
When the application is started the user selects one area: data structures or algorithmic schemes, and chooses a particular one. Then, the main window for the selected data structure or algorithmic scheme is opened. For all cases this window looks similar. The central panel is used to represent the data structure, on the left there is a list of the actions that can be executed. The allowed actions are highlighted while the non–allowed ones are disabled. The panel on the right shows the actions that have already been executed on the data structure. On Fig. 1, we show an example of the main window for binary search trees. The selected key type is string, but the system also allows char and string types. On this tree, the user may continue executing actions using the left hand operations panel, or he can use the simulation facilities of the menu to go up on the sequence of actions to see previous states or restart the sequence from the beginning. The central panel offers some tree drawing facilities that allows the user to expand and contract the tree, as well as to move over the screen to see the hidden parts. Notice also that just above the central panel the result of the last action is shown. On the top of the screen there is a menu that facilitates managing the system. We can create a new data structure; open an existing one; or save the state of the editing one. We can change the view, from behavior, that is the one by default, to implementation, either static or dynamic. On Fig. 2 we can see first the behavior view of a queue. When we insert an item, a truck throws it on the top of the maze. Then, the item falls down until it finds the end of the maze or a previous one. When we extract an item, the end of the maze opens and the first item falls down. The use of the maze illustrates that items cannot jump over the previous ones and the fact that in a queue items are extracted in the order that they come in. Below the behavior view we have the representation of the static implementation based on a circular array and a dynamic implementation based on pointers. On the static implementation, the array is represented by a circle where items are inserted and extracted counterclockwise. This representation stresses the fact that there is not a final element of the array and also that we
804
C. Segura et al.
Fig. 2. Behavior and implementations views of queues
Interactive Learning of Data Structures and Algorithmic Schemes
805
can always insert an item unless the array is completely full. On the dynamic implementation we use short arrows to represent the pointers between the items. Two long arrows point to the first item (written as primero in the figure) and to the last one (written as ultimo). The sequence of modifications are shown step by step so that students notice its relevance in order not to lose any pointer. Using the menu we can also execute the operations on the data structure, use the simulation facilities and change the execution speed. In the documentation part the user can consult some documentation about the data structure, like the algebraic specification, the implementation code and the cost of each implementation, and finally it can change to the associate Vedya-test tool, where the user can do some proposed test about the selected data structure. The main window for the execution of algorithmic schemes looks similar. There are panels for drawing the execution of the algorithm, for introducing the input data and for showing the actions being executed. The simulation facilities are also available. 2.2
Data Types Animations
Vedya is complemented with a set of tutorials on data types and a set of algorithm animations showing the usage of a particular data type to solve a given problem. This set of tutorials and algorithm animations are developed in Flash, and can be accessed independently from Vedya’s initial menu. We have tutorials for stacks, queues, binary search trees, red-black trees and priority queues. We have a tutorial about the heap-sort algorithm, an animation of the insertion in a 2-3-4 tree, and examples of: – The use of stacks to evaluate an expression in postfix form, or to transform an infix expression to a postfix one. – The use of queues to obtain the breath-first tree traversal. – The use of stacks, queues and double queues to check palindromes. Finally, we have some animations on graphs: to obtain the minimum spanning tree using the Prim and Kruskal algorithms and to compute minimum paths using the Dijsktra algorithm [6]. 2.3
The Vedya-test Tool
As we have said Vedya also offers facilities to solve tests about the data structures and algorithms that are being studied. The Vedya-test tool can be invoked from the Vedya tool, or it can be executed independently. The tool offers facilities to teachers that allow them to create/modify/delete questions in a database, and to create tests from the database of questions. The student visualizes the tests, solves them and obtains the solutions. Questions are grouped by subject on the database, but it is possible to mix questions about different data structures in the same test.
806
C. Segura et al.
Fig. 3. Tool design
3
The Vedya Tool: Implementation
One of the main objectives when Vedya was implemented was that its extension with new data structures or algorithmic schemes should be as easy as possible. The main feature of the implementation is that both the data structures and the algorithms being represented are actually implemented in the tool separately from their graphical representations. The design, shown in Fig. 3, is modularly divided into four blocks: interface, graphics, threads and implementations of the data structures and algorithms. This means that whenever we desire to change a graphical representation we do not need to change the data structure or the algorithm itself. The windows and frame act as communication channel between implementation and the graphics. The threads only communicate with the graphics. Inheritance is exploited in order to reuse as much code as possible. The different blocks are communicated by means of “actions”: • When the user executes an operation in the interface over a data structure or executes an algorithm, the frame sends it to the implementation of the data structure/algorithm to be executed. • As a result of such execution, a vector of atomic actions to be graphically represented in an animated way is returned to the frame. For example, when inserting an element in a binary search tree, information about the path followed by the element being inserted is returned. • The frame sends the vector of actions to the graphical module, which paints the necessary elements and sends to the threads block each action to be animated, depending on the kind of selected visualization (user, static implementation or dynamic implementation).
Interactive Learning of Data Structures and Algorithmic Schemes
807
Fig. 4. Stack tracks design
Java threads are used in order to animate operations on data structures. Each operation divides into actions and the animation sequence of each action is divided into tracks where different kinds of movements are applied: circular, radial, horizontal, vertical or point to point. In Fig. 4 we show the circular tracks for the stack user representation, which consists of a dead-end pipe.
4
Interactive Learning
In order to obtain a detailed evaluation of the usage of Vedya, we have proposed several tests related to the behavior, implementation and application of the main data structures offered by the tool. We also collect students’ opinion using Vedya in the Data Structures course at the second year, and in the Programming Methodology and Technology course at the third year, respectively. The vast majority of our engineering and computer science students have taken an introductory programming course in the first academic year, typically in Pascal. Although the learning of the main algorithmic schemes and programming techniques is not a prerequisite to the subject of Data Structures, many students elect to take it either prior to, or concurrent with, Programming Methodology and Technology. As a result, although a pseudocode programming language for Data Structures is the assumed language for Data Structures, many students have enough knowledge about C++ or Java programming languages through the integrated programming laboratories of parallel courses and subjects. Taking into account this profile, skills and background of our students, we have proposed 8 tests in the Virtual Campus of the Complutense University of Madrid. The number of engineering students registered in the Virtual Campus was just over 320 distributed in three groups (130 in group A, 59 in group B, and 131 in group C). Table 1 shows the number of the students who answered each of the tests in the corresponding group.
808
C. Segura et al. Table 1. Students answering the tests
Group A (130) Group B (59) Group C (131) Total
Stacks 1 Stacks 2 Queues Sequences BST AVL RB Heaps 61 50 45 32 37 34 41 38 26 23 23 19 18 17 17 18 59 44 37 24 36 45 32 28 147 118 105 75 91 96 90 76
We observe that, from the second test on, the number of students becomes stable in a number lightly low to the number of students who access regularly to the Virtual Campus. These numbers, though seemingly high, are only between 23 % (75 students of 320) and 37 % (118 of 320) of registered students, which shows the high rate of students giving up in this topic from the beginning. Table 2. Percentage of correct answers
Group A Group B Group C
Stacks 1 Stacks 2 76.4% 82.5% 78.9% 83.6% 76.2% 79.8%
Queues Sequences BST AVL RB 77.8% 65.6% 82.2% 84.9% – 85.0% 63.6% 86.2% 87.7% 90.9% 73.5% 69.0% 83.5% – 68.9%
Heaps 86.3% 90.2% 86.8%
Table 2 shows the percentage of correct answers in the three groups: In general, it is high, which demonstrates the interest of the students who have taken part. In group B, the percentage is slightly higher than groups A and C; since 85 % of the students who have decided to complete the tests across the Virtual Campus of group B are not “new” students of this subject. Table 3. Comparison of academic results with previous courses
Not attended Passed Failed
2002/03 2003/04 2004/05 2005/06 2006/07 57.6% 45.3% 42.3% 64.7% 50.8% 15.3% 22.2% 20.2% 18.2% 30.1% 27.1% 32.5% 37.5% 17.1% 18.9%
Table 3 shows the percentage of students that did not attend the final exam, those who passed, and those who failed during the last five years. We observe that in the last course, in which we have applied the Vedya tool, we have reduced by 14% the percentage of students giving up the course with respect to the previous course, and at the same time, we have increased by 12% the percentage of students that passed the exam. The percentage of students that failed the exam increased by 2% due to the rise of students attending the exam. Comparing with previous courses (2003 to 2004) the percentage of students that passed has increased between 8% (with respect to the course 2003/04) and 15% (with respect to the course 2002/03).
Interactive Learning of Data Structures and Algorithmic Schemes
5
809
Conclusions and Future Work
In this paper, we have presented a novel interactive tool called Vedya for the visualization of data structures and algorithmic schemes which can be used as a very useful educational tool to aid first-year engineering and computer science students learn Data Structures and Algorithms. The main benefit of this kind of software is to facilitate the student’s grasp of the target procedures of education, to facilitate teamwork and communication between teachers and students. In this sense, the novel integration of the Vedya tool in the virtual classroom has allowed us to motivate the participation of the students, and to obtain one of the most powerful goals from the educational viewpoint. Furthermore, the personalization and automatization of the learning process has avoided the lack of motivation for the abstract subjects in Computer Science because students find them ‘useful’. The fact that interface language is Spanish is for our students one additional reason to find the environment more friendly. Finally, it is frequent that a tool is useful not only to the original purpose it was created for, but also for other subsidiary (though not less important) uses. In this sense, we believe that the development of this tool can help educators to use the Vedya tool directly or to create other similar tools. As future work, we plan to develop alternative ways to integrate the Vedya tool in a Virtual Campus based on WebCT. We are interested in the application of the tool and the interactive learning methodology presented in this paper inside of different models of virtualization and e-learning in Computational Science: the development of digital repositories of Learning Objects about data structures and algorithmic schemes in the Vedya tool using IMS DRI and Moodle, the integration of the Vedya tool in an Intelligent Tutorial System, or the application of typical tools of the Web 2.0 philosophy.
References 1. Weiss, M.: Data Structures and Problem Solving Using Java. Addison-Wesley, Reading (1998) 2. Mart´ı, N., Ortega, Y., Verdejo, A.: Estructuras de datos y m´etodos algor´ıtmicos: ejercicios resueltos. Prentice-Hall, Englewood Cliffs (2003) 3. Pe˜ na Mar´ı, R.: Dise˜ no de programas. Formalismo y abstracci´ on, 3rd edn. PrenticeHall, Englewood Cliffs (2005) 4. Cormen, T.H., Leiserson, C.E., Rivest, R.L.: Introduction to Algorithms. The MIT Press/McGraw-Hill (2001) 5. Chen, T., Sobh, T.: A tool for data structure visualization and user-defined algorithm animation. In: Frontiers in Education Conference (2001) 6. Pita, I., Segura, C.: A tool for interactive learning of data structures and algorithms. In: 8th International Symposium on Computers in Education, SIIE 2006, vol. 1, pp. 141–148 (2006)
Prediction and Analysis of Weaning Results of Ventilator-Dependent Patients with an Artificial Neuromolecular System Jong-Chen Chen1, Shou-Wei Chien1, and Jinchyr Hsu2 1
Department of Information Management, National Yunlin University of Science and Technology, Touliu, Taiwan 2 Department of Internal Medicine, TaiChung Hospital, TaiChung, Taiwan [email protected], {jans.alex,jinchyr.hsu}@msa.hinet.net
Abstract. We have developed a vertical information processing model, motivated from physiological evidence, which integrates intra- and interneuronal information processing. Information processing at the intraneuronal levels is to create a repertoire of pattern processing neurons. Information processing at the interneuronal levels is to group appropriate pattern processing neurons to constitute an effective pattern processing system. The system was applied to a database of the weaning results of ventilator-dependent patients. Ventilator has been used to support the breathing need of patients, and weaning is the gradual process of removing it from ventilator-dependent patients. Experiments with the model show that the integrated system is able to learn to differentiate data in an autonomous manner, separating those patients who have successful weaning results from those who do not. Our parameter analysis shows that most of the parameters identified as significant by the system are the same as those by physicians, but some are not. Keywords: Weaning, Evolutionary Learning, Artificial Neural Networks.
1 Introduction People are in great danger when they have difficulty in breathing, especially if they have chronic pulmonary disease or neuromuscular problems. Ventilator has been commonly used to support their breathing needs. Statistics show that most people in intensive care unit (ICU) are in critical conditions and around 40% of them need ventilator to support their breathing [1], [18]. When patients are no longer in critical conditions, they will be transferred to respiratory care center (RCC) for further care. Removing ventilator from patients too early could put them in dangerous situation. However, for those patients who can breathe naturally, placing ventilator on them may be of little help. When to remove a ventilator from a patient is sometimes a tough question. Statistics show that the successful weaning rate is between 35% and 60% [16], [17]. The decision is usually based on the subjective assessments of physicians. The parameters that the physician takes in accounts are physiological factors, including maximal inspiratory pressure (Pimax)[10], [20], vital capacity M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 813–822, 2008. © Springer-Verlag Berlin Heidelberg 2008
814
J.-C. Chen, S.-W. Chien, and J. Hsu
(VC)[20], rapid shallow breathing index (RSBI)[23], minute ventilation (VE)[10], [21], pH and PCO[2], [16], APACHE II score, and blood urine nitrogen (BUN). The above studies are related to what parameters to consider when the physician tries to wean ventilator off patients in ICU. However, there are limited studies regarding weaning in RCC, where one of the major tasks is to help patients to progress from mechanical ventilation to spontaneous breathing as early as possible. At present, there are 27 parameters used to determine whether weaning is successful or not for ventilator-dependent patients in RCC. These parameters are divided into four categories. The first category includes gender and age. The second category comprises physiological parameters. These are APACHE II score and coma scale at admission time, blood urine nitrogen (BUN), creatinin (Cr), albumin (Alb), hemoglobine (Hb), the ratio of respiratory frequency to tidal volume (RSBI) and coma scale before removing ventilator. The third category of parameters is related to the diseases carried by patients. These include chronic obstructive pulmonary disease, cardiovascular disease, cerebral vascular disease, other internal factors, septic syndrome with multiple organ failure, respiratory tract disease, trauma, acute respiratory distress syndrome, brain surgery, and other surgeries. The fourth category of parameters is regarding the treatment and complication occurring during the patient’s admission time. These include the length of time staying in ICU, the length of time using ventilator, and tracheastomy, respiratory tract infection, blood stream infection, urinary tract infection, and other infection. Successfully weaning patients off ventilator requires careful assessments. It may be helpful if we develop an intelligent system to assist physicians in making such a decision. In this study, we apply an artificial neuromolecular system (ANM), a biologically motivated model that integrates intra- and inter-neuronal information processing, to differentiate a database of ventilator-dependent patients. The system has two hypotheses. The first is that some neurons in the brain have significant intraneuronal information processing, which might directly or indirectly relate to their firing behavior [14], [13], [15]. Neurons of this type will be called cytoskeletal neurons or enzymatic neurons[4], [6]. They combine, or integrate, input signals in space and time to yield temporally patterned output signals to control other neurons. The second hypothesis is that some neurons in the brain serve as pointers to other neurons in a way that allows for memory manipulation. Neuron of this type is called reference neurons [6-7]. Reference neurons are used to select appropriate subsets of cytoskeletal neurons, which then control the manner in which input patterns are transduced to output patterns.
2 The Architecture The ANM system as currently implemented comprises eight competing subnets, each consisting of 30 cytoskeletal neurons. Cytoskeletal neurons are manipulated by two levels of reference neurons. Low-level reference neurons select comparable cytoskeletal neurons in each subnet (i.e., neurons that have similar cytoskeletal structures). High-level reference neurons select different combinations of the low-level reference neurons. Fig. 1 provides a simplified picture (only two of the competing subnets were shown, each consisting of only four cytoskeletal neurons). In Fig. 1, the intraneuronal
Prediction and Analysis of Weaning Results
Ra
high-level reference neurons low-level reference neurons cytoskeletal neurons
E1
r1
815
Rb
r2
r3
E2 E3 E4 subnet 1
E1
r4
E2 E3 E4 subnet 2
Fig. 1. Connections between reference and cytoskeletal neuron layers. When Ra fires, it will fire r1 and r4. Similarly, the firing of Rb will cause r3 and r4 to fire, which in turn fires E3 and E4 in each subnet. (Ei stands for cytoskeletal neuron i.).
structures of E1, E2, E3, and E4 in subnet 1 are similar to those of E1, E2, E3, and E4 in subnet 2, respectively.
3 Pattern Processing Neurons Cytoskeletal neurons are the major pattern processing neurons in the ANM system. Information processing sketched in cytoskeletal neurons is motivated from some physiological evidence that the internal dynamics of a neuron control its firing behaviors [13], [14], [15]. Our hypothesis is that the cytoskeleton plays the role of signal integration. That is, it is capable of integrating signals in space and time to yield spatiotemporal output signals. The dynamics of cytoskeletal neurons are simulated with 2-D cellular automata [22]. Our implementation (Fig. 2) in the cytoskeletal neuron is to capture the signal integration capability (to be described below). When an external signal impinges on the membrane of a cytoskeletal neuron, a readin enzyme sharing at the same site is activated. The activated readin enzyme will then activate the cytoskeletal component sharing at the same site, which in turn activates its neighboring component of the same type, and so on. Thus, the activation of a readin enzyme will in turn activate a chain of neighboring components of the same type (i.e., initiate a unidirectional signal flow). As to a neighboring component of different type, an activated component will affect its state when there is an MAP (microtubule associated protein) linking them together. The interactions between two different types of neighboring components are assumed to be asymmetric. For example, in Fig. 3a, the activation of the readin enzyme at location (2,2) will trigger a cytoskeletal signal flow along the C2 components of the second column, starting from location (2,2) and running to location (8,2). When the signal arrives at location (8,2), the C2 component sharing at the same site will be activated, which in turn stimulates its neighboring C1 component at location (8,1) to a more exciting state (but is not sufficient to activate it). A counter example (Fig. 3b) is that the activation of the readin enzyme at location (4,1) will trigger a cytoskeletal signal flow along the C1 components of the first column, starting from location (4,1) and running to
816
J.-C. Chen, S.-W. Chien, and J. Hsu
location (8,1). The activation of the C1 component at location (8,1) will activate its neighboring C2 component at location (8,2). The activation of the latter will in turn activate the C2 component at location (7,2), its next neighboring component at location (6,2), and so on. Thus, it will trigger a signal flow on the second column, starting from location (8,2) and running to location (2,2). j location (i,j)
5 6 7 8 c1 1 i 2 c2 c3 c1 3 c2 c3 c1 c1 4 c1 c2 c1 c3 c1 c2 c1 c3 5 c1 c2 c1 c3 c1 c2 c1 c3 c2 c1 c3 6 c1 c2 c1 c3 c1 c2 c3 7 c2 c1 8 c1 c2 c1 c3 1 2
' ' MAP
3 4
' ' readout enzyme
' ' readin enzyme
Fig. 2. Cytoskeletal neurons. Each grid location has at most one of three types of components: C1, C2, or C3. Some sites may not have any component at all. Readin enzymes could reside at the same site as any one of the above components. Readout enzymes are only allowed to reside at the site of a C1 component. Each site has eight neighboring sites. The neighbors of an edge site are determined in a wrap-around fashion. Two neighboring components of different types may be linked by an MAP (microtubule associated protein).
We have described the feature that different types of components interact with each other in an asymmetric manner. The other feature is that different types of components transmit signals at different speeds. The summary of the above two features is that C1 components transmit signals at the slowest speed, but with the highest activating value, and that C3 components transmit signals at the fastest speed, but with the lowest activating value. The activation value of C2 components and their transmitting speed are intermediate between those of C1 and C3 components. When the spatiotemporal combination of cytoskeletal signals arriving at a site of readout enzyme is suitable, the readout will be activated and then the neuron will fire. Fig. 4 shows that there are three possible cytoskeletal signal flows, initiated by external signals, to activate the readout enzyme at location (8,3). One is a signal flow on the second column, the other on the third column, and another on the fourth column. Any two of the above three signal flows might activate the readout enzyme at location (8,3), which in turn will cause the neuron to fire. Nevertheless, the neuron might fire at different times in response to different combinations of signal flows along these fibers. One is that different types of components transmit signals at
Prediction and Analysis of Weaning Results
817
different speeds. The other is that signals may be initiated by different readin enzymes. For example, the signal flow on the second column may be initiated either by the readin enzyme at location (2,2) or by the readin enzyme at location (3,2). Similarly, the signal flow on the fourth column may be initiated either by the readin enzyme at location (1,4) or by the enzyme at location (2,4). A signal initiated by different enzymes will integrate with another signal at different times. All of these will affect the temporal firing behaviors of a neuron. (a)
external signal 1 2 3 4 5
(b)
6
7
1 c3 2 c2 c3 c1 3 c2 c3 c1 c1 4 c1 c2 c1 c3 c1 c2 c1 c3 5 c1 c2 c1 c3 c1 c2 c1 c3 6 c1 c2 c1 c3 c1 c2 c1 c3 7 c1 c2 c1 c3 c2 8 c1 c2 c1 c3
external signal
8
1 2 1 2
3
4
5
6
7
8
c3
c2 c3 c1 3 c2 c3 c1 c1 4 c1 c2 c1 c3 c1 c2 c1 c3 5 c1 c2 c1 c3 c1 c2 c1 c3 6 c1 c2 c1 c3 c1 c2 c1 c3 7 c1 c2 c1 c3 c2 8 c1 c2 c1 c3
Fig. 3. Interaction between different types of component via an MAP. (a) An external signal will trigger a signal flow on the second column, starting from location (2,2) and running to location (2,8). When this signal arrives at location (8,2), it will affect its neighboring C1 component at location (8,1) to a more exciting (i.e., a state that is much easier to be activated later). (b) An external signal will trigger a signal flow on the first column, starting from location (4,1) to running to location (8,1). The activation of the C1 component at location (8,1) will in turn activate the C2 component at location (8,2) via the MAP, which in turn will trigger a signal flow on the second column, starting from location (8,2) and running to location (2,2).
4 Evolutionary Learning Six levels of evolutionary variation are possible in the system: at the level of readin enzymes (initiating signal flows), at the level of readout enzymes (responding signal flows), at the level of MAPs (modulating signal flows), at the level of cytoskeletal components (transmitting signal flows), at the level of connections among receptor neurons and cytoskeletal neurons, and at the level of reference neurons. In the current implementation we allow variation-selection operators to act on only one level at a time. One level, or aspect, is open to evolution for sixteen cycles. During this time all the other levels are held constant. The levels of evolutionary changes are turned on in the sequence of reference neuron, readin enzyme, reference neuron, connection among receptor neurons and cytoskeletal neurons, reference neuron, cytoskeletal component, reference neuron, readout enzyme, reference neuron, and MAP.
818
J.-C. Chen, S.-W. Chien, and J. Hsu
external signals 1
2
3
4
5
6
7
8
1
c3 c2 c3 c1 c2 c3 c1 c1 c2 c1 c3 c1 c2 c1 c3 5 c1 c2 c1 c3 c1 c2 c1 c3 6 c1 c2 c1 c3 c1 c2 c1 c3 7 c1 c2 c1 c3 c2 8 c1 c2 c1 c3 2 3 4 c1
Fig. 4. Different combinations of cytoskeletal signals to fire the neuron. The figure demonstrates that the readout enzyme at location (8,3) might be activated by any two of signal flows along C1, C2, and C3 components.
Evolutionary learning at cytoskeletal neuron level has three steps: 1. Each subnet is activated in turn for evaluating its performance. 2. The pattern of readout enzymes, readin enzymes, MAPs, connectivities, and other components of best-performing subnets is copied to lesser-performing subnets, depending on which level of evolution is operative. 3. The pattern of readout enzymes, readin enzymes, MAPs, connectivities, and other components of lesser-performing subnets is slightly varied. Evolutionary learning at reference neuron level has three steps: 1. Cytoskeletal neurons controlled by each reference neuron are activated for evaluating their performance. 2. The pattern of neural activities controlled by best-performing reference neurons is copied to lesser-performing reference neurons. 3. Lesser-performing reference neurons control a slight variation of the neural grouping controlled by best-performing reference neurons.
5 Application Domains In this study, the ANM system was employed to differentiate a clinical database of ventilator-dependent patients. In total, there were 189 records, of which 84 patients had successfully weaned from ventilators while the remaining 105 failed. The following procedure described how to link the ANM system with the database. Note that in this model the cytoskeletal neurons served as the major components responsible for information processing. We first explained how to set up the connections between these 27 parameters and cytoskeletal neurons. Then we described how to map each parameter value to an external signal for cytoskeletal neurons. At last, we explained how to evaluate the system’s performances.
Prediction and Analysis of Weaning Results
819
We note that the connections between the parameter layer and cytokskeletal neuron layer were only partial. That is, each of these neurons was responsible for processing only a small subset of stimuli generated from these 27 parameters (through evolutionary learning). However, it should be noted that all cytoskeletal neurons that had connections with a parameter would receive the same pattern of stimuli. The initial connections between these 27 parameters and cytoskeletal neurons were randomly decided, but subject to change as learning continued. Through evolutionary learning, each cytoskeletal neuron would be trained to be a specific input-output pattern transducer. Each parameter was encoded with a 5-bit pattern. In total, there were 135 bits required to encode all of these 27 parameters. For each parameter, the minimal and maximal values of these 189 records were determined (to be denoted by MIN and MAX, respectively). The difference between these two values was equally divided into 5 increments (denoted by INCRs). The transformation of each actual parameter value (to be denoted by ACTUAL) into the corresponding 5-bit pattern was shown to be
=
⎧ 00001, ⎪ 00010, ⎪ ⎨ 00100, ⎪ 01000, ⎪⎩ 10000,
if if if if if
(MIN (MIN (MIN (MIN
MIN + + + +
≤ INCR) ≤ INCR × 2) ≤ INCR × 3) ≤ INCR × 4) ≤
ACTUAL ACTUAL ACTUAL ACTUAL ACTUAL
< < < < ≤
(MIN + (MIN + (MIN + (MIN + MAX
INCR) INCR × 2) INCR × 3) INCR × 4)
Each bit with value ‘1’ represented a specific set of stimuli to a cytoskeletal neuron. When a readin enzyme received an external stimulus, a cytoskeletal signal was initiated. As to which readin enzymes of a neuron would receive the stimuli from a parameter, it was randomly decided in the beginning and subject to change during the course of learning. For each record, all stimuli were sent to cytoskeletal neurons simultaneously. This means that all cytoskeletal signals were initiated at the same time. The cytoskeleton integrated these signals in space and time. For each input record, the class of the first firing cytoskeletal neuron was assigned as its output (to be described below). Cytoskeletal neurons were equally divided into two classes, corresponding to two different groups of records. One class of neurons represented the group of patients who have successful weaning while the other unsuccessful weaning. For each record (input pattern), we defined that the system made a correct response when the class of the first firing neuron was in accordance with the group shown in the database. The ANM system was tested with each of these 189 records in sequence. The greater the number of correct responses made by the system, the higher its fitness was.
6 Experimental Results Two experiments were performed. The first was to apply the system to the RCC database. For comparison, we also tested the database with two other artificial neural networks (the back-propagation neural network and support vector machine) and one tree-based decision algorithm (Waikato Environment for Knowledge Analysis). In the second experiment, we examined the effectiveness of each parameter in determining weaning results.
820
J.-C. Chen, S.-W. Chien, and J. Hsu
6.1 Data Differentiation These 189 records were divided into two sets: training and testing. The training set consisted of 120 records, about two thirds of the database. The testing set consisted of 69 records, about one third of the database. Ten runs were performed. For each run, 120 out of these 189 records were selected at random as the training set whereas the remaining 69 records were grouped as the test set. The experimental result showed that the number of records recognized by the ANM system increased continuously during the course of learning. The average differentiation rate was 91%. Then, the system after substantial learning was tested. The average differentiation rate on these ten test sets was 71.0%, implying that it possessed a certain degree of differentiation capability. For comparison, we first tested the database with the Weka tool, a collection of data mining algorithms that employed tree-based classifications. The best result that we obtained with this tool was collected. The average differentiation rate of 10 runs on the testing set was 63.2%. We then tested it with the support vector machine. The average differentiation rate on the testing set was 69.9%. Lastly, we employed the back-propagation neural network to the database. Several combinations of the numbers of hidden layers (including nodes) and transferring functions were tested. The ten best results were collected. The average differentiation rate on the testing set was 70.9%. 6.2 Parameter Analysis In this experiment, we investigated the effectiveness of each parameter in determining the weaning results of ventilator-dependent patients. We noted that the system learned to treat a particular parameter as rather significant when it could always be used to give a correct response to all training records. In other words, altering the values of this parameter might lead to a completely different result. By contrast, the system tended to ignore those parameters with insignificant values. In other words, for an insignificant parameter, any alteration of its values had no effect on the response. To generate a test set, we copied the training set and varied a specific parameter of each of the training patterns. The variation was made by setting a parameter at a specific value, starting from 0.1 to 1 (with an increment of 0.1). Thus, there were eleven testing sets generated for each of these 27 parameters. In total, there were 297 testing sets. For each testing set, we set a parameter at a specific value but kept the values of other 26 parameters unchanged. For example, the first testing set was identical to the training set except the value of the first parameter of each record was set to be 0.1. The second testing set was generated in a similar manner, but the value of the first parameter was set to be 0.2. The system after substantial learning was employed. The experimental results showed that there were almost no changes on the system’s outputs when the values of these 7 parameters (Cr, respiratory tract disease, cardiovascular disease, and chronic obstructive pulmonary disease, septic syndrome with multiple organ failure, other surgeries, and the length of time staying in ICU) were altered. That is, altering the values of these parameters had no significant effect on the system’s outputs. This implied that there was essentially no direct relationship between these parameters and
Prediction and Analysis of Weaning Results
821
patients’ weaning results. By contrast, the outputs were quite different when the following 10 parameters (age, Alb, BUN, Hb, cerebral vascular disease, coma scale before removing ventilator, RSBI, respiratory tract infection, urinary tract infection, other infection) were set at different values, illustrating that they played a vital role in affecting patients’ weaning results. Among these 10 parameters, age and coma scale before removing ventilator are two most important parameters. For the remaining 10 parameters, the result showed that the changes of the system’s outputs fell within a certain range, suggesting that to some extent they were important parameters. We were particular interested in those parameters identified as very important by one side but as insignificant by other side. Our experimental result showed that there were four parameters identified by the physician as very significant but by the ANM system as almost insignificant. These included respiratory tract disease, cardiovascular disease, chronic obstructive pulmonary disease, and septic syndrome with multiple organ failure. All of these parameters were related to the disease carried by the patients. Even though the result was not quite the same as that of the physicians; however, in some senses it should provide another dimension of information to physicians. It implied that some important parameters that the physician had in mind might not be that critical.
7 Conclusions The experimental result with the RCC database is consistent with our previous result that the system demonstrates a high differentiation capability. Finally, we note that a parameter is significant if it can be used to effectively differentiate data and to make it redundant if it cannot. The significance of each parameter is dependent upon the structure of a data set. Different sets possess different significant parameters. The parameter analysis result has some implications for clinical studies. It provides physicians with information about the effectiveness of each parameter in determining weaning for patients. Our experimental result shows that some of these significant parameters are the same as those recognized by physicians whereas some are not. The finding of those significant parameters not previously recognized by physicians may provide them another dimension of information, which in turn may open up a possibility of exploring some unknown phenomena.
References 1. Adams, A.B., et al.: Survey of Long-Term Ventilators Support in Minnesota: From 1986 to 1992. Chest 103, 463–469 (1993) 2. Bouachour, G., et al.: Gastric Intramural Ph: An Indicator of Weaning from Mechanical Ventilation in Patients. Eur. Respir. J. 9, 1868–1873 (1996) 3. Bremermann, H.J.: Optimization through Evolution and Recombination. In: Yovits,, Jacobi,, Goldstein (eds.) Self-Organizing Systems, pp. 93–106. Spartan Books, Washington, D.C (1962) 4. Conrad, M.: Molecular Information Processing in the Central Nervous System, Parts I and II. In: Conrad, M., Güttinger, W., Dal Cin, M. (eds.) Physics and Mathematics of the Nervous System, pp. 82–127. Springer, Heidelberg (1974a)
822
J.-C. Chen, S.-W. Chien, and J. Hsu
5. Conrad, M.: Evolutionary Learning Circuits. J. Theor., Biol. 46, 167–188 (1974b) 6. Conrad, M.: Molecular Information Structures in the Brain. J. Neurosci. Res. 2, 233–254 (1976a) 7. Conrad, M.: Complementary Molecular Models of Learning and Memory. BioSystems 8, 119–138 (1976b) 8. Conrad, M.: Principle of Superposition-Free Memory. J. Theor. Biol. 67, 213–219 (1977) 9. Conrad, M., Kampfner, R.R., Kirby, K.G., Rizki, E.N., Schleis, G., Smalz, R., Trenary, R.: Towards an Artificial Brain. BioSystems 23, 175–218 (1989) 10. Feeley, T., Hedley-Whyte, J.: Weaning from Controlled Ventilation and Supplemental Oxygen. N. Engl. J. Med. 292, 303–306 (1975) 11. Fogel, D.: Evolutionary Computation: Towards a New Philosophy of Machine Intelligence. IEEE Press, Piscatawy (1995) 12. Fogel, L., Owens, A., Walsh, M.: Artificial Intelligence through Simulated Evolution. Wiley, New York (1966) 13. Hameroff, S.R.: Ultimate Computing. North-Holland, Amsterdam (1987) 14. Liberman, E.A., Minina, S.V., Shklovsky-Kordy, N.E., Conrad, M.: Microinjection of Cyclic Nucleotides Provides Evidence for a Diffusional Mechanism of Intraneuronal Control. BioSystems 15, 127–132 (1982) 15. Matsumoto, G., Tsukita, S., Arai, T.: Organization of the Axonal Cytoskeleton: Differentiation of the Microtubule and Actin Filament Arrays. In: Warner, F.D., McIntosh, J.R. (eds.) Cell Movement: Kinesin, Dynein, and Microtubule Dynamics, pp. 335–356. Alan R. Liss, New York (1989) 16. Modawal, A., et al.: Weaning Success among Ventilator-Dependent Patients in a Rehabilitation Facility. Arch. Phys. Med. Rehabil. 83, 154–157 (2002) 17. Nava, S., et al.: Survival and Prediction of Successful Ventilator Weaning in Copd Patients Requiring Mechanical Ventilation for more than 21 Days. Eur Respir. J. 7, 1645–1652 (1994) 18. Robinson, R.: Ventilotor Dependency in the United Kingdom. Archives in Disease of Child 65, 1235–1236 (1990) 19. Rosen, R.: Dynamical System Theory in Biology. Wiley, New York (1970) 20. Sahn, S.A., Lakshminarayan, S.: Besides Criteria for Discontinuation of Mechanical Ventilation. Chest 63, 1002–1005 (1973) 21. Stetson, J.B.: Introductory Essay. Int Anethesiology Clinics 8(4), 767–779 (1970) 22. Wolfram, S.: Cellular Automata as Models of Complexity. Nature 311, 419–424 (1984) 23. Yang, K.L., Tobin, M.J.: A Prospective Study of Indexes Predicting the Outcome of Trials of Weaning from Mechanical Ventilation. N. Engl. J. Med. 324(21), 1445–1450 (1991)
Licence Plate Character Recognition Using Artificial Immune Technique Rentian Huang, Hissam Tawfik, and Atulya Nagar Intelligent and Distributed Systems Lab, Deanery of Business and Computer Sciences, Liverpool Hope University, Liverpool, United Kingdom L16 9JD {10076507,TAWFIKH,NAGARA}@Hope.ac.uk
Abstract. This paper proposes the application of Artificial Immune Technique in Licence Plate Character Recognition (LPCR). The use of Clonal Selection Algorithm (CSA) is composed of two main stages: (1) dynamic training samples; and (2) a choice of the best antibodies based on the three main clonal operations of cloning, clonal mutation and clonal selection. Once memory cells are established it will output the classification results using Fuzzy K-Nearest Neighbor (KNN) approach. The performance of CSA is compared to the Back Propagation Neural Networks (BPNN) in solving a LPCR problem. The experimental results show that the Artificial Immune Technique has a favorable performance in terms of being more accurate and robust. Keywords: Artificial Immune System (AIS), Clonal Selection Algorithm (CSA), Licence Plate Recognition (LPR).
1 Introduction Licence Plate Recognition (LPR) System combines image processing and character recognition technology to identify vehicles by way of automatically reading their number plates. A typical LPR process consists of three stages: 1) licence plates location, 2) character segmentation and 3) character recognition. LPR demonstrates particularly useful and practical vehicle identification technology as it assumes no additional means of vehicle identity apart from the existing and legally required number plate. Furthermore, when data gathered by an LPR system is stored and organized within a database, more complex information-driven tasks may potentially be performed such as, vehicle travel time calculations as well as border controls. However, in practice it is a very difficult task due to the variety of environmental conditions. LPR system is usually conducted under certain restrictive conditions such as, indoor scenes, stationary backgrounds, fixed illumination, prescribed driveways, limited vehicle speed, and at a designated range of distance between camera and vehicle [1]. Despite these current limitations, LPR finds applications in private parking management, traffic monitoring, automatic toll payment, surveillance and security enforcement [2]. Numerous algorithms have previously been exploited such as, Hidden Markov Models (HMM) [3], Artificial Neural Networks (ANN) [4], Hausdorff Distance [5], M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 823–832, 2008. © Springer-Verlag Berlin Heidelberg 2008
824
R. Huang, H. Tawfik, and A. Nagar
Support Vector Machine [6] (SVM)-based character recognizer and template matching that leave a lot of room for improvements [7]. The focus of this paper is to investigate a character recognition technique using the Artificial Immune System (AIS) based CSA. A number of adjustments are made to the basic implementation of CSA in order to improve the performance, especially in using a new dynamic training to establish the immune memory (collection of antibodies) for classification. Additionally, Neural-Network results are presented to compare the performance. The experimental results show that CSA for character recognition has a better performance in terms of successful classification of the characters of licence plates. CSA proved more accurate and robust compared to Neural Networks. The rest of this paper is presented as follows: The LPR architecture in section 2 which includes a review of relevant techniques used for tackling character recognition of LPR. Section 3 introduces our CSA and the features added to it for character recognition. Section 4 provides the experimental details and highlights the compared performance of the CSA in character recognition. Section 5 gives the conclusion and proposes the future work.
2 Car Plate Recognition Car plate recognition algorithms reported in the research are generally composed of three main steps, 1) locating licence plates, 2) segmenting licence numbers and 3) identifying the characters. Fig. 1 illustrates our proposed LPR. In locating the licence plate, a colour edge detector is developed to detect the type of edges contained within the licence plate. While, multiple licence plate candidates are normally detected, size and shape filtering is used to remove objects that do not satisfy some specific conditions. The target will select regions that serve as possible licence plate boundaries. In order to achieve possible licence plate boundaries the area-to-perimeter ratio of the candidate area is compared with the actual standard ratio of a number plate. Once a licence plate candidate has been extracted from the image, the licence number segmentation preprocessing component will continue to perform three tasks, Grey-level transform, Median filtering and Binarisation. A vertical projection is performed to segment the characters with each character image normalized to a size of 16x16 after segmentation. Following character segmentation from the plate region a method needs to be selected for character recognition which is the main subject of this work.
Fig. 1. Diagram of our LPR process
Licence Plate Character Recognition Using Artificial Immune Technique
825
There has been a large number of character recognition techniques reported. HMM for recognition begins with the pre-processing and parameterization from the region of interests detected in the previous phase. Researchers report that the width of the plate in the image after rescaling lies between 25% and 75% for an image width arranging from 200 and 600 pixels. This reveals the necessity for good character analysis when implementing HMM, which places a restriction on the effective distance of the recognition system [3]. Various types of ANN had been used for licence plate character identification such as the work done by Broumandnia and Fathy [8]. A self-organized Neural Network based on Kohonen’s Self-Organized Feature Maps (SOFMs) was implemented to tolerate noisy, deformed, broken, or incomplete characters acquired from licence plates, which were bent or tilted with respect to the camera [9]. Probabilistic Neural Networks (PNNs) for LPR were also introduced by Anagnostopoulos et al. [10]. Hausdorff distance is a method for LPCR that compares two binary images. Its main problem is the computational burden. Its recognition rate is very similar to that obtained with Neural-Network classifiers, but it is slower [5]. Kim et al. designed a system implementing four SVMs and report an impressive average character recognition rate. The architecture, however, is strictly designed for Korean plates [6]. A suitable technique for the recognition of single font and fixed size characters is the pattern matching technique. The recognition process was based on the computation of the normalized cross correlation values for all the shifts of each character template over the sub-image containing the licence plate [7]. LPR problem continues to be a challenge for artificial intelligence solutions and novel approaches are therefore needed to improve the performance and efficiency of LPCR algorithm. In this work, an AIS character recognition technique based Clonal Selection Algorithm is presented for solving LPR problem.
3 Immune Techniques for Character Recognition Artificial Immune System is a rapidly emerging technique for developing mechanisms for learning, prediction, memory and adaptation. AIS mimics the biological immune systems as these offer powerful and robust information processing capabilities for solving complex problems [11]. The immune system is a biological pattern recognition and classification system which learns to distinguish the self from nonself. The immune systems behaviour is an emergent property of the entire population of diverse agents and improves performance by weeding out the weakest players, replacing them with agents as different as possible. The immune system is computationally one of the least understood biological paradigms but has drawn significant attention. AIS started to be used in many application domains including computer security, optimization, robotics, data mining, fault detection, anomaly detection, and pattern recognition [12]. 3.1 Clonal Selection Clonal Selection Theory, the famous theory in immunology, was put forward by Burnet in 1978 [13]. Its main ideas lie in the fact that the antigen can selectively react
826
R. Huang, H. Tawfik, and A. Nagar
to the antibodies, which are natively produced and spread on the cell surface in the form of peptides. When cells are exposed to an antigen, the antigen stimulates an immune cell with appropriate receptors to proliferate (divide) and mature into terminal plasma cells. The process of cell division generates a clone, i.e., a set of cells that are the progenies of the single cell. In addition to proliferating and maturing into plasma cells, the immune cells can differentiate into long-lived memory cells. Memory cells circulate through the blood, lymph and tissues, and when exposed to a second antigenic stimulus they commence to differentiate into large immune cells (lymphocyte) capable of producing high affinity antibody preselected for the specific antigen that once stimulated the primary response. Fig. 2 depicts the clonal selection principle.
Fig. 2. Clonal Selection Principle
Clonal selection is a dynamic process of the immune system stimulated by the selfadapting antigen. Some biologic features such as learning, memory and antibody diversity can be used in artificial immune systems to solve complex problems. De Castro and Von Zuben proposed the first Clonal Selection Algorithm, called CLONALG and suggested that it could be used for pattern recognition. They generated random antibodies to be used as the target patterns in CSA. They used a set of 12 x 10 binary images as the target patterns [14]. Utpal Garain et al. proposed a CSA used for a 2-class problem to classify pairs of similar character patterns and claimed promising results. Setting aside the classification power, data reduction had been another capability of CSA [15]. Utpal Garain et al. further explored the potential of CSA in pattern recognition by applying the CSA for a 10-class classification problem. Empirical study with two datasets shows that the CSA has very good generalization ability with experimental results reporting the average recognition accuracy of about 96% [16]. 3.2 Clonal Selection for LPCR The proposed clonal selection algorithm for character recognition is composed of two main processes: firstly, selecting samples and training these samples using CSA; secondly, using fuzzy (KNN) approach to output the classification results for LPR. The essentials of clonal selection are established as follows. Antigens are images
Licence Plate Character Recognition Using Artificial Immune Technique
827
stored in a matrix and represent the licence plate character of the system. Antibodies are the candidates that go through clonal process and try to catch and represent the common features of antigens. The affinity between antibody and antigen is the reflection of the total combination power located between them. For classification problem, hamming distance (HD) and similarity functions are used to measure affinity between antigen and antibody. The Hamming distance rule is presented below: n
len
difference = ∑∑ posi = Abi ⊕ Ag ij
Affinity = − difference
(1)
i =1 i =1
Where Abi is the ith bit in the antibody Ab , Ag ij is the ith bit in the antigen pattern examples Ag j and n is the number of examples for a particular pattern class. Len is the total length of an antibody and ⊕ represents the exclusive XOR operator. Another formula used to measure similarity (affinity) of antigen to antibody interaction is presented as shown in Eqn (2) below:
S ( Ag 1 , Ag 2 ) =
1 − 2 2
S10 S 01 − S 00 S11
(2)
(S11 + S10 )(S 01 + S 00 )(S11 + S 01 )(S10 + S 00 )
where ( Ag1 , Ag 2 ) are the two matrices to be compared, S10 , S 01, S 00, S11 , which denote the number of zero matches, one matches, zero mismatches, and one mismatches. The value of S is in the range [0, 1], where 1 indicates the highest and 0 indicates lowest similarity between the samples. In immunology, cloning selects a number of antibodies with the highest affinity and cloning them based on their antigenic affinities. The higher the antigenic affinity, the higher the number of clones will be generated. The total number of clones generated N c is defined in Equation 3 as follow: n ⎛β ⋅N ⎞ N c = ∑ round ⎜ ⎟ ⎝ i ⎠ i =1
(3)
where β is a multiplying factor, N is the total number of antibodies, round (.) is the operator that rounds its argument toward the closet integer. i
i*
In clonal mutation, the clone set C is used to produce mutated offspring C . The higher the affinity an antigen has, the smaller the mutation rate. The algorithm for mutation is described in Fig. 3. The Equation is defined as follows: ⎛ t ⎞ ⎛ ⎜ 1− ⎟ ⎜ Δ(t , y ) = y 1 − r ⎝ T ⎠ ⎜ ⎝
λ
⎞ ⎟ ⎟ ⎠
(4)
where t is the iteration number, T is the maximum iteration number, r is a random value in the range [0, 1], and λ is used to decide the nonconforming degree.
828
R. Huang, H. Tawfik, and A. Nagar
i
Find the maximum and minimum in population C . For each Ab , do Generate a random value in the range [0, 1], named mr Generate another random value in the range [0, 1], named t0 If mr < mutation_rate If t0 >=0 Ab = Ab + Δ(t , max − Ab ) else Ab = Ab - Δ(t , Ab − min ) return Ab __________________________________________________________________ Fig. 3. Mutation rate control algorithm
The final operation is clonal selection which includes hamming distance and similarity threshold selections. Hamming distance threshold defines the antibodies allowed to stay for further memory cell selection. The similarity threshold defines the antibodies allowed to be added into memory cells to become detectors. The value of the Hamming distance threshold should be adjusted empirically such that the antibodies have the ability to detect new cases correctly. 3.3 Dynamic Training Algorithm One antigen (UK mandatory typeface) from each class and antibodies generated by basic clonal selection has been chosen to initialize the immune memory. After initialization, real characters were passed to a dynamic training algorithm, and the immune memory cells training and testing go hand in hand to obtain a better memory cell for classification. The clonal dynamic training algorithm is shown as Fig. 4. While No. <= size of antigens Selected an antigen Ag and start to train Classify the antigen using the current updated memory cells If Classification strategy recognized antigen, start with another antigen Otherwise generate antibodies Ab s randomly and calculate the affinity Select n Ab s having the highest affinity and clone them Apply hypermutation to the clone set C i to produce mutated offspring C i* Re-calculate the HD between Ag and C i* . Select Ab s for next step Calculate similarity between Ab and Ag Select matured Ab for memory cells Stop training if the required number of matured antibodies is generated End when all antigens been trained ____________________________ _____________________________________ Fig. 4. Dynamic Training algorithm
Licence Plate Character Recognition Using Artificial Immune Technique
829
Classification is implemented by a fuzzy (KNN) approach, proposed by Keller in [17], which provides an improvement on existing classification techniques. For the testing, pattern was passed through the memory cells as the fuzzy KNN selected k closest memory cells from the immune memory. The selected memory cells were then grouped according to their class labels with the class of the largest sized group identifying the testing pattern.
4 Experiments and Results Three different dataset resources ‘LPR0’, ‘LPR1’ and ‘LPR2’ have been collected from a car park, road, street and petrol stations within the UK. LPR0 consisted of 950 samples of licence plates that will be used for training. LPR1 and LPR2 are two datasets containing 400 samples to test the performance of the systems. Characters extracted from LPR0 data were grouped into two parts: digits and letters. The digits have 10 classes (0 to 9) and the letters have 23 classes (A to Z without Q, O, I). Experiments were carried out through two different training methods. (1) Single pass training, where each antigen produces the same number of antibodies, (2) Dynamic training, described in section 3.3. In both training, all the antibodies were first generated based on the mandatory typeface. Each antigen produced 30 antibodies before generating them from LPR0. These antigens only generated 10 antibodies each. The HD Threshold was 25 for digits and 50 for characters; the Similarity Threshold was 0.93 for digits and 0.87 for letters. The initial population for antibodies was 30 and hypermutation probability was 0.05. All parameters were determined by experimentation. Classification results are also depended on the classification strategy. The effect of k in fuzzy KNN classification is examined and shows that k=5 for the digits and k=7 for letters gave the best performance with K=7 giving a better combined performance. Improvement can be further achieved by dividing the letters into G1 and G2. Table 1 presents the results for both training. The results proved that dynamic training reduced the difficulties for Single pass such as, large numbers of immune memory cells, low recognition accuracy, time wasted in training and recognizing. G1:
G2: Table 1. Two different training results for CSA
Parameter K K=5 K=7 K=9
Digits Accuracy % Single Dynamic 96.5 98.5 95 96.5 93 95
Letters Accuracy % Single Dynamic 86 88 89 92.4 85 89
Table 2 presents the results for best training and testing (C=correct, I=Incorrect) for our Licence Plate Character Recognition.
830
R. Huang, H. Tawfik, and A. Nagar Table 2. Training & testing results for CSA
Data set Accuracy%
Training LPR0 Digits Letters 96.5 92.4
Testing LPR1 C 1I >2I 89.5 3.0 7.5
Testing LPR2 C 1I >2I 83.5 7.5 9.0
The performance of our CS based approach has also been compared to Back Propagation Neural Networks [18]; a feed-forward neural network consisting of three layers has been employed. In this case, the multilayer perceptron (MLP) model had 256 nodes in the input layer and 20-50 in hidden layers which were determined empirically. Letters are divided into N1, N2 so the confusion of similar characters could be corrected by Neural-Networks. The initial results using ANN for the performance of the digit network were sufficiently successful. The results are shown in table 3. N1:
N2: Table 3. Training & testing results for ANN
Data set Accuracy%
Training LPR0 Digits Letters 94.3 90.5
C 84.5
Testing LPR1 1W >2W 6.0 9.5
C 81.0
Testing LPR2 1W >2W 9.0 10.0
From the experiments, the clonal selection algorithm has shown that it can provide sufficient data to make generalizations from examples and can also successfully classify previously unseen examples of its training classes. Adjustments to the basic algorithm improved performance and illustrated how clonal selection could be used in LPCR. Compared to our neural network approach the proposed AIS based method was found to be better than ANN by more than 2 percents in training and more than 3 percent in testing. The algorithm has been made to increase good candidate memory cells and reduce the training time. But the main weakness lies in the efficiency of the algorithm as the time taken to generate the memory cells could be seen to make it unattractive for time-dependent applications such as most real-world problems.
5 Concluding Remarks The paper reports on the use of clonal selection algorithm for its application for Licence Plate Character Recognition. The clonal selection algorithm can be characterized as a good alternative and a more competitive approach whereby individual antibodies are competing with the whole population cooperating as an ensemble of individuals to present the final solution. The experimental results consistently show that the proposed algorithm has high classification precision.
Licence Plate Character Recognition Using Artificial Immune Technique
831
Moreover, when compared with BPNN the average performance of CSA is more accurate and robust. Specifically, the proposed algorithmic structure was not evaluated on a fixed location; the image is acquired manually in various views and illumination conditions in order to closely resemble real world situations. Licence Plate Recognition is always an important research topic of artificial intelligent systems. Future work include increasing the size of the test set, improvement of the detection accuracy and classification speed, and the hybridization of CSA with Higher Order Neural Networks (HONNs) can be carried out later on.
References 1. Chang, S.L., Chen, L.S., Chung, Y.C., Chen, S.W.: Automatic licence plate recognition. IEEE Trans. Intell. Transp. Syst. 5(1), 42–53 (2004) 2. Wu, C., On, L.C., Weng, C.H., Kuan, T.S., Kengchung, N.G.: A Macao Licence Plate Recognition System. In: Proceedings of Fourth International Conference on Machine Learning and Cybernetics, Guangzhou, vol. 7, pp. 4506–4510 (2005) 3. Duan, T.D., Hong Du, T.L., Phuoc, T.V., Hoang, N.: Building an automatic vehicle licence plate recognition system. In: Proc. Int. Conf. Comput. Sci. RIVF, pp. 59–63 (2005) 4. Hu, Y., Zhu, F., Zhang, X.: A Novel Approach for License Plate Recognition Using Subspace Projection and Probabilistic Neural Network. In: Wang, J., Liao, X.-F., Yi, Z. (eds.) ISNN 2005. LNCS, vol. 3497, pp. 216–221. Springer, Heidelberg (2005) 5. Martín, F., García, M., Alba, L.: New methods for automatic reading of VLP’s (Vehicle Licence Plates). In: Proc. IASTED Int. Conf. SPPRA (2002) 6. Kim, K.I., Jung, K., Kim, J.: Color Texture-Based Object Detection: An Application to License Plate Localization. In: Lee, S.-W., Verri, A. (eds.) SVM 2002. LNCS, vol. 2388, pp. 293–309. Springer, Heidelberg (2002) 7. Anagnostopoulos, C., Anagnostopoulos, I., Loumos, V., Kayafas, E.: A Licence PlateRecognition Algorithm for Intelligent Transportation System Applications. IEEE Transaction on Intelligent Transportation Systems 7(3) (September 2006) 8. Broumandnia, A., Fathy, M.: Application of pattern recognition for Farsi licence plate recognition. In: The ICGST Int. Conf. Graphics, Vision and Image Processing (GVIP), vol. 2, pp. 25–31 (2005) 9. Chang, S.-L., Chen, L.-S., Chung, Y.-C., Chen, S.-W.: Automatic Licence Plate Recognition. IEEE Transactions on Intelligent Transportation Systems 5(1) (March 2004) 10. Anagnostopoulos, C., Alexandropoulos, T., Boutas, S., Loumos, V., Kayafas, E.: A template-guided approach to vehicle surveillance and access control. In: Proc. IEEE Conf. Advanced Video and Signal Based Surveillance, pp. 534–539 (2005) 11. de Castro, L.N., Timmis, J.: Artificial Immune systems: A Novel Paradigm to Pattern Recognition. In: Corchado, A.J., Fyfe, C. (eds.) Artificial Neural Networks in Pattern Recognition, pp. 67–84. university of Paisley (2003) 12. Timmis, J., Knight, T., De Castro, L.N., Hart, E.: An Overview of Artificial Immune Systems. Natural Computation Series, 51–86 (2004) 13. Burnet, F.M.: Clonal Selection and After. Theoretical Immunology. In: Bell, G.I., Perelson, A.S., Pimbley Jr, G.H. (eds.), pp. 63–85. Marcel Dekker Inc., New York (1978) 14. de Castro, L.N., Von Zuben, F.J.: aiNet: An Artificial Immune Network for Data Analysis. In: Sacker, R.A., Newton, C.S. (eds.) Data Mining: A Heuristic Approach. Idea Publishing Group, Hershey (2001)
832
R. Huang, H. Tawfik, and A. Nagar
15. Garain, U., Chakraborty, M., Majumder, D.: Improvement of OCR Accuracy by Similar Character Pair Discrimination: an Approach based on Artificial Immune System. In: The 18th Int. Conf. on Pattern Recognition (ICPR), vol. 2, pp. 1046–1049 (2006) 16. Garain, U., Chakraborty, M.P., Dasgupta, D.: Recognition of Handwritten Indic Script Using Clonal Selection Algorithm. In: Bersini, H., Carneiro, J. (eds.) ICARIS 2006. LNCS, vol. 4163, pp. 256–266. Springer, Heidelberg (2006) 17. Keller, J.M., Gray, M.R., Givens Jr., J.: A Fuzzy K-Nearest Neighbor Algorithm. IEEE Transactions on Systems, Man, and Cybernetics 15(4), 580–585 (1985) 18. Nijhuis, J.A.G., ter Brugge, M.H., Helmholt, K.A., Pluim, J.P.W., Spaanenburg, L., Venema, R.S., Westenberg, M.: Car licence plate recognition with neural networks and fuzzy logic. In: Proc. IEEE Int. Conf. Neural Netw., vol. 5, pp. 2232–2236 (1995)
Integration of Ab Initio Nuclear Physics Calculations with Optimization Techniques Masha Sosonkina1 , Anurag Sharda1 , Alina Negoita2 , and James P. Vary2 1
Ames Laboratory/DOE, Iowa State University, Ames, IA 50011, USA {masha,anurag}@scl.ameslab.gov 2 Physics Department, Iowa State University, Ames, IA 50011, USA {alina,jvary}@iastate.edu
Abstract. Optimization techniques are finding their inroads into the field of nuclear physics calculations where the objective functions are very complex and computationally intensive. A vast space of parameters needs searching to obtain a good match between theoretical (computed) and experimental observables, such as energy levels and spectra. In this paper, we propose a design integrating the ab initio nuclear physics code MFDn and the VTDIRECT95 code for derivative-free optimization. We experiment with the initial implementation of the design showing good matches for several single-nucleus cases. For the parallel MFDn code, we determine appropriate processor numbers to execute efficiently a multiple-nuclei parameter search. Keywords: No Core Shell Model, MFDn, Derivative-free Optimization, VTDIRECT95.
1
Motivation
Unlike electrons in the atom the interaction between nucleons is not known precisely and is complicated. The shell model is the fundamental tool to study the structure of nuclei. The basic idea is that the nucleons move in an average potential generated by the mutual interactions of the nucleons. The strong Nucleon-Nucleon (NN) interaction as well as 3-nucleon (NNN) interactions generate the potential that describes the nucleon energy levels in the nucleus. In particular, NN and NNN interactions tuned to fit light nuclei are used in nuclear astrophysics for solar models, supernova modeling, and Big Bang nucleosynthesis. The techniques for solving these problems also find applications in the field of quantum chemistry, condensed matter physics, atomic, nuclear, and particle physics. Until recently, the No Core Shell Model [1,2] has been limited to nuclei up to atomic mass A of 16. Work is underway to extend this method to heavier nuclei [3]. The effective Hamiltonian operator derived from CD-Bonn interaction [4] gives a poor description of nuclei with atomic mass near 48. Figure 1 describes the matches between the theoretical and experimentally obtained energy levels for 49 Sc, where the initial version of the theory is given in the rightmost column. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 833–842, 2008. c Springer-Verlag Berlin Heidelberg 2008
834
M. Sosonkina et al.
A problem with the existing Hamiltonian is that the computed spectra is too compressed compared with the experimental spectra. The addition of the three terms — isospin-dependent V0 , central V1 , and tensor-interaction Vtens — results in a reasonable low lying spectra for the nuclei involved in the double-beta decay of 48 Ca. One of the physics goals is to test whether the same modified Hamiltonian used for the isotopes of the nuclei with atomic mass of 48 is able to describe other heavy nuclei. These three terms and possibly other (up to 20) parameters need to be searched to obtain their best match to the experimental values (see Fig. 1, for example). To find this match according to some criteria, it is required to evaluate energies at many points in a parameter search space. In particular, a criterion, called χ2 , may be calculated that quantifies the match using weights (see Sect. 3.2). This process may be automated by taking advantage of optimization techniques which will generate the points at which χ2 may be evaluated. Note that, since derivatives do not come into the picture for complex nuclear physics calculations on which χ2 depends, the derivative-free optimization is considered. As a trade-off, it is typically required a large number of function evaluations even to find a local minimum. The time taken by such an optimization algorithm is directly proportional to the cost of the objective function evaluation. Thus, parallel implementations of both the function evaluation and optimization algorithm may be beneficial. The remainder of the paper is organized as follows. In Sect. 2, we give an overview of two parallel software packages under consideration: Many Fermion Dynamics nuclear (MFDn) and the optimization algorithm VTDIRECT95. In 5.6
+
3/27/29/2 11/2 5/2
-
1/2
)
(5/2
4.8
-
(15/2 ) -
(11/2 )
Excitation Energy (MeV)
49
Sc -
7/2
-
7/2 11/2 13/2
-
4
box column
-
-
5/2
5/2 (9/2 ) + 1/2
-
5/2 1/2 3/2
-
-
3.2
9/2
(9/2 ) 7/2 + + 5/2 3/2
-
5/2
-
-
11/2
7/2
-
9/2
-
2.4
3/2 + + (5/2 3/2 )
F
-
7/2 -
5/2
-
9/2
+
1.6
3/2
-
3/2
-
+
15/2
1/2
-
11/2
-
1/2
0.8
-
7/2
convex hull
-
5/2
0
-
-
3/2
7/2
Exp
CD-Bonn + 3 terms
CD-Bonn
Fig. 1. Matching of experimental (Exp) and theoretical energy levels of 49 Sc using CD-Bonn potential in its initial version (CD-Bonn) and new with searched terms (CD-Bonn+3terms). Each energy level is annotated with its spin value.
D
Fig. 2. Example of the scatter plot of boxes. The F -axis is function values, and the D-axis is box diameters.
Integration of Ab Initio Nuclear Physics Calculations
835
Sect. 3, we first present a design to make the MFDn code and optimization algorithm work in concert. Then, we describe our initial implementation. Sect. 4 presents the computation experiments and analyzes them with respect to nuclear physics objectives. Sect. 5 concludes.
2
Overview of Nuclear Physics and Optimization Packages
Many Fermion Dynamics nuclear (MFDn) [5] parallel code is used for large-scale nuclear structure calculations in the No Core Shell Model (NCSM) formalism [1,2], which has been shown to be successful for up to 16-nucleon problems on present day computational resources. MFDn code is charged to compute a few lowest (≈15) converged solutions, called wave functions, to the many-nucleon Schr¨ odinger equation: H |φ = E |φ . (1) Then other properties, called observables, are formed from the calculated wave functions. The matrix H in (1) is the Hamiltonian operator, which is typically solved using Lanczos diagonalization since H is symmetric and sparse. However, the Lanczos iterative process may be very expensive due to huge dimensionality of H with many off-diagonal elements. The number of Lanczos iterations also increases significantly for the energy levels beyond the ground state. For example, for the 16 O nucleus in the 6hω basis space, the ground-state energy level requires only 35 Lanczos iterations, while 15 excited states need at least 200 Lanczos iterations to converge. Note that, in this case, the constructed Hamiltonian H has the dimension of 26,483,625. MFDn constructs the m-scheme basis space, evaluates the Hamiltonian matrix elements in this basis using efficient algorithms, diagonalizes the Hamiltonian to obtain the lowest eigenvectors and eigenvalues, then post-processes the wave functions to obtain a suite of observables and to compare them with experimental values. VTDIRECT95 [6] is a Fortran95 suite of parallel codes implementing the derivative-free optimization algorithm DIRECT [7] that takes a set of problemand algorithm-dependent input parameters and finds the global minimum of an objective function f inside the feasible set D. Each iteration of DIRECT consists of the following steps. 1. INITIALIZATION. Normalize the feasible set D to be the unit hypercube. Sample the center point ci of this hypercube and evaluate f (ci ). Initialize fmin = f (ci ), evaluation counter m = 1, and iteration counter t = 0. 2. SELECTION. Identify the set S of “potentially optimal” boxes that are subregions of D. A box is potentially optimal if, for some Lipschitz constant, the function value within the box is potentially smaller than that in any other box (a formal definition with parameter is given by [8]). 3. SAMPLING. For any box j ∈ S, identify the set I of dimensions with the maximum side length. Let δ equal one-third of this maximum side length. Sample the function at the points c ± δei for all i ∈ I, where c is the center of the box and ei is the ith unit vector.
836
M. Sosonkina et al.
4. DIVISION. Divide the box j containing c into thirds along the dimensions in I, starting with the dimension with the lowest value of wi = min{f (c + δei ), f (c−δei )}, and continuing to the dimension with the highest wi . Update fmin and m. 5. ITERATION. Set S = S − {j}. If S = ∅ go to 3. 6. TERMINATION. Set t = t + 1. If iteration limit or evaluation limit has been reached, stop. Otherwise, go to 2. Initially, only one box exists in the system. As the search progresses, more boxes are generated, illustrated by the scatter plot shown in Fig. 2, where each circle represents a box. The sizes of boxes increase along the D-axis (diameter) and the function values at box centers increase along the F -axis (function). All the boxes with the same diameter belong to a “box column”. Reference [8] proves that all potentially optimal boxes in S are on the lower right convex hull of the scatter plot in Fig. 2. To produce more tasks in parallel, new points are sampled around all boxes in S along their longest dimensions during SAMPLING. This modification also removes the step ITERATION, thus simplifying the loop. In the DIVISION step, multiple new boxes are generated for each potentially optimal box. The multiple function evaluation tasks at each iteration give rise to a natural functional parallelism, which is especially beneficial for expensive objective functions. The parallel implementation distributes the work to multiple masters in the SELECTION phase. The functions are then evaluated by the pool of workers to accomplish SAMPLING. VTDIRECT95 also supports user-level checkpointing method to restart function evaluations through log files. Several other options are provided by the optimization algorithm to improve the performance on large-scale parallel systems.
3
Design of Integrated System
In nuclear structure calculations, the computed and experimental results are matched for single as well as multiple nuclei using the χ2 function. Thus, a general case of multiple nuclei is considered in the design. At the initialization, the search algorithm divides the search domain into multiple sub-domains and assigns each sub-domain a set of masters (SM). Each sub-domain master is responsible for generation of evaluation points within its own sub-domain. Figure 3 and N3, with one MFDn shows a diagram for the case of multiple nuclei, N1, N2, execution per nuclei. The overall hierarchy consists of three tiers. The first two tiers correspond to the set of processors used by VTDIRECT95 while the last tier corresponds to the set of processors used by MFDn. It is denoted as MFDn pool and included into a dash-lined oval. The evaluation points generated by SM are then handed to the workers W assigned to VTDIRECT95 (represented as shaded squares in Fig. 3). These workers act as connectors between the VTDIRECT95 and the MFDn. Each worker W pre-processes the evaluation point and then delivers the pre-processed data to MFDn pool. The number of processing elements (PE) required by MFDn typically depends on the size of the Hamiltonian matrix and the hardware characteristics. So another functionality of a W
Integration of Ab Initio Nuclear Physics Calculations
SM1
SM2
W1
SMm
Wn
W2
N1
837
N3
N1
N3
N2
N2
MFDn pool
Fig. 3. Design layout for the integration system
worker is to assemble from MFDn pool the needed number of PEs for an MFDn run. Once an instance of MFDn completes, its output is communicated back to the associated W worker, which is responsible for computing the χ2 function aggregating the MFDn outputs for all the nuclei and for delivering it to the proper SM. 3.1
Enabling Seamless Integration
The goal of this design is to treat both the VTDIRECT95 and the MFDn codes as “black-box” components and provide an infrastructure to integrate them. Here, we detail the implementation of the design. The MFDn code requires an input file which contains the set of parameters for every MFDn run. The output of the MFDn code is a text file including theoretical observables, such as excitation energies for a given nucleus for the desired number of states. The optimization algorithm produces points in a search space in each iteration and then accepts the function evaluation at those points. Thus, both MFDn and VTDIRECT95 have well-defined input and output interfaces. Consider the following external additions necessary to interface MFDn and VTDIRECT95: 1. Input Modifier (IM) inserts the points generated by VTDIRECT95 into an input file for MFDn. 2. Wait (W) waits for a completion signal from the MFDn code. The Wait stub may also gather the runtime performance information from MFDn. 3. Output Modifier (OM) post-processes the MFDn output file and creates an input for the χ2 function. For a single iteration of VTDIRECT95, Fig. 4 shows a workflow between the VTDIRECT95 and MFDn codes. The main goal of the described additions is to provide interfaces and placeholders for pre- and post-processing. Thus, we denote them as stubs in Fig. 4. VTDIRECT95 generates an evaluation point which is inserted into MFDn input file through the IM stub. MFDn produces
838
M. Sosonkina et al.
then a set of values to be captured by the OM stub and provided to the χ2 evaluator, which in turn returns the objective function value to VTDIRECT95. All the stubs related to the MFDn execution are grouped in a single box together with MFDn in Fig. 4. The χ2 evaluator is a generic module the implementation of which may change depending on the application scientist’s goals. Similarly, VTDIRECT95 is depicted as separate box since it is a general-purpose optimization code. The integrated workflow, due to its flexibility, allows to change the objective function construction and the search algorithm without affecting the overall organization.
VTDIRECT95 IM Stub
MFD
W Stub OM Stub
Χ2 Evaluator
Fig. 4. Workflow diagram for the MFDn and VTDIRECT95
3.2
Construction of the χ2 Function
The χ2 is constructed using a theory file, an experimental file, and the base energy value of the given nucleus. The theory file is an output from the MFDn code which contains calculated observables. The experimental file has the energy levels as found experimentally by different national scientific organizations [9,10]. In addition to the energy values, each energy level is associated with the spin j of the protons/neutrons. There are many options in construction of χ2 . A particular choice depends on such parameters as the quality of the experimental data and the questions nuclear physicists want to answer comparing the theoretical and experimental energy levels. As an example consider the following χ2 definition: ˜it (v) 2 × σ 2 , Eie (v) − E (2) χ2 (v) = ie 1≤it ≤15 1≤ie ≤k
˜ are the absolute experimental and theoretical where v = (V0 , V1 , Vtens ) and E energies, respectively; ie and it are the indices of the corresponding matched energy levels (one ie paired with one it ), i.e., those levels that have the same spin j; and k is the maximum desired number of matches. Each experimental energy level le is assigned the weight σle . This weight is inversely proportional to the distance of that energy level from the ground energy level. In (2), different weight schemes were considered, such as, for every le , σle = 1/le or σle = 1/2le .
Integration of Ab Initio Nuclear Physics Calculations
4
839
Experiments
Computing platforms at the National Energy Research Scientific Computing Center (NERSC) at the Lawrence Berkeley National Laboratory served as testbed for the development and testing. The results presented here were obtained on the NERSC IBM p575 POWER 5 system, named Bassi. It is a distributed memory computer with 888 processors (comprising 111 nodes and sharing 32 GBytes of memory) used for scientific applications. Each Bassi processor has a theoretical peak performance of 7.6 GFlops, and the nodes are interconnected by the IBM “Federation” HPS switch. An implementation of the proposed design for single and multiple nuclei has been developed and tested. It integrates the serial VTDIRECT95 and parallel MFDn codes, while the usage of parallel VTDIRECT95 has been left as a future work. The following three nuclei have been considered: 47 K, 48 Ca, and 49 Ca. The Hamiltonian matrices are sparse and their sizes are 136231, 12000, and 15666, respectively, in the lowest available model space. The MFDn execution time depends heavily on the Hamiltonian size. Also, the complexity (“shape”) of the objective function drives the time required by the optimization algorithm to find the minimum. Figure 5(a) shows the number of evaluations required per VTDIRECT95 iteration as the number of iterations grows for a sample nucleus. For the multiple nuclei fit, the runtime is guided by the runtime of the heaviest nucleus since the evaluations for all the nuclei are needed to construct the χ2 in this case. Therefore, it is desirable to adjust the number of processors allocated to a particular nucleus based on its computational cost relative to the other nuclei in the set. In particular, for our example, 47 K has the largest Hamiltonian matrix, so executing it on the largest subset of processors makes sense. Figure 5(b) depicts the case when all the three nuclei are evaluated once with three different sets of processor numbers (shown in the x-axis). The parallel time to compute all three nuclei indeed decreases as the number of processors is increased compared with the base case of using small equal number of processor for each nucleus. When about twice as many (15) processors are assigned to the calculation of each nucleus, the timings for the smaller Hamiltonians decrease by about half. The runtime for 47 K, however, decreases only slightly. By augmenting the number of processors to 45 for 47 K, the execution time is decreased dramatically, while for the smallest Hamiltonian of 48 Ca, the time has actually increased with the increase in the number of processors to 21. The latter indicates that the parallel overhead starts to dominate the overall execution time. In general, besides the Hamiltonian matrix size, other factors such as communication overhead and hardware characteristics, may affect the number of processors used to calculate efficiently a particular nucleus. The optimization of the parameters for 48 Ca produced the results (Fig. 6(a)) matching theoretical observables correctly to their counterparts in the experimental data file. First six states have been matched correctly with their spins and the difference in energy levels from theoretical observables and experimental data is less than 0.001 MeV. The second nucleus considered is 49 Ca. The groundstate level in the theoretical observable was matched within the difference of
840
M. Sosonkina et al.
(a) Evaluations per iteration for
49
(b) Execution time of 1 evaluation of VTDIRECT95 for 47 K, 48 Ca, and 49 Ca on different numbers of processors on Bassi.
Ca.
Fig. 5. Execution of MFDn for a single- and multiple-nuclei cases
0.02 MeV of the experimental ground state energy value (Fig. 6(b)). Higher energy levels in the theoretical observables remained unmatched to counterparts in the experimental data. A likely reason is that some of the energy levels in the experimental data have uncertain spin levels. The findings for 49 Ca are important from the physics point of view since they give new directions for fitting the nuclei with a similar mass starting with obtained set of parameters. Similar arguments may be used to explain the unmatched energy levels in 47 K. Checkpointing is a valuable feature of VTDIRECT95. Supercomputers with batch scheduling typically have an upper bound on the time any job is allowed to execute. For example, the maximum time permitted on NERSC supercomputers
1 ,2,3,4 3(1)
Excitation Energy (MeV)
(-)
5
4
4 3,4,5 1 (+) 3 3
5
2+ 4+ 0+ 2+ 2+ 5
-
+
3 + 4 + 0 2
+
3
48
2
Ca
1
0
0
6
+
+
+
0
CD-Bonn + 3 terms
Experiment
(a)
48
Ca.
+
5
Excitation Energy (MeV)
-
6
4
-
-
1/2 + 9/2 1/2 + 9/2 + (5/2+ ) 5/21/2 3/2+ 7/2 5/2-
9/2 11/2 7/2 1/2 3/2 7/211/2 9/2 3/2 5/2 7/2 5/2
-
(1/2 ,3/2 ) -
3
5/2 + (9/2)
2
1/2
3/2
49
1
0
3/2
-
-
Ca
-
CD-Bonn + 3 terms
Experiment
(b)
49
Ca.
Fig. 6. Matching of experimental and theoretical energy levels
7/2
-
3/2
-
Integration of Ab Initio Nuclear Physics Calculations
841
is only 48 hours, which is surely not enough to find the global or even local minimum for such an expensive function evaluation as described in this paper. Hence, we have utilized the checkpointing feature as a routine procedure to restart the integrated code for the next maximum time allowed by the queuing system.
5
Summary and Future Work
We have proposed a design for the integration of the MFDn and VTDIRECT95 parallel codes. The integration uses the master-worker paradigm of the VTDIRECT95 code and produces a three-tier scheme. Our contribution is to show how an expensive multiprocessor function evaluation may fit into this scheme. In the paper, we have shown an implementation of the proposed design for the case of sequential VTDIRECT95, which produces one point at a time. In this situation, we have studied various definitions of the objective function (χ2 ) and obtained good matches between the theoretical and experimental energy levels for 48 Ca and the ground-state energy level for 49 Ca. We have also investigated the efficient executions of MFDn when the results from multiple nuclei are to be considered simultaneously by VTDIRECT95. We have found that assigning different numbers of processors to different MFDn executions, typically in accordance with the Hamiltonian matrix size, reduces the overall time for a function evaluation needed by VTDIRECT95. In the future, we plan to provide an implementation taking advantage of parallelism in both VTDIRECT95 and MFDn and also to consider a variety of nuclei in the multiple-nuclei case. We will also compare the results produced by VTDIRECT95 with other derivative-free optimization techniques, such as those from the Toolkit for Advanced Optimization (TAO) [11]. The integrated software will be a useful tool for a wide range of ab initio nuclear physics calculations. Acknowledgments. The work was supported in part by Iowa State University under the contract DE-AC02-07CH11358 with the U.S. Department of Energy, by the U.S. Department of Energy under the grants DE-FC02-07ER41457 (UNEDF SciDAC-2) and DE-FG02-87ER40371 (Division of Nuclear Physics), and by NERSC.
References 1. Navratil, P., Vary, J., Barrett, B.: Properties of 12 C in the ab-initio Nuclear Shell Model. Phys. Rev. Lett. 84, 5728 (2000) 2. Navratil, P., Vary, J., Barrett, B.: Large-basis ab-initio No-core Shell Model and its application to 12 C. Physical Review C62, 54311 (2000) 3. Vary, J., Popescu, S., Stocia, S., Navratil, P.: No Core Shell Model A=47 and A=49. nucl-th/0607041/ 4. Machleidt, R., Sammarruca, F., Song, Y.: Nonlocal nature of the nuclear force and its impact on nuclear structure. Phys. Rev. C53, C63, 024001, Ref 9 (1996, 2001)
842
M. Sosonkina et al.
5. Vary, J.: The Many-Fermion Dynamics Shell-Model Code. Iowa State University (1992) 6. He, J., Watson, L., Sosonkina, M.: Algorithm XXX: VTDIRECT95: Serial and parallel codes for the global optimization algorithm DIRECT. ACM Transactions on Math. Soft. (submitted, 2007) 7. Jones, D.: The DIRECT global optimization algorithm. Encyclopedia of Optimization 1, 431–440 (2001) 8. Jones, D., Pertunen, C., Stuckman, B.: Lipschtzian optimization without the Lipscitz constant. J. Optimization Theory and Applications 79, 157–181 (1993) 9. Electronic version of nuclear data sheets. telnet://bnlnd2.dne.bnl.gov 10. Burrows, T.: Nuclear data sheets, 74, 1; Nuclear data sheets 76, 191 (1995) 11. Benson, S., McInnes, L.C., Mor´e, J., Munson, T., Sarich, J.: TAO user manual (revision 1.9). Technical Report ANL/MCS-TM-242, Mathematics and Computer Science Division, Argonne National Laboratory (2007), http://www.mcs.anl.gov/tao
Non-uniform Distributions of Quantum Particles in Multi-swarm Optimization for Dynamic Tasks Krzysztof Trojanowski Institute of Computer Science, Polish Academy of Sciences Ordona 21, 01-237 Warsaw, Poland [email protected]
Abstract. This paper presents research considering mixed multi-swarm optimization approach applied to dynamic environments. One of the versions of this approach, called mQSO is a subject of our special interest. The mQSO algorithm works with a set of particles divided into subswarms where every sub-swarm consists of two types of particles: classic and quantum ones. The research is focused on studying properties of the latter type. Two new distributions of new locations for the quantum particles are proposed: static and adaptive one. Both of them are based on an α-stable symmetric distribution. In opposite to already published methods of distribution of new locations the proposed methods allow the locations to be distributed over the entire search space. Obtained results show high efficiency of the mQSO approach equipped with the proposed two new methods.
1
Introduction
In the presented research a mixed multi-swarm optimization in dynamic environments is studied. Application of mixed multi-swarm approach to dynamic optimization has already been tested and proved its efficiency. Approaches with static and varying number of subs-warms [1], [2], [3] as well as approaches with adaptive number of species in the swarm [4], [5], [6] have been researched. A version with static number of subs-warms called mQSO, where two types of particles: quantum particles and classic ones are in use and especially their rules of movement became a subject of our interest. Classic particles use velocity vectors to evaluate their new positions. The rules for quantum particles are based on the random distribution of possible new locations of a particle around its current location similarly to the distribution of the locations in the quantum cloud of the atom. The rules for quantum particles proposed in [2], [3] are based on the idea of uniform distribution over the space of a quantum cloud, which is a hyper-sphere of a constant radius with the current location in the middle. In this paper we examine strategies of the quantum particle movement being alternative to the strategy mentioned above. Two methods of evaluation of new locations of the quantum particle are proposed and the efficiency of the mQSO equipped with these methods is experimentally verified. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 843–852, 2008. c Springer-Verlag Berlin Heidelberg 2008
844
K. Trojanowski
As a dynamic test-bed a MPB [7] generator was selected. In MPB we optimize in a real-valued 5-dimensional search space and the fitness landscape is built of a set of unimodal functions individually controlled by the parameters allowing to create different types of changes. The paper is organized as follows. In Sec. 2 there is a brief presentation of the optimization algorithm. Two new methods of generation of the quantum particle’s new locations are described in Sec. 3. Section 5 shows a measure used for evaluation of the results of experiments and the selected testing environment while Sec. 6 – the results of experiments performed with the environment. Section 7 concludes the presented research.
2
Quantum Multi-swarm
A simple scheme of the particle swarm optimization algorithm is given in Fig. 1. A PSO optimizer is equipped with a set of particles xi where i ∈ [1, . . . N ]. Each of the particles represents a solution in an n-dimensional real valued search space. For the search space a fitness function f (·) is defined which is used to evaluate the quality of the solutions. A particle yi represents the best solution found by the i-th particle (called particle attractor), and a particle y∗ – the best solution found by the swarm (called swarm attractor). The scheme is made for maximization problem. Algorithm 1. The particle swarm optimization 1: Create and initialize the swarm 2: repeat 3: for i = 1 to N do 4: if f (xi ) > f (yi ) then 5: yi = xi 6: end if 7: if f (yi ) > f (y∗ ) then 8: y∗ = yi 9: end if 10: end for 11: update location and velocity of all the particles 12: until stop condition is satisfied
Search properties of the PSO scheme in Algorithm 1 are represented by the step ”update location and velocity”. In this step there are two main actions performed: first the velocity of each of the particles is updated and then all the particles change their location in the search space according to the new values of velocity vectors and the kinematic laws. Formally for every iteration t of the search process every j-th coordinate of the vector of velocity v as well as the coordinate of the location x undergo the following transformation [8]: vjt+1 = χ(vjt + c1 r1t (yjt − xtj ) + c2 r2t (yj∗t − xtj )) , = xtj + vjt+1 , xt+1 j
(1)
Non-uniform Distributions of Quantum Particles
845
where r1t and r2t are random values uniformly generated in the range [0, 1], χ is a constriction factor and χ < 1 and c1 and c2 control the attraction to the best found personal and global solutions, respectively. The basic idea presented in Algorithm 1 has been developed for non-stationary optimization applications. One of the first significant changes in this scheme is an introduction of multi-swarm. In the presented approach the number of subswarms is constant during the process of searching. Each of them is treated as an independent self-governing population which is not influenced by any of the neighbors. However, there are mechanisms which periodically perform some actions based on the information about the state of search of the entire swarm [3]. To guarantee the appropriate distribution of the sub-swarms over the entire search space the exclusion mechanism eliminates sub-swarms, which are located too close to each other. When the sub-swarms are too close to each other, the occupation of the same optimum is most likely to occur. In this case one of them is selected to be eliminated and a new one is generated from scratch. Any two subswarms are considered as located too close to each other if for the best solutions from the compared two sub-swarms the euclidean distance is closer than the defined threshold ρ. In [3] yet another mechanism of sub-swarms’ management was proposed called anti-convergence, which protects against convergence of subswarms. However it was off in the presented experiments. The last of the sub-swarms’ management mechanisms described in [3] is based on mixing of types of particles in sub-swarms. In the presented research the mixed sub-swarms consist of two types of particles governed by two different rules of movement. While the location of the particles of the first type is evaluated according to classic formulas as discussed above, the remaining ones are treated as quantum particles and change their location according to the analogy with quantum dynamics of the particles. All the particles in such a mixed sub-swarm share the information about the current best position and the best position ever found by the sub-swarm. Idea of quantum particle proposed by Blackwell and Branke in [2] originates from the quantum model of atom where the trajectories of electrons are described as quantum clouds. Adaptation of this idea to the model of movement of the particles rejects kinematics laws used in classic PSO for evaluation of a distance traveled by the particle with a constant velocity in a period of time. Instead of this a new position of the quantum particle is randomly generated inside a cloud of the given range rcloud surrounding y∗ i.e. the current sub-swarm attractor. In quantum model the particle’s speed becomes irrelevant, because every location inside the cloud can be chosen as a new location with a non-zero probability. The model of quantum particles has been extended in this paper. Since the model proposed in [2] assumes the uniform distribution of the set of possible new locations of the particle over the cloud’s space, it was interesting to test and verify another types of distributions.
846
3
K. Trojanowski
Movement of Quantum Particles
Two new types of distribution of new locations are considered in this paper. The first one is defined with static rules while the second one – adaptive rules. Both of them are based on two-phase mechanism. In the first phase a direction θ is selected. In the second phase a distance d from the original is calculated. The direction θ can be obtained with use of a random variable from the angularly uniform distribution on the surface of a hyper-sphere [9]. The distance d is an α-stable random variate and is computed as follows: d = SαS(0, σ), and
(2)
σ = rSαS · (Dw /2),
(3)
where SαS(·, ·) represents α-stable symmetric distribution variate and Dw is a width of the feasible part of the domain, i.e. a distance between a lower and an upper boundary of the search space. The new location is based on the found direction θ and a distance d from the original. This is an isotropic distribution. The α-stable distribution is controlled by four parameters: stability index α (α ∈ 0 < α ≤ 2), skewness parameter β, scale parameter σ and location parameter μ. The Chambers-Mallows-Stuck method of generation of the α-stable symmetric random variables [10] can be used. The method for σ = 1 and μ = 0 with a correction for the case where α = 1 given by Weron in 1996 [11] is presented in (4). To calculate the α-stable distributed random variate X two another independent random variates are needed: a random variate U , which is uniformly distributed on [−π/2, π/2] and an exponential random variate W obtained with rate parameter λ = 1: ⎧ (1−α)/α α(U+Bα,β ) cos(U−α(U+Bα,β )) ⎪ ⎪ Sα,β · sin(cos · , 1 ⎪ U) /α W ⎪ ⎪ ⎪ ⎨ iff α = 1, (4) X= ⎪ ⎪ 2 π W cos U ⎪ ⎪ ( + βU ) tan U − β ln π +βU , ⎪ ⎪ 2 ⎩π 2 iff α = 1. 2 πα 1/(2α) 2 . where Bα,β = α−1 arctan β tan πα 2 , and Sα,β = 1 + β tan 2 In the symmetric version of this distribution (called SαS, i.e. symmetric αstable distribution) β is set to 0. For α = 2 the SαS(μ, σ) distribution reduces to the Gaussian N (μ, σ) and in the case of α = 1 the Cauchy C(μ, σ) is obtained. In (3) rSαS is a scale parameter. The difference between the static and the adaptive version of the distribution is in the way of calculation of the distance d. In the static version the distance depends on merely the α-stable generator. In the adaptive version it depends on the value returned by the generator and multiplied by normalized fitness of the particle. The latter version was inspired by a mutation operator introduced in [12] where it was a component of the immune optimization algorithm called opt-aiNet and designed for multimodal function optimization. The
Non-uniform Distributions of Quantum Particles
847
mutation operator uses independent random variates for modification of each of the coordinates. This approach gives isotropic distribution for Gaussian random variables but unfortunately turns into non-isotropic for any other α-stable distribution in multidimensional search space. We wanted to keep to isotropic distributions, therefore the operator was not migrated as-is. In our adaptive approach the direction is evaluated in the same way like in the static one but the distance is calculated respectively to the current fitness values of the remaining antibodies in P : (5) d = SαS(0, σ) · exp(−f (xi )), where σ is calculated as in (3) and f (xi ) is the fitness of the i-th solution xi normalized in [0, 1] respectively to the fitness values of all the solutions in P : f (xi ) − fmin , (fmax − fmin ) = max f (xj ) and fmin =
f (xi ) = fmax
j=1,...,N
(6) min f (xj ).
j=1,...,N
Fig. 1. Distribution of the new points in 2-dimensional search space for α: 2, 1, 0.5, and 0.1
Fig. 1 presents sample distributions of a set of points in the 2-dimensional search space generated for the same original with the static method of generation. There are four distributions for four different values of α: 2, 1, 0.5, and 0.1. It often happens that the domain of possible solutions is limited by a set of some box constraints and only the solutions fitted in the constraints are classified as feasible. Both types of distribution presented above allow to generate new locations over the entire domain, so it is possible to generate feasible and unfeasible locations as well. From the theoretical point of view we can easily cope with unfeasible locations simply by allowing them just to stay where they are because the evaluation function formula is usually defined for all points in Rn . However, from the engineering point of view we cannot accept such a free treatment since in the real world the constraints are based on the knowledge of the modeled phenomenon and represent its features like e.g. temperature (which cannot be less than -273 C or higher than some reasonable limit: +100 C for a water or a smoke point for an oil). Therefore in the presented research it is assumed that the domain of possible solutions is limited by a set of some box constraints and only the solutions fitted in the constraints are classified as feasible.
848
K. Trojanowski
Since the main focus in this paper is not about the constrained optimization, we selected a very simple procedure of immediate repairing unfeasible particles. Clearly the j-th coordinate of the solution x breaking its box constraints is trimmed to the exceeded limit, i.e.: if xj < loj then xj = loj , if xj > hij then xj = hij . The procedure is applied in the same way to both types of the particles, the classic and the quantum ones. In case of classic particles the velocity vector v of the repaired particle stays unchanged even if it still leads the particle outside the acceptable search space.
4
Settings of the Algorithm’s Parameters
The algorithm parameters’ settings applied to the experiments presented below originate from [3]. In the cited publication authors present results of experiments obtained for different configurations of swarms tested with the MPB benchmark where there are 10 moving peaks. Among many tested configurations the best results for the optimization problem with 10 moving peaks are obtained where there are 10 sub-swarms and each of them consists of five classic particles and five quantum ones (see Table III in [3]). The total population of particles consists of 100 solutions divided equally into 10 sub-swarms. The values of pure PSO parameters are: c1,2 = 2.05 and χ = 0.7298. For QSO the range of exclusion is set to 31.5 (for the best performance the value of ρ should be set close to 30. However, the precision of this parameter’s setting is not crucial. In [3] the authors claim that the algorithm is not very sensitive to small changes of ρ). In the presented algorithm there is no strategy of detecting the appearance of change in the fitness landscape. Since our main goal was studying the properties of the different distributions of the quantum particles, we assumed that it would just introduce yet another unnecessary bias into the obtained values of offline error and make their analysis even more difficult. Therefore a change is known to the system instantly as it appears and there is not any additional computational effort for its detection. When the change appears, all the solutions stored in both classic and quantum particles are reevaluated and the swarm memory is forgotten. Classic particle’s attractors are overwritten by current solutions represented by these particles and sub-swarms’ attractors are overwritten by the current best solutions in the sub-swarms.
5
Applied Measure and the Benchmark
In the performed experiments the offline error (briefly oe) measure [7,13] of obtained results was used. The offline error represents the average deviation from the optimum of the fitness of the best individual evaluated since the last change of the fitness landscape. Every time the solution’s fitness is evaluated,
Non-uniform Distributions of Quantum Particles
849
an auxiliary variable is increased by the value which is the deviation of the best solution evaluated since the last change including the one just evaluated as well. When the experiment is finished the sum in the variable is divided by the total number of evaluations and returned as the offline error. Formally: Ne (j) Nc
1
1 ∗ oe = ( (f ∗ − fji )) , Nc j=1 Ne (j) i=1 j
(7)
where Nc is the total number of changes of the fitness landscape in the experiment, Ne (j) is the number of evaluations of the solutions performed for the j-th state of the landscape, fj∗ is the value of optimal solution for the j-th landscape ∗ (i.e. between the j-th and (j + 1)-th change in the landscape) and fji is the current best found fitness value for the j-th landscape i.e. the best value found among the ones belonging to the set from fj1 till fji where fji is the value of the fitness function returned for its i-th call performed for the j-th landscape. During the process of search the offline error can be calculated in two ways: in one of them the error is evaluated from the beginning of the experiment while in another one – the value of offline error starts to be evaluated only after some number of changes in the fitness landscape. The latter way is advised as saddled with the less measurement error caused by the initial phase of the search process (please see e.g. [14] for a discussion about the possible influence of the initial phase on the quality of the results obtained for MPB). Therefore, just this way was applied in our tests. For compatibility with experiments published by others the number of evaluations between subsequent changes of the fitness landscape equals 5000. During a single experiment the fitness landscape changes 110 times (however for the first 10 changes the error is not evaluated). Every experiment was repeated 50 times and the means are presented. The parameters of MPB were set exactly the same as specified in [3] in scenario 2. The fitness landscape was defined for the 5-dimensional search space with boundaries for each of dimensions set to 0; 100. For the search space there exist a set of 10 moving peaks which vary their height randomly within the interval 30; 70, width within 1; 12 and position by a distance of 1.
6
Results of Experiments
We started our research from repeating some of the experiments presented in [3]. It was necessary to make the earlier results comparable to the current ones since in our case a period of the first 10 changes in the environment is excluded from calculating the offline error which makes the values of oe significantly smaller than those in [3]. We repeated experiments with uniform distribution of new locations inside a quantum cloud (called further Mcloud ) for a series of values of rcloud : from 0.05 to 4.5 with step 0.05. The three best values of oe were: 1.6264 (std.dev.: 0.4104), 1.6297 (std.dev.: 0.4062), 1.6298 (std.dev.: 0.5227) and they were obtained for rcloud : 0.30, 0.35 and 0.25 respectively.
850
K. Trojanowski 1.65 1.6 1.55 1.5
1.65 1.6 1.55 1.5
4.5 4 3.5 3 2.5 2 1.5
4.5 4 3.5 3 2.5 2 1.5 0.5 1 1.5
0.1 rS α S
0.01
α
0.5 1 1.5
0.1
0.0012
rS α S
0.01
α
0.0012
Fig. 2. Offline error for Mαstatic i.e. a static version (on the left) and for Mαadapt i.e. an adaptive version (on the right): rSαS vs. α Table 1. The three best values of offline error obtained for the two tested methods of the new location generation for the case with 10 moving peaks method
offline error
std. dev.
σ
α
Mαstatic [1] Mαstatic [2] Mαstatic [3]
1.4603 1.4665 1.5023
0.3066 0.4188 0.3518
0.25 0.25 0.35
1.35 1.80 1.00
Mαadapt [1] Mαadapt [2] Mαadapt [3]
1.4614 1.4722 1.5008
0.3255 0.3041 0.3496
0.60 0.85 0.60
1.70 1.70 1.75
In our research we wanted to get performance characteristics of the tested distributions first. This way we are able to compare not only the best results possible to obtain for a given test-case but also an information about the robustness of the searching engines and its sensitivity to changes in their parameters settings. Thus, two large groups of experiments were performed. We tested the static version of α-stable symmetric distribution — Mαstatic , and adaptive version of α-stable symmetric distribution — Mαadapt . The sets of tests with the two methods were based on variation of values of two method’s parameters: α and rSαS . The former parameter varied from 0.5 to 2 with step 0.05 while the latter – from 0.001 to 0.1 with step 0.001. It gave 3100 configurations for each of the approaches. They were tested on the same class of dynamic environments build by MPB benchmark with 10 moving peaks and with its parameters set to values as defined above. Obtained values of offline error for the two groups of experiments are presented in Fig. 2. In Table 1 for each of the distributions the best three configurations and the values of offline error for each of the three are presented. The offline error for Mcloud is higher than for the two remaining methods. The significance levels obtained with Student’s t tests indicate difference between Mcloud and distributions in unlimited area i.e. for both tests between Mcloud and the two methods the
Non-uniform Distributions of Quantum Particles
851
1.6
1.9
1.58 1.8
1.56 1.54
1.7 1.52 1.5
1.6
1.48 1.5
Mα static Mα adapt 50
100
150
200
250
300
350
Mα static Mα adapt
1.46 400
450
2
4
6
8
10
12
14
16
18
20
Fig. 3. The first 10% of values of offline error sorted ascending for the two tested distributions – left side graph, and a zoom to the best first 20 values – right side graph
significance level is lower than the commonly accepted level of 0.05 (for Mαstatic – p = 0.025437, for Mαadapt – p = 0.029881). Apart from Fig. 2 and Table 1 yet another quantitative comparison was performed based on the same set of results. The analysis is presented in Fig. 3. To generate Fig. 3 for each of the methods, the values of offline error obtained for tested configurations of parameters were sorted ascending. This way we can observe sensitivity of the methods for changes in the parameters values as well as the small disturbances in the optimized fitness landscape. We can compare not only the best results but also the number of other configurations giving satisfying results i.e. the size of area of useful configurations of the distributions. A graph with the first 10% of sorted values of offline error is in Fig. 3. Both curves in Fig. 3 start from the level of offline error which is presented in Table 1 and which is almost the same for each of them. However they quickly branch. It is clearly visible that the adaptive method outperforms the static method. The advantage is that the adaptive method is much more tolerant for lack of fitting the methods parameters to the properties of the fitness landscape. In other words an appropriately tuned algorithm is doing very well for each of the methods however if the algorithm is not perfectly tuned to the problem, the loss of performance is much less for adaptive methods of distribution.
7
Conclusions
In this paper two methods of evaluation of the quantum particle’s new position are experimentally verified: the static and the adaptive one. Both methods employ the α-stable symmetrically distributed random variable. The distribution is controlled by the parameter α, which is responsible for the density of distribution of new locations around the quantum particle. Obtained results are satisfactory: they are better than those for uniform distribution of new locations
852
K. Trojanowski
in the limited area around the quantum particle. Besides the results of series of experiments visualized in Fig. 3 showed that the adaptive method of distribution is less sensitive to small changes in its parameters than the static one.
References 1. Blackwell, T.: Particle Swarm Optimization in Dynamic Environments. In: Evolutionary Computation in Dynamic and Uncertain Environments. Studies in Computational Intelligence, vol. 51, pp. 29–49. Springer, Heidelberg (2007) 2. Blackwell, T., Branke, J.: Multi-swarm optimization in dynamic environments. In: Raidl, G.R., Cagnoni, S., Branke, J., Corne, D.W., Drechsler, R., Jin, Y., Johnson, C.G., Machado, P., Marchiori, E., Rothlauf, F., Smith, G.D., Squillero, G. (eds.) EvoWorkshops 2004. LNCS, vol. 3005, pp. 489–500. Springer, Heidelberg (2004) 3. Blackwell, T., Branke, J.: Multiswarms, exclusion, and anti-convergence in dynamic environments. IEEE Trans. Evol. Comput. 10(4), 459–472 (2006) 4. Li, X.: Adaptively choosing neighborhood bests in a particle swarm optimizer for multimodal function optimization. In: Deb, K., et al. (eds.) GECCO 2004. LNCS, vol. 3102, pp. 105–116. Springer, Heidelberg (2004) 5. Li, X., Branke, J., Blackwell, T.: Particle swarm with speciation and adaptation in a dynamic environment. In: GECCO 2006: Proc. Conf. on Genetic and Evolutionary Computation, pp. 51–58. ACM Press, New York (2006) 6. Parrot, D., Li, X.: Locating and tracking multiple dynamic optima by a particle swarm model using speciation. IEEE Trans. Evol. Comput. 10(4), 440–458 (2006) 7. Branke, J.: Memory enhanced evolutionary algorithm for changing optimization problems. In: Proc. of the Congress on Evolutionary Computation, pp. 1875–1882. IEEE Press, Piscataway (1999) 8. Clerc, M., Kennedy, J.: The particle swarm-explosion, stability, and convergence in a multi-dimensional complex space. IEEE Trans. Evol. Comput. 6(1), 58–73 (2002) 9. Marsaglia, G.: Choosing a point from the surface of a sphere. Ann. Math. Statist. 43(2), 645–646 (1972) 10. Chambers, J.M., Mallows, C.L., Stuck, B.W.: A method for simulating stable random variables. J. Amer. Statist. Assoc. 71(354), 340–344 (1976) 11. Weron, R.: On the Chambers-Mallows-Stuck method for simulating skewed stable random variables. Statist. Probab. Lett. 28, 165–171 (1996) 12. de Castro, L.N., Timmis, J.: An artificial immune network for multimodal function optimization. In: Proc. of the IEEE Congress on Evolutionary Computation, vol. 1, pp. 674–699. IEEE Press, Piscataway (2002) 13. Branke, J.: Evolutionary Optimization in Dynamic Environments. Kluwer Academic Publishers, Dordrecht (2002) 14. Trojanowski, K.: B-cell algorithm as a parallel approach to optimization of moving peaks benchmark tasks. In: Sixth International Conf. on Computer Information Systems and Industrial Management Applications (CISIM 2007), IEEE Computer Society Conf. Publishing Services, pp. 143–148 (2007)
An Integer Linear Programming for Container Stowage Problem Feng Li, Chunhua Tian, Rongzeng Cao, and Wei Ding IBM China Research Laboratory, Beijing 100094, P.R. China {lfeng, chtian, caorongz, dingw}@cn.ibm.com
Abstract. So far most stowage planning only consider the stowage of single type containers or do not consider the stability of the containership. Here a 0-1 linear programming model for this problem is proposed. The objective is to maximize space utilization while minimize the operation cost from upload and download of different types containers at each port of a multi-port journey, with vessel stability, industry regulations and customized rules as the constraints. Based on experiments by adopting branch & cut algorithm from COIN-OR(Common Optimization Interface for Operations Research) to such model. Successively, a simulation system is developed to verify the feasibility and practicability of the model. By using the branch and cut algorithm of COIN-OR, the simulation system has shown that our model is efficient and robust. Keywords: Container Stowage Planning, Integer Programming.
1
Introduction
Since the 1970s, containerization has increasingly facilitated the transportation of cargos. Nowdays over 60 percent deep-sea general cargo is transported in Containers, whereas some routes, especially between economically strong and stable countries, are containerized up to 100 percent [1]. There are lots of shipping companies competing around the world to provide profitable container transportation services. In order to increase the benefit of economy of scale, the size of containerships has enhanced. The increase in capacity has been typically from relatively small 350 Twenty Foot Equivalent Units (TEUs) to containerships with capacities of more than 8000 TEUs [2]. The increase of containership size contributes to incremental profits of shipping companies, it will cause a minus factor, enlarginging complexity and difficulty of the arrangement of containers. The arrangement of contaioners for a containership is usually called Container Stowage Problem (CSP). CSP is complicated because of its combinatorial nature. In recent operations research and management science literatures, the mothods develped to sovle it can be grouped into five main classes: mathematical modeling, simulation based upon probability, decision support systems and heuristics. Among them, there are several objectives for the problem: optimizing available space and prevent M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 853–862, 2008. c Springer-Verlag Berlin Heidelberg 2008
854
F. Li et al.
damage, minimizing berthing time [3], minimizing total number of shifting of containers [4], maximizing containership’s ultilization, and so on. Unfortunately those approaches have some commonalities with the problem and mainly devoted to the loading problem. The well-known mathematical model for CSP is integer linear programming [5]. Although those models can provide optimal solutions for CSP, they have incorporated too many simplification hypotheses, which have made them unsuitable for practical applications. The first simulation approach is completed by Shields [6]. In his work, a small number of stowage plans are randomly created to be evaluated and compared by simulation of the voyage and the best is selected. It is efficient in practice, it does not guarantee the optimality of the solutions, and it also takes a long computation time to find a reasonably good solution. Later, Saginaw and Perakis [7] and Shin and Nam [8] develop decision support systems that are based on the knowledge of a manager or an operator in charge of loading and unloading operations, Wilson et al. presented a realistic model, taking into account all technical restrictions in order to implement a commercial usable decision support system. They decomposed the CSP into two phases: strategic process and tactical process [9]. However, those decision support systems only show a solution of small sample problem and their efficiency from the practical viewpoint are not shown. The first heuristic for CSP is proposed by Martin et al. [10]. They addressed CSP for the transtainer system, and developed a heuristic algorithm to solve it. Since then Todd and Sen implemented a genetic algorithm procedure with multiple criteria such as proximity in terms of container location on board and minimization of unloading-related re-handle, transverse moment and vertical moment [11], Haghani and Kaisar developed a heuristic algorithm for CSP with minimizing container handling cost, while keeping the containership’s GM acceptable [12]. Dubrovsky et al. implemented a genetic algorithm-based heuristic for CSP [13]. These heuristics can produce complete and practical but rarely near-optimum solutions to the container stowage problem. Considering those problems which are mentiioned above, an extend model for CSP will be proposed in this paper, which is used to find an optimal plan for stowing containers of different size into a containership on a multi-port journey, with a set of structural and operational restrictions. The objective is to maximize the containership utilization while minimize the operation cost from the container re-allocation. Such constraints and objective are firstly described in detail, then a basic 0-1 Linear Programming model is proposed. Successively, a simulation system was proposed to illustrate the efficiency of the model and compare it with random stowage strategy.
2
Container Stowage Problem
When solving CSP, of particular interest are the constraints related to the structure of the containership and the size of the hold and upper deck. We consider here two types of containerships, namely Ro-Ro (Roll on-Roll off), which load/unload containers through the ramps located either at the bow or the stern
An Integer Linear Programming for Container Stowage Problem
16 21
19
17
12 15
13
08 11
04 07
09
05
855
40'Bays 03
01
20'Bays Tiers 90 88 86 84 82 08 06 04 02
B
A Rows 10 08 06 04 02 01 03 05 07 09
Fig. 1. Containership Structure
of the ship, and Lo-Lo (Lift on-Lift off), which load/unload containers from the top(by using cranes). The basic structure of a containership and its cross section are shown in Figure 1, which are used to provide an idea of how container stowage take places. There are a given number of locations for placing containers, that can vary in size depending on the containership. The most common location is 8 feet in height, 8 feet in width and 20 feet in length. Each location is identified by three indies, each one consisting of two numbers given its position with respect to the three dimensions. In partice, each location is addressed by the following identifiers: (1) bay, that gives its position relative to the cross section of the containership (counted from bow to stern); (2) row, that gives its position relative to the vertical section of the corresponding bay (counted from the center to outside); (3) tier, that gives its position related to the horizontal section of the corresponding bay (counted from the bottom to the top of the containership). Thus a container will be located in a given bay, a given row and a given tier. In oder to describe CSP in detail, we propose the following notations. Bay Index Sets: Let I denote index set of bay set, A and B denote index sets of anterior part bay set and back part bay set, respectively, E and O denote index sets of even bay set and odd bay set, respectively, where I=A∪B, I=E ∪O, A ∩ B=∅ and E ∩ O=∅. Row Index Sets: Let Ri denote index set of row set for the ith bay, LRi and RRi denote index sets of left row set and right row set of the ith bay, respectively. For example, the row set of the second bay for the containership shown in Figure 1 is {09, 07, 05, 03, 01, 02, 04, 06, 08, 10}, then R2 = {1, 2, ···, 10}, LR2 = {1, 2, · · ·, 5} and RR2 = {6, 7, · · ·, 10}. Tier Index Sets: Let Tij denote index set of tier set for the ith bay and the jth row of Ri , let U Tij denote index set of upper tier set for the ith bay and the
856
F. Li et al.
jth row of Ri and let BTij denote index set of below deck tier set of the ith bay and the jth row of Ri . Obviously, U Tij ∪ BTij = Tij and U Tij ∩ BTij = ∅ Port Set: Let P denote the port set which are on the journey of the containership. Suppose P = {1, 2, · · ·, |P |}. Container Sets: Let Cp (p = 1, 2, · · ·, |P |) denote the container set, which will |P | be loaded on the containership at port p, and let C = ∪p=1 Cp . Let wc and dc denote the weight and destination port of container c, ∀c ∈ C, repectively. In this paper only two types of containers are considered, and let F denote set of 40-feet container, W denote set of 20-feet container. Let Cˆp (p = 1, 2, · · ·, |P |) denote the set of containers which are loaded on port i, (i = 1, 2, · · · , p − 1) and the order of those containers desitination port are greater than p. Obviously, when p = 1, Cˆp = ∅. 2.1
Constraints
Using these notations, some structural and operational restrictions can be described as follows. Because the number of the allocations provided by containership is finite, the maximum number of the containers which can be stowed on the containership is limitied. Thus at current port p, xijkc · sc ≤ mt , (1) i∈I j∈Ri k∈Tij c∈Cp ∪C ˆp
xijkc ≤ mf .
(2)
i∈E j∈Ri k∈Tij c∈Cp ∪C ˆp
where, xijkc denote the variables of the optimizaiton, with the following specification: 1, If the container c is stowed in location {ijk}, (3) xijkc = 0, otherwise. and location {ijk} means the ith bay, the jth row of Ri and the kth tier of Tij . sc denote the size of the container c ∈ C, and 1, If the size of container c is 20 foot, sc = (4) 2, If the size of container c is 40 foot. mf denote the maximum number of 40-feet container which can be loaded on the ship. mt denote the maximum number of 20-feet container which can be stowed on the ship. At the same time, each location of the containership can only have most one container and each container occupies one and only one location of the containership, therefore, xijkc ≤ 1, ∀c ∈ Cp ∪ Cˆp , (5) i∈I j∈Ri k∈Tij
ˆp c∈Cp ∪C
xijkc ≤ 1, ∀i ∈ I, ∀j ∈ Ri , ∀k ∈ Tij .
(6)
An Integer Linear Programming for Container Stowage Problem
857
The total weight of all containers which can be stowed on the containership can not exceed the maximum weight capacity of the containership. xijkc · wc ≤ Q, (7) i∈I j∈Ri k∈Tij c∈Cp ∪C ˆp
where Q denote the maximum weight capacity of the containership. In practice, any 40-feet container can not be stowed in an odd bay and any 20feet container can not be stowed in an even bay; the stowage of 20-feet containers in those odd bays that are contiguous to even bays’s location already chosen for stowing 40-feet container is unfeasible, and inversely; any 20-feet container can not be stowed on any 40-feet container; and any container can not be hangingly stowed. Then, we can get xijkc = 0, ∀i ∈ O, j ∈ Ri , k ∈ Tij , (8) ˆp ) c∈F ∩(Cp ∪C
xijkc = 0,
∀i ∈ E, j ∈ Ri , k ∈ Tij ,
ˆp ) c∈W ∩(Cp ∪C
ˆp ) c∈W ∩(Cp ∪C
∀i ∈ E, j ∈ Ri , k ∈ U Tij , (k + 1) ∈ U T(i+1)j , (12) x(i−1)j(k+1)c + xijkc ≤ 1, ˆp ) c∈F ∩(Cp ∪C
∀i ∈ E, j ∈ Ri , k ∈ U Tij , (k + 1) ∈ U T(i−1)j , (13) x(i+1)j(k+1)c + xijkc ≤ 1, ˆp ) c∈F ∩(Cp ∪C
∀i ∈ E, j ∈ Ri , k ∈ BTij , (k + 1) ∈ BT(i+1)j , (14) x(i−1)j(k+1)c + xijkc ≤ 1, ˆp ) c∈F ∩(Cp ∪C
xij(k+1)c −
∀i ∈ E, j ∈ Ri , k ∈ BTij , (k + 1) ∈ BT(i−1)j , (15) xijkc ≤ 0,
ˆp ) c∈W ∩(Cp ∪C
ˆp ) c∈W ∩(Cp ∪C
ˆp ) c∈W ∩(Cp ∪C
xijkc ≤ 1,
ˆp ) c∈F ∩(Cp ∪C
ˆp ) c∈W ∩(Cp ∪C
x(i+1)j(k+1)c +
ˆp ) c∈W ∩(Cp ∪C
xijkc ≤ 1, ∀i ∈ E, j ∈ Ri , k ∈ Tij , (11)
ˆp ) c∈F ∩(Cp ∪C
ˆp ) c∈W ∩(Cp ∪C
x(i−1)jkc +
ˆp ) c∈W ∩(Cp ∪C
xijkc ≤ 1, ∀i ∈ E, j ∈ Ri , k ∈ Tij , (10)
ˆp ) c∈F ∩(Cp ∪C
ˆp ) c∈W ∩(Cp ∪C
x(i+1)jkc +
(9)
xij(k+1)c −
∀i ∈ O, j ∈ Ri , k ∈ U Tij , k + 1 ∈ U Tij , (16) xijkc ≤ 0,
ˆp ) c∈W ∩(Cp ∪C
∀i ∈ O, j ∈ Ri , k ∈ BTij , k + 1 ∈ BTij , (17)
858
F. Li et al.
2xij(k+1)c −
ˆp ) c∈F ∩(Cp ∪C
ˆp ) c∈F ∩(Cp ∪C
x(i+1)jkc ≤ 0,
ˆp ) c∈W ∩(Cp ∪C
ˆp ) c∈F ∩(Cp ∪C
x(i−1)jkc −
ˆp ) c∈W ∩(Cp ∪C
∀i ∈ E, j ∈ Ri , k ∈ U Tij , k + 1 ∈ U Tij , (18)
2xij(k+1)c −
2xijkc −
ˆp ) c∈F ∩(Cp ∪C
x(i+1)jkc ≤ 0,
2xijkc −
x(i−1)jkc −
ˆp ) c∈W ∩(Cp ∪C
∀i ∈ E, j ∈ Ri , k ∈ BTij , k + 1 ∈ BTij . (19)
ˆp ) c∈W ∩(Cp ∪C
The stability of the ship is very important for the deep-sea container transprotation. Since the vertical, transverse and longitudinal distribution of a ship’s weight, which is the most influencing factor for the ship’s stability, is excessively unbalanced will lead it unstable, a bad stowage plan may result in the instability of the ship. To assure the stability of a containership, a stowage plan should satisfy several operational constraints. In this study the following three factors: metacentric height (GM), heel and trim are introduced to describe the vertical, transverse and longitudinal distribution of a ship’s weight. For a ship to be stable, GM must be greater than the minimum allowable metacentric height, the heel should be narrow or at least to be smaller than a given small number, the trim is also close to zero or at least within certain prespecified limits for good performance of the ship. Hence a good stowage planning should make GM greater, heel and trim smaller to the extrem. Thus, the following constraints should be satisfied, ˆp , ∀i ∈ I, j ∈ Ri , k ∈ U Tij , wc · xijkc − we · xijk+1e ≥ 0, ∀c, e ∈ Cp ∪ C (20) ˆp , ∀i ∈ I, j ∈ Ri , k ∈ BTij , (21) wc · xijkc − we · xijk+1e ≥ 0, ∀c, e ∈ Cp ∪ C −Q1 ≤ wc · xijkc − wc · xijkc ≤ Q2, ˆp i∈A j∈Ri k∈Tij c∈Cp ∪C
−Q3 ≤
ˆp i∈B j∈Ri k∈Tij c∈Cp ∪C
wc · xijkc −
ˆp i∈I j∈LRi k∈Tij c∈Cp ∪C
(22) wc · xijkc ≤ Q3,
ˆp i∈I j∈RRi k∈Tij c∈Cp ∪C
(23)
where Q1, Q2 and Q3 are given tolerance for stability of the ship. In order to guarantee the containers in Cˆp still on the containership, when the containership begins to vessel to the other destination, the following equations should be satisfied, xijkc = 1, ∀c ∈ Cˆp . (24) i∈I j∈Ri k∈Tij
2.2
Objective Function
In the container transportation industry, containerships make repeated tours of a sequences of ports according to their planned routes. At each port on a tour of
An Integer Linear Programming for Container Stowage Problem
859
a containership, containers are unloaded and additional containers destined for subsequent ports are loaded. Time duration required for loading and unloading depends on the arrangement of the cargo on board the ship, ie the stowage planning, which specifies where each container is loaded on the ship. Stowage plans, if not prepared well enough, may cause unnecessary handling time, time required for temporary unloading and re-loading of containers of gantry cranes at the ports. Consequently, port efficiency and ship utilization are largely affected by stowage plans. In order to minimize the total rehandling of the containers on the ship. We propose the following objective functions, |P |−1
min
p=2 i∈I j∈Ri k∈Tij
⎡ ⎣
m∈Tij ;m>k
⎛
⎝
dc · xijmc −
ˆp c∈Cp ∪C
⎞⎤
dc · xijkc ⎠⎦ .
ˆp c∈Cp ∪C
(25)
While considering the shifting number of the containers, we still want to improve the utilization of the containership. Thus we need another objective function, that is max sc · xijkc . (26) i∈I j∈Ri k∈Tij c∈Cp ∪C ˆp
Definition 1. A vertical unit of the container means the container set with the same row and the absolute difference of their bay number is smaller than 1. Obviously the function value of (25) is not the real shifting number of the containers. In fact the shifting number of the containers can be calculate as follows, |P |−1 Vijmkp , (27) p=2 i∈I j∈Ri k∈Tij m∈Tij ;m>k
where, Vijmkp =
1, if ˆp dc · xijmc > ˆp dc · xijkc , c∈Cp ∪C c∈Cp ∪C 0, otherwise.
Theorem 1. If the container stowage problem is built up by objective function with (25,26) and constraints (1-24), it can minimize the container shifting number. Proof. Obviously, we only need to prove that when (25) obtain its minimum, (27) can also get its minimum. Since only the containers in a vertical unit will result in the container shifting. Thus we only need to prove that for some vertical unit, the theorem is correct. Without lose general, we suppose the Bay no of the vertical unit is i , Row no is j . When (25) gets its minimum we can get that, dc · xi j mc ≤ dc · xi j kc , if m > k, m, k ∈ Ti j . ˆp c∈Cp ∪C
ˆp c∈Cp ∪C
860
F. Li et al.
Therefore, when (25) get its minimum, the shifting nubmer will be zero. Zero is the minimum of (27).
3
Numerical Examples
According to this integer programming container stowage planning model, we can generate an optimal container stowage strategy for the ocean shipping liners. Below a simulation system was proposed to illustrate it. In order to verify the efficency of the integer model for the container stowage problem, we develop a simulation system. In this system, there are two stowage engine, the first one is our model, the second one is ”random”. Here ”random” means we stowage the containers onto the containership randomly, it is said that for any container, we randomly generate a position with bay row and tie, if the position is available, then we stowage that container on it, otherwise we generate another position until the position is available. In order to display the result of those two stowage engine, a 3d view of the containership is developed. The 3d view of the containership is built up on a containership generator. By using this generator, any type of containership can be built. Besides stowage engine, 3d view, and containership generator module of the system, there are data input, container generator, vessel operation calculation and report module in this simulation system. The function of data input is to input the data of routes(a sequences of ports) of the contianership. The container generator is to produce containers with different types, sizes and destination. The vessel operation calculation is to caluate the shifting number of the container at each port, the utilization of the vessel, and the weight balance of the veseel including the absolute weight difference between the front and rear, the absolute weight between the left part and the right part. The report module is show all of the calculation result and the statistics of the container on the vessel, such as how many container with the same destination. In order to verify the efficiency of the model, a containership with 20 bays 10 rows and 10 ties is built up. We suppose that the maximum capacity of the containership is 800TEU and its maximum load capacity is 18,000TON. For this vessel we build up a routes with 10 ports. At each port we randomly generate different types, different weight and different destination containers. Each time we generate all containers, then we vessel the containership, we load all of the containers from the first ports of the route, then unload the container for the second port and load the other containers for the coming ports, and so on. We have done the stowage process about 1000 times, and the stowage results are shown in Table 1 and Figures 2–5. Table 1 shows that, the shifting number, ultilization of the containership, and the weight balance difference of the vessel for the optimal stowage strategy are better than that of the random strategy. By using our model, the shifting number of the container handling has been cutten down to Zero. The irrationality of the ramdom stowage strategy will result in some containers can not be loaded on the vessel. Since the random stowage strategy don’t consider the weigth balance
An Integer Linear Programming for Container Stowage Problem
861
Table 1. Compare Result for Optimal Random Optimal Mean Variation Mean Variation Shifting Number 1800 200 0 0 Utilization 89% 0.10 95% 0.04 Weight Balance 50 10 8 2 Difference (TON)
Fig. 2. 3d-View for Random Stowage Strategy(Up Deck)
Fig. 4. 3d-View for Optimal Stowage Strategy(Up Deck)
Fig. 3. 3d-View for Random Stowage Strategy(Down Deck)
Fig. 5. 3d-View for Optimal Stowage Strategy(Down Deck)
of the vessel, it will lead the absolute weigth difference between the front and back of the vessel and the left and right part of the vessel bigger than that of the optimal stowage strategy. Thus the stability of the vessel has been improved by the optimal stowage strategy.
4
Conclusion
In this paper, an integer programming for the container stowage problem is proposed. It is suitable for multiple port and multimodel container transportation services. In this model, the container loading process and re-handling process are considered to minimize the container shifting number and maximize the utilization of the containership. At the same time several real rules of the ocean shipping industry and vessel stability are been concerned as constrains of the model. A simulation system is developed to verify the feasibility and practicability of the model. By using the branch and cut algorithm of COIN-OR, the simulation system has shown that our model is efficient and robust.
862
F. Li et al.
References 1. Steenken, D., VoB, S., Stahlbock, R.: Container terminal operation and operations research: a classification and literature review. OR Spectrum 26, 3–49 (2004) 2. Methodologies for reducing truck turn time at marine container terminals. Research Report SWUT/05/167830-1 3. Atkins, W.H.: Modern marine terminal operations and management. Boyle, Oakland (1991) 4. Kang, J.G., Kim, Y.D.: Stowage planning in maritime container transportation. Journal of the Operational Research Society 53, 415–426 (2002) 5. Ambrosino, D., Sciomachen, A., Tanfani, E.: Stowing a containership: the master bay plan problem. Transportation Research Part A 38, 81–99 (2004) 6. Shields, J.J.: Container-ship stowage: a computer-aided preplanning system. Marine Technology 21, 370–383 (1984) 7. Saginaw, D.J., Perakis, A.N.: A decision support system for containership stowage planning. Marine Technology 26, 47–61 (1989) 8. Shin, J.Y., Nam, K.C.: Intelligent decision support system for containership autostowage planning. Journal of Korean Institue Port Research 9, 19–32 (1995) 9. Wilson, I.D., Roach, P.A., Ware, J.A.: Container stowage pre-planning: using search to genetrate solutions, a case study. Knowledge-Based Systems 14, 137– 145 (2001) 10. Martin, J., Randhawa, S.U., McDowell, E.D.: Computerized container-ship load planning: A methodology and evaluation. Computers & Industrial Engineering 9, 357–369 (1988) 11. Todd, D.S., Sen, P.: A multiple criteria genetic algorithm for containership loading. In: Proceedings of the Seventh International Conference on Genetic Algorithms, pp. 674–681 (1997) 12. Haghani, A., Kaisar, E.I.: A model for designing container loading plans for containerships. In: Annual Conference for Transportation Research Board (2001) 13. Dubrovsky, O., Levitin, G., Penn, M.: A genetic algorithm with a compact solution encoding for the container ship stowage problem. Journal of Heuristics 8, 585–599 (2002)
Using Padding to Optimize Locality in Scientific Applications E. Herruzo1 , O. Plata2 , and E.L. Zapata2 1
2
Dept. Electronics, University of C´ ordoba, Spain [email protected] Dept. Computer Architecture, University of M´ alaga, Spain [email protected], [email protected]
Abstract. Program locality exploitation is a key issue to reduce the execution time of scientific applications, so as many techniques have been designed for locality optimization. This paper presents new compiler algorithms based on array padding that optimize program locality either locally (at loop level) or globally (the whole program). We first introduce a formal cache model that is used to analyze how all cache levels are filled up when arrays inside nested loops are referenced. We further study the relation between the model parameters and the data memory layout of the arrays, and define how to pad those arrays in order to optimize cache occupation at all levels. Experimental evaluation on some numerical benchmarks shows the benefits of our approach.
1
Introduction
For the last decades locality exploitation has been one of the main goals for improving the performance of scientific applications, giving rise to a wide range of software optimizations. Nowadays two locality-related trends can be observed: applications process larger and larger data sets, and the processor-memory gap problem is more and more significant. The memory latency problem has been attacked from two different fronts. On the one hand, by means of hardware solutions, like lockup-free caches, multithreading, prefetching, out-of-order execution, data and instruction speculation and so on. On the other hand, by means of compiler techniques for code and/or data transformations [1]. Array padding is a well-known data layout optimization technique that optimizes locality by reducing conflict misses. Although a global technique (affects the whole program), its use can be localized to the nested loop (or few loops) where most of the execution time is spent (a frequent case in scientific applications). This paper presents a simple model of the cache that captures essential information of its behaviour during the execution of a loop nest. This model is used as a framework to define how to pad the arrays in the loop in order to optimize cache occupation. Our method establishes a relationship among a small set of cache parameters, how the array elements are referenced and how they are stored in memory in order to obtain the optimal padding that optimizes cache occupation. The proposed method is subsequently extended to optimize cache M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 863–872, 2008. c Springer-Verlag Berlin Heidelberg 2008
864
E. Herruzo, O. Plata, and E.L. Zapata
locality for the whole program (global code optimization) and as well as for the complete cache hierarchy. The rest of the paper is organized as follows. First, the cache model developed to design our padding methods is introduced. In the next section, our intra-array padding approach is developed for three scenarios: a single loop nest optimization, all loops in the program (global optimization) and all levels in the cache hierarchy. After that, the proposed techniques are experimentally evaluated for a wide set of numerical bechmarks. Finally, some related work is discussed.
2
Modelling the Cache Behaviour
A minimal set of characteristic parameters will be defined with the aim of optimizing cache occupation by using array padding. We consider a L−way setassociative cache, of size C × L × W , where C is the number of cache sets, L is the number of blocks per set, and W is the block size in words. We also consider that the array access patterns come from referencing a M −dimensional array, X, within a N −depth nested loop. Expressions in the array dimensions are of the form fk ∗ Ik , k = 1, . . . , M , where I = (I1 , I2 , . . . , IN ) is the iteration vector of the loop, but any general affine expression is perfectly valid. Without loss of generality, we assume that N = M . To simplify the explanation, we restrict the cache model to a single array X inside a perfectly nested loop. However this model can be extended to several arrays appearing in the same loop and to not-perfectly nested loops. When the multidimensional array X is allocated in memory, it is linearized and laid out in some order. So, considering for instance a column-major order, the offset (in words) from the beginning of the array of the element of X referenced in the iteration I = (I1 , I2 , . . . , IM ) of the nested loop is given by ArrOf f (X, I ) = k−1 M−1 f1 ∗ I1 + . . . + fk ∗ Ik ∗ i=1 Di + . . . + fM ∗ IM ∗ i=1 Di , where Di is the size of the i−dimension of X. The stride of array X on loop index Ik is defined as the distance in memory (in words) of array entries referenced by consecutive l iterations of the loop k, that is, Stride(X, Ik ) = fk ∗ Ikl+1 ∗ k−1 i=1 Di − fk ∗ Ik ∗ k−1 l i=1 Di , where Ik represents the l−th iteration of the loop k. Let us consider now the cache. The execution of consecutive iterations of the loop k generates array references separated in memory a word distance equals to Stride(X, Ik ). We will define now two new cache-related strides. The above array references are contained in cache blocks (of size W words each). The distance (in blocks) between these cache blocks is defined as the cache block stride, that is, BlockStride(X, Ik ) = block(X, Ikl+1 ) − block(X, Ikl ) = M emAddr(X, Ikl+1 )/W −M emAddr(X, Ikl )/W . Note that although Stride() is constant BlockStride() may not be constant, depending on the relative offset of the memory references in their corresponding blocks. On the other hand, the distance (in cache sets) of the above blocks once they are placed into cache is defined as the cache set stride, that is, SetStride(X, Ik ) = (block(X, Ikl+1 ) − block(X, Ikl )) mod C, or SetStride(X, Ik ) = BlockStride(X, Ik ) mod C. We are assuming BlockStride is a positive integer, in order to simplify the expressions
Using Padding to Optimize Locality in Scientific Applications
865
(otherwise, we have to take absolute values and consider negative loop steps). With this definition, if we take as a reference the first iteration of the loop k (iteration Lw), we can use the following expression to calculate the set where is located the block referenced in the l−th iteration of the loop, Set(X, Ikl ) = (Set(X, IkLw ) + l ∗ SetStride(X, Ik )) mod C.
(1)
This expression assumes that SetStride() is constant for all iterations of the loop k, which is true if BlockStride() is also constant for that loop. This condition is fulfilled if all array references have the same block offset, that is, if Stride(X, Ik ) is a multiple of W . Our method to optimize cache occupation considers that this is the case. Otherwise, we can always reach this condition by incrementing the k dimension of X by the amount Stride(X, Ik ) mod W .
3
Optimizing Cache Occupation by Array Padding
We propose a new approach to determine how to pad arrays in order to optimize cache occupation, and consequently reduce miss rates. Padding techniques modify the portion of memory reserved for storing array data by including empty memory zones [1]. We focus our analysis on intra-array padding, where the empty memory zones are included among the array dimensions that is, arrays are redimensioned by increasing the size in some of its dimensions. 3.1
Single Loop Nest
Consider first the case of a single loop nest and a single-level cache hierarchy. In our analysis we are specially interested in those arrays with a stride greater than the cache block size (W ) in the innermost loop of the nest. The cache occupation for these arrays may suffer from block replacements by self-interference, and some cache blocks may remain unoccupied. This happens when SetStride(X, Ik ) > 1. This situation is very common in scientific applications, but the interesting problem occurs when the number of sets involved in the replacements is small. These cases shows an inefficient use of the cache space. Our goal is to change, in these cases, the array memory layout through padding to maximize the cache occupation, and consequently reduce the miss rate (due to self-replacements). Next lemma explains how to determine the dimension increments (intra-array padding) needed to maximize cache occupation. Lemma. Given a loop k in the nest and the array X, the occupation of the cache is maximized if SetStride(X, Ik ) and C are mutually prime, that is, GCD(SetStride(X, Ik ), C) = 1 (GCD condition). Proof. We will prove that if SetStride(X, Ik ) is relatively prime to C then the first C iterations of the loop k touch different cache sets. In that case, if the loop k is longer than C then all cache sets are occupied by references to array X in that loop. If it is shorter, the number of touched cache sets is maximum
866
E. Herruzo, O. Plata, and E.L. Zapata
Select all Xi arrays with stride greater than W on the innermost loop (IM ) for (each Xi array) do if (SetStride(Xi , IM ) is even) then N ewStride(Xi , IM ) = Stride(Xi , IM ) + W endif endfor Fig. 1. Intra-array padding algorithm for a single loop nest and a single cache
(as many as the length of the loop). So, let us consider two of the first C iterations of the loop k, say p and q. Note that 0 ≤ p, q < C. Let us assume that references to X in those two iterations are located to the same cache set, that is, Set(X, Ikp ) = Set(X, Ikq ). In that case, from Eq. (1), we have, (Set(X, IkLw ) + p ∗ SetStride(X, Ik )) mod C = (Set(X, IkLw ) + q ∗ SetStride(X, Ik )) mod C, or, eliminating the module operation, Set(X, IkLw ) + p ∗ SetStride(X, Ik ) = Set(X, IkLw ) + q ∗ SetStride(X, Ik ) + C ∗ r, where r is an integer number. A bit of simplification gives, (p − q) ∗ SetStride(X, Ik ) = C ∗ r. Given that SetStride(X, Ik ) and C are mutually prime, p − q must be divisible by C, as r is an integer number. But | p − q | is lower than C, so we came to a contradiction. Thus, references to X in both iterations must touch different cache sets. And this must be true for any pair of such iterations. Q.E.D. In the case that SetStride(X, Ik ) and C are not mutually prime then the lemma does not hold. However, if β is the greatest common factor between both values, then SetStride (X, Ik ) = SetStride(X, Ik )/β and C = C/β are mutually prime. So, the lemma holds for these new reduced values. That means that the first C iterations of loop k touch different cache sets. The rest of them may cause set replacements. In addition, as SetStride(X, Ik ) is assumed constant, the cache sets touched by these first iterations are β sets apart. So, only the fraction C/β cache sets are assured to be occupied. The GCD condition can be used to define a simple algorithm (see Fig. 1) to compute the needed padding of an array in a nested loop for achieving maximum cache occupation. The key idea is touching the maximum number of cache sets when executing the iterations in the innermost loop. That permits to increase the cache set reuse by iterations of the next outer loop. In order to touch all the cache sets, we need to modify SetStride() to make it relatively prime to C. As C is a power of two, an increment of SetStride() by one is enough. This is obtained by increasing BlockStride() by also one, or by increasing Stride() of the array by W (remember that BlockStride() is assumed constant). 3.2
Multiple Loop Nests (Global Code Optimization)
In scientific codes it is very frequent that the same few arrays, usually referenced in the body of loop nests, are re-used during the whole program execution. To maximize the cache occupation for these array we may pad them satisfying the
Using Padding to Optimize Locality in Scientific Applications
867
for (each X array in the program) do i = 1, P (0) = 0 for (each loop nest in the program) do if (SetStride(X, IM ) is odd) then P (0) = 1 else P (i) = SetStride(X, IM )/2, i = i + 1 endfor n = i, even = 0, odd = 0 if (P (0) == 0) then ΔSetStride = 1 else for (i = 1 to n) do if (P (i) is even) then even = even + 1 else odd = odd + 1 endfor if (odd > even) then ΔSetStride = 4 else ΔSetStride = 2 N ewStride(Xi , IM ) = Stride(Xi , IM ) + W ∗ ΔSetStride endfor Fig. 2. Intra-array padding procedure for the whole program
GCD condition for all loop nests in the program that include such arrays. We have to bear in mind that we consider the arrays which are referenced over a non-contiguous dimension in the innermost loop of a nest. Then, for a specific array under consideration, we take all SetStride()’s corresponding to all nested loops, and calculate the stride increment (from now on, ΔStride) needed to satisfy the GCD condition for every SetStride(). A new problem arises when for some loop nests SetStride() is even and for other loop nests SetStride() is odd. As ΔStride must be the same for all loop nests, we need to calculate it so that it satisfies the GCD condition for all of them or, at least, it minimizes GCD(SetStride(), C), in order to maximize the number of the used cache sets. In what follows we assume that the number of iterations of the different nested loops are similar. We also assume that the number of the odd and even values of SetStride()’s are similar. This is the worst case because we need to obtain a solution for the even values of SetStride()’s. In our approach, with a mixture of odd and even values for SetStride()’s, an even ΔStride will be calculated, in order to not turning odd values into even ones. To minimize the number of the non touched cache sets, this even increment should minimize GCD(SetStride(), C). In addition, it is convenient that the increment being the smallest possible, in order to minimize the number of (empty) pad locations. Based on this idea, we have defined the algorithm shown in Fig. 2, that determines the array stride increment needed to maximize the occupation of the cache for the whole program. In the procedure a vector P is computed containing the values SetStride()/2 for all even values of SetStride()’s. The first value of P shows if there are odd values of SetStride()’s or not. With the even values, the minimum even increment in stride to minimize GCD(SetStride(), C) is computed for the whole program.
868
E. Herruzo, O. Plata, and E.L. Zapata Select a X array inside a nested loop Sort cache levels in decreasing order of their block size for (each cache level take Wi in the sorted order) do Apply single-level procedure(Stride(X, IM ), Ci , Wi ) Stride(X, IM ) = N ewStride(X, IM ) endfor Fig. 3. Intra-array padding procedure for the complete cache hierarchy
3.3
Complete Cache Hierarchy
In the cache hierarchy the blocks may be the same size for all levels. In this case, the stride increment is the same for all cache levels, so we can apply the same algorithms described above. Otherwise, if block sizes are different over the cache hierarchy, we need to obtain a stride increment appropriate for each cache level. The algorithm in Fig. 1 calculates the stride increment for a specific cache level. To pad the array for other cache level, we need to carry out the same algorithm changing SetStride() and the cache block size (W ). The method to accomplish this process starts with the cache level with the largest block size, continuing with the rest of levels in decreasing order of block size. As an example, consider a two-level cache hierarchy with different block sizes, W 1 and W 2, where W 1 < W 2 (both are power of two). For this system, if Stride()/W 2 is integer, then all the quotients, Stride()/W 1, (Stride() + W 2)/W 2), (Stride() + W 1)/W 1) and (Stride() + W 2 + W 1)/W 1) are all also integer. However, the quotient (Stride() + W 2 + W 1)/W 2) is non-integer. Despite this, its integer part satisfies the GCD condition. Based on these results, a simple algorithm has been developed to extend our padding approach to a cache hierarchy (see Fig. 3).
4
Experimental Evaluation
We first tested the basic padding method (single loop nest and one level cache) for a synthetic single loop nest example on a real machine, and for a small set of kernel benchmarks on a cache simulator. Second, we tested the full padding method (whole program optimization on a complete cache hierarchy) for a selection of benchmarks on a real machine. When using a real machine, different optimization levels where tested, obtaining similar results. In this paper we show those results for two optimization levels on two different processors. 4.1
Basic Padding Method
In this section we present the evaluation of the basic padding method for a simple test code, a double nested loop of 1000 × 1000 iterations (i, j), where the loop body is X(i, j) = 3, being X a 1600 × 1600 single-precision floating-point array (4-byte words). The experiments were conducted in a MIPS R10K-processor system running in exclusive mode. This system has a 2-way set associative 32KByte L1 data cache with a 32-Byte cache block (C = 512, L = 2, W = 8),
Using Padding to Optimize Locality in Scientific Applications
869
Table 1. L1 (left) and L2 (right) data cache misses for different array paddings Arr. Dim. SetStride GCD Exec.Time L1 Misses Arr. Dim. SetStride GCD Exec.Time L2 Misses 1600 200 8 0.212 1,001,000 1600 50 2 0.212 23,100 1608 201 1 0.171 312,000 1632 51 1 0.214 19,500 1616 202 2 0.214 1,001,000 1664 52 4 0.219 27,640 1624 203 1 0.169 310,000 1696 53 1 0.217 20,400 1632 204 4 0.214 1,001,000 1728 54 2 0.214 24,700 1640 205 1 0.168 312,000 1760 55 1 0.212 18,600 1648 206 2 0.216 1,001,000 2048 64 64 0.548 698,500 1664 208 16 0.217 1,001,000 2176 68 4 0.217 22,200
Table 2. Cache miss ratio for a 16 KB cache using a simulator C=1024 Benchmark Array Dim. GCD L=1 test code 400 4 1 test code 404 1 1.196 test code 408 2 0.282 test code 412 1 9.310 liver 240 4 1 liver 244 1 3.874 liver 248 2 0.869 liver 252 1 1.705 mxm 200 2 1 mxm 204 1 0.858 mxm 208 4 2.502 mxm 220 1 7.826
Cache Miss Ratio C=512 C=256 C=128 C=64 L=2 L=4 L=8 L=16 1 1 1 1 1.442 2.732 13.333 13.333 0.105 0.873 4.230 4.230 3.307 3.142 3.142 12.571 1 1 1 1 3.874 1.193 13.741 14.200 0.869 3.354 4.531 4.531 1.705 13.741 13.741 14.200 1 1 1 1 0.256 0.131 0.226 0.645 1.031 0.432 0.140 0.145 1.351 0.658 0.138 1.001
C=32 L=32 1 61.710 4.645 12.342 1 14.200 4.531 14.200 1 1.072 0.093 1.208
and a 2-way set associative 4-MByte L2 data cache with a 128-Byte cache block (C = 16384, L = 2, W = 32). The results correspond to codes compiled with the MIPSpro Fortran 90 (v. 7.30) compiler using the ”O0” optimization option. Considering the test code, we have Stride(X, i) = 1 and Stride(X, j) = 1600. For the L1 cache level we have, BlockStride(X, j) = 1600/8 = 200 (constant), SetStride(X, j) = 200 mod 512 = 200, and GCD(200, 512) = 8. This result means that between two touched cache sets there are 7 other sets which are never used. With an increment by one of SetStride(X, j), we have GCD(201, 512) = 1 and N ewStride(X, j) = 1600 + 8 = 1608. Table 1 (left) shows the experimental results (execution time and L1 data cache misses) for different values of the second dimension of the array X. Besides, the table shows the SetStride(X, j) for each case and the values of GCD(SetStride(X, j), C). Note that the best results are obtained when the GCD condition holds, as expected. For the L2 cache level, we have now BlockStride(X, j) = 1600/32 = 50 (constant), SetStride(X, j) = 50 mod 16384 = 50, and GCD(50, 16384) = 2. So, only half of the L2 cache is used. Table 1 (right) shows the experimental results for the L2 cache and different values for the array second dimension. The table shows three cases fulfilling the GCD condition. For these cases the minimum miss rate is obtained, as expected. A different scenario corresponds to the evaluation of our basic padding method using a cache simulator, in order to test easily different cache configurations. The obtained results are shown in table 2, for three different benchmarks: test
870
E. Herruzo, O. Plata, and E.L. Zapata
code (test code described above), liver (livermore kernel [3]), and mxm (matrix multiplication of square matrices). The first row in table 2 for each benchmark corresponds to the original dimension size of the array. The rest of rows include different padded dimension sizes. The smallest one (that is, the second row) corresponds to the increment given by our padding technique. For each cache configuration we measured the cache miss rate using our simulator. The results are given in the table as Cache Miss Ratio, that is obtained by dividing the cache miss rate for the original dimension size by the cache miss rate for the padded dimension size. This way, a value larger than 1 means that the miss rate was decreased after applying padding. In case of mxm, padding is not effective due to the small size of the data structures. 4.2
Full Padding Method
A different experimental setup was implemented testing a number of bechmarks from different suites (SPEC95 & 2000, perfectB and NAS) on a Intel Pentium-4 platform. The computer system has a 8-way set associative 16-KByte L1 data cache with a 64-Byte block size, and a 8-way set associative 1-MByte L2 data cache with a 64-Byte block size. The results shown here correspond to codes compiled with the Intel Fortran Compiler v. 9.0, using in this case the ”O3” optimization level and ”-align none” option (in order to not interfere with padding). In addition, the code was executed in exclusive mode (hard real-time mode in the linux scheduler). Table 3 (top) shows the performance results applying our padding method for the complete cache hierarchy. Note that there are a significant performance improvement by padding the arrays. In order to optimize the whole code, we first have to find all the nested loops where array references in a non contiguous dimension exist in the innermost loop. Next, we proceed by selecting one of these arrays. From this point on, SetStride()’s are calculated for every level of the cache hierarchy. For each one, the corresponding values of GCD(SetStride(), C) are also calculated, and from then, the stride increments. If the stride values are the same for all the loops, we only have to carry out the sum of the stride increments obtained for each level of the cache memory. Otherwise, we have to apply the corresponding padding algorithm. Table 3 (bottom) show the improvement obtained by applying our padding method to several benchmarks on the Pentium-4 machine.
5
Related Work
There is a great amount of work in the literature related to the design of compiler techniques for program locality exploitation. In [6] authors present an iterative method that uses ILP (Integer Linear Programming) to compute optimal solutions of memory layout transformations. Other works [7,8] also develop heuristic solutions to data layout and loop transformations, based on data reuse vectors. Works based on Cache Miss Equations (CMEs) [5] are similar to our own proposal in some aspects but with different results. They analyze the cache
Using Padding to Optimize Locality in Scientific Applications
871
Table 3. Improvement for a single loop nest (top) and the whole code (bottom) for the complete cache hierarchy in Pentium 4 Subroutine
Original array dim.
Padded % Exec. time % L1 num. misses % L2 num. misses array dim. improvement improvement improvement
PerfectB Bench. adm hyd 50,50,50 56,52,50 arc2d scaldt 500,500,4 544,500,4 dyfesm mnlbyx 500,500,3 544,500,3 flo52 collc 200,200,200 200,205,200 mg3d march 100,100 100,112 spec77 horiz2 500,500 544,500 trfd trfa 500,500 544,500 NAS Bench. appsp spentax3 660,33,33 660,35,34 appbt l2norm 50,50,50,50 56,52,56,50 fftpde transx 1024,1024 1064,1024 SPEC CFP Bench. tomcatv cal.res. 513,513 540,520 applu rhs 20,50,50,100 20,50,70,100 swim calc3z 1335,1335 1352,1336 hydro2d v.th.p 1200,800 1224,800 mgrid norm2u3 320,200,200 344,201,200 turb3d wcal 33,64,64 34,68,64 wave RADFG 100,100,100 116,140,100 tfft2 transc 500,30,500 500,30,501
swim mgrid appsp fftpde
16.2 11.3 7.1 5.4 3.2 16.8 18.3
15.1 15.7 1.2 5.5 0.7 15.5 3.8
0 13.6 44.4 5.7 8.6 34.2 0
0 0 120.0
6.3 0.2 -1.7
2.0 0 90.0
588.2 5.7 6.8 433.2 4.2 8.3 5.4 6.5
211.5 4.7 1.6 907.4 19.5 -0.1 5.5 11.7
103.8 2.9 3.6 173.3 18.4 11.7 5.7 0
% Exec. time % L1 num. misses % L2 num. misses improvement improvement improvement 1.1 3.2 -0.2 3.4 2.5 1.3 4.8 2.8 3.7 1.2 3.9 10.4
occupation through references to arrays in loops and define the algorithms to generate the CMEs and some optimizations, as padding. In [11] authors also use CMEs to propose a cost model that combined with a genetic algorithm carries out padding (and tiling) for a multi-level cache memory system. The work in [2] is mainly focused on spatial locality optimization. The method provides a parameterized cost function based on polytopes and Ehrhart polynomials from the iteration space of a loop nest. We also use the polyhedron defined by the iteration space of a loop nest, but with a different objective. Our goal is to determine the cache occupation generated by this polyhedron. In [9] authors present a similar approach to ours for array padding. Their technique is iterative until no self-interference is caused. Our method, however, is a direct calculation, more accurate and with lower algorithmic complexity. Other work [10] computes the conflict distance of array references by directly linearizing the uniformly-generated references. In [12] authors analyze iterative stencil loops and use array padding to remove conflict misses after tiling the loops. Finally, the work in [4] shares with our approach a similar treatment of the cache and the program code. However, we also take into account how array
872
E. Herruzo, O. Plata, and E.L. Zapata
data is stored in memory, and we introduce the new notions of cache block and set strides and work with them to develop the padding algorithms. These facts, to some extent, complete their work, justifying some of the results they obtain.
6
Conclusions
We presented a parameter-based cache model that is used as a framework to determine how to pad arrays in order to optimize program locality. In fact, the model permits to maximize cache occupation when arrays are reference within loop nests. This model is used to develop array padding algorithms in different scenarios: single loop nest and multiple loops, single and multiple cache levels. We showed that simple padding techniques are very useful to obtain respectable performance improvements for a variety of scientific codes.
References 1. Bacon, D.F., Graham, S.L., Sharp, O.J.: Compiler Transformations for HighPerformance Computing. ACM Computing Surveys 26(4), 345–420 (1994) 2. Clauss, P., Meister, B.: Automatic Memory Layout Transformation to Optimize Spatial Locality in Parameterized Loop Nests. ACM Computer Architecture News 28(1), 11–19 (2000) 3. Coleman, S., Mckinley, K.S.: Tile Size Selection Using Cache Organization and Data Layout. In: ACM Conf. on PLDI. La Jolla (CA), pp. 279–290 (1995) 4. Ferrante, J., Sarkar, V., Thrash, W.: On Estimating and Enhancing Cache Effectivenes. Work. on Languages and Compilers for Parallel Computers (1991) 5. Ghosh, S., Martonosi, M., Malik, S.: Cache Miss Equations: A Compiler Framework for Analyzing and Tunning Memory Behaviour. ACM TOPLAS 21(4), 703–746 (1999) 6. Kandemir, M., Banerjee, P., Choudhary, A., Ramanujam, J., Ayguade, E.: An Integer Linear Programming Approach for Optimizing Cache Locality. In: ACM Int’l. Conf. on Supercomputing Rhodes, pp. 500–509 (1999) 7. Kandemir, M., Choudhary, A., Ramanujam, J., Banerjee, P.: Improving Locality Using Loop and Data Transformations in an Integrated Framework. In: ACM/IEEE Int’l. Symp. on Microarchitecture. Dallas (TX), pp. 285–297 (1998) 8. O’Boyle, M., Knijnenburg, P.: Integrating Loop and Data Transformations for Global Optimizations. In: IEEE Int’l. Conf. on Parallel Architectures and Compilation Techniques., Paris, pp. 12–19 (1998) 9. Panda, P., Nakamura, H., Dutt, N., Nicolau, A.: A Data Alignment Technique for Improving Cache Performance. In: Int’l. Conf. on Computer Design: VLSI in Computers and Processors., Austin (TX), pp. 587–592 (1997) 10. Rivera, G., Tseng, C.W.: Data Transformations for Eliminating Conflict Misses. In: ACM Conf. on PLDI, Montreal, pp. 38–49 (1998) 11. Vera, X., Abella, J., Llosa, J., Gonz´ alez, A.: An Accurate Cost Model for Guiding Data Locality Transformations. ACM TOPLAS 27(5), 946–987 (2005) 12. Li, Z., Song, Y.: Automatic Tiling of Iterative Stencil Loops. ACM TOPLAS 26(6), 975–1028 (2004)
Improving the Performance of Graph Coloring Algorithms through Backtracking Sanjukta Bhowmick1 and Paul D. Hovland2 1
Department of Computer Science and Engineering,The Pennsylvania State University, University Park, PA 16802 [email protected] 2 Mathematics and Computer Science Division, Argonne National Laboratory, 9700 South Cass Avenue, Argonne, IL 60439-4844 [email protected]
Abstract. Graph coloring is used to identify independent objects in a set and has applications in a wide variety of scientific and engineering problems. Optimal coloring of graphs is an NP-complete problem. Therefore there exist many heuristics that attempt to obtain a near-optimal number of colors. In this paper we introduce a backtracking correction algorithm which dynamically rearranges the colors assigned by a top level heuristic to a more favorable permutation thereby improving the performance of the coloring algorithm. Our results obtained by applying the backtracking heuristic on graphs from molecular dynamics and DNA-electrophoresis show that the backtracking algorithm succeeds in lowering the number of colors by as much as 23%. Variations of backtracking algorithm can be as much as 66% faster than standard correction algorithms, like Culberson’s Iterated Greedy method, while producing a comparable number of colors. Keywords: Graph Coloring, Backtracking.
1 Introduction Graph coloring is used for partitioning a collection of objects into “independent” sets. Objects belonging to the same set are identified by having the same color. Objects with the same color are non-conflicting, that is, certain operations can be performed simultaneously on them. Coloring is used in many computational and engineering applications that require identification of concurrent tasks. Some examples include register scheduling, frequency assignments for mobile networking, the evaluation of sparse Jacobian matrices, etc. Optimal coloring strategies improve parallelism. The fewer colors required to classify the objects, the more the inherent parallelism of the problem can be exploited. The holy grail of graph coloring is achieving the chromatic number, the smallest number of colors required to color the graph so that no two adjacent vertices have the same color. Algorithms for determining the chromatic number are NP-complete [1] and designing polynomial time heuristics to obtain quasi-optimal solutions is an active area of research. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 873–882, 2008. c Springer-Verlag Berlin Heidelberg 2008
874
S. Bhowmick and P.D. Hovland
It has been observed that for many heuristics the order in which vertices are colored significantly affects the number of colors obtained [2]. Based on this observation, we present a backtracking correction heuristic that mitigates the effects of a “bad” vertex ordering. The backtracking algorithm dynamically rearranges the colors assigned by the top level coloring algorithm thereby, changing the vertex ordering to a more favorable permutation. We study the performance of backtracking algorithm in conjunction with several popular coloring heuristics and compare it with another correction technique, Culberson’s iterated greedy scheme [3]. Results from our experiments on graphs obtained from molecular dynamics [4] and DNA electrophoresis [5] show that backtracking improves upon the performance of the top level heuristic as well as the iterated greedy approach. The average reduction of colors is as much as 23% (16%), compared to the original (iterated greedy) method. Most correction algorithms necessarily take more time to determine the near-optimal number of colors. We have designed a variation of backtracking that is as much as 66% faster than the iterated greedy method, while giving the number of colors within 1% of that obtained by the correction method. The rest of the paper is arranged as follows. In Section 2 we present the mathematical description of relevant terms from graph theory and define graph coloring. We provide a brief review of some of the standard coloring heuristics in Section 3. In Section 4 we describe the backtracking algorithm. We discuss experimental results in Section 5 and present improved variations to the heuristic such as Multilevel and Reverse backtracking. Section 6 contains conclusions and discussion of our future research plans.
2 Mathematical Definitions In this section we define some terms used in graph theory. Unless mentioned otherwise, the terms used here are as they are defined in [6]. A graph G = (V, E) is defined as a set of vertices V and a set of edges E. An edge e ∈ E is associated with two vertices u, v which are called its endpoints. If a vertex v is an endpoint of an edge e, then e is incident on v. A vertex u is a neighbor of v if they are joined by an edge. The degree of a vertex u is the number of its neighbors. A walk, of length l, in a graph G is an alternating sequence of v0 , e1 , v1 , e2 , . . . , el , vl vertices and edges, such that for j = 1, . . . , l; vj−1 and vj are the endpoints of edge ej . An internal vertex is a vertex that is neither the initial nor final vertex in this sequence. A path is a walk with no edges or internal vertices repeated. Two vertices are said to be distance-k neighbors if the shortest path connecting them has length at most k [7]. A vertex-coloringof a graph G = (V, E) is a function φ : V → C from the set of vertices to a set C = {1, 2, . . . , n} of “colors”. A distance-k coloring of a graph G = (V, E) is a mapping φ : V → {1, 2, . . . , n} such that φ(u) = φ(v), whenever u and v are distance-k neighbors. The least possible number of colors required for a distance-k coloring of a graph G is called its k-chromatic number [7].
3 Review of Some Coloring Algorithms In this section we provide an overview of some standard coloring algorithms. It has been observed that the order in which vertices are colored is an important parameter in
Improving the Performance of Graph Coloring Algorithms through Backtracking
875
lowering the number of colors. Consequently, many coloring heuristics focus on finding an efficient vertex ordering. Some apply well known graph traversal methods like the depth-first search [6] while others focus on orderings based on the degree of the vertices or the number of colored neighbors. Some examples of the later category include the largest first [8] ordering where the vertices are arranged in non-increasing order of their degrees and the smallest last [9] ordering which dynamically orders the vertices such that the last vertex in the sequence is one with the minimum degree in the subgraph induced by the yet uncolored vertices. In the incidence degree [10] ordering, the vertex with the maximum number of colored neighbors is the next one to be colored. The effectiveness of these heuristics depend on the underlying graph structure. The results can be improved by using correction algorithms such as Culberson’s Iterated Greedy method [3]. In this approach, once the initial algorithm has been applied, the iterated greedy method rearranges the vertices in decreasing order of color, and re-colors them. Culberson’s method guarantees that the reordering does not increase the number of colors.
4 The Backtracking Correction Heuristic The backtracking correction heuristic is based on dynamically reassigning colors amongst already colored vertices in order to restrict the number of colors within an user specified minimum. The heuristic is implemented as follows; the user specifies a coloring threshold set to a lower bound on the chromatic number. The backtracking heuristic is invoked whenever this threshold is exceeded. It is easy to see that when backtracking is called there is only one vertex, designated as the last-vertex, that is assigned a color higher than the threshold. Evidently, the rest of the colors up to the threshold would have been used to color the neighbors of the last vertex. These colors form the acceptable set of colors. The last-vertex is temporarily assigned a pseudo-color from the acceptable color set. The backtracking algorithm tries to determine whether there is an alternate assignment of colors from the acceptable set to the neighboring vertices that would allow the last-vertex to retain the pseudo-color and prevent conflicts. If such an assignment is found, then we have a coloring within the limits of the threshold. If no such arrangement can be obtained for any color from the acceptable set, the last vertex is assigned its original color and the threshold is increased by one. Pseudocode for Backtracking Heuristic Set threshold to T For all vertices v Color vertex v with initial coloring algorithm pseudocolor[v]=color[v] If color[v]>T For all colors c; 1 ≤ c ≤ T Set pseudocolor[v] to c Set fail to FALSE For all neighbors n of v, If color[n]=c
876
S. Bhowmick and P.D. Hovland
Reassign pseudocolor[n] to avoid conflicts; If pseudocolor[n]>T; Set fail to TRUE; Break; If fail is FALSE Re-coloring is successful Break Else continue for next color If fail is FALSE (Alternative coloring assignment found) For all vertices v; set color[v]=pseudocolor[v] Else For all vertices v; set pseudocolor[v]=color[v] Increase T by 1
5 Performance of Backtracking Heuristic We report on the performance of the backtracking algorithm on two test suites each containing six matrices. We applied the coloring algorithms discussed in Section 3 to the adjacency graphs corresponding to these matrices. The first set obtained from molecular dynamics [4], consists of a group of graphs with fixed vertices (11414) and gradually increasing number of edges. The second set obtained from the Florida Sparse Matrix Collection [5], representing DNA electrophoresis, consists of graphs whose size increases with both vertices and edges. We used the following ordering heuristics; Natural Ordering (N), Smallest Last (S), Largest First (L), Incidence Degree (I), and Depth First Ordering (D). For each heuristic we conducted three sets of experiments using: i) only the heuristic, ii) Culberson’s Iterated Greedy method [3], with the current heuristic in the first iteration and iii) the heuristic with the Backtracking algorithm. Our experiments include results for both distance-1 and distance-2 coloring objectives. The threshold for distance-1 coloring was set to 3 and the threshold for distance-2 coloring was set to the minimum degree of the graph. 5.1 Reduction of Colors The results summarized in Tables 1 and 2 demonstrate that backtracking can significantly reduce the number of colors. The number of colors obtained is lower than that given by the iterated greedy algorithm. The reduction is higher for distance-1 coloring (maximum reduction of 23%) than for distance-2 coloring (maximum reduction of 18%). This is to be expected, since coloring vertices based on distance-2 neighbors requires fulfilling more constraints, thus reducing the possibility of color reassignments. 5.2 Running Time The time taken to color a graph is proportional to its size. Figure 1 plots the time taken to distance-1 color the two sets of graphs. The results show that though the execution time of backtracking is competitive with the iterated greedy heuristic for smaller
Improving the Performance of Graph Coloring Algorithms through Backtracking
877
Fig. 1. Comparison of the time taken between Depth First Search (DF), DF using Culberson’s Iterated Greedy Algorithm and DF using Backtracking for distance-1 coloring. The left-hand side figure represents graphs from molecular dynamics and the right-hand side figure represents graphs from DNA-electrophoresis. Time is given in seconds.
graphs, it gets much larger as the size of the graph increases. The running time is worse for distance-2 coloring and with particular bad combinations of vertex ordering and threshold for dense matrices the execution time can go up to as much as 77 times the time taken by the top-level heuristic. The time required to backtrack can be reduced by increasing the level of the threshold. Since backtracking does not start until the threshold is reached, a judicious selection can significantly decrease the time, as shown in Figure 2, without compromising in the number of colors. 2.5 235 236
Time for Coloring Graphs
2
1.5 236 1
239
0.5 246 287 0
N
NG
NB(3)
NB(143) NB(215) NB(225)
Fig. 2. Comparison of time taken to color a molecular dynamics graph with Natural (N) ordering, Iterated Greedy Method (G) and Backtracking (B) with different thresholds, given in parenthesis. The values on top of the bars give the number of colors obtained. Time is given in seconds.
5.3 Improving Backtracking Improvements to backtracking can include i) reduction of colors and ii) reduction of execution time. We will explore the first option with respect to distance-1 coloring and the second option with respect to distance-2 coloring.
878
S. Bhowmick and P.D. Hovland
Multilevel Backtracking: Multilevel backtracking reduces the number of colors by recursively invoking the correction heuristic. For example, if in course of backtracking a neighbor of the last-vertex is colored higher than the threshold, we use multilevel backtracking to explore further reassignments to the vertices adjacent to the neighbor in order to obtain limit the colors within the acceptable set. Figure 3 and Table 1 summarizes the number of colors for distance-1 coloring of the representative graphs. The results show that 2-level backtracking (recursion used once) can reduce the number of colors by 10% compared to non-recursive backtracking.
600
400
300
200
100
0
V=37:E=196 V=93:E=692 V=1K:E=9K V=3K:E=38K V=39K:E=520K V=130K:E=1902K
60
Total Number of Colors
500
Total Number of Colors
70
Edge=15K Edge=64K Edge=130K Edge=412K Edge=1655K Edge=2683K
50
40
30
20
10
N NGNBN2
S SGSB S2
L LG LB L2
I IG IB I2
D DGDBD2
0
N NGNBN2
S SGSB S2
L LG LB L2
I IG IB I2
D DGDBD2
Fig. 3. Number of colors required by the graphs for distance-1 coloring. The groups from left to right represent different top level heuristics, Natural Ordering (N), Smallest Last (S), Largest First (L), Incidence Degree (I), and Depth First Ordering (D). Suffix (G) represents the coloring obtained by the iterated greedy algorithm, suffix (B) represents the coloring obtained by backtracking and suffix (2) represents 2-level backtracking. The left-hand side figure represents graphs from molecular dynamics and the right-hand side figure represents graphs from DNAelectrophoresis.
Reverse Backtracking: Reverse backtracking is used to reduce the execution time of the backtracking algorithm. In this heuristic the vertices are first colored using a top level algorithm and then backtracking is applied to see if it is possible to re-color all the vertices of the the highest color to a lower color. The process is continued until a lower color cannot be assigned. This algorithm is similar to the iterated greedy algorithm in that the vertices are grouped according to color after the initial coloring is complete. However instead of re-coloring all the vertices again as in the iterated greedy algorithm, reverse-backtracking re-colors only the vertices with the highest colors (and their neighbors as required). Consequently reverse backtracking has a lower running time than the iterated greedy heuristic. Figure 4 compares the execution time for distance-2 coloring with respect the most expensive algorithm (depth-first search). The results show that reverse backtracking is faster by as much as 66% compared to the iterative greedy algorithm. The performance of these algorithms with respect to the number of colors is summarized is Figure 5 and Table 2. For most algorithms, the number of colors obtained by reverse backtracking is competitive to that obtained by the iterated greedy method.
Improving the Performance of Graph Coloring Algorithms through Backtracking
879
Table 1. Number of colors required by the graphs for distance-1 coloring. The groups from left to right represent different top level heuristics, Smallest Last (S), Largest First (L), Incidence Degree (I), and Depth First Ordering (D). Suffix (G) represents the coloring obtained by the iterated greedy algorithm, suffix (B) represents the coloring obtained by backtracking and suffix (2) represents 2-level backtracking. Graphs V=11K E=15K V=11K E=64K V=11K E=130K V=11K E=412K V=11K E=1.6M V=11K E=2.6M V=37 E=196 V=93 E=692 V=1K E=9K V=3K E=38K V=39K E=520K V=130K E=1.9M
S 5 8 14 34 115 183 4 6 8 9 10 12
S-G 5 8 14 34 115 183 5 5 7 8 10 11
S-B 5 8 14 32 110 176 4 5 7 8 10 10
S-2 5 8 13 32 109 171 4 5 7 8 9 9
L 5 9 16 42 137 218 5 6 8 9 13 13
L-G 5 9 16 40 135 216 5 7 8 9 11 12
L-B 5 9 15 37 124 187 4 5 7 8 10 10
L-2 5 8 14 35 117 181 4 5 6 8 9 10
Edge=15K
3000
I 5 9 15 34 117 183 5 7 8 10 12 13
I-G 5 9 14 34 117 182 5 6 7 9 11 11
350
2500
300
Edge=412K Edge=2683K
250
Total Number of Colors
Total Number of Colors
Edge=1655K 2000
1500
1000
I-2 5 8 13 32 109 173 4 5 7 8 10 10
D 5 11 18 45 164 266 5 6 8 11 14 14
D-G 5 9 16 42 156 241 5 6 7 9 12 12
D-B D-2 5 5 9 8 15 14 39 37 142 130 225 201 5 4 5 5 8 7 9 8 11 10 12 11
V=37:E=196 V=93:E=692 V=1K:E=9K V=3K:E=38K V=39K:E=520K V=130K:E=1902K
Edge=64K Edge=130K
I-B 5 9 14 33 112 177 4 5 7 9 10 10
200
150
100 500
0
50
N NGNBNR
S SGSBSR
L LG LB LR
I IG IB IR
D DGDBDR
0
N NGNBNR
S SGSBSR
L LG LB LR
I IG IB IR
D DGDBDR
Fig. 4. Number of colors required by the graphs distance-2 coloring. The groups from left to right represent different top level heuristics,Natural Ordering (N), Smallest Last (S), Largest First (L), Incidence Degree (I), and Depth First Ordering (D). Suffix (G) represents the iterated greedy algorithm, suffix (B) represents backtracking and suffix (R) represents reverse backtracking. The left-hand side figure represents graphs from molecular dynamics and the right-hand side figure represents graphs from DNA-electrophoresis.
Reverse Backtracking Heuristic Color all vertices Set C to maximum number of colors While TRUE For all vertices colored with C; Apply backtracking with threshold C-1 If all vertices can be recolored; set C to C-1 Else break
880
S. Bhowmick and P.D. Hovland
Table 2. Number of colors required by the 12 graphs for distance-2 coloring. The groups from left to right represent different top level heuristics, Smallest Last (S), Largest First (L), Incidence Degree (I), and Depth First Ordering (D). Suffix (G) represents the iterated greedy algorithm, suffix (B) represents backtracking and suffix (R) represents reverse backtracking. Due to space constraints the colors for the graph with vertex=11k and edges=2.6M are given in the order of 103 . Graphs (V:E) 11K:15K 11K :64K 11K:130K 11K:412K 11K:1.6M 11K:2.6M 37:196 93:692 1K:9K 3K:38K 39K:520K 130K :1.9M
S S-G S-B S-R L L-G L-B L-R 9 22 47 152 617 1.02 11 19 30 39 62 68
9 22 47 150 609 1.01 11 18 28 38 62 66
9 22 45 144 585 .97 10 17 27 36 56 60
9 22 47 150 608 1.01 11 18 29 39 59 64
9 26 57 183 654 1.07 10 17 30 42 67 73
9 24 56 178 651 1.07 10 17 29 41 67 65
9 24 52 157 617 1.01 10 17 27 38 61 72
9 24 55 170 645 1.05 10 17 28 38 64 68
I 9 24 50 152 615 1.01 11 17 31 44 66 72
I-G I-B I-R 9 23 49 152 615 1.00 11 17 30 42 63 71
9 22 45 148 596 .98 11 17 28 37 58 64
9 23 48 151 610 1.00 11 16 29 41 62 68
D D-G D-B D-R 9 29 65 217 862 1.39 12 19 31 48 81 87
9 26 59 198 771 1.22 12 18 31 43 82 77
9 24 55 184 710 1.14 11 17 28 40 68 73
9 26 58 200 789 1.24 11 18 29 43 70 78
Fig. 5. Comparison of the time taken between Depth First Search (DF), DF using Culberson’s Iterated Greedy Algorithm and DF using Reverse Backtracking for distance-2 coloring. The left-hand side figure represents graphs from molecular dynamics and the right-hand side figure represents graphs from DNA-electrophoresis. Time is given in seconds.
5.4 Comparison with Results from Integer Programming We can measure the effectiveness of backtracking by comparing the results with known optimal colorings. We used integer programming to find the optimal number of colors for some graphs of small size. The results provided in Table 5.4 demonstrate that we can also obtain optimal coloring within a few levels of backtracking. The threshold was set to the minimum nonzeros per column of the corresponding column intersection matrix.
Improving the Performance of Graph Coloring Algorithms through Backtracking
881
Table 3. Number of colors for distance-1 coloring. The columns from left to right represent coloring using Natural ordering, Integer Programming, and Natural ordering with Backtracking. The number of levels of backtracking is given in parenthesis. The IP execution for graphs with E=1555 and E=3925 could not be completed within the time bounds. Graph Natural IP BT (Levels) V=12 E=59 11 11 11 (1) 8 8 8 (1) V=12 E=32 50 between 50 and 45 50 (1) V=60 E=1555 10 10 10 (1) V=20 E=115 10 10 10 (1) V=10 E=45 8 7 7 (2) V=14 E=77 8 6 6 (2) V=72 E=472 51 51 51 (1) V=68 E=2074 unknown 40 (1) V=100 E=3925 49
6 Discussion and Future Work We have described a backtracking algorithm and shown that it is indeed successful in reducing the number of colors. The price for achieving a smaller number of colors is an increase in the time required to compute the coloring. The execution time is generally higher for dense matrices or coloring problems with more constraints, such as distance-2 coloring. This is to be expected, since as the number of neighbors increase backtracking has more vertices to search. Backtracking has several provisions for lowering the execution time based on user specific requirements, such as by varying the threshold, or invoking backtracking only at specific intervals. Reverse backtracking is also effective in reducing the computing time and the number of colors given by this method is close to the original backtracking for distance-2 coloring. The trade-off between the reduction of colors and the execution time of the algorithm depends on the purpose of the coloring. If the underlying application uses the same partition multiple times then an upfront large cost to obtain a near-optimal partition is justified. We have observed that the performance of backtracking is largely dependent on the top-level heuristic. One of our research goals, therefore, is to design more efficient variations of backtracking to match the top-level coloring techniques. Our other future plans include application of backtracking in parallel coloring algorithms. Acknowledgments. This work was supported by the Mathematical, Information, and Computational Sciences Division subprogram of the Office of Computational Technology Research, U.S. Department of Energy under Contract W-31-109-Eng-38.The idea for the backtracking algorithm was inspired by reading the excellent review on graph coloring algorithms by Assefaw Gebremedhin, Fredrik Manne and Alex Pothen [7]. We are also grateful to Assefaw Gebremedhin and Rahmi Aksu for letting us use their graph coloring software. Our implementation of the backtracking algorithm was built on top of this software. We would also like to thank Sven Leyffer for his assistance in using integer programming to find optimal colorings for several small matrices. We thank Gail Pieper for proofreading a draft of this paper.
882
S. Bhowmick and P.D. Hovland
References 1. Garey, M., Johnson, D.: Computers and Intractability: A Guide to the Theory of NPCompleteness. W.H. Freeman and Company, New York (1979) 2. Siek, J.G., Lee, L., Lumsdaine, A.: Boost Graph Library, The: User Guide and Reference Manual. Addison Wesley Professional, Reading (2001) 3. Culberson, J.C.: Iterated Greedy Graph Coloring and the Difficulty Landscape. Technical Report (1992) 4. Carloni, P.: PDB Coordinates for HIV-1 Nef binding to Thioesterase ii, http://www.sissa.it/sbp/bc/publications/publications.html 5. Davis, T.: University of Florida Sparse Matrix Collection (1997), http://www.cise.ufl.edu/research/sparse/matrices 6. Gross, J.L., Yellen, J.: Handbook of Graph Theory and Applications. CRC Press, Boca Raton (2004) 7. Gebremdhin, A., Manne, F., Pothen, A.: What color is your Jacobian? Graph coloring for Computing Derivatives. SIAM Review 47, 629–705 (2005) 8. Welsh, D.J.A., Powell, M.B.: An upper bound for the chromatic number of a graph and its application to timetabling problems. Computer J., 85–86 (1967) 9. Matula, D.W.: A min-max theorem for graphs with application to graph coloring. SIAM Review, 481–482 (1968) 10. Coleman, T., Mor´e, J.J.: Estimation of Sparse Jacobian Matrices and Graph Coloring Problems. SIAM Journal of Numerical Analysis 20(1), 187–209 (1983)
Automatic Identification of Fuzzy Models with Modified Gustafson-Kessel Clustering and Least Squares Optimization Methods Grzegorz Glowaty AGH University of Science and Technology, Department of Computer Science, Al. Mickiewicza 30, 30-059 Krakow, Poland [email protected]
Abstract. An automated method to generate fuzzy rules and membership functions from a set of sample data is presented. Our method is based on clustering and uses a modified version of Gustafson-Kessel algorithm. The aim is to divide a product space into set of clusters for which the systems exhibits behavior close to linear. For each of the clusters we produce a fuzzy rule and generate a set of membership functions for the rule antecedent with use of an approach based on curve fitting. Weighted linear least-squares regression is used to obtain consequent functions for TSK-models. Key words: fuzzy modeling, fuzzy clustering, Gustafson-Kessel algorithm.
1 Introduction Fuzzy models have proven to be effective function approximators. They are also easy interpretable because they are composed of human readable rules. Those rules can be used to understand a nature of a modeled system. This huge advantage of fuzzy modeling above many other modeling techniques motivates researchers to work on automatic methods of fuzzy modeling as they eventually allow for easy generation of human readable interpretation of the system. In this paper we focus on a fuzzy model generation with use of fuzzy data clustering. First, we provide a general idea of application of clustering in fuzzy model identification. We propose modifications to Gustafson-Kessel fuzzy clustering algorithm with a purpose of producing clusters more suitable for usage in fuzzy model. Then we show how to convert those clusters to TSK fuzzy models. At the end of this work models produced with the described method are compared with models produced by other classical fuzzy modeling approaches.
2 Clustering in Fuzzy Model Identification Fuzzy rules introduce a natural partition of the system space. Antecedents of the rules introduce a partition of the input space. This partition defines a set of regions in which particular rules apply. General idea behind the use of clustering techniques in the M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 883–892, 2008. © Springer-Verlag Berlin Heidelberg 2008
884
G. Glowaty
fuzzy model identification [6, 8, 10] is that if we are able to find groups of sample data that exhibit similar behavior in a given area of a system space then we should be able to divide the problem of modeling into several smaller subspaces. In each of these subspaces we create a fuzzy rule that mimics approximated system’s behavior in this area. Fuzzy clustering methods not only find cluster centers, but also assign membership degree of each of the samples to each of the clusters. We use this information in generation of fuzzy rules. We modify Gustafson-Kessel fuzzy clustering algorithm [9] and use it as the basis for our approach.
3 Finding Clusters 3.1 Desired Cluster Properties The objective of our method is to create fuzzy rules for which antecedents are decomposed into set of predicates for each of the variables of the input domain. This kind of model provides the best interpretability of produced rules. There are approaches [6] that use n-1 dimensional fuzzy sets as membership functions in rule antecedents (where n is the number of dimensions of the product space). However, those models are harder to interpret. In the best performing of the methods presented in [8], Gath and Geva clustering algorithm is used and a transformation of input variables is applied. The goal of the transformation is to leverage clusters as if they were parallel to the axes of the space. That also reduces readability of the rules. In order to derive fuzzy model from a set of fuzzy clusters in the product (inputoutput) space X 1 × ... × X n −1 × X n (where X n is the output domain) a projection of each of the clusters onto each of the input space axes is obtained. Fuzzy clusters being results of the most of the fuzzy clustering algorithms are of the shape of sphere or hyper-ellipsoid. In case of spheres it is easy to obtain a “projection” of a cluster on an axis without a loose of information, however in case of hyper-ellipsoids the more axes of the ellipsoid are parallel to the axes of the space, the more information is preserved. Some of the approaches are based on this observation and look for the clusters that have all of their axes parallel to the axes of the space [10]. For the TSK fuzzy models the consequent of the fuzzy rule may be a linear function of input variables. In this case there is no need of projecting the cluster onto the output axis. With this in mind we propose a modified version of Gustafson-Kessel algorithm that finds clusters that are easily projected onto the input space, and not necessarily parallel to the output axis. 3.2 Gustafson-Kessel Algorithm Let us assume a set of N samples in the n dimensional space. The target is to find K fuzzy clusters, such that: K
∀i ∈{1,..., N }∑ μ k ,i = 1 ,
(1)
k =1
where
μk ,i is a membership degree of sample i to cluster k. Gustafson-Kessel algo-
rithm finds clusters by minimizing the following function:
Automatic Identification of Fuzzy Models N
885
K
J X , m (U , V ) = ∑∑ μ km,i DA2k ( xi − vk ) ,
(2)
i =1 k =1
where U is a set of membership degrees μ , V is a set of cluster centers v, m is fuzzi2
ness factor (usually a value close to 2), X is a set of N samples x, and DAk is a norm induced by matrix Ak . Every cluster has its own norm inducing matrix 1
Ak = [σ k det( Fk )] n −1 Fk−1 ,
(3)
where F is a fuzzy covariance matrix defined as follows: N
Fk =
∑μ i =1
m k ,i
( xi − vk )( xi − vk )T
.
N
∑μ i =1
(4)
m k ,i
Parameter σ k in (3) was introduced as a cluster capacity so the objective function minimization is not trivial process of minimizing all values of matrix A. Usually for Gustafson-Kessel algorithm destination capacity of 1 for each of the clusters is as2
sumed. Norm DAk induced by matrix Ak is calculated in the following way:
DA2k (x) = (v k − x)T Ak (vk − x) .
(5)
Given the membership degrees centers of the clusters are calculated as the weighted mean value of all membership degrees: N
vk =
∑μ i =1 N
m k ,i
∑μ i =1
xi
(6)
. m k ,i
On the other hand, given the cluster center and the norm inducing matrix it is possible to induce desired membership degrees of the samples in the following way: μ k ,i =
1 D A2k ( xi ) K
∑D j =1
2 Aj
.
(7)
( xi )
Gustafson-Kessel algorithm minimizes function given by (2) by iterative execution of the following steps: 1. 2. 3. 4.
Initialize U with random membership degrees Calculate centers of clusters with (6) Calculate new membership degrees with (7) Calculate fuzzy covariance matrices using (4)
886
G. Glowaty
5. Calculate norms induced by those matrices using (3) and (5). 6. If membership degrees have changed more in this iteration than assumed termination value proceed to step 2. In [10] a modification of this algorithm was proposed to restrict it to find clusters that are parallel to all the axes of the input-output space. In this method, we propose modification that results in finding clusters parallel to input space axes, and not necessarily parallel to the output axis. Gustafson-Kessel algorithm needs the number of clusters as the input parameter. We identify several models with different numbers of clusters and chose the best one according to the testing set error. 3.3 Modification of Gustafson-Kessel Algorithm to Obtain Desired Clusters
Clusters that are parallel to one of the axes tend to have significant non-zero variance along this axis and values of all covariances of this axis variable close to zero. As it was noticed in [10] a desired covariance matrix for clusters parallel to the axes is a diagonal matrix. In this work, however, we are looking for a wider class of clusters, namely clusters that are parallel only to the input-space axes. To achieve this, we lessen a restriction on the covariance of the output variable, but still do not want to introduce any covariance between input variables. This leads to clusters induced by covariance matrix of a form (8)
⎡c1 0 ⎢0 c 2 ⎢ F0 = ⎢... ... ⎢ ⎢0 0 ⎢⎣ 0 0
... 0 ... 0 ... 0 0 cn−1 yi
0
0⎤ 0 ⎥⎥ yi ⎥ . ⎥ 0⎥ cn ⎥⎦
(8)
If F is a fuzzy covariance matrix, F0 is a matrix that was created from F by putting 0 everywhere except for diagonal, and a single place not on a diagonal in last row and last column. Question to be answered is whether such a matrix is a valid covariance matrix. This is important because the covariance matrix needs to be positive semidefinite so it has positive determinant and the norm inducing matrix A obtained by (3) exists in ℜ . It is easy to show that in general (if more original elements were preserved) such matrix may not be a covariance matrix. However, restricting values in a way shown in (8) leads to covariance matrix in all cases. n× n
Theorem 1. Let F be a covariance matrix. Matrix F0 as in (8) created from F has
properties of a covariance matrix. Proof Let us consider only two variables: i-th and n-th and their covariance matrix
⎡c Fin = ⎢ i ⎣ yi
yi ⎤ . c n ⎥⎦
(9)
Automatic Identification of Fuzzy Models
887
From properties of covariance matrices we have:
det( Fin ) ≥ 0 ⇒ ci c n − y i2 ≥ 0 .
(10)
For F0 to be a covariance matrix it is sufficient to be a symmetric positive semidefinite. A sufficient condition for a matrix to be positive semi-definite is that all determinants of the leading minors of the matrix are non-negative. Let F0 j be a j-th leading minor of F0 . From definition (8) and form a fact that diagonal contains only non-negative numbers (variances) for all n-1 leading minors:
∀j = 1...n − 1 : det( F0 j ) = c1 ⋅ ... ⋅ c n−1 ≥ 0 .
(11)
The value of last minor’s determinant (of whole matrix):
det( F0 ) = c1 ⋅ ... ⋅ c n − c1 ⋅ ... ⋅ ci −1 y i ci +1 ⋅ ... ⋅ c n −1 y ,
(12)
det( F0 ) = c1 ⋅ ... ⋅ ci −1ci +1 ⋅ ... ⋅ c n −1 (ci c n − y ) . 2 i
From (10) and (12) we conclude det( F0 ) ≥ 0 , so the matrix (8) is positive semidefinite symmetric matrix. It is worth noting that condition stated by the above theorem does not hold in a general case when we leave more than one non-diagonal non-zero value in the last row and column. We modify the Gustafson-Kessel algorithm so it finds clusters that have covariance matrices of a form (8) meaning that their axis are not necessarily parallel to the output axis. We do this by introduction of a step 4a to the algorithm: 4a. Convert covariance matrix to a form (8) by preserving only the largest covariance value in the last row/column. The intuition for this approach is that we would like to preserve the most significant relation in the shape of the obtained cluster. Because of the conclusions of Theorem 1 all calculations performed in next steps of the algorithm may succeed.
4 Converting Clusters to Fuzzy Rules Having obtained a set of fuzzy clusters canters and a set of norms induced for those clusters the task is to create membership functions of rules’ antecedents. In this example we use asymmetric Gaussian type of membership functions but it must be noted that any classical type of membership functions would fit our method. The membership function is based on 4 parameters determining peak point and shape of left and right sides of the curve ⎧ − ( x −c21 ) ⎪ e 2σ1 , x < c1 ⎪ fσ1 ,c1 ,σ 2 ,c2 ( x ) = ⎨ −( x −c22 )2 . ⎪ e 2σ 2 , x > c2 ⎪1, x ≥ c ^ x ≤ c 1 2 ⎩ 2
(13)
888
G. Glowaty
Some authors suggest projecting clusters onto each of the axes using fuzzy projection techniques [1, 10]. Curve fitting technique is applied to adjust membership function parameters so the membership degree of fulfillment of premise of the rule corresponding to a given cluster reflects the membership degree of the measured samples to that cluster. In TSK model we assume prod-type AND operator for rule premise. It should be noted that our technique applies also to different types of operators (e.g. min). The degree of fulfillment of a rule j is calculated as follows: n −1
d j ( x) = ∏ f ( i ) ( x (i ) ) ,
(14)
i =1
(i )
(i )
where f ( x ) is a value of i-th function in a form of (13) for i-th coordinate of vector x. We employ non-linear least squares optimization to obtain parameters of f ( i ) . Objective function under minimization for rule i is given by (15): N
e(Σ ( i ) , C (i ) ) = ∑ ( μ i , j − d i ( x)) 2 ,
(15)
j =1
where Σ ( i ) , C ( i ) are sets of parameters of membership functions used in rule i and
μi, j is a membership degree of sample j to cluster i. If Jacobian of the destination function is analytically available we may use it in the calculations (this applies for standard Gaussian membership functions). In all other cases we may calculate Jacobian approximation using finite differences. In this work we used a subspace trust region approach [3] available in Matlab Optimization Toolbox, but also other least squares curve fitting methods could be applied. Numerical gradient based methods have a risk of converging to local minimum. In this case, however, we have a good starting point for the minimization. We can use cluster centers coordinates as initial values for the C parameters. Having cluster center defined as a very close to optimal we have less chance of converging to some local minima. Also we can calculate good initial guess value for Σ parameters, such that neighboring membership functions overlap. However, experiments have shown that this calculation is not necessary, as function converges without it.
5 Determination of Rule Consequents Clusters detect grouping of sample data, which due to the construction of the covariance matrix may be approximated with a linear function. We use weighted least squares linear regression to identify parameters of the output function for a rule. Philosophy for use of weighted method is that samples with little membership degrees in a given rule are likely be evaluated by other rules so their output should not influence the output function to a big extent. And contrary, samples with high membership
Automatic Identification of Fuzzy Models
889
degrees to a rule is evaluated primarily by this rule so they should have a big impact on the output function. Given the output function for a rule i we can formulate an error function for linear regression as shown below: n −1
g i ( x (1) ,..., x ( n −1) ) = a 0 + ∑ a j x ( j ) , j =1
(16)
N
ei ( A) = ∑ μi , j [ x j =1
( n) j
− gi ( x ,...,x (1) j
( n −1) j
2
)] .
6 Experimental Results 6.1 Box-Jenkins Gas Furnace
The input data [11] is a series of pairs where u(t) is a rate of flow of gas into furnace, and y(t) is a CO2 concentration at the time t. With use of the method described in [1] we conclude that the output y(t) can be predicted with use of 3 variables: y(t-1), u(t-4) and u(t-3). Variable u(t-3) does not significantly improve the performance of the model, while adding computation complexity. As in [6] we use only y(t-1), u(t-4) variables. As the learning data set we chose the first half of the samples, second half is used to calculate an approximation error.
mf
mf
2,1
mf
1,1
1,2
1
0.8
0.8
Degree of membership
Degree of membership
mf
2,2
1
0.6
0.4
0.2
0.6
0.4
0.2
0
0
-3
-2
-1
0 u(t-4)
1
2
3
45
50
55
60
y(t-1)
Fig. 1. Membership functions of input variables obtained for gas furnace problem
Membership functions that were obtained with our approach are depicted on Fig. 1. Resulting TSK rule base: IF u(t - 4) IS mf 1,1 AND y(t - 1) IS mf 1,2 THEN y = -1.38u(t - 4) + 0.51y(t - 1) + 25.78 IF u(t - 4) IS mf 2,1 AND y(t - 1) IS mf 2,2 THEN y = -1.39u(t - 4) + 0.54y(t - 1) + 23.86 .
(17)
Root mean square error (RMSE) for testing set approximation with rule base (17) is 0.391. Table 1 compares our result with results obtained with different methods summarized in [6].
890
G. Glowaty Table 1. Comparison of RMSE for gas furnace problem
Method Pedrycz (84) Xu (87) Sugeno (91) Sugeno (93) Wang (96) Delgado(99) Rantala(02) This method
Num. of inputs 2 2 6 3 2 2 4 2
Num. of rules 81 25 2 6 5 2 5 2
RMSE 0.565 0.572 0.261 0.435 0.397 0.396 0.358 0.391
As it can be seen our method provides good approximation accuracy with simple model. Delgado [6] model provides similar accuracy, but they use input membership functions in product space, hence not achieving that interpretability as our model. Wang [12] provides also similar accuracy model with significantly bigger number of rules. Sugeno [13] model providing the best accuracy uses significantly bigger input information so these two methods can not be directly compared with this example. 6.2 Non-linear Function Identification
As another benchmark we use a non-linear function with two input variables:
z = (1 + x −2 + y −1.5 ) 2 ,1 ≤ x, y ≤ 5 .
(18)
We use 50 random samples for learning and 200 other random samples for the system performance evaluation once it is learned. We determined a number of clusters to be 4. Figure 2 presents actual function vs. our modeling result. As we can see, our approximation does not perform well on the boundaries of the system space. This is due to the fact that very little random samples for learning were selected on the boundary. Model output
50 5
45
4.5
40 4
35 z
z
3.5
30
3 2.5
25
2
1
20 2 15 1
1 1.5
2
3 1.5
2
2.5
3 1
4 3
3.5
4
4.5
5
5
1.5
2
2.5
3
4 3.5
4
x
y
y
Fig. 2. Original function and its fuzzy model
4.5
5
5 x
Automatic Identification of Fuzzy Models
891
Table 2. Comparison of RMSE for non linear function (18)
Method Wang (96) Delgado(99) This method
Num. of rules 6 2 2
RMSE 0.281 0.266 0.233
Table 2 presents RMSE of our approach compared with other results from the literature. 6.3 Miles Per Gallon (MPG) Prediction
We run the test against standard miles per gallon prediction data set [14]. We divided the data set into two equal subsets and performed learning on one of them and measured the RMSE on the other half of the data. We selected 5 inputs for our model (displacement, horsepower, weight, acceleration and year) and 4 rules. Table below shows comparison of our result with other approaches found in the literature. It must be noted, that difference in MPG prediction are so small that can be due to the selection of random learning and testing sets. Our model is more complicated than Babuska [8] model but provides more interpretability as the other method uses transformation of input variables for the rules. Optimized ANFIS provides similar results to our method but with more complex underlying model. Table 3. Comparison of RMSE for MPG approximation
Method Jang (96) (linear reg.) Babuska (02) ANFIS This method
Inputs 6 5 5 5
Rules 2 6 4
Training RMSE 3.45 2.72 2.48 2.76
Testing RMSE 3.44 2.85 2.85 2.84
7 Conclusions We have shown that existing clustering based approaches to fuzzy modeling may still be improved. By modification of clustering algorithm in use we are able to obtain accurate fuzzy models and still preserve the interpretability. Additionally, it has been proven that curve fitting techniques combined with linear regression methods are a valid approach to convert clusters into fuzzy rules. As numerical results show, our method provides satisfactory results very often delivering a simpler model than other approaches. Moreover, the method is extensible enough and can be easily adopted to find membership functions of other types than Gaussian. It also can be subject to later optimization, giving a very good basis for optimization starting point. Optimization of the model obtained with this method is in scope for our future work in this area.
892
G. Glowaty
References 1. Sugeno, M., Yasukawa, T.: A Fuzzy-Logic-Based Approach to Qualitative Modeling. IEEE Trans. on Fuzzy Systems 1(1), 7–31 (1993) 2. Jang, J.S.R.: ANFIS: Adaptive-network-based fuzzy inference systems. IEEE Trans. on System, Man and Cybernetics 23(3), 665–685 (1993) 3. Coleman, T.F., Li, Y.: An Interior, Trust Region Approach for Nonlinear Minimization Subject to Bounds. SIAM Journal on Optimization 6, 418–445 (1996) 4. Rantala, J., Koivisto, H.: Optimized Subtractive Clustering for Neuro-Fuzzy Models. In: 3rd WSEAS International Conference on Fuzzy Sets and Fuzzy Systems (2002) 5. Wang, W., Zhang, Y.: On fuzzy cluster validity indices. Fuzzy Sets and Systems 158, 2095–2117 (2007) 6. Gomez-Skarmeta, A.F., Delgado, M., Vila, M.A.: About the use of fuzzy clustering techniques for fuzzy model identification. Fuzzy Sets and Systems 106, 179–188 (1999) 7. Parekh, G., Keller, J.M.: Learning the Fuzzy Connectives of a Multilayer Network Using Particle Swarm Optimization. In: IEEE Symposium on Foundations of Computational Intelligence, pp. 591–596 (2007) 8. Abonyi, J., Babuska, R., Szeifert, F.: Modified Gath-Geva fuzzy clustering for identification of Takagi-Sugeno fuzzy models. IEEE Trans. on Systems, Man and Cybernetics 32(5), 612–621 (2002) 9. Gustafson, E.E., Kessel, W.C.: Fuzzy clustering with a fuzzy covariance matrix. In: Proc. of the IEEE Conference on Decision and Control, pp. 761–766 (1979) 10. Klawonn, F., Kruse, R.: Constructing a fuzzy controller from data. Fuzzy Sets and Systems 85, 177–193 (1997) 11. Box, G.E.P., Jenkins, G.M.: Time Series Analysis, Forecasting and Control. Holden Day, San Francisco (1970) 12. Langari, R., Wang, L.: Complex systems modeling via fuzzy logic. IEEE Trans. on Systems, Man, and Cybernetics 26(1), 100–106 (1996) 13. Sugeno, M., Tanaka, K.: Successive identification of a fuzzy model and its applications to prediction of a complex system. Fuzzy Sets and Systems 42(3), 315–334 (1991) 14. Asuncion, A., Newman, D.J.: UCI Machine Learning Repository (2007)
Extending the Four Russian Algorithm to Compute the Edit Script in Linear Space Vamsi Kundeti and Sanguthevar Rajasekaran Department of Computer Science and Engineering University of Connecticut Storrs, CT 06269, USA {vamsik,rajasek}@engr.uconn.edu
Abstract. Computing the edit distance between two strings is one of the most fundamental problems in computer science. The standard dynamic programming based algorithm computes the edit distance and edit script in O(n2 ) time and space. Often the edit script is of more importance than the value of the edit distance. The Four Russian Algorithm [1] computes the edit distance in O(n2 / log n) time but does not address how to compute edit script within that runtime. Hirschberg [2] gave an algorithm to compute edit script in linear space but the runtime remained O(n2 ). In this paper we present algorithms that compute both the edit script and n2 ) time using O(n) space. edit distance in O( log n Keywords: edit distance, edit script, linear space, four russian algorithm, hirschberg’s algorithm.
1
Introduction
The edit distance between strings S1 = [a1 , a2 , a3 . . . an ] and S2 = [b1 , b2 , b3 . . . bn ] is defined as the minimal cost of transforming S1 into S2 using the three operations Insert, Delete, and Change(C) (see e.g., [3]). The first application(global alignment) of the edit distance algorithm for protein sequences was studied by Needleman [4]. Later algorithms for several variations (such as local alignment, affine gap costs, etc.) of the problem were developed (for example) in [5], [6], and [7]. The first major improvement in the asymptotic runtime for computing the value of the edit distance was achieved in [1]. This algorithm is widely known as the Four Russian Algorithm and it improves the running time by a factor of O(log n) (with a run time of O(n2 / log n)) to compute just the value of the edit distance. It does not address the problem of computing the actual edit script, which is of wider interest rather than just the value. Hirschberg [2] has given an algorithm that computes the actual script in O(n2 ) time and O(n) space. The space saving idea from [2] was applied to biological problems in [8] and [9]. However the asymptotic complexity of the core algorithm in each of these remained O(n2 ). Also, parallel algorithms for the edit distance problem and its application to sequence alignment of biological sequences were studied M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 893–902, 2008. c Springer-Verlag Berlin Heidelberg 2008
894
V. Kundeti and S. Rajasekaran
extensively (for example) in [10] and [11]. In paper [12] linear space parallel algorithms for the sequence alignment problem were given, however they assume that O(n2 ) is the optimal asymptotic complexity of the sequential algorithm. Please refer to [13] for an excellent survey on all these algorithms. A special case is one where each of these operations is of unit cost. Edit Script is the actual sequence of operations that converts S1 into S2 . In particular, the edit script is a sequence Escript = {X1 , X2 , X3 . . . Xn }, Xi ∈ I, D, C. Standard dynamic programming based algorithms solve both the distance version and the script version in O(n2 ) time and O(n2 ) space. The main result of this is an al paper gorithm for computing the edit distance and edit script in O
n2 log n
time and
O(n) space. The rest of the paper is organized as follows. In Sec. 2 we provide a summary of the four Russian algorithm [1]. In Sec. 3 we discuss the O(n2 ) time algorithm that consumes O(n) space and finally in Sec. 4 we show how to compute the edit n2 distance and script using O( log n ) time and O(n) space.
2
Four Russian Algorithm
In this section we summarize the Four Russian Algorithm. Let D be the dynamic programming table that is filled during the edit distance algorithm. The standard edit distance algorithm fills this table D row by row after initialization of the first row and the first column. Without loss of generality, throughout this paper we assume that all the edit operations cost unit time each. The basic idea behind the Four Russian Algorithm is to partition the dynamic programming table D into small blocks each of width and height equal to t where t is a parameter to be fixed in the analysis. Each such block is called a t-block. The dynamic programming table is divided into t-blocks such that any two adjacent t-blocks overlap by either a row or column of width (or height) equal to t. See Fig. 1 for more details on how the dynamic programming table D is partitioned. After this partitioning is done The Four Russian algorithm fills up the table D block by block. Algorithm 1 has more details. A quick qualitative analysis of the algorithm is as follows. After the partition2 ing of the dynamic programming table D into t-blocks we have nt2 blocks and if 2 processing of each of the block takes O(t) time then the running time is O( nt ). In the case of standard dynamic programming, entries are filled one at a time (rather than one block at a time). Each entry can be filled in O(1) time and 2 hence the total run time is O(n2 ). In the Four Russian algorithm, there are nt2 blocks. In order to be able to fill each block in O(t) time, some preprocessing is done. Theorem 1 is the basis of the preprocessing. Theorem 1. If D is the edit distance table then |D[i, j] − D[i + 1, j]| ≤ 1, and |D[i, j] − D[i, j + 1]| ≤ 1∀(0 ≤ i, j ≤ n). Proof. Note that D[i, j] is defined as the minimum cost of converting S1 [1 : i] into S2 [1 : j]. Every element of the table D[i, j] is filled based on the values from
Computing edit script in O(n2 / log n) Time and O(n) Space
895
D[i − 1, j − 1],D[i − 1, j] or D[i, j − 1]. D[i, j] ≥ D[i − 1, j − 1](characters at S1 [i] and S2 [j] may be same or different), D[i, j] ≤ D[i, j − 1] + 1 (cost of insert is unity),D[i, j − 1] ≤ D[i − 1, j − 1] + 1(same inequality as the previous one rewritten for element D[i, j − 1]). The following inequalities can be derived from the previous inequalities. −D[i, j] ≤ −D[i − 1, j − 1] D[i, j − 1] ≤ D[i − 1, j − 1] + 1 −D[i, j] + D[i, j − 1] ≤ 1 D[i, j − 1] − D[i, j] ≤ 1 D[i, j] ≤ D[i, j − 1] + 1 {Started with this} −1 ≥ D[i, j − 1] − D[i, j] |D[i, j − 1] − D[i, j]| ≤ 1 Along the same lines we can also prove that |D[i − 1, j] − D[i, j]| ≤ 1 and D[i − 1, j − 1] ≤ D[i, j]. Theorem 1 essentially states that the value of the edit distance in the dynamic programming table D will either increase by 1 or decrease by 1 or remain the same compared to the previous element in any row or a column of D. Theorem 1 helps us in encoding any row or column of D with a vector of 0, 1, −. For example a row in the edit distance table D[i, ∗] = [k, k + 1, k, k, k − 1, k − 2, k − 1] can be encoded with a vector vi = [0, 1, −1, 0, −1, −1, 1]. To characterize any row or column we just need the vector vi and k corresponding to that particular row or column. For example, if D[i, ∗] = [1, 2, 3, 4, . . . , n], then k = 1 for this row and vi = [0, 1, 1, 1, 1, 1, 1, . . . , 1]. For the computation of the edit distance table D the leftmost column and the topmost row must be filled (or initialized) before the start of the algorithm. Similarly in this algorithm we need the topmost row (A) and leftmost column (B) to compute the edit distance within the t-block see Fig. 1. Also see Algorithm 2. It is essential that we compute the edit distance within any t-block in constant time. In the Four Russian algorithm the computation of each t-block depends on the variables A, B, K, C, E (see Fig. 1). The variable A represents the top row of the t-block and B represents the the left column of the t-block. C and E represent the corresponding substrings in the strings S1 and S2 . K is the intersection of A and B. If the value of the variable K is k then from Theorem 1 we can represent A and B as vectors of {0,1,-1} rather than with exact values along the row and column. As an example, consider the first t-block which is the intersection of the first t rows and the first t columns of D. For this t-block the variables {A, B, K, C, E} have the following values: K = D[0, 0], A = D[0, ∗] = [0, 1, 1, 1, . . . , 1], B = D[∗, 0] = [0, 1, 1, 1, . . . , 1], C = S2 [0, 1, . . . , t], and E = S1 [0, 1, . . . , t]. For any t-block we have to compute {A , B , K } as a function of {A, K, B, C, E} in O(1) time. In this example plugging in {A, B, K, C, E} for the first t-block gives K = D[t, t], A = [D[0, t], . . . , D[t, t]],B = [D[t, 0], . . . , D[t, t]]. To accomplish the task of computing the edit distance in a t-block in O(1) time, we precompute
896
V. Kundeti and S. Rajasekaran
S1
E
S2 {A’,B’,K’} = F(A,B,C,K,E)
K B A
C
t−block
A’ overlapping row
B’ overlapping column
K’
filled pattern indicates initialized values of the dynamic programming table
Fig. 1. Using preprocessed lookup table {A , B , K } = F (A, B, C, K, E)
all the possible inputs in terms of variables {A, B, 0, C, E}. We don’t have to consider all possible values of K since if K1 is the value of K we get with input variables {A, B, 0, C, E} then the value of K for inputs {A, B, K, C, E} would be K1 + K. Thus this encoding(and some preprocessing) helps us in the computation of the edit distance of the t-block in O(1) time. The algorithm is divided into two parts pre-processing step and actual computation.
Algorithm 1. Four Russian Algorithm, t is a parameter to be fixed. INPUT : Strings S1 and S2 , Σ, t OUTPUT: Optimal Edit distance /*Pre-processing step*/ F = PreProcess(Σ, t) ; for i = 0;i < n;i+ = t do for j = 0;j < n;j+ = t do {A , B , D } = LookU pF (i, j, t) ; [D[i + t, j] . . . D[i + t, j + t] = A ; [D[i, j + t] . . . D[i + t, j + t] = B ; end end
2.1
Pre Processing Step
As we can see from the previous description, at any stage of the Algorithm 1 we need to do a lookup for the edit distance of any t-block and as a result get the row and column for the adjacent t-blocks. From Theorem 1 its evident
Computing edit script in O(n2 / log n) Time and O(n) Space
897
Algorithm 2. LookUp routine used by Algorithm 1. INPUT : i,j,t OUTPUT: A , B , D A = [D[i, j] . . . D[i, j + t]]; B = [D[i, j] . . . D[i + t, j]]; C = [S2 [j] . . . S2 [j + t]]; E = [S1 [j] . . . S1 [j + t]]; K = D[i, j]; /*Encode A,B*/ for k = 1;k < t;k + + do A[k] = A[k] − A[k − 1]; B[k] = B[k] − B[k − 1]; end /*Although K is not used in building lookup table F we maintain the consistency with Fig. 1 */ return {A , B , D } = F (A, B, C, K, E) ;
that any input {A, B, K, C, E} (see Fig. 1) to the t-block can be transformed into vectors of {−1, 0, 1}. In the preprocessing stage we try out all possible inputs to the t-block and compute the corresponding output row and column ({A , B , K } (see Fig. 1). More formally, the row (A ) and column(B ) that need to be for any t-block can be repesented as a function F (lookup table) with inputs {A, B, K, C, E}, such that {A , B , K } = F (A, B, K, C, E). This function can be precomputed since we have only limited possibilities. For any given t, we can have 3t vectors corresponding to A and B. For a given alphabet of size Σ we have Σ t possible inputs corresponding to C and E. K will not have any effect since we just have to add K to A [t] or B [t] at the end to compute K . The time to preprocess is thus O((3Σ)2t t2 ) and the space for the lookup table F would be O((3Σ)2t t). Since t2 ≤ (3Σ)t , if we pick log n t = 3 log(3Σ) , the preprocessing time as well as the space for the lookup table will be O(n). Here we make use of the fact that the word length of the computer is Θ(log n). This in particular means that a vector of length t can be thought of as one word. 2.2
Computation Step
Once the preprocessing is completed in O(n) time, the main computation step proceedes scanning the t-blocks row by row and filling up the dynamic programming table(D). Algorithm 1 calls Algorithm 2 in the inner most for loop. Algorithm 2 takes O(t) time to endcode the actual values in D and calls the function F which takes O(1) time and returns the row (A ) and column (B ) which are used as input for other t-blocks. The runtime of the entire algorithm is 2 O( nt nt t) = O( nt ). Since t = Θ(log n) the run time of the Four Russian Algorithm 2 n is O( log n ).
898
3
V. Kundeti and S. Rajasekaran
Hirschberg’s Algorithm to Compute the Edit Script
In this section we briefly describe Hirschberg’s [2] algorithm that computes the edit script in O(n2 ) time using O(n) space. The key idea behind this algorithm is an appropriate formulation of the dynamic programming paradigm. We make some definitions before giving details on the algorithm. – Let S1 and S2 be strings with |S1 | = m and |S2 | = n. A substring from index i to j in a string S is denoted as S[i . . . j]. – If S is a string then S r denotes the reverse of the string. – Let D(i, j) stand for the optimal edit distance between S1 [1 . . . i] and S2 [1 . . . j]. – Let Dr (i, j) be the optimal edit distance between S1r [1 . . . i] and S2r [1 . . . j]. Lemma 1. D(m, n) = min0≤k≤m {D[ n2 , k] + Dr [ n2 , m − k]}. The Lemma 1 essentially says that finding the optimal value of the edit distance between strings S1 and S2 can be done as follows: Split S1 into two parts (p11 and p12 ) and S2 into two parts (p21 and p22 ); Find the edit distance (e1 ) between p11 and p21 ; Find the edit distance (e2 ) between p12 and p22 ; Finally add both the distances to get the final edit distance (e1 + e2 ); Since we are looking for the minimum edit distance we have to find a breaking point (k) that minimizes the value of (e1 + e2 ). We would not miss this minimum even if we break one of the strings deterministically and find the corresponding breaking point in the other string. As a result of this we keep the place where we break in one of the strings fixed. (Say we always break one of the strings in the middle). Then we find a breaking point in the other string that will give us minimum value of (e1 + e2 ). The k in Lemma 1 can be found in O(mn) time and O(m) space for the following reasons. To find the k at any stage we need two rows(D[ n2 , ∗] and Dr [ n2 , ∗]) from forward and reverse dynamic programming tables. Since the values in any row of the dynamic programming table just depend on the previous row, we just have to keep track of the previous row while computing the table D and Dr . Once we find k we can also determine the path from the previous row ( n2 − 1) to row ( n2 ) in both the dynamic programming tables D and Dr (see Fig. 2). Once we find these subpaths we can continue to do the same for the two subproblems (see Fig. 2) and continue recursively. The run time of the algorithm can be computed by the following reccurence relation. T (n, m) = T ( n2 , k) + T ( n2 , m − k) + mn mn T ( n2 , k) + T ( n2 , m − k) = mn 2 + 4 + . . . = O(mn) In each stage we use only O(m) space and hence the space complexity is linear.
Computing edit script in O(n2 / log n) Time and O(n) Space
899
S1 S2
m−k D sub−problem n/2 −1
k
(k at which D[n/2,k]+ Dr [n/2,m−k] is min)
n/2 n/2 −1
subpaths
Dr sub−problem
S r2 S r1 Fig. 2. Illustration of Hirschberg’s recursive algorithm
4
Our Algorithm
Our algorithm combines the frameworks of the Four Russian algorithm and n2 that of Hirschberg’s Algorithm. Our algorithms finds the edit script in O log n time using linear space. We extend the Four Russian algorithm to accommodate Lemma 1 and to compute the edit script in O(n) space. At the top-level of our algorithm we use a dynamic programming formulation similar to that of Hirschberg. Our algorithm is recursive and in each stage of the algorithm we compute k and also find the sub-path as follows. n n D(m, n) = min0≤k≤m {D( , k) + Dr ( , m − k)} 2 2 The key question here is how to use the Four Russian framework in the computation of D( n2 , k) and Dr ( n2 , m − k) for any k in time better than O(n2 )? . Hirschberg’s algorithm needs the rows D( n2 , ∗) and Dr ( n2 , ∗) at any stage of the recursion. In Hirschberg’s algorithm at recursive stage (R(m, n)), D( n2 , k) and Dr ( n2 , m − k) are computed in O(mn) time. We cannot use the same approach since the run time will be Ω(n2 ). We have to find a way to compute the rows n2 D( nn , ∗) and Dr ( n2 , ∗) with a run time of O( log n ). The top-level outline of our algorithm is illustrated by the pseudo-code in TopLevel (see Algorithm 3). The algorithm starts with input strings S1 and S2 of length m and n, respectively. At this level the algorithm applies Lemma 1 and finds k. Since the algorithm requires D( n2 , ∗) and Dr ( n2 , ∗) at this level it calls the algorithm FourCompute to compute the rows D( n2 , ∗), D( n2 − 1, ∗), Dr ( n2 , ∗) and
900
V. Kundeti and S. Rajasekaran
Dr ( n2 − 1, ∗). Note the fact that although for finding k we require rows D( n2 , ∗) and Dr ( n2 , ∗), to compute the actual edit script we require rows D( n2 − 1, ∗) and Dr ( n2 − 1, ∗). Also note that these are passed to algorithm FindEditScript to report the edit script around index k. Once the algorithm finds the appropriate k for which the edit distance would be minimum at this stage, it divides the problem into two sub problems (see Fig. 2) (S1 [1 . . . k1 − 1], S2 [1 . . . n2 − 1]) and (S1 [m − k2 + 1 . . . m], S2 [ n2 + 1 . . . n]. Observe that k1 and k2 are returned by FindEditScript. FindEditScript is trying to find if the sub-path passes through the row n2 (at the corresponding level of recursion) and updates k so that we can create sub-problems (please see arcs (sub-paths) in Fig. 2). Once the sub-problems are properly updated the algorithm solves each of these problems recursively. We now describe algorithm FourCompute which finds the rows D( n2 , ∗) and r n D ( 2 , ∗) (that are required at each recursive stage of TopLevel (Algorithm 3)) in time O( nm t ) where t is the size of blocks used in the Four Russian Algorithm. We do exactly the same pre-processing done by the Four Russian Algorithm and create the lookup table F . FourCompute is called for both forward (S1 ,S2 ) and reverse strings (S1r ,S2r ). The lookup table F (A, B, K, C, E) has been created for all the strings from Σ of length t. We can use the same lookup table F for all the calls to FourCompute. A very important fact to remember is that in the Four Russian algorithm whenever a lookup call is made to F the outputs {A , B } are always aligned at the rows which are multiples of t, i.e., at any stage of the Four Russian algorithm we only require the values of the rows D(i, ∗) such that i mod t = 0. In our case we cannot directly use the Four Russian Algorithm in algorithm FourCompute because the lengths of the strings which are passed to FourCompute from each recursive level of TopLevel is not necessarily a multiple of t. Suppose that in some stage of the FourCompute algorithm a row i is not a multiple of t. We apply the Four Russian Algorithm and compute till row D( ti , ∗), find the values in the row D( ti − t, ∗) and apply lookups for rows ti − t, ti − t + 1, . . ., and ti − t + i mod t. Basically we need to slide the t-block from the row ti − t to ti − t + i mod t. Thus we can compute any row that is not a multiple of t in an extra i mod t∗ m t time (where m is the length of the string represented across the columns). We can also use the standard edit distance computation in rows ti , ti + 1, . . . ti + i mod t which also takes the same amount of extra time. Also consider the space used while we compute the required rows in the FourCompute algorithm. We used only O(m + n) space to store arrays D [0, ∗] and D [∗, 0] and reused them. So the space complexity of algorithm FourCompute is linear. The run time is O(( nt )( mt )(t)) to compute a row D(n, ∗) or Dr (n, ∗). We arrive at the following Lemma. Lemma 2. Algorithm FourCompute Computes rows Dr ( n2 , ∗), D( n2 , ∗) required by Algorithm TopLevel at any stage in O( mn t ) time and O(m + n) space.
Computing edit script in O(n2 / log n) Time and O(n) Space
901
The run time of the complete algorithm is as follows. Here c is a constant. T (n, m) = T ( n2 , k) + T ( n2 , m − k) + c mn 2t . mn mn + + · · · ) = O( T (n, m) = c( mn 2t 4t t ). Since t = Θ(log n) the run time is O(n2 / log n). Algorithm 3. TopLevel which calls FourCompute at each recursive level. Input: Strings S1 ,S2 ,|S1 | = m,|S2 | = n Output: Edit Distance and Edit Script D( n2 , ∗) = FourCompute( n2 , m, S1 , S2 , D(∗, 0), D(0, ∗)); Dr ( n2 , ∗) = FourCompute( n2 , m, S1r , S2r , Dr (∗, 0), Dr (0, ∗)); /*Find the k which gives min Edit Distance at this level*/ M inimum = (m + n) ; for i = 0 to n do if (D( n2 , i) + Dr ( n2 , m − i)) < M inimum then k=i; M inimum = D( n2 , i) + Dr ( n2 , m − i) ; end end /*Compute The EditScripts at this level */ k1 = FindEditScript(D( n2 , ∗), D( n2 − 1, ∗), k, F orward) ; k2 = FindEditScript(Dr ( n2 , ∗), Dr ( n2 − 1, ∗), k, Backward) ; /*Make a recursive call If necessary*/ ; TopLevel(S1 [1 . . . k1 − 1],S2 [1 . . . n2 − 1]) ; TopLevel(S1 [m − k2 + 1 . . . m],S2 [ n2 + 1 . . . n]) ; 4.1
Space Complexity
The space complexity is the maximum space required at any stage of the algorithm. We have two major stages where we need to analyze the space complexity as follows. The first during the execution of the entire algorithm and the second during preprocessing and storing the lookup table. 4.2
Space during the Execution
The space for algorithm TopLevel is clearly linear since we need to store just 4 rows at any stage: Rows D( n2 , ∗), D( n2 − 1, ∗), Dr ( n2 , ∗) and Dr ( n2 − 1, ∗). From Lemma 2 the space required for FourCompute is also linear. So the space complexity of the algorithm during execution is linear. 4.3
Space for Storing Lookup Table F
We also need to consider the space for storing the lookup table F . The space required to store the lookup table F is also linear for an appropriate of t value n2 (as has been shown in Sec. 2.1). The runtime of the algorithm is O log n .
902
5
V. Kundeti and S. Rajasekaran
Conclusion
In this paper we have shown that we can compute both the edit distance and n2 edit script in time O( log n ) using O(n) space. Acknowledgments. This research has been supported in part by the NSF Grant ITR-0326155 and a UTC endowment.
References 1. Arlazarov, V.L., Dinic, E.A., Kronrod, M.A., Faradzev, I.A.: On economic construction of the transitive closure of a directed graph. Dokl. Akad. Nauk SSSR 194, 487–488 (1970) 2. Hirschberg, D.S.: Linear space algorithm for computing maximal common subsequences. Communications of the ACM 18(6), 341–343 (1975) 3. Horowitz, E., Sahni, S., Rajasekaran, S.: Computer Algorithms. Silicon Press (2008) 4. Needleman, S.B., Wunsch, C.D.: A general method applicable to the search for similarities in the amino acid sequence of two proteins. Journal of Molecular Biology 48(3), 443–453 (1970) 5. Smith, T.F., Waterman, M.S.: Identification of common molecular subsequences. Journal of Molecular Biology 147(1), 195–197 (1981) 6. Gotoh, O.: Alignment of three biological sequences with an efficient traceback procedure. Journal of Theoretical Biology 121(3), 327–337 (1986) 7. Huang, X., Hardison, R.C., Miller, W.: A space-efficient algorithm for local similarities. Computer Applications in the Biosciences 6(4), 373–381 (1990) 8. Gotoh, O.: Pattern matching of biological sequences with limited storage. Computer Applications in the Biosciences 3(1), 17–20 (1987) 9. Myers, E.W., Miller, W.: Optimal alignments in linear space. Computer Applications in the Biosciences 4(1), 11–17 (1988) 10. Edmiston, E., Wagner, R.A.: Parallelization of the dynamic programming algorithm for comparison of sequences, pp. 78–80 (1987) 11. Ranka, S., Sahni, S.: String editing on an simd hypercube multicomputer. Journal of Parallel and Distributed Computing 9(4), 411–418 (1990) 12. Rajko, S., Aluru, S.: Space and time optimal parallel sequence alignments. IEEE Transactions on Parallel and Distributed Systems 15(12), 1070–1081 (2004) 13. Gusfield, D.: Algorithms of Strings Trees and Sequences. Cambridge (1997)
Accuracy of Baseline and Complex Methods Applied to Morphosyntactic Tagging of Polish Marcin Kuta, Michal Wrzeszcz, Pawel Chrzaszcz, and Jacek Kitowski Institute of Computer Science, AGH-UST, al. Mickiewicza 30, Krak´ ow, Poland {mkuta,kito}@agh.edu.pl
Abstract. The paper presents baseline and complex part-of-speech taggers applied to the modified corpus of Frequency Dictionary of Contemporary Polish. Accuracy of 5 baseline part-of-speech taggers is reported. On the base of these results complex methods are worked out. Thematic split and attribute split methods are proposed and evaluated. Tagging accuracy of voting methods is evaluated finally. The most accurate baseline taggers are SVMTool (for a simple tagset) and fnTBL (for a complex tagset). Voting method called Total Precision achieves the top accuracy among all looked over methods. Keywords: part-of-speech tagging, natural language processing.
1
Introduction
Part-of-speech (POS) tagging algorithms are intensively exploited in a wide range of applications including syntactic and semantic parsing, speech recognition and generation, ontology construction, machine translation, text understanding, information retrieval and many others [1]. Unfortunately POS tagging of highly inflecting languages like Polish is much more challenging than application to analytic languages (e.g. English or French), as the former are annotated with large, complex tagsets, describing many morphological categories. POS tagging algorithms are computationally time demanding, especially when employed to inflecting languages. In the domain training time exceeding 24 h is nothing extraordinary. Time requirements of the complex algorithms like split models or voting methods are moreover one order of magnitude higher. The paper examines accuracy of selected baseline algorithms applied to morphosyntactic tagging of Polish. Next more complex methods are investigated, both originally proposed (split models) as well as already known but not evaluated on Polish (voting methods). We present also results for the simple tagset (only the first attribute of each tag considered) to provide point of reference to English and other languages described by small tagsets. The taggers are evaluated on the modified corpus of Frequency Dictionary of Contemporary Polish (m-FDCP), authors’ improved version [2] of the standard FDCP corpus available at [3] site. The m-FDCP corpus is annotated with the complex tagset, containing over 1200 tags. Tags consist of a set of attributes, M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 903–912, 2008. c Springer-Verlag Berlin Heidelberg 2008
904
M. Kuta et al.
each attribute describing selected morphological category. A token is the entity being subject of tagging. A word segment is a token containing at least one letter or digit. By raw text we mean a sequence of tokens without their tags.
2
Baseline Tagging Algorithms
POS tagging algorithms are roughly divided into statistical and rule based methods. For a given token w all methods examine its context, being a window of N tokens and tags centred on the token w. The rule based methods take into account a wide context, which is desirable in the case of languages containing long-distance syntax dependencies. Statistical algorithms map a sequence of tokens into a sequence of tags with a probability model, which describes occurrence of the most probable sequence of tags for a given sequence of tokens. 2.1
Evaluated Algorithms
In the paper we first investigate five baseline algorithms, providing new results, compared to [2] [4]. The methods serve furthermore as components for construction of more sophisticated taggers. Hidden Markov Model (HMM). The algorithm belongs to the statistical methods. Given a sequence of tokens, w1 , . . . , wn , the HMM tagger assigns a sequence of tags, T = (t1 , . . . , tn ), according to a formula Tˆ = arg max T
n
p(wi |ti ) · p(ti |ti−1 , . . . , ti−N ) ,
(1)
i
where p(wi |ti ) is the conditional probability of occurrence of word wi given that tag ti occurred and p(ti |ti−1 , . . . , ti−N ) is the conditional probability of occurrence of tag ti given that tag sequence ti−1 , . . . , ti−N previously occurred. Maximum entropy. This statistical method [5] aims to maximize the entropy function by selection of binary features, reflecting dependence in a training corpus. The model assumes a set of binary features, fj , is defined on the combination of a tag ti and its context c. The probabilistic model is built from family of models f (t ,c) p(ti , c) = πμ αj j i , (2) j
where p(ti , c) stands for joint distribution of tags and contexts and π, μ are normalisation factors in order that p(·, ·) forms the probability function. Memory-based learning. The algorithm exploits rule based approach. Tagger [6] acquires examples from training corpora, which are later used in the tagging process. During the learning process memory-based taggers store in memory a set of examples (ti , ci ), where ti denotes the tag and ci its context. Given a token
Accuracy of Baseline and Complex Methods
905
w in context c, the memory-based tagger assigns it a tag tk , such that distance between c and ck is minimal. Transformation-based error-driven learning. This rule based method [7] starts with assigning a trivial sequence of tags to a given tokenised text. The target sequence of tags is determined by applying series of transformations. Each transformation, F , is a rule in the form: ”Replace value of tag t with value y if current context c fulfils condition φ.” The core of learning process is the algorithm for finding suitable transformations. Support vector machines. The SVM is a statistical algorithm, which maps input data to a higher dimensional space and next constructs a linear separating hyperplane with the maximal margin. The mapping is done with kernel functions, of which linear, polynomial, radial basis (RBFs) and sigmoid functions are the most used. The SVM approach achieves 97.16% accuracy for English [8]. As an implementation of the above algorithms the following taggers have been chosen for evaluation: Table 1. Taggers used in experiments Algorithm
Tagger name
HMM Maximum entropy Transformation based Memory based SVM
TnT [9] MXPost 1 [5] fnTBL [10] MBT [6] SVMTool 2 [8]
Next the complex methods have been worked out: the split models have been elaborated and the collective methods evaluated.
3
Split Models
3.1
Thematic Split
To make benefit of the thematic split approach the corpus structure should be nonuniform, i.e., it should consist of segments diversified in language style. This is the case of the m-FDCP corpus, containing 5 segments (thematic parts) differing in vocabulary, style, etc., e.g. average sentence length varies from 10.42 tokens/sentence (artistic drama) to 23.27 tokens/sentence (popular science). These differences cause that tagging rules utile in one segment may become inefficient for another one. Thus, instead of providing one overall language model, created from the whole corpus, it is worth considering building a number of separate models, each acquired with a baseline tagger from a different thematic part. 1 2
Referred further as MXP. Referred further as SVM.
906
M. Kuta et al.
3.2
Attribute Split
Attribute split method is applicable only to corpora annotated with complex tagsets. Assuming a tagset provides for presence of K morphological categories, the entire corpus is replicated K times, each copy corresponding to one morphological category. The i-th copy (1 ≤ i ≤ K) contains the whole text (all tokens), annotated with a small tagset Ti . The tagset Ti is obtained from the complex tagset by removing from each tag all attributes except the attribute describing i-th morphological category. If i-th morphological category is not applicable to a current token, the token is annotated with a special tag none in a relevant copy. Next, the training procedure and tagging of the raw test set is performed separately on each copy with a given baseline algorithm. Partition to training and test sets remains preserved as in the original corpus annotated with the complex tagset. Finally, K output files, generated within tagging of the raw test set, are merged into one file, where each token is back annotated with all the relevant morphological categories. The merged file is evaluated against the test set of the corpus annotated with the complex tagset.
4
Collective Methods
Collective methods and their performance on English and Dutch are shown in the fundamental work [11]. The idea of collective methods is based on assumption, that different baseline methods make errors in different places, i.e., baseline methods are to some extent complementary. The higher the complementarity of taggers, the bigger chance the combined system compensates errors of its constituents and performs better than the components alone. A few independent baseline taggers (components) propose a tag simultaneously [11] [12] [13] according to their algorithms described in Sect. 2.1. Proposed tags are compared and the best tag is selected according to an arbitration mechanism. As an arbitration several voting strategies are available. Assuming the reference test corpus contains n tokens w1 , . . . , wn , a token wi is annotated in the corpus with a tag ti (1 ≤ i ≤ n), a tagger A guesses for a token wi a tag tA i , the accuracy of the tagger A is defined as follows: n δ(tA , ti ) df #correctly tagged tokens accuracy = = i=1 i , (3) #all tokens n where δ is the Kronecker delta function. The sentence accuracy is a ratio of entirely correctly tagged sentences to the total number of sentences. If a tagger B guesses a tag tB i respectively, the complementarity of the tagger B to the tagger A, comp(B|A), is determined as follows [14]: df
comp(B|A) = 1 −
#common errors of taggers A and B . #errors of tagger A
(4)
Given a tag X, the precision and recall of the tagger A on the tag X are given as: df #tokens tagged and annotated with X precX = , (5) #tokens tagged with X
Accuracy of Baseline and Complex Methods
df
recallX =
#tokens tagged and annotated with X . #tokens annotated with X
907
(6)
Majority is a simple voting method, with exactly one vote assigned to each tagger. With weighted voting methods each tagger votes for its accuracy (Total Precision method) or precX (Tag Precision or Precision-Recall method). Additionally, the tagger may be obliged to support tags other than suggested by itself (Precision-Recall method, with weight 1 − recallY ). Ties are resolved by a random selection amongst winning tags. Vote strength of particular methods is summarised in Table 2. Table 2. Vote strength of a component tagger in various voting methods, X stands for a tag proposed by tagger, Y (Y = X) represents each tag proposed by opposition Voting method Tag X Tag Y Majority 1 0 Total Precision accuracy 0 Tag Precision precX 0 Precision-Recall precX 1 − recallY
For each tagger parameters accuracy, precX and recallX (for each tag X) are to be determined with help of a tuning set, disjoint from a test set, to avoid artificial boosting of the above methods.
5
Evaluated Data and Experiments Setup
We used the modified corpus of Frequency Dictionary of Contemporary Polish [15], annotated with the slightly abridged version of the IPI PAN tagset [16]. The m-FDCP corpus is partially corrected and disambiguated version of the FDCP corpus [3], both with manual checking and automatic procedures. The whole process of corpus improving has been described in details in [2]. The corpus is balanced between five thematic parts: (A) popular science, (B) news dispatches, (C) editorials and longer articles, (D) artistic prose and (E) artistic drama, each standing for approximately 20% of the corpus and representing different style of the language. The used tagset provides for 9 morphological categories: grammatical class (part of speech or POS), number, case, gender, person, degree, aspect, negation, vocalicity. The main parameters of the corpus are gathered in Table 3 (4th column). The baseline algorithms have been evaluated with the split of the m-FDCP corpus to a training and test set, standing for 90% and 10% of the corpus, respectively. The balanced character of the training and test set was carefully preserved within the split. Their characteristics are gathered in Table 3. The setup for the thematic split approach was the following: for each of 5 thematic parts of the m-FDCP corpus the separate experiment with the baseline
908
M. Kuta et al. Table 3. Main parameters of the m-FDCP corpus [2]
tokens word segments sentences different tokens
Training Test Full 90% 10% 100% 592729 65927 658656 496907 55139 552046 36601 4211 40812 87097 19557 92872
tagset size ambiguous tokens, % mean token ambiguity
Simple tagset 30 30 30 26.15 26.19 26.16 1.44 1.43 1.44
tagset size ambiguous tokens, % mean token ambiguity
Complex tagset 1191 724 1243 47.76 47.65 47.74 3.12 3.12 3.12
taggers was performed. Each part was split to a training and test set in 90%/10% ratio (18% and 2% of the whole corpus respectively). In the attribute split approach we used the entire m-FDCP corpus, divided to training and test sets as for baseline tagging. This corpus was replicated 9 times, each with an appropriate tagset, as there are 9 morphological categories. The setup for voting methods was slightly more complex. We avoided partition of the corpus to training, tuning and test sets in 80%/10%/10% ratio [12] but instead the tuning set was created on the base of the 90% training set, used already within baseline tagging experiments. The 90% training set was divided into 9 equal parts among which 8 parts served for training and one part for testing. The procedure was repeated 9 times, with a different part serving for testing each time. The 9 output files were merged into one file, standing for the tuning set; cf. [11]. In this way two important aims have been achieved. We got bigger training set (90% instead of 80% of the corpus) for training the baseline taggers and at the same time 9 times bigger tuning set (90% instead of 10% of the corpus). As baseline components we used the taggers from Table 1, prepared already for the baseline experiments. But when applied to the complex tagset, voting methods omitted the SVM tagger, as achieving much lower accuracy than the rest of components. All experiments were performed at the ACC Cyfronet AGH-UST site on the SMP supercomputer, SGI Altix 3700, equipped with 128 1.5 GHz Intel Itanium 2 processors, 256 GB RAM and 4.75 TB disk storage. The taggers have been compiled with standard optimization level. Depending on baseline tagger chosen training time varies considerably from 3 to 980 · 102 seconds and tagging speed from 5 to 220 · 102 tokens/second [4].
Accuracy of Baseline and Complex Methods
6 6.1
909
Results Results for Baseline Methods
The accuracy of 5 baseline tagging algorithms is presented in Table 4. Table 4. Accuracy of baseline taggers trained on 90% of the m-FDCP corpus, [%] TnT MXP fnTBL MBT SVM
6.2
All tokens Known tokens Unknown tokens Ambiguous tokens Word segments Word segments with known tags Word segments with unknown tags Unknown word segments Sentences
96.20 96.98 88.65 89.50 95.46 96.94 0.00 88.65 61.48
Simple tagset 96.30 96.51 95.74 97.01 97.51 97.10 89.43 86.89 82.60 91.09 91.36 89.94 95.57 95.83 94.91 96.79 97.55 97.08 28.21 3.85 0.00 89.43 86.90 82.60 62.15 63.71 58.54
96.74 97.52 89.14 91.36 96.10 97.60 0.00 89.15 65.16
All tokens Known tokens Unknown tokens Ambiguous tokens Word segments Word segments with known tags Word segments with unknown tags Unknown word segments Sentences
86.33 88.97 60.86 78.66 83.66 90.73 0.00 60.86 28.95
Complex tagset 85.00 86.79 82.31 87.53 89.76 85.75 60.55 58.09 49.06 78.71 80.34 72.48 82.07 84.21 78.85 87.44 90.27 86.58 29.84 30.50 0.66 60.55 58.09 49.05 26.88 29.87 22.51
73.54 81.12 0.23 63.44 68.36 80.69 0.00 0.23 12.04
Results for Split Models
The observed decline of accuracy with the thematic split model (Table 5, columns 2-7) is due to considerable reduction of training sets sizes, comparing to setup for the baseline algorithms, what might mask the profits of thematic split. To prevent the masking effect we made another experiment. We examined the baseline algorithms, with uniform (balanced) training and test set of identical sizes as for the thematic split, i.e., standing for 18% and 2% of the entire corpus (Table 5, column 8). The higher accuracy of the thematic split model over this additional experiment is apparent (column 8 vs. column 7 of Table 5). We remark also accuracy degradation of the attribute split model (Table 6) comparing to the baseline taggers. Only the attribute split model for SVM experiences improvement, in comparison to the baseline tagger. This can be explained by considerably low accuracy of the SVM tagger.
910
M. Kuta et al.
Table 5. Accuracy of taggers trained on thematic parts A–E, average accuracy (arithmetic mean) of all thematic parts, accuracy of taggers trained on the uniform set, [%] Thematic parts B C D
A
Simple 95.07 94.83 94.58 94.00 95.39
E
18%/2% average uniform
tagset 95.24 94.85 95.40 94.69 95.83
TnT MXP fnTBL MBT SVM
95.24 94.63 94.98 94.14 95.44
96.49 95.98 95.63 95.61 96.62
95.39 94.66 95.20 94.40 95.49
TnT MXP fnTBL MBT SVM
82.61 79.05 80.74 77.91 66.69
83.35 80.34 80.80 77.80 66.72
Complex tagset 82.24 82.39 85.79 79.47 79.83 84.02 80.43 80.99 84.00 77.13 78.42 82.92 65.42 67.94 74.53
95.49 94.99 95.16 94.57 95.75
94.82 94.46 94.64 93.80 94.93
83.28 80.54 81.39 78.84 68.26
83.26 79.22 80.92 78.03 67.69
Table 6. Tagging accuracy of the attribute split model applied to the complex tagset, in next rows accuracy of its individual components given, [%]
attribute split POS number case gender person degree aspect negation vocalicity
6.3
TnT
MXP fnTBL MBT
SVM
81.73 96.20 97.08 91.00 92.46 99.31 98.13 97.73 98.82 100.00
82.55 81.73 81.00 96.30 96.44 95.74 97.10 96.74 96.77 92.81 92.08 91.03 91.95 91.13 91.86 99.45 99.54 99.38 98.44 98.59 98.11 97.91 98.03 97.54 98.91 98.91 98.70 100.00 100.00 100.00
84.06 96.74 97.42 92.81 93.01 99.53 98.58 98.19 98.92 100.00
Results for Voting Methods
The possibility of improving the baseline algorithms by voting methods is expressed by complementarity (Table 7). Complementarity value near 0% indicates that a pair of taggers gives similar effects and no accuracy increase with voting methods. The closer the complementarity values to 100%, the higher the margin for accuracy increase of voting methods. The accuracy of the voting methods themselves is given in Table 8. The methods are ordered according to their growing complexity. The effort connected with gathering weights required by voting methods and creation of the tuning set paid off. Voting methods achieve the highest accuracy among all presented algorithms. The results of Total Precision methods are especially encouraging.
Accuracy of Baseline and Complex Methods
911
Table 7. Complementarity, comp(B|A), of baseline taggers trained on 90% of the m-FDCP corpus, [%] (a) simple tagset
@A B@ @
TnT MXP fnTBL MBT SVM
(b) complex tagset
TnT MXP fnTBL MBT SVM – 37.32 36.84 31.80 36.16
35.72 – 40.35 38.55 38.55
31.29 36.72 – 28.03 30.47
39.21 46.58 41.03 – 41.60
25.74 30.30 25.65 23.79 –
@A B@ @
TnT MXP fnTBL MBT SVM
TnT MXP fnTBL MBT SVM – 33.20 34.56 21.39 13.62
39.13 32.28 39.25 55.39 – 33.25 43.19 56.42 41.22 – 41.07 54.05 33.01 21.09 – 39.96 23.12 7.94 10.17 –
Table 8. Accuracy of voting methods, [%] Voting method Simple tagset Complex tagset Majority 96.93 87.14 Total Precision 96.95 88.03 Tag Precision 96.93 87.41 Precision Recall 96.93 87.21
7
Conclusions
According to our experiments, the following conclusions can be drawn. The SVM tagger achieves the highest, state-of-the-art accuracy among the baseline methods for the simple tagset, although is useless when applied to the complex tagset. For the complex tagset fnTBL yields both the highest overall accuracy and sentence accuracy among the baseline methods. If the amount of unknown tokens is prevailing, the MXPost should be considered. Attribute split models give lower results than the baseline taggers except the SVM case, whose baseline accuracy is however significantly low. The thematic split model has not outperformed any baseline tagger. The additional experiment with uniform 20% set of the m-FDCP corpus proved however potential usefulness of the model. The condition of its applicability is that a corpus consists of several thematic segments. The segments should be large enough, so that the benefit of separate thematic models dominates accuracy degradation tied to reduction of a training set size. Voting methods are the most complex methods, requiring preparation of several baseline taggers. Elaboration of a reliable tuning set is another computationally demanding issue. Each of these methods outperforms the most accurate baseline method. The Total Precision method achieves the highest accuracy and at the same time is simpler than the Tag Precision and Precision Recall methods yielding the simplicity only to the Majority method. Acknowledgments. This research is partially supported by the Polish Ministry of Science and Higher Education, grant no. 11.11.120.777. ACC CYFRONET AGH is acknowledged for the computing time.
912
M. Kuta et al.
References 1. Mauco, M., Leonardi, M.: A derivation strategy for formal specifications from natural language requirements models. Computing and Informatics 26(4), 421–445 (2007) P., Kitowski, J.: Increasing quality of the Corpus of Frequency 2. Kuta, M., Chrzaszcz, Dictionary of Contemporary Polish for morphosyntactic tagging of the Polish language. Computing and Informatics (to appear) 3. Corpus of Frequency Dictionary of Contemporary Polish, http://www.mimuw.edu.pl/polszczyzna P., Kitowski, J.: A case study of algorithms for morphosyntac4. Kuta, M., Chrzaszcz, tic tagging of Polish language. Computing and Informatics 26(6), 627–647 (2007) 5. Ratnaparkhi, A.: A maximum entropy model for part-of-speech tagging. In: Proc. of the 1st Conf. on Empirical Methods in Natural Language Processing, Univ. of Pennsylvania, USA, pp. 133–142 (1996) 6. Daelemans, W., Zavrel, J., Berck, P., Gillis, S.: MBT: A memory-based part of speech tagger-generator. In: Proc. of the 4th Workshop on Very Large Corpora, Copenhagen, Denmark, pp. 14–27 (1996) 7. Brill, E.: Transformation-based error-driven learning and natural language processing: A case study in part of speech tagging. Computational Linguistics 21(4), 543–565 (1995) 8. Gim´enez, J., M` arquez, L.: SVMTool: A general POS tagger generator based on Support Vector Machines. In: Proc. of the 4th Int. Conf. on Language Resources and Evaluation, Lisbon, Portugal, pp. 43–46 (2004) 9. Brants, T.: TnT - a statistical part-of-speech tagger. In: Proc. of the 6th Applied Natural Language Processing Conf., Seattle, USA, pp. 224–231 (2000) 10. Florian, R., Ngai, G.: Fast Transformation-Based Learning Toolkit manual. John Hopkins Univ., USA (2001), http://nlp.cs.jhu.edu/~rflorian/fntbl 11. van Halteren, H., Zavrel, J., Daelemans, W.: Improving accuracy in word class tagging through the combination of machine learning systems. Computational Linguistics 27(2), 199–229 (2001) 12. van Halteren, H., Zavrel, J., Daelemans, W.: Improving data driven wordclass tagging by system combination. In: Proc. of the 36th Annual Meeting on Association for Computational Linguistics, Montr´eal, Canada, vol. 1, pp. 491–497 (1998) 13. Schr¨ oder, I.: A Case Study in Part-of-Speech Tagging Using the ICOPOST Toolkit, Technical report FBI-HH-M-314/02, Univ. of Hamburg, Germany (2002) 14. Brill, E., Wu, J.: Classifier combination for improved lexical disambiguation. In: Proc. of the 7th Int. Conf. on Computational Linguistics, San Francisco, USA, pp. 191–195 (1998) 15. Modified corpus of Frequency Dictionary of Contemporary Polish, http://nlp.icsr.agh.edu.pl 16. IPI PAN Corpus resources, http://korpus.pl
Synonymous Chinese Transliterations Retrieval from World Wide Web by Using Association Words Chung-Chian Hsu and Chien-Hsing Chen National Yunlin University of Science and Technology, Taiwan {hsucc,g9423809}@yuntech.edu.tw
Abstract. We present a framework for mining synonymous transliterations from a set of Web pages collected via a search engine. An integrated statistical measure is proposed to form search keywords for a search engine in order to retrieve relevant Web snippets. We employ a scheme of comparing the similarity between two transliterations to aid in identifying synonymous transliterations. Experimental results show that the average number of harvesting synonymous transliterations is about 5.04 for an input transliteration. The retrieval results could be beneficial for constructing ontology, especially, in the domain of foreign person names. Keywords: synonymous transliteration, cross lingual information retrieval, Chinese transliteration, person names, ontology.
1 Introduction A transliteration is a local representation of a foreign word by rendering the pronunciation in the alphabet to the local language. With many different translators working without a common standard, there may be many different transliterations for the same proper noun. For example, the inconsistent Chinese transliterations
本拉登
本拉丹
賓拉登
(bin la deng), (ben la deng) and (ben la dan) are all translated from a foreign name “Bin Laden”. Unfortunately, a person may know only one of those transliterations. As a result, the synonymous transliterations problem may engender comprehensive obstacle while one is reading. More importantly, it also results in incomplete search results when a user inputs only one of the transliterations to a
賓拉登 (bin la deng) as a search keyword cannot retrieve the Web pages which use 本拉登 (ben la deng) as the transliteration for Bin
search engine. For instance, using
Laden. In this paper, we attempt to propose a framework for automatically extracting as many synonymous transliterations as possible from the Web with respect to a given input transliteration as a first step to the problem. The research result is beneficial to constructing ontology, especially, in the domain of famous person names. Some major tasks in natural language processing such as machine translation, named entity recognition, automatic summarization, information extraction and crosslanguage information retrieval (CLIR) have treated Web corpora as a good knowledge M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 913–922, 2008. © Springer-Verlag Berlin Heidelberg 2008
914
C.-C. Hsu and C.-H. Chen
source for extracting useful information. Search engines have been considered an important tool to retrieve relevant documents. However, a simple, short query usually fail in returning only highly relevant documents and instead a huge amount of Web pages in diversified topics are usually returned. A short query expanded by additional relevant search keywords could help to limit the retrieved pages to what the user is intended. Work in the literature such as query extension [1] proposed some techniques for identifying proper keywords for extension. We follow this idea for collecting high quality candidate snippets which might contain synonymous transliterations. The traditional approaches in CLIR usually require a parallel corpus which suffers from bias and time-consuming due to manually collecting. Instead, we propose an effective framework to mining synonymous transliterations from Web snippets returned by a search engine. A critical step is to use proper keywords for collecting a limited amount of snippets which could include as many synonymous transliterations as possible. To achieve this goal, we use a measure which integrates several statistic approaches of keyword determination so as to raise the keyword quality. After retrieving relevant documents via a search engine, we apply a comparison scheme to determine whether an unknown word segmented from the retrieved snippets is indeed a synonymous transliteration. Our scheme is based on comparing digitalized physical sounds of Chinese characters. The traditional approaches in CLIR are usually grapheme-based or phonetic-based. Compared to those approaches, our approach possesses more powerful discrimination capability.
2 Candidate Snippets Collection We propose a procedure as presented in Fig. 1 for collecting candidate Web snippets in which synonymous transliterations may appear. First, the transliteration (TL) is inputted for collecting a set of n snippets, called core snippets. After text preprocess, a set of m keywords, called association words, which are highly associated with the TL are extracted from the core snippets. The associated words are to form searchkeywords to retrieve a set of k snippets from the Web, called candidate snippets, which are considered likely containing synonymous transliterations.
Core Snippets
TL Downloading
Association Words
Feature Selection
Candidate Snippets
Search Keywords
Keywords Formation
Retrieving
Fig. 1. A procedure of collecting candidate Web snippets
2.1 Association Words Selection Several statistical methods [2] can be used to select feature terms with respect to a document category by measuring association strength between a term and the category, including Information gain (IG), Mutual information (MI), Chi-square (CHI), Correlation coefficient (CC), Relevance score (RS), Odds ratio (OR) and GSS Coefficient (GSS). A fusion approach which integrates features selected by different methods may improve the quality of features, reduce noisy, and avoid overfitting [2].
Synonymous Chinese Transliterations Retrieval from World Wide Web
915
Therefore, to estimate the strength of association between a term tk and an input transliterations ci, we employ a fusion model integrating six popular feature selection functions. To calculate the strength, we need to compute various joint and conditional probabilities. Recently, several researchers proposed to use the returned count of a query to a search engine for estimating term relationship. Cheng et. al. [3] used the returned page counts from the search engine to estimate association strength between two terms. Cilibrasi and Vitanyi [4] used the returned page counts to measure the information distance so as to estimate the similarity among the names of objects. We follow their idea for our needs. To take GSS as an example, GSS(tk,ci) = p(tk,ci)p( , ) − p( tk, )p( ,ci ) where p(tk,ci) represents the probability of cooccurrence of tk and ci which can be estimated via the returned page counts of a query “tk” + “ci” to a search engine, in which ci is a transliteration and tk is a term. and represent the positive existence of in a Web page containing while and indicate the opposite, non-existence. In practice, we first download a fixed number of Web snippets D for a transliteration ci via a search engine. Denote T = {t1, …, tk, …, tK} be a set of terms in D are extracted and the scores on obtained from the core snippets, all terms the six functions for association strength between tk and ci are measured. Six ranking for each tk with respect to the six functions are obtained, where values represents the rank of tk under the mth evaluation function. The average rank fk is ∑ /M. A lower rank represents more important of the term. defined as , 2.2 Search-Keywords Formation Based on the ranked association words selected in the previous step, there are several alternatives to form a query for further collecting candidate snippets which may contain synonymous transliterations. We consider several strategies and empirically compare their performance. Three entities are used to form different strategies for synonymous transliterations (ST), which are the transliteration TL, the association term (AS), and TL’s original word (ORI). Three strategies were made as follows. Strategy 1 (Direct strategy). An ST may appear in the same snippet with a TL or ORI. Therefore, the TL or ORI can be used as the query term. Given a transliteration, its foreign origin can be determined automatically by several techniques found in CLIR [5-12]. Strategy 2 (Indirect strategy). Association words highly related to the TL are possible to retrieve snippets containing an ST. Therefore, in the indirect strategy, we make a query Q out of association words; specifically, a query Qm-As is an m-term query which is formed by m association words. We select significant association words and use a combination to generate a set of queries Q. Then, each of the queries is used to collect several hundred snippets which collectively form the set of candidate Web
賓拉登 (Bin Laden), say { 恐佈分子 (terrorist), 阿富汗 (Afghanistan), 攻擊 (attack), 恐怖主義 (terrorism) }, and m = 2, a query Q is a 2-term query such as (恐怖份子,阿富汗). snippets. For instance, given the top four association words of
2-As
916
C.-C. Hsu and C.-H. Chen
The query set Q consists of all two-term combinations of the four ASs. The size of Q 汗) is C(4,2) = 6. The set of search-keywords in query Q2-As is {q1 ( , q2
= 恐怖份子 阿富 ;
=(阿富 汗 攻擊); q =(恐怖主義,攻擊); …; q }. 3
6
Strategy 3 (Integrated strategy). A combination of the direct and the indirect strategy may improve retrieval effectiveness. Therefore, an integrated strategy containing the Qm-As and the QORI or QTL is considered. Empirically, the integration with ORI is much better than TL. Thus, we integrate association words with the ORI to produce a query Qm-AsOri. For example, Q1-AsOri =(
、
恐怖份子, Bin Laden) or Q
2-AsOri
=(
恐怖主義、阿富
汗 Bin Laden).
3 Synonymous Transliterations Extraction from Candidate Snippets After collecting candidate snippets from the Web, we apply several processes to extract synonymous transliterations. Transliterations are unknown to an ordinary dictionary, so we first discard known words in the snippets with the help of a dictionary and then extract n-gram terms from the remaining text. The length parameter n is set to the range from |TL| - 1 to |TL| + 1 since the length of an ST with respect to an input TL is most likely in that range. A process of dynamic alignment is employed to select candidate synonymous transliterations (CSTs) from the n-gram terms. Then, we compare the similarity between CSTs and the TL. A highly similar CST to the TL is considered a synonymous transliteration. The extraction procedure is presented in Fig. 2. Candidate Snippets
Term Segmenting
n-gram terms
CSTs
Dynamic Aligning
STs
Comparing
Fig. 2. The procedure of extracting synonymous transliterations
3.1 N-Gram Terms and Candidate Synonymous Transliterations Generations The size of n-gram terms segmented from the remaining text after discarding known words is still huge. Most of them are obviously not an ST. We apply a heuristic to discard those n-gram terms and the remaining terms are referred to as candidate synonymous transliterations (CSTs). In particular, we observed that most of synonymous transliterations are highly matching in the first and the last Character, for instance,
戈巴契夫 (ge ba qi fu), 戈爾
巴喬夫 (ge er ba qiao fu) and 戈巴卓夫 (ge ba zhuo fu). That is to say, two terms which does not match well in the first and the last character are very likely not synonymous, for instance, 本拉丹 (ben la dan) and 拉丹襲 (la dan xi) in which the
Synonymous Chinese Transliterations Retrieval from World Wide Web
917
first character of the latter matches with the second character of the former while the
拉丹襲 which has the first two characters from a true synonymous transliteration 本拉丹 is last character of the former matches with the second of the latter. In fact,
generated due to the use of 3-gram segmentation. An extra-last-character exception has to be taken care. Several final foreign phonemes might be ignored in the transliterations by some translators but not be ignored by some other translators. Those phonemes include “m”, “er”, “d” and “s”
姆 (mu), 兒 (er), 爾 (er), 德 (de), and 斯 (si) when they are not ignored. For example, 貝克漢 (bei ke han) and 貝克漢姆 (bei ke han mu).
and usually be transliterated as
Therefore, when a mismatched last character pair is attributed to this exception, we need to further explore the matching between the last second character of the longer word and the last character of the shorter word. According to the above observations, we resort to a dynamic programming technique [12, 13] to determine the optimal alignment between an n-gram term and the TL so as to eliminate the n-gram terms which do not match well with the TL in the first and the last character and neither fall in the extra-last-character exception. Those n-gram terms which match well with the TL in the first and the last character or are well handled by the extra-last-character exception are considered CSTs. Note that the alignment is based on pronunciation similarity of Chinese characters. 3.2 Candidate Synonymous Transliterations Comparison
A transliteration usually has pronunciation close to their original foreign words. Therefore, synonymous transliterations usually have similar pronunciations. We use the Chinese Sound Comparison method (CSC) [12] to compare the pronunciation of two Chinese words, which has advantages over grapheme-based and conventional phoneme-based approaches. Grapheme-based approaches are mainly based on the number of identical alphabets in the two words. Phoneme-based approaches are mainly based on the pronunciation similarity between phones. In the conventional phoneme-based approaches [14, 15], the similarity scores between phones are assigned by some predefined rules which take articulatory features of phones into consideration. Instead, CSC compares two words by their digitalized physical sounds, which raise the effectiveness by embedding more discriminative information in the digitalized sound signals Given two Chinese words A ={a1a2…aN} and B ={b1b2…bM} where an is the nth character in Chinese word A and bm is the mth character in Chinese word B. N is not necessarily equal to M. A dynamic programming based approach to comparing the similarity of smallest distortion for A and B by adjusting the warp on the time axis is employed. The recurrence formula is defined as follows in which T(N, M) is the similarity of {a1a2…aN} and {b1b2…bM}, and sim(an, bm) is the similarity for two Chinese characters. ,
max
N 1, M N 1, M N, M 1
1
N, M
918
C.-C. Hsu and C.-H. Chen
We constructed two similarity matrices for comparing the similarity between Chinese characters [12]. One is for the 37 phonetic symbols which are used to make of the pronunciation of a Chinese character. The other is for the 412 basic sounds which include all pronunciations of Chinese characters without considering tones. The similarity between two Chinese characters is measured by where an.IC . , . 1 , , and bm.IC represent their initial consonant (IC). According to our experience, final sound heavily influences speech sound comparison. Therefore, we adopt an initialweighted comparison approach, which involved a balancing adjustment: weighting the initial consonants of the characters to balance the bias caused by the final sounds. The 37 phonetic symbol similarity matrix is used to provide the similarity data between the initials of the characters. sim(an, bm) is the weighted similarity between character an and bm obtained from the similarity matrices of the 37 phonetic symbols and the 412 character pronunciations. w represents a trade-off between weighting the initial consonant and the whole character and is set to 0.4 empirically. For example,
森
生
the similarity between two Chinese characters (sen) and (sheng) is measured by first converting them to the representation of their corresponding phonetic symbols,
ㄙㄣ(sen) and ㄕㄥ(sheng), respectively. They have initial consonants ㄙ(si) and ㄕ(shi), respectively. Then, the score is calculated by the formula, sim(森,生) = sim( ㄙㄣ,ㄕㄥ) = 0.4 sim (ㄙ, ㄕ) + 0.6 sim (ㄙㄣ, ㄕㄥ). According the two similarity matrices s37 and s412, sim (ㄙ,ㄕ) =0.66 and sim (ㄙㄣ,ㄕㄥ) = 0.69. The result is 0.68, the measured similarity between two Chinese characters 森 and 生.
namely,
s37
s412
s37
s412
The normalized similarity between two words A and B which takes into account the length of the words is defined as scoreCSC(A,B) = T(N,M)/(0.5(N+M)) where N and M are the lengths of A and B, respectively. The choice of normalization operation significantly influences the similarity comparison. We set it to the average length of N and M according empirical results indicated in [12]. A high score between an CST and the TL implies the CST is very likely an ST of the TL.
4 Experiments We collected a total of 50 Chinese transliterations (TLs) from the Web. The data were drawn from two major types of proper nouns, i.e., locations and personal names. Their length is 2, 3 or 4, which are most commonly seen in Chinese transliterations. The number of transliterations in each group is 10, 30 and 10, respectively. 4.1 Quality of Query Strategies Each of the 50 TLs was submitted to Google search engine and the first 20 snippets were collected as the core snippets of the TL. For each TL, the top five association words were used to collect various sets of the candidate snippets according to different strategies mentioned in section 2.2. Google also suggests synonymous
Synonymous Chinese Transliterations Retrieval from World Wide Web
919
transliterations with respect to some user queries. We therefore consider their recommendation as well in the experiment. QTL: collecting snippets by using the TL; QOri: collecting snippets by using the original foreign word; Qm-As: collecting snippets by using the query consisting of m associated words; Qm-AsOri: collecting snippets by using a query consisting of m associated word plus the foreign word; QGR: Google recommendation. The following discusses how effective each strategy is able to collect a better set of candidate snippets, which shall contain as many synonymous transliterations as possible. The second row in Table 1 shows the ratio of TL having at least one synonym in the collected snippets under a certain query strategy. Under the strategy QOri, the ratio is 74%; in other words, 37 out of 50 TLs have at least one synonym in the retrieved snippets. Q2-AsOri brings the best performance, which is 92%. The result also shows that only 4% has recommendation from the search engine. Among the inputs, three of the 50 TLs do not has any synonym in the collected snippets, including
托拉斯
赫爾利
雅典娜
(Trust) and (Hurley). (Athena), Surprisingly, the combination of the original word along with association words performs better than using the original word alone. For instance, the transliterations
馬斯哈托夫 (Maskhadov), 巴薩拉 (Basra), 賽普拉斯 (Cypress), 費雪 (Fisher), 蓋亞 (Gaea), and 鮑爾 (Powell) have no STs in the collected snippets by Q , but they do Ori
have by Q2-AsOri or Q1-AsOri. The reason is that these transliterations are more popular than their synonymous transliterations. As a result, all the returned snippets, of which the number is limited to about 1000 by the search engine, by QOri contain only the most commonly seen transliterations, no other synonymous transliterations. A stricter query strategy which additionally include association words along with the original foreign word help to bring the Web pages containing synonymous transliterations to the set of the returned first 1000 pages. Second, we test how many synonymous transliterations could be retrieved in average under different methods with respect to a given TL. Experimental results in the third row of Table 1 show that including Ori along with their association words in the query outperforms their counterpart, which does not include Ori. Furthermore, the parameter m (the number of association words in a query) is better not to be greater than 3. Requesting too many association words in a snippet would limit the number of snippets that we can retrieve. Given the 50 TLs, we retrieved in total 366 STs, of which 252(69%), 246(67%), 136(37%), 145(40%), 86(23%), 54(15%), 40(11%), 22(6%), 2(0.5%) by 2-AsOri, 1AsOri, 3-AsOri, Ori, TL, 2-As, 3-As, 1-As, and GR, respectively. 322 (88%) out of 366 can be retrieved by 2AsOri and 1AsOri together. Finally, we inspect how uniqueness the retrieved result of a method is, i.e., how many words which are retrieved uniquely by the method but not by the other methods.
920
C.-C. Hsu and C.-H. Chen
Table 1. Probability of a TL having at least one synonym in the collected snippets and the average number of retrieved synonymous transliterations Method 2-AsOri 1-AsOri 3-AsOri Ori TL 2-As 3-As 1-As GR ST. Occurrence Probability 0.92 0.9 0.82 0.74 0.50 0.50 0.38 0.36 0.04 Average number of STs 5.04 4.92 2.72 2.90 1.72 1.08 0.80 0.44 0.10 Uniqueness 0.318 0.419 0.062 0.093 0.023 0.054 0.016 0.016 0.000
Table 1 shows among those words retrieved by only one method, about 40% and 30% are by Q1-AsOri and Q2-AsOri, respectively. Except for GR, other methods can retrieve more or less some unique STs. 4.2 Performance of Synonymous Transliterations Extraction This section presents how well the confirmation model can recognize those identified candidate synonymous transliterations (CSTs) as true synonymous transliterations (STs). The evaluation measures include precision, the average number of retrieved STs and the inclusion rate. Because Q2-AsOri was more effective in retrieving candidate snippets in which STs appear, we use the set of CSTs extracted from the candidate snippets by Q2-AsOri via the dynamic alignment process. The dynamic alignment approach reduced the size of the n-gram terms 355,943 to the size of CST terms 56,408. We further utilize the CSC approach [12] to measure the similarity between a CST term and the TL. The initial consonant weight is set to w = 0.4 which is suggested in [12]. A high score indicates high pronunciation similarity between the CST and the TL and implies that they are likely synonymous. Fig. 3 shows retrieval precision and the average number of retrieved STs with respect to various similarity thresholds by the CSC. The result shows that all extracted STs acquire at least a 0.5 CSC similarity score. It also shows that the precision is high (over 0.89) when the score is greater than 0.9.
Fig. 3. Precision and average number of collected synonyms under various similarity scores by CSC
AR (average ranking), ARR (average reciprocal rank) and the inclusion rate, which are commonly used for the evaluation in information retrieval, are calculated for the data set according to the rank of the similarity score of a true ST to the TL. AR and ARR are 7.22 and 0.74, respectively. For the inclusion rate, 67% of STs are included
Synonymous Chinese Transliterations Retrieval from World Wide Web
921
in top-1, 81% are included in top-5, 88% are included in top-10 and 99% are included in top-100. The lowest rank of a true ST is 324.
5 Conclusions and Future Directions In this paper we present a framework for collecting synonymous transliterations from the Web with respect to a given input transliteration. The research result can be applied to construct ontology of famous person names. Our method uses the online retrieved Web pages collection as the corpus. Unlike the conventional approaches in information retrieval, a manually pre-collected set of documents is used as the corpus which may engender bias. Moreover, to extract synonymous transliterations from the retrieved Web snippets, we compare the similarity between unknown words and the input transliteration by an approach based on comparing digitalized physical sounds. We will continue to improve the precision of identified synonymous transliterations in our future work. Acknowledgement. This work is supported by National Science Council, Taiwan under grant NSC 96-2416-H-224-004-MY2.
Reference 1. Carpineto, C., Bordoni, F.U., Mori, R.D., Avignon, U.O., Romano, G., Bordoni, F.U., Bigi, B.: An information-theoretic approach to automatic query expansion. ACM Transactions on Information Systems 19(1), 1–27 (2001) 2. Huang, S., Chen, Z., Yu, Y., Ma, W.-Y.: Multitype features coselection for Web document clustering. IEEE Transactions on Knowledge and Data Engineering 18(4), 448–459 (2006) 3. Cheng, P.-J., Teng, J.-W., Chen, R.-C., Wang, J.-H., Lu, W.-H., Chien, L.-F.: Translating unknown queries with Web corpora for cross-language information retrieval. In: Proceedings of ACM SIGIR, Sheffield, South Yorkshire, UK (2004) 4. Cilibrasi, R.L., Vitanyi, P.M.B.: The Google similarity distance. IEEE Transactions on Knowledge and Data Enginerring 19(3), 370–383 (2007) 5. Tsuji, K.: Automatic extraction of translational Japanese-Katakana and English word pairs from bilingual corpora. International Journal of Computer Processing of Oriental Language, 261–280 (2002) 6. Stalls, B.G., Kevin, K.: Translating names and technical terms in arabic text. In: Proceedings of the COLING/ACL Workshop on Computational Approaches to Semitic Languages (1998) 7. Somers, H.L.: Similarity metrics for aligning children’ s articulation data. In: Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics and 17th International Conference on Computational Linguistics, pp. 1227–1231 (1998) 8. Sakoe, H., Chiba, S.: Dynamic programming algorithm optimization for spoken word recognition. In: IEEE Trans. Acoustics, Speech, and Signal Proc. ASSP, pp. 43–49 (1978) 9. Lin, W.H., Chen, H.H.: Backward machine transliteration by learning phonetic similarity. In: Proceedings of the Sixth Conference on Natural Language Learning, Taipei, Taiwan, pp. 139–145 (2002)
922
C.-C. Hsu and C.-H. Chen
10. Lin, W.H., Chen, H.H.: Similarity measure in backward transliteration between different character sets and its applications to CLIR. In: Proceedings of Research on Computational Linguistics Conference XIII, Taipei, Taiwan, pp. 97–113 (2000) 11. Lee, C.J., Chang, J.S., Jang, J.-S.R.: Alignment of bilingual named entities in parallel corpora using statistical models and multiple knowledge sources. ACM Transactions on Asian Language Information Processing 5(2), 121–145 (2006) 12. Hsu, C.-C., Chen, C.-H., Shih, T.-T., Chen, C.-K.: Measuring similarity between transliterations against noise data. ACM Transactions on Asian Language Information Processing (2007) 13. Kuo, J.-S., Li, H., Yang, Y.-K.: A phonetic similarity model for automatic extraction of transliteration pairs. ACM Trans. Asian Language Information Processing (2007) 14. Kondrak, G.: Phonetic alignment and similarity. Computers and the Humanities 37(3), 273–291 (2003) 15. Chen, H.H., Lin, W., Yang, C.C., Lin, W.H.: Translating/transliterating named entities for multilingual information access. Journal of the American Society for Information Science and Technology, 645–659 (2006)
Parallel Approximate Finite Element Inverses on Symmetric Multiprocessor Systems Konstantinos M. Giannoutakis and George A. Gravvanis Department of Electrical and Computer Engineering, School of Engineering, Democritus University of Thrace, 12, Vas. Sofias street, GR 671 00 Xanthi, Greece {kgiannou, ggravvan}@ee.duth.gr
Abstract. A new parallel normalized optimized approximate inverse algorithm for computing explicitly approximate inverses, is introduced for symmetric multiprocessor (SMP) systems. The parallelization of the approximate inverse has been implemented by an antidiagonal motion, in order to overcome the data dependencies. The parallel normalized explicit approximate inverses are used in conjuction with parallel normalized explicit preconditioned conjugate gradient schemes, for the efficient solution of finite element sparse linear systems. The parallel design and implementation issues of the new algorithms are discussed and the parallel performance is presented, using OpenMP. The speedups tend to the upper theoretical bounds for all cases making approximate inverse preconditioning suitable for SMP systems.
1
Introduction
Sparse matrix computations, which have inherent parallelism, are of central importance in computational science and engineering computations and are the most time-consuming part. Hence research efforts were focused on the production of efficient parallel computational methods and related software suitable for multiprocessor systems, [1,2,3,10]. Until recently, direct methods have been effectively used, but the increase of size, even with the use of modern computer systems, has become a barrier to such methods, [2,3,10]. Additionally, the solution of sparse linear systems, because of its applicability to real-life problems, is obtained by iterative methods, which are in competitive demand after the emergence of Krylov subspace methods, [4,8]. An important achievement over the last decades is the appearance and use of Explicit Preconditioned Methods, [4], for solving sparse linear systems, and the preconditioned form of a linear system Au = s is M Au = M s, where M is preconditioner, [2,4,8,10]. The preconditioner M has therefore to satisfy the following conditions: (i) M A should have a “clustered”spectrum, (ii) M can be efficiently computed in parallel and (iii) finally “M × vector”should be fast to compute in parallel, [2,4,8,9,10]. Hence, the derivation of parallel methods was the main objective for which several families of parallel inverses are proposed. The main motive for the derivation of the parallel explicit approximate inverse matrix algorithms is that they M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 925–934, 2008. c Springer-Verlag Berlin Heidelberg 2008
926
K.M. Giannoutakis and G.A. Gravvanis
result in parallel iterative methods in conjunction with parallel preconditioned conjugate gradient - type schemes respectively, for solving finite element linear systems on SMP systems. The important feature of the proposed parallel approximate inverse preconditioning is that the approximate inverse is computed explicitly and in parallel, eliminating the forward-backward substitution, which does not parallelize easily, [10]. For the implementation of the parallel programs, the OpenMP application programming interface has been used. OpenMP has emerged as a shared-memory programming standard and it consists of compiler directives and functions for supporting both data and functional parallelism. The parallel for pragma with static scheduling has been used for the parallelization of loops on both the construction of the approximate inverse and the preconditioned conjugate gradient scheme.
2
Parallel Explicit Approximate Inverses
Let us consider the arrow - type linear system, i.e., Au = s, where A is a sparse arrow - type (n × n) matrix of the following form: - m ⎡ ⎤ a1 b 1 b b b ⎢ a2 b 2 ⎥ b b ⎢ ⎥ b0 ⎥ b ⎢ b b b ⎢ b b b⎥ b ⎢ ⎥ b b b ⎢ ⎥ b b b ⎢ ⎥ b b 0 b ⎢ ⎥) bb b b V ≡ (v b k,η ⎢ ⎥ am−1 bm−1 b ⎢ ⎥ b ⎢ ⎥ am b m A≡⎢ b ⎥ b⎥ ⎢ Z Z ⎢ ⎥ ⎢ ⎥ Z Z symmetric ⎢ ⎥ Z Z ⎢ ⎥ Z Z ⎢ ⎥ Z Z ⎢ ⎥ ⎢ ⎥ Z Z ⎣ Z bn−1 ⎦ Z an
(1)
(2)
Let us assume the normalized finite element approximate factorization of the coefficient matrix A, such that: A ≈ Dr Trt Tr Dr ,
r ∈ [1, . . . , m − 1),
(3)
where r is the “fill-in” parameter, i.e. the number of outermost off-diagonal entries retained in semi-bandwidth m, Dr is a diagonal matrix and Tr is a sparse upper (with unit diagonal elements) triangular matrix of the same profile as the coefficient matrix A, [7].
Parallel Approximate Finite Element Inverses on SMP Systems
927
The elements of the Dr , Tr decomposition factors were computed by the FEANOF algorithm, cf. [12]. The memory requirements of the FEANOF algorithm are ≈ (r + 2 + 2)n words, while the computational work for m << n is ≈ 1/2(r + )(r + + 3)n mults + n square roots, cf. [7]. Let Mrδl = (μi,j ), i ∈ [1, n] j ∈ [max(1, i − δl + 1), min(n, i + δl − 1)], be the normalized approximate inverse of the coefficient matrix A, i.e. −1 −1
rδl Dr−1 , with M
rδl = Trt Tr −1 . Mrδl = Dr−1 Trt Tr Dr = Dr−1 M
(4)
The elements of the approximate inverse were determined by retaining a certain number of elements of the inverse, i.e. only δl elements in the lower part and δl − 1 elements in the upper part of the inverse (by applying the so-called “position-principle”), next to the main diagonal, by the Normalized Optimized Banded Approximate Inverse Finite Element Matrix -2D algorithmic procedure (henceforth called the NOROBAIFEM-2D algorithm, without inverting the decomposition factor Tr , [7]. The challenge of computing parallel approximate inverses is to overcome its data dependencies, which create a critical path and an order of computations, hence any parallel approximate inverse matrix algorithm must abide by those dependencies in order to avoid any data loss. ⎤ ⎡ (15) δl (14) " (13) b " " μ 1,1 μ μ 1,3 " " b " 1,2 " " ⎥ ⎢"" " "(12) "" (14) ""(13) (11)b ⎥ ⎢ μ μ 2,2 " μ 2,3" μ 2,4 " b ⎥ ⎢ 2,1 " " " ⎥ ⎢ " " " " ⎥ ⎢" (13) "" (12) "(11) (10) "" (9) b ⎥ ⎢ μ 3,1 " μ 3,2 " μ 3,3 "μ 3,4" μ 3,5 " "b " " ⎥ ⎢b " " " b " ⎥ ⎢ b " " "(11) " (10) " (9) (8) (7) " ⎥ ⎢ " " b μ μ μ μ μ " " " 4,2 4,3 4,4 4,5 4,6 " " ⎥ ⎢ b "
δl = ⎢ " "" " b M " " ⎥ . (5) " r b ""(9) "" (8) " (7) (6) " (5) " ⎢ "b ⎥ μ 5,3 μ 5,4" μ 5,5 " μ 5,6" μ 5,7 " ⎢ b " b⎥ " " " ⎥ ⎢ " b" "" "(6) "" " ⎥ ⎢ (7) (5) (4) (3) " " " ⎢ μ μ μ μ μ b" 6,4 6,5" 6,6" 6,7 " 6,8 ⎥ " ⎥ ⎢ " " " " b" ⎢ "(5) "" " (4) (3) (2) ⎥ " ⎥ ⎢ " " b μ μ μ μ 7,5 7,6 " 7,7" 7,8 " ⎦ ⎣ " b "" " " " (3) (2) (1) " " b"μ μ μ b 8,6" 8,7 " 8,8 For the parallelization of the NOROBAIFEM-2D algorithm, an antidiagonal motion (wave-like pattern), starting from the element μ 8,8 down to μ 1,1 , has been used, because of the dependency of the elements of the inverse during its construction. More specifically, any element within the banded approximate inverse requires its corresponding right or lower element to be computed first. This sequence of computations, without any loss of generality and for simplicity reasons, is shown for the normalized banded approximate inverse in equation (5) (with n = 8 and δl = 3). The values of the parentheses at the superscript of each (k) i,j was computed at the (k)-th element (e.g. μ i,j ), indicate that the element μ sequential step of the algorithm (k-th antidiagonal), while the elements with the same superscript (i.e. (k)) were computed concurrently. It should be noted that
928
K.M. Giannoutakis and G.A. Gravvanis
due to the data dependencies, for δl = 1, 2 the parallel algorithm will execute sequentially. Let us consider that the command forall denotes the parallel for instruction (forks/joins threads), for executing parallel loops. Then, the algorithm for the implementation of the Parallel ANti Diagonal NOROBAIFEM-2D algorithm (henceforth called the PAND-NOROBAIFEM-2D algorithm), on symmetric multiprocessor systems, can be described as follows: for k = 1 to δl forall l = 1 to k call inverse(n − l + 1,n − k + l) m=2 for k = (δl + 1) to n forall l = m to (k − m + 1) call inverse(n − l + 1,n − k + l) if (k − δl) mod 2 = 0 then m=m+1 m=m−1 for k = (n − 1) downto (δl + 1) forall l = m to k − m + 1 call inverse(l,k − l + 1) if (k − δl) mod 2 = 1 then m=m−1 for k = δl downto 1 forall l = 1 to k call inverse(l,k − l + 1) where the function inverse(i,j) computes the element μi,j of the normalized optimized approximate inverse, and can be described as follows, [7]: function(inverse) Let r = r + , m = m + , mr = m − r, nmr = n − mr, nm = n − m. if i >= j then if j > nmr then if i = j then if i = n then (6) μ 1,1 = 1 else n−j,δl+1 (7) μ n−i+1,1 = 1 − gj · μ else n−i+1,i−j (8) μ n−i+1,i−j+1 = −gj · μ else if j ≥ r and j ≤ nmr then if i = j then nmr−j μ n−i+1,1 = 1 − gj · μ n−j,δl+1 − hr−1−k,j+1−r+k · μ x,y (9) k=0
call mw(n, δl, i, mr + j + k, x, y)
Parallel Approximate Finite Element Inverses on SMP Systems
929
else μ n−i+1,i−j+1 = −gj · μ n−i+1,i−j −
nmr−j
hr−1−k,j+1−r+k · μ x,y
(10)
k=0
call mw(n, δl, i, mr + j + k, x, y) else if j > nm + 1 and j ≤ r − 1 then if i = j then n−j,δl+1 − hj,k · μ x1 ,y1 μ n−i+1,1 = 1 − gj · μ − call mw(n, δl, i, m + k − 1, x1 , y1 ) else
k=j+1−r k>0
nm
hj−1−λ,+1+λ · μ x2 ,y2
call mw(n, δl, i, m + λ, x2 , y2 )
μ n−i+1,i−j+1 = −gj · μ n−i+1,i−j − − call mw(n, δl, i, m + k − 1, x1 , y1 ) else if j ≤ nm + 1 then if i = j then if i = 1 then
(11)
λ=0
nm
hj,k · μ x1 ,y1
k=j+1−r k>0
hj−1−λ,+1+λ · μ x2 ,y2
(12)
λ=0
call mw(n, δl, i, m + λ, x2 , y2 )
n−1,δl+1 − μ n,1 = 1 − g1 · μ
h1,k · μ x,y
(13)
k=1
call mw(n, δl, 1, m + k − 1, x, y) else μ n−i+1,1 = 1 − gj · μ n−j,δl+1 −
k=j+1−r k>0
hj,k · μ x1 ,y1 −
j−1
hj−λ,+λ · μ x2 ,y2
(14)
λ=1
call mw(n, δl, i, m + k − , x1 , y1 ) call mw(n, δl, i, m + λ − 1, x2 , y2 ) else j−1 n−j,δl+1 − hj,k · μ x1 ,y1 − hj−λ,+λ · μ x2 ,y2 (15) μ n−i+1,1 = −gj · μ k=j+1−r k>0
call mw(n, δi, i, m + k − , x1 , y1 ) if i <> j then n−i+1,i−j+1 μ n−i+1,δl+i−j = μ
λ=1
call mw(n, δl, i, m + λ − 1, x2 , y2 ) (16)
The procedure mw(n, δl, s, q, x, y), [5], reduces the memory requirements of the approximate inverse to only n × (2δl − 1)-vector spaces. The computational process is logically divided into 2n − 1 sequential steps representing the 2n − 1 antidiagonals, while synchronization between processes is needed after the
930
K.M. Giannoutakis and G.A. Gravvanis
computation of each antidiagonal, to ensure that the elements of the matrix are correctly computed.
3
Parallel Normalized Preconditioned Conjugate Gradient method
In this section we present a class of parallel Normalized Explicit Preconditioned Conjugate Gradient (NEPCG) method, based on the derived parallel optimized approximate inverse, designed for symmetric multiprocessor systems.The NEPCG method for solving linear systems has been presented in [7]. The computational complexity of the NEPCG method is O [(2δl + 2 + 11) nmults + 3n adds] ν operations,where ν is the number of iterations required for the convergence to a certain level of accuracy, [7]. The Parallel Normalized Explicit Preconditioned Conjugate Gradient (PN EPCG) algorithm for solving linear systems can then be described as follows: forall j = 1 to n (r0 )j = sj − A (u0 )j if δl = 1 then forall j = 1 to n (r0∗ )j = (r0 )j / d2 j else forall j = 1 to n j (r0∗ )j =
(17)
(18)
μ n+1−i,i+1−k k=max(1,j−δl+1) min(n,j+δl−1) +
k=j+1
(r0 )k /dk
μ n+1−k,δl+k−j (r0 )k /dk
/ (d)j
(19)
forall j = 1 to n (σ0 )j = (r0∗ )j (20) forall j = 1 to n (reduction+p0 ) p0 = (r0 )j ∗ (r0∗ )j (21) Then, for i = 0, 1, . . ., (until convergence) compute in parallel the vectors ui+1 , ri+1 , σi+1 and the scalar quantities αi , βi+1 as follows: forall j = 1 to n (qi )j = A (σi )j (22) forall j = 1 to n (reduction +ti ) (23) ti = (σi )j ∗ (qi )j (24) αi = pi /ti forall j = 1 to n (ui+1 )j = (ui )j + αi (σi )j (25) (ri+1 )j = (ri )j − αi (qi )j (26) if δl = 1 then forall 1 to n ∗j = (27) ri+1 j = (ri+1 )j / d2 j
Parallel Approximate Finite Element Inverses on SMP Systems
else forall j = 1 to
n ∗ ri+1 j =
j
μ n+1−i,i+1−k k=max(1,j−δl+1) min(n,j+δl−1) +
k=j+1
(ri+1 )k /dk
μ n+1−k,δl+k−j (ri+1 )k /dk
931
/ (d)j (28)
forall j = 1 to n (reduction+p i+1 ) ∗ pi+1 = (ri+1 )j ∗ ri+1 j βi+1 = pi+1 /pi forall j = 1 to n ∗ + βi+1 (σi )j (σi+1 )j = ri+1 j
(29) (30) (31)
It should be noted that the parallelization of the coefficient matrix A×vector operation has been implemented by taking advantage of the sparsity of the coefficient matrix A.
4
Numerical Results
In this section we examine the applicability and effectiveness of the proposed parallel schemes for solving sparse finite element linear systems. Let us now consider a 2D-boundary value problem: uxx + uyy + u = F,
(x, y) ∈ R,
with
u (x, y) = 0,
(x, y) ∈ ∂R,
(32)
where R is the unit square and ∂R denotes the boundary of R. The domain is covered by a non-overlapping triangular network resulting in a hexagonal mesh. The right hand side vector of the system (1) was computed as the product of the matrix A by the solution vector, with its components equal to unity. The “fill-in”parameter was set to r = 2 and the width parameter was set to = 3. The iterative process was terminated when ri ∞ < 10−5 . It should be noted that further details about the convergence behavior and the impact of the “retention”parameter on the solution can be found in [6]. The numerical results presented in this section were obtained on an SMP machine consisting of 16 2.2 GHz Dual Core AMD Opteron processors, with 32 GB RAM running Debian GNU/Linux (National University Ireland Galway). For the parallel implementation of the algorithms presented, the Intel C Compiler v9.0 with OpenMP directives has been utilized with no optimization enabled at the compilation level. It should be noted that due to administrative policies, we were not able to explore the full processor resources (i.e. more than 8 threads). In our implementation, the parallel for pragma has been used in order to generate code that forks/joins threads, in all cases. Additionally, static scheduling has been used (schedule(static)), whereas the use of dynamic scheduling has not produced improved results. The speedups and efficiencies of the PAND-NOROBAIFEM-2D algorithm for several values of the “retention”parameter δl with n = 10000 and m = 101,
932
K.M. Giannoutakis and G.A. Gravvanis
Table 1. Speedups for the PAND-NOROBAIFEM-2D algorithm for several values of δl Speedups for the PAND-NOROBAIFEM-2D algorithm Retention parameter 2 processors 4 processors 8 processors δl = m 1.8966 3.8458 6.8653 δl = 2m 1.9600 3.8505 7.4011 δl = 4m 1.9741 3.9260 7.5768 δl = 6m 1.9986 3.9501 7.8033
Table 2. Efficiencies for the PAND-NOROBAIFEM-2D algorithm for several values of δl Efficiencies for the PAND-NOROBAIFEM-2D algorithm Retention parameter 2 processors 4 processors 8 processors δl = m 0.9483 0.9615 0.8582 δl = 2m 0.9800 0.9626 0.9251 δl = 4m 0.9870 0.9815 0.9471 δl = 6m 0.9993 0.9875 0.9754
are given in Table 1 and 2. In Fig. 1 the parallel speedups for several values of the “retention”parameter δl are presented for the PAND-NOROBAIFEM2D method, for n = 10000 and m = 101. The speedups and efficiencies of the PNEPCG algorithm for several values of the “retention”parameter δl with n = 10000 and m = 101, are given in Table 3 and 4. In Fig. 2 the parallel speedups for several values of the “retention”parameter δl are presented for the PNEPCG method, for n = 10000 and m = 101. Table 3. Speedups for the PNEPCG algorithm for several values of δl Speedups for the PNEPCG method Retention parameter 2 processors 4 processors 8 processors δl = 1 1.1909 1.5365 1.6097 δl = 2 1.5261 2.2497 2.7299 δl = m 1.8070 3.4351 6.3522 δl = 2m 1.8576 3.4824 6.3636 δl = 4m 1.9103 3.5453 6.4043 δl = 6m 1.9735 3.5951 6.6106
It should be mentioned that for large values of the “retention”parameter, i.e. multiples of the semi-bandwidth m, the speedups and the efficiency tend to the upper theoretical bound, for both the parallel construction of the approximate inverse and the parallel normalized preconditioned conjugate gradient method, since the coarse granularity amortizes the parallelization overheads. For small
Parallel Approximate Finite Element Inverses on SMP Systems
Fig. 1. Speedups versus the NOROBAIFEM-2D algorithm
“retention”parameter
δl
for
the
933
PAND-
Table 4. Efficiencies for the PNEPCG algorithm for several values of δl Efficiencies for the PNEPCG algorithm Retention parameter 2 processors 4 processors 8 processors δl = 1 0.5955 0.3841 0.2012 δl = 2 0.7631 0.5624 0.3412 δl = m 0.9035 0.8588 0.7940 δl = 2m 0.9288 0.8706 0.7954 δl = 4m 0.9551 0.8863 0.8005 δl = 6m 0.9867 0.8988 0.8263
Fig. 2. Speedups versus the “retention”parameter δl for the PNEPCG algorithm
values of the “retention”parameter, i.e. δl = 1, 2, the fine granularity is responsible for the low parallel performance, since the parallel operations reduces to simple ones, like inner products, and the utilization of the hardware platform is decreasing.
934
5
K.M. Giannoutakis and G.A. Gravvanis
Conclusion
The design of parallel explicit approximate inverses results in efficient parallel preconditioned conjugate gradient method for solving finite element linear systems on multiprocessor systems. Finally, further parallel algorithmic techniques will be investigated in order to improve the parallel performance of the normalized explicit approximate inverse preconditioning on symmetric multiprocessor systems, particularly by increasing the computational work output per processor and eliminating process synchronization and any associated latencies. Acknowledgments. The author would like to thank indeed Dr. John Morrison of the Department of Computer Science, University College of Cork for the provision of computational facilities and support through the WebCom-G project funded by Science Foundation Ireland.
References 1. Akl, S.G.: Parallel Computation: Models and Methods. Prentice-Hall, Englewood Cliffs (1997) 2. Dongarra, J.J., Duff, I., Sorensen, D., van der Vorst, H.A.: Numerical Linear Algebra for High-Performance Computers. SIAM, Philadelphia (1998) 3. Duff, I.: The impact of high performance computing in the solution of linear systems: trends and problems. J. Comp. Applied Math. 123, 515–530 (2000) 4. Gravvanis, G.A.: Explicit Approximate Inverse Preconditioning Techniques. Archives of Computational Methods in Engineering 9(4), 371–402 (2002) 5. Gravvanis, G.A.: Parallel matrix techniques. In: Papailiou, K., Tsahalis, D., Periaux, J., Hirsch, C., Pandolfi, M. (eds.) Computational Fluid Dynamics I, pp. 472–477. Wiley, Chichester (1998) 6. Gravvanis, G.A., Giannoutakis, K.M.: Normalized Explicit Finite Element Approximate Inverses. I. J. Differential Equations and Applications 6(3), 253–267 (2003) 7. Gravvanis, G.A., Giannoutakis, K.M.: Normalized finite element approximate inverse preconditioning for solving non-linear boundary value problems. In: Bathe, K.J. (ed.) Computational Fluid and Solid Mechanics 2003. Proceedings of the Second MIT Conference on Computational Fluid and Solid Mechanics, vol. 2, pp. 1958–1962. Elsevier, Amsterdam (2003) 8. Grote, M.J., Huckle, T.: Parallel preconditioning with sparse approximate inverses. SIAM J. Sci. Comput. 18, 838–853 (1977) 9. Huckle, T.: Approximate sparsity patterns for the inverse of a matrix and preconditioning. Applied Numerical Mathematics 30, 291–303 (1999) 10. Saad, Y., van der Vorst, H.A.: Iterative solution of linear systems in the 20th century. J. Comp. Applied Math. 123, 1–33 (2000)
Fast and Small Short Vector SIMD Matrix Multiplication Kernels for the Synergistic Processing Element of the CELL Processor Wesley Alvaro1 , Jakub Kurzak1 , and Jack Dongarra1,2,3 2
1 University of Tennessee, Knoxville TN 37996, USA Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA 3 University of Manchester, Manchester, M13 9PL, UK {alvaro, kurzak, dongarra}@eecs.utk.edu
Abstract. Matrix multiplication is one of the most common numerical operations, especially in the area of dense linear algebra, where it forms the core of many important algorithms, including solvers of linear systems of equations, least square problems, and singular and eigenvalue computations. The STI CELL processor exceeds the capabilities of any other processor available today in terms of peak single precision, floating point performance. In order to fully exploit the potential of the CELL processor for a wide range of numerical algorithms, fast implementation of the matrix multiplication operation is essential. The crutial component is the matrix multiplication kernel crafted for the short vector Single Instruction Multiple Data architecture of the Synergistic Processing Element of the CELL processor. In this paper, single precision matrix multiplication kernels are presented implementing the C = C − A × B T operation and the C = C −A×B operation for matrices of size 64×64 elements. For the latter case, the performance of 25.55 Gflop/s is reported, or 99.80 percent of the peak, using as little as 5.9 KB of storage for code and auxiliary data structures.
1
Introduction
The CELL Broadband Engine (CBE) processor has been developed jointly by the alliance of Sony, Toshiba and IBM (STI). The CELL processor is an innovative multi-core architecture consisting of a standard processor, the Power Processing Element (PPE), and eight short-vector Single Instruction Multiple Data (SIMD) processors, referred to as the Synergistic Processing Elements (SPEs). The SPEs are equipped with scratchpad memory referred to as the Local Store (LS) and a Memory Flow Controller (MFC) to perform Direct Memory Access (DMA) transfers of code and data between the system memory and the Local Store. All components are interconnected with the Element Interconnection Bus (EIB). This paper is only concerned with the design of computational micro-kernels for the SPE in order to fully exploit Instruction Level Parallelism (ILP) provided by its SIMD architecture. Issues related to parallelization of code for execution on multiple SPEs, including intra-chip communication and synchronization, are M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 935–944, 2008. c Springer-Verlag Berlin Heidelberg 2008
936
W. Alvaro, J. Kurzak, and J. Dongarra
not discussed here. SPE architercural details important to the discussion are presented in Sect. 4.1 and also throughout the text, as needed. Plentiful information about the design of the CELL processor and CELL programming techniques is in public the domain [1].
2
Motivation
The current trend in processor design is towards chips with multiple processing units, commonly referred to as multi-core processors [2]. It has been postulated that building blocks of future architectures are likely to be simple processing elements with shallow pipelines, in-order execution, and SIMD capabilities [3]. It can be observed that the Synergistic Processing Element of the CELL processor closely matches this description. By the same token, investigation into microkernel development for the SPE may have a broader impact by providing an important insight into programming future multi-core architectures. 2.1
Performance Considerations
State of the art numerical linear algebra software utilizes block algorithms in order to exploit the memory hierarchy of traditional cache-based systems [4,5]. Public domain libraries such as LAPACK and ScaLAPACK are good examples. These implementations work on square or rectangular submatrices in their inner loops, where operations are encapsulated in calls to Basic Linear Algebra Subroutines (BLAS), with emphasis on expressing the computation as Level 3 BLAS, matrix-matrix type, operations. Frequently, the call is made directly to the matrix multiplication routine GEMM. At the same time, all the other Level 3 BLAS can be defined in terms of GEMM and a small amount of Level 1 and Level 2 BLAS [6]. 2.2
Code Size Considerations
In the current implementation of the CELL BE architecture, the SPEs are equipped with a Local Store of 256 KB. It is a common practice to use tiles of 64 × 64 elements for dense matrix operations in single precision, which occupy 16 KB buffers in the Local Store. Between six and eight such buffers are necessary to efficiently implement common matrix operations. In general, it is reasonable to assume that half of the Local Store is devoted to application data buffers leaving only 128 KB for the application code, necessary libraries and the stack. Owing to that, the Local Store is a scarse resource and any real-world application is facing the problem of fitting tightly coupled components together in the limited space.
3
Related Work
Implementation of matrix multiplication C = C + A × B T using Intel Streaming SIMD Extensions (SSE) was reported by Aberdeen and Baxter [7]. Analysis
Fast and Small Short Vector SIMD Matrix Multiplication Kernels
937
of performance considerations of various computational kernels for the CELL processor, including the GEMM kernel, was presented by Williams et al. [8]. The first implementation of the matrix multiplication kernel C = A × B for the CELL processor was reported by Chen et al. [9]. Performance of 25.01 Gflop/s was reported on a single SPE, with code size of roughly 32 KB. More recently assembly language implementation of the matrix multiplication C = C − A × B was reported by Hackenberg[10,11]. Performance of 25.40 Gflop/s was reported with code size close to 26 KB.
4 4.1
Implementation SPU Architecture Overview
The core of the SPE is the Synergistic Processing Unit (SPU). The SPU is a RISC-style SIMD processor feturing 128 general purpose registers and 32bit fixed length instruction encoding. SPU includes instructions that perform single precision floating point, integer arithmetic, logicals, loads, stores, compares and branches. SPU includes nine execution units organized into two pipelines, referred to as the odd and even pipeline. Instructions are issued in-order and two independent instructions can be issued simultaneously if they belong to different pipelines. SPU executes code form the Local Store and operates on data residing in the Local Store, which is a fully pipelined, single-ported, 256 KB of Static Random Access Memory (SRAM). Load and store instructions are performed within local address space, which is untranslated, unguarded and noncoherent with respect to the system address space. Loads and stores transfer 16 bytes of data between the register file and the Local Store, and complete with fixed six-cycle delay and without exception. SPU does not perform hardware branch prediction and omits branch history tables. Instead, the SPU includes a Software Managed Branch Target Buffer (SMBTB), which holds a single branch target and is loaded by software. A mispredicted branch flushes the pipelines and costs 18 cycles. A correctly hinted branch can execute in one cycle. Since both branch hint and branch instructions belong to the odd pipeline, proper use of SMBTB can result in zero overhead from branching for a compute-intensive loop dominated by even pipeline instructions. 4.2
Loop Construction
The main tool in loop construction is the technique of loop unrolling. In general, the purpose of loop unrolling is to avoid pipeline stalls by separating dependent instructions by a distance in clock cycles equal to the corresponding pipeline latencies. It also decreases the overhead associated with advancing the loop index and branching. On the SPE it serves the additional purpose of balancing the ratio of instructions in the odd and even pipeline, owing to register reuse between interations.
938
W. Alvaro, J. Kurzak, and J. Dongarra
In the canonical form, matrix multiplication Cm×n = Am×k × Bk×n coinsists of three nested loops iterating over the three dimensions m, n and k. Loop tiling is applied to improve the locality of reference and to take advantage of the O(n3 )/O(n2 ) ratio of arithmetic operations to memory accesses. This way register reuse is maximized and the number of loads and stores is minimized. Conceptually, tiling of the three loops creates three more inner loops, which calculate a product of a submatrix of A and a submatrix of B and updates a submatrix of C with the partial result. Practically, the body of these three inner loops is subject to complete unrolling to a single block of a straight-line code. The tile size is picked such that the cross-over point between arithmetic and memory operations is reached, which means that there is more FMA or FNMS operations to fill the even pipeline than there is load, store and shuffle or splat operations to fill the odd pipeline. The resulting structure consists of three outer loops iterating over tiles of A, B and C. Inevitably, nested loops induce mispredicted branches, which can be alleviated by further unrolling. Aggressive unrolling, however, leads quickly to undesired code bloat. Instead, the three-dimensional problem can be linearized by replacing the loops with a single loop performing the same traversal of the iteration space. This is accomplished by traversing tiles of A, B and C in a predefined order derived as a function of the loop index. A straightforward row/column ordering can be used and tile pointers for each iteration can be constructed by simple transformations of the bits of the loop index. At this point, the loop body still contains auxiliary operations that cannot be overlapped with arithmetic operations. These include initial loads, stores of final results, necessary data rearrangement with splats and shuffles, and pointer advancing operations. This problem is addressed by double-buffering, on the register level, between two loop iterations. The existing loop body is duplicated and two separate blocks take care of the even and odd iteration, respectively. Auxiliary operations of the even iteration are hidden behind arithmetic instructions of the odd iteration and vice versa, and disjoint sets of registers are used where necessary. The resulting loop is preceeded by a small body of prologue code loading data for the first iteration, and then followed by a small body of epilogue code, which stores results of the last iteration. 4.3
C=C-A×B
T
Before going into details, it should be noted, that matrix storage follows C-style row-major format. It is not as much a carefull design decision, as compliance with the common practice on the CELL processor. It can be attributed to C compilers being the only ones allowing to exploit short-vector capabilities of the SPEs through C language SIMD extensions. An easy way to picture the C = C − A × B T operation is to represent it as the standard matrix vector product C = C − A × B, where A is stored using row-major order and B is stored using column-major order. It can be observed that in this case a row of A can readily be multiplied with a column of B to yield a vector containing four partial results, which need to be summed up to
Fast and Small Short Vector SIMD Matrix Multiplication Kernels
939
produce one element of C. The vector reduction step introduces superfluous multiply-add operations. In order to minimize their number, four row-column products are computed, resulting in four vectors, which need to be internally reduced. The reduction is performed by first transposing the 4 × 4 element matrix represented by the four vectors and then applying four vector multiply-add operations to produce a result vector containing four elements of C. The basic scheme is depicted in Fig. 1 (left).
Fig. 1. Basic operation of the C = C − A × B T micro-kernel (left). Basic operation of the C = C − A × B micro-kernel (right).
The crucial design choice to be made is the right amount of unrolling, which is equivalent to deciding the right tile size in terms of the triplet {m, n, k} (Here sizes express numbers of individual floating-point values, not vectors). Unrolling is mainly used to minimize the overhead of jumping and advancing the index variable and associated pointer arithmetic. It has been pointed out in Sect. 4.1 that both jump and jump hint instructions belong to the odd pipeline and, for compute intensive loops, can be completely hidden behind even pipeline instructions and thus introduce no overhead. In terms of the overhead of advancing the index variable and related pointer arithmetic, it will be shown in Sect. 4.5 that all of these operations can be placed in the odd pipeline as well. In this situation, the only concern is balancing even pipeline, arithmetic instructions with odd pipeline, data manipulation instructions. Simple analysis can be done by looking at the number of floating-point operations versus the number of loads, stores and shuffles, under the assumption that the size of the register file is not a constraint. The search space for the {m, n, k} triplet is further truncated by the following criteria: only powers of two are considered in order to simplify the loop construction; the maximum possible number of 64 is chosen for k in order to minimize the number of extraneous floating-point instructions performing the reduction of partial results; only multiplies of four are selected for n to allow for efficient reduction of partial results with eight shuffles per one output vector of C. Under these constraints, the entire search space can be easily analyzed.
940
W. Alvaro, J. Kurzak, and J. Dongarra
Table 1 (left) shows how the number of each type of operation is calculated. Table 2 (left) shows the number of even pipeline, floating-point instructions including the reductions of partial results. Table 2 (center) shows the number of even pipeline instructions minus the number of odd pipeline instructions including loads, stores and shuffles (not including jumps and pointer arithmetic). In other words, Table 2 (center) shows the number of spare odd pipeline slots before jumps and pointer arithmetic are implemented. Finally, Table 2 (right) shows the size of code involved in calculations for a single tile. It is important to note here that the double-buffered loop is twice the size. Table 1. Numbers of different types of operations in the computation of one tile of the C = C − A × B T micro-kernel (left) and the C = C − A × B micro-kernel (right) as a function of tile size ({m, n, 64} triplet) Type of Operation Floating point Load A Load B Load C Store C Shuffle
Pipeline
Number of Operations
Type of Operation
(m × n × 64) ⁄ 4 + m × n
Floating point
Even Odd
m × 64 ⁄ 4
Load A
64 × n ⁄ 4
Load B
m×n ⁄4
Load C
m×n ⁄4
Store C
m×n ⁄4×8
Pipeline
Number of Operations
Even Odd
Splat
(m × n × k) ⁄ 4
m×k ⁄4
k×n ⁄4 m×n ⁄4 m×n ⁄4 m×k
Table 2. Unrolling analysis for the C = C − A × B T micro-kernel: left - number of even pipeline, floating-point operations, center - number of spare odd pipeline slots, right - size of code for the computation of one tile M/N 1 2 4 8 16 32 64
4 68 136 272 544 1088 2176 4352
8 16 32 64 136 272 544 1088 272 544 1088 2176 544 1088 2176 4352 1088 2176 4352 8704 2176 4352 8704 17408 4352 8704 17408 34816 8704 17408 34816 69632
M/N 1 2 4 8 16 32 64
4 -22 20 104 272 608 1280 2624
8 16 32 64 -28 -40 -64 -112 72 176 384 800 272 608 1280 2624 672 1472 3072 6272 1472 3200 6656 13568 3072 6656 13824 28160 6272 13568 28160 57344
M/N 1 2 4 8 16 32 64
4 1.2 1.0 1.7 3.2 6.1 12.0 23.8
8 1.2 1.8 3.2 5.9 11.3 22.0 43.5
16 2.3 3.6 6.1 11.3 21.5 42.0 83.0
32 4.5 7.0 12.0 22.0 42.0 82.0 162.0
64 8.9 13.9 23.8 43.5 83.0 162.0 320.0
It can be seen that the smallest unrolling with a positive number of spare odd pipeline slots is represented by the triplet {2, 4, 64} and produces a loop with 136 floating-point operations. However, this unrolling results in only 20 spare slots, which would barely fit pointer arithmetic and jump operations. Another aspect is that the odd pipeline is also used for instruction fetch and near complete filling of the odd pipeline may cause instruction depletion, which in rare situations can even result in an indefinite stall. The next larger candidates are triplets {4, 4, 64} and {2, 8, 64}, which produce loops with 272 floating-point operations, and 104 or 72 spare odd pipeline slots, respectively. The first one is an obvious choice, giving more room in the odd pipeline and smaller code.
Fast and Small Short Vector SIMD Matrix Multiplication Kernels
941
C=C-A×B
4.4
Here, same as before, row major storage is assumed. The key observation is that multiplication of one element of A with one row of B contributes to one row of C. Owing to that, the elementary operation splats an element of A over a vector, multiplies this vector with a vector of B and accumulates the result in a vector of C (Fig. 1). Unlike for the other kernel, in this case no extra floating-point operations are involved. Same as before, the size of unrolling has to be decided in terms of the triplet {m, n, k}. This time, however, there is no reason to fix any dimension. Nevertheless, similar constraints to the search space apply: all dimensions have to be powers of two, and additionally only multiplies of four are allowed for n and k to facilitate efficient vectorization and simple loop construction. Table 1 (right) shows how the number of each type of operation is calculated. Table 3 (left) shows the number of even pipeline, floating-point instructions. Table 3 (center) shows the number of even pipeline instructions minus the number of odd pipeline instructions including loads, stores and splats (not including jumps and pointer arithmetic). In other words, Table 3 (center) shows the number of spare odd pipeline slots before jumps and pointer arithmetic are implemented. Finally, Table 3 (right) shows the size of code involved in calculations for a single tile. It is should be noted again that the double-buffered loop is twice the size. It can be seen that the smallest unrolling with a positive number of spare odd pipeline slots produces a loop with 128 floating-point operations. Five possibilities exist, with the triplet {4, 16, 8} providing the highest number of 24 spare odd pipeline slots. Again, such unrolling would both barely fit pointer arithmetic and jump operations and be a likely cause of instruction depletion. The next larger candidates are unrollings producing loops with 256 floatingpoint operations. There are 10 such cases, with the triplet {4, 32, 8} being the obvious choice for the highest number of 88 spare odd pipeline slots and the smallest code size. Table 3. Unrolling analysis for the C = C − A × B micro-kernel: left - number of even pipeline, floating-point operations, center - number of spare odd pipeline slots, right size of code for the computation of one tile K 4 4 4 4 4 4 4 8 8 8 8 8 8 8 16 16 16 16 16 16 16
M/N 1 2 4 8 16 32 64 1 2 4 8 16 32 64 1 2 4 8 16 32 64
4 4 8 16 32 64 128 256 8 16 32 64 128 256 512 16 32 64 128 256 512 1024
8 8 16 32 64 128 256 512 16 32 64 128 256 512 1024 32 64 128 256 512 1024 2048
16 16 32 64 128 256 512 1024 32 64 128 256 512 1024 2048 64 128 256 512 1024 2048 4096
32 64 32 64 64 128 128 256 256 512 512 1024 1024 2048 2048 4096 64 128 128 256 256 512 512 1024 1024 2048 2048 4096 4096 8192 128 256 256 512 512 1024 1024 2048 2048 4096 4096 8192 8192 16384
K 4 4 4 4 4 4 4 8 8 8 8 8 4 4 16 16 16 16 16 16 16
M/N 1 2 4 8 16 32 64 1 2 4 8 16 32 64 1 2 4 8 16 32 64
4 -7 -10 -16 -28 -52 -100 -196 -12 -16 -24 -40 -72 -136 -264 -22 -28 -40 -64 -112 -208 -400
8 -9 -10 -12 -16 -24 -40 -72 -14 -12 -8 0 16 48 112 -24 -16 0 32 96 224 480
16 -13 -10 -4 8 32 80 176 -18 -4 24 80 192 416 864 -28 8 80 224 512 1088 2240
32 64 -21 -37 -10 -10 12 44 56 152 144 368 320 800 672 1664 -26 -42 12 44 88 216 240 560 544 1248 1152 2624 2368 5376 -36 -52 56 152 240 560 608 1376 1344 3008 2816 6272 5760 12800
K 4 4 4 4 4 4 4 8 8 8 8 8 4 4 16 16 16 16 16 16 16
M/N 1 2 4 8 16 32 64 1 2 4 8 16 32 64 1 2 4 8 16 32 64
4 0.1 0.1 0.2 0.4 0.7 1.4 2.8 0.1 0.2 0.3 0.7 1.3 2.5 5.0 0.2 0.4 0.7 1.3 2.4 4.8 9.6
8 0.1 0.2 0.3 0.6 1.1 2.2 4.3 0.2 0.3 0.5 1.0 1.9 3.8 7.6 0.3 0.6 1.0 1.9 3.6 7.1 14.1
16 0.2 0.3 0.5 1.0 1.9 3.7 7.3 0.3 0.5 0.9 1.7 3.3 6.4 12.6 0.6 1.0 1.7 3.1 6.0 11.8 23.3
32 0.3 0.5 1.0 1.8 3.4 6.8 13.4 0.6 1.0 1.7 3.1 5.9 11.5 22.8 1.1 1.8 3.1 5.6 10.8 21.0 41.5
64 0.6 1.0 1.8 3.4 6.6 12.9 25.5 1.2 1.8 3.2 5.8 11.1 21.8 43.0 2.2 3.4 5.8 10.6 20.3 39.5 78.0
942
4.5
W. Alvaro, J. Kurzak, and J. Dongarra
Advancing Tile Pointers
The remaining issue is the one of implementing the arithmetic calculating the tile pointers for each loop iteration. Due to the size of the input matrices and the tile sizes being powers of two, this is a straightforward task. The tile offsets can be calculated from the tile index and the base addresses of the input matrices using integer arithmetic and bit manipulation instructions (bitwise logical instructions and shifts). Although a few variations are possible, the resulting assembly code will always involve a similar combined number of integer and bit manipulation operations. Unfortunately, all these instructions belong to the even pipeline and will introduce an overhead, which cannot be hidden behind floating point operations, like it is done with loads, stores, splats and shuffles. One way of minimizing this overhead is extensive unrolling, which creates a loop big enough to make the pointer arithmetic negligible. An alternative is to eliminate the pointer arithmetic operations from the even pipeline and replace them with odd pipeline operations. With the unrolling chosen in Sect. 4.3 and Sect. 4.4, the odd pipeline offers empty slots in abundance. It can be observed that, since the loop boundaries are fixed, all tile offsets can be calculated in advance. At the same time, the operations available in the odd pipeline include loads, which makes it a logical solution to precalculate and tabulate tile offsets for all iterations. It still remains necessary to combine the offsets with the base addresses, which are not known beforehand. However, under additional alignment constraints, offsets can be combined with bases using shuffle instructions, which are also available in the odd pipeline. The precalculated offsets have to be compactly packed in order to preserve space consumed by the lookup table. Since tiles are 16 KB in size, offsets consume 14 bits and can be stored in a 16-bit halfword. Three offsets are required for each loop iteration. With eight halfwords in a quadword, each quadword can store offsets for two loop iterations or a single interation of the pipelined, double-buffered loop. The size of the lookup table constructed in this manner equals N 3 /(m × n × k) × 8 bytes. The last arithmetic operation remaining is the advancement of the itaration variable. It is typical to decrement the iteration variable instead of incrementing it, and branch on non-zero, in order to eliminate the comparison operation, which is also the case here. This still leaves the decrement operation, which would have to occupy the even pipeline. In order to annihilate the decrement, each quadword containing six offsets for one itaration of the double-buffered loop also contains a seventh entry, which stores the index of the quadword to be processed next (preceeding in memory). In other words, the iteration variable, which also serves as the index to the lookup table, is tabulated along with the offsets and loaded instead of being decremented. At the same time, both the branch instruction and the branch hint belong to the odd pipeline. Also, a correctly hinted branch does not cause any stall. As a result, such an implementation produces a continuous stream of floating-point operations in the even pipeline, without a single cycle devoted to any other activity.
Fast and Small Short Vector SIMD Matrix Multiplication Kernels
5
943
Results
Both presented SGEMM kernel implementations produce a continuous stream of floating-point instructions for the duration of the pipelined loop. In both cases, the loop iterates 128 times, processing two tiles in each iteration. The C = C − A × B T kernel contains 544 floating-point operations in the loop body and, on a 3.2 GHz processor, delivers 25.54 Gflop/s (99.77 % of peak) if actual operations are counted, and 24.04 Gflop/s (93.90 % of peak) if the standard formula, 2N 3 , is used for operation count. The C = C −A×B kernel contains 512 floating-point operations in the loop body and delivers 25.55 Gflop/s (99.80 % of peak). Here, the actual operation count equals 2N 3 . If used on the whole CELL processor with 8 SPEs, performance in excess of 200 Gflop/s should be expected. Table 4 shows the summary of the kernels’ properties. Table 4. Summary of the properties of the SPE SIMD SGEMM mikro-kernels CharacteristicT Performance
C=C-A×BT C=C-A×BT 24.04
25.55
Gflop/s
Gflop/s
Execution time
21.80 s
20.52 s
Fraction of peak
93.90 %
99.80 %
99.77 %
99.80%
68.75 %
82.81 %
69
69
Code segment size
4008
3992
Data segment size
2192
2048
Total memory footprint
6200
6040
USING THE 2× M× N× K FORMULA
Fraction of peak USING ACTUAL NUMBER OF FLOATING–POINT INSTRUCTIONS
Dual issue rate ODD PIPELINE WORKLOAD
Register usage
The code is freely available, under the BSD license and can be downloaded from the author’s web site http://icl.cs.utk.edu/∼alvaro/.
6
Conclusions
Computational micro-kernels are architecture specific codes, where no portability is sought. It has been shown that systematic analysis of the problem combined with exploitation of low-level features of the Synergistic Processing Unit of the CELL processor leads to dense matrix multiplication kernels achieving peak performance without code bloat.
944
W. Alvaro, J. Kurzak, and J. Dongarra
References 1. IBM Corporation: Cell Broadband Engine Programming Handbook, Version 1.1 (April 2007) 2. Geer, D.: Industry Trends: Chip Makers Turn to Multicore Processors. Computer 38(5), 11–13 (2005) 3. Asanovic, K., Bodik, R., Catanzaro, B.C., Gebis, J.J., Husbands, P., Keutzer, K., Patterson, D.A., Plishker, W.L., Shalf, J., Williams, S.W., Yelick, K.A.: The Landscape of Parallel Computing Research: A View from Berkeley. Technical Report UCB/EECS-2006-183, Electrical Engineering and Computer Sciences Department, University of California, Berkeley (2006) 4. Dongarra, J.J., Duff, I.S., Sorensen, D.C., van der Vorst, H.A.: Numerical Linear Algebra for High-Performance Computers. SIAM, Philadelphia (1998) 5. Demmel, J.W.: Applied Numerical Linear Algebra. SIAM, Philadelphia (1997) 6. K˚ agstr¨ om, B., Ling, P., van Loan, C.: GEMM-Based Level 3 BLAS: HighPerformance Model Implementations and Performance Evaluation Benchmark. ACM Trans. Math. Soft. 24(3), 268–302 (1998) 7. Aberdeen, D., Baxter, J.: Emmerald: A Fast Matrix-Matrix Multiply Using Intel’s SSE Instructions. Concurrency Computat.: Pract. Exper. 13(2), 103–119 (2001) 8. Williams, S., Shalf, J., Oliker, L., Kamil, S., Husbands, P., Yelick, K.: The Potential of the Cell Processor for Scientific Computing. In: ACM International Conference on Computing Frontiers (2006) 9. Chen, T., Raghavan, R., Dale, J., Iwata, E.: Cell Broadband Engine architecture and its first implementation, A performance view (November 2005), http://www-128.ibm.com/developerworks/power/library/pa-cellperf/ 10. Hackenberg, D.: Einsatz und Leistungsanalyse der Cell Broadband Engine. Institut f¨ ur Technische Informatik, Fakult¨ at Informatik, Technische Universit¨ at Dresden, Großer Beleg (February 2007) 11. Hackenberg, D.: Fast matrix multiplication on CELL systems (July 2007), http://tu-dresden.de/die tu dresden/zentrale einrichtungen/zih/forschun/ architektur und leistungsanalyse von hochleistungsrechnern/cell/
Tridiagonalizing Complex Symmetric Matrices in Waveguide Simulations W.N. Gansterer1 , H. Schabauer1 , C. Pacher2, and N. Finger2 1
University of Vienna, Research Lab Computational Technologies and Applications {wilfried.gansterer,hannes.schabauer}@univie.ac.at 2 Austrian Research Centers GmbH - ARC, Smart Systems Division {christoph.pacher,norman.finger}@arcs.ac.at
Abstract. We discuss a method for solving complex symmetric (nonHermitian) eigenproblems Ax = λBx arising in an application from optoelectronics, where reduced accuracy requirements provide an opportunity for trading accuracy for performance. In this case, the objective is to exploit the structural symmetry. Consequently, our focus is on a nonHermitian tridiagonalization process. For solving the resulting complex symmetric tridiagonal problem, a variant of the Lanczos algorithm is used. Based on Fortran implementations of these algorithms, we provide extensive experimental evaluations. Runtimes and numerical accuracy are compared to the standard routine for non-Hermitian eigenproblems, LAPACK/zgeev. Although the performance results reveal that more work is needed in terms of increasing the fraction of Level 3 Blas in our tridiagonalization routine, the numerical accuracy achieved with the nonHermitian tridiagonalization process is very encouraging and indicates important research directions for this class of eigenproblems. Keywords: Tridiagonalization, complex symmetric eigenvalue problems, waveguide simulation, optoelectronic.
1
Introduction
We discuss methods for efficiently tridiagonalizing a complex symmetric (nonHermitian) matrix. The term complex matrix is used to denote a matrix which has at least one element with a nonzero imaginary part. Tridiagonalization is an important preprocessing step in reduction-based (tridiagonalization-based) methods for computing eigenvalues and eigenvectors of dense real symmetric or complex Hermitian matrices. In the context considered here, the underlying complex symmetric eigenvalue problem (EVP) has similar ˆ B ˆ ∈ Cn×n structural, but different mathematical properties. Given matrices A, H H ˆ ˆ ˆ ˆ ˆ ˆ ˆ ˆ with A = A (but A = A ) and B = B (but B = B ), the objective is to efficiently compute eigenvalues λ and eigenvectors y of the generalized EVP ˆ = λBy ˆ . Ay
(1)
The main challenge is to find ways for utilizing the structural symmetry in the absence of the mathematical properties of Hermitian matrices. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 945–954, 2008. c Springer-Verlag Berlin Heidelberg 2008
946
W.N. Gansterer et al.
Although problems of the type (1) do not occur as frequently in practice as real symmetric or complex Hermitian problems, there are some important applications where they arise (see, for example, [1,2,3]). The efforts summarized in this paper are motivated by the simulation of guided-wave multi-section devices in optoelectronics. As described in Section 2, techniques for numerically solving Maxwell’s equations in this context lead to dense EVPs of the type (1). Analogously to Hermitian problems, one possible approach for solving probˆ = In . Complex symlem (1) starts with reducing it to standard form where B metry allows for special techniques in this reduction step which are not discussed here. After that, a tridiagonalization process is performed on the standard EVP which results in a similar complex symmetric tridiagonal matrix T . After this tridiagonalization step, eigenvalues and eigenvectors of T are computed and the eigenvectors are backtransformed. In the following, we focus on symmetry-preserving approaches for efficiently tridiagonalizing a complex symmetric matrix. This functionality constitutes a central step in the solution process outlined above and is one way of exploiting the available structure. A more detailed discussion of the other steps involved in solving (1) will be provided in a forthcoming paper. Mathematically speaking, structural symmetry is not a very distinctive feature of complex matrices, since every matrix A ∈ Cn×n is similar to a complex symmetric matrix [1]. In contrast to a real symmetric matrix a complex symmetric matrix A is not necessarily diagonalizable. Nevertheless, structural symmetry is of great interest for the development of space- and time-efficient algorithms. Obviously, half of the information in a complex symmetric matrix is redundant, and efficient algorithms should be able to take advantage of this fact in terms of memory requirements as well as in terms of computational effort. The utilization of this purely structural property in the absence of important mathematical properties of Hermitian matrices requires a trade-off in numerical stability. In order to perform a symmetry preserving similarity transformation, the transformation matrix Q ∈ Cn×n needs to be complex orthogonal (but not unitary), that is, it has to satisfy Q Q = In . Related Work. Various non-reduction based methods for solving complex symmetric EVPs have been proposed, for example, based on the Jacobi method [4], on the Lanczos method [5], or on variants of the Jacobi-Davidson method [6]. For dense matrices, reduction-based methods tend to be more efficient. A modified conventional Householder-based reduction method has been described in [2]. The tridiagonalization of a dense complex symmetric matrix has also been investigated in [3]. In [2], the resulting tridiagonal complex symmetric problem is solved using a modified QR algorithm. Other related approaches for computing eigenvalues of a complex symmetric tridiagonal matrix were discussed in [7,8]. Synopsis. In Section 2 of this paper, the motivating application from optoelectronics is summarized. In Section 3, the tridiagonalization method investigated in this paper is discussed in detail, and Section 4 contains experimental results. Conclusions and future work are summarized in Section 5.
Tridiagonalizing Complex Symmetric Matrices in Waveguide Simulations
2
947
Guided-Wave Multisection Devices
The use of high-index contrast waveguides (WGs) in novel guided-wave devices for telecom- and sensing applications allows a very versatile tailoring of the flow of light. However, an efficient design requires the direct numerical solution of Maxwell’s equations in inhomogeneous media. In many important cases such devices can be successfully modeled as follows: (i) in the x-direction (direction of propagation) the material parameters are piecewise constant, (ii) the material parameters and the optical fields do not depend on the y-coordinate, and (iii) in the z-direction the material parameters are allowed to vary arbitrarily. Usually, the z-dimension is of the order of up to several tens of wavelengths whereas the device extension into the x-direction is several hundreds of wavelengths. A powerful numerical method for the solution of Maxwell’s equations in such WG-based devices is the eigenmode expansion technique (which is often referred to as mode-matching (MM ) technique) [9,10,11], where the electromagnetic field components in each subsection being homogeneous in the x-direction are represented in terms of a set of local eigenmodes. MM requires a small computational effort compared to other numerical techniques like two-dimensional finite-elements or FDTD which can be regarded as “brute-force” methods from the viewpoint of device physics. However, MM can only be as stable and efficient as the algorithms used to determine the required set of local WG modes. Due to the open boundary conditions (see Section 2.1) and materials with complex dielectric permittivities these local eigenmodes have typically complex eigenvalues which makes their correct classification very difficult: numerical instabilities can arise from an improper truncation of the mode spectrum. In a recently developed variant of the MM technique — the so-called variational mode-matching (VMM ) [12] — this stability problem is avoided by applying a Galerkin approach to the local wave equations and taking into account the whole spectrum of the discretized operators. 2.1
The VMM-Approach
Within the 2D-assumption ∂y (·) = 0, Maxwell’s equations for dielectric materials characterized by the dielectric permittivity ε(x, z) take the form ∂x a∂x φ + ∂z a∂z φ + k02 bφ = 0 ,
(2)
where φ = Ey , a = 1, b = ε for TE- and φ = Hy , a = 1ε , b = 1 for TMpolarization, respectively; k0 = 2π λ0 (vacuum wavelength λ0 ). In the z-direction, the simulation domain is 0 ≤ z ≤ L. To permit an accurate description of radiation fields, an artificial absorber (that mimics an open domain) has to be installed near z = 0 and z = L. For this purpose so-called perfectly-matched layers (PMLs) are used by employing the complex variable stretching approach [13], i. e., in the vicinity of the domain boundaries the coorz dinate z is extended into the complex plane: z → z˜ = z + ı 0 dτ σ(τ ), where σ is the PML parameter determining the absorption strength. At z = 0 and z = L
948
W.N. Gansterer et al.
Dirichlet- or Neumann boundary conditions (BCs) are set. However, they should not have a significant influence on the overall optical field since the physical BCs must be given by the PMLs. In the x-direction, the structure is divided into nl local WGs, which expand over xl−1 ≤ x ≤ xl = xl−1 + dl with 1 ≤ l ≤ nl . Under the necessary condition that ε does not depend on x Eq. (2) can be solved inside each local WG l with the separation ansatz φ(l) (x, z˜) =
nϕ j=1
ϕj (˜ z)
nϕ
(l) (l) (l) (l) (l) cjρ αρ,+ eık0 νρ (x−xl−1 ) + αρ,− e−ık0 νρ (x−xl ) ,
(3)
ρ=1
where ρ labels the local waveguide modes. The transverse shape functions ϕj (˜ z) (the same set is used for all local WGs) must satisfy the outer boundary conditions. Apart from this constraint, the ϕj ’s may be chosen rather freely allowing for adaptive refinement in z-regions where rapid variations of the field are expected. This ansatz reduces the 2D problem to a set of nl 1D problems. After inserting Eq. (3) into Eq. (2), Galerkin’s method is applied to obtain a (l) discretized version of Eq. (2) for each local WG l. Finally, the coefficients αρ,± are “mode-matched” by imposing the physical boundary conditions at all the xl -interfaces [12]. 2.2
The Complex Symmetric Eigenvalue Problem
For each local WG, the discretized version of Eq. (2) is a generalized EVP of the form Acρ = (νρ )2 Bcρ , (4) where we have suppressed the index l for simplicity. Here, the νρ ’s are the modal refractive indices and the cjρ ’s are the corresponding modal expansion coefficients is a sum of a mass- and a stiffness-matrix, occurring in Eq. (3).1 A z (∂z˜ϕm (˜ z ϕm (˜ z )b(˜ z )ϕj (˜ z ) − k2 d˜ z ))a(˜ z )(∂z˜ϕj (˜ z )), whereas B is a Amj = d˜ 0 z ϕm (˜ z )a(˜ z )ϕj (˜ z ). pure mass-matrix: Bmj = d˜ The generalized EVP (4) has the following properties: (i) A and B are complex symmetric: the complex coordinate z˜ originating from the PMLs (and the possibly complex material constants a and b) are responsible for the complexvaluedness; (ii) B is indefinite (due to the open boundary conditions represented by the PMLs and a possibly negative material constant a); (iii) the typical order of the matrices for 2D problems is 100–1000 (depending on the geometry and the required truncation order of the modal expansion—in 3D models the order can be much higher); (iv ) the full spectrum of eigenpairs is required; (v ) the required accuracy is of the order 10−8 for the eigenpairs corresponding to the lowest order WG modes (approx. 10% of the mode spectrum); a somewhat lower accuracy (approx. 10−6 ) is acceptable for the remainder of the spectrum; (vi) depending on the WG geometry, some of the eigenvalues (especially those corresponding to the lowest order WG modes) may almost degenerate. It is evident that an efficient eigenvalue solver which utilizes the symmetry of the EVP (4) as well as its special properties is a very important building block for efficient 2D and 3D optical mode solvers.
Tridiagonalizing Complex Symmetric Matrices in Waveguide Simulations
3
949
Methodology
The standard approach to solving a dense complex symmetric EVP (as available, for example, in Lapack [14]) is to treat it as a nonsymmetric EVP: the complex symmetric matrix is reduced to upper Hessenberg form, from which eigenvalues and eigenvectors are computed using a QR iteration. The main motivation behind investigating tridiagonalization-based approaches as an alternative is the obvious potential for reducing storage and runtime requirements. In order to preserve symmetry, complex orthogonal similarity transformations (COTs) Q are needed which satisfy Q Q = In . In general, Q2 ≥ 1 and thus the application of complex orthogonal matrices can increase numerical errors. 3.1
Splitting Methods
The real part R and the imaginary part S of a complex symmetric matrix A = R + iS are real symmetric matrices. One basic idea, which has been introduced earlier [3], is to separate the tridiagonalization of R from the tridiagonalization of S as much as possible. More specifically, part of a column of R can be eliminated using a (real) orthogonal Householder transformation QR . After that, a (smaller) part of the corresponding column of S can be eliminated without causing any fill-in in R using another (real) orthogonal Householder transformation QI . Both of these operations are performed in real arithmetic, and both transformation matrices have norm one. Eventually, a single nonzero element below the subdiagonal in S remains to be eliminated. This operation has to be performed in complex arithmetic, using a 2 × 2 COT, whose norm cannot be bounded a priori. When the column elimination is first performed in R and then in S, we call the procedure RI variant. Analogously, it is possible to eliminate first in S and then in R. We call this procedure IR variant . The advantages of splitting methods seem obvious: Most of the computation can be done in real arithmetic, only one third of the transformations are potentially unstable, and this danger can easily be monitored because of the low dimensions of the COTs (only order two). Complex Orthogonal Transformations. The transformation matrix 1 zs √ , z = z1 + iz2 ∈ C, z1 , z2 , s ∈ R , G := z 2 + s2 −s z
(5)
defines a COT since G G = I2 . Consequently, GAG is a similarity transformation of A. In the RI variant, a COT GRI has to be determined such that a + ib d + ie GRI = , ic 0 where a, b, c, d, e ∈ R and c = 0. Choosing the parameters z = s bc − i ac , s = 0 arbitrary, the COT is given as 1 b − ia c GRI =
. (6) b − ia b2 − a2 + c2 − i(2ab) −c
950
W.N. Gansterer et al.
In the IR variant, a COT GIR has to be determined such that a + ib d + ie = . GIR c 0 With z = s
3.2
a c
+ i cb , s = 0 arbitrary, the COT is given as 1 a + ib c GIR =
. a + ib a2 − b2 + c2 + i(2ab) −c
(7)
Numerical Aspects
In a splitting method, the complex orthogonal transformations (5) are the only non-unitary transformations, all other transformations used have unit norm. If G2 1 the accuracy of the tridiagonalization process could be influenced negatively. G is a normal matrix, and thus its spectral norm is given by its largest eigenvalue in modulus: G2 =
1+γ 1−γ
1/4 with
γ=
z12
2|z2 s| . + z22 + s2
(8)
If γ approaches one, the accuracy of the tridiagonalization process may deteriorate. For GRI and GIR , respectively, γ in (8) becomes γRI =
2|ac| , a 2 + b 2 + c2
γIR =
2|bc| . a 2 + b 2 + c2
We observe that the freedom in choosing the parameter s does not help in controlling the norm of the COT, since γRI and γIR are independent of s. During the tridiagonalization process, monitoring the norms of the COTs makes it possible to detect potentially large errors. Various strategies have been suggested to avoid large norms, such as the recovery transformations proposed in [3]. Adaptive Elimination Order. The order of processing R and S can be determined independently in each iteration of the tridiagonalization process. For both variants, the norm of each COT can be precomputed with only marginal overhead. Based on this information, the COT with the smaller norm can be selected and the corresponding variant carried out. Obviously, this heuristic choice is only a local minimization and there is no guarantee that it minimizes the accumulated norm of all COTs in the tridiagonalization process. Comparison to and combination with recovery transformations are topics of ongoing work.
4
Experimental Evaluation
In our experiments, we used the following routines: zsysta reduces a generalized ˆ B) ˆ to a standard EVP (A), and zsyevv solves the standard complex EVP (A,
Tridiagonalizing Complex Symmetric Matrices in Waveguide Simulations
951
symmetric EVP. The latter consists of a newly implemented RI tridiagonalization (zsytridi), compev [15] for computing eigenvalues and inverm [15] for computing corresponding eigenvectors of the complex symmetric tridiagonal. zsyevg tests the accuracy of the tridiagonalization process by first calling zsytridi, followed by a call of LAPACK/zgeev on the resulting tridiagonal matrix. The codes were run on a Sun Fire v40z with 4 dual-core Opteron 875 CPUs (2.2 GHz) and 24 GB main memory. Suse Linux Enterprise Server 10, the GNU Fortran 95 compiler, Lapack version 3.1.1, Goto Blas 1.20, and the AMD Core Math Library (Acml 4.0.1) were used. We experimented with random test matrices with elements in [0, 2] as well as with a real application case. 4.1
Numerical Accuracy
Denoting with (λi , xi ) the eigenpairs computed by LAPACK/zgeev, and with ˜i, x ˜i ) the eigenpairs computed by zsyevv, an eigenvalue error E and a residual (λ error R have been computed according to E := max i
˜ i − λi | |λ , |λi |
R = max i
˜ i In )˜ ||(A − λ xi ||2 , ||A||2
i ∈ {1, . . . , n} .
Fig. 1 illustrates that the loss of accuracy in the tridiagonalization process itself is surprisingly low ! Although the total values of E and R of zsyevv increase up to 10−6 , most of this error is due to the Lanczos variant used for solving the tridiagonal problem. The error introduced by the RI tridiagonalization is only about two orders of magnitudes higher than the one of LAPACK/zgeev. 1D Waveguide Problem. The waveguide structure is a Si/SiOx twin waveguide operated in TM-polarization at a wavelength λ0 = 1.55 μm. The dielectric constants are εSi = 12.96 and εSiOx = 2.25. The core thickness and -separation are 0.5 μm and 0.25 μm, respectively. The z-extension of the model domain, terminated by electrically perfectly conducting walls, is 10 μm. The PML-layer thickness is 1 μm with the PML-parameter σ = 1. As shape functions, localized linear hat functions and polynomial bubble functions with a degree up to 24 were used. For reducing the generalized problem (4) to standard form, we computed a ˆ With ||B ˆ − F F T ||2 = generalized (complex) symmetric Cholesky factor F of B. −16 1.8 · 10 , the accuracy of this factorization is satisfactory for our test case. The eigenpairs (λi , xi ) of the resulting standard problem computed using Gnu ˜i, x Octave were compared with the eigenpairs (λ ˜i ) computed by our routine zsyevv. Backtransformation of the eigenvectors leads to a weighted residual error ˜i B)˜ ˆ yi ||2 ||(Aˆ − λ max = 3.8 · 10−14 , ˆ ˆ 2 i=1,...,n ||A||2 ||B|| ˆ 2 = 928, ||B|| ˆ 2 = 2). which is a very satisfactory accuray (for this test case, ||A||
952
W.N. Gansterer et al.
1e-05 R (zsyevv) E R (zsyevg) R (LAPACK/zgeev)
1e-06 1e-07 1e-08 1e-09 1e-10 1e-11 1e-12 1e-13 1e-14 100
500
1000
2000
3000
4000
Order n of eigenproblem Ax = λx Fig. 1. Accuracy of zsyevv, LAPACK/zgeev, and zsyevg operating on random matrices
4.2
Runtime Performance
We compared our routine zsyevv to LAPACK/zgeev using two different implementations of the Blas. Fig. 2 shows that the current version of zsyevv is faster than zgeev only if the Acml Blas is used. With the overall faster Goto Blas, zgeev outperforms our implementation. At first sight, this result is disappointing. Despite the exploitation of the structure, the new routine is slower than the more general routine for nonsymmetric problems for the best Blas available. A more detailed analysis helps to pinpoint the reason. Table 1 shows the percentages of the total runtimes which each of the two routines spent in their different parts for the two different Blas versions. For our routine zsyevv, the tridiagonalization part zsytridi clearly dominates the computation time for all problem sizes and for both Blas versions. This shows that our current code zsytridi is unable to take advantage of the faster Goto Blas. Three different parts of LAPACK/zgeev have been timed separately: zgehrd reduces the complex matrix A to upper Hessenberg form, zhseqr computes the eigenvalues of the Hessenberg matrix, and ztrevc computes corresponding eigenvectors. The runtime for all other code parts of LAPACK/zgeev is summed under “rest”. The picture here is quite different. For the Acml Blas, the operations on the Hessenberg matrix clearly dominate for all problem sizes, whereas for the faster Goto Blas, the percentages for the three dominating parts become very similar for large problem sizes. Summarizing, we observe that our current code cannot utilize a faster Blas. This is not surprising, since so far it is dominated by Level 2 Blas operations and more effort is needed to increase the fraction of Level 3 Blas operations.
Tridiagonalizing Complex Symmetric Matrices in Waveguide Simulations
953
10000
Runtime [s]
1000 100 10 zgeev/ACML zsyevv/ACML zsyevv/Goto zgeev/Goto
1 0.1 0.01 100
500
1000
2000
3000
4000
Order n of eigenproblem Ax = λx Fig. 2. Runtimes of zsyevv and LAPACK/zgeev operating on random matrices Table 1. Percentages of runtimes spent in parts of zsyevv and LAPACK/zgeev zsyevv LAPACK/zgeev n zsytridi compev inverm zgehrd zhseqr ztrevc 500 87.2 6.5 6.3 8.1 82.6 6.2 Acml 2000 93.9 1.5 4.6 7.2 84.2 6.2 4000 94.5 0.8 4.7 4.4 90.3 3.8
rest 3.1 2.4 1.5
500 Goto 2000 4000
5.9 7.8 9.7
BLAS
5
87.3 92.7 93.7
6.5 1.9 1.0
6.2 5.4 5.3
15.3 22.7 28.6
66.9 50.6 37.8
12.0 18.9 23.9
Conclusions and Future Work
Motivated by application problems arising in optoelectronics, a tridiagonalization process for complex symmetric matrices based on complex orthogonal transformations has been investigated. Compared to the standard Lapack routine for nonsymmetric eigenproblems, the loss of numerical accuracy caused by the potentially instable tridiagonalization process is surprisingly low in practice. However, partly in contrast to results published earlier [16], the performance benefits achieved are not yet satisfactory, especially for highly optimized Blas. The effort summarized here motivates various further research activities. Methodologically, the performance results indicate the need for blocked approaches. This suggests that non-splitting methods, where A is not split into real and imaginary part, can be an attractive alternative. For the optoelectronics problem, the matrices in (4) can be made banded in some situations by choosing appropriate shape functions. This motivates the investigation of efficient algorithms for generalized banded complex symmetric eigenvalue problems.
954
W.N. Gansterer et al.
References 1. Horn, R.A., Johnson, C.R.: Matrix Analysis. Cambridge University Press, Cambridge (1985) 2. Ohnami, K., Mikami, Y.: Resonance scattering in a two-dimensional non-integrable system. J. Phys. A 25, 4903–4912 (1992) 3. Bar-On, I., Ryaboy, V.: Fast diagonalization of large and dense complex symmetric matrices, with applications to quantum reaction dynamics. SIAM J. Sci. Comput. 18, 1412–1435 (1997) 4. Leung, A.Y.T., Liu, Y.F.: A generalized complex symmetric eigensolver. Comput. and Structures 43, 1183–1186 (1992) 5. Cullum, J.K., Willoughby, R.A.: A practical procedure for computing eigenvalues of large sparse nonsymmetric matrices. In: Cullum, J.K., Willoughby, R.A. (eds.) Proceedings of the IBM Europe Institute Workshop on Large Scale Eigenvalue Problems, pp. 193–223. North-Holland, Amsterdam (1986) 6. Arbenz, P., Hochstenbach, M.E.: A Jacobi–Davidson method for solving complex symmetric eigenvalue problems. SIAM J. Sci. Comput. 25(5), 1655–1673 (2004) 7. Luk, F., Qiao, S.: Using complex-orthogonal transformations to diagonalize a complex symmetric matrix. In: Luk, F.T. (ed.) Advanced Signal Processing: Algorithms, Architectures, and Implementations VII, Proc. SPIE., vol. 162, pp. 418–425 (1997) 8. Cullum, J.K., Willoughby, R.A.: A QL procedure for computing the eigenvalues of complex symmetric tridiagonal matrices. SIAM J. Matrix Anal. Appl. 17, 83–109 (1996) 9. Sudbo, A.S.: Film mode matching: A versatile numerical method for vector mode field calculations in dielectric waveguides. Pure and Appl. Optics 2, 211–233 (1993) 10. Franza, O.P., Chew, W.C.: Recursive mode matching method for multiple waveguide junction modeling. IEEE Trans. Microwave Theory Tech. 44, 87–92 (1996) 11. Bienstman, P., Baets, R.: Optical modelling of photonic crystals and VCSELs using eigenmode expansion and perfectly matched layers. Optical and Quantum Electronics 33, 327–341 (2001) 12. Finger, N., Pacher, C., Boxleitner, W.: Simulation of Guided-Wave Photonic Devices with Variational Mode-Matching, April 2007. American Institute of Physics Conference Series, vol. 893, pp. 1493–1494 (2007) 13. Teixeira, F.L., Chew, W.C.: General closed-form PML constitutive tensors to match arbitrary bianisotropic and dispersive linear media. IEEE Microwave Guided Wave Lett. 8, 223–225 (1998) 14. Anderson, E., Bai, Z., Bischof, C.H., Blackford, S., Demmel, J.W., Dongarra, J.J., Du, C.J., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.C.: Lapack Users’ Guide, 3rd edn. SIAM Press, Philadelphia (1999) 15. Cullum, J.K., Willoughby, R.A.: Lanczos Algorithms for Large Symmetric Eigenvalue Computations, vol. 1. Theory, vol. 2. Programs, Birkh¨ auser, Boston, MA (1985) 16. Bar-On, I., Paprzycki, M.: High performance solution of the complex symmetric eigenproblem. Numerical Algorithms 18, 195–208 (1998)
On Using Reinforcement Learning to Solve Sparse Linear Systems Erik Kuefler and Tzu-Yi Chen Computer Science Department, Pomona College, Claremont CA 91711, USA {kuefler,tzuyi}@cs.pomona.edu
Abstract. This paper describes how reinforcement learning can be used to select from a wide variety of preconditioned solvers for sparse linear systems. This approach provides a simple way to consider complex metrics of goodness, and makes it easy to evaluate a wide range of preconditioned solvers. A basic implementation recommends solvers that, when they converge, generally do so with no more than a 17% overhead in time over the best solver possible within the test framework. Potential refinements of, and extensions to, the system are discussed. Keywords: iterative methods, preconditioners, reinforcement learning.
1
Introduction
When using an iterative method to solve a large, sparse, linear system Ax = b, applying the right preconditioner can mean the difference between computing x accurately in a reasonable amount of time, and never finding x at all. Unfortunately choosing a preconditioner that improves the speed and accuracy of the subsequently applied iterative method is rarely simple. Not only is the behavior of many preconditioners not well understood, but there are a wide variety to choose from (see, for example, the surveys in [1,2]). In addition, many preconditioners allow the user to set the values of one or more parameters, and certain combinations of preconditioners can be applied in concert. Finally, there are relatively few studies comparing different preconditioners, and the guidelines that are provided tend to be general rules-of-thumb. To provide more useful problem-specific guidelines, recent work explores the use of machine learning techniques such as decision trees [3], neural networks [4], and support vector machines [5,6] for recommending preconditioned solvers. This line of research attempts to create a classifier that uses assorted structural and numerical features of a matrix in order to recommend a good preconditioned solver (with parameter settings when appropriate). At a minimum, these techniques recommend a solver that should be likely to converge to the solution vector. However, each paper also describes assorted extensions: [3] attempts to recommend a preconditioned solver that converges within some user-defined parameter of optimal, [5] attempts to give insight into why certain solvers fail,
Corresponding author.
M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 955–964, 2008. c Springer-Verlag Berlin Heidelberg 2008
956
E. Kuefler and T.-Y. Chen
and [4] considers different use scenarios. In addition, [7] tries to predict the efficiency of a solver in terms of its time and memory usage, and [3] describes a general framework within which many machine learning approaches could be used. Other work explores statistics-based data mining techniques [8]. A drawback of the existing work is its dependence on supervised learning techniques. In other words, to train the classifier they need access to a large body of data consisting not only of matrix features, but also information on how different preconditioned solvers perform on each matrix. If the goal is predicting convergence, the database needs to keep track of whether a particular preconditioned solver with particular parameter settings converges for each matrix. However, if time to convergence is also of interest, the database must have consistent timing information. Furthermore, there must be an adequate number of test cases to allow for accurate training. These requirements may become problematic if such techniques are to be the basis of long term solutions. An appealing alternative is reinforcement learning, which differs from previously applied machine learning techniques in several critical ways. First, it is unsupervised which means the training phase attempts to learn the best answers without being told what they are. This makes it easier to consider a large variety of preconditioned solvers since no large collection of data gathered by running examples is necessary for training the system. Second, it allows the user to define a continuous reward function which it then tries to maximize. This provides a natural way to introduce metrics of goodness that might, for example, depend on running time rather than just trying to predict convergence. Third, reinforcement learning can be used to actually solve linear systems rather than just recommending a solver. After describing how reinforcement learning can be applied to the problem of choosing between preconditioned solvers, results of experiments using a basic implementation are discussed. Extensions and refinements which may improve the accuracy and utility of the implementation are also presented.
2
Using Reinforcement Learning
Reinforcement learning is a machine learning technique that tries to gather knowledge through undirected experimentation, rather than being trained on a specially-crafted body of existing knowledge [9]. This section describes how it can be applied to the problem of selecting a preconditioned iterative solver. Applying reinforcement learning to a problem requires specifying a set of allowable actions, a reward (or cost) associated with each action, and a state representation. An agent then interacts with the environment by selecting an option from the allowable actions, and keeps track of the environment by maintaining an internal state. In response to the actions taken, the environment gives a numerical reward to the agent and may change in a way that the agent can observe by updating its state. As the agent moves within the environment, the agent attempts to assign a value to actions taken while in each state. This value is what the agent ultimately wishes to maximize, so computing an accurate
On Using Reinforcement Learning to Solve Sparse Linear Systems
957
action-value function is the agent’s most important goal. Note that the reward from taking an action in a state is different from its value: the former reflects the immediate benefit of taking that single action whereas the latter is a long-term estimate of the total rewards the agent will receive in the future as a result of taking that action. The agent learns the action-value function through a training process consisting of some number of episodes. In each episode, the agent begins at some possible starting point. Without any prior experiences to guide it, the agent proceeds by performing random actions and observing the reward it receives after taking such actions. After performing many actions over several episodes, the agent eventually associates a value with every pair of states and actions. As training continues, these values are refined as the agent chooses actions unlike those it has taken previously. Eventually the agent will be able to predict the value of taking each action in any given state. At the end of the training the agent has learned a function that gives the best action to take in any given state. When the trained system is given a matrix to solve, it selects actions according to this function until it reaches a solution. 2.1
Application to Solving Sparse Linear Systems
Reinforcement learning can be applied to the problem of solving sparse linear systems by breaking down the solve process into a series of actions, specifying the options within each action, and defining the allowable transitions between actions. Fig. 1 shows an example which emphasizes the flexibility of the framework. For example, the two actions labelled “scale” and “reorder,” with transitions allowed in either direction between them, can capture the following (not unusual) sequence of actions: equilibrate the matrix, permute large entries to the diagonal, scale the matrix to give diagonal entries magnitude 1, apply a fill-reducing order. The implementation simply needs to allow all those matrix manipulations as options within the “scale” and “reorder” actions. Similarly, the single “apply iterative solver” step could include all the different iterative methods described in [10] as options. And every action can be made optional by including the possibility of doing nothing. Of course, increasing the flexibility in the initial specification is likely to increase the cost of training the system. The state can be captured as a combination of where the agent is in the flowchart and assorted matrix features. These features should be cheap to compute and complete enough to represent the evolution of the matrix as it undergoes assorted actions. For example, features might include the matrix bandwidth or a matrix norm: the former is likely to change after reordering and the latter after scaling. While the framework in Fig. 1 does allow for unnecessary redundant actions such as computing and applying the same fill-reducing heuristic twice, a wellchosen reward function will bias the system against such repetition. For example, a natural way to define the reward function is to use the time elapsed in computing each step. This not only allows the algorithm to see the immediate, short-term effects of the actions it plans to take, but also allows it to estimate
958
E. Kuefler and T.-Y. Chen
Fig. 1. One set of actions that could be used to describe a wide variety of solvers for sparse linear systems
the remaining time that will be required once that action is completed. In other words, the algorithm should be able to learn that taking a time-consuming action (e.g., computing a very accurate preconditioner) could be a good idea if it puts the matrix into a state that it knows to be very easy to solve. Notice that this means the framework gracefully allows for a direct solver (essentially a very accurate, but expensive to compute, preconditioner). In addition, if there are actions that result in failures from which there is no natural way to recover, those could be considered to result in essentially an infinite amount of time elapsing. If later a technique for recovery is developed, it can be incorporated ino the framework by adding to the flowchart. Training the system consists of giving it a set of matrices to solve. Since the system must explore the space of possibilities and uses some randomness to do so, it should attempt to solve each matrix in the training set several times. 2.2
Implementation Details
The general framework for applying reinforcement learning to this problem is described above; important details that are specific to the implementation discussed in this paper are presented here. First, the set of steps and allowable actions are restricted to those shown in Fig. 2. There are fewer actions than in Fig. 1, and the options within each action are restricted to the following: – Equilibrate: The matrix can be initially equilibrated, or left alone. – Reorder: The rows and columns of the matrix can be left unpermuted (natural), or one or the other could be reordered using a permutation computed using: MC64 [11,12], Reverse Cuthill-McKee [13], or COLAMD [14,15]. – Precondition: The preconditioner is restricted to the ILUTP Mem [16] variant of incomplete LU, with one of 72 combinations of parameter settings: lfil between 0 and 5 inclusive, a droptol of 0, .001, .01, or .1, and a pivtol of 0, .1, or 1. – Solve: The iterative solver is restricted to GMRES(50) [17] with a maximum of 500 iterations and a relative residual of 1e − 8. The reinforcement learning framework allows for many more combinations of preconditioners than earlier studies which also restrict the solver to restarted
On Using Reinforcement Learning to Solve Sparse Linear Systems
959
Fig. 2. Possible transitions between steps and their associated actions
GMRES and/or the preconditioner to a variant of ILU [4,5,6,7,18]. Observe, for example, that equilibration is now optional. Hence a total of 576 preconditioned solvers are described by the above framework; this is notably more than used to evaluate systems based on other machine learning techniques [3,4,5]. A system for automatically selecting from amongst so many options is particularly valuable given previous work that shows the difficulty of presenting information accurately comparing different preconditioned solvers across a range of metrics [19]. Note that because the state keeps track of where the program is in the flowchart, the system can restart the entire preconditioned solve if and only if the incomplete factorization breaks down or if GMRES fails to converge. As a result, the final system will be more robust since it can try different approaches if the first fails. While such step-based restrictions are not strictly necessary, incorporating domain knowledge by requiring the agent to perform computations in a logical order should reduce the training time and improve the accuracy of the trained system. The state also keeps track of 32 structural and numerical features derived from the matrix itself. These are the same features as those used in [4], which are a subset of those used in [3,18]. Since each action changed the values of some of the features, this allowed the agent to observe the changes it made to the matrix during the computation and to react to those changes. Finally, since the overall goal is minimizing the total time required to solve the matrices in the training set, the reward function used is the negative of the time required to complete that step. To bias the system against actions which are very fast but do not lead to a successful solve, the agent receives an additional reward (penalty) if GMRES fails to converge or if the ILU preconditioner cannot be computed. Without this safeguard, the agent might repeatedly take an action that cannot succeed and thus make no progress in learning the action-value function. The action-value function is initialized to 0, even though all true action values are negative. This is the “optimistic initial values” heuristic described in [9] that has the beneficial effect of encouraging exploration during early iterations of the
960
E. Kuefler and T.-Y. Chen
algorithm. Since the agent is effectively expecting a reward of 0 for each action, it will be continually “disappointed” with each action it takes after receiving a negative reward, and will thus be encouraged to experiment with a wide range of actions before eventually learning that they will all give negative rewards. The high-level reinforcement learning algorithm was implemented in C++, with C and Fortran 77 used for the matrix operations. The code was compiled using g++, gcc, and g77 using the -O2 and -pthread compiler flags. The testing, training, and exhaustive solves were run on a pair of Mac Pro computers each running Ubuntu with 2 GB of RAM and four 2.66 GHz processors.
3
Experimental Results
The system described above was tested on a pool of 664 matrices selected from the University of Florida sparse matrix collection [20]. So that the results could be compared against the best results possible, all 576 preconditioned solvers allowed for by Fig. 2 were run on each matrix. However, due to time constraints, only 608 of the 664 matrices completed all 576 runs. Fig. 3 plots the number of matrices (out of 608) that converged for a given number of runs; note that each bar represents a decile. Every matrix converged for at least one setting, and 7 converged for all settings. Overall, 42% of the tested preconditioned solvers converged. For each matrix the fastest time taken to solve it was also saved so that the results using the trained system could be compared to it.
Fig. 3. Convergence results from testing all 576 possible preconditioned solvers on 608 of the matrices in the test suite. The y-axis gives the number of matrices which converged for some number of the solvers, the x-axis partitions 576 into deciles.
3.1
Methodology
The following protocol for training and testing was repeated 10 times. The system was trained on 10% of the matrices, chosen at random, by solving each of those matrices 40 times. Since the framework restarts if the ILU factorization fails or GMRES does not converge, potentially many more than 40
On Using Reinforcement Learning to Solve Sparse Linear Systems
961
attempts were made. As demonstrated in Fig. 3, every matrix can be solved by at least one solver so eventually repeated restarts should result in finding the solution. After the training phase, the algorithm was tested on two sets of matrices. The first was equivalent to the training set; the second contained all 664 matrices. From each testing set the number of matrices successfully solved on the algorithm’s first attempt (without a restart on failure) was calculated. Next, the ratio of the time taken to solve the matrix was divided by the fastest time possible within the framework described in Section 2.2. If the reinforcement learning algorithm did a good job of learning, this ratio should be close to 1. If it equals 1 then the algorithm learned to solve every matrix in the best possible way. 3.2
Results
Table 1 gives both the percentages of matrices that the system successfully solves on its first try and the time it took to solve them. These numbers are given both when the algorithm is tested on matrices in its training set and when it is tested on a more diverse set of matrices. Table 1. Percent of systems successfully solved, and the median ratio of the time taken to solve those systems vs. the fastest solver possible, both when the testing and training sets are equivalent and when the testing set is larger and more diverse testing = training testing = all matrices percent solved 81.8% 56.4% ratio of time 1.14 1.16
As expected, convergence results are best when the training and testing set are identical, with a success rate of 81.8%. When tested on the entire set of matrices, 56.4% of matrices were successfully solved (note that both of these percentages should go up if restarts are allowed). As was done for Fig. 3, Fig. 4 plots the number of matrices that were successfully solved in a given number of trials. Note that there were 10 trials overall and that, on average, a matrix should only be in the training set once. Comparing Fig. 4 to Fig. 3, observe that matrices were more likely to be solved in a greater percentage of cases, and that a larger number of cases converged overall (56% vs 42%). This indicates that the system has learned an action-value function that appropriately penalizes preconditioned solvers which cannot solve a system. Since the time taken to solve each matrix must be compared to the optimal time (as computed through exhaustive search), the second row in Table 1 takes the ratio of solved time to best possible time and gives the median of those ratios. Note that this ratio could only be computed for the 608 matrices on which the full set of exhaustive runs was completed. While the results were slightly better when the training and testing sets were equivalent, overall half the matrices that were solved were done so with no more than 16% overhead over the fastest
962
E. Kuefler and T.-Y. Chen
Fig. 4. The number of matrices which were correctly solved on the first try for a given number of trials (out of 10)
solution possible regardless of whether the matrix was in the testing set as well as the training set.
4
Discussion
This paper describes a framework for using reinforcement learning to solve sparse linear systems. This framework differs from that of previous sytems based on other machine learning techniques because it can easily factor running time into the recommendation, it makes it practical to consider a far larger number of potential preconditioned solvers, and it actually solves the system. In addition, the framework is extensible in the sense that it is simple to add new operations such as a novel iterative solver or a new choice of preconditioner. An initial implementation that focussed on solving systems using ILU preconditioned GMRES is described. And while the convergence results presented in Section 3 are not as good as those in papers such as [4], the problem being solved here is more complex: rather than predicting if any of a set of preconstructed solvers would be likely to solve a particular matrix, this architecture creates its own solver as an arbitrary combination of lower level operations. Furthermore, the results are based on the system’s first attempt at solving a problem — there was no possibility of a restart on failure since, without learning (which injects some randomness) in the final trained system, a restart without some matrix modification would result in the same failure. Note that either incorporating randomness (say by enabling learning) and allowing a restart after any kind of failure, or trying something more complex such as adding αI to A [21] upon a failure to compute the ILU preconditioner, should improve the convergence results. Of course, restarts would take time, so the ratio of time solved to best possible time would increase. The fact that the code had trouble solving general-case matrices when the testing set is much more diverse than the training set suggests that the algorithm may not be generalizing sufficiently. This is a known issue in reinforcement learning (and all other machine learning techniques), and there are standard ways to attempt to improve this. Possibilities include a more sophisticated state
On Using Reinforcement Learning to Solve Sparse Linear Systems
963
encoding (e.g., Kanerva Coding [22]), or reducing the set of matrix features used to define the state to those that are particularly meaningful (work on determining these features is currently underway). As with other machine learning techniques, there are also many opportunities to find better constants in the implementation. For the tested implementation values for parameters such as the number of training episodes, the learning rate, the eligibility trace decay, the size of tiles, and the number of tilings were chosen based on general principles and were experimented with only slightly. An intruiging direction for future work is exploring alternative reward functions. Even within the current implementation a modified reward function that, say, punished failure more might improve the behavior of the trained system. But, in addition, the reward function could be modified to use any metric of goodness. For example, a function that depended on a combination of space and time usage could be used to build a recommendation system that would take into account both. And, in fact, one could imagine a personalized system for solving sparse linear systems that allows users to define a reward function which depends on the relative utilities they assign to a wide variety of resources. Acknowledgements. The authors would like to thank Tom Dietterich for helpful discussions. This work was funded in part by the National Science Foundation under grant #CCF-0446604. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation.
References 1. Benzi, M.: Preconditioning techniques for large linear systems: A survey. J. of Comp. Physics 182(2), 418–477 (2002) 2. Saad, Y., van der Vorst, H.A.: Iterative solution of linear systems in the 20th century. J. Comput. Appl. Math. 123(1-2), 1–33 (2000) 3. Bhowmick, S., Eijkhout, V., Freund, Y., Fuentes, E., Keyes, D.: Application of machine learning to the selection of sparse linear solvers. International Journal of High Performance Computing Applications (submitted, 2006) 4. Holloway, A.L., Chen, T.-Y.: Neural networks for predicting the behavior of preconditioned iterative solvers. In: Shi, Y., van Albada, G.D., Dongarra, J., Sloot, P.M.A. (eds.) ICCS 2007. LNCS, vol. 4487, pp. 302–309. Springer, Heidelberg (2007) 5. Xu, S., Zhang, J.: Solvability prediction of sparse matrices with matrix structurebased preconditioners. In: Proc. Preconditioning 2005, Atlanta, Georgia (2005) 6. Xu, S., Zhang, J.: SVM classification for predicting sparse matrix solvability with parameterized matrix preconditioners. Technical Report 450-06, University of Kentucky (2006) 7. George, T., Sarin, V.: An approach recommender for preconditioned iterative solvers. In: Proc. Preconditioning 2007, Toulouse, France (2007) 8. Ramakrishnan, N., Ribbens, C.J.: Mining and visualizing recommendation spaces for elliptic PDEs with continuous attributes. ACM Trans. on Math. Softw. 26(2), 254–273 (2000) 9. Sutton, R.S., Barto, A.G.: Reinforcement Learning: An Introduction. MIT Press, Cambridge (1998)
964
E. Kuefler and T.-Y. Chen
10. Barrett, R., Berry, M., Chan, T.F., Demmel, J., Donato, J., Dongarra, J., Eijkhout, V., Pozo, R., Romine, C., van der Vorst, H.: Templates for the solution of linear systems: Building blocks for iterative methods. SIAM, Philadelphia (1994) 11. Duff, I.S., Koster, J.: The design and use of algorithms for permuting large entries to the diagonal of sparse matrices. SIAM J. Matrix Anal. Appl. 20(4), 889–901 (1999) 12. Duff, I.S., Koster, J.: On algorithms for permuting large entries to the diagonal of a sparse matrix. SIAM J. Matrix Anal. Appl. 22(4), 973–996 (2001) 13. Cuthill, E., McKee, J.: Reducing the bandwidth of sparse symmetric matrices. In: Proc. of the 24th Natl. Conf. of the ACM, pp. 157–172 (1969) 14. Davis, T., Gilbert, J., Larimore, S., Ng, E.: Algorithm 836: COLAMD, a column approximate minimum degree ordering algorithm. ACM Trans. on Math. Softw. 30(3), 377–380 (2004) 15. Davis, T., Gilbert, J., Larimore, S., Ng, E.: A column approximate minimum degree ordering algorithm. ACM Trans. on Math. Softw. 30(3), 353–376 (2004) 16. Chen, T.-Y.: ILUTP Mem: A space-efficient incomplete LU preconditioner. In: Lagan´ a, A., Gavrilova, M.L., Kumar, V., Mun, Y., Tan, C.J.K., Gervasi, O. (eds.) ICCSA 2004. LNCS, vol. 3046, pp. 31–39. Springer, Heidelberg (2004) 17. Saad, Y., Schultz, M.H.: GMRES: A generalized minimal residual algorithm for solving nonsymmetric linear systems. SIAM J. Sci. Stat. Comput. 7(3), 856–869 (1986) 18. Xu, S., Zhang, J.: A data mining approach to matrix preconditioning problem. Technical Report 433-05, University of Kentucky (2005) 19. Lazzareschi, M., Chen, T.-Y.: Using performance profiles to evaluate preconditioners for iterative methods. In: Gavrilova, M.L., Gervasi, O., Kumar, V., Tan, C.J.K., Taniar, D., Lagan´ a, A., Mun, Y., Choo, H. (eds.) ICCSA 2006. LNCS, vol. 3982, pp. 1081–1089. Springer, Heidelberg (2006) 20. Davis, T.: University of Florida sparse matrix collection. NA Digest 92(42), October 16, 1994 and NA Digest 96(28) July 23, 1996, and NA Digest 97(23) June 7 (1997) http://www.cise.ufl.edu/research/sparse/matrices/ 21. Manteuffel, T.A.: An incomplete factorization technique for positive definite linear systems. Mathematics of Computation 34, 473–497 (1980) 22. Kanerva, P.: Sparse Distributed Memory. MIT Press, Cambridge (1988)
Reutilization of Partial LU Factorizations for Self-adaptive hp Finite Element Method Solver Maciej Paszynski and Robert Schaefer Department of Computer Science AGH University of Science and Technology, Al. Mickiewicza 30, 30-059 Krak´ ow, Poland paszynsk,[email protected] http://home.agh.edu.pl/~paszynsk
Abstract. The paper presents theoretical analysis of the extension of the new direct solver dedicated for the fully automatic hp adaptive Finite Element Method. The self-adaptive hp-FEM generates in a fully automatic mode (without any user interaction) a sequence of meshes delivering exponential convergence of the numerical error with respect to the mesh size. The consecutive meshes are obtained by performing h, p or hp refinements. The proposed solver constructs an initial elimination tree based on the nested dissection algorithm executed over the initial mesh. The constructed elimination tree is updated each time the mesh is refined, by adding the elimination sub-tree related to the executed refinement. We propose a new strategy for reutilization of partial LU factorizations computed by the direct solver on the previous mesh, when solving a consecutive mesh from the sequence. We show that the number of LU factorizations that must be recomputed is linearly proportional to the number of singularities in the problem.
1
Motivation and the Basic Idea of Solution
The paper presents theoretical analysis of the extension of the sequential and parallel solvers [1], [2] dedicated for the self-adaptive hp Finite Element Method [3], [4], [5]. The self-adaptive hp-FEM generates a sequence of approximation spaces delivering exponential convergence of the numerical error of the resulting approximation of the variational problem under consideration. The exponential convergence of the error is obtained with respect to the dimension of the approximation space. The self-adaptive hp-FEM starts from an initial approximation space, constructed by utilizing a given uniform initial finite element mesh. The first order polynomial basis function (”pyramids”) are related to vertices of the mesh, and the higher order polynomial basis functions are related to finite element edges and interiors [3]. The consecutive spaces from the produced sequence are obtained by performing so-called h or p refinements. The h refinement consists in breaking selected finite element into new son-elements, and adding new basis functions related to just created elements. The p refinement consists in adding higher order basis function associated with selected element edges or interiors. The refinements performed to improve the quality of the approximation M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 965–974, 2008. c Springer-Verlag Berlin Heidelberg 2008
966
M. Paszynski and R. Schaefer
Fig. 1. Updating of the elimination tree when the mesh is h refined
space are selected by utilizing knowledge driven algorithm [6] based on the graph grammar formalism. An efficient solver must be utilized to compute coefficients of the projection of the considered weak (variational) problem solution onto the current approximation space. The coefficients are called degrees of freedom (d.o.f.). These coefficients, denoted by uihp , are computed by solving the system of equations dim
uihp b (ei , ej ) = l (ej )
∀j = 1, ..., dim ,
(1)
i=1
where dim denotes the dimension of the approximation space (number of the basis functions), {ek }dim k=1 denote the basis functions and b (ei , ej ) and l (ej ) are matrix and right-hand-side vector entries obtained by computing some integrals resulting from the considered problem. Here we present a short description of direct solvers utilized by FEM. The frontal solver browses finite elements in the order prescribed by the user, aggregates d.o.f. to the so-called frontal matrix. Based on the elements connectivity information it recognizes fully assembled degrees of freedom and eliminates them from the frontal matrix [7]. This is done to keep the size of the frontal matrix as small as possible. The key for efficient work of the frontal solver is the optimal ordering of finite elements. The multi-frontal solver constructs the d.o.f. connectivity tree based on analysis of the geometry of computational domain [7]. The frontal elimination pattern is utilized on every tree branch. Finite elements are joined into pairs and d.o.f. are assembled into frontal matrix associated with the branch. The process is repeated until the root of the assembly tree is reached. Finally, the common dense problem is solved and partial backward substitutions are recursively executed on the assembly tree. The sub-structuring method solver is a parallel solver working over a computational domain partitioned into multiple sub-domains [8]. First, the sub-domains internal d.o.f. are eliminated with respect to the interface d.o.f. Second, the interface problem is solved. Finally, the internal problems are solved by executing backward substitution on each sub-domain. This can be done by performing frontal decomposition on each sub-domain, and then solving the interface problem by a sequential frontal solver (this method is called the multiple fronts solver [9]). The better method is to
Reutilization of Partial LU Factorizations
967
Fig. 2. Elimination tree for simple two finite elements mesh. Fully aggregated degrees of freedom for element interiors are eliminated in parallel, the resulting Schur complement contributions are added, and common interface problem is finally solved. The process is followed by performing recursive backward substitutions (not presented in the picture).
solve the interface problem also by a parallel solver (this is called the direct sub-structuring method solver ). The parallel implementation of the multi-frontal solver is called the sparse direct method solver. The MUlti frontal Massively Parallel Solver (MUMPS) [10] is an example of such a solver. A new efficient sequential and parallel solver for self-adaptive hp-FEM has been designed [1], [2], utilizing elimination tree constructed base on the history of mesh refinements. The elimination tree for the initial mesh is created by utilizing nested dissection algorithm. The exemplary two finite elements mesh with its elimination tree is presented on the first panel in Fig. 1. Each time decision about mesh refinement is made, the elimination tree is dynamically expanded by adding sub-tree related to the performed refinements. The example of two h refinements performed on the initial mesh with resulting expanding of the elimination tree is presented in Fig. 1. Thus, we can distinguish two levels on the elimination tree. The first level is related to the initial mesh elements, and the second level is related to refinements performed over the initial mesh. The following observation is the key idea of the designed solver [1], [6]. The integral b (ei , ej ) is non-zero only if intersection of supports of ei and ej is not empty. The support of a vertex basis function spreads over finite elements having the vertex, the support of an element edge basis function spreads over two finite elements adjacent to the edge, and finally the support of an element interior basis function spreads only over the element. Thus, the integral b (ei , ej ) is zero if basis functions are related to distanced elements. The solver constructs first partially aggregated sub-matrices related to single finite elements, then it eliminates these entries that have already been fully assembled, and then it recursively merges resulting sub-matrices and eliminates fully assembled entries until it reaches the top of the elimination tree. Finally, it executes recursive backward substitutions, from the root of the tree down to the leaves. The exemplary execution of the solver on the two elements initial mesh from Fig. 1 is presented in Fig. 2. The resulting LU factorizations computed at every node of the elimination tree can be stored at tree nodes for further reutilization. Each time the mesh
968
M. Paszynski and R. Schaefer
Fig. 3. The problem is solved over the first mesh. All LU factorizations (black and grey) are computed. Then, the mesh is refined, and the problem is solved again. Grey LU factorizations are reutilized from the previous mesh, but all brown LU factorizations must be recomputed. Black LU factorizations from previous mesh are deleted.
is refined, the LU factorizations from the unrefined parts of the mesh can be reutilized. There is a need to recompute LU factorization over the refined elements, as well as on the whole path from any refined leaf up to the root of the elimination tree. The example of the reutilization of partial LU factorizations after performing two local refinements is presented in Fig. 3.
2
Theoretical Analysis of the Solver Efficiency
We start this section with the sketch of the recursive solver algorithm, with reutilizations of LU factorizations. matrix function recursive solver(tree node) if (tree node has no son nodes) then eliminate leaf element stiffness matrix internal nodes; store Schur complement sub-matrix at tree node; return (Schur complement sub-matrix); else if (tree node has son nodes) then do (for each tree node son) if (sub-tree has been refined) then son matrix = recursive solver(tree node son); else get the Schur complement sub-matrix from tree node son; endif merge son matrix into new matrix; enddo decide which unknowns of new matrix can be eliminated; perform partial forward elimination on new matrix; store Schur complement sub-matrix at tree node; return (Schur complement sub-matrix); endif
Reutilization of Partial LU Factorizations
969
Computational Complexity of the Sequential, Recursive Solver Without Reutilization of LU Factorizations. Let us estimate first the number of operations performed by a sequential recursive solver during forward elimination over a square shape 2D finite element mesh with N = 2n × 2n finite elements. The order of approximation in the interior of the element is assumed to be equal to (p1 , p2 ). The orders of approximation on element edges are assumed to be equal to the corresponding orders in the interior. From this assumption it follows that there are 2 faces with orders p1 and 2 faces with orders p2 . The total number of d.o.f. in such an element is nrdof = (p1 + 1) (p2 + 1) = O (p1 p2 ). To estimate the efficiency of the sequential solver, we assume that p1 = p2 = p, e.g. by taking p = max{p1 , p2 }. Thus, the total number of d.o.f. satisfies 2 nrdof = (p + 1) = O(p2 ), while the number of interior d.o.f. can be evaluated as interior nrdof = (p − 1)2 = O(p2 ), and the number of interface d.o.f. satisfies interf ace nrdof = 4p2 = O(p2 ). The recursive solver eliminates d.o.f. related to elements interiors. The computational complexity of this step is 22n × O(p6 ) since there are 22n such finite elements and the internal d.o.f. elimination cost is O(p6 ) on every element. Then, the solver joints elements into pairs, and eliminates d.o.f. related to common edges. The computational complexity of this operation is 22n−1 × ((2 + 4 + 1) × p)2 × (2 + 4) × p since there are 22n−1 such pairs of elements, and there are 7 total edges within a pair, and only one edge is eliminated. In the next step elements are joint into sets of four, and d.o.f. related to two common edges are eliminated. The computational complexity of this step is 22n−2 × ((4 × 2 + 2) × p)2 × (4 × 2) × p since there are 22n−2 such sets of elements, and there are 10 edges in every set, and only 2 edges are eliminated.
Fig. 4. Two tested meshes with uniform p = 4 and p = 5
The process is repeated until we reach the root of the elimination tree. The total computational complexity of this process is k=1,...,n
22n × p6 + 22n−1 × (2 + 4 + 1)2 p2 × (2 + 4) × p + 2 22n−2k−1 2 × 2k+1 + 2 × 2k + 2k p2 2 × 2k+1 + 2 × 2k p+ 2 22n−2k 2 × 2k + 2 × 2k + 2k p2 2 × 2k + 2 × 2k p .
970
M. Paszynski and R. Schaefer
Fig. 5. The execution time of the parallel solver over the second tested mesh
This can be estimated by utilizing the sum of the geometrical series as ⎞ ⎛ 2n−1 3 2n 6 T1 = O 2 p + O 2 p +O⎝ 22n+k+5 p3 ⎠ k=1,...,n
= O 22n p6 + 22n−1 + 23n+6 − 22n+4 p3 = O(22n p6 + 23n p3 + 22n p3 ) . (2) Computational Complexity of the Sequential Solver With Reutilization of LU Factorizations. In this section we perform the same analysis of the computational complexity like in the previous section, but this time we assume that the problem over the computational mesh has been already solved, and only one element has been h refined in the direction of a mesh corner singularity. In this case, there is a need to compute all LU factorizations related to the elimination sub-tree associated with broken corner element. It is also necessary to recompute all LU factorizations on the single path from the refined element (represented by a leaf in the original elimination tree) up to the root of the tree. The computational complexity over the broken element is 4 × p6 + 2 × (2 + 4 + 1)2 p2 × (2 + 4) × p + (4 ∗ 2 + 2)2 p2 × (4 ∗ 2) × p , (3) since there are 4 element interiors, two single common edges and 1 twofold edge. The computational complexity of the recomputation of the whole path from the refined leaf up to the elimination tree root can be estimated by utilizing equation (2) with the correction that there is only one set of elements on every level of the tree, and without the leaf element computations, already estimated in (3). k=1,...,n
(2 + 4 + 1)2 p2 × (2 + 4) × p + 2 2 × 2k+1 + 2 × 2k + 2k p2 2 × 2k+1 + 2 × 2k p+ 2 . 2 × 2 k + 2 × 2 k + 2 k p2 2 × 2 k + 2 × 2 k p
(4)
Reutilization of Partial LU Factorizations
971
Table 1. Execution time at different elimination tree nodes on two tested meshes First mesh Tree level Nodes number min time [s] max time [s] 1 1 0.115 2 2 0.854 0.883 3 4 0.864 2.406 4 8 0.828 2.542 5 16 0.904 2.750 6 32 0.049 0.230 < 10−2 7 64 < 10−2 < 10−3 8-14 128-9216 < 10−3
Second mesh min time [s] max time [s] 0.212 1.631 1.674 1.617 4.625 1.675 4.535 1.621 4.686 1.606 4.763 < 10−2 0.110 < 10−3 < 10−3
The total computational complexity of the solver reutilizing LU factorization is equal to the sum of (3) and (4), that is ⎛ ⎞ 23k+6 p3 ⎠ T11 = O p6 + O p3 + O ⎝ k=1,...,n
= O p6 + 1 + 23n+6 − 26 p3 = O(p6 + 23n p3 ) .
(5)
In the case of multiple refined leaves, the pessimistic estimation is that each leaf will generate a separate path to be totally recomputed. Thus, the total computational complexity with r refined leafs (resulting from r4 singularities) is T1r = O rp6 + r + r23n+6 − r26 p3 = O(rp6 + r23n p3 ) .
(6)
We conclude this section with the comparison of the execution times of the sequential solver with reutilization of LU factorization with respect to the sequential solver without the reutilization
2n
2 N T1 =O =O . (7) r T1 r r The solver with reutilization of partial LU factorizations is O Nr times faster. Complexity of the Parallel Solver Without Reutilization of LU Factorizations. The parallel version of the solver exchanges the partially aggregated matrices between the same level nodes [1]. Leaves of the elimination tree are assigned to different processors. When traveling up the elimination tree, the local Schur complements are sent from the second children node to the first one (to the first processor in every set). To estimate the computational complexity of the parallel recursive solve, we assume that the number of processors is P = 22m . Each processor is responsible for its part of the mesh, with 22n−2m finite elements. Thus, each processor performs O(22(n−m) p6 + 23(n−m) p3 )
(8)
972
M. Paszynski and R. Schaefer
operations on its part of the mesh. After this step, all computations over the elimination tree are performed fully in parallel: 2 2 × 2k+1 + 2 × 2k + 2k p2 2 × 2k+1 + 2 × 2k p+ k=m+1,...,n
2 2 × 2 k + 2 × 2 k + 2 k p2 2 × 2 k + 2 × 2 k p 22k ) = O(p3 22(m+k) ) = O(22(n−m) p3 ) .
= O(p3
k=m+1,n
(9)
k=1,n−m
The communication complexity involves 2(n − m + 1) parallel point to point communications where sub-matrices related to local Schur complements are exchanged between pairs of tree nodes. The communication complexity is then 2 × (2k × p)2 = O(p2 2(m+k) ) = O(22(n−m) p2 ) (10) k=m+1,n
k=1,n−m
since the size of every sub-matrix is 2 × p. The total complexity of the parallel solver without reutilization of the LU factorizations is then k
TP = (22(n−m) p6 + 23(n−m) p3 + 22(n−m) p3 ) × tcomp + 22(n−m) p2 × tcomm (11) with P = 22m the number of processor, and p the order of approximation. Complexity of the Parallel Solver With Reutilization of LU Factorizations. In the case of the parallelization of the reutilization, the maximum number of processors that can be utilized is equal to r, the number of elements refined within the actual mesh. Each refinement requires the recomputation of the whole path from the refined leaf up to the tree root, which is a purely sequential. If the number of processors P = 22m is larger or equal to the number of executed refinements 22m ≥ r, then the total computational complexity can be roughly estimated as parallel execution of r paths from a leaf to the root of the tree, which is equal to (5). The communication complexity remains unchanged, since there is still a need to exchange the LU factorization, even if they are taken from local tree nodes. Thus the communication complexity is equal to (10). The total complexity of the parallel solver with reutilization of LU factorizations is TPr = (p6 + 23n p3 ) × tcomp + 22(n−m) p2 × tcomm .
(12)
This is the “best parallel time” that can be obtain by the parallel solver with reutilization of partial LU factorizations, under the assumption that we have enough available processors (P = 22m ≥ r). In other words, it is not possible to utilize more processors then number of refined elements r. We can compare the execution time of the parallel solver with reutilization to the parallel solver without the reutilization (as usually under the assumption that we have enough processors P = 22m ≥ r).
N N TP N 2(n−m) =O =O 2 =O ≤O . (13) r 2m TP 2 P r The parallel solver with reutilization is O Nr times faster than the parallel solver without the reutilization.
Reutilization of Partial LU Factorizations
3
973
Test Results
We conclude the presentation with two numerical experiments, presented in Fig. 4. The goal of these experiments is to illustrate the limitation of the scalability of the solver by the sequential part of the algorithm - the longest path from the root of the elimination tree down to the deepest leaf. For more numerical experiments executed for much larger problems, with more detailed discussion on the performance of the solver, as well as for the detailed comparison with the MUMPS solver, we refer to [1]. Both numerical experiments have been performed for the 3D Direct Current (DC) borehole resistivity measurement simulations [11]. The 3D problem has been reduced to 2D by utilizing the Fourier series expansions in the non-orthogonal system of coordinates. We refer to [11] for the detailed problem formulation. The first mesh contains 9216 finite elements with polynomial order of approximation p = 4, and 148, 257 d.o.f. The second mesh contains 9216 finite elements with polynomial order of approximation p = 5, and 231, 401 d.o.f. Both meshes have been obtained by performing two global hp refinements from the initial mesh with 32 × 18 = 576 finite elements with polynomial order of approximation p = 2 or p = 3, respectively. There are necessary 10 nested dissection cross-sections of the initial mesh, since 32 × 18 ≤ 25 × 25 . Thus, the depth of the initial elimination tree is 10. Each global hp refinement consists in breaking each finite element into 4 son elements and increasing polynomial order of approximation by 1. Thus, each global hp refinement adds 2 levels to the elimination tree, so the total number of levels in the elimination tree is 14. Table 1 contains the total number of nodes at given elimination tree level, as well as the minimum and maximum Schur complement computation times for nodes located at given level of the elimination tree. The time of computing the entire path of partial LU factorization from a tree leaf up to the elimination tree root varies from 4 sec. to 9 sec. on the first mesh and from about 10 sec. up to 17 sec. on the second mesh. The execution time of the sequential solver with reutilization of LU factorizations over r times refined mesh will be within (4 × r, 9 × r) sec. over the first and (10 × r, 17 × r) sec. over the second mesh. The execution time of the parallel solver with reutilization of LU factorizations over r times refined mesh will be within (4, 9) sec. over the first and (10, 17) sec. over the second mesh, if there are more processors than refined elements. We present also in Fig. 5 the execution time of the parallel solver over the first mesh with N = 231, 401 unknowns, for increasing number of processors. We observe that the parallel solver execution time is limited by the maximum time required to solve the entire path, which is about 9 second in this case.
4
Conclusions
We proposed a new algorithm for the sequential and parallel solver, that allows for significant reduction of the solver execution time over a sequence of meshes generated by the self-adaptive hp-FEM. The solver reutilized partial LU factorizations computed in previous iterations over unrefined parts of the mesh.
974
M. Paszynski and R. Schaefer
Every local h refinements requires a sequential recomputation of all LU factorization on a path from the refined leaf up to the root of the elimination tree. The maximum number of processors that can be utilized by the parallel solver with reutilization is equal to the number of refined elements. Both, the sequential and parallel solver with reutilization is O Nr faster than the solver without the reutilization, where N is number of elements and r is number of refinements. Acknowledgments. We acknowledge the support of Polish MNiSW grant no. 3TO8B05529 and Foundation for Polish Science under Homming Programme.
References 1. Paszy´ nski, M., Pardo, D., Torres-Verdin, C., Demkowicz, L., Calo, V.: Multi-Level Direct Sub-structuring Multi-frontal Parallel Direct Solver for hp Finite Element Method. ICES Report 07-33 (2007) 2. Paszy´ nski, M., Pardo, D., Torres-Verdin, C., Matuszyk, P.: Efficient Sequential and Parallel Solvers for hp FEM. In: APCOM-EPMSC 2007, Kioto, Japan (2007) 3. Demkowicz, L.: Computing with hp-Adaptive Finite Elements, vol. I. Chapman & Hall/Crc Applied Mathematics & Nonlinear Science, New York (2006) 4. Demkowicz, L., Pardo, D., Paszy´ nski, M., Rachowicz, W., Zduneka, A.: Computing with hp-Adaptive Finite Elements, vol. II. Chapman & Hall/Crc Applied Mathematics & Nonlinear Science, New York (2007) 5. Paszy´ nski, M., Kurtz, J., Demkowicz, L.: Parallel Fully Automatic hp-Adaptive 2D Finite Element Package. Computer Methods in Applied Mechanics and Engineering 195(7-8), 711–741 (2007) 6. Paszy´ nski, M.: Parallelization Strategy for Self-Adaptive PDE Solvers. Fundamenta Informaticae (submitted, 2007) 7. Duff, I.S., Reid, J.K.: The Multifrontal Solution of Indefinite Sparse Symmetric Linear Systems. ACM Trans. on Math. Soft. 9, 302–325 (1983) 8. Giraud, L., Marocco, A., Rioual, J.-C.: Iterative Versus Direct Parallel Substructuring Methods in Semiconductor Device Modelling. Numerical Linear Algebra with Applications 12(1), 33–55 (2005) 9. Scott, J.A.: Parallel Frontal Solvers for Large Sparse Linear Systems. ACM Trans. on Math. Soft. 29(4), 395–417 (2003) 10. Milti-frontal Massively Parallel Sparse Direct Solver (MUMPS), http://graal.ens-lyon.fr/MUMPS/ 11. Pardo, D., Calo, V.M., Torres-Verdin, C., Nam, M.J.: Fourier Series Expansion in a Non-Orthogonal System of Coordinates for Simulation of 3D Borehole Resistivity Measurements; Part I: DC. ICES Report 07-20 (2007)
Linearized Initialization of the Newton Krylov Algorithm for Nonlinear Elliptic Problems Sanjay Kumar Khattri Stord/Haugesund University College, Bjørnsonsgt. 45 Haugesund 5528, Norway [email protected]
Abstract. It is known that the Newton Krylov algorithm may not always converge if the initial assumption or initialization is far from the exact solution. We present a technique for initializing Newton Krylov solver for nonlinear elliptic problems. In this technique, initial guess is generated by solving linearised equation corresponding to the nonlinear equation. Here, nonlinear part is replaced by the equivalent linear part. Effectiveness of the technique is presented through numerical examples.
1
Introduction
The past fifty to sixty years have seen generous improvement in solving linear systems. Krylov subspace methods are the result of the tremendous effort by the researchers during the last century. It is one among the ten best algorithms of the 20th century. There exists optimal linear solvers [16]. But, still there is no optimal nonlinear solver, or the one that we know of. Our research is in the field of optimal solution of nonlinear equations generated by the discretization of the nonlinear elliptic equations [15], [14], [13], [12]. Let us consider the following nonlinear elliptic partial differential equation [15] div(−K grad p) + f (p) = s(x, y) D
p(x, y) = p ˆ g(x, y) = (−K ∇p) · n
in Ω
(1)
on ∂ΩD on ∂ΩN
(2) (3)
Here, Ω is a polyhedral domain in Rd , the source function s(x, y) is assumed to be in L2 (Ω), and the medium property K is uniformly positive. In the equations (2) and (3), ∂ΩD and ∂ΩN represent Dirichlet and Neumann part of the boundary, respectively. f (p) represents nonlinear part of the equation. p is the unknown function. The equations (1), (2) and (3) models a wide variety of processes with practical applications. For example, pattern formation in biology, viscous fluid flow phenomena, chemical reactions, biomolecule electrostatics and crystal growth [9], [5], [6], [7], [8], [10]. There are various methods for discretizing the equations (1), (2) and (3). To mention a few: Finite Volume, Finite Element and Finite Difference methods [12]. These methods convert nonlinear partial differential equations into a system M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 975–982, 2008. c Springer-Verlag Berlin Heidelberg 2008
976
S.K. Khattri
of algebraic equations. We are using the Newton Krylov algorithm for solving the discrete nonlinear system of equations formed by the Finite Volume method [15]. Since, initial guess or initialization is very important for the convergence of the Newton’s algorithm. Thus, for starting the Newton Krylov algorithm, we are solving the corresponding linearised equation, and use this solution as the initial guess for the Newton Krylov algorithm. The corresponding linearized equations to the nonlinear equaion (1) is div(−K grad p)+ f (p) = s. Here, f (p) is the linear representation of the nonlinear part f (p).
2
Newton Krylov Algorithm
For formulating Newton algorithm, equation (1) is discretized in the residual form [15] div(−K grad p) + f (p) − s = 0. Let the discretization of the nonlinear partial differential equations result in a system of nonlinear algebraic equations A(p) = 0. Each cell in the mesh produces a nonlinear algebraic equation [15], [12]. Thus, discretization of the equations (1), (2) and (3) on a mesh with n cells result in n nonlinear equations, and let these equations are given as ⎛ ⎞ A1 (p) ⎜ A2 (p) ⎟ ⎜ ⎟ (4) A(p) = ⎜ . ⎟ . ⎝ .. ⎠ An (p) We are interested in finding the vector p which makes the operator A vanish. The Taylors expansion of nonlinear operator A(p) around some initial guess p0 is A(p) = A(p0 ) + J(p0 ) Δp + hot, (5) where hot stands for higher order terms. That is, terms involving higher than the first power of Δp. Here, difference vector Δp = p − p0 . The Jacobian J is a n × n linear system evaluated at the p0 . The Jacobian J in the equation (5) is given as follows ⎞ ⎛ ∂A1 ∂A1 ∂A1 ··· ⎜ ∂p1 ∂p2 ∂pn ⎟ ⎟ ⎜ ⎜ ∂A2 ∂A2 · · · ∂A2 ⎟ ⎟ ⎜ ∂Ai ∂p1 ∂p2 ∂pn ⎟ =⎜ J= ⎟ ⎜ . . . ∂pj ⎜ .. .. . . . .. ⎟ ⎟ ⎜ ⎝ ∂An ∂An ∂An ⎠ ··· ∂p1 ∂p2 ∂pn Since, we are interested in the zeroth of the non-linear vector function A(p). Thus, setting the equation (5) equals to zero and neglecting higher order terms will result in the following well known Newton Iteration Method
Linearized Initialization of the Newton Krylov Algorithm
J(pk ) Δpk = −A(pk ), pk+1 = pk + Δpk+1 ,
k = 0, . . . , n.
977
(6)
The linear system (6) is solved by the Conjugate Gradient algorithm [16]. The pseudo code is presented in the Algorithm 1. The presented algorithm have been implemented in the C++ language. Three stopping criteria are used in the Algorithm 1. The first criterion is the number of iterations. Second and third criteria are based on the residual vector, A(p) and difference vector Δpk . If the method is convergent, L2 norm of the difference vector, Δp, and the residual vector, A(p), converge to zero [see 11]. We are reporting convergence of both of these vectors. For better understanding the error reducing property of the method, we report variation of A(pk )L2 /A(p0 )L2 and Δ(pk )L2 /Δ(p0 )L2 with iterations (k). Algorithm 1. Newton Krylov algorithm. 1 2 3 4 5 6 7 8 9
Mesh the domain; Form the non-linear system, A(p); Find initial guess p0 ; Set the counter k = 0 ; while k ≤ maxiter or Δpk L2 ≤ tol or A(pk )L2 ≤ tol do Solve the discrete system J(pk )Δpk = −A(pk ); pk+1 = pk + Δpk ; k ++ ; end
Our research work is focus on the initialization step of the above algorithm. Initialization (step three of the Algorithm 1) is a very important part of the Newton Krylov algorithm.
3 3.1
Numerical Work Example 1
Without loss of generality let us assume that K is unity, and the boundary is of Dirichlet type. Let f (p) be γ exp(p). Thus, the equations (1), (2) and (3) are written as −∇2 p + γ exp(p) = f p(x, y) = p
D
in Ω,
(7)
on ∂ΩD .
(8)
Here, γ is a scalar. Let γ be 100. For computing the true error and convergence behavior of the methods, let us further assume that the exact solution of the equations (7) and (8) is the following bubble function
978
S.K. Khattri
0.07 0.06 0.05 0.04 0.03 0.02 0.01 0 1 0.8 0.6 0.4 0.2 0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Fig. 1. Surface plot of the exact solution of example 3.1
p = x (x − 1) y (y − 1). Let our domain be a unit square. Thus, Ω = [0, 1] × [0, 1]. Figure 1 displays the surface plot of the exact solution. We are discretizing equations (7) and (8) on a 40 × 40 mesh by the method of Finite Volumes [11], [12], [13], [15]. Discretization results in a nonlinear algebraic vector (4) with 1600 nonlinear equations. For making initial guess, we are using two approaches. In the first tradtional approach, we make a random initialization. The second approach is based on the linearization of the nonlinear part. Let us now form a linear approximation to the nonlinear part through Taylor series expansion. The Taylor series expansion of the nonlinear part (exponential funciton) is given as ep =
∞
pi i=0
i
,
=1+p+
p3 p2 + + ···. 2 3
From the above expansion, the linear approximation of ep is (1 + p). For forming a corresponding linearized equation to the nonlinear equation (7), we replace, ep by (1 + p). Thus, for finding an initial guess for the Newton algorithm, we are solving the following corresponding linearised equation −∇2 p + γ (1 + p) = f. The Newton iteration for both of these initial guesses are reported in the Fig. 2(a). Figure 2(a) presents the convergence of the residual vector, while Fig. 2(b) presents the convergence of the difference vector for first eight
Linearized Initialization of the Newton Krylov Algorithm
979
0
10
Random Initialization Linearized Initialization −2
10
−4
||A(pk)||L2/||A(p0)||L2
10
−6
10
−8
10
−10
10
−12
10
0
1
2
3
4 Iterations [ k ]
5
6
7
8
(a) Newton iteration vs A(pk )L2 for two different initialization. 0
10
Random Initialization Linearized Initialization −2
10
−4
||Δpk||L2/||Δp0||L2
10
−6
10
−8
10
−10
10
−12
10
0
1
2
3
4 Iterations [ k ]
5
6
7
8
(b) Newton Iteration vs Δ(pk )L2 for different initialization. Fig. 2. Example 3.1
iterations. We are solving the Jacobian system by the ILU preconditioned Conjugate Gradient with a tolerance of 1 × 10−10 . It is clear from the Figs. 2(a) and 2(b) that solving the corresponding linearized equation for the initial guess can make a big difference. With random initialization, the residual after five iterations is about 1/100 of the initial residual. While with linearized initialization, the residual after five iteration is about 1/1012 of the initial residual. It is interesting to note in the Fig. 2(b), with random initialization the Newton Krylov algortithm is not converging in the L2 norm of the difference vector. On the other hand, with a linearized initialization the Newton Krylov algorithm is still reducing the error in difference vector by 1/1012 of the initial error.
980
S.K. Khattri
3.2
Example 2
Let us solve the following equations −∇2 p + ξ sinh(exp(p)) = f p(x, y) = p
in Ω, D
(9)
on ∂ΩD .
(10)
Here, ξ is a scalar. We choose ξ to be 10. Let the exact solution be given as p = cosx + y cos3 x − y + cosx − y sinhx + 3 y + 5 e−(x
2
+y 2 )/8
Let our domain be a unit square. Thus, Ω = [0, 1] × [0, 1]. Figure 3 portrays the surface plot of the exact solution. For forming a corresponding linearized equation. The Taylor series expansion of sinh(exp(p)) around p = 0 is given as 1 1 1 1 + e+ sinhep = e − p 2 2e 2 2e 5e 1 1 + + e p2 + − p3 + . . . . 2 12e 12 The above series expansion is found through the Maple by using the command “taylor(sinh(exp(p)), p = 0, 5)”. From the above expansion, the linear approximation of sinhep is 1 1 1 1 e− e+ + p. 2 2e 2 2e
7 6.5 6 5.5 5 4.5 4 3.5 3 2.5 1 0.8 0.6 0.4 0.2 0
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
Fig. 3. Surface plot of the exact solution of example 3.2
0.8
0.9
1
Linearized Initialization of the Newton Krylov Algorithm
981
For forming a corresponding linearized equation to the nonlinear equation (9), we replace, sinhep by (1/2 e − 1/2 e) + (1/2 e + 1/2 e) p. Thus, for finding an initial guess for the Newton algorithm, we are solving the following linearised equation 1 1 1 1 e− + e+ −∇2 p + ξ p = f. 2 2e 2 2e
4
Conclusions
Robust initialization of the Newton Krylov algorithm is very crucial for the convergence. Initialization plays very important role in the convergence of the Newton Krylov algorithm. We presented a technique for forming the initial guess. Numerical work shows that initializing the Newton Krylov algorithm through the solution of the corresponding linearized equation is computationally efficient.
Bibliography [1] Khattri, S.K.: Newton-Krylov Algorithm with Adaptive Error Correction For the Poisson-Boltzmann Equation. MATCH Commun. Math. Comput. Chem. 1, 197– 208 (2006) [2] Khattri, S.K., Hellevang, H., Fladmark, G.E., Kvamme, B.: Simulation of longterm fate of CO2 in the sand of Utsira. Journal of Porous Media (to be published) [3] Khattri, S.K.: Grid generation and adaptation by functionals. Computational and Applied Mathematics 26, 1–15 (2007) [4] Khattri, S.K.: Numerical Tools for Multicomponent, Multiphase, Reactive Processes: Flow of CO2 in Porous Media. PhD Thesis, The University of Bergen (2006) [5] Host, M., Kozack, R.E., Saied, F., Subramaniam, S.: Treatment of Electrostatic Effects in Proteins: Multigrid-based Newton Iterative Method for Solution of the Full Nonlinear Poisson-Boltzmann Equation. Proteins: Structure, Function, and Genetics 18, 231–245 (1994) [6] Holst, M., Kozack, R., Saied, F., Subramaniam, S.: Protein electrostatics: Rapid multigrid-based Newton algorithm for solution of the full nonlinear PoissonBoltzmann equation. J. of Bio. Struct. & Dyn. 11, 1437–1445 (1994) [7] Holst, M., Kozack, R., Saied, F., Subramaniam, S.: Multigrid-based Newton iterative method for solving the full Nonlinear Poisson-Boltzmann equation. Biophys. J 66, A130–A130 (1994) [8] Holst, M.: A robust and efficient numerical method for nonlinear protein modeling equations. Technical Report CRPC-94-9, Applied Mathematics and CRPC, California Institute of Technology (1994) [9] Holst, M., Saied, F.: Multigrid solution of the Poisson-Boltzmann equation. J. Comput. Chem. 14, 105–113 (1993) [10] M. Holst: MCLite: An Adaptive Multilevel Finite Element MATLAB Package for Scalar Nonlinear Elliptic Equations in the Plane. UCSD Technical report and guide to the MCLite software package. Available on line at, http://scicomp.ucsd.edu/∼ mholst/pubs/publications.html [11] Khattri, S.: Convergence of an Adaptive Newton Algorithm. Int. Journal of Math. Analysis 1, 279–284 (2007)
982
S.K. Khattri
[12] Khattri, S., Aavatsmark, I.: Numerical convergence on adaptive grids for control volume methods. The Journal of Numerical Methods for Partial Differential Equations 9999 (2007) [13] Khattri, S.: Analyzing Finite Volume for Single Phase Flow in Porous Media. Journal of Porous Media 10, 109–123 (2007) [14] Khattri, S., Fladmark, G.: Which Meshes Are Better Conditioned: Adaptive, Uniform, Locally Refined or Locally Adjusted? In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2006. LNCS, vol. 3992, pp. 102– 105. Springer, Heidelberg (2006) [15] S. Khattri, Nonlinear elliptic problems with the method of finite volumes. Differential Equations and Nonlinear Mechanics. Article ID 31797 (2006) [16] van der Vorst, H.A.: Iterative Krylov Methods for Large Linear Systems. Cambridge monographs on applied and computational mathematics. Cambridge University Press, New York (2003)
Analysis and Comparison of Reordering for Two Factorization Methods (LU and WZ) for Sparse Matrices Beata Bylina and Jaroslaw Bylina Department of Computer Science Institute of Mathematics Marie Curie-Sklodowska University Pl. M. Curie-Sklodowskiej 1, 20-031 Lublin, Poland [email protected], [email protected]
Abstract. The authors of the article make analysis and comparison of reordering for two factorizations of the sparse matrices – the traditional factorization into the matrices L and U as well as the factorization into matrices W and Z. The article compares these two factorizations regarding: the produced quantity of non-zero elements alias their susceptibility to a fill-in; the algorithms reorganizing matrix (for LU it will be the algorithm AMD but for WZ it will be a modification of the Markowitz algorithm); as well as the time of the algorithms. The paper also describes the results of a numerical experiment carried for different sparse matrices from Davis Collection.
1
Introduction
It is a very important issue for the numerical linear algebra to solve different linear systems of equations both when the matrix of coefficients is a dense one (that is including few non-zero elements) or when the matrix is sparse. In this paper we deal with a question of solving linear systems with a sparse matrix of coefficients by a factorization of the matrix. Solving sparse systems demands applying direct or iterative methods. Both kinds of methods have their own merits and flaws. However, in this paper we only handle the direct methods based on Gaussian elimination. As far as the direct methods are concerned, they demand applying the coefficient matrix factorization into factors of two matrices, e.g. into LU, WZ or QR as well as into three factors, e.g. into LDLT . We will assume that A is a square (n×n), nonsingular and sparse matrix of not any particular structure. Usually during a factorization of a sparse matrix, matrices which come into existence have far more non-zero elements comparing to the primary matrix. During the matrix A factorization into a product, one has to do with this fill-in problem – consisting in generating additional non-zero elements (except the ones which were non-zero in the matrix A). The fill-in causes a substantial increase in memory requirements and (what comes with that) a worsening M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 983–992, 2008. c Springer-Verlag Berlin Heidelberg 2008
984
B. Bylina and J. Bylina
of a solver performance. Some problems connected to the fill-in are: a reduction of the very fill-in (by some reordering or approximation) and forecasts of positions of non-zeros (for more efficient storing of the matrices’ elements). The fill-in is the reason for applying algorithms and data structures to reduce it and act in due time. A sparse factorization usually consists of two parts. The first part is a reorganization of the matrix and its analysis where a symbolic factorization is done, pointing in anticipation of the places where non-zero elements appear. The second part is a usual numerical sparse matrix factorization into factors. We can find some examples of this approach – as MUMPS [2] and SuperLU [13]. In [3] we can find analysis and comparison of two solvers mentioned above. In this paper we focus on the first part of the algorithm – that is the reordering. Reducing of non-zero elements quantity demands applying different permutations of rows and columns (known as reordering). The number of all possible permutations is n! (for an n × n matrix) and finding, which of them is the best one, belongs to the class of NP-complete problems. For structured matrices (like symmetric ones) we can use the Minimum Degree Algorithm [16] or the Nested Dissection [15]. Of course, we do not always know the structure of the matrix so there are heuristic algorithms which reorganize the matrix. Some of them include the Markowitz scheme [14] and the Markowitz scheme with threshold pivoting (for stability) [10]. In papers [4] and [5] some other modifications of the Markowitz scheme are considered. The article considers reordering for the LU and WZ factorizations [9,10,16,18] for a sparse square (n × n) matrix of not any particular structure. The article describes and examines a matrix transformation leading to a reduction of non-zero elements in the output matrices L, U by applying the AMD (Approximate Minimum Degree) algorithm [1] as well as in the output matrices W, Z by applying the modified Markowitz algorithm (for the WZ factorization) given by the authors. The aim of the paper is to compare the algorithms in their effectiveness of the fill-in reduction. The performance time of the modified Markowitz algorithm is also considered. The reasons for choosing AMD is its popularity, accessibility and wide application. The rest of the paper is organized as follows. Section 2 presents the WZ factorization. Section 3 presents the modifications of the Markowitz scheme for the WZ factorization which ensures the growth of the matrices W and Z sparsity and also a factorization stability. Section 4 describes an environment used to numerical experiments conducted for plenty of matrices from Davis Collection and we also present the results of the examination. We will make an analysis, how many non-zero elements we will find in the matrices L + U and W + Z, and also how the AMD algorithm and the modified Markowitz algorithm influence the number of non-zero elements as well as the time of algorithms performance. In this article we mark the well-known numerical algorithm of the LU factorization simply by LU. The numerical algorithm LU with reordering [1] we mark by AMD – in the same way as it is marked in the literature.
Analysis and Comparison of Reordering for Two Factorization Methods
2
985
WZ Factorization
The WZ factorization was proposed by Evans and Hatzopoulos [12] as the factorization compatible to SIMD computers. SIMD according to Flynn classification means a Single Instruction stream and a Multiple Data stream, so the SIMD architecture is characterized by multiplexing of processing units. The papers [6,7,11,17] develop and examine the modifications of the WZ factorization method and consider its parallel implementations. Let A be a nonsingular matrix. The WZ factorization causes a division of the matrix A into W and Z factors (so that A = WZ) assuming forms which can be described like follows: (for an even n): ⎤ ⎡ 1 0 ⎢ w21 1 0 w2n ⎥ ⎥ ⎢ ⎢ ··· ··· ··· ··· ··· ··· ⎥ ⎥ ⎢ ⎢ ··· ··· ··· ··· ··· ··· ··· ··· ⎥ ⎥ ⎢ ⎢ ··· ··· ··· ··· 1 0 ··· ··· ··· ··· ⎥ ⎥ (1) W=⎢ ⎢ ··· ··· ··· ··· 0 1 ··· ··· ··· ··· ⎥, ⎥ ⎢ ⎢ ··· ··· ··· ··· ··· ··· ··· ··· ⎥ ⎥ ⎢ ⎢ ··· ··· ··· ··· ··· ··· ⎥ ⎥ ⎢ ⎣ wn−1,1 0 1 wn−1,n ⎦ 0 1 ⎡ ⎤ z11 · · · · · · · · · · · · · · · · · · · · · · · · z1,n ⎢ ⎥ z22 · · · · · · · · · · · · · · · · · · z2,n ⎢ ⎥ ⎢ ⎥ ··· ··· ··· ··· ··· ··· ⎢ ⎥ ⎢ ⎥ ··· ··· ··· ··· ⎢ ⎥ ⎢ ⎥ zpp zpq ⎥, Z=⎢ (2) ⎢ ⎥ zqp zqq ⎢ ⎥ ⎢ ⎥ ··· ··· ··· ··· ⎢ ⎥ ⎢ ⎥ ··· ··· ··· ··· ··· ··· ⎢ ⎥ ⎣ ⎦ zn−1,2 · · · · · · · · · · · · · · · · · · zn−1,n zn1 · · · · · · · · · · · · · · · · · · · · · · · · zn,n where m = (n − 1)/2,
p = (n + 1)/2,
An example for an odd n (n = 5): ⎡ ⎤ 1 0 0 0 0 ⎢ w21 1 0 0 w25 ⎥ ⎢ ⎥ ⎥ W=⎢ ⎢ w31 w32 1 w34 w35 ⎥ , ⎣ w41 0 0 1 w45 ⎦ 0 0 0 0 1
⎡
z11 ⎢ 0 ⎢ Z=⎢ ⎢ 0 ⎣ 0 z51
q = (n + 1)/2.
(3)
⎤ z15 0 ⎥ ⎥ 0 ⎥ ⎥. 0 ⎦ z55
(4)
z12 z22 0 z42 z52
z13 z23 z33 z43 z53
z14 z24 0 z44 z54
See also Fig. 1 and Fig. 2. The numerical algorithm of the WZ factorization in this article is marked simply by WZ.
986
B. Bylina and J. Bylina
Fig. 1. The form of the output matrices in the WZ factorization (left: W; right: Z)
Fig. 2. The kth step of the WZ factorization (actually, of the transformation of the matrix A into Z); here k2 = n − k + 1
3
Modification of Markowitz Scheme for WZ Factorization
The original Markowitz scheme was first presented in [14]. It consists in a special way of the pivoting – not regarding to the value of pivot element but to the quantity of non-zero elements in rows and columns left to process. The row having the fewest non-zeros is chosen to be swapped with the current row and similarly columns are swapped. Thus, the number of newly generated non-zeros
Analysis and Comparison of Reordering for Two Factorization Methods
987
(that is the amount of the fill-in) can be reduced significantly. Unfortunately, such an algorithm can lead to a zero pivot and hence make the factorization fail. There are modifications of the Markowitz scheme which ensure success of the factorization (as in [4,5,10]). Here we show a modified Markowitz scheme version for the WZ factorization. Let A(k) be the matrix obtained from the kth step of the WZ factorization with (k) the size (n − 2k + 2) × (n − 2k + 2) (as in Fig. 2), let ri be the number of (k) non-zero values in the ith row of the matrix A . We choose i1 = arg
min
(k)
i∈{k,...,k2 }
ri
(5)
and i2 = arg
(k)
min
i∈{k,...,k2 }\{i1 }
ri .
(6)
Then we swap the kth row with the i1 st row and the k2 nd row with the i2 nd row. (We consider only rows, because in the WZ factorization there would be much more comparisons if we considered also columns because of two pivot rows [instead of only one in LU] and two pivot columns [instead of only one in LU]). Of course, such swapping can lead to the situation where the determinant (k) (k)
(k)
(k)
d = akk ak2 k2 − ak2 k akk2
(7)
(which is the pivot by which we divide in the WZ factorization) will be zero – then the continuation of the factorization will not be possible. That is why we must additionally choose i1 and i2 in the way the determinant d will not equal zero (what is not shown in the above paragraph). It means that in the modified Markowitz scheme (as in the original one) during each turn of completely external loop there is a need to make many comparisons to choose two rows including the smallest number of non-zero elements. The algorithm, which consists of the WZ factorization with our modification of the Markowitz algorithm, we mark as MWZ.
4
Numerical Experiment
Here we try to compare the performance of some algorithms and study the reordering influence on the number of non-zero elements. The algorithms’ implementation was done using C language. Data structures to store the matrices A, W, Z, L, U were two-dimensional arrays located in RAM. The numerical experiment was done using a Pentium IV 2.80 GHz computer with 1 GB RAM. The algorithms were tested in a GNU/Linux environment and the compilation was done using the compiler gcc with an optimization option -O3. Tests were done for matrices from Davis Collection [8]. The tests were done for a set of 40 sparse matrices from different applications. We have not managed to do the WZ factorization for 11 matrices – they were singular. For 14 matrices we needed the WZ factorization with the modified
988
B. Bylina and J. Bylina Table 1. Test matrices chosen from Davis Collection # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
matrix name lfat5 e bcsstko1 nos4 olm100 rdb2001 orsirr 1 comsol rdb2048 ex29 rdb3200 rdb5000 uym5940 raefsky5 fp pd
matrix size 14 48 100 100 2001 1030 1500 2048 2870 3200 5000 5940 6316 7548 8081
number of non-zeros 30 224 347 396 1120 6858 97645 12032 23754 18880 2960 85842 167178 884222 13036
is it symmetric? no yes yes no no no no no no no no no no no no
Table 2. The comparison of non-zero elements quantity for the algorithms LU, WZ, MWZ, AMD # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LU 44 272 447 639 7674 145528 1176657 258298 217840 505914 990394 2991163 212829 39967153 23526
WZ 44 272 447 639 7368 207125 1101656 254516 131951 274908 980888 2673569 226487 53861147 23818
MWZ 44 272 447 545 4730 86392 934350 114862 120198 216135 409015 1045803 227613 20092097 23088
AMD 52 930 1164 494 3730 50374 213582 82234 127970 150256 82234 656730 226632 2875348 20599
Markowitz scheme (as a kind of pivoting) what enabled the numerical WZ factorization (with no pivoting such factorizations were impossible). Table 1 includes the set of the matrices where the WZ and MWZ algorithms were successfully applied. Table 2 includes information how many non-zero elements (nz) were created while doing the algorithms WZ, LU, AMD and MWZ. By using data from Davis Collection [8] we placed the number of elements for the matrices created by the algorithm AMD; the results for LU, WZ and MWZ are from the authors’ tests.
Analysis and Comparison of Reordering for Two Factorization Methods
989
Table 3. The comparison of the performance times for the algorithms LU, WZ, MWZ (times given in seconds) # 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
LU 0.01 0.01 0.01 0.01 0.01 2.58 7.73 19.14 51.96 71.19 246.53 411.44 506.38 901.20 1135.50
WZ 0.02 0.03 0.05 0.06 0.10 1.44 4.30 10.41 28.95 37.53 143.66 237.18 286.07 503.96 591.44
MWZ 0.04 0.07 0.09 0.13 0.20 1.63 7.07 10.59 29.67 38.19 146.96 248.03 280.85 854.40 599.64
Table 3 presents time during which the algorithms WZ, LU and MWZ were being done. The quantities of non-zero elements and the performance times for chosen four matrices are also presented in Fig. 3 and Fig. 4. (They are scaled for every matrix to show the relative changes of the number of non-zeros and the performance time.) By comparing the algorithms LU and WZ we can notice that the number of non-zero elements generated by these two factorizations is approximately similar. It is possible to find matrices for which the WZ factorization generates fewer nonzero elements than the LU factorization, for example the matrix ex29. But we can find the matrices for which the LU factorization generates fewer non-zero elements, e.g. the matrix fp. For the tested matrices the algorithm WZ generates on the average 2% fewer non-zero elements than the algorithm LU. Applying the Markowitz scheme before the further WZ factorization caused a considerable decline of created non-zero elements number. Applying the Markowitz algorithm for the WZ factorization causes an increase of non-zero elements number for the only one matrix among all the tested matrices. For the rest ones, MWZ causes a decrease of non-zero elements number of average 25% comparing to the WZ algorithm. Applying the AMD algorithm for the tested matrices considerably reduced the quantity of non-zero elements of average 36%. We managed to find such matrices for which the WZ factorization as well as the MWZ factorization produce fewer non-zero elements than the AMD algorithm, e.g. the matrix ex29.
990
B. Bylina and J. Bylina
Fig. 3. Relative numbers of non-zeros in the four algorithms for four sample matrices
Fig. 4. Relative performance times of the three algorithms for four sample matrices
In the Markowitz scheme comparing to the algorithm which does not use any permutation, time for the tested matrices grows 17% on the average.. It is worth noticing that the time for LU is 50% longer than for WZ.
5
Conclusions
In this paper we have presented a detailed analysis and comparison of two reordering schemes. The first, called AMD, is used for the LU factorization; the second – MWZ – proposed by the authors, is used for the WZ factorization. The algorithms’ functioning was presented with some sparse matrices taken from concrete engineering applications. Our analysis is based on experiments with the use of a usual PC. The analysis addresses two aspects of the efficiency of the factorization: the role of the reordering step and the time needed for the factorization. We can summarize
Analysis and Comparison of Reordering for Two Factorization Methods
991
our observations as follows: there exist matrices for which MWZ (proposed by the authors) is worth using instead of AMD. Moreover, it appeared that the time of the WZ algorithm was on the average 50% shorter comparing to the LU algorithm. It results from the fact that loops in the WZ factorization are two times shorter what enables better use of modern processors architecture: threading (possibility to use parallel calculations) and the organization of the processor access to the memory (particularly an optimal use of the multilevel cache memory). Our future works would research problems of the influence of reordering on the results’ numerical accuracy. The other future issue is to name properties of the matrices for which using MWZ is better then using AMD. Acknowledgments. This work was partially supported within the project Metody i modele dla kontroli zatloczenia i oceny efektywno´sci mechanizm´ ow jako´sci uslug w Internecie nastepnej generacji (N517 025 31/2997). This work was also partially supported by Marie Curie-Sklodowska University in Lublin within the project R´ ownolegle algorytmy generacji i rozwiazywania mechanizm´ ow kontroli przecia˙zenia w protokole TCP modelowanych przy u˙zyciu la´ ncuch´ ow Markowa.
References 1. Amestoy, P., Davis, T.A., Duff, I.S.: Algorithm 837: AMD, An approximate minimum degree ordering algorithm. ACM Trans. Math. Soft. 23, 1129–1139 (1997) 2. Amestoy, P.R., Duff, I.S., L’Excellent, J.-I., Koster, J.: A full asynchronous multifrontal solver using distributed dynamic scheduling. SIAM J. Matr. Anal. Apl. 23(1), 15–41 (2001) 3. Amestoy, P.R., Duff, I.S., L’Excellent, J.-I., Li, X.S.: Analysis and Comparison of Two General Sparse Solvers for Distributed Memory Computers. ACM Trans. Math. Soft. 27(4), 388–421 (2001) 4. Amestoy, P., Li, X.S., Ng, E.G.: Diagonal Markowitz Scheme with Local Symmetrization. Report LBNL-53854 (2003); SIAM. J. Matr. Anal. Appl. 29, 228 (2007) 5. Amestoy, P., Pralet, S.: Unsymmetric Ordering Using a Constrained Markowitz Scheme. SIAM J. Matr. Anal. Appl.; Report LBNL-56861 (submitted, 2005) 6. Bylina, B., Bylina, J.: The Vectorized and Parallelized Solving of Markovian Models for Optical Networks. In: Bubak, M., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2004. LNCS, vol. 3037, pp. 578–581. Springer, Heidelberg (2004) 7. Chandra Sekhara Rao, S.: Existence and uniqueness of WZ factorization. Parall. Comp. 23, 1129–1139 (1997) 8. Davis, T.: University of Florida Sparse Matrix Collection. NA Digest 92(42) (1994), NA Digest 96(28) (1996), and NA Digest 97(23) (1997), http://www.cise.ufl.edu/research/sparse/matrices 9. Duff, I.S.: Combining direct and iterative methods for the solution of large systems in different application areas. Technical Report RAL-TR-2004-033 (2004) 10. Duff, I.S., Erisman, A.M., Reid, J.: Direct Methods for Sparse Matrices. Oxford University Press, New York (1986)
992
B. Bylina and J. Bylina
11. Evans, D.J., Barulli, M.: BSP linear solver for dense matrices. Parall. Comp. 24, 777–795 (1998) 12. Evans, D.J., Hatzopoulos, M.: The parallel solution of linear system. Int. J. Comp. Math. 7, 227–238 (1979) 13. Li, X.S., Demmel, J.W.: A scalable sparse direct solver using static pivoting. In: Proceedings of the Ninth SIAM Conference on Parallel Processing for Scientific Computing (1999) 14. Markowitz, H.M.: The elimination form of the inverse and its application to linear programming. Management Science 3, 255–269 (1957) 15. Reid, J., Duff, I.S., Erisman, A.M.: On George’s nested dissection method. SIAM J. Numer. Anal. 13, 686 (1976) 16. Tinney, W.F., Walker, J.W.: Direct solution of sparse network equations by optimally ordered triangular factorization. Proc. IEEE 55, 1801–1809 (1967) 17. Yalamov, P., Evans, D.J.: The WZ matrix factorization method. Parall. Comp. 21, 1111–1120 (1995) 18. Zlatev, Z.: On some pivotal strategies in Gaussian elimination by sparse technique. SIAM J. Numer. Anal. 17, 18–30 (1980)
KCK-Means: A Clustering Method Based on Kernel Canonical Correlation Analysis Chuan-Liang Chen1, Yun-Chao Gong2, and Ying-Jie Tian3,∗ 1
Department of Computer Science, Beijing Normal University, Beijing 100875, China 2 Software Institute, Nanjing University, Nanjing, China 3 Research Centre on Fictitious Economy & Data Science, Chinese Academy of Sciences, 100080, Beijing, China [email protected], [email protected], [email protected]
Abstract. Kernel Canonical Correlation Analysis (KCCA) is a technique that can extract common features from a pair of multivariate data, which may assist in mining the ground truth hidden in the data. In this paper, a novel partitioning clustering method called KCK-means is proposed based on KCCA. We also show that KCK-means can not only be run on two-view data sets, but also it performs excellently on single-view data sets. KCK-means can deal with both binary-class and multi-class clustering tasks very well. Experiments with three evaluation metrics are also presented, the results of which reflect the promising performance of KCK-means. Keywords: Kernel Canonical Correlation Analysis, K-means clustering, Similarity Measure, Clustering Algorithm.
1 Introduction Clustering is one of the most commonly techniques which is widely applied to extract knowledge, especially when lacking any a priori information (e.g., statistical models) about the data. Generally, the problem of clustering deals with partitioning a data set consisting of n points embedded in m-dimensional space into k distinct set of clusters, such that the data points within the same cluster are more similar to each other than to data points in other clusters [3]. There are two main approaches of clustering algorithms, hierarchical (e.g., agglomerative methods) and partitional approaches (e.g., k-means, k-medoids, and EM). Most of these clustering algorithms are based on elementary distance properties of the instance space [4]. In some interesting application domains, instances are represented by attributes that can naturally be split into two subsets, either of which suffices for learning [5], such as web pages which can be classified based on their content as well as based on the anchor texts of inbound hyperlinks. Intuitively, there may be some projections in these two views which should have strong correlation with the ground truth. Kernel ∗
Corresponding author.
M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 995–1004, 2008. © Springer-Verlag Berlin Heidelberg 2008
996
C.-L. Chen, Y.-C. Gong, and Y.-J. Tian
Canonical Correlation Analysis (KCCA) is such a technique that can extract common features from a pair of multivariate data, which can be used as a statistical tool to identify the correlated projections between two views. Therefore, KCCA is expected to be used to measure the similarity between data points excellently. In this paper, we propose two algorithms based on KCCA which can improve the performances of traditional clustering algorithms—K-means, namely KCK-means for two-view data sets and single-view data sets that could not be split naturally. The results of experiments show that their performances are much better than those of the original algorithms. Our empirical study shows that these two algorithms can not only perform excellently on both two-view and single-view data, but also be able to extract better quality clusters than traditional algorithms. The remainder of this paper is organized as follows. We demonstrate KCCA and propose the algorithms in Sect. 2. Performance measures, experiment results and their analysis are presented in Sect. 3. Finally, Sect. 4 presents the main conclusions.
2 KCK-Means Method 2.1 Canonical Correlation Analysis Firstly, we briefly review Canonical Correlation Analysis (CCA), then its kernel extension—Kernel Canonical Correlation Analysis (KCCA). CCA is computationally an eigenvector problem. It attempts to find two sets of basis vectors, one for each view, such that the correlation between the projections of these two views into the basis vectors are maximized. Let X = {x1, x2, … , xl} and Y = {y1, y2, … , yl} denote two views, i.e. two attribute sets describing the data. CCA finds projection vectors wx and wy such that the correlation coefficient between wTx X and
wTy Y is maximized. That is [12], ⎛
⎞ ⎟ , ⎜ wT C w wT C w ⎟ x xx x y yy y ⎠ ⎝ wTx Cxy wy
ρ = arg max ⎜ wx , wy
(1)
⎧⎪ w Cxx wx = 1 , w.r.t ⎨ ⎪⎩ w C yy wy = 1 where Cxy is the between-sets covariance matrix of X and Y, Cxx and Cyy are respectively the within-sets covariance matrices of X and Y. The maximum canonical correlation is the maximum of ρ with respect to wx and wy. Assume that C yy is invertible, then T x T y
wy =
1
λ
C yy−1C yx wx ,
(2)
and C xy C yy−1C yx wx = λ 2 Cxx wx .
(3)
KCK-Means: A Clustering Method Based on Kernel Canonical Correlation Analysis
997
By first solving for the generalized eigenvectors of Eq. 3, we can therefore obtain the sequence of wx ’s and then find the corresponding wy ’s using Eq. 2. However, in complex situations, CCA may not extract useful descriptors of the data because of its linearity. In order to identify nonlinearly correlated projections between the two views, kernel extensions of CCA (KCCA) can be used [12]. Kernel CCA offers an alternative solution by first projecting the data into a higher dimensional feature space, i.e. mapping xi and yi to φ ( xi ) and φ ( yi ) respectively (i = 1, 2, … , l). And then
φ ( xi ) and φ ( yi ) are treated as instances to run CCA routine. Let Sx = { (φ ( x1 ), φ ( x2 ),..., φ ( xl )) }and Sy = { (φ ( y1 ), φ ( y2 ),..., φ ( yl )) }. Then the directions wx and wy can be rewritten as the projection of the data onto the direction α and β ( α , β ∈ ℜl ): wx = S xα and wy = S y β . Let Kx = S xT S x and Ky= S Ty S y be the kernel matrices corresponding to the two views. Substituting into Eq. 1 we can obtain the new objective function
ρ = max α ,β
α T Kx K y β α T K x2α ⋅ β T K y2 β
.
(4)
α can be solved from ( K x + κ I ) −1 K y ( K y + κ I )−1 K xα = λ 2α ,
(5)
where κ is used for regularization. Then β can be obtained from
β=
1 ( K y + κ I ) −1 K xα . λ
(6)
Let Κx(xi, xj) = φx ( xi )φxT ( x j ) and Κy(yi, yj) = φ y ( yi )φ yT ( y j ) are the kernel functions of the two views. Then for any for any x* and y*, their projections can be obtained from P(x*)= Κx(xi, X) α and P(y*)= Κy(yi, Y) β respectively. A number of α and β (and corresponding λ) can be solved from Eq. 5 and Eq. 6. If the two views are conditionally independent given the class label, the most strongly correlated pair of projections should be in accordance with the ground-truth [9]. However, in real-world applications the conditional independence rarely holds, and therefore, information conveyed by the other pairs of correlated projections should not be omitted [9]. So far we have considered the kernel matrices as invertible, although in practice this may not be the case [20]. We use Partial Gram-Schmidt Orthogonolisation (PGSO) to approximate the kernel matrices such that we are able to re-represent the correlation with reduced dimensionality [12]. In PGSO algorithm, there is a precision parameter—η, which is used as a stopping criterion. For low-rank approximations, we need keep eigenvalues greater than η and the number of eigenvalues we need to consider is bounded by a constant that depends solely on the input distribution [20]. Since the dimensions of the projections rely on the N×M lower triangular matrix
998
C.-L. Chen, Y.-C. Gong, and Y.-J. Tian
output by PGSO which relies on this stopping criterion, we discuss the influence of η to our algorithm in Sect. 3. More detail about PGSO is described in [20]. 2.2 Two KCK-Means Algorithms
In our method, the similarity between data points is measured partly by the projections obtained by KCCA and extends the K-means algorithm. In [7], Balcan et al. showed that given appropriately strong PAC-learners on each view, an assumption of expansion on the underlying data distribution is sufficient for co-training to succeed, which implies that the stronger assumption of independence between the two views is not necessary, and the existence of sufficient views is sufficient. Similarly, the distance function fsim described below is also calculated based on the assumption that X and Y are sufficient to describe the data respectively, which is the same as the assumption of expansion about the co-training method. Actually, our method is intuitively derived from co-training [10]. Since the two views are sufficient to describe the data, both of them may be consist of some projections correlate with the ground truth. So we intend to measure the similarity between instances using information from two views of data. KCCA is an excellent tool that can carry out this task. Therefore, measuring by the use of KCCA may be a promising way of solving the problem of traditional distance measures. Let m denote the number of pairs of correlated projections that have been identified, then x* and y* can be projected into Pj(x*) and Pj(y*) (j = 1, 2, … ,m). Let fsim denote distance functions, which is L2-norm • in this paper. Of course, other similarity distance functions also could be. Based on the projections obtained by KCCA, a new similarity measure can be defined as follows, 2
f sim ( xi , x j ) = μ xi − x j
2
m
+ ∑ Pk ( xi ) − Pk ( x j )
2
,
(7)
k =1
where μ is a parameter which regulates the proportion of the distance between the original instances and the distance of their projections. Based on this similarity measure, we propose the first algorithm as follows. Input: Output:
X and Y, two views of a data set with n instances k, the number of clusters desired C1 and C2, two vectors containing the cluster indices of each point of X and Y.
Process: 1. Identify all pairs of correlated projections, obtaining α i , β i by solving Eqs. 5 and 6 on X and Y. 2. for i = 1, 2, …, l do Project xi and yi into m pairs projections and obtain P(xi) and P(yi). 3. Get the new data sets by unite X and P(X), Y and P(Y), i.e. Mx = X P(X), My = Y P(Y). Fig. 1. KCK-means Algorithm for two-view data sets
KCK-Means: A Clustering Method Based on Kernel Canonical Correlation Analysis
999
Cluster Mx and My respectively as follows: 4. Randomly assign each instance of Mx (My) to one cluster of the k clusters. 5. Calculate the cluster means, i.e., calculate the mean value (both the original value and the projections’ value) of the instance of each cluster. 6. repeat 7. (re)assign each instances to the cluster to which the instance is the most similar by calculating Eq. 7. update the cluster means. 8. 9. until no change. Fig. 1. (continued)
However, two-view data sets are rare in real world, which is the cause that though co-training is a powerful paradigm, it is not widely applicable. In [6], it points out that if there is sufficient redundancy among the features, we are able to identify a fairly reasonable division of them, and then co-training algorithms may show similar advantages to those when they perform on the two-view data sets. Similarly, in this paper, we try to randomly split the single-view data set into two parts and treat them as the two views of the original data set to perform KCCA and then KCK-means. Input:
X , a single-view data set with n instances k, the number of clusters desired C, a vector containing the cluster indices of each point of X.
Output: Process: 1. Randomly spilt X into two views with the same attributes, X1 and X2. 2. Identify all pairs of correlated projections, obtaining D i , E i by solving Eqs. 5 and 6 on X1 and X2. 3. for i = 1, 2, …, l do Project x1, i and x2, i into m pairs projections and obtain P(x1, i) and P(x2, i). 4. Unite P(X1) and P(X2) into P(X), i.e. P(X) = P(X1)ĤP(X2). 5. Get the new data sets by unite X and P(X), i.e. Mx = XĤP(X). Cluster Mx: 6. Randomly assign each instance of Mx to one cluster of the k clusters. 7. Calculate the cluster means, i.e., calculate the mean value (both the original value and the projections’ value) of the instance of each cluster. 8. repeat 9. (re)assign each instances to the cluster to which the instance is the most similar by calculating Eq. 7. 10. update the cluster means. 11. until no change. Fig. 2. The KCK-means Algorithm for single-view data sets
3 Experiments and Analysis Two standard multi-view data sets are applied to evaluate the effectiveness of the first version of KCK-means. They are
1000
C.-L. Chen, Y.-C. Gong, and Y.-J. Tian
Course: The course data set has two views and contains 1,051 examples, each corresponding to a web page, which is described in [10]. 200 examples are used in this paper and there are 44 positive examples. Ads: The url and origurl data sets are derived from the ads data set which is described in [16] and has two categories. 300 examples are used in this paper, among which 42 examples are positive. In this paper, we construct a two-view dataset by using the url view and origurl view. In order to find out how well the second version of KCK-means performs on single-view data sets, we use three single-view data sets . F1
A3a: The a3a is a single-view data set derived from Adult Data Set of UCI, which is described in [11]. It has two categories and 122 features. 3,185 examples are used and there are 773 positive examples. W1a: The w1a is a single-view data set derived from web page dataset which is described in [9]. It has two categories and 300 sparse binary keyword attributes. 2,477 examples are used, among which 72 examples are positive. DNA: The DNA is a single-view data set which is described in [8]. It has three categories and 180 attributes. 2,000 examples are used, among which 464 examples are 1st class, 485 examples are 2nd class, and 1,051 examples are 3rd class. We use three performance measures, Pair-Precision, Intuitive-Precision and Mutual Information, to measure the quality of the clusters obtained by the KCK-means. Pair-Precision: The evaluation metric in [2] is used in our experiments. We evaluate a partition i.e. the correct partition using accuracy =
num(correct decisions ) . n(n − 1) / 2
Mutual Information: Though entropy and purity are suitable for measuring a single cluster’s quality, they are both biased to favor smaller clusters. Instead, we use a symmetric measure called Mutual Information to evaluate the overall performance. The Mutual Information is a measure of the additional information known about one when given another [1], that is MI ( A, B) = H ( A) + H ( B ) − H ( A, B) , where H(A) is the entropy of A and can be calculated by using n
H ( A) = −∑ p( xi ) log 2 ( p( xi )) . i =1
Intuitive-Precision: We choose the class label that share with most samples in a cluster as the class label. Then, the precision for each cluster A is defined as: P( A) =
1
1 max( {xi | label ( xi ) = C j } ) . A
On http://www.csie.ntu.edu.tw/~cjlin/libsvmtools/datasets, all these single-view data sets can be downloaded. H
H
KCK-Means: A Clustering Method Based on Kernel Canonical Correlation Analysis
1001
In order to avoid the possible bias from small clusters which have very high precision, the final precision is defined by the weighted sum of the precision for all clusters, as shown in the following equation G
Ak
k =1
N
P=∑
P ( Ak ) ,
where G is the number of categories (classes) and N is the total number of instances.
Fig. 3. Clustering results on two two-view data sets (course and ads, on the left column) and three single-view data sets (a3a, w1a and DNA, on the right column) using KCK-means comparing with two traditional clustering algorithms, K-means and Agglom (agglomerative hierarchical clustering) with three performance measures, P-Precision (Pair-Precision), IPrecision (Intuitive-Precision), and MI (Mutual Information)
The comparison among between KCK-means and K-means, agglomerative hierarchical clustering, are performed. In order to better reflect the performance of the three algorithms, for all experiments demonstrated below with the two partitioning algorithms, K-means and KCK-means, the diagrams are based on averaging over ten
1002
C.-L. Chen, Y.-C. Gong, and Y.-J. Tian
100%
100% 95%
Kmeans(View1) Kmeans(View2) Agglom(View1) Agglom(View2) KCK-means(View1) KCK-means(View2)
P-Precision
80%
90% 85%
P-Precision
90%
70%
Kmeans Agglom KCK-means
80% 75%
60%
70%
50%
65%
40%
100%
100%
95%
95%
90%
90%
85%
85%
1
9
8
0.
7
5 0.
0.
4 0.
6
3 0.
0.
2 0.
η
0.
1 0.
1
0. 9
0. 8
0. 7
0. 5
0. 6
0. 4
0. 3
0. 2
0. 1
60%
η
80% Kmeans(View1) Kmeans(View2) Agglom(View1) Agglom(View2) KCK-means(View1) KCK-means(View2)
75% 70% 65% 60%
I-Precision
I-Precision
Kmeans Agglom KCK-means
80% 75% 70% 65%
1
0. 9
0. 8
9 0.
0. 6
8 0.
η
0. 7
7 0.
0.9
0. 5
6 0.
0.5
0. 4
5 0.
1.0
0. 2
4 0.
0.6
0. 3
3 0.
0. 1
2 0.
1
1 0.
60%
η
0.8 Kmeans
Mutual Information
Mutual Information
0.4
0.7
Kmeans(View1) Kmeans(View2) Agglom(View1) Agglom(View2) KCK-means(View1) KCK-means(View2)
0.3 0.2 0.1
Agglom KCK-means
0.6 0.5 0.4
7
8
9
0.
0.
η
1
6
0.
5
1
0.
9 0.
0.
8 0.
3
7 0.
4
.6 η0
0.
5 0.
2
4 0.
0.
3 0.
1
2 0.
0.
0.3 1 0.
0.
0
Fig. 4. The influence of η on the performance of KCK-means on the two-view data set course and the single-view data set DNA, where η changes from 0.1 to 1.0, all of the three evaluation metrics, Pair-Precision, Intutitive-Precision and Mutual Information, are used
clustering runs to compensate for their randomized initialization. And that is also beneficial for measuring the performance of the second version of KCK-means on the single-view data sets for its randomly splitting these data sets. The performances of the three algorithms are showed in Fig. 3. In Fig. 3, the performances of KCK-means are much better than those of other two traditional clustering algorithms. On some data sets such as a3a, the Pair-Precision and Intuitive-Precision of the results of KCK-means are both almost 100%, but PairPrecision and Intuitive-Precision of the results of K-means and agglomerative hierarchical clustering are 59.74%, 75.73% and 58.87%, 75.73% respectively. KCKmeans also performs excellently on the multi-class data set—DNA and gets 85.03% Pair-Precision, for K-means and agglomerative hierarchical clustering 72.39% and 67.13% respectively. For other two evaluation metrics, KCK-means is also much better than those of the others’.
KCK-Means: A Clustering Method Based on Kernel Canonical Correlation Analysis
1003
In our experiments, we also note that when the proportion parameter μ is set to be very small or even zero, the performance of KCK-means is the best, which means using the projections obtained from KCCA the similarity between instances already can be measured good enough. The μ in the experiments described in this paper is all set to be 10-6. In Sect. 2.1 we have stated that there is a precision parameter (or stopping criterion)—η in the PGSO algorithm, on which the dimensions of the projections rely. Now we demonstrate its influence on the performance of KCK-means. In order to better measure such influence, we use two data sets, course and DNA, in the experiments described below. Because course is a two-view data set with two classes, and DNA is a single-view data set with three classes, then we can combine the measure of the KCK-means on two-view data set and single-view data set simultaneously. The results are derived on more than ten clustering showed in Fig. 4. In Fig. 4 we can find that follow the change of η, the performance of KCK-means changes a little. Furthermore, even considering the influence, the performances of KCK-means on both data sets are also much better than the other two clustering algorithms. However, in the experiments we find when η is larger than some threshold which depends on given data set the performance of KCK-means descends very much even worse than those of K-means and agglomerative hierarchical clustering. After carefully observation, we find in such situations the number of dimensions of projections is always very small, sometimes even only one dimension. Just as what we have described in Sect. 2.1, in real-world applications the conditional independence rarely holds, and therefore, information conveyed by the other pairs of correlated projections should not be omitted [9]. Therefore, this performance descending may be caused by lacking information conveyed by the other projections.
4 Conclusion In this paper, we propose a novel partitioning method, i.e. KCK-means, based on KCCA and inspiration from co-training. By using KCCA which mines the ground truth hidden in the data, KCK-means measures the similarity between instances. On two two-view data sets, course and ads, and three single-view data sets, a3a, w1a and DNA, the experiments are performed using three performance measures, PairPrecision, Intuitive-Precision and Mutual Information. The results reflect that by using KCK-means, much better quality of clusters could be obtained than those obtained from K-means and agglomerative hierarchical clustering algorithms. However, we also observe that when the number of dimensions of the projections obtained from KCCA is very small, the performance of KCK-means descends very much even worse than those of the two traditional clustering algorithms. This reflects that in real-world applications, we need to consider the information conveyed by the other pairs of correlated projections obtained from KCCA, instead of only considering the strongest projection or very few stronger projections. That is, the number of dimensions of projections obtained from KCCA and then used in KCK-means must be enough.
1004
C.-L. Chen, Y.-C. Gong, and Y.-J. Tian
Acknowledgments. The research work described in this paper was supported by grants from the National Natural Science Foundation of China (Project No. 10601064, 70531040, 70621001).
References 1. Butte, A.J., Kohane, I.S.: Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. In: Pacific Symposium on Biocomputing, Hawaii, pp. 415–426 (2000) 2. Wagstaff, K., Claire, C.: Clustering with Instance-level Constraints. In: the 17th International Conference on Machine Learning, pp. 1103–1110. Morgan Kaufmann press, Stanford (2000) 3. Khan, S.S., Ahmadb, A.: Cluster center initialization algorithm for K-means clustering. Pattern Recognition Letters 25, 129–1302 (2004) 4. Kirsten, M., Wrobel, S.: Relational distance-based clustering. In: Page, D.L. (ed.) ILP 1998. LNCS, vol. 1446, pp. 261–270. Springer, Heidelberg (1998) 5. Bickel, S., Scheffer, T.: Multi-View Clustering. In: The 4th IEEE International Conference on Data Mining, pp. 19–26. IEEE press, Brighton (2004) 6. Nigam, K., Ghani, R.: Analyzing the effectiveness and applicability of co-training. In: the 9th international conference on Information and knowledge management, pp. 86–93. ACM press, McLean (2000) 7. Balcan, M.F., Blum, A., Yang, K.: Co-training and expansion: Towards bridging theory and practice. In: The 18th Annual Conference on Neural Information Processing Systems, pp. 89–96. MIT press, Vancouver (2005) 8. Hsu, C.W., Lin, C.J.: A comparison of methods for multi-class support vector machines. IEEE Transactions on Neural Networks 13, 415–425 (2002) 9. Zhou, Z.H., Zhan, D.C., Yang, Q.: Semi-supervised learning with very few labeled training examples. In: The 22nd AAAI Conference on Artificial Intelligence, pp. 675–680. AAAI press, Vancouver (2007) 10. Blum, A., Mitchell, T.: Combining labeled and unlabeled data with co-training. In: The Conference on Computational Learning Theory, pp. 92–100. Morgan Kaufmann press, Madison (1998) 11. Kohavi, R.: Scaling Up the Accuracy of Naive-Bayes Classifiers: a Decision-Tree Hybrid. In: The Second International Conference on Knowledge Discovery and Data Mining, pp. 202–207. AAAI press, Oregon (1996) 12. Hardoon, D.R., Szedmak, S., Shawe-Taylor, J.: Canonical correlation analysis; An overview with application to learning methods. Technical report, Department of Computer Science Royal Holloway, University of London (2003)
Application of the Variational Iteration Method for Inverse Stefan Problem with Neumann’s Boundary Condition Damian Slota Institute of Mathematics Silesian University of Technology Kaszubska 23, 44-100 Gliwice, Poland [email protected]
Abstract. In this paper, the possibility of application of the variational iteration method for solving the inverse Stefan problem with a Neumann boundary condition is presented. This problem consists in a calculation of temperature distribution as well as in the reconstruction of the function which describes the heat flux on the boundary, when the position of the moving interface is known. The validity of the approach is verified by comparing the results obtained with the analytical solution. Keywords: Inverse Stefan problem, Variational iteration method, Heat equation, Solidification.
1
Introduction
In this paper, the author is trying to solve the one-phase inverse design Stefan problem with a Neumann boundary condition. This problem consists in a calculation of temperature distribution as well as in the reconstruction of the function which describes the heat flux on the boundary, when the position of the moving interface is known. This paper applies the variational iteration method to the discussed problems. The variational iteration method was developed by Ji-Huan He [1, 2, 3, 4, 5] and is useful for solving a wide range of problems [1, 2, 3, 7, 5, 8, 9, 4, 6, 10, 11]. The application of the variational iteration method for direct and inverse Stefan problems with a Dirichlet boundary condition is considered in paper [12]. It is possible to find an exact analytical solution of the inverse Stefan problem only in few simple cases. In other cases we are left with approximate solutions only [15, 17, 18, 16, 14, 13]. For example in papers [14, 13], authors used the Adomian decomposition method combined with optimization for an approximate solution of a one-phase inverse Stefan problem. However, in paper [17], the authors compare selected numerical methods to solve a one-dimensional, one-phase inverse Stefan problem. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 1005–1012, 2008. c Springer-Verlag Berlin Heidelberg 2008
1006
D. Slota
Fig. 1. Domain of the problem
2
Problem Formulation
Let D = {(x, t); t ∈ [0, t∗ ), x ∈ [0, ξ(t)]} be a domain in R2 (Figure 1). On the boundary of this domain, three components are distributed: Γ0 = {(x, 0); x ∈ [0, v = ξ(0)]} , Γ1 = {(0, t); t ∈ [0, t∗ )} ,
(2.1) (2.2)
Γg = {(x, t); t ∈ [0, t∗ ), x = ξ(t)} ,
(2.3)
where the initial and boundary conditions are given. In domain D, we consider the heat conduction equations: α
∂ 2 u(x, t) ∂u (x, t), = ∂x2 ∂t
(2.4)
with the initial condition on boundary Γ0 : u(x, 0) = ϕ(x),
(2.5)
the Neumann condition on boundary Γ1 : −k
∂u(0, t) = q(t), ∂x
(2.6)
the condition of temperature continuity and the Stefan condition on the moving interface Γg : u(ξ(t), t) = u∗ , ∂u(x, t) dξ(t) , −k =κ ∂x dt x=ξ(t)
(2.7) (2.8)
where α is the thermal diffusivity, k is the thermal conductivity, κ is the latent heat of fusion per unit volume, u∗ is the phase change temperature, x = ξ(t) is
Application of the Variational Iteration Method
1007
the function describing the position of the moving interface Γg , and u, t and x refer to temperature, time and spatial location, respectively. The discussed inverse Stefan problem consists in finding a function to describe the temperature distribution u(x, t) in domain D, and function q(t) describing the heat flux on the boundary Γ1 , which will satisfy equations (2.4)–(2.8). All other functions (ϕ(x), ξ(t)) and parameters (α, k, κ, u∗ ), are known.
3
Solution of the Problem
Using the variational iteration method we are able to solve the nonlinear equation: L(u(z)) + N (u(z)) = f (z), (3.1) where L is the linear operator, N is the nonlinear operator, f is a known function and u is a sought function. At first, we construct a correction functional: z un (z) = un−1 (z) +
λ L(un−1 (s)) + N (˜ un−1 (s)) − f (s) ds,
(3.2)
0
where u ˜n−1 is a restricted variation [1, 2, 3, 4], λ is a general Lagrange multiplier [19, 1, 2], which can be identified optimally by the variational theory [20, 1, 2, 3], and u0 (z) is an initial approximation. Next, we determine the general Lagrange multiplier and identify it as a function of λ = λ(s). Finally, we obtain the iteration formula: z un (z) = un−1 (z) +
λ(s) L(un−1 (s)) + N (un−1 (s)) − f (s) ds.
(3.3)
0
The correction functional for equation (2.4) can be expressed as follows: x 2 ˜n−1 (s, t) ∂ un−1 (s, t) 1 ∂u un (x, t) = un−1 (x, t) + ds. (3.4) λ − ∂s2 α ∂t 0 From equation (3.4), the general Lagrange multiplier can be identified as follows: λ(s) = s − x. Hence, we obtain the following iteration formula: x ∂2u 1 ∂un−1 (s, t) n−1 (s, t) ds. (s − x) − un (x, t) = un−1 (x, t) + 2 ∂s α ∂t 0
(3.5)
(3.6)
Next, we select an initial approximation in the form: u0 (x, t) = A + B x,
(3.7)
where A and B are parameters. For the determination of parameters A and B, we will use the Neumann boundary condition (2.6) and the condition of temperature
1008
D. Slota
continuity (2.7). To this end, we require that the initial approximation u0 (x, t) fulfils the above conditions. The boundary condition (2.6) requires: 1 B = − q(t), k
(3.8)
whilst the condition (2.7) leads to the result: 1 ξ(t) q(t). k
A = u∗ +
(3.9)
Hence, the initial approximation has the form: 1 q(t) ξ(t) − x . k
u0 (x, t) = u∗ +
(3.10)
Finally, we obtain the following iteration formula: 1 q(t) ξ(t) − x , k un (x, t) = un−1 (x, t) + x ∂2u 1 ∂un−1 (s, t) n−1 (s, t) ds, + (s − x) − ∂s2 α ∂t 0 u0 (x, t) = u∗ +
(3.11)
n ≥ 1.
(3.12)
Because function un (3.6) depends on an unknown function q(t), we have derived this function in the form of a linear combination: q(t) =
m
pi ψi (t),
(3.13)
i=1
where pi ∈ R and the basis functions ψi (t) are a linear independence. The coefficients pi are selected to show a minimal deviation of function un (3.6) from the initial condition (2.5) and the Stefan condition (2.8). Thus, we are looking for the minimum of the following functional: v 2 un (x, 0) − ϕ(x) dx + J(p1 , . . . , pm ) = 0
t∗
k
+ 0
dξ(t) 2 ∂un (ξ(t), t) +κ dt. ∂x dt
(3.14)
After substituting equations (3.12) and (3.13) to functional J, differentiating it with respect to the coefficients pi (i = 1, . . . , m) and equating the obtained derivatives to zero: ∂J p1 , . . . , pm = 0, i = 1, . . . , m, (3.15) ∂pi a system of linear algebraic equations is obtained. In the course of solving this system, coefficients pi are determined, and thereby, the approximated distributions of the heat flux q(t) on boundary Γ1 and temperature un (x, t) in domain D are obtained.
Application of the Variational Iteration Method
4
1009
Example
The theoretical considerations introduced in the previous sections will be illustrated with an example, where the approximate solution will be compared with an exact solution. We consider an example of the inverse Stefan problem, in which: α = 0.1, k = 1, κ = 10, u∗ = 1, t∗ = 1/2 and ϕ(x) = e−x ,
ξ(t) =
1 t. 10
(4.1)
Next, an exact solution of the inverse Stefan problem will be found by means of the following functions: u(x, t) = et/10−x , q(t) = e
t/10
,
(x, t) ∈ D, ∗
t ∈ [0, t ].
(4.2) (4.3)
As basis functions we take: ψi (t) = ti−1 ,
i = 1, . . . , m.
(4.4)
In Figures 2 and 3, we present an exact and reconstructed distribution of the heat flux on the boundary Γ1 for n = 1, m = 5 and for n = 2, m = 2. The left figure presents the exact (solid line) and the determined approximate position (dash line), whereas the right figure shows diagrams of the distribution of errors which occur when reconstructing the heat flux.
Fig. 2. Heat flux on boundary Γ1 (a) and error distribution in the reconstruction of this heat flux (b) for n = 1 and m = 5 (solid line – exact value qe , dash line – reconstructed value qr )
Figure 4 presents error distributions in the reconstruction of the phase change temperature (left figure) and error distributions in the reconstruction of the Stefan condition along the moving interface (right figure) for n = 1 and m = 5. The calculations were made for an accurate moving interface position and for a position disturbed with a pseudorandom error with a size of 1%, 2% and 5%. Table 1 presents values of the absolute error (δf ) and a percentage relative error
1010
D. Slota
Fig. 3. Heat flux on boundary Γ1 (a) and error distribution in the reconstruction of this heat flux (b) for n = 2 and m = 2 (solid line – exact value qe , dash line – reconstructed value qr )
Fig. 4. Error distribution in the reconstruction of phase change temperature (a) and in the reconstruction of the Stefan condition (b)
(Δf ) with which the heat flux on the boundary Γ1 (f = q) and distribution of the temperature in domain D (f = u) were reconstructed for different perturbations. The values of absolute errors are calculated from formulas: t∗
1/2 2 1 qe (t) − qr (t) dt , (4.5) δq = ∗ t 0
1/2 2 1 δu = , (4.6) ue (x, t) − ur (x, t) dx dt |D| D where qe (t) is an exact value of function q(t), qr (t) is a reconstructed value of function q(t), ue (x, t) is an exact distribution of temperature in domain D and ur (x, t) is a reconstructed distribution of temperature in this domain, and: |D| = 1 dx dt. (4.7) D
However, percentage relative errors are calculated from formulas: t∗
−1/2 2 1 Δq = δq · ∗ · 100%, qe (t) dt t 0
(4.8)
Application of the Variational Iteration Method
1011
Fig. 5. Error distribution in the reconstruction of heat flux for perturbation equal to 2% (a) and 5% (b) (qe – exact value, qr – reconstructed value)
Δu = δu ·
1 |D|
2 ue (x, t) dx dt
−1/2
· 100%.
(4.9)
D
As shown in the results, the presented algorithm is stable in terms of the input data errors. Each time when the input data were burdened with errors, the error of the heat flux reconstruction did not exceed the initial error. Table 1. Values of errors in the reconstruction of heat flux and distribution of temperature (n = 2, m = 2, δ – absolute error, Δ – percentage relative error)
5
Per.
δq
Δq
δu
Δu
0% 1% 2% 5%
0.001225 0.002957 0.008244 0.016487
0.11944% 0.28830% 0.80389% 1.60768%
0.000785 0.000843 0.001065 0.001385
0.07721% 0.08292% 0.10473% 0.13620%
Conclusion
In this paper, solution of one-phase inverse Stefan problems is presented. The problem consists in a calculation of temperature distribution and of a function which describes the heat flux on the boundary, when the position of the moving interface is known. The proposed solution is based on the variational iteration method. The calculations show that this method is effective for solving the problems under consideration. The advantage of the proposed method comparing it with classical methods consists in obtaining the heat flux and temperature distribution in the form of continuous functions, instead of a discreet form. The method applied does not require discretization of the region, like in the case of classical methods based on the finite-difference method or the finite-element method. The proposed method produces a wholly satisfactory result already in a small number of iterations,
1012
D. Slota
whereas the classical methods require a suitably dense lattice in order to achieve similar accuracy, which considerably extends the time of calculations.
References 1. He, J.-H.: Approximate analytical solution for seepage flow with fractional derivatives in porous media. Comput. Methods Appl. Mech. Engrg. 167, 57–68 (1998) 2. He, J.-H.: Approximate solution of nonlinear differential equations with convolution product nonlinearities. Comput. Methods Appl. Mech. Engrg. 167, 69–73 (1998) 3. He, J.-H.: Variational iteration method – a kind of non-linear analytical technique: some examples. Int. J. Non-Linear Mech. 34, 699–708 (1999) 4. He, J.-H.: Non-Perturbative Methods for Strongly Nonlinear Problems. Dissertation.de-Verlag im Internet GmbH, Berlin (2006) 5. He, J.-H.: Variational iteration method – Some recent results and new interpretations. J. Comput. Appl. Math. 207, 3–17 (2007) 6. Abdou, M.A., Soliman, A.A.: New applications of variational iteration method. Physica D 211, 1–8 (2005) 7. He, J.-H.: Variational iteration method for autonomous ordinary differential systems. Appl. Math. Comput. 114, 115–123 (2000) 8. He, J.-H., Liu, H.-M.: Variational approach to diffusion reaction in spherical porous catalyst. Chem. Eng. Technol. 27, 376–377 (2004) 9. He, J.-H., Wu, X.-H.: Construction of solitary solution and compacton-like solution by variational iteration method. Chaos, Solitions and Fractals 29, 108–113 (2006) 10. Momani, S., Abuasad, S.: Application of He’s variational iteration method to Helmholtz equation. Chaos, Solitions and Fractals 27, 1119–1123 (2006) 11. Momani, S., Abuasad, S., Odibat, Z.: Variational iteration method for solving nonlinear boundary value problems. Appl. Math. Comput. 183, 1351–1358 (2006) 12. Slota, D.: Direct and Inverse One-Phase Stefan Problem Solved by Variational Iteration Method. Comput. Math. Appl. 54, 1139–1146 (2007) 13. Grzymkowski, R., Slota, D.: One-phase inverse Stefan problems solved by Adomian decomposition method. Comput. Math. Appl. 51, 33–40 (2006) 14. Grzymkowski, R., Slota, D.: An application of the Adomian decomposition method for inverse Stefan problem with Neumann’s boundary condition. In: Sunderam, V.S., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2005. LNCS, vol. 3516, pp. 895–898. Springer, Heidelberg (2005) 15. Zabaras, N., Yuan, K.: Dynamic programming approach to the inverse Stefan design problem. Numer. Heat Transf. B 26, 97–104 (1994) 16. Grzymkowski, R., Slota, D.: Numerical method for multi-phase inverse Stefan design problems. Arch. Metall. Mater. 51, 161–172 (2006) 17. Liu, J., Guerrier, B.: A comparative study of domain embedding methods for regularized solutions of inverse Stefan problems. Int. J. Numer. Methods Engrg. 40, 3579–3600 (1997) 18. Slodiˇcka, M., De Schepper, H.: Determination of the heat-transfer coefficient during soldification of alloys. Comput. Methods Appl. Mech. Engrg. 194, 491–498 (2005) 19. Inokuti, M., Sekine, H., Mura, T.: General use Lagrange multiplier in non-linear mathematical physics. In: Nemat-Nasser, S. (ed.) Variational Method in the Mechanics of Solids, pp. 156–162. Pergamon Press, Oxford (1978) 20. Finlayson, B.A.: The Method of Weighted Residuals and Variational Principles. Academic Press, New York (1972)
Generalized Laplacian as Focus Measure Muhammad Riaz1, Seungjin Park2, Muhammad Bilal Ahmad1, Waqas Rasheed1, and Jongan Park1 1
School of Information & Communications Engineering, Chosun University, 501-759 South Korea 2 Dept of Biomedical Engineering, Chonnam National University Hospital, Kwangju, South Korea [email protected]
Abstract. Shape from focus (SFF) uses focus measure operator for depth measurement from a sequence of images. From the analysis of defocused image, it is observed that the focus measure operator should respond to high frequency variations of image intensity and produce maximum values when the image is perfectly focused. Therefore, an effective focus measure operator must be a high-pass filter. Laplacian is mostly used as focus measure operator in the previous SFF methods. In this paper, generalized Laplacian is used as focus measure operator for better 3D shape recovery of objects. Keywords: Shape from focus, SFF, Laplace filter, 3D shape recovery.
1 Introduction The well-known examples of passive techniques for 3D shape recovery from images include shape from focus (SFF). Shape From Focus (SFF) [1], [2] for 3D shape recovery is a search method which searches the camera parameters (lens position and/or focal length) that correspond to focusing the object. The basic idea of image focus is that the objects at different distances from a lens are focused at different distances. Fig. 1 shows the basic image formation geometry. In SFF, the cam-era parameter setting, where the blur circle radius R is zero is used to determine the distance of the object. In Fig. 1, if the image detector (ID) is placed exactly at a distance v, sharp image P’ of the point P is formed. Then the relationship between the object distance u, focal distance of the lens f, and the image distance v is given by the Gaussian lens law:
1 1 1 = + f u v
(1)
Once the best-focused camera parameter settings over every image point are determined, the 3D shape of the object can be easily computed. Note that a sensed image is in general quite different from the focused image of an object. The sensors M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 1013–1021, 2008. © Springer-Verlag Berlin Heidelberg 2008
1014
M. Riaz et al.
Fig. 1. Image formation of a 3D object
are usually planar image detectors such as CCD arrays; therefore, for curved objects only some parts of the image will be focused whereas other parts will be blurred. In SFF, an unknown object is moved with respect to the imaging sys-tem and a sequence of images that correspond to different levels of object focus is obtained. The basic idea of image focus is that the objects at different distances from a lens are focused at different distances. The change in the level of focus is obtained by changing either the lens position or the focal length of the lens in the camera. A focus measure is computed in the small image regions of each of the image frame in the image sequence. The value of the focus measure increases as the image sharpness or contrast increases and it attains the maximum for the sharpest focused image. Thus the sharpest focused image regions can be detected and extracted. This facilitates auto-focusing of small image regions by adjusting the camera parameters (lens position and/or focal length) so that the focus measure attains its maximum value for that image region. Also, such focused image regions can be synthesized to obtain a large image where all image regions are in focus. Further, the distance or depth of object surface patches that correspond to the small image regions can be obtained from the knowledge of the lens position and the focal length that result in the sharpest focused images of the surface patches. A lot of research has been done on the image focus analysis to automatically focus the imaging system [6], [7] or to obtain the sparse depth information from the observed scene [2], [3], [4], [8], [9]. Most previous research on Shape From Focus (SFF) concentrated on the developments and evaluations of different focus measures [1], [9]. From the analysis of defocused image [1], it is shown that the defocusing is a LFP, and hence, focus measure should respond to high frequency variations of image intensity and produce maximum values when the image is perfectly focused. Therefore, most of the focus measure in the literature [1], [9] somehow maximizes the high frequency variations in the images. The common focus measure in the literature
Generalized Laplacian as Focus Measure
1015
are; maximize high frequency energy in the power spectrum using FFT, variance of image gray levels, L1-norm of image gradient, L2-norm of image gradient, L1-norm of second derivatives of image, energy of Laplacian, Modified Laplacian [2], histogram entropy of the image, histogram of local variance, Sum-ModulusDifference, etc. There are other focus measures based on moments, wavelet, DCT and median filters. The traditional SFF (SFFTR) [2] uses modified Laplacian as focus measure operator. There are spikes in the 3D shape recovery using modified Laplacian. Laplacian and modified Laplacian operators are fixed and are not suitable in every situation [5]. In this paper, we have used generalized Laplacian as focus measure operator which can be tuned for the best 3D shape results. This paper is organized as follows. Section 2 describes the image focus and defocus analysis and the traditional SFF method. Section 3 de-scribes the generalized Laplacian and simulation results are shown in section 5.
2 Image Focus and Defocus Analysis If the image detector (CCD array) coincides with the image plane (see Fig. 1) a clear or focused image f(x,y) is sensed by the image detector. Note that a sensed image is in general quite different from the focused image of an object. The sensors are usually planar image detectors such as CCD arrays; therefore, for curved objects only some parts of the image will be focused whereas other parts will be blurred. The blurred image h(x,y) usually modeled by the PSF of the camera system. In a small image region if the imaged object surface is approximately a plane normal to the optics axis, then the PSF is the same for all points on the plane. The defocused image g(x,y) in the small image region on the image detector is given by the convolution of the focused image with the PSF of the camera system, as:
g ( x, y ) = h ( x, y ) ⊗ f ( x, y )
(2)
where the symbol denotes convolution. Now we consider the defocusing process in the frequency domain ( ). Let , and be the Fourier Trans-forms of the functions , and respectively. Then, we can express equ. (2) in the frequency domain by knowing the fact that the convolution in the spatial domain is the multiplication in the fre-quency domain, as:
G ( w1 , w2 ) = H ( w1 , w2 ).F ( w1 , w2 )
(3)
The Gaussian PSF model is a very good model of the blur circle. So the PSF of the camera system can be given as:
h ( x, y ) =
⎛ x2 + y2 ⎜− exp ⎜ 2πσ 2 2σ 2 ⎝ 1
⎞ ⎟ ⎟ ⎠
(4)
The spread parameter σ is proportional to the blur radius R in Fig. 1. The Fourier Transform of PSF is OTF of the camera system and is given as: ⎛ w 2 + w2 2 2 ⎞ σ ⎟ H ( w1 , w2 ) = exp⎜ − 1 ⎜ ⎟ 2 ⎝ ⎠
(5)
1016
M. Riaz et al.
We note that low frequencies are passed un-attenuated, while higher frequencies are reduced in amplitude, significantly so for frequencies above about 1/σ. Now σ is a measure of the size of the original PSF; therefore, the larger the blur, the lower the frequencies that are attenuated. This is an example of the inverse relationship between scale changes in the spatial domain and corresponding scale changes in the frequency domain. In fact the product R”ρ is constant, where R” is the blur radius in the spatial domain, and ρ is the radius in its transform. Hence, defocusing is a low-pass filtering process where the bandwidth decreases with increase in defocusing. A defocused image of an object can be obtained in three ways: by displacing the sensor with respect to the image plane, by moving the lens, or by moving the object with respect to the object plane. Moving the lens or sensor with respect to one another causes the following problems: (a) The magnification of the system varies, causing the image coordinates of focused points on the object to change. (b) The area on the sensor over which light energy is distributed varies, causing a variation in image brightness. However, object movement is easily realized in industrial and medical applications. This approach ensures that the points of the object are focused perfectly focused onto the image plane with the same magnification. In other words, as the object moves, the magnification of imaging system can be assumed to be constant for image areas that are perfectly focused. To automatically measure the sharpness of focus in an image, we must formulate a metric or criterion of “sharpness”. The essential idea underlying practical measures of focus quality is to respond high-frequency content in the image, and ideally, should produce maximum response when the image area is perfectly focused. From the analysis of defocused image, it is shown that the defocusing is a low-pass filtering, and hence, focus measure should respond to high frequency variations of image intensity and produce maximum values when the image is perfectly focused. Therefore, most of the focus measure in the literature somehow maximizes the high frequency variations in the images. Generally, the objective has been to find an operator that behaves in a stable and robust manner over a variety of images, including those of in-door and outdoor scenes. Such an approach is essential while developing automatically focusing systems that have to deal with general scenes. An interesting observation can be made regarding the application of focus measure operators. Equation (2) relates a defocused image using the blurring function. Assume that a focus measure operator is applied by convolution to the defocused image . The result is a new image expressed as:
r ( x, y ) = o( x, y ) ⊗ g ( x, y ) = o( x, y ) ⊗ (h( x, y ) ⊗ f ( x, y ))
(6)
Since convolution is linear and shift-invariant, we can rewrite the above expression as: r ( x, y ) = h( x, y ) ⊗ (o( x, y ) ⊗ f ( x, y ))
(7)
Therefore, applying a focus measure operator to a defocused image is equivalent to defocusing a new image obtained by convolving the focused image with the operator. The operator only selects the frequencies (high frequencies) in the focused image that will be attenuated due to defocusing. Since, defocusing is a low-pass filtering process, its effects on the image are more pronounced and detectable if the image has strong
Generalized Laplacian as Focus Measure
1017
high-frequency content. An effective focus measure operator, therefore, must highpass filter the image. One technique for passing the high spatial frequencies is to deter-mine its second derivative, such as Laplacian, given as: ∇2 I =
∂2I ∂x 2
+
∂2I
(8)
∂y 2
The Laplacian masks of 4-neigbourhoods and 8- neighborhoods are given in Fig. 2.
0 -1 0
-1 4 -1
-1 -1 -1
0 -1 0
4-neigbourhoods
-1 8 -1
-1 -1 -1
8-neigbourhoods
Fig. 2. Laplacian masks
Laplacian is computed for each pixel of the given image window and the criterion function can be stated as:
∑∑ ∇ 2 I ( x, y) x
for ∇ 2 I ( x, y ) ≥ T
y
(9)
Nayar noted that in the case of the Laplacian the second derivatives in the x and y directions can have opposite signs and tend to cancel each other. He, therefore, proposed the modified Laplacian (ML) as: ∇ 2M I =
∂2I ∂x 2
+
∂2I ∂y 2
(10)
The discrete approximation to the Laplacian is usually a 3 x 3 operator. In order to accommodate for possible variations in the size of texture elements, Nayar computed the partial derivatives by using a variable spacing (step) between the pixels used to compute the derivatives. He proposed the discrete approximation of the ML as: ∇ 2ML I ( x, y ) = 2 I ( x, y ) − I ( x − step, y ) − I ( x + step, y ) + 2 I ( x, y ) − I ( x, y − step) − I ( x, y + step )
(11)
Finally, the depth map or the focus measure at a point (x,y) was computed as the sum of ML values, in a small window around (x,y), that are greater than a threshold value t: F ( x, y ) =
i= x+ N j = y + N
∑ ∑
i=x−N j = y− N
∇ 2ML I (i, j )
for ∇ 2ML I (i, j ) ≥ T1
(12)
1018
M. Riaz et al.
The parameter N determines the window size used to compute the focus measure. Nayar referred the above focus measure as the sum-modified-Laplacian (SML) or traditional SFF (SFFTR).
3 Generalized Laplacian as Focus Measure For a given camera, the optimally accurate focus measure may change from one object to the other depending on their focused images. Therefore, selecting the optimal focus measure from a given set involves computing all focus measures in the set. In applications where computation needs to be minimized by computing only one focus measure, it is recommended to use simple and accurate focus measure filter for all conditions [5]. Laplacian has some desirable properties such as simplicity, rotational symmetry, elimination of unnecessary in-formation, and retaining of necessary information. Modified Laplacian [2] takes the absolute values of the second derivatives in the Laplacian in order to avoid the cancellation of second derivatives in the horizontal and vertical directions that have opposite signs. In this paper, we tried to use tuned Laplacian [5] as focus measure operator. A 3x3 Laplacian (a) should be rotationally symmetric, and (b) should not respond to any DC component in image brightness. The structure of the Laplacian by considering the above conditions is shown in Fig. 3. The last condition is satisfied if the sum of all elements of the operator equals zero: a + 4b + 4c = 0
(13)
c b C b a B c b C
c -1 c -1 4(1-c) -1 c -1 c
(a)
(b)
(c)
(d)
Fig. 3. (a) The 3x3 Laplacian kernal (b) Tuned Laplacian kernal with c = 0.4, b = -1 (c) The Fourier Transform of (b) when c = 0 and (d) when c = 0.4
Generalized Laplacian as Focus Measure
1019
If b = -1, then a = 4(1-c). Now we have only one variable c. The problem is now to find c such that the operator’s response should have sharp peaks. The frequency response of Laplacian for c = 0 and for c = 0.4 are shown in Fig. 3 (c) and (d). From Fig 3 (d), we see that the response of the tuned focus measure operator (c = 0.4) has much sharper peaks than the Laplacian (c = 0). The 4-neighbouhood kernel in Fig. 2 is obtained by c = 0, b = -1, and 8neigbourhood kernel in Fig. 2 is obtained by c = -1, b = -1.
4 Simulation Results We analyze and compare the results of 3D shape recovery from image sequences using the SFFTR with modified Laplacian and generalized Laplacian. Experiments were conducted on three different types of objects to show the performance of the new operator. The first object is a simulated cone whose images were generated using camera simulation software. A sequence of 97 images of the simulated cone was generated corresponding to 97 lens positions. The size of each image was 360 x 360. The second object is a real cone whose images were taken using a CCD camera system. The real cone object was made of hard-board with black and white stripes drawn on the surface so that a dense texture of ring patterns is viewed in images. All image frames in the image sequences taken for experiments have 256 gray levels.
(a) At lens step 15
(b) At lens step 40
(c) At lens step 70
Fig. 4. Images of simulated cone at different lens steps
(a) At lens step 20
(b) At lens step 40 Fig. 5. Images of real cone at different lens steps
(c) At lens step 90
1020
M. Riaz et al.
Figs. 4 and 5 show the image frames recorded at different lens position controlled by the motor. In each of these frames, only one part of the image is focused, whereas the other parts are blurred to varying degrees. We apply Modified Laplacian and the Generalized Laplacian as fo-cus measure operator using SFFTR method on the simulated and real cone images. The improvements in the results (Fig. 6) on simulated cone are not very prominent except a slight sharpness in the peak. However, on real cone, we see in Fig. 7 (a) that there are some erroneous peaks using Modified Laplacian which are removed as shown in Fig. 7 (b) using generalized Laplacian.
(a)
(b)
Fig. 6. (a) 3D shape recovery of the Simulated cone using SFFTR with Modified Laplacian as Focus Measure Operator (b) with Tuned Laplacian as Focus Measure operator with b= -0.8, c = 0.45
(a)
(b)
Fig. 7. (a) 3D shape recovery of the Real cone using SFFTR with Modified Laplacian as Focus Measure Operator (b) with Tuned Laplacian as Focus Measure operator with b= -1, c = 0.4
5 Conclusions In this paper, we have proposed a generalized Laplacian method as focus measure operator for shape from focus. Some improvements in the 3D shape recovery results are obtained. It is also noticed through simulation that erroneous peaks can be reduced
Generalized Laplacian as Focus Measure
1021
by using modified Laplacian, as discussed in the previous section. Further investigation is in process for generalized focus measure operator in-stead of fixed operators.
Acknowledgement This research was supported by the second BK 21 program of the Korean Government.
References 1. Krotkov, E.: Focusing. International Journal of Computer Vision 1, 223–237 (1987) 2. Nayar, S.K., Nakagawa, Y.: Shape from focus. IEEE Transactions on Pattern Analysis and Machine Intelligence 16(8) (August 1994) 3. Subbarao, M., Choi, T.-S.: Accurate recovery of three dimensional shape from im-age focus. IEEE Transactions on Pattern Analysis and Machine Intelligence 17(3) (March 1995) 4. Nayar, S.K., Watanabe, M., Noguchi, M.: Real-time focus range sensor. In: Proc. of Intl. Conf. on Computer Vision, pp. 995–1001 (June 1995) 5. Subbarao, M., Tyan, J.K.: Selecting the Optimal Focus Measure for Autofocusing and Depth-from-Focus. IEEE Trans. Pattern Analysis and Machine Intelligence 20(8), 864–870 (1998) 6. Schlag, J.F., Sanderson, A.C., Neumann, C.P., Wimberly, F.C.: Implementation of Automatic Focusing Algorithms for a Computer Vision System with Camera Control. Carnegie Mel-lon University, CMU-RI-TR-83-14 (August 1983) 7. Tenenbaum, J.M.: Accommodation in Computer Vision. Ph.D. dissertation, Standford University (1970) 8. Hiura, S., Matsuyama, T.: Depth Measurement by the Multi-Focus Camera. In: Proc. IEEE Int. Conf. Computer Vision and Pattern Recognition, June 1998, pp. 953–959 (1998) 9. Jarvis, R.A.: A Perspective on Range Finding Techniques for Computer Vision. IEEE Trans. Pattern Analysis and Machine Intelligence 5(2) (March 1983)
Application of R-Functions Method and Parallel Computations to the Solution of 2D Elliptic Boundary Value Problems Marcin Detka and Czesław Cichoń Chair of Applied Computer Science, Kielce University of Technology, Al. Tysiąclecia Państwa Polskiego 7, 25-314 Kielce, Poland {Marcin.Detka,Czeslaw.Cichon}@tu.kielce.pl
Abstract. In the paper, the R-function theory developed by Rvachew is applied to solve 2D elliptic boundary value problems. Unlike the well-established FEM or BEM method, this method requires dividing the solution into two parts. In the first part, the boundary conditions are satisfied exactly and in the second part, the differential equation is satisfied in an approximate way. In such a way, it is possible to formulate in one algorithm the so-called general structural solution of a boundary-value problem and use it for an arbitrary domain and arbitrary boundary conditions. The usefulness of the proposed computational method is verified using the example of the solution of the Laplace equation with mixed boundary conditions. Keywords: structural solution, R-functions, parallel computations.
1 Introduction Mathematical models of engineering problems are often defined as boundary-value problems involving partial-differential equations. For the description of such problems it is required to have analytical information connected with the equation itself (or a set of equations) and geometrical information necessary to define boundary conditions. This information concerns the solution domain, shapes of particular parts of the boundary, distribution and forms of the imposed constraints and the like. It is accounted for in a different way in various solution methods. In the paper, such problems are solved in a uniform way using the R-function theory, developed by Rvachew et al. [3]. In this theory, the so-called structural solutions are constructed with the use of elaborated tools of the analytical geometry. As a result, the structural solution exactly satisfying the boundary conditions contains some unknown parameters that have to be computed. The paper is limited to elliptic problems in two dimensions. Such problems are still dealt with because of their well-known relation to many physical models. Furthermore, theoretical and numerical results obtained in this area are very useful in practice. M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 1022–1031, 2008. © Springer-Verlag Berlin Heidelberg 2008
Application of R-Functions Method and Parallel Computations
1023
The discrete solution is determined using orthogonal, structured grid nodes over the Cartesian space C which contains the solution domain Ω . The unknown function of the problem is approximated by means of assuming a set of simple spline functions of the first order. The property of the support locality and density of these functions make it possible to compute, in an effective way, parameters of the structural solution, by redistributing the solution procedure into processors. In the algorithm of the parallel solution, the meshless method, proposed by Yagawa et al. [6], is applied. In this method, the resulting system of linear equations is constructed in a “row-by-row” fashion. The usefulness of the proposed method of computations is verified with the example of the solution of the Laplace equation with mixed boundary conditions.
2 Problem Statement and the Method of Solution Consider the linear operator equation of the form:
Au = f
in Ω ⊂ ℜ 2 ,
(1)
where f ∈ L2 (Ω) . It is well known, that when A is a linear positive-defined operator on a linear set DA in a separable Hilbert space H, and f ∈ H , the generalized solution u of Eq. (1) is an element of the so-called energy space HA that minimizes the functional [2]:
J (u ) =
1 B(u , u ) − ( f , u ) H , 2
(2)
where B(u , u ) = ( Au, u ) H and ( f , u ) H are bilinear and linear functionals, respectively. Because of the equivalence of Eqs. (1) and (2), in numerical computations it is preferred to solve Eq. (2) using the Ritz method. It is assumed that for the most general boundary conditions the solution can by represented in the structural form:
u = φ0 + ωϕ (φ1 ) ,
(3)
where ω is a known function that takes on zero values on the boundary ∂Ω and is positive in the interior of Ω . Functions φ0 and φ1 are chosen in such a way so as to satisfy all boundary conditions. The specification of the function
ϕ
depends on the
problem under consideration (see Section 4). It should be noted that functions φ0 and
φ1 can by specified in a piece-wise fashion with different values prescribed to them at each part of the boundary ∂Ω . The advantage of the solutions in the form of Eq. (3)
1024
M. Detka and C. Cichoń
is that the function ω describes completely all the geometrical information of a particular boundary value problem. The equation ω = 0 defines the geometry of a domain implicitly. The functions ω are constructed using the theory of R-functions developed by Rvachev [3]. Finally, functions φ0 and φ1 can by expressed by only one function φ [5] that in the Ritz approximation is sought in the form: n
φ N = ∑ c jψ j ,
(4)
j =1
where N is a positive integer, c j are unknown parameters and {ψ j } are some basis functions. The sole purpose of the function φ N is to satisfy the analytical constraints of the boundary value problem. It means that the structure of Eq. (3) does not place any constraints on the choice of the functions ψ j [4] After integrating over the domain Ω , the functional J (u ) becomes an ordinary function of the parameters
c1 , c2 ,K, c N . Therefore, the condition δJ = 0 is
equivalent to the solution of the linear algebraic equation, characterized by the matrix equation:
Kc = F .
(5)
3 Parallel Procedure of Computations Basis functions {ψ j } can be defined globally in the domain Ω , or locally with the dense local supports. As regards the parallel solution of the problem, the local approach is preferable, therefore it is chosen in the paper. Let us define the Cartesian space C ⊂ ℜ 2 and assume that the solution domain is subspace Ω ⊂ C , Fig. 1. Then, the space C is discretized using the regular mesh points of the structured grid. It is necessary to choose integers n and m, and define step size h and k by h = ( xb − xa ) n and k = ( yb − y a ) m , where points a( xa , ya ) and b( xb , yb ) are given a priori. For each point of the grid, the simple spline of the first order based on the sixth triangles, containing the grid vertex j is defined, Fig. 2. The basis function ψ j is composed of the six linear functions:
ψ j = {ψ 1j ,ψ 2j ,...,ψ 6j } ,
(6)
where functions ψ kj , k = 1,2,...,6 , have the following form in the local coordinate
(
)
system s1 = ( x − x j ) / h, s 2 = ( y − y j ) / k :
Application of R-Functions Method and Parallel Computations
⎧ψ 1j = 1 − s1 − s2 ⎪ 2 ⎪ψ j = 1 − s2 ⎪ψ 3j = 1 + s1 ⎪ ψ j = ⎨ψ 4j = 1 + s1 + s2 ⎪ψ 5 = 1 + s 2 ⎪ j ⎪ψ 6j = 1 − s1 ⎪ ⎩0
1025
if ( s1 , s2 ) ∈ T1 if ( s1 , s2 ) ∈ T2 if ( s1 , s2 ) ∈ T3 if ( s1 , s2 ) ∈ T4 if ( s1 , s2 ) ∈ T5 if ( s1 , s2 ) ∈ T6 otherwise .
Fig. 1. Cartesian space C and the solution domain
Fig. 2. Linear dashed basis function ψ
j
(7)
Ω
1026
M. Detka and C. Cichoń
The algorithm of parallel computations is shown in Fig. 3. The main steps of the computations are as follows: 1. Decomposition of the space C into Cp subdomains, p=1,2,…,P, where P is the number of processors. 2. Parallel identification of nodes in each subdomain Cp according to the rule shown in. Fig. 4. 3. Parallel modification of subdomain Cp in order to balance the number of nodes in each processor. 4. Parallel supplement of the nodes set in the domain Cp with neighbouring nodes which are active in the solution of the problem. 5. For each node j, parallel computations of the elements Kjk and Fj, k=1,2,…,7 (max) of the matrix equation (5). 6. Parallel solution of the matrix equation (5) by the conjugate gradient method using the Portable Extensible Toolkit for Scientific Computation (PETSc) library.
Fig. 3. The algorithm of parallel computations
Numerical integration is needed to calculate the matrix K and the vector F . Integration over triangles is performed with the use of 4-point Gaussian quadrature. For the case when the boundary ∂Ω crosses the triangle an additional procedure has been applied in order to divide the integration region into subregions. The rule states
Application of R-Functions Method and Parallel Computations
1027
Fig. 4. Decomposition of the solution domain (P=3), identification of nodes
that “the active subregion” is such a part of the triangle that belongs to the Ω domain and contains any of integration points. Next, integrals over the new triangle or quadrilateral subregions are also computed numerically. The above rule has also been applied to the identification of nodes in the subdomains.
4 Example The proposed solution method has been verified with a simple example taken from [1]. Consider the Laplace equation on the domain Ω , shown in Fig. 5
− ∇ 2u ( x, y ) = 0 in Ω ,
(8)
with the boundary conditions on ∂Ω
∂u ∂y
=0 ∂Ω1
∂u = −2 u ( x, y ) ∂Ω = 80 4 ∂x ∂Ω3 .
(9)
u ( x, y ) ∂Ω = −4 x 4 + 33x 2 − 2 x + 17. 2
The exact solution is equal to:
u ( x, y ) = 81 − 2 x + x 2 − y 2 . The geometric domain primitives
(10)
Ω can be defined as a Boolean set combination of four
Ω = Ω1 ∩ Ω 2 ∩ Ω 3 ∩ Ω 4 ,
(11)
1028
M. Detka and C. Cichoń
Fig. 5. Solution domain Ω of the Laplace equation (8)
defined as
Ω i = {( x, y ) ∈ ℜ 2 : ωi ( x, y ) ≥ 0}.
(12)
Functions ωi , normalized to the first order, have the form:
ω1 = y , ω 2 =
8 − 2x2 − y 16 x 2 + 1
, (13)
2 ω3 = x , ω 4 = ( y − 1 + x), 2 and the equation of the solution domain
Ω can be expressed in the following way:
ω = ω1 ∧ 0 ω 2 ∧ 0 ω3 ∧ 0 ω 4 , where
(14)
∧ 0 is the R0 – conjunction.
After some manipulations, the structural form of the solution (3) takes the final form
u = g 01 − ωD1 ( g 01 ) + ωg11 − ωD1 (φg 02 ) + φg 02 , where D1 (•) = ∂ω ∂ (•) + ∂ω ∂ (•) and
∂x ∂x
∂y ∂y
(15)
Application of R-Functions Method and Parallel Computations
g 01 =
(−4 x 4 + 33 x 2 − 2 x + 17)ω134 + 80ω123 , ω 234 + ω134 + ω124 + ω123
g11 =
− 2ω1 , ω1 + ω3
ω234 + ω124 , g 02 = ω 234 + ω134 + ω124 + ω123
1029
(16)
where ωijk = ωi ω j ω k .The functional (2) takes the form
⎡⎛ ∂u ⎞ 2 ⎛ ∂u ⎞ 2 ⎤ J (u ) = ∫ ⎢⎜ ⎟ + ⎜⎜ ⎟⎟ ⎥dΩ + 4 ∫ u d∂Ω 3 . ∂x ∂y Ω⎢ ∂Ω 3 ⎣⎝ ⎠ ⎝ ⎠ ⎥⎦
(17)
The formulae for the calculation of the matrix K coefficients and the column vector F , Eq. (5), are given explicitly in [1]. Computations have been made for h=k=0.5, 0.2 and 0.1, which has led to the sets of the basis functions {ψ j }, j = 1,2,3,..., N , where N= 63, 307 and 1124. The quality of the solution has been verified calculating the absolute, relative and least square errors:
ε 1 = max | u exac − u approx | ,
(18)
u exac − u approx |, u exac
(19)
∑ (u exac − u approx ) 2 .
(20)
i
ε 2 = max | i
ε3 =
1 N
i
The results of computations are given in Table 1. It should be noted that the improvement in the calculation accuracy at higher mesh density is smaller than expected. Probably, the reason is that the basic functions ψ j are too simple. The last column in Table 1 presents the data given in [1], where the global approximations are assumed in the form of the third degree complete polynomial. It should be stressed that although the final solution is worse, it is obtained with notably less numerical effort. Table 1. Approximation errors
ε1 ε2 ε3
h=k=0.5 6.00
h=k=0.2 2.34
h=k=0.1 1.98
[1] 10.32
0.15
0.05
0,04
0.41
2.62
0.93
0.85
19.18
1030
M. Detka and C. Cichoń
The graphs of the u function for different vertical and horizontal cross-sections of the Ω domain are shown in Fig. 6.
Fig. 6. Graphs of the u function for the different cross-sections, +++ discretization k=h=0.5, △△△ discretization k=h=0.1,--- polynomial N=3 [1], exact solution
◇◇◇ discretization k=h=0.2,
Fig. 7. Speedup and parallel efficiency in the function of the number of processors (time of the parallel solving of the linear equations set has been omitted), theoretical ideal speedup, +++ discretization k=h=0.5, ◇◇◇ discretization k=h=0.2, △△△ discretization k=h=0.1
Application of R-Functions Method and Parallel Computations
1031
As expected, the assumption of simple linear basis functions yields quite satisfactory computational results for suitably dense mesh nodes. Some inaccuracies that occur in the interior of the domain solution probably result from approximate calculations of the function derivatives, which appear in the formulae. In the program, these derivatives are calculated using of the GNU Scientific Library (GSL). Fig. 7 shows how the speedup and parallel efficiency varies with the numbers of processors for various problem sizes. The presented algorithm has been parallelized using Message Passing Interface (MPI MPICH ver. 1.2.7p1) library function and GNU C Compiler (ver.3.2). It has been tested with 9 nodes cluster with 2 Intel Xenon 2.4 Mhz 1GB of RAM. The nodes have been connected by a Gigaethernet.
5 Conclusions In the paper, the so-called structural solution has been applied to the solution of the elliptic partitial-differential equations. In the algorithm of the computations some properties of the structural solution have been exploited, namely the fact that the solution is composed of two parts, one of them fulfils exactly the boundary conditions and the others fulfils the differential equation in an approximate way. This feature of the solution can be employed effectively if we assume simple, linear basis functions over local simplexes and use the structured grid of nodes. That, together with the “row-by-row” method of computing the coefficients of the resulting system of linear algebraic equations, leads to the effective parallel algorithm of the solution. In the authors’ opinion, the efficiency of the proposed method should be particularly observable in the analysis of the problems with real great domain solutions. On the other hand, if more complex boundary value problems are to be solved, the local basis spline functions of the higher order will probably be needed.
References 1. Grzymkowski, R., Korek, K.: On R-function Theory and its Application in Inverse Problem of Heat Conduction. Information Technology Interfaces. In: Proceedings of the 23rd International Conference on Pula, Croatia, pp. 393–402 (2001) 2. Reddy., J.N.: Applied Functional Analysis and Variational Methods in Engineering. McGraw–Hill Book Company, New York (1986) 3. Rvachew, W.L., Sliesarienko, A.P.: Algiebra łogiki i intierwalnyje prieobrazowanija w krajewych zadaczach (in Russian), Izd. Naukowa Dumka, Kijów (1976) 4. Shapiro, V.: Theory of R-functions and Applications, Technical Report, Cornell University (1988) 5. Wawrzynek., A.: Modelling of solidification and cooling of metals and heat diffusion problems by R-function method (in Polish), Zesz. Nauk. Pol Śląskiej, Mechanika 119, Gliwice, Poland (1994) 6. Yagawa, G.: Node-by-node parallel finite elements: a virtually meshless method. Int. J. Numer. Meth. Eng. 60(1), 69–102 (2004)
Using a (Higher-Order) Magnus Method to Solve the Sturm-Liouville Problem Veerle Ledoux , Marnix Van Daele, and Guido Vanden Berghe Vakgroep Toegepaste Wiskunde en Informatica, Ghent University, Krijgslaan 281-S9, B-9000 Gent, Belgium {Veerle.Ledoux,Marnix.VanDaele,Guido.VandenBerghe}@UGent.be
Abstract. The main purpose of this paper is to describe techniques for the numerical solution of a Sturm-Liouville equation (in its Schr¨ odinger form) by employing a Magnus expansion. With a suitable method to approximate the highly oscillatory integrals which appear in the Magnus series, high order schemes can be constructed. A method of order ten is presented. Even when the solution is highly-oscillatory, the scheme can accurately integrate the problem using stepsizes typically much larger than the solution “wavelength”. This makes the method well suited to be applied in a shooting process to locate the eigenvalues of a boundary value problem.
1
Introduction
In this paper we are concerned with the numerical approximation of problems of the form (1) y (x) = [V (x) − E] y(x), a ≤ x ≤ b This equation is the Sturm-Liouville equation in its Liouville normal form, also called Schr¨ odinger form. Mathematically, Schr¨ odinger problems arise from the standard separation of variables method applied to a linear partial differential equation, and in connection with the inverse scattering transform for solving nonlinear partial differential equations. The Schr¨ odinger equation is also well known as the fundamental equation in quantum physics or quantum chemistry but arises for instance also in geophysical applications, and vibration and heat flow problems in mechanical engineering. Many Schr¨ odinger problems have explicit solutions, and are therefore important in the analytic investigation of different physical models. However most (boundary value) problems cannot be solved analytically, and computationally efficient approximation techniques are of great applicability. Although we focus in this paper on the basic Schr¨ odinger equation in a finite domain and with a smooth potential V (x), our scheme can be extended to a more general Sturm-Liouville problem −(p(x)y (x)) + q(x)y(x) = Ew(x)y(x). The parameter E (also called the eigenvalue) in (1) is unknown, and is to be found subject to some kind of boundary conditions in the endpoints a and b.
Postdoctoral Fellow of the Fund for Scientific Research - Flanders (Belgium) (F.W.O.-Vlaanderen).
M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 1032–1041, 2008. c Springer-Verlag Berlin Heidelberg 2008
Magnus Method of Sturm-Liouville Problem
1033
It is well known that as E grows, the solutions of (1) become increasingly √ oscillatory. In fact, as E → +∞ the solution “wave length” approaches 2π/ E. This highly oscillatory character of the solution is the reason why standard integrators encounter difficulties in efficiently estimating the higher eigenvalues: a naive integrator will be forced to make increasingly smaller steps severely increasing the running time. By taking advantage of special methods, one can construct numerical algorithms having special advantages over these standard (naive) methods. Pruess suggested to approximate the coefficients of the problem by piecewise constant approximations, solving the problem analytically on the piecewise constant intervals (see [15,16]). For such a coefficient approximation method the step size is not restricted by the oscillations in the solution but the scheme is only second order, unless Richardson extrapolation approximations are made. Two approaches have been suggested to construct higher order schemes, both being natural extensions of the Pruess ideas. A first approach is based on a technique from mathematical physics: the perturbation approximation, leading to the so-called Piecewise Perturbation Methods (PPM) (see [8,9,10,11]). In [2] it was shown that the piecewise perturbation approach may be viewed as the application of a modified Neumann series. The second approach consists in the application of another integral series: the Magnus series. During the last decade, numerical schemes based on the Magnus expansion received a lot of attention due to their preservation of Lie group symmetries (see [5],[14], and references cited therein). More generally, Magnus methods have been applied in spectral theory, Hamiltonian systems, symplectic and unitary integration, control theory, stochastic systems, and quantum chemistry; see [1] for a list of applications. Moan [13] was the first to consider a Magnus method in the context of Sturm-Liouville problems. He applied a Magnus series integrator directly to eq. (1) with a piecewise polynomial V (x). However poor approximations can then be expected for large eigenvalues. Later Degani and Schiff [2,3] and Iserles [4] showed that it is a better idea for oscillatory ordinary differential equations to apply the Magnus series integrator not directly to the equation but to the so-called modified equation. In [12] such a modified Magnus scheme of order eight was constructed for the Schr¨odinger problem and applied in a shooting procedure to compute the eigenvalues of the boundary value problem. In the current paper we present the construction of a modified Magnus method of order ten. In order to reach tenth order, the Filon-based quadrature rule for the oscillatory integrals appearing in the Magnus series, had to be extended to triple integrals. Also this new modified Magnus integrator can be used in a shooting process to efficiently compute eigenvalues.
2
The (Modified) Magnus Method
The differential equation (1) is converted into a system of first-order ODE’s y(x) = A(x, E)y(x), y(a) = y0 ,
(2)
1034
V. Ledoux, M. Van Daele, and G. Vanden Berghe
where
0 1 , V (x) − E 0
A(x, E) =
(3)
and y = [y(x), y (x)]T . Suppose that we have already computed yi ≈ y(xi ) and that we wish to advance the numerical solution to xi+1 = xi + hi . We first compute a constant approximation V¯ of the potential function V (x)
1 V¯ = hi
xi +hi
V (x)dx.
(4)
xi
Next we change the frame of reference by letting ¯
y(x) = e(x−xi )A u(x − xi ),
xi ≤ x ≤ xi+1
where ¯ A(E) =
0 1 . V¯ − E 0
(5)
(6)
We treat u as our new unknown which itself obeys the linear differential equation u (δ) = B(δ, E)u(δ), where
δ = x − xi ∈ [0, hi ],
u(0) = yi
¯ ¯ B(δ, E) = e−δA A(xi + δ) − A¯ eδA .
(7)
(8)
The matrix B can be computed explicitly. With ξ(Z) and η0 (Z) defined as ξ(Z) =
cos(|Z|1/2 ) if Z ≤ 0 , cosh(Z 1/2 ) if Z > 0 ,
⎧ sin(|Z|1/2 )/|Z|1/2 if Z < 0 , ⎪ ⎪ ⎨ if Z = 0 , η0 (Z) = 1 ⎪ ⎪ ⎩ sinh(Z 1/2 )/Z 1/2 if Z > 0 ,
(9)
(10)
we can write B as ⎛ ⎜ B(δ, E) = ΔV (δ) ⎝
δη0 (Z2δ ) −
⎞ 1 − ξ(Z2δ ) 2(E − V¯ ) ⎟ ⎠,
1 + ξ(Z2δ ) −δη0 (Z2δ ) 2
(11)
where ΔV (δ) = V¯ − V (xi + δ) and Zγ = Z(γ) = (V¯ − E)γ 2 . Note that the PPM-formulation in e.g. [8,9] uses the same functions ξ(Z) and η0 (Z) . We apply a Magnus method to the modified equation (7). The Magnus expansion is then (where the bracket denotes the matrix commutator) σ(δ) = σ1 (δ) + σ2 (δ) + σ3 (δ) + σ4 (δ) + . . . ,
(12)
Magnus Method of Sturm-Liouville Problem
1035
where σ1 (δ) =
δ
B(x)dx, 0
1 δ x1 [B(x2 ), B(x1 )]dx2 dx1 , σ2 (δ) = − 2 0 0 x1 δ x1 1 σ3 (δ) = B(x2 )dx2 , B(x2 )dx2 , B(x1 ) dx1 , 12 0 0 0 x1 x2 1 δ B(x3 )dx3 , B(x2 ) dx2 , B(x1 ) dx1 , σ4 (δ) = 4 0 0 0 ¯
and u(δ) = eσ(δ) yi , δ ≥ 0. Thus, to compute yi+1 = ehA eσ(h) yi with h = hi , we need to approximate σ(h) by truncating the expansion (12) and replacing ¯ integrals by quadrature (see next section). The 2 × 2 matrix exponentials ehA ¯ and eσ(h) can be written down explicitely. ehA is the matrix exponential of a constant matrix, and thus 0 h ξ(Zh ) hη0 (Zh ) expm = (13) , Zh = Z(h). h(V¯ − E) 0 Zh η0 (Zh )/h ξ(Zh ) To write down an expression for eσ(h) , we note that σ(h) is always a two by two matrix with zero trace. For such matrices the following is true: a b ξ(ω) + aη0 (ω) bη0 (ω) expm = (14) , ω = a2 + bc. cη0 (ω) c −a ξ(ω) − aη0 (ω) Here a, b, c, ω are functions of x and E.
3
Integration of the Integrals
As shown in [4], the regular Magnus quadrature formulae ([7]) are useless in the presence of high oscillation. For E V¯ the matrix function B in (11) is highly oscillatory and quadrature must be used that respects high oscillation. Filon-type quadrature can be used to approximate highly oscillating integrals to a suitable precision in a small number of function evaluations per step. As in [12], we will apply Filon-type quadrature not-only in the oscillatory region E > V¯ , but also in the nonoscillatory E < V (x) region (where it is just as good as regular Gauss-Christoffel Magnus quadrature). The univariate Filon-rule is discussed in [4] and has the nice property that while regular quadrature is ineffective in the presence of high oscillation, Filon quadrature delivers accuracy which actually improves with higher oscillation. Here, we use this Filon-rule to approximate the univariate (modified) Magnus h integral 0 B(δ)dδ. In fact, thismeans that ΔV (δ) in (11) is replaced by the ν Lagrange polynomial LΔV (δ) = k=1 ΔV (ck h) k (δ) where k is the kth cardinal polynomial of Lagrangian interpolation and c1 , c2 , . . . , cν are distinct quadrature
1036
V. Ledoux, M. Van Daele, and G. Vanden Berghe
nodes. The resulting integrals can then be solved analytically. An alternative way to obtain the interpolating polynomial LΔV (δ) is by approximating V (x) by a series over shifted Legendre polynomials: V (x) ≈
ν−1
Vs hs Ps∗ (δ/h)
(15)
s=0
By the method of least squares the expressions for the coefficients Vs are obtained: (2s + 1) h Vs = V (xi + δ)Ps∗ (δ/h)dδ, m = 0, 1, 2, . . . . (16) hs+1 0 ν−1 It can then be noted that V¯ = V0 and ΔV (δ) ≈ LΔV (δ) = − s=1 Vs hs Ps∗ (δ/h). To compute the integrals (16) tenth-order Gauss-Legendre is used, requiring ν = 5 function evaluations of V (Gauss-Lobatto is another option). With ξ = ξ(Z2h ),
η0 = η0 (Z2h ),
Z2h = 4Zh = 4(V¯ − E)h2
and Vˆs = hs+1 Vs , s = 1, . . . , 4, we obtain then the following
(Vˆ1 /2 + 5Vˆ4 + 3Vˆ2 /2 + 3Vˆ3 ) η0 Z h 0 (−Vˆ3 − Vˆ2 − Vˆ4 − Vˆ1 )ξ − Vˆ1 + Vˆ4 + Vˆ2 − Vˆ3 + 4Zh ˆ ˆ (−45V4 − 3V2 − 15Vˆ3 )ξ − 15Vˆ3 +45Vˆ4 +3Vˆ2 + 4Zh2 η0 −105Vˆ4 /4ξ+105 Vˆ4/4 (15Vˆ3 +105Vˆ4 ) + + 2 3 2Zh Zh h h ΔV (δ) (1+ξ(Z2δ )) dδ ≈ ΔV (δ)ξ(Z2δ )dδ 1 h
0
h
ΔV (δ)δη0 (Z2δ )dδ ≈
0
h
ΔV (δ) (1−ξ(Z2δ )) dδ ≈ − 0
(17)
h
ΔV (δ)ξ(Z2δ )dδ 0
η0 (3Vˆ2 +15Vˆ3 +45Vˆ4 ) η0 + ≈ (Vˆ1 + Vˆ2 + Vˆ3 + Vˆ4 ) Zh (−3Vˆ2 − Vˆ1 − 10Vˆ4 − 6Vˆ3 )ξ + 6Vˆ3 − 3Vˆ2 − 10Vˆ4 + Vˆ1 + 2Zh ˆ ˆ 210V4 η0 + (−105V4 − 15Vˆ3 )ξ − 105Vˆ4 + 15Vˆ3 + (18) 2Zh2 h which allows us to approximate 0 B(δ)dδ. Including only this first Magnusterm is sufficient to have a fourth-order method. However to construct a method
Magnus Method of Sturm-Liouville Problem
1037
of order ten, we need to include more Magnus terms. First we consider the approximation of σ2 . We extend the Filon idea to the computation of the double integral. As in [12] we write the double integral as
h
δ1
h
δ1
[B(δ2 ), B(δ1 )]dδ2 dδ1 = 2 0
ΔV (δ1 )ΔV (δ2 )K1 (δ1 , δ2 )dδ2 dδ1 U1 0
0
0
h
δ1
+2
ΔV (δ1 )ΔV (δ2 )K2 (δ1 , δ2 )dδ2 dδ1 U2 0
0
h
δ1
+2
ΔV (δ1 )ΔV (δ2 )K3 (δ1 , δ2 )dδ2 dδ1 U3 0
0
(19) where K1 (x, y) = yη0 (Z2y ) − xη0 (Z2x ), K2 (x, y) = ξ(Z2x ) − ξ(Z2y ), K3 (x, y) = (x − y)η0 (Z2(x−y) ) and 1 1 1 − 0 0 2(E− 0 2(E− ¯ ¯ ¯) 4(E− V ) V ) V , U2 = . , U3 = −1 U1 = 1 1 0 0 0 ¯) 4(E−V 2 2 (20) The three integrals in (19) must be replaced by quadrature. We again replace ΔV by the polynomial LΔV and solve the resulting integrals analytically (Maple). For brevity reasons we do not list the full expressions of the resulting formulae here, we show only the expression for the third integral: 0
h
δ1
ΔV (δ1 )ΔV (δ2 )K3 (δ1 , δ2 )dδ2 dδ1 ≈
0
Vˆ 2 + Vˆ 2 − Vˆ 2 − Vˆ 2 + 2(Vˆ Vˆ − Vˆ Vˆ ) 4 2 3 1 2
4
3
1
190Vˆ42 − Vˆ12 + 15Vˆ22 − 66Vˆ32 4Zh2 9Vˆ 2 − 405Vˆ 2 + 4335Vˆ 2 − 30Vˆ3 Vˆ1 + 1110Vˆ4 Vˆ2 +
4Zh ˆ ˆ ˆ −42V3 V1 + 156V4 Vˆ2 3 4 + + 2 4Zh2 4Zh3 Vˆ 2 − 3Vˆ 2 + 6Vˆ 2 − 10Vˆ 2 −225Vˆ32 +20475Vˆ42 +630Vˆ4Vˆ2 11025Vˆ42 1 2 3 4 + + + η 0 4Zh4 4Zh5 4Zh2 7Vˆ3 Vˆ1 − 13Vˆ4 Vˆ2 −1110Vˆ42 − 270Vˆ4 Vˆ2 + 30Vˆ3 Vˆ1 − 9Vˆ22 + 105Vˆ32 + + 4Zh2 4Zh3 225Vˆ32 − 630Vˆ4 Vˆ2 − 5775Vˆ42 11025Vˆ42 + − ξ 4 4Zh 4Zh5 −Vˆ22 /20 − Vˆ12 /12 − Vˆ32 /28 − Vˆ42 /36 −7Vˆ4 Vˆ2 − 5Vˆ3 Vˆ1 + . + Zh 4Zh2 (21) As shown in [12] the inclusion of this second Magnus term leads to an eighthorder algorithm. Next we consider the approximation of σ3 and σ4 in order to have a tenth-order scheme. The same procedure is applied again: the function
1038
V. Ledoux, M. Van Daele, and G. Vanden Berghe
ΔV appearing in the expressions for σ3 and σ4 is replaced by a polynomial. By symbolic computation it can be shown that it is sufficient here to replace ΔV (δ) by a third-degree polynomial. Therefore we take ΔV (δ) ≈ 3s=1 Vs hs Ps∗ (δ/h), where the coefficients Vs are still the same as the ones before. Also only the terms where the degree in h is smaller than 11 have to be considered: e.g. we do not take into account the Vˆ33 -term. We used the symbolic software package Maple to compute the expressions of the 2 by 2 matrix ς = σ3 + σ4 . As an illustration, we show some terms of the diagonal elements: ς11 = −ς22 =
135Vˆ 2 Vˆ + 49Vˆ 3 + 240Vˆ Vˆ Vˆ + 45Vˆ 3 + 150Vˆ 2 Vˆ + 123Vˆ Vˆ 2 1 3 2 1 2 1 2 2 1 1 3
480Zh2 961Vˆ12 Vˆ2 +105Vˆ13 + 8382Vˆ1 Vˆ3 Vˆ2 +2475Vˆ12 Vˆ3 + 2025Vˆ1 Vˆ22 +1161Vˆ23 + 96Zh3 5859Vˆ1 Vˆ22 +59662Vˆ1Vˆ3 Vˆ2 + 7245Vˆ12 Vˆ3 + 8055Vˆ23 + 736Vˆ12 Vˆ2 + 32Zh4 549Vˆ23 + 16305Vˆ1Vˆ3 Vˆ2 /4 + ξ + ... Zh5 (22)
The formulas in (17), (21) and (22) may be problematic for E close to V¯ due to near-cancellation of like terms. Therefore alternative formulas are used for small Zh values (see [12]). These alternative formulas are obtained by applying a Taylor expansion. The alternative for expression (17) is then e.g. 1 h ΔV (δ)δη0 (Z2δ )dδ ≈ h 0 (Vˆ1 /3 + Vˆ2 /15)Zh + (Vˆ3 /105 + 4Vˆ2 /105 + 4Vˆ1 /45 + Vˆ4 /945)Zh2 + (2Vˆ3 /945 + Vˆ2 /189 + Vˆ1 /105 + 2Vˆ4 /3465)Z 3 + . . .
(23)
h
The alternative formulae are used in the interval |Zh | < 0.15, in this case it is found that it is sufficient to go up to Zh8 .
4
Shooting for Eigenvalues
As mentioned before, a shooting procedure can be used to locate the eigenvalues of the boundary value problem associated with (1). The modified Magnus method presented here is well suited for the repeated solution of the initial value problems which appear in the shooting procedure. These initial value problems are solved for a fixed potential V but for different values of E. For our modified Magnus integrator, a mesh can be constructed which only depends on V and not on E (a procedure similar as in [12] can be used to construct the mesh). This mesh has to be computed only once and is then used in all eigenvalue computations. Moreover the value V¯ and the coefficients Vs are computed and
Magnus Method of Sturm-Liouville Problem
1039
Algorithm 1. A Sturm-Liouville solver based on a modified Magnus method 1: Use stepsize selection algorithm to construct mesh a = x0 < x1 < ... < xn = b 2: for i = 1 to n do 3: Compute V¯ and Vs , s = 1, . . . , 4 for the ith interval (Gauss-Legendre with 5 nodes). 4: end for 5: Choose a meshpoint xm (0 ≤ m ≤ n) as the matching point. 6: Set up initial values for yL satisfying the BC at a and initial values for yR satisfying the BC at b. Choose a trial value for E. 7: repeat 8: for i = 0 to m − 1 do ¯ 9: yL (xi+1 ) = ehi A eσ(hi ) yL (xi ) 10: 11: 12:
end for for i = n down to m + 1 do ¯ yR (xi−1 ) = e−σ(hi ) e−hi A yR (xi )
13: end for 14: Adjust E by comparing yL (xm ) with yR (xm ) (Newton iteration). 15: until E sufficiently accurate
stored once for all before the start of the shooting process. Algorithm 1 shows the basic shooting procedure in which the modified Magnus algorithm is used to propagate the left-hand and right-hand solutions. For more details on such a shooting procedure we refer to [12].
5
Numerical Examples
As test potentials we take two well-known test problems from the literature [17]. The Coffey-Evans problem is a Schr¨odinger equation with V (x) = −2β cos(2x) + β 2 sin2 (2x)
(24)
and y(−π/2) = y(π/2) = 0 as boundary conditions. Here we take β = 30. The second problem is the Woods-Saxon problem defined by V (x) = −50
1−
5t 3(1+t)
1+t
(25)
with t = e(x−7)/0.6 over the interval [0, 15]. The eigenvalue spectrum of this Woods-Saxon problem contains 14 eigenenergies E0 , ..., E13 . We take here an equidistant mesh. Note however that an automatic stepsize selection algorithm can be constructed as in [12]. We performed some eigenvalue computations at different step lengths. The absolute errors ΔEk = Ekexact − Ekcomput are collected in Table 1. For the Coffey-Evans problem some lower eigenvalues come in very close clusters and to distinguish between them the search algorithm must rely on a highly accurate integrator. Our modified Magnus method deals very well with these close eigenvalues. Also no systematic deterioration of the accuracy is
1040
V. Ledoux, M. Van Daele, and G. Vanden Berghe
Table 1. Absolute value of (absolute) errors ΔEk for the Coffey-Evans and WoodsSaxon problem. n is the number of (equidistant) steps. aE-b means a.10−b .
k 0 1 2 3 4 5 6 8 10 15 20 30 40 50
Coffey-Evans problem n = 128 Ek 0.0000000000000000 3.4E-10 117.9463076620687587 1.5E-9 231.6649292371271088 2.1E-9 231.6649293129610125 1.1E-9 231.6649293887949167 2.1E-9 340.8882998096130157 4.5E-9 445.2830895824354620 4.4E-9 445.2832550313310036 4.4E-9 637.6822498740469991 4.8E-9 802.4787986926240517 2.8E-9 951.8788067965913828 2.3E-9 1438.2952446408023577 2.0E-9 2146.4053605398535082 1.5E-9 3060.9234915114205911 1.0E-9
n = 256 2.2E-13 1.4E-12 1.1E-12 1.1E-12 7.9E-13 4.4E-12 3.6E-12 2.7E-12 4.2E-12 1.7E-12 3.7E-12 2.5E-12 2.7E-12 2.7E-12
k 0 1 2 3 4 5 6 7 8 9 10 11 12 13
Woods-Saxon problem Ek n = 64 n = 128 -49.45778872808258 3.9E-11 8.5E-14 -48.14843042000639 3.8E-10 2.6E-13 -46.29075395446623 2.0E-9 1.6E-12 -43.96831843181467 7.2E-9 6.3E-12 -41.23260777218090 2.0E-8 1.9E-12 -38.12278509672854 4.8E-8 4.6E-11 -34.67231320569997 9.7E-8 9.7E-11 -30.91224748790910 1.7E-7 1.7E-10 -26.87344891605993 2.8E-7 2.9E-10 -22.58860225769320 3.9E-7 4.3E-10 -18.09468828212811 5.1E-7 5.7E-10 -13.43686904026007 5.9E-7 6.7E-10 -8.67608167074520 6.0E-7 7.2E-10 -3.90823248120989 5.0E-7 6.6E-10
observed as k is increased. This tenth-order method gives of course more accurate approximations than the eighth order method of [12]: this method gives e.g. for the first eigenvalue of the Coffey-Evans problem an error of 1.0E-7 (n = 128) and 4.0E-10 (n = 256).
6
Conclusion
In this paper we discussed a modified Magnus method of order ten for the integration of a Sturm-Liouville problem in the Schr¨ odinger form. Therefore the modified Magnus method described earlier by Degani and Schiff and Iserles had to be extended to the non-oscillatory E < V region and a Filon-like quadrature rule had to be defined for the multivariate integrals appearing in the Magnus series. The modified Magnus method can be applied in a shooting procedure in order to compute the eigenvalues of a boundary value problem. Since an E-independent mesh can be constructed, all function evaluations can be done before the actual shooting process, which makes the method well suited to compute large batches of eigenvalues or just particularly large eigenvalues.
References 1. Blanes, S., Casas, F., Oteo, J.A., Ros, J.: Magnus and Fer expansions for matrix differential equations: the convergence problems. J. Phys A: Math. Gen. 31, 259– 268 (1998) 2. Degani, I., Schiff, J.: RCMS: Right Correction Magnus Series approach for oscillatory ODEs. J. Comput. Appl. Math. 193, 413–436 (2006)
Magnus Method of Sturm-Liouville Problem
1041
3. Degani, I.: RCMS - Right Correction Magnus Schemes for oscillatory ODEs and cubature formulae and commuting extensions. Thesis (PhD). Weizmann Institute of Science (2004) 4. Iserles, A.: On the numerical quadrature of highly oscillatory integrals I: Fourier transforms. IMA J. Numer. Anal. 24, 365–391 (2004) 5. Iserles, A., Nørsett, S.P.: On the solution of linear differential equations in Lie groups. Phil. Trans. R. Soc. Lond. A. 357, 983–1019 (1999) 6. Iserles, A.: On the global error of discretization methods for highly-oscillatory ordinary differential equations. BIT 42, 561–599 (2002) 7. Iserles, A., Munthe-Kaas, H.Z., Nørsett, S.P., Zanna, A.: Lie-group methods. Acta Numerica 9, 215–365 (2000) 8. Ixaru, L.G.: Numerical Methods for Differential Equations and Applications. Reidel, Dordrecht-Boston-Lancaster (1984) 9. Ixaru, L.G., De Meyer, H., Vanden Berghe, G.: SLCPM12 - A program for solving regular Sturm-Liouville problems. Comp. Phys. Commun. 118, 259–277 (1999) 10. Ledoux, V., Van Daele, M., Vanden Berghe, G.: CP methods of higher order for Sturm-Liouville and Schr¨ odinger equations. Comput. Phys. Commun. 162, 151–165 (2004) 11. Ledoux, V., Van Daele, M., Vanden Berghe, G.: MATSLISE: A MATLAB package for the Numerical Solution of Sturm-Liouville and Schr¨ odinger equations. ACM Trans. Math. Software 31, 532–554 (2005) 12. Ledoux, V., Van Daele, M., Vanden Berghe, G.: Efficient numerical solution of the 1D Schr¨ odinger eigenvalue problem using Magnus integrators. IMA J. Numer. Anal (submitted) 13. Moan, P.C.: Efficient approximation of Sturm-Liouville problems using Lie group methods. Technical report. DAMTP University of Cambridge (1998) 14. Munthe-Kaas, H., Owren, B.: Computations in a free Lie algebra, Phil. Trans. R. Soc. Lond. A. 357, 957–981 (1999) 15. Pruess, S.: Solving linear boundary value problems by approximating the coefficients. Math. Comp. 27, 551–561 (1973) 16. Pruess, S., Fulton, C.T.: Mathematical software for Sturm-Liouville problems. ACM Trans. on Math. Software 19, 360–376 (1993) 17. Pryce, J.D.: Numerical Solution of Sturm-Liouville Problems. Clarendon Press (1993)
Stopping Criterion for Adaptive Algorithm Sanjay Kumar Khattri Stord/Haugesund University College, Bjørnsonsgt. 45 Haugesund 5528, Norway [email protected]
Abstract. Adaptive algorithm consists of many different parameters. For example, adaptive index, adaptive criterion and stopping criterion. The adaptivity index drives the adaptive algorithm by selecting some elements for further refinement. Apart from the driving force, another important aspect of an algorithm is its stopping criteria. We present a new stopping criterion for adaptive algorithm.
1
Introduction
Convergence rate of finite volume on uniform meshes depends on the regularity or singularity of the solution. We develop finite volume on adaptive meshes. And, we present pointwise or infinity convergence of the developed method. It is shown that the convergence of the presented adaptive method is independent of the regularity or singularity of the underlying problem. An adaptive techniques depend on several factors such as error indicator and adaptive algorithm. We present a simple adaptive criterion and adaptive algorithm. Now let us consider the steady state pressure equation of a single phase flowing in a porous medium Ω − div (K grad p) = f p(x, y) = p
D
in Ω ,
(1)
on ∂ΩD .
(2)
Here, Ω is a polyhedral domain in R2 , the source function f is assumed to be in L2 (Ω), and the diagonal tensor coefficient K(x, y) is positive definite and piecewise constant. The coefficient K is allowed to be discontinuous in space. In porous media flow [7,4,1], the unknown function p = p(x, y) represents the pressure of a single phase, K is the permeability or hydraulic conductivity of the porous medium, and the velocity u of the phase is given by the Darcy law as : u = −K grad p. The next section presents finite volume method and adaptive algorithm.
2
Finite Volume Discretization and Adaptive Algorithm
For solving partial differential equations (PDEs) in a domain, by numerical methods such as finite volume, the domain is divided into smaller elements called M. Bubak et al. (Eds.): ICCS 2008, Part I, LNCS 5101, pp. 1042–1050, 2008. c Springer-Verlag Berlin Heidelberg 2008
Stopping Criterion for Adaptive Algorithm M
A h2 m
1
h1
1043
h2
2
1
h1
o n
N
2
h3
3
B
(a) Flux on a matching grid.
(b) Flux on a non-matching grid.
Fig. 1. Computation of flux across an edge
finite volumes or cells. Finite volume discretization of the Equation (1) for a finite volume is given as [7] 4 i=1
Fi =
f dτ .
(3)
V
Here, Fi is the flux through the interface i. Now let us compute the flux for the interface MN shared by the cells 1 and 2 (see Fig. 1(a)). The flux [12,13,14,7] through the edge MN is given as FMN = ΦMN (p2 − p1 ) ,
(4)
where the scalar ΦMN is referred to as the transmissibility of the interface MN and is given as l 1 . (5) ΦMN = K1 K2 h1 h2 (K1 /h1 + K2 /h2 ) Here, K1 and K2 refers to the permeability of the cells 1 and 2 in Fig. 1(a). The perpendicular distance of the interface MN from the center of cell 1 is h1 . Similarly, h2 is the perpendicular distance of the interface MN from the center of cell 2. The length of interface MN is l. Adaptive discretization can result in a nonmatching grid as shown in Fig. 1(b). We are using the same flux approximation for computing flux on a non matching grid. We are using the following expression for computing the error from the cell i in a mesh [7], 1 def ˆ L2 (∂Ωi ) |∂Ω i |1/2 . i = f L2 (Ωi ) |Ωi | 2 + (K ∇ph ) · n
(6)
Here, |Ωi | is the area of the finite volume, |∂Ωi | is the circumference of the ˆ is the unit outward normal. The quantity finite volume, and n 1/2
nL2 (∂Ωi ) |∂Ω i | (K ∇ph )ˆ
1044
S.K. Khattri
is the total flux associated with cell i. Let us further define a quantity named adaptivity index for cell i in a mesh, ⎤ ⎡ def i ⎦. ηi = ⎣ (7) max j j∈cells
It can be seen from the above definition of adaptivity index. For a cell with zero error ( = 0), the adaptivity index η is zero, and for a cell with maximum error η is 1. Thus for any cell, the adaptivity index ηi will be in the range [0, 1]. It can be seen in the Algorithm 1 that the driving force for the Algorithm is the adaptivity index η. The adaptivity index (7) drives the Algorithm 1 by selecting some finite volumes for further refinement. Apart from the driving force, another important aspect of an algorithm is its stopping criteria. The two obvious stopping criteria of an adaptive algorithm are the maximum allowable degrees of freedom (DOFmax ) or the maximum allowed mesh refinement, and the maximum allowed adaptive iteration steps “Iter ≤ Itermax ”. For defining third criterion, let us compute maximum error associated with a finite volume or cell on a mesh formed after k iterative steps of the algorithm. Let this error be ξk . Thus, ξk = max i , i∈cells
Thus, ξ0 is the maximum error of a cell on the initial mesh. Our third stopping criterion is defined as ξk ≥ tol . ξ0 The third criterion “ξk /ξ0 ≥ tol” is the error reduction after k iteration steps of the adaptive algorithm. Here, ξk denotes the maximum error (maximum value of i on a mesh) on an adaptively refined mesh after k iteration steps of the adaptive Algorithm 1. The quantity ξk /ξ0 , which measures the reduction of the posteriori error estimate i , provides information of the relative error reduction. Thus, ξk /ξ0 can be used as a stopping criterion apart from the maximum number of degrees of freedom. The degrees of freedom and maximum iteration of the adaptive algorithm do not provide information about the error reduction. Algorithm 1 is used for adaptive refinement. When a finite volume is selected for further refinement based on the value of the adaptivity index (7), this finite volume is divided into four equal finite volumes. During the adaptive refinement process all finite volumes Ωi in a mesh, for which the adaptivity index ηi is greater than a given tolerance δ, are refined. The tolerance δ lies between the values 0 and 1. Tolerance δ equal to 0 means uniform refinement (refine all finite volumes), and tolerance δ equal to 1 means that the adaptive algorithm will refine a single finite volume per iteration step which can be costly. Both of these values can be computationally expensive and may not be optimal. A small δ will
Stopping Criterion for Adaptive Algorithm
1045
Algorithm 1. Adaptive Algorithm with a new stopping criterion [ξIter /ξ0 ] ≥ tol. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Mesh the domain; Compute ξ0 ; Set Iteration Counter Iter = 0; while DOF ≤ DOFmax or Iter ≤ Itermax or [ξIter /ξ0 ] ≥ tol do Discretize the PDE on the mesh; Solve the discrete system to a given tolerance; forall Finite Volumes j in the Mesh do if ηj ≥ δ then Divide the Finite Volume j into Four Elements; end end Form a new mesh; Iter++ ; Compute ξIter ; end
refine many finite volumes and thus introduce many new cells per iteration step of the adaptive algorithm. On the other hand, a large value of δ will refine fewer cells and thus introduce fewer new finite volumes per iteration step. It should be kept in mind that during each iteration step of the adaptive algorithm a discrete system needs to be solved. Typically a value of δ = 0.5 is used [15]. To measure the effectiveness of the adaptivity index (7) in selecting the cells with maximum error, we use the relation def
Γ =
Cell number with η = 1.0 . Cell number with maximum point-wise error |p − ph |
(8)
Here, Γ is the robustness of the indicator η. If Γ is close to 1, the cells with the maximum point-wise error and the cells with the maximum error given by the error indicator (6) are the same. We compute the robustness quantity Γ of the adaptive index during each iteration step of the adaptive Algorithm 1.
3
Numerical Examples
Let p be the exact solution vector, and ph be the finite volume solution vector on a mesh. Let us further assume that pk be the exact pressure at the center of the cell k and pkh be the discrete pressure by the finite volume approximation for the same location. Error in the L∞ norm is defined as
def p − ph L∞ = maxk∈cells |pk (x) − pkh (x)| , (9) The finite volume solution is enforced inside the domain by the Dirichlet boundary condition and the source term. For solving the discrete systems of
1046
S.K. Khattri
equations formed on the sequence of adaptive and uniform meshes, we are using the ILU preconditioned Conjugate Gradient (CG) iterative solver unless mentioned otherwise. Let the domain be Ω = [−1, 1] × [−1, 1], and it is divided into four sub-domains according to the permeability K (see Fig. 2). Let the permeability in the sub-domain Ωi be Ki . It is assumed that the permeability in the sub-domain Ω1 is equal to the permeability in the sub-domain Ω3 , and the permeability in the sub-domain Ω2 is equal to the permeability in the sub-domain Ω4 . That is K1 = K3 and K2 = K4 . Let us further assume that K1 = K3 = R and K2 = K4 = 1. The parameter R is defined below. Let the exact solution in polar form be p(r, θ) = rγ η(θ) , (10) [8,9]. The parameter γ denotes the singularity in the solution [9], and it depends on the permeability distribution in the domain. For the singularity γ = 0.1, the Fig. 3 presents permeability distribution. η(θ) is given as ⎧ π cos [(π/2 − σ)γ] cos [(θ − π/2 + ρ)γ] , θ ∈ [0, ] , ⎪ ⎪ ⎪ ⎪ π 2 ⎪ ⎪ ⎨cos(ργ) cos [(θ − π + σ)γ] , θ ∈ [ , π] , 2 η(θ) = (11) 3π ⎪ cos(σγ) cos [(θ − π − ρ)γ] , θ ∈ [π, ] , ⎪ ⎪ ⎪ 2 ⎪ ⎪ ⎩cos [(π/2 − ρ)γ] cos [(θ − 3π/2 − σ)γ] , θ ∈ [ 3π , 2π] , 2 and the parameters R, γ, ρ ⎧ ⎪ ⎨R 1/R ⎪ ⎩ R
and σ satisfy the following nonlinear equations = − tan [(π − σ) γ] cot(ργ) , = − tan(ργ) cot(σγ) , = − tan(σγ) cot [(π/2 − ρ)γ] ,
(12)
under the following nonlinear constraints 0< max{0, πγ − π} < max{0, π − πγ} <
γ <2 , 2γρ < min{πγ, π} , −2γρ < min{π, 2π − πγ} .
(13)
The constrained nonlinear Equations (12) can be solved for the parameters R, σ, and ρ by Newton’s iteration algorithm for different degrees of singularity γ. The ∂p analytical solution p(r, θ) satisfies the usual interface conditions : p and K ∂n are continuous across the interfaces. It can be shown that the solution p belongs in the fractional Sobolev space H1+κ (Ω) (κ < γ) [10]. Let the singularity be γ = 0.1. Various parameters that satisfy the relations (12) under the constraints (13) are R ≈ 161.4476 ,
ρ ≈ 0.7854 and σ ≈ −14.9225 .
The permeability distribution is shown in Figure 3. The exact solution belongs to the fractional Sobolev space H1+k (k < 0.1). We have solved this problem on
Stopping Criterion for Adaptive Algorithm
Ω4
Ω3
1047
K 3 ≈ 161.45 I
K4 ≈ I
O
Ω1
Ω2
K 1 ≈ 161.45 I
Fig. 2. Domain Ω is divided into four subdomains Ωi , i = 1 . . . 4 according to the permeability
K2 ≈ I
Fig. 3. Permeability distribution for the singularities γ = 0.1 and γ = 0.1269. The solution is singular at O = (0, 0).
0.2
0.1 0.08
0.15
0.06
0.1
0.04 0.05
0.02 0
0
−0.02 −0.05
−0.04 −0.1
−0.06 −0.08
−0.15
−0.1 −0.2
−1 −1
−0.5
−0.5
0
0
0.5
0.5 1
−1 −0.5 0 0.5 −1 1
−0.5
0
0.5
(a) Exact solution given by the equation (b) Surface plot of the (10) for γ = 0.1. (u − uh )/uL∞ for γ = 0.1.
1
error
Fig. 4. Surface plots of the exact solution and error for the singularity parameter γ = 0.1
adaptive and uniform meshes. The outcome of our numerical work is reported in Figs. 4 and 5. Figure 4(a) is a surface plot of the exact solution. The solution is singular at the origin. Figure 4(b) presents a surface plot of the error. It can be seen that the error is maximum at the singularity. Figure 5 compares the convergence behaviour on adaptive and uniform meshes in the L∞ norm. We did not notice any convergence in the L∞ norm on uniform meshes till one million degrees of freedom. A similar behaviour was also observed in [11] on uniform meshes for singular problems. It was suggested in [11] that adaptive meshes may be ideal for such solutions. On adaptive meshes, we are getting P p − ph L∞ ≈ DOF− /2 with the convergence P ≈ 1 (see Figure 5). Because of the regularity of the solution, this convergence is quasi optimal [9,8].
1048
S.K. Khattri
0.025
0.025
0.02
0.02
0.015
0.015
||p−ph||L∞
||p−ph||L∞
Adaptive Uniform
0.01
0.005
Adaptive Uniform
0.01
0.005
0 0 10
2
10
4
6
10 10 Degrees of Freedom
8
10
Fig. 5. On adaptive meshes p − ph L∞ 1 ≈ DOF− /2
0 0 10
2
10
4
6
10 10 Degrees of Freedom
8
10
Fig. 6. On adaptive meshes p − ph L∞ 1 ≈ DOF− /2
Let the singularity be γ = 0.1269. Various parameters that satisfy the relations 12 under the constraints 13 are R ≈ 99.999999 ,
ρ ≈ 0.7853982 and σ ≈ −11.59263 .
The exact solution belongs in the fractional Sobolev space H1.126 . Figure 6 is comparing the convergence behaviour of the finite volume on adaptive and uniform meshes. Again, we did not observe any convergence till one million degrees of freedom on uniform meshes (there is some convergence during the 1.5
γ ≈ 0.10 γ ≈ 0.13
k
0
[ξ /ξ ]
1
0.5
0
0
0.5
1 1.5 2 2.5 Degrees of Freedom
3
3.5 5
x 10
Fig. 7. Decrease in the stopping criteria ξk /ξ0 in the Algorithm 1 with adaptive refinement
Stopping Criterion for Adaptive Algorithm
2
γ = 0.10 γ = 0.13
1.5
Robustness [ Γ ]
1049
1 0.5 0 −0.5 −1 −1.5 −2 0
10
20 30 40 50 Iteration of the Algorithm [ Iter ]
60
Fig. 8. Robustness (defined by the Equation (8)) of the adaptivity index for finding cells with most error. Solutions are in the spaces H1.126902 and H1.1 .
last refinement, see the Figure 6). While on the adaptive meshes, we are still P getting p − ph L∞ ≈ DOF− /2 with P ≈ 1. Figure 8 is a plot of the robustness against the iterations of the adaptive algorithm. It can be seen in the Figure 8, the robustness is almost always equal to 1.0 for all of the adaptive iterations. It means that the cells with the maximum point-wise error and cells with the maximum value of the error indicator given by the equation 6 are the same.
References 1. Khattri, S.K.: Nonlinear elliptic problems with the method of finite volumes. Differential Equations and Nonlinear Mechanics. Article ID 31797, 16 pages (2006) doi:10.1155/DENM/2006/31797 2. Khattri, S.K.: Newton-Krylov Algorithm with Adaptive Error Correction For the Poisson-Boltzmann Equation. MATCH Commun. Math. Comput. Chem. 1, 197– 208 (2006) 3. Khattri, S.K., Fladmark, G.: Which Meshes Are Better Conditioned: Adaptive, Uniform, Locally Refined or Locally Adjusted? In: Alexandrov, V.N., van Albada, G.D., Sloot, P.M.A., Dongarra, J. (eds.) ICCS 2006. LNCS, vol. 3992, pp. 102–105. Springer, Heidelberg (2006) 4. Khattri, S.K.: Analyzing Finite Volume for Single Phase Flow in Porous Media. Journal of Porous Media 10, 109–123 (2007) 5. Khattri, S.K., Hellevang, H., Fladmark, G.E., Kvamme, B.: Simulation of longterm fate of CO2 in the sand of Utsira. Journal of Porous Media (to be published) 6. Khattri, S.K.: Grid generation and adaptation by functionals. Computational and Applied Mathematics 26, 1–15 (2007)
1050
S.K. Khattri
7. Khattri, S.K.: Numerical Tools for Multicomponent, Multiphase, Reactive Processes: Flow of CO2 in Porous Media. PhD Thesis, The University of Bergen (2006) 8. Morin, P., Nochetto, R.H., Siebert, K.G.: Data oscillation and convergence of adaptive FEM. SIAM J. Numer. Anal. 38, 466–488 (2000) 9. Chen, Z., Dai, S.: On the efficiency of adaptive finite element methods for elliptic problems with discontinuous coefficients. SIAM J. Sci. Comput. 24, 443–462 (2002) 10. Strang, G., Fix, G.J.: An analysis of the finite element method, vol. 1. Wiley, New York (1973) 11. Eigestad, G., Klausen, R.: On the convergence of the multi-point flux approximation O-method: Numerical experiments for discontinuous permeability. Numerical Methods for Partial Differential Equations 21, 1079–1098 (2005) 12. Aavatsmark, I.: An introduction to multipoint flux approximations for quadrilateral grids. Comput. Geosci. 6, 405–432 (2002) 13. Ewing, R., Lazarov, R., Vassilevski, P.: Local refinement techniques for elliptic problems on cell-centered grids. I. Error analysis. Math. Comp. 56, 437–461 (1991) 14. Ewing, R., Lazarov, R., Vassilevski, P.: Local refinement techniques for elliptic problems on cell-centered grids. III. Algebraic multilevel BEPS preconditioners. Numer. Math. 59, 431–452 (1991) 15. Riviere, B.: Discontinuous galerkin fnite element methods for solving the miscible displacement problem in porous media. PhD Thesis, The University of Texas at Austin (2000)
Author Index
Abad, F. II-106 Abarca, R. III-471 Abbate, Giannandrea II-251 Abbod, Maysam III-16, III-634 Abdelouahab, Zair I-365 Abdullah, M. I-246 Abe, Takayuki II-35 Abell´ an, Jos´e L. I-456 Abramson, David I-66 Acacio, Manuel E. I-456 Adamczyk, Jakub I-355 Ahmad, Muhammad Bilal I-1013 Ai, Jianwen II-603 Akdim, Brahim II-353 Al-Kanhal, Tawfeeq III-634 Alda, Witold I-749, II-46 Alexandrov, Vassil III-379, III-429, III-438 Alles, Michael L. II-281 Alvaro, Wesley I-935 Anthes, Christoph III-379 Anthonissen, M.J.H. I-651 Anya, Obinna II-622, III-419 Arnal, A. II-96 Arod´z, Tomasz II-527 Assel, Matthias III-90 Assous, Franck II-331 Atanassov, Emanouil I-203 Aydt, Heiko III-26 Baczkowski, Krystian III-100 Bae, Seung-Hee I-407 Baeza, C. III-471 Bajka, Michael II-187 Balandin, Alexander A. II-242 Balfe, Shane III-510 Balint-Kurti, Gabriel G. II-387 Bali´s, Bartosz III-80, III-358 Banerjee, Sambaran II-207 Barabasz, Barbara III-682 Barbucha, Dariusz III-624 Bargiel, Monika II-126 Barty´ nski, Tomasz III-243 Barv´ık, Ivan I-661
Barzdziukas, Valerijus I-770 Battiato, Sebastiano II-76 Beezley, Jonathan D. III-46 Belkus, Houria II-207 Belloum, Adam III-459, III-481 Bengochea Mart´ınez, L. III-349 Benoit, Anne I-215 Bergdorf, Michael II-167 Bhowmick, Sanjukta I-873 Biecek, Przemyslaw III-100 Black, S.M. II-396 Blais, J.A. Rod II-638 Bl¨ ugel, Stefan I-6 Bode, Arndt III-201 Bonner, C.E. II-396 Boryczko, Krzysztof I-600, I-630 Bo˙zejko, Wojciech I-264 Brezany, Peter I-76 Brito, Rui M.M. III-70 Broeckhove, Jan I-226 Bubak, Marian I-56, I-254, II-217, III-80, III-90, III-243, III-358, III-446, III-459, III-481 Buchholz, M. I-45 Buchholz, Peter III-223 Buckingham, Lawrence III-491 Bulat, Jaroslaw III-178 Bungartz, Hans-Joachim I-45, III-213 Byler, Kendall G. II-360 Bylina, Beata I-983 Bylina, Jaroslaw I-983 Byrski, Aleksander III-584, III-654 Cai, Wentong III-26 Caiazzo, Alfonso II-291 ´ Calvo, Angel-Luis II-659 Camahort, E. II-106 Campos, Fernando Otaviano III-120 Campos, Ricardo Silva III-120 Cannataro, Mario III-67, III-148 Cao, Rongzeng I-853 ˇ Capkoviˇ c, Frantiˇsek III-545 Cˆ arstea, Alexandru I-126 Carvalho, Marco III-584
1052
Author Index
Cebrat, Stanislaw III-100 Cerepnalkoski, Darko III-463 Cernohorsky, Jindrich I-489 Cetnarowicz, Krzysztof III-533, III-594 Chaarawi, Mohamad I-297 Chakraborty, Soham III-46 Chalmers, Matthew III-158 Chang, Jaewoo I-731 Chaves, R. I-741 Chen, Chien-Hsing I-913 Chen, Chuan-Liang I-995 Chen, H. III-731 Chen, Jong-Chen I-813 Chen, Mark I-590 Chen, Tzu-Yi I-955 Chen, Zhengxin II-426, II-450 Chen, Zhenxing I-7 Chien, Shou-Wei I-813 Childs, Hank I-32 Chlebiej, Michal II-25 Choi´ nski, Dariusz II-261, III-381 Chojnacki, Rafal I-355 Chopard, Bastien II-227, II-291 Chover, Miguel II-5, II-86, II-136 Chrysanthakopoulos, George I-407 Pawel I-903 Chrzaszcz, Ciarlet Jr., Patrick II-331 Cicho´ n, Czeslaw I-1022 Renata III-594 Cieciwa, Ciepiela, Eryk III-740 Clarno, Kevin T. III-291 Cobo, Angel II-116 Coen, Janice L. III-46 Cofi˜ no, A.S. III-471 Cooper, Ian I-184 Cope, Jason II-646 Corcuera, Pedro II-715 Cort´es, Ana II-659, III-36 Cox, Simon J. III-339 Cuenca, Javier I-236 Cui, Jiangjun III-110 Cur´e, Olivier III-520 Czarnowski, Ireneusz III-624 Dagdeviren, Orhan I-509, I-519 Dai, Peng I-759 Danelutto, M. I-146 Darema, Frederica III-5 Davoli, Renzo I-287 de Oliveira, Bernardo Lino III-168
de Supinski, Bronis R. III-253 del Vado V´ırseda, Rafael I-800 Deng, Xiaotie II-407 Denham, M´ onica III-36 Denkowski, Marcin II-25 Depoorter, Wim I-226 Deshpande, Karishma III-16 Detka, Marcin I-1022 Deymier, Pierre II-301 Di Blasi, Gianpiero II-76 Ding, Wei I-853 Dobrowolski, Grzegorz III-555 Doherty, Thomas I-96 Dongarra, Jack I-935 Dostert, Paul III-54 Douglas, Craig C. III-3, III-46, III-54 Dre˙zewski, Rafal III-664, III-740 Du, L.H. III-731 Dubielewicz, Iwona II-687 Dubin, Uri I-274 Dubitzky, Werner I-106, I-274, III-70 Duda, Krzysztof III-178 Dunk, Andrew III-429 Duplaga, Mariusz I-476, III-178 Dutka, L ukasz III-409 Dymek, Dariusz I-386 Dzemyda, Gintautas I-770 Dziurzanski, Piotr I-427 Dzwinel, Witold II-177 Eckhardt, Wolfgang III-213 Efendiev, Yalchin III-54 El Zein, Ahmed I-466 Elsayed, Ibrahim I-76 Elts, E. I-45 Enticott, Colin I-66 Erciyes, Kayhan I-509, I-519 Ewing, Richard E. III-54 Fabja´ nski, Krzysztof I-499 Falcone, Jean Luc II-291 Falcou, Joel I-154 Falda, Grzegorz III-301 Fang, Y.D. III-731 Fangohr, Hans III-339 Fedoseyev, Alexander I. II-242, II-281 Fern´ andez, Juan I-456, III-471 Fern´ andez-Quiruelas, V. III-471 Fey, Dietmar I-174 Finger, N. I-945
Author Index Fox, Geoffrey C. I-407 Fragos, Tassos II-207 Frantziskonis, George II-301 Fraser, David L. I-417 Fregeau, John II-207 Freitag, Felix II-669 Fuji, Michiko II-207 Fukazawa, Kenji II-35 Funika, Wlodzimierz III-233, III-446 F¨ urlinger, Karl III-261 Gabriel, Edgar I-297 Gaburov, Evghenii II-207 Gagliardi, Fabrizio I-18 Gaiday, Alexander V. II-360 Gallery, Eimear III-510 Gallo, Giovanni II-76 Gallud, Jose A. III-389 G´ alvez, Akemi II-116, II-715 Gan, Boon Ping III-26 Gansterer, W.N. I-945 Gao, Guangxia II-476 Gao, Zhen-Guo I-559 Garc´ıa de Lomana, Adri´ an L´ opez I-610 Garc´ıa-Torre, F. III-471 Gardenghi, Ludovico I-287 Garic, Slavisa I-66 Gatial, Emil I-116, I-194 Gava, Fr´ed´eric I-375 Gavaghan, David I-66, I-571 Gavrilenko, A.V. II-396 Gavrilenko, V.I. II-396 Gehrke, Jan D. III-692 Gepner, Pawel I-42, I-417 Giannoutakis, Konstantinos M. I-925 Gil-Costa, Veronica I-327 Gim´enez, Domingo I-236, II-659 Gjermundrød, Harald III-399 Glebbeek, Evert II-207 Glowaty, Grzegorz I-883 Glut, Barbara I-641 Godowski, Piotr III-233 Goldweber, Michael I-287 ` G´ omez-Garrido, Alex I-610 ´ G´ omez-Nieto, Miguel Angel II-369 G´ omez-R´ıo, M. I-741 Gong, Yun-Chao I-995 Gonz´ alez-Cinca, Ricard II-735 G´ orriz, J.M. I-741 Goscinski, Andrzej I-164
1053
Grabska, Ewa III-604 Grau, Vicente I-571 Gravvanis, George A. I-925 Groen, Derek I-86, II-207 Gruji´c, Jelena II-576 Guarnera, Giuseppe Claudio II-76 Gubala, Tomasz I-56 Gumbau, Jesus II-136 Guo, Jianping II-630 Gutierrez, Eladio I-700 Guti´errez de Mesa, J.A. III-349 Guti´errez, J.M. III-471 Guzzi, Pietro Hiram III-148 Habala, Ondrej I-116, I-194 Habela, Piotr III-301, III-311 Haffegee, Adrian III-438 Hamada, Mohamed II-678 Han, Jianguo I-76 Har¸ez˙ lak, Daniel III-446 Harfst, Stefan II-207 Harvey, Jeremy N. II-387 Hasan, Adil III-321 He, Kaijian II-494 He, Y.L. III-731 Hegewald, Jan II-227 Heggie, Douglas II-207 Hern´ andez Encinas, L. II-706 Heˇrman, Pavel I-661 Herruzo, E. I-863 Hidalgo, J.L. II-106 Higashi, Masatake II-15, II-66 Hluch´ y, Ladislav I-116, I-194, III-331 Hnatkowska, Bogumila II-687 Hochreiter, Ronald II-408 Hoekstra, Alfons G. II-165, II-227, II-291 Hogan, James M. III-491 Horak, Bohumil III-564 Hovland, Paul D. I-873 Hsu, Chung-Chian I-913 Hsu, Jinchyr I-813 Hu, Yincui II-630 Huang, Fang II-605 Huang, Lican III-501 Huang, Rentian I-823 Huang, Yan I-184 H¨ ulsemann, Frank III-203 Hunt, Ela III-158
1054
Author Index
Hussain, Saber II-353 Hut, Piet II-207 Ibrahim, H. I-246 Iglesias, Andr´es II-3, II-116, II-715 Izumi, Hajime II-35 Izzard, Rob II-207 Jablonski, Stefan III-520 Jafari, Fahimeh I-436 Jakimovski, Boro III-463 Jakubowska, Joanna III-158 Jamieson, Ronan III-429, III-438 Jankowski, Robert I-710, II-614 I-355 Jarzab, Marcin J¸edrzejowicz, Piotr III-624 Johnson, Neil F. I-33 Johnston, Steven III-339 Jun, Qin III-674 Jurczuk, Krzysztof I-679 Jurczyk, Pawel I-136 Jurczyk, Tomasz I-641 Jurek, Janusz III-712 Juri´c, Mario II-207 Justham, Stephen II-207 Kaandorp, Jaap A. III-110 Kaczmarek, Pawel L. I-317 Kaczmarski, Krzysztof III-301, III-311 Kaminski, Wieslaw A. I-620 Kanada, Yasumasa I-446 Kaneko, Masataka II-35 Karl, Wolfgang III-268 Kasperkiewicz, Janusz III-702 Kasprzak, Andrzej I-549 Kasztelnik, Marek I-56 Khalili, K. II-146 Khan, Fakhri Alam I-76 Khattri, Sanjay Kumar I-975, I-1042 Khonsari, Ahmad I-436, I-539 Kim, Youngjin I-731 Kiraga, Joanna III-100 Kirou, Andy I-33 Kisiel-Dorohinicki, Marek III-654 Kitahara, Kiyoshi II-35 Kitowski, Jacek I-903, III-409 Kleijn, Chris R. II-251 Kneip, Georges III-268 Knuepfer, Andreas III-201 Kobayashi, Masakazu II-15
Kocot, Joanna III-740 Kohl, Peter I-571 Kolingerova, Ivana II-86 Kononowicz, Andrzej A. III-188 Konovalov, Alexander I-126 Kope´c, Mariusz I-600 Kornmayer, Harald III-399 Kosch, Harald I-215 Kotsalis, Evangelos M. II-234 Kotulski, Leszek I-386, III-644 Koumoutsakos, Petros II-167, II-234 Kowalewski, Bartosz III-358 Kowalik, Michal F. I-417 Koziorek, Jiri III-564 Krafczyk, Manfred II-227 Kranzlm¨ uller, Dieter III-201, III-253, III-379 Kravtsov, Valentin I-274 Krejcar, Ondrej I-489 Kr¸etowski, Marek I-679 Kriksciuniene, Dalia II-504 Krishamoorthy, Sriram I-20 Krishnan, Manoj I-20 Kroeker, Juergen I-581 Kr´ ol, Dariusz III-446 Kruk, Tomasz I-499 Kryza, Bartosz III-409 Krzhizhanovskaya, Valeria V. II-165 Kuefler, Erik I-955 Kulakowski, Krzysztof II-545 Kumar, Praveen II-387 Kundeti, Vamsi I-893 Kurdziel, Marcin I-630 Kuroda, Hisayasu I-446 Kurzak, Jakub I-935 Kuta, Marcin I-903 Laclav´ık, Michal III-331 Lai, Kin Keung II-494 Lang, E. I-741 Lassl, A. I-741 Lech, Piotr I-790 Ledoux, Veerle I-1032 Lee, Vernon I-590 Lendermann, Peter III-26 Levandovskiy, Igor A. II-360 Levnaji´c, Zoran II-584 Li, Deng III-54 Li, Feng I-853 Li, Guoqing II-605
Author Index
1055
Li, Hongquan II-416 Li, Man-Tian I-559 Li, Xiang I-559 Li, Xingsen II-436 Liu, Dingsheng II-603, II-605 Liu, Rong I-7, II-426, II-450 Liu, Ting I-76 Lloyd, Bryn A. II-187 Lluch, A. II-96 Lobosco, Marcelo III-168 Lodder, Robert A. III-54 Lombardi, James II-207 Long, Wen II-486 Lorenz, Daniel III-223 Low, Malcolm Yoke Hean III-26 Lozano, Maria III-389 Lu, Tingjie II-466 Luo, Q. II-657 Luo, Ying II-630 Luque, Emilio III-36
Melnik, R.V.N. II-197 Mendon¸ca Costa, Caroline III-120 Mertes, Jacqueline Gomes II-153 Messig, Michael I-164 Metzger, Mieczyslaw II-261, III-381 Mikolajczak, Pawel II-25 Milde, Florian II-167 Millar, Campbell I-96 Miranda Teixeira, Gustavo III-168 Misev, Anastas I-203 Mishra, Sudib K. II-301 Mitra, Abhijit II-379 Mitrovi´c, Marija II-551 Monterde, J. II-96 Moore, Shirley III-261 Moraveji, Reza I-529, I-539 Morimoto, Shoichi II-514 Morris, Alan III-276 Muntean, I.L. I-45 Muralidharan, Krishna II-301
Macariu, Georgiana I-126 Maciejewski, Henryk III-140 Mackiewicz, Dorota III-100 Mackiewicz, Pawel III-100 Madey, Greg III-6 Mahapatra, D. Roy II-197 Maischak, Matthias II-321 Maka, Tomasz I-427 Makino, Jun I-86 Makowiecki, Wojciech I-749 Malarz, Krzysztof II-559 Malawski, Maciej I-56, III-243 Maleti´c, Slobodan II-568 Malony, Allen III-276 Mandel, Jan III-46 Mandel, Johannes J. I-106 Mantiuk, Radoslaw I-780 Margalef, Tom` as III-36 Marin, Mauricio I-327 Markelov, Gennady I-581 Markowski, Marcin I-549 Marks, Maria III-702 Marqu`es, Joan Manuel II-669 Marranghello, Norian II-153 Martin, David I-96 Mazurkiewicz, Jacek I-671 McCreath, Eric I-466 McMillan, Steve I-86, II-207 Mehl, Miriam III-213
Nagai, Takahiro I-446 Nagar, Atulya I-823, II-622, III-419 Nagy, James G. I-721 Nahuz, Sadick Jorge I-365 Natkaniec, Joanna II-545 Navarro, Leandro II-669 Negoita, Alina I-833 Nielsen, Henrik Frystyk I-407 Nieplocha, Jarek I-20 Noble, Denis I-66 Noble, Penelope I-66 Noco´ n, Witold II-261, III-381 Nogawa, Takeshi II-15 Nowakowski, Piotr III-90 ´ Nuall´ ain, Breannd´ an O II-207 Okarma, Krzysztof I-790 Ong, Boon Som I-590 Orlowska, Maria E. I-3 Ostermann, Elke II-321 Ostropytskyy, Vitaliy III-70 Othman, M. I-246 Oya, Tetsuo II-66 Ozsoyeller, Deniz I-519 Pacher, C. I-945 Pachter, Ruth II-353 Paj¸ak, Dawid I-780 Palmer, Bruce I-20
1056
Author Index
Pannala, Sreekanth II-301 Park, Jongan I-1013 Park, Seungjin I-1013 Parus, Jindra II-86 Paszy´ nska, Anna III-604 Paszy´ nski, Maciej I-965, III-533, III-604 Pawlus, Dorota I-689 Peachey, Tom I-66 P¸egiel, Piotr III-233 Pelczar, Michal III-80 Penichet, Victor M.R. III-389 Pereira, Aledir Silveira II-153 Petcu, Dana I-126 Pflug, Georg Ch. II-408 Pita, Isabel I-800 Plagne, Laurent III-203 Plank, Gernot I-571 Plata, O. I-863 Plotkowiak, Michal I-571 Pokrywka, Rafal I-396 Poore, Jesse H. III-291 Portegies Zwart, Simon I-86, II-207 Pozuelo, Carmela II-659 Preissl, Robert III-253 Prusiewicz, Agnieszka III-614 Puglisi, Giovanni II-76 Puig-Pey, Jaime II-116 Puntonet, C.G. I-741 Qiu, Xiaohong I-407 Queiruga, D. II-706 Queiruga Dios, A. II-706 Quinlan, Daniel J. III-253 Radziszewski, Michal II-46 Rajasekaran, Sanguthevar I-893 Rajkovi´c, Milan II-568 Ramalho Pommeranzembaum, Igor III-168 Raman, Ashok II-242, II-281 Ramasami, Ponnadurai II-343, II-344 Ram´ırez, J. I-741 Ramos, Francisco II-5, II-86 Ramos-Quintana, Fernando II-725 Randrianarivony, Maharavo II-56 Rasheed, Waqas I-1013 Ratajczak-Ropel, Ewa III-624 Rebollo, Cristina II-136 Rehman, M. Abdul III-520 Rehn-Sonigo, Veronika I-215
Remolar, Inmaculada II-136 Rendell, Alistair I-466 Riaz, Muhammad I-1013 Riche, Olivier III-70 Ripolles, Oscar II-5 Robert, Yves I-215 Rodr´ıguez, A. I-741 Rodr´ıguez, Daniel III-289, III-368 Rodriguez, Blanca I-571 Roe, Paul III-491 Rojek, Gabriel III-594 Romberg, Mathilde III-67 Romero, A. I-741 Romero, Sergio I-700 Roux, Fran¸cois-Xavier II-311 Ruan, Yijun III-130 Ruiz, Irene Luque II-369 Ruiz, Roberto III-289 Ruszczycki, Bla˙zej I-33 Rycerz, Katarzyna II-217 Sadayappan, P. I-20 Safaei, Farshad I-539 Saiz, Ana Isabel I-800 Sakalauskas, Virgilijus II-504 S´ amano-Galindo, Josefina II-725 San Mart´ın, R.M. III-471 Sano, Yoichi II-15 Santamaria, Eduard II-735 Sarbazi-Azad, Hamid I-529 Sarmanho, Felipe S. I-337 Schabauer, Hannes I-945, II-408 Schaefer, Robert I-965, III-533, III-682 Sch¨ afer, Andreas I-174 Schneider, J¨ urgen E. I-571 Schoenharl, Timothy W. III-6 Schroeder, Wayne III-321 Schulz, Martin III-253 Schuster, Assaf I-274 Segura, Clara I-800 Sekiguchi, Masayoshi II-35 ˇ Seleng, Martin III-331 Seo, Shinji II-66 Sepielak, Jan III-664 Serot, Jocelyn I-154 Shang, Wei II-416 Shao, Qinghui II-242 Sharda, Anurag I-833 Sharma, Purshotam II-379 Sharma, Sitansh II-387
Author Index Shende, Sameer III-276 Sher, Anna I-66 Shi, Yong I-7, II-407, II-426, II-436, II-450, II-459, II-476 Shiflet, Angela B. II-697 Shiflet, George W. II-697 Shirayama, Susumu II-535 Shubina, Tatyana E. II-360 Sicilia, Miguel-Angel III-368 Silva, Cˆ andida G. III-70 Sim˜ ao, Adenilso S. I-337 ˇ Simo, Branislav I-116, I-194 Simunovic, Srdjan II-301 Singh, Harjinder II-379, II-387 Sinha, N. II-197 Sinnott, Richard O. I-96 Siwik, Leszek III-664, III-740 Skabar, Andrew II-441 Sloot, Peter M.A. II-217 Slota, Damian I-1005 Smola, Alex I-466 Smolka, Maciej III-535 ´ zy´ Snie˙ nski, Bartlomiej III-533, III-722 Sobczynski, Maciej III-100 Socha, Miroslaw III-178 Soler, Pablo I-800 Sosonkina, Masha I-833 Souza, Paulo S.L. I-337 Souza, Simone R.S. I-337 Spear, Wyatt III-276 Sportouch, David I-610 Srovnal, Vilem III-564 Stahl, Frederic III-70 Stencel, Krzysztof III-301, III-311 Stephan, Ernst P. II-321 Stewart, Gordon I-96 St¨ umpert, Mathias III-399 Subieta, Kazimierz III-301, III-311 Subramaniam, S. I-246 Sumitomo, Jiro III-491 Sun, B. III-731 Sun, Li-Ning I-559 Sun, Luo I-759 Sunderam, Vaidy I-136 Sundnes, Joakim III-67 Sung, Wing-Kin III-130 ˇ Suvakov, Milovan II-593 Suzuki, Yasuyuki II-15 Swain, Martin I-106, I-274, III-70 Swain, W. Thomas III-291
1057
´ Swierczy´ nski, Tomasz III-409 Sykes, A.C. II-396 Szczerba, Dominik II-187 Sz´ekely, G´ abor II-187 Szepieniec, Tomasz I-254 Szydlo, Tomasz I-307 Szyma´ nski, Kamil III-555 Tadi´c, Bosiljka II-525, II-551 Tadokoro, Yuuki II-35 Tajiri, Shinsuke II-271 Takato, Setsuo II-35 Talebi, Mohammad S. I-436 Talik, Marek III-409 Tanaka, Hisao II-271 Tao, Jie III-201, III-268 Tao, Linmi I-759 Tavakkol, Arash I-529 Tawfik, Hissam I-823, II-622, III-419 Tay, Joc Cing I-590 Teixeira, Mario Meireles I-365 ten Thije Boonkkamp, J.H.M. I-651 Tesoriero, Ricardo III-389 Teuben, Peter II-207 Theis, F. I-741 Thijsse, Barend J. II-251 Tian, Chunhua I-853 Tian, Ying-Jie I-995, II-436 Tirado-Ramos, A. II-657 T¨ olke, Jonas II-227 Tomlinson, Allan III-510 Towsey, Michael III-491 Treigys, Povilas I-770 Trenas, Maria A. I-700 Trojanowski, Krzysztof I-843 Tsutahara, Michihisa II-271 Tufo, Henry M. II-646 Turcza, Pawel I-476, III-178 Turek, Wojciech III-574 Turner, Stephen John III-26 Turowski, Marek II-242, II-281 Uchida, Makoto II-535 Uebing, Christian III-223 Um, Jungho I-731 Uribe, Roberto I-327 Valiev, Marat I-20 van Bever, Joris II-207 Van Daele, Marnix I-1032
1058
Author Index
Vanden Berghe, Guido I-1032 Vanmechelen, Kurt I-226 Vary, James P. I-833 Vasiljevi´c, Danijela II-568 Vega, Vinsensius B. III-130 Velinov, Goran III-463 Veltri, Pierangelo III-148 Vicent, M.J. II-106 Vilkomir, Sergiy A. III-291 Vill` a-Freixa, Jordi I-610 Villasante, Jes´ us I-5 Vodacek, Anthony III-46 Volkert, Jens III-201, III-379 Volz, Bernhard III-520
Wojcik, Grzegorz M. I-620 Wojtusiak, Janusz III-692 Wolniewicz, Pawel III-399 Wr´ oblewski, Pawel I-600 Wrzeszcz, Michal I-903 Wu, Chaolin II-630 Wu, Jun II-466
Wach, Jakub III-80 Walkowiak, Tomasz I-671 Walkowiak, Wolfgang III-223 Walser, Markus I-33 Wan, Mike III-321 Wan, Wei II-603 Wang, Huiwen II-486 Wang, Jiangqing III-674 Wang, Jianqin II-630 Wang, Shouyang II-407, II-416 Wang, Yanguang II-630 Wang, Zhen III-46 Watt, John I-96 Wcislo, Rafal II-177 Weber dos Santos, Rodrigo III-67, III-120, III-168 Wei, Wenhong I-347 Weinzierl, Tobias III-213 Weise, Andrea III-321 Weller, Robert A. II-281 Wendykier, Piotr I-721 Wibisono, Adianto III-481 Widmer, Gerhard III-379 Wierzbowska, Izabela III-624 Wism¨ uller, Roland III-201, III-223 Wi´sniewski, Cezary III-409 Wi´sniowski, Zdzislaw III-188 Wodecki, Mieczyslaw I-264 W¨ ohrer, Alexander I-76
Yaghmaee, Mohammad H. I-436 Yamashita, Satoshi II-35 Yan, Nian I-7, II-426, II-450 Yan, Yunxuan II-605 Yang, Xuecheng II-466 Yaron, Ofer II-207 Yau, Po-Wah III-510 Yebra, J. Luis A. II-735 Yeow, J.T.W. II-197 Yoshida, Hitoshi I-446 Yuan, Huapeng I-407
Xiao, Wenjun I-347 Xie, Chi II-494 Xiong, Li I-136 Xu, Guangyou I-759 Xue, Yong II-603, II-630
Zapata, Emilio L. I-700, I-863 Zapletal, David I-661 Z´ arate-Silva, V´ıctor H. II-725 Zemp, Marcel II-207 Zeng, Yi II-605 Zhang, Peng II-436, II-476 Zhang, Xiaohang II-466 Zhang, Ying II-459 Zhang, Zhiwang II-436, II-476 Zhao, Zhiming III-459, III-481 Zheng, Bo-jin III-533, III-674 Zhou, Zongfang II-459 Zieli´ nski, Krzysztof I-307, I-355 Zieli´ nski, Tomasz I-476, III-178 Zolfaghari, H. II-146 Zoppi, G. I-146 Zuzek, Mikolaj III-409