The Regularized Fast Hartley Transform: Optimal Formulation of Real-Data Fast Fourier Transform for Silicon-Based Implementation in

SIGNALS AND COMMUNICATION TECHNOLOGY For other titles published in this series, go to http://www.springer.com/series/4...

Author: Keith Jones

26 downloads 572 Views 1MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

SIGNALS AND COMMUNICATION TECHNOLOGY

For other titles published in this series, go to http://www.springer.com/series/4748

Keith Jones

The Regularized Fast Hartley Transform Optimal Formulation of Real-Data Fast Fourier Transform for Silicon-Based Implementation in Resource-Constrained Environments

123

Dr. Keith Jones L-3 Communications TRL Technology Shannon Way, Ashchurch, Tewkesbury Gloucestershire, GL20 8ND, U.K.

ISBN 978-90-481-3916-3 e-ISBN 978-90-481-3917-0 DOI 10.1007/978-90-481-3917-0 Springer Dordrecht Heidelberg London New York Library of Congress Control Number: 2009944070 c Springer Science+Business Media B.V. 2010 No part of this work may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microﬁlming, recording or otherwise, without written permission from the Publisher, with the exception of any material supplied speciﬁcally for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work. Cover design: WMXDesign GmbH Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)

Preface

Most real-world spectrum analysis problems involve the computation of the real-data discrete Fourier transform (DFT), a unitary transform that maps elements of the linear space of real-valued N-tuples, RN , to elements of its complex-valued counterpart, CN , and when carried out in hardware it is conventionally achieved via a real-from-complex strategy using a complex-data version of the fast Fourier transform (FFT), the generic name given to the class of fast algorithms used for the efficient computation of the DFT. Such algorithms are typically derived by exploiting the property of symmetry, whether it exists just in the transform kernel or, in certain circumstances, in the input data and/or output data as well. In order to make effective use of a complex-data FFT, however, via the chosen real-from-complex strategy, the input data to the DFT must first be converted from elements of RN to elements of CN . The reason for choosing the computational domain of real-data problems such as this to be CN , rather than RN , is due in part to the fact that computing equipment manufacturers have invested so heavily in producing digital signal processing (DSP) devices built around the design of the complex-data fast multiplier and accumulator (MAC), an arithmetic unit ideally suited to the implementation of the complex-data radix-2 butterfly, the computational unit used by the familiar class of recursive radix-2 FFT algorithms. The net result is that the problem of the real-data DFT is effectively being modified so as to match an existing complex-data solution rather than a solution being sought that matches the actual problem. The increasingly powerful field-programmable gate array (FPGA) and application-specific integrated circuit (ASIC) technologies are now giving DSP design engineers far greater control, however, over the type of algorithm that may be used in the building of high-performance DSP systems, so that more appropriate algorithmically-specialized hardware solutions to the real-data DFT may be actively sought and exploited to some advantage with these technologies. The first part of this monograph thus concerns itself with the design of a new and highly-parallel formulation of the fast Hartley transform (FHT) which is to be used, in turn, for the efficient computation of the DFT. The FHT is the generic name given to the class of fast algorithms used for the efficient computation of the discrete Hartley transform (DHT) – a unitary (and, in fact, orthogonal) transform and close relative of the DFT possessing many of the same properties – which, v

vi

Preface

for the processing of real-valued data, has attractions over the complex-data FFT in terms of reduced arithmetic complexity and reduced memory requirement. It’s bilateral or reversal property also means that it may be straightforwardly applied to the transformation from Hartley space to data space as well as from data space to Hartley space, making it thus equally applicable to the computation of both the DFT and its inverse. A drawback, however, of conventional FHT algorithms lies in the loss of regularity (as relates to the algorithm structure) arising from the need for two sizes – and thus two separate designs – of butterfly for efficient fixed-radix formulations, where the regularity equates to the amount of repetition and symmetry present in the design. A generic version of the double butterfly, referred to as the “GD-BFLY” for economy of words, is therefore developed for the radix-4 FHT that overcomes the problem in an elegant fashion. The resulting single-design solution, referred to as the regularized radix-4 FHT and abbreviated to “R24 FHT”, lends itself naturally to parallelization and to mapping onto a regular computational structure for implementation with parallel computing technology. A partitioned-memory architecture for the parallel computation of the GD-BFLY and the resulting R24 FHT is next developed and discussed in some detail, this exploiting a single locally-pipelined high-performance processing element (PE) that yields an attractive solution, particularly when implemented with parallel computing technology, that is both area-efficient and scalable in terms of transform length. High performance is achieved by having the PE able to process the input/output data sets to the GD-BFLY in parallel, this in turn implying the need to be able to access simultaneously, and without conflict, both multiple data and multiple twiddle factors, or trigonometric coefficients, from their respective memories. A number of pipelined versions of the PE are described using both fast fixed-point multipliers and phase rotators – where the phase rotation operation is carried out in optimal fashion with hardware-efficient Co-Ordinate Rotation DIgital Computer (CORDIC) arithmetic – which enable arithmetic complexity to be traded off against memory requirement. The result is a set of scalable designs based upon the partitioned-memory single-PE computing architecture which each yield a hardware-efficient solution with universal application, such that each new application necessitates minimal re-design cost, as well as solutions amenable to efficient implementation with the silicon-based technologies. The resulting areaefficient and scalable single-PE architecture is shown to yield solutions to the real-data radix-4 FFT that are capable of achieving the computational density – that is, the throughput per unit area of silicon – of the most advanced commerciallyavailable complex-data solutions for just a fraction of the silicon resources. Consideration is given to the fact that when producing electronic equipment, whether for commercial or military use, great emphasis is inevitably placed upon minimizing the unit cost so that one is seldom blessed with the option of using the latest state-of-the-art device technology. The most common situation encountered is one where the expectation is to use the smallest (and thus the least expensive) device that is capable of yielding solutions able to meet the performance objectives, which often means using devices that are one, two or even three generations behind the latest specification. As a result, there are situations where there would be great merit

Preface

vii

in having designs that are not totally reliant on the availability of the increasingly large quantities of expensive embedded resources, such as fast multipliers and fast memory, as provided by the manufacturers of the latest silicon-based devices, but are sufficiently flexible to lend themselves to implementation in silicon even when constrained by the limited availability of embedded resources. The designs are thus required to be able to cater for a range of resourceconstrained environments where the particular resources being consumed and traded off, one against another, include the programmable logic, the power and the time (update time or latency), as well as the embedded resources already discussed. The choice of which particular FPGA device to use throughout the monograph for comparative analysis of the various designs is not considered to be of relevance to the results obtained as the intention is that the attractions of the solutions developed should be valid regardless of the specific device onto which they are mapped – that is, a “good” design should be device-independent. The author is well aware, however, that the intellectual investment made in achieving such a design may seem to fly in the face of current wisdom whereby the need for good engineering design and practice is avoided through the adoption of ever more powerful (and power consuming) computing devices – no apologies offered. The monograph, which is based on the fruits of 3 years of applied industrial research in the U.K., is aimed at both practicing DSP engineers with an interest in the efficient hardware implementation of the real-data FFT and academics /researchers/students from engineering, computer science and mathematics backgrounds with an interest in the design and implementation of sequential and parallel FFT algorithms. It is intended to provide the reader with the tools necessary to both understand the new formulation and to implement simple design variations that offer clear implementational advantages, both theoretical and practical, over more conventional complex-data solutions to the problem. The highly-parallel formulation of the real-data FFT described in the monograph will be shown to lead to scalable and device-independent solutions to the latency-constrained version of the problem which are able to optimize the use of the available silicon resources, and thus to maximize the achievable computational density, thereby making the solution a genuine advance in the design and implementation of high-performance parallel FFT algorithms. L-3 Communications TRL Technology, Shannon Way, Ashchurch, Tewkesbury, Gloucestershire, GL20 8ND, U.K.

Dr. Keith Jones

Acknowledgements

Firstly, and most importantly, the author wishes to thank his wife and partner in crime, Deborah, for her continued support for the project which has occupied most of his free time over the past 12 months or so, time that would otherwise have been spent together doing more enjoyable things. Secondly, given his own background as an industrial mathematician, the author gratefully acknowledges the assistance of Andy Beard of TRL Technology, who has painstakingly gone through the manuscript clarifying those technology-based aspects of the research least familiar to the author, namely those relating to the ever-changing world of the FPGA, thereby enabling the author to provide a more comprehensible interpretation of certain aspects of the results. Finally, the author wishes to thank Mark de Jongh, the Senior Publishing Editor in Electrical Engineering at Springer, together with his management colleagues at Springer, for seeing the potential merit in the research and providing the opportunity of sharing the results with you in this monograph.

ix

Contents

1

Background to Research .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 1.1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 1.2 The DFT and Its Efficient Computation .. . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 1.3 Twentieth Century Developments of the FFT . . . . . . . . . . . . . .. . . . . . . . . . . 1.4 The DHT and Its Relation to the DFT. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 1.5 Attractions of Computing the Real-Data DFT via the FHT .. . . . . . . . . 1.6 Modern Hardware-Based Parallel Computing Technologies .. . . . . . . . 1.7 Hardware-Based Arithmetic Units . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 1.8 Performance Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 1.9 Basic Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 1.10 Organization of the Monograph . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . References .. . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .

1 1 2 4 6 7 8 9 10 11 12 13

2

Fast Solutions to Real-Data Discrete Fourier Transform . . . . .. . . . . . . . . . . 2.1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 2.2 Real-Data FFT Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 2.2.1 The Bergland Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 2.2.2 The Brunn Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 2.3 Real-From-Complex Strategies .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 2.3.1 Computing One Real-Data DFT via One Full-Length Complex-Data FFT . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 2.3.2 Computing Two Real-Data DFTs via One Full-Length Complex-Data FFT . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 2.3.3 Computing One Real-Data DFT via One Half-Length Complex-Data FFT . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 2.4 Data Re-ordering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 2.5 Discussion .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . References .. . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .

15 15 16 16 18 19

The Discrete Hartley Transform .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 3.1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 3.2 Normalization of DHT Outputs.. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 3.3 Decomposition into Even and Odd Components . . . . . . . . . .. . . . . . . . . . .

27 27 28 29

3

20 20 22 23 24 25

xi

xii

4

5

Contents

3.4

Connecting Relations Between DFT and DHT . . . . . . . . . . . .. . . . . . . . . . . 3.4.1 Real-Data DFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 3.4.2 Complex-Data DFT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 3.5 Fundamental Theorems for DFT and DHT . . . . . . . . . . . . . . . .. . . . . . . . . . . 3.5.1 Reversal Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 3.5.2 Addition Theorem .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 3.5.3 Shift Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 3.5.4 Convolution Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 3.5.5 Product Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 3.5.6 Autocorrelation Theorem . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 3.5.7 First Derivative Theorem.. . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 3.5.8 Second Derivative Theorem . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 3.5.9 Summary of Theorems .. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 3.6 Fast Solutions to DHT . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 3.7 Accuracy Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 3.8 Discussion .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . References .. . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .

29 30 30 31 32 33 34 34 35 35 35 36 36 37 39 39 40

Derivation of the Regularized Fast Hartley Transform . . . . . . .. . . . . . . . . . . 4.1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 4.2 Derivation of the Conventional Radix-4 Butterfly Equations . . . . . . . . 4.3 Single-to-Double Conversion of the Radix-4 Butterfly Equations .. . 4.4 Radix-4 Factorization of the FHT . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 4.5 Closed-Form Expression for Generic Radix-4 Double Butterfly . . . . 4.5.1 Twelve-Multiplier Version of Generic Double Butterfly . . . . 4.5.2 Nine-Multiplier Version of Generic Double Butterfly .. . . . . . 4.6 Trigonometric Coefficient Storage, Accession and Generation .. . . . . 4.6.1 Minimum-Arithmetic Addressing Scheme .. . . . . . .. . . . . . . . . . . 4.6.2 Minimum-Memory Addressing Scheme . . . . . . . . . .. . . . . . . . . . . 4.6.3 Trigonometric Coefficient Generation via Trigonometric Identities . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 4.7 Comparative Complexity Analysis with Existing FFT Designs . . . . . 4.8 Scaling Considerations for Fixed-Point Implementation ... . . . . . . . . . . 4.9 Discussion .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . References .. . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .

41 41 42 45 46 48 54 54 56 57 57

Algorithm Design for Hardware-Based Computing Technologies . . . . . . 5.1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 5.2 The Fundamental Properties of FPGA and ASIC Devices .. . . . . . . . . . 5.3 Low-Power Design Techniques.. . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 5.3.1 Clock Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 5.3.2 Silicon Area.. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 5.3.3 Switching Frequency .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 5.4 Proposed Hardware Design Strategy . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 5.4.1 Scalability of Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .

65 65 66 67 68 68 70 70 71

58 59 61 62 63

Contents

6

xiii

5.4.2 Partitioned-Memory Processing . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 5.4.3 Flexibility of Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 5.5 Constraints on Available Resources . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 5.6 Assessing the Resource Requirements . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 5.7 Discussion .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . References .. . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .

71 72 73 73 74 75

Derivation of Area-Efficient and Scalable Parallel Architecture .. . . . . . . 6.1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 6.2 Single-PE Versus Multi-PE Architectures . . . . . . . . . . . . . . . . .. . . . . . . . . . . 6.3 Conflict-Free Parallel Memory Addressing Schemes . . . . .. . . . . . . . . . . 6.3.1 Data Storage and Accession . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 6.3.2 Trigonometric Coefficient Storage, Accession and Generation . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 6.4 Design of Pipelined PE for Single-PE Architecture . . . . . . .. . . . . . . . . . . 6.4.1 Internal Pipelining of Generic Double Butterfly ... . . . . . . . . . . 6.4.2 Space Complexity Considerations .. . . . . . . . . . . . . . . .. . . . . . . . . . . 6.4.3 Time Complexity Considerations .. . . . . . . . . . . . . . . . .. . . . . . . . . . . 6.5 Performance and Requirements Analysis of FPGA Implementation .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . 6.6 Constraining Latency Versus Minimizing Update-Time . .. . . . . . . . . . . 6.7 Discussion .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . References .. . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .

77 77 78 80 80 84 89 90 91 92 93 95 97 98

7

Design of Arithmetic Unit for Resource-Constrained Solution .. . . . . . . . .101 7.1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .101 7.2 Accuracy Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .102 7.3 Fast Multiplier Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .103 7.4 CORDIC Approach . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .104 7.4.1 CORDIC Formulation of Complex Multiplier .. . .. . . . . . . . . . .104 7.4.2 Parallel Formulation of CORDIC-Based PE . . . . . .. . . . . . . . . . .105 7.4.3 Discussion of CORDIC-Based Solution . . . . . . . . . .. . . . . . . . . . .106 7.4.4 Logic Requirement of CORDIC-Based PE . . . . . . .. . . . . . . . . . .109 7.5 Comparative Analysis of PE Designs . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .110 7.6 Discussion .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .112 References .. . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .115

8

Computation of 2n -Point Real-Data Discrete Fourier Transform .. . . . . .117 8.1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .117 8.2 Computing One DFT via Two Half-Length Regularized FHTs . . . . . .118 8.2.1 Derivation of 2n -Point Real-Data FFT Algorithm . . . . . . . . . . .119 8.2.2 Implementational Considerations.. . . . . . . . . . . . . . . . .. . . . . . . . . . .122 8.3 Computing One DFT via One Double-Length Regularized FHT . . . .129 8.3.1 Derivation of 2n -Point Real-Data FFT Algorithm . . . . . . . . . . .129 8.3.2 Implementational Considerations.. . . . . . . . . . . . . . . . .. . . . . . . . . . .130

xiv

Contents

8.4 Discussion .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .132 References .. . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .134 9

Applications of Regularized Fast Hartley Transform . . . . . . . . .. . . . . . . . . . .135 9.1 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .135 9.2 Fast Transform-Space Convolution and Correlation.. . . . . .. . . . . . . . . . .136 9.3 Up-Sampling and Differentiation of Real-Valued Signal. .. . . . . . . . . . .137 9.3.1 Up-Sampling via Hartley Space . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .138 9.3.2 Differentiation via Hartley Space .. . . . . . . . . . . . . . . . .. . . . . . . . . . .139 9.3.3 Combined Up-Sampling and Differentiation . . . . .. . . . . . . . . . .139 9.4 Correlation of Two Arbitrary Signals . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .140 9.4.1 Computation of Complex-Data Correlation via Real-Data Correlation .. . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .141 9.4.2 Cross-Correlation of Two Finite-Length Data Sets . . . . . . . . . .142 9.4.3 Auto-Correlation: Finite-Length Against Infinite-Length Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .143 9.4.4 Cross-Correlation: Infinite-Length Against Infinite-Length Data Sets . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .145 9.4.5 Combining Functions in Hartley Space . . . . . . . . . . .. . . . . . . . . . .147 9.5 Channelization of Real-Valued Signal . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .149 9.5.1 Single Channel: Fast Hartley-Space Convolution.. . . . . . . . . . .149 9.5.2 Multiple Channels: Conventional Polyphase DFT Filter Bank .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .151 9.6 Discussion .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .155 References .. . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .156

10 Summary and Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .159 10.1 Outline of Problem Addressed.. . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .159 10.2 Summary of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .160 10.3 Conclusions .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .162 Appendix A A.1 A.2

A.3 A.4

Computer Program for Regularized Fast Hartley Transform .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .163 Introduction .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .163 Description of Functions .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .164 A.2.1 Control Routine . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .164 A.2.2 Generic Double Butterfly Routines .. . . . . . . . . . . . . . .. . . . . . . . . . .164 A.2.3 Address Generation and Data Re-ordering Routines . . . . . . . .165 A.2.4 Data Memory Accession and Updating Routines .. . . . . . . . . . .165 A.2.5 Trigonometric Coefficient Generation Routines . .. . . . . . . . . . .166 A.2.6 Look-Up-Table Generation Routines .. . . . . . . . . . . . .. . . . . . . . . . .167 A.2.7 FHT-to-FFT Conversion Routines .. . . . . . . . . . . . . . . .. . . . . . . . . . .167 Brief Guide to Running the Program .. . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .167 Available Scaling Strategies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .169

Contents

xv

Appendix B Source Code Listings for Regularized Fast Hartley Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .173 B.1 Listings for Main Program and Signal Generation Routine .. . . . . . . . .173 B.2 Listings for Pre-processing Functions .. . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .185 B.3 Listings for Processing Functions . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .189 Glossary . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .221 Index . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . .223

Biography

Keith John Jones is a Chartered Mathematician (C.Math.) and Fellow of the Institute of Mathematics & Its Applications (F.I.M.A.), UK, having obtained a B.Sc. Honours degree in Mathematics from the University of London in 1974 as an external student, an M.Sc. in Applicable Mathematics from Cranfield Institute of Technology in 1977, and a Ph.D. in Computer Science from Birkbeck College, University of London, in 1992, again as an external student. The Ph.D. was awarded primarily for research into the design of novel systolic processor array architectures for the parallel computation of the DFT. Dr. Jones currently runs a mathematical/software consultancy in Weymouth, Dorset, with his wife Deborah, as well as being employed as a part-time consultant with TRL Technology in Tewkesbury, Gloucestershire, where he is engaged in the design and implementation of high-performance digital signal processing algorithms and systems for wireless communications. Dr. Jones has published widely in the signal processing and sensor array processing fields, having a particular interest in the application of number theory, algebra, and nonstandard arithmetic techniques to the design of low-complexity algorithms and circuits for efficient implementation with suitably defined parallel computing architectures. Dr. Jones also holds a number of patents in these fields. Dr. Jones has been named in both “Who’s Who in Science and Engineering” and the “Dictionary of International Biography” since 2008.

xvii

Chapter 1

Background to Research

Abstract This chapter provides the background to the research results discussed in the monograph that relate to the design and implementation of the regularized FHT. Following a short historical account of the role of the DFT in modern science a case is made for the need for highly-parallel FFT algorithms geared specifically to the processing of real-valued data for use in the type of resource-constrained (both silicon and power) environments encountered in mobile communications. The relation of the DHT to the DFT is given and the possible benefits of using a highlyparallel formulation of the FHT for solving the real-data DFT problem discussed. This is followed by an account of the parallel computing technologies now available via the FPGA and the ASIC with which such a formulation of the problem might be efficiently implemented. A hardware-efficient arithmetic unit is also discussed which can yield a flexible-precision solution whilst minimizing the memory requirement. A discussion of performance metrics for various computing architectures and technologies is then given followed by an outline of the organization of the monograph.

1.1 Introduction The subject of spectrum or harmonic analysis started in earnest with the work of Joseph Fourier (1768–1830), who asserted and proved that an arbitrary function could be represented via a suitable transformation as a sum of trigonometric functions [6]. It seems likely, however, that such ideas were already common knowledge amongst European mathematicians by the time Fourier appeared on the scene, mainly through the earlier work of Joseph Louis Lagrange (1736–1813) and Leonhard Euler (1707–1783), with the first appearance of the discrete version of this transformation, the discrete Fourier transform (DFT) [36,39], dating back to Euler’s investigations of sound propagation in elastic media in 1750 and to the astronomical work of Alexis Claude Clairaut (1713–1765) in 1754 [24]. The DFT is now widely used in many branches of science, playing in particular a central role in the field of digital signal processing (DSP) [36, 39], enabling digital signals – namely those that have been both sampled and quantized – to be viewed in the frequency domain K. Jones, The Regularized Fast Hartley Transform, Signals and Communications Technology, DOI 10.1007/978-90-481-3917-0 1, c Springer Science+Business Media B.V. 2010

1

2

1 Background to Research

where, compared to the time domain, the information contained in the signal may often be more easily extracted and/or displayed, or where many common DSP functions, such as that of the finite impulse response (FIR) filter or the matched filter [36, 39], may be more easily or efficiently carried out. The monograph is essentially concerned with the problem of computing the DFT, via the application of various factorization techniques, using silicon-based parallel computing equipment – as typified by field-programmable gate array (FPGA) and application-specific integrated circuit (ASIC) technologies [31] – bearing in mind the size and power constraints relevant to the particular field of interest, namely that of mobile communications, where a small battery may be the only source of power supply for long periods of time. The monograph looks also to exploit the fact that the measurement data, as with many real-world problems, is real valued in nature, with each sample of data thus belonging to R, the field of real numbers [4], although the restriction to fixed-point implementations limits the range of interest still further to that of Z, the commutative ring of integers [4].

1.2 The DFT and Its Efficient Computation Turning firstly to its definition, the DFT is a unitary transform [17], which for the case of N input/output samples, may be expressed in normalized form via the equation N1 1 X X.F/ Œk D p xŒn:Wnk N; N nD0

k D 0; 1; : : : ; N 1

(1.1)

where the input/output data vectors belong to CN , the linear space of complexvalued N-tuples [4], and the transform kernel – also known as the Fourier Matrix and which is, as one would expect, a function of both the input and output data indices – derives from the term p WN D exp.i2 =N/; i D 1; (1.2) the primitive Nth complex root of unity [4, 32, 34]. The unitary nature of the DFT means that the inverse of the Fourier Matrix is equal to its conjugate-transpose, whilst its columns form an orthogonal basis [6, 7, 17] – similarly, a transform is said to be orthogonal when the inverse of the transform matrix is equal simply to its transpose, as is the case with any real-valued kernel. Note that the multiplication of any power of the term WN by any number belonging to C, the field of complex numbers [4], simply results in a phase shift of that number – the amplitude or magnitude remains unchanged. The direct computation of the N-point DFT as defined above involves O.N2 / arithmetic operations, so that many of the early scientific problems involving the DFT could not be seriously attacked without access to fast algorithms for its efficient solution, where the key to the design of such algorithms is the identification and

1.2 The DFT and Its Efficient Computation

3

exploitation of the property of symmetry, whether it exists just in the transform kernel or, in certain circumstances, in the input data and/or output data as well. One early area of activity with such transforms involved astronomical calculations, and in the early part of the nineteenth century the great Carl Friedrich Gauss (1777–1855) used the DFT for the interpolation of asteroidal orbits from a finite set of equally-spaced observations [24]. He developed a fast two-factor algorithm for its computation that was identical to that described in 1965 by James Cooley and John Tukey [12] – as with many of Gauss’s greatest ideas, however, the algorithm was never published outside of his collected works and only then in an obscure Latin form. This algorithm, which for a transform length of N D N1 N2 involves just O..N1 C N2 / N/ arithmetic operations, was probably the first member of the class of algorithms now commonly referred to as the fast Fourier transform (FFT) [5, 6, 9, 12, 17, 35], which is unquestionably the most ubiquitous algorithm in use today for the analysis or manipulation of digital data. In fact, Gauss is known to have first used the above-mentioned two-factor FFT algorithm for the solution of the DFT as far back as 1805, the same year that Admiral Nelson routed the French fleet at the Battle of Trafalgar – interestingly, Fourier served in Napoleon Bonaparte’s army from 1798 to 1801, during its invasion of Egypt, acting as scientific advisor. Although the DFT, as defined above, allows for both the input and output data sets to be complex valued (possessing both amplitude and phase), many real-world spectrum analysis problems, including those addressed by Gauss, involve only realvalued (possessing amplitude only) input data, so that there is a genuine need for the identification of a subset of the class of FFTs that are able to exploit this fact – bearing in mind that real-valued data leads to a Hermitian-symmetric (or conjugatesymmetric) frequency spectrum: complex-data FFT ) exploitation of kernel symmetry; whilst real-data FFT ) exploitation of kernel & spectral symmetries; with the exploitation of symmetry in the transform kernel being typically achieved by invoking the property of periodicity and of the Shift Theorem, as will be discussed later in the monograph. There is a requirement, in particular, for the development of real-data FFT algorithms which retain the regularity – as relates to the algorithm structure – of their complex-data counterparts as regular algorithms lend themselves more naturally to an efficient implementation. Regularity, which equates to the amount of repetition and symmetry present in the design, is most straightforwardly achieved through the adoption of fixed-radix formulations, such as with the familiar radix-2 and radix-4 algorithms [9, 11], as this essentially reduces the FFT design to that of a single fixed-radix butterfly, the computational engine used for carrying out the repetitive arithmetic operations. Note that with such a formulation the radix actually corresponds to the size of the resulting butterfly, although in Chapter 8 it is seen how a DFT, whose length is a power of two but not a power of four, may be solved by means of a highly-optimized radix-4 butterfly.

4

1 Background to Research

An additional attraction of fixed-radix FFT formulations, which for an arbitrary radix “R” decomposes an N-point DFT into logR N temporal stages each comprising N/R radix-R butterflies, is that they lend themselves naturally to a parallel solution. Such decompositions may be defined over the dimensions of either space – facilitating its mapping onto a single-instruction multiple-data (SIMD) architecture [1] – or time – facilitating its mapping, via the technique of pipelining, onto a systolic architecture [1, 29] – that enables them to be efficiently mapped onto one of the increasingly more accessible/affordable parallel computing technologies. With the systolic solution, each stage of the computational pipeline – referred to hereafter as a computational stage (CS) – corresponds to that of a single temporal stage. A parallel solution may also be defined over both space and time dimensions which would involve a computational pipeline where each CS of the pipeline involves the parallel execution of the associated butterflies via SIMD-based processing – such an architecture being often referred to in the computing literature as parallel-pipelined.

1.3 Twentieth Century Developments of the FFT As far as modern-day developments in FFT design are concerned it is the names of Cooley and Tukey that are always mentioned in any historical account, but this does not really do justice to the many contributors from the first half of the twentieth century whose work was simply not picked up on, or appreciated, at the time of publication. The prime reason for such a situation was the lack of a suitable technology for their efficient implementation, this remaining the case until the advent of the semiconductor technology of the 1960s. Early pioneering work was carried out by the German mathematician Carl Runge [40], who in 1903 recognized that the periodicity of the DFT kernel could be exploited to enable the computation of a 2N-point DFT to be expressed in terms of the computation of two N-point DFTs, this factorization technique being subsequently referred to as the “doubling algorithm”. The Cooley–Tukey algorithm, which does not rely on any specific factorization of the transform length, may thus be viewed as a simple generalization of this algorithm, as successive application of the doubling algorithm leads straightforwardly to the radix-2 version of the Cooley–Tukey algorithm. Runge’s influential work was subsequently picked up and popularized in publications by Karl Stumpff [45] in 1939 and Gordon Danielson and Cornelius Lanczos [13] in 1942, each in turn making contributions of their own to the subject. Danielson and Lanczos, for example, produced reduced-complexity solutions to the DFT through the exploitation of symmetries in the transform kernel, whilst Stumpff discussed versions of both the “doubling algorithm” and the “tripling algorithm”. All of the techniques developed, including those of more recent origin such as the “nesting algorithm” of Schmuel Winograd [49] and the “split-radix algorithm” of Pierre Duhamel [15], rely upon the divide-and-conquer [28] principle, whereby the computation of a composite length DFT is broken down into that of a number of smaller DFTs where the small-DFT lengths correspond to the factors

1.3 Twentieth Century Developments of the FFT

5

of the original transform length. Depending upon the particular factorization of the transform length, this process may be repeated in a recursive fashion on the increasingly smaller DFTs. When the lengths of the small DFTs have common factors, as encountered with the familiar fixed-radix formulations, then between successive stages of small DFTs there will be a need for the intermediate results to be modified by elements of the Fourier Matrix, these terms being commonly referred to in the FFT literature as twiddle factors. When the algorithm in question is a fixed-radix algorithm of the decimation-in-time (DIT) type [9], whereby the sequence of data space samples is decomposed into successively smaller sub-sequences, then the twiddle factors are applied to the inputs to the butterflies, whereas when the fixed-radix algorithm is of the decimation-in-frequency (DIF) type [9], whereby the sequence of transform space samples is decomposed into successively smaller sub-sequences, then the twiddle factors are applied to the outputs to the butterflies. Note, however, that when the lengths of the small DFTs have no common factors at all – that is, when they are relatively prime [4, 32] – then the need for the twiddle factor application disappears as each factor becomes equal to one. This particular result was made possible through the development of a new number-theoretic data re-ordering scheme in 1958 by the statistician Jack Good [20], the scheme being based upon the ubiquitous Chinese Remainder Theorem (CRT) [32,34,35] – which for the interest of those readers of a more mathematical disposition provides a means of obtaining a unique solution to a set of simultaneous linear congruences – whose origins supposedly date back to the first century A.D. [14]. Note also that in the FFT literature, the class of fast algorithms based upon the decomposition of a composite length DFT into smaller DFTs whose lengths have common factors – such as the Cooley–Tukey algorithm – is often referred to as the Common Factor Algorithm (CFA), whereas the class of fast algorithms based upon the decomposition of a composite length DFT into smaller DFTs whose lengths are relatively prime is often referred to as the Prime Factor Algorithm (PFA). Before moving on from this brief historical discussion, it is worth returning to the last name mentioned, namely that of Jack Good, as his background is a particularly interesting one for anyone with an interest in the history of computing. During World War Two Good served at Bletchley Park in Buckinghamshire, England, working alongside Alan Turing [25] on, amongst other things, the decryption of messages produced by the Enigma machine [19] – as used by the German armed forces. At the same time, and on the same site, a team of engineers under the leadership of Tom Flowers [19] – all seconded from the Post Office Research Establishment at Dollis Hill in North London – were, unbeknown to the outside world, developing the world’s first electronic computer, the Colossus [19], under the supervision of Turing and Cambridge mathematician Max Newman. The Colossus was built primarily to automate various essential code breaking tasks such as the cracking of the Lorenz code used by Adolf Hitler to communicate with his generals and was the first serious device – albeit a very large and a very specialized one – on the path towards our current state of technology whereby entire signal processing systems may be mapped onto a single silicon chip.

6

1 Background to Research

1.4 The DHT and Its Relation to the DFT A close relative of the Fourier Transform is that of the Hartley Transform, as introduced by Ralph Hartley (1890–1970) in 1942 for the analysis of transient and steady state transmission problems [23]. The discrete version of this unitary transform is referred to as the discrete Hartley transform (DHT) [8], which for the case of N input/output samples, may be expressed in normalized form via the equation 1 X X.H/ Œk D p xŒn:cas.2 nk=N/ N nD0 N1

k D 0; 1; : : : ; N 1

(1.3)

where the input/output data vectors belong to RN , the linear space of real-valued N-tuples, and the transform kernel – also known as the Hartley Matrix and which is, as one would expect, a function of both the input and output data indices – is as given by cas.2 nk=N/ cos.2 nk=N/ C sin.2 nk=N/:

(1.4)

Note that as the elements of the Hartley Matrix – as given by the “cas” function – are all real valued, the DHT is orthogonal, as well as unitary, with the columns of the matrix forming an orthogonal basis. Unlike the DFT, the DHT has no natural interpretation as a frequency spectrum, its most natural use being as a means for computing the DFT and as such, fast solutions to the DHT, which are referred to generically as the fast Hartley transform (FHT) [7, 8, 43], have become increasingly popular as an alternative to the FFT for the efficient computation of the DFT. The FHT is particularly attractive for the case of real-valued data, its applicability being made possible by the fact that all of the familiar properties associated with the DFT, such as the Circular Convolution Theorem and the Shift Theorem, are also applicable to the DHT, and that the complex-valued DFT output set and real-valued DHT output set may each be simply obtained, one from the other. To see the truth of this, note that the equality nk cas.2 nk=N/ D Re Wnk N Im WN

(1.5)

(where “Re” stands for the real component and “Im” for the imaginary component) relates the kernels of the two transformations, both of which are periodic with period 2 . As a result X.H/ Œk D Re X.F/ Œk Im X.F/ Œk ;

(1.6)

which expresses the DHT output in terms of the DFT output, whilst 1 .H/ X ŒN k C X.H/ Œk Re X.F/ Œk D 2

(1.7)

1.5 Attractions of Computing the Real-Data DFT via the FHT

7

and

1 .H/ X ŒN k X.H/ Œk ; (1.8) Im X.F/ Œk D 2 which express the real and imaginary components of the DFT output, respectively, in terms of the DHT output.

1.5 Attractions of Computing the Real-Data DFT via the FHT Although applicable to the computation of the DFT for both real-valued and complex-valued data, the major computational advantage of the FHT over the FFT, as implied above, lies in the processing of real-valued data. As most real-world spectrum analysis problems involve only real-valued data, significant performance gains may be obtained by using the FHT without any great loss of generality. This is evidenced by the fact that if one computes the complex-data FFT of an N-point real-valued data sequence, the result will be 2N real-valued (or N complex-valued) samples, one half of which are redundant. The FHT, on the other hand, will produce just N real-valued outputs, thereby requiring only one half as many arithmetic operations and one half the memory requirement for storage of the input/output data. The reduced memory requirement is particularly relevant when the transform length is large and the available resources are limited, as may well be the case with the application area of interest, namely that of mobile communications. The traditional approach to the DFT problem has been to use a complex-data solution, regardless of the nature of the data, this often entailing the initial conversion of real-valued data to complex-valued data via a wideband digital down conversion (DDC) process or through the adoption of a real-from-complex strategy whereby two real-data FFTs are computed simultaneously via one full-length complex-data FFT [42] or where one real-data FFT is computed via one half-length complex-data FFT [11, 42]. Each of the real-from-complex solutions, however, involves a computational overhead when compared to the more direct approach of a real-data FFT in terms of increased memory, increased processing delay to allow for the possible acquisition/processing of pairs of data sets, and additional packing/unpacking complexity. With the DDC approach, the information content of short-duration signals may also be compromised through the introduction of unnecessary filtering operations. The reason for such a situation is due in part to the fact that computing equipment manufacturers have invested so heavily in producing DSP devices built around the fast multiplier and accumulator (MAC), an arithmetic unit ideally suited to the implementation of the complex-data radix-2 butterfly, the computational unit used by the familiar class of recursive radix-2 FFT algorithms. The net result is that the problem of the real-data DFT is effectively being modified so as to match an existing complex-data solution rather than a solution being sought that matches the actual problem.

8

1 Background to Research

It should be noted that specialized FFT algorithms [2,10,15,16,18,30,33,44,46] do however exist for dealing with the case of real-valued data. Such algorithms compare favourably, in terms of arithmetic complexity and memory requirement, with those of the FHT, but suffer in terms of a loss of regularity and reduced flexibility in that different algorithms are typically required for the computation of the DFT and its inverse. Clearly, in applications requiring transform-space processing followed by a return to the data space, this could prove something of a disadvantage, particularly when compared to the adoption of a bilateral transform, such as the DHT, which may be straightforwardly applied to the transformation from Hartley space to data space as well as from data space to Hartley space, making it thus equally applicable to the computation of both the DFT and its inverse – this is as a result of the fact that its definitions for the two directions, up to a scaling factor, are identical. A drawback of conventional FHT algorithms [7, 8, 43], however, lies in the need for two sizes – and thus two separate designs – of butterfly for fixed-radix formulations, where a single-sized radix-R butterfly produces R outputs from R inputs and a double-sized radix-R butterfly produces 2R outputs from 2R inputs. A generic version of the double-sized butterfly, referred to as the generic double butterfly and abbreviated hereafter to “GD-BFLY”, is therefore developed in this monograph for the radix-4 FHT which overcomes the problem in an elegant fashion. The resulting single-design radix-4 solution, referred to as the regularized FHT [26] and abbreviated hereafter to “R24 FHT”, lends itself naturally to parallelization [1, 3, 21] and to mapping onto a regular computational structure for implementation with parallel computing technology.

1.6 Modern Hardware-Based Parallel Computing Technologies The type of high-performance parallel computing equipment referred to above is typified by the increasingly attractive FPGA and ASIC technologies which now give design engineers far greater flexibility and control over the type of algorithm that may be used in the building of high-performance DSP systems, so that more appropriate hardware solutions to the real-data FFT may be actively sought and exploited to some advantage with these silicon-based technologies. With such technologies, however, it is no longer adequate to view the complexity of the FFT purely in terms of arithmetic operation counts, as has conventionally been done, as there is now the facility to use both multiple arithmetic units – particularly fast multipliers – and multiple blocks or banks of fast memory in order to enhance the FFT performance via its parallel computation. As a result, a whole new set of constraints has arisen relating to the design of efficient FFT algorithms. With the recent and explosive growth of wireless technology, and in particular that of mobile communications, algorithms are now being designed subject to new and often conflicting performance criteria where the ideal is either to maximize the throughput (that is, to minimize the update time) or satisfy some constraint on the latency, whilst at the same time minimizing the required

1.7 Hardware-Based Arithmetic Units

9

silicon resources (and thereby minimizing the cost of implementation) as well as keeping the power consumption to within the available budget. Note, however, that the throughput is also constrained by the input–output (I/O) speed, as the algorithm cannot process the data faster than it can access it. Such trade-offs are considered in some considerable detail for the hardware solution to the R24 FHT, with the aim, bearing in mind the target application area of mobile communications, of achieving a power-efficient solution. The adoption of the FHT for wireless communications technology seems particularly apt, given the contribution made by the originator of the Hartley Transform (albeit the continuous version) to the foundation of information theory, where the Shannon–Hartley Theorem [37] helped to establish Shannon’s idea of channel capacity [37, 41]. The theorem simply states that if the amount of digital data or information transmitted over a communication channel is less than the channel capacity, then error-free communication may be achieved, whereas if it exceeds that capacity, then errors in transmission will always occur no matter how well the equipment is designed.

1.7 Hardware-Based Arithmetic Units When producing electronic equipment, whether for commercial or military use, great emphasis is inevitably placed upon minimizing the unit cost so that one is seldom blessed with the option of using the latest state-of-the-art device technology. The most common situation encountered is one where the expectation is to use the smallest (and thus the least expensive) device that is capable of yielding solutions able to meet the performance objectives, which often means using devices that are one, two or even three generations behind the latest specification. As a result, there are situations where there would be great merit in having designs that are not totally reliant on the availability of the increasingly large quantities of expensive embedded resources, such as fast multipliers and fast memory, as provided by the manufacturers of the latest silicon-based devices, but are sufficiently flexible to lend themselves to implementation in silicon even when constrained by the limited availability of embedded resources. One way of achieving such flexibility with the R24 FHT would be through the design of a processing element (PE) that minimizes or perhaps even avoids the need for fast multipliers, or fast memory, or both, according to the availability of the resources on the target device. Despite the increased use of the hardware-based computing technologies, however, there is still a strong reliance upon the use of software-based techniques for the design of the arithmetic unit. These techniques, as typified by the familiar fast multiplier, are relatively inflexible in terms of the precision they offer and, although increasingly more power efficient, tend to be expensive in terms of silicon resources. There are a number of hardware-based arithmetic techniques available, however, such as the shift-and-add techniques, as typified by the Co-Ordinate Rotation

10

1 Background to Research

DIgital Computer (CORDIC) arithmetic [47] unit, and the look-up table (LUT) techniques, as typified by the Distributed Arithmetic (DA) arithmetic [48] unit, that date back to the DSP revolution of the mid-twentieth century but nevertheless still offer great attractions for use with the new hardware-based technologies. The CORDIC arithmetic unit, for example, which may be used to carry out in an optimal fashion the operation of phase rotation – the key operation for the computation of the DFT – may be implemented by means of a computational structure whose form may range from fully-sequential to fully-parallel, with the latency of the CORDIC operation increasing linearly with increasing parallelism. The application of the CORDIC technique to the computation of the R24 FHT is considered in this monograph for its ability to both minimize the memory requirement and to yield a flexible-precision solution to the real-data DFT problem.

1.8 Performance Metrics Having introduced and defined the algorithms of interest in this introductory chapter, namely the DFT and its close relation the DHT, as well as discussing very briefly the various types of computing architecture and technology available for the implementation of their fast solutions, via the FFT and the FHT, respectively, it is now worth devoting a little time to considering the type of performance metrics most appropriate to each. For the mapping of such algorithms onto a uni-processor computing device, for example, the performance would typically be assessed according to the following: Performance Metric for Uni-processor Computing Device: An operation-efficient solution to a discrete unitary or orthogonal transform, when executed on a (Von Neumann) uni-processor sequential computing device, is one which exploits the symmetry of the transform kernel such that the transform may be computed with fewer operations than by direct implementation of its definition.

As is clear from this definition, the idea of identifying and exploiting the property of symmetry, whether it exists just in the transform kernel or, in certain circumstances, in the input data and/or output data as well, is central to the problem of designing fast algorithms for the efficient computation of discrete unitary or orthogonal transforms. For the mapping of such algorithms onto a multi-processor computing device, on the other hand, the performance would typically be assessed according to the following: Performance Metric for Multi-processor Computing Device: A time-efficient solution to a discrete unitary or orthogonal transform, when executed on a multi-processor parallel computing device, is one which facilitates the execution of many of the operations simultaneously, or in parallel, such that the transform may be computed in less time than by its sequential implementation.

With the availability of multiple processors the idealized objective of a parallel solution is to obtain a linear speed-up in performance which is directly proportional to

1.9 Basic Definitions

11

the number of processors used, although in reality, with most multi-processor applications, being able to obtain such a speed-up is rarely achievable. The main problem relates to the communication complexity arising from the need to move potentially large quantities of data between the processors. Finally, for the mapping of such algorithms onto a silicon-based parallel computing device, the performance would typically be assessed according to the following: Performance Metric for Silicon-Based Parallel Computing Device: A hardware-efficient solution to a discrete unitary or orthogonal transform, when executed on a silicon-based parallel computing device, is one which facilitates the execution of many of the operations simultaneously, or in parallel, such that the transform throughput per unit area of silicon is maximized.

Although other metrics could of course be used for this definition, this particular metric of throughput per unit area of silicon – referred to hereafter as the computational density – is targeted specifically at the type of power-constrained environment that one would expect to encounter with mobile communications, as it is assumed that a solution that yields a high computational density will be attractive in terms of both power efficiency and hardware efficiency, given the known influence of silicon area – to be discussed in Chapter 5 – on the power consumption.

1.9 Basic Definitions To clarify the use of a few basic terms, note that the input data to unitary transforms, such as the DFT and the DHT, may be said to belong to the data space which, as already stated, is CN for the case of the DFT and RN for the case of the DHT. Analogously, the output data from such transforms may be said to belong to the transform space which for the case of the DFT is referred to as Fourier space and for the case of the DHT is referred to as Hartley space. As already implied, all vectors with an attached superscript of “(F)” will be assumed to reside within Fourier space whilst all those with an attached superscript of “(H)” will be assumed to reside within Hartley space. These definitions will be used throughout the monograph, where appropriate, in order to simplify or clarify the exposition. Also, it has already been stated that for the case of a fixed-radix FFT, the trigonometric elements of the Fourier Matrix, as applied to the appropriate butterfly inputs/outputs, are generally referred to as twiddle factors. However, for consistency, the elements of both the Fourier and Hartley Matrices, as required for the butterflies of their respective decompositions, will be referred to hereafter as the trigonometric coefficients – for the fast solution to both transform types the elements are generally decomposed into pairs of real numbers for efficient application. Finally, note that the curly brackets “f.g” will be used throughout the monograph to denote a finite set or sequence of digital samples, as required, for example, for expressing the input–output relationship for both the DFT and the DHT. Also, the indexing convention generally adopted when using such sequences is that the

12

1 Background to Research

elements of a sequence in data space – typically denoted with a lower-case character – are indexed by means of the letter “n”, whereas the elements of a sequence in transform space – typically denoted with an upper-case character – are indexed by means of the letter “k”.

1.10 Organization of the Monograph The first section of the monograph provides the background information necessary for a better understanding of both the problem being addressed, namely that of the real-data DFT, and of the resulting solution described in the research results that follow. This involves, in Chapter 1, an outline of the problem set in a historical context, followed in Chapter 2 by an account of the real-data DFT and of the fast algorithms and techniques conventionally used for its solution, and in Chapter 3 by a detailed account of the DHT and the class of FHT algorithms used for its fast solution, and of those properties of the DHT that make the FHT of particular interest with regard to the fast solution of the real-data DFT. The middle section of the monograph deals with the novel solution proposed for dealing with the real-data DFT problem. This involves, in Chapter 4, a detailed account of the design and efficient computation of a solution to the DHT based upon the GD-BFLY, namely the regularized FHT or R24 FHT, which lends itself naturally to parallelization and to mapping onto a regular computational structure for implementation with silicon-based parallel computing technology. Design constraints and considerations for such technologies are then discussed in Chapter 5 prior to the consideration, in Chapter 6, of different architectures for the mapping of the R24 FHT onto such hardware. A partitioned-memory architecture exploiting a single high-performance PE is identified for the parallel computation of the GD-BFLY and of the resulting R24 FHT [27] whereby both the data and the trigonometric coefficients are partitioned or distributed across multiple banks of memory, referred to hereafter as the data memory (DM) and the trigonometric coefficient memory (CM), respectively. Following this, in Chapter 7, it is seen how the fast multipliers used by the GD-BFLY might in certain circumstances be beneficially replaced by a hardware-based parallel arithmetic unit – based upon CORDIC arithmetic – that is able to yield a flexible-precision solution, without need of trigonometric coefficient memory, when implemented with the proposed hardware-based technology. The final section of the monograph deals with applications of the resulting solution to the real-data DFT problem. This involves, in Chapter 8, an account of how the application of the R24 FHT may be extended to the efficient parallel computation of the real-data DFT whose length is a power of two, but not a power of four, this being followed by its application, in Chapter 9, to the computation of some of the more familiar and computationally-intensive DSP-based functions, such as those of correlation – both auto-correlation and cross-correlation – and of the wideband channelization of real-valued radio frequency (RF) data via the polyphase DFT filter bank [22]. With each such function, which might typically be encountered in that

References

13

increasingly important area of wireless communications relating to the geolocation [38] of signal emitters, the adoption of the R24 FHT may potentially result in both conceptually and computationally simplified solutions. The monograph concludes with two appendices which provide both a detailed description and a listing of computer source code, written in the “C” programming language, for all the functions of the proposed partitioned-memory single-PE solution to the R24 FHT, this code being used for proving the mathematical/logical correctness of its operation. The computer program provides the user with various choices of PE design and of storage/accession scheme for the trigonometric coefficients, helping the user to identify how the algorithm might be efficiently mapped onto suitable parallel computing equipment following translation of the sequential “C” code to parallel code as produced by a suitably chosen hardware description language (HDL). The computer code for the complete solution is also to be found on the compact disc (CD) accompanying the monograph. Finally, note that pseudo-code, loosely based on the “C” programming language, will be used throughout the monograph, where appropriate, to illustrate the operation of the R24 FHT and of the individual functions of which the R24 FHT is comprised.

References 1. S.G. Akl, The Design and Analysis of Parallel Algorithms (Prentice Hall, Upper Saddle River, NJ, 1989) 2. G.D. Bergland, A Fast Fourier Transform Algorithm for Real-Valued Series. Comm. ACM. 11(10) (1968) 3. A.W. Biermann, Great Ideas in Computer Science (MIT Press, Cambridge, MA, 1995) 4. G. Birkhoff, S. MacLane, A Survey of Modern Algebra (Macmillan, New York, 1977) 5. R. Blahut, Fast Algorithms for Digital Signal Processing (Addison Wesley, Boston, MA, 1985) 6. R.N. Bracewell, The Fourier Transform and Its Applications (McGraw Hill, New York, 1978) 7. R.N. Bracewell, The fast Hartley transform. Proc. IEEE. 72(8) (1984) 8. R.N. Bracewell, The Hartley Transform (Oxford University Press, New York, 1986) 9. E.O. Brigham, The Fast Fourier Transform and its Applications (Prentice Hall, Englewood Cliffs, NJ, 1988) 10. G. Bruun, Z-transform DFT filters and FFTs. IEEE Trans. ASSP. 26(1) (1978) 11. J.W. Cooley, P.A.W. Lewis, P.D. Welch, The fast Fourier transform algorithm and its applications. Technical Report RC-1743, IBM (1967) 12. J.W. Cooley, J.W. Tukey, An algorithm for the machine calculation of complex Fourier series. Math. Comput. 19(4), 297–301 (1965) 13. G.C. Danielson, C. Lanczos, Some improvements in practical Fourier series and their application to x-ray scattering from liquids. J. Franklin Inst. 233, 365–380, 435–452 (1942) 14. C. Ding, D. Pei, A. Salomaa, Chinese Remainder Theorem: Applications in Computing, Coding, Cryptography. World Scientific (1996) 15. P. Duhamel, Implementations of Split-Radix FFT Algorithms for Complex, Real and RealSymmetric Data. IEEE Trans. ASSP. 34(2), 285–295 (1986) 16. P. Duhamel, M. Vetterli, Improved Fourier and Hartley transform algorithms: Application to cyclic convolution of real data. IEEE Trans. ASSP. 35(6), 818–824 (1987) 17. D.F. Elliott, K. Ramamohan Rao, Fast Transforms: Algorithms, Analyses, Applications (Academic, New York, 1982)

14

1 Background to Research

18. O. Ersoy, Real discrete Fourier transform. IEEE Trans. ASSP. 33(4) (1985) 19. P. Gannon, Colossus: Bletchley Park’s Greatest Secret (Atlantic Books, London, 2006) 20. I.J. Good, The interaction algorithm and practical Fourier series. J. Roy. Stat. Soc. Ser. B 20, 361–372 (1958) 21. D. Harel, Algorithmics: The Spirit of Computing. (Addison Wesley, Reading, MA, 1997) 22. F.J. Harris, Multirate Signal Processing for Communication Systems (Prentice Hall, Upper Saddle River, NJ, 2004) 23. R.V.L. Hartley, A more symmetrical Fourier analysis applied to transmission problems. Proc. IRE. 30 (1942) 24. M.T. Heideman, D.H. Johnson, C.S. Burrus, Gauss and the history of the fast Fourier transform. IEEE ASSP Mag. 1, 14–21 (1984) 25. A. Hodges, Alan Turing: The Enigma (Vintage, London, 1992) 26. K.J. Jones, Design and parallel computation of regularised fast Hartley transform. IEE Proc. Vision Image Signal Process. 153(1), 70–78 (February 2006) 27. K.J. Jones, R. Coster, Area-efficient and scalable solution to real-data fast Fourier transform via regularised fast Hartley transform. IET Signal Process. 1(3), 128–138 (2007) 28. L. Kronsjo, Computational Complexity of Sequential and Parallel Algorithms (Wiley, New York, 1985) 29. S.Y. Kung, VLSI Array Processors (Prentice Hall, Englewood Cliffs, NJ, 1988) 30. J.B. Marten, Discrete Fourier transform algorithms for real valued sequences. IEEE Trans. ASSP. 32(2) (1984) 31. C. Maxfield, The Design Warrior’s Guide to FPGAs (Newnes (Elsevier), Burlington, MA, 2004) 32. J.H. McClellan, C.M. Rader, Number Theory in Digital Signal Processing (Prentice Hall, Englewood Cliffs, NJ, 1979) 33. H. Murakami, Real-valued fast discrete Fourier transform and decimation-in-frequency algorithms. IEEE Trans. Circuits Syst. II: Analog Dig. Signal Process. 41(12), 808–816 (1994) 34. I. Nivan, H.S. Zuckerman, An Introduction to the Theory of Numbers (Wiley, New York, 1980) 35. H.J. Nussbaumer, Fast Fourier Transform and Convolution Algorithms (Springer, Berlin, 1981) 36. A.V. Oppenheim, R.W. Schafer, Discrete-Time Signal Processing (Prentice Hall, Englewood Cliffs, NJ, 1989) 37. J.R. Pierce, An Introduction to Information Theory: Symbols, Signals and Noise (Dover, New York, 1980) 38. R.A. Poisel, Electronic Warfare: Target Location Methods (Artech House, Boston, MA, 2005) 39. L.R. Rabiner, B. Gold, Theory and Application of Digital Signal Processing (Prentice Hall, Englewood Cliffs, NJ, 1975) 40. C. Runge, Uber die Zerlegung Empirisch Periodischer Funktionen in Sinnus-Wellen. Zeit. Fur Math. Und Physik. 48, 443–456 (1903) 41. C.E. Shannon, A mathematical theory of communication. BSTJ 27, 379–423, 623–657 (1948) 42. G.R.L. Sohie, W. Chen, Implementation of Fast Fourier Transforms on Motorola’s Digital Signal Processors. downloadable document from website: www.Motorola.com 43. H.V. Sorensen, D.L. Jones, C.S. Burrus, M.T. Heideman, On computing the discrete Hartley transform. IEEE ASSP 33, 1231–1238 (1985) 44. H.V. Sorensen, D.L. Jones, M.T. Heideman, C.S. Burrus, Real-valued fast Fourier transform algorithms. IEEE Trans. ASSP. 35(6), 849–863 (1987) 45. K. Stumpff, Tafeln und Aufgaben zur Harmonischer Analyse und Periodogrammrechnung. (Julius Springer, Berlin, 1939) 46. P.R. Uniyal, Transforming real-valued sequences: Fast Fourier versus fast Hartley transform algorithms. IEEE Signal Process. 42(11) (1994) 47. J.E. Volder, The CORDIC trigonometric computing technique. IRE Trans. Elect. Comput. EC-8(3), 330–334 (1959) 48. S.A. White, Application of distributed arithmetic to digital signal processing: A tutorial review. IEEE ASSP Mag. 6(3), 4–19 (1989) 49. S. Winograd, Arithmetic Complexity of Computations. (SIAM, Philadelphia, PA, 1980)

Chapter 2

Fast Solutions to Real-Data Discrete Fourier Transform

Abstract This chapter discusses the two approaches conventionally adopted for dealing with the real-data DFT problem. The first approach involves the design of specialized fast algorithms, such as those due to Bergland and Bruun, which are geared specifically to addressing real-data applications and therefore able to exploit, in a direct way, the real-valued nature of the data – which is known to result in a Hermitian-symmetric frequency spectrum. The second approach, which is the most commonly adopted, particularly for applications requiring a hardware solution, involves re-structuring the data so as to use an existing complex-data FFT algorithm, possibly coupled with pre-FFT and/or post-FFT stages, to produce the DFT of either one or two (produced simultaneously) real-valued data sets – such solutions thus said to be obtained via a “real-from-complex” strategy. A discussion is finally provided relating to the results obtained in the chapter.

2.1 Introduction Since the original developments of spectrum analysis in the eighteenth century, the vast majority of real-world applications have been concerned with the processing of real-valued data, where the data generally corresponds to amplitude measurements of some signal of interest. As a result, there has always been a genuine practical need for fast solutions to the real-data DFT with two quite distinct approaches evolving over this period to address the problem. The first and more intellectually challenging approach involves trying to design specialized algorithms which are geared specifically to real-data applications and therefore able to exploit, in a direct way, the real-valued nature of the data which is known to result in a Hermitian-symmetric frequency spectrum, whereby for the case of an N-point transform

and

Re X.F/ Œk D Re X.F/ ŒN k

(2.1)

Im X.F/ Œk D Im X.F/ ŒN k ;

(2.2)

K. Jones, The Regularized Fast Hartley Transform, Signals and Communications Technology, DOI 10.1007/978-90-481-3917-0 2, c Springer Science+Business Media B.V. 2010

15

16

2 Fast Solutions to Real-Data Discrete Fourier Transform

so that one half of the DFT outputs are in fact redundant. Such solutions, as typified by the Bergland [1] and Bruun [3, 13] algorithms, only need therefore to produce one half of the DFT outputs. The second and less demanding approach – but also the most commonly adopted, particularly for applications requiring a hardware solution – involves re-structuring the data so as to use an existing complex-data FFT algorithm, possibly coupled with pre-FFT and/or post-FFT stages, to produce the DFT of either one or two (produced simultaneously) real-valued data sets – such solutions thus said to be obtained via a real-from-complex strategy [17]. Both of these approaches are now discussed in some detail prior to a summary of their relative merits and drawbacks.

2.2 Real-Data FFT Algorithms Since the re-emergence of computationally-efficient FFT algorithms, as initiated by the published work of James Cooley and John Tukey in the mid-1960s [5], a number of attempts have been made [1, 3, 6, 7, 9, 10, 12, 18, 19] at producing fast algorithms that are able to directly exploit the spectral symmetry that arises from the processing of real-valued data. Two such algorithms are those due to Glenn Bergland (1968) and Georg Bruun (1978) and these are now briefly discussed so as to give a flavour of the type of algorithmic structures that can result from pursuing such an approach. The Bergland algorithm effectively modifies the DIT version of the familiar Cooley–Tukey radix-2 algorithm [2] to account for the fact that only one half of the DFT outputs need to be computed, whilst the Bruun algorithm adopts an unusual recursive polynomial-factorization approach – note that the DIF version of the Cooley–Tukey fixed-radix algorithm, referred to as the Sande–Tukey algorithm [2], may also be expressed in such a form – which involves only real-valued polynomial coefficients until the last stage of the computation, making it thus particularly suited to the computation of the real-data DFT. Examples of the signal flow graphs (SFGs) for both DIT and DIF versions of the radix-2 FFT algorithm are as given below in Figs. 2.1 and 2.2, respectively.

2.2.1 The Bergland Algorithm The Bergland algorithm is a real-data FFT algorithm based upon the observation that the frequency spectrum arising from the processing of real-valued data is Hermitiansymmetric, so that only one half of the DFT outputs needs to be computed. Starting with the DIT version of the familiar complex-data Cooley–Tukey radix-2 FFT algorithm, if the input data is real valued, then at each of the log2 N temporal stages of the algorithm the computation involves the repeated combination of two transforms to yield one longer double-length transform. From this, Bergland observed that the property of Hermitian symmetry may actually be exploited at each of the

2.2 Real-Data FFT Algorithms

17

X[0]

X[0] 0

W8

X[2] X[4]

4-point DFT

X[1]

W 18 W 28

X[6]

X[2] X[3]

W 38

X[1]

X[4] 4

W8

X[3] X[5]

4-point DFT

W58 W68

X[7]

X[5] X[6] X[7]

W78

Fig. 2.1 Signal flow graph for DIT decomposition of eight-point DFT

X[0]

X[0]

X[1] 4-point DFT

X[2] X[3] X[4] X[5] X[6]

X[2] X[4] X[6]

-

W 08

X[1]

W 18 W 28

4-point DFT

X[5]

3

W8

X[7]

X[3]

-

X[7]

Fig. 2.2 Signal flow graph for DIF decomposition of eight-point DFT

log2 N stages of the algorithm. Thus, as all the odd-addressed output samples for each such double-length transform form the second half of the frequency spectrum, which can in turn be straightforwardly obtained from the property of spectral symmetry, Bergland’s algorithm instead uses those memory locations for storing the imaginary components of the data. Thus, with Bergland’s algorithm, given that the input data sequence is real valued, all the intermediate results may be stored in just N memory locations – each location thus corresponding to just one word of memory. The computation can also be carried out in an in-place fashion – whereby the outputs from each butterfly are stored in the same set of memory locations as used by the inputs – although the indices of the set of butterfly outputs are not in bit-reversed order, as they are with the Cooley–Tukey algorithm, being instead ordered according to the Bergland

18

2 Fast Solutions to Real-Data Discrete Fourier Transform

ordering [1], as also are the indices of the twiddle factors or trigonometric coefficients. However, the normal ordering of the twiddle factors may, with due care, be converted to the Bergland ordering and the Bergland ordering of the FFT outputs subsequently converted to the normal ordering, as required for an efficient in-place solution [1, 17]. Thus, the result of the above modifications is an FFT algorithm with an arithmetic complexity of O.N:log2 N/ which yields a reduction of two saving, compared to the conventional zero-padded complex-data FFT solution – to be described in Section 2.3.1 – in terms of both arithmetic complexity and memory requirement.

2.2.2 The Brunn Algorithm The Bruun algorithm is a real-data FFT algorithm based upon an unusual recursive polynomial-factorization approach, proposed initially for the case of N input samples where N is a power of two, but subsequently generalized to arbitrary evennumber transform sizes by Hideo Murakami in 1996 [12]. Recall firstly, from Chapter 1, that the N-point DFT can be written in normalized form as N1 1 X X.F/ Œk D p xŒn:Wnk k D 0; 1; : : : ; N 1 (2.3) N N nD0 where the transform kernel is derived from the term WN D exp.i2 =N/;

(2.4)

the primitive Nth complex root of unity. Then, by defining the polynomial x(z) whose coefficients are those elements of the sequence fx[n]g, such that N1 1 X x.z/ D p xŒn:zn ; N nD0

(2.5)

it is possible to view the DFT as a reduction of this polynomial [11], so that X.F/ Œk D x WkN D x.z/ mod z WkN

(2.6)

where “mod” stands for the modulo operation [11] which denotes the polynomial remainder upon division of x(z) by z WkN [11]. The key to fast execution of the Bruun algorithm stems from being able to perform this set of N polynomial remainder operations in a recursive fashion. To compute the DFT involves evaluating the remainder of x(z) modulo some polynomial of degree one, more commonly referred to as a monomial, a total of N times, as suggested by Equations 2.5 and 2.6. To do this efficiently, one can combine the remainders recursively in the following way. Suppose it is required to evaluate

2.3 Real-From-Complex Strategies

19

x(z) modulo U(z) as well as x(z) modulo V(z). Then, by first evaluating x(z) modulo the polynomial product, U(z).V(z), the degree of the polynomial x(z) is reduced, thereby making subsequent modulo operations less computationally expensive. k Now the productof all of the monomials, z W , for values of k from 0 up N to N 1, is simply zN 1 , whose roots are clearly the N complex roots of unity. A recursive factorization of zN 1 is therefore required which breaks it down into polynomials of smaller and smaller degree with each possessing as few non-zero coefficients as possible. To compute the DFT, one takes x(z) modulo each level of this factorization in turn, recursively, until one arrives at the monomials and the final result. If each level of the factorization splits every polynomial into an O(1) number of smaller polynomials, each with an O(1) number of non-zero coefficients, then the modulo operations for that level will take O(N) arithmetic operations, thus leading to a total arithmetic complexity, for all log2 N levels, of O.N:log2 N/, as obtained with the standard Cooley–Tukey radix-2 algorithm. NNotethat for N a power of two, the Bruun algorithm factorizes the polynomial z 1 recursively via the rules z2M 1 D zM 1 zM C 1

(2.7)

and p p z4M C a:z2M C 1 D z2M C 2 a:zM C 1 z2M 2 a:zM C 1 ;

(2.8)

where “a” is a constant such that jaj 2. On completion of the recursion, when M D 1, there remains polynomials of degree two that can each be evaluated modulo two roots of the form z WkN for each polynomial. Thus, at each recursive stage, all of the polynomials may be factorized into two parts, each of half the degree and possessing at most three non-zero coefficients, leading to an FFT algorithm with an O.N:log2 N/ arithmetic complexity. Moreover, since all the polynomials have purely real coefficients, at least until the last stage, they quite naturally exploit the special case where the input data is real valued, thereby yielding a reduction of two saving, compared to the conventional zero-padded complex-data FFT solution to be discussed in Section 2.3.1, in terms of both arithmetic complexity and memory requirement.

2.3 Real-From-Complex Strategies By far the most common approach to solving the real-data DFT problem is that based upon the use of an existing complex-data FFT algorithm as it simplifies the problem, at worst, to one of designing pre-FFT and/or post-FFT stages for the packing of the real-valued data into the correct format required for input to the FFT algorithm and for the subsequent unpacking of the FFT output data to obtain the spectrum (or spectra) of the original real-valued data set (or sets). Note that any fast

20

2 Fast Solutions to Real-Data Discrete Fourier Transform

algorithm may be used for carrying out the complex-data FFT, so that both DIT and DIF versions of fixed-radix FFTs, as already discussed, as well as more sophisticated FFT designs such as those corresponding to the mixed-radix, split-radix, prime factor, prime-length and Winograd’s nested algorithms [8, 11], for example, might be used.

2.3.1 Computing One Real-Data DFT via One Full-Length Complex-Data FFT The most straightforward approach to the problem involves first packing the realvalued data into the real component of a complex-valued data sequence, padding the imaginary component with zeros – this action more commonly referred to as zero padding – and then feeding the resulting complex-valued data set into a complexdata FFT. The arithmetic complexity of such an approach is clearly identical to that obtained when a standard complex-data FFT is applied to genuine complexvalued data, so that no computational benefits stemming from the nature of the data are achieved with such an approach. On the contrary, computational resources are wasted with such an approach, as excessive arithmetic operations are performed for the computation of the required outputs and twice the required amount of memory used for the storage of the input/output data.

2.3.2 Computing Two Real-Data DFTs via One Full-Length Complex-Data FFT The next approach to the problem involves computing two N-point real-data DFTs, simultaneously, by means of one N-point complex-data FFT. This is achieved by packing the first real-valued data sequence into the real component of a complexvalued data sequence and the second real-valued data sequence into its imaginary component. Thus, given two real-valued data sequences, fg[n]g and fh[n]g, a complex-valued data sequence, fx[n]g, may be simply obtained by setting x Œn D g Œn C i:h Œn ;

(2.9)

with the kth DFT output of the resulting data sequence being written in normalized form, in terms of the DFTs of fg[n]g and fh[n]g, as 1 X X.F/ Œk D p xŒn:Wnk N N nD0 N1

2.3 Real-From-Complex Strategies

21

1 X 1 X Dp gŒn:Wnk hŒn:Wnk N C ip N N nD0 N nD0 N1

N1

D G Œk C i:H Œk D .GR Œk HI Œk/ C i .GI Œk C HR Œk/ ;

(2.10)

where GR Œk and GI Œk are the real and imaginary components, respectively, of G[k] – the same applying to HR Œk and HI Œk with respect to H[k]. Similarly, the (N–k)th DFT output may be written in normalized form as 1 X X.F/ ŒN k D p xŒn:Wn.Nk/ N N nD0 N1

1 X 1 X D p gŒn:Wnk hŒn:Wnk N C ip N N nD0 N nD0 N1

N1

D G Œk C i:H Œk D .GR Œk C HI Œk/ C i .GI Œk C HR Œk/

(2.11)

where the superscript “ ” stands for the operation of complex conjugation, so that upon combining the expressions of Equations 2.10 and 2.11, the DFT outputs G[k] and H[k] may be written, in terms of the DFT outputs X.F/ Œk and X.F/ ŒN k, as GŒk D GR Œk C iGI Œk i h i 1 h .F/ Re X Œk C X.F/ ŒN k C iIm X.F/ Œk X.F/ ŒN k (2.12) D 2 and HŒk D HR Œk C iHI Œk i h i 1 h .F/ Im X Œk C X.F/ ŒN k iRe X.F/ Œk X.F/ ŒN k (2.13) D 2 where the terms Re X.F/ Œk and Im X.F/ Œk denote the real and imaginary components, respectively, of X.F/ Œk. Thus, it is evident that the DFT of the two real-valued data sequences, fg[n]g and fh[n]g, may be computed simultaneously, via one full-length complex-data FFT algorithm, with the DFT of the sequence fg[n]g being as given by Equation 2.12 and that of the sequence fh[n]g by Equation 2.13. The pre-FFT data packing stage is quite straightforward in that it simply involves the assignment of one real-valued data sequence to the real component of the complex-valued data array and one realvalued data sequence to its imaginary component. The post-FFT data unpacking stage simply involves separating out the two spectra from the complex-valued FFT output data, this involving two real additions/subtractions for each real-data DFT output together with two scaling operations each by a factor of 2 (which in fixedpoint hardware reduces to that of a simple right shift operation).

22

2 Fast Solutions to Real-Data Discrete Fourier Transform

2.3.3 Computing One Real-Data DFT via One Half-Length Complex-Data FFT Finally, the last approach to the problem involves showing how an N-point complexdata FFT may be used to carry out the computation of one 2N-point real-data DFT. The kth DFT output of the 2N-point real-valued data sequence fx[n]g may be written in normalized form as 2N1 1 X X.F/ Œk D p xŒn:Wnk 2N 2N nD0

k D 0; 1; : : : ; N 1

1 X 1 X k Dp xŒ2n:Wnk xŒ2n C 1:Wnk N C W2N p N 2N nD0 2N nD0 N1

N1

(2.14)

which, upon setting gŒn D xŒ2n and hŒn D xŒ2n C 1, becomes 1 X 1 X k gŒn:Wnk hŒn:Wnk X.F/ Œk D p N C W2N p N 2N nD0 2N nD0 N1

N1

k D 0; 1; : : : ; N 1

D GŒk C Wk2N HŒk:

(2.15)

Therefore, by setting yŒn D gŒn C i:hŒn and exploiting the combined expressions of Equations 2.10 and 2.11, the DFT output Y[k] may be written as YŒk D .GR Œk HI Œk/ C i .GI Œk C HR Œk/

(2.16)

and that for Y[N–k] as YŒN k D .GR Œk C HI Œk/ C i .GI Œk C HR Œk/ :

(2.17)

Then, by combining the expressions of Equations 2.15–2.17, the real component of X.F/ Œk may be written as .F/

XR Œk D

1 Re .YŒk C YŒN k/ 2 1 C cos.k =N/:Im .YŒk C YŒN k/ 2 1 sin.k =N/:Re .YŒk YŒN k/ 2

and the imaginary component as .F/

XI Œk D

1 Im .YŒk YŒN k/ 2

(2.18)

2.4 Data Re-ordering

23

1 C sin.k =N/:Im .YŒk C YŒN k/ 2 1 cos.k =N/:Re .YŒk YŒN k/ : 2

(2.19)

Thus, it is evident that the DFT of one real-valued data sequence, fx[n]g, of length 2N, may be computed via one N-point complex-data FFT algorithm, with the real component of the DFT output being as given by Equation 2.18 and the imaginary component of the DFT output as given by Equation 2.19. The pre-FFT data packing stage is conceptually simple, but nonetheless burdensome, in that it involves the assignment of the even-addressed samples of the real-valued data sequence to the real component of the complex-valued data sequence and the odd-addressed samples to its imaginary component. The post-FFT data unpacking stage is considerably more complex than that required for the approach of Section 2.3.2, requiring the application of eight real additions/subtractions for each DFT output, together with two scaling operations, each by a factor of 2, and four real multiplications by pre-computed trigonometric coefficients.

2.4 Data Re-ordering All of the fixed-radix formulations of the FFT, at least for the case where the transform length is a power or two, require that either the inputs to or the outputs from the transform be permuted according to the digit reversal mapping [4]. In fact, it is possible to place the data re-ordering either before or after the transform for both the DIT and DIF formulations [4]. For the case of a radix-2 algorithm the data reordering is more commonly known as the bit reversal mapping, being based upon the exchanging of single bits, whilst for the radix-4 case it is known as the di-bit reversal mapping, being based instead upon the exchanging of pairs of bits. Such data re-ordering, when mapped onto a uni-processor sequential computing device, may be carried out via the use of either: 1. An LUT, at the expense of additional memory; or 2. A fast algorithm using just shifts, additions/subtractions and memory exchanges; or 3. A fast algorithm that also makes use of a small LUT – containing the reflected bursts of ones that change on the lower end with incrementing – to try and optimize the speed at the cost of a slight increase in memory with the optimum choice being dependent upon the available resources and the time constraints of the application. Alternatively, when the digit-reversal mapping is appropriately parallelized, it may be mapped onto a multi-processor parallel computing device, such as an FPGA, possessing multiple banks of fast memory, thus enabling the time-complexity to be greatly reduced – see the recent work of Ren et al. [14] and Seguel and Bollman [16].

24

2 Fast Solutions to Real-Data Discrete Fourier Transform

The optimum approach to digit-reversal is dictated very much by the operation of the FFT, namely whether the FFT is of burst or streaming type, as discussed in Chapter 6.

2.5 Discussion The aim of this chapter has been to highlight both the advantages and the disadvantages of the conventional approaches to the solution of the real-data DFT problem. As is evident from the examples discussed in Section 2.2, namely the Bergland and Bruun algorithms, the adoption of specialized real-data FFT algorithms may well yield solutions possessing attractive performance metrics in terms of arithmetic complexity and memory requirement, but generally this is only achieved at the expense of a more complex algorithmic structure when compared to those of the highly-regular fixed-radix designs. As a result, such algorithms would not seem to lend themselves particularly well to being mapped onto parallel computing equipment. Similarly, from the examples of Section 2.3, namely the real-from-complex strategies, the regularity of the conventional fixed-radix designs may only be exploited at the expense of introducing additional processing modules, namely the pre-FFT and/or post-FFT stages for the packing of the real-valued data into the correct format required for input to the FFT algorithm and for the subsequent unpacking of the FFT output data to obtain the spectrum (or spectra) of the original real-valued data set (or sets). An additional set of problems associated with the realfrom-complex strategies, at least when compared to the more direct approach of a real-data FFT, relate to the need for increased memory and increased processing delay to allow for the possible acquisition/processing of pairs of data sets. It is worth noting that an alternative DSP-based approach to those discussed above is to first convert the real-valued data to complex-valued data by means of a wideband DDC process, this followed by the application of a conventional complex-data FFT. Such an approach, however, introduces an additional function to be performed – typically an FIR filter with length dependent upon the performance requirements of the application – which also introduces an additional processing delay prior to the execution of the FFT. Drawing upon a philosophical analogy, namely the maxim of the fourteenth century Franciscan scholar, William of Occam, commonly known as “Occam’s razor” [15], why use two functions when just one will suffice. A related and potentially serious problem arises when there is limited information available on the signal under analysis as such information might well be compromised via the filtering operation, particularly when the duration of the signal is short relative to that of the transient response of the filter – as might be encountered, for example, with problems relating to the detection of extremely short duration dual-tone multi-frequency (DTMF) signals. Thus, there are clear drawbacks to all such approaches, particularly when the application requires a solution in hardware using parallel computing equipment, so

References

25

that the investment of searching for alternative solutions to the fast computation of the real-data DFT is still well merited. More specifically, solutions are required possessing both highly regular designs that lend themselves naturally to mapping onto parallel computing equipment and attractive performance metrics, in terms of both arithmetic complexity and memory requirement, but without requiring excessive packing/unpacking requirements and without incurring the latency problems (as arising from the increased processing delay) associated with the adoption of certain of the real-from-complex strategies.

References 1. G.D. Bergland, A fast Fourier transform algorithm for real-valued series. Comm. ACM. 11(10) (1968) 2. E.O. Brigham, The Fast Fourier Transform and Its Applications. (Prentice Hall, Englewood Cliffs, NJ, 1988) 3. G. Bruun, Z-Transform DFT Filters and FFTs. IEEE Trans. ASSP. 26(1) (1978) 4. E. Chu, A. George, Inside the FFT Black Box (CRC Press, Boca Raton, FL, 2000) 5. J.W. Cooley, J.W. Tukey, An algorithm for the machine calculation of complex Fourier series. Math. Comput. 19(4), 297–301 (1965) 6. P. Duhamel, Implementations of split-radix FFT algorithms for complex, real and realsymmetric data. IEEE Trans. ASSP 34(2), 285–295 (1986) 7. P. Duhamel, M. Vetterli, Improved Fourier and Hartley transform algorithms: Application to cyclic convolution of real data. IEEE Trans. ASSP 35(6), 818–824 (1987) 8. P. Duhamel, M. Vetterli, Fast Fourier transforms: A tutorial review and a state of the art. Signal Process. 19, 259–299 (1990) 9. O. Ersoy, Real discrete Fourier transform. IEEE Trans. ASSP 33(4) (1985) 10. J.B. Marten, Discrete Fourier transform algorithms for real valued sequences. IEEE Trans. ASSP. 32(2) (1984) 11. J.H. McClellan, C.M. Rader, Number Theory in Digital Signal Processing (Prentice Hall, Englewood Cliffs, NJ, 1979) 12. H. Murakami, Real-valued fast discrete Fourier transform and cyclic convolution algorithms of highly composite even length. Proc. ICASSP 3. 1311–1314 (1996) 13. H.J. Nussbaumer, Fast Fourier Transform and Convolution Algorithms. (Springer, Berlin, 1981) 14. G. Ren, P. Wu, D. Padua, Optimizing Data Permutations for SIMD Devices (PDL’06, Ottawa, Ontario, Canada, 2006) 15. B. Russell, History of Western Philosophy (George Allen & Unwin, London, 1961) 16. J. Seguel, D. Bollman, A framework for the design and implementation of FFT permutation algorithms. IEEE Trans. Parallel Distrib. Syst. 11(7), 625–634 (2000) 17. G.R.L. Sohie, W. Chen, Implementation of Fast Fourier Transforms on Motorola’s Digital Signal Processors. Downloadable document from website: www.Motorola.com 18. H.V. Sorensen, D.L. Jones, M.T. Heideman, C.S. Burrus, Real-valued fast Fourier transform algorithms. IEEE Trans. ASSP 35(6), 849–863 (1987) 19. P.R. Uniyal, Transforming real-valued sequences: Fast Fourier versus fast Hartley transform algorithms. IEEE Signal Process. 42(11) (1994)

Chapter 3

The Discrete Hartley Transform

Abstract This chapter introduces the DHT and discusses those aspects of its solution, as obtained via the FHT, which make it an attractive choice for applying to the real-data DFT problem. This involves first showing how the DFT may be obtained from the DHT, and vice versa, followed by a discussion of those fundamental theorems, common to both the DFT and DHT algorithms, which enable the input data sets to be similarly related to their respective transforms and thus enable the DHT to be used for solving those DSP-based problems commonly addressed via the DFT, and vice versa. The limitations of existing FHT algorithms are then discussed bearing in mind the ultimate objective of mapping any subsequent solution onto silicon-based parallel computing equipment. A discussion is finally provided relating to the results obtained in the chapter.

3.1 Introduction An algorithm that would appear to satisfy most if not all of the requirements laid down in Section 2.5 of the previous chapter is that of the DHT, as introduced in Chapter 1, a discrete unitary transform [6] that involves only real arithmetic (thus making it also orthogonal) and that is intimately related to the DFT, satisfying all those properties required of a unitary transform as well as possessing fast algorithms for its solution. Before delving into the details, however, it is perhaps worth restating the definition, namely that for the case of N input/output samples, the DHT may be expressed in normalized form via the equation 1 X Œk D p xŒn:cas.2 nk=N/ k D 0; 1; : : : ; N 1 N nD0 N1

X

.H/

(3.1)

where the input/output data vectors belong to RN , the linear space of real-valued N-tuples, and the transform kernel is given by cas.2 nk=N/ D cos.2 nk=N/ C sin.2 nk=N/; K. Jones, The Regularized Fast Hartley Transform, Signals and Communications Technology, DOI 10.1007/978-90-481-3917-0 3, c Springer Science+Business Media B.V. 2010

(3.2) 27

28

3 The Discrete Hartley Transform

a periodic function with period 2 and possessing (amongst others) the following set of useful properties: cas .A C B/ D cosA:casB C sinA:cas. B/ cas .A B/ D cosA:cas .B/ C sinA:casB casA:casB D cos .A B/ C sin .A C B/ 1 1 casA C casB D 2:cas .A C B/ :cos .A B/ 2 2 1 1 .A B/ casA casB D 2:cas .A C B/ :sin 2 2

(3.3)

as will be exploited later for derivation of the FHT algorithm.

3.2 Normalization of DHT Outputs Suppose now that the DHT operation is applied twice, in succession, the first time to a real-valued data sequence, fx[n]g, and the second time to the output of the first operation. Then given that the DHT is bilateral and, like the DFT, a unitary transform, the output from the second operation, fy[n]g, can be expressed as fyŒng D DHT .DHT .fxŒng// fxŒng;

(3.4)

so that the output of the second operation is actually equivalent to the input of the first operation. However, it should be noted that without the presence of the scaling factor, p 1= N, that has been included in the current definition of the DHT, as given by Equation 3.1 above, the magnitudes of the outputs of the second DHT would actually be equal to N times those of the inputs of the first DHT, so that the role of the scaling factor is to ensure that the magnitudes are preserved. It should be borne in mind, however, that the presence of a coherent signal in the input data will result in most of the growth in magnitude occurring in the forward transform, so that any future scaling strategy – as discussed in the following chapter – must reflect this fact. Note that a scaling factor of 1=N isp often used for the forward definition of both the DFT and the DHT, the value of 1= N being used instead here purely for mathematical elegance, as it reduces the definitions of the DHT for both the forward and the reverse directions to an identical form. The fundamental theorems discussed in Section 3.5 for both the DFT and the DHT, however, are valid regardless of the scaling factor used.

3.4 Connecting Relations Between DFT and DHT

29

3.3 Decomposition into Even and Odd Components The close relationship between the DFT and the DHT hinges upon symmetry considerations which may be best explained by considering the decomposition of the DHT into its “even” and “odd” components [2], denoted E[k] and O[k], respectively, and written as X.H/ Œk D EŒk C OŒk (3.5) where, for an N-point transform, E[k] is such that E ŒN k D E Œk

(3.6)

O ŒN k D O Œk :

(3.7)

and O[k] is such that

As a result, the even and odd components may each be expressed in terms of the DHT outputs via the expressions EŒk D

1 .H/ X Œk C X.H/ ŒN k 2

(3.8)

and

1 .H/ X Œk X.H/ ŒN k ; (3.9) 2 respectively, from which the relationship between the DFT and DHT outputs may be straightforwardly obtained. OŒk D

3.4 Connecting Relations Between DFT and DHT Firstly, from the equality nk cas.2 nk=N/ D Re Wnk N Im WN ;

(3.10)

which relates the kernels of the two transformations, the DFT outputs may be expressed as X.F/ Œk D EŒk iOŒk; (3.11) so that

and

1 .H/ X ŒN k C X.H/ Œk Re X.F/ Œk D 2

(3.12)

1 .H/ X ŒN k X.H/ Œk ; Im X.F/ Œk D 2

(3.13)

30

whilst

3 The Discrete Hartley Transform

X.H/ Œk D Re X.F/ Œk Im X.F/ Œk :

(3.14)

3.4.1 Real-Data DFT Thus, from Equations 3.12 to 3.14, the complex-valued DFT output set and the realvalued DHT output set may now be simply obtained, one from the other, so that a fast algorithm for the solution of the DFT may also be used for the computation of the DHT whilst a fast algorithm for the solution of the DHT may similarly be used for the computation of the DFT. Note from the above equations that pairs of real-valued DHT outputs combine to give individual complex-valued DFT outputs, such that X.H/ Œk & X.H/ ŒN k $ X.F/ Œk (3.15) for k D 1; 2; : : :; N=2 1, whilst the remaining two terms are such that

and

X.H/ Œ0 $ X.F/ Œ0

(3.16)

X.H/ ŒN=2 $ X.F/ ŒN=2:

(3.17)

With regard to the two trivial mappings provided above by Equations 3.16 and 3.17, it may also be noted from Equation 3.10 that when k D 0, we have cas .2 nk=N/ D Wnk N D 1;

(3.18)

so that the zero-address component in Hartley space maps to the zero-address (or zero-frequency) component in Fourier space, and vice versa, as implied by Equation 3.16, whilst when k D N=2, we have n cas .2 nk=N/ D Wnk N D .1/ ;

(3.19)

so that the Nyquist-address component in Hartley space similarly maps to the Nyquist-address (or Nyquist-frequency) component in Fourier space, and vice versa, as implied by Equation 3.17.

3.4.2 Complex-Data DFT Now, having defined the relationship between the Fourier-space and Hartley-space representations of a real-valued data sequence it is a simple task to extend the results to the case of a complex-valued data sequence. Given the linearity of the DFT – this property follows from the Addition Theorem to be discussed in the following

3.5 Fundamental Theorems for DFT and DHT

31

section – the DFT of a complex-valued data sequence, fxR Œn C i:xI Œng, can be written as the sum of the DFTs of the individual real and imaginary components, so that DFT .fxR Œn C i:xI Œng/ D DFT .fxR Œng/ C i:DFT .fxI Œng/ :

(3.20)

Therefore, by first taking the DHT of the individual real and imaginary components of the complex-valued data sequence and then deriving the DFT of each such component by means of Equations 3.12 and 3.13, the real and imaginary components of the DFT of the complex-valued data sequence may be written in terms of the two DHTs as 1 1 .H/ .H/ .H/ .H/ Re X.F/ Œk D XR ŒN k C XR Œk XI ŒN k XI Œk (3.21) 2 2 and 1 1 .H/ .H/ XR ŒN k X.H/ X.H/ Im X.F/ Œk D R Œk C I ŒN k C XI Œk ; (3.22) 2 2 respectively, so that it is now possible to compute the DFT of both real-valued and complex-valued data sequences by means of the DHT – pseudo code is provided for both the real-valued data and complex-valued data cases in Figs. 3.1 and 3.2, respectively. The significance of the complex-to-real decomposition described here for the complex-data DFT is that it introduces an additional level of parallelism to the problem as the resulting DHTs are independent and thus able to be computed simultaneously, or in parallel, when implemented with parallel computing technology – a subject to be introduced in Chapter 5. This is particularly relevant when the transform is long and the throughput requirement high.

3.5 Fundamental Theorems for DFT and DHT As has already been stated, if the DFT and DHT algorithms are to be used interchangeably, for solving certain types of signal processing problem, then it is essential that there are corresponding theorems [2] for the two transforms which enable the input data sequences to be similarly related to their respective transforms. Suppose firstly that the sequences fx[n]g and fX.F/ Œkg are related via the expression n o DFT .fxŒng/ D X.F/ Œk ; (3.23) so that fx[n]g is the input data sequence to the DFT and fX.F/ Œkg the corresponding transform-space output, thus belonging to Fourier space, and that fx[n]g and fX.H/ Œkg are similarly related via the expression

32

3 The Discrete Hartley Transform Description: The real and imaginary components of the real-data N-point DFT outputs are optimally stored in the following way: XRdata[0] XRdata[1] XRdata[N–1] XRdata[2] XRdata[N–2] --- --- ---

⫽ ⫽ ⫽ ⫽ ⫽

zeroth frequency output real component of 1st frequency output imaginary component of 1st frequency output real component of 2nd frequency output imaginary component of 2nd frequency output

--- --- ---

--- --- ---

--- --- ---

XRdata[N/2–1] ⫽ real component of (N/2–1)th frequency output XRdata[N/2+1] ⫽ imaginary component of (N/2–1)th frequency output XRdata[N/2] ⫽ (N/2)th frequency output Note: The components XRdata[0] and XRdata[N/2] do not need to be modified to yield zeroth and (N/2)th frequency outputs. Pseudo-Code for DHT-to-DFT Conversion: k = N – 1; for ( j ⫽ 1; j < (N/2); j⫽j+1) { store ⫽ XRdata[k] + XRdata[j]; XRdata[k] ⫽ XRdata[k] – XRdata[j]; XRdata[j] ⫽ store; XRdata[j] ⫽ XRdata[j] / 2; XRdata[k] ⫽ XRdata[k] / 2; k ⫽ k – 1; }

Fig. 3.1 Pseudo-code for computing real-data DFT from DHT outputs

n o DHT .fxŒng/ D X.H/ Œk ;

(3.24)

so that fx[n]g is now the input data sequence to the DHT and fX.H/ Œkg the corresponding transform-space output, thus belonging to Hartley space. Then using the normalized definition of the DHT as given by Equation 3.1 – with a similar scaling strategy assumed for the definition of the DFT, as given by Equation 1.1, and its inverse – the following commonly encountered theorems may be derived, each one carrying over from one transform space to the other. Note that the data sequence is assumed, in each case, to be of length N.

3.5.1 Reversal Theorem The DFT-based relationship is given by o n DFT .fxŒN ng/ D X.F/ ŒN k ;

(3.25)

3.5 Fundamental Theorems for DFT and DHT

33

Description: The complex-data N-point DFT outputs are optimally stored with array "XRdata" holding the real component of both the input and output data, whilst the array "XIdata" holds the imaginary component of both the input and output data. Note: The components XRdata[0] and XRdata[N/2] do not need to be modified to yield zeroth and (N/2)th frequency outputs. Pseudo-Code for DHT-to-DFT Conversion: k ⫽ N – 1; for (j ⫽ 1; j < (N/2);j⫽j+1) { // Real Data Channel. store ⫽ XRdata[k] + XRdata[ j]; XRdata[k] ⫽ XRdata[k] – XRdata[ j]; XRdata[ j] ⫽ store; XRdata[ j] ⫽ XRdata[j] / 2; XRdata[k] ⫽ XRdata[k] / 2; // Imaginary Data Channel. store ⫽ XIdata[k] + XIdata[j]; XIdata[k] ⫽ XIdata[k] – XIdata[j]; XIdata[j] ⫽ store; XIdata[j] = XIdata[j] / 2; XIdata[k] ⫽ XIdata[k] / 2; // Combine Outputs for Real and Imaginary Data Channels. store1 ⫽ XRdata[j] + XIdata[k]; store2 ⫽ XRdata[j] – XIdata[k]; store3 ⫽ XIdata[j] + XRdata[k]; XIdata[k] ⫽ XIdata[j] – XRdata[k]; XRdata[ j] ⫽ store2; XRdata[k] ⫽ store1; XIdata[ j] ⫽ store3; k ⫽ k – 1; }

Fig. 3.2 Pseudo-code for computing complex-data DFT from DHT outputs

with the corresponding DHT-based relationship given by n o DHT .fxŒN ng/ D X.H/ ŒN k :

(3.26)

3.5.2 Addition Theorem The DFT-based relationship is given by DFT .fx1 Œn C x2 Œng/ D DFT .fx1 Œng/ C DFT .fx2 Œng/ o n o n .F/ Œk C X Œk ; D X.F/ 1 2

(3.27)

34

3 The Discrete Hartley Transform

with the corresponding DHT-based relationship given by DHT .fx1 Œn C x2 Œng/ D DHT .fx1 Œng/ C DHT .fx2 Œng/ o n o n .H/ .H/ D X1 Œk C X2 Œk :

(3.28)

3.5.3 Shift Theorem The DFT-based relationship is given by n o DFT .fxŒn n0 g/ D ei2 n0 k=N :X.F/ Œk ;

(3.29)

with the corresponding DHT-based relationship given by DHT .fxŒn n0 g/ o n o n D cos .2 n0 k=N/ :X.H/ Œk sin .2 n0 k=N/ :X.H/ ŒN k : (3.30)

3.5.4 Convolution Theorem Denoting the operation of circular or cyclic convolution by means of the symbol “”, the DFT-based relationship is given by n o .F/ .F/ DFT .fx1 Œng fx2 Œng/ D X1 Œk:X2 Œk ;

(3.31)

with the corresponding DHT-based relationship given by DHT .fx1 Œng fx2 Œng/ 1 .H/ .H/ .H/ .H/ X1 Œk:X2 Œk X1 ŒN k:X2 ŒN k D 2 .H/ .H/ .H/ .H/ CX1 Œk:X2 ŒN k C X1 ŒN k:X2 Œk :

(3.32)

3.5 Fundamental Theorems for DFT and DHT

35

3.5.5 Product Theorem The DFT-based relationship is given by n o n o .F/ DFT .fx1 Œn:x2 Œng/ D X.F/ 1 Œk X2 Œk ;

(3.33)

with the corresponding DHT-based relationship given by DHT .fx1 Œn:x2 Œng/ 1 .H/ .H/ .H/ .H/ X1 Œk X2 Œk X1 ŒN k X2 ŒN k D 2

.H/ .H/ .H/ .H/ C X1 Œk X2 ŒN k C X1 ŒN k X2 Œk :

(3.34)

3.5.6 Autocorrelation Theorem Denoting the operation of circular or cyclic correlation by means of the symbol “˝”, the DFT-based relationship is given by 1 DFT .fxŒng ˝ fxŒng/ D 2

ˇ ˇ ˇ .F/ ˇ2 ˇX Œkˇ ;

with the corresponding DHT-based relationship given by ˇ ˇ2 ˇ ˇ2 ˇ ˇ ˇ ˇ DHT .fxŒng ˝ fxŒng/ D ˇX.H/ Œkˇ C ˇX.H/ ŒN kˇ :

(3.35)

(3.36)

3.5.7 First Derivative Theorem The DFT-based relationship is given by DFT

o ˚ 0 n x Œn D i2 kX.F/ Œk ;

with the corresponding DHT-based relationship given by o

n ˚ DHT x0 Œn D 2 kX.H/ ŒN k :

(3.37)

(3.38)

36

3 The Discrete Hartley Transform

3.5.8 Second Derivative Theorem The DFT-based relationship is given by o ˚ 00 n x Œn D 4 2 k2 X.F/ Œk ;

(3.39)

with the corresponding DHT-based relationship given by o

n ˚ DHT x00 Œn D 4 2 k2 X.H/ Œk :

(3.40)

DFT

3.5.9 Summary of Theorems This section simply highlights the fact that for every fundamental theorem associated with the DFT, there is an analogous theorem for the DHT, which may be applied, in a straightforward fashion, so that the DHT may be used to address the same type of signal processing problems as the DFT, and vice versa. An important example is that of the digital filtering of an effectively infinite-length data sequence with a fixed-length FIR filter, more commonly referred to as continuous convolution, where the associated linear convolution is carried out via the piecewise application of the Circular Convolution Theorem using either the overlap-add or the overlap-save technique [3]. The role of the DHT, in this respect, is much like that of the number-theoretic transforms (NTTs) [7] – as typified by the Fermat number transform (FNT) and the Mersenne number transform (MNT) – which gained considerable popularity back in the 1970s amongst the academic community. These transforms, which are defined over finite or Galois fields [7] via the use of residue number arithmetic [7], exist purely for their ability to satisfy the Circular Convolution Theorem. An additional and important result, arising from the Product Theorem of Equations 3.33 and 3.34, is that when the real-valued data sequences fx1 Œng and fx2 Œng are identical, we obtain Parseval’s Theorem [3], as given by the equation N1 X nD0

jxŒnj2 D

N1 Xˇ

N1 ˇ ˇ ˇ ˇ .F/ ˇ2 X ˇ .H/ ˇ2 ˇX Œkˇ D ˇX Œkˇ ;

kD0

(3.41)

kD0

which simply states that the energy in the signal is preserved under both the DFT and the DHT (and, in fact, under any discrete unitary or orthogonal transformation), so that the energy measured in the data space is equal to that measured in the transform space. This theorem will be used later in Chapter 8, where it will be invoked to enable a fast radix-4 FHT algorithm to be applied to the fast computation of the real-data DFT whose transform length is a power of two, but not a power of four. Finally, note that whenever theorems involve dual Hartley-space terms in their expression – such as the terms X.H/ Œk and X.H/ [N–k], for example, in the

3.6 Fast Solutions to DHT

37

convolution and correlation theorems – that it is necessary that care be taken to treat the zero-address and Nyquist-address terms separately, as neither term possesses a dual.

3.6 Fast Solutions to DHT Knowledge that the DHT is in possession of many of the same properties as the DFT is all very well, but to be of practical significance, it is also necessary that the DHT, like the DFT, possesses fast algorithms for its efficient computation. The first widely published work in this field is thought to be that due to Ronald Bracewell [1, 2], who produced both radix-2 and radix-4 versions of the DIT fixed-radix FHT algorithm. His work in this field was summarized in a short monograph [2] which has formed the inspiration for the work discussed here. The solutions produced by Bracewell are attractive in that they achieve the desired performance metrics in terms of both arithmetic complexity and memory requirement – that is, compared to a conventional complex-data FFT, they require one half of the arithmetic operations and one half the memory requirement – but suffer from the fact that they need two sizes – and thus two separate designs – of butterfly for efficient fixed-radix formulations. For the radix-4 algorithm, for example, a single-sized butterfly produces four outputs from four inputs, as shown in Fig. 3.3, whilst a double-sized butterfly produces eight outputs from eight inputs, as shown in Fig. 3.4, both of which will be developed in some detail from first principles in the following chapter. This lack of regularity makes an in-place solution somewhat difficult to achieve, necessitating the use of additional memory between the temporal stages, as well as making an efficient mapping onto parallel computing equipment less than straightforward. Although other algorithmic variations for the efficient solution to the DHT have subsequently appeared [4, 5, 10], they all suffer, to varying extents, in terms of their lack of regularity, so that alternative solutions to the DHT are still sought that possess the regularity associated with the conventional complex-data fixed-radix FFT algorithms but without sacrificing the benefits of the existing FHT algorithms in terms of their reduced arithmetic complexity, reduced memory requirement and optimal latency. Various FHT designs could be studied, including versions of the popular radix-2 and split-radix [4] algorithms, but when transform lengths allow for comparison, the radix-4 FHT is more computationally efficient than the radix-2 FHT, its design more regular than that of the split-radix FHT, and it has the potential for an eightfold speed up with parallel computing equipment over that achievable via a purely sequential solution, making it a good candidate to pursue for potential hardware implementation. The radix-4 version of the FHT has therefore been selected as the algorithm of choice in this monograph.

38

3 The Discrete Hartley Transform X[0]

X[0]

X[1]

X[1]

-

X[2]

X[2]

-

X[3]

-

X[3]

-

Zero-address version of single-sized butterfly

X[0]

X[0] 2

X[1]

X[1]

-

X[2]

X[2]

2

X[3]

X[3]

Nyquist-address version of single-sized butterfly

Fig. 3.3 Signal flow graphs for single-sized butterfly for radix-4 FHT algorithm

trigonometric coefficients X[0] X[0]

X[1] -

-

-

X[1] X[2]

X[2]

X[3] -

-

-

X[3] X[4]

X[4]

X[5] -

-

-

X[5] X[6]

X[6]

X[7] -

-

-

Fig. 3.4 Signal flow graph for double-sized butterfly for radix-4 FHT algorithm

X[7]

3.8 Discussion

39

3.7 Accuracy Considerations When compared to a full-length FFT solution based upon one of the real-fromcomplex strategies, as discussed in Section 2.3 of the previous chapter, the FHT approach will involve approximately the same number of arithmetic operations (when the complex arithmetic operations of the FFT are reduced to equivalent real arithmetic operations) in order to obtain each real-data DFT output. The associated numerical errors may be due to both rounding, as introduced via the discarding of the lower order bits from the fixed-point multiplier outputs, and truncation, as introduced via the discarding of the least significant bit from the adder outputs after an overflow has occurred. The underlying characteristics of such errors for the two approaches will also be very similar, however, due to the similarity of their butterfly structures, so that when compared to FFT-based solutions possessing comparable arithmetic complexity the errors will inevitably be very similar [8, 11]. This feature of the FHT will be particularly relevant when dealing with a fixedpoint implementation, as is implied with any solution that is to be mapped onto an FPGA or ASIC device, where the combined effects of both truncation errors [9] and rounding errors [9] will need to be properly assessed and catered for through the optimum choice of word length and scaling strategy.

3.8 Discussion When the DHT is applied to the computation of the DFT, as discussed in Section 3.4, a conversion routine is required to map the DFT outputs from Fourier space to Hartley space. For the real-data case, as outlined in Fig. 3.1, the conversion process involves two real additions/subtractions for each DFT output together with two scaling operations, whilst for the complex-data case, as outlined in Fig. 3.2, this increases to four real additions/subtractions for each DFT output together with two scaling operations. All the scaling operations, however, are by a factor of 2 which in fixed-point arithmetic reduces to that of a simple right shift operation. Note that if the requirement is to use an FHT algorithm to compute the power spectral density (PSD) [3, 6], which is typically obtained from the squared magnitudes of the DFT outputs, then there is no need for the Hartley-space outputs to be first transformed to Fourier space, as the PSD may be computed directly from the Hartley-space outputs – an examination of Equations 3.12–3.14 should convince one of this. Also, it should be noted that with many of the specialized real-data FFT algorithms, apart from their lack of regularity, they also suffer from the fact that different algorithms are generally required for the fast computation of the DFT and its inverse. Clearly, in applications requiring transform-space processing followed by a return to the data space, as encountered for example with matched filtering, this could prove something of a disadvantage, particularly when compared to the adoption of a bilateral transform, such as the DHT, where the definitions of both the transform and its inverse, up to a scaling factor, are identical.

40

3 The Discrete Hartley Transform

References 1. R.N. Bracewell, The fast Hartley transform. Proc. IEEE 72(8) (1984) 2. R.N. Bracewell, The Hartley Transform (Oxford University Press, New York, 1986) 3. E.O. Brigham, The Fast Fourier Transform and Its Applications (Prentice Hall, Englewood Cliffs, NJ, 1988) 4. P. Duhamel, Implementations of split-radix FFT algorithms for complex, real and realsymmetric data. IEEE Trans. ASSP 34(2), 285–295 (1986) 5. P. Duhamel, M. Vetterli, Improved Fourier and Hartley transform algorithms: Application to cyclic convolution of real data. IEEE Trans. ASSP 35(6), 818–824 (1987) 6. D.F. Elliott, K. Ramamohan Rao, Fast Transforms: Algorithms, Analyses, Applications (Academic, New York, 1982) 7. J.H. McClellan, C.M. Rader, Number Theory in Digital Signal Processing (Prentice Hall, Englewood Cliffs, NJ, 1979) 8. J.B. Nitschke, G.A. Miller, Digital filtering in EEG/ERP analysis: Some technical and empirical comparisons. Behavior Res. Methods, Instrum. Comput. 30(1), 54–67 (1998) 9. L.R. Rabiner, B. Gold, Theory and Application of Digital Signal Processing (Prentice Hall, Englewood Cliffs, NJ, 1975) 10. H.V. Sorensen, D.L. Jones, C.S. Burrus, M.T. Heideman, On Computing the Discrete Hartley Transform. IEEE ASSP 33, 1231–1238 (1985) 11. A. Zakhor, A.V. Oppenheim, Quantization errors in the computation of the discrete Hartley transform. IEEE Trans. ASSP 35(11), 1592–1602 (1987)

Chapter 4

Derivation of the Regularized Fast Hartley Transform

Abstract This chapter discusses a new formulation of the FHT, referred to as the regularized FHT, which overcomes the limitations of existing FHT algorithms given the ultimate objective of mapping the solution onto silicon-based parallel computing equipment. A generic version of the double-sized butterfly, the GD-BFLY, is described which dispenses with the need for two sizes – and thus two separate designs – of butterfly as required via conventional fixed-radix formulations. Efficient schemes are also described for the storage, accession and generation of the trigonometric coefficients using suitably defined LUTs. A brief complexity analysis is then given in relation to existing FFT and FHT approaches to both the real-data and complex-data DFT problems. A discussion is finally provided relating to the results obtained in the chapter.

4.1 Introduction A drawback of conventional FHT algorithms, as highlighted in the previous chapter, lies in the need for two sizes – and thus two separate designs – of butterfly for efficient fixed-radix formulations. For the case of the radix-4 FHT to be discussed here, a single-sized butterfly, producing four outputs from four inputs, is required for both the zero-address and the Nyquist-address iterations of the relevant temporal stages, whilst a double-sized butterfly, producing eight outputs from eight inputs, is required for each of the remaining iterations. We look now at how this lack of regularity might be overcome, bearing in mind the desire, ultimately, to map the resulting algorithmic structure onto suitably defined parallel computing equipment: Statement of Performance Objective No 1: The aim is to produce a design for a generic double-sized butterfly for use by a radix-4 version of the FHT which lends itself naturally to parallelization and to mapping onto a regular computational structure for implementation with parallel computing technology.

Note that the attraction of the radix-4 solution, rather than that of the more familiar radix-2 case, is its greater computational efficiency – in terms of both reduced arithmetic complexity and reduced memory access – and the potential for exploiting K. Jones, The Regularized Fast Hartley Transform, Signals and Communications Technology, DOI 10.1007/978-90-481-3917-0 4, c Springer Science+Business Media B.V. 2010

41

42

4 Derivation of the Regularized Fast Hartley Transform

greater parallelism, at the arithmetic level, via the larger sized butterfly, thereby offering the possibility of achieving a higher computational density when implemented in silicon – to be discussed in Chapter 6.

4.2 Derivation of the Conventional Radix-4 Butterfly Equations The first step towards achieving this goal concerns the derivations of the two different sized butterflies – the single and the double – as required for efficient implementation of the radix-4 FHT. A DIT version is to be adopted given that the DIT algorithm is known to yield a slightly better signal-to-noise ratio (SNR) than the DIF algorithm when fixed-point processing is used [7, 10]. In fact, the noise variance of the DIF algorithm can be shown to be twice that of the DIT algorithm [10], so that the DIT algorithm offers the possibility of using shorter word lengths and ultimately less silicon for a given level of performance. The data re-ordering, in addition, is assumed to take place prior to the execution of the transform so that the data may be efficiently generated and stored in memory in the required di-bit reversed order directly from the output of the analog-to-digital conversion (ADC) unit at minimal expense. Let us first decompose the basic DHT expression as given by Equation 3.1 from the previous chapter – although in this instance without the scaling factor and with output vector X.H/ now replaced simply by X for ease of exposition – into four partial summations, such that XŒk D

N=41 X

xŒ4n:cas.2 .4n/k=N/

nD0

C

N=41 X

xŒ4n C 1:cas.2 .4n C 1/k=N/

nD0

C

N=41 X

xŒ4n C 2:cas.2 .4n C 2/k=N/

nD0

C

N=41 X

xŒ4n C 3:cas.2 .4n C 3/k=N/:

(4.1)

nD0

Suppose now that x1Œn D xŒ4n; x2Œn D xŒ4n C 1; x3Œn D xŒ4n C 2 & x4Œn D xŒ4n C 3 (4.2) and note from Equation 3.3 of the previous chapter that cas.2 .4n C r/k=N/ D cas.2 nk=.N=4/ C 2 rk=N/ D cos.2 rk=N/ :cas.2 nk=.N=4// C sin.2 rk=N/ :cas.2 nk=.N=4//

(4.3)

4.2 Derivation of the Conventional Radix-4 Butterfly Equations

43

and cas.2 nk=N/ D cas.2 n.N k/=N/:

(4.4)

Then if the partial summations of Equation 4.1 are written as X1Œk D

N=41 X

x1Œn:cas.2 nk=.N=4//

(4.5)

x2Œn:cas.2 nk=.N=4//

(4.6)

x3Œn:cas.2 nk=.N=4//

(4.7)

x4Œn:cas.2 nk=.N=4//;

(4.8)

nD0

X2Œk D

N=41 X nD0

X3Œk D

N=41 X nD0

X4Œk D

N=41 X nD0

it enables the equation to be re-written as XŒk D X1Œk C cos.2 k=N/:X2Œk C sin.2 k=N/:X2ŒN=4 k C cos.4 k=N/:X3Œk C sin.4 k=N/:X3ŒN=4 k C cos.6 k=N/:X4Œk C sin.6 k=N/:X4ŒN=4 k;

(4.9)

the first of the double-sized butterfly equations. Now, by exploiting the properties of Equations 4.3 and 4.4, the remaining doublesized butterfly equations may be written as XŒN=4 k D X1ŒN=4 k C sin.2 k=N/:X2ŒN=4 k C cos.2 k=N/:X2Œk cos.4 k=N/:X3ŒN=4 k C sin.4 k=N/:X3Œk sin.6 k=N/:X4ŒN=4 k cos.6 k=N/:X4Œk

(4.10)

XŒk C N=4 D X1Œk sin.2 k=N/:X2Œk C cos.2 k=N/:X2ŒN=4 k cos.4 k=N/:X3Œk sin.4 k=N/:X3ŒN=4 k C sin.6 k=N/:X4Œk cos.6 k=N/:X4ŒN=4 k

(4.11)

44

4 Derivation of the Regularized Fast Hartley Transform

XŒN=2 k D X1ŒN=4 k cos.2 k=N/:X2ŒN=4 k C sin.2 k=N/:X2Œk C cos.4 k=N/:X3ŒN=4 k sin.4 k=N/:X3Œk cos.6 k=N/:X4ŒN=4 k C sin.6 k=N/:X4Œk

(4.12)

XŒk C N = 2 D X1Œk cos.2 k=N/:X2Œk sin.2 k=N/:X2ŒN=4 k C cos.4 k=N/:X3Œk C sin.4 k=N/:X3ŒN=4 k cos.6 k=N/:X4Œk sin.6 k=N/:X4ŒN=4 k

(4.13)

XŒ3N=4 k D X1ŒN=4 k sin.2 k=N/:X2ŒN=4 k cos.2 k=N/:X2Œk cos.4 k=N/:X3ŒN=4 k C sin.4 k=N/:X3Œk C sin.6 k=N/:X4ŒN=4 k C cos.6 k=N/:X4Œk (4.14) XŒk C 3N=4 D X1Œk C sin.2 k=N/:X2Œk cos.2 k=N/:X2ŒN=4 k cos.4 k=N/:X3Œk sin.4 k=N/:X3ŒN=4 k sin.6 k=N/:X4Œk C cos.6 k=N/:X4ŒN=4 k

(4.15)

XŒN k D X1ŒN=4 k C cos.2 k=N/:X2ŒN=4 k sin.2 k=N/:X2Œk C cos.4 k=N/:X3ŒN=4 k sin.4 k=N/:X3Œk C cos.6 k=N/:X4ŒN=4 k sin.6 k=N/:X4Œk; (4.16) where N/4 is the length of the DHT output sub-sequences, fX1[k]g, fX2[k]g, fX3[k]g and fX4[k]g, and the parameter “k” varies from 1 up to N/8–1. When k D 0, which corresponds to the zero-address case, we obtain the singlesized butterfly equations XŒ0 D X1Œ0 C X2Œ0 C X3Œ0 C X4Œ0 XŒN=4 D X1Œ0 C X2Œ0 X3Œ0 X4Œ0

(4.17) (4.18)

XŒN=2 D X1Œ0 X2Œ0 C X3Œ0 X4Œ0 XŒ3N=4 D X1Œ0 X2Œ0 X3Œ0 C X4Œ0;

(4.19) (4.20)

and when k D N=8, which corresponds to the Nyquist-address case, we obtain the single-sized butterfly equations

4.3 Single-to-Double Conversion of the Radix-4 Butterfly Equations

p 2:X2ŒN=8 C X3ŒN=8 p XŒ3N=8 D X1ŒN=8 X3ŒN=8 C 2:X4ŒN=8 p XŒ5N=8 D X1ŒN=8 2:X2ŒN=8 C X3ŒN=8 p XŒ7N=8 D X1ŒN=8 X3ŒN=8 2:X4ŒN=8:

XŒN=8 D X1ŒN=8 C

45

(4.21) (4.22) (4.23) (4.24)

Thus, two different-sized butterflies are required for efficient computation of the DIT version of the radix-4 FHT, their SFGs being as given in Figs. 3.3 and 3.4 of the previous chapter. For the single-sized butterfly equations, the computation of each output involves the addition of at most four terms, whereas for the double-sized butterfly equations, the computation of each output involves the addition of seven terms. The resulting lack of regularity makes an attractive hardware implementation very difficult to achieve, therefore, without suitable reformulation of the associated equations.

4.3 Single-to-Double Conversion of the Radix-4 Butterfly Equations In order to derive a computationally-efficient single-design solution to the radix4 FHT, it is therefore necessary to regularize the algorithm structure by replacing the single and double sized butterflies with a generic version of the double-sized butterfly. Before this can be achieved, however, it is first necessary to show how the single-sized butterfly equations may be converted to the same form as that of the double-sized butterfly. When just the zero-address equations need to be carried out, it may be achieved via the interleaving of two sets, each of four equations, one set involving the consecutive samples fX1[0], X2[0], X3[0], X4[0]g, say, and the other set involving the consecutive samples fY1[0], Y2[0], Y3[0], Y4[0]g, say. This yields the modified butterfly equations XŒ0 D X1Œ0 C X2Œ0 C X3Œ0 C X4Œ0

(4.25)

YŒ0 D Y1Œ0 C Y2Œ0 C Y3Œ0 C Y4Œ0 XŒN=4 D X1Œ0 C X2Œ0 X3Œ0 X4Œ0

(4.26) (4.27)

YŒN=4 D Y1Œ0 C Y2Œ0 Y3Œ0 Y4Œ0 XŒN=2 D X1Œ0 X2Œ0 C X3Œ0 X4Œ0

(4.28) (4.29)

YŒN=2 D Y1Œ0 Y2Œ0 C Y3Œ0 Y4Œ0 XŒ3N=4 D X1Œ0 X2Œ0 X3Œ0 C X4Œ0

(4.30) (4.31)

YŒ3N=4 D Y1Œ0 Y2Œ0 Y3Œ0 C Y4Œ0;

(4.32)

with the associated double-sized butterfly being referred to as the “Type-I” butterfly.

46

4 Derivation of the Regularized Fast Hartley Transform

Similarly, when both the zero-address and Nyquist-address equations need to be carried out – which is always true when the Nyquist-address equations are required – they may be combined in the same fashion to yield the butterfly equations XŒ0 D X1Œ0 C X2Œ0 C X3Œ0 C X4Œ0 p XŒN=8 D X1ŒN=8 C 2:X2ŒN=8 C X3ŒN=8 XŒN=4 D X1Œ0 C X2Œ0 X3Œ0 X4Œ0 p XŒ3N=8 D X1ŒN=8 X3ŒN=8 C 2:X4ŒN=8 XŒN=2 D X1Œ0 X2Œ0 C X3Œ0 X4Œ0 p XŒ5N=8 D X1ŒN=8 2:X2ŒN=8 C X3ŒN=8 XŒ3N=4 D X1Œ0 X2Œ0 X3Œ0 C X4Œ0 p XŒ7N=8 D X1ŒN=8 X3ŒN=8 2:X4ŒN=8;

(4.33) (4.34) (4.35) (4.36) (4.37) (4.38) (4.39) (4.40)

with the associated double-sized butterfly being referred to as the “Type-II” butterfly. With the indexing assumed to start from zero, rather than one, the evenindexed equations thus correspond to the zero-address butterfly and the odd-indexed equations to the Nyquist-address butterfly. Thus, the sets of single-sized butterfly equations may be reformulated in such a way that the resulting composite butterflies now accept eight inputs and produce eight outputs, the same as the standard radix-4 double-sized butterfly, referred to as the “Type-III” butterfly. The result is that the radix-4 FHT, instead of requiring both single and double sized butterflies, may now be carried out instead with three simple variations of the double-sized butterfly.

4.4 Radix-4 Factorization of the FHT A radix-4 factorization of the FHT may be obtained in a straightforward fashion in terms of the double-sized butterfly equations through application of the familiar divide-and-conquer [6] principle, as used in the derivation of other fast discrete unitary and orthogonal transforms [4], such as the FFT. This factorization leads to the algorithm described by the pseudo-code of Fig. 4.1, where all instructions within the scope of the outermost “for” loop constitute a single iteration in the temporal domain and all instructions within the scope of the innermost “for” loop constitute a single iteration in the spatial domain. Thus, each iteration in the temporal domain, more commonly referred to as a “stage”, comprises N/8 iterations in the spatial domain, where each iteration corresponds to the execution of a single set of doublesized butterfly equations.

4.4 Radix-4 Factorization of the FHT Fig. 4.1 Pseudo-code for radix-4 factorization of FHT algorithm

47 // Set up transform length. N = 4α; // Di-bit reverse input data addresses. (in) X N =PΦ0 .xN; // Loop through log4 temporal stages. offset = 1; for (i = 0; i <α; i=i+1) { M = 8×offset; // Loop through N/8 spatial iterations. for ( j = 0; j < N; j=j+M) { for (k = 0; k < offset; k=k+1) // Carry out radix-4 double butterfly equations. { // Double Butterfly Routine: //

computes 8 outputs from 8 inputs (out)

XN

( (in)

)

M = f X N ,CM n,k,Sn,k n=0,1,2,3

} } offset = 2 (2i + ;1) }

The implication of the above definitions is that for the processing of a single data set a given stage may only be executed after its predecessor and before its successor, whereas every iteration of a given stage may in theory be executed simultaneously. Thus, each stage is time dependent and may only be executed sequentially, whereas if the data is available then the iterations within each stage may be executed in parallel. Note from the pseudo-code of Fig. 4.1 that “ˆ0 ” is the bijective mapping or permutation – with “Pˆ0 ” the associated permutation matrix – corresponding to the di-bit reversal mapping of the FHT input data addresses, whilst the double-sized butterfly section referred to in the pseudo-code makes use of cosinusoidal and sinusoidal terms, as given by CM n;k D cos.2 nk=M/ n D 0; 1; 2; 3

(4.41)

SM n;k D sin.2 nk=M/; n D 0; 1; 2; 3;

(4.42)

and respectively, the trigonometric coefficients defined in Chapter 1, which are each a function of the indices of the innermost and outermost loops. For the FHT factorization described here, the double-sized butterfly routine referred to in the pseudo-code implements either the Type-I butterfly of Equations 4.9 to 4.16, the Type-II butterfly of Equations 4.25–4.32, or the Type-III butterfly of Equations 4.33–4.40. As a result, the FHT appears to require a different SFG for each “Type” of double butterfly and so appears to lack at this stage the regularity

48

4 Derivation of the Regularized Fast Hartley Transform

necessary for an efficient mapping onto a single regular computational structure, as will be required for an efficient hardware implementation with parallel computing equipment.

4.5 Closed-Form Expression for Generic Radix-4 Double Butterfly The first step towards addressing this problem is to reformulate the double-sized butterfly equations so that they may be expressed in a recursive closed-form fashion, as once this is achieved it will then be a simple task to show how the same SFG can be used to describe the operation of each of the Type-I, Type-II and Type-III doublesized butterflies. This first step is achieved through the introduction of the address permutations “ˆ1 ”, “ˆ2 ”, “ˆ3 ” and “ˆ4 ”, as defined in Table 4.1, and through the introduction of arithmetic redundancy into the processing via the use of the M trigonometric coefficients “EM n;k ” (for even-valued index “k”) and “On;k ” (for oddvalued index “k”), as defined in Table 4.2, where the cosinusoidal and sinusoidal M terms referred to in the table, “CM n;k ” and “Sn;k ”, are as given by Equations 4.41 and 4.42, respectively. Through the use of such operators and terms, it can be shown how the same set of arithmetic operations may be carried out upon the input data set for every instance of the double-sized butterfly, despite the fact that for certain of the Type-I

Table 4.1 Address permutations for generic double butterfly Input address 0 1 2 3 ˆ1 : Type D I, II 0 1 2 6 0 1 2 3 ˆ1 : Type D III 0 4 3 2 ˆ2 : Type D I, II 0 4 2 6 ˆ2 : Type D III 0 4 1 5 ˆ3 : Type D I, II 0 4 1 3 ˆ3 : Type D III 0 4 1 5 ˆ4 : Type D I, II, III

4 4 4 1 1 2 2 6

5 5 5 5 5 6 6 2

Table 4.2 Trigonometric coefficients for generic double butterfly Index m 0 1 2 3 4 M Em;k : Type D I 1 0 1 0 1 1 0 1 0 1 EM m;k : Type D II

5 0 0

EM m;k : Type D III OM m;k : Type D I OM m;k : Type D II

1 0 0

0 1 1

CM 1;k 0 0

SM 1;k 1 1

CM 2;k 0 0

SM 2;k 1 1

OM m;k : Type D III

0

1

SM 1;k

CM 1;k

SM 2;k

CM 2;k

6 3 6 6 3 3 7 3

7 7 7 7 7 7 5 7

6 1

7 0

1 p

1 p

0

SM 3;k 1

1 p

1 p

2 CM 3;k

2 SM 3;k

2

2

CM 3;k

4.5 Closed-Form Expression for Generic Radix-4 Double Butterfly

49

and Type-II cases the values of the set of trigonometric coefficients suggest that the multiplications are trivial and thus avoidable – that is, that one or more of the trigonometric coefficients belong to the set f 1; 0; C1g. The even-valued and odd-valued indices for the addressing of the input data to the double-sized butterfly are both arithmetic sequences and consequently generated very simply via the pseudo-code of Fig. 4.2, with the associated double-sized

if (i == 0) { // Set up 1st even and odd data indices for Type-I double butterfly. twice_offset = offset & index_even[0] = j & index_odd[0] = j + 4; // Set up address permutations for Type-I double butterfly.

Φn = Φn

(I,II)

n = 1,2,3,4

} else { twice_offset = 2 × offset; if (k == 0) { // Set up 1st even and odd data indices for Type-II double butterfly. index_even[0] = j & index_odd[0] = j + offset; //

Set up address permutations for Type-II double butterfly.

Φn = Φn

(I,II)

n = 1,2,3,4

} else { // Set up 1st even and odd data indices for Type-III double butterfly. index_even[0] = j + k & index_odd[0] = j + twice_offset – k; //

Set up address permutations for Type-III double butterfly.

Φn = Φn

(III)

n = 1,2,3,4

} //

} Set up remaining even and odd data indices for double butterfly. for (n = 1; n < 4; n=n+1) { index_even[n] = index_even[n-1] + twice_offset; index_odd[n] = index_odd[n-1] + twice_offset; }

Fig. 4.2 Pseudo-code for generation of data indices and address permutations

50

4 Derivation of the Regularized Fast Hartley Transform

butterfly – referred to hereafter as the generic double butterfly or “GD-BFLY” – being expressed via the pseudo-code of Fig. 4.3. The address permutations are dependent only upon the “Type” of GD-BFLY being executed, with just two slightly different versions being required for each of the first three permutations, and only one for the last permutation. The two versions of ˆ1 differ in just two (of the eight possible) exchanges whilst the two versions of ˆ2 and ˆ3 each differ in just three // Set up input data vector. for (n = 0; n < 4; n=n+1) { X[2n] = X(in)[index_even[n]] & X[2n+1] = X(in)[index_odd[n]]; } // Apply 1st address permutation. T

Y = PΦ1 .X //

st

Apply trigonometric coefficients and 1 set of additions/subtractions. for (n = 1; n < 4; n=n+1) { M

store = E2n,k × Y[2n] + E2n+1,k × Y[2n+1]; M M × Y[2n+1]; Y[2n+1] = O × Y[2n] _ O M

2n,k

2n+1,k

Y[2n] = store; } // Apply 2nd address permutation. T X = PΦ 2 .Y

// Apply 2nd set of additions/subtractions. for (n = 0; n < 4; n=n+1) {

//

store = X[2n]+X[2n+1] & X[2n+1]=X[2n] –X[2n+1] & X[2n] = store; } Apply 3rd address permutation. T

Y = PΦ 3 .X // Apply 3rd set of additions/subtractions. for (n = 0; n < 4; n=n+1) { store = Y[2n]+Y[2n+1] & Y[2n+1]=Y[2n] –Y[2n+1] & Y[2n] = store; } // Apply 4th address permutation. T

X = PΦ 4 .Y // Set up output data vector. for (n = 0; n < 4; n=n+1) { X(out)[index_even[n]]=X[2n] & X(out)[index_odd[n]]=X[2n+1]; }

Fig. 4.3 Pseudo-code for carrying out generic double butterfly

4.5 Closed-Form Expression for Generic Radix-4 Double Butterfly

51

(of the eight possible) exchanges, as evidenced from the contents of Table 4.1. The trigonometric coefficients, which as stated above include the trivial constants belonging to the set f 1; 0; C1g, are dependent also upon the value of the parameter “k” corresponding to the innermost loop of the pseudo-code of Fig. 4.1. An elegant and informative way of representing the four permutation mappings may be achieved by noting from the group-theoretic properties of the symmetric group [1] – which for order N is the set of all permutations of N objects – that any permutation can be expressed as a product of cyclic permutations [1] and that each such permutation can also be simply expressed as a product of transpositions [1]. As shorthand for describing a permutation, a cyclic notation is first introduced in order to describe how the factorization of a given permutation is achieved. With this notation, each element within parentheses is replaced by the element to its right with the last element being replaced by the first element in the set. Any element that replaces itself is omitted. Thus, the two versions of ˆ1 may be expressed as ˆ1 D .3; 6/

(4.43)

ˆ1 D .:/ ;

(4.44)

and the second version being the length eight identity mapping, the two versions of ˆ2 as ˆ2 D .1; 4/ .2; 3/ D .1; 4/ .3; 2/ (4.45) and ˆ2 D .1; 4/ .3; 6/ ;

(4.46)

the two versions of ˆ3 as ˆ3 D .1; 4; 2/ .3; 5; 6/ D .2; 1; 4/ .5; 6; 3/ D .2; 1/ .2; 4/ .5; 6/ .5; 3/

(4.47)

and ˆ3 D .1; 4; 2/ .5; 6; 7/ D .2; 1; 4/ .5; 6; 7/ D .2; 1/ .2; 4/ .5; 6/ .5; 7/ ;

(4.48)

and finally the single version of ˆ4 as ˆ4 D .1; 4; 6; 3; 5; 2/ D .3; 5; 2; 1; 4; 6/ D .3; 5/ .3; 2/ .3; 1/ .3; 4/ .3; 6/ :

(4.49)

From these compact representations – which are equivalent to those given in tabular form in Table 4.1 – both the commonalities and the differences between the two versions of each permutation are straightforwardly visualized, with each pair being distinguished by means of a single transposition whilst the common

52

4 Derivation of the Regularized Fast Hartley Transform

component (whether in terms of cyclic permutations or transpositions) is fixed and thus amenable to hard-wiring. The ordering of the transpositions has been adjusted in the above expressions so as to minimize the associated communication lengths involved in the exchanges. For ˆ1 the first version involves the application of a single transposition, involving addresses “3” and “6”, whilst for ˆ2 the two versions differ only in the final transposition involving the exchange of address “3” with either address “2” or address “6”, and for ˆ3 they differ only in terms of the final transposition involving the exchange of new address “5” (original address “6”) with either address “3” or address “7”. Notice that as the combined effect of the first four trigonometric coefficients – corresponding to indices m D 0 and m D 1 in Table 4.2 – for every instance of the GD-BFLY, is simply for the first two inputs to the GD-BFLY to pass directly through to the second permutation, then the first four multiplications and the associated pair of additions may be simply removed from the SFG of Fig. 3.4 shown in the previous chapter, to yield the SFG shown below in Fig. 4.4, this being obtained at the cost of slightly reduced regularity, at the arithmetic level, within the GD-BFLY. This results in the need for just 12 real multiplications for the GD-BFLY, rather than 16, whose trigonometric coefficient multiplicands may be obtained, through symmetry relations, from just six stored trigonometric coefficients: two each – both cosinusoidal and sinusoidal – for single-angle, double-angle and triple-angle cases. Also, the number of additions required prior to the second permutation reduces from eight to just six. Thus, the three “Types” of GD-BFLY each map efficiently onto the same regular computational structure, this structure being represented by a SFG consisting

trigonometric coefficients

_

_

_

_

_

_

_

_

Fig. 4.4 Signal flow graph for twelve-multiplier version of generic double butterfly

output data vector

_

Φ4

Address Permutation

_

Φ3

Address Permutation

Address Permutation

input data vector

_

Address Permutation

Φ2

Φ1

4.5 Closed-Form Expression for Generic Radix-4 Double Butterfly

53

of three stages of additive recursion, the first being preceded by a point-wise multiplication stage involving the trigonometric coefficients. Denoting the input and output data vectors by X.in/ and X.out/ , respectively, the operation of the GD-BFLY may thus be represented in a closed-form fashion by means of a multi-stage recursion, as given by the expression X.out/ D PTˆ4 : A3 : PTˆ3 : A2 : PTˆ2 : A1 : M1 : PTˆ1 :X.in/ (4.50) where “Pˆ1 ”, “Pˆ2 ”, “Pˆ3 ” and “Pˆ4 ” are the butterfly-dependent permutation matrices [1] associated with the address permutations “ˆ1 ”, “ˆ2 ”, “ˆ3 ” and “ˆ4 ”, respectively. Being orthogonal [1], whereby Pˆ :PTˆ D I8 – the matrix version of the length eight identity mapping – they may each be applied either side of an equation, such that Y D PTˆ :X Pˆ :Y D X; (4.51) where the superscript “T” denotes the transpose operator. The composite matrix “A1 :M1 ” is a butterfly-dependent 2 2 block diagonal matrix [1] containing the trigonometric coefficients (as defined from the contents of Table 4.2 with the first two terms fixed and equal to one), whilst “A2 ” and “A3 ” are fixed addition blocks, also expressed as 2 2 block diagonal matrices, such that 2

C1 0 6 0 C1 6 6 CE2 CE3 6 6 CO2 O3 6 A1 :M1 D 6 6 CE4 CE5 6 6 CO4 O5 6 4 CE6 CE7 CO6 O7 and

3 7 7 7 7 7 7 7; 7 7 7 7 5

(4.52)

3 C1 C1 7 6C1 1 7 6 7 6 C1 C1 7 6 7 6 C1 1 7 6 A2 D A3 D 6 7: 7 6 C1 C1 7 6 7 6 C1 1 7 6 4 C1 C15 C1 1

(4.53)

2

Note that as long as each data set for the GD-BFLY is accompanied by an appropriately set “Type” flag – indicating whether the current instance of the GDBFLY is of Type-I, Type-II or Type-III – then the correct versions of the first three permutators may be appropriately applied for any given instance of the GDBFLY. The reformulated equations, which were obtained through the introduction

54

4 Derivation of the Regularized Fast Hartley Transform

of arithmetic redundancy into the processing, thus correspond to a double butterfly which overcomes, in an elegant fashion, the loss of regularity associated with more conventional fixed-radix formulations of the FHT. The resulting radix-4 algorithm is henceforth referred to as the regularized FHT or “R24 FHT” [5], where the “R24 ” part of the expression is short for “Regularized Radix-4”.

4.5.1 Twelve-Multiplier Version of Generic Double Butterfly As evidenced from the SFG of Fig. 4.4, the GD-BFLY described above requires a total of 12 real multiplications and 22 real additions, whilst the effect of the permutators for a parallel solution is to reduce the communication topology to that of nearest neighbour for input to both the adders and the multipliers, with the data entering/leaving the arithmetic components in consecutive pairs. The only change to the operation of the GD-BFLY, from one instance to another, is in terms of the definitions of the first three address permutations, with one of two slightly different versions being appropriately selected for each such permutation according to the particular “Type” of the GD-BFLY being executed – see the permutation definitions of Table 4.1. As a consequence, each instance of the twelve-multiplier version of the GDBFLY may be carried out using precisely the same components and represented by means of precisely the same SFG.

4.5.2 Nine-Multiplier Version of Generic Double Butterfly A lower-complexity version of the above GD-BFLY may be achieved by noting that each block of four multipliers and its associated two adders corresponds to the solution of a pair of bilinear forms [9], which can be optimally solved, in terms of multiplications, with just three multipliers – see the corresponding section of the SFG for the standard Type-III GD-BFLY in Fig. 4.5. This complexity reduction is achieved at the expense of three extra adders for the GD-BFLY and six extra adders for the generation of the trigonometric coefficients. The complete SFG for the resulting reduced-complexity solution is as shown in Fig. 4.6, from which it can be seen that the GD-BFLY now requires a total of nine real multiplications and 25 real additions. As with the twelve-multiplier version, there are minor changes to the operation of the GD-BFLY, from one instance to another, in terms of the definitions of the first three address permutations, with one of two slightly different versions being appropriately selected for each such permutation according to the particular “Type” of the GD-BFLY being executed – see the permutation definitions of Table 4.1. Additional but minor changes are also required, however, to the operation of the stage of adders directly following the multipliers and to the ordering of the outputs from the resulting operations.

4.5 Closed-Form Expression for Generic Radix-4 Double Butterfly

55

Multiplication-addition block of standard Type-III double butterfly ~ a cos θ sin θ a ~ = b sin θ − cosθ b Multiplicative constants: c1 = cos θ+sin θ c2 = cos θ c3 = cos θ−sin θ c1 ~ a

a +

c2

– +

c3

–

~ b

b

Fig. 4.5 Reduced-complexity arithmetic block for set of bilinear forms Φ2

Φ3

± ± ±

_

_

_ Address Permutation

±

Address Permutation

Address Permutation

input data vector

_

Φ4

_

_

output data vector

trigonometric coefficients

Address Permutation

Φ1

± ±

_

_

Fig. 4.6 Signal flow graph for nine-multiplier version of generic double butterfly

For the first of the three sets of three multipliers, if the GD-BFLY is of TypeI or Type-II then each of the two adders performs addition on its two inputs and the ordering of the two outputs is the same as that of the two inputs, whilst if the GD-BFLY is of Type-III then each of the two adders performs subtraction on its two inputs and the ordering of the two outputs is the reverse of that of the two inputs.

56

4 Derivation of the Regularized Fast Hartley Transform

Similarly, for the second of the three sets of three multipliers, if the GD-BFLY is of Type-I or Type-II then each of the two adders performs addition on its two inputs and the ordering of the two outputs is the same as that of the two inputs, whilst if the GD-BFLY is of Type-III then each of the two adders performs subtraction on its two inputs and the ordering of the two outputs is the reverse of that of the two inputs. Finally, for the last of the three sets of three multipliers, if the GD-BFLY is of TypeI then each of the two adders performs addition on its two inputs and the ordering of the two outputs is the same as that of the two inputs, whilst if the GD-BFLY is of Type-II or Type-III then each of the two adders performs subtraction on its two inputs and the ordering of the two outputs is the reverse of that of the two inputs. Note that the reversal of each pair of outputs is straightforwardly achieved, as shown in Fig. 4.6, by means of a simple switch. As a consequence, each instance of the nine-multiplier version of the GD-BFLY may be carried out using precisely the same components and represented by means of precisely the same SFG.

4.6 Trigonometric Coefficient Storage, Accession and Generation An efficient implementation of the R24 FHT invariably requires an efficient mechanism for the storage and accession of the trigonometric coefficients required for feeding into each instance of the GD-BFLY. The requirement, more exactly, is that six non-trivial coefficients be either accessed from the CM or suitably generated on-the-fly in order to be able to carry out the necessary processing for any given data set. Referring to the definitions for the non-trivial cosinusoidal and sinusoidal terms, as given by Equations 4.41 and 4.42, respectively, if we put “ D N=M, where the parameters “M” and “N” are as defined in the pseudo-code of Fig. 4.1, then N CM n;k D cos.2 nk“=N/ D Cn;k“

for n D 1; 2; 3

(4.54)

N SM n;k D sin.2 nk“=N/ D Sn;k“ ;

for n D 1; 2; 3;

(4.55)

and enabling the terms to be straightforwardly addressed from suitably constructed LUTs via the parameters “n”, “k” and ““”. The total size requirement of the LUT can be minimized by exploiting the relationship between the cosinusoidal and sinusoidal functions, as given by the expression 1 cos.x/ D sin x C ; (4.56) 2 as well as the periodic nature of each, as given by the expressions sin.x C 2 / D sin.x/

(4.57)

4.6 Trigonometric Coefficient Storage, Accession and Generation

57

and sin.x C / D sin.x/ :

(4.58)

Two schemes are now outlined which enable a simple trade-off to be made between memory size and addressing complexity – as measured in terms of the number of arithmetic/logic operations required for computing the necessary addresses. These particular schemes will be later exploited, in Chapter 6, by the conflict-free and (for the data) in-place parallel memory addressing schemes developed for the efficient parallel computation of the R24 FHT.

4.6.1 Minimum-Arithmetic Addressing Scheme As already stated, the trigonometric coefficient set comprises both cosinusoidal and sinusoidal terms for single-angle, double-angle and triple-angle cases. To minimize the arithmetic/logic requirement for the generation of the addresses, the LUT may be sized according to a single-quadrant addressing scheme, whereby the trigonometric coefficients are read from a sampled version of the sinusoidal function with argument defined from 0 up to =2 radians. Thus, for the case of an N-point R24 FHT, it is required that the LUT be of size N/4 words yielding a total CM requirement, Aopt denoted CMEM , of 1 Aopt (4.59) CMEM D N 4 words. This scheme would seem to offer, therefore, a reasonable compromise between the CM requirement and the addressing complexity, using more than the theoretical minimum amount of memory required for the storage of the trigonometric coefficients so as to keep the arithmetic/logic requirement of the addressing as simple as possible.

4.6.2 Minimum-Memory Addressing Scheme Another approach to this problem is to adopt a two-level LUT, this comprising one coarse-resolution region of N/4L words for the sinusoidal function, covering 0 up to =2 radians, and one fine-resolution region of L words for each of the cosinusoidal and sinusoidal functions, covering 0 up to =2L radians. The required trigonometric coefficients may then be obtained from the contents of the two-level LUT through the application of one or other of the standard trigonometric identities cos.™ C ¥/ D cos.™/ : cos.¥/ sin.™/ : sin.¥/

(4.60)

sin.™ C ¥/ D sin.™/ : cos.¥/ C cos.™/ : sin.¥/ ;

(4.61)

and

58

4 Derivation of the Regularized Fast Hartley Transform

where “™” corresponds to the angle defined over the coarse-resolution region and “¥” to the angle defined over the fine-resolution region. By expressing the combined size of the two-level LUT for the sinusoidal function as ı f.L/ D N 4L C L (4.62) words, it can be seen that the optimum LUT size is obtained when ı df D 1 N 4L2 dL is set to zero, giving L D Mopt CMEM , of

(4.63)

p N=2 and resulting in a total CM requirement, denoted Mopt

CMEM D

3p N 2

(4.64)

p p words – N=2 for the coarse-resolution region and N=2 for each of the two fineresolution regions. This scheme therefore yields the theoretical minimum memory requirement for the storage of the trigonometric coefficients at the expense of an increased arithmetic/logic requirement for the associated addressing. The two-level LUT will actually be regardedphereafter as consisting of three separate complementary-angle LUTs, each of size N=2 words, rather than as a single LUT, as all three may need to be accessed simultaneously if an efficient parallel solution to the R24 FHT is to be achieved.

4.6.3 Trigonometric Coefficient Generation via Trigonometric Identities With both of the storage schemes discussed above, after deriving the single-angle trigonometric coefficients from the respective LUT(s), the double-angle and tripleangle trigonometric coefficients may then be straightforwardly obtained from the single-angle trigonometric coefficients through the application of the standard trigonometric identities cos.2™/ D 2: cos2 .™/ 1 sin.2™/ D 2: sin.™/ : cos.™/

(4.65) (4.66)

cos.3™/ D .2: cos.2™/ 1/ : cos.™/ sin.3™/ D .2: cos.2™/ C 1/ : sin.™/ ;

(4.67) (4.68)

and

4.7 Comparative Complexity Analysis with Existing FFT Designs

59

respectively, or alternatively, through the replication of the respective LUT(s) for each of the double-angle and triple-angle cases. This question will be discussed further in Chapter 6 in relation to the conflict-free and (for the data) in-place parallel memory addressing schemes.

4.7 Comparative Complexity Analysis with Existing FFT Designs This chapter has concerned itself with the detailed derivation of a regularized version of the DIT radix-4 FHT, referred to as the R24 FHT, the intention being to use the resulting algorithm for the efficient parallel computation of the real-data DFT. For most applications, the real-data DFT is still generally solved with a realfrom-complex strategy, as discussed in some detail in Chapter 2, whereby an N-point complex-data FFT simultaneously computes the outputs of two N-point real-data DFTs, or where the output of an N-point real-data DFT is obtained from the output of one N/2-point complex-data FFT. Such approaches, however, are adopted at the possible expense of increased memory, increased processing delay to allow for the acquisition/processing of pairs of data sets, and additional packing/unpacking complexity. The class of specialized real-data FFTs discussed in Chapter 2 is also commonly used and although these algorithms compare favourably, in terms of operation counts and memory requirement, with those of the FHT, they suffer in terms of a loss of regularity and reduced flexibility in that different algorithms are required for the computation of the DFT and its inverse. The performance of the R24 FHT is therefore compared very briefly with those of the complex-data and real-data FFTs, as described in Chapter 2, together with the conventional non-regularized FHT [2,3]. Performance is evaluated for the computation of both real-data and complex-data DFTs, where the application of the FHT to complex-valued data is achieved very simply by processing separately the real and imaginary components of the data and additively combining the outputs to yield the complex-data DFT output – this was discussed in some detail in Section 3.4 of the previous chapter. The results are summarized in Table 4.3, where a single-PE architecture is assumed for each solution such that the PE is able to produce all the outputs for a single instance of the respective butterfly (there are two types for the standard non-regularized FHT) simultaneously via the exploitation of fine-grained parallelism at the arithmetic level – such architectural considerations are to be discussed in some depth in future chapters of the monograph. Such a performance may prove difficult (if not impossible) to attain for some approaches, however, as the third row of the table suggests that neither the N-point real-data FFT nor the standard non-regularized FHT lend themselves particularly well to parallelization. However, as can be seen from the table, the regularity/simplicity of the design and the bilateral nature of the algorithm make the R24 FHT an attractive solution compared to the class of real-data FFTs, whilst the reduced processing delay (for the real-data case) and reduced data memory/pin count requirement (for both the

Table 4.3 Algorithmic comparison for real-data and complex-data FFT designs Complex-data Algorithm N-point FFT Design regularity High No of butterfly designs 1 Parallelization High Arithmetic domain Complex field Arithmetic complexity O.N log4 N/ Time complexity O.N log4 N/ Data memory for N-point real-data DFT 2N Data memory for N-point complex-data DFT 2N Pin count for N-point real-data DFT 22N Pin count for N-point complex-data DFT 22N Processing delay for N-point real-data DFT 2D Applicable to forward & inverse DFTs Yes Additive complexity for unpacking of N-point real-data DFT N Additive complexity for unpacking of N-point complex-data DFT – Real-data N-point FFT Low 1 Low Complex field O.N log4 N/ O.N log4 N/ N – 2N – D No – –

Standard N-point FHT Low 2 Low Real field O.N log4 N/ O.N log4 N/ N N 2N 2N D Yes N 4N

Regularized N-point FHT High 1 High Real field O.N log4 N/ O.N log4 N/ N N 2N 2N D Yes N 4N

60 4 Derivation of the Regularized Fast Hartley Transform

4.8 Scaling Considerations for Fixed-Point Implementation

61

real-data and complex-data cases) offer additional advantages over the conventional complex-data FFT approach. The low memory requirement of the R24 FHT approach is particularly relevant for applications involving large transform lengths, as is the case with many wide bandwidth channelization problems, for example. Summarizing the results, the regularity of the design, combined with the ease of parallelization, nearest-neighbour communication topology at the arithmetic component level (as effected by the permutators) for a parallel solution, simplicity of the arithmetic components, optimum processing delay, low pin count and memory requirements, make the R24 FHT an extremely attractive candidate to pursue for possible realization in hardware with parallel computing equipment. The time and arithmetic complexities are shown to be of the same order for each solution considered, with the arithmetic complexity of the GD-BFLY being actually equivalent to that achievable for the butterfly of an optimally designed complex-data radix4 FFT algorithm [8], widely considered the most computationally attractive of all fixed-radix butterflies.

4.8 Scaling Considerations for Fixed-Point Implementation For a fixed-point implementation of the R24 FHT, as is the case of interest in this monograph, the registers available for holding the trigonometric coefficients and the data are of fixed length, whilst the register used for holding the outputs from the arithmetic operations (namely the accumulator), although of fixed length, is generally longer than those used for holding the trigonometric coefficients and the data. This additional length for the accumulator is to prevent the unnecessary loss of accuracy from rounding of the results following the arithmetic operations, as the multiplication of a K-bit word and a L-bit word yields a .K C L/-bit result, whilst the addition of two L-bit words yields a .L C 1/-bit result. When the trigonometric coefficients are each less than or equal to one, however, as they are for the R24 FHT, each multiplication will introduce no word growth, whereas the addition of any two terms following the multiplication stage may produce word growth of one bit. The maximum growth in magnitude through the GD-BFLY occurs when all the input samples possess equal magnitude and the rotation angle associated with the trigonometric coefficients is =4, the magnitude then growing by a factor of up to p 1 C 3 2 5:242. If the data register is fully occupied this will result in three bits of overflow. To prevent this, an unconditional scaling strategy could be applied whereby the data are right shifted by three bits prior to each stage of GD-BFLYs. However, apart from reducing the dynamic range of the data, such scaling introduces truncation error if the discarded bits are non-zero. The possibility of overflow would therefore be eliminated at the cost of unnecessary shifting of the data and a potentially large loss of accuracy. A more accurate approach would be to adopt a conditional scaling strategy, namely the block floating-point technique [8], whereby the data are shifted only when overflow occurs. The block floating-point mechanism comprises two parts.

62

4 Derivation of the Regularized Fast Hartley Transform

The output part calculates the maximum magnitude of the output data for the current stage of GD-BFLYs, from which a scaling factor is derived as a reference value for the input scaling of the next stage of GD-BFLYs. The input part receives the scaling factor generated by the previous stage of GD-BFLYs, so that the number of bits to be right shifted for the current input data set will be based on the scaling factor provided. Therefore, the data overflow and the precision of the integer operations are controlled automatically by the block floating-point mechanism, which provides information not only for the word growth of the current stage of GD-BFLYs but also for the word growth of all the previous stages. Such scaling, however, is far more complex to implement than that of unconditional scaling. An alternative to the above two approaches is to allow the data registers to possess a limited number of guard bits to cater for some or all of the word growth, such that the scaling strategy need only cater for limited word growth, rather than for the worst case. The performance of such a scheme, however, as with that of unconditional scaling, will always yield a sub-optimal performance – in terms of accuracy and dynamic range – when compared to that achievable by the conditional block floating-point scheme.

4.9 Discussion To summarize the situation so far, a new formulation of the radix-4 FHT has been derived, referred to as the regularized FHT or R24 FHT, whereby the major limitation of existing fixed-radix FHT designs, namely the lack of regularity arising from the need for two sizes – and thus two separate designs – of butterfly, has been overcome. It remains now to see how easily the resulting structure lends itself to mapping onto parallel computing equipment, bearing in mind that the ultimate requirement is to derive an area-efficient solution for power-constrained applications, such as mobile communications, where parallelism will need to be fully and efficiently exploited in order that the required throughput rates are attained. There is reason to be optimistic in the endeavor in that the large size of the GD-BFLY, which results in it being able to produce eight outputs from eight inputs, offers the promise of an eightfold speed up with parallel computing equipment over that achievable via a purely sequential solution, whilst the arithmetic requirements of the GD-BFLY as indicated from its SFG suggest that it could well lend itself to internal pipelining, with each CS of the pipeline being made up from various combinations of the arithmetic components (adders and multipliers) and permutators of which the GD-BFLY is composed. Note that the radix-4 butterfly used for the standard formulation of the radix-4 FFT is sometimes referred to in the technical literature as a dragonfly, rather than a butterfly, due to its resemblance to the said insect – a radix-8 butterfly may also be referred to as a spider for the same reason. Finally, it should be noted that the property of symmetry has been exploited not only to minimize the number of arithmetic operations required by both FFT and FHT algorithms, through the regular nature of the respective decompositions, but

References

63

also to minimize the memory requirement through the nature of the fundamental function from which the associated transform kernels are derived, namely the sinusoidal function. The basic properties of this function – together with that of its complementary function the cosinusoid – are as described by Equations 4.56–4.58 given earlier in the chapter, with the sinusoid being an even-symmetric function relative to any odd-integer multiple of the argument =2 and an odd-symmetric function relative to any even-integer multiple of =2, whilst the cosinusoid is an even-symmetric function relative to any even-integer multiple of the argument =2 and an odd-symmetric function relative to any odd-integer multiple of =2. That is, they are each either even-symmetric or odd-symmetric according to whether the axis of symmetry is an appropriately chosen multiple of =2.

References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10.

G. Birkhoff, S. MacLane, A Survey of Modern Algebra (Macmillan, New York, 1977) R.N. Bracewell, The fast Hartley. Transform. Proc. IEEE 72(8) (1984) R.N. Bracewell, The Hartley Transform (Oxford University Press, New York, 1986) D.F. Elliott, K. Ramamohan Rao, Fast Transforms: Algorithms, Analyses, Applications (Academic, New York, 1982) K.J. Jones, Design and parallel computation of regularised fast Hartley transform. IEE Proc. Vision, Image Signal Process. 153(1), 70–78 (2006) L. Kronsjo, Computational Complexity of Sequential and Parallel Algorithms (Wiley, New York, 1985) Y. Li, Z. Wang, J. Ruan, K. Dai, A low-power globally synchronous locally asynchronous FFT processor. HPCC 2007, LNCS 4782, 168–179 (2007) L.R. Rabiner, B. Gold, Theory and Application of Digital Signal Processing (Prentice Hall, Englewood Cliffs, NJ, 1975) S. Winograd, Arithmetic Complexity of Computations (SIAM, Philadelphia, PA, 1980) A. Zakhor, A.V. Oppenheim, Quantization errors in the computation of the discrete Hartley transform. IEEE Trans. ASSP 35(11), 1592–1602 (1987)

Chapter 5

Algorithm Design for Hardware-Based Computing Technologies

Abstract This chapter first provides a brief discussion of the fundamental properties of both FPGA and ASIC devices, together with their relative merits, before analyzing the various design techniques and parameters – namely those relating to the clock frequency, silicon area and switching frequency – and the constraints and trade-offs that need to be made between them when trying to design a low-power solution to the regularized FHT for implementation with such technologies. The benefits of incorporating scalability, partitioned-memory processing and flexibility into the design of the proposed solution are considered as well as the design options available for silicon-based implementation when constrained by the limited availability of embedded resources. A discussion is finally provided relating to the results obtained in the chapter.

5.1 Introduction The type of high-performance parallel computing equipment typified by the increasingly powerful silicon-based FPGA and ASIC technologies [2] now gives design engineers far greater flexibility and control over the type of algorithm that may be used in the building of high-performance DSP systems, so that more appropriate hardware solutions to the problem of solving the real-data DFT may be actively sought and exploited to some advantage with these silicon-based technologies. With such technologies, however, it is no longer adequate to view the FFT complexity purely in terms of arithmetic operation counts, as has conventionally been done, as there is now the facility to use both multiple arithmetic units – adders and fast multipliers – and multiple banks of fast random access memory (RAM) in order to enhance the FFT performance via its parallel computation. As a result, a whole new set of constraints has arisen relating to the design of efficient FFT algorithms. With the recent and explosive growth of wireless technology, and in particular that of mobile communications, where a small battery may be the only source of power supply for long periods of time, algorithms are now being designed subject to new and often conflicting performance criteria where the ideal is either to maximize the throughput (that is, to minimize the update time) or satisfy K. Jones, The Regularized Fast Hartley Transform, Signals and Communications Technology, DOI 10.1007/978-90-481-3917-0 5, c Springer Science+Business Media B.V. 2010

65

66

5 Algorithm Design for Hardware-Based Computing Technologies

some constraint on the latency, whilst at the same time minimizing the required silicon resources (and thereby minimizing the cost of implementation) as well as keeping the power consumption to within the available budget. Note, however, that the throughput is also constrained by the I/O speed, as the algorithm cannot process the data faster than it can access it. To be able to produce such a solution, however, it is first necessary to identify the relevant parameters [2] involved in the design process and then to outline the constraints and trade-offs that need to be made between them. This chapter first looks therefore into those design techniques and parameters – namely those that relate to the clock frequency, silicon area and switching frequency – that need to be considered for the design of a low-power solution to the R24 FHT. The aim, bearing in mind the target application area of mobile communications, is to obtain a solution that optimizes the use of the available silicon resources on the target device whilst keeping the associated power consumption to within the available budget and, in so doing, to maximize the achievable computational density – as defined in Chapter 1. The particular benefits of incorporating scalability, partitioned-memory processing and flexibility into the design of the solution are then discussed as well as the design options available for silicon-based implementation when constrained by the limited availability of embedded resources.

5.2 The Fundamental Properties of FPGA and ASIC Devices An FPGA device [2] is an integrated circuit that contains configurable or programmable blocks of logic along with configurable interconnections between the blocks. DSP design engineers are able to configure or program such devices to perform a wide variety of signal processing tasks with most modern devices offering the facility for repeated re-programming. An ASIC device [2], on the other hand, is custom-designed to address a specific application and as such is able to offer the ultimate solution in terms of size (the number of transistors), complexity and performance, where performance is typically measured in terms of computational density. Designing and building an ASIC is an extremely time-consuming and expensive process, however, with the added disadvantage that the final design is frozen in silicon [2] and cannot be modified without creating a new version of the device. The ASIC is often referred to as being fine-grained because ultimately it is implemented at the level of the primitive logic gates, whereas the FPGA is often referred to as being coarse-grained because it is physically realized using higherlevel blocks of programmable logic. Therefore, in order to enhance the capabilities and the competitiveness of the FPGA, manufacturers are now providing embedded resources, such as fast multipliers and banks of fast RAM with dedicated arithmetic routing, which are considerably smaller, faster and more power efficient than when implemented in programmable logic by the user. These features, when coupled with the massive parallelism on offer, enable the FPGA to outperform the fastest of the conventional uni-processor DSP devices by two or even three orders of magnitude.

5.3 Low-Power Design Techniques

67

The system-level attractions of the FPGA are its flexibility and cost-effectiveness for low-volume price-sensitive applications, whilst the additional circuit-level benefits of reduced delay, area and power consumption are known to be even more pronounced with ASIC technology which, as stated above, will always yield optimum performance when that performance is to be measured in terms of computational density. The cost of an FPGA, as one would expect, is much lower than that of an ASIC. At the same time, implementing design changes is also much easier with the timeto-market for such designs being considerably shorter. This means that the FPGA allows the design engineer to realize software and hardware concepts on an FPGAbased test platform without having to incur the enormous costs associated with ASIC designs. Therefore, high-performance FFT designs, even when ultimately targeted at an ASIC implementation, will generally for reasons of ease, time and cost be developed and tested on an FPGA. For the analysis carried out in this and future chapters, emphasis is placed on the implementation of arithmetic units with FPGA technology where the target device family is the popular Virtex-II Pro as produced by Xilinx Inc. [6] of the USA. Although this device family may be somewhat old, its use is only intended to facilitate comparison between the different types of PE or computing architecture proposed for the parallel computation of the R24 FHT – with real-world applications it is not always possible, for various practical/financial reasons, to have access to the latest device technologies. The FPGA is actually made up of a number of configurable logic blocks (CLBs), which provide one with both logic and storage. Each CLB is made up of a number of “slices”, two for the case of a Virtex-II Pro device, with each logic slice containing two LUTs. Each LUT can in turn be configured as a 16 by one-bit synchronous RAM or read only memory (ROM), which is more commonly referred to as distributed RAM.

5.3 Low-Power Design Techniques Over the past decade or so, power consumption has grown from a secondary to a major constraint in the design of hardware-based DSP solutions. In portable applications, such as mobile communications, low power consumption has long been the main design constraint, due in part to the increasing cost of cooling and packaging, but also to the resulting rise in on-chip temperature, which in turn results in reduced reliability. The result is that the identification and application of low-power techniques, at both arithmetic and algorithmic levels, are crucial to the specification of an achievable design in silicon that meets with the required power-related performance objectives. The power consumption associated with the silicon-based implementation of a high performance DSP algorithm, such as the R24 FHT, comprises both “static” and “dynamic” components. The dynamic component has until recently dominated the total power consumption, although as the devices become ever bigger and

68

5 Algorithm Design for Hardware-Based Computing Technologies

ever more powerful the contribution of the static (or acquiescent) component to the total power consumption is becoming increasingly more significant. Given our hardware-efficient objectives, however, we restrict our attention here to the dynamic component, denoted PD , which may be expressed as PD D C V2 f

(5.1)

where “C” is the capacitance of the node switching, “V” is the supply voltage and “f” the switching frequency. This component is primarily driven by the clock frequency of the device, the silicon area required for its implementation – which is determined by the size of the arithmetic unit, the total memory requirement and the data routing – and the average switching rate of the individual circuits in each clock cycle. These items are now discussed in more detail in order that a suitable design strategy might be identified.

5.3.1 Clock Frequency To achieve high throughput with a hardware-based solution to the DSP-based problem of interest, the clock frequency is typically traded off against parallelism, with the choice of solution ranging from that based upon the use of a single processor, driven to a potentially very high clock frequency, to that based upon the use of multiple processors, typically operating concurrently via pipelining of the algorithms, which combine to achieve the required performance but with a potentially much reduced clock frequency. For the particular problem of interest in this monograph, the parallelism can be exploited at both arithmetic level, in terms of the fine-grain parallelism of the GD-BFLY, and algorithmic level, in terms of the coarse-grain parallelism of the resulting R24 FHT, with pipelining techniques being the most power-efficient means of achieving parallelism due to the associated nearest-neighbour communication requirement. Given the strong dependence of power consumption on clock frequency, there is clearly great attraction in being able to keep the clock frequency as low as possible for the implementation of the R24 FHT provided the resulting solution is able to meet the required performance objectives relating to throughput. To achieve this, however, it is necessary that an appropriate parallelization scheme be defined, such a scheme being typically based upon one of the two pipelining schemes outlined above, which will additionally impact upon the silicon area requirement, as now discussed.

5.3.2 Silicon Area Suppose that the R24 FHT is of length N, where N D 4’

(5.2)

5.3 Low-Power Design Techniques

69

with “’”, the radix exponent corresponding to N, thus representing the number of temporal stages required by the algorithm. High-performance solutions may be obtained through coarse-grain algorithmic parallelization by adopting an ’stage computational pipeline, as shown in Fig. 5.1, where each computational stage is assigned its own PE and double-buffered memory. But this means that the amount of silicon required by the R24 FHT will be both dependent upon and proportional to the size of the transform to be computed, as is the case with most commercially-available intellectual property (IP) core designs. A solution based upon a globally-pipelined multi-PE architecture such as this achieves O(N) time complexity at the cost of O.log4 N/ space complexity, where space complexity refers loosely to the total silicon area requirement. Alternatively, with a single-PE architecture, as shown in Fig. 5.2, high performance may be achieved for the R24 FHT through fine-grain PE-level arithmetic parallelization based upon internal or local pipelining of the PE. However, the success of this scheme relies heavily, if adequate throughput is to be achieved, upon the appropriate partitioning and storage of the data and trigonometric coefficients

input data

CM

CM

CM

PE No 1 N/8 Radix-4 GD-BFLYs

PE No 2 N/8 Radix-4 GD-BFLYs

PE No a N/8 Radix-4 GD-BFLYs

DM

DM

DM

CM – trigonometric coefficient memory DM – data memory

Fig. 5.1 Multi-PE architecture for radix-4 version of regularized FHT

input data

Data Memory

output data

Trigonometric Coefficient Memory

Parallel PE Radix-4 Generic Double Butterfly

Fig. 5.2 Single-PE architecture for radix-4 version of regularized FHT

loop through a× N / 8 Radix-4 GD-BFLYs

output data

70

5 Algorithm Design for Hardware-Based Computing Technologies

in partitioned memory so that multiple samples/coefficients may be accessed and (for the data) updated in parallel from their respective memory banks. Optimal efficiency also requires that the processing for each instance of the GD-BFLY be carried out in an in-place fashion so that the memory requirement may be kept to a minimum. When such a solution is possible, the result is both area-efficient and scalable in terms of transform length, with space complexity – apart from the memory requirement – being independent of the size of the transform to be computed. Such a solution achieves O.N: log4 N/ time complexity, which when the I/O requires N clock cycles, ensures continuous real-time operation for ’ 8 and thus for N 64K at the cost of O(1) space complexity. The greater the area efficiency, therefore, the lower the achievable throughput, as one would expect, so that the ultimate choice of solution will be very much dependent upon the timing constraint, if any, to be imposed upon the problem, as will be discussed in the following chapter. Note that the word “scalable”, which has already been used a few times in this monograph and may mean different things in different contexts, simply refers to the ease with which the sought-after solution may be modified in order to accommodate increasing or decreasing transform sizes – this is discussed further in Section 5.4.1.

5.3.3 Switching Frequency Another important factor affecting power consumption is the switching power which relates to the number of times that a gate makes a logic transition, 0 ! 1 or 1 ! 0, in each clock cycle. Within the arithmetic unit, for example, when one of the inputs is constant, as with the case of the pre-computed trigonometric coefficients, it is possible to use the pre-computed values to reduce the number of logic transitions involved, when compared to a conventional fast multiplier solution, and thus to reduce the associated switching power. With a parallel DA arithmetic [4] unit, for example, it is possible to reduce both switching power and silicon area for the implementation of the arithmetic components at the expense of increased memory for the storage of pre-computed sums or inner products [1], whereas with a parallel CORDIC arithmetic [3] unit it is possible to eliminate the CM requirement and the associated power-hungry memory accesses, which also involves switching activity, at the minimal expense of increased arithmetic and control logic for the on-the-fly generation of the trigonometric coefficients within each stage of the CORDIC pipeline.

5.4 Proposed Hardware Design Strategy Having discussed very briefly the key parameters relevant to the production of a low-power silicon-based solution to the R24 FHT, a design strategy is now outlined to assist in the achieving of such a solution. One should bear in mind, however,

5.4 Proposed Hardware Design Strategy

71

that in using the proposed solution for real-world applications, where the available silicon resources may vary considerably from one application to another, it would be advantageous to be able to define a small number of variations of the basic PE design whereby the appropriate choice of design would enable one to optimize the use of the available silicon resources on the target device so as to obtain a solution that maximizes the achievable computational density.

5.4.1 Scalability of Design The first property of the sought-after solution to be considered is that of scalability, as referred to above in the discussion on silicon area. A desirable feature of our solution is that it should be easily adapted, for new applications, at minimal redesign effort and cost. This may in part be achieved by making the solution scalable, in terms of transform length, such that the same single-PE computing architecture may be used for each new application with the hardware requirements remaining essentially unaltered as the transform length is increased or decreased – other than the varying memory requirement necessary to cater for the varying amounts of data and trigonometric coefficients – such an approach, in turn, playing a key role in keeping the power consumption to within the available budget. The consequence of using such a strategy is that as the transform length N is increased, the silicon area is kept essentially constant at the expense of an increased update time (the elapsed time between the production of each new real-data FFT output data set) and increased latency (the elapsed time involved in the production of a real-data FFT output data set from its associated input data set), where the latency increases according to the number of times the GD-BFLY is executed per transform, namely N=8: log4 N. However, if the required performance dictates simply that the latency satisfy the timing constraint imposed by the I/O requirement – namely the processing of N samples in N clock cycles – then the property of scalability looks to be an extremely attractive mechanism for achieving an area-efficient solution to the R24 FHT, particularly when implemented with silicon-based parallel computing equipment. Note that if the requirement was to be able to keep the update time constant, then it would be necessary either to increase the clock frequency and/or to increase the hardware requirements, in line with the increasing transform size, using a multi-PE architecture – either way, it would in turn result in a significant increase in both cost and power consumption.

5.4.2 Partitioned-Memory Processing An additional requirement arising from the property of scalability, as already indicated, is that relating to the need for the data and the trigonometric coefficients

72

5 Algorithm Design for Hardware-Based Computing Technologies

to be appropriately partitioned and stored in partitioned memory so that multiple samples/coefficients may be accessed and (for the data) updated in parallel from their respective memory banks. The resulting combination of scalability of design and partitioned-memory processing, if it could be achieved, would yield a solution that was both area-efficient and able to yield high-throughput and which would be able, for all transform lengths of interest (except for pathologically large cases), to satisfy the latency constraint arising from the I/O requirement. An additional attraction of such processing is that the adoption of partitionedmemory, rather than that of a single global memory, results in a further reduction in power consumption [5].

5.4.3 Flexibility of Design The final property of the solution to be considered is that of flexibility, whereby the best possible use might be made of the available silicon resources when the solution is applied to new applications. This is achieved with the provision of a few variations of the basic PE design, each exploiting the same computing architecture, where the variations enable one to select a specific design according to the particular silicon resources available on the target device. Such flexibility has already been implied in the results of Sections 4.5 and 4.6 of the previous chapter, where both nine-multiplier and twelve-multiplier versions of the GD-BFLY were considered together with different CM addressing schemes, one of which minimized the arithmetic complexity at the cost of an increased CM requirement and the other minimizing the CM requirement at the cost of increased arithmetic complexity. A different type of flexibility relates to the available arithmetic precision, as provided by the arithmetic unit. Different signal processing applications involving the use of an FFT may require very different processing functions in order to carry out the necessary tasks and often different levels of precision for each such function. The FFT may well be fed directly by means of an ADC unit, for example, so that the word length of the data into and out of the FFT will be dictated both by the capability of the ADC and by the dynamic range requirements of the processing functions into which the FFT feeds. For the design to have truly universal application, therefore, it would be beneficial that the arithmetic unit should be easily adapted to cater for arbitrary arithmetic precision processing, including those applications where the requirements are not adequately addressed through the use of embedded resources, so that different levels of accuracy may be achieved for different applications without having to alter the basic design of the PE. Such flexibility in terms of the arithmetic precision may be achieved via the use of a pipelined CORDIC arithmetic unit, as discussed in some depth in Chapter 7, where increased precision may be obtained by simply increasing the length of the associated computational pipeline – noting that the CORDIC stages are identical – at the expense of a proportionate increase in latency.

5.6 Assessing the Resource Requirements

73

5.5 Constraints on Available Resources As already discussed in Section 1.7 of Chapter 1, when producing electronic equipment, whether for commercial or military use, one is seldom blessed with the option of using the latest state-of-the-art device technology. As a result, there are situations where there would be great merit in having designs that are not totally reliant on the availability of the increasingly large quantities of expensive embedded resources, such as fast multipliers and fast memory, as provided by the manufacturers of the latest silicon-based devices, but are sufficiently flexible to lend themselves to implementation in silicon even when constrained by the limited availability of embedded resources. A problem may arise in practice, for example, when the length of the transform to be computed is very large compared to the capability of the target device such that there are insufficient embedded resources to enable a successful mapping of the transform onto the device. In such a situation, where the use of a larger and more powerful device is simply not an option, it is thus required that some means be found of facilitating a successful mapping onto the available device and one way of achieving this is through the design of a more appropriate arithmetic unit, namely one which does not rely too heavily upon the use of embedded resources. As with the requirement for flexible-precision processing, this may be achieved via the use of a pipelined CORDIC arithmetic unit, to be discussed in Chapter 7, which can be shown to effectively eliminate the requirement for both fast fixedpoint multipliers and fast RAM for the trigonometric coefficients.

5.6 Assessing the Resource Requirements Given the device-independent nature of the R24 FHT design(s) sought in this monograph, a somewhat theoretical approach has been adopted for assessing the resource requirements for its implementation in silicon, this assessment being based on the determination of the individual requirements, measured in logic slices, for addressing both the arithmetic complexity and the memory requirement. Such an approach can only tell part of the story, however, as the amount of logic required for controlling the operation and interaction of the various components (which ideally are manufacturer-supplied embedded components for optimal size and power efficiency) of the design is rather more difficult (if not impossible) to assess, if considered in isolation from the actual hardware design process, this due in part to the automated and somewhat unpredictable nature of that process, as outlined below. Typically, after designing and implementing the hardware design in a HDL there is a multi-stage process to go through before the design is ready for use in a FPGA. The first stage is synthesis, which takes the HDL code and translates it into a “netlist” which is simply a textual description of a circuit diagram or schematic. This is followed by a simulation which verifies that the design specified in the netlist functions correctly. Once verified, the netlist is translated into a binary format, the

74

5 Algorithm Design for Hardware-Based Computing Technologies

components and connections that it defines then mapped to CLBs, before the design is finally placed and routed to fit onto the target device. A second simulation is then performed to help establish how well the design has been placed and routed before a “configuration” file is generated to enable the design to be loaded onto the FPGA. The reality, after this process has been gone through, is that the actual logic requirement will invariably be somewhat greater than predicated by theory, due to the inefficient and unpredictable use made of the available resources in meeting the various design constraints. This situation is true for any design considered, however, so that in carrying out a comparative analysis of different FHT or FFT designs the same inefficiencies will inevitably apply to each. Such an overhead in the logic requirement needs to be borne in mind, therefore, when actually assessing whether a particular device has sufficient resources to meet the given task.

5.7 Discussion This chapter has looked very briefly into those design techniques and parameters that need to be considered and traded off in order to achieve a low-power solution to the R24 FHT, where the sought-after solution is required to be able to optimize the use of the available silicon resources on the target device so as to obtain a solution that maximizes the achievable computational density. This involved a discussion of the benefits of incorporating scalability, partitioned-memory processing and flexibility into the design of the solution and of the design options available for silicon-based implementation when constrained by the limited availability of embedded resources. Clearly, if silicon-based designs can be produced that minimize the requirement for such embedded resources, then smaller lower-complexity devices might be used, rather than those at the top end of the device range, as is commonly the case, thus minimizing the financial cost of implementation. The less the reliance on the use of embedded resources the greater the flexibility in the choice of target hardware. It remains now to see how a suitable computing architecture might be defined which enables the attractions of the hardware-based technologies discussed in this chapter to be effectively exploited for the parallel computation of the R24 FHT, the derivation of which was discussed in some considerable detail in the previous chapter. In doing so, it would be advantageous to offer a choice of PE designs which range from providing optimality in terms of the arithmetic complexity to optimality in terms of the memory requirement, as this would provide the user with the ability to optimize the design of the PE for each new application according to the resources available on the target device. The single-PE architecture discussed in this chapter would be particularly attractive for the case of the R24 FHT given that the associated computing engine, the GD-BFLY, produces eight outputs from eight inputs, so that a parallel solution would offer a theoretical eightfold speed up over a purely sequential solution. This would necessitate the data and trigonometric coefficients being appropriately partitioned and stored in partitioned memory so that multiple samples/coefficients may

References

75

be accessed and (for the data) updated in parallel from their respective memory banks – this would in turn result in a further decrease of the power consumption. Being able to place the memory close to where the processing is actually taking place would in addition eliminate the need for long on-chip communication paths which can result in long processing delays and increased power consumption.

References 1. G. Birkhoff, S. MacLane. A Survey of Modern Algebra (Macmillan, New York, 1977) 2. C. Maxfield. The Design Warrior’s Guide to FPGAs (Newnes (Elsevier), 2004) 3. J.E. Volder, The CORDIC trigonometric computing technique. IRE Trans. Elect. Comput. EC-8(3), 330–334 (1959) 4. S.A. White, Application of distributed arithmetic to digital signal processing: A tutorial review. IEEE ASSP Mag. 4–19 (1989) 5. T. Widhe, J. Melander, L. Wanhammar, Design of efficient radix-8 butterfly PE for VLSI. Proc. IEEE Int. Symp. Circuits Syst. Hong Kong (1997) 6. Xilinx Inc., company and product information available at company web site: www.xilinx.com

Chapter 6

Derivation of Area-Efficient and Scalable Parallel Architecture

Abstract This chapter discusses a partitioned-memory single-PE computing architecture for the parallel computation of the regularized FHT which seeks to maximize the computational density – that is, the throughput per unit area of silicon – when implemented with a silicon-based parallel computing device. A pipelined implementation of the GD-BFLY is discussed together with conflict-free and (for the data) in-place parallel memory addressing schemes for both the data and the trigonometric coefficients which enable the outputs from each instance of the GD-BFLY to be produced within a single clock cycle. Four versions of the solution are discussed which enable trade-offs to be made of arithmetic complexity against memory requirement according to the resources available on the target device. An FPGA implementation of the regularized FHT is then discussed and its performance compared with two commercially-available solutions. A discussion is finally provided relating to the results obtained in the chapter.

6.1 Introduction A point has now been reached whereby an attractive formulation of the FHT algorithm has been produced, namely the R24 FHT, whilst those properties required of such an algorithm and of its associated computing architecture for the achievement of an optimal mapping onto silicon-based parallel computing equipment – as typified by the FPGA and the ASIC – have also been outlined. The question now to be addressed is whether such a mapping can be found, bearing in mind that the ultimate objective appears to be demanding a “squaring of the circle”, namely that of maximizing the computational throughput whilst at the same time minimizing the required quantities of silicon resources so as to reduce both power consumption and cost – see the definition “Performance Metric for Silicon-Based Parallel Computing Device” in Section 1.8 of Chapter 1. To make the objective more concrete, therefore, it is perhaps worth putting it into the following words:

K. Jones, The Regularized Fast Hartley Transform, Signals and Communications Technology, DOI 10.1007/978-90-481-3917-0 6, c Springer Science+Business Media B.V. 2010

77

78

6 Derivation of Area-Efficient and Scalable Parallel Architecture Statement of Performance Objective No 2: The aim is to produce a scalable solution to the regularized FHT – and hence to the realdata FFT – in silicon that leads to the execution of a length N transform, where N caters for all transform lengths of practical interest, in N clock cycles or less, this performance being subject to the constraint that the least amount of silicon resources be used.

Although other metrics could of course be used for this definition, this particular metric – which for the latency-constrained version of the real-data DFT problem looks for that solution incurring the lowest possible silicon cost which thus equates to maximizing the computational density – is targeted specifically at the type of power-constrained environment that one would expect to encounter with mobile communications, as it is assumed that a solution that yields a high computational density will be attractive in terms of both power consumption and hardware efficiency, given the known influence of silicon area on the power consumption – as discussed in the previous chapter. The restriction to its execution being completed in N clock cycles or less is to ensure continuous operation (whereby the processing rate of the R24 FHT is able to keep up with the speed of the I/O over each block of data) and is valid for transform lengths up to and including 64K provided that the outputs from each instance of the GD-BFLY can be produced within a single clock cycle. For transform lengths longer than this, therefore, it would not be possible to sustain continuous operation with a single PE, so that a multi-PE solution with at least two PEs and a suitable architecture would be required – this point is taken up again later in the chapter. Also, stating the objective in this way, it is possible to ensure that any parallel solution to the R24 FHT, if found, will possess those properties necessary for an attractive hardware implementation, although it will also be necessary that a proper comparison be made both of the time complexity, as given by the latency, and of the required silicon resources of any such solution with those of existing commercially available industry-standard FFT devices – although, as already stated, most if not all such commercially-available solutions will almost invariably involve the computation of the conventional complex-data radix-2 version of the FFT.

6.2 Single-PE Versus Multi-PE Architectures Two types of parallel computing architecture were briefly discussed in Section 5.3.2 of the previous chapter, one based upon the adoption of multiple PEs and the other upon the adoption of a single PE, where the multi-PE architecture achieves the required computational throughput via algorithm-level pipelining and the other via arithmetic-level pipelining within the PE itself. The globally-pipelined multi-PE architecture thus lends itself naturally to streaming operation – which generally takes naturally-ordered input data and produces digit-reversed output data – whereby the data samples are processed as soon as they arrive at the first PE in the pipeline, whilst the locally-pipelined single-PE architecture lends itself more naturally to block-based or burst operation – which generally takes digit-reversed input data

6.2 Single-PE Versus Multi-PE Architectures

79

and produces naturally-ordered output data – whereby all the data samples must first be generated and stored before they can be processed. The single-PE architecture certainly looks to offer the most promise for the problem under consideration, bearing in mind the power-constrained environment associated with the target application area of mobile communications, but in order to achieve the required computational throughput it will be necessary that the memory, for both the data and the trigonometric coefficients, be suitably organized. The memory structure should be such that the data and trigonometric coefficients required for the execution of the GD-BFLY may be accessed simultaneously, without conflict, thereby enabling the outputs from each instance of the GD-BFLY to be produced within a single clock cycle. In order to achieve this it is necessary that the memory be organized according to the partitioned-memory architecture of Fig. 6.1, where the topology of the data routing network is shown in the form of an H-Tree so as to keep the communication paths between the PE and each memory bank of equal length – although in reality, when mapping such designs onto an FPGA, one no longer has any control over such matters. The input/output data is partitioned or distributed over eight memory banks and the trigonometric coefficients over three memory banks, so that suitable parallel addressing schemes need now to be defined which ideally enable one data sample to be read from (and written to) each DM bank every clock cycle, in an in-place fashion and without conflict, and two trigonometric coefficients to be read from each CM bank every clock cycle [3], again without conflict. Such addressing schemes, for both the data and the trigonometric coefficients, are now discussed in some detail.

MnD - nth data memory bank MnC - nth trigonometric coefficient memory bank

M2D

M3D

M4D

M5D

M6D

M7D

M8D

×8

M1C

M2C

M1D

×6

Trigonometric Coefficient Generator

×9

Generic Radix-4 Double Butterfly

M3C

Address Generator

Fig. 6.1 Partitioned-memory single-PE architecture for regularized FHT

80

6 Derivation of Area-Efficient and Scalable Parallel Architecture

6.3 Conflict-Free Parallel Memory Addressing Schemes The partitioned-memory addressing schemes described here for the R24 FHT are based upon the assumption that the memories are dual-port. Such memory is assumed to have four data ports, two for the data inputs and two for the data outputs, although there is only one address input for each input/output data pair. As a result, each memory bank is able to cater for either two simultaneous reads, as required for the case of the CM, two simultaneous writes, or one simultaneous read and write using separate read and write addresses. These read/write options will be shown to be sufficient for the addressing requirements of both the CM, which requires two simultaneous reads per clock cycle, and the DM, which for the implementation discussed in this monograph will be shown to need all three options. With regard to the DM, the addressing scheme is also to be regarded as being in-place as the outputs of each instance of the GD-BFLY are to be ultimately written back to the same memory locations from which the GD-BFLY inputs were accessed.

6.3.1 Data Storage and Accession The GD-BFLY, for Type-I, Type-II and Type-III cases, as described in Chapter 4, requires that eight data samples be read from and written to the DM, in an in-place fashion, in order to be able to carry out the processing for a given data set. One way for this to be achieved is if the eight samples to be processed by the GD-BFLY are stored with one sample in each DM bank, so that all eight DM banks are used for each instance of the GD-BFLY. Another way, given the availability of dual-port memory, would be to have two samples in each of four DM banks with alternate sets of DM banks being used on alternate sets of data. The problem is addressed here by adopting suitably modified versions of the rotation-based radix-4 memory mapping, “‰4 ”, as given by: Definition of Mapping for Data Memory Addressing: # # "" ˛ X

n mod 4k >> 2.k 1/ mod 4 << 1 ‰4 .n; ’/ D

(6.1)

kD1

so that ‰4 .n; ’/ 2 f0; 2; 4; 6g; where the parameter “n” 2 f0; 1; : : : ; N 1g corresponds to the sample address after dibit reversal and “’” is the radix exponent corresponding to the transform length “N”, i.e. where ’ D log4 N.

The symbols “>>” and “<<” correspond to the binary right-shift and left-shift operations, respectively, which together with the familiar modulo operation, abbreviated here to “mod”, may be straightforwardly and cost-effectively implemented in programmable logic. Introducing now the function “” for representing the DM bank addresses, the initial/final data to/from the R24 FHT may be written/read to/from the DM banks according to:

6.3 Conflict-Free Parallel Memory Addressing Schemes

81

Definition of Mapping for Pre-FHT and Post-FHT Addressing: 1 .n; ’/ D ‰4 .n; ’/ C .n mod 2/ so that

(6.2)

1 .n; ’/ 2 f0; 1; : : : ; 7g:

Note that this mapping also holds true for the DM accesses made from within the R24 FHT for the execution of the first stage of GD-BFLYs, where all eight of the DM banks are utilized, whilst those for the remaining stages are carried out according to: Definition of Mapping for Double Butterfly Addressing: 2 .k; n; ’/ D ‰4 .n; ’/ C .k mod 2/ so that

(6.3)

2 .k; n; ’/ 2 f0; 1; : : : ; 7g;

where the parameter “k” 2 f0; 1; : : : ; .N=8/ 1g corresponds to the GD-BFLY execution number for the current temporal stage.

Thus, from Equation 6.3 above, if the execution number “k” for the current temporal stage is an even-valued integer, then the DM banks required for that particular instance of the GD-BFLY will be the four even-addressed banks, whilst if the execution number is an odd-valued integer, the DM banks required will be the four odd-addressed banks. Having determined the DM bank to which a particular sample belongs, its location within that DM bank may then be straightforwardly obtained, via the function “ˆ”, according to: Definition of Mapping for Address Offset: ˆ.n/ D n >> 3

(6.4)

so that ˆ.n/ 2 f0; 1; : : :; .N=8/ 1g; where the parameter “n” 2 f0; 1; : : : ; N 1g corresponds to the sample address after di-bit reversal.

To better understand the workings of these rotation-based memory mappings for the storage/accession of the data, it is best to first visualize the data as being stored in a two-dimensional array of four columns and N/4 rows, where the data is stored on a row-by-row basis, with four samples to a row. The effect of the generic address mapping, ‰4 , as shown in the example given in Table 6.1 below, is to apply a left-sense rotation to each row of data where the amount of rotation is dependent upon the particular .N=4/ 4 sub-array to which it belongs, as well as the particular .N=16/ 4 sub-array within that sub-array, as well as the particular .N=64/4 sub-array within that sub-array, etc., until all the relevant partitions have been accounted for – there are log4 N of these. As a result, there is a cyclic rotation being applied to the data over each such sub-array – the cyclic nature of the mapping means that within each sub-array the amount of rotation to be applied to a given row of data is one position greater than that for the preceding row. This property, as will later be seen, may be beneficially exploited by the GD-BFLY through the way in which it stores/accesses

82 Table 6.1 Structure of generic address mapping ‰4 for case of length-64 data set

Table 6.2 Structure of address mapping 1 for case of length-64 data set

6 Derivation of Area-Efficient and Scalable Parallel Architecture Row

Value of Generic address mapping ‰4

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 2 4 6 2 4 6 0 4 6 0 2 6 0 2 4 Row 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

2 4 6 0 4 6 0 2 6 0 2 4 0 2 4 6

4 6 0 2 6 0 2 4 0 2 4 6 2 4 6 0

6 0 2 4 0 2 4 6 2 4 6 0 4 6 0 2

Value of address mapping 1 0 3 4 7 2 5 6 1 4 7 0 3 6 1 2 5 2 5 6 1 4 7 0 3 6 1 2 5 0 3 4 7 4 7 0 3 6 1 2 5 0 3 4 7 2 5 6 1 6 1 2 5 0 3 4 7 2 5 6 1 4 7 0 3

the elements of the input/output data sets, for both individual instances of the GDBFLY, via the address mapping 1 , as well as for consecutive pairs of instances, via the address mapping 2 , over all eight memory banks. Examples of the address mappings 1 and 2 are given in Tables 6.2 and 6.3, respectively, where each pair of consecutive rows of bank addresses correspond to the locations of a complete GD-BFLY input/output data set. Suppose now, for ease of exposition, that the arithmetic within the GD-BFLY can be assumed to be carried out fast enough to allow for the data sets processed by the GD-BFLY to be both read from and written back to DM within a single clock cycle – this is not of course actually achievable and a more realistic scenario is to be discussed later in Section 6.4 when the concept of internal pipelining within the PE is introduced.

6.3 Conflict-Free Parallel Memory Addressing Schemes Table 6.3 Structure of address mapping 2 for case of length-64 data set

83 Row

Value of address mapping 2

0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15

0 2 5 7 2 4 7 1 4 6 1 3 6 0 3 5

2 4 7 1 4 6 1 3 6 0 3 5 0 2 5 7

4 6 1 3 6 0 3 5 0 2 5 7 2 4 7 1

6 0 3 5 0 2 5 7 2 4 7 1 4 6 1 3

The input/output data set to/from the GD-BFLY comprises four even-address samples and four odd-address samples, where for a given instance of the GD-BFLY for the first temporal stage, each of the eight DM banks will contain just one sample, as required, whilst for a given instance of the GD-BFLY for the remaining ’1 temporal stages, four of the eight DM banks will each contain one even-address sample and one odd-address sample with the remaining four DM banks being unused. As a result, it is generally not possible to carry out all eight reads/writes for the same data set using all eight DM banks in a single clock cycle. However, if for all but the first temporal stage we consider any pair of consecutive instances of the GD-BFLY, then it may be shown that the sample addresses of the second instance will occupy the four DM banks not utilized by the first, so that every two clock cycles the eight even-address samples and the eight odd-address samples required by the pair of consecutive instances of the GD-BFLY may be both read from and written to DM, as required for conflict-free and in-place memory addressing – see Fig. 6.2 below. Thus, based upon our simplistic assumption, all eight DM banks for the first temporal stage may be both read from and written to within a single clock cycle, whilst for the remaining ’ 1 temporal stages it can be shown that in any one clock cycle all the input samples for one instance of the GD-BFLY may be both read from DM and processed by the GD-BFLY, whilst all those output samples produced by its predecessor may be written back to DM. As a result, the solution based upon the single-PE R24 FHT architecture will be able to yield complete GD-BFLY output sets at the rate of one set per clock cycle, as required. An alternative way of handling the pipelining for the last ’ 1 temporal stages would be to read just four samples for the first clock cycle, with one sample from each of the four even-addressed memory banks. This would be followed by eight samples for each succeeding clock cycle apart from the last, with four samples for the current instance of the GD-BFLY being read from the four

84

6 Derivation of Area-Efficient and Scalable Parallel Architecture 0 1st butterfly of pair

2nd butterfly of pair

1

ES / OS

2

3

ES /OS

ES / OS

4

5

7

ES /OS

ES /OS

ES / OS

6

ES / OS

ES /OS

ES – even-address sample OS – odd-address sample

Fig. 6.2 Addressing of hypothetical pair of consecutive generic double butterflies for all stages other than first

even-addressed/odd-addressed memory banks and four samples for the succeeding instance of the GD-BFLY being read from the remaining four odd-addressed/evenaddressed memory banks. The processing would be completed by reading just four samples for the last clock cycle, with one sample from each of the four oddaddressed memory banks. In this way, for each clock cycle apart from the first and the last, eight samples could be read/written from/to all eight memory banks, one sample per memory bank, with one complete set of eight GD-BFLY outputs being thus produced and another partly produced, to be completed on the succeeding clock cycle. Note, however, that a temporary buffer would be needed to hold one complete GD-BFLY output set as the samples written back to memory would also need to come from consecutive GD-BFLY output sets, rather than from a single GD-BFLY output set, due to the dual-port nature of the memory. For the last clock cycle, the remaining set of eight GD-BFLY outputs could also be written out to all eight memory banks, again one sample per memory bank. The choice of how best to carry out the pipelining is really down to the individual HDL programmer, but for the purposes of consistency within the current monograph, it will be assumed that all the samples required for a given instance of the GD-BFLY are to be read from the DM within the same clock cycle, two samples per even-addressed/odd-addressed memory bank as originally described, so that all the input samples for one instance of the GD-BFLY may be both read from DM and processed by the GD-BFLY, whilst all those output samples produced by its predecessor are written back to DM.

6.3.2 Trigonometric Coefficient Storage, Accession and Generation Turning now to the trigonometric coefficients, the GD-BFLY, as described in Chapter 4, requires that six non-trivial trigonometric coefficients be either accessed

6.3 Conflict-Free Parallel Memory Addressing Schemes

85

from CM or efficiently generated in order to be able to carry out the GD-BFLY processing for a given data set. Two schemes are now outlined for performing this task whereby all six trigonometric coefficients may be accessed simultaneously, within a single clock cycle, these schemes offering a straightforward trade-off of memory requirement against addressing complexity – as measured in terms of the number of arithmetic/logic operations required for computing the necessary addresses. The two schemes considered cater for those extremes whereby the requirement is either to minimize the arithmetic complexity or to minimize the CM requirement. Clearly, other options that fall between these two extremes are also possible, but these may be easily defined and developed given an understanding of the techniques discussed here and in Section 4.6 of Chapter 4.

6.3.2.1 Minimum-Arithmetic Addressing Scheme The trigonometric coefficient set comprises cosinusoidal and sinusoidal terms for single-angle, double-angle and triple-angle cases. Therefore, in order for all six trigonometric coefficients to be obtained simultaneously, three LUTs may be exploited with the two single-angle coefficients being read from the first LUT, the two double-angle coefficients from the second LUT, and the two triple-angle coefficients from the third LUT. To keep the arithmetic complexity of the addressing to a minimum each LUT may be defined as in Section 4.6.1 of Chapter 4, being sized according to the single-quadrant addressing scheme, whereby the trigonometric coefficients are read from a sampled version of the sinusoidal function with argument defined from 0 up to =2 radians. Thus, for the case of an N-point R24 FHT, it is required that each of the three single-quadrant LUTs be of size N/4 words yielding Aopt a total CM requirement, denoted CMEM , of Aopt

CMEM D

3 N 4

(6.5)

words. This scheme would seem to offer a reasonable compromise, therefore, between the CM requirement and the addressing complexity, using more memory than is theoretically necessary, in terms of replicated LUTs, in order to keep the arithmetic/logic requirement of the addressing as simple as possible – namely, a zero arithmetic complexity when using the twelve-multiplier version of the GD-BFLY or six additions when using the nine-multiplier version.

6.3.2.2 Minimum-Memory Addressing Scheme Another approach to the problem is to adopt a two-level LUT for the first of the three angles, where the associated complementary-angle LUTs are as p defined in Section 4.6.2 of Chapter 4, comprising one coarse-resolution region of N=2 words

86

6 Derivation of Area-Efficient and Scalable Parallel Architecture

p for the sinusoidal function, and one fine-resolution region of N=2 words for each of the cosinusoidal and sinusoidal functions. To keep the CM requirement to a minimum, the double-angle and triple-angle trigonometric coefficients are then obtained straightforwardly through the application of standard trigonometric identities, as given by Equations 4.65–4.68 of Chapter 4, so that the solution requires that three complementary-angle LUTsp be used for just the single-angle trigonometric coefficient case, each LUT of size N=2 words, yielding a total CM requirement, denoted Mopt CMEM , of 3p Mopt N (6.6) CMEM D 2 words. The double-angle and triple-angle trigonometric coefficients could also be obtained by assigning a two-level LUT to the storage of each, but the associated arithmetic complexity involved in generating the addresses turns out to be identical to that obtained when the trigonometric coefficients are obtained through the direct application of standard trigonometric identities, so that in this instance the replication of the two-level LUT provides us with three times the memory requirement but with no arithmetic advantage as compensation. With the proposed technique, therefore, the CM requirement, as given by Equation 6.6, is minimized at the expense of additional arithmetic/logic for the addressing – namely, an arithmetic complexity of seven multiplications and eight additions when using the twelve-multiplier version of the GD-BFLY or seven multiplications and 14 additions when using the nine-multiplier version.

6.3.2.3 Summary of Addressing Schemes The results of this section are summarized in Table 6.4 below, where the CM requirement and arithmetic complexity for each of the conflict-free parallel addressing schemes are given. A trade-off has clearly to be made between CM requirement and arithmetic complexity, with the choice being ultimately made according to the resources available on the target hardware. Versions I and II of the solution to the R24 FHT correspond to the adoption of the minimum-arithmetic addressing scheme for the twelve-multiplier and nine-multiplier PEs, respectively, whilst Versions III and IV correspond to the adoption of the minimum-memory addressing scheme for the twelve-multiplier and nine-multiplier PEs, respectively. The trigonometric coefficient accession/generation schemes required for Versions I to IV of the above solution are illustrated via Figs. 6.3–6.6, respectively, with the associated arithmetic complexity for the addressing given by zero when using Version I of the R24 FHT solution, six additions when using Version II, seven multiplications and eight additions when using Version III, and seven multiplications and 14 additions when using Version IV.

Version of solution I II III IV Adders 0 6 8 14

Multipliers 0 0 7 7

Multipliers 12 9 12 9

Adders 22 25 22 25

Coefficient generator

Processing element 8 1=8N D N 8 1=8N D N 8 1=8N D N 8 1=8N D N

Data 3 1=4N D 3=4N 3 3 1=4N pD =4N p 3 1=2pN D 3=2pN 3 1=2 N D 3=2 N

Coefficients

Table 6.4 Performance/resource comparison for fast multiplier versions of N-point regularized FHT Arithmetic complexity Memory requirement (words)

1=8N:log N 4 1=8N:log N 4 1=8N:log N 4 1=8N:log N 4

Update time

1=8N:log N 4 1=8N:log N 4 1=8N:log N 4 1=8N:log N 4

Latency

Time complexity (clock cycles)

6.3 Conflict-Free Parallel Memory Addressing Schemes 87

88

6 Derivation of Area-Efficient and Scalable Parallel Architecture

Fig. 6.3 Resources required for trigonometric coefficient accession/generation for Version I of solution with one-level LUTs

S1 C1

S2 C2

S3 C3

Fig. 6.4 Resources required for trigonometric coefficient accession/generation for Version II of solution with one-level LUTs

S1 C1

D1 D2 D3 D4

Sn = sin(nθ) Î LUT[n]

D5

Cn = cos(nθ) Î LUT[n]

D6

Dn = nth coefficient

D7

LUT[n] = 1 N words 4

D8 D9

D1 D2 D3 Sn = sin(nθ) Î LUT[n]

S2 C2

S3 C3

D4 D5 D6

Cn = cos(nθ) Î LUT[n] Dn = nth coefficient LUT[n] = 1 N words 4

D7 D8 D9

Note that with the minimum-memory addressing scheme of Figs. 6.5 and 6.6 pipelining will certainly need to be introduced so as to ensure that a complete new set of trigonometric coefficients is available for input to the GD-BFLY for each new clock cycle.

6.4 Design of Pipelined PE for Single-PE Architecture

89

S1, C1 S2

D7, D8, D9

C2

S1 = sin(α)

Delay

LUT[1]

C1 = cos(α)

LUT[1]

S2 = sin(β)

LUT[2]

C2 = cos(β)

LUT[3]

Delay

D4, D5, D6 D1, D2, D3

LUT[n] = 1 N words 2

Dn = nth coefficient

Fig. 6.5 Resources required for trigonometric coefficient accession/generation for Version III of solution with two-level LUT – pipelining required to maintain computational throughput

Delay

Delay

Delay

S1, C1 S2

D7, D8, D9

C2

S1 = sin(α) LUT[1] C1 = cos(α)

Delay

Delay

Delay

D1, D2, D3

LUT[1]

S2 = sin(β)

LUT[2]

C2 = cos(β)

LUT[3]

D4, D5, D6

LUT[n] = 1 N words 2

Dn = nth coefficient

Fig. 6.6 Resources required for trigonometric coefficient accession/generation for Version IV of solution with two-level LUT – pipelining required to maintain computational throughput

6.4 Design of Pipelined PE for Single-PE Architecture To exploit the multi-bank memories and LUTs, together with the associated conflictfree and (for the data) in-place parallel memory addressing schemes, the PE needs now to be able to produce one complete GD-BFLY output set per clock cycle, as discussed in Section 6.3.1, bearing in mind that although, for the first temporal stage,

90

6 Derivation of Area-Efficient and Scalable Parallel Architecture

all eight DM banks can be both read from and written to within the same clock cycle, for the remaining temporal stages, only those four DM banks not currently being read from may be written to (and vice versa).

6.4.1 Internal Pipelining of Generic Double Butterfly The above constraint suggests that a suitable PE design may be achieved if the GDBFLY is carried out by means of a “-stage computational pipeline, as shown in the simple example of Fig. 6.7, where ““” is an odd-valued integer and where each CS of the pipeline contains its own set of storage registers for holding the current set of processed samples. In this way, if a start-up delay of DCG clock cycles is required for a pipelined version of the trigonometric coefficient generator and DPE clock cycles for a pipelined version of the PE, where DPE D “ 1;

(6.7)

then after a total start-up delay of DSU clock cycles for the first temporal stage of the processing, where DSU D DCG C DPE ; (6.8) the PE will be able to read in eight samples and write out eight samples every clock cycle, thereby enabling the first temporal stage to be completed in DSU C N=8 clock cycles, and subsequent temporal stages to be completed in N/8 clock cycles. Note that the pipeline delay DPE must account not only for the sets of adders and permutators, but also for the fixed-point multipliers which are themselves typically implemented as pipelines, possibly requiring as many as five CSs according to the required precision. As a result, it is likely that at least nine CSs might be required

Stage 0: both even-addressed EB and odd-addressed OB memory banks are read from & written to at the same time – one sample per memory bank READ EB & OB

PE0

EB & OB

PE1

EB & OB

WRITE

PE2

PE3 EB & OB

PE4 EB & OB

EB & OB

Stages 1 to a-1: when even-addressed EB memory banks are read from, odd-addressed OB memory banks are written to & vice versa – two samples per memory bank READ EB / OB

PE0

O B / EB

PE1

EB / OB

PE2

OB / EB

PE3

EB / OB

Fig. 6.7 Parallel solution for PE using five-stage computational pipeline

PE4

WRITE OB / EB

DM0

CM1

CS1

DM1

91

CM2

CSβ−1

Writes × 8

Reads × 6

CM0

CS0 Reads × 8

Address Generation

6.4 Design of Pipelined PE for Single-PE Architecture

DM7

CM – coefficient memory CS – computational stage DM – data memory

Fig. 6.8 Memory structure and interconnections for internally-pipelined partitioned-memory PE

for implementation of the computational pipeline, with each temporal stage of the R24 FHT requiring the PE to execute the pipeline a total of N/8 times. A description of the pipelined PE including the structure of the memory, for both the data and the trigonometric coefficients, together with its associated interconnections, is given in Fig. 6.8. Note, however, that depending upon the relative lengths of the computational pipeline, ““”, and the transform, “N”, an additional delay may need to be applied for every temporal stage, not just the first, in order to ensure that sample sets are not updated in one temporal stage before they have been processed and written back to DM in the preceding temporal stage, as this would result in the production of invalid outputs. If the transform length is sufficiently greater than the pipeline delay, however, this problem may be avoided – these points are discussed further in Section 6.4.3.

6.4.2 Space Complexity Considerations The space complexity is determined by the combined requirements of the multibank dual-port memory and the arithmetic/logic components. Adopting the minimum-arithmetic addressing scheme of Versions I and II of the R24 FHT solution (as detailed in Table 6.4), the worst-case total memory requirement for the partitioned-memory single-PE architecture, denoted M.W/ FHT , is given by

92

6 Derivation of Area-Efficient and Scalable Parallel Architecture

M.W/ FHT

D 8 D

7 N 4

1 1 N C3 N 8 4 (6.9)

words, where N words are required by the eight-bank DM and 3N/4 words for the three single-quadrant LUTs required for the CM. In comparison, by adopting the minimum-memory addressing scheme of Versions III and IV of the R24 FHT solution (as detailed in Table 6.4), the best-case total memory requirement for the partitioned-memory single-PE architecture, denoted M.B/ FHT , is given by .B/

1p 1 N C3 N 8 2 3p DNC N 2

MFHT D 8

(6.10)

p words, where N words are required by the eight-bank DM and 3 N=2 words for the three complementary-angle LUTs required for the CM. The arithmetic/logic requirement is dominated by the presence of the dedicated fast fixed-point multipliers, with a total of nine or 12 being required by the GD-BFLY and up to seven for the memory addressing, depending upon the chosen addressing scheme.

6.4.3 Time Complexity Considerations The partitioned-memory single-PE architecture, based upon the internally-pipelined PE described in Section 6.4.1, enables a new GD-BFLY output set to be produced every clock cycle. Therefore, the first temporal stage will be completed in DSU CN=8 clock cycles and subsequent temporal stages in either N/8 clock cycles or DSM CN=8 clock cycles, where the additional delay DSM provides the necessary safety margin to ensure that the outputs produced from each stage are valid. The delay depends upon the relative lengths of the computational pipeline and the transform and may range from zero to as large as DPE . As a result, the N-point R24 FHT, where “N” is as given by Equation 5.2, has a worst-case time complexity, denoted T.W/ FHT , of 1 1 N C .˛ 1/ D N T.W/ D D C C SU SM FHT 8 8 1 D .DSU C .’ 1/DSM / C N: log4 N 8

(6.11) .B/

clock cycles, and a best-case or standard time-complexity, denoted TFHT , for when the safety margin delay is not required, of

6.5 Performance and Requirements Analysis of FPGA Implementation

T.B/ FHT

1 1 D DSU C N C .˛ 1/ N 8 8 1 D DSU C N: log4 N 8

93

(6.12)

clock cycles, given that ’ D log4 N. More generally, for any given combination of pipeline length, ““”, and transform length, “N”, it should be a straightforward task to calculate the exact safety margin delay, DSM , required after each temporal stage in order to guarantee the generation of valid outputs, although for most parameter combinations of practical interest it will almost certainly be set to zero so that the time-complexity for each instance of the transform will be as given by Equation 6.12. Note that a multi-PE R24 FHT architecture, based upon the adoption of an ’-stage computational pipeline, could only yield this level of performance by exploiting up to “’” times as much silicon as the single-PE R24 FHT architecture, assuming that the PEs in the pipeline are working in sequential fashion with the data and trigonometric coefficients stored in global memory – that is, with the reads/writes being performed at a rate of one per clock cycle. Each stage of a pipelined multiPE R24 FHT architecture requires the reading/writing of all N samples so that ’ 1 double-buffered memories – each holding up to 2N samples to cater for both inputs and outputs of the PE – are typically required for connecting the PEs in the pipeline.

6.5 Performance and Requirements Analysis of FPGA Implementation The theoretical complexity requirements discussed above have been proven in silicon by TRL Technology (a member of L3 Communications Corporation, U.S.A.) in the U.K., who have produced a generic real-data radix-4 FFT implementation – based upon the R24 FHT – on a Xilinx Virtex-II Pro 100 FPGA [8], running at close to 200 MHz, for use in various wireless communication systems. A simple comparison with the state-of-the-art performances of the RFEL QuadSpeed FFT [4] and Roke Manor Research FFT solutions [5] (both multi-PE designs from the UK whereby a complex-data FFT may be used to process simultaneously two real-valued data sets – packing/unpacking of the input/output data sets therefore needs to be accounted for) is given in Table 6.5 for the case of 4K-point and 16K-point real-data FFTs (where 1K 1;024), where the RFEL and Roke Virtex-II Pro 100 results are extrapolated from company data sheets and where the Version II solution of the R24 FHT described in Section 6.3.2.3 – using the minimum-arithmetic addressing scheme together with a nine-multiplier PE – is assumed for the TRL solution. Clearly, many alternatives to these two commercially-available devices could have been used for the purposes of this comparison, but at the time the comparison was made, these devices were both considered to be viable options with performances that were (and still are) quite representative of this particular class of multi-PE streaming FFT solution.

b

a

DHT-to-DFT conversion not accounted for in figures Packing/unpacking requirement not accounted for in figures

Table 6.5 Performance and resource utilization for 4K-point and 16K-point real-data radix-4 FFTs Clock Input 1K 18 RAMS frequency FFT word (with double 18 18 200 MHz length length buffering) Multipliers Logic slices TRLa 4K 18 11 9 5; 000 (2.5% capacity) (2.0% (5.0% capacity) capacity) 4K 12 33 30 5; 000 RFELb (7.5% capacity) (6.8% (5.0% capacity) capacity) 4K 14 42 48 3; 800 ROKEb (9.5% capacity) (10.8% (3.8% capacity) capacity) 16K 18 44 9 5; 000 TRLa (9.9% capacity) (2.0% (5.0% capacity) capacity) 16K 12 107 37 6; 500 RFELb (24.1% capacity) (8.3% (6.5% capacity) capacity) 16K 10 124 55 5; 800 ROKEb (28.0% capacity) (12.4% (5.8% capacity) capacity)

Latency per real-data FFT (s) 15 (1 channel) 21 (2 channels) 21 (2 channels) 72 (1 channel) 83 (2 channels) 83 (2 channels)

Update time per real-data FFT (s) 15 (1 channel) 10 (2 channels) 10 (2 channels) 72 (1 channel) 41 (2 channels) 41 (2 channels)

4

4

1

4

4

I/O speed (samples/cycle) 1

94 6 Derivation of Area-Efficient and Scalable Parallel Architecture

6.6 Constraining Latency Versus Minimizing Update-Time

95

Note that the particular choice of the real-from-complex strategy for the two commercially-available solutions has been made to ensure that we compare like with like, or as close as we can make it, as the adoption of the DDC-based approach would introduce additional filtering operations to complicate the issue together with an accompanying processing delay. As a matter of interest, for an efficient implementation with the particular device used here, the Virtex-II Pro 100, a complex DDC with 84 dB of spurious-free dynamic range (SFDR) has been shown to require approximately 1,700 slices of programmable logic [1]. Although the performances, in terms of the update time and latency figures, are similar for the solutions described, it is clear from the respective I/O requirements that the RFEL and Roke performance figures are achieved at the expense of having to process twice as much data at a time (two channels yielding two output sets instead of one) as the TRL solution and (for the case of an N-point transform) having to execute N/2 radix-2 butterflies every N/2 clock cycles, so that the pipeline needs to be fed with data generated by the ADC unit(s) at the rate of N complex-valued (or 2N real-valued) samples every N/2 clock cycles. This means generating the samples at four times the speed (four samples per clock cycle instead of just one) of the TRL solution which might in turn involve the use of multiple ADC units. The results highlight the fact that although the computational densities of the three solutions are not that dissimilar, the TRL solution is considerably more area efficient, requiring a small fraction of the memory and fast multiplier requirements of the other two solutions in order to satisfy the latency constraint, whilst the logic requirement – as required for controlling the operation and interaction of the various components of the FPGA implementation – which increases significantly with transform length for the RFEL and Roke solutions, remains relatively constant. The scalable nature of the TRL solution means that only the memory requirement needs substantially changing from one transform length to another in order to reflect the increased/decreased quantity of data to be processed, making the cost of adapting the solution for new applications negligible. For longer transforms, better use of the resources could probably be achieved by trading off memory against fast multiplier requirement through the choice of a more memory-efficient addressing scheme – as discussed above in Section 6.3. Note that double buffering is assumed for the sizing of the TRL solution in order to support continuous processing, whereby the I/O is limited to N clock cycles, this resulting in a doubling of the DM requirement.

6.6 Constraining Latency Versus Minimizing Update-Time An important point to note is that most, if not all, of the commercially-available FFT solutions are multi-PE solutions geared to streaming operation where the requirement relates to the minimization of the update time – so as to maximize the throughput – rather than satisfying some constraint on the latency, as has been addressed in this monograph with the design of the R24 FHT. In fact, the point should

96

6 Derivation of Area-Efficient and Scalable Parallel Architecture

perhaps be re-made here that in the “Statement of Performance Objective No 2” made at the beginning of the chapter, the requirement was simply that the N-point transform be executed within N clock cycles. Now from Equation 6.12, this is clearly achievable for all transform lengths up to and including 64K, so that for transforms larger than this it would be necessary to increase the throughput rate by an appropriate amount in order that continuous operation be maintained – note that two PEs would maintain continuous operation for N 416 . To clarify the situation, whereas with a pipelined FFT approach, as adopted by the multi-PE commercially-available solutions, one is able to attain a high throughput rate by effectively minimizing the update time, with the R24 FHT it is possible to increase the throughput rate by adopting an SIMD-type approach, either via a “multi-R24 FHT” solution whereby multiple R24 FHTs are used to facilitate the simultaneous processing of multiple data sets or via a “multi-PE” solution whereby multiple PEs are used to facilitate the parallel processing of a single data set by means of a multi-PE version of the R24 FHT. The multi-PE solution could thus be used to maintain continuous operation for the case of extremely large transform lengths, whereas the multi-R24 FHT solution could be used to deal with those computationally-demanding applications where the throughput rate for the generation of each new N-point real-data FFT output data set needs to be greater than one set every N clock cycles. With the multi-R24 FHT approach, the attraction is that it is possible to share both the control logic and the CM between the R24 FHTs, given that the LUTs contain precisely the same information and need to be accessed in precisely the same order for each R24 FHT. Such an approach could also be used to some advantage, for example, when applied to the computation of the complex-data DFT, as discussed in Section 3.4.2 of Chapter 3, where one R24 FHT is applied to the computation of the real component of the data and one R24 FHT is applied to the computation of the imaginary component. A highly-parallel dual-R24 FHT solution such as this would be able to attain, for the case of complex data, the eightfold speed up already achieved for the real-data case over a purely sequential solution (now processing eight complex-valued samples per clock cycle rather than eight real-valued samples), yet for minimum additional resources. With the multi-PE approach – restricting ourselves here to the simple case of two PEs – it needs firstly to be noted that as dual-port memory is necessary for the operation of the single-PE solution so quad-port memory would be necessary for the operation of a dual-PE solution, so as to facilitate the reading/writing of two samples from/to each of the eight memory banks for each clock cycle, as well as the reading of four (rather than two) trigonometric coefficients from each of the LUTs, as shared by the PEs. Alternate instances of the GD-BFLY could be straightforwardly assigned to alternate PEs with all eight GD-BFLY inputs/outputs for each PE being read/written from/to memory simultaneously so that conflict-free and inplace parallel memory addressing would be maintained for each PE. At present, genuine quad-port memory is not available from the major FPGA manufacturers, so that the obtaining of such a facility may only be achieved through the modification of existing dual-port memory at the effective cost of a doubling

6.7 Discussion

97

of the memory requirement. A simple alternative may be obtained, however, by noting that with current FPGA technology there is typically an approximate factor of 2 difference between the dual-port memory read/write access time and the update time for the fast multipliers – and thus the update time of the GD-BFLY computational pipeline. As a result, by doubling the speed at which the reads/writes are performed, a pseudo quad-port memory capability may be achieved whereby the data is read/written from/to the dual-port memory at twice the rate of the computational pipeline and thus at a sufficient rate to sustain the operation of the pipeline. The ideas considered in this section involving the use of multi-PE and multiR24 FHT solutions would seem to suggest that the throughput rate of the most advanced commercially-available solutions could be achieved for reduced quantities of silicon, so that the GD-BFLY-based PE could thus be used as a building block to define real-data FFT solutions to a range of problems according to whether the particular design objective involves the satisfying of some constraint on the latency, as addressed by this monograph, or the maximization of the throughput rate.

6.7 Discussion The outcome of this chapter is the specification of a partitioned-memory single-PE computing architecture for the parallel computation of the R24 FHT, together with the specification of conflict-free and (for the data) in-place parallel memory addressing schemes for both the data and the trigonometric coefficients, which enable the outputs from each instance of the GD-BFLY to be produced via this computing architecture within a single clock cycle. Four versions of the PE have been described – all based upon the use of a fixed-point fast multiplier and referred to as Versions I, II, III and IV of the solution – which provide the user with the ability to trade off arithmetic complexity, in terms of both adders and multipliers, against the memory requirement, with a theoretical performance and resource comparison of the four solutions being provided in tabular form. The mathematical/logical correctness of the operation of all four versions of the solution has been proven in software via a computer program written in the “C” programming language. Silicon implementations of both 4K-point and 16K-point transforms have been studied, each using Version II of the R24 FHT solution – which uses the minimumarithmetic addressing scheme together with a nine-multiplier version of the PE – and the Xilinx Virtex-II Pro 100 device running at a clock frequency of close to 200 MHz. The R24 FHT results were seen to compare very favourably with those of two commercially-available industry-standard multi-PE solutions, with both the 4Kpoint and 16K-point transforms achieving the stated performance objective whilst requiring greatly reduced silicon resources compared to their commercial complexdata counterparts. Note that although the target device family may be somewhat old, it was more than adequate for purpose, which was simply to facilitate comparison of the relative merits of the single-PE and multi-PE architectures. As already stated, with real-world applications it is not always possible, for various practical/financial

98

6 Derivation of Area-Efficient and Scalable Parallel Architecture

reasons, to have access to the latest device technologies. Such a situation does tend to focus the mind, however, as one is then forced to work to within whatever silicon budget one happens to have been dealt with. Note that a number of scalable single-PE designs for the fixed-radix FFT [2,6,7], along the lines of that discussed in this chapter for the R24 FHT, have already appeared in the technical literature over the past 10–15 years for the more straightforward complex-data case, each such solution using a simplified version of the memory addressing scheme discussed here whereby multi-bank memory is again used to facilitate the adoption of partitioned-memory processing. Another important property of the proposed set of R24 FHT designs discussed here is that they are able, via the application of the block floating-point scaling technique, to optimize the achievable dynamic range of the Hartley space (and thus Fourier space) outputs and therefore to outperform the more conventional streaming FFT solutions which, given the need to process the data as and when it arrives, are restricted to the use of various fixed scaling strategies in order to address the fixedpoint overflow problem. With fully-optimized streaming operation, the application of block floating-point scaling would involve having to stall the optimal flow of data through the computational pipeline, as the entire set of outputs from each stage of butterflies needs to be passed through the “maximum” function in order that the required common exponent may be found. As a result, the block-based nature of the single-PE R24 FHT operation means that it is also able to produce higher accuracy transform-space outputs than is achievable by its multi-PE FFT counterparts. Finally, it should be noted that the data re-ordering – carried out here by means of the di-bit reversal mapping – to be applied to the input data to the transform can be comfortably carried out in less than N clock cycles, for a length N transform, so that performance may be maintained through the use of double buffering, whereby one data set is being re-ordered and written to one set of DM banks whilst another data set – its predecessor – is being read/written from/to another set of DM banks by the R24 FHT. The functions of the two sets of DM banks are then interchanged after the completion of each R24 FHT. Thus, we may set up what is essentially a two-stage pipeline, where the first stage of the pipeline carries out the task of data re-ordering and the second carries out the R24 FHT on the re-ordered data. The data re-ordering may be carried out in various ways, as already outlined in Section 2.4 of Chapter 2.

References 1. R. Hosking, New FPGAs tackle real-time DSP tasks for defense applications, Boards & Solutions Magazine (November 2006) 2. L.G. Johnson, Conflict-free memory addressing for dedicated FFT hardware. IEEE Trans. Circuits Syst. II: Analog Dig. Signal Proc. 39(5), 312–316 (1992) 3. K.J. Jones, R. Coster, Area-efficient and scalable solution to real-data fast Fourier transform via regularised fast Hartley transform. IET Signal Process. 1(3), 128–138 (2007) 4. RF Engines Ltd., IP Cores – Xilinx FFT Library, product information sheet available at company web site: www.rfel.com

References

99

5. Roke Manor Research Ltd., Ultra High Speed Pipeline FFT Core, product information sheet available at company web site: www.roke.com 6. B.S. Son, B.G. Jo, M.H. Sunwoo, Y.S. Kim, A high-speed FFT processor for OFDM systems. Proc. IEEE Int. Symp. Circuits Syst. 3, 281–284 (2002) 7. C.H. Sung, K.B. Lee, C.W. Jen, Design and implementation of a scalable fast Fourier transform core. Proc. of 2002 IEEE Asia-Pacific Conference on ASICs, 295–298 (2002) 8. Xilinx Inc., company and product information available at company web site: www.xilinx.com

Chapter 7

Design of Arithmetic Unit for Resource-Constrained Solution

Abstract This chapter discusses a solution to the regularized FHT where the pipelined fixed-point multipliers involving the trigonometric coefficients are now replaced by pipelined CORDIC phase rotators which eliminate the need for the trigonometric coefficient memory and lead to the specification of a flexibleprecision solution. The design is targeted, in particular, at those applications where one is constrained by the limited availability of embedded resources. Theoretical performance figures for a silicon-based implementation of the CORDIC-based solution are derived and the results compared with those for the previously discussed solutions based upon the use of the fast fixed-point multipliers for various combinations of transform length and word length. A discussion is finally provided relating to the results obtained in the chapter.

7.1 Introduction The last two chapters have provided us with a detailed account of how the R24 FHT is able to be mapped onto a partitioned-memory single-PE computing architecture so as to effectively exploit the computational power of the silicon-based parallel computing technologies. Four versions of this highly-parallel R24 FHT solution have been produced with PE designs which range from providing optimality in terms of the arithmetic complexity to optimality in terms of the memory requirement, although the common feature of all four versions is that they each involve the use of a fast fixed-point multiplier. No consideration has been given, as yet, as to whether an arithmetic unit based upon the fast multiplier is always the most appropriate to adopt or, when such an arithmetic unit is used, how the fast multiplier might best be implemented. With the use of FPGA technology, however, the fast multiplier is typically available to the user as an embedded resource which, although expensive in terms of silicon resources, is becoming increasingly more power efficient and therefore the logical solution to adopt. A problem may arise in practice, however, when the length of the transform to be computed is very large compared to the capability of the target device such that there are insufficient embedded resources – in terms of fast multipliers, fast RAM, K. Jones, The Regularized Fast Hartley Transform, Signals and Communications Technology, DOI 10.1007/978-90-481-3917-0 7, c Springer Science+Business Media B.V. 2010

101

102

7 Design of Arithmetic Unit for Resource-Constrained Solution

or both – to enable a successful mapping of the transform onto the device to take place. In such a situation, where the use of a larger and more powerful device is simply not an option, it is thus required that some means be found of facilitating a successful mapping onto the available device and one way of achieving this is through the design of a more appropriate arithmetic unit, namely one which does not rely too heavily upon the use of embedded resources. The choice of which type of arithmetic unit to adopt for the proposed resourceconstrained solution has been made in favour of the CORDIC unit, rather than the DA unit, as the well documented optimality of CORDIC arithmetic for the operation of phase rotation [1] – as is shown to be the required operation here – combined with the ability to generate the rotation angles that correspond to the trigonometric coefficients very efficiently on-the-fly, with trivial memory requirement, make it the obvious candidate to pursue – the DA unit would inevitably involve a considerably larger memory requirement due to the storage of the pre-computed sums or inner products. A number of attractive CORDIC-based FPGA solutions to the FFT have appeared in the technical literature in recent years, albeit for the more straightforward complex-data case, with two such solutions as discussed in references [2, 11]. Note that the sizing to be carried out in this chapter for the various R24 FHT solutions, including those based upon both the fast fixed-point multiplier and the CORDIC phase rotator, is to be performed for hypothetical implementations exploiting only programmable logic in order to facilitate their comparison.

7.2 Accuracy Considerations To obtain L-bit accuracy in the GD-BFLY outputs it will be necessary to retain sufficient bits out of the multipliers as well as to use sufficient guard bits in order to protect both the least significant bit (LSB) and the most significant bit (MSB). This is due to the fact that with fixed-point processing the accuracy may be degraded through the possible word growth of one bit with each stage of adders. For the MSB, the guard bits correspond to those higher order (initially unoccupied) bits, appended to the left of the L most significant data bits out of the multipliers, that could in theory, after completion of the stages of GD-BFLY adders, contain the MSB of the output data. For the LSB, the guard bits correspond to those lower order (initially occupied) bits, appearing to the right of the L most significant data bits out of the multipliers, which could in theory, after completion of the stages of GD-BFLY adders, affect or contribute to the LSB of the output data. Thus, the possible occurrence of truncation errors due to the three stages of adders is accounted for by varying the lengths of the registers as the data progresses across the PE. Allowing for word growth in this fashion permits the application of block floating-point scaling [9] – as discussed in Section 4.8 of Chapter 4, prior to each stage of GD-BFLYs, thereby enabling the dynamic range of any signals present in the data to be maximized at the output of the R24 FHT.

7.3 Fast Multiplier Approach

103

7.3 Fast Multiplier Approach Apart from the potentially large CM requirement associated with the four PE designs discussed in the previous chapter, an additional limitation relates to their relative inflexibility, in terms of the arithmetic precision offered, due to their reliance on the fast fixed-point multiplier. For example, when the word length, “L”, of one or more of the multiplicands exceeds the word length capability, “K”, of the embedded multiplier, it would typically be necessary to use four embedded multipliers and two 2K-bit adders to carry out each L L multiplication (assuming that K < L 2K). When implemented on an FPGA in programmable logic, it is to be assumed that one L L pipelined multiplier will require of order 5L2 =8 slices [3, 4] in order to produce a new output each clock cycle, whilst one L-bit adder will require L/2 slices [4]. The CM will require L-bit RAM, with the single-port version involving L/2 slices and the dual-port version involving L slices [8]. These logicbased complexity figures will be used later in the chapter for carrying out sizing comparisons of the PE designs discussed in this and the previous chapters. To obtain L-bit accuracy in the outputs of the twelve-multiplier version of the GD-BFLY, which involves three stages of adders, it is necessary that L C 3 bits be retained from the multipliers, each of size L L, in order to guard the LSB, whilst the first stage of adders is carried out to .L C 4/-bit precision, the second stage to .L C 5/-bit precision and the third stage to .L C 6/-bit precision, in order to guard the MSB, at which point the data is scaled to yield the L-bit results. Similarly, to obtain L-bit accuracy in the outputs of the nine-multiplier version of the GD-BFLY, which involves four stages of adders, it is necessary that the first stage of adders (preceding the multipliers) be carried out to .L C 1/-bit precision, with L C 4 bits being retained from the multipliers, each of size .L C 1/ .L C 1/, whilst the second stage of adders is carried out to .L C 5/-bit precision, the third stage to .L C 6/-bit precision and the fourth stage to .L C 7/-bit precision, at which point the data is scaled to yield the L-bit results. Thus, given that the twelve-multiplier version of the GD-BFLY involves a total of 12 pipelined multipliers, six stage-one adders, eight stage-two adders and eight stage-three adders, the PE can be constructed with an arithmetic-based logic requirement, denoted LA M12 , of LA M12

1 15L2 C 22L C 112 2

(7.1)

slices, whilst the nine-multiplier version of the GD-BFLY, which involves a total of three stage-one adders, nine pipelined multipliers, six stage-two adders, eight stage-three adders and eight stage-four adders, requires an arithmetic-based logic requirement, denoted LA M9 , of LA M9 slices.

1 45L2 C 190L C 548 8

(7.2)

104

7 Design of Arithmetic Unit for Resource-Constrained Solution

These figures, together with the CM requirement – as discussed in Section 6.3 of the previous chapter and given by Equations 6.5 and 6.6 – will combine to form the benchmarks with which to assess the merits of the hardware-based arithmetic unit now discussed.

7.4 CORDIC Approach The CORDIC algorithm [12] is an arithmetic technique used for carrying out twodimensional vector rotations. Its relevance here is in its ability to carry out the phase rotation of a complex number, as this will be seen to be the underlying operation required by the GD-BFLY. The vector rotation, which is a convergent linear process, is performed very simply as a sequence of elementary rotations with an ever-decreasing elementary rotation angle where each elementary rotation can be carried out using just shift and add-subtract operations.

7.4.1 CORDIC Formulation of Complex Multiplier For carrying out the particular operation of phase rotation, a vector (X,Y) is rotated by an angle “™” to obtain the new vector .X0 ; Y0 /. For the nth elementary rotation, the fixed elementary rotation angle, arctan.2n /, which is stored in a ROM, is subtracted/added from/to the angle remainder, “™n ”, so that the angle remainder approaches zero with increasing “n”. The mathematical relations for the conventional non-redundant CORDIC rotation operation [1] are as given below via the four sets of equations: (a) Phase Rotation Operation: X0 D cos.™/ X sin.™/ Y Y0 D cos.™/ Y C sin.™/ X ™0 D 0

(7.3)

(b) Phase Rotation Operation as Sequence of Elementary Rotations: X0 D

K1 Y

cos.arctan.2n //.Xn ¢n :Yn :2n /

nD0

Y0 D

K1 Y

cos.arctan.2n //.Yn C ¢n :Xn :2n /

nD0

™0 D

K1 X nD0

™ .¢ n :arctan.2n //

(7.4)

7.4 CORDIC Approach

105

(c) Expression for nth Elementary Rotation: XnC1 D Xn ¢n :2n :Yn YnC1 D Yn C ¢n :2n :Xn ™nC1 D ™n ¢n :arctan.2n /

(7.5)

where “¢n ” is either C1 or 1, for non-redundant CORDIC, depending upon the sign of the angle remainder term, denoted here as “™n ”. (d) Expression for CORDIC Magnification Factor: MD

K1 Y nD0

n

arccos.arctan.2 // D

K1 Yq

.1 C 22n / 1:647 forlarge K (7.6)

nD0

which may need to be scaled out of the rotated output in order to preserve the correct amplitude of the phase rotated complex number. The choice of non-redundant CORDIC, rather than a redundant version whereby the term “¢n ” is allowed to be either C1, 1 or 0, ensures that the value of the magnification factor, which is a function of the number of iterations, is independent of the rotation angle being applied and therefore fixed for every instance of the GD-BFLY whether it is of Type-I, Type-II or Type-III – for the definitions see Section 4.3 of Chapter 4.

7.4.2 Parallel Formulation of CORDIC-Based PE From Equation 7.5, the CORDIC algorithm requires one pair of shift/add–subtract operations and one add–subtract operation for each bit of accuracy. When implemented sequentially [1], therefore, the CORDIC unit implements these elementary operations, one after another, using a single CS and feeding back the output as the input to the next iteration. A sequential CORDIC unit with L-bit output has a latency of L clock cycles and produces a new output every L clock cycles. On the other hand, when implemented in parallel form [1], the CORDIC unit implements these elementary operations as a computational pipeline – see Fig. 7.1 – using an array of identical CSs. A parallel CORDIC unit with L-bit output has a latency of L clock cycles but produces a new output every clock cycle. An attraction of the fully parallel pipelined architecture is that the shifters in each CS involve a fixed right shift, so that they may be implemented very efficiently in the wiring. Also, the elementary rotation angles may be distributed as constants to each CS so that they may also be hardwired. As a result, the entire CORDIC rotator may be reduced to an array of interconnected add–subtract units. Pipelining is achieved by inserting registers between the add–subtract units, although with most FPGA architectures there are already registers present in each logic cell, so that the addition of the pipeline registers involves no additional hardware cost.

106 Fig. 7.1 Pipeline architecture for CORDIC rotator

7 Design of Arithmetic Unit for Resource-Constrained Solution X0

Y0

Z0 LUT

>> 0

~ X0 ± Y0

~ Y0 ± X0

Z0 ± α0

Y1

X1 >> 1

>> 1

α0

sign(Z0)

>> 0

Z1

α1

sign(Z1)

~ X1 ± Y1

~ Y1 ± X1

Z1 ± α1

X2

Y2

Z2

XN−1

YN−1

ZN−1

>> N-1

~ XN-1 ± YN-1 XN

>> N-1

~ YN-1 ± XN-1 YN

sign(ZN-1)

αN−1

ZN-1 ± αN-1 ZN

~ ~ Note: Xn = Xn >> n and Yn=Yn>> n for n = 0,1,...,N−1

7.4.3 Discussion of CORDIC-Based Solution The twelve-multiplier version of the GD-BFLY produces eight outputs from eight inputs, these samples denoted by .X1 ; Y1 / through to .X4 ; Y4 /, with the multiplication stage of the GD-BFLY comprising 12 real multiplications which, together with the accompanying set of additions/subtractions, may be expressed for the case of the standard Type-III GD-BFLY via the three sets of equations

7.4 CORDIC Approach

107

X2 Y2 X3 Y3 X4 Y4

D

D

D

X2 Y2 sin.2™/ X3 Y3 cos.2™/ sin.3™/ X4 Y4 cos.3™/

cos.™/ sin.™/ sin.™/ cos.™/ cos.2™/ sin.2™/ cos.3™/ sin.3™/

(7.7) (7.8) (7.9)

where “™” is the single-angle, “2™” the double-angle and “3™” the triple-angle rotation angles. These sets of equations are equivalent to what would be obtained if we multiplied the complex number interpretations of .X2 ; Y2 / by ei™ ; .X3 ; Y3 / by ei2™ and .X4 ; Y4 / by ei3™ , followed for the case of the standard Type-III GD-BFLY by negation of the components Y2 ; Y3 and Y4 . As with the nine-multiplier and twelve-multiplier versions of the GD-BFLY, there are minor changes to the operation of the GD-BFLY, from one instance to another, in terms of the definitions of the first three address permutations, with one of two slightly different versions being appropriately selected for each according to the particular “Type” of GD-BFLY being executed – see Table 4.1 of Chapter 4. In addition, however, there are also minor changes required to the outputs of the CORDIC units in that if the GD-BFLY is of Type-I then the components Y2 ; Y3 and Y4 do not need to be negated, whereas if the GD-BFLY is of Type-II then only component Y4 needs to be negated and if the GD-BFLY is of Type-III, as discussed in the previous paragraph, then all three components need to be negated. Note, however, that the outputs will have grown due to the CORDIC magnification factor, “M”, of Equation 7.6, so that this growth needs to be adequately accounted for within the GD-BFLY. The most efficient way of achieving this would be to allow the growth to remain within components .X2 ; Y2 / through to .X4 ; Y4 / and the components .X1 ; Y1 / to be scaled multiplicatively by the term “M”, this being achieved with just two constant coefficient multipliers – see Fig. 7.2. This would result in a growth of approximately 1.647 in all the eight inputs to the second address permutation “ˆ2 ”. Note that scaling by such a constant differs from the operation of a standard fast multiplier in that Booth encoding/decoding circuits are no longer required, whilst efficient recoding methods [5] can be used to further reduce the logic requirement of the simplified operation to approximately one third that of the standard fast fixed-point multiplier. An obvious attraction of the CORDIC-based approach is that the GD-BFLY only requires knowledge of the single-angle, double-angle and triple-angle rotation angles, so that there is no longer any need to construct, maintain and access potentially large LUTs required for the storage of the trigonometric coefficients – that is, for the storage of sampled versions of the sinusoidal function with argument defined from 0 up to =2 radians. As a result, the radix-4 factorization of the CORDIC-based FHT may be expressed very simply with the updating of the rotation angles for the execution of each instance of the GD-BFLY being performed on-the-fly and involving only additions and subtractions.

7 Design of Arithmetic Unit for Resource-Constrained Solution

Φ2

Φ3

Un-scaled CORDIC Rotator

Negate

Un-scaled CORDIC Rotator

Negate

Un-scaled CORDIC Rotator

Negate

_ Address Permutation

input data vector

Address Permutation

Fixed Scaler

Φ4

_ Address Permutation

Φ1

_

_

_

_

_

output data vector

negated rotation angles

Address Permutation

108

_

Fig. 7.2 Signal flow graph for CORDIC-based version of generic double butterfly

XS

>>n

ZS

>>n

ZS

XD

sign(ZS)

>>n

YD

>>n

ZD

sign(ZD)

XT

>>n

YT

>>n

ZT

sign(ZT)

~ X±Y

~ Y±X

Z ± αn

~ X±Y

~ Y±X

Z ± αn

~ X±Y

~ Y±X

Z ± αn

XS

YS

ZS

XD

YD

ZD

XT

YT

ZT

~ ~ Note: X = X >> n and Y = Y >> n where n and αn are fixed

Fig. 7.3 Computational stage of pipeline for CORDIC rotator with scalar inputs

The optimum throughput for the GD-BFLY is achieved with the fully-parallel hardwired solution of Fig. 7.3, whereby each CS of the pipeline uses nine add– subtract units to carry out simultaneously the three elementary phase rotations – note that in the figure the superscripts “S”, “D” and “T” stand for “single angle”, “double angle” and “triple angle”, respectively. Due to the decomposition of the original rotation angle into “K” elementary rotation angles, it is clear that execution of the phase rotation operation can only be approximated with the accuracy of the outputs of the last iteration being limited by the magnitude of the last elementary rotation angle applied. Thus, if L-bit accuracy is required of the rotated output, one would expect the number of iterations, “K”, to be chosen so that K D L, as the right shifts carried out in the Kth (and last) iteration would be of length L 1. This, in turn, necessitates two guard bits on the MSB and log2 L guard bits on the LSB. The MSB guard bits cater for the magnification factor

7.4 CORDIC Approach

109

p of Equation 7.6 and the maximum possible range extension of 2, whilst the LSB guard bits cater for the accumulated rounding error from the “L” iterations. Note also, from the definition of the elementary rotation angles, tan.™n / D ˙2n ;

(7.10)

that the CORDIC algorithm is known to converge over the range =2 ™ C =2, so that in order to cater for rotation angles between ˙ an additional rotation angle of ˙ =2 may need to be applied prior to the elementary rotation angles in order to ensure that the algorithm converges, thus increasing the number of iterations from K D L to K D L C 1. This may be very simply achieved, however, via application of the equations: X0 D ¢:Y Y0 D C¢:X ™0 D ™ C ¢: (

where ¢D

1 2

C1

Y<0

1

otherwise

(7.11)

;

(7.12)

whenever the rotation angle lies outside the range of convergence, with the above equations being carried out via precisely the same components and represented by means of precisely the same SFG as those equations – namely Equations 7.3–7.6 – corresponding to the elementary rotation angles.

7.4.4 Logic Requirement of CORDIC-Based PE Referring back to the SFG of Fig. 7.2, it is to be assumed that the GD-BFLY outputs are to be computed to L-bit accuracy. Therefore, because of the two stages of adders following the CORDIC rotators, it will be necessary for the CORDIC rotators to adopt L C 3 iterations in order to produce data to .L C 2/-bit accuracy for input to the first stage of adders. This in turn requires that each CORDIC rotator adopt L C 4 C log2 .L C 2/ bits for the registers, this including log2 .L C 2/ guard bits for the LSB and two guard bits for the MSB. Following their operation, the data will have magnified by one bit so that just the top MSB guard bit needs to be removed, together with the lowest log2 .LC2/C1 bits, to leave the required LC2 bits for input to the adders. The first stage of adders is then carried out to .L C 3/-bit precision and the second stage to .L C 4/-bit precision, at which point the data is scaled to yield the final L-bit result. The outputs from the two fixed-coefficient multipliers – note that in the time it takes for the CORDIC operation to be executed the same

110

7 Design of Arithmetic Unit for Resource-Constrained Solution

fixed-coefficient multiplier could be used to carry out the scaling operation for both of the first two inputs – are retained to .L C 2/-bit precision in order to ensure consistency with the precision of the outputs from the CORDIC rotators. Thus, the CORDIC-based version of the GD-BFLY involves three .L C 3/-stage pipelined CORDIC rotators, eight .L C 3/-bit stage-one adders, eight .L C 4/-bit stage-two adders and one shared fixed-coefficient multiplier using a .L C 2/-bit coefficient, so that the PE may be constructed with a total arithmetic-based logic requirement, denoted LA C , of LA C

1 10L2 C 83L C 9.L C 3/:log2 .L C 2/ C 168 2

(7.13)

slices. Note that the single-angle, double-angle and triple-angle rotation angles are fed directly to the GD-BFLY, so that the only memory requirement is for the storage of three generation angles for each stage of the transform from which the rotation angles may then be recursively derived via simple addition. Thus, assuming singleport memory, the memory-based logic requirement, denoted LM C , is given by just LM C

3 ˛L 2

(7.14)

slices, with the required single-angle, double-angle and triple-angle rotation angles being computed on-the-fly as and when they are required.

7.5 Comparative Analysis of PE Designs This section provides a very brief theoretical comparison of the silicon resources required for all the five types of PE so far considered – four corresponding to the use of a pipelined fixed-point multiplier, as discussed in the previous chapter, and one corresponding to the use of the pipelined CORDIC arithmetic unit – where the sizing is to be based upon the logic-based complexity figures discussed in Sections 7.3 and 7.4. An FPGA implementation [6] would of course be able to exploit the available embedded resources, whether using the fast fixed-point multiplier or the CORDIC arithmetic unit, as most FPGA manufacturers now provide their own version of the CORDIC unit, in addition to the fast multipliers and RAM, as an embedded resource to be exploited by the user. A pipelined version of the CORDIC arithmetic unit may even be obtained as an IP core [10] and subsequently used as a building block for constructing larger DSP systems. The assumption here is that any relative advantages obtained from implementation in programmable logic will carry over when the PEs are implemented using such optimized silicon resources. The arithmetic-based and memory-based logic requirements for all five versions of the R24 FHT solution – catering for both the arithmetic complexity and the CM requirement – are as summarized in Table 7.1 below, from which the attraction

Table 7.1 Logic resources required for different versions of PE and trigonometric coefficient generator assuming N-point regularized FHT and L-bit accuracy Arithmetic-based Memory-based logic for logic for coefficient coefficients Version Processing Arithmetic-based logic for double generator (slices) (slices) of solution element type butterfly (slices) 1=2.15L2 C 22L C 112/ 3=4LN I Fast multiplier 0 1=8.45L2 C 190L C 548/ 3=4LN 3L II Fast multiplier p 1=2.15L2 C 22L C 112/ 3=2L N 1=8.35L2 C 162L C 277/ III Fast multiplier p 2 1=8.45L C 190L C 548/ 3=2L N 1=8.35L2 C 246L C 831/ IV Fast multiplier 2 1=2.10L C 83L C 9.L C 3/: log .L C 2/ C 168/ 3=2L: log N 0 V CORDIC unit 2 4

7.5 Comparative Analysis of PE Designs 111

112

7 Design of Arithmetic Unit for Resource-Constrained Solution

of Version V, the CORDIC-based solution, is evident. The benefits stem basically from the fact that there is no longer any need to construct, maintain and access potentially large LUTs required for the storage of the trigonometric coefficients. The same word lengths, denoted “L”, are assumed for both the input/output data, to/from the GD-BFLY, and the trigonometric coefficients. The control-based logic requirements – for controlling the operation and interaction of the various components of the design – as discussed in Section 5.6 of Chapter 5, are not included in the results as they are rather more difficult (if not impossible) to assess, if considered in isolation from the actual hardware design process, this due in part to the automated and somewhat unpredictable nature of that process. It seems clear, however, that the gains achieved by the CORDIC-based R24 FHT solution in not having to access the CM will be somewhat counter-balanced by the need to control a potentially large number of adders rather than just a few fast fixed-point multipliers. Also, the two versions of the R24 FHT solution based upon the minimum-memory addressing (Versions III and IV) will involve greater control complexity than those versions based upon the minimum-arithmetic addressing (Versions I and II), as evidenced from the discussions of Section 6.3.2 of the previous chapter. For each of the five versions, however, the control-based logic requirement will vary little with transform length or word length, as indicated in the results of Section 6.5 of Chapter 6, due to the scalable nature of the designs. Estimates for the logic requirements due to both the arithmetic complexity and the CM for various combinations of transform length and data/coefficient word length for all the solutions considered, are as given in Table 7.2, with the results reinforcing the attraction of the CORDIC-based solution for those parameter sets typically encountered in high-performance DSP applications. It is evident from the results displayed in Tables 7.1 and 7.2 that as the transform length increases, the associated memory-based logic requirement makes all those solutions based upon the fast fixed-point multiplier increasingly less attractive, as the silicon requirement is clearly dominated for such solutions by the increasing memory requirement. The only significant change to the CORDIC-based solution as the transform length varies relates to the memory allocation for storage of the input/output data.

7.6 Discussion The primary question addressed in this chapter concerned the optimal choice of arithmetic unit given the requirement for a resource-constrained solution to the R24 FHT. This involved the replacement of the fast fixed-point multipliers used by the GD-BFLY by a hardware-based parallel arithmetic unit which minimized the need for the use of embedded resources – at least in the shape of fast fixed-point multipliers and fast RAM for the trigonometric coefficients. The particular design investigated was based upon the use of CORDIC arithmetic, as this is known to be computationally optimal for the operation of phase rotation – most FPGA manufacturers now provide their own version of the CORDIC

Processing element type

Fast multiplier Fast multiplier Fast multiplier Fast multiplier CORDIC unit

Version of solution

I II III IV V

14 14 4 4 3

18 18 6 6 4

23 22 9 8 5

L D 16 L D 20 L D 24 (approximate sizing in slices 1K) 50 50 5 5 3

63 63 7 7 4

77 76 10 9 15

L D 16 L D 20 L D 24 (approximate sizing in slices 1K)

Table 7.2 Logic resources required for combinations of transform length N and word length L N =1;024 N = 4;096

194 194 7 7 3

243 243 9 9 4

293 292 12 11 5

L D 16 L D 20 L D 24 (approximate sizing in slices 1K)

N =16;384

7.6 Discussion 113

114

7 Design of Arithmetic Unit for Resource-Constrained Solution

unit, in addition to the fast multipliers and RAM, as an embedded resource to be exploited by the user. The result of using such a PE is the identification of a solution to the real-data DFT offering the promise of greatly reduced quantities of silicon resources – at least for the arithmetic complexity and memory requirement – and when implemented with FPGA technology, the possibility of adopting a lower-complexity and lower-cost device compared to that based upon the use of the fast fixed-point multiplier. The mathematical/logical correctness of the operation of the resulting CORDIC-based version of the R24 FHT solution, as with those versions based upon the use of the fast fixed-point multiplier, has been proven in software via a computer program written in the “C” programming language. The comparative benefits of the various designs, as suggested by the complexity figures derived for a hypothetical implementation with programmable logic, should also carry over, not only when exploiting embedded resources, but also when implemented with ASIC technology, where the high regularity of the CORDIC-based design could prove particularly attractive. In fact, a recent study [7] has shown that compared to an implementation using a standard cell ASIC, the FPGA area required to implement a typical DSP algorithm – such as the R24 FHT – is on average 40 times larger, whilst the achievable speed, which relates to the critical path delay and hence the maximum allowable clock frequency, is on average one third of that for the ASIC. As a result, it is possible to hypothesize a dynamic power consumption for an FPGA implementation which is on average nine times greater than that for the ASIC when embedded features are used [7], increasing to 12 times when only programmable logic is used [7]. Note that the design constraint on the PE discussed in Section 6.4 of the previous chapter, concerning the total number of CSs in the computational pipeline, is applicable for the CORDIC-based solution as well as those based upon the fast fixed-point multiplier, namely that the total number of CSs – including those corresponding to the CORDIC iterations – needs to be an odd-valued integer, so as to avoid any possible conflict problems with regard to the reading/writing of the input/output data sets from/to the eight DM banks for each new clock cycle. Finally, it should be noted that the benefits of adopting the CORDIC-based design, rather than one of the more conventional designs based upon the use of the fast fixed-point multiplier, may only be achieved at the expense of incurring greater latency given that the delay associated with a pipelined fixed-point multiplier might typically be of order log2 L clock cycles whereas that for the pipelined CORDIC arithmetic unit is of order L clock cycles. As a result, for the processing of 16-bit to 24-bit data – as discussed in Table 7.2 – whereas the pipelined fixed-point multiplier PE design might typically involve a total pipeline delay of nine clock cycles, say, that based upon the CORDIC arithmetic unit might involve a total pipeline delay of two to three times the size, which might in turn (at least for smaller transform sizes) necessitate the adoption of a safety margin delay with each stage of GD-BFLYs.

References

115

References 1. R. Andraka, A Survey of CORDIC Algorithms for FPGA Based Computers. Proceedings of ACM/SIGDA 6th International Symposium on FPGAs (Monteray, CA, 1998) pp. 191–200 2. A. Banerjee, S.D. Anindya, S. Banerjee, FPGA realization of a CORDIC-based FFT processor for biomedical signal processing. Microprocessors Microsyst. (Elsevier). 25(3), 131–142 (2001) 3. M. Becvar, P. Stukjunger, Fixed-point arithmetic in FPGA. Acta Polytech. 45(2), 67–72 (2005) 4. C.H. Dick, FPGAs: The high-end alternative for DSP applications. Journal of DSP Eng. (Spring 2000) 5. K. Hwang, Computer Arithmetic: Principles, Architectures and Design (Wiley, New York, 1979) 6. K.J. Jones, R. Coster, Area-efficient and scalable solution to real-data fast Fourier transform via regularised fast Hartley transform. IET Signal Process. 1(3), 128–138 (2007) 7. I. Kuon, J. Rose, Measuring the Gap Between FPGAs and ASICs (FPGA ‘06, Monteray, CA, 2006) 8. C. Maxfield, The Design Warrior’s Guide to FPGAs (Newnes (Elsevier). 2004) 9. L.R. Rabiner, B. Gold, Theory and Application of Digital Signal Processing (Prentice Hall, Englewood Cliffs, NJ, 1975) 10. RFEL: rfel.com/products/Products Cordic.asp 11. T. Sansaloni, A. Perez-Pascual, J. Valls, Area-efficient FPGA-based FFT processor, Elect. Lett. 39(19), 1369–1370 (2003) 12. J.E. Volder, The CORDIC trigonometric computing technique. IRE Trans. Elect. Comput. 8(3), 330–334 (1959)

Chapter 8

Computation of 2n -Point Real-Data Discrete Fourier Transform

Abstract This chapter describes two solutions to the problem of the real-data DFT whereby the GD-BFLY, which has been designed for a radix-4 version of the FHT, is now used for the computation of the 2n -point DFT where the transform length is a power of two, but not a power of four. This enables it to be applied, potentially, to a great many more problems, including those that might not necessarily be best solved through the direct application of a 4n -point transform. The first approach is referred to as the “double-resolution” approach, as it involves FHT-based processing at double the required transform-space resolution via two half-length regularized FHTs, whilst the second approach is referred to as the “half-resolution” approach, as it involves FHT-based processing at one half the required transform-space resolution via one double-length regularized FHT. A discussion is finally provided relating to the results obtained in the chapter.

8.1 Introduction The results discussed so far in this monograph have been concerned solely with the application of the R24 FHT to the computation of the real-data DFT where the transform length is a power of four. Given the amount of effort and resources devoted to the design of the GD-BFLY and the associated R24 FHT, however, there would be great attraction in being able to extend its range of applicability to that of the 2n point real-data DFT where the transform length is a power of two, but not a power of four – note that 2n is a power of four whenever “n” is an even-valued integer. A radix-2 version of the regularized FHT could of course be developed but this would yield just fourfold parallelism, at best, rather than the eightfold parallelism of the radix-4 solution, whilst the time-complexity would increase by a factor of 2 log2 N=log4 N. If the applicability of the R24 FHT could be generalized, therefore, without significantly compromising performance, it would result in a very flexible solution to the real-data DFT problem that would be able to address a great many more problems, including those that might not necessarily be best solved through the

K. Jones, The Regularized Fast Hartley Transform, Signals and Communications Technology, DOI 10.1007/978-90-481-3917-0 8, c Springer Science+Business Media B.V. 2010

117

118

8 Computation of 2n -Point Real-Data Discrete Fourier Transform

direct application of a 4n -point transform [1]. Two approaches to this problem are now addressed:– 1. The first involves the exploitation of two half-length versions of the R24 FHT, with one transform being applied to the even-addressed samples of the data sequence and the second transform to the odd-addressed samples. 2. The second involves the exploitation of one double-length version of the R24 FHT, this being applied to a zero-padded version of the data sequence. Thus, with the first approach the results are produced in Fourier space, whilst with the second approach the results are produced in Hartley space so that conversion from Hartley space to Fourier space is still required. The first approach will be referred to as the “double-resolution” approach, as it involves R24 FHT-based processing at double the required transform-space resolution – thus corresponding to a halving of the required resolving capability, whilst the second approach will be referred to as the “half-resolution” approach, as it involves R24 FHT-based processing at one half the required transform-space resolution – thus corresponding to a doubling of the required resolving capability. The required resolution will in turn be referred to as the full resolution as it will correspond exactly to the resolving capability of the sought-after solution. Note, however, that the effect of zero-padding the input data sequence is to produce interpolated results in the transform space, so that although greater accuracy may be achieved in locating tonal signals within that space, the resolving capability – that is, the ability to distinguish closely-spaced transform-space components of the signal – will not be improved at all.

8.2 Computing One DFT via Two Half-Length Regularized FHTs This section discusses the first of the two approaches and is concerned with the computation of the 2N-point real data DFT, where “N” is a power of four. To see how this may be achieved, using two N-point R24 FHTs, a regular and highly-parallel “R4 FHT-to-R2 FFT” conversion routine, or converter, is required which enables the Hartley-space data to be suitably combined and transformed to Fourier space. The converter exploits the following properties: (a) The outputs of a real-data DFT of length 2N, where N is a power of four, may be obtained from the outputs of a complex-data DFT of length N (as discussed in Section 2.3.3 of Chapter 2). (b) The real and imaginary components of the complex-data DFT outputs may each be independently obtained via a R24 FHT of length N. The resulting FFT algorithm, which thus exploits two N-point R24 FHTs and one R4 FHT-to-R2 FFT converter, produces outputs in Fourier space rather than Hartley

8.2 Computing One DFT via Two Half-Length Regularized FHTs

119

space and so may be regarded as belonging to the same class of specialized real-data FFT algorithms as those discussed earlier in Section 2.2 of Chapter 2.

8.2.1 Derivation of 2n -Point Real-Data FFT Algorithm Let us start by denoting the real-valued data sequence by fx[n]g, with the evenaddressed sub-sequence given by fxE Œng and the odd-addressed sub-sequence by fxO Œng. After processing each sub-sequence by means of an N-point R24 FHT, let the nR24 FHT ooutputs from the processing of the even-addressed samples be denoted by X.H/ E Œk and those obtained from the processing of the odd-addressed samples n o .H/ by XO Œk . The R24 FHT outputs may then be converted to DFT outputs by means of the expressions 1 .H/ .H/ XE Œk C XE ŒN k 2 1 .H/ .H/ X.F/ X Œk D ŒN k X Œk I;E E E 2 .F/

XR;E Œk D

(8.1) (8.2)

for the even-addressed terms, and 1 .H/ XO Œk C X.H/ O ŒN k 2 1 .H/ .F/ .H/ XI;O Œk D XO ŒN k XO Œk 2 X.F/ R;O Œk D

(8.3) (8.4)

for the odd-addressed terms, where “X.F/ ” denotes the real component of the DFT R;E=O

output and “X.F/ I;E=O ” the imaginary component. Suppose that the sequences fYR Œkg and fYI Œkg are now introduced via the expressions .F/ YR Œk D X.F/ R;E Œk XI;O Œk

(8.5)

.F/ YI Œk D X.F/ I;E Œk C XR;O Œk:

(8.6)

Then .F/ YR ŒN k D X.F/ R;E Œk C XI;O Œk

(8.7)

.F/ YI ŒN k D X.F/ I;E Œk C XR;O Œk

(8.8)

8 Computation of 2n -Point Real-Data Discrete Fourier Transform

120

n o and the 2N-point real-data DFT outputs, denoted X.F/ Œk for the real component R n o .F/ and XI Œk for the imaginary component, may be written as X.F/ R Œk D

1 ŒYR Œk C YR ŒN k C cos.2 k=2N/ .YI Œk C YI ŒN k/ 2 (8.9) sin.2 k=2N/ .YR Œk YR ŒN k/

X.F/ I Œk D

1 ŒYI Œk YI ŒN k C sin.2 k=2N/ .YI Œk C YI ŒN k/ 2 (8.10) cos.2 k=2N/ .YR Œk YR ŒN k/ ;

where the DFT outputs that correspond to the required non-negative half of the frequency spectrum are addressed by means of the index “k” 2 f0; 1; : : : ; N 1g. The even-symmetric nature of the sinusoidal function relative to the argument =2 (which corresponds here to the index k D N=2), together with its periodicity – see Equations 4.56–4.58 of Chapter 4 – enables four real outputs and four imaginary outputs to be produced from the application of each pair of half-resolution trigonometric function values. Thus, four complex-valued Fourier-space samples may be efficiently computed from two sets of four real-valued Hartley-space samples by means of the R4 FHT-to-R2 FFT converter, as shown in the SFG of Fig. 8.1, where the “cos” and “sin” trigonometric function values are referred to via the parameters “WR ” and “WI ”, respectively. The DFT addresses for each set of four complexvalued outputs are expressed via the indices m1 2 f0; 1; : : : ; N=41g; m2 D Nm1 1; m3 D N=2m1 1 & m4 D N=2Cm1 (8.11) 2 FHT output sets, which are also that each of the two memories containing the R4 ready physically partitioned column-wise into eight memory banks, now needs to be conceptually partitioned row-wise with each being divided into four quadrants, with the address “m1 ” corresponding to locations in the first quadrant, “m2 ” to locations in the second quadrant, “m3 ” to locations in the third quadrant and “m4 ” to locations in the fourth quadrant – as shown in Fig. 8.2. n o The R24 FHT outputs of the even-addressed samples, X.H/ Œk , which are stored E in the same memory as the original even-addressed data n set, are o subsequently overwritten by the real components of the DFT outputs, X.F/ Œk , this memory being R 2 thus referred to as the even-real data o .DMER /. Similarly, the R4 FHT outn memory .H/

puts of the odd-addressed samples, XO Œk , which are stored in the same memory as the original odd-addressed data set, nare subsequently overwritten by the imagio nary components of the DFT outputs, X.F/ Œk , this memory being thus referred I

to as the odd-imaginary data memory .DMOI /. Note that the two R24 FHTs may be computed in either sequential or parallel mode – as discussed later – but however this is done, the computation of the two R24 FHTs precedes that of the converter, so

8.2 Computing One DFT via Two Half-Length Regularized FHTs

121

XR[m1]

XR[m2]

÷2

XR[m1]

÷2

XR[m2]

÷2

XI[m1]

÷2

XI[m2]

÷2

XR[m3]

÷2

XR[m4]

÷2

XI[m3]

÷2

XI[m4]

_ _ WR

XI[m1]

WI _

_

_

XI[m2] WR XR[m3]

XR[m4]

_ _ WR

XI[m3]

WI XI[m4]

_

_

_

WR

Fig. 8.1 Signal flow graph for R4 FHT-to-R2 FFT converter

that the same memories – the DMER and the DMOI – may be used for holding the input/output data to/from both the R24 FHTs and the converter. Thus, in order to obtain a solution that meets the latency constraint discussed in Chapter 6, it is required that after the computation of the two R24 FHTs has been completed, the R4 FHT-to-R2 FFT converter combines and transforms the two sets of Hartley-space outputs to Fourier space, doing so in the same highly-parallel fashion as is done with the individual R24 FHTs. This means that a partitioned-memory computing architecture is required which enables conflict-free and (for the data) in-place parallel memory addressing of both the data and the trigonometric coefficients for the operation of both the R24 FHTs and the R4 FHT-to-R2 FFT converter.

8 Computation of 2n -Point Real-Data Discrete Fourier Transform

122

“EVEN-REAL MEMORY”

“ODD-IMAGINARY MEMORY” Quadrant No 1

Quadrant No 1 N/4 rows × 8 columns indexed by ‘m1’

N/4 rows × 8 columns indexed by ‘m1’

Quadrant No 3

Quadrant No 3 N/4 rows × 8 columns indexed by ‘m3’

N/4 rows × 8 columns indexed by ‘m3’ Quadrant No 4

Quadrant No 4 N/4 rows × 8 columns indexed by ‘m4’

N/4 rows × 8 columns indexed by ‘m4’ Quadrant No 2

Quadrant No 2

N/4 rows × 8 columns indexed by ‘m2’ For storage of regularized FHT output for “even-addressed” input samples & “real component” of real-data FFT output

N/4 rows × 8 columns indexed by ‘m2’ For storage of regularized FHT output for “odd-addressed” input samples & “imaginary component” of real-data FFT output

Fig. 8.2 Memory structure for in-place parallel addressing by R4 FHT-to-R2 FFT converter

8.2.2 Implementational Considerations Now in order for the R4 FHT-to-R2 FFT conversion routine – which from the SFG of Fig. 8.1 requires a total of eight real multiplications and 20 real additions – to be carried out in such a manner, it is necessary that the input data set for each instance of the converter (which is obtained by taking one sample from each quadrant of both the DMER and the DMOI ) is such that of the four samples obtained from each memory, no more than two possess the same memory bank address, given the dualport nature of the memory. Unfortunately, however, this is not the case as it is quite possible for all four samples from each memory to appear in precisely the same memory bank. To address this problem, it is first necessary to introduce a small intermediate data memory .DMI /, of two-dimensional form, partitioned into eight rows of eight columns where each memory bank is capable of holding a single sample. As a result, the entire memory is capable of holding eight complete sets of GD-BFLY outputs – that is, a total of 64 samples – four even-addressed sets of outputs and four odd-addressed sets. Each GD-BFLY output set, which is stored in a single row of memory (either in the DMER or the DMOI ) is now mapped to a single row (or column) of the DMI , so that once the DMI is full, the samples may then be used to provide eight complete sets of input data for feeding to the converter, where the eight samples of each set now come from eight distinct single-sample memory banks. The

8.2 Computing One DFT via Two Half-Length Regularized FHTs

123

R4 FHT-to-R2 FFT converter is now able to update the contents of the DMI with its own outputs, eight samples at a time, after which the contents may be written back to the DMER and the DMOI , again eight samples at a time. Thus, if a row of samples is read from each quadrant of the DMER and the DMOI , eight samples at a time, and subsequently written to a row (or column) of the DMI , eight samples at a time, in the appropriate order, then when any set of eight samples required by the converter is accessed from the DMI , the samples are obtained from eight distinct single-sample memory banks so that the data is read/written from/to the DMI without conflict. This, in turn, enables the eight reads/writes to be carried out simultaneously in just one clock cycle, so that a suitably pipelined implementation of the R4 FHT-to-R2 FFT converter would be able to produce all eight outputs in a single clock cycle, as required. With the introduction of a second DMI , identical in form to the first, it is now possible for eight complete sets of eight R24 FHT outputs to be built up and stored in one DMI , whilst the data in the other DMI is being processed to yield eight complete sets of DFT outputs – both the real and the imaginary components – to be written back, eight samples at a time, to the DMER and the DMOI . This processing scheme involves a start-up delay of eight clock cycles to allow for the first DMI to be filled, first time around, and a completion delay of eight clock cycles to allow for the second DMI to be emptied, last time around, whilst in between the functions of the DMI alternate every eight clock cycles, with the contents of one DMI being updated with data from the DMER and DMOI whilst the contents of the other DMI is being processed by the R4 FHT-to-R2 FFT converter. To achieve this, both the data memory and one of the intermediate memories need to be updated every eight clock cycles, so that simultaneous reads/writes are required for each memory type, where in each case care needs to be taken to ensure that memory is not updated before it has first been used – see the three consecutive updates given by the scheduling scheme of Fig. 8.3. Thus, it is possible for a partitioned-memory computing architecture to be defined which enables the processing for both the two N-point R24 FHTs and the R4 FHT-to-R2 FFT converter to be efficiently carried out, in an in-place fashion, where the basic components of the solution are as shown in the scheme of Fig. 8.4. Note that the two intermediate memories required of such a scheme are best built with programmable logic, so as not to waste potentially large quantities of fast and expensive embedded RAM in their construction, as embedded memory normally comes with a minimum size of some several thousands of bits, rather than just a few tens of bits, as required for each of the 64 banks of each DMI . update = n–1

update = n

update = n+1

DM → DM(new) I1

DM → DM(new) I2

DM → DM(new) I1

DM(old) I1 →

DM(old) I2 →

DM(old) → DM I

DM

Process DM(current) I 2

DM

Process DM(current) I

Fig. 8.3 Scheduling of memory-based operations

1

1

Process DM(current) I 2

8 Computation of 2n -Point Real-Data Discrete Fourier Transform

124

R24FHT

even addresses

Data Memory

input / output data

odd addresses

Trigonometric Coefficient Memory

R4FHT-to-R2FFT Converter

Trigonometric Coefficient Memory

Data Memory

R24FHT

Trigonometric Coefficient Memory

Fig. 8.4 Scheme for 2N-point real-data FFT using N-point regularized FHT

8.2.2.1 Solution Exploiting One PE for Computation of Regularized FHTs With regard to the sequential version of the solution, whereby a single PE is assigned to the computation of the two R24 FHTs so that they must be executed sequentially, one after another, the time-complexity for the 2N-point real-data DFT using the double-resolution approach, denoted TSDR , is given by TSDR D

N .log4 N C 1/ C16 4

(8.12)

clock cycles, which includes the start-up and completion delays for the DMI , with the associated arithmetic complexity for both the PE and the converter given by either 20 multiplications and 42 additions, when using Versions I or III of the R24 FHT solution, or 17 multiplications and 45 additions when using Versions II or IV. Note, however, that this figure excludes any contributions for the pipeline start-up delays of both the R24 FHTs and the R4 FHT-to-R2 FFT converter. The worst-case memory requirement for the 2N-point real-data DFT using the sequential version of the double-resolution approach involves two sets of eight DM banks for storage of the data, with each bank holding N/8 samples, and four LUTs for storage of the trigonometric coefficients – one set of three single-quadrant LUTs with each LUT holding N/4 double-resolution trigonometric coefficients for minimum-arithmetic addressing by the R24 FHT and one single-quadrant LUT holding N/2 full-resolution trigonometric coefficients for minimum-arithmetic addressing by the R4 FHT-to-R2 FFT converter. This results in a total memory requirement, denoted M.W/ SDR , of

8.2 Computing One DFT via Two Half-Length Regularized FHTs

5 M.W/ SDR D 2NC N C 128 4 13 D N C 128 4

125

(8.13)

words, which includes the requirement for the DMI , with the associated arithmetic complexity for the memory addressing given by zero when using Version I of the R24 FHT solution or six additions when using Version II. In comparison, the best-case memory requirement for the 2N-point real-data DFT using the sequential version of the double-resolution approach involves two sets of eight DM banks for storage of the data, with each bank holding N/8 samples, and six complementary-angle LUTs (that is, two two-level LUTs) for storage of the trigonometric coefficients – one set of three complementary-angle LUTs with each p LUT holding N=2 double-resolution trigonometric coefficients for minimum2 memory addressing by the Rp 4 FHT and one set of three complementary-angle LUTs with each LUT holding 2N=2 full-resolution trigonometric coefficients for minimum-memory addressing by the R4 FHT-to-R2 FFT converter. This results in a total memory requirement, denoted M.B/ SDR , of M.B/ SDR D 2N C

p 3 p N C 2N C 128 2

(8.14)

words, which includes the requirement for the DMI , with the associated arithmetic complexity for the addressing given by seven multiplications and eight additions when using Version III of the R24 FHT solution or seven multiplications and 14 additions when using Version IV.

8.2.2.2 Solution Exploiting Two PEs for Computation of Regularized FHTs With regard to the parallel version of the solution, whereby a separate PE is assigned to the computation of each of the R24 FHTs so that they may be executed simultaneously, or in parallel, the time-complexity for the 2N-point real-data DFT using the double-resolution approach, denoted TPDR , is given by TPDR D

N .log4 N C 2/ C16 8

(8.15)

clock cycles, which includes the start-up and completion delays for the DMI , with the associated arithmetic complexity for both the two PEs and the converter given by either 32 multiplications and 64 additions, when using Versions I or III of the R24 FHT solution, or 26 multiplications and 70 additions when using Versions II or IV. Note, however, that this figure excludes any contributions for the pipeline startup delays of both the R24 FHTs and the R4 FHT-to-R2 FFT converter. The worst-case memory requirement for the 2N-point real-data DFT using the parallel version of the double-resolution approach involves two sets of eight DM

8 Computation of 2n -Point Real-Data Discrete Fourier Transform

126

banks for storage of the data, with each bank holding N/8 samples, and seven single-quadrant LUTs for storage of the trigonometric coefficients – two sets of three single-quadrant LUTs with each LUT holding N/4 double-resolution trigonometric coefficients for minimum-arithmetic addressing by the two R24 FHTs and one single-quadrant LUT holding N/2 full-resolution trigonometric coefficients for minimum-arithmetic addressing by the R4 FHT-to-R2 FFT converter. This results in .W/ a total memory requirement, denoted MPDR , of M.W/ PDR D 2N C 2N C 128 D 4N C 128

(8.16)

words, which includes the requirement for the DMI , with the associated arithmetic complexity for the addressing given by zero when using Version I of the R24 FHT solution or six additions when using Version II. In comparison, the best-case memory requirement for the 2N-point real-data DFT using the parallel version of the double-resolution approach involves two sets of eight DM banks for storage of the data, with each bank holding N/8 samples, and nine complementary-angle LUTs (that is, three two-level LUTs) for storage of the trigonometric coefficients – two sets of three complementary-angle LUTs with each p LUT holding N=2 double-resolution trigonometric coefficients for minimummemory addressing by the twopR24 FHTs and one set of three complementary-angle LUTs with each LUT holding 2N=2 full-resolution trigonometric coefficients for minimum-memory addressing by the R4 FHT-to-R2 FFT converter. This results in a total memory requirement, denoted M.B/ PDR , of

.B/

MPDR D 2N C 3

p 1p NC 2N C 128 2

(8.17)

words, which includes the requirement for the DMI , with the associated arithmetic complexity for the addressing given by seven multiplications and eight additions when using Version III of the R24 FHT solution or seven multiplications and 14 additions when using Version IV. Further memory reductions to those given above by Equations 8.16 and 8.17 may be achieved, however, by simply sharing the LUTs containing the double-resolution trigonometric coefficients between the two R24 FHTs as they both contain precisely the same information and need to be accessed in precisely the same order for each R24 FHT. This memory reduction – which results in the same memory requirement as for the sequential solution – yields worst-case and best-case figures of M.W/ PDR D and M.B/ PDR D 2N C

13 N C 128 4

p 3 p N C 2N C 128 2

(8.18)

(8.19)

8.2 Computing One DFT via Two Half-Length Regularized FHTs

127

words, respectively, which includes the requirement for the DMI and could be achieved at a minimal cost of a slightly more complex memory addressing scheme.

8.2.2.3 Summary of Latency-Constrained Solutions The theoretical performance and resource utilization figures for the latencyconstrained computation of the 2N-point real-data DFT, where N is a power of four, by means of the R24 FHT are summarized in Table 8.1 below, where “S” refers to the sequential solution with one PE being assigned to the computation of both R24 FHTs, and “P” to the parallel solution with one PE being assigned to the computation of each R24 FHT. The results highlight the achievable computational density of the parallel solution, when compared to the sequential version, as the resulting throughput is nearly doubled at the minimal expense of an additional 12 fast fixedpoint multipliers and 22 adders for Versions I or III of the R24 FHT solution or just nine fast fixed-point multipliers and 25 adders for Versions II or IV (and, of course, increased programmable logic).

8.2.2.4 Reduced-Complexity Solution for Increasing Throughput Rate An alternative hardware-efficient solution to the 2N-point real-data DFT problem may be obtained, again using the double-resolution approach, by setting up a two-stage computational pipeline with the first stage of the pipeline performing sequentially the computation of the two N-point R24 FHTs and the second stage performing the computation of the R4 FHT-to-R2 FFT converter. With such an approach the DM requirement would need to be increased by 2N words, however, so that the outputs from the R24 FHTs could be double buffered, thereby enabling one pair of R24 FHT output sets to be processed by the converter whilst another pair of output sets was being produced. With such an approach, given that the first CS of the pipeline has 2N clock cycles within which to complete the computation of the two N-point R24 FHTs, the second CS would also have 2N clock cycles within which to complete its own task, namely the computation of the R4 FHT-to-R2 FFT converter. However, when the converter is carried out in a highly parallel fashion, as previously described in this section, the second CS would require only N/4 C 16 clock cycles (ignoring the pipeline delay) to complete its task, so that the time complexities of the two stages will differ by a significant factor, whereas optimum utilization of resources requires that they should be comparable. By adopting a much simpler sequential solution to the converter, however, with outputs being produced at the rate of just one or maybe two per clock cycle, rather than eight, comparable time complexities may be achieved and at greatly reduced silicon cost due to the resulting reduction in processing complexity and the associated simplicity of the control logic. Thus, by doubling the latency and increasing the DM requirement to allow for double buffering, a much simpler solution requiring considerably less control

Multipliers 20 32 17 26

27 39

24 33

Type of solution Version D I, Mode D ‘S’ Version D I, Mode D ‘P’ Version D II, Mode D ‘S’ Version D II, Mode D ‘P’

Version D III, Mode D ‘S’ Version D III, Mode D ‘P’

Version D IV, Mode D ‘S’ Version D IV, Mode D ‘P’ 59 84

50 72

Adders 42 64 51 76

Memory requirement (words) 13=4N C 128 13=4N C 128 13=4N C 128 13=4N C 128 p p N C 2N C 128 2N C 3=2 p p 2N C 3=2. N C 2N/C 128 p p N C 2N C 128 2N C 3=2 p p 2N C 3=2. N C 2N/ C 128

Table 8.1 Theoretical performance analysis for 2N-point real-data FFT where N is power of four Arithmetic complexity Time complexity (clock cycles) log4 N C 1 C 16 1=8N log N C 2 C 16 4 1=4N log N C 1 C 16 4 1=8N log N C 2 C 16 4 1=4N log N C 1 C 16 4 1=8N.log N C 2/ C 16 4 1=4N log N C 1 C 16 4 1=8N.log N C 2/ C 16 4 1=4N

128 8 Computation of 2n -Point Real-Data Discrete Fourier Transform

8.3 Computing One DFT via One Double-Length Regularized FHT

129

logic may be achieved which, although not able to meet the latency constraint, is nevertheless able to produce a new set of outputs for the 2N-point real-data DFT every 2N clock cycles.

8.3 Computing One DFT via One Double-Length Regularized FHT This section discusses the second of the two approaches and is concerned with the computation of the N-point real-data DFT, where “2N” is a power of four. To see how this may be achieved, using one 2N-point R24 FHT, let us first turn to an important result from Section 3.5 of Chapter 3, namely that of Parseval’s Theorem, which states that the energy in a signal is preserved under a unitary or orthogonal transformation, such as with the DFT or DHT, this being expressed as N1 X nD0

jxŒnj2 D

N1 Xˇ

N1 ˇ ˇ ˇ ˇ .F/ ˇ2 X ˇ .H/ ˇ2 ˇX Œkˇ D ˇX Œkˇ ;

kD0

(8.20)

kD0

so that the energy measured in the data space is equal to that measured in the transform space. This result, combined with the familiar DFT-based technique of obtaining an interpolated frequency spectrum by performing the DFT on a zero-padded input data set, is now exploited to obtain a simple algorithm for obtaining the Hartley-space outputs for a 2n -point FHT by means of the R24 FHT and thus, after Hartley space to Fourier space conversion, the Fourier-space outputs for a 2n -point real-data DFT.

8.3.1 Derivation of 2n -Point Real-Data FFT Algorithm Let us start the DHT to a length N data sequence so that the output, n by applying o .H/ denoted XN Œk , is given by N1 1 X X.H/ Œk D p xŒn:cas.2 nk=N/ N N nD0

(8.21)

and then apply the DHT to a length 2N data sequence obtained by appending N zerovalued samples to the same N samples as used above, so that the output, denoted n o .H/ X2N Œk , is given by 2N1 1 X .H/ X2N Œk D p xŒn:cas.2 nk=2N/: 2N nD0

(8.22)

8 Computation of 2n -Point Real-Data Discrete Fourier Transform

130

Then by considering only the even-addressed outputs of Equation 8.22 we have that 2N1 1 X Œ2k D p xŒn:cas.2 n2k=2N/ X.H/ 2N 2N nD0

1 X xŒn:cas.2 nk=N/; D p 2N nD0 N1

(8.23)

so that

1 .H/ (8.24) X.H/ 2N Œ2k D p XN Œk; 2 meaning that the signal energy measured at index “k” using the N-point transform is equal to twice that obtained when it is measured at the corresponding index (that is, “2k”) using the 2N-point transform, as with the longer transform the energy is being spread over twice as may outputs. In fact, from Parseval’s Theorem, we have that N1 X

jxŒnj2 D

nD0

N1 Xˇ

ˇ ˇ .H/ ˇ2 ˇXN Œkˇ

kD0

D

N1 X ˇ kD0

D

N1 X kD0

ˇ2 ˇ ˇ2 ˇ .H/ ˇ ˇ .H/ ˇ ˇX2N Œ2kˇ C ˇX2N Œ2k C 1ˇ ; ˇ2 1 ˇˇ .H/ ˇˇ2 ˇˇ .H/ ˇ ˇX Œkˇ C ˇX2N Œ2k C 1ˇ 2 N

(8.25)

so that one half of the signal energy is contained in the even-addressed outputs and the other half in the odd-addressed outputs. The Hartley-space outputs of interest correspond to the even-addressed outputs, so that although the solution to the 2N-point DHT – as carried out by the R24 FHT – produces all 2N outputs, both even-addressed and odd-addressed, it is only the evenaddressed outputs that need to be subsequently converted from Hartley space to Fourier space.

8.3.2 Implementational Considerations Although the data need only be generated N samples at a time, the on-chip memory – in the form of the eight memory banks for the storage of the data and three LUTs for the storage of the trigonometric coefficients, as discussed in Chapter 6 – needs to cater for twice that amount of data and up to twice the corresponding number of trigonometric coefficients due to the fact that half-resolution processing is being used to derive the required DFT outputs. As a result, each DM bank needs to be

8.3 Computing One DFT via One Double-Length Regularized FHT

131

able to hold N/4 data samples and each of the three LUTs needs to be able to hold 2 either p N/2 trigonometric coefficients, for Versions I or II of the R4 FHT solution, or 2N=2 trigonometric coefficients for Versions III or IV. Thus, disregarding the Hartley space to Fourier space conversion requirement, which from Section 3.4 of Chapter 3 is trivial, the time-complexity for the N-point real-data DFT using the half-resolution approach, denoted THR , is given by THR D

N :log4 2N 4

(8.26)

clock cycles, which excludes any contribution for the pipeline delay of the R24 FHT, whilst the worst-case total memory requirement, denoted M.W/ HR , is given by 3 M.W/ HR D 2NC N 2 7 D N 2

(8.27)

words, and the best-case total memory requirement, denoted M.B/ HR , is given by M.B/ HR D 2N C

3p 2N 2

(8.28)

words. Thus, it is evident from the time-complexity figure of Equation 8.26, that in order to produce a new Hartley-space output set of length 2N every N clock cycles, as required, it will be necessary to set up a new input data set every N clock cycles – each set comprising N new samples and N zero-valued samples – and for values of “N” such that 2N > 256 (from the time-complexity figures of Equations 6.11 and 6.12 in Chapter 6) to effectively double the throughput of the standard R24 FHTbased approach. One way of achieving this is to process alternate data sets on separate R24 FHTs running in parallel and offset by N clock cycles relative to each other – see Fig. 8.5. In this way, the latency of each R24 FHT would be bounded above by 2N clock cycles, for those transform sizes of interest, whilst for the same transform sizes the update time of the dual-R24 FHT system would be bounded above by just N clock cycles.

Regularized FHT

Regularized FHT

Regularized FHT clock cycles

t=0

t=N

t = 2N Regularized FHT

t = 3N

t = 4N

Regularized FHT

t = 5N

t = 6N

Regularized FHT

Fig. 8.5 Dual-R24 FHT approach to half-resolution processing scheme

t = 7N

132

8 Computation of 2n -Point Real-Data Discrete Fourier Transform

Therefore, given that a single 2N-point R24 FHT requires twice the DM requirement and up to twice the CM requirement (depending upon the addressing scheme used) of a single N-point R24 FHT – albeit with the same arithmetic complexity – the required update time, achieved via the use of two 2N-point R24 FHTs, would involve up to four times the memory requirement and twice the arithmetic complexity. An alternative approach to that described above would be to adopt a single 2Npoint R24 FHT, rather than two, but to assign two PEs to its computation, as described in Section 6.6 of Chapter 6, thus doubling the throughput of the R24 FHT and enabling the processing to keep up with the I/O over each block of data. The feasibility of a dual-PE solution such as this would clearly be determined, however, by the viability of using either the more complex quad-port memory or a doubled read/write access rate to the dual-port memory, for both the DM and the CM, as it will be necessary to read/write two samples from/to each of the eight memory banks for each clock cycle, as well as to read four (rather than two) trigonometric coefficients from each of the LUTs. Thus, achieving the required update time via the use of a dualPE R24 FHT such as this, would involve twice the arithmetic complexity of a single N-point R24 FHT solution, together with either the replacement of all dual-port memory by quad-port memory or a doubling of the read/write access rate to the dual-port memory. As a result, the achievable computational density for a solution to the real-data DFT based upon the half-resolution approach that achieves the required timing constraint may be said to lie between one quarter and one half of that achievable for a 4n -point real-data DFT via the conventional use of the R24 FHT, the exact fraction being dependent upon the length of the transform – the longer the transform the larger the relative memory requirement and the lower the relative computational density – and the chosen approach. The practicality of such a solution is therefore very much dependent upon the implementational efficiency of the R24 FHT compared to that of other commercially-available solutions. The results of Chapters 6 and 7, however, would seem to suggest the adoption of the R24 FHT for both 2n -point and 4n -point cases to be a perfectly viable option.

8.4 Discussion The first solution discussed in Section 8.2 has shown how the highly-parallel GDBFLY may be effectively exploited for the computation of the 2N-point real-data DFT, where “N” is a power of four. The solution was obtained by means of a “double-resolution” approach involving FHT-based processing at double the required transform-space resolution via the application of two half-length regularized FHTs. The R4 FHT-to-R2 FFT converter uses a conflict-free and in-place parallel memory addressing scheme to enable the computation for the 2n -point case to be carried out in the same highly-parallel fashion as for the 4n -point case. The solution has some other interesting properties, even when the complexity is viewed purely in terms of sequential arithmetic operation counts, as the computation of the 2N-point real-data DFT – when N is a power of four – requires a total of

8.4 Discussion

133 mply

CFFT D 2N: log2 2N

(8.29)

real multiplications when obtained via one of the real-from-complex strategies discussed in Chapter 2, using the standard complex-data Cooley–Tukey algorithm, but only mply

CFHT D N .3 log4 N C 2/

(8.30) R24

FHT and the real multiplications when obtained via the combined use of the R4 FHT-to-R2 FFT converter. Thus, for the computation of a 2K-point real-data DFT, for example, this means 22,528 real multiplications via the complex-data radix-2 FFT or 15,360 real multiplications via the R24 FHT, implying a reduction of nearly one-third by using the solution outlined here. The split-radix algorithm could be used instead of the Cooley–Tukey algorithm to further reduce the multiplication count of the radix-2 FFT but only at the expense of a loss of regularity in the FFT design. The second solution discussed in Section 8.3 has shown how the highly-parallel GD-BFLY may be effectively exploited for the computation of the N-point realdata DFT, where “2N” is a power of four. The solution was obtained by means of a “half-resolution” approach involving FHT-based processing at one half the required transform-space resolution via the application of one double-length regularized FHT. A point worth noting is that if DHT .fxŒ0; xŒ1; xŒ2; xŒ3g/ D fX.H/ Œ0; X.H/ Œ1; X.H/ Œ2; X.H/ Œ3g

(8.31)

say, then it is also true, via a theorem applicable to both the DFT and the DHT, namely the Stretch or Repeat Theorem [2], that DHT .fxŒ0; xŒ1; xŒ2; xŒ3; xŒ0; xŒ1; xŒ2; xŒ3g/ D f2X.H/ Œ0; 0; 2X.H/ Œ1; 0; 2X.H/ Œ2; 0; 2X.H/ Œ3; 0g

(8.32)

this result being true, not just for the four-point sequence shown, but for a data sequence of any length. As a result, an alternative to the zero-padding approach, which instead involves the idea of transforming a repeated or replicated data set, could be used to extract the required FHT outputs from those of a double-length R24 FHT. Note, however, that the magnitudes of the required even-addressed output samples are twice what they should be so that scaling may be necessary – namely division by two which in fixed-point hardware reduces to that of a simple right shift operation – in order to achieve the correct magnitudes, this being applied either to the input samples or to the output samples. The two solutions discussed, based upon both double-resolution and halfresolution approaches – and for which the mathematical/logical correctness of their operation has been proven both in software, via a computer program written in the “C” programming language, and in silicon with a non-optimized Virtex-II Pro

134

8 Computation of 2n -Point Real-Data Discrete Fourier Transform

100 FPGA implementation – thus enable the R24 FHT to be applied, potentially, to a great many more problems including those that might not necessarily be best solved through the direct application of a 4n -point transform.

References 1. K.J. Jones, R. Coster, Area-efficient and scalable solution to real-data fast Fourier transform via regularised fast Hartley transform. IET Signal Process. 1(3), 128–138 (2007) 2. J.O. Smith III, Mathematics of the Discrete Fourier Transform (DFT) with Audio Applications (W3K Publishing, Stanford, CA, 2007)

Chapter 9

Applications of Regularized Fast Hartley Transform

Abstract This chapter discusses the application of the regularized FHT to a number of computationally-intensive DSP-based functions that may benefit from the adoption of a transform-space solution, and in particular, where the data in question is real valued so that the processing may be efficiently carried out in Hartley space. The functions discussed are those of up-sampling, differentiation, correlation – both auto-correlation and cross-correlation – and channelization. Efficient channelization, for the case of a single channel (or small number of channels), may be achieved by means of a DDC process where the filtering is performed via fast Hartley-space convolution, whilst for the case of multiple channels, efficiency may be achieved via the application of the polyphase DFT filter bank. Each such function might typically be encountered in that increasingly important area of wireless communications relating to the geolocation of signal emitters, with each potentially able to yield both conceptually and computationally simplified solutions when solved via the regularized FHT. A discussion is finally provided relating to the results obtained in the chapter.

9.1 Introduction Having now seen how the R24 FHT might be used for the efficient parallel computation of an N-point DFT where N may be a power of either two or four – although for optimal computational density it should be a power of four – the monograph concludes with the description of a number of DSP-based functions where the adoption of Hartley space, rather than Fourier space, as the chosen transform space within which to carry out the processing, may potentially lead to conceptually and computationally simplified solutions. Three particular sets of functions common to many modern DSP systems are discussed, namely: 1. The up-sampling and differentiation – for the case of both first and second derivatives – of a real-valued signal either individually or in combination. 2. The correlation function of two real-valued or complex-valued signals where the signals may both be of infinite duration, as encountered with cross-correlation,

K. Jones, The Regularized Fast Hartley Transform, Signals and Communications Technology, DOI 10.1007/978-90-481-3917-0 9, c Springer Science+Business Media B.V. 2010

135

136

9 Applications of Regularized Fast Hartley Transform

or where one signal is of finite duration and the other of infinite duration, as encountered with auto-correlation. 3. The channelization of a real-valued signal which, for the case of a single channel (or small number of channels), may be achieved by means of a DDC process where the filtering is carried out via fast Hartley-space convolution, whilst for the case of multiple channels, may be achieved via the application of the polyphase DFT filter bank. One important area of wireless communications where all three sets of functions might typically be encountered is that relating to the geolocation [8] of signal emitters, where there is a requirement to produce accurate timing measurements from the data gathered at a number of sensors, these measurements being generally obtained from the up-sampled outputs of a correlator. When the signal under analysis is of sufficiently wide bandwidth, however, the data would first have to be partitioned in frequency before such measurements could be made so as to optimize the SNR of the signal for specific frequency bands of interest prior to the correlation process. For the case of a single channel (or small number of channels) the associated filtering operation may, depending upon the parameters, be most efficiently carried out by means of fast transform-space convolution, whilst when there is a sufficiently large number of equi-spaced and equi-bandwidth channels, this process – which is generally referred to in the technical literature as channelization – is best carried out by means of a polyphase DFT filter bank [1, 4, 12]. The adoption of the transform-space approach in signal processing makes particular sense when a significant amount of the processing is able to be efficiently carried out in the transform space, so that several distinct tasks might be beneficially performed there before the resulting signal is transformed back to data space. A multi-sensor digital signal conditioner [5] has been defined, for example, which exploits the transform-space approach to carry out in a highly efficient manner, in Fourier space, the various tasks of sample-rate conversion, spectral shaping or filtering and malfunctioning sensor detection and compensation.

9.2 Fast Transform-Space Convolution and Correlation Given the emphasis placed on the transform-space approach in this chapter it is perhaps worth illustrating firstly its importance by considering the simple case of the filtering of a real-valued signal by means of a length N FIR filter. A linear system [9, 10] such as this is characterized by means of an output signal that is obtained from the convolution of the system input signal with the system impulse response – as represented by a finite set of filter coefficients. A direct data-space formulation of the problem may be written, in un-normalized complex-data form, as Rconv h;x Œk D

N1 X nD0

h Œn:xŒk n;

(9.1)

9.3 Up-Sampling and Differentiation of Real-Valued Signal

137

where the superscript “” refers to the operation of complex conjugation, so that each filter output requires N multiplications – this yields an O.N2 / arithmetic complexity for the production of N filter outputs. Alternatively, a fast Hartley-space convolution approach – see Section 3.5 of Chapter 3 – combined with the familiar overlap-save or overlap-add technique [2] associated with conventional FFT-based linear convolution [2] (where the FHT of the filter coefficient set is fixed and precomputed), might typically involve the application of two 2N-point FHTs and one transform-space product of length 2N in order to produce N filter outputs – this yields an O(N.logN) arithmetic complexity. Thus, with a suitably chosen FHT algorithm, clear computational gains are achievable via fast Hartley-space convolution for relatively small values of N. The correlation function is generally defined as measuring the degree of correlation or similarity between a given signal and a shifted replica of that signal. From this, the basic data-space formulation for the cross-correlation function of two arbitrary complex-valued signals may be written, in un-normalized form and with arbitrary upper and lower limits, as X

upper

Rcorr h;x Œk

D

h Œn:xŒk C n;

(9.2)

nDlower

which is similar in form to that for the convolution function of Equation 9.1 except that there is no need to apply the folding operation [2] to one of the two functions to be correlated. In fact, if either of the two functions to be correlated is an even function, then the operations of convolution and correlation are equivalent. The above expression is such that: (1) when both sequences are of finite length, it corresponds to the cross-correlation function of two finite-duration signals – to be discussed in Section 9.4.2; (2) when one sequence is of infinite length and the other a finite-length stored reference it corresponds to the auto-correlation function – to be discussed in Section 9.4.3; or (3) when both sequences are of infinite length it corresponds to the cross-correlation function of two continuous data streams – to be discussed in Section 9.4.4. As evidenced from the discussion above relating to the convolution-based filtering problem, the larger the correlation problem the greater the potential benefits to be gained from the adoption of a transform-space approach, particularly when the correlation operation is carried out by means of a fast unitary/orthogonal transform such as the FFT or the FHT.

9.3 Up-Sampling and Differentiation of Real-Valued Signal This section looks briefly at how two basic DSP-based functions, namely those of up-sampling and differentiation, might be efficiently carried out by first transforming the real-valued signal from data space to Hartley space, via the application of a

138

9 Applications of Regularized Fast Hartley Transform

DHT, then modifying in some way the resulting Hartley-space data, before returning to the data space via the application of a second DHT to obtain the data corresponding to an appropriately modified version of the original real-valued signal.

9.3.1 Up-Sampling via Hartley Space The first function considered is that of up-sampling where the requirement is to increase the sampling rate of the signal without introducing additional frequency components to the signal outside of its frequency range or band of definition – this function being also referred to as band-limited interpolation. Suppose that the signal is initially represented by means of “N” real-valued samples and that it is required to increase or interpolate this by a factor of “L”. To achieve this, the real-valued data is first transformed from data space to Hartley space, via the application of a DHT of length N, with zero-valued samples being then inserted between the samples of the Hartley-space data according to the following rule [11]:

YŒk D

8 ˆ ˆ ˆ ˆ ˆ <

L:XŒk 1=2L:XŒN=2

0 ˆ ˆ 1=2L:XŒN=2 ˆ ˆ ˆ : L:XŒk M C N

for for for for for

k 2 Œ0; N=2 1 k D N=2 k 2 ŒN=2 C 1; M N=2 C 1 ; k D M N=2 k 2 ŒM N=2 C 1; M 1

(9.3)

where M D L N, before returning to the data space via the application of a second DHT, this time of length M, to obtain the resulting up-sampled signal, as required – see Fig. 9.1. Note that the non-zero terms in the above expression have been magnified by a factor of “L” so as to ensure, upon return to the data space, that the magnitudes of the original samples are preserved. Note that the above technique, which has been defined for the up-sampling of a single segment of signal data, may be straightforwardly applied to the case of a continuous signal through the piecing together of multiple data-space signal segments

{X(H)[k]}

{x[n]} DHT

Zero-Pad Centre of Spectrum – see Equation 9.3

Fig. 9.1 Scheme for up-sampling of signal using DHT

{Y(H)[k]}

{y[n]} DHT

9.3 Up-Sampling and Differentiation of Real-Valued Signal

139

via a suitably adapted reconstruction technique [3] which combines the use of the overlap-save technique, as associated with conventional FFT-based linear convolution, with that of temporal windowing [7], in order to keep the root-mean-square (RMS) interpolation error to an acceptable level. Without taking such precautions, the interpolation error may well prove to be unacceptably high due to the inclusion of error maxima near the segment boundaries – this problem being referred to in the technical literature as the end or boundary effect [2].

9.3.2 Differentiation via Hartley Space The second function considered is that of differentiation and from the First and Second Derivative Theorems of Section 3.5 in Chapter 3 it was stated, for the case of an N-point data set, that o ˚ 0 n x Œn D 2 kX.H/ ŒN k

(9.4)

o ˚ 00 n x Œn D 4 2 k2 X.H/ Œk ;

(9.5)

DHT and DHT

respectively, so that by transforming the real-valued signal from data space to Hartley space, via the application of a DHT of length N, then modifying the resulting Hartley-space samples according to Equation 9.4 or 9.5 above, before returning to the data space via the application of a second DHT, also of length N, it is possible to obtain the first or second derived function corresponding to the original real-valued signal, as required – see Fig. 9.2.

9.3.3 Combined Up-Sampling and Differentiation Note from the results of the above two sections that it is a straightforward task to carry out both the up-sampling and the differentiation of the real-valued signal by

{X(H)[k]}

{x[n]} DHT

Modify: Y(H)[K] = 2πk × X(H)[N – k]

Fig. 9.2 Scheme for differentiation of signal using DHT

{y[n]}

{Y(H)[k]} DHT

Modify

140 Fig. 9.3 Scheme for combined up-sampling and differentiation of signal using DHT

9 Applications of Regularized Fast Hartley Transform {x[n]}

{X(H)[k]} DHT

Modify {Z(H)[k]}

Modify: Z(H)[k] = 2πk × X(H)[N–k]

Zero-Pad Centre of Spectrum – see Equation 9.3 {y[n]}

{Y(H)[k]} DHT

simply applying both sets of modifications to the same set of Hartley-space samples before returning to the data space. Thus, after modifying the Hartley-space samples according to Equation 9.4 or 9.5 of Section 9.3.2, the resulting samples are then zero-padded according to Equation 9.3 of Section 9.3.1, before being returned to the data space via the application of a second DHT to yield an up-sampled version of the first or second derived function of the original real-valued signal, as required (see Fig. 9.3).

9.4 Correlation of Two Arbitrary Signals Having covered very briefly the problems of up-sampling and differentiation, the computationally more intensive problem of correlation, as introduced in Section 9.2, is now addressed in some detail. As evidenced from the discussions of Section 9.2 relating to fast transform-space convolution and correlation, when the correlation operation is performed upon two finite segments of signal, each comprising “N” samples, a direct data-space implementation will yield an O.N2 / arithmetic complexity, whereas a transform-space implementation involving two forward-direction transforms, one transform-space product and one reverse-direction transform will yield an O(N.logN) arithmetic complexity, via the application of a fast unitary/orthogonal transform, which suggests that the larger the correlation problem the greater the potential benefits to be gained from the adoption of a transform-space approach. A key ingredient for the success and the generality of the transform-space approach is in being able to carry out a linear correlation by means of one or more circular correlations, so that by invoking the Circular Correlation Theorem [2] – which is analogous to the more familiar Circular Convolution Theorem [2] – it is possible to move the processing from the data space to the transform space where a fast algorithm may be exploited. Thus, when the data in question is complex-valued,

9.4 Correlation of Two Arbitrary Signals

141

the processing may be carried out in Fourier space via the use of an FFT, whereas when the data is real-valued, it may be carried out in Hartley space via the use of an FHT. Note that with the problem of geolocation, it is possible for either crosscorrelation or auto-correlation to be encountered: if the sensors operate in passive mode, then each operation will be assumed to be that of cross-correlation and thus to be performed on signals from two different sensors to provide time-difference-ofarrival (TDOA) or equivalent relative range measurements, whereas if the sensors operate in active mode, then each operation will be assumed to be that of autocorrelation (so that one of the two signals is simply a stored reference of the other) to provide time-of-arrival (TOA) or equivalent range measurements. The essential difference, in terms of processing requirement, between the two modes of operation, is that with auto-correlation, one of the two signals if of finite duration and the other of infinite duration, whilst with cross-correlation, both of the signals are of infinite duration. The signal of interest is typically in the form of a sampled pulse or pulse train, for both active and passive systems, so that the received signal, although often regarded as being of infinite duration for the purposes of correlator implementation, is actually a succession of temporally-spaced finite-duration segments.

9.4.1 Computation of Complex-Data Correlation via Real-Data Correlation Although all of the techniques discussed in this chapter are geared to the processing of real-valued signals, it is worth pointing out that as the operation of correlation, denoted by means of the symbol “˝”, is a linear process – and thereby satisfying the property of additivity – the correlation of two complex-valued signals – as encountered, for example, when the signal processing is carried out at base-band [4, 9, 10, 12] – may be decomposed into the summation of four correlations each operating upon two real-valued signals, so that fXR Œn C i:XI Œng ˝ fYR Œn C i:YI Œng .fXR Œng ˝ fYR Œng C fXI Œng ˝ fYI Œng/ C i: .fXR Œng ˝ fYI Œng fXI Œng ˝ fYR Œng/ ;

(9.6)

this expression taking into account the operation of complex conjugation to be performed upon one of the two input signals – as shown in Equation 9.2. The attraction of the complex-to-real decomposition described here for the complex-data correlation operation is that it introduces an additional level of parallelism to the problem as the resulting real-data correlations are independent and thus able to be computed simultaneously, or in parallel, as shown in Fig. 9.4. This is particularly relevant when the quantities of data to be correlated are large and the

142

9 Applications of Regularized Fast Hartley Transform

{YR[n]}

{YI[n]}

– correlation – addition

{XR[n]} {ZR[n]}

{XI[n]} {ZI[n]} –

Fig. 9.4 Scheme for complex-data correlation via real-data correlation

throughput requirement high as a transform-space approach may then be the only viable approach to adopt, leaving the conventional complex-data approach to rely upon the parallelization of the complex-data FFT and its inverse as the only logical means of achieving the required performance. With the complex-to-real decomposition, however, the required performance may be more easily obtained by running in parallel multiple versions of the R24 FHT in both forward and reverse directions. The transformation from data space to Hartley space, for example, may be carried out by running in parallel two (when using a stored reference) or four (when crosscorrelating two arbitrary signals) R24 FHTs, this followed by the computation of four sets of transform-space products, again in parallel, with each transform-space product taking the form of 1 1 ZŒk D X.H/ Œk Y.H/ Œk C Y.H/ ŒN k C X.H/ ŒNk Y.H/ ŒN k Y.H/ Œk 2 2 (9.7) The results of the four transform-space products may then be additively combined prior to the results being transformed back to the data space by running in parallel two R24 FHTs to yield the required data-space correlation results – see Fig. 9.5. Thus, compared to a solution based upon the use of a complex-data FFT, this approach results in a potential doubling of the parallelism (in addition to that achievable via the efficient implementation of the R24 FHT, as discussed in Chapter 6) with which to increase the throughput of the complex-data correlation operation.

9.4.2 Cross-Correlation of Two Finite-Length Data Sets Before moving on to the two important cases of auto-correlation and crosscorrelation where one at least of the two signals is of infinite duration, the simple

9.4 Correlation of Two Arbitrary Signals {YR[n]}

{YI[n]} DHT

DHT

{XR[n]}

DHT

143

Combine

DHT

{ZR[n]}

DHT

{ZI[n]}

Combine

{XI[n]}

DHT

Combine Combine

–

Combine: 1 X(H)[k](Y(H)[k]+Y(H) [N – k])+ 1 X(H)[N – k](Y(H)[N – k] - Y(H)[k]) 2 2

Fig. 9.5 Scheme for complex-data correlation using DHT

problem of cross-correlating two finite-duration signals by means of the DHT is considered. To achieve this, if one of the two signal segments is represented by “N1 ” samples and the other signal segment by “N2 ” samples, then the length “N” of the DHT is first chosen so that N N 1 C N2 1

(9.8)

One segment is then pre-zero-padded out to a length of “N” samples and the other segment post-zero-padded also out to a length of “N” samples. Following this, each zero-padded segment is passed through the N-point DHT, their transforms then multiplied, sample-by-sample, before the transform-space product is transformed back to the data space by means of another N-point DHT, to yield the required crosscorrelator output. There will, however, be a deterministic shift of length S D N .N1 C N2 1/

(9.9)

samples, which needs to be accounted for when interpreting the output, as the resulting data set out of the final DHT comprises “N” samples whereas the correlation of the two segments is known to be only of length N1 C N2 1. This procedure is outlined in Fig. 9.6.

9.4.3 Auto-Correlation: Finite-Length Against Infinite-Length Data Sets The next type of problem considered relates to that of auto-correlation whereby a finite-duration signal segment – in the form of a stored reference – is being

144

9 Applications of Regularized Fast Hartley Transform {X(H)[k]}

{x[n]} Pre-Zero-Pad

DHT

Post-Zero-Pad

DHT

{Y(H)[k]}

{y[n]}

Combine

{Z(H)[k]}

{z[n]} DHT

Combine: 1 X(H)[k](Y(H)[k] + Y(H)[N – k])+ 1 X(H)[N – k](Y(H)[N – k] - Y(H)[k]) 2 2

Fig. 9.6 Scheme for correlation of two signal segments

correlated against a continuous or infinite-duration signal. The stored reference correlator is commonly referred to in the technical literature as a matched filter, where the output of a detector based upon the application of such a filter is known to optimize the peak received SNR in the presence of additive white Gaussian noise (AWGN). The output is also known to correspond – at least for the case of idealized distortion-free and multipath-free propagation – to the auto-correlation function of the stored signal. This type of problem is best tackled by viewing it as a segmented correlation, a task most simply solved by means of the familiar overlap-save or overlap-add technique associated with conventional FFT-based linear convolution. The approach involves decomposing the infinite-duration received signal into segments and computing the correlation of the stored reference and the received signal as a number of smaller circular correlations. With the overlap-save technique, for example, suitable zero-padding of the stored reference combined with the selection of an appropriate segment length enables the required correlation outputs to be obtained from the segmented circular correlation outputs without the need for further arithmetic. With the overlap-add technique, on the other hand, the received signal segments need also to be zero-padded with the required correlation outputs being obtained through appropriate combination – although only via addition – of the segmented circular correlation outputs. A solution based upon the adoption of the overlap-save technique is as outlined in Fig. 9.7, where the stored reference comprises “N1 ” samples, the DHT is of length “N”, where N 2N1 ; (9.10) and the number of valid samples produced from each length-N signal segment out of the correlator is given by “N2 ”, where

9.4 Correlation of Two Arbitrary Signals {x[n]}

145 Post-Zero-Pad

(H)

{X

[k]}

DHT

{y[n]}

{z[n]}

{Y

Get Next Overlapped Segment

(H)

[k]}

DHT

Combine

(H)

Discard Invalid Outputs

{Z

[k]}

DHT

Combine: 1 X(H)[k](Y(H)[k] + Y(H)[N – k])+ 1 X(H)[N – k](Y(H)[N – k] - Y(H)[k]) 2 2

Fig. 9.7 Scheme for auto-correlation using DHT

N2 D N N1 C 1;

(9.11)

these samples appearing at the beginning of each new output segment with the last N1 1 samples of each such segment being invalid and thus discarded. To achieve this, consecutive length-N segments of signal are overlapped by N1 1 samples, with the first such segment being pre-zero-padded by N1 1 samples to account for the lack of a predecessor. The optimal choice of segment length is dependent very much upon the length of the stored reference, with a sensible lower limit being given by twice the length of the stored reference – as given by Equation 9.10. Clearly, the shorter the segment length the smaller the memory requirement but the lower the computational efficiency of the solution, whereas the larger the segment length the higher the computational efficiency but the larger the memory requirement of the solution. Thus, there is once again a direct trade-off to be made of arithmetic complexity against memory requirement, according to how long one makes the signal segment.

9.4.4 Cross-Correlation: Infinite-Length Against Infinite-Length Data Sets The final type of problem considered relates to that of cross-correlation whereby a continuous or infinite-duration signal is being correlated against another signal of

146

9 Applications of Regularized Fast Hartley Transform

similar type. This type of problem, as with that for the auto-correlation problem of the previous section, is best tackled by viewing it as a segmented correlation, albeit one requiring a rather more complex solution. With the cross-correlation of two infinite-duration signals, each region of signal that carries information will be of finite duration, so that if 50% overlapped signal segments are generated from the data acquired at each sensor, where the segment length corresponds to twice the anticipated duration of each signal region of interest added to twice the maximum possible propagation delay arising from the separation of the sensors, then for some given acquisition period the current signal region of interest is guaranteed to appear in the corresponding segment of both sensors. Thus, the cross-correlation breaks down into the successive computation of a number of overlapped cross-correlations of finite-duration signals one of which corresponds to the cross-correlation of the current signal region of interest. If the length of the segment is short enough to facilitate its direct computation – that is, there is adequate memory to hold the sensor data and cross-correlator outputs – then the overlapped cross-correlation of each two signal segments may be carried out by means of the technique described in Section 9.4.2. If this is not the case, however, then it is likely that the number of cross-correlator outputs of actual significance – that is, that correspond to the temporal region containing the dominant peaks – will be considerably smaller than the number of samples in the segment so that computational advantage could be made of this fact. To see how this may be achieved [6], each segment needs first to be broken down into a number of smaller sub-segments with the cross-correlation of the original two segments being subsequently obtained from the cross-correlation of the subsegments in the following way. Suppose that we regard each long signal segment as being comprised of “K” samples, with the number of samples in each sub-segment being denoted by “N”, where K D M N; (9.12) for some integer “M”. Then denoting the sub-segment index by “m” we carry out the following steps: 1. Segment each set of K samples to give: xm Œn D

xŒn C .m 1/N n D 0; 1; : : : ; N 1 0 n D N; N C 1; : : : ; 2N 1

for m D 0; 1; : : : ; M 2, and ym Œn D yŒn C .m 1/N

n D 0; 1; : : : ; 2N 1

for m D 0; 1; : : : ; M 2; and yŒn C .m 1/N n D 0; 1; : : : ; N 1 ym Œn D 0 n D N; N C 1; : : : ; 2N 1 for m D M 1.

(9.13)

9.4 Correlation of Two Arbitrary Signals

147

2. Carry out the 2N-point DHT of each sub-segment to give: o n X.H/ m Œk D DHT .fxm Œng/ for m D 0; 1; : : : ; M 1; and n o Y.H/ m Œk D DHT .fym Œng/ for m D 0; 1; : : : ; M 1. 3. Multiply the two Hartley-space output sets, sample-by-sample, to give: 1 .H/ .H/ Zm Œk D X.H/ m Œk Ym Œk C Ym Œ2N k 2 1 .H/ .H/ C X.H/ m Œ2N k Ym Œ2N k Ym Œk 2

(9.14)

(9.15)

k D 0; 1; : : : ; 2N 1 for m D 0; 1; : : : ; M 1. 4. Sum the transform-space products over all M sets to give: Z.H/ Œk D

M1 X

Zm Œk

k D 0; 1; : : : ; 2N 1

(9.16)

mD0

5. Carry out the 2N-point DHT of the resulting summed product to give: n o fzŒng D DHT Z.H/ Œk :

(9.17)

The above sequence of steps, which illustrate how to carry out the required segmented cross-correlation operation, is also given in diagrammatic form in Fig. 9.8 below. Note that if the sampled data is complex-valued rather than real-valued then the above sequence of steps may be straightforwardly modified to account for the four real-data combinations required by the complex-to-real parallel decomposition discussed in Section 9.4.1. Also, as for each of the correlation schemes discussed in this section, if the length of the correlation operations is chosen to be a power of four, then the regularized FHT may be beneficially used to enable the function to be carried out in a computationally-efficient manner.

9.4.5 Combining Functions in Hartley Space Having shown in the previous sections how different functions, such as those of up-sampling and differentiation, may be efficiently carried out, either individually or in combination, via transformation to Hartley space, it is easy to visualize – through straightforward manipulation of the Hartley-space data – how such functions may also be combined with that of correlation to enable the output signal from

148

{xm[n]}

9 Applications of Regularized Fast Hartley Transform

Get Next Sub-segment

DHT

Post-Zero-Pad

{X (H) [k]} m {Y (H) m [k]} {ym[n]}

Get Next Overlapped Sub-segment

DHT

Combine

{Z (H) [k]} Combine: see Equations 9.15 and 9.16 {z[n]}

DHT

Fig. 9.8 Scheme for cross-correlation using DHT

the correlator to be produced in up-sampled form, or as a derived function of the standard correlator output signal or, upon combining of the two ideas, as an upsampled version of a derived function of the standard correlator output signal. The adoption of the first derived function of the standard correlator output signal, for example, enables one to replace peak detection by zero detection for the estimation of either TOA or TDOA. The utility of such an idea is particularly evident in the seemingly intractable problem of trying to find the TOA corresponding to the direct path component of a multi-path signal given that the largest peak of the standard correlator output signal does not necessarily correspond to the location of the direct path component. With the first derived function, for example, it can be shown that the position of the peak of the direct path signal corresponds to the point at which the value of the first derived function first starts to decrease, whilst with the second derived function, it can be shown that the position of the peak of the direct path signal corresponds to the point at which the first negative peak of the second derived function appears. Thus, both first and second derived functions of the standard correlator output signal may be used to attack the problem. Finally, note that with all the correlation-based expressions given in this section that involve the use of dual Hartley-space terms, such as the terms X.H/ Œk and X.H/ ŒN k, that it is necessary that care be taken to treat the zero-address and Nyquist-address terms separately, as neither term possesses a dual.

9.5 Channelization of Real-Valued Signal

149

9.5 Channelization of Real-Valued Signal The function of a digital multi-channel receiver [9, 10] is to simultaneously down-convert a set of frequency division multiplexed (FDM) channels residing in a single sampled data stream. The traditional approach to solving this problem has been to use a bank of DDC units, with each channel being produced individually via a DDC unit which digitally down-converts the signal to base-band, constrains the bandwidth with a digital filter, and then reduces the sampling rate by an amount commensurate with the reduction in bandwidth. The problem with the DDC approach, however, is one of cost in that multiple channels are produced via replication of the DDC unit, so that there is no commonality of processing and therefore no possibility of computational savings being made. This is particularly relevant when the bandwidth of the signal under analysis dictates that a large number of channels be produced, as the DDC unit required for each channel typically requires the use of two FIR low-pass filters and one stored version of the period of a complex sinusoid sampled at the input rate. Two cases are now considered, the first corresponding to the efficient production of a single channel (or small number of channels) by means of a DDC process where the filtering is carried out via fast Hartley-space convolution, the second corresponding to the production of multiple channels via the application of the polyphase DFT filter bank.

9.5.1 Single Channel: Fast Hartley-Space Convolution For the simple example of a single channel, after the real-valued signal has been frequency-shifted to base-band, the remaining task of the DDC process is to filter the resulting two channels of data so as to constrain the bandwidth of the signal and thus enable the sampling rate to be reduced by an amount commensurate with the reduction in bandwidth. Each filtering operation may be viewed as a convolution-type problem, where the impulse response function of the digital filter is being convolved with a continuous or infinite-duration signal. As already stated, this convolution-based problem may be solved with either a data-space or a transform-space approach, the optimum choice being very much dependent upon the achievable down-sampling rate out of the two FIR filters – one filter for the in-phase channel and another for the quadrature channel. Clearly, if the down-sampling rate is sufficiently large and/or the length of the impulse response of each filter sufficiently short, then the computational efficiency of the data-space approach may well be difficult to improve upon. For the case of the transform-space approach, however, this type of problem is best tackled by viewing it as a segmented convolution, a task most simply solved by means of the familiar overlap-save or overlap-add technique, as discussed already in relation to the analogous problem of segmented correlation. The approach involves decomposing the infinite-duration received signal into segments and computing the

150

9 Applications of Regularized Fast Hartley Transform {X(H)[k]}

{x[n]} Post-Zero-Pad

DHT

{yI[n]}

Get Next Overlapped Segment

DHT

{yQ[n]}

Get Next Overlapped Segment

DHT

{YI(H)[k]}

{YQ(H)[k]}

Combine

DHT

Combine

DHT

{Z(H)[k],Z(H)[k]} {zI[n],zQ[n]}

I

Discard Invalid Outputs

Q

Combine: 1X(H)[k](Y(H)[k] + Y(H)[N – k])+ 1 X(H)[N – k](Y(H)[N – k] – Y(H)[k]) 2 2

Fig. 9.9 Scheme for filtering complex-valued signal using DHT

convolution of the impulse response function of the filter and the received signal as a number of smaller circular convolutions. With the overlap-save technique, for example, suitable zero-padding of the impulse response function combined with the selection of an appropriate segment length enables the required convolution outputs to be obtained from the segmented circular convolution outputs without the need for further arithmetic. A solution based upon the adoption of the overlap-save technique is as outlined in Fig. 9.9, where the impulse response function, fx[n]g, comprises “N1 ” samples or coefficients, the DHT is of length “N”, where N 2 N1 ;

(9.18)

and the number of valid samples produced from each length-N signal segment out of the convolver is given by “N2 ”, where N2 D N N1 C 1;

(9.19)

these samples appearing at the end of each new output segment with the first N1 1 samples of each such segment being invalid and thus discarded. To achieve this, consecutive length-N segments of the in-phase and quadrature components of the signal are overlapped by N1 1 samples, with the first such segment being pre-zero-padded by N1 1 samples to account for the lack of a predecessor. The transform-space product associated with each of the small circular convolutions takes the form of ZŒk D

1 .H/ .H/ 1 X Œk Y Œk C Y.H/ ŒN k C X.H/ ŒN k Y.H/ ŒN k Y.H/ Œk ; 2 2 (9.20)

9.5 Channelization of Real-Valued Signal

151

with the in-phase and quadrature components of the final filtered output denoted by fzI Œng and fzQ Œng, respectively. The optimum choice of segment length is dependent very much upon the length of the impulse response function of the filter, with a sensible lower limit being given by twice the length of the impulse response function – as given by Equation 9.18. Clearly, as with the case of segmented correlation, the shorter the segment length the smaller the memory requirement but the lower the computational efficiency of the solution, whereas the larger the segment length the higher the computational efficiency but the larger the memory requirement of the solution. Thus, there is once again a direct trade-off to be made of arithmetic complexity against memory requirement, according to how long one makes the signal segment.

9.5.2 Multiple Channels: Conventional Polyphase DFT Filter Bank A common situation, of particular interest, is where multiple channels – possibly even thousands of channels – are to be produced which are of equal spacing and of equal bandwidth, as a polyphase decomposition may be beneficially used to enable the bank of DDC processes to be simply transformed into an alternative filterbank structure, namely the polyphase DFT filter bank, as described in Fig. 9.10 for the most general case of a complex-valued signal, whereby large numbers of channels may be simultaneously produced at computationally acceptable levels. For a brief mathematical justification of this decomposition, it should be first noted that a set of N filters, fHk .z/g, is said to be a uniform DFT filter bank [1,4,12] if the filters are expressible as (9.21) Hk .z/ H0 z:WkN ; where

H0 .z/ D 1 C z1 C C z.N1/ ;

(9.22)

1

with z corresponding to the unit delay and WN to the primitive Nth complex root of unity, as given by Equation 1.2 in Chapter 1. Two additional ideas of particular importance are those conveyed by the Equivalency Theorem [1, 4, 12] and the Noble Identity [1, 4, 12], where the invoking of the Equivalency Theorem enables the operations of down-conversion followed by low-pass filtering to be replaced by those of band-pass filtering followed by down-conversion, whilst that of the Noble Identity enables the ordering of the operations of filtering followed by down-sampling to be straightforwardly reversed. With these two key ideas in mind, assume that the prototype filter, denoted P(z), is expressible in polyphase form as P.z/ D

N1 X nD0

zn :Hn .zN /;

(9.23)

152

9 Applications of Regularized Fast Hartley Transform Filter

{x[n]}

↓N

H0(z)

↓N

H1(z)

z-1

z-1

z-1

↓N

{y0[m]}

N–Point Complex-Data Discrete Fourier Transform

Down-Sample

HN−1(Z)

{y1[m]}

{yk[m]}

{yN-1[m]}

{x[n]} – band-pass complex-valued input signal {y [m]} – low-pass complex-valued output signal k

Fig. 9.10 Scheme for polyphase DFT channelization of complex-valued signal

for the case of an N-branch system, so that the filter corresponding to the kth branch, Pk .z/, may thus be written as Pk .z/ D P z:WkN D

N1 X

1 k n z :WN :Hn .zN /;

(9.24)

nD0

with the output of Pk .z/, denoted Yk .z/, given by Yk .z/ D

N1 X

n z :Hn .zN /:X.z/ ; Wnk N

nD0

which corresponds to the polyphase structure shown above in Fig. 9.10.

(9.25)

9.5 Channelization of Real-Valued Signal

153

With this structure, therefore, the required band-pass filters are obtained by adopting a polyphase filterbank, with each filter branch being obtained by delaying and sub-sampling the impulse response of a single prototype FIR low-pass filter, followed by the application of a DFT to the instantaneous output sets produced by the polyphase filter bank. The effect of the polyphase filtering is to isolate and down-sample the individual channels, whilst the DFT is used to convert each channel to base-band. In this way, the same polyphase filter bank is used to generate all the channels with additional complexity reduction being made possible by computing the DFT with an appropriately chosen FFT algorithm. When the sampled data is complex valued, the feeding of N samples into an N-branch polyphase DFT filter bank will result in the production of N independent channels via the use of a complex-data FFT, whereas when the sampled data is real valued, the feeding of N samples into the N-branch polyphase DFT filter bank will result in the production of just N/2 independent channels via the use of a real-data FFT. For the efficient computation of the polyphase DFT filter bank, as for that of the standard DFT, the traditional approach to the problem has been to use a complexdata solution, regardless of the nature of the data, this often entailing the initial conversion of the real-valued data to complex-valued data via a wideband DDC process, or through the adoption of a real-from-complex strategy whereby two realvalued data sets are built up from the polyphase filter bank outputs to enable two real-data FFTs to be computed simultaneously via one full-length complex-data FFT, or where one real-data FFT is performed on the polyphase filter bank outputs via one half-length complex-data FFT. The most commonly adopted approach is probably to apply the polyphase DFT filter bank after the real-valued data has first been converted to base-band via the wideband DDC process, which means that the data has to undergo two separate stages of filtering – one stage following the frequency shifting and another for the polyphase filter bank – before it is in the required form. The same drawbacks are therefore equally valid for the computation of the realdata polyphase DFT filter bank as they are for that of the real-data DFT, these drawbacks having already been comprehensively discussed in Chapter 2. A typical channelization problem involves a real-valued wide bandwidth RF signal, sampled at an intermediate frequency (IF) with a potentially high sampling rate, and a significant number of channels, so that the associated computational demands of a solution based upon the use of the polyphase DFT filter bank would typically be met through the mapping of the polyphase filter bank and the associated real-data FFT placed at its output onto appropriately chosen parallel computing equipment, as might be provided by a sufficiently powerful FPGA device. As a result, if the number of polyphase filter branches is a power of four, then the real-data DFT placed at the output of the polyphase filter bank may be efficiently carried out by means of the R24 FHT without recourse to the use of a complex-data FFT.

154

9 Applications of Regularized Fast Hartley Transform

9.5.2.1 Alias-Free Formulation An important problem associated with the polyphase DFT filter bank is that of adjacent channel interference which arises through the nature of the sampling process – namely the fact that with the conventional formulation of the polyphase DFT filter bank all the channels are critically sampled at the Nyquist rate – as this results in overlapping of the channel frequency responses and hence aliasing of a signal in the transition region of one channel into the transition region of one of its neighbours. To overcome this problem, the presence of aliased signals arising from the poor filtering performance near the channel boundaries may be reduced or eliminated by over-sampling the individual channels to above the Nyquist rate. This over-sampling may be most simply achieved, with a rational factor, by overlapping the segments of sampled data into the polyphase filter bank, using simple memory shifts/exchanges, and then removing the resulting frequency-dependant phase shifts at the output of the polyphase filter bank by applying circular time shifts to the filtered data, this being achieved by re-ordering the data with simple memory shifts/exchanges [4]. The effect of over-sampling is to create redundant spectral regions between the desired channel boundaries and thus to prevent the overlapping of adjacent channel frequency responses. For a channel bandwidth of “W”, suppose that an over-sampling ratio of 4/3 is used – equating to an overlap of 25% of the sampled data segments – and that the pass-band and stop-band edges are symmetrically placed at .3=4/ .W=2/ and .5=4/ .W=2/, respectively, relative to the channel boundary. This results in the creation of a spectral band (in the centre of the redundant region), of width “B”, where

4 W 5 W BD2 3 2 4 2

D

W ; 12

(9.26)

which separates adjacent channel stop-band edges and thus prevents possible aliasing problems, so that the redundant regions may be easily identified and removed upon spectrum analysis of the individual channels of interest. Clearly, by suitably adjusting the position of the stop-band edge – that is, by setting it to exactly R .W=2/ where “R” is the over-sampling ratio – it is possible to completely eliminate this spectral safety region such that the locations of the stop-band edges of adjacent channels actually coincide. If the signal is real valued and the number of channels to be produced is equal to N/2 – and hence the length of the sampled data segments as well as the number of branches used by the polyphase filter is equal to N – then an over-sampling ratio of “R” will require an overlap, “O”, of O D N 1 1=R

(9.27)

samples for the data segments. As with the computation of any DSP-based function, there is a direct trade-off to be made between complexity and performance in that

9.6 Discussion

155

the larger the over-sampling ratio the larger the arithmetic complexity but the easier the task of the polyphase filtering process. This results in a reduction in the number of taps required by each of the small filters of the polyphase filter bank, which in turn leads to a reduced latency and a reduced-duration transient response. A realistic value for the over-sampling ratio, as commonly adopted in many channelization problems, is given by two, whereby the requirement is thus for a 50% overlap of the sampled data segments.

9.5.2.2 Implementational Considerations With a simplified formulation (albeit not a particularly useful one) of the polyphase DFT filter bank, which takes no account of the aliasing problem, if “N” real-valued samples are fed into an N-branch polyphase filter bank, then the solution to the associated real-data DFT problem will equate to the execution of one N-point real-data FFT every N clock cycles which is the very problem that has already been addressed in this monograph through the development of the R24 FHT. For the more interesting and relevant situation, however, where an over-sampling ratio of two is adopted to address the aliasing problem, the solution to the associated real-data DFT problem will equate to the execution of one N-point real-data FFT every N/2 clock cycles, so that it will be necessary, for when N > 256 (from the time-complexity figures of Equations 6.11 and 6.12 in Chapter 6), to double the throughput of the standard R24 FHT and hence of the real-data FFT. This may be achieved either by using a dual-PE version of the R24 FHT, as discussed in Section 6.6 of Chapter 6, or by computing two R24 FHTs simultaneously, or in parallel, on consecutive overlapped sets of polyphase filter outputs – along the lines of the dual-R24 FHT scheme described in Section 8.3 of the previous chapter. When the over-sampling ratio is reduced to 4/3, however, the real-data DFT problem simplifies to the execution of one N-point real-data FFT every 3N/4 clock cycles, so that for those situations where N 4K, a single R24 FHT may well suffice, as evidenced from the time-complexity figures given by Equations 6.11 and 6.12 in Chapter 6.

9.6 Discussion This chapter has focused on the application of the DHT to a number of computationally-intensive DSP-based functions which may benefit from the adoption of transform-space processing. The particular application area of geolocation was discussed in some detail as it is a potential vehicle for all of the functions considered. With most geolocation systems there is typically a requirement to produce up-sampled correlator outputs from which the TOA or TDOA timing measurements may subsequently be derived. The TOA measurement forms the basis of

156

9 Applications of Regularized Fast Hartley Transform

those geolocation systems based upon the exploitation of multiple range estimates whilst the TDOA measurement forms the basis of those geolocation systems based upon the exploitation of multiple relative range estimates. The up-sampling, differentiation and correlation functions, as was shown, may all be efficiently performed, in various combinations, when the processing is carried out via Hartley space, with the linearity of the complex-data correlation operation also leading to its decomposition into four parallel real-data correlation operations. This parallel decomposition is particularly useful when the quantities of data to be correlated are large and the throughput requirement high as it enables the correlation to be efficiently computed by running in parallel multiple versions of the R24 FHT. With regard to the channelization problem, it was suggested that the computational complexity involved in the production of a single channel (or small number of channels) by means of a DDC process may, depending upon the parameters, be considerably reduced compared to that of the direct data-space approach, by carrying out the filtering via fast Hartley-space convolution. For the case of multiple channels, it was seen that the channelization of a real-valued signal by means of the polyphase DFT filter bank may also be considerably simplified through the adoption of an FHT for carrying out the associated real-data DFT. In fact, with most RF channelization problems, where the number of channels is large enough to make the question of implementational complexity a serious issue, the sampled IF data is naturally real valued, so that advantage may be made of this fact in trying to reduce the complexity to manageable levels. This can be done in two ways: firstly, by replacing each pair of short FIR filters – applied to the in-phase and quadrature channels – required by the standard solution for each polyphase branch, with a single short FIR filter, as the data remains real valued right the way through the polyphase filtering process; and secondly, by replacing the complex-data DFT at the output of the standard polyphase filter bank by a real-data DFT which for a suitably chosen number of channels may be efficiently computed by means of the R24 FHT. Note that to be able to carry out the real-data DFT component of the polyphase DFT filter bank with a dual-PE solution to the R24 FHT, rather than a single-PE solution, as suggested in Section 9.5.2.2, it would be necessary to use either quadport memory or a doubled read/write access rate to the dual-port memory, for both the DM and the CM, so as to ensure conflict-free and (for the data) in-place parallel memory addressing of both the data and the trigonometric coefficients for each PE – as discussed in Section 6.6 of Chapter 6 – with all eight GD-BFLY inputs/outputs for each PE being read/written simultaneously from/to memory.

References 1. A.N. Akansu, R.A. Haddad, Multiresolution Signal Decomposition: Transforms – Subbands – Wavelets (Academic Press, San Diego, CA, 2001). 2. E.O. Brigham, The Fast Fourier Transform and Its Applications (Prentice Hall, Englewood Cliffs, NJ, 1988).

References

157

3. D. Fraser, Interpolation by the FFT revisited – an experimental investigation. IEEE Trans. ASSP. 37(5), 665–675 (1989). 4. F.J. Harris, Multirate Signal Processing for Communication Systems (Prentice Hall, Upper Saddle River, NJ, 2004). 5. K.J. Jones, Digital Signal Conditioning for Sensor Arrays, G.B. Patent Application No: 0112415 (5 May 2001). 6. R. Nielson, Sonar Signal Processing (Artech House, Boston, MA, 1991). 7. A.V. Oppenheim, R.W. Schafer, Discrete-Time Signal Processing (Prentice Hall, Upper Saddle River, NJ, 1989). 8. R.A. Poisel, Electronic Warfare: Target Location Methods (Artech House, Boston, MA, 2005). 9. J.G. Proakis, Digital Communications (McGraw Hill, New York, 2001). 10. B. Sklar, Digital Communications: Fundamentals and Applications (Prentice Hall, Englewood Cliffs, NJ, 2002). 11. C.C. Tseng, S.L. Lee, Design of FIR Digital Differentiator Using Discrete Hartley Transform and Backward Difference (European Signal Processing Conference (EUSIPCO), Lausanne, 2008). 12. P.P. Vaidyanathan, Multirate Systems and Filter Banks (Prentice Hall, Englewood Cliffs, NJ, 1993).

Chapter 10

Summary and Conclusions

Abstract This chapter first outlines the background to the problem addressed by the preceding chapters, namely the computation using silicon-based hardware of the real-data DFT, including the specific objectives that were to be met by the research, this being followed with a further discussion of the results obtained from the research and finally of the conclusions to be drawn.

10.1 Outline of Problem Addressed The problem addressed in this monograph has been concerned with the parallel computation of the real-data DFT, targeted at implementation with silicon-based parallel computing equipment, where the application area of interest is that of wireless communications, and in particular that of mobile communications, so that resource-constrained (both silicon and power) solutions based upon the highly regular fixed-radix FFT design have been actively sought. With the computing power now available via the silicon-based parallel computing technologies, however, it is no longer adequate to view the FFT complexity purely in terms of arithmetic operation counts, as has conventionally been done, as there is now the facility to use both multiple arithmetic units – adders, fast multipliers and CORDIC phase rotators – and multiple banks of fast RAM in order to enhance the FFT performance via its parallel computation. As a result, a whole new set of constraints has arisen relating to the design of efficient FFT algorithms for silicon-based implementation. With the environment encountered in mobile communications, where a small battery may be the only source of power supply for long periods of time, algorithms are now being designed subject to new and often conflicting performance criteria, where the ideal is either to maximize the throughput (that is, to minimize the update time) or satisfy some constraint on the latency, whilst at the same time minimizing the required silicon resources (and thereby minimizing the cost of implementation) as well as keeping the power consumption to within the available budget. The traditional approach to the DFT problem has been to use a complex-data solution, regardless of the nature of the data, this often entailing the initial conversion K. Jones, The Regularized Fast Hartley Transform, Signals and Communications Technology, DOI 10.1007/978-90-481-3917-0 10, c Springer Science+Business Media B.V. 2010

159

160

10 Summary and Conclusions

of real-valued data to complex-valued data via a wideband DDC process or through the adoption of a real-from-complex strategy whereby two real-data FFTs are computed simultaneously via one full-length complex-data FFT or where one realdata FFT is computed via one half-length complex-data FFT. Each such solution, however, involves a computational overhead when compared to the more direct approach of a real-data FFT in terms of increased memory, increased processing delay to allow for the possible acquisition/processing of pairs of data sets, and additional packing/unpacking complexity. With the DDC approach, where two functions are used instead of just one, the information content of short-duration signals may also be compromised through the introduction of the additional and unnecessary filtering operation. Thus, the traditional approach to the problem of the real-data DFT has effectively been to modify the problem so as to match an existing complex-data solution – the aim of the research carried out in this monograph has been to seek a solution that matches the actual problem. The DHT, which is an orthogonal real-data transform and close relative to the DFT that possesses many of the same properties, was identified as an attractive algorithm for attacking the real-data DFT problem as the outputs from a real-data DFT may be straightforwardly obtained from the outputs of the DHT, and vice versa, whilst fast algorithms for its solution – referred to generically as the FHT – are now commonly encountered in the technical literature. A drawback of conventional FHTs, however, lies in the lack of regularity arising from the need for two sizes – and thus two separate designs – of butterfly for fixed-radix formulations, where a single-sized radix-R butterfly produces R outputs from R inputs and a double-sized radix-R butterfly produces 2R outputs from 2R inputs.

10.2 Summary of Results To address the above situation, a generic version of the double-sized butterfly, referred to as the generic double butterfly and abbreviated to “GD-BFLY”, was developed for the radix-4 version of the FHT which overcame the problem in an elegant fashion. The resulting single-design solution, referred to as the regularized FHT and abbreviated to “R24 FHT”, lends itself naturally to parallelization and to mapping onto a regular computational structure for implementation with one of the silicon-based parallel computing technologies. A partitioned-memory architecture was identified and developed for the parallel computation of the GD-BFLY and of the resulting R24 FHT, whereby both the data and the trigonometric coefficients were partitioned or distributed across multiple banks of memory. The approach exploited a single locally-pipelined highperformance PE that yielded an attractive solution which was both area-efficient and scalable in terms of transform length. High performance was achieved by having the PE able to process the input/output data sets to the GD-BFLY in parallel, this in turn implying the need to be able to access simultaneously, and without conflict,

10.2 Summary of Results

161

both multiple data and multiple txrigonometric coefficients, from their respective memories. The arithmetic and permutation operations performed on the GD-BFLY data within the PE were mapped onto a computational pipeline where, for the implementation considered, it was required that the total number of CSs in the pipeline was an odd-valued integer so as to avoid any possible conflict problems with the reading/writing of input/output data from/to the DM banks for each new clock cycle. A number of pipelined versions of the PE were thus described using both fast fixed-point multipliers and CORDIC phase rotators which enabled the arithmetic complexity to be traded off against memory requirement. The result was a set of designs based upon the partitioned-memory single-PE computing architecture which each yield a hardware-efficient solution with universal application, such that each new application necessitates minimal re-design cost. The resulting solutions were shown to be amenable to efficient implementation with the silicon-based technologies and capable of achieving the computational density – that is, the throughput per unit area of silicon – of the most advanced commercially-available complexdata solutions for just a fraction of the silicon resources. The area-efficiency makes each design particularly attractive for those applications where the real-data transform is sufficiently long as to make the associated memory requirement a serious issue for more conventional multi-PE solutions, whilst the block-based nature of their operation means that they are also able, via the block floating-point scaling strategy, to produce higher accuracy transform-domain outputs when using fixed-point arithmetic than is achievable by their streaming FFT counterparts. Finally, it was seen how the applicability of the R24 FHT – which is a radix-4 algorithm – could be generalized, without significantly compromising performance, to the efficient parallel computation of the real-data DFT whose length is a power of two, but not a power of four. This enables it to be applied, potentially, to a great many more problems, including those that might not necessarily be best solved through the direct application of a 4n -point transform. This was followed by its application to the computation of some of the more familiar and computationally-intensive DSP-based functions, such as those of correlation – both auto-correlation and crosscorrelation – and of the wideband channelization of RF data via the polyphase DFT filter bank. With each such function, which might typically be encountered in that increasingly important area of wireless communications relating to the geolocation of signal emitters, the adoption of the R24 FHT may potentially result in both conceptually and computationally simplified solutions. Note that the mathematical/logical correctness of the operation of the various functions used by the partitioned-memory single-PE solution to the R24 FHT has been proven in software with a computer program written in the “C” programming language. This code provides the user with various choices of PE design and of storage/accession scheme for the trigonometric coefficients, helping the user to identify how the algorithm might be efficiently mapped onto suitable parallel computing equipment following translation of the sequential “C” code to the parallel code produced by a suitably chosen HDL.

162

10 Summary and Conclusions

10.3 Conclusions The aims of this research, as described above, have been successfully achieved with a highly-parallel formulation of the real-data FFT being defined without recourse to the use of a complex-data FFT and, in so doing, a solution that yields clear implementational advantages, both theoretical and practical, over the more conventional complex-data solutions to the problem. The highly-parallel formulation of the real-data FFT described in the monograph has been shown to lead to scalable and device-independent solutions to the latency-constrained version of the problem which are able to optimize the use of the available silicon resources, and thus to maximize the achievable computational density, thereby making the solution a genuine advance in the design and implementation of high-performance parallel FFT algorithms.

Appendix A

Computer Program for Regularized Fast Hartley Transform

Abstract This appendix outlines the various functions of which the regularized FHT is comprised and provides a detailed description of the computer code, written in the “C” programming language, for executing the said functions, where integeronly arithmetic is used to model the fixed-point nature of the associated arithmetic operations. The computer source code for the complete solution, which is listed in Appendix B, is to be found on the CD accompanying the monograph.

A.1 Introduction The processing functions required for a fixed-point implementation of the R24 FHT break down into two quite distinct categories, namely those pre-processing functions that need to be carried out in advance of the real-time processing for performing the following tasks: – Setting up of LUTs for trigonometric coefficients – Setting up of permutation mappings for GD-BFLY and those processing functions that need to be carried out as part of the real-time solution: – – – – –

Di-bit reversal DM reads and writes CM reads and trigonometric coefficient generation GD-BFLY computation FHT-to-FFT conversion

The individual modules – written in the “C” programming language with the Microsoft Visual CCC compiler under their Visual Studio computing environment – that have been developed to implement these particular pre-processing and processing functions are now outlined, where integer-only arithmetic has been used to model the fixed-point nature of the associated arithmetic operations. This is followed by a brief guide on how to run the program and of the scaling strategies

163

164

A Computer Program for Regularized Fast Hartley Transform

available to the user. Please note, however, that the program has not been exhaustively tested so it is quite conceivable that various bugs may still be present in the current version of the code. The notification of any such bugs, if identified, would be greatly welcomed by the author. The computer code for the complete solution, which is listed in Appendix B, is to be found on the CD accompanying the monograph.

A.2 Description of Functions Before the R24 FHT can be executed it is first necessary that a main module or program be produced: “RFHT4 Computer Program.c” which carries out all the pre-processing functions, as required for providing the necessary inputs to the R24 FHT, as well as setting up the input data to the R24 FHT through the calling of a separate module: “SignalGeneration.c” such that the data – real valued or complex valued – may be either accessed from an existing binary or text file or generated by the signal generation module.

A.2.1 Control Routine Once all the pre-processing functions have been carried out and the input data made ready for feeding to the R24 FHT, a control module: “RFHT4 Control.c” called from within the main program then carries out in the required order all the processing functions that make up the real-time solution, this starting with the di-bit reversal of the input data, followed by the execution of the R24 FHT, and finishing with the conversion of the output data, should it be required, from Hartley space to Fourier space.

A.2.2 Generic Double Butterfly Routines Three versions of the GD-BFLY have been produced, as discussed in Chapters 6 and 7, with the first version, involving 12 fast fixed-point multipliers, being carried out by means of the module:

A.2 Description of Functions

165

“Butterfly V12M.c”, the second version, involving nine fast fixed-point multipliers, by means of the module: “Butterfly V09M.c” and the third version, involving three CORDIC rotation units, by means of the module: “Butterfly Cordic.c”. The last version makes use of a separate module: “Rotation.c” for carrying out the individual phase rotations.

A.2.3 Address Generation and Data Re-ordering Routines The generation of the four permutation mappings used by the GD-BFLY, as discussed in Chapter 4, is carried out by means of the module: “ButterflyMappings.c”, whilst the di-bit reversal of the input data to the R24 FHT is carried out with the module: “DibitReversal.c” and the addresses of the eight-sample data sets required for input to the GD-BFLY are obtained by means of the module: “DataIndices.c”. Note that for optimal efficiency the four permutation mappings used by the GDBFLY only store information relating to the non-trivial exchanges.

A.2.4 Data Memory Accession and Updating Routines The reading/writing of multiple samples of data from/to DM, which requires the application of the memory address mappings discussed in Chapter 6, is carried out by means of the module:

166

A Computer Program for Regularized Fast Hartley Transform

“MemoryBankAddresses.c” which, given the address of a single di-bit reversed sample, produces both the memory bank address together with the address offset within that particular memory bank. For optimal efficiency, this piece of code should be tailored to the particular transform length under consideration, although a mapping that is implemented for one particular transform length will also be valid for every transform length shorter than it, albeit somewhat wasteful in terms of unnecessary arithmetic/logic.

A.2.5 Trigonometric Coefficient Generation Routines The trigonometric coefficient sets accessed from CM which are required for the execution of the GD-BFLY are dependent upon the particular version of the GD-BFLY used, namely whether it involves 12 or nine fast fixed-point multipliers, as well as the type of addressing scheme used, namely whether the storage of the trigonometric coefficients is based upon the adoption of one-level or two-level LUTs. For the combination of a twelve-multiplier version of the GD-BFLY and the adoption of three one-level LUTs, the trigonometric coefficients are generated via the module: “Coefficients V12M 1Level.c”, whilst for the combination of a nine-multiplier version of the GD-BFLY and the adoption of three one-level LUTs, the trigonometric coefficients are generated via the module: “Coefficients V09M 1Level.c”, for the combination of a twelve-multiplier version of the GD-BFLY and the adoption of three two-level LUTs, the trigonometric coefficients are generated via the module: “Coefficients V12M 2Level.c”, and for the combination of a nine-multiplier version of the GD-BFLY and the adoption of three two-level LUTs, the trigonometric coefficients are generated via the module: “Coefficients V09M 2Level.c”. All four versions produce sets of nine trigonometric coefficients which are required for the execution of the GD-BFLY.

A.3 Brief Guide to Running the Program

167

A.2.6 Look-Up-Table Generation Routines The generation of the LUTs required for the storage of the trigonometric coefficients is carried out by means of the module: “Look Up Table 1Level.c”, for the case of the one-level LUTs, or the module: “Look Up Table 2Level.c” for the case of the two-level LUTs.

A.2.7 FHT-to-FFT Conversion Routines Upon completion of the R24 FHT, the outputs may be converted from Hartley space to Fourier space, if required, this being carried out by means of the module: “Conversion.c”. The routine is able to operate with FHT outputs obtained from the processing of either real-data inputs or complex-data inputs.

A.3 Brief Guide to Running the Program The parameters that define the operation of the R24 FHT are listed as constants at the top of the main program, “RFHT4 Computer Program.c”, these constants enabling the various versions of the GD-BFLY to be selected, as required, as well as the transform length, word lengths (for both data and trigonometric coefficients), input/output data formats, scaling strategy, etc., to be set up by the user at run time. The complete list of parameters is reproduced here, as shown in Fig. A.1, this including a typical set of parameter values and an accompanying description of each parameter. The input data set used for testing the various double butterfly and memory addressing combinations may be either read from a binary or text file (real data or complex data), with the appropriate file name being as specified in the signal generation routine, “SignalGeneration.c”, or mathematically generated to model a signal in the form of a single tone (real data or complex data versions) where the address of the excited FHT/FFT bin is as specified on the last line of Fig. A.1. For a real-valued input data set the program is able to produce transform outputs in either Hartley space or Fourier space, whilst when the input data set is complex valued the program will automatically produce the outputs in Fourier space.

168 //

A Computer Program for Regularized Fast Hartley Transform SYSTEM PARAMETERS:

#define FHT_length #define data_type #define FHT_FFT_flag #define BFLY_type #define MEM_type #define scaling //

1024 1 1 3 1 2

no of bits representing input data no of bits representing trigonometric coefficients

18 27 5

// // //

no of Cordic iterations = output accuracy (bits) no of bits representing Cordic rotation angle no of guard bits for LSB: ~ log2(no_of_iterations)

2 2

// //

1 => HEX, 2 => DEC 1 => HEX, 2 => DEC

FIXED SCALING PARAMETERS - ONE FACTOR PER FHT STAGE:

#define scale_factor_0 #define scale_factor_1 #define scale_factor_2 #define scale_factor_3 #define scale_factor_4 #define scale_factor_5 #define scale_factor_6 #define scale_factor_7 //

// //

FILE PARAMETERS:

#define input_file_format #define output_file_format //

18 24

CORDIC BUTTERFLY PARAMETERS:

#define no_of_iterations #define no_of_bits_angle #define LSB_guard_bits //

transform length: must be a power of 4 1 => real-valued data, 2 => complex-valued data 1 => FHT outputs, 2 => FFT outputs Bfly type: 1 => 12 mplys, 2 => 9 mplys, 3 => 3 Cordics Memory type: 1 => one-level LUT, 2 => two-level LUT 1 => FIXED, 2 => BFP

REGISTER-LENGTH PARAMETERS:

#define no_of_bits_data #define no_of_bits_coeffs //

// // // // // //

2 2 2 2 2 2 1 0

// // // // // // // //

bits to shift for stage = 0 bits to shift for stage = 1 bits to shift for stage = 2 bits to shift for stage = 3 bits to shift for stage = 4 – last stage for 1K FHT bits to shift for stage = 5 – last stage for 4K FHT bits to shift for stage = 6 – last stage for 16K FHT bits to shift for stage = 7 – last stage for 64K FHT

SYNTHETIC DATA PARAMETERS:

#define data_input #define dft_bin_excited

1 117

// //

0 => read data from file, 1 => generate data tone excited: between 0 and FHT_length/2-1

Fig. A.1 Typical parameter set for regularized FHT program

Note that when writing the outputs of an N-point FHT to file, the program stores one sample to a line; when writing the outputs of an N-point real-data FFT to file, it stores the zero-frequency term on the first line followed by the positive frequency terms on the next N/2 – 1 lines, with the real and imaginary components of each term appearing on the same line; and finally, when writing the outputs of an N-point complex-data FFT to file, it stores the zero-frequency term on the first line followed by the positive and then negative frequency terms on the next N – 1 lines, with the real and imaginary components of each term appearing on the same line – although the Nyquist-frequency term, like the zero-frequency term, possesses only a real component. Bear in mind that for the case of the real-data FFT, the magnitude of a zero-frequency tone (or Nyquist-frequency tone, if computed), if measured in

A.4 Available Scaling Strategies

169

the frequency domain, will be twice that of a comparable positive frequency tone (i.e. having the same signal amplitude) which shares its energy equally with its negative-frequency counterpart.

A.4 Available Scaling Strategies With regard to the fixed-point scaling strategies, note that when the scaling of the intermediate results is carried out via the conditional block floating-point technique, it is applied at the input to each stage of GD-BFLYs. As a result, any possible magnification incurred during the last stage of GD-BFLYs is not scaled out of the results, so that up to three bits of growth will still need to be accounted for in the R24 FHT outputs according to the particular post-FHT processing requirements. Examples of block floating-point scaling for both the twelve-multiplier and the nine-multiplier versions of the GD-BFLY are given in Figs. A.2 and A.3, respectively, each geared to the use of an 18-bit fast multiplier – the scaling for the CORDIC version is essentially the same as that for the twelve-multiplier version. The program provides the user with specific information relating to the chosen parameter set, printing to the screen the amount of scaling required, if any, for each stage of GD-BFLYs required by the transform.

Input Data: 18 Bits + Zero Growth Output Data: 18 Bits + Growth

1-2

22

21 >> Growth–3

23

×8

1-8

3-8

>> Growth

21

18

21

24 >>3

22

18 ×8 18 + Growth

Calculate Growth

Register Details: PE Internal = 21 (Min) & 24 (Max) PE External = 21

Note: GrowthÎ{0, 1, 2, 3}

Fig. A.2 Block floating-point scaling for use with twelve-multiplier and CORDIC versions of generic double butterfly

170

A Computer Program for Regularized Fast Hartley Transform

Input Data: 17 Bits + Zero Growth Output Data: 17 Bits + Growth

1-2

Theory => 23 Bits Maximum

22

20 >> Growth–3

23

×8

1-8

3-8

>> Growth

17

21

18

20

24 >> 3

22

18 ×8 17 + Growth

Calculate Growth

Register Details: PE Internal = 20 (Min) & 23 (Max) PE External = 20

Note: Growth Î {0, 1, 2, 3}

Fig. A.3 Block floating-point scaling for use with nine-multiplier version of generic double butterfly

For the case of the unconditional fixed scaling technique – the individual scale factors to be applied for each stage of GD-BFLYs are as specified by the set of constants given in Fig. A.1 – a small segment of code has been included within the generic double butterfly routines which prints to the screen an error message whenever the register for holding the input data to either the fast multiplier or the CORDIC arithmetic unit overflows. For the accurate simulation of a given hardware device this segment of code needs to be replaced by a routine that mimics the “actual” behaviour of the device in response to such an overflow – such a response being dependent upon the particular device used. When the nine-multiplier version of the GD-BFLY is adopted the presence of the stage of adders prior to that of the fast fixed-point multipliers is far more likely to result in an overflow unless additional scaling is applied immediately after this stage of adders has been completed, as is performed by the computer program, or alternatively, unless the data word-length into the GD-BFLY is constrained to be one bit shorter than that for the twelve-multiplier version. Clearly, in order to prevent fixed-point overflow, the settings for the individual scale factors will need to take into account both the transform length and the particular version of the GD-BFLY chosen, with experience invariably dictating when an optimum selection of scale factors has been made. Bear in mind, however, that with the CORDIC-based version of the GD-BFLY there is an associated magnification of the data magnitudes by approximately 1.647 with each temporal stage of GD-BFLYs which needs to be accounted for by the scale factors. Finally, note that when the CORDIC-based GD-BFLY is selected, regardless of the scaling strategy adopted, the program will also print to the screen

A.4 Available Scaling Strategies

171

exactly how many non-trivial shifts/additions are required for carrying out the two fixed-coefficient multipliers for the chosen parameter set. For the case of an 18-stage CORDIC arithmetic unit, for example, a total of nine such non-trivial shifts/additions are required.

Appendix B

Source Code Listings for Regularized Fast Hartley Transform

Abstract This appendix lists the source code, written in the “C” programming language, for the various functions of which the regularized FHT is comprised. The actual computer source code is to be found on the CD accompanying the monograph.

B.1 Listings for Main Program and Signal Generation Routine #include “stdafx.h” #include <math.h> #include <stdio.h> #include <stdlib.h> #include <string.h> // // //

DEFINE PARAMETERS ------------------------SYSTEM PARAMETERS:

#define FHT length #define data type #define FHT FFT flag #define BFLY type

1024 1 1 3

#define MEM type

1

#define scaling

2

//

// // // // // // // //

transform length: must be a power of 4 1 => real-valued data, 2 => complex-valued data 1 => FHT outputs, 2 => FFT outputs Bfly type: 1 => 12 mplys, 2 => 9 mplys, 3 => 3 Cordics Memory type: 1 => one-level LUT, 2 => two-level LUT 1 => FIXED, 2 => BFP

REGISTER-LENGTH PARAMETERS:

#define no of bits data #define no of bits coeffs

18 24

// //

no of bits representing input data no of bits representing trigonometric coefficients

173

174 //

B Source Code Listings for Regularized Fast Hartley Transform CORDIC BUTTERFLY PARAMETERS:

#define no of iterations #define no of bits angle #define LSB guard bits //

no of Cordic iterations = output accuracy (bits) no of bits representing Cordic rotation angle no of guard bits for LSB: log2(no of iterations)

2 2

// //

1 => HEX, 2 => DEC 1 => HEX, 2 => DEC

FIXED SCALING PARAMETERS - ONE FACTOR PER FHT STAGE:

#define scale #define scale #define scale #define scale #define scale #define scale #define scale #define scale //

// // //

FILE PARAMETERS:

#define input file format #define output file format //

18 27 5

factor factor factor factor factor factor factor factor

0 1 2 3 4 5 6 7

2 2 2 2 2 2 2 2

// // // // // // // //

bits to shift for stage = 0 bits to shift for stage = 1 bits to shift for stage = 2 bits to shift for stage = 3 bits to shift for stage = 4 - last stage for 1K FHT bits to shift for stage = 5 - last stage for 4K FHT bits to shift for stage = 6 - last stage for 16K FHT bits to shift for stage = 7 - last stage for 64K FHT

SYNTHETIC DATA PARAMETERS:

#define data input #define dft bin excited

1 256

// //

0 => read data from file, 1 => generate data tone excited: between 0 and FHT length/2-1

void main () f // // // // // // // // // // // // // // //

REGULARIZED FAST HARTLEY TRANSFORM ALGORITHM ---------------------------------------------------------------Author: Dr. Keith John Jones, June 14th 2009 FIXED-POINT FHT IMPLEMENTATION FOR FPGA -DATA & COEFFICIENTS QUANTIZED UTILIZES ONE DOUBLE–SIZED BUTTERFLY - T Y P E = 12 fast multipliers & 22 adders or 9 fast multipliers & 25 adders -or 3 Cordic arithmetic units & 2 fixed multipliers & 16 adders -UTILIZES EIGHT DATA MEMORY BANKS - S I Z E = N / 8 words per bank UTILIZES THREE COEFFICIENT MEMORY BANKS - S I Z E = N / 4 words or sqrt(N) / 2 words or zero words per bank

B.1 Listings for Main Program and Signal Generation Routine

175

//

Description:

// // // // // // // // // // // // //

--------This program carries out the FHT using a generic radix-4 double-sized butterfly. The solution performs 8 simultaneous reads/writes using 8 memory banks, each of length N/8 words. Three LUTs, each of length N/4 words or sqrt(N)/2 words, may also be used for holding the trigonometric coefficients, enabling all six coefficients to be accessed simultaneously - these LUTs are not required, however, when the arithmetic is performed with the Cordic unit. Three types of double-sized butterfly are available for use by the FHT: one involves the use of 12 fast fixed-point multipliers and 22 adders, another involves the use of 9 fast fixed-point multipliers and 25 adders, whilst a third involves the use of 3 Cordic arithmetic units, 2 fixed multipliers and 16 adders. Two coefficient memory addressing schemes are also available for use by the FHT: one involves the use of 3 LUTs, each of length N/4 words, whilst another involves the use of 3 LUTs, each of length sqrt(N)/2 words. The following combinations of arithmetic and memory are thus possible: 1) for a 12-multiplier double-sized butterfly & N/4 word LUTs, the coefficient generation involves no arithmetic operations; 2) for a 12-multiplier double-sized butterfly & sqrt(N)/2 word LUTs, the coefficient generation involves 7 multiplications and 8 additions; 3) for a 9-multiplier double-sized butterfly & N/4 word LUTs, the coefficient generation involves just additions; 4) for a 9-multiplier double-sized butterfly & sqrt(N)/2 word LUTs, the coefficient generation involves 7 multiplications and 14 additions; whilst 5) for a Cordic double-sized butterfly, the coefficients are efficiently generated on-the-fly. Scaling may be carried out within the regularized FHT to prevent overflow in the data registers – this may be carried out with either fixed scaling coefficients after each temporal stage, or by means of a block floating-point scheme in order to optimize the dynamic range out of the FHT. The program may produce either FHT or FFT output, where the input data may be either real valued or complex valued. For the case of complexvalued data, the FHT is simply applied to the real and imaginary components of the data separately before being appropriately combined via the FHT-to-FFT conversion routine. The inputs/outputs may be read/written from/to file with either decimal or hexadecimal formats.

// // // // // // // // // // // // // // // // // //

// // // // // // // // // // // // //

Files Used: -------For input/output data memory: input data read.txt - input file from which data is read. - FHT/FFT output data file. output data fht fft.txt For one-level trigonometric coefficient memory: - LUT for single-angle argument. LUT A1.txt - LUT for double-angle argument. LUT A2.txt - LUT for triple-angle argument. LUT A3.txt For two-level trigonometric coefficient memory: - coarse resolution sin LUT for single-angle argument. LUT Sin Coarse.txt - fine resolution sin LUT for single-angle argument. LUT Sin Fine.txt - fine resolution cos LUT for single-angle argument. LUT Cos Fine.txt

176 // // // // // // // // // // // // // // // // // // // // // //

// // // //

// //

B Source Code Listings for Regularized Fast Hartley Transform Functions Used: ----------- main program. FHT Computer Program SignalGeneration - signal generation routine. - regularized FHT control routine. RFHT4 Control - one-level LUT generation routine. LookUpTable 1Level - two-level LUT generation routine. LookUpTable 2Level ButterflyMappings - address permutation generation routine. DibitReversal - sequential di-bit reversal routine & 1-D to 2-D conversion. - double butterfly calculation routine: 12-multiply version. Butterfly V12M - double butterfly calculation routine: 9-multiply version. Butterfly V09M - double butterfly calculation routine: Cordic version. Butterfly Cordic Coefficients V12M 1Level - one-level coefficient generation: 12-multiply version. Coefficients V09M 1Level - one-level coefficient generation: 9-multiply version. Coefficients V12M 2Level - two-level coefficient generation: 12-multiply version. Coefficients V09M 2Level - two-level coefficient generation: 9-multiply version. DataIndices - data address generation routine. Conversion - DHT-to-DFT conversion routine. MemoryBankAddress - memory bank address/offset calculation routine. Rotation - Cordic phase rotation routine. Externs: -----void RFHT4 Control (int**, int*, int*, int*, int*, int*, int*, int*, int*, int*, int*, int*, int, int, int, int, int, int, int, int, int*, int*, int, int*, int*, int*, int, int, int, int, int*, int, int, int, int, int, int, int); void SignalGeneration (int*, int*, int, int, int, int, int, int); void LookUpTable 1Level (int, int, int*, int*, int*, int); void LookUpTable 2Level (int, int, int*, int*, int*, int); void ButterflyMappings (int*, int*, int*, int*); void DibitReversal (int, int, int*, int, int*, int**); void Conversion (int, int, int, int*, int*); void MemoryBankAddress (int, int, int, int, int*, int*); Declarations: ---------Integers: -----int wordsize, m, M, n, n1, n2, N, N2, N4, N8, no of bits, data levels, coef levels; int zero = 0, count, RootN, RootNd2, max magnitude, real type = 1, imag type = 2; int fft length, offset, halfpi, growth, growth copy, angle levels, minusquarterpi; int Root FHT length, alpha, lower, upper; Integer Arrays: -----------int index1[4], index2[16], index3[16], index4[8]; int scale factors[8], power of two A[15], power of two B[8]; int beta1[8], beta2[8], beta3[8], growth binary[32], arctans[32];

B.1 Listings for Main Program and Signal Generation Routine // // // //

// // // // ## //

//

//

//

177

Floats: ----double pi, halfpi float, quarterpi float, twopi, angle, growth float; Pointer Variables: ------------int *XRdata, *XIdata; int *bank1, *offset1, *bank2, *offset2, *scale total; int *Look Up Sin A1, *Look Up Sin A2, *Look Up Sin A3; int *Look Up Sin Coarse, *Look Up Cos Fine, Look Up Sin Fine; int **XRdata 2D = new int*[8]; Files: ---FILE *myinfile, *output; *********************************************************************** R E G U L A R I S E D F H T I N I T I A L I S A T I O N. Set up transform parameters. Root FHT length = (int) (sqrt(FHT length+0.5)); for (n = 3; n < 9; n++) f if (FHT length == (int) (pow(4,n))) f alpha = n; g g Set up standard angles. pi = atan(1.0)*4.0; halfpi float = atan(1.0)*2.0; twopi = atan(1.0)*8.0; quarterpi float = atan(1.0); wordsize = sizeof (int); memset (&index1[0], 0, wordsize << 2); memset (&index2[0], 0, wordsize << 4); memset (&index3[0], 0, wordsize << 4); memset (&index4[0], 0, wordsize << 3); Set up scale factors for butterfly stages. scale factors[0] = scale factor 0; scale factors[1] = scale factor 1; scale factors[2] = scale factor 2; scale factors[3] = scale factor 3; scale factors[4] = scale factor 4; scale factors[5] = scale factor 5; scale factors[6] = scale factor 6; scale factors[7] = scale factor 7; Set up dynamic memory. for (n = 0; n < 8; n++) f XRdata 2D[n] = new int [FHT length/8]; g XRdata = new int [FHT length]; XIdata = new int [FHT length]; if (MEM type == 1) f Look Up Sin A1 = new int [FHT length/4]; Look Up Sin A2 = new int [FHT length/4]; Look Up Sin A3 = new int [FHT length/4];

178

B Source Code Listings for Regularized Fast Hartley Transform Look Up Sin Coarse = new int [1]; Look Up Cos Fine = new int [1]; Look Up Sin Fine = new int [1]; g else f Look Look Look Look Look Look

//

//

// // // // // //

//

Up Up Up Up Up Up

Sin A1 = new int [1]; Sin A2 = new int [1]; Sin A3 = new int [1]; Sin Coarse = new int [Root FHT length/2+1]; Cos Fine = new int [Root FHT length/2]; Sin Fine = new int [Root FHT length/2];

g bank1 = new int [1]; bank1[0] = 0; bank2 = new int [1]; bank2[0] = 0; offset1 = new int [1]; offset1[0] = 0; offset2 = new int [1]; offset2[0] = 0; scale total = new int [1]; myinfile = stdin; Set up write-only file for holding FHT/FFT output data. if ((output = fopen(“output data fht fft.txt”, “w”)) == NULL) printf (“nn Error opening output data file”); Set up transform length. N = FHT length; N2 = (N >> 1); N4 = (N2 >> 1); N8 = (N4 >> 1); RootN = Root FHT length; RootNd2 = RootN / 2; if (data type == 1) f fft length = N2; g else f fft length = N; g Set up number of quantisation levels for data. data levels = (int) (pow(2,(no of bits data-1))-1); Set up number of quantisation levels for coefficients. coef levels = (int) (pow(2,(no of bits coeffs-1))-1); Set up number of quantisation levels for Cordic rotation angles. angle levels = (int) (pow(2,(no of bits angle-1))-1); Set up maximum allowable data magnitude into double butterfly. max magnitude = (int) (pow(2,(no of bits data-1))); Set up register overflow bounds for use with unconditional fixed scaling strategy. lower = -(data levels+1); upper = data levels; Set up power-of-two array. no of bits = alpha << 1; for (n = 0; n <= no of bits; n++) power of two A[n] = (int) pow(2,n); Set up modified power-of-two array. for (n = 0; n <= alpha; n++) power of two B[n] = (int) pow(2,(2*n+1));

B.1 Listings for Main Program and Signal Generation Routine //

//

//

//

//

//

Set up Cordic initial rotation angles for each temporal stage. offset = 1; for (n = 0; n < alpha; n++) f M = offset << 3; offset = power of two B[n]; if (n == 0) f beta1[0] = 0; beta2[0] = 0; beta3[0] = 0; g else f angle = -(twopi/M); beta1[n] = (int) ((angle/pi)*angle levels); angle = -2.0*(twopi/M); beta2[n] = (int) ((angle/pi)*angle levels); angle = -3.0*(twopi/M); beta3[n] = (int) ((angle/pi)*angle levels); g g Set up Cordic magnification factor. growth float = 1.0; for (n = 0; n < no of iterations; n++) f growth float *= sqrt(1+pow(2.0,-2*n)); g growth = (int) (growth float*angle levels); Calculate binary representation of magnification factor. n = 0; count = 0; growth copy = growth; while ((growth copy >= 0) && (n < no of bits angle)) f growth binary[n] = growth copy % 2; growth copy = (growth copy-growth binary[n]) / 2; if (growth binary[n] == 1) count ++; n++; g if (BFLY type == 3) f printf (“nn No of additions required by fixed multiplier = %d”, count-1); g Set up Cordic micro-rotation angles. for (n = 0; n < no of iterations; n++) f angle = atan(pow(2.0,-n)); arctans[n] = (int) ((angle/pi)*angle levels); g Calculate integer form of trigonometric terms. halfpi = (int) ((halfpi float/pi)*angle levels); minusquarterpi = (int) ((-quarterpi float/pi)*angle levels); Print program information to screen. printf (“nnnn Regularized Fast Hartley Transformnn”); printf (“ - - - - - - - - - - - - - - -nnnn”);

179

180

B Source Code Listings for Regularized Fast Hartley Transform if (BFLY type == 1) f printf (“Butterfly Type = Twelve-Multipliernnnn”); g else f if (BFLY type == 2) f printf (“Butterfly Type = Nine-Multipliernnnn”); g else f printf (“Butterfly Type = Cordicnnnn”); printf (“LUT Type = Not Relevantnnnn”); g g if (BFLY type < 3) f if (MEM type == 1) f printf (“LUT Type = One-Levelnnnn”); g else f printf (“LUT Type = Two-Levelnnnn”); g g if (data input == 0) f printf (“Data Type = Realnnnn”); g else f printf (“Data Type = Syntheticnnnn”); g printf (“Transform Length = %dnnnn”, FHT length); if (scaling == 1) f printf (“Scaling Strategy = Fixed”); g else f printf (“Scaling Strategy = Block Floating-Point”); g if (BFLY type == 3) f printf (“nnnn No of shifts/additions required by fixed multiplier = %d”, count-1); g

B.1 Listings for Main Program and Signal Generation Routine

181

// // ##

********************************************************************* S I G N A L G E N E R A T I O N. SignalGeneration (XRdata, XIdata, N, data type, dft bin excited, data input, data levels, input file format); // ********************************************************************* // ## R E G U L A R I S E D F H T P R E - P R O C E S S I N G. // Set up look-up table of multiplicative constants. if (MEM type == 1) f // Standard memory solution. LookUpTable 1Level (N, N4, Look Up Sin A1, Look Up Sin A2, Look Up Sin A3, coef levels); g else f if (MEM type == 2) f // Reduced memory solution. LookUpTable 2Level (N, RootNd2, Look Up Sin Coarse, Look Up Cos Fine, Look Up Sin Fine, coef levels); g g // Set up address permutations. ButterflyMappings (index1, index2, index3, index4); // // ## // // //

//

********************************************************************* R E G U L A R I S E D F H T P R O C E S S I N G. Process “R E A L” component of data - may be real-valued or complex-valued data. scale total[0] = 0; Di-bit reverse addresses of data & store in 2-D form. DibitReversal (N8, no of bits, power of two A, alpha, XRdata, XRdata 2D); Regularized FHT routine. RFHT4 Control (XRdata 2D, index1, index2, index3, index4, Look Up Sin A1, Look Up Sin A2, Look Up Sin A3, Look Up Sin Coarse, Look Up cos Fine, Look Up Sine Fine, power of two B, Look Up Cos Fine, Look Up Sin Fine, no of bits coeffs, alpha, N, N2, N4, RootNd2, coef levels, scaling, scale factors, scale total, max magnitude, beta1, beta2, beta3, angle levels, halfpi, minusquarterpi, growth, arctans, no of iterations, no of bits angle, LSB guard bits, lower, upper, BFLY type, MEM type); Store output data in 1-D form. n1 = 0; n2 = 1; for (m = 0; m < N8; m++) f for (n = 0; n < 4; n++) f MemoryBankAddress (n1, 0, 0, alpha, bank1, offset1); MemoryBankAddress (n2, 1, 0, alpha, bank2, offset2); XRdata[n1] = XRdata 2D[*bank1][m]; n1 += 2; XRdata[n2] = XRdata 2D[*bank2][m]; n2 += 2; g g

182

// // //

//

// ##

// // ##

//

//

B Source Code Listings for Regularized Fast Hartley Transform if (data type == 2) f Process “I M A G I N A R Y” component of complex-valued data. scale total[0] = 0; Di-bit reverse addresses of data & store in 2-D form. DibitReversal (N8, no of bits, power of two A, alpha, XIdata, XRdata 2D); Regularized FHT routine. RFHT4 Control (XRdata 2D, index1, index2, index3, index4, Look Up Sin A1, Look Up Sin A2, Look Up Sin A3, Look Up Sin Coarse, Look Up Cos Fine, Look Up Sin Fine, power of two B, alpha, N, N2, N4, RootNd2, coef levels, no of bits coeffs, scaling, scale factors, scale total, max magnitude, beta1, beta2, beta3, angle levels, halfpi, minusquarterpi, growth, arctans, no of iterations, no of bits angle, LSB guard bits, lower, upper, BFLY type, MEM type); Store output data in 1-D form. n1 = 0; n2 = 1; for (m = 0; m < N8; m++) f for (n = 0; n < 4; n++) f MemoryBankAddress (n1, 0, 0, alpha, bank1, offset1); MemoryBankAddress (n2, 1, 0, alpha, bank2, offset2); XIdata[n1] = XRdata 2D[*bank1][m]; n1 += 2; XIdata[n2] = XRdata 2D[*bank2][m]; n2 += 2; g g g if ((FHT FFT flag > 1) jj (data type == 2)) f F H T - T O - F F T C O N V E R S I O N. Conversion (real type, N, N2, XRdata, XIdata); if (data type == 2) f Conversion (imag type, N, N2, XRdata, XIdata); g g ********************************************************************* W R I T I N G O F F H T / F F T O U T P U T D A T A T O F I L E. if (output file format == 1) f “H E X” file format. if (FHT FFT flag == 1) f FHT outputs - real-valued input & complex-valued output. for (n = 0; n < N; n++) fprintf (output,“%xnn”, XRdata[n]); g

B.1 Listings for Main Program and Signal Generation Routine

183

else f

//

//

//

//

//

//

// // ## //

if (data type == 1) f FFT outputs - real-valued input & complex-valued output. fprintf (output,“%x %xnn”, XRdata[0], zero); for (n = 1; n < N2; n++) fprintf (output,“%x %xnn”, XRdata[n], XRdata[N-n]); g else f FFT outputs - complex-valued input & complex-valued output. for (n = 0; n < N; n++) fprintf (output,”%x %xnn”, XRdata[n], XIdata[n]); g g g else f “D E C” file format. if (FHT FFT flag == 1) f FHT outputs - real-valued input & real-valued output. for (n = 0; n < N; n++) fprintf (output,“%10dnn”, XRdata[n]); g else f if (data type == 1) f FFT outputs - real-valued input & complex-valued output. fprintf (output,”%10d %10dnn”, XRdata[0], zero); for (n = 1; n < N2; n++) fprintf (output,”%10d %10dnn”, XRdata[n], XRdata[N-n]); g else f FFT outputs - complex-valued input & complex-valued output. for (n = 0; n < N; n++) fprintf (output,”%10d %10dnn”, XRdata[n], XIdata[n]); g g g ********************************************************************* F I L E C L O S U R E & M E M O R Y D E L E T I O N. fclose (output); Delete dynamic memory. delete XRdata, XIdata, XRdata 2D; delete bank1, bank2, offset1, offset2, scale total; delete Look Up Sin A1, Look Up Sin A2, Look Up Sin A3; delete Look Up Sin Coarse, Look Up Cos Fine, Look Up Sin Fine;

184 //

B Source Code Listings for Regularized Fast Hartley Transform End of program. printf (“nnnn Processing Completednnnn”);

g #include “stdafx.h” #include <stdio.h> #include <math.h> void SignalGeneration (int *XRdata, int *XIdata, int N, int data type, int dft bin excited, int data input, int data levels, int input file format) f // Description: // -------// Routine to generate the signal data required for input to the Regularized FHT. // Parameters: // --------// XRdata = real component of 1-D data. // XIdata = imaginary component of 1-D data. // N = transform length. = data type: 1 => real valued data, 2 => complex valued data // data type = integer representing DFT bin excited. // dft bin excited = data type: 0 => read data from file, 1 => generate data. // data input = no of quantized data levels. // data levels = input file format: 1 => HEX, 2 => DEC. // input file format // Note: // ---// Complex data is stored in data file in the form of alternating real and imaginary // components. // Declarations: // --------// Integers: // -----int n; // Floats: // ----double twopi, argument; //

*********************************************************************

// ##

T E S T D A T A G E N E R A T I O N. if (data input == 0) f Read in FHT input data from file. FILE *input; if ((input = fopen(“input data fht.txt”, “r”)) == NULL) printf (“nn Error opening input data file to read from”); if (input file format == 1) f

//

B.2 Listings for Pre-processing Functions //

//

//

//

// g

185

“H E X” file format. if (data type == 1) f for (n = 0; n < N; n++) fscanf (input, “%x”, &XRdata[n]); g else f for (n = 0; n < N; n++) fscanf (input, “%x %x”, &XRdata[n], &XIdata[n]); g g else f “D E C” file format. if (data type == 1) f for (n = 0; n < N; n++) fscanf (input, “%d”, &XRdata[n]); g else f for (n = 0; n < N; n++) fscanf (input, “%d %d”, &XRdata[n], &XIdata[n]); g g Close file. fclose (input); g else f Generate single-tone signal for FHT input data. twopi = 8*atan(1.0); for (n = 0; n < N; n++) f argument = (twopi*n*dft bin excited)/N; XRdata[n] = (int) (cos(argument)*data levels); if (data type == 2) f XIdata[n] = (int) (sin(argument)*data levels); g g g End of function.

B.2 Listings for Pre-processing Functions #include “stdafx.h” #include <math.h> #include <stdio.h> void LookUpTable 1Level(int N, int N4, int *Look Up Sin A1, int *Look Up Sin A2, int *Look Up Sin A3, int coef levels)

186 f // // // // // // // // // // // // // // // // // // //

//

//

B Source Code Listings for Regularized Fast Hartley Transform

Description: --------Routine to set up the one-level LUTs containing the trigonometric coefficients. Parameters: --------N = transform length. N4 = N / 4. = l ook-up table for single-angle argument. Look Up Sin A1 = look-up table for double-angle argument. Look Up Sin A2 = look-up table for triple-angle argument. Look Up Sin A3 = number of trigonometric coefficient quantisation levels. coef levels Declarations: --------Integers: -----int i; Floats: ----double angle, twopi, rotation; *********************************************************************** Set up output files for holding LUT contents. FILE *output1; if ((output1 = fopen(“LUT A1.txt”, “w”)) == NULL) printf (“nn Error opening 1st LUT file”); FILE *output2; if ((output2 = fopen(“LUT A2.txt”, “w”)) == NULL) printf (“nn Error opening 2nd LUT file”); FILE *output3; if ((output3 = fopen(“LUT A3.txt”, “w”)) == NULL) printf (“nn Error opening 3rd LUT file”); twopi = (double) (atan(1.0) * 8.0); rotation = (double) (twopi / N); Set up size N/4 LUT for single-angle argument. angle = (double) 0.0; for (i = 0; i < N4; i++) f Look Up Sin A1[i] = (int) (sin(angle) * coef levels); angle += (double) rotation; fprintf (output1,“%xnn”, Look Up Sin A1[i]); g Set up size N/4 LUT for double-angle argument. angle = (double) 0.0; for (i = 0; i < N4; i++) f Look Up Sin A2[i] = (int) (sin(angle) * coef levels); angle += (double) rotation; fprintf (output2,“%xnn”, Look Up Sin A2[i]); g

B.2 Listings for Pre-processing Functions //

// // g

187

Set up size N/4 LUT for triple-angle argument. angle = (double) 0.0; for (i = 0; i < N4; i++) f Look Up Sin A3[i] = (int) (sin(angle) * coef levels); angle += (double) rotation; fprintf (output3,“%xnn”, Look Up Sin A3[i]); g Close files. fclose (output1); fclose (output2); fclose (output3); End of function.

#include “stdafx.h” #include <stdio.h> #include <math.h> void LookUpTable 2Level(int N, int RootNd2, int *Look Up Sin Coarse, int *Look Up Cos Fine, int *Look Up Sin Fine, int coef levels) f // Description: // --------// Routine to set up the two-level LUTs containing the trigonometric coefficients. // Parameters: // --------// N = transform length. // RootNd2 = sqrt(N) / 2. // Look Up Sin Coarse = coarse resolution sin LUT for single-angle argument. = fine resolution cos LUT for single-angle argument. // Look Up Cos Fine = fine resolution sin LUT for single-angle argument. // Look Up Sin Fine = number of trigonometric coefficient quantisation levels. // coef levels // Declarations: // ---------// Integers: // -----// int i; // Floats: // ----double angle coarse, angle fine, twopi, rotation coarse, rotation fine; // *********************************************************************** // Set up output files for holding LUT contents. FILE *output1; if ((output1 = fopen(“LUT Sin Coarse.txt”, “w”)) == NULL) printf (“nn Error opening 1st LUT file”); FILE *output2; if ((output2 = fopen(“LUT Cos Fine.txt”, “w”)) == NULL) printf (“nn Error opening 2nd LUT file”); FILE *output3; if ((output3 = fopen(“LUT Sin Fine.txt”, “w”)) == NULL) printf (“nn Error opening 3rd LUT file”); twopi = (double) (atan(1.0) * 8.0);

188

//

// // g

B Source Code Listings for Regularized Fast Hartley Transform rotation coarse = (double) (twopi / (2*sqrt((float)N))); rotation fine = (double) (twopi / N); Set up size sqrt(N) LUT for single-angle argument. angle coarse = (double) 0.0; angle fine = (double) 0.0; for (i = 0; i < RootNd2; i++) f Look Up Sin Coarse[i] = (int) (sin(angle coarse) * coef levels); Look Up Cos Fine[i] = (int) (cos(angle fine) * coef levels); Look Up Sin Fine[i] = (int) (sin(angle fine) * coef levels); fprintf (output1,“%xnn”, Look Up Sin Coarse[i]); fprintf (output2,“%xnn”, Look Up Cos Fine[i]); fprintf (output3,“%xnn”, Look Up Sin Fine[i]); angle coarse += (double) rotation coarse; angle fine += (double) rotation fine; g Look Up Sin Coarse[RootNd2] = coef levels; Close files. fclose (output1); fclose (output2); fclose (output3); End of function.

#include “stdafx.h” void ButterflyMappings(int *index1, int *index2, int *index3, int *index4) f // Description: // --------// Routine to set up the address permutations for the generic double butterfly. // Parameters: // --------// index1 = 1st address permutation. // index2 = 2nd address permutation. // index3 = 3rd address permutation. // index4 = 4th address permutation. // *********************************************************************** // 1st address permutation for Type-I and Type-II generic double butterflies. index1[0] = 6; index1[1] = 3; // 1st address permutation for Type-III generic double butterfly. index1[2] = 3; index1[3] = 6; // 2nd address permutation for Type-I and Type-II generic double butterflies. index2[0] = 0; index2[1] = 4; index2[2] = 3; index2[3] = 2; index2[4] = 1; index2[5] = 5; index2[6] = 6; index2[7] = 7; // 2nd address permutation for Type-III generic double butterfly. index2[8] = 0; index2[9] = 4; index2[10] = 2; index2[11] = 6; index2[12] = 1; index2[13] = 5; index2[14] = 3; index2[15] = 7; // 3rd address permutation for Type-I and Type-II generic double butterflies. index3[0] = 0; index3[1] = 4; index3[2] = 1; index3[3] = 5; index3[4] = 2; index3[5] = 6; index3[6] = 3; index3[7] = 7; // 3rd address permutation for Type-III generic double butterfly. index3[8] = 0; index3[9] = 4; index3[10] = 1; index3[11] = 3; index3[12] = 2; index3[13] = 6; index3[14] = 7; index3[15] = 5;

B.3 Listings for Processing Functions //

// g

189

4th address permutation for Type-I, Type-II and Type-III generic double butterflies. index4[0] = 0; index4[1] = 4; index4[2] = 1; index4[3] = 5; index4[4] = 6; index4[5] = 2; index4[6] = 3; index4[7] = 7; End of function.

B.3 Listings for Processing Functions #include “stdafx.h” #include <stdio.h> #include <math.h> void RFHT4 Control(int **Xdata 2D, int *index1, int *index2, int *index3, int *index4, int *Look Up Sin A1, int *Look Up Sin A2, int *Look Up Sin A3, int *Look Up Sin Coarse, int *Look Up Cos Fine, int *Look Up Sin Fine, int *power of two, int alpha, int N, int N2, int N4, int RootNd2, int coef levels, int no of bits coeffs, int scaling, int *scale factors, int *scale total, int max magnitude, int *beta1, int *beta2, int *beta3, int angle levels, int halfpi, int minusquarterpi, int growth, int *arctans, int no of iterations, int no of bits angle, int LSB guard bits, int lower, int upper, int BFLY type, int MEM type) f // Description: // --------// Routine to carry out the regularized FHT algorithm, with options to use either // twelve-multiplier, nine- multiplier or Cordic versions of the generic double butterfly // and N/4 word, sqrt(N)/2 word or zero word LUTs for the storage of the // trigonometric coefficients. // Externs: // -----void Butterfly V12M (int, int, int, int*, int*, int*, int*, int*, int*, int*, int, int, int, int*, int, int, int); void Butterfly V09M (int, int, int, int*, int*, int*, int*, int*, int*, int*, int, int, int, int*, int, int, int, int); void Butterfly Cordic (int*, int*, int*, int*, int*, int*, int*, int, int, int, int*, int, int, int, int, int*, int, int, int, int); void Coefficients V12M 1Level (int, int, int, int, int, int*, int*, int*, int*, int); void Coefficients V09M 1Level (int, int, int, int, int, int*, int*, int*, int*, int); void Coefficients V12M 2Level (int, int, int, int, int, int, int, int*, int*, int*, int*, int, int); void Coefficients V09M 2Level (int, int, int, int, int, int, int, int*, int*, int*, int*, int, int); void DataIndices (int, int, int, int, int*, int[2][4], int[2][4], int, int); // Parameters: // --------= 2-D data. // Xdata 2D // index1 = 1st address permutation. // index2 = 2nd address permutation. // index3 = 3rd address permutation. // index4 = 4th address permutation.

190 // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // //

// // //

B Source Code Listings for Regularized Fast Hartley Transform Look Up Sin A1 Look Up Sin A2 Look Up Sin A3 Look Up Sin Coarse Look Up Cos Fine Look Up Sin Fine power of two alpha N N2 N4 RootNd2 coef levels no of bits coeffs scaling scale factors scale total max magnitude beta1 beta2 beta3 angle levels halfpi minusquarterpi growth arctans no of iterations no of bits angle LSB guard bits lower upper BFLY type

= = = = = = = = = = = = = = = = = = = = = = = = = = = = = = = =

LUT for single-angle argument. LUT for double-angle argument. LUT for triple-angle argument. coarse resolution sin LUT for single-angle argument. fine resolution cos LUT for single-angle argument. fine resolution sin LUT for single-angle argument. array containing powers of 2. no of temporal stages for transform. transform length. N / 2. N / 4. sqrt(N) / 2. number of trigonometric coefficient quantisation levels. number of bits representing trigonometric coefficients. scaling flag: 1 => FIXED, 2 => BFP. bits to shift for double butterfly stages. total number of BFP scaling bits. maximum magnitude of data into double butterfly. initial single-angle Cordic rotation angle. initial double-angle Cordic rotation angle. initial triple-angle Cordic rotation angle. number of Cordic rotation angle quantisation levels. integer value of C(pi/2). integer value of (pi/4). integer value of Cordic magnification factor. Cordic micro-rotation angles. no of Cordic iterations. no of bits representing Cordic rotation angle. no of bits for guarding LSB. lower bound for register overflow with unconditional scaling. upper bound for register overflow with unconditional scaling. BFLY type: 1 => 12 multipliers, 2 => 9 multipliers, 3 => 3 Cordic units. = MEM type: 1 => LUT = one-level, 2 => LUT = two-level.

MEM type Description: --------Integers: ------int i, j, k, n, n2, offset, M, beta, bfly count, Type, negate flag, shift; Integer Arrays: ----------int X[9], kk[4], kbeta[3], Data Max[1], coeffs[9], threshold[3]; int index even 2D[2][4], index odd 2D[2][4]; *********************************************************************** Set up offset for address permutations. kk[3] = 0; Set up block floating-point thresholds.

B.3 Listings for Processing Functions

//

//

//

//

//

//

//

// ## //

//

threshold[0] = max magnitude; threshold[1] = max magnitude << 1; threshold[2] = max magnitude << 2; Loop through log4 temporal stages. offset = 1; Data Max[0] = 0; shift = 0; for (i = 0; i < alpha; i++) f Set up look-up table index and address offsets. M = (int) (offset << 3); beta = (int) (N / M); bfly count = 0; if ((scaling == 2) && (i > 0)) f Calculate shift to be applied to data so that MSB occupies optimum position. shift = 0; for (n = 0; n < 3; n++) f if ((Data Max[0] < threshold[n]) jj (Data Max[0] >= +threshold[n])) shift ++; g Increase total number of BFP scaling bits. scale total[0] += shift; printf (“nnnn Maximum data magnitude from stage %d”, i-1); printf (“ = %d [threshold=”, Data Max[0]); printf (“%d]”, max magnitude); printf (“nn Shift to be applied to data for stage %d”, i); printf (“ = %d”, shift); if (i == (alpha-1)) f printf (“nnnn Total shift applied to data = %d”, scale total[0]); g Initialise maximum data magnitude for this stage. Data Max[0] = 0; g Loop through spatial iterations. for (j = 0; j < N; j += M) f Initialise address offsets and double butterfly type. kbeta[0] = 0; kbeta[1] = 0; kbeta[2] = 0; Type = 2; for (k = 0; k < offset; k++) f if (i == 0) f S T A G E = 0. negate flag = 0; Set up data indices for double butterfly . DataIndices (i, j, k, offset, kk, index even 2D, index odd 2D, bfly count, alpha); bfly count ++; Set up trigonometric coefficients for double butterfly. if (BFLY type == 1)

191

192

//

//

//

//

//

//

//

//

//

B Source Code Listings for Regularized Fast Hartley Transform f Butterfly is twelve-multiplier version. if (MEM type == 1) f Standard arithmetic & standard memory solution. Coefficients V12M 1Level (i, k, N2, N4, kbeta[0], Look Up Sin A1, Look Up Sin A2, Look Up Sin A3, coeffs, coef levels); g else f Standard arithmetic & reduced memory solution. Coefficients V12M 2Level (i, k, N2, N4, RootNd2, alpha, kbeta[0], Look Up Sin Coarse, Look Up Cos Fine, Look Up Sin Fine, coeffs, coef levels, no of bits coeffs); g Increment address offset. kbeta[0] += beta; g else f Butterfly is nine-multiplier version. if (BFLY type == 2) f if (MEM type == 1) f Reduced arithmetic & standard memory solution. Coefficients V09M 1Level (i, k, N2, N4, kbeta[0], Look Up Sin A1, Look Up Sin A2, Look Up Sin A3, coeffs, coef levels); g else f Reduced arithmetic & reduced memory solution. Coefficients V09M 2Level (i, k, N2, N4, RootNd2, alpha, kbeta[0], Look Up Sin Coarse, Look Up Cos Fine, Look Up Sin Fine, coeffs, coef levels, no of bits coeffs); g g Increment address offset. kbeta[0] += beta; g R E A D S - Set up input data vector for double butterfly. for (n = 0; n < 4; n++) f n2 = (n << 1); X[n2] = Xdata 2D[index even 2D[0][n]][index even 2D[1][n]]; X[n2+1] = Xdata 2D[index odd 2D[0][n]][index odd 2D[1][n]]; g

B.3 Listings for Processing Functions //

//

//

//

//

//

//

// ## //

//

Carry out set of double butterfly equations. if (BFLY type == 1) f Standard arithmetic solution - twelve-multiplier butterfly. Butterfly V12M (i, j, k, X, coeffs, kk, index1, index2, index3, index4, coef levels, no of bits coeffs, scaling, Data Max, shift, lower, upper); g else f if (BFLY type == 2) f Reduced arithmetic solution - nine-multiplier butterfly. Butterfly V09M (i, j, k, X, coeffs, kk, index1, index2, index3, index4, coef levels, no of bits coeffs, scaling, Data Max, shift, 1, lower, upper); g else f Cordic arithmetic solution. Butterfly Cordic (X, kbeta, kk, index1, index2, index3, index4, halfpi, minusquarterpi, growth, arctans, no of iterations, no of bits angle, negate flag, scaling, Data Max, shift, LSB guard bits, lower, upper); Increment address offsets. kbeta[0] += beta1[i]; kbeta[1] += beta2[i]; kbeta[2] += beta3[i]; g g if (scaling == 1) f F I X E D S C A L I N G - scale output data according to stage number. for (n = 0; n < 8; n++) X[n] = (X[n] >> scale factors[i]); g W R I T E S - Set up output data vector for double butterfly. for (n = 0; n < 4; n++) f n2 = (n << 1); Xdata 2D[index even 2D[0][n]][index even 2D[1][n]] = X[n2]; Xdata 2D[index odd 2D[0][n]][index odd 2D[1][n]] = X[n2+1]; g g else f S T A G E > 0. Set up data indices for double butterfly. DataIndices (i, j, k, offset, kk, index even 2D, index odd 2D, bfly count, alpha); bfly count ++; Set up trigonometric coefficients for double butterfly. if (BFLY type == 1)

193

194

//

//

//

//

//

//

//

//

//

//

B Source Code Listings for Regularized Fast Hartley Transform f Butterfly is twelve-multiplier version. if (MEM type == 1) f Standard arithmetic & standard memory solution. Coefficients V12M 1Level (i, k, N2, N4, kbeta[0], Look Up Sin A1, Look Up Sin A2, Look Up Sin A3, coeffs, coef levels); g else f Standard arithmetic & reduced memory solution. Coefficients V12M 2Level (i, k, N2, N4, RootNd2, alpha, kbeta[0], Look Up Sin Coarse, Look Up Cos Fine, Look Up Sin Fine, coeffs, coef levels, no of bits coeffs); g Increment address offset. kbeta[0] += beta; g else f Butterfly is nine-multiplier version. if (BFLY type == 2) f if (MEM type == 1) f Reduced arithmetic & standard memory solution. Coefficients V09M 1Level (i, k, N2, N4, kbeta[0], Look Up Sin A1, Look Up Sin A2, Look Up Sin A3, coeffs, coef levels); g else f Reduced arithmetic & reduced memory solution. Coefficients V09M 2Level (i, k, N2, N4, RootNd2, alpha, kbeta[0], Look Up Sin Coarse, Look Up Cos Fine, Look Up Sin Fine, coeffs, coef levels, no of bits coeffs); g g Increment address offset. kbeta[0] += beta; g R E A D S - Set up input data vector for double butterfly. for (n = 0; n < 4; n++) f n2 = (n << 1); X[n2] = Xdata 2D[index even 2D[0][n]][index even 2D[1][n]]; X[n2+1] = Xdata 2D[index odd 2D[0][n]][index odd 2D[1][n]]; g Carry out set of double butterfly equations. if (BFLY type == 1)

B.3 Listings for Processing Functions

//

//

//

//

//

//

// g

f Standard arithmetic solution - twelve-multiplier butterfly. Butterfly V12M (i, j, k, X, coeffs, kk, index1, index2, index3, index4, coef levels, no of bits coeffs, scaling, Data Max, shift, lower, upper); g else f if (BFLY type == 2) f Reduced arithmetic solution - nine-multiplier butterfly. Butterfly V09M (i, j, k, X, coeffs, kk, index1, index2, index3, index4, coef levels, no of bits coeffs, scaling, Data Max, shift, Type, lower, upper); g else f Cordic arithmetic solution. negate flag = k+1; Butterfly Cordic (X, kbeta, kk, index1, index2, index3, index4, halfpi, minusquarterpi, growth, arctans, no of iterations, no of bits angle, negate flag, scaling, Data Max, shift, LSB guard bits, lower, upper); Increment address offsets. kbeta[0] += beta1[i]; kbeta[1] += beta2[i]; kbeta[2] += beta3[i]; g g if (scaling == 1) f F I X E D S C A L I N G - scale output data according to stage number. for (n = 0; n < 8; n++) X[n] = (X[n] >> scale factors[i]); g W R I T E S - Set up output data vector for double butterfly. for (n = 0; n < 4; n++) f n2 = (n << 1); Xdata 2D[index even 2D[0][n]][index even 2D[1][n]] = X[n2]; Xdata 2D[index odd 2D[0][n]][index odd 2D[1][n]]= X[n2+1]; g g Type = 3; g g offset = power of two[i]; g End of function.

195

196

B Source Code Listings for Regularized Fast Hartley Transform

#include “stdafx.h” #include <stdlib.h> void DibitReversal(int N8, int no of bits, int *power of two, int alpha, int *Xdata, int **Xdata 2D) f // Description: // --------// Routine to carry out in sequential fashion the in-place di-bit reversal mapping of the // input data and to store data in 2-D form. // Parameters: // --------// N8 = N / 8. = number of bits corresponding to N. // no of bits // power of two = array containing powers of 2. // alpha = no of temporal stages for transform. // Xdata = 1-D data. = 2-D data. // Xdata 2D // Externs: // -----void MemoryBankAddress (int, int, int, int, int*, int*); // Declarations: // ---------// Integers: // ------int i1, i2, i3, j1, j2, k, n, store; // Pointer Variables: // ------------int *bank, *offset; // *********************************************************************** // Set up dynamic memory. bank = new int [1]; bank[0] = 0; offset = new int [1]; offset[0] = 0; // Re-order data. i3 = 0; for (i1 = 0; i1 < N8; i1++) f for (i2 = 0; i2 < 8; i2++) f j1 = 0; j2 = (i3%2); for (k = 0; k < no of bits; k += 2) f n = no of bits - k; if (i3 & power of two[k]) j1 += power of two[n-2]; if (i3 & power of two[k+1]) j1 += power of two[n-1]; g if (j1 > i3) f store = Xdata[i3]; Xdata[i3] = Xdata[j1]; Xdata[j1] = store; g

B.3 Listings for Processing Functions //

// // g

197

Convert to 2-D form. MemoryBankAddress (i3, j2, 0, alpha, bank, offset); Xdata 2D[*bank][i1] = Xdata[i3]; i3 ++; g g Delete dynamic memory. delete bank, offset; End of function.

#include “stdafx.h” void Conversion(int channel type, int N, int N2, int *XRdata, int *XIdata) f // Description: // --------Routine to convert DHT coefficients to DFT coefficients. If the FHT is to be used for the // computation of the real-data FFT, as opposed to being used for the computation of the // complex-data FFT, the complex-valued DFT coefficients are optimally stored in the // following way: // // XRdata[0] = zero’th frequency component // XRdata[1] = real component of 1st frequency component // XRdata[N-1] = imag component of 1st frequency component // XRdata[2] = real component of 2nd frequency component // XRdata[N-2] = imag component of 2nd frequency component // --// XRdata[N/2-1] = real component of (N/2-1)th frequency component // XRdata[N/2+1] = imag component of (N/2-1)th frequency component // XRdata[N/2] = (N/2)th frequency component // For the case of the complex-valued FFT, however, the array “XRdata” stores the // real component of both the input and output data, whilst the array “XIdata” stores the // imaginary component of both the input and output data. // Parameters: // --------// // // // // // // // // // //

channel type = 1 => real input channel, 2 => imaginary input channel. N = transform length. N2 = N / 2. XRdata = on input: FHT output for real input channel; on output: as in “description” above. XIdata = on input: FHT output for imaginary input channel; on output: as in “description” above. Declarations: ---------Integers: ------int j, k, store, store1, store2, store3;

198 //

// //

// //

//

// g

B Source Code Listings for Regularized Fast Hartley Transform *********************************************************************** if (channel type == 1) f R E A L D A T A C H A N N E L. k = N 1; Produce DFT output for this channel. for (j = 1; j < N2; j++) f store = XRdata[k] + XRdata[j]; XRdata[k] = XRdata[k] XRdata[j]; XRdata[j] = store; XRdata[j] /= 2; XRdata[k] /= 2; k ; g g else f I M A G I N A R Y D A T A C H A N N E L. k = N 1; Produce DFT output for this channel. for (j = 1; j < N2; j++) f store = XIdata[k] + XIdata[j]; XIdata[k] = XIdata[k] XIdata[j]; XIdata[j] = store; XIdata[j] /= 2; XIdata[k] /= 2; Produce DFT output for complex data. store1 = XRdata[j] + XIdata[k]; store2 = XRdata[j] XIdata[k]; store3 = XIdata[j] + XRdata[k]; XIdata[k] = XIdata[j] XRdata[k]; XRdata[j] = store2; XRdata[k] = store1; XIdata[j] = store3; k - -; g g End of function.

#include “stdafx.h” #include <math.h> void Coefficients V09M 1Level(int i, int k, int N2, int N4, int kbeta, int *Look Up Sin A1, int *Look Up Sin A2, int *Look Up Sin A3, int *coeffs, int coef levels) f // Description: // --------// Routine to set up the trigonometric coefficients for use by the nine-multiplier version of // the generic double butterfly where one-level LUTs are exploited. // Parameters: // --------// i = temporal addressing index. // k = spatial addressing index. // N2 = N / 2. // N4 = N / 4.

B.3 Listings for Processing Functions // // // // // // // // // //

//

//

//

//

//

199

kbeta = temporal/spatial index. = look-up table for single-angle argument. Look Up Sin A1 = look-up table for double-angle argument. Look Up Sin A2 = look-up table for triple-angle argument. Look Up Sin A3 coeffs = current set of trigonometric coefficients. = number of trigonometric coefficient quantisation levels. coef levels Declarations: --------Integers: ------int m, n, n3, store 00, store 01; static int startup, coeff 00, coeff 01, coeff 02, coeff 03, coeff 04; *********************************************************************** if (startup == 0) f Set up trivial trigonometric coefficients - valid for each type of double butterfly. coeff 00 = +coef levels; coeff 01 = 0; coeff 02 = coef levels; Set up additional constant trigonometric coefficient for Type-II double butterfly. coeff 03 = (int) ((sqrt(2.0) / 2) * coef levels); coeff 04 = coeff 03 + coeff 03; startup = 1; g if (i == 0) f Set up trigonometric coefficients for Type-I double butterfly. n3 = 0; for (n = 0; n < 3; n++) f coeffs[n3++] = coeff 00; coeffs[n3++] = coeff 01; coeffs[n3++] = coeff 00; g g else f if (k == 0) f Set up trigonometric coefficients for Type-II double butterfly. n3 = 0; for (n = 0; n < 2; n++) f coeffs[n3++] = coeff 00; coeffs[n3++] = coeff 01; coeffs[n3++] = coeff 00; g

200

B Source Code Listings for Regularized Fast Hartley Transform coeffs[6] = coeff 04; coeffs[7] = coeff 03; coeffs[8] = 0;

// //

//

//

// g

g else f Set up trigonometric coefficients for Type-III double butterfly. m = kbeta; Set up single-angle sinusoidal & cosinusoidal terms. store 00 = Look Up Sin A1[N4-m]; store 01 = Look Up Sin A1[m]; coeffs[0] = store 00 + store 01; coeffs[1] = store 00; coeffs[2] = store 00 store 01; Set up double-angle sinusoidal & cosinusoidal terms. m <<= 1; store 00 = Look Up Sin A2[N4-m]; store 01 = Look Up Sin A2[m]; coeffs[3] = store 00 + store 01; coeffs[4] = store 00; coeffs[5] = store 00 store 01; Set up triple-angle sinusoidal & cosinusoidal terms. m += kbeta; if (m < N4) f store 00 = Look Up Sin A3[N4-m]; store 01 = Look Up Sin A3[m]; g else f store 00 = -Look Up Sin A3[m-N4]; store 01 = Look Up Sin A3[N2-m]; g coeffs[6] = store 00 + store 01; coeffs[7] = store 00; coeffs[8] = store 00 - store 01; g g End of function.

#include “stdafx.h” #include <math.h> void Coefficients V09M 2Level(int i, int k, int N2, int N4, int RootNd2, int alpha, int kbeta, int *Look Up Sin Coarse, int *Look Up Cos Fine, int *Look Up Sin Fine, int *coeffs, int coef levels, int no of bits coef)

B.3 Listings for Processing Functions f // // // // // // // // // // // // // // // // // // // // // // //

//

//

//

//

//

201

Description: --------Routine to set up the trigonometric coefficients for use by the nine-multiplier version of the generic double butterfly where two-level LUTs are exploited. Parameters: --------i = temporal index. k = spatial index. N2 = N / 2. N4 = N / 4. alpha = number of FHT temporal stages. kbeta = temporal/spatial index. Look Up Sin Coarse = coarse resolution sin LUT for single-angle argument. = fine resolution cos LUT for single-angle argument. Look Up Cos Fine = fine resolution sin LUT for single-angle argument. Look Up Sin Fine coeffs = current set of trigonometric coefficients. = number of trigonometric coefficient quantisation levels. coef levels = number of bits representing trigonometric coefficients. no of bits coef Declarations: ---------Integers: ------int m, n, n3, sa1, sca2, ca1, sv1, sv2, cv1, cv2, sum1, sum2, sum3; int store 00, store 01, store 02, store 03; int64 store1, store2, store3, store 04, store 05; static int startup, alpham1, bits to shift, bits to shift m1; static int coeff 00, coeff 01, coeff 02, coeff 03, coeff 04; *********************************************************************** if (startup == 0) f Set up trivial trigonometric coefficients - valid for each type of double butterfly. coeff 00 = +coef levels; coeff 01 = 0; coeff 02 = coef levels; Set up additional constant trigonometric coefficient for Type-II double butterfly. coeff 03 = (int) ((sqrt(2.0) / 2) * coef levels); coeff 04 = coeff 03 << 1; Set up scaling factor for multiplication stage. bits to shift = no of bits coef - 1; bits to shift m1 = bits to shift - 1; Set up scaling factor for address calculation. alpham1 = alpha - 1; startup = 1; g

202

//

//

// //

//

B Source Code Listings for Regularized Fast Hartley Transform if (i == 0) f Set up trigonometric coefficients for Type-I double butterfly. n3 = 0; for (n = 0; n < 3; n++) f coeffs[n3++] = coeff 00; coeffs[n3++] = coeff 01; coeffs[n3++] = coeff 00; g g else f if (k == 0) f Set up trigonometric coefficients for Type-II double butterfly. n3 = 0; for (n = 0; n < 2; n++) f coeffs[n3++] = coeff 00; coeffs[n3++] = coeff 01; coeffs[n3++] = coeff 00; g coeffs[6] = coeff 04; coeffs[7] = coeff 03; coeffs[8] = 0; g else f Set up trigonometric coefficients for Type-III double butterfly. m = kbeta; Set up single-angle sinusoidal & cosinusoidal terms. sa1 = m >> alpham1; ca1 = RootNd2 - sa1; sca2 = m % RootNd2; cv1 = Look Up Sin Coarse[ca1]; sv1 = Look Up Sin Coarse[sa1]; cv2 = Look Up Cos Fine[sca2]; sv2 = Look Up Sin Fine[sca2]; sum1 = cv1 + sv1; sum2 = cv2 + sv2; sum3 = cv2 - sv2; store1 = (( int64)sum1*cv2) >> bits to shift; store2 = (( int64)sum2*sv1) >> bits to shift; store3 = (( int64)sum3*cv1) >> bits to shift; store 00 = (int) (store1 - store2); store 01 = (int) (store1 - store3); coeffs[0] = store 00 + store 01; coeffs[1] = store 00; coeffs[2] = store 00 store 01; Set up double-angle sinusoidal & cosinusoidal terms. store1 = (( int64)store 00*store 00) >> bits to shift m1; store2 = (( int64)store 00*store 01) >> bits to shift m1;

B.3 Listings for Processing Functions

//

// g

store 02 = (int) (store1 - coef levels); store 03 = (int) store2; coeffs[3] = store 02 + store 03; coeffs[4] = store 02; coeffs[5] = store 02 store 03; Set up triple-angle sinusoidal & cosinusoidal terms. store1 = (( int64)store 02*store 00) >> bits to shift m1; store2 = (( int64)store 02*store 01) >> bits to shift m1; store 04 = (int) (store1 - store 00); store 05 = (int) (store2 + store 01); coeffs[6] = (int) (store 04 + store 05); coeffs[7] = (int) (store 04); coeffs[8] = (int) (store 04 - store 05); g g End of function.

#include “stdafx.h” #include <math.h> void Coefficients V12M 1Level(int i, int k, int N2, int N4, int kbeta, int *Look Up Sin A1, int *Look Up Sin A2, int *Look Up Sin A3, int *coeffs, int coef levels) f // Description: // --------// Routine to set up the trigonometric coefficients for use by the twelve-multiplier // version of the generic double butterfly where one-level LUTs are exploited. // Parameters: // --------// i = temporal addressing index. // k = spatial addressing index. // N2 = N / 2. // N4 = N / 4. // kbeta = temporal/spatial index. // Look Up Sin A1 = look-up table for single-angle argument. // Look Up Sin A2 = look-up table for double-angle argument. // Look Up Sin A3 = look-up table for triple-angle argument. // coeffs = current set of trigonometric coefficients. = number of trigonometric coefficient quantisation levels. // coef levels // Declarations: // ---------// Integers: // ------int m, n, n3; static int startup, coeff 00, coeff 01, coeff 02, coeff 03;

203

204

B Source Code Listings for Regularized Fast Hartley Transform

// ********************************************************************** if (startup == 0) f // Set up trivial trigonometric coefficients - valid for each type of double butterfly. coeff 00 = +coef levels; coeff 01 = 0; coeff 02 = coef levels; // Set up additional constant trigonometric coefficient for Type-II double butterfly. coeff 03 = (int) ((sqrt(2.0) / 2) * coef levels); startup = 1; g if (i == 0) f // Set up trigonometric coefficients for Type-I double butterfly. n3 = 0; for (n = 0; n < 3; n++) f coeffs[n3++] = coeff 00; coeffs[n3++] = coeff 01; coeffs[n3++] = coeff 02; g g else f if (k == 0) f // Set up trigonometric coefficients for Type-II double butterfly. n3 = 0; for (n = 0; n < 2; n++) f coeffs[n3++] = coeff 00; coeffs[n3++] = coeff 01; coeffs[n3++] = coeff 02; g for (n = 6; n < 9; n++) f coeffs[n] = coeff 03; g g else f // Set up trigonometric coefficients for Type-III double butterfly. m = kbeta; // Set up single-angle sinusoidal & cosinusoidal terms. coeffs[0] = Look Up Sin A1[N4-m]; coeffs[1] = Look Up Sin A1[m];

B.3 Listings for Processing Functions //

//

//

// g

Set up double-angle sinusoidal & cosinusoidal terms. m <<= 1; coeffs[3] = Look Up Sin A2[N4-m]; coeffs[4] = Look Up Sin A2[m]; Set up triple-angle sinusoidal & cosinusoidal terms. m += kbeta; if (m < N4) f coeffs[6] = Look Up Sin A3[N4-m]; coeffs[7] = Look Up Sin A3[m]; g else f coeffs[6] = -Look Up Sin A3[m-N4]; coeffs[7] = Look Up Sin A3[N2-m]; g Set up remaining trigonometric coefficients through symmetry. coeffs[2] = coeffs[0]; coeffs[5] = coeffs[3]; coeffs[8] = coeffs[6]; g g End of function.

#include “stdafx.h” #include <math.h> void Coefficients V12M 2Level(int i, int k, int N2, int N4, int RootNd2, int alpha, int kbeta, int *Look Up Sin Coarse, int *Look Up Cos Fine, int *Look Up Sin Fine, int *coeffs, int coef levels, int no of bits coef) f // // // // // // // // // // // // // // // // // //

Description: --------Routine to set up the trigonometric coefficients for use by the twelve-multiplier version of the generic double butterfly where two-level LUTs are exploited. Parameters: --------i = temporal index. k = spatial index. N2 = N / 2. N4 = N / 4. alpha = number of FHT temporal stages. kbeta = temporal/spatial index. Look Up Sin Coarse = coarse resolution sin LUT for single-angle argument. = fine resolution cos LUT for single-angle argument. Look Up Cos Fine = fine resolution sin LUT for single-angle argument. Look Up Sin Fine coeffs = current set of trigonometric coefficients. = number of trigonometric coefficient quantisation levels. coef levels no of bits coef = number of bits representing trigonometric coefficients.

205

206 // // // //

//

//

// //

//

//

//

B Source Code Listings for Regularized Fast Hartley Transform Declarations: ---------Integers: ------int m, n, n3, sa1, sca2, ca1, sv1, sv2, cv1, cv2, sum1, sum2, sum3; int64 store1, store2, store3; static int startup, alpham1, bits to shift, bits to shift m1; static int coeff 00, coeff 01, coeff 02, coeff 03; *********************************************************************** if (startup == 0) f Set up trivial trigonometric coefficients - valid for each type of double butterfly. coeff 00 = +coef levels; coeff 01 = 0; coeff 02 = coef levels; Set up additional constant trigonometric coefficient for Type-II double butterfly. coeff 03 = (int) ((sqrt(2.0) / 2) * coef levels); Set up scaling factor for multiplication stage. bits to shift = no of bits coef - 1; bits to shift m1 = bits to shift - 1; Set up scaling factor for address calculation. alpham1 = alpha - 1; startup = 1; g if (i == 0) f Set up trigonometric coefficients for Type-I double butterfly. n3 = 0; for (n = 0; n < 3; n++) f coeffs[n3++] = coeff 00; coeffs[n3++] = coeff 01; coeffs[n3++] = coeff 02; g g else f if (k == 0) f Set up trigonometric coefficients for Type-II double butterfly. n3 = 0; for (n = 0; n < 2; n++) f coeffs[n3++] = coeff 00; coeffs[n3++] = coeff 01; coeffs[n3++] = coeff 02; g

B.3 Listings for Processing Functions for (n = 6; n < 9; n++) f coeffs[n] = coeff 03; g

// //

//

//

//

// g

g else f Set up trigonometric coefficients for Type-III double butterfly. m = kbeta; Set up single-angle sinusoidal & cosinusoidal terms. sa1 = m >> alpham1; ca1 = RootNd2 - sa1; sca2 = m % RootNd2; cv1 = Look Up Sin Coarse[ca1]; sv1 = Look Up Sin Coarse[sa1]; cv2 = Look Up Cos Fine[sca2]; sv2 = Look Up Sin Fine[sca2]; sum1 = cv1 + sv1; sum2 = cv2 + sv2; sum3 = cv2 - sv2; store1 = (( int64)sum1*cv2) >> bits to shift; store2 = (( int64)sum2*sv1) >> bits to shift; store3 = (( int64)sum3*cv1) >> bits to shift; coeffs[0] = (int) (store1 - store2); coeffs[1] = (int) (store1 - store3); Set up double-angle sinusoidal & cosinusoidal terms. cv1 = coeffs[0]; sv1 = coeffs[1]; store1 = (( int64)cv1*cv1) >> bits to shift m1; store2 = (( int64)cv1*sv1) >> bits to shift m1; coeffs[3] = (int) (store1 coef levels); coeffs[4] = (int) store2; Set up triple-angle sinusoidal & cosinusoidal terms. cv2 = coeffs[3]; store1 = (( int64)cv1*cv2) >> bits to shift m1; store2 = (( int64)sv1*cv2) >> bits to shift m1; coeffs[6] = (int) (store1 cv1); coeffs[7] = (int) (store2 + sv1); Set up remaining trigonometric coefficients through symmetry. coeffs[2] = coeffs[0]; coeffs[5] = coeffs[3]; coeffs[8] = coeffs[6]; g g End of function.

#include “stdafx.h” #include <math.h> void Butterfly V12M(int i, int j, int k, int *X, int *coeffs, int *kk, int *index1, int *index2, int *index3, int *index4, int coef levels, int no of bits coeffs, int scaling, int *Data Max, int shift, int lower, int upper)

207

208 f // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // // //

//

B Source Code Listings for Regularized Fast Hartley Transform

Description: --------Routine to carry out the generic double butterfly computation using twelve fixed-point fast multipliers. Parameters: --------i = index for temporal loop. j = index for outer spatial loop. k = index for inner spatial loop. X = 1-D data array. coeffs = current set of trigonometric coefficients. kk = offsets for address permutations. index1 = 1st address permutation. index2 = 2nd address permutation. index3 = 3rd address permutation. index4 = 4th address permutation. = number of trigonometric coefficient quantisation levels. coef levels no of bits coeffs = number of bits representing trigonometric coefficients. scaling = scaling flag: 1 => FIXED, 2 => BFP. = maximum magnitude of output data set. Data Max shift = no of bits for input data to be shifted. lower = lower bound for register overflow with unconditional scaling. upper = upper bound for register overflow with unconditional scaling. Declarations: ---------Integers: ------int m, n, n2, n2p1, n3, n3p1, store, bits to shift1, bits to shift2; Long Integers: ---------int64 m1, m2, m3, m4; Integer Arrays: --------int Y[8]; *********************************************************************** Apply 1st address permutation - comprising one data exchange. m = kk[0]; store = X[index1[m++]]; X[6] = X[index1[m]]; X[3] = store; Set up scaling factor for multiplication stage. bits to shift2 = no of bits coeffs - 1; if (scaling == 1) f Y[0] = X[0]; Y[1] = X[1];

B.3 Listings for Processing Functions // ###

// ###

// //

//

//

//

//

//

//

Check for register overflow & flag when overflow arises. for (n = 0; n < 8; n++) f if ((X[n] < lower) jj (X[n] > upper)) f printf (“nnnn Overflow occurred on input register”); g g Check for register overflow completed. g else f Set up scaling factor for first two samples of input data set. bits to shift1 = 3 - shift; Shift data so that MSB occupies optimum position. Y[0] = X[0] << bits to shift1; Y[1] = X[1] << bits to shift1; for (n = 2; n < 8; n++) f X[n] = X[n] >> shift; g Build in three guard bits for LSB. bits to shift2 -= 3; g Apply trigonometric coefficients and 1st set of additions/subtractions. n3 = 0; for (n = 1; n < 4; n++) f n2 = (n << 1); n2p1 = n2 + 1; n3p1 = n3 + 1; Truncate contents of registers to required levels. m1 = (( int64)coeffs[n3]*X[n2]) >> bits to shift2; m2 = (( int64)coeffs[n3p1]*X[n2p1]) >> bits to shift2; Y[n2] = (int) (m1 + m2); Truncate contents of registers to required levels. m3 = (( int64)coeffs[n3p1]*X[n2]) >> bits to shift2; m4 = (( int64)coeffs[n3+2]*X[n2p1]) >> bits to shift2; Y[n2p1] = (int) (m3 - m4); n3 += 3; g Apply 2nd address permutation. m = kk[1]; for (n = 0; n < 8; n++) f X[index2[m++]] = Y[n]; g Apply 2nd set of additions/subtractions. for (n = 0; n < 4; n++) f n2 = (n << 1); n2p1 = n2 + 1; store = X[n2] + X[n2p1]; X[n2p1] = X[n2] - X[n2p1]; X[n2] = store; g

209

210 //

//

// //

//

// g

B Source Code Listings for Regularized Fast Hartley Transform Apply 3rd address permutation. m = kk[2]; for (n = 0; n < 8; n++) f Y[index3[m++]] = X[n]; g Apply 3rd set of additions/subtractions. for (n = 0; n < 4; n++) f n2 = (n << 1); n2p1 = n2 + 1; store = Y[n2] + Y[n2p1]; Y[n2p1] = Y[n2] - Y[n2p1]; Y[n2] = store; g m = kk[3]; for (n = 0; n < 8; n++) f if (scaling == 2) f Remove three LSB guard bits - MSB may be magnified by up to three bits. Y[n] = (Y[n] >> 3); Update maximum magnitude of output data set. if (abs(Y[n]) > abs(Data Max[0])) Data Max[0] = Y[n]; g Apply 4th address permutation. X[index4[m++]] = Y[n]; g End of function.

#include “stdafx.h” #include <math.h> void Butterfly V09M(int i, int j, int k, int *X, int *coeffs, int *kk, int *index1, int *index2, int *index3, int *index4, int coef levels, int no of bits coeffs, int scaling, int *Data Max, int shift, int Type, int lower, int upper) f // Declarations: // ---------// Routine to carry out the generic double butterfly computation using nine fixed-point // fast multipliers. // Parameters: // --------// i = index for temporal loop. // j = index for outer spatial loop. // k = index for inner spatial loop. // X = 1-D data array. // coeffs = current set of trigonometric coefficients. // kk = offsets for address permutations. // index1 = 1st address permutation.

B.3 Listings for Processing Functions // // // // // // // // // // // // // // // // // // // // // // // //

//

// //

//

//

211

index2 = 2nd address permutation. index3 = 3rd address permutation. index4 = 4th address permutation. = number of trigonometric coefficient quantisation levels. coef levels = number of bits representing trigonometric coefficients. no of bits coeffs scaling = scaling flag: 1 => FIXED, 2 => BFP. = maximum magnitude of output data set. Data Max shift = no of bits for input data to be shifted. Type = butterfly type indicator: I, II or III. lower = lower bound for register overflow with unconditional scaling. upper = upper bound for register overflow with unconditional scaling. Note: ---Dimension array X[n] from 0 to 8 in calling routine RFHT4 Control. Declarations: ---------Integers: ------int m, n, n2, n2p1, store, bits to shift1, bits to shift2; Long Integers: ------int64 product; Integer Arrays: -------int Y[11]; *********************************************************************** Apply 1st address permutation - comprising one data exchange. m = kk[0]; store = X[index1[m++]]; X[6] = X[index1[m]]; X[3] = store; Set up scaling factor for multiplication stage. bits to shift2 = no of bits coeffs - 1; if (scaling == 2) f Set up scaling factor for first two samples of input data set. bits to shift1 = 3 - shift; Shift data so that MSB occupies optimum position. X[0] = X[0] << bits to shift1; X[1] = X[1] << bits to shift1; for (n = 2; n < 8; n++) f X[n] = X[n] >> shift; g Build in three guard bits for LSB. bits to shift2 -= 3; g Apply 1st set of additions/subtractions. Y[0] = X[0]; Y[1] = X[1]; Y[2] = X[2]; Y[3] = X[2] + X[3]; Y[4] = X[3];

212

B Source Code Listings for Regularized Fast Hartley Transform

Y[5] = X[4]; Y[6] = X[4] + X[5]; Y[7] = X[5]; Y[8] = X[6]; Y[9] = X[6] + X[7]; Y[10] = X[7]; if (scaling == 1) f // Scale outputs of 1st set of additions/subtractions. For (n = 0; n < 11; n++) Y[n] = (Y[n]>>1); // ### Check for register overflow & flag when overflow arises. for (n = 0; n < 11; n++) f if ((Y[n] < lower) jj (Y[n] > upper)) f printf (“nnnn Overflow occurred on input register”); g g // ### Check for register overflow completed. g // Apply trigonometric coefficients. for (n = 0; n < 9; n++) f product = (( int64)coeffs[n]*Y[n+2]) >> bits to shift2; X[n] = (int) product; g // Apply 2nd set of additions/subtractions. if (Type < 3) f Y[2] = X[0] + X[1]; Y[3] = X[1] + X[2]; Y[4] = X[3] + X[4]; Y[5] = X[4] + X[5]; g else f Y[2] = X[1] - X[2]; Y[3] = X[0] - X[1]; Y[4] = X[4] - X[5]; Y[5] = X[3] - X[4]; g if (Type < 2) f Y[6] = X[6] + X[7]; Y[7] = X[7] + X[8]; g else f Y[6] = X[7] - X[8]; Y[7] = X[6] - X[7]; g // Apply 2nd address permutation. m = kk[1]; for (n = 0; n < 8; n++) f X[index2[m++]] = Y[n]; g

B.3 Listings for Processing Functions //

//

//

//

// //

//

// g

213

Apply 3rd set of additions/subtractions. for (n = 0; n < 4; n++) f n2 = (n << 1); n2p1 = n2 + 1; store = X[n2] + X[n2p1]; X[n2p1] = X[n2] - X[n2p1]; X[n2] = store; g Apply 3rd address permutation. m = kk[2]; for (n = 0; n < 8; n++) f Y[index3[m++]] = X[n]; g Apply 4th set of additions/subtractions. for (n = 0; n < 4; n++) f n2 = (n << 1); n2p1 = n2 + 1; store = Y[n2] + Y[n2p1]; Y[n2p1] = Y[n2] - Y[n2p1]; Y[n2] = store; g Apply 4th address permutation. m = kk[3]; for (n = 0; n < 8; n++) f if (scaling == 2) f Remove three LSB guard bits - MSB may be magnified by up to three bits. Y[n] = (Y[n] >> 3); Update maximum magnitude of output data set. if (abs(Y[n]) > abs(Data Max[0])) Data Max[0] = Y[n]; g Apply 4th address permutation. X[index4[m++]] = Y[n]; g End of function.

#include “stdafx.h” #include <math.h> void Butterfly Cordic(int *X, int *kbeta, int *kk, int *index1, int *index2, int *index3, int *index4, int halfpi, int minusquarterpi, int growth, int *arctans, int no of iterations, int no of bits angle, int negate flag, int scaling, int *Data Max, int shift, int LSB guard bits, int lower, int upper) f // Description: // --------// Routine to carry out the generic double butterfly computation using three Cordic // arithmetic units. // Externs: // -----void Rotation (int*, int*, int*, int, int, int*);

214 // // // // // // // // // // // // // // // // // // // // // // // // // // // // // //

//

//

//

B Source Code Listings for Regularized Fast Hartley Transform Parameters: --------X = data. kbeta = current set of rotation angles. kk = offsets for address permutations. index1 = 1st address permutation. index2 = 2nd address permutation. index3 = 3rd address permutation. index4 = 4th address permutation. halfpi = integer version of +(pi/2). minusquarterpi = integer version of -(pi/4). growth = integer version of Cordic magnification factor. arctans = micro-rotation angles. no of iterations = no of Cordic iterations. no of bits angle = no of bits to represent Cordic rotation angle. = negation flag for Cordic output. negate flag scaling = scaling flag: 1 => FIXED, 2 => BFP. = maximum magnitude of output data set. Data Max shift = no of bits for input data to be shifted. LSB guard bits = no of bits for guarding LSB. lower = lower bound for register overflow with unconditional scaling. upper = upper bound for register overflow with unconditional scaling. Declarations: ---------Integers: ------int m, n, n2, n2p1, store, bits to shift1, bits to shift2; Integer Arrays: ---------int Y[8], xs[3], ys[3], zs[3]; ***************************************************************** Apply 1st address permutation - comprising one data exchange. m = kk[0]; store = X[index1[m++]]; X[6] = X[index1[m]]; X[3] = store; Set up scaling factor for multiplication stage. bits to shift1 = no of bits angle - 1; if (scaling == 1) f ### Check for register overflow & flag when overflow arises. for (n = 0; n < 8; n++) f if ((X[n] < lower) jj (X[n] > upper)) f printf (“nnnn Overflow occurred on input register”); g g ### Check for register overflow completed. g else

B.3 Listings for Processing Functions

// //

//

//

//

// //

//

//

//

f Set up scaling factor for first two samples of input data set. bits to shift2 = LSB guard bits - shift + 2; Shift data so that MSB occupies optimum position. X[0] = X[0] >> shift; X[1] = X[1] >> shift; for (n = 2; n < 8; n++) f X[n] = X[n] << bits to shift2; g Build in two additional guard bits for LSB. bits to shift1 -= 2; g Scale first two permuted inputs with Cordic magnification factor. Y[0] = (int) ((( int64)growth*X[0]) >> bits to shift1); Y[1] = (int) ((( int64)growth*X[1]) >> bits to shift1); Set up inputs to Cordic phase rotations of remaining permuted inputs. xs[0] = X[2]; xs[1] = X[4]; xs[2] = X[6]; ys[0] = X[3]; ys[1] = X[5]; ys[2] = X[7]; zs[0] = kbeta[0]; zs[1] = kbeta[1]; zs[2] = kbeta[2]; if (negate flag == 1) zs[2] = minusquarterpi; Carry out Cordic phase rotations of remaining permuted inputs. Rotation (xs, ys, zs, halfpi, no of iterations, arctans); Set up outputs from Cordic phase rotations of remaining permuted inputs. Y[2] = xs[0]; Y[4] = xs[1]; Y[6] = xs[2]; Y[3] = ys[0]; Y[5] = ys[1]; Y[7] = ys[2]; if (scaling == 2) f Scale Cordic outputs to remove LSB guard bits. for (n = 2; n < 8; n++) f Y[n] = Y[n] >> LSB guard bits; g g Negate, where appropriate, phase rotated outputs. if (negate flag > 0) f Y[7] = -Y[7]; if (negate flag > 1) f Y[3] = -Y[3]; Y[5] = -Y[5]; g g Apply 2nd address permutation. m = kk[1]; for (n = 0; n < 8; n++) f X[index2[m++]] = Y[n]; g

215

216 //

//

//

// //

//

// g

B Source Code Listings for Regularized Fast Hartley Transform Apply 1st set of additions/subtractions. for (n = 0; n < 4; n++) f n2 = (n << 1); n2p1 = n2 + 1; store = X[n2] + X[n2p1]; X[n2p1] = X[n2] - X[n2p1]; X[n2] = store; g Apply 3rd address permutation. m = kk[2]; for (n = 0; n < 8; n++) f Y[index3[m++]] = X[n]; g Apply 2nd set of additions/subtractions. for (n = 0; n < 4; n++) f n2 = (n << 1); n2p1 = n2 + 1; store = Y[n2] + Y[n2p1]; Y[n2p1] = Y[n2] - Y[n2p1]; Y[n2] = store; g m = kk[3]; for (n = 0; n < 8; n++) f if (scaling == 2) f Remove two LSB guard bits - MSB may be magnified by up to two bits. Y[n] = (Y[n] >> 2); Update maximum magnitude of output data set. if (abs(Y[n]) > abs(Data Max[0])) Data Max[0] = Y[n]; g Apply 4th address permutation. X[index4[m++]] = Y[n]; g End of function.

#include “stdafx.h” void Rotation (int *xs, int *ys, int *zs, int halfpi, int no of iterations, int *arctans) f // Description: // --------// Routine to carry out the phase rotations required by the Cordic arithmetic unit for the // single angle, double angle and triple angle cases. // Parameters: // --------// xs = X coordinates. // ys = Y coordinates. // zs = rotation angles. // halfpi = +(pi/2). // no of iterations = no of Cordic iterations. // arctans = set of micro-rotation angles.

B.3 Listings for Processing Functions // // // // // // // // //

//

//

// g

217

Declarations: ---------Integers: ------int k, n; Integer Arrays: ----------int temp[3]; *********************************************************************** P H A S E R O T A T I O N R O U T I N E. Reduce three rotation angles to region of convergence: [-pi/2,+pi/2]. for (n = 0; n < 3; n++) f if (zs[n] < -halfpi) f temp[n] = +ys[n]; ys[n] = -xs[n]; xs[n] = temp[n]; zs[n] += halfpi; g else if (zs[n] > +halfpi) f temp[n] = -ys[n]; ys[n] = +xs[n]; xs[n] = temp[n]; zs[n] -= halfpi; g g Loop through Cordic iterations. for (k = 0; k < no of iterations; k++) f Carry out phase micro-rotation of three complex data samples. for (n = 0; n < 3; n++) f if (zs[n] < 0) f temp[n] = xs[n] + (ys[n] >> k); ys[n] -= (xs[n] >> k); xs[n] = temp[n]; zs[n] += arctans[k]; g else f temp[n] = xs[n] - (ys[n] >> k); ys[n] += (xs[n] >> k); xs[n] = temp[n]; zs[n] -= arctans[k]; g g g End of function.

#include “stdafx.h” #include <stdlib.h> void DataIndices (int i, int j, int k, int offset, int *kk, int index even 2D[2][4], int index odd 2D[2][4], int bfly count, int alpha)

218 f // // // // // // // // // // // // // // // // // // // // // // // // //

// // //

B Source Code Listings for Regularized Fast Hartley Transform

Description: ---------Routine to set up the data indices for accessing the input data for the generic double butterfly. Parameters: --------i = index for temporal loop. j = index for outer spatial loop. k = index for inner spatial loop. offset = element of power-of-two array. kk = offsets for address permutations. = even data address indices. index even 2D = odd data address indices. index odd 2D = double butterfly address for stage. bfly count alpha = no of temporal stages for transform. Externs: -----void MemoryBankAddress (int, int, int, int, int*, int*); Declarations: ---------Integers: ------int n, n1, n2, twice offset; Pointer Variables: ------------int *bank1, *offset1, *bank2, *offset2; *********************************************************************** Set up dynamic memory. bank1 = new int [1]; bank1[0] = 0; bank2 = new int [1]; bank2[0] = 0; offset1 = new int [1]; offset1[0] = 0; offset2 = new int [1]; offset2[0] = 0; Calculate data indices. if (i == 0) S T A G E = 0. twice offset = offset; Set up even and odd data indices for Type-I double butterfly. n1 = j - twice offset; n2 = n1 + 4; for (n = 0; n < 4; n++) f n1 += twice offset; n2 += twice offset; MemoryBankAddress (n1, n, 1, alpha, bank1, offset1); index even 2D[0][n] = *bank1; index even 2D[1][n] = *offset1; MemoryBankAddress (n2, n, 1, alpha, bank2, offset2); index odd 2D[0][n] = *bank2; index odd 2D[1][n] = *offset2; g

B.3 Listings for Processing Functions //

//

//

//

//

//

// // g

Set up offsets for address permutations. kk[0] = 0; kk[1] = 0; kk[2] = 0; g else f S T A G E > 0. twice offset = (offset << 1); if (k == 0) f Set up even and odd data indices for Type-II double butterfly. n1 = j - twice offset; n2 = n1 + offset; for (n = 0; n < 4; n++) f n1 += twice offset; n2 += twice offset; MemoryBankAddress (n1, bfly count, 1, alpha, bank1, offset1); index even 2D[0][n] = *bank1; index even 2D[1][n] = *offset1; MemoryBankAddress (n2, bfly count, 1, alpha, bank2, offset2); index odd 2D[0][n] = *bank2; index odd 2D[1][n] = *offset2; g Set up offsets for address permutations. kk[0] = 0; kk[1] = 0; kk[2] = 0; g else f Set up even and odd data indices for Type-III double butterfly. n1 = j + k - twice offset; n2 = j - k; for (n = 0; n < 4; n++) f n1 += twice offset; n2 += twice offset; MemoryBankAddress (n1, bfly count, 1, alpha, bank1, offset1); index even 2D[0][n] = *bank1; index even 2D[1][n] = *offset1; MemoryBankAddress (n2, bfly count, 1, alpha, bank2, offset2); index odd 2D[0][n] = *bank2; index odd 2D[1][n] = *offset2; g Set up offsets for address permutations. kk[0] = 2; kk[1] = 8; kk[2] = 8; g g Delete dynamic memory. delete bank1, bank2, offset1, offset2; End of function.

219

220

B Source Code Listings for Regularized Fast Hartley Transform

#include ”stdafx.h” void MemoryBankAddress (int address, int butterfly, int startup, int alpha, int *bank, int *offset) f // Description: // --------// Routine to calculate the memory bank address and offset. // Parameters: // --------// address = sample address: [0,1,...,N-1]. // butterfly = butterfly address for stage: [0,1,...,N/8-1]. // startup = initialisation flag: 0 => start up, 1 => butterfly. // alpha = no of temporal stages for transform. // bank = memory bank address of sample: [0,1,2,3,4,5,6,7]. // offset = address offset within memory bank: [0,...,N/8-1]. // Note: // ---// For optimum arithmetic efficiency, comment out coding options not relevant to the // current application. // Declarations: // ---------// Integers: // ------int k1, k2, sub block size, mapping; // *********************************************************************** // Calculate memory bank address for N up to and including 1K. // bank[0] = ((((address%4)+((address%16)>>2)+((address%64)>>4)+ // ((address%256)>>6)+(address>>8)) % 4) << 1) + (butterfly%2); // Calculate memory bank address for N up to and including 4K. // bank[0] = ((((address%4)+((address%16)>>2)+((address%64)>>4)+((address%256) // >>6)+((address%1024)>>8)+(address>>10)) % 4) << 1) + (butterfly%2); // Calculate memory bank address for N up to and including 16K. // bank[0] = ((((address%4)+((address%16)>>2)+((address%64)>>4)+ // ((address%256)>>6)+((address%1024)>>8)+((address%4096)>>10)+ // (address>>12))% 4) << 1) + (butterfly%2); // Calculate memory bank address using generic version of address mapping. sub block size = 1; mapping = 0; for (k1 = 0; k1 < alpha; k1++) f k2 = k1 << 1; sub block size <<= 2; mapping += ((address % sub block size) >> k2); g bank[0] = ((mapping % 4) << 1) + (butterfly % 2); // Calculate address offset within memory bank. if (startup > 0) offset[0] = address >> 3; // End of function. g

Glossary

ADC – analog-to-digital conversion ASIC – application-specific integrated circuit AWGN – additive white Gaussian noise CD – compact disc CFA – Common Factor Algorithm CLB – configurable logic block CM – trigonometric coefficient memory CN – linear space of complex-valued N-tuples CORDIC – Co-Ordinate Rotation DIgital Computer CRT – Chinese Remainder Theorem CS – computational stage DA – distributed arithmetic DDC – digital down conversion DFT – discrete Fourier transform DHT – discrete Hartley transform DIF – decimation-in-frequency DIT – decimation-in-time DM – data memory DMER – even-real data memory DMI – intermediate data memory DMOI – odd-imaginary data memory DSP – digital signal processing DTMF – dual-tone multi-frequency FDM – frequency division multiplexed FFT – fast Fourier transform FHT – fast Hartley transform FNT – Fermat number transform FPGA – field-programmable gate array GD-BFLY – generic double butterfly HDL – hardware description language IF – intermediate frequency I/O – input–output IP – intellectual property LSB – least significant bit LUT – look-up table MAC – multiplier and accumulator MNT – Mersenne number transform MSB – most significant bit NTT – number-theoretic transform

221

222 O – order PE – processing element PFA – Prime Factor Algorithm PSD – power spectral density RAM – random access memory R24 FHT – regularized radix-4 fast Hartley transform RF – radio frequency RN – linear space of real-valued N-tuples ROM – read only memory SFDR – spurious-free dynamic range SFG – signal flow graph SIMD – single-instruction multiple-data SNR – signal-to-noise ratio TDOA – time-difference-of-arrival TOA – time-of-arrival

Glossary

Index

A Alias-free formulation, 154–155 Analog-to-digital conversion (ADC), 42, 72, 95 Application-specific integrated circuit (ASIC), 1, 2, 8, 39, 65–67, 77, 114 Area efficient, 62, 70–72, 77–98, 160 Arithmetic complexity, 1, 8, 18–20, 25, 37, 39, 61, 72–74, 77, 85, 86, 97, 110, 112, 114, 124, 125, 132, 137, 140, 145, 151, 155 Auto-correlation, 12, 135–137, 141–146, 161

B Bergland algorithm, 16–18 Bit reversal mapping, 23, 47, 98 Bruun algorithms, 16, 18–19, 24 Butterfly, 3, 7, 8, 11, 17, 37–39, 41–56, 59, 61, 62, 81, 84, 90–91, 108, 160

C Channelization, 12, 61, 135, 136, 149–156, 161 Chinese remainder theorem (CRT), 5 Circular convolution, 36, 140, 150 Circular correlation, 140, 144 Clock frequency, 66, 68, 71, 97, 114 Coarse grain parallelism, 68 Common factor algorithm (CFA), 5 Complementary angle LUTs, 85, 86, 92, 125, 126 Computational density, 11, 42, 66, 71, 74, 77, 78, 127, 132, 135, 161, 162 Computational stage (CS), 4, 62, 90, 105, 108, 127 Configurable logic block (CLB), 67 Convolution, 34, 36, 135–137, 140, 144, 149–151, 156

Cooley-Tukey algorithm, 4, 5, 17, 133 Co-Ordinate Rotation Digital Computer (CORDIC), 10, 12, 70, 72, 73, 101, 102, 104–114, 159, 161, 165, 168–171, 173–176, 178–180, 189, 190, 193, 195, 213–215, 217 Correlation, 12, 35, 37, 135–137, 140–149, 151, 156, 161 Cross-correlation, 12, 135, 137, 141–143, 145–148 D Data memory (DM), 12, 59, 79–84, 90–92, 95, 98, 114, 120, 122, 124–127, 130, 132, 156 Data space, 5, 8, 11, 12, 36, 39, 129, 136–140, 142, 143, 149, 156 Decimation-in-frequency (DIF), 5, 16, 17, 20, 23, 42 Decimation-in-time (DIT), 5, 16, 17, 20, 23, 37, 42, 45, 59 Di-bit reversal mapping, 47, 98 Differentiation, 135, 137–140, 147, 156 Digital down conversion (DDC), 7, 24, 95, 135, 136, 149, 151, 153, 156, 160 Digit reversal mapping, 23 Discrete Fourier transform (DFT), 1–8, 10–12, 15–25, 27–37, 39, 41, 59, 65, 78, 96, 114, 117–136, 149, 151–156, 159–161 Discrete Hartley transform (DHT), 1, 6–8, 10–12, 27–39, 42, 44, 129, 130, 133, 138–140, 143–145, 147, 148, 150, 155, 160 Distributed arithmetic (DA), 10, 70, 102 Divide-and-conquer, 4, 46 Double buffering, 95, 98, 127 Double-resolution approach, 117, 118, 124–127, 132 Dragonfly, 62

223

224 Dual-port memory, 91, 96, 97, 132, 156

E Equivalency theorem, 151

F Fast Fourier transform (FFT), 1, 3–8, 10, 11, 15–24, 37, 39, 41, 46, 59–62, 65, 67, 71, 72, 74, 78, 93–98, 102, 118–122, 124, 128–130, 133, 137, 139, 141, 142, 144, 153, 155, 159–162 Fast Hartley transform (FHT), 1, 6–10, 12, 27–39, 41–63, 65, 69, 74, 77–79, 87, 101, 107, 111, 117–133, 135–156, 160 Field-programmable gate array (FPGA), 1, 2, 8, 23, 39, 65–67, 73, 74, 77, 79, 93–97, 101–103, 105, 110, 112, 114, 134, 153 Fine grain parallelism, 68 Fixed point, 2, 39, 42, 61–62, 90, 92, 97, 101–103, 107, 110, 112, 114, 127, 133, 161 Fourier matrix, 2, 5 Fourier space, 11, 30, 31, 39, 98, 118, 120, 121, 129–131, 135, 136, 141 Frequency division multiplexed (FDM), 149

G Generic double butterfly (GD-BFLY), 8, 48, 50, 52, 54–56, 90–91, 108, 160 Global pipelining, 69, 78

H Half-resolution approach, 117, 118, 131–133 Hardware description language (HDL), 13, 73, 84, 161 Hartley matrix, 6 Hartley space, 8, 11, 30, 32, 36, 39, 98, 118, 120, 121, 129–131, 135–142, 147–151, 156

Index L Latency, 8, 10, 25, 37, 66, 71, 72, 78, 95–97, 105, 114, 121, 127, 129, 131, 155, 159, 162 Linear convolution, 36, 137, 139, 144 Linear correlation, 140 Local pipelining, 69

M Matched filter, 2, 39, 144 Memory requirement, 7, 8, 10, 18, 24, 25, 37, 58, 59, 61, 63, 68, 70, 71, 73, 74, 86, 91, 92, 95, 97, 101, 102, 110, 112, 114, 124–126, 131, 132, 145, 151, 161 Minimum arithmetic addressing, 57, 85, 86, 91, 93, 112, 124, 126 Minimum memory addressing, 57–58, 85–86, 88, 92, 112, 125, 126 Mobile communications, 2, 7–9, 11, 65–67, 79, 159 Multiplier and accumulator (MAC), 7

N Nearest-neighbour communication, 61, 68 Noble identity, 151

O Orthogonal, 2, 6, 10, 11, 27, 36, 46, 53, 129, 137, 140, 160 Overlap-add technique, 137, 144, 149 Overlap-save technique, 36, 139, 144, 150

I In-place processing, 80, 123, 132 Input-output (I/O), 9, 11, 66, 70–72, 78, 95, 132

P Parseval’s theorem, 36, 129, 130 Partitioned-memory processing, 66, 71–72, 74, 98 Pipeline delay, 90, 114, 127, 131 Pipelining, 4, 62, 68, 69, 78, 82–84, 88, 90–91, 105 Polyphase DFT filter bank, 135, 136, 151, 153, 154, 156 Power spectral density (PSD), 39 Prime factor algorithm (PFA), 5 Processing element (PE), 9

K Kernels, 2–4, 6, 10, 18, 27, 29, 63

Q Quad-port memory, 96, 97, 132, 156

Index R Radix, 3–5, 7, 8, 11, 16, 19, 20, 23, 24, 37, 41–56, 61, 62, 69, 78, 80, 93–95, 98, 117, 133, 159, 160 Random access memory (RAM), 65–67, 73, 101, 103, 110, 112, 114, 123, 159 Read only memory (ROM), 67, 104 Real-from-complex strategy, 7, 15, 16, 59, 95, 153, 160 Regularity, 3, 8, 24, 37, 39, 41, 45, 47, 52, 54, 59, 61, 62, 114, 133, 160 Regularized radix-4 fast Hartley transform (R24 FHT/, 8–10, 12, 13, 36–38, 41, 42, 45–48, 54, 56–59, 61, 62, 66–71, 73, 74, 77, 78, 80, 81, 83, 85, 86, 91–93, 95–98, 101, 102, 107, 110, 112, 114, 117–121, 123–127, 129–135, 142, 153, 155, 156, 160, 161, 163–165, 167, 169, 175 S Scalable, 70, 71, 77–98, 112, 160, 162 Scaling strategy, 28, 39, 61, 62, 98, 161, 163, 167, 169–171, 178, 180 Shannon-Hartley theorem, 9 Signal flow graph (SFG), 16, 17, 38, 45, 47, 48, 52, 54–56, 62, 108, 109, 120–122 Signal-to-noise ratio (SNR), 42, 136, 144 Silicon area, 11, 66, 68–71, 78 Single-instruction multiple-data (SIMD), 4, 96 Single-port memory, 110 Single-quadrant addressing scheme, 57, 85 Space complexity, 69, 70, 91–92 Spider, 62 Spurious-free dynamic range (SFDR), 95

225 Start-up delay, 90, 123–125 Switching frequency, 66, 68, 70 Symmetry, 3, 10, 16, 29, 52, 62, 63, 205, 207 Systolic, 4

T Time complexity, 23, 70, 78, 92–93, 117, 124, 125, 127, 131, 155 Time-difference-of-arrival (TDOA), 141, 148, 155, 156 Time-of-arrival (TOA), 141, 148, 155 Transform space, 8, 12, 31, 32, 39, 98, 118, 132, 133, 135–137, 140, 142, 143, 147, 149, 150, 155 Trigonometric coefficient memory (CM), 12, 56–58, 69, 70, 72, 79, 80, 85, 86, 91, 92, 96, 103, 104, 110, 112, 132, 156, 163, 166

U Unitary, 2, 6, 10, 11, 27, 28, 36, 46, 129, 137, 140 Update-time, 8, 65, 71, 95–97, 132, 159 Up-sampling, 135, 137–140, 147, 156

W Wireless communications, 9, 13, 135, 136, 159, 161

Z Zero-padding, 118, 133, 144, 150