Turbo-like Codes: Design for High Speed Decoding

Turbo-like Codes Aliazam Abbasfar Turbo-like Codes Design for High Speed Decoding 123 A C.I.P. Catalogue record f...

Author: Aliazam Abbasfar

21 downloads 623 Views 3MB Size Report

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

Report copyright / DMCA form

DOWNLOAD PDF

Turbo-like Codes

Aliazam Abbasfar

Turbo-like Codes Design for High Speed Decoding

123

A C.I.P. Catalogue record for this book is available from the Library of Congress.

ISBN-10 978–1–4020–6390–3 ISBN-13 978–1–4020–6390–9 Published by Springer, P.O. Box 17, 3300 AA Dordrecht, The Netherlands. www.springeronline.com

Printed on acid-free paper

All Rights Reserved c 2007 No part of this work may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without written permission from the Publisher, with the exception of any material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work.

Dedicated to my wife

Contents

List of Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ix List of Tables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xiii Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii

1

Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 Outline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

2

Turbo Concept . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1 Turbo Codes and Turbo-like Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.1 Turbo Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.2 Repeat–Accumulate Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5 2.1.3 Product Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Iterative Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.3 Probability Propagation Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8 2.4 Message-passing Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12 2.5 Graphs with Cycles . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13 2.6 Codes on Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.6.1 Parity-check Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14 2.6.2 Convolutional Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 2.6.3 Turbo Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3

High-speed Turbo Decoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.2 BCJR Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19 3.3 Turbo Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 3.4 Pipelined Turbo Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22 3.5 Parallel Turbo Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

vii

viii

Contents

3.6 Speed Gain and Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.6.1 Definitions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26 3.6.2 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27 3.7 Interleaver Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30 3.7.1 Low Latency Interleaver Structure . . . . . . . . . . . . . . . . . . . . . . . . 31 3.7.2 Interleaver Design Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33 3.7.3 Simulation Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.8 Hardware Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 3.9 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38 4

Very Simple Turbo-like Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39 4.1.1 Bounds on the ML Decoding Performance of Block Codes . . . . 40 4.1.2 Density Evolution Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41 4.2 RA Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44 4.2.1 ML Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 4.2.2 DE Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46 4.3 RA Codes with Puncturing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.3.1 ML Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47 4.3.2 Performance of Punctured RA Codes with ML Decoding . . . . . . 53 4.3.3 DE Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55 4.4 ARA Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56 4.4.1 ML Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 4.4.2 Performance of ARA Codes with ML Decoding . . . . . . . . . . . . . 58 4.4.3 DE Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 4.5 Other Precoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62 4.5.1 Accumulator wih Puncturing . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63 4.6 Hardware Complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64 4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

5

High Speed Turbo-like Decoders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.2 Parallel ARA Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 67 5.3 Speed Gain and Efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.4 Interleaver Design . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.5 Projected Graph . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 5.5.1 Parallel Turbo Decoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.5.2 Other Known Turbo-like Codes . . . . . . . . . . . . . . . . . . . . . . . . . . 71 5.5.3 Parallel LDPC Codes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 5.5.4 More Accumulate–Repeat–Accumulate Codes . . . . . . . . . . . . . . 74 5.6 General Hardware Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78 5.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80

References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 81 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83

List of Figures

1 2

The block diagram of a PCCC encoder The block diagram of a SCCC encoder

6 6

3 4 5

An example of a HCCC encoder Repeat–Accumulator code block diagram Block diagram of a product code

6 6 6

6 7 8 9

The iterative turbo decoding block diagram Examples of Tanner graphs: (a) tree (b) with cycles Probabilistic graphs Variable x and its connections in the graph

8 8 9 10

10 11

One constraint node and its connections in graph A tree graph

11 12

12 13 14 15 16

Tanner graph for Hamming code, H (7,4) Tanner graph for regular LDPC (3,5) Convolutional code Tanner graph The Tanner graph of Convolutional codes with state variables A trellis section

14 14 15 15 16

17

An example of the graph of a PCCC

16

18 19 20 21

The messages in a convolutional code Block diagram of the SISO Timing diagram of the traditional SISO Message passing between the constituent codes of turbo codes

20 20 21 21

22 23 24 25 26

The iterative decoding structure Pipelined turbo decoder Parallel turbo decoder structure Timing diagram of the parallel SISOs Timing diagram of the parallel SISOs in vector notation

22 23 24 24 25

ix

x

List of Figures

27 28 29

Partitioned graph of a simple PCCC Parallel turbo decoder with shared processors for two constituent codes Performances of parallel decoder

25 26 28

30 31 32

28 29

35

Efficiency and speed gain Efficiency vs. signal to noise ratio (a) Bit sequence in matrix form (b) after row interleaver (c) A conflict-free interleaver (d) Bit sequence in sequential order (e) The conflict-free interleaved sequence Data and extrinsic sequences in two consecutive iterations for turbo decoder with reverse interleaver Sequences in two consecutive iterations for parallel turbo decoder with reverse interleaver Scheduling diagram of the parallel decoder

36 37

The flowchart of the algorithm Performance comparison for B = 1,024

34 36

38 39 40 41

Performance comparison for B = 4,096 (a) alpha recursion (b) beta recursion (c) Extrinsic computation Probability density function of messages in different iterations Constituent code model for density evolution

37 37 42 42

42 43 44

Constituent code model for density evolution SNR improvement in iterative decoding Repeat–Accumulator code block diagram

43 44 44

33 34

30 31 32 32

45 Density evolution for RA codes (q = 3) 46 Accumulator with puncturing and its equivalent for p = 3 47 Block diagram of accumulator with puncturing

46 47 47

48 49 50

Block diagram of check_4 code and its equivalents Normalized distance spectrum of RA codes with puncturing Density evolution for RA codes with puncturing (q = 4, p = 2)

51 54 56

51 52 53 54 55 56 57

The block diagram of the precoder ARA(3,3) BER performance bound ARA(4,4) BER performance bound Normalized distance spectrum of ARA codes with puncturing Density evolution for ARA codes with puncturing (q = 4, p = 2) Performance of ARA codes using iterative decoding The block diagram of the new precoder

57 58 59 60 61 62 63

58 59

Tanner graph for new ARA code Performance of the new ARA code

63 64

List of Figures

xi

60 61 62

The partitioned graph of ARA code Parallel turbo decoder structure Projected graph

68 68 69

63 64 65 66 67 68 69 70

70 71 72 73 74 75 75

71 72 73

Projected graph with conflict-free interleaver A PCCC projected graph with conflict-free interleaver (a) PCCC with 3 component codes (b) SCCC (c) RA(3) (d) IRA(2,3) A parallel LDPC projected graph Simple graphical representation of a LDPC projected graph ARA code without puncturing (a) Rate 1/3 ARA code (b) rate 1/2 ARA code (a) Rate 1/2 ARA code (b) New rate 1/3 ARA code (c) New rate 1/4 ARA code Improved rate 1/2 ARA codes Irregular rate 1/2 ARA codes Irregular ARA code family for rate >1/2

74 75

Parallel decoder hardware architecture Window processor hardware architecture

79 79

76 77 77 78

List of Tables

I Probability Definitions II State Constraint III The Decoder Parameters IV Characteristic Factors for the Parallel Decoder @SNR = 0.7 dB (BER = 10E − 8) V An Example of the Interleaver VI Cut-off Thresholds for RA Codes with Puncturing VII Cut-off Threshold for Rate 1/2 ARA Codes VIII Cutoff Threshold for ARA Codes with Rate <1/2 IX Cutoff Threshold for Improved ARA Codes with Rate <1/2 X Cutoff Threshold for ARA Codes with Rate >1/2

9 16 26 29 33 54 61 76 77 78

xiii

Acknowledgments

First and foremost, I would like to express my deepest gratitude to my wife for her patience and sacrifices throughout this research. She has been a constant source of assistance, support, and encouragement. My heartfelt thanks go to my parents for their generous love, encouragement, and prayers. Their sacrifices have been my inspiration throughout my career and I am deeply indebted to them in all my successes and accomplishments. Words cannot express the deep feeling of gratitude I have for my family. There are so many people that I would like to thank for making my experience at UCLA truly one of a kind. I would like to thank my advisor Professor Kung Yao for all the help, support, and opportunities he has provided me over these years. His valuable advice, guidance, and unconditional support helped me overcome all the challenges of the doctoral process. I also want to thank Dr. Flavio Lorenzelli for his support and fruitful discussions throughout my research. My special thanks go to Dr. Dariush Divsalar who has been the motivation force to go into the specific field of channel coding. I have gained most of my knowledge in the field from discussions with him. He has had a profound effect on my Ph.D. both as a mentor and a colleague. Finally, I would also like to thank Professor Parviz Jabehdar-Maralani of the University of Tehran for his continued support and encouragement.

xv

Abstract

The advent of turbo codes has sparked tremendous research activities around the theoretical and practical aspects of turbo codes and turbo-like codes. The crucial novelty in these codes is the iterative decoding. In this work, first a novel high-speed turbo decoder is presented that exploits parallelization. Parallelism is achieved very efficiently by exploiting the messagepassing algorithm. It has been shown that very large speed gains can be achieved by this scheme while the efficiency is maintained reasonably high. Memory access, which poses a practical problem for the proposed parallel turbo decoder, is solved by introducing the conflict-free interleaver. The latency is further improved by designing a special kind of conflict-free interleaver. Furthermore, an algorithm to design such an interleaver is presented. Simulation results show that the performance of turbo code is not sacrificed by using the interleaver with the proposed structure. Although turbo code has near Shannon-capacity performance and the proposed architecture for parallel turbo decoder provides a very efficient and highly regular hardware, the circuit is still very complex and demanding for very high-speed decoding. Therefore, it becomes necessary to find turbo-like codes that not only achieve excellent error correction capability, but are also very simple. As a result, a class of new codes for different rates and block sizes, called Accumulate–Repeat– Accumulate (ARA) codes, was invented during this search. The performance of ARA codes is analyzed; and it has been shown that some ARA codes perform very close to random codes, which achieve the Shannon limit. The architecture for high-speed ARA decoder is presented and practical issues discussed. This leads us to a general class of turbo-like codes with parallelism capability, i.e. codes with projected graphs. It is shown that parallel turbo decoder, discussed earlier, is in the same class. Projected graph provides a powerful and yet simple method for designing parallelizable turbo-like codes.

xvii

Chapter 1

Introduction

Efficient and reliable data communication over noisy channels has been pursued more and more for many decades. Applications such as wire-line modems, wireless and satellite communications, Internet data transfer, digital radio broadcasting, and data storage devices are only a few examples that accelerated the development of data communication systems. The issue of efficiency and reliability in communication systems was fundamentally addressed by Shannon [30] in 1948. Shannon introduced the capacity (C) for a noisy channel, which determines the maximum data rate (R) that can be transferred over the channel reliably (i.e. without any error). In other words, there exists a coding scheme of rate (R < C) with arbitrarily small error probability. The proof of this was done in a nonconstructive way, which means that it does not give any method for construction of capacity-approaching codes. The pursuit of capacity-approaching codes took almost 50 years. The introduction of turbo codes by Berrou, Glavieux, and Thitimajshima [9] was a major breakthrough in the world of practical capacity-approaching codes causing a revolution in the field of error-correcting codes. As a result, several other classes of capacity-approaching codes were rediscovered and invented including Low-Density Parity-Check (LDPC) codes [14], Repeat–Accumulate (RA) codes [12], and product codes. The trend in data communications is towards high data rate applications which require high-speed decoding. Although very excellent codes have been proposed and their efficient decoding algorithms are known, design of codes that are suitable for high-speed applications is still a challenging task. This includes design of high-speed decoding architectures, as well as low complexity codes, which are naturally more suitable for parallelism.

1.1 Outline The advent of turbo codes has sparked tremendous research activities around the theoretical and practical aspects of turbo codes and turbo-like codes. This study introduces different types of turbo codes and turbo-like codes. These codes include the Parallel Concatenated Convolutional Code (PCCC), originally introduced by Berrou [9], Serial Concatenated Convolutional Codes (SCCC) later introduced in [7], RA codes [12], and product codes. Aliazam Abbasfar, Turbo-Like Codes, 1–3. c Springer 2007

1

2

1 Introduction

The common property among turbo-like code is that they consist of very simple constituent codes that are connected to each other with random or pseudorandom interleavers. The crucial novelty in these codes is the iterative decoding. This means that the constituent codes are decoded separately, which is efficient and practically feasible since they are very simple codes. Then, they pass new information to each other in a course of a few iterations. It has been shown that iterative decoding is a generalization of the well-known probability or belief propagation algorithm. The belief propagation algorithm that has been essential for development of new ideas throughout this work is described in the context of coding. The basic theorems for this algorithm are explained and proven in the following paragraphs. Thisis then followed by a description of the computational algorithm. The probability propagation algorithm is proven in conjunction with a tree-structured graph – graphs without any cycle. In fact, the graphical representation of any problem solved by this algorithm is the centerpiece of the algorithm. The generalization of the algorithm for graphs with cycles is presented later on. Representation of codes on graph is the next step towards characterization of the iterative decoding as an example of the probability propagation algorithm. The graph representations are presented for a few codes that are commonly used in turbo-like codes. In Chapter 3, first the traditional turbo decoder is introduced. Second, a novel high-seed turbo decoder is presented that exploits parallelization. Parallelism is achieved very efficiently by exploiting the message-passing algorithm. Finally, two characterization factors for the proposed parallel turbo decoder are investigated: speed gain and efficiency. It has been shown by simulations that very large speed gains can be achieved by this scheme while the efficiency is maintained reasonably high. Memory access poses a practical problem for the proposed parallel turbo decoder. This problem is solved in the next section. Conflict-free interleaver is introduced to address the memory access problem. The latency is further improved by designing a special kind of conflict-free interleaver. Lastly, an algorithm to design such an interleaver is presented. Simulation results show that the performance of turbo code is not sacrificed by using this interleaver. Hardware complexity, which is of major concern, is investigated and the overall architecture of the hardware for high-speed turbo decoder is presented. Although turbo code has near Shannon-capacity performance and the proposed architecture for parallel turbo decoder gives a very efficient and highly regular hardware, the circuit is still very complex and demanding for very high-speed decoding. Therefore, Chapter 4 examines simple turbo-like codes that can achieve excellent error correction capability. In order to investigate the performance of the turbo-like codes some useful analysis tools are used. Some maximum likelihood (ML) performance bounds are briefly explained which evaluate the performance of codes using ML decoding. The density evolution (DE) method, which analyzes the behavior of turbo-like codes using iterative decoding, is also described.

1.1 Outline

3

The RA code provided valuable insight in the search for low-complexity turbo-like codes. A class of new codes for different rates and block sizes, called ARA code, was invented during this search. The performance of ARA codes is analyzed and some are shown to perform very close to random codes, which achieve the Shannon limit. The simplicity of ARA codes allows us to build a high-speed ARA decoder with very low complexity hardware. Chapter 5 first presents the architecture for high-speed ARA decoder and then discusses practical issues. This leads us to a general structure for turbo-like codes with parallelization capability. The concept of projected graph is presented. In fact, codes with projected graph comprise a class of turbo-like code. The parallel turbo decoder discussed earlier is in the same class. Projected graph provides a powerful and yet simple method for designing parallelizable turbo-like codes, which is used in future research. Finally, future research directions are discussed in light of this study’s findings.

Chapter 2

Turbo Concept

2.1 2.1.1

Turbo Codes and Turbo-like Codes Turbo Codes

Turbo code introduced by Berrou [9] is a PCCC, which consists of two or more convolutional codes encoding different permuted versions of a block of information. A block diagram of a PCCC turbo encoder is shown in Figure 1. Each convolutional code is called a constituent code. Constituent codes may be same or different, systematic or nonsystematic. However, they should be of recursive type in order to have good performance. In most cases the C0 is a systematic code, which means that the input sequence is transmitted along with the coded sequences; the overall code is systematic too. If the code rate of the constituent codes is r0 , r1 , . . . , rn , then the overall rate is r0 ||r1 || · · · ||rn ; like parallel resistors. Later on SCCC were introduced in [7]. A block diagram of a SCCC is drawn in Figure 2. The block of information is encoded by the first constituent code. Then the output is interleaved and encoded by the second constituent code, and so on and so forth. The output of the last stage is sent over the communication channel. If the code rate of the constituent codes are r1 , r2 , . . . , rn , the overall rate is r1 × r2 × · · · × rn . To obtain a systematic code all constituent codes should be systematic. We can combine the PCCC and SCCC codes to come up with various Hybrid Concatenated Convolutional Codes (HCCCs). Such a HCCC is shown in Figure 3. We can extend the above-described turbo codes to obtain more generalized codes. In the above codes all the constituent codes are convolutional. If we remove this limitation and let them be arbitrary block codes, then we will have a broader class of codes called turbo-like codes. Some examples of turbo-like codes are given in the sequel.

2.1.2

Repeat–Accumulate Codes

Perhaps the simplest type of turbo-like codes is RA codes; which makes it very attractive for analysis. The general block diagram of this code is given in Figure 4. It is a serial concatenated code of two constituent codes: repetition code and Aliazam Abbasfar, Turbo-Like Codes, 5–17. c Springer 2007

5

6

2 Turbo Concept

u

Fig. 1 The block diagram of a PCCC encoder

C0

I1

C1

In

Cn

u

C0

I1

C1

In

Cn

Fig. 2 The block diagram of a SCCC encoder

u

I1

C1

I2

C2

Fig. 3 An example of a HCCC encoder

C0

I3

u N

u

C3

rep(q)

qN

I

qN

row-wise block codes

Fig. 5 Block diagram of a product code

ACC

qN

row-column interleaver

Fig. 4 Repeat–Accumulator block diagram

coulmn-wise block codes

code

c

2.2 Iterative Decoding

7

accumulate code. An information block of length N is repeated q times and interleaved to make a block of size qN, and then followed by an accumulator. Using rate formula for serial concatenated codes, the code rate for RA codes would be 1/q.

2.1.3

Product Codes

The serial concatenation of two block codes with a row–column interleaver results in a product code. In fact, each block code consists of several identical smaller block codes that construct rows/columns of the code word. A block diagram of this code is shown in Figure 5.

2.2

Iterative Decoding

The crucial novelty in turbo codes is the introduction of iterative decoding. The other key component is random or pseudorandom interleaver, which is discussed later. Having exploited these concepts, turbo codes achieve excellent performance with a moderate complexity. The iterative decoding algorithm is based on maximum a posteriori (MAP) estimation of the input sequence. However, since it is difficult to find the MAP solution by considering all the observations at the same time, the MAP decoding is performed on the observations of each constituent code separately. This is explained for a PCCC with two constituent codes. Since two codes have been produced from one input sequence, the a posteriori probability (APP) of data bits coming from the first constituent decoder can be used by the second decoder and vice versa. Therefore the decoding process is carried out iteratively. At the beginning we do not have any information about input sequence. The MAP decoding of the first constituent code is performed without any prior knowledge of the input sequence. This process generates APP of the input sequence bits that can be exploited in the second decoder. The information passed to the other constituent decoder is called extrinsic information. BCJR algorithm [5] is an efficient algorithm that recursively computes the APPs for a convolutional code. In [6] a general unit, called SISO, is introduced that generates the APPs in the most general case. Chapter 2 explains the BCJR algorithm based on decoding on graphs with tree structure. Since the second constituent code is using the permuted version of the input sequence, therefore, extrinsic information should also be permuted before being used by the second decoder. Likewise, the extrinsic information of the second decoder is to be permuted in reverse order for the next iteration of the first decoder. The iterative decoding block diagram is shown in Figure 6.

8

2 Turbo Concept y1

y2

I

SISO

I-1

SISO

Fig. 6 The iterative turbo decoding block diagram

2.3 Probability Propagation Algorithms It has been shown that the iterative decoding is the generalization of the well-known probability or belief propagation algorithm. This algorithm has been developed in the artificial intelligence and expert system literature, most notably by Pearl [25] and Lauritzen and Spiegelhalter [19]. Connection between the Pearl’s belief propagation algorithm with coding was first discovered by MacKay and Neal [22, 23], who showed the Gallager algorithm [14] for decoding LDPC codes is essentially an instance of belief propagation algorithm. McEliece et al. [24] independently showed that turbo decoding is also an instance of belief propagation algorithm. We describe this algorithm in a way that is more suitable for coding applications. If we have some variables that are related to each other, there is a bipartite graph representation showing their dependence, which is called the Tanner graph [31]. The nodes are divided between variable nodes (circles) and constraint nodes (squares). Two examples of such graphs are shown in Figure 7. If we have some noisy observation of the variables, then it becomes a probabilistic graph. In some communication systems some of the variables are not sent through the channel, hence, there is no observation available for them in the receiver side. Therefore, we should distinguish between observed and unobserved variables. Two examples of probabilistic graphs are shown in Figure 8. We denote all variables with vector x and their observations with vector y. Some useful definitions are listed in Table I. x1

x0

1

x6

6

0 7

x7

12 x 9

10

2

11

x2

x0

1

9

x5

x5

10

2

x3

6 12

11

x2

x3

8 4

x4

(a)

Fig. 7 Examples of Tanner graphs: (a) tree, (b) with cycles

x7

9 5

x6

8 3

x4

0 7

5 x8

x1

3

4 (b)

2.3 Probability Propagation Algorithms x1

x0

1

x6 6

0 7

x7

10

12

x9

x2

x1

x5

x5

10

2

x3

3

12

x7

11

x2

x6

8 3

x4

(a) Tree

Constraint node Unobserved Variable node

x

5

Observed Variable node

x

9

x3

8 4

6

x4

0 7

5

x8

x0

1

9

11

2

9

4

(b) loopy

Fig. 8 Probabilistic graphs

The probability propagation algorithm is proven for graphs with tree structure. Given a graphical code model, the probability propagation algorithm is used to compute the APP of variables very efficiently. However, for graphs with cycles there is no known efficient algorithm for computing the exact APPs, which makes it practically infeasible. The probability propagation algorithm can be extended to graphs with cycles by proceeding with the algorithm as if there is no cycle in the graph. Although there is no solid proof of convergence in this case, it usually gives very excellent results in practice. The turbo codes performance is a testimony for effectiveness of the algorithm. The following theorems are very helpful in order to explain the algorithm. Theorem 1: In a probabilistic tree, the likelihood of each variable can be decomposed into the local likelihood and some independent terms, which are called messages or extrinsics in the case of turbo decoding. We prove this theorem using the graph shown in Figure 9. In Figure 9 we have lumped all the variables connecting to an edge into one vector and denoted them and their observations as x1 , x2 , x3 , y1 , y2 , y3 ; i.e. x = {x, x1 , x2 , x3 }, y = {y, y1 , y2 , y3 }. The likelihood of variable x is determined by marginalization of x. We have P(y | x) =

P(x, y) =

x1 ,x21 ,x3

P(x1 , y1 , x2 , y2 , x3 , y3 | x) P(y | x)

x1 ,x21 ,x3

Table I Probability Definitions Definition

Probability

Likelihood of x Likelihood of x Local likelihood of x Local reliability of x Reliability of x A Posteriori Probability (APP) of x

P(y | x) P(y | x) P(y | x) P( x, y) = P(y | x)P(x) P( x, y) = P( y | x)P(x) P( x | y) = P( x, y)/P(y)

(1)

10

2 Turbo Concept

x1

1

2 x

y1

x2

Fig. 9 Variable x and its connections in the graph

y2

3

x3 y3

It can be simplified as following P(x1 , y1 | x) P(x2 , y2 | x) P(x3 , y3 | x) P(y | x) (2) P(y | x) = x1

x2

x3

P(y | x) = P(y1 | x)P(y2 | x)P(y3 | x)P(y | x)

(3)

P(y | x) = IM1 (x) IM2 (x) IM3 (x)P(x, y) = IM1 (x) IM2 (x) IM3 (x)P(y | x) In the above formula we have three incoming messages (IM) each one coming from one part of the graph, i.e. a subgraph. Since the variable is connected to each subgraph by one edge, therefore, there is a correspondence between IM and the edges connected to a variable node. In other words, the IM are communicated along the edges of the graph towards the variable nodes. As the mathematical expression shows, the incoming message is the likelihood of the variable x given the only observations that are connected to x via each edge. Since the graph is a tree, for every variable node decomposition of the graph into some disjoint subgraphs is always possible. That proves theorem 1. Furthermore, we define the outgoing messages (OM) as follows: OM1 (x) = P(y2 , y3 , y | x) = IM2 (x) IM3 (x)P(y | x)

(4)

OM2 (x) = P(y1 , y3 , y | x) = IM1 (x) IM3 (x)P(y | x)

(5)

OM3 (x) = P(y1 , y2 , y | x) = IM1 (x) IM2 (x)P(y | x)

(6)

Therefore, we have P(y | x) = IM1 (x) OM1 (x) = IM2 (x) OM2 (x) = IM3 (x) OM3 (x)

(7)

These messages are communicated along the edges emanated from variable x towards constraints nodes. Because every edge of the graph connects one variable node to a constraint node, there are two messages related to each edge: one incoming and the other outgoing. Therefore the message indices can be regarded as edge indices. We will use these messages in theorem 2.

2.3 Probability Propagation Algorithms

11

Fig. 10 One constraint node and its connections in graph

x

x1 y1

x1

1

2

x2

x2 y2

3 x3 x3 y3

It should be noted that messages are a function of the variable. If the variable is a quantized variable, the messages should be computed for all the levels of the variable. For example for binary variables the message includes the likelihood of being 0 and 1. However, in practice the likelihood ratio is used to simplify the operations and to save the memory for storage. Theorem 2: For each constraint node the IM on one edge can be computed from the OMs on the other edges connected to the constraint node. We prove this theorem using the graph shown in Figure 10. In Figure 10 we have lumped all the variables connecting to an edge into one vector and denoted them and their observations as x1 , x2 , x3 , y1 , y2 , y3 . Each edge is directly connected to one variable, which is donated by x1 ∈ x1 , x2 ∈ x2 , and x3 ∈ x3 . Therefore, we have x1 = {x1 , x1 }, x2 = {x2 , x2 }, and x3 = {x3 , x3 }. The likelihood of variable x is determined by the marginalization of x. We have IM(x) = P(y1 , y2 , y3 | x) = =

P(x1 , y1 , x2 , y2 , x3 , y3 | x)

x1 ,x2 ,x3

P(x 1 , y1 , x 2 , y2 , x 3 , y3 | x1 , x2 , x3 )P(x1 , x2 , x3 | x)

x1 ,x2 ,x3

=

x1 ,x2 ,x3

⎡

⎣ x

1

,x

2

,x

⎤

P(x 1 , y1 , x 2 , y2 , x 3 , y3 | x1 , x2 , x3 )P(x1 , x2 , x3 | x) ⎦

3

12

2 Turbo Concept

=

x1 ,x2 ,x3

×

P(x 1 , y1 | x1 )

x 1

P(x 2 , y2 | x2 )

x 2

P(x 3 , y3 | x3 ) P(x1 , x2 , x3 | x)

x 1

=

[P(y1 | x1 )P(y2 | x2 )P(y3 | x3 )P(x1 , x2 , x3 | x)]

x1 ,x2 ,x3

=

[OM1 (x1 ) OM2 (x2 ) OM3 (x3 )P(x1 , x2 , x3 | x)]

(8)

x1 ,x2 ,x3

As we see the IM is obtained by marginalizing the appropriate weighted product of OMs. The weight function explicitly shows the effect of the constraint.

2.4 Message-passing Algorithm Theorems 1 and 2 provide the basic operations needed for computing the reliability of all the variables in the graph in an efficient way. The algorithm is called message passing, which is essentially the marginalization algorithm. There are many variants of the algorithm that are different only in the scheduling of the computations. Two important versions of the algorithm are described in the sequel. Efficient schedule: In this schedule the algorithm starts from the OMs for leaf vertices in the graph, which are simply the local likelihood of the variables. Messages propagate from the leaves toward the inside of the graph and then back towards the leaves; this time they are IMs. The messages are computed only when all the required messages are ready. The order of message computation for the graph shown in Figure 11 is: OM0 , OM1 , OM2 , OM3 , OM4 , OM5 , OM6 IM7 , IM8 , IM9 OM10 , OM11 , OM12 x1

x0

1

x6

0 7

x7

10

2

6 12

x9

11

x2

9 5

x8

x5

8 x3

3

4

x4

Fig. 11 A tree graph

2.5 Graphs with Cycles

13

IM0 , IM10 , IM11 , IM12 OM7 , OM8 , OM9 IM1 , IM2 , IM3 , IM4 , IM5 , IM6

Flooding schedule: In this schedule all messages are initialized by the local likelihood of the variables. In each step all the messages are computed regardless of the location of the edge in the graph and the status of other messages. Although this schedule is not efficient, this is the fastest algorithm that can be conceived. After several steps all the messages converge to the correct value. The maximum number of steps is the depth of the graph. However, in big graphs the number of steps needed for a given accuracy of the messages is much less that the depth of the graph. This scheduling is shown in the following. The bold messages are those that are final. OM0 , OM1 , OM2 , OM3 , OM4 , OM5 , OM6 , OM7 , OM8 , OM9 , OM10 , OM11 , OM12 IM0 , IM1 , IM2 , IM3 , IM4 , IM5 , IM6 , IM7 , IM8 , IM9 , IM10 , IM11 , IM12 OM0 , OM1 , OM2 , OM3 , OM4 , OM5 , OM6 , OM7 , OM8 , OM9 , OM10 , OM11 , OM12 IM0 , IM1 , IM2 , IM3 , IM4 , IM5 , IM6 , IM7 , IM8 , IM9 , IM10 , IM11 , IM12 OM0 , OM1 , OM2 , OM3 , OM4 , OM5 , OM6 , OM7 , OM8 , OM9 , OM10 , OM11 , OM12 IM0 , IM1 , IM2 , IM3 , IM4 , IM5 , IM6 , IM7 , IM8 , IM9 , IM10 , IM11 , IM12

2.5

Graphs with Cycles

Although the message-passing algorithm is proved for cycle-free graphs, it has been used for graph with cycles as well. The success of turbo codes shows the importance of the extension of this algorithm for graph with cycles. All good turbo-like codes have graph with cycles. In fact, they are very rich in cycles that make the codes more powerful. However, it is well known that short cycles should be avoided in order to have good performance. In essence, the message-passing algorithm does not change for this class of graphs. It simply ignores the presence of loops. The messages propagate through the graph based on the same rules; i.e. theorems 1 and 2. Because of the loops the effect of one observation can be fed back to itself after some steps; i.e. the messages passed along the loop come back to their origin. This creates some correlation between messages, which make theorems 1 and 2 no longer valid. To alleviate this effect, short cycles are removed from the graph. In this case the effect of one variable on return message is well attenuated. Different scheduling can be used in the message-passing algorithm. Efficient scheduling is not applicable here because it is based on a graph having a tree structure. However, there is some scheduling that is more efficient than others. Flooding scheduling can be used for very fast and low latency applications. We usually stop the message passing when the messages are well propagated.

14

2 Turbo Concept Fig. 12 Tanner graph for Hamming code, H (7,4)

5

3

4

6

1

7

2

2.6 Codes on Graph The natural setting for codes decoded by iterative algorithms is their graph representation. In this section the graph representation of some commonly used codes are illustrated and explained. 2.6.1

Parity-check Codes

Parity-check codes are binary linear block codes. Parity-check matrix determines the constraints between binary variables. As an example, the parity-check matrix of for Hamming code H (7,4) is given as follows: ⎡ ⎤ 1 0 1 1 1 0 0 H = ⎣1 1 1 0 0 1 0⎦ (9) 1 1 0 1 0 0 1 The Tanner graph for Hamming code, H (7,4), is shown in Figure 12. LDPC codes are referred to those codes that have very sparse parity-check matrix. Regular LDPC, first introduced by Gallager [14], has a matrix with fixed number of nonzero elements in each row and column. The Tanner graph for a regular LDPC code with variable and check nodes with degree 3 and 5 is shown in Figure 13. 1

2

3

4

. . .

N

Interleaver

. . .

Fig. 13 Tanner graph for regular LDPC (3,5)

2.6 Codes on Graph

15

c

0

1

2

3

4

5

6

7

u

0

1

2

3

4

5

6

7

Fig. 14 Convolutional code Tanner graph

Luby et al. [20, 21] found out that using irregular LDPC codes and optimizing the degree distribution of variable and check nodes give very superior codes. Building on the analytical techniques developed by Luby et al., Richardson et al. [27, 28] designed long irregular LDPC codes that practically achieve the Shannon limit.

Convolutional Codes

2.6.2

In convolutional codes, the information bits goes through a filter. The code bits are derived from a few previous information and code bits. The dependency is defined by the generator polynomials. For example, for a recursive convolutional code with feedback polynomial Den(D) = 1 + D + D 2 and forward polynomial of Num(D) = 1 + D 2 , the following constraint hold for all n: c[n] + c[n − 1] + c[n − 2] + u[n] + u[n − 2] = 0 u[−1] = u[−2] = 0

(10)

c[−1] = c[−2] = 0 Where x[n] is the nth information bit, y[n] is the nth output bit and the plus sign is a modulo 2 addition operator. The above constraints are depicted in Figure 14. However, this graph representation has many short cycles (length 4), which is not desirable for message-passing algorithm. With the introduction of state variables we can eliminate all the loops. State variables are fictitious variables that are not part of the code word; i.e. they are not sent over the channel. The size of state variables is (K − 1) bits where K is the constraint length of the convolutional code; i.e. it takes on values between 0 and 2 K −1 − 1. The graph of the convolutional code for 8-bit information block size with the state variables is drawn in Figure 15. c

0 0

u

x 0

1 1

x 1

2 2

x 2

3 3

x 3

4 4

x 4

5 5

x 5

6 6

x

7 7

6

Fig. 15 The Tanner graph of Convolutional codes with state variables

x 7

codeword bits

8

7

State variable

8

State constraint

x

16

2 Turbo Concept Table II State Constraint

0

0/00

S[n]

u[n]

c[n]

S[n + 1]

0 0 1 1 2 2 3 3

0 1 0 1 0 1 0 1

0 1 1 0 0 1 1 0

0 2 0 2 1 3 1 3

0

1/11 0/01 1

1 1/10 0/00

2

2 1/11 0/01

3

3 1/10

C1

C2

c1

0

0

x

u

0

0

x

c2

0

Fig. 16 A trellis section

1 1

x

2 2

1

1

x 1

x

3 3

2

2

x

x

4 4

3

3

2

Fig. 17 An example of the graph of a PCCC

x 3

x

5 5

4

4

x 4

x

6 6

5

5

x 5

x

7 7

6

6

x 6

x

8

7

7

x 7

8

2.6 Codes on Graph

17

The state constraint shows the relationship between code word bits and the state variables. For convolutional code in the previous example, the state constraint only allows the cases that are listed in Table II. The state constraint can be also described by trellis section of the convolutional codes, which is shown in Figure 16, whose information on each edge is u/uc.

2.6.3 Turbo Codes The graph of turbo codes is derived from connecting the graphs of its own constituent convolutional codes. The graphical representation of a rate 1/3 PCCC with block size of 8 is shown in Figure 17. The graph of other codes can be drawn the same way. It should be noted that despite the fact that the graph of a constituent convolutional code is loopless, the overall graph has many loops.

Chapter 3

High-speed Turbo Decoders

3.1

Introduction

The graphical representation of a code gives the natural setting for iterative decoding, which is basically the message-passing algorithm. In order to grasp the high-speed turbo decoding architecture, it is essential that the traditional approach for turbo decoding is appreciated first. In traditional turbo decoding each constituent code is processed one at a time. The algorithm for efficiently computing the APPs of bits is called the BCJR algorithm after their first inventors [5]. It presents an efficient way to compute the APP of bits for any convolutional code. In [4] it is shown that BCJR algorithm is indeed the message-passing algorithm. Here we briefly describe the structure of this algorithm and relate it to the message-passing algorithm.

3.2 BCJR Algorithm The three main steps of this algorithm are as follows: Forward Recursion (FR): In this step we compute the likelihood of all the states in the trellis given the past observations. Starting from a known state, we will go forward along the trellis and compute the likelihood of all the states in one trellis section from the likelihood of the states in the previous trellis section. This recursive scheme is continued until the likelihood of all the states, which are called alpha variables, are computed in the forward direction. The forward recursion is exactly equivalent of computing the forward messages as it is shown in Figure 18. Starting from first state, which is a leaf vertex, forward messages are computed recursively from observations and previous forward messages. Since the state variables are not observed from the channel, the IM and OM for each state variable is the same. Hence, only one of them is drawn in Figure 18. It is very instructive to note that the forward messages are the likelihood of states given the past observations. In the sequel we will use alpha variables and forward messages interchangeably. Backward Recursion (BR): This step is quite similar to the forward recursion. Starting from a known state at the end of the block, we compute the likelihood of

Aliazam Abbasfar, Turbo-Like Codes, 19–38. c Springer 2007

19

20

3 High-speed Turbo Decoders

c

0 0

u

x 0

1 1

x

2 2

1

x

3 3

2

x 3

4 4

x 4

5 5

x 5

6 6

x 6

7 7

x

Forward messages Backward messages

8

extrinsics

7

Fig. 18 The messages in a convolutional code

previous states in one trellis section. Therefore we compute the likelihood of all the states in the trellis given the future observations, which are called beta variables. This iterative processing is continued until the beginning of the trellis. The backward recursion is exactly equivalent of computing the backward messages as shown in Figure 18. Starting from the last state, which is a leaf vertex, backward messages are computed recursively from observations and previous backward message. It is very instructive to note that the backward messages are the likelihood of states given the future observations. We will use beta variables and backward messages interchangeably. Output Computation (OC): Once the forward and backward likelihood of the states are computed, the extrinsic information can be computed from them. The extrinsic information can be viewed as the likelihood of each bit given all the observations. Output computation is equivalent of computing the IM for the code word bits. These messages match the definition of extrinsic information. As we observe in Figure 18, the likelihood of both input and output bits for a convolutional code is computed. However, depending on the connection between the constituent codes in a turbo code, some of the extrinsics might not be used. Bennedetto et al. [6] introduced a general unit, called SISO, which generates the APPs in the most general case. It should be noted that SISO is a block, which implements the BCJR algorithm. The inputs to the SISO block are the observations (r1 or r2 ), initial values for alpha and beta variables, (α 0 and β N ) and the extrinsics coming from other SISOs. The outputs are the alpha and beta variables at the end of forward and backward recursions, which are not used any more, and the new extrinsics that will pass to the other SISO. The block diagram of a SISO for a convolutional code of length N is sketched in Figure 19. In the traditional realization of the SISO, the timing scheduling for the three mentioned steps is as follows. The backward recursion is done completely for the entire r1 a0

aN SISO

b0

bN x

y

Fig. 19 Block diagram of the SISO

3.3 Turbo Decoding

21 Backward: y N −1 y N −2 · · · y1 y0 b N b N −1 · · · b2 b1 b0 Forward: y0 y1 · · · y N −2 y N −1 a0 a1 · · · a N −2 a N −1 a N Output: x0 x1 · · · x N −2 x N −1

Fig. 20 Timing diagram of the traditional SISO

block and all beta variables are stored in a memory. Then, the forward recursion starts from the first trellis section and computes the alpha variables one by one. Since at this time both alpha and beta variables are available for the first trellis section, the extrinsic for the first bit is computed at this time. Therefore the extrinsic computation is done along with the forward recursion. The sequence of variables in time is shown in Figure 20. Alpha and beta variables are denoted by a and b, and incoming and outgoing extrinsics are denoted by y and x. We could exchange the order in which forward and backward recursion is done. However, this scheduling outputs the results in the reverse order.

3.3

Turbo Decoding

Once two or more convolutional codes are connected together with interleavers, turbo codes are produced. The decoding is performed by passing the messages between the constituent codes. Each constituent code receives the incoming extrinsics and computes new extrinsics by using the BCJR algorithm. The new extrinsics are used by other constituent codes. Figure 21 sketches the message passing between the constituent codes for a simple PCCC and SCCC. The decoding starts from one constituent code and proceeds to other constituent codes. This process is considered as one iteration. It takes several iterations to obtain u

c1 SISO

I c2 SISO

c1 u

c2 SISO

I

SISO

Fig. 21 Message passing between the constituent codes of turbo codes

22

3 High-speed Turbo Decoders Fig. 22 The iterative decoding structure

r1 a0

aN SISO

b0

bN x

y Interleaver yI

xI a0

aN SISO

b0

bN r2

very good performance for bit decisions. Turbo codes are designed such that the probability density of the extrinsics is shifted toward higher values in each iteration. This phenomenon is known as DE, which is also used to compute the capacity of codes using the iterative decoding. The decisions are made based on messages on information bits. Hence, the decisions are getting better and better with more iterations. Since the second constituent code uses the permuted version of the bit sequence, the extrinsic information should also be permuted before being used by the second SISO. Likewise, the extrinsic information of the second SISO is to be permuted in reverse order for the next iteration of the first SISO. The structure of the iterative decoder is shown in Figure 22. SISO block processes the observation serially and outputs the extrinsics serially. Hence, in the sequel the traditional iterative turbo decoding is referred as serial decoding. It should be noted that only one SISO is working at a time. Without any loss of generality, only the simple PCCC, which is the most popular turbo code, is investigated from now on. The methods described here can be applied to all other turbo codes with simple modifications.

3.4

Pipelined Turbo Decoder

One possible way to speed up the decoding is to perform the iterations in a pipelined way. In this method there is one exclusive SISO for each constituent code at certain iteration. The constituent codes are decoded and the results passed to next stage for further iterations. The block diagram of such a decoder is drawn in Figure 23. All the SISOs are running at the same time, but working on different received blocks of coded data. In this method the decoding rate has been increased by 2I times, where I is the number of iterations. Although we have achieved some speed gain, the latency remains the same. To get the bit decisions we have to wait until all the stages are finished for a block of data.

3.5 Parallel Turbo Decoder r1(t)

SISO 0

23

x Int

xI

SISO 0⬘

r2(t-2I+1)

r1(t-2I+2)

r2(t-I) y Int −1

yI

...

yI SISO x I

Int

xI

SISO I'

y Int -1

yI

Fig. 23 Pipelined turbo decoder

The increase in the decoding rate comes at the expense of more hardware. As we see in Figure 23, the hardware needed for the decoding is also increased 2I times. This hardware consists of memory to store the observations and extrinsics and the logic to compute the SISO computations. This method has another disadvantage. The number of iterations is fixed and cannot be changed, which is essential for power reduction. Therefore this method is not interesting for high-speed turbo decoding.

3.5

Parallel Turbo Decoder

In this section, we present a novel method for iterative decoding of turbo codes that can be used for very high-speed decoders. This method was first introduced by Abbasfar and Yao [3]. Although this method is applicable for every turbo code, we will explain it in the case of a block PCCC code. The algorithm is as follows. First of all, the received data for each constituent codes are divided into several contiguous nonoverlapping sub-blocks, called windows. Then, each window is decoded separately in parallel using the BCJR algorithm. In other words, each window processor is a vector decoder. However, the initial values for alpha and beta variables come from previous iteration of adjacent windows. Since all the windows are being processed at the same time, in the next iteration the initial values for all of them are ready to load. Moreover, there is no extra processing needed for the initialization of state probabilities at each iteration. The size of windows is a very important parameter that will be discussed later. The structure of the decoder is shown in Figure 24. The timing diagram of the messages for one constituent code is shown in Figure 25. The above timing diagram can be simplified by using vector notation, which is shown in Figure 26. The variables that computed at the same time are simply replaced with a vector. Each vector has M elements, which belong to different window processors (SISOs). For example, we have a0 = [a0 a N a2N · · · a M N −N ]T and b0 = [b0 b N b2N · · · b M N −N ]T . This notation is the generalization of the serial decoder. It will also help to appreciate the new interleaver structure for the parallel decoder discussed later. The proposed structure stems from the message-passing algorithm itself. We have only partitioned the graph into some subgraphs and used parallel scheduling

24

3 High-speed Turbo Decoders

Fig. 24 Parallel turbo decoder structure

for different partitions. Partitioning helps us to parallelize the decoding of one constituent code. The graph of a PCCC and its partitions is shown in Figure 27. There are two types of messages that are communicated between sub-blocks. First, the messages associated with the information bits, i.e. the extrinsic information, which is communicated between two constituent codes in the traditional approach. Second, the messages that are related to the states in window boundaries; we call them state messages. In fact we have introduced new messages that are passed between sub-blocks at each iteration. These messages are the same as alpha and SISO1 : Backward: Forward: Output: SISO2 : Backward: Forward: Output: . . . SISOM : Backward: Forward: Output:

y N −1 y N −2 . . . y1 y0 b N b N −1 . . . b2 b1 b0 y0 y1 . . . y N −2 y N −1 a0 a1 . . . a N −2 a N −1 a N x0 x1 . . . x N −2 x N −1 y2N −1 y2N −2 . . . y N +1 y N B2N b2N −1 . . . b N +2 b N +1 b N y N y N +1 . . . a N a N +1 . . . x N x N +1 . . .

y2N −2 y2N −1 a2N −2 a2N −1 a2N x2N −2 x2N −1

y M N −1 y M N −2 . . . y(M−1)N +1 y(M− 1)N b M N b M N −1 . . . b(M−1)N +2 b(M−1)N +1 b(M−1)N y(M−1)N y(M−1)N +1 . . . y M N −2 y M N −1 a(M−1)N a(M−1)N +1 . . . a M N −2 a M N −1 a N x(M−1)N x(M−1)N +1 . . . x M N −2 x M N −1

Fig. 25 Timing diagram of the parallel SISOs

3.5 Parallel Turbo Decoder

25 yN−1 yN−2 · · · y1 y0

Backword:

bN

bN−1 · · · b2 b1 b0 y0 y1 · · · yN−2 yN−1

Forword:

a0 a1 · · · aN−2 aN−1 aN x0 x1 · · · xN−2 xN−1

Output:

Fig. 26 Timing diagram of the parallel SISOs in vector notation

beta variables that are computed in forward and backward recursion of the BCJR algorithm. In the first iteration there is no prior knowledge available about the state probabilities. Therefore the messages are set to equal probability for all the states. In each iteration, these messages are updated and passed across the border of adjacent partitions. The optimum way to process a window is the serial processing using forward and backward recursions; i.e. BCJR algorithm. Therefore each window processor is a SISO. The processing of the windows in two constituent codes can be run in parallel. However, when we discuss the interleaver for the parallel decoder, we will find out that this is not necessary. In parallel turbo code, when the constituent codes are the same we can share the SISO blocks for all constituent codes. Therefore the architecture of the decoder of the choice only needs half of the processors as it is shown in Figure 28. Table III shows the parameters of a parallel decoder. For window size at two extremes, the approach is reduced to known methods. If window size is B, and the number of windows is 1, it turns out to the traditional approach. If the window size is 1, the architecture reduces to a fully parallel decoder, which was proposed by Frey et al. [13]. It should be noted that the memory requirement for all cases is the same.

c1

0

0

x

1 1

0

0

x

c2

0

x

2 2

1

1

x

1

x

3 3

2

2

x

2

x

4 4

3

3

x

3

Fig. 27 Partitioned graph of a simple PCCC

x

5 5

4

4

x

4

x

6 6

5

5

x

5

x

7 7

6

6

x

6

x

8

7

7

x

7

8

26

3 High-speed Turbo Decoders a0

aN W1

b0

x1

a2N

y1

b2N ... bMN-N

W2

bN

x2

aMN

aMN-N

y2

WM

xM

bMN

yM

Interleaver/Deinterleaver

Fig. 28 Parallel turbo decoder with shared processors for two constituent codes

Processing time is the time needed to decode one block. Since all windows are processed at the same time, each SISO is done after Tw . We assume that all messagepassing computation associated with one state constraint node is done in one clock cycle (Tclk ). We have I iterations and each iteration has two constituent codes, so it takes 2I × TW to complete the decoding. It is worth mentioning that the processing time determines the latency as well. Therefore any speed gain is equivalent to lower latency. Processing load is the amount of computations that we need. The processing load for each SISO is proportional to the number of the state constraints. Hence, it is kB, where k is a constant factor which depends on the complexity of the state constraints. It should be noted that processing load in serial and parallel SISO are the same. Therefore the total processing load is 2I × kB.

3.6

Speed Gain and Efficiency

3.6.1

Definitions

Two characteristic factors should be studied as performance figures. One is the speed gain and the other is the efficiency. In ideal parallelization the efficiency is always 1. It means that there is no extra processing load needed for parallel processing. Table III The Decoder Parameters Parameter

Definition

N M B=M×N I TW = 2N × Tclk T = 2I × TW P = k × 2I × B

Window size Number of windows (SISOs) Block size Number of iterations Window Processing Time Processing Time (Latency) Processing Load

3.6 Speed Gain and Efficiency

27

They are defined as follows: Speed gain = T0 /T Efficiency = P0 /P Where T0 and P0 are the processing time and processing load for the serial approach, i.e. W = B case. The factors can be further simplified to: Speed gain = M × I0 /I Efficiency = I0 /I This is a very interesting result. The speed gain and the efficiency are proportional to the ratio between number of iterations needed for serial case and parallel case. If the number of iterations required for the parallel case is the same as the serial case, we enjoy a speed gain of M without degrading the efficiency, which is ideal parallelization. Therefore we should look at the number of iterations required for a certain performance to further quantify the characteristic factors. In next section we will investigate these factors with some simulations.

3.6.2 Simulation Results For simulations, a PCCC with block size of 4,800 is chosen. The first constituent code is a rate 1/2 of systematic code and the second code is a rate one nonsystematic recursive code. The feed forward and feedback polynomials are the same for both codes and are 1 + D + D 3 and 1 + D 2 + D 3 , respectively. Thus coding rate is 1/3. The simulated channel is an additive white Gaussian noise (AWGN) channel. The bit error rate (BER) performance of the proposed high-speed decoder has been simulated for window sizes of N = 256, 128, 64, 48, 32, 16, 8, 4, 2, and 1. The first observation was that this structure does not sacrifice performance for speed. We can always increase the maximum number of iterations to get similar performance as of the serial decoder. The maximum number of iterations for each case is chosen such that the BER performance of the decoder equals that of the serial decoder after 10 iterations (I0 = 10). Figure 29 shows the BER performance of the decoder with different window sizes. The curves are almost indistinguishable. However, in practice, the iterations are stopped based on a criterion that shows the decoded data is reliable or correct. We have simulated such a stopping criterion in order to obtain the average number of iterations needed. The stopping rule that we use is the equality between the results of two consecutive iterations. The average number of iterations is used for the efficiency computation. The average number of iterations for low signal to noise ratio is the maximum number of iterations for each window size. Efficiency and speed gain of the parallel decoder with different window sizes is shown in Figure 30. It clearly shows that we have to pay some penalty in order

28

Fig. 29 Performances of parallel decoder

Fig. 30 Efficiency and speed gain

3 High-speed Turbo Decoders

3.6 Speed Gain and Efficiency

29

Table IV Characteristic Factors for the Parallel Decoder @SNR = 0.7 dB (BER = 10E − 8) Window size 64 32 16 8 4 2 1

Max # of iterations

Avg. # of iterations

Speed gain

Efficiency (%)

12 14 18 25 42 65 120

5.0 5.8 7.4 10.4 16.3 28.3 52.0

63 109 170 242 310 356 386

84 72 57 40 26 15 8

to achieve the speed gain. Also we observe that the efficiency of parallel decoder decreases gracefully for window sizes greater than 32. The efficiency is degraded dramatically for very small windows, which prohibits us to get speed gain as well. However, the speed gain is a decreasing function. As a summary, in Table IV, the maximum number of iterations, the average number of iterations, and the characteristic factors are tabulated for different window sizes at E b /N0 = 0.7 (BER = 1e − 8). Efficiency curves with respect to SNR are illustrated in Figure 31. The interesting observation in the efficiency curves is the flatness of the curves. In other words, the

Fig. 31 Efficiency vs. signal to noise ratio

30

3 High-speed Turbo Decoders

efficiency of the parallel decoder is almost constant in all SNR. This observation translates to almost constant speed gain over the whole SNR range.

3.7 Interleaver Design In traditional decoder, the extrinsics are usually stored in a memory whose locations is accessed one at a time. The order in which the extrinsic memory is accessed is different for two constituent codes because of the interleavers. It is usually accessed in sequential order for the first constituent code and in interleaved order for other constituent codes. In practice, memory addressing is used to access the memory in desired order. Although the message-passing algorithm allows us to parallelize the decoding process, accessing so many extrinsics at the same time poses a practical problem. Since M SISOs are running at the same time, M extrinsics are being used simultaneously. Having a memory with M read and write ports is practically not feasible, especially for large M. The solution is to have M submemories each one of which is accessed by only one SISO. This should be true not only for extrinsics with sequential orders but also for interleaved orders. Therefore, it imposes a constraint on the interleavers. Such an interleaver is called conflict free. If the bit sequence is put in a matrix with rows corresponding to bits processed with one SISO, the general structure for conflict-free interleaver is derived by random row permutation followed by column permutations. An example for M = 5 and N = 4 is shown in Figure 32. The submemories are distinguished with different colors. Each column of the matrix is processed at the same time. As we see at each time all the submemories are used and not one is accessed twice. This is true for both sequential and interleaved order.

0

0

1

2

3

1

3

2

0

8

3

4

5

6

7

7

4

5

6

1

14 19

0

8

9

10 11

8

10 11

9

18

4

12

9

12 13 14 15

13 14 12 15

7

10

2

6

16 17 18 19

18 17 19 16

13 17

5

15

(a)

(b)

1

2

3

4

5

6

7

8

11 16

(c) 9

10 11 12 13 14 15 16 17 18 19

4

12

(d) 8

3

11 16

1

14 19

0

18

9

7

10

2

6

13 17

5

15

(e)

Fig. 32 (a) Bit sequence in matrix form; (b) after row interleaver; (c) a conflict-free interleaver; (d) bit sequence in sequential order; (e) the conflict-free interleaved sequence

3.7 Interleaver Design

31

With conflict-free interleavers the extrinsic memory access problem is solved. In the next section we present an improved interleaver that almost halves the latency. Although this structure is applicable for every turbo code, we will explain it in the case of a block PCCC code.

3.7.1

Low Latency Interleaver Structure

To explain the interleaver structure we start with the reverse interleaver in a serial decoder. When the reverse interleaver is used, it is observed that the next iteration can start processing as soon as the first extrinsic is ready and every new computed extrinsic is used right away. This property is true only for the reverse interleaver. The reason for this property is that the sequence of extrinsics computed in the current iterations matches the sequence needed in the backward recursion in the next iteration. Figure 33 shows the observation and extrinsic sequences used in two consecutive iterations. The indices used for alpha and beta variables denote the trellis stage. For extrinsics they denote the bit number in the block. The alpha and beta variables in two iterations are totally independent, although for simplicity the same notation is used for them. As it is observed from the sequences in Figure 33 the output sequence of the second iteration is compatible with the input sequence of the first interleaver, which means that we can repeat this pipelining process for succeeding iterations as well. This phenomenon results in two very important advantages. First, the latency of the decoder decreases by almost a factor of two, which translates to speed gain as well. We have, Processing time (Latency) = (2I + 1) × Tw /2 = (I + 1/2) × Tw C1 :

y N −1 y N −2 · · · y1 y0 bN

b N −1 · · · b2 b1 b0 y0 y1 · · · y N −2 y N −1 a0 a1 · · · a N −2 a N −1 a N

C2 :

x0 x1 · · · x N −2 x N −1

extrinsic outputs

x0 x1 · · · x N −2 x N −1

extrinsic inputs (interleaved)

b N b N −1 · · · b2

b1 b0 x N −1 x N −2 · · · x1 a0

a1

x0

· · · a N −2 a N −1 a N

y N −1 y N −2 · · · y1

y0

Fig. 33 Data and extrinsic sequences in two consecutive iterations for turbo decoder with reverse interleaver

32

3 High-speed Turbo Decoders y N −1 y N −2 . . . y1 y0 b N b N −1 . . . b2 b1 b0 y0 a0 x0 .. ..

y1 . . . y N −2 a1 . . . a N −2 x1 . . . x N −2 .. .. .. ..

??

π0 .. ..

y N −1 a N −1 a N x N −1 extrinsic outputs .. ..

?

?

π1 . . . π N −2 π N −1 .. .. .. .. .. ..

??

?

interleavers

?

x0 x1 . . . x N −2 x N −1 extrinsic inputs b N b N −1 . . . b2 b1 b0 x N −1 x N −2 . . . x1 x0 a0 a1 . . . a N −2 a N −1 a N y N −1 y N −2 . . . y1 y0

Fig. 34 Sequences in two consecutive iterations for parallel turbo decoder with reverse interleaver

This advantage comes at the expense of running both the forward and backward recursion circuitry all the time. There is no extra hardware needed, though. Second, the extrinsic information are not required to be stored in a memory and retrieved later on. This memory reduction is due to in place processing of the extrinsics. Despite these advantages, this interleaver is never used for turbo codes. The reason is the poor BER performance of it. In the sequel we use the reverse interleaver in the context of the parallel turbo decoder. By this we get around this problem by incorporating one more permutation while we still exploit the advantages. For the parallel decoder; the timing diagram for two consecutive iterations with the proposed interleaver is shown in Figure 34. The idea is to use the same vector with permuted elements. So x0 in the next iteration is permuted, π0 (x0 ), and y0 will be replaced by π0−1 (x0 ). If we did not have the π0 , π1 , . . . , π N −1 interleavers, there would be nothing more than M parallel decoders each one working on a separate block. However, the presence of the interleavers creates a randomness that will improve the performance of the code while the architecture of the code is almost intact, i.e. the advantages of the reverse interleaver are still in place. The permutations are done differently for each vector. Therefore the interleaver block is time variant, but memoryless. Because the number of parallel blocks is usually small, the interleaver implementation is feasible. Moreover, it does not have FR 1

FR 1'

FR 2

...

FR I'

BR 1

BR 1'

...

BR I

BR I'

OC 1

OC 1'

...

OC I

OC I'

Int

Int-1

...

Int

Int-1

Fig. 35 Scheduling diagram of the parallel decoder

3.7 Interleaver Design

33

the memory access problem due to its memoryless structure, which is the main problem in the parallelization of turbo decoder. The scheduling diagram of the parallel decoder is shown in Figure 35.

3.7.2

Interleaver Design Algorithm

The structure of the interleaver can be best understood by organizing the bits in a matrix with each row having the bits processed in a SISO. Each column of this matrix is a vector that is computed at a time. We have a two-step interleaver: a reverse row interleaver and a random column interleaver. If the matrix elements are denoted by Pi, j , where i = 0, 1, . . . , M−1 and j = 0, 1, . . . , N −1, then the equivalent interleaver sequence is {Q n } = {N *Pi, j + N − j}. As an example, a turbo code with block length of 20 is decomposed into M = 5 sub-blocks of N = 4 bits. Therefore, there are 5 SISOs working in parallel; each one works on a block of 4 bits. Table V shows an example of the interleaver in matrix format. The equivalent interleaver is {11, 2, 5, 4, 15, 18, 9, 12, 19, 14, 1, 16, 3, 6, 13, 8, 7, 10, 17, 0}. In this section we will explain how to design such an interleaver. The algorithm has two main steps: constructing a random interleaver and updating based on a certain constraint. Using the matrix format for the interleaver, we initialize the interleaver design by taking a random interleaver for each column. In the next step we will update the interleaver by applying a certain constraint. To update the interleaver we use columnwise bit swapping, which ensures that the structure of the interleaver is preserved. Since the constraints for interleaved designs are usually applicable for onedimensional interleaver, to update the interleaver it is best to compute the equivalent interleaver, which can be done on the fly. The simplest constraint that we can use is the spread of the interleaver [7]. In other words, we design an S-random interleaver with the proposed structure. An ordinary S-random can be viewed as a special case for this structure, i.e. when we have only one column. Therefore, this algorithm not only presents an algorithm with the proposed structure, but also gives a fast algorithm for designing S-random interleavers. The flowchart of this algorithm is shown in Figure 36. Starting from the first row, the first SISO bits, and the constraint for each bit is checked given the previously designed bits. If the constraint is met, we go to the next Table V An Example of the Conflict-Free Interleaver Bit index in SISO Interleaved bit index SISO0 gets from SISO SISO1 gets from SISO SISO2 gets from SISO SISO3 gets from SISO SISO4 gets from SISO

0 3 2 3 4 0 1

1 2 0 4 3 1 2

2 1 1 2 0 3 4

3 0 1 3 4 2 0

34

3 High-speed Turbo Decoders Fig. 36 The flowchart of the algorithm

Start

Construct a matrix with N random interleaver columns; Pi,j i = 0, j = 0

n=Nxi+j; Qn = N*Pi,j + N−j

m = i;

k = N*m + j ; Qk = N*Pm,j + N-j

Yes

Qk satisfys the constraint ? No m = m +1; m > M?

No

Yes m = i−1;

k = N*m + j; Qk = N*Pm,j + N−j

Yes Pi,j <-> Pm,j

j++; if( j = N) { j = 0, i++}

Qn & Qk satisfy the constraint ? No m = m −1; m<0?

No

i>M? Yes Done

Yes

No

3.8 Hardware Complexity

35

bit. Otherwise we check the remaining elements in the column in place of this bit. If one of them satisfies the constraint, we exchange the indices in the column and go to the next bit. If not, we try exchanging this bit with the previously designed bits in the column. In this situation, when we exchange two bits the constraint for both bits should be satisfied. If none of the previously designed bits can be exchanged with this bit, then the algorithm fails. There are two options available in case the algorithm fails: one is to make the constraint milder and the other one is to redo everything with a new random matrix. We have observed that this algorithm is very fast for S-random interleaver design and it does not fail when the spread is less than sqrt(B/2). The maximum spread that we can achieve for the structured interleaver is slightly smaller than that of the ordinary one. Therefore, one should expect some degradation in performance. A much more complicated constraint can be used in the algorithm in order to improve the code. These constraints usually depend strictly on the code.

3.7.3 Simulation Results For simulations, two PCCCs with block sizes of 1,024 and 4,096 are chosen. The first constituent code is a rate 1/2 systematic code and the second code is a rate one nonsystematic recursive code. The feed forward and feedback polynomials are the same for both codes and are 1 + D + D 3 and 1 + D 2 + D 3 , respectively. Thus coding rate is 1/3. The simulated channel is an AWGN channel. Two interleavers with block length of 1,024 (M = 32, N = 32) and 4,096 (M = 128, N = 32) have been designed with the proposed algorithm. The BER performance of the decoders has been simulated and compared with that of the serial decoder with S-random interleaver. The maximum number of iterations for each case is 10. The performance comparison for the 1,024 case is illustrated in Figure 37. The proposed two-dimensional S-random interleaver is called S2 -random. As we see in the figure, the performances are almost the same. The S-random interleaver has a slightly better error floor. The performances for the 4,096 case is shown in Figure 38. The difference between the error floors is more noticeable. However, the codes have equal threshold in both cases. The error floor can be reduced with a more powerful constraint in the interleaver design algorithm.

3.8

Hardware Complexity

After the speed gain and efficiency were discussed, we are going to investigate hardware complexity for parallel turbo decoder. The turbo decoder hardware consists of two major parts: logic and memory.

36

3 High-speed Turbo Decoders

EB/No Fig. 37 Performance comparison for B = 1,024

The memory requirement for the parallel decoder consists of the following: Observations: These variables are the observation values received from channel. They usually are the logarithm of likelihood ratios (LLR) of the code word bits. The message computations are easier using these variables. The size of the memory is the number of bits in a code word. For a rate 1/3 turbo code with block size of B, the memory size is 3B. Usually the width of this memory is 4–5 bits. This is a readonly memory during the decoding (all iterations). This memory is the same for serial decoder. Extrinsics: The extrinsics are stored in a memory to be used later by other constituent codes. The size of the memory depends on the number of connections between constituent codes. For a simple PCCC, there are only B extrinsics needed. This memory is a read/write with a width of 6–9 bits. For conflict-free interleavers, this memory is divided into M sub-blocks each one accessed independently. The total memory, however, is the same as serial decoder. Beta variables: These variables should be kept in a memory in backward recursion before it is used to compute the extrinsics. Each beta variable is actually an array. The size of the array is 2(K −1) −1, where K is the constraint length of the convolutional code; i.e. the number of possible values for a state variable. For an 8-state convolutional code the size of memory for beta variables is 8B. Each element in the array has usually 8–12 bits. This memory is a big portion of memory requirements for the

3.8 Hardware Complexity

37

EB/No Fig. 38 Performance comparison for B = 4,096

parallel decoder. In serial decoder there is a sliding window approach that reduces this memory at the expense of some more processing. The logic that is needed for processing is mainly for the message computation. There are three operations needed in message computations that have been shown in Figure 39. n

n

x

n

n+1

n

x

n

n+1

n

x

n

n

n

(a)

(b)

(c)

Fig. 39 (a) alpha recursion; (b) beta recursion; (c) extrinsic computation

n+1

38

3 High-speed Turbo Decoders

All of the above operations are done in one clock cycle. In the parallel decoder we have M SISO blocks. Therefore, compared to serial decoder the above computational logic is increased by M times. Contrary to pipelined turbo decoder, the complexity is not increased proportional to the speed gain.

3.9 Conclusion We have proposed an efficient architecture for parallel implementation of turbo decoders. The advantage of this architecture is that the increase in the processing load due to parallelization is minimal. Simulation results demonstrate that this structure not only achieves some orders of magnitude in speed gain, but also maintains the efficiency in processing. Also we have shown that the efficiency and the speed gain of this architecture are almost independent of the SNR. We also have proposed a novel interleaver structure for parallel turbo decoder. The advantages of this architecture are low latency, high speed, and the feasibility of the implementation. Simulation results show that we can achieve very good BER performance by this architecture as well. We also presented a fast algorithm to design such an interleaver, which can be used for designing S-random and other interleavers by just changing the constraint. The regularity of the recently proposed architecture for parallel turbo decoder and the advantages of the proposed interleaver make this the architecture of choice for VLSI implementation of high-speed turbo decoders.

Chapter 4

Very Simple Turbo-like Codes

4.1

Introduction

In searching for simple turbo-like codes RA codes are very inspiring. They are perhaps the simplest turbo-like codes. Surprisingly, they achieve good performance too. Simplicity of these codes lends itself to a more comprehensive analysis of their performance. Divsalar et al. have shown the performance of these codes with ML decoding and proven that they can achieve near Shannon limit performance [12]. Moreover, they have proved that it achieves the Shannon limit when the rate goes to zero. However, RA codes cannot compete with turbo codes or well-designed LDPCs as far as performance is concerned. To improve the performance of RA codes Jin proposed Irregular Repeat–Accumulate (IRA) codes [16,17]. He also presented a method for designing very good IRA codes for binary erasure and additive white Gaussian channels. He showed that they outperform turbo codes for codes with very large block sizes. However, IRA codes lose both the regularity and simplicity at the expense of performance. In this chapter we show that with some simple modifications RA codes can be transformed into very powerful codes while maintaining the simplicity. The modifications include simple puncturing and precoding. First RA codes with regular puncturing are analyzed using iterative decoding, as well as ML decoding. ML decoding performance is shown by a tight bound using the weight distribution of RA codes with puncturing. In fact, with increasing both the repetition and puncturing the code rate remains the same whereas the performance gets better. Then we present ARA codes. These codes not only are very simple, but also achieve excellent performance. The performance of these codes with ML decoding is illustrated and compared to random codes by very tight bounds. It is shown that there are some simple codes that perform extremely close to Shannon limit with ML decoding. The performance of ARA codes using iterative decoding is also investigated and compared to ML decoding later on. RA and ARA codes can be classified as LDPC codes. Despite the fact that LDPC codes in general may have a very computationally involved encoder, RA and ARA codes have a simple encoder structure. ARA codes, especially, allows us to generate a wide range of LDPC codes with various degree distributions including variable

Aliazam Abbasfar, Turbo-Like Codes, 39–65. c Springer 2007

39

40

4 Very Simple Turbo-like Codes

nodes with degree one (RA and IRA code structures do not allow degree one variable nodes). They are able to generate LDPC codes with various code rates and data frame sizes, and with a performance close to the Shannon capacity limit. The proposed coding structure also allows constructing very high-speed iterative decoders using belief propagation (message passing) algorithm. First we describe briefly some of the tools that we use for analyzing the performance of a turbo-like code.

4.1.1

Bounds on the ML Decoding Performance of Block Codes

Since there is no practical ML decoding algorithm available for block codes with large block size, we use the performance bounds to obtain some insight on codes behavior. Using the classic union bound, the frame (word) error rate (FER) and BER for a (N, K) linear block code decoded by an ML criterion over an AWGN channel is upper-bounded by ⎛ ⎞ E b⎠ (11) Ad Q ⎝ 2d r FER ≤ N0 d=d min

BER ≤

⎛ ⎞ K w E b ⎠ Aw,d Q ⎝ 2d r K N0

(12)

d=d min w=1

Where r denotes the code rate, E b /No is the signal to noise ratio, d is the Hamming distance of code words, dmin is the minimum distance between code words, wd is the average input error weight, Ad is the cardinality of code words with distance d, Aw,d is the cardinality of code words with input and output weight of w and d, K is the block length, N is the code word length, and Q denotes the complementary error function defined as 1 Q(x) = √ 2π

∞

e−u

2 /2

du

(13)

x

However, this bound is not very tight in low signal to noise ratios. There are tighter bounds like Viterbi-Viterbi [33], Poltyrev [26], and Divsalar [10] bounds. Divsalar bound is very attractive since it provides tight bounds with closed form expressions for bit-error and word-error probabilities. Here we describe this bound by defining some new variables: δ = d/N a(δ) = ln(Ad )/N Eb c=r N0

(14)

4.1 Introduction

41

Where δ, a(δ), and c are normalized weight, weight distribution, and SNR respectively. Then we define the following: c0 (δ) = (1 − e−2a(δ) ) 1−δ 2δ c f (c, δ) = c0 (δ) + 2c + c2 − c − 1 Then we define the exponent: ⎧ c f (c, δ) 1 ⎪ ⎪ ⎨ ln[1 − 2c0 (δ) f (c, δ)] + , 1 + f (c, δ) E(c, d) = 2 ⎪ ⎪ ⎩ −a(δ) + δc,

c0 (δ) < c <

(15)

e2a(δ) − 1 2δ(1 − δ)

otherwise (16)

Based on Divsalar bound the upper bound on the FER is given by:

√ Pe ≤ min e−NE(c,d) , ena(δ) Q 2cd

(17)

d=d min

The BER is upper bounded by the same formula, but the definition of r(δ) should be changed to the following: ⎛ ⎞ w (18) Aw,d ⎠/N a(δ) = ln ⎝ K w It also obtains a closed form expression for the minimum SNR threshold that serve as a tight upper bound on maximum-likelihood capacity of nonrandom codes, which is the following: Eb 1 = (19) max [c0 (δ)] N0 min r 0≤δ≤1−r The above threshold holds when N goes to infinity. The only information required for the error probability bounds and the threshold is the weight distribution of the code. Fortunately, the codes that are investigated are simple enough to derive their weight distribution.

4.1.2 Density Evolution Method The analysis of iterative decoders for turbo-like codes with short blocks is an open problem. However, there is an asymptotic solution for this; i.e. when N goes to infinity. This analysis is based on the DE method proposed by Richardson and Urbanke [27]. In this method the probability density function of the messages passed between the constituent codes is tracked as this density evolves from iteration to iteration. They used this method to compute the asymptotic threshold of LDPC codes over a binary input AWGN channel. Divsalar et al. [11] generalized this method for turbo and turbo-like codes.

42

4 Very Simple Turbo-like Codes

Fig. 40 Probability density function of messages in different iterations

To explain the DE phenomena, we assume that all the code word bits are zero; i.e. value 1 sent in the transmitter. We sketch the probability density function of the OMs for one constituent codes in different iterations. Example of those curves is shown in Figure 40. As we see the density functions evolves towards higher means. We can approximate the density functions by Gaussian approximation, as Wiberg did in his dissertation [34]. The bit decisions are made based on the messages, so the bit-error probability depends on the SNR of messages. Therefore what is really important is that the SNR of the density functions should increase in order to obtain better performance as the iterations go on. The messages are passed between the constituent codes. Hence, the constituent codes get the evolved message as input and generate new messages at the output. So what we need to know is the SNR transfer function for each constituent codes. Using the transfer functions we can track the behavior of the SNR of the messages Eb/No SNRin SNRout

G

Fig. 41 Constituent code model for density evolution

4.1 Introduction

43

Fig. 42 Constituent code model for density evolution

Eb / No SNRin G2

SNRout

G1

as they are passing between the constituent codes. Therefore, we sketch a general model for each constituent code as in Figure 41. In this model we have one more parameter that is the operating E b /N0 . This is the E b /N0 of the observations – the leaf node messages – that are fed to this constituent code and acts like a bias for the constituent code. The SNR transfer function is denoted by G that indicates the following relationship: SNRout = G(SNRin )

(20)

It should be noted that the transfer function is implicitly dependent on the operating E b /N0 . The transfer function is usually derived by Monte Carlo simulation using the Gaussian approximation or the real density function. For turbo-like codes with two constituent codes the overall block diagram of the iterative decoding is shown in Figure 42. We have: SNRout = G 1 (SNRin ) SNRin = G 2 (SNRout )

(21) (22)

Suppose we start with the first constituent code. There is no prior messages at this time; i.e. SNRin = 0. The observations help to generate new messages with some SNRout = G 1 (SNRin ) > 0. This messages are passed to the second constituent code and output messages have the SNR = G 2 (G 1 (SNRin )), which is the SNRin of the first constituent code for the next iteration. In order to obtain a better SNR at each iteration we should have the following: G 2 (G 1 (SNRin )) > SNRin ;

for any SNRin

(23)

Since G 2 is strictly ascending function, it is reversible and we can write an equivalent relation: G 1 (SNRin ) > G −1 2 (SNRin );

for any SNRin

(24)

If the above constraint holds, then the iteratively decoding will result in correct information bits (SNR goes to infinity). The minimum operating E b /N0 that this constraint holds is denoted as the capacity of the code with iteratively decoding. We usually draw G 1 (SNR) and G −1 2 (SNR) in one figure. These curves give us some sense about the convergence speed as well. An example of such curves and the

44

4 Very Simple Turbo-like Codes

Fig. 43 SNR improvement in iterative decoding

way SNR improves in iterations is shown in Figure 43. As we see the speed of SNR improvement depends on the slopes of G1 and G2 . We use the transfer function curves to analyze the performance of turbo-like codes with iterative decoding.

4.2 RA Codes RA codes are the simplest codes among turbo-like codes, which make them very attractive for analysis. The general block diagram of this code is drawn in Figure 44. An information block of length N is repeated q times and interleaved to make a block of size qN, and then followed by an accumulator. u N

rep(q)

qN

Fig. 44 Repeat–Accumulator code block diagram

I

qN

ACC

qN

4.2 RA Codes

45

The accumulator can be viewed as a truncated rate one recursive convolutional code with transfer function of 1/(1 + D), but sometimes it is better to think of it as a block code whose input block [x1 , x2 , . . . , xn ] and output block [y1 , y2 , . . . , yn ] are related by the following: y1 = x1 y2 = x1 + x2 y3 = x1 + x2 + x3 .... yn = x1 + x2 + x3 + · · · + xn

4.2.1

(25)

ML Analysis

For ML analysis we need the weight distribution of the code. We use the concept of uniform interleaver [8] to compute the overall input-output weight enumerator (IOWE). Therefore, we need to compute the IOWE of both repetition code and the accumulator. For repetition code it is simply the following:

rep(q)

Aw,d

⎧ ⎪ ⎨ N ; w = ⎪ ⎩ 0;

d = qw

(26)

otherwise

It can be expressed as rep(q) Aw,d

=

N w

δ(d − qw)

(27)

where δ(·) is the Kronecker delta function. The IOWE of the accumulator is: Aacc w,d

=

N −d w/2

d −1 w/2 − 1

(28)

Where x and x denote the largest integer smaller than x and smallest integer larger than x, respectively.

46

4 Very Simple Turbo-like Codes

Having the IOWE of the repeat and accumulate codes, we can compute the IOWE of the RA code using the uniform interleaver. RA(q) Aw,d

=

qN h =0

rep(q) Aw,h Aacc h,d qN h

=

N

w qN

qN − d qw/2

d −1 qw/2 − 1

(29)

qw

4.2.2 DE Analysis The RA codes consist of two component codes: repeat and accumulate codes. Hence, the messages are exchanged between these two codes. We use Gaussian approximation to obtain the SNR transfer functions of the constituent codes, which are shown in Figure 45. In accumulator code we have nonzero SNR even with zero messages in the input – coming from repetition code. This is because the observations help to generate some nonzero messages. Therefore, the accumulator is able to jumpstart the iterative decoding. On the other hand, the repetition code has a straight line transfer function

Fig. 45 Density evolution for RA codes (q = 3)

4.3 RA Codes with Puncturing

47

Fig. 46 Accumulator with puncturing and its equivalent for p = 3

with a slope of 2; the reverse function is shown in Figure 45. SNRout is zero when SNRin is zero. This is justified since there is no channel observation available for this code. The curves are almost touching; hence, the threshold of the RA code (rate 1/3) is almost 0.5 dB.

4.3 4.3.1

RA Codes with Puncturing ML Analysis

To compute the IOWE of the RA codes with puncturing we use the equivalent encoder depicted in Figure 46 instead of the accumulator with puncturing. As we see, the equivalent graph is a concatenated code of a regular check code and an accumulator, which is shown in Figure 47. Since the check code is regular and memoryless, the presence of any interleaver between two codes does not change the IOWE of the overall code. In order to compute the IOWE for this code we insert a uniform interleaver between two codes. The next step is to compute the IOWE of the check code. The IOWE can be expressed in a simple closed-form formula if we use the two-dimensional Ztransform. The inverse Z-transform results in Acw,d . We start with N = 1, i.e. we have only one parity check. We have Ac (W, D) = E p (W ) + O p (W )D

pN

Check(p)

N

Acc

N

pN

Fig. 47 Block diagram of accumulator with puncturing

Check(p)

(30)

N

p

N

Acc

N

48

4 Very Simple Turbo-like Codes

where E p (W ) = Even[(1 + W ) p ]

(31)

O p (W ) = Odd[(1 + W ) p ]

(32)

and

Since there are N independent check nodes in the code, the IOWE can be written in Z-transform as: Ac (W, D) = (E p (W ) + O p (W )D) N =

N N d=0

d

E p (W ) N −d O p (W )d D d

(33)

The IOWE is obtained by taking the inverse Z-transform. The closed-form expression for Aw,d for arbitrary p is very complicated. Instead we derive the IOWE for p = 2, 3, and 4, which are practically more useful.

Case p = 2

4.3.1.1

Using the general formula in Z-transform we have: Ac(2) (W, D) = (1 + W 2 + 2WD) N

(34)

It can be expanded as following: N N d=0

d

⎡ ⎤ N N −d N ⎣ N −d (1 + W 2 ) N −d (2W )d D d = W 2 j (2W )d ⎦ D d d j j =0

d=0

(35) Therefore the IOWE can be expressed as

c(2)

Aw,d

⎧ N −d ⎪ ⎨ N 2d ; d j = ⎪ ⎩ 0;

w = d + 2j

for

j = 0, . . . ,N − d

(36)

otherwise

It can be expressed concisely as c(2) Aw,d

=

N d

N −d j=0

N −d j

2d δ(w − d − 2 j)

(37)

4.3 RA Codes with Puncturing

49

where δ(x) is the Kronecker delta function. Example: N = 3 ⎡ 1 0 ⎢0 6 ⎢ ⎢3 0 ⎢ A=⎢ ⎢0 12 ⎢3 0 ⎢ ⎣0 6 1 0

0 0 12 0 12 0 0

⎤ ⎡ 0 1 ⎢0 0⎥ ⎥ ⎢ ⎢ 0⎥ ⎥ ⎢3 ⎥ 8⎥ = ⎢ ⎢0 ⎢ 0⎥ ⎥ ⎢3 ⎦ ⎣0 0 0 1

0 2 0 4 0 2 0

⎤ 0 ⎡ 0⎥ ⎥ 1 ⎥ 0⎥ ⎢ ⎢0 8⎥ ⎥ ⎣0 0⎥ ⎥ 0 0⎦ 0

0 0 4 0 4 0 0

0 3 0 0

⎤ 0 0⎥ ⎥ 0⎦ 1

0 0 3 0

(38)

The second matrix is canceled out when we concatenate this code to other codes with a uniform interleaver.

4.3.1.2

Case p = 3

Starting from general formula in Z-transform we have: Ac(3) (W, D) = (1 + 3W 2 + (3W + W 3 )D) N

(39)

It can be expanded as following: A

c(3)

(W, D) =

N N d=0

A

c(3)

(W, D) =

(1 + 3W 2 ) N −d (3W + W 3 )d D d

d

N −d N −d i 2i 3W i

N N

d

d=0

×

i=0

d i=0

d i

It can be written as: A

c(3)

(W, D) =

N N d=0

×

(40)

d

3 (W ) i

2(d−i)

W

d

Dd

(41)

N −d N −d ii 2ii 3 W ii ii=0

d d i=0

i

(d−i)

3

(W )

2i

W

d

Dd

(42)

Then we have A

c(3)

−d N d N N d N −d ii+(d−i) d+2i+2ii (W, D) = W 3 Dd d i ii d=0

i=0 ii=0

(43)

50

4 Very Simple Turbo-like Codes

If we let j = i + ii, we have A

c(3)

(W, D) =

N N N d=0

×

d N −d j −i

j=0

min( j,d) i=max(0, j−N +d)

d i

3d+ j−2i W d+2 j

Dd

(44)

Therefore, it is easy to show that the IOWE becomes: ⎧ min( j,d) ⎪ d N − d N ⎪ d+ j−2i ⎪ 3 ; ⎪ ⎨ d i=max(0, j−N +d) i j −i c(3) Aw,d = w = d + 2 j for j = 0, . . . ,N ⎪ ⎪ ⎪ ⎪ ⎩ 0; otherwise It can be written as Ac(3) w,d =

N d

⎧ N ⎨ j=0

⎩

min( j,d)

i=max(0, j−N +d)

(45)

⎫ ⎬ d N −d 3d+ j−2i δ(w − d − 2 j) i j −i ⎭ (46)

where δ(·) is the Kronecker delta function. Meanwhile we have the following property: c(3) Ac(3) w,d = A3N −w,N −d

(47)

This property can be proven very easily by taking the complements of three input bits to a check. The output of the check is also inverted. If the number of nonzero input and output bits are w and d, respectively, the number of nonzero bits in a complemented version is 3N − w and N − d. This proves the property. This property helps to save some computations. Example: N = 3 ⎡ 1 0 0 ⎢0 9 0 ⎢ ⎢9 0 27 ⎢ ⎢ 0 57 0 ⎢ ⎢27 0 99 A=⎢ ⎢ 0 99 0 ⎢ ⎢27 0 57 ⎢ ⎢ 0 27 0 ⎢ ⎣0 0 9 0 0 0

⎤ ⎡ 0 1 ⎢0 0⎥ ⎥ ⎢ ⎢ 0⎥ ⎥ ⎢9 ⎥ 27⎥ ⎢ ⎢0 ⎢ 0⎥ ⎥ = ⎢27 ⎥ 27⎥ ⎢ ⎢0 ⎢ 0⎥ ⎥ ⎢27 ⎥ 9⎥ ⎢ ⎢0 0⎦ ⎣0 1 0

0 0 3 0 0 9 19 0 0 33 33 0 0 19 9 0 0 3 0 0

⎤ 0 0⎥ ⎥ 0⎥ ⎥⎡ 27⎥ ⎥ 1 ⎢ 0⎥ ⎥ ⎢0 ⎥ 27⎥ ⎣0 0⎥ ⎥ 0 9⎥ ⎥ 0⎦ 1

0 3 0 0

0 0 3 0

⎤ 0 0⎥ ⎥ 0⎦ 1

(48)

4.3 RA Codes with Puncturing

4N

Check(4)

51

N

4N

Check(2)

4N

Check(2)

p

2N

Check(2)

2N

2N

N

Check(2)

N

Fig. 48 Block diagram of check_4 code and its equivalents

4.3.1.3

Case p = 4

The code for this case can be viewed as a concatenated code as shown in Figure 48. Because the check code is regular and memoryless, we can put any interleaver between the codes without changing the IOWE of the overall code. By using a uniform interleaver and the results found for case p = 2 the IOWE can be written as: c(4)

Aw,d =

c(2) c(2) 2N Aw,h Ah,d

h=0

2N

(49)

h

Using the result for case p = 2, we obtain ⎧ N −d−2 j −d ⎨ 2N N N −d 2N − d − 2 j c(4) 22d+2 j Aw,d = d j i ⎩ j=0

i=0

× δ(w − d − 2i − 2 j) Example: N = 3 ⎡ 1 ⎢ 0 ⎢ ⎢ 18 ⎢ ⎢ 0 ⎢ ⎢111 ⎢ ⎢ 0 ⎢ c(4) =⎢ A ⎢252 ⎢ 0 ⎢ ⎢111 ⎢ ⎢ 0 ⎢ ⎢ 18 ⎢ ⎣ 0 1

0 12 0 156 0 600 0 600 0 156 0 12 0

0 0 48 0 384 0 672 0 384 0 48 0 0

⎫ ⎬ (50)

⎭

⎤ ⎡ 0 1 ⎢ 0 0 ⎥ ⎥ ⎢ ⎢ 0 ⎥ ⎥ ⎢ 18 ⎥ 64 ⎥ ⎢ ⎢ 0 ⎢ 0 ⎥ ⎥ ⎢111 ⎥ 192⎥ ⎢ ⎢ 0 ⎢ 0 ⎥ = ⎥ ⎢252 ⎥ 192⎥ ⎢ ⎢ 0 ⎢ 0 ⎥ ⎥ ⎢111 ⎥ 64 ⎥ ⎢ ⎢ 0 ⎢ 0 ⎥ ⎥ ⎢ 18 ⎦ ⎣ 0 0 0 1

0 4 0 52 0 200 0 200 0 52 0 4 0

0 0 16 0 128 0 224 0 128 0 16 0 0

⎤ 0 0 ⎥ ⎥ 0 ⎥ ⎥ 64 ⎥ ⎥ ⎡ 0 ⎥ ⎥ 1 ⎥ 192⎥ ⎢ ⎢0 0 ⎥ ⎥ ⎣0 192⎥ ⎥ 0 0 ⎥ ⎥ 64 ⎥ ⎥ 0 ⎥ ⎥ 0 ⎦ 0

0 3 0 0

0 0 3 0

⎤ 0 0⎥ ⎥ 0⎦ 1

(51)

52

4 Very Simple Turbo-like Codes

This method can be applied for any p that can be decomposed into two smaller numbers. Having computed the IOWE of the check code, we can use the uniform interleaver formula to come up with the IOWE of the accumulator with puncturing. We have: acc( p)

Aw,d

=

c( p) N Aw,h Aacc h,d

h=0

N

(52)

h

The simplified expressions for cases p = 2, 3, and 4 are as follows: Aacc(2) w,d =

−h N N N −h N −d d −1 2h δ(w − h − 2 j) h/2 h/2 − 1 j

(53)

h=0 j=0

Aacc(3) w,d

=

⎧ N ⎨ N h=0 j=0

⎩

h N −h N −d d −1 h/2 h/2 − 1 i j −i

min( j,h) i=max(0, j−N +h)

⎫ ⎬

× 3h+ j−2i δ(w − h − 2 j)

(54)

⎭

acc(4)

Aw,d

⎧ −h−2 j −h ⎨ 2N N N N −h 2N − h − 2 j N −d d −1 = h/2 h/2 − 1 j i ⎩ h=0 j=0

i=0

⎫ ⎬ × 22h+2 j δ(w − h − 2i − 2 j) ⎭

(55)

It should be noted that despite the fact that we use a uniform interleaver to obtain the IOWE, we come up with the exact IOWE for accumulator with puncturing. The next step is to find the IOWE of the RA code with puncturing, which is derived in case of a uniform interleaver after repetition. rep(q)−acc( p)

Aw,d

=

rep(q) acc(p) qN Aw,h Ah,d

h=0

qN

(56)

h

Therefore, the closed form expressions for IOWE of RA( p = 2, q = 2), RA( p = 3, q = 3), and RA( p = 4, q = 4) will be the following: rep(2)−acc(2)

Aw,d

=

N w

2N

−h N N N −h N −d d −1 2h h/2 h/2 − 1 j h=0 j=0

2w

× δ(2w − h − 2 j)

(57)

4.3 RA Codes with Puncturing rep(3)−acc(3) Aw,d

=

53

N w

3N

× rep(4)−acc(4)

Aw,d

=

N −d h/2

h N −h i j −i

d −1 3h+ j−2i δ(3w − h − 2 j) h/2 − 1

(58)

N w

4N

−h−2 j −h 2N N N N −h 2N − h − 2 j j i h=0 j=0

4w

×

min( j,h)

h=0 j=0 i=max(0, j−N +h)

3w

N N

N −d h/2

i=0

d −1 22h+2 j δ(4w−h −2i −2 j) h/2 − 1

(59)

The above expressions are IOWE of the nonsystematic RA codes with regular puncturing. However, in most cases we need to have systematic codes. It is very easy to compute the IOWE of a systematic code based on its nonsystematic code. The following formula shows the conversion. sys−rep(q)−acc( p)

Aw,d

4.3.2

rep(q)−acc( p)

= Aw,d−w

(60)

Performance of Punctured RA Codes with ML Decoding

RA codes are usually nonsystematic codes, i.e. the information block is not sent along with the output of the accumulator. However, the RA codes with puncturing should be systematic in order to be decodable by iterative decoding. This constraint is because the messages passed towards information variables are always zero; hence not improving. The normalized distance spectrum of some rate 1/2 codes for a block size of 4,000 are illustrated in Figure 49. These codes are RA code (q = 2), systematic RA code with puncturing (q = 3, p = 3), (q = 4, p = 4), and random code. The distance spectrum of a (n, k) random code is n k random 2−k (61) Aw,d = w d k = 2n−k (62) Arandom d d To compute the cutoff thresholds of these codes using Divsalar bound, the normalized distance spectrum is computed when N goes to infinity, i.e. the asymptotic expression of r (δ). For random codes and the asymptotic expression of r (δ) for

54

4 Very Simple Turbo-like Codes

Fig. 49 Normalized distance spectrum of RA codes with puncturing

random codes with code rate Rc is r (δ) = H (δ) + (Rc − 1) ln 2

(63)

where H (δ) is the binary entropy function. The asymptotic expression of r (δ) for RA code with repetition q can be obtained as: r (δ) = max

0<e<1/q

1−q H (qe) + (1 − δ)H q

qe 2(1 − δ)

+δH

qe 2δ

(64)

Now we obtain the asymptotic expression of r (δ) for systematic punctured RA (q = 3, p = 3). After summing (4.60) over w, we let δ = d/2N for 0 < δ < 1, η = h/2N for 0 < η< 1/2, ρ 1 = i/2N for max(0, ρ 2 + η − 1/2) < ρ 1 < min(ρ 2 , h), and Table VI Cut-off Thresholds for RA Codes with Puncturing Cutoff threshold

RA (q =2)

RA_punc. (q = 3, p = 3)

RA_punc. (q = 4, p = 4)

Random code

Shannon limit

Rate 1/2

3.38 dB

1.49 dB

0.87 dB

0.308 dB

0.184 dB

4.3 RA Codes with Puncturing

55

ρ 2 = j/2N for 0 < ρ 2 > 1/2. Also (2ρ 2 + η)/3 < min(0.5, δ). ⎧ ⎨ ρ1 ρ2 − ρ1 4ρ2 + 2η r (δ) = max −H +ηH + (1/2 − η) H η,ρ1 .ρ2 ⎩ 3 η 1/2 − η

2ρ2 + η + (ρ2 + η − 2ρ1 ) ln(3) + 1/2 − δ + 3 ⎫ ⎬ 2ρ2 + η η/2 + δ− H 3 δ − 2ρ2 +η ⎭

H

η/2 1/2 − δ +

2ρ2 +η 3

(65)

3

To derive the asymptotic expression of r (δ) for RA(q = 4, p = 4), we let δ = d/2N for 0 < δ< 1, η = h/2N for 0 < η < 1/2, ρ 1 = i/2N for 0 < ρ 1 < 1 – η – 2ρ 2 , and ρ 2 = j/2N for 0 < ρ 2 < 1 – η. Also (2ρ 1 + 2ρ 2 + η)/4 < min(0.5, δ). ⎧ ⎨ 3 2ρ + 2ρ +η ρ1 1 2 r (δ) = max − H + (1 − η − 2ρ2 )H η,ρ1 .ρ2 ⎩ 2 2 1 − η − 2ρ2

ρ2 2ρ1 + 2ρ2 + η + (2ρ2 + 2η) ln(2)+(1/2 − η) H + 1/2 − δ+ 1/2 − η 4 ⎫ ⎬ η/2 η/2 2ρ1 +2ρ2 + η ×H H + δ− 4 1/2 − δ + 2ρ1 +2ρ2 +η δ− 2ρ1 +2ρ2 +η ⎭ 4

4

(66) The thresholds for different codes have been computed using a brute force search and are compared in Table VI. In fact, by increasing both repetition and puncturing the rate remains the same, whereas the performance gets better. It is very intriguing to know how much the codes improve as repetition increases and whether they can achieve Shannon limit with infinite repetition and puncturing. However, this is not very interesting from the practical point of view. The complexity of code goes up with large number of repetitions and it is not desirable for high-speed decoders. In next section we improve the code with another simple modification.

4.3.3

DE Analysis

Here the two component codes are repeat code and accumulate code with puncturing. Using Gaussian approximation to obtain the SNR transfer functions of the constituent codes, we get the transfer functions that are shown, in Figure 50.

56

4 Very Simple Turbo-like Codes

Fig. 50 Density evolution for RA codes with puncturing (q = 4, p = 2)

Puncturing the accumulators makes the SNR transfer function to start from (0, 0) point. Therefore the information bits is sent; systematic code; to enable the iterative decoding to start. Compared to RA code (q = 3), which has the same rate, an improvement of 0.1dB in threshold is obtained. Since we have more repetition, the improvement comes at the expense of more complexity. However, the threshold is still far from the good turbo-like codes.

4.4 ARA Codes In this section we investigate the effect of an accumulator precoder on the performance of RA codes. It is shown that we can obtain some codes that achieve extremely near Shannon capacity performance with ML decoding. Unfortunately, the iterative decoding algorithm cannot achieve the same performance. However, there are still some good codes that have performance comparable to turbo codes. The main advantages of these codes are the simplicity and the capability of parallelization. Therefore, practical high-speed decoders can be conceived for these codes. The accumulator precoder is a rate one encoder used to improve the performance of the RA codes. However, only a portion of the information block goes to the accumulator. This is mainly because of message-passing algorithm. In other words,

4.4 ARA Codes

57

Fig. 51 The block diagram of the precoder M ACC N-M

N-M

M bits are passed through without any change and the rest (N – M bits) goes through an accumulator. M is considered a parameter in code design. The effect of this parameter is studied in ML and iterative decoding. Then, it is optimized for achieving the best performance. The block diagram of the precoder is shown in Figure 51.

4.4.1

ML Analysis

In order to find the performance of the code we need to compute the IOWE of the precoder. It is easily computed using the IOWE of the accumulator code as follows: M M pre (67) Aacc Aw,d = w−m,d−m m m=0

Therefore the IOWE of the overall code can be written as: pre−rep(q)−acc( p) Aw,d

=

pre rep(q)−acc( p) N Aw,h Ah,d

h=0

N

(68)

h

For systematic ARA code ( p = 3, q = 3), we have

pre−rep(3)−acc(3)

Aw,d

=

N M

m=0 k=0

M m 3N

N N

min( j,h)

h=0 j=0 i=max(0, j−N +h)

3k

h N −h i j −i

d −1 N −M −k+m h/2 − 1 (w − m)/2 k−m−1 × 3h+ j−2i δ(3k − h − 2 j) (w − m)/2 − 1 ×

N −d h/2

pre−rep(4)−acc(4)

Aw,d

=

N M m=0 k=0

M m 4N

−h−2 j −h 2N N N N −h 2N − h − 2 j j i h=0 j=0

i=0

4k

d −1 N −M −k+m h/2 − 1 (w − m)/2 k−m−1 × 22h+2 j δ(4k − h − 2i − 2 j) (w − m)/2 − 1 ×

(69)

N −d h/2

(70)

58

4 Very Simple Turbo-like Codes

Fig. 52 ARA(3,3) BER performance bound

The above expressions are IOWE of the nonsystematic ARA codes with regular puncturing. The IOWE of their systematic codes are derived by the following conversion. sys−ARA(q, p)

Aw,d 4.4.2

ARA(q, p)

= Aw,d−w

(71)

Performance of ARA Codes with ML Decoding

Divsalar BER performance bound of the ARA(3,3) and ARA(4,4) for different Ms are shown in Figures 52 and 53. It is observed that the more number of bits accumulates in the precoder, the lower the code threshold becomes. However, the improvement stops at a certain point, which is M = 1/5 N for ARA(3,3) and M = 2/5 N for ARA(4,4). It is obvious that when M = N the codes turn into RA with puncturing. It is very heartening that the performance of the ARA(4,4) approaches very closely to that of random codes for the same block size in low E b /N0 region. It is very instructive to observe the distance spectrum of these codes (For optimum M). As we see in Figure 54, the only difference between the distance spectrum of these codes and a random code is in the low-distance region, which causes the error floor.

4.4 ARA Codes

59

Fig. 53 ARA(4,4) BER performance bound

Now we obtain the asymptotic expression of r (δ) for systematic punctured ARA(q = 3, p = 3). After summing (4.71) over w, we let α = M/2N for 0 < α < 1/2, ε1 = m/2N for 0 < ε 1 < α, ε2 = (w – m)/2N for 0 <ε2 < 1/2–α, δ = d/2N for 0 < δ < 1, η = h/2N for 0 < η < 1/2, ρ 1 = i/2N for max(0, ρ 2 + η − 1/2) < ρ 1 < min(ρ 2 , h), and ρ 2 = j/2N for 0 <ρ 2 < 1/2. Also (2ρ 2 + η)/3 < min(0.5, δ).

r (δ) =

⎧ ⎨

αH

max

ε1 ,ε2 ,η,ρ1 .ρ2 ⎩

ε 1

α

−

3 H 2

4ρ2 + 2η 3

+ (ρ2 + η − 2ρ1 ) ln(3)

η/2 δ − ε1 − ε2 η/2 2ρ2 + η + (1/2 − δ + ε1 + ε2 )H − ε1 + 1/2 − δ + ε1 + ε2 3 ε2 /2 2ρ2 + η × H 2ρ +η + ε1 + 1/2 − α − 2 3 − ε1 3 ⎫ ⎬ ε2 /2 ×H (72) 1/2 − α − 2ρ23+η + ε1 ⎭

+ηH

ρ1 η

+ (1/2 − η) H

ρ2 − ρ1 1/2 − η

+ (δ − ε1 − ε2 )H

60

4 Very Simple Turbo-like Codes

Fig. 54 Normalized distance spectrum of ARA codes with puncturing

To derive the asymptotic expression of r(δ) for ARA(q = 4, p = 4), we let α = M/2N for 0 < α < 1/2, ε1 = m/2N for 0 < ε1 < α, ε 2 = (w – m)/2N for 0 < ε2 < 1/2 − α, δ = d/2N for 0 < δ < 1, η = h/2N for 0 < η < 1/2, ρ 1 = i/2N for 0 < ρ 1 < 1 − η – 2ρ 2 , and ρ 2 = j/2N for 0 < ρ 2 < 1 – η. Also (2ρ 1 + 2ρ 2 + η)/4 < min(0.5, δ). ⎧ ⎨

ε 2ρ1 + 2ρ2 + η 1 − 2H + (1 − η − 2ρ2 ) r (δ) = max αH ε1 ,ε2 ,η,ρ1 .ρ2 ⎩ α 2

ρ1 η/2 + (1/2 − δ + ε1 + ε2 )H 1 − η − 2ρ2 1/2 − δ + ε1 + ε2 η/2 2ρ1 + 2ρ2 + η + (δ − ε1 − ε2 )H + ε1 + 1/2 − α + δ − ε1 − ε2 4 ε2 /2 2ρ1 + 2ρ2 + η ×H − ε1 + 2 +η 4 1/2 − α + 2ρ1 +2ρ + ε1 4 ⎫ ⎬ ε2 /2 × H 2ρ +2ρ +η (73) + (2ρ2 + 2η) ln(2) 1 2 ⎭ − ε1

×H

4

4.4 ARA Codes

61

Table VII Cut-off Threshold for Rate 1/2 ARA Codes Cutoff threshold

ARA_punc. (q = 3, p = 3)

ARA_punc. (q = 4, p = 4)

Random code

Shannon limit

Rate 1/2

0.509 dB

0.310 dB

0.308 dB

0.184 dB

Table VII tabulates the Divsalar cutoff threshold for the same codes in Figure 54. As we expected, based on the BER performance bound, the cutoff threshold of ARA(4,4) is extremely close to the cutoff threshold of random code.

4.4.3

DE Analysis

Unfortunately the iterative decoding cannot decode, as well as ML decoding. Moreover, the difference between the performances of two codes with iterative decoding cannot be predicted based on their ML decoding performance. The effect of the precoder in iterative decoding is very clear in Figure 55. The accumulator and the repetition are regarded as one constituent code. The SNR

Fig. 55 Density evolution for ARA codes with puncturing (q = 4, p = 2)

62

4 Very Simple Turbo-like Codes

Fig. 56 Performance of ARA codes using iterative decoding

transfer functions of this code and the simple repetition code are shown for comparison. The noticeable difference is a shift in the curve. This improves the threshold almost by 0.5dB. We have used the DE method to optimize the ARA codes for iterative decoding. ARA(4,4) and ARA(3,3) achieve the best performance with M = 0.7 × N and M = 0.5 × N , respectively. The performances of these codes are illustrated in Figure 56.

4.5

Other Precoders

Although the ML decoding performance of ARA codes with simple accumulate precoder is very close to that of random codes, there is no known practical method to actually perform this decoding. Therefore, we are looking for codes that have good performance with iterative decoding. We have observed that with different precoders very good codes can be obtained. In this section we introduce some of these precoders.

4.5 Other Precoders

63

Fig. 57 The block diagram of the new precoder

P0

N

4.5.1

ACC

P1

N

M

N-M

Accumulator with Puncturing

A simple modification is to use puncturing at the output of the accumulator in the precoder. A general block diagram of this precoder is drawn in Figure 57. P0 and P1 denote puncturing with two different patterns. Two puncturing blocks usually have complement patterns. Two points are worth noting here. First, in this precoder all the information bits go to the accumulator contrary to the accumulator precoder. Second, the precoder rate is one. Using the above precoder, we have designed a new ARA code, which have better performance. The optimum value for M is N /2. Therefore one out of two is punctured both from the information bits and from accumulated sequence with complement patterns. The RA code is RA(q = 3, p = 3). The Tanner graph for this code is shown in Figure 58. This code has 0.1dB better threshold and better error floor as shown in Figure 59.

0

1

2

3

4

5

6

7

5

6

7

Interleaver

0

1

2

Fig. 58 Tanner graph for new ARA code

3

4

64

4 Very Simple Turbo-like Codes

Fig. 59 Performance of the new ARA code

4.6 Hardware Complexity Since the building blocks of the decoder is repetition and check nodes, like LDPC codes, and the message passing are very simple for these blocks, the hardware complexity is very low as far as the logic is concerned. The memory requirement depends on the number of edges in the graph. ARA codes have quite small number of edges, which results in low memory requirement.

4.7 Conclusion This study proposes a novel coding structure which is not only very simple, but also achieves the performance comparable or better than the best practical turbo codes and LDPC codes. The ML analysis showed that in some cases they are extremely close to random codes, which achieve the Shannon limit. The proposed coding scheme generates a family of LDPC codes for various code rates and data frame sizes, and with a performance close to the Shannon capacity

4.7 Conclusion

65

limit. Unlike general LDPC codes, they have very simple encoding too. The main innovation is in the inclusion of a very simple precoder, which is constructed by parallel punctured accumulators. Such precoder improves the performance. The regularity and simplicity of the proposed coding structure also allows the construction of very high-speed iterative decoders using message-passing algorithm.

Chapter 5

High Speed Turbo-like Decoders

5.1

Introduction

This chapter presents the architecture for high-speed decoding of ARA codes. Whereby, message-passing algorithm enables us to achieve parallelism. Simulations have shown that efficiency is not compromised in order to obtain speed gains. Like parallel turbo decoder memory access poses a practical problem. We extend the concept of conflict-free interleaver to address this problem. This leads to the introduction of a new class of turbo-like codes that can be decoded very fast. It is shown that the proposed high-speed turbo and ARA decoder are among the codes in this class. The general architecture for decoding this class is presented.

5.2

Parallel ARA Decoder

To build high-speed decoders for ARA codes we go along the similar way used for high-speed turbo decoders. The basic idea is to partition the graph into several subgraphs and let them work in parallel. For hardware regularity it is desirable that subgraphs are identical or have minimum variety. As an example, the ARA code shown in Figure 58 is considered. The partitioned graph is drawn in Figure 60. Each subgraph is decoded using the messages passing algorithm. Since subgraphs have tree structure, the efficient scheduling provides the fastest decoding method. Usually the decoding for each subgraph is done serially, which lowers the complexity. The hardware entity that performs the decoding for one subgraph is called subgraph processor or window processor – each subgraph corresponds to a window of the code word. There are three types of messages that are communicated within/between subblocks: Internal messages are those that correspond to edges within one subgraph. Border messages are those that are related to edges connecting two adjacent subgraphs. External messages are passed between subgraphs through the interleaver. In other words, they correspond to edges with global span. External messages are called extrinsics after their counterparts in turbo codes. Aliazam Abbasfar, Turbo-Like Codes, 67–80. c Springer 2007

67

68

5 High Speed Turbo-like Decoders

0

1

2

3

4

5

6

7

5

6

7

Interleaver

0

1

2

3

4

Fig. 60 The partitioned graph of ARA code

We need memory for storing all messages. The memory for internal messages is local to the subgraph processors, whereas for external messages it is a global memory that all subgraphs should have access to it. Border messages are usually stored in registers that are part of subgraph processors. They are exchanged with neighboring subgraphs at the end of each iteration. Therefore the architecture for the decoder is like in Figure 61, in which a and b are border messages and x and y are extrinsics. Internal messages are not shown. This architecture is very similar to parallel turbo decoder. The only difference is that there are two different window processors here, which are denoted as W and W.

Fig. 61 Parallel turbo decoder structure

5.4 Interleaver Design

5.3

69

Speed Gain and Efficiency

Unlike turbo decoders the parallel processing does not cost much more processing. This is because the LDPC codes are inherently parallel. What we are doing here is to make the parallelization practically feasible.

5.4

Interleaver Design

Although the message-passing algorithm allows us to parallelize the decoding process, accessing so many extrinsics at the same time poses a practical problem. Since M window processors are running at the same time, M extrinsics are being used simultaneously. The extrinsic memory is organized in M banks of memory in order to facilitate the simultaneous access; i.e. M locations are accessed simultaneously. As we discussed in the case of parallel turbo decoder, the interleaver should be such that the window processors get the extrinsics from different banks of memory in interleaved order as well. This forces us to have conflict-free interleaver presented for parallel turbo decoder. However, in this section we look at this problem from a graphical point of view. The parallel decoder comprises M identical processors that are running in parallel. We put the partitions in parallel planes. Then we look at the projected graph. The projected graph for ARA code shown in Figure 60, is shown in Figure 62. The projected graph can be viewed as the vectorized version of the actual graph. In other words, there is a message vector associated with every edge in the projected graph. The structure of memories for messages is such that only one message vector is accessible at a time. The interleaver should preserve the

Fig. 62 Projected graph

70

5 High Speed Turbo-like Decoders Fig. 63 Projected graph with conflict-free interleaver

message vectors in its entirety, but the permutation is allowed within a vector. The permutation within a vector is the permutation among the window processors or different planes in the overall graph. This permutation does not change the projected graph. Therefore the interleaver consists of several independent permutations within the message vectors. The way vectors are connected between two constituent codes is another flexibility in the interleaver design. An example of the projected graph with conflict-free interleaver is shown in Figure 63. The dashed edges indicate that the permutation is allowed within a vector. The above connections not only guarantee the conflict-free structure, but also ensure that messages are propagated throughout the graph. Therefore projected graph provides a very useful approach for designing turbo-like codes for high speed decoding.

5.5 Projected Graph In this section, design methodology for turbo-like codes based on the projected graph is presented. There are two ways that could be pursued in order to design codes based on projected graphs. The first approach is the one that was used so far to parallelize the decoder. This is based on partitioning an existing code graph into some subgraphs. This method works on any regular or semiregular graph. The projected graph includes one

5.5 Projected Graph

71

Fig. 64 A PCCC projected graph with conflict-free interleaver

partition of each component code, which is called component graph. The component graphs are connected with conflict-free interleavers. This method was used for ARA code whose projected graph is shown in Figure 63. It is shown that the parallel turbo decoder is a member of this class. Later on LDPC codes based on projected graph are also introduced.

5.5.1

Parallel Turbo Decoder

The parallel turbo decoder proposed in Chapter 3 is one example of graphs based on projected graph. The projected graph of such a code is illustrated in Figure 64. It is very instructive to note that the reverse interleaver, used for decreasing the latency, is clearly shown – connecting the edges of two component graphs in reverse order. From the projected graph argument, we can see that the interleaver structure is just several independent permutations. The number of permutations needed is the window size; here it is 4.

5.5.2

Other Known Turbo-like Codes

The class of codes based on the projected graph covers a wide range of turbo-like codes. This section introduces some known turbo and turbo-like codes with projected graphs. Figure 65 illustrates the projected graphs for a parallel turbo code with three constituent codes: a serial turbo code, RA code, and IRA code.

72

5 High Speed Turbo-like Decoders Fig. 65 (a) PCCC with three component codes: (b) SCCC, (c) RA(3), (d) IRA(2,3)

5.5 Projected Graph

73

The second approach is to design the code by designing its projected graph. In this method we should design some component graphs and the connections between them in order to have good performance. In other words, the partitions are designed first and then put together to create constituent codes. This approach is very appealing because the resulting code is definitely parallelizable and the performance of the code can be analyzed very efficiently by its component graphs. The first example of this approach is parallel LDPC codes, which is explained in the following section.

5.5.3

Parallel LDPC Codes

This section explains how to design LDPC codes with parallel decoding capabilities. This class of LDPC codes was independently discovered by Richardson et al. [29], which is called vector LDPC codes. Thorpe [32] introduced LDPC codes based on protograph that is basically the same concept. In this section we present this class as codes with projected graph. There are two component graphs in these codes: one contains only single paritycheck codes (variable degree) and the other has only repetition codes (variable degree). One example of such a code is shown in Figure 66. There are some noticeable facts about this projected graph. All variable nodes are in one component graph, which means that all the observations are stored and processed in one kind of window processor. Variable and check node can have different degrees. Therefore this structure is capable of implementing regular and irregular LDPC codes. The degree distribution of variable and check nodes is known from the projected graph. There are no local messages and no border messages passed between adjacent subgraphs. In other words, we only have external edges. Therefore, the projected graph can be represented graphically as a simple Tanner graph. The graphical representation for above example is shown in Figure 67. The number of interleaver needed for this code is equal to the number of edges of the projected graph. The only disadvantage of this method for code design is that it does not provide an efficient encoder. Sometimes simple encoding is not possible for codes designed this way.

Fig. 66 A parallel LDPC projected graph

74

5 High Speed Turbo-like Decoders Fig. 67 Simple graphical representation of a LDPC projected graph

0

5.5.4

1

2

3

More Accumulate–Repeat–Accumulate Codes

In this section some improved ARA codes are introduced that were designed based on projected graphs. These are ARA codes with different codes rates. The thresholds of these codes are also compared with channel capacity threshold.

5.5.4.1

Code Rates <1/2

The first idea that comes in mind to lower the rate is to send the punctured bits in the accumulator. We start with ARA code without puncturing to obtain the lowest code rate. The projected graph for this code is shown in Figure 68. The code rate is 1/4 in this case. By puncturing more bits in the accumulator, ARA codes with higher rates are obtained. The projected graphs for some of the codes derived this way are drawn in Figure 69. It should be noted that the code with rates larger than 1/2 cannot be constructed by puncturing more bits, because the iterative decoding fails to start if so many bits are punctured. The thresholds for these codes are listed in Table VIII. Although the construction of above codes is very simple, they are not competitive to very good codes as far as the performance is concerned. With some minor changes in the projected graph for rate 1/2 code, better codes with lower code rates are obtained. The projected graphs for these codes are shown in Figure 70. The rate 1/3 code is derived by just encoding zero bits instead of one information bit; i.e. the variable node 1 in the upper row. The rate 1/4 is constructed by sending

Fig. 68 ARA code without puncturing

Fig. 69 (a) Rate 1/3 ARA code; (b) rate 1/2 ARA code

76

5 High Speed Turbo-like Decoders Table VIII Cutoff Threshold for ARA Codes with Rate <1/2 Rate

1/2

1/3

1/4

Threshold Shannon limit

0.51 dB 0.184 dB

0.09 dB −0.5 dB

0.02 dB −0.7 dB

the intermediate variable that is punctured in rate 1/2 and 1/3 codes. The thresholds of these codes are listed in Table IX and compared to Shannon limit.

5.5.4.2

Code Rates = 1/2

The ML analysis of ARA codes showed that the performance of ARA codes improves as we accumulate more bits in the precoder. This was very inspiring when we searched for better ARA codes. Figure 71 illustrates two codes that precode 2/3 and 3/4 of the information bits. The thresholds for these codes are 0.4 dB and 0.33 dB, respectively, which are 0.1 dB and 0.17 dB better than the best rate 1/2 code we have designed. So far the repetitions used in the codes are the same for all the variables. Using variable repetitions gives more improvement. An example of ARA code with irregular repetition is shown in Figure 72. The threshold for this code is 0.367 dB. It is very important to understand that the structure of the projected graph is of major importance in achieving good performance. The projected graphs presented here are the results of an extensive trial and error search.

Fig. 70 (a) Rate 1/2 ARA code; (b) new rate 1/3 ARA code; (c) new rate 1/4 ARA code

5.5 Projected Graph

77

Table IX Cutoff Threshold for Improved ARA Codes with Rate <1/2 Rate

1/2

1/3

1/4

Threshold Shannon limit

0.51 dB 0.184 dB

−0.05 dB −0.5 dB

−0.15 dB −0.7 dB

Fig. 71 Improved rate 1/2 ARA codes

Fig. 72 Irregular rate 1/2 ARA codes

78

5 High Speed Turbo-like Decoders Table X Cutoff Threshold for ARA Codes with Rate >1/2

5.5.4.3

Rate

4/7

5/8

2/3

7/10

8/11

3/4

10/13

Threshold (dB) Shannon limit Difference

0.700 0.530 0.170

1.006 0.815 0.191

1.272 1.059 0.213

1.506 1.272 0.234

1.710 1.459 0.251

1.894 1.626 0.268

2.057 1.777 0.280

Code Rates >1/2

In this section we present a family of ARA codes derived from rate 1/2 irregular ARA code in Figure 72. The projected graph of these codes is shown in Figure 73 and the performance of this family is listed for different rates in Table X. It also shows how close that is to Shannon threshold.

5.6 General Hardware Architecture In this section we present general hardware architecture for implementing parallel turbo-like decoders. Without any loss of generality we focus on turbo-like codes with two constituent codes. This can be easily extended to codes with several constituent codes by grouping them into two combined constituent codes. The general hardware architecture is shown in Figure 74. EXTn denotes the external memory for nth window processor. Since the processors are identical and are running in parallel, the scheduling is the same for all of them. Therefore, there is only one scheduling controller needed for each constituent codes. The scheduling controller determines which message vector is accessed and what permutation is used. The permutor is a memoryless block that permutes the message vector on the fly. Since the message vectors are permuted

Fig. 73 Irregular ARA code family for rate >1/2

5.6 General Hardware Architecture

Window processor W0

Window processor W1

EXT0

EXT1

79

... ...

Window processor WM-1

Scheduling Controller C1

EXTM-1

Address Generator Permutation Select

Permutor

Window processor W0

Window processor W1

...

Window processor WM-1

Scheduling Controller C2

Fig. 74 Parallel decoder hardware architecture

differently, the permutor should be programmable. If M, the number of window processors, is large, the permutor can be the bottleneck of the hardware design. The architecture of one window processor is depicted in Figure 75. AM and BM denote the registers which contains border messages. The observation memory is loaded at the beginning of the decoding and remains intact until end. This memory is not necessary for all window processors.

Observation Mem Internal Messages Mem

A M

Fig. 75 Window processor hardware architecture

Message Passing Core

B M

80

5 High Speed Turbo-like Decoders

5.7 Conclusion In this chapter, architecture for high-speed decoding of ARA codes was presented. Two major issues in high-speed decoding were addressed: parallel processing and memory access problem. This led to introduction of a new class of turbo-like codes that can be decoded very fast, which are the codes with projected graph. This classification provides an alternative method to design turbo-like codes for highspeed decoding. It was shown that the proposed high-speed turbo and ARA decoder are among the codes in this class. The general architecture for decoding this class of codes was also presented. The generalized coding structure that was developed during this research is a powerful approach toward designing turbo-like codes that are suitable for highspeed decoding. However, there are some areas that are not covered yet or can be a complement to this research, which are described as follows. First, in designing ARA codes the focus was on improving the threshold. However, another important aspect of the performance is usually ignored that is the error floor. Two important factors affect the error floor of a code: code structure and interleaver design. Code structure is selected to obtain a certain threshold; therefore the interleaver design is used to improve the error floor. ARA codes with pseudorandom interleavers usually have high error floors. We have been able to improve the error floor by orders of magnitude by manual changes in the interleavers. It is very important to find a systematic way to design or modify interleavers to have low error floor. Design of algorithmic interleavers is a more challenging topic, which is of more practical interest. Second, the search for good codes based on their projected graph is very rewarding. Since the structure of such a code guarantees the high-speed decoding capability, the only concern is the performance of the code. On the other hand, the projected graph is desired to be very simple, which makes the search easier. One simple way of approaching this problem is to start with known projected graphs and make some changes. Analysis of the resulting code determines whether the change is good or not. We have pursued this approach and some preliminary results show its effectiveness.

References

1. A. Abbasfar and K. Yao, An efficient and practical architecture for high speed turbo decoders, Vol. 1, Proceedings of VTC, October 2003, pp. 337–341. 2. A. Abbasfar and K. Yao, Interleaver design for high speed turbo decoders, Vol. 5205, Proceedings of SPIE, August 2003, pp. 282–290. 3. A. Abbasfar and K. Yao, An efficient architecture for high-speed turbo decoders, Proceedings of ICASSP 2003, April 2003, pp. IV-521–IV-524. 4. S. Aji and R.J. McEliece, The generalized distributed law, IEEE Trans. Inform. Theory, March 2000, 32(1), 325–343. 5. L.R. Bahl, J. Cocke, F. Jelinek, and J. Raviv, Optimal decoding of linear codes for minimum symbol error rate, IEEE Trans. Inform. Theory, March 1974, 284–287. 6. S. Bennedetto, D. Divsalar, G. Montorsi, and F. Pollara, Soft-input soft-output APP module for iterative decoding of concatenated codes, IEEE Commun. Lett., January 1997, 22–24. 7. S. Bennedetto, D. Divsalar, G. Montorsi, and F. Pollara, Serial concatenation of interleaved codes: performance analysis, design, and iterative decoding, IEEE Trans. Inform. Theory, May 1998, 44(3), 909–926. 8. S. Bennedetto and G. Montorsi, Unveiling turbo codes: some results on parallel concatenated codes, IEEE Trans. Inform. Theory, March 1996, 42(2), 409–428. 9. C. Berrou, A. Glavieux, and P. Thitimajshima, Near Shannon limit error correcting coding and decoding: turbo codes, Proceedings of the 1993 IEEE International Conference on Communications, Geneva, Switzerland, May 1993, pp. 1064–1070. 10. D. Divsalar, A simple tight bound on error probability of block codes with application to turbo codes, JPL TMO Progress Report 42–139, November 1999, pp. 1–35. 11. D. Divsalar, S. Dolinar, and F. Pollara, Iterative turbo decoder analysis based on Gaussian density evolution, IEEE J. Select. Areas Commun., May 2001, 19(5), pp. 891–907. 12. D. Divsalar, H. Jin, and R.J. McEliece, Coding theorems for turbo-like codes, Proceedings of the 36th Allerton Conference on Communication, Control and Computing, September 1998, Allerton House, Monticello, IL, pp. 201–210. 13. B.J. Frey, F.R. Kschischang, and P.G. Gulak, Concurrent turbo-decoding, Proceedings of the IEEE International Symposium on Information Theory, July 1997, Ulm, Germany, p. 431. 14. R. Gallager, Low-Density Parity-Check Codes, MIT Press, Cambridge, MA, 1963. 15. J. Hsu and C.H. Wang, A parallel decoding scheme for turbo codes, Vol. 4, IEEE Symposium on Circuits and Systems, Monterey, June 1998, pp. 445–448.

81

82

16. H. Jin, Analysis and Design of Turbo-like Codes. Ph.D. thesis, California Institute of Technology, Pasadena, 2001. 17. H. Jin, A. Khandekar, and R. McEliece, Irregular repeat-accumulate codes, in: Proceedings of the 2nd International Symposium on Turbo Codes, Brest, France, 2000, pp. 1–8. 18. F.R. Kschischang and B.J. Frey, Iterative decoding of compound codes by probability propagation in graphical models, IEEE J. Select. Areas Commun., February 1998, 16(2), 219–230. 19. S.L. Lauritzen and D.J. Spiegelhalter, Local computations with probabilities on graphical structures and their applications in expert systems, J. R. Stat. Soc. B., 1988, 50, 157–224. 20. M. Luby, M. Mitzenmacher, M.A. Shokrollahi, D.A. Spielman, and V. Stemann, Practical low-resilient codes, Proceedings of 29th Symposium on Theory of Computing, 1997, pp. 150–157. 21. M. Luby, M. Mitzenmacher, M.A. Shokrollahi, and D.A. Spielman, Improved low-density parity-check codes using irregular graphs, IEEE Trans. Inform. Theory, 2001, 47, 585– 598. 22. D.J.C. MacKay and R.M. Neal, Good codes based on very sparse matrices, in: C. Boyd (ed.). Cryptography and Coding, 5th IMA Conference, No. 1025 in Lecture Notes in Computer Science, Springer, Berlin, 1995, pp. 100–111. 23. D.J.C. MacKay, Good error correcting codes based on very sparse matrices, IEEE Trans. Inform. Theory, 1999, 45(2), 399–431. 24. R.J. McEliece, D.J.C. MacKay, and J.F. Cheng, Turbo decoding as an instance of Pearl’s belief propagation algorithm, IEEE J. Select. Areas Commun., February 1998, 16(2), 140–152. 25. J. Pearl, Fusion, propagation, and structuring in belief networks, Artif. Intell., 1986, 29, 242–288. 26. G. Poltyrev, Bounds on the decoding error probability of binary linear codes via their spectra, IEEE Trans. Inform. Theory, 40(10), 1261–1271. 27. T. Richardson and R. Urbanke, The capacity of low density parity check codes under message passing decoding, IEEE Trans. Inform. Theory, February 2001, 47(2), 599–618. 28. T. Richardson, M.A. Shokrollahi, and R. Urbanke, Design of capacity-approaching irregular low-density parity-check codes, IEEE Trans. Inform. Theory, February 2001, 47(2), 619–637. 29. Richardson et al., Methods and apparatus for decoding LDPC codes, United States Patent 6,633,856, October 14, 2003. 30. C.E. Shannon, A mathematical theory of communications, Bell Syst. Tech. J., 1948, 27, 379–423. 31. R.M. Tanner, A recursive approach to low complexity codes, IEEE Trans. Inform. Theory, 1981, IT-27, 533–547. 32. Jeremy Thorpe, Low Density Parity Check (LDPC) Codes Constructed from Protograhs, JPL INP Progress Report 42-154, August 15, 2003. 33. A.J. Viterbi and A.M. Viterbi, An improved union bound for binary input linear codes on AWGN channel, with applications to turbo decoding, Proceedings of IEEE Information Theory Workshop, February 1998. 34. N. Wiberg, Codes and Decoding on General Graphs. Linköping Studies in Science and Technology. Dissertation no. 440. Linköping University, Linköping, Sweden, 1996.

Index

A Accumulate–Repeat–Accumulate, v, xiv, 4, 56, 102 ARA code, 4 ARA codes, iv, v, viii, ix, xiv, 4, 56, 80, 82, 84, 87, 88, 90, 92, 102, 105, 106, 109, 110 ARA decoder, v, xiv, 4, 92, 109 B Backward Recursion, 25 BCJR algorithm, iii, 8, 24, 26, 27, 31, 33, 34 belief propagation algorithm, 2, 9, 112 bipartite graph, 9 block codes, iv, 6, 7, 18, 56 C Codes on graph, iii, 18 Conflict-free interleaver, 3 conflict-free interleavers, 42, 52, 97 Constituent code, viii, 61 constituent codes, vi, vii, 2, 5, 6, 7, 8, 26, 27, 28, 29, 31, 33, 34, 35, 40, 52, 59, 60, 61, 65, 79, 96, 98, 100, 107, 108 convolutional code, vi, 5, 8, 20, 21, 23, 24, 26, 52, 64 convolutional codes, 5, 6, 20, 21, 22, 27 Convolutional codes, iii, vi, 20, 21 D density evolution, viii, 4, 28, 59, 61, 111 Density evolution, iv, 59

E efficiency, iv, v, xiii, 1, 3, 35, 36, 37, 38, 40, 53, 92, 94 Efficient schedule, 16 extrinsic information, 8, 25, 26, 28, 33, 44 extrinsics, 11, 26, 27, 28, 29, 30, 40, 41, 42, 44, 52, 93, 94, 95 F Flooding schedule, 16 Forward recursion, 24 G graph representation, 18, 20, 22, 24 graphs with cycles, 3, 11 Graphs with cycles, iii, 17 H hardware architecture, v, ix, 107, 108, 109 Hardware complexity, iv, v, 3, 51, 90 high-speed decoding, xiii, 2, 4, 92, 109, 110 I interleaver, iv, vii, ix, x, xiii, 3, 7, 32, 34, 41, 42, 43, 44, 45, 46, 47, 49, 50, 53, 54, 64, 65, 67, 69, 72, 73, 74, 92, 93, 95, 96, 97, 98, 101, 110 Interleaver, iv, v, 40, 45, 95, 111 IOWE, 64, 65, 67, 68, 69, 70, 72, 73, 74, 75, 81, 82 IRA codes, 55 Irregular Repeat–Accumulate, 55

83

84

iterative decoding, vii, viii, xiii, 2, 3, 4, 7, 8, 9, 24, 28, 29, 30, 55, 56, 61, 63, 66, 75, 80, 86, 87, 88, 102, 111 L latency, iv, xiii, 3, 18, 30, 35, 42, 43, 53, 98 LDPC codes, v, 1, 9, 19, 55, 56, 59, 90, 91, 94, 97, 100, 101, 113 Low-density parity-check, 1, 19, 112 M MAP decoding, 8 Memory access, xiii, 3 message passing, 15, 18, 20, 24, 27, 56, 80, 90, 91, 92 message-passing algorithm, xiii, 3, 17, 24, 32, 41, 95 ML decoding, iv, v, 4, 55, 56, 75, 80, 82, 86, 88 P Parallel concatenated convolutional code, 2, 5 parallelization, xiii, 3, 4, 35, 36, 45, 53, 80, 94 Parity-check codes, iii, 18 pipelining, 43 precoder, viii, 80, 81, 82, 86, 88, 89, 105 precoding, 55 probabilistic graph, 10 probability propagation algorithm, 3, 11 processing load, 35, 36, 53 projected graph, ix, 4, 95, 96, 97, 98, 100, 101, 102, 104, 106, 109, 110 protograph, 100 puncturing, iv, v, viii, ix, 55, 67, 73, 74, 75, 76, 78, 79, 82, 84, 87, 88, 102

Index

R RA codes, iii, iv, viii, x, 1, 2, 6, 55, 63, 65, 66, 67, 75, 76, 78, 79, 80 Repeat–Accumulate codes, 63 repetition codes, 100 S scheduling, 15, 16, 18, 27, 32, 45, 93, 108 Serial concatenated convolutional codes, 2, 5 serial decoder, 32, 37, 42, 49, 52, 53 Shannon limit, xiv, 4, 19, 55, 56, 78, 91, 105, 111 SISO, vi, 8, 26, 27, 28, 29, 30, 34, 35, 41, 45, 46, 47, 53 sparse parity-check matrix, 19 speed gain, vii, 3, 30, 35, 36, 38, 39, 40, 43, 51, 53 Speed gain, iv, v, 35, 36, 94 speed gain and efficiency, 3, 51 S-random interleavers, 47 state constraint, 21, 35 state variables, vi, 20, 21, 25 systematic code, 5, 6, 36, 49, 75, 79 T Tanner graph, vi, viii, 9, 18, 19, 20, 21, 89 turbo codes, vi, xiii, 1, 2, 6, 7, 11, 17, 22, 27, 28, 29, 30, 44, 55, 80, 93, 111, 112 Turbo codes, iii, 5, 22, 28 turbo decoding, vi, 9, 11, 24, 29, 30 turbo encoder, 5 turbo-like code, 2, 4, 56 turbo-like codes, iii, iv, v, xiii, xiv, 2, 3, 4, 5, 6, 17, 55, 59, 61, 63, 80, 92, 97, 98, 107, 109 W window processor, 31, 34, 93, 101, 107, 108