VLSI Design for Video Coding
Youn-Long Steve Lin • Chao-Yang Kao Huang-Chih Kuo • Jian-Wen Chen
VLSI Design for Vide...
31 downloads
984 Views
25MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
VLSI Design for Video Coding
Youn-Long Steve Lin • Chao-Yang Kao Huang-Chih Kuo • Jian-Wen Chen
VLSI Design for Video Coding H.264/AVC Encoding from Standard Specification to Chip
13
Prof. Youn-Long Steve Lin National Tsing Hua University Dept. Computer Science 101 Kuang Fu Road HsinChu 300 Section 2 Taiwan R.O.C.
Huang-Chih Kuo National Tsing Hua University Dept. Computer Science 101 Kuang Fu Road HsinChu 300 Section 2 Taiwan R.O.C.
Chao-Yang Kao National Tsing Hua University Dept. Computer Science 101 Kuang Fu Road HsinChu 300 Section 2 Taiwan R.O.C.
Jian-Wen Chen National Tsing Hua University Dept. Computer Science 101 Kuang Fu Road HsinChu 300 Section 2 Taiwan R.O.C.
ISBN 978-1-4419-0958-9 e-ISBN 978-1-4419-0959-6 DOI 10.1007/978-1-4419-0959-6 Springer New York Dordrecht Heidelberg London Library of Congress Control Number: 2009943294 c Springer Science+Business Media, LLC 2010 All rights reserved. This work may not be translated or copied in whole or in part without the written permission of the publisher (Springer Science+Business Media, LLC, 233 Spring Street, New York, NY 10013, USA), except for brief excerpts in connection with reviews or scholarly analysis. Use in connection with any form of information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed is forbidden. The use in this publication of trade names, trademarks, service marks, and similar terms, even if they are not identified as such, is not to be taken as an expression of opinion as to whether or not they are subject to proprietary rights. Printed on acid-free paper Springer is part of Springer Science+Business Media (www.springer.com)
Preface
A video signal is represented as a sequence of frames of pixels. There exists vast amount of redundant information that can be eliminated with video compression technology so that its transmission and storage becomes more efficient. To facilitate interoperability between compression at the video producing source and decompression at the consumption end, several generations of video coding standards have been defined and adapted. After MPEG-1 for VCD and MPEG-2 for DVD applications, H.264/AVC is the latest and most advanced video coding standard defined by the international standard organizations. Its high compression ratio comes at the expense of more computational-intensive coding algorithms. For low-end applications, software solutions are adequate. For high-end applications, dedicated hardware solutions are needed. This book describes an academic project of developing an application-specific VLSI architecture for H.264/AVC video encoding. Each subfunction is analyzed before a suitable parallel-processing architecture is designed. Integration of subfunctional modules as well as the integration into a bus-based SOC platform is presented. The whole encoder has been prototyped using an FPGA. Intended readers are researchers, educators, and developers in video coding systems, hardware accelerators for image/video processing, and high-level synthesis of VLSI. Especially, those who are interested in state-of-the-art parallel architecture and implementation of intra prediction, integer motion estimation, fractional motion estimation, discrete cosine transform, context-adaptive binary arithmetic coding, and deblocking filter will find design ideas from this book. HsinChu, Taiwan, ROC
Youn-Long Lin Chao-Yang Kao Huang-Chih Kuo Jian-Wen Chen
v
Acknowledgments
Cheng-Long Wu, Cheng-Ru Chang, Chun-Hsin Lee, Chun-Lin Chiu, Hao-Ting Huang, Huan-Chun Tseng, Huan-Kai Peng, Hui-Ting Yang, Jhong-Wei Gu, Kai-Hsiang Chang, Li-Cian Wu, Ping Chao, Po-Sheng Liu, Sheng-Tsung Hsu, Sheng-Yu Shih, Shin-Chih Lee, Tzu-Jen Lo, Wei-Cheng Huang, Yu-Chien Kao, Yuan-Chun Lin, and Yung-Hung Chan of the Theda.Design Group, National Tsing Hua University contribute to the development of the H.264 Video Encoder System described in this book. The authors appreciate financial support from Taiwan’s National Science Council under Contracts no. 95-2220-E-007-024, 96-2220-E-007-013, and 97-2220-E-007003 and Ministry of Economics Affairs under Contracts no. 94-EC-17-A-01-S1038, 95-EC-17-A-01-S1-038, and 96-EC-17-A-01-S1-038. Financial support from Taiwan Semiconductor Manufacturing Company Limited (TSMC) and Industry Technology Research Institute (ITRI) is also greatly appreciated. Global Unichip Corp. provided us with its UMVP multimedia SOC platform and consultation during the FPGA prototyping stage of the development. The authors are grateful to Chi Mei Optoelectronics for a 52-in. Quad Full HD display panel. Joint research with the Microprocessor Research Center (MPRC) of Peking University has been an important milestone of this project.
vii
Contents
1 Introduction to Video Coding and H.264/AVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 Basic Coding Unit . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.2 Video Encoding Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.3 Color Space Conversion. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.4 Prediction of a Macroblock . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.5 Intraframe Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.6 Interframe Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.7 Motion Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.8 Prediction Error. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.9 Space-Domain to Frequency-Domain Transformation of Residual Error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.10 Coefficient Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.11 Reconstruction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.12 Motion Compensation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.13 Deblocking Filtering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.2 Book Organization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1 1 2 2 2 3 4 4 4 4
2 Intra Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.1.2 Design Consideration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.2.1 Prediction Time Reduction Approaches. . . . . . . . . . . . . . . . . . . . . . . . 2.2.2 Hardware Area Reduction Approaches . . . . . . . . . . . . . . . . . . . . . . . . 2.3 A VLSI Design for Intra Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.1 Subtasks Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.2 Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11 11 12 16 19 19 19 20 20 24 30 30
5 5 5 5 6 6
ix
x
Contents
3 Integer Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Design Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.1 Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2.2 Data-Reuse Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3 A VLSI Design for Integer Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Proposed Data-Reuse Scheme . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Data Flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.4 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
31 31 33 36 37 37 43 44 45 47 49 52 53
4 Fractional Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Design Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 A VLSI Design for Fractional Motion Estimation . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Proposed Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Proposed Resource Sharing Method for SATD Generator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
57 57 58 61 61 63 63
5 Motion Compensation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 Design Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Memory Traffic Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Interpolation Engine. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 A VLSI Design for Motion Compensation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Motion Vector Generator. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Interpolator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
73 73 73 75 75 76 76 77 77 79 83 83
Transform Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Algorithms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Design Consideration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.1 Multitransform Engine Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.2.2 Trans/Quan or InvQuan/InvTrans Integration Approaches . . . .
85 85 85 97 97 97 97
6
68 72 72
Contents
6.3
6.4
xi
A VLSI Design for Transform Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.3.1 Subtasks Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.3.2 Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98 6.3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .106 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .106
7 Deblocking Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .107 7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .107 7.1.1 Deblocking Filter Algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .108 7.1.2 Subtasks Processing Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .112 7.1.3 Design Considerations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .113 7.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .115 7.3 A VLSI Design for Deblocking Filter. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .116 7.3.1 Subtasks Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .116 7.3.2 Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .116 7.3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .122 7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .124 8 CABAC Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .125 8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .125 8.1.1 CABAC Encoder Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .125 8.1.2 Subtasks Processing Order . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .134 8.1.3 Design Consideration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .134 8.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .136 8.3 A VLSI Design for CABAC Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .139 8.3.1 Subtasks Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .139 8.3.2 Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .140 8.3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .147 8.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .148 9 System Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .151 9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .151 9.1.1 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .151 9.1.2 Design Consideration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .153 9.2 Related Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .155 9.3 A VLSI Design for H.264/AVC Encoder . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .156 9.3.1 Subtasks Scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .156 9.3.2 Architecture. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .159 9.3.3 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .165 9.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .166 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .167 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .173
Chapter 1
Introduction to Video Coding and H.264/AVC
Abstract A video signal is represented as a sequence of frames of pixels. There exists a vast amount of redundant information that can be eliminated with video compression technology so that transmission and storage becomes more efficient. To facilitate interoperability between compression at the video producing source and decompression at the consumption end, several generations of video coding standards have been defined and adapted. For low-end applications, software solutions are adequate. For high-end applications, dedicated hardware solutions are needed. This chapter gives an overview of the principles behind video coding in general and the advanced features of H.264/AVC standard in particular. It serves as an introduction to the remaining chapters; each covers an important coding tool and its VLSI architectural design of an H.264/AVC encoder.
1.1 Introduction A video encoder takes as its input a video sequence, performs compression, and then produces as its output a bit-stream data which can be decoded back to a video sequence by a standard-compliant video decoder. A video signal is a sequence of frames. It has a frame rate defined as the number of frames per second (fps). For typical consumer applications, 30 fps is adequate. However, it could be as high as 60 or 72 for very high-end applications or as low as 10 or 15 for video conferencing over a low-bandwidth communication link. A frame consists of a two-dimensional array of color pixels. Its size is called frame resolution. A standard definition (SD) frame has 720 480 pixels per frame whereas a full high definition (FullHD) one has 1,920 1,088. There are large number of frame size variations developed by various applications such as computer monitors. A color pixel is composed of three elementary components: R, G, and B. Each component is digitized to an 8-bit data for consumer applications or a 12-bit one for high-end applications.
Y.-L.S. Lin et al., VLSI Design for Video Coding: H.264/AVC Encoding from Standard Specification to Chip, DOI 10.1007/978-1-4419-0959-6 1, c Springer Science+Business Media, LLC 2010
1
2
1 Introduction to Video Coding and H.264/AVC
The data rate for a raw video signal is huge. For example, a 30-fps FullHD one will have a data rate of 30 1;920 1;088 3 8 D 1:5Gbps, which is impractical for today’s communication or storage infrastructure. Fortunately, by taking advantage of the characteristics of human visual system and the redundancy in the video signal, we can compress the data by two orders of magnitude without scarifying the quality of the decompressed video.
1.1.1 Basic Coding Unit In order for a video encoding or decoding system to handle video of different frame rates and simplify the implementation, a basic size of 16 16 has been popularly adopted. Every main stream coding standards from MPEG-1, MPEG-2, : : : to H.264 has chosen a macroblock of 16 16 pixels as their basic unit of processing. Hence, for video of different resolutions, we just have to process different number of macroblocks. For every 720 480 SD frame, we process 45 30 macroblocks while for every FullHD frame, we process 120 68 macroblocks.
1.1.2 Video Encoding Flow Algorithm 1.1 depicts a typical flow of video encoding. frame(t ) is the current frame to be encoded. frame0 (t 1) is the reconstructed frame for referencing or called reference frame. frame0 (t ) is the reconstructed current frame. We encode F .t / one macroblock (MB) at a time starting from the leftmost MB of the topmost row. We called the MB being encoded as Curr MB. It can be encoded in one of the three modes: I for intra prediction, P or unidirectional interprediction, and B for bidirectional interprediction. The resultant MB from prediction is called Pred MB and the difference between Curr MB and Pred MB is called Res MB for residuals. Res MB goes through space-to-frequency transformation and then quantization processes to become Res Coef or residual coefficients. Entropy coding then compresses Res Coef to get final bit-stream. In order to prepare reconstructed current frame for future reference, we perform inverse quantization and inverse transform on Res Coef to get reconstructed residuals called Reconst res. Adding together Reconst res and Pred MB, we have Reconstruct MB for insertion into frame0 (t ).
1.1.3 Color Space Conversion Naturally, each pixel is composed of R, G, and B 8-bit components. Applying the following conversion operation, it can be represented as one luminance (luma) component Y and two chrominance (chroma) components Cr and Cb. Since the human
1.1 Introduction
3
Algorithm 1.1: Encode a frame. encode a frame (frame(t), mode) for I D 1, N do //** N: #rows of MBs per frame for I D 1, M do //** N: #rows of MBs per frame Curr MB D MB(frame(t), I, J); case (mode) I: Pred MB D Intra Pred (frame(t)’, I, J); P: Pred MB D ME (frame(t-1)’, I, J); B: Pred MB D ME (frame(t-1)’, frame(tC1)’, I, J); endcase Res MB D Curr MB - Pred MB; Res Coef D Quant(Transform(Res MB)); Output(Entropy code(Res Coef)); Reconst res D InverseTransform(InverseQuant(Res Coef)); Reconst MB D Reconst res C Pred MB; Insert(Reconst MB, frame(t)’); endfor endfor end encode a frame;
visual system is more sensitive to luminance component than chrominance ones, we can subsample Cr and Cb to reduce the data amount without sacrificing the video quality. Usually one out of two or one out of four subsampling is applied. The former is called 4:2:2 format and the later 4:2:0 format. In this book, we assume that 4:2:0 format is chosen. Of course, the inverse conversion will give us R, G, B components from a set of Y , Cr, Cb components. Y D 0:299R C 0:587G C 0:114B; Cb D 0:564.B Y /;
(1.1)
Cr D 0:713.R Y /:
1.1.4 Prediction of a Macroblock A macroblock M has 1616 D 256 pixels. It takes 2563 D 768 bytes to represent it in RGB format and 256 .1 C 1=4 C 1=4/ D 384 bytes in 4:2:0 format. If we can find during decoding a macroblock M0 which is similar to M, then we only have to get from the encoding end the difference between M and M0 . If M and M0 are very similar, the difference becomes very small so does the amount of data needed to be transmitted/stored. Another way to interpret similarity is redundancy. There exist two types of redundancy: spatial and temporal. Spatial redundancy results from similarity between a pixel (region) and its surrounding pixels (regions) in a frame. Temporal redundancy results from slow change of video contents from one frame to the next. Redundancy information can be identified and removed with prediction tools.
4
1 Introduction to Video Coding and H.264/AVC
1.1.5 Intraframe Prediction In an image region with smooth change, a macroblock is likely to be similar to its neighboring macroblocks in color or texture. For example, if all its neighbors are red, we can predict that a macroblock is also red. Generally, we can define several prediction functions; each takes pixel values from neighboring macroblocks as its input and produces a predicted macroblock as its output. To carry out intraframe prediction, every function is evaluated and the one resulting in the smallest error is chosen. Only the function type and the error need to be encoded and stored/transmitted. This tool is also called intra prediction and a prediction function is also called a prediction mode.
1.1.6 Interframe Prediction Interframe prediction, also called interprediction, identifies temporal redundancy between neighboring frames. We call the frame currently being processed the current frame and the neighboring one the reference frame. We try to find from the reference frame a reference macroblock that is very similar to the current macroblock of the current frame. The process is called motion estimation. A motion estimator compares the current macroblock with candidate macroblocks within a search window in the reference frame. After finding the best-matched candidate macroblock, only the displacement and the error need to be encoded and stored/transmitted. The displacement from the location of the current macroblock to that of the best candidate block is called motion vector (MV). In other words, motion estimation determines the MV that results in the smallest interprediction error. A bigger search window will give better prediction at the expense of longer estimation time.
1.1.7 Motion Vector A MV obtained from motion estimation is adequate for retrieving a block from the reference frame. Yet, we do not have to encode/transmit the whole of it because there exists similarity (or redundancy) among MVs of neighboring blocks. Instead, we can have a motion vector prediction (MVP) as a function of neighboring blocks’ MVs and just process the difference, called motion vector difference (MVD), between the MV and its MVP. In most cases, the MVD is much smaller than its associated MV.
1.1.8 Prediction Error We call the difference between the current macroblock and the predicted one as prediction error. It is also called residual error or just residual.
1.1 Introduction
5
1.1.9 Space-Domain to Frequency-Domain Transformation of Residual Error Residual error is in the space domain and can be represented in the frequency domain by applying discrete cosine transformation (DCT). DCT can be viewed as representing an image block with a weighted sum of elementary patterns. The weights are termed as coefficients. For computational feasibility, a macroblock of residual errors is usually divided into smaller 4 4 or 8 8 blocks before applying DCT one by one.
1.1.10 Coefficient Quantization Coefficients generated by DCT carry image components of various frequencies. Since human visual system is more sensitive to low frequency components and less sensitive to high frequency ones, we can treat them with different resolution by means of quantization. Quantization effectively discards certain least significant bits (LSBs) of a coefficient. By giving smaller quantization steps to low frequency components and larger quantization steps to high frequency ones, we can reduce the amount of data without scarifying the visual quality.
1.1.11 Reconstruction Both encoding and decoding ends have to reconstruct video frame. In the encoding end, the reconstructed frame instead of the original one should be used as reference because no original frame is available in the decoding end. To reconstruct, we perform inverse quantization and inverse DCT to obtain reconstructed residual. Note that the reconstructed residual is not identical to the original residual as quantization is irreversible. Therefore, distortion is introduced here. We then add prediction data to the reconstructed residual to obtain reconstructed image. For an intrapredicted macroblock, we perform predict function on its neighboring reconstructed macroblocks while for an interpredicted one we perform motion compensation. Both methods give a reconstructed version of the current macroblock.
1.1.12 Motion Compensation Given a MV, the motion compensator retrieves from the reference frame a reconstructed macroblock pointed to by the integer part of the MV. If the MV has fractional part, it performs interpolation over the retrieved image to obtain the final reconstructed image. Usually, interpolation is done twice, one for half-pixel accuracy and the other for quarter-pixel accuracy.
6
1 Introduction to Video Coding and H.264/AVC
1.1.13 Deblocking Filtering After every macroblock of a frame is reconstructed one by one, we obtain a reconstructed frame. Since the encoding/decoding process is done macroblock-wise, there exists blocking artifacts between boundaries of adjacent macroblocks or subblocks. Deblocking filter is used to eliminate this kind of artificial edges.
1.2 Book Organization This book describe a VLSI implementation of a hardware H.264/AVC encoder as depicted in Fig. 1.1.
AMBA AMBA Slave
AMBA Master AMBA Interface MAU Arbiter DF MAU
MB MAU
SR MAU
BIT MAU
Command Receiver
MainCtrl Engine Inter Info Memory
IntraPred Engine Unfilter Memory DF Engine
FME Engine
IntraMD Engine
MC Engine
Multiplexer
IME Engine
TransCoding Engine
CABAC Engine Recons Engine
ReconsMB Memory
Encoder Core
Fig. 1.1 Top-level block diagram of the proposed design
PE Engine
1.2 Book Organization
7
In Chap. 2, we present intra prediction. Intra prediction is the first process of H.264/AVC intra encoding. It predicts a macroblock by referring to its neighboring macroblocks to eliminate spatial redundancy. There are 17 prediction modes for a macroblock: nine modes for each of the 16 luma 4 4 blocks, four modes for a luma 16 16 block, and four modes for each of the two chroma 8 8 blocks. Because there exists great similarity among equations of generating prediction pixels across prediction modes, effective hardware resource sharing is the main design consideration. Moreover, there exists a long data-dependency loop among luma 4 4 blocks during encoding. Increasing parallelism and skipping some modes are two of the popular methods to design a high-performance architecture for high-end applications. However, to increase throughput will require more hardware area and to skip some modes will degrade video quality. We will present a novel VLSI implementation for intra prediction in this chapter. In Chap. 3, we present integer motion estimation. Interframe prediction in H.264/AVC is carried out in three phases: integer motion estimation (IME), fractional motion estimation (FME), and motion compensation (MC). We will discuss these functions in Chaps. 3, 4, and 5, respectively. Because motion estimation in H.264/AVC supports variable block sizes and multiple reference frames, high computational complexity and huge data traffic become main difficulties in VLSI implementation. Moreover, high-resolution video applications, such as HDTV, make these problems more critical. Therefore, current VLSI designs usually adopt parallel architecture to increase the total throughput and solve high computational complexity. On the other hand, many data-reuse schemes try to increase data-reuse ratio and, hence, reduce required data traffic. We will introduce several key points of VLSI implementation for IME. In Chap. 4, we present fractional motion estimation. Motion estimation in H.264/AVC supports quarter-pixel precision and is usually carried out in two phases: IME and FME. We have talked about IME in Chap. 3. After IME finds an integer motion vector (IMV) for each of the 41 subblocks, FME performs motion search around the refinement center pointed to by IMV and further refines 41 IMVs into fractional MVs (FMVs) of quarter-pixel precision. FME interpolates halfpixels using a six-tap filter and then quarter-pixels a two-tap one. Nine positions are searched in both half refinement (one integer-pixel search center pointed to by IMV and eight half-pixel positions) and then quarter refinement (one half-pixel position and eight quarter-pixel positions). The position with minimum residual error is chosen as the best match. FME can significantly improve the video quality (C0:3 to C0:5dB) and reduce bit-rate (20–37%) according to our experimental results. However, our profiling report shows that FME consumes more than 40% of the total encoding time. Therefore, an efficient hardware accelerator for fractional motion estimation is indispensable. In Chap. 5, we present motion compensation. Following integer and fractional motion estimation, motion compensation (MC) is the third stage in H.264/AVC interframe prediction (P or B frame). After the motion estimator finds MVs and related information for each current macroblock, the motion compensator generates
8
1 Introduction to Video Coding and H.264/AVC
compensated macroblocks (MBs) from reference frames. Due to quarter-pixel precision and variable-block-size motion estimation supported in H.264, motion compensation also needs to generate half- or quarter-pixels for MB compensation. Therefore, motion compensation also has high computational complexity and dominates the data traffic on DRAM. Current VLSI designs for MC usually focus on reducing memory traffic or increasing interpolator throughput. In this chapter, we will introduce several key points of VLSI implementation for motion compensation. In Chap. 6, we present transform coding. In H.264/AVC, both transform and quantization units consist of forward and inverse parts. Residuals are transformed into frequency domain coefficients in the forward transform unit and quantized in the forward quantization unit to reduce insignificant data for bit-rate saving. To generate reconstructed pixels for the intra prediction unit and reference frames for the motion estimation unit, quantized coefficients are rescaled in the inverse quantization unit and transformed back to residuals in the inverse transform unit. There are three kinds of transforms used in H.264/AVC: 4 4 integer discrete cosine transform, 2 2 Hadamard transform, and 4 4 Hadamard transform. To design an area-efficient architecture is the main design challenge. We will present a VLSI implementation of transform coding in this chapter. In Chap. 7, we present deblocking filter. The deblocking filter (DF) adopted in H.264/AVC reduces the blocking artifact generated by block-based motioncompensated interprediction, intra prediction, and integer discrete cosine transform. The filter for eliminating blocking artifacts is embedded within the coding loop. Therefore, it is also called in-loop filter. Expirically, it achieves up to 9% bit-rate saving at the expense of intensive computation. Even with today’s fastest CPU, it is hard to perform software-based real-time encoding of high-resolution sequences such as QFHD (3,840 2,160). Consequently, accelerating the deblocking filter by VLSI implementation is indeed required. Through optimizing processing cycle, external memory access, and working frequency, we show a design that can support QFHD at 60-fps application by running at 195 MHz. In Chap. 8, we present context-based adaptive binary arithmetic coding. Contextbased adaptive binary arithmetic coding (CABAC) adopted in H.264/AVC main profile is the state-of-the-art in terms of bit-rate efficiency. In comparison with context-based adaptive variable length coding (CAVLC) used in baseline profile, it can save up to 7% of the bit-rate. However, CABAC occupies 9.6% of total encoding time and its throughput is limited by bit-level data dependency. Moreover, for ultrahigh resolution, such like QFHD (3,840 2,160), its performance is difficult to meet real-time requirement for a pure software CABAC encoder. Therefore, it is necessary to accelerate the CABAC encoder by VLSI implementation. In this chapter, a novel architecture of CABAC encoder will be described. Its performance is capable of real-time encoding QFHD video in the worst case of main profile Level 5.1. In Chap. 9, we present system integration. Hardware cost and encoding performance are the two main challenges in designing a high-performance H.264/AVC encoder. We have proposed several high-performance architectures for the functional
1.2 Book Organization
9
units in an H.264/AVC encoder. In addition, external memory management is another design issue. We have to access an external memory up to 3.3 GBps for real-time encoding 1080pHD video in our encoder. We propose several AMBAcompliant memory access units (MAUs) to efficiently access an external memory. We will present our H.264/AVC encoder in this chapter.
Chapter 2
Intra Prediction
Abstract Intra prediction is the first process of H.264/AVC intra encoding. It predicts a macroblock by referring to its neighboring macroblocks to eliminate spatial redundancy. There are 17 prediction modes for a macroblock: nine modes for each of the 16 luma 4 4 blocks, four modes for a luma 16 16 block, and four modes for each of the two chroma 8 8 blocks. Because there exists great similarity among equations of generating prediction pixels across prediction modes, effective hardware resource sharing is the main design consideration. Moreover, there exists a long data-dependency loop among luma 4 4 blocks during encoding. Increasing parallelism and skipping some modes are two of the popular methods to design a high-performance architecture for high-end applications. However, to increase throughput will require more hardware area and to skip some modes will degrade video quality. We will present a novel VLSI implementation for intra prediction in this chapter.
2.1 Introduction H.264/AVC intra encoding achieves higher compression ratio and quality compared with the latest still image coding standard JPEG2000 [1]. The intra prediction unit, which is the first process of H.264/AVC intra encoding, employs 17 kinds of prediction modes and supports several different block sizes. For baseline, main, and extended profiles, it supports 4 4 and 16 16 block sizes. For high profile, it additionally supports an 8 8 block size. In this chapter, we focus on the intra prediction for baseline, main, and extended profiles. The intra prediction unit refers to reconstructed neighboring pixels to generate prediction pixels. Therefore, its superior performance comes at the expense of very high computational complexity. We describe the detailed algorithm of intra prediction in Sect. 2.1.1 and address some design considerations in Sect. 2.1.2.
Y.-L.S. Lin et al., VLSI Design for Video Coding: H.264/AVC Encoding from Standard Specification to Chip, DOI 10.1007/978-1-4419-0959-6 2, c Springer Science+Business Media, LLC 2010
11
12
2 Intra Prediction
2.1.1 Algorithm All intra prediction pixels are calculated based on the reconstructed pixels of previously encoded neighboring blocks. Figure 2.1 lists all intra prediction modes with different block sizes. For the luma component, a 16 16 macroblock can be partitioned into sixteen 4 4 blocks or just one 16 16 block. The chroma component simply contains one 8 8 Cb block and one 8 8 Cr block. There are nine prediction modes for each of the 16 luma 4 4 blocks and four prediction modes for a luma 16 16 block and two chroma 8 8 blocks. Figure 2.2 illustrates the reference pixels of a luma macroblock. A luma 16 16 block is predicted by referring to its upper, upper-left, and left neighboring luma 16 16 blocks. For a luma 4 4 block, we utilize its upper, upper-left, left, and upper-right neighboring 4 4 blocks. There are 33 and 13 reference pixels for a luma 16 16 block and a luma 4 4 block, respectively. To predict a chroma 8 8 block is like to predict a luma 16 16 block by using its upper, upper-left, and left neighboring chroma blocks. There are 17 reference pixels for a chroma block. Figure 2.3 shows all the computation equations of luma 4 4 modes. Upper case letters from “A” to “M” denote the 13 reference pixels and lower case letters from “a” to “p” denote the 16 prediction pixels. Component
Block Size 1 16x16
L16_VER L16_HOR L16_DC L16_PLANE
0:vertical 1:horizontal 2:DC 3:diagonal down-left 4:diagonal down-right 5:vertical-right 6:horizontal-down 7:vertical-left 8:horizontal-up
L4_VER L4_HOR L4_DC L4_DDL L4_DDR L4_VR L4_HD L4_VL L4_HU
1 8x8
0:DC 1:horizontal 2:vertical 3:plane
CB8_DC CB8_HOR CB8_VER CB8_PLANE
1 8x8
0:DC 1:horizontal 2:vertical 3:plane
CR8_DC CR8_HOR CR8_VER CR8_PLANE
16 4x4 16x16
8x8 Cr
Abbreviation
0:vertical 1:horizontal 2:DC 3:plane
Y
Cb
Prediction Modes
8x8
Fig. 2.1 Intra prediction modes
2.1 Introduction
13
MB_UpperLeft
MB_Upper
MB_Left
MB_Current
MB_UpperRight
blk_UL blk_U blk_UR blk_L
blk_C
Fig. 2.2 Reference pixels of a luma macroblock
There are four modes for a luma 16 16 block and two chroma 8 8 blocks: horizontal, vertical, DC, and plane as shown in Fig. 2.4. All except the plane mode are similar to that of luma 4 4 modes. Plane modes defined for smoothly varying image are the most complicated. Every prediction pixel has a unique value. Figure 2.5 shows the equations of a luma 16 16 plane mode. Each prediction pixel value depends on its coordinate in the block and parameters a, b, and c which are calculated from pixels of neighboring blocks. After generating prediction pixels for each mode, the intra mode decision unit will compute the cost of each mode, based on distortion and bit-rate, and choose the one with the lowest cost. We describe the encoding process of the intra prediction unit using Algorithm 2.1 and show the corresponding flow chart in Fig. 2.6. The primary inputs of the intra prediction unit are Xcoord and Ycoord , which indicate the location of the current macroblock in a frame. For example, in a CIF (352 288) frame, (Xcoord ,Ycoord / D .0; 0/ denotes the first macroblock and (Xcoord ,Ycoord / D .21; 17/ denotes the last. We use upper-case letters A through M to represent 13 reference pixels for a luma 4 4 block as shown in Fig. 2.3. HSL and VSL represent sets of left and upper reference pixels for a luma 16 16 block. QL represents upper-left reference pixels for a luma macroblock. HSCb , VSCb , QCb , HSCr , VSCr , and QCr are for two chroma 8 8 blocks. Figure 2.7 shows the order in which to process the subtasks of an intra prediction unit. Table 2.1 shows the cycle budget for an H.264/AVC encoder to encode a
14
2 Intra Prediction Prediction Mode (Abbreviation) MAB I a b J e f K i j Lmn
C c g k o
DE FGH d (Reference Pixels) h A-D: Upper Pixels l E-H: Upper Right Pixels p I-L : Left Pixels
Vertical (VER) MABCDE FGH I a=e=i=m=A J b=f=j=n=B K c=g=k=o=C L d=h=l=p=D
M : Upper Left Pixels (Prediction Pixels) a-p
Horizontal (HOR) MABCDE FGH I a=b=c=d=I J e=f=g=h=J K i=j=k=l=K L m=n=o=o=L
DC (DC) MABCDE FGH I If (A-D,I-L available) J a-p=(SH+SV+4)>>3 K else if (I-L available) L a-p=(SH+2)>>2 SH=SUM(I-L) SV=SUM(A-D)
Diagonal Down-Left (DDL) MABCDE FGH I a=(A+2B+C+2)>>2 J b=e=(B+2C+D+2)>>2 K c=f=i=(C+2D+E+2)>>2 L d=g=j=m=
Diagonal Down-Right (DDR) MABCDE FGH I m=(L+2K+J+2)>>1 J i=n=(K+2J+I+2)>>2 K e=j=o=(J+2I+M+2)>>2 L a=f=k=p=
(D+2E+F+2)>>2 h=k=n= (E+2F+G+2)>>2 l=o=(F+2G+H+2)>>2 p=(G+3H+2)>>2
(I+2M+A+2)>>2 b=g=l= (M+2A+B+2)>>2 c=h=(A+2B+C+2)>>2 d=(B+2C+D+2)>>2
Vertical-Right (VR) MABCDE FGH a=j=(M+A+1)>>1 I b=k=(A+B+1)>>1 J c=l=(B+C+1)>>1 K d=(C+D+1)>>1 L m=(K+2J+I+2)>>2
Horizontal-Down (HD) MABCDE FGH m=(L+K+1)>>1 I i=o=(K+J+1)>>1 J e=k=(J+I+1)>>1 K a=g=(I+M+1)>>1 L n=(L+2K+J+2)>>2
i=(J+2I+M+2)>>2 e=n=(I+2M+A+2)>>2 f=o=(M+2A+B+2)>>2 g=p=(A+2B+C+2)>>2 h=(B+2C+D+2)>>2
j=p=(K+2J+I+2)>>2 f=l=(J+2I+M+2)>>2 b=h=(I+2M+A+2)>>2 c=(M+2A+B+2)>>2 d=(A+2B+C+2)>>2
Vertical-Left (VL) MABCDE FGH a=(A+B+1)>>1 I b=i=(B+C+1)>>1 J c=j=(C+D+1)>>1 K d=k=(D+E+1)>>1 L l=(E+F+1)>>1
else if (A-D available) a-p=(SV+2)>>2 else a-p=128
Horizontal-Up (HU) MA B C D E F GH I k=l=m=o=p=L J g=i=(L+K+1)>>1 c=e=(K+J+1)>>1 K a=(J+I+1)>>1 L
e=(A+2B+C+2)>>2 f=m=(B+2C+D+2)>>2 g=n=(C+2D+E+2)>>2 h=o=(D+2E+F+2)>>2 p=(E+2F+G+2)>>2
Fig. 2.3 Equations of luma 4 4 modes prediction pixels
h=j=(3L+J+2)>>2 d=f=(L+2K+J+2)>>2 b=(K+2J+I+2)>>2
2.1 Introduction
a
15 Vertical
Horizontal
b
Vertical
Plane
DC
DC
Horizontal
Plane
Pred [y,x] Mean (32 neighboring pixels)
PlanePred [y,x]
Fig. 2.4 (a) Luma 16 16 and (b) chroma 8 8 prediction modes
ReconsPixel[–1,–1] ReconsPixel[–1,0]
1*(ReconsPixel[8, –1] - ReconsPixel[6,–1]) ReconsPixel[0,–1]
3*(ReconsPixel[–1,10] - ReconsPixel[–1,4])
PlanePred [y, x] = Clip1{ a+16+b*(x–7)+c*(y–7))>>5}, x, y = 0~15 a = (ReconsPixel [–1, 15] + ReconsPixel [15,–1]) <<4 b = (5*H + 32) >> 6 c = (5*V + 32) >> 6 7
H = Σ (x' +1)∗(ReconsPixel[8 + x',–1] – ReconsPixel[6 – x', –1]) x'=0 7
V = Σ (y' +1)∗(ReconsPixel[ –1,8 + y' ] – ReconsPixel[ – 1,6 – y' ]) y'=0
Fig. 2.5 Illustration of plane modes
16
2 Intra Prediction
Algorithm 2.1: Intra prediction. Intra Prediction (Xcoord ,Ycoord , AM, HSL , VSL , QL , HSC b , VSC b , QC b , HSC r , VSC r , QC r ) for 16 luma 4x4 block do PredPixels4x4DC D Gen 4x4DC(A,B,C,D,I,J,K,L); if up block available then PredPixels4x4VER D Gen 4x4VER(A,B,C,D); PredPixels4x4DDL D Gen 4x4DDL(A,B,C,D,E,F,G,H); PredPixels4x4VL D Gen 4x4VL(A,B,C,D,E,F,G); endif if left block available then PredPixels4x4HOR D Gen 4x4HOR(I,J,K,L); PredPixels4x4H U D Gen 4x4HU(I,J,K,L); endif if left block available & up block available & up left block available then PredPixels4x4DDR D Gen 4x4DDR(A,B,C,D,I,J,K,L,M); PredPixels4x4VR D Gen 4x4VR(A,B,C,D,I,J,K,M); PredPixels4x4HD D Gen 4x4HD(A,B,C,I,J,K,L,M); endif endfor PredPixel16x16DC D Gen 16x16DC(HSL ,VSL ); if up luma mb available then PredPixels16x16VER D Gen 16x16VER(VSL ); endif if left luma mb available then PredPixels16x16HOR D Gen 16x16HOR(HSL ); endif if left luma mb available & up luma mb available then PredPixels16x16PLANE D Gen 16x16PLANE(VSL ,HSL ,QL ); endif if up chroma mb available then PredPixels8x8VER C b D Gen 8x8VER(VSC b ); PredPixels8x8VER C r D Gen 8x8VER(VSC r ); endif if left chroma mb available then PredPixels8x8HOR C b D Gen 8x8HOR(HSC b ); PredPixels8x8HOR C r D Gen 8x8HOR(HSC r ); endif if left chroma mb available & up chroma mb available then PredPixels8x8PLANE C b D Gen 8x8PLANE(VSC b ,HSC b ,QC b ); PredPixels8x8PLANE C r D Gen 8x8PLANE(VSC r ,HSC r ,QC r ); endif
1080pHD (1,920 1,088) video at 30 fps at different working frequencies. If the intra prediction unit generates four prediction pixels per cycle, it will take 960 cycles to predict a macroblock.
2.1.2 Design Consideration Because all neighboring pixels are reconstructed, the intra prediction unit can only start to predict a luma 4 4 block, a luma 16 16 block, or a chroma 8 8 block
2.1 Introduction
17 Start Predict luma 4x4 DC Upper block available? Yes Predict luma 4x4 VER, DDL, VL Left block available? Yes Predict luma 4x4 HOR, HU All neighbor block available? Yes Predict luma 4x4 DDR, VR, HD
No
No
No
No All 16 4x4 block done? Yes Predict luma 16x16 or chroma DC Upper MB available? Yes Predict luma 16x16 or chroma VER
No
Left MB available? Yes Predict luma 16x16 or chroma HOR
No
All neighbor MB available? Yes Predict luma 16x16 or chroma Plane
No
No Luma 16x16, Cb8x8, and Cr 8x8 done?
Yes End
Fig. 2.6 Flow chart of the intra prediction unit
after its neighboring blocks are reconstructed as shown in Fig. 2.2. This data dependency exists in both the macroblock level for a luma 16 16 block and two chroma 8 8 blocks and the 4 4 block level for 16 luma 4 4 blocks. The data dependency among 16 luma 4 4 blocks usually dominates the system performance of an H.264/AVC intra encoder since it takes 576 960 D 60% of the total processing time as shown in Fig. 2.7. In Fig. 2.8, the arrows denote the data dependency among 16 luma 4 4 blocks, and the numbers inside the 4 4 blocks show the processing order defined in the
18
2 Intra Prediction
Predict luma 4x4 DC for a 4x4 block Predict luma 4x4 VER, DDL, VL for a 4x4 block Predict luma 4x4 HOR, HU for a 4x4 block Predict luma 4x4 DDR, VR, HD for a 4x4 block Predict luma16x16 DC, VER, HOR, PLANE Predict chroma DC, VER, HOR, PLANE Cycles
36
576
832
Fig. 2.7 Order of processing subtasks Table 2.1 Cycle budget at different working frequency for 1080pHD video Frequency (MHz) Cycles 300 1,225 250 1,021 200 816 166 678 150 612 125 510 100 408
1
2
5
6
3
4
7
8
9
10
13
14
11
12
15
16
Fig. 2.8 Data dependency and processing order of 16 luma 4 4 blocks
960
2.2 Related Works
19
H.264/AVC standard. For example, the arrows point to block 13, which means we have to predict it by referring to the reconstructed pixels of blocks 7, 8, and 10.
2.2 Related Works Several VLSI architectures exist for H.264/AVC intra prediction. Some of them address how to shorten prediction time to support high-resolution video applications. Others aim to provide a hardware-efficient solution to minimize the system cost.
2.2.1 Prediction Time Reduction Approaches The intra prediction unit and the intra mode decision unit together account for about 80% of the computation time in the H.264/AVC intra encoding, according to our profiling using the H.264/AVC reference sofware Joint Model (JM) 11.0. An allmode mode decision approach evaluates costs of nine luma 4 4 modes, four luma 16 16 modes, and four chroma 8 8 modes, whereas a partial-mode approach evaluates fewer modes by skipping modes that have a lower probability of being the best one. By adopting the partial-mode mode decisions can shorten prediction time but result in video-quality degradation. Several previous designs [7, 54, 78] propose partial-mode mode decision algorithms. For example, a three-step encoding algorithm is proposed by Cheng and Chang [7] to adaptively choose the next prediction mode. Instead of reducing the computation load, one can reduce prediction time by increasing pixel-level parallelism (PLP). Increasing the PLP directly decreases the total prediction time. For example, if the intra prediction unit predicts 8 pixels per cycle, the lower bound of prediction time will be 480 cycles. Several previous works [21, 29, 32] have proposed the ability to predict 4 pixels per cycle, and another work [46] proposes to predict 8 pixels per cycle by employing two prediction engines. The data dependency among luma 4 4 blocks introduces bubble cycles. Huang et al. [21] proposes inserting luma 16 16 prediction into these bubble cycles to eliminate them.
2.2.2 Hardware Area Reduction Approaches Instead of using a dedicated hardware [61] for each prediction mode, several previous designs [21, 29, 46] employ reconfigurable architectures. Some previous works [29, 46] save hardware area by removing the plane mode from the prediction mode capability (PMC). PMC is defined as the prediction modes that the intra prediction unit supports. PMC affects the prediction mode number and
20
2 Intra Prediction
the candidate mode set for the mode decision unit. The size of the candidate mode set also affects compression ratio. A smaller set results in reduced coding efficiency. Several previous designs schedule the processing order of prediction modes according to their computation load to save hardware costs. Huang et al. [21] schedules DC and plane modes last for data preparation, while the unit outputs pixels for vertical and horizontal modes.
2.3 A VLSI Design for Intra Prediction We propose a VLSI architecture for intra prediction in this section. We first describe how we schedule all subtasks in Sect. 2.3.1, and we then propose our hardware architecture in Sect. 2.3.2. In Sect. 2.3.3, we evaluate its performance.
2.3.1 Subtasks Scheduling We categorize all intra prediction modes into reconstruction-loop (RL) modes and nonreconstruction-loop (Non-RL) modes, as shown in Table 2.2. The former includes all luma 4 4 modes, whereas the later includes luma 16 16 and chroma 8 8 modes. Our profiling shows that RL modes occupy 59%, and three plane modes occupy approximately 40% of overall computations. Since the data dependency among 4 4 blocks dominates the processing time, prediction of RL modes is the performance bottleneck. Our design spends 5 cycles to generate prediction pixels of nine intra 4 4 modes for a 4 4 block by increasing PLP to 16 (pixels/mode) 2 (modes/cycle) D 32 pixels/cycle. Ideally, it takes 5 (cycles/block) 16 (blocks/macroblock) D 80 cycles to predict 16 luma 4 4 blocks.
Table 2.2 Mode categories Prediction modes RL modes Category Bypass DC Plane Skew
Luma 4 4 L4 HOR L4 VER L4 DC L4 L4 L4 L4 L4 L4
DDL DDR VR HD VL HU
Non-RL modes Luma 16 16 L16 HOR L16 VER L16 DC L16 PLANE
Cb 8 8 CB8 HOR CB8 VER CB8 DC CB8 PLANE
Cr 8 8 CR8 HOR CR8 VER CR8 DC CR8 PLANE
2.3 A VLSI Design for Intra Prediction
21
We also increase the PLP for predicting Non-RL modes. Our design generates 16 pixels of Non-RL modes per cycle. Therefore, it takes 16 (cycles/macroblock) 4 (modes) C 4 (cycles/chroma 8 8 block) 2 (chroma types) 4 (modes) D 96 cycles to predict a luma 16 16 block and two chroma 8 8 blocks. To alleviate the performance bottleneck caused by the long data-dependency loop among luma 4 4 blocks, we modify the processing order of 4 4 blocks to process two luma 4 4 blocks at a time. In Fig. 2.9, the arrows show the data dependency among 16 luma 4 4 blocks, and the numbers inside the 4 4 blocks show the modified processing order. To predict two chroma 8 8 blocks, we use a 4 4 block as a unit to generate prediction pixels. We predict one Cb and one Cr 4 4 block at the same time to shorten the processing time. Moreover, to utilize the bubble cycles between two luma 4 4 blocks, we generate prediction pixels of luma 16 16 modes after generating prediction pixels of luma 4 4 modes for a luma 4 4 block. Figure 2.10 shows the timing diagram of our proposed design. As mentioned before, there is a data dependency among luma 4 4 blocks. After the intra prediction
1'
2'
3'
4'
3'
4'
5'
6'
5'
6'
7'
8'
7'
8'
9'
10'
Fig. 2.9 Proposed processing order of 16 luma 4 4 blocks
Fetch data RL engine 1 predicts luma 4x4 modes for a 4x4 block RL engine 2 predicts luma 4x4 modes for a 4x4 block Non-RL engine predicts luma 16x16 modes Non-RL engine predicts chroma 8x8 modes Cycles
16
32
Fig. 2.10 Timing diagram of the proposed design
129
163
22
2 Intra Prediction
unit finishes one luma 4 4 encoding iteration (encoding one or two luma 4 4 blocks), it needs to wait for transform, quantization, inverse quantization, inverse transform, and reconstruction units before starting the next iteration. Our design requires 1 cycle to read reference pixels. By adopting the proposed processing order, two RL engines take 5 (cycles/iteration) 10 (iterations/macroblock) D 50 cycles to predict 16 luma 4 4 blocks. We schedule the Non-RL engine to predict luma 16 16 modes between two luma 4 4 encoding iterations. The Non-RL engine generates prediction pixels of luma 16 16 modes for one 4 4 block in iterations 1, 2, 9, and 10 and for two 4 4 blocks in other iterations, as two RL engines do. The Non-RL engine also computes parameters for the chroma plane and DC modes at iterations 2, 3, 4, 5, and 6 and outputs prediction pixels at iterations 9 and 10. In total, it takes 163 cycles to predict a macroblock. To schedule all prediction modes efficiently, we divide them into four categories: bypass, DC, skew, and plane, according to their characteristics as shown in Table 2.2. The bypass category consists of vertical and horizontal modes which require no computation. The DC category has four DC modes in luma 4 4, luma 16 16, Cb 8 8, and Cr 8 8, respectively. The skew category has six modes: DDL, DDR, VR, HD, VL, and HU, all for luma 4 4 prediction. Their computation equations are mainly three-tap and two-tap filters. The plane category has three modes: luma 16 16 plane, Cb 8 8 plane, and Cr 8 8 plane. They are the most complicated. To predict RL modes, we first separate those in the skew category into three groups (1) HD and HU modes, (2) DDR and VR modes, and (3) DDL and VL modes. We then schedule the horizontal mode and vertical mode into group (4) and the DC mode into group (5). Our design takes 5 cycles to output them in order of group number. To predict Non-RL modes, our design performs horizontal, vertical, DC, and plane prediction in order for a luma 16 16 block and two chroma 8 8 blocks. Although increasing PLP will increase the hardware area, we propose an optimized processing element (PE) scheduling scheme to achieve better resource sharing. A PE is a basic hardware unit used to calculate a prediction pixel. With the optimized scheduling scheme, we aim at generating prediction pixels of RL modes with an optimal number of PEs. Our design predicts 32 pixels of RL modes per cycle. If we give every prediction pixel a dedicated PE, 32 PEs will be needed for RL modes. However, for all modes in the bypass category, we need nothing but multiplexers. For all DC modes, we use a 4-pixel adder tree to remove the computation load from the PEs. RL modes in the skew category are the most complicated modes. Still, there are multiple prediction pixels sharing the same computation equation. For illustration, we draw Table 2.3 by reorganizing a table proposed by Huang et al. [21]. Table 2.3 shows all computation equations of RL modes in the skew category. Each row represents a computation equation and the number of prediction pixels that utilize the equation in each mode. The computation equations labeled as “T3,” “T2,” and “Bypass” denote three-tap filter, two-tap filter, and bypass operations, respectively. The numbers in the columns of reference pixels represent the multiplied numbers of corresponding reference pixels. The numbers in the columns of skew modes denote the
2.3 A VLSI Design for Intra Prediction
23
Table 2.3 Operation table of RL modes in the skew category Reference pixels Skew modes Equation L K J I M T3eq0 3 1 T3eq1 1 2 1 T3eq2 1 2 1 T3eq3 1 2 1 T3eq4 1 2 T3eq5 1 T3eq6 T3eq7 T3eq8 T3eq9 T3eq10 T3eq11 T3eq12 T2eq0 T2eq1 T2eq2 T2eq3 T2eq4 T2eq5 T2eq6 T2eq7 T2eq8 T2eq9 T2eq10 T2eq11
1 1 1 1 1 1 1 1 1
Bypass
1
A B C D E F G H DDL DDR VR HD VL HU 2 1 1 2 2 1 2 1 3 1 2 1 4 2 2 2 1 3 2 1 1 2 1 1 2 2 1 1 1 2 1 2 1 1 2 1 2 1 3 2 1 2 1 4 2 1 2 1 3 1 1 2 1 2 1 3 1 1 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
2 2 2 1
Sum 2 4 6 6 8 6 7 6 5 6 4 2 1
2 2 1
3 4 3 2 2 3 4 3 2 1 0 0
6
6
1 2 2 2 1
number of prediction pixels sharing the corresponding computation equation. For example, four prediction pixels in DDL mode share the same computation equation, T3eq9. We can reuse the PEs which perform the same computation equation during the same cycle. Moreover, our design is able to predict RL modes with an optimal number of PEs by simultaneously performing two of the most similar modes. For example, DDL and VL modes have the greatest similarity according to Table 2.3. If our design predicts DDL and VL modes in the same cycle, only 7 three-tap filter PEs and 5 two-tap filter PEs are needed. However, DDR mode is similar to both VR mode and HD mode. If we predict DDR mode with HD mode and VR mode with HU mode in the same cycle, the optimal set of PEs will be 8 three-tap filter PEs and 7 two-tap filter PEs. Therefore, our proposed design predicts DDR mode with VR mode and HD mode with HU mode in the same cycle. By adopting this schedule, our design uses only 7 three-tap filter PEs and 5 two-tap filter PEs.
24
2 Intra Prediction
2.3.2 Architecture Figure 2.11 shows the top-level block diagram of the proposed design. Its primary inputs are reconstructed neighboring pixels, and its primary outputs are prediction pixels. It contains two RL engines and one Non-RL engine. We use two RL engines to predict two luma 4 4 blocks in parallel, and we design our Non-RL engine to generate 32 prediction pixels for two 4 4 blocks at a time.
Block-level DC Pixel Generator Block-level Skew Pixel Generator
Non-RL Engine
RL Engine Xcoord, Ycoord
MB-level HV Pixel Generator Block-level DC Pixel Generator Block-level Skew Pixel Generator
Multiplexer
Block-level Reference Pixels
Plane Pixel Calculator
Multiplexer
Plane Seed Pixel Generator
MB-level DC Pixel Generator
16 Prediction Pixels
Multiplexer
Plane Parameter Generator
MB-level Reference Pixels
Block-level HV Pixel Generator
Multiplexer
RL Engine
16 Prediction Pixels
Multiplexer
Block-level Reference Pixels
16 Prediction Pixels
16 Prediction Pixels
Block-level HV Pixel Generator Main Controller Reference Pixels Memory AG Enable, Address
Fig. 2.11 Top-level block diagram of the proposed design
2.3 A VLSI Design for Intra Prediction
2.3.2.1
25
RL Engine
The RL engine consists of a block-level skew pixel generator, a block-level DC pixel generator, and a block-level HV pixel generator. Both HV and DC pixel generators are easy to implement. The HV pixel generator generates prediction pixels for vertical and horizontal modes by bypassing the reference pixels, whereas the DC pixel generator uses a 4-pixel adder tree to compute the DC value. The skew pixel generator predicts pixels for DDL, DDR, HU, HD, VL, and VR modes. Its architecture is shown in Fig. 2.12. It consists of seven PE3s and five PE2s, where PE3 and PE2 are for three-tap filter and two-tap filter operations, respectively. It first takes 13 reference pixels and distributes them to appropriate PEs. Next, each PE selects input pixels to produce prediction values. Finally, 32 multiplexers select the prediction values according to mode specification. We design two customized architectures for PE3 and PE2, respectively, as depicted in Fig. 2.13. In each PE3, three multiplexers first select current input pixels. Next, two adders and one shifter sum up the current input pixels. Then, the rounding & shifting and clipping units proceed to postprocess the summation value. The architecture of PE2 is similar to that of PE3 except that it sums up two reference pixels before rounding. 2.3.2.2
Non-RL Engine
The Non-RL engine consists of three units: an MB-level HV pixel generator, an MB-level DC pixel generator, and an MB-level plane pixel generator. The HV pixel
PE3_1 PE3_2
PE3_5
HU, VR, VL
Multiplexer
PE3_4
Multiplexer
PE3_3
HD, DDR, DDL
PE3_6 PE3_7 13 block-level reference pixels PE2_1 PE2_2 PE2_3 PE2_4 PE2_5
Fig. 2.12 Architecture of the block-level skew pixel generator
26
2 Intra Prediction K G L F J H J D K C I E I H J GMH M E I D A F A F M E B G L C L B K D B B A A C C
<<1
<<1
<<1
<<1
<<1
<<1
<<1
Round & Shift
Round & Shift
Round & Shift
Round & Shift
Round & Shift
Round & Shift
Round & Shift
Clip
Clip
Clip
Clip
Clip
Clip
Clip
PE3_1
PE3_2
PE3_3
PE3_4
PE3_5
PE3_6
PE3_7
LE K F
KA J B
J B IC
I C MD
MA D E
Round & Shift
Round & Shift
Round & Shift
Round & Shift
Round & Shift
Clip
Clip
Clip
Clip
Clip
PE2_1
PE2_2
PE2_3
PE2_4
PE2_5
Fig. 2.13 PE3 and PE2 architecture
generator is similar to that of the RL engine. The DC pixel generator consists of two 4-pixel adder trees for summing up the reference pixels. The plane pixel generator generates prediction pixels of the three plane modes. It carries the largest amount of computation load. Although the value of each prediction pixel depends on its coordinate in the block, there are systematic methods to save the computations. First, we divide a macroblock into sixteen 4 4 blocks, using a 4 4 block as a unit to generate prediction pixels. We find that each prediction pixel is related to its neighboring prediction pixels. Prediction values are increased by the parameter “b” from left to right and by the parameter “c” from top to bottom in a 4 4 block. Both parameters b and c are calculated according to the equations shown in Fig. 2.5. Second, we can calculate 16 prediction pixels by using plane parameters and the seed pixel for every 4 4 block. The seed pixel is a precalculated pixel on the top-left corner of a 4 4 block. There are 16 seed pixels in the luma component and 8 seed pixels in the chroma components. Figure 2.14 illustrates the seed pixel generation. It
2.3 A VLSI Design for Intra Prediction SeedPixel_0
Increase by 4b
27 SeedPixel index 0
1
4
5
2
3
6
7
8
9
12
13
14
15
10 11 Increase by 4c
16
17
20
21
18
19
22
23
PlanePred [y, x] = Clip1{a+16+b*(x–7)+c*(y–7))>>5} PlanePred [0, 0] = Clip1{a+16+b*(x–7)+c*(y–7))>>5} = Clip1{(SeedPixel_0)>>5} SeedPixel_0 = a+16+b*(–7) + c*(–7) SeedPixel_1 = SeedPixel_0 + 4*b SeedPixel_2 = SeedPixel_0 + 4*c SeedPixel_3 = SeedPixel_0 + 4*(b+c)
SeedPixel_4 = SeedPixel_0 + 8*b SeedPixel_5 = SeedPixel_4 + 4*b SeedPixel_6 = SeedPixel_4 + 4*c SeedPixel_7 = SeedPixel_4 + 4*(b+c)
SeedPixel_8 = SeedPixel_0 + 8*c SeedPixel_9 = SeedPixel_8 + 4*b SeedPixel_10 = SeedPixel_8 + 4*c SeedPixel_11 = SeedPixel_8 + 4*(b+c)
SeedPixel_12 = SeedPixel_0 + 8*(b+c) SeedPixel_13 = SeedPixel_12 + 4*b SeedPixel_14 = SeedPixel_12 + 4*c SeedPixel_15 = SeedPixel_12 + 4*(b+c)
SeedPixel_16 = a+16+b*(–3) + c*(–3) SeedPixel_17 = SeedPixel_16 + 4*b SeedPixel_18 = SeedPixel_16 + 4*c SeedPixel_19 = SeedPixel_16 + 4*(b+c)
SeedPixel_20 = a+16+b*(–3) + c*(–3) SeedPixel_21 = SeedPixel_20 + 4*b SeedPixel_22 = SeedPixel_20 + 4*c SeedPixel_23 = SeedPixel_20 + 4*(b+c)
Fig. 2.14 Seed pixel generation
is not necessary to use 24 18 bits of registers to buffer all 24 seed pixels. Instead, we use only 6 major seed pixels to calculate the remaining 18 seed pixels. The plane pixel generator consists of three parts: a plane parameter generator, a plane seed pixel generator, and a plane pixel calculator. The plane parameter generator as depicted in Fig. 2.15 produces parameters: a C 16, b, c, 3b, and 3c. It
28
2 Intra Prediction Luma pixels Luma pixels Cb pixels Cb pixels Cr pixels Cr pixels
Luma pixels Luma pixels Cb pixels Cb pixels Cr pixels Cr pixels
Multiplexer
Multiplexer
Multiplexer
Multiplexer
Customized Multiplier
Customized Multiplier
Customized Multiplier
Customized Multiplier
Increment unit 16
Customized Multiplier
Parameters
Fig. 2.15 Architecture of the plane parameter generator
then outputs these parameters to the plane seed pixel generator and the plane pixel calculator. The plane seed pixel generator as depicted in Fig. 2.16 takes the parameters to generate the seed pixel for every 4 4 block. It first generates 6 major seed pixels, 0, 4, 8, 12, 16, and 20 and then computes the remaining 18 seed pixels by using 6 major seed pixels and parameters. Figure 2.17 illustrates how the plane pixel calculator generates the plane pixels. It contains 16 customized PEs to generate 16 prediction pixels. At first, it takes the seed pixel of the current 4 4 block and the parameters as inputs. It then computes the compensation value for each prediction pixel according to its coordinate in a 4 4 block. Figure 2.18 shows the 16 compensation values. Finally, 16 prediction pixels are produced at a time by summing up the compensation values and the seed pixel.
2.3 A VLSI Design for Intra Prediction
29
SeedPixel_0 Parameter Set Registers
SeedPixel_4 SeedPixel_8 SeedPixel_12 SeedPixel_16 SeedPixel_20
Multiplexer
Customized Multiplier
Multiplexer
Register SeedPixel_[0-23]
Fig. 2.16 Architecture of the plane seed pixel generator Parameters
SeedPixel
Customized PE Array
Plane prediction pixels
Fig. 2.17 Plane pixels calculation
30
2 Intra Prediction Increase b 0
b
2b
3b
c
b+c
2b+c
3b+c
2c
b+2c
2b+2c
3b+2c
3c
b+3c
2b+3c
3b+3c
Increase c
Fig. 2.18 Compensation values in a 4 4 block
2.3.3 Evaluation Synthesized targeted toward a TSMC 0.13-m CMOS cell library, the total gate count of the proposed design is about 32K gates when running at 161 MHz. It takes 163 cycles to predict a macroblock. Running at 161 MHz, the proposed design can real-time encode 4K 2K (4,096 2,048) video at 30 fps.
2.4 Summary The intra prediction unit, which is the first process of H.264/AVC intra encoding, refers to reconstructed neighboring pixels to generate prediction pixels. Its superior performance comes at the expense of very high computational complexity. To share hardware resources well and to shorten prediction time to speed up the datadependency loop among luma 4 4 blocks are the main challenges in designing a hardware architecture for high-resolution applications. Many previous works have proposed either a mode-skipping scheme or configurable architecture to design the hardware. We increase the pixel-level parallelism and propose a new processing order to shorten the prediction time. By adopting the proposed optimized PE scheduling scheme, we also achieve better resource sharing of processing element. Our design can process one macroblock within 163 cycles while consuming only 32K gates. It is able to support high-resolution applications up to 4K 2K.
Chapter 3
Integer Motion Estimation
Abstract Interframe prediction in H.264/AVC is carried out in three phases: integer motion estimation (IME), fractional motion estimation, and motion compensation. We will discuss these functions in this chapter and Chaps. 4 and 5, respectively. Because motion estimation in H.264/AVC supports variable block sizes and multiple reference frames, high computational complexity and huge data traffic become main difficulties in VLSI implementation. Moreover, high-resolution video applications, such as HDTV, make these problems more critical. Therefore, current VLSI designs usually adopt parallel architecture to increase the total throughput and solve high computational complexity. On the other hand, many data-reuse schemes try to increase data-reuse ratio and, hence, reduce required data traffic. In this chapter, we will introduce several key points of VLSI implementation for IME.
3.1 Introduction High-quality video sequences usually have a high frame rate at 30 or 60 frames per second (fps). Therefore, two consecutive frames in a video sequence are quite similar. The goal of motion estimation is to exploit this characteristic to reduce temporal redundancy. In Fig. 3.1, for example, when encoding frame t C 1, it only needs to encode the difference between frame t C 1 and frame t (i.e., the airplane) instead of the whole frame t C 1. In current video coding standards, block-based motion estimation (BME) is widely used to estimate movement of a rectangular block from the current frame. BME fits well with rectangular video frames. It is also suitable for block-based image transforms (e.g., DCT). However, there are several disadvantages. For example, real objects are rarely rectangular. Many types of object motion are hard to estimate using BME. Moreover, BME causes a blocking artifact after decoding. Despite these disadvantages, BME is employed by most existing video coding standards. BME compares each N N -pixel current block in the current frame with several or every N N -pixel reference blocks (called candidate blocks) in the search window of the reference frame to determine the best match, as shown in Fig. 3.2. The reference frame may be the previous frame, the next frame, or both. A popular matching Y.-L.S. Lin et al., VLSI Design for Video Coding: H.264/AVC Encoding from Standard Specification to Chip, DOI 10.1007/978-1-4419-0959-6 3, c Springer Science+Business Media, LLC 2010
31
32
3 Integer Motion Estimation
Frame t
Frame t+1
Fig. 3.1 Two consecutive video frames
Fig. 3.2 Block-based motion estimation
criterion is to measure the residual calculated by subtracting reference blocks from the current block, so that the reference block that minimizes the residual is chosen as the best match. In previous video coding standards, the block size of motion estimation is fixed. Fixed-block-size motion estimation (FBSME) expends the same effort to estimate the motion of moving objects and static objects. This inflexibility causes low coding performance. Moreover, two objects moving in different directions in one block can also lead to low estimation accuracy. Therefore, H.264/AVC adopts variable-blocksize motion estimation (VBSME). VBSME adaptively uses a smaller block size for estimating moving objects and a larger block size for static ones to increase the coding efficiency. In H.264/AVC, each frame of a video is partitioned into macroblocks (MBs) of 16 16 pixels. Each macroblock can be split into smaller blocks in four ways: one 16 16 block, two 16 8 blocks, two 8 16 blocks, or four 8 8 blocks. If the 8 8 mode is chosen, each of the four 8 8 blocks may be split further into one 8 8 block, two 8 4 blocks, two 4 8 blocks, or four 4 4 blocks. VBSME in H.264/AVC is carried out in two phases: integer motion estimation (IME) and fractional motion estimation (FME). In the first step, IME computes the motion vector predictor (MVP) of current MB. MVP is medium of MVs of the left, the top, and the right-top (or left-top) MBs of current MB. The position pointed by MVP is set as the center point of the search window. In the next step, IME performs
3.1 Introduction
33 Start
End
Fig. 3.3 Flow chart of IME
motion search within the search window and finds integer motion vector (IMV) for all sizes of blocks (4 4–16 16 blocks). Finally, IME outputs these 41 IMVs (one for 16 16 block, two for 16 8 blocks, two for 8 16 blocks, four for 8 8 blocks, eight for 8 4 blocks, eight for 4 8 blocks, and sixteen for 4 4 blocks) to FME, which will be discussed in Chap. 4. The flow chart of IME in H.264/AVC is shown in Fig. 3.3. We introduce some block-based IME algorithms in Sect. 3.1.1 and discuss several key points of VLSI implementation for IME in Sect. 3.1.2.
3.1.1 Algorithms There are many kinds of algorithms for block-based IME . The most accurate strategy is the Full Search (FS) algorithm. By exhaustively comparing all reference blocks in the search window, FS gives the most accurate motion vector which causes minimum sum of absolute differences (SAD) or sum of square difference (SSD). The computation of SAD and SSD is shown in (3.1) and (3.2), respectively, where N denotes the block size, CB the current block, RB the reference block, (i ,j ) the motion vector (MV), and SR the search range . Algorithm 3.1 shows process steps of the FS algorithm, where W denotes the frame width, H the frame height, SRH the horizontal search range, and SRV the vertical search range. SAD .i; j / D
N 1 N 1 X X
jCB .m; n/ RB .m C i; n C j /j;
mD0 nD0
SADmin D min .SAD .i; j // ; SR i; j < SR;
(3.1)
34
3 Integer Motion Estimation
Algorithm 3.1: Full search integer motion estimation. for w D 0 to W/N do for h D 0 to H/N do MV(w,h) D (0,0); SAD(w,h) D INIFINITE; for i D SRH to SRH -1 do for j D SRV to SRV -1 do SAD(i,j) D 0; for x D 0 to N-1 do for y D 0 to N-1 do SAD(i,j) C D j CB(x,y) - RB(i C x,j C y)j; endfor endfor if SAD(i,j) < SAD(w,h) then MV(w,h) D (i,j); SAD(w,h) D SAD(i,j); endif endfor endfor return with MV(w,h); endfor endfor
SSD .i; j / D
N 1 N 1 X X
.CB .m; n/ RB .m C i; n C j //2 ;
mD0 nD0
SSDmin D min .SSD .i; j // ; SR i; j < SR:
(3.2)
To exhaustively compare with all the reference blocks in the search window, the FS algorithm requires huge computational cost. Therefore, many computationally efficient heuristics have been proposed. They can be divided into two types. The first type reduces computational complexity by reducing the number of search points, e.g., Three Step Search (TSS) algorithm [31] and Diamond Search (DS) algorithm [68,84]. Figure 3.4 depicts the search steps of the TSS algorithm, where the number in a circle depicts the order of search step and a gray circle means the final choice of each step. The TSS algorithm first searches points around the center (x,y) with step size s. The positions of nine search points are (x s,y s), (x,y s), (x C s,y s), (x s,y), (x,y), (x C s,y), (x s,y C s), (x,y C s), and (x C s,y C s). The point with minimum distortion (SAD or SSD) becomes the center of the next step. The initial step size in Fig. 3.4 is three, and the step size is decremented after each step. This allows the algorithm to finish in three steps – hence its name. The DS algorithm also has nine initial search points, but these search points form a diamond instead of a square. Figure 3.5 shows the position of the diamond, where the next search steps may be along the diamond’s vertex and face, respectively. Therefore, there are five or three new search points to be evaluated at every next step. DS stops searching when the search point with minimum distortion is at the center of a diamond. Both TSS and DS algorithms are widely used, and there are
3.1 Introduction
35
2
3
3
3
3
2
3
3
3
3
2
2
1
2
2
2
2
1
1
1
1
1
1
1
1
Fig. 3.4 Three step search algorithm
1 1
1 1
1 2
1 2
3
1 1
1 2
3
3 3
3
Fig. 3.5 Diamond search algorithm
many algorithms extended from the TSS or DS algorithm, such as the Four Step Search algorithm. There are also some algorithms trying to reduce the number of search points by an early termination strategy. These algorithms start from the lefttop or center point in the search window and search reference blocks one by one. The search process is terminated when the distortion of a reference block is less than a predefined threshold. The rest of reference blocks are not tested even if there may be a reference block with lower distortion. These algorithms can control computational complexity and resulted video quality by changing the value of the threshold. The higher the threshold, the lower the number of search points, and hence the lower the complexity.
36
3 Integer Motion Estimation 16
8
16
16
16 8 16
8
Fig. 3.6 Down sampling
The second type of algorithms reduces computational complexity by reducing the computation of each search point. Koga et al. [31] proposed a down-sampling method in which a 16 16 block can be 1/2-down-sampling to an 8 16 or a 16 8 block, or 1/4-down-sampling to an 8 8 block, as shown in Fig. 3.6. Therefore, the complexity of distortion computation of each reference block is reduced to one-half or one-fourth. These two types of complexity-reduction schemes can be used together to further reduce the computational complexity. However, these algorithms reduce search time at the expense of video-quality loss and bit-rate increase. These low-complexity algorithms are usually used in low-resolution or mobile applications.
3.1.2 Design Considerations Due to computational regularity and excellent video quality, full search motion estimation is commonly employed in VLSI implementation. Therefore, we concentrate on VLSI implementation of the Full Search algorithm in the rest of this chapter. However, this exhaustive search strategy also leads to high computational complexity and huge amounts of data traffic. Therefore, a highly parallel architecture is essential to perform large amount of computation. Moreover, there are a lot of overlapping pixels between consecutive search windows as well as between reference blocks during block matching. An efficient data-reuse scheme is essential to reduce redundant data access and thus total data traffic.
3.2 Related Works
37
3.2 Related Works We describe several representative IME architectures in Sect. 3.2.1 and introduce data-reuse schemes in Sect. 3.2.2.
3.2.1 Architecture There are many hardware designs for FBSME. Yang et al. [80] proposed the first 1D array architecture as shown in Fig. 3.7. Each processing element (PE) is responsible for calculating SAD of one reference block and the number of PEs is equal to the number of reference blocks in the horizontal direction within the search range. Current block pixels are propagated through shift registers while reference block pixels are broadcasted to all PEs. This design allows sequential inputs but performs parallel processing with 100% hardware utilization. Moreover, this design reduces memory traffic by broadcasting reference block pixels. By using similar concept, Yeo and Hu [81] proposed a 2D array architecture as shown in Fig. 3.8. Current block pixels are still propagated through shift registers, but reference block pixels are broadcasted in both the horizontal and vertical directions. A set of N N PEs is responsible for N N region in search window, where N is the block size (N D 4 in Fig. 3.8). Consequently, there are totally (2SRV =N ) (2SRV =N ) sets of PEs for the whole search window. By twodirectional data broadcasting, this design further increases data-reuse ratio. Komarek and Pirsch [34] proposed a 2D array architecture as shown in Fig. 3.9. Each current pixel is stored in a PE and hence the number of PEs is equal to the current block size (block size N D 4 in Fig. 3.9). Instead of broadcasting, reference
0 1
D
D
PE 0
PE 1
D
PE 2
PE 14
Comparator
MV
Fig. 3.7 Yang’s 1D array architecture
D
PE 15
38
3 Integer Motion Estimation Current Reference Pixel Pixel0
PE
PE
PE PE
Reference Pixel1
D
D
D
D
PE
PE PE
PE
D
D
D
D
PE
PE
PE
PE
D
D
D
D
PE
PE
PE
PE
D
D
D
Comparator MV
Fig. 3.8 Yeo and Hu’s 2D array architecture
0 PE
0 D
D PE Reference Pixel
D
D
D
PE
D
PE
D
PE
PE
PE
D
PE
D
PE
PE D
D
PE D
D
D 2D
PE D
D
D 2D
D
D
D
D PE
PE
PE
0
D
D
D PE
D
D
D PE
PE
0
PE D
2D
PE Comparator MV
Fig. 3.9 Komarek and Pirsch’s 2D array architecture
3.2 Related Works
39
Reference Current Pixel Pixel1
Reference Pixel0 D
D D
D
D
D D
D D
D
D
D
D
D
PE
PE
PE
PE
D
D
D
D
D
D
PE
PE
PE
PE
D
D
D
D
D
D
PE
PE
PE
PE
D
D
D
D
D
D
PE
PE
PE
PE
D
D
D
Adder Tree Comparator MV
Fig. 3.10 Komarek and Pirsch’s 2D array architecture
block pixels are propagated horizontally from one PE to the next. SAD of a column is accumulated in the vertical direction first and then the SAD of the whole block is summed up in the horizontal direction. Vos and Stegherr [72] proposed a 2D array architecture with N N PEs as shown in Fig. 3.10 (N D 4). Current block pixels are also stored in PEs. Reference block pixels are stored in shift registers and can be propagated in three directions. With these three-directional shift registers, this design can search reference blocks in snake scan order instead of raster scan order. Therefore, data-reuse rate is further increased when search process switches from one column or one row to the next. Several 1D and 2D array architectures have been proposed for full-search variable-block-size motion estimation (FSVBSME). Instead of performing block matching for all kinds of blocks, most of them sum up SADs of small blocks (e.g., 4 4) to generate a SAD of big blocks (e.g., 16 16). This SAD reuse scheme can significantly reduce computation complexity. Yap and McCanny [79] proposed a 1D array architecture of 16 PEs using a data flow similar to that of Yang et al. [80]. Huang et al. [22] proposed a 2D array of 16 16 PEs with 1D data broadcasting and 1D partial result reuse. The block diagram is shown in Fig. 3.11 and the detailed PE architecture is shown in Fig. 3.12. Kim et al. [30] also proposed a 2D array of 16 16 PEs employing a preload register and a search data buffer inside each PE
40
3 Integer Motion Estimation 0 1
4x4PE Array
4x4PE Array
4x4PE Array
4x4PE Array
4x4PE Array
4D
4x4PE Array
4D
4x4PE Array
4D
4x4PE Array
4D
4x4PE Array
4D
4x4PE Array
4D
4x4PE Array
4D
4x4PE Array
4D
4x4PE Array
4D
4x4PE Array
4D
4x4PE Array
4D
4x4PE Array
4D
MV
Fig. 3.11 Huang et al.’s 2D array architecture
0 1
PE
PE
PE
PE
PE
PE
+ PE
PE
D
+ + PE
PE
PE
PE
D
+ + PE
PE
PE
PE
D
+ + D
Fig. 3.12 A 4 4 PE array of Huang et al.’s architecture
3.2 Related Works
41
to achieve 100% PE utilization. Chen et al. [3] proposed a multiple(8)-candidate parallel architecture, as shown in Fig. 3.13, that uses a shared reference buffer. Each array consists of 16 16 PE and computes 41 SADs of a reference block in every cycle as shown in Fig. 3.14. Eight horizontally adjacent reference blocks are processed in parallel. By means of multiple-candidate data reuse, this design can significantly reduce on-chip memory traffic and hence power consumption. We use a simple example to illustrate the difference between the single 2D PE-array architecture, the multiple(Q)-candidate parallel architecture with Q 2D PE-arrays, and our architecture as proposed in Sect. 3.3. In Fig. 3.15, there are two consecutive current blocks, and search windows (SWs) of these two current blocks overlap each other. Two reference blocks (RBs) are in the overlapping region. In
8 Reference Blocks
Current Block
16x16PE Array0 41 SADs
16x16PE Array1
16x16PE Array6
41 SADs
41 SADs Comparator MV
Fig. 3.13 Chen et al.’s multiple(8)-candidate 2D array architecture Current Block
Reference Block 16
16 Adder Tree + SADs
Adder Tree + 41 SADs
Fig. 3.14 A 16 16 PE array of Chen et al.’s architecture
16x16PE Array7 41 SADs
42
3 Integer Motion Estimation
SW2 SW1
RBx CB1
CB2
RBy
Fig. 3.15 Overlapping between search windows of two adjacent current MBs Step2:
Step1: RF
RF
CF
RBX
CF
RBY
CB1 Array
CB1 Array
Step4:
Step3: RF
RF
CF
RBX
CB2 Array
CF
CB2
RBY Array
Fig. 3.16 Processing steps of the traditional single 2D PE-array architecture
Fig. 3.16, the traditional design with only one 2D PE-array has to perform block matching sequentially. Steps 1 and 2 perform block matching for CB1 against RBX and RBY . CB2 is similarly processed in Steps 3 and 4. There is no data reuse for the overlapping region between RBX and RBY . In the multiple(2)-candidate parallel architecture, as shown in Fig. 3.17, two RBs are processed in parallel by two 2D PE-arrays. The performance is doubled and memory traffic reduced, as the pixels in the overlapping region are read only once and broadcasted to two PE-arrays.
3.2 Related Works
43 Step2:
Step1: RF
CF
RBx
RF
CF
RBx Array
Array CB1
RBy
Array
CB2 RBy
Array
Fig. 3.17 Processing steps of the multiple(2)-candidate parallel architecture with two 2D PE-arrays
Although the FS algorithm is commonly employed in VLSI implementation, there are also designs adopting low complexity partial-search algorithms to increase the performance. Lin et al. [42] proposed a multiresolution parallel motion estimator. Three ME modes which use different block sizes, search ranges, and down-sampling rates are processed in parallel and only two modes are chosen to FME. Su et al. [64] proposed a two-stage IME. The first stage performs downsampling motion search on 16 16 blocks only. The second stage performs VBSME on pixels around the candidates selected in the first stage. These two designs reduce search complexity at the expense of video-quality degradation and bit-rate increase.
3.2.2 Data-Reuse Schemes During block matching, there are a lot of overlapping pixels between consecutive search windows as well as between reference blocks. It is essential that IME designs increase data reuse and thus reduce memory access. Tuan et al. [71] defined four levels of data-reuse schemes. The Level A scheme reuses overlapped reference pixels between adjacent reference blocks of one current MB. The Level B scheme reuses overlapped reference pixels of adjacent reference block strips of one current MB. The Level C scheme reuses pixels in the overlapped regions of search windows of consecutive current MBs. The Level D scheme reuses pixels in the entire search window strips of consecutive current MBs. Figure 3.18 shows these four-level datareuse schemes, where N denotes the MB size, W the frame width, SRV the vertical search range, and SRH the horizontal search range. Table 3.1 lists the required memory size for each scheme. The Level A scheme has the smallest size local memory but results in the highest off-chip memory traffic. On the other hand, the Level D
44
3 Integer Motion Estimation
a
b
N + SRH –1
N + SRV –1
N + SRV –1
N-1 N 1
c
SRH –1
d N
1 N-1 1
W + SRH –1
SRV –1
CB 1 CB 2
N
N + SRV –1
N
N
1
N + SRH –1
Fig. 3.18 Four levels of data-reuse scheme (a) Level A, (b) Level B, (c) Level C, (d) Level D
Table 3.1 Local memory size of different levels of data-reuse scheme Level Local memory size A N .N 1/ B (N C SRH / .N 1/ C .N C SRV 1/ .SRH 1/ D .W C SRH 1/ .SRV 1/
scheme achieves the lowest off-chip memory traffic but has the largest size local memory. Presently, most ME designs adopt the Level C scheme to strike a balance between memory size and traffic.
3.3 A VLSI Design for Integer Motion Estimation We propose a multiple(P)-macroblock data-reuse scheme in Sect. 3.3.1. This scheme can simultaneously increase data-reuse ratio and reduce on-chip memory access. Moreover, the required local memory size is fixed and is independent of the horizontal search range. Then, we combine our scheme with the multiple(Q)-candidate data-reuse scheme [3] to obtain an advanced scheme with an excellent data-reuse ratio. Based on the proposed scheme, we present our VLSI architecture in Sect. 3.3.2 and our data flow for multiple reference frames in Sect. 3.3.3 [26, 27]. Finally, we show the experimental results in Sect. 3.3.4.
3.3 A VLSI Design for Integer Motion Estimation
45
3.3.1 Proposed Data-Reuse Scheme In the Level C scheme, reference blocks in the overlapping region of consecutive search windows are held in a local memory and read multiple times during processing of consecutive current MBs. This scheme results in too much unnecessary memory access and power consumption. Our main idea, called multiple(P)-macroblock data-reuse, is to reduce memory access by broadcasting data. In the proposed scheme, a reference block is read only once and broadcasted to all PE-arrays. Each PE-array performs block matching for one current MB. SADs of a reference block against multiple current MBs are computed in parallel. The degree P is defined as the number of simultaneously processed current MBs. We use the same example from Sect. 3.2.1 to illustrate our approach as depicted in Fig. 3.19. Because the overlapping of SWs involves two adjacent current MBs, the degree P is 2. Two 2D PE-arrays are needed in this example, each for a current MB. In Step 1, the first reference block is read and broadcasted to two PE-arrays. Block matching of CB1 and CB2 against the first reference block is performed in parallel. In Step 2, the same process is performed for the second reference block. With two 2D PE-arrays, the throughput is doubled, but the hardware utilization is still 100%. Compared to the traditional scheme shown in Fig. 3.16, the amount of memory access is reduced by half. Moreover, the local memory in the proposed approach needs to hold only a column of the reference block .N .N C SRV 1// in the search window instead of the whole overlapping region of the search windows. Therefore, we reduce the local memory size to that of the Level B scheme whereas keeping the memory traffic low as that of the Level C scheme. In our proposed multiple(P)-macroblock data-reuse scheme, the optimum degree P which results in 100% hardware utilization depends on the horizontal search range. Increasing the horizontal search range also enlarges the search window and, therefore, increases the number of overlapped search windows. Consequently,
Step2:
Step1: RF
CF
2D PE Array
RF
CB1
RBx
CF
2D PE Array
CB1
2D PE Array
CB2
RBy 2D PE Array
CB2
Fig. 3.19 Processing steps of the multiple(2)-macroblock parallel architecture with two 2D PE-arrays
46
3 Integer Motion Estimation Table 3.2 Relationship between horizontal search range Degree of multiple-macroblock parallelism 2 3 4 5 6 7 8
the degree of multiple-macroblock parallelism and the Horizontal search range [CX, X C 1] X D 16 X D others X D 24 X D others X D 32 X D others X D 40 X D others X D 48 X D others X D 56 X D others X D 64 X D others
Hardware utilization (%) 100 <100 100 <100 100 <100 100 <100 100 <100 100 <100 100 <100
reference blocks in the overlapping region must be broadcasted to more PE-arrays for more current MBs. If the horizontal search range is [C16, 15], for example, only two search windows overlap. If the horizontal search range increases to [C32, 31], overlapping occurs between four consecutive search windows, and the optimum P is 4. In contrast, employing the multiple(4)-macroblock data-reuse scheme decreases the utilization of PE-arrays when the horizontal search range is smaller than [C32, 31]. Table 3.2 lists the various degrees of multiple-macroblock parallelism and the associated optimum search ranges. Chen et al.’s [3] multiple-candidate data-reuse scheme reduces on-chip memory access by reading multiple consecutive reference blocks at the same time and broadcasting overlapped pixels. Our multiple-macroblock data-reuse scheme reduces on-chip memory access by reading each reference block only once and performing block matching of multiple consecutive current MBs. These two schemes are orthogonal to each other. Hence, we integrate them to further increase data-reuse ratio and decrease on-chip memory access. We use the same example to illustrate this concept, as depicted in Fig. 3.20. The degree of both multiple-candidate parallelism and multiple-macroblock parallelism is 2. That is, two reference blocks are read at the same time and broadcasted for block matching of two current MBs. Therefore, four 2D PE-arrays work in parallel. The throughput is quadrupled and the hardware utilization is still 100%. For comparison to the on-chip memory traffic, we assume that the block size in the example is 16 16. During the whole process, the traditional scheme in Fig. 3.16 must read each of the two reference blocks twice or 2 2 .16 16/ D 1;024 pixels in total. The multiple(2)-candidate data-reuse scheme in Fig. 3.17 must read two overlapped reference blocks twice or 2 .16 17/ D 544 pixels. The multiple(2)macroblock data-reuse scheme in Fig. 3.19 must read one reference block twice or
3.3 A VLSI Design for Integer Motion Estimation
47
Step1: RF
CF
CB1
RBx
RBy
2D PE Array
2D PE Array CB2
2D PE Array
2D PE Array
Fig. 3.20 Processing step of the combined multiple(2)-candidate parallel and multiple(2)macroblock parallel architecture with four 2D PE-arrays
2 .16 16/ D 512 pixels. In contrast, the combined scheme in Fig. 3.20 has only to read two overlapped reference blocks once or 1 .16 17/ D 272 pixels. Consequently, 73% of on-chip memory access is reduced in comparison to the traditional scheme. The degrees of multiple-candidate and multiple-macroblock parallelism in the combined scheme are both flexible. The target processing capability determines the required degree of multiple-candidate parallelism. The horizontal search range determines the optimum degree of multiple-macroblock parallelism. Therefore, this combined scheme is suitable for a range of performance requirements and parameter settings.
3.3.2 Architecture We target the proposed hardware design toward 1,920 1,088 (HD1080p) at 30 frames per second (fps) with [C32, 31] of vertical and horizontal search range and two reference frames (RFs) when running at 150 MHz. For this requirement, the proposed design combines multiple(4)-candidate and multiple(4)-macroblock datareuse schemes. It represents that four consecutive reference blocks are processed in parallel for each of the four consecutive CBs in an overlapped fashion, and thus 44 D 16 2D PE-arrays are needed. The block diagram of the proposed architecture is shown in Fig. 3.21. There are four CB Register Arrays holding four consecutive current MBs for multiple(4)-macroblock parallel processing. For further increasing data-reuse ratio, we use a Reference Pixel Register Array to hold four consecutive reference blocks (16 19 pixels). When the next four reference blocks are read into the Reference
48
3 Integer Motion Estimation Ref. Pixels SRAM0
Ref. Pixels SRAM1
CB SRAM
Ref. Pixels AG
CB AG
Ref. Pixels Register Array
CB Reg0
CB Reg1
CB Reg2
CB Reg3
2D PE Array0
2D PE Array4
2D PE Array8
2D PE Array12
2D PE Array1
2D PE Array5
2D PE Array9
2D PE Array13
2D PE Array2
2D PE Array6
2D PE Array10
2D PE Array14
2D PE Array3
2D PE Array7
2D PE Array11
2D PE Array15
41 SADs
41 Four-Input Comparators
41 Four-Input Comparators
41 Four-Input Comparators
41 Four-Input Comparators
MV Gen RF0
MV Gen RF0
MV Gen RF0
MV Gen RF0
MV Gen RF1
MV Gen RF1
MV Gen RF1
MV Gen RF1
41 MVs for RF0 and 41 MVs for RF1 per CB
Fig. 3.21 Block diagram of the proposed multiple(4)-candidate parallel and multiple(4)macroblock parallel architecture with sixteen 2D PE-arrays
Pixel Register Array, only 16 4 pixels are updated, whereas 16 15 pixels are reused. Moreover, the Reference Pixel Register Array holds one more column of 19 pixels for eliminating bubble cycles when motion estimation switches from one column to the next. Therefore, the size of the Reference Pixel Register Array is 17 19 D 323 bytes.
3.3 A VLSI Design for Integer Motion Estimation
49
In the proposed architecture, there are sixteen 2D PE-arrays, each consisting of 16 16 D 256 PEs. PE-arrays 0–3 perform block matching for CB1 , PE-arrays 4–7 for CB2 , PE-arrays 8–11 for CB3 , and PE-arrays 12–15 for CB4 . Every PE-array can compute 41 SADs of one reference block in one clock cycle. The architecture of a 2D PE-array is similar to Fig. 3.14. In the next cycle, all SADs are transferred to comparators and the minimum SADs of 41 variable-size blocks are sent to MV Generators. Finally, 41 MVs of RF0 and 41 MVs of RF1 for one current MB are generated and then transferred to the fractional motion estimator (FME). Because there are sixteen 2D PE-arrays working in parallel, the required chip size is very large. Therefore, we adopt two popular methods to reduce its hardware cost. The first is 1/2-down-sampling. The number of PEs in a 2D PE-array is reduced from 16 16 D 256 to 16 8 D 128. The second is pixel truncation. He and Liou [20] proposed a pixel truncation method for hardware implementation. This method truncates several least significant bit (LSB) of a pixel to reduce the complexity of distortion computation. The value of a pixel is from 0 to 255, and therefore, a pixel is represented with 8-bits in hardware. If 3 LSB bits are truncated, for example, a pixel is 5-bit only and 5-bit subtractors are used instead of 8-bit ones when computing the distortion. We use 5 bits instead of 8 to represent a pixel. This method also reduces required memory traffic.
3.3.3 Data Flow In order to be compatible with both multiple-candidate and multiple-macroblock parallel schemes, we employ a column-by-column scan order. Moreover, we adopt the snake scan scheme to further increase the data-reuse ratio and eliminate bubble cycles for updating the whole Reference Pixel Register Array when switching from one column to the next. Furthermore, we set (0,0) as the search center to reuse the search window overlapping between adjacent MBs. Present-day ME designs usually process multiple reference frames one by one. However, this straightforward processing order is unsuitable for the proposed architecture as illustrated in Figs. 3.22 and 3.23. In Fig. 3.22, the search range is [C32, 31] and, therefore, there are 64 64 search points for one current MB. The search points of CB1 include A0 D0 in RF0 and A1 D1 in RF1 . The search points of CB2 include B0 E0 in RF0 and B1 E1 in RF1 . The search points of CB3 include C0 F0 in RF0 and C1 F1 in RF1 . The search points of CB4 include D0 G0 in RF0 and D1 G1 in RF1 . The widths of A0 , B0 , . . . , and G1 are all 16 pixels. Figure 3.23a shows the data flow when two reference frames are processed sequentially. In Step 1, PE-arrays 1–4 perform block matching of CB1 against A0 . In Step 2, B0 is broadcasted to PE-arrays 1–4 and PE-arrays 5–8 for block matching of CB1 and CB2 , respectively. After Step 4, block matching of CB1 against RF0 is completed. In Step 5, block matching of CB1 against RF1 starts and A1 is transferred to PE-arrays 1–4. However, block matching of CB2 CB4 is still in RF0 and E0 is broadcasted to PE-arrays 5–8, 9–12, and 13–16. Therefore, data from two
50
3 Integer Motion Estimation Current Frame
CB1 CB2 CB3 CB4
Reference Frame 0 (RF0) Search Points of CB1 Search Points of CB3
A0
B0
C0
D0
E0
F0
G0
Search Points of CB2 Search Points of CB4 Reference Frame 1 (RF1) Search Points of CB1 Search Points of CB3
A1
B1
C1
D1
E1
F1
G1
Search Points of CB2 Search Points of CB4
Fig. 3.22 Overlapping between search windows among four consecutive current macroblocks
search areas, A1 and E0 , should be read and transferred concurrently. This condition also happens in Steps 6–7 and Steps 9–11 (the gray region). Reading and transferring two search areas’ data concurrently leads to complex control and data routing. Moreover, two address generators are required for Reference Pixel SRAM0 and SRAM1. In addition, some search area data is read more than once in this data flow (e.g., E0 in Steps 5 and 9). This inefficient data flow increases on-chip memory traffic. To solve these problems, we propose a reference frame interlaced processing (RFIP) scheme for our proposed multiple-reference-frame ME architecture. Figure 3.23b shows the detailed data flow. In Step 1, PE-arrays 1–4 still perform block matching of CB1 against A0 . In Step 2, PE-arrays 1–4 perform block matching
3.3 A VLSI Design for Integer Motion Estimation
51
a Step 1 2 3 4 5 6 7 8 9 10 11 12 13 14
PE Array PE Array PE Array PE Array Completed 4-7 0-3 CB 8-11 12-15 CB 1 -A 0 CB 1 -B 0 CB 1 -C 0 CB 1 -D 0 CB 1 -A 1 CB 1 -B 1 CB 1 -C 1 CB 1 -D 1 CB 5 -E 0 CB 5 -F 0 CB 5 -G 0 CB 5 -H 0 CB 5 -E 1 CB 5 -F 1
CB 2 -B 0 CB 2 -C 0 CB 2 -D 0 CB 2 -E 0 CB 2 -B 1 CB 2 -C 1 CB 2 -D 1 CB 2 -E 1 CB 6 -F 0 CB 6 -G 0 CB 6 -H 0 CB 6 -I 0 CB 6 -F 1
CB 3 -C 0 CB 3 -D 0 CB 3 -E 0 CB 3 -F 0 CB 3 -C 1 CB 3 -D 1 CB 3 -E 1 CB 3 -F 1 CB 7 -G 0 CB 7 -H 0 CB 7 -I 0 CB 7 -J 0
CB 4 -D 0 CB 4 -E 0 CB 4 -F 0 CB 4 -G 0 CB 4 -D 1 CB 4 -E 1 CB 4 -F 1 CB 4 -G 1 CB 8 -H 0 CB 8 -I 0 CB 8 -J 0
CB 1 CB 2 CB 3 CB 4
b Step 1 2 3 4 5 6 7 8 9 10 11 12 13 14
PE Array PE Array PE Array PE Array Completed 4-7 CB 0-3 8-11 12-15 CB 1 -A 0 CB 1 -A 1 CB 1 -B 0 CB 1 -B 1 CB 1 -C 0 CB 1 -C 1 CB 1 -D 0 CB 1 -D 1 CB 5 -E 0 CB 5 -E 1 CB 5 -F 0 CB 5 -F 1 CB 5 -G 0 CB 5 -G 1
CB 2 -B 0 CB 2 -B 1 CB 2 -C 0 CB 2 -C 1 CB 2 -D 0 CB 2 -D 1 CB 2 -E 0 CB 2 -E 1 CB 6 -F 0 CB 6 -F 1 CB 6 -G 0 CB 6 -G 1
CB 3 -C 0 CB 3 -C 1 CB 3 -D 0 CB 3 -D 1 CB 3 -E 0 CB 3 -E 1 CB 3 -F 0 CB 3 -F 1 CB 7 -G 0 CB 7 -G 1
CB 4 -D 0 CB 4 -D 1 CB 4 -E 0 CB 4 -E 1 CB 4 -F 0 CB 4 -F 1 CB 4 -G 0 CB 4 -G 1
CB 1 CB 2 CB 3 CB 4
Fig. 3.23 (a) Data flow of the traditional sequential process for two reference frames; (b) Data flow of the proposed reference frame interlaced processing for two reference frames
of CB1 against A1 instead of B0 . In Steps 3 and 4, B0 and B1 are broadcasted to PEarrays 1–4 and 5–8, respectively. After Step 8, block matching of CB1 is completed. In Step 9, PE-arrays 1–4 are responsible for processing CB5 and E0 is transferred to all PE-arrays. CB2 , CB3 , and CB4 are then completed in Steps 10, 12, and 14,
52
3 Integer Motion Estimation
respectively. In this data flow, data from only one search area is required at any time. Therefore, control is simpler and no multiplexer is required for selecting data from different search windows. Moreover, each pixel is read only once and fully reused, thus minimizing on-chip memory access. According to RFIP data flow, our design can process one MB every 512 cycles after spending 2,048 cycles for the first MB of each row. Our design can also fit into any block-based encoding pipeline, and our RFIP scheme is applicable to cases with three or more reference frames.
3.3.4 Evaluation When designing IME architecture, one has to make trade-offs between local memory size and external memory traffic. Using a larger local memory can increase the data-reuse ratio and reduce the off-chip memory traffic. We analyze memory traffic requirements for four data-reuse schemes shown in Table 3.3. Scheme 1 adopts the Level C scheme with a search window memory. Both Huang et al. [22] and Kim et al. [30] used Scheme 1. With the same size of memory as Scheme 1, Scheme 2 proposed in [3] adopts multiple-candidate data-reuse scheme to decrease on-chip memory traffic. Schemes 3 and 4 are our proposed approaches. By multiplemacroblock data-reuse, Scheme 3 requires low off-chip memory traffic as that of the Level C scheme (Method 2) with memory size as small as that of the Level B scheme. Scheme 4 reduces on-chip memory traffic by adopting both multiplecandidate and multiple-macroblock data-reuse schemes with a register array to hold consecutive reference blocks. Table 3.4 shows theoretically on-chip and off-chip memory traffic and Table 3.5 shows the required storage size of these four methods. We also take a real case for example: 1,920 1,088 (HD1080p), 30-fps video with [C32, 31] of SR and one RF. The degree of multiple-candidate (Q) and multiple-macroblock parallelism (P) is 4. Table 3.6 shows the required memory traffic in each of these four schemes. The result shows that our proposed architecture with multiple(4)-candidate and multiple(4)-macroblock data-reuse scheme can save 98.1% of on-chip memory traffic and 75% of local memory size compared with the popular Level C scheme.
Table 3.3 Four data-reuse schemes Scheme Candidate block strip memory Search window memory Level B data-reuse scheme Level C data-reuse scheme Multiple-candidate Q-parallelism Multiple-macroblock P-parallelism Reference pixel register array
1
2
v
v
v
v v
3 v
4 v
v
v
v
v v v
3.4 Summary
53
Table 3.4 Memory traffic analysis of four data-reuse schemes Scheme On-chip memory traffic (Bps) Off-chip memory traffic (Bps) 1 F .H=N / .W=N / SRV SRH N 2 F .H=N /.W CSRH 1/.N CSRV 1/ 2 F .H=N / .W=N / .SRH F .H=N /.W CSRH 1/.N CSRV 1/ SRV =Q/ N .N C Q 1/ 3 F .H=N /.W=N /SRV SRH N 2 =P F .H=N /.W CSRH 1/.N CSRV 1/ 4 F .H=N / .W=N / SRH .N F .H=N /.W CSRH 1/.N CSRV 1/ .N C SRV 1//=P Table 3.5 Storage cost analysis of four data-reuse schemes Local memory Register array size (Bytes) size (Bytes) Scheme 1 .N C SRV 1/ SRH – – 2 .N C SRV 1/ SRH 3 .N C SRV 1/ N – 4 .N C SRV 1/ N N .N C Q 1/ Table 3.6 Memory traffic and size comparison of four data-reuse schemes On-chip memory Off-chip memory Memory size traffic (GBps) traffic (MBps) (Bytes) Scheme 1 257 319 5,056 2 76 319 5,056 3 64 319 1,264 4 5 319 1,264
As mentioned in Sect. 3.3.2, we use 1/2-down-sampling and pixel truncation to reduce the hardware cost. These two methods may result in quality degradation. Consequently, we use four video sequences to test the quality of our design. Each sequence consists of 100 frames in 1,920 1,088 format. Figure 3.24 shows the R-D curves of reference software JM9.0 and our encoder. The average PSNR drop is 0.03 dB while average bit-rate increasing is 0.73%. We have implemented the proposed design in synthesizable Verilog HDL. We synthesize our design targeted toward a TSMC 130-nm CMOS standard cell library. Table 3.7 shows the synthesis result of our IME design.
3.4 Summary In this chapter, we have discussed the key points of VLSI implementation for IME and introduced some hardwired IME designs. We have also proposed a multiple(P)macroblock data-reuse scheme for FSVBSME. In this method, P current MBs perform block matching in parallel and reference pixels in the overlapping region of consecutive SWs are broadcasted and fully utilized. Therefore, data-reuse ratio of on-chip memory increases P times. Moreover, we combined the proposed
54
3 Integer Motion Estimation
a
Bluesky 43 42
PSNR(dB)
41 40 39 38 37 36
JM 9.0
35 34 33
Proposed
2
4
6
8 Bit Rate (Mbps)
PSNR(dB)
b 41 40 39 38 37 36 35 34 33 32 31
12
14
Tractor
JM 9.0 Proposed 3
6
9
12
15
18
21
24
27
Bit Rate (Mbps)
c
PSNR(dB)
10
41 40 39 38 37 36 35 34 33 32 31 30
Riverbed
JM 9.0 Proposed 5
10
15
d
20 25 Bit Rate (Mbps)
30
40
35
Sunflower
44 43
PSNR(dB)
42 41 40 39 JM 9.0
38 37
Proposed
36 35
1
2
3
4
5 6 Bit Rate (Mbps)
7
8
9
Fig. 3.24 The R-D curve of JM9.0 and our modified encoder. (a) Bluesky, (b) Tractor, (c) Riverbed, (d) Sunflower
3.4 Summary
55 Table 3.7 Synthesis result Process 130 nm No. of PE 128 16 Gate count 453K Local memory 2.94 KB Frequency 130 MHz Max. spec. 1,920 1088, 30 fps Search range 30 fps No. of RF 2 Block size 4 4–16 16
multiple-macroblock data-reuse scheme with an existing multiple-candidate datareuse scheme to obtain an advanced one which needs ultra-low on-chip memory traffic. Finally, we proposed a hardware architecture for H.264/AVC FSVBSME. This architecture adopts multiple(4)-candidate and multiple(4)-macroblock datareuse scheme and contains sixteen 2D PE-arrays working in parallel. For multiple reference frames, our proposed RFIP data flow can lead to simpler control, lower hardware cost, and smaller memory traffic. Synthesized into a TSMC 130-nm cell library, the total gate count of our design is 453K and local memory size is 2.94 KB. Experimental result shows that when running at 130 MHz, our design can process one MB every 512 cycles and is capable of supporting 1,920 1,088 30-fps video with 64 64 SR and 2 RFs.
Chapter 4
Fractional Motion Estimation
Abstract Motion estimation in H.264/AVC supports quarter-pixel precision and is usually carried out in two phases: integer motion estimation (IME) and fractional motion estimation (FME). We have talked about IME in Chap. 3. After IME finds an integer motion vector (IMV) for each of the 41 subblocks, FME performs motion search around the refinement center pointed to by IMV and further refines 41 IMVs into fractional MVs (FMVs) of quarter-pixel precision. FME interpolates half-pixels using a six-tap filter and then quarter-pixels a two-tap one. Nine positions are searched in both half refinement (one integer-pixel search center pointed to by IMV and 8 half-pixel positions) and then quarter refinement (1 half-pixel position and 8 quarter-pixel positions). The position with minimum residual error is chosen as the best match. FME can significantly improve the video quality (C0:3 to C0:5dB) and reduce bit-rate (20–37%) according to our experimental results. However, our profiling report shows that FME consumes more than 40% of the total encoding time. Therefore, an efficient hardware accelerator for FME is indispensable.
4.1 Introduction Motion estimation in video coding exploits temporal redundancy of a video sequence. In previous video coding standards, the block size of motion estimation is fixed. Fixed block-size motion estimation (FBSME) spends the same effort to estimate the motion of moving objects and static objects. This inflexibility causes low coding performance. Moreover, two objects moving in different directions in one block also lead to low estimation accuracy. Therefore, H.264/AVC adopts variableblock-size motion estimation (VBSME). VBSME adaptively uses smaller block size for moving objects and larger block size for static ones to increase the coding efficiency. In H.264/AVC, each frame of a video is partitioned into macroblocks (MBs) of 16 16 pixels. Each macroblock can be split into smaller blocks in four ways: one 16 16 block, two 16 8 blocks, two 8 16 blocks, or four 8 8 blocks. If the 8 8 mode is chosen, each of the four 8 8 blocks may be split further into one 8 8 block, two 8 4 blocks, two 4 8 blocks, or four 4 4 blocks. Y.-L.S. Lin et al., VLSI Design for Video Coding: H.264/AVC Encoding from Standard Specification to Chip, DOI 10.1007/978-1-4419-0959-6 4, c Springer Science+Business Media, LLC 2010
57
58
4 Fractional Motion Estimation Table 4.1 Performance improvement of H.264/AVC encoder with FME Format of test sequencesa PSNR Bitrate 1. 720 480 C0:524 dB 26.6% 2. 1,280 720 C0:295 dB 37.0% 3. 1,920 1,088 C0:384 dB 20.7% a
GOP: IPPPPP; RDO: off; 2 reference frames; Search range: 1. 64 64, 2. 128 128, 3. 256 256
VBSME in H.264/AVC is carried out in two phases: integer motion estimation (IME) and fractional motion estimation (FME). First, IME performs motion search within the whole search range and finds integer motion vector (IMV) for each of the 41 subblocks. In the next step, FME interpolates half-pixels by six-tap filter and then quarter-pixels by two-tap filter. Finally, FME performs motion search around the refinement center pointed to by IMV and further refines 41 IMVs into fractional motion vectors (FMVs) of quarter-pixel precision. FME can significantly improve the video quality and reduce bit-rate as exemplified in Table 4.1. However, our profiling report shows that FME consumes more than 40% of the total encoding time (5 reference frames, ˙16 search range). Therefore, a hardware accelerator for FME is essential. We introduce the FME algorithm in Sect. 4.1.1 and discuss the key points of VLSI implementation for FME in Sect. 4.1.2.
4.1.1 Algorithms After IME finds motion vectors at integer-pixel precision, FME further refines the motion vectors of the 41 subblocks to quarter-pixel precision. Figure 4.1 shows the flow chart of FME for a subblock. Inputs of FME include IMV, reference frame pixels, and current block pixels. Outputs are refined FMV for 41 subblocks. Reference frame pixels from the search window pointed to by IMVs are used for interpolation. Residual pixels between current block pixels and interpolated pixels are used to compute the sum of absolute transformed differences (SATD). FME can be divided into two parts: half refinement and quarter refinement. In half refinement, we generate half-pixels using six-tap filters and compute the SATD values. We also compute MV cost based on the half MV-bit-rate estimation. Then, we choose the best half refinement result. In quarter refinement, we generate quarter-pixels using two-tap filters and compute the SATD values. We also compute MV cost based on the quarter MV-bit-rate estimation. Finally, we choose the best FMV for each subblock. Figure 4.2 illustrates the refinement algorithm employed in the reference software. This algorithm is also commonly employed in hardware implementation. Nine positions are searched in both half refinement (one integer-pixel search center pointed to by IMV and 8 half-pixel positions) and then quarter refinement (1 halfpixel position and 8 quarter-pixel positions). The position with lowest cost is chosen as the best match.
4.1 Introduction
59 Start
End
Fig. 4.1 Flow chart of FME
In H.264/AVC, the luma prediction values for half-pixels are derived by applying six-tap filters on the integer pixels. The luma prediction values for quarter-pixels are derived by averaging integer and half-pixels. Figure 4.3 shows some integer and fractional pixels. Shaded boxes with upper-case letters are integer pixels. Boxes aa, bb, cc, dd, ee, ff, gg, hh, b, h, j, s, and m are half-pixels. Boxes a, c, d, e, f, g, i, k, n, p, q, and r are quarter-pixels. Figure 4.4 depicts formulas of six-tap filter (for half interpolation) and two-tap filter (for quarter interpolation). Table 4.2 gives the mapping of all the quarter-pixels in Fig. 4.3. The cost function of FME considers both distortion and MV-bit-rate estimation: Cost D Distortion C MV bit rate. Distortion is the sum of absolute transformed differences (SATD) while MV-bit-rate is estimated by using a lookup table defined in the reference software. 1. Distortion: We take SATD value as the distortion in the cost function. We sum up the absolute values of all Hadamard transformed residual pixels. In the computation, all types of subblocks are decomposed into 4 4 subblocks.
60
4 Fractional Motion Estimation
Integer-Pixel
Half-Pixel
Quarter-Pixel
Fig. 4.2 Refinement algorithm in the reference software V
W
E
F
cc
dd
K
L
A
aa
B
C
bb
D
G d h n M
a e i p
b f j q
c H g k m r N s
R
gg
S
T
hh
U
X
Y
I
J
ee
ff
P
Q
Fig. 4.3 Fractional pixels position
2. : Lagrangian parameter is derived from quantization parameter to make tradeoff between distortion and bit-rate. We set the value according to reference software. 3. MV-bit-rate: It is looked up from the table in the reference software.
4.2 Related Works
61
Half interpolate (x,y,z,u,v,w) = x – 5y + 20z + 20u – 5v + w.
Clip (x) = 0 (ifx < 0), or 255 (ifx >255), or x (otherwise) 1.
h = Clip (Half interpolate ((A, C, G, M, R, T)+16) >>5). cc, dd, m, ee, and ff are defined similarly.
2.
aa = Clip (Half interpolate ((V, W, A, B, X, Y)+16) >>5). bb, b, s, qq, and hh are define similarly.
3.
j = Clip (Half interpolate ((aa, bb, b, s, qq, hh) +512) >> 10), or j = Clip (Half interpolate ((cc, dd, h, m, ee, ff) +512) >> 10). Both functions generate the same result.
Quarter interpolate (x, y) = (x + y + 1)>>1. For example, a = Quarter interpolate (G, b) = (G + b + 1)>>1.
Fig. 4.4 Interpolation formula Table 4.2 Function mapping of quarter-pixel generation Quarter interpolate (x; y) a c d n f i k q x G H G M b h j j y b b h h j J m s
e b h
g b m
p h s
r m s
4.1.2 Design Considerations In two-stage refinement, FME interpolates fractional pixels and generates cost for each search position. Therefore, fractional pixel interpolation and SATD generation are the most important and critical parts of FME design. How to design efficient interpolators and SATD generators and to schedule them for high utilization rates are the key challenges of FME VLSI implementation.
4.2 Related Works The two-stage-refinement algorithm in the reference software is efficient in terms of video quality and computational complexity. This algorithm is also commonly adopted in FME VLSI implementation. In this section, we describe some representative FME architectures, which employ the two-stage refinement algorithm. Chen et al. [10] proposed an architecture with nine 4 4 processing units (PUs) as shown in Fig. 4.5. In each refinement stage, nine candidates around the refinement center are evaluated simultaneously. They analyzed FME algorithm loops and accordingly proposed two decomposing techniques to increase hardware parallelism
62
4 Fractional Motion Estimation Current Pixels
Reference Pixels
Interpolator
MV Cost Generator
4x4 Block PU 0
4x4 Block PU 1
4x4 Block PU 8
SATD Accumulator
SATD Accumulator
SATD Accumulator
Comparator
FMVs and Mode
Fig. 4.5 Chen’s FME architecture Current Pixels
MV Cost Generator
Half-pixel Interpolator
Quarter-pixel Interpolator
16x16 Block PU 0
16x16 Block PU 1
16x16 Block PU 8
SATD Accumulator
SATD Accumulator
SATD Accumulator
Reference Pixels
Comparator
FMVs and Mode
Fig. 4.6 Yang’s FME architecture
and utilization. All sizes of blocks are decomposed into 4 4 blocks during process. Therefore, the hardware utilization of fraction pixel interpolators and SATD generators is 100%. However, decomposing large blocks into 4 4 blocks brings redundant fractional pixel interpolation. This design can process SDTV (720 480) video when running at 100 MHz. They later replicated the architecture of [10] four times to enhance the processing capability from SDTV to HD720p (1,280 720) [34]. However, the design supports only 16 16 and 8 8 modes. It also suffers from low utilization for SATD generators. Yang, Goto, and Ikenaga [18] presented an FME architecture with nine 16 16 processing units as shown in Fig. 4.6. This design adopts a short-latency 16-pixel-
4.3 A VLSI Design for Fractional Motion Estimation
63
wide interpolator to increase throughput and eliminate redundant interpolation. However, fourfold increase in data-width only results in 1.67 times of throughput gain. Moreover, all sizes of blocks are processed by 16 16 processing units. Therefore, the hardware utilization is low when processing small size blocks (4 4 and 4 8). This design can process HD1080p (1,920 1,088) video when running at 200 MHz. There are also some designs adopting fast heuristics to reduce computational complexity or hardware cost. Kuo et al. [35] proposed a single iteration FME that searches only six points. This design can save 20% of gate count and provide 40% of throughput improvement over Chen et al.’s architecture [10]. Wang et al. [73] proposed a two-step search FME with early termination by taking advantage of cost correlation between neighboring subpixel positions. This design can save 40% of area cost and 14% of computational complexity compared to that of Chen et al. [10]. Su et al. [63] proposed a mode-reduction FME. Small size blocks (8 4, 4 8, and 4 4) are evaluated only when the cost of 16 16 block is larger than the summation of four 8 8 blocks. Therefore, 30% of hardware cost is saved. Kao et al. [25] proposed a FME architecture which employed a mathematical model to estimate SATD of subpixel positions. All fractional pixel interpolation is unnecessary. Therefore, design throughput increases significantly with ultra-low computational complexity. These low-complexity designs reduce search time and hardware cost, but at the expense of video-quality degradation and bit-rate increase.
4.3 A VLSI Design for Fractional Motion Estimation This section is organized as follows. In Sect. 4.3.1, we propose a three-engine FME architecture for real-time processing high-definition video [28, 77]. In FME, however, the computational complexity of interpolation is much higher than that of the SATD generation. For this reason, most FME designs suffer from low utilization of the SATD generator. In Sect. 4.3.2, we propose a resource-sharing scheme for SATD generator. This scheme can increase the hardware utilization and save 33% of hardware cost for the computation of SATD.
4.3.1 Proposed Architecture For high video quality and high compression efficiency, we employ the search algorithm from the reference software, as shown in Fig. 4.2. Figure 4.7 depicts the block diagram of our proposed FME. It consists of five parts: the input-data scheduler, the processing engines, the SATD generator arbiter, the SATD generators, and the mode selector. The input-data scheduler supplies current and reference block pixels to three processing engines. We dispatch the processing of 41 subblocks of a macroblock to three processing engines in a load-balancing way: Engine 1 for 8 8 and 4 4 subblocks, Engine 2 for 8 4 and 4 8 subblocks, and Engine 3 for
64
4 Fractional Motion Estimation Reference Pixels, Current Pixels and IMV
Input-Data Scheduler
Buffer
Buffer
Buffer
Engine 1
Engine 2
Engine 3
Arbiter
SATD Generator 1
SATD Generator 1
Mode Selector
Mode, FMV and minimum cost
Fig. 4.7 Block diagram of the proposed FME
16 16, 16 8, and 8 16 subblocks. Each engine performs subpixel interpolation, generates MV cost, and decides the best candidate. We separate SATD generators from processing engines and propose an arbitration scheme in Sect. 4.3.2 to increase their utilization. After finishing SATD generation, these three engines evaluate the cost function and output results to the mode selector. Finally, the mode selector outputs the final mode, FMVs, and the cost. Figure 4.8 depicts the top-level timing planning of the three engines for processing one MB. Figure 4.9 shows the block diagram of Engine 1 and Engine 2. Each engine has a half-pixel interpolator, a quarter-pixel interpolator, an SATD accumulator, and an MV cost generator. There are also two FIFOs to store subpixels for SATD generation. Both Engine 1 and Engine 2 are two-stage pipelined to increase throughput. The first stage includes half-pixel interpolation and SATD generation. Because the minimum width of blocks processed in Engine 1 and Engine 2 are 4 pixels, a 4-pixelwide half-pixel interpolator is adopted for high hardware utilization. Figure 4.10 depicts the architecture of our half-pixel interpolator, which consists of 5 horizontal filters, 11 vertical filters, and 67 shift registers. The five horizontal filters perform interpolation for ten input horizontal integer pixels. Then, the filtered pixels and six integer pixels (a–f) are routed to shift registers. The vertical filter performs interpolation for pixels in the same column of shift registers. Both horizontal and vertical filters are two-stage pipelined to shorten the critical path, as shown in Fig. 4.11. The second stage of both Engine 1 and Engine 2 is the most timing critical. Therefore, we further divide this stage into five substages to achieve higher through-
4.3 A VLSI Design for Fractional Motion Estimation
65
Engine 1
Engine 2
Engine 3
0
200
100
300
400
500
600 cycles
Fig. 4.8 Timing diagram of three engines for processing one MB Reference Pixels, Current Pixels and IMV
Controller
Half-pixel Interpolator
Quarter-pixel Interpolator
SATD Accumulator
MV Cost Generator
Half-pixel Memory SATD Requester
Half-pixel FIFO
Quarter-pixel FIFO
Arbiter
Best Candidate to Mode Selector
Fig. 4.9 Block diagram of Engine 1 and Engine 2
put and higher hardware utilization. These substages are (1) compute motion vector predictor (MVP), (2) read half-pixel SATD, compute half-MV cost for nine candidates, and decide the best half candidate, (3) interpolate quarter-pixels by using two-tap filters, (4) compute quarter-pixel SATD, and (5) read quarter-pixel SATD, compute quarter-MV cost for nine candidates, and decide the best quarter candidate. FME of some subblocks should be performed sequentially due to the dependency between MV cost generation, such as four 4 4 blocks in the same 8 8 partition. This dependency may lead to low hardware utilization. Therefore, we interleave the
66
4 Fractional Motion Estimation A
B
Cycle 1
H
C
D
H
E
H
H
F
H
Cycle 2 Cycle 3 Cycle 4 Cycle 5 Cycle 6 Cycle 7 Cycle 8
V
V
V
V
V
V
V
V
V
V
V
Cycle 9 Half-pixels H
Integer-pixels
V
Register
Horizontal 2-stage-pipelined 6-tap Filter Vertical 2-stage-pipelined 6-tap Filter
Fig. 4.10 The proposed half-pixel interpolator
a
f
b
+
e
d
c
+
+ <<2 –
–
<<2 –
6-tap Interpolate (a, b, c, d, e, f) = a – 5b + 20c + 20d – 5e + f = (a + f) – 5(b + e) + 20(c + d) = (a + f) – 5[(b + e) – 4(c + d)] = (a + f) – [(b + e) – 4(c + d)] – 4[(b + e) – 4(c + d)]
Fig. 4.11 The proposed two-stage-pipelined six-tap filter
4.3 A VLSI Design for Fractional Motion Estimation
67
processing of an 8 8 block with four 4 4 blocks in Engine 1 and the processing of two 4 8 blocks with two 8 4 blocks in Engine 2. Figure 4.12 shows the timing diagram of five interleaved substages of the second stage of both Engine 1 and Engine 2. Figure 4.13 shows the block diagram of Engine 3. Engine 3 is not the critical part of our design. Therefore, Engine 3 performs half refinement and quarter refinement sequentially for 16 16, 16 8, and 8 16 blocks. To meet the performance target, Engine 3 consists of 2 six-tap and 2 two-tap filters divided into two
a Sub-stages
4x4_0
8x8
4x4_2
4x4_1
4x4_3
1 2 3 4 5 Dependency of MV Cost Generation
b Sub-stages
4x8_0
4x8_1
8x4_0
8x4_1
1 2 3 4 5
Fig. 4.12 Timing diagram of interleaved five substages of the second stage in (a) Engine 1 and (b) Engine 2 Reference Pixels, Current Pixels and IMV
Controller
Interpolator Bank 0 (6tap+2tap)
Interpolator Bank 1 (6tap+2tap)
SATD Accumulator
MV Cost Generator
SATD Requester
FIFO Bank 0
FIFO Bank 1
Fig. 4.13 Block diagram of Engine 3
Arbiter
Best Candidate to Mode Selector
68
4 Fractional Motion Estimation
groups: Interpolator bank 0 and Interpolator bank 1. Each Interpolator bank consists of 1 six-tap filter and 1 two-tap filter.
4.3.2 Proposed Resource Sharing Method for SATD Generator In the FME process, the computational complexity of interpolation is much higher than that of SATD generation. Without an appropriate hardware resource allocation, most FME designs suffer from low utilization of the SATD generator. Therefore, we propose a scheme for sharing the SATD generator to increase hardware utilization and reduce total cost. 4.3.2.1
Analysis of SATD Generator Usage
For SATD generation, we decompose all types of blocks into 4 4 blocks. A 4 4 Processing Unit (PU) is responsible for one 4 4 block. Each PU implements the 2D Hadamard transform by using two 4-pixel-wide 1D transforms and a transpose array. Because both half refinement and quarter refinement choose one best candidate out of nine (one center point and eight neighboring points), our SATD generator is composed of nine 4 4 PUs. Figure 4.14 shows the architecture of our PU and SATD generator. For increasing hardware utilization, we analyze the complexity of SATD generation. During the FME process, there are two refinements for each subblock and each refinement should compare nine candidates. Moreover, there are seven block sizes (16 16 to 4 4), each consisting of 256 pixels. Therefore, there are totally 2 9 7 256 D 32;256 pixels to be processed. Our SATD generator (nine 4-pixelwide PUs) can process 9 4 D 36 pixels per cycle and, hence, can finish 32,256
a
b –
–
–
–
1-D Transform 1-D Transform
+
Accumulate 4 times +
+
Transpose Register Array
Fig. 4.14 Proposed (a) 4 4 PU and (b) SATD generator
PU
PU
PU
PU
PU
PU
PU
PU
PU
4.3 A VLSI Design for Fractional Motion Estimation
69
pixels in 896 cycles. To meet the performance constraint (HD1080p at 150 MHz), our FME needs only two SATD generators instead of three for three engines. Consequently, 33% of hardware is saved.
4.3.2.2
Customized Arbitration Scheme
Although using only two SATD generators can reduce hardware cost and increase hardware utilization, we should have a strategy to serve three engines efficiently with two SATD generators. The processing time for each of the three engines varies because the required reference pixels may be stored in different memory banks. Therefore, we cannot apply a fixed plan for resource allocation. For that reason, we utilize a bus protocol and model all interpolators as masters and SATD generators as slaves, as shown in Fig. 4.15. There are six masters: two interpolators in Engine 1, two interpolators in Engine 2, and two interpolator banks in Engine 3. We issue all requests from Engine 1 to SATD generator 1, all requests from Engine 2 to SATD generator 2, and requests from Engine 3 to both SATD generator 1 and SATD generator 2. Moreover, we classify the requests from
Controller Engine 1 Half-pixel Interpolator
SATD
Quarter-pixel Interpolator SATD Generator 1
Engine 3 Interpolator bank 0 Interpolator bank 1
SATD Generator 2
Engine 2 Half-pixel Interpolator Quarter-pixel Interpolator
Fig. 4.15 The proposed arbiter
SATD
70
4 Fractional Motion Estimation
the six masters into two groups: real-time and nonreal-time requests. Engine 1 and Engine 2 are the most critical part of our design, and pipeline stalls in Engine 1 and Engine 2 may degrade the overall performance. Hence, requests from both Engine 1 and Engine 2 are treated as real-time requests with high priority, while requests from Engine 3 are treated as nonreal-time requests with low priority. To design the arbitration scheme between six masters and two slaves, we should include some special features in our three-engine architecture. First, the half refinement and quarter refinement in Engine 1 and Engine 2 are in different pipeline stage and they may issue requests almost in close succession. This condition leads to a temporal locality of requests and, hence, a Hot Area in the timing diagram, as shown in Fig. 4.16a. Second, SATD generation of a 4 4 block should be finished without interruption or it should be recomputed. Third, the input-data scheduler transfers data to different engines sequentially and so activation times of pipeline stages in Engine 1 and Engine 2 are different. Therefore, the hot areas in Engine 1 and Engine 2 are staggered, as shown in Fig. 4.16b. For the first feature, we give real-time masters in the hot area the highest priority and tag it with a Hot Area Flag. For the second feature, we make a restriction that SATD generation of a 4 4 block cannot be interrupted; neither can an almost-completed master. These two limits can prevent a nonreal-time master from starvation. For the third feature, we define three arbitration modes as depicted in Fig. 4.17. If Engine 2 has no requests, we reserve the SATD generator 1 for Engine 1 and the SATD generator 2 for Engine 3. If Engine 1 has no requests, we reserve the SATD generator 2 for Engine 2 and the SATD generator 1 for Engine 3. If both Engine 1 and Engine 2 have requests, we reserve the SATD generator 1 for Engine 1 and the SATD generator 2 for Engine 2. In this case, Engine 3 should make
a Hot Area
b Staggered Requests from Engine 1
Requests from Engine 2
Fig. 4.16 (a) An example of hot area in the timing diagram. (b) The staggered hot area in the timing diagram of Engine 1 and Engine 2
4.3 A VLSI Design for Fractional Motion Estimation
a
71
b Non-real-time Requests (Engine3)
Non-real-time Requests (Engine3)
Interpolator bank 0
Interpolator bank 1
Interpolator bank 0
Interpolator bank 1
SATD Gen. 1
SATD Gen. 2
SATD Gen. 1
SATD Gen. 2
Quarter-pixel Interpolator
Half-pixel Interpolator
Quarter-pixel Interpolator
Half-pixel Interpolator
Real-time Requests (Engine1)
Real-time Requests (Engine2)
c Non-real-time Requests (Engine3) Interpolator bank 0
Interpolator bank 1
SATD Gen. 1
Half-pixel Interpolator
Quarter-pixel Interpolator
Real-time Requests (Engine1)
Constant Priority Arbitration
SATD Gen.2
Half-pixel Interpolator
Quarter-pixel Interpolator
Real-time Requests (Engine2)
Round-Robin Arbitration
Fig. 4.17 Three arbitration modes (a) mode1 (b) mode2 (c) mode3
requests for both SATD generators 1 and 2. In summary, we describe three steps of our customized arbitration scheme as follows. 1. Each real-time master in a hot area raises a hot area flag. 2. The arbiter determines the mode according to the hot area flags and the two limits for starvation. 3. The arbiter allocates the SATD generators to the masters according to the mode.
72
4 Fractional Motion Estimation Table 4.3 Implementation result of proposed FME Technology TSMC 0.13 m Gate count 321K Frequency 154 MHz Resolution HD1080p Cycles/MB 631
4.3.3 Evaluation We have implemented the proposed FME in Verilog HDL. Synthesized targeted toward a TSMC 130-nm CMOS cell library, our design takes 9.72 KB of local memory and 321K logic gates as shown in Table 4.3. We use four 1,920 1,088 video sequences for RTL simulation. The results show that total cycle count is sequence independent and the proposed FME consumes 631 cycles for one MB in average (the maximum is 644 cycles and the minimum 628, including 9 cycles for mode decision). Therefore, our design can encode 1,920 1,088 30-fps video when running at 154 MHz.
4.4 Summary VBSME with quarter-pixel accuracy is one of the major contributors to H.264/ AVC’s outstanding compression efficiency and video quality. Due to the high computational complexity, however, VBSME needs hardware accelerator for real-time or high-resolution applications. In this chapter, we have proposed a three-engine architecture for H.264/AVC FME. Our efficient task scheduling and two-stage pipelined engines enhance the interpolator throughput. We have also proposed a customized arbitration scheme for resource sharing and save 33% of hardware cost for SATD generation. Synthesized into a TSMC 130-nm cell library, the total gate count of our design is 321K, and local memory size is 9.72 KB. Experimental results show that when running at 154 MHz, our FME design can process one MB in 631 cycles and is capable of supporting 1,920 1,088 30-fps video.
Chapter 5
Motion Compensation
Abstract Following integer and fractional motion estimation, motion compensation (MC) is the third stage in H.264/AVC interframe prediction (P or B frame). After the motion estimator finds motion vectors and related information for each current macroblock (MB), the motion compensator generates compensated MBs from reference frames. Due to quarter-pixel-precision and variable-block-size motion estimation supported in H.264, motion compensation also needs to generate halfor quarter-pixels for MB compensation. Therefore, motion compensation also has high computational complexity and dominates the data traffic on DRAM. Current VLSI designs for MC usually focus on reducing memory traffic or increasing interpolator throughput. In this chapter, we will introduce several key points of VLSI implementation for motion compensation.
5.1 Introduction H.264/Advanced Video Coding (AVC) is the latest video coding standard jointly developed by the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG). By adopting several advanced features, H.264 outperforms previous standards by up to 50% in bit-rate reduction. However, these features also lead to intensive computation and memory traffic consumption, especially in high-resolution applications. When encoding a P or B frame, motion compensation uses motion vector difference (MVD) to generate motion vector (MV) and uses the MV to compensate a macroblock (MB). Since it requires large amount of computation and results in high memory traffic, a hardware accelerator is essential for its integration into a pure hardwired encoder.
5.1.1 Algorithms A motion compensator (MC) takes MVD and some other information such as MB block type and reference index as its input and outputs a compensated MB in two Y.-L.S. Lin et al., VLSI Design for Video Coding: H.264/AVC Encoding from Standard Specification to Chip, DOI 10.1007/978-1-4419-0959-6 5, c Springer Science+Business Media, LLC 2010
73
74
5 Motion Compensation Reference Frame
Current Frame
Current Block
Motion Compensation
MVD, Block type, Reference Index
Fig. 5.1 Motion compensation overview
a
b (D) B
C D
A
B
(C)
A
Fig. 5.2 Neighboring block: (a) case of using “C” (b) case of using “D”
major steps (1) generate MV and (2) generate a compensated MB. In Step 1, MC generates MV predictor (MVP) and then adds MVP with MVD to get MV. Since the MVD is provided by the fractional motion estimation (FME), the main function in Step 1 is MVP generation. In Step 2, MC generates the compensated MB by finding and combining several variable-sized blocks in reference frames, as shown in Fig. 5.1. When the MV points to an integer position, a compensated MB is immediately copied from reference blocks. In case of fractional MV, the MC performs interpolation on the integer pixels to get the compensated pixels up to quarter-pixel accuracy. For generating MVP, the MVs of previously encoded neighboring blocks must be found out first. One of them will be selected as the resultant MVP. In general, there are three neighboring blocks: block A, B, and (C or D), as shown in Fig. 5.2a, where block A is the left block, block B is the top block, block C is the right-top block, and block D is the left-top block. In general, A, B, and C will be selected as the three neighboring blocks. But if block C is not available, block D is selected to replace block C as shown in Fig. 5.2b. The selection among the MVs of these three neighboring blocks depends on the reference index and the block partition mode. For a detailed description of the algorithm please refer to H.264/AVC standard.
5.2 Related Works
75
The compensation comprises interpolation and weighted prediction. The interpolation of luminance pixels is based on the six-tap filter defined in (5.1) and is the same as that in FME. InterpL .A; B; C; D; E; F / D A 5B C 20C C 20D 5E C F:
(5.1)
The interpolation of chrominance pixel is based on (5.2), where xFrac and yFrac are the last three bits of MVx and MVy, respectively. InterpC .A; B; C; D/ D A .8 xFrac/ .8 yFrac/ C B xFrac .8 yFrac/ C C .8 xFrac/ yFrac C D xFrac yFrac: (5.2) After interpolation, MC applies weighted prediction to the interpolated blocks. In certain parameter-specified situation, we apply the default weighted prediction (5.3) which just linearly combines blocks from list 0 and list 1. Otherwise, the weighted prediction of monoprediction [(5.4) and (5.5)] or biprediction (5.6) is applied, where w, log W D, and offset are all parameters. R Œx; y D .RL 0 Œx; y C RL 1 Œx; y C 1/ 1; R Œx; y D RL 0 Œx; y w0 C 2logWD1 logW D C offset 0; R Œx; y D RL 1 Œx; y w1 C 2logWD1 logW D C offset 1; R Œx; y D RL 0 Œx; y w0 C RL 1 Œx; y w1 C 2logWD .logW D C 1/ C .offset 0 C offset 1 C 1/ 1:
(5.3) (5.4) (5.5)
(5.6)
5.1.2 Design Considerations Motion compensation has high computational complexity and high memory traffic. For the computational part, the design challenge is to increase hardware utilization of interpolator under the throughput constraint. For the memory access part, the design challenge is to minimize the total amount of required data and DRAM access penalty. How to design efficient interpolators and reduce memory traffic are the key points of MC VLSI implementation.
5.2 Related Works Several hardware architectures for motion compensation have been proposed in recent years. In this section, we classify them based on two major concerns: memory traffic reduction and interpolation engine.
76
5 Motion Compensation
5.2.1 Memory Traffic Reduction We classify memory traffic reduction schemes into two types according to whether they change the MC access pattern or not. Schemes which do not change the MC access pattern include memory mapping optimization and arbitration policy design. The MB-based memory mapping [65, 66] is useful for block-based video decoding because it induces no bank conflict or row-miss within an MB. However, as the MC complexity grows, MB cross-over becomes more frequent and thus dominates the DRAM access latency. Pixel duplication and L-C correlated mapping [82] are therefore proposed to reduce bank conflict between neighboring MBs. To further reduce system-level access latency, an arbitration-based access scheduling technique [66] has been proposed to alleviate bank conflict between different functional modules. Other memory traffic reduction schemes change the MC access pattern in different ways. Reference frame data reusing [23, 56, 67] is a popular method in many hardwired MC designs. It reduces redundant DRAM access by employing a register file to hold the data shared by neighboring 4 4 blocks of a divided block. To further reduce redundant DRAM access, a cache-based reuse scheme [44] (we called MB-level unification) is proposed to eliminate all redundant access during a whole MB decoding by translating scattered access pattern into a unified one.
5.2.2 Interpolation Engine We classify interpolation engine architectures into three types: 1D based [69, 70], 2D based [37], and separate 1D based [23, 67, 75]. The first two are either inefficient or costly while the separate 1D-based one is both efficient and low cost. In the following, we analyze the 1D and separated 1D architectures and survey some previous works on cost reduction of luminance six-tap filter. The single filter approach with six-bank memory subsystem [69,70] achieves full utilization of filter. By carefully manipulating the local memory format, the luma filter can fetch six neighboring pixels at once. But its throughput is not sufficient when the filtering operations are dense. A separate 1D-based interpolation engine with 13 luma filters and several shift registers for data reuse [23, 67, 75] gives very high performance/cost ratio. When interpolating a block, it takes a row of nine reference pixels and produces a row of four interpolated pixels every cycle. By utilizing the shift register, neither redundant computation nor redundant reference pixel fetching is performed for interpolating a block with 4-pixel width. In the case of input latency greater than 1 cycle due to bandwidth limitation, a separate 1D filter with shared luma filters [56] is proposed to improve the utilization. It employs only six luma filters to handle blocks with up to 8-pixel width. By assuming that the latency of a 13-pixel input is exactly 4 cycles, all luma filters can be fully utilized. A luma filter needs only six adders [83] to achieve 1-cycle latency. That is insignificant in today’s process technology. However, a chroma filter is much
5.3 A VLSI Design for Motion Compensation
77
more complicated since it consists of several multipliers. Therefore, a combined luma/chroma filter [23] has been proposed to reduce the area cost of chroma filter with small interconnection and multiplexer overhead.
5.3 A VLSI Design for Motion Compensation In this section, we present our motion compensation system (MCS), as shown in Fig. 5.3 [13, 14]. It consists of three major components: Motion Vector Generator (MVG), Memory Fetch Unit (MFU), and Interpolator. The MVG produces MVs and reference index. MFU brings in reference pixels. The interpolator reads and compensates reference pixels into compensated pixels and then computes residual pixels by subtracting compensated pixels from current pixels. All of these functions are parallelized. We further describe MVG in Sect. 5.3.1 and Interpolator in Sect. 5.3.2.
5.3.1 Motion Vector Generator Figure 5.4 shows the block diagram of the proposed MVG. The main function of MVG is to generate MV by adding together MV predictor (MVP) and MV difference (MVD) which is from FME. MVP generation has two major steps, neighboring
Prediction Information
MVG
MFU
Reference Pixels
Reference Index Interpolator
Current Pixels
Compensated Pixels Subtractor
Fig. 5.3 Block diagram of the proposed motion compensation system
Residual Pixels
Motion Vector
78
5 Motion Compensation MVD Prediction Information Reader
Neighboring MV Address Generator
+
MV Selector
MV
Reference Index, Block Type
Fig. 5.4 Block diagram of the proposed motion vector generator
Left MV Register
Current MV Memory
Top MV Register
Neighbor MV Memory
Fig. 5.5 Motion vectors saved for MVP generation
MV determination and selection. For the former, we propose a neighboring MV address generator to check for the availability of each neighboring block and produce the address of neighboring MV. For the later, a MV selector is proposed to read the neighboring MVs and select one of them as the resultant MVP. In MVG, a whole row of top neighboring MVs (top-row) should be stored in memory. We use several local memory and registers to save the currently needed MVs, as shown in Fig. 5.5. The following steps help us to exchange data between the register and the local memory. Before the MVG starts, the top-row MVs should be fetched into a register. After the MVG finishes, the current bottom-row MVs will be stored into the local Neighbor MV Memory to serve as the top-row MVs in the next MB-row. The execution time of MVG varies with the MB partition type. Because the MVG adopts a variable size subblock level pipeline, saving different number of MVs takes different number of cycles. Although the MVG will spend more cycles to encode a larger subblock, the worst case occurs when the MB consists of sixteen 4 4 blocks. In all, the MVG needs 235 cycles for the worst case and 120 cycles for the average case.
5.3 A VLSI Design for Motion Compensation
79
5.3.2 Interpolator The interpolator module performs half- and quarter-pixels interpolation, weight prediction, and pixel reconstruction. Interpolation is the most computation-intensive part of the motion compensator. Therefore, a highly parallelized hardware architecture is essential for encoding HDTV (1,920 1,080) video. Our proposed interpolator adopts a dual interpolation engine scheme that employs two separate 1D interpolation engines to handle two 4 4 blocks simultaneously. These two engines are fully utilized in both biprediction and monoprediction interpolation. Figure 5.6 depicts the block diagram of our proposed interpolator, which consists of two different pipelined structures. In the top two modules (the Base Address Look-up Unit and the Static Synchronized Address Generator), we adopt a multiplecycle pipelined architecture to process a 4 4 block. In other modules, we adopt a single-cycle pipelined architecture to handle one row of data at a time because the interpolation engine produces data row-by-row at uncertain time. Before the interpolation engines start, the Base Address Look-up Unit generates the current reference
Motion Vector and Reference Index
Reference Pixels
Base Address Lookup Unit
Static Synchronized Address Generator
Data Box
Interpolation Engine 0
Interpolation Engine 1
Switching FIFO
Weighted Prediction Engine
Fig. 5.6 Block diagram of the proposed interpolator
Compensated Pixels
80
5 Motion Compensation
region and translates it into a base address. In the next pipeline stage, the Static Synchronized Address Generator sends all reference block addresses. Finally, through the computation of interpolation engines and the weighted prediction engine, the compensated data is generated.
5.3.2.1
Interpolation Engine
To improve interpolation performance, our proposed interpolation engine uses four chroma filters to interpolate Cb and Cr data simultaneously. Ideally, it takes at most (9.luma/ C 3.chroma// D 12 cycles to interpolate a 4 4 block. In MB level, the proposed architecture needs 11 cycles of pipeline delay. Therefore, the longest latency for an MB in a B-slice is .14 16 C 11/ D 235 cycles. In the case of P slice or monopredicted B slice, the longest latency is .14 8 C 11/ D 123 cycles. Finally, if the incidence of biprediction is less than 70%, the 200-cycle constraint on computation can be easily met. In fact, our experimental data shows that the incidence of biprediction is usually less than 30%. To reduce the hardware cost, we apply three modifications to the proposed interpolation engine. First, since we can reduce the interpolation formula as suggested in [69, 70], an 8-bit register is used by the half-pixel shift register file instead of a 13-bit one, as shown in Fig. 5.7. Second, by switching the two horizontal neighboring pixels, we save six vertical registers and one filter as compared to [67]. Third, we
MF
V
Integer-pixels MF
V
V
V
Register
V
V
V
V
V
Vertical Luma Filter
Mixed Filter (4 Horizontal Luma Filters + 4 Chroma Filters)
Fig. 5.7 Interpolation engine architecture
5.3 A VLSI Design for Motion Compensation Fig. 5.8 Six-tap luma filter
81 a
f
b
e
c
+
+
d +
<<2 –
–
<<2 –
propose a low-cost chroma filter to further reduce the hardware cost. Moreover, the luma filter needs only six adders. We divide it into two stages as shown in Fig. 5.8 to meet the frequency constraint.
5.3.2.2
Area-Efficient Chroma Filter
Since the eight chroma filters dominate the hardware cost, we propose a lowcost chroma filter with 75% fewer gates. It consists of two multiply-adders (MA) and two pipeline registers as shown in Fig. 5.9. Since we can rewrite the chroma filtering (5.2) into (5.7)–(5.9), two multiply-adders are sufficient to handle the multiplications–additions in both the horizontal and vertical directions. Consider the multiplication–addition case in which two multiplicators are N -bit with values K and 2N K. We can reduce the problem into an (N C 1) numbers addition problem. In the case of H.264 standard where N D 3, we just need three adders to build an multiply-adder because the rightmost 1s of these two numbers are located in the same bit and other higher bits have at most two 1s in total. Eventually, the proposed low-cost chroma filter contains only six adders. InterpC .A; B; C; D/ D R D Rab .8 yFrac/ C Rcd yFrac; Rab D A .8 xFrac/ C B xFrac; Rcd D C .8 xFrac/ C D xFrac:
5.3.2.3
(5.7) (5.8) (5.9)
Fully Utilized Weighted Prediction Engine
A traditional weighted prediction engine consists of eight multipliers and costs 11.4K logic gates. By analyzing its utilization, we find that it consumes merely 6 (4 luma C 2 chroma) cycles per 4 4 block. Therefore, we propose a fully utilized weighted prediction engine with only four multipliers. Each multiplier and some adders constitute a processing element as shown in Fig. 5.10. It produces a pixel
82
5 Motion Compensation yFrac
xFrac E C A
MA
MA
Rcdef Rabcd
F D B
xFrac +
<<1 <<2
A
0 <<1
+
B
Rab
+ 0 <<2
<<3
Fig. 5.9 The proposed chroma filter
AVG
Offset logWD Pixel
<<7 x
POW
INV
+
Result
7bits
WD ABS Sign.Bit
Weight
+
+
>>
Fig. 5.10 Architecture of weight prediction engine
in 2 cycles and consumes 12 cycles per 4 4 block. However, it compromises the best case latency of the Static Synchronized Address Generator. We can make this tradeoff because it does not affect the worst case latency. Eventually, the complete interpolator takes 12–14 cycles in every stage. Since the weighted prediction engine needs to get the list 0/list 1 data alternately, we employ a switching FIFO. Two rows of pixels were stored into the switching FIFO at the same time and alternately output to the weighted prediction engine. Empirically, the switching FIFO needs only three entries.
5.4 Summary
83
Table 5.1 Implementation result of proposed MC
Technology Gate count Frequency Resolution Avg. cycles/MB
TSMC 0.13 m 73K 50 MHz HD1080p 155
180
Total Cycles / MB
160
158
156
162 152
140
157 146
120 100 80 60 40 20 0
Tractor
Bluesky Pedestrian Station
Rushhour Sunflower
Fig. 5.11 Simulation result of whole MCS
5.3.3 Evaluation The proposed MCS has been implemented in synthesizable Verilog RTL code. Targeted toward a TSMC 0.13-m cell library at 200 MHz, the gate count of each module is shown in Table 5.1. In order to observe the actual MCS behavior, we integrate all the three MCS components with a cyclic-queued pipeline memory model and then integrate the MCS into a 128-bit DRAM model. The input data such as prediction information and reference frame are generated by the reference software beforehand, and the output data is compared with the golden pattern generated from the reference software. In the system level, we measure the total cycles of the whole MCS as shown in Fig. 5.11. The resultant cycles of every sequence all meet the 200-cycle constraint.
5.4 Summary We have proposed an H.264/AVC MCS for real-time encoding of HDTV video. Our proposed dual interpolation engine scheme halves the latency in both P and B slices. Experiment shows that the proposed design can on the average encode an MB with four reference frames in 150 cycles. The whole system only has to run at 50 MHz for real-time encoding HDTV@30 fps. For future applications with higher resolution, we can extend the unification range to multiple MB rows and employ more interpolation engines and multiple MC modules to further reduce computation and DRAM access latency.
Chapter 6
Transform Coding
Abstract In H.264/AVC, both transform and quantization units consist of forward and inverse parts. Residuals are transformed into frequency domain coefficients in the forward transform unit and quantized in the forward quantization unit to reduce insignificant data for bit-rate saving. To generate reconstructed pixels for the intra prediction unit and reference frames for the motion estimation unit, quantized coefficients are rescaled in the inverse quantization unit and transformed back to residuals in the inverse transform unit. There are three kinds of transform used in H.264/AVC: 4 4 integer discrete cosine transform, 2 2 Hadamard transform, and 4 4 Hadamard transform. To design an area-efficient architecture is the main design challenge. We will present a VLSI implementation of transform coding in this chapter.
6.1 Introduction Unlike previous standards such as H.263 or MPEG-4, the H.264/AVC standard employs a new transform coding algorithm. For baseline, main, and extended profiles, it introduces a 4 4 integer transform modified from the traditional discrete cosine transform (DCT). The modified forward and inverse DCT can be carried out by using only shifters and adders and avoiding mismatch between encoder and decoder sides due to floating-point arithmetic. It also provides divider-free forward and inverse quantization algorithms to reduce computational complexity. We describe the detailed algorithm of transform coding in Sect. 6.1.1 and address some design considerations in Sect. 6.1.2.
6.1.1 Algorithms The motion estimation or the intra prediction module in a video encoder finds a macroblock which is similar to current ones from reference or present frames. However,
Y.-L.S. Lin et al., VLSI Design for Video Coding: H.264/AVC Encoding from Standard Specification to Chip, DOI 10.1007/978-1-4419-0959-6 6, c Springer Science+Business Media, LLC 2010
85
86
6 Transform Coding
Residuals
Transform
Quantization
Transformed Residuals to Reconstruction
Inverse Transform
Inverse Quantization
Run Level Coding
Coefficients to Entropy Coding
Fig. 6.1 Block diagram of the transform coding unit
the found macroblock usually does not perfectly match with the current one. The differences are called residuals (or prediction errors). The transform coding unit takes residuals as input and outputs compressed data to an entropy encoder. Figure 6.1 illustrates the block diagram of the transform coding unit in H.264/AVC. Along the encoding path, the transform (Trans) unit, the quantization (Quan) unit, and the run level coding (RLC) unit are used to reduce the insignificant spatial information of a video sequence. The residuals are transformed and quantized to a set of nonzero coefficients. At the inverse (or reconstruction) path, quantized coefficients are rescaled in the inverse quantization (InvQuan) unit and transformed back into residuals in the inverse transform (InvTrans) unit. Finally, the transformed residuals are combined with prediction pixels to reconstruct a macroblock. There are three blocks within a macroblock: one 16 16 luma block and two 8 8 chroma blocks (Cb and Cr). A 16 16 luma block can be divided into sixteen 4 4 blocks and a chroma 8 8 block can be divided into four 4 4 blocks. The transform coding unit uses one 4 4 residual block as a unit to encode a macroblock. It also needs to encode the DC values of two chroma blocks. However, to encode the DC values of a 16 16 luma block or not is depending on its prediction mode. The processing order and the output order for residual blocks are different. In Fig. 6.2, the output order is listed under the processing order. If a macroblock is encoded using one of the four intra 16 16 prediction modes, we call it an intra 16 16 macroblock. Otherwise, we call it a nonintra 16 16 macroblock. If a macroblock is an intra 16 16 macroblock as shown in Fig. 6.2a, the luma AC blocks 0–15 are processed first. Block 16, which contains DC values of sixteen 4 4 luma blocks, will be processed later. After encoding luma blocks, Cb AC blocks 17–20 are processed followed by Cb DC 2 2 block 21. Finally, Cr AC blocks 22–25 are processed followed by its DC 2 2 block 26. If the current macroblock is a nonintra 16 16 macroblock, we do not have to process the luma DC block as shown in Fig. 6.2b. After performing the transform coding, the luma DC block will be sent to the entropy encoder first if the macroblock is an intra 16 16 macroblock. Two chroma 2 2 DC blocks are also sent before the eight chroma 4 4 AC blocks. Figure 6.3 shows the flow chart of the transform coding unit. Because there exists a data-dependency loop in the intra 4 4 prediction, we need to process 16 luma 4 4 residual blocks of the best intra 4 4 modes first for a I-type macroblock. If the
6.1 Introduction
a
87
16 (–1) 21 (16)
0 (0)
1 (1)
4 (4)
5 (5)
2 (2)
3 (3)
6 (6)
7 (7)
8 (8)
9 (9)
12 (12)
13 (13)
10 (10)
11 (11)
14 (14)
15 (15)
26 (17)
17 (18)
18 (19)
22 (22)
23 (23)
19 (20)
20 (21)
24 (24)
25 (25)
Cb
Cr
Luma
Intra 16x16 macroblock
b
20 (16)
0 (0)
1 (1)
4 (4)
5 (5)
2 (2)
3 (3)
6 (6)
7 (7)
8 (8)
9 (9)
12 (12)
13 (13)
10 (10)
11 (11)
14 (14)
15 (15)
25 (17)
16 (18)
17 (19)
21 (22)
22 (23)
18 (20)
19 (21)
23 (24)
24 (25)
Cb
Cr
Luma Non-intra 16x16 macroblock
Fig. 6.2 Processing order and output order of (a) intra 16 16 macroblock and (b) non-intra 16 16 macroblock
88
6 Transform Coding
I4: 4x4 residuals of the best luma intra 4x4 mode I16: 4x4 residuals of the best luma intra 16x16 mode PB4: 4x4 residuals of the best luma inter mode T: 4x4 DCT Q: Quantization IQ: Inverse Quantization IT: 4x4 Inverse DCT HT: 4x4 Hadamard Transform IHT: 4x4 Hadamard Transform R: Run Level Coding Start No
Perform T, Q, IQ, IT, R for an I4 block No
Yes
Is a P-type or B-type MB?
Perform T, Q, IQ, IT, R for a PB4 block No
All 16 I4 blocks done?
All 16 PB4 blocks done?
Yes No Is an intra 16x16 MB? Yes Perform T, Q, IQ, IT, R for an I16 block No
All 16 I16 blocks done?
Yes Perform T, Q, IQ, IT, R for a luma DC block
Perform T, Q, IQ, IT, R for a Cb 4x4 block No
All 4 Cb 4x4 blocks done? Yes Perform HT, Q, IQ, IHT, R for a Cb 2x2 DC block Perform T, Q, IQ, IT, R for a Cr 4x4 block
No
All 4 Cr 4x4 blocks done? Yes Perform HT, Q, IQ, IHT, R for a Cr 2x2 DC block End
Fig. 6.3 Flow chart of the transform coding unit
Yes
6.1 Introduction
89
macroblock is encoded by an intra 16 16 prediction mode, we need to additionally process 16 luma 4 4 residual blocks of the best intra 16 16 mode and its DC block.
6.1.1.1
Transform
Transform algorithms decorrelate residuals of spatial domain into coefficients of frequency domain. Because coefficients will become more compact after being transformed, redundant information can be further removed. Taking DCT as an example, the data of frequency domain is the weight of standard basic patterns as shown in Fig. 6.4. Any 4 4 residual blocks can be represented by the weighted combination of the basic patterns. The energy of these data has been concentrated into a small number of values through forward transform. The insignificant coefficients will be removed in the quantization unit to reduce the transmitted data size at the expense of slight degradation in quality. We can reconstruct an approximate copy of the current block by using coefficients and basic patterns. Three kinds of integer transform are adopted in H.264/AVC depending on the type of residual blocks: 4 4 Hadamard transform (HT) for a luma DC block, 2 2 Hadamard transform for chroma DC blocks, and 4 4 integer DCT for all other 4 4 AC blocks.
4x4 image block
= 140
–1
–19
+... 4x4 Basic Patterns
Transform 6
12
9
11
140 –1 –6
10
9
5
13
–19 –39 7 –92
2
11
12
5
22
8
–27 –32 –59 –21
20
7
16
Residuals (Spatial domain)
Inverse Transform
Fig. 6.4 Basic theorem of 4 4 DCT
17
8
7
31
Coefficients (Frequency domain)
90
6 Transform Coding
Equation (6.1) shows the forward integer DCT proposed in H.264/AVC. X is the input 4 4 residual block and Y is the matrix of transformed coefficients. C and its transpose matrix C T are modified DCT matrices. Both of them contain only integers to simplify computation. E is the matrix of postscaling factors. The symbol ˝ indicates that each element of W (W D CXCT ) is one-to-one multiplied by the corresponding element in matrix E. For example, Y (0,0) is obtained by W (0,0) multiply E(0,0) and Y (2,1) is obtained by W (2,1) multiply E(2,1). To further simplify the computation, the scalar multiplication is integrated into the quantization unit in H.264/AVC. We have only to implement CXCT in the transform engine. 02
32
32
1 1 1 1 1 2 B6 2 1 1 2 7 6 X 7 6 1 1 B 6 7 6 7 6 Y D .CXC / ˝ E D @4 1 1 1 1 5 4 5 4 1 1 1 2 2 1 1 2 T
31
2
ab 2
a2
ab 2
b2 4
ab 2
b2 4
ab 2
a2
ab 2
ab b 2 2 4
ab 2
b2 4
a2
6 1 1 6 ab C 6 1 2 7 7C ˝ 6 2 1 2 5A 6 6 a2 4 1 1
3 7 7 7 7 7 7 5
(6.1) Equations (6.2) and (6.3) show the 4 4 and 2 2 forward Hadamard transform proposed in H.264/AVC. XLumaDC and XChromaDC are the input 4 4 luma DC block and 2 2 chroma DC block, respectively. YLumaDC and YChromaDC are the matrices of transformed coefficients. 32 32 1 1 1 1 1 B6 1 1 1 1 7 6 XLumaDC 7 6 1 6 76 76 D B 541 @4 1 1 1 1 5 4 1 1 1 1 1 1 1 1 1 XChromaDC : D 1 1 1 1 02
YLumaDC
YChromaDC
6.1.1.2
1 1 1 1
31 1 1 , C 1 1 7 7C 2; (6.2) 1 1 5A 1 1 (6.3)
Quantization
The H.264/AVC standard proposes a scalar quantizer. The quantization operation is modified to avoid division and floating point arithmetic. The basic quantization operation is defined in (6.4). Zij D round.Yij =Qstep/:
(6.4)
Yij is the output of the transform unit. Qstep is a quantization step derived according to the quantization parameter (QP). There are 52 Qstep values as shown in Table 6.1. For every increment of six in QP, Qstep is double of the previous value. The wide range of Qstep sizes enables the encoder to control the trade-off between bit-rate and video quality. To integrate the postscaling matrix E of the transform unit into the quantization unit, the quantization operation is modified as (6.5) shows. W D CXC T and POF
6.1 Introduction Table 6.1 Quantization step sizes QP 0 1 2 QStep 0.625 0.6875 0.8125 QP 18 QStep 5
91
3 0.875 24 10
4 1
5 1.125 30 20
6 1.25
Table 6.2 Multiplication factor LevelScale Positions Positions QP (0,0)(2,0)(2,2)(0,2) (1,1)(1,3)(3,1)(3,3) 0 13,107 5,243 1 11,916 4,660 2 10,082 4,194 3 9,362 3,647 4 8,192 3,355 5 7,282 2,893
7 1.375 36 40
8 1.625
9 1.75 42 80
Other positions 8,066 7,490 6,554 5,825 5,243 4,559
is equal to a2 , ab/2, or b 2 /4 according to the coordinate .i; j /. In the reference software, POF/Qstep is implemented using a multiplication factor LevelScale defined in Table 6.2 and a shift-right operation to avoid division operation. The parameter qbits is defined in (6.6). POF LevelScale D round Wij ; Zij D round Wij Qstep 2qbits qbits D 15 C floor.QP=6/:
(6.5) (6.6)
The quantization operation for 4 4 residual blocks can be simplified as (6.7) shows. The symbol indicates shift-right operation. f is 2qbits /3 for intra blocks and 2qbits /6 for inter blocks. jZij j D .jWij j LevelScaleŒQP%6; .i; j / C f / qbits; sign.Zij / D sign.Wij /:
(6.7)
For 2 2 and 4 4 DC blocks, the quantization operation is defined in (6.8). jZDCij j D .jWDCij j LevelScaleŒQP%6; .0; 0/ C 2f / .qbits C 1/; sign.ZDCij / D sign.WDCij /:
6.1.1.3
(6.8)
Inverse Quantization
The inverse quantization operation is defined in (6.9). The equation is multiplied by 64 to avoid rounding error. It incorporates the prescaling factor Ei of the inverse transform unit by multiplying PRF. PRF is equal to a2 , ab, or b 2 according to the
92
6 Transform Coding Table 6.3 Scaling factor InvLevelScale Positions Positions QP (0,0)(2,0)(2,2)(0,2) (1,1)(1,3)(3,1)(3,3) 0 10 16 1 11 18 2 13 20 3 14 23 4 16 25 5 18 29
Other positions 13 14 16 18 20 23
coordinate .i; j /. Outputs of the inverse transform unit will be right-shifted by 6 bits to remove the scaling factor. Yij0 D Zij Qstep PRF 64:
(6.9)
In order to simplify (6.9), (Qstep PRF 64) is defined as InvLevelScale in the reference software as shown in Table 6.3. For QP greater than 5, the InvLevelScale is calculated by using QP%6. The modified inverse quantization operation for 4 4 residual blocks is shown in (6.10). Yij0 D .Zij InvLevelScaleŒQP%6; .i; j // .QP=6/:
(6.10)
For 4 4 DC blocks, the quantization operation is defined in (6.11) and (6.12). 0 YDCij D .ZDCij InvLevelScaleŒQP%6; .0; 0// .QP=6 2/ 0 YDCij
if.QP 12/; (6.11) 1.QP=6/ D .ZDCij InvLevelScaleŒQP%6; .0; 0/ C 2 / .2 QP=6/ if.QP < 12/:
(6.12)
For 2 2 DC blocks, the quantization operation is defined in (6.13) and (6.14). 0 D .ZDCij InvLevelScaleŒQP%6; .0; 0// .QP=6 1/ YDCij if.QP 6/; 0 YDCij
6.1.1.4
D .ZDCij InvLevelScaleŒQP%6; .0; 0// 1 if.QP < 6/:
(6.13) (6.14)
Inverse Transform
There are also three kinds of inverse transform algorithms adopted in H.264/AVC: 4 4 inverse Hadamard transform (IHT) for a luma DC block, 2 2 inverse Hadamard transform for chroma DC blocks, and 4 4 inverse DCT (IDCT) for all other 4 4 AC blocks.
6.1 Introduction
93
Equation (6.15) shows the IDCT operation proposed in H.264/AVC. Y is the matrix of inverse-quantized coefficients. Ei is the matrix of prescaling factors and had been integrated into the inverse quantization unit. 3 02 1 1 1 1 2 7 B6 6 6 1 12 1 1 7 B6 Y X 0 D .CiT .Y ˝ Ei /Ci / D 6 7 B6 1 4 1 2 1 1 5 @4 1 1 1 12 2
3
2
a2 7 6 7 6 ab 7˝6 2 5 4a ab
ab b2 ab b2
a2 ab a2 ab
3 31 2 1 1 1 1 ab 6 7C 1 1 1 1 7 7 b 2 7C 6 2 2 7 7C 6 7 ab 5A 6 4 1 1 1 1 5 1 1 b2 2 1 1 2
(6.15) Equations (6.16) and (6.17) show the 4 4 and 2 2 inverse Hadamard trans0 0 form proposed in H.264/AVC. YLumaDC and YChromaDC are the matrices of inversequantized coefficients. 32 32 1 1 1 1 1 761 B6 1 1 1 1 7 6 Y 0 6 7 6 LumaDC 7 6 DB 541 @4 1 1 1 1 5 4 1 1 1 1 1 0 1 1 1 1 YChromaDC : D 1 1 1 1 02
0 XLumaDC
0 XChromaDC
6.1.1.5
1 1 1 1
31 1 1 C 1 1 7 7C (6.16) 1 1 5A 1 1 (6.17)
Run Level Coding
The RLC unit performs a zig-zag scan as shown in Fig. 6.5 to sort a 4 4 or a 2 2 coefficient matrix into an one-dimensional coefficient array. Because the transform unit will concentrate the significant data into the low frequency part of a coefficient matrix and the insignificant data in the high frequency part usually become zero after quantization, the RLC unit reorders the coefficients to sort the low frequency data ahead of the high frequency data. It also calculates two parameters for the decoder: significant coefficient flag (sig coeff flag) and last significant coefficient flag (last sig coeff flag). The sig coeff flag[i ] is 1 if the coefficient at the scanning position [i ] of the coefficient array is nonzero. The last sig coeff flag[i ] is 1 if the coefficients are all zero from
a
b
start
start
end
2x2 block
end 4x4 block
Fig. 6.5 Zig-zag scan for (a) 2 2 and (b) 4 4 blocks
94
6 Transform Coding
the subsequent scanning position [i C 1] to the end of the coefficient array. On the encoder side, the transform coding unit sends sig coeff flag, last sig coeff flag, and nonzero coefficients instead of all the coefficients to save bit-rates. The transform decoding unit will use these two parameters to reconstruct coefficient blocks.
6.1.1.6
Encoding Process of the Transform Coding Unit
We use Algorithm 6.1 to show the encoding process of the transform coding unit. Its primary inputs are QP, residuals, and macroblock information such as macroblock mode and macroblock type. In addition to sig coeff flag and last sig coeff flag, the transform coding unit has to compute two parameters for the decoder: coded block pattern (CBP) and coded block flag (CBF). Figure 6.6 shows the usage of CBP. The right four bits specify which luma 8 8 block contains at least one nonzero coefficient. The left two bits specify whether Algorithm 6.1: Transform coding. Transform Coding (QP, Residuals, macroblock information) if current MB is P-type or B-type then for 16 luma 4x4 AC blocks of the best inter mode do YL D DCT luma 4x4(XL ); ZL D Quan(QP, YL ); freordered coefficients, SCF, LSCFg = RLC(ZL ); Y’L D InvQuan(QP, ZL ); X’L D IDCT luma 4x4(Y’L ); endfor else for 16 luma 4x4 AC blocks of the best intra 4x4 modes do YL D DCT luma 4x4(XL ); ZL D Quan(QP, YL ); freordered coefficients, SCF, LSCFg = RLC(ZL ); Y’L D InvQuan(QP, ZL ); X’L D IDCT luma 4x4(Y’L ); endfor if best mode is an intra 16x16 mode then for 16 luma 4x4 AC blocks of the best intra 16x16 mode do YL D DCT luma 4x4(XL ); ZL D Quan(QP, YL ); freordered coefficients, SCF, LSCFg D RLC(ZL ); Y’L D InvQuan(QP, ZL ); X’L D IDCT luma 4x4(Y’L ); endfor YLDC D DCT luma DC(XLDC ); ZLDC D Quan(QP, YLDC ); freordered coefficients, SCF, LSCFg D RLC(ZLDC ); Y’LDC D InvQuan(QP, ZLDC ); X’LDC D IDCT luma DC(Y’LDC ); endif
6.1 Introduction
95
endif for 4 Cb 4x4 AC blocks do YCb D DCT chroma 4x4(XC b ); ZCb D Quan(QP, YC b ); freordered coefficients, SCF, LSCFg = RLC(ZC b ); Y’Cb D InvQuan(QP, ZC b ); X’Cb D IDCT chroma 4x4(Y’C b ); endfor YCbDC D DCT chromma DC(XC bDC ); ZCbDC D Quan(QP, YCbDC ); freordered coefficients, SCF, LSCFg D RLC(ZCbDC ); Y’CbDC D InvQuan(QP, ZCbDC ); X’CbDC D IDCT chroma DC(Y’CbD ); for 4 Cr 4x4 AC blocks do YCr D DCT chroma 4x4(XC r ); ZCr D Quan(QP, YC r ); freordered coefficients, SCF, LSCFg D RLC(ZCr ); Y’Cr D InvQuan(QP, ZCr ); X’Cr D IDCT chroma 4x4(Y’Cr ); endfor YCrDC D DCT chromma DC(XC rDC ); ZCrDC D Quan(QP, YCrDC ); freordered coefficients, SCF, LSCFg D RLC(ZCrDC ); Y’CrDC D InvQuan(QP, ZCrDC ); X’CrDC D IDCT chroma DC(Y’CrDC ); fCBP, CBFg D Gen MB info(coefficients);
5
4
3
2
1
0 16
CBP[5:0]
0
1
2
3
17
18
19
22
23
20
21
24
25
4~7
Cb
8~11
Cr
12~15
CBP[0] == 0: 0~3 are all zero blocks CBP[1] == 0: 4~7 are all zero blocks CBP[2] == 0: 8~11 are all zero blocks CBP[3] == 0: 12~15 are all zero blocks
Fig. 6.6 Specification of coded block pattern
CBP[5:4] == 00:16~25 are all zero blocks CBP[5:4] == 01:18~25 are all zero blocks and16, 17 not all zero CBP[5:4] == 10: at least one nonzero coefficient in 18~25
96
6 Transform Coding
chroma DC block or two 8 8 blocks contain nonzero coefficients. If there is any nonzero coefficient in the 8 8 block, CBF will be checked to see which 4 4 block has a nonzero coefficient. CBF equals to 1 if the 4 4 block has a nonzero coefficient. The transform decoding unit will check to see if (1) CBP is equal to 0 and the prediction mode is nonintra 16 16 or (2) CBF of the luma 16 16 DC block is equal to 0. If one of these two conditions is true, current macroblock contains only zero coefficients and will be skipped. For remaining cases, each macroblock will be decoded. If the transform coding unit outputs four coefficients per cycle, it totally takes 686 (1,162) cycles to encode a nonintra 16 16 (an intra 16 16) macroblock as shown in Fig. 6.7. 16 luma 4x4 blocks
16 luma luma 4 Cb Cb 4 Cr Cr 4x4 DC 4x4 4x4 DC 4x4 DC blocks blocks 2x2 blocks 2x2 (intra 16x16 only)
Perform 4x4 DCT Perform 4x4 IDCT Perform 4x4 Hadamard Transform Perform 4x4 Inverse HT Perform 2x2 Hadamard Transform Perform 2x2 Inverse HT Perform Quantization Perform Inverse Quantization Perform Run Level Coding Cycles
28
448
Fig. 6.7 Order of processing subtasks
896 924
1036 1043 1155 1162
6.2 Related Works
97
6.1.2 Design Consideration Although most of the computations in the transform coding unit are designed to be carried out using only integer arithmetic, the computation load of it for supporting 1080pHD (1,920 1,088) resolution is still heavy. To encode 1080pHD (1,920 1,088) resolution video at 30 fps requires the data throughput up to 94 MBps. The transform coding unit needs to work at 168 MHz (285 MHz) if it takes 686 (1,162) cycles to encode a macroblock to meet this performance requirement.
6.2 Related Works There are two main streams of hardware implementation for the transform coding unit. One is a multitransform engine combining DCT, IDCT, and Hadamard transform. Other designs address the integration issue. They integrate the Trans unit and the Quan unit (or the InvTrans unit and the InvQuan unit) into a single engine.
6.2.1 Multitransform Engine Approaches Wang et al. [74] is the first multitransform architecture. It contains two 1D transform units and a transpose memory. Each 1D transform is completed within 1 cycle. In order to reduce area, it combines three kinds of transforms and outputs four coefficients per cycle. Targeted toward low power consumption, Chen et al. [6] has two improvements. First, it uses four 1D transforms to complete the row part of a 4 4 block and another four 1D transforms to complete the column part of a 4 4 block. Since the input data of the second 1D transform is ready, no transpose memory is needed to buffer the output data of the first 1D transform. Second, it introduces a switching power suppression technique to diminish unnecessary power consumption of the data transition. In [9], authors of [6] further propose an interlaced I/O schedule. Their design achieves throughput of 800 Mpixels/s at working frequency 100 MHz.
6.2.2 Trans/Quan or InvQuan/InvTrans Integration Approaches Wang et al. [76] integrates the InvQuan and the InvTrans units into a single engine. It contains a 3 kbits local memory and takes 210 cycles for decoding a macroblock in the worst case.
98
6 Transform Coding
Lin et al. [38] proposed a combined architecture for an H.264/AVC encoder. It integrates the Trans unit with the Quan unit and the InvQuan unit with the InvTrans unit. It saves the area cost of the Quan and the InvQaun units at the expense of a longer critical path. It outputs eight coefficients per cycle and runs at 56 MHz.
6.3 A VLSI Design for Transform Coding We propose a VLSI architecture for the transform coding unit in this section. We first describe how we schedule all subtasks in Sect. 6.3.1 and then propose our hardware architecture in Sect. 6.3.2. In Sect. 6.3.3, we evaluate its performance.
6.3.1 Subtasks Scheduling Figure 6.8a shows the processing order of our design for a nonintra 16 16 block. At iteration 1, 2, 9, and 10, our design processes eight residuals for one 4 4 residual block per cycle. At iteration 3–8, our design processes two luma 4 4 blocks simultaneously. Figure 6.9 shows the timing diagram for processing a nonintra 16 16 block. Figure 6.8b shows the processing order of our design for luma AC blocks in an intra 16 16 block. Our design processes eight residuals for two luma AC 4 4 blocks (four residuals for each) per cycle. Figure 6.10 shows the corresponding timing diagram. We process the luma DC block between the quantization and the inverse quantization operations of luma AC blocks. Our design processes eight residuals per cycle for a luma DC block. To process two chroma 8 8 blocks is similar to process an intra 16 16 block. Figure 6.8c shows the processing order of our design for chroma 4 4 AC blocks. Our design also processes eight residuals for two chroma 4 4 AC blocks (four residuals for each) per cycle. Figure 6.11 shows the timing diagram for processing chroma blocks. In the proposed design, we spend 178 C 81 D 259 cycles to process a nonintra 16 16 macroblock. If the current macroblock is encoded using an intra 16 16 mode, it additionally takes 163 cycles to process this macroblock. Totally, our design takes 259 (422) cycles to process a nonintra 16 16 (an intra 16 16) macroblock.
6.3.2 Architecture Figure 6.12 shows the top-level block diagram of our proposed design. Its primary inputs are residuals, QP, and some macroblock information such as mode (intra 16 16 or nonintra 16 16) and type (I, P, or B) for the current macroblock. Its pri-
6.3 A VLSI Design for Transform Coding
99
a
b DC
1
2
3
4
1
1
3
3
3
4
5
6
2
2
4
4
5
6
7
8
5
5
7
7
7
8
9
10
6
6
8
8
Non-intra 16x16 block
intra 16x16 block
c DC
DC
1
2
1
2
3
4
3
4
Cb
Cr
Chroma 8x8 blocks
Fig. 6.8 Processing order of the proposed design for (a) non-intra 16 16 block (b) intra 16 16 block and (c) chroma 8 8 blocks
mary outputs are coefficients, transformed residuals, and macroblock information. The main components of our design are an 1D multitransform engine, a combined Q/IQ engine, and a RLC engine. Because our multitransform engine is 1D, we need a transpose buffer to perform 2D transform. The TPDC buffer is used for transpose operation and DC values storage. Our design also contains a local coefficient memory to store quantized coefficients for the RLC engine and the Q/IQ engine.
6.3.2.1
Multitransform Engine
Figure 6.13 illustrates the architecture of our multitransform engine. Table 6.4 shows its input and output data. There are eight kinds of inputs and four kinds of operations
100
6 Transform Coding
Perform 4x4 DCT
Perform 4x4 DCT
Perform 4x4 IDCT
Perform 4x4 IDCT
Transpose
Transpose
Perform Quantization
Perform Quantization
Perform Inverse Quantization
Perform Inverse Quantization
Write Coefficient Memory
Write Coefficient Memory
Read Coefficient Memory Cycles
Read Coefficient Memory 1
4
9
Cycles 1
13
For luma 4x4 block 1, 2, 9, 10
1
Cycles
2 13 26
7
14
21
For other luma 4x4 block pair
3
5
4 47
68
6 89
7 11 0
8 131
9
10
152 165 178
Fig. 6.9 Timing diagram for processing a nonintra 16 16 block
of our multitransform engine. However, the matrices of the HT and the IHT operations are the same and there exists a great similarity between DCT and IDCT operations. By using multiplexers to select a corresponding datapath, we integrate all kinds of transforms into a single datapath and achieve better resource sharing by reusing adders and shifters. 6.3.2.2
Combined Q/IQ Engine
Figure 6.14 illustrates the architecture of our Q/IQ engine. In the proposed design, we use a table look-up method instead of division to calculate QP/6 and QP%6. Moreover, adders and multipliers can be reused for both Quan and InvQuan operations. By adopting these two methods, we can reduce hardware cost significantly.
6.3 A VLSI Design for Transform Coding
101
luma 4x4 8 iterations
luma DC 4x4
luma 4x4 8 iterations
Perform 4x4 DCT Perform 4x4 IDCT
Transpose
Perform Quantization Perform IQ Perform 4 x4 HT Perform 4 x4 Inverse HT Write Coefficient Memory Read Coefficient Memory Cycles
10
80
91
163
Fig. 6.10 Timing diagram for processing an intra 16 16 block
6.3.2.3
RLC Engine
Our proposed RLC engine reads coefficients and performs zig-zag scans for the entropy encoder. It calculates sig coeff flag using a bit-wise OR operation by taking every bit of a coefficient as the input. To calculate last sig coeff flag, we use sig coeff flag as the index to check the location of the last non-zero coefficient. It also calculates CBF by using 16 sig coeff flag.
102
6 Transform Coding chorma 4x4 4 iterations
chroma DC
chroma 4x4 4 iterations
Perform 4x4 DCT Perform 4x4 IDCT Transpose
Perform Quantization Perform IQ Perform 2x2 HT Perform 2x2 Inverse HT Write Coefficient Memory Read Coefficient Memory Cycles
10
40
45
54
81
Fig. 6.11 Timing diagram for processing chroma blocks
6.3.2.4
Organization of the Coefficient Memory
The local coefficient memory contains two read ports and one write port. It has four banks, each one being 128 bits 30 entries. Figure 6.15 shows the organization of the coefficient memory. We access the coefficient memory in a ping-pong fashion because the transform coding unit and the entropy encoder work at different pipelined stages in our encoder. If coefficients of the current macroblock are stored in entries 0–14, coefficients of the previous macroblock will be stored in entries 15–29. When the Q/IQ engine reads or writes the coefficient memory for the current macroblock, the RLC engine also reads the coefficient of the previous macroblock for the entropy encoder.
6.3 A VLSI Design for Transform Coding QP MB information
103
FSM
MultiTransform Engine
Residuals
Transformed residuals
TPDC Buffer Controller
RLC Engine
TPDC Buffer
Q/IQ Engine
Coefficients, SCF, LSCF & CBF
Coefficient Memory Coefficient Memory AG
CBP Calculator
CBP
Fig. 6.12 Top-level block diagram of the proposed design
1/2
1 1/32
–
– 1/2
1 1/32
1/2
–
–
1/2
2 2
–
1/2
– – 1/2
1
1/2
–
2 –
1 1/2
1/2
1/32
1/2
1/2
2 – – –
1 1/32
1/2
1/2
1 1/32
Fig. 6.13 Architecture of the proposed multitransform engine
1/2
1/32 1
–
–
1/2
1/2 –
–
1/2
1 1/32
1/32
1/2
1/2
1/2
104
6 Transform Coding
Table 6.4 Input and output data of the proposed multitransform engine Input From Operation Output Residuals Intra MD or DCT TransRes MC unit Transposed TransRes TPDC buffer DCT Coefficients DC values TPDC buffer HT TransDC Transposed TransDC TPDC buffer HT DC coefficients Inverse-quantized Q/IQ engine IDCT InvTransCoeff coefficients Transposed TPDC buffer IDCT Transformed InvTransCoeff residuals Inverse-quantized DC Q/IQ engine IHT InvTransDC coefficients Transposed InvTransDC TPDC buffer IHT Transformed DC residuals
To TPDC buffer Q/IQ engine TPDC buffer Q/IQ engine TPDC buffer Recons unit TPDC buffer Recons unit
QBITS Calculator
QP
Constant Calculator Remainder Calculator
Coef DC Coef DC Coef DC Coef DC
Combined LevelScale Table
ABS
Sign
ABS
ABS
Left Shifter
Sign Sign
ABS
Sign Sign
Coef DC Coef DC Coef DC Coef DC
ABS Sign ABS
ABS
ABS
Fig. 6.14 Architecture of the proposed Q/IQ engine
Sign Right Shifter
Sign
6.3 A VLSI Design for Transform Coding
105
DC DC 1
2
5
6
3
4
7
8
9
10
13
14
11
12
15
16
DC
17
18
21
22
19
20
23
24
Cb
Cr
128 bits x 4 0
1, 2
1
3, 4
2
2
5, 6
3,5
3
7, 8
4, 6
4
9, 10
7, 9
5
11, 12
8, 10
6
13, 14
11, 13
7
15, 16
12, 14
8
Luma DC
15
1
16
9 10
17, 21
17, 21
11
18,22
18, 22
12
19, 23
19, 23
13
20, 24
20, 24
14
Cb, Cr DC
Cb, Cr DC
intra 16x16 block
non-intra16x16 block
15
29
Fig. 6.15 Organization of the coefficient memory
106
6 Transform Coding
6.3.3 Evaluation Synthesized targeted toward a TSMC 0.13-m CMOS cell library, the total gate count of the proposed design is about 77K gates when running at 138 MHz. The multitransform engine consumes about 18K gates and the Q/IQ engine consumes about 36K gates. It takes 422 cycles to process a macroblock at the most. Running at 104 MHz, the proposed design can encode 1080pHD video in real time.
6.4 Summary A transform coding unit is indispensable to a video encoder. It translates data from a spatial domain to a frequency domain and removes insignificant parts. H.264/AVC proposes a new transform coding algorithm to achieve better compression performance. However, the computation load of the transform coding unit in H.264/AVC is heavy for high-resolution applications. Previous works either propose an architecture to support multiple transform or to integrate the Trans and the Quan units (or the InvQuan and the InvTrans units) into a single architecture for system integration. We propose an efficient transform coding architecture for H.264/AVC. We have designed a multitransform engine to integrate three kinds of transforms in both forward and inverse parts. We have also designed a combined Q/IQ engine to share the multipliers and adders for the Quan and the InvQuan units. Our design can achieve good resource sharing to reduce hardware area cost. It is able to support real-time encoding for 1080pHD video.
Chapter 7
Deblocking Filter
Abstract The deblocking filter (DF) adopted in H.264/AVC reduces the blocking artifact generated by block-based motion-compensated interprediction, intra prediction, and integer discrete cosine transform. The filter for eliminating blocking artifacts is embedded within the coding loop. Therefore, it is also called in-loop filter. Expirically, it achieves up to 9% bit-rate saving at the expense of intensive computation. Even with today’s fastest CPU, it is hard to perform software-based real-time encoding of high-resolution sequences such as QFHD (3,840 2,160). Consequently, accelerating the deblocking filter by VLSI implementation is indeed required. Through optimizing processing cycle, external memory access, and working frequency, we show a design that can support QFHD at 60-fps application by running at 195 MHz.
7.1 Introduction The deblocking filter [45] eliminates blocking artifact and thus generates a smooth picture. The prediction unit finds a block similar to the current block. The found block cannot perfectly match the current block and thus prediction error occurs. For coding efficiency, the errors are transformed and quantized. After decoding, the reconstructed block is different from the original block. Especially, discontinuity comes out at the block edge. To reduce the degree of discontinuity, the deblocking filter process is applied. The deblocking filter consumes much computing time, because all pixels adjacent to the block boundary must be filtered. To meet real-time requirement, we implement a deblocking filter in a fully hardwired architecture. In the following subsections, we will introduce the deblocking filter algorithm and address its design considerations.
Y.-L.S. Lin et al., VLSI Design for Video Coding: H.264/AVC Encoding from Standard Specification to Chip, DOI 10.1007/978-1-4419-0959-6 7, c Springer Science+Business Media, LLC 2010
107
108
7 Deblocking Filter
Algorithm 7.1: Deblocking filter. DF (coding information, unfiltered pixels, QPs, FilterOffsetA, FilterOffsetB) for each of two adjacent blocks do BS D Boundary strength generation(coding information); f˛, ˇ, indexA, indexBg D Get threshold(QPs, FilterOffsetA, FilterOffsetB); filterSampleflag D Calculate filterSampleflag(unfilter pixels, ˛, ˇ); if filterSampleflag DD 1 then if BS DD 4 then filtered pixels D Strong edge filter(unfiltered pixels); else fı, ıp, ıqg D Calculate delta(BS, indexA, unfiltered pixels); filtered pixels D Weak edge filter(ı, ıp, ıq, unfiltered pixels); endif endif endfor return with filtered pixels
7.1.1 Deblocking Filter Algorithm Algorithm 7.1 gives the deblocking filter algorithm and Fig. 7.1 shows its flow chart. Inputs to the filter include unfiltered pixels, coding information, QPs, and slice parameters (FilterOffsetA and FilterOffsetB). At the beginning, we calculate the boundary strength (BS) according to coding information from both slice level and MB level to set the filtering strength. If we apply the filter directly according to the boundary strength, it may result in a blurring picture. Therefore, the standard employs two thresholds, ˛ and ˇ, derived from QPs and slice parameters to help avoid oversmoothening the true edges. Afterward, the filter determines whether the pixels need to be filtered according to the filterSampleflag. Finally, according to the boundary strength, it employs one of the two edge filters to filter pixels in a specific order. Unlike the strong edge filter, the weak edge filter needs additional delta values before filtering pixels. 7.1.1.1
Filtering Order
H.264/AVC specifies a normative in-loop filter algorithm. The filter process should follow the specific filtering order from the top-left MB to the bottom-right one. Each MB consists of one 16 16 luminance block and two 8 8 chrominance blocks in 4:2:0 video format. Filtering an MB needs to reference and modify part of the pixels of the neighboring MBs to the left and to the top. The deblocking filter process consists of a horizontal filtering across all vertical edges and a vertical filtering across all horizontal edges. Figure 7.2a, b shows the edges to be filtered for a luminance block and a chrominance block, respectively. For the 16 16 luminance block depicted in Fig. 7.2a, vertical edge V0 is filtered horizontally first from top to bottom, followed by edge V1, edge V2, and edge V3. Then horizontal edge H0 is filtered vertically from left to right, followed by edge
7.1 Introduction
109 Start
Generate boundary strength Get threshold Calculate filter Sample flag
filter Sample flag = 1?
No
Yes Yes
Boundary strength = 4?
Perform strong edge filter
No
Calculate delta Perform weak edge filter
No
Finish filter ? Yes End
Fig. 7.1 The flow chart of deblocking filter
H1, edge H2, and edge H3. Note that in each 1D filtering operation, at most 6 pixels (p2, p1, p0, q0, q1, q2) among 8 unfiltered pixels (p3, p2, p1, p0, q0, q1, q2, q3) will be changed. The filtering process of chrominance components is similar to that of luminance components as shown in Fig. 7.2b. It is first horizontally applied on edge V0 from top to bottom, followed by V1. Then, the horizontal edges H0 and H1 are filtered sequentially. There are only 4 input pixels (p1, p0, q0, q1) and at most two possible modifications (p0, q0) for each chrominance filtering.
7.1.1.2
Boundary Strength, Thresholds, and FilterSampleflag
The deblocking filter process is adaptive to values of boundary strength and thresholds. The boundary strength is classified into five levels (0–4), 4 for the strongest filtering and 0 for no filtering. A BS value is calculated for each edge between two adjacent 4 4 luminance blocks as well as edge between two corresponding 2 2 chrominance blocks. Figure 7.3 shows all parameters boundary strengths.
110
7 Deblocking Filter
a
b P3 P2 P1 P0
Q0 Q1 Q2 Q3
P3
P1 P0
Q0 Q1
H1
P2 P1 P0
H2
Q0
H3
P1 P0
H1
Q0
Q1 H4
Q2
H2
Q1
Q3 V0
V1
V2
V3
V0
V1
Fig. 7.2 Horizontal and vertical filtering of (a) luma and (b) chroma blocks
a
b hEdge 0
hBS11 hBS10
hEdge 1
vBS9
vBS2
vBS10
vBS3
vBS11
hBS9 hBS8
vBS1
hBS11
hBS3 hBS2
hBS15
hBS1 hBS0
h hBS11
hBS14
hBS13
hBS12
vEdge 0 vEdge 1
hEdge 3
vBS8
vEdge 0 vEdge 1
vBS 14
vBS10
vBS7
hBS7
hBS10
vBS6
hEdge 2
vBS13
vBS9
h hBS9
h hBS8 vBS3
hBS6
hBS5
hBS4 vBS2
hEdge 1
vBS0
vBS12
vBS8
vBS5
hBS3
vBS4
vBS1
hBS2
hBS1
hBS0 vBS0
hEdge 0
vBS11
vBS15
vEdge 2
vEdge 3
Fig. 7.3 Boundary strengths for (a) luma and (b) chroma edges
Figure 7.4 gives a flow chart for determining BS value. If any one of these two adjacent 4 4 blocks is coded with intra prediction mode and they are on the macroblock edge, the BS is set to 4. If any one of them is intra coded and neither one is on the MB edge, the BS is 3. If any of them contains nonzero coefficients, the BS is set to 2. Finally, if different reference pictures are used or the difference between two motion vectors of the two blocks is greater than or equal to 4 in units of quarter-pixels, the BS shall be equal to 1. For the remaining cases, the BS is set to 0.
7.1 Introduction
111 Start
Yes
Intra-coded && MB edge
No
No Yes
Intra-coded
Yes
Non-zero trans. coeff.
Yes
BS = 4
BS = 3
BS = 2
BS = 1
No
Diff.of MV >=4|| Diff. Ref.
No
BS = 0
End
Fig. 7.4 Flow chart of deriving BS value
Thresholds, ˛ and ˇ, are used to prevent true edges from being overfiltered. Their values are derived from the quantization parameters according to the following steps. The function Clip3 operates as a saturation function when its third parameter is out of the range defined by the first two. After obtaining indexA and indexB, we get the threshold variables, ˛ and ˇ, from Alpha Table and Beta Table. Step 1. Let the quantization parameters of the two 4 4 blocks be qPp and qPq Step 2. qPav D .qPp C qPq C 1/ 1 Step 3. indexA D Clip3.0; 51; qPav C FilterOffsetA/ Step 4. indexB D Clip3.0; 51; qPav C FilterOffsetB/ Step 5. ˛ D Alpha Table .indexA/ Step 6. ˇ D Beta Table .indexB/ FilterOffsetA and FilterOffsetB are slice-level parameters configured by the encoder to control the filtering effect. If they are set negative, the threshold values will be small, and, thus, the deblocking filter process will seldom take place. It is suitable for high-quality video sequences. When the input pixels, boundary strength, and threshold variables are ready, the deblocking filter needs to check the value of filterSampleflag. If it is equal to 1, the current edge is very likely to be a blocking artifact. Thus, the edge filter process is applied. The flag is set to 1 when all of the following conditions hold.
112
7 Deblocking Filter
Cond. 1. BS !D 0 Cond. 2. Abs(p0 q0) < ˛ Cond. 3. Abs(p1 p0) < ˇ Cond. 4. Abs(q1 q0) < ˇ
7.1.1.3
Edge Filter
The deblocking filter starts to perform edge filter process after obtaining necessary input such as unfiltered pixels, ˛, ˇ, and BS. Based on the flag filterSampleFlag and BS, it decides on whether to execute filtering. If the BS equals 4, it applies the strong edge filter; otherwise, it applies the weak edge filter. Figures 7.5 and 7.6 show the flow charts of strong edge filter and weak edge filter, respectively. The strong edge filter modifies 3 pixels each on both sides of the edge with a low pass filter. Unlike the strong edge filter, the weak edge filter uses table-derived c0 and c1 to clip the delta values. The flag chromaEdgeflag is used to identify the luma or chroma edges.
7.1.2 Subtasks Processing Order Figure 7.7 shows the top-level timing diagram of subtasks. At the beginning, the deblocking filter fetches the unfiltered pixels. Then, it prepares boundary strength, threshold, and filterSampleflag of each edge for filtering a boundary. After filtering 48 boundaries, it outputs the filtered pixels. Start
No
Chroma Edge flag = 0&& |p2-p0|
Yes p0 = (p2+2p1+2p0+2q0+2q1+4)>>3 p1 = (p2+p1+p0+q0+2)>>2 p2 = (2p3+3p2+p1+p0+q0+4)>>3
p0 = (2p1+p0+q0+2)>>2
No
Chroma Edge flag = 0&& |q2-q0|
Yes q0 = (q2+2q1+2q0+2p0+2p1+4)>>3 q1 = (q2+q1+q0+p0+2)>>2 q2 = (2q3+3q2+q1+q0+p0+4)>>3
q0= (2q1+q0+p0+2)>>2
End
Fig. 7.5 Flow chart of strong edge filter
7.1 Introduction
113 Start
C1 = C1_table(BS,index_A)
chroma Edge flag = 0 &&|p2–p0|
No
chroma Edge flag = 0 Yes
No
c0 = c1+(|p2–p0|
Yes d pi = (p2+((p0+q0+1)>>1)–2p1)>>1 d p = clip3(c1,– c1, d pi) p1 = p1+ d p
c0 = c1+1 chroma Edge flag = 0 &&|q2-q0|< b
No d 0i = (4(q0-p0)+(p1–q1)+4)>>3 0 = clip3(c1,– c1, d 0i) p0 = p0+ d p q0 = q0–d p
Yes d qi = (p2+((p0+q0+1)>>1)–2p1)>>1 d q = clip3(c1,– c1, d pi) q1 = q1+ d q
End
Fig. 7.6 Flow chart of weak edge filter
Fetch unfiltered pixels and coding information Generate boundary strength, threshold and filter Sample flag Perform edge filter
Flush filtered pixels
Cycles
96 99103
432
528
Fig. 7.7 Subtasks timing of the deblocking filter
7.1.3 Design Considerations Designing a deblocking filter has five considerations (1) processing cycles, (2) external memory traffic, (3) working frequency, (4) hardware cost, and (5) skip mode support for eliminating unnecessary filtering. We will describe them in the following subsections.
114
7.1.3.1
7 Deblocking Filter
Processing Cycles
The cycle counts of the deblocking filter is determined by both filter utilization and I/O interface width. Because there are 192 filtering operations per MB, at least 192 cycles are needed using a single edge filter and 96 cycles for two edge filters. Since there are 96 32 bits of pixel data to be read from previous stage and written to the external memory, we need at least 96 cycles using a 32-bit I/O interface and 48 using a 64-bit one.
7.1.3.2
External Memory Traffic
The deblocking filter’s external memory access is composed of three parts: write access of current MB, read/write access of left neighbor, and read/write access of top neighbor. The write access of current MB is constant but there is a tradeoff between local SRAM size and traffic for left and top neighbors. With a 32 32-bit of local memory, we can save 192 bytes of traffic for left neighbor. However, in order to eliminate upper neighbor traffic, we need to store the whole row of upper neighboring blocks with a huge local SRAM whose size is proportional to the picture width.
7.1.3.3
Working Frequency
The deblocking filter algorithm is both complex and adaptive. A single filtering operation implemented in hardware includes input selection from memories or registers, table look-up, threshold clipping, pixel comparison, pixel filtering with addition and shift, pixel clipping in case of overflow, and output to memories or registers. Therefore, it is important to carefully allocate these operations into different pipelined stages for high frequency application. A pipelined design should further ensure that there is no pipeline hazard during consecutive filtering operations given a filtering order. Otherwise, extra stall cycles have to be inserted and, hence, performance degrades.
7.1.3.4
Hardware Cost
Hardware cost is determined together by local memory size and logic gate count. The memory size depends on two factors. One is the tradeoff between external memory traffic and local SRAM size. The other is filtering order. A design with memory-efficient filtering order would completely filter a 4 4 block as soon as possible to reduce the size of the local buffer for partially filtered blocks. The logic gate count mainly depends on the number and structure (pipelined or nonpipelined) of edge filters and the number and size of transpose registers.
7.2 Related Works
115
Fig. 7.8 Profiling of Skip-top mode and Skip-all mode in P-type frames
7.1.3.5
Skip Mode Support for Eliminating Unnecessary Filtering
According to the distribution of BS values, two special MB modes (Skip-top mode and Skip-all mode) are very likely to occur in an inter-predicted frame. An MB of Skip-top mode has zero BS at its top boundary edges. Thus, we do not have to access external memory for its upper-neighbor pixels. An MB of Skip-all mode has all its BSs zero. Thus, no filtering is needed. Figure 7.8 shows the profiling result of these two modes in some P-type frames. A design that takes advantage of skip modes will need less memory access and fewer cycle counts.
7.2 Related Works Several previous designs focus on cycle reduction by increasing the filter utilization. Huang’s architecture [19] is the first that adopts straightforward filtering order. Chang’s design [17] reduces processing cycles by overlapping data transfer with filtering operation and employing an adaptive macroblock transmission scheme to skip transfer of pixels that need no filtering. Khurana’s design [33] and Shih’s designs [58, 59] achieve near-optimal cycle counts using single filter by performing horizontal filtering and vertical filtering alternately. Later, Lin [52] proposes an architecture with two edge filters, an 128-bit I/O, and a large local SRAM for cycle reduction. In order to integrate into a hardwired decoder or codec, Liu [48] and Shih [59] hold both left and upper neighbor pixels in a large local SRAM. The later further reduces the size for upper neighbor by storing only half of the chroma components in
116
7 Deblocking Filter
the local memory. On the contrary, Chang [17] proposes a bus-connected accelerator using an adaptive macroblock transmission scheme. It need not store neighbor pixels in local SRAM. However, it requires extra external memory traffic. For achieving high working frequency, many previous designs employ pipelined edge filters. Li [41] proposes a four-stage pipelined data path that is 47% faster than a nonpipelined one. Chao [16], Bojnordi [2], and Khurana [33] utilize their pipelined edge filters with hazard-free filtering order. Shih [59] employs a forwarding scheme to prevent hazard from happening. For hardware resource saving, Chao’s [16] and Khurana’s [33] architectures use novel filtering order and hence need less local memory. Li [41], Lin [52], and Bojnordi [2] all propose special multibank memory organizations to save the use of transpose registers.
7.3 A VLSI Design for Deblocking Filter Based on the above design considerations, our target is to fulfill the demand of ultra-high throughput with low hardware overhead. Thus, we propose a DF architecture [43], which nearly achieves the lower bound on cycle count using two filters and a 64-bit I/O interface. Furthermore, our design can save all left neighbor traffic and most upper neighbor traffic in inter-predicted frames without storing the entire row of upper neighbors in the local SRAM. It employs a five-stage pipelined and hardware-shared dual-edge filter to generate two filtering results every cycle.
7.3.1 Subtasks Scheduling In order to achieve ultra-high throughput, we must well schedule the processing of subtasks of the deblocking filter as shown in Fig. 7.9. In the beginning, it generates the boundary strengths of all edges. Then, it employs a five-stage pipelined dual-edge filter and a 64-bit I/O to filter two boundary edges within 4 cycles. When filtering the rightmost macroblock, it will further spends cycles to flush the rightmost filtered pixels. Because our H.264/AVC encoder is also pipelined, we can further reduce 48 cycles by overlapping boundary strengths generation with filtering of previous blocks.
7.3.2 Architecture Figure 7.10 shows the block diagram of our deblocking filter architecture. The three dotted boxes, BSG, DF, and MAU-DF, represent the modules involved in the deblocking filtering process. Outside the dotted boxes are global memories.
7.3 A VLSI Design for Deblocking Filter
117
Generate boundary strengths
Fetch unfiltered pixels and generate threshold Perform left strong filter Perform left weak filter Perform right weak filter
output filtered pixels
flush rightmost filtered pixels
Cycles
48
52
56
148 160
Fig. 7.9 Subtasks scheduling
These three modules are executed at different pipelined stages. When the DF module filters MB N , the Upper-neighbor Prefetch submodule reads the upper neighbor of MB (N C 1) from the reference frame and the BSG module calculates the boundary strengths of MB (N C 2). Meanwhile the Ref-frame Interface submodule takes the filtered pixels of MB (N 1) or (N 2) from the Filtered memory to the external memory. The BSG module calculates BSs of an MB within 48 cycles. It gets motion vectors, reference indexes, coding information, and filtering flags from the input memories MV0, MV1, Refidx0, Refidx1, Para, and Predinfo, respectively. BSG also informs the MAU-DF module of Skip-top mode and the DF module of Skip-all mode. The MAU-DF module manages the external memory access including the pixel prefetch for upper neighbor and the output for filtered pixels. The submodule, Upper-neighbor Prefetch, reads the upper neighboring blocks from the reference frame and stores them into the Upper-neighbor memory. After an MB is completely filtered, the Ref-frame interface will store this MB as well as its upper neighbor from the Filtered memory to the reference frame. If it is a Skip-top MB, we will not access its upper neighbor.
118
7 Deblocking Filter Skip-top mode
Skip-all mode
Reconstruct
MAU-DF
Upper-neighbor Pre-fetch Upper-neighbor 64
Filtered
T1
ROP Selector
64 64 64
LocalPixel 64
32
32
BS Generator
BS
32
Threshold Generator BS Selector
Two-Result-Per-Cycle Edge Filter
32
Memory
64
DF
Local-Pixe l Selector
Coding information
Ref-frame Interface
T2 32
32
Logic block
Fig. 7.10 Proposed deblocking filter block diagram
The DF module needs the pixels of the current MB as well as its left and upper neighbor as inputs. It takes the unfiltered current MB through the Reconstruct memory and the upper neighboring blocks through the Upper-neighbor memory. After an MB is filtered, its rightmost column of 4 4 blocks has been stored in the LocalPixel memory as the left neighbor of the next MB. The DF module is composed of several submodules including Threshold Generator, BS Selector, Two-Result-Per-Cycle Edge Filter, T0, T1, and memory interfaces. Threshold Generator and BS selector provide two sets of boundary strengths and threshold variables for the Two-Result-Per-Cycle Edge Filter. Two transpose registers, T0 and T1, are used for transposing pixels. The Two-Result-Per-Cycle Edge Filter takes 12 pixels as its input and filters them through a five-stage pipeline. It selects pixels from the Local-Pixel memory, Reconstruct memory, T0, T1, and its own output port. The Local-Pixel Selection submodule selects the output of Two-ResultPer-Cycle Edge Filter and T0 or the unfiltered pixels of the Reconstruct memory to store. Parallel-connected ROP Pixel Selector submodule selects two individual 4 4 blocks as input and outputs them alternately with a parallel-connected Rowof-Pixel.
7.3.2.1
Filtering Order
Figure 7.11 shows the filtering order of our Two-Result-Per-Cycle Edge Filter. It can concurrently filter 12 pixels across two edges every cycle. The number associated with an edge denotes the processing step. The letters “L” and “R” denote the left
7.3 A VLSI Design for Deblocking Filter
L0 5L
T0
T1
T2
T3
8L
23L
11L
10L
B0 5R B1 9L 23R
8R L1 6L
B2
B3
11R
B4 6R B5 7L B6 19L
9R
24L
10R 7R
B7
22L
21L
L2 16L B8 16R B9 20L B10 20R 19R L3 17L
24R 17R
B12
22R 18L
B11
T5
4L
3L
4R L5
B15
T4
B16 L4 1L 1R
21R 18R
B14
B13
119
3R
2L B18 2R
Y
B17
Cb
B19
T6
T7
15L
14L
L6 12L B20 12R
B21 14R
15R L7 13L B22 13R
B23
Cr
Fig. 7.11 Proposed filtering order
or right data paths used. We start from the Cb block which are followed by the top-half of the Y block except “23L” and “23R.” Then we filter the Cr block and the remaining edges of the Y block. This order ensures no access conflict in the local and output memory. Therefore, we can filter all 48 edges without any stall cycles.
7.3.2.2
Memory Organization
We employ a row-of-pixels (ROP) approach for data placement in the memories. Figure 7.12 shows the pixel organizations of the input memories, output memory, and local memory. The Reconstruct memory could provide 8 pixels with serialconnected ROP in 1 cycle. The Upper-neighbor memory and Filtered memory store pixels as parallel-connected ROP. Both Reconstruct and Upper-neighbor memories adopt a ping-pong structure. The Filtered memory expends 60 64 3 bits for three MBs to collect filtered pixels for effective off-chip memory access. In the LocalPixel memory, we use a two-ported two-bank SRAM (28 32 2 bits) to store the left neighboring pixels and temporal pixels.
7.3.2.3
Two-Result-Per-Cycle Edge Filter
Figure 7.13 shows the proposed five-stage pipelined dual-edge filter. Stage 1 gets 12 pixels from the memories, transpose registers, or output of the edge filter. Stage 2 filters the left edge with left strong filter (BS D 4). The left delta values of the left edge and the R21 delta value of the right edge are also derived at this stage, since “R21” is ready. Stage 3 filters the left edge with left weak edge filter (BS<4). It also performs R21 filter to generate the “R21” pixel and calculates the right delta values. In Stage 4, the remaining pixels of the right edge are filtered with the right weak edge filter. Finally, Stage 5 outputs filtered pixels to local memories, transpose
120
7 Deblocking Filter Reconstruct memory
Upper-neighbor memory
Filtered memory
Local-pixel memory
Y Y
Cr Y
Cb
Cr Cr
Bank 1 Bank 2
Cb
Cb Size: 60x64 bits Each entry: parallel-connected ROP
Size: 48x64 bits Each entry: serial-connected ROP
Size: 12x64 bits Each entry: parallel-connected ROP
Size: 28x32x2 bits Each entry: ROP
Fig. 7.12 Different memory organizations of one MB Input unfiltered pixels
Stage 1
L00
L01
L02
M00
M01
M02
M03
L11
L12
L13
Left d
M10
R00
R01
R02
R03
R10
R11
R12
R13
R21 d calculation
left strong weak filter and left d calculation
Stage 2
L10
L03
M11
M12
M13
R21 d
R21 filter
Stage 3 left weak edge filter right d calculation L20
L21
L22
L23
M20
M21
M22
Stage 4
L30
Stage 5
M23
Right d
R20
R21
R22
R23
R30
R31
R32
R33
right weak edge filter
L31
L32
L33
M30
M31
M32
M33
Output unfiltered pixels
Fig. 7.13 Proposed pipelined two-result-per-cycle edge filter
7.3 A VLSI Design for Deblocking Filter
121
registers, or input of the edge filter. Note that the strong edge filter only applies to an MB’s boundary and thus it does not need to be performed on the right edge according to our filtering order. Figure 7.14a takes Step 9 and Step 10 as example to illustrate the pipeline hazard. In Step 9, Blocks B1, B2, and B3 are horizontally filtered. However, the following Step 10 needs the transposed Block B3 immediately which is still in the pipeline. To cope with this hazard, we add some multiplexers for forwarding the filtered pixels to appropriate location. Figure 7.14b shows the detailed behavior of pixels transfer between Step 9 and Step 10. At the last cycle of Step 9, the pixels in the right registers of the edge filter need to be transposed and fed back to the middle registers within the next 4 cycles. Meanwhile, the pixels in the left registers will be transposed and stored into the location 20–23 of bank 1 of the Local-Pixel memory. At the first cycle of Step 10, the pixels, B3d, B3h, B3l, and B3p, are forwarded from the right registers to the middle registers, since they have not been referenced nor modified during current filtering operation. Similarly, the pixels, B1a, B1e, B1i, and B1m, are also forwarded from the left registers to the Local-Pixel memory during the same cycle. When the forwarding path is activated, the pixels to be referenced or modified will be transferred downward. But those filtered or unused pixels may be transferred horizontally to fill the empty registers whose pixels have been forwarded out. Those different sets of the shaded pixels in the left registers and right registers represent the forwarded pixels of each cycle. With the forwarding path, we solve the pipeline hazard problem and reduce the number of transpose registers used.
7.3.2.4
Pixels Transfer of a Non-skip-all MB
Figure 7.15 illustrates the pixel transfer steps of a Non-skip-all MB in detail. Each step takes 4 cycles and each grid represents a 4 4 block. “L,” “M,” and “R” denote the left, middle, and right registers, respectively, of our edge filter. “WL1” and “WL2” denote the blocks written to bank 1 and bank 2, respectively, of the LocalPixel memory. Their associated locations are denoted as “ADDR.” “O1” and “O2” indicate the output blocks which are processed by the Parallel-connected ROP Pixel Selection submodule. “T0” always buffers and transposes the result of “M.” The pixels, “L,” “M,” and “R,” of the edge filter are selected from the Reconstruct memory, Local-Pixel memory, “T0,” “T1,” or the outputs of the edge filter. The output blocks “O1” and “O2” are from “L,” “R,” “T0,” the Reconstruct memory, or the Local-Pixel memory. After 24 filtering steps, the rightmost column of blocks, “B3,” “B7,” “B11,” “B15,” “B17,” “B19,” “B21,” and “B23,” replace the original left neighboring blocks in the Local-Pixel memory. To increase the hardware utilization, Step 26 and Step 27 are overlapped with Step 1 and Step 2 of the next MB. Thus, we can process one Non-skip-all MB within 100 cycles.
122
7 Deblocking Filter
a T3a T3b T3c T3d T3e T3f T3g T3h T3i T3j T3k T3l
L0 5L
T0
T1
T2
T3
8L
23L
11L
10L
B0 5R B1 9L 8R
L1 6L
23R
B2 9R 11R
B1a B1b B1c B1d B2a B2b B2c B2d B3a B3b B3c B3d
B3
B1e B1f B1g B1h B2e B2f B2g B2h B3e B3f B3g B3h
10R
B4 6R B5 7L
B6 7R
19L
22L
24L
T3m T3n T3o T3p
B1i B1j B1k B1l B2i B2j B2k B2l B3i B3j B3k B3l
B7
B1m B1n B1o B1p B2m B2n B2o B2p B3m B3n B3o B3p
21L
B7a B7b B7c B7d
L2 16L B8 16R B9 20L B10 20R B11 19R L3 17L
24R 17R
B12
22R 18L
B7e B7f B7g B7h
21R
B7i B7j B7k B7l
18R B14
B13
B15
B7m B7n B7o B7p
b Left registers
Fourth cycle of step 9
First cycle of step 10
Third cycle of step 10
Fourth cycle of step 10
Right registers
B1m B1n B1o B1p
B2m B2n B2o B2p
B3m B3n B3o B3p
B1i
B2i B2j
B2k B2l
B3i
B2e B2f
B2g B2h
B3e B3f B3g B3h
B1j B1k B1l
B1e B1f B1g B1h
Local-pixel memory bank1 address 20-23
B3j B3k B3l
B1a B1b B1c B1d
B2a B2b B2c B2d
B3a B3b B3c B3d
T3d T3h T3l T3p
B3d B3h B3l B3p
B7d B7h B7l B7p
B1j
B1n B1o B1p
B2m B2n B2o B2p
B3m B3n B3o B3k
B1f
B1g B1k B1l
B2i B2j
B2k B2l
B3i
B2e B2f
B2g B2h
B3e B3a B3b B3c
B1b B1c B1d B1h
Second cycle of step 10
Middle registers
B3j B3f B3g B1a B1e B1i
B1m
T3c T3g T3k T3o
B3c B3g B3k B3o
B7c B7f
T3d T3h T3l T3p
B3d B3h B3l B3p
B7d B7h B7l B7p
B1g B1k B1o B1p
B2m B2n B2o B2p
B3m B3n B3j B3f
B1b B1f B1j B1n
B1c B1d B1h B1l
B2i B2f
B2k B2l
B3i B3e B3a B3b
B1a B1e B1i B1m
T3b T3f T3j T3n
B3b B3f
B3j B3n
B7b B7f
B7j B7n
T3c T3g T3k T3o
B3c B3g B3k B3o
B7c B7f
B7k B7o
B1c B1g B1k B1o
T3d T3h T3l T3p
B3d B3h B3l B3p
B7d B7h B7l B7p
B1b B1f B1j B1n
B1d B1h B1l B1p
B2m B2n B2o B2p
B3m B3i
B1a B1e B1i B1m
T3a T3e T3i T3m
B3a B3e B3i B3m
B7a B7e B7i B7m
B1d B1h B1l B1p
T3b T3f T3j T3n
B3b B3f
B7b B7f
B7j B7n
B1c B1g B1k B1o
T3c T3g T3k T3o
B3c B3g B3k B3o
B7c B7g B7k B7o
B1b B1f B1j B1n
T3d T3h T3l T3p
B3d B3h B3l B3p
B7d B7h B7l B7p
B1a B1e B1i B1m
B3j B3n
B7k B7o
B3e B3a
Fig. 7.14 (a) The illustration of pipeline hazard and (b) of pixels transfer in Step 9 and Step 10
7.3.3 Evaluation Figure 7.16 shows the processing cycles of Non-skip-all mode and Skip-all mode. In the Skip-all mode, it takes only 48 cycles to directly transfer the pixels from the Reconstruct memory to the Local-Pixel memory and Filtered memory without any
7.3 A VLSI Design for Deblocking Filter
Cb
Y top
Cr
Y bottom
Step
L
M
R
1
L4
B16
B17
T1
WL1 ADDR
WL2 ADDR
O1
O2
2
L5
B18
B19
B1 6
T5
B17t 12-15
L4
3 4
T5
B17t
B19t
B18
T4
B16t
T4
B16t
B18t
B17t
B19 12-15
L5 T5
5
L0
B0
B1
B16t
B17
8-11
T4
6
L1
B4 B6
B5 B7
B0
B1 B0t
0-3 4-7
L0
B18 B16
L1
B17
B0t B2
B4t
B6
B7t
16-19
B3 B7t
B0t
T3
B4t
4-7
B3t
T2
B2t
B6t
B2 B3t
7
B5
8 9
T0
10 11
T3 T2
12
L6
B20
B21
B2t
13
L7
B22
B23
B20
T7
B6t 20-23 B21t 12-15
14
B21t
B23t
B22
T6
B20t
15
T7 T6
B23t
B22t
16
L2
B8
B9
17
L3
B12
B13
B8
18
B13
B14
B15
19
B4t
B8t
B12t
B12 B14
20
B9
B10
B11
B8t
21
B7t
B11t
B15t
22
B6t
B10t
23
T1
24
B5t
B1
B4
T0 16-19 B16t 20-23 B5t
8-11
B19
T0
B1t 20-23 B7t 16-19
T3
B3
T2
B3
L6
B2
0-3
B0
8-11
L7
B21t
B23 12-15
T7
B23
B20t
B21 B9
8-1
6
B22
0-3
L2
B20
B8t
4-7
L3
B21
B15t
4-7
B10
B9t
24-27
B14t
B11t
B15
4-7
B1t
B5t
B10t
B11
0-3
B9t
B13t
B1t
25 Flush
T0
123
B13t 24-27 B14t
4-7
B4
B12
B7
4-7
B7
B15
B6
B14
T1 B5
B10 B1
B13
B9
B8
B13 20-23
B9t
26 27
B11
Fig. 7.15 Pixels transfer of a Non-skip-all MB Skip-all Mode
Flush only for MBs in right boundary
Directly transfer
Cb
Y
Cr
16
32
16
Non-skip-all Mode Filter
Flush
Cb
Y-top
Cr
Y-bottom
16
28
16
36
12
Fig. 7.16 Processing cycles of Non-skip-all mode and Skip-all mode MB
filtering. For the right boundary of a picture, we do not have to keep the rightmost column. Thus, it takes 16 extra cycles to flush out the pixels stored in the local memory. For a Non-skip-all MB, 96 cycles are needed for filtering and 12 cycles for flushing out the filtered results.
124
7 Deblocking Filter
Targeted toward a TSMC 0.13-m CMOS cell library, the gate count of the DF and BSG module are 17.1K and 5.8K, respectively, when synthesized at 250 MHz.
7.4 Summary Our architecture nearly achieves the lower bound of processing cycle count, minimizes the neighbor access with efficient local SRAM, and fulfills the high frequency requirement. Moreover, we employ a hardware-shared Two-Result-Per-Cycle Edge Filter and adopt a mode-aware data flow. It satisfies the demand for an ultra-high throughput.
Chapter 8
CABAC Encoder
Abstract Context-based adaptive binary arithmetic coding (CABAC) adopted in H.264/AVC main profile is the state-of-the-art in terms of bit-rate efficiency. In comparison with context-based adaptive variable length coding (CAVLC) used in baseline profile, it can save up to 7% of the bit-rate. However, CABAC occupies 9.6% of total encoding time and its throughput is limited by bit-level data dependency. Moreover, for ultra-high resolution, such like QFHD (3,840 2,160), its performance is difficult to meet real-time requirement for a pure software CABAC encoder. Therefore, it is necessary to accelerate the CABAC encoder by VLSI implementation. In this chapter, a novel architecture of CABAC encoder will be described. Its performance is capable of real-time encoding QFHD video in the worst case of main profile Level 5.1.
8.1 Introduction In H.264/AVC, a syntax element (SE) carries coding information. The main function of context-based adaptive binary arithmetic coding (CABAC) [53] is to encode syntax elements into a bit-stream. CABAC achieves high compression efficiency through (1) selecting probability model according to syntax element’s context, (2) adapting probability estimation based on local statistics, and (3) using arithmetic coding instead of variable-length coding. These new technologies lead to high data dependency and computational complexity. For real-time encoding highresolution sequence, we implement CABAC encoder by fully hardware. In the following subsections, we will introduce the CABAC algorithm and address its design consideration.
8.1.1 CABAC Encoder Algorithm CABAC consists of three parts (1) binarization, (2) context modeler, and (3) binary arithmetic encoder. The binary arithmetic encoder in turn has three coding Y.-L.S. Lin et al., VLSI Design for Video Coding: H.264/AVC Encoding from Standard Specification to Chip, DOI 10.1007/978-1-4419-0959-6 8, c Springer Science+Business Media, LLC 2010
125
126
8 CABAC Encoder
Algorithm 8.1: CABAC encoder. CABAC encoder (SE, inc idc) context table[0..398] D Build table(inc idc); repeat fbin, process modeg D binarization (SE); repeat if process mode DD regular mode then context index D Context modeler (SE, Bin); context data D context table [context index]; fbitstream, context datag D regular coding engine (bin, context data); context table[context index] D context data; else if process mode DD bypass mode then bitstream D bypass coding engine (bin); else bitstream D terminal coding engine (bin); endif until finish encoding the last bin of SE until SE DD end slice flag return with bitstream
engines: regular, bypass, and terminal. Algorithm 8.1 outlines the CABAC encoder and the flow chart of CABAC encoder is shown in Fig. 8.1. In the beginning, CABAC builds a context table according to the int idc variable. Then, the binarization unit maps an incoming SE to a series of bins. Afterward, CABAC determines the encoding mode of each bin. For the regular mode, the context modeler computes a context index(CtxIdx) to the context table. For some SEs, the context index is calculated according to neighboring SEs. After looking up the context table, the regular coder encodes bins according to the context data. Finally, it updates the context table and outputs a bit stream. Unlike the regular coding engine, other two engines need not look up the context table.
8.1.1.1
Building Context Table
At the beginning of encoding a new slice, the encoder must build a new context table. The context table contains 399 (defined in H.264/AVC main profile) probability models of bins. Each probability model is composed of the value of the Most Probable Symbol (MPS) and the Probability State Index (pStateIdx). These two parameters, MPS and pStateIdx, are used to estimate the probability of a bin.
8.1.1.2
Binarization Schemes
The result of the binarization process is a series of bins. CABAC binarization schemes (except the table mapping scheme) enable simple on-the-fly computation without using any table like the Huffman coding does. There are five basic schemes and three concatenations of these basic schemes.
8.1 Introduction
127 Start
Generate boundary strength Get threshold Calculate filter Sample flag
filter Sample flag = 1?
No
Yes Yes
Boundary strength = 4?
No
Calculate delta
Perform strong edge filter
Perform weak edge filter
No
Finish filter? Yes End
Fig. 8.1 The flow chart of CABAC encoder
The Unary Code Method If the syntax element is x, the binarization result is x continuous “1” bits plus a terminating “0” bit. For example, if the syntax element is “3,” the binarization result is “1110.”
The Truncated Unary Code Method If the syntax element value is less than a truncated value S , the binarization result is the same as the result of the unary code. Otherwise, the binarization result is S “1”s. For example, when the truncated value is 3 and syntax element value is “2,” the binarization result is “110.” But if the syntax element value is “3” or “4,” the binarization result should be “111.”
128
8 CABAC Encoder
The Fixed-Length Code Method This is the standard binary representation method. For example, if the syntax element value is “6,” the binarization result is “110.”
The kth-Order Exp-Golomb Code Method The CABAC encoder uses two kinds of Exp-Golomb code methods: EG0 (0st order Exp-Golomb code) and EG3 (3rd order Exp-Golomb code). Algorithm 8.2 outlines the kth-order Exp-Golomb code method.
The Table Mapping Code Method Some syntax elements are binarized by looking up tables. For example, if the value of mb type SE (I slice) is “5,” its binarization is “1001000.”
The Concatenation Method Built on the five basic methods, we have three concatenation methods. The first one is a concatenation of a 4-bit fixed-length code for the prefix and the truncated unary code method with S D 2 for the suffix. Other two methods are the mixed unary code and the kth-order Exp-Golomb code method. We call them Unary/kth-order Exp-Golomb (UEGk) code method, where k D 0 in UEG0 and k D 3 in UEG3. For example, UEG0 combines the truncated unary method with S D 14 and the 0st order Exp-Golomb method. If the syntax element value is “18” and the binarization method is UEG0, the binarization result should be “11111111111111 110 01.” Algorithm 8.2: The kth-order Exp-Golomb code method. kt h order Exp-Golomb code (SE,k) while true if SE >D .1 << k/ output bit “1” to Binstream; SE D SE .1 << k/; k D k C 1; else while k >D 0 output bit to binstream with value D ((SE>>k)&0x01); k D k 1; endwhile break ; endif endwhile return with Binstream
8.1 Introduction
8.1.1.3
129
Context Modeler
The context modeler generates a context index (CtxIdx) for each bin to look up its probability model. The modeler contains 399 (defined in H.264/AVC main profile) probability models which are used to estimate the probability of a bin. The probability is then passed to the binary arithmetic encoder to generate output bit-stream. We employ two equations to calculate CtxIdx. The first one is “CtxIdx D CtxIdxOffset C CtxIdxInc,” and the second one is “CtxIdx D CtxIdxOffset C CtxIdxInc C CtxBlkCatOffset.” The CtxIdxOffset value is dependent on SE type. The values of CtxIdxInc and CtxBlkCatOffset depend on four reference schemes. The first equation is used in the first three reference schemes while the second one is used in the last scheme.
Reference to Syntax Elements in the Neighboring Macroblock The two parameters, CtxIdx Left-Inc and CtxIdx Top-Inc, are derived from SEs of the neighboring (top and left) macroblocks or subblocks. Their sum gives CtxIdxInc.
Reference to Previously Coded Bin Value In this scheme, CtxIdxInc is derived from some previously coded bin values.
Reference to Syntax Elements in Previously Encoded Macroblocks CtxIdxInc is derived from some SEs of the macroblock that precedes the current macroblock in the encoding order. When the current macroblock is the first macroblock of a slice, there is no macroblock preceding the current macroblock.
Reference to Syntax Elements in the Same Coefficient Block This scheme is used for calculating CtxIdxInc and CtxBlkCatOffset of the second equation. In Table 8.1, CtxBlkCatOffset is derived from two parameters, SE type and context category. The context category is decided by residual block type, e.g., Luma DC, Luma AC, or Chroma DC, as shown in Table 8.2. For calculating CtxIdxInc, two methods are used. The first method relies on the position in the scanning path and is applied to SE types sig coeff flag and last sig coeff flag. The second method relies on two variables and is applied to coeff level type. One variable is used to accumulate the number of coeff level SEs whose value is greater than 1; the other variable is used to accumulate the number of coeff level SEs with value equals to 1.
130
8 CABAC Encoder Table 8.1 CtxBlkCatOffset values as function of context categories and SE types Context category 0 1 2 3 4 coded block flag 0 4 8 12 16 sig coeff flag 0 15 29 44 47 0 15 29 44 47 last sig coeff flag 10 20 30 39 coeff abs level minus1 0
Table 8.2 Basic block types and associated context categories Block type Context category Luma DC for intra 16 16 0 Luma AC for intra 16 16 1 Luma for intra 4 4 and Inter 2 Chroma DC 3 Chroma AC 4
8.1.1.4
Binary Arithmetic Encoder
CABAC employs a recursive interval subdivision procedure as illustrated in Fig. 8.2 to produce a series of bits. It uses two variables, Low and Range, to specify the subinterval. In addition, since all bins are either of Most Probable Symbol (MPS) or of Least Probable Symbol (LPS), the subinterval is further divided into two parts, RangeMPS (rMps) and RangeLPS (rLps). rMps is used as the new subinterval if the bin is an MPS; otherwise, rLps is used. The new subinterval is renormalized if it becomes too small. Several bits of Low are output to the bit-stream during each renormalization. Three arithmetic coding engines employed by binary arithmetic encoder are described next. Their main difference lies in the source of input probability model. Both bypass and terminal coding engines use predefined probability while the regular engine uses that computed by the context modeler.
Regular Coding Engine Algorithm 8.3 depicts the regular coding engine. Because this engine needs to fetch and update the probability model, it is more complex than the others. In the first step, it uses the rangeTabLPS table to simplify the calculation of “Range Probability” and gets an approximate RangeLPS value. In the second step, new Low and Range are computed according to whether the bin is MPS. The third step is used to update the probability model (i.e., MPS and pStateIdx). The last step, RenormE presented in Algorithm 8.4, performs renormalization process. The renormalization process contains shift, subtraction, and PutBit function. If the value of Range is too small, it (i.e., RenormE step) will renormalize it and output the overflow bits of Low to the bit-stream.
8.1 Introduction
131 Range Lps
RangeMps Low0 Range0 (interval)
Range Mps Low1 Range1
Low 2 Output bit stream
Range2
Fig. 8.2 Interval subdivision procedure in binary arithmetic encoding Algorithm 8.3: The regular coder. regular coding engine (Range,Low,context data,bin) q D (Range>>)&3; RangeLPS D rangeTabLPS[pStateIdx][q]; RangeMPS D Range-RangeLPS; if bin ! D MPS Low D Low C RangeMPS; Range D RangeLPS; if pStateIdx! D 0 MPS=1-MPS; endif pStateIdx D transIdxLPS[pStateIdx]; else Range D RangeMPS; pStateIdx D transIdxMPS[pStateIdx]; endif RenormE (Range,Low);
Bypass Coding Engine The probability for the bypass engine is predefined as 1/2, so the new Range will be updated as Range/2. However, as shown in Algorithm 8.5, the renormalization process is combined with this engine’s interval subdivision procedure, so the Range is unchanged. Finally, output bits of Low are generated.
Terminal Coding Engine The terminal engine as depicted in Algorithm 8.6 is similar to that of the regular engine except that its RangeLPS always equals to 2. Besides, at the end of each slice, this engine flushes remaining undetermined output bits (bitOutstanding bits).
132
8 CABAC Encoder
Algorithm 8.4: The renormalization process. RenormE (Range,Low) while Range<256 if Low<256 output one bit with value D 0; else if Low>D512 Low D Low-256; output a bitOutstanding bit; else Low D Low-512; output one bit with value D 1; endif endif Range D Range<<1; Low D Low<<1; endwhile
Algorithm 8.5: The bypass coder. bypass coding engine (Range,Low,bin) Low D Low<<1; if bin! D 0 Low D Low C Range; endif if Low>D1024 Low D Low-1024; output one bit with value=1; else if Low<512 output one bit with value D 0; else Low D Low-512; output a bitOutstanding bit; endif endif
8.1.1.5
CABAC Encoding Flow
Figures 8.3–8.6 show the flow charts for encoding all types of SEs in a macroblock. The top level encoding flow, shown in Fig. 8.3, is divided into two subflows, I and P/B flows, based on the macroblock type (mb type) SE. All coefficient SEs need to be encoded in both I and P/B flows, but the Macroblock-Information (mb info) SEs in the two flows are mostly different. We show the detail of I and P/B mb info SEs encoding flows in Figs. 8.4 and 8.5, respectively. Figure 8.4 is I mb info SEs encoding flow. If the mb type SE is intra 4 4, CABAC encodes the intra 4 4 pre mode SE and the rem intra pre mode SE; otherwise, it encodes the chroma pre mode SE directly. In Fig. 8.5, P mb info encoding is a subset of B mb info encoding.
8.1 Introduction
133
Algorithm 8.6: The terminal coder. terminal coder (Range,Low,bin) Range D Range-2; if bin! D 0 Low D Low C Range; Range D 2; RenormE (Range,Low); flush remaining undetermined bits; else RenormE (Range,Low); endif
Each MB start
I
I
Slice_type
mb_type
Encode I MB information
P/B
P/B
Encode P / B MB information
Encode mb_skip_flag
0
mb_skip_flag 1
Encode end_slice_flag
Encode residual
Yes
Final MB of slice
No
Slice done
Fig. 8.3 The top level CABAC encoding flow for all SEs in one macroblock
In each P (or B) macroblock, there are at most 16 (or 16 2) motion vector difference (MVD) SEs. Each MVD SE contains two kinds of vectors, horizontal and vertical, so the maximum number for MVD SEs is 32 (or 32 2). There are four types of SEs related to residual data: coded block flag (CBF), significant coefficient flag (sig coeff flag), last significant coefficient flag (last sig coeff flag), and coefficient (coeff level). We illustrate the encoding flow in Fig. 8.6. Each coeff level SE corresponds to a set of sig coeff flag SE and last sig coeff flag SE. If all coeff level SEs in every coefficient block are
134
8 CABAC Encoder Encode I MB information
Encode mb_type
Yes
Intra_4x4
No
Encode intra4x4_pre_mode
Yes
16 iteration
Intra_4x4
No
Encode code_block_pattern
Encode rem_pre_mode
No
Encode chroma_pre_mode
Yes
code_block_ppatten
!=0
=0 Encode mb_qp_delta
Done
Fig. 8.4 I mb info SEs encoding flow
nonzero, the CBF SE equals to 1. There are at most 16 sig coeff flag SEs, 16 last sig coeff flag SEs, and 16 coeff level SEs in a coefficient block. There can be up to 27 coefficient blocks that makes the number of residual SEs much more than that of others.
8.1.2 Subtasks Processing Order Figure 8.7 shows the order of processing subtasks. In the beginning, the encoder spends 1 cycle to perform binarization. Then, context modeling takes 1–4 cycles because the encoder must refer to neighboring SEs for certain SEs. Afterward, for regular bin, it reads context table, performs regular coding, and update context table sequentially. Both bypass coding and terminal coding take the encoder 1 cycle.
8.1.3 Design Consideration We calculate the throughput requirement for real-time encoding QFHD video. We assume the maximum compression ratio (bin/bit) is 1.8 according to both
8.1 Introduction
135 Encode P/B MB information Encode mb_type Yes
Encode sub_mb_type
mb_type = 8x8
No PartId x < MBpartNum
PartIdx < MB partNum
Yes, List0
Yes, List0 Encode mvd_l0
Encode ref_idx_l0 B
mb_type
No
P
P
mb_type
No PartIdx < MBpartNum
PartIdx < MBpartNum
Yes, List1 Encode ref_idx_l1
B
No
Yes, List1 Encode mvd_l1
Encode coded_block_pattern
Encode mb_qp_delta
!=0 coded_block_pattern =0 Done
Fig. 8.5 P/B mb info SEs encoding flow
[57] and our own experimental result. The maximum bin-rate is “240 Mbit/s 1.8 bin/bit D 432 Mbin/s.” In order to meet the throughput requirement, we must design a high throughput Binary Arithmetic Encoder Engine (BAE) which can process 2 bins/cycle. Before designing architecture, we analyze the number of bins of each SE type for six different sequences using the reference software JM 11.0 with search range D [63, 64], GOP D IBPB, and frame rate D 30 fps. The bin distribution of each SE type is shown in Fig. 8.8. The bins of SEs of sig coeff flag, last sig coeff flag, and coeff level types account for up to 88.46% of total bins. This percentage increases if the bit-rate of the sequence is higher (or higher quality). In P or B macroblocks, the bins of MVD type could be as high as 44.04%. Based on these observations, we conclude that these four SE types appear more frequently
136
8 CABAC Encoder Encode coded_block_flag
Encode 4x4 block
coded_block_flag
coded_block_pattern !=0
=0
0
1 Encode significant_coeff_flag
Encode last_significant_coeff_flag
last_significant_coeff_flag
0
1 Encode coeff_abs_level_minus1 Encode coeff_sign_flag Yes
No
Last coeff
No
Last 4 x4 block Yes
Done
Fig. 8.6 Coefficient SEs encoding flow
than the others. Thus, their efficient processing is the most critical. Besides, certain SE types, such as CBF, also consume many cycles. We propose how to effectively encode these SE types in the following sections.
8.2 Related Works Unlike designs [8, 12] of CABAC decoder, the bottleneck of CABAC encoding is in the recursive arithmetic coding, so its performance is largely determined by the throughput of Binary Arithmetic Coding Engine (BAE). We summarize some previous works in Table 8.3. They use different pipelined architectures and memory access schemes to eliminate data dependency between stages. Several researches focus on how to process multiple bins in parallel. Shojania [60] proposes an One-Bin BAE architecture. It takes a variable number of cycles for bypass and regular coding using a six-stage pipelined architecture. Since it does not deal with pipeline hazard, the average throughput is only 0.38 bins/cycle. Chen’s design [11] performs binarization and context modeling in parallel as Liu et al. [40] does and merges three BAE modes to reduce hardware cost.
8.2 Related Works
137
Build context table
Perform binarization Perform context modeling
Read context table Perform regular coding Update context table
Perform bypass coding Perform terminal coding
Cycles
399 400 404 407
411
Fig. 8.7 Order of processing subtasks of CABAC encoder
It achieves a throughput of 0.55 bins/cycle with a 13.3K gates BAE. The dynamic pipelined architecture proposed by Li [50] adopts an interstage register bypass scheme to eliminate memory access conflicts, but its renormalization stage takes more than 1 cycle and therefore the throughput is only 0.58 bins/cycle. Liu et al. [40] proposed the first full-hardware CABAC encoder. It uses a one-bin BAE and employs a forwarding scheme for context memory access to realize a hazard-free pipelined architecture that achieves 1 bin/cycle of throughput. By analyzing the distribution of bins in different SE types, Lo [49] proposed a two-bin BAE to increase throughput and a context memory arrangement scheme to increase memory bit-rate. Osorio [55] proposed a multibin BAE that can encode two dual bins together and the throughput can be up to 1.9 bins/cycle. Chen [15] proposed a novel ML-decomposed (MPS-encoder and LPS-encoder decomposed architecture) BAE with a throughput-selection architecture. This design can process up to 840 Mbin/s.
138
8 CABAC Encoder
a
BitRate (Mbits/s)
9.3
11.1
15.0
16.6
20.9
34.3
100% 80% 60% 40% 20% 0%
Rushhour
Station
Pedestrian Sunflower
Bluesky
Riverbed
b 100% mvd
80%
last_sig.
60%
sig.
40% 20%
coeff_level 0% Other
c
Rushhour
Station
Pedestrian Sunflower
Bluesky
Riverbed
Rushhour
Station
Pedestrian Sunflower
Bluesky
Riverbed
100% 80% 60% 40% 20% 0%
Fig. 8.8 Bin distribution of each SE type for QP D 28 HD video: (a) I macroblock, (b) P macroblock, and (c) B macroblock Table 8.3 Summary of related works Methods for increasing BAE throughput Context memory access Reference Pipelined BAE optimization Shojania [60] One-bin No Chen [11] One-bin No Li [50] One-bin Yes Liu [40] One-bin Yes Lo [49] Two-bin Yes Osorio [55] Multibin Yes Chen [15] Multibin Yes
Hardwired acceleration BCM No Yes No Yes Unknown Partial No
BAE Yes Yes Yes Yes Yes Yes Yes
8.3 A VLSI Design for CABAC Encoder
139
8.3 A VLSI Design for CABAC Encoder 8.3.1 Subtasks Scheduling Figure 8.9 shows subtasks scheduling before designing our CABAC encoder architecture. We plan to perform binarization and context modeling concurrently. The decoder may spend 1 or 2 cycles for binarization depending on SE of types. For context modeling, it will use either 1 cycle to generate context index of a bin or spend 4 cycles for certain SEs that need to refer to neighboring blocks’ SEs. If the encoder encodes one regular bin sequentially, it will spend 3 cycles. In order to reduce cycle count, we employ a pipelined architecture to encode multiple bins. The encoder would encode multiple bins in 1 cycle. In addition, while the encoder spends multiple cycles to perform arithmetic encoding, it could perform binarization and context modeling of the next SE. Therefore, the encoder would perform arithmetic encoding, binarization, and context modeling concurrently.
Build context table
Perform binarization Perform context modeling Read context table Perform Regular_AE
Update context table Perform Bypass_AE Perform Terminal_AE
Cycles
Fig. 8.9 Subtasks scheduling
399 403
140
8 CABAC Encoder BCM FSM
Mb_info memory Run level encode
Mb_info AG Get residual data Ctrl.
Binarization and Context Modeler (BCM)
Write to getNei Mem. AG Preload getNei Mem. AG
Get neighbor
Neighbor SE memory (3840x32)
Neighbor SE memory (3360x32)
Stall_CM AE Top Ctrl.
BAE Address translation Access context memory Context memory bank1(200x7)
Full
Context memory bank2 (200x7) Table Rom
PIPO buffer (16x14)
Lookup
Initial context memory
RangeTabLPS table TransIdxLPS table
Update codl Range Update cod Low Bit stream generator
Fig. 8.10 Block diagram of the proposed CABAC encoder
8.3.2 Architecture Figure 8.10 shows the block diagram of our proposed CABAC encoder. There are three major modules: Binarization and Context Modeler (BCM), Parallel-InParallel-Out (PIPO) buffer, and Binary Arithmetic Encoder Engine (BAE). In the beginning of a slice, the encoder uses the initial-context-memory module to initialize the two context memory banks. Then, it starts to encode a macroblock of the slice. First, the Preload-getNei-Mem-AG module loads the SEs of the neighboring macroblocks from the two Neighbor-SE memories. At the same time, the BCM reads SEs from Mb-info memory or Run-level-encode module. Then, the BCM generates up to 14 Context Modeler Output Data Structures (BCMODS) in parallel and puts them into the PIPO buffer. Concurrently, the BAE gets 0, 1, or 2 BCMODSs from the PIPO buffer and encodes them into bit-stream. Figure 8.11 depicts three kinds of BCMODS. The first one contains a bin, its CtxIdx, and its mode (traditional regular or terminal mode). The second one contains 1–4 bins, the number of valid bins, and the mode (several-bypass mode). The third one contains one regular bin plus one bypass bin, the regular bin’s CtxIdx, and the mode (regular–bypass mode). Then, our BAE can encode a combination of a regular bin and a bypass bin or a combination of 1–4 bypass bins. Since our design can process two BCMODSs simultaneously, it is a multibin BAE. The following subsections explain BCMODS generation of most frequent SEs of types sig coeff flag, last sig coeff flag, coeff level, CBF, and MVD. Then, we present the architecture of our multibin BAE the last subsection.
8.3 A VLSI Design for CABAC Encoder
141
Regular mode (1 regular or terminal bin) Terminal mode
1 bit [13]
Mode
Context index
R_bin
9 bits [11:3]
1 bit [12]
3 bits [2:0]
Several-Bypass mode (1~4 by pass bins) B_bins
4 bits [13:10]
Bypass Len.
None useage
Mode
3 bits [9:7]
4 bits [6:3]
3 bits [2:0]
Regular-Bypass mode (1 regular + 1 bypass) R_bin
B_bin
Context index
1 bit [13]
1 bit [12]
9 bits [11:3]
Mode
3 bits [2:0]
Fig. 8.11 Three kinds of proposed BCMODS
8.3.2.1
BCMODS Generation for Sig coeff flag and Last sig coeff flag Types
The context modeling processes are simple for the two SE types because their processing need not refer to neighboring or preceding SEs. In addition, both SEs of types sig coeff flag and last sig coeff flag contain only one bin. Therefore, the BCM can generate BCMODS easily.
8.3.2.2
BCMODS Generation for Coeff level Type
The context modeling process of coeff level SE needs to refer to previously decoded coeff level in the same coefficient block without referring to neighboring SEs. The CtxIdxes of most coeff level bins are the same. Therefore, it is easy to generate CtxIdx. The binarization of coeff level SEs includes the truncated unary and the 0st order Exp-Golomb (EG0) schemes. The bins are of bypass mode if they are binarized by the EG0 scheme. Because there are up to 33 bypass bins, we have to simplify the EG0 scheme for coeff level SE. We assume that the bin between the prefix and the suffix part is “1” instead of original “0” to obtain (8.1). Besides, we can easily generate the prefix part because it is just a series of “1” and its length is equal to the suffix part’s length. For example, when a coeff level SE is 19, we generate its bypass bins “11,0,01” by “1914 D 5.” First, we get the suffix part, “01,” from “1,01,” the binary representation of the
142
8 CABAC Encoder
difference 5. Second, we get the prefix part, “11,” according to the length of the suffix part “01.” (8.1) f1; suffix partg D jcoeff levelj 14:
8.3.2.3
BCMODS Generation for CBF Type
A CBF SE contains just one regular bin, but its context modeling takes many cycles to access neighboring blocks’ CBF SEs. The CBF SE of neighboring block is invalid, if the current block is in the picture boundary or the neighboring block is skipped according to its coded block pattern SE. To simplify data access, we redefine the addresses of all CBF SEs of a luma 4 4 block as shown in Fig. 8.12a. The left block address of the current block becomes “currentAddr1” and the top block address becomes “currentAddr5.” These CBF SEs and CtxIdx Left-Inc/Top-Inc parameters share a 24-entry buffer according to the redefined addresses. Such resource sharing is feasible because in the same address, the value of CtxIdx Left-Inc/Top-Inc parameter equals to that of the valid CBF SE. In case of an invalid CBF SE, on the other hand, the value of CtxIdx Left-Inc/Top-Inc parameter overrides the buffer of that address as shown in Fig. 8.12b. All CtxIdx Left-Inc/Top-Inc related to such specific SEs as mb type and coded block pattern have been precalculated before encoding CBF SEs. Since this
a
Top MB 1 6
5 Left MB
2
3
7
8
4 9 14
10
11
12
13
15
16
17
18
19
20
21
22
23
24
CurrentAddr-5
Current MB CurrentAddr-1
b
N: Invalid CBF SE V: Valid CBF SE
Top
Inc: CtxIdx Left-Inc or Top-Inc
Left Addr. 1 24 bitbuffer N Inc
2 V
3 V
4 V
5 V
6 V
7 V
8 N
9 N
10 11 12 13
...........
23 24
N
...........
V
Inc Inc Inc
V
V
N
Inc
V
Pre-calculation (take one cycle to pre-calculate all Inc)
Fig. 8.12 Simplified data access of CBF SEs in luma AC 4 4 block (a) redefined addresses (b) 24-entry buffer
8.3 A VLSI Design for CABAC Encoder
143
buffer saves all parameters according to the redefined addresses, we can use it to get the required parameters (current CBF SE, CtxIdx Left-Inc, and Top-Inc) and compute the bin and its CtxIdx in 1 cycle. This new CtxIdx generation method is also adopted for other block types, (e.g., Chroma AC).
8.3.2.4
BCMODS Generation for MVD Type
The binarization result of MVD SEs employs the truncated unary and the 3rd order Exp-Golomb (EG3) schemes. It is similar to that for coeff level SEs. Therefore, we can simplify the EG3 scheme of MVD SE as well. An MVD SE is composed of a horizontal MVD (mvd h) and a vertical MVD (mvd v). The get neighbor processes of mvd h and mvd v access the same neighboring blocks to generate their CtxIdx Left-Inc/Top-Inc. Therefore, we can fetch neighboring blocks’ mvd h and mvd v SEs concurrently.
8.3.2.5
Pipelined Multibin BAE Architecture
We propose a multibin BAE which can process two BCMODSs simultaneously. The BAE is a six-stage pipelined architecture and its block diagram is shown in Fig. 8.13. The first stage translates the bin’s CtxIdx into the context-memory-bank address. The second stage reads the bin’s MPS and pStateIdx from the context memory. The third stage determines whether the bin is MPS, and it reads two combinational circuit tables, transIdxLPS and rangeTabLPS tables, to get the next pStateIdx and four RangeLPS, respectively. In order to process two BCMODSs without lengthening the
0~2 BCMODS
Context memory access
Address translation Access context memory Context memory bank1 (200x7) Context memory bank2 (200x7)
Lookup
Range TabLPS TransIdxLPS
Renormalize Range Renormalize Low Process the overflow bits of Low
Update codlRange
Generate bit-stream
Bit stream generator
Update codlLow
Fig. 8.13 Proposed six-stage pipelined BAE
Divide Update Range and Low into two stage
144
8 CABAC Encoder
critical path, we divide the process of updating Range and Low into two stages. The fourth stage (Update Range) and the fifth stage (Update Low) are used for interval subdivision and renormalization. They also output overflow bits of Low (partitioned to determined or bitOutstanding bits). Finally, the sixth stage (Generate bit-stream) collects these overflow bits and processes the bitOutstanding ones to generate a bitstream in a sequence of bytes. A traditional renormalization process performs several iterations of shift, subtraction, and PutBit functions. The PutBit function handles bitOutstanding bits and outputs bits. The renormalization process is difficult to implement as a combinational circuit especially for a multibin BAE. We propose an efficient renormalization method that can be easily implemented in our architecture. It consists of four subprocesses, renormalizing Range, renormalizing Low, processing the overflow bits of Low, and generating bit-stream, scheduled into the fourth, fifth, and sixth pipeline stages. In the following, we describe the detailed implementation of each stage of the proposed multibin BAE.
The First Three Stages For our BAE process, there could be two regular bins in two BCMODSs. In order to read two context data concurrently, we partition the context memory into two memory banks. Therefore, the first stage translates the bin’s CtxIdx into the contextmemory-bank address. Then, the second stage reads the two memory banks to get the bin’s MPS and pStateIdx. Finally, the third stage decides whether the bin is MPS and gets four RangeLPS by looking up the table and generates new MPS and pStateIdx to update the memory banks. The architectures of the second and the third stages are shown in Fig. 8.14. We employ a forwarding circuit in the two stages to prevent a read-after-write hazard from happening. Our multibin BAE can process any two BCMODSs in parallel as long as they do not concurrently fetch different entries from the same bank (i.e., read memory collision).
The Fourth Stage (Update Range) and The Fifth Stage (Update Low) Table 8.4 gives equations for calculating Next-Range and Inter-Low for processing a regular or bypass bin. We use Inter-Low to indicate Low that contained overflow bits. Table 8.5 shows the simplified equations derived from Table 8.4 for processing the proposed BCMODS types. We divide the updating processes into two stages, Update Range (fourth stage) and Update Low (fifth stage). We find that all operands (except current Low) in the Inter-Low equations are generated from Next-Range equations. Then, we can process two BCMODSs without lengthening the critical path.
8.3 A VLSI Design for CABAC Encoder
MPS Pst-Idx Bin
145
Phy. Mem. Addr.
Phy. Mem. Addr.
Context table Mem.
Context table Mem.
Forwarding data
Pre forwarding data
MPS Pst-Idx
Forwarding data
Bin
MPS MPS
Comparator Pst-Idx IsMPS TranIdxLPS table
Pre forwarding data
RangeTabLPS table
BCMODS (rLpsx4, isMPS)
Pst-Idx Range TabLPS table
Comparator IsMPS TranIdxLPS table
BCMODS (rLpsx4, isMPS)
Fig. 8.14 Architectures of the accessing context memory stage and the looking up table stage (the second and third stage)
Table 8.4 The equations of Next-Range and Inter-Low for processing one regular or one bypass bin Mode Bin type Next-Range Inter-Low Overflow-bits Len. Bypass Bypass(1) Range (Low 1) C Range 1 Bypass Bypass(0) Range Low 1 1 Regular MPS rMps Lz Low Lz Lz Regular LPS rLps Lz (Low C rMps ) Lz Lz
The architecture of the fourth stage is shown in Fig. 8.15. Because all operands in the Next-Range equation are derived from the current Range, there is high data dependency between two Range updating processes. Therefore, as shown in Fig. 8.15, we employ two concatenated units for encoding two BCMODSs. The architecture of the fifth stage is shown in Fig. 8.16. It implements the InterLow equations depicted in Table 8.5 and outputs overflow bits to the next stage. We implement the underlined part of the equations as “Low-Inc,” and assign the multipliers to the fourth stage, as shown in Fig. 8.15, to shorten the critical path in the fifth stage. We use “Inter-Low1” and “Inter-Low2” shown in Fig. 8.16 to represent the result of the “Inter-Low” equation for only one or two BCMODSs. Figure 8.17 illustrates the processing overflow bits of Inter-Low1/Inter-Low2. Because the minimum register precision required for Low is 10 bits, other bits
146
8 CABAC Encoder
Table 8.5 Simplified equations of Next-Range and Inter-Low for processing multiple BCMODSs Overflow– bits Len. Mode Bin type Next-Range Inter-Low Regular– MPS C Bypass(0) rMps Lz (Low (Lz C 1)) Lz C 1 bypass Regular– LPS C Bypass(0) rLps Lz (Low (Lz C Lz C 1 bypass 1)) C (rMps (Lz C 1)) Regular– MPS C Bypass(1) rMps Lz (Low Lz C 1 bypass (Lz C 1) C (rMps Lz) Regular– LPS C Bypass(1) rLps Lz (Low (Lz C Lz C 1 bypass 1)) C (rMps (Lz C 1) C (rLps Lz) Several1–4 Bypass bins Range (Low bins num.) C (Range 1–4 bypass bins value) Regular MPS rMps Lz Low Lz Lz Regular LPS rLps Lz (Low Lz) C (rMps Lz) Lz Bypass len.
Bypass bits
rLpsx4
Bypass len.
rLps
Range cur.
9'd2
Bypass bits
rLpsx4
9'd2
Range cu r. –
rLps – rMp s
rMps X
X
Leading One detector
Leading One detector <<
Lz. Num. (3 bits)
rMps (13 bits)
rLps (9 bits)
<<
Lz. Num. (3 bits)
rMps (13 bits)
rLps (9 bits)
Fig. 8.15 The architecture of the fourth stage (Update Range)
named as overflow bits should output to the next stage. However, if we output these bits without extra processing, a carry-out problem would occur. To resolve this problem, we check the bits of Low with their indices larger than 8 (i.e., index “9” to “9 C Overflow-bits Len.”) from right to left. The first zero encountered is referred to as a “dividing zero-bit.” For example, in Fig. 8.17a, the dividing zerobit is in index 12. Within the overflow bits, the bits on the left of the dividing zero-bit are determined bits; the remaining bits are undetermined and are known as bitOutstanding bits.
8.3 A VLSI Design for CABAC Encoder rMps (13 bits)
rLps (9 bits)
Lz. Num. (3 bits)
147 Lz. Num. (3 bits)
BCMODS 1
rLps (9 bits)
rMps (13 bits) BCMODS2
Low Cur. <<
<< +
Low-Inc.
<< << Low-Inc.
+
<<
Inter-Low1
<< +
+ Inter-Low2 Processing overflow Set MSB of Low Nxt.
Leading zero detector
Overflow bits of codlLow
Num. Of determined bits
Num. Of bit Outstanding
CodlLow (10 bits)
Fig. 8.16 The architecture of the fifth stage (Update Low)
After processing overflow bits, the Next-Low is shown in Fig. 8.17b. When there is a dividing zero-bit, the ninth bit of Low (i.e., the most significant bit of Low) is modified to “0” for putting the next generated carry-out; otherwise, it is unchanged. The Sixth Stage (Generate Bit-Stream) The sixth stage collects all overflow bits and outputs a bit-stream as a number of bytes, as shown in Fig. 8.18. If there are some determined bits in the input overflow bits, the previous bitOutstanding bits become determined bits, which should be “01..1” or “10..0” depending on the Most Significant Bit (MSB) of the overflow bits. It would become “01..1” if the MSB of the overflow bits is 0; otherwise, it would be “10..0.”
8.3.3 Evaluation Our design is synthesized toward a TSMC 0.13-m CMOS cell library in ASIC flow. In the worst-case corner, the 222-MHz circuit takes 45,981 gates. The results for encoding 1080 HD sequences with GOP D “IBPB” are shown in Fig. 8.19. We can see that the performance of our design is directly proportional to the bin-rate. That is, our CABAC throughput (Bin/Cycle) increases as the quality increases in the same sequence. Therefore, we can save more processing cycles in high-quality video.
148
8 CABAC Encoder
a
Idx
Inter-Low (before processing overflow)
14
13
12
11
10
9
8
7
6
5
4
3
2
1
0
overflow bits Step 1 0
1
determined bits
0
1
1
1
1
1
1
0
1
1
0
1
0
0
0
0
0
bitOutstanding
dividing zero-bit
+ 1
0
0
0
0
Step 2
= Carry out 0
1
1
0
0
0
0
1
1
0
1
1
0
1
0
b overflow bits Case 1 Next-Low (after processing overflow)
1
1
1
1
1
1
1
1
1
0
1
1
0
1
0
Case 2 X
X
0
1
1
0
1
1
1
0
1
1
0
1
0
dividing zero- bit
Fig. 8.17 Symbolic representation of processing overflow of Inter-Low (a) carry-out problem (b) Next-Low
8.4 Summary We have described a fully hardware CABAC encoder. Based on bin distribution analysis, we propose several methods to decrease computational complexity in the CABAC encoder, so we can increase its throughput without increasing its critical path length. The CABAC encoder contains an SE-specific cycle-reduction context modeler and a multibin BAE. For realizing the multibin BAE, we propose an optimized context memory access scheme. We also design a PIPO buffer to let our BCM and BAE operate without unnecessary stall.
8.4 Summary
149
Num. Of determined bits
Num. Of bit Outstanding
Overflow-bits of codlLow
Is Terminal Lps
codlLow (10bits)
+ Buffer Cnt. Cur.
–
>> Determined bits Generate determined bits
Determined bits Num.
Num. Of bit Outstanding accumulation
+ Buffer Cnt. Nxt. Buffer for accumulation of determined bits (64bits) 8 bits
Fig. 8.18 The architecture of the sixth stage (generate bit-stream)
1.7 riverbed
1.6
throughput (bins/cycle)
1.5 bluesky 1.4 1.3
pedestrian
1.2 station
1.1 1
rushhour
0.9 0.8
sunflower 0
20
40
60
80
100
120 140 160 bit-rate (Mbits/s)
Fig. 8.19 Throughput of our CABAC encoder for encoding each sequence in HD resolution
150
8 CABAC Encoder
Our BAE throughput achieves 2.37 bins/cycle and our whole CABAC encoder processes 1.3 bins/cycle on average. It is capable of real-time encoding QFHD sequences in maximum bit-rate of main profile Level 5.1 when running at 222 MHz. Besides, efficient encoding of significant SEs lets us save more processing cycles in high-quality video. In the future, we can minimize the memory size for storing neighboring SEs. Also we can complete the functionality for high profile and support Rate Distortion Optimization (RDO) coding.
Chapter 9
System Integration
Abstract H.264/AVC introduces many new tools to achieve better coding performance, but at the expense of high computational complexity. Hardware cost and encoding performance are the two main challenges in designing a highperformance H.264/AVC encoder. We have proposed several high-performance architectures for the functional units in an H.264/AVC encoder. In addition, external memory management is another design issue. We have to access an external memory up to 3.3 GBps for real-time encoding 1080pHD video in our encoder. We propose several AMBA-compliant memory access units to efficiently access an external memory. We will present our H.264/AVC encoder in this chapter.
9.1 Introduction H.264/AVC is the latest video coding standard developed jointly by ITU-T and ISO/IEC. Compared with MPEG-4, H.263, and MPEG-2, it can save up to 39%, 49%, and 64% of bit-rate [24], respectively. We describe the algorithm of the H.264/AVC encoding in Sect. 9.1.1 and address some design considerations in Sect. 9.1.2.
9.1.1 Algorithm Figure 9.1 illustrates the H.264/AVC encoding process. To encode P-type or B-type macroblocks, both the integer and the fractional motion estimation (IME and FME) units find the best-match blocks from the reference frames for each block type of a macroblock in the current frame. The inter mode decision (InterMD) unit chooses the best block type and sends the corresponding motion vectors (MV) to the motion compensation (MC) unit. The MC unit then compensates the best match block. To encode I-type macroblocks, the intra prediction (IntraPred) unit first refers to reconstructed neighboring pixels to generate the prediction pixels of each intra prediction mode. The intra mode decision (IntraMD) unit then chooses the best one. Y.-L.S. Lin et al., VLSI Design for Video Coding: H.264/AVC Encoding from Standard Specification to Chip, DOI 10.1007/978-1-4419-0959-6 9, c Springer Science+Business Media, LLC 2010
151
152
9 System Integration Reference Frame & Current MB
Integer Motion Estimation
Fractional Motion Estimation
Intra Prediction
Inter Mode Decision Current MB
Motion Compensation
Intra Mode Decision
Current MB MB Mode Decision
Transform
Quantization
Reconstruction
Inverse Transform
Inverse Quantization
Deblocking Filter
Entropy Coding
Variable Length Coding
Run Level Coding
Bit-stream Filtered Macroblock
Fig. 9.1 H.264/AVC encoding process
Macroblocks in P-type (B-type) frames can be encoded using either inter modes or intra modes. In other words, it is possible to contain P-type (B-type) and I-type macroblocks in a P-type (B-type) frame and also to contain only I-type macroblocks in an I-type frame. If the current frame is P-type or B-type, the macroblock mode decision (MBMD) units will compare the cost of the best inter mode and the best intra mode to decide the macroblock type. It then calculates residuals and macroblock information for the best mode. Otherwise, it calculates residuals and macroblock information for the best intra mode. The three mode decision units compute the cost of each mode based on distortion and bit-rate as shown in (9.1). The Lagrangian parameter is used to make a tradeoff between distortion and bit-rate. Its value is derived according to a quantization parameter (QP). Cost D Distortion C Bit-rate: (9.1) The transform (Trans) unit transforms residuals into coefficients. The quantization (Quan) unit then quantizes coefficients and outputs them for both the inverse quantization (InvQuan) unit and the entropy coding unit. The InvQuan unit and the
9.1 Introduction
153
inverse transform (InvTrans) unit translate coefficients back to residuals. The reconstruction (Recons) unit then adds up residuals and prediction pixels of the best mode to generate reconstructed pixels. The entropy coding unit compresses coefficients and macroblock information into an MB-level bit-stream. Finally, the variable length coding (VLC) unit encodes configuration parameters into a bit-stream header and packs them together with an MB-level bit-stream into an H.264/AVC bit-stream for system output. If the current frame is I-type or P-type, the deblocking filter (DF) unit filters the reconstructed macroblock to reduce blocking artifact and writes it to an external memory to form reference frames. Algorithm 9.1 shows the H.264/AVC encoding process and Fig. 9.2 shows the corresponding flow chart.
9.1.2 Design Consideration Many functional units in H.264/AVC have high computational complexity. Unlike considering only hardware cost and processing time in designing a decoder [4, 39, 56], we also have to consider coding performance in designing an encoder. The computation loads of the IME and the FME units are proportional to the number of search candidates, the size of the search range, and the number of reference frames. For example, a full-search motion estimation algorithm that evaluates all search candidates results in the best video quality. However, it takes more cycles Algorithm 9.1: H.264/AVC encoding. H.264/AVC Encoding (video) if current frame is P-type or B-type frame do IntegerMVs D IME(RefFrames, CurrentMB); FractionalMVs D FME(RefFrames, IntegerMVs, CurrentMB); fBestFractionalMV, MBinfoInter g D InterMD(FractionalMVs, CurrentMB); PredPixelsBest Inter Mode D MC(RefFrame, BestFractionalMV); endif PredPixels D IntraPred(ReconsPixels); fPredPixelsBest Intra Mode , MBinfoIntra g D IntraMD(PredPixels, CurrentMB); if current frame is P-type or B-type frame do fResiduals, MBinfog D MBMD(PredPixelsBest Intra Mode , PredPixelsBest Inter Mode , MBinfoI nt ra , MBinfoI nt er , CurrentMB); else fResiduals, MBinfog D MBMD(PredPixelsBest Intra Mode , MBinfoIntra , CurrentMB); endif fCoefficients, Transformed Residualsg D transform coding(Residuals, QP); ReconsPixels D Recons(PredPixelsBest Mode , Transformed Residuals); MB-level bit-stream D entropy coding(Coefficients, MBinfo); H.264/AVC bit-stream D VLC(MB-level bit-stream); if current frame is I-type or P-type do FilteredMB D DF(ReconsPixels); endif
154
9 System Integration Start Current frame is P or B type? Yes Perform Integer Motion Estimation Perform Fractional Motion Estimation Perform Inter Mode Decision
No
Perform Motion Compensation Perform Intra Prediction Perform Intra Mode Decision Perform MB Mode Decision Perform Transform Coding
Perform Reconstruction Current frame is I or P type? Yes
Perform Entropy Coding Perform Variable Length Coding No
Perform Deblocking Filter
End
Fig. 9.2 Flow chart of the H.264/AVC encoding process
and generates a large amount of memory traffic. Fast motion estimation algorithms that evaluate fewer search candidates incur lower hardware cost, but at the expense of video-quality degradation. Searching more reference frames also produces better video quality but incurs a large amount of memory traffic. The total memory bytes we have to read are decided by the number of reference frames and the size of the search range. For example, if the search range is 64 64 and the number of reference frames is 2, we have to read 80 80 2 D 12;800 bytes of data for the IME unit. We also have to read frames in the source video and write the reference frames. Therefore, we need an efficient mechanism to manage memory traffic.
9.2 Related Works
155
In the intra frame encoding, the five units, which are IntraPred, IntraMD, Trans, Quan, InvQuan, InvTrans, and Recons, form an encoding loop. Because the IntraPred unit refers to the reconstructed pixels of its upper and left 4 4 blocks, we need to go through the whole encoding loop 24 times to encode a macroblock. There are many fast algorithms proposed to speed up the encoding loop. However, all of them cause video-quality degradation. There are two mode decision algorithms introduced in the H.264/AVC reference software JM11.0: the rate-distortion optimized (RDO) mode decision [62] and the low-complexity mode decision. The RDO mode decision evaluates the distortion and bit-rate by carrying out the entire encoding process. It achieves the best quality and compression rate at the expense of high computational complexity. Therefore, it complicates the hardware implementation. In the low complexity mode decision, we can use either the sum of absolute differences (SAD) or the sum of absolute transformed differences (SATD) to evaluate distortion. Using SATD will deliver better video quality than using SAD [21]. Equations (9.2) and (9.3) show how to calculate SAD and SATD, respectively. In (9.3), H denotes the 4 4 Hadamard matrix. X (9.2) SATD D .j.OrigŒPi;j PredŒPi;j j/; i; j D 0–4; X (9.3) SATD D .jH .OrigŒPi;j PredŒPi;j / H j/; i; j D 0–4: There are also two entropy encoding algorithms in H.264/AVC: the context-based adaptive variable length coding (CAVLC) and the context-based adaptive binary arithmetic coding (CABAC). CAVLC is used in the baseline and the main profiles. CABAC is used in the main profile. CAVLC is much simpler but less effective compared with CABAC. According to our experiment, CABAC is 13–19% better than CAVLC in bit-rate saving for 1080pHD video. However, CABAC involves a lot of bit-level operations to achieve a better compression ratio.
9.2 Related Works Several previous works [21,29,36,46] propose H.264/AVC intra frame encoders. All of them aim to speed up the encoding loop in the intra frame encoding. Huang et al. [21] proposed the first H.264/AVC intra frame encoder. They proposed an efficient pipeline schedule for the intra frame encoding loop. Running at 54 MHz, it is able to encode SD (720 480) video sequences at 30 fps. Both Ku et al. [29] and Li et al. [46] adopt fast intra prediction heuristics (i.e., skipping some modes) to speed up the encoding process regardless of quality loss. The encoder proposed by Ku et al. [29] can encode 720pHD (1,280 720) video at 30 fps when running at 117 MHz, but it suffers from a 0.06% bit-rate increase. The encoder proposed by Li et al. [46] runs at 61 MHz and can encode 720pHD video at 30 fps, but it suffers from a 0.68% bit-rate increase.
156
9 System Integration
Other works have proposed H.264/AVC encoders. Chen et al. [3] proposed the first H.264/AVC encoder. It reduces the memory traffic of the IME unit by processing eight horizontally adjacent reference blocks in parallel, and it parallelizes the interpolation of some subblocks in the FME unit. Running at 108 MHz, it is able to encode 720pHD (1,280 720) video sequences at 30 fps. In Chang et al.’s work [5], it adopts fast algorithms for the IME, the FME, and the IntraPred units to reduce computational complexity. Its fast algorithms have several levels. Each level has its computational complexity and delivers differing video quality. Its minimum working frequency to encode 720pHD (1,280 720) video sequences at 30 fps is 72 MHz. Instead of ASIC approaches, Liu et al. [51] propose an SOC solution. Its architecture contains a 32-bit media embedded processor and a 64-Mb embedded DRAM. It also adopts fast algorithms for the IME and FME units. Running at 200 MHz, it is able to encode 1080pHD (1,920 1,088) video sequences at 30 fps. Unlike all previous works that implement CAVLC to support H.264/AVC baseline profile, the encoder proposed by Lin et al. [47] is the first H.264/AVC encoder which implements a CABAC encoder. It uses a subsampling algorithm to reduce the memory traffic of the IME unit and searches fewer candidates in their FME unit. It is able to encode 1080pHD (1,920 1,088) video sequences at 30 fps at 145 MHz.
9.3 A VLSI Design for H.264/AVC Encoder We propose an H.264/AVC encoder in this section. We first describe how we schedule all subtasks in Sect. 9.3.1 and then propose our hardware architecture in Sect. 9.3.2. In Sect. 9.3.3, we evaluate its performance.
9.3.1 Subtasks Scheduling We support two reference frames with a 64 64 search range in our encoder. According to our experiments, using two reference frames with a 64 64 search range achieves about 0.02-dB better video quality than using one reference frame with a 128 64 search range. We also implement the CABAC encoder in our design to achieve better coding performance. To reduce the computational complexity of the CABAC unit, we do not support I-type macroblocks in either P-type or B-type frames in our encoder. This configuration causes a 0.07-dB PSNR drop on average. Because there are no I-type macroblocks in either P-type or B-type frames, we do not have to implement the MBMD unit. Furthermore, we integrate the interMD unit into our FME architecture. In the FME and the IntraMD architectures, we use the SATD to measure distortion. We employ a two-stage pipelined architecture in our design for encoding I-type macroblocks. The first stage includes all functional units in the encoding loop. The
9.3 A VLSI Design for H.264/AVC Encoder
157
1st stage 2nd stage Perform Intra Prediction Perform Intra MD Perform Transform Coding Perform Reconstruct Perform CABAC Perform PE Perform DF Cycles 433 or 603
Average 191
Fig. 9.3 Timing diagram for encoding a I-type macroblock
second stage includes the CABAC unit, the VLC unit, and the DF unit. Figure 9.3 shows the timing diagram of our encoder to encode an I-type macroblock. By adopting the proposed processing order for 4 4 blocks and increasing the data bandwidth of each functional unit, we spend 434 or 603 cycles to encode a macroblock in the first intra pipelined stage, depending on the macroblock mode. In the second intra stage, the DF unit takes 100 cycles to process an I-type macroblock. However, the processing time of the CABAC and the VLC units varies according to the QP and video contents. Empirically, average processing cycles of the CABAC and the VLC units for one macroblock are 101, 159, and 317 when QP is 28, 22, and 16 in our encoder, respectively. We employ a four-stage pipelined architecture for encoding P-type or B-type macroblocks. The first and second inter stages contain the IME and the FME units, respectively. In the third inter stage, we perform MC and transform coding. The fourth inter stage is the same as the second intra stage. Figure 9.4 shows the timing diagram of our encoder to encode a P-type or B-type macroblock. By adopting the proposed reference frame interlace processing order, our IME architecture can encode a macroblock within 512 cycles. Our FME architecture uses
158
9 System Integration 1st stage
2nd stage 3rd stage 4th stage
Perform IME Perform FME Perform MC Perform Transform Coding Perform Reconstruct Perform CABAC Perform PE Perform DF Cycles 512
628~644
450
Average 191
Fig. 9.4 Timing diagram for encoding P-type or B-type macroblocks
three engines to process all block types in parallel. It takes a total of 644 cycles at the maximum to encode a macroblock. In the third inter stage, it takes about 450 cycles for our MC and transform coding architectures to encode a macroblock. Because the DF unit takes almost 100 cycles to process a P-type or B-type macroblock, the fourth inter stage requires almost the same number of cycles as the second intra stage. Figure 9.5 shows the timing diagram for our encoder to access the external memory in our system. The proposed memory access units (MAUs) will prefetch the next macroblock and its search range. These MAUs contain several local memory, which they access in a ping-pong fashion. On average, it takes 208 cycles to read a macroblock in the source video, 344 cycles to read pixels in the search range, 15 cycles to read the referenced data of the proposed DF engine, 43 cycles to write the filtered macroblock, and 9 cycles to write the H.264/AVC bit-stream. To write the reference frames, we propose an MB column-based reference frame mapping. We also propose a mixed Cb–Cr mapping for chroma pixels. Both of
9.3 A VLSI Design for H.264/AVC Encoder
159
Read the next macroblock Read the search range of next MB Read the DF referenced data Write reference frames Write H.264/AVC bit-stream Cycles
208
567 609
Fig. 9.5 Timing diagram for accessing the external memory
these methods reduce DRAM access penalties. Figure 9.6 shows the proposed reference frame format. To support two reference frames with a 64 64 search range, we have to read data in the search range for the IME, FME, and MC units as shown in Fig. 9.7. Because the MC unit receives the best MV to compensate the prediction pixels, the reference pixels it needs are a subset of the search range. Because the search ranges of two consecutive macroblocks are overlapped as shown in Fig. 9.7, we have only to read the nonoverlapped area for the next macroblock, excluding the first macroblock of each row. We have to read the whole search range to encode the first macroblock of each row. Because the VLC unit generates a different number of bits every time and outputs bits aperiodically, it will degrade system performance if we write its output directly to the external memory. We store the output bit-streams and write them to the external memory when their amount equals to 400 bytes.
9.3.2 Architecture Figure 9.8 shows the top-level block diagram of the proposed design. It consists of two main parts: the encoder core containing all functional units to perform encoding and the AMBA interface for data transfer between the encoder core and the external memory.
160
a
9 System Integration
32 bits (4 pixels)
Column 1
Column 2
2
3
4
61
62
63
64
65
66
67
68
1
Column m
64n+1
5 Row 1
Row 2 Luma Frame 128
Row n 64n
64mn
MB column-based reference frame mapping
b 32 bits (4 pixels)
2
3
4
62
63
64
1 5
Cb Cr Cb Cr 61
Chroma Frame
Mixed Cb-Cr mapping
Fig. 9.6 Proposed reference frame format for (a) MB column-based reference mapping and (b) Mixed Cb-Cr mapping
9.3.2.1
Encoder Core
Most of the functional blocks inside the encoder core are described in the corresponding chapters. We describe the left three engines in this section.
9.3 A VLSI Design for H.264/AVC Encoder
161
SR for FME SR for IME
Non-overlapped area
MB N+1
MB N
Fig. 9.7 Illustration of search range
The Parameter Encoding (PE) engine performs the VLC algorithm. By hardwiring this engine, we can reduce the performance overhead caused by software/hardware communication. The Recons engine contains only adders and multiplexers. It sums up the transformed residuals and the prediction pixels of the best mode to generate reconstructed pixels. It writes the corresponding reconstructed pixels for the intra luma 4 4
162
9 System Integration AMBA AMBA Slave
AMBA Master AMBA Interface MAU Arbiter DF MAU
MB MAU
SR MAU
BIT MAU
Command Receiver
MainCtrl Engine Inter Info Memory
Intra Pred Engine Unfilter Memory DF Engine
FME Engine
Intra MD Engine
MC Engine
Multiplexer
IME Engine
TransCoding Engine
CABAC Engine Recons Engine
ReconsMB Memory
PE Engine
Encoder Core
Fig. 9.8 Top-level block diagram of the proposed design
prediction to the IntraPred engine, the corresponding reconstructed pixels for the intra luma 16 16 prediction to the Unfilter memory, and all reconstructed pixels to the ReconsMB memory. Figure 9.9 shows the proposed IntraMD engine. Because our IntraPred engine generates 32 pixels for two 4 4 blocks at a time, there are four cost calculators in our IntraMD engine. Each cost calculator contains a 4 4 SATD calculator to obtain distortion cost. The IntraMD engine also calculates the residuals of the best intra mode. Instead of using large amounts of memory to store all prediction pixels, our proposed IntraPred engine can regenerate the prediction pixels of the best mode. 9.3.2.2
AMBA Interface
In addition to design components inside the encoder core, we implement AMBAcompliant modules to deal with external-memory traffic and software/hardware
9.3 A VLSI Design for H.264/AVC Encoder
QP
Prediction Pixels
Lambda Generator Difference Calculator
163
Cost Calculator Cost Calculator Best Mode Selector
Prediction Pixels
Difference Calculator
Cost Calculator
Best Luma & Chroma Modes
Cost Calculator
Original Pixels
Residuals Calculator
Residuals
Fig. 9.9 Architecture of the proposed IntraMD engine
communication. There are four MAUs, an MAU arbiter, and a command receiver in the AMBA interface. The command receiver is connected to an AMBA slave. Users send configuration parameters including a group-of-picture (GOP), a QP, and slice types in a memory address format to the command receiver. The command receiver analyzes its input to obtain these configuration parameters and sends them to the corresponding modules in the encoder core. The MAU arbiter is connected to an AMBA master. It decides which one of the four MAUs can access the external memory. The four MAUs are the deblocking filter (DF) MAU, macroblock (MB) MAU, search range (SR) MAU, and Bit-stream (BIT) MAU. The DF MAU reads the neighboring referenced data for filtering operations and writes filtered macroblocks from and to the external memory. The BIT MAU writes the compressed H.264/AVC bit-stream to the external memory. We use a queue-based local buffer to store the compressed bit-stream temporarily. The MB MAU reads macroblocks in the source video from the external memory for the IME, FME, and IntraMD engines. It contains local memory to store pixels. Because the IME and the FME engines access a luma macroblock in a different way than the IntraMD engine does, our MB MAU stores the luma macroblock into its local memory using different mappings for different macroblock types. Moreover, there are two processing orders for luma 4 4 blocks in our IntraPred engine. We designed our MB MAU to dynamically support two different processing orders. Figure 9.10 illustrates the different memory mappings in our MB MAU. The SR MAU reads the search range from the external memory for the IME, the FME, and the MC engines. It takes most of the memory traffic. Figure 9.11 illustrates how our SR MAU reads the search range. The proposed SR MAU contains
164
9 System Integration
1
2
5
6
3
4
7
8
9
10
13
14
11
12
15
16
I-type macroblock
P-type or B-type macroblock
64 bits x 4 0
1, 2
0
1
3, 5
1
2
4, 6
2
3
9, 7
3
4
10, 8
4
5
11, 13
5
6
12, 14
6
7
15, 16
7
For intra 4x4 prediction 0
1, 2
1
3, 4
2
5, 6
3
7, 8
4
9, 10
5
11, 12
6
13, 14
7
15, 16 For intra 16x16 prediction
Fig. 9.10 Memory mappings for a luma macroblock in the proposed MB MAU
fourteen 128 bits 128 entries of local SR memory. Half of them are for one reference frame. It requires only five SR memory to store the search range of the IME engine for a reference frame. However, because the IME, the FME, and the MC engines work at different pipelined stages, we need to use two additional SR memory for the FME and the MC engines.
9.3 A VLSI Design for H.264/AVC Encoder
165
MB N+1
MB N
128 bits
Non-overlapped area
128 entires
Fig. 9.11 Illustration of reading search range in the proposed SR MAU
9.3.3 Evaluation We evaluate the proposed design using a multimedia SOC platform (32-bits AMBA) and its behavior model (128-bits AMBA model). We use five 1080pHD sequences, each of which contains 100 frames to verify our design. Table 9.1 shows the parameter setting of the proposed design. The average number of cycles needed to process a macroblock is 632 when QP is 16. Our design can real-time encode 1080pHD
166
9 System Integration Table 9.1 Encoder parameter setting Maximum resolution 1,920 1,088 Level 4.0 QP range 0–51 Number of reference frames 2 Search range 64 64 I-type MB in P/B slice Off Hadamard transform On Mode decision Low complexity Entropy encoder CABAC
video when running at 158 MHz. Synthesized using a TSMC 0.13-m CMOS cell library, the proposed design consumes 1,942K (245K for MAUs and 1,697K for encoder core) gates and 248 KB (163KB in MAUs and 85KB in encoder core) local memory.
9.4 Summary We have proposed an H.264/AVC encoder for high-resolution video. We propose high-performance architectures in the encoder core and have designed efficient MAUs in the AMBA interface. By adopting the proposed performance enhancement methods, our encoder can real-time encode 1080pHD@30 fps video when running at 158 MHz.
References
1. Al A, Rao BP, Kudva S, Babu S, Suman D, Rao A (2004) Quality and complexity comparison of H.264 intra mode with JPEG2000 and JPEG. In: Proceedings of IEEE international conference on image processing, Singapore, October 2004, pp 525–528 2. Bojnordi MN, Fatemi O, Hashemi MR (2006) An efficient deblocking filter with selftransposing memory architecture for H.264/AVC. In: Proceedings of IEEE international conference on acoustics, speech and signal processing, Toulouse, France, May 2006, pp II–II 3. Chen T-C, Chien S-Y, Huang Y-W, Tsai C-H, Chen C-Y, Chen T-W, Chen L-G (2006) Analysis and architecture design of an HDTV720p 30 frames/s H.264/AVC encoder. IEEE Trans Circuits Syst Video Technol 16(6):673–688 4. Chang C-R, Chen J-W, Lo T-J, Chiu C-L, Chang Y-H, Tzeng H-C, Shih S-Y, Kao Y-C, Kao C-Y, Lin Y-L (2006) An H.264/AVC main profile hardwired decoder. In: Proceedings of the 2006 picture coding symposium, Beijing, China, April 2006 5. Chang H-C, Chen J-W, Su C-L, Yang Y-C, Li Y, Chang C-H, Chen Z-M, Yang W-S, Lin C-C, Chen C-W, Wang J-S, Quo J-I (2007) A 7 mW-to-183 mW dynamic quality-scalable H.264 video encoder chip. In: IEEE international solid-state circuits conference, Digest of technical papers, San Francisco, USA, February 2007, pp 280–603 6. Chen K-H, Guo J-I, Chao K-C, Wang J-S, Chu Y-S (2005) A high-performance low power direct 2-D transform coding IP design for MPEG-4 AVC/H.264 with a switching power suppression technique. In: Proceedings of IEEE international symposium on VLSI design, automation and test, Hsinchu, Taiwan, April 2005, pp 291–294 7. Cheng C-C, Chang T-S (2005) Fast three step intra prediction algorithm for 4 4 blocks in H.264. In: Proceedings of IEEE international symposium on circuits and systems, Kobe, Japan, May 2005, pp 1509–1512 8. Chen J-W, Chang C-R, Lin Y-L (2005) A hardware accelerator for context-based adaptive binary arithmetic decoding in H.264/AVC. In: Proceedings of IEEE international symposium on circuits and systems, Kobe, Japan, May 2005, pp 4525–4528 9. Chen K-H, Guo J-I, Wang J-S (2006) A high-performance direct 2-D transform coding IP design for MPEG-4 AVC/H.264. IEEE Trans Circuits Syst Video Technol 16(4):472–483 10. Chen T-C, Huang Y-W, Chen L-G (2004) Fully utilized and reusable architecture for fractional motion estimation of H.264/AVC. In: Proceedings of IEEE international conference on acoustics, speech, and signal processing, Montreal, Canada, May 2004, pp 9–12 11. Chen J-L, Lin Y-K, Chang T-S (2007) A low cost context adaptive arithmetic coder for H.264/MPEG-4 AVC video coding. In: Proceedings of IEEE international conference on acoustics, speech and signal processing, Hawaii, USA, April 2007, pp 105–108 12. Chen J-W, Lin Y-L (2007) A high-performance hardwired CABAC decoder. In: Proceedings of IEEE international conference on acoustics, speech, and signal processing, Hawaii, USA, April 2007, pp 37–40 13. Chao P, Lin Y-L (2008) A motion compensation system with a high efficiency reference frame pre-fetch scheme for QFHD H.264/AVC decoder. In: Proceedings of IEEE international symposium on circuits and systems, Seattle, USA, May 2008, pp 256–259
167
168
References
14. Chao P, Lin Y-L (2008) Reference frame access optimization for ultra high resolution H.264/AVC decoding. In: Proceedings of IEEE international conference on multimedia and expo, Hanover, Germany, June 2008, pp 1441–1444 15. Chen Y-J, Tsai C-H, Chen L-G (2007) Novel configurable architecture of ML-decomposed binary arithmetic encoder for multimedia applications. In: Proceedings of IEEE international symposium on VLSI design, automation, and test, Hsinchu, Taiwan, April 2007, pp 1–4 16. Chao YC, Lin JK, Yang JF, Liu BD (2006) A high throughput and data reuse architecture for H.264/AVC deblocking filter. In: Proceedings of Asia Pacific conference on circuits and systems, Singapore, December 2006, pp 1260–1263 17. Chang SC, Peng WH, Wang SH, Chiang T (2005) A platform based bus-interleaved architecture for deblocking filter in H.264/MPEG-4 AVC. IEEE Trans Consumer Electron 51(1):249–255 18. Yang C, Goto S, Ikenaga T (2006) High performance VLSI architecture of fractional motion estimation in H.264 for HDTV. In: Proceedings of IEEE international symposium on circuits and systems, Island of Kos, Greece, May 2006, pp 2605–2608 19. Huang YW, Chen TW, Hsieh BY, Wang TC, Chang TH, Chen, LG (2003) Architecture design for deblocking filter in H.264/JVT/AVC. In: Proceedings of IEEE international conference on multimedia and expo, Baltimore, USA, July 2003, pp I-693-6 20. He Z, Liou M-I (1997) Reducing hardware complexity of motion estimation algorithms using truncated pixels. In: Proceedings of IEEE international symposium on circuits and systems, Hong Kong, China, June 1997, pp 2809–2812 21. Huang Y-W, Hsieh B-Y, Chen T-C, Chen L-G (2005) Analysis, fast algorithm, and VLSI architecture design for H.264/AVC intra frame coder. IEEE Trans Circuits Syst Video Technol 15(3):378–401 22. Huang Y-W, Wang T-C, Hsieh B-Y, Chen L-G (2003) Hardware architecture design for variable block size motion estimation in MPEG-4 AVC/JVT/ITU-T H.264. In: Proceedings of IEEE international symposium on circuits and systems, Bangkok, Thailand, May 2003, pp II796–II799 23. Hou K-C, Wang S-Z, Huang Y-H, Liu T-M, Lee C-Y (2006) A bandwidth-efficient motion compensation architecture for H.264/AVC HDTV decoder. In: Proceedings of 17th VLSI design/CAD symposium, Hualian, Taiwan, August 2006 24. Joch A, Kossentini F, Schwarz H, Wiegand T, Sullivan G (2002) Performance comparison of video coding standards using Lagrangian coder control. In: Proceedings of IEEE international conference on image processing, New York, USA, September 2002, pp 501–504 25. Kao C-Y, Kuo H-C, Lin Y-L (2006) High performance fractional motion estimation and mode decision for H.264/AVC. In: Proceedings of IEEE international conference on multimedia and expo, Toronto, Canada, July 2006, pp 1241–1244 26. Kao C-Y, Lin Y-L (2008) A high-performance and memory-efficient architecture for H.264/AVC motion estimation. In: Proceedings of IEEE international conference on multimedia and expo, Hanover, Germany, June 2008, pp 141–144 27. Kao C-Y, Lin Y-L (in press) A memory-efficient and highly parallel architecture for variable block size integer motion estimation in H264/AVC. IEEE Trans Very Large Scale Integr Syst 28. Kao C-Y, Wu C-L, Lin Y-L (in press) A high performance three-engine architecture for H.264/AVC fractional motion estimation. IEEE Trans Very Large Scale Integr Syst 29. Ku C-W, Cheng C-C, Yu G-S, Tsai M-C, Chang T-S (2006) A high-definition H.264/AVC intra-frame codec IP for digital video and still camera applications. IEEE Trans Circuits Syst Video Technol 16(8):917–928 30. Kim M, Hwang I, Chae S-I (2005) A fast VLSI architecture for full-search variable block size motion estimation in MPEG-4 AVC/H.264. In: Proceedings of Asia and South Pacific design automation conference, Shanghai, China, January 2005, pp 631–634 31. Koga T, Iinuma K, Hirano A, Iijima Y, Ishiguro T (1981) Motion compensated interframe coding for video conferencing. In: Proceedings of national telecommunication conference, New Orleans, USA, November 1981, pp C9.6.1–C9.6.5
References
169
32. Kao Y-C, Kuo H-C, Lin Y-T, Hou C-W, Li Y-H, Huang H-T, Lin Y-L (2006) A high-performance VLSI architecture for intra prediction and mode decision in H.264/AVC video encoding. In: Proceedings of IEEE Asia-Pacific conference on circuits and systems, Singapore, December 2006, pp 562–565 33. Khurana G, Kassim AA, Ping CT, Mi MB (2006) A pipelined hardware implementation of in-loop deblocking filter in H.264/AVC. IEEE Trans Consum Electron 52(2):536–540 34. Komerek T, Pirsch P (1989) Array architectures for block matching algorithms. IEEE Trans Circuits Syst Video Technol 36(10):1301–1308 35. Kuo T-Y, Lin Y-K, Chang T-S (2007) SIFME: a single iteration fractional-pel motion estimation algorithm and architecture for HDTV sized H.264 video coding. In: Proceedings of IEEE international conference on acoustics, speech, and signal processing, Hawaii, USA, April 2007, pp 1185–1188 36. Kuo H-C, Lin Y-L (2008) An H.264/AVC full-mode intra-frame encoder for 1080HD video. In: Proceedings of IEEE international conference on multimedia and expo, Hanover, Germany, June 2008, pp 1037–1040 37. Lappalainen V, Hallapuro A, Hamalainen T (2003) Complexity of optimized H.26L video decoder implementation. IEEE Trans Circuits Syst Video Technol 13(7):717–725 38. Lin H-Y, Chao Y-C, Chen C-H, Liu B-D, Yang J-F (2005) Combined 2-D transform and quantization architectures for H.264 video coders. In: Proceedings of IEEE international symposium on circuits and systems, Kobe, Japan, May 2005, pp 1802–1805 39. Lin Y-C, Chao P, Hung W-C, Peng H-K, Lee C-H, Chen J-W, Lo T-J, Chang Y-H, Hsu S-T, Jan K-Y (2006) A pure hardwired H.264/AVC video decoder on an SOC platform. In: International SOC conference, Seoul, Korea, October 2006 40. Liu P-S, Chen J-W, Lin Y-L (2007) A hardwired context-based adaptive binary arithmetic encoder for H.264 advanced video coding. In: Proceedings of IEEE international symposium on VLSI design, automation, and test, Hsinchu, Taiwan, April 2007, pp 1–4 41. Li L, Goto S, Ikenaga T (2005) An efficient deblocking filter architecture with 2-dimemensional parallel memory for H.264/AVC. In: Proceedings of Asia South Pacific design automation conference, Shanghai, China, January 2005, pp 623–626 42. Lin C-C, Lin Y-K, Chang T-S (2007) PMRME: a parallel multi-resolution motion estimation algorithm and architecture for HDTV sized H.264 video coding. In: Proceedings of IEEE international conference on acoustics, speech, and signal processing, Hawaii, USA, April 2007, pp 285–288 43. Lin YC, Lin YL (2009) A two-result-per-cycle deblocking filter architecture for QFHD H.264/AVC decoder. IEEE transactions on very large scale integration systems 17(6):838–843 44. Li Y, Qu Y, He Y (2007) Memory cache based motion compensation architecture for HDTV H.264/AVC decoder. In: Proceedings of IEEE international symposium on circuits and systems, New Orleans, USA, May 2007, pp 2906–2909 45. List P, Joch A, Lainema J, Bjntegarrd G, Karczewicz M (2003) Adaptive deblocking filter. IEEE Trans Circuits Syst Video Technol 13(7):614–619 46. Li D-W, Ku C-W, Cheng C-C, Lin Y-K, Chang T-S (2007) A 61 MHz 72k gates 1,280 720 30 fps H.264 intra encoder. In: Proceedings of IEEE international conference on acoustics, speech and signal processing, Hawaii, USA, April 2007, pp (II) 801–804 47. Lin Y-K, Li D-W, Lin C-C, Kuo T-Y, Wu S-J, Tai W-C, Chang W-C, Chang T-S (2008) A 242-mW 10-mm2 1080p H.264/AVC high-profile encoder chip. In: IEEE international solidstate circuits conference, Digest of technical papers, San Francisco, USA, February 2008, pp 314–615 48. Liu TM, Lee WP, Lin TA, Lee CY (2005) A memory-efficient deblocking filter for H.264/AVC video coding. In: Proceedings of IEEE international symposium on circuits and systems, Kobe, Japan, May 2005, pp 2140–2143 49. Lo C-C, Zeng Y-J, Shieh M-D (2007) Design and test of a high-throughput CABAC encoder. In: Proceedings of IEEE international technical conference of IEEE region 10, Taipei, Taiwan, Octobor 2007, pp 1–4
170
References
50. Li L, Song Y, Ikenaga T, Goto S (2006) A CABAC encoding core with dyna mic pipeline for H.264/AVC main profile. In: Proceedings of Asia Pacific conference on circuits and system, Singapore, December 2006, pp 760–763 51. Liu Z, Song Y, Shao M, Li S, Li L, Ishiwata S, Nakagawa M, Goto S, Ikenaga T (2007) A 1.41-W H.264/AVC real-time encoder SOC for HDTV1080P. In: Proceedings of IEEE symposium on VLSI circuits, Kyoto, Japan, June 2007, pp 12–13 52. Lin HY, Yang JJ, Liu BD, Yang JF (2006) Efficient deblocking filter architecture for H.264 video coders. In: Proceedings of IEEE international symposium on circuits and systems, Island of Kos, Greece, May 2006, p 4 53. Marpe D, Schwarz H, Wiegand T (2003) Context-based adaptive binary arithmetic coding in the H.264/AVC video compression standard. IEEE Trans Circuits Syst Video Technol 17(7):620–636 54. Meng B, Au O-C (2003) Fast intra-prediction mode selection for 4A blocks in H.264. In: Proceedings of IEEE international conference on acoustics, speech, and signal processing, Hong Kong, China, April 2003, pp (III) 389–392 55. Osorio R, Bruguera J (2006) High-throughput architecture for H.264/AVC CABAC compression system. IEEE Trans Circuits Syst Video Technol 16(11):1376–1384 56. Peng H-K, Lee C-H , Chen J-W, Lo T-J, Chang Y-H, Hsu S-T, Lin Y-C, Chao P, Hung W-C, Jan K-Y (2007) A highly integrated 8-mW H.264/AVC main profile real-time CIF video decoder on a 16-MHz SoC platform. In: Proceedings of Asia and South Pacific design automation conference, Yokohama, Japan, January 2007, pp 112–113 57. Sayood K (2006) Introduction to data compression. Morgan-Kaufmann, San Francisco, CA 58. Shih SY, Chang CR, Lin YL (2005) An AMBA-compliant deblocking filter IP for H.264/AVC. In: Proceedings of IEEE international symposium on circuits and systems, Kobe, Japan, May 2005, pp 4529–4532 59. Shih SY, Chang CR, Lin YL (2006) A near optimal deblocking filter for H.264 advanced video coding. In: Proceedings of Asia South Pacific design automation conference, Yokohama, Japan, January 2006, p 6 60. Shojania H, Sudharsanan S (2005) A high performance CABAC encoder. In: Proceedings of IEEE international Northeast workshop on circuits and systems, Ville de Quebec, Canada, June 2005, pp 315–318 61. Suh K, Park S, Cho H (2005) An efficient hardware architecture of intra prediction and TQ/IQIT module for H.264 encoder. Electron Telecommun Res Inst J 27(5):511–524 62. Sullivan G, Wiegand T (1998) Rate-distortion optimization for video compression. IEEE Signal Process Mag 15(6):74–90 63. Su C-L, Yang W-S, Chen Y-L, Li Y, Chen C-W, Guo J-I, Tseng S-Y (2006) Low complexity high quality fractional motion estimation algorithm and architecture design for H.264/AVC. In: Proceedings of IEEE Asia Pacific symposium on circuits and systems, Singapore, December 2006, pp 578–581 64. Su C-L, Yang W-S, Chen Y-L, Yang Y-C, Chen C-W, Guo J-I, Tseng S-Y (2006) A low complexity high quality interger motion estimation architecture design for H.264/AVC. Proceedings of IEEE Asia Pacific conference on circuits and systems, Singapore, December 2006, pp 398-401 65. Takizawa T, Hirasawa M (2001) An efficient memory arbitration algorithm for a single chip MPEG2 AV decoder. IEEE Trans Consum Electron 47(3):660-665 66. Takizawa T, Tajime J, Harasaki H (1999) High performance and cost effective memory architecture for an HDTV decoder LSI. In: Proceedings of IEEE international conference on acoustics, speech, and signal processing, Phoenix, USA, March 1999, pp 1981–1984 67. Tsai C-Y, Chen T-C, Chen T-W, Chen L-G (2005) Bandwidth optimized motion compensation hardware design for H.264/AVC HDTV decoder. In: Proceedings of 48th midwest symposium on circuits and systems, Cincinnati, USA, August 2005, pp 1199-1202 68. Tham J, Ranganath S, Ranganath M, Kassim A (1998) A novel unrestricted center-biased diamond search algorithm for block motion estimation. IEEE Trans Circuits Syst Video Technol 8(4):369–377
References
171
69. Tseng H-C, Chang C-R, Lin Y-L (2005) A motion compensator with parallel memory for H.264 advance video coding. In: Proceedings of 16th VLSI design/CAD symposium, Hualian, Taiwan, August 2005 70. Tseng H-C, Chang C-R, Lin Y-L (2005) A hardware accelerator for H.264/AVC motion compensation. Proceedings of IEEE workshop on signal processing systems, Athens, Greece, November 2005, pp 214–219 71. Tuan J-C, Chang T-S, Jen C-W (2002) On the data reuse and memory bandwidth analysis for full-search block-matching VLSI architecture. IEEE Trans Circuits Syst Video Technol 12(1):61–72 72. Vos L, Stegherr M (1989) Parameterizable VLSI architectures for the full-search blockmatching algorithm. IEEE Trans Circuits Syst 36(10):1309–1316 73. Wang Y-J, Cheng C-C, Chang T-S (2007) A fAST aLGORITHM AND iTS VLSI architecture for fractional motion estimation for H.264/MPEG-4 AVC video coding. IEEE Trans Circuits Syst Video Technol 17(5):578–583 74. Wang T-C, Huang Y-W, Fang H-C, Chen L-G (2003) Parallel 4 4 2D transform and inverse transform architecture for MPEG-4 AVC/H.264. In: Proceedings of IEEE international symposium on circuits and systems, Bangkok, Thailand, May 2003, pp (II)800–803 75. Wang S-Z, Lin T-A, Liu T-M, Lee C-Y (2005) A new motion compensation design for H.264/AVC decoder. In: Proceedings of IEEE international symposium on circuits and systems, Kobe, Japan, May 2005, pp 4558–4561 76. Wang S-H, Peng W-H, He Y, Lin G-Y, Lin C-Y, Chang S-C, Wang C-N, Chiang T (2003) A platform-based MPEG-4 advanced video coding (AVC) decoder with block level pipelining. In: Proceedings of the 2003 joint conference of the fourth international conference on information, communications and signal processing and the Fourth Pacific rim conference on multimedia, Singapore, December 2003, pp 51–55 77. Wu C-L, Kao C-Y, Lin YL (2008) A high performance three-engine architecture for H.264/AVC fractional motion estimation. In: Proceedings of IEEE international conference on multimedia and expo, Hanover, Germany, June 2008, pp 133–136 78. Wang J-C, Wang J-F, Yang J-F, Chen J-T (2007) A fast mode decision algorithm and its VLSI design for H.264/AVC intra-prediction. IEEE Trans Circuits Syst Video Technol 17(10):1414–1422 79. Yap S, McCanny J (2004) A VLSI architecture for variable block size video motion estimation. IEEE Trans Circuits Syst II, Express Briefs 51(7):384–389 80. Yang K-M, Sun M-T, Wu L (1989) A family of VLSI designs for the motion compensation block-matching algorithm. IEEE Trans Circuits Syst Video Technol 36(10):1317–1325 81. Yeo H, Hu Y-H (1995) A novel modular systolic array architecture for full-search block matching motion estimation. IEEE Trans Circuits Syst Video Technol 5(5):407-416 82. Zhang P, Gao W, Wu D, Xie D (2006) An efficient reference frame storage scheme for H.264 HDTV decoder. In: Proceedings of IEEE international conference on multimedia and expo, Toronto, Canada, July 2006, pp 361–364 83. Zhang N-R, Li M, Wu W-C (2006) High performance and efficient bandwidth motion compensation VLSI design for H.264/AVC decoder. In: Proceedings of 8th international conference on solid-state and integrated circuit technology, Shanghai, China, October 2006, pp 1896–1898 84. Zhu S, Ma K (1997) A new diamond search algorithm for fast block matching motion estimation. In: Proceedings of international conference on information, communications and signal processing, Singapore, September 1997, pp 292–296
Index
k t h Order Exp–Golomb Code Method, 128 1-D array architecture, 37 1080pHD, 16, 97 2-D array architecture, 37, 39 4Kx2K, 30 720pHD, 155
A adaptive macroblock transmission scheme, 115 AMBA, 159, 162 master, 163 slave, 163 arbitration policy, 76 ASIC, 156
B BAE architecture ML–decomposed, 137 Multi–Bin, 137 One–Bin, 136 Two–Bin, 137 bank conflict, 76 basic pattern, 89 BCMODS, 140 bi prediction, 75 binarization, 125, 126, 141 Binarization and Context Modeler (BCM), 140 binary arithmetic encoder (BAE), 125, 130, 140 bypass, 126, 131 regular, 126, 130 terminal, 126, 131 bit-stream H.264/AVC, 153, 158 MB-level, 153 bitOutstanding, 131, 144, 146 block
AC, 86 DC, 86 blocking artifact, 6, 8, 107, 111 boundary strength (BS), 108, 109, 115, 117 C CABAC, 125, 155 CAVLC, 8, 125, 155 Cb, 2, 12, 86 chroma pre mode SE, 132 chromaEdgeflag, 112 chrominance (chroma), 2 CIF, 13 coded block flag (CBF), 94, 96, 133 coded block pattern (CBP), 94 coeff level SE, 129, 133 coefficient, 5, 86, 89, 133 Concatenation Method, 128 context index (CtxIdx), 129 context modeler, 125, 129 context table, 126 Cr, 2, 12, 86 CtxBlkCatOffset, 129 CtxIdx Left–Inc, 129 CtxIdx Top–Inc, 129 CtxIdxInc, 129 CtxIdxOffset, 129 D Data-Reuse Scheme, 43 Level A, 43 Level B, 43 Level C, 43, 45, 52 Level D, 43 deblocking filter (DF), 8, 107, 108, 116 down-sampling, 36, 49, 53 DRAM, 83, 156 access penalty, 75, 76, 159
173
174 E edge filter, 108, 112 strong edge filter, 108, 112 weak edge filter, 108, 112 entropy encoder, 86 Exp–Golomb code method, 128 F filtering strength, 108 FilterOffsetA, 111 FilterOffsetB, 111 filterSampleflag, 108, 109, 111 Fixed–Length Code Method, 128 frame B-type, 152 I-type, 152 P-type, 152 frames per second (fps), 1, 31 full high definition (FullHD), 1 G group-of-picture (GOP), 163 H H.263, 85, 151 half refinement, 58, 70 HDTV, 31 Huffman, 126 I in–loop filter, 108 indexA, 111 indexB, 111 inter–stage register bypass scheme, 137 interpolation, 64, 75, 76, 79 2-tap filter, 59 6-tap filter, 59 half-pixels, 79 quarter-pixels, 79 interpolation engine 1-D based, 76 2-D based, 76 dual, 79 separate 1-D based, 76 interprediction, 4 intra encoding, 11 intra mode DC, 13 DDL, 22 DDR, 22 HD, 22 horizontal, 13
Index HU, 22 plane, 13 vertical, 13 VL, 22 VR, 22 intra prediction, 4, 8, 11, 85, 107 intra4x4 pre mode SE, 132 InvLevelScale, 92 ISO/IEC, 73, 151 ITU-T, 73, 151
J Joint Model (JM), 19, 53, 135, 155 JPEG2000, 11
L L-C correlated mapping, 76 Lagrangian parameter, 60, 152 last significant coefficient flag, 93, 133 Least Probable Symbol (LPS), 130 least-significant-bits (LSB), 5 LevelScale, 91 list0, 75 list1, 75 luminance (luma), 2
M macroblock, 2 B-type, 151 I-type, 86, 151 intra 16x16, 86 non-intra 16x16, 86 P-type, 151 mb type SE, 128 Media embedded Processor, 156 memory access unit (MAU), 158, 163 mode non-reconstruction-loop (Non-RL), 20 reconstruction-loop (RL), 20 mode decision all-mode, 19 inter, 151 intra, 13, 19, 151 macroblock, 152 partial-mode, 19 rate-distortion optimized (RDO), 155 mono prediction, 75 Most Probable Symbol (MPS), 130 Most Significant Bit (MSB), 147 motion compensation, 7, 8, 73, 77, 107
Index motion estimation, 4, 31, 85 block-based, 31, 33 Diamond Search, 34 fixed block size, 32, 57 fractional, 32, 58 Full Search, 33, 36, 153 integer, 32, 58 Three Step Search, 34 variable block size, 32, 57 motion vector (MV), 4, 33, 58, 151 difference, 4, 73, 77 predictor, 32, 65, 74, 77 motion vector difference (MVD) SE, 133 motion vector difference SE horizontal, 143 vertical, 143 Moving Picture Experts Group (MPEG), 73 MPEG-1, 2 MPEG-2, 2, 151 MPEG-4, 85, 151 multiple-candidate data reuse, 41, 44, 46, 52 multiple-candidate parallel architecture, 41, 42 multiple-macroblock data reuse, 44–46, 52 MV-bit-rate estimation, 58, 59 N Non–skip–all mode, 122 P Parallel–In–Parallel–Out (PIPO), 140 partial-search algorithm, 43 ping-pong fashion, 102, 158 pipeline hazard, 121 pixel duplication, 76 pixel truncation, 49, 53 pixel-level parallelism (PLP), 19, 21, 22 post-scaling factor, 90 pre-scaling factor, 91, 93 prediction mode capability (PMC), 19 Probability State Index (pStateIdx), 126 processing element (PE), 22, 37, 39 profile baseline, 11, 85 extended, 11, 85 high, 11 main, 11, 85 PSNR, 156 PutBit function, 130, 144
Q QFHD, 8, 107, 125 quantization, 86
175 quantization parameter (QP), 90, 152 quantization step (Qstep), 5, 90 quarter pixel accuracy, 74 quarter refinement, 58, 70 queue, 163
R RangeLPS (rLps), 130 RangeMPS (rMps), 130 rangeTabLPS table, 130 read–after–write hazard, 144 reconfigurable architecture, 19 reference frame, 31, 47, 49, 159 Reference Frame Interlaced Processing (RFIP), 50 reference index, 73 Reference Pixel Register Array, 47 rem intra pre mode SE, 132 renormalization, 130 residual, 4, 86, 89 row–of–pixels (ROP), 119 row-miss, 76 run level coding (RLC), 86, 93
S scalar multiplication, 90 scalar quantizer, 90 SDTV, 62 search range, 33, 159 search window, 31–33, 41 seed pixel, 26 significant coefficient flag, 93, 133 Skip–all mode, 115, 122 Skip–top mode, 115 snake scan, 49 SOC, 156 standard definition (SD), 1, 155 sum of absolute differences (SAD), 33, 155 sum of absolute transformed differences (SATD), 58, 59, 155, 156 generation, 63, 68 sum of square difference (SSD), 33 switching FIFO, 82 syntax element (SE), 125
T Table Mapping Code Method, 128 threshold, 108, 109, 111
176
Index
transform discrete cosine (DCT), 85, 89 Hadamard (HT), 89, 90 inverse discrete cosine (IDCT), 92 inverse Hadamard (IHT), 92 transIdxLPS table, 143 Truncated Unary Code Method, 127 Two–Result–Per–Cycle Edge Filter, 118
V variable–length coding, 125 Video Coding Experts Group (VCEG), 73
U Unary Code Method, 127
Z zig-zag scan, 93, 101
W weighted prediction, 75, 82