VIDEO TRACES FOR NETWORK PERFORMANCE EVALUATION
Video Traces for Network Performance Evaluation A Comprehensive Overv...
56 downloads
641 Views
14MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
VIDEO TRACES FOR NETWORK PERFORMANCE EVALUATION
Video Traces for Network Performance Evaluation A Comprehensive Overview and Guide on Video Traces and Their Utilization in Networking Research
by
PATRICK SEELING Arizona State University, AZ, U.S.A.
FRANK H.P. FITZEK Aalborg University, Denmark and
MARTIN REISSLEIN Arizona State University, AZ, U.S.A.
A C.I.P. Catalogue record for this book is available from the Library of Congress.
ISBN-10 1-4020-5565-X (HB) ISBN-13 978-1-4020-5565-2 (HB) ISBN-10 1-4020-5566-8 (e-book) ISBN-13 978-1-4020-5566-9 (e-book)
Published by Springer, P.O. Box 17, 3300 AA Dordrecht, The Netherlands. www.springer.com
Printed on acid-free paper
All Rights Reserved © 2007 Springer No part of this work may be reproduced, stored in a retrieval system, or transmitted in any form or by any means, electronic, mechanical, photocopying, microfilming, recording or otherwise, without written permission from the Publisher, with the exception of any material supplied specifically for the purpose of being entered and executed on a computer system, for exclusive use by the purchaser of the work.
To Jody. — Patrick
To Sterica and Lilith. — Frank
To Jana and Tom. — Martin
v
Contents
1
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1
Part I Digital Video 2
Introduction to Digital Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1 The Beginning of Moving Pictures . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.2 Digital Picture and Video Representation . . . . . . . . . . . . . . . . . . . 8 2.3 Video Hierarchy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3
Video Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1 DCT-Based Video Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.1 Block Scanning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.2 Discrete Cosine Transformation . . . . . . . . . . . . . . . . . . . . . 3.1.3 Quantization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.4 Zig–Zag Scanning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.1.5 Variable Length Coding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.2 Inter-frame Coding: Motion Estimation and Compensation . . . 3.3 Scalable Video Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Data Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Temporal Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Spatial Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.4 SNR Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.5 Object Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.6 Fine Granular Scalability (FGS) . . . . . . . . . . . . . . . . . . . . . 3.3.7 Multiple Description Coding (MDC) . . . . . . . . . . . . . . . . . 3.4 Wavelet-Based Video Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.5 Video Coding Standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
vii
17 17 18 19 20 23 24 25 29 30 30 31 31 32 33 35 35 38
viii
Contents
Part II Video Traces and Statistics 4
Metrics and Statistics for Video Traces . . . . . . . . . . . . . . . . . . . . 4.1 Video Frame Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.1 Autocorrelation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.2 Variance–Time Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.3 R/S Statistic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.4 Periodogram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.5 Logscale Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.1.6 Multiscale Diagram . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2 Video Frame Quality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3 Correlation between Video Frame Sizes and Qualities . . . . . . . . 4.4 Additional Metrics for FGS Encodings . . . . . . . . . . . . . . . . . . . . . 4.5 Additional Metric for MDC Encodings . . . . . . . . . . . . . . . . . . . . .
45 45 46 47 47 48 50 50 51 54 55 58
5
Video Trace Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1 Overview of Video Trace Generation and Evaluation Process . . 5.1.1 Video Source VHS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 Video Source DVD . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.3 Video Source YUV Test Sequences . . . . . . . . . . . . . . . . . . 5.1.4 Video Source Pre-Encoded Video . . . . . . . . . . . . . . . . . . . . 5.2 MDC Trace Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3 Evaluation of MPEG-4 Encodings . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Single–Layer Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Temporal Scalable Encoding . . . . . . . . . . . . . . . . . . . . . . . . 5.3.3 Spatial Scalable Encoding . . . . . . . . . . . . . . . . . . . . . . . . . . 5.4 Evaluation of H.264 Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.5 Evaluation of MPEG-4 FGS Encodings . . . . . . . . . . . . . . . . . . . . . 5.6 Evaluation of Wavelet Video Traces . . . . . . . . . . . . . . . . . . . . . . . . 5.7 Evaluation of Pre–Encoded Content . . . . . . . . . . . . . . . . . . . . . . . 5.8 Evaluation of MDC Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . .
59 59 60 61 62 62 62 63 66 67 71 73 75 77 79 80
6
Statistical Results from Video Traces . . . . . . . . . . . . . . . . . . . . . . 83 6.1 Video Trace Statistics for MPEG-4 Encoded Video . . . . . . . . . . 83 6.1.1 Examples from Silence of the Lambs Single Layer Encodings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 6.1.2 Videos and Encoder Modes for Evaluated MPEG-4 Video Traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 93 6.1.3 Single Layer Encoded Video . . . . . . . . . . . . . . . . . . . . . . . . 97 6.1.4 Temporal Scalable Encoded Video . . . . . . . . . . . . . . . . . . . 100 6.1.5 Spatial Scalable Encoded Video . . . . . . . . . . . . . . . . . . . . . 104 6.2 Video Trace Statistics for H.264 Video Trace Files . . . . . . . . . . . 109 6.3 Video Trace Statistics for Pre-Encoded Video . . . . . . . . . . . . . . 118 6.4 Video Trace Statistics for Wavelet Encoded Video . . . . . . . . . . . 125
Contents
ix
6.4.1 6.4.2 6.4.3 6.4.4
Analysis of Video Traffic . . . . . . . . . . . . . . . . . . . . . . . . . . . 125 Analysis of Video Quality . . . . . . . . . . . . . . . . . . . . . . . . . . 134 Correlation Between Frame Sizes and Qualities . . . . . . . . 140 Comparison Between Wavelet and MPEG-4 Encoded Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141 6.5 Video Trace Statistics for MPEG-4 FGS Encoded Video . . . . . . 153 6.6 Video Trace Statistics for MDC Encoded Video . . . . . . . . . . . . . 165 Part III Applications for Video Traces 7
IP Overhead Considerations for Video Services . . . . . . . . . . . . 173 7.1 Introduction and Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173 7.2 Data Plane . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 7.2.1 Real Time Protocol (RTP) and User Datagram Protocol (UDP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175 7.2.2 Transmission Control Protocol (TCP) . . . . . . . . . . . . . . . . 176 7.2.3 Internet Protocol (IP) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 7.3 Signaling Overhead . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177 7.3.1 Session Description Protocol (SDP) . . . . . . . . . . . . . . . . . . 178 7.3.2 Session Announcement Protocol (SAP) . . . . . . . . . . . . . . . 178 7.3.3 Session Initiation Protocol (SIP) . . . . . . . . . . . . . . . . . . . . 178 7.3.4 Real Time Streaming Protocol (RTSP) . . . . . . . . . . . . . . . 179 7.3.5 Real Time Control Protocol (RTCP) . . . . . . . . . . . . . . . . . 179 7.4 Header Compression Schemes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180 7.5 Short Example for Overhead Calculation . . . . . . . . . . . . . . . . . . . 182
8
Using Video Traces for Network Simulations . . . . . . . . . . . . . . . 183 8.1 Generating Traffic from Traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 8.1.1 Stream Level Issues . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 183 8.1.2 Frame/Packet Level Issues . . . . . . . . . . . . . . . . . . . . . . . . . . 188 8.2 Simulation Output Data Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . 191 8.2.1 Performance Metrics in Video Trace Simulations . . . . . . 191 8.2.2 Estimating Performance Metrics . . . . . . . . . . . . . . . . . . . . . 193
9
Incorporating Transmission Errors into Simulations Using Video Traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195 9.1 Video Encoding and Decoding . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196 9.1.1 Single Layer and Temporal Scalable Encoding . . . . . . . . . 196 9.1.2 Spatial and SNR Scalable Video . . . . . . . . . . . . . . . . . . . . . 198 9.2 Video Quality after Network Transport . . . . . . . . . . . . . . . . . . . . . 200 9.2.1 Single Layer and Temporal Scalable Video . . . . . . . . . . . . 203 9.2.2 Spatial Scalable Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 9.2.3 SNR Scalable Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 205 9.3 Video Offset Distortion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 206
x
Contents
9.4 9.5
9.6
9.7
9.3.1 Comparison of Rate-Controlled and Non-RateControlled Video Encoding for Single-Layer Video . . . . . 207 9.3.2 Comparison of Rate-Controlled and Non-RateControlled Video Encoding for Scalable Video . . . . . . . . . 211 Perceptual Considerations for Offset Distortions or Qualities . . 213 Using Video Offset Distortion Traces . . . . . . . . . . . . . . . . . . . . . . . 215 9.5.1 Assessing the Video Quality After Network Transport Using Video Traces . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 9.5.2 Available Tools . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 217 Offset Distortion Influence on Simulation Results . . . . . . . . . . . . 218 9.6.1 Single Layer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 219 9.6.2 Spatial Scalable Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221 Error-Prone and Lost MDC Descriptors . . . . . . . . . . . . . . . . . . . . 224
10 Tools for Working with Video Traces . . . . . . . . . . . . . . . . . . . . . . 229 10.1 Using Video Traces with Network Simulators . . . . . . . . . . . . . . . . 229 10.1.1 NS II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231 10.1.2 Omnet++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232 10.1.3 Ptolemy II . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 233 10.2 The VideoMeter Tool for Linux . . . . . . . . . . . . . . . . . . . . . . . . . . 235 10.2.1 VideoMeter Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 237 10.2.2 Freeze File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 239 10.3 RMSE and PSNR Calculator . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 240 10.4 MPEG-4 Frame Size Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 242 10.5 Offset Distortion Calculators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 10.5.1 Single Layers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 243 10.5.2 Spatial Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 245 11 Outlook . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 247 List of Abbreviations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 249 Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255 Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 267
1 Introduction
Multimedia networking applications and, in particular, the transport of compressed video are expected to contribute significantly to the traffic in the future Internet and wireless networks. For transport over networks, video is typically encoded (i.e., compressed) to reduce the bandwidth requirements. Even compressed video, however, requires large bandwidths of the order of hundred kbps or Mbps. In addition, compressed video streams typically exhibit highly variable bit rates (VBR) as well as long range dependence (LRD) properties. This, in conjunction with the stringent Quality of Service (QoS) requirements (loss and delay) of video traffic, makes the transport of video traffic over communication networks a challenging problem. As a consequence, in the last decade the networking research community has witnessed an explosion in research on all aspects of video transport. The characteristics of video traffic, video traffic modeling, as well as protocols and mechanisms for the efficient transport of video streams, have received a great deal of interest among networking researchers and network operators and a plethora of video transport schemes have been developed. For developing and evaluating video transport mechanisms and for research on video networking in general, it is necessary to have available some characterization of the video. Generally, there are three different ways to characterize encoded video for the purpose of networking research: (i) video traffic model, (ii) video bit stream, and (iii) video traffic trace. Video traffic models, such as [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11] strive to capture the essential properties of the real traffic in parsimonious, accurate, and computationally efficient mathematical models. A traffic model is typically developed based on the statistical properties of samples of the real traffic, or, in many cases, video traces of the real traffic. Video traces are therefore typically a prerequisite for model development. The developed traffic model is verified by comparing the traffic it generates with the video traces. If the traffic model is deemed sufficiently accurate, it can be used for the mathematical analysis of networks, for model driven simulations, and also for generating so-called virtual (synthetic) video traces. 1
2
1 Introduction
The video bit stream is generated using the encoder and contains the complete video information. The traffic characterization (e.g., the frame size) can be obtained by measuring the traffic or by parsing the bit stream. The video quality can be determined by subjective (viewing) evaluation [12] or objective methods [13, 14, 15]. The advantage of the bit stream is that it allows for networking experiments where the quality of the video—after suffering losses in the network—is evaluated, e.g., [16, 17, 18, 19, 20, 21]. The techniques presented in Chapter 9 bring this capability to assess the video quality after lossy network transport to video traces. One limitation of the bit stream is that it is very large in size; several GBytes for one hour of compressed video or several tens of GBytes for one hour of uncompressed video. Another limitation of bit streams is that they are usually proprietary and/or protected by copyright. This limits the access of networking researchers to bit streams, and also limits the exchange of bit streams among research groups. An additional key limitation of the bit streams is that they require expertise in video coding. As a result, only individuals with expertise and the necessary equipment for both video coding and networking research can conduct video networking research with bit streams. Video traces are an attractive alternative to traffic models and bit streams in that they represent the traffic and quality of the videos. While the bit streams give the actual bits carrying the video information, the traces only give the number of bits used for the encoding of the individual video frames and the quality level (e.g., in PSNR) of the encoding. Thus, there are no copyright issues. Importantly, video traces significantly extend the set of individuals who can conduct high-quality video networking research by providing the relevant video characterizations in plain trace files. The trace files can be processed with standard PCs and be utilized in standard network simulations, thus enabling networking researchers without video coding expertise or equipment to conduct video networking research. As video traces are a very convenient video characterization for networking research, the traces stimulate video networking research. Indeed, the networking research community experienced an initial explosion in the research on video transport after the MPEG–1 traces [22, 23, 24, 25] became publicly available around 1995. These first traces were very elementary in that (i) the traces were for only a small set of videos, (ii) the traces represented only video encoded with the less efficient MPEG-1 codec at a single quality level, and (iii) no traces of scalable encoded video were available. Nevertheless, these MPEG-1 traces provided a useful basis for a wealth of video networking research. Our involvement in video trace creation and dissemination begun in 1999 at the Telecommunication Networks (TKN) institute headed by Prof. Adam Wolisz at the Technical University Berlin. We created traces for a moderately large number of videos with the MPEG-4 and H.263 codecs for a range of different quantization scales (quality levels). We focused on single-layer (nonscalable) encoded videos and incorporated both encodings without and with
1 Introduction
3
rate control targeting a prescribed target bit rate. These traces were first used for simulating video traffic in the wireless video streaming study [26] and described in the article [27]. The encodings with a range of quality levels permitted the simulation of the network transport of different versions of the same video, thus widening the range of networking scenarios that can be examined using video traces. At the same time, the statistical analysis of the traces revealed a characteristic “hump” shape of the bit rate variability plotted as a function of the video quality [28]. We continued the video trace creation at Arizona State University, Tempe, and Aalborg University, Denmark, expanding the trace library into several directions. One major direction is traces for scalable encoded video. Scalable video encoding is promising as it permits streaming over heterogeneous networks providing variable bit rates and to heterogeneous receivers with different display formats (screen sizes) and processing capabilities with a single encoding. We generated traces for layered scalable MPEG encoding as well as for the fine granular scalable MPEG coding which allows for scalability at a bit granularity. We also generated traces for basic wavelet based codecs, which provide for highly flexible scalability. A major conceptual advance for video traces have been the offset distortion traces that permit assessing the video quality after lossy network transport, a capability that could previously only be achieved through experiments with actual video bit streams. This book provides a comprehensive introduction to video traces and their use in networking research. After first providing the basics of digital video and video coding, we introduce the video traces, covering the metrics captured in the traces, the trace generation, as well as the statistical characteristics of the video characterized in the traces. We then turn our attention to the use of the video traces in networking research, examining the practical aspects of transporting video over Internet Protocol (IP) networks and the simulation of video transport using traces, including the simulations using the offset distortion traces. Software tools and utilities that facilitate the use of video traces in network simulations and other video networking related software tools are also presented.
Part I
Digital Video
2 Introduction to Digital Video
In this chapter, we give an introduction to digital video and the differences between different video standards. From the different ways of displaying digital video, we look at different color spaces and conversions between them. For the YUV color space, we look more detailed into ways of representation for each individual pixel or groups of pixels. From this smallest unit, we semantically integrate different levels of a video hierarchy to a full movie. Our approach is meant to be of introductory nature, we refer the interested reader to the literature sources for more detailed information.
2.1 The Beginning of Moving Pictures In the early 1900’s moving images became popular. One of the first realizations of moving images was the flip book. The flip book is the simplest form of animation using a sequence of still images with small changes from one image to the next one. To use the flip book, the viewer starts with the first image and flips through the following images. If the flipping rate is sufficiently high, the illusion of motion is created, a phenomenon referred to as persistence of vision. The phenomenon is based on the fact that the retina of the human eye is retaining any image for a given time. Thus, if the retaining rate is larger than 16 images per second, the human brain will superimpose those images creating the illusion of real motion. To describe that the human eye is sensitive to the number of images per second, the so called “flicker fusion rate” was introduced. It is a statistical mean to describe when flickering of the moving images is observed. The flicker fusion rate varies depending on the different observers, but if more than 48 images are displayed per second, flickers can hardly be detected. The first mechanisms to create the illusion of moving pictures, such as the flip book, were based on passive lighting. Later, with the introduction of chemical film, the light, generated by an active source, was passed through the images. The larger number of pictures per second on the chemical film 7
8
2 Introduction to Digital Video
(a) Camera
(b) Projector
Fig. 2.1: Traditional film camera and projector. would be cost-intensive and therefore practically impossible. Most of the film cameras, such as the one illustrated in Figure 2.1, shoot the movie with 16, 18 or 24 frames per second. As this frame rate is far below the flickering fusion rate, a so called shutter (see Figure 2.2) was introduced to the film projector (see Figure 2.1). The shutter has two main objectives. The first objective is to suspend the light of the projector when the next picture has to be shown and the old picture has to be removed. Furthermore, the shutter shows the actual images multiple times, thus increasing the frame rate by this. The shutter is nothing else than a rotating disc with areas in it to let the light pass through or to suspend it. The typical shutters found are two and three wing shutters. The two and three wing shutters increase the frame rate of 24 and 16 frames per second to a virtual rate of 48 frames per second. The reason for 18 frames per second is not motivated by the flickering rate, but by reasons of audio recording. By suppressing flickering, the drawback is that the brightness is reduced. This can be compensated by more powerful lamps within the projector compared to a slide projector.
2.2 Digital Picture and Video Representation Video consists of a sequence of individual video frames or images, that are displayed at a certain frame rate. The consecutive display of the individual pictures then creates the effect of captured motion. In traditional video shooting on film reels, for example, the camera captures at a frame rate of 24 frames
2.2 Digital Picture and Video Representation
9
Fig. 2.2: A shutter in a film projector.
0
1/24
2/24
3/24
t
Fig. 2.3: Concept of capturing motion in film. per second (fps). This frame rate gives the impression of continuous motion to the human eye when played back. The traditional film-based display is illustrated in Figure 2.3.
10
2 Introduction to Digital Video
t
Fig. 2.4: Line display in progressive video. Different standards for frame rates exist. Progressive video draws all the individual lines of a picture in sequence and displays them as in Figure 2.4. The National Television Standards Committee (NTSC) format frame rate is set at 29.97fps, or approximately 30fps. The Phase Alternating Line (PAL) standard uses 25fps. The frame rate of Super8 is 18fps. In normal television sets, however, the display of the video frames is done with twice that frequency, whereby the changes in the pictures are captured by just sending out half the lines which comprise the resolution of the full television screen. This concept is called interlacing, and by combining the interlaced lines, the frame rate of non-interlaced or progressive video is given. Figure 2.5 illustrates the concept of interlacing line-wise. The concept of interlacing is that the human brain and eyes work together to eliminate the discrepancies that are caused by the interlacing mechanism. With the advent of the digital representation of film and video and the encoding of the source material, the progressive approach determines the frequency in terms of frames per second that is used throughout this book. With video encoders and decoders working together in computers and set-top boxes before sending the signal to the connected television set (and introducing artificial interlacing), we can assume that the video is processed and displayed on a full-frame basis, whereby the display of the individual images occurs at the points in time given by the chosen frame rate. We illustrate the concept that we assume throughout this book in Figure 2.6. Each individual video frame consists of picture elements (usually referred to as pixels or pels). The frame format specifies the size of the individual frames in terms of pixels. The ITU-R/CCIR-601 format (the common TV format) has 720 × 480 pixels (i.e., 720 pixels in the horizontal direction and 480 pixels in the vertical direction), while the Common Intermediate Format
2.2 Digital Picture and Video Representation
0
1/60
2/60
11
t
0
1/30
Display buffer
Decode
Display buffer
Decode
Fig. 2.5: Line display in interlaced video.
2/30
t
Fig. 2.6: Decoding, storing in frame buffer of the graphics card and display of digitally encoded video. (CIF) format has 352 × 288 pixels and the Quarter CIF (QCIF) format has 176×144 pixels; the CIF and QCIF formats are typically considered in network related studies. Different color spaces are used to represent a single pixel. The common color space in the computer domain is the representation based on the three component colors red, green and blue (RGB). With these three components the color of a pixel can be defined. In video transmission, the pixels are represented by three different components, the luminance component (Y), and the two chrominance components hue (U) and intensity (V). This representation dates back to the beginning days of black and white and later color television. The YUV color representation was necessary to broadcast color TV signals
12
2 Introduction to Digital Video
while allowing the old black and white TV sets to function without modifications. As the luminance information is located in different frequency bands, the old tuners are capable to tune in on the Y signal alone. The conversion between these two color spaces is defined by a fixed conversion matrix. A general conversion matrix for converting from RGB to YUV is given as in Equation (2.1). Y 0.299 0.587 0.114 R U = −0.147 −0.289 0.436 · G (2.1) V 0.615 −0.515 −0.100 B These values are used to convert the RGB values to the YUV color space used in PAL systems. For NTSC systems, the conversion is given as in Equation (2.2). Y 0.299 0.587 0.114 R I = −0.595716 −0.274453 −0.321263 · G (2.2) Q 0.211456 −0.522591 0.311135 B Similarly, the RGB values can be converted from YUV or YIQ. For all conversion purposes, the values have to be mapped to the typical range of [0 . . . 255] that is used in 8 bit digital environments. For more conversions and color spaces, we refer to [29]. In the following, we focus on the YUV representation which is typically used in video compression schemes. Several different YUV formats exist. They can be categorized by the sub– sampling that is used between the different components and the way in which the values are stored. The original Y, U, and V values can be stored for each individual pixel and this format is referred to as YUV 4:4:4 and is illustrated in Figure 2.7. The human eye, however, is far more sensitive to changes in luminance than to the other components. It is therefore common to reduce the information that is stored per picture by chrominance sub–sampling. With sub–sampling the ratio of chrominance to luminance bytes is reduced. More specifically, sub-sampling represents a group of typically four pixels by their four luminance components (bytes) and one set of two chrominance values. Each of these two chrominance values is typically obtained by averaging the corresponding chrominance values in the group. In case the four pixels are grouped as a block of 2 × 2 pixels, the format is YUV 4:2:0. If the grouped pixels are forming a line of 4 × 1, the format is referred to as YUV 4:1:1. These two most common YUV sampling formats are illustrated in Figures 2.8 and 2.9. Using the averaging approach to obtain the chrominance values for YUV 4:1:1 from YUV 4:4:4, the hue values can be calculated as U1,1 (411) =
U1,1 (444) + U2,1 (444) + U3,1 (444) + U4,1 (444) , 4
(2.3)
and so on for the remaining pixels. The saturation values are calculated in a similar manner. Using the averaging approach to obtain the chrominance
2.2 Digital Picture and Video Representation
1
2
3
4
1
2
3
4
Y U
Y U
Y U
Y U
V
V
V
V
Y
Y
Y
Y
U
U
U
U
V
V
V
V
Y
Y
Y
Y
U
U
U
U
V Y
V Y
V Y
V Y
U
U
U
U
V
V
V
V
13
x
y
Fig. 2.7: YUV 4:4:4 without any sub–sampling.
Y
Y
Y
Y
U
V
Fig. 2.8: YUV 4:1:1 subsampling.
Y
Y UV
Y
Y
Fig. 2.9: YUV 4:2:0 subsampling. values for YUV 4:2:0 from YUV 4:4:4, the hue values can be calculated as U1,1 (420) =
U1,1 (444) + U2,1 (444) + U2,1 (444) + U2,2 (444) . 4
(2.4)
The second method of characterizing the different YUV formats is by the manner in which the different samples are stored in a file for the
14
2 Introduction to Digital Video
different components. The values can be stored either packed or planar . Packed storage saves the values pixel-wise for each created block component, whereas the planar storage method saves the pixel’s information in arrays following each other. The exemplary YUV 4:4:4 (progressive) in a packed format would thus be in a file stored with the bytes in the order Y(1,1) , U(1,1) , V(1,1) , Y(2,1) , U(2,1) , V(2,1) , . . .. The most commonly used YUV 4:2:0 format is stored in planar format and consists of all the byte values for the Y component, followed by the byte values for the U and V components. This results in a file (for 172 × 144 pixels comprising a picture) similar to Y(1,1) , Y(2,1) , . . . , Y(172,144) , followed directly by U(1,1) (420), . . . , U(88,72) (420) =
U(175,143) (444)+U(176,143) (444)+U(175,144) (444)+U(176,144) (444) , 4
which are followed directly by V(1,1) (420), . . . , V(88,72) (420). Thus the size of one YUV frame with 4:2:0 (or 4:1:1) chrominance sub– sampling in the QCIF format (176 pixel columns by 144 pixel rows for the luminance component and half the rows and columns for each of the two chrominance components) is 8 bit = 304, 128 bit = 38, 016 byte. (2.5) 176 · 144 · 8 bit + 2 · 4 The frame sizes and data rates for the different video formats and frame rates are summarized in Table 3.2. Clearly visible already from the small video resolution and resulting frame size in Equation (2.5), transmitting this low resolution video at the NTSC frame rate gives an enormous bandwidth 304, 128
30 frames bit bit · = 9, 123, 840 ≈ 9.1Mbps. frame sec sec
(2.6)
Given the enormous bit rates of the uncompressed video streams, even more for higher resolutions, it is clear that some form of compression is required to allow transmission of video over networks.
2.3 Video Hierarchy In general digital video is not processed, stored, compressed and transmitted on a per–pixel basis, but in a hierarchy [30], as illustrated in Figure 2.10. At the top of this hierarchy is the video sequence, which is divided into individual scenes. One example for a scene could be a discussion of several people. Scenes can be divided into multiple shots, whereby shots are used for dramatization effects by the director. Following the example, the director could have introduced several shots of the discussion showing the people at different camera angles. These first levels of video segmentation are due to the artistic component in video sequences and have a semantic meaning. Below the shot level
2.3 Video Hierarchy
15
Video Sequence t Scenes
Shots
Group of Pictures
Frame
Slice
Macroblock
Block
Fig. 2.10: Typical composition of a video sequence. come the Groups of Pictures (GoPs). Each GoP in turn consists of multiple video frames. A single frame is divided into slices. Slices represent independent coding units that can be decoded without referencing other slices of the same frame. They consist typically of several consecutive macroblocks. Slicing can be utilized to achieve a higher error–robustness. Each slice consists of several macroblocks (MBs), each typically consisting of 4 × 4 blocks. Each block typically consists of 8 × 8 pixels. While automatic video segmentation is still undergoing major research efforts, the levels that are most relevant for video encoding and decoding are from the GoP level downwards. Different shots may have different content and thus using different GoP patterns can be beneficial, especially when a potentially lossy transmission is considered. Video compression generally exploits three types of redundancies [30]. On a per–frame basis (i.e., single picture), neighboring pixels tend to be correlated and thus have spatial redundancy [31]. Intra-frame encoding is employed to reduce the spatial redundancy in a given frame. In addition, consecutive frames have similarities and therefore temporal redundancy. These temporal
16
2 Introduction to Digital Video
redundancies are reduced by inter-frame coding techniques. The result of the reduction of these two redundancies is a stream of codewords (symbols) that has some redundancy at the symbol level. The redundancy between these symbols is reduced by variable length coding before the binary code is passed on to the output channel.1 The elimination of these redundancies is explained in the following as we give an introductory overview of different video coding principles, we refer the interested reader to [30, 32] for more details.
1
Additional compression schemes, such as the exploitation of object recognition techniques, are also in development, but not commonly applied up to now.
3 Video Encoding
In this chapter, we introduce several different video encoding methods. We start with the most commonly used discrete cosine transform (DCT) without predictive coding and the different mechanisms that are used in the process of applying the DCT in modern video encoders. We continue by introducing the predictive coding mechanisms with their respective intricacies and different methods of scalable video coding. An introduction to wavelet-based video encoding and current video standards conclude this chapter.
3.1 DCT-Based Video Encoding We focus initially on the principles employed in the MPEG standards and on single-layer (non-scalable) video encoding. The main principles of MPEG video coding are intra-frame coding using the discrete cosine transform (DCT) and inter-frame coding using motion estimation and compensation between successive video frames. The DCT approach is commonly used in today’s video encoders/decoders due to the low complexity associated with the transforms. As intra-coding only gives a typically small compression ratio, inter-frame coding is used to increase the compression ratio. For the intra-frame coding each video frame is divided into blocks of 8 × 8 samples of Y samples, U samples, and V samples. Each block is transformed using the DCT into a block of 8 × 8 transform coefficients, which represent the spatial frequency components in the original block. These transform coefficients are then quantized by an 8 × 8 quantization matrix which contains the quantization step size for each coefficient. The quantization matrix is obtained by multiplying a base matrix by a quantization scale. This quantization scale is typically used to control the video encoding. A larger quantization scale gives a coarser quantization resulting in a smaller size (in bits) of the encoded video frame as well as a lower quality. The quantized coefficients are then zigzag scanned, run-level coded, and variable length coded to achieve further compression. 17
18
3 Video Encoding
Discrete Cosine Transform (DCT)
Block Scanning
Quantization
Zig−Zag Scanning
Variable Length Coding (VLC)
Fig. 3.1: DCT coding concept. The intra-coding (compression) of an individual video frame resembles still picture encoding. It is commonly based on the discrete cosine transformation (DCT). (Wavelet–based transformation schemes have also emerged. Studies indicate that in the field of video encoding, the wavelet–based approach does not improve the quality of the transformed video significantly [33]. However, essentially all internationally standardized video compression schemes are based on the DCT and we will therefore focus on the DCT in our discussion.) The intra-frame coding proceeds by partitioning the frame into blocks, also referred to as block scanning. The size of these blocks today is typically 8 × 8 pixels (previously, also 4 × 4 and 16 × 16 were used). The DCT is then applied to the individual blocks. The resulting DCT coefficients are quantized and zig–zag–scanned according to their importance to the image quality. An overview of these steps is given in Figure 3.1. 3.1.1 Block Scanning In order to reduce the computational power required for the DCT, the original frame is first subdivided into macroblocks. Each macroblock is finally subdivided into 4 × 4 blocks, since efficient algorithms exist for a block–based DCT [34]. The utilization of block–shapes for encoding is one of the limitations for DCT–based compression systems. The typical object shapes in natural pictures are irregular and thus cannot be fitted into rectangular blocks ,
3.1 DCT-Based Video Encoding
19
Fig. 3.2: Video frame subdivision into blocks (QCIF format into 22×18 blocks of 8 × 8 pixels each). Mode 1
Mode 2
Mode 5
Mode 3
Mode 6
Mode 4
Mode 7
Fig. 3.3: Different macroblock subdivision modes that are supported by the H.264 video coding standard. as illustrated in Figure 3.2. In order to increase the compression efficiency, different block sizes can be utilized at the cost of increased complexity [35]. With the standardization of H.264 [36], the composition of macroblocks can be different from the previously used 4×4 subdivision. The H.264 Standard supports seven different macroblock subdivision modes, where each macroblock can be subdivided into smaller fragments in order to provide a finer granularity and higher quality. The different subdivision formats are illustrated in illustrated in Figure 3.3. 3.1.2 Discrete Cosine Transformation The DCT is used to convert a block of pixels (e.g., for the luminance component 8×8 pixels, represented by 8 bits each, for a total of 256 bits) into a block of transform coefficients. The transform coefficients represent the spatial frequency components of the original block. An example for this transformation of the block marked in Figure 3.2 is illustrated in Figure 3.4.
20
3 Video Encoding u
f(i,j)
F(u,v)
1136 49
−48
23
−5
−12
9
−9
256
47
−19
1
11
−7
9
−136 −13
3
−8
6
−1
0
−2
56
58
−35
19
−5
−5
14
−5
3
−29
28
−14
2
11
−11
9
−24
−11
−8
6
2
−3
3
2
23
22
−6
4
−1
−1
3
−6
−16
−20
7
0
3
3
−8
2
−39
DCT
v
Fig. 3.4: 8 × 8 block of luminance values (visual representation and numerical values) and the resulting DCT transform coefficients (decimal places truncated). This transformation is lossless, it merely changes the representation of the block of pixels, or more precisely the block of luminance (chrominance) values. A two–dimensional DCT for an N × N block of pixels can be described as two consecutive one–dimensional DCTs (i.e., horizontal and vertical). With f (i, j) denoting the pixel values and F (u, v) denoting the transform coefficients we have F (u, v) =
−1 N −1 N (2i + 1) uπ (2j + 1) vπ 2 ·C(u)·C(v)· f (i, j) cos cos , N 2N 2N i=0 j=0 (3.1)
where C(x) =
√1 , 2
1,
x=0 otherwise.
(3.2)
The lowest order coefficient is usually referred to as the DC –component, whereas the other components are referred to as AC –components. 3.1.3 Quantization In typical video frames the energy is concentrated in the low frequency coefficients. That is, a few coefficients with u and v close to zero have a high significance for the representation of the original block. On the other hand, most higher frequency coefficients (i.e., F (u, v)’s for larger u and v) are small. In order
3.1 DCT-Based Video Encoding
21
I(u,v) 3
2
1 −4Q
−3Q
−2Q
−Q
F(u,v) Q
2Q
3Q
4Q
−1
−2
−3
Fig. 3.5: Illustration of quantization, T = Q. to compress this spatial frequency representation of the block, a quantization of the coefficients is performed. Two factors determine the amount of compression and the loss of information in this quantization: 1. Coefficients F (u, v) with an absolute value smaller than the quantizer threshold T are set to zero, i.e., they are considered to be in the so-called “dead zone”. 2. Coefficients F (u, v) with an absolute value larger than or equal to the quantizer threshold T are divided by twice the quantizer step size Q and rounded to the nearest integer. In summary, the quantized DCT coefficients I(u, v) are given by | F (u, v) |< T
0 for I(u, v) = F (u,v) for | F (u, v) |≥ T, 2Q
(3.3)
where[·] denotes rounding to the nearest integer. A quantizer with T = Q, as typically used in practice, is illustrated in Figure 3.5. Figure 3.10 continues the example from Figure 3.4 and shows the quantized values for T = Q = 16. As illustrated here, typically many DCT coefficients are zero after quantization [37]. The larger the step size, the larger the compression gain — as well as the loss of information [38]. The trade–off between compression and decodable image quality is controlled by setting the quantizer step size (and quantizer threshold) [30].
22
3 Video Encoding
Table 3.1: Default MPEG-4 quantization matrix for intra-coded video frames (from [39]). 8 17 17 18 20 21 21 22 22 23 23 24 25 26 27 28
18 19 22 23 24 26 28 30
19 21 23 24 26 28 30 32
21 23 24 26 28 30 32 35
23 25 26 28 30 32 35 38
25 27 28 30 32 35 38 41
27 28 30 32 35 38 41 45
In modern video encoders, typically an optimized quantization step size for the individual coefficients is used. The fixed quantization steps are stored as quantization matrix and applied during the encoding process. The encoding quality is then controlled by the quantization scale factor q, which is multiplied with the quantization matrix before the quantization takes place. An exemplary quantization matrix is given in Table 3.1. The general trade-off between image quality and compression (frame size in bytes after quantization) is illustrated for the first frame from the Foreman test sequence [40], encoded with the MPEG-4 reference software [39] in Figures 3.6, 3.7, and 3.8. Notice that the quality of the video frame visibly decreases as q increases. In addition, the limitation of the block-based encoding becomes visible, as the blockiness of the image increases. As can be seen from the frame sizes, the quality loss is also reflected in the amount of data needed. The relationship between the quantization scale q and the size of the image can be captured in a quantization scale-rate curve as illustrated in Figure 3.9. We note that applying a very low quantization scale factor results in a very large frame size without any visual impairments, whereas applying medium quantization scale factors can yield visual impairments, yet a vastly reduced encoded frame size. This results in a trade-off decision between size and quality, which we will discuss further later. Our previous outlines have been with respect to quantization scale controlled encodings. Rate control can be applied during the encoding process to adjust the resulting video frame sizes to the bandwidth available. The quantization is adjusted in a closed–loop process (i.e., the result of the quantization is measured for its size and as required encoded again with a different quantizer step size) to apply a compression in dependence of video content and the resulting frame size. The result is a constant bit rate (CBR) video stream but with varying quantization and thus quality. The opposite of CBR is variable bit rate (VBR) encoding. Here the quantization process remains constant, thus it is referred to as open–loop encoding (i.e., the result of the quantization process is no longer subject to change in order to meet bandwidth requirements). To achieve a constant quality, VBR encoding has to be used [41, 42].
3.1 DCT-Based Video Encoding
23
Fig. 3.6: First frame from the Foreman test sequence, encoded with MPEG-4 reference encoder, quantization scale q = 1 results in 78007 bytes.
Fig. 3.7: First frame from the Foreman test sequence, encoded with MPEG-4 reference encoder, quantization scale q = 15 results in 9392 bytes.
Fig. 3.8: First frame from the Foreman test sequence, encoded with MPEG-4 reference encoder, quantization scale q = 30 results in 4461 bytes. 3.1.4 Zig–Zag Scanning The coefficient values obtained from the quantization are scanned by starting with the DC–component and then continuing to the higher frequency components in a zig–zag fashion, as illustrated in Figure 3.10. The zig–zag scanning
24
3 Video Encoding 700000 600000
Frame size [bit]
500000 400000 300000 200000 100000 0 0
5
10 15 20 Quantization scale q
25
30
Fig. 3.9: Relationship between the quantization scale factor q and the resulting frame size in bit for the 1st frame of the Foreman sequence. 36 2 8
0
0
0
0
36 2
−1 0
0
0
0
8
−2 1
−1 1
0
0
0
0
−1 0
0
0
0
−2 1
−1 1
−4 0
0
0
0
0
0
0
−4 0
0
0
0
0
0
0
2
2
−1 1
0
0
0
0
2
2
−1 1
0
0
0
0
0
−1 1
0
0
0
0
0
0
−1 1
0
0
0
0
0
−1 0
0
0
0
0
0
0
−1 0
0
0
0
0
0
0
1
0
0
0
0
0
0
1
0
0
0
0
0
0
−1 −1 0
0
0
0
0
0
−1 −1 0
0
0
0
0
0
1
1
36,2,8,−4,−1...
Fig. 3.10: Quantized DCT coefficients (Q = 16) and zig–zag scanning pattern. facilitates the subsequent variable length encoding by encountering the most likely non–zero elements first. Once all non-zero coefficients are scanned, the obtained sequence of values is further encoded to reduce codeword redundancy, see Section 3.1.5. The scanning can be stopped before collecting all quantized non-zero coefficients to achieve further (lossy) compression. 3.1.5 Variable Length Coding The purpose of variable length coding (VLC) is to reduce the statistical redundancy in the sequence of codewords obtained from zig–zag scanning an intra-coded block (or block of differences for a predicted block). This is the
3.2 Inter-frame Coding: Motion Estimation and Compensation
25
part of the video encoding process that provides the actual compression. The VLC uses an infinite set of code words and is applied only on the mapping of the symbols and thus reduces the necessity of redefining codewords [43]. The coding is based on a single, static table of codewords which results in a simple mapping process. Short codewords are assigned to values with high probabilities. Longer codewords are assigned to less probable outcomes of the quantization. The mapping between these original values and the code symbols is done within the variable length coding (VLC). The mapping has to be known by both, the sender and the receiver. As shown before, the quantization and zig–zag scanning result in a large number of zeros. These values are encoded using run–level coding. This encoding only transmits the number of zeros instead of the individual zeros. In addition, when no other values are trailing the coefficients with zeros (and this is the most likely case), an End of Block (EOB) codeword is inserted into the resulting bitstream. Huffman coding [44] and Arithmetic coding [45, 44] and their respective derivatives are used to implement VLC. Huffman coding is fairly simple to implement, but achieves lower compression ratios. On the other hand, arithmetic coding schemes are computationally more demanding, but achieve better compression. As processing power is abundant in many of today’s systems, newer codecs mostly apply arithmetic coding [30, 46]. The context–adaptive binary arithmetic coder (CABAC) [46] is one of these coding techniques and used in the H.264 video coding standard. The CABAC approach uses probability distributions to further reduce the space needed to store the encoded frame. Shorter symbols are assigned to bit patterns with a high probability of occurrence and longer symbols to bit patterns with a smaller probability of occurrence. This mapping process achieves lossless compression by porting the sequence of symbols into an interval of real numbers between 0 and 1 with respect to the symbol’s probability at the source. It is therefore exploiting additional correlation of symbols at the encoding side for further reduction of data to be stored for each frame.
3.2 Inter-frame Coding: Motion Estimation and Compensation In the previous sections, we introduced the typically employed algorithms in today’s video encoders that rely on the DCT transform for individual frames. This process, however, only encodes frames individually. As video consists of a series of frames or images with changing content over time, video encoders commonly employ inter-frame coding to reduce the temporal redundancy between successive frames (images). The basic idea of inter-frame coding is that the content of a given current video frame is typically similar to that of a past or future video frame. The past (or future) frame is used as a reference frame to predict the content of the current frame. This prediction is typically performed on a macroblock or block basis [47].
26
3 Video Encoding
Search boundary ?
1
2
Frame n−1 (reference)
?
Frame n (current)
Fig. 3.11: Determination of the motion vector for a macroblock on frame n. For the inter-frame coding MPEG introduced the frame types Intra-coded (I), Inter-coded (P), and Bi-directional coded (B). In an I frame all blocks are intra-coded as outlined above. The macroblocks (four blocks of 8 × 8 samples per macroblock) in P frames are inter-coded with respect to the preceding I or P frame, while a B frame is intercoded with respect to the preceding I or P frame and the succeeding I or P frame. To inter-code a given macroblock, the best matching macroblock in the reference frame(s) is determined and identified by a motion vector (referred to as motion estimation). Any (typically small) difference between the block to be encoded and the best matching block is transformed using the DCT, quantized, and coded as outlined above (referred to as motion compensation); if a good match can not be found the macroblock is intra coded. (In the optional 4MV mode the above algorithms are applied to blocks instead of macroblocks.) For a given actual block in the current frame a block matching algorithm (BMA) searches for the most similar prediction block in the reference frame, as illustrated in Figure 3.11. The goal of the search is to determine the motion vector, i.e., the displacement vector of the most similar (macro)block on the reference frame to the (macro)block under consideration on the current frame. This search, which is also referred to as motion estimation, is performed over a specific range around the location of the block in the current frame, as illustrated as search boundary. The search algorithms are not limited to individual (macro)blocks, but rather expand on a pixel or even sub–pixel basis. Several different matching (similarity) criteria such as the cross-correlation function, mean squared error, or mean absolute error can be applied. As illustrated in Figure 3.11, several candidates for the (macro)block under consideration may exist on the reference frame. The candidate with the least difference from the (macro)block under consideration is selected and the according motion vector is used to determine the displacement. In case that the comparison is
3.2 Inter-frame Coding: Motion Estimation and Compensation
27
Repeated edge pixels
Current frame
Reference frame
Fig. 3.12: Illustration of the unrestricted motion estimation mode. performed on a block basis, the process yields four motion vectors for each macroblock. The enhancement of normal motion vectors is the revocation of picture boundaries as limits for the validity of a vector’s target, also known as unrestricted or extended motion vector mode. Since there is no content and thus data available for the outside of a picture, the pixels at the border are simply replicated to fill the nonexistent values needed as references. Figure 3.12 illustrates this algorithm. To find the best match by full search, (2n + 1)2 comparisons are required. Several fast motion estimation schemes such as the three step search [48] or the hierarchical block matching algorithm [49] have evolved to reduce the processing. Once the motion vector is determined, the difference between the prediction block and the actual block is encoded using the intra-frame coding techniques discussed in the preceding section. These differences may be due to lighting conditions, angles, and other factors that slightly change the content of the (macro)block. These differences are typically small and allow for efficient encoding with the intra-frame coding techniques (and variable length coding techniques). The quantizer step size can be set independently for the coding of these differences. The encoding of these differences accounts for the differences between the prediction block and the actual block, and is referred to as motion compensation. The inter–coded frame is represented by (i) the motion vectors (motion estimation), and (ii) the encoded error or difference between the current frame with the determined motion vectors and the reference frame (motion compensation). For the case that the motion estimation does not yield any matches within the search boundary or if encoding the motion vector and remaining difference would result in a larger size than applying intra-coding only to the (macro)block, the (macro)block is encoded using intra-coding techniques only. Newer video coding standards, such as the H.264 video coding standard, allow for multiple reference frames for a single frame under consideration. For motion estimation and compensation, the frame under consideration and all
28
3 Video Encoding
Fig. 3.13: Frames 90, 91 and 92 from the News test sequence illustrating changing video frame content. reference frames have to be available to the encoder and decoder. This results in large memory requirements as the frame resolution increases. As hardware components for the encoder and decoder, such as computational speed and memory, become less and less of a restriction (in availability and price), the availability of multiple reference frames increases the likelihood of finding good matches for the (macro)block under consideration and thus increase the compression efficiency. Macroblocks and/or blocks in video frames often reveal parts of the background or scene that were not visible before the actual frame [37]. Motion vectors of these areas can therefore not be found by referencing previous frames, but only by considering also future frames. We illustrate this idea in Figure 3.13 for frames 90, 91 and 92 obtained from the News video sequence. The background content changes from frame 90 to frame 91. If only reference frames from the past were allowed, the changed background content in frame 91 would have to be coded in intra mode for this frame. If the following frame 92 is intra–coded by default, then the additional intra–coded (macro)blocks in frame 91 would reduce the compression efficiency. Video coding standards have incorporated this idea and inter-frame coding often considers both prediction from past reference frames as well as future reference frames. There are three basic methods for encoding the original pictures in the temporal domain: Intra coded frames, Predicted frames and Bi– directionally predicted frames, as introduced in the MPEG-1 standard [50]. These encoding methods are applied on the frame, macroblock or block level, depending on the codec. An intra–coded frame consists exclusively of intracoded macroblocks. Thus, an intra–coded frame contains the compressed image information (without any prediction information), resulting in a large frame size (compared to the size of the inter– or bidirectional–coded frames). The inter–coded frames use motion estimation and compensation techniques relying on the previous inter– or intra–coded frame. The bi–directional encoded frames rely on a previous as well as a following intra– or inter–coded frame. This prediction information results in smaller frame sizes for the P– frames and even smaller frame sizes for the B–frames. When B frames do not have any following I– or P–frames that can be used as reference frames, no
3.3 Scalable Video Encoding
29
Forward prediction
I
B
B
P
B
B
P
B
B
P
B
B
I
Backward prediction
Fig. 3.14: Typical MPEG Group of Pictures (GoP) consisting of I, P and B frames (frames 1–12). encoding or decoding is possible. Intra–coded frames or blocks are not relying on other video frames and thus are important to stop error propagation. The sequence of frames starting with an intra–coded frame up to but not including the next intra–coded frame is referred to as a Group of Pictures (GoP). The relationship between these different encoding types and how frames rely on each other in a typical MPEG frame sequence consisting of 12 frames is illustrated in Figure 3.14. It is not necessary to have more than one I–frame at the beginning of the video sequence, in which case the entire video sequence is a single GoP. Another extreme is to have no P or B frames, in which case the GoP length is 1 and each video frame is encoded independently, similar to individual pictures.
3.3 Scalable Video Encoding With scalable encoding the encoder produces typically multiple layers. The base layer provides a basic quality (e.g., low spatial or temporal resolution video) and adding enhancement layers improves the video quality (e.g., increases spatial resolution or frame rate). A variety of scalable encoding techniques have been developed, which we will introduce in this section. Scalable encoding is a convenient way to adapt to the wide variety of video–capable hardware (e.g., PDAs, cell phones, laptops, desktops) and delivery networks (e.g., wired vs. wireless) [51, 52]. Each of these devices has different constraints due to processing power, viewing size, and so on. Scalable encoding can satisfy these different constraints with one encoding of the video. We briefly note that an alternative to scalable encoding is to encode the video into different versions, each with a different quality level, bit rate, spatial/temporal resolution. The advantage of having different versions is that it does not require the more sophisticated scalable encoders and does not incur the extra overhead due to the scalable encoding. The drawback is that the multiple versions take up more space on servers and possibly need to be streamed all together (simulcast) over the network to be able to choose the appropriate version at any given time.
30
3 Video Encoding Base Layer 36 2 8
−2 1
0
0
0
0
−1 0
0
0
0
0
0
0
0
0
−1 1
0
0
0
0
−1 1
−4 0
0
2
2
0
−1 1
0
0
0
0
0
−1 0
0
0
0
0
0
0
1
0
0
0
0
0
0
−1 −1 0
0
0
0
0
0
1
Enhancement Layer
Fig. 3.15: Data partitioning by priority break point setting.
Transcoding is another alternative to scalable encoding. Transcoding can be used to adapt to the different network conditions, as in [53], or to adapt to different desired video formats [54]. The transcoding approach requires typically a high performance intermediate node. Having given an overview of the general issues around scalable video encoding, we now introduce the different approaches to scalable encoding, we refer the interested reader to [55] for more details. 3.3.1 Data Partitioning Though not explicitly a scalable encoding technique, data partitioning divides the bitstream of non–scalable video standards such as MPEG–2 [56] into two parts. The base–layer contains critical data such as motion vectors and low– order DCT coefficients while the enhancement layer contains for example the higher order DCT coefficients [32]. The priority break point determines where to stop in the quantization and scanning process for the base–layer coefficients to be further encoded [30] as shown in Figure 3.15. The remaining coefficients are then encoded by resuming the zig–zag scanning pattern at the break point and are stored in the enhancement layer. 3.3.2 Temporal Scalability Temporal scalability reduces the number of frames in the base layer. The removed frames are encoded into the enhancement layer and reference the frames of the base layer. Different patterns of combination of frames in base and enhancement layer exist [57]. In Figure 3.16 an enhancement layer consisting of all B–frames is given as already used in video trace evaluations [58]. No other frames depend on the successful decoding of B–frames. If the enhancement layer is not decodable, the decoding of the other frame types is
3.3 Scalable Video Encoding
31
Enhancement layer
B
B
B
B
B
B
Base Layer
I
P
P
P
Fig. 3.16: Temporal scalability with all B–frames in enhancement layer. not affected. Nevertheless, since the number of frames that are reconstructed changes, the rate of frames per second is to be adjusted accordingly for viewing and quality evaluation methods (e.g., the last successfully decoded frame is displayed for a longer period, also called freezing). This adjustment is inflicting the loss in viewable video quality. 3.3.3 Spatial Scalability Scalability in the spatial domain is applying different resolutions to the base and the enhancement layer. If, for example, the original sequence is in the CIF format (352 × 288), the base layer is downsampled into the QCIF format (176 × 144) prior to the encoding. Spatial scalability is therefore also known as pyramid coding. In addition to the application of different resolutions, different GoP structures are used in the two layers. The GoP pattern in the enhancement layer is referencing the frames in the base layer. An exemplary layout of the resulting dependencies is illustrated in Figure 3.17. The content of the enhancement layer is the difference between the layers, as well as the frame–based reference of previous and following frames of the same layer. A study of the traffic and quality characteristics of temporal and spatial scalable encoded video is given in [58]. 3.3.4 SNR Scalability SNR scalability provides two (or more) different video layers of the same resolution but with different qualities. The base layer is coded by itself and provides a basic quality in terms of the (P)SNR. The enhancement layer is encoded to provide additional quality when added back to the base layer. The encoding is performed in two consecutive steps: first the base layer is encoded with a low quality, then the difference between the decoded base layer and the input video is encoded with higher quality settings in a second step [55], as illustrated in Figure 3.18.
32
3 Video Encoding
Enhancement layer
P
B
B
P
B
Forward reference
B
P
B
B
Backward reference
Base Layer I
P
P
P
P
P
P
P
P
Fig. 3.17: Example for spatial scalability and cross–layer references.
Enhancement Layer (SNR 2)
Encode
∆
Original frame
Decode
I
B
B
P
B
B
P
B
B
Base Layer (SNR 1)
Fig. 3.18: Example for SNR scalability.
At the receiver–side the base quality is obtained simply by decoding the base layer. For enhanced quality, the enhanced layer is decoded and the result is added to the base layer. There is no explicit need for both layers to be encoded by the same video compression standard, though for ease of use it is advisable to do so. 3.3.5 Object Scalability Another scalability feature is possible within video standards that support the composition of video frames out of several different objects, such as MPEG– 4 [57]. The base layer contains only the information that could not be fitted or identified as video objects. The enhancement layer(s) are made up of the respective information for the video objects, such as shape and texture. The
3.3 Scalable Video Encoding
33
Background Reconstructed video frame
Object
Fig. 3.19: Example for object–based scalability.
example shown in Figure 3.19 presents a case where the background (landscape) and an object (car) were separated. In this case, the background is encoded independently from the object. 3.3.6 Fine Granular Scalability (FGS) Fine Grain Scalability (FGS) is a relatively new form of scalable video encoding [59] that has recently been added to the MPEG-4 video coding standard [60] in order to increase the flexibility of video streaming. With FGS, the video is encoded into a base layer (BL) and one enhancement layer (EL). Similar to conventional scalable video coding, the base layer must be received completely in order to decode and display a basic quality video. The enhancement layer has the special property that it can be cut at any bit rate and the received part of the FGS enhancement layer stream can be successfully decoded and improves upon the basic video quality [59, 61]. FGS thus removes the restriction of conventional layered encoding where an enhancement layer must be completely received for successful decoding. Similar to conventional scalable encoding, the FGS enhancement layer is hierarchical in that “higher” bits require the “lower” bits for successful decoding. This means that when cutting the enhancement layer bit stream before transmission, the lower part of the bit stream (below the cut) needs to be transmitted and the higher part (above the cut) can be dropped. The FGS enhancement layer can be cut at the granularity of bits, as illustrated in Figure 3.20. The flexibility of FGS makes it attractive for video streaming, as video servers can adapt the streamed video to the available bandwidth in real–time (without requiring any computationally demanding re–encoding). But this flexibility comes at the expense of reduced coding efficiency. Following standardization, the refinement and evaluation of the FGS video coding has received considerable interest [62, 63, 64, 65, 66, 67, 68, 69, 70, 71, 72, 73, 74]. There is no motion
34
3 Video Encoding
Enhancement Layer
Base Layer
I
P
B
Fig. 3.20: Example of truncating the FGS enhancement layer before transmission. compensation within the FGS enhancement layer. This makes the enhancement layer highly resilient to transmission errors, and subsequently well suited to the transmission over error–prone networks such as the best–effort Internet. A typical scenario for transmitting MPEG–4 FGS encoded videos over the Internet has been proposed by the MPEG–4 committee in [75]. In this scenario the base layer is transmitted with high reliability (achieved through appropriate resource allocation and/or channel error correction) and the FGS enhancement layer is transmitted with low reliability (i.e., in a best effort manner and without error control). We close this brief overview of MPEG–4 FGS encoding by noting that the MPEG–4 standard includes several refinements to the basic SNR FGS approach outlined above and also a temporal scalable FGS mode, which are beyond the scope of our study. (A streaming mechanism adapting the video by adding and dropping the SNR FGS and temporal FGS enhancement layers is studied in [76].) We also note that a Progressive FGS (PFGS) refinement has recently been proposed [77, 73], but not yet standardized. In contrast to MPEG–4 FGS, PFGS allows for partial motion compensation among the FGS bit–planes, while still achieving the fine granularity property. This motion compensation typically improves the coding efficiency, but lowers the error resilience of the enhancement layer [78].
3.4 Wavelet-Based Video Encoding
35
3.3.7 Multiple Description Coding (MDC) With Multiple Description Coding (MDC) [79], the video is encoded into several sub-streams referred to as descriptions. Each of the descriptions are conveyed toward the receiver. Decoding more descriptions gives a higher video quality while decoding an arbitrary subset of the descriptions results in lower quality. The individual descriptions have no explicit hierarchy or dependency amongst them, i.e., any combination of the descriptions can be combined and decoded [80]. This is in contrast to conventional hierarchical layered videos where a received enhancement layer is useless if the corresponding base layer is missing as for FGS. The advantages of MDC have been studied for multi-hop networks [81, 82], Orthogonal Frequency Division Multiplexing (OFDM) [83], Multiple Input Multiple Output (MIMO) systems [84], ad–hoc networks [85], Universal Mobile Telecommunications System (UMTS) [86], Transport Control Protocol (TCP) [87] and Content Delivery Networks (CDN) [88]. MDC is especially interesting for the support of heterogeneous terminals in 4G networks as advocated in [89]. Future Fourth Generation (4G) mobile systems are envisioned to offer wireless services to a wide variety of mobile terminals ranging from cellular phones and Personal Digital Assistants (PDAs) to laptops [90]. These wide variety of mobile terminals are referred to as heterogeneous terminals. Heterogeneous terminals have various processing power, memory, storage space, battery life and data rate capabilities. Unlike DVB–T and DVB–H, where the same spectrum is reserved for the support of each technology in a time multiplex fashion, heterogeneous terminals in 4G should use the same spectrum in case the users are interested in the same services, to use the spectrum efficiently. In a multicast scenario, high class terminals would receive a large number of streams, while low class terminals would go for a smaller number. Note that the sub–streams of the low class terminal are also received by the high class terminal. The flexibility of the bandwidth assigned to each descriptor and the number of descriptors assigned to end users makes MDC a very attractive coding scheme for 4G networks. The advantage of multiple description coding is typically achieved at the expense of reduced video compression gains and existing video traffic characterizations such as single and multiple layer coding, as presented in the chapters before, can not be used as they would underestimate the bandwidth required.
3.4 Wavelet-Based Video Encoding With wavelet transform coding [91] a video frame is not divided into blocks as with the DCT-based MPEG coding. Instead, the entire frame is coded into several subbands using the wavelet transform. The Wavelet Transform has many advantages over the DCT transform. The most obvious of them all,
36
3 Video Encoding
Fig. 3.21: Block Diagram of the MC-3DEZBC wavelet encoder [93].
Fig. 3.22: First wavelet decomposition stage. is the compact-support feature. This compact-support allows to translate a time-domain function into a representation that is not only localized in frequency, but in time as well. The net result of this is that the wavelet transform can occur over the entire image within a reasonable computational and bit budget. Thus, the obvious visual advantage is that block artifacts, common in DCT–based transforms, are eliminated in the wavelet transform. The wavelet transform codec which we introduce here is the MC-3DEZBC [92]. The block diagram of the MC-3DEZBC codec in Figure 3.21 illustrates the complete codec, including the temporal decomposition and the motion estimation. Each video frame undergoes a four-stage spatial decomposition, which is recursively performed on the low frequency subband. The first stage of a filter bank structure used for the spatial decomposition is illustrated in Figure 3.22. Here Xn is the input image. ∗v and ∗h represent convolution in the vertical direction and convolution in the horizontal direction, respectively. The impulse response of the low pass filter and high pass filer are represented by hL and hH , respectively. An arrow pointing downwards and followed by
3.4 Wavelet-Based Video Encoding
37
Fig. 3.23: Passband structure for MC-3DEZBC [93]. the number 2, represents subsampling by two in the horizontal or vertical direction (represented by the subscript preceding the arrow). HL1 , LH1 , and HH1 represent the outputs of the filters of the first decomposition stage. Each stage creates three subbands, while the fourth (which is the lowest frequency subband in both the horizontal and the vertical dimensions) is fed into the next stage of the spatial decomposition. The four–stage decomposition provides 13 subbands, as illustrated in Figure 3.23. These 13 subbands obtained from the four decomposition stages are then coded individually using the 3D version of the embedded zerotree block coding algorithm 3D-EZBC [92]. This is an extension of the embedded zerotree block coding (EZBC) algorithm developed in [94]. The resulting bit streams are then bit plane encoded and combined to form one sub-stream as illustrated in Figure 3.24. For easier illustration, each sub-stream in Figure 3.24 is color coded such that it matches with the corresponding color in Figure 3.23. All sub-streams of each frame and all frames in the corresponding GoP are then combined to create a hierarchical code stream [93]. Each GoP is coded as a separate message with context-dependent arithmetic coding. Each message is embedded, thus the bitstream can be truncated at any point to a given bit budget. Rate control
38
3 Video Encoding
Fig. 3.24: Individually coded sub-bitstreams corresponding to Figure 3.23 [93]. is implemented on each GoP with the bit budget given by Rg = Ng · r/F (bits), where Ng denotes the number of frames in a GoP, r the given bit rate in bits/sec, and F denotes the frame rate of the image sequence in frames/sec.
3.5 Video Coding Standards Video compression is undergoing constant changes as new coding/decoding (codec) systems are being developed and introduced to the market. Nevertheless, the internationally standardized video compression schemes, such as the H.26x and MPEG-n standards, are based on a common set of fundamental encoding principles which we reviewed in the previous sections. The sizes of the pictures in the current video formats are illustrated in Figure 3.25. Note that the ITU-R/CCIR 601 format (i.e., the common TV image format) and the CIF and QCIF have the same ratio of width to height. In contrast, the High Definition Television (HDTV) image format has a larger width to height ratio, i.e., is perceived as “wider”. Each individual image is composed of picture elements (usually referred to as pixels or pels). The specific width and height (in pixels) in of the different formats are summarized in Table 3.2. Today, typical formats for wireless video are QCIF (176 × 144 pixel) and CIF (352 × 288 pixel). Despite a large variety of video coding and decoding systems (e.g., the proprietary Real–Media codec etc.), standardization on an international level is performed by two major bodies: ITU–T and ISO/MPEG. The early H.261 codec of the ITU–T was focused on delivering video over ISDN–networks with a fixed bitrate of n × 64kbit/s, where n denotes the number of multiplexed ISDN–lines. From this starting point, codecs were developed for different purposes such as the storage of digital media or delivery over
3.5 Video Coding Standards
39
HDTV
ITU−R / CCIR 601 (TV) CIF QCIF
Fig. 3.25: Illustration of image formats. Table 3.2: Characteristics for different video formats. Standard Format Sub–sampling Columns (Y) Rows (Y) Columns (U,V) Rows (U,V) Frame size [byte] Data Rate [Mbit/s]
QCIF ITU–T H.261 PAL NTSC [25 Hz] [30 Hz] 4:2:0 176 144 88 72 38016
7.6
9.1
CIF ITU–T H.261 PAL NTSC [25 Hz] [30 Hz] 4:2:0 352 288 176 144 152064
30.4
36.5
TV ITU-R/CCIR-601 PAL NTSC [25 Hz] [30 Hz] 4:2:2 720 576 480 360 360 576 480 1244160 1036800
248.8
298.6
HDTV ITU-R 709-3 PAL NTSC [25 Hz] [30 Hz] 4:2:2 1920 1080 960 1080 4147200
829.4
995.3
packet–oriented networks. The latest codec development H.264 (or MPEG–4 Annex 10, H.264/AVC) has recently been finalized by a Joint Video Team (JVT) of the ITU–T and the ISO/MPEG standardization bodies. The evolving standards achieved better quality with lower bit rates and thus better rate-distortion performance as time progressed. Figure 3.26 sketches an overview of the video standards development to date. The H.264 video coding standard differs from its predecessors (the ITU–T H.26x video standard family and the MPEG standards MPEG–2 and MPEG– 4) in providing a high compression video coding layer (VCL) for storage optimization as well as a network adaption layer (NAL) for the packetization of the encoded bit stream according to transmission requirements [95]. An overview of these layers is given in Figure 3.27. The network adaption layer
40
3 Video Encoding
100 Mbit/s Video production
20 Mbit/s
MPEG−2
HDTV
Ver.3 Ver.2
1 Mbit/s
MPEG−1
CDROM
Ver.1
ITU H.261
64 kbit/s Video phone
ITU H.263
MPEG−4
ITU / MPEG H.26L
8 kbit/s Mobile radio
1990
2003
Fig. 3.26: Video coding standards of ITU–T, ISO/MPEG and the Joint Video Team (JVT).
Control Data
Video Coding Layer
Macroblock
Data Partitioning Slice / Partition
Network Adaption Layer H.320
H.324x
H.323/IP
...
Fig. 3.27: Block diagram of an H.26L coder.
3.5 Video Coding Standards
41
varies according to the underlying network type (e.g. 802.3, 802.11x, UMTS, and others). To handle abrupt changes in the bit stream and the loss of parts of pictures or structures, the H.264 standard provides the possibility of refreshing the pictures on a macroblock level. Additionally, refresh frames (intra picture refresh) are used to stop the prediction process of frames that are referencing lost or errorneous frames. Furthermore, the standard provides the possibility to switch between several differently encoded streams to avoid high computational effort (and thus high power consumption) for the encoding and decoding typically associated with transcoding. The stream switching functionality allows for non–realtime encoding and real–time, bandwidth–based selection of streams encoded with different quantization and/or GoP settings. The motion estimation is performed for multiple reference frames (see H.263++ standard, Annex U – long term memory prediction) and works beyond the picture boundaries as given by unrestricted motion vectors. The H.264 video coding standard includes several additional features and novelties, for which we refer the interested reader to [36].
Part II
Video Traces and Statistics
4 Metrics and Statistics for Video Traces
In this chapter, we review the statistical definitions and methods used in the analysis of the generated video size traces. We refer the interested reader to [96, 97] for details on the statistical properties. Let N denote the number of video frames in a given trace. Let tn , n = 0, . . . , N − 1, denote the frame period (display time) of frame n. Let Tn , n = 1, . . . , N , denote the cumulative display time up to (and including) frame n−1 n − 1, i.e., Tn = k=0 tk (and define T0 = 0). Let Xn , n = 0, . . . , N − 1, denote the frame size (number of bit) of the encoded (compressed) video frame frame n. Let QYn , n = 0, . . . , N − 1, denote the quality (in terms of the Peak Signal to Noise Ratio (PSNR)) of the luminance component of the encoded (and subsequently decoded) video frame n (in dB). Similarly, let V QU n and Qn , n = 0, . . . , N − 1, denote the qualities of the two chrominance components hue (U) and saturation (V) of the encoded video frame n (in dB).
4.1 Video Frame Size ¯ of a frame size trace is estimated as The (arithmetic) sample mean X N −1 ¯= 1 X Xn . N n=0
(4.1)
2 The sample variance SX of a frame size trace is estimated as 2 = SX
N −1 1 ¯ 2. (Xn − X) N − 1 n=0
2 A computationally more convenient expression for SX is
N −1 2 N −1 1 1 2 SX = X2 − Xn . N − 1 n=0 n N n=0
45
(4.2)
(4.3)
46
4 Metrics and Statistics for Video Traces
The coefficient of variation CoVX of the frame size trace is defined as SX CoVX = ¯ . X
(4.4)
The maximum frame size Xmax is defined as Xmax =
max
0≤n≤N −1
Xn .
(4.5)
We define the aggregated frame size trace with aggregation level a as Xn(a)
1 = a
(n+1)a−1
Xj ,
for n = 0, . . . , N/a − 1,
(4.6)
j=na
i.e., the aggregate frame size trace is obtained by averaging the original frame size trace Xn , n = 0, . . . , N − 1, over non–overlapping blocks of length a. We define the GoP size trace as
(m+1)G−1
Ym =
Xn ,
for m = 0, . . . , N/G − 1,
(4.7)
n=mG
where G denotes the number of frames in a GoP (where typically G = 12). (G) Note that Ym = G · Xn . 4.1.1 Autocorrelation The autocorrelation function [98] can be used for the detection of non– randomness in data or identification of an appropriate time series model if the data is not random. One basic assumption is that the observations are equispaced. The autocorrelation is expressed as a correlation coefficient, referred to as autocorrelation coefficient (acc). Instead of calculating the correlation between two different variables, such as size and quality, the correlation is calculated for the values of the same variable at positions n and n + k. When the autocorrelation is used to detect non-randomness, it is usually only the first (lag k = 1) autocorrelation that is of interest. When the autocorrelation is used to identify an appropriate time series model, the autocorrelations are usually plotted for a range of lags k. The autocorrelation coefficient ρX (k) for lag k, k = 0, 1, . . . , N − 1, is estimated as ρX (k) =
N −k−1 ¯ ¯ (Xn − X)(X 1 n+k − X) . 2 N − k n=0 SX
(4.8)
4.1 Video Frame Size 2 SX =
1 N −1
N −1 n=0
47
¯ 2; (Xn − X)
foreach a = 12, 24, 48, 96, . . . do M = N/a ; (n+1)a−1 (a) Xj , n = 0, . . . , M − 1 ; Xn = a1 j=na 2(a) (a) M −1 ¯ 2; SX = M1−1 n=0 (Xn − X) 2(a) 2 ) ; plot point log10 a, log10 (SX /SX end Algorithm 1: Algorithm for determining the variance–time plot.
4.1.2 Variance–Time Test The variance time plot [99, 100, 101] is obtained by plotting the normalized 2(a) 2 variance of the aggregated trace SX /SX as a function of the aggregation level (“time”) a in a log–log plot, as detailed in Algorithm 1. Traces without long range dependence eventually (for large a) decrease linearly with a slope of −1 in the variance time plot. Traces with long range dependence, on the other hand, eventually decrease linearly with a flatter slope, i.e., a slope larger than −1. We consider aggregation levels that are multiples of the GoP size (12 frames) to avoid the effect of the intra–GoP correlations. For reference purposes we plot a line with slope −1 starting at the origin. For the estimation of the Hurst parameter we estimate the slope of the linear part of the variance time plot using a least squares fit. We consider the aggregation levels a ≥ 192 in this estimation since our variance time plots are typically linear for these aggregation levels. The Hurst parameter is then estimated as H = slope/2 + 1. 4.1.3 R/S Statistic We use the R/S statistic [99, 100, 102] to investigate the long range dependence characteristics of the generated traces. The R/S statistic provides an heuristic graphical approach for estimating the Hurst parameter H. Roughly speaking, for long range dependent stochastic processes the R/S statistic is characterized by E[R(n)/S(n)] ∼ cnH as n → ∞ (where c is some positive finite constant). The Hurst parameter H is estimated as the slope of a log–log plot of the R/S statistic. More formally, the rescaled adjusted range statistic (for short R/S statistic) is plotted according to the algorithm given in Algorithm 2. The R/S statistic R(ti , d)/S(ti , d) is computed for logarithmically spaced values of the lag k, starting with d = 12 (to avoid the effect of intra–GoP correlations). For each
48
4 Metrics and Statistics for Video Traces
foreach d = 12, 24, 48, 96, . . . do I = K + 1 − dK N ; foreach i = 1, . . . , I do N ti = (i − 1) K + 1; ¯ i , d) = 1 d−1 X (a) ; X(t ti +j j=0 d (a) d−1 ¯ i , d)]2 ; S 2 (ti , d) = d1 j=0 [Xti +j − X(t (a) k−1 ¯ W (ti , k) = j=0 Xti +j − k X(ti , d); R(ti , d) = max {0, max1≤k≤d W (ti , k)} − min {0, min1≤k≤d W (ti , k)}; i ,d) plot point log d, log R(t S(ti ,d) ; end end Algorithm 2: Algorithm for the R/S statistic plot. lag value d as many as K samples of R/S are computed by considering different starting points ti ; we set K = 10 in our analysis. The starting points must satisfy (ti − 1) + d ≤ N , hence the actual number of samples I is less than K for large lags d. Plotting log[R(ti , d)/S(ti , d)] as a function of log d gives the rescaled adjusted range plot (also referred to as pox diagram of R/S). A typical pox diagram starts with a transient zone representing the short range dependence characteristics of the trace. The plot then settles down and fluctuates around a straight “street” of slope H. If the plot exhibits this asymptotic behavior, the asymptotic Hurst exponent H is estimated from the street’s slope using a least squares fit. To verify the robustness of the estimate we repeat this procedure for each trace for different aggregation levels a ≥ 1. The Hurst parameter, or self– similarity parameter, H, is a key measure of self-similarity [103, 104]. H is a measure of the persistence of a statistical phenomenon and is a measure of the length of the long range dependence of a stochastic process. A Hurst parameter of H = 0.5 indicates absence of self-similarity whereas H = 1 indicates the degree of persistence or a present long–range dependence. 4.1.4 Periodogram We estimate the Hurst parameter H using the heuristic least squares regression in the spectral domain, see [99, Sec. 4.6] for details. This approach relies
4.1 Video Frame Size
49
on the periodogram I(λ) as approximation of the spectral density, which near the origin satisfies log I(λ) ≈ log cf + (1 − 2H) log λk + log ξk .
(4.9)
To estimate the Hurst parameter H we plot the periodogram in a log–log plot, as detailed in Algorithm 3. (Note that the expression inside the | · |
M=
N a
;
foreach n = 0, 1, . . . , M − 1 do (n+1)a−1 (a) Xn = a1 j=na Xj ; (a)
(a)
Zn = log10 Xn ; end foreach k = 1, 2, . . . , M2−1 do λk = 2πk M ; 2 M −1 (a) 1 I(λk ) = 2πM n=0 Zn e−jnλk ; plot point (log10 λk , log10 I(λk )); end Algorithm 3: Algorithm for periodogram.
corresponds to the Fourier transform coefficient at frequency λk , which can be efficiently evaluated using Fast Fourier Transform techniques.) For the Hurst parameter estimation we define yk = log10 I(λk )
(4.10)
β0 = log10 cf − 0.577215 β1 = 1 − 2H ek = log10 ξk + 0.577215
xk = log10 λk
(4.11) (4.12)
With these definitions we can rewrite (4.9) as yk = β0 + β1 xk + ek .
(4.13)
We estimate β0 and β1 from the samples (xk , yk ), k = 1, 2, . . . , 0.7 · (N/a − 2)/2 := K using least squares regression, i.e., K K K K k=1 xk yk − k=1 xk k=1 yk (4.14) β1 = 2 K K 2 − K x x k=1 k k=1 k
50
4 Metrics and Statistics for Video Traces
and K β0 =
k=1
yk − β1 K
K k=1
xk
(4.15)
The Hurst parameter is then estimated as H = (1 − β1 )/2. We plot the periodogram (along with the fitted line y = β0 + β1 x) and estimate the Hurst parameter in this fashion for the aggregation levels a = 12, 24, 48, 96, 192, 300, 396, 504, 600, 696, and 792. 4.1.5 Logscale Diagram We jointly estimate the scaling parameters α and cf using the wavelet-based approach of Veitch and Abry [105], where α and cf characterize the spectral density fX (λ) ∼ cf | λ |−α , | λ |→ 0.
(4.16)
The estimation is based on the logscale diagram, which is a plot of log2 (µj ) as a function of log2 j, where µj =
nj 1 | dX (j, k) |2 nj
(4.17)
k=1
is the sample variance of the wavelet coefficient dX (j, k), k = 1, . . . , nj , at octave j. The number of available wavelet coefficients at octave j is essentially nj = N/2j . We plot the logscale diagram for octaves 1 through 14 using the code provided by Veitch and Abry [105]. We use the daubechies 3 wavelet to eliminate linear and quadratic trends [106]. We use the automated choosenewj1 approach [105] to determine the range of scales (octaves) for the estimation of the scaling parameters. 4.1.6 Multiscale Diagram We investigate the multifractal scaling properties [105, 106, 107, 108, 109, 110, 111, 112, 113, 114] using the wavelet-based framework [109]. In this framework the qth order scaling exponent αq is estimated based on the qth order logscale diagram, i.e., a plot of (q)
log2 (µj ) = log2
nj 1 | dX (j, k) |q nj
(4.18)
k=1
as a function of log2 j. The multiscale diagram is then obtained by plotting ζ(q) = αq − q/2 as a function of q. A variation of the multiscale diagram, the
4.2 Video Frame Quality
51
so–called linear multiscale diagram is obtained by plotting hq = αq /q − 1/2 as a function of q. We employ the multiscaling Matlab code provided by Abry and Veitch [105]. We employ the daubechies 3 wavelet. We use the L2 norm, sigtype 1, the q vector [0.5, 1, 1.5, 2, 2.5, 3, 3.5, 4]. We use the automated newchoosej1 approach form Abry and Veitch’s logscale diagram Matlab code [105] to determine the range of scales (octaves) for the estimation of the scaling parameters.
4.2 Video Frame Quality Consider a video sequence with N frames (pictures), each of dimension Dx ×Dy pixels. Let I(n, x, y), n = 0, . . . , N −1; x = 1, . . . , Dx ; y = 1, . . . , Dy , denote the luminance (gray–level, or Y component) value of the pixel at location (x, y) in video frame n. The Mean Squared Error (MSE) is defined as the mean of the squared differences between the luminance values of the video ˜ Specifically, the MSE for an individual frames in two video sequences I and I. video frame n is defined as Mn =
Dy
Dx 2 1 ˜ I(n, x, y) − I(n, x, y) . Dx · Dy x=1 y=1
(4.19)
The mean MSE for a sequence of N video frame is N −1 ¯ = 1 M Mn . N n=0
The RMSE is defined as square root of the MSE √ RM SE = M SE.
(4.20)
(4.21)
The Peak Signal to Noise Ratio (PSNR) in decibels (dB) is generally defined as PSNR = 10 · log10 (p2 /MSE), where p denotes the maximum luminance value of a pixel (255 in 8–bit pictures). We define the quality (in dB) of a video frame n as Qn = 10 · log10
p2 . Mn
(4.22)
We define the average quality (in dB) of a video sequence consisting of N frames as 2 ¯ = 10 · log10 p . Q ¯ M
(4.23)
Note that in this definition of the average quality, the averaging is conducted with the MSE values and the video quality is given in terms of the PSNR (in dB).
52
4 Metrics and Statistics for Video Traces
We also define an alternative average quality (in dB) of a video sequence as N −1 ¯ = 1 Q Qn , N n=0
(4.24)
where the averaging is conducted over the PSNR values directly. We now define natural extensions of the above quality metrics. We define 2 the MSE sample variance SM of a sequence of N video frames as 2 = SM
N −1 1 ¯ 2, Mn − M N − 1 n=0
and the MSE standard deviation SM as 2 . SM = SM
(4.25)
(4.26)
We define the quality standard deviation SQ of a video sequence as SQ = 10 · log10
p2 . SM
(4.27)
We define the coefficient of quality variation CoQV of a video sequence as SQ CoQV = ¯ . Q
(4.28)
We define an alternative quality standard deviation as −1 1 N ¯ 2, SQ = Qn − Q N − 1 n=0
(4.29)
and the corresponding alternative coefficient of quality variation as SQ CoQV = ¯ . Q
(4.30)
We define the quality range (in dB) of a video sequence as Qmax min =
max
0≤n≤N −1
Qn −
min
0≤n≤N −1
Qn .
(4.31)
We estimate the MSE autocorrelation coefficient ρM (k) for lag k, k = 0, . . . , N − 1, as ρM (k) =
N −k−1 ¯ )(Mn+k − M ¯) (Mn − M 1 . 2 N − k n=0 SM
(4.32)
4.2 Video Frame Quality
53
While the above definitions focus on the qualities at the level of individual video frames, we also define, as extensions, qualities for aggregates (groups) of a frames (with the GoP being a special case of frame aggregation with a = G, where typically G = 12). (a) Let Mm , m = 0, . . . , N/a−1, denote the MSE of the mth group of frames, defined as (a) Mm
1 = a
(m+1)a−1
Mn .
(4.33)
n=ma
(a)
Let Qm , m = 0, . . . , N/a − 1, denote the corresponding PSNR quality (in dB), defined as Q(a) m = 10 · log10
p2 (a)
2(a)
We define the MSE sample variance SM frames each as 2(a)
SM
=
.
(4.34)
Mm
of a sequence of groups of a
N/a−1 2 1 ¯ , Mn(a) − M N/a − 1 n=0
(4.35)
(a)
and the corresponding MSE standard deviation SM as (a) 2(a) SM = SM .
(4.36)
(a)
We define the quality standard deviation SQ of a sequence of groups of a frames each as (a)
SQ = 10 · log10
p2 (a)
.
(4.37)
SM
We define the coefficient of quality variation CoQV (a) of a sequence of groups of a frames each as (a)
CoQV (a) =
SQ ¯ . Q
(4.38)
We define the alternative quality standard deviation for groups of a frames each as N/a−1 2 1 (a) (a) ¯ , Qn − Q SQ = (4.39) N/a − 1 n=0
54
4 Metrics and Statistics for Video Traces
(m+1)a−1 (a) where Qn = a1 n=ma Qn . We define the corresponding alternative coefficient of quality variation as (a)
CoQV
(a)
SQ = ¯ . Q
(4.40)
We define the quality range (in dB) of a sequence of groups of a frames each as max(a)
Qmin
=
max
0≤n≤N/a−1
Q(a) n −
min
0≤n≤N/a−1
Q(a) n .
(4.41)
We estimate the MSE autocorrelation coefficient for groups of a frames (a) ρM for lag k, k = 0, a, 2a, . . . , N/a − 1 frames as (a) ρM (k)
1 = N/a − k
(a) ¯ )(M (a) − M ¯) (Mn − M n+k
n=0
SM
N/a−k−1
(a)
.
(4.42)
4.3 Correlation between Video Frame Sizes and Qualities We define the covariance between the frame size and the MSE frame quality as SXM =
N −1 1 ¯ ¯ (Xn − X)(M n − M ), N − 1 n=0
(4.43)
and the size–MSE quality correlation coefficient as ρXM =
SXM . SX · SM
(4.44)
We define the covariance between the frame size and (PSNR) frame quality as SXQ =
N −1 1 ¯ ¯ (Xn − X)(Q n − Q ), N − 1 n=0
(4.45)
and the size–quality correlation coefficient as ρXQ =
SXQ . SX · SQ
(4.46)
Similar to the above frame–level definitions, we define the covariance be(a) tween the aggregated frame sizes Xn , n = 0, . . . , N/a−1, and the aggregated (a) MSE qualities Mn , n = 0, . . . , N/a − 1, as
4.4 Additional Metrics for FGS Encodings (a)
SXM =
N/a−1 1 (a) ¯ ¯ (Xn(a) − X)(M n − M ), N/a − 1 n=0
55
(4.47)
and the corresponding correlation coefficient as (a)
(a)
ρXM =
SXM (a)
(a)
SX · SM
.
(4.48) (a)
We define the covariance between aggregated frame size Xn , n = (a) 0, . . . , N/a−1, and the aggregated (PSNR) qualities Qn , n = 0, . . . , N/a−1, as (a)
SXQ =
N/a−1 1 (a) ¯ ¯ ), (Xn(a) − X)(Q −Q n N/a − 1 n=0
(4.49)
and the corresponding correlation coefficient as (a)
(a)
ρXQ =
SXQ (a)
(a)
SX · SQ
.
(4.50)
4.4 Additional Metrics for FGS Encodings The base layer (BL) and the FGS enhancement layer (EL) of the video are VBR–encoded, with instant bitrates rb (t) and re (t) during frame period t, t = 1, . . . , N . According to the FGS property, the enhancement layer can be truncated anywhere before decoding. We denote any part of the EL which is to be added to the BL as EL substream. We say that an EL substream is encoded at rate C(t) ∈ [0, re (t)], when the last T · (re (t) − C(t)) bits of each frame t, t = 1, . . . , N , have been removed from the original EL bitstream. The BL Group of Pictures (GoP) is composed of 12 images throughout our study, and its pattern is fixed to IBBPBBPBBPBB. We suppose that the video is partitioned into consecutive scenes. Let S denote the total number of scenes in a given video of length N frames. Let s, s = 1, . . . , S, denote the scene index and Ns the length (in number of images) S of scene number s (note that s=1 Ns = N ). Let Qt (C), t = 1, . . . , N denote the quality of the tth decoded image, when the EL is encoded with rate C. Let Qbt = Qt (0), denote the quality of the same image, when only the BL is decoded. We define Qet (C) = Qt (C) − Qbt as the improvement (increase) in quality which is achieved when decoding the EL, as well as the BL, of frame t encoded with rate C. The different statistics for the individual video frame qualities, are calculated as given above. We denote the total size of frame t by Xt (C) = Xtb + Xte (C), when the EL is encoded with rate C. Let Xtei , i = 1, . . . , 8,
56
4 Metrics and Statistics for Video Traces
t = 1, . . . , N denote the size of EL bitplane i of frame t and Ytei , i = 1, . . . , 8, t = 1, . . . , N denote the aggregate size of the bitplanes 1, . . . , i (Ytei = i ej j=1 Xt ). Let Qs,n (C), s = 1, . . . , S, n = 1, . . . , Ns denote the quality of the nth decoded video frame of scene s, when the EL is encoded with rate C. As for Qt (C), we denote the quality of frame n within scene s, when only the BL is decoded by Qbs,n = Qs,n (0), and the improvement in quality achieved when decoding the EL by Qes,n (C) = Qs,n (C) − Qbs,n . The Rate-Distortion (RD) characteristics of each image n within scene s are obtained by plotting the curves Qs,n (C). The mean and sample variance of the quality of the images within scene s, s = 1, . . . , S, are estimated as follows: Ns ¯ s (C) = 1 Q Qs,n (C), Ns n=1 2 σQ (C) = s
Ns 1 ¯ s (C)]2 . [Qs,n (C) − Q Ns − 1 n=1
(4.51)
(4.52)
The coefficient of quality variation of scene s, s = 1, . . . , S, is given by: SQ (C) CoVs = ¯ s . Qs (C)
(4.53)
For each scene s, we also denote the total size of image n by Xs,n (C) = b e + Xs,n (C), when the EL is encoded with rate C. Xs,n ¯ s (C), sample variance σ 2 (C) and autocorreWe estimate the mean X Xs lation coefficient ρXs (C, k) of the sequence of total image sizes Xs,n (C), n = 1, . . . , Ns , the same way as for image qualities. We denote the mean, vari¯ sb (C), σ 2 b (C), ance, and autocorrelation of the BL and EL frame sizes as X Xs e 2 ¯ (C), σ e (C), ρX e (C, k) respectively. ρXsb (C, k) and X s Xs s We monitor the length (in video frames) of the successive scenes Ns , s = ¯ = N/n and σ 2 . The 1, . . . , S. We denote the mean and variance of Ns as N N ¯ s (C) mean in quality of all individual images of a scene is denoted as Q Let Θs (C) be the total quality of video scene number s, s = 1, . . . , n, when the EL has been coded at rate C for all images of the scene. Similar to the measure of quality of the individual images of a given scene, we define Θs (C) = Θsb + Θse (C), where Θsb = Θs (0) denotes the total quality of scene s when only the BL is decoded, and Θse (C) the improvement in quality achieved by the EL coded at rate C. We analyze the mean, variance, and autocorrelation coefficients of the scene qualities, as defined by: S 1 ¯ Θ(C) = Θs (C), S s=1
(4.54)
4.4 Additional Metrics for FGS Encodings
1 2 ¯ [Θs (C) − Θ(C)] , S − 1 s=1
57
S
2 σΘ (C) =
S−k ¯ ¯ 1 [Θs (C) − Θ(C)][Θ S+k (C) − Θ(C)] . ρΘ (C, k) = 2 S − k s=1 σΘ (C)
(4.55)
(4.56)
For each scene s, the rate-distortion characteristics are obtained by plotting the curves Θs (C). The mean and variance of the scenes’ qualities give an overall indication of the perceived quality of the whole video. However, the variance of scene quality does not capture the the differences in quality between successive video scenes, which degrade the perceived quality. To capture this, we introduce a new metric, called variability, which is defined as: 1 |Θs (C) − Θs−1 (C)| S − 1 s=2 S
V (C) =
(4.57)
Note: In order to account for differences in the length of the successive scenes, we can also weigh scenes according to their respective frame length. Let Θs (C) denote the weighted measure of scene quality, expressed as: Ns Θs (C) = ¯ Θs (C), N
(4.58)
We can define the mean and variance of the weighted quality as: S S S 1 Ns 1 ¯ (C) = 1 Θ Θ Θs (C) = (C) = Ns · Θs (C), ¯ s S s=1 S s=1 N N s=1 2 σΘ (C) =
(4.59)
S S 1 ¯ (C))2 = 1 [ (Θ (C))2 − S(Θ ¯ (C))2 ], (Θs (C) − Θ s S − 1 s=1 S − 1 s=1
(4.60) ¯ s (C) denotes the mean size of the frames within scene s, Recalling that X ¯ s (C) of a scene and the correlation coefficient between the mean frame size X the total quality Θs (C) of a scene is estimated as: ρX,Θ ¯ (C) =
S ¯ s (C) − X(C))(Θ ¯ ¯ 1 (X s (C) − Θs (C)) , S − 1 s=1 SX¯ s (C) · SΘ (C)
(4.61)
¯ where X(C) denote the mean of the successive mean frame sizes of all scenes S ¯ ¯ composing the video (X(C) = s=1 X s (C)/S). Finally, we denote the correlation coefficient between the BL quality and the total (BL+EL) quality of a scene by ρΘb ,Θ (C).
58
4 Metrics and Statistics for Video Traces
4.5 Additional Metric for MDC Encodings As MDC is introducing an encoding overhead, we need to extend the metric with respect to this overhead. The encoding overhead is defined as the amount of data by which the split streams of one video sequence are increased in comparison to the single stream. The encoding overhead is different to the network overhead that comes on top of each descriptor. The MDC overhead OH is calculated by summing up all frame sizes Xn,j over N frames of J descriptors, dividing it by the sum of all frames size Xi over all frames N · J of the single stream and subtracted it by 1. J OH =
N
j=1
n=1
J·N i=1
Xn,j
Xi
−1
(4.62)
5 Video Trace Generation
In this chapter, we describe the generation and the structure of the video traces. We first give a general overview of our experimental setup. We then discuss the different studied types of encodings, including the specific settings of the encoder parameters. Finally, we describe the structures of the video traces and define the quantities recorded in the traces.1
5.1 Overview of Video Trace Generation and Evaluation Process We illustrate the overall setup of the video trace generation and evaluation in Figure 5.1. The general setup for creating video traces is to generate uncompressed (raw) YUV files in a sub–sampling and storage format that can be used by the (typically software) encoder. The commonly used uncompressed format is planar YUV 4:2:0 as described in detail in Section 2.2. The original (unencoded) video frames in the YUV format are stored and used as input to the video encoder, which is typically the reference encoder software. The encoder is then used to encode the source video in non–real–time or offline. This avoids potential bottlenecks during the encoding process. To evaluate different encoder settings and the resulting video frame sizes and qualities, the encoding parameters are varied to obtain several different encodings for the same source video sequence. The result of the encoding is the encoded (compressed) video bit stream in the respective video coding format and the video trace file. The video trace file is obtained by either modifying the encoder software directly or by using the encoder output for further processing. The video trace file is then used to evaluate the characteristics of the encoded video. 1
To avoid any conflict with copyright laws, we emphasize that all image processing, encoding, and analysis was done for scientific purposes only. The encoded video sequences have no audio stream and are not publicly available. We make only the frame size traces available to researchers.
59
60
5 Video Trace Generation Video source
Parameters Encoded bit stream
Video source
Raw video sequence
Reference encoder Video trace file
Video source Evaluation
Fig. 5.1: Overview of the video trace generation and evaluation setup. In addition to evaluate different encoder configurations and the impact on the compressed video characteristics, using a broad variety of source video sequences is necessary. The video traffic and quality characteristics vary greatly for different video content. Thus, covering a wide range of video genres with a large variety of semantic video content is important. The evaluated video sequences are typically 30–60 minutes long. The publicly available short video test sequences, which are commonly used in the development of video compression standards, are typically only several hundred video frames in length. Evaluation of video content with longer duration is thus complementary to the evaluation of these short video sequences. To give an overview of the trace file generation, especially for scalable video encodings, we use the following notation. Let N denote the number of video frames in a given trace. Let tn , n = 0, . . . , N − 1, denote the frame period (display time) of frame n. Let Tn , n = 1, . . . , N , denote the cumulative n−1 display time up to (and including) frame n − 1, i.e., Tn = k=0 tk (and define T0 = 0). Let Xn , n = 0, . . . , N − 1, denote the frame size (number of bit) of the encoded (compressed) video frame frame n. Let QYn , n = 0, . . . , N − 1, denote the quality (in terms of the Peak Signal to Noise Ratio (PSNR)) of the luminance component of the encoded (and subsequently decoded) video V frame n (in dB). Similarly, let QU n and Qn , n = 0, . . . , N − 1, denote the qualities of the two chrominance components hue (U) and saturation (V) of the encoded video frame n (in dB). 5.1.1 Video Source VHS To obtain the uncompressed video from traditional video tapes, each of the studied video sequences was played from a VHS tape using a video cassette recorder (VCR). The (uncompressed) YUV video frames are captured using a PC video capture card and the bttvgrab (version 0.15.10) software [115]. We stored the uncompressed video frames on hard disk. We grabbed the YUV information at the National Television Standards Committee (NTSC) frame rate of 30 frames per second. We illustrate the generation of the YUV video files in Figure 5.2. The studied video sequences were captured in the QCIF
5.1 Overview of Video Trace Generation and Evaluation Process VHS recorder
Video capture card
bttvgrab (QCIF, CIF)
61
Raw video sequence
Fig. 5.2: Overview of YUV creation from VHS video sources. (176×144 pixel) resolution and in the CIF (352×288 pixel) resolution. All the video capturing was done into the planar YUV 4:2:0 format and quantization into 8 bits. We note that the video capture was conducted on a high performance system (dual Intel Pentium III 933 MHz processors with 1 GB RAM and 18 GByte high–speed SCSI hard disc) and that bttvgrab is a high–quality video capture software and was the only freely available video grabbing software. To avoid frame drops due to buffer build–up when capturing long video sequences, we captured the 60 minute (108,000 frames) QCIF sequences in two segments of 30 minutes (54,000 frames) each. With this strategy, we did not experience any frame drops when capturing video in the QCIF format. We did experience a few frame drops when capturing video in the larger CIF format. In order to have a full half hour (54,000 frames) of digital CIF video for our encoding experiments and statistical analysis we filled the gaps by duplicating the video frame preceding the dropped frame(s). For early video traces, video was split into 15 minute parts due to hard disc restrictions. We believe that the introduced error is negligible since the total number of dropped frames is small compared to the 54,000 frames in half an hour of video and the number of consecutive frame drops is typically less than 10–20. The file size of one hour of uncompressed QCIF video is 4,105,728,000 byte. With the larger size of the CIF video format, we restricted the length of the video sequences in CIF format to 30 minutes, which accounts for a file size of 8,211,456,000 byte. 5.1.2 Video Source DVD To complement the studies conducted by creating the source video by capturing video frames from a VCR, videos were additionally captured from DVD. Some of the video content was identical, but no actual alignment of the position within the captured movies took place. Although the source video on the DVD is already encoded with the MPEG–2 video compression standard and therefore potential artifacts and other visual degradations may be present, the motivation for using DVD video as source is that the VCR videos also suffer from a certain degree of added noise and visual imperfections as part of the playback and capturing process. We illustrate the generation of the YUV files from DVD sources in Figure 5.3. The DVD video was converted using the ffmpeg [116] software encoder/decoder to generate the YUV source video files. Using this approach, we generated QCIF and CIF video sequences that have the same durations as the VHS-originated video sequences.
62
5 Video Trace Generation DVD
MPEG−2 stream
ffmpeg (QCIF, CIF)
Raw video sequence
Fig. 5.3: Overview of YUV creation from DVD video sources. 5.1.3 Video Source YUV Test Sequences For some evaluations, YUV 4:2:0 test sequences were used. Although these short video sequences are used in the studies of encoder/decoder performance and video quality issues, these sequences are typically only several hundred video frames long (several seconds in duration). For studies that focus on network delivery of encoded video and long–time video transmission and characteristics, these sequences are too short to give an insight in streaming full– length video to clients. These short sequences, however, give important insights into the content dependency for encoded video, as the individual video sequences in most cases contain individual shots or scenes. 5.1.4 Video Source Pre-Encoded Video We developed an approach to use pre-encoded video content, which is shared on the Internet between users, for the video trace generation. The advantage of this approach is that the entire grabbing and encoding process (including the choice of encoder parameter settings) is already done by different users, who seemed to be satisfied by the quality of the video content after encoding. This type of video content is shared among users in the fixed wired Internet, but it appears that this content is an appropriate content for streaming video in WLAN networks. We illustrate the trace generation for the pre-encoded videos in Figure 5.4.
Internet
Mplayer
Video trace
Fig. 5.4: Overview of trace generation from pre-encoded video sources.
5.2 MDC Trace Generation MDC divides a single stream raw video sequence into multiple streams by exploiting quantizers or using the frame-based approach. The later one is done by putting consecutive frames into the generated streams in a round-robin fashion. In this work, MDC splits the video stream into multiple descriptors by
5.3 Evaluation of MPEG-4 Encodings
63
..... J+1
1
H.26x Encoder
D1
..... J+2
2
H.26x Encoder
D2
..... J+3
3
H.26x Encoder
D3
raw video sequence J+3 J+2 J+1
J
.....
3
2
1
Fig. 5.5: Splitter and encoding chain for J = 3.
a frame-based approach using a splitter entity. An illustration of the splitting and encoding process is given in Figure 5.5 for J = 3. The splitter takes the raw video sequence and splits it into J sub– sequences, (J > 1), such that the i–th sub–sequence contains picture i, J + i, 2J + i, and so on. Once the splitted sequences are ready, then each stream is feed into any standard video encoder, such as H.263, H.264, or MPEG4. Within this book, we focus on H.264 encoded MDC streams for the trace generation. Using bit stream parsers, the encoded streams are evaluated and traces are generated. Consequently, the main difference in terms of traffic between the standard single layer video coding and the MDC technique used in this study is coming from the splitting and merging operations. The relationship between the various encoding types and how frames rely on each other in a typical frame sequence for MDC with three descriptors is illustrated in Figure 5.6 for a GoP of 12 frames. The properties of the encoded sub-streams such as amount of data and robustness are measured and evaluated. We are interested in the overhead that arises from the splitting and encoding process. Obviously the inter-frame differences will increase with larger J, which in turn results in smaller compression gains as the video encoder has more data to work on. Furthermore, we investigate how sensitive the overhead is in dependency of the encoder’s settings in terms of quantization parameters.
5.3 Evaluation of MPEG-4 Encodings In this section we describe in detail the studied types of video encoding (compression). All encodings were conducted with the Microsoft version of the MPEG–4 reference (software) encoder [117], which has been standardized by MPEG in Part5 — Reference Software of the standard. Using this standardized reference encoder, we study several different types of encodings which are controlled by the parameters of the encoder. We refer to a particular type of encoding as encoding mode.
64
5 Video Trace Generation
Fig. 5.6: Frame-based construction of multiple descriptions for GoP of twelve video frames. In Table 5.1, we provide an overview of all the studied encoding modes for the MPEG-4 video encodings. We assigned different encoding modes and quality levels to another. The three main categories of studied encoding modes are single–layer (non– scalable) encoding, temporal scalable encoding, and spatial scalable encoding. All studied encoding modes have in common that the number of video objects is set to one, i.e., we do not study object segmentation. We also note that we do not employ reversible variable length coding (RVLC), which achieves increased error resilience at the expense of slightly smaller compression ratios. We found that in the reference software RVLC is currently implemented only for single–layer encodings (as well as for the base layer of scalable encodings). To allow for a comparison of the traffic and quality characteristics of scalable encodings, we conduct all encodings without RVLC. For similar reasons we consistently use the decoded frames (rather than the YUV source) for motion estimation (by setting Motion.Use.Source.For.ME.Enable[0] = 0). Also, throughout we employ the H.263 quantization matrix. We generate two types of video traces: verbose traces and terse traces. The verbose traces give the following quantities (in this order): frame number n, cumulative display time Tn , frame type (I, P, or B), frame size Xn (in
5.3 Evaluation of MPEG-4 Encodings
65
Table 5.1: Overview of the different encoding modes used for the evaluation of MPEG-4 video encodings. Scalability Mode Single Layer
Temporal Scalable
Spatial Scalable
Quality Level / Target Bit Rate High High-Medium Medium Medium-Low Low 64kbps 128kbps 256kbps High High-Medium Medium Medium-Low Low 64kbps 128kbps 256kbps High High-Medium Medium Medium-Low Low 64kbps 128kbps 256kbps
Base Layer Settings qI = 4, qP = 4, qB = 4 qI = 10, qP = 10, qB = 10 qI = 10, qP = 14, qB = 16 qI = 24, qP = 24, qB = 24 qI = 30, qP = 30, qB = 30 qBlock = [1 . . . 31] qBlock = [1 . . . 31] qBlock = [1 . . . 31] qI = 4, qP = 4, qB = 4 qI = 10, qP = 10, qB = 10 qI = 10, qP = 14, qB = 16 qI = 24, qP = 24, qB = 24 qI = 30, qP = 30, qB = 30 qBlock = [1 . . . 31] qBlock = [1 . . . 31] qBlock = [1 . . . 31] qI = 4, qP = 4, qB = 4 qI = 10, qP = 10, qB = 10 qI = 10, qP = 14, qB = 16 qI = 24, qP = 24, qB = 24 qI = 30, qP = 30, qB = 30 qBlock = [1 . . . 31] qBlock = [1 . . . 31] qBlock = [1 . . . 31]
Enhancement Layer Settings
qP = 14, qB = 16
qP = 14, qB = 16
bit), luminance quality QYn (in dB), hue quality QU n (in dB), and saturation quality QVn (in dB). These quantities are given in ASCII format with one video frame per line. Recall that in our single–layer (non–scalable) encodings and our temporal scalable encodings, we use the GoP pattern with 3 P frames between 2 successive I frames and 2 B frames between successive (I)P and P(I) frames. With this GoP pattern, the decoder needs both the preceding I (or P) frame and the succeeding P (or I) frame for decoding a B frame. Therefore, the encoder emits the frames in the order IPBBPBBPBBPBBIBBP. . .. We also arrange the frames in this order in the verbose trace file. Note that due to this ordering, line 0 of the verbose trace gives the characteristics of frame number n = 0, line 1 gives frame number n = 3, lines 2 and 3 give frames 1 and 2, line 4 gives frame 6, lines 5 and 6 give frames 4 and 5, and so on.
66
5 Video Trace Generation
In the terse traces, on the other hand, the video frames are ordered in strictly increasing frame numbers. Specifically, line n, n = 0, . . . , N − 1, of a given terse trace gives the frame size Xn and the luminance quality QYn . We remark that for simplicity we do not provide the cumulative display time of frame number N − 1, which would result in an additional line number N in the trace. We also note that for our encodings with spatial scalability, which use the GoP pattern with 11 P frames between successive I frames and no bi–directionally predicted (B) frames, the frames are ordered in strictly increasing order of the frame numbers in both the verbose and the terse trace files. For the two–layer encodings with temporal and spatial scalability, we generate verbose and terse traces for both the base layer and the enhancement layer. The base layer traces give the sizes and the PSNR values for the (decoded) base layer (see Sections 5.3.2 and 5.3.3 for details). The enhancement layer traces give the sizes of the encoded video frames in the enhancement layer and the improvement in the PSNR quality obtained by adding the enhancement layer to the base layer (i.e, the difference in quality between the aggregate (base + enhancement layer) video stream and base layer video stream). In summary, the base layer traces give the traffic and quality of the base layer video stream. The enhancement layer traces give the enhancement layer traffic and the quality improvement obtained by adding the enhancement layer to the base layer. 5.3.1 Single–Layer Encoding The Group of Pictures (GoP) pattern for single layer encodings is set to IBBPBBPBBPBBIBBP. . ., i.e., there are three P frames between successive I frames and two B frames between successive P (I) frames. We conduct single– layer encodings both without rate control and with rate control. For the encodings without rate control, the quantization parameters are fixed throughout the encoding. We consider the five quality levels defined in Table 5.1. The encodings with rate control employ the TM5 rate control scheme [118], which adjusts the quantization parameters on a macroblock basis. We conduct encodings with the target bit rates 64 kbps, 128 kbps, and 256kbps that are given in Table 5.1 as well. The frame sizes and frame qualities for the single–layer encodings are obtained directly from the software encoder. During the encoding the MPEG– 4 encoding software computes internally the frame sizes and the PSNR values for the Y, U, and V components. We have augmented the encoding software in such manner that it writes this data along with the frame numbers and frame types directly to a verbose trace. We have verified the accuracy of the internal computation of the frame sizes and the PSNR values by the software encoder. To verify the accuracy of the frame size computation we compared the sum of the frame sizes in the trace with the file size (in bit) of the encoded video (bit stream). We found that the file size of the encoded video is
5.3 Evaluation of MPEG-4 Encodings
67
typically on the order of 100 Byte larger than the sum of the frame sizes. This discrepancy is due to some MPEG–4 system headers, which are not captured in the frame sizes written to the trace. Given that the file size of the encoded video is on the order of several Mbytes and that individual encoded frames are typically on the order of several kbytes, this discrepancy is negligible. To verify the accuracy of the PSNR computation, we decoded the encoded video and computed the PSNR by comparing the original (uncompressed) video frames with the encoded and subsequently decoded video frames. We found that the PSNR values computed for the Y, U, and V components internally perfectly match the PSNR values obtained by comparing original and decoded video frames. We note that the employed MPEG–4 software encoder is limited to encoding segments with a YUV file size no larger than about 2 GBytes. Therefore, we encoded the 108, 000 frame QCIF sequences in two segments of 54,000 frames (4500 GoPs with 12 frames per GOP) each and the 54,000 CIF sequences in four segments of 13,500 frames each. The verbose traces for the individual segments were merged to obtain the 108,000 QCIF frame trace and the 54,000 CIF frame trace. When encoding the 4500th GoP of a segment, the last two B frames of the 4500 GOP are bi–directionally predicted from the third P frame of the 4500th GOP and the I frame of the 4501th GoP. Since the 4501th GoP is not encoded in the same run as the preceding GoPs, our traces were missing the last two B frames in a 54, 000 frame segment. To fix this, we inserted two B frames at the end of each segment of 53,998 (actually encoded) frames. We set the size of the inserted B frames to the average size of the actually encoded B frames in the 4500th GoP. We believe that this procedure results in a negligible error. We provide an exemplary verbose trace excerpt for the QCIF version of Silence of the Lambs encoded with the high encoding mode settings (as detailed in Table 5.1) in Table 5.2. As the encoder has to encode the referenced frames (I or P frames) before frames that reference them (P or B frames), the order of frames in the trace is different from the actual display order and time of the frames. This frame order is also referred to as encoder frame order. The corresponding terse video trace is given in Table 5.3. In the terse video trace, the video frames are in order of their display, which is referred to as display order. 5.3.2 Temporal Scalable Encoding In the considered temporal scalable encodings, the I and P frames constitute the base layer while the B frames constitute the enhancement layer. We note that encodings with different assignments of frame types to the different layers are possible (and are supported by the reference encoder). We chose to have the encoded I and P frames in the base layer and the encoded B frames in the enhancement layer to fix ideas. By using this particular way of arranging the frames amongst layers, the allocation of traffic to the base layer and the
68
5 Video Trace Generation
Table 5.2: Verbose trace example for Silence of the Lambs (using QCIF resolution and high encoding mode) with frames in encoder order (IPBB. . . ). Number n . . . 105 103 104 108 106 107 . . .
Frame Time Type Size Tn [ms] [I,P,B] Xn [Bit] . . . . . . . . . 3.500000 P 78664 3.433333 B 67848 3.466667 B 33904 3.600000 I 100160 3.533333 B 74624 3.566667 B 72272 . . . . . . . . .
Y QYn [dB] . . . 35.677898 35.643799 35.317600 38.570099 35.662800 35.677399 . . .
PSNR U QU n [dB] . . . 39.408901 39.554600 39.431999 40.661301 40.160400 40.292301 . . .
V QVn [dB] . . . 40.597801 40.645302 40.696800 42.064499 41.433800 41.666199 . . .
Table 5.3: Terse trace example for Silence of the Lambs (using QCIF resolution and high encoding mode) with frames in display frame order (IBBP. . . ). Frame Size Xn [Bit] . . . 67848 33904 78664 74624 72272 100160 . . .
PSNR Y QYn [dB] . . . 35.643799 35.317600 35.677898 35.662800 35.677399 38.570099 . . .
enhancement layer is controlled by varying the number of B frames between successive I(P) and P(I) frames. We initially conduct encodings with two B frames between successive I(P) and P(I) frames (i.e., in the MPEG terminology we set the source sampling rate to three for the base layer and to one for the enhancement layer). We again conduct encodings without rate control and with rate control. For the encodings without rate control, we use the fixed sets of quantization parameter settings defined in Table 5.1. Note that with the adopted scalable encoding types, the quantization parameters of the I and P frames determine the size (in bits) and the quality of the frames in the base layer, while the quantization parameter of the B frame determines the size and quality of the enhancement layer frames.
5.3 Evaluation of MPEG-4 Encodings
69
Table 5.4: Verbose base layer trace example for temporal scalable encoded Silence of the Lambs (using QCIF resolution and high encoding mode) with frames in encoder order (IPBB. . . ). Number n . . . 105 103 104 108 106 107 . . .
Frame Time Type Size Tn [ms] [I,P,B] Xnb [Bit] . . . . . . . . . 3.500000 P 78472 3.433333 B 0 3.466667 B 0 3.600000 I 100160 3.533333 B 0 3.566667 B 0 . . . . . . . . .
Y Qb,Y [dB] n . . . 35.677898 19.969000 16.181400 38.570099 17.067699 15.055800 . . .
PSNR U Qb,U [dB] n . . . 39.408901 38.445301 38.470200 40.661301 39.083698 39.221500 . . .
V Qb,V [dB] n . . . 40.597801 39.988998 40.140400 42.064499 40.480000 40.551998 . . .
For the temporal scalable encodings with rate control, we use the TM5 rate control scheme to control the bit rate of the base layer to a pre–specified target bit rate (64 kbps, 128 kbps, and 256 kbps are used). The B frames in the enhancement layer are open–loop encoded (i.e., without rate control); throughout we set the quantization parameter to 16 (which corresponds to the medium quality level, see Table 5.1). The temporal scalable encodings are conducted both for video in the QCIF format and for video in the CIF format. The frame size of both the encoded video frames in the base layer (I and P frames with the adopted encoding modes, see the beginning of this section) and the encoded video frames in the enhancement layer (i.e., in our case the encoded B frames) are obtained from the frame sizes computed internally by the encoder. We provide an excerpt from the verbose base layer trace for the QCIF format of Silence of the Lambs (using the high encoding mode for temporal scalable video from Table 5.1) in Table 5.4. Note that the base layer traces (both verbose and terse traces) give the sizes of the frames in the base layer and contain zero for a frame in the enhancement layer. The corresponding enhancement layer trace is given in Table 5.5. The enhancement layer traces give the sizes of the frames in the enhancement layer (and subsequently contain zero for a frame in the base layer) as well as the quality improvement for adding the enhancement layer to the base layer, as detailed in the following. Formally, we let Xnb , n = 0, . . . , N − 1, denote the frame sizes in the base layer stream, and let Xne , n = 0, . . . , N − 1, denote the frame sizes in the enhancement layer stream. The video frame qualities (PSNR values) for the b,U b,V base layer, which we denote by Qb,Y n , Qn , and Qn , n = 0, . . . , N − 1, are determined as follows. The qualities of frames that are in the base layer (I and P frames with our settings) are obtained by comparing the decoded base layer
70
5 Video Trace Generation
Table 5.5: Verbose enhancement layer trace example for temporal scalable encoded Silence of the Lambs (using QCIF resolution and high encoding mode) with frames in encoder order (IPBB. . . ). Number n . . . 105 103 104 108 106 107 . . .
Frame Time Type Size Tn [ms] [I,P,B] Xne [Bit] . . . . . . . . . 3.500000 P 0 3.433333 B 67848 3.466667 B 33912 3.600000 I 0 3.533333 B 74624 3.566667 B 72280 . . . . . . . . .
PSNR Y U V e,U e,V Qe,Y [dB] Q [dB] Q [dB] n n n . . . . . . . . . 0.000000 0.000000 0.000000 15.674799 1.109299 0.656303 19.136200 0.961800 0.556400 0.000000 0.000000 0.000000 18.595100 1.076702 0.953800 20.621598 1.070801 1.114201 . . . . . . . . .
frames with the corresponding original (uncompressed) video frames. To determine the qualities of the frame in the enhancement layer, which are missing in the base layer, we adopt a simple interpolation policy (which is typically used in rate–distortion studies, see, e.g., [119]). With this interpolation policy, the “gaps” in the base layer are filled by repeating the last (decoded) base layer frame, that is, the base layer stream I1 P1 P2 P3 I2 P4 . . . is interpolated to I1 I1 I1 P1 P1 P1 P2 P2 P2 P3 P3 P3 I2 I2 I2 P4 P4 P4 . . .. The base layer PSNR values are then obtained by comparing this interpolated decoded frame sequence with the original YUV frame sequence. The detailed calculation is outlined later in form of the offset distortion traces in Chapter 9. The improvements in the video quality (PSNR) achieved by adding the enhancee,U e,V ment layer, which we denote by Qe,Y n , Qn , and Qn , n = 0, . . . , N − 1, are determined as follows. For the base layer frames, which correspond to “gaps” in the enhancement layer, there is no improvement when adding the enhancement layer. Consequently, for the base layer frames, zeros are recorded for the quality improvement of the Y, U, and V components in the enhancement layer trace. To determine the quality improvement for the enhancement layer frames, we obtain the PSNR of the aggregate (base + enhancement layer) stream from the encoder. We then record the differences between the these PSNR values b,U b,V values in the enhancement layer and the corresponding Qb,Y n , Qn , and Qn trace. To continue the previous example outlining our encoding approach, adding the enhancement layer stream to the base layer stream resolves the gaps that make the interpolation (or repetition of base layer frames) necessary to obtain the quality values for the base layer stream only. More formally, adding the enhancement layer to the base layer resolves the previously shown interpolation to I1 B1 B2 P1 B3 B4 P2 B5 B6 P3 B7 B8 I2 B9 B10 P4 . . ..
5.3 Evaluation of MPEG-4 Encodings
71
5.3.3 Spatial Scalable Encoding In our study of spatial scalable encodings, we focus on video in the CIF format. In contrast to temporal scalability, here every encoded video frame has a base layer component and an enhancement layer component. Decoding the base layer gives the video in the QCIF format, whereas decoding both layers gives the video in the CIF format. We note that the base layer QCIF video may be up–sampled and displayed in the CIF format; this up– sampling results in a coarse–grained, low-quality CIF format video. For the spatial scalable encoding, we set the GoP structure for the base layer to IPPPPPPPPPPPIPP. . .. The corresponding GoP structure for the enhancement layer is PBBBBBBBBBBBPBB. . ., where by the convention of spatial scalable encodings, each P frame in the enhancement layer is encoded with respect to the corresponding I frame in the base layer and each B frame in the enhancement layer is encoded with respect to the corresponding P frame in the base layer. Each P frame in the base layer is forward predicted from the preceding I(P) frame. For the spatial scalable encoding without rate control, the quantization parameters of the different frame types (I, P, and B) are fixed according to the quality levels defined in Table 5.1. For the encodings with rate control, we use the TM5 rate control algorithm to keep the bit rate of the base layer at a pre–specified target bit rate of 64 kbps, 128kbps, or 256kbps, as given in Table 5.1. The quantization parameters of the enhancement layer frames are set to fixed values corresponding to the settings used for the medium quality level (14 for P frames, 16 for B frames). With spatial scalable encoding, each encoded frame has both a base layer component and an enhancement layer component. We let Xnb and Xne , n = 0, . . . , N − 1, denote the sizes (in bit) of the base layer component and the enhancement layer component of frame n, respectively. Both components are obtained from the frame sizes computed internally by the encoder. The verbose base layer trace gives two different qualities for each video frame, these are the , Qb,qcif,U , and Qb,qcif,V as well as the CIF qualities QCIF qualities Qb,qcif,Y n n n b,cif,Y b,cif,U b,cif,V , Qn , and Qn . The QCIF qualities are obtained by comparing Qn the decoded base layer stream with the downsampled (from CIF to QCIF) original video stream. The CIF qualities are obtained as follows. The base layer stream is decoded and upsampled (from QCIF to CIF). This CIF video stream is then compared with the original CIF video stream to obtain the CIF qualities. The terse base layer trace gives only the sizes (in bit) of the for each base layer component Xnb and the luminance CIF quality Qb,cif,Y n frame n, n = 0, . . . , N − 1. We provide an excerpt from the verbose base layer trace for the Silence of the Lambs (using the high encoding mode for spatial scalable video from Table 5.1) in Table 5.6. b,U b,V The verbose enhancement layer trace gives the Qb,Y n , Qn , and Qn , n = 0, . . . , N − 1, the quality improvements achieved through the enhancement layer with respect to the base layer CIF qualities. These quality improvements
72
Number n . . . 105 103 104 108 106 107 . . .
Frame PSNR, downsampled original PSNR, upsampled BL Time Size Type Y U V Y U V Tn [ms] Xnb [Bit] [I,P,B] Qb,qcif,Y [dB] Qb,qcif,U [dB] Qb,qcif,V [dB] Qb,cif,Y [dB] Qb,cif,U [dB] Qb,cif,V [dB] n n n n n n . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.500000 19856 P 35.580200 41.295101 42.352600 24.658100 41.168301 42.271900 3.433333 17256 B 35.700401 41.737000 42.357899 24.430300 41.318100 42.285301 3.466667 18072 B 35.608601 41.421700 42.488701 24.654301 41.273800 42.312599 3.600000 68760 I 37.844799 41.892799 42.766201 25.010599 41.185001 42.463299 3.533333 15800 B 35.868301 41.521099 42.656399 24.555700 41.332199 42.485100 3.566667 12960 B 36.015900 41.180599 42.370998 24.731899 41.087002 42.177200 . . . . . . . . . . . . . . . . . . . . . . . . . . .
5 Video Trace Generation
Table 5.6: Verbose base layer trace example for spatial scalable encoded Silence of the Lambs (using CIF resolution and high encoding mode) with frames in encoder order (IPBB. . . ).
5.4 Evaluation of H.264 Encodings
73
are obtained as follows. The aggregate video stream is decoded (CIF format) and compared with the original CIF format video stream to obtain the PSNR values of the aggregate stream. The quality improvements are then obtained , Qb,cif,U , and Qb,cif,V from by subtracting the base layer CIF qualities Qb,cif,Y n n n the corresponding PSNR values of the aggregate stream. We show the verbose enhancement layer trace excerpt corresponding to the base layer in Table 5.7. We note that due to limitations of the employed decoder, we could extract only the first N/4 − 1 frames of an encoded N frame sequence for the calculations of the aggregate PSNR values.
5.4 Evaluation of H.264 Encodings We used the H.264 reference encoder JM2 version 3.6, which is publicly available (for more recent releases refer to [120]). The purpose of our study is to generate and statistically evaluate the frame sizes of the encoded video streams matching closely the previous configurations used in the MPEG-4 single layer video studies outlined in Section 5.3. We thus disabled some of the more advanced encoder features, which additionally were in development at the time of our study. The disabled features included the slice mode that is providing error resilience features by coding fixed or fixed bytes per slice. We also used only the CABAC–technique to remove inter-symbol correlation. The network adaption layer was also not used, as were restrictions to the search range. We were therefore only using the basic features such as inter–, intra–, and bi-directional prediction and motion estimation. Additionally, we used a fixed GoP and motion vector resolution setting for the prediction modes. The result is a setup being very close to the most basic encoding settings used in previous video trace file generation processes such as [121]. We did not specify a target bit rate, since rate–adaptive encoding is not available in the encoder version under consideration. Instead, we used static quality levels (quantization scale factors q), which we set for all three frame types to 1, 5, 10, 15, 20, 25, 30, 35, 40, 45, and 51. For ease of comparison with the already existing video trace files (H.261, H.263, and MPEG–4, see [121]), we used the GoP structure IBBPBBPBBPBBI. . . . Note that the encoder has to encode the referenced frames first, thus the resulting frame sequence is IPBBPBBPBBIBBP. . . . We used the freely available and widely used YUV testing sequences in our experiments. An overview of the evaluated sequences is given in Table 5.8. For each of the studied quality levels, we encoded the YUV-files into the H.264 bit stream off–line (thus there was no frame drop during the encoding). The resulting encoder status output was parsed to generate the traces. For each quantization level and test sequence we generated a terse and a verbose trace file. The traces were then used for the statistical analysis of the video traffic. The verbose trace shown in Table 5.9 gives for each frame the type (I, P, or B), the play out time (= frame number/30) in msec, and the frame
74
Number n . . . 103 104 105 106 107 108 . . .
Frame Time Size Tn [ms] Xnel [Bit] . . . . . . 3.433333 70304 3.466667 66496 3.500000 67856 3.533333 70848 3.566667 67688 3.600000 119760 . . . . . .
Type [I,P,B] . . . B B B B B P . . .
Y Qel,Y [dB] n . . . 36.300098 36.209400 36.309898 36.279099 36.298100 36.365601 . . .
PSNR U Qel,U [dB] n . . . 42.007599 41.992802 41.954201 42.087502 41.895599 41.982498 . . .
V Qel,V [dB] n . . . 43.047699 43.108299 42.918701 42.960899 42.816399 43.001598 . . .
PSNR, improvement from Qb,cif n Y U V [dB] [dB] [dB] . . . . . . . . . 11.869799 0.689499 0.762398 11.555099 0.719002 0.795700 11.651798 0.785900 0.646801 11.723398 0.755302 0.475800 11.566200 0.808598 0.639198 11.355001 0.797497 0.538300 . . . . . . . . .
5 Video Trace Generation
Table 5.7: Verbose enhancement layer trace example for spatial scalable encoded Silence of the Lambs (using CIF resolution and high encoding mode) with frames in encoder order (IPBB. . . ).
5.5 Evaluation of MPEG-4 FGS Encodings
75
Table 5.8: Overview of Evaluated Sequences for H.264 video encodings. Video Sequence Carphone Claire Container Foreman Grandma Mobile Mother and Daughter News Paris Salesman Silent Tempete
Number of Frames 382 494 300 400 870 300 961 300 1000 449 300 260
Format QCIF QCIF QCIF QCIF QCIF CIF QCIF QCIF CIF QCIF QCIF CIF
Table 5.9: Verbose trace example for H.264 encoding of the Foreman sequence (using QCIF resolution and quantization scale 5) with frames in encoder order (IPBB. . . ). Frame Number n . . . 0 3 1 2 . . .
Frame Type [I,P,B] . . . I P B B . . .
Frame Time Tn [ms] . . . 0.0 100.0 33.3 66.7 . . .
Frame Size Xn [Bit] . . . 18466 12494 10048 10121 . . .
size in byte. The terse trace gives only the sequence of frame sizes in byte as exemplarily shown in Table 5.10 for the Foreman test sequence. We note that in the encodings the last GoP is incomplete, since the last two B–frames are referencing a frame that is not available.
5.5 Evaluation of MPEG-4 FGS Encodings In our evaluation, the Microsoft MPEG-4 reference software encoder/decoder with FGS functionality was used [117]. The video sequences were separated into scenes, whereby a scene can be defined as sequence of video frames with similar characteristics. If there is a significant difference between two consecutive video frames, a scene change can be assumed. For these cases, a scene description trace file as shown in Table 5.11 for Silence Of The Lambs in CIF
76
5 Video Trace Generation
Table 5.10: Terse trace example for H.264 encoding of the Foreman sequence (using QCIF resolution and quantization scale 5) with frames in encoder order (IPBB. . . ). Frame Size Xn [Bit] . . . 18466 12494 10048 10121 . . . Table 5.11: Scene description trace example for the Silence of the Lambs video sequence. Scene Last Frame Number n . . . . . . 1 841 2 1481 3 1662 4 1721 5 2052 . . . . . . format was generated. We note that the first frame in a video is n = 0 to obtain the start reference for the first identified scene. First, we encode the video using 2 different sets of quantization parameters for the base layer. This gives compressed base layer bit streams of high quality (with quantization parameters for the different frame types matching the high encoding mode from Table 5.1) and medium quality (with quantization parameters for the different frame types matching the medium encoding mode from Table 5.1), as well as the associated enhancement layer bit streams. The Group of Pictures (GoP) structure of the base layer is set to IBBPBBPBBPBB. . . . The base layer video trace is given in Table 5.12 for the base layer trace of the Silence of the Lambs video. From the encoding process we furthermore obtain the sizes of the enhancement layer bit planes. The values for the different bitplanes of each video frame’s enhancement layer are provided in the bitplane trace as given in Table 5.13. Combining the base layer and the enhancement layer results in a quality improvement over the base layer only. To determine the quality improvement due to the FGS enhancement layer, the enhancement layer is cut at the increasing and equally
5.6 Evaluation of Wavelet Video Traces
77
Table 5.12: Base layer trace example for fine grain scalable encoded Silence of the Lambs. Number n . . . 103 104 105 106 107 108 . . .
Frame Type [I,P,B] . . . P B B P B B . . .
Size Xnb [Bit] . . . 70904 64496 66048 68640 59568 53376 . . .
Y Qb,Y [dB] n . . . 36.2605 36.3238 36.3208 36.241 36.4295 36.5074 . . .
PSNR U Qb,U [dB] n . . . 42.0972 42.1504 42.0671 41.9804 42.1252 41.9577 . . .
V Qb,V [dB] n . . . 43.2431 43.1072 43.1487 43.0543 43.074 42.9171 . . .
Table 5.13: Bitplane trace for the enhancement layer of fine grain scalable encoded Silence of the Lambs. Frame Number n . . . 103 104 105 106 107 108 . . .
1 [Bit] . . . 54296 52592 52576 54208 50480 48816 . . .
2 [Bit] . . . 145376 145480 145472 145904 145280 145488 . . .
3 [Bit] . . . 0 0 0 0 0 0 . . .
Bitplane Size 4 5 [Bit] [Bit] . . . . . . 0 0 0 0 0 0 0 0 0 0 0 0 . . . . . .
6 [Bit] . . . 0 0 0 0 0 0 . . .
7 [Bit] . . . 0 0 0 0 0 0 . . .
8 [Bit] . . . 0 0 0 0 0 0 . . .
spaced bitrates C = 0, 200, 400, . . . , 2000kbps. Combining the rate-restricted enhancement layer and the base layer results in the FGS enhancement layer trace, which provides the video frame qualities Qn (C) at each enhancement layer bit rate. An example for such FGS enhancement layer trace is given in Table 5.14.
5.6 Evaluation of Wavelet Video Traces First, the raw YUV frames are used as the input of the encoder software. The encoder software produces an intra frame encoded video stream. Then this encoded video stream is truncated at 10 different bit rate budgets, providing 10 individual streams at bit rates of 25, 75, 100, 300, 600, 800, 1000, 1200, 1400,
78
5 Video Trace Generation
Table 5.14: Enhancement layer trace example for fine grain scalable encoded Silence of the Lambs with enhancement layer truncated at bit rate C = 200. Frame Number n . . . 103 104 105 106 107 108 . . .
Y QYn (C) [dB] . . . 36.6571 36.8797 36.7369 36.7385 37.0906 37.0827 . . .
PSNR U QYn (C) [dB] . . . 40.3034 40.2874 40.3096 40.1868 40.5567 40.4665 . . .
V QYn (C) [dB] . . . 40.6082 40.6511 40.7647 40.5847 41.0077 41.0811 . . .
and 1600 kbps. During the truncation the truncating software also provides the frame size of the individual sub-streams, described in Section 3.4 and illustrated in Figure 3.23. Finally, the individual encoded streams are passed through the decoder which produces the decoded video frames in YUV format. Additionally, the decoder software produces the trace file which contains the frame number, aggregated frame size, and the PSNR of the decoded frame compared to the original frame. Note that the aggregated frame size is 10 bytes larger than the addition of the individual sub-streams. This is due to the fact that there is an overhead of 10 bytes in the aggregated frame size to incorporate the 5 individual sub-stream sizes. i.e., 2 bytes per sub-stream. We illustrate the format of the combined video trace for the waveletencoded Star Wars video at 800kbps in Table 5.15. The corresponding trace Table 5.15: Combined trace file format for wavelet video encoding of Star Wars with 800kbps bit rate. Frame Number n . . . 100 101 102 103 104 105 . . .
Size Xn [Byte] . . . 3670 3679 3688 3713 3609 3694 . . .
Y QYn [dB] . . . 39 38 39 39 39 39 . . .
PSNR U QU n [dB] . . . 40 40 40 40 40 40 . . .
V QVn [dB] . . . 43 43 43 42 43 42 . . .
5.7 Evaluation of Pre–Encoded Content
79
Table 5.16: Substream trace file format for wavelet video encoding of Star Wars with 800kbps bit rate. Frame Number n . . . 100 101 102 103 104 105 . . .
Stream 1 [Bit] . . . 107 107 106 108 109 108 . . .
Stream 2 [Bit] . . . 324 319 322 325 320 325 . . .
Sub-stream Sizes Stream 3 Stream 4 [Bit] [Bit] . . . . . . 641 1166 648 1185 641 1211 654 1199 647 1164 639 1186 . . . . . .
Stream 5 [Bit] . . . 1422 1410 1398 1417 1359 1426 . . .
per substream, i.e., in the applied scenario all 5 substreams, is given in Table 5.16.
5.7 Evaluation of Pre–Encoded Content The typical approach used in the video trace generation and evaluation is to study the impact of different video encoding parameters on the video traffic and quality characteristics. This type of investigation is very time consuming, which is due to (i) the entire grabbing and encoding process being very time consuming and (ii) the diversity of different encoding standards, encoders and encoder settings results in multiple repetitions of the encoding process to capture the impact of the variety of parameters. Furthermore, we face the problem that numerous new or varied video encoders are emerging. As an example, current video players support about 100 different video codecs and their derivates. The most important encoders are DivX;-) (including DIV3, DIV4, DIV5, DIV6, MP43, etc), Windows Media Video 7/8/9, and the RealPlayer (including RV 20/30/40). We illustrated the video trace generation for pre-encoded video content with the help of the modified MPlayer tool in the beginning of this chapter in Figure 5.4. The already encoded video sequences were fed into the MPlayer ´ ad Gere¨offy. The tool is based on the libmpg3 tool [122] version 0.90 by Arp´ library and an advancement of the mpg12play and avip tools. Major modifications to the source codes were made such that the mplayer tool played the video sequence and simultaneously printed each frame with the frame number, the play-out time, the video frame size, the audio frame size, and a cumulative bit size into the raw trace files. An excerpt of a raw trace file obtained using this approach is given in Table 5.17. By means of this approach we avoid having to write a parser for each video codec. Using these raw video traces,
80
5 Video Trace Generation
Table 5.17: Raw trace file format for pre-encoded video encoding of Kiss Of The Dragon. Frame Number n . . . 103 104 105 106 107 108 . . .
Time Tn . . . 4.254259 4.295967 4.337676 4.379384 4.421093 4.462801 . . .
Video Size Xn [Byte] . . . 9854 9930 10054 10126 5805 5830 . . .
Audio Size [Byte] . . . 640 640 640 640 640 640 . . .
Cumulative Size [Byte] . . . 455246 465816 476510 487276 493721 500191 . . .
we create the verbose and terse video traces in the same format presented for H.264 traces, see Section 5.4. The trace files were used for the statistical analysis of the video data. We measured that the video file size is always slightly larger than the sum of the frame sizes produced by the video and audio encoders. To explain this fact, we have to state first that all video sequences are mostly distributed in the AVI format. Simply put, the AVI format is a container. Due to the container information the file size is larger than the video and audio format. We do not include this overhead into our trace files. In case of multimedia streaming the video and audio information is packetized into RTP frames. The RTP header contains all important information for the playout process at the receiver. Therefore, we assume that the additional container information is not needed and hence not included in the trace file.
5.8 Evaluation of MDC Encodings The video sequences used for the MDC encoding with frame length and a short information are given in Table 5.18 for the QCIF and CIF formats. For each video sequence, the encodings were done for J ≤ 20. For J = 1 the video trace has a frame spacing of 40ms, as shown in the example given in Table 5.19. Increasing J up to 10, the spacing increases to 400 ms as given in the trace file illustrated in Table 5.20. The other nine disjoint descriptors have also a spacing of 400 ms, but with a different offset value (multiples of 40 ms).
5.8 Evaluation of MDC Encodings
81
Table 5.18: YUV QCIF/CIF video sequences from [123]. Video Sequence Name bridge–close bridge–far highway carphone claire container foreman grandma mother and daughter news salesman silent mobile paris tempete
Video QCIF QCIF QCIF QCIF QCIF QCIF QCIF QCIF QCIF QCIF QCIF QCIF CIF CIF CIF
Format and CIF and CIF and CIF
Frames 2000 2101 2000 382 494 300 400 870 961 300 449 300 300 1065 260
Information Charles Bridge. Charles Bridge far view. Driving over highway. Man talking at the phone. Female talking to the camera. Ship leaving the harbor. Man speaking to the camera. Grandma in front of the camera. Mom and daughter speaking. News studio and two speakers. Salesman in his office Woman doing sign language. Train is moving Two people talking to each other Moving Cam
Table 5.19: Video trace example for MDC encoded traces for single descriptor, J = 1. Frame Number Type Time Size n [I,P,B] Tn [ms] Xn [Bit] 0 I 0.0 172720 1 P 40.0 39352 2 P 80.0 35936 3 P 120.0 33672 4 P 160.0 35016 . . . . . . . . . . . .
Table 5.20: Video trace example for MDC encoded traces for one of ten descriptors, J = 10. Frame Number n 0 1 2 3 4 . . .
Type [I,P,B] I P P P P . . .
Time Tn [ms] 0.0 400.0 800.0 1200.0 1600.0 . . .
Size Xn [Bit] 172720 61152 65168 61064 59576 . . .
6 Statistical Results from Video Traces
In this chapter we present an overview over findings from our extensive library of video traces publicly available at [40]. The detailed notations and statistical definitions used in this chapter were presented previously in Chapter 4. In Chapter 5, we provided the details on the generation and evaluation process for the different traces. This chapter, on the other hand, focuses on the results obtained from the various traces.
6.1 Video Trace Statistics for MPEG-4 Encoded Video In this section we present an overview of our publicly available library of MPEG-4 traces of heterogeneous and scalable encoded video [40]. The traces evaluated here have been generated from over 15 videos of one hour each, which have been encoded into a single layer at heterogeneous qualities and into two layers using the temporal scalability and spatial scalability modes of MPEG-4. For a compact representation of our results, we present here the aggregated findings from a subset of our video trace library [40] and the individual evaluation of the Silence of the Lambs video sequence to illustrate our findings by an example. 6.1.1 Examples from Silence of the Lambs Single Layer Encodings In the following, we present exemplary results for the analysis of the frame size traces for the Silence of the Lambs video sequence, encoded in a variety of encoding modes, which were introduced in Table 5.1. We use this example c [2004] IEEE. Reprinted, with permission, from: P. Seeling, M. Reisslein, and B. Kulapala. Network Performance Evaluation with Frame Size and Quality Traces of Single-Layer and Two-Layer Video: A Tutorial. IEEE Communications Surveys and Tutorials, Vol. 6, No. 3, p. 58–78, 3rd quarter 2004. c [2004] IEEE. Reprinted, with permission, from: B. Kulapala, P. Seeling, and M. Reisslein. Comparison of Traffic and Quality Characteristics of Rate-Controlled Wavelet and DCT Video. In Proc. IEEE International Conference on Computer Communications and Networks (ICCCN), Pages 247–252, Chicago, IL, October 2004.
83
6 Statistical Results from Video Traces 18000
9000
16000
8000
14000
7000 Frame size [Byte]
Frame size [Byte]
84
12000 10000 8000 6000
6000 5000 4000 3000
4000
2000
2000
1000
0
0 0
20000
40000
60000
80000
100000
0
20000
40000
Frame n
(a) Encoding mode: high.
80000
100000
(b) Encoding mode: 256 kbps.
9000
9000
8000
8000
7000
7000 Frame size [Byte]
Frame size [Byte]
60000 Frame n
6000 5000 4000 3000 2000
6000 5000 4000 3000 2000
1000
1000
0
0 0
20000
40000
60000
80000
100000
0
20000
Frame n
40000
60000
80000
100000
Frame n
(c) Encoding mode: medium.
(d) Encoding mode: 128 kbps.
3000
18000 16000
2500
Frame size [Byte]
Frame size [Byte]
14000 2000
1500
1000
12000 10000 8000 6000 4000
500 2000 0
0 0
20000
40000
60000
80000
Frame n
(e) Encoding mode: low.
100000
0
20000
40000
60000 Frame n
80000
100000
(f) Encoding mode: 64 kbps.
Fig. 6.1: Single layer frame size (Xn ) as function of frame number (n) plots for Silence of the Lambs with different encoding modes from Table 5.1. to provide the reader with an example of how to interpret the results that are presented in an overview form for the excerpt of the video trace library in the following sections and that are presented in detail in [124] for single layer encodings, in [125] for temporal scalable encodings , and in [126] for spatial scalable encodings. We illustrate the frame sizes Xn in bytes as a function of the frame number n in Figure 6.1 for Silence of the Lambs encoded in different encoding modes. In these frame size plots, we observe large variations. In different periods the frame sizes are very large, whereas in other periods, the frame sizes are smaller. These different periods correspond to different content in the video. We also observe that even with employed rate control, these periods of large frame sizes and thus high traffic volume remain due to the content depen-
6.1 Video Trace Statistics for MPEG-4 Encoded Video 1800 1600
12000
Average frame size in GoP [Byte]
Average frame size in GoP [Byte]
14000
10000 8000 6000 4000 2000
1400 1200 1000 800 600 400 200
0
0 0
1000 2000 3000 4000 5000 6000 7000 8000 9000
0
GoP n
4500
4000
4000 Average frame size in GoP [Byte]
Average frame size in GoP [Byte]
(b) Encoding mode: 256 kbps.
4500
3500 3000 2500 2000 1500 1000
1000 2000 3000 4000 5000 6000 7000 8000 9000 GoP n
(a) Encoding mode: high.
500
3500 3000 2500 2000 1500 1000 500
0
0 0
1000 2000 3000 4000 5000 6000 7000 8000 9000
0
GoP n
1000 2000 3000 4000 5000 6000 7000 8000 9000 GoP n
(c) Encoding mode: medium.
(d) Encoding mode: 128 kbps.
1800
3500
Average frame size in GoP [Byte]
1600 Average frame size in GoP [Byte]
85
1400 1200 1000 800 600 400 200 0
3000 2500 2000 1500 1000 500 0
0
1000 2000 3000 4000 5000 6000 7000 8000 9000 GoP n
(e) Encoding mode: low.
0
1000 2000 3000 4000 5000 6000 7000 8000 9000 GoP n
(f) Encoding mode: 64 kbps.
Fig. 6.2: Single layer average frame size of one GoP (Ym /12) as function of GoP number (m) plots for Silence of the Lambs with different encoding modes from Table 5.1. dency. As illustrated in Figure 6.2, these variations are reduced when frames are averaged (smoothed) over the period of one GoP (12 frames using the encoding parameters outlined in Chapter 5). We additionally observe that even with smoothing, some variability remains. From the rate-controlled encodings in Figure 6.2, we additionally observe that for the lower target bit rates, the TM5 rate control algorithm is not able to achieve the target bit rates, as indicated by the spikes in the averaged frame sizes. Observing the frame size histograms illustrated in Figure 6.3, we observe that the rate-controlled encodings have a narrower histogram than the quantizer-controlled encodings. This is explained by the rate control algorithm trying to match the given target bit rate for the different encoding modes. We observe that the high
86
6 Statistical Results from Video Traces 0.014
0.02 0.018
0.012
0.016 0.014 Probability
Probability
0.01 0.008 0.006
0.012 0.01 0.008 0.006
0.004
0.004 0.002
0.002
0
0 0
2000 4000 6000 8000 10000 12000 14000 16000 18000
0
Frame size [Byte]
Frame size [Byte]
(a) Encoding mode: high.
(b) Encoding mode: 256 kbps.
0.045
0.045
0.04
0.04
0.035
0.035 0.03 Probability
0.03 Probability
1000 2000 3000 4000 5000 6000 7000 8000 9000
0.025 0.02 0.015
0.025 0.02 0.015
0.01
0.01
0.005
0.005
0
0 0
1000 2000 3000 4000 5000 6000 7000 8000 9000
0
Frame size [Byte]
1000 2000 3000 4000 5000 6000 7000 8000 9000 Frame size [Byte]
(c) Encoding mode: medium.
(d) Encoding mode: 128 kbps.
0.014
0.06
0.012
0.05
0.01 Probability
Probability
0.04 0.008 0.006
0.03
0.02 0.004 0.01
0.002 0
0 0
500
1000
1500
2000
2500
Frame size [Byte]
(e) Encoding mode: low.
3000
0
2000 4000 6000 8000 10000 12000 14000 16000 18000 Frame size [Byte]
(f) Encoding mode: 64 kbps.
Fig. 6.3: Single layer frame size histogram for Silence of the Lambs with different encoding modes from Table 5.1. bandwidth rate-controlled encoding and on the opposite the low quantizercontrolled encoding both exhibit double peaks, whereas the other encodings do not. We illustrate the frame size distributions for the two outstanding encoding modes low and 256 kbps for the individual frame types (I, P, B) in Figure 6.4. With the higher allowed bandwidth, the TM5 algorithm is able to allocate the bits better on a per-frame basis, which results in individual single peaks of the histograms for the different frame types. The overlapping of the three histograms in turn produces the multiple peaks that we observed in Figure 6.3. In particular, we derive by the histograms by frame type that the first peak observed in Figure 6.3 is due to the small sizes of the B frames (which are also large in numbers), the second, smaller, peak in Figure 6.3 is
6.1 Video Trace Statistics for MPEG-4 Encoded Video 0.005
87
0.0016
0.0045
0.0014
0.004 0.0012 0.003
Probability
Probability
0.0035
0.0025 0.002
0.001 0.0008 0.0006
0.0015 0.0004 0.001 0.0002
0.0005 0
0 0
500
1000
1500
2000
2500
3000
0
1000 2000 3000 4000 5000 6000 7000 8000 9000
Frame size [Byte]
Frame size [Byte]
(a) I frames in low encoding mode.
(b) I frames in 256 kbps encoding mode.
0.0045
0.0025
0.004 0.002
0.0035
Probability
Probability
0.003 0.0025 0.002
0.0015
0.001
0.0015 0.001
0.0005
0.0005 0
0 0
500
1000
1500
2000
2500
0
1000
2000
Frame size [Byte]
(c) P frames in low encoding mode.
3000
4000
5000
6000
Frame size [Byte]
(d) P frames in 256 kbps encoding mode.
0.006
0.0035 0.003
0.005
0.0025 Probability
Probability
0.004
0.003
0.002 0.0015
0.002 0.001 0.001
0.0005
0
0 0
200
400
600
800
1000 1200 1400 1600 1800
Frame size [Byte]
(e) B frames in low encoding mode.
0
1000
2000
3000
4000
5000
6000
7000
Frame size [Byte]
(f) B frames in 256 kbps encoding mode.
Fig. 6.4: Single layer frame size histograms by frame types for Silence of the Lambs in low and 256 kbps encoding modes from Table 5.1. due to the sizes of the P frames (which are medium in numbers) and the third and flat peak between frame sizes of 3000 and 4000 byte in Figure 6.3 is due to the I frame size distribution (whereby we note that the I frames are smallest in numbers). For the quantizer-controlled encodings, on the other hand, the peaks for the different frame types become less pronounced and due to the content dependency, the frame size histograms for the individual frame types are exhibiting multiple, spread peaks themselves. In turn the combination of the different frame size histograms does no longer result in a single, defined
88
6 Statistical Results from Video Traces
1
1
0.9
0.8
0.8
0.6 ACC ρX(k)
ACC ρX(k)
peak, but a rather spread multiple-peak region observed in Figure 6.3. The different numbers of the three frame types, in turn, determines which of the characteristics observed in Figure 6.4 becomes most dominant in Figure 6.3. For our example, we note the combination of multiple peaks in the P and B frame size histograms to be clearly visible, whereas the I frame histograms’s peak from Figure 6.4 is only merely visible in Figure 6.3. We illustrate the autocorrelation coefficient (ACC) ρX (k) as a function of the lag k in frames in Figure 6.5 for Silence of the Lambs with different encoding modes. For encodings without rate control, the frame size ACC is
0.7 0.6
0.4 0.2
0.5
0
0.4
-0.2
0.3
-0.4 0
20
40
60
80
100
120
0
20
40
Lag k [Frames]
60
80
100
120
Lag k [Frames]
(a) Encoding mode: high.
(b) Encoding mode: 256 kbps.
1
1
0.9
0.8
0.8 0.6
0.6
ACC ρX(k)
ACC ρX(k)
0.7
0.5 0.4 0.3
0.4 0.2 0
0.2 -0.2
0.1 0
-0.4 0
20
40
60
80
100
120
0
20
40
Lag k [Frames]
(c) Encoding mode: medium.
80
100
120
(d) Encoding mode: 128 kbps.
1
1
0.8
0.8
0.6
0.6 ACC ρX(k)
ACC ρX(k)
60 Lag k [Frames]
0.4
0.4
0.2
0.2
0
0
-0.2
-0.2 0
20
40
60
80
100
Lag k [Frames]
(e) Encoding mode: low.
120
0
20
40
60
80
100
120
Lag k [Frames]
(f) Encoding mode: 64 kbps.
Fig. 6.5: Autocorrelation coefficient (ACC) ρX (k) for frame sizes as function of the lag in frames (k) Silence of the Lambs with different encoding modes from Table 5.1.
6.1 Video Trace Statistics for MPEG-4 Encoded Video
89
represented by a train of spikes superimposed on a slowly decaying curve. For the high quality encoding, the decay is slower than for lower encodings, whereas the decay is not clearly visible for the rate-controlled encodings. The spikes in the plots can be explained as follows. The largest spikes occur at lags that are multiples of the GoP length (in our case 12 frames) and represent the correlation of the always very large I frames at these lags. The smaller spikes represent the correlation of I↔P frames and P↔P frames, since P frames are typically between the sizes of I and B frames. The smallest correlation is observed between the different frame types and the B frames. This relationship is independent of the encoding modes, i.e., it depends highly on the selected GoP structure and little on the actual video content. The general level of the ACC as well as the superimposed decaying curve, on the other hand, are indicators for the self-similarity of the generated video traffic and its long-range dependence. We observe that the rate-controlled encodings are all around zero and only exhibit a very slight decay for the 64 kbps encoding. The quantizercontrolled encodings, on the other hand, exhibit a slowly decaying ACC, which is on a higher level for higher quality. This can be explained as follows. With the encoding being of very low quality, the frame sizes become more random and due to the loss in the encoding process, less content dependent. As the quality of the encoding increases, the content dependency increases as well, as the particular features of the video content are compressed together. For frames that are close together, in turn, the content becomes similar and so the encoding of the frame sizes results in a higher correlation for these frames. With enabled rate control, however, the content dependency is not as high, as the rate control algorithm allocated the bit budget using a fixed bit budget allocation formula. We illustrate the corresponding autocorrelation coefficients (ACC) ρY (k) for the GoP level (a = 12) in Figure 6.6. We observe that the spikes visible on the frame level are no longer present. This is due to the aggregation of the different frame types into a complete GoP, which in turn removes the typical frame size differences that originate from the GoP structure in terms of frame types. We additionally observe that for the quantizer-controlled encodings, the decrease in the autocorrelation coefficient is slightly lower than exponential and the ACC remains above zero, indicating that the GoP sizes as process are approaching a memoryless behaviour. For the rate-controlled encodings, we observe an immediate and sharp drop in the ACC (except for the 64 kbps encoding ) and that the ACC remains around zero as the lag k increases. We now illustrate the R/S plots for different encodings of the Silence of the Lambs video in Figure 6.7. The Hurst parameters estimated from the R/S plots illustrated in Figure 6.7 and from aggregated frame size traces with different aggregation levels a are given in Table 6.1. For the Hurst parameter H as a measure for long-range dependency of the video frame sizes at different aggregation levels a, we note that there is a general trend of decreasing values of H with increasing aggregation levels a, as previously studied in detail, see, e.g., [100]. In general, we observe that the Hurst parameters for the
6 Statistical Results from Video Traces 1
1
0.9
0.8
0.8
0.6 ACC ρY(k)
ACC ρY(k)
90
0.7 0.6
0.4 0.2
0.5
0
0.4
-0.2
0.3
-0.4 0
20
40
60
80
100
120
0
20
40
Lag in GoPs k
60
80
100
120
Lag in GoPs k
(a) Encoding mode: high.
(b) Encoding mode: 256 kbps.
1
1
0.9
0.8
0.8 ACC ρY(k)
ACC ρY(k)
0.6 0.7 0.6
0.4
0.2 0.5 0
0.4 0.3
-0.2 0
20
40
60
80
100
120
0
20
40
Lag in GoPs k
60
80
100
120
Lag in GoPs k
(c) Encoding mode: medium.
(d) Encoding mode: 128 kbps.
1
1 0.9
0.9
0.8 0.8 ACC ρY(k)
ACC ρY(k)
0.7 0.7 0.6 0.5
0.6 0.5 0.4 0.3
0.4 0.2 0.3
0.1
0.2
0 0
20
40
60
80
100
Lag in GoPs k
(e) Encoding mode: low.
120
0
20
40
60 Lag in GoPs k
80
100
120
(f) Encoding mode: 64 kbps.
Fig. 6.6: Autocorrelation coefficient (ACC) for GoP sizes ρY (k) as function of the lag k (in GoPs) for Silence of the Lambs with different encoding modes from Table 5.1. rate-controlled encodings are smaller than those obtained for the quantizercontrolled encodings. We illustrate the periodogram plots at the GoP aggregation level (a = 12) for different encodings of the Silence of the Lambs video in Figure 6.8. We present the corresponding Hurst parameters H that were obtained from the periodogram for different aggregation levels a ≥ 12 in Table 6.2. We observe that for the Hurst parameter H estimation from the periodogram plots, we obtain similar insights to those from the R/S plots.
6.1 Video Trace Statistics for MPEG-4 Encoded Video 4.5
91
1.6
4
1.4
3.5 1.2 log10(R/S)
log10(R/S)
3 2.5 2
1 0.8
1.5 0.6 1 0.4
0.5 0
0.2 1
1.5
2
2.5
3 log10(d)
3.5
4
4.5
5
1
(a) Encoding mode: high.
1.5
2
2.5
3 log10(d)
3.5
4
4.5
5
(b) Encoding mode: 256 kbps.
4
2.5
3.5 2
log10(R/S)
log10(R/S)
3 2.5 2 1.5
1.5
1
1 0.5 0.5 0
0 1
1.5
2
2.5
3 log10(d)
3.5
4
4.5
5
1
(c) Encoding mode: medium.
2
2.5
3 log10(d)
3.5
4
4.5
5
(d) Encoding mode: 128 kbps.
4
3.5
3.5
3
3
2.5
2.5
log10(R/S)
log10(R/S)
1.5
2 1.5
2 1.5 1
1
0.5
0.5 0
0 1
1.5
2
2.5
3 log10(d)
3.5
4
4.5
(e) Encoding mode: low.
5
1
1.5
2
2.5
3 log10(d)
3.5
4
4.5
5
(f) Encoding mode: 64 kbps.
Fig. 6.7: R/S plots for frame sizes of Silence of the Lambs with different encoding modes from Table 5.1. For the different encodings of the Silence of the Lambs video, we illustrate the variance time plots used for the estimation of the Hurst parameter in Figure 6.9. We present the corresponding Hurst parameters H that were obtained from the variance time plots in Table 6.3. We observe that for the Hurst parameter H estimation from the variance time plots tend to be smaller than those obtained with other methods. The logscale diagrams for Silence of the Lambs are given in Figure 6.10. The Hurst parameters estimated from the logscale diagrams are given in Table 6.4. Some of the estimated Hurst parameters are above one, which should
92
6 Statistical Results from Video Traces
Table 6.1: Hurst parameter values obtained from the R/S plots for Silence of the Lambs. Encoding Mode High Medium Low 64 kbps 128 kpbs 256 kbps
1 0.977 0.858 0.871 0.661 0.258 0.182
12 0.905 0.912 0.892 0.682 0.433 0.324
24 0.889 0.895 0.878 0.661 0.436 0.345
48 0.890 0.889 0.876 0.643 0.428 0.349
Aggregation level a 96 192 300 396 0.911 0.923 0.903 0.898 0.881 0.888 0.887 0.853 0.887 0.881 0.842 0.883 0.645 0.632 0.659 0.618 0.421 0.503 0.523 0.594 0.352 0.372 0.362 0.404
504 0.805 0.831 0.852 0.659 0.553 0.671
600 0.795 0.806 0.820 0.654 0.566 0.711
696 0.828 0.808 0.825 0.591 0.487 0.995
792 0.752 0.776 0.856 0.583 0.599 0.799
Table 6.2: Hurst parameter values obtained from the periodogram plots for Silence of the Lambs. Encoding Mode High Medium Low 64 kbps 128 kpbs 256 kbps
12 1.203 1.053 0.995 0.890 0.723 0.365
24 1.250 1.131 1.072 0.911 0.835 0.416
48 1.168 1.106 1.036 0.917 0.972 0.379
96 1.061 1.034 0.984 0.877 1.039 -0.093
Aggregation level a 192 300 396 1.018 1.011 1.006 1.008 1.030 1.006 0.942 0.966 0.916 0.867 0.872 0.799 0.951 0.631 0.329 -0.132 0.011 0.014
504 1.084 1.034 0.906 0.799 0.198 -0.257
600 1.040 1.089 0.971 0.863 0.127 0.093
696 1.074 1.061 0.980 0.926 -0.032 0.329
792 1.120 1.123 1.039 1.005 0.004 0.202
Table 6.3: Hurst parameter values obtained from the variance time plots for Silence of the Lambs. High 0.909
Medium 0.895
Low 0.866
Encoding Mode 64 kbps 0.763
128 kpbs -0.748
256 kbps -0.190
Table 6.4: Hurst parameter values obtained from the logscale plots for Silence of the Lambs. High 1.002
Medium 1.064
Low 0.659
Encoding Mode 64 kbps -0.085
128 kpbs -0.124
256 kbps -0.458
be viewed with caution, as the Hurst parameter is only defined up to one. One explanation for the overestimation is that the employed logscale estimation is based on the assumption of a Gaussian time series, whereas the results from the video traces typically are non-Gaussian. We present the multiscale diagrams for Silence of the Lambs in Figure 6.11. Table 6.5 gives the multiscaling parameter αq for the orders q = 0.5, 1, 1.5, 2, 2.5, 3, 3.5, and 4. We observe that the scaling parameters tend to increase with increasing q (with the exception of the low quality encoding). Note that the Hurst parameter estimate is given by H = α2 /2 for the employed estimation with c norm of one. We observe again that a number of estimates are around one or exceed one. The number of these “suspicious” H estimates, however, is smaller than with the logscale plot estimation. This may be due to the fact that the multiscale estimation does not assume a Gaussian time series.
6.1 Video Trace Statistics for MPEG-4 Encoded Video 2
93
-2
1
-3
0 -4
-2
log10(I(λk))
log10(I(λk))
-1
-3 -4 -5
-5 -6 -7
-6 -8
-7 -8 -3.5
-3
-2.5
-2
-1.5 -1 log10(λk)
-0.5
0
0.5
-9 -3.5
1
(a) Encoding mode: high.
-3
-2.5
-2
-1.5 -1 log10(λk)
-0.5
0
0.5
1
(b) Encoding mode: 256 kbps.
1
-2
0 -3 -1 -4 log10(I(λk))
log10(I(λk))
-2 -3 -4 -5
-5
-6
-6 -7 -7 -8 -3.5
-3
-2.5
-2
-1.5 -1 log10(λk)
-0.5
0
0.5
-8 -3.5
1
(c) Encoding mode: medium.
0
0
-1
-1
-2
-1.5 -1 log10(λk)
-0.5
0
0.5
1
-2 log10(I(λk))
-2 log10(I(λk))
-2.5
(d) Encoding mode: 128 kbps.
1
-3 -4 -5
-3 -4 -5 -6
-6
-7
-7 -8 -3.5
-3
-3
-2.5
-2
-1.5 -1 log10(λk)
-0.5
0
(e) Encoding mode: low.
0.5
1
-8 -3.5
-3
-2.5
-2
-1.5 -1 log10(λk)
-0.5
0
0.5
1
(f) Encoding mode: 64 kbps.
Fig. 6.8: Periodogram plots for frame sizes of Silence of the Lambs with different encoding modes from Table 5.1. 6.1.2 Videos and Encoder Modes for Evaluated MPEG-4 Video Traces For our statistical overview of single layer and temporal scalable videos, we consider the traces of the videos in Table 6.6. All considered videos are 60 minutes long, corresponding to 108,000 frames and are in the QCIF format. For spatial scalable encoding (see Section 6.1.5) only 30 minutes (54,000 frames) of the videos in the CIF format are considered. We consider the encodings without rate control with the fixed quantization scales in Table 5.1.
94
6 Statistical Results from Video Traces -0.1
0
-0.15
-1 -2
-0.3
-3
log10(σ2(a))
2
log10(σ (a))
-0.2 -0.25
-0.35 -0.4 -0.45 -0.5
-4 -5 -6
-0.55 -7
-0.6 -0.65
-8 1
1.5
2
2.5 log10(a)
3
3.5
4
1
(a) Encoding mode: high.
1.5
2
2.5 log10(a)
3
3.5
4
(b) Encoding mode: 256 kbps.
-0.5
4
-0.6
2
-0.7 log10(σ2(a))
2
log10(σ (a))
0 -0.8 -0.9
-2
-4 -1 -6
-1.1 -1.2
-8 1
1.5
2
2.5 log10(a)
3
3.5
4
1
(c) Encoding mode: medium. -0.3
-0.6
-0.4
-0.8
2.5 log10(a)
3
3.5
4
-1 log10(σ2(a))
-0.6 2
2
(d) Encoding mode: 128 kbps.
-0.5
log10(σ (a))
1.5
-0.7 -0.8 -0.9
-1.2 -1.4 -1.6 -1.8
-1 -1.1
-2
-1.2
-2.2 1
1.5
2
2.5 log10(a)
3
3.5
(e) Encoding mode: low.
4
1
1.5
2
2.5 log10(a)
3
3.5
4
(f) Encoding mode: 64 kbps.
Fig. 6.9: Variance time plots for different aggregation levels a of frame sizes of Silence of the Lambs with different encoding modes from Table 5.1. For the rate control encodings we consider TM5 [118] rate control with the target bit rate settings summarized in Table 5.1 as well. The base layer of the considered temporal scalable encoding gives a basic video quality by providing a frame rate of 10 frames per second. Adding the enhancement layer improves the video quality by providing the (original) frame rate of 30 frames per second. With the considered spatial scalable encoding, the base layer provides video frames that are one fourth of the original size (at the original frame rate), i.e., the number of pixels in the video frames is cut in half in both the horizontal and vertical direction. (These quarter size
6.1 Video Trace Statistics for MPEG-4 Encoded Video 34
95
22 20
32
18 30 16 14
26
yj
yj
28
12 10
24
8 22 6 20
4
18
2 0
2
4
6
8
10
12
14
0
2
4
6
Octave j
8
10
12
14
Octave j
(a) Encoding mode: high.
(b) Encoding mode: 256 kbps.
30
20
28
18 16
26
14
24 yj
yj
12 22
10 20
8
18
6
16
4
14
2 0
2
4
6
8
10
12
14
0
2
4
6
Octave j
(c) Encoding mode: medium.
10
12
14
(d) Encoding mode: 128 kbps.
28
23
26
22
24
21
22
20
20
19
yj
yj
8 Octave j
18
18
16
17
14
16
12
15 0
2
4
6
8
10
12
Octave j
(e) Encoding mode: low.
14
0
2
4
6
8 Octave j
10
12
14
(f) Encoding mode: 64 kbps.
Fig. 6.10: Logscale plots for frame sizes of Silence of the Lambs with different encoding modes from Table 5.1. frames can be up-sampled to give a coarse grained video with the original size.) Adding the enhancement layer to the base layer gives the video frames in the original size (format). For each video and scalability mode we have generated traces for videos encoded without rate control and for videos encoded with rate control. For the encodings without rate control we keep the quantization parameters fixed, which produces nearly constant quality video (for both the base layer and the aggregate (base + enhancement layer) stream, respectively) but highly variable video traffic. For the encodings with rate control we employ the TM5
96
6 Statistical Results from Video Traces 3
0 -1
2.5
-2 ζ(q) = αq-q/2
ζ(q) = αq-q/2
2
1.5
1
-3 -4 -5 -6
0.5
-7
0
-8 0
1
2
3
4
0
1
Order q
2
3
4
Order q
(a) Encoding mode: high.
(b) Encoding mode: 256 kbps.
3
40 30
2.5 20 10 ζ(q) = αq-q/2
ζ(q) = αq-q/2
2
1.5
1
0 -10 -20 -30
0.5 -40 0
-50 0
1
2
3
4
0
1
Order q
2
3
4
Order q
(c) Encoding mode: medium.
(d) Encoding mode: 128 kbps.
1.5
2
1 0
0.5
-2 ζ(q) = αq-q/2
ζ(q) = αq-q/2
0 -0.5 -1 -1.5
-4
-6
-2 -2.5
-8
-3 -3.5
-10 0
1
2
3
Order q
(e) Encoding mode: low.
4
0
1
2 Order q
3
4
(f) Encoding mode: 64 kbps.
Fig. 6.11: Multiscale diagrams for Silence of the Lambs with different encoding modes from Table 5.1. rate control, which strives to keep the bit rate around a target bit rate by varying the quantization parameters, and thus the video quality. We apply rate control only to the base layer of scalable encodings and encode the enhancement layer with fixed quantization parameters. Thus, the bit rate of the base layer is close to a constant bit rate, while the bit rate of the enhancement layer is highly variable. This approach is motivated by networking schemes that provide constant bit rate transport with very stringent quality of service for the base layer and variable bit rate transport with less stringent quality of service for the enhancement layer.
6.1 Video Trace Statistics for MPEG-4 Encoded Video
97
Table 6.5: Multiscaling parameter values obtained for Silence of the Lambs. Encoding Mode High Medium Low 64 kbps 128 kbps 256 kbps
q = 0.5 0.503 0.548 0.455 0.083 -0.077 -0.096
q=1 1.001 1.092 0.826 -0.067 -0.229 -0.186
Multiscaling Parameter αq for orders q q = 1.5 q=2 q = 2.5 q=3 1.487 1.961 2.424 2.877 1.621 2.136 2.622 3.085 1.094 1.243 1.290 1.270 -0.266 -0.466 -0.662 -0.853 -0.557 -1.091 -1.739 -2.429 -0.349 -0.664 -1.107 -1.602
q = 3.5 3.322 3.536 1.214 -1.041 -3.126 -2.110
q=4 3.758 3.980 1.143 -1.227 -3.820 -2.620
6.1.3 Single Layer Encoded Video In this section we give an overview of the video traffic and quality statistics of the single layer encodings, which are studied in greater detail in [124]. In Table 6.7, we give an overview of the elementary frame size and bit rate statistics. We consider the average frame size X, the coefficient of variation (defined as the standard deviation of the frame size normalized by the mean ¯ and frame size) CoVX , the peak-to-mean ratio of the frame size Xmax /X, the mean and peak bit rates, as well as the average PSNR quality Q and the coefficient of the quality variation CoQV . We note that the PSNR does not completely capture the many facets of video quality. However, analyzing a large number of videos subjectively becomes impractical. Moreover, recent studies have found that the PSNR is as good a measure of video quality as other, more sophisticated objective quality metrics [127]. As the PSNR is well-defined only for the luminance (Y) component [128] and since the human visual system is more sensitive to small changes in the luminance, we focus on the luminance PSNR values. For a compact presentation of the findings from our trace library’s subset, we report for each metric the minimum, mean, and maximum of the set of videos given in Table 6.6 in Table 6.7. This presentation, which we adopt for most tables in this chapter, conveys the main characteristics of the different encoding and scalability modes. However, it does not convey the impact of the different video genres and content features on the video traffic and quality, for which we refer to [124]. Focusing for now on the encodings without rate control, we observe that the coefficient of variation CoVX and the peak-to-mean ratio Xmax /X increase as the quantization scale increases (i.e., as the video quality decreases), indicating that the video traffic becomes more variable. As the quality decreases further, the coefficient of variation and peak-to-mean ratio decrease. In other words, we observe a peak (“hump”) of the coefficient of variation and peak-tomean ratio of the frame sizes for intermediate video quality. From Table 6.7 we observe a similar hump phenomenon for the coefficient of variation and the peak-to-mean ratios of the GoP sizes, which we denote by Y . These observations extend on earlier studies [129] which considered a smaller range of the quantization scale and uncovered only an increasing trend in the coefficient of variation and the peak-to-mean ratio for increasing quantization scales (i.e., decreasing video quality).
98
5.1 for details on the settings of the encoding Encoding Mode (see Table 5.1) low, medium, high low, medium, high low, medium–low, medium, medium–high, low, medium, high low, medium–low, medium, medium–high, low, medium, high low, medium–low, medium, medium–high, low, medium, high low, medium, high low, medium, high low, medium, high low, medium, high low, medium, high low, medium–low, medium, medium–high,
high high high
high
6 Statistical Results from Video Traces
Table 6.6: Overview of studied video sequences in QCIF format, see Table modes. Class Video Genre Movies Citizen Kane Drama Die Hard I Action Jurassic Park I Action Silence Of The Lambs Drama Star Wars IV Sci-fi Star Wars V Sci-fi The Firm Drama The Terminator I Action Total Recall Action Cartoons Aladdin Cartoon Cinderella Cartoon Sports Baseball Game 7 of the 2001 World Series Snowboarding Snowboarding Competition TV Sequences Tonight Show Late Night Show
6.1 Video Trace Statistics for MPEG-4 Encoded Video
99
Table 6.7: Overview of frame statistics of single-layer traces (QCIF). Enc. Mode High
Medium – High Medium
Medium – Low Low
64 kbps
128 kbps
256 kbps
Min Mean Max Min Mean Max Min Mean Max Min Mean Max Min Mean Max Min Mean Max Min Mean Max Min Mean Max
Frame Size Bit Rate GoP Size Mean CoV Peak/M. Mean Peak CoV Peak/M. ¯ X Xmax Ymax X max X CoVX CoVY ¯ ¯ T T X Y [kbyte] [Mbps] [Mbps] 1.881 0.399 4.115 0.451 3.108 0.284 2.606 3.204 0.604 6.348 0.769 4.609 0.425 4.136 5.483 0.881 8.735 1.316 6.31 0.709 7.367 0.613 1.017 9.345 0.147 1.93 0.536 6.087 0.738 1.146 12.819 0.177 2.202 0.645 6.754 0.949 1.36 16.303 0.228 2.398 0.803 7.902 0.333 1.173 10.688 0.08 1.586 0.438 3.642 0.55 1.489 16.453 0.132 2.045 0.547 6.03 0.874 2.128 25.386 0.21 2.708 0.77 12.268 0.23 1.033 11.466 0.055 0.775 0.447 4.498 0.273 1.206 15.438 0.065 0.992 0.546 5.405 0.327 1.547 19.468 0.078 1.272 0.747 6.148 0.194 0.82 7.67 0.047 0.522 0.383 3.02 0.282 0.943 11.357 0.067 0.742 0.441 4.642 0.392 1.374 17.289 0.094 1.104 0.671 8.35 0.267 0.806 8.398 0.064 0.774 0.354 2.991 0.297 1.022 48.328 0.0714 3.353 0.411 9.563 0.384 1.494 82.72 0.092 5.488 0.46 18.51 0.534 1.066 17.749 0.128 2.274 0.089 2.626 0.534 1.189 28.135 0.128 3.606 0.143 4.776 0.535 1.401 50.883 0.128 6.52 0.277 9.691 1.067 0.904 6.89 0.256 1.765 0.03 1.395 1.067 1.000 9.841 0.256 2.521 0.0431 1.65 1.067 1.106 13.086 0.256 3.352 0.072 2.387
Frame Quality Mean CoV ¯ CoQV Q [dB] 25.052 0.162 36.798 0.326 37.674 0.67 30.782 0.353 31.705 0.56 32.453 0.907 28.887 0.465 30.29 1.017 31.888 3.685 26.535 0.438 27.539 0.824 28.745 1.099 25.177 0.434 26.584 0.712 28.446 1.618 25.052 0.446 26.624 0.746 28.926 1.585 26.12 0.641 28.998 1.197 31.795 3.021 28.461 0.639 31.414 1.432 33.824 5.307
While the origins of this hump phenomenon are under investigation in ongoing work, a detailed analysis of different factors influencing this “hump” behavior are given in [28] together with implications of this phenomenon for statistical multiplexing. We can draw some additional guidelines for networking studies which are detailed in Chapter 8. Next, we observe that the encodings with rate control with target bit rates of 64 and 128 kbps tend to have significantly larger coefficients of variation than the encodings without rate control. This is primarily because the employed TM5 rate control algorithm allocates target bit rates to each of the frame types (I, P, and B) and thus provides effective rate control at the GoP time scale — with potentially large variations of the individual frame sizes. Even with TM5 rate control, however, there are some small variations in the GoP sizes, see Table 6.7. These variations are mostly due to relatively few outliers, resulting in the quite significant peak-to-mean ratio, yet very small coefficient of variation. (As a side-note, we remark that the 128 kbps and 256 kbps target bit rates are met perfectly (in the long run average), while the 64 kbps is not always met. This is because the employed encoder does not allow for quantization scales smaller than 30,30,30, which gives average bit rate above 64 kbps for some videos.) Both, the typically very large frame size variations with rate control, and the residual variation at the larger GoP time scale need to be taken into consideration in networking studies.
100
6 Statistical Results from Video Traces
To assess the long range dependence properties of the encoded videos, we determined the Hurst parameter of the frame size traces using the R/S plot, the periodogram, the variance-time plot, and the logscale diagram, see [130] for details. We have found that the encodings without rate control generally do exhibit long range dependence with the Hurst parameter typically ranging between 0.75 and 0.95. The encodings with rate control do typically not exhibit long range dependence (except for the cases where the 64 kbps target bit rate could not be reached due to the quantization scale being limited to at most 30). In stark contrast to the behavior of the variability (CoVX and Xmax /X) observed above, the Hurst parameter estimates are roughly the same when comparing different quality levels. We have also investigated the multifractal scaling characteristic of the video traffic using the wavelet based multiscale diagram, see Chapter 4. We found that the linear multiscale diagram does generally not significantly differ from a horizontal line. This indicates that the video traffic is mono-fractal, i.e., does not exhibit a significant multi-fractal behavior. 6.1.4 Temporal Scalable Encoded Video Base Layer Table 6.8 summarizes the frame size and quality statistics of the base layer of the temporal scalable encoded video. Recall that in the considered temporal scalable encodings, the I and P frames constitute the base layer and the
Table 6.8: Overview of frame statistics for the base layer of temporal scalability (QCIF resolution). Frame Size Bit Rate Aggregated (3) GoP Size CoV Peak/M. Mean Peak CoV Peak/M. CoV Peak/M. b(3) b b b X X Ymax ¯ b(3) b Xmax X max max CoV b X CoV b CoV T T X Y ¯b ¯b ¯b X X X Y [kbyte] [Mbps] [Mbps] 0.895 1.54 9.68 0.215 3.124 0.351 3.227 0.281 2.437 1.458 1.6878 12.897 0.35 4.363 0.522 4.3 0.395 3.536 2.316 1.994 18.463 0.556 6.285 0.812 6.154 0.668 5.762 0.349 1.96 16.47 0.084 1.918 0.783 5.49 0.486 4.596 0.4245 2.135 22.033 0.102 2.179 0.919 7.345 0.57 5.513 0.539 2.405 28.651 0.129 2.398 1.123 9.551 0.708 7.532 0.224 2.038 16.478 0.054 1.586 0.848 5.493 0.375 3.138 0.3727 2.292 23.818 0.089 2.0349 1.037 7.939 0.49 4.837 0.567 2.872 37.791 0.136 2.708 1.443 12.597 0.686 8.617 0.146 1.987 19.051 0.035 0.784 0.806 6.351 0.414 3.896 0.16425 2.163 25.88 0.0393 1.002 0.939 8.627 0.500 4.989 0.197 2.533 33.329 0.047 1.272 1.213 11.111 0.665 6.776 0.11 1.797 13.74 0.026 0.556 0.64 4.58 0.352 2.639 0.1574 1.912 20.058 0.038 0.736 0.743 6.687 0.418 4.152 0.211 2.37 30.309 0.051 1.104 1.098 10.104 0.622 7.139 0.267 1.782 24.886 0.064 1.594 0.626 8.296 0.138 3.286 0.267 1.883 42.52 0.064 2.723 0.716 14.173 0.209 6.016 0.267 2.051 70.436 0.064 4.511 0.857 23.479 0.338 12.126 0.534 1.645 10.29 0.128 1.318 0.486 3.43 0.045 1.417 0.534 1.705 12.629 0.128 1.617 0.548 4.21 0.082 1.737 0.534 1.819 18.772 0.128 2.404 0.661 6.257 0.138 2.613 1.067 1.518 8.504 0.256 2.177 0.318 2.835 0.021 1.231 1.067 1.546 10.125 0.256 2.593 0.359 3.375 0.038 1.397 1.067 1.617 11.664 0.256 2.987 0.453 3.888 0.064 1.722 Mean
Enc. Mode High Min Mean Max Medium Min – Mean High Max Medium Min Mean Max Medium Min – Mean Low Max Low Min Mean Max 64 kbps Min Mean Max 128 kbps Min Mean Max 256 kbps Min Mean Max
Frame Quality Mean CoV ¯ b CoQV b Q [dB] 20.944 24.28 27.623 24.437 25.386 26.809 20.797 23.804 27.047 23.422 24.264 25.067 20.279 22.842 25.828 20.35 23.364 26.853 20.688 23.842 27.292 20.842 24.088 27.508
2.292 3.167 4.731 2.406 2.865 3.402 2.172 2.76 3.85 0.848 1.805 2.859 1.494 2.157 2.673 1.875 2.473 3.434 2.102 2.796 4.127 2.218 2.992 4.577
6.1 Video Trace Statistics for MPEG-4 Encoded Video
101
B frames constitute the enhancement layer. With the IBBPBBPBBPBBPBb b and X3k+2 , k = 0, . . . , N/3−1, BIBB . . . GoP structure, the frame sizes X3k+1 are zero as these correspond to gaps in the base layer frame sequence. We observe for the encodings without rate control that the temporal base layer traffic is significantly more variable than the corresponding single layer traffic. b b /X of the base layer frame sizes is roughly 1.5 The peak-to-mean ratio Xmax to 2 times larger than the corresponding Xmax /X of the single layer traces (from Table 6.7). This larger variability of the base layer of the temporal scalable encoding is due to the fact that the frames missing in the base layer are counted as zeros in the frame size analysis, i.e., the frame size analysis considers a scenario where each frame is transmitted during its frame period of 33 msec and nothing is transmitted during the periods of the skipped frames. To overcome the large variabilities of the base layer we consider averaging three base layer frames (i.e., an I or P frame and the subsequent two missing frames of size zero) and denote the averaged base layer frame size by X b(3) . We observe that with this averaging (smoothing), which is equivalent to spreading the transmission of each base layer frame over three frame periods (100 b b(3) msec), the Xmax /X is typically one half to two thirds of the corresponding Xmax /X. This indicates that the I and P frames are relatively less variable in size compared to the B frames, which is intuitive as B frames can cover the entire range from being completely intra-coded (e.g., when a scene change occurs at that frame) to being completely inter-coded. For the encodings with rate control, we observe from Table 6.8 in comparison with Table 6.7 that the smoothed (over three frames or GoP) base layers are significantly less variable than the corresponding single layer encodings. This is again primarily due to the generally smaller variability of the I and P frames in the base layer. The peak bit rates of the 128 and 256 kbps base layers with GoP smoothing are typically less than 200 kbps and 300 kbps, respectively. This enables the transport of the base layer with rate control over reliable constant bit rate network “pipes” — provisioned for instance using the guaranteed services paradigm [131]. We note, however, that even the ratecontrolled base layers smoothed over GoPs require some over-provisioning since the peak rates are larger than the average bit rates. In more detailed studies [125], we have found that the excursions above (and below) the average bit rate are typically short-lived. Therefore, any of the common smoothing algorithms (e.g., [132, 133]) should be able to reduce the peak rates of the GoP streams to rates very close to the mean bit rate with a moderately-sized smoothing buffer. In addition, we note that the TM5 rate control employed in our encodings is a basic rate control scheme which is standardized and widely used. More sophisticated and refined rate control schemes (e.g., [134]) may further reduce the variability of the traffic. In summary, we recommend to employ our traces obtained with TM5 rate control in scenarios where the video traffic is smoothed over the individual frames in a GoP (which incurs a delay of about 0.4 sec) or use some other smoothing algorithm.
102
6 Statistical Results from Video Traces
Now turning to the video frame PSNR quality, we observe that the average quality Q is significantly lower and the variability in the quality significantly larger compared to the single layer encoding. This severe drop in quality and increase in quality variation are due to decoding only every third frame and displaying it in place of the missing two B frames. The reduction in quality with respect to the single layer encoding is not as severe for the rate-controlled encodings, which now can allocate the full target bit rate to the I and P frames. Enhancement Layer The main observations from the enhancement layer traffic statistics in Table 6.9 are a very pronounced hump in the variability and a relatively large variability — even when smoothing the two B frames over three frame periods or over a GoP. For the enhancement layers corresponding to the base layers with rate control, we observe that the average enhancement layer bit rate decreases as the target bit rate of the base layer increases. This is to be expected as the higher bit rate base layer contains a more accurate encoding of the video, leaving less information to be encoded in the enhancement layer. We also observe that the enhancement layers of the rate - controlled layers tend to have a somewhat higher variability than the medium encoding mode
Table 6.9: Overview of frame statistics of the enhancement layers of temporal scalability. Mean
Frame Size Bit Rate CoV Peak/M. Mean Peak
e
Enc. Mode High Min Mean Max Medium Min – Mean High Max Medium Min Mean Max Medium Min – Mean Low Max Low Min Mean Max 64 kbps Min Mean Max 128 kbps Min Mean Max 256 kbps Min Mean Max
e X CoVX [kbyte] 0.914 0.801 1.748 0.951 3.172 1.175 0.262 1.13 0.311 1.277 0.407 1.439 0.101 1.093 0.176 1.361 0.317 1.773 0.082 1.103 0.106 1.233 0.127 1.486 0.073 0.978 0.122 1.096 0.183 1.353 0.153 0.985 0.293 1.269 0.547 1.601 0.119 1.088 0.208 1.323 0.388 1.547 0.11 1.078 0.181 1.276 0.32 1.53
e Xmax ¯e X
4.885 9.872 15.765 15.736 20.121 23.71 14.714 25.136 37.224 12.393 20.181 28.648 9.637 17.295 24.727 9.535 16.185 26.351 13.012 21.845 31.076 14.599 24.153 35.494
¯e X T
e Xmax T
[Mbps] [Mbps] 0.219 2.368 0.42 3.92 0.761 6.138 0.063 1.05 0.075 1.484 0.098 1.738 0.024 0.531 0.042 1.035 0.076 1.778 0.02 0.31 0.026 0.5 0.031 0.594 0.018 0.226 0.03 0.511 0.044 0.829 0.037 0.678 0.07 1.078 0.131 1.801 0.029 0.616 0.05 1.059 0.093 1.804 0.026 0.561 0.043 1.037 0.077 1.807
Aggregated (3) CoV Peak/M. e(3)
CoVX
0.305 0.491 0.765 0.687 0.841 1.018 0.669 0.905 1.319 0.669 0.804 1.061 0.544 0.665 0.937 0.557 0.848 1.166 0.669 0.886 1.103 0.652 0.823 1.063
e(3) Xmax ¯e X
3.219 6.096 9.738 10.238 12.562 15.07 9.688 14.976 22.732 7.15 10.825 14.029 5.639 9.86 15.923 6.129 9.879 17.543 8.544 13.295 20.409 9.672 14.683 22.745
GoP Size CoV Peak/M. CoVYe 0.291 0.462 0.757 0.62 0.793 0.992 0.619 0.811 1.258 0.556 0.715 0.986 0.49 0.562 0.828 0.53 0.817 1.142 0.634 0.833 1.062 0.608 0.746 0.995
e Ymax ¯e Y
2.759 4.83 8.831 7.907 9.234 10.166 5.223 9.977 20.066 5.74 7.251 8.683 3.905 6.057 11.155 4.521 7.831 16.43 5.998 10.288 19.154 5.131 10.168 18.692
6.1 Video Trace Statistics for MPEG-4 Encoded Video
103
single layer encoding, which uses the same quantization parameters as the enhancement layer of the rate-controlled base layer. Aggregate (Base + Enhancement Layer) Stream Table 6.10 gives the traffic and quality statistics of the aggregate (base + enhancement layer) streams with temporal scalability. We observe that for the encodings without rate control the aggregate stream statistics are approximately equal the corresponding statistics of the single layer encodings (in Table 6.7). Indeed, we have verified, that for encodings without rate control, extracting the I and P frames out of a single layer encoding is equivalent to the base layer of a temporal scalable encoding. Extracting the B frames out of a single layer encoding gives a stream equivalent to the enhancement layer of a temporal scalable encoding. This is to be expected since temporal scalable encoding adds essentially no overhead. The situation is fundamentally different for the temporal scalable encodings with rate control, where the rate-controlled base layer and the open-loop encoded enhancement layer are aggregated. If rate control is employed for the base layer encoding, the obtained base layer is very different from the I and P frame sequence of a single layer encoding (both when the single layer is encoded with and without rate
Table 6.10: Overview of frame statistics of the aggregate (base + enhancement layer) stream with temporal scalability. Mean b+e
Enc. Mode High Min Mean Max Medium Min – Mean High Max Medium Min Mean Max Medium Min – Mean Low Max Low Min Mean Max 64 kbps Min Mean Max 128 kbps Min Mean Max 256 kbps Min Mean Max
Frame Size Bit Rate CoV P eak/M. Mean Peak
b+e X CoVX [kbyte] 1.881 0.399 3.163 0.626 5.488 0.881 0.61 1.021 0.735 1.15 0.946 1.363 0.332 1.174 0.549 1.497 0.877 2.139 0.228 1.044 0.270 1.219 0.324 1.565 0.191 0.833 0.28 0.954 0.391 1.39 0.42 0.583 0.56 0.893 0.814 1.229 0.652 0.817 0.742 1.131 0.921 1.394 1.176 1.049 1.248 1.245 1.387 1.391
b+e Xmax ¯ b+e X
4.097 6.493 8.884 9.382 12.728 16.371 10.659 16.498 25.477 11.569 15.753 19.627 8.208 11.585 17.926 13.026 20.469 32.596 7.495 9.304 11.642 6.561 8.698 10.578
¯ b+e X T
b+e Xmax T
[Mbps] [Mbps] 0.451 3.606 0.759 4.575 1.317 6.174 0.146 1.918 0.176 2.179 0.227 2.398 0.08 1.586 0.132 2.045 0.21 2.708 0.055 0.784 0.065 1.002 0.078 1.272 0.046 0.556 0.067 0.753 0.094 1.104 0.101 1.594 0.134 2.723 0.195 4.511 0.157 1.357 0.178 1.656 0.221 2.404 0.282 2.177 0.3 2.593 0.333 2.987
CoV
GoP Frame Quality Peak/M. Mean CoV
CoVYb+e 0.284 0.443 0.709 0.538 0.646 0.802 0.445 0.550 0.77 0.455 0.552 0.749 0.395 0.449 0.673 0.359 0.422 0.473 0.176 0.228 0.319 0.076 0.109 0.168
b+e Ymax ¯ b+e Y
2.707 4.319 7.372 6.072 6.783 7.928 3.731 6.07 12.348 4.443 5.434 6.13 3.076 4.685 8.442 3.031 4.807 7.982 2.18 3.569 6.43 1.552 2.356 4.032
¯ b+e CoQV b+e Q [dB] 35.996 0.162 36.803 0.321 37.676 0.620 30.786 0.353 31.709 0.561 32.459 0.914 28.893 0.418 30.302 0.614 31.892 1.207 26.538 0.438 27.542 0.832 28.748 1.127 25.17 0.394 26.586 0.564 28.438 1.033 26.655 0.566 28.713 0.783 31.351 1.439 28.207 0.572 30.56 0.77 32.973 1.126 29.695 0.507 32.196 0.713 34.316 0.954
104
6 Statistical Results from Video Traces
control). Similarly, the enhancement layer obtained from an actual temporal scalable encoding with a rate-controlled base layer is quite different from the B frame sequence of a single layer encoding, even though the enhancement layer of the temporal scalable encoding is coded with fixed quantization parameters. 6.1.5 Spatial Scalable Encoded Video In this section we give an overview of the video traffic and quality statistics of spatial scalable encoded video, which are studied in detail in [126]. In the considered spatial scalable encoding the base layer provides the video in QCIF format. Adding the enhancement layer to the base layer gives the video in the CIF format. Table 6.11 gives an overview of the videos that have been studied for spatial scalability. Base Layer Table 6.12 gives an overview of the frame size and quality statistics of the base layers of the spatial scalable encodings. Focusing for now on the encodings without rate control, we observe again a hump in the coefficients of variation and peak-to-mean ratios of both the frame sizes and (somewhat less pronounced) the GoP sizes. Comparing these base layers which provide the video in the QCIF format with the single layer QCIF video in Table 6.7, we observe that the frame size, bit rate, and GoP size statistics are roughly the same. The observed differences are primarily due to considering a different set of videos in the spatial scalability study. A comparison for the individual videos [135] reveals that the traffic statistics of the QCIF base layer are typically almost identical to the corresponding statistics of the single layer QCIF encodings. Next, consider the frame qualities of the base layer in Table 6.12. These qualities are obtained by up-sampling the QCIF base layer frames to CIF format and comparing these CIF frames with the original CIF frames. We observe that the PSNR qualities of these up-sampled base layer frames are quite low compared to the single layer QCIF frames, in fact the mean frame qualities are quite similar to the PSNR qualities of the temporal base layer. The traffic characteristics of the base layers with rate control are generally similar to the corresponding traffic statistics of the single layer encodings. In particular, the rate - controlled base layers exhibit quite significant traffic variability even at the GoP level (and in particular for small bit rates), which may require substantial over-provisioning or smoothing to reliably transmit the base layer. This is in contrast to the base layer of the temporal scalable encoding, which exhibited smaller traffic variability at the GoP level. The primary reason for this phenomenon is that, as noted in Section 6.1.4, the temporal base layer dedicates the entire target bit rate to the less variable (when viewed at the GoP level) I and P frames.
Class Movies Sports Lecture and Surveillance
Video Silence Of The Lambs The Terminator I Snowboarding Lecture Martin Reisslein Parking Lot Cam
Genre Drama Action Snowboarding Competition Lecture Surveillance
Quantization Scale Settings (from Table 5.1) Low, Medium–Low, Medium, Medium–High, High Low, Medium–Low, Medium, Medium–High, High Low, Medium–Low, Medium, Medium–High, High Low, Medium–Low, Medium, Medium–High, High Low, Medium–Low, Medium, Medium–High, High
6.1 Video Trace Statistics for MPEG-4 Encoded Video
Table 6.11: Overview of studied video sequences in CIF format.
105
106
6 Statistical Results from Video Traces
Table 6.12: Overview of frame statistics for the base layer of spatial scalability (CIF). Mean
Frame Size CoV Peak/M.
b
b X CoVX Enc. Mode [kbyte] High Min 1.868 0.463 Mean 3.589 0.629 Max 5.962 0.831 Medium Min 0.494 0.782 – Mean 1.089 1.044 High Max 1.957 1.390 Medium Min 0.338 1.216 Mean 0.687 1.541 Max 1.196 2.183 Medium Min 0.233 0.859 – Mean 0.391 1.139 Low Max 0.612 1.615 Low Min 0.201 0.786 Mean 0.321 1.045 Max 0.461 1.423 64 kbps Min 0.267 0.773 Mean 0.340 1.224 Max 0.446 2.107 128 kbps Min 0.534 1.042 Mean 0.534 1.308 Max 0.535 1.772 256 kbps Min 1.067 0.890 Mean 1.067 1.122 Max 1.067 1.494
b Xmax ¯b X
3.167 5.632 8.849 4.523 9.473 15.602 6.620 13.798 22.825 5.702 10.251 15.354 5.922 9.278 13.817 5.888 15.823 32.089 11.533 23.467 46.579 9.256 11.051 14.739
Bit Rate Mean Peak ¯b X T
Xmax T
[Mbps] [Mbps] 0.448 3.396 0.861 4.186 1.431 5.468 0.119 1.670 0.262 1.999 0.470 2.486 0.081 1.608 0.165 1.852 0.287 2.034 0.056 0.708 0.094 0.830 0.147 0.917 0.048 0.553 0.077 0.646 0.111 0.717 0.064 0.543 0.082 1.160 0.107 2.088 0.128 1.478 0.128 3.009 0.128 5.977 0.256 2.371 0.256 2.831 0.256 3.775
GoP Size CoV Peak/M. CoVYb 0.245 0.421 0.658 0.322 0.563 0.922 0.299 0.530 0.819 0.252 0.470 0.638 0.212 0.417 0.551 0.144 0.371 0.545 0.039 0.217 0.515 0.033 0.049 0.081
b Ymax ¯b Y
2.348 3.512 6.820 3.197 5.041 11.549 3.279 4.966 11.032 2.989 4.496 8.880 2.753 4.006 7.032 2.704 4.146 7.036 1.427 3.741 4.754 1.300 1.607 2.410
Frame Quality Mean CoV ¯ b CoQV b Q [dB] 19.465 0.883 23.557 1.055 27.858 1.258 19.414 0.890 23.383 1.063 27.507 1.268 19.385 0.895 23.301 1.067 27.386 1.274 19.105 0.914 22.829 1.085 26.678 1.301 18.940 0.924 22.591 1.093 26.384 1.313 18.902 0.925 22.686 1.086 26.659 1.315 18.959 0.904 23.060 1.074 27.360 1.309 19.310 0.891 23.367 1.063 27.641 1.279
Enhancement Layer From the summary of the statistics of the enhancement layer of the spatial scalable encodings in Table 6.13 we first observe for the encodings with fixed quantization scales that the mean frame sizes and bit rates of the enhancement layer are roughly three times larger than the corresponding base layer frame sizes and bit rates. This is to be expected as the enhancement layer stream increases the frame format from one quarter of the CIF format to the full CIF format. Next, we observe that the coefficient of variation of the frame sizes and the GoP sizes of the enhancement layer exhibit the hump behavior. The peak-to-mean ratio of the frame sizes, on the other hand, only increases with increasing quantization scales (i.e., decreasing video quality) and thus does not exhibit the hump behavior. This effect is the subject of ongoing studies. Another noteworthy observation is that the GoP size variability of the enhancement layer is significantly larger than for the base layer (or the single layer QCIF video), especially for larger quantization scales. This indicates that the enhancement layer is typically more difficult to accommodate in packet switched networks. Next, we turn to the enhancement layers corresponding to the base layers encoded with rate control. These enhancement layers are encoded with the fixed quantization scales corresponding to the medium encoding mode in Table 5.1. Similar to the encodings with temporal scalability we observe that
6.1 Video Trace Statistics for MPEG-4 Encoded Video
107
Table 6.13: Overview of frame statistics of the enhancement layers of spatial scalability. Mean e
Enc. Mode High
Medium – High Medium
Medium – Low Low
64 kbps
128 kbps
256 kbps
Min Mean Max Min Mean Max Min Mean Max Min Mean Max Min Mean Max Min Mean Max Min Mean Max Min Mean Max
X [kbyte] 5.765 10.451 17.793 1.386 2.869 5.280 0.693 1.480 2.698 0.464 0.931 1.559 0.373 0.729 1.152 0.776 1.679 2.981 0.704 1.602 2.965 0.676 1.484 2.797
Frame Size CoV Peak/M. e CoVX
0.378 0.506 0.757 0.639 0.833 1.247 0.793 1.001 1.423 0.772 0.919 1.218 0.728 0.881 1.103 0.822 1.037 1.369 0.831 1.041 1.506 0.815 1.046 1.556
e Xmax ¯e X
3.928 5.965 8.654 6.492 12.056 16.844 9.354 17.652 25.621 9.770 20.141 29.009 11.456 21.918 31.906 8.661 15.589 22.801 8.678 16.945 25.145 9.142 18.077 27.201
Bit Rate Mean Peak ¯e X T
[Mbps] 1.384 2.508 4.270 0.333 0.689 1.267 0.166 0.355 0.647 0.111 0.223 0.374 0.090 0.175 0.276 0.186 0.403 0.716 0.169 0.385 0.712 0.162 0.356 0.671
e Xmax T
[Mbps] 10.147 13.210 16.773 5.601 6.891 8.227 4.218 5.114 6.056 3.233 3.770 4.539 2.859 3.294 3.910 4.245 5.182 6.197 4.226 5.173 6.175 4.204 5.144 6.137
GoP Size CoV Peak/M. CoVYe 0.235 0.402 0.658 0.319 0.596 1.001 0.358 0.671 1.085 0.300 0.621 0.916 0.273 0.589 0.819 0.374 0.649 1.068 0.379 0.698 1.201 0.355 0.714 1.197
e Ymax ¯e Y
2.844 3.555 6.182 3.330 5.461 10.585 3.647 6.436 12.425 3.951 6.304 10.941 3.958 6.228 9.969 3.648 6.211 12.221 3.952 6.736 13.949 4.249 7.161 15.102
the average enhancement layer traffic decreases as the target bit rate for the base layer increases. We also observe that the variability of the enhancement layers corresponding to the rate - controlled base layers is slightly higher than the variability of the enhancement layer of the encoding with fixed medium encoding mode quantization scales. Aggregate (Base + Enhancement Layer) Stream In Table 6.14 we summarize the traffic and quality statistics of the aggregate spatial scalable stream which gives the video in the CIF format. For comparison we provide in Table 6.15 the traffic and quality statistics of single layer CIF format encodings of the videos. For the encodings without rate control, we observe that the aggregate spatial scalable video tends to have larger average frame and GoP sizes and bit rates as well as lower PSNR quality. This is primarily due to the overhead of spatial scalable encodings. In a more detailed study we determined this overhead by comparing the bit rates of aggregate spatial and single-layer encodings with essentially the same average PSNR quality to be around 20% [126]. Aside from this overhead the statistics of the aggregate spatial scalable encodings and the corresponding single layer CIF encodings are quite similar. Note, however, that the frame sizes and bit rates of the spatial scalable encodings with rate control are significantly larger than the corresponding frame sizes and bit rates of the single layer CIF encodings.
108
6 Statistical Results from Video Traces
Table 6.14: Overview of frame statistics of the aggregate (base + enhancement layer) stream with spatial scalability (CIF). Mean b+e
Enc. Mode High Min Mean Max Medium Min – Mean High Max Medium Min Mean Max Medium Min – Mean Low Max Low Min Mean Max 64 kbps Min Mean Max 128 kbps Min Mean Max 256 kbps Min Mean Max
Frame Size Bit Rate CoV Peak/M. Mean Peak
b+e X CoVX [kbyte] 7.633 0.394 14.040 0.509 23.754 0.730 1.880 0.653 3.958 0.836 7.237 1.165 1.058 0.837 2.167 1.068 3.893 1.370 0.698 0.756 1.322 0.903 2.171 1.045 0.575 0.728 1.051 0.845 1.613 0.913 1.043 0.805 2.020 1.012 3.428 1.338 1.238 0.773 2.136 0.957 3.500 1.263 1.743 0.704 2.551 0.846 3.864 1.069
b+e Xmax ¯ b+e X
3.681 5.403 8.573 5.626 10.041 16.070 8.134 13.540 22.175 8.340 14.732 19.949 8.906 15.610 21.171 8.139 12.892 17.688 8.260 11.802 15.463 8.300 9.900 11.251
¯ b+e X T
b+e Xmax T
[Mbps] 1.832 3.370 5.701 0.451 0.950 1.737 0.254 0.520 0.934 0.167 0.317 0.521 0.138 0.252 0.387 0.250 0.485 0.823 0.297 0.513 0.840 0.418 0.612 0.927
[Mbps] 10.585 16.286 20.983 5.986 7.954 9.771 4.264 5.911 7.601 3.303 4.078 4.742 2.920 3.507 4.173 4.308 5.448 6.696 4.311 5.504 6.937 4.283 5.921 8.434
CoV
GoP Frame Quality Peak/M. Mean CoV
CoVYb+e 0.235 0.404 0.656 0.318 0.582 0.975 0.330 0.618 0.992 0.281 0.569 0.823 0.248 0.529 0.714 0.291 0.577 0.875 0.217 0.507 0.712 0.140 0.381 0.481
b+e Ymax ¯ b+e Y
2.747 3.507 6.338 3.196 5.287 10.839 3.465 5.911 11.981 3.489 5.704 10.299 3.483 5.426 8.990 3.494 5.614 10.565 3.243 5.163 9.236 2.407 4.217 6.710
¯ b+e CoQV b+e Q [dB] 30.679 0.913 35.994 1.170 37.846 1.307 30.553 1.105 32.493 1.174 33.990 1.278 27.840 1.072 30.350 1.155 32.398 1.268 25.216 1.058 28.116 1.151 30.571 1.280 24.007 1.050 27.080 1.149 29.744 1.286 27.752 1.071 30.231 1.157 32.299 1.273 27.762 1.057 30.375 1.149 32.645 1.271 27.868 1.049 30.580 1.143 32.988 1.261
Table 6.15: Overview of frame statistics of the single layer stream (CIF). Enc. Mode High
Medium – High Medium
Medium – Low Low
64 kbps
128 kbps
256 kbps
Min Mean Max Min Mean Max Min Mean Max Min Mean Max Min Mean Max Min Mean Max Min Mean Max Min Mean Max
Frame Size Bit Rate GoP Size Mean CoV Peak/M. Mean Peak CoV Peak/M. ¯ X Xmax Ymax X max CoV X CoVX Y ¯ ¯ T T X Y [kbyte] [Mbps] [Mbps] 6.419 0.402 4.150 1.541 9.649 0.221 2.759 11.289 0.542 5.727 2.709 14.099 0.388 3.629 17.832 0.742 8.271 4.280 17.760 0.620 6.290 1.596 0.710 6.422 0.383 5.434 0.311 3.664 3.329 0.943 10.379 0.799 7.149 0.561 5.401 5.546 1.221 14.506 1.331 8.548 0.914 10.163 1.074 0.970 8.888 0.258 4.458 0.291 3.741 2.172 1.296 13.092 0.521 6.012 0.550 5.502 3.411 1.915 18.782 0.819 7.277 0.835 9.653 0.706 0.790 8.647 0.170 3.079 0.252 3.580 1.382 0.975 12.628 0.332 3.846 0.498 5.112 1.900 1.336 18.159 0.456 4.618 0.651 7.654 0.657 0.733 9.193 0.158 2.733 0.215 3.346 1.201 0.881 12.408 0.288 3.364 0.446 4.642 1.530 1.156 17.327 0.367 4.078 0.569 6.333 0.653 0.720 9.221 0.157 2.675 0.210 3.322 1.184 0.865 12.297 0.284 3.294 0.440 4.584 1.497 1.126 17.072 0.359 3.968 0.562 6.208 0.653 0.720 9.221 0.157 2.674 0.211 3.322 1.184 0.865 12.295 0.284 3.294 0.440 4.584 1.497 1.126 17.065 0.359 3.968 0.562 6.207 1.067 0.722 9.618 0.256 3.457 0.101 2.280 1.303 1.024 20.731 0.313 6.095 0.401 5.098 1.497 1.741 49.493 0.359 13.131 0.562 9.908
Frame Quality Mean CoV ¯ CoQV Q [dB] 37.025 1.100 37.654 1.189 38.303 1.232 30.989 0.935 32.867 1.120 34.337 1.243 29.305 0.925 31.585 1.087 33.423 1.176 25.975 0.978 28.896 1.112 31.384 1.248 24.849 1.002 27.965 1.116 30.677 1.255 24.708 1.002 27.846 1.116 30.591 1.257 24.708 1.002 27.847 1.116 30.595 1.257 24.711 1.001 28.642 1.093 31.626 1.256
6.2 Video Trace Statistics for H.264 Video Trace Files
109
This is because the fixed target bit rate is allocated to the QCIF sized base layer in the spatial scalable encodings whereas it is allocated to the full CIF sized video in the single layer encodings.
6.2 Video Trace Statistics for H.264 Video Trace Files The following results are only intended to give a first impression of the capabilities of the H.264 video coding standard. Our discussion focuses the video test sequence Paris. A screenshot giving an impression of the content (an ongoing discussion with some movements of the persons and some objects) of the Paris sequence is given in Figure 6.12. Results from evaluations of additional video test sequences are provided on our web page [40]. Table 6.16 provides an overview of the basic statistics of the Paris traces for different quantization scale settings q. We also evaluated the traces at an aggregation level of a = 12 frames, i.e., at the GoP level, see Table 6.16 as well. This fixed–length moving average analysis gives a more stationary impression of the video trace since the frame type differences are smoothed out.
Fig. 6.12: Screenshot of the Paris video sequence in CIF format.
110
q Xmin [byte] Xmax [byte] ¯ X [byte] ¯ I−f rame X [byte] ¯ P −f rame X [byte] ¯ B−f rame X [byte] 2 SX
1 43390
5 29037
10 12061
15 3930
20 1288
25 418
30 119
35 67
40 29
45 25
51 16
95525
81139
62578
46474
33824
23919
15746
9840
5763
3214
1448
54345.28
39572.05
22062.51
11395.94
6331.72
3827.69
2145.86
1182.10
647.40
348.34
161.30
94414.24
80066.83
61447.85
45408.61
32699.18
22964.62
14945.32
9248.70
5445.10
3035.90
1377.35
58793.27
43068.29
24494.70
11899.60
5928.88
3337.22
1748.51
854.76
421.38
208.58
64.75
47621.87
33152.20
16182.01
6916.99
3157.31
1598.14
680.67
287.57
127.13
61.83
44.16
174399635.34 172302917.00 158433906.59 112674047.03 65962417.27 34444608.41 15328149.06 6054760.37 2134816.65 668818.27 136146.83
CoV
0.24
0.33
0.57
0.93
1.28
1.53
1.82
2.08
2.26
2.35
2.29
Peak to mean
1.76
2.05
2.84
4.08
5.34
6.25
7.34
8.32
8.90
9.23
8.98
93578
79234
60610
44535
31944
22312
14461
8896
5273
2955
1348
721737
539976
324552
179212
101868
61343
33712
17741
9496
5035
2362
650782.23
473844.23
264149.30
136385.82
75734.05
45763.58
25640.10
14112.82
7725.02
4154.18
1924.11
6610073.94 1420858.69
319171.88
60188.64
10792.15
Xmin,GoP [byte] Xmax,GoP [byte] ¯ GoP X [byte] 2 SX,GoP
512264567.35 432694471.18 366798439.65 199274339.93 74414914.80 26030350.12
CoVGoP
0.03
0.04
0.07
0.10
0.11
0.11
0.10
0.08
0.07
0.06
0.05
Peak to mean
1.11
1.14
1.23
1.31
1.35
1.34
1.31
1.26
1.23
1.21
1.23
6 Statistical Results from Video Traces
Table 6.16: Single frame and GoP statistics for different quantization scales q for the Paris video sequence.
6.2 Video Trace Statistics for H.264 Video Trace Files
111
Frame Size Trace of Carphone_QP25 − Aggregation 12 2000 1800 frame size [byte]
1600 1400 1200 1000 800 600 400 200 0
0
50
100
150 200 250 frame index
300
350
Fig. 6.13: Impact of changing background dynamics on encoded video frame sizes for H.264 encoded video sequence Carphone with quantization scale q = 25 and aggregation level a = 12.
The frame sizes reflect the video content and its dynamic behavior, as with any block– and motion vector–based encoding process. The frame sizes are generally larger if the movie content is more dynamic and richer in texture. As can be seen in the frame traces of the Carphone video sequence, the frame size is rising around frame index n = 150. This is due to a shift of the landscape in the back, viewable through the car window (see Figure 6.13). Before frame index n = 150, the view is a clear sky, only occasionally interrupted by moving objects (e.g. lanterns, street signs) — after frame index n = 150, the view is a forest, with a rich texture. Figure 6.13 gives an impression on the changing backgrounds and the resulting frame sizes for a GoP–aggregation level of a = 12 and a quantization scale q = 25 for the Carphone video sequence. In general, the encoded frame sizes are larger when smaller quantization parameters are used (which in turn gives a higher video quality). These factors are interdependent, i.e., higher dynamics paired with finer quantization results in larger encoded frame sizes, and vice versa. We illustrate the frame size traces for quantization scales of q = 1, 15, and 31 in Figure 6.14 for the Paris video sequence. We observe that the plots clearly illustrate the large differences in size between the different encoded video frame types I, P, and B. We also note that the video frame sizes decrease very quickly as the applied quantization scale q is increased. We note that the applied quantization scales from q = 1 to q = 31, which we illustrate here
112
6 Statistical Results from Video Traces Frame Size Trace of Paris_QP01 - Aggregation 12 35000
60000
30000
50000
25000 frame size [byte]
frame size [byte]
Frame Size Trace of Paris_QP01 - Aggregation 1 70000
40000
30000
20000
15000
20000
10000
10000
5000
0
0 0
200
400
600
800
1000
0
200
400
frame index
600
800
1000
800
1000
frame index
(a) Frame level, q = 1.
(b) GoP level, q = 1.
Frame Size Trace of Paris_QP15 - Aggregation 1
Frame Size Trace of Paris_QP15 - Aggregation 12
25000
8000
7000 20000
frame size [byte]
frame size [byte]
6000
15000
10000
5000
4000
3000
2000 5000 1000
0
0 0
200
400
600
800
1000
0
200
400
frame index
600 frame index
(c) Frame level, q = 15.
(d) GoP level, q = 15.
Frame Size Trace of Paris_QP31 - Aggregation 1
Frame Size Trace of Paris_QP31 - Aggregation 12
5000
1400
4500 1200 4000 1000 frame size [byte]
frame size [byte]
3500 3000 2500 2000 1500
800
600
400
1000 200 500 0
0 0
200
400
600
800
frame index
(e) Frame level, q = 31.
1000
0
200
400
600
800
1000
frame index
(f) GoP level, q = 31.
Fig. 6.14: Frame sizes Xn and GoP level smoothed frame sizes (a = 12) as a function of the frame index n for H.264 encoded test sequence Paris with different quantization scales q = 1, 15, 31. are only representing a part of the quantization scale range that is allowed in the H.264 standard, which is q = 1, . . . , 51. The GoP–smoothed traces in Figure 6.14 give a clearer impression of the video content and resulting video traffic dynamics. We observe that the plots do not indicate any large or dynamic changes with increasing frame index n. This due to the used testing sequences, which typically have only little dynamic change in their content and depict individual scenes or shots. The study of the impact of dynamic
6.2 Video Trace Statistics for H.264 Video Trace Files Probability Density Function (p) for Paris_QP01
113
Probability Density Function (p) for Paris_QP15
0.003
0.004
0.0035 0.0025
0.002
p
p
0.003
0.0025
0.002 0.0015 0.0015
0.001
0.001 10
100
1000
10000
100000
10
100
1000
frame size [byte]
10000
100000
frame size [byte]
(a) q = 1.
(b) q = 15. Probability Density Function (p) for Paris_QP31 0.01
0.009 0.008 0.007
p
0.006 0.005 0.004 0.003 0.002 0.001 10
100
1000
10000
100000
frame size [byte]
(c) q = 31.
Fig. 6.15: Frame size histograms for H.264 encoded test sequence Paris with different quantization scales q = 1, 15, 31. changes of the video content on the video traffic requires longer videos with a typical movie length. The distribution of the frame sizes gives clues about requirements for stochastic modeling of the encoded video traffic. Frame size histograms or probability distributions allow us to make observations concerning the variability of the encoded data and the necessary requirements for the purpose of real–time transport of the data over a combination of wired and wireless networks. In the following we present the probability density function p as a function of the frame size for the Paris sequence in Figure 6.15. We observe for all the different quality levels a large spread of the frame sizes, which additionally indicates a large tail end. The overall distribution may very roughly be seen as Gaussian, what should be seen with caution due to the limited length of the evaluated test sequence. We observe that the distribution is spreading out more for smaller quantization parameters. This is expectedly derived by comparing the differences in the frame sizes for the different frame types (which normally tend to be high for I–frames, intermediate for P–frames, and low for B–frames). With lower fidelity (i.e., higher quantization), the differentiation between these types regarding the frame size is decreasing due to the more forcefully applied quantization. The
114
6 Statistical Results from Video Traces
viewable result is characterized by a total loss of clear differences between objects, colors, and so forth. Figure 6.16 gives an overview of the quantization effects on the quality of the encoded video. (We note that the individual images were scaled down to fit on a single page). We observe that the larger quantization scales result in significantly visible loss of quality and also result in a reduction in the PSNR values used as objective video quality metric. We illustrate the autocorrelation function for individual video frames for the Paris video sequence in Figure 6.17. The autocorrelation function for the single frame aggregation level shows the similarity within a GoP, whereas higher aggregation levels give an indication of the long–term self–similarity. We observe from Figure 6.17 that there are large spikes spaced 12 frames apart, which are superimposed on a slowly decaying curve. These are due to repetitive GoPs, which contain 12 frames each. Thus for a lag of k = 12 frames, I frames correlate with I frames, P frames with P frames, and B frames with B frames. The intermediate spikes that are spaced three frames apart are due to the correlations between I and P frames and I or P frames with B frames. We observe that the intermediate spikes are decreasing with the fidelity of the encoded bit stream. This appears to be due to the wider spread of the frame size distribution for larger quantization parameters. We additionally illustrate the autocorrelation coefficient for the GoP-level aggregation of a = 12 frames in Figure 6.17 for the Paris sequence. We observe from Figure 6.17 that the GoP–based autocorrelation tends to fall off slower than an exponential, suggesting the presence of long-range dependencies. We additionally observe that the autocorrelation coefficient drops faster and lower for higher quantization scales q. The Hurst parameter, or self–similarity parameter, H, is a key measure of self-similarity. A Hurst parameter of H = 0.5 indicates absence of selfsimilarity whereas H = 1 indicates the degree of persistence or a present long–range dependence. The H parameter can be estimated from a graphical interpolation of the R/S plot. We illustrate the R/S plots in the following Figure 6.18 for the Paris sequence on the frame level. For the single frame level, we observe that the derived Hurst parameters are all below the 0.5 level, not indicating a long-rage dependency within the generated video traffic. In Figure 6.18, we additionally illustrate the R/S plot for the Paris sequence on the GoP aggregation level of a = 12 frames. We observe that in contrast to the results on the frame level, on a GoP–basis, the Hurst parameters H stay well above 0.5, indicating the presence of long–range dependency within the generated video traffic. We note, however, that due to the limited amount of samples for the calculation of the Hurst parameter, this has to be seen with some caution. We note that we applied the 4σ–test [136] to eliminate all outlying residuals for a better estimation of the Hurst parameter. We illustrate the variance time plots for the Paris video sequence in Figure 6.19. If no long range dependency is present, the slope of the resulting variance time would be −1. For slopes larger than −1, a dependency is present. For simple reference, we plot a reference line with a slope of −1 in the figures.
6.2 Video Trace Statistics for H.264 Video Trace Files
115
(a) q = 40, PSNR for this frame Q = 27.43 dB.
(b) q = 45, PSNR for this frame Q = 24.29 dB.
(c) q = 51, PSNR for this frame Q = 20.39 dB.
Fig. 6.16: Quantization effect for H.264 encoded test sequence Paris with different quantization scales q = 40, 45, 51.
116
6 Statistical Results from Video Traces Frame Autocorrelation Coefficent (acc) for Paris_QP01
GoP Autocorrelation Coefficent (acc) for Paris_QP01
1
1
0.8
0.8
0.6 0.6
acc
acc
0.4 0.4
0.2 0.2 0
0
-0.2
-0.4
-0.2 0
2
4
6
8
10
12
14
16
18
0
2
4
6
lag [frame]
8
10
12
14
16
18
14
16
18
14
16
18
lag [GoPs]
(a) Frame level, q = 1.
(b) GoP level, q = 1.
Frame Autocorrelation Coefficent (acc) for Paris_QP15
GoP Autocorrelation Coefficent (acc) for Paris_QP15
1
1.2
1
0.8
0.8 0.6
acc
acc
0.6 0.4
0.4 0.2 0.2
0
0
-0.2
-0.2 0
2
4
6
8
10
12
14
16
18
0
2
4
6
lag [frame]
10
12
lag [GoPs]
(c) Frame level, q = 15.
(d) q = 15.
Frame Autocorrelation Coefficent (acc) for Paris_QP31
GoP Autocorrelation Coefficent (acc) for Paris_QP31
1
1
0.8
0.8
0.6
0.6
acc
acc
8
0.4
0.2
0.4
0.2
0
0
-0.2
-0.2 0
2
4
6
8
10
12
14
lag [frame]
(e) Frame level, q = 31.
16
18
0
2
4
6
8
10
12
lag [GoPs]
(f) q = 31.
Fig. 6.17: Autocorrelation coefficients (ACC) for individual video frames and for GoP level aggregation (a = 12) for H.264 encoded test sequence Paris with different quantization scales q = 1, 15, 31. Our plots in Figure 6.19 indicate a certain degree of long term dependency since the estimated slope is larger than −1. We illustrate the periodogram plots for the Paris video sequence in Figure 6.20 for the single frame and 3 frame (a = 3) aggregation levels. We observe that for the single frame aggregation level, the estimated Hurst parameters are above those obtained from the R/S plots. We also note that for an aggregation level of a = 3 frames, the estimated Hurst parameters turn
6.2 Video Trace Statistics for H.264 Video Trace Files R/S Plot for Paris_QP01 (H=0.440132024128846)
117
R/S Plot for Paris_QP01 (H=0.822942712303602)
1.6
1.4 1.3
1.4 1.2 1.1 1.2
log(R/S)
log(R/S)
1 1
0.9 0.8
0.8 0.7 0.6 0.6 0.5 0.4 0.8
0.4 1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
1
1.1
1.2
1.3
1.4
log(d)
1.5
1.6
1.7
1.8
1.9
2
log(d)
(a) Frame level, q = 1, H = 0.440.
(b) GoP level, q = 1, H = 0.823.
R/S Plot for Paris_QP15 (H=0.338207714199581)
R/S Plot for Paris_QP15 (H=0.772989917253919)
1.3
1.4
1.2
1.3 1.2
1.1
1.1 1
log(R/S)
log(R/S)
1 0.9 0.8
0.9 0.8
0.7 0.7 0.6
0.6
0.5 0.4 0.8
0.5 0.4 1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
1
1.1
1.2
1.3
1.4
log(d)
1.5
1.6
1.7
1.8
1.9
2
log(d)
(c) Frame level, q = 15, H = 0.338.
(d) GoP level, q = 15, H = 0.773.
R/S Plot for Paris_QP31 (H=0.231895560753889)
R/S Plot for Paris_QP31 (H=0.708771572853162)
1.1
1.3 1.2
1 1.1 1 0.9
0.8
log(R/S)
log(R/S)
0.9
0.7
0.8 0.7 0.6
0.6
0.5 0.5 0.4 0.4 0.8
0.3 1
1.2
1.4
1.6
1.8
2
2.2
2.4
2.6
2.8
3
log(d)
(e) Frame level, q = 31, H = 0.232.
1
1.1
1.2
1.3
1.4
1.5
1.6
1.7
1.8
1.9
2
log(d)
(f) GoP level, q = 31, H = 0.709.
Fig. 6.18: Single frame and GoP aggregation level (a = 12) R/S plots for H.264 encoded test sequence Paris with different quantization scales q = 1, 15, 31. negative (albeit in the correct absolute range), which has to be seen with caution. Overall we note from the periodogram estimation of the Hurst parameter that the produced video traffic exhibits long-range dependency, which increases as the quantization scale q increases. This is explained with the loss of information due to the encoding process and the single-scene source video.
118
6 Statistical Results from Video Traces Variance Time Plot for Paris_QP01
Variance Time Plot for Paris_QP15
3
2.5
2.5
2
2
1.5
log variances(agg)
log variances(agg)
1.5 1 0.5
1
0.5
0
0 -0.5
-0.5
-1
-1 -1.5
-1.5 1
1.2
1.4
1.6
1.8
2
2.2
2.4
1
1.2
1.4
1.6
log(agg)
1.8
2
2.2
2.4
log(agg)
(a) q = 1, H = 0.823.
(b) q = 15, H = 0.773. Variance Time Plot for Paris_QP31
2.5
2
log variances(agg)
1.5
1
0.5
0
-0.5
-1
-1.5 1
1.2
1.4
1.6
1.8
2
2.2
2.4
log(agg)
(c) q = 31, H = 0.709.
Fig. 6.19: Variance time plots for H.264 encoded test sequence Paris with different quantization scales q = 1, 15, 31.
6.3 Video Trace Statistics for Pre-Encoded Video For our measurements we collected over 100 pre-encoded sequences on the web. We focused on different actual movies and TV series. A subset of all investigated sequences is given in Tables 6.17 and 6.18. The video sequences given in Table 6.17 are used for the statistical evaluation, while sequences in Table 6.18 are listed because of specific characteristics found. The tables give the sequence name and video and audio information. The video information includes the codec type, the format, frame rate, and data rate. We found a large variety of video codecs, such as DX50, DIV4, DIV3, XVID, RV20, RV30, DIVX, and MPEG1. The video format ranges from from very small (160x120) to large (640x352). The frame rate ranges from 23.98 to 29.97 frames/sec. In the following, we present results obtained for the movie Stealing Harvard and for episode 20 from season one of Friends in greater detail. The encoding details for Stealing Harvard and Friends1x20 can be found in Tables 6.17 and 6.18, respectively. We illustrate the frame size traces for both videos in Figure 6.21. We observe that both sequences exhibit short periods of high video traffic, i.e., periods in which spikes are clearly visible in the plots presented in Figure 6.21. To match the MPEG-4 encodings presented previously,
6.3 Video Trace Statistics for Pre-Encoded Video Periodogram Plot for Paris_QP01 Aggregation 1 (H=0.53477333257119)
119
Periodogram Plot for Paris_QP01 Aggregation 3 (H=-0.928345931479983)
0.5
0
0
-2
-0.5 -4 -1 -6 I(lambda)
I(lambda)
-1.5 -2 -2.5
-8
-10
-3 -12 -3.5 -14
-4 -4.5 -2.5
-16 -2
-1.5
-1
-0.5
0
0.5
-2
-1.5
-1
Lambda
-0.5
0
0.5
Lambda
(a) Frame level, q = 1, H = 0.535.
(b) a = 3 level, q = 1, H = −0.928.
Periodogram Plot for Paris_QP15 Aggregation 1 (H=0.695581747148811)
Periodogram Plot for Paris_QP15 Aggregation 3 (H=-0.818338583309328)
1
0
-2 0 -4 -1 I(lambda)
I(lambda)
-6
-2
-8
-10 -3 -12 -4 -14
-5 -2.5
-16 -2
-1.5
-1
-0.5
0
0.5
-2
-1.5
-1
Lambda
-0.5
0
0.5
Lambda
(c) Frame level, q = 15, H = 0.696.
(d) a = 3 level, q = 15, H = −0.818.
Periodogram Plot for Paris_QP31 Aggregation 1 (H=0.78795919420912)
Periodogram Plot for Paris_QP31 Aggregation 3 (H=-0.738105913549396)
1
2 0
0 -2 -1
I(lambda)
I(lambda)
-4 -2
-3
-6 -8 -10
-4 -12 -5 -14 -6 -2.5
-16 -2
-1.5
-1
-0.5
0
0.5
Lambda
(e) Frame level, q = 31, H = 0.788.
-2
-1.5
-1
-0.5
0
0.5
Lambda
(f) a = 3 level, q = 31, H = −0.738.
Fig. 6.20: Single frame and 3 frame aggregation level (a = 3) periodogram plots for H.264 encoded test sequence Paris with different quantization scales q = 1, 15, 31. we additionally evaluate the aggregation of multiple frames, as given in Figure 6.22 for an aggregation level of a = 12 frames. We observe that even with aggregation of multiple frames, the spikes in the video traffic are reduced, but the video traffic exhibits still a very high variability. In addition, we observe that for some video encodings, the available video encoding tools may provide simple optimizations to match an overall size of the video traffic, either in a single or two-pass approach. While some content dependency cannot be
120
6 Statistical Results from Video Traces
Table 6.17: Excerpt of investigated movie details as obtained from the MPlayer output. Movie Name
Codec
Bully 1 Bully 3 Hackers LOTR II (CD1) LOTR II (CD2) Oceans 11 Robin Hood Disney Serving Sara Stealing Harvard Final Fantasy Tomb Raider Roughnecks Kiss Of The Dragon
DX50 DX50 DIV4 XVID XVID DIV3 DIV3 XVID XVID DIV3 DIV3 DIV3 DIV3
Format [pixel] 576x432 512x384 720x576 640x272 640x272 544x224 512x384 560x304 640x352 576x320 576x240 352x272 640x272
Video Frame Rate T [1/s] 25.00 25.00 23.98 23.98 23.98 23.98 23.98 23.98 23.98 23.98 23.98 29.97 23.98
Audio Data Rate [kbit/s] 128.0 128.0 96.0 80.0 80.0 128.0 96.0 128.0 128.0 128.0 128.0 128.0 128.0
Data Rate [kbit/s] 1263.8 988.6 794.8 966.0 965.2 707.7 1028.9 831.2 989.1 823.9 820.3 849.1 846.6
Table 6.18: Excerpt of investigated TV series details as obtained from the MPlayer output. Series Name
Codec
Friends 1x20 Friends 4x03 Friends 4x04 Friends 9x13 Friends 9x14 Dilbert 1x06 Dilbert 2x03 Dilbert 2x04 Dilbert 2x05
DIV3 DIV3 DIV3 DIV3 DIVX MPEG1 DIV3 RV30 RV20
Format [pixel] 400x288 512x384 640x480 320x240 352x240 160x120 220x150 220x148 320x240
Video Frame Rate [1/s] 25.00 25.00 25.00 29.97 29.97 29.97 29.99 30.00 19.00
50000
Audio Data Rate [kbit/s] 128.0 128.0 64.1 128.0 56.0 64.0 32.0 32.0 44.1
Data Date [kbit/s] 751.6 1015.1 747.4 498.2 589.7 192.0 129.4 132.0 179.0
35000
45000
30000 25000
35000
Frame size [Byte]
Frame size [Byte]
40000
30000 25000 20000 15000
20000 15000 10000
10000 5000
5000 0
0 0
20000
40000
60000
80000 100000 120000 140000
Frame n
(a) Movie: Stealing Harvard.
0
5000
10000
15000
20000
25000
30000
35000
Frame n
(b) TV series episode: Friends 1x20.
Fig. 6.21: Frame sizes (in byte) for the pre-encoded movie Stealing Harvard and TV series episode from Friends (season one, episode 20). removed with these encoding approaches, others should be smoothed out similar to the TM5 rate control algorithm employed in our MPEG-4 encodings. Thus, we can conclude that most of the videos evaluated are of VBR nature, with some general limits for the quantization scale settings to allow a match
6.3 Video Trace Statistics for Pre-Encoded Video 25000
121
10000 9000 8000 Frame size [Byte]
Frame size [Byte]
20000
15000
10000
5000
7000 6000 5000 4000 3000 2000 1000
0
0 0
20000
40000
60000
80000 100000 120000 140000
0
5000
Frame n
10000
15000
20000
25000
30000
35000
Frame n
(a) Movie: Stealing Harvard.
(b) TV series episode: Friends 1x20.
Fig. 6.22: Aggregated frame sizes (aggregation level a = 12) for the preencoded movie Stealing Harvard and TV series episode from Friends (season one, episode 20). to a given size (one-pass approach) or an additional optimization (two-pass approach). In Table 6.19 we give an overview of the frame statistics for the evaluated pre-encoded movies. The table presents the mean frame size X, the coefficient of variation CoVX , and the peak to mean ratio of the frame sizes. Furthermore, the mean and peak bit rates are given. We note that the data rates given in Table 6.17 are based on the output of the MPlayer tool, while the data rates given below are an output of our evaluation tool. We observe that the video traffic obtained from the presented movie encodings is highly variable with peak to mean ratios of the frame sizes in the range from approximately 9 to about 25 for most of the video streams. In the video trace studies of MPEG-4 encodings, we typically found that the peak to mean ratios of the frame sizes were typically higher for videos encoded with rate control for small target bit rates and lower for videos encoded with higher target bit rates.
Table 6.19: Overview of frame statistics of traces. Evaluated Movie Bully 1 Bully 3 Hackers LOTR II (CD1) LOTR II (CD2) Oceans 11 Robin Hood Disney Serving Sara Stealing Harvard Final Fantasy Tomb Raider Roughnecks Kiss Of The Dragon
Mean X [Byte] 6319 4943 4150 5037 5032 3695 5364 4334 5157 4295 4290 3541 4414
Frame Sizes CoVX Peak/Mean SX /X Xmax /X 1.27 1.24 0.63 0.6 0.60 0.76 0.75 0.67 0.6 0.75 0.76 0.57 0.62
22.30 19.38 25.18 9.02 9.65 11.78 14.99 13.46 9.15 11.6 12.92 12.67 9.56
Bit rate Mean Peak X/T Xmax /T [Mbit/s] [Mbit/s] 2.02 45.1 1.58 30.66 1.38 34.86 1.68 15.16 1.68 16.2 1.23 14.53 1.79 26.82 1.45 19.46 1.72 15.75 1.43 16.63 1.43 18.5 0.95 11.98 1.47 14.08
122
6 Statistical Results from Video Traces 0.0035
0.0006
0.003
0.0005
0.0025 Probability
Probability
0.0004 0.002 0.0015
0.0003
0.0002 0.001 0.0001
0.0005 0
0 0
5000 100001500020000250003000035000400004500050000
0
5000
10000
Frame size [Byte]
15000
20000
25000
30000
35000
Frame size [Byte]
(a) Movie: Stealing Harvard.
(b) TV series episode: Friends 1x20.
Fig. 6.23: Frame size distribution for the pre-encoded movie Stealing Harvard and TV series episode from Friends (season one, episode 20). 1
1 0.9
0.9
0.8 0.8 ACC ρX(k)
ACC ρX(k)
0.7 0.7 0.6
0.6 0.5 0.4
0.5 0.3 0.4
0.2
0.3
0.1 0
20
40
60
80
100
Lag k [Frames]
(a) Movie: Stealing Harvard.
120
0
20
40
60
80
100
120
Lag k [Frames]
(b) TV series episode: Friends 1x20.
Fig. 6.24: Frame size autocorrelation coefficients (ACC) for the pre-encoded movie Stealing Harvard and TV series episode from Friends (season one, episode 20). We also note that similarly, for higher video quality encodings of quantizercontrolled video encodings, we found lower peak to mean ratios, whereas we found higher peak to mean ratios for lower quality encodings. In Figure 6.23 we illustrate the frame size distributions for the movie Stealing Harvard and the Friends episode. We observe that both distributions are heavily tailed. We also observe that the Friends episode’s frame sizes are more easily fitted toward a tailed Gaussian distribution, whereas the Stealing Harvard’s frame size distribution has a very pronounced peak for very small frame sizes and only after that assumes a more Gaussian form. This corroborates oter findings, whereby the assumption of Gaussian traffic from encoded video sources is very rough approximation at best. We now look at the self-similarity and long range dependency chracteristics of the pre-encoded movie and TV episode, starting with the frame size autocorrelation coefficient in Figure 6.24. We observe for both evaluated traces an initial sharp drop of the autocorrelation coefficient, followed by a sharp increase and then a slow decay with increasing frame lags k. We observe that the general level of the ACC values is higher for the movie
6.3 Video Trace Statistics for Pre-Encoded Video 4.5
123
3.5
4
3
3.5 2.5 log10(R/S)
log10(R/S)
3 2.5 2
2 1.5
1.5 1 1 0.5
0.5 0
0 1
1.5
2
2.5
3
3.5
4
4.5
5
log10(d)
1
1.5
2
2.5
3
3.5
4
4.5
log10(d)
(a) Movie: Stealing Harvard, H ≈ 0.966.
(b) TV series episode: Friends 1x20, H ≈ 0.874.
Fig. 6.25: Frame size R/S plots for the pre-encoded movie Stealing Harvard and TV series episode from Friends (season one, episode 20). Table 6.20: Hurst parameters H estimated from the pox diagram of R/S as a function of the aggregation level a. Evaluated Movie Bully 1 Bully 3 Hackers LOTR II (CD1) LOTR II (CD2) Oceans 11 Robin Hood Disney Serving Sara Stealing Harvard Final Fantasy Tomb Raider Roughnecks Kiss Of The Dragon
1 0.884 0.870 0.503 0.960 0.976 0.917 0.815 0.936 0.966 0.916 0.908 0.647 0.902
12 0.861 0.861 0.517 0.879 0.876 0.844 0.826 0.853 0.895 0.833 0.849 0.650 0.852
Aggregation Level a 50 100 200 0.838 0.842 0.821 0.856 0.889 0.908 0.513 0.531 0.520 0.848 0.847 0.866 0.894 0.926 0.934 0.818 0.809 0.787 0.806 0.798 0.810 0.849 0.839 0.821 0.853 0.813 0.785 0.779 0.769 0.752 0.852 0.850 0.843 0.650 0.631 0.633 0.808 0.809 0.802
400 0.784 0.940 0.486 0.809 0.864 0.756 0.784 0.790 0.700 0.733 0.800 0.690 0.780
800 0.655 1.030 0.619 0.750 0.816 0.736 0.808 0.740 0.675 0.726 0.731 0.771 0.774
Stealing Harvard than for the TV series Friends 1x20. This indicates that in general, there is a higher level of correlation between frames in the movie than in the TV series episode. We present the R/S plots for the movie Stealing Harvard and the TV series episode Friends 1x20 in Figure 6.25. As estimated from the diagrams, the Hurst parameter for the movie is larger than the Hurst parameter estimated for the TV series episode. In turn, we derive that the video traffic for the movie has a higher level of long-range dependency than the video traffic for the TV series episode. This finding also corroborates the observations made from the decay and level of the autocorrelation coefficient (ACC) in Figure 6.24 on the individual frame level. We present the Hurst parameter H for the evaluated pre-encoded movies estimated from the R/S plot in Table 6.20. We observe that for most movies the Hurst parameter, even at high aggregation levels, is quite large, indicating long dependency of the video traffic of the pre-encoded movies. The only exception is the movie Hackers, where the Hurst parameter is around H ≈ 0.5, indicating a lower level of long-rage dependency of the video traffic. We additionally observe that for
124
6 Statistical Results from Video Traces 1
0
0
-1 -2
-2
log10(I(λk))
log10(I(λk))
-1
-3 -4
-4 -5
-5
-6
-6 -7 -3.5
-3
-7 -3
-2.5
-2
-1.5 -1 log10(λk)
-0.5
0
0.5
1
(a) Movie: Stealing Harvard, H ≈ 0.878.
-3
-2.5
-2
-1.5
-1 log10(λk)
-0.5
0
0.5
1
(b) TV series episode: Friends 1x20, H ≈ 0.865.
Fig. 6.26: Periodogram plots for aggregated frame sizes (a = 12) for the preencoded movie Stealing Harvard and TV series episode from Friends (season one, episode 20). most movies, the Hurst parameter declines with higher levels of aggregation. We provide the periodogram plots and Hurst parameters H estimated from the periodogram plots for the pre-encoded movie and TV series episode in Figure 6.26. We observe that the estimated Hurst parameters are closer than those obtained from the R/S plots at the frame level. We additionally observe that the Hurst parameter estimated for the movie compared to the parameter estimated for the TV series episode still indicates a higher level of long-rage dependency for the movie. It is very hard to compare our former traces with the traces generated with the presented pre-encoded video approach. This is due to the different video formats, utilized encoders, encoder settings, and synchronization issues regarding the content of the evaluated videos. In Figure 6.27, we illustrate the aggregated frame sizes for an aggregation level of a = 500 video frames for the pre-encoded and the MPEG-4 encoded (at the low quality encoding mode as given in Table 5.1) video Robin Hood Disney. We observe that the overall level of the pre-encoded video sizes is larger than the level for the low quality MPEG-4 encoding. We also observe that the two sequences are not synchronized, yet display similar characteristics at different aggregate numbers. We emphasize that these similarities are observed even though the videos were encoded completely independently (using different encoders applied to the sequences grabbed from a VCR with our previous approach and by someone posting a DIV3 encoding on the web with our pre-encoded approach). These similar characteristics are content dependent and in turn using different means of encoding only vary their intensity and overall behavior.
6.4 Video Trace Statistics for Wavelet Encoded Video 10000
125
Pre-encoded (DIV3), 512x384 MPEG-4, QCIF, Low
9000
Aggregate size [Byte]
8000 7000 6000 5000 4000 3000 2000 1000 0
50
100 Aggregate number
150
200
Fig. 6.27: Video traffic comparison for Robin Hood Disney between the preencoded video trace and MPEG-4 encoded video trace for an aggregation level of a = 500 video frames.
6.4 Video Trace Statistics for Wavelet Encoded Video 6.4.1 Analysis of Video Traffic ¯ the coefficient of variation CoVX , and peak– Table 6.21 gives the the mean X, ¯ of the frame sizes as well as the mean bit rates X/T ¯ to–mean ratio Xmax /X and the peak bit rates Xmax /T , as defined in Chapter 4. From Table 6.21 we observe that the CoVX increases as the encoded video rate increases from very low bit rates to medium bit rates, and then the CoVX decreases as the encoded video rate increases further from the medium rate to very high rate. For example, for the video sequence Foot Ball w/c in Table 6.21, we observe that the CoVX is 0.183 at 25 kbps and increases to 0.292 at 300 kbps. Then it starts to decrease back to 0.216 at 1600kbps, causing a hump like behavior. The causes for this phenomenon and its implications on channel utilization and buffer requirements will be explored in future work. This same phenomenon has been observed in [130] for MPEG–4 video traces. We observe from Table 6.21 that the peak to mean ratio of the frame sizes exhibits a similar hump behavior. Table 6.22 gives the the mean Y¯ , the coefficient of variation CoVY , and peak–to–mean ratio Ymax /Y¯ of the GoP sizes as well as the mean bit rates Y¯ /GT and the peak bit rates Ymax /GT , as defined in Chapter 4. We observe that the CoVY is smaller for the GoP level compared to the frame level depicted on Table 6.21. Here, too, we observe
126
6 Statistical Results from Video Traces
Table 6.21: Overview of frame size statistics for wavelet encoded videos The Terminator and The Lady and The Tramp. Video
The Terminator
The Lady And The Tramp
Football with Commercials
Tonight Show with Commercials
Target Rate [kbps] 25 75 100 300 600 800 1000 1200 1400 1600 25 75 100 300 600 800 1000 1200 1400 1600 25 75 100 300 600 800 1000 1200 1400 1600 25 75 100 300 600 800 1000 1200 1400 1600
Compression Ratio YUV:3D-EZBC 367.724 121.982 91.392 30.434 15.212 11.408 9.126 7.604 6.518 5.703 367.757 121.982 91.365 30.434 15.212 11.408 9.126 7.605 6.518 5.703 367.653 121.979 91.425 30.434 15.212 11.408 9.126 7.605 6.518 5.703 367.754 121.987 91.426 30.433 15.212 11.408 9.126 7.605 6.518 5.703
Mean ¯ [kbyte] X 0.103 0.312 0.416 1.249 2.499 3.332 4.166 4.999 5.833 6.666 0.103 0.312 0.416 1.249 2.499 3.332 4.166 4.999 5.832 6.666 0.103 0.312 0.416 1.249 2.499 3.332 4.166 4.999 5.832 6.666 0.103 0.312 0.416 1.249 2.499 3.332 4.166 4.999 5.833 6.666
Frame Size CoVX Peak/Mean ¯ ¯ SX /X Xmax /X 0.144 1.944 0.265 3.831 0.293 5.753 0.312 5.483 0.296 4.850 0.281 3.985 0.263 3.948 0.247 3.377 0.225 2.940 0.197 3.022 0.123 2.119 0.222 2.445 0.239 2.483 0.239 2.441 0.214 2.141 0.195 2.154 0.175 1.899 0.161 1.867 0.145 1.764 0.125 1.627 0.183 2.679 0.280 2.519 0.291 2.434 0.292 2.382 0.286 2.497 0.276 2.316 0.262 2.315 0.249 2.180 0.232 2.030 0.216 1.904 0.135 2.012 0.254 3.225 0.267 3.093 0.280 3.521 0.259 3.012 0.241 2.516 0.219 2.487 0.203 2.239 0.186 1.990 0.168 1.954
Bit Rate Mean Peak ¯ Xmax /T [Mbps] X/T [Mbps] 0.025 0.048 0.075 0.287 0.100 0.574 0.300 1.644 0.600 2.909 0.800 3.187 1.000 3.947 1.200 4.051 1.400 4.116 1.600 4.834 0.025 0.053 0.075 0.183 0.100 0.248 0.300 0.732 0.600 1.284 0.800 1.722 1.000 1.898 1.200 2.239 1.400 2.470 1.600 2.604 0.025 0.066 0.075 0.188 0.100 0.243 0.300 0.714 0.600 1.498 0.800 1.852 1.000 2.315 1.200 2.616 1.400 2.842 1.600 3.046 0.025 0.050 0.075 0.241 0.100 0.309 0.300 1.056 0.600 1.807 0.800 2.012 1.000 2.486 1.200 2.686 1.400 2.786 1.600 3.125
the hump phenomenon of increasing CoVY from low bit rates to mid bit rates and then decreasing from mid bit rates to high bit rates. Next, we provide plots to illustrate the video traffic characteristics and statistical characteristics of the following video sequences: (a) Terminator encoded at 25 kbps, (b) Terminator encoded at 100 kbps, (c) Lady Tramp encoded at 300 kbps, (d) Lady Tramp encoded at 800 kbps, (e) Foot Ball w/c encoded at 1000 kbps, and (f) Foot Ball w/c encoded at 1600 kbps. The video sequences were chosen from the three different genres action, cartoon, and a TV talk show with commercials to give a representation of different video content. Figure 6.28 illustrates the behavior of the frame sizes (in bytes) as a function of the frame index n. We observe that The Terminator encoded at 100 kbps is smoother than the The Terminator encoded at 25 kbps. But The Lady and The Tramp encoded at 300 kbps shows more variations than the The Terminator at 100 kbps. By visual inspection of Figure 6.28 Football with Commercials encoded at 1000 kbps and 1600 kbps both have almost the same variations, but obviously due to different bit rates they are centered at the corresponding frame sizes. For all bit rate encodings, we observed that some parts of the trace that had higher variations than the others, which correspond to different scenes of the video sequence. Next, we observe the behavior
6.4 Video Trace Statistics for Wavelet Encoded Video
127
Table 6.22: Overview of GoP size statistics for wavelet encoded videos The Terminator, The Lady and The Tramp, Football with Commercials and Tonight Show with Commercials. Video
The Terminator
The Lady And The Tramp
Football with Commercials
Tonight Show with Commercials
Target Rate [kbps] 25 75 100 300 600 800 1000 1200 1400 1600 25 75 100 300 600 800 1000 1200 1400 1600 25 75 100 300 600 800 1000 1200 1400 1600 25 75 100 300 600 800 1000 1200 1400 1600
Mean Y¯ [kbyte] 1.654 4.986 13.407 19.983 39.980 53.311 66.643 79.974 93.305 106.637 1.571 4.637 12.924 18.434 36.832 49.097 61.363 73.627 85.892 98.157 1.654 4.986 6.652 19.983 39.980 53.312 66.643 79.974 93.305 106.637 1.654 4.986 6.652 19.983 39.980 53.311 66.643 79.974 93.306 106.637
GoP Size CoVY Peak/Mean SY /Y¯ Ymax /Y¯ 0.133 1.763 0.248 2.351 0.604 3.876 0.294 2.837 0.278 2.506 0.264 2.328 0.247 2.206 0.232 2.091 0.211 1.926 0.185 1.744 0.212 2.123 0.337 2.522 0.626 3.639 0.373 2.551 0.361 2.237 0.351 2.195 0.340 2.048 0.334 2.012 0.327 1.887 0.318 1.718 0.172 1.820 0.266 2.151 0.277 2.185 0.278 2.314 0.272 2.458 0.262 2.296 0.249 2.293 0.237 2.163 0.220 1.950 0.205 1.889 0.126 1.950 0.240 2.919 0.253 2.988 0.265 3.392 0.244 2.935 0.227 2.465 0.206 2.440 0.191 2.200 0.176 1.934 0.159 1.920
Bit Rate Mean Peak ¯ Y /(Gt) [Mbps] Ymax /(Gt) [Mbps] 0.025 0.044 0.075 0.176 0.201 0.780 0.300 0.850 0.600 1.503 0.800 1.862 1.000 2.206 1.200 2.508 1.400 2.696 1.600 2.790 0.024 0.050 0.070 0.175 0.194 0.705 0.277 0.705 0.552 1.236 0.736 1.616 0.920 1.885 1.104 2.222 1.288 2.432 1.472 2.530 0.025 0.045 0.075 0.161 0.100 0.218 0.300 0.694 0.600 1.474 0.800 1.836 1.000 2.292 1.200 2.595 1.400 2.729 1.600 3.021 0.025 0.048 0.075 0.218 0.100 0.298 0.300 1.017 0.600 1.760 0.800 1.971 1.000 2.439 1.200 2.639 1.400 2.706 1.600 3.072
of the GoP sizes as a function of the GoP index m, illustrated in Figure 6.29. In Section 3.4, we described the behavior of the MC-3DEZBC encoder’s rate control which gives an insight to the large variations observed in Figure 6.29. In contrast to the frame level, here we observe a much smoother plot due to the fact that when taking the aggregate of frame sizes over a GoP the variations get somewhat smoothed out. We have observed this behavior for different aggregation levels, not shown here due to space constraints. But at the GoP level we still observe different variations along the trace due to different scenes of the video. Figure 6.30 illustrates the histograms of the frame sizes. We observe a single peak with a relative smooth slope in contrast to the MPEG-4 traces, where a double peak was observed [124]. At the GoP level the
128
6 Statistical Results from Video Traces
Fig. 6.28: Frame size Xn as a function of the frame index n for wavelet encoded QCIF videos The Terminator, The Lady And The Tramp, and Football with Commercials. histograms are much smoother relative to the frame level histograms, as illustrated in Figure 6.31. Figure 6.32 illustrates the autocorrelation coefficient as a function of the frame lag k (in frames). For the frame level autocorrelation coefficient, we observe a smooth decaying curve. This is in contrast to the spiky autocorrelation coefficient behavior observed for MPEG-4 encodings due to the three different frame types I, P and B. The decay of the autocorrelation coefficient, however, is less than exponential and indicates that close frames are correlated with respect to their frame sizes. Only for distant frames, with a lag k > 140, we observe that the autocorrelation coefficient is
6.4 Video Trace Statistics for Wavelet Encoded Video
129
Fig. 6.29: GoP sizes Ym as a function of the index m for wavelet encoded QCIF videos The Terminator, The Lady and The Tramp, and Football with Commercials. close to zero. In Figure 6.33, we observe a different type of behavior of the autocorrelation coefficient as a function of the lag k (in GoPs). For the GoP level, we observe that the autocorrelation coefficient drops sharply and exponentially below zero, then slowly approaches zero and remains around zero. This behavior indicates that there is nearly no correlation of distant GoPs. For closer GoPs, there is only little correlation, which becomes negative, in turn this could be an indicator for the rate-control algorithm. We illustrate
130
6 Statistical Results from Video Traces
Fig. 6.30: Frame size histograms for wavelet encoded QCIF videos The Terminator, The Lady and The Tramp, and Football with Commercials. the R/S pox plots for The Terminator, The Lady and The Tramp and Football with Commercials in Figure 6.34 for the GoP aggregation level a = 16. In addition to the aggregation level illustrated, Table 6.23 provides the Hurst parameter H determined with the R/S method, calculated for aggregation levels of a = 1, 2, 4, 16, 24, 32, 64, 128, 256, 400, 480, 560, 640 and 800. We provide the periodogram plots for the same sequences in Figure 6.35. The Hurst parameter H estimated with the periodogram as a function of the aggregation
6.4 Video Trace Statistics for Wavelet Encoded Video
131
Fig. 6.31: GoP size histograms for wavelet encoded QCIF videos The Terminator, The Lady and The Tramp, and Football with Commercials. level a is given in Table 6.24. The variance time plots for the three evaluated movies The Terminator, The Lady and The Tramp and Football with Commercials are illustrated in Figure 6.36. The corresponding Table 6.25 gives the Hurst parameter estimated using the variance time plot. Additionally, Table 6.25 provides the values of the scaling parameters cf and α (the latter shown as H = (1 + α)/2) estimated from the logscale diagram, which is given in Figure 6.37. Overall, we note that the estimates for the Hurst parameter
132
6 Statistical Results from Video Traces
Fig. 6.32: Frame size autocorrelation coefficients (ACC) for wavelet encoded QCIF videos The Terminator, The Lady and The Tramp, and Football with Commercials. H typically decrease as the aggregation level increases from a = 1 to around a = 200 and then are more or less stable [100]. We make similar observations here. The pox plot of R/S for a = 1 and the periodogram for a ≤ 64 give H estimates larger than 0.5, which usually indicate the presence of long range dependence in the video traffic. However, the H estimates obtained for larger aggregation levels a are all below 0.5, which indicates that there is no long range dependence in the
6.4 Video Trace Statistics for Wavelet Encoded Video
133
Fig. 6.33: GoP size autocorrelation coefficients for wavelet encoded QCIF videos The Terminator, The Lady and The Tramp, and Football with Commercials. traffic. All in all our investigations indicate that there is no significant long range dependence in the video traffic of wavelet encoded video. In Figure 6.37 we illustrate the logscale diagram of the H value estimates with a general trend of a increasing curve for lower octaves j, and then a decreasing trend for higher octaves j.
134
6 Statistical Results from Video Traces
Fig. 6.34: POX plots of R/S for aggregation level a = 16 for wavelet encoded QCIF videos The Terminator, The Lady and The Tramp, and Football with Commercials. 6.4.2 Analysis of Video Quality In this section we analyze the video quality aspects of the wavelet encoded video traces. Our main focus is on the PSNR and MSE values, defined in Chapter 4. For the PSNR values we only take into account the luminance component of the video traces, since the human visual system is more
6.4 Video Trace Statistics for Wavelet Encoded Video
135
Table 6.23: Hurst parameters estimated from pox diagram of R/S statistics as a function of the aggregation level a. Video
Target Rate [kbps] The 25 Terminator 75 100 300 600 800 1000 1200 1400 1600 The Lady 25 And 75 The Tramp 100 300 600 800 1000 1200 1400 1600 Football 25 with 75 Commercials 100 300 600 800 1000 1200 1400 1600 Tonight 25 Show 75 with 100 Commercials 300 600 800 1000 1200 1400 1600
Aggregation level a [frames] 1 0.697 0.673 0.674 0.665 0.664 0.667 0.671 0.678 0.683 0.682 0.723 0.704 0.697 0.693 0.690 0.690 0.679 0.685 0.683 0.687 0.693 0.669 0.674 0.693 0.698 0.693 0.694 0.691 0.696 0.694 0.703 0.703 0.712 0.700 0.696 0.693 0.691 0.693 0.691 0.691
12 0.521 0.498 0.496 0.484 0.480 0.479 0.477 0.481 0.478 0.477 0.493 0.466 0.465 0.465 0.456 0.447 0.441 0.443 0.447 0.452 0.473 0.456 0.467 0.495 0.503 0.500 0.503 0.499 0.503 0.505 0.499 0.500 0.499 0.494 0.499 0.504 0.505 0.502 0.506 0.509
24 0.511 0.489 0.479 0.462 0.460 0.464 0.459 0.464 0.465 0.461 0.450 0.434 0.429 0.429 0.424 0.415 0.411 0.413 0.415 0.417 0.447 0.424 0.434 0.465 0.470 0.468 0.472 0.467 0.469 0.472 0.484 0.476 0.473 0.464 0.479 0.484 0.483 0.480 0.484 0.486
48 0.480 0.463 0.457 0.443 0.431 0.430 0.429 0.432 0.433 0.431 0.410 0.396 0.393 0.396 0.386 0.377 0.370 0.373 0.377 0.381 0.408 0.383 0.395 0.430 0.442 0.442 0.444 0.441 0.444 0.446 0.453 0.454 0.453 0.437 0.449 0.459 0.462 0.459 0.466 0.470
96 0.424 0.412 0.406 0.384 0.367 0.366 0.366 0.370 0.371 0.369 0.346 0.345 0.342 0.346 0.339 0.331 0.324 0.330 0.335 0.335 0.348 0.310 0.316 0.357 0.378 0.377 0.384 0.379 0.387 0.394 0.416 0.417 0.413 0.405 0.407 0.412 0.410 0.402 0.409 0.415
192 0.364 0.364 0.346 0.327 0.318 0.323 0.320 0.325 0.325 0.330 0.306 0.296 0.303 0.307 0.301 0.288 0.282 0.292 0.296 0.301 0.293 0.266 0.273 0.304 0.317 0.312 0.319 0.314 0.318 0.328 0.360 0.387 0.375 0.371 0.373 0.379 0.370 0.358 0.365 0.368
300 0.307 0.273 0.266 0.256 0.254 0.268 0.260 0.268 0.268 0.260 0.308 0.269 0.255 0.248 0.241 0.234 0.238 0.240 0.250 0.274 0.253 0.236 0.246 0.278 0.283 0.274 0.285 0.279 0.283 0.294 0.292 0.321 0.305 0.318 0.312 0.337 0.330 0.335 0.339 0.334
396 0.326 0.300 0.294 0.270 0.251 0.259 0.261 0.271 0.274 0.266 0.270 0.237 0.245 0.254 0.264 0.264 0.276 0.281 0.302 0.325 0.248 0.202 0.208 0.238 0.248 0.237 0.241 0.234 0.240 0.238 0.338 0.300 0.279 0.251 0.237 0.271 0.281 0.269 0.287 0.301
504 0.287 0.226 0.227 0.245 0.244 0.246 0.230 0.244 0.230 0.207 0.278 0.256 0.252 0.259 0.248 0.246 0.246 0.240 0.259 0.260 0.203 0.217 0.230 0.294 0.298 0.275 0.275 0.259 0.254 0.251 0.336 0.299 0.292 0.342 0.373 0.382 0.386 0.365 0.391 0.403
600 0.268 0.264 0.250 0.221 0.217 0.232 0.227 0.243 0.254 0.249 0.259 0.224 0.262 0.254 0.252 0.237 0.230 0.237 0.241 0.235 0.251 0.248 0.267 0.318 0.317 0.311 0.301 0.284 0.295 0.288 0.370 0.395 0.370 0.359 0.329 0.317 0.301 0.274 0.270 0.265
696 0.351 0.310 0.296 0.301 0.285 0.291 0.284 0.289 0.295 0.301 0.264 0.257 0.289 0.297 0.286 0.268 0.263 0.256 0.247 0.226 0.237 0.196 0.193 0.245 0.266 0.259 0.271 0.262 0.271 0.272 0.357 0.330 0.322 0.323 0.332 0.345 0.354 0.361 0.368 0.384
792 0.262 0.233 0.323 0.247 0.217 0.215 0.206 0.195 0.191 0.184 0.348 0.255 0.245 0.256 0.254 0.275 0.270 0.265 0.288 0.312 0.202 0.195 0.190 0.181 0.202 0.202 0.203 0.195 0.205 0.215 0.299 0.318 0.327 0.265 0.212 0.213 0.212 0.206 0.200 0.225
sensitive to the luminance component in contrast to the chrominance components. We denote Qn for QYn , and Mn for p2 /10(Qn /10) for convenience. ¯ the coefficient of quality variation Table 6.26 gives the average quality Q, CoQV , the alternative coefficient of quality variation CoQV , and the quality range Qmax min for the video frames, while at the GoP aggregation level it gives the max(G) coefficients of variation CoQV (G) , CoQV (G) and the quality range Qmin . ¯ is around 18 − 20 dB for We observe that the low average video quality Q ¯ is 25 kbps video while for the 1600 kbps video the average video quality Q around 39 − 40 dB. As we observed in Table 6.21, the CoQV shows a hump like behavior, it increases for the low bit rates, comes to a peak around the mid bit rates, and gradually decreases back for the higher bit rates. The CoQV , on the other hand, shows a gradual decreasing trend when the bit
136
6 Statistical Results from Video Traces
Fig. 6.35: Periodogram plots for aggregation level a = 16 for wavelet encoded QCIF videos The Terminator, The Lady and The Tramp, and Football with Commercials. rate is increased. We observe that the Qmax min decreases with the increasing bit rate as well. Foot Ball w/c shows a much larger Qmax min than the other videos. Next, at a GoP level we observe similar results from Table 6.26. The CoQV (G) max(G) shows the hump like behavior while the CoQV (G) and Qmin decreases with (G) we observe is relatively smaller than increasing video bit rates. The CoQV CoQV .
6.4 Video Trace Statistics for Wavelet Encoded Video
137
Table 6.24: Hurst parameters estimated from periodogram as a function of the aggregation level a. Video
Target Rate [kbps] The 25 Terminator 75 100 300 600 800 1000 1200 1400 1600 The Lady 25 And 75 The Tramp 100 300 600 800 1000 1200 1400 1600 Football 25 with 75 Commercials 100 300 600 800 1000 1200 1400 1600 Tonight 25 Show 75 with 100 Commercials 300 600 800 1000 1200 1400 1600
Aggregation level a [frames] 12 1.065 1.080 1.075 1.073 1.066 1.058 1.057 1.052 1.041 1.027 1.113 1.158 1.149 1.150 1.136 1.124 1.131 1.125 1.117 1.118 1.055 1.038 1.026 1.012 1.008 1.010 1.013 1.018 1.011 1.013 1.020 1.022 1.014 0.981 0.951 0.955 0.972 0.976 0.983 1.001
24 0.973 0.998 1.015 1.002 1.000 0.990 0.986 0.983 0.968 0.950 1.024 1.041 1.021 1.016 0.994 0.984 0.986 0.979 0.972 0.986 0.977 0.962 0.961 0.973 0.971 0.976 0.975 0.974 0.969 0.967 0.934 0.949 0.941 0.917 0.883 0.877 0.875 0.899 0.899 0.908
48 0.893 0.930 0.931 0.930 0.924 0.924 0.916 0.913 0.907 0.893 0.965 0.997 1.002 0.995 0.974 0.965 0.980 0.977 0.968 0.975 0.939 0.920 0.916 0.918 0.911 0.903 0.901 0.917 0.906 0.899 0.875 0.899 0.898 0.885 0.845 0.838 0.820 0.850 0.844 0.843
96 0.706 0.746 0.762 0.747 0.742 0.737 0.725 0.733 0.738 0.719 0.752 0.786 0.767 0.755 0.723 0.697 0.710 0.695 0.681 0.678 0.740 0.704 0.713 0.737 0.755 0.750 0.750 0.742 0.737 0.730 0.695 0.734 0.735 0.736 0.709 0.707 0.694 0.680 0.677 0.668
192 0.454 0.507 0.526 0.511 0.490 0.488 0.484 0.480 0.473 0.457 0.448 0.494 0.498 0.482 0.469 0.457 0.462 0.438 0.417 0.425 0.469 0.434 0.449 0.498 0.511 0.498 0.489 0.486 0.472 0.474 0.420 0.474 0.488 0.481 0.461 0.477 0.470 0.465 0.465 0.483
300 0.108 0.166 0.170 0.178 0.195 0.220 0.227 0.234 0.215 0.174 0.171 0.205 0.176 0.148 0.118 0.087 0.082 0.074 0.059 0.072 0.068 0.057 0.037 0.174 0.205 0.191 0.185 0.166 0.146 0.128 0.137 0.211 0.226 0.247 0.215 0.208 0.189 0.195 0.196 0.187
396 -0.059 -0.002 -0.002 0.003 -0.022 -0.032 -0.062 -0.015 -0.050 -0.098 -0.033 0.063 0.008 -0.038 -0.095 -0.113 -0.126 -0.128 -0.154 -0.159 -0.181 -0.146 -0.119 -0.067 -0.018 -0.005 -0.006 -0.016 -0.033 -0.044 -0.191 -0.039 -0.092 -0.038 -0.056 -0.058 -0.088 -0.079 -0.070 -0.106
504 -0.210 -0.158 -0.222 -0.148 -0.136 -0.098 -0.107 -0.092 -0.108 -0.155 -0.132 -0.120 -0.056 -0.078 -0.115 -0.154 -0.164 -0.175 -0.193 -0.211 -0.214 -0.158 -0.192 -0.133 -0.152 -0.157 -0.163 -0.172 -0.178 -0.195 -0.288 -0.101 -0.107 -0.071 -0.044 -0.051 -0.063 -0.048 -0.049 -0.040
600 -0.176 -0.239 -0.196 -0.157 -0.132 -0.172 -0.173 -0.187 -0.215 -0.230 -0.202 -0.108 -0.063 -0.060 -0.091 -0.116 -0.127 -0.130 -0.142 -0.159 -0.134 -0.197 -0.123 -0.201 -0.157 -0.137 -0.132 -0.132 -0.142 -0.189 -0.099 -0.043 -0.086 -0.124 -0.158 -0.149 -0.130 -0.090 -0.102 -0.090
696 -0.248 -0.172 -0.239 -0.150 -0.101 -0.092 -0.111 -0.115 -0.120 -0.180 -0.166 -0.237 -0.175 -0.152 -0.119 -0.118 -0.097 -0.097 -0.103 -0.111 -0.088 -0.157 -0.208 -0.192 -0.162 -0.193 -0.209 -0.230 -0.241 -0.261 -0.229 -0.249 -0.301 -0.158 -0.130 -0.079 -0.079 -0.077 -0.068 -0.059
792 -0.306 -0.354 -0.307 -0.314 -0.303 -0.292 -0.306 -0.309 -0.332 -0.373 -0.174 -0.122 -0.087 -0.059 -0.068 -0.063 -0.066 -0.054 -0.028 -0.070 -0.263 -0.240 -0.227 -0.169 -0.071 -0.111 -0.147 -0.158 -0.179 -0.178 -0.335 -0.216 -0.242 -0.090 -0.171 -0.210 -0.217 -0.233 -0.251 -0.235
Figure 6.38 illustrates the behavior of the video quality in P SN R as a function of the frame index n. We observe a relatively high variance of the video quality for the low bit rate videos, while the quality tends to smooth out as the bit rate is increased. Different sections of the trace tend to have different variations and an average video quality, which corresponds to the different scenes in the video sequence. We observed that the variations of the quality for the same bit rate of different videos also vary due to the different content of the video genres. Figure 6.39 shows the histograms of the frame qualities. We observe that the histograms are wider for the low bit rate video and more narrow for the high bit rate video. This is due to the fact that with large bit budgets, the encoder can encode frames with less loss, consistently, while at lower bit rates more detailed, complicated frames have a lower PSNR.
138
6 Statistical Results from Video Traces
Fig. 6.36: Variance time plots for wavelet encoded QCIF videos The Terminator, The Lady and The Tramp, and Football with Commercials. The Terminator encoded at 100 kbps behaves much differently illustrating a edgy histogram which is in contrast to the other bit rates, which show smoother single peak histograms. Figure 6.40 and Figure 6.41 show the autocorrelation coefficient as a function of lag k (in frames) and lag k (in GoPs), respectively. In Figure 6.40 we observe that the autocorrelation function is smooth and decaying slowly,
6.4 Video Trace Statistics for Wavelet Encoded Video
139
Table 6.25: Hurst parameters estimated from variance time plot, scaling parameters estimated from logscale diagram. Video
The Terminator
The Lady And The Tramp
Football with Commercials
Tonight Show with Commercials
Target Rate [kbps] 25 75 100 300 600 800 1000 1200 1400 1600 25 75 100 300 600 800 1000 1200 1400 1600 25 75 100 300 600 800 1000 1200 1400 1600 25 75 100 300 600 800 1000 1200 1400 1600
VT H -0.007 0.014 0.029 0.047 0.047 0.036 0.041 0.022 0.021 0.026 0.022 0.028 0.033 0.041 0.063 0.067 0.092 0.096 0.097 0.086 -0.111 -0.077 -0.071 -0.042 -0.040 -0.047 -0.048 -0.044 -0.050 -0.062 -0.190 -0.174 -0.154 -0.374 -0.432 -0.421 -0.403 -0.382 -0.373 -0.348
Logscale Diagram cf 6696783360.000 134226712199168.000 361141272576.000 322789810634752.000 478617935020032.000 2104900845568.000 3280063430656.000 689201872896.000 377870319616.000 875160141824.000 213082080.000 489060224.000 22928542.000 19194778.000 9321051.000 10888958.000 820040.312 718594.750 495879.500 442595.625 6687762.500 17504038.000 23999492.000 36904152.000 24528310.000 13327088.000 15617054.000 12771494.000 3192834.500 4051244.250 230368864.000 675199.625 748491.125 165650.844 213499472.000 120589.367 156895.969 174308.781 73974.336 55982.273
α -2.684 -4.100 -3.159 -4.190 -4.255 -3.392 -3.450 -3.251 -3.173 -3.283 -2.201 -2.325 -1.936 -1.912 -1.824 -1.848 -1.556 -1.544 -1.502 -1.484 -1.759 -1.907 -1.955 -2.000 -1.944 -1.867 -1.884 -1.863 -1.669 -1.697 -2.258 -1.486 -1.493 -1.295 -2.186 -1.560 -1.587 -1.600 -1.501 -1.460
H -0.842 -1.550 -1.080 -1.595 -1.627 -1.196 -1.225 -1.125 -1.086 -1.141 -0.601 -0.662 -0.468 -0.456 -0.412 -0.424 -0.278 -0.272 -0.251 -0.242 -0.380 -0.453 -0.478 -0.500 -0.472 -0.434 -0.442 -0.431 -0.334 -0.349 -0.629 -0.243 -0.246 -0.148 -0.593 -0.280 -0.294 -0.300 -0.250 -0.230
this is again in contrast to the MPEG-4 encodings [124]. At the GoP level, in Figure 6.41, we observe a relatively sharper, less smoother decay. Figures 6.42 and 6.43 illustrate the scatter plots of frame quality as a function of the video frame size and video GoP size, respectively. Here, we note that that larger frames not necessary have a high video quality. We observe that the frame quality levels tend to disperse horizontally for higher bit rates, while at lower bit rates the frame qualities tend to stay closer to the mean.
140
6 Statistical Results from Video Traces
Fig. 6.37: Logscale diagrams for wavelet encoded QCIF videos The Terminator, The Lady and The Tramp, and Football with Commercials. 6.4.3 Correlation Between Frame Sizes and Qualities Table 6.27 gives the size–MSE quality correlation coefficient ρXM and the size–PSNR quality correlation coefficient ρXQ , as well as the corresponding (G) (G) correlation coefficients ρXM and ρXQ for the GoP aggregation. For the frame level we initially observe from Table 6.27 that the coefficient of size–MSE correlation ρXM decreases as the bit rate is increased. The coefficient of size– quality correlation ρXQ , on the other hand, decreases for increasing bit rates.
6.4 Video Trace Statistics for Wavelet Encoded Video
141
Table 6.26: Overview of quality statistics of single–layer traces for wavelet ecoded video. Video The Terminator
The Lady And The Tramp
Football With Commercials
Tonight Show With Commercials
Target Bit Rate 25 75 100 300 600 800 1000 1200 1400 1600 25 75 100 300 600 800 1000 1200 1400 1600 25 75 100 300 600 800 1000 1200 1400 1600 25 75 100 300 600 800 1000 1200 1400 1600
¯ Q 19.256 22.965 24.576 28.847 33.126 35.189 36.889 38.339 39.643 40.781 18.449 21.196 23.509 25.681 28.737 30.224 31.531 32.728 33.923 35.130 18.490 21.796 22.730 27.592 31.862 33.886 35.552 36.957 38.094 39.224 18.427 20.413 21.014 24.053 27.044 28.631 30.015 31.315 32.475 33.646
Frame Level CoQV CoQV 0.529 0.128 0.638 0.120 0.793 0.154 0.729 0.100 0.720 0.081 0.688 0.070 0.641 0.062 0.587 0.055 0.531 0.049 0.482 0.044 0.395 0.106 0.432 0.093 0.673 0.111 0.476 0.074 0.468 0.060 0.449 0.053 0.439 0.049 0.423 0.045 0.410 0.043 0.405 0.041 0.443 0.139 0.477 0.124 0.484 0.121 0.530 0.105 0.527 0.088 0.502 0.078 0.469 0.069 0.433 0.064 0.415 0.060 0.408 0.056 0.401 0.104 0.383 0.094 0.381 0.093 0.396 0.087 0.402 0.076 0.379 0.068 0.372 0.064 0.351 0.059 0.353 0.058 0.354 0.056
Qmax min 23.910 22.960 29.360 24.590 24.280 24.390 22.130 20.880 20.650 20.880 18.430 17.050 22.190 19.350 21.290 20.530 20.210 20.570 19.560 18.320 68.090 64.750 63.870 58.750 54.300 52.120 50.340 48.650 46.920 45.300 29.000 27.080 25.910 21.470 18.190 18.250 17.910 17.430 17.780 17.940
CoQC (G) 0.518 0.629 0.785 0.725 0.717 0.686 0.640 0.585 0.530 0.481 0.389 0.428 0.671 0.474 0.467 0.448 0.439 0.422 0.409 0.404 0.431 0.469 0.477 0.525 0.524 0.499 0.466 0.430 0.413 0.406 0.390 0.374 0.374 0.392 0.400 0.378 0.370 0.350 0.352 0.353
GoP level CoQV (G) 0.128 0.122 0.162 0.102 0.082 0.071 0.062 0.054 0.048 0.043 0.106 0.093 0.114 0.074 0.060 0.054 0.049 0.045 0.044 0.041 0.130 0.121 0.118 0.106 0.089 0.078 0.069 0.063 0.059 0.056 0.099 0.091 0.091 0.086 0.076 0.068 0.064 0.059 0.058 0.057
max(G)
Qmin 22.706 21.070 23.602 19.102 19.749 18.635 18.689 17.526 17.354 16.174 17.131 15.822 15.852 13.011 11.655 10.864 10.835 10.707 10.780 10.304 33.760 30.405 29.641 24.114 20.235 18.141 18.858 17.834 16.552 16.456 21.654 19.417 18.276 15.778 15.589 15.750 15.485 14.699 14.336 14.365
This behavior is anticipated due to the inverse relationship between the PSNR and the MSE. For the bit rates in observation, the ρXQ stays negative. We (G) (G) see a similar trend in the GoP level where the ρXM increases and the ρXQ decreases for increasing bit rates. 6.4.4 Comparison Between Wavelet and MPEG-4 Encoded Video In this section, we compare the statistical video characteristics of the MC3DEZBC wavelet encodings with those obtained from the MPEG-4 intra encodings. We focus in the presentation of our results on the movie The Terminator, as they are representative. Table 6.28 gives the basic statistics and the compression ratio (i.e., amount of data for the uncompressed frame size compared to the mean compressed frame size) for The Terminator encodings with
142
6 Statistical Results from Video Traces
Fig. 6.38: Video frame quality Qn (in dB) as a function of the frame index n for wavelet encoded QCIF videos The Terminator, The Lady and The Tramp, and Football with Commercials. different target bit rates. We observe that the wavelet encoder achieves a better match of the lower target bit rates than the MPEG encoder, which fails to match the lower target bit rates. For target bit rates from 25kbps to 100kbps the MPEG-4 encodings result in similar mean frame sizes of approximately 0.5kbyte. The target bandwidth of 100kbps is thus exceeded by approx. 26%. This behavior is due to the maximum quantization scale of 31 available in the reference encoder implementation. With this bound on the quantization
6.4 Video Trace Statistics for Wavelet Encoded Video
143
Fig. 6.39: Histograms of video frame quality Qn (in dB) for wavelet encoded QCIF videos The Terminator, The Lady and The Tramp, and Football with Commercials. scale, the TM5 algorithm is unable match the lower target bit rates. With data rates higher than 100kbps the compression ratios for both coding modes become very close. For 25 and 75kbps, the CoVX and peak-to-mean ratio are identical for MPEG-4. For the encoding with 100kbps target rate, we observe that the peak-to-mean ratio for the MPEG encoding is no longer identical to that of the two lower target bit rates while the CoVX is, which corroborates our previous reasoning in favor of the CoVX as a robust measure of
144
6 Statistical Results from Video Traces
Fig. 6.40: MSE autocorrelation coefficient pM (k) as a function of the lag k (in frames) for wavelet encoded QCIF videos The Terminator, The Lady and The Tramp, and Football with Commercials. the traffic variability. From Table 6.28 we observe additionally that the coefficient of variation increases as the encoded video rate increases, reaches a peak, and decreases as the encoded video rate increases further, building a hump of variability. The result is present for both, MPEG and wavelet encodings. The trend is much clearer, however, for wavelet encodings. Figure 6.44 illustrates this characteristic of the coefficient of variation for both movies.
6.4 Video Trace Statistics for Wavelet Encoded Video
145
Fig. 6.41: MSE autocorrelation coefficient pG M (k) as a function of the lag k (in GoPs) for wavelet encoded QCIF videos The Terminator, The Lady and The Tramp, and Football with Commercials. We observe that for wavelet encodings the peak is located at 300kbps. For MPEG-4 encodings, the peak is located at 1Mbps. We furthermore observe that the level of variability depends on the content (i.e., encoded movie) as well as on the encoding type. The MPEG-4 encodings tend to have a higher variability compared to the wavelet encodings and The Terminator encodings exhibit higher variability than the additionally evaluated The Lady and The
146
6 Statistical Results from Video Traces
Fig. 6.42: Scatter plots of frame size and frame quality for wavelet encoded QCIF videos The Terminator, The Lady and The Tramp, and Football with Commercials. Tramp encodings. To study general characteristics without short-term effects, we average over non-overlapping blocks of a frames for an aggregation level a. In Figure 6.45 we exemplarily illustrate the aggregated frame size trace for The Terminator with a target bit rate of 300kbps and aggregation level of a = 792. We observe that the TM5 rate control algorithm used for the MPEG-4 encoding produces a generally close fit to the target bit rate with
6.4 Video Trace Statistics for Wavelet Encoded Video
147
Fig. 6.43: Scatter plots of GoP size and average GoP quality for wavelet encoded QCIF videos The Terminator, The Lady and The Tramp, and Football with Commercials. a limited number of exceptions. The TM5 algorithm matches target bit rates at the Group of Pictures (GoP) level. We note that the GoP length in our study equals a single frame. The TM5 algorithm therefore tries to match the target bit rate for individual frames. For higher aggregation levels the resulting average aggregated frame sizes therefore typically exhibit lower variability than the individual frame sizes, as can be obtained by comparing Figures 6.44 and 6.45.
148
6 Statistical Results from Video Traces
Table 6.27: Correlation between quality and traffic for single–layer wavelet traces. Video The Terminator
The Lady And The Tramp
Football With Commercials
Tonight Show With Commercials
Target Bit Rate 25 75 100 300 600 800 1000 1200 1400 1600 25 75 100 300 600 800 1000 1200 1400 1600 25 75 100 300 600 800 1000 1200 1400 1600 25 75 100 300 600 800 1000 1200 1400 1600
Frame Level ρXM ρXQ 0.389 -0.481 0.390 -0.484 0.302 -0.322 0.279 -0.382 0.195 -0.286 0.148 -0.224 0.115 -0.172 0.072 -0.107 0.034 -0.069 0.027 -0.075 0.371 -0.414 0.395 -0.425 0.241 -0.271 0.289 -0.315 0.184 -0.210 0.128 -0.146 0.080 -0.093 0.030 -0.028 -0.017 0.023 -0.025 0.017 0.493 -0.505 0.471 -0.508 0.439 -0.484 0.356 -0.419 0.293 -0.359 0.262 -0.329 0.233 -0.301 0.194 -0.261 0.162 -0.229 0.125 -0.206 0.540 -0.554 0.548 -0.537 0.509 -0.512 0.322 -0.382 0.195 -0.258 0.147 -0.194 0.101 -0.144 0.059 -0.095 0.013 -0.050 -0.012 -0.038
GoP level (G)
ρXM 0.399 0.382 0.292 0.270 0.187 0.141 0.109 0.066 0.028 0.019 0.390 0.397 0.237 0.284 0.179 0.124 0.077 0.028 -0.017 -0.025 0.501 0.465 0.429 0.347 0.285 0.254 0.224 0.187 0.155 0.118 0.546 0.545 0.499 0.309 0.186 0.140 0.095 0.056 0.013 -0.012
(G)
ρXQ -0.483 -0.464 -0.301 -0.356 -0.260 -0.198 -0.144 -0.078 -0.035 -0.034 -0.426 -0.421 -0.263 -0.306 -0.201 -0.138 -0.086 -0.021 0.028 0.022 -0.472 -0.460 -0.436 -0.381 -0.326 -0.298 -0.270 -0.232 -0.201 -0.179 -0.518 -0.502 -0.474 -0.348 -0.235 -0.176 -0.131 -0.084 -0.043 -0.031
The MC-3DEZBC, on the other hand, produces more variable video frame sizes, but matches the target bit rate over longer time scales. As result, the traffic produced by the MC-3DEZBC encoder accurately fits the target bit rate overall, but produces more variable traffic over shorter time scales. In Figure 6.46 we plot the frame size autocorrelation coefficients as a function of the lag k for a target bit rate of 300kbps. The autocorrelations of the MC-3DEZBC encodings drop sharply and are reduced to very small values for higher lags k. The autocorrelation coefficient for The Lady and The Tramp encoded in MPEG-4, however, only drops off sharply at the beginning and levels out around 0.2. This outcome indicates that there is some correlation between relative distant frame sizes for the The Lady and The Tramp MPEG-4 encoding. The autocorrelation for The Terminator encoded in MPEG-4, however,
6.4 Video Trace Statistics for Wavelet Encoded Video
149
Table 6.28: Overview of frame statistics for The Terminator encoded with wavelet-based MC-3DEZBC and DCT-based MPEG-4 encoder. Video Encoder MC-3DEZBC (Wavelet)
MPEG-4 (DCT)
Target Rate [kbps] 25 75 100 300 600 800 1000 1200 1400 1600 25 75 100 300 600 800 1000 1200 1400 1600
Compress. ratio YUV:Enc 367.696 121.979 91.421 30.434 15.212 11.408 9.126 7.605 6.518 5.704 74.186 74.183 74.149 30.399 15.203 11.403 8.257 7.602 6.516 6.362
Mean ¯ [kbyte] X 0.103 0.312 0.416 1.249 2.499 3.332 4.166 4.999 5.832 6.665 0.512 0.512 0.513 1.251 2.501 3.334 4.604 5.001 5.834 5.975
CoVX ¯ SX / X 0.198 0.322 0.334 0.340 0.321 0.307 0.297 0.284 0.272 0.259 0.319 0.319 0.319 0.338 0.474 0.623 0.884 0.826 0.763 0.809
Peak to Mean ¯ Xmax /X 2.979 3.911 3.826 4.173 3.336 3.096 2.867 2.766 2.642 2.435 3.101 3.101 4.061 12.455 6.229 7.134 5.300 4.879 4.182 4.501
Fig. 6.44: Coefficient of variation as function of target bit rate for wavelet 3D-EZBC and MPEG-4 encodings. drops off faster than the two wavelet encodings, with no visible correlations for higher lags. The frame sizes for both MC-3DEZBC encodings exhibit no correlations at longer distances. The result of the autocorrelation comparison is hence that the DCT-based MPEG-4 encoding produces frame sizes with content-dependent autocorrelation, whereas the wavelet-based MC-3DEZBC
150
6 Statistical Results from Video Traces
Fig. 6.45: Aggregated frame size trace with aggregation level of a = 792 frames for target bit rate 300kbps for The Terminator encodings with 3D-EZBC and MPEG-4. seems to be more neutral in producing only minimally autocorrelated frame sizes. We now compare the quality of 3D-EZBC and MPEG-4 encodings based on the peak signal to noise ratio (PSNR). The basic video quality statistics for The Terminator are given in Table 6.29. We begin our observation for the target bit rates of 100kbps and up, as for the lower target bit rates, the bounded MPEG-4 quantization scale setting does not allow for a fair comparison. We observe that the average video quality for MPEG encoded video sequences is always lower than for the 3D-EZBC encodings. Earlier comparison studies in [137], where only the lowest target bit rates were evaluated, showed a difference of approximately 0.5dB in favor of DCT-based video encodings based on the PSNR of the luminance component. In contrast, we find that the quality difference increases with the target bit rate and even reaches a significant difference of more than 7dB, but in favor of the wavelet-based encodings. We also find that the video quality from wavelet-based encodings is always higher than for DCT-based MPEG-4 encodings for target bit rates higher than 100kbps. Our results indicate that the quality difference between wavelet and MPEG encoded video increases faster than linear depending on the target bit rates. Note that for network simulation studies the quality to bit rate relationship cannot be simply scaled. Our results furthermore show that for higher target bit rates, the wavelet-based 3D-EZBC clearly outperforms the DCT-based MPEG-4 encoding.
6.4 Video Trace Statistics for Wavelet Encoded Video
151
Fig. 6.46: Autocorrelation as function of lag k for wavelet 3D-EZBC and MPEG-4 encodings. Table 6.29: Overview of quality statistics for The Terminator encoded with wavelet-based MC-3DEZBC and DCT-based MPEG-4 encoder. Video Encoder MC-3DEZBC (Wavelet)
MPEG-4 (DCT)
Target Rate [kbps] 25 75 100 300 600 800 1000 1200 1400 1600 25 75 100 300 600 800 1000 1200 1400 1600
¯ Q 25.13 29.76 31.01 36.86 41.65 43.93 45.83 47.39 48.72 49.84 30.18 30.18 30.19 35.41 39.18 40.41 41.25 41.90 43.12 42.55
CoQV 0.145 0.123 0.118 0.102 0.090 0.083 0.077 0.071 0.065 0.058 0.081 0.081 0.081 0.134 0.139 0.160 0.188 0.190 0.191 0.221
Qmax min 72.570 69.070 68.030 62.480 57.940 55.480 53.670 51.700 49.930 49.160 29.688 37.507 37.454 38.628 65.120 65.120 65.594 64.356 66.059 66.039
Figure 6.47 illustrates the average video quality for the two encoding methods and the two evaluated movies. The average qualities for both encoding methods increase over the whole target bit rate scale, although the marginal return in terms of quality decreases with increasing target bit rates (i.e., for higher target bit rates, an increase in the bit rate results in a less than linear
152
6 Statistical Results from Video Traces
Fig. 6.47: Average video quality as function of encoding target bit rate for wavelet 3D-EZBC and MPEG-4 encoded movies.
Fig. 6.48: Coefficient of quality variation for wavelet 3D-EZBC and MPEG-4 encoded movies.
6.5 Video Trace Statistics for MPEG-4 FGS Encoded Video
153
increase in quality). From Table 6.29 we observe the variation of the video quality CoQV increases over the whole quantization scale for the MPEG-4 encoded The Terminator, whereas the CoQV decreases over the whole quantization scale for the 3D-EZBC encoding. We illustrate the characteristic of the CoQV in Figure 6.48. The quality range Qmax min follows the same trend of decreasing in value for the 3D-EZBC encoding, while increasing in value for MPEG-4 encoding. For the transmission of video, the encoded video quality and video traffic have to be taken into account. We use the coefficient of correlation as measure of (linear) dependency to determine the correlation between video traffic and video quality. We start by comparing the correlation of mean frame sizes and mean frame qualities for target bit rates greater than or equal to 100kbps. For The Lady and The Tramp as well as The Terminator, we obtain a correlation of 0.9 between quality and size for 3D-EZBC and MPEG-4 encodings. This indicates a strong correlation between the quality and the size of the encoded video frames for different target bit rates. The correlation between the coefficient of variation for frame sizes CoV x and the coefficient of variation for video qualities CoQV , also calculated starting from 100kbps target bit rate, are similarly pronounced and above 0.85 for both considered video encoding methods and both evaluated video sequences. These findings indicate that frame quality and frame size are strongly dependent. In addition, we observe that video quality variability and video traffic variability are highly correlated for 3D-EZBC and MPEG-4 encodings.
6.5 Video Trace Statistics for MPEG-4 FGS Encoded Video Here we present the analysis of a short video clip of 828 frames encoded in CIF format, obtained by concatenation of the test sequences Coastguard, Foreman, and Table in this order. We segmented (by hand) the resulting clip into 4 scenes (T1 = 1, T2 = 301, T3 = 601, T4 = 732) corresponding to the 4 shots of the video (the table sequence is composed of 2 shots). The (I, P, B) quantizers used to encode the base layer were fixed to (4, 4, 4) and (10, 14, 16) for the high and low quality versions of the base layer respectively. Figures 6.49 and 6.50 show the quality of the successive video frames Q(t) for the 4 scenes, when only the base layer is decoded and when a substream of the FGS–enhancement layer (at rate C = 3 Mbps) is added to the base layer before decoding. We make the following observations for both low and high base layer qualities. First, the average video frame quality changes from one scene to another for the base layer–only stream and also when a constant rate enhancement layer is added (this is confirmed for all enhancement layer rates in Figures 6.59 and 6.61). This trend in the image quality time series suggests to analyze the quality statistics of each scene separately [138].
154
6 Statistical Results from Video Traces
Fig. 6.49: Image PSNR Q as a function of image number t for “Clip” encoded with low quality base layer.
Fig. 6.50: Image PSNR Q as a function of image number t for “Clip” encoded with high quality base layer.
6.5 Video Trace Statistics for MPEG-4 FGS Encoded Video
155
For a given scene, we see that for the BL there are significant differences in the quality achieved for successive images. Most of these differences are introduced by the different types of BL images (I, P, B) — the frames with the highest quality correspond to I–frames. When adding a part of the EL (at rate C = 3 Mbits in the figures), we see that these differences are still present, even if they have changed in magnitude. Therefore, this suggests to distinguish between the different types of images in order to study the RD characteristics of the FGS–EL. We additionally notice that scenes 2 and 3 feature high variations of average quality for a given frame type within the same scene. Scene 2 corresponds to the Foreman sequence in which the camera pans from the foreman’s face to the building. A better scene segmentation tool would have segmented scene 2 into two different scenes, since the foreman’s face and the building have different complexities. These observations indicate that the variations in video frame quality after decoding the base layer and any enhancement substream are mainly due to the base layer encoding process (single–layer encoder and quantization parameters used). Figures 6.51 and 6.52 show the aggregate size of the FGS enhancement layer bit planes and Figures 6.53 and 6.54 illustrate the size of the BL frames. We observe that, in general, I frames have fewer bit planes than P or B frames and the total number of bits for the enhancement layer frames is larger for
Fig. 6.51: Aggregate size of the enhancement layer bit planes Y ei as a function of frame number t for “Clip” encoded with low quality base layer.
156
6 Statistical Results from Video Traces
Fig. 6.52: Aggregate size of the enhancement layer bit planes Y ei as a function of frame number t for “Clip” encoded with high quality base layer.
Fig. 6.53: Size of base layer frames X b as a function of image number t for “Clip” encoded with low quality base layer.
6.5 Video Trace Statistics for MPEG-4 FGS Encoded Video
157
Fig. 6.54: Size of base layer frames X b as a function of image number t for “Clip” encoded with high quality base layer. P and B frame types than for I frame types. This is because I frames have a higher base layer–only quality. Therefore, fewer bit planes and fewer bits are required to code the enhancement layer of I frames. For the same reason, when comparing high and low base layer qualities, we observe that the enhancement layer corresponding to the high quality base layer needs, for most video frames, fewer bit planes than the enhancement layer corresponding to the low quality base layer. When comparing the average size of enhancement layer frames for all scenes with the average size of the corresponding base layer frames, we see that the larger the average base layer frame size, the larger the average enhancement layer frame size. This can be explained by the different complexities of the scenes. For example, we observe that it requires fewer bits to code I frames in the first part of scene 2 than to code I frames in scene 1, meaning that the complexity of scene 1 video frames is larger than that of the video frames in scene 2. Therefore, the average number of bits required to code the enhancement layer of scene 1 video frames is larger than for the first part of scene 2. We plot in Figures 6.55, 6.56, 6.57 and 6.58, the RD function Qes (C) (improvement quality brought by the enhancement layer as a function of the FGS encoding rate) for different types of images within the same GOP. Note that some RD functions feature a few outliers (at low FGS bit rates). The plots
158
6 Statistical Results from Video Traces
Fig. 6.55: Improvement in PSNR Qe as function of the FGS bitrate C for successive I and B images in scene 1 of “Clip” encoded with low quality base layer.
6.5 Video Trace Statistics for MPEG-4 FGS Encoded Video
159
Fig. 6.56: Improvement in PSNR Qe as function of the FGS bitrate C for successive I and B images in scene 1 of “Clip” encoded with high quality base layer.
160
6 Statistical Results from Video Traces
Fig. 6.57: Improvement in PSNR Qe as function of the FGS bitrate C for successive B and P images in scene 2 of “Clip” encoded with low quality base layer.
6.5 Video Trace Statistics for MPEG-4 FGS Encoded Video
161
Fig. 6.58: Improvement in PSNR Qe as function of the FGS bitrate C for successive B and P images in scene 2 of “Clip” encoded with high quality base layer.
162
6 Statistical Results from Video Traces
¯ s as a function of the FGS bit Fig. 6.59: Average image quality by scene Q rate C for all scenes of “Clip” encoded with low quality base layer. confirm that the RD functions of the enhancement layer depend on the type of image of the BL and the particular scene. (i) We first see that RD functions are different for each bit plane, indicating that bit planes have different characteristics. (ii) Also, the maximum gain in quality for the same amount of EL data added to the BL, i.e. Qe ((k + 1) · c) − Qe (k · c), for k = 1, . . . , m − 1, is always achieved when we get closer to the end of a bit plane. This may be due to the bit plane headers. Indeed, the more bits in a given bit plane after truncation, the fewer the share of the bit plane header in the total data for this bit plane. Figures 6.60 and 6.62 give the variance of image quality for the different scenes of the video for both low and high BL qualities. Scene 2 is the scene with the largest variance, because of the variations in average image quality from the beginning to the end of the scene. We see that, for a given scene, the variance in quality changes with the FGS rate. These fluctuations can be explained by the different bit-plane RD functions of the different types of frames within a given scene: for a same FGS cutting rate C, the gain in quality Qe (C) is be different for I, P or B pictures. Finally, Figures 6.63 and 6.64 give the autocorrelation function of the image quality for the base layer and the FGS rates C = 1 Mbps and C = 3 Mbps. We observe periodic spikes which correspond to the GoP pattern. We also see that, at small lags, there are high correlations in quality for the different types of pictures at all FGS rates. In particular we see that, although at FGS rates C = 1 Mbps and C = 3 Mbps the variance in quality is for most
6.5 Video Trace Statistics for MPEG-4 FGS Encoded Video
163
Fig. 6.60: Standard deviation of image quality by scene σQs as a function of the FGS bitrate C for all scenes of “Clip” encoded with low quality base layer.
¯ s as a function of the FGS bitrate Fig. 6.61: Average image quality by scene Q C for all scenes of “Clip” encoded with high quality base layer.
164
6 Statistical Results from Video Traces
Fig. 6.62: Standard deviation of image quality by scene σQs as a function of the FGS bitrate C for all scenes of “Clip” encoded with high quality base layer.
Fig. 6.63: Total autocorrelation in image quality ρQ for “Clip” encoded with low quality base layer.
6.6 Video Trace Statistics for MDC Encoded Video
165
Fig. 6.64: Total autocorrelation in image quality ρQ for “Clip” encoded with high quality base layer. scenes higher than for the base layer only (see Figures 6.60 and 6.62), the autocorrelation in quality is slightly higher at small lags when adding the enhancement layer to the base layer.
6.6 Video Trace Statistics for MDC Encoded Video For the overhead calculation of H.264 encoded video streams we used a quantization parameter of 31 as default parameter. Later we will investigate even the impact of this parameter by varying these values between 1 and 51. We have agreed on a group of picture (GoP) structure with one I frame and eleven P frames. In Figure 6.65 the overhead of selected H.264 video sequences in the QCIF format for different number of sub–streams is given. The overhead increases with larger numbers of descriptors. The largest overhead increase is achieved switching from one descriptor to the second one. In Figure 6.66 the overhead of selected H.264 video sequences in the CIF format for different number of sub–streams is given for all six CIF video sequences. Table 6.30 presents mean frame size values versus different number of descriptors for six different video sequences in the CIF format. The QCIF format results are given in Table 6.31, which presents mean frame size values versus different number of descriptors for 12 different video sequences. Video content plays a crucial role for the mean frame size values. If the video has a relatively low motion, the increasing number of descriptors does not increase the mean frame
166
6 Statistical Results from Video Traces
Fig. 6.65: Overhead of selected H.264 video sequences in the QCIF format for different number of sub–streams.
Fig. 6.66: Overhead of selected H.264 video sequences in the CIF format for different number of sub–streams.
6.6 Video Trace Statistics for MDC Encoded Video
167
Table 6.30: Mean frame size value (in bit) for the CIF video sequences. Descriptors J 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
Bridge– Close 13028.7 16270.7 17769.6 18281.9 18832.6 18952.8 19188.1 19252.9 19612.0 19641.0 19935.2 19738.9 19995.1 20026.3 20481.8 20318.1 20363.0 20610.1 20523.0 20725.7
Bridge– Far 2697.2 3040.9 3267.5 3425.8 3483.0 3605.9 3607.7 3617.5 3685.2 3735.7 3729.9 3804.6 3856.9 3874.4 3867.5 3872.9 4015.8 3939.7 4082.0 3955.5
Mobile
Paris
Tempete
Highway
44444.8 50888.9 55049.0 58989.0 60368.8 65383.0 67084.5 71311.1 70803.8 73501.6 76772.4 79851.5 76627.1 79530.2 82563.6 86784.8 88745.8 91273.5 93227.7 93888.0
17353.0 20265.0 22026.6 23435.0 24221.1 25031.9 25821.8 26755.0 26599.9 27116.7 27371.8 28605.2 28653.8 29440.6 28808.4 29858.7 30850.0 29662.7 30308.0 31164.5
32471.4 36827.1 40113.6 42368.7 44581.8 46092.2 48817.0 48816.0 51178.5 52878.4 51725.5 53467.8 54710.4 56774.6 58449.8 60229.0 61271.4 63293.7 65137.8 66157.5
6745.8 8067.1 8799.9 9489.9 9021.5 11363.6 11694.3 9939.0 11413.5 11655.4 11890.6 12084.4 12150.8 12246.5 12596.3 12999.2 12883.4 13081.8 13229.8 13352.1
size values dramatically as in the case of b ridg e -far video sequence. However, if the video has a high motion activity as in the case of highway video sequence, mean frame size values increases dramatically with the increasing number of descriptors. As seen in Tables 6.30 and Table 6.31, mean frame size values increases with increasing number of descriptors except at some points. For example the mean frame size values of the b ridg e -far video sequence in QCIF format decreases from 833.8 to 815.8 whereas the number of descriptors is increased from 3 to 4. This effect can be explained by the used GoP structure. By changing the number of descriptors even the ratio of used I frames changes which has a slight impact on short video sequences. In Figure 6.67 the overhead for the container video sequence in the QCIF format for different quantization values between 1 and 51 is given. Simultaneously in Figure 6.68 we present also the bandwidth requirements for the container video sequence in the QCIF format for the same quantization values. In Figure 6.69 and 6.70 the overhead and the bandwidth requirements are given now for the foreman video sequence.
168
# of descriptors 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
bridge–far 807.0 825.9 833.8 815.8 830.2 865.0 851.7 843.6 872.0 878.8 872.9 868.2 899.0 904.4 891.2 864.7 926.5 918.6 948.1 888.5
claire 1483.8 1773.6 2019.7 2216.6 2388.4 2418.2 2531.2 2646.4 2671.5 2937.7 2850.3 2961.1 2957.6 2920.0 3047.5 3028.2 3072.5 3108.1 3267.3 3067.3
grandma 1944.3 2224.6 2445.1 2626.0 2720.7 2878.1 2968.5 2941.9 2968.5 3193.7 3220.3 3185.0 3360.4 3480.7 3368.8 3441.0 3648.3 3346.8 3397.6 3703.2
highway 2448.2 2886.4 3112.5 3350.0 3532.0 3669.7 3740.1 3897.5 4008.3 4136.1 4200.2 4223.2 4268.7 4287.9 4459.0 4577.0 4566.1 4617.3 4639.0 4699.6
bridge–close 2977.2 3601.9 3909.0 4029.1 4167.1 4242.1 4251.8 4271.7 4380.6 4443.4 4532.0 4440.3 4522.3 4572.0 4677.0 4605.8 4612.2 4756.5 4671.3 4734.0
container 2229.3 2588.2 2908.0 3128.3 3185.2 3617.1 3724.3 4049.5 3807.0 4186.9 4540.7 4789.1 4363.8 4569.1 4868.8 5042.2 5231.5 5445.5 5643.7 5715.2
mthr dotr 2391.7 2910.9 3297.4 3618.3 3823.1 4184.4 4384.9 4546.8 4727.2 4782.0 5140.9 5196.9 5485.0 5482.7 5543.2 5523.0 5834.4 6114.4 6170.8 5871.0
salesman 2832.2 3309.6 3688.2 3997.8 4145.9 4446.4 4679.1 4746.8 4972.0 4842.1 5177.6 5490.5 5144.4 5331.0 5617.1 5752.5 5821.2 5292.6 5408.6 5582.5
silent 3530.8 4246.5 4743.0 5048.9 5092.1 5545.9 5671.6 5997.8 5857.6 6097.8 6242.0 6589.4 6065.7 6198.4 6489.6 6551.1 6573.1 6773.0 7080.0 7043.7
news 3368.4 4229.4 4809.2 5176.4 5291.2 5858.8 5980.5 6252.3 6231.5 6384.0 6762.0 7048.9 6570.4 6912.0 6988.0 7240.4 7463.5 7695.5 7936.0 7844.8
carphone 4719.0 5672.4 6284.2 6728.2 7085.2 7420.5 7708.7 7827.7 8067.8 8508.4 8583.5 8682.0 9015.1 9043.2 9196.1 9003.1 8997.4 9166.0 9575.2 9825.6
foreman 4408.7 5446.9 6298.8 6904.9 7410.4 7843.6 8020.4 8529.2 8464.9 9188.6 8970.2 8910.7 9415.7 9951.4 10442.1 10340.4 9725.2 9890.5 9922.2 10216.4
6 Statistical Results from Video Traces
Table 6.31: Mean frame size values (in bit) for the QCIF video sequences
6.6 Video Trace Statistics for MDC Encoded Video
169
Fig. 6.67: Overhead for the container video sequence in the QCIF format for different quantization values.
Fig. 6.68: Bandwidth requirements for the container video sequence in the QCIF format for different quantization values.
170
6 Statistical Results from Video Traces
Fig. 6.69: Overhead for the foreman video sequence in the QCIF format for different quantization values.
Fig. 6.70: Bandwidth requirements for the foreman video sequence in the QCIF format for different quantization values.
Part III
Applications for Video Traces
7 IP Overhead Considerations for Video Services
7.1 Introduction and Motivation So far we have focused on the generation and evaluation of video traffic. This is sufficient for the reader interested in particular information of the video encoding process and the related traffic characteristics. But if video packets are conveyed over the Internet Protocol (IP) based networks , the traffic characteristics change due to the so called IP overhead. The overhead is generated by each protocol layer involved in the transportation and added to each video packet. The additional information can significantly impact the overall traffic characteristics. Which protocol layer is involved in the transportation depends on the scenario and in this chapter the most used IP protocol layers for real time streaming and full download scenarios are introduced. The goal is to make the reader familiar with potential overhead within the IP world. In the beginnings of the 80s the International Organization for Standardization (ISO) specified the Open Systems Interconnection (OSI) layering model [139]. Even though the IP world was not following these design rules to the full extend, still the concept of protocol layering is applied in the IP protocol world. Layering is motivated by the fact that layered protocols can be reused for different platforms and that testing becomes much easier than for a layer-less entity. On the other side the price for the layering is an additional overhead, which each layer is adding to the payload to communicate with the counterpart protocol layer. As the focus of this book is on video services, the following IP protocol stack is envisioned: The highest layer, corresponding to the application layer, is hosting the video coding coder and decoder. As we have learned in Chapter 3, the video coder output are video packets of a certain size and type. These packets are forwarded to the next lower protocol layer. As given in Figure 7.2, two protocol stacks can be used, namely the TCP/IP or the RTP/UDP/IP stack. While TCP is used for full download in conjunction with the File Transport Protocol (FTP) or also for some niche streaming services, RTP/UDP is designed for real-time and used by most of the streaming services. 173
174
7 IP Overhead Considerations for Video Services
Fig. 7.1: Example of protocol overhead for RTP/UDP/IP. TCP as well as RTP/UDP use IP as the underlying network layer protocol. The common denominator IP is then using any link level communication system to transport the IP datagram. Without going into detail, the IP overhead which has to be added to each video layer for IP version 4 and version 6 including the RTP/UDP or TCP overhead is 40 byte and 60 byte, respectively. But in some wired networks as well as wireless networks, header compression schemes were applied to decrease this overhead down to a few bytes. The main reasons to apply header compression schemes are reduced delay and the reduced bandwidth requirements. In the following, we explain the IP - based overhead in detail for different communication scenarios including the signaling domain and at the end of this chapter it is shown how header compression schemes reduce the overhead significantly at the cost of robustness.
7.2 Data Plane
175
Media Source / Sink Application Layer Application
Session Discovery
SAP Session
Media Selection
SIP
Layer
Media Control
RTSP
RTP/RTCP
QoS
RTP Profiles
Transport Layer
TCP/UDP
Network
TCP/UDP
IP
IGMP
Data Transport
Multicast
IP
RSVP
Layer Session Management
QoS
Fig. 7.2: Data and signalling plane with IP protocols..
7.2 Data Plane 7.2.1 Real Time Protocol (RTP) and User Datagram Protocol (UDP) The key enabling protocol for multimedia streaming over IP networks is the Real Time Protocol (RTP) specified in RFC 1889. RTP works in combination with the Real Time Control Protocol (RTCP), which is introduced later in Section 7.3.5. These two protocols run on top of the User Datagram Protocol (UDP). As UDP is the underlying protocol for RTP, we discuss these two protocols here together with respect to the IP overhead. For the quick reader, each incoming video packet will get a 12 byte RTP and an eight byte UDP header. In the following of this section the RTP and UDP protocol and the related overhead are described more detailed. The Real Time Protocol (RTP), as specified in RFC 1889, is a transport mechanism for real-time data. It consists of two components: RTP and RTCP, the RTP Control Protocol. Both, RTP and RTCP typically run over UDP, but can use any other packet - oriented transport protocol. In a multimedia session, audio and video streams are separated and transmitted over different RTP sessions using different UDP port addresses. In case of multi- layered or multiple description coded video, each layer is conveyed over a separate RTP/UDP/IP session as given in Figure 7.3. Thus, the overhead has to be taken into account multiple times. The RTP header has a fixed structure and is always 12 bytes long. The User Datagram Protocol (UDP), described in RFC 768, is designed to support unicast and multicast services between applications. No guarantees
176
7 IP Overhead Considerations for Video Services
Fig. 7.3: Example of multi-layered multimedia transport over RTP/UDP/IP. about packet delivery or ordering can be given. Applications can communicate via the UDP protocol specifying source and destination ports. The resulting header format contains therefore only a two byte source and a two byte destination port. Further 4 bytes specify the datagram length and contain a 2 byte checksum resulting in a total overhead of 8 byte for the UDP header. Using the video traces for simulation purposes for RTP/UDP is quite straight forward. As those protocols are not responsive to the underlying transportation, the only important impact in terms of traffic characterization is the overhead of 20 bytes composed of the 12 byte for RTP (without the payload type) and the eight byte for UDP. 7.2.2 Transmission Control Protocol (TCP) In contrast to the RTP/UDP scenario, the Transmission Control Protocol (TCP), given in RFC 793, is responsive to the underlying communication system and besides the 20 byte TCP protocol overhead per packet, the TCP behavior has to be taken into account with its retransmission strategy and flow control. Most of the simulation tools nowadays offer a standard compliant TCP version, such as NS2. Nevertheless, TCP requires much more care in the performance evaluation than the RTP/UDP suite.
7.3 Signaling Overhead
177
Fig. 7.4: IP fragmentation example. 7.2.3 Internet Protocol (IP) The Internet Protocol (IP), as defined in RFC 791, describes the packetswitched communication between IP end systems. The protocol is designed to convey IP datagrams from an IP source towards an IP destination. To achieve this goal, an IP header is added to the payload. This header information is 20 byte for the IP version 4 and 40 byte for the IP version 6. The full header format is given in RFC 791 Section 3.1. The most important capability, regarding the video traces, is the possibility to fragment larger IP datagrams into smaller portions. Fragmentation takes place when a maximum threshold referred to as maximum transfer unit (MTU) is reached. The MTU size is set to 1500 bytes in Ethernet-based communication systems. In case the IP segment plus header is larger than the MTU size, the segment is chopped off after the MTU threshold. The remaining information of the segment is combined with a new IP header as shown in Figure 7.4. Assuming an Ethernet-based MTU of 1500 bytes and an IP version 4 header with 20 byte, IP would split a 3000 byte segment into two fragments of 1480 (plus 20 byte header) and a remainder of 40 byte (plus 20 byte header).
7.3 Signaling Overhead Parallel to the data plane, signaling messages have to be exchanged to establish and to maintain the session. Here the most used signaling protocols are shortly explained. Unfortunately, the traffic generated by the signaling entities is hard to predict, as it depends on user interactions, variations in the channel
178
7 IP Overhead Considerations for Video Services
quality, and other impacts. Only for the RTCP, the maximum traffic allowed is set to 5% compared to the overall session traffic. Nevertheless here the most important signaling protocols such as SAP, SIP, SDP, RTSP and RTCP are introduced. The Session Announcement Protocol (SAP), in conjunction with the SIP and/or RTSP protocols, initiates the streaming of a video. SAP announces both, multicast and unicast sessions to a group of users. SIP initiates multimedia sessions in a client/server manner. An open issue is how the client retrieves the destination address. Possible solutions are that the address is well known or is provided by SAP. RTSP is simply a “remote control” used to control unicast streams in a server/client manner. SIP, SAP, and RTSP use the Session Description Protocol (SDP) to describe the media content and run either over the TCP or the UDP protocol. Note that SDP is missing in the figure, as it is not a real protocol, but rather a language such as HTTP. 7.3.1 Session Description Protocol (SDP) SDP is used to describe a multimedia session. The SDP message contains a textual coding that describes the session, more specifically it gives 1. the transport protocol used to convey the data, 2. a type field to distinguish the media (video, audio, etc), and 3. the media format (MPEG4, etc). Furthermore, the SDP message may contain the duration of the session, security information (encryption keys), and the session name in addition to the subject information (e.g., Arielle (c) Disney). SDP messages can be carried in any protocol, including HTTP, SIP, RTSP, and SAP. Originally, SDP was designed for the support of multicast sessions. The information relating to the multicast session was conveyed using SAP. More recently, SDP is also used in combination with SIP and RTSP. 7.3.2 Session Announcement Protocol (SAP) The SAP is used for advertising multicast sessions. In brief, SAP discovers ongoing multicast sessions and seeks out the relevant information to setup a session. (In case of a unicast session the setup information might be exchanged or known by the participants.) Once all the information required for initiating a session is known, SIP is used to initiate the session. 7.3.3 Session Initiation Protocol (SIP) Signaling protocols are needed to create sessions between two or more entities. For this purpose the H.323 and the SIP protocol have been standardized by two different standardization committees. H.323 was standardized by the
7.3 Signaling Overhead
179
ITU. The IETF proposed the Session Initiation Protocol (SIP) specified in RFC 3261. In contrast to other signaling protocols, SIP is text-based such as SDP. SIP is a client/server-oriented protocol and is able to create, modify, and terminate sessions with one or multiple participants. Multi-party conferencing is enabled through IP multicast or a mesh of unicast connections. Clients generate requests and transmit them to a SIP proxy. The proxy in turn typically contacts a SIP registrar to obtain the user’s current IP address. Users register with the SIP registrar whenever they start up an SIP application on a device, e.g., PDA, laptop, etc. This allows the SIP registrar to keep track of the user’s current IP address. With SIP it is thus possible to reach user’s that are on the move, making SIP very relevant for wireless streaming. Using the INVITE request, a connection is set up. To release an existing connection a BYE request is used. Besides these two requests, further requests are OPTIONS, STATUS, ACK, CANCEL, and REGISTER. SIP reuses HTTP header fields to ease an integration of SIP servers with web servers. In the SIP terminology the client is called user agent. A host can simultaneously operate as client and as server. The call identifiers used in SIP include the current IP addresses of the users wishing to communicate and the type of media encoding used, e.g., MPEG-4 in the case of video. 7.3.4 Real Time Streaming Protocol (RTSP) RTSP may be thought of as a “remote control” for media streaming. More specifically, it is used to implement interactive features known from the VCR, such as pause and fast-forward. RTSP has many additional functionalities and has been adopted by RealNetworks. RTSP exchanges RTSP messages over an underlying transport protocol, such as TCP or UDP. The RTSP messages are ASCII text and very similar to HTTP messages. RTSP uses out-of-band signaling to control the streaming. 7.3.5 Real Time Control Protocol (RTCP) The companion control protocol for RTP is RTCP. It is introduced together with RTP in RFC 1889. Sender and receiver exchange RTCP packets to exchange QoS information periodically. Five types of messages exist: 1. 2. 3. 4. 5.
Sender Reports (SR) Receiver Reports (RR) Source Descriptions (SDES) Application Specific Information (APP) Session Termination Packets (BYE).
Each report type serves a different function. The SR report is sent by any host which generated RTP packets. The SR includes the amount of data that was sent so far, as well as some timing information for synchronization process. Hosts that receive RTP streams generate the Receiver Report. This
180
7 IP Overhead Considerations for Video Services
report includes information about the loss rate and the delay jitter of the RTP packets received so far. In addition the last timestamp and delay, since the last SR was received, is included. This allows the sender to estimate the delay and jitter between sender and receiver. The rate of the RTCP packets is adjusted in dependency of the number of users per multicast group. In general RTCP provides the following services: 1. QoS monitoring and congestion control: This is the primary function of RTCP. RTCP provides feedback to an application about the quality of data distribution. The control information is useful to the senders, the receivers, and third-party monitors. The sender can adjust its transmission based on the receiver report feedback. The receivers can determine whether congestion is local, regional, or global. Network managers can evaluate the network performance for multicast distribution. 2. Source identification: In RTP data packets, sources are identified by randomly generated 32-bit identifiers. These identifiers are not convenient for human users. RTCP SDES (source description) packets contain textual information called canonical names as globally unique identifiers of the session participants. It may include user’s name, telephone number, email address, and other information. 3. Inter-media synchronization: RTCP sender reports contain an indication of real-time and the corresponding RTP timestamp. This can be used in inter-media synchronization like lip synchronization in video. 4. Control information scaling: RTCP packets are sent periodically among participants. When the number of participants increases, it is necessary to balance between getting up-to-date control information and limiting the control traffic. In order to scale up to large multicast groups, RTCP has to prevent the control traffic from overwhelming network resources. RTP limits the control traffic to at most 5% of the overall session traffic. This is enforced by adjusting the RTCP generating rate according to the number of participants.
7.4 Header Compression Schemes As mentioned above, the IP overhead for TCP/IP or RTP/UDP/IP is 40 and 60 byte for the IP version 4 or IP version 6, respectively. As this overhead can occupy a large portion of the whole information packet, especially for small video formats or layered coding, header compression schemes were introduced to reduce this kind of overhead and to increase the spectral efficiency. Stat-of -the-art header compression schemes will reduce the overhead to a tenth or even more. Header compression schemes were designed for wired networks focusing on TCP/IP. Further developments include the support of multimedia services and signaling. With RFC3095 a new header compression scheme
7.4 Header Compression Schemes
181
dedicated to wireless links was introduced, namely the Robust Header compression (RoHC). Without going into detail explaining the header compression schemes, the focus lies on the potential compression, the main mechanism to compress the IP header, and the implications of using header compression. The most important implication is the trade-off between the robustness against error propagation and compression gain. The header compression schemes can be classified into two main categories, namely header compression without and with feedback. While the first header compression mechanism, targeting the wired domain, were not considering channel errors, the need for a feedback channel was not motivated. With the introduction of header compression in the wireless domain, the feedback became more important to handle wireless channel errors. These errors are crucial and need to be detected as soon as possible. The general concept of header compression is based on a compressor at the sender side and the decompressor at the receiver side. The compressor has two options, to send a packet without compression (with full header information) or compressed. The compression is based on redundancy within the header fields called intra-packet redundancy and redundancy between consecutive packets referred to as inter-packet redundancy. The compressor will send the very first packet uncompressed to the receiver and in the following only inform the receiver about the changes or the so-called deltas. These deltas can be used at the receiver side to reconstruct the full header information based on the last received packet. This system is stable as long as no packet error occurs. Once a packet error occurs, all header information will be lost in the ongoing transmissions and therefore all packets are lost even though the following packets after the packet errors were received correctly. To achieve at least some robustness, an uncompressed packet is sent every k-th slot. The parameter k can be designed very small to achieve more robustness or very large to achieve more compression. It would be more than natural to adapt the parameter k to the GoP size for video transportation to keep the error propagation as low as possible. From the standpoint of overhead, uncompressed header information can be assumed to be a tenth of the original header information. Note that one out of k packets has to contain the full header information. For the feedback-based header compression the overhead depends on the channel characteristics and it is therefore hard to provide deterministic values for header sizes. In this case the full header compression scheme needs to be implemented as described, e.g., in RFC3095.
182
7 IP Overhead Considerations for Video Services
7.5 Short Example for Overhead Calculation To make the reader familiar with the IP overhead calculation a short example is given assuming an RTP/UDP/IP protocol stack with the following video trace: 1 2 3
4300 1200 1000
To the first video frame with 4300 byte the RTP and UDP header is added (4320 byte). With an MTU size of 1500 byte and an IP header of 20 byte, the 4300 bytes are split into two fragments of size 1480 byte and one fragment of 1360 byte. To all fragments a 20 byte IP header is added. Even though the video packet is only 4300 byte long, the output of the IP layer is 4380 byte. The second a third video frame are smaller than the MTU such that we only need to add the RTP, UDP, and IP header only once resulting in an extra 40 byte for each video frame. As a worst case scenario, 5% additional traffic for the signaling with RTCP can be assumed.
8 Using Video Traces for Network Simulations
We now provide guidelines on how to employ video traces for meaningful network performance evaluations. We focus primarily on how to employ the traces in simulations. For general instructions on how to conduct simulations, we refer to the standard simulation textbooks, e.g., [96, 140]. Our focus throughout this chapter is on the aspects that are unique to simulations employing video traces. Although we focus on simulations, our discussions apply analogously for using traces as a basis for traffic modeling. In the first section of this chapter, we discuss how to generate network traffic from traces for simulations. We then discuss in Section 8.2 how to meaningfully analyze and interpret the outcomes of the simulations.
8.1 Generating Traffic from Traces Typically the first step in evaluating the performance of a network for video using traces is to generate an appropriate traffic (load) from the traces. For the appropriate traffic generation there are a number of issues to consider. These issues range from picking and preparing the video streams (traces) to the packetization of the video frames. We first address the issues at the stream level and then turn to the issues at the level of individual video frames and packets. 8.1.1 Stream Level Issues Selecting the Videos (Titles) At the stream level one first needs to select the videos (titles) to be used in the evaluation. Generally, it is advisable to select as many different videos as possible (available) to cover the different video genres and video content features c [2004] IEEE. Reprinted, with permission, from: P. Seeling, M. Reisslein, and B. Kulapala. Network Performance Evaluation with Frame Size and Quality Traces of Single-Layer and Two-Layer Video: A Tutorial. IEEE Communications Surveys and Tutorials, Vol. 6, No. 3, p. 58–78, 3rd quarter 2004.
183
184
8 Using Video Traces for Network Simulations
likely to be encountered in the considered networking scenario. Selecting an appropriate mix of videos is important as the video traffic characteristics vary widely according to the video content. Let M denote the number of different videos selected for a given evaluation study. Composing the Workload Next, one needs to decide how to compose the workload from the selected set of videos. The main consideration in composing the workload is typically whether or not the networking protocol or mechanism under evaluation exploits localities of reference. A video caching mechanism, for instance, relies on localities of reference and strives to improve the network performance by caching the most frequently requested videos. A scheduling mechanism for a router output port, on the other hand, typically does not exploit any locality of reference. Thus, for evaluations of protocols and mechanisms that exploit localities of reference the workload should be composed according to the appropriate distribution. For example, studies of streaming media servers, e.g., [141], indicate that the video popularity follows a Zipf distribution [142]. More specially, if there are M videos available, with video 1 being the most popular and video M being the least popular, then the probability that a given request is for the mth most popular video is K , mζ
m = 1, . . . , M,
where K=
1+
1 2ζ
1 + ··· +
1 Mζ
(8.1)
.
(8.2)
The Zipf distribution is characterized by the parameter ζ ≥ 0. The larger ζ, the more localized the Zipf distribution, i.e., the more popular is the most popular video. Requests for streaming videos were in an initial measurement study found to be distributed according to a Zipf distribution with ζ around 0.5 [141]. It has been observed that the request for movies in video rental stores and video-on-demand systems are well described by a Zipf distribution with ζ in the vicinity of one [143]. Furthermore, studies of web caches indicate that requests for HTML documents and images follow approximately a Zipf distribution with ζ in the vicinity of one [144]. It is therefore reasonable to expect that requests for streaming videos generally follow a Zipf distribution with ζ in the range between 0.5 and 1. If locality of reference plays no role in the studied network protocol, it is reasonable to select the videos according to a discrete uniform distribution U [1, M ], i.e., each video is equally likely selected with probability 1/M to satisfy a client request. This uniform random video selection ensures that the traffic patterns in the selected mix of M videos are roughly uniformly “experienced” by the protocol under study.
8.1 Generating Traffic from Traces
185
Select Encoding Mode The next step in setting up a simulation study is typically the selection of the appropriate encoding mode(s) for the individual videos. The choice of the appropriate encoding mode, i.e., single layer or scalable encoded video, without or with rate control, depends largely on the particular protocol or mechanisms under study. We provide here a few general guidelines and considerations. Generally, one should avoid scaling the video traces. By scaling, we refer to the process of multiplying the size of each individual video frame by a constant to adjust the average bit rate of the video trace to some desired level. Scaling does generally not provide valid traces for the desired average bit rate. To see this, consider scaling a trace for the single layer 4,4,4 encoded video with high quality and bit rate, see Table 6.7, to smaller bit rates. To obtain average bit rates of the trace of the 30,30,30 encoded video, for instance, one would need to divide the size of every frame in the 4,4,4 trace by approximately ten. The thus scaled 4,4,4 trace would have the average bit rate of a 30,30,30 trace, but the variability (CoV and peak-to-mean ratio) of the scaled 4,4,4 trace would still be the same as for the original 4,4,4 trace. The variability of the 4,4,4 trace, however, is quite different from the variability of a 30,30,30 trace, as is evident from Table 6.7. It is therefore generally recommended to avoid scaling the traces. Nevertheless, for some evaluations it may be desirable and convenient to employ traces for rate-controlled video with a different bit rate than available. For other evaluations, it may be convenient to employ traces for different openloop encoded videos, with the same average bit rate of some prespecified level. With scaling, each open-loop encoded video (title) contributes equally to the system utilization, which makes it easy to maintain a prespecified constant utilization with a mix of different videos. For these reasons it may be necessary to scale traces before employing them in network simulations. In such situations it is recommended to use the trace with the average bit rate closest to the desired bit rate so that the scaling factor is as close to one as possible. Constant Utilization Simulation Scenario We conclude this discussion of the stream level issues by outlining the trace usage in two streaming scenarios, which frequently arise in networking studies. First, we outline a “constant utilization” scenario. Suppose we wish to examine the performance of a multiplexer, scheduler, or similar system that is fed by several streams for a specific long run average utilization level. Furthermore, suppose that we wish to examine the system performance for open-loop VBR encoded video titles and scaled the closest traces to a common average bit rate X/T . Let J denote the number of simultaneous video streams required to achieve a desired level of system utilization J · X/(C · T ), where C denotes the capacity of the system. For each of the J video streams we uniformly randomly select one of the M traces. For each selected trace we independently draw a
186
8 Using Video Traces for Network Simulations
starting phase (frame) from a discrete uniform distribution U [1, N ] over the N frames in the trace. The video frames are then processed according to the protocol or mechanism under study from the starting frame onward. One option is to continue this simulation until all N frames have been processed. (Note that due to the random starting frame the end of the traces may be reached before processing all N frames. When the end of a trace is reached the trace is “wrapped around”, i.e., the processing continues from the beginning of the trace.) Once all N frames have been processed, we immediately select randomly a new trace and random starting phase for the newly selected trace for each of the J streams. Thus there are always J streams in progress. There are a number of variations of the outlined constant utilization simulation, which may be appropriate depending on the protocol under study. One variation is to not continue the simulation after all N frames of a trace have been processed, but to draw a random independent stream duration (bounded by N ) instead. With this approach one can study the effect of new streams starting up and the stream duration (lifetime) by varying the distribution used to draw the random stream duration. Another variation is to employ the original unscaled traces to achieve a constant utilization. This is achieved by fixing the composition J1 , J2 , . . . , JM M of the streams that achieves a specific utilization, m=1 JmT·X m /C. With this approach the videos are not chosen randomly. Instead, there are always Jm streams with video m ongoing. For each stream a random uniform start phase into the corresponding trace is selected. When all the frames of a given trace have been processed or a stream’s lifetime expires, the same video is immediately started up, but with a new independent random starting phase. Thus, with this approach the number of ongoing streams of each video title is deterministic, but the traffic is random due to the random phase profiles. The advantage of this approach is that it avoids the scaling of the videos and allows for studies with streams with heterogeneous average bit rates. We conclude this discussion of the constant utilization approaches by noting that they are appropriate to examine performance metrics at the packet and burst level time scale, such as packet loss and delay. However, the constant utilization approaches are not suitable for examining call level metrics, such as call blocking probabilities. Therefore, we outline next a “varying utilization” simulation scenario which is appropriate for call level evaluations, as they are required for call admission control and caching mechanisms, for instance. Varying Utilization Simulation Scenario To illustrate the “varying utilization” simulation scenario, suppose we wish to examine the performance of a call admission or caching mechanism that processes incoming requests for video streams. Depending on the current system load, cache contents, and traffic characteristics of the currently supported streams and the requested stream, the new request is either granted or denied.
8.1 Generating Traffic from Traces
187
Suppose that we have selected a set of M video traces for the evaluation. To run the simulation, we need to generate client requests according to some stochastic model. The Poisson process, where the time between successive arrivals is exponentially distributed is generally a good model for request arrivals. For each new client request we draw independently the video (e.g., according to a uniform or Zipf distribution), the starting phase, and the lifetime (duration) of the stream. Whenever the end of a stream lifetime is reached, the stream is simply removed from consideration, freeing up the system resources it occupied. The distribution of the lifetime (for which the exponential distribution is generally a good choice) and the request arrival process are adjusted to achieve the desired load level of the system. To illustrate the load level adjustment, consider a system with capacity C bit/sec to which requests for (scaled) video streams with an average bit rate of X/T arrive, and suppose each accepted video stream consumes the bandwidth X/T of the available bandwidth C. The stability limit of such a system is Jmax = C · T /X streams. Let L denote the mean of the lifetime distribution in frame periods and let ρ denote the mean request arrival rate in requests per frame period. Then the long run average fraction of calls (requests) that can be accepted is given by 1/ρ . L/Jmax
(8.3)
To see this, note that 1/ρ is the average spacing between request arrivals in frame periods, and L/Jmax is the average spacing in frame periods between call departures (streams reaching the end of their lifetime) when the system is fully loaded. We considered scaled video streams for this illustrative calculation of the load level, because some mechanisms may give preference to requests according to the average bit rate of the requested stream. With such a preferential granting of requests, the average of the average bit rates of the currently supported streams may be quite different from the average of the average bit rates of the stream requests. In concluding this discussion of the “varying utilization” simulation scenario, we point out one subtle issue with the average bit rates of the streams. The average bit rate of an original or scaled trace is calculated over all N frames of the trace. When generating a video stream from a trace by drawing (i) a starting phase from a discrete uniform distribution U [1, N ] over all frames in the trace, and (ii) a random lifetime, the average bit rate of a given thus generated stream may be quite different from the average bit rate of the trace. In particular, the average stream bit rate may be quite different from the average trace bit rate if the lifetime is relatively short compared to the length of the trace. This is because a short lifetime may “sample” a part of the trace that has unusual characteristics compared to the overall trace. (It should also be noted that in the opposite extreme with a lifetime significantly longer than the trace, and wraparound whenever the end of the trace is reached, the generated stream contains duplicate traffic patterns.) One way to enforce a desired average bit rate for each individual stream generated from a trace is to
188
8 Using Video Traces for Network Simulations
scale the randomly selected video trace segment (from the starting phase onward until the end of the stream lifetime). Such per-stream scaling, however, is computationally demanding and as noted above may falsify the true variability characteristics. On the other hand, by generating many (short) streams from a given trace (without any per-stream scaling), the average bit rate of the streams converges to the average bit rate of the trace. It is recommended to keep these subtleties in mind when designing and evaluating a simulation study employing video traces. 8.1.2 Frame/Packet Level Issues In this section, we discuss the issues arising at the level of individual video frames and network packets (e.g., IP packets, data link layer frames). Since many of these frame and packet level issues relate closely to the video playout at the (client) receivers, we first take a brief look at the playout process. Receiver Playout Process To start the playout process of a typical MPEG video sequence with the GoP pattern IBBPBBPBBPBB (which we consider without loss of generality throughout this discussion), the decoder needs the first I and P frames before it can decode the first B frame. For this reason the frames are emitted in the codec sequence IPBB. . . by the encoder and are also transmitted in this order in practical systems, as noted in Chapter 5. To better understand the start of the playout process, consider the scenario in Figure 8.1 where the reception of the first I frame commences at time zero and is completed at time T , which denotes the frame period of the video. Each subsequent frame takes T seconds for reception. The decoding of the first B frame commences at time 3T and we suppose for illustration that the decoding of a frame takes δ seconds. Thus, the first B frame is available for display at time 3T +δ, allowing us to commence the playback by displaying the first I frame at time 2T + δ. It is straightforward to verify with a similar argument that the playback can reception sequence I
P
B
B t
0
T
2T
3T
4T
display sequence I
B
B
P t
2T + δ
3T + δ
4T + δ
Fig. 8.1: Start of video playout: The first I and P frame are required to decode the first B frame.
8.1 Generating Traffic from Traces
189
commence at time 3T +δ if the frames are transmitted in the display sequence IBBP. . . A few remarks on the sketched playout process are in order. First, it is relevant for networking studies to note that the client suffers playout starvation when it wants to start the decoding of a video frame but has not yet fully received that frame. The client may employ error concealment techniques [145] to conceal the missing video information. The simplest technique is to continue displaying the last fully and on-time received frame. There is a range of more sophisticated techniques that attempt to decode partially received frames or extrapolate the missing frame from preceding frames. The second relevant point for networking studies is that for many networking studies it may be preferable to simulate the transmission of frames in the IBBP. . . order, because the GoPs are successively transmitted with this frame order. With the IPBB order, on the other hand, the I frame of the second GoP is transmitted before the last two B frames of the first GoP. Consequently, there are a combined total of 9 P and B frames transmitted between the first two I frames and a total of 11 P and B frames between all successive I frames. This may lead to difficulties for mechanisms that smooth the video frames in individual GoPs and also for mechanisms that exploit specific alignments of the I frames in the supported streams. In addition, it should be noted that for many networking studies it may be appropriate to consider start-up delays introduced by the networking protocol under study in isolation from the playout commencement delay due to the MPEG encoder (illustrated in Figure 8.1). For such studies, it may very well be appropriate to assume that the first frame (I frame) is decoded and displayed at a time governed by the network protocol and the subsequent frame (B frame, when using the IBBP ordering) is independently decoded and then displayed when the frame period of the I frame expires. With such a simulation, the playout commencement delay due to the MPEG frame encoder order is added to the network introduced start-up delay and possibly other delay components (e.g., server delay) to give the total start-up delay experienced by the user. Packetization As given in Chapter 7, video traffic is typically transported in Real Time Protocol (RTP) [146] packets through networks. An RTP packet consists of the 12 byte RTP header, an 8 byte UDP header, and 20 byte IPv4 header/ 40 byte IPv6 header. (When TCP is used for the video transport a 20 byte TCP header is used instead of the UDP header.) The packetization, i.e., the packaging of the video data into packets, is typically governed by RFCs. The packetization of MPEG-4 encoded video into RTP packets, for instance, is described in RFC 3016 [147], which we use as a basis for our discussion of packetization. Generally, it is recommended that a given RTP packet carries data from only one video frame, such that the loss of an RTP packet will
190
8 Using Video Traces for Network Simulations
affect only one video frame. The amount of video data in an RTP packet should be adjusted such that the complete RTP packet (consisting of video data plus headers) is no larger than the maximum transfer unit (MTU) on the path through the network to avoid fragmentation in the network (except for wireless links which may perform fragmentation of the RTP packet carried over the wired network). In case the video frames are small, it is permitted to carry multiple consecutive video frames in one RTP packet. We note that the packet headers may contribute significantly to the total traffic, especially when low bit rate video streams are transmitted with tight real-time constraints that prohibit the grouping of multiple frames into one RTP packet. Header compression schemes have been proposed to limit the waste of bandwidth due to protocol headers in such situations, see e.g., [148]. It should also be noted that with scalable (layered) encoded video, each layer is typically packetized independently to allow for the different treatment of the layers in the network (e.g., at the IP level). Furthermore, we note that the video traces reflect only the video data — typical video display however consists of video and audio. The bit rate of the encoded audio is in many scenarios negligible compared to the bit rate of the encoded video, see e.g., [149]. The encoded audio stream, however, is typically packetized independently from the video. This packetized audio stream may make a significant contribution to the total (video + audio) traffic, especially in the absence of header compression schemes. Packet Transmission The final packet level issue we would like to address is the transmission of the individual packets. First, consider the simple case where one packet carries a complete video frame. Depending on the overall simulation setup, the packet may be sent at once, which may be appropriate for a packet-level simulation that keeps track of the individual packets, but not the individual bits. For a fluid traffic simulation running at the granularity of frame periods, on the other hand, it may be appropriate to transmit a packet of size S bit at the constant bit rate S/T bit/sec over the duration of one frame period of length T . If a single video frame is packetized into multiple packets, it may be appropriate to space out the transmission instants of the individual packets equally over one frame period in a packet level simulation. Whereas in a fluid simulation, the aggregate size of all the packets would be transmitted at a constant bit rate over one frame period. Finally, consider the case when multiple video frames are packetized into a single packet into a fluid simulation. Depending on the simulation scenario, it may be preferable to transmit this single packet over one frame period (e.g., in a real-time scenario), or to transmit it over as many frame periods as there are video frames in the packet (e.g., in a non-realtime scenario).
8.2 Simulation Output Data Analysis
191
8.2 Simulation Output Data Analysis In this section we discuss how to analyze the output of a simulation involving video traces in order to draw meaningful conclusions about the networking system, protocol, or mechanisms under study. We focus again on the unique issues arising in simulations with video traces and we refer to the standard textbooks, e.g., [96, 140], for general instructions on the analysis of simulation output data. In this section we first discuss the video-related performance metrics obtained from simulations, and then discuss how to obtain statistically valid estimates of the performance metrics of interest. 8.2.1 Performance Metrics in Video Trace Simulations Loss Probability A typically considered metric in video networking is the starvation (loss) probability, which comes in two main forms. The frame starvation probability is the long run fraction of video frames that miss their decoding (playout) deadline, i.e., are not completely delivered to the receiver by the time the receiver needs them to start the decoding. The frame starvation probability may be estimated for individual clients or for the complete system under study. The information loss probability is the long run fraction of encoding information (bits) that misses its decoding (playout) deadline. The information loss probability has a finer granularity than the frame loss probability because a partially delivered frame is considered as one lost frame toward the frame loss probability (irrespective of how much data of that frame was delivered/not delivered in time), whereas the information loss probability counts only the fraction of the frame’s information in bits that were not delivered in time. As an illustrative example, consider the transmission of 10 frames — each of size 240 bit — to a client, and suppose only 120 bit of the first frame are delivered on-time (and the other 120 bit arrive after the decoding deadline). Also suppose the remaining 9 frames are all completely delivered ahead of their respective decoding deadlines. Then, the frame loss probability is 1/10 = 10%, whereas the information loss probability is 120/(10 · 240) = 5%. We note that in this example and throughout this discussion so far on the loss probability, we have ignored the dependencies between the encoded video frames. Specifically, in a typical video encoding, the I frame in a GoP is required to decode all other P and B frames in the GoP (as well as the B frames in the preceding GoP encoded with respect to the I frame starting the GoP under consideration). Thus, the loss of an I frame is essentially equivalent to the loss of all the frames in the GoP under consideration (as well as some frames in the preceding GoP). Similarly, a given P frame is required to decode all the successive P frames in the same GoP as well as the B frames encoded with respect to these P frames. Thus, the loss of a P frame is equivalent to the loss of all these dependent frames.
192
8 Using Video Traces for Network Simulations
The information loss probability is mainly motivated by error concealment and error resilience techniques [145] that allow for the decoding of partially received video frames. Error resilience techniques are currently a subject of intense research efforts and more advances in this area are to be expected. The deployment of these techniques may be affected by the required computational effort and energy, which are often limited in wireless devices. Video Quality The frame loss probability and the information loss probabilities are convenient performance metrics for video networking, as they can be directly obtained from network simulation with video traces. However, these loss probabilities are to a large extend “network” metrics and provide only limited insight into the video quality perceived by the user. It is certainly true that a smaller loss probability corresponds in general to a higher video quality. However, it is difficult to quantify this relationship. This is because the rate-distortion curves of encoders relate only the bit rates of completely received streams (layers) to the corresponding PSNR video quality. (Whereby we should keep in mind that the PSNR provides only a limited albeit widely used characterization of the video quality, see Chapter 6). If a part of a stream (layer) is lost, the video quality can no longer be obtained from the encoder rate-distortion curve. In general, experiments with actual encoders, decoders, and video data are required to obtain the video quality after lossy network transport. There are, however, scenarios, in which it is possible to obtain the approximate PSNR video quality after lossy network transport. One such scenario is the network transport of layered encoded video with priority for the base layer, i.e., the enhancement layer data is dropped before the base layer data when congestion arises. First, consider temporal scalable encoded video in this context. If an enhancement layer frame is completely received (and all the frames that are used as encoding references are also completely received), then the PSNR quality of the frame is obtained by adding the base layer PSNR quality of the frame (from the base layer trace) and the enhancement layer PSNR quality improvement of the frame (from the enhancement layer trace). If all the referenced frames are completely received and (a part of or) all of the enhancement layer is lost, then one can (conservatively) approximate the quality of the frame by the PSNR quality of the base layer trace. If a part or all of a frame that serves as a reference frame for the encoding of other frame(s) is lost, e.g., a P frame (in the base layer) of the encoding considered in Figure 3.16, then all frames that depend on the (partially) lost reference frame are affected. The quantitative impact of such a loss can currently only be determined from experiments with the actual video if sophisticated error concealment or error recovery techniques are employed. In case that a basic error concelament scheme is employed, such as re-display of the last successfully received video frame, offset distortion traces can be used, as we describe
8.2 Simulation Output Data Analysis
193
in detail in Chapter 9. Similarly, if a part (or all) of the base layer or the enhancement layer is lost, scalable offset distortion traces can be employed, as described in detail in Chapter 9 as well. Another scenario in which one can assess the video quality of the received video after lossy network transport is transcoding (also referred to as cropping scenario [150]). In this scenario single layer encoded video is transported through a network. Whenever congestion arises, the video is transcoded [54] to a lower quality (corresponding to a larger quantization scale, so that the transcoded video fits into the available bandwidth). This scenario can be (approximately) simulated using the single-layer video traces by switching to the trace of a lower quality encoding of the same video. To conclude this section on the video quality as a performance metric in video trace simulations, we note that the received video quality is generally maximized by maximizing the average frame quality and minimizing the quality variations. More specifically, the received video quality is maximized by maximizing the qualities of the individual video frames and minimizing the variations in quality between consecutive video frames. 8.2.2 Estimating Performance Metrics As with any simulation, a key consideration when simulating a network mechanism or protocol using video traces is the statistical validity of the obtained results. We refer the reader to standard simulation texts, e.g., [96, 140], for general instructions on how to obtain statistically meaningful simulation results and focus here primarily on the aspects unique to simulation using video traces. Video traces, in general, and the constant utilization and varying utilization simulation scenarios outlined in Section 8.1.1 lend themselves both to terminating simulations and steady state simulations. In terminating simulations, several independent simulation runs are performed and the estimates of the metrics of interest are obtained by averaging the metric estimates obtained from the individual runs. A terminating simulation of the constant utilization scenario can be conducted by running several simulations, as outlined in Section 8.1.1. Each simulation is started with independently randomly selected traces, starting phases (and possibly stream lifetimes). The advantage of this terminating simulation approach is that the individual simulation runs are independent and thus the classical Student t or Normal distribution based statistics can be used to evaluate the confidence intervals around the estimated sample means. The disadvantage of the terminating simulation approach is that each simulation run needs to be “warmed up” sufficiently to remove the initial transient. While this is not a problem for system simulations that do not require any warm-up, e.g., the simulation of a bufferless multiplexer for a constant utilization, the warm-up may be a significant problem for systems that require a warm-up, e.g., buffered multiplexers. This problem of warming up simulations
194
8 Using Video Traces for Network Simulations
driven by self-similar input is to the best of our knowledge an open problem. We therefore only note that it is widely expected that the transient period is longer when driving simulations with self-similar input traffic and that the conventional methods, e.g., [151], may underestimate the required warm-up period. One way to mitigate this warm-up problem is to start up the entire system in steady state (in case it is known) or at least to start up the traffic load of the system at (or close to) the steady state load. Next, we consider steady state simulations where a single (typically very long) simulation run is considered and the metrics of interest are typically obtained by averaging metric estimates obtained during independent observation periods (usually referred to as batches). A steady state simulation with video traces can be conducted by running one long constant utilization simulation or one long varying utilization simulation, as outlined in Section 8.1.1. The advantage of the steady state simulation is that the warm-up period (during which the system is not observed) is incurred only once. The challenge of the steady state simulation of systems with video traces is that due to the long range dependence in the video traffic, the metric estimates of successive (non-overlapping) observation periods (batches) are typically somewhat correlated. The problem of estimating confidence intervals from these batches has received some initial interest, e.g., see the studies [152] and [153], to which we refer for details on the estimation methods. We note that a simple heuristic to obtain uncorrelated batches despite long-range dependent video traffic is to separate successive observation periods (batches) such that they are (approximately) independent. More specifically, the heuristic is to run the constant utilization or varying utilization simulation and to truncate the distribution of the stream duration at a specific value ∆. Then, separating successive batches by at least ∆ will ensure that none of the video streams that contribute to the traffic load during a given batch contributes to the traffic load during the next batch. This ensures that the successive batches are independent, provided the system under study has only little “memory”. This heuristic provides a simple way to obtain statistically meaningful performance metrics at the expense of increased simulation duration.
9 Incorporating Transmission Errors into Simulations Using Video Traces
In the previous chapter, we described how to use video traces in networking simulations to generate typical video traffic scenarios. In this chapter, we extend the utilization of video traces by showing how to incorporate transmission errors when determining the video quality. For video networking research, the encoded video can be represented in several forms, such as • The actual encoded bit stream, which typically is large in size, copyright protected, requires expertise in encoding/decoding, and cannot be easily exchanged among researchers. • Video traces, which carry the information of the encoded video bit stream, but not the actual encoded information and are thus freely exchangeable among researchers. • Video traffic models, which typically try to capture statistical properties of a certain genre of videos, are based on video traces. Additionally, models are typically limited in providing the networking researcher a model for a specific genre of video (e.g., sports videos, news videos). Video traces thus present an appealing opportunity for networking researchers, as results can be conveniently reproduced and exchanged among researchers. At the same time, video traces are typically smaller in size than encoded video and can be used in simulation environments without much efforts. Video traces c [2005] IEEE. Reprinted, with permission, from: P. Seeling, M. Reisslein, and F.H.P. Fitzek. Offset Distortion Traces for Trace-Based Evaluation of Video Quality after Network Transport. In Proc. International Conference on Computer Communications and Networks (ICCCN), Pages 375–380, San Diego, CA, October 2005. c [2006] IEEE. Reprinted, with permission, from: P. Seeling, M. Reisslein and F.H.P. Fitzek. Layered Offset Distortion Traces for Trace-Based Evaluation of Video Quality after Network Transport. In Proc. IEEE Consumer Communications and Networking Conference (CCNC), Vol. 1, Pages 292–296, Las Vegas, NV, January 2006.
195
196
9 Incorporating Transmission Errors into Simulations Using Video Traces
typically contain information about the encoded video frames, such as frame number and frame size, as well as the distortion or quality of the individual encoded video frames in comparison to the original and uncompressed video frames. The objective video quality is typically measured in terms of the root mean square error (RMSE) and the peak signal to noise ratio (PSNR), which is computed from the RMSE. We refer to the RMSE as distortion and to the PSNR as quality throughout this chapter. The information and frame loss probabilities in simulations, which are defined as the amount of data and the long run fraction of frames that miss their playout deadline at the receiver, can typically be determined in an easy fashion, see Chapter 8 for more details. These metrics, however, are not suitable to determine the video quality that is perceived at the receiving client. While the video traces we introduced in Chapter 5 contain information about individual video frames, such as frame size and frame distortion or quality, this information about individual frames cannot be extended to capture the losses that occur due to lossy network transport mechanisms. When video frames are not decodeable — either because they were not received in time or because they were damaged during network transport — the most basic and common approach is for the decoder to display the last successfully received and decoded frame until a new frame is correctly received and decoded. Video encoding mechanisms typically introduce a dependency among consecutive video frames, so that in most cases of individual frame losses, several video frames are lost for the decoding process due to inter-frame dependencies. The loss and subsequent re-display of the last successfully decoded video frame cannot be accommodated using the traditional video traces, as they do not contain this information. Offset distortion video traces, on the other hand, complement the traditional video traces in providing the information needed to determine the video distortion or quality of non-decodeable video frames [154, 155]. The video quality as perceived at the receiving client can then be calculated by elementary statistics.
9.1 Video Encoding and Decoding In this section we briefly review for a video stream consisting of N frames (i) the commonly used video encoding schemes which were described in detail in Chapter 3, (ii) the resulting inter-frame dependencies created by the encoding process, and (iii) the result in terms of error spreading in case of individual frames not being available (either due do network delay or erroneous transmission) for the decoder. 9.1.1 Single Layer and Temporal Scalable Encoding Video encoding utilizes in the most popular video coding standards the DCT transform on parts of a video frame. To increase the compression efficiency, the temporal correlation of subsequent video frames is exploited by motion estimation and motion compensation techniques. The result of applying motion estimation and compensation techniques are inter-frame dependencies. To il-
9.1 Video Encoding and Decoding
197
Forward prediction
1
2
3
4
5
6
7
I
P
P
P
P
P
P
Error spreading
Fig. 9.1: Popular video coding scheme with inter-frame dependencies [154]. lustrate the inter-frame dependencies created by the encoding mechanisms, we consider without loss of generality a video sequence encoded with the IPPP. . . encoding pattern as illustrated in Figure 9.1. The I frames are intra-coded and rely on no other frame, whereas the forward predicted P frames rely on the previous I or P frames. We note that in addition to I and P frames, bidirectionally predicted B frames can be used as well. Frames of the B type rely on the previous and following I or P frames. This adds to the inter-frame dependencies and has to be taken into account when using a trace-based approach, as outlined below for temporal scalability. Without loss of generality, we assume that in case an individual frame is lost, all subsequent frames that rely on the lost frames cannot be decoded. For each frame that is not available to the decoder, the decoder displays the last successfully received frame. In the example illustrated in Figure 9.1, we assume that the P frame 5 cannot be decoded. Subsequently, the frames 6 and 7 in the illustrated example cannot be decoded, as they rely on the availability of frame 5 at the decoder. Thus, the error from frame 5 spreads to the following frames in this example. Thus, the decoder would re-display frame 4 as replacements for frames 5, 6, and 7. We assume without loss of generality that the error spreading from an unavailable frame at the decoder (e.g., due to transmission errors or transmission delays) spreads to subsequent frames until the decoder receives a new I frame serving as reference. This can be achieved either by following a fixed GoP structure and a limited GoP length, so that at fixed intervals a frame is encoded as an I frame, or by assuming a feedback from the decoder to the encoder to notify the encoder to encode a frame as an I frame. Using a temporal scalability encoding mode, we assume that the B frames of the base layer (BL) of the single layer encoding constitute the enhancement layer (EL). For an example, we consider a temporal scalability scheme with an IBBPBBPBB. . . GoP pattern. With this GoP structure, the enhancement layer consists of all the B frames. As no other frames rely on the B frames in the enhancement layer, the enhancement layer can be easily added or dropped for the decoder. In the example illustrated in Figure 9.2, the base layer consists of I and P frames and the reception of the base layer gives a third of the original frame rate at the decoder, reception of the base and enhancement
198
9 Incorporating Transmission Errors into Simulations Using Video Traces
2
3
5
6
B
B
B
B
Enhancement Layer
Error spreading
1
4
7
I
P
P
Base Layer
Error spreading
Fig. 9.2: Temporal scalable video with inter-frame dependencies and different error spreading possibilities [155]. layer provides the original frame rate. The enhancement layer B frames are encoded with respect to the preceding I or P frame and the succeeding I or P frame in the base layer. As illustrated in Figure 9.2, the loss of a base layer (reference) frame results in the loss of the referencing frames in the enhancement layer. Simultaneously, the loss of a frame in the base layer spreads to the following frames in the base layer until a new I frame is received — either by a resynchronization request from the decoder to the encoder or by the correct reception of an I frame at the beginning at the next GoP — and the reference at the decoder has been updated. The example illustrated in Figure 9.2 shows how the P frame at position 7 is not available at the decoder. As the previous two B frames of the enhancement layer at positions 5 and 6 rely on the availability of the P frame in the base layer at position 7, they cannot be decoded. In turn, the decoder re-displays frame 4 in place of frames 5, 6, and 7. In the same way the following frames of the base layer cannot be decoded until a new reference (I) frame of the base layer can be sent to the decoder. In turn, also the following frames of the enhancement layer cannot be decoded until the base layer has been updated with a new reference. In Algorithm 4, we provide an overview of the decoding algorithm for the single layer and temporal scalable encodings. 9.1.2 Spatial and SNR Scalable Video Spatial scalable encoded video provides a low resolution base layer version of the encoded video for the decoder. With one or more available enhancement layers available to the decoder, the resolution of the decoded video is higher. To fix ideas here, we assume that the base layer provides a QCIF resolution
9.1 Video Encoding and Decoding
199
while n
and one enhancement layer in addition to the base layer provides (the original) CIF version of the encoded video. The enhancement layer frames can be encoded with respect to the corresponding base layer frames only, which we assume for our evaluations, or with respect to the corresponding base layer frame and the previous enhancement layer frame utilizing motion estimation and compensation techniques. For additional enhancement layers, the same mechanisms described here apply analogously for the additional layers. We illustrate the inter-frame dependencies for the considered spatial scalable video encoding scheme with P frames only in Figure 9.3. For the decoder at the receiver, three distinct cases for the decoding exist, which are 1. Neither base nor enhancement layer frames are available and the decoder displays the last frame or upsampled frame. 2. The base layer frame is available and the decoder displays an upsampled version of the base layer frame. 3. Both, base and enhancement layer frames are available and the decoder displays the original resolution. In the example outlined in Figure 9.3, the first four frames are received with their base and enhancement layer encodings. For frame 5, only the base layer is received and the decoder subsequently displays the upsampled version. For the remaining two frames 5 and 6, no data is available and the decoder redisplays the upsampled frame 5 for both frames.
200
9 Incorporating Transmission Errors into Simulations Using Video Traces
1
2
3
4
5
6
7
P
P
P
P
P
P
P
Enhancement Layer
Error spreading
Forward prediction
1
2
3
4
5
6
7
I
P
P
P
P
P
P
Base Layer
Error spreading
Fig. 9.3: Spatial scalable video with inter-frame dependencies and different error spreading possibilities [155]. SNR scalable coding provides a highly compressed and distorted version of the unencoded video frame in the base layer. Adding the enhancement layer information, additional information is available and the distortion of the decoded frame is reduced. For SNR scalable encodings, the mechanisms outlined above apply analogously. In particular, instead of using different resolutions, the decoder displays different qualities of the encoded frame. Thus, in the example provided in Figure 9.3, the decoder would display a high quality CIF-sized version of the video frames 1–4, a low quality version of the frame 5, and re-display frame 5 in place of frames 6 and 7. We describe the decoder’s behavior in Algorithm 5 for spatial and SNR scalable video.
9.2 Video Quality after Network Transport To determine the video frame quality or video sequence quality, subjective tests or objective metrics can be used. The subjective video quality can be determined using experiments resulting in mean opinion scores (MOS) [156]. This approach, however, requires test subjects and is therefore not practical for utilization in mostly automated networking research projects. Objective metrics, on the other hand, are calculated using either the source video or the encoded video. A video quality metric that requires only the encoded the video bit stream is, e.g., the video quality metric (VQM) [157]. The RMSE and PSNR as metrics have been used continuously in video encoding and networking research to determine the video frame and video sequence quality. These two metrics compare the source video and the encoded video frame by
9.2 Video Quality after Network Transport
201
while n
*/
else decoded framen+1 = decoded frame(framen ); case B type if is available(forward reference(bl framen+1 )) and is available(backward reference(bl framen+1 )) then if is available with references(el framen+1 ) then decoded framen+1 = decode(bl framen+1 ,el framen+1 ); else decoded framen+1 = decode and upsample(bl framen+1 ); /*In case of SNR scalability, we do not need to upsample
*/
else decoded framen+1 = decoded frame(framen ); else decoded framen+1 = decoded frame(framen ); display decoded framen+1 ; n = n + 1; Algorithm 5: Decoding and display algorithm for spatial scalable and SNR scalable video.
frame to determine the distortion or quality for each video frame individually. The quality for a video stream can be determined from the individual video frame quality values using elementary statistics. Typically it is assumed that the video stream quality is maximized if the quality of individual frames is
202
9 Incorporating Transmission Errors into Simulations Using Video Traces
maximized and the variability of the quality among the frames of a video is minimized [41]. To quantitatively determine the impact of video frame losses on the video quality, either the video bit stream can be used, or a low quality value can be used to approximate the deteriorated quality after a frame loss [158]. Assuming a low quality value, however, is a static approach that does not take the differences in content into account. For a more thorough approach using the video bit stream, the impact of transport losses were studied for MPEG-2 in [18] and a tool was presented in [19]. In [21], the authors study the impact of differentiated QoS transport mechanisms on the video quality. For networking research, the loss of video data or video frames is typically determinable without much effort, either by experiments or simulation. For the determination of the video quality in an environment without losses or delays during network transport, the video quality can also be determined in a fairly easy manner utilizing video traces. The video quality is typically measured in terms of the distortion as the root mean squared error (RMSE) and quality in terms of the peak signal to noise ratio (PSNR) between original and encoded and subsequently decoded individual video frames. The unencoded video frames are typically represented in the YUV 4:2:0 format usable for MPEG-2 and MPEG-4 encodings, whereby an 8-bit value is assigned to each pixel’s luminance value and an 8-bit value for a block of 4 pixels’ hue and chrominance values. Typically, only the luminance component is taken into account for the video frame quality evaluations, as the human eye is most sensitive to this component [128]. Both, RMSE and PSNR, are referential video quality metrics; they both require the original video frames in addition to the decoded video frames to determine the video quality. At the same time, these metrics allow for a trace-based video quality evaluation without the actual bit stream and are easily automated. We now briefly review video quality metrics previously introduced in Chapter 4. Let us assume a video with X, Y as resolution in pixels (e.g., for QCIF X = 144, Y = 176 and for CIF X = 288, Y = 352) which consists of N video frames encoded with a quantization scale q. We denote an individual pixel’s luminance value in the nth original video frame at position (x, y) as Fnq (x, y) and its encoded and subsequently decoded counterpart by fnq (x, y). We calculate the video frame distortion as RMSE for all the luminance differences of an individual frame as −1 Y 1 X−1 q RM SEn = [Fnq (x, y) − fnq (x, y)]2 . (9.1) XY x=0 y=0 The video frame quality as PSNR can be calculated from the RMSE as Qqn = 20 log10
255 . RM SEnq
We calculate the average video quality or video stream quality as
(9.2)
9.2 Video Quality after Network Transport q
Q =
N 1 · P SN Rnq N n=1
203
(9.3)
and the variability of the video frame qualities measured as standard deviation as N 1 q q σ = (Qqn − Q )2 . (9.4) (N − 1) n=1 To obtain a more useful variability metric taking the average video frame quality into account, we additionally calculate the coefficient of variation of the video frame qualities as CoV q =
σq
q.
Q
(9.5)
We calculate the corresponding distortion metrics in analogous manner. The video stream quality is generally maximized if the quality of individual frames is maximized and the variability of the quality among the frames of a video stream is minimized [41]. The video frame qualities of the encoded video frames are readily available for networking researchers in video traces and can be employed into research without much efforts. For lossy network transport mechanisms or by introducing delay into the delivery of the video stream, the decoder may not be able to receive video frames of the base and/or enhancement layer(s) in time or at all. Typically, no individual distortion or quality value can be assigned to these lost video frames and only a rough approximation, e.g., less than 20 dB [158], can be made. In order to facilitate networking research that includes the video frame quality after network transport, additional information is needed to determine the quality of the video frames that are not available to the decoder, either themselves or by broken references. In the following, we determine the video qualities for basic error handling schemes at the decoder using the offset distortion approaches outlined in [154, 155]. 9.2.1 Single Layer and Temporal Scalable Video For single and temporal scalable video, the decoder can either decode an individual video frame or not. In case the video decoder is unable to decode a video frame, the last successfully decoded video frame is re-displayed at the client, as outlined in the examples in Section 9.1.1. For the successfully decoded and displayed video frames, the video quality can be determined by Equation (4.22). For video frames that are not decoded, the offset distortion can be used to determine the video quality. The offset video distortion of the encoded and decoded re-displayed video frame with respect to the original unencoded video frame can be determined as follows. Let n denote the position of the last successfully decoded video frame and let d denote the offset of the
204
9 Incorporating Transmission Errors into Simulations Using Video Traces while n
Algorithm 6: Calculation of the video frame quality values for single layer and temporal scalable video.
video frame under consideration with respect to video frame n. The offset distortion is then calculated in analogy to Equation (9.6) as −1 Y 1 X−1 q q [Fnq (x, y) − f(n+d) (x, y)]2 . (9.6) RM SEn (d) = XY x=0 y=0 The corresponding video frame quality can be calculated similar to Equation (4.22) as Qqn (d) = 20 log10
255 . RM SEnq (d)
(9.7)
We note that given this approach, the original Equations (4.21) and (4.22) for the video frame distortion and video frame quality are given as RM SEnq (0) and Qqn (0), respectively. With this approach, the video quality after lossy network transport can be determined on a per-frame basis as outlined in Algorithm 6.
9.2 Video Quality after Network Transport
205
9.2.2 Spatial Scalable Video Spatial scalable video adds upsampled video frames on top of re-displaying video frames for the decoding process as outlined in Section 9.1.2. The upsampling introduces an additional distortion (i.e., loss of quality) to the video displayed by the decoder. Thus, additional metrics are needed in order to accommodate the upsampling of base layer video frames and the re-display of the upsampled base layer (BL) video frames. Let fUn,q P denote the encoded (at quantization scale q) and upsampled base n+d,q denote the original unencoded enlayer frame n. Let furthermore FEL hancement layer frame in full resolution at offset d. The distortion caused by upsampling the base layer frame n and re-displaying it instead of the full resolution combination of base and enhancement layer frames is calculated as −1 Y 1 X−1 n,q 2 RM SEU P (d) = [F n+d,q (x, y) − fUn,q (9.8) P (x, y)] . XY x=0 y=0 EL Note that in case of d = 0, the distortion is calculated for the upsampling of the base layer frame only and used to determine the distortion caused by displaying the upsampled base layer frame instead of the full resolution base and enhancement layer frame combination. The corresponding video frame quality can be calculated similar to Equation (9.7) as Qn,q U P (d) = 20 log10
255 . RM SEUn,q P (d)
(9.9)
For spatial scalable encoded video, we thus need to differentiate between the base layer resolution (BL) distortion and quality, the base and enhancement layer resolution (EL) and the upsampled base layer (UP). The combination of all the different video qualities for spatial scalable video are given in Algorithm 7. 9.2.3 SNR Scalable Video The determination of the video frame qualities after (lossy) network transport for SNR scalable encoded video can be done in analogy to the presented metrics and algorithm for spatial scalable encoded video in Section 9.2.2 with only minor modifications. In particular, rather than determining the upsampling distortion and quality of the base layer video frame, the difference in video distortions and qualities is already given by the encoding process and/or its output. The two metrics that are required to determine the video quality at the receiving client are the offset distortion and quality for re-displayed base layer frames at offsets d compared to the unencoded enhancement layer frames, in analogy to Equations (9.8) and (9.9).
206
9 Incorporating Transmission Errors into Simulations Using Video Traces while n
else d = d + 1; if last good frame was el then n,q Q (d); EL else n,q Q (d); UP
Algorithm 7: Calculation of the video frame quality values for spatial scalable video.
9.3 Video Offset Distortion In this section we examine the offset distortion values in greater detail. For the encodings evaluated throughout this section, we utilize the MPEG-4 reference encoder. We note that the video offset distortion trace data is publicly available at [40].
9.3 Video Offset Distortion 70
Frame 9089, q=4 256kbps Frame 5586, q=4 256kbps Frame 867, q=4 256kbps
60 50
RMSE
207
40 30 20 10 0 0
5
10
15
20
25
30
35
40
Offset d
Fig. 9.4: Video frame offset distortion for frames n = 867, 5586, and 9089 from the Jurassic Park I video sequence encoded with a quantization scale of q = 4 and a target bit rate of 256kbps. 9.3.1 Comparison of Rate -Controlled and Non-Rate-Controlled Video Encoding for Single-Layer Video The single layer video encodings we evaluate in this section are of a QCIF resolution and encoded using a IBBPBBPBBPBBI. . . GoP pattern. For encodings that are quantization scale-controlled, the same quantization scale q is used for all different frame types. For encodings that are rate-controlled, the TM5 rate control algorithm [118] with the specified target bit rate is applied. We illustrate the video frame offset distortion for frames n = 867, 5586, and 9089 from the video sequence Jurassic Park I in Figure 9.4 and the corresponding video frame offset quality values in Figure 9.5. We observe the typical inverse relationship between the distortion and quality values. Similar to the findings in [154], we observe that an approximation of the offset distortion or quality by a fixed value is not advisable, as the individual frames exhibit (i) generally different levels of distortion (and hence quality) and (ii) a different behavior of the offset distortion or quality with increasing offset d. Importantly, we observe that for the evaluated frames, the offset distortion values are nearly identical for the rate-controlled and non-rate-controlled versions of the encoded video. This is significant, as the non-rate-controlled version is a very high quality version, whereas the rate-controlled version is
208
9 Incorporating Transmission Errors into Simulations Using Video Traces 35
Frame 867, 256kbps q=4 Frame 5586, 256kbps q=4 Frame 9089, 256kbps q=4
PSNR [dB]
30
25
20
15
10 0
5
10
15
20
25
30
35
40
Offset d
Fig. 9.5: Video frame offset quality for frames n = 867, 5586, and 9089 from the Jurassic Park I video sequence encoded with a quantization scale of q = 4 and a target bit rate of 256kbps. controlled by the TM5 algorithm, which dynamically adjusts the quantization scale q depending on the encoded video frame size and remaining target bandwidth for a GoP. The content of the video frames (and the content for the frames under evaluation due to the offset d) is important. Frame 867 (and the subsequent frames) shows people moving behind bushes in the dark, frame 5586 shows a reflection of a person in water and frame 9089 shows a group of people brushing off an artifact in a desert environment. We additionally observe that the relationship between the rate-controlled and quantization scalecontrolled encodings changes as the offset d increases. For a small offset d, the quality of the quantization scale-controlled encoding is higher for frames 867 and 5586, whereas with increasing offset, the quality of the rate-controlled encoding for each frame is higher. Furthermore, we note that the sizes of the encoded macroblocks is different for both versions depicted in Figures 9.4 and 9.5. To further evaluate the different behaviors with increasing offsets and the closeness of the quantization scale-controlled and rate-controlled versions, we evaluate the video offset qualities for the two video frames 867 and 5586 now in greater detail. In Figures 9.6 and 9.7, we illustrate the offset quality values for both video frames encoded at different quantization scales and target bit rates. We observe that for both frames, the shape of the offset distortion curve does
9.3 Video Offset Distortion 28
64kbps q=30 q=24 128kbps q=10 256kbps q=4
26
24 PSNR [dB]
209
22
20
18
16 0
5
10
15
20
25
30
35
40
Offset d
Fig. 9.6: Video frame offset quality for frame n = 867 from the Jurassic Park I video sequence encoded with different quantization scales q and target bit rates. not change with the different levels of quantization or the different applied target bit rates. We additionally note that for both illustrated video frames, the order of the individual curves changes with increasing offset d. The frames have a higher quality for closer offsets and higher quality encodings. As the offset increases, this relationship is inversed. Two mechanisms are responsible for this behavior, namely (i) the encoding of the video frames and (ii) the change in content of subsequent video frames. For close offsets, the contents of the frames n + d following the frame n under consideration are closely related and thus the differences in between the frames are small. With high quality encodings (by application of either a low quantization scale q or a high target bit rate), the small differences in content and the better quality from the encoding result in a low offset distortion (and thus high offset quality) for the higher quality encodings. For lower quality encodings, the differences are larger, as the distortions from the encoding process and not from the offset comparison are large. As the offset d increases, the differences in content of the video frames increase. With a higher quality encoding, the video frames have a higher level of detail. In turn, the differences between consecutive video frames are finer, whereas the differences in lower quality encodings are smaller due to the decreased level of detail. In turn, a higher quality encoding yields a lower offset quality (and higher offset distortion) at the larger offsets. Content dependency is also the source for the different encoding settings to
210
9 Incorporating Transmission Errors into Simulations Using Video Traces 34
q=30 64kbps 128kbps q=24 256kbps q=10 q=4
32 30
PSNR [dB]
28 26 24 22 20 18 16 0
5
10
15
20
25
30
35
40
Offset d
Fig. 9.7: Video frame offset quality for frame n = 5586 from the Jurassic Park I video sequence encoded with different quantization scales q and target bit rates. be very close to another. We note that the curves for target bit rates 64kbps and quantization scale q = 30 are very close to another, as are the tuples of 128kbps and q = 10 and 256kbps and q = 4. This closeness is a result of the rate-distortion behavior of the encoder and the TM5 bandwidth matching. For low target bandwidths, the rate control has to assign a very high quantization scale, resulting in the first tuple of 64kbps, q = 30. As the target bandwidth increases, the rate control algorithm can assign lower quantization scales to the encoding, which results in the tuple 256kbps, q = 4. We also note that in previous findings, the TM5 algorithm was not found to be matching the lower target bit rates effectively [158]. This inability of rate control, due to a limited range of quantization scales available for the TM5 rate control algorithm, adds to the closeness of the offset qualities for the rate-controlled encodings, especially for the two lower target bit rates 64kbps and 128kbps. To approximate unknown target bandwidths or quantization scales, it is beneficial to look at the combinations that are given by the offset quality figures and bandwidths of the encoded full video sequences. From [40], we derive the averaged bandwidths of the encoded videos as given in Table 9.1. Comparing the different average bit rates of the encoded video, we observe that the offset qualities for encodings with similar average bit rates are very close to another in terms of their offset quality values depicted in Figures 9.6 and 9.7. For unknown quantization scales or target bit rates, the offset dis-
9.3 Video Offset Distortion
211
Table 9.1: Encoded bandwidths for different quantization scales q and target bit rates for the video sequence Jurassic Park I from [40]. Quantization scale q / Average Bitrate Target bit rate [Mbps] q=4 0.78435 q = 10 0.22787 q = 24 0.07845 q = 30 0.06645 64kbps 0.06632 128kbps 0.12813 256kbps 0.25613 tortion of known quantization scales or bit rates can be used to derive a close approximation of unknown offset distortions by comparison of the average encoded bit rates. 9.3.2 Comparison of Rate -Controlled and Non-Rate-Controlled Video Encoding for Scalable Video For the evaluation of scalable video encodings, we consider spatial scalable video encodings with a QCIF base layer resolution and a CIF enhancement layer resolution. We utilize an IBBPBBPBBPBBI. . . GoP pattern in the base layer and a PBB structure the enhancement layer. For encodings that are quantization scale-controlled, the same quantization scale q is used in the base and enhancement layers. For rate-controlled encodings, the TM5 rate-control algorithm is applied to the base layer with the given target bit rate while the enhancement layer is encoded using a quantization scale q = 14 for P and q = 16 for B frame types. We illustrate the upsampled base layer’s offset qualities for frames n = 1895, 2550, and 5370 from the Terminator I video sequence, encoded with a quantization scale of q = 4 and a target bit rate of 256kbps, in Figure 9.8. We note that for the different video frames, the upsampled base layer offset qualities are very close for both quantization scale and bit rate - controlled encodings and the corresponding curves are on top of each other. This can be explained as follows. Using the low resolution base layer, the rate control algorithm is able to assign a low quantization scale q to the encoded base layer, which is close to q = 4 illustrated in Figure 9.8 as well. We additionally observe that as in the single layer case, the offset quality is content dependent and thus different for each frame n and offset d. Similar to the single layer case, we illustrate the upsampled base layer’s offset quality for a variety of quantization scales q and target bit rates exemplarily for frame n = 2550 in Figure 9.9. We observe that for all different quantization scales and target bit rates, the different offset quality values are
212
9 Incorporating Transmission Errors into Simulations Using Video Traces 30
Frame 1895, q=4 256kbps Frame 2550, q=4 256kbps Frame 5370, q=4 256kbps
28 26
PSNR [dB]
24 22 20 18 16 14 12 10 0
5
10
15
20
25
30
35
40
Offset d
Fig. 9.8: Video frame offset qualities for frames n = 1895, 2550, and 5370 from the Terminator I video sequence encoded with a quantization scale of q = 4 and a target bit rate of 256kbps. very close to another. This closeness of the upsampled base layer offset qualities can be used to approximate unknown offset qualities at other quantization scales or target bit rates. We illustrate the offset quality for the enhancement layer frame n = 2550 in Figure 9.10. We observe that the enhancement layer offset quality closely follows the upsampled base layer offset quality values. In addition, the different enhancement layer offset quality values are close together, such that an approximation of unknown offset qualities can be done for unknown quantization scale or target bit rate settings. We illustrate the difference in the offset qualities for the upsampled base layer and enhancement layer frame n = 2550 in Figure 9.11. For the difference between the two offset qualities, we observe a decrease with the offset d for all different encoding settings. We additionally note that for different encoding modes, the differences for the quantization scale-controlled encodings are close for all offsets, whereas the differences for the rate-controlled encoding modes are further apart and only seem to converge with large offsets d. Thus, for approximation of the upsampled base layer qualities by means of the qualities obtained for individual layers, the significant difference, especially for small offsets d and rate -controlled encodings has to be taken into account.
9.4 Perceptual Considerations for Offset Distortions or Qualities 28
q=4 q=10 q=24 q=30 256kbps 128kbps 64kbps
26 24 PSNR [dB]
213
22 20 18 16 14 12 0
5
10
15
20
25
30
35
40
Offset d
Fig. 9.9: Video frame offset quality for the upsampled base layer frame n = 2550 from the Terminator I video sequence encoded with different quantization scales q and target bit rates.
9.4 Perceptual Considerations for Offset Distortions or Qualities The RMSE and the RMSE-based PSNR metrics are based on the comparison between the individual video frames of the unencoded source video and the received and decoded video. These metrics therefore do not take the flow of the consecutive video frames as they are displayed at the receiver into account. This can lead to a slowing or even decrease in the offset distortion when consecutive frames have only little correlation as illustrated in Figure 9.7 for the offset quality of frame 5586 from the Jurassic Park I video sequence. This behavior, although consistent with the application of these metrics, is not consistent with the intuitive value of these video frames to the receiving client. In order to derive a more suitable metric for comparing the source video and the received and decoded video with errors and offset distortions, it is necessary to take the impact of re-displaying the current frame multiple times on the perceived video quality into account. We propose an adjustment of the RMSE and PSNR metrics that takes the number of consecutive displays of a video frame into account yet does not require more information than stored in video traces. In the following, we therefore present an approximation of
214
9 Incorporating Transmission Errors into Simulations Using Video Traces 30
q=4 q=10 q=24 q=30 256kbps 128kbps 64kbps
28 26
PSNR [dB]
24 22 20 18 16 14 12 0
5
10
15
20
25
30
35
40
Offset d
Fig. 9.10: Video frame offset quality for the enhancement layer frame n = 2550 from the Terminator I video sequence encoded with different quantization scales q and target bit rates. perceptual considerations based on the RMSE (and PSNR) values that are readily available in offset distortion video traces. We consider a client that receives the encoded video and is presented with multiple re-displays of the same video frame. The client thus looks at a sum of distortions that originate from the last successfully decoded video frame n at d = 0 and the following re-displays of frame n with the offsets d ≥ 1. We use this approach as basis to calculate the perceptually adjusted RMSE (pRM SE). In particular, we define d pRM SEnq (d)
=
d=0
RM SEnq (d) , d+1
(9.10)
where the sum of distortions seen by the client is averaged over the number of re-displayed frames. The average offset distortion thus presents the upper bound for the perceptually adjusted RMSE values as the offset d increases. We determine the perceptually adjusted quality (pQ) for each frame and offset as 255 . (9.11) pQqn (d) = 20 · log10 pRM SEnq (d) We illustrate the traditional versus the perceptually adjusted video frame offset distortion and quality values in Figures 9.12 and 9.13 for the frames
9.5 Using Video Offset Distortion Traces 6
256kbps 128kbps q=4 q=10 q=30 q=24 64kbps
5
4 PSNR [dB]
215
3
2
1
0 0
5
10
15
20
25
30
35
40
Offset d
Fig. 9.11: Difference in the video frame offset qualities (upsampled base layer and enhancement layer) for frame n = 2550 from the Terminator I video sequence encoded with different quantization scales q and target bit rates. 867, 5586 and 9089 from the Jurassic Park I video sequence encoded with quantization scale q = 4 for all frame types. We observe that the perceptual adjustment results in a smooth rise of the video frame offset distortion (and a smooth decline of the video frame offset quality) for the evaluated frames and offsets. Comparing the originally obtained video frame qualities Q4n (d) and the perceptually adjusted video frame qualities pQ4n (d), we observe that the perceptual quality values obtained at small offsets are generally higher than their traditionally calculated counterparts. This reflects that for a small number of re-displayed video frames, e.g., one or two frames, the perceived quality would not experience a large degradation. For frames that are further away, the perceptually adjusted quality values can be smaller than the traditionally calculated quality values. This reflects that the more video frames were not correctly received, the lower the perceived quality for the client.
9.5 Using Video Offset Distortion Traces Traditional video traces display the frame time, frame number, frame type, video quality per frame, and additional information [158]. Offset distortion traces provide the distortion (RMSE) information per frame as a function
216
9 Incorporating Transmission Errors into Simulations Using Video Traces 70
Frame 9089, RMSE Frame 9089, pRMSE Frame 867, RMSE Frame 5586, RMSE Frame 5586, pRMSE Frame 867, pRMSE
Distortion (RMSE)
60 50 40 30 20 10 0 0
5
10
15
20
25
30
35
40
Offset d
Fig. 9.12: Video frame offset distortion (original and perceptually adjusted) for the frames 867, 5586 and 9089 from the Jurassic Park I video sequence encoded with quantization scale q = 4. Table 9.2: Exemplary offset distortion trace for Star Wars IV, encoded with q = 4. Frame # d=1 d=2 d=3 d=4 d=5 d=6 0 79.50700667 79.62517641 80.16716058 162.5373498 122.5335215 122.3341171 1 79.50866246 80.05223989 162.4254677 122.4373796 122.2356938 82.56895003 2 80.21261549 162.5813237 122.5847470 122.3799119 82.71486860 82.83361557 ...
... ... ... ...
of the offset d given in Equation (9.6). In order to facilitate managing the additional amount of data per frame, the offset distortion trace format has to be easily accessible. We thus propose a table format in which each frame n constitutes a line and each offset d constitutes a column in the row indexed by n. Note that for each individual encoding setup (e.g., choosing a different quantization scale q or target bit rate), a new offset distortion trace has to be generated. We illustrate an exemplary offset distortion trace as can be found at [40] in Table 9.2 for the first 3 video frames and values of d up to 6. We note that the trace format is the same for all different offset distortions calculated and presented in this contribution and that given the relation between the offset distortion and offset quality as given in Equation (9.7), only the offset distortion values are included in the traces.
9.5 Using Video Offset Distortion Traces 35
Frame 867, pQ Frame 5586, pQ Frame 5586, Q Frame 867, Q Frame 9089, pQ Frame 9089, Q
30 Quality (PSNR)
217
25
20
15
10 0
5
10
15
20
25
30
35
40
Offset d
Fig. 9.13: Video frame offset quality (original and perceptually adjusted) for the frames 867, 5586 and 9089 from the Jurassic Park I video sequence encoded with quantization scale q = 4. 9.5.1 Assessing the Video Quality After Network Transport Using Video Traces To determine the video quality after potentially lossy network transport, both the conventional traces and the offset distortion traces are required. In Algorithm 8, we provide the function to determine the video frame quality values for the algorithms presented in Section 9.2 and the traces available at [40]. In particular, we consider the terse video traces from [40], as they contain the information for frame n in row n (they are in frame order), whereas the verbose traces are using an index for each frame (they are in encoding order). Both trace formats can be used, but the difference in order and the requirement for additional indexing when using the verbose video traces have to be kept in mind. 9.5.2 Available Tools Video offset distortion traces have become available to the networking research community for several different video encodings, see [40] for some available traces. The availability of offset distortion traces alone may not be sufficient for researchers that have access to the unencoded video bit streams or for video researchers that are interested in working with the commonly used video
218
9 Incorporating Transmission Errors into Simulations Using Video Traces switch d do case d = 0 /*Traditional traces already provide the video quality values. if single layer then Qqn (0) = terse trace(row n); else if base and enhancement layer then Qn,q EL (0) = terse aggregated trace(row n); else Qn,q U P (0) = terse trace for BL(row n); case d ≥ 1 /*Offset distortion traces provide the video distortion values. switch video encoding mode do case single layer 255 Qqn (d) = 20 · log10 ; offset trace(row n, col. d)
*/
*/
case temporal scalable Qqn (d) = 20 · log10
255 ; offset trace(row n, col. d) /*For temporal scalable encodings, the single layer offset traces can be used.
*/
case spatial scalable if base and enhancement layer then 255 Qqn (d) = 20 · log10 ; offset trace EL(row n, col. d) else Qn,q U P (0) = 255 ; 20 · log10 upsampled BL offset trace(row n, col. d)
Algorithm 8: Function to derive the video frame quality values for single and scalable video encodings after network transport from video traces at [40].
test sequences. For those researchers, we made our software tools used in the generation of the offset distortion traces available for download at [40]. The tools generate the suggested output in a comma separated (CSV) format that allows for easy access to the values in spreadsheet programs and further processing. The tools require the input video to be in the YUV 4:2:0 format. We describe the usage of these tools in greater detail in Chapter 10.
9.6 Offset Distortion Influence on Simulation Results In this section, we compare the actual video quality obtained with the offset distortion traces introduced in this chapter with the approximation approach outlined in [158], where the quality for re-displayed video frames was fixed with Qqn (d) = 20dB, d = 1, 2, . . ..
9.6 Offset Distortion Influence on Simulation Results 40
Q, d=1 Q, d=2 Q-20, d=1 Q-20, d=2 Q, d=4 Q-20, d=4
38 36 Average quality [dB]
219
34 32 30 28 26 24 22 1e-006
1e-005
0.0001
0.001
0.01
0.1
Bit error rate
Fig. 9.14: Comparison of approximation of quality of loss-affected frames (Q-20) with actual quality obtained from offset distortion trace (Q): Average video stream qualities for the Foreman video sequence encoded with quantization scale q = 3 as function of bit error rate for different offsets d. 9.6.1 Single Layer We consider sending the Foreman video sequence encoded with a quantization scale q = 3 over an error-prone link. We utilize the MPEG-4 reference software and encode the video sequence in simple profile, single layer mode. The link is modeled using uncorrelated bit-errors with different error probabilities. We consider the error probability for the size of the video frames (in bits) only and include no protocol overhead for comparison. We utilize the elementary IPPP. . . GoP pattern and assume that after each erroneous frame d − 1 more frames are lost before the sender can update the receiver by sending an Iframe. Without loss of generality, we assume that the I frame has the same distortion values as the P frame otherwise sent at that position. We report our results for a 99% confidence interval of the mean quality. q We illustrate the effect of different bit error rates on the video quality Q for different offsets d in Figure 9.14. We observe that only for very low bit error rates the approximation with Q = 20 dB (Q-20) results in a close fit of the value obtained by the framewise exact PSNR calculation (Q) using the offset distortion traces. As the bit-error rate increases, the difference between the approximation and the actual quality obtained with the offset distortion traces becomes larger, reaching differences between 2 dB and 4 dB, which are
220
9 Incorporating Transmission Errors into Simulations Using Video Traces
Coefficient of quality variation
0.35
0.3
0.25
0.2
0.15 Q-20, d=2 Q-20, d=4 Q-20, d=1 Q, d=4 Q, d=2 Q, d=1
0.1
0.05 1e-006
1e-005
0.0001
0.001
0.01
0.1
Bit error rate
Fig. 9.15: Comparison of approximation of quality of loss-affected frames (Q-20) with actual quality obtained from offset distortion trace (Q): Coefficient of video frame quality variation for the Foreman video sequence encoded with quantization scale q = 3 as function of bit error rate for different offsets d. quite significant. We conclude that the approximation does not capture the effect of different offsets d well and in turn results in a large deviation from the correct simulation outcomes that can be obtained using the offset distortion traces. Figure 9.15 shows the calculated coefficients of video quality variation CoV q for different bit error rates. We note that the variability in the video frames’ qualities increases for all different settings and metrics under consideration. We observe that in terms of capturing the variability of the quality of the video frames, the approximation approach results in an overly high estimate of the video quality variability. We further illustrate the effect of different fixed frame intervals needed to update the receiver (offsets d) on the calculated video stream quality in Figure 9.16 for a fixed bit error rate of 10−4 . We observe that for a given bit error rate, the video stream quality as function of the offset (or frames needed to allow the receiver to receive an I frame) has an asymptotic behavior toward 20 dB for the approximation approach. The impact of increased offsets on the actual quality obtained with the offset distortion traces, however, continues to decline more rapidly as a function of the offset than the approximation. We also note a significant difference between the approximation and the actual
9.6 Offset Distortion Influence on Simulation Results 34
221
Q-20 Q
32
Average quality [dB]
30 28 26 24 22 20 18 0
5
10
15
20
25
Offset d
Fig. 9.16: Comparison of approximation of quality of loss-affected frames (Q-20) with actual quality obtained from offset distortion trace (Q): Average video stream qualities for the Foreman video sequence encoded with quantization scale q = 3 at different offsets for bit error rate 10−4 . quality values of close to 4 dB in case only the erroneous frame is affected (d = 1). The continuing decline with larger offsets and the higher quality at smaller offsets can only be obtained using the actual quality values obtained from the offset distortion traces. Figure 9.17 illustrates the calculated coefficients of (relative) video quality variation CoV q as a function of the offsets d. We observe that with larger offsets d the video quality variability increases for the actual quality. We also note that the approximative approach does not only fail to capture the video quality variability correctly, but also drops with increasing offsets. In the region of smaller offsets d, the approximation approach results in too high variation estimates. Only for an offset of d = 6 do the approximation approach and the offset distortion trace based approach result in approximately similar simulation results for the variability of the video frame qualities. For larger offsets, the approximation approach greatly underestimates the video frame variabilities. 9.6.2 Spatial Scalable Video In this section, we evaluate the general impact of the scalable offset distortion on simulation results using the video sequence News encoded with the official
222
9 Incorporating Transmission Errors into Simulations Using Video Traces 0.36
Coefficient of quality variation
0.34 0.32 0.3 Q-20 Q
0.28 0.26 0.24 0.22 0.2 0
5
10
15
20
25
Offset d
Fig. 9.17: Comparison of approximation of quality of loss-affected frames (Q-20) with actual quality obtained from offset distortion trace (Q): Coefficient of video frame quality variation for the Foreman video sequence encoded with quantization scale q = 3 at different offsets for bit error rate 10−4 . Microsoft MPEG-4 reference encoder [39]. We use a QCIF-sized base layer and a CIF-sized enhancement layer. We employ the GoP pattern IPPP. . . using a GoP length of 24 frames and a frame rate of 24 frames per seconds (i.e., we start a new GoP every second). The enhancement layer is encoded only with respect to the base layer (i.e., we do not enable motion estimation and compensation between enhancement layer frames) in our evaluation, but we note that different encoding schemes can be regarded in a similar manner. Without loss of generality, we evaluate lossless transmission of the encoded video over a bandwidth-limited link to illustrate the difference between a rough approximation using a low value of Qqn = 20db or RM SEnq = 25.5 for the video quality versus the actual value determined by the scalable offset distortion trace. We consider the standard layered video streaming approach, where the base layer is transmitted before the enhancement layer in combination with the typically used RTP/UDP/IP protocol encapsulation. We illustrate the influence on the video distortion for the enhancement layer (EL) resolution in Figure 9.18 for the News sequence encoded with a quantization scale parameter of q = 9. We observe that the enhancement layer distortion is significantly higher for approximation and calculation of the distortion values over a wide range of evaluated bandwidths. The reason for this
9.6 Offset Distortion Influence on Simulation Results 26
223
EL approx. distortion EL calc. distortion
25 24 23
RMSE
22 21 20 19 18 17 16 15 50000
100000
150000
200000
250000
Bandwidth [bps]
Fig. 9.18: Mean distortion from approximation and calculation for spatial scalable video streaming of the News video sequence. behavior is that the encoded base layer requires 73.2 kbps on average, whereas the enhancement layer requires 1038.9 kbps on average. In addition to this effect, the intra-coded I frames in the base layer and the larger sized enhancement layer frames require packetization into multiple IP packets, which adds an additional protocol overhead. Only if most of the base layer can be transmitted successfully, the approximated distortion for the enhancement layer approaches the level of the calculated distortion, as we can determine the distortion for the upsampling from the current traces. Importantly, we note that the approximation of the distortion with a fixed value results in a too high estimate of the distortion. A simple approximation with fixed values is thus not desirable. For the variability of the video distortion, we illustrate the standard deviation for the two layers in Figure 9.19. We observe that the approximation of the distortion values decreases the variability significantly compared to the actual variability. The approximation does not capture the behavior of the calculated variability, instead the variability is approximated as too low for very low bit rates and as too high as the bit rate increases. Overall, we find that using an approximation value instead of calculated values increases the introduced error in video quality estimation on a trace basis quite significantly. We conclude that utilizing the scalable video distortion and quality traces we are currently including into our video trace library [40]
224
9 Incorporating Transmission Errors into Simulations Using Video Traces 6
EL approx. distortion stddev EL calc. distortion stddev
5
RMSE
4
3
2
1
0 50000
100000
150000
200000
250000
Bandwidth [bps]
Fig. 9.19: Standard deviation of the distortion from approximation and calculation for spatial scalable video streaming of the News video sequence. results in an accurate estimation of the video quality after (lossy) network transport.
9.7 Error-Prone and Lost MDC Descriptors To evaluate the video quality, two different error models are taken into account. We distinguish between lost descriptors and error -prone descriptors (combinations of both are also possible). If a complete descriptor is lost, we assume that all of its frames are lost completely. This is a realistic scenario for the support of heterogeneous terminals and for multi–hop networks. Heterogeneous terminals may listen only to a subset of descriptors, which is due to their bandwidth limitations [89, 159]. In case of multi–hop networks, if one of the hops in the transmission is completely down, then the descriptors transmitted via that hop are completely lost. In case of error–prone descriptors, bit errors with a given probability are added to the encoded descriptors. To evaluate the impact of the error-prone and lost MDC descriptors, we have to extend our setup of the splitter by a new entity called merger, as given in Figure 9.20. Each received descriptor is decoded individually and afterwards merged to a single YUV stream. The received YUV stream can the n be compared with the original YUV stream. To calculate the PSNR values in dependency of the percentage of successfully received descriptors, we select
9.7 Error-Prone and Lost MDC Descriptors D1
H.26x Decoder
..... J+1
1
D2
H.26x Decoder
..... J+2
2
D3
H.26x Decoder
..... J+3
3
J+3 J+2 J+1
225
J
.....
3
2
1
J
.....
3
1
1
Fig. 9.20: Merger for J=3. D1
H.26x Decoder
..... J+1
1
D2
D3
J+3 J+2
H.26x Decoder
..... J+3
J
3
Fig. 9.21: Merger for J=3, where one descriptor is missing. in dependency of the percentage of successfully received descriptors we select randomly a sub–set of descriptors and feed these to the merger. In case descriptors are lost, frames will be missing in the video sequence. As a very simple error concealment, we are freezing the last successfully received frame until we receive a new frame. Freezing is performed by making a copy of the last received frame for the next frame until an update frame is received. This procedure is shown in Figure 9.21 for the example of a missing descriptor 2. In this case frame 1, J + 1, . . . will be displayed two times. In case the descriptors are error prone they are decoded and merged afterwards. In Figure 9.22, PSNR measurements versus percentage of successfully received sub–streams for a total of J = 10 descriptors are given. In this figure, it is assumed that all the frames belonging to the successfully received descriptors are received without any error coming from the propagation through the channel. The results are obtained using six different video sequences. The foreman video sequence includes the highest motion, whereas the clarie and container video sequences include relatively lower motion than foreman. As we can observe in the figure, if the motion in a given video sequence is high, the slope of the PSNR degradation curve is also high. For example, for a loss of 60 % out of J = 10 descriptors, the PSNR degradation is 6 dB for foreman, 2 dB for claire and 1 dB for container video sequence. Thus, we conclude that the more motion in a given video sequence, the more important it is to receive as many descriptors as possible for a given J. Figures 9.23 and 9.24 present PSNR values versus percentage of successfully received descriptors for the container and foreman video sequences for 2, 4, 8 and 16 descrip-
226
9 Incorporating Transmission Errors into Simulations Using Video Traces
Fig. 9.22: PSNR values for different numbers of descriptors versus the percentage of received sub–streams for various video sequences. tors. Encoding overhead versus number of descriptors is additionally plotted for each video sequence, as it presents the additional bandwidth required by MDC compared to the bandwidth required by single stream video. The encoding overhead increases with the number of descriptors and depends on the video content. The percentage of received descriptors that are assumed to be received with a certain Bit Error Probability (BEP) at the IP level of 10−4 (bad channel), 10−5 (medium channel), and 10−6 (good channel). To obtain these results, we encoded the descriptors with H.263 and added a given BEP. In the following , we assumed to receive a subset of all descriptors, decoded them, and merged them together. We repeated the measurements until we got a relative error of 1% for an 99% confidence interval. As in the case explained in Figure 9.22, the effect of lost descriptors again depends on the relative motion in the video sequence. Since the foreman video sequence has relatively higher motion than the container video sequence, the effect of lost descriptors is more visible in Figure 9.24 than in Figure 9.23. As it is obvious in Figures 9.23 and 9.24, bit errors affects the video quality. Therefore, it is of utmost importance to look at the channel characteristics when deciding on the best optimization solution in case cross–layer optimization is exploited with MDC. Interested readers may refer to our study in [159].
9.7 Error-Prone and Lost MDC Descriptors 227
Fig. 9.23: PSNR values for the container video sequence versus percentage of received descriptors for J = 2, 4, 8 and 16
228 9 Incorporating Transmission Errors into Simulations Using Video Traces
Fig. 9.24: PSNR values for the foreman video sequence versus percentage of received descriptors for J = 2, 4, 8 and 16
10 Tools for Working with Video Traces
In this chapter of the book, we introduce the reader to a variety of tools that are useful in the utilization of video traces. The tools we describe include interfaces to network simulators, tools to generate MPEG-4 encoded video frame size traces, and tools to evaluate the video quality (in terms of the RMSE and PSNR) with and without consideration of errors and error concealment.
10.1 Using Video Traces with Network Simulators In the previous parts and chapters of this book, we introduced video traces and results that were obtained from these traces. To use video traces for networking research, different ways of simulating multimedia — and particularly video — traffic exist. The evaluation of different video coding standards (e.g., H.263 [38, 129, 160], MPEG–4 [57, 129, 160, 161], or H.264 [43, 46]), the resulting video traffic, and the resulting requirements for networks have attracted great interest in the research community. As encoded video traffic is dependent to the content [162, 163], the encoding standard, and the encoder settings, generalized video source model are subject of ongoing research. In order to facilitate the network performance evaluation, quality of service (QoS) categorizations, and service designs, the prospective video traffic for transmission systems has to be evaluated. The difficulties in modeling the behavior of video sequences with different encoding modes and different contents is driving the utilization of network simulators for performance analysis purposes. A large portion of research in the field of video traffic analysis is thus based on video traces. The variety of different video stream characteristics, user behavior, and the properties of the transporting network is huge. Models of video sources such as presented in [4, 6, 112, 163, 164, 165] can greatly improve the possibilities to test final systems with regard to some real–life applications. However, these models require several to several tens of parameters. The creation of such a model is therefore only possible if the video itself has been statistically evaluated before. To evaluate an encoded 229
230
10 Tools for Working with Video Traces Modeled video source
Encoded video
Trace file
Select model
Statistical evaluation
Incorporate parameters
Include as video source in simulator
Trace based video Source
Fig. 10.1: Overview of video traffic modeling vs. video trace approach. video sequence, the video trace files (i.e., the video frame sizes and qualities) are utilized. However, these video traces can also be used for the use in different network simulators. The main purpose of using network simulators with video traces is to facilitate the layout of the network with respect to its physical form, QoS, and other parameters. As modeling the video sources always requires the original trace to be fully evaluated first, a simpler approach is to incorporate these traces directly into the network simulator. Another benefit from incorporating the video traces directly into network simulators is the vast amount of video source models, each with a specific point of view. The definition of an interface between the network simulator and the video trace is necessary. An overview of the necessary steps for incorporation of video sources is illustrated in Figure 10.1. Clearly, the direct utilization of video traces in network simulators facilitates the fastest method to incorporate video sources into existing network models. It is thus necessary to have clearly defined interfaces to facilitate a timely approach to any incorporation into existing models. In the following, we explain how to use existing interfaces for the most popular network simulators. These interfaces were designed by various contributors. All of these interfaces are made available through our video trace site and its mirrors [40]. Video trace files were generated as early as 1995, started by Rose [25]. As the encoder generations moved onward to the currently widespread MPEG–4 standard [161], the trace files generated also started to differ. As the frame rate per second is usually fixed within a single encoded video stream, the frame numbers or the playout time can be regarded as redundant. The trace files presented on our web page [40] use different column positions to present the values needed for the different simulator interfaces. The presented trace files also differ with respect to the information presented. This results from the evolutionary development in the generation of video encoding standards and related software encoders. Additionally, the depth of research in the video coding domain has increased. In the beginning of the research based on video frame traces, the focus was solemnly on the network performance. Today, the issues of quality and their correlation to the video frame sizes is also of importance, driven by the need for differential QoS. For utilization of the trace files in the simulators described below, some of the interfaces have to be adapted. It is therefore necessary to redefine the routines of the interfaces that read the trace files. The groups of frame traces presented in Table 10.1 use the same
10.1 Using Video Traces with Network Simulators
231
Table 10.1: Different video trace file formats. Group Entries in file MPEG–4 (new) frame# time type size[byte] PSNR-Y PSNR-U PSNR-V MPEG–4 FGS frame# type size[bit ] PSNR-Y PSNR-U PSNR-V H.26L frame# type time size[byte] MPEG–4 (old) H.263 time type size[byte] H.261
column layout. Please note that up to now, we only incorporate the single layer verbose trace files. As most of the interfaces are programs written in different scripting languages, the positions to change within are easy to find and in all cases involve only a single or a few lines. The Omnet++ interface is an exception, since it automatically recognizes the two different groups of MPEG–4 trace file formats. Several simulator packages for utilization in network performance analysis exist. Mostly of interest in this field are discrete event simulators. In the following, we introduce the interfaces by which the video trace files can be used in these simulators as video sources. Three major distributions of simulators are shown with an exemplary demonstrator for each of them. With the demonstrator incorporating the video trace-based sources, anybody can easily derive advanced configurations from these. 10.1.1 NS II The interface for using the video traces with the Network Simulator — ns– 2 [166] was written by Michael Savoric from the Technical University of Berlin, Germany. The example provided at [40] is using the trace file model as provided with the different groups of video traces. The Network Animator (NAM) package is used to generate a visual representation of the frame sending mechanism. As the interface was written for an earlier version of the Network Simulator, a few warnings may appear in newer versions. Nevertheless, the provided TCL–scripts are tested to work with the version to date. With some experience in TCL–scripting, the interface can easily be modified to work with additional, such as fine grain scalability, trace file versions. In this demonstrator, all links between entities shown are set to a capacity of 1Mbit. The running simulator and visualization package is illustrated in Figure 10.2. For a successful compilation of the demonstration package, the following steps are required: 1. Extract the package: tar xvzf nsexample.tar.gz. 2. Change into the created directory: cd nsExample. 3. Invoke the demonstration: ns video trace type.tcl, where type denotes the group of trace files introduced above. (Please note that it may take some time for the script to parse the trace file and create the ns input file.)
232
10 Tools for Working with Video Traces
Fig. 10.2: NS–2 trace-based video source with NAM sample output. 10.1.2 Omnet++ The interface for the Omnet++ simulator was written by Luca Signorin from the University of Ferrara, Italy. The interface is capable of detecting the different video trace file formats and to feed the data into the Omnet++ simulator accordingly. The trace files above use different columns for the representation of the data. The Omnet++ simulator package is available from [167]. The current stable version 2.2 is the version the interface was written for and tested with (Note that the currently available version 2.3 beta is not compatible with the interface). The following steps are necessary for a successful compilation of the interface and demonstration package: 1. Extract the downloaded file: tar xvzf OMNET++VideoInterface.tar.gz. 2. Change into the created directory: cd OMNET++VideoInterface.
10.1 Using Video Traces with Network Simulators
233
Fig. 10.3: Omnet++ visualisation window for provided example simulation. 3. Edit the Makefile and change the following paths to the Omnet++ directory according to your setup: NEDC=your omnet dir /src/nedc/nedcwrapper.sh OMNETPP INCL DIR=your omnet dir /include OMNETPP LIB DIR=your omnet dir /lib 4. Call make. 5. Invoke ./OMNET++VideoInterface. To speed up the simulation and skip the auto detection, it is possible set the “type video” parameter in omnetpp.ini file. The following output and main Omnet++ windows are shown in Figures 10.3 and 10.4. With minor adjustments to the header and source files, this example can easily be expanded to suit different needs. 10.1.3 Ptolemy II A third network simulation package, Ptolemy [168], was originally used for simulation in the beginning of the video trace project. The interface was written by Frank Fitzek during his time at the Technical University of Berlin. At this time, only trace files of the format H.261, H.263, and MPEG (old) were used. We note that with respect to Table 10.2, the trace files for H.264 should work as they have the same format as the traces obtained from the previous H.26x coders. In case newer versions of trace files with different formats are used, minor changes to the source code have to be done. The interface is mainly based on the PTOLEMY star DEVideo.pl. The DEVideo.pl star is derived from the DERepeatStar domain of the Ptolemy simulator package. A set of parameters as in Table 10.2 can be used to adapt the interface to the users’ needs. The
234
10 Tools for Working with Video Traces
Fig. 10.4: Omnet++ main window for provided example simulation. Table 10.2: Parameters for Ptolemy interface. Parameter CodingType TraceList InputDir framescale Offset maxFrameSize Duration DEBUG
Type StringState StringState StringState FloatState FloatState IntState IntState IntState
Info MPEG4 and H263x is supported file name were trace files are specified location of trace files and this→TraceList scaling the frame size uniform random phase for sequence start larger frames will be skipped uniform random trace duration in sec different DEBUG level for stdout
Default H263 H263TraceList.dat $TRACE PT DIR/ 1.0 0.0 1000.0 1000.0 3
CodingType specifies which codec is used. This important to read the trace data properly. The TraceList parameter points to a file where all trace file names are specified that should be used for the simulation. Examples: Verbose Verbose Verbose Verbose Verbose
ARDNews 64.dat ARDTalk 64.dat Aladdin 64.dat BoulevardBio 64.dat DieFirma 64.dat
0.245 0.745 1 1 1
In addition to the name of the video trace file, a time scale factor has to be provided. The video trace files differ in their length. The factor multiplicated with 60 minutes gives the actual video trace length. The files itself has to be stored in InputDir. In case the video frames have to be scaled, the parameter framescale has to be set properly. The star choses randomly a video trace file
10.2 The VideoMeter Tool for Linux
235
out of the trace file set given in TraceList and starts generating frames with an random offset (whereby the maximum offset is given by Offset). This is done to avoid synchronization issues between different terminals. Frames that are larger than maxFrameSize will be skipped (this option was not used to date, but might be helpful for other researchers). The Duration parameter is used to set the duration time of the video trace in such manner that the offset and the duration are set randomly but never larger than the video trace length. If the trace data is fully consumed, the star choses a new trace file with new values for the offset and the duration. The DEBUG level is used to check the proper functionality of the star. It should be deactivated for simulations, as it produces a lot of output.
10.2 The VideoMeter Tool for Linux In order to simplify the evaluation of the quality of video after compression and potentially lossy network transport, we developed the VideoMeter tool for the Linux operating system. In order to make the tool independent of the used video compression scheme, the tool takes only concatenated YUV video frames as input. The individual YUV frames have to be in the YUV 4:2:0 format. We chose this format due to its spreading in the video coding community as well as the independence from any video compression scheme. The VideoMeter is intended for independence from any video compression scheme. If we refer to frames in this section, we always refer to the uncompressed frames in YUV format. Figure 10.5 illustrates the different spaces of either compressed video or YUV 4:2:0 which the VideoMeter utilizes. Items that may introduce errors are drawn in red. The tool supports the display of up to three video sequences that can be played back and their quality differences studied. We refer to the three sequences as original, encoded, and transmitted video sequences. The currently supported video formats are QCIF (176x144 pels) and CIF (352x288 pels). A screenshot of the tool with all possible options enabled is given in Figure 10.6. In addition to the display of the YUV streams, the difference pictures for the Y component can be calculated and displayed to make the errors visible. These errors are also expressed in terms of the PSNR (and hence given in dB). The PSNR values and difference pictures are calculated for (i) the original and encoded videos (for the encoding difference), (ii) the encoded and the transmitted videos (for the transmission differences), as well as (iii) the original and the transmitted videos (for the complete difference). Both, the difference picture calculation and the PSNR calculation are done on a pixel basis and thus may cause a heavy processing load for slower systems. However, a pixel-based color space conversion has to be done in any case to display the YUV data on the RGB–based terminal screen. Therefore, the additional computational overhead for the calculation of the PSNR over
236
10 Tools for Working with Video Traces
VideoMeter Original
Decoded
Transmitted
YUV
YUV’
YUV’’
ε
ε
YUV 4:2:0 Space Any Video Format Space ε
Encoder
ε
Decoder
Network
Decoder
ε
Fig. 10.5: Overview of YUV and an arbitrary video compression scheme in the VideoMeter context. the difference pictures is relatively low and thus done in any case the difference picture is displayed. The disabling of the PSNR switch only disables its graphical representation, which uses some additional CPU time. We illustrate the information window of the VideoMeter tool in Figure 10.7. The information window shows some basic statistics about the currently displayed sequences. Additionally, the playout modes are shown as are the video sequence input files. The information window further shows the current PSNR values and the smoothed (i.e., mean) PSNR from the beginning of the playout. As errors occur, the comparison between the streams will possibly differ by one or several pictures, as illustrated in Figure 10.8. This behavior is common to decoding algorithms that could drop frames that were too damaged during transmission or by rate–adaption schemes of the encoder. Frame drops would render a PSNR calculation no longer suitable, since the corresponding pictures of the streams would not be compared to each other, i.e., the synchronization would be lost. To ensure the comparison of the corresponding frames in the face of these frame drops, we employ the widely used technique of freezing, which we explained in greater detail for the offset distortion traces in Chapter 9. With freezing, the last successfully decoded video frame stays in the display memory until the next video frame is successfully decoded and displayed. Since our tool does not include any “knowledge” about the decoding status, a freeze file (see Section 10.2.2 for details) has to be provided. The freeze file holds the numbers of the frames that have to be frozen for comparison and play–out (i.e., the frames that are assumed to be not decodeable).
10.2 The VideoMeter Tool for Linux
237
Fig. 10.6: Main window with displayed video sequences of the VideoMeter tool. 10.2.1 VideoMeter Usage The VideoMeter tool is invoked from the command line as: videometer [opt.switches] -f1 yuv-file [-f2 ...] [-f3 ...] The command line switches are used to enable or disable the different features. The tool can be used as a mere YUV player, if only a single file is given and PSNR and difference picture generation are disabled. In the following, we explain the usage of the different switches in greater detail: -l Loops the playback. The number of looped playbacks is displayed in the info window.
238
10 Tools for Working with Video Traces
Actual frame and target playout rate Start− and endframe for playout Looping (number of loops) Amount of frames in freeze file
Mean PSNR values from the first frame to the current frame PSNR values for the current frame
Input Files
Color delimiters for typical PSNR range
Fig. 10.7: Information window of the VideoMeter tool.
-nodiff Disables the output of the difference pictures. If PSNR is enabled, then the -nodiff switch is ignored and the PSNR values are calculated. -PSNR Shows a graphical representation of the last 20 PSNR values on the bottom of the main window. In addition, the current PSNR values and the average PSNR from the beginning of the playback are shown in the info window. -fps n The program will target a play–out rate as specified with n. If the processing takes longer, it plays as fast as it can. The default value is 25 for the PAL–standard frame rate playback speed. -quiet Disables the info window. You can also simply minimize it during playback. -s n The playback starts with the frame n of the file f1. The default start frame is the first frame in the file.
10.2 The VideoMeter Tool for Linux H.263 error−free
YUV original
H.263 original
YUV encoded
YUV corrupted
YUV transmitted
1
1
1
2
2
2
TESTBED
H.263 corrupted
239
3
2
4
4
4
5
5
5
6
6
6
Fig. 10.8: Methodology for PSNR calculation for video transmission over wireless links. -e n The playback ends with the frame n of the file f1. The default end is the last frame in the file. -CIF Specifies the input YUV–sequence is in CIF–format. The default format is QCIF. -fr2 name Specifies the file that holds numbers of the missing frames in file f2. -fr3 name Specifies the file that holds numbers of the missing frames in file f3.
10.2.2 Freeze File The freezefile contains the frames that are to be frozen for the specified stream during the playback for the difference picture and PSNR calculation. Freeze files can be provided for both the encoded as well as the transmitted YUV video sequences. In the first case, a rate–adapting encoder setting may be
240
10 Tools for Working with Video Traces
Fig. 10.9: Example for missing frames in YUV video streams. forcing a frame drop, whereas in the latter case, the damaged frame may be unrecoverable and will have to be discarded by the decoder software. In both cases, the resulting YUV streams will have less frames than the original video sequence. In the following figures, red colored items represent errors and losses, while green items represent fixed or concealed errors. Figure 10.9 illustrates an example of several losses in the YUV files. The encoded file misses frame n due to rate adaption of the encoder while the transmitted file additionally suffers a second frame loss of frame n + 1 due to transmission errors. The result is a lost synchronization of the streams from the first error onwards. As another effect, the resulting PSNR comparison is no longer valid in terms of comparing the quality losses thereafter. To fix this lost synchronization, we incorporate the possibility for the frame freezing error concealment technique as described in greater detail in Chapter 9 for video traces rather than actual video. Figure 10.10 illustrates both the freeze file as well as the resulting comparisons of the YUV frames. The synchronization is regained by keeping the last successfully received and decoded frame (prior to a missing frame or not decodeable frame) in memory. The freeze file should be in plain text format providing the numbers of the missing frames, starting from 0 for the first frame of the sequence. Each frame should be in a new line. No freeze file can be provided for the original sequence, since we assume that this is the video sequence without any missing frames. For the previous example outlined in Figures 10.9 and 10.10, the freeze file would have the contents given in Table 10.3.
10.3 RMSE and PSNR Calculator This program calculates the difference between the individual Y-component’s pixels between a source video sequence (Source YUV) and an encoded video
10.3 RMSE and PSNR Calculator Original
Encoded
Transmitted
YUV
YUV
YUV
n−1
n−1
YUV Comparison with Freezefile
n−1
YUV n
YUV n−1
YUV
n−1
YUV n+1
YUV n+1
resync
YUV
n−1
resync
YUV
n+2
Freezefile Content
241
YUV
n+2
n
n+2
n n+1
Fig. 10.10: Example for freeze file utilization to regain synchronization between the different YUV video streams. Table 10.3: Exemplary freeze file content for the example outlined in Figures 10.9 and 10.10 Freeze File 1 Freeze File 2 (Encoding losses) (Transmission losses) n n n+1 ... ...
Table 10.4: CSV file format for the RMSE and PSNR calculator tool. Frame Number RMSE PSNR sequence (Enc. YUV). Both videos have to be available in the uncompressed YUV 4:2:0 format in either the QCIF or the CIF video resolutions. The program calculates the RMSE and from the RMSE the PSNR values as given in Chapter 4 between frames 1, 2, . . . , n of the two video sequences. The length (in case that the two files are of different numbers of frames) is determined by the original video sequence. The main program window allows the user to select the uncompressed YUV files and a comma -separated file to be created by the program, as illustrated in Figure 10.11. Pressing “Start” invokes the calculation of the values and creates the CSV output file. The values in the output file are stored as illustrated in Table 10.4. We note that this tool requires the Microsoft .NET Framework version 1.1.
242
10 Tools for Working with Video Traces
Fig. 10.11: Screenshot of the RMSE and PSNR calculator, currently QCIF and CIF video resolutions are supported. Table 10.5: Output file format for the MPEG-4 Frame Size Parser tool. Frame Type Size [byte]
10.4 MPEG-4 Frame Size Parser For MPEG-4 encodings in the simple profile, single, and multiple layers the program parses the encoded video bit stream for the frame start delimiter 0000 0000 0000 0001 as outlined in the MPEG4 standard. The bit stream has therefore to be in the reference ISO/MPEG format in order to be parsed correctly. In particular, there should be no further encapsulation into formats such as AVI, RTP, etc. The tool parses the encoded bit stream and provides the frame type with its occurrence in the compressed bit stream and the size of the frame in byte. The main program window illustrated in Figure 10.12 allows the user to select the compressed file for parsing. For output, a comma- or tab-separated file to be created by the program has to be selected. Pressing “Start” invokes the parser and creates the output file with the selected delimiter and in the format given in Table 10.5 We note that this tool requires the Microsoft .NET Framework version 1.1.
10.5 Offset Distortion Calculators
243
Fig. 10.12: Screenshot of the MPEG-4 parser. Table 10.6: Output file format for the single layer Offset Distortion Calculator tool. Frame Number RM SEn (d = 1) RM SEn (d = . . .) RM SEn (d = dmax )
10.5 Offset Distortion Calculators For details on how the offset distortion is calculated and how it can be used in networking research, we refer the interested reader to Chapter 9. Due to the relationship between the RMSE and PSNR, the two presented tools only provide the RMSE values, from which the PSNR values can be easily calculated as shown in Chapters 4 and 9. We note that the tools presented in this section require the Microsoft .NET Framework version 1.1. 10.5.1 Single Layers The single layer Offset Distortion Calculator tool can be used to calculate the offset distortion values for the single layer, temporal scalable, and SNR scalable encodings. In the program window illustrated in Figure 10.13, the user has to select two YUV video sequences in the YUV 4:2:0 format. The program requires the unencoded original video (Org.) and the encoded and subsequently decoded (Enc.) video sequences. The output file will be written into a standard comma-separated value file (CSV) that allows easy access to the values in popular spreadsheet programs. Currently, the program supports calculation of the video offset distortion values as RMSE values for the QCIF and CIF resolutions, which must be selected by the user. The offset distortion values are calculated for each individual frame and are stored in the CSV output file as given in Table 10.6. The user can select the maximum number of offsets dmax for which the distortion values are calculated for each frame.
244
10 Tools for Working with Video Traces
Fig. 10.13: Screenshot of the Offset Distortion Calculator tool for individual layers, currently QCIF and CIF video resolutions are supported.
Fig. 10.14: Screenshot of the video offset distortion calculator for the upsampled base layer frames in spatial scalability encodings.
10.5 Offset Distortion Calculators
245
Table 10.7: Output file format for the spatial scalability Offset Distortion Calculator tool. Frame Number RM SEUn P (d = 1) RM SEUn P (d = . . .) RM SEUn P (d = dmax ) 10.5.2 Spatial Scalability The spatial scalable Offset Distortion Calculator tool can be used to calculate the offset distortion values for the upsampled and re-displayed base layer frames for spatial scalable encodings. Currently, the program supports a QCIF frame resolution for the base layer and a CIF frame resolution of the enhancement layer. The base layer frames are upsampled without further processing, e.g., no filters are used to smooth out imperfections. In the program window illustrated in Figure 10.14, the user has to select two YUVvideo sequences inthe YUV 4:2:0 format. The output file will be written into a standard comma-separated value file(CSV). The original unencoded enhancement layer is supposed to be in the CIF resolution (Org. EL). The encoded and subsequently decoded base layer is supposed to be in QCIF resolution (Enc. BL). The offset distortion values are calculated for each individual frame and are stored in the CSV output file is given in Table 10.7. The user can select the maximum number of offsets dmax for which the distortion values are calculated for each frame, similar to the single layer tool.
11 Outlook
There are a number of important developments under way in the areas of video compression and multimedia networking that will likely increase the importance of video traces for networking research in the near future. One important area of new developments is the efficient non-scalable video coding up to High Definition Television (HDTV) resolution in conjunction with increasing access network speeds. In particular, the fidelity range extensions (FRExt) of the H.264/AVC video codec allow for the efficient encoding of High Definition Television (HDTV) video [169, 170], while new networking technologies, such as Ethernet Passive Optical Networks (EPONs) [171, 172, 173, 174, 175, 176, 177, 178, 179] employed as access networks provide high bandwidths that make Internet Protocol based Television (IPTV) in the HDTV format possible. The extended H.264/AVC codec allows for the efficient encoding of video up to HDTV resolution through a number of advanced coding techniques, such as multiple reference frames and generalized B-pictures that exploit redundancies over longer time horizons as well as Intra picture prediction of macroblocks that exploits spatial redundancies in a given video frame. With these advanced coding techniques, the extended H.264/AVC codec generates traffic that is very different from the older MPEG codecs: the average bit rate is often cut in half, but the burstiness is significantly increased due to the much larger scope for predictive encoding [169] and will pose unprecedented challenges for network transport. With the help of an extended trace library including traces of long videos up to HD resolution coded with the new H.264/AVC codec, efficient network transport mechanisms for these new encodings can be developed and assessed. For networking research at the frame level, video traces with the same structures as presented in this book can be employed for the new encodings. Another important new development are novel error resilience techniques introduced with the H.264/AVC codec. These novel error-resilience features, along with error-concealment mechanisms at the decoder, can significantly improve the video quality over error-prone (e.g., wireless) networks. As described in this book, we have designed the offset distortion concept for 247
248
11 Outlook
estimating the received video quality after lossy network transport (a capability that previously required experiments with video bit streams). The offset distortion concept does not presently consider the specific error-resilience features of video codecs and is limited to error-concealment by copying the last successfully received frame. The offset distortion concept, however, provides a useful basis for designing extended traces that account for the specific errorresilience features of codecs and error-concealment features of decoders and we will make such extended traces available through the video trace library presented in this book. Overall, the video trace concept has greatly facilitated networking research for over a decade now. A large range of video transport studies that previously required video coding expertise and equipment and were computationally very demanding, can be conducted with video traces. No video coding expertise or equipment is needed, instead, traces files representing the video can be conveniently incorporated into networking studies. As the field of video coding advances, so will video traces so as to allow the networking research community to keep up with the new video codecs and their characteristics and to account for the latest video codecs in networking research.
List of Abbreviations
4G 4MV AC ACC ACF ACK APP ASCII AVC B (frame) BL BMA bps BYE CABAC CBR CCIR CDN CH CIF CoQV CoV CPU CSV dB DC DCT DVB-H DVB-T DVD EL
Fourth-generation wireless networks Four motion vectors per macroblock mode Higher order coefficients from the DCT AutoCorrelation Coefficient AutoCorrelation Function Acknowledgement APPlication specific information in RTCP American Standard Code for Information Interchange Advanced Video Coding Bi-directionally predicted video frame Base Layer Block Matching Algorithm Bits per second Session termination packets in RTCP Context-Adaptive Binary Arithmetic Coder Constant Bit Rate Now part of the ITU-R Content Distribution Network Compressed Header Common Intermediate Format (352 x 288 pixels) Coefficient of Quality Variation Coefficient of Variation Central Processing Unit Comma-Separated Value Decibel Lowest order coefficient from the DCT Discrete Cosine Transform Digital Video Broadcasting - Handheld Digital Video Broadcasting - Terrestrial Digital Versatile Disc Enhancement Layer 249
250
EOB EZBC FGS fps FTP GB GoP H.26x H.32x
List of Abbreviations
End Of Block Embedded Zerotree Block Coding Fine Granular Scalability Frames per second File Transfer Protocol Gigabyte Group of Pictures Video standard family of the ITU Definition of protocols to provide audio-visual communication by the ITU-T HDTV High Definition TeleVision HTML HyperText Markup Language HTTP HyperText Transfer Protocol I (frame) Intra-coded video frame IETF Internet Engineering Task Force IGMP Internet Group Management Protocol IP Internet Protocol ISO International Organization for Standardization ITU International Telecommunication Union ITU-R ITU Radiocommunication Sector ITU-T ITU Telecommunication Standardization Sector kbps Kilobits per second MB MacroBlock Mbps Megabits per second MC Motion Compensation MDC Multiple Description Coding MIMO Multiple Input Multiple Output MOS Mean Opinion Score MPEG Moving Picture Experts Group MPEG-x Video standard family of the MPEG MSE Mean Squared Error MTU Maximum Transfer Unit MV Motion Vector NAL Network Adaptation Layer NTSC National Television System(s) Committee OFDM Orthogonal Frequency-Division Multiplexing OH OverHead OSI Open Systems Interconnection P (frame) Predicted video frame PAL Phase Alternation Line PDA Personal Digital Assistant PFGS Progressive Fine Grain Scalability pRMSE Perceptually adjusted RMSE PSNR Peak Singal-to-Noise Ratio QCIF Quarter Common Intermediate Format
List of Abbreviations
QoS R/S RAM RD RFC RGB RMSE RoHC RR RSVP RTCP RTP RTSP RVLC SAP SCSI SDES SDP sec SIP SNR SR Super8 TCP TM5 TV UDP UMTS VBR VCL VCR VHS VLC VQM YIQ YUV
251
Quality of Service Range/Scale Random Access Memory Rate-Distortion Request For Comment Red, Green, Blue colorspace Rooted Mean Squared Error Robust Header Compression Receiver Reports in RTCP Resource ReSerVation Protocol RTP Control Protocol Real-time Transport Protocol Real-time Streaming Protocol Reversible Variable Length Coding Session Announcement Protocol Small Computer System Interface Source Descriptions in RTCP Session Description Protocol Second Session Initiation Protocol Signal-to-Noise Ratio Sender Reports in RTCP Super-8 mm film Transfer Control Protocol Test Model 5 rate control algorithm TeleVision User Datagram Protocol Universal Mobile Telecommunications System Variable Bit Rate Video Coding Layer Video Cassette Recorder Video Home System Variable Length Coding Video Quality Metric YIQ is a color space that was formerly used in the NTSC television standard Colorspace defined by the luminance component (Y) and the two chrominance components hue (U) and intensity (V)
Acknowledgements
The work this book is based on is supported in part by the National Science Foundation under Grant No. Career ANI-0133252 and Grant No. ANI0136774, in part by the State of Arizona through the IT301 initiative and in partbyamatching grant and a special pricing grant from Sun Microsystems. We thank Werner Fitzek for making the photos presented in Figures 2.1 and 2.2. We would like to thank a great number of people who contributed to this project: • • • • • • • • • • • • • • • • • • • • • • •
Ahmed Salih, Technical University of Berlin, Germany Alia Sanders, Arizona State University, United States Andrea Carpenter, Arizona State University, United States Ben Abda Moncef, Technical University of Berlin, Germany Beshan Kulapala, Arizona State University, United States Birgit Vetter, Technical University of Berlin, Germany Bo Nygaard Bai, Aalborg University, Denmark Charles Whitlatch, Arizona State University, United States Dimitri Tseronis, Technical University of Berlin, Germany Felix Carroll, Arizona State University, United States Finn Hybjerg Hansen, Aalborg University, Denmark Geert Van der Auweira, Arizona State University, United States Henrik Benner, Aalborg University, Denmark Jeremy Lassetter, Arizona State University, United States Jerome Damers, Arizona State University, United States Josh Weber, Arizona State University, United States Kai Ellies, Technical University of Berlin, Germany Khaled Athamneh, Technical University of Berlin, Germany Kyle Williams, Arizona State University, United States Laura Main, Arizona State University, United States Lina Karam, Arizona State University, United States Mike Hains, Arizona State University, United States Miroslawa Malach, Technical University of Berlin, Germany 253
254
• • • • • • • • • • • • • • • •
Acknowledgements
Mohammad Kandil, Technical University of Berlin, Germany Ole Benner, Aalborg University, Denmark Osama Lotfallah, Arizona State University, United States Per Mejdal Rasmussen, Aalborg University, Denmark Philippe de Cuetos, ENST, Paris, France Prashant David, Arizona State University, United States Sampath Ratnam, Arizona State University, United States Sethuraman Panchanathan, Arizona State University, United States Svend Erik Volsgaard, Aalborg University, Denmark Tatiana K. Madsen, Aalborg University, Denmark Thomas Kroener, Technical University of Berlin, Germany Torben H. Knudsen, Aalborg University, Denmark Trang Nguyen, Arizona State University, United States Tyler Barnett, Arizona State University, United States Victorin Kagoue Mougoue, Technical University of Berlin, Germany Zephyr Magezi, Arizona State University, United States
References
1. M. Dai and D. Loguinov, “Wavelet and time-domain modeling of multi-layer VBR video traffic,” in Proc. of Packet Video, Irvine, CA, Dec. 2004. 2. ——, “Analysis and modeling of MPEG-4 and H.264 multi-layer video traffic,” in Proc. of IEEE INFOCOM, Miami, FL, Mar. 2005, pp. 2257–2267. 3. I. Dalgic and F. A. Tobagi, “Characterization of quality and traffic for various video encoding schemes and various encoder control schemes,” Stanford University, Departments of Electrical Engineering and Computer Science, Tech. Rep. CSL–TR–96–701, Aug. 1996. 4. M. Frey and S. Nguyen-Quang, “A gamma–based framework for modeling variable–rate MPEG video sources: The GOP GBAR model,” IEEE/ACM Transactions on Networking, vol. 8, no. 6, pp. 710–719, Dec. 2000. 5. M. Garrett and W. Willinger, “Analysis, modeling and generation of self– similar VBR video traffic,” in Proceedings of ACM Sigcomm, London, UK, Sept. 1994, pp. 269–280. 6. D. P. Heyman and T. V. Lakshman, “Source models for VBR broadcast video traffic,” IEEE/ACM Transactions on Networking, vol. 4, pp. 40–48, Jan. 1996. 7. M. Krunz and S. Tripathi, “On the characterization of VBR MPEG streams,” in Proceedings of ACM SIGMETRICS, Seattle, WA, June 1997, pp. 192–202. 8. C. H. Liew, C. K. Kodikara, and A. M. Kondoz, “MPEG-encoded variable bit-rate video traffic modelling,” IEE Proceedings Communications, vol. 152, no. 5, pp. 749–756, Oct. 2005. 9. D. Lucantoni, M. Neuts, and A. Reibman, “Methods for performance evaluation of VBR video traffic models,” IEEE/ACM Transactions on Networking, vol. 2, no. 2, pp. 176–180, Apr. 1994. 10. U. K. Sarkar, S. Ramakrishnan, and D. Sarkar, “Modeling full-length video using markov-modulated gamma-based framework,” IEEE/ACM Transactions on Networking, vol. 11, no. 4, pp. 638–649, 2003. 11. ——, “Study of long duration MPEG-trace segmentation methods for developing frame size based traffic models,” Computer Networks, vol. 44, no. 2, pp. 177–188, 2004. 12. ITU-500-R, “Recommendation BT.500–8 — Methodology for the subjective assessment of the quality of television pictures,” 1998.
255
256
References
13. A. Basso, I. Dalgic, F. Tobagi, and C. Lambrecht, “Study of MPEG–2 coding performance based on a perceptual quality metric,” in Proceeding of Picture Coding Symposium, Melbourne, Australia, Mar. 1996. 14. A. Webster, C. Jones, M. Pinson, S. Voran, and S. Wolf, “An objective video quality assessment system based on human perception,” in Proceedings of SPIE Human Vision, Visual Processing and Digital Display, Vol. 1913, 1993, pp. 15– 26. 15. S. Winkler, “A perceptual distortion metric for digital color video,” in Proceedings of SPIE Human Vision and Electronic Imaging, Vol. 3644, Jan. 1999, pp. 175–184. 16. R. Aravind, M. Civanlar, and A. Reibman, “Packet loss resilience of MPEG–2 scalable video coding algorithms,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 6, no. 5, pp. 426–435, Oct. 1996. 17. H. Liu and M. E. Zarki, “Performance of H.263 video transmission over wireless networks using hybrid ARQ,” IEEE Journal on Selected Areas in Communications, vol. 15, no. 9, pp. 1775–1786, Dec. 1997. 18. W. Luo and M. ElZarki, “Analysis of error concealment schemes for MPEG–2 video transmission over ATM based networks,” in Proceedings of SPIE Visual Communications and Image Processing 1995, Taiwan, May 1995, pp. 102–108. 19. ——, “MPEG2Tool: A toolkit for the study of MPEG–2 video transmission over ATM based networks,” Department of Electrical Engineering, University of Pennsylvania, Tech. Rep., 1996. 20. R. Puri, K. Lee, K. Ramachandran, and V. Bharghavan, “An integrated source transcoding and congestion control paradigm for video streaming in the internet,” IEEE Transactions on Multimedia, vol. 3, no. 1, pp. 18–32, Mar. 2001. 21. J. Shin, J. W. Kim, and C.-C. J. Kuo, “Quality–of–service mapping mechanism for packet video in differentiated services network,” IEEE Transactions on Multimedia, vol. 3, no. 2, pp. 219–231, June 2001. 22. W.-C. Feng, Buffering Techniques for Delivery of Compressed Video in Video– on–Demand Systems. Kluwer Academic Publisher, 1997. 23. M. W. Garret, “Contributions toward real-time services on packet networks,” Ph.D. dissertation, Columbia University, May 1993. 24. M. Krunz, R. Sass, and H. Hughes, “Statistical characteristics and multiplexing of MPEG streams,” in Proceedings of IEEE Infocom ’95, April 1995, pp. 455– 462. 25. O. Rose, “Statistical properties of MPEG video traffic and their impact on traffic modelling in ATM systems,” University of Wuerzburg, Institute of Computer Science, Tech. Rep. 101, Feb. 1995. 26. F. Fitzek and M. Reisslein, “A prefetching protocol for continuous media streaming in wireless environments,” IEEE Journal on Selected Areas in Communications, vol. 19, no. 10, pp. 2015–2028, Oct. 2001. 27. ——, “MPEG-4 and H.263 video traces for network performance evaluation,” IEEE Network, vol. 15, no. 6, pp. 40–54, November/December 2001, video traces available at http://trace.eas.asu.edu. 28. P.Seeling and M. Reisslein, “The rate variability-distortion (vd) curve of encoded video and its impact on statistical multiplexing,” IEEE Transactions on Broadcasting, vol. 51, no. 4, pp. 473–492, Dec. 2005. 29. K. Jack, Video Demystified: A Handbook for the Digital Engineer, 2nd ed. San Diego, CA: HighText Interactive, 1996.
References
257
30. M. Ghanbari, Video Coding – An Introduction to Standard Codecs. The Institution of Electrical Engineers, 1999. 31. A.N. Netravali and B.G. Haskell, Digital Pictures: Representation, Compression, and Standards. Plenum Press, 1995, ch. 3. 32. T. Sikora, “MPEG Digital Video Coding Standards,” in Digital Electronics Consumer Handbook. McGraw Hill, 1997. 33. Z. Xiong, K. Ramchandran, M.T. Orchard, and Y.–Q. Zhang, “A Comparative Study of DCT– and Wavelet–Based Image Coding,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 9, no. 5, August 1999. 34. W. Chen, C. Smith, and S. Fralick, “A fast computational algorithm for the discrete cosine transform,” IEEE Transcript on Communications, pp. 1004– 1009, Sept. 1977. 35. I. Dinstein, K. Rose, and A. Heimann, “Variable block–size transform image coder,” IEEE Transactions on Communications, pp. 2073–2078, Nov. 1990. 36. Joint Video Team of ITU-T and ISO/IEC JTC 1, “Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification (ITU-T Rec. H.264 — ISO/IEC 14496–10 AVC),” 2004. 37. M. J. Riley and I. E. Richardson, Digital Video Communications. Artech House, 1997. 38. D. S. Turaga and T. Chen, Compressed Video over Networks, ser. The Signal Processing Series. New York, NY: Marcel Dekker, Sept. 2000, ch. Fundamentals of Video Coding: H.263 as an Example, pp. 3–34. 39. Microsoft, “Mpeg-4 visual codec version 2.5.0,” Aug. 2004. 40. “Video traces for network performance evaluation,” Online Website. [Online]. Available: http://www.eas.asu.edu/trace 41. T. Lakshman, A. Ortega, and A. Reibman, “VBR video: Tradeoffs and potentials,” Proceedings of the IEEE, vol. 86, no. 5, pp. 952–973, May 1998. 42. A. Ortega, Compressed Video Over Networks. Marcel Dekker, Sept. 2000, ch. Variable Bit Rate Video Coding, pp. 343–383. 43. K. Dovstam, “Video coding in h.26l,” Ph.D. dissertation, Royal Institute of Technology, Stockholm, Sweden, 2000. 44. D. Lelewer and D. Hirschberg, “Data compression,” ACM Computing Surveys (CSUR), vol. 19, no. 3, pp. 261–296, 1987. 45. P. Howard and J. Vitter, “Analysis of arithmetic coding for data compression,” Information Processing and Management, vol. 28, no. 6, pp. 749–764, 1992. 46. T. Wiegand, “H.26L Test Model Long–Term Number 9 (TML-9) draft0,” ITUT Study Group 16, Dec. 2001. 47. Y. Shi and H. Sun, Image and Video Compression for Multimedia Engineering. CRC Press, 2000. 48. T. Koga, K. Linuma, A. Hirano, Y. Iijima, and T.Ishiguro, “Motion compensated interframe coding for video conferencing,” in Proceddings of National Telecommunication Conference (NTC), vol. G5. 3, New Orleans, LA, Dec. 1981, pp. 1–5. 49. M. Bierling, “Displacement estimation by hierarchical block matching,” Visual Communications and Image Processing, vol. 1001, pp. 942–951, 1988. 50. MPEG–1, “Coding of moving pictures and associated audio for digital storage media at up to 1.5 mbps,” ISO/IEC 11172, 1993. 51. R. Rejaie, D. Estrin, and M. Handley, “Quality adaptation for congestion controlled video playback over the internet,” in Proceedings of ACM SIGCOMM, Cambridge, MA, Sept. 1999, pp. 189–200.
258
References
52. D. Saparilla and K. W. Ross, “Optimal streaming of layered video,” in Proceedings of IEEE INFOCOM, Tel Aviv, Israel, Mar. 2000, pp. 737–746. 53. G. Reyes, A. R. Reibman, S.-F. Chang, and J. Chuang, “Error–resilient transcoding for video over wireless channels,” IEEE Journal on Selected Areas in Communications, vol. 18, no. 6, pp. 1063–1074, June 2000. 54. T. Shanableh and M. Ghanbari, “Heterogeneous video transcoding to lower spatio-temporal resolutions and different encoding formats,” IEEE Transactions on Multimedia, vol. 2, no. 2, pp. 101–110, June 2000. 55. M. Ghanbari, “Layered coding,” in Compressed Video over Networks, M.-T. Sun and A. R. Reibman, Eds. Marcel Dekker, 2001, pp. 251–308. 56. MPEG–2, “Generic coding of moving pictures and associated audio information,” ISO/IEC 13818–2, 1994, draft International Standard. 57. F. Pereiera and E. Touradj, The MPEG–4 Book. Upper Saddle River, NJ: Prentice Hall, 2002. 58. M. Reisslein, J. Lassetter, S. Ratnam, O. Lotfallah, F. Fitzek, and S. Panchanathan, “Traffic and quality characterization of scalable encoded video: A large-scale trace-based study,” Arizona State University, Dept. of Electrical Eng., Tech. Rep., Dec. 2002. 59. W. Li, “Overview of Fine Granularity Scalability in MPEG–4 Video Standard,” IEEE Trans. on Circuits and Systems for Video Technology, vol. 11, no. 3, pp. 301–317, March 2001. 60. ISO/IEC JTC1/SC29/WG11 Information Technology - Generic Coding of Audio-Visual Objects : Visual ISO/IEC 14496–2 / Amd X, December 1999. 61. H. Radha, M. van der Schaar, and Y. Chen, “The mpeg-4 fine-grained scalable video coding method for multimedia streaming over ip,” IEEE Transactions on Multimedia, vol. 3, no. 1, pp. 53–68, Mar. 2001. 62. C. Buchner, T. Stockhammer, D. Marpe, G. Blattermann, and G. Heising, “Progressive texture video coding,” in Proceedings of IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2001, pp. 1813– 1816. 63. H.-C. Huang, C.-N. Wang, and T. Chiang, “A robust fine granularity scalability using trellis-based predictive leak,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 12, no. 6, pp. 372–385, June 2002. 64. R. Kalluri and M. Schaar, “Fine granular scalability for H.26L-based video streaming,” in Proceedings of IEEE International Conference on Consumer Electronics (ICCE), 2002, pp. 346–347. 65. E. Lin, C. Podilchuk, A. Jacquin, and E. Delp, “A hybrid embedded video codec using base layer information for enhancement layer coding,” in Proceedings of IEEE International Conference on Image Processing (ICIP), 2001, pp. 1005– 1008. 66. A. Luthra, R. Gandhi, K. Panusopone, K. Mckoen, D. Baylon, and L. Wang, “Performance of MPEG-4 profiles used for streaming video,” in Proceedings of Workshop and Exhibition on MPEG-4, 2001, pp. 103–106. 67. S. Parthasaraty and H. Radha, “Optimal rate control methods for fine granularity scalable video,” in Proceedings of IEEE International Conference on Image Processing (ICIP), vol. 2, Barcelona, Spain, Sept. 2003, pp. 805–808. 68. R. Rajendran, M. Schaart, and S.-F. Chang, “FGS+: optimizing the joint SNRtemporal video quality in MPEG-4 fine grained scalable coding,” in Proceedings of IEEE International Symposium on Circuits and Systems (ISCAS), 2002, pp. 445–448.
References
259
69. A. R. Reibman, U. Bottou, and A. Basso, “Dct-based scalable video coding with drift,” in Proceedings of IEEE International Conference on Image Processing (ICIP), vol. 2, Thessaloniki, Greece, Oct. 2001, pp. 989–992. 70. M. Schaar and Y.-T. Lin, “Content-based selective enhancement for streaming video,” in Proceedings of IEEE International Conference on Image Processing (ICIP), 2001, pp. 977–980. 71. M. Schaar and H. Radha, “A hybrid temporal-SNR fine-granular scalability for internet video,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 11, no. 3, pp. 318–331, Mar. 2001. 72. Q. Wang, Z. Xiong, F. Wu, and S. Li, “Optimal rate allocation for progressive fine granularity scalable video coding,” IEEE Signal Processing Letters, vol. 9, no. 2, pp. 33–39, Feb. 2002. 73. F. Wu, S. Li, and Y.-Q. Zhang, “A framework for efficient progressive fine granularity scalable video coding,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 11, no. 3, pp. 332–344, Mar. 2001. 74. X. Zhao, Y. He, S. Yang, and Y.Zhong, “‘rate allocation of equal image quality for MPEG-4 FGS video streaming,” in Proceedings of Packet Video Workshop, Pittsburgh, PA, Apr. 2002. 75. ISO/IEC JTC1/SC29/WG11 N4791, “Report on mpeg–4 visual fine granularity scalability tools verification tests,” May 2002. 76. T. Kim and M. Ammar, “Optimal quality adaptation for MPEG-4 fine-grained scalable video,” in Proceedings of IEEE Infocom, San Francisco, CA, Apr. 2003, pp. 641–651. 77. Q. Zhang, W. Zhu, and Y.-Q. Zhang, “Resource allocation for multimedia streaming over the internet,” IEEE Transactions on Multimedia, vol. 3, no. 3, pp. 339–355, Sept. 2001. 78. D. Wu, Y. T. Hou, W. Zhu, Y.-Q. Zhang, and J. M. Peha, “Streaming video over the internet: Approaches and directions,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 11, no. 3, pp. 1–20, Mar. 2001. 79. V. Goyal, “Multiple description coding: Compression meets the network,” IEEE Signal Processing Magazine, vol. 18, pp. 74–91, Sept. 2001. 80. S. Ekmekci and T. Sikora, “Unbalanced quantized multiple description video transmission using path diversity,” in IS&T/SPIE’s Electronic Imaging 2003, 2003, santa Clara, CA. 81. N. Gogate, D. M. Chung, S. S. Panwar, and Y. Wang, “Supporting image and video applications in a multihop radio environment using path diversity and multiple description coding,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 12, no. 9, pp. 777–792, Sept. 2002. 82. N. Gogate and S. S. Panwar, “Supporting video/image applications in a mobile multihop radio environment using route diversity,” in Proceedings of IEEE International Conference on Communications, vol. 3, Vancouver, Canada, June 1999, pp. 1701–1706. 83. Z. Ji, Q. Zhang, W. Zhu, J. Lu, and Y.-Q. Zhang, “Video broadcasting over mimo-ofdm systems,” in Proceedings of IEEE International Symposium on Circuits and Systems, vol. 2, May 2003, pp. 844–847. 84. Y. Altunba¸sak, N. Kamacı, and R. M. Mersereau, “Multiple description coding with multiple transmit and receive antennas for wireless channels: The case of digital modulation,” in Proceedings of IEEE Global Telecommunications Conference (GLOBECOM), vol. 6, San Antonio, TX, Nov. 2001, pp. 3272–3276.
260
References
85. S. Lin, Y. Wang, S. Mao, and S. Panwar, “Video transport over ad-hoc networks using multiple paths,” in Proceedings of IEEE International Symposium on Circuits and Systems (ISCAS), vol. 1, Scottsdale, AZ, May 2002, pp. 57–60. 86. M. Pereira, M. Antonini, and M. Barlaud, “Multiple description video coding for umts,” in Proceedings of Picture Coding Symposium, Saint Malo, France, Apr. 2003. 87. R. Puri, K. W. Lee, K. Ramchandran, and V. Bharghavan, “An integrated source transcoding and congestion control paradigm for video streaming in the internet,” IEEE Transactions on Multimedia, vol. 3, no. 1, pp. 18–32, Apr. 2001. 88. J. G. Apostolopoulos, W. Tan, and S. J. Wee, “Performance of a multiple description streaming media content delivery network,” in Proceedings of IEEE International Conference on Image Processing (ICIP), vol. 2, Rochester, NY, Sept. 2002, pp. 189–192. 89. F. H. Fitzek, H. Yomo, P. Popovski, R. Prasad, and M. Katz, “Source descriptor selection schemes for multiple description coded services in 4g wireless communication systems,” in Proceedings of The First IEEE International Workshop on Multimedia Systems and Networking (WMSN05) in conjunction with The 24th IEEE International Performance Computing and Communications Conference (IPCCC 2005), Phoenix, AZ, Apr. 2005. 90. R. Prasad and S. Hara, Multicarrier Techniques for 4G Mobile Communications, ser. Universal Personal Communications. Artech House, 2003. 91. D. S. Taubman and M. Marcellin, JPEG2000 Image compression fundamentals standards and practice. Kluwer Academic Publishers, July 2002. 92. S.-T. Hsiang and J. Woods, “Embedded video coding using invertible motion compensated 3-d subband/wavelet filter bank,” Signal Processing: Image Communications, vol. 16, pp. 705–724, May 2001. 93. S.-T. Hsiang, “Highly scalable subband/wavelet image and video coding,” Ph.D. dissertation, Rensselaer Polytechnic Institute, NY, May 2002. 94. S.-T. Hsiang and J. W. Woods, “Embedded image coding using zeroblocks of subband/wavelet coefficients and context modeling,” in Proceedings of IEEE International Symposium on Circuits and Systems (ISCAS), vol. 3, Geneva, Switzerland, May 2000, pp. 662–665. 95. G. Sullivan, T. Wiegand, and T. Stockhammer, “Using the draft h.26l video coding standard for mobile applications,” in IEEE International Conference on Image Processing (ICIP), vol. 3, Thessaloniki, Greece, Oct. 2001, pp. 573–576. 96. A. M. Law and W. D. Kelton, Simulation, Modeling and Analysis, 3rd ed. McGraw Hill, 2000. 97. C. Chatfield, The Analysis of Time Series: An Intoduction, 4th ed. Chapman and Hall, 1989. 98. G. Box and G. Jenkins, Time Series Analysis: Forecasting and Control. Holden-Day, 1976. 99. J. Beran, Statistics for Long–Memory Processes, J. Beran, Ed. London: Chapman and Hall, 1994. 100. J. Beran, R. Sherman, M. S. Taqqu, and W. Willinger, “Long–range dependence in variable–bit–rate video traffic,” IEEE Transactions on Communications, vol. 43, no. 2/3/4, pp. 1566–1579, February/March/April 1995. 101. M. Krunz, “On the limitations of the variane–time test for inference of long– range dependence,” in Proceedings of IEEE Infocom 2001, Anchorage, Alaska, Apr. 2001, pp. 1254–11 260.
References
261
102. B. B. Mandelbrot and M. S. Taqqu, “Robust R/S analysis of long–run serial correlations,” in Proceedings of 42nd Session ISI, Vol. XLVIII, Book 2, 1979, pp. 69–99. 103. H. Hurst, “Long–Term Storage Capacity of Reservoirs,” Proc. American Society of Civil Engineering, vol. 76, no. 11, 1950. 104. W. E. Leland, M. S. Taqq, W. Willinger, and D. V. Wilson, “On the selfsimilar nature of Ethernet traffic,” in ACM SIGCOMM, D. P. Sidhu, Ed., San Francisco, California, 1993, pp. 183–193. 105. D. Veitch and P. Abry, “A wavelet based joint estimator of the parameters of long–range dependence,” IEEE Transactions on Information Theory, vol. 45, no. 3, pp. 878–897, Apr. 1999. [Online]. Available: ˜ http://www.emulab.ee.mu.oz.au/darryl 106. P. Abry and D. Veitch, “Wavelet analysis of long–range–dependent traffic,” IEEE Transactions on Information Theory, vol. 44, no. 1, pp. 2–15, Jan. 1998. 107. P. Abry, D. Veitch, and P. Flandrin, “Long–range–dependence: Revisiting aggregation with wavelets,” Journal of Time Series Analysis, vol. 19, no. 3, pp. 253–266, May 1998. 108. M. Roughan, D. Veitch, and P. Abry, “Real–time estimation of the parameters of long–range dependence,” IEEE/ACM Transactions on Networking, vol. 8, no. 4, pp. 467–478, Aug. 2000. 109. P. Abry, P. Flandrin, M. S. Taqqu, and D. Veitch, Self Similar Network Traffic Analysis and Performance Evaluation. New York, NY: Wiley, 2000, ch. Wavelets for the analysis, estimation and synthesis of scaling data, pp. 39–88. 110. ——, Long Range Dependence: Theory and Applications. Birkhauser, 2002, ch. Self–Similarity and Long–Range Dependence through the wavelet lens. 111. A. Feldmann, A. C. Gilbert, W. Willinger, and T. Kurtz, “The changing nature of network traffic: Scaling phenomena,” Computer Communciation Review, vol. 28, no. 2, Apr. 1998. 112. J. Gao and I. Rubin, “Multiplicative multifractal modeling of long–range– dependent network traffic,” International Journal of Communication Systems, vol. 14, pp. 783–801, 2001. 113. A. C. Gilbert, W. Willinger, , and A. Feldmann, “Scaling analysis of conservative cascades, with applications to network traffic,” IEEE Transactions on Information Theory, vol. 45, no. 3, pp. 971–991, Apr. 1999. 114. R. H. Riedi, M. S. Crouse, V. J. Ribeiro, and R. G. Baraniuk, “A multifractal wavelet model with applications to network traffic,” IEEE Transactions on Information Theory, vol. 45, no. 3, pp. 992–1018, Apr. 1999. 115. J. Walter, “bttvgrab.” [Online]. Available: http://www.garni.ch/bttvgrab/ 116. F. Bellard, “FFMPEG - multimedia system ver. 0.4.9-pre1,” Online, Feb. 2005. [Online]. Available: http://ffmpeg.sourceforge.net/index.php 117. I. 14496, “Video Reference Software, Microsoft–FDAM1–2.3–001213.” 118. T. M. E. Committee, “MPEG–2 Video Test Model 5, ISO/IEC JTC1/SC29WG11 MPEG93/457,” Apr. 1993. 119. Q. Zhang, W. Zhu, and Y.-Q. Zhang, “Resource allocation for multimedia streaming over the internet,” IEEE Transactions on Multimedia, vol. 3, no. 3, pp. 339–355, Sept. 2001. 120. C. Suehring, “H.26l software coordination,” http://bs.hhi.de/ suehring/tml/. 121. F. H. Fitzek and M. Reisslein, “Mpeg–4 and h.263 video traces for network performance evaluation,” Technical University of Berlin, Tech. Rep., 2000, tKN– 00–06.
262
References
122. A. Gereoffy, “Mplayer,” Online, May 2003, version 0.90. [Online]. Available: http://www.mplayerhq.hu 123. F. H. P. Fitzek and M. Reisslein, “Video traces for network performance evaluation: Yuv 4:2:0 video sequences,” http://trace.eas.asu.edu/yuv/yuv.html. 124. M. Reisslein, J. Lassetter, S. Ratnam, O. Lotfallah, F. H. Fitzek, and S. Panchanathan, “Traffic and quality characterization of scalable encoded video: A large-scale trace-based study, part 2: Statistical analysis of single-layer encoded video,” Arizona State University, Dept. of Electrical Engineering, Technical Report, Dec. 2002. 125. ——, “Traffic and quality characterization of scalable encoded video: A largescale trace-based study, part 3: Statistical analysis of temporal scalable encoded video,” Arizona State University, Dept. of Electrical Engineering, Technical Report, Dec. 2002. 126. ——, “Traffic and quality characterization of scalable encoded video: A largescale trace-based study, part 4: Statistical analysis of spatial scalable encoded video,” Arizona State University, Dept. of Electrical Engineering, Technical Report, Aug. 2003. 127. A. M. et.al., “Video quality experts group: Current results and future directions,” in Proc. SPIE Visual Communications and Image Processing, vol. 4067, Perth, Australia, June 2000, pp. 742–753. 128. S. Winkler, “Vision models and quality metrics for image processing applications,” Ph.D. dissertation, EPFL, Switzerland, 2000. 129. F. Fitzek and M. Reisslein, “MPEG–4 and H.263 video traces for network performance evaluation,” IEEE Network, vol. 15, no. 6, pp. 40–54, November/December 2001. [Online]. Available: http://www.eas.asu.edu/trace 130. M. Reisslein, J. Lassetter, S. Ratnam, O. Lotfallah, F. H. Fitzek, and S. Panchanathan, “Traffic and quality characterization of scalable encoded video: A large-scale trace-based study, part 1: Overview and definitions,” Arizona State University, Dept. of Electrical Engineering, Technical Report, Dec. 2002. 131. S. Shenker, C. Partridge, and R. Guerin, “Specification of Guaranteed Quality of Service,” RFC 2212 (Proposed Standard), Sept. 1997. [Online]. Available: http://www.ietf.org/rfc/rfc2212.txt 132. W. Feng and J. Rexford, “A comparison of bandwidth smoothing techniques for the transmission of prerecorded compressed video,” in Proceedings of IEEE Infocom, Kobe, Japan, Apr. 1997, pp. 58–67. 133. M. Krunz, “Bandwidth allocation strategies for transporting variable–bit–rate video traffic,” IEEE Communications Magazine, vol. 37, no. 1, pp. 40–46, Jan. 1999. 134. Z. He and S. K. Mitra, “A unified rate-distortion analysis framework for transform coding,” IEEE Transaction on Circuits and Systems for Video Technology, vol. 11, no. 12, pp. 1221–1236, Dec. 2001. 135. M. Reisslein, J. Lassetter, S. Ratnam, O. Lotfallah, F. H. Fitzek, and S. Panchanathan, “Traffic and quality characterization of scalable encoded video: A large-scale trace-based study,” Arizona State University, Dept. of Electrical Engineering, Technical Report Series, Dec. 2002. 136. L. Sachs, Angewandte Statistik. Springer, 2002. 137. Z. Xiong, K. Ramchandran, M. T. Orchard, and Y.-Q. Zhang, “A comparative study of dct and wavelet based coding,” in Proceedings of IEEE International Symposium on Circuits and Systems (ISCAS), vol. 4, Monterey, CA, May 1998, pp. 273–276.
References
263
138. C. Chatfield, The Analysis of Time Series: An Intoduction, 4th ed. Chapman and Hall, 1989. 139. I. 7498–1, “Information technology - open systems interconnection - basic reference model: The basic model, second edition,” Nov. 1994. 140. G. S. Fishman, Principles of Discrete Event Simulation. Wiley, 1991. 141. M. Chesire, A. Wolman, G. M. Voelker, and H. M. Levy, “Measurement and analysis of a streaming media workload,” in Proceedings of USITS, San Francisco, CA, Mar. 2001. 142. G. K. Zipf, Human Behavior and Principle of Least Effort: An Introduction to Human Ecology. Cambridge, MA: Addison–Wesley, 1949. 143. A. Dan, D. Sitaram, and P. Shahabuddin, “Dynamic batching policies for an on-demand video server,” Multimedia Systems, vol. 4, no. 3, pp. 112–121, 1996. 144. L. Breslau, P. Cao, L. Fan, G. Philips, and S. Shenker, “Web caching and zipflike distributions: Evidence and implications,” in Proc. of IEEE INFOCOM, New York, NY, Mar. 1999, pp. 126–134. 145. Y. Wang and Q. Zhu, “Error control and concealment for video communication: A review,” Proceedings of the IEEE, vol. 86, no. 5, pp. 974–997, May 1998. 146. A.-V. T. W. Group, H. Schulzrinne, S. Casner, R. Frederick, and V. Jacobson, “RTP: A Transport Protocol for Real-Time Applications,” RFC 1889 (Proposed Standard), Jan. 1996, obsoleted by RFC 3550. [Online]. Available: http://www.ietf.org/rfc/rfc1889.txt 147. Y. Kikuchi, T. Nomura, S. Fukunaga, Y. Matsui, and H. Kimata, “RTP Payload Format for MPEG-4 Audio/Visual Streams,” RFC 3016 (Proposed Standard), Nov. 2000. [Online]. Available: http://www.ietf.org/rfc/rfc3016.txt 148. F. Fitzek, S. Hendrata, P. Seeling, and M. Reisslein, Wireless Internet – Header Compression Schemes for Wireless Internet Access, ser. Electrical Engineering & Applied Signal Processing Series. CRC Press, Mar. 2004, ISBN 0849316316 Chapter 10, pp. 1–24. 149. F. Fitzek, M. Zorzi, P. Seeling, and M. Reisslein, “Video and audio trace files of pre-encoded video content for network performance measurements,” in Proc. of IEEE Consumer Communications and Networking Conference (CCNC), Las Vegas, NV, Jan. 2004, pp. 245–250. 150. N. Duffield, K. Ramakrishnan, and A. Reibman, “Issues of quality and multiplexing when smoothing rate adaptive video,” IEEE Transactions on Multimedia, vol. 1, no. 4, pp. 53–68, Dec. 1999. 151. L. Schruben, “Initialization bias in simulation output,” Operations Research, vol. 30, pp. 569–590, 1982. 152. J. Beran, Statistics for Long–Memory Process, 1st ed. Boca Raton, FL: Chapman and Hall/CRC, 1994. 153. A. Suarez-Gonzales, J. Lopez-Ardao, C. Lopez-Garcia, M. Rodriguez-Perez, M. Fernandez-Veiga, and M. E. Sousa-Vieira, “A batch means procedure for mean value estimation of processes exhibiting long range dependence,” in Proc. of the 2002 Winter Simulation Conference, San Diego, CA, Dec. 2002, pp. 456– 464. 154. P. Seeling, M. Reisslein, and F. Fitzek, “Offset distortion traces for tracebased evaluation of video quality after network transport,” in Proc. of IEEE Int. Conference on Computer Communications and Networks (ICCCN), San Diego, CA, Oct. 2005, pp. 375–380.
264
References
155. ——, “Layered video coding offset distortion traces for trace-based evaluation of video quality after network transport,” in Proc. of IEEE Consumer Communications and Networking Conference CCNC, Las Vegas, NV, Jan. 2006, pp. 292–296. 156. ITU-T Recommendation P.800.1, “Mean opinion score (MOS) terminology,” Mar. 2003. 157. M. Pinson and S. Wolf, “Low bandwidth reduced reference video quality monitoring system,” in First International Workshop on Video Processing and Quality Metrics for Consumer Electronics, Scottsdale, AZ, Jan. 2005. 158. P. Seeling, M. Reisslein, and B. Kulapala, “Network performance evaluation with frame size and quality traces of single-layer and two-layer video: A tutorial,” IEEE Communications Surveys and Tutorials, vol. 6, no. 3, pp. 58–78, Third Quarter 2004. 159. F. H. Fitzek, B. Can, H. C. Nguyen, M. I. Rahman, R. Prasad, and C. Koo, “Cross-layer optimization of OFDM systems for 4g wireless communications,” in Proceedings of the 9th International OFDM-Workshop, Dresden, Germany, Sept. 2004, pp. 27–31. 160. F. Fitzek and M. Reisslein, “MPEG–4 and H.263 traces for network performance evaluation,” Technical University Berlin, Dept. of Electrical Eng., Germany, Tech. Rep., Oct. 2000. [Online]. Available: http://trace.eas.asu.edu 161. R. K. (Editor), “Overview of the MPEG–4 standard, ISO/IEC 14496,” May/June 2000. 162. M. Wu, R. A. Joyce, H.-S. Wong, L. Guan, and S.-Y. Kung, “Dynamic resource allocation via video content and short–term traffic statistics,” IEEE Transactions on Multimedia, vol. 3, no. 2, pp. 186–199, June 2001. 163. P. Bocheck and S. Chang, “A content based video traffic model using camera operations,” in Proceedings of IEEE International Conference on Image Processing (ICIP), vol. 2, Lausanne, Switzerland, Sept. 1996, pp. 817–820. 164. K. Chandra and A. R. Reibman, “Modeling one– and two–layer variable bit rate video,” IEEE/ACM Transactions on Networking, vol. 7, no. 3, pp. 398– 413, June 1999. 165. H. Liu, N. Ansari, and Y. Q. Shi, “A simple model for MPEG video traffic,” in Proceedings of IEEE International Conference on Multimedia and Expo (ICME), New York, NY, 2000, pp. I–553–556. 166. “The network simulator — ns–2.” [Online]. Available: www.isi.edu/nsnam/ns/ index.html 167. A. Varga, “Omnet++,” IEEE Network Interactive, vol. 16, no. 4, 2002. [Online]. Available: www.hit.bme.hu/phd/vargaa/omnetpp 168. J. Buck, E. L. S. Ha, and D. Messerschmitt, “Ptolemy: A Framework for Simulating and Prototyping Heterogeneous Systems,” Int. Journal of Computer Simulation, vol. 4, pp. 155–182, Apr. 1994. [Online]. Available: ptolemy.eecs.berkeley.edu/index.htm 169. D. Marpe, T. Wiegand, and S. Gordon, “H.264/MPEG-4 AVC fidelity range extensions: Tools, profiles, performance, and application areas,” in Proc. IEEE Int. Conf. on Image Proc. (ICIP), Genova, Italy, Sept. 2005, pp. 593–596. 170. G. J. Sullivan, P. Topiwala, and A. Luthra, “The H.264/AVC advanced video coding standard: Overview and introduction to the fidelity range extensions,” in Proceedings of SPIE Conference on Applications of Digital Image Processing XXVII, Special Session on Advances in the New Emerging Standard: H.264/AVC, Aug. 2004, pp. 454–474.
References
265
171. C. H. Foh, L. Andrew, E. Wong, and M. Zukerman, “FULL-RCMA: A high utilization epon,” IEEE Journal on Selected Areas in Communications, vol. 22, no. 8, pp. 1514–1524, Oct. 2004. 172. Y.-L. Hsueh, M. S. Rogge, S. Yamamoto, and L. G. Kazovsky, “A highly flexible and efficient passive optical network employing dynamic wavelength allocation,” IEEE/OSA Journal of Lightwave Technology, vol. 23, no. 1, pp. 277–286, Jan. 2005. 173. G. Kramer, B. Mukherjee, and G. Pesavento, “IPACT: A Dynamic Protocol for an Ethernet PON (EPON),” IEEE Communications Magazine, vol. 40, no. 2, pp. 74–80, Feb. 2002. 174. G. Kramer, B. Mukherjee, Y. Ye, S. Dixit, and R. Hirth, “Supporting differentiated classes of service in ethernet passive optical networks,” Journal of Optical Networking, vol. 1, no. 8, pp. 280–298, Aug. 2002. 175. Y. Luo and N. Ansari, “Bandwidth Allocation for Multiservice Access on EPONs,” IEEE Communications Magazine, vol. 43, no. 2, pp. S16–S21, Feb. 2005. 176. M. Ma, Y. Zhu, and T. Cheng, “A bandwidth guaranteed polling mac protocol for ethernet passive optical networks,” in Proceedings of IEEE INFOCOM, Mar. 2003, pp. 22–31, san Francisco, CA. 177. M. P. McGarry, M. Maier, and M. Reisslein, “Ethernet PONs: A survey of dynamic bandwidth allocation (DBA) algorithms,” IEEE Communications Magazine, vol. 42, no. 8, pp. S8–S15, Aug. 2004. 178. M. P. McGarry, M. Reisslein, and M. Maier, “WDM ethernet passive optical networks (EPONs),” IEEE Communications Magazine, vol. 44, no. 2, pp. S18– S25, Feb. 2006. 179. C. Xiao, B. Bing, and G. K. Chang, “An efficient reservation mac protocol with preallocation for high-speed wdm passive optical networks,” in Proc. of IEEE INFOCOM, Miami, FL, Mar. 2005.
Index
4G, 35 aggregation, 46, 53 arithmetic coding, 25 autocorrelation, 46, 52, 54 B frame, 26, 28, 191 base layer, 29 fine granular scalable, 33 SNR scalable, 31 spatial, 31 temporal, 30 block, 15, 25 block matching algorithm, 26 block scanning, 18 blockiness, 22 call admission, 186 coefficient of variation, 46, 52, 53 color space, 11 conversion, 12 constant bit rate, 22 constant utilization simulation, 185 content dependency, 111 context–adaptive binary arithmetic coding, 25 correlation coefficient, 55 covariance, 54 data partitioning, 30 description, 35 discrete cosine transformation, 19 encoding overhead, 58, 165
end of block, 25 enhancement layer, 29 fine granular scalable, 33 SNR scalable, 31 spatial, 31 temporal, 30 truncation, 33 fine granular scalability, 33 flicker fusion rate, 7 flip book, 7 frame format, 10, 38 CIF, 11 ITU-R/CCIR-601, 10 QCIF, 11 frame loss probability, 191 frame rate, 8 frame type, 26 generate video traffic, 183 group of pictures, 15, 29, 188, 191, 196 H.264, 19, 27, 39 H.264 results, 109 autocorrelation, 114 histograms, 113 periodogram, 116 R/S statistics, 114 variance time statistics, 114 header compression, 180 high definition television, 38 Huffmann coding, 25 hump phenomenon, 97, 125, 126 Hurst parameter, 47–49
267
268
Index
I frame, 26, 28, 191 information loss probability, 191 inter-frame coding, 25 interlaced video, 10 internet protocol, 177 intra-coding, 17 ip overhead, 173 layers, 29 locality of reference, 184 logscale diagram, 50 loss probability, 191 macroblock, 15, 25 MDC results, 165 encoding overhead, 165, 167 frame size statistics, 165 mean, 45, 51 mean squared error, 51 metrics, 45 additional for MDC, 58 frame size, 45 aggregation, 46 autocorrelation, 46 coefficient of variation, 46 logscale diagram, 50 mean, 45 multiscale diagram, 50 periodogram, 48 R/S statistic, 47 variance, 45 variance–time test, 47 video quality, 51 mean squared error, 51 peak signal to noise ratio, 51 rooted mean squared error, 51 motion compensation, 26 motion estimation, 26 motion vector, 26 unrestriced, 27 moving images, 7 MPEG-4 FGS results, 153 bit planes, 155 quality, 153 quality improvement, 157 quality variation, 162 qualtiy autocorrelation, 162 MPEG-4 frame size parser, 242 MPEG-4 results, 93
single layer, 97 frame sizes, 97 spatial scalable, 104 base layer, 104 combined layers, 107 enhancement layer, 106 temporal scalable, 100 base layer, 100 combined layers, 103 enhancement layer, 102 videos, 98 multiple description coding, 35 encoding overhead, 58 splitter, 63 video quality after network transport, 224 multiscale diagram, 50 network simulation, 183 call admission, 186 constant utilization simulation, 185 data analysis, 191 encoding mode selection, 185 frame level issues, 188 frame loss probability, 191 generate video traffic, 183 incorporation of transmission errors, 195 information loss probability, 191 packet transmission, 190 performance metric estimation, 193 performance metrics, 191 quality after transport, 218 receiver playout, 188 scaling video traces, 185 steady state simulation, 194 terminating simulation, 193 varying utilization simulation, 186 video packetization, 189 video quality, 192 video quality maximization, 193 video selection, 183 workload composition, 184 network simulations offset distortion impact, 218 software tools, 229 video quality for multiple descriptors, 224 network simulator — ns–2, 231
Index NTSC, 10 object scalability, 32 offset distortion, 195 algorithm single layer, 204 spatial scalable, 206 temporal scalable, 204 video quality after network transport, 218 approximation, 210 example results, 206 impact on simulation results, 218 offset distortion calculator tool, 243 perceptual considerations, 214 single layer, 203 SNR scalable, 205 spatial scalable, 205 temporal scalable, 203 trace format, 216 utilization, 215 offset distortion calculator, 243 Omnet++, 232 OSI layering model, 173 P frame, 26, 28, 191 packet transmission, 190 packetization, 189 PAL, 10 peak signal to noise ratio, 51, 202 performance metrics, 191 periodogram, 48 pixel, 10, 15 pre-encoded results, 118 autocorrelation, 122 frame size distribution, 122 frame size trace, 118 frame statistics, 121 periodogram, 124 R/S statistics, 123 trace comparison, 124 priority break point, 30 progressive video, 10 protocol overhead, 173 calculation, 182 IP, 177 RTCP, 179 RTP, 175 RTSP, 179
269
SAP, 178 SDP, 178 signaling, 177 SIP, 178 TCP, 176 protocol stack, 173 Ptolemy II, 233 pyramid coding, 31 quantization, 20 quantization matrix, 22 quantization scale, 22 quantizer, 21 R/S statistic, 47 real time control protocol, 179 real time protocol, 175 real time streaming protocol, 179 receiver playout, 188, 196 reference frame, 26 RGB, 11 RMSE and PSNR calculator, 240 robust header compression, 181 rooted mean squared error, 51, 202 run–level coding, 25 scalable video coding, 29 data partitioning, 30 fine granular scalability, 33 multiple description coding, 35 object scalability, 32 SNR scalability, 31 spatial scalability, 31 temporal scalability, 30 scaling video traces, 185 scene, 14 scene metrics, 56 selection of encoding mode, 185 session announcement protocol, 178 session description protocol, 178 session initiation protocol, 178 shot, 14 shutter, 8 silence of the lambs example results, 83 aggregated frame size autocorrelation, 89 aggregated frame sizes, 84 frame size, 84 frame size autocorrelation, 88
270
Index
frame size histograms, 85 frame type size histograms, 86 logscale diagrams, 91 multiscale diagrams, 92 periodograms, 90 R/S statistics, 89 variance time statistics, 91 simulation output analysis, 191 simulation performance metric estimation, 193 slice, 15 SNR scalability, 31 software tools, 229 MPEG-4 frame size parser, 242 network simulators, 229 ns–2, 231 Omnet++, 232 Ptolemy, 233 offset distortion calculator, 243 RMSE and PSNR calculator, 240 videometer, 235 spatial scalability, 31 splitter, 63 starting phase, 186 starvation probability, 191 statistical results, 83 H.264, 109 autocorrelation, 114 histograms, 113 periodogram, 116 R/S statistics, 114 variance time statistics, 114 MDC, 165 encoding overhead, 165, 167 frame size statistics, 165 MPEG-4, 93 single layer, 97 spatial scalable, 104 temporal scalable, 100 videos, 98 MPEG-4 FGS, 153 bit planes, 155 quality, 153 quality autocorrelation, 162 quality improvement, 157 quality variation, 162 pre-encoded, 118 autocorrelation, 122 frame size distribution, 122
frame size trace, 118 frame statistics, 121 periodogram, 124 R/S statistics, 123 trace comparison, 124 silence of the lambs example, 83 wavelets, 125 aggregated frame size statistics, 125 autocorrelation, 128 comparison with MPEG-4, 141 correlation, 139 frame size statistics, 125 logscale statistics, 133 periodogram, 130 quality autocorrelation, 139 quality histograms, 137 R/S statistics, 130 size histograms, 127 variance time statistics, 131 video quality, 135 video quality trace, 137 steady state simulation, 194 subband, 35 Super8, 10 temporal scalability, 30 terminating simulation, 193 trace comparison, 124 transcoding, 30, 193 transform coefficients, 19 transmission control protocol, 176 transmission errors, 195 variable bit rate, 22, 185 variable length coding, 24 end of block, 25 run–level coding, 25 variance, 45, 52, 53 variance–time test, 47 varying utilization simulation, 186 video coding standards, 38 video decoding, 196 algorithm single layer, 199 SNR scalable, 201 spatial scalable, 201 temporal scalable, 199 single layer, 196 SNR scalable, 199
Index spatial scalable, 198 temporal scalable, 197 video encoding constant bit rate, 22 inter-frame coding, 25 intra-coding block scanning, 18 discrete cosine transformation, 19 quantization, 20 quantization matrix, 22 quantization scale, 22 variable length coding, 24 zig–zag scanning, 23 scalable video coding, 29 data partitioning, 30 fine granular scalability, 33 multiple description coding, 35 object scalability, 32 SNR scalability, 31 spatial scalability, 31 temporal scalability, 30 variable bit rate, 22 video evaluation H.264, 73 MDC, 80 MPEG-4 encoding modes, 65 overview, 63 single layer, 66 spatial scalable, 71 temporal scalable, 67 MPEG-4 FGS, 75 pre-encoded, 79 wavelets, 77 video frame, 8, 15 video frame size and quality correlation, 54 video hierarchy, 14 video on demand system, 184 video quality, 192, 200, 235 after network transport, 200 video sequence, 14 video streaming, 173 IP, 177 RTCP, 179 RTP, 175 RTSP, 179 SAP, 178 SDP, 178
271
signaling, 177 SIP, 178 TCP, 176 video trace format different formats in overview, 231 H.264 terse, 76 verbose, 75 MDC multiple descriptors, 81 single descriptor, 81 MPEG-4 FGS base layer, 77 enhancement layer, 77 scene description, 76 truncated enhancement layer, 78 MPEG-4 single layer terse, 68 verbose, 68 MPEG-4 spatial scalable verbose base layer, 72 verbose enhancement layer, 74 MPEG-4 temporal scalable verbose base layer, 69 verbose enhancement layer, 70 offset distortion, 216 pre-encoded raw trace, 80 wavelets combined, 78 substream, 79 video trace generation, 59 for MDC video, 62 for pre-encoded video, 62 from DVD, 61 from VHS, 60 from YUV sequences, 62 overview, 59 videometer, 235 wavelet encoding, 35 wavelet results, 125 aggregated frame size statistics, 125 autocorrelation, 128 comparison with MPEG-4, 141 correlation, 139 frame size statistics, 125 logscale statistics, 133 periodogram, 130
272
Index
quality autocorrelation, 139 quality histograms, 137 R/S statistics, 130 size histograms, 127 variance time statistics, 131 video quality, 135 video quality trace, 137 wavelet transform, 35 workload composition, 184
YUV, 11 formats, 12 frame size, 14 packed, 14 planar, 14 sub–sampling, 12 zig–zag scanning, 23 Zipf distribution, 184