RateDistortion Optimizations for Motion Estimation in LowBitrate Video Coding
Dzung T. Hoang1 Philip M. Long2 Je rey Scott Vitter3
CS{1995{16
Department of Computer Science Duke University Durham, North Carolina 27708{0129 June 19, 1995
A liated with Brown University. Supported in part by an NSF Graduate Fellowship and by Air Force O ce of Strategic Research grant F49620{92{J{0515. 2 Support was provided in part by Air Force O ce of Strategic Research grant F49620{92{J{0515. Current address: Research Triangle Institute, 3040 Cornwallis Road, P.O. Box 12194, Research Triangle Park, NC 27709. 3 Support was provided in part by Air Force O ce of Strategic Research grant F49620{92{J{0515, Army Research O ce grant DAAH04{93{G{0076, and by an associate membership in CESDIS.
1
1 INTRODUCTION
We present and compare methods for choosing motion vectors for motioncompensated video coding. Our primary focus is on videophone and videoconferencing applications, where very low bit rates are necessary, where motion is usually limited, and where frames must be coded in the order they are generated. We provide evidence, using established benchmark videos typical of these applications, that choosing motion vectors explicitly to minimize rate, subject to implicit constraints on distortion, yields better ratedistortion tradeo s than minimizing notions of prediction error. Minimizing a linear combination of rate and distortion results in further ratedistortion improvements. Using a heuristic function of the prediction error and the motion vector codelength results in compression performance comparable to the more computationally intensive coders while running much faster. We incorporate these ideas into coders that operate within the p 64 standard.
1
Abstract
1 Introduction
The typically strong correlation between successive frames of a video sequence makes video highly compressible, since the pixels of the previous frame can be used to predict the intensities of the current frame. The di erence between the predicted and the true frame often is small and can be encoded e ciently, for example, by a lossy transform coder using the twodimensional discrete cosine transform (2D DCT). Improved compression is readily obtained by rst estimating what portions of the current frame correspond to moving objects and then transmitting motion vectors that tell the decoder where to look on the previous frame for predictions of the intensity of each pixel in the current frame. The most popular method for estimating these motion vectors originated with Jain and Jain 6] and is called blockmatching . In their approach, the current frame is divided into blocks (usually 8 8 or 16 16) whose pixels are assigned the same motion vector. The motion vector for a given block B is usually obtained by (approximately) minimizing, from among candidates ~ within a v limited search area, some norm of the di erence between B and the prediction obtained from ~ . The v mean squared error (MSE) is a commonly used di erence measure, although for example the mean absolute di erence (MAD) is often substituted because it can be implemented e ciently. Blockmatching motion compensation is combined with DCT transform coding of residuals in hybrid video coding systems such as H.261 (also known informally as the p 64 standard) 1, 8] and MPEG 2]. In previous work on motion compensation for video coding, motion vectors are chosen to minimize prediction error, and this task is often separated from other components of the video coding system that control the rate and distortion. Most of the emphasis of research in motion estimation has been on speeding up the motion search without sacri cing too much rate/distortion performance. In this paper, we are concerned primarily with video coding at low bitrates for applications such as videophone and videoconferencing, where the bitrate is typically limited to 64 kbps or less. At such low bitrates, the coding of motion vectors takes up a signi cant portion of the bandwidth. We investigate the use of cost measures that more directly estimate the e ect of the choice of motion vector on the total codelength and reconstruction distortion. Our experimental results show that using these measures yields substantially better ratedistortion tradeo s. Initially, our emphasis is on codelength and quality, not on computation time, in order to determine the limits on the compressibility of video. We rst develop and present computationally intensive coders that attempt to explicitly optimize for rate and distortion. Insights from these implementations lead to faster coders that minimize an e ciently computed heuristic function. We implemented and tested our motion estimation algorithms within an experimental implementation of the p 64 standard. The p 64 standard is intended for applications like videophone and videoconferencing, where very low bitrates are required, not much motion is present, and frames
2 OVERVIEW OF THE P 64 STANDARD
2
are to be transmitted essentially as they are generated. Unlike the case of the MPEG standard, we cannot rst compress a subsampling of frames, and then use frames both before and after a given frame to predict it. Our experimental results are for benchmark videos typical of the type for which the p 64 standard was intended: they consist of a \headandshoulders" view of a single speaker. Using blockmatching, we create only a crude, but concise, model of the motion. For video coding, we do not necessarily want to nd the \correct" motion vectors, in contrast to a goal of research in optic ow, for example 9]. If a motion vector eld that does not correspond to the actual motion in the scene yields the shortest description, that is su cient for purposes of compression. However, an accurate motion eld is desirable for motion interpolation, where a noncoded frame is interpolated from two successively coded frames by performing motion compensation using an interpolated motion eld. In the next section, we describe the PVRG implementation of the p 64 standard, and then show how to modify the PVRG implementation, but remain within the p 64 standard, to choose motion vectors that more directly minimize codelength and distortion. For comparable quality, at the level roughly required for transmitting a benchmark QCIF video sequence at 2,000 bits per frame, explicit bit minimization reduces the bitrate by about 11% on average. In the p 64 standard, two binary decisions must be made from time to time (for details, see Section 2). In the base PVRG implementation, heuristics based on prediction error are used to make these decisions. When the explicit bit minimization philosophy is also applied to make the coding decisions, the improvement becomes a signi cant 27%. If instead of minimizing the bitrate, we minimize a combination of rate and distortion, we observe a bitrate reduction of 30% for the same level of quality. We then describe a heuristic function that gives compression performance comparable to explicitly minimizing bitrate, while running much faster. Experimental results are presented in Sections 4.4 and 5.3. To the best of our knowledge, ours is the rst work investigating the e ect of minimizing codelength subject to quality constraints (implicit and explicit) to choose motion vectors, although the importance of coding motion vectors at very low bitrates is acknowledged in 7].
2 Overview of the P 64 Standard
We rst provide a brief overview of key components of the p 64 standard. The p 64 standard speci es a threecomponent color system as the format for the video data. The three components are a luminance band Y and two chrominance bands CB and CR . Since the human visual system is more sensitive to the luminance component and less sensitive to the chrominance components, CB and CR are subsampled by a factor of 4 compared to Y . The image is composed of Groups of Blocks (GOB) which are further decomposed into macroblocks (MB) each consisting of four 8 8 Y blocks, one 8 8 CB block and one 8 8 CR block. Motion prediction and compensation are performed by treating each macroblock as an atomic entity; that is, there is one motion vector per macroblock. Figure 1 shows a block diagram of the p 64 coder. At a high level, the basic process is as follows. The macroblocks are scanned in a lexicographic order. For each macroblock M , the encoder chooses a motion vector ~ (how this is done is left unspeci ed in the standard), and the v di erence between ~ and the motion vector for the previous macroblock is transmitted, using a v static Hu man code. For each 8 8 block B contained in M , a lossy version of the block of prediction errors obtained by using ~ to predict B is then transmitted. This is done by applying v a 2D DCT to the block of prediction errors, quantizing the transform coe cients, and sending the result using a runlength/Hu man coder, where the coe cients are scanned in a zigzag order.
3 PVRG IMPLEMENTATION OF P 64
CC Video In
q
3

6
? a q q
T

Q
?
q
q
Q?1 T?1
? ?
p t  qz q
To Video Multiplex Coder
q

a q
q
 ?
F P
6
v f
T: Transform Q: Quantizer P: Picture Memory with motioncompensated variable delay F: Loop Filter CC: Coding Control
p: t: qz: q:
Flag for INTRA/INTER Flag for transmitted or not Quantizer indication Quantizing index for transform coe cients v: Motion vector f: Switching on/o of the loop lter
Figure 1: Block diagram of the p 64 source coder 1] As indicated in Figure 1, the encoder has the option of changing certain aspects of the above process. First, the encoder might simply not transmit the current macroblock; the decoder is then assumed to use the corresponding macroblock in the previous frame in its place. If transmitted, the macroblock can be transform coded with motion compensation (interframe coding) or without (intraframe coding). If motion compensation is used, there is an option to apply a linear lter to the previous decoded frame before using it for prediction. The p 64 standard does not specify how to make these decisions.
3 PVRG Implementation of P 64
As a basis for comparison of the di erent motion estimation schemes presented in this paper, we use the \vanilla" p 64 coder supplied by the Portable Video Research Group (PVRG)1 . In the base PVRG implementation, a motion vector ~ is determined for each macroblock M v by means of standard fullsearch blockmatching. Only the luminance blocks are compared to the determine the best match, with the MAD being used as the measure of prediction error. Several heuristics are used to make the coding decisions. The variance VP of the prediction errors for the luminance blocks in M by using ~ is compared against the variance VY of the luminance v blocks in M to determine whether to perform intraframe or interframe coding. If interframe motion compensation mode is selected, the decision of whether to use motion compensation with
havefun.stanford.edu.
1
As of the publication date, the source code for this implementation can be obtained via anonymous ftp from
4 EXPLICIT MINIMIZATION ALGORITHMS
4
Variance of motion compensation block
160
5.5 5 Interframe motion compensation y=x MAD with motion vector 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 0 32 64 96 128 Variance of block prediction error 160 0 1 2 3 4 5 MAD with zero motion vector 6 0.5 1.5 Motion vector compensation Zero displacement motion compensation 2.7 y=x/1.1
128
96
64 Intraframe motion compensation
32
0
(a) Intraframe/interframe decision diagram
(b) Motion vector decision diagram
Figure 2: Default decision diagrams for coding control 5] a zero motion vector or the estimated motion vector is made by comparing the MAD of motion compensation with zero motion against that with the estimated motion vector. In the former case, no motion vector needs to be sent. The default decision diagrams are shown in Figure 3. The loop lter in interframe mode is enabled if VP is below a certain threshold. The decision of whether to transmit a transformcoded block is made individually for each block in a macroblock by considering the sum of absolute values of the quantized transform coe cients. If the sum falls below a preset threshold, the block is not transmitted. Video coders for videophone and videoconferencing often have to operate with xed bandwidth limitations. Since the p 64 standard speci es entropy coding, resulting in a variable bitrate, some form of rate control is required for operation on bandwidthlimited channels. For example, if the coder's output exceeds the channel capacity, then frames could be dropped or the quality decreased in order to meet the bandwidth limitations. On the other hand, if the coder's output is well below the channel's capacity, the quality and/or framerate can be increased. A simple technique for rate control uses a virtual bu er model in a feedback loop in which the bu er's \fullness" controls the level of quantization performed by the lossy transform coder. The PVRG coder uses this approach. The quantization step size Q is determined from the bu er fullness BF using the equation:
Q = min(b80BF + 1c; 31);
where Q has a integral range of 1; 31], BF has a realvalued range of 0; 1]. This feedback function is plotted in Figure 3.
4 Explicit Minimization Algorithms
In this section, we describe and compare the performance of three coders which conform to the p 64 standard. The rst coder is the same as the PVRG coder, except that motion vectors are chosen in order to minimize (locally) the number of bits used to code the motion vectors as well as the residuals. In the second coder, certain binary decisions made in the rst algorithm using heuristics based on error are instead made again to minimize total codelength. In the third,
4 EXPLICIT MINIMIZATION ALGORITHMS
35 30 Quantization Step Size 25 20 15 10 5 0 0 0.2 0.4 0.6 Buffer Fullness 0.8 1
5
Figure 3: Feedback function controlling quantization step size based on bu er fullness motion vectors and coding decisions are chosen to minimize a linear combination of codelength and distortion. In the PVRG coder, motion estimation is performed (independent of the coding decisions) to minimize the MAD of the prediction errors. The rationale for this is that minimizing the MSE (approximated with the MAD) of the prediction errors is equivalent to minimizing the variance of the 2D DCT coe cients of the prediction errors, which tends to result in more coe cients being quantized to zero. However, minimizing the variance of the DCT coe cients does not necessarily lead to a minimum coding of the quantized coe cients, especially since the quantized coe cients are then Hu man and runlength encoded. Furthermore, since coding decisions are made independently, the e ects of motion estimation on codelength is further made indirect. In Algorithm M1, motion estimation is performed explicitly to minimize (locally) the codelength of each macroblock. To compute the codelength, we make the same coding decisions as the PVRG coder and invoke the appropriate encoding subroutines for each choice of motion vector within the search area, picking the motion vector that results in the minimum codelength for the entire macroblock. The computed codelength includes the coding of the transform coe cients for the luminance blocks2 , the motion vector, and all other side information. When choosing the motion vector to minimize the coding of the current macroblock, we use the fact that the motion vectors for previous macroblocks (in scan order) have been determined in order to compute the codelength. However, since the choice of a motion vector for the current macroblock a ects the codelength of future macroblocks, this is a greedy minimization procedure, and we may not obtain a globally minimal codelength. Since we are explicitly attempting to minimize the codelength, we are almost assured to have higher prediction error than if we just minimize the prediction error. The higher prediction error is likely to result in higher reconstruction distortion. Instead of attempting to deal with quality directly, we rely on the transform coder and quantizer to deliver a desired level of quality; that is, the M1 coder may require a ner quantization step size to deliver the same quality as the PVRG
2 The transform coding of the chrominance blocks could be included as well. However, we chose not to in order to make a fair comparison to the base PVRG coder. This is also the policy for the other coders described in this paper.
4.1 Algorithm M1
4 EXPLICIT MINIMIZATION ALGORITHMS
6
coder. As we will see in the results section, for low bitrates bitminimization does indeed result in consistently better ratedistortion curves. In Algorithm M1, the decisions of whether to use a lter and whether to use motion compensation are made the same way as in the PVRG p 64 implementation. In Algorithm M2, however, these decisions are also made to minimize codelength: All three combinations of the decisions are tried, and the one resulting in the smallest codelength is used. Here, even more than with M1, we rely on the transform coder and the quantizer to code the prediction errors with adequate quality. Our hope is that the gain in compression e ciency will o set the decrease in reconstruction quality for a given quantization step size; that is, to achieve a certain quality level, we may be able to use ner quantization and still get improvements in compression. Since M2 is able to make decisions on how to code each macroblock, it is able to take into account the coding of side information in minimizing the codelength. For low bitrates, where the percentage of side information is signi cant compared to the coding of motion vectors and transform coe cients, one would expect that M2 will be able to reduce the codelength of side information. As with M1, M2 does not explicitly consider the e ect of the motion vector on the resulting distortion. However, experimental results show better ratedistortion performance compared to the base PVRG coder. With Algorithms M1 and M2, we minimize the codelength without regards to distortion and then choose the quantization step size to achieve the desired distortion level. This is not always the best policy. There may be cases where the choice of motion vector and coding decisions that minimize codelength result in a relatively high distortion, whereas another choice would have a slightly higher codelength but substantially lower distortion. In terms of ratedistortion tradeo , the second choice may be better. Since the ultimate goal is better ratedistortion performance, we expect further improvements if we minimize a combination of codelength and distortion. M1 and M2 call encoder routines in the minimization steps. By adding calls to decoder routines, we can compute the resulting distortion. We incorporated this idea into Algorithm RD. Algorithm RD minimizes a linear combination of codelength and distortion3. Let B (~ ;~) denote v c the number of bits to code the current macroblock using motion vector ~ and coding decisions ~. v c Similarly, let D(~ ;~) be the resulting mean squared error. RD minimizes the objective function v c CRD (~ ;~) = B (~ ;~) + D(~ ;~): v c v c v c (1) The choice of the Lagrange parameter depends on the ratedistortion curve for the particular input video. A good choice is to set to be equal to the negative of the slope of the line tangent to the ratedistortion curve at the operating point. This can be determined, for example, by preprocessing a portion of the input video to estimate the ratedistortion curve. We performed experiments using 49 frames of the \Miss America" sequence and 30 frames of the \Claire" sequence, both in QCIF (176 144) format sampled at 10 frames per second. These are \head and shoulders" sequences typical of the type found in videophone and videoconferencing applications.
3
4.2 Algorithm M2
4.3 Algorithm RD
4.4 Experimental Results
This is a standard ratedistortion optimization procedure based on Lagrange multipliers.
5 HEURISTIC ALGORITHMS
7
4.4.1 RateDistortion Characteristics
To evaluate the ratedistortion performance of the various motion estimation algorithms, we ran them with various xed quantization step sizes Q, without ratecontrol. For RD, we determine the values for from the slope of the ratedistortion curve for the M2 coder. The search region used for blockmatching is 7 in both directions. The averaged ratedistortion curves achieved by varying Q are plotted for low bitrates in Figures 4 and 5. As indicated in the plots, M1 performs moderately better than the PVRG implementation and M2 and RD signi cantly better. For instance, transmitting the QCIF Miss America sequence at 2,000 bits per frame would result in an average PSNR of 35dB. For the same distortion, the M1, M2, and RD coders would require approximately 11%, 27%, and 30% less bandwidth, respectively. Tables 1{8 show the bitrate consumption of the coders in more detail. The tabulated gures are averages for interframe coding and do not include the initial intracoded frame. At low bitrates, the PVRG coder spends more bits for motion vectors than transform coe cients. M1, M2, and RD achieve signi cant reductions in codelength with the coding of the motion vectors. Also, since M2 and RD are able to make coding control decisions to reduce the codelength, further gains are achieved in the coding of side information. At higher bitrates, both M2 and RD achieve signi cant rate reduction with the transform coe cients. Furthermore, M2 and RD are able to trade o additional motionvector coding for less transformcoe cient coding. The improvement of RD over M2 is minimal in these experiments. However, experiments on CIF (352 288) format sequences at higher bitrates show noticably better results (see Figure 6). The diminishing returns at very low bitrates is due to the relatively high overhead for coding side information at very low bitrates in the p 64 standard. For example, with the Miss America sequence, RD is able to reduce the DCT coding by about 10% over M2, but due to the relatively high cost of coding motion vectors and side information, the total improvement is less.
4.4.2 Rate Control
We performed additional experiments using rate control to achieve a target bitrate of 16 kbps. The virtual bu er feedback rate control mechanism described in Section 3 is used, with the bu er size set to 8 kbits. The average PSNR for each coded frame is plotted for the Miss America and Claire sequences in Figures 7 and 8, respectively. All the coders start with the same quantization step size for the initial intracoded frame.
5 Heuristic Algorithms
While Algorithms M1, M2, and RD generally exhibit better ratedistortion performance than the base PVRG coder, they are computationally intensive. The additional computation is in the explicit evaluation of the rate (and distortion in the case of RD). To address the computational complexity, we introduce Algorithms H1 and H2, which minimize a heuristic function of the prediction error and the motion vector codelength, both of which can be computed e ciently. The idea is that the prediction error (MSE, MAD, or similar measure) can be used to estimate the rate and distortion for tranform coding. The motion vector codelength (readily available with a table lookup) is included since it is signi cant when coding at low bitrates. For H1, coding control is performed using the same decision rules used in the PVRG and M1 coders. With H2, coding control is performed to minimize rate in the same manner as is done in M2.
5 HEURISTIC ALGORITHMS
8
RateDistortion Plot for Miss America 45 40 35 30 MSE 25 20 15 10 5 1000 2000 3000 Average Bits/Frame 4000 PVRG M1 M2 RD
Figure 4: Averaged ratedistortion curve for QCIF Miss America sequence
RateDistortion Plot for Claire 80 70 60 50 MSE 40 30 20 10 0 1000 2000 3000 Average Bits/Frame 4000 PVRG M1 M2 RD
Figure 5: Averaged ratedistortion curve for QCIF Claire sequence
5 HEURISTIC ALGORITHMS
Table 1: Coding Miss America sequence with PVRG coder. 31 28 24 20 16 12 8
Q
9
Bits/Frame DCT Bits 1687.83 87.2917 1705.83 117.521 1767.77 176.5 1853.75 269.854 2040.25 459.771 2480.77 884.333 3585.4 1977.73
MV Bits Side Bits 576 1024.54 559.854 1028.46 555.396 1035.88 540.438 1043.46 515.062 1065.42 494.271 1102.17 435.812 1171.85
PSNR 33.1775 33.5985 34.1942 34.8846 35.8919 37.2402 39.3037
Table 2: Coding Miss America sequence using M1 coder. 31 28 24 20 16 12 8
Q
Bits/Frame DCT Bits 1313.6 86.6042 1348.65 113.083 1408.17 170.458 1499.58 254.375 1719.94 457.104 2147.85 859.542 3290.83 1949.85
MV Bits Side Bits 229.75 997.25 228.729 1006.83 222.062 1015.65 221.333 1023.88 211.333 1051.5 199.542 1088.77 179.104 1161.88
PSNR 32.245 32.7477 33.516 34.3269 35.3213 36.8958 39.1048
Table 3: Coding Miss America sequence using M2 coder. 31 28 24 20 16 12 8
Q
Bits/Frame DCT Bits 900.896 73.8542 937.667 98.0417 1008.21 141.812 1130 226.271 1340.21 391.729 1764.17 734.458 2822.38 1670.94
MV Bits Side Bits 248.125 578.917 247.521 592.104 253.083 613.312 256.083 647.646 251.125 697.354 250.375 779.333 247.25 904.188
PSNR 31.8192 32.4433 33.111 33.9965 35.1646 36.7285 39.0923
Table 4: Coding Miss America sequence using RD coder. 31 28 24 20 16 12 8
Q
Bits/Frame DCT Bits 879.333 63.3542 920.354 85.8542 986.875 129.292 1096.21 208.396 1298.44 366.438 1715.1 708.396 2763.9 1636.88
MV Bits Side Bits 249.854 566.125 253.146 581.354 254.729 602.854 253.562 634.25 249.375 682.625 246.479 760.229 242.938 884.083
PSNR 31.8756 32.4188 33.1948 34.105 35.2135 36.7519 39.0858
5 HEURISTIC ALGORITHMS
Table 5: Coding Claire sequence with PVRG coder. 31 28 24 20 16 12 8
Q
10
Bits/Frame DCT Bits 1547.28 83.8621 1544.72 101.241 1597.21 150.241 1672.31 245.655 1842.34 412.069 2168.48 761.379 2941.93 1541.9
MV Bits Side Bits 383.862 1079.55 362.931 1080.55 366.931 1080.03 344.103 1082.55 340.724 1089.55 308.103 1099 280.448 1119.59
PSNR 30.5452 31.1097 31.9059 32.8897 34.2483 35.7203 38.1193
Table 6: Coding Claire sequence with M1 coder. 31 28 24 20 16 12 8
Q
Bits/Frame DCT Bits 1361.93 99.7586 1389.38 131.069 1427.41 169.759 1515.97 259.103 1665.97 410.655 2000.93 745.276 2810.45 1555.86
MV Bits Side Bits 191.862 1070.31 186.759 1071.55 183.862 1073.79 176.759 1080.1 169.552 1085.76 160.207 1095.45 139.483 1115.1
PSNR 29.6931 30.5431 31.4224 32.5579 34.0238 35.4817 38.0403
Table 7: Coding Claire sequence with M2 coder. 31 28 24 20 16 12 8
Q
Bits/Frame DCT Bits 980.31 78.4483 1020.38 104.586 1056.07 124.793 1158.52 206.862 1338.69 364.138 1665.83 669.966 2488.55 1444.66
MV Bits Side Bits 204.724 697.138 204.724 711.069 195.586 735.69 193.931 757.724 189.345 785.207 179.448 816.414 165.586 878.31
PSNR 29.5866 30.2531 31.3438 32.4472 33.8834 35.5045 38.0028
Table 8: Coding Claire sequence with RD coder. 31 28 24 20 16 12 8
Q
Bits/Frame DCT Bits 975.483 74.0345 1019.21 105.759 1050.1 124.31 1158.72 210.586 1317.69 350.241 1652.45 660.897 2484.83 1440.76
MV Bits Side Bits 204.483 696.966 201.586 711.862 193.448 732.345 192.621 755.517 186.586 780.862 179 812.552 163.138 880.931
PSNR 29.6652 30.381 31.3707 32.519 33.9886 35.5586 38.0597
5 HEURISTIC ALGORITHMS
RateDistortion Plot for CIF Miss America 45 40 35 30 MSE 25 20 15 10 5 0 3000 5000 7000 9000 Average Bits/Frame 11000 PVRG M1 M2 RD
11
Figure 6: Averaged ratedistortion curve for CIF Miss America sequence
~ v Let E (~ ) denote a measure of the prediction error that results from using motion vector ~ to code the v ~ v current macroblock. For example, the error measure could be de ned as E (~ ) = hMAD(~ ); DC(~ )i, v v where MAD(~ ) is the mean absolute prediction error and DC(~ ) is the average prediction error. Let v v B (~ ) denote the number of bits to code the motion vector ~ . Algorithm H1 chooses ~ to minimize v v v the objective function CH (~ ; Q) de ned as v ~ v CH (~ ; Q) = H (E (~ ); Q) + B (~ ) v v
5.1 Algorithm H1
(2)
where Q is the quantization step size4 . Intuitively, the function H can be thought of as providing an estimate of the number of bits to code the prediction error. As we will discuss later, it can also be used to estimate a combination of the rate and distortion for coding the prediction error. ~ The choice of error measure, E , and heuristic function, H , are parameters to the algorithm. In our investigations, we used MAD as the error measure, for computational reasons. We also looked at using the MSE, but this did not give any clear advantages over the MAD. It is also possible to ~ de ne E to be a function of several variables. For the rest of this paper, we report only on the use ~ ~ v of MAD for E and denote E (~ ) by for convenience, where the dependence on ~ is implicit. We v examined several choices for H and describe them below. As mentioned above, we can use H to estimate the number of bits used to transformcode the prediction error. To get an idea of what function to use, we gathered experimental data on the relationship between the MAD and DCT coded bits per macroblock for a range of motion vectors. Fixing the quantization step size Q at various values, the data was generated by running the RD 4 Generally, H depends on the quantization step size Q as well as ~ . For simplicity, we sometimes assume that we v are coding with a xed Q. With rate control, Q will necessarily vary. In this case, we appeal to the more general formulation of H .
5 HEURISTIC ALGORITHMS
12
PSNR for Coding Miss America Sequence at 16kps 38 PVRG M1 37 M2 RD PSNR (dB) 36 35 34 33 32 0 5 10 15 20 25 30 Frame 35 40 45 50
Figure 7: Distortion for coding Miss America sequence at 16kbps with rate control.
PSNR for Coding Claire Sequence at 16kps 35 34 PSNR (dB) 33 32 31 30 0 5 10 15 Frame 20 25 30 PVRG M1 M2 RD
Figure 8: Distortion for coding Claire sequence at 16kbps with rate control.
5 HEURISTIC ALGORITHMS
13
coder on two frames of the QCIF Miss America sequence and outputting the MAD and DCT coded bits per macroblock for each choice of motion vector. The results are histogrammed and shown as density plots in Figure 9. These plots suggest the following forms for H :
H ( ) = c 1 + c2 ; H ( ) = c1 log( + 1) + c2 ; H ( ) = c1 log( + 1) + c2 + c3 :
i
(3) (4) (5)
The parameters c could be determined o ine by trialanderror or by standard curve tting techniques or online using adaptive techniques such as WidrowHo and recursive curve tting. The above forms assume a xed Q. In general, H also depends on Q; however, when using H to estimate the motion motion for a particular macroblock, Q is held constant to either a preset value or to a value determined by the rate control mechanism. We could do a surface t for H ( ; Q). However, determining the appropriate functional form for such a surface t would be a more involved task. Instead, we treat the t parameters c as functions of Q. Since there are a small number (31) of possible values for Q, we can store the parameters in a lookup table, for instance. We can also consider modelling the reconstruction distortion as a function of prediction error. Again, we use the RD coder to generate experimental data for distortion versus MAD. The resulting density plots are shown in Figure 10. The plots are somewhat similar to the ones relating DCT bits to MAD. Again, we can consider Equations 3{5 to model the distortion. As with the RD coder, we can consider jointly optimizing the heuristic estimates of rate and distortion with the following cost function: ~ v ~ v CH (~ ; Q) = HR(E (~ ); Q) + HD(E (~ ); Q) + B (~ ); v v (6) where HR is the model for rate, HD is the model for distortion, and is the Lagrange parameter. If we use the same functional form to model both rate and distortion, the combined heuristic function, H = HR + HD, would have the same form in the case of Equations 3{5. In this case, we can perform curve tting once for the combined heuristic by training on the statistic R + D, where R is the DCT bits for a macroblock and D is the reconstruction distortion for the macroblock. As in RD, the parameter can be determined from the ratedistortion curve, for example.
i
5.2 Algorithm H2
As with M1, the H1 coder uses a xed coding control. As with M2, we can consider trying out all the coding control choices and chosing the one that results in the fewest coded bits. We apply this modi cation to H1 and call the resulting coder H2. Since H2 has to try out three coding control choices, it will be about three times slower than H1. However, H2 can easily take advantage of parallel hardware. For the H1 and H2 coders, we used the same test sequences and followed the same procedures described in Section 4.4. We tested the di erent forms for the heuristic function given in Equations 3{5.
5.3 Experimental Results
5 HEURISTIC ALGORITHMS
14
DCT Bits vs. MAD for Q=12 350 300 DCT Bits DCT Bits 250 200 150 100 50 5 10 15 20 25 30 35 50 200 150 100 250
DCT Bits vs. MAD for Q=16
5
10 15 20 25 30 35
MAD Prediction Error DCT Bits vs. MAD for Q=20 200 DCT Bits DCT Bits 150 100 50 150 125 100 75 50 25 5 10 15 20 25 30 35
MAD Prediction Error DCT Bits vs. MAD for Q=24
5
10 15 20 25 30 35
MAD Prediction Error DCT Bits vs. MAD for Q=28 140 120 120 100 DCT Bits DCT Bits 100 80 60 40 20 5 10 15 20 25 30 35 80 60 40 20
MAD Prediction Error DCT Bits vs. MAD for Q=31
5
10 15 20 25 30 35
MAD Prediction Error
MAD Prediction Error
Figure 9: Density plots of DCT coding bits vs. MAD prediction error for rst intercoded frame of Miss America sequence at various levels of quantization.
5 HEURISTIC ALGORITHMS
15
Distortion vs. MAD for Q=12 80 120 MSE Distortion MSE Distortion 5 60 40 20 0 100 80 60 40 20 0 10 15 20 25 30 35 MAD Prediction Error Distortion vs. MAD for Q=20 175 MSE Distortion MSE Distortion 150 125 100 75 50 25 0 5 10 15 20 25 30 35 MAD Prediction Error Distortion vs. MAD for Q=28 300 MSE Distortion MSE Distortion 250 200 150 100 50 0 5 10 15 20 25 30 35 MAD Prediction Error 350 300 250 200 150 100 50 0 0 200 150 100 50
Distortion vs. MAD for Q=16
5
10 15 20 25 30 35
MAD Prediction Error Distortion vs. MAD for Q=24
5
10 15 20 25 30 35
MAD Prediction Error Distortion vs. MAD for Q=31
5
10 15 20 25 30 35
MAD Prediction Error
Figure 10: Density plots of MSE reconstruction distortion vs. MAD prediction error for rst intercoded frame of Miss America sequence at various levels of quantization.
6 CONCLUSION AND FUTURE WORK
16
To determine the coe cients for the heuristic functions, we used linear least squares regression. For each value for the quantizer step size Q, we trained a set of coe cients and stored them in a lookup table. For the heuristic functions given in Equations 3{5, we performed curve tting to the R + D statistic, as discussed earlier.
5.3.1 Curve Fitting
5.3.2 RateDistortion Characteristics
We ran H1 and H2 using the various heuristic functions on the Miss America and Claire sequences, varying values of the quantization step size Q. The resulting averaged ratedistortion curves are plotted in Figures 11 and 12. In particular, the bitrate consumption of the H1 and H2 coders using the linear heuristic function is tabulated for both test sequences in Tables 9{12. Compared with the results in Tables 1{8, the H1 and H2 coders use fewer bits to code motion vectors, while spending more bits to code DCT coe cients. As expected, the heuristic coders generate more bits than the explicit minimization coders for the same quantization step size. On the other hand, since the heuristic coders incorporate the prediction error in the cost function, they result in less reconstruction distortion. This tradeo between rate and distortion is favorable for the Claire sequence but not for the Miss America sequence. However, for both test sequences, the heuristic coders perform better than the base PVRG coder. For the Claire sequence, H1 performs slightly better than M1, and H2 better than M2. However, for the Miss America sequence, the performance of H1 lies between PVRG and M1. The performance of H2 is competitive with M2 when using the linear t.
5.3.3 Rate Control
As in section 4.4.2, we ran the heuristic coders with rate control targetted to 16 kbps. Comparative plots of the resulting average PSNR are shown in Figures 15{18. These results show that the heuristic coders perform as well as the explicit minimization coders with rate control.
6 Conclusion and Future Work
We have shown that choosing motion vectors and coding control to (greedily) minimize codelength, a combination of codelength and distortion, and computationally e cient approximations thereof, yields substantial improvements in ratedistortion performance when coding video at low rates. In general, the Lagrangian parameter in Equation 1 is dependent on characteristics of the block being coded and could vary as the statistics of the input sequence vary. In this paper, we considered only the simple case of using a xed . An online adaptation of to track variations in the input sequence is certainly possible and would result in more robust coders. Since controls rate to some extent, it could be used in conjunction with the quantization step size in performing rate control. Preliminary investigations along these lines show promising improvements. We have thus far only investigated a limited number of functional forms for the heuristic function H . These were suggested by visual examination of the histograms; however, perhaps some sort of theoretical analysis would suggest alternative forms. Another possibility would be to use techniques from nonparametric statistics (see 3]), where one estimates a functional relationship without choosing a form for the hypothesis a priori, instead implicitly using smoothness assumptions on the relationship to be modelled. At a glance, it does not seem as if this would work, since the most popular methods for nonparametric regression require
6 CONCLUSION AND FUTURE WORK
17
RateDistortion Plot for Miss America 45 40 35 MSE 30 25 20 15 1400 PVRG M1 H1LIN H1LOG H1LOGLIN
1500
1600 1700 1800 Average Bits/Frame
1900
2000
Figure 11: Comparison of averaged ratedistortion for coding Miss America sequence with H1 coder, with parameters determined by least squares curve tting.
RateDistortion Plot for Claire 70 65 60 55 MSE 50 45 40 35 30 25 1600 1700 1800 1900 Average Bits/Frame 2000 PVRG M1 H1LIN H1LOG H1LOGLIN
Figure 12: Comparison of averaged ratedistortion for coding Claire sequence with H1 coder, with parameters determined by least squares curve tting.
6 CONCLUSION AND FUTURE WORK
18
RateDistortion Plot for Miss America 50 45 40 35 MSE 30 25 20 15 10 1000 1200 1400 1600 1800 Average Bits/Frame 2000 PVRG M2 H2LIN H2LOG H2LOGLIN
Figure 13: Comparison of averaged ratedistortion for coding Miss America sequence with H2 coder, with parameters determined by least squares curve tting.
RateDistortion Plot for Claire 80 70 60 MSE 50 40 30 20 1200 PVRG M2 H2LIN H2LOG H2LOGLIN
1400 1600 1800 Average Bits/Frame
2000
Figure 14: Comparison of averaged ratedistortion for coding Claire sequence with H2 coder, with parameters determined by least squares curve tting.
6 CONCLUSION AND FUTURE WORK
Table 9: Coding Miss America sequence using H1LIN coder. 31 28 24 20 16 12 8
Q
19
Bits/Frame DCT Bits 1399.52 160.621 1409.41 168.379 1453.24 202.483 1528.97 278.621 1744.07 470.552 2177.86 877.931 3296.07 1943.52
MV Bits Side Bits 188.069 1050.83 195.448 1045.59 203.793 1046.97 206.793 1043.55 207.552 1065.97 202.793 1097.14 183.069 1169.48
PSNR 32.3676 32.9559 33.7169 34.6541 35.6986 37.1852 39.3141
Table 10: Coding Miss America sequence using H2LIN coder. 31 28 24 20 16 12 8
Q
Bits/Frame DCT Bits 933 130.104 950.062 137.625 1019.69 180.604 1135.73 260.021 1347.85 425.729 1775.65 774 2860.17 1734.21
MV Bits Side Bits 192.833 610.062 200.021 612.417 208.229 630.854 215.729 659.979 215.396 706.729 215.958 785.688 212.208 913.75
PSNR 31.9721 32.7019 33.5 34.429 35.5227 36.9613 39.2417
Table 11: Coding Claire sequence with H1LIN coder. 31 28 24 20 16 12 8
Q
Bits/Frame DCT Bits 1369.34 101.552 1381.59 116.379 1413 147.483 1518.83 255.69 1683.07 419 2027.17 761.552 2810.03 1544.24
MV Bits Side Bits 187.379 1080.41 186.552 1078.66 186.379 1079.14 182.034 1081.1 175.517 1088.55 167 1098.62 148.103 1117.69
PSNR 30.3479 30.8797 31.7386 32.759 34.1155 35.661 38.0693
Table 12: Coding Claire sequence with H2LIN coder. 31 28 24 20 16 12 8
Q
Bits/Frame DCT Bits 981.966 88.8276 1017.69 114.655 1061.79 139.103 1174.1 226.586 1347.34 377.345 1673.41 682.31 2500.79 1463.69
MV Bits Side Bits 186.759 706.379 184.172 718.862 184.069 738.621 185.103 762.414 178.414 791.586 172.241 818.862 154.207 882.897
PSNR 29.8383 30.4376 31.4659 32.5917 33.9697 35.5955 38.0831
6 CONCLUSION AND FUTURE WORK
20
PSNR for Coding Miss America Sequence at 16kps 38 37 36 35 34 33 32 0 5 10 15 20 25 30 Frame 35 40 45 50 PVRG M1 H1LIN H1LOG H1LOGLIN
Figure 15: Distortion for coding Miss America sequence at 16kbps with rate control. Comparison of H1 heuristic with M1 coder.
PSNR (dB)
PSNR for Coding Miss America Sequence at 16kps 39 38 37 36 35 34 33 0 5 10 15 20 25 30 Frame 35 40 45 50 PVRG M2 RD H2LIN H2LOG H2LOGLIN
Figure 16: Distortion for coding Miss America sequence at 16kbps with rate control. Comparison of H2 heuristic with M2 and RD coders.
PSNR (dB)
6 CONCLUSION AND FUTURE WORK
21
PSNR for Coding Claire Sequence at 16kps 34 PVRG M1 H1LIN H1LOG H1LOGLIN
33 PSNR (dB)
32
31
30 0 5 10 15 Frame 20 25 30
Figure 17: Distortion for coding Claire sequence at 16kbps with rate control. Comparison of H1 heuristic with M1 coder.
PSNR for Coding Claire Sequence at 16kps 36 35 34 33 32 31 30 0 5 10 15 Frame 20 25 30 PVRG M2 RD H2LIN H2LOG H2LOGLIN
Figure 18: Distortion for coding Claire sequence at 16kbps with rate control. Comparison of H2 heuristic with M2 and RD coders.
PSNR (dB)
REFERENCES
22
a lot of space to represent their estimated functions, and a fair amount of time to evaluate them. However, since the number of values, say of MAD, that are likely to be observed in practice is fairly small, the function H obtained could be stored cheaply and evaluated quickly by simply using a ROM. We are also interested in examining whether other quickly computed statistics are better indicators of the value of a particular motion vector, or whether they could fruitfully be combined with the MAD. For example, the maximum prediction error for a block can be computed almost for free if the MAD is already being computed. As discussed earlier, other reasonable candidates include the DC transform coe cient and possibly a handful of other transform coe cients. For the purposes of computational e ciency, it would be nice if the parameters used in any parametric method could be xed independently of the quantization step size, with good results. Preliminary indications are that this is the case for some of the heuristic functions considered in this paper. Finally, it would be interesting to determine how well the methods of this paper work in conjunction with video compression methods other than the H.261 standard. The upcoming H.263 standard is similar enough to H.261 that it seems clear that these methods will work well with the H.263 standard. An example of the use of the techniques of this paper in a coder where the motion eld is encoded using a quadtree is described in 4].
References
1] CCITT. Video codec for audiovisual services at p 64 kbit/s, 1990. Study group XV { Report R 37. 2] D. Le Gall. Mpeg: A video compression standard for multimedia applications. Communications of the ACM, 34(4):46{58, Apr. 1991. 3] W. Haerdle. Smoothing Techniques. Springer Verlag, 1991. 4] D. T. Hoang, P. M. Long, and J. S. Vitter. Explicit bitminimization for motioncompensated video coding. In Proceedings of the 1994 IEEE Data Compression Conference, pages 175{184, Snowbird, UT, 1994. 5] A. C. Hung. Pvrgp64 codec 1.1, 1993. Available from Stanford University by anonymous ftp. 6] J.R. Jain and A.K. Jain. Displacement measurement and its application in interframe coding. IEEE Transactions on Communications, COM29(12):1799{1808, 1981. 7] H. Li, A. Lundmark, and R. Forchheimer. Image sequence coding at very low bitrates: A review. IEEE Transactions on Image Processing, 3(5):589{609, Sep. 1994. 8] M. Liou. Overview of the p 64 kbit/s video coding standard. Communications of the ACM, 34(4):60{63, Apr. 1991. 9] A. Singh. Optic Flow Computation. IEEE Computer Science Press, 1991.