Вы находитесь на странице: 1из 4

Shot Boundary Detection Using Macroblock Prediction Type Information

S. De Bruyne1 , K. De Wolf1 , W. De Neve1 , P. Verhoeve2 , and R. Van de Walle3 ,


1

Ghent University - IBBT, ELIS, Multimedia Lab Gaston Crommenlaan 8 bus 201, B-9050 Ledeberg-Ghent, Belgium Televic Leo Bekaertlaan 1, B-8870, Izegem, Belgium Ghent University - IBBT - IMEC, ELIS, Multimedia Lab Gaston Crommenlaan 8 bus 201, B-9050 Ledeberg-Ghent, Belgium

Abstract The increasing availability and use of digital video has led to a high demand for ecient video analysis techniques. The starting point in video browsing and retrieval systems is the low-level analysis of video content, especially the segmentation of video content into shots. In this paper, we propose a method for automatic video indexing based on the macroblock prediction type information of MPEG-4 Visual compressed video bitstreams with varying GOP structures. This method exploits the decisions made by the encoder in the motion estimation phase, resulting in specic characteristics of the macroblock prediction type information when shot boundaries occur. By working on compressed domain information, full frame decoding of MPEG-4 Visual bitstreams is avoided. Hence, fast segmentation can be achieved compared to metrics in the uncompressed domain.

1 Introduction Recent advances in multimedia compression technology, combined with the signicant increase in computer performance, as well as the growth of the Internet, have resulted in the widespread use and availability of digital video. As a consequence, many terabytes of video data are stored in large video databases. They are often insucient cataloged and are only accessible by sequential scanning of the sequences. This has resulted in a growing demand of new technologies and tools for the ecient indexing, browsing and retrieval of digital video data. Shot boundary detection has been generally accepted as the necessary prerequisite step to achieve automatic video content analysis. A shot is dened as a sequence of frames continuously captured from the same camera. According to whether the transition between consecutive shots is abrupt or not, boundaries are classied as cuts or gradual transitions respectively [1].

Most of the methods to perform shot boundary detection are developed in the uncompressed domain [1]. In this domain, lots of features like color and edges can be exploited, resulting in a high prediction accuracy. On the other hand, in the compressed domain, full decompressioncan be avoided by using compressed domain features only [2, 3]. Hereby, real-time processing is possible. Since most video data are stored in compressed formats for eciency of storage and transmission, we focus on methods in the compressed domain. Up to now, most algorithms in this domain focus on MPEG-2 coded video. To go one step further, this paper takes a closer look at an algorithm for bitstreams compliant with the Advanced Simple Prole of MPEG-4 Visual. In this paper, we propose a method that is based on the encoders search for the best prediction. This process selects for each macroblock (MB) the prediction type that obtains the most ecient encoding. This results in specic patterns of the MB type information when successive frames have dissimilar contents, such as consecutive frames belonging to dierent shots. The outline of this paper is as follows. In Section 2, background information on MPEG-4 Visual compressed video sequences is provided. The actual algorithm for shot boundary detection is elaborated in Section 3. Section 4 discusses some performance results as obtained by our method while Section 5 concludes this paper.

2 High-level Overview of MPEG-4 Visual Compressed video sequences, conforming to the MPEG-4 Visual specication [4], consist of three kinds of frames. I-VOPs (Video Object Plane) are coded without any reference to other frames, P-VOPs are coded by using motion-compensated prediction from a previous frame, and B-VOPs use bidirectional motion-compensated prediction, meaning that a previous as well as a future frame

2
a
Ri-1 (i) (i) (i+1)

S. De Bruyne, K. De Wolf, W. De Neve, P. Verhoeve, and R. Van de Walle,


b
Ri+2 (i-1) (i-2) (i-1)
500

250

Bi (i+1)

bi+1

Ri-3 (i)

Bi-2 (i-2)

bi-1 (i-1)

Ri
0 0 10 20 30 40 50 60 70 80 90 100

Fig. 2 Example of a metric containing two shot boundaries

Ri-2

Bi-1 (i)

bi (i-1)

Ri+1

Fig. 1 Possible positions of a cut in a frame triplet

can be used. These I- and P-VOPs are denoted as reference frames as they can be used for the prediction of P- and B-VOPs. All these frames can be embedded in Groups Of Pictures (GOP) corresponding to the structure described by the regular expression IB*(PB*)*. A singular frame is further divided in MBs which contain information about the type of the temporal prediction and the corresponding motion vector(s) (MV) used for motion compensation. The possible prediction types are Intra coded (without any prediction from other frames) and Inter coded (using motion compensated prediction from one or more previously encoded frames). The latter is subdivided in Forward, Backward and Bidirectional referenced prediction. Depending on the frame type to which the MBs belong, one can choose from one or more prediction types. In particular, I-VOPs contain only intra coded MBs, P-VOPs consist of intra coded and forward referenced MBs, and B-VOPs contain forward, backward and bidirectional referenced MBs. Our method, based on an algorithm for MPEG-2 video sequences [3], makes use of this prediction information to locate possible shot boundaries. 3 Shot Boundary Detection 3.1 Proposed Method Within a video sequence, a continuous strong inter-frame correlation is present as long as no signicant changes occur. Due to the high similarity in a shot, most of the MBs belonging to B-VOPs are coded using bidirectional prediction because a B-VOP bears a strong resemblance with the past as well as with the future reference frame. These MBs can be encoded with a specic bidirectional mode, i.e. the direct mode. This mode utilizes the MVs of the corresponding reference frames as a prediction of its own MV(s). This mode provides higher compression when the motion in successive frames is similar. On the other hand, when a B-VOP has a high resemblance with only one of its reference frames, it mainly refers to this frame by forward or backward prediction and uses hardly any of the information in the other reference frame. It is clear that the amount of the dierent kinds of prediction types in MBs can be applied to dene a metric for locating possible shot boundaries. One could expect that shot boundaries always occur at I-VOPs, but it should

be mentioned that this notion is not enough to detect shot boundaries. This is due to the fact that I-VOPs are often used as random access points in a video and therefore occur more often than shot boundaries. First we consider a specic (most commonly used) GOP structure to explain the underlying idea. This GOP structure [IBBPBBPBB] can be split into groups of three frames having the form of a triplet IBB or PBB. In what follows, the reference frames I and P will be denoted as Ri , the front bidirectional frame as Bi and the rear bidirectional frame as bi . According to this convention, the video sequence can be analyzed as a group of triplets of the form R1 B2 b3 R4 B5 b6 R7 B8 b9 . Fig. 1 visualizes the three possible locations of a shot boundary (i.e. a cut) in a frame triplet. In the rst case (a), one assumes that the front bidirectional frame Bi is the rst frame with a dierent content. Since both bidirectional frames Bi and bi+1 have hardly any resemblance to the previous reference frame Ri1 and a close resemblance to the following reference frame Ri+2 , most of their MBs will be backward referenced. If the content change occurs at the rear reference frame Ri (b), this new information cannot be used by the bidirectional frame Bi2 and bi1 . Therefore, most of these MBs will use forward prediction. Finally, if the content change occurs at bi (c), Bi1 will be strongly predicted forward by the rst reference frame Ri2 , while bi will be mainly predicted backward by the rear reference frame Ri+1 . In case the content remains similar, none of these patterns above takes place so that the major part of the MBs in the B-VOPs will be bidirectional predicted. Based on these assumptions, a metric can be dened for the visual frame dierence by analyzing the percentage of MBs in a frame that are forward and/or backward referenced. Let T (i) be the number of forward referenced MBs and T (i) the number of backward referenced MBs of a given frame with index i and frame type T . The frame dierence metric (i) can be dened as follows [3]:

8 > < B (i) + b (i + 1), (i) = B (i 2) + b (i 1), > : B (i 1) + b (i),

if i is a B-frame (a) if i is a R-frame (b) if i is a b-frame (c)

Peaks in (i) represent strong and abrupt changes in the video content. In Fig. 2, an example of a metric is shown containing two peaks at frames 34 and 59 which correspond with two shot boundaries in the video sequence. To achieve automatic shot boundary detection, the results are compared with a predened constant or an adaptive threshold, based on the mean and the variation of (i) for surrounding frames.

Shot Boundary Detection Using Macroblock Prediction Type Information

3.2 Generalized Approach Most video sequences available at the moment do not satisfy the above mentioned structure. For example, when there is a lot of motion in consecutive frames, it is hard to nd a proper prediction from the reference frames. Intra coding only exists for reference frames, and as a consequence, all MBs in a B-VOP need to be coded using prediction. Therefore, the encoder often prefers to use more reference frames instead of bidirectional frames. On the other hand, when there is hardly any dierence between successive frames, the encoder can prefer to encode more than two frames as B-VOPs in order to increase the compression rate. Moreover, not all sequences have the same resolution, and as a consequence dierent thresholds are needed. To overcome these problems, the method needs to be expanded to comply with all sorts of encoded video sequences under the restriction that the encoded sequences contain B-VOPs. When taking a closer look at the algorithm above, two cases can be distinguished, in particular when the frame is a reference frame or a bidirectional frame. In case of a reference frame, (i) is obtained by counting the number of forward predicted MBs of the bidirectional frames lying between the current reference frame and the preceding reference frame. However, when there are no B-VOPs present between these two reference frames, this approach does not work. Therefore, the algorithm needs to be adjusted so that the value of (i) corresponds to the cardinality of the intra coded MBs in the current frame. In case of bidirectional frames, the value of (i) is obtained by taking the sum of all forward referenced MBs of the preceding bidirectional frames and the backward referenced MBs of the current and following bidirectional frames between the previous and next reference frames. Furthermore, the obtained results need to be scaled by dividing (i) by the number of bidirectional frames and the number of MBs in a frame. Let (i), (i) and (i) be the number of forward referenced, backward referenced and intra coded MBs resp. of a given frame with index i. Further assume that #mb is the total number of MBs in a frame and n the number of bidirectional frames between the previous reference frame Rf with frame index f and the current or following reference frame Rr with index r. The extended dierence metric (i) can be dened as:

4 Experimental Results In order to evaluate the proposed method, an MPEG-4 Visual complying decoder (XviD) was adjusted to support shot boundary detection. First, the performance of the algorithm was evaluated based on several kinds of video. Afterwards, the inuence of the bitrate and the parameters for motion estimation on the encoder side were examined, and the results were compared to algorithms in the uncompressed domain. In the test phase, four sequences with a frame size around 352x192 were selected. The rst one is a part of the movie Drive, containing 64 shot boundaries where shots with lots of object and camera motion are alternated with dialogs. Shrek2, Return of the Jedi and Troy are all trailers of movies brimming with all kinds of shot changes, each consisting of resp. 93, 51 and 36 shot boundaries. Especially Jedi and Troy are a real challenge since they are full of motion, gradual changes, special eects, variations in light intensity, et cetera. 4.1 Performance To evaluate the performance of the algorithm, a comparison based on the number of missed detections (MDs) and false alarms (FAs) is examined:
Recall = Detects Detects + M Ds P recision = Detects Detects + F As

In Table 1, the performance of the proposed algorithm is presented for the above mentioned video sequences coded at a bitrate around 680kBit/s. For these
Table 1 Performance of the algorithm
drive shrek2 jedi troy detects 64 88 50 29 MD 0 5 1 7 FA 2 10 9 1 recall 100% 95% 98% 81% precision 97% 90% 85% 97%

8 f +n P > 1 > (j ), if i is a R- and i-1 a B-frame n.#mb > < 1 j=f +1 (i) = #mb (i) , if i and i-1 are R-frames ! > f +n i 1 > P P 1 > : n.#mb j=f +1 (j ) + j=i (j ) , if i is a B-frame

Due to the applied division, (i) is contained in the interval [0, 1] and consequently represents the probability of a shot boundary at position i. As a result, a constant threshold can be chosen for the automatic shot boundary detection of various kinds of video sequences. For gradual changes, a similar approach is adopted.

test results, the major part of the missed detections are caused by gradual changes since these changes are spread over an unknown number of frames and in some cases, there is hardly any dierence between two consecutive frames. This is a problem which most of the shot boundary detection algorithms have to cope with. The false alarms have various reasons. Sudden changes in light intensity, such as lightning, camera ashlights and explosions, often lead to false alarms. This is due to the fact that the current image cannot be predicted from previous reference frames since the luminance highly diers. Uniform black shots also cause problems since the encoder prefers forward prediction in case of black frames. It should be possible to solve this problem by having a look at the DC coecients. When a shot contains lots of movement, originating from objects or the camera, false alarms will often occur.

S. De Bruyne, K. De Wolf, W. De Neve, P. Verhoeve, and R. Van de Walle, Table 2 Inuence of the bitrate on Jedi and Troy
bitrate (kBit/s) Jedi 290 Jedi 680 Jedi 980 Troy 290 Troy 680 Troy 980 detects 49 50 50 29 29 30 MD 2 1 1 7 7 6 FA 8 9 11 0 1 1 recall 96% 98% 98% 81% 81% 83% precision 84% 85% 82% 100% 97% 97%

Due to this motion, successive frames will have less similarity and it will be more dicult for the encoder to nd a good prediction. This leads to lots of intra coded macroblocks, and therefore the structure of the macroblock type information in successive frames bears resemblance to gradual changes. When taking a closer look at the test results, it also catches the eye that nearly all cuts were detected. This implies that the performance for simple video sequences like news programs and drama soaps can be expected to be very high. 4.2 Inf luences of the Bitrate The influence of the bitrate chosen at the encoder side on the performance of the algorithm is shown in Table 2. These results show that the recall and the precision for dierent bitrates are alike. Nevertheless, one can see that the amount of false alarms slightly increases and the missed detections decreases when the bitrate rises. This is due to the fact that the encoder prefers more reference frames and intra coded MBs in shots with a lot of motion when the bitrate is higher. The results for other sequences are similar. 4.3 Inuences of the Parameters for Motion Estimation The inuence of the parameters for motion estimation at the encoder side are shown in Table 3. These parameters determine the complexity of the search window, the sub pixel motion search precision, et cetera. Based on the parameter values, several tests were carried out for low, medium and high complex motion estimation. From these results, one can conclude that the complexity of the encoders search for the best prediction only has a minor impact on the performance. 4.4 Comparison with Methods in the Uncompressed Domain In the past, methods based on features in the uncompressed domain were investigated in our research group, in particular global color histograms and edge detection algorithms using Sobel ltering techniques. The comparison for two test sequences is given in Table 4. Our algorithm outperforms edge detection, but the algorithm based on histograms has even better results. It is obvious that algorithms in the uncompressed domain acquire a higher performance since they possess more features. The great advantage in the compressed domain on the other hand is the fast segmentation, in particular with this method that is faster than real-time. When comparing the framerates, we notice that our algorithm is a factor 5.66 faster than the color histogram, and even more for edge detection. On a regular desktop, our algorithm achieves a framerate of 320 frames per second.

Table 3 Inuence of motion estimation on Jedi and Troy


complexity Jedi low Jedi medium Jedi high Troy low Troy medium Troy high detects 50 50 50 28 28 27 MD 1 1 1 6 7 7 FA 9 9 9 1 1 2 recall 98% 98% 98% 82% 80% 79% precision 85% 85% 85% 97% 97% 93%

Table 4 Comparison with the uncompressed domain


Shrek 2 recall precision 90 % 100 % 78 % 94% 95 % 90 % recall 88 % 70 % 79% Troy precision 100 % 92 % 93%

histograms edge detection macrobloks

5 Conclusion In this paper, we discussed an algorithm for automatic shot boundary detection based on macroblock prediction type information, a feature that is available in the compressed domain. Formulas for a generic GOP structure are presented, which are very useful in practice, but where other papers pay little attention to. The measurements illustrate that this algorithm performs well, keeping in mind that this approach is quite consistent for dierent encoder parameter settings and far more rapid than methods in the uncompressed domain. Acknowledgement: The research as described in this paper was funded by Ghent University, the Interdisciplinary Institute for Broadband Technology (IBBT), the Institute for the Promotion of Innovation by Science and Technology in Flanders (IWT), the Fund for Scientic Research-Flanders (FWO-Flanders), the Belgian Federal Science Policy Oce (BFSPO), and the European Union. References
1. U. Gargi, R. Kasturi, and S. Strayer, Performance Characterization of Video-Shot-Change Detection Methods, IEEE Transactions on Circuits and Systems for Video Technology, vol. 10, pp. 1-13, 2000. 2. S. Pei, and Y. Chou, Ecient MPEG Compressed Video Analysis Using Macroblock Type Information, IEEE Transactions on Multimedia, vol. 1, pp 321-333, 1999. 3. J. Calic, and E. Izquierdo, Towards Real-Time Shot Detection in the MPEG Compressed Domain, Proceedings of the Workshop on Image Analysis for Multimedia Interactive Services, 2001. 4. ISO/IEC JTC 1, Coding of audio-visual objects - Part 2: Visual, ISO/IEC 14496-2 (MPEG-4 visual version 1), April 1999.

Вам также может понравиться