Вы находитесь на странице: 1из 10

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO.

1, JANUARY 2012 67

VSYNC: Bandwidth-Efficient and


Distortion-Tolerant Video File Synchronization
Hao Zhang, Member, IEEE, Chuohao Yeo, and Kannan Ramchandran, Fellow, IEEE

AbstractWe introduce video-sync (VSYNC), a video file utilities in the Unix/Linux operating systems. These conven-
synchronization system that efficiently uses a bidirectional com- tional utilities are efficient in reducing the bandwidth for
munications link to maintain up-to-date video sources at remote
exact bit-level synchronization of general data files. However,
ends to a desired resolution and distortion level. By automatically
detecting and transmitting only the differences between video video files with the same content may be stored in different
files, VSYNC is able to avoid unnecessary re-transmission of bitstream formats on different locations, and when sent over
the entire video when there are only minor differences between the network may be encoded in various ways in order to reduce
video copies. A hierarchical hashing scheme is designed to allow transmission redundancy as well as to meet the receivers de-
synchronization to within some user-defined distortion, while be-
vice characteristics. Therefore, these standard synchronization
ing rate-efficient and computationally tractable. Distributed video
coding is used to realize further rate savings when transmitting protocols are unsuitable for video data because they fail to
video updates. VSYNC is bandwidth-efficient and is useful in capture similarity at the content level.
many scenarios including video backup, video sharing, and video This motivates us to investigate protocols that can patch
authentication applications. Experimental results show that rate- and update the differences between video files, which we
savings ranging from 2 to 10 can be obtained by VSYNC
call video synchronization. There can be multiple versions
with about 10% of the frames being edited, compared to re-
transmitting the compressed video or using a file synchronization of similar video sources that users wish to synchronize to
utility such as rsync. a single version. Different videos might also have different
video parameters such as resolution and quality. This calls for
Index Termsrsync, video coding, video file synchronization,
video hash, VSYNC. a protocol that is video-centric and that can synchronize video
files to satisfy predefined distortion constraints. In this paper,
I. Introduction we introduce video-sync (VSYNC), a video-centric video file
synchronization protocol that can detect and send the changes

P ROLIFERATION of video content continues to severely


burden the Internet. It has been estimated that video
will dominate more than 60% of the entire consumer Internet
in an automated fashion.
Video file synchronization has a wide range of applications,
including video file sharing, video data backup, online video
traffic by 2013 [1]. However, much video content information editing, and video forensics. Consider the following use cases.
that is transmitted may be redundant, since many copies of 1) A marketing professional at a company headquarters
the same video exist on many different places or even on prepares a product demo video and sends an MPEG-
the same server. Oftentimes, these different versions have compressed version of it to some decision makers in
only minor differences, e.g., due to typical edits including other company offices around the world. He later does
deletion/insertion, rearrangement, and/or transcoding. Clearly, some minor editing to the original video according to
completely retransmitting a similar video wastes bandwidth, the committees discussions, and wants to update ev-
and it would be desirable to have an automatic scheme that erybodys copy at hand. Retransmitting the entire video
can detect and transmit just the differences. file which has only small changes is both expensive and
There are many general file synchronization tools that are unnecessary, especially in situations where the video file
used to maintain the same file on two or more remote locations is large and the communication cost is high.
while aiming to minimize communication bandwidth. Such 2) It may be desirable to allow mobile devices to always
tools include rsync [2] and zsync [3], which are open-source have access to the most recent copy of video files
Manuscript received October 5, 2010; revised February 8, 2011, April 15, located on a remote server, in a similar fashion to the
2011 and April 26, 2011; accepted April 27, 2011. Date of publication June 2, Sync function in Windows Media Player [4] which
2011; date of current version January 6, 2012. This paper was recommended
by Associate Editor B. Yan. is designed for local and exact bit-level synchronization
H. Zhang and K. Ramchandran are with the Department of Electrical where bandwidth is not a concern. However, minimizing
Engineering and Computer Science, University of California at Berke- transmission rate is of particular interest when there
ley, Berkeley, CA 94720 USA (e-mail: zhanghao@eecs.berkeley.edu; kan-
nanr@eecs.berkeley.edu). are tight bandwidth constraints, such as over wireless
C. Yeo is with the Department of Signal Processing, Institute for Infocomm networks.
Research, 138632, Singapore (e-mail: zuohao@eecs.berkeley.edu). 3) In video authentication, subscribers may want to find
Color versions of one or more of the figures in this paper are available
online at http://ieeexplore.ieee.org. out if their copies of rented or purchased video have
Digital Object Identifier 10.1109/TCSVT.2011.2158336 been tampered with and if so, would want to detect,
1051-8215/$26.00 
c 2011 IEEE
68 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 1, JANUARY 2012

without prior knowledge of the differences between videos;


2) our use of hierarchical hashes to verify the synchronization,
where the hash rate threshold can be systematically and rigor-
ously determined as a function of the targeted distortion; and
3) the system is invariant to the choice of resolution or codec
choice, i.e., it can synchronize video sources encoded with
different encoding parameters. To the best of our knowledge,
this paper is the first attempt to address the synchronization
problem for video data.
Fig. 1. Typical VSYNC application setup. Initially, users U1 and U2 , respec- The idea of VSYNC was first introduced by Zhang et al. [5].
tively, have video files Vorig and Vupd , which can be similar in content. User
U1 then wants to synchronize his/her video to Vupd at a specified resolution A distributed video coding (DVC) approach is proposed to
and quality to get V upd . Bidirectional communications is present, and the goal further reduce the transmission rate in their later work [6].
is to minimize the total transmission rate R1 +R2 while satisfying the required The framework is then applied to performing synchroniza-
distortion constraint between Vupd and V upd .
tion across heterogenous mobile platforms [7]. All of these
previous works assumed that the vectors of pixel values of
identify, and fix the tampered content using minimum any underlying pairs of compared units, e.g., frames, have
transmission overhead. the same l2 norm. Peak signal-to-noise ratio (PSNR) per-
In all of the above cases, the videos on the two ends of formance can only be guaranteed when the variances (or
the communication link need not have the same resolution contrasts) of the underlying videos are normalized. In this
or encoding scheme. It would be highly desirable to have a paper, we provide a holistic view of the VSYNC protocol and
protocol that automatically detects the changes and synchro- complete the work by proposing solutions that guarantee the
nizes the video files under a user-defined distortion constraint PSNR performance even when such variances are different.
while using minimal data transfer bandwidth, taking advantage Experimental results show that compared to baseline meth-
of the fact that both parties already have some version of the ods, such as re-transmission using H.264 and synchronization
video data. using rsync, VSYNC offers substantial transmission savings in
It is worth noting that no assumption has been made about typical usage scenarios for various target distortion constraints.
the book-keeping of the modification history of one video The proposed system is offered as a promising candidate to
file with respect to another similar video file. In practice, such improve bandwidth efficiency in video file sharing, video file
book-keeping may not always be available or feasible. For back up, video authentication, and other related applications.
example, in video authentication, the history of changes of
the tampered content is usually not available nor reversible. II. Related Work
In video editing, different users may also have different
A. Rsync and Zsync Protocols
video editing software which makes book-keeping difficult.
It is desirable to design a tool that can automatically detect We will use the notation introduced earlier in Fig. 1, except
and update the differences without knowledge of what the that video data files are treated as general data files in these
differences are. protocols. When an update request is initiated using rsync,
user U1 splits the file Vorig into non-overlapping chunks of
A. Problem Setup fixed size, computes a weak hash and a strong hash for each
chunk and sends them to U2 over the communications link [2].
To formulate the problem setup, a two-user case with
The weak hash is the Adler-32 rolling checksum and the strong
grayscale videos is considered as shown in Fig. 1. We will
hash is the MD4 hash. User U2 then computes the weak hashes
use this setup for clarity of exposition in this paper, although
for every overlapping chunk, i.e., chunks that overlap with
it can be easily extended to synchronization among multiple
each other by a certain size (usually one byte), of the same
users with color videos in a distributed fashion. Consider user
size in the updated file Vupd in a computationally efficient
U1 , who has video Vorig , and wants to synchronize with user
manner by taking advantage of the recursive nature of the
U2 who has video Vupd , to get video V upd . In general, Vorig ,
Adler-32 checksum calculation, and compares them with the
Vupd , and V upd may have different resolution and quality,
weak hashes sent by user U1 as an initial check for matches.
and the goal is to obtain V upd at the desired resolution and
U2 further verifies chunks matched by the weak hash check
quality given Vorig and Vupd using minimal total transmission
by checking their MD4 hashes. U2 then sends the data in
rate R1 + R2 . The same notation in this setup will be used
Vupd that are not part of any chunk matches with Vorig along
throughout the paper unless stated otherwise.
with information on where to merge these blocks into the
recipients version to construct V upd . The bit-stream of V upd
B. Contributions will be identical to that of Vupd after the synchronization. zsync
In this paper, a novel video synchronization protocol works in a similar way, but the hash checks will be performed
VSYNC is described for the setup depicted in Fig. 1. The by U1 instead of U2 [3].
major contributions of VSYNC are: 1) it can automatically It is worth noting that the hashes used in rsync and zsync are
synchronize videos to within some targeted distortion con- generated on the bit-level of the file, thus a single bit change
straint, e.g., mean-squared-error (MSE), with high probability, within a chunk in the file will change its corresponding hash.
ZHANG et al.: VSYNC: BANDWIDTH-EFFICIENT AND DISTORTION-TOLERANT VIDEO FILE SYNCHRONIZATION 69

Fig. 3. Illustration of GOF and MB tube. A GOF is a consecutive series


of frames. An MB tube is a collection of co-located MBs within a GOF.

Fig. 2. VSYNC Protocol. In step 1, user U1 generates weak hashes for the
GOFs and strong hashes for the MB tubes and sends them to user U2 . User While the above-mentioned hashing algorithms are robust
U2 verifies the hashes in step 2 and generates new content and assembly to certain kinds of compression distortion and sensitive to
instructions in step 3. In step 4, user U1 decodes the new content and content-based visual attacks, their focus is on tamper detection
assembles the updated video.
rather than file synchronization without accountability for
transmission efficiency. Furthermore, only empirical perfor-
When applied to synchronization of video files, such content- mance of the hashing algorithms was given and no quantifiable
agnostic approaches would be very inefficient. relationship between the similarity of hashes and that of the
underlying images/videos has been established.
B. Image and Video Hashing Our work of VSYNC differs from previous works in several
There are various image and video hashing techniques in the aspects. First, VSYNC is a bidirectional protocol that aims
literature, many of which are designed for tampering detection. at synchronizing video files at remote ends in a rate-efficient
Fridrich and Goljan [8] proposed a tampering detection and computation-efficient manner, rather than detecting image
method that is robust to JPEG compression and additive noise. tampering. Second, a hierarchical hashing scheme is designed
Discrete cosine transform (DCT) blocks of an image are pro- to learn the differences and similarities between the video data,
jected onto random vectors followed by adaptive thresholding which can be analytically quantified and mapped to the Ham-
to get binary hashes. Roy and Sun [9] proposed an image hash ming distance between their respective hashes. Third, while
based on two parts. One part consists of quantized projections we use channel coding to reduce the hash rate, we also use
of an images scale-invariant feature transform features onto an extra MSE check after the channel code decoding process
random hyperplanes. The other part consists of both the ori- to validate if the underlying content satisfies the required
entation of the image and the block-wise histogram statistics. distortion. This estimated MSE is then used to apply a DVC
Lin and Chang [10] proposed a hashing technique for image approach to estimate and transmit the necessary additional
authentication which can prevent malicious manipulations but bits. This avoids unnecessary overhead and further improves
allow JPEG lossy compression. The authentication signature is transmission efficiency.
based on the invariance of the relationships between DCT co-
efficients at the same position in separate blocks of an image.
Lin et al. introduced a rate-efficient image tamper detection III. System Overview
technique [11] similar in spirit to a secure biometric hash We first present an overview of the VSYNC system as
proposed by Draper et al. [12], where blocks of the query illustrated in Fig. 2. Define a group of frames (GOFs) as a
image are projected onto random vectors and quantized, and series of F frames. Each frame of size N1 by N2 is divided
then syndrome encoded using a suitable low-density-parity- into q by q macroblocks (MBs). Also define a MB tube as
check (LDPC) code [13], [14]. The syndromes and a cryp- the set of co-located MBs within a GOF, as shown in Fig. 3.
tographic hash of the quantized projections are sent over When an update request is initiated, both Vorig and Vupd are
the network. The same projections and cryptographic hashes first resized to the same resolution as V upd if necessary.1
are also obtained of the original image, and the received In step 1 in Fig. 2, user U1 computes a weak hash of every
syndrome bits are decoded using the projections of the original non-overlapping GOF of Vorig and a strong hash for each MB
image as side information. If the decoding is successful and tube across each GOF. In step 2, user U2 computes the weak
the cryptographic hashes match, a successful match will be hash for every overlapping GOF of Vupd and compares it with
declared. In Lins work [11], tolerable differences between the that sent by U1 , where the overlapping GOFs are GOFs that
original image and the query image are assumed to be due to overlap with each other by a fixed number of frames. Such
quantization errors, and the authors showed empirically that weak hash check helps locate for each GOF in Vorig all the pos-
the difference between the unquantized projection coefficients sible matching GOFs in Vupd . Whenever the weak hash check
follow a Gaussian distribution after compression. However, the
upd is 640480, and the original
1 For example, if the required resolution of V
Gaussian assumption may break down after quantization, right
resolution of Vorig and Vupd is 320 120 and 1280 960, respectively, then
before LDPC coding is applied and therefore the thresholds Vorig should be upsampled by a factor of 2 and Vupd should be downsampled
used need to be experimentally determined. by a factor of 2.
70 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 1, JANUARY 2012

TABLE I
Key Notations

Notation Definition
Vorig Video to be updated
Vupd Destination (updated) video
F No. of frames in a GOF
N1 N2 Dimensions of a frame
qq Dimensions of an MB Fig. 4. Angle subtended by the rays from the origin to z1 and z2 can
v1 (v2 ) Sampled pixel values of GOF in Vorig (Vupd ) be found using simple trigonometry to be arccos zz1z ,z2 
. If a hyperplane
1 2
1 (2 ) Pixel values of MB tube in Vorig (Vupd ) orientation is chosen uniformly at random, then the probability of the
Kw (Ks ) No. of projections for weak (strong) hash hyperplane separating z1 and z2 is just p = 1 arccos zz1z
,z2 
.
w1 (w2 ) Weak hash of Vorig (Vupd ) 1 2

s1 (s2 ) Strong hash of Vorig (Vupd )


s1e (s2e ) Encoded strong hash of Vorig (Vupd )
TM (TP ) Required MSE (PSNR) threshold same length, e.g., the corresponding GOFs, frames or MBs
w (s ) Threshold angle for weak (strong) hash in Vorig and Vupd , are within the threshold TM in terms of
pw (ps ) Threshold flipping probability for weak (strong) hash
R(ps ) Rate of LDPC code with parameter ps
relative MSE of their pixel values (or within TP in terms of
their relative PSNR), then these two content units should pass
Note: we use bold type to denote vectors.
the similarity check and hence no re-transmission is needed
for synchronization; otherwise, the content units are said to
is verified, U2 then verifies the match of the corresponding fail the similarity check and the corresponding content unit
GOF pairs by checking a strong hash of each corresponding in Vupd should be re-transmitted. Since the validation process
pair of MB tubes across the two GOFs, which helps detect during synchronization is performed by hashing, it would be
localized edits and avoid re-transmission of unmodified blocks. desirable to control the maximum distortion of the target video
In step 3, all the GOFs and MB tubes that do not pass both by adjusting the corresponding hash rate for a given threshold.
hash checks are encoded using a suitable scheme discussed in To achieve this, we will first introduce a random projections
Section V. This together with the assembly instructions, i.e., idea studied by Yeo et al. [15] as follows.
information on where to merge these new data into the Vorig , is
transmitted to user U1 , who uses it to construct V upd , as shown B. Distance-Preserving Random Projection
in step 4. The details of the hashing algorithms, new data Consider two n-dimensional vectors z1 and z2 , each pro-
encoding schemes, and parameter configurations are presented jected onto K independent and identically distributed random
in the subsequent sections. Table I lists the key notations used hyperplanes P1 , P2 , . . . , PK followed by 1-bit quantization
in this paper. depending on the sign of the projected values to generate
K-bit binary hashes h(z1 ) and h(z2 ), respectively. Extending
IV. Hierarchical Hashing Scheme the result of Yeo et al. [15], it can be shown that the
probability that z1 and z2 have different quantization bits for
We apply a hierarchical hashing scheme, namely, a weak each hyperplane is
hash check followed by a strong hash check, to find the
1 z1 , z2 
matches. The weak hash check is computationally efficient p = arccos (1)
and is used to quickly find candidate matches while the strong z1 z2 
hash check follows up on the candidate matches to verify each where z1 , z2  is the inner-product between z1 and z2 . Fig. 4
match at the cost of more computations. This top-to-bottom illustrates this relationship graphically.
hierarchical hashing structure is used to provide a scalable way The Hamming distance dH (h(z1 ), h(z2 )) between the two
to find matching content by trading off between accuracy and K-bit hashes h(z1 ) and h(z2 ) can be used to estimate the
complexity. In this section, we will first state the distortion flipping probability p with p = dH (h(zK1 ),h(z2 )) . An estimate of
criteria used, and introduce the distance-preserving random the Euclidean distance between z1 and z2 is then given by
projections technique that is used as the basis of the weak and 
d E (z1 , z2 ) = z1 2 + z2 2 2z1 z2  cos (p).
(2)
strong hashes. Then, we will explain how the weak and strong
hashes are computed and checked. The random projection approach provides a promising check-
sum for videos because it is distance-preserving. Using the
A. Distortion Criteria properties described above, we will be able to relate the MSE
Unlike general file synchronization tools such as rsync and and thus the PSNR of the content to the Hamming distance
zsync, VSYNC does not aim at maintaining exact copies between these binarized random projection hashes [16].
of the video data, but rather keeping the visual content to
within some user-defined distortion. In this paper, we use the C. Weak Hash Check
MSE (or PSNR) between the pixel values in videos as the The goal of the weak hash is to locate possible matching
distortion criteria. Denote by TM the threshold in MSE, the GOFs in Vupd for every non-overlapping GOF in Vorig . The
corresponding threshold for PSNR is TP = 20 log 255 TM
, where hash needs to be fast to check, and has to have a small false
255 is the largest pixel value of an 8-bit grayscaled video. negative rate, i.e., the probability that two similar content units
To quantify similarity, we say that if two content units of the will fail the weak hash check should be low. To achieve these
ZHANG et al.: VSYNC: BANDWIDTH-EFFICIENT AND DISTORTION-TOLERANT VIDEO FILE SYNCHRONIZATION 71

to MSE distortion TM , will generate and transmit the weak


hash to user U2 . Specifically, for each non-overlapping GOF
in Vorig , a fraction f of all the pixels are randomly sampled
to get v1 . The l2 norm of v1 is computed and used with
TM to obtain the threshold angle w . The sampled values are
projected onto Kw random hyperplanes and quantized into 1 bit
depending on the sign of the projected values to get the weak
hash bits w1 . These hash bits and w are sent to U2 . User
Fig. 5. Demonstration of the choice of the thresholding angle for the weak
hash. Any vector that is within TM from vector v1 lies in the circle shown U2 performs the same weak hash generation process to get w2
in the figure. The vector in the circle that has the largest angle with respect for the overlapping GOFs in Vupd , and computes the Hamming
to v1 is the one that is tangent to the circle. If any vector v2 forms an angle distance dH (w1 , w2 ) between w1 and w2 to perform the weak
larger than w , it will be rejected.
hash check.

goals, FN1 N2 pixels for each GOF are uniformly randomly D. Strong Hash Check
sub-sampled. The sub-sampling is done to reduce the compu- As was discussed in the previous section, it is desirable for
tation complexity and the sampling fraction, denoted by f , can the weak hash to have a small false negative rate. Indeed, only
be chosen to trade off between complexity and accuracy. The vectors that form angles larger than w will not pass the MSE
sub-sampling can be coordinated between user U1 and user requirement. However, this nature also makes the weak hash
U2 by some pre-agreed common pseudo-randomness on the vulnerable in detecting other possibilities that violate the MSE
hash generation process for both the GOF and the MB tubes criteria, i.e., it can have a large false positive rate. For example,
within each GOF. Denote by v1 and v2 respectively the vector those vectors that form an angle less than w but lie outside
of randomly sampled pixels of the GOFs in Vorig and Vupd , the circle, as shown in Fig. 5, will not be rejected by the weak
and by Nw = FN1 N2 f the number of sampled pixels. v1 and hash check. Since the weak hash is generated by only a sub-
v2 are then projected onto Kw random hyperplanes Pk RNw , sampled set of the pixels, it is also possible that some changes
k {1, . . . , Kw }. Each projected value is then quantized into in certain areas cannot be detected. For example, if the only
one bit depending on its sign. Denote by w1 and w2 the difference between Vorig and Vupd is that Vupd has a logo across
vectors of the quantized values. To make the weak hash check the upper left corner of the entire video and the size of the
computationally light, only the Hamming distance dH (w1 , w2 ) logo only accounts for a small fraction of the video frames,
between w1 and w2 is checked. The relationship between then the weak hash is likely to miss it. Therefore, another hash
dH (w1 , w2 ) and the distortion tolerance TM is discussed in check that is able to detect such cases is needed to verify the
the sequel. potential matches returned by the weak hash check.
Following the idea used in (1), we know that the entries With these guidelines in mind, we design the strong hash
in w1 and w2 will be either 0 or 1 and that each pair of to check on a finer granularity in the GOFs, i.e., on an MB
corresponding entry bits in the vectors, denoted by w1 [i] and tube basis. To form the strong hash of an MB tube, all the
w2 [i], i = 1, 2, . . . , Kw , should have a probability of differing pixel values of each MB tube are formed into a vector which
v1 ,v2 
to be Pr(w1 [i] = w2 [i]) = 1 arccos
v1 v2 
, which can be estimated is projected onto Ks random hyperplanes and are quantized
dH (w1 ,w2 ) depending on its sign. Denote by 1 (2 ) the vectors of the
using Kw
.
Given v1 of length Nw , any vector that is within an MSE of pixel values of the MB tube in Vorig (Vupd ), and by s1 (s2 ) the
TM from it should lie within the circle of radius Nw TM as vectors of the quantized bits of the projections in Vorig (Vupd ).
shown in Fig. 5. The largest possible angle w that any vector The angle threshold s for the strong hash can be computed
within the circle can have with respect to v1 is formed between in a similar fashion to that in the weak hash. Denote by
v1 and the vector that is tangent to the circle. It can be seen Ns = Fq2 the number of pixels in an MB tube. Given the
that vectors that form an angle larger than w with respect MSE threshold, TM , the corresponding angle threshold for
N s TM
to v1 will fail the MSE requirement and therefore should not any MB tube 1 is given by s = arcsin  1
. To reduce
pass the weak hash check. The threshold w can be obtained the number of thresholds that need to be transmitted, we
as select L + 1 representative thresholds sl , l = 0, 1, . . . , L
which can be pre-designed. We say that s is in class Cl
Nw TM
w = arcsin . (3) if sl1 < s sl , l = 1, 2, . . . , L, and the representative
v1  threshold value for class Cl is sl . Only the class index l
Since dH (wK1w,w2 ) provides a reasonably good estimate of the is transmitted. When s1 belongs to class Cl , the cross-over
flipping probability of any random hyperplane bisecting the probability p between the entry bits in s1 and s2 should
l
vectors v1 and v2 , we should let the corresponding two GOFs satisfy p pls = s if the underlying MB tubes satisfy the
pass the weak hash check if dH (wK1w,w2 ) < w , otherwise they MSE constraint TM . A small false negative rate is guaranteed
fail. One threshold w for each GOF will be piggy-backed in because any 1 that gives a s in any class Cl will always be
the hash bits to be transmitted. Since only one threshold per checked against the largest value sl in that class.
GOF is transmitted, the overhead is small. However, sending out the raw bits in the strong hash as
In summary, the weak hash check works as follows. User is can be wasteful, we therefore apply a channel coding
U1 , who wants to synchronize his/her video Vorig to Vupd up concept to reduce the hash rate needed as elaborated below.
72 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 1, JANUARY 2012

In particular, we model the correlation between s1 and s2 by


a binary symmetric channel with some cross-over probability
p that is determined by the distortion between the underlying
MB tubes 1 and 2 . Instead of s1 being directly sent from
user U1 to user U2 , only the syndrome bits of s1 are sent
using a suitable binary channel code [17]. If s1 belongs to class
Cl , we choose a channel code such that it is able to correct
l
a flipping probability of pls = s and rely on the decoding
success to perform our hypothesis test.
Here, we choose to use LDPC codes as channel codes [11],
[12], and so the syndrome bits are just the parity bits. Let
P GF (2)Ks ns be the parity check matrix of the correspond-
ing LDPC code for class l with ns syndrome bits, i.e., rate Fig. 6. Bit-plane view of a block of 64 coefficients. Bit-planes are arranged
in increasing order with 0 corresponding to the least significant bit.
R(pls ) = 1 Knss , such that it is able to correct up to Ks pls
bit errors. The encoded strong hash s1e is then computed by
Combining the above equation with (5), we have
s1e = P T s1 . The receiving user U2 would then use an LDPC

belief propagation (BP) decoder to decode the syndrome bits dE (1 , 2 ) < max (d1 , d2 ) = dEm (1 , 2 ) (7)
using the corresponding s2 as the side-information. Following
Korner and Marton [18], we let U2 first compute the encoded where

bits s2e of s2 and its modulo-2 sum with s1e . Denote by d1 = u 2 + 2 2 2u 1 |2  cos ( dH (s1 ,s2 ) );
z = s1 s2 and by ze = s1e s2e , we have  1 Ks
d2 = u2 + 2 2 2u |2  cos ( dH (s1 ,s2 ) ).
1 1 Ks
ze = s1e s2e = P T s1 P T s2 = P T (s1 s2 ) = P T z. (4)
Therefore, dEm (1 , 2 ) Ns TM provides a sufficient condi-
It is easy to see that z should be Bernoulli with some tion for the MSE requirement dE (1 , 2 ) Ns TM . Since
parameter p. Since P T is designed to correct up to a flipping sl , Ns , TM and Ks are given, and 2 2 and dH (s1 , s2 ) can be
probability pls , user U2 should be able to decode z from ze if computed at U2 , the MSE check can be easily performed.
p pls and fail otherwise. Therefore, the hypothesis testing In summary, the strong hash check works as follows. For
problem of whether the uncoded strong hash satisfies the each pair of GOFs that pass the weak hash check, user U2
Hamming distance check is recast as a decodability problem. will classify the MB tubes for the GOF in video Vupd in
It is also worth noting that the rate of the LDPC code will be the same way as is instructed by user U1 and perform the
close to the theoretical lower bound H(ps ) when Ks is large, corresponding LDPC decoding. If the decoding fails, U2 will
where H(.) is the entropy function. It is also for this reason mark the MB tubes and request retransmission. Otherwise, the
that we group several MB tubes into categories depending decoded Hamming distance is used to perform the MSE check
on their class Cl so that the projected bit values can be by verifying dEm (1 , 2 ) Ns TM .
concatenated to increase length. As a result of applying the
syndrome encoding, large rate savings compared to directly
sending the binary bits can be obtained. This is especially true V. Rate-Efficient Update
when the desired video distortion is low and the corresponding
For GOFs that fail the weak hash check, user U2 should
threshold pls is small. However, LDPC BP decoding requires
inform user U1 that these GOFs in video Vorig are obsolete
more computations, so we only apply the strong hash checks
and should be deleted. User U2 will also send both the new
to GOF pairs that pass the weak hash check.
GOFs in video Vupd that none of the GOFs in video Vorig
To verify if the content units satisfy the MSE threshold TM ,
match and the MB tubes that fail the strong hash decoding
user U2 also needs to check the MSE between 1 and 2 in
process. Users can choose appropriate encoding schemes,
addition to the LDPC decodability check. Using (2), we can
e.g., H.264, according to the application requirements to
obtain an estimate of their Euclidean distance as follows:
encode the GOFs and MBs.
dE (1 , 2 ) = If an MB tube passes the strong hash check (and hence
 also passes the weak hash check) but fails the MSE check,
dH (s1 , s2 ) it is still likely that the corresponding MB tubes are highly
1 2 + 2 2 21 2  cos ( ) (5)
Ks correlated since the decoded Hamming distance passed the
angular threshold. Therefore, the amount of rate needed to
where dH (s1 , s2 ) is the Hamming distance returned by the
update the MBs can be reduced if user U2 is able to exploit
LDPC decoder (the l1 norm of z in this case). However, 1
this correlation. We will apply a DVC technique to do so, such
is unknown at user U2 . Suppose s1 comes from an MB tube
that client U2 transmits just enough information of 2 to user
that belongs to class Cl and thus has a threshold angle sl , we
U1 to perform incremental updates to 2 using 1 as side-
know from Fig. 5 and the classification criteria that
information. The maximum MSE dEm (1 , 2 ) described in (7)
Ns TM Ns TM is used to estimate the rate needed to update the corresponding
u1 = 1  < = u 1 . (6)
sin sl sin sl1 MB tube to a quality of no less than TM .
ZHANG et al.: VSYNC: BANDWIDTH-EFFICIENT AND DISTORTION-TOLERANT VIDEO FILE SYNCHRONIZATION 73

in Fig. 6. The remaining least significant bit-planes, shown in


black color, are encoded using a suitable entropy code. The
details of these codes can be found in [19].
It is worth noting that only the I-frame of the MB tubes are
updated using the above approach. The subsequent P-frames
are still encoded using standard video encoding schemes that
exploit temporal correlation. This is because, in most cases,
temporal correlation is stronger than spatial correlation and
rate savings are most evident for I-frames that would have to
be intra-coded otherwise.
Fig. 7. Partitioning of the quantization lattice into levels. Xi is the source,
X i is the quantized codeword, and Yi is the side-information. The number of
levels in the partition tree depends on the effective correlation between X i
and Yi . VI. Video Assembly
User U2 will send all the necessary update information
The DVC update procedure is based on the PRISM DVC and the assembly instructions to U1 to assemble a version of
scheme introduced by Puri et al. [19]. A 2-D DCT is applied to the updated video, V upd . The assembly instructions include
the first MB in the MB tube 2 . The transformed coefficients which original frames and MB tubes should be kept and
are zig-zag scanned into a vector, Xi , i = 1, 2, . . . , q2 . A deleted, what and where is the new content to be added and
scalar quantizer is then applied to the coefficients to obtain which encoding and decoding scheme is to be used, and any
X i , with a quantization step size chosen based on the necessary rearrangement of the original GOFs and frames. The
desired quality TM (or TP ). Denote also by Yi the zig-zag instructions usually cost negligible overhead compared to the
scanned DCT coefficients of the first MB in 1 . Assume video content itself and can be included as headers of the re-
additive noise between Xi and Yi , i.e., Yi = Xi + Ni where Ni transmitted data. Since some of the original reference frames
represents the correlation noise. The correlation between Xi in Vorig may no longer exist after the synchronization process,
and Yi can then be interpreted in terms of the number of most user U1 might need to re-encode some of the frames. To reduce
significant bit-planes of the quantized version X i that can be computational overhead, it is possible to design a procedure
inferred from the side-information Yi . This is shown in Fig. 6, that uses as much compressed information as possible from
where white colored bits represent the predictable bits while the original video that U1 has, and only re-encode frames that
the remaining least significant bit-planes shown in black and contain modified MBs and GOFs that have frames deleted.
grey color are innovations which need to be encoded and
transmitted. Starting from the least significant bit-plane of X i,
each successive bit-plane identifies a partition of codewords VII. Experimental Results
containing X i . The number of least significant bits that need In this section, we evaluate VSYNC and compare it with
to be transmitted for the update is given by the minimum tree- direct transmission using H.264 and synchronization using
depth for which the distance between successive codewords in rsync in terms of the rate-distortion (RD) performance. The
the partition is greater than twice the effective noise magnitude following parameters are chosen: GOF size F = 15 frames,
between X i and Yi . The effective noise magnitude [19] is the MB size q q = 16 16, number of weak hash projections
sum of the variance characteristics of Ni , which is obtained by per GOF Kw = 256, number of strong hash projections per MB
offline training, and the quantization step size . This would tube Ks = 256. For simplicity, L = 2 classes are chosen for the
enable correct decoding of X i at the decoder using Yi . Fig. 7 quantization of the strong hash thresholds. This corresponds
to
Ns TM
demonstrates an example where the effective noise magnitude L+1 = 3 thresholds 0 = s0 < s1 < s2 . Since s = arcsin  1
,
is 2 and the necessary tree depth is 2. we choose the average l2 norm and the smallest l2 norm of
Clearly, one needs to know the correlation between each the MB tubes 1 in each GOF to compute the corresponding
pair of Xi and Yi to infer the number of syndrome bits that thresholds s1 and s2 respectively. Since the quantization of
corresponds to the tree depth mentioned above. In this paper, thresholds represents only a small overhead to the strong hash,
we use the upper bound MSE dEm (1 , 2 ) in (7) between the a finer quantization will reduce overall hash rate. The actual
pair of MB tubes 1 and 2 to infer such correlation. In weak (strong) thresholds and the rate of the LDPC codes are
particular, we classify the correlation into 16 classes, which computed directly from the desired PSNRs as discussed in the
are separated by a set of 15 thresholds Tj (j = 1, 2, . . . , 15). previous sections.
Class j is chosen when Tj1 dEm (1 , 2 ) < Tj . Each class For simplicity, we assume Vorig is a compressed version (to
j is associated with a block correlation noise whose variance some PSNR) of some original source video Vsrc , and Vupd is
statistics were determined based on dEm (1 , 2 ) using offline an edited version of Vsrc and has the same quality as Vsrc .
training [19]. However, the offline training procedure does The goal is to synchronize Vorig to Vupd to get V upd such that
not guarantee precise determination of the correlation noise the relative PSNR of V upd to Vupd is the same as that of Vorig
between the blocks. We treat this uncertainty by jointly encod- to Vsrc . We test several edit cases and show the performance
ing the most significant bit-planes of individual coefficients in terms of the transmission rate (kb/s) needed to realize the
that need to be communicated to the decoder with a coset synchronization for different relative PSNRs of Vorig to Vsrc
channel code. These bit-planes correspond to the grey color (or V upd to Vupd ).
74 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 1, JANUARY 2012

Fig. 8. RD performance of synchronization using VSYNC, rsync, and H.264 Fig. 10. RD performance curves for VSYNC, rsync, and H.264 for video
for video Akiyo and News. Motorcycle.

Fig. 11. Tampering of video Foreman by adding a text banner to the upper-
left corner of the entire video.

Fig. 9. Edit process for video Motorcycle.

A. Addition of New Frames


Fig. 12. Detected tampered regions (black) using VSYNC.
The test videos used are Akiyo (176 144, 25 f/s, 12 s,
300 frames) and News (176 144, 25 f/s, 12 s, 300 frames).
Video Vorig is a compressed version of Akiyo, and video Vupd since no new content has been added to the original video.
is obtained by adding the first 25 frames of video News to VSYNC automatically detects the changes and the only over-
the beginning of video Akiyo. It can be seen from Fig. 8 that head is the transmitted hash. All the overlapping content was
VSYNC is able to save a considerable amount of transmission used and the deletion of frames was successfully detected,
by only sending the newly added frames therefore utilizing which avoided unnecessary transmission.
all the original overlapping content, while rsync again fails to It is worth noting that the hash rate can be lower for higher
capture the content-level structure of the video file and thus is PSNR requirements. This is because higher PSNR results in
as expensive as re-transmitting the entire video using H.264. a lower flipping probability which gives a smaller rate for the
LDPC code. The overall hash overhead is just a small fraction
B. Re-Arrangement and Deletion of the actual video rate.
In this case, the Motorcycle (VGA) video [20] (640 480,
25 f/s, 12 s) is used. This is a 12-s clip (from the first 12 s) C. Localized Tampering Detection and Correction
extracted and cropped from YouTube. Vorig is a compressed We test the ability of VSYNC to capture and correct video
version of Motorcycle using H.264 with group of picture size tampering. In this case, we let Vupd be the original Foreman
15. Vupd is an edited version of Motorcycle using the edit (176 144, 25 f/s, 12 s, 300 frames) video, and Vorig be the
process illustrated in Fig. 9: the original last 4 s of 100 frames compressed and tampered version of it. The tampering is to
are rearranged to the beginning and the original sixth to eighth add a small text banner of size 52 32 to the upper left corner
seconds of 75 frames are deleted. The user wants to update of the entire video, as shown in Fig. 11. The goal is to detect
his video to the new version at the same resolution of their and correct such tampering using minimal data transfer.
original videos Vorig . In this case, the weak hash was first able to align the
Fig. 10 demonstrates the performance comparisons of matching GOFs correctly, and the strong hash successfully
VSYNC, rsync, and H.264 for each user. The y-axis represents detected the tampering of the sub-frame regions, as shown in
the desired quality of V upd in PSNR (dB) compared to Vupd , Fig. 12. The video is divided into 1616 MB tubes as depicted
and the x-axis shows the rate (kb/s) needed to realize the by the white lines in the figure, and the black regions represent
synchronization. In this case, the edit only involves rearranging the regions that fail the strong hash check.
and deleting frames, i.e., no new content is added. A genie- The RD curves that correspond to correcting the tampered
aided system should be able to inform the users to only regions using VSYNC, rsync, and H.264 are demonstrated in
rearrange the content. Rsync is unable to exploit the strong Fig. 13. It can be seen that rsync results in re-transmission of
similarity between videos, and thus is as expensive as re- the entire new video, as expected, due to the drastic change in
transmitting the entire video using H.264. This is wasteful the underlying encoded bit-stream. On the contrary, VSYNC
ZHANG et al.: VSYNC: BANDWIDTH-EFFICIENT AND DISTORTION-TOLERANT VIDEO FILE SYNCHRONIZATION 75

hash check process also aided the application of distributed


coding schemes to reduce hash rates.
The work presented can be extended in many ways. We are
currently working on including chrominance information in
the protocol, to enable VSYNC to handle color videos. The
chrominance is also helpful in the hash verification process by
utilizing color features. We are also working on generalizing
the framework into a large-scale video synchronization system
by extending the hashing structure. In application scenarios
Fig. 13. RD performance of tampering detection and correction using where one wants to update a database of video files with a new
VSYNC, rsync, and H.264 for video Foreman. incoming video, the first job would be to find if similar video
files already exist. We can construct visual words as another
TABLE II
layer of hash to reduce the search space, and only perform
Comparison of Performance in Computational Complexity and
the VSYNC algorithm described in the paper on a much
Accuracy Among Approaches Using 1) Weak Hash Only;
smaller set of videos. Other possible future work includes
2) Strong Hash Only; and 3) Weak + Strong Hash
synchronizing videos taken by multiple cameras with different
capturing angles, and securing the synchronization process. A
Hash Time (ratio) Precision (%) Recall (%)
fully designed system that automatically synchronizes videos
Weak 1 13.9 100
Strong 105 100 100
at remote ends using secured network protocols would be
Weak+strong 70 100 100 highly desirable in a wide range of applications.

transmits only the data necessary to correct the tampered


regions while utilizing relevant content from Vorg . References
To see the benefit of having a layered weak/strong hash [1] Cisco Visual Networking Index: Forecast and Methodology, 20082013
architecture, we also compare the systems performance to [Online]. Available: http://www.cisco.com/en/US/solutions/collateral/
when only weak hash or strong hash is used in the Foreman ns341/ns525/ns537/ns705/ns827/white paper c11-481360 ns827
Networking Solutions White Paper.html
case. Table II shows the computation time, precision and [2] A. Tridgell and P. Mackerras. (1998, Nov.). The rsync Algorithm
recall for the following cases: 1) only weak hash is used; [Online]. Available: http://rsync.samba.org
2) only strong hash is used; and 3) both weak hash and [3] C. Phipps. (2005, Mar.). ZSYNC: Optimised rsync over HTTP [Online].
Available: http://zsync.moria.org.uk
strong hash are used. The computation time refers to the CPU [4] Available: http://www.microsoft.com/windowsxp/using/windowsmedia-
time needed to perform the entire VSYNC protocol during the player/getstarted/sync.mspx
synchronization process, normalized with respect to when only [5] H. Zhang, C. Yeo, and K. Ramchandran, VSYNC: A novel video file
synchronization protocol, in Proc. 16th ACM Int. Conf. Multimedia,
weak hash is used. Precision is the percentage of MB tubes 2008, pp. 757760.
matches found that are correct, while recall is the percentage [6] H. Zhang, C. Yeo, and K. Ramchandran, Rate efficient remote video
of total actual MB tubes matches that are found. It can be file synchronization, in Proc. IEEE Int. Conf. Acous., Speech Signal
Process., Apr. 2009, pp. 18451848.
seen that while the weak hash has a high recall of 100%, its [7] H. Zhang, C. Yeo, and K. Ramchandran, Remote video file synchro-
precision is very low. Using only the weak hash gives many nization for heterogeneous mobile clients, Proc. SPIE Conf. Series, vol.
false positives due to mis-alignment of GOFs and it is unable 7443, p. 74430F, Sep. 2009.
[8] J. Fridrich and M. Goljan, Robust hash functions for digital watermark-
to detect localized edits. The strong hash is able to both align ing, in Proc. Int. Conf. Inform. Technol.: Coding Comput., 2000, pp.
the GOFs correctly and detect the tampering, but it requires 178183.
much more computational time due to the LDPC decoding [9] S. Roy and Q. Sun, Robust hash for detecting and localizing image
tampering, in Proc. IEEE Int. Conf. Image Process., vol. 6. Sep.Oct.
complexity. The hierarchical approach however, achieves a 2007, pp. 117120.
balanced tradeoff between accuracy and complexity. [10] C. Lin and S. Chang, A robust image authentication method distin-
guishing JPEG compression from malicious manipulation, IEEE Trans.
Circuits Syst. Video Technol., vol. 11, no. 2, pp. 153168, Feb. 2001.
[11] Y. Lin, D. Varodayan, and B. Girod, Image authentication and tam-
VIII. Conclusion and Future Work pering localization using distributed source coding, in Proc. IEEE 9th
Workshop Multimedia Signal Process., Oct. 2007, pp. 393396.
In this paper, VSYNC, a novel video file synchronization [12] S. Draper, A. Khisti, E. Martinian, A. Vetro, and J. Yedidia, Using
protocol, was proposed to minimize data transfer in synchro- distributed source coding to secure fingerprint biometrics, in Proc. IEEE
nizing video files across a bidirectional communications link. Int. Conf. Acous., Speech Signal Process., vol. 2. Jan. 2007, pp. 129132.
[13] R. Gallager, Low Density Parity Check Codes, Monograph. Cambridge,
A hierarchical hashing scheme was designed to efficiently MA: MIT Press, 1963.
check possible matches across videos and only transmit the [14] T. Richardson and R. Urbanke, The capacity of low-density parity-
modified content. In addition to the novelty of efficient video check codes under message-passing decoding, IEEE Trans. Inform.
Theory, vol. 47, no. 2, pp. 599618, Feb. 2001.
content synchronization, the key contribution of VSYNC with [15] C. Yeo, P. Ahammad, H. Zhang, and K. Ramchandran, Rate-constrained
regard to prior work was our distance preserving hash, i.e., distributed distance testing and its applications, in Proc. ICASSP, Apr.
we can quantify the distortion between video frames using 2009, pp. 809812.
[16] C. Yeo, P. Ahammad, and K. Ramchandran, Rate-efficient visual
the Hamming distance between hashes and thus can determine correspondences using random projections, in Proc. IEEE Int. Conf.
a suitable threshold given a target distortion acceptance. The Image Process., Jan. 2008, pp. 217220.
76 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 22, NO. 1, JANUARY 2012

[17] D. Slepian and J. Wolf, Noiseless coding of correlated information Infocomm Research, Singapore, in 2004, and was an Engineering Intern
sources, IEEE Trans. Inform. Theory, vol. 19, no. 4, pp. 471480, Jul. with Omnivision Technologies, Sunnyvale, CA, in 2005. His current research
1973. interests include image and video processing and communications, distributed
[18] J. Korner and K. Marton, How to encode the modulo-two sum of binary source coding, computer vision, and machine learning.
sources (corresp.), IEEE Trans. Inform. Theory, vol. 25, no. 2, pp. 219 Dr. Yeo was a recipient of the Singapore Government Public Service
221, Mar. 1979. Commission Overseas Merit Scholarship from 1998 to 2002. Since 2004,
[19] R. Puri, A. Majumdar, and K. Ramchandran, PRISM: A video coding he has been receiving the Singapores Agency for Science, Technology, and
paradigm with motion estimation at the decoder, IEEE Trans. Image Research Overseas Graduate Scholarship. He received a Best Student Paper
Process., vol. 16, no. 10, pp. 24362448, Oct. 2007. Award at SPIE VCIP in 2007.
[20] Available: http://www.youtube.com/watch?v=ESVLfrKr Zo&hd=
1&feature=hd

Kannan Ramchandran (S92M93SM98F05)


Hao Zhang (S06M09) received the B.E. degree received the Ph.D. degree in electrical engineering
in electronic engineering from Tsinghua University, from Columbia University, New York, NY, in 1993.
Beijing, China, in 2006, and the M.A. degree in Since 1999, he has been a Professor with the
statistics and the M.S. degree in electrical engineer- Department of Electrical Engineering and Computer
ing and computer sciences from the University of Science, University of California (UC) at Berkeley,
California (UC) at Berkeley, Berkeley, in 2009. He Berkeley. From 1993 to 1999, he was a Faculty
is currently pursuing the Ph.D. degree in electrical Member with the Department of Electrical and Com-
engineering and computer sciences at UC Berkeley. puter Engineering, University of Illinois at Urbana-
From 2005 to 2006, he was a Research Assistant Champaign (UIUC), Urbana. Prior to that, he was
with the Center for Intelligent Image and Document with AT&T Bell Laboratories, Murray Hill, NJ,
Processing, Tsinghua University. He was an Engi- from 1984 to 1990. His current research interests include distributed signal
neering Intern with Cisco Systems, San Jose, CA, in 2007. Since 2007, he has processing and coding for networks, robust and scalable video delivery over
been a Graduate Student Researcher with the Berkeley Audio Visual Signal wireless and peer-to-peer networks, robust distributed storage, multi-user
Processing and Communication Systems Laboratory, UC Berkeley. His current information theory, media and information-theoretic security, and multiscale
research interests include image and video processing and communications, statistical image processing and modeling.
distributed source coding, computer vision, and machine learning. Dr. Ramchandran received the Eli Jury Award in 1993 from Columbia
Mr. Zhang was a recipient of the U.S. Vodafone Foundation Fellowship University for his doctoral thesis, the National Science Foundation CAREER
from 2006 to 2008. He received a Best Student Paper Award at ACM MM Award in 1997, the Office of Naval Research and Army Research Office
2008, a Best Paper Finalist in ICASSP 2009, and a Best Paper Finalist in Young Investigator Awards in 1996 and 1997, respectively, the Henry Mag-
ICIP 2010. nusky Scholar Award at UIUC, and the Okawa Foundation Prize from the
Department of Electrical Engineering and Computer Science at UC Berkeley
in 2001. He was the co-recipient of the two Senior Best Paper Awards from the
Chuohao Yeo received the B.S. degree in electrical IEEE Signal Processing Society in 1993 and 1997. He has been a co-author on
science and engineering and the M.E. degree in several Best Paper and Best Student Paper Awards in leading conferences in
electrical engineering and computer science from his field including the IEEE Packet Video Workshop in 2000, the IEEE/ACM
the Massachusetts Institute of Technology (MIT), Information Processing in Sensor Networks in 2005, the IEEE MMSP in 2007
Cambridge, in 2002, and the Ph.D. degree in elec- (finalist), the ACM Multimedia in 2008, the IEEE International Conference on
trical engineering from the Department of Electrical Image Processing in 2008, and the IEEE ICASSP in 2009 (finalist). He was
Engineering and Computer Sciences, University of also the recipient of the Outstanding Teaching Award from the Department
California at Berkeley, Berkeley. of Electrical Engineering and Computer Science at UC Berkeley in 2009. He
From 2001 to 2002, he was a Research Assistant serves on numerous technical program committees for premier conferences
with the Research Laboratory of Electronics, MIT. in image, video, and signal processing, communications, and information
He was a Research Engineer with the Institute for theory.

Вам также может понравиться