Вы находитесь на странице: 1из 17

Audio Watermarking Techniques

Hyoung Joong Kim Department of Control and Instrumentation Engineering Kangwon National University Chunchon 200-701, Korea

This paper surveys the audio watermarking schemes. State-of-the-art of the current watermarking schemes and their implementation techniques are briey summarized. They are classied into ve categories: quantization scheme, spread-spectrum scheme, twoset scheme, replica scheme, and self-marking scheme. Advantages and disadvantages of each scheme are also discussed. In addition, synchronization schemes are also surveyed.
Blind Watermarking Non-Blind Watermarking

Quantization s[k]=Q(x[k]+d)

Spread-Spectrum s[k]=x[k]+w[k]

Two-Set s[k]=x[k]+d

Replica s[k]=x[k]+x[k-d]


Figure 1: A typical audio watermarking schemes.

Non-blind watermarking schemes are theoretically interesting, but not so useful in practical use, since it requires double storage capacity and double communication bandwidth for watermark detection. Of course, non-blind schemes may be useful as copyright verication mechanism in a copyright dispute (and even necessary, see (Craver et al. 1998) or inversion attacks). On the other hand, blind watermarking scheme can detect and extract watermarks without use of the unwatermarked audio. Therefore, it requires only a half storage capacity and half bandwidth compared with the non-blind watermarking scheme. Hence, only blind audio watermarking schemes are considered in this chapter. Needless to say, the blind watermarking methods need selfdetection mechanisms for detecting watermarks without unwatermarked audio.

Audio watermarks are special signals embedded into digital audio. These signals are extracted by detection mechanisms and decoded. Audio watermarking schemes rely on the imperfection of the human auditory system. However, human ear is much more sensitive than other sensory motors. Thus, good audio watermarking schemes are dicult to design (Kim et al. 2003). Even though the current watermarking techniques are far from perfect, during the last decade audio watermarking schemes have been applied widely. These schemes are sophisticated very much in terms of robustness and imperceptibility (Bender et al. 1996) (Cox et al. 2002) (Cox and Miller 2002). Robustness and imperceptibility are important rquirements of watermarking, while they are conicting each other.

This paper presents basically ve audio watermarking schemes (see Figure 1). First scheme is quantization based watermarking which quantizes the sample values to make valid sample values and invalid ones. Second one is the spread-spectrum method based on the similarity between watermarked audio and pseudo-random sequence. Third one is the two-set method based on dierences between two or more sets, which includes the patchwork scheme. Fourth one is the replica method using the close copy of the original audio, which includes the replica modulation scheme. Last one is the self-marking scheme. Of course, much more schemes and their variants are available. For example, time-base modulation (Foote and Adcock 2002) is theoretically interesting. However, this mechanism is a non-blind watermarking scheme. Audio watermarking scheme that encodes compressed audio data (Nahrstedt and Qiao 1998) does not embed real watermarking signal into raw audio. Furthermore, no psychoacoustic model is available in the compressed domain to enable the adjustment of the watermark to ensure inaudibility. Synchronization is important for detecting watermarks especially when audio is attacked. Most of the audio watermarking schemes are position-based, i.e., watermarks are embedded into specic positions and detected from the position. Thus, shift in positions by attack makes such detection schemes fail to work. Main purpose of synchronization schemes are to nd the shifted positions. Several synchronization schemes are surveyed in this article. In audio watermarking, time-scaling or pitch-scaling attack is one of the most dicult attacks to manage. A brief idea for these attacks is summarized, which is proposed by (Tachibana et al. 2001).

D D 2 x D 2

Anchor q(x,D) Quantized value to 0 Quantized value to 1

Figure 2: A simple quantization scheme. where q() is a quatization function and D is a quantization step. A quatization function q(x) is given as follows: q(x, D) = [x/D] D, where [x] rounds to the nearest integer of x. The concept of the simplest quantization scheme in Equation (1) is illustrated in Figure (2). A sample value x is quantized to q(x, D) or to the black circle (). Let q(x, D) denote anchor. If the watermarking bit b is 1, the anchor is moved to the white circle (). Otherwise, the cross () stands for the watermarking bit 0. For example, let D be 8, and x be 81. Then, q(81, 8) = 80. If b = 1, then y = 82. Otherwise, y = 78. As is shown in the gure, the distance between achors is D. Detection is the inverse process of embedding. The detection process is summarized as follows: b= 1 0 if 0 < y q(x, D) < D/4 if D/4 < y q(x, D) < 0

This scheme is simple to implement. This scheme is robust againt noise attack so long as the noise margin is below D/4. In other words, the additive noise A scalar quantization scheme quantizes a sample is larger than D/4, then quantized value is perturbed value x and assign new value to the sample x based so much that detector misinterprets the watermarkon the quantized sample value. In other words, the ing bit. The robustness can be enhanced if dither watermarked sample value y is represented as follows: modulation (Chen and Wornell 1999) is used. This scheme is formulated as follows: q(x, D) + D/4 if b = 1 y= (1) ym = q(x + dm , D) dm , q(x, D) D/4 otherwise

Quantization Method

Audio Signal s(n) Power Spectrum Estimation

x(n) Watermarked Audio Scaling

PsychoAcoustic Model

Watermark Shaping Filter Optional Part r(n) Pseudo-Random Sequence


Malvar 2001) (Kim 2000) (Lee and Ho 2000) (Seok et al. 2002) (Swanson et al. 1998). This method is easy to implement, but has some serious disadvantages: it requires time-consuming psycho-acoustic shaping to reduce audible noise, and susceptible to time-scale modication attack. (Of course, usage of psychoacoustic models is not limited to spreadspectrum techniques.) Basic idea of this scheme and implementation techniques are described below.

Message b(n)

Basic Idea

This scheme spreads pseudo-random sequence across the audio signal . The wideband noise can be spread into either time-domain signal or transform-domain Figure 3: A typical embedder of the spread-spectrum signal no matter what transform is used. Frequently used transforms include DCT (Discrete Cosine Transwatermarking scheme. form), DFT (Discrete Fourier Transform), and DWT (Discrete Wavelet Transform). The binary waterwhere m is an index, and dm is the m-th dither vec- mark message v = {0, 1} or its equivalent bipolar tor. For example, let d1 = 2, d2 = 0, x = 8, and variable b = {1, +1} is modulated by a pseudoD = 4. Then, y1 = 10 and y2 = 8. Detection proce- random sequence r(n) generated by means of a secret dure estimates the distance and detect watermarking key. Then the modulated watermark w(n) = br(n) index as follows: is scaled according to the required energy of the audio signal s(n). The scaling factor controls the trade-o between robustness and inaudibility of the m = 1 if e(y1 , d1 ) < e(y1 d2 ) b= (2) watermark. The modulated watermark w(n) is equal m = 2 if e(y2 , d2 ) < (e(y2 d1 ) to either r(n) or r(n) depending on whether v = 1 or v = 0. The modulated signal is then added to where e(yi , dj ) = yi q(yi + dj ) + dj . Now, the original audio to produce the watermarked audio from the Equation (2), it is possible to detect wax(n) such as termark index. In the above example, e(y1 , d1 ) = 0 and e(y1 d2 ) = 2, Thus, it is clear that y1 is close x(n) = s(n) + w(n). to d1 . Similarly, y2 is close to d2 . This procedure cab be extended to the dither vector. The detection scheme uses linear correlation. Because the pseudo-random sequence r(n) is known and can be regenerated generated by means of a secret 3 Spread-Spectrum Method key, watermarks are detected by using correlation between x(n) and r(n) such as Spread-spectrum watermarking scheme is an example of the correlation method which embeds pseudoN random sequence and detects watermark by cal1 x(i)r(i), (3) c= culating correlation between pseudo-random noise N i=1 sequence and watermarked audio signal. Spreadwhere N denotes the length of signal. Equation spectrum scheme is the most popular scheme and has been studied well in literature (Boney et al. 1996) (3) yields the correlation sum of two components as (Cox et al. 1996) (Cvejic et al. 2001) (Kirovski and follows: 3

in which a detector fails to detect a watermark in a watermarked audio.

x(n) Filter Watermarked Audio ~ r(n) Correlator c


Pseudo-Random Sequence

r(n) Pseudo-Random Sequence

Figure 4: A typical preprocessing block for detector of the spread-spectrum watermarking scheme.


1 N

s(i)r(i) +

1 N

N i=1

br2 (i).


Assume that the rst term in Equation (4) is almost certain to have small magnitudes. If those two signals s(n) and r(n) are independent, the rst term should vanish. However, it is not the case. Thus, the watermarked audio is preprocessed as is shown in Figure 4 in order to make such assumption valid. One possible solution is ltering out s(n) from x(n). Preprocessing methods include high-pass ltering (Hartung and Girod 1998) (Haitsma et al. 2000), linear predictive coding (Seok et al. 2002), and ltering by whitening lter (Kim 2000). Such preprocessing allows the second term in Equation (4) to have a much larger magnitude and the rst term almost to be vanished. If the rst term has similar or larger magnitude than the second term, detection result will be erroneous. Based on the hypothesis test using the correlation value c and the predened threshold , the detector outputs m= 1 0 if c > if c

Pseudo-random sequence has statistical properties similar to those of a truly random signal, but it can be exactly regenerated with knowledge of privileged information (see Section 2.1). Good pseudo-random sequence has a good correlation property such that any two dierent sequences are almost mutually orthogonal. Thus, cross-correlation value between them is very low, while auto-correlation value is moderately large. Most popular pseudo-random sequence is the maximum length sequence (also known as M -sequence). This sequence is a binary sequence r(n) = {0, 1} having the length N = 2m 1 where m is the size of the linear feedback shift register. This sequence has very nice auto-correlation and cross-correlation properties. If we map the binary sequence r(n) = {0, 1} into bipolar sequence r(n) = {1, +1}, auto-correlation of the M -sequence is given as follows: 1 N
N 1

r(i)r(i k) =

1 1/N

if k = 0 otherwise


Typical value of is 0. The detection threshold has a direct eect both on the false positive and false negative probabilities. False positive means a type of error in which a detector incorrectly determines that a watermark is present in a unwatermarked audio. On the other hand, false negative is a type of error 4

The M -sequences have two disadvantages. First, length of the M -sequences, which is called chip rate, is strictly limited to as given by 2m 1. Thus, it is impossible to get, for example, nine-chip sequences. Length of the typical pseudo-random sequences is 1,023 (Cvejic et al. 2001) or 2,047. There is always a possibility to make the trade-o between the length of the pseudo-random sequence and robustness. However, very short sequences such as length 7 are also used (Liu et al. 2002). Second, the number of dierent M -sequences is also limited once the size m is determined. It is shown that M -sequence is not secure in terms of cryptography. Thus, not all pseudo-random sequences are M sequences. Sometimes, non-binary and consequently real-valued pseudo-random sequence r(n) R with Gaussian distribution (Cox et al. 1996) is used. Nonbinary chaotic sequence (Bassia et al. 2001) is also

used. As long as they are non-binary, its correlation characteristic is very nice. However, since we have to use integer sequences (processed such as r(n) ) due to nite precision, correlation properties become less promising.
Sound Pressure Level (dB)

80 70


Watermark Shaping

60 50 40 30 20 10 0 Threshold of Audibility (Quiet Curve) Masking Curve

Carelessly added pseudo-random sequence or noise to audio signal can cause unpleasant audible sound whatever watermarking schemes are used. Thus, just reducing the strength of pseudo-random sequence cannot be the nal solution. Because human ears are very sensitive especially when the sound energy is very low, even a very little noise with small value of can be heard. Moreover, small makes the spread-spectrum scheme not robust. One solution to ensure inaudibility is watermark shaping based on the psycho-acoustic model (Arnold and Schilz 2002) (Bassia et al. 2001) (Boney et al. 1996) (Cvejic et al. 2001) (Cvejic and Seppnen 2002). Interestingly a enough, the watermark shaping can also enhance robustness since we can increase the strength suciently as long as the noise is below the margin. Psycho-acoustic models for audio compression exploit frequency and temporal masking eects to ensure inaudibility by shaping the quantized noise according to the masking threshold. Psycho-acoustic model depicts the human auditory system as a frequency analyzer with a set of 25 bandpass lters (also known as critical bands). The required intensity of a single sound expressed in unit of decibel [dB] to be heard in the absence of another sound is known as quiet curve (Cvejic et al. 2001) or threshold of audibility (Rossing et al. 2002). Figure 5 shows the quiet curve. In this case, the threshold in quiet is equal to the so-called minimum masking threshold. However, masking eect can increase the minimum masking threshold. A sound lying in the frequency or temporal neighborhood of another sound aects the characteristics of the neighboring sound, which phenomenon is known as masking. The sound that does the masking is called masker and the sound that is masked is called the maskee. The psycho-acoustic model analyzes the input signal s(n) in order to calculate the minimum masking threshold T . Figure 6

-10 20 50 100 200 500 1,000 2,000 Frequency (Hz) 5,000 10,000 20,000

Figure 5: A typical curve for masking. Noise sound below the solid line or bold line is inaudible. Bold line is moved upward by taking masking eects into consideration.

80 70
Power Spectral Density (dB)

60 50 40 30 20 10 0 -10 0 0.5 1 1.5 Frequency (Hz) Inaudible Watermark Signal Audible Watermark Signal

Original Audio Signal

2.5 x 104

Figure 6: An example of noise shaping. Audible noise (dotted line) is transformed into inaudible noise (broken line).

shows inaudible and audible watermark signals. The audible watermark signal can be transformed into inaudible signal by applying watermark shaping based on the psycho-acoustic model. The frequency masking procedure is given as follows: 1. Calculate the power spectrum. 2. Locate the tonal (sinusoid-like) and non-tonal (noise-like) components. 3. Decimate the maskers to eliminate all irrelevant maskers. 4. Compute the individual masking thresholds. 5. Determine the minimum masking threshold in each subband. This minimum masking threshold denes the frequency response of the shaping lter, which shapes the watermark. The ltered watermark signal is scaled in order to embed the watermark noise below the masking threshold. The shaped signal below the masking threshold is hardly audible. In addition, the noise energy of the pseudo-random sequence can be increased as much as possible in order to maximize robustness. The noise is inaudible as far as the noise power is below the masking threshold T . Temporal masking eects are also utilized for watermark shaping. Watermark shaping is a time-consuming task especially when we try to exploit the masking eects frame by frame in real-time because watermark shaping lter coecients are computed based on the psycho-acoustic model. In this case, we have to use Fourier transform and inverse Fourier transform, and follow the ve steps described above. Needless to say, then detection rate increases since robustness of the watermark increases. However, since it is too timeconsuming, watermark shaping lter computed based on the quiet curve can be used. Since this lter exploits the minimum noise level, it is not optimal in terms of the watermark strength . This results in a strong reduction of the robustness. Of course, instead of maximizing the masking threshold, we can increase the length of the pseudorandom sequence for the robustness. However, this method reduces the embedding message capacity. 6

Figure 7: Seven example waveforms for sinusoidal modulation watermarking. By the courtesy of Dr. Zheng Liu.


Sinusoidal Modulation

Another solution is the sinusoidal modulation based on the orthogonality between sinusoidal signals (Liu et al. 2002). Sinusoidal modulation utilizes the orthogonality between sinusoidal signals with dierent frequencies 1 N
N 1


2im N


2in N

1 0

if m = n otherwise

Based on this properties, the sinusoidally modulated watermark can be generated by adding sinusoids with dierent frequencies by pseudo-random sequences (Liu et al. 2002) as follows:
N 1


bi i sin(2fi ).

Note that watermark signal modulated by the elements of pseudo-random sequence bi keeps the same correlation characteristics as that of pseudo-random sequence in Equation (5). Coecient bi is a bipolar pseudo-random sequence, i is a scaling factor for the

Just noticeable difference

Patchwork Scheme

Minimum inaudible amplitude


Figure 8: Just noticeable dierences for sinusoidal modulation.

Original patchwork scheme embeds a special statistic into an original signal (Bender et al. 1996). The two major steps in the scheme are: (i) choose two patches pseudo-randomly and (ii) add the small constant value d to the samples of one patch A and subtract the same value d from the samples of another patch B. Mathematically speaking, a = ai + d, i b = bi d, i

i-th sinusoidal component with frequency fi . For example, Figure 8 shows seven waveforms for sinusoidal modulated watermarking scheme. Seven sinusoids are linearly combinated with dierent bi coecients. This sinusoidal modulation method has following advantages. First, watermark embedding and detection can be simply done in the time-domain. Thus, its embedding complexity is relatively low. Second, length of the pseudo-random sequence is very short. Third, the embedded sinusoids always start from zero and end on zero, which minimizes the chance of block noise. Of course, this scheme also need psychoacoustic modulation for inaudibility. However, the number of sinusoids are quite few in numbers, just noticeable dierence (see Figure 8) for them can be decided in the frequency domain by audibility experiments.

where ai and bi are samples of the patchwork sets A and B, respectively. Thus, the original sample values have to be slightly modied. The detection process starts with the subtraction of the sample values beb tween two patches. Then, E[ ], the expected a value of the dierences of the sample means is used to decide whether the samples contain watermark inforb mation or not, where a and are sample means of the individual sample a and b , respectively. Since i i two patches are used rather than one, it can detect the embedded watermarks without the original signal, which makes it a blind watermarking scheme. Patchwork has some inherent drawbacks. Note that b a b a b] E[ ] = E[( + d) ( d)] = E[ + 2d, a where a and are sample means of the individual b sample ai and bi , respectively. The patchwork scheme b assumes that E[ ] = 2d due to the prior asa sumption that random sample ensures that expected values are all the same such that E[ = 0. Howa b] ever, the actual dierence of sample means, a b, is not always zero in practice. Although the distribution of the random variable E[a b ] is shifted to the right as shown in Figure 9, the probability of a wrong detection still remains (see the area smaller than 0 in the watermarked distribution). The performance of the patchwork scheme depends on the distance between two sample means and d which aects inaudibility. Furthermore, the patchwork scheme has originally been designed for images. The original patchwork scheme has been applied to the spatial-domain image (Bender et al. 1996) (or, equivalently, time-domain in audio) data. However, 7

Two-Set Method

Blind watermarking scheme can be devised by making two sets dierent. For example, if two sets are dierent, then we can conclude that watermark is present. Such decisions are made by hypothesis tests typically based on the dierence of means between two sets. Making two sets of audio blocks have different energies can also be a good solution for blind watermarking. Patchwork (Arnold 2000) (Bender et al. 1996) (Yeo and Kim 2003) also belongs to this category. Of course, depending on the applications we can exploit the dierences between two sets or more.

Unwatermarked Distriburion

Unwatermarked Distribution

Watermarked Distriburion

Watermarked Distribution



Figure 9: A comparison of the unwatermarked and Figure 10: A comparison of the un-watermarked and watermarked distributions of the mean dierence. watermarked distributions of the mean dierence by the modied patchwork algorithm time-domain embedding is vulnerable even to weak attacks and modications. Thus, patchwork scheme where C is a constant and sign is the sign can be implemented in the transform-domain (Arnold function. This function makes the large value 2000) (Bassia et al. 2001) (Yeo and Kim 2003). Their set larger and the small value set smaller so that implementations have enhanced original patchwork the distance between two sample means is always algorithms. First, mean and variance of the sample bigger than d = CS as shown in Figure 10. values are computed in order to detect the watermarks. Second, new algorithms assume that the dis3. Finally, replace the selected elements ai and bi tribution of the sample values is normal. Third, they by a and b . i i try to decide the value d adaptively. Modied Patchwork Scheme (MPA) (Yeo and Kim Since the proposed embedding function (6) intro2003) is described below: duces relative distance changes of two sets, a natural test statistic which is used to decide whether or not 1. Generate two sets A = {ai } and B = {bi } the watermark is embedded should concern the disrandomly. Calculate the sample means a = tance between the means of A and B. In this section, N N 1 1 N i=1 ai and b = N i=1 bi , respectively, we present the detecting scheme and investigate the and the pooled sample standard error statistical properties. The decoding process is as follows: N N 2 2 i=1 (ai a) + i=1 (bi b) . S= 1. Calculate the test statistics N (N 1) 2. The embedding function presented below introduces an adaptive value change, a i b i = ai + sign( CS/2 a b) = bi sign( CS/2 a b) T2 = ( 2 a b) . S2


2. Compare T 2 with the threshold and decide that watermark is embedded if T 2 > and no watermark is embedded otherwise. 8

Multiplicative patchwork scheme (Yeo and Kim 2003) provides a new way of patchwork embedding. Most of the embedding schemes are additive such as x = x + w, while multiplicative embedding schemes have the form x = s(1 + w). Additive schemes shift average, while multiplicative scheme changes variance. Thus, detection scheme exploits such facts.

Original Signal 1 a

Echo Signal Echo Amplitude

0 Delay Offset



Amplitude Modication

Figure 11: Kernels for echo hiding.

This method embeds watermark by changing energies of two or three blocks. Energy of each block of length also embeds part of the original signal in frequency N is dened and calculated as domain as a watermark. Thus, replica modulation embeds replica, i.e., a properly modulated original N signal, as a watermark. Detector can also generate |s(i)|. E= the replica from the watermarked audio and calculate i=1 the correlation. The most signicant advantage of The energy is high when the amplitude of signal is this method is its high immunity to synchronization large. Assume that two consecutive blocks be used attack. to embed watermark. We can make the two blocks A and B have the same energies or dierent energies by modifying the amplitude of each block. Let 5.1 Echo Hiding EA and EB denote the energies of blocks A and B, Echo hiding embeds data into an original audio signal respectively. If EA EB + , then, for example, by introducing an echo in the time domain such that we conclude that watermark message m = 0 is embedded. If EA EB , then we conclude that watermark message m = 1 is embedded. Otherwise, x(n) = s(n) + s(n d). (7) no watermark is embedded. However, this method has a serious problem. AsFor simplicity, a single echo is added above (see sume that block A has much more energy than block Figure 11). However, multiple echoes can be added B and the watermark message to be embedded is (Bender et al. 1996). Binary messages are embedded 0, then there is no problem at all. Otherwise, we by echoing the original signal with one of two delays, have to make EA larger than EB . As long as the eneither a d0 sample delay or a d1 sample delay. Extracergy dierence gap is wide, the resulting artifact betion of the embedded message involves the detection comes obvious and so unnatural to be noticed. This of delay d. Autocepstrum or cepstrum detects the scheme can turn forte part into piano part, undelay d. Ceptrum analysis duplicates the cepstrum fortunately, or vice versa. Such problem can be modimpulses every d samples. The magnitude of the imerated by using three blocks (Lie and Chang 2001) or pulses representing the echoes are small relative to more. By using multiple blocks, such artifacts can be the original audio. The solution to this problem is reduced slightly by distributing the burdens across to take auto-correlation of the cepstrum (Gruhl et al. other blocks. 1996). Double echo (Oh et al. 2001) such as

Replica Method

x(n) = s(n) + s(n d) s(n d ).

Original signal can be used as an audio watermark. can reduce the perceptual signal distortion and enEcho hiding is a good example. Replica modulation hance robustness. Typical value of is less than 9

three or four samples. Echo hiding is usually imperceptible and sometimes makes the sound rich. Synchronization methods frequently adopt this method for coarse synchronization. Disadvantage of echo hiding is its high complexity due to cepstrum or autocepstrum computation during detection. On the other hand, anybody can detect echo without any prior knowledge. In other words, it provides the clue for the malicious attack. This is another disadvantage of echo hiding. Blind echo removing is partially successful (Petitcolas et al. 1998). Time-spread echo (Ko et al. 2002) can reduce such a possibility of attacks. Another way of evading blind attack is auto-correlation modulation (Petrovic et al. 1999) which obtains watermark signal w(n) from the echoed signal x(n) in Equation (7). This method is more sophisticated and elaborated in the replica modulation. Double echo hiding scheme (Kim and Choi 2003) x(n) = s(n) + s(n d) + s(n + d), is now available. The virtual echo s(n + d) violates the causality. However, it is possible to embed virtual echoes by delaying echo-embedding process by d samples. These twin echoes make the cesptrum peak higher than single echo with the same strength of echo . Thus, double echoes can enhance detection rate due to higher peak or enhance imperceptibility by reducing accordingly.

to contrast it with time-domain echo - the case where replica is obtained by a time-shift of original (or its portion). Such a modulated signal w(n) is a replica. This replica can be used as a carrier in much the same manner as PN sequence in spreadspectrum techniques. Thus, the watermarked signal has the following form: x(n) = s(n) + w(n). As long as the components are invariant against modications, the replica in the frequency domain can be generated from the watermarked signal. The watermark signal w(n) can be generated from the watermarked signal x(n) by processing it according to the embedding process. Then, correlation between x(n) and w(n) is computed as follows 1 N


s(i)w(i) +

1 N




Replica Modulation

Replica modulation (Petrovic 2001) is a novel watermarking scheme that embeds a replica, i.e., a modied version of original signal. Three replica modulation methods include frequency-shift, phaseshift, and amplitude-shift schemes. The frequencyshift method transforms s(n) into frequency domain, copies a fraction of low-frequency components in certain ranges (for example, from 1 kHz to 4 kHz), modulates them (by moving 20 Hz, for example, with a proper scaling factor), inserts them back to the original components (to cover ranges from 1020 Hz to 4020 Hz) and transforms inversely to time domain to generate watermark signal w(n). Since the frequency components are shifted and added in the frequency domain, we call it frequency-domain echo 10

to detect watermark. As long as we use frequency band with lower cut-o much larger than frequency shift, and the correlation is done over integer number of frequency shift period, we have very small correlation between s(n) and w(n) in Equation (8). On the other hand, the spectra of the product w(n)w(n) has a strong dc component, and, thus, c contains a term of mean value of w(n)w(n), i.e., it contains the scaled auxiliary signal in the last term of Equation (8). Note that the frequency-shift is just one way to generate replica. Combination of frequency-shift, phase-shift, and amplitude-shift makes the replica modulation more dicult for malicious attacker to derive a clue, and makes the correlation value between s(n) and w(n) even smaller. The main advan tage in comparison to PN sequence is that chip synchronization is not needed during detection, which makes replica modulation immune to synchronization attack. When an attacker makes a jitter attack (e.g., cuts out a small portion of audio, and splices the signal) against PN sequence techniques, synchronization is a must. On the contrary, the replica modulation is free from synchronization since replica and original give the same correlation before and after cutting and

(a) Original Signal

Bit "1" is embedded. (Gentle slope)

Bit "0" is embedded. (Steep slope)

(b) Time-Scale Modified Signal

Figure 12: The concept of time-scale modication watermarking scheme. Messages, either bit 0 and 1, can be embedded by changing slopes between two successive extrema. splicing. Of course, the time-scaling attacks can affect bit and packet synchronization, but this is much smaller problem than chip synchronization. Pitchscaling (Shin et al. 2002) is a variant of the replica modulation, which makes it possible that the length of audio remains unchanged, but the harmonics is either expanded or contracted accordingly.

2001). Time-scale modication refers to the process of either compressing or expanding the time-scale of audio. Basic idea of the time-scale modication watermarking is to change the time-scale between two extrema (successive maximum and minimum pair) of the audio signal (see Figure 12). The intervals between two extrema are partitioned to N segments of equal amplitude. We can change the slope of the signal in certain amplitude interval(s) according to the bits we want to embed, which changes the timescale. For example, the steep slope and gentle slope stand bits 0 and 1 or vice versa, respectively. Advanced time-scale modication watermarking scheme (Mansour and Tewk 2001) can survive time-scale modication attack.


Salient Features

Salient features are special and noticeable signal to the embedders, but common signal to the attackers. They may be either natural or articial. However, in either case they must be robust against attacks. So far those features are extracted or made empirically. The salient features can be used especially for synchronization or for robust watermarking, for example, against time-scale modication attack.

Self-Marking Method


Watermark detection starts by alignment of watermarked block with detector. Losing synchronization causes false detection. Time-scale or frequency-scale modication makes the detector lose synchronization. Thus, most serious and malicious attack is probably the desynchronization. All the watermarking algorithms assume that any detector be synchronized before detection. Brute-force search is computationally infeasible. Thus, we need fast and exact synchronization algorithms. Some watermarking schemes such as replica modulation or echo hiding are rather robust against certain type of desynchronization attacks. Such schemes can be used as a baseline method 6.1 Time-Scale Modication for coarse synchronization. Synchronization code can Time-scale modication is a challenging attack and be used to synchronize the onset of the watermarked can be used for watermarking (Mansour and Tewk block. Self-marking method embeds watermark by leaving self-evident marks into the signal. This method embeds special signal into the audio, or change signal shapes in time domain or frequency domain. Time-scale modication method (Mansour and Tewk 2001) and many schemes based on the salient features (Wu et al. 2000) belong to this category. Clumsy self-marking method, for example, embedding a peak into frequency domain, is prone to attack since it is easily noticeable. 11

However, rened synchronization scheme design is not simple. Clever attackers also try to devise sophisticated methods for desynchronization. Thus, synchronization scheme should also be robust against attacks and fast. There are two synchronization problems. First one is to align the starting point of a watermarked block. This approach is applied to the attacks such as cropping out or inserting redundancy. For example, a garbage clip can be added to the beginning of audio intentionally or unintentionally. Some MP3 encoders unintentionally add around 1,000 samples, which makes innocent decoder fail to detect exact watermarks. Second one is time-scale and frequency-scale modications, intentionally done by malicious attackers or unintentionally done by the audio systems (Petrovic et al. 1999), anyway which are very dicult to cope with. Time-scale modication is a time-domain attack that adds fake samples periodically into target audio or delete samples periodically (Petitcolas et al. 1998) or uses sophisticated time-scaling schemes (Arb 2002) (Dutilleux 2002) to keep pitches. Thus, audio length may be increased or decreased. On the other hand, frequencyscale modication (or pitch-scaling) adjusts frequencies and then applies time-scale modication to keep the size unchanged. This attack can be implemented by sophisticated audio signal processing techniques (Arb 2002) (Dutilleux 2002). Aperiodic modication is more dicult to manage. There are many audio features such as brightness, zero-crossing rate, pitch, beat, frequency centroid, and so on. Some of them can be used for synchronization as long as such features are invariant under attacks. Feature analysis in speech processing has been studied well in literature while very few studies are available in audio processing. Recently, a precise synchronization scheme which is ecient and reliable against time-scaling and pitchscaling attacks has been presented (Tachibana et al. 2001). For the robustness, this scheme calculates and manipulates the magnitudes of segmented areas in the time-frequency plane using short-term DFTs. The detector correlates the magnitudes with a pseudo-random array that corresponds to twodimensional areas in the time-frequency plane. The purpose of 2-D array is to detect watermark when at 12

least one plane of information is alive under the assumption that attacking watermarks in both planes at the same time is not so feasible. Manipulation of magnitudes (which is similar to amplitude modication) is useful since magnitudes are less inuenced than phases under attack. This scheme is useful to ght against time-scaling and pitch-scaling attacks and defenses quite well against them.


Coarse Alignment

Fine alignment is the nal goal of synchronization. However, such alignment is not simple. Thus, coarse synchronization is needed to locate possible position fast and eectively. Once such positions are identied, ne synchronization mechanisms are used for exact synchronization. Thus, coarse alignment scheme should be simple and fast. Combination of energy and zero-crossing is a good example for coarse alignment scheme. Total energy and number of zero-crossings of each block are calculated. A sliding window is used to conne a block. If the two measures meet the predened criteria, then we can conclude that the block is close to the target block for synchronization. Such conclusion is drawn from the assumption that energy and number of zero-crossing are invariant. For example, a block with low energy and large number of zero-crossings may be a good clue. Number of zero-crossings are closely related with frequencies. Large number of zero-crossings implies that the audio contains high frequency components. Energy computation is simple to implement. Just taking absolute values of each sample and summing up all gives the energy of the sample. Counting the number of sign changes from positive to negative and vice versa gives the number of zero-crossings. Echo-hiding can also be used for coarse synchronization. For example, if an evidence of echo existence is identied, it shows that the block is near from synchronization. Unfortunately, echo detection is considerably costly in terms of computing complexity. Replica modulation is rather robust against desynchronization attacks.


Synchronization Code
Sequence A

The synchronization code in time domain based on Bark code (Huang et al. 2002) is a notable idea. The Bark code (with bit length 12, for example, given as 111110011010) can be used as a synchronization since this code has a special autocorrelation function. To embed the Bark code successively, this method sets the lowest 13 bits to be 1100000000000 when embedding message is 1, and set to be 0100000000000 otherwise, regardless of the sample values. For example, a 16-bit sample value 1000000011111111 is changed forcibly into 1001100000000000 to embed message 1 in time domain. This method is claimed to achieve the best performance to resist additive noise and keep sucient inaudibility.

Sequence B (a) Exact match (b) One-chip off (15 matches) (3 matches) (c) One-chip off with etxtended chip rates by 3 (15 matches)

Figure 13: The concept of redundant-chip coding. Right gure is an extended version of the center gure by chip-rate of 3. Correlation is calculated at the areas with dotted lines only.

and marking on it with special sawtooth shape is an example. Such articial marking may generate audi7.3 Salient Point Extraction ble high frequency noise. Careful shaping can reduce Salient point extraction without changing the orig- the noise to a hardly audible level. inal signal (Wu et al. 2000) is also a good scheme. Basic idea of this scheme is to extract salient points 7.4 Redundant-Chip Coding as locations where the audio signal energy is climbing fast to a peak value. This approach works well for Pseudo-random sequence is a good tool for watersimple audio clips played by few instruments. How- marking. As is mentioned, correlation is eective to ever, this scheme has two disadvantages with more detect watermark as long as perfect synchronization complex audio clips. First, overall energy variation is achieved. When the pseudo-random sequence is exbecomes ambiguous for complex audio where many actly aligned, its correlation approaches to Equation music instruments are played altogether. Then, the (5). Figure 13-(a) depicts a perfect synchronization stability of the salient points decreases. Second, there between a 15-chip pseudo-random sequence (if we use exists the diculty to dene appropriate thresholds M -sequence, but not in this example). Its normalized for all piece music. High threshold value is suitable auto-correlation is 1. However, if the sequences are for audio with sharp energy variation. However, the misaligned by one chip o as is shown in (b), its autosame value to complex audio would yield very few correlation falls down to 3/15. This problem can be salient points. Thus, audio content analysis (Wu et solved by redundant-chip coding (Kirovski and Malal. 2000) parses complex audio into several simpler var 2001). Figure 13-(c) shows an expanded chip rate ones so that stability of salient points could be im- 3. Now, misalignment by one chip o doesnt matter. proved and the same threshold could be applied to During the detection phase, only the central sample all audio clips. of each expanded chip is used for computing correlaIn order to avoid such complex operations, spe- tion. The central chips are marked by broken lines cial shaping of audio signal is also useful for coarse in Figure 13-(c). By using such a redundant-chip ensynchronization. This approach intentionally modi- coding with expansion by R chips, correct detection es signal shape to keep salient points, which is su- is possible up to R/2 chips o misalignment. Of ciently invariant under malicious modications. For course, this method enhances robustness at the cost example, choosing the fast climbing signal portion of embedding capacity. 13


Beat-Scaling Transform

The beat, salient periodicity of music signal, is one of the fundamental characteristics of audio. Serious beat change can spoil the music. Thus, beat must be almost invariant under attacks. In this context, beat can be a very important marker for synchronization. The beat-scaling transform (Kirovski and Attias 2002) can be used for enabling synchronicity between the watermark detector and the location of the watermark in an audio clip. Beat-scaling transform method calculates the average beat period in the clip and identies the location of each beat as accurately as possible. Next, the audio clip is scaled (i.e., stretched or shortened) such that the length of each beat period is constant and equal to the average beat period rounded to the nearest multiple of a certain block of samples. The scaled clip is watermarked and scaled back to its original tempo. As long as beat remains unchanged, watermarks can be detected from the scaled beat periods. Beat detection algorithms are presented in (Goto and Muraoka 1999) (Scheirer 1998). Of course, in this case the synchronization relies on the accuracy of the beat detection algorithms.

chronization. On the other hand, replica method is eective for synchronization. However, echo hiding is vulnerable to attack. Replica modulation (Petrovic 2001) is rather secure than echo hiding. Among two-set schemes, the modied patchwork algorithm (Yeo and Kim 2003) is also very much elaborated. Self-marking method can be used especially for synchronization or for robust watermarking, for example, against time-scale modication attack. Such ve seminal works have improved watermarking schemes remarkably. However, more sophisticated technologies are required, and expected to be achieved in the next decade. Some synchronization schemes are also very important. This article briey surveys the basic ideas for synchronization.


This work was in part supported by the Brain Korea 21 Project, Kangwon National University. The authors appreciate Prof. D. Ghose of Indian Institute of Science for their comments. The authors also appreciate Dr. Rade Petrovic of Verance Inc., Mr. Michael Arnold of Fraunhofer Gesellscaft, Dr. Fabien A. P. Petitcolas of Microsoft, for their kind personal communications and review. The authors also appreciate Taehoon Kim, Kangwon National Univer8 Conclusions sity, for implementing various schemes and providing Available studies on audio watermarking is far less useful information. than that of image watermarking or video watermarking. However, during the last decade audio watermarking studies have also increased consider- References ably. Those studies have contributed much to the Arb, D., Keiler, F., and Zler, U. (2002), Timeo progress of audio watermarking technologies. This frequency Processing, in DAFX: Digital Audio Efpaper surveyed those papers and classied them fects, edited by U. Zler, John Wiley and Sons, pp. o into four categories: quantization scheme, spread237-297. spectrum scheme, two-set scheme, replica scheme, and self-marking. Quantization scheme is not so ro- Arnold, M. (2000), Audio watermarking: features, bust against attacks, but easy to implement. Spreadapplications and algorithms, IEEE International spectrum scheme requires psycho-acoustic adaptaConferenc Multimedia and Expo, vol. 2, pp. 1013tion for inaudible noise embedding. This adapta1016. tion is rather time-consuming. Of course, most of the audio watermarking schemes need psychoacous- Arnold, M. (2001), Audio Watermarking: Burying information in the data, Dr. Dobbs Journal, vol. tic modelling for inaudibility. Another disadvantage 11, pp. 21-28. of spread-spectrum scheme is its diculty of syn14

Arnold, M., and Schilz, K. (2002), Quality evaluation of watermarked audio tracks, SPIE Electronic Imaging, vol. 4675, pp. 91-101.

Felten, E. W. (2001), Reading between the lines: Lessons from the SDMI challenge, UXENIX Security Symposium.

Bassia, P., Pitas, I., and Nikolaidis, N. (2001), Ro- Craver, S., Liu, B, and Wolf, W. (2002), Detectors bust audio watermarking in the time domain, for echo hiding systems, Information Hiding , LecIEEE Transactions on Multimedia, vol. 3, pp. 232ture Notes in Computer SCience, vol. 2578, pp. 241. 247-257. Bender, W., Gruhl, D., Morimoto, N., and Lu, A. Cvejic. N., Keskinarkaus, A., and Seppnen, T. a (1996), Techniques for data hiding, IBM Systems (2001), Audio watermarking using m-sequences Journal, vol. 35, pp. 313-336. and temporal masking, IEEE Workshops on Applications of Signal Processing to Audio and AcousBoeuf, J., and Stern, J.P. (2001), An analysis of tics, New Paltz, New York, pp. 227-230. one of the SDMI audio watermarks, Proceedings: Information Hiding, pp. 407-423. Cvejic. N., and Seppnen, T. (2002), Improving aua dio watermarking scheme using psychoacoustic waBoney, L., Tewk, A. H., and Hamdy, K. N. (1996), termark ltering, IEEE Internation Conference Digital watermarks for audio signal, Internaon Signal Processing and Information Technology, tional Conference on Multimedia Computing and Cairo, Egypt, pp. 169-172. Systems, Hiroshima, Japan, pp. 473-480. Dutilleux, P., de Poli, C., and Zler, U. (2002), o Chen, B., and Wornell, G.W., (19969), Dither modTime-frequency Processing, in DAFX: Digital ulation: A new approach to digital watermarking Audio Eects, edited by U. Zler, John Wiley and o and information embedding, Proceedings of the Sons, pp. 201-236. SPIE: Security and Watermarking of Multimedia Contents, vol. 3657, pp. 342-353. Foote. J., and Adcock, J. (2002), Time base modulation: A new approach to watermarking audio Cox, I.J., Kilian, J., Leigton, F.T., and Shamoon, T. and images, e-print. (1996), Secure Spread Spectrum Watermarking for Multimedia, IEEE Trans. Image Processing, Goto. M., and Muraoka, Y. (1999), Real-time beat vol. 6, pp. 1673-1687. tracking for drumless audio signals, Speech Communication, vol. 27, nos. 3-4, pp. 331-335. Cox, I.J., Miller, M.I., and Bloom, J.A. (2002), Digital Watermarking, Morgan Kaufman Publishers. Gruhl. D., Lu, A, and Bender, W. (1996), Echo Hiding, Pre-Proceedings: Information Hiding, CamCox, I.J., and Miller, M.I. (2002), The rst 50 years bridge, UK, pp. 295-316. of electronic watermarking, Journal of Applied Signal Processing, vol. 2, pp. 126-132. Haitsma. J., van der Veen, M., Kalker, T., and Bruekers, F. (2000), Audio watermarking for monitorCraver, S. A., Memon, N., Yeo, B.-L., and Yeung, M. ing and copy protection, ACM Multimedia WorkM. (1998), Resolving Rightful Ownerships with shop, Marina del Ray, California., pp. 119-122. Invisible Watermarking Techniques: Limitations, Attacks, and Implication, IEEE Journal on Se- Hartung, F., and Girod, B. (1998), Watermarking of lected Areas in Communications, vol. 16, no. 4, pp. uncompressed and compressed video, Signal Pro573-586, 1998. cessing, vol. 66, pp. 283-301. Craver, S. A., Wu, M., Liu, B., Stubbleeld, A., Hsieh, C.-T., and Tsou, P.-Y. (2002), Blind cepSwartzlander, B., Wallach, D. S., Dean, D., and strum domain audio watermarking based on time 15

Liu, Z., Kobayashi, Y., Sawato, S., and Inoue, A. (2002), A robust audio watermarking method using sine function patterns based on pseudo-random Huang, J., Wang, Y., and Shi, Y. Q. (2002), A sequences, Proceedings of Pacic Rim Workshop blind audio watermarking algorithm with selfon Digital Steganography 2002, pp. 167-173. synchronization, IEEE International Conference Mansour, M. F., and Tewk, A. H. (2001), Timeon Circuits and Systems, vol. 3, pp. 627-630. scale invariant audio data embedding, InternaKim, H. (2000), Stochastic model based audio wational Conference on Multimedia and Expo. termark and whitening lter for improved detection, IEEE International Conference on Acous- Mansour, M. F., and Tewk, A. H. (2001), Audio watermarking by time-scale modication, Intertics, Speech, and Signal Processing, vol. 4, pp. national Conference on Acoustics, Speech, and Sig1971-1974. nal Processing, vol. 3, pp. 1353-1356. Kim, H.J., Choi, Y.H., Seok, J., and Hong, J. (2003), Audio watermarking techniques, Intelligent Wa- Nahrstedt, K., and Qiao, L. (1998), Non-invertible watermarking methods for MPEG video and autermarking Techniques: Theory and Applications, dio, ACM Multimedia and Security Workshop, World Scientic Publishing (to appear). Bristol, U.K., pp. 93-98. Kim, H.J., and Choi, Y.H. (2003), A novel echo Oh, H.O., Seok, J.W., Hong, J.W., and Youn, D.H. hiding algorithm, IEEE Transactions on Circuits (2001), New echo embedding technique for robust and Systems for Video Technology, (to appear). and imperceptible audio watermarking, IEEE International Conference on Acoustics, Speech, and Kirovski, D., and Malvar, H. (2001), Robust spreadSignal Processing, vol. 3, pp. 1341-1344. spectrum audio watermarking, IEEE International Conference on Acoustics, Speech, and Signal Petitcolas, F.A.P., Anderson, R.J., Kuhn, M.G. Processing, Salt Lake City, UT, pp. 1345-1348. (1998), Attacks on copyright marking system, Information Hiding, Lecture Notes in Computer Kirovski, D., and Attias, H. (2002), Audio waScience, vol. 1525, pp. 218-238. termark robustness to desynchronization via beat detection, Information Hiding, Lecture Notes in Petrovic, R., Winograd, J.M., Jemili, K., and Metois, Computer Science, vol. 2578, pp. 160-175. E. (1999) Data hiding within audio signals, International Conference on Telecommunications in Ko, B.-S., Nishimura, R, and Suzuki, Y. (2002), Modern Satellite, Cable, and Broadcasting Service, Time-spread echo method for digital audio watervol. 1, pp. 88-95. marking using pn sequences, IEEE International Conference on Acoustic, Speech, and Signal Pro- Petrovic, R. (2001) Audio signal watermarking cessing, vol. 2, pp. 2001-2004. based on replica modulation, International Conference on Telecommunications in Modern SatelLee, S.K., and Ho, Y.S. (20010), Digital audio walite, Cable, and Broadcasting Service, vol. 1, pp. termarking in the cepstrum domain, IEEE Trans227-234. actions on Consumer Electronics, vol. 46, no. 3, pp. 744-750. Rossing, T.D., Moore, F.R., and Wheeler, P.A. (2002), The Science of Sound, 3rd ed., AddisonLie, W.-N., and Chang, L.-C. (2001), Robust and Wesley, San Francisco. high-quality time-domain audio watermarking subject to psychoacoustic masking, IEEE Interna- Seok, J., Hong, J., and Kim, J. (2002), A novel audio tional Symposium on Circuits and Systems, vol. 2, watermarking algorithm for copyright protection of pp. 45-48. digital audio, ETRI Journal, vol. 24, pp. 181-189. 16

energy features, IEEE International Conference on Digital Signal Processing, vol. 2, pp. 705-708.

Scheirer, E (1998), Tempo and beat analysis of acoustic musical signals, Journal of the Acoustic Society of America, vol. 103, pp. 588-601. Shin, S., Kim, O., Kim, J., and Choi, J. (2002), A robust audio watermarking algorithm using pitch scaling, IEEE International Conference on Digital Signal processing, pp. 701-704. Swanson, M., Zhu, B., Tewk, A., and Boney, L. (1998), Robust audio watermarking using perceptual masking, Signal Processing, vol. 66, pp. 337355. Tachibana, R., Shimizu, S., Kobayashi, S., and Nakamura, T. (2001), An audio watermarking method robust against time- and frequency-uctuation, Proceedings of the SPIE: Security and Watermarking of Multimedia Contents, vol. 4314, pp. 104-115. Wu, C.-P., Su, P.-C., and Kuo, C.-C. J. (2000), Robust and ecient digital audio watermarking using audio content analysis, Security and Watermarking of Multimedia Contents, SPIE, vol. 3971, pp. 382-392. Wu, M., Craver, S. A., Felten, E. W., and Liu, B. (2001), Analysis of attacks on SDMI audio watermarks IEEE International Conference on Acoustic, Speech, and Signal Processing, pp. 1369-1372. Yeo, I.-K., and Kim, H.J. (2003), Modoed patchwork algorithm: A novel audio watermarking scheme, IEEE Transactions on Speech and Audio Processing, vol. 11, (to appear). Yeo, I.-K., and Kim, H.J. (2003), Generalized patchwork algorithm for image watermarking scheme, ACM Multimedia Systems, (to appear).