Вы находитесь на странице: 1из 16

19

Chapter 3.

Text Document Watermarking

3.1 Introduction
To actively embrace Internet as a communication and content distribution medium, it is necessary to secure Internet contents by incorporating digital watermarking methods. Digital watermarking methods for images, audio, and video are already in place and are quite effective. For image watermarking, the existing redundancy in images and limitations of human visual system (HVS) are utilized. Similar properties have been utilized for audio and video watermarking. Watermark has been embedded in video and audio frames which remain imperceptible. Text is the most extensively used medium of communication existing over the Internet. The major components of websites, books, newspapers, articles, legal documents is simple the plain text. Therefore, plain text requires utmost protection and security from copyright violators. In past, a number of digital watermarking algorithms have been proposed for images, audios, and videos; however digital watermarking algorithms for plain text are inadequate and ineffective. Digital watermarking is the process of embedding a unique digital watermark in a digital content to protect it from illegal copying and copyright violations. The process of embedding and extracting a digital watermark to and from a digital text document which uniquely identifies the original copyright owner of that text is called Digital Text Watermarking. Text watermarking abides by the same principles as image, audio, or

video watermarking. The watermark should remain resilient to random tampering attacks, undetectable to anybody but the original owner/author of the text, as well as easily and

20

Copyright Protection of Plain Text using Digital Watermarking

fully automatically reproducible by the watermark extraction algorithm. The main concern in text watermarking is that the plain text contains less redundant information as compared to images, audio, and video which could be used for secret communication, as happens in steganography, and watermarking. Text watermarking techniques should implant unique and invisible watermarks in text documents which remains intact after diverse tampering attacks of insertion, deletion, and re-ordering. The digital watermarking solutions for text make it easy to send and receive text over Internet, intranet, extranet, and facsimile. Documents can be evaluated for text confidentiality and copyright protection. Detection of any tampering made can also be done using digital text watermarking techniques by making it tamper proof. This chapter is organized as follows: In Section 2, we briefly describe the application avenues for digital text watermarking. The rationale behind the difficulties faced in watermarking text is stated in Section 3. This is followed, in Section 4, with a description of possible attacks, their volumes, and nature on text. Section 5 expatiate the previous approaches towards text watermarking and the Section 6 discusses the drawbacks of these approaches. The discussion has been summarizes in the last section.

3.2 Applications
Text watermarking can be used for large number of applications in the real world. With the increasing and widespread use of Internet all over the world for information sharing, text watermarking has gained more importance. The emerging concepts of digital libraries, e-business, e-learning, and e-government, e-books, has made text watermarking a necessity. Legal documents, certificates, web sites, business plans, books, articles, poetry, company documents, confidential contents, SMS, and emails, can be protected by text marking algorithms. Text watermarking can be used for a number of purposes. Authentication, copyright protection, copy prevention, covert communication, tamper detection, and fingerprinting are some of the applications of text watermarking.

21

3.2.1 Authentication
For authentication, fragile watermarks can be used to detect any tampering of a text document. If the watermark is detected, the text document is genuine; if not, the text has been tampered and cannot be considered. It is very necessary to authenticate text, especially when using for legal purposes. In sensitive communication e.g. in defense application and in business communication, it is extremely important to authenticate, check reliability and completeness of the text messages.

3.2.2 Copyright Protection


Text watermarking can be used to protect the intellectual property rights of plain text. It is very necessary to protect copyrights of web contents, e-books, research papers, journal articles, poetry, quotes, and other documents containing plain text. The content owner/author can embed a watermark representing copyright information of his data. This watermark can later be extracted to prove ownership if any conflict of copyright claim arises in future. This can be very helpful to settle copyright disputes in court. It is probably the most prominent use of digital text watermarking.

3.2.3 Copy Prevention


Illegal copying and dissemination of text can also be avoided by the text watermarking algorithms. The watermarked information can directly control digital recording/copying device which can be a printer or simple a copy paste command. The embedded key can represent a copy-permission bit stream that is detected by the recording software which then decide if the copying procedure should go on (if it is allowed) or not (if it is prohibited by the content owner).

3.2.4 Covert Communication


The transmission of private data, which can be plain text or an image, is another application of text watermarking. Covert communication in this way means implanting a strategic/secret message into an innocuous looking text in a way that would prevent any

22

Copyright Protection of Plain Text using Digital Watermarking

unauthorized person to detect it and the intended recipient would be able to get it. The text watermarking algorithms proposed in this thesis can be used for covert communication as well.

3.2.5 Tamper Detection


The recent text watermarking algorithms can also identify the type, nature, and volume of tampering made by attackers in the original text. Thus, it becomes possible sometimes to predict or sense the intentions of attackers. Issues and problems of plagiarism faced by current researchers can be resolved by efficient tamper detection algorithms.

3.2.6 Fingerprinting
In order to trace the source of illegal copies the text author/owner can embed different watermarking keys in the copies that are supplied to different users. For the owner, embedding a unique serial number-like watermark is a good way to detect the users who break their license agreement by copying the protected data and supplying it to a third party. The publishing companies can use such fingerprint watermarks to detect the copyright violators.

3.3 Why Text Watermarking Is Difficult?


Plain text, being the simplest mode of information, brings various challenges when it comes to copyright protection. Text has limited capacity for watermark embedding since there is no redundancy in text as can be found in images, audio, and videos. The binary nature with clear demarcation between foreground and background, block/line/word patterning, semantics, structure, style, and language rules are some of the eminent properties of text which are needed to be addressed in any text watermarking algorithm. Besides, the inherent properties of a generic watermarking scheme like imperceptibility, robustness, and security also need to be satisfied.

23

Any transformation on text should preserve the meaning, fluency, grammaticality, and the value of text. The meaning of the text is its value, and it should be preserved through watermarking in order not to disturb the communication. Fluency is required to represent the meaning of the text in a clear and fluent way, more importantly in literary writings. The embedding process should comply with the grammar rules of the language, in order to preserve the readability of the text. Preserving the style of the author is very important in some domains such as literature writing or news channels [37]. Sensitive nature of some documents such as legal documents, poetry, and quotes do not allow us to make semantic transformations randomly because in these forms of text a simple transformation sometimes destroys both the semantic connotation and the value of text.

3.4 Attacks
Cyber community is not much enthusiastic about text watermarking technologies. The reason might be the un-disclosed watermarking methods and lack of robustness towards attacks. It is possible for an attacker to perform partial attacks even if he/she is not able to do it completely. So it is necessary to analyze each type of attack. Watermark attacks include unauthorized insertion, unauthorized detection, and unauthorized deletion. These unauthorized attacks, their volume, and nature are described as follows:

3.4.1 Types
Generally text encounters reproduction, synonym substitution, and reformatting, paraphrasing, and syntactic transformations attacks. All these attacks can be placed in the following categories: unauthorized insertion, unauthorized detection, unauthorized deletion, re-ordering attack, and combination of the all. i. Unauthorized Insertion

Under this form of attack, words and sentences are added to the text to make it look different and sometimes to keep another message\watermark of any attacker. An attacker sometimes inserts some text to the original text to add some additional information. This

24

Copyright Protection of Plain Text using Digital Watermarking

kind of attack happens when an attacker is interested to add some false information for example in case of legal documents and cases. Such attacks can be avoided by incorporating a certifying authority in the watermarking architectures which timestamps the contents in the name of author with current date and time. Whenever, a dispute over the copyright claim arises, this timestamp is used to identify the author who registered the content first. ii. Unauthorized Detection

In some applications, the ability to detect should be restricted. It is conceivable that the ability of an adversary merely identifies whether or not a mark is present in a given Work will threaten the security of a watermarking system [38]. iii. Unauthorized Deletion

Deletion attack means random deletion of words and sentences from the original text. The attacker deletes some information to detract the reader and hide the identity of the original author\owner of the text. Security against unauthorized deletion is required in all watermarking applications. It is necessary to prevent an attacker from recovering the original, but it is more important to prevent removal of watermark from the text. The watermark should still survive if the attacker performs a number of alterations in text. Watermark should be detectable by the extraction algorithm. iv. Re-ordering

The attacker shuffles and reorders the words and sentences of the text to make it look different and to destroy the watermark. In case of text, the attacker rephrases and replaces certain words with their synonyms. The intention generally is to destroy the writing style, connotation, and sometimes meaning of the text.

3.4.2 Volume
The volume of attack depends on the attackers intention. If the attacker is interested to add or delete some information to and from the text, then volume of attack will be low.

25

However, if the attacker is interested in using some part of the text in his\her own text, then volume of attack will be high.

3.4.3 Nature
Combined insertion, deletion, and re-ordering attack is termed as tampering attack. Tampering can be made at any random location in the text document. Tampering can be made in two ways: dispersed tampering, and localized tampering. i. Localized

Localized tampering means, insertion or deletion; of words or sentences at a single location in the text. This location can be in the beginning, at the end or anywhere in the text, depending on the attackers intention of use. ii. Dispersed

Dispersed insertion and deletion of sentences and words can be made at multiple locations in the original text. The attackers trying to make text look different makes dispersed tampering in the text. This kind of attack generally occurs in research plagiarism and literary writings.

3.5 Literature Review


Text watermarking is the area of research that has emerged after the development of Internet and communication technologies. The first reported effort to protect the copyright of the text was made in 1994 by Brassil et al. [14] [15], when IEEE Journal on Selected Areas in Communications issue was scheduled to be published, for Secure Electronic Publishing Trial. There were over 1,200 registered users in first month, and each copy of each paper has been registered and watermarked with the recipient [39] and it is currently a very active research area with a number of researchers working on text watermarking for the English language as well as Persian, Turkish, Korean, Urdu, and Arabic.

26

Copyright Protection of Plain Text using Digital Watermarking

The previous work on digital text watermarking can be classified in the following categories; an image based approach, a syntactic approach, and a semantic approach. Description of each category and the work done accordingly are ensuing:

3.5.1 Image-based Approach


In this approach towards digital text watermarking, text document image is used to embed the watermark. Text is difficult to watermark because of its simplicity, sensitiveness, and low capacity for watermark embedding. The initially attempts in text watermarking tried to treat text as image. Watermark was embedded in the layout and appearance of the text image. Brassil, et al. proposed a few methods to watermark text document by using text image [12-14]. The first method proposed by Brassil was the line-shift coding algorithm which alters the document image by moving lines upward or downward (left or right) depending on binary signal (watermark) to be inserted as shown in figure 3.1.

Figure 3.1 Line shift coding [16] The detection algorithm is non-blind in which the original document should be available. The second method was the word-shift coding algorithm which moves the words within text horizontally thus expanding spaces to embed the watermark. The algorithm can operate both in non-blind and blind modes. The third method is the feature coding algorithm which slightly modifies features such as the pixel of characters, the length of the end lines in characters to encode watermark bits in the text. All these

27

proposed techniques discourage un-authorized distribution by embedding each document with a unique codeword. Among the three presented methods, line-shift coding is the most robust solution under diverse attacks but this can also be easily defeated. Maxemchuk, et al. [39][40][41] analyzed the performance of the above mentioned methods. The correlation and centroid-based methods [42] are also suggested which treats profiles as a discrete time signal and look for direction of shift and which uses distances between the centroids of adjacent profile blocks for detecting the watermark respectively. Low, et al. [42][43] further analyzed the efficiency of the methods. Huang and Yan [43] proposed an algorithm based on an average inter-word distance in each line. The distances are adjusted according to the sine wave of a specific phase and frequency. The feature and the pixel level algorithms were also developed which mark the documents by modifying the stroke features such as width or serif [45]. Algorithm which utilizes gray scale image of text was also developed [46]. Another algorithm which watermarks text document image using edge direction histogram was also proposed [47]. Young-Won Kim et al. proposed a text watermarking algorithm based on word classification and inter-word space statistics [48]. In this approach, all the words in a text document are classified depending on some text features and then adjacent words comprise a segment and that segment is classified depending on class labels of the words within the segment. The information is encoded by modifying some statistics of interword spaces of the segments belonging to the same class. Several advantages over the conventional word-shift algorithms are discussed. Adnan M. Alattar et al. proposed an algorithm [49] to watermark electronic text documents containing justified paragraphs and irregular line spacing. The algorithms which exploit the printed text document to identify the source printer were also developed [50]. These methods use print quality defects as an intrinsic signature of a printer shows the banding features of a text document. These features can identify the specific make and model of the device which created it. Cox et al., [51] described a number of applications of digital watermarking and their common properties

28

Copyright Protection of Plain Text using Digital Watermarking

like robustness tamper resistance, fidelity, computational cost, and false positive rate. They observed that these properties vary greatly depending on the application. They also described seven applications of watermarking: broadcast monitoring, owner

identification, proof of ownership, authentication, transactional watermarks, copy control, and covert communication. Yang and Kot [52] proposed a method for watermarking on text document images to authenticate the owner or authorized user is proposed. The proposed method makes use of the integrated inter character and word spaces for watermark embedding. An overlapping component which is of size three is utilized, whereby the relationship of the left and right spaces of the character is employed for the watermark embedding. The integrity of the document can be ensured by comparing the hash value of the character components of the document before and after watermark embedding, which can be applied to other line shifting and word-shifting methods as well. Chao et al., [53] proposed a steganographic method to embed secret information into text files. This is achieved by making slight modification to scattered inter-word spaces of the formatted text using the popular typesetting tool TeX. Qadir and Ahmad [54] suggested a novel idea based upon an intelligent encoding scheme in the world of text watermarking which has no effect on the alteration of the syntax of the document as well as the layout. Thus providing a layout/format independent technique in which information within the text is manipulated to hide certain information. Abdullah and Wahab [55] presented a text watermarking scheme targeting an object based environment. The heart of the proposed solution describes the concept of watermarking an object based text document where each and every text string is entertained as a separate object having its own set of properties. Taking advantage of the z-ordering of objects, watermark is applied with the z-axis letting zero fidelity disturbances to the text. Villan et al. [56] analyzed the theoretical practical aspects of text data hiding in printed documents. Mikkilineni et al. [57] [58] [59] worked to enhance data hiding and

29

watermark embedding capacity of printed paper documents. Micic et al. [60] proposed algorithm for authentication of text document using digital watermarking. Text document images were compared to evaluate changes. Xingming Sun with his team proposed a component based digital watermarking algorithm for Chinese texts [61]. Li and Dong [62] proposed an algorithm for Chinese text watermarking based on Chinese characters structure. Another text watermarking algorithm using eigen values is also proposed [63]. Culnane et al. [64] proposed a binary text watermarking algorithm using continuous line embedding. Zhou et al. [65] presented a zero-watermarking algorithm for content authentication of Chinese text documents.

3.5.2 Syntactic Approach


Text is made up of characters, words, and sentences. Sentences have different syntactic structures. Applying syntactic transformations on text structure to embed watermark has also been one of the approaches towards text watermarking in the past. Mikhail. J. Atallah, et al. first proposed the natural language watermarking scheme using the syntactic structure of text [17][18][66] where the syntactic tree is built and transformations are applied to it to embed the watermark preserving all inherent properties of the text. They developed techniques for embedding a robust watermark in text by a number of information assurance and security techniques with the advanced and resources of natural language processing. For watermark embedding, they used the manipulations of TMR (Text meaning representation), such as grafting, pruning, and substitution. These methods are resistant towards many attacks but change the text to a large extent. Hence cannot be applied to the text of sensitive nature like poetry, legal documents, transcripts, and contracts. The Natural Language Processing (NLP) techniques are used to analyze the syntactic and the semantic structure of text while performing any transformations to embed the watermark bits.

30

Copyright Protection of Plain Text using Digital Watermarking

Figure 3.2 Syntactic sentence level watermarking [68] Kankanhalli and Hau [67] proposed a method to watermark electronic text documents in using the ASCII characters and punctuation in text. Hassan et al. proposed the natural language watermarking algorithm by performing the morpho-syntactic alterations to the text [68]. The text is first transformed into a syntactic tree diagram where the hierarchies and the functional dependencies are made explicit and watermark is embedded. The watermarking process is shown in figure 3.2. The author stated that agglutinative languages like Turkish are easier to watermark than English language. The watermarking solutions for agglomerative languages like Turkish, Korean, Arabic, and Urdu are efficient since these languages provide space for watermark embedding. However, the syntactic solutions for English language are not much adequate. Hassan et al. also proposed 21 syntactic tools for text watermarking [69] and Mi-Young Kim [70] recently proposed an algorithm for text watermarking using syntactic analysis of plain text. Kim [71] also proposed a natural language watermarking algorithm for Korean language using adverbial displacement. Helge Hoehn proposed a natural language watermarking algorithm which uses rather the semantic and syntactical transformations of the original text contents rather than modifying the text [72].

31

Murphy and Vogel [73] present three natural language marking algorithms using shallow parsing techniques, lexical substitutions, and swapping. They also analyzed the significance of automated and reversible syntactic transformations to hide data in plain text [74].

3.5.3 Semantic Approach


The semantic watermarking schemes focus on using the semantic structure of text to embed the watermark. Text contents, verbs, nouns, words and their spellings, acronyms, sentence structure, grammar rules, etc. have been exploited to insert watermark in the text but none of these proved to be resilient and degrade the quality of the text to a large extent. Atallah et al. were the first to propose the semantic watermarking schemes in 2000 [17][18][75][76]. Later, the synonym substitution method was proposed in which watermark is embedded by replacing certain words with their synonyms [19]. Xingming, et al. proposed noun-verb based technique for text watermarking [77] which exploits nouns and verbs in a sentence parsed with a grammar parser using semantic networks. Mercan et al. proposed a sentence based text watermarking algorithm [78] which relies on multiple features of each sentences and exploits the notion of orthogonality between features. Later Mercan, et al. proposed an algorithm of the text watermarking by using idiosyncrasies to embed the watermark [20]. The algorithms make clever use of typing errors, acronyms, and abbreviations that are common in cursory text like emails, blogs, chat, SMS etc. Algorithms were developed to watermark the text using the linguistic semantic phenomena of presuppositions [21][22] by observing the discourse meanings and representations. Presupposition is the implicit information considered as well known. Presuppositions are identified and then transformations like passivization, topicalization, extraposition, and preposing are applied to embed watermark in the text.

32

Copyright Protection of Plain Text using Digital Watermarking

The text pruning and the grafting algorithms were also developed in the past. The algorithm based on text meaning representation (TMR) strings has recently been proposed [79]. Shirali-Shahreza et al. [80] proposed a new method for secret exchange of information through SMS by using abbreviation text steganography with the use of the invented language of SMS-texting. They also proposed a method for steganography in English texts. In this method the US and UK spellings of words substituted in order to hide data in an English text. In English some words have different spelling in US and UK [81]. Later, Rafat [82] proposed an enhanced method for SMS steganography using SMStexting language, by removing the static nature of word-abbreviation list and introducing computationally light weighted XoR encryption. Das [83] proposed an enhanced buyer-seller watermarking protocol based on public key encryption standard, which is secure and flexible. Jonathan et al [84] and Robert [85] provided study and surveys of digital watermarking techniques for text, image, and video documents. Zhang et al. [86] explored the application of text watermarking in digital reading, in which holders are compensate for any copyright violation.

3.6 Drawbacks
Text watermarking algorithms using binary text images are not robust against re-typing and text reproduction attacks. With increasing and efficient use of OCR (Optical Character Recognition) now a days, these methods are totally a failure. The use of OCR can destroy the changes made by shifting words upward and downward, to the document margins, to the fonts, serif, and features of the text. Also, watermarking can easily destroyed by a simple copy paste to notepad attack. Text watermarking by using syntactic structure combined with natural language processing algorithms, is an efficient approach towards text watermarking but research progress in NLP is very slow. Syntactic sentence paraphrasing can result in unnaturalness of the sentence. Syntactic techniques also require good performance syntactic analyzers. The transformation applied using NLP algorithms are most of the time non-reversible.

33

Semantic text watermarking techniques significantly improve the information hiding capacity of English text by modifying the granularity of meaning of individual term/sentence but semantic text watermarking schemes are very conceptual and

impractical. The synonym based techniques are not resilient to the random synonym substitution attacks. There may be the cases where wrong words get selected for synonym substitution. Moreover, synonym based methods require a large synonymy dictionary and a huge collocation database. Sensitive nature of some text like legal documents, poetry, and quotation do not allow us to make random semantic transformations. The reason behind is the necessity to preserve the semantic connotation as well the value of text, while performing any transformation. In addition, text watermarking based on semantics, is language dependent where language is not something static. With the passage of time, language varies and hence the security and copyright solution provided by digital watermarking based on semantic will have limited strength. The semantic techniques for digital watermarking use natural language processing algorithms to analyze text meaning and to perform transformation. NLP is an immature area of research; hence, text watermarking using semantics does not provide a practical and complete text watermarking solution.

3.7 Summary
In this chapter, text document watermarking is described exclusively in detail. The applications areas, the challenges, and possible attacks are also described. It is observed that text watermarking methods for English language text proposed so far; lack robustness, integrity, accuracy, and generality. Also, the amount of work done on text watermarking is very limited and specific. Text watermarking algorithms using binary text image are not robust against text reproduction attacks and have limited applicability. Similarly, text watermarking using text syntactic and semantic structure is not robust against random tampering attacks.

34

Copyright Protection of Plain Text using Digital Watermarking

These algorithms are application area and/or language specific with limited applicability and usability. The previous techniques are computationally expensive and non robust. Text being an important medium of information exchange requires complete protection. Text encountering massive insertion, deletion, and reordering attacks need to be protected from copyright violators. Therefore, efficient and practical text watermarking algorithms are required.

Вам также может понравиться