Вы находитесь на странице: 1из 10

292

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART C: APPLICATIONS AND REVIEWS, VOL. 38, NO. 3, MAY 2008

A New Model for Secure Dissemination of XML Content


Ashish Kundu, Student Member, IEEE, and Elisa Bertino, Fellow, IEEE
AbstractThe paper proposes an approach to content dissemination that exploits the structural properties of an Extensible Markup Language (XML) document object model in order to provide an efcient dissemination and at the same time assuring content integrity and condentiality. Our approach is based on the notion of encrypted postorder numbers that support the integrity and condentiality requirements of XML content as well as facilitate efcient identication, extraction, and distribution of selected content portions. By using such notion, we develop a structurebased routing scheme that prevents information leaks in the XML data dissemination, and assures that content is delivered to users according to the access control policies, that is, policies specifying which users can receive which portions of the contents. Our proposed dissemination approach further enhances such structurebased, policy-based routing by combining it with multicast in order to achieve high efciency in terms of bandwidth usage and speed of data delivery, thereby enhancing scalability. Our dissemination approach thus represents an efcient and secure mechanism for use in applications such as publishsubscribe systems for XML Documents. The publishsubscribe model restricts the consumer and document source information to the routers to which they register with. Our framework facilitates dissemination of contents with varying degrees of condentiality and integrity requirements in a mix of trusted and untrusted networks, which is prevalent in current settings across enterprise networks and the web. Also, it does not require the routers to be aware of any security policy in the sense that the routers do not need to implement any policy related to access control. Index TermsEncryption, Extensible Markup Language (XML), postorder traversal, preorder traversal, publish subscribe, security, structure-based routing, trees.

I. INTRODUCTION

HE PROBLEM of content dissemination in an enterprisesetting as well as web-setting has been widely investigated, and various dissemination techniques have been proposed [1], [2]. Recently, however, the transformation and growth of enterprise networks into more dynamic frameworks than just a passive repository of content as well as the increase in the ubiquity of services have contributed to the signicance and complexity of this problem. The evolution of an Extensible Markup Language (XML) [3] and its inuence on the data models toward XML-ization has made the document object model (DOM) [4] a de facto standard for content representation. Enterprise comput-

Manuscript received on December 12, 2006, revised on April 20, 2007. This work was supported in part by the National Science Foundation under Grant 0430274 and in part by the sponsors of CERIAS. This paper was recommended by Guest Editors P. Hung, M. Alesky, and Z. Milosevic. The authors are with the Center for Education and Research in Information Assurance and Security (CERIAS) and Department of Computer Science, Purdue University, West Lafeyette, IN 47907 USA (e-mail: ashishk@cs.purdue. edu; bertino@cs.purdue.edu). Color versions of one or more of the gures in this paper are available online at http://ieeexplore.ieee.org. Digital Object Identier 10.1109/TSMCC.2008.919213

ing paradigms have been using the XML DOM as the primary standard for data representation. Web services in intraenterprise and interenterprise networks are being adopted as the components for distributed computing; these web services are primarily XML-based services. Recent developments in content-network appliances provide technology that can be used at the network level to efciently lter and distribute contents to interested parties in a possibly very large distributed system. Efciency and scalability must, however, be provided by assuring at the same time the security of contents, and the privacy of the parties acquiring and disseminating contents. It is useless to provide high-bandwidth content distribution systems the if the integrity of the disseminated contents is not assured or the ownership of the contents are not protected. Such problems are further complicated when dealing with contents encoded in XML, in that, because of the hierarchical organization of the content, different condentiality and integrity requirements may exist for different portions of the same content. Data that a consumer is not authorized to access, but belongs to the complete data set is called extraneous data. Flow of extraneous data to a consumer may leak information, even when this data is encrypted. In particular, extraneous data is prone to off-line dictionary attacks even by a legitimate consumer that can exploit contextual knowledge from the data elements it has access to. Therefore, it is important that extraneous data, even if encrypted with keys that the consumer does not have, be removed from the content before its delivery. We thus need a dissemination approach, specically tailored to XML that addresses the issues of security, privacy, and scalability in a holistic manner. Relevant requirements for such a dissemination approach include the following. 1) Access control: A consumer must be provided with only that data set that it is permitted to access. 2) Data integrity: Not only the integrity of the received data must be veriable by the consumer, but also any compromise to the data must be precisely determined. An XML-based data instance when represented using the DOM has an underlying tree [5] structure. Each node of such a tree refers to an XML entity. A tree is a nonlinear structure with a set of nodes and edges such that it is acyclic; there is a special node called root with no incoming edges and every other node has exactly one incoming edge. Fig. 1(a) shows a tree, the abstract representation of an XML document. Section II-A elaborates on the XML data model. In this paper, we develop an approach that addresses the problem of content dissemination in enterprise and cross-enterprise networks; our solution satises the outlined requirements in a holistic manner. The proposed dissemination model exploits various structural properties of the XML DOM in order to

1094-6977/$25.00 2008 IEEE


Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on January 22, 2009 at 02:31 from IEEE Xplore. Restrictions apply.

KUNDU AND BERTINO: NEW MODEL FOR SECURE DISSEMINATION OF XML CONTENT

293

Fig. 1. (a) Tree. Abstract representation of an XML document. (b) Postorder numbers associated with each node. (c) Encrypted postorder numbers associated with each node.

encrypted PONs that overcome the vulnerabilities of postorder numbering while preserving all its desirable properties; 2) applications of encrypted postorder numbering to the verication of content integrity and to the prevention of information leaks in order to assure data condentiality; 3) a novel method for structure-based routing that can be used in nontrusted domains for dissemination of sensitive content. B. Outline of the Paper Section II presents some simple observations on XML data models and PONs. Section III introduces the notion of encrypted PONs. Document encoding and encryption techniques using encrypted PONs are dened in Section IV. Based on the structural encoding of the documents, structure-based routing and dissemination model for XML documents are proposed in Section V. Section VI analyzes the proposed dissemination model with respect to the requirements described in Section I. Section VII discusses the related work, and Section VIII concludes the paper. II. SOME SIMPLE OBSERVATIONS In this section, we discuss the properties of XML data and postorder numbers.

Fig. 2.

Security requirements in XML dissemination.

support access control, integrity, and privacy requirements. The use of structural properties is favorable to efciency and scalability of the dissemination framework. Our solution is based on the simple notion of postorder numbering [5] and its properties. By using such notion, we develop a novel content routing scheme called structure-based routing of XML data. Such routing scheme prevents information leaks, and at the same time improves efciency and scalability of the structure-based dissemination model. A key feature of our approach is that it directly takes into account access control policies, that is, policies specifying which entity can access which portion of the contents, so that contents are disseminated according to these policies. The resulting dissemination model is a multicast model for the XML dissemination (Fig. 2) that, based on the content structure and access control policies, builds an overlay topology. Moreover, we exploit the properties of postorder numbers (PONs) for integrity assurance. Our technique allows consumers to verify the integrity of data they receive, and in the case in which data have been tampered with, allows the consumers to determine the affected portions of the data. In what follows, we refer to the XML documents as documents, trees, data, or content trees. Subtrees refer to a subdocument in terms of DOM. An element in DOM refers to a vertex or a node in the corresponding tree representation. The terms user and consumer are used as synonymous. A. Main Contributions The main contributions of this paper can be summarized as follows: 1) extensions of the notion of postorder numbering to derive a family of secure structural identiers called

A. XML Data Model DOM is the commonly used model for representing XMLbased languages [4]. DOM organizes data as a rooted tree. In what follows, document refers to such a tree and documentroot refers to its root. Moreover, element refers to an intermediate node in the tree. Content of a node includes attr, documenttype, and DOMimplementation [6]. Each node that is one of the followingtext, CDATAsection, processinginstruction, comment, is a leaf node in the tree. An entity refers a nonroot node in the tree. Let D be an XML data instance organized according to the DOM representation. Let T (V, E) be a tree representing the document D; V and E denote the set of nodes (vertices) and of edges of D, respectively. Let x be a node in V . Let Dx denote the subtree of D rooted at x. Some or all the nodes in an XML data instance contain content. Content of a node x is referred to as contentx . contentx contains only the content specic to x and not of other nodes. The relation between parentchild nodes is represented as directed edges, with edges directed from parents to children. In what follows, ancestor(x) denotes the set of ancestors of x. The dissemination of a document exploits the following structural properties in order to meet the requirements of secure and scalable dissemination of XML data. 1) XML data is order-preserving, that is, nodes x and y have an order among them in D. 2) The unit of data access is the subtree representation of a subdocument. The smallest unit is a node. 3) Any element and its corresponding subdocument are accessible through by themselves or by a subtree rooted at any of their ancestors.

Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on January 22, 2009 at 02:31 from IEEE Xplore. Restrictions apply.

294

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART C: APPLICATIONS AND REVIEWS, VOL. 38, NO. 3, MAY 2008

These properties are crucial in ensuring that the structurebased routing extracts the correct subdocument and routes it to the correct consumer; they are reected by postorder numbers, discussed in the next section. B. Postorder Numbers PONs are the numbers assigned to the nodes of a tree according to the postorder traversal of the tree [5]. Let px denote the PON of node x V , where V is the set of vertices of an XML document D. A number is assigned to a node only when each of its children has been assigned a PONs. The children of a node are assigned PONs according to the order, which is left to right. The highest PON is |V | and the lowest is 1. If z is the parent of x and x is the last child of z to be visited in postorder traversal before z is traversed, then pz = px + 1. Fig. 1(b) shows the PONs assigned to each node of the tree in Fig. 1(a). We now present some important properties of PONs. 1) (PON-I): px uniquely refers to x and the subdocument Dx in D. 2) (PON-II): Let z be the parent of x. Then, px < pz . 3) (PON-III): Let px lowest be the lowest PON of any element in the subdocument Dx ; let u be a descendant of x. Then, px lowest pu px . 4) (PON-IV): Let x and y be left and right children of z. Then, px < py . The rst property assures that we can identify and extract a specic subdocument in a document. The second property is the basis for reasoning about the relation between the parent and a child. The third property imposes a lower and upper bound on the possible PON of any element in a subtree. It is useful in identifying if a new node has been added to the document and which one it is. The fourth property is used to determine if there is any swapping among siblings in the received subdocument. However, the general notion of the PON has some drawbacks with respect to security. The PON of a node x indicates the number of nodes in the subdocument Dx . This is not desirable especially when the consumer is permitted to access only a subset of the subdocument Dx . The consumer may be able to infer additional and possibly sensitive information regarding the size of the document. This is against the condentiality and privacy requirements of the dissemination model, which is our goal. Moreover, PONs are predictable, given their values and the distance between consecutive numbers. Thus, they can be generated, because they always lie in a range from 1 to the total number of elements in the document. This makes the data easily vulnerable to tampering. Therefore, we need a notion, equivalent to the notion of PON, that overcomes the aforementioned drawbacks and still exhibits all the good properties of PONs. Next section describes such a notion and proposes a mechanism to encode the XML data such that information leaks are prevented. III. ENCRYPTED POSTORDER NUMBERS The notion of encrypted (EPON) is a derived from the general notion of PON and overcomes the security related aws of solutions based on the use of PON.

A. Computation Let {p1 , p2 , . . . , pn } be a set of PONs for an XML data instance. Each pi , i = 1, . . . , n, is combined with a unique random number ri . The combined values are then encrypted by using an order-preserving encryption function [7]. The resulting set of numbers is the set of EPONs, and each of them is an EPON specic to a data node in the XML document. The EPON for node px is denoted by ex . By combination, we mean a process of addition or concatenation or some other possible combination operations. The random values are chosen so that the combined values preserve the order of PONs. The encryption process encrypts these numbers in such a way that the ordering among the entities is preserved. The random value associated with a PON follows a strictly increasing order, with the lowest random value being associated with the lowest PON. Let, x, y, and z be nodes such that x and y are children of z; let px , py , and pz be their PONs, respectively. The random values would be rx , ry , and rz , respectively. By denition of PONs (Sect. II), pz > px and pz > py . rz > rx and rz > ry , that is, the order of the PONs is preserved by the random numbers. rx and ry should be chosen so that no relation can be and should possibly be established between them. B. Properties of EPON EPONs preserve the order of PONs, therefore the properties characterizing EPONs are identical to those of PONs. We refer to the properties of EPONs as EPON-properties. 1) (EPON-I): ex uniquely determines the location of x in D and ex uniquely determines the subdocument Dx . 2) (EPON-II): Let z be the parent of x. Then, ex < ez . 3) (EPON-III): Let ex lowest be the lowest PON of any element in the subdocument Dx ; let u be a descendant of x. Then, ex lowest eu ex . 4) (EPON-IV): Let x and y be left and right children of z. Then, ex < ey . EPONs can be computed as follows. While traversing the content tree in postorder and assigning PON px to node x, a random number rx is generated such that if node y is just visited prior to x, then rx ry . If all nodes have been visited, then all the combined values (of px and rx ) are encrypted with the order among them being preserved. The algorithm is presented as follows. Traverse the content tree in postorder. Let the current node be x. Let Px be the PON for x. Generate a random number rx such that If node y is the last node visited prior to x, then rx ry If all the nodes have been visited, then x, encrypt the combination of px and rx with their order preserved. IV. DOCUMENT ENCODING AND ENCRYPTION In this section, we introduce the notion of structural identiers for nodes in an XML document.

Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on January 22, 2009 at 02:31 from IEEE Xplore. Restrictions apply.

KUNDU AND BERTINO: NEW MODEL FOR SECURE DISSEMINATION OF XML CONTENT

295

TABLE I ENCODING OF XML TREE IN FIG. 1(a)

A. Structural Identier Let z be a node; its structural identier referred to as Sz , is dened as a pair (ez , ez lowest ), where ez is the EPON associated with z, and ez lowest is the lowest EPON for any node in the set of descendant nodes of z. The structural identier is unique for each node in a document. Sz not only uniquely identies the document element referred to as z, but also identies the subtree of z. The second factor of Sz facilitates the unique identication of those elements that belong to the subtree. B. Integrity Identier The content of a node includes attributes of an XML element, but does not include any of its descendants. In the denition of an integrity an identier, the content of a node is bound to it using its structural identier. A hash of the concatenation of the content and the structural identier are generated. The resulting hash value referred to as local hash (LH) is the integrity identier of the node x denoted by Ix . Thus, Ix = H(Sx , contentx ), where Sx is the structural identier of x, contentx is the content of x, and H is a one-way collision-resistant hash function. C. Document Encoding The well-dened structural entity in a hierarchically organized content such as a document is intuitively a subtree. Each node x in a document has an encoding information Cx dened as a tuple: Cx = (Sx , Ix ), where Sx is the structural identier and Ix is the integrity identier of x. 1) Properties of Encoding Tuples: Each node x with parent z in a content tree is encoded with the tuple Cx , Sz . If x is the root, then its encoding is Cx . Such encoding tuple facilitates the verication of the structural integrity of the content. It also facilitates data manipulation operations based on the structural identier: content identication, extraction, and composition. The encoding tuple of a node also contains the structural identier of its parent. The parent identiers are used to verify that data have not been compromised and to detect swapping of data elements. Table I shows the encoding of each node in the XML tree in Fig. 1(a). For simplicity, the integrity identier is not enumerated as it involves a hash value. Lemma 4.1: Let x and z be nodes in an XML document such that z is an ancestor of x; let ex and ez be their respective z EPONs. Let ex lowest and elowest be the lowest EPONs in the z subtrees rooted at x and z, respectively. Then ex lowest elowest . Proof: z is the parent of x in the XML document. Therefore, Dx Dz . The lemma follows from the properties of EPONs.

Fig. 3. Three subtrees of the content tree are shared with three consumers: consumer 1, 2, and 3.

Lemma 4.1 denes the basis for use of the structural identier of a node in the encoding tuple of each of its child nodes. D. Document Encryption Document encoding is followed by document encryption. Since our dissemination technique delivers only the contents that is accessible to a user, we do not need a hierarchical encryption scheme as proposed in [8]. Each encoded node is encrypted using a key that is shared between the producer and the consumer. If the content routers are trusted, the key maybe shared between the routers and the consumers. After encryption, each x document node x is represented as Sx , Sz , Es , where Sz is x the structural identier of z, the parent of x and Es is the value resulting from the encryption of x. V. STRUCTURE-BASED ROUTING We propose a multicast-based approach to disseminate XMLbased data among the consumers. Fig. 3 shows multiple consumer requests to access an XML tree. Consumer 1 has access to subtree T1 , consumer 2 has access to T2 , and consumer 3 has access to T3 . For dissemination of the subtrees among various consumers, a multicast topology based on the structure of the tree is proposed. The multicast topology is built dynamically and asynchronously using a publishsubscribe methodology. The publishsubscribe based multicast network uses the structure-based routing. The structure-based routing involves the following entities: the document source is the document producer or a trusted owner of the document, and has full access to the original document and is the root of the multicast overlay network; the publisher publishes the data to a set of subscribers; the subscriber subscribes to the data and sends its request to a router-based publisher; the router routes the specic portion of the data to consumers and other routers. A router is both a publisher and a subscriber. The document source is a publisher. A consumer is a subscriber. A consumer is said to be associated with a router for a specic document if it has subscribed to that document through that router. For simplicity of discussion, we assume one document source; however the proposed solution can handle multiple document sources. A parent router of another router is one from which the latter receives some content. A child router is dened conversely.

Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on January 22, 2009 at 02:31 from IEEE Xplore. Restrictions apply.

296

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART C: APPLICATIONS AND REVIEWS, VOL. 38, NO. 3, MAY 2008

We assume that documents are identied by a valid uniform resource identierURI [9] or any other naming scheme suitable for the enterprise. The owner of the document can itself carry out publishing or can delegate the publishing functionality to one or more other entities. The publishing routers propagate this information to their neighbor routers. Let Dz and Dq be subdocuments such that q is a descendant of z. Thus, Dq Dz . Let R refer to any router that is reachable from another router Rz . Dz is the maximal structural block at a router Rz if and only if Rz or any router R reachable from Rz has only those consumers that have access to only Dz or Dq , for any q that is a descendant of z in D. Each router is aware of the maximal structural block that it is responsible for routing collectively to all the subscribersconsumers and routers. a) Example: Let D be represented by the tree T [T is shown in Fig. 1(a)]. Let R1 be a router. Consumers u1 and u2 have subscribed to R1 for document D. u1 and u2 have access to the subdocument represented by Tz and Tx , respectively. Let R2 be a router reachable from R1 . It has a subscriber u3 . u3 has access to the subdocument Tx . Therefore, the maximal structural block of R1 is Tz and that of R2 is Tx . If a new consumer subscribes to R1 with an access to T itself, then the maximal structural block of R1 becomes T . The multicast topology with root at router Rz disseminates content to a set of consumers collectively such that none of them has access or has subscribed to a subtree Dm of D where Dz Dm . Dz can be identied through the EPONs. Let ez and em be the EPONs of Dz and Dm , respectively; then, ez < em , by denition of EPON. Router Rz routes (or publishes) only Dz and its subtrees. Rz identies its maximal structural block through its EPON ez , which we call as the publishing EPON (PEPON). A PEPON is the EPON of a maximal structural block being published from the corresponding router. For router Rz , ez is a PEPON. Routing is carried on a multicast topology, which is of the form: either a tree or a directed acyclic graph (DAG). In Fig. 5, the PEPONs for the two routers (between the consumer and the producer) are 43 and 96. Access permissions on the content for a consumer are expressed on a node in the document. Access permissions for a consumer u on a document d denoted by Lu are represented as an allowedset dened as {Sx | consumer has access to node x and Sx is the structural signature of x}. A. Content Routers By content routers, we refer to content distributors of brokers. Such a router is an application level router that routes documents. In what follows, the notation {x+} or { x +} denotes a nonempty set of elements of type x. Every router R is aware of the following information: 1) { c-id, c-credentials, document URI, permissions, callback-address +}, where c-id is the id of the consumer subscribed to document accessible at document URI, ccredentials includes parameters needed for authentication of c-id, and permissions is the set of the nodes that a con-

Fig. 4.

Routing of three subtrees to consumers using EPONs.

sumer has access to in the document (allowedset). The callback-address method provides a mechanism to deliver content to the consumer in case of asynchronous subscription. 2) { parent router, {Sx +} +}, where x is the root of the maximal structural block the parent router receives and Sx is its structural identier, ex in Sx is the PEPON of the router. 3) { child router, {Sy +} +}, where a child router is a router that has subscribed to a subdocument with PEPON Sy from R. The list of child routers with their specic PEPONs are stored at R. b) Example: Consider Fig. 4. The router that routes the tree with PEPON 43 has two consumers consumer 1 and consumer 2. It does not have any parent router nor any child router. The information stored at this router is shown in Table II. B. Dissemination Network A link in the document dissemination network is between two content routers and might involve intermediate network routers. In this section, we discuss the development of the dissemination network that uses the structural identier. 1) Subscription: The subscription process is initiated by a consumer. Upon being successful, the process returns the consumer a set of structural signatures for the nodes in the document that the consumer has access to. The set is the allowedset for the consumer. A consumer determines which router to join for a specic document. A router R upon receiving a request for subscription to a document performs the consumer subscription, if the consumer is authorized. 2) Link Setup: If a router R does not already have a known path to the document publisher to satisfy the request, it sends subscription requests to some or all other routers it is aware of (neighbor routers). Among many possible protocols, we propose a three-way handshake protocol to establish a subscription link between two routers. Suppose that the router R receives positive responses from R1 and R2 . Based on the document properties and the path length from document source to each of them, R then determines which one to choose and notify the router(s) accordingly. Several criteria can be used for such selection.

Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on January 22, 2009 at 02:31 from IEEE Xplore. Restrictions apply.

KUNDU AND BERTINO: NEW MODEL FOR SECURE DISSEMINATION OF XML CONTENT

297

TABLE II INFORMATION AT THE ROUTER FOR PEPON 43 IN FIG. 5

Outline of Link Setup Protocol 1) The consumer sends the subscription request for a document including its consumer id, credentials, and callback method to a router R. 2) R authenticates the consumer and determines the list of signatures of the content nodes that the consumer has access to. 3) The router determines the set of subdocuments (subtrees) from the set of signatures as follows: it sorts the signatures based on the EPON in the signature; if Sx is the signature for x, then the sorting parameter is ex . Let the sorted set be Q. Let the set of subtrees be denoted by , initialized as empty. Let the signature with highest EPON in Q be Sz . Remove each signature Sx from Q including Sz such that ex ez and ex ez lowest , assign this set of signature to x ; add x to and repeat this process until Q becomes empty. 4) If the list of accessible subtrees includes a subtree with root having EPON ez (z being the document element) that is subsumed by the content tree served by this router, then the request is processed successfully. This is determined z by checking if eh lowest elowest ez eh (Lemma 4.1). 5) Otherwise, the router R sends a subscription request for the subtree rooted at ez to some or all of its neighboring routers. If the access permissions include multiple subtrees, the subscription request includes each of these subtrees. 6) Upon receiving a request from R, a router Ri checks if there exists a PEPON ex . If so, it returns ex with success as a response to R, else it recursively repeats the link setup procedure from Ri for ez to all its neighbors. 7) Upon receiving the responses, the router R selects a parent router; the router registers the consumer and sends the response back to the client. 8) Each router determines if the new node(s) and the existing node(s) can be combined together to form a complete subtree of the document. If so, then it replaces all the nodes stored in the database by the lowest common ancestor of these subtrees. C. Content Publishing The document publication process varies based on the recipientrouter or consumer. Content is published as follows. Router R receives a set of document nodes N from its ancestor (topology is a tree) or ancestors (topology is a DAG). If R has a nonempty set of consumers for the document, it then forward the document to the consumers based on the permissions. If there is a nonempty set of routers that are subscribers for some nodes in this document, then R forward the document to these routers based on their requirements.

1) Content Delivery To Consumers: For each subscribed consumer, the router determines its access permissions for the associated document. The router identies to which received content subtrees, the allowed nodes (included in allowedset) belong. This is carried out by matching the EPON ex of each Sx allowedset with the EPON of each of the roots of the received subtrees.The subtrees specic for the consumer in are then extracted from the identied content. The router then forward the subtree to the consumer after encrypting it by using the encryption technique in place, if any. In our running example, Fig. 4 shows how EPONs are used for routing of subtrees. 2) Content Delivery To Routers: The process of forwarding the document to a router is as follows. For each router in its subscriber set, a router determines the node(s) it is registered for. It identies and extracts these document nodes from the respective subtrees. They are then encrypted and sent to the subscribing router. The next section discusses the technique used for identication and extraction. 3) Content Identication and Extraction: Content identication and extraction is carried out at each router that has at least one subscriber. Each router has a list of the content subtrees it receives for a given document. The list is essentially a list of signatures of the roots of these subtrees (maximal structural blocks) that contain the EPONs (PEPONs) of these roots. The router also keeps track of the list of signatures of the roots of the subtrees, each of its subscribers (consumers or routers) has access to (Section V-A). The identication step determines the belongs-to relation among each of the content roots accessible to each consumer and the content subtrees it receives. An important property of EPONs is reported here from Section III-B. Simple EPON propertyany node y that belongs to the subtree rooted at node x is such that the ey < ex , where ey and ex are the EPONs of y and x, respectively. The EPON of y, ey is also greater than or equal to elowest , as dened in the structural identier Sx of x. The identication technique uses the simple EPON property while verifying the belongs-to relation among the received content and the subscribed content. The worst case complexity of the identication step is O(mn), where m is the number of received content subtrees and n is the number of subscribers at a given router. During the extraction step, a depth-rst traversal [5] is carried out to determine the subscribed content root. The EPON of the root of the subscribed content root is compared with the EPON of the visited node. If these EPONs match, the corresponding subtree is extracted. The worst case complexity of the extraction procedure is same as that of the depth-rst searchO(v + e), where v is the number of nodes in the received subtree and e is the number of edges in the received subtree. D. Document Verication The security requirements for secure dissemination of XML content are two-fold (Section I): maintaining condentiality by not sending extraneous data to a consumer (preventing information leaks) and facilitating precise verication of integrity. In this section, we focus on the integrity verication for the received content at the consumer side.

Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on January 22, 2009 at 02:31 from IEEE Xplore. Restrictions apply.

298

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART C: APPLICATIONS AND REVIEWS, VOL. 38, NO. 3, MAY 2008

In order to precisely detect any integrity violations, the following verication steps must be executed at the consumer side: 1) (N-I) if nodes have been dropped; 2) (N-II) if the order of the nodes has been changed; 3) (N-III) if the content of a node has been compromised; 4) (N-IV) if some nodes have been added in an unauthorized manner; 5) (N-V) if the content of one node has been replaced with the content of another node. Let u be a consumer receiving a set of nodes Ru from the u u router. Let Ru = {Sx | Sx = the structural identier of node x received by u}. Consumer u receives a list, denoted by Lu , of signatures of each permitted node during its subscription phase. The consumer validates all the nodes received with the nodes it expects to receive by matching their EPONs. Let r be a received node, r Ru . Let the structural signature of r be (er , er lowest ). x Let s be a signature in Lu such that s = (ex , elowest ). The matching of signatures is carried out as follows: s Lu r Ru x u (er , er lowest ) = (ex , elowest ); i.e., for each node in L , if there is u a node in R with an identical structural signature, then all the permitted nodes have been received. If there is some node s that u has access to but does not match with any r in Ru , then s has been dropped (N-I veried partially). Then the consumer carries out a postorder traversal on every subtree representation with root at r Ru , after the nodes are decrypted. Let x be the currently visited node. The local hash of x, denoted by H x is computed as H(Sx , contentx ). After the decryption, x has the following encoding: (Sx , Sz , Cx , Sz ). H x is compared with Ix in Cx . If there is a mismatch, then the content integrity has been violated (N-III veried). If contents of two nodes have been swapped, this will also be detected because the integrity identier of content of the node is bound to its structural identier (Section IV-B) (N-V veried). Otherwise the process continues as follows. The outer Sx must match with the Sx in Cx . If not, then this node is discarded as compromised and an integrity violation is noted. If the outer Sz is not same as the inner Sz , then a violation is detected, but the node is not yet discarded. The inner Sz is compared with the inner Sw of the received parent node w of x. If they do not match, then x is discarded (N-I is veried completely). ex lowest is x w compared with ew lowest . If elowest < elowest , then the integrity of the structure of the sub-tree has been violated. If ex > ew , then this is a case of reordering. If ex is found not to be within the bounds of bounds of [ew lowest , ew ], a new node has been added (N-IV veried). The verication algorithm also checks if ex is less than any node visited earlier during this traversal. This is done by comparing the two factors of the structural identier Sx . No such occurrence ensures that there is no change in the original order among nodes (N-II veried). The verication process is efcient and simple. It uses the basic technique of post- and preorder traversal and hash computation. Therefore the computation is not expensive nor the implementation of such a technique is complex. The order of verication is linear in terms of the size of the content received because the postorder traversal combined with the preorder processing on each subtree veries the integrity of the content.

E. Update Management This section discusses updates to documentscontent and structure in the context of structure-based routing. In case of changes that are structurally invariant, only the data inside a document node changes. Thus, only the local hash of the node changes. Only the updates of the changed nodes along with their signatures are forwarded to the routers. Structural changes have to be reected in the mapping from user credentials to accessible nodes and their signatures. Therefore, the services that implement the mapping function from user credential to structural identiers need to be notied accordingly with the new EPONs. If it is a distributed hash table, then the document source updates the hash table. The routers are also notied of the modications. Removal of a subtree is notied to the routers and consumers having the document. In case of addition of a new subtree, the original structure of the document is not affected. Therefore, the update is propagated to all the routers that have consumers with access permission to the new subtree. In case of interchanges, the changes need to be propagated to the routers and consumers that are registered for any updated node or an ancestor of that updated node. VI. DISCUSSION In this section, we discuss the requirements mentioned in Section I for document dissemination and show that our proposed dissemination model addresses all the security requirements of a dissemination model. A. Requirements Satisfaction Integrity: We introduced the notion of EPONs in order to support all the integrity requirements. In Section V-D, we developed techniques based on this notion for content verication and validation. Access Control and Condentiality: The structure-based routing scheme ensures that a consumer is delivered only the portion of data that it has access to. The notion of maximal structural blocks at routers ensures that the routers have access to only that much amount of data that its consumers collectively have access to. Our postorder-numbering-based integrity check technique is parallel to the Merkle hash algorithm [10]. Such a technique requires the hash values of the subtrees that are not accessible to the consumer also to be forwarded, so that the consumer can verify the document integrity by computing and matching the nal hash value of the complete original tree. Our technique exploits the properties of postorder numbering for the same goal and thus avoids sending the hash values of the subtrees that are not accessible to the consumer, thereby preventing the leakage of data. This is an indirect information leak that is prevented by our framework. 1) Efciency of Structure-Based Routing: The simple notion of PONs in the context of XML data provides powerful and sound principles for content identication and efcient extraction with linear time complexity. The framework uses an efcient content routing mechanism based on the content structure. The cost of routing is, in the worst case, linear in the document size.

Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on January 22, 2009 at 02:31 from IEEE Xplore. Restrictions apply.

KUNDU AND BERTINO: NEW MODEL FOR SECURE DISSEMINATION OF XML CONTENT

299

Fig. 5. Summary of Merklehash technique, the selective XML dissemination technique, and the proposed technique in this paper.

The multicast topology based on the structure-based routing for the dissemination model is acyclic. The multicast reduces the network usage, while the cycleless feature ensures that the number of router hops is nite and proportional to the height of the document tree. Using the Pigeonhole principle on the number of content nodes and the number of consumers in a large dissemination network, there would be overlaps for content accessibility and subscription between consumers. Given that each path from the document source to a consumer contains a monotonically decreasing sequence of EPONs of the document as PEPONs of the underlying routers and a parent router never has a less EPON as a given routers PEPON, a cycle cannot occur. This makes the multicast topology more efcient in terms of bandwidth usage and dissemination speed. The path from the source of a document to a consumer contains a list of routers. Let the sequence of routers be R1 R2 Ri R(i+1) Rn . Let the publishing EPON for the consumer at each Ri be i . The following observations are crucial for the topological efciency. 1) For each i < j, 1 i, j n, i j , that is, the PEPONs have a monotonically decreasing order among them in such a path. This is because a router creates a link to another router during the subscription process, if and only if the router has access to the required subtree or a larger subtree from the specic document. 2) Due to the monotonicity property, the sizes of the subtrees being transmitted along the path R1 R2 Rn also decrease monotonically. In the worst case, all subscribers along the path have access to the complete document; however, in reality, most subscribers have access to a subset of the document. Therefore, the cost of transmission of the content from the source to a consumer is less than the cost incurred in a common star/broadcast topology; or is in the worst case, equivalent to such a cost in the latter. Therefore, such a model is efcient in terms of network resource usage, speed of dissemination, and is thus more scalable. VII. RELATED WORK The main research efforts related with our paper are in the area of secure dissemination of XML data and in the area of secure publishsubscribe systems. Fig. 5 presents a summary of the Merkle hash technique of integrity verication of trees [10], the selective dissemination of XML [8], and the solution proposed in this paper.

In the rst area, the only approach supporting the access control in both pull and push-based distribution of data has been proposed by Bertino and Ferrari [8]. Such an approach relies on encrypting different portions of the data with different keys and then distributing the keys to data consumers according to the access control policies. Bertino et al. [11] have also investigated the problem of integrity of XML data by using the notion of Merkle hash. Those approaches have, however, some major drawbacks in that they are not scalable and do not remove extraneous data from contents. These drawbacks are fully addressed by the approach proposed in this paper. In the second area, several approaches have been proposed to address efciency issues concerning publishsubscribe systems [12][19]. Most approaches (e.g., [13], [16], and [17]) use a spanning tree structure for event routing. In order to reduce the matching that has to be performed by brokers from the root to the leaves, several optimization techniques have been proposed. Virtual groups are used to reduce the matching performed by brokers [20]. However, security issues in content-based publish subscribe systems have not been investigated. The only exceptions are the approaches by Srivatsa and Liu [21], that, however, focuses only on resiliency, and by Opyrchal and Prakash [19], which, however, is very inefcient and is not exible with respect to access control policies. In contrast, our approach [22] addresses a larger spectrum of security requirements while being at the same time efcient and scalable. VIII. CONCLUSION AND FUTURE WORK The paper shows how the structural properties of the XML DOM can be exploited in order to address issues in data security and dissemination. We have applied the simple notion of PONs to solve some of the important challenges in data security, especially data integrity and secure data dissemination. By using the structural properties of XML-content in conjunction with the properties of the PONs, we have proposed: 1) a technique to verify the integrity of the distributed content; 2) a technique that facilitates maintaining data condentiality; and 3) a novel structure-based routing of XML content. We introduced the notion of EPONs in order to support the integrity and condentiality requirements of XML content as well as to facilitate efcient identication, extraction, and distribution of subsets of the content. The structure-based routing scheme uses the notion of EPONs to prevent information leaks in XML data dissemination. We proposed a dissemination model for XML content that combines the multicast and structure-based routing in order to improve efciency in terms of bandwidth usage and speed of data delivery, thereby favoring scalability. The dissemination model combined with techniques for data integrity verication and condentiality provides a secure publishsubscribe paradigm for XML documents. The publishsubscribe model restricts the consumer and document source information to the routers to which they register with. Such an approach to XML content dissemination satises the requirements of integrity, condentiality, and privacy preservation in a holistic manner. Structure-based routing provides a modular and exible model for security enforcements in data distribution. Flexibility

Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on January 22, 2009 at 02:31 from IEEE Xplore. Restrictions apply.

300

IEEE TRANSACTIONS ON SYSTEMS, MAN, AND CYBERNETICSPART C: APPLICATIONS AND REVIEWS, VOL. 38, NO. 3, MAY 2008

in security enforcement is known to be an important requirement in secure system design and implementation. Depending on the degree of trust on the network integrity, checks may or may not be enforced. Moreover, the framework facilitates dissemination of contents with varying degrees of condentiality and integrity in a mix of trusted and untrusted networks, which is so prevalent in current settings across enterprise networks and the web. It facilitates the control and enforcement of access control policies at a single pointdocument source, which is very difcult to achieve in content-based routing and other publishsubscribe systems. On the other hand, content-based routing can be easily emulated using the structure-based routing presented in this paper. We plan to further investigate PONs and other structural properties of hierarchical data models, and to apply them to address various security, management, and engineering related issues. Explorations concerning the implementation of various access control policies and integrity models on such data dissemination model would also be interesting. REFERENCES
[1] M. Altinel and M. J. Franklin, Efcient ltering of XML documents for selective dissemination of information, in Proc. VLDB Conf., 2000, pp. 5364. [2] A. Crespo, O. Buyukkokten, and H. Garcia-Molina, Query merging: Improving query subscription processing in a multicast environment, IEEE Trans. Knowl. Data Eng., vol. 15, no. 1, pp. 174191, 2003. [3] Extensible Markup Language (XML) [Online]. Available: http://www.w3. org/XML/ [4] W3C Document Object Model (DOM) [Online]. Available: http://www. w3.org/DOM/ [5] T. H. Cormen, C. E. Leiserson, R. L. Rivest, and C. Stein, Introduction to Algorithms. Cambridge, MA: MIT Press, 2001. [6] W3C Document Object Model Core [Online]. Available: http://www. w3.org / TR / 2004 / REC-DOM-Level-3-Core-20040407 / core.html#ID1590626202 [7] R. Agrawal, J. Kiernan, R. Srikant, and Y. Xu, Order preserving encryption for numeric data, in Proc. 2004 ACM SIGMOD Int. Conf. Manag. Data, pp. 563574. [8] E. Bertino and E. Ferrari, Secure and selective dissemination of XML documents, ACM Trans. Inf. Syst. Secur., vol. 5, no. 3, pp. 290331, 2002. [9] Naming and addressing: URIs, URLs, . . . http://www.w3.org/ Addressing/ [10] R. Merkle, Secrecy, Authentication, and Public Key Systems. Ph.D. dissertation, Dept. Elect. Eng., Stanford Univ., CA, 1979. [11] E. Bertino, B. Carminati, E. Ferrari, B. M. Thuraisingham, and A. Gupta, Selective and authentic third-party distribution of XML documents, IEEE Trans. Knowl. Data Eng., vol. 16, no. 10, pp. 12631278, Oct. 2004. [12] L. Opyrchal and A. Prakash, Secure distribution of events in contentbased publish subscribe systems, presented at the 10th USENIX Security Symp., Washington, DC, 2001. [13] A. K. Datta, M. Gradinariu, M. Raynal, and G. Simon, Anonymous publishsubscribe in p2p networks, presented at the Int. Parallel Distrib. Process. Symp., Nice, France, 2003. [14] G. Banavar, T. Chandra, B. Mukherjee, and J. Nagarajarao, An efcient multi-cast protocol for content-based publish subscribe systems, in Proc. 19th IEEE Int. Conf. Distrib. Comput. Syst., 1999, pp. 262272. [15] M. Aguilera, R. Strom, D. Sturman, M. Astley, and T. Chandra, Matching events in a content-based subscripton system, presented at the 18th ACM Symp. Principles Distrib. Comput., Atlanta, GA, 1999. [16] A. Riabov, Z. Liu, J. L. Wolf, P. S. Yu, and L. Zhang, Clustering algorithms for content-based publication-subscription systems, in Proc. 22nd IEEE Int. Conf. Distrib. Comput. Syst., 2002, pp. 133142. [17] F. Cao and J. Singh, Efcient event routing in content-based publish subscribe service networks, in Proc. of IEEE INFOCOM 2004, pp. 929 940.

[18] A. Carzaniga and A. L. Wolf, Forwarding in a content-based network, in Proc. ACM SIGCOMM, Karlsruhe, Germany, Aug. 2003, pp. 163174. [19] C. Wang, A. Carzaniga, D. Evans, and A. L. Wolf, Security issues and requirements for internet-scale publishsubscribe systems, in Proc. Hawaii Int. Conf. Syst. Sci., 2002, p. 303. [20] P. Costa and G. P. Picco, Semi-probabilistic content-based publish subscribe, in Proc. 25th IEEE Int. Conf. Distrib. Comput. Syst., 2005, pp. 575585. [21] G. Cugola, D. Frey, A. L. Murphy, and G. P. Picco, Minimizing the reconguration overhead in content-based publishsubscribe, in Proc. 19th ACM Symp. Appl. Comput., 2004, pp. 11341140. [22] A. Carzaniga, M. J. Rutherford, and A. L. Wolf, A routing scheme for content-based networking, in Proc. IEEE INFOCOM, 2004, pp. 918928. [23] G. P. Picco, G. Cugola, and A. L. Murphy, Efcient content-based event dispatching in presence of topological recongurations, in Proc. 23rd Int. Conf. Distrib. Comput. Syst., 2003, pp. 234243. [24] H. Zhou and S. Singh, Content based multicast (CBM) in ad hoc networks, in Proc. MobiHoc, 2000, pp. 5160. [25] P. Costa, M. Migliavacca, G. P. Picco, and G. Cugola, Epidemic algorithms for reliable content-based publishsubscribe: An evaluation, in Proc. 24th IEEE Int. Conf. Distrib. Comput. Syst., 2004, pp. 552561. [26] M. Srivatsa and L. Liu, Securing publishsubscribe overlay services with eventguard, in Proc. 12th ACM Conf. Comput. Commun. Security, 2005, pp. 289298. [27] A. Carzaniga, D. S. Rosenblum, and A. L. Wolf, Design and evaluation of a wide-area event notication service, ACM Trans. Comput. Syst., vol. 19, no. 3, pp. 332383, 2001. [28] R. Zhang and Y. C. Hu, Hyper: A hybrid approach to efcient contentbased publishsubscribe, presented at the Int. Conf. Distrib. Compt. Syst., Las Vegas, NV, 2005. [29] A. Kundu and E. Bertino, Secure dissemination of XML content using structure-based routing, in Proc. 10th IEEE Int. Enterprise Distrib. Object Comput. Conf. (EDOC06), 2006, pp. 153164. Ashish Kundu (S06) is currently working toward the Ph.D. degree in the Department of Computer Science, Purdue University, West Lafayette, IN. His primary research interests lie in the security and privacy issues in data distribution, and language-based security issues in a distributed context. He has previously been a Research Staff Member at IBM India Research Laboratory, Delhi. Mr. Ashish is a Student Member of the ACM. He is also an Honorary Life Member of Upsilon Pi Epsilon. He is the coauthor of a paper that has been awarded best student paper in IEEE EDOC 06. He has also been awarded by IBM Bravo award for his technical contributions. He has served on the program committee of IEEE EDOC 07 and 08. Elisa Bertino (SM03F02) is currently a Professor of computer sciences at Purdue University, West Purudue, IN, and serves as Research Director of the Center for Education and Research in Information Assurance and Security (CERIAS). Previously she was a faculty member at Department of Computer Science and Communication, University of Milan, where she directed the DB&SEC Laboratory. She has been a Visiting Researcher at the IBM Research Laboratory (now Almaden), San Jose, at the Microelectronics and Computer Technology Corporation, at Rutgers University, and at Telcordia Technologies. Her main research interests include security, privacy, digital identity management systems, database systems, distributed systems, and multimedia systems. In those areas, she has published more than 250 papers in all major refereed journals, and in proceedings of international conferences and symposia. She is a coauthor of the books Object-Oriented Database SystemsConcepts and Architectures (Addison-Wesley International, 1993), Indexing Techniques for Advanced Database Systems (Kluwer Academic, 1997), Intelligent Database Systems (Addison-Wesley International, 2001), and Security for Web Services and Service Oriented Architectures (Springer, 2007). She is a Co-Editor-in-Chief of the Very Large Database Systems (VLDB) Journal. She serves also on the editorial boards of several scientic journals, incuding the ACM Transactions on Information and System Security, ACM Transactions on Web, Acta Informatica, the Parallel and Distributed Database Journal, the Journal of Computer Security, Data & Knowledge Engineering, and Science

Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on January 22, 2009 at 02:31 from IEEE Xplore. Restrictions apply.

KUNDU AND BERTINO: NEW MODEL FOR SECURE DISSEMINATION OF XML CONTENT

301

of Computer Programming. She has been a Consultant to several companies on data management systems and applications and has given several courses to industries. Her research has been sponsored by several organizations and companies, including the USA National Science Foundation, the US AirForce Ofce for Sponsored Research, the I3P Consortium, the European Union (under the 5th and 6th IST research programmes), IBM, Microsoft, and the Italian Telecom. Dr. Bertino is a Fellow of the ACM and has been been named a Golden Core Member for her service to the IEEE Computer Society. She received the 2002 IEEE Computer Society Technical Achievement Award for For outstanding contributions to database systems and database security and advanced data

management systems and the 2005 IEEE Computer Society Tsutomu Kanai Award For pioneering and innovative research contributions to secure distributed systems. She has served as a Program Committee member of several international conferences, such as ACM SIGMOD, VLDB, ACM OOPSLA, as Program Co-Chair of the 1998 IEEE International Conference on Data Engineering (ICDE), as program chair of 2000 European Conference on Object-Oriented Programming (ECOOP 2000), of the 7th ACM Symposium of Access Control Models and Technologies (SACMAT 2002), of the EDBT 2004 Conference, and the IEEE Policy 2007 Workshop. She is an Associate Editor of IEEE INTERNET COMPUTING and IEEE SECURITY AND PRIVACY.

Authorized licensed use limited to: GOVERNMENT COLLEGE OF TECHNOLOGY. Downloaded on January 22, 2009 at 02:31 from IEEE Xplore. Restrictions apply.

Вам также может понравиться