You are on page 1of 26

A Progress Report

On
Insider Collusion Attack on Privacy-Preserving Kernel-Based Data
Mining Systems
Bachelor of Engineering
in
Computer Science & Engineering
Submitted by

Sneha Shinde
Vaishali Singan
Shruti Ghotkar
Vaishali Jikar

Guided by

Prof. T. M. Belle

Department of Computer Science & Engineering

Shri Shankarprasad Agnihotri College of Engineering, Wardha

2016-17
Insider Collusion Attack on Privacy-Preserving Kernel-Based
Data Mining Systems

Bachelor of Engineering
In
Computer Science & Engineering
Submitted by
Sneha Shinde
Vaishali Singan
Shruti Ghotkar
Vaishali Jikar

Guided by
Prof. T. M. Belle

Department of Computer Science & Engineering

Shri Shankarprasad Agnihotri College of Engineering, Wardha

2016-17

Shri Shankarprasad Agnihotri College of Engineering


Declaration

I, hereby declare that the report titled Insider Collusion Attack on Privacy-Preserving Kernel-
Based Data Mining Systemssubmitted herein has been carried out by me in the Department of Computer
science & Engineering of Shri Shankarprasad Agnihotri College of Engineering, Wardha. The work is
original and has not been submitted earlier as a whole or in part for the award of any degree / diploma at this
or any other Institution / University.

I also hereby assign to Shri Shankarprasad Agnihotri College of Engineering, Wardha all rights under
copyright that may exist in and to the above work and any revised or expanded derivatives works based on the
work as mentioned. Other work copied from references, manuals etc. is disclaimed.

Sneha Shinde

Vaishali Singan

Shruti Ghotkar

Vaishali Jikar

Date :
ACKNOWLEDGEMENT

I have immense pleasure in expressing my gratitude towards my seminar guide Prof. T. M. Belle for
valuable guidance and inspiration. She continuously supervised my work with utmost care.
I would also like to thank Prof. Bawane sir, for always being supportive, encouraging and extremely
helpful.

I also expressed my sincere thanks to Prof. D. B. Dandekar, Head of Department CSE, SSPACE,
Wardha.
INDEX

Chapter no. Tittle Page no.

Abstract

1. Introduction 1

2. Literature Survey and Review 2

2.1 A Novel Method For Privacy In Big Data Mining

2.2 The Malicious Insider Threat In Cloud

2.3 Data Security in Big Data

3. Methodology 5

3.1 Architecture of proposed system

3.2 Method for proposed system

3.3 Algorithm

4. Tools/Platforms 8

5. Design/Implementation 9

5.1 Data allocation module

5.2 Fake objects module

5.3 Optimization module

5.4 Data distributor module

5.5 Fake agent module

6. Future Scope 10

7. Snapshots 11

8. References 18
Figure Index

Fig. no. Fig. Name Page no.


Fig. 1.1 The Attacking Scenario : Insider within the hospital helps the 1
outsiders attackers (the SVM server) to launch attack

Fig. 2.1 Data Mining Process 2

Fig. 2.2 Opportunities for prevention, detection and response for an 3


insider attack.

Fig. 2.3 Result and Discussions on data security 4

Fig. 3.1 Architecture of Proposed System 5


Abstract

In this paper, we consider a new insider threat for the privacy preserving work of distributed kernel-
based data mining (DKBDM), such as distributed support vector machine.Among severalknowndata
breaching problems, those associated with insider attacks have been rising significantly, making thisone of the
fastest growing types of security breaches. Once considered a negligible concern, insider attackshave risen to
be one of the top three central data violations. Insider-related research involving the distribution of kernel-
based data mining is limited, resulting in substantial vulnerabilities in designing protection against
collaborative organizations. Prior works often fall short by addressing a multifactorial model that is more
limited in scope and implementation than addressing insiders within an organization colluding with outsiders.
A faulty system allows collusion to go unnoticed when an insider shares data with an outsider, who can then
recover the original data from message transmissions (intermediary kernel values) among organizations. This
attack requires only accessibility to a few data entries within the organizations rather than requiring the
encrypted administrative privileges typically found in the distribution of data mining scenarios. To the best of
our knowledge, we are the _rst to explore this new insider threat in DKBDM. We also analytically
demonstrate the minimum amount of insider data necessary to launch the insider attack. Finally, we follow up
by introducing several proposed privacy-preserving schemes to counter the described attack. Data leakage
happens every day when confidential business information such as customer or patient data, company secrets,
budget information etc. are leaked out when these information are leaked out, then the companies are at
serious risk. Most probably data are being leaked from agents side. So,company have to very careful while
distributing such a data to an agents

Keywords : Data hiding, data leakage, fake object, insider attack, kernel, Privacy preserving data mining.
Chapter1

INTRODUCTION

Data-breaching problems related to insider attacks are one of the fastest growing attack types.
According to the 2015 Verizon Data Breach Investigations Report, attacks from insider misuse have
risen signicantly, from 8% in 2013 to 20.6% in 2015. This near-triple rate of increase is astonishing when
one considers that this rise has taken place over a span of only two years. As a result of this rapid increase,
insider attacks are now among the top three types of data breaches.
Insider attacks arise not from system security errors but from staff inside the companys enterprise
data security circles. solutions to protect its perimeter yet still nd it difcult to prevent an insider attack.
Many data mining applications store huge amounts of personal information; therefore, extensive research has
primarily focused on dealing with potential privacy breaches. One prime area of research in preserving
privacy is the Sup- port Vector Machine (SVM). SVM is a very popular data mining methodology used
mainly with the kernel trick to map data into a higher dimensional feature space as well as maintain archives
with better mining precision results.
With privacy protection in mind, Vaidya et al. provided astate-of-the-art privacy-preserving
distributed SVM schemeto securely merge kernels.Their proposal encoded andhide the kernel values in a
noisy mixture during transmissionsuch that the original data cannot be recovered even if thesedistributed
organizations colluded.To the best of our knowledge, no prior work has considereda robust pragmatic model
in which ``insiders within organizations''collude with outsiders. Such a pragmatic model considers the insider
as the key player in sharing data withan attacker, who can then recover the original data from theintermediary
kernel values of the SVM model. This attackis more realistic because the attacker needs only to obtaina few
data entries rather than the entire database from anorganization to successfully recover the rest of the private
data.

Fig. 1.1 The Attacking Scenario : Insider within the hospital helps the outsiders attackers (the SVM server)
to launch attack.
Chapter 2

LITERATURE SURVEY AND REVIEW


2.1 Tittle: A Novel Method For Privacy In Big Data Mining

Author : Nasrin Irshad Hussain , Bharadwaj Choudhury , Sandip Rakshit

Date: 16, October 2014

Description : A novel method is proposed for preserving privacy in mining big databases. Goal is to
mining the data while preserving data privacy and confidentiality.

2.1.1 Privacy protection on raw data-


In the first step, we encrypt the customers data for safeguard data from unauthorized use and
preserving privacy. For encryption of the data, we used RSA algorithm and using public key encrypt
the data. After encryption, data are safely stored in the database. Except authorized employee, all
people can see the encrypted data.
2.1.2 Data Mining process-
When organization willing to apply data mining on their big data for gaining knowledge, then
authorized employee of the organization, take required data from database for mining. Using private
key they can only decrypt the required raw data. After decryption, a data mining technique call
clustering is used. Clustering is efficient for big data mining. It will put all data into some clusters and
extract some knowledge from it. For clustering we used rule system to build the clusters based on
these rules. From the above example the authorized employee have got the analyzed table for data
mining without violating privacy policy.
2.1.3 Knowledge Gain-
After data mining, knowledge discovery process is ended. We have gain some patterns for uplifting
your business, so that it will be published in the Public layer of the accessment model. Here all
employee can see the mining knowledge and according to this, they will work. Preserving privacy is
the main aim of our work. If organizations or companies follow this type of model then it will be very
beneficial for them and they can assure their customers that their details are very important for them
and they got full privacy against this. Customers data are the wealth of a company, so preserving
privacy for these data is very important

Implementation :

Fig 2.1 Data Mining Process


2.2 Tittle : The Malicious Insider Threat In Cloud

Author : Atulay Mahajan, Sangeeta Sharma

Date: March-April, 2015

Description: The detection and conception of strategies to solve malicious insider threats.

2.2.1 Collecting Data From Literature


To identify which areas of cloud computing security need more research, initially challenges are found
Literature review is used to find all available data relevant to a particular research area, for
collecting information to satisfy our questions. Based on the information gathered from literature
review, an analysis was employed to develop general International Journal of Engineering
Research and General explanations. This helped to identify the key concepts, terms and also
resources used by other researchers. This data is used to develop alternative designs or find out need
for further research.
2.2.2 Studying The Context of The Problem
In order to study the problem, it is suggested that it should be studied in two distinct contexts:
i. Insider threat in the cloud provider: Where the insider is a malicious employee working for the
cloud provider. He/she could cause great deal of damage to both the provider and its customers.
ii. Insider threat in the cloud outsourcer: The insider is an employee of an organization which has
outsourced part or whole of its infrastructure on the cloud. Though responsibilities may be different,
there are few elementary differences between a rogue administrator at the cloud provider and a rogue
administrator within the customer organization; both insiders have root access to systems and data,
and both may employ similar types of attacks to steal information.
2.2.3 Countermeasures
Client side
1. Confidentiality /Integrity
2. Availability
Provider Side
1. Separation of Duties
2. Logging
3. Legal Binding
4. Insider Detection Models

Implementation :

Fig. 2.2 Opportunities for prevention, detection and response for an insider attack
2.3 Tittle : Data Security in Big Data
Author: Er. Jyoti Arora and Misha Gupta

Date: October,18, 2015

DESCRIPTION: safeguard data by identifying the legitimate user and attacker.

With increase in the size of data every year by 40% globally data mining and its consequences both have
become important factors of research. The tech giants such as Google, Facebook etc. are looking for patterns
in data for enhancing the user experience. With data mining for user experience enhancement there are some
negative outcomes that happened with this. People have reported of leakage of their private information and
content. So this privacy control along with advancement in data mining models has become important factor
of research. In the study we did on the topic we decided to give our contribution to this research because this
factor is of very importance and touches million lives. Doing research on topic and studying various factors
we found that there is a requirement of Internet model which prevents the expose of user private data
and sensitive information on all types of user levels i.e. data provider, data collector, data miner and decision
maker. On the basis of this problem we encountered we decided following objectives. The data-mining model
designed will give extensions for prevention and expose to every user level. The extensions provided will be
on data such as:
Phone Numbers.
Personal Shopping details
Address
Family Details The above factors are the privacy fields we will design system, which will not allow access
to these sensitive information with restriction, enable feature for all four types of users. While users end is
providing personal information of content browsing we set field of prevention. This will make sure that
information respective to private fields will not shared anywhere except for the user and service provider. As
we know that privacy level varies from person to person and this should be implemented in research also so
we will make sure that in our model every user can set his privacy level. He can prevent his any information
to be shared or dragged, or he can allow his all information to be shared. This implementation is done on all
four users to provide extra robustness to the security. This Internet net model will be used on local university
connection and in friend zones to make survey on the success rate of this approach on the Internet. And with
help of live information we will perform improvements to our systems and set future scopes for the research.

IMPLEMENTATION:

Fig. 2.3 Result and Discussions on data security


Chapter 3
.METHODOLOGY

Fig.2.4 Architectureof proposed system


3.1 ATTACK STEP 1: KERNEL COLLECTION

In this step, the malicious outsider collects kernel values from the victim systemas many as possible.the
server is able to collect all the private kernel matrices directly following procedure 2 of the system. There
are many more ways to achieve the goal of collecting kernel values; we will not address them all in this paper.
Another example is illustrated by Navia-Vzquezs distributed semiparametric support vector machine
(DSSVM) [18]: in which kernels are used as basic units to reducing the amount of information required for
communication. In [18], every client needs to send the pairs{ Rm, rm} to the other clients for distributed
learning, where Rm and rm are two intermediate matrices consisting of kernel values to reduce the
communication load. Note that rm = KTm Wm ym, Km is the kernel matrix of client m, and Wm is a part of the
clients entire weighting matrix, whereas W, and ymare the client part of the whole y-value matrix. The two
parameters W and y are public to all clients. Thus, the client receiving { Rm, rm} could invert rm to obtain the
kernel matrix Km based on the two public parameters W and y. If one of the clients is an attacker, he can
obtain all the values of the kernel matrix of the previous client.

3.2ATTACK STEP 2: KERNEL-AND-INSIDER-DATA (KID) LINKING

In which is very popular in complex problems efciently, the following principles are considered:In this step,
an insider colludes (shares his own data) with one of the outsiders, and then, the outsider searches for the
kernel values composed from the data of the insider and his collection of kernel values. The idea of the
outsiders search stems from the fact that kernel values are composed on the basis of the insiders data and the
other (non-insider) data, as shown in equation.

Kernel ij = (InsiderData i)T(NotInsiderData j)

Principle 1: Because there is a symmetrical property in the kernel matrix, consider only vertical and
horizontal kernel lines.

Principle 2: Merge the kernel lines for the same axis of the index, because they all represent the same index.

Principle 3: remove the kernel lines representing the indices of the other insiders data.

3.3 ATTACK STEP 3:DATA RECOVERY

The outsider learned which kernel values were composed from which insiders data in step 2. In this step, the
outsider recovers the remaining private data inside these kernels, which are composed of one insiders data
and one private data. For example, in Fig. 8 (b), all the elements Ki3, i6=3, which are K13, K23, and K43 on
the orange kernel line are composed of one insider data (insider-data A, whose index is 3) and one unknown
private data (the data whose index is i). The question is: How can we retrieve the unknown data from the
kernel value Ki3, i6=3? This is the main focus in attack step 3. We will continue to use the same example of
Vaidyas PPSVM system [4], [5] used above to help introduce our idea. Suppose that in steps 1 and 2, the
malicious outsider has successfully collected n insiders data, Sj, where j = 1 n, and has also collected the
kernel values, Kij, composed of the insiders data, Sj, and a non-insiders data, Di. Assume that in total there
are p non-insiders data values in Di, where i = 1 p. Sj and Di are all vectors with m elements, and S
j (k) is the k-th element of Sj, where k = 1 m. This is also true for Di (k). The goal of attack step 3 is to
deduce all the non-insiders data, D1 Dp. First, the outsider uses the following method to deduce one non-
insiders data value, Du, and then proceeds to deduce all the other unknown non-insiders data values, one by
one, in the same way.
Algorithm 1: Kernel-and-Insider-Data-Linking Attack
Require: m m kernel matrix KM, total m data records x1 xm, and total n insiders data s1
sn
1: for k = 1 . . . n do
2: {Compute K1 and K2, where K1 is the kernel value of (sk, sp,p6=k, 1 6p 6n), and K2 is the
kernel value of (sk, s q,q6=k||q6=p, 1 6q 6n)}
3: Let KC1 = [], KC2 = [], l1 = 0, l2 = 0, IndexCand= [], Index = []
4: for fori = 1 . . . m do //Search for values equal to K1 and K2 in KM
5: for j = 1 . . . m do
6: if KM(i, j) = K1 then
7: KC1(l1 ) = (i, j)
8: else if KM i, j) = K2 then
9: KC2(l2) = (i, j)
10: end if
11: end for
12: end for
13: for u = 1 . . . max(l1 ) do //Apply Principle 1 & 2 to
kernel lines
14: for v = 1 . . . max(l2) do
15: if KC1(u)[1] 6= KC1(v)[1] &KC1(u)[2] = KC1(v)[2] then
16: if no element of the array IndexCand(k) = KC1(u)[2] then
17: Insert the element KC1(u)[2] into the array IndexCand(k)
18: end if
19: end if
20: end for
21: end for
22: end for
23: for k = 1 . . . n do //Apply Principle 3 to kernel lines
24: if #element of IndexCand(k) = 1 then
25: Index(k) = theelementofIndexCand(k)
26: end if
27: end for
28: for k = 1 . . . n do
29: if #element of IndexCand(k) >1 then
30: Delete all elements of IndexCand(k) that has been assigned to the other Index
31: Index(k) = a randomly chosen element from the remaining elements of IndexCand(k)
32: end if
33:end for
Chapter 4

Tools/platform

4.1 Software Requirement:


a) The module is written in ASP .net and C# .NET
b) It is developed in visual studio 2010 platform
c) O.S windows
d) MS-SQL server 2005 or Higher

4.2 Hardware Requirement:

a) PROCESSOR : Pentium 4 or Above


b) RAM : 256 MB or More
c) Hard disc Space : 500 MB To 1 GB
Chapter 5
Design and implementation
5.1 Data Allocation Module
Increase the chances of detecting agent that leak data. The main focus of our project is the data allocation
problem as how can the distributor intelligently give data to agents in order to improve the chances of
detecting a guilty agent, Admin can send the files to the authenticated user, users can edit their account details
etc. Agent views the secret key details through mail. In order to increase the chances of detecting agents that
leak data.

5.2 Fake Object Module


The distributor creates and adds fake objects to the data that he distributes to agents. Fake objects are objects
generated by the distributor in order to increase the chances of detecting agents that leak data. The distributor
may be able to add fake objects to the distributed data in order to improve his effectiveness in detecting guilty
agents. Our use of fake objects is inspired by the use of trace records in mailing lists. In case we give the
wrong secret key to download the file, the duplicate file is opened, and that fake details also send the mail.
Ex: The fake object details will display.

5.3 Optimization Module


The optimization module is the distributors data allocation to agents has one constraint and one objective.
The agents constraint is to satisfy distributors requests, by providing them with the number of objects they
request or with all available objects that satisfy their conditions. His objective is to be able to detect an agent
who leaks any portion of his data. User can able to lock and unlock the file for secure.

5.4 Data Distributor Module


A data distributor has given sensitive data to a set of supposedly trusted agents (third parties). Some of the
data is leaked and found in an unauthorized place (e.g., on the web or somebodys laptop). The distributor
must assess the likelihood that the leaked data came from one or more agents, as opposed to having been
independently gathered by other means Admin can able to view the which file is leaking and fake users
details also.

5.5 Fake Agent Module


To compute this, we need an estimate for the probability that values in S can be guessed by the target. For
instance, say some of the objects in T are emails of individuals. We can conduct an experiment and ask a
person with approximately the expertise and resources of the target to find the email of say 100 individuals. If
this person can find say 90 emails, then we can reasonably guess that the probability of finding one email is
0.9. On the other hand, if the objects in question are bank account numbers, the person may only discover say
20, leading to an estimate of 0.2. We call this estimate pt, the probability that object can be guessed by the
target. To simplify the formulas that we present in the rest of the paper, we assume that all T objects have the
same pt, which we call p. Our equations can be easily generalized to diverse p ts though they become
cumbersome to display. Next, we make two assumptions regarding the relationship among the various
leakage events. The first assumption simply states that an agents decision to leak an object is not related to
other objects.
Chapter 6
FUTURE SCOPE

In this project we run the languages html., which is run on the different platform but we run it on the
one platform which is we created thats the server but in future we may add programming language
such as php.
Chapter 7
Snapshots
7.1 Step 1 : Start up the visual studio software in a system.

7.2 Step 2 : Select The file form database.


7.3 Step 3 : System is ready to work.
7.4 Step 4 : Admin can login in to the system.
7.5 Step 5 : Admin can register and view the distributors.
7.6 Step 6 : Distributor can login to the system with their authority.
7.7 Step 7 : Distributor provide the secret code and distributes the files to the agent.
7.8 Step 8 : agent can register himself/herself by using the secret code.
References

[1] Insider Collusion Attack on Privacy-Preserving Kernel-Baesd Data Mining Systems.

[2] Chandni Bhatt et al, / (IJCSIT) International Journal of Computer Science and Information Technologies,
Vol. 5 (2) , 2014, 2556-2558 Data Leakage Detection Chandni Bhatt Prof.Richa SharmaP( GHRAET
India).

[3] P. S.Wang, S.-W.Chen, C.-H.Kuo, C.-M.Tu, and F. Lai, An intelligent dietary planning mobile system
with privacy-preserving mechanism, in Proc. IEEE Int. Conf. Consum. Electron. (ICCE), Jun. 2015, pp.
336337

[4] 2015 Verizon data breach investigation report,Verizon,Bedminister,NJ,USA,2015

[5] S. Hartley, Over 20 Million attempts to Hack into Health database. Auckland, New Zealand: The New
Zealand Herald,2014

[6] C.-C.Chang and C.-J. Lin,LIPSVM:A library for SVM,ACM


trans.Intellsysttechnolo,vol.2,no.3,april2011,Artno.27[online].available
http://www.csie.ntu.edu.tw/~cjlin/libsvm

[7]J.Vaidya,H.Yu,andx.Jiang,privacypreservingSVMclassificationknowl.inf.Syst.,vol.14,no.2,pp.161-
178,feb 2008

[8] L. Xu, C. Jiang, J. Wang, J. Yuan, and Y. Ren, ``Information security in


big data: Privacy and data mining,'' IEEE Access, vol. 2, pp. 1149_1176, Oct. 2014.

[9] F. Liu,W. K. Ng, andW. Zhang, ``EncryptedSVMfor outsourced data mining,''


inProc. IEEE Int. Conf. Cloud Comput. (CLOUD), Jun./Jul. 2015,pp. 1085_1092.

[10]D. Vizr and S. Vaudenay, ``Cryptanalysis of chosen symmetric homomorphic


schemes,'' Studia Sci. Math. Hungarica, vol. 52, no. 2, pp. 288_306,2015.

[11] M. Gnen and E. Alpayd_n, ``Multiple kernel learning algorithms,''J. Mach. Learn. Res., vol. 12, pp.
2211_2268, Jul. 2011

[12] S. Goryczka, L. Xiong, and B. C. M. Fung, ``m-privacy for collaborative


datapublising,'' IEEE Trans. Knowl. Data Eng., vol. 26, no. 10,pp. 2520_2533, Oct. 2014.

[13]K.-P.Lin and M.-S. Chen, Privacy-preserving outsourcing support vector machines with random
transformation,in Proc. ACM SIGKDD Int. Conf.Knowl. Discovery Data Mining, 2010, pp. 363372.

[14]S.R.M. Oliveira and O. R. Zaane, Privacy preserving clustering by data transformation, J. Inf. Data
Manage., vol. 1, no. 1, p. 37, 2010.