Академический Документы
Профессиональный Документы
Культура Документы
1. Introduction
Privacy preserving in data mining means hiding output knowledge of data mining by using several methods when this output data is valuable and private. There have been two types of privacy concerning data mining. The first type of privacy, called input privacy, is that the data is manipulated so that the mining results is not affected or minimally affected [1].
387
2. Problem Definition
As mentioned earlier, almost all of studies proposed hide association rules were done on binary valued data. However, in real world, data usually contains quantitative values. For example, in a patients blood test, many attributes can be found. However, instead of this presence, attributes quantity in blood is more important for determination of illness. Every people have the problem of cholesterol, but this doesnt mean that one is sick or not, the only criterion for determination of illness is the surplus in cholesterols quantity. In previously presented data mining methods, whether they are using binary values or they are using quantitative values, it is possible to extract association rules that threaten personal privacy. Especially in medical intuitions, there are databases including very extensive information about patients. Possible bad purposed usage of those databases threatens personal privacy of patients. There are some recent examples about this type of usage. For example; Kiser, one of the most important medical institutes of United States sent 858 e-mail messages by mistake. Those messages contained IDs of users and their answers for their illnesses and the questions as well. All of those messages were sent to wrong receivers. (Washington Post, 10 August 2000). In another example, Global Healttrax, an online firm selling health products, sent names, home phones, bank account numbers and credit card data through their website by mistake. (MSNBC, 19 January 2000)
Input: (1) a source database D, (2) a min_support, (3) a min_confidence, (4) a set of predicting items X
Output: 1 Fuzzification of the database, D F 2. In fuzzified database F, calculate every items support value where f F 3. If all f(support)< min_support;EXIT:. // there isnt any rule. 4. Find Large 2_itemsets in F. 5. For each Xs large 2 itemsets 6. Calculate tthe support value of the rule U min(UL , UR ) 7. if rule U(support) > min_support 8. Calculate Rule U (confidence) 9. if rule U(confidence) > min_sonfidence 10. for each TL of rule 11. if TL <0.5 and TL < TR 12. TL = 1 - TL 13. end // if 14. end // for each 15. end // for each 16. Re-calculate confidence value of rule U 17. if rule U(confidence) > min_confidence 18. remove all TL and return to initial form 19. end // if 20. end // for each 21. F prune the database. F D 22. end.
388
association rules that satisfy user-specified minimum support and confidence. STEP 1 Transform the quantitative value v j of each item i j entered into a fuzzy set represented as f j1 f j2 f jl by using the given membership ( + + ... )
R j1 Rj2 R jl
In this section, an example is give to demonstrate the proposed algorithm. This is a simple example to show how the proposed algorithm can be used to hide critical fuzzy association rules from a set of quantitative transaction data.
Table 1. The set of 5 quantitative transaction data
functions for item quantities, where l is the number of regions for i j . STEP 2 Calculate the count of each attribute region (linguistic term) Rjk in the transaction data as:
n
count jk =
f
i =1
jk
T1 T2 T3 T4 T5
A 10 3 6 7 11
B 5 11 3 5 4
C 8 6 9 8 7
D 3 14 13 12 10
STEP 3 Check whether count jk of each Rjk over n is larger than or equal to the predefined minimum support value. If Rjk satisfies the above condition, put it in the set of large-1 itemsets (L1). STEP 4 Join the large itemsets L1 to generate the candidate set C2. Two regions belonging to the same item cannot simultaneously exist in an itemset in C2. STEP 5 Calculate the fuzzy value of each transaction data as: f I = Min f I j Then, calculate the fuzzy count
j =1 n 2
f
i =1
i I
STEP 6 According to user specified minimum confidence value, rules are extracted. A confidence value of a AoBo rule is computed as follows: Confidence(AoBo)=
Support(AoBo) Support(Ao)
STEP 7 Critical rules are determined. Then in order to hide the rules, confidence values are tried to be decreased. This is achieved by one of two strategies. The first one is to increase the support count of Ao, i.e., LHS of the rule, but not support count of Ao Bo . The second one is to decrease the support the support count of the itemset Ao Bo .
Confidence(Ao Bo) =
An Example:
Support(AoBo) Support(Ao)
In this example, each item has three fuzzy regions: z, o and b. Thus, three fuzzy membership values are produced for each item according to the predefined membership functions. Note that only the same set of membership functions is used all the items for simplicity. Minimum support and confidence values are set at 2.2 and %75, respectively. Fuzzification of transaction data is given in Table 2. If we take AoBz (Support= 2.4 and Confidence= 100%) as an a critical rule, our aim will be decreasing the value of Confidence(AoBz) by decreasing the value of Support(AoBz) or by increasing the value of Support(Ao). In order to increase the value of Support(Ao) we subtract Ao from 1 in the case Ao is lower than Bz.
Transactio n T1 T2 T3 T4 T5 Count
Ab 0 0 0 0 0.2 0.2
B Bo 0 0.8 0 0 0 0.8
Bb 0 0.2 0 0 0 0.2
Cb 0 0 0 0 0 0
Dz 0.6 0 0 0 0 0.6
389
Confidence(AoBz)=
To hide critical fuzzy association rule AoBz, transaction T3 in item A is modified from 0.2 to 10.2=0.8. The new database is shown in Table 4.
Table 3. Fuzzy values of items, Ao and Bz
T1 T2 T3 T4 T5 Count
Bz 1 0 0.6 1 0.8
T1 T2 T3 T4 T5 Count
Bz 1 0 0.6 1 0.8
Confidence(AoBz)=
If this is not enough to hide the relevant rule, then in transactions having Ao equals to 0, we subtract the value of Ao from 1. The procedure is denoted in Table 5 and Table 6. Thus, we obtain a smaller confidence value to hide the relevant rule as shown in the following formula. Since this value is lower than minimum confidence value, the rule is hidden from the user. After modifying the values
As can be easily seen in Table 7, the sum of the fuzzy values in T2 ad T3 transactions passes the value 1. In order to handle this problem, the region having fuzzy support value modified is subtracted from 1 and the result is written as the other fuzzy region value. For example; Since Az + Ao = 0.8 + 0.8 = 1.6, Ao is subtracted from 1 and the result 1-0.8 = 0.2 is written as the value of the Az. Finally, according to the values in Table 8, the transformed database is shown in Table 9.
Ab 0 0 0 0 0.2 0.2
Bz 1 0 0.6 1 0.8
B Bo 0 0.8 0 0 0
Bb 0 0.2 0 0 0
Cb 0 0 0 0 0
Dz 0.6 0 0 0 0
2.4 2.6
0.6 2.2
390
TID T1 T2 T3 T4 T5
A 10 10 9 7 11
B 5 11 3 5 4
C 8 6 9 8 7
D 3 14 13 12 10
Number of Rules
100 80 60 40 20 0
5. Experimental Results
In order to understand characteristics of the proposed algorithm in a numerical way, we have done several numbers of experiments and observed the output effects. For the experiments, a computer having an Intel Centrino 1.6 processor and 512MB RAM is used. In the software side Windows XP operating system was running. We applied the proposed algorithm to the Wisconsin Breast Cancer database from UCI Machine Learning Repository [10]. The database consists of 10 attributes and one of them is categorical. Thus, we ignored this categorical attribute. We conducted three different experiments. The first experiment is dedicated to show the relationship between number of total and hidden rules, and number of transactions. In this experiment, as minimum support values are taken as 24, 48, 62 and 74, respectively, the minimum confidence value is set at 70%. The results are depicted in Fig. 2. The second experiment deals with finding the number of total and hidden rules for different values of minimum support at 200 transactions. As can be easily seen from Fig. 3, the number of rule decreases with increase of minimum support value. In this experiment, the size of data set is 200. The final experiment finds the number of total and hidden rules for different values of minimum confidence. The results are reported in Fig. 4, which demonstrates that the number of hidden rules quickly rises with the increase of minimum confidence value as the number of total rule decreases slowly.
7 6 Total Rules Hidden Rules
50
60
70
80
90
100
M inimum Support
60
Number of Rules
50 40 30 20 10 0 50 60 70 80 90
M inimum Confidence
Number of Rules
Acknowledgements
This study was supported by Frat University, Scientific Research Projects Office under Grant No: FUBAP1476.
Number of Transaction
391
6. References
1. V. Verkios, E. Bertino, I. G. Fovino, L. P. Provenza, Y. Saygn and Y. Theodoris, State-of-the-art in Privacy Preserving Data Mining, SIGMOD Record, Vol.33, No. 1,50-57, March 2004. S. L. Wang, B. Parikh and A. Jafari, Hiding Informative Association Rule Sets, Expert Systems with Applications, Volume 33, Issue 2, Pages 316323, August 2007. S. L. Wang, and A. Jafari, Using unknowns for hiding sensitive predictive association rules, IEEE International Conference on Information Reuse and Integration, Page(s):223-228, 15-17 Aug. 2005. Y. Saygn. V. Verkios, and C. Clifton, Using unknowns to prevent Discovery of Associations Rules, SIGMOD Record 30(4): 45-54, December 2001. S. Oliveira and O. Zaiane, Privacy preserving frequent itemset mining, IEEE International Conference on Data Mining, pp.43-54, November 2002. 6. T. P. Hong, C. S. Kuo, S. C. Chi, Mining association rules from quantitative data, Intell. Data Anal. 3 (5) pp.363376, 1999. 7. M. Kaya, R. Alhajj, F. Polat and A. Arslan, Efficient Automated Mining of Fuzzy Association Rules, Proc. of DEXA, 2002. 8. L.A. Zadeh, Fuzzy Sets, Information and Control, Vol.8, pp.338-353, 1965. 9. R. Agrawal. T. Imielinski. And A. Swami. Mining Associations Rules between Sets of Items in Large Database, In Proceedings of ACM SIGMOD International Conference on Management of Data, Washington DC, May 1993. 10. http://mlearn.ics.uci.edu/databases/breast-cancerwisconsin/breast-cancer-wisconsin.data
2.
3.
4.
5.
392