Академический Документы
Профессиональный Документы
Культура Документы
PCA Techniques
Amal Hadri Khalid Chougdali Rajae Touahni
LASTID Laboratory GREST Research Group, National LASTID Laboratory
Faculty of Science, Ibn tofail School of Applied Sciences (ENSA), Faculty of Science, Ibn tofail
University, Kenitra, Morocco. Kenitra, Morocco. University, Kenitra, Morocco.
amal.hadri2009@gmail.com chougdali@gmail.com rtouahni@hotmail.com
Abstract— One of the most commonly problem in the field of Our purpose is to make anomaly-based IDS more efficient,
network intrusion detection system is the tremendous by using some techniques to reduce the high dimensional
number of redundant and irrelevant information used to data obtained from network traffic, before applying any
build an intrusion detection system. In order to overcome this anomaly-based algorithms’.
problem, we have used and compared two dimensionality
reduction methods namely PCA and Fuzzy PCA which allows To address the problem of high dimensionality, a common
us to keeping just the most relevant information from the
approach is to identify the most relevant features
network traffic data. Then, we have applied K nearest
associated with all connection records without unduly
Neighbour algorithm in order to classify the test samples of
connections into a normal or attack category. The conducted
compromising the quality of the classification. The most
experiments were made by using KDDcup99 dataset. The commonly used approach which has proven to be efficient
results obtained reveal that Fuzzy PCA method outperforms in many application areas [3], [4][5][6] is Principal
PCA in detecting U2R and DoS (Denial of Service) attacks. Component Analysis (PCA) which allows us to define the
“eigenvectors” (or principal components PCs) of the
Keywords-- Dimension reduction; PCA; Fuzzy PCA; Network covariance matrix of the connection records distribution
Security; Intrusion Detection System (IDS)
[7]. These eigenvectors can be considered as a set of
I. INTRODUCTION features, which used to calculate the variation between all
connection records. Each connection can be defined by the
Nowadays, there are several existing mechanisms and eigenvectors corresponding to the largest eigenvalues, and
computer techniques to improve the robustness and which is the result of the most variance within the set of
security of computer network. Among these techniques, connection records.
we find the IDS (Intrusion Detection System) which can
Unfortunately, PCA as any other multivariate statistical
be used to detect abnormal or suspicious activities, as well
method are sensitive to outliers, missing data, and poor
as target’s attacks. In other words, these mechanisms can
linear correlation between variables, due to poorly
detect any attempt to violate the security policy. Generally,
distributed variables. As a result, data transformations
intrusion detection systems are classified as either misuse-
have a large impact upon PCA [8].
based (signature-based) or anomaly-based. Misuse-based
approach identifies the abnormal or suspicious behavior by To alleviate the drawbacks of PCA, one of the most
comparing it to known attacks signature. For this purpose, illuminating methods is FPCA [9], [10] (Fuzzy Principal
a dataset of attacks signature is required, this approach Component Analysis). The main goal of this technique is
provide a good detection for the well-known attack. to fuzzify the input data to reduce the influence of outliers
However, it can’t detect the new or unfamiliar intrusions. by using Fuzzy C-Means algorithm and then reformulate
On the other hand, the anomaly-based approach was PCA into FPCA.
introduced by Anderson [1] and Denning [2] attempt to
determine the “normal” model or behavior and generate This paper is organized as follows: the second section
an alarm if the variation between a given observation and describes concisely the two dimensionality reduction
the normal behavior surpasses a defined threshold. techniques PCA and FPCA. The third section is reserved
to present the approach of our system and we will present
the experimental methodology and discuss the results in
A. PCA :
1 1
𝐶𝑛×𝑛 = 𝑀
𝑖=1 𝜃𝑖 𝜃𝑖𝑇 = 𝐴𝐴𝑇 (3) III. PROPOSED APPROACH OF OUR SYSTEM
𝑀 𝑀
Each attack type fall into four main categories: Finally, KNN [17] classifier is used for classification in
order to check whether these sample test network
- DOS: denial-of-service, e.g. syn flood; connections are normal or abnormal.
- R2L: unauthorized access from a remote machine,
e.g. guessing password; IV. EXPERIMENTAL METHODOLOGY AND
- U2R: unauthorized access to local superuser RESULTS
(root) privileges, e.g., various ``buffer overflow''
attacks; This section will be reserved to present the different
- Probing: surveillance and other probing, e.g., port experiments and results obtained when we implements the
scanning. dimensionality features reduction techniques PCA and
FPCA presented above.
The test dataset consists of 311,029 connections, and it
includes some particular attack types not existing in the We have used, as a training sample, 1900 normal
training dataset. The datasets contain a total of 24 training connections, 900 DOS, 900 Probing, 900 R2L, and 52
attack types, with an additional 14 types in the test data U2R randomly selected from the 10% training dataset
only. (KDDcup99). And for testing sample we have selected
randomly from the test dataset, 900 normal connections,
We work in this paper with the 10% dataset. 900 DOS, 900 Probing, 900 R2L, and 52 U2R.
Denote the intrusions successfully classified as TP (true
Step 2: Data preprocessing positives), the normal connections correctly predicted as
TN (true negatives), the normal connections wrongly
The main goal of this step is to have a standard format
classified as FP (false positives) and the intrusions
attributes before applying any dimensionality reduction,
wrongly classified FN (False negative).
for that we have converted the discrete attributes values of
In order to evaluate the performance of we use four
dataset into continuous values following the idea used in
measures: detection Rate (also called recall) DR, false
[3],suppose we have m possible values for a discrete
positive rate FPR, precision and F-measure. To getting
attribute i . For each discrete attribute we correspond m
more realistic results, we have calculated the average of
coordinates, and we can associate one coordinate for every
these measures using the 10-fold cross validation:
possible value of the attribute. Then, the coordinate
corresponding to the attribute value has a value of 1 and
the remaining coordinates has a value of 0. As an 𝑇𝑃
illustration if we consider the protocol type attribute which 𝐷𝑅 = × 100 (8)
𝑇𝑃 + 𝐹𝑁
can take one of the following discrete attributes tcp, udp or
icmp. Following the idea presented above, there will be 3
coordinates for this attribute. As a result, supposing that a 𝐹𝑃 (9)
𝐹𝑃𝑅 = × 100
connection record has a tcp (resp. udp or icmp) as a 𝐹𝑃 + 𝑇𝑁
protocol type ,then the corresponding coordinates will be
(1,0,0) (resp. (0,1,0) or (0,0,1)). With this conversion, each
𝑇𝑃
connection record in the datasets will be represented by 𝑃𝑟𝑒𝑐𝑖𝑠𝑖𝑜𝑛 = × 100 (10)
128 coordinates (3 different values for the protocol_type 𝑇𝑃 + 𝐹𝑃
attribute, 11 different values for the flag attribute, 70
possible values for the service attribute and 0 or 1 for the 2×𝑇𝑃
𝐹 − 𝑚𝑒𝑎𝑠𝑢𝑟𝑒 = × 100 (11)
remaining 6 discrete attributes) in place of 41 attributes. 2×𝑇𝑃+𝐹𝑃+𝐹𝑁
In this stage we use the two dimensionality features A powerful IDS should have a high DR, Precision and F-
reduction techniques PCA and FPCA in order to reduce measure and a low FPR. In the beginning we have
the high dimensionality of data (for the training and testing conducted two experiments in order to determine the best
parameters which allow us the maximum detection rate
(also for the two measures: F-measure and Precision). In results with a maximum detection rate, Precision and F-
the first one we have fixed the number of principal measure and the minimum FPR. The main objective of this
component at two and we have diversified widely the experiment is to find the best number of principal
number of the nearest neighbors. As mentioned in the component (PCs) which can enhance considerably the
Fig.1 that k=2 nearest neighbors gives us the optimal detection rate (DR).
In our second experiment we have fixed the number of In our third experiment, we have evaluated the efficiency
nearest neighbors at two and we have varied the number of of Fuzzy PCA in intrusion detection field, for that ,we
principal components in order to find the number of k have fixed the number of principal components and the
neighbors which give us the best detection rate. As shown number of nearest neighbors at two to seek the degree of
in the Fig.2, the first and second principal components membership M which gives us the best results. It’s clearly
give the optimal results. as illustrated in the Fig.3 that M= 9 gives us the best
results.
In accordance with the two experiences mentioned above
we have fixed the number of the nearest neighbors’ and the In the next experiments, we will compare the two methods
number of PC at their optimal values in order to compute PCA and Fuzzy PCA, as shown in Fig.4and Fig.5 FPCA
and get the detection rate for every type of attacks. overcomes PCA at the first and the second principal
components in detecting attacks .However, PCA gives a
few FPR rate than FPCA.
To get more realistic results we have compared the with PCA, FPCA method has a worse false alarm rate even
detection rates of every type of attacks for PCA and FPCA if it has a better detection rate.
as demonstrated in the table II. It is shown that the
detection rates of FPCA for DOS and U2R attacks are
globally the best compared to those of PCA.