Вы находитесь на странице: 1из 10

Network Traffic Classification

Using Multiclass Classifier

Prabhjot Kaur1(&), Prashant Chaudhary1, Anchit Bijalwan1,


and Amit Awasthi2
1
Department of Computer Science and Engineering,
Uttaranchal University, Dehradun, India
info.prabh@gmail.com, kmahi1985@gmail.com,
anchit.bijalwan@gmail.com
2
University of Petroleum and Energy Studies, Dehradun, India
aawasthi@ddn.upes.ac.in

Abstract. This paper aims to classify network traffic in order to segregate


normal and anomalous traffic. There can be multiple classes of network attacks,
so a multiclass model is implemented for ordering attacks in anomalous traffic.
A supervised machine learning method SVM support Vector Machine has been
used for multiclass classification. The most widely used dataset KDD Cup 99
has been used for analysis. Firstly, the dataset has been preprocessed using three
way step and secondly the analysis has been performed using multi-classifier
method. The results acquired exhibited the adequacy of the multiclass classifi-
cation on the dataset to a fair extent.

Keywords: Multiclass classification  Normal traffic  Anomalous traffic

1 Introduction

Human has always aspired to develop techniques that could replace human efforts to a
great extent. In this era, machine and deep learning is superseding other techniques. If
one can train the machine using the data instead of explicitly programming the
machine, that’s where we need machine learning. Machine learning has empowered
many domains such as web search, text recognition, speech recognition, medicine such
as protein structure estimation, network traffic analysis and prediction, intrusion
detection etc. Network traffic analysis is one of the emerging domains. An attack can be
predicted from the current network traffic flow and it can held stop the intruders before
actually attacking the network. This can be done using machine learning by training the
network. There are three categories of machine learning: supervised, un-supervised and
semi-supervised [1]. This paper focuses on Support Vector Machine (SVM) supervised
machine learning technique for network traffic classification. Network traffic classifi-
cation using SVM can include two approaches: binary or two-way classification and
multi-class classification [2]. The first approach works simply by classifying the net-
work between normal and anomalous traffic. The second approach can be applied using
two sub-approaches i.e. (a) mapping multiple classes to individual binary classes;

© Springer Nature Singapore Pte Ltd. 2018


M. Singh et al. (Eds.): ICACDS 2018, CCIS 905, pp. 208–217, 2018.
https://doi.org/10.1007/978-981-13-1810-8_21
Network Traffic Classification Using Multiclass Classifier 209

(b) directly solving multi-class problem. In this paper, first sub-approach is used to
classify multi-class traffic classification [2].
The word classifier is a type of algorithmic technique used to implement classifi-
cation [3]. The classification techniques can either be applied to the active data collected
on site or passively on already built dataset. There are widely available network traffic
collection tools such as: Iris, NetIntercept, tcpdump, Snort, Bro etc. [4, 5]. The online
data stores of network traffic datasets are widely available for analysis of network traffic
[6]. The network traffic files are generally stored in packet capture format (.pcap) which
can subsequently be converted to desired format for analysis. These network files consist
of features showing the type of traffic. For classification of network traffic, the most
relevant features are selected out of the all features set. Then classification is performed
on network traffic using the reduced feature set. Reducing the features may lessen the
computation time and affirmatively affect the accuracy of the learning classification
technique [7]. There are various models provided for feature selection: Wrapper and
filter method [7], Correlation based feature selection (CFS) [8], INTERACT algorithm
[9], The Consistency-based filter [10], gini covariance method [11], information gain,
attribute evaluation etc. [12]. Wrapper method aims to select the feature subset with high
extrapolative power that optimizes the classifier. Whereas in filter method, the best
possible feature subset is selected from the data set irrespective of the classifier opti-
mization. CFS technique aims to select the features that are highly correlated with the
class and least correlated with remaining features of the class. INTERACT deals with
inspecting the contribution of individual feature in the whole dataset and how its
removal affects the consistency. The contribution is generated based on the ratio
between entropy and information gain (IG) known as symmetrical uncertainty
(SU) [13]. Information gain aims to determine the maximum information obtained from
a particular feature. Gini covariance method aims at checking the variability of the
feature and assigning respective ranks using spatial ranking method. The features within
a particular threshold value are selected and beyond are rejected. Information gain
attribute evaluation is to determine the best possible feature or attribute in the dataset.
Traditional binary classifiers work well with known patterns and their accuracy is
fairly good. However, the drawback of these traditional binary classifiers is their
inability to detect novel patterns in the data. This limitation has been removed for
anomaly detection in wireless sensor networks by using a modified version of SVM for
unknown traffic classification [14].

1.1 Related Work


Numerous studies have been conducted for traffic analysis using KDD Cup’99 dataset
[6]. A computational efficient technique called novel multilevel hierarchical Kohonen
net focuses on reduced feature and network size. The subset from KDD Cup’99 data is
selected consisting of combination of normal and anomalous traffic records, which can
be used to train the classifier. However, the test data consists of more attacks than
available in train set, are used for testing the classifier [15]. Evolutionary neural net-
works based novel approach for intrusion detection has been proposed over the same
KDD dataset. This approach takes way less time to find the higher neural networks than
the conventional neural network approaches by learning system-call orders [16].
210 P. Kaur et al.

Another technique applied on KDD Cup data set is modified and improved version of
C4.5 decision tree classifier. In this method new rules are derived by evaluating the
network traffic data and thereby applied to detect intrusion in the real time [17].
Another technique applied on the modified version of KDD’99 data set named NSL-
KDD that aims to decrease the false rate and increase the detection rate by optimizing
the weighted average function [18]. A novel technique named Density peaks nearest
neighbors (DPNN) is applied on KDD’99 cup data set to yield an improved accuracy
over SVM method. This approach detects unknown attacks thus improving the sub
categorical accuracy improvement of 15% on probe attacks and an overall efficiency
improvement of 20.688% [23]. The authors used deep auto-encoder technique on
KDD’99 cup dataset by constructing multilayer neurons showing improved accuracy
over traditional attack identification techniques [24]. The authors performed a two way
step on KDD’99 cup dataset: feature reduction using three different techniques i.e. gain
ratio, mutual information, correlation and generated analysis score using Naïve Bayes,
random forest, adaboost, SVM, bagging, kNN and stacking. Their results showed the
maximum performance given by SVM with 99.91% score and closer performance
score of 99.89 by random forest algorithm [25].

1.2 Data Set: KDD Cup 99


The full train dataset consists of 4,898,431 records out of which 972,781 are normal
records and 3,925,650 are attack records. In this full train dataset vast numbers of
records are redundant and after redundancy removal the total records, normal and
attack records become 1,074,992, 812,814 and 262,178 respectively [19]. However the
10% train dataset consists of total of records 494,021 out of which record are 97,278
normal whereas are 396,743 attack records. The test dataset consists of 311,027 records
out of which 60,591 are normal records and 250,436 are attack records. In this test
dataset vast numbers of records are redundant and after redundancy removal the total
records, normal and attack records become 77,289, 47,911 and 29,378 respectively.
There were two invalid records found in the test dataset having record number 136,489
and 136,497 consisting of unacceptable value for service feature as ICMP, henceforth
removed these two records from test dataset [19]. KDD CUP 99 dataset includes four
different categories of attacks which are further subcategorized into twenty two cate-
gories shown in Fig. 1. The four classes of attacks present in train dataset are: Denial of
Service (DoS), User to Root (U2R), Remote to Local (R2L) and Probe. DoS attack
denies user’s genuine access to the machine by either flooding the network with excess
traffic or making the system resources over utilized. In U2R, the unauthorized user
gains access to the system’s root directory, thereby attaining all rights of the super user.
R2L deals with getting local access of the machine from remote location by exploiting
unknown vulnerability. Probe attack deals with gaining control of the system by
security breach [19]. Sub categories of the aforementioned attacks are depicted in
Fig. 1. The frequency of the number of attacks present in the particular train and test
data set files are mentioned clearly. Though the redundancy has already removed from
both train and test datasets. Test dataset has unknown traffic category as well.
Therefore total number of reduced records after redundancy removal in train dataset
and test dataset are: 1,074,992 and 77289 respectively.
Network Traffic Classification Using Multiclass Classifier 211

Fig. 1. Train and Test network traffic data statistics (KDD Cup’99)

1.3 Support Vector Machine


SVM is one of the most widely used classification techniques. A decade ago it was
typically used for binary classification, however with the advent of its variants; multi-
class classification is most frequently in use today. A hyper plane need to be selected in
212 P. Kaur et al.

such a way that it precisely separates between two classes of data. The wider the hyper
plane width, the better it is. The width points of the hyper plane are decided from the
closest points to the hyper plane line known as support vectors. In context of network
traffic data, there can be either normal traffic or anomalous traffic which comes under
binary classification. Multiple subclasses of anomalous traffic can be determined using
multi-class SVM. Binary classification is easy to implement as the classifier need to
learn either the traffic is normal or anomalous. In order to perform multiple class
classification, certain characterizations need to be considered i.e. One versus one
(OvO) and one versus rest (OvR). In OvR, one class separates from other classes if
binary characteristics of one class distinguish it from remaining set of classes. In OvO,
here each classifier forms a pair with every other classifier and learns from the rela-
tionship formed [20]. Yet, there are many variants of SVM such as least squares SVM,
v-SVM, nearly-isotonic SVM, Bounded SVM, NPSVM and Twin SVM, but this paper
shall focus on multi-class categorization property of SVM [18].

2 Methodology

In order to perform the whole scenario, a formal step line


has been followed. In general it must follow the four-
steps: Data selection, Pre-processing, Analysis and result
evaluation [21] as shown in Fig. 2. In nearly every data
analytical domain, the generic flow model steps are fol-
lowed meticulously. The steps may vary depending upon
the unalike analysis requirements. Based on the generic
model, the stair step followed in this paper is shown in
Fig. 3. The four steps are: data selection, data prepro-
cessing, analysis and result respectively which further
consists of sub-steps. Data selection may either include the
primary dataset collected in hand or the secondary datasets
selection from online repository. Data preprocessing is
subdivided into three parts: (1) Removing redundancy,
(2) Feature selection and (3) Data transformation. Fig. 2. Generic flow model
Data analysis step involves extracting the relevant
information from vast amount of data. The researcher may
use different methods for data analysis. In this paper, supervised machine learning
technique SVM for multi class classification has been used. The final step is obtaining
results and accuracy.

2.1 Experimental Setup


The firsthand experiment is run on an Intel Core i5-5200U CPU @ 2.20 GHz computer
with 8.00 GB RAM running operating system Linux (Ubuntu 16.04 LTS). Python 3.6 has
been used for programming with scikit learn libraries such as Pandas and NumPy [22].
Network Traffic Classification Using Multiclass Classifier 213

Fig. 3. Proposed step line for data analysis

2.2 Data Selection


In Data selection step, KDD CUP 99 is selected for data analysis. The brief detail about
this dataset is already mentioned in second section. This first step could either be data
collection or data selection. Data collection can be done by deploying network traffic
collection tools such as tcpdump, NetIntercept, Snort, Bro etc. [4, 5]. The data collected
using these methods are called primary data collection. The collected data is stored as
datasets having specific extension such as .pcap. These datasets are most often avail-
able publicly to researchers. If a data set is selected from these publicly available data
sets then it is called secondary dataset selection [6]. The data is selected based on
researcher’s area of interests.

2.3 Data Preprocessing


Data preprocessing means cleaning the data and making it readily available for further
handling. In this step, the data in dataset is fine-tuned as per the input requirements to
the model for processing. KDD Cup 99 dataset includes many redundant rows in
training and testing datasets. There can be variant steps followed to preprocess the data.
There is no generic step line for data pre-processing. In this paper three-step process is
followed to preprocess data: (1) Removing redundancy, (2) Feature selection and
(3) Data transformation [21]. KDD Cup 99 dataset has two sets of train data and test
data: complete dataset and 10 percent of complete dataset. The size of complete train
and test dataset is 4,898,431 rows X 41 features and 311,031 rows X 41 features
respectively. Whereas the size of 10 percent of complete train dataset is 494,021 rows
X 41 features. Both the above complete and 10 percent of complete data sets consist of
redundant rows. After redundancy removal from 10 percent of complete train data set
the records become 145586. The data redundancy may lead to the problem of biased
results of the classifier towards frequently occurring records. Therefore, using a python
script, the redundancy of training and testing dataset has been removed as a part of first
step to data preprocessing. The second step to data preprocessing is feature extraction.
Two widely used methods are used in combination and ranked the features accordingly.
214 P. Kaur et al.

These methods are information gain and Gini covariance [11]. The numeric values
obtained using information gain method and Gini covariance method are in the range of
2.014–0.080 and 0.483–0.011 respectively for all 41 features. Based on combined
values of both the methods, rank is assigned to the respective feature. The highest
ranked 26 features are selected for further analysis. The numeric values range for
information gain method and Gini covariance method are between 2.014–0.214 and
0.483–0.035 respectively for all selected 26 features. The third step to data prepro-
cessing is data transformation that involved two tasks: dataset file format conversion
and symbolic conversion. The first subtask means to convert the dataset files in a
format required by the machine learning model. Python with scikit learn libraries are
used in this paper for data conversion. Scikit-learn accept data in csv (comma separated
value) format for further analysis. Therefore, all the dataset files are converted to .csv
format. The second subtask of data transformation is to convert the symbolic values
with numeric values. Python code has been written for symbolic value conversion in
the train and test dataset. Therefore, data preprocessing step prepares the data for
analysis in further steps. The authors have selected the subset of train set consisting of
few attack sets from all four categories.

2.4 Analysis
Data analysis is the process of determining the relevant information by data modeling.
In this paper, the authors have used Support Vector Machine (SVM) supervised
machine learning technique for modeling the network traffic data. Since SVM can be
implemented for both binary class and multiclass classification, thus multiclass SVM
has been used in this paper. This has been implemented by using python programming
with scikit learn libraries. A classifier known as Support Vector Classifier has been
used requiring set of values to be passed as its parameters. The most relevant is the
kernel which can take the values such as rbf, linear etc. but the default kernel is set to
Radial Basis Function (rbf). Other parameters include C = 1.0, cache_size, coef,
class_weight, kernel, degree, gamma and decision_function_shape, verbose etc. The
parameter decision_function_shape can take either of two values: ovr or ovo. The
results using One vs. One value of decision_function_shape obtained categorically
[DoS, U2R, R2L, normal] is: 100, 66.66, 96, 98.12. The results using One vs. Rest
value of decision_function_shape obtained categorically [DoS, U2R, R2L, normal] is:
100, 60, 96, 98.53. However the results are little improved when analysis is performed
on the reduced feature dataset. In reduced feature dataset, the results using One vs. One
value of decision_function_shape obtained categorically [DoS, U2R, R2L, normal] is:
100, 67.6, 96.1, 98.12. The results using One vs. Rest value of decision_func-
tion_shape obtained categorically [DoS, U2R, R2L, normal] is: 100, 60.37, 96, 98.79.
The above values are obtained by using Eq. (1). Substantial amount of computational
time has decreased due to analysis being performed on reduced feature set data.
Network Traffic Classification Using Multiclass Classifier 215

3 Results and Discussion

The results aforementioned in the analysis part are calculated using the simple accuracy
formula of:

Accuracy ¼ ðCorrectPrediction=NumOfTestingSamplePerCatogryÞ  100 ð1Þ

The NumOfTestingSamplePerCatogry is implied as the number of testing samples


per category. For experimental purpose the subset of the complete dataset is selected
for analysis and NumOfTestingSamplePerCatogry list holds the number of attacks
categorically in the selected subset. The outcome of the machine learning model is
compared with the actual data label which is stored in the list named CorrectPrediction
implied as correct predictions obtained.
However, the authors felt that its accuracy can be improved if the four notations are
duly considered: True Positives (TPs), True Negatives (TNs), False Positive (FPs) and
False Negatives (FNs). True positives are the correct predictions for correct traffic which
is the most ideal case and focus remains on maximizing TPs. True negatives denotes
appropriately labeled the network traffic data records as normal. False positives, label as
an attack to the normal record. False negative means considering attack traffic records as
normal traffic records [18]. Therefore, the measurement terms are [18, 25]:

Accuracy ¼ ðTPs þ TNsÞ=ðTPs þ TNs þ FPs þ FNsÞ ð2Þ

ErrorRate ¼ ðFNs þ FPsÞ=ðTPs þ TNs þ FPs þ FNsÞ ð3Þ

Precision ¼ TPs=ðTPs þ FPsÞ ð4Þ

Recall ¼ TPs=ðTPs þ FNsÞ ð5Þ

The Eq. (2) is preferred over Eq. (1) while calculating the accuracy of the proposed
machine learning model. The error rate, precision and recall parameters are depicted in
Eqs. (3), (4) and (5) respectively.
Using python programming, the values of TPs, TNs, FPs and FNs are calculated
which are subsequently put in the Eqs. (2) to (4) to obtain the values of different
metrics. The accuracy, error rate, precision and recall obtained categorically [DoS,
U2R, R2L, normal] is: [1.0, 0.55, 1.0, 0.99], [0, 0.44, 0, 0.0020], [1.,1.,1.,1.] and [1.,
0.99, 0.2, 1.] respectively. However emphasis is done on maximizing TPs and mini-
mizing FNs.

4 Conclusion and Future Scope

In this paper, KDD cup dataset has been analyzed using multiclass SVM supervised
machine learning technique. First of all the data preprocessing is done by removing the
redundant rows, substituting the numeric values for columns consisting of text data and
reducing the feature by applying appropriate feature selection technique. Thereafter the
216 P. Kaur et al.

dataset is converted in the format desired by appropriate classification technique. Then


a subset of train data set is selected to train the classifier. After analysis is done, the
results obtained using reduced feature set showed substantial improvement over the
results obtained with full 41 feature analysis. However, significant improvement in
computational time has been seen. Furthermore, accuracy has been derived using two
different approaches which do show greater variability in accuracy of R2L attacks.
Thus the overall analysis work helps to understand and apply the multiclass problem to
a fair extent. On account of the technical limitations of the current work is the handling
of big dataset. It is computationally expensive to process the complete dataset in one
go. Therefore, subsets of dataset are selected to perform analysis. On part of future
scope, cross validation can be performed by taking various folds of the dataset.
Learners rules can be derived which can subsequently be used in domain of intrusion
detection systems. Also, the SVM multiclass model can be applied with kernel function
to check its accuracy with the given dataset.

References
1. Chapelle, O., Schölkopf, B., Zien, A.: Semi-Supervised Learning. MIT Press, London
England (2006)
2. Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: Feature selection and
classification in multiple class datasets: an application to KDD Cup 99 dataset. Expert Syst.
Appl. 38, 5947–5957 (2011)
3. Kwon, D., Kim, H., Kim, J., Suh, S.C., Kim, I., Kim, K.J.: A survey of deep learning-based
network anomaly detection. Cluster Comput. 20, 1–13 (2017)
4. Pilli, E.S., Joshi, R.C., Niyogi, R.: Network forensic frameworks: survey and research
challenges. Digital Invest. 7, 14–27 (2010)
5. Kaur, P., Bijalwan, A., Joshi, R.C., Awasthi, A.: Network forensic process model and
framework: an alternative scenario. In: Singh, R., Choudhury, S., Gehlot, A. (eds.)
Intelligent Communication, Control and Devices. AISC, vol. 624, pp. 493–502. Springer,
Singapore (2018). https://doi.org/10.1007/978-981-10-5903-2_50
6. KDD Cup 1999 Data. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
7. Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artif. Intell. 97, 273–324
(1997)
8. Doshi, M., Chaturvedi, S.k.: Correlation based feature selection (CFS) technique to predict
student performance. Int. J. Comput. Netw. Com. (IJNC) 6(3) 197–206 (2014)
9. Zhao, Z., Liu, H.: Searching for interacting features. In: Proceedings of international joint
conference on artificial intelligence, 1156–1167 (2007)
10. Dash, M., Liu, H.: Consistency-based search in feature selection. Artif. Intell. 151, 155–176
(2003)
11. Sang, Y., Dang, X., Sang, H.: Symmetric Gini Covariance and Correlation version. Can.
J. Stat. 44(3), 1–20 (2016)
12. Bajaj, K., Arora, A.: Dimension reduction in intrusion detection features using discrimi-
native machine learning approach. Int. J. Comput. Sci. 10(4), 324–328 (2013)
13. Forman, G.: An extensive empirical study of feature selection metrics for text classification.
J Mach. Learn. Res. 3, 289–1305 (2003)
Network Traffic Classification Using Multiclass Classifier 217

14. Shilton, A., Rajasegarar, S., Palaniswami, M.: Combined multiclass classification and
anomaly detection for large-scale wireless sensor networks. In: IEEE Eighth International
Conference on Intelligent Sensors, Sensor Networks and Information Processing, pp 491–
496. IEEE Press, New York (2013)
15. Sarasamma, S., Zhu, Q., Huff, J.: Hierarchical Kohonen net for anomaly detection in
network security. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 35(2), 302–312 (2005)
16. Han, S.J., Cho, S.B.: Evolutionary neural networks for anomaly detection based on the
behavior of a program. IEEE Trans. Syst. Man Cybern. 36(3), 559–570 (2005)
17. Rajeswari, L.P., Arputharaj, K.: An active rule approach for network intrusion detection with
enhanced C4.5 algorithm. Int. J. Commun. Netw. Syst. Sci. 4, 285–385 (2008)
18. Bamakan, S.M.H., Wang, H., Yingjie, T., Shi, Y.: An effective intrusion detection
framework based on MCLP/SVM optimized by time-varying chaos particle swarm
optimization. Neurocomputing 199, 90–102 (2016)
19. Tavallaee, M., Bagheri, E., Lu, W., Ghorbani, A.A.: A detailed analysis of the KDD CUP 99
data set. In: Proceedings of the 2009 IEEE Symposium on Computational Intelligence in
Security and Defense Applications. IEEE Press, New York (2009)
20. Yukinawa, N., Oba, S., Kato, K., Ishii, S.: Optimal aggregation of binary classifiers for
multi-class cancer diagnosis using gene expression profiles. IEEE/ACM Trans Comput.
Biol. Bioinform. 6(2), 333–343 (2009)
21. Singh, R., Kumar, H., Singla, R.K.: Analysis of feature selection techniques for network
traffic dataset. In: 2013 International Conference on Machine Intelligence and Research
Advancement, pp. 42–46 (2013)
22. Scikit learn machine learning in python. http://scikit-learn.org/stable/auto_examples/svm/
plot_rbf_parameters.html
23. Li, L., Zhang, H., Peng, H., Yang, Y.: Nearest neighbors based density peaks approach to
intrusion detection. Chaos, Solitons Fractals 110, 33–40 (2018)
24. Farahnakian, F., Heikkonen J.: A deep auto-encoder based approach for intrusion detection
system. In: 20th International Conference on Advanced Communication Technology
(ICACT), pp. 178–183 (2018)
25. Kushwaha, P., Buckchash, H., Raman, B.: Anomaly based intrusion detection using filter
based feature selection on KDD-CUP 99. In: 2017 IEEE Region 10 Conference (TENCON),
Malaysia (2017)

Вам также может понравиться