Академический Документы
Профессиональный Документы
Культура Документы
1 Introduction
Human has always aspired to develop techniques that could replace human efforts to a
great extent. In this era, machine and deep learning is superseding other techniques. If
one can train the machine using the data instead of explicitly programming the
machine, that’s where we need machine learning. Machine learning has empowered
many domains such as web search, text recognition, speech recognition, medicine such
as protein structure estimation, network traffic analysis and prediction, intrusion
detection etc. Network traffic analysis is one of the emerging domains. An attack can be
predicted from the current network traffic flow and it can held stop the intruders before
actually attacking the network. This can be done using machine learning by training the
network. There are three categories of machine learning: supervised, un-supervised and
semi-supervised [1]. This paper focuses on Support Vector Machine (SVM) supervised
machine learning technique for network traffic classification. Network traffic classifi-
cation using SVM can include two approaches: binary or two-way classification and
multi-class classification [2]. The first approach works simply by classifying the net-
work between normal and anomalous traffic. The second approach can be applied using
two sub-approaches i.e. (a) mapping multiple classes to individual binary classes;
(b) directly solving multi-class problem. In this paper, first sub-approach is used to
classify multi-class traffic classification [2].
The word classifier is a type of algorithmic technique used to implement classifi-
cation [3]. The classification techniques can either be applied to the active data collected
on site or passively on already built dataset. There are widely available network traffic
collection tools such as: Iris, NetIntercept, tcpdump, Snort, Bro etc. [4, 5]. The online
data stores of network traffic datasets are widely available for analysis of network traffic
[6]. The network traffic files are generally stored in packet capture format (.pcap) which
can subsequently be converted to desired format for analysis. These network files consist
of features showing the type of traffic. For classification of network traffic, the most
relevant features are selected out of the all features set. Then classification is performed
on network traffic using the reduced feature set. Reducing the features may lessen the
computation time and affirmatively affect the accuracy of the learning classification
technique [7]. There are various models provided for feature selection: Wrapper and
filter method [7], Correlation based feature selection (CFS) [8], INTERACT algorithm
[9], The Consistency-based filter [10], gini covariance method [11], information gain,
attribute evaluation etc. [12]. Wrapper method aims to select the feature subset with high
extrapolative power that optimizes the classifier. Whereas in filter method, the best
possible feature subset is selected from the data set irrespective of the classifier opti-
mization. CFS technique aims to select the features that are highly correlated with the
class and least correlated with remaining features of the class. INTERACT deals with
inspecting the contribution of individual feature in the whole dataset and how its
removal affects the consistency. The contribution is generated based on the ratio
between entropy and information gain (IG) known as symmetrical uncertainty
(SU) [13]. Information gain aims to determine the maximum information obtained from
a particular feature. Gini covariance method aims at checking the variability of the
feature and assigning respective ranks using spatial ranking method. The features within
a particular threshold value are selected and beyond are rejected. Information gain
attribute evaluation is to determine the best possible feature or attribute in the dataset.
Traditional binary classifiers work well with known patterns and their accuracy is
fairly good. However, the drawback of these traditional binary classifiers is their
inability to detect novel patterns in the data. This limitation has been removed for
anomaly detection in wireless sensor networks by using a modified version of SVM for
unknown traffic classification [14].
Another technique applied on KDD Cup data set is modified and improved version of
C4.5 decision tree classifier. In this method new rules are derived by evaluating the
network traffic data and thereby applied to detect intrusion in the real time [17].
Another technique applied on the modified version of KDD’99 data set named NSL-
KDD that aims to decrease the false rate and increase the detection rate by optimizing
the weighted average function [18]. A novel technique named Density peaks nearest
neighbors (DPNN) is applied on KDD’99 cup data set to yield an improved accuracy
over SVM method. This approach detects unknown attacks thus improving the sub
categorical accuracy improvement of 15% on probe attacks and an overall efficiency
improvement of 20.688% [23]. The authors used deep auto-encoder technique on
KDD’99 cup dataset by constructing multilayer neurons showing improved accuracy
over traditional attack identification techniques [24]. The authors performed a two way
step on KDD’99 cup dataset: feature reduction using three different techniques i.e. gain
ratio, mutual information, correlation and generated analysis score using Naïve Bayes,
random forest, adaboost, SVM, bagging, kNN and stacking. Their results showed the
maximum performance given by SVM with 99.91% score and closer performance
score of 99.89 by random forest algorithm [25].
Fig. 1. Train and Test network traffic data statistics (KDD Cup’99)
such a way that it precisely separates between two classes of data. The wider the hyper
plane width, the better it is. The width points of the hyper plane are decided from the
closest points to the hyper plane line known as support vectors. In context of network
traffic data, there can be either normal traffic or anomalous traffic which comes under
binary classification. Multiple subclasses of anomalous traffic can be determined using
multi-class SVM. Binary classification is easy to implement as the classifier need to
learn either the traffic is normal or anomalous. In order to perform multiple class
classification, certain characterizations need to be considered i.e. One versus one
(OvO) and one versus rest (OvR). In OvR, one class separates from other classes if
binary characteristics of one class distinguish it from remaining set of classes. In OvO,
here each classifier forms a pair with every other classifier and learns from the rela-
tionship formed [20]. Yet, there are many variants of SVM such as least squares SVM,
v-SVM, nearly-isotonic SVM, Bounded SVM, NPSVM and Twin SVM, but this paper
shall focus on multi-class categorization property of SVM [18].
2 Methodology
These methods are information gain and Gini covariance [11]. The numeric values
obtained using information gain method and Gini covariance method are in the range of
2.014–0.080 and 0.483–0.011 respectively for all 41 features. Based on combined
values of both the methods, rank is assigned to the respective feature. The highest
ranked 26 features are selected for further analysis. The numeric values range for
information gain method and Gini covariance method are between 2.014–0.214 and
0.483–0.035 respectively for all selected 26 features. The third step to data prepro-
cessing is data transformation that involved two tasks: dataset file format conversion
and symbolic conversion. The first subtask means to convert the dataset files in a
format required by the machine learning model. Python with scikit learn libraries are
used in this paper for data conversion. Scikit-learn accept data in csv (comma separated
value) format for further analysis. Therefore, all the dataset files are converted to .csv
format. The second subtask of data transformation is to convert the symbolic values
with numeric values. Python code has been written for symbolic value conversion in
the train and test dataset. Therefore, data preprocessing step prepares the data for
analysis in further steps. The authors have selected the subset of train set consisting of
few attack sets from all four categories.
2.4 Analysis
Data analysis is the process of determining the relevant information by data modeling.
In this paper, the authors have used Support Vector Machine (SVM) supervised
machine learning technique for modeling the network traffic data. Since SVM can be
implemented for both binary class and multiclass classification, thus multiclass SVM
has been used in this paper. This has been implemented by using python programming
with scikit learn libraries. A classifier known as Support Vector Classifier has been
used requiring set of values to be passed as its parameters. The most relevant is the
kernel which can take the values such as rbf, linear etc. but the default kernel is set to
Radial Basis Function (rbf). Other parameters include C = 1.0, cache_size, coef,
class_weight, kernel, degree, gamma and decision_function_shape, verbose etc. The
parameter decision_function_shape can take either of two values: ovr or ovo. The
results using One vs. One value of decision_function_shape obtained categorically
[DoS, U2R, R2L, normal] is: 100, 66.66, 96, 98.12. The results using One vs. Rest
value of decision_function_shape obtained categorically [DoS, U2R, R2L, normal] is:
100, 60, 96, 98.53. However the results are little improved when analysis is performed
on the reduced feature dataset. In reduced feature dataset, the results using One vs. One
value of decision_function_shape obtained categorically [DoS, U2R, R2L, normal] is:
100, 67.6, 96.1, 98.12. The results using One vs. Rest value of decision_func-
tion_shape obtained categorically [DoS, U2R, R2L, normal] is: 100, 60.37, 96, 98.79.
The above values are obtained by using Eq. (1). Substantial amount of computational
time has decreased due to analysis being performed on reduced feature set data.
Network Traffic Classification Using Multiclass Classifier 215
The results aforementioned in the analysis part are calculated using the simple accuracy
formula of:
The Eq. (2) is preferred over Eq. (1) while calculating the accuracy of the proposed
machine learning model. The error rate, precision and recall parameters are depicted in
Eqs. (3), (4) and (5) respectively.
Using python programming, the values of TPs, TNs, FPs and FNs are calculated
which are subsequently put in the Eqs. (2) to (4) to obtain the values of different
metrics. The accuracy, error rate, precision and recall obtained categorically [DoS,
U2R, R2L, normal] is: [1.0, 0.55, 1.0, 0.99], [0, 0.44, 0, 0.0020], [1.,1.,1.,1.] and [1.,
0.99, 0.2, 1.] respectively. However emphasis is done on maximizing TPs and mini-
mizing FNs.
In this paper, KDD cup dataset has been analyzed using multiclass SVM supervised
machine learning technique. First of all the data preprocessing is done by removing the
redundant rows, substituting the numeric values for columns consisting of text data and
reducing the feature by applying appropriate feature selection technique. Thereafter the
216 P. Kaur et al.
References
1. Chapelle, O., Schölkopf, B., Zien, A.: Semi-Supervised Learning. MIT Press, London
England (2006)
2. Bolón-Canedo, V., Sánchez-Maroño, N., Alonso-Betanzos, A.: Feature selection and
classification in multiple class datasets: an application to KDD Cup 99 dataset. Expert Syst.
Appl. 38, 5947–5957 (2011)
3. Kwon, D., Kim, H., Kim, J., Suh, S.C., Kim, I., Kim, K.J.: A survey of deep learning-based
network anomaly detection. Cluster Comput. 20, 1–13 (2017)
4. Pilli, E.S., Joshi, R.C., Niyogi, R.: Network forensic frameworks: survey and research
challenges. Digital Invest. 7, 14–27 (2010)
5. Kaur, P., Bijalwan, A., Joshi, R.C., Awasthi, A.: Network forensic process model and
framework: an alternative scenario. In: Singh, R., Choudhury, S., Gehlot, A. (eds.)
Intelligent Communication, Control and Devices. AISC, vol. 624, pp. 493–502. Springer,
Singapore (2018). https://doi.org/10.1007/978-981-10-5903-2_50
6. KDD Cup 1999 Data. http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html
7. Kohavi, R., John, G.H.: Wrappers for feature subset selection. Artif. Intell. 97, 273–324
(1997)
8. Doshi, M., Chaturvedi, S.k.: Correlation based feature selection (CFS) technique to predict
student performance. Int. J. Comput. Netw. Com. (IJNC) 6(3) 197–206 (2014)
9. Zhao, Z., Liu, H.: Searching for interacting features. In: Proceedings of international joint
conference on artificial intelligence, 1156–1167 (2007)
10. Dash, M., Liu, H.: Consistency-based search in feature selection. Artif. Intell. 151, 155–176
(2003)
11. Sang, Y., Dang, X., Sang, H.: Symmetric Gini Covariance and Correlation version. Can.
J. Stat. 44(3), 1–20 (2016)
12. Bajaj, K., Arora, A.: Dimension reduction in intrusion detection features using discrimi-
native machine learning approach. Int. J. Comput. Sci. 10(4), 324–328 (2013)
13. Forman, G.: An extensive empirical study of feature selection metrics for text classification.
J Mach. Learn. Res. 3, 289–1305 (2003)
Network Traffic Classification Using Multiclass Classifier 217
14. Shilton, A., Rajasegarar, S., Palaniswami, M.: Combined multiclass classification and
anomaly detection for large-scale wireless sensor networks. In: IEEE Eighth International
Conference on Intelligent Sensors, Sensor Networks and Information Processing, pp 491–
496. IEEE Press, New York (2013)
15. Sarasamma, S., Zhu, Q., Huff, J.: Hierarchical Kohonen net for anomaly detection in
network security. IEEE Trans. Syst. Man Cybern. Part B (Cybern.) 35(2), 302–312 (2005)
16. Han, S.J., Cho, S.B.: Evolutionary neural networks for anomaly detection based on the
behavior of a program. IEEE Trans. Syst. Man Cybern. 36(3), 559–570 (2005)
17. Rajeswari, L.P., Arputharaj, K.: An active rule approach for network intrusion detection with
enhanced C4.5 algorithm. Int. J. Commun. Netw. Syst. Sci. 4, 285–385 (2008)
18. Bamakan, S.M.H., Wang, H., Yingjie, T., Shi, Y.: An effective intrusion detection
framework based on MCLP/SVM optimized by time-varying chaos particle swarm
optimization. Neurocomputing 199, 90–102 (2016)
19. Tavallaee, M., Bagheri, E., Lu, W., Ghorbani, A.A.: A detailed analysis of the KDD CUP 99
data set. In: Proceedings of the 2009 IEEE Symposium on Computational Intelligence in
Security and Defense Applications. IEEE Press, New York (2009)
20. Yukinawa, N., Oba, S., Kato, K., Ishii, S.: Optimal aggregation of binary classifiers for
multi-class cancer diagnosis using gene expression profiles. IEEE/ACM Trans Comput.
Biol. Bioinform. 6(2), 333–343 (2009)
21. Singh, R., Kumar, H., Singla, R.K.: Analysis of feature selection techniques for network
traffic dataset. In: 2013 International Conference on Machine Intelligence and Research
Advancement, pp. 42–46 (2013)
22. Scikit learn machine learning in python. http://scikit-learn.org/stable/auto_examples/svm/
plot_rbf_parameters.html
23. Li, L., Zhang, H., Peng, H., Yang, Y.: Nearest neighbors based density peaks approach to
intrusion detection. Chaos, Solitons Fractals 110, 33–40 (2018)
24. Farahnakian, F., Heikkonen J.: A deep auto-encoder based approach for intrusion detection
system. In: 20th International Conference on Advanced Communication Technology
(ICACT), pp. 178–183 (2018)
25. Kushwaha, P., Buckchash, H., Raman, B.: Anomaly based intrusion detection using filter
based feature selection on KDD-CUP 99. In: 2017 IEEE Region 10 Conference (TENCON),
Malaysia (2017)