Вы находитесь на странице: 1из 29

MALWARE DETECTION USING

MACHINE LEARNING
Project Guide: Dr.IndirapriyaDarshini

Team Members:
Mohammed Uzair Mohiuddin
(16881A1227)
Mohammed Taqueeuddin (16881A1226)
Challa Srikanth (16881A1211)
Thonnupuri Koushik (16881A1256)
MALWARE

What Is Malware?
• Malware is a program that has malicious
intentions and it is made and designed for a
certain purpose, to acquire access to an
information system without the administrator's
authorization.

• With more simple words malware is a software


that helps an attacker complete and fulfills his
malicious and crime intentions.
• Malware is one the imminent threats that
companies and users face every day.
• Whether it is a phishing email or an exploit
delivered throughout the browser, coupled
with multiple evasion methods and other
security vulnerabilities, it is a proven fact
that nowadays defense systems cannot
compete.
• The availability of frameworks such as Veil,
Shelter, and others are known to be used by
professionals when conducting pre testing
work and are known to be quite effective.
What is Cleanware?
Cleanware is that kind of software whose activity is
not considered malicious. It is important to separate
malware from clean ware, to ensure that an
unknown file, is not malicious.
Examples could be:
• Opening attachments in an E-mail.
• Inserting an exterior hard drive or a USB stick to
your system.
• Background detection on computer.
• Downloading a file.
Machine Learning a brief
Introduction
Machine Learning is a subfield that mixes many
domains of mathematics mainly Statistics and
Probabilities and Linear Algebra and
Computation (Algorithms, Data Processing,
Numerical Calculations).
• To gain insight from data it is used to detect
fraud, spam and recommending movies and
meals and products to buy, Amazon,
Facebook, Google to name a few of the
hundreds of companies that use Machine
learning to improve their products.
• Machine Learning can be split into two major
methods supervised learning and unsupervised
learning the first means that the data we are
going to work with is labeled the second means
it is unlabeled, detecting malware can be
attacked using both methods, but we will focus
on the first one since our goal is to classify files.

• Classification is a sub domain of supervised


learning it can be either binary (malware-not
malware) or multi-class (cat-dog-pig-lama…)
thus malware detection falls under binary
classification.
The Problem Statement
• Our goal is teaching a computer, more specifically an
artificial neural network, to detect Windows malware
without relying on any explicit signatures database that
we’d need to create, but by simply ingesting the
dataset of malicious files we want to be able to detect
and learning from it to distinguish between malicious
code or not, both inside the dataset itself but, most
importantly, while processing new, unseen samples.
Our only knowledge is which of those files are
malicious and which are not, but not what specifically
makes them so, we’ll let the ANN do the rest.
Dataset Description
• In order to do this, I’ve collected approximately
200,000 Windows PE samples, divided evenly in
malicious (with 10+ detections on VirusTotal) and
clean (known and with 0 detections on
VirusTotal).
• Since training and testing the model on the very
same dataset wouldn’t make much sense (as it
could perform extremely well on the training set,
but not being able to generalize at all on new
samples), this dataset will be automatically
divided by ergo into 3 sub sets:
• A training set, with 70% of the samples, used for
• A validation set, with 15% of the samples, used to
benchmark the model at each training epoch.
• A test set, with 15% of the samples, used to
benchmark the model after training.
• Needless to say, the amount of (correctly labeled)
samples in your dataset is key for the model
accuracy, its ability to correctly separate the two
classes and generalize to unseen samples - the more
you’ll use in your training process, the better.
• Besides, ideally the dataset should be periodically
updated with newer samples and the model
retrained in order to keep its accuracy high over
time even when new unique samples appear in the
wild (namely: wget + crontab + ergo).
Workflow of the Model
• First, we need to collect malware samples and clean sample, we
cannot work with less than 10k samples of both, and it is advisable
to use even more of these.
• We need to extract meaningful features from our samples .These
features will be the basis of our study.
• After extracting these features, we need to process all our samples
to build a dataset it can be a database file or a CSV file . This way it
will be easier to turn it into vectors, since the algorithms work by
performing computation on vectors.
• Lastly, we need metrics in this binary classification there are a
multitude of metrics to benchmark the performance of an algorithm
(ROC/AUC, Confusion Matrix…) we will use a confusion matrix
since it represents the rates of True Positives and True Negatives as
well as False Positives and False Negatives.
Machine Learning and Classification
The idea of machine learning is to let the
algorithm learn by itself the best
parameters from data in order to make
good predictions.
There are many different applications, in
our case we will consider using machine
learning algorithm to classify binaries
between legitimate and malicious
binaries. 
Steps to be Followed:
• Extract as many features as we can from
binaries to have a good training data set.
• The features have to be integers or floats to be
usable by the algorithms
• Identify the best features for the algorithm :
• we should select the information that best
allows to differentiate legitimate files from
malware.
• Choose a classification algorithm
• Test the efficiency of the algorithm and identify
the False Positive/False negative rate.
Decision Tree
Random Forest
• Ensemble learning method
• Uses output of multiple decision trees
Naive Bayes
Linear Regression
How Machine Learning Work?
• Training
• Feature Extraction -> Learning Algorithms -> Generate Classifier
• Testing
• Feature Extraction -> Classifier -> Classifier Result
PYTHON LIBRARIES
Feature Selection
The idea of feature selection is to reduce the 54
features extracted to a smaller set of feature which
are the most relevant for differentiating legitimate
binaries from malware.
To play with the data, we will use Pandas and then 
Scikit which largely uses the numpy library:
So a first way of doing it manually could be to check the
different values and see if there is a difference between the
two groups.
For instance, we can take the parameter File Alignment
(which defines the alignment of sections and is by default
0x200 bytes) and check the values :
But there is a more efficient way of selecting features :
some algorithms have been developed to identify the
most interesting features and reduce the dimensionality
of the data set (see the Scikit page for Feature Selection
).
In our case, we will use the Tree-based feature selection
:
So in this case, the algorithm selected 14 important features
among the 54, and we can notice that indeed the
SectionsMaxEntropy is selected but other features (like the
Machine value) are surprisingly also good parameters for this
classification :
Selection of the Classification
Algorithm
• The best classification algorithm depends on
many parameters and some of them (like SVM)
have a lot of different parameters and can be
difficult to correctly configure. So I used a
simple option:
• I tested some of them with default parameters
and selected the best one (I removed SVM
because it does not handle well dataset with
many features):
Testing the Efficiency of the Model
Output of the Model
User Interface for Malware Detection
HOME PAGE
Uploading the file for Malware Detection
Output as Legitimate or Malicious

Вам также может понравиться