Вы находитесь на странице: 1из 20

DEEPSIGN

Deep Learning for Automatic Malware Signature Generation and Classification


PROBLEM STATEMENT

• New malware programs are growing exponentially (e.g., on average 160,000 malwares
appeared everyday in 2013)
• Anti-virus solutions do not effectively and efficiently detect, analyze and generate
malware signatures.
• Methods for automatic malware signature generation target specific aspects of malware
(e.g., any vulnerability in Windows Operating System)
• Variants of a malware are not easily recognized using these ways.
• Difficult to generate signatures which can be used to prevent new attacks against zero-
day malware.
INTRODUCTION

• Method presented in this paper is aimed to generate malware signature invariant to


modifications in program
• It doesn’t rely on a specific aspect of malware and represent its general behavior
• It relies on training a DEEP BELIEF NETWORK (DBN) which is proven successful in
generating invariant representation in many challenging domains.
• They have used a large dataset of 1800 samples, six major categories of malware and 300
variants in each category; provided by C4 Security.
SOME RELATED WORKS

Some methods that worked to improve malware signature generation process:

• Autograph • Amd (Semantics-Aware Malware


• Honeycomb Detection )
• PAYL sensor • Polygraph
• Nemean architecture (semantic-aware • EarlyBird
Network Intrusion Detection System) • Netspy
• Statistical model for undecidable viral • Auto-Sign
detection
SOME RELATED WORKS (CONT.)

• Autograph records source and destination of connections attempted from outside the
network
• Honeycomb analyzes the traffic on the honeypot, uses largest common substrings (LCS) to
generate signatures and measure similarities in packet payloads
• The PAYL sensor monitors the flow of information in the network and tries to detect
malicious attacks using anomaly detection
• The Nemean architecture is a semantic-aware Network Intrusion Detection System (NIDS)
which normalizes packets from individual sessions in the network and renders semantic
context. A signature generation component clusters similar sessions and generates signatures
for each cluster
SOME RELATED WORKS (CONT.)

• Polygraph generates content based signatures that use several substring signatures
(tokens), to expand the detection of malware variants
• EarlyBird sifts through the invariant portion of a worm's content that will appear
frequently on the network as it spreads or attempts to spread
• Auto-Sign generates a list of signatures for a malware by splitting its executable to
segments of equal sizes. For each segment a signature is generated, and the list of
signatures is subsequently ranked
LIMITATIONS OF MENTIONED APPROACHES

• Mostly rely on specific behavior of malware:


• Specific network activity
• Specific substring in executable

• Less accurate on larger malware programs (e.g., Andromeda)


• Some of them are resilient to small modifications but malware can evade it using different
techniques (e.g., encrypting the executable)
TERMINOLOGIES (DEEP BELIEF NETWORK)

• DBNs can be viewed as a composition of simple, unsupervised networks


where each sub-network's hidden layer serves as the visible layer for the
next
• Its composed of multiple layers of hidden units with connections between
the layers but not between units within each layer
• When trained on a set of examples in an unsupervised way, a DBN can
learn to probabilistically reconstruct its inputs. The layers then act as
feature detectors on inputs
• After this learning step, a DBN can be further trained in a supervised way
to perform classification
TERMINOLOGIES (DENOISING AUTOENCODERS)

• An auto-encoder is an artificial neural network used for


unsupervised learning of efficient coding.
• The aim of an auto-encoder is to learn a representation (encoding)
for a set of data, typically for the purpose of dimensionality
reduction
• In de-noising auto-encoders, each time a sample is given to the
network, a small portion of it is corrupted by adding noise (or more
often by zeroing the values) and then given to the input layer of the
network which aims to generate uncorrupted version of input
TERMINOLOGIES (DROPOUT)

• During training, one or more neural network nodes is


switched off once in a while so that it will not interact
with the network
• With dropout, the learned weights of the nodes become
somewhat more insensitive to the weights of the other
nodes and learn to decide somewhat more by their own
• In general, dropout helps the network to generalize
better and increase accuracy since the (possibly
somewhat dominating) influence of a single node is
decreased by dropout.
PROPOSED SIGNATURE GENERATION METHOD

• The method in paper contains following sections:


• Program Behavior as Binary Vector
• Training a Deep Belief Network

• Method is performed in the following way:


• Program is run in a sandbox
• Sandbox file is converted to a binary bit-string
• File is fed to the neural network
• deep neural network produces a vector at its output layer
PROGRAM AS BINARY VECTOR

• Behavior is recorded by running program in a sandbox


• It logs different activities performed by that program(e.g., API function calls and their
parameters, files created or deleted and websites and ports accessed)
• For converting sandbox generated text file, one of the methods common in natural
language processing is unigram (i-gram) extraction
• To implement this for instance, find the 5,000 most frequent words in the text ( i.e.,
dictionary), and then for each text sample check which of these 5,000 words are present.
Thus, each text sample is represented as a 5,000 sized bit-string
PROGRAM AS BINARY VECTOR

• Our method follows the following simple steps to convert sandbox files to fixed size
inputs to the neural network:
• Extract all unigrams for each sandbox file in the dataset
• Remove the unigrams which appear in all files (contain no information)
• For each unigram count:
• Select top 20,000 with highest frequency
• Convert each sandbox file to a 20,000 sized bit string, by checking whether each of the 20,000
unigrams appeared in it
TRAINING A DEEP BELIEF NETWORK

• Deep belief network (DBN) is created by training a deep stack


of denoising autoencoders
• In denoising autoencoders each time a sample is given to the
network, a small portion of it is corrupted by adding noise (or
more often by zeroing the values)
• That is, given an input X, first it is corrupted to x and then given
to the input layer of the network. The objective function of the
network in the output layer remains generating X
TRAINING A DEEP BELIEF NETWORK

• When training is complete, decoder layer is discarded output of the hidden layer is
treated as the input to a new auto-encoder added on top of the previous one.
• Auto-encoders are trained similarly making up a total of eight layers.
IMPLEMENTATION AND EXPERIMENTAL RESULTS

• The six malware categories used are Zeus, Carberp, SpyEye, Cidox, Andromeda and
DarkComet
• Each of the 1,800 programs in our dataset is run in Cuckoo sandbox
• Deep denoising autoencoder were trained consisting of eight layers (20,000-5,000-2,500-
1,000-500-250-l00-30), with layer-wise training
• Dropouts were used to regularize the network and prevent overfitting
SIGNATURE GENERATION PROCESS STEPS
EXPERIMENTAL RESULTS

• All of 1,800 vectors of size 20,000 were passed to the DBN and were convert to 30-sized
representations (signatures)
• The visualization is generated using the t-distributed stochastic neighbor embedding (t-SNE)
algorithm
• Variants of the same malware family are mostly clustered together in the signature space,
demonstrating that the signatures due to DBN indeed capture invariant representations of
malware.
• Training this network on the 1,200 input training samples (using input noise = 0.2, dropout = 0.5,
and learning rate = 0.001), and predicting on 600 test samples results in 98.6% accuracy on test
data, a relatively substantial improvement over other methods.
EXPERIMENTAL RESULT
CONCLUSION

• Current approaches for malware signature generation use specific aspects of malware
• New malware variants easily evade detection by modifying small parts of their code
• Unsupervised deep learning is a powerful method for generating high level invariant
representations in domains beyond computer vision, language processing, or speech
recognition; and can be applied successfully to challenging domains such as malware
signature generation.

Вам также может понравиться