Академический Документы
Профессиональный Документы
Культура Документы
Structure Analysis
Elizabeth Walkup
University of Stanford
ewalkup@stanford.edu
1
based statistics of various structures. There are models. Some of the top features are shown in
other structural features within the database Table 1.
that could be used, but these were chosen to
maximize information gain while decreasing IV. Models
model size. Using only structural features, the For the sake of brevity, only the top five
top model had a testing error of 7.3691%. models will be described, although over 20
I.1 Mach-O File Structure were tested.
Mach-O executables[6] have three main I. Rotation Forest
parts: a mach header, load commands, and Rotation Forest[1] builds an ensemble (or
data. The data section is made up of the actual forest) of decision trees, in this case C4.5 trees.
executable code, which can be compressed or The feature set is split into K disjoint subsets.
encrypted, so no features are gathered from In this case, K was 75, which corresponds to
there. The mach header contains information roughly 3 features per set. For each classi-
about the platform the executable was built for, fier, principal component analysis (PCA) is run
as well as the file type (dynamic library, etc). using the subset of features chosen. The coeffi-
The majority of the structural features come cients found using PCA are put into a sparse
from the load commands. These contain head- rotation matrix, which is rearranged to match
ers for each code segment in the data, as well as the chosen features. This new matrix multi-
information about import libraries and how to plied by the training data comprises the model.
load the file in memory. Each segment header To classify a new data point, the confidence for
also contains information about a substructure each individual model is calculated, and the
within the segment called a section. output class is the one with the highest confi-
II. Import Features dence. This model had a confidence factor of
0.25.
Every executable file imports files and li-
braries to use within its program. These can- II. Random Forest
not be encrypted, so their names can be ex- Random Forest[2], similar to Rotation For-
tracted from the binary. However, there are est, is an ensemble of different decision tree
literally thousands of names that could be models. However, instead of using a confi-
present, which is not easy to incorporate into a dence measure, the different models vote on
model. To compromise, feature selection was the output class. The training sets for these
performed on models built solely from import models are chosen randomly, but with the
features, and the top features were then in- same distribution. In this case, there were ten
corporated back into the main set of features. decision tree models and the models used the
When these import features were added, the full feature set.
testing error of the top model decreased from
7.3691% to 2.7682%, a nearly 5% improvement. III. PART
PART[3] is a rule-based algorithm. The data
III. Feature Selection set is divided into several new sets. A partial
Two methods were used to trim features: pruned C4.5 decision tree is built based on
forward search selection and information gain each set. The leaf of this tree with the largest
ranking. These two algorithms had many top data coverage (meaning it classifies the most
features in common, particularly when used instances) is made into a rule. A leaf must
to choose import features. By combing the top cover at least two instances to be considered
results from these two methods, thousands of for a rule.
possible import features were reduced to 60
dynamic library (dylib) imports and 117 func- IV. LMT
tion imports. Combined with the 50 structural LMT[4] stands for Logistic Model Tree. This
features, there were 227 features for all of the is a basic decision tree except that instead of
2
Table 1: Top Ten Features by Informaton Gain
having classes at the leaf nodes, it has logis- • Goodware tends to import more libraries
tic regression functions. A node must cover than malware
at least fifteen instances to have child nodes.
The functions at the leaf nodes model the class • Malware tends to have more segments
probability: than goodware and use more varied seg-
ment flags
e Fj ( x)
Pr ( G = j| X = x ) = (1) • Malware and goodware have the same
J
∑k=1 e Fk ( x) distribution of load com- mand sizes
∑
j j
Fj = α0 + αv v (2) • Malware is more likely to use Java li-
v∈VT
braries
where Vt is a subset of all the features, αi ’s
• Malware is more likely to use HTTP
are weights, and k is the number of possible
stream functions (probably for con-
feature values.
trol/exfiltration)
V. IBk
IBk[5] is an instance-based learning imple- • Goodware uses more memory access con-
mentation of k-Nearest Neighbors (in this case trols
k=1). For each data point, it calculates a simi-
larity for each cluster. That data point is then
VI. Discussion
The ensemble methods had the best perfor-
assigned to the cluster with the highest simi-
mance, perhaps due to the complex nature of
larity.
the input features or the varied data set. Since
V. Results all of the import features were nominal fea-
Many different supervised learning algo- tures, decision tree models were particularly
rithms were evaluated to see which one could well suited for classification.
classify the data the best. The algorithms were The fact that the testing error can be re-
trained on the full 227 attributes. The top ten duced to a few percent means that there are
results are listed in Table 2. A false positive is actually tangible differences in static structures
a goodware sample that was classified as mal- between malware and goodware. This was ex-
ware, and a false negative is a malware sample pected going into the project because substan-
that was classified as goodware. tial results have been produced using a similar
Some interesting facts that can be gleaned approach with Windows portable executable
from the data itself: (PE) files and malware.
3
Table 2: Top Ten Supervised Learning Algorithms Results
Algorithm Training Error Test Error False Positive False Negative ROC Area
Rotation Forest 0.2653% 2.7682% 0.02 0.046 0.994
Random Forest 0.1768% 4.1522% 0.035 0.057 0.99
PART 0.9726% 4.4983% 0.025 0.092 0.949
LMT 0.9726% 5.1903% 0.025 0.115 0.98
IBk 0% 5.1903% 0.035 0.092 0.944
SMO 3.8019% 6.2284% 0.04 0.115 0.923
FT Tree 1.4147% 6.5744% 0.35 0.138 0.921
J48/C4.5 Tree 1.1494% 6.5744% 0.05 0.103 0.92
Regression 3.4483% 9.3426% 0.05 0.195 0.966
However, the results are not quite what used within the Mach-O files. Adding more
would be considered "production level" for mal- features could possibly increase the accuracy
ware detection. As a general rule, the goal is to of the classification, but it comes at the price of
keep false positives below 0.001% and false neg- a larger model.
atives below 1%. At best, among these models This static analysis approach could also be
the false positive was 2% and the false nega- used to try and classify malware into "fami-
tive was 4.6%. False positives are especially lies" (for example, all polymorphic Flashback
important to keep low, because customers will executables), which is an open problem in the
not use an antivirus that blocks their legitimate security industry.
programs.
References
The models built in this project are still use- [1] J. Rodriguez and L. Kuncheva, "Rota-
ful as triaging tools. In cybersecurity, incident tion Forest: A New Classifier Ensemble
response teams at businesses are swamped Method", IEEE Trans. Pattern Analysis and
with email security alerts every day, especially Machine Intelligence , vol. 28, no. 10, pp.
with regards to suspicious email attachments. 1619-1630, Oct. 2006.
Running these models on the applicable exe-
cutables can allow the incident response team [2] Leo Breiman, "Random Forests", Univer-
to focus only on those that have a high proba- sity of California, Jan. 2001.
bility of being malware. So the conclusion is
[3] Eibe Frank and Ian H. Witten, "Generat-
that static analysis shows substantial promise,
ing Accurate Rulesets Without Global Op-
even if it is not quite refined enough to be
timization", University of Waikato.
deployed as a standalone antivirus.
[4] N. Landwehr, M. Hall, and E. Frank, "Lo-
VII. Future Work gistic Model Trees", 14th European Confer-
The one major roadblock to this project is ence on Machine Learning, Jun. 2004.
a lack of malware samples. The 420 samples
[5] D. Aha, D. Kibler, and M. Albert,
here took weeks to obtain, whereas, by contrast,
"Instance-Based Learning Algorithms",
thousands of goodware samples can be gath-
Machine Learning , vol. 6, pp. 37-66, 1991.
ered in a day. Expanding the sample set would
probably cut down on the generalization er- [6] "OS X ABI Mach-O File Format
ror. One possible approach would be to catch Reference". Mac Developer Library.
malware using a honeypot setup, though that https://developer.apple.com/library/mac
often comes with its own set of legal issues. /documentation/DeveloperTools/Con-
There are also more features that can be ceptual/MachORuntime/index.html.
4
[7] "Weka". The University of Waikato. [9] "VirusTotal".
http://www.cs.waikato.ac.nz/ml/weka/. https://www.virustotal.com/.