Академический Документы
Профессиональный Документы
Культура Документы
P. Charland
DRDC – Valcartier Research Centre
With the revolution in information technology, the dependence of the Canadian Armed Forces
(CAF) on their information systems continues to grow. While information systems-based assets
confer a distinct advantage, they also make the CAF vulnerable if adversaries interfere with those.
Unfortunately, the technology required to disrupt and damage an information system through
malicious software (malware) is far less sophisticated and expensive than the amount of
investment required to create the system. To understand and mitigate this threat, reverse
engineering has to be performed to analyze malware. However, software reverse engineering is a
manually intensive and time-consuming process. The learning curve to master it is quite steep and
once mastered, the process is hindered when anti-reverse engineering techniques are used. This
results in the very few available reverse engineers being quickly saturated. This Scientific Report
describes new approaches to accelerate the reverse engineering process of malware. The goal is to
reduce redundant analysis efforts by automating the identification of code fragments which
reuse (i) previously analyzed assembly code or (ii) open source code publicly available.
With today’s proliferation of malware, new analysis techniques which go beyond existing best
practices are needed to accelerate the reverse engineering process. The intended audience for this
report are thus software reverse engineers and malware analysts, as well as managers and decision
makers of organizations which have to reverse engineer third party code. The techniques
presented in this report will contribute to provide the CAF with a core capability to respond more
effectively and rapidly to cyber threats and lead to a change in their defence capabilities. To
defend against cyber attacks, the CAF currently rely on antivirus and intrusion detection systems
(IDS). These approaches consist of looking for symptoms such as increased network traffic or
anomalous signals, and try to fight the problems that arise. Although they can be effective in
some cases, they do not help in understanding what the real problem is and are ineffective when
attacks morph. Providing a new capability for malware reverse engineering, as it is proposed in
this report, will allow the CAF to uncover vulnerabilities that malware exploit. This will also give
them the required knowledge to generate defences against future malicious attacks to pre-empt
them, even when they morph. This new capability will introduce a paradigm shift in the practices
of the CAF in the cyber environment. They will move from a strictly defensive approach, which
has shown its limits, to a proactive approach. This proposed approach has also security potential
for the Royal Canadian Mounted Police (RCMP), the Communications Security Establishment
Canada (CSEC), the Canadian Security Intelligence Service (CSIS), and Public Safety (PS) Canada.
DRDC-RDDC-2015-R027 i
This page intentionally left blank.
ii DRDC-RDDC-2015-R027
Résumé ……..
Avec la prolifération actuelle des logiciels malveillants, de nouvelles techniques d'analyse allant
au-delà des meilleures pratiques existantes de rétro-ingénierie sont nécessaires pour accélérer le
processus. L’audience visée par ce rapport sont donc les rétro-ingénieurs logiciels et analystes de
maliciels, ainsi que les gestionnaires et décideurs d’organisations qui ont à faire la rétro-
ingénierie de code tiers. Les techniques présentées dans ce rapport contribueront à fournir aux
FAC une capacité de base pour répondre plus efficacement et rapidement aux menaces
cybernétiques et mènera à un changement dans leurs capacités de défense. Pour se prémunir
contre les cyber-attaques, les FAC comptent actuellement sur des antivirus et des systèmes de
détection d'intrusions. Ces approches consistent à chercher des symptômes, tels qu’une
augmentation du trafic réseau ou des signes anormaux, et essayer de lutter contre les problèmes
qui surgissent. Bien qu'elles puissent être efficaces dans certains cas, elles n'aident pas à
comprendre ce qu’est le véritable problème et sont inefficaces lorsque les attaques se
transforment. Fournir une nouvelle capacité de rétro-ingénierie de logiciels malveillants, tel que
proposée dans le présent rapport, permettra aux FAC de découvrir les vulnérabilités que les
logiciels malveillants exploitent. Ceci leur fournira aussi les connaissances requises afin de
générer des mesures défensives contre de futures attaques malveillantes afin de les contrer, même
lorsqu’elles se transforment. Cette nouvelle capacité introduira un changement de paradigme dans
les pratiques des FAC dans le cyber-environnement. Celles-ci passeront d'une approche
strictement défensive, dont les limites ont été exposées, à une approche proactive. L’approche
proposée a aussi un potentiel pour la sécurité, notamment la Gendarmerie royale du Canada
(GRC), le Centre de la sécurité des télécommunications Canada (CSTC), le Service canadien du
renseignement de sécurité (SCRS) et la Sécurité publique (SP) Canada.
DRDC-RDDC-2015-R027 iii
This page intentionally left blank.
iv DRDC-RDDC-2015-R027
Table of contents
DRDC-RDDC-2015-R027 v
List of figures
vi DRDC-RDDC-2015-R027
List of tables
DRDC-RDDC-2015-R027 vii
This page intentionally left blank.
viii DRDC-RDDC-2015-R027
1 Introduction
Canadians – individuals, industry and government – are embracing the many advantages that
cyberspace offers, and the economy and quality of life are the better for it. But their increasing
reliance on cyber technologies makes them more vulnerable to those who attack Canada’s digital
infrastructure to undermine the national security, economic prosperity, and way of life [1]. The
first decade of the 21st century saw a dramatic change in the nature of cybercrime. Once the
province of teenage boys spreading graffiti for kicks and notoriety, hackers today are organized,
financially motivated gangs [2]. In the past, virus writers displayed offensive images and bragged
about the malware they had written; now hackers target companies and governments to steal
intellectual property, build complex networks of compromised computers, and rob individuals of
their identities [2]. It is a common scenario that the only piece of evidence of a targeted attack is
the malicious executable code itself. Analyzing malicious code (malware) requires reverse
engineering, as malware authors seldom do the favor of providing the source code of
their creations.
During the last few years, the sophistication of malware has considerably evolved and has thus
complicated the reverse engineering process. While malware used to consist of small programs
written mostly in assembly, which spread by infecting other executable files, today’s malware
programs are written using high-level languages, come in many forms (e.g., botnets, rootkits,
malicious document files), and each new variant improves on the previous ones, by adding new
capabilities and fixing bugs. Also, as developing stealthy and persistent malware requires a high
degree of technical complexity, it is quite common for code fragments to be reused between
different malware.
The fact that malware authors share source code among them [3, 4], have adopted a versioning
approach, and use evasion techniques to bypass antivirus detection have resulted in a proliferation
of malware. According to Panda Security, in 2013 alone, there were 30 million new malware
strains in circulation, at an average of 82,000 per day [5]. This results in the very few available
reverse engineers being quickly saturated, as in-depth malware analysis is a tedious and time-
consuming task. Reverse engineers should thus leverage the code reuse in the production of
malware and be able to correlate different malware programs to identify the similarities between
them and thereby, the code fragments they share. This would prevent reverse engineers from
reanalyzing the code fragments of a new malware, which have already been analyzed in a
previous context and therefore, reduce redundant analysis efforts.
Although software reverse engineering came to an age in 1990 [6], it is only in recent years that
its importance and visibility have arisen. However, despite a growing community, it is still
DRDC-RDDC-2015-R027 1
perceived by many as a dark art. This report will thus serve to illustrate our vision of how, by
leveraging the open source code repositories publicly available and assembly code fragments
previously analyzed, the entry barrier new analysts face could be reduced.
The remainder of this Scientific Report is organized as follows: Chapter 2 presents the assembly
to source code matching process and its associated usage scenarios. Chapter 3 provides
background information on code clone detection, the research area on which the identification of
reused code fragments is built, as well as the different types of assembly code clones which
should be detected by the proposed approach. Finally, Chapter 4 provides conclusions and
future work.
2 DRDC-RDDC-2015-R027
2 Assembly to source code matching
Malware authors reuse code from all sources including legitimate ones, such as open source code.
For example, both the Conficker worm and the Waledac bot used an open source implementation
for their cryptographic functions [7, 8]. In the case of the Waledac bot, this represented 25% of its
code [8]. Also, the much-discussed FLAME malware contained publicly available open source
code packages such as SQLite and LUA [9].
Another example of source code reuse for malware creation is the Citadel Trojan. Citadel is based
on the leaked source code of the Zeus crimeware kit [10]. Other malware (e.g., Gameover Zeus,
Ice IX, LICAT, and Murofet) were also created after the source code of Zeus was made
public [10].
2.1 RE-Google
Some people have compared reverse engineering to solving a jigsaw puzzle [11]. You first start
by finding the corner pieces, then the frame, and after that, you work your way forward from
there. Using this analogy, the corner pieces for reverse engineering are strings, constants, and
function names. Strings contain human readable hints about a given functionality. Specific
constants can give additional clues and can sometimes be used to even identify certain types of
algorithms. Function names of imported functions from shared libraries (e.g., DLL) can reveal
information about the performed actions. However, a lot of experience is needed to know, for
example, the significance of a constant in a given context or what the combination of imported
functions might results in.
The meaning of strings, constants, and function names could be obtained by searching them on
publicly available open source code repositories. This was the idea behind the RE-Google IDA
Pro plug-in [11], which has proved to be very valuable to find algorithms and code excerpts
containing such information on Google Code. The ability to efficiently recognize the open source
origin for a given assembly code fragment is desirable, both in order to enhance the productivity
of a reverse engineer, as well as to reduce the odds of common libraries leading to false
correlation between otherwise unrelated code bases.
RE-Google [11], a proof of concept plug-in for the IDA Pro disassembler [12], extracts constants,
library names, and strings contained in a disassembled binary, and uses them to search for code
on Google Code [13]. The links to the top ten source code files found are inserted as comments
into the assembly code listing. Reviewing these files frequently provides enlightening insights
into the functionality of the code fragment in question and saves considerable time. However,
RE-Google uses the Google Code Search Data Application Programming Interface (API), which
is no longer available, making this plug-in non-functional.
DRDC-RDDC-2015-R027 3
GitHub [17], Krugle [18], the Open Hub Code Search [19], and searchcode [20]. The two selected
options were the Open Hub Code Search and GitHub.
The Open Hub Code Search, by Black Duck Software [21], with its 21,372,664,482 lines of open
source code, claims to be the world's largest and most comprehensive code search engine [19]. It
is the result of the merge in 2012 with Koders, another open source code search engine acquired
by Black Duck Software in 2008.
Through the Open Hub Code Search, Black Duck Software wants to fill the gap left by the
shutdown of Google Code Search in 2012. Figure 1 shows a screenshot of the results for the
search term sort filtered for the C++ programming language.
2.2.2 GitHub
GitHub is a web-based hosting service for software development projects using Git [22] at its
heart. Git is a free and open source distributed version control system. GitHub claims to be the
largest code host on the planet with over 18.9 million repositories [23]. One advantage that
GitHub has over the Open Hub Code Search is its robust API, which allows the integration of
4 DRDC-RDDC-2015-R027
third-party tools or applications [23]. Figure 2 displays a snapshot of the returned results by
GitHub, considering only the C++ projects, for the search term sort.
DRDC-RDDC-2015-R027 5
Figure 3: Citadel sub_42514F function in IDA Pro.
6 DRDC-RDDC-2015-R027
Figure 4: Sert.cpp file excerpt from GitHub.
Figure 4 displays a subset of the file Sert.cpp found on the GitHub Carberp1 repository. All the
calls to the Windows API functions (coloured in pink) in Figure 3 are also present in Figure 4, as
shown in Table 1. The presence of the letter W appended at the end of the function name
CertOpenSystemStoreW in the disassembly and not in the source code listing is due to the fact
that there are two versions of CertOpenSystemStore. One for ASCII strings (ending with an A)
and one for Unicode (ending with a W). In the present case, the CertOpenSystemStore function
was compiled for Unicode.
1
https://github.com/hzeroo/Carberp/blob/6d449afaa5fd0d0935255d2fac7c7f6689e8486b/source%20-
%20absource/pro/all%20source/sert/sert.cpp
DRDC-RDDC-2015-R027 7
Table 1: Correspondence between assembly and source code.
IDA Pro virtual Source code line
Function name
address number
Figure 3 shows that IDA Pro was also able to extract the string “MY” at the address 00425150,
which is passed as a parameter to the function CertOpenSystemStoreW. This string is also
present in the file Sert.cpp. It is initialized at line 51 (Figure 5) and its pointer is passed as a
parameter to the CertOpenSystemStore function at line 172 (Figure 4).
This example illustrates how being able to match assembly with its corresponding source code
greatly accelerates the reverse engineering process. The latter is at a higher level of abstraction
and is thus easier to understand. In this case, it serves to clearly illustrate how the different
Windows Certificate Store functions used for cryptography and present in Citadel are related.
8 DRDC-RDDC-2015-R027
Figure 5: Sert.cpp file excerpt on GitHub.
DRDC-RDDC-2015-R027 9
2.4 Close matching
With close matching, although the exact corresponding source code cannot be found on a public
repository, the fact that certain assembly code features (e.g., strings, constants) are matched can
provide additional information. This can sometimes reveal the performed actions of a function.
This is best illustrated with the following example. Figure 6 displays the assembly code listing of
a malicious Secure Shell (SSH) client. IDA Pro was able to extract the string x11-req at the
address 806721A. This string was also found on the Black Duck Open Hub Code Search, in the
channels.c2 file of the MirOS Project.
10 DRDC-RDDC-2015-R027
Figure 7: channels.c file excerpt on the Black Duck Open Hub Code Search.
The MirOS project is a secure operating system from the BSD family for 32-bit i386 and SPARC
systems [24]. Its file channel.c comes from the OpenBSD project [25]. The fact that the string
x11-req was retrieved on the Black Duck Open Hub Code Search in the channel.c file allows
the reverse engineer to know that the disassembled code displayed in Figure 6 is used for
initiating X11connection forwarding. This is mentioned in the comments at the beginning of the
file (Figure 8).
DRDC-RDDC-2015-R027 11
Figure 8: Comments in the channels.c file on the Black Duck Open Hub Code Search.
12 DRDC-RDDC-2015-R027
Figure 9: Video capture capability discovered in Citadel.
Although contextual matching is far from being the ideal scenario, the high-level information it
provides for a function saves the reverse engineer from manually analyzing it. With Citadel
having approximately 800 functions, any function for which the reverse engineer will not have to
deal with assembly code analysis is a time saver.
DRDC-RDDC-2015-R027 13
3 Code clone detection
During the last few years, the sophistication of malware has considerably evolved and has thus
complicated the reverse engineering process. While malware used to consist of small programs
written mostly in assembly, which spread by infecting other executable files, today’s malware
programs are written using high-level languages, come in many forms (e.g., botnets, rootkits,
malicious document files), and each new variant improves on the previous ones, by adding new
capabilities and fixing bugs. Also, as developing stealthy and persistent malware requires a high
degree of technical complexity, it is quite common for code fragments to be reused between
different malware.
The fact that malware authors share source code among them [1, 4], have adopted a versioning
approach, and use evasion techniques to bypass antivirus detection have resulted in a proliferation
of malware. Since retrieving the open source origin of a malware code fragment is not always
possible, reverse engineers should thus leverage the code reuse in the production of malware and
be able to correlate different malware programs to identify the similarities between them and
thereby, the code fragments they share. This would prevent them from reanalyzing the code
fragments of a new malware, which have already been analyzed in a previous context.
The problem of correlating different code fragments is closely related to the research area of
clone detection. Clone detection is a technique to identify duplicate code fragments in a code
base. Traditionally, it has been used to decrease the code size by consolidating it and thus,
facilitate program comprehension and software maintenance. This need stems from the fact that
reusing code fragments by copying and pasting them, with or without modifications, is a common
scenario in software development and can be detrimental to software maintenance and evolution.
For example, if a bug is found in a code fragment, then all similar code fragments must also be
verified for the presence of this bug.
As clone detection is an important problem, it has been studied extensively and numerous clone
detection algorithms exist. Depending on the code level analysis used, they can be classified
within the following categories: text-based [26, 27, 28], token-based [29, 30, 31, 32], tree-based
[33, 34, 35], metrics-based [36, 37], and graph-based [38, 39, 40]. However, most existing clone
detection algorithms operate on source code and these solutions are not directly applicable to
assembly code. One important application of clone detection on binary code is the detection of
copyright infringements. For example, closed source software should not contain open source
code released under the GNU General Public License (GPL).
14 DRDC-RDDC-2015-R027
validate their precision, recall, and scalability. Also, unlike source code clone detection, no
independent experimental study has been conducted to compare the performance of the assembly
code clone detection methods.
A code fragment is any sequence of assembly code instructions, with or without comments, at any
granularity level (e.g., function, basic block). A code fragment is a clone of another code
fragment if they are similar according to a given definition of similarity [56]. In clone detection
(or search), code fragments can be similar based on their program text (textual similarity) or
functionality (functional similarity). In the literature, code clones have been classified into
Type I, II, III (textual similarity), and IV (functional similarity) [57]. The results of clone
detection take the form of clone pairs. A pair of code fragments is called a clone pair if there
exists a clone-relation between them (i.e., a clone pair is a pair of code fragments which are
identical or similar to each other) [57].
A Type I clone is when two or more code fragments are identical except for variations in
whitespace, layout, and comments. In the example of Figure 10, the only difference between the
two code fragments is the presence of the Memory comment indicated in red at line 1.
1 push eax ; Memory push eax
2 call ds:_aligned_free call ds:_aligned_free
3 and dword ptr [esi], 0 and dword ptr [esi], 0
4 pop ecx pop ecx
Figure 10: Type I clone example.
DRDC-RDDC-2015-R027 15
3.2.2 Type II clones
Type II clones are structurally and syntactically identical code fragments except for variations in
identifiers, literals, types, layout, and comments. In Figure 11, the only difference between the
two code fragments is that for some instructions, they use different constants, variable names, and
labels. For example, for the assembly code instruction at line 5, the instruction on the left uses the
constant 24h and the variable name var_C, while its corresponding instruction on the right uses
the constant 20h and the variable name InBuffer. Similar differences also apply for the
instructions at line 15. For the jnz instruction at line 17, the loc_10001A97 label is used on the
left, while the loc_10001493 label is used for the instruction on its right.
1 push edi ; Size push edi ; Size
2 call _malloc call _malloc
3 mov edx, eax mov edx, eax
4 mov ecx, edi mov ecx, edi
5 mov [esp+24h+var_C], edx mov [esp+20h+InBuffer], edx
6 mov edi, edx mov edi, edx
7 mov edx, ecx mov edx, ecx
8 xor eax,eax xor eax, eax
9 shr ecx, 2 shr ecx, 2
10 rep stosd rep stosd
11 mov ecx, edx mov ecx, edx
12 add esp, 4 add esp, 4
13 and ecx, 3 and ecx, 3
14 rep stosb rep stosb
15 mov eax, [esp+20h+var_C] mov eax, [esp+1Ch+InBuffer]
16 test eax, eax test eax, eax
17 jnz loc_10001A97 jnz loc_10001493
18 mov eax, [ebx] mov eax, [ebx]
19 push eax push eax
Figure 11: Type II clone example.
A Type III clone is a Type II clone with further modifications. Statements can be changed, added,
or removed, in addition to variations in identifiers, literals, types, layout and comments. In the
example of Figure 12, the order of the two instructions at lines 3 and 4 was inverted.
1 mov esi, [ebp+arg_0] mov esi, [ebp+arg_0]
2 mov edx, [esi+214h] mov edx, [esi+214h]
3 mov edi, [esi+220h] mov [ebp+var_4], edx
4 mov [ebp+var_4], edx mov edi, [esi+220h]
5 cmp [esi+21Ch], edi cmp [esi+21Ch], edi
6 jl short loc_76641044 jl short loc_76641044
7 lea ebx, [edx+edi*8] lea ebx, [edx+edi*8]
Figure 12: Type III clone example.
16 DRDC-RDDC-2015-R027
3.2.4 Type IV clones
A Type IV clone, also known as a semantic clone, occurs when two or more code fragments
perform the same computation, but using different syntactic variants. In the example of Figure 13,
the two code fragments carry out the same function, i.e., compute the length of a string. However,
as it can be seen, their implementation differs significantly.
1 strlen1 proc near strlen3 proc near
2
3 arg_0 = dword ptr 4 arg_0 = dword ptr 4
4
5 mov eax, [esp+arg_0] push edi
6 mov edi, [esp+4+arg_0]
7 loc_401004: xor ecx, ecx
8 cmp byte ptr [eax], 0 not ecx
9 jz short done xor al, al
10 inc eax cld
11 jmp short loc_401004 repne scasb
12 not ecx
13 done: lea eax, [ecx-1]
14 sub eax, [esp+arg_0] pop edi
15 retn retn
16 strlen1 endp strlen3 endp
Figure 13: Type IV clone example.
Type I Clone
Exact Clone
Type II Clone
DRDC-RDDC-2015-R027 17
The first usage scenario which should be supported is the detection of exact clones. Figure 14
displays an example of an exact clone detected in both Citadel (on the left) and Zeus (on the
right), with their differences highlighted. When compared with their corresponding instructions in
Zeus, some Citadel instructions use different registers (e.g., ebx and edi at address 40ADEE and
40ADEF), labels (e.g., loc_40B135 at address 40ADFA), and function names (e.g., sub_433D74
at address 40AE59).
Figure 14: Exact clone detected in both Citadel (left) and Zeus (right).
18 DRDC-RDDC-2015-R027
The second usage scenario to be supported is illustrated in Figure 15. In this example, an inexact
clone was identified in both Citadel (on the left) and Zeus (on the right). This clone is related to
the RC4 function used for encrypting the command and control (C&C) network traffic between
the bot and the C&C server.
Figure 15: Inexact clone detected in Citadel (left) and Zeus (right).
DRDC-RDDC-2015-R027 19
Figure 16: Zeus code fragment.
20 DRDC-RDDC-2015-R027
4 Conclusions and future work
As software reverse engineering is a manually intensive and time-consuming process, this report
illustrates scenarios on how to accelerate it by partially automating some of its aspects. The
objective is to leverage existing sources of information available, namely (i) public open source
code repositories and (ii) previously analyzed assembly code fragments. The first approach aims
at saving time by providing the significance of a function’s strings, constants, and imported
functions, without having the reverse engineer analyze the underlying assembly code. The second
attempts to reduce redundant analysis efforts by detecting code clones of a target executable.
Future work will consist of developing prototypes to support the different scenarios explained in
the present report. These prototypes will allow matching assembly with its corresponding source
code, as well as detecting clones of assembly code fragments. Empirical studies will also be
conducted to measure their precision, recall, scalability, and performance using malware samples.
The results will be measured against the most representative approaches to compare assembly code.
DRDC-RDDC-2015-R027 21
This page intentionally left blank.
22 DRDC-RDDC-2015-R027
References .....
[1] Public Safety Canada, Canada’s Cyber Security Strategy: For a Stronger and More
Prosperous Canada, Government of Canada, 2010.
[2] Sophos, Security Threat Report: 2010, Sophos Group, Abingdon, United Kingdom,
Jan. 2010.
[3] PSTP08-0107eSec Combating Robot Networks and Their Controllers: A Study for the Public
Security and Technical Program (PSTP), CPQ0229.1.36.4, Bell Canada, 2010.
[4] D. Babic, D. Reynaud, and D. Song, “Malware Analysis with Tree Automata Inference,”
Proc. of the 23rd Int’l Conf. on Computer Aided Verification (CAV ‘11), Snowbird, Utah,
Jul. 2011, pp. 116-131.
[7] P. Porras, H. Saidi, and V. Yegneswaran, “An Analysis of Conficker,” Mar. 2009;
http://mtc.sri.com/Conficker/.
[8] B. Stock, et al., “Walowdac – Analysis of a Peer-to-Peer Botnet,” Proc. of the 2009 European
Conf. on Computer Network Defense (EC2ND '09), Milan, Italy, Nov. 2009, pp. 13-20.
[9] sKyWIper Analysis Team, sKyWIper (a.k.a. Flame a.k.a. Flamer): A Complex Malware for
Targeted Attacks, tech. report, Department of Telecommunications, Budapest Univ. of
Technology and Economics, 2012.
[10] Analysis Team, Malware Analysis: Citadel, tech. report, AhnLab Security Emergency
response Center, 2012.
[14] Antelink, “Antepedia Open Source Search Engine - Stay aware of updates & security issues
of open source projects,” Accessed Jan. 2015; http://www.antepedia.com/.
[15] Microsoft, “CodePlex - Open Source Project Hosting,” Accessed Jan. 2015;
https://www.codeplex.com/.
DRDC-RDDC-2015-R027 23
[16] GrepCode, “GrepCode.com - Java Source Code Search 2.0,” Accessed Jan. 2015;
http://grepcode.com/.
[17] GitHub, “GitHub · Build software better, together.,” Accessed Jan. 2015;
https://github.com/.
[18] Aragon Consulting Group, “Home | krugle - software development productivity,” Accessed
Jan. 2015; http://www.krugle.com/.
[19] Black Duck Software, “Open Hub Code Search,” Accessed Jan. 2015;
http://code.openhub.net/.
[20] B.E. Boyter, “searchcode | source code search engine,” Accessed Jan. 2015;
https://searchcode.com/.
[24] Black Duck Software, “The MirOS Project Open Source Project on Open Hub,” Accessed
Jan. 2015; https://www.openhub.net/p/mirbsd.
[26] J.H. Johnson, “Identifying Redundancy in Source Code Using Fingerprints,” Proc. of the
1993 Conf. of the Centre for Advanced Studies on Collaborative Research: Software Eng.,
Toronto, Ont., Sept. 1993, pp. 171-183.
[27] A. Marcus and J.I. Maletic, “Identification of High-level Concept Clones in Source Code,”
Proc. of the 16th IEEE Int’l Conf. on Automated Software Eng. (ASE ‘01), Coronado, Calif.,
Nov. 2001, pp. 107-114.
[28] J. Ji, et al., “Source Code Similarity Detection Using Adaptive Local Alignment of
Keywords,” Proc. of the 8th Int’l Conf. on Parallel and Distributed Computing,
Applications and Technologies (PDCAT ‘07), Adelaide, Australia, Dec. 2007, pp. 179-180.
[29] B.S. Baker, “On Finding Duplication and Near-Duplication in Large Software Systems,”
Proc. of the 2nd Working Conf. on Reverse Eng. (WCRE ‘95), Toronto, Ont., Jul. 1995, pp.
86-95.
24 DRDC-RDDC-2015-R027
[31] H.A. Basit, et al., “Efficient Token Based Clone Detection with Flexible Tokenization,”
Proc. of the 6th Joint Meeting of the European Software Eng. Conf. and the ACM
SIGSOFT Symp. on the Foundations of Software Eng., Dubrovnik, Croatia, Sept. 2007,
pp. 513-516.
[32] B. Hummel, et al., “Index-based Code Clone Detection: Incremental, Distributed, Scalable,”
Proc. of the 2010 IEEE Int’l Conf. on Software Maintenance, Timisoara, Romania, Sept.
2010, pp. 1-9.
[33] I.D. Baxter et al., “Clone Detection Using Abstract Syntax Trees,” Proc. of the Int’l Conf.
on Software Maintenance (ICSM '98), Bethesda, Md., Nov. 1998, pp. 368-377.
[34] V. Wahler, et al., “Clone Detection in Source Code by Frequent Itemset Techniques,” Proc.
of the 4th IEEE Int’l Workshop on Source Code Analysis and Manipulation (SCAM ‘04),
Chicago, Ill., Sept. 2004, pp. 128-135.
[35] W.S. Evans, C.W. Fraser, and F. Ma, “Clone Detection via Structural Abstraction,” Proc. of
the 14th Working Conf. on Reverse Eng. (WCRE ’07), Vancouver, B.C., Oct. 2007,
pp. 150-159.
[36] K.A. Kontogiannis, et al., “Pattern Matching for Clone and Concept Detection,” J. of
Automated Software Engineering, vol. 3, no. 1-2, Jun. 1996, pp. 77-108.
[37] J. Mayrand, C. Leblanc, and E. Merlo, “Experiment on the Automatic Detection of Function
Clones in a Software System Using Metrics,” Proc. of the 1996 Int’l Conf. on Software
Maintenance (ICSM ’96), Monterey, Calif., Nov. 1996, pp. 244-253.
[38] J. Krinke, “Identifying Similar Code with Program Dependence Graphs,” Proc. of the 8th
Working Conf. on Reverse Eng. (WCRE ’01), Stuttgart, Germany, Oct. 2001, pp. 301-309.
[39] R. Komondoor and S. Horwitz, “Using Slicing to Identify Duplication in Source Code,”
Proc. of the 8th Int’l Symp. on Static Analysis (SAS 2001), Paris, France, Jul. 2001,
pp. 40-56.
[40] C. Liu, et al., “GPLAG: Detection of Software Plagiarism by Program Dependence Graph
Analysis,” Proc. of the 12th ACM SIGKDD Int’l Conf. on Knowledge Discovery and Data
Mining (KDD ’06), Philadelphia, Pa., Aug. 2006, pp. 872-881.
[41] J. Jang and D. Brumley, BitShred: Fast, Scalable Code Reuse Detection in Binary Code,
tech. report CMU-CyLab-10-006, CyLab, Carnegie Mellon Univ., 2009.
[42] J. Jang, D. Brumley, and S. Venkataraman, BitShred: Fast, Scalable Malware Triage, tech.
report CMU-CyLab-10-022, CyLab, Carnegie Mellon Univ., 2010.
[43] M.E. Karim, et al., “Malware Phylogeny Generation using Permutations of Code,” J. in
Computer Virology, vol. 1, no. 1, 2005, pp. 13-23.
[44] A. Schulman, “Finding Binary Clones with Opstrings and Function Digests, Doctor Dobb's
J., vol. 30, no. 9, Sept. 2005, pp. 64-70.
DRDC-RDDC-2015-R027 25
[45] A. Walenstein, et al., “Exploiting Similarity Between Variants to Defeat Malware: “Vilo”
Method for Comparing and Searching Binary Programs,” Proc. of BlackHat DC 2007.
[47] A. Saebjornsen, et al., “Detecting Code Clones in Binary Executables,” Proc. of the 18th
Int’l Symp. on Software Testing and Analysis (ISSTA '09), Chicago, Ill., Jul. 2009,
pp. 117-128.
[48] E. Carrera and G. Erdélyi, “Digital Genome Mapping - Advanced Binary Malware
Analysis,” Proc. of the Virus Bulletin Conf., Chicago, Ill., Sept. 2004, pp. 187-197.
[49] Halvar Flake, “More Fun with Graphs,” Black Hat Federal, 2003.
[50] M.E. Karim, et al., “Malware Phylogeny Generation using Permutations of Code,” J. in
Computer Virology, vol. 1, no. 1, 2005, pp. 13-23.
[51] I. Briones and A. Gomez, “Graphs, Entropy and Grid Computing: Automatic Comparison
of Malware,” Proc. of the Virus Bulletin Conf., Ottawa, Ont., Oct. 2008, pp. 187-197.
[52] S.S. Anju, et al., “Malware Detection Using Assembly Code and Control Flow Graph
Optimization,” Proc. of the 1st Amrita ACM-W Celebration on Women in Computing in
India (A2CWiC '10), Tamilnadu, India, Sept. 2010, pp. 65:1-65:4.
[53] T. Dullien, et al., “Automated Attacker Correlation for Malicious Code,” Proc. of the NATO
RTO Information Assurance and Cyber Defence, Tallinn, Estonia, Dec. 2010.
[54] P.M. Comparetti, et al., “Identifying Dormant Functionality in Malware Programs,” Proc.
of the 2010 IEEE Symp. on Security and Privacy (SP ’10), Oakland, Calif., May 2010,
pp. 61-76.
[55] Z. Wang, K. Pierce, and S. McFarling, “BMAT - A Binary Matching Tool for Stale Profile
Propagation,” J. of Instruction-Level Parallelism, vol. 2, 2000.
[56] C.K. Roy, J.R. Cordy, and R. Koschke, “Comparison and Evaluation of code Clone
Detection Techniques and Tools: A Qualitative Approach,” Science of Computer
Programming, vol. 74, no. 7, May 2009, pp. 470-495.
[57] C.K. Roy and J.R. Cordy, A Survey on Software Clone Detection Research, tech. report
2007-541, School of Computing, Queen's Univ., Kingston, Ont., 2007.
26 DRDC-RDDC-2015-R027
List of symbols/abbreviations/acronyms/initialisms
DRDC-RDDC-2015-R027 27
This page intentionally left blank.
28 DRDC-RDDC-2015-R027
DOCUMENT CONTROL DATA
(Security markings for the title, abstract and indexing annotation must be entered when the document is Classified or Designated)
1. ORIGINATOR (The name and address of the organization preparing the document. 2a. SECURITY MARKING
Organizations for whom the document was prepared, e.g. Centre sponsoring a (Overall security marking of the document including
contractor's report, or tasking agency, are entered in section 8.) special supplemental markings if applicable.)
Software fingerprinting for automated assembly code analysis: Project overview (U)
4. AUTHORS (last name, followed by initials – ranks, titles, etc. not to be used)
Charland, P
5. DATE OF PUBLICATION 6a. NO. OF PAGES 6b. NO. OF REFS
(Month and year of publication of document.) (Total containing information, (Total cited in document.)
including Annexes, Appendices,
etc.)
March 2015
40 57
7. DESCRIPTIVE NOTES (The category of the document, e.g. technical report, technical note or memorandum. If appropriate, enter the type of report,
e.g. interim, progress, summary, annual or final. Give the inclusive dates when a specific reporting period is covered.)
Scientific Report
8. SPONSORING ACTIVITY (The name of the department project office or laboratory sponsoring the research and development – include address.)
15bz18
10a. ORIGINATOR'S DOCUMENT NUMBER (The official document 10b. OTHER DOCUMENT NO(s). (Any other numbers which may be
number by which the document is identified by the originating assigned this document either by the originator or by the sponsor.)
activity. This number must be unique to this document.)
DRDC-RDDC-2015-R027
11. DOCUMENT AVAILABILITY (Any limitations on further dissemination of the document, other than those imposed by security classification.)
Unlimited
12. DOCUMENT ANNOUNCEMENT (Any limitation to the bibliographic announcement of this document. This will normally correspond to the
Document Availability (11). However, where further distribution (beyond the audience specified in (11) is possible, a wider announcement
audience may be selected.))
Unlimited
DRDC-RDDC-2015-R027 29
13. ABSTRACT (A brief and factual summary of the document. It may also appear elsewhere in the body of the document itself. It is highly desirable that
the abstract of classified documents be unclassified. Each paragraph of the abstract shall begin with an indication of the security classification of the
information in the paragraph (unless the document itself is unclassified) represented as (S), (C), (R), or (U). It is not necessary to include here abstracts in
both official languages unless the text is bilingual.)
With the revolution in information technology, the dependence of the Canadian Armed Forces
(CAF) on their information systems continues to grow. While information systems-based assets
confer a distinct advantage, they also make the CAF vulnerable if adversaries interfere with
those. Unfortunately, the technology required to disrupt and damage an information system
through malicious software (malware) is far less sophisticated and expensive than the amount of
investment required to create the system. To understand and mitigate this threat, reverse
engineering has to be performed to analyze malware. However, software reverse engineering is
a manually intensive and time-consuming process. The learning curve to master it is quite steep
and once mastered, the process is hindered when anti-reverse engineering techniques are used.
This results in the very few available reverse engineers being quickly saturated. This Scientific
Report describes new approaches to accelerate the reverse engineering process of malware. The
goal is to reduce redundant analysis efforts by automating the identification of code fragments
which reuse (i) previously analyzed assembly code or (ii) open source code publicly available.
---------------------------------------------------------------------------------------------------------------
Software reverse engineering; malware analysis; assembly to source code matching; clone
detection
30 DRDC-RDDC-2015-R027