Writing Ultra-Small Windows Executables

Writing ultra-small Windows
executables
(April 17, 2017)
How small can a valid and useful Win32 executable be? There
already are a few tutorials about this topic, but these are
either not working on modern Windows versions any longer or
only cover the most basic do nothing but return zero
program. The goal here should be to do something genuinely
useful: a console application that outputs the contents of
the Windows clipboard on standard output.
The target application

The target application scenrario is a little more than a
simple Hello World application, but still only very basic
Win32 API 101 with only a few calls to kernel32.dll and
user32.dll functions and very little algorithmic stuff
inbetween. Still, its undeniably useful, e.g. when you
want to filter the contents of some text in an editor
through (Win32 versions of) sed, grep, tr, cut, awk or
similar command-line tools. Or and this is what Im using
very frequently to quickly change the directory in the
console to something else you happen to have open in
Explorer. Besides copying the path from Explorer, you would
normally need to type cd /d in the command prompt window,
followed by a right-click, and Enter. (The awkward /d
parameter is important, otherwise the current volume letter
wouldnt be changed as well.) Thats cumbersome; I use a
small batch file (called fcd.cmd in my case) which resides
in a directory somewhere in the PATH and does this
automatically:
@for /f "usebackq tokens=*" %%a in (`getclip`) do @cd /d %%a
It calls the getclip.exe helper program to get the

clipboards contents and crafts a cd /d command with it
(using cmd.exes byzantine syntax to make the backticks work
their usual magic). I had this helper program already, but
it was part of a GnuWin32 installation that requires a
bunch of obscure DLLs to work; for my new Windows
installation, I didnt want to use this cruft again.
However, theres no simple alternative either: Windows
ships with clip.exe, but this only works in the opposite
direction, putting data from a pipe into the clipboard, not
out of it. A quick internet research only came up with
solutions that had even more ridiculous dependencies. So I
decided to shave a yak take matters into my own hands and
write my own simple, small implementation of getclip.exe.
Couldnt be that hard.
The nave C implementation

The obvious first try is to write a C program, like this:
#include <windows.h>
#include <stdio.h>
#include <string.h>
int main(void) {
if (!OpenClipboard(NULL)) {
ExitProcess(1);
}
HANDLE hData = GetClipboardData(CF_TEXT);
if (!hData) {
CloseClipboard();
ExitProcess(1);
}
const char *str = (const char*)GlobalLock(hData);
if (!str) {
CloseClipboard();
ExitProcess(1);
}
fwrite((const void*)str, 1, strlen(str), stdout);
GlobalUnlock(hData);
CloseClipboard();
return 0;
}
Note the use of fwrite instead of puts to avoid adding

additional newlines at the end of the output. Other than
that, its fairly basic stuff: opening the clipboard,
requesting the data as plain ASCII text, mapping the data
into our address space, writing it to stdout and
deallocating all the resources we acquired.
Using Visual Studio 2017 (configured to optimize for size)

and the default Windows 8.1 SDK, this gives us an
executable of 76800 bytes. Yes, thats almost 77 kilobytes!
We could get this down to 8704 bytes by linking dynamically
against the C library DLL, but thats cheating: This way,
the user would require a bunch of DLLs in the multi-
megabyte range to run it. We can do better than that.
That pesky C library

Looking closely at the code, it becomes obvious that this
C library tax is not really necessary for this program:
The only required library functions are fwrite and strlen,
everything else is just plain Win32 API calls into
kernel32.dll and user32.dll. fwrite to stdout can be trivially
substituted by GetStdHandle and WriteFile, and strlen doesnt
even need a substitute because its inlined by the compiler
anyway. So lets just get rid of the C library altogether
and link with /NODEFAULTLIB. In doing so, we lose the luxury
of having a main function that has a working heap and gets
the command line parsed into argc and argv, but we dont
need that anyway. We can instead make our main function be
mainCRTStartup, which is the default entry point of console-
mode Windows executables, and return from it by calling
ExitProcess. The whole program turns into this (changes
highlighted):
#include <windows.h>
#include <string.h>
int mainCRTStartup(void) {
if (!OpenClipboard(NULL)) {
ExitProcess(1);
}
HANDLE hData = GetClipboardData(CF_TEXT);
if (!hData) {
CloseClipboard();
ExitProcess(1);
}
const char *str = (const char*) GlobalLock(hData);
if (!str) {
CloseClipboard();
ExitProcess(1);
}
DWORD dummy;
WriteFile(GetStdHandle(STD_OUTPUT_HANDLE),
(const void*)str, strlen(str), &dummy, NULL);
GlobalUnlock(hData);
CloseClipboard();
ExitProcess(0);
}
Thats not too many changes, but the result is quite

impressive: were down to 3072 bytes (3 KiB), with no
dependencies other than the two mandatory DLLs! At this
point, further optimization isnt reasonable: Were already
below the page size, filesystem cluster size and (modern)
disk sector size, all of which are at 4 KiB. If we shrink
anything below these 4 KiB, we wont save any memory,
storage space or load time. So there, we have it we
pushed it as far as far as it makes sense!
But no, we wont leave it at that! (At least, I wont.) It

might be a pure sports exam by now, but if we started the
quest to make a getclip implementation as small as possible,
we might just as well end it! So lets go the next step
Assembly to the rescue

The actual code is simple enough to forego the comfort zone
of C programming altogether and write it straight in
assembly, like this (NASM/YASM syntax):
global _mainCRTStartup
extern _ExitProcess@4
extern _OpenClipboard@4
extern _CloseClipboard@0
extern _GetClipboardData@4
extern _GlobalLock@4
extern _GlobalUnlock@4
extern _GetStdHandle@4
extern _WriteFile@20
section .text
_mainCRTStartup:
; set up stack frame for *lpBytesWritten

push ebp
sub esp, 4
; if (!OpenClipboard(NULL)) ExitProcess(1);
push 0
call _OpenClipboard@4
or eax, eax
jz error2
; HANDLE hData = GetClipboardData(CF_TEXT); if (!hData) fail;

push 1 ; CF_TEXT
call _GetClipboardData@4
or eax, eax
jz error
push eax ; save hData for GlobalUnlock at the end
; char* str = GlobalLock(hData); if (!str) fail;

push eax
call _GlobalLock@4
or eax, eax
jz error
; strlen(str)
mov ecx, eax
strlen_loop:
mov dl, [ecx]
or dl, dl
jz strlen_end
inc ecx
jmp strlen_loop
strlen_end:
sub ecx, eax
; WriteFile(GetStdHandle(STD_OUTPUT_HANDLE), ...)
push 0 ; lpOverlapped = NULL
lea edx, [ebp-4] ; put nBytesWritten on the stack
push edx
push ecx ; nNumberOfBytesToWrite = strlen(str)
push eax ; lpBuffer = str
push -11 ; hFile = ...
call _GetStdHandle@4 ; ... GetStdHandle(STD_OUTPUT_HANDLE)
push eax
call _WriteFile@20
; GlobalUnlock(hData); CloseClipboard(); ExitProcess(0);
call _GlobalUnlock@4 ; hData is already on the stack
call _CloseClipboard@0
push 0
call _ExitProcess@4
error:
call _CloseClipboard@0
error2:
push 1
call _ExitProcess@4
Assembling this and linking it with Microsofts link.exe

generates an executable of 2560 bytes. That might sound a
bit disappointing (a mere 512 bytes reduction for writing
everything in assembly, come on!), but in fact its more or
less expected: Code generated by a good C compiler is
usually already very tight (it might even be better than my
attempt; I didnt check that though) and by telling it to
omit all C library dependencies, theres not much
additional cruft in there that would be produced by a
compiler but not by an assembler.
However, by having a closer look into the generated

executable, it shows lots of zeroes and all kinds of PE
sections, including relocation information (which is not
needed at all for non-ASLR exectutables) and (empty) debug
information. There are no linker options (that I know of)
that get rid of this, so we need to dig even deeper
Constructing PE files by hand

The perhaps not easiest, but certainly most thorough way to
stop any interference from the linker is not to use one and
write all the PE headers and sections directly in the
assembler. Unfortunately, the PE format is not very simple
and full of idiosyncracies, so it takes some effort until a
working binary emerges:
bits 32
BASE equ 0x00400000
ALIGNMENT equ 512
SECTALIGN equ 4096
%define ROUND(v, a) (((v + a - 1) / a) * a)

%define ALIGNED(v) (ROUND(v, ALIGNMENT))
%define RVA(obj) (obj - BASE)
section header progbits start=0 vstart=BASE
mz_hdr:
dw "MZ" ; DOS magic
times 0x3a db 0 ; [UNUSED] DOS header
dd RVA(pe_hdr) ; address of PE header
pe_hdr:
dw "PE",0 ; PE magic + 2 padding bytes
dw 0x014c ; i386 architecture
dw 2 ; two sections
dd 0 ; [UNUSED] timestamp
dd 0 ; [UNUSED] symbol table pointer
dd 0 ; [UNUSED] symbol count
dw OPT_HDR_SIZE ; optional header size
dw 0x0102 ; characteristics: 32-bit,
executable
opt_hdr:
dw 0x010b ; optional header magic
db 13,37 ; [UNUSED] linker version
dd ALIGNED(S_TEXT_SIZE) ; [UNUSED] code size
dd ALIGNED(S_IDATA_SIZE) ; [UNUSED] size of initialized data
dd 0 ; [UNUSED] size of uninitialized
data
dd RVA(section..text.vstart) ; entry point address
dd RVA(section..text.vstart) ; [UNUSED] base of code
dd RVA(section..idata.vstart) ; [UNUSED] base of data
dd BASE ; image base
dd SECTALIGN ; section alignment
dd ALIGNMENT ; file alignment
dw 4,0 ; [UNUSED] OS version
dw 0,0 ; [UNUSED] image version
dw 4,0 ; subsystem version
dd 0 ; [UNUSED] Win32 version
dd RVA(the_end) ; size of image
dd ALIGNED(ALL_HDR_SIZE) ; size of headers
dd 0 ; [UNUSED] checksum
dw 3 ; subsystem = console
dw 0 ; [UNUSED] DLL characteristics
dd 0x00100000 ; [UNUSED] maximum stack size
dd 0x00001000 ; initial stack size
dd 0x00100000 ; maximum heap size
dd 0x00001000 ; [UNUSED] initial heap size
dd 0 ; [UNUSED] loader flags
dd 16 ; number of data directory entries
dd 0,0 ; no export table
dd RVA(import_table) ; import table address
dd IMPORT_TABLE_SIZE ; import table size
times 14 dd 0,0 ; no other entries in the data
directories
OPT_HDR_SIZE equ $ - opt_hdr
sect_hdr_text:
db ".text",0,0,0 ; section name
dd ALIGNED(S_TEXT_SIZE) ; virtual size
dd RVA(section..text.vstart) ; virtual address
dd ALIGNED(S_TEXT_SIZE) ; file size
dd section..text.start ; file position
dd 0,0 ; no relocations or debug info
dw 0,0 ; no relocations or debug info
dd 0x60000020 ; flags: code, readable, executable
sect_hdr_idata:
db ".idata",0,0 ; section name
dd ALIGNED(S_IDATA_SIZE) ; virtual size
dd RVA(section..idata.vstart) ; virtual address
dd ALIGNED(S_IDATA_SIZE) ; file size
dd section..idata.start ; file position
dd 0xC0000040 ; flags: data, readable, writeable
ALL_HDR_SIZE equ $ - $$
;;;;;;;;;;;;;;;;;;;; .text ;;;;;;;;;;;;;;;;;
section .text progbits follows=header align=ALIGNMENT

vstart=BASE+SECTALIGN*1
s_text:
; set up stack frame for *lpBytesWritten

push ebp
sub esp, 4
push 0
call [OpenClipboard]
or eax, eax
jz error2

push 1 ; CF_TEXT
call [GetClipboardData]
or eax, eax
jz error

push eax
call [GlobalLock]
or eax, eax
jz error
; strlen(str)
mov ecx, eax
strlen_loop:
mov dl, [ecx]
or dl, dl
jz strlen_end
inc ecx
jmp strlen_loop
strlen_end:
sub ecx, eax
lea edx, [ebp-4] ; put nBytesWritten on the stack
push edx
call [GetStdHandle] ; ... GetStdHandle(STD_OUTPUT_HANDLE)
push eax
call [WriteFile]

call [GlobalUnlock] ; hData is already on the stack
call [CloseClipboard]
push 0
call [ExitProcess]
error:
call [CloseClipboard]
error2:
push 1
call [ExitProcess]
S_TEXT_SIZE equ $ - s_text
;;;;;;;;;;;;;;;;;;;; .idata ;;;;;;;;;;;;;;;;;
section .idata progbits follows=.text align=ALIGNMENT

s_idata:
import_table:
; import of kernel32.dll
dd 0 ; [UNUSED] read-only IAT
dd 0 ; [UNUSED] forwarder chain
dd RVA(N_kernel32) ; library name
dd RVA(IAT_kernel32) ; IAT pointer
; import of user32.dll
dd 0 ; [UNUSED] read-only IAT
dd 0 ; [UNUSED] forwarder chain
dd RVA(N_user32) ; library name
dd RVA(IAT_user32) ; IAT pointer
; terminator (empty item)
times 5 dd 0
IMPORT_TABLE_SIZE: equ $ - import_table
IAT_kernel32:
ExitProcess: dd RVA(H_ExitProcess)
GlobalLock: dd RVA(H_GlobalLock)
GlobalUnlock: dd RVA(H_GlobalUnlock)
GetStdHandle: dd RVA(H_GetStdHandle)
WriteFile: dd RVA(H_WriteFile)
dd 0
IAT_user32:
OpenClipboard: dd RVA(H_OpenClipboard)
CloseClipboard: dd RVA(H_CloseClipboard)
GetClipboardData: dd RVA(H_GetClipboardData)
dd 0
align 4, db 0
N_kernel32: db "kernel32.dll",0
align 4, db 0
N_user32: db "user32.dll",0
align 2, db 0
H_OpenClipboard: db 0,0,"OpenClipboard",0
align 2, db 0
H_GetClipboardData: db 0,0,"GetClipboardData",0
align 2, db 0
H_GlobalLock: db 0,0,"GlobalLock",0
align 2, db 0
H_GetStdHandle: db 0,0,"GetStdHandle",0
align 2, db 0
H_WriteFile: db 0,0,"WriteFile",0
align 2, db 0
H_GlobalUnlock: db 0,0,"GlobalUnlock",0
align 2, db 0
H_CloseClipboard: db 0,0,"CloseClipboard",0
align 2, db 0
H_ExitProcess: db 0,0,"ExitProcess",0
S_IDATA_SIZE equ $ - s_idata
align ALIGNMENT, db 0
the_end:
Thats a pretty standard by the book implementation of a

PE file: Code and import tables are nicely segregated into
separate sections, the sections have their default
alignment, all headers are spelled out in full, and fields
which are not used by the loader nevertheless have sensible
values or at least the usual dummy values (i.e. zero). The
only thing thats missing is a proper DOS stub, so if
anybody ever tries to run this on real DOS, it will crash
and burn.
So what does it give us? The result is 1536 bytes of finest

hand-crafted code. Not too bad, but not quite satisfying
either. The elephant in the room is the 512-byte alignment
of the sections in the file that causes a lot of empty
space: Cant we just turn that down to, like, nothing?
Unfortunately, we really cant: Windows 10s loader insists
on a file alignment of 512 bytes; any attempt to decrease
it results in the message This app cant be executed on
this PC. Its not even possible to strip the padding at
the end of the last section. (WINE accepts all of that
without flinching, but thats not at all our target
platform.)
Merging sections
Even with Windows being so uncooperative, we still got one
trick up our sleeves: We can just put both the code and the
import tables into a combined section. Thats not common to
do (code/data separation exists for a reason), but on our
quest to make the file smaller, we take what we can.
The modifications are quite small, so heres just a diff:
@@ -34,3 +42,3 @@
- dw 2 ; two sections
+ dw 1 ; one section
@@ -44,8 +52,8 @@
- dd ALIGNED(S_TEXT_SIZE) ; [UNUSED] code size
- dd ALIGNED(S_IDATA_SIZE) ; [UNUSED] size of initialized data
+ dd ALIGNED(S_SECT_SIZE) ; [UNUSED] code size
+ dd ALIGNED(S_SECT_SIZE) ; [UNUSED] size of initialized data
data
- dd RVA(section..text.vstart) ; entry point address
- dd RVA(section..text.vstart) ; [UNUSED] base of code
- dd RVA(section..idata.vstart) ; [UNUSED] base of data
+ dd RVA(section.getclip.vstart); entry point address
+ dd RVA(section.getclip.vstart); [UNUSED] base of code
+ dd RVA(section.getclip.vstart); [UNUSED] base of data
@@ -74,20 +82,11 @@
-sect_hdr_text:
- db ".text",0,0,0 ; section name
- dd ALIGNED(S_TEXT_SIZE) ; virtual size
- dd RVA(section..text.vstart) ; virtual address
- dd ALIGNED(S_TEXT_SIZE) ; file size
- dd section..text.start ; file position
+sect_hdr:
+ db "getclip",0 ; section name
+ dd ALIGNED(S_SECT_SIZE) ; virtual size
+ dd RVA(section.getclip.vstart); virtual address
+ dd ALIGNED(S_SECT_SIZE) ; file size
+ dd section.getclip.start ; file position
- dd 0x60000020 ; flags: code, readable, executable
+ dd 0xE0000060 ; flags: code + data, readable,
writeable, executable
-sect_hdr_idata:
- db ".idata",0,0 ; section name
- dd ALIGNED(S_IDATA_SIZE) ; virtual size
- dd RVA(section..idata.vstart) ; virtual address
- dd ALIGNED(S_IDATA_SIZE) ; file size
- dd section..idata.start ; file position
- dd 0,0 ; no relocations or debug info
- dw 0,0 ; no relocations or debug info
- dd 0xC0000040 ; flags: data, readable, writeable
@@ -97,4 +96,4 @@
-section .text progbits follows=header align=ALIGNMENT
-s_text:
+section getclip progbits follows=header align=ALIGNMENT
+the_section:
@@ -157,9 +156,5 @@
-S_TEXT_SIZE equ $ - s_text
-
;;;;;;;;;;;;;;;;;;;; .idata ;;;;;;;;;;;;;;;;;
-section .idata progbits follows=.text align=ALIGNMENT

-s_idata:
-
+ align 4, ret
@@ -215,3 +210,3 @@
-S_IDATA_SIZE equ $ - s_idata
+S_SECT_SIZE equ $ - the_section
The result is (predictably) 1024 bytes, i.e. exactly 1 KiB.

Within the constraints of the Windows loader, its not
possible to go below that: We need at least one pseudo-
section for the header and one section for actual code and
data, and both of them need to be at least a full 512
bytes.
Going sectionless
As this whole section business works against us, can we
possibly live without it? Windows will load at least the
header part of the executable into memory anyway, and if we
sneak the actual code and import table data into there, we
should be fine. In fact, this used to work in the past, but
at least Windows 10 version 1703 (and very likely already
versions before that) simply ignore import tables that are
not contained in a section. As a result, the pointers to
the function names in the Import Address Table are not
replaced by the functions entry point address the
program will load just fine, but it will crash shortly
thereafter when it tries to call the first API function.
So if we want to go down the sectionless PE route, we

need to find an alternative way to load our imports. But
how can we do that? Even LoadLibrary and GetProcAddress would
need to be imported from kernel32.dll somehow or do they?
In fact, kernel32.dll (and ntdll.dll) are already loaded, by
default, by Windows PE loader! We just need to find the
addresses somehow. This can be done with some pointer
chasing: The FS selector points to the Thread Environment
Block (TEB), which contains a pointer to the Process
Environment Block (PEB), which contains a pointer to the PE
loader data, which contains a doubly-linked circular list
of loader data tables for each loaded DLL, which contain a
pointer to the DLLs base address. Phew. But as complicated
as that sounds, its just six simple MOV instructions. The
complex part is what comes after that.
Because right now, we have a pointer to the base address of

a DLL thats supposed to be kernel32.dll. But we need
function pointers, not DLL base addresses, and we cant
just call GetProcAddress yet (because we dont know its
address). The only thing we can do is re-implement
GetProcAddress by parsing the PE header, looking for the
export tables, searching these for the desired function
name, and using the ultra-complicated three-step lookup
procedure (that doesnt even work as intended; I got
consistent off-by-one errors when implementing it according
to the spec) to get the actual address. Thats a lot of
code, but theres no way around that.
Having implemented a poor mans GetProcAddress, note that we
no longer need the real thing: We can directly look for
LoadLibrary in the loaded DLLs (one of which is always
kernel32.dll), load user32.dll with it and then use our own
look-up function for all other required API calls as well.
In fact, I went so far as to have a wrapper function that
takes the base address of a DLL and the function name,
looks the function up and calls it directly.
One nice side-effect of going sectionless is that Windows

now allows us to set the file alignment to an arbitrarily
low value, because it isnt really interested in any
alignment stuff in this case. (It checks that the section
alignment is equal to the file alignment though, but thats
fine with us).
There is one additional pitfall on Windows 7 64-bit (I

believe I didnt see this on 32-bit Windows 7, but Im not
sure). It seems that its loader is not fully ignoring the
section table as it ought to: if the DWORD where the file
offset of the first section is stored is negative, the
executable cant be run. In effect, this means that the
byte at offset 23 (decimal) after the optional header must
not be 0x80 or greater. Thats quite a restriction, because
were going to put code there and we dont want to juggle
around with the instructions until we have found an
arrangement that works! Fortunately, we can circumvent
this: The optional header size field does not really
store the size of the optional header the optional header
has a fixed size after all, only determined by the number
of data dictionary entries, which is stored explicitly. No,
what the optional header size field actually encodes is
the offset of the section table, relative to the optional
headers start. So we simply need to choose a value such
that the DWORD at offset [optional header start + optional
header size + 20] is guaranteed to be less than 0x80000000.
One good candidate is the image base field, which
defaults to 0x400000 and is located at offset 28 inside the
optional header so we put down 8 as the optional header
size and were set!
bits 32
BASE equ 0x00400000
ALIGNMENT equ 4
SECTALIGN equ 4

org BASE
mz_hdr:
dw "MZ" ; DOS magic
times 0x3a db 0 ; [UNUSED] DOS header
dd RVA(pe_hdr) ; address of PE header
pe_hdr:
dw 0 ; no sections
dd 0 ; [UNUSED] symbol table pointer
dd 0 ; [UNUSED] symbol count
dw 8 ; optional header size
executable
opt_hdr:
dd RVA(the_end) ; [UNUSED] code size
dd RVA(the_end) ; [UNUSED] size of initialized data
data
dd RVA(main) ; entry point address
dd RVA(main) ; [UNUSED] base of code
dd RVA(main) ; [UNUSED] base of data
dd SECTALIGN ; section alignment
dw 4,0 ; [UNUSED] OS version
dw 0,0 ; [UNUSED] image version
dd 0 ; [UNUSED] Win32 version
dd ALIGNED(ALL_HDR_SIZE) ; size of headers
dd 0 ; [UNUSED] checksum
dw 0 ; [UNUSED] DLL characteristics
dd 0x00100000 ; [UNUSED] maximum stack size
dd 0x00001000 ; [UNUSED] initial heap size
dd 0 ; [UNUSED] loader flags
times 16 dd 0,0 ; no entries in the data directories
;;;;;;;;;;;;;;;;;;;; .text ;;;;;;;;;;;;;;;;;
main:
; set up stack frame for local variables
push ebp
%define DummyVar ebp-4
%define kernel32base ebp-8
%define user32base ebp-12
sub esp, 12
; locate the loader data tables where the loaded DLLs are managed
mov eax, [fs:0x30] ; get PEB pointer from TEB
mov eax, [eax+0x0C] ; get PEB_LDR_DATA pointer from PEB
mov eax, [eax+0x14] ; go to first LDR_DATA_TABLE_ENTRY
mov eax, [eax] ; move two entries further, because the
mov eax, [eax] ; third is typically kernel32.dll
try_next_lib:
push eax ; save LDR_DATA_TABLE_ENTRY pointer
mov ebx, [eax+0x10] ; load base address of the library
mov esi, N_LoadLibrary
call find_import ; load LoadLibrary from there (if present)
or eax, eax ; found?
jnz kernel32_found
pop eax ; restore LDR_DATA_TABLE_ENTRY pointer
mov eax, [eax] ; go to next LDR_DATA_TABLE_ENTRY
jmp try_next_lib
find_import: ; FUNCTION that finds procedure [esi] in library at base

[ebx]
mov edx, [ebx+0x3c] ; get PE header pointer (w/ RVA
translation)
add edx, ebx
cmp word [edx], "PE" ; is it a PE header?
jne find_import_fail
mov eax, [edx+0x74] ; check if data dictionary is present
or eax, eax
jz find_import_fail
mov edx, [edx+0x78] ; get export table pointer RVA
or edx, edx ; check if export table is present
jz find_import_fail
add edx, ebx ; get absolute address of export table
push edx ; store the export table address for later
mov ecx, [edx+0x18] ; ecx = number of named functions
mov edx, [edx+0x20] ; edx = address-of-names list (w/ RVA
translation)
add edx, ebx
name_loop:
dec ecx ; pre-decrement counter and check if we're
done
js find_import_fail1
push esi ; store the desired function name's pointer
(we will clobber it)
mov edi, [edx] ; load function name (w/ RVA translation)
add edi, ebx
cmp_loop:
lodsb ; load a byte of the two strings into AL,
AH
mov ah, [edi] ; and increase the pointers
inc edi
cmp al, ah ; identical bytes?
jne next_name ; if not, this is not the correct name
or al, al ; zero byte reached?
jnz cmp_loop ; if not, we need to compare more
; if we arrive here, we have a match!
pop esi ; restore the name pointer (though we don't
use it any longer)
pop edx ; restore the export table address
sub ecx, [edx+0x18] ; turn the negative counter ECX into a
positive one
neg ecx
dec ecx
mov eax, [edx+0x24] ; get address of ordinal table (w/ RVA
translation)
add eax, ebx
movzx ecx, word [eax+ecx*2] ; load ordinal from table
;sub ecx, [edx+0x10] ; subtract ordinal base
mov eax, [edx+0x1C] ; get address of function address table (w/
RVA translation)
add eax, ebx
mov eax, [eax+ecx*4] ; load function address (w/ RVA
translation)
add eax, ebx
ret
next_name:
pop esi ; restore the name pointer
add edx, 4 ; advance to next list item
jmp name_loop
find_import_fail1:
pop eax ; we still had one dword on the stack
find_import_fail:
xor eax, eax
ret
call_import: ; FUNCTION that finds procedure [esi] in library at

base [ebx] and calls it
call find_import
or eax, eax ; found?
jz critical_error ; if not, we're screwed
jmp eax ; but if so, call the function
; back to the main program ...

kernel32_found:
; we found kernel32 (ebx) and LoadLibraryA (eax), so we can load
user32.dll
mov [kernel32base], ebx ; store kernel32's base address
push N_user32
call eax ; call LoadLibraryA
or eax, eax ; check the result
jz error2
mov [user32base], eax ; store user32's base address
push 0
mov ebx, eax ; user32 base address was still in eax
mov esi, N_OpenClipboard
call call_import
or eax, eax
jz error2

push 1 ; CF_TEXT
; mov ebx, [user32base]
mov esi, N_GetClipboardData
call call_import
or eax, eax
jz error

push eax
mov ebx, [kernel32base]
mov esi, N_GlobalLock
call call_import
or eax, eax
jz error
; strlen(str)
mov ecx, eax
strlen_loop:
mov dl, [ecx]
or dl, dl
jz strlen_end
inc ecx
jmp strlen_loop
strlen_end:
sub ecx, eax
lea edx, [DummyVar] ; lpBytesWritten
push edx
; mov ebx, [kernel32base]
mov esi, N_GetStdHandle
call call_import ; ... GetStdHandle(STD_OUTPUT_HANDLE)
push eax
mov esi, N_WriteFile
call call_import

mov esi, N_GlobalUnlock
call call_import ; hData is already on the stack
mov ebx, [user32base]

mov esi, N_CloseClipboard
call call_import
push 0
jmp exit
error:
mov esi, N_CloseClipboard
call call_import
error2:
push 1
exit:
mov esi, N_ExitProcess
jmp call_import
critical_error:
ret
N_LoadLibrary: db "LoadLibraryA", 0
N_OpenClipboard: db "OpenClipboard",0
N_GetClipboardData: db "GetClipboardData",0
N_GlobalLock: db "GlobalLock",0
N_GetStdHandle: db "GetStdHandle",0
N_WriteFile: db "WriteFile",0
N_GlobalUnlock: db "GlobalUnlock",0
N_CloseClipboard: db "CloseClipboard",0
N_ExitProcess: db "ExitProcess",0
the_end:
Thats quite a lot of work, but at least we can save

another 25% and get down to 768 bytes. This comes at the
expense of runtime performance, though, because our
homegrown GetProcAddress implementation is not nearly as
efficient as Windows original one: We simply scan all
function names (of which there are over 1600 in
kernel32.dll), while the proper loader uses binary search to
speed things up. But were talking of a few hundred
microsecons here, loading and running an executable at all
takes an order of magnitude more time than that.
Import by hash
Of the 768 bytes in the sectionless version, 118 bytes
(15%!) are spent on function names. Thats seems a little
excessive, doesnt it? After all, were not really
interested in the names themselves, we just use them to
find the functions adresses. As a first try, we could
limit the length of the stored strings by only comparing
the first, say, 7 characters. We wont be able to discern
LoadLibraryA from its Unicode cousin LoadLibraryW this way, but
since the names are guaranteed to be alphabetically sorted
in export tables, we would hit LoadLibraryA first anyway.
However, we cant use less than 7 significant bytes,
because otherwise e.g. GlobalLock would be too unspecific
and we would get GlobalAddAtomA instead.
But 7 bytes per import is still quite some data, and the
whole approach is a forward compatibility timebomb, because
future versions of Windows could add new functions to our
two DLLs with catastrophic effect. So, truncating names is
not the best path to follow. However, theres a much more
powerful alternative: Hashing! As said, were not
interested in the names, not even parts of it. A machine-
readable mapping that can uniquely identify the proper
function name without actually knowing it is sufficient;
bonus points if its easy to compute. (For our purposes, we
dont need a cryptographically strong hash or anything
fancy, we just want to tell a few function names apart!)
Long story short, such mappings exist. In our example,

well use a simple rotate-and-xor hash. The algorithm
uses a 32-bit accumulator register. For each character of
the function name, two operations are performed (in any
order): The characters ASCII code is XORed into the
register (addition would be possible as well), and the
register is rotated by a fixed (and ideally prime) number
of bits. This can be computed in two x86 instructions per
character, and is able to map all names of the two DLLs in
question (and also various others I tested with) into 32-
bit hashes without any collisions. Another nice property is
that the hash can be computed in reverse: We can store the
start value of the accumulator, and a match is detected
when after processing all characters of a function name,
the accumulator becomes zero. (We could live without that,
but it simplifies the implementation a tiny bit.)
This modification can be applied to the existing

implementation quite easily, so heres again just a diff:
@@ -95,5 +89,5 @@
- mov esi, N_LoadLibrary
+ mov esi, 0x01364564 ; hash of "LoadLibraryA"
call find_import ; load LoadLibrary from there (if present)
@@ -123,15 +117,16 @@
cmp_loop:
- lodsb ; load a byte of the two strings into AL,
AH
- mov ah, [edi] ; and increase the pointers
- inc edi
- cmp al, ah ; identical bytes?
- jne next_name ; if not, this is not the correct name
- or al, al ; zero byte reached?
- jnz cmp_loop ; if not, we need to compare more
+ movzx eax, byte [edi] ; load a byte of the name ...
+ inc edi ; ... and advance the pointer
+ xor esi, eax ; apply xor-and-rotate
+ rol esi, 7
+ or eax, eax ; last byte?
+ jnz cmp_loop ; if not, process another byte
+ or esi, esi ; result hash match?
+ jnz next_name ; if not, this is not the correct name
@@ -180,5 +175,5 @@
push 0
- mov esi, N_OpenClipboard
+ mov esi, 0xFC7956AD ; hash of "OpenClipboard"
call call_import
@@ -188,5 +183,5 @@
push 1 ; CF_TEXT
- mov esi, N_GetClipboardData
+ mov esi, 0x0C473D74 ; hash of "GetClipboardData"
call call_import
or eax, eax
@@ -197,5 +192,5 @@
- mov esi, N_GlobalLock
+ mov esi, 0x4A88F58C ; hash of "GlobalLock"
call call_import
@@ -221,18 +216,18 @@
- mov esi, N_GetStdHandle
+ mov esi, 0xEACA71C2 ; hash of "GetStdHandle"
push eax
- mov esi, N_WriteFile
+ mov esi, 0x3FD1C30F ; hash of "WriteFile"
call call_import

- mov esi, N_GlobalUnlock
+ mov esi, 0xC3907A85 ; hash of "GlobalUnlock"
call call_import ; hData is already on the stack

- mov esi, N_CloseClipboard
+ mov esi, 0x1D84425E ; hash of "CloseClipboard"
call call_import
@@ -242,5 +237,5 @@
error:
- mov esi, N_CloseClipboard
+ mov esi, 0x1D84425E ; hash of "CloseClipboard"
call call_import
@@ -248,5 +243,5 @@
exit:
- mov esi, N_ExitProcess
+ mov esi, 0x665640AC ; hash of "ExitProcess"
jmp call_import
critical_error:
@@ -254,13 +249,4 @@
-N_LoadLibrary: db "LoadLibraryA", 0
-N_OpenClipboard: db "OpenClipboard",0
-N_GetClipboardData: db "GetClipboardData",0
-N_GlobalLock: db "GlobalLock",0
-N_GetStdHandle: db "GetStdHandle",0
-N_WriteFile: db "WriteFile",0
-N_GlobalUnlock: db "GlobalUnlock",0
-N_CloseClipboard: db "CloseClipboard",0
-N_ExitProcess: db "ExitProcess",0
The result is 656 bytes, 112 bytes less than the version
without import-by-hash. Its not quite the optimal amount
of savings (which would be 118 bytes, the size of the name
strings) because the comparison grew a little bit, but
still quite an impressive result.
Header trickery
Before our short excursion into the land of hashes, we
worked hard on bypassing the alignment limits, but still
theres a lot of space spent in the PE headers. One trivial
thing is to remove the data dictionary as we dont even
have table-based imports by now. But thats not all:
Fortunately, there are many fields in the headers that
arent evaluated by the Windows loader where we can put
other stuff in. The largest part of this is the 64-byte DOS
header at the beginning, of which only the first two bytes
(the MZ signature) and the last four bytes (the address
of the PE header) are important. We can actually move
(collapse) the PE header inside the DOS header, all the
way until address 4 (which is the minimum alignment
requirement). In this case, the PE header location field of
the DOS header coincides with the section alignment field
of the PE header, so we get a section (and file) alignment
of 4 perfect!
Runs of other unused fields in the header can be used to

put the last remaining string (user32.dll) and even code
into. The latter is a bit complicated, because the code
sequence must fit into the slot of unused fields, and if
youre unlucky, it might grow when moving into the header
if a jump that used to be relative is turned into an
absolute jump because the distance between jump site and
target has become too large. I didnt manage to fit a lot
of code into the headers, but at least theres something.
The offset corresponds to the MSB of the resource tables

size entry in the data dictionary. The solution for this is
rather surprising: The SizeOfHeaders
The following dump is what the headers now look like. The
main part is the same, except that the blocks that have
been moved into the headers (N_user32, next_name and parts of
main) are now obviously gone:
mz_hdr:
dw "MZ" ; DOS magic
dw "kj" ; filler to align the PE header
pe_hdr:
dw 0 ; no sections
N_user32: db "user32.dll",0,0 ; 12 bytes of data collapsed into the
header
;dd 0 ; [UNUSED-12] timestamp
;dd 0 ; [UNUSED] symbol table pointer
;dd 0 ; [UNUSED] symbol count
executable
opt_hdr:
main_part_1: ; 12 bytes of main entry point + 2 bytes of jump
jmp main_part_2
align 4, db 0
;db 13,37 ; [UNUSED-14] linker version
;dd RVA(the_end) ; [UNUSED] code size
;dd RVA(the_end) ; [UNUSED] size of initialized data
;dd 0 ; [UNUSED] size of uninitialized
data
dd RVA(main_part_1) ; entry point address
main_part_2: ; another 6 bytes of code + 2 bytes of jump
push ebp
sub esp, 12
mov eax, [eax] ; go to where ntdll.dll typically is
jmp main_part_3
align 4, db 0
;dd RVA(main) ; [UNUSED-8] base of code
;dd RVA(main) ; [UNUSED] base of data
dd SECTALIGN ; section alignment (collapsed with
the
; PE header offset in the DOS
header)
next_name: ; we interrupt again for a few bytes of code from the
loader
jmp name_loop
align 4, db 0
;dw 4,0 ; [UNUSED-8] OS version
;dw 0,0 ; [UNUSED] image version
dd 0 ; [UNUSED-4] Win32 version
dd RVA(opt_hdr) ; size of headers (must be small
enough
; so that entry point inside header
is accepted)
dd 0 ; [UNUSED-4] checksum
dw 0 ; [UNUSED-6] DLL characteristics
dd 0x00100000 ; maximum stack size
dd 0x00001000 ; initial heap size
dd 0 ; [UNUSED-4] loader flags
(= none!)
;;;;;;;;;;;;;;;;;;;; .text ;;;;;;;;;;;;;;;;;
main_part_3:
mov eax, [eax] ; go to where kernel32.dll typically is
try_next_lib:
; (from here on, not much has changed)
With this, were at 436 bytes, a whopping 33% less than
before! The downside is that the header declarations in the
source code become quite unreadable by now, and that were
no longer forward compatible: A future version of Windows
might decide that the OS version listed in the EXE file is
now totally relevant and may thus not want to execute files
made for version 33630.1068.
Unsafe optimizations
All along the way, we were cautious not to remove any
checks and clean exits in case of failure. But were
already relying on a few details of the PE loader that are
unlikely to change soon, but are not carved into stone
either. So why not go full YOLO and strip off all the
safety nets? We could assume that
kernel32.dll always is the third image loaded (after

our own executable and ntdll.dll).
the kernel32.dll image is a proper PE image with all
headers and dictionary items in their usual places.
all imported functions actually exist.
uninitialization (GlobalUnlock, CloseClipboard) is not
neccesary, because the system cleans up our mess
anyway when the process exits.
GlobalLock is a no-operation that can be omitted
completely, because the HGLOBAL that is returned by
GetClipboardData is already a bona fide pointer.
This allows us to rip out a good chunk of code. For

example, we dont need to separate find_import and call_import
any longer, because well no longer check whether a
function exists; if we want to look up a function, were
always going to call it as well. Furthermore, the order of
the loader and main code has been shuffled around a bit as
well to make jumps as short as possible, and the code
snippets used to fill the unused header fields are slightly
different ones:
bits 32
BASE equ 0x00400000
ALIGNMENT equ 4
SECTALIGN equ 4

org BASE
mz_hdr:
dw "MZ" ; DOS magic
dw "kj" ; filler to align the PE header
pe_hdr:
dw 0 ; no sections
N_user32: db "user32.dll",0,0 ; 12 bytes of data collapsed into the
header
;dd 0 ; [UNUSED-12] timestamp
;dd 0 ; [UNUSED] symbol table pointer
;dd 0 ; [UNUSED] symbol count
executable
opt_hdr:
main_part_1: ; 12 bytes of main entry point + 2 bytes of jump
jmp main_part_2
align 4, db 0
;db 13,37 ; [UNUSED-14] linker version
;dd RVA(the_end) ; [UNUSED] code size
;dd RVA(the_end) ; [UNUSED] size of initialized data
;dd 0 ; [UNUSED] size of uninitialized
data
dd RVA(main_part_1) ; entry point address
push ebp
sub esp, 12
mov eax, [eax] ; go to where ntdll.dll typically is
jmp main_part_3
align 4, db 0
;dd RVA(main) ; [UNUSED-8] base of code
;dd RVA(main) ; [UNUSED] base of data
dd SECTALIGN ; section alignment (collapsed with
the
; PE header offset in the DOS
header)
mov eax, [eax] ; go to where kernel32.dll typically is
jmp main_part_4
align 4, db 0
;dw 4,0 ; [UNUSED-8] OS version
;dw 0,0 ; [UNUSED] image version
dd 0 ; [UNUSED-4] Win32 version
dd RVA(opt_hdr) ; size of headers (must be small
enough
; so that entry point inside header
is accepted)
dd 0 ; [UNUSED-4] checksum
dw 0 ; [UNUSED-2] DLL characteristics
dd 0x00100000 ; maximum stack size
dd 0x00001000 ; initial heap size
dd 0 ; [UNUSED-4] loader flags
(= none!)
main_part_4:
mov [kernel32base], ebx ; store kernel32's base address
mov esi, 0x01364564 ; hash of "LoadLibraryA"
push N_user32 ; we want to load user32.dll
call call_import ; call LoadLibraryA
mov [user32base], eax ; store user32's base address
push 0
mov esi, 0xFC7956AD ; hash of "OpenClipboard"
call call_import
or eax, eax
jz error

push 1 ; CF_TEXT
mov esi, 0x0C473D74 ; hash of "GetClipboardData"
call call_import
or eax, eax
jz error
; strlen(str)
mov ecx, eax
strlen_loop:
mov dl, [ecx]
or dl, dl
jz strlen_end
inc ecx
jmp strlen_loop
strlen_end:
sub ecx, eax
lea edx, [DummyVar] ; lpBytesWritten
push edx
mov esi, 0xEACA71C2 ; hash of "GetStdHandle"
push eax
mov esi, 0x3FD1C30F ; hash of "WriteFile"
call call_import
; ExitProcess(0);
push 0
jmp exit
error:
push 1
exit:
mov esi, 0x665640AC ; hash of "ExitProcess"
; fall-through into call_import
call_import: ; FUNCTION that calls procedure [esi] in library at base

[ebx]
mov edx, [ebx+0x3c] ; get PE header pointer (w/ RVA
translation)
add edx, ebx
mov edx, [edx+0x78] ; get export table pointer RVA (w/ RVA
translation)
add edx, ebx
push edx ; store the export table address for later
mov ecx, [edx+0x18] ; ecx = number of named functions
mov edx, [edx+0x20] ; edx = address-of-names list (w/ RVA
translation)
add edx, ebx
name_loop:
push esi ; store the desired function name's hash
(we will clobber it)
mov edi, [edx] ; load function name (w/ RVA translation)
add edi, ebx
cmp_loop:
movzx eax, byte [edi] ; load a byte of the name ...
inc edi ; ... and advance the pointer
xor esi, eax ; apply xor-and-rotate
rol esi, 7
or eax, eax ; last byte?
jnz cmp_loop ; if not, process another byte
or esi, esi ; result hash match?
jnz next_name ; if not, this is not the correct name
pop esi ; restore the name pointer (though we don't
use it any longer)
pop edx ; restore the export table address
sub ecx, [edx+0x18] ; turn the negative counter ECX into a
positive one
neg ecx
mov eax, [edx+0x24] ; get address of ordinal table (w/ RVA
translation)
add eax, ebx
movzx ecx, word [eax+ecx*2] ; load ordinal from table
;sub ecx, [edx+0x10] ; subtract ordinal base
mov eax, [edx+0x1C] ; get address of function address table (w/
RVA translation)
add eax, ebx
mov eax, [eax+ecx*4] ; load function address (w/ RVA
translation)
add eax, ebx
jmp eax ; jump to the target function
next_name:
dec ecx ; decrease counter
jmp name_loop
the_end:
The final result with this is 316 bytes, another 27% less
than before!
Conclusion
This concludes our journey into size optimization. At this
point, were 240 times smaller than the nave first C
implementation, and even if we consider our first serious
optimization step (the C implementation without C library)
as the starting point, were still almost 10 times smaller.
But admittedly, the amount of effort necessary for this is
extremely high and hardly justified ;)
You can download all the source files of this little

experiment if youre interested.
Im not going to claim that my implementation is the

smallest possible, most efficient or best-on-any-other-axis
one. Im not a seasoned sizecoder at that low level
(usually I stop at the get rid of the C library step).
What also concerns me is that I had to implement the export
table parser differently from all documentation I could
find on the subject (including Microsofts official PE
specification) by not subtracting the base ordinal from the
value in the name ordinal table to get the function address
table index. So if you have any explanations or improvement
ideas, let me know.
Update (2017-09-09): As a commenter pointed out, some of

the executables didnt run on Windows 7 x64. I figured out
whats the issue and updated the post and the download file
accordingly see the last paragraph before the code sample
in the going sectionless chapter for details.

Writing Ultra-Small Windows Executables

Загружено:

Сведения о документе

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Writing Ultra-Small Windows Executables

Загружено:

Авторское право:

Доступные форматы

Writing ultra-small Windows

The target application

@for /f "usebackq tokens=*" %%a in (`getclip`) do @cd /d %%a

It calls the getclip.exe helper program to get the

The nave C implementation

Note the use of fwrite instead of puts to avoid adding

Using Visual Studio 2017 (configured to optimize for size)

That pesky C library

Thats not too many changes, but the result is quite

But no, we wont leave it at that! (At least, I wont.) It

Assembly to the rescue

; set up stack frame for *lpBytesWritten

; HANDLE hData = GetClipboardData(CF_TEXT); if (!hData) fail;

; char* str = GlobalLock(hData); if (!str) fail;

Assembling this and linking it with Microsofts link.exe

However, by having a closer look into the generated

Constructing PE files by hand

%define ROUND(v, a) (((v + a - 1) / a) * a)

section header progbits start=0 vstart=BASE

;;;;;;;;;;;;;;;;;;;; .text ;;;;;;;;;;;;;;;;;

section .text progbits follows=header align=ALIGNMENT

; set up stack frame for *lpBytesWritten

; HANDLE hData = GetClipboardData(CF_TEXT); if (!hData) fail;

; char* str = GlobalLock(hData); if (!str) fail;

; GlobalUnlock(hData); CloseClipboard(); ExitProcess(0);

S_TEXT_SIZE equ $ - s_text

;;;;;;;;;;;;;;;;;;;; .idata ;;;;;;;;;;;;;;;;;

section .idata progbits follows=.text align=ALIGNMENT

S_IDATA_SIZE equ $ - s_idata

Thats a pretty standard by the book implementation of a

So what does it give us? The result is 1536 bytes of finest

The modifications are quite small, so heres just a diff:

-section .idata progbits follows=.text align=ALIGNMENT

The result is (predictably) 1024 bytes, i.e. exactly 1 KiB.

So if we want to go down the sectionless PE route, we

Because right now, we have a pointer to the base address of

One nice side-effect of going sectionless is that Windows

There is one additional pitfall on Windows 7 64-bit (I

%define ROUND(v, a) (((v + a - 1) / a) * a)

;;;;;;;;;;;;;;;;;;;; .text ;;;;;;;;;;;;;;;;;

find_import: ; FUNCTION that finds procedure [esi] in library at base

call_import: ; FUNCTION that finds procedure [esi] in library at

; back to the main program ...

; HANDLE hData = GetClipboardData(CF_TEXT); if (!hData) fail;

; char* str = GlobalLock(hData); if (!str) fail;

; GlobalUnlock(hData); CloseClipboard(); ExitProcess(0);

mov ebx, [user32base]

Thats quite a lot of work, but at least we can save

Long story short, such mappings exist. In our example,

This modification can be applied to the existing

; GlobalUnlock(hData); CloseClipboard(); ExitProcess(0);

mov ebx, [user32base]

Runs of other unused fields in the header can be used to

The offset corresponds to the MSB of the resource tables

;;;;;;;;;;;;;;;;;;;; .text ;;;;;;;;;;;;;;;;;

kernel32.dll always is the third image loaded (after

This allows us to rip out a good chunk of code. For

%define ROUND(v, a) (((v + a - 1) / a) * a)

; HANDLE hData = GetClipboardData(CF_TEXT); if (!hData) fail;

call_import: ; FUNCTION that calls procedure [esi] in library at base

You can download all the source files of this little

Im not going to claim that my implementation is the

Update (2017-09-09): As a commenter pointed out, some of