Вы находитесь на странице: 1из 43

This page intentionally left blank.

1-C
The first artifact we are going to discuss is the Windows registry.

The Windows registry is essentially a database that was originally meant for storing configuration information.
However, programs can store arbitrary data in the registry, in essence, using it for whatever purpose they see fit.
Some malware even hides part of itself inside the registry.

The registry contains a wealth of information, as it used by almost every aspect of the operating system.

The registry itself is an hierarchical (tree-like) structure, similar to a file system. The registry is divided into
keys, and sub keys (similar to directories and subdirectories). Each key can have zero or more values (similar to
how a directory can have zero or more files). Each value (like each file) has content. At a very high level, keys
and subkeys are grouped together into parts of the registry called hives. Each hive can have it’s own file on
disk, although some hives are virtual, and exist only at run time.

The registry that contains most of the configuration information for the system is found in the
%SystemRoot%\System32\Config directory, and is represented as a series of files.

Each user also has their own portion of the registry, found under the user’s profile directory in a file called
NTUSER.dat. The format for registry files is the same regardless of it they are system or user registries.

2-C
This screenshot shows the relations of hives, keys, subkeys, and values.

On the left side we can see several hives, such as HKEY_CLASSES_ROOT, HKEY_CURRENT_USER, and
HKEY_LOCAL_MACHINE.

Inside each hive are keys and subkeys, such as HARDWARE, DEVICEMAP, and VIDEO.

Each key (or subkey) has zero or more values, which appear on the right side of the screen. Notice that each
value has a name, type, and data. This is similar to how files have names, types, and content.

3-C
This slide describes the general layout for how a registry is stored on disk. While the official format has never
been released by Microsoft, there has been significant reverse engineering work, to the point where most of the
on-disk structures are understood.

The first concept for reading registry hive files is that the hive is broken into 4096-byte chunks called blocks.
If new space is needed in the hive, an entire block is allocated. This is similar to clusters on a file system.

The first block of the hive is known as the base block and holds various information global information for the
hive. To keep track of keys, values, subkeys, etc. the registry uses a container called a cell. Each cell contains
one piece of information (e.g. either a key, a value, a subkey list, etc.). A field at the beginning of the cell
describes the type of information inside the cell. The registry uses the term “cell index” to describe the offset
of a particular cell inside the hive file. The cell index is relative to the first bin (which starts directly after the
base block). This means when translating cell indices to file offsets, you will need to add 0x1000 (4096).

For efficiency purposes, if a hive has to expand to accommodate a new cell, it allocates a unit called a bin. A
bin is a container that can hold zero or more cells, and is always a multiple of the block size.

At the top of the hive is the root key (cell). The registry keeps a list of the subkeys for a particular key in a list
called a SubKey list. Likewise, the registry keeps track of the values for a particular key in a list called a
Value List.

A SubKey list, is a list of cell indices (offsets into the hive file) of the various subkey cells. Similarly, a Value
List is a list of cell indices of the various values for a key.

4-C
We will now examine the structures of keys, subkeys, and values as they are represented in hive files. The arrow
in the regedit window shows the current location of the data structure we are analyzing.

The first cell we will examine is a key cell. The first four bytes describe the size of the cell as a negative number
(in two’s complement). The next two bytes are a signature, which tell us the type of cell we are examining.
Following that are two bytes that are flags that describe some additional details about the key.

After the flags field is an 8-byte field that describes the last time the key (or one of it’s values) was last written to.
Four bytes later is a 4-byte field that describes the cell index of the parent key.

The next field is a 4-byte field that is the number of subkeys in the current key. Four bytes later is the index to the
cell that contains the SubKey list. Four bytes later is the number of values in the key, followed by a 4-byte field
that is the cell index of the Value List.

The size of the name of the key is at 0x4C (76) bytes from the start of the cell. Two bytes later the name starts.

5-C
This slide shows the SubKey list and subkeys for the key on the previous slide.

At offset 0x6DE98 is the start of the SubKey list. The first four bytes are the size of the cell as a negative
number (in two’s complement notation). The next two bytes are the signature of the cell. Since this is a SubKey
list cell, the next two bytes tell how many entries are in this list.

Following the header is a list of entries, where each entry is composed of two 4-byte fields. The first field is a
cell index of the key cell that describes the subkey. The second field is a hash of the name of the key.

The first (and only) subkey is at offset 0x06DE40. Here we see the standard cell header, of 4-byte size field,
followed by a signature. In this case, it is a cell that describes a key, so it has the flags, modification, parent,
subkey count, subkey list cell index, value count, and value list index fields that we saw on the previous slide.

6-C
This slide shows the binary layout for the value list and value cells for the key described two slides ago.

The Value List cell starts at offset 0x60B70. The list is composed of a 4-byte size value, followed by one or
more 4-byte entries listing the cell indices of the value cells that belong to the key.

The Value cell has the same first six bytes as the previous SubKey and Key cells (size as a negative number
followed by the signature). The third field is two bytes and is the size of the name of the value. The next field is
four bytes long and describes the size of the data associated with the value. The following field is also four
bytes long, and is the cell index of the cell that contains the data associated with the value.

The next field is four bytes long and describes the type of data associated with the value. The following two
bytes are a flags field which is set to 0 means the name of the value is in 16-bit little endian unicode, otherwise
the name is in ASCII. Two bytes later is the name of the value.

The cell that contains the data for the value has a four byte size, followed by the data.

7-C
Here is a list of useful registry keys.

The list of commands a user types into the Start Menu -> Run box can be found under the RunMRU key.

To see the URLs typed into the address bar of Internet Explorer, go to the TypedURLs key.

The user assist keys keep track of programs run from the Windows shell, including the number of times they have
been run, and the last time a program was run. The entries are ROT-13 “encrypted”. ROT-13 is a 13-character
shift, wrapping around when the letter ‘z’ is hit.

The slide above lists a number of locations to look for programs that run at start up. These are useful keys to
examine, especially in scenarios dealing with malware.

8-C
One tool to help analyze the registry is called RegRipper.

RegRipper is a PERL script developed by Harlan Carvey. RegRipper is designed to extract useful information
from hive files in an automated fashion. It is plug-in based, where each plug-in extracts various types of useful
information from a hive file. The actual plugins that run are dependent on the type of hive file you are
examining. Since RegRipper is plugin based and open source, you can develop your own plugins.

To use regripper, use the “-r” option to specify the path to the hive file, and the “-f” option to specify the type
of hive file.

Even though RegRipper analyzes Windows artifacts, it runs on Windows, Linux, and Mac OSX.

9-C
The Windows shell allows a user to have shortcut (a.k.a. link or lnk) files. A shortcut file is a file that has
information used to access another file (or shell object). It is a form of a pointer.

There is an abundant amount of useful information inside of lnk files. Some of the more useful information
includes the type of drive the target file/object is on (e.g. a fixed drive, removable, remote, etc.) Also included
is the full path of the target file, including drive letters, volume labels, and serial numbers for locally attached
volumes. For remote volumes (shares) the full server share path and a drive letter (if the share was mounted).

Additionally, if the target is in a “special” or “known folder” (e.g. the Printers folder), this information is
stored in the lnk file.

Beyond the information already listed, there are usually one (if not more) copies of the target file’s metadata
(timestamps, size, and attributes) inside an lnk file. If the target file is an executable, any command line
arguments are also stored inside the lnk file.

10 - C
This is the general layout of an lnk file. With the exception of the header, every part of the LNK file is optional.
There are not fixed offsets for any structure.

The header of the LNK file contains information about the target file (including attributes and timestamps) as
well as describing what other sections of the LNK file exist.

The PIDL section, if it exists, will follow the header. It contains a shell path (a PIDL) to the target file. We will
PIDLs in more detail shortly.

After the PIDL section (if it exists) is the LinkInfo section. This section describes the path (including name)
and volume-specific information for the target file. If the target file is stored on a local drive, the drive type,
serial number, and volume label are stored in an optional structure called the Volume ID. The Volume ID
structure exists inside the LinkInfo section. If the target file resides on a remote disk, information including the
share path, and the device name or letter is stored in an optional structure called the
CommonNetworkRelativeLink (CNRL). If the CNRL exists, it will be inside the LinkInfo section.

After the LinkInfo section is the StringData section, which is optional, and can contain up to five strings. The
strings can describe the name of the target, the relative path to the target (from the directory the lnk file resides
in), the working directory for the target, any command line options (for executable targets) and the location of
an icon associated with the target.

The last section (which is also optional) is the ExtraData section, and it can contain a number of different pieces
of information. Of notable interest is the PropertyStore data block, which can contain arbitrary information, and
the TrackerInformation data block, which can be used to track files across multiple Windows systems (if copied
over a network share).

11 - C
This slide shows the layout of a ShellLinkHeader. This is the header at the start of an lnk file.

The first four bytes describe the size of the header, and per Microsoft must always be the value 0x4C. The
next 16 bytes are a CLSID (Microsoft’s term for a GUID) for lnk files. The next four bytes are flags
describing the various sections that exist in the lnk file.

The next field is a 4-byte field that describes the attributes of the target, followed by the creation time of the
target as an 8-byte FILETIME timestamp. The next two fields are 8 bytes each, and are the access and
modification timestamps for the target (again in FILETIME format).

After the modification time, is the size of the target as a 4-byte field, followed by a 4-byte field describing the
location for an icon associated with the lnk file. The show command is a 4-byte field that describes how a
console window should be displayed (if a console window is used). The last two bytes of the header describe
the hotkey combination used to active the shortcut.

12 - C
A few slides ago, we mentioned a PIDL. Invariably, as you conduct more computer forensic exams, you will
run into PIDLs. A PIDL is a pointer to an “item id list”. In effect, it is a shell path.

To understand what a shell path is, we need to understand what the Windows shell is. In essence, the Windows
shell is like a virtual file system. Just like a regular file system has folders and files, so does the Windows shell.
However, unlike a regular file system, the Windows shell also has virtual folders, folders that either do not exist,
or have a different layout, on disk. Examples of virtual folders are the Recycle bin (which has a different on-
disk layout than what is presented by the graphical interface) or the printers folder (which doesn’t actually exist
on disk).

So a PIDL is like a file path, but for the Windows shell.

Unlike a regular file path, which is composed of text names, a PIDL is composed of binary segments called shell
item id’s (SHITEMID structures). For the most part, these structures are undocumented by Microsoft.

The first two bytes of a PIDL are the size of the PIDL, followed by a list of SHITEMID structures. The first
two bytes of SHITEMID structures denote the size of the individual SHITEMID structure. The meaning of the
bytes that follow the size field are up to the specific folder or shell extension that the SHITEMID structure
belongs to.

The reason this information is useful to an examiner, is that it often contains yet an additional set of timestamps
for the files and folders in the path to the target. Beyond this, there is usually a second set of timestamps for the
target (the first set residing in the ShellLinkHeader structure).

13 - C
This slide shows the layout of a SHITEMID structure that describes a file. Even though the format is not
documented, there has been reverse engineering work, and parts of the structure of PIDLs are understood.

Following the size field, is a one byte type field, which describes the type of data found in the rest of the
SHITEMID structure. One byte later is a four byte field that describes the size of the file/folder the
SHITEMID structure describes. The modification time of the file/folder is contained in the next four bytes,
followed by the attributes as a 2-byte field.

After the attributes is the name of the file/folder as a NULL terminated ASCII string. One byte beyond the end
of the ASCII string is the start of the “additional information” structure. Six bytes after the size field is the
creation time of the file/folder as a 4-byte DOS date and timestamp. The next four bytes are the access time of
the corresponding file/folder (again in DOS date/timestamp format).

The two bytes that follow the access timestamp describe the location (relative to the start of the additional
information structure) of the unicode version of the name of the file/folder. The name of the file/folder is
represented as a NULL terminated 16-bit little endian unicode string. The two bytes that follow the unicode
name are the offset of the additional information field.

14 - C
Many commercial forensic tools will parse lnk files. A free (and open source) PERL script called lslnk.pl
will also parse some of the metadata out of an lnk file. The lslnk.pl script was developed by Harlan Carvey.

This slide shows an example of running lslnk.pl against a shortcut file.

15 - C
The Internet Explorer browser maintains lots of information about a users browsing habits. Information that IE
keeps track of, and can be incredibly useful to an investigation include a user’s cookies, cached copies of web
pages and downloaded files, as well as a list of the user’s browsing history.

The information is kept under a number of subdirectories under the user’s profile directory.

Most of the relevant information is held in index.dat files. These files are an undocumented binary file format,
that have the same general structure (regardless of location), but the meaning of the data contained within varies
based on what the index.dat is keeping track of (cache vs. cookies vs. history).

It’s important to understand that not *every* entry in an index.dat file is required to have come from the IE
browser. The caching functionality that IE uses is actually an operating system API, and any program is free to
use the API.

As a result, you may also see entries in index.dat files from programs other than IE. Unfortunately, there is no
direct way to tell what program created which entry.

16 - C
This slide shows the general layout of an index.dat file. The index.dat file is divided up into equal sized chunks
called blocks. Each block is 128 bytes long. When allocating new space inside the index.dat file, the operating
system will allocate at least a block-sized worth of space.

At the start of the file is a header, which contains amongst other things, the version of the index.dat file, the size of
the file, the start of the HASH table, the total number of blocks as well as how many are used. Information such as
the current and maximum size of the cache, and the directories which hold any files are also in the header. The
offset of the 1st LEAK record (which we will discuss below) is also found in the header.

Entries in index.dat files can be “grouped” together, for accounting purposes. The offset of the 1st group record is
also found in the header. After the header is the allocation bitmap, which describes the status of each block. After
the allocation bitmap is a series of records. There are a number of different types of records.

• Group: This is used to keep track of which URL entries belong to which groups.
• HASH: This is used to quickly locate a specific URL or REDR entry. A HASH record is composed of a
series of two-field entries. The first is a hash of the URL, and the second is the block offset of the
corresponding URL or REDR entry.
• URL: This record describes information about URLs that were visited, and if there is any cached version of
the file downloaded from the URL.
• REDR: This record describes when a user is redirected to another page.
• LEAK: This record describes cached files that could not be properly removed when the cache scavenger
last ran.

17 - C
This slide and the next slide show the layout of a URL record type. The single record was divided across two
screens for readability purposes.

The first four bytes of the record are a signature, followed by a 4-byte field describing the number of 128-byte
blocks the record occupies. The next two fields are 8-bytes each and are FILETIME timestamps. The meaning
of these (and the following) timestamp fields varies based on the location of the index.dat file.

The next four bytes describe the Primary DOS datetime timestamp. Four bytes later starts the low 32-bits of the
size of a cached file. Four bytes later is the offset of a group record, if the entry is associated with a group.
Four bytes later is the size of the header of the record. The next field is also four bytes long and describes the
offset of the URL the record describes.

After the URL offset field, is a one byte field that is an index used to determine the cache subdirectory a cached
file resides in. Three bytes later is a 4-byte field describing the offset (relative to the start of the record) of the
name of a cached file.

The next four bytes describe the type of entry, followed by two 4-byte fields describing the location and size of
HTTP headers that were cached when the URL was last retrieved. After these fields is a four byte field
describing the offset (relative to the start of the record) of the extension of the cached file.

18 - C
This slide continues where the previous slide left off.

The next four bytes are the secondary DOS datetime timestamp, followed by a 4-byte field describing the number
of times the entry has been retrieved, followed by a 4-byte field describing the number of times the record has
been used by the user.

The next four bytes are the tertiary DOS datetime timestamp, followed by four bytes describing the high 32-bits
of the size of the cached file.

The URL is NULL terminated ASCII string, as is the name of the cached file. The HTTP headers are also ASCII,
but not necessarily NULL terminated.

19 - C
This table describes the meanings of the various timestamps found in the URL and LEAK records of index.dat
files.

It is important to note that most of the times in the FILETIME structures are stored in UTC, while all of the
DOS datetime timestamps are stored in local time.

Sadly, forensic analysis tools still make time conversion errors when parsing index.dat, potentially leading to
incorrect conclusions. This is why is is *vital* that you test and understand both how your tools work, and what
their limitations are.

20 - C
All of your commercial forensic analysis tools will parse index.dat files. One free (and open source) tool to
accomplish the same task is pasco. To run pasco simply provide it with the full path to the index.dat file you
wish to examine.

21 - C
When a user deletes a file from Windows explorer, or by right clicking on it, the file is not actually deleted.
Instead it is moved to the Recycle bin, a directory that holds the files before they are actually deleted. If a user
deletes a file from the command prompt or via an API call, the file is deleted right away, and not placed in the
recycle bin.

The location of the recycle bin varies by version of Windows. Under Windows 2000 and later, the recycle bin
directory contains a sub directory for each user that deletes a file. You can use this information to determine
which account deleted a file.

It is possible to navigate to the Recycle bin directly from the Windows shell. However the folder presented to
you is *not* the same as what you see from examining the directory structure directly. Instead the Windows
shell uses a virtual folder to display the information. What You See Is Not What You Get.

Some of the more useful information from the Recycle bin includes the original path, name, and size of the
deleted file. Also the time the file was deleted is recorded as well.

22 - C
If you examine the files in the recycle bin directories, you will notice they the files have different names than
they did prior to deletion. The renaming happens to avoid collisions if you delete two files with the same
name.

Under versions of Windows prior to Vista, the files are renamed as D, followed by the original drive letter,
then a number, . (dot) the original extension of the file. For instance, Dc12.exe could be the 12th deleted file on
the C drive, and was originally a .exe file.

The information that maps the deleted file name back to the original path and file name, and contains the
deletion time is in a file called INFO or INFO2. The file contains a simple header, followed by fixed sized
entries.

In Windows Vista and later, the naming scheme is different. The original files are renamed as $R followed by
a hash, . (dot), and then the original extension. Instead of a single file containing the name, path, and deletion
time information, there is an individual file for each deleted file. The information for the deleted files is in a
file name $I followed by the same hash as the $R file, and then . (dot) the original extension.

23 - C
An INFO2 file is composed of a header followed by several entries.

This slide shows the header of an INFO2 file. The first four bytes are the version of the INFO2 file. 8 bytes
later is a two-byte value describing the size of each entry in the INFO2 file. The size of an INFO2 file is fixed,
and this value is 0x0320 (800) bytes.

24 - C
This slide shows the layout of an entry in an INFO2 file. Part of the entry has been removed for readability
purposes.

At the start of the entry is the original name and path of the deleted file as a NULL terminated ASCII string, up
to a maximum of 260 characters. If the value in this field is less than 260 characters, it is padded with NULL
bytes.

The next field is four bytes long and describes the deletion entry number. This is the same number that that you
find in the name of the deleted file. The next four bytes describe the drive letter the file was deleted from. A =
0, B = 1, C = 2, and so on.

The next 8 bytes are the time the file was deleted in FILETIME format. Following this field are four bytes
denoting the size of the deleted file. Finally is a NULL terminated, 16-bit little endian unicode, version of the
file name. This field is 520 bytes long, and is padded with NULLs if the unicode version of the name is less
than 520 bytes.

Incidentally, if the first byte of the ASCII version of the file name is NULL (0x00), then it means the
corresponding file does not exist in the recycle bin directory.

25 - C
This slide shows the layout of information inside a $I file.

The layout is much simpler than an INFO2 file. The first four bytes are the version. Four bytes later is an 8-
byte field describing the size of the deleted file. The next field is the deletion time in FILETIME format (8-
bytes long). Finally the original name and path of the file as a NULL terminated, 16-bit little endian, unicode
string.

26 - C
One free and open source tool to parse INFO2 files (and there is a version for $I files) is called rifiuti. To run
rifiuti, simply provide the full path to the INFO2 file.

27 - C
The Prefetch is a mechanism used by Windows to reduce the time it takes a program to load required code pages
into memory, when the program is first started.

The cached code pages are found in the Prefetch directory, which is under the system root directory. Even if the
executable file is run from a USB or remote drive, there is a good chance prefetch files will be created. The
prefetch directory can have a maximum of 128 files before old ones are deleted to make space for new cache
files.

The name of the files in the prefetch directory are named based on the original executables name, followed by –
(dash) and then a hash of the path the file was run from.

Some of the useful information from a prefetch cache file includes the creation time of the prefetch file
(suggesting when the program was first run), the time the executable was last run, how many times the
executable has been run, the files opened within the first 10 seconds of program execution, and metadata for
volumes referenced in the first 10 seconds of program execution.

28 - C
This slide shows the header information from a prefetch cache file. At offset 0x64 is the offset to the file paths
referenced by the program during the first 10 seconds of execution. The next four bytes describe the size of the
file paths.

The next four bytes are the offset of volume metadata for volumes accessed within the first 10 seconds of
program execution.

Eight bytes later is the last time the program was run, in FILETIME format. Sixteen bytes later is a 4-byte field
describing the number of times the program has been executed.

29 - C
Here is an example of an entry in the file paths segment. Each file path is a NULL terminated 16-bit, little
endian, unicode string.

This example comes from a copy of the prefetch file for Powerpoint. I had double clicked on the file
ForensicsIntensive.pptx. It’s also interesting to note that case is *not* preserved.

30 - C
This slide shows volume information for the same prefetch file.

The first four bytes are the offset to the volume paths, followed by four bytes that tell how many volume path
strings there are.

The next eight bytes are the creation time of the volume in FILETIME format. The next four bytes are the
serial number of the volume. This can be useful for tying a specific USB device to the execution of a program.

The volume path is a NULL terminated, 16-bit, little endian, unicode string describing the Windows path to the
volume.

31 - C
Microsoft Office documents are a gold mine for forensic examiners. The amount of information stored
inside Office documents can literally be mind boggling. For instance, the creation, modification, and last
print timestamps, as well as the recent authors of a document and locations the document was saved are all
found inside Office documents. This is just a few of the fields!

Understanding Microsoft Office documents is difficult, due to their hideously complex nature. Microsoft
has released the specification for the Office document file formats. The specifications for Microsoft Word
alone is over 700 pages.

The fundamental concept of how Microsoft Office documents are laid out is much closer to a file system
(with files and directories) rather than just one large monolithic format.

To understand the metadata inside an Office document, we need to examine the 3 levels that exist inside
any given Office document. They are the Compound File Binary level, the OLE property set level, and
then any application (Word/Powerpoint/Excel) specific metadata.

32 - C
The OLE compound file binary (CFB) format provides the underlying structure used to hold Office documents
and metadata. It is a filesystem-in-a-file, and has a very similar structure to FAT file systems. The CFB format
is not particular to Office documents, and is used by other Windows system artifacts.

When working with CFB files, they are divided up into chunks called sectors, which are either 512 or 4096
bytes in size. The first sector of a CFB file contains a header, which describes the layout of the rest of the CFB
file. This is similar to how a boot sector describes the layout of a file system.

Inside a CFB file there are 3 types of allocation tables. The FAT which keeps track of the status and relation of
sectors (just like a FAT in a FAT file system). The double-indirect FAT (DIFAT) which is used to describe the
locations of the sectors that contain the regular FAT. The third type of FAT is called the Mini FAT, and is used
to keep track of small streams. These small streams have their own separate storage mechanism.

Just like a FAT file system, the CFB has directory entries, which describe either streams (analogous to files) and
storages (analogous to directories). Unlike FAT file systems however, this is only one type of directory entry,
and it is larger than 32 bytes.

33 - C
This slide diagrams the general layout of a CFB file. With the exception of the header and first part of the
DIFAT, all of the metadata information can exist anywhere in the file.

The header in the first sector of the CFB file describes the location of the root directory, as well as the start and
size of the FAT and DIFATs. The header also includes the maximum size of a stream (file) that can fit in the
Mini stream.

The DIFAT sectors describe the location of the FAT sectors, and are chained together. The last entry in a
DIFAT sector is the address of the next DIFAT sector.

The FAT sectors function the same was as they do in a FAT file system, describing sector chains, and sector
status.

The Mini stream is used to hold streams (files) that are less than the cutoff size (from the header). The Mini
FAT is used to keep track of the mini sectors (found in the Mini stream) in the same manner that the regular
FAT keeps track of the regular sectors.

The Root stream is the name of the root directory. The slide above shows that the Root stream contains entries
for streams called “0Table”, the Mini stream, and “SummaryInformation”.

The streams need not be contiguous. For example the 0Table stream is (on the slide above) is fragmented.

34 - C
This slide shows what the CFB layout on the previous slide would look like, if it were mapped to a file system.

The Root directory contains a directory entry called “Root Entry”. The stream associated with the “Root Entry”
directory entry is the Mini stream (used to hold very small streams.)

The 0Table and SummaryInformation directory entries would point to their respective streams. Just like
directory entries in a FAT file system named “0Table” and “SummaryInformation” would point to their
respective files.

35 - C
The next level of metadata in an Office document are OLE property sets. OLE property sets are a generic
method for storing arbitrary data in OLE CFB files.

The metadata (the OLE properties) are stored in a structure called a PropertySetStream. A PropertySetStream is
a stream from a CFB that contains a collection of properties. The collection of properties is known as a
PropertySet. There can be a maximum of two PropertySet structures per PropertySetStream.

Inside a PropertySet structure, there are several different types of properties, such as integers, strings,
timestamps, etc. There are also “special” properties, such as the CodePage property, which is used to determine
the encoding of string properties. Also the Dictionary property can be used to provide names for the properties
in a PropertySet. Properties are stored in a generic container called a TypedPropertyValue structure.

The meaning of the properties (metadata) in a PropertySet vary based on the type of PropertySet. Each type of
PropertySet is identifier by a unique GUID. Additionally, some property sets have requirements on the name of
the stream they reside in.

There are three “well known” (i.e. documented) property sets. They are:
• SummaryInformation: to describe summary information (e.g. keywords, title) about a document
• DocumentSummaryInformation: to describe summary information that is specific to documents (e.g.
number of paragraphs, headings, etc.)
• UserDefined: to describe application-specific information, such as the hyperlinks in a Word document.

36 - C
This slide shows graphically how PropertySetStreams and PropertySets are laid out.

The PropertySetStream structure is contains a header, and one or two PropertySet structures. The header at the
start of the PropertySetStream structure describes the location and number of PropertySet structures.

Each PropertySet structure contains a header, and zero or more properties. The header in the PropertySet
structure describes the location and a numeric identifier for each of the properties. The properties themselves
contain the metadata.

37 - C
This slide shows a TypedPropertyValue (TPV) structure that contains a string.

The first two bytes denote the type of data in the TPV structure. In this case it is a CodePageString structure
(i.e. a string that is encoded with the code page described by the special CodePage property).

Two bytes later is a 4-byte field that describes the size of the string. The next field is the NULL terminated
string, followed by NULL bytes to pad the TPV structure to an even multiple of 4 bytes.

38 - C
The last level of metadata in an Office document is the application specific metadata. For this course we will
focus on the most commonly used Office application, Microsoft Word.

A Word document is not a single stream, instead the information that comprises a Word document is spread
across several different streams. All of the streams however are in the same CFB file.

The stream named WordDocument contains most of the content of a Word document (if it is not encrypted)
and the File Information Block (FIB).

The streams named either 0Table or 1Table contain tables of strings referenced by the WordDocument stream.

The stream named Data contains various FIB related information.

The streams \x05SummaryInformation and \x05DocumentSummaryInformation contain the well known


SummaryInformation and DocumentSummaryInformation property sets. The stream named encryption
contains the content of encrypted Word documents.

The FIB is the most important structure in a Word document. The FIB describes the overall layout and
structure of the Word document. Each new version of word has extended the FIB by appending new fields on
the end. The latest version of the FIB is a data structure with a few hundred fields.

39 - C
This slide shows the overall layout of Word specific metadata.

In the WordDocument stream, the FIB has a number of important pieces of information. If the document is
encrypted and/or obfuscated, will be denoted in the FIB. The string table will be in a stream named either
0Table or 1Table. Some Word documents have both. The stream that contains the valid string tables (0Table or
1Table) is also determined by a field in the FIB.

There are several fields in the FIB which describe metadata useful to the investigator. These fields are the
offsets of string tables held in the 0Table or 1Table stream.

Some of the useful strings referenced by the FIB include:


• The last 10 users who saved the document, and where it was saved to.
• Any associated strings.
• Bookmarks in the document.
• External files that are referenced (e.g. If a document is composed of sub documents.)
• The authors of revision information.

40 - C
This slide shows how a string table is laid out.

A string table is composed of a header, followed by a series of data elements. The data elements contain the
strings, and a space to hold “extra data”. The purpose and meaning of the extra data fields is tied to the meaning
of the data the string table describes.

The first two bytes of the string table tell an examiner how to interpret the strings. If they are the value 0xFFFF,
then the strings are 16-bit, little endian, unicode strings. Otherwise they are strings encoded in the code page of
the system.

The next field (which will be the first field if the first two bytes are *not* 0xFFFF) is the number of data/extra
data pairs in the string table. This field is also two bytes long. The following field is two bytes long and
describes the size of the extra data for each data/extra data pair.

The next part of the string table is a list of the data/extra data pairs. The first two bytes are the size of the string.
The next field is the string itself, and then any extra data.

41 - C
A free and open source tool to extract some of the more common metadata from a Microsoft Word document
is called wmd.pl. This PERL script was developed by Harlan Carvey.

To run wmd.pl, use the “-i” option to provide the path to the word document to examine.

42 - C
Thank you for attending today’s class. If you have any questions or suggestions, feel free to e-mail me at
mmurr@codeforensics.net. Alternatively, you can also find me on twitter at: http://www.twitter.com/mikemurr.

I also have a blog about digital forensics at http://www.forensicblog.org

43 - C

Вам также может понравиться