Вы находитесь на странице: 1из 55

White Paper

by
Morgan Weinberg, President
Table of Contents

Introduction ....................................................................................................................................................................1
Key Taiyen Volatile Heap Features ............................................................................................................................2
Key Taiyen Persistent Heap Features ........................................................................................................................3
Paper Overview ..........................................................................................................................................................4
Background ....................................................................................................................................................................4
Fragmentation/Availability...........................................................................................................................................4
Basic Allocation Strategies .........................................................................................................................................5
Locality of Reference..................................................................................................................................................6
Multithreading Issues..................................................................................................................................................7
NUMA Support ...........................................................................................................................................................9
Taiyen Libraries..............................................................................................................................................................9
Some Unix Distributions .............................................................................................................................................9
Some Windows Distributions ......................................................................................................................................9
Notation ....................................................................................................................................................................10
Volatile Heap Manager.................................................................................................................................................10
Heap Structure .........................................................................................................................................................10
Static Blocks .............................................................................................................................................................11
Dynamic Blocks ........................................................................................................................................................12
Large Dynamic Blocks ..............................................................................................................................................13
The Testing Regimen...................................................................................................................................................14
The Memory Tester ..................................................................................................................................................14
Volatile Heap Performance Tests .............................................................................................................................17
Volatile Heap Test Results ...........................................................................................................................................19
SMP Scalability.........................................................................................................................................................19
Page Efficiency.........................................................................................................................................................21
Single-Threaded Performance..................................................................................................................................22
Dynamic Block Performance ....................................................................................................................................23
Block Access Speed.................................................................................................................................................26
Heap Compression ......................................................................................................................................................27
Heap Decompression ...............................................................................................................................................27
Bounds Checking .........................................................................................................................................................27
Memory Leak Detection ...............................................................................................................................................28
Persistent Heap Manager ............................................................................................................................................29
Heap Structure .........................................................................................................................................................29
Snapshot Heap.........................................................................................................................................................30
Transactional Heap ..................................................................................................................................................32
Archive Heap ............................................................................................................................................................33
Log Entries and Transactions ...................................................................................................................................33
Restoration from System Failure ..............................................................................................................................35
Embedded Log Entries .............................................................................................................................................36
Rollback....................................................................................................................................................................38
Undo/Redo ...............................................................................................................................................................39
Automatic Snapshots................................................................................................................................................43
ACID Transactions ...................................................................................................................................................44
Ease of Use ..............................................................................................................................................................45
Reliability ..................................................................................................................................................................46
Persistent Heap Performance...................................................................................................................................47
Transaction Engine...................................................................................................................................................50
Multiple Heaps .............................................................................................................................................................52
Conclusions..................................................................................................................................................................52

© 2005 Solid State Storage Inc. All rights reserved.

i
Introduction
Modern C/C++ programs often place enormous demands on computer memory. The most severe case involves
server-based software that is expected to run for months and perhaps years at a stretch. Those components
collectively responsible for memory management must be able to handle continuous memory churning where blocks
of varying sizes are continually allocated, resized, and freed; all while ensuring data integrity, durability, consistency,
and performance. And the task is complicated by the fact that most modern software employ multi-threading, where
multiple threads are engaging the memory manager at once. On symmetric multiprocessing (SMP) machines these
threads will likely be spread amongst multiple processors, introducing further complexity because concurrency and
CPU cache utilization now become primary concerns.

Operating systems do an excellent job of managing memory, but only at the page level. When a process allocates a
block from the OS, it gets a page or a page multiple. The same goes for freeing memory. There's no way to return
256 bytes back to the OS. The options are to return a page, a page multiple, or nothing. On most 32-bit systems, a
page is 4 KB, and on most 64-bit systems, 8 KB. The OS component that handles memory is generally called the
virtual memory manager (VMM). One critical feature of a VMM is its ability to effectively extend the amount of
installed RAM on a machine by employing the machine's persistent storage (generally, one or more disk drives) to
cache pages that are infrequently used. The VMM continually monitors the memory demands of all of the running
processes and if the RAM becomes overloaded, it takes a handful of the least accessed pages and writes them to
disk. The area in virtual memory where this data used to reside, but is now on disk, is called a page frame. If a
process tries to access the page frame, the VMM issues a page fault and reads the page back into memory. VMMs
use a variety of algorithms to balance the memory demands of running processes. The Windows VMM, for example,
is governed by working set theory.

Given the coarse page-sized granularity of VMM allocation, it is clear that some other component must reside on top
of the VMM to handle the allocation of blocks of varying sizes. Generally, this is accomplished by what is termed an
allocator or heap, which is just a block of pages allocated from the VMM that is apportioned according to a collection
of algorithms designed to handle varying and disparate allocation requirements. Essentially, the heap must be able to
subdivide pages into smaller blocks, coalesce pages into larger blocks, resize those blocks, reincorporate blocks that
are no longer needed by the application, and when possible return pages to the VMM. To be affective in modern
computing environments, the heap must be fully re-entrant, allowing multiple threads to concurrently allocate, resize
and free those blocks.

Other features can be imagined as well, such as heap persistence (the ability to save and restore a heap from disk)
and perhaps heap transactions to guarantee data consistency. To organize levels of responsibility, we envision a
simple data-management stack (DMS) with layers of responsibility and their governing components, as shown below:

Data Management Stack (DMS) Governing Component(s) Product

6 Application Layer
5 Data Organization Layer Database Manager
Objedata
4 Access Control Layer Object Manager

3 Persistence Control Layer Persistent Heap Taiyen Memory


Manager
2 Page Organization Layer Volatile Heap

1 Virtual Memory Layer Virtual Memory Manager (VMM) Unix, Windows, etc.

0 Physical Layer CPU, RAM, cache, TLB, bus, etc. Pentium, etc.

Figure 1

In this model, each layer is responsible for a specific set of services. The physical layer provides the hardware
necessary to support virtual memory and translate virtual memory addresses to physical addresses. The virtual
memory layer is responsible for providing pages and blocks of pages to the layers above, and moving pages to and
from the paging files according to the demands of the running processes. The page organization layer is responsible
for sub-allocating pages and coalescing pages to provide memory blocks of varying size to the layers above. The
persistence control layer is responsible for making the blocks provided by the page organization layer persistent and
for ensuring data consistency. The access control layer is responsible for managing data access and the provisioning
of resources in a multi-user environment. This layer also provides security. The data organization (or data

1
architecture) layer provides the logic necessary to tie memory blocks together to form a cohesive database. This is
the layer where a C++ class library or a relational engine, for example, would reside.

As is typical with a software stack, layers are encapsulated with higher layers inheriting the services of lower layers
but never vice-versa. This allows the layers to be de-coupled, with no layer ever having to be concerned with the
layer above and allows the needs of each layer to be served by distinct components that operate autonomously,
making it easier to isolate bugs and produce more reliable software. Layer boundaries are drawn to suit the data-
management needs of a wide variety of applications. Some applications such as single-user games may only require
the facilities offered by layer two, whereas other complex multi-user transactional systems may utilize every layer.

The Taiyen Memory Manager is comprised of volatile and persistent heap managers, and as the figure shows, serves
two layers in the stack. The volatile heap manager, which offers unmatched performance and productivity features,
serves as a replacement for similar components found in the C runtime library (CRT), operating systems, and third-
party libraries. The persistent heap manager, on the other hand, is more of a unique component that doesn’t have a
direct parallel, but can serve as a compelling alternative to archives and object databases. Building upon the Taiyen
system is Objedata, which is a follow-on product that will be covered in a separate paper. We have Taiyen builds for
both 32 and 64-bit versions of Unix and Windows, with builds for Linux, Solaris, and other popular distributions being
added over time. The system comes in the form of static libraries so applications can link in just the functions they
need. This also eliminates the need to distribute yet another DLL, which avoids the versioning problems associated
therewith and allows more flexibility in terms of the interface. The system provides over 80 API calls.

Key Taiyen Volatile Heap Features

• The system is fully scalable on SMP and SMP NUMA machines, tests faster than Hoard (an allocator developed
at the Universities of Texas and Massachusetts), and tests more than double the speed of mtmalloc (Sun
Microsystem's multithreaded offering). The system offers complete protection against the false sharing of cache
lines, a problem that can degrade the performance of multithreaded applications.

• The system is robust, which is what we define as the ability to handle a wide variety of block sizes and a
dynamically varying average block size. This is in contrast to allocators such as Hoard that are only capable of
handling a tight range of block sizes.

• The system's allocator is also extremely fast when only one accessing thread is being used and is often twice the
speed of both Microsoft's and Sun's stock allocators.

• To maximize the access speed of allocated blocks, the system aligns all blocks optimally with respect to cache
lines and pages, thereby improving locality. Other allocators misalign over 99% of the allocated blocks, causing
blocks to touch extra cache lines resulting in increased cache misses. On Itanium systems, for instance, aligned
blocks can be accessed up to 50% faster than those that are unaligned.

• The system has native support for dynamic blocks, and a related feature that allows large multi-page blocks to
be moved within memory very rapidly through a proprietary technique called page table entry (PTE) redirection.
This allows the reallocation of large dynamic blocks up to 100 times faster than competing systems, and
facilitates the system's constant efforts to produce a less fragmented memory space.

• The system supports debugging features such as multi-threaded memory leak detection and bounds checking.

• The system supports heap compression/decompression to and from buffers and files. Heaps can be relocated,
resized, and even defragmented in the process.

2
Key Taiyen Persistent Heap Features

• The system provides orthogonal persistence, making it exceptionally easy to use. Any block that is allocated within the
persistent heap is automatically made persistent. Other systems require objects to be derived from a base class that
requires that the objects be explicitly saved on an individual basis; and conversely, those objects have to be explicitly
and individually loaded from disk. When a persistent heap is restored, all of the blocks are implicitly loaded at once, but
to maximize restoration speed they are only loaded virtually. A block is only made present in memory when it is
referenced, and this process is also automatic.

• Anything that can be stored in a volatile heap can be stored in a persistent heap, which includes pointers and the myriad
of pointer-dependant structures such as linked-lists and b-trees. Other systems require the use of cumbersome and
buggy reference objects to link objects together. When a persistent heap is restored, it is always restored to its original
address, thereby preserving the validity of the pointer references therein.

• The system supports three persistent heap types: archive, snapshot, and transactional. The archive heap is designed for
information retrieval. The snapshot heap is designed for single-user desktop applications, and the transactional heap is
designed for multi-user server applications. The primary difference between the snapshot and transactional heaps is the
speed at which data is saved and the complexity of usage. As might be expected, data is saved in a snapshot heap by
performing snapshots. Snapshots are fast because they only flush newly modified data to disk, and they are superior to
the typical file save because they are fail-safe. If a system failure occurs during the snapshot, the heap will always be
restorable to its state as of the last snapshot. While snapshots are highly efficient, they are not nearly the fastest means
of preserving heap data. This title goes to the heap transaction, which can be made on the order of microseconds. Heap
transactions are global in nature and upon closure, effectively save the entire contents of the heap in one instant. Like
snapshots, they are also fail-safe. If the system fails, the transactional heap will always be restorable to its state as of the
last transaction.

• The drawback of using a transactional heap is increased complexity because it requires interaction with a transaction
log; but as far as logs go, this one is extremely easy to use. A heap transaction is composed of one or more log entries,
and there are two types of log entries: data and procedure. A data log entry copies a block of data and its pointer to the
log, whereas a procedure entry copies a function pointer and it parameter list as well as some ancillary data to the log.
Procedure entries allow the logging of complex operations such as the inversion of a matrix that would otherwise entail
copying large blocks of data. The process of making entries in the log is easy because the log never has to be checked
for overflow. From the perspective of the user, the log can be considered to be of infinite size. Log entries are also made
quickly because they are all buffered, and no entries are written to disk until the entries are terminated, which closes the
transaction. Actually, to maximize transaction speed, the log is specifically designed to reside on a solid state drive
(SSD) as opposed to a disk drive. This is possible because the log remains of fixed size for the life of the heap. When an
SSD is employed, the system becomes a potent transaction engine that takes a choppy input data stream and converts
it to a smooth bursty and contiguous output that is readily received by disk drives.

• When a rollback buffer is attached to a transactional heap, the system supports automatic rollback. When rollback is
invoked, the system will execute the mirror operations necessary to undo the operations constituting the current
transaction. During a rollback, the system also makes the appropriate entries in the transaction log, so that it doesn't fall
out of congruence with the heap. In a related feature, snapshot heaps allow the attachment of both undo and redo
buffers, which enable multiple levels of undo and redo, a feature required by most modern desktop applications.

• All heap transactions meet the definition given by the acronym ACID – atomic, consistent, isolated, and durable.

• The system gives the highest priority to data integrity, and its routines make liberal use of checksums and other means
of verification to ensure such integrity.

• A persistent heap is easy to administer. A transactional heap uses only three files, with one being the aforementioned
transaction log. Two files back a snapshot heap, while an archive heap needs only one. Heap restoration is easy
because the same function that is used to restore a heap from a normal shutdown also restores it from system failure;
and the function works on all three heap types. The function examines the attached files to determine the state of the
heap at the point of closure and then chooses the appropriate reconstruction strategy. For example, it might determine
that transactions in the log need to be processed to bring the heap up to date.

• The persistent heap manager is extremely fast. Firstly, it inherits all of the volatile heap features such as high allocation
speed and PTE redirection. Secondly, its own services are highly optimized. Tests of a transactional heap on a 2.8 GHz
PC with an Ultra ATA-133 drive, yielded the following results: The log entry speed averaged four million log entries/s.

3
The transaction speed, with an average of 8.9 log entries per transaction, averaged 5,294 transactions/s. The snapshot
speed averaged 12.8 MB/s. The heap restoration speed averaged 1.63 GB/s. And heap compression and
decompressions were also fast at 44.2 MB/s and 40.8 MB/s, respectively.

• The system leverages the power of sparse files, so when a 512 GB file is used to back a 512 GB heap, the actual
amount of disk space that is consumed correlates to the amount of data in the heap and not the size of the heap. In
other words, if this heap were to contain only a megabyte worth of data, roughly a megabyte worth of disk space would
be consumed.

• A prime advantage of the persistent heap is that it is data semantics and organization agnostic. All of the services it
provides such as transaction processing and rollback will work for any type of data organization. This means that a wide
variety of software from basic desktop applications to complex database management systems (DBMSs) can be built
upon this layer. The data organization could easily take the form of objects, tables and indices for a relational DBMS, or
whatever the developer dreams up.

The Taiyen system supports the ability to create multiple volatile and persistent heaps within the same process address
space. It manages all of its own memory and never passes an allocation request off to another allocator.

Paper Overview
The first section of this paper covers the background of heap management. It covers the primary issue involved,
which is fragmentation, and introduces the concept of availability. Availability plays a principal roll in the architecture
of the Taiyen system. Other topics such as allocation strategies, locality of reference, and multithreading issues are
also covered. The next section outlines the Taiyen libraries. The following section covers the volatile heap manager
and such topics as heap structure, static blocks, dynamic blocks, and large dynamic blocks. The section after this
goes into the testing regimen used to generate the performance metrics quoted herein and covers the memory tester
in detail. The subsequent section presents the resulting volatile heap test results with specific attention paid to SMP
scalability, page efficiency, single-threaded performance, dynamic block performance, and block access speed. The
following sections cover heap compression and decompression, bounds checking, leak detection, and the finally, the
persistent heap manager. This last section outlines persistent heap structure in general, and archive, snapshot, and
transactional heaps in particular. It also covers log entries, transactions, restoration from system failure, automatic
snapshots, embedded log entries, rollback, undo/redo, acid transactions, ease of use, and reliability. Lastly, the ability
of the transactional heap to function as a high performance transaction engine is argued with performance figures to
support the case.

Background
Fragmentation/Availability
Fragmentation is defined as the existence of sparsely allocated blocks spread across the heap’s address space.
Fragmentation occurs because blocks in a heap may be freed in random order, unlike the deallocation of blocks on a
stack, for instance, where the last block to be allocated will always be the first to be freed. The effects of
fragmentation include poor page utilization (or page efficiency) and a loss of availability. We define page efficiency as
the total number of allocated bytes in a heap divided by the aggregate number of committed pages backing it. We
define availability as a measurement of a heap's ability to accommodate allocation requests as a percentage of its
ability, were all of its free blocks coalesced together, and is given by the equation below:

For a series of N free blocks that are available for allocation, ordered from the largest to the smallest, (F1, F2, F3,
…FN), and O is the total number of free blocks in the heap, the heap’s availability, A is as follows:


N −1
1 ⎡⎛ N ⎞ ⎤
A = 1− O ⎢⎜ ∑ Fn ⎟ − Fm ⎥ Eq. 1
⎣⎝ n=m ⎠
∑F
m=1
m
m=1

4
To illustrate the concept of availability, consider the four heaps in Figure Two:

1) 10 B B

A = 1 = 100%

2) 6 B 4 B

A = 1 - 1/10(10 - 6) = 60%

3) 6 B 3 B 1

A = 1 - 1/10((10 - 6) + (4 - 3)) = 50%

4) 6 B 2 B 2

A = 1 - 1/10((10 - 6) + (4 - 2)) = 40%

Figure 2

Availability may be thought of as the inverse of fragmentation. The first heap contains one large free block and two
stored blocks that are clumped together. Assume, in the figure, that all free blocks are available for allocation. The
heap is completely unfragmented and therefore has an availability of 100%. The second heap contains the same
amount of aggregate free space as the first, but because it is fragmented, its availability drops to 60%. It is clearly
less conducive to allocation than the first heap. Heaps three and four share the same six-element free block as heap
two, but show smaller availabilities. Why? because they are more severely fragmented. Once the six-element block is
occupied, heap two is capable of accommodating a four-element block, heap three is capable of accommodating a
three-element block, and heap four, only a two-element block. In this way, while the total free space in each heap is
the same, they are progressively less able to handle allocation requests. By this formula, the worst availability is
displayed by a heap that contains the maximum number of possible free blocks, which is intuitively what one would
expect. Notice that while availability will increase with an increase in average free block size, two heaps with the
same average free block sizes may have different availabilities, as demonstrated by heaps three and four. Heap
managers that strive to maximize availability, maximize the probability that currently committed pages will be utilized
for future allocations, maximize page efficiency, and reduce stress on the paging system.

Because heaps have no control over which blocks are freed and when, much of the art of heap management comes
down to the manner in which the free blocks that are left behind are apportioned for future allocations. This goes for
the first block to be allocated in a heap where, essentially, the heap is one large free block and the nth block where
the heap may have a million free blocks from which to choose. The implications are huge. For instance, in a
multithreaded environment allocating two blocks on the same cache line can introduce false sharing where two
threads running on different CPUs touch the same cache line causing the system's cache coherency mechanism to
become overburdened.

Basic Allocation Strategies


A variety of strategies to apportion free blocks have been utilized with varying goals. A common one, the best-fit
method, given an allocation request of X bytes, scans the free blocks to find the one offering the best fit. An
optimization of this algorithm maintains a collection of lists where each list points to a series of free blocks of similar
size, allowing a free block of a desired size to be apprehended more quickly than scanning one large doubly linked
list. The best-fit algorithm, which is employed by most stock heap managers, does a good job at minimizing
fragmentation, thereby maximizing availability. It is also adept at handling dynamic environments where block sizes
vary widely. Implementations do differ on details such as if and when neighboring free blocks are coalesced.
Coalescing free blocks obviously increases availability but does so at the expense of speed; there is work involved
with coalescing free blocks.

Another strategy uses an aggregation algorithm where the heap is subdivided into mini-heaps with each specializing
in a particular block size. This essentially pre-fragments the heap and immediately reduces availability. The
motivation behind this algorithm is speed. Given an allocation request, apprehending a target mini-heap is a quick
event, and freeing a block requires very little work. Because of the loss of availability, however, aggregation
algorithms are only suitable for allocators designed to handle a narrow band of block sizes. Hoard uses such an
algorithm, and does start consuming excess memory when confronted with size variation.

5
There are other allocation strategies that sacrifice availability for speed as well. One of note is the power-of-two
allocator used by Sun's mtmalloc. This allocator rounds every allocation request to the nearest power-of-two, which is
acceptable so long as the allocation requests fit this mold. The motivation behind this approach is to speed free block
searches by reducing the number of size classes. A size class is defined as a range of block sizes that are handled
identically by the allocator. For instance, the 64-byte size class for a power-of-two allocator would include blocks from
33 bytes to 64 bytes because they would all target the same 64-byte free list. The weakness of the power-of-two
allocator is that it is memory inefficient when block sizes are not powers of two, which can be often.

The Taiyen system employs a fully coalescing allocator that uses a modified-best-fit strategy, which promotes
availability while approaching the theoretical speed of a pure aggregation algorithm. We accomplished this by
carefully engineering the code and using inline assembly in critical areas. The end result is a robust system that can
handle dynamically varying block sizes from a handful of bytes to megabytes and does so at unmatched speeds.

Locality of Reference
Spatial locality predicts that bytes in close proximity will be accessed nearly in time, and temporal locality predicts that
bytes that were accessed recently will be accessed again in the near future. The principal of locality allows the
implementation of a memory hierarchy where the bulk of an application’s data can reside in low-cost yet lower-speed
storage without having to sacrifice performance in proportion. In the optimally designed system, an application will run
nearly as fast as it would if its data resided entirely in the CPU’s registers and cache, the top layer of the hierarchy.
The memory employed at this layer is static random access memory (SRAM), which is expensive but fast enough to
feed the CPU directly. The cache will generally contain multiple layers and is usually measured in megabytes. The
next layer, the main memory, consists of dynamic RAM (DRAM), which comes in a variety of architectures (SDRAM,
RDRAM, etc.) and is measured in gigabytes. DRAM is called dynamic RAM because it must be constantly refreshed
as opposed to static RAM, which retains its charge as long as voltage is applied to the circuit. But because DRAM
circuitry is simpler, it is cheaper (and slower) than SRAM, although DRAM speeds have seen dramatic increases over
the years. At the bottom of the hierarchy is the least expensive and slowest storage: disk and tape. Such storage is
measured in terabytes.

The system designer’s goal is keep to keep the most frequently used data as high in the hierarchy as possible to
satisfy the CPU’s voracious hunger for data. When referenced data are not found in cache (which is called a cache
miss as opposed to a cache hit), the CPU goes into a wait state and retrieves a cache line from main memory. Cache
lines are generally 64 bytes wide and even though the CPU may only need a couple of bytes to complete the
operation at hand, proximal bytes are retrieved because spatial locality teaches that they will probably be needed in
the immediate future. This is especially true when data access is sequential as is the case for a great many
operations from processing instructions to copying data to a buffer. Bringing in a new cache line will force another
one out, which one being governed by the cache’s associativity and the controller’s replacement algorithm. Typically,
caches use set associativity with a least recently used (LRU) replacement policy. Cache misses can cost on the order
of tens to hundreds of clock cycles

If referenced data are not found in main memory (referred to as a page miss as opposed to a page hit) the CPU
issues a page fault and the VMM either turns to the paging files or a mapped file for the page in question. With spatial
locality in mind, VMMs will typically read in neighboring pages along with the faulting page. If the main memory is fully
committed, an underutilized page must written to disk, with the page replacement algorithm being similar to that of the
CPU cache. A page miss can cost on the order of millions of clocks.

With the hardware’s reliance on the memory hierarchy, it is incumbent upon software at all layers to maximize locality
so as to minimize cache and page and misses. Generally, software systems concentrate on spatial locality. This can
be accomplished though good coding practices and the careful design of data structures. For example, well-written
code will minimize loop bodies so that they may sit entirely in the instruction cache. More advanced techniques
include strip-mining, loop tiling, etc., with such optimizations usually being applied by a good compiler.

The Taiyen system promotes locality through its data alignment policies, which ensure that blocks never touch extra
cache lines and extra pages. In other words, a 64-byte block allocated in a Taiyen heap will only reside on one cache
line and a 4096-byte block will only reside on one page (assuming a 4-KB or greater page size). This will avoid cache
unit splits, which occur when accessing N cache lines worth of data causes N + 1 cache line loads. In correlation, it
will also avoid excess paging when accessing page-sized structures. In rare cases, in the interest of maximizing
availability, a sub-page block may be positioned such that it straddles a page boundary, but this should not be
deemed a significant liability as page alignment is more critical for blocks that are page multiples in size.

Note that these alignment rules don’t necessarily allow the Taiyen allocator to be substitute for the CRT’s
aligned_malloc(size_t size, size_t alignment), which forces a block allocated thereby to be aligned along the power-
of-two boundary, alignment. For example, a 512-byte block is not guaranteed to reside on a 512-byte boundary. It
will, however, be aligned along a cache line. The system does guarantee power-of-two alignments for seven critical
granularities: the system word (four bytes on 32-bit systems, and eight bytes on 64-bit systems), double the system

6
word, the quad-word, the cache line size, double the cache line size, the page size, and the segment size, which is
defined as 16 pages. In this respect, the allocator should be able to handle most of the real-world allocations normally
passed to aligned_malloc().

Because it’s impossible or at least impractical for the heap manager to monitor and predict usage patterns, the
application itself is best suited to promote inter-object locality. For example, a structure that will receive a lot of
pounding such as a priority queue, is best implemented as a contiguous dynamic array as opposed to an unordered
linked-list where elements may wind up sharing cache lines with elements of unrelated structures. In general, care
needs to be taken when designing sparse structures such as linked-lists, which inherently show poor locality. The
problem is that traversing the list can, in the worst case, yield a cache miss and possibly a page fault with every node
reference. One solution to the quandary is to improve the locality of the nodes by placing them in a contiguous stack,
which will reduce the likelihood of cache misses and page faults. Of course, at a lower level, the responsibility of
maintaining intra-object locality rests with the heap manager, and if structures aren’t properly aligned to begin with,
efforts to improve inter-object locality will not be as fruitful as possible.

Going forward, cache locality will become increasingly critical. Unlike RAM, which, in theory, can be expanded to the
extent of the address space, or even beyond, CPU cache is bound by the die size. Cache is also bound by CPU
speed, because the larger the cache, the higher the latency in the circuitry used to manage it; and as CPUs increase
in speed, the more pressure there is to reduce this latency. CPU designers address this problem by incorporating
multiple layers of cache with the smallest and fastest residing on the chip. Intel's Itanium I processor, for example,
employs three levels of cache (16 KB, 96 KB, and 4 MB). What's notable about this architecture is the latency at each
level (2, 6, and 22 clock cycles, respectively) not to mention the main memory latency of 176+ clocks. These sorts of
latencies clearly show the cost of a cache miss and are indicative of the modern CPU's dependence on locality at the
cache level.

Multithreading Issues
With the proliferation of SMP machines and multithreaded software, heap managers must be able to scale. Stock
allocators don't scale because a mutex (also called a lock) either guards the entire component, or at least too much of
it to allow full thread concurrency. The allocator is effectively serialized allowing only one thread inside at a time.
Even worse, as other threads are blocked, an extra context switch is thrown for each blocked thread resulting,
typically, in a torrent of context switches that further degrade system performance. In addition, when a thread is
blocked not only is its time-slice cut short, the clock cycles spent loading touched cache lines will have been wasted.

The primary challenge to the multithreaded heap manager is to allow full concurrency where not necessarily all
accessing threads but at least all accessing CPUs can operate in parallel. The secondary challenge is to avoid
introducing false sharing, which is where separate threads running on separate CPUs allocate blocks that share the
same cache line. This can lead to cache ping-ponging where threads on different CPUs attempt to write to the cache
line at the same time, causing one CPU to stall as it waits for it cache to updated and vice versa. The tertiary
challenge is to maintain good page utilization in an environment where multiple threads are allocating and freeing
memory at different rates, and where memory may be utilized according to a producer-consumer arrangement (where
one thread continually allocates memory and another is responsible for freeing it).

The most effective way to achieve concurrency is compartmentalization, where each processor gets its own sandbox
in which to operate. For example, on a quad-server, a compartmentalized heap would contain at least four
sandboxes, and the allocator would direct threads operating on the first processor to the first sandbox and so on.
Depending on the specific implementation, this approach will allow heap performance to scale almost linearly as
processors are applied to the application. While compartmentalization does eliminate contention between threads
operating on different processors, it usually doesn't eliminate contention between threads operating on the same
processor, unless, of course, each thread gets its own sandbox. Intra-processor thread contention, however, is
generally not of concern as threads running on the same processor obviously don't run in parallel. And it is likely that
one thread will make it through the allocator within its time-slice before the next thread comes along, resulting in low
per-sandbox contention. Furthermore, it is considered bad practice to spawn more than one thread per processor, as
this would tend to produce excess context switching. For these reasons, most implementations utilize per-processor
sandboxes as opposed to per-thread sandboxes.

Stock heaps are not compartmentalized and generally use a global lock that serializes the entire allocator. Some non-
compartmentalized heaps try to promote concurrency by eliminating the global lock and placing separate locks on
each free block list (recall the discussion on best-fit algorithms) or by using concurrent b-tree structures. Concurrency
with this approach is hit and miss (mostly miss), because it relies on threads consistently allocating blocks of different
size classes, which is unrealistic. A further weakness of non-compartmentalized heaps is that they actively introduce
false sharing by allowing threads on different CPUs to allocate from the same cache line. The one advantage of the
non-compartmentalized heap, however, is that it tends to exhibit the same page efficiency when accessed by many
threads as it does when accessed by one thread.

7
There are a variety of compartmentalization algorithms. Approaches differ on the granularity and means of
compartmentalization. They differ on the means by which threads are directed to their target sandboxes. They also
differ on how tightly the sandboxes are pinned to their respective CPUs, and on how they maintain availability.

The Taiyen system, which does employ compartmentalization, allows the user to specify the number of sandboxes
for which a heap will be configured. There are two basic ways to compartmentalize a span of heap address space.
The simple approach is to divide the heap into P sandboxes of equal size where P is the number of processors, but
this would severely reduce availability. The more sophisticated approach, that taken by the Taiyen system, is to allow
each sandbox to grow discontiguously by a fixed amount that is generally a fraction of the heap size. This is a more
flexible approach than the former because the sandboxes can grow individually according to the demands of the
threads operating therein. This is similar to the approach taken by Hoard. Recall from the discussion above that
Hoard uses an aggregation strategy, where the heap is divided into mini-heaps (also termed superblocks), each
taking a different size class. Hoard then segregates the mini-heaps into sandboxes and attaches them to separate
CPUs.

Some allocators like mtmalloc use a round-robin style mechanism to assign threads to sandboxes, and others try to
perform more active load balancing by steering threads with allocation requests to under-utilized sandboxes. The
problem with this or any algorithm that tries to perform an inter-sandbox optimization is that some sort of
synchronization object is required to manage the threads. In a heavily loaded system, even the fastest
synchronization object such as an interlocked-incrementer will block and cause context switching. The second
problem here is that in one case, thread A on CPU A may allocate from sandbox A; and then in the next case, thread
A may allocate from sandbox B, which is attached to CPU B. The problem is that thread A’s CPU cache contains
sandbox A's structures and not those of sandbox B. Standby for some cache misses! The third point is why load
balance at all? If the allocation algorithm yields good page utilization, load balancing across sandboxes will not be
needed.

The Taiyen system use per-thread information to map threads to sandboxes, namely, thread-local storage (TLS).
Hoard performs a similar mapping but hashes thread IDs instead of using TLS. We chose TLS, because most
modern compilers are now providing support for static TLS, which allows stored values to be retrieved in a handful of
clocks and doesn't require calling a procedure as is needed to retrieve a thread ID. Using TLS also gives the
application more control over which thread targets which sandbox. In contrast to the assignment methods described
above, this one requires no global synchronization and is more cache friendly, because thread A will only access
sandbox A, etc.

Some allocators like pthread_alloc, used by the standard template library (STL), mistakenly take the concept of
coupling each thread to its respective sandbox to the extreme. Each thread is permanently pinned to its sandbox,
which is not an issue, until one thread hands a block off to another thread to be freed. The problem occurs when this
second thread happens to be running on a different CPU from the first, in which case the block is allocated from one
sandbox, handed off, and returned to a second. In a producer-consumer arrangement, where one thread may be
continually allocating blocks and a second is freeing blocks, if the threads ever wind up running on different CPUs,
the sandbox being accessed by the allocating thread will quickly run out of memory because no free blocks will ever
be returned to it. This type of arrangement can also introduce false sharing, because the free block that used to be
part of sandbox A and is now in sandbox B may share a cache line with a block that was allocated by sandbox A's
thread. As soon as sandbox B's thread allocates within this free block, false sharing will have been established.

A block freed in the Taiyen system, regardless of what thread is freeing it, is always returned to the sandbox from
which it was allocated. This prevents induced false sharing and protects against the unbounded procurement of
memory described above, which can both occur when memory is allocated and freed by separate threads. Both
mtmalloc and Hoard subscribe to the same practice of returning blocks to their original sandboxes. Even with an
allocator that can support it, the best practice is to avoid the producer-consumer arrangement all together, and have
every thread manage its own memory. In other words, the thread that allocates a block should also free it. Most OS
schedulers will spread threads evenly amongst the CPUs and will try to keep each thread running on the CPU to
which it has affinity in order to maximize cache hits. When each thread manages its own memory, it continually
accesses the same sandbox over and over, and there is a good chance that the sandbox’s data structures will be in
cache for each go around. If a block, however, is handed off to a second thread to free, which happens to be running
on a separate CPU, there's little chance that the first sandbox's data will be in the second CPU's cache. Given the
implication, if the producer-consumer model still best suits the application, it is advisable that the producer and
consumer threads be explicitly given the same processor affinity.

One drawback of compartmentalization is that clearly not every free block in the heap is available to every allocating
thread, which, by definition, reduces availability. As noted, the Taiyen system avoids the biggest hit to availability that
would occur if a heap were divided equally by P processors, but even with finer grained compartmentalization,
availability suffers. Some allocators such as Hoard promote availability by specifying a minimum allocation level
within each sandbox. Basically, when consumption of a superblock drops below a certain level, it is removed from its
sandbox and sent to a global free store where it now becomes available to all of the sandboxes and thus all of the

8
allocating threads. The problem with this algorithm is that as soon as any partially allocated superblock enters the
free store, it opens that superblock to false sharing. To increase availability, Hoard allows unallocated super blocks to
be reset to whatever size class the application demands. In effect, the Taiyen system takes a similar approach by
returning unutilized pages to the VMM, which then become available to every sandbox in the heap (as well as other
processes running on the machine). This approach also eliminates any chance of false sharing. Furthermore,
because the Taiyen allocator uses a modified best-fit strategy, per-sandbox availability is naturally higher than it is for
aggregation strategies, so there is less demand for an extra mechanism to promote heap-wide availability. And while
a hit to availability will generally reduce page efficiency, as note earlier, one does not directly correlate to the other. In
tests, the page efficiency of Taiyen heap that is engaged by multiple CPUs is only a few percent lower than the page
efficiency of a heap that is engaged by one CPU.

NUMA Support
NUMA is the acronym for non-uniform memory access, which is a common architecture for SMP machines with many
processors. Each processor in an SMP machine has full access to the entire span of physical memory, but as the
number of processors grows, excess traffic on the interconnect becomes an impediment to performance. Hardware
designers have addressed this problem by dividing such SMP systems into nodes, where each node is essentially a
sub-system containing its own CPUs and memory; and where nodes are connected via a high-speed back plane. In
this arrangement, CPUs still have access to all installed memory, but can typically access local node memory
approximately three times faster than memory from another node. Thus, the key to maximizing performance on a
NUMA system is to keep memory access as local as possible. Operating systems facilitate this by allocating physical
pages locally from the node on which the allocating thread is running. Data allocated on a node will remain as such
until it is paged out. When paged back in, that data will be allocated on the node running the touching thread. The
implication is that allocations have to be made very carefully on a NUMA system to ensure good node locality.

The Taiyen model adapts well to NUMA architectures because it allows the creation of multiple volatile heaps that
can be assigned individually to each node on the host. A single heap won’t work even with its ability to segregate
thread access via sandboxes because there are simply too many internal control structures that reside on common
pages that would be accessed by different nodes (while these structures share pages, they don’t share cache lines).
The solution is to create multiple heaps; and to ensure that these heaps reside on different nodes, they must be
created by threads that are running on different nodes. The main thread can be used to create these heaps, but the
trick will be to change its CPU affinity prior to creating each heap. A simple array-based resolution mechanism can be
used in the thread procedure to direct threads to the appropriate heap. Be sure to accomplish this using local stack
structures as opposed to globals.

The creation of multiple transactional heaps in a NUMA process is not advised, mostly because there will be no
advantage in doing so; and the disadvantage will be the excess consumption of address space. To maintain
coherency between a transactional heap and its log, no operation on that heap can touch another heap. The better
approach will be to spawn a separate process for each node, with each process containing its own transactional
heap. Good concurrency across the nodes can be realized because it’s a simple matter to pin a process to a
particular node. In this scenario, the parent process will have to balance the load across the child processes and
manage the dispersion of IO requests.

Taiyen Libraries
Some Unix Distributions
Debug
Taiyen_USPARC64_Solaris_CC_X_Dbg.a
Taiyen_USPARC64_Solaris_CC_X_OT_Dbg.a

Release
Taiyen_USPARC64_Solaris_CC_X.a
Taiyen_USPARC64_Solaris_CC_X_OT.a

Some Windows Distributions


Debug
Taiyen_IA32_Win_CL_X_Dbg.lib
Taiyen_IA32_Win_CL_X_OT_Dbg.lib

Taiyen_IA64_Win_CL_X_Dbg.lib
Taiyen_IA64_Win_CL_X_OT_Dbg.lib

9
Release
Taiyen_IA32_Win_CL_X.lib
Taiyen_IA32_Win_CL_X_OT.lib
Taiyen_IA32_Win_CL_X_fc.lib
Taiyen_IA32_Win_CL_X_OT_fc.lib

Taiyen_IA64_Win_CL_X.lib
Taiyen_IA64_Win_CL_OT_X.lib

Notation
Taiyen_(processor architecture)_(operating system)_(compiler)_(X)_ (over threaded yes/no)_(debug build
yes/no)_(non-cdecl calling convention)
Dbg - Debug build.
X - The maximum number of CPUs for which a heap may be configured. For 32-bit builds, X will range between 2 and
32. For 64-bit builds, X will range between 2 and 64.
OT - The number of accessing threads may exceed X. For non-OT libraries, the number of accessing threads may
not exceed X.

> Windows Specific

fc – The library calling convention is fastcall. All Taiyen libraries have C linkage and unless otherwise noted, employ
the cdecl calling convention. When available, builds using the faster register-based calling convention, fastcall are
also provided.

< Windows Specific

The debug libraries take every opportunity to verify that internal structures and allocated blocks are error-free. They
verify that parameters passed to the system functions are within acceptable limits. If a parameter is found to be
incorrect, for any reason, an assertion error message box will be invoked with an expression describing the error, the
file and line in which the error occurred, and a time stamp. The debug libraries will also routinely examine internal
structures, and in the unlikely event that an error is found in one of these, a similar message box will be invoked, but
it will advise the user to abort the process and report the error to the manufacturer. By default, the debug libraries all
perform bounds checking and leak detection.

The release libraries will only scale to the value of X in the library name. For example, Taiyen_IA64_Win_CL_64.lib
will scale to 64 CPUs. The number of CPUs for which a particular heap is configured is set upon heap creation and is
limited by this value of X. When linking to Taiyen_IA64_Win_CL_64.lib for instance, a heap may be configured for
one to 64 CPUs. In contrast, a heap created via Taiyen_IA64_Win_CL_2.lib may only be configured for one to two
CPUs. The number of CPUs for which a heap is configured actually determines how many sandboxes will be created;
a heap configured for 64 allocating CPUs will contain 64 sandboxes, and certainly more sandboxes may be created
than CPUs installed in the host. The over-threaded (OT) libraries use locks to allow multiple threads per sandbox. In
contrast, non-OT libraries do not use locks and therefore allow only one thread per sandbox, but are about 15%
faster than their OT counterparts. These non-OT libraries are included to boost the speed of multithreaded
applications that match the number of running threads to the number of CPUs on the host, which is generally
preferred so as to minimize context switching.

TaiyenGlobals.h contains all of the macro definitions, and global variable and function declarations needed for the
system. This header must be included before all others. The header, PublicHeapManager.h contains the volatile heap
function declarations and PublicPersistentHeapManager.h contains persistent heap specific declarations. Include
PublicHeapManager.h before PublicPersistentHeapManager.h.

Volatile Heap Manager


Heap Structure
All operations on a volatile heap are conducted by the volatile heap manager. This includes not only its creation and
its core purpose (the allocation and freeing of objects), but many other ancillary operations as well. The structure of a
heap is shown in Figure Three, below:

10
internal region user space
heap space (uwHeapSize)

Figure 3

A heap is divided into two primary regions: The low region, referred to as the internal region, is used by the system
exclusively to store internal structures. The high region, referred to as the user space, is used to only store user data
and free block nodes. In general, the internal region will occupy approximately 3.5% of the entire heap space. The
heap manager will allocate memory within the internal region in proportion to the amount of memory allocated in the
user region. This separation improves the locality of internal structures as well as user data. It also allows the system
to keep track of blocks in the event that a buffer overrun wipes out any free block nodes in the user space. Because
overflows usually progress from low to high addresses, it is unlikely that the internal region will be corrupted by an
unruly process.

A volatile heap is created by calling CHeapManager_pcCreate(). In this function, the parameter, uwHeapSize
specifies the extent of the heap space. On Unix, this heap space will generally be backed by a named file, unless the
distribution allows for anonymous mappings. On Windows, the heap space will be backed by the paging file. All of the
structures required to support a volatile heap are allocated internally within the heap space – no memory is ever
allocated from anywhere else in the process. That includes the stack, other heaps, etc. The function also takes the
parameter, uwNumCPUs, which specifies the number of sandboxes to create. The system supports the allocation of
multiple volatile heaps in the process space. This goes for persistent heaps as well.

One difference between the Taiyen heap and that of other systems, is that blocks of all sizes are allocated within the
heap space. One advantage of keeping all allocated blocks within the heap space is that in the case where multiple
heaps are used in a process, blocks are quickly resolved to their parent heaps without the need for lookup tables or
other such logic.

Static Blocks
The system supports two block types: static and dynamic. Dynamic blocks, as the name implies, are designed to be
resized. Never try to resize a static block, as the underlying structure of a static block is quite different from that of a
dynamic block, especially when the block is beyond a page in size. The structure of a static block is shown in Figure
Four, below:

user region (uwSize)


committed region

Figure 4

A static block is composed of two regions: a user region and a committed region. The size of the user region is the
amount of static storage requested from the heap manager for a given block. The committed region is the actual
amount of memory used to store the user region. In many cases, to maintain alignment in the heap, a block's
committed region will be made slightly larger than its user region. For instance, a block with a user region of seven
bytes will have a committed region of eight bytes. A static block is allocated by calling CHeapManager_pmAlloc() with
the parameter, uwSize specifying the size of the user region.

When using a debug library, a system word of padding is added to the low end of the user region and two system
words to the high-end. The high system word of the padding contains a block index used to help pinpoint memory
leaks. Whenever a static block is referenced in a heap function, such as when it’s freed, the system will verify that
neither of these areas is corrupted. A call to CHeapManager_uwDbgEnumerateInvalidBlocks() will explicitly scan the
entire heap for corrupted blocks. The structure of a bounds-checked static block is show in Figure Five, below:

low padding user region (uwSize) high padding


committed region

Figure 5

Both the debug and release builds of CHeapManager_pmAlloc() return user region pointers. As seen in Figure Five,
the lower padding added by the checked build will offset the user region with respect to the committed region by a
system word. The pointer to the committed region is called the block-base pointer.

11
Newly allocated blocks are not initialized to zero, the reason being that the vast majority of operations entail allocating
a block and then immediately assigning a value to it. Rarely is that value zero; and if it is, so be it. That assignment
certainly doesn't need to be automatic.

Dynamic Blocks
Figure Six shows the structure of a dynamic block. Because of their resizability, dynamic blocks are the ideal base
structure for all dynamic objects, the most common being the dynamic array, often referred to as a vector. A dynamic
block is allocated by calling CHeapManager_pmSetDynBlockSize(). This also happens to be the same call used to
resize the block, and if desired, free the block as well. The parameter, uwSize passed to this function specifies the
amount of dynamic storage requested from the heap manager for the block. Internally, this quantity is termed the user
region.

arena user region (uwSize)


dynamic region
committed region expansion region
reserved region

Figure 6

A dynamic block is composed of six regions, a system word sized arena, the user region, a dynamic region, a
committed region, a reserved region, and an expansion region. The arena, which is located at the low end of the
block, contains the size of the user region. The dynamic region is the sum of the arena and the user region. The
committed region represents the amount of memory used to store the dynamic region. The reserved region extends
beyond the committed region providing a buffer in which the user region may expand, unencumbered. The expansion
region is simply the extent of this buffer. The CPU pages covered by the expansion region will remain uncommitted
unless and until they are needed to support an expanding committed region. When a dynamic block's reserved region
is smaller than the size of a page, the expansion region as a percentage of the user region is reduced to minimize
overhead.

When a user region expands beyond the reserved region, the system may either attempt to expand the reserved
region in place, or take the opportunity to reduce fragmentation by relocating the block. Of course, if a static or
dynamic block exists immediately above the expanding block and would interfere with the expansion, the block will be
forced to be moved.

When a user region shrinks below a certain threshold, the reserved region will also shrink in concert to avoid wasting
address space for the proportionally oversized expansion buffer that would have existed otherwise. When a reserved
region shrinks, as is the case when it expands, the system may move the block as a way of reducing fragmentation.

The size of the reserved region is controlled by the system. It is set to allow the user region room for expansion
without unduly consuming address space. A careful balance is struck, weighing the potential cost in CPU cycles that
would be spent if a block had to be frequently relocated, against the drain on the heap caused by excessive
fragmentation coupled with the loss in memory efficiency caused by making expansion regions too large. Other
factors come into play such as the block’s size relative to a page. Smaller blocks can be relocated rather quickly,
because, obviously, there isn’t that much data to haul around in a smaller block’s user region.

As a dynamic block grows in size, however, its periodic relocation will become a severe drain on performance if some
strategy is not employed to either circumvent it altogether or to accelerate it. The Taiyen system has a unique and
fast way of relocating blocks that are larger than a page. Instead of performing a costly byte by byte move, the
system redirects the page table entries (PTEs) of the pages backing the block to its new virtual address. In essence,
each physical page in RAM is simply reassigned to a new page frame in virtual memory, effectively moving a page’s
worth of data with each assignment. Because no block data is touched during the relocation, the method can even
move a block that has been paged out to disk without having to page in any portion of that block during the move, as
would be the case for the byte-by-byte method. Tests show that PTE redirection allows multi-page blocks to be
relocated in memory two orders of magnitude faster than a byte-by-byte copy. Even with the benefit of PTE
redirection, relocation does have a cost, so multi-page dynamic blocks do have expansion buffers. The pages
touched by these buffers are uncommitted and thus only consume address space.

As is the case for static blocks, the debug builds allocate dynamic blocks with padding, although padding is only
needed at the high-end of the block. The extent of this upper padding is also two system words and is considered
part of the dynamic region. Like the static block high padding, the high system word contains a block index to help
track memory leaks. The padding is checked every time the dynamic block is referenced in a library function. The
structure of a bounds-checked dynamic block is shown in Figure Seven, below:

12
arena user region (uwSize) high padding
dynamic region
committed region expansion region
reserved region

Figure 7

Like CHeapManager_pmAlloc(), CHeapManager_pmSetDynBlockSize() always returns a user region pointer. Like


static blocks, dynamic blocks are not initialized to zero when created or expanded.

Large Dynamic Blocks


A large dynamic block is a specialized form of dynamic block with one unique feature – the reserved region of a large
dynamic block will never shrink below the value set upon creation. If need be, the reserved region may expand
beyond this threshold, and it will expand and shrink according to the rules governing regular dynamic blocks, but it
will only shrink to its initial value. To create a large dynamic block, CHeapManager_pmSetDynBlockSize() is called
with the second parameter being set to HEAP_ALLOCATE_LARGE_DYNAMIC_BLOCK_FLAG. The third parameter,
in this case, will set the size of the block's reserved region. The minimum size of a large dynamic block's reserved
region is a segment, which is defined as 16 pages or 64 KB for 32-bit builds and 128 KB for the 64-bit builds.

Once created, a large dynamic block may be treated as any other dynamic block. Resizing operations within its initial
reserved region will be fast, because the block will never need to be relocated, but it will monopolize address space.
In addition, if its user region shrinks well below a page in size, heap efficiency will suffer, because the untouched
portion of the page will go unused. A prime advantage of the large dynamic block is that resizings within the initial
reserved region are fail-safe, which may be critical for certain structures.

13
The Testing Regimen
All volatile heap tests referenced here were performed on a 32-bit Intel model S450NX server using four 550 MHz
Xeon processors, running Windows 2003 Server; a 64-bit HP i2000 workstation using dual 800 MHz Itanium I
processors, running Windows 2003 server; and a 64-bit Sun e4500 server using eight 336 MHz Ultrasparc II
processors, running Solaris 9. The comparison tests pit the Taiyen memory manager against the stock memory
managers provided by each platform's respective C runtime library (CRT). Test were also run against Windows heap
API's such as HeapAlloc() and HeapFree(), but in every instance the Windows CRT proved faster, so only the CRT
results are referenced. Multithreading tests were also run against the 32-bit Windows build of Hoard, and Sun's
mtmalloc running on Solaris 9. Hoard was not tested on Solaris because, at the time of this writing, only a 32-bit build
existed for it, and there is not a 32-bit Taiyen build for Solaris. The Taiyen libraries used for the tests were
Taiyen_IA32_Win_CL_32_fc.lib, Taiyen_IA64_Win_CL_64.lib, and Taiyen_USPARC64_Solaris_CC_64.a.

The Memory Tester


The memory tester is an indispensable tool that provides a number of useful services, which runs on both Unix and
Windows. First, it is used internally as a powerful debugging tool for both volatile and persistent heaps. Second, it
serves as a profiling tool that can pit one memory manager against another in varying environments; and third, it
allows the performance of disparate platforms to be compared when taxed by a process that places heavy demands
on the OS and hardware, and does so equally from one machine to the next. It tests the following volatile heap
functions: the allocation of static and dynamic blocks, the resizing of dynamic blocks, and the freeing of both static
and dynamic blocks. The persistent heap functions that it tests include: compressing and decompressing a heap to
and from a file, making snapshots, making data and procedure log entries, closing transactions, recovering a heap
from a simulated system failure during a snapshot, recovering a heap from a simulated system failure during
transaction processing, and recovering a heap from congruent heap files. The code for the tester is found in
TestMemoryManager.cpp.

Allocation Pattern
Upper allocation limit
Slope = current average
Current number of allocated bytes

allocation size for test 7


Total number of allocated
bytes
Slope = average
allocation size

Slope = average
deallocation size

Slope = current average


allocation size for test 13
Lower allocation limit

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
Test

Figure 8

The tester is essentially a torture chamber for heap managers. It randomly allocates, resizes, and frees blocks in a
heap, continuously churning the memory. At specific intervals, marked by the completion of a test, the churning is
briefly paused and the heap is analyzed to see how it is performing. Figure Eight, above, shows this process
graphically. The solid line plots the typical allocation pattern generated by the tester. A channel is defined by the
lower and upper allocation limits, and once the aggregate number of stored bytes rises above the lower limit, it will
bounce between the two. The upper limit is a hard ceiling and will never be pierced. The total allocation, however,
may dip briefly below the lower limit, but upon doing so, the tester will immediately start allocating bytes. Notice that
this solid line is choppy. There’s a general bias toward allocation or deallocation, but along the way, the system is
both allocating and freeing memory; and for brief periods, the local trend may be the reverse of the bias. The upper

14
and lower allocation limits, the slope of the bias, as well as the amount of choppiness (or noise) are all controllable.
To understand how the allocation pattern is controlled, refer to Figure Nine below.

Triangular Distribution
Probability
Average deallocation size Average allocation size
upper limit Pmax upper limit
Average deallocation size Average allocation size
Average deallocation size Average allocation size
lower limit lower limit

Area = 1.0

0
Block size increment

Pmax = 2.0 / (9 + 1) = 20% where, Maximum allocation size


9 = number of possible block sizes.

Figure 9

The figure shows a triangular probability distribution that is defined by two values: the average allocation size and the
maximum allocation size. The height of the triangle, which equals the maximum probability, Pmax, is a function of the
number of possible block sizes bound by the sides of the triangle. In the figure, there are nine possibilities. Because
there has to be a 100% probability of a size falling within the distribution, the area of the triangle must be one; and
from that, Pmax, the probability of allocating a block of average size is derived to be 20%. The probability falls off
linearly as the size deviates from this average to where the probability of allocating a block of maximum size is 4%.

When the memory tester is biased towards allocating memory, the number of bytes to allocate is computed by
retrieving a random value from a triangular distribution positioned such that its peak to the right of the probability axis.
Notice that a portion of the right distribution in the figure intrudes into the negative area left of the probability axis. A
value retrieved from this portion indicates that so many bytes are to be freed from the heap. Referring back to Figure
Eight, imagine that the tester is currently conducting test one, which means that it is biased towards allocating
memory. Imagine further that it is retrieving values from the right distribution in Figure Nine. Some of the values will
be positive and some will be negative, but overall, memory will tend to be allocated at a rate equal to the average
allocation size, which is the slope of the center line in Figure Eight and thusly, the general slope of the allocation
pattern.

Once the upper allocation limit is reached, the tester will reverse its bias towards freeing memory and values will be
chosen from a triangular distribution positioned such that its peak to the left of the probability axis. Imagine, in this
case, that values are retrieved from the distribution to the left in Figure Nine. Values that fall to the right of the
probability access will cause that many bytes to be allocated, but the trend will be to free memory. The slope of the
centerline tracked by this leg of the allocation pattern will equal the average deallocation size. Any choppiness in the
pattern will be caused by the distribution intersecting the probability axis, with the extent thereof being proportional to
the percent of the distribution that is divided by this axis.

The average allocation and deallocation sizes are not specifiable parameters. Instead, the user chooses a range of
average sizes. Actually, he chooses average allocation size lower and upper limits, and average deallocation size
lower and upper limits (of course, average allocation and deallocation sizes could be specified by default, by setting
the lower and upper limits equal to each other). The system uses these limits as follows: In Figure Eight, the slope of
the dashed line changes at test boundaries. When the trend is upwards, the slope, during a particular test, is the
current average allocation size; and conversely, when the trend is downwards, the slope, bounded by an individual
test, is the current average deallocation size. When the bias is to allocate memory, the tester, at the beginning of
each test, retrieves a random value between the average allocation size lower and upper limits, and for the duration
of the test, that becomes the current average allocation size. Likewise, when the bias is to deallocate memory, before
each test, the tester retrieves a random value between the average deallocation size lower and upper limits, which

15
becomes the test’s current average deallocation size. Actually, both current average allocation and deallocation sizes
are computed before each test, to allow the bias to change mid-test.

The tester also allows allocation sizes to be determined via a flat probability distribution, but none of the simulations
referenced herein used this option. The triangular distribution allows the tester to focus on block sizes towards the
middle of a range while still including blocks at the edge of the range, but at a much lower frequency.

The specific heap functions called by the tester are as follows:

Allocating a static block. Freeing a static block.


Taiyen: CHeapManager_pmAlloc() Taiyen: CHeapManager_Free()
Other: malloc() Other: free()

Allocating a dynamic block. Freeing a dynamic block


Taiyen: CHeapManager_pmSetDynBlockSize() Taiyen: CHeapManager_pmSetDynBlockSize()
Other: malloc() Other: free()

Expanding a dynamic block. Shrinking a dynamic block.


Taiyen: CHeapManager_pmSetDynBlockSize() Taiyen: CHeapManager_pmSetDynBlockSize()
Other: realloc() Other: realloc()

Two parameters control the aggregate heap allocation:


1) Lower Allocation Limit 2) Upper Allocation Limit

Five parameters control individual allocations and deallocations:


1) Maximum Allocation Size 2) Average Allocation Size Lower Limit
3) Average Allocation Size Upper Limit 4) Average Deallocation Size Lower Limit
5) Average Deallocation Size Upper Limit

Five parameters control dynamic block operations:


1) Dynamic Block Operation % 2) Dynamic Block Size Lower Limit
3) Dynamic Block Size Upper Limit 4) Average Percent to Increase Block Size
5) Average Percent to Decrease Block Size

One parameter controls the number of workers threads used to access a heap at once:
1) Number of Worker Threads

Other parameters exist that control the byte interval at which data is written to allocated blocks and control how many
operations are completed per test, how many simulations are conducted in a row, etc. The parameters listed above,
however, are the ones of most concern.

When the dynamic block percentage is greater than zero, the tester apportions that percentage of operations to the
creation and resizing of dynamic blocks. When an operation has been tagged as dynamic, the tester tries to balance
the average percentages to increase and decrease dynamic block sizes (referred to below as percentages A and B)
with the amount of bytes to be added or taken away from the heap (referred to a X) for the operation at hand. When X
bytes are to be added to the heap, the following procedure is executed:

1) If there are less than DYN_BLOCK_THRESHOLD dynamic blocks allocated by the thread, a dynamic block of X
bytes is allocated.
2) The tester looks for a dynamic block that were it increased in size by A percent, the amount by which the block
would grow would equal X.
3) If a dynamic block of this size doesn’t exist, the next dynamic block lower in size is chosen. If one is found, the
dynamic block would be expanded by A percent and not by X bytes.
4) If no dynamic blocks of lower size are found, the tester looks for the next dynamic block of larger size. When
apprehended, the tester will expand it by A percent or to its maximum size, which ever is less.
5) If only dynamic blocks of maximum size are found, a new dynamic block of X bytes will be allocated.

When an operation has been tagged as dynamic and X bytes are to be removed from the heap, the procedure is as
follows:

1) If there are no dynamic blocks in the heap and the dynamic block operation percentage is less than 100, an
attempt is made to find a static block to free. If the operation percentage is 100 or no static blocks exist, the
operation is ignored.
2) The tester looks for a dynamic block that were it decreased in size by B percent, the amount by which the block
would contract would equal X.

16
3) If a dynamic block of this size doesn’t exist, the next dynamic block of larger in size is chosen. If one is found, the
dynamic block would be contracted by B percent and not by X bytes.
4) If no dynamic blocks of larger size are found, the tester looks for the next dynamic block of lower size.
5) Once a block is found, if shrinking it by B percent would reduce its size below the lower limit, the block is freed.

Using these procedures, the tester will generally allocate dynamic blocks until a critical mass is reached, and then
continually resize the same pool of blocks. Occasionally blocks in the pool are freed, and when this occurs, they are
soon replaced, which keeps the pool at roughly a constant size. When resizing dynamic blocks, the tester is biased
towards honoring percentages A and B over size request X. Similar procedures are used to handle static blocks as
well. When X bytes are to be added to the heap, a static block of X bytes is allocated. When X bytes are to be freed
from the heap, a search is made for a block of X bytes in size, and if one is not found, the tester will search half the
time for one of smaller size and half the time for one of larger size. The tester uses double linked lists to link blocks of
the same size class. Separate lists exist for static and dynamic blocks, and when a block must be retrieved to either
resize or free, the system doesn’t simply grab the first block off the top of a list; it retrieves a block at a random depth
and randomly chooses whether to walk forwards or backwards through the list to this depth.

Volatile Heap Performance Tests


To eliminate paging during the Solaris tests, the following entry was placed in /etc/system:

set tune_t_fsflushr 1048576

This set the number of seconds between dirty page scans to 1,048,576.

Most Unix distributions don't allow anonymous mappings, and if this is the case, named files will be needed to back
volatile Taiyen heaps (named files are always needed to back persistent Taiyen heaps). If a named file is being used,
the daemon that flushes the dirty pages of mapped files must either be stopped or the interval at which it awakens
must be set high enough (or the daemon must otherwise tuned) to avoid excess paging. On Solaris, this daemon is
fsflush, and the parameter that governs the interval between flushes is tune_t_fsflushr.

To eliminate paging during the Windows tests, the process working set size was set to the size of the heap via
SetProcessWorkingSetSize( ).

There are two testing regimens, one for volatile heaps and one specifically for persistent heaps. The volatile heap
testing regimen involved running a series of simulations back to back, with each simulation testing a different area of
performance. The first simulation, for example, specifically tests block access speed for small blocks. The simulations
after that test single-threaded performance for static blocks of increasing size, then dynamic blocks of increasing size.
These simulations are then repeated to test multithreaded performance. The series is completed by a set of
simulations that test the resizing of large dynamic blocks. Each simulation was comprised of 512 tests, with each test
being comprised of 65536 operations. The tester times each operation and each test. At the end of each simulation
all of the blocks are freed; this is a timed event as well.

The regimen works as follows: First, a heap is created and then an input file is read that contains the parameters for
all of the simulations in the series. Next, the main thread spawns a number of worker threads that enter the test
procedure and perform the heap operations. The main thread remains blocked until all of the worker threads exit the
procedure, marking the completion of one test, at which time the main thread is awakened. The thread then tabulates
a variety of statistics and writes the output to the screen, and every so often to an output file. The main thread then
loops until all of the tests in the simulation are complete. It then frees all of the blocks in the subject heap, and starts
the next simulation with a new set of parameters. The process continues until either all of the simulations are
complete or the heap blows up, whichever comes first.

We used the following input files for the three platforms on which we tested:
32-bit Windows on IA32 (4 CPUs): SimulationInputDataVolatile_4CPUs_36.h, consisting of 36 simulations.
64-bit Windows on IA64 (2 CPUs): SimulationInputDataVolatile_2CPUs_26.h, consisting of 26 simulations.
64-bit Solaris on USPARC64 (8 CPUs): SimulationInputDataVolatile_8CPUs_46.h, consisting of 46 simulations.

The following are the resulting output files. The first file contains the Taiyen results and the second the CRT results:
32-bit Windows on IA32 (4 CPUs): Taiyen_Vol_IA32_Win_Sim_1-36.txt, Windows32_Sim_1-36.txt.
64-bit Windows on IA64 (2 CPUs): Taiyen_Vol_IA64_Win_Sim_1-26.txt, Windows64_Sim_1-26.txt.
64-bit Solaris on USPARC64 (8 CPUs): Taiyen_Vol_USPARC64_Solaris_Sim_1-46.txt, Solaris64_Sim_1-46.txt.

The difference in the number of simulations stems from there being more multithreading tests on those systems that
have more CPUs, otherwise each has identical input parameters. Figure 10 below shows the input parameters for
simulations one through seven and the last five simulations.

17
Average % to Increase Block
Dynamic Block Operation %

Number of Worker Threads


Dynamic Block Size Lower

Dynamic Block Size Upper


Average Deallocation Size

Average Deallocation Size


Maximum Allocation Size

Average Allocation Size

Average Allocation Size

Average % to Decrease
Lower Allocation Limit

Upper Allocation Limit

Lower Limit

Upper Limit

Lower Limit

Upper Limit
Simulation

Block Size
Limit

Limit

Size
1 96 MB 128 MB 128 B 16 B 16 B 16 B 16 B 0 N/A N/A N/A N/A 1
2 96 MB 128 MB 512 B 16 B 42 B 16 B 42 B 0 N/A N/A N/A N/A 1
3 64 MB 128 MB 1 KB 16 B 85 B 16 B 85 B 0 N/A N/A N/A N/A 1
4 32 MB 128 MB 4 KB 16 B 341 B 16 B 341 B 0 N/A N/A N/A N/A 1
5 32 MB 128 MB 16 KB 16 B 1365 B 16 B 1365 B 0 N/A N/A N/A N/A 1
6 32 MB 128 MB 64 KB 16 B 5461 B 16 B 5461 B 0 N/A N/A N/A N/A 1
7 96 MB 128 MB 512 B 16 B 42 B 16 B 42 B 50 16 B 512 B 50 50 1
8 64 MB 128 MB 1 KB 16 B 85 B 16 B 85 B 50 16 B 1 KB 50 50 1
9 32 MB 128 MB 4 KB 16 B 341 B 16 B 341 B 50 16 B 4 KB 50 50 1
10 32 MB 128 MB 16 KB 16 B 1365 B 16 B 1365 B 50 16 B 16 KB 50 50 1
11 32 MB 128 MB 64 KB 16 B 5461 B 16 B 5461 B 50 16 B 64 KB 50 50 1
12 96 MB 128 MB 512 B 16 B 42 B 16 B 42 B 0 N/A N/A N/A N/A 2

N-4 1 MB 128 MB 64 KB 16 B 5461 B 16 B 5461 B 75 16 KB 64 KB 150 50 1
N-3 1 MB 128 MB 256 KB 16 B 21.8 KB 16 B 21.8 KB 75 16 KB 256 KB 150 50 1
N-2 1 MB 128 MB 1 MB 16 B 87.4 KB 16 B 87.4 KB 75 16 KB 1 MB 150 50 1
N-1 1 MB 128 MB 4 MB 16 B 35.0 KB 16 B 35.0 KB 75 16 KB 4 MB 150 50 1
N 1 MB 128 MB 16 MB 16 B 1.40 MB 16 B 1.40 MB 75 16 KB 16 MB 150 50 1

Figure 10

This figure actually represents the 32-bit simulation input parameters. To get the 64-bit simulation parameters, simply
replace 16 B with 32 B in the table. Simulation one is distinct from the others in that it is specifically designed to test
block access speed of small blocks (mostly <= cache line size). Each block is completely populated with data upon
allocation, and then after every test, all of the blocks in the heap are scanned to get a checksum, with each block-
scan being timed. Simulations 2 - 6 are designed to test single-threaded static block allocation speed. Simulations 7 -
11 simply repeat simulations 2 - 6 with a dynamic block operation percentage of 50%. Simulations 12 - 21 repeat
simulations 2 - 11 with two worker threads instead of one; simulations 22 - 31 repeat with four threads, and so on.
The input file targeting a two-CPU system, for example, will stop at the two-thread simulations, leaving 1 - 21 plus the
last five for a total of 26. The last five simulations are designed to test the handling of large dynamic blocks. By
default, data is written to the low and high system words of each allocated block. Beyond that, the tester allows an
interval to be set whereby additional block data is written. As mentioned, the first simulation writes data to the entire
block upon allocation, so the interval it uses is a system word. Simulations 2 - (N-5), however, use an interval of a
page, and simulations (N-4) - N do not write data to the interior pages. The reason why each allocated block isn't
completely populated or why the larger blocks in the last five simulations are barely populated at all is simply to
reduce total simulation time. For instance, the Windows CRT simulation, Windows32_Sim_1-36.txt took 612 minutes
to complete as is. For the middle simulations, it was important to at least touch each block page in order to measure
page efficiency for these cases.

On IA32, the Taiyen system completed its series of 36 simulations in 140 minutes and, as noted, the Windows CRT
completed the same series in 612 minutes. On IA64, the Taiyen system completed its series of 26 simulations in 65
minutes, and the Windows CRT completed the same series in 220 minutes. On USPARC64, the Taiyen system
completed its series of 46 simulations in 110 minutes, and the Solaris CRT completed the same series in 321
minutes. The IA32 simulations took proportionally longer than those on the other machines because after each test,
page efficiencies were calculated. Page efficiency wasn't calculated on the other platforms because doing so required
a special device driver that wasn't available for other than 32-bit Windows. This is not a concern because the page
efficiency measured on one platform can be extrapolated to another.

Notably absent are results from mtmalloc and Hoard. Unfortunately, these two allocators were unable make it through
their respective simulation series before failing. mtmalloc made it through simulation 11 and then caused a bus error
after consuming too much disk space. Hoard only made it to simulation 6 until it committed so much physical storage
that the system came to a crawl because of paging. As mentioned, mtmalloc is a power-of-two allocator, so right off
the bat it consumes excess memory and yields poor availability. Hoard also demonstrates poor availability as

18
predicted by its aggregation strategy. On the other hand, all three CRT allocators as well as the Taiyen system
showed very good robustness by being able to make it through the testing regimen unscathed. What is also evident is
that mtmalloc and Hoard are specialty allocators designed for a very narrow purpose - SMP scalability for a small
range of block sizes. The Taiyen system, in contrast, also satisfies this need, but without sacrificing robustness. In
essence, it delivers the robustness of the standard general-purpose allocator at the speed of one of these specialty
allocators.

Volatile Heap Test Results


SMP Scalability

Performance for Static Blocks (<= 512 B) Performance for Static Blocks (<= 1024 B)

1.2 1.2

1 1

0.8
Operation speed

0.8

Operation speed
Taiyen Taiyen
Solaris USParc64 Solaris USParc64
0.6 0.6
Window s IA32 Window s IA32
Window s IA64 Window s IA64
0.4 0.4

0.2 0.2

0 0
1 2 4 8 1 2 4 8

Threads Threads

Performance for Static Blocks (<= 4096 B) Performance for Dynamic/Static Blocks (50/50) (<= 4096 B)

1.2 1.2

1 1

0.8 0.8
Operation speed
Operation speed

Taiyen Taiyen
Solaris USParc64 Solaris USParc64
0.6 0.6
Window s IA32 Window s IA32
Window s IA64 Window s IA64
0.4 0.4

0.2 0.2

0 0
1 2 4 8 1 2 4 8

Threads Threads

Figure 11
Figure 11 shows the nearly linear scalability of the Taiyen system across the span of block sizes most likely to be
utilized by the typical multithreaded application. On eight threads, the Taiyen system proved to be up to 50 times
faster than the Solaris CRT and on four threads, up to 30 times faster than the Windows CRT. This is no surprise
given the serial nature of the CRT heaps. On 32-bit Windows, the Taiyen system showed a scalability of 98% on the
4-CPU host, with 100% being the perfect linear scaling demonstrated when, for example, running four threads yields
four times the speed of one thread. On 64-bit Windows, the Taiyen system showed 100% scalability, although to be
fair, the host machine had only two CPUs. On 64-bit Solaris, the Taiyen system showed up to 90% scalability on the
8-CPU host.

The tests represented by the charts in the top row closely mimic the standard performance tests conducted on
allocators - tests that concentrate on a narrow range of small block sizes. The charts on the bottom row show test
results for an increased average block size and more size variation, and for dynamic block operations. For the case
where the maximum static block size was 512 bytes, the average block size was on the order of 80 bytes, the specific
number depending on the platform and particular simulation. When the maximum static block size was 1024 bytes,
the average block size was on the order of 130 bytes. And when the maximum static block size was 4096 bytes, the
average block size was on the order of 400 bytes. Recall from the discussion of the triangular probability distribution
that the size and offset of the triangle are based on the maximum block size and average allocation/deallocation
upper and lower limits. Given the parameters used here, a significant portion of the triangle resides on the left side of

19
the vertical axis, which means that blocks are continually allocated and freed regardless of overall bias, and that the
average block size will be significantly less than the maximum block size divided by two.

Performance for Static Blocks (8 threads) Performance for Dynamic/Static Blocks (50/50) (8 threads)

6000000 5000000

4500000
5000000
4000000

3500000
4000000
Operations/sec

Operations/sec
3000000
Taiyen Taiyen
3000000 2500000
Solaris (mtmalloc) Solaris (mtmalloc)
2000000
2000000
1500000

1000000
1000000
500000

0 0
512 1024 4096 16384 65536 512 1024 4096 16384 65536
Maxium static block size Maxium dynamic/static block size

Figure 12
Running multithreaded tests against CRT heaps really isn't that interesting, so specific multithreaded tests were
conducted on the specialty heaps, mtmalloc and Hoard. To gauge Taiyen's performance against mtmalloc, the eight-
threaded USPARC64 simulations, 32 - 41 were run back-to-back on mtmalloc yielding the charts in Figure 12. While
mtmalloc was not able to complete the complete series of 46 simulations, it did complete these 10 without error. On
average, the Taiyen system proved to be over twice the speed of mtmalloc. The results of these tests are provided in
Solaris64_Sim_mtmalloc_32-41.txt.

Performance for Static Blocks (4 threads) Performance for Dynamic/Static Blocks (50/50) (4 threads)

4000000 3500000

3500000 3000000

3000000
2500000
Operations/sec
Operations/sec

2500000
2000000
Taiyen (Win32) Taiyen (Win32)
2000000
Hoard (Win32) Hoard (Win32)
1500000
1500000

1000000
1000000

500000 500000

0 0
512 1024 4096 16384 65536 512 1024 4096 16384 65536

Maxium static block size Maxium dynamic/static block size

Figure 13
Surprisingly, Hoard did not fare as well as mtmalloc. While not able to run all 36 IA32 simulations, an attempt was
made to run the four-threaded simulations, 22 - 31, but Hoard failed to run these back-to-back. In fact, it failed to
complete the 512 tests that constitute a simulation in all but simulations 22, 27, 29, and 30. Either the simulations
would lock up, most likely because of thread deadlock, or the simulation proceeded so slowly that it was impractical
to let it run its full course. Simulations 26 and 31 suffered from the latter and took 21.6 and 10.4 minutes, respectfully,
to complete shortened runs of 256 tests. To be fair, these two simulations tested block sizes larger than for what
Hoard was designed. The maximum block size for both was 64 KB and the average block size for simulation 26
wound up being 1964 bytes, and the average block size for simulation 31 wound up as 2700 bytes. To successfully
run simulations 23 - 25 and 28, the number of tests was set below the point where the tester would lock. In terms of
performance, as shown in Figure 13, Hoard was almost in-line with Taiyen until the maximum block size and average
block sizes grew beyond 4096 and 400 bytes, respectfully. This was to be expected, as Hoard was optimized for
small blocks. The Hoard results are provided in files Hoard_IA32_Win_Sim_22.txt - Hoard_IA32_Win_Sim_31.txt.

20
Page Efficiency

Page Efficiency for Static Blocks (IA 32) Page Efficiency for Dynamic/Static Blocks (50/50) (IA 32)

90 80

80 70

70
60

Average page efficiency


Average page efficiency

60
Taiyen (1 thread) 50 Taiyen (1 thread)
50
Taiyen (4 threads) Taiyen (4 threads)
40
40 Window s (1 thread) Window s (1 thread)
Window s (4 threads) 30 Window s (4 threads)
30
20
20

10 10

0 0
512 1024 4096 16384 65536 512 1024 4096 16384 65536

Maxium static block size Maxium dynamic/static block size

Figure 14
Figure 14 shows the page efficiencies achieved by the Taiyen system and Windows CRT for a variety of block sizes,
and for both single and four-threaded configurations. The page efficiencies are measured by utilizing a spy device
that scans each page of the heap after each test. The current page efficiency is then computed by dividing the
number of bytes stored in the heap by the aggregate number of tabulated dirty pages. A dirty page is identified as so
by checking bits in its page table entry (PTE). The other bits of interest in the PTE are the present bit, which identifies
whether a page is present in RAM or paged to the disk, and the accessed bit, which is set when a page is touched.
For good measure, the memory tester checks these bits as well and tabulates the present and accessed pages. If
paging occurs at any point during the tests, the accuracy of the page efficiency will be thrown off because the dirty
bits of the pages that were written to disk will have been cleared. To prevent paging, the process working set size is
set to the size of the heap. Note that this will only work on Windows, and as mentioned, on Unix, paging is prevented
by suppressing the mapped file flushing (or update) daemon, which on Solaris is fsflush. Because a Taiyen heap
stores all of its internal structures as well as the stored bytes within the heap space, the calculated page efficiency,
which accounts for these internal structures, is a true gauge of heap overhead. It is unclear as to whether the
measurement of the Windows CRT heap page efficiency included its internal structures, but given the amount of data
stored, either inclusion or omission would likely have a marginal effect on the net value. Page efficiencies have only
been provided for the 32-bit Windows heaps because we only wrote a spy device for this platform.

The charts show that with respect to block size, the Taiyen and Windows CRT page efficiencies are in-line with each
other and interestingly, trend together. The only departure in the trends occurs in the left chart for a maximum block
size of 64 KB and in the right chart for a maximum block size of 16 KB. The congruence between the two systems, in
terms of page efficiency, is most likely a result of both systems using a best-fit free block strategy. Another factor is
that the Taiyen system and clearly the CRT both decommit unused pages, something that isn't done by the specialty
allocators, mtmalloc and Hoard.

In the left chart, the Taiyen page efficiency is slightly lower than Windows for the smaller block sizes and slightly
higher for the larger near page and multi-page sized blocks. This is attributable to Taiyen's strict alignment policy that
places these larger blocks along page boundaries, which tends to improve page efficiency. Conversely, the same
alignment policy that aligns smaller blocks optimally along CPU cache lines tends to reduce page efficiency because
it winds up spreading blocks out more than they would be otherwise. In the right chart, which includes dynamic block
test results, the disparity between Taiyen and the CRT is greater for the larger block sizes because of the
fundamental difference in which dynamic blocks are handled between the two, a topic to be covered below.

The figure also demonstrates a critical feature of the Taiyen system, which is that page efficiency does not suffer
appreciably when the heap is configured for multithreaded use. As noted earlier, the compartmentalization strategy
used by the Taiyen system to handle multithreaded access does, by its nature, reduce overall availability but not at a
severe cost to page efficiency. As the figure shows, when the heap is configured for four-CPU access, page
efficiency only drops a few percent across the board.

Interestingly, the CRT showed a similar drop-off in page efficiency when engaged by multiple threads. One would
expect that a serialized heap's page efficiency to be independent of the number of accessing threads. One possible
explanation of why this wasn't the case is that when employing multiple threads, the memory tester will more severely
churn the target heap than when employing a single thread. This occurs because the threads wind up moving in and
out of sync where they all start off biased towards allocating memory but soon one thread is biased towards allocating

21
memory and a second is biased towards freeing memory and so on. The charts show that the effect is magnified
when the larger blocks are introduced and especially when the larger blocks are dynamic.

The Taiyen system is flexible and setting various build definitions can increase page efficiency, but at a cost to speed.
This is where careful tuning comes in. For example, the page efficiencies in the charts could have been increased by
up to 10% if the Taiyen system were configured to decommit pages more aggressively, but doing so would have cut
the speed of some multithreaded operations in half. The slowdown occurs because the VMM uses a global lock that
only allows only one thread access at a time, making the decommission of pages a serialized event. To speed the
process, in many cases, the Taiyen system will discard a page by clearing its dirty bit instead of decommitting the
page out right. This will leave the page in the process working set while at the same time making it a prime candidate
for procurement by the VMM in the event that pages are needed elsewhere. In other cases, especially when large
multi-page blocks are freed, the unoccupied pages are fully decommitted, releasing them from the working set. The
logic is that the pages used to support the smaller sub-page blocks are much more likely to be quickly reused than
those for multi-page blocks. To reduce contention in the VMM, in many cases, the Taiyen system doesn't just release
a page at a time, it releases a block of pages.

Single-Threaded Performance

Performance for Static Blocks (1 thread) Performance for Dynamic/Static Blocks (50/50) (1 thread)

1.2 1.2

1 1

0.8
Operation speed

0.8

Operation speed
Taiyen Taiyen
Solaris USParc64 Solaris USParc64
0.6 0.6
Window s IA32 Window s IA32
Window s IA64 Window s IA64
0.4 0.4

0.2 0.2

0 0
512 1024 4096 16384 65536 512 1024 4096 16384 65536

Maxium static block size Maxium dynamic/static block size

Figure 15
Figure 15 shows the performance of the Taiyen system versus the CRT allocators on even ground - single threaded
mode. With all of the allocators likely pursuing a best-fit strategy, differences in performance are attributed to
differences in underlying algorithm and implementation. With respect to static blocks, the Taiyen system averaged
over twice the speed of the Solaris allocator, and well over four times the speed of the Windows allocators. With
respect to dynamic blocks, the Solaris allocator at one point falls in line with the Taiyen allocator. The reason for this
stems from the practice of Solaris (and that of others) of shrinking blocks in place, something the Taiyen system
doesn't always do.

At times, the Windows CRT showed a curious behavior: it would be running along fine and then all of a sudden stall
out, continue running, and then stall out again; and while it did execute the first simulation without stalling, the others
were affected. Sometimes it would stall while the tester was biased to allocate memory and sometimes it would stall
while biased to free memory, so it is unclear what the underlying cause was.

22
Dynamic Block Performance

Performance for Multi-Page Dynamic Blocks Performance for Multi-Page Dynamic Blocks

45000 25000
39152
40000
20301

Operations/sec (75% dynamic)


Operations/sec (75% dynamic)

20000
35000
16898
30000
15000 13703
25000 12608
21143 Taiyen Taiyen
Window s IA32 10331 Window s IA64
20000
10000
15000 13586
10269
10000 5000 3588
5000 3661
1231 960
399 107 283
0 0
256 KB 1 MB 4 MB 16 MB 256 KB 1 MB 4 MB 16 MB

Maximum dynamic block size Maximum dynamic block size

Performance for Multi-Page Dynamic Blocks

16000
14190
14000
Operations/sec (75% dynamic)

12000

10000
8003 Taiyen
8000 6995 6991
Solaris USparc64
5710
6000 5327

4000
2320
2000 873

0
256 KB 1 MB 4 MB 16 MB

Maximum dynamic block size

Figure 16
Figure 16 shows the results of the large multi-page dynamic block tests. Previous figures showed dynamic block
performance for blocks less than or equal to 64 KB in size. The dynamic block operations performed by the tester
were all fairly aggressive and constitute more of a stress test than a real world simulation. For example, in the
simulations portrayed by Figure 16, dynamic blocks were expanded on average by 130 - 140% and contracted by
50%. In the other simulations, the dynamic blocks were both expanded and contracted by 50% on average. Dynamic
arrays and similar structures built upon dynamic blocks are not likely to expand and contract by these amounts on a
continual basis, but these dynamic block simulations weren't designed to test the typical scenario. They were
designed to test the more extreme case because often much programming effort is often made in anticipation of this
case.

For instance, what happens when a 256-MB array has to expand and is subject to be relocated in memory? This
event, while possibly unlikely, may introduce so much latency that the thought of using a contiguous 256-MB array
has to be scrapped. Sometimes, to handle this case, a segmented array is used, which is composed of a smaller
contiguous array of pointers to fixed-sized sub-arrays that may be located anywhere in memory. Expanding an array
of this type simply entails allocating one or more sub-arrays and appending their addresses to the pointer array; and
conversely, shrinking the array entails freeing one or more of these sub-blocks at the end of the pointer array. The
advantage of this algorithm is that expansions and contractions will consistently execute quickly because there is no
threat of any significant relocation. On occasion, the pointer array may need to be expanded or contracted, and
possibly relocated as a consequence, but the array will be relatively small and the relocation insignificant. The biggest
disadvantage of this approach stems from the fact that the dynamic block will not be contiguous in memory, and thus
accessing its data will not be possible through a pointer and an offset. Reading and writing each datum in the block
will require two lookups: one to the pointer array and a second to the sub-array where the datum is stored. The other
disadvantage is that a segmented array can only expand and shrink at the segment granularity. What if the segment
size is a page and the array shrinks to 64 bytes? To handle this case, additional code may be needed to consolidate
the segmented array into a contiguous array when the array size drops to a certain size.

23
Another practice for handling dynamic arrays when using the stock allocator is to allocate a static block larger than
the size initially needed for the array, and to then reallocate the block each time the array pierces the block’s upper
bound, increasing the static block’s size by some factor with each reallocation. Logic is also added to contract the
containing block in response to a shrinking array. The motivation behind the exercise is to reduce the frequency of
relocations, which occur when a block cannot expand in place. The problem with the approach is primarily the hit to
page efficiency. Take, for example, a page-sized array that uses a two-page static block for a container. Admittedly,
committing the second page is not a problem because it won't be made present in memory until it is written. The
problem occurs when the array has expanded into the second page and shrinks back into the first, leaving the second
page committed with abandoned data. And the problem is magnified when multiple pages are involved. Secondarily,
there is a problem with the stock allocators in that they shrink all blocks in place, a problem to be touched upon
below. Thirdly, again, the logic to handle all of this requires more work from the programmer.

A key advantage of the Taiyen system's native support for dynamic blocks is that the programmer need not be
concerned with these issues. The programmer can be assured that relocating a large multi-page dynamic block will
not cripple the system. The PTE redirection mechanism was designed for this specific task and as noted, it can even
move blocks that have been paged to disk without having to read them back into memory beforehand. And because
Taiyen dynamic blocks have built-in expansion buffers, the programmer doesn't have to add these externally. He can
be confident that the system will carefully manage these buffers, striking a balance between performance and page
efficiency. Furthermore, because the system has control over the buffers, it can decommit pages that are no longer
needed.

To direct attention back to the test results, understand that the severity of the tests was to assure the programmer
that the Taiyen dynamic block mechanism is fully capable of handling the extreme case, which of course is inclusive
of the typical case. To best comprehend what is happening in these tests, examine some of the raw tester output
data for the case where the maximum dynamic block size was 16 MB (recall from the input data that the minimum
block size was 16 KB):

IA32
Taiyen Windows CRT
Average number of operations per second: 10268.775088 Average number of operations per second: 107.128892
Dynamic block operation %: 77.400316 Dynamic block operation %: 77.400316
% of dynamic block operations that are resizings: 98.737579 % of dynamic block operations that are resizings: 98.737579
Average dynamic block size: 5769980.038095 Average dynamic block size: 5769980.038095
Average % dynamic block increase: 141.070016 Average % dynamic block increase: 141.070016
Average % dynamic block decrease: 49.939666 Average % dynamic block decrease: 49.939666
Number of dynamic block expansions with relocation: 138748 Number of dynamic block expansions with relocation: 331469
Number of dynamic block contractions with relocation: 290000 Number of dynamic block contractions with relocation: 0
% of dynamic block expansions in place: 17.320844 % of dynamic block expansions in place: 9.7049
% of dynamic block contractions in place: 45.006731 % of dynamic block contractions in place: 100.000000

IA64
Taiyen Windows CRT
Average number of operations per second: 11766.956211 Average number of operations per second: 283.272220
Dynamic block operation %: 77.271404 Dynamic block operation %: 77.271404
% of dynamic block operations that are resizings: 99.579572 % of dynamic block operations that are resizings: 99.579572
Average dynamic block size: 5022911.264286 Average dynamic block size: 5022911.264286
Average % dynamic block increase: 141.497440 Average % dynamic block increase: 141.497440
Average % dynamic block decrease: 49.946274 Average % dynamic block decrease: 49.946274
Number of dynamic block expansions with relocation: 261281 Number of dynamic block expansions with relocation: 340703
Number of dynamic block contractions with relocation: 246816 Number of dynamic block contractions with relocation: 0
% of dynamic block expansions in place: 23.318865 % of dynamic block expansions in place: 9.978
% of dynamic block contractions in place: 43.805838 % of dynamic block contractions in place: 100.000000

24
USPARC64
Taiyen Solaris CRT
Average number of operations per second: 5326.738925 Average number of operations per second: 872.552256
Dynamic block operation %: 77.256705 Dynamic block operation %: 77.824409
% of dynamic block operations that are resizings: 99.027981 % of dynamic block operations that are resizings: 97.301218
Average dynamic block size: 5245354.166667 Average dynamic block size: 6512988.278571
Average % dynamic block increase: 142.749297 Average % dynamic block increase: 141.289272
Average % dynamic block decrease: 49.907680 Average % dynamic block decrease: 49.842688
Number of dynamic block expansions with relocation: 252415 Number of dynamic block expansions with relocation: 56413
Number of dynamic block contractions with relocation: 255778 Number of dynamic block contractions with relocation: 0
% of dynamic block expansions in place: 23.825461 % of dynamic block expansions in place: 81.988992
% of dynamic block contractions in place: 42.211167 % of dynamic block contractions in place: 100.000000

In each case, about 77% of the operations were dynamic and of those operations, virtually all were resizings. The
average dynamic block size ranged between 5.0 and 6.5 MB, and on average, dynamic blocks were expanded by
141% and contracted by 50%. Notice that the allocator independent data such as dynamic block operation % is
congruent between the Taiyen system and Windows CRT, but not so with respect to the Solaris CRT. The memory
tester seeds the random number generator identically for each subject allocator so that each is tested on equal
footing, but the Solaris random number generator apparently uses a seeding method such as system time that
perhaps doesn't result in congruence. Regardless, enough operations were performed to gauge relative performance.

Besides operation speed, the most glaring difference the Taiyen system and the CRT implementations is that the
Taiyen system does relocate dynamic blocks upon contraction. Recall that the Taiyen system considers each
dynamic block resizing as an opportunity to reduce fragmentation. This is most critical when a block is shrinking
because, as Figure 17 shows, this can lead to worst-case fragmentation. In the figure, four dynamic blocks of equal
size are contracting by equal amounts. When they contract in place, the heap is left with sparsely located blocks
separated by equally sized free blocks yielding, by mathematical definition, worst-case fragmentation. To guard
against this, the Taiyen system will always relocate shrinking blocks in this kind of scenario. Regardless of the order
in which dynamic blocks are shrunk in the figure, they will always wind up being compacted as shown.

DB DB DB DB

DB DB DB DB

CRT Heap

DB DB DB DB

DB DB DB DB

Taiyen Heap
Figure 17

Defragmentation upon contraction is especially critical for the larger multi-page dynamic blocks where contraction in
place can lead to a severe loss of availability. This would not be practical without PTE redirection. Notice that on
IA32, the Taiyen system performed 290,000 contractions with relocation, which are almost as many expansions with
relocation that the Windows CRT performed at 331,469. Without PTE redirection, the operation speed of the Taiyen
system would have fallen by probably two orders of magnitude.

Notice that only between 17 and 24% of the Taiyen expansions were made in place, which shows that the expansion
buffers were not designed to accommodate 141% average dynamic block increases. For blocks of the size tested
here, the expansion buffers only consume virtual address space, not any physical storage. Conceivably, they could
be made larger with little strain on resources, especially on 64-bit systems where virtual address space is plentiful;
but, again, these tests were not meant to mimic the typical real-world case. On Windows, however, where almost no
expansions were made in place, user supplied expansion buffers, as described earlier, are absolutely required.

In contrast to Windows, the Solaris CRT made 82% of the block expansions in place, and only 56,413 relocations,
which is close to a tenth the number of relocations made by the Taiyen system at 508,193. Most likely, Solaris uses
some sort of heuristic to identify dynamic block resizing activity and tries to steer allocations clear of the address
space just above the identified dynamic blocks, effectively appending expansion buffers to these blocks. With this
approach, as dynamic blocks shrink in size, the cost of relocation and conversely the speed of PTE redirection
becomes less of a factor, which is borne out by the lower chart in Figure 16 where, for a maximum block size of 256

25
KB, Solaris averaged 14,190 operations/sec and the Taiyen system averaged 8,003 operations/sec. During this
simulation, Solaris made 153,823 total relocations and the Taiyen system made almost three times as many at
448,619. Both made nearly the same number of expansions in place, 112,481 for Solaris and 127,507 for the Taiyen
system. Clearly, the difference between the two comes down to how dynamic block contractions are handled.

Our contention is that the increase in fragmentation and loss of availability resulting from contracting dynamic blocks
in place does not justify the modest gain in speed. Taking into account the exaggerated resizing percentages, in real
world conditions, the disparity between the two systems would be much less. Furthermore, based on the low number
of relocations made by Solaris for the 16-MB blocks, too much address space above the dynamic blocks is protected,
with such protection also constituting a loss of availability.

If there is a need for a dynamic block that can be continually resized at the highest possible speed (which would have
to preclude the possibility of relocation), the Taiyen system offers the capability to create a large dynamic block
through the use of the flag, HEAP_ALLOCATE_LARGE_DYNAMIC_BLOCK_FLAG. When operating within its
predetermined expansion buffer, a large dynamic block will always expand and shrink in place. The advantage that
this has over a static block is that, when the dynamic region is shrunk, pages that are no longer touched are
decommitted (actually, their dirty bits are cleared so they remain in the working set), and the block may expand
beyond its initial reserved region. In addition, if it needs to be relocated, it can employ the PTE redirection
mechanism.

Block Access Speed

Block Access Speed

6000000

5000000

4000000
Blocks/sec

Taiyen
3000000
CRT

2000000

1000000

0
IA32_Win IA64_Win USparc64_Solaris

Platform

Figure 18
As mentioned, all blocks allocated within the Taiyen system are optimally aligned with respect to cache lines and
pages, improving both cache efficiency and page efficiency. The tester measures the cache efficiency of each
allocated block, which is simply defined as the lowest possible number of cache lines that would be touched by a
block, were it optimally aligned, divided by the actual number. In other words, the cache efficiency of a 48-byte block
touching one cache line would be 100% and if that same block touched two cache lines, the cache efficiency would
be 50%, the smallest possible cache efficiency. The tester then computes an average cache efficiency after each
test. The tester also tabulates the percentage of blocks that are misaligned, that is: touch extra cache lines. All of the
tested allocators referenced in this document, including mtmalloc and Hoard, misaligned nearly every block.

The first simulation on every platform, which used a maximum static block size of 128, tested the access speed of the
allocated blocks by completely populating every allocated block, and then after every test traversing the lists that
linked these blocks to retrieve a checksum while timing the event. The average block size on IA32 wound up as 33
bytes and on the 64-bit platforms, 43 bytes, so most blocks were less than the 64-byte cache line of all three
platforms. Figure 18 shows the results of these tests. The only difference between the Taiyen system and the others
was block alignment, all other factors were kept constant. The most notable difference in block access speed
occurred on IA64 were the Taiyen blocks were allocated over 50% faster that the Windows CRT blocks. This was to
be expected because of the Itanium's heavy reliance on cache. We expect that cache alignment will become more of
a factor in the future, where the widening ratio of processor to memory speed will make cache utilization more critical.

26
Heap Compression
The Taiyen system offers many ways to save a heap to a storage device, with compression being one of the most
straightforward. It works for both volatile and persistent heaps, and beyond simple data preservation, the
compression functions offer a host of other capabilities. There are two possible destinations for a compressed heap:
a buffer and a file. A volatile heap is compressed to a buffer by calling CHeapManager_uwCompressHeapToBuffer()
and to a file via CHeapManager_uwCompressHeapToFile(). In conjunction, a persistent heap is compressed to a
buffer via CPersistentHeapManager_uwCompressHeapToBuffer(), and to a file via
CPersistentHeapManager_uwCompressHeapToFile().

Compressing a heap entails compacting all of its sparsely stored blocks into one contiguous unit. When a debug
library is used, each block's padding is compiled along with its contents. Compressing a heap to a buffer or file allows
it to be copied to another process or perhaps sent across a network to another computer. Upon decompression, a
variety of a heap's parameters may be reset, allowing it to be moved to a new address, expanded, shrunk or any
combination thereof. The heap may even be defragmented in concert. Before a heap is compressed,
CHeapManager_uwGetHeapInfo() must be called to populate a CHeapInfo object that will include a checksum of the
user data, which, upon decompression, may be verified to ensure the integrity of the compression/decompression
cycle. This object as well as other parametric information will be included in the compressed heap along with the
stored blocks.

Heap Decompression
Once compressed to a buffer or file, a heap may be decompressed by either the volatile or persistent heap manager,
allowing a once volatile heap to be made persistent and vice versa. The system allows a heap to be restored to its
exact address and size, with each block being restored to its address at the time of compression. If, however, the
heap is restored to a new address or size, or it is defragmented during decompression, there is a possibility that
some or all of the stored blocks will find new addresses. This may not pose a problem, but if there are references to
the old addresses in the heap, these references will have to be corrected. The system offers a convenient means of
updating block references, should this be necessary. After a heap is decompressed, its internal structures may be
checked by the powerful function, CHeapManager_uwVerifyHeap(), which performs a detailed scan of all of the
heap's internal structures and will return immediately upon detection of an error. As mentioned, a call to
CHeapManager_uwGetHeapInfo() may be employed at this point to verify the stored checksum.

A heap that has been compressed to a buffer is restored with a call to either
CHeapManager_pcRestoreHeapFromCompressedHeapBuffer() or
CPersistentHeapManager_pcRestoreHeapFromCompressedHeapBuffer().
From a file, restoration is made via CHeapManager_pcRestoreHeapFromCompressedHeapFile()
or CPersistentHeapManager_pcRestoreHeapFromCompressedHeapFile().

A variety of parametric data as well as the embedded CHeapInfo object (combined together
in what's called a CCompressedHeapParameters object) can be extracted from a heap that was
compressed to a buffer via CHeapManager_GetCompressedHeapParametersFromBuffer() and from a heap that was
compressed to a file via CHeapManager_GetCompressedHeapParametersFromFile().

Bounds Checking
Bounds checking is enabled by simply linking to one of the debug libraries. As already described, a bounds-checked
static block contains padding above and below its user region. The lower bound padding consists of a four-byte arena
that contains the offset within the block where the upper bound padding is located. This allows the system to quickly
discern if a block has been corrupted. Given a block pointer, the system retrieves the offset and checks the upper
padding bytes at the offset to see if any have been overwritten. A bounds-checked dynamic block doesn’t require
lower padding, because it already has a built-in arena.

The system will check for bounds corruption every time a block is referenced. The debug libraries also offer the
function, CHeapManager_uwDbgEnumerateInvalidBlocks() to explicitly enumerate all of the corrupted blocks in a
heap. The first call to the function will return a tally of such blocks, which in turn is used to allocate a buffer of the size
specified by CHeapManager_uwGetEnumeratedInvalidBlockBufferSize(). A second call to the function will populate
this buffer with the addresses of the offending blocks.

The system will catch the case where an arena has been corrupted to the point where the offset stored there points to
an address beyond the high padding byte, where a memory dereference could easily yield an access violation. The
system cannot, however, catch the unlikely case where padding regions just happened to have been overwritten with
their original values. Nor will the system catch out of control behavior where data is written at a random offset beyond
a block's upper of lower padding.

27
The system does offer another measure to address the corruption of upper bound padding. The value of a padding
byte may be changed from its default value of 0xAA, by setting the global, g_uwDbgUBP to an alternate value.
Control over g_uwDbgUBP allows an application to be checked with a variety of padding values to guard against
coincidental padding corruption. This measure does not, however, address coincidental corruption of lower bound
padding, but to be realistic, the error that the system is really trying to trap here is the common one where elements
are written in an array beyond the array's upper bound.

Because the system may, in some cases, set additional bits in bounds-checked static block arenas, one should
always let the system check for bounds violations. This can be accomplished on a block-by-block basis by calling
CHeapManager_GetBlockInfo().

Memory Leak Detection


Embedded within the upper bound padding of every block will be two items: a block index that will range
between one and DBG_MAX_ALLOCATION_INDEX and the index of the allocating CPU (or sandbox).
The block index will actually be the tally of dynamic and static blocks allocated by a CPU since the heap's
inception (with the allocating block included in the count). Each allocating CPU maintains its own tally, and
when a tally exceeds DBG_MAX_ALLOCATION_INDEX, it rolls back over to zero, leaving the
next allocated block with an index of one. The tallies only move forward. In other words, freeing a
block will not decrement a tally. The functions CHeapManager_uwDbgGetTotNumAllocatedBlocks() and
CHeapManager_uwDbgGetTotNumAllocatedBlocksPerCPU() may be used to retrieve (and optionally reset) net and
per-CPU counts, respectively.

To catch memory leaks, checkpoint the software with calls to CHeapManager_uwDbgEnumerateBlocks(), which
is an extremely fast function that first returns the current count of the number of blocks in a heap.
More detailed information can be obtained by allocating a buffer of the size stipulated by
CHeapManager_uwDbgGetEnumeratedBlockBufferSize() and then recalling the function to populate an array of
CDbgBlockInfo objects. A CDbgBlockInfo object has four members: block address, size, type (static or dynamic),
allocating CPU, and index.

The information found in a CDbgBlockInfo object probably won't be enough to correct a memory leak. Generally, the
state of the system needs to have been known when the un-freed block was allocated. To this end, the system offers
the function, CHeapManager_uwDbgBreakUponAllocation(), which will break execution when a particular CPU
allocates a block of a particular index, revealing the call stack and data values at the point of allocation. This, of
course, assumes the circumstances that lead to the leak are repeatable, which is far from certain, especially with
respect to multi-threaded applications where randomness is introduced by race conditions brought about by the
thread scheduler. However, this and other related debugging functions can prove efficacious.

Access to the structures that maintain per-CPU tallies and allocation breakpoints is not serialized, so using this
feature requires that no more than one thread be registered per allocating CPU. This shouldn't be seen as a
detriment, as up to 32 threads on 32-bit systems and 64 on 64-bit systems may be so registered. Furthermore, this
restriction will eliminate per-sandbox race conditions.

Beyond CHeapManager_uwDbgEnumerateBlocks(), which is only available to debug builds, are two release mode
functions that provide block tallies: CHeapManager_GetHeapInfo() and CHeapManager_uwVerifyHeap(). The second
is the preferred means of counting blocks and bytes because the first also computes a checksum of the user data,
which makes it slower. If, however, the heap is damaged, CHeapManager_uwVerifyHeap() will return immediately
upon discovery of an internal error, whereas CHeapManager_GetHeapInfo() will attempt to perform a complete tally.
Note that neither function will provide individual block information.

28
Persistent Heap Manager
Heap Structure
All operations on a persistent heap are conducted by the persistent heap manager, which is basically an extension of
the volatile heap manager with the added functionality to make blocks stored in a heap impervious to system failure.
A persistent heap, once created, is used exactly as a volatile heap. Allocations, resizings, and deallocations are all
made using the exact same functions. When a persistent heap is closed and restored, every byte, including every
pointer reference, is restored to exactly its pre-closure state. The architecture of a persistent heap is essentially the
same as that of a volatile heap. The heap space is divided into a lower system region and the upper user space, with
the system region being made slightly larger than its volatile counterpart to accommodate the extra structures needed
to provide persistence. All of the features of the volatile heap manager, such as robustness, native dynamic block
support, PTE redirection, and optimal block alignment are available to persistent heaps. Bounds checking is provided
by the debug builds; from the perspective of the system, block padding is considered every bit a part of a block as its
contents.

As is the case with a volatile heap, all of the structures required to support a persistent heap are allocated internally
within the heap space - no memory is ever allocated from anywhere else in the process. This includes the stack,
other heaps, etc.

The key to a heap’s persistence is its integration with three files: a backing file, a data file, and a transaction log. The
backing and data files are referred to, collectively, as the heap files. The backing file, which is mapped, is used by the
VMM for paging, and the data file, in effect, backs the backing file. Both of these files must be at least the size of the
heap and optionally larger by a segment multiple. The size of the transaction log is flexible within a few minor
constraints. Don’t consider the use of two heap files to be an inefficient use of storage. The backing file is essentially
a named paging file for the heap, and on a server where storage is critical, the aggregate size of the paging files can
be reduced to the extent that backing files are used by running applications.

The system will make the heap files sparse upon their attachment, so all of the volumes that contain persistent heap
files must support sparseness. Sparse files work akin to virtual memory in that actual physical storage in the target
persistent media is committed on demand as writes to the media are made. The implication is that, upon creation, the
heap files for a one-GB heap will not collectively use two GB of disk space. Disk space will be committed in line with
and as blocks are allocated in the heap. Note that the files used to store compressed heaps can be located in
volumes that don't support sparseness.

> Unix Specific

Most all Unix file systems support sparseness, and even if one of the storing volumes doesn't or it is unclear as to
whether it does or not, there is a good chance that a persistent heap will run on it anyway. There is a vast difference
between the way the Taiyen system handles sparseness when running on Unix than it does on Windows.

< Unix Specific

> Windows Specific

Sparse file support was added to NTFS as of Windows 2000 version 5.0. The FAT and FAT32 file systems do not
support sparse files, so ensure that persistent heap files are located in NTFS 5.0 or later volumes.

< Windows Specific

The use of a transaction log is optional. When omitted, the heap becomes what is termed a snapshot heap, as it will
not support transactions, but will, at least, support snapshots. When the log is included, the heap becomes fully
transactional. The use of a log requires the attachment of a buffer of fixed sized, LOG_BUFFER_SIZE. In the face of
system failure, a transactional heap is guaranteed to be recoverable to the last valid transaction. And a snapshot
heap, upon recovery from system failure, is guaranteed to be recoverable to the last successful snapshot.

A persistent heap will exist in one of five states:

Heap State Backing File Data File Transaction Log Processing Mode____________________
0 Invalid Invalid Invalid Pre first snapshot
1 Valid Invalid Invalid Snapshot
2 Invalid Valid Valid Normal processing
3 Valid Valid Invalid Heap is closed and heap files are congruent
4 Valid N/A N/A Heap is being used as an archive

29
Heap state zero exists just after a heap is created or decompressed and is the only unrecoverable heap state. The
first snapshot will bring the heap out of state zero and into state two.

Heap state one exists briefly during a snapshot. If a system failure occurs during this state, the heap will be restorable
from the backing file.

Heap state two exists during all normal processing outside of a snapshot. If a system failure occurs during this state,
the heap will be restorable from the data file and transaction log, if attached.

Heap state three exists when a heap is closed and its backing and data files are congruent. State three is achieved
when the persistent heap manager is used to close the heap. Using the volatile heap manager to close a persistent
heap will leave it in state two.

Heap state four exists when a heap is restored from a valid backing file without the attachment of a data file or
transaction log. A heap in this state will not support transactions or fail-safe snapshots. However, these features may
be restored by attaching the two files later on; and in fact, these files may be attached while the heap is in use. After
these files are attached, an immediate snapshot must be made to bring them into congruence with the heap.

Snapshot Heap
A snapshot heap is easier to use and faster than a transactional heap because it spares the interaction with a
transaction log that would otherwise be required. The downside is that the means of backup, the snapshot, isn't
nearly as fine-grained and as fast as a transaction. If this is not a concern, such as the case for the typical desktop
application that needs to perform a save only occasionally, a snapshot heap will suffice.

The primary advantage that the snapshot heap has over other low-level means of persistence such as the standard
memory-mapped file is that the heap offers a framework on which to build. A memory-mapped file, like a conventional
file, starts off as a blank slate, and the user must provide all of the logic to organize the data therein. The use of
Taiyen heap technology to manage a span of persistent space relieves the developer from having to write this
complicated logic. It provides the ability to efficiently subdivide that space into static and dynamic regions of varying
sizes; to continually free, reallocate, and resize those regions; and to effectively manage fragmentation.

Another advantage of the snapshot heap is that snapshots are fail-safe, which is in contrast to the typical file save or
mapped file flush. If, during a snapshot, a system failure were to occur during the actual disk write, the heap would be
restorable to the previous snapshot. Contrast this to a failure during the typical file save, which would, at the very
least, leave the file corrupted with partially written data. The snapshot is also efficient in that it only saves pages that
contain fresh data. In other words, if a small portion of a heap was modified since the last snapshot, only that portion
will be flushed to disk upon the next snapshot, not the entire heap.

This is also true for mapped files by virtue of the VMM, which keeps track of dirty pages through PTEs. As will be
described below, a snapshot updates both the backing and data files. And while the backing file is mapped, the data
file isn’t, so the system must track page modifications through its own structures to know what pages to write to it.
Scanning PTEs for dirty bits is not an option because these bits are cleared when their respective pages are flushed,
and flushing can occur lazily to the backing file between snapshots. So upon execution of a snapshot, a PTE scan
wouldn’t necessarily account for every page that was modified since the previous snapshot, an accounting that is
required to properly update the data file.

The solution used by the system is to commit the pages of a persistent heap with write protection and trap all page
modifications with an exception handler. On Unix, the handler is internal and is initialized automatically, and on
Windows, the handler is accessed via the exception filter TAIYEN_GLOBAL_EXP_FILTER. Any modification of
persistent heap data must be enclosed within a try-except block, as shown below. When an attempt is made to
modify a write protected page, the exception handler will intercept the page fault; it will grant the page write access;
and it will then direct execution back to the faulting instruction. It will also update internal structures that keep track of
the heap's dirty pages. Each page granted write access will maintain such access until a snapshot is performed, at
which time the dirty pages will be flushed to the heap files and will again be write protected.

30
> Windows Specific

__try
{
// Heap modifier.
}
__except(TAIYEN_GLOBAL_EXP_FILTER)
{}

< Windows Specific

Two structures are used to keep track of modified pages: the dirty page bit array (DiPBA) and the dirty segment bit
array (DiSBA), with the purpose of the latter to accelerate scans of the former. The DiSBA was necessary to scan
multi-terabyte heaps in short order. At 128 KB (16 pages) per segment on 64-bit systems, a DiSBA scan can cover 8-
MB per clock cycle (64 bits * 128 KB / bit = 8 MB), not counting cache-miss overhead. Access to the DiPBA is
provided via CPersistentHeapManager_GetHeapParameters(), which may be of some use to certain applications. A
current tally of dirty pages in the heap can be retrieved by calling CPersistentHeapManager_uwGetNumDirtyPages().

A snapshot heap is created by calling CPersistentHeapManager_pcCreate(), with the parameter, hTransactionLog


set to null. Handles to the backing and data files must be supplied, however. Both of these files must be configured
such that writes do not return until the storing device is physically updated, that is: created with file system buffering
disabled. Do, however, enable the device’s write cache and for critical systems, ensure that it is battery-backed. The
files must be located on volumes that support sparseness. The rest of the parameters are identical to those used to
create a volatile heap.

A snapshot is made by calling CPersistentHeapManager_uwSnapshot(). As a snapshot is entered, the state of the


heap is either zero or two. It is zero if the heap has just been created or decompressed, and two if at least one
snapshot has been made after either disposition. During a snapshot, the system first flushes its dirty pages to the
backing file. Once this is complete, the state of the heap goes to one. If a failure were to occur at any time while a
heap is in state one, the complete image of the heap now present in the backing file makes it completely restorable.
Next, the system flushes all of the dirty pages to the data file. If a transaction log is being used, the system will clear
the log as well. As soon as this is complete, the heap goes to state two.

To maintain the integrity of the backing file while the heap is in state one, all writing threads must be suspended
during the snapshot (reading threads, however, can keep running). Recall that the VMM uses the backing file for
paging, and if a thread happened to be modifying a page in the heap during the middle of a snapshot, conceivably,
the VMM could flush that newly touched page to the backing file, leaving it in an inconsistent state. Before
CPersistentHeapManager_uwSnapshot() returns, all of the pages that were flushed will have been, once again, write
protected. When successful, the function will return the number of bytes flushed to the heap. Because the low page in
the system region will always be flushed, a successful return will at least be the page size.

Coming out of the snapshot, the heap is in state two, and all of the blocked writing threads may resume processing. A
complete image of the heap at the time of the just completed snapshot is now resident in the data file, which, unlike
the backing file, is completely insulated from the VMM. The only way it can possibly be modified is by executing the
next snapshot. Any failure at this point would result in a restoration from the data file.

Both snapshot and transactional heaps are closed by calling CPersistentHeapManager_uwClose(). This function will
bring the heap files into congruence by making a snapshot and will return portions of the heap files that are no longer
being utilized to the file system. When successful, the heap state coming out of this function is three. This is the
preferred method of closure because restoring a heap from congruent heap files is by far the fastest restoration
means, as it primarily entails mapping in the heap file views.

Calling the volatile heap manager's heap closure function, CHeapManager_Clear(), will leave a heap in state two and
only restorable to the previous snapshot, effectively simulating a system failure; assuming, of course, that at least one
snapshot had been made at some point before, bringing the heap out of state zero. Before closing a heap,
CHeapManager_uwGetHeapInfo() may be called to retrieve a checksum of the user data stored within the heap,
which may be used to check the data post restoration.

Within the heap's internal structure is a block of storage comprised of 16 system words called the user block, which is
available to the user for storing data like checksums. Because the block is part of the heap's internal structure, its
address is always at a fixed offset from the heap address allowing data to be retrieved from it via an index. The
functions used to set and get data from the user block are: CHeapManager_SetUserSystemWord() and
CHeapManager_uwGetUserSystemWord(), respectively.

31
Similarly, there are two functions called CHeapManager_SetHeapEntryAddress() and
CHeapManager_GetHeapEntryAddress() used to set and get an internal block of storage called the heap entry
address, which is simply one system word set aside to store the address of a block allocated in the heap, through
which the addresses of all other allocated blocks may be resolved. This entry block may be the lead block in a linked
list, the address of an array, or whatever the user finds the most useful. Obviously, the user block may be used to
store other block addresses as well.

A persistent heap is restored after being closed or after a system failure by calling
CPersistentHeapManager_pcRestoreHeapFromHeapFiles(). This function will always restore a heap to its pre-
closure address and size and likewise for each block therein. This is in contrast to heap decompression, a topic to be
covered below, which does allow for the resetting of heap address and size. When successful, the function will
provide the state of the heap at the point of closure. The possible values are as follows:

State 0 - The heap is unrecoverable. This state will only be seen if a recovery is attempted before the first successful
snapshot is executed.

State 1 - A system failure occurred during a snapshot. The data file is invalid. The system was rebuilt from the
backing file.

State 2 - A system failure occurred during normal processing. The backing file is invalid. The data file is valid. If a
transaction log was used, the system was rebuilt up to its state as of the last valid transaction. If a transaction log
wasn't used, the system was rebuilt to its state as of the last snapshot.

State 3 - Both heap files are congruent and therefore, valid.

State 4 - The heap was configured for archival use.

When a persistent heap is restored, an enormous number of checks are made to ensure that everything jibes,
and if any inconsistency is found, the restoration process will bail out. After the heap is restored, further
checks may be made through a call to CHeapManager_uwVerifyHeap() to check internal structures and
CHeapManager_uwGetHeapInfo() to check the user data. If parametric information about a heap is needed before it
is restored, call CpersistentHeapManager_uwGetHeapFileParametersFromDataFile() to retrieve a
CHeapFileParameters object.

Transactional Heap
A transactional heap allows snapshots and transactions. A heap transaction is just a fancy way of preserving a heap's
contents, much like a snapshot; except that a transaction is significantly faster. While transactions allow file saves to
be made at a much higher rate than snapshots, interaction with a transaction log does add complexity and overhead
to the process. When a persistent heap is restored, it is always restored to the last complete transaction (with the
snapshot being considered a type of transaction); any partial transaction is ignored. A persistent heap is recoverable
even after a system failure occurs during an actual device write. For most transactional systems, handling power
failures is simple task. It’s when the disk controller blows out, that the mettle of the system is tested. The Taiyen
system was specifically designed to recover from such catastrophic failure.

A transactional heap is created via the same function as is the snapshot heap. A handle to a transaction log must be
supplied, as well as the address of a transaction log buffer (to be allocated by the user), and a few other ancillary
parameters must be set. The size of the buffer must be LOG_BUFFER_SIZE. The log is to be created with the same
settings as the heap files, and again, must be located on a volume that supports sparseness.

The transaction log supports two kinds of entries: the data entry and the procedure entry. The data entry, as one
would imagine, involves copying a block of data and its address from the heap to the log. The event being logged, in
this case, might be the setting of an object’s data member. The other type, the procedure entry, is slightly more
complex. It entails populating a structure called a procedure log entry stack or PLES, which contains a procedure’s
calling convention, address, parameter list, along with some other miscellaneous elements. Once filled, the stack is
written to the log. The system automatically adds some verification elements to both entry types (such as a
checksum) to ensure that if called upon to rebuild a heap, the entries are read and invoked without error.

The procedure entry comes in handy for logging events that would be inefficient or too complex for data entries.
These include events that might affect many sparse patches of memory at once and events that might affect huge
spans of memory. For instance, the inversion of a matrix could potentially affect a huge data block. The logging of this
operation would be a computationally expensive operation, not to mention a waste of precious log space. The better
solution is to log the event with a procedure entry, which would only require the filling out of a PLES for the function
that inverted the matrix and appending that PLES to the log. A PLES is easy to prepare programmatically, and being
only a handful of bytes in size, is quickly logged.

32
If a failure occurs and the heap must be rebuilt from the log, the log's transaction processor will process each data
entry by copying the stored data block to the stored address; and will process each procedure entry by calling the
procedure, optionally checking the procedure's return value against its expected result, and then optionally loading
the return value into a prescribed address. Imagine that a post system failure recovery is in progress and a matrix
inversion entry is next in line to be processed. The transaction processor will see that the entry is of the procedure
type. It will push the parameters in the PLES onto the stack and will call the function; and if tasked, it will verify the
return value. Let’s say that the next entry is of the data type. Here, the transaction processor will copy the data block
in the log to its specified heap address. In this fashion, entry by entry, the transaction processor will rebuild the heap
to its exact state as of the last valid transaction recorded before the time of failure.

For a log entry to be invoked upon rebuild, it must belong to a valid transaction, which is defined as one or more log
entries followed by a transaction terminator. This is also termed a closed transaction (or committed transaction). The
first log entry made following a terminator or made in a fresh log implicitly starts a new transaction. Log entries that
are not terminated constitute an invalid transaction, and all such entries will be ignored by the transaction processor
in the event that a heap requires restoration from the log. Unterminated entries are found when a system failure
occurs during the middle of a transaction, rendering the transaction invalid.

All log entries are initially written to the log buffer. When the entries are terminated, they are flushed as a block to the
log file. If the buffer becomes too full to support a particular entry, it will automatically be flushed and cleared to make
room for it. Any entry that exceeds half the size of the buffer or any entry that (coupled with a terminator) would cause
the log file to overflow, will force the scheduling of a snapshot, a topic to be covered below.

Archive Heap
When an existing heap is restored with the data file handle set to H_PAGING_FILE, the heap becomes configured for
archive use, which is to say that it meant to be a data repository and not meant to be modified. Actually, the archive
heap will not prevent data modification or even the allocation of new blocks, and it will even perform snapshots. It just
can't guarantee data integrity to the level of the snapshot or transactional heap. If a system failure occurs during an
archive heap snapshot, the backing file might become permanently corrupted, which puts it at the level of the
standard file write in terms of data protection.

When a persistent heap is closed successfully, the backing and data files are essentially redundant. If there is a
desire to minimize disk usage, the data file may be deleted after a heap is closed, in which case the persistent heap
will restored as an archive heap. Once the heap is up and running, a new data file may be subsequently attached by
calling CPersistentHeapManager_uwAttachDataFile(). To bring the data file in congruence with the heap, call
CPersistentHeapManager_InvalidateHeap() and then make a snapshot. This will allow all of the heap data to be
written to the data file in an efficient manner that doesn't waste time copying unused bytes, as would be the case
were the backing file simply copied as an image. The other reason not to copy the backing file this way and attach it
as a data file is that some shells will not preserve sparseness when copying a file, and a persistent heap will not work
with non-sparse files. In contrast, the system will make any new file attached to a persistent heap sparse.

In actuality, either the backing or data file can be attached as a backing file, which will allow an active persistent heap
to be copied and perhaps opened in another process. Imagine two backups need to be made of an active
transactional heap. First a snapshot is made to bring the data file up to date. Secondly, a new data file is attached,
the heap is invalidated, and a second snapshot is made creating the second copy. Lastly, a third data file is attached,
the heap is again invalidated and followed by another snapshot to bring it in congruence. Now the heap has no
attachment to the first two data files, and they may be used to spawn two new heaps in other processes. To make
these new heaps transactional, logs may be attached via CPersistentHeapManager_uwAttachTransactionLog().

Log Entries and Transactions


A data entry is made by calling CPersistentHeapManager_uwMakeDataEntryInLog(), passing the function a block
address and size. A procedure entry is made by calling CPersistentHeapManager_uwMakeProcedureEntryInLog(),
passing it the address of a PLES. The structure of a PLES is as follows:

1) PLES size, (required)


2) calling convention flag | return flags, (required | optional)
3) procedure address, (required)
4) procedure parameter one, (optional)
5) ... (optional)
n + 3) procedure parameter n, (optional)
n + 4) return value to verify, (optional depending on the return flag used)
n + 5) address to take the procedures return value. (optional)

33
A PLES can support up to 16 parameters.

The following is some sample code showing a function call and its corresponding log entry:

void *pmBlock = CHeapManager_pmAlloc(pcHeapManager, 15);

UW puwPLES[6];

puwPLES[0] = 6 * SYS_WORD_SIZE; // PLES size.


puwPLES[1] = LOG_PROCEDURE_CDECL_FASTCALL_FLAG | // Calling convention.
LOG_VERIFY_PROCEDURE_RETURN_IS_EQUAL_TO_VALUE_FLAG; // Return flag.
puwPLES[2] = (UW)CHeapManager_pmAlloc; // Procedure address.
puwPLES[3] = (UW)pcHeapManager; // Parameter one.
puwPLES[4] = 15; // Parameter two.
puwPLES[5] = (UW)pmBlock; // Return value to verify.

UW uwError = CPersistentHeapManager_uwMakeProcedureEntryInLog(pcHeapManager, puwPLES);

The flag LOG_PROCEDURE_CDECL_FASTCALL_FLAG specifies that the cdecl calling convention is used for the
debug build and that the fastcall convention is used for the release build. Actually, to enable the fastcall calling
convention, FASTCALL must be defined on the compiler command line, or cdecl will be expected for both debug and
release builds. Note that all Unix functions are called via the cdecl convention, whereas Windows uses three calling
conventions: cdecl, fastcall, and stdcall, the last being used for the Windows API. When compiling for Unix, the
Taiyen system configures all calling convention macros to specify the cdecl calling convention. As noted, all of the
macro definitions required to use the Taiyen system are located in TaiyenGlobals.h. Various conditional compilation
definitions in this file allow fairly seamless porting between 32 and 64-bit systems, and between Unix and Windows
with minimal code differences; those few differences being limited to heap setup.

The flag LOG_VERIFY_PROCEDURE_RETURN_IS_EQUAL_TO_VALUE_FLAG will force the transaction processor


to check the return from CHeapManager_pmAlloc() against pmBlock. If the two values don't match, the processor will
cancel the heap rebuild and return
ERROR_LOG_PROCEDURE_RETURN_IS_EQUAL_TO_VALUE_FLAG_VERIFICATION_FAILED.

On top of this, there is much additional verification performed to ensure that a heap will be rebuilt properly. For
example, extra elements are added to each PLES that must be verified by the transaction processor as each entry is
processed. The first will catch the case where an entry is invoked out of order or is skipped, and the second is a
checksum of the PLES to catch file read and write errors.

Assume that a call to CHeapManager_pmAlloc() has been executed and its corresponding log entry
has been made. If a failure were to occur at this point, this entry would not be invoked upon rebuild because a
transaction hasn't been completed yet, that is: the entry still resides in the log buffer. To close the transaction,
CPersistentHeapManager_uwCloseTransaction() is called. This function will perform two tasks: it will write a special
terminator code after this entry, and then flush the entry and terminator as a sector-aligned sector-multiple block to
the log file, thereby establishing the transaction. To ensure durability, the function doesn’t return until either the
transaction is resident on the log device (or an error has occurred).

A procedure log entry need not be made immediately after its corresponding heap operation is executed. This goes
for data entries as well. A procedure entry may even be logged before the procedure is called, assuming, of course,
that no return value comes into play. Data entries, on the other hand, must only be made after the data in question is
written to the heap. The major restriction on transaction formation is that log entries be made in the order in which
their corresponding operations are executed, and that all operations defining a transaction be initiated after the
transaction is started and completed before it is closed. In short, the rule is that by the time each transaction is
closed, the log must be completely coherent with the heap.

To maintain coherence, the transactional heap uses the one-writer / many-reader model, where only one thread may
modify the heap and access the log at a time. Each transaction must be contained within a log mutex (to be provided
by the user) as shown below, effectively serializing the data stream to the log. Because this data stream will have to
be serialized to make it across the device bus to the log anyway, the use of a lock for the writing threads will not
cause an appreciable performance hit when the log’s storing device can receive transactions at interconnect speeds,
which is possible for bus-attached solid state drives (SSDs). Expect more on this topic later on.

34
Writing thread ↓

{
Enter log mutex.
Wait for object A’s read/write auto-reset event. // Waiting for any reading threads to finish up.

// Modify object A. For example:

Make a rollback entry for the data about to be modified. // Data rollback entries are made pre data modification.
Modify a block of data.
Log the data operation. // Implicitly starting the transaction. Data log entries are made post data modification.

Call a procedure.
Log the procedure.
Make a rollback entry for the procedure.

Close the transaction. // A rollback can be performed at anytime before the transaction is closed.

Set object A’s read/write event. // Allow any pending reading threads or another writing thread in.
Leave log mutex.
}

Reading thread ↓

{
Enter object A’s thread count critical section.

If (++ thread count == 1) // Only the first reading thread needs to potentially wait.
Wait for object A’s read/write auto-reset event.

Leave object A’s thread count critical section.

// Once one reading thread has access, all others can bypass the read/write synchronization object.
// Read object A.

Enter object A’s thread count critical section.

If (- - thread count == 0) // Once all of the reading threads are out, open the object to any waiting writing thread.
Set object A’s auto-reset event.

Leave object A’s thread count critical section.


}

Restoration from System Failure


Continuing the example above, imagine, at this point, that a system failure has occurred. Calling
CPersistentHeapManager_pcRestoreHeapFromHeapFiles() and passing it a handle to the log will restore the heap to
this last transaction. The first step in this process is to restore the heap to its state of the last snapshot, which entails
loading heap’s image (from the data file) at that the time of the last snapshot. Now if the heap were only a snapshot
heap, the restoration would be complete. For the transactional heap, however, the system needs to go further. It
continues on by scanning the log for a transaction terminator. If and when one is found, the transaction processor
invokes all of the log entries found upstream of this terminator. The data blocks are copied to their respective heap
addresses and the procedures are executed, with a plethora of error checking happening along the way. The system
continues on searching for terminators until the zeroed portion of the log is found. Any entries that are found without a
terminator are ignored, as it is assumed that the system failed before this last terminator was written.

In this case, the system will find the one procedure log entry for CHeapManager_pmAlloc() and its terminator, and as
part of a valid transaction, will execute this function. As mentioned, a call to CHeapManager_uwVerifyHeap() will add
confidence that the restoration occurred without error. Because the heap will have closed unexpectedly, there will be
no checksum to corroborate. The system will report that the state of the heap at the time of failure was two, and as
was the case with the snapshot heap, the heap state coming out of the restoration will be two. The transaction log will
have been zeroed in preparation for new log entries.

35
Embedded Log Entries
Filling out a PLES for every call to CHeapManager_pmAlloc(), for instance, will soon become very cumbersome and
will make an application overly susceptible to bugs. The alternative to this is to use what is called an embedded log
entry (ELE). An embedded log entry is an entry that is encapsulated with its corresponding procedure into what is
termed an embedded log entry procedure, or l-proc for short. The procedure called within the l-proc is called the
complement. The system offers a handful of l-procs. For example, the l-proc for
CHeapManager_pmSetDynBlockSize() is CHeapManager_pmSetDynBlockSizeELE(). Its source code is provided
here:

void *CPersistentHeapManager_pmSetDynBlockSizeELE(CPersistentHeapManager *pcHeapManager,


void *pmBlock, UW uwSize, void *pmDestinationAddress)
{
// This function accomplishes four things:
// 1) It creates a new or resizes or frees an existing dynamic block via
// CHeapManager_pmSetDynBlockSize().
// 2) If a destination address is provided via pmDestinationAddress, the result of the call
// will be stored in pmDestinationAddress.
// 3) It makes a parallel entry in the log.
// 4) It makes mirror entries in the rollback buffer.

CBlockInfo cBlockInfo;

// Catch the cases where pmBlock may be null or set to


// HEAP_ALLOCATE_LARGE_DYNAMIC_BLOCK_FLAG.
if ((UW)pmBlock >= SYS_SEGMENT_SIZE)
{
cBlockInfo.m_pmBlock = pmBlock;
CHeapManager_GetBlockInfo(pcHeapManager, &cBlockInfo);

// For dynamic blocks:


// 1) cBlockInfo.m_pmBlock will remain as set.
// 2) cBlockInfo.m_uwSize will be set to the user region.

// If the block is shrinking, populate the rollback buffer.


// Always make the last entry to be executed the first to go in the rollback buffer.

if (uwSize < cBlockInfo.m_uwSize)


{
#ifdef _ENABLE_BOUNDS_CHECKING
if (uwSize == 0 && cBlockInfo.m_uwMinReservedRegion == 0)
// The dynamic block is being freed.
{
// Preserve the upper bound padding.
cBlockInfo.m_uwSize += SYS_NODE_SIZE;
}
#endif

UW uwSizeRD = UW_ROUND_DOWN1(uwSize, SYS_WORD_SIZE_INV_MASK);

// Secondly, restore the block's contents. The previous call to


// pmExplicitSetDynBlockSize() will have restored the padding.
CPersistentHeapManager_MakeDataEntryInRollbackBuffer(pcHeapManager,
(void*)((UW)pmBlock + uwSizeRD), cBlockInfo.m_uwSize - uwSizeRD);
}
}

/////////////////////////////////////////////////////////////////////////////
// Perform and log the operation.

void *pmNewBlock;

#ifdef WINDOWS
__try
{
#endif

pmNewBlock = CHeapManager_pmSetDynBlockSize(pcHeapManager, pmBlock, uwSize);

36
#ifdef WINDOWS
}
__except(TAIYEN_LOCAL_EXP_FILTER)
{}
#endif

// Log the operation, regardless of whether it is successful or not.


UW puwPLES[10];

if (pmDestinationAddress)
// Load the return value into this address.
{
// Make a data entry in the rollback buffer before setting the destination address.
CPersistentHeapManager_MakeDataEntryInRollbackBuffer(pcHeapManager,
pmDestinationAddress, SYS_WORD_SIZE);

((UW*)pmDestinationAddress)[0] = (UW)pmNewBlock;

puwPLES[0] = 8 * SYS_WORD_SIZE; // PLES size. An extra element is required.


puwPLES[1] = LOG_PROCEDURE_CDECL_FASTCALL_FLAG | // Calling convention.
LOG_VERIFY_PROCEDURE_RETURN_IS_EQUAL_TO_VALUE_FLAG | // Return flag.
LOG_STORE_PROCEDURE_RETURN_VALUE_FLAG; // Return flag.

puwPLES[7] = (UW)pmDestinationAddress;
}
else
{
puwPLES[0] = 7 * SYS_WORD_SIZE; // PLES size.
puwPLES[1] = LOG_PROCEDURE_CDECL_FASTCALL_FLAG | // Calling convention.
LOG_VERIFY_PROCEDURE_RETURN_IS_EQUAL_TO_VALUE_FLAG; // Return flag.
}

puwPLES[2] = (UW)CHeapManager_pmSetDynBlockSize; // Procedure address.


puwPLES[3] = (UW)pcHeapManager; // Parameter one.
puwPLES[4] = (UW)pmBlock; // Parameter two.
puwPLES[5] = uwSize, // Parameter three.
puwPLES[6] = (UW)pmNewBlock; // Return value to verify.

UW uwError = CPersistentHeapManager_uwMakeProcedureEntryInLog(pcHeapManager, puwPLES);

/////////////////////////////////////////////////////////////////////////////
// Finish populating the rollback buffer.

if (BW_CORE_HEAP_OPERATION_RETURNED_AN_ERROR(pmNewBlock) == FALSE)
// CHeapManager_pmSetDynBlockSize() did not return an error code. Note that if uwSize is
// zero, the function will return NULL, which would not be an error in this case.
{
// If pmBlock is null, pmExplicitSetDynBlockSize() will free the block, and will ignore
// all other parameters except pmNewBlock.

puwPLES[0] = 10 * SYS_WORD_SIZE; // PLES size.


puwPLES[1] = LOG_PROCEDURE_CDECL_FASTCALL_FLAG | // Calling convention.
LOG_VERIFY_PROCEDURE_RETURN_IS_EQUAL_TO_VALUE_FLAG; // Return flag.
puwPLES[2] = (UW)pmExplicitSetDynBlockSize; // Mirror procedure address.
puwPLES[3] = (UW)pcHeapManager; // Parameter one.
puwPLES[4] = (UW)pmNewBlock; // Parameter two.
puwPLES[5] = (UW)pmBlock; // Parameter three.
puwPLES[6] = cBlockInfo.m_uwSize; // Parameter four.
puwPLES[7] = cBlockInfo.m_uwReservedRegion; // Parameter five.
puwPLES[8] = cBlockInfo.m_uwMinReservedRegion; // Parameter six.
puwPLES[9] = (UW)pmBlock; // Return value to verify.

CPersistentHeapManager_MakeProcedureEntryInRollbackBuffer(pcHeapManager, puwPLES);
}

return (void*)((UW)pmNewBlock | uwError);


}

Notice that the prototype for the l-proc has the extra parameter, pmDestinationAddress. When a block is allocated or
reallocated, its pointer has to be assigned somewhere or a memory leak will definitely result. The l-proc provides the
capability to have this pointer assigned to an address and to log the event by setting
LOG_STORE_PROCEDURE_RETURN_VALUE_FLAG in the PLES. Notice further that the l-proc also includes
rollback buffer entries.

37
Rollback
The rollback service, which is only available to transactional heaps, is enabled by attaching a buffer via
CPersistentHeapManager_AttachRollbackBuffer(). The rollback buffer, like the log buffer, must be aligned along a
page boundary, must be a page multiple in size, and does not have be zeroed prior to attachment. When a rollback
buffer is attached, its state is set to two, which indicates that it is enabled. When a persistent heap is used without an
attached rollback buffer, the buffer's state is zero, indicating that it is disabled. Notice that the buffer states parallel
those of the heap: state zero is unrecoverable and state two is normal processing. To get the buffer's current state as
well as its parametric information, call CPersistentHeapManager_GetRollbackBufferInfo().

The rollback buffer is used almost identically to the transaction log. It takes both data and procedure
entries, and the procedure entries are prepared identically to those for the log – a PLES is populated and
passed to the buffer. As entries are made in the log, mirror entries must be made in the rollback buffer
that, when processed, will automatically undo the operations constituting the current unclosed transaction, leaving the
heap exactly in its state as of the last transaction. Because log entries are meant to record operations,
they are termed parallel entries. Data and procedure entries are made in the roll back buffer by calling
CPersistentHeapManager_MakeDataEntryInRollbackBuffer() and
CPersistentHeapManager_MakeProcedureEntryInRollbackBuffer(), respectively. A rollback is executed by calling
CPersistentHeapManager_uwRollback(). It's important to note that CPersistentHeapManager_uwRollback() only rolls
back unterminated entries. Whenever a transaction is closed, the rollback buffer is cleared (it is actually reset versus
being zeroed out) effectively locking the operations that constituted this transaction in the heap.

When a rollback is executed, the last entry made in the buffer is processed first moving backward to the first entry.
After each entry is processed, it is automatically written to the transaction log to maintain precise consistency
between the heap and the log. It is critical to understand that entries in the transaction log that reflect the undone
operations are never erased during rollback. The log always moves forward, which is to say, during rollback, new
entries are appended to the log that effectively undo the previous entries.

The l-proc above includes three rollback buffer entries. The first entry will restore the contents of the clipped portion of
a block that is contracting or being freed. The second entry will restore the contents at pmDestinationAddress if an
address has been provided; and the third entry will restore the block to its pre-operation location via
pmExplicitSetDynBlockSize(), an internal function that allocates a dynamic block at a particular address (the system
has a parallel function called pmExplicitAlloc() that is used restore a freed static block). The entries are made in
reverse of the order in which they will be executed upon rollback. As mentioned, data entries and procedure entries
are made identically in the rollback buffer as they are in the transaction log. The only difference occurs under the
surface where the system appends verification data to log entries such as checksums, to ensure data integrity.

While all Taiyen l-procs contain rollback buffer entries, there are some cases where a complex operation that is
composed of many sub-operations is best rolled back by undoing each sub-operation individually; but is, on the other
hand, best logged by making one procedure entry versus making log entries for each sub-operation. To facilitate this
scenario, the system provides what is termed an embedded rollback entry (ERE), which embeds a procedure with the
rollback entries necessary to undo the operation creating what is termed an r-proc. An r-proc is essentially an l-proc
without the log entry. The system offers r-proc versions of each l-proc.

The rollback buffer can overflow, but will not return an error upon this occurrence as evidenced by the entry function
prototypes. Because rollback is considered to be a very infrequent operation and furthermore, because the buffer is
cleared after each transaction is closed, buffer overflow is not a matter of great concern. If, however, a rollback is
attempted and fails because either the buffer had overflowed at some point or, even more unlikely, a processed
rollback entry failed, there is recourse beyond shutting down the process. A call to
CPersistentHeapManager_uwRebuild() could be used to rebuild the heap to the last closed transaction, effectively
rolling back the unterminated entries. Realize that this drastic measure will rarely if ever have to be used. The
probability of buffer overflow is remote, especially if a relatively large buffer is employed; and coupling this with the
low probability that a rollback will need to be executed, makes the overall probability of failure extremely unlikely.

If the rollback buffer does overflow, its state goes immediately to zero, indicating that it is disabled. Once in state
zero, it ignores all further entries until either the transaction is closed or a rollback is attempted; or the heap is rebuilt,
at which point its state is reset to two - active processing. Of course, if a rollback buffer is never attached, its state
always remains zero.

The rollback facility described here, which only allows the current unclosed transaction to be rolled back, was
designed for multi-user systems where previous transactions may not be disturbed. For single-user desktop
applications, where there might be a desire to undo operations all the way to the point where the heap was first
created and then perhaps subsequently redo those operations, a separate undo/redo feature is also available that
works similarly to rollback. The difference in implementation is that while rollback is designed to integrate with

38
transactional heaps, undo/redo is designed to integrate with snapshot heaps, which are perfectly sufficient for
desktop applications.

Undo/Redo
The undo/redo service is enabled by calling CPersistentHeapManager_AttachUndoBuffer() and supplying addresses
to undo and redo buffers. The two buffers must be aligned along page boundaries and of equal size, and that size
must be a page multiple. The buffers do not have to be zeroed prior to attachment. This attaching function has two
other parameters of note: bwUseCircularBuffer and bwReattachPersistentBuffers. When bwUseCircularBuffer is set
to TRUE, the system will use a circular buffer; allowing new entries that could not otherwise fit in the buffer to
overwrite the oldest transactions contained therein. If there is a desire to be able to undo all of the transactions made
on a heap, set this parameter to FALSE. Beware that doing so will obviously limit the number of transactions that can
be stored in the buffer as the system does not allow undo and redo buffers to be expanded after being attached. The
second parameter, bwReattachPersistentBuffers, when set to TRUE, tells the system that the attaching buffers may
contain valid transactions left over from a previous session. Of course, when the parameter is set to FALSE, the
buffers are considered to be virgin. This feature allows operations made during one session to be undone in the
following session.

Data and procedure entries are made in the undo buffer just as they are in the rollback buffer and the transaction log
for that matter. A data entry is made via CPersistentHeapManager_uwMakeDataEntryInUndoBuffer() and a
procedure entry via CPersistentHeapManager_uwMakeProcedureEntryInUndoBuffer(). Unlike the rollback buffer
entry functions, these can return an error code. Actually, it is a warning code. If a particular entry can't fit in the undo
buffer, the following warning is returned: WARNING_UNDO_BUFFER_CANNOT_ACCOMODATE_ENTRY. This is a
single-chance warning designed to allow the application to issue a subsequent message to the user that the current
operation being attempted can't be undone, and to prompt whether or not to move forward. If the user desires to go
ahead, the next attempt to make the oversized entry in the buffer will go through, but no entry will be made, and the
system will disable the buffer. If, however, the user decided to back out of the current operation, the application would
call CPersistentHeapManager_uwUndoUnterminatedEntries() to undo the sub-operations made up to that point,
which constitute the current transaction. To see undo buffer entries in action, examine the following code:

void *CPersistentHeapManager_pmSetDynBlockSizeEUE(CPersistentHeapManager *pcHeapManager,


void *pmBlock, UW uwSize, void *pmDestinationAddress)
{
// This function accomplishes three things:
// 1) It creates a new or resizes or frees an existing dynamic block via
// CHeapManager_pmSetDynBlockSize().
// 2) If a destination address is provided via pmDestinationAddress, the result of the call
// will be stored in pmDestinationAddress.
// 3) It makes mirror entries in the undo buffer.

/////////////////////////////////////////////////////////////////////////////
// Whenever an operation requires multiple entries to be made in the undo buffer, if
// possible, check for availability before adding these entries to avoid performing an
// operation that can't be subsequently recorded.

CBlockInfo cBlockInfo;
CUndoBufferEntryInfo acUndoBufferEntryInfo[3];

UW uwEntry = 0;
UW uwSizeRD;

// Catch the cases where pmBlock may be null or set to


// HEAP_ALLOCATE_LARGE_DYNAMIC_BLOCK_FLAG.
if ((UW)pmBlock >= SYS_SEGMENT_SIZE)
{
cBlockInfo.m_pmBlock = pmBlock;
CHeapManager_GetBlockInfo(pcHeapManager, &cBlockInfo);

// For dynamic blocks:


// 1) cBlockInfo.m_pmBlock will remain as set.
// 2) cBlockInfo.m_uwSize will be set to the user region.

39
if (uwSize < cBlockInfo.m_uwSize)
{
#ifdef _ENABLE_BOUNDS_CHECKING
if (uwSize == 0 && cBlockInfo.m_uwMinReservedRegion == 0)
// The dynamic block is being freed.
{
// Preserve the upper bound padding.
cBlockInfo.m_uwSize += SYS_NODE_SIZE;
}
#endif
uwSizeRD = UW_ROUND_DOWN1(uwSize, SYS_WORD_SIZE_INV_MASK);

acUndoBufferEntryInfo[uwEntry].m_bwDataEntry = TRUE;
acUndoBufferEntryInfo[uwEntry].m_uwSize = cBlockInfo.m_uwSize - uwSizeRD;

uwEntry++;
}
}
else
{
cBlockInfo.m_uwSize = 0;
}
if (pmDestinationAddress)
{
acUndoBufferEntryInfo[uwEntry].m_bwDataEntry = TRUE;
acUndoBufferEntryInfo[uwEntry].m_uwSize = SYS_WORD_SIZE;

uwEntry++;
}

acUndoBufferEntryInfo[uwEntry].m_bwDataEntry = FALSE;
acUndoBufferEntryInfo[uwEntry].m_uwSize = 12 * SYS_WORD_SIZE;

UW uwError = CPersistentHeapManager_uwCheckUndoBufferAvailability(pcHeapManager,
acUndoBufferEntryInfo, uwEntry + 1);

if (uwError)
{
return (void*)uwError;
}

/////////////////////////////////////////////////////////////////////////////
// If the block is shrinking, populate the undo buffer. Always make the last entry to be
// executed the first to go in the undo buffer.

if ((UW)pmBlock >= SYS_SEGMENT_SIZE && uwSize < cBlockInfo.m_uwSize)


{
// Secondly, restore the block's contents. The previous call to
// pmExplicitSetDynBlockSize() will have restored the padding.
uwError = CPersistentHeapManager_uwMakeDataEntryInUndoBuffer(pcHeapManager,
(void*)((UW)pmBlock + uwSizeRD), cBlockInfo.m_uwSize - uwSizeRD);

_ASSERT_TY(uwError == 0);

// This will prevent a data entry from being made in the redo buffer that contains
// garbage.
DisableAutoDataEntryForLastEntry(pcHeapManager);
}

/////////////////////////////////////////////////////////////////////////////
// Perform the operation.

void *pmNewBlock;

#ifdef WINDOWS
__try
{
#endif

pmNewBlock = CHeapManager_pmSetDynBlockSize(pcHeapManager, pmBlock, uwSize);

40
#ifdef WINDOWS
}
__except(TAIYEN_LOCAL_EXP_FILTER)
{}
#endif

if (BW_CORE_HEAP_OPERATION_RETURNED_AN_ERROR(pmNewBlock))
// An error occured.
{
return pmNewBlock;
}
// else, pmExplicitSetDynBlockSize() did not return an error code. Note that if uwSize is
// zero, the function will return NULL, which would not be an error in this case.

if (pmDestinationAddress)
// Load the return value into this address.
{
// Make a data entry in the undo buffer before setting the destination address.
uwError = CPersistentHeapManager_uwMakeDataEntryInUndoBuffer(pcHeapManager,
pmDestinationAddress, SYS_WORD_SIZE);

_ASSERT_TY(uwError == 0);

((UW*)pmDestinationAddress)[0] = (UW)pmNewBlock;
}

/////////////////////////////////////////////////////////////////////////////
// Finish populating the undo buffer.

// If pmBlock is null, pmExplicitSetDynBlockSizeEUE() will free the block, and will ignore
// all other parameters except pmNewBlock and cBlockInfo.m_uwSize.

UW puwPLES[12];

if (pmDestinationAddress)
{
puwPLES[0] = 12 * SYS_WORD_SIZE; // PLES size. An extra element is required.
puwPLES[1] = LOG_PROCEDURE_CDECL_FASTCALL_FLAG | // Calling convention.
LOG_VERIFY_PROCEDURE_RETURN_IS_EQUAL_TO_VALUE_FLAG | // Return flag.
LOG_STORE_PROCEDURE_RETURN_VALUE_FLAG; // Return flag.

puwPLES[11] = (UW)pmDestinationAddress;
}
else
{
puwPLES[0] = 11 * SYS_WORD_SIZE; // PLES size.
puwPLES[1] = LOG_PROCEDURE_CDECL_FASTCALL_FLAG | // Calling convention.
LOG_VERIFY_PROCEDURE_RETURN_IS_EQUAL_TO_VALUE_FLAG; // Return flag.
}

puwPLES[2] = (UW)pmExplicitSetDynBlockSizeEUE; // Mirror procedure address.


puwPLES[3] = (UW)pcHeapManager; // Parameter one.
puwPLES[4] = (UW)pmNewBlock; // Parameter two.
puwPLES[5] = (UW)pmBlock; // Parameter three.
puwPLES[6] = cBlockInfo.m_uwSize; // Parameter four.
puwPLES[7] = cBlockInfo.m_uwReservedRegion; // Parameter five.
puwPLES[8] = cBlockInfo.m_uwMinReservedRegion; // Parameter six.
puwPLES[9] = (UW)pmDestinationAddress; // Parameter seven.
puwPLES[10] = (UW)pmBlock; // Return value to verify.

uwError = CPersistentHeapManager_uwMakeProcedureEntryInUndoBuffer(pcHeapManager, puwPLES);

_ASSERT_TY(uwError == 0);

return pmNewBlock;
}

This function is the embedded undo entry (EUE), or u-proc for CHeapManager_pmSetDynBlockSize(). It is similar to
an l-proc and an r-proc, but with one key difference. Notice the call early in the u-proc to
CPersistentHeapManager_uwCheckUndoBufferAvailability(). Certain functions such as those returning values that
must be loaded into the PLES, like CHeapManager_pmSetDynBlockSize(), can only be logged in the undo buffer
after they're called. The problem occurs when the buffer can't accommodate such an entry, in which case an
operation will have been entered into that can't be undone, leaving the user without the option to either abandon the

41
current operation or proceed forward. CPersistentHeapManager_uwUndoUnterminatedEntries() couldn't be called in
this instance because the undo buffer would not have been in congruence with the heap.

The solution is to ensure that all of the undo entries needed to reverse a particular operation can fit in the undo buffer
before entering into that operation. This is accomplished by first creating an array of CUndoBufferEntryInfo objects,
with each object corresponding to the size and type of entry (data or procedure) that will be made in the buffer for the
operation at hand. The order of CUndoBufferEntryInfo objects must precisely match the order in which the entries will
be made, and if certain entries will only be made conditionally, then the corresponding objects should only be added
according to the same conditions. To illustrate, the u-proc above shows that the second CUndoBufferEntryInfo object
is only added when pmDestinationAddress is non-null, which is in congruence with corresponding buffer entry. Once
the object array is prepared, it is passed to CPersistentHeapManager_uwCheckUndoBufferAvailability(), which, like
the entry functions, will issue a single-chance warning if sufficient availability doesn't exist. The warning could then be
passed onto the user, and if the choice was made to not proceed with the operation, a call to
CPersistentHeapManager_uwUndoUnterminatedEntries() could safely be made to reverse any previous operations
belonging to the current transaction.

This term transaction in the last sentence may perhaps be confusing. While it’s true that the undo and redo buffers
are to be attached to snapshot heaps, and while it’s also true that snapshot heaps don't support transactions, at least
not in the way that transactional heaps do, there has to be a mechanism to allow operations to be undone and redone
in a controlled manner. Imagine the process of creating a fillet in a CAD program. On the surface, one line has to be
clipped; a second line must then be clipped; and lastly, an arc must be drawn between the two. Under the surface,
there may be many sub-operations such as updating the entity database and perhaps, allocating memory for the arc.
When the user goes to undo the fillet, he is not going to want to undo each little sub-operation individually. He is
going to want to undo them as a package. In the vernacular of the persistent heap's undo/redo service, this package
is called a transaction. An undo buffer transaction is defined (similar to a log transaction) as one or more buffer
entries followed by a terminator, and is terminated by calling
CPersistentHeapManager_CloseUndoBufferTransaction(). Notice that this last function won't return an error code
because it requires no extra space from the buffer.

A call to CPersistentHeapManager_uwUndo() will invoke the undo processor, which will execute entries working from
the last entry backward until either a transaction terminator or the lead entry is found. It doesn't matter if these entries
are terminated or not, they will be undone regardless. The first undo implicitly starts an undo/redo cycle. Notice in the
u-proc above that no entries were ever made in the redo buffer. This is because the redo buffer is managed entirely
by the system and is automatically populated with redo entries as the undo entries are processed. Once an
undo/redo cycle has been started, repeated calls to CPersistentHeapManager_uwUndo() will progressively undo the
heap operations until there are no buffer entries left to execute, at which point the function will return
ERROR_UNDO_BUFFER_THERE_IS_NOTHING_TO_UNDO. Now, with a fully populated redo buffer at hand, all of
these operations can be redone by calling CPersistentHeapManager_uwRedo(). Upon processing all of the redo
transactions, the function will return ERROR_UNDO_BUFFER_THERE_IS_NOTHING_TO_REDO. These two errors
constitute the end points of the undo/redo cycle, and once an undo/redo cycle has been started, undos and redos
may be performed in any desired sequence.

Once an undo/redo cycle had been started, it is imperative that no operations be made on the heap until the cycle is
closed. The cycle is closed by calling CPersistentHeapManager_ExitUndoRedoCycle(). This function will also reset
the redo buffer and decommit the pages contained therein.

The undo buffer will exist in one of four states: State zero indicates that the buffer is disabled. This state exists before
a buffer is attached and when an entry is made in the buffer that caused its overflow. Recall that the buffer will issue a
single-chance warning before accepting an entry that will cause this. State one is the warning state and exists upon
issuance of the overflow warning. State two is normal processing; and state three is the state of the buffer during an
undo/redo cycle.

Four actions will bring the undo buffer into state two, with the first being the initial attachment thereof. If an old buffer
is attached, its state remains as it was at the point the heap was closed. This allows a buffer that was closed during
an undo/redo cycle (state three), for instance, to be reattached and pick up where it left off. The second action is the
clearing of the buffer, which occurs as a result of calling CPersistentHeapManager_ClearUndoBuffer(). Note that this
is the only function that will enable an attached buffer that has been disabled. The third action that will move an undo
buffer to state two is exiting an undo/redo cycle, and the fourth is calling
CPersistentHeapManager_uwUndoUnterminatedEntries().

This highlights a difference between CPersistentHeapManager_uwUndo(), which starts an undo/redo cycle and
moves the buffer to state three, and CPersistentHeapManager_uwUndoUnterminatedEntries(), which is designed to
bring a buffer out of state one and back into state two. Note that CPersistentHeapManager_uwUndo() will also bring
the buffer out of state one, but if there are no unterminated entries in the undo buffer,

42
CPersistentHeapManager_uwUndo() will undo the last transaction, whereas
CPersistentHeapManager_uwUndoUnterminatedEntries() will simply return having taken no action.

All of these details may seem to detract from one of the snapshot heap's main benefits, which is ease of use, but
undo/redo is certainly an optional feature, and moreover, if it is indeed needed by the application, using this
framework definitely makes the task easier than writing the necessary code from scratch. And by implementing
undo/redo below the data organization layer, this feature can readily be incorporated into a variety of desktop
applications from word processors, spreadsheets, and CAD systems, to the next generation of highly integrated and
complex software.

Automatic Snapshots
An important feature of transactional heaps is the automatic snapshot mechanism. There are two events that can
trigger an automatic snapshot (also termed a forced snapshot). The first is log overflow. As entries are added to the
log, eventually the log will fill. As soon as the system detects that an impending entry will either not leave room in the
log for a transaction terminator or will not fit in the log as is, a snapshot will be scheduled. From the user’s
perspective, the log will continue to accept entries, but in actuality, the entry that scheduled the snapshot as well as
those made thereafter are ignored. When the transaction is finally closed, instead of writing a terminator to the log,
the system will make the pending snapshot. Here, the snapshot acts as sort of an über transaction, compiling all of
the little transactions in the log into one large disk write. During the snapshot, the dirty portion of the log is zeroed and
its pointer revolves back to its low end. By virtue of the log being sparse, it can be zeroed logically without having to
physically write zeros to the media storing it. The log is always cleared while the heap is in state one. This way the
system knows not to attempt a recovery from the log were a system failure to occur while it was in the process of
being zeroed. Recall that when a failure occurs during state one, the resulting recovery is made from only the backing
file. Coming out of the snapshot, the heap is presented with a fresh log ready to receive new entries.

The second event that can cause a snapshot to be scheduled is the attempt to make a log entry that is greater than
or equal to half the log buffer size (because procedure entries are limited to 16 parameters, only data entries can be
this large). The logic here is that the operation corresponding to such an entry will have entailed writing a large
contiguous block of data to the heap, and instead of duplicating this block-write to the log, it will be more efficient to
allow the snapshot mechanism to go ahead and flush it to the heap files (where it will ultimately wind up anyway).

The upshot of the automatic snapshot mechanism is that entries may be made in the log without ever having to check
the size of an entry against available space in the log. This, ostensibly, allows the size of the log to be considered as
infinite, radically simplifying its use.

Of course snapshots may be made explicitly by perhaps a thread that awakens at specific time intervals. The thread
may first check the number of dirty pages in the heap via CPersistentHeapManager_uwGetNumDirtyPages() and
then make a snapshot when the tally eclipses a certain threshold. To ensure that an explicitly made snapshot is never
made mid-transaction, all modifying threads must be blocked during a snapshot (note that reading threads may
however, continue running). This is done by wrapping the snapshot within the log mutex as shown here:

{
Enter log mutex.

Snapshot.

Leave log mutex.


}

The automatic-snapshot process is illustrated in Figure 19 below, which shows a series of transactions being made
on a heap and the ensuing snapshot that occurs upon the exhaustion of log space. In the figure, all of these
transactions have only dirtied three pages. When the snapshot is made, the three pages are flushed to disk; once to
the backing file and once to the data file. Note that the VMM may have already written one or more of the pages to
the backing file prior to the snapshot, in which case a portion of the flush to this file will be ignored. Because the VMM
has no connection to the data file, there is no way that this file could have been updated. In practice, it is expected
that the aggregate number of log entry bytes will exceed the aggregate number of dirty page bytes, as predicted by
the 80/20 rule, which states that 80% of the time is spent modifying 20% of the data. The quotient used to track this
ratio is called the DPPL, which stands for dirty page per log entry bytes.

43
Data Log Entry Procedure Log Entry

Heap Operations
Transaction Log

0 1 2 Dirty 3 4 Dirty 5 Dirty


page page page page page page

Transactional Heap

Snapshot Zeroing
Transaction Log

0 1 2 Dirty 3 4 Dirty 5 Dirty


page page page page page page

Transactional Heap
file flush

Disk backing and data files

Figure 19

ACID Transactions
Persistent heap data comes in two states: transient and consistent. Imagine inserting a node into a doubly-linked list
and then walking the list to find a certain value. Only until all of the touched pointers are set, is the structure in a
consistent state that can safely be accessed. In a single-threaded application, the thread that performs the insertion is
the thread that will subsequently walk the list to access its contents. The two operations are sequential, so there is no
chance of accessing any transient data that may in turn raise an exception. In a multi-threaded application, however,
multiple threads may by vying to access this list at the same time, so protections must be built in to prevent these
threads from seeing the structure before it is consistent. Such protections are essential in transactional systems, but
other attributes beyond consistency are desirable as well. Atomicity is also critical. The classic example is the transfer
of funds from one account to the next. While the first account is debited, the second is credited by the same amount,
which must be an all or nothing operation – either both accounts are updated or neither is. If there is any error during
the course of the operation, any changes must be able to be rolled back. These and other critical transaction
properties are found in the acronym, ACID, which stands for atomic, consistent, isolated, and durable.

44
Consistent state
B
Transient state
User data checksum

Rollback or
rebuild
Heap transaction

System failure with restoration

Time

Figure 20

Employment of the one-writer / many-reader model allows heap transactions to meet these four criteria. Atomicity is
achieved as an object’s updating thread blocks all reading threads until the update is complete. From the perspective
of the reading threads, the heap’s data state only changes when each transaction is closed. The rollback mechanism
is atomic as well because when invoked, it completely restores the heap’s user data to its state as of the previous
transaction. This is demonstrated in Figure 19, which plots the checksum of the user data in a hypothetical heap. The
dashed line represents the actual checksum, whereas the solid line represents the data state as seen by the querying
threads, which moves atomically from one state to the next with each closed transaction, represented by the vertical
lines.

Consistency is achieved as each transaction moves the heap atomically from one consistent state to the next. The
model also ensures that upon closure of each transaction, the heap is completely congruent with the log. This is
critical in case the transaction is closed via an automatic snapshot, in which case the heap will need to be in a
consistent state prior to exiting the log mutex. Because the application riding on top of a transactional heap
determines when transactions are closed, some responsibility for maintaining consistency falls to the application
layer.

Isolation is achieved by allowing only one thread to modify the heap at a time, ensuring that transactions are made
serially, and by allowing the modifications made by one transaction to be visible to other transactions only after the
first transaction either commits or aborts. Concurrency can be achieved by utilizing multiple transactional heaps,
although with one stipulation: the heaps must be completely isolated from one another – no transaction can straddle
two or more heaps. The high degree of isolation imposed here prevents dirty reads where a read of incomplete data
or an abandoned transaction can wind up as part of a second transaction, thereby introducing data inconsistency.

Durability is achieved through use of the transaction log and fail-safe snapshots, which make transactional heaps
impervious to system failure, and guarantee that a transactional heap will always restore to its state as of the last
closed transaction. Durability is assured because the log entries comprising a transaction are committed to the log’s
storing device before the next transaction is started. This applies to snapshots as well, which do not return until both
the backing and data files have been updated.

Ease of Use
Recall from the data-management stack that the persistent heap operates at the persistence control layer, well below
the data organization and application layers, making it an encapsulated and extensible component. Furthermore, the
persistent heap is data semantics and organization agnostic, making it easy to use as well. The term persistent heap
itself implies simplicity. One immediately thinks of a heap wherein allocated blocks are somehow made persistent,
and no special code is ever required to achieve this. This is sometimes referred to as orthogonal persistence, which
allows the programmer to treat structures the same way regardless of whether they exist in volatile or persistent
domains. Any conceivable data architecture can be built upon a persistent heap and make full use of its services. The
desired architecture could be relational, completely object oriented, or whatever else the developer conjures.

45
To illustrate, compare a persistent heap to parallel technologies such as archives and object databases, which
require more integration with the application and impose more semantic constraints. To make an object persistent
using MFC, that object has to be derived from the CArchive class, and similarly for an object database, the d_Object
class. Storing and loading objects to and from the archive or database require that specific serialization and
deserialization methods be called for each object, and the details involved can get quite cumbersome. MFC, for
example, performs serialization either via operators or the method, Serialize(). Operators are used when the
database (or archive) needs to dynamically reconstruct an object based on information stored in a CRuntimeClass
object, and Serialize() is used when the class of the object is known beforehand and memory has been allocated for
it. Sometimes the method, WriteClass() is used to store version and class information of a base class during
serialization of a derived class. Other times the MapObject() method is used to place objects in a map that are not
really serialized but that must be available for subobjects to reference. If this already sounds confusing, just spend a
couple of hours with an MFC guide trying to decipher it.

Use of a persistent heap requires no such contortions. Objects in a persistent heap are managed globally, never
individually, which is to say that no objects ever need to be individually serialized or deserialized. When data is saved
via a snapshot, no attention is paid to which object or objects the data belongs to. The system is only concerned with
the flushing of dirty pages, and the system always knows which pages are dirty by virtue of the page protection
mechanism. Likewise, when a heap is restored, all of the objects are implicitly loaded at once, but only virtually.
Objects are physically loaded into RAM only upon being referenced, and courtesy of the VMM, this is completely
automatic and under the surface; and the objects are loaded to precisely the addresses where they left off. The VMM
doesn't finish its work there. It constantly monitors memory conditions, and it returns infrequently used objects to the
backing file as necessary. Heap transactions also have global scope. Recall that when a transaction is closed,
effectively every object in the heap is saved in one instant. The upshot is that no object in a persistent heap ever
needs to be derived from an archive or similar class to gain persistence. It only needs to be allocated in the heap. In
fact, objects in the strictest sense don't need to be used at all because the system provides C interfaces, which
means that even plain old structs can easily be made persistent.

Another feature of the persistent heap that simplifies its use is that it stores memory pointers. Pointers are the most
efficient and straightforward way for objects to reference other objects. Object databases don't allow pointers to
reference persistent objects and require that references be made through instances of the reference class,
d_Ref<T>, which can be quite cumbersome. By allowing pointers, the same code used to create and manage
structures in a volatile heap can be used in a persistent heap, which in turn minimizes bugs. This also maximizes
speed by avoiding the overhead associated with reference objects.

Now extra code is required to enable transactions and rollback in a persistent heap, but this will be the case for any
transactional framework. However, these features are straightforward to implement because the system offers a
logical, systematic, and intuitive interface. Transactions can take any form imaginable. Data and procedure entries
allow almost any type of operation to be logged in an uncomplicated way in both the log and the rollback buffer; and
never having to check for overflow in either structure greatly simplifies the process. Encapsulation of this process is
possible through the use of l-procs and r-procs, which offer a methodical way to construct intricate and complex
transactions. For desktop applications, multiple undo and redo is equally easy to implement, especially when u-procs
are employed.

The administration of a persistent heap is also effortless. There is only one function that is ever used to restore a
persistent heap, CPersistentHeapManager_pcRestoreHeapFromHeapFiles(). Just pass in the handles of the two
heap files and the transaction log, and it will examine the disposition of these files to determine what state the heap
was in at the time of closure and apply the appropriate reconstruction method. Furthermore, heap restoration is
always very fast, as will be shown below. Some systems spend excessive amounts of time rebuilding internal
structures after a failure. This is not the case for the Taiyen system, which was designed to thrive in harsh computing
environments such as those associated with military systems, where system failure has to be anticipated as likely.

Reliability
The volatile and persistent heaps described herein have been undergone thousands of hours of testing, and have
been attacked from all angles. The Taiyen system was designed with many facilities, both internal and external, to aid
in the testing and debugging process. Internally, the debug (or checked) build uses countless assertions to verify
internal structures, and special assertions to catch and diagnose multithreading bugs. Also, the system can be built in
both debug and release modes with a heap statistics module that maintains exhaustive statistics on a slew of metrics.
When the module is enabled, the memory tester described earlier then compares its own statistics with that of a test
heap to verify consistency. The heap's checksum and verification functions have also proved invaluable for
scrutinizing both volatile and persistent heaps, especially post decompression and restoration. These functions are
available to the user and should be used liberally. The backbone of the testing regimen has been the memory tester,
but numerous other specialized testing applications have been developed. One of note is a program that reads in a
script that loads a series of blocks in a heap, performs a specified operation, and then compares the disposition of the
heap with that predicted by the script. To aid debugging, the scripts mirror CAD drawings that show the specified

46
operations graphically. This program was especially useful for testing unlikely scenarios that could take years for a
random number generator to reproduce.

Interestingly, it is the testing regimen that should define the boundaries of a component. As the number of inter-
related features grows, the size of the proper testing program grows geometrically because the number of
permutations to be checked grows geometrically as well. In the case of the Taiyen system, probably 80% of the
development time was spent in testing versus coding the actual binary. The limits of the system were not set so much
by what was possible to code, but by what was possible, economically, to test (at least to the extent where reliable
components can be delivered).

Persistent Heap Performance


The memory tester can be configured to test snapshot and transactional heaps. It uses the power of heap
persistence to resume an old simulation that was halted either on purpose or because of a crash or system failure.
This is a great debugging aid because it allows a simulation or series of simulations to run for thousands of hours; but
if a bug occurs that shuts the simulation down, instead of having to restart it from scratch, it can be restarted at a
point just prior to failure, allowing the faulty code to be pinpointed quickly. The tester always reloads the heap in a
state that was verified to be stable and seeds the random number generator precisely to reproduce the exact
condition that lead to failure.

During a persistent heap test, allocations, resizings, and deallocations are made just as they are during a volatile
heap test. When the heap is transactional, the tester makes a log entry for each heap operation, as well as the other
ancillary data and procedure entries needed to maintain coherence between the heap and log. To test transaction
speed, the tester, given a maximum number N, will write a terminator after a random number of entries between one
and N. In the performance tests covered below, N was set to 16, yielding an expected average of eight entries per
transaction.

After a specified number of tests, the tester performs a restoration of some sort, cycling through the list below:

1) The heap is restored from the backing file simulating a system failure during a snapshot - heap state one.

2) The heap is restored from the data file simulating a system failure during normal processing - heap state two.

3) The heap is restored from congruent heap files simulating a normal heap closure - heap state three.

4) The heap is restored from a compressed heap buffer cleared and then restored one of the compressed heap files.
The tester uses two compressed heap files and alternates between the two in case the system were to fail while
the heap was being compressed, in which case the heap could still be restored from the previous file.

For the tests cited here, we used a 2.8 MHz Pentium 4 PC running Windows XP, with the heap files and log located
on a Maxtor 6Y200PO Ultra ATA-133 drive. Five simulations were conducted on each platform and all used the same
input file, SimulationInputDataPersistent2.h with the output being directed to Taiyen_Per_IA32_Win_Sim_1-4.txt.

Figure 21 shows some of the input parameters:


Average % to Increase Block
Dynamic Block Operation %

Number of Worker Threads


Dynamic Block Size Lower

Dynamic Block Size Upper


Average Deallocation Size

Average Deallocation Size


Maximum Allocation Size

Average Allocation Size

Average Allocation Size

Average % to Decrease
Lower Allocation Limit

Upper Allocation Limit

Lower Limit

Upper Limit

Lower Limit

Upper Limit
Simulation

Block Size
Limit

Limit

Size

1 32 MB 64 MB 1 KB 16 B 85 B 16 B 85 B 10 16 B 1 KB 50 50 1
2 16 MB 64 MB 4 KB 16 B 341 B 16 B 341 B 10 16 B 4 KB 50 50 1
3 1 MB 64 MB 16 KB 16 B 1365 B 16 B 1365 B 10 16 B 16 KB 50 50 1
4 1 MB 64 MB 64 KB 16 B 5461 B 16 B 5461 B 10 16 B 64 KB 50 50 1
5 1 MB 64 MB 256 KB 16 B 21.8 KB 16 B 21.8 KB 10 16 B 256 KB 50 50 1

Figure 21

These parameters are similar to those used for the volatile heap tests with a couple of notable differences. One,
every simulation includes dynamic block operations but at a relatively low percentage, 10%. Two, all of the

47
simulations were run on one thread. As mentioned, access to the log has to be serialized to maintain coherence with
the heap, so there was no point in utilizing multiple threads for these tests.

For all of the tests, the maximum number of operations between transaction terminations was 16, and the number of
tests between heap restoration was also 16. The log size for each simulation was 128 MB, and the log buffer size
was LOG_BUFFER_SIZE (128 KB on 32-bit platforms and 256 KB on 64-bit platforms).

The following are the results for simulation one:

Average cache efficiency: 100.000000


Percent of blocks that are misaligned: 0.000000

Current percent of address space used: 44.367427

Current number of operations per second: 1,977,878.010424


Average number of operations per second: 1,838,701.534810
Maximum number of operations per second: 2,985,814.933781

Current number of tests per second: 11.863248


Average number of tests per second: 0.846547
Maximum number of tests per second: 13.888964

Current number of allocated bytes: 59,548,952


Average number of allocated bytes: 46,047,440.781250
Maximum number of allocated bytes: 67,108,864
Current average allocation size per operation: 16
Current average deallocation size per operation: 16
Average allocation size per operation: 16.000000
Average deallocation size per operation: 16.000000

Dynamic block operation percentage: 9.773631


Percent of dynamic block operations that are resizings: 88.917837
Total number of dynamic block allocations: 93,545
Total number of freed dynamic blocks: 88,174
Current number of allocated dynamic blocks: 5,371
Current number of dynamic block bytes: 682,508
Current average dynamic block size: 127.072798
Minimum dynamic block size: 16
Average dynamic block size: 87.268335
Maximum dynamic block size: 128
Average percent dynamic block increase: 48.400719
Average percent dynamic block decrease: 47.915475

Number of dynamic block expansions in place: 548,769


Number of dynamic block expansions with relocation: 321,109
Number of dynamic block contractions in place: 247,749
Number of dynamic block contractions with relocation: 340,397
Percent of dynamic block expansions in place: 63.085743
Percent of dynamic block expansions with relocation: 36.914257
Percent of dynamic block contractions in place: 42.123724
Percent of dynamic block contractions with relocation: 57.876276

Total number of static block allocations: 8,768,362


Total number of freed static blocks: 6,369,109
Current number of allocated static blocks: 2,399,253
Current number of static block bytes: 58,866,444
Current average static block size: 24.535322
Minimum static block size: 16
Average static block size: 33.854126
Maximum static block size: 128

--------------------------------------------

Number of log entries: 12,807,351.000000


Number of log transactions: 1,441,968.000000
Average number of log entries per transaction: 8.881855
Average transaction size: 274.521254
Average transaction flushing efficiency: 35.081203
Number of forced snapshots: 0
Number of timed snapshots: 12
Average snapshot size: 72615253.333333
Average dirty page bytes per log entry bytes: 0.692030

48
Number of heap restorations: 16

Average log entry speed (log entries per second): 3,986,259.995762


Average transaction termination speed (transaction terminations per second): 5,357.543433
Average transaction speed (transactions per second): 5,294.343605
Average transaction speed (bytes per second): 1,453,409.847199
Average snapshot speed (bytes per second): 12,787,703.611579
Average sustained throughput (transactions per second): 4,908.287731
Average sustained throughput (bytes per second): 1,347,429.304278

Average heap restoration from backing file speed (bytes per second): 9,978,499.730237
Average heap restoration from data file speed (bytes per second): 1,418,307.172064
Average heap restoration from congruent heap files speed (bytes per second): 1,625,675,118.906685
Average heap compression speed to buffer (bytes per second): 543,986,452.160653
Average heap decompression speed from buffer (bytes per second): 163,992,872.510485
Average heap compression speed to file (bytes per second): 44,198,439.568796
Average heap decompression speed from file (bytes per second): 40,802,358.469421

Notice the exceptional transaction speed of nearly 5,300 per second with an average of approximately nine log
entries per transaction. The tests were run without enabling file system buffering for the log, which means that each
transaction was written directly to the drive. The drive’s 8-MB write cache, however, was enabled, which allows the
drive to signal that a write is complete once the data stream encompassing that write is resident in cache, and not
necessarily on disk. In contrast, when the write cache is disabled, streamed bytes must be resident on the actual disk
media before the drive can accept the next transaction, which will severely hamper I/O speed (IOS). For example,
when the test at hand was run with the a disabled write cache, transaction speed was reduced to a dismal average of
116 per second, even with all writes being completely sequential. This, of course, illuminates the principal bottleneck
in disk drives – mechanical latency, which is measured by the two quantities: average seek time and rotational
latency. Actually, the total random access time includes other second order factors and is given by the following sum:
(average seek time) + (average rotational latency) + (transfer time) + (controller overhead).

Based on the performance data for the Maxtor drive, the 116/s-transaction speed is predictable. The published
numbers declare an average seek time ≤ 9.4 ms, an average rotational latency = 4.18 ms, and a controller overhead
< .3 ms. Assuming that the log occupies contiguous sectors and tracks in the drive, and given that writes to the log
are sequential, the seek time can be ignored. With an average of 856 sectors per track (or 438,272 bytes per track)
and an average transaction size of 275 bytes, each track holds roughly 1,600 transactions, which means that the
drive performs relatively few seeks as the log is populated. And when a seek is performed, it is only to the next track
over (a track-to-track seek takes .9 ms on this drive). With transfer times in the tens of MBs per second and a
controller overhead of only .3 ms, the dominant factor is clearly rotational latency. The average latency of 4.18 ms
would predict a transaction speed of nearly 240/s, but interestingly, because each transaction is made next to the one
before it, the drive may have to wait one full revolution before the disk head is in the correct position to make the next
write, resulting in a worst case latency of 8.36 ms (1/.00836 ≈ 116). There might be ways to optimize the process by
spacing writes such that the target sector always leads the disk head by a few degrees, but this is a moot point.

As the test shows, employing the write cache eliminates nearly all of the drive latency. As mentioned, the log buffer is
flushed when a transaction is terminated, so it’s actually the termination speed that specifically measures the storage
IOS. (Note that the transaction speed is calculated by adding the log entry overhead to the transaction termination
speed, but at nearly four million entries per second, this overhead is minimal.) At 5,358 terminations per second, the
latency calculates to .187 ms, which can almost entirely be attributed to controller overhead. Throughput is
maximized because sequential writes allow a continuous stream bytes to pass through the write head as the platters
spin underneath. The average transaction flushing efficiency of 35% shows the actual amount of transaction data
written to disk as a percentage of each log-buffer flush. Because these flushes bypass the file system cache, they are
aligned along sector boundaries and are sector multiples in size. Given a disk sector size of 512 and an average
transaction size of 275, a 35% flushing efficiency is in line with expectations.

The extra bytes written per flush only trivially affect IOS. The transaction throughput at 1.45 MB/s is well below the
drive’s maximum sustained transfer rate of 37 to 67 MB/s, which rate depending on where the write head is with
respect to the disk’s outer diameter where throughput is maximized. To gain further insight, the same test was run
with a sector size of 1,024, which lowered the flushing efficiency to 21.2%. The resulting speed of 5,317 transaction
terminations per second reinforces the understanding that for these relatively small transaction sizes, latency
(specifically, controller latency) is the predominant factor limiting IOS.

Another useful metric is the average number of dirty page bytes per log entry bytes (DPPL), which turned out to be
around .7 for this test. The DPPL is an indicator of the locality of heap operations where the lower the ratio, the higher
the locality. Imagine resetting the same byte repeatedly. This would rapidly populate the log while dirtying only one
page, which would yield a DPPL approaching zero. The test referenced here, because it continually churns the
memory by allocating, resizing, and freeing blocks of random sizes, renders a higher DPPL than would be expected
by a real world system.

49
Notice that the average compression and decompression speeds of 44.2 and 40.8 MB/s fall within the drive’s range
of maximum sustained transfer rates. Snapshots, which entail flushes to two files instead of one, should theoretically
be half the speed of compressions but fall short at about 57% of this mark. The reason why is that snapshots are not
completely sequential as are compressions and decompressions. Because the snapshot algorithm does coalesce
neighboring dirty pages into single block writes, snapshots are largely sequential (and with a smaller DPPL ratio,
even more so), but even a small amount of sparseness in a series of disk writes can induce latency overhead. On the
other hand, any sparseness in the snapshot could’ve been mitigated had the drive used a cache larger than 8 MB,
which was completely overwhelmed by the average snapshot size of 72.6 MB.

Nevertheless, an average snapshot speed of 12.8 MB/s is respectable, and by no means is the intended storage a
$50.00 ATA drive. One would expect even a modestly priced system to connect to U320 SCSI or Fibre Channel
RAID, where snapshot speeds are measured in hundreds of MB per second. To avoid situations where a pending
snapshot becomes so large so as to noticeably affect system availability, the current dirty page count as given by
CPersistentHeapManager_uwGetNumDirtyPages() could force a snapshot merely based on the dirty page tally
breaching a certain threshold. Note that snapshot speed can also be increased by lazily and continuously writing dirty
pages to the backing file such that when each snapshot occurs, most or all of the flush is directed towards the data
file, thereby reducing the total number of bytes flushed per snapshot.

The restoration speed from congruent heap files at 1.61 GB/s is extremely fast as it primarily entails mapping in the
backing file’s views. This is the most efficient way of loading persistent data because disk access is kept to a
minimum – pages are only read from disk as they are touched. Backing and data file restorations, with speeds
averaging 10.0 and 1.5 MB/s, respectively while decent, aren’t nearly as fast because large disk reads are necessary
to load a consistent heap image. Data file restorations are further hindered by having to process the transactions
found in the log.

Transaction Engine
A potential shortcoming of the one-writer / many-reader model is that transactions cannot be made concurrently on
SMP platforms. We knew this from the outset, but the question is: given the interaction with a log that sits on a shared
bus, to what extent can concurrency exist throughout the entire transaction process? Will multiple writing threads all
block each other as they wait their turn to access the log or respective logs?

Using concurrent writing threads can yield performance gains when the latency of the log device is many times that of
the interconnect, as it is for disk drives and even disk arrays, in which case transaction throughput could be increased
by spreading transactions across multiple logs residing on multiple devices. One problem with the approach is that
unless the modifying operations are completely disconnected, some rather intricate code may be required to ensure
isolated transactions. This wouldn’t be practical for a transactional heap, because it would imply that the heap would
require knowledge of the application logic, which would violate the rules of encapsulation. Secondly and more
importantly, as the speed of the log device approaches that of the interconnect, any advantage gained by distributing
transactions will be negated as the interconnect now becomes the bottleneck. In this case, writing thread concurrency
will show little to no net performance gain.

This is an important insight, which teaches that, given the correct hardware configuration, the one-writer / many-
reader model can be used without a performance trade-off. From the application development perspective, this model
is highly desirable because it offers a simple and clean way to manage resources in a multi-client environment. It
allows heap-layer encapsulation, guarantees transaction isolation for every operation, and reduces the likelihood of
synchronization bugs, which are the most difficult to find and correct. CPU utilization won’t suffer either with this
model because multiple CPUs will be engaged to process queries and to accelerate individual queries to the extent
that they are parallelizable.

Of course, this is predicated by the log’s device being able to receive transactions at interconnect speeds, but this
doesn’t mean that the device has to necessarily match the interconnect’s bandwidth. The reason is that the log feed
is non-pipelined, which is to say that each transfer must be completed be before the next one can be initiated. For the
relatively small transaction sizes expected in practice (<= 512 B), this can result in net transaction speeds
significantly lower than the interconnect’s bandwidth because of non-transfer latencies. This means that even on the
new breed of ultra-high throughput interconnects such as PCI-X 266 and PCI Express, an SSD using low-cost DRAM
(as opposed to SRAM) will be able to handle the feed.

For example, recent tests published by Intel of an eight-lane PCI Express implementation with an average payload
size of 2,048 bytes showed non-pipelined transfer rates averaging 2.2 GB/s. But when the transfer size averaged 512
bytes, the transfer rate fell to nearly 1.2 GB/s, showing that non-transfer latency factors noticeably for these smaller
payloads. Still, that figure is impressive and higher throughputs are possible, because PCI Express can scale to 16
lanes, predicting small transaction throughputs of roughly 2.4 GB/s. The challenge for the next generation of SSD will
be to receive transactions at a rate measured in millions per second at throughputs of GB per second. Such a unit will

50
definitely have to be directly attached to the interconnect and consist of battery-backed high-speed DRAM. At these
speeds, it will also be a challenge for the host’s CPUs and software to keep pace. For most high-demand
applications, the typical bus-attached PCI-64 SSD, which is capable of an IOS measured in the 100,000s per second,
will suffice.

The log can be placed on an SSD because it can be made relatively small with respect to the heap and remains of
fixed size throughout its life. This is by virtue of the snapshot mechanism, which flushes the pages dirtied since the
last snapshot and clears the log, paving the way for new transactions. With the heap files placed on disk-based bulk
storage, which can sustain a sequential throughput of 320 MB/s on U320 SCSI RAID, for example, (but an IOS of
only tens of thousands per second) an SSD-based log could dramatically increase the system’s transaction speed at
a minimal cost percentage wise. When so configured, the system can convert a choppy input data stream into the
smooth bursty feed that leverages the throughput capabilities of the disk array and its interconnect, as illustrated in
Figure 22 below.

Transactional Heap

choppy data stream smooth bursty output


Figure 22

Another advantage of moving the I/O burden to an SSD is that cheaper and less power-hungry RAID can be used
without incurring a performance hit. Typically, RAID systems require hundreds of drives and gigabytes of cache to
achieve a sustained IOS comparable to that of a mid-range SSD. But relatively few drives can sustain the high
sequential throughputs needed for snapshots. With internal transfer rates in the 50 MB/s, seven drives could
theoretically saturate one U320 SCSI channel. The upshot is that the ideal storage for heap files are RAID systems
with fewer larger drives as opposed to many small drives, which promises to reduce hardware costs, maintenance
costs, and power costs for both the storage and environmental systems.

The throughput capability of the system will depend upon log speed and snapshot speed. Equation two below,
defines net the net throughput, TP in bytes per unit time is defined as the number of transaction bytes, Tb in the log
(at the time of a snapshot) divided by the sum: average transaction time, Tt plus average snapshot time, St.. The two
times are added together because writing threads are suspended during snapshots. Recall that snapshots must
occur within the log mutex.

Tb
TP = Eq. 2
Tt + St
The number of bytes flushed during a snapshot is equal to (DPPL • Tb), and dividing this product by the snapshot
speed, Ss yields the snapshot time. Equation three substitutes this quantity for St.

Tb
TP = Eq. 3
Tt + DPPL⋅Tb
Ss
Dividing the numerator and denominator by Tb and substituting transaction speed, Ts for the quotient (Tb / Tt) renders
the final throughput equation below.

1
TP =
1 + DPPL Eq. 4
Ts Ss

51
This equation provides good insight into transactional heap performance. As expected, maximizing transaction and
snapshot speeds are key, but what may have been unclear is the extent to which DPPL plays into the equation.
Clearly, the smaller the DPPL the better. Two factors combine to reduce DPPL: locality and time. If modifications are
relegated to relatively few pages, the DPPL, by definition, will continue to shrink as entries are added to the log. But
allowing the dirty page count to grow beyond a certain point means that any upcoming snapshot may be large
enough to noticeably delay any writing threads. Most applications will want to keep snapshots under a second, so if a
100 MB/s snapshot can be achieved with a decent RAID system, a snapshot trigger might be set at 50 MB. If the
DPPL for a particular application averaged .2, at least a 250-MB log would be needed to avoid automatic snapshots.
Notice how the DPPL determines the minimum log size, with the lower the DPPL, the larger the requisite log. Imagine
that the log storage is able to achieve a transactional throughput of ten MB/s. In this case, the heap could sustain a
net throughput of 9.8 MB/s (1 / (1 / 10 + .2 / 100)) with snapshots occurring every 25 seconds.

In the test results above, the sustained throughput of 1.34 MB/s is only slightly below the log throughput of 1.44 MB/s,
indicating that on the hardware used for this test, transaction speed has a long way to go before snapshot speed
factors appreciably in the equation. To simulate the use of an SSD, the same test was run with file buffering enabled
for the log. The resulting transaction speed wound up at over ten times the non-buffered figure at 56,639
transactions/s, yielding a net throughput of 8.5 MB/s at 30,972 transactions/s. To put this figure in perspective, this
equates to over 1.8 million transactions per minute. At the time of this writing, the world record transaction speed was
around one million per minute.

Multiple Heaps
The Taiyen system provides for the use of multiple heaps. Because of the volatile heap’s ability to scale with multiple
processors, using more than one in a process will probably be unnecessary, unless the application will be run on a
NUMA platform. As described earlier, using multiple volatile heaps is a way to achieve scalability on a NUMA system.
There may be more of a use for multiple persistent heaps, and one can imagine many cases where it is desirable to
have one volatile and at least one persistent heap in a process.

There are only a handful of constraints regarding the use of multiple heaps:

1) All heaps must be created within a contiguous block of address space, called the heap space.
2) A heap must be registered before it can be used.
3) Upon closure, a heap should be unregistered.

Configuring the heap space for multiple heaps is a three-step process: First the individual heaps are created or
restored. Next, a handful of global variables are set, and finally, the heaps are registered, one at a time, by calling
RegisterHeap(). Note that each persistent heap will require its own heap files and transaction log, and to maintain the
integrity of their respective logs, each heap must be insulated from the others. Any interdependence, such as one link
in a list residing in one heap while the next resides in another, could lead to a catastrophic loss of coherence between
a heap and its log.

Once the heap space is configured, the macros, PM_GET_HEAP_FROM_ADDRESS(),


PC_GET_PER_HEAP_FROM_ADDRESS(), and PC_GET_VOL_HEAP_FROM_ADDRESS() may be used to quickly
resolve the address of a stored block to that of the heap in which it is stored. The last two macros are versions of the
first, with the returned pointer cast to the appropriate heap class. Because all of the blocks stored by a heap reside
with its user space, no special look-up tables are required to aid in this resolution.

The most efficient configuration is achieved when the common denominator between all heap sizes is maximized. If
n n
all of the heaps are 2 in size and are aligned on boundaries that are multiples of 2 , a heap manager address may
be resolved from a heap address by clearing the least significant n bits from that address, with this type of resolution
being markedly faster than that given by PM_GET_HEAP_FROM_ADDRESS().

During the course of a process, if registered heaps are to be destroyed, and new ones are to be created and
registered, unregister each destroyed heap by calling UnregisterHeap().

Conclusions
The Taiyen system was devised to serve a wide spectrum of applications. Application data can be segregated into
two basic types: volatile and persistent, with the two basic types of application being single-user and multi-user. Most
all types of application can potentially benefit from the improved volatile data management offered by the system,
which includes performance and productivity features such as high availability, scalability, robustness, optimal data
alignment, PTE redirection, native dynamic block support, compression/decompression, bounds checking, and leak
detection. Performance gains will vary according to how data-intensive and allocation-intensive the application is, with
those applications that produce the most data churn receiving the largest boost.

52
Data-hungry desktop applications such as CAD/CAE, video and audio editing, math and scientific analysis, financial
analysis, etc. will benefit from the snapshot heap’s powerful feature set. Because snapshot heaps employ file
mapping, files load almost instantaneously. Saves (snapshots) are fast and efficient as well because only modified
pages are flushed to disk, and this comes without having to write any specialized code. But more importantly, saves
are fail-safe, and recoveries from failure are painless. By inheriting volatile heap functionality, persistent heaps
manage data churn more effectively than competing solutions and yield excellent page utilization and through sparse
files, disk utilization. These features along with features like built-in undo/redo promise to improve developer
productivity as well and allow the creation of ever more rich and powerful applications.

High-demand server applications such as DBMSs, transaction coordinators, etc. will benefit from the transaction
throughput capabilities of transactional heaps, built-in rollback, ACID transactions, and the features inherited from
snapshot and volatile heaps. First and foremost, the Taiyen system is a performance component, and it is in the
server space where performance counts the most. The system, capable of unparalleled transaction speeds on mid-
range hardware, promises to maximize transaction / (second • dollar) and in turn lower IT infrastructure costs,
operating costs, and increase the accessibility of transactional systems to an ever wider audience. The economic
feasibility of many nascent technologies such as micro-payments and RFID depends on the ability to sustain low per-
transaction costs. But beyond performance, transactional heaps are more adaptable and extensible than other
transactional systems, which typically shoehorn the developer into a particular data model. They offer a semantically
agnostic architecture on which any imaginable application can be built, using a variety of methodologies.

To assist the developer, we have also prepared a thorough API guide, demonstration programs, and eight highly
detailed user-guides that will progressively tutor the user from the most basic features to the most advanced.

53

Вам также может понравиться