Arolett - Thesis (Describes The Use of BCEL API Inside Dstm2)

A Java Implementation of the Rochester Software Transactional Memory Library
Aaron Rolett September 30, 2007
Abstract Recent advances in chip design have necessitated the development of a new programming model to simplify multithreaded programming. Transactional Memory (TM) replaces lock based thread synchronization with a new interface that approximates global lock semantics. In this paper an implementation of RSTM in Java is compared and contrasted with the C++ RSTM implementation as well as Suns own DSTM2 library implementation. The version of RSTM Java evaluated in this paper provides Suns DSTM2 API. Additional overhead introduced into the RSTM library by running it in Java with the Sun DSTM2 API is also discussed.
The Case for Transactional Memory
In 1965, Gordon Moore predicted that the number of transistors on a chip would double about every two years [3]. For the past 42 years, Moores law has more or less held true. However, as the clock speeds and number of transistors on the chips has increased, so has the unwanted heat dissipation. Several years ago, chip engineers reached a point where they could not increase the speed of the chip and still eectively dispel the heat with air convection cooling. Their solution was to place multiple, slightly less powerful, chips or cores together on one silicon die. These chips have become popular and it is now common for consumer and server machines to have 2, 4 or even 8 cores. The increased availability of multiple processor machines means that programmers must now write multithreaded code to fully utilize a computers resources. Traditionally, multithreaded code has used locks to protect data shared between threads. Under the lock based model, a thread trying to access a shared data structure must rst acquire a lock that prevents other threads from concurrently modifying the structure. When implemented correctly, a lock based programming model prevents data races. However, lock-based programming models are notoriously dicult to implement correctly and often force the programmer to choose between ease of implementation and runtime eciency. Under a coarse grain locking scheme, one lock protects a large data structure or a whole group 1
of objects. This approach is highly inecient because it creates false contention between threads. The alternative, ne grain locking, protects each data structure individually and provides provides much better performance. However, the advantage of the increased performance is partially oset by the disadvantage of the increased code complexity.
1.1
Problems with the Lock-Based Programming Model
Besides the aforementioned problems, lock-based code suers from several other problems including: Deadlock: Occurs when two or more threads form a circular chain where each thread is waiting for a lock held by the next thread in the chain. Once the chain exists, none of the threads that are part of it will ever make progress because each thread in the chain will be waiting indenitely for a lock. Priority Inversion: Occurs when a lower priority thread holds a lock that a higher priority thread needs to continue execution. The higher priority thread will not be able to continue until the lower priority thread releases the lock. Preemption: When a thread holding a lock is preempted, other threads which need that lock will not be able to run until the preempted thread is rescheduled and has a chance to release the resource. This can have a very negative impact on performance since it often takes 10s of milliseconds for a thread to be rescheduled. Spurious thread failure: Consider the case where two threads are both waiting for the same lock. Thread B is waiting for thread A to release the lock. As a result, thread B is not doing useful work. Thread A experiences a failure and dies. Now thread B will wait indenitely for a lock that will never be released.
1.2
A New Model: Software Transactional Memory
Transactional Memory (TM), which was originally proposed by Herlihy and Moss [7] as a hardware mechanism, aims to provide the performance of ne grain locks while approximating the simpler interface of a coarse grain locking model. In essence, the programmer marks blocks of code that will be executed in parallel as atomic and the underlying system ensures that these atomic blocks linearize [9] in a consistent order. 1.2.1 Non-blocking Software Transactional Memory
To avoid many of the common problems in lock based programming models (e.g. deadlock, priority inversion, preemption and spurious thread failure), many software transactional memory (STM) libraries are carefully designed to be non-blocking algorithms. This class of algorithms is further subdivided into [5]: Wait-Freedom: Wait-freedom guarantees that every thread will make progress in a bounded number of steps.
Lock-Freedom: In a lock-free algorithm, some thread in the system makes progress in a bounded number of steps. Obstruction-Freedom: Obstruction-freedom ensures that if all but one thread in the system stop, then the system as a whole will make progress. While an obstructionfree algorithm is deadlock free, it may be subject to livelock. A common library design decision is to use an obstruction-free library and then use an out-of-band contention manager to combat livelock. This allows for a simpler non-blocking algorithm which often performs better than algorithms with stronger guarantees. However, a non-block algorithm often requires signicant overhead including multiple copies of the data which are updated simultaneously. When designing a non-blocking STM system, it is important to understand and, when possible, minimize this overhead.
Rochester Software Transactional Memory
The Rochester Software Transactional Memory (RSTM) is an obstruction-free, object based STM that was designed to reduce much of the overhead traditionally associated with non-blocking transactional memory systems. [10]. Before reviewing the internal workings of the RSTM library, it is useful to examine how a programmer interacts with the RSTM-C++ API 1 . All transactionally shared data objects are accessed through a wrapper class of type Shared<T> and are required to inherit from Object<T>, where T is the class of the object to be shared. A transaction is always bracketed by BEGIN_TRANSACTION ... END_TRANSACTION macros. Inside of a transaction, a call to Open_RO() or Open_RW() on the wrapper object allows the programmer to access a shared object. That shared object may then be read or modied. Shared data objects cannot be accessed outside of a transaction and objects opened inside of a transaction cannot be cached and used outside of a transaction. Internally, RSTM maintains of a set of thread specic transaction descriptors. Each descriptor contains internal meta-data about the state of a transaction. A transaction can be in one of three states, active, committed or aborted. An active transaction is a transaction which is in the process of modifying shared data. These modications are not visible to the rest of the system until the transaction enters the committed state. Moving from the active to committed or aborted state is an atomic operation. If the transaction commits, all changes it made become visible to the rest of the world and no future changes are possible. If the transaction aborts, none of the changes become visible and the transaction must try again. An active transaction can only make changes to a shared object that it currently owns. Ownership of an object is acquired either eagerly, when the programmer calls Open_RW(), or lazily, right before a transaction attempts to
The API mentioned in this paper is not the most current interface to the library. A new API that wraps the Open RO and Open RW commands is in active development. However, the older API is used in the RSTM-java port and is more directly relevant to this paper.
1
commit. Ownership acquisition provides a hook which permits the library detect conicts between in-ight transactions. If a conict is detected, an external contention manager instructs the transactions involved on how to proceed. To enable a Shared<T> object to be modied in a transaction without making the changes visible to the other threads, the Shared<T> maintains a chain of two cloned objects. As seen in Figure 1, the chain contains an old and new version of the object. Under an eager acquire system, writers try to acquire ownership of the object they are modifying when the object is opened. This is done in several stages which are (1) read the Shared<T> pointer to the new clone; (2) determine which of the two clones is the correct one to use; (3) create a copy of the current data object and initialize the ownership eld with a pointer to the current transaction. Also a pointer to the data eld found in step 2 is placed in the new objects next pointer; (4) a Compare-and-Swap (CAS) instruction is used to atomically update the Shared<T>s object pointer to the newly created object; (5) the created object is added to the transactions write list so that it can be cleaned up if the transaction aborts [10]. The correct version of the object is determined by looking at the pointer to the new data object and the new data object itself. The low bit of the pointer to the new data object indicates whether the object is currently owned by a transaction. If the bit is 0, then the new data object is the most current version. However, if the bit is one, then the system looks at the new data objects Owner and Status elds to determine the correct object. If the current owner is committed, then the new data object is the most current version. However, if the transaction is active or aborted, then the old object is the most recent object. In addition to the eager acquire method described above, a transaction can delay acquiring objects for writing until immediately before the transaction commits. This lazy acquire method follows almost exactly the same methodology as eager acquire with one exception. In lazy acquire, the transaction does not attempt to acquire ownership of an object immediately. Instead, the object is placed in a lazy write list and immediately before the transaction commits it attempts to acquire all objects that it has written to. When a transaction requests read-only access to an object, it does not need to obtain ownership of that object. Instead, it simply needs to notice or receive notication if the object has changed between the point at which it was read and the moment that the transaction attempts to commit. Allowing a transaction to read an object without acquiring ownership allows for increased parallelism within the system. For example, transaction A may read an object which transaction B is modifying. Transaction A may commit as long as does so before transaction B commits. RSTM provides two possible methods for notication when an object changes. The rst method, called visible readers, makes all reads of an object visible to all transactions. This allows a transaction that successfully commits to abort all other doomed transactions. The second method, called invisible readers, requires each transaction to maintain a private list of all the objects it has read and the state of each object at the time it was read. Then whenever the transaction tries to commit or open a new object, it ensures that all the objects it has read thus far are still valid (Validation). 4
CleanBit NewData VisibleReaders Objectheader
ACTIVE Descriptor Owner OldData DataObject newversion
DataObject oldversion
Figure 1: RSTM meta-data. Visible Readers are located inside the object header as a bitmap. An unlimited number of invisible readers together with up to 32 visible reader transactions can all simultaneously read an object.
The DSTM2 API
Another TM library, DSTM2 [6], was developed by researchers from Suns Scalable Synchronization Research Group. DSTM2 builds on the groups earlier TM library, DSTM [8]. Its two main goals were to simplify the TM programming interface while providing a library backend which facilitated researchers experiments with new kinds of TM implementations. Using the DSTM2 interface is quite easy. First, the programmer designs a interface which meets certain conditions. Then, the interface is passed to a Transactional Factory which implements an anonymous class and returns an atomic object factory. This factory creates instances of the newly implemented class on demand [6]. These instances are used in transactions to provide data race free code. Interfaces passed to the transactional factory must consist of one or more pairs of getter and setter methods. The names given to these pairs of objects are important. The Transactional Factory uses these names to determine how to implement the given methods. All method pairs must be of the form:
T getFieldName(); void setFieldName(T newValue); Once the programmer has designed an interface for the shared data, creating instances of an anonymous class and using them in a transaction is quite simple. It is important to note is that the library provides a custom implementation of the Thread 5
@atomic public interface LLNode { int getValue(); void setValue(int value); LLNode getNext(); void setNext(LLNode next); }
Figure 2: Interface for a Linked List Node class through which the programmer interacts with the library. Two of the key methods are {Thread.makeFactory(Interface)} and {Thread.doIt(Callable <T>)}. The rst method is used to generate an atomic factory for the given interface and the second allows the programmer to actually run a transaction. To fully appreciate the API as it is described above, we look at how a programmer would use it to implement a linked list. The rst step is to implement an interface for the shared data, in this case a linked list node (LLNode). This interface, as seen in Figure 2, contains two elds, an integer and a reference to the next LLNode in the list. Notice that this LLNode interface is implicitly creating the two elds by providing two getter/setter method pairs in the denition. Second, we create an atomic object factory for the interface using: Factory<LLNode> factory = Thread.makeFactory(LLNode.class); This factory provides a create method which is used to instantiate instances of the new anonymous class. The factory is then used to create a root node for the linked list: LLNode root = factory.create(); Then, several threads are created and inside each thread concurrent insert transactions are run. The code for the insert method is quite standard and can be found in Figure 3. Running the transaction consists of creating a Callable and passing it to Thread.doIt() as follows: Thread.doIt(new Callable<Void>() { public Void call() { mylist.insert(value); } } Although it is possible for a programmer to call methods on transactional objects outside of a transaction, the program is considered incorrect if this is done while there are any in-ight transactions. 6
public List { static Factory<LLNode> factory = Thread.makeFactory(); public void insert(int i) { List node = factory.create(); node.setValue(i); node.setNext(root.getNext()); root.setNext(node); } ... };
Figure 3: Linked List insert method To support the API described above, the DSTM2 library uses the Apache BCEL library [4] to dynamically create class implementations for the interfaces passed to the Transactional Object Factory. However, bytecode engineering with the BCEL requires a solid understanding of Java bytecode and takes longer to implement than the equivalent Java code. To remove the need for TM implementers to write their own Java bytecode, DSTM2 provides the implementer with several hooks in the form of a custom adapter class. This adapter class requires two methods, makeGetter(...) and makeSetter(...), which are used to generate new closures dening the behaviors of the get() and set(...) methods implemented by the atomic object factories. An instance of the adapter is created for every object generated by the atomic object factory and that adapter stores object specic meta-data that it needs to generate getter and setter method implementations. While the adapter class does create some additional overhead, the DSTM2 authors felt that this was an acceptable penalty because it means that TM implementers no longer had to write bytecode themselves. The DSTM2 API provides a great simplication over some earlier STM APIs such as DSTM [8] and an older API for RSTM [10]. Both library APIs required the programmer to explicitly open transactional objects for reading and writing.
4
4.1
RSTM-Java Port
Goals of the RSTM-Java Port
The original goal of the RSTM-Java port was to implement the RSTM library in Java and leverage Suns DSTM2 frontend to provide a clean TM API to the programmer. Because of licensing restrictions on the Sun DSTM2 library code, one of the main design decisions 7
was to keep RSTM-Java library implementation free of all Sun licensed code so that it could be freely distributed. Then, if the user obtains a Sun license, the user could easily download the library and work with a much cleaner interface.
4.2
Problems with the Original Goal
Several design features in the DSTM2 library make it very dicult to develop a system that is capable of running both with and without Suns DSTM2 system. There are several problems with the custom adapter class that make it dicult to use while satisfying the initial goals of the project. One such issue relates to the way RSTM is implemented. Every time an object is acquired through either lazy or eager acquire under RSTM, a clone of the original object is created and atomically swapped into the the Shared<T> object data pointer using a CAS operation. Since this occurs every time an object is opened for writing, the speed of the object creation and clone method are of extreme importance. Now, if the adapter class was used to avoid bytecode engineering, a clone method would not be provided by the factory. Without a clone method, new clone objects would have to use run-time reection to copy the older objects. According to a Java developer works article [1], Java 1.4 eld accesses using runtime reection take over 1000 times as long as direct accesses. This cost would have dramatically slowed down the system. A second problem with using a custom adapter class for cloned objects is the frequency with which cloned objects are generated. Since every call to Open_RW() causes a new cloned object to be created, using an adapter based factory means that every Open_RW() would create two new objects, i.e. a new adapter and clone object. Given the frequency with which these objects would be created, it was considered an unacceptable cost. The nal problem with using a Sun BCEL Factory for cloned objects is that the cloned object factory is required for RSTM library execution. One of the design goals was the ability to run the RSTM system without the DSTM2 frontend. Since cloned objects are a key part of the library, the system would not have run without Suns factories. One possible solution that was not used is to allow library users who do not have access to the frontend to provide there own full class implementation with a clone method instead of just a simple interface. Another key issue when porting the library is that the Sun DSTM2 library manages much of the transactional meta-data automatically for the TM author. The system takes care of creating transactional descriptors and committing, aborting and validating them for the TM author. The DSTM2 library provides TM authors with specic callback hooks that are invoked on certain system actions. These hooks are called on validation, commit and abort. Additional hooks are being added to a new version that will also be called at the beginning of a transaction and immediately before a transaction tries to commit. The latter is needed to support lazy acquire transactions, where ownership of objects written by the transaction is only acquired directly before the transaction attempts to commit. 8
While the callbacks and meta-data provide an easy to use interface for TM authors, using them while maintaining a system separable from DSTM2 would have been dicult. The basic solution would have required the RSTM library to provide a replication of DSTM2s meta-data TM interface. At that point, there would have been little benet derived from the use of the library.
4.3
The Solution
Given the issues described above, the simplest method for porting the RSTM library to Java was to create an implementation that provided the Sun DSTM2 library interface but did not leverage any of Suns code. This solution allows users of the RSTM-Java library to leverage code written to work with DSTM2 while imposing less restrictions on the distribution of the library.
4.4
Implementation
The RSTM-Java port consists of two main components, a backend port of the C++ RSTM library to support transactions and a frontend designed to provide the DSTM2 library interface. 4.4.1 The RSTM-Java Backend
Most of the RSTM-Java backend implementation is a straightforward port of the C++ library. Although Java does not give the programmer direct access to the CAS instruction, it provides several wrapper classes that allow for certain variable types to be updated atomically. Among these are the AtomicMarkableReference and AtomicInteger classes. Even though the syntax is slightly dierent, the AtomicMarkableReference class provides a drop-in replacement for the pointer, which points to object clone and is located in the object header. The AtomicInteger class is used to store the visible readers bitmap for each shared object header. Another modication dealt with the way all shared data objects are constructed. In the C++ version of RSTM, all of the shared objects are required to inherit from the Object<T> class. The Object<T> base class provides a pointer to the next shared object in the clone chain and a pointer to the descriptor of the transaction that currently owns the object. This works well in a language such as C++ where multiple inheritance is supported. However, because Java only supports single inheritance, another solution had to be found. Since we want to support the Sun DSTM2 API in the new library, the library would have access to an interface provided by the user. The natural solution was to create a specialized factory which returned an object containing getter and setter implementations for all of the methods inside of the library as well as a clone method, a next object eld and an owner eld. The getter and setter methods do nothing more than set and return the specied eld. 9
4.4.2
The Factories
One detail not mentioned in the preceding section pertains to the implementation of the custom bytecode engineering factories. Unlike the DSTM2 library, the Javasisst bytecode engineering library [2] was used for all factory implementations. Javassists main advantage over the BCEL is a builtin Java source-to-bytecode compiler that greatly simplies the factory implementation. When given an interface, the factory follows the steps listed below to generate a clonable anonymous class for the library: 1. Use Java runtime reection to determine the methods supplied by the given interface. Store the method names and elds for future use. 2. Create a new class using the Javasisst library. For each setter method in the interface, create a eld in the class and an setter method implementation. For a getter eld, create a method implementation that returns the appropriate eld. 3. Add additional elds and methods for the Owner and Next elds. 4. Finally, use the interface description to create a clone method. Almost the exact same steps are followed to create factory used in the frontend to provide the DSTM2 API interface. The DSTM2 API requires that the library provides the programmer with an implementation of their interface that can safely be used inside a transaction. This is accomplished by creating a second factory with which the programmer directly interacts. The second factory is very similar to the cloneable object factory with a few notable exceptions. For example, the extra elds and clone methods are not added and the objects generated hold onto a Shared object and use that object to call Open_RW() and Open_RO() While the actual creation of the factory using runtime reection is quite expensive, it only happens twice for each object type. Once for the frontend factory creation and a second time for the clonable object factory creation.
Results and Discussion
In this section, the performance of RSTM-Java is compared to the performance of the DSTM2 library as well as the C++ version of RSTM and a coarse grain lock (CGL) lock implementation that provides the DSTM2 API. All benchmarks were run on a SunFire 6800 containing 16 1.2Ghz UltraSPARC III processors. The DSTM2, RSTM-Java and CGL tests were run under the Java 1.6.0 VM while the C++ version of RSTM was compiled using GCC with optimization level -O3. Throughput was measured for 1 to 28 threads over a period of 10 seconds for each thread. All measurements for the Java and C++ versions of RSTM used invisible readers / eager acquire with the Polka contention manager. The results reported here were all averaged over a set of 3 test runs. 10
5.1
Benchmarks
Several benchmarks were used to evaluate performance of the dierent STM libraries. All of the benchmarks maintain integer sets and each active thread performs a 1:1:1 mixture of insert, contains and delete operations. 5.1.1 Linked List with Early Release
This benchmark maintains a list of sorted integers and each transaction traverses the list and then performs the appropriate action (inserting, deleting or reporting the existence of a node). The list is rst traversed in a read-only mode, early releasing nodes that are no longer needed. Once the appropriate spot is found in the list, the transaction opens the nodes in read-write mode and preforms the requested operation. All values in the linked list are limited to the range of 0...255. 5.1.2 Red-Black Tree
The red-black tree (RBTree) has each transaction search for a node using read-only opens down the tree. After the correct node is located, both the correct node and the nodes that must be modied for the RBTree balancing process are opened for writing. 5.1.3 Hash Table
In the Hash Table, 256 buckets are created. Each bucket provides a linked list for overow protection. Values between 0...255 are randomly inserted and removed. The hash tables modulates the input integer by the number of buckets to nd the storage bucket. 5.1.4 Counter
Counter maintains a single number that all transactions try to increment by 1 simultaneously.
5.2
Performance
Speedup graphs for the benchmarks described above can be found in Figures 4 through 7. These gures all use a log scale on the y axis. 5.2.1 Overhead of the RSTM-Java Port
The C++ version of RSTM performs many more transactions per second than the Java version. While some of the slowdown can simply be attributed to Java running bytecode that must be interpreted, another portion of it must be attributed to implementation issues due to the lack of pointers in Java as well as the desire to support Suns DSTM2 API.
11
1e+07
RSTM-Java RSTM-C++ CGL DSTM2 ofree
Transactions/second
1e+06
100000
10000 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 Threads
Figure 4: Hash Table.
1e+06
RSTM-Java RSTM-C++ CGL
Transactions/second
100000
10000 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 Threads
Figure 5: Linked List with Early Release. Note: There are no DSTM2 numbers on this graph because the publicly released copy of the library did not support early release at the time of this writing.
12
1e+07 1e+06 Transactions/second 100000 10000 1000 100 10 1 0 2 4 6 8
RSTM-Java RSTM-C++ CGL DSTM2 ofree
10 12 14 16 18 20 22 24 26 28 Threads
Figure 6: Red-Black Tree.
1e+07
1e+06 Transactions/second
100000
10000
1000
100 0 2 4 6 8
RSTM-Java RSTM-C++ CGL DSTM2 ofree RSTM-Java no API 10 12 14 16 18 20 22 24 26 28 Threads
Figure 7: Counter. This gure contains an extra set of data, RSTMJava no API, which represents number of transactions per second when the frontend factory is removed and the wrapper objects are accessed directly.
13
One reason for the slowdown of RSTM-Java as compared with RSTM-C++ is the extra function calls that occur for each access to a transactional object. Under RSTM-C++, a programmer calls Open_RO() and Open_RW() directly on Shared<T> objects. With JavaRSTM, the programmer creates a frontend object that is responsible for calling Open_RO() and Open_RW() on getter and setter methods. This simplies the programming interface but results in an additional function call on every object access and usually several times more calls into the transactional framework to open an object during the transaction. These extra calls occur for several reasons: (1) the user gets a value from the object multiple times instead of caching the value the rst time; (2) the user reads a value and then modies it; (3) a user writes a value multiple times in a transaction. In cases two and three, the programmer cannot avoid the problem because of the way the API is structured. Unlike RSTM-Java, where a programmer reading and then writing a value implicitly opens a value for reading and then writing, RSTM-C++ allows the programmer to simply open the value for writing once and then read and modify it as many times as needed. Similarly, the only way to modify a value is to use the setter method which implicitly calls Open_RW() on every invocation. A second extra cost incurred by the RSTM-Java implementation relates to the RSTM algorithms use of a clean bit in the Shared<T> object header. As stated previously, this bit is used to determine whether the object in question is currently owned by a transaction. Under C++, the bit can be atomically updated by directly modifying the least signicant bit of the pointer and using a CAS operation to swap in the new value. Java does not have pointers but the AtomicMarkableReference class in java.util.concurrent.atomic provides similar semantics that allow the programmer to access or manipulate the least signicant bit of a reference. However, the class uses a boxed implementation of reference/boolean pairs which introduces an unneeded extra level of indirection. Since the AtomicMarkableReference is updated twice for every successfully acquired object, the additional cost of the indirection is quite high. It should be possible to remove the need for the clean bit from the RSTM library by determining the ownership of the object based on the newer data objects next pointer. However, attempts to implement this change have introduced a race condition into the code. Future work will hopefully eliminate this unnecessary overhead. Finally, the API interface to run a new transaction, Thread.doIt(Callable <T>), forces a new callable object to be created at the start of every transaction. This object is only needed for the duration of the transaction, so in addition to the overhead of creating the Callable, the syntax puts increased pressure on the Java garbage collector. 5.2.2 The DSTM2 API Overhead
For comparison purposes, a version of the counter benchmark was implemented without using the frontend factory. Instead, the benchmark holds onto a Shared<T> and explicitly calls Open_RO() and Open_RW(). Note that this does not eliminate the cost of creating a 14
new Callable<T> for each transaction. From the results in Figure 7, it is clear that the number of transactions per second almost double without this extra overhead. 5.2.3 Coarse-Grain Locks
Coarse-grained locks provide signicant performance gains over DSTM2 and RSTM-Java. Although coarse-grained locks provide better performance at low thread levels under the C++ version of RSTM [10], the Java port of RSTM incurs several extra levels of indirection when working with objects in a transaction. 5.2.4 DSTM2 Library Performance
The DSTM2 obstruction-free locator based factory is the least ecient factory as measured with benchmarks described above. Part of this slow down is due to the extra exibility provided by Suns library implementation. Because all frontend factory methods are created by having the Adapter class return a Callable<T>, the Callable<T> has to use reection to invoke the desired method. Additionally, the locator implementation uses visible readers which showed degraded performance under RSTM-C++ benchmarks. Finally, the choice of a contention manager can have a large impact on STM performance and experimentation with dierent contention managers may improve DSTM2 library performance. All DSTM2 benchmarks were run using the default exponential backo manager but the choice of contention manager often has a large eect on total system performance. The DSTM2 paper [6] does not mention the contention manager used for measurement so the default exponential backo manager was used.
Conclusions
RSTM was successfully ported from C++ to Java and provides Suns DSTM2 interface. However, that interface, together with the overhead of using an AtomicMarkableReference to keep track of the current version of the object, greatly reduces the overall performance of RSTM-Java. The cost of providing the DSTM2 API in a library is quite high because of the extra calls into the transactional memory library as well as the extra cost of each call. Compiler support could mitigate much of this cost by reducing the frontend overhead and eliminating multiple unnecessary calls into the TM system. Finally, it could eliminate the need for a new Callable<T> on each transaction. Additionally, the new RSTM-Java library provides enhanced performance over the Sun DSTM2 library. Some of the performance loss in Suns library can be attributed to the cost associated with the exibility provided by the library. Another portion of the loss comes from the visible reader locator based library TM implementation. The RSTM-Java port is fully functional although additional work needs to be done to further reduce the overhead associated with this port and improve the overall performance. 15
References
[1] http://www-128.ibm.com/developerworks/java/library/j-dyn0603/. [2] Javassist Byte-Code Engineering Library. http://www.csg.is.titech.ac.jp/ chiba/javassist/. [3] Moores Law. http://www.intel.com/technology/mooreslaw/. [4] Apache Software Foundation. http://jakarta.apache.org/bcel/manual.html. Byte-Code Engineering Library.
[5] Maurice Herlihy, Victor Luchangco, and Mark Moir. Obstruction-free synchronization: Double-ended queues as an example. In ICDCS 03: Proceedings of the 23rd International Conference on Distributed Computing Systems, page 522, Washington, DC, USA, 2003. IEEE Computer Society. [6] Maurice Herlihy, Victor Luchangco, and Mark Moir. A Flexible Framework for Implementing Software Transactional Memory. In OOPSLA 06: Proceedings of the 21st annual ACM SIGPLAN conference on Object-oriented programming systems, languages, and applications, pages 253262, New York, NY, USA, 2006. ACM Press. [7] Maurice Herlihy and J. Eliot B. Moss. Transactional Memory: Architectural Support for Lock-Free Data Structures. In ISCA 93: Proceedings of the 20th annual international symposium on Computer architecture, pages 289300, New York, NY, USA, 1993. ACM Press. [8] Maurice P. Herlihy, Victor Luchangco, Mark Moir, and William N. Scherer III. Software Transactional Memory for Dynamic-sized Data Structures. In Proceedings of the 22nd Annual ACM Symposium on Principles of Distributed Computing, July 2003. [9] Maurice P. Herlihy and Jeannette M. Wing. Linearizability: A Correctness Condition for Concurrent Objects. ACM Trans. Program. Lang. Syst., 12(3):463492, 1990. [10] Virendra J. Marathe, Michael F. Spear, Christopher Heriot, Athul Acharya, David Eisenstat, William N. Scherer III, and Michael L. Scott. Lowering the Overhead of Nonblocking Software Transactional Memory. In Proceedings of the 1st ACM SIGPLAN Workshop on Languages, Compilers, and Hardware Support for Transactional Computing, June 2006.
16

Arolett - Thesis (Describes The Use of BCEL API Inside Dstm2)

Загружено:

Сведения о документе

Исходное описание:

Оригинальное название

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Arolett - Thesis (Describes The Use of BCEL API Inside Dstm2)

Загружено:

Авторское право:

Доступные форматы

A Java Implementation of the Rochester Software Transactional Memory Library

Aaron Rolett September 30, 2007

The Case for Transactional Memory

Problems with the Lock-Based Programming Model

A New Model: Software Transactional Memory

Rochester Software Transactional Memory

CleanBit NewData VisibleReaders Objectheader

ACTIVE Descriptor Owner OldData DataObject newversion

The DSTM2 API

Problems with the Original Goal

Results and Discussion

RSTM-Java RSTM-C++ CGL DSTM2 ofree

Figure 4: Hash Table.

RSTM-Java RSTM-C++ CGL

1e+07 1e+06 Transactions/second 100000 10000 1000 100 10 1 0 2 4 6 8

RSTM-Java RSTM-C++ CGL DSTM2 ofree

Figure 6: Red-Black Tree.

RSTM-Java RSTM-C++ CGL DSTM2 ofree RSTM-Java no API 10 12 14 16 18 20 22 24 26 28 Threads

Вам также может понравиться