Вы находитесь на странице: 1из 26

Exploiting Parallelism in Multicore Processors through Dynamic Optimizations

Abhimanyu Khosla Mtech Cse

What is TLS ??

Potentially Dependent Threads or SPECULATIVE THREADS

Why Dynamic ??

Static thread management analyse extensive profile information (Probability based data and control dependence profiling). Can extract Coarse-Grained parallelism ( several thousand instructions). Difficult to statically estimate performance impact even when extensive information is available . Why ??

Problems with Static Thread Management

Profiling Information cannot accurately predict the costs of speculation, synchronization and other overheads. Performance impact of speculative threads depends on the underlying hardware configuration. Speculative threads behaviors are inputdependent.

Speculative threads experience phase behavior.

Experimental Infrastructure For Dynamic Optimizations.

Speculative Thread Execution Model. Architectural Support.

Compilation Infrastructure.
Simulation Infrastructure.

Speculative Thread Execution Model

Allows Compiler to parallelize a sequential program without first proving the independence among the extracted threads.
Underlying hardware keeps track of each memory access.

TLS empowers compiler to parallelize programs that were previously nonparallelizable.

Compiler Is forced in giving up Parallelizing this Code(memory Address unknown At compile time).

Architectural Support for Speculation (CMP)

Each core has a private 1st level protocol and a shared l2 cache.

STAMPede approach is used to support TLS.

Extends Cache Coherence Protocol with two new states - (SpS) Speculatively Shared.

- (SpE) - Speculatively Exclusive.

- Transition to and From these States.

If a cache line is Speculatively loaded it enters a SpS or SpE state.

All Speculative threads are assigned a unique ID.

Thread ID of the Sender PiggyBacks on all invalidation messages. If an invalidation message arrives from a logically earlier thread for a cache line (Sps or SpE), then the thread is squashed and re-executed.

Compilation Infrastructure

Built on Open64 Compiler.

Extended to extract Speculative threads from loops. To dynamically optimize where Speculative Threads should be spawned, compiler is forced to create a different executable in which every loop is parallized.

Simulation Infrastructure

Based on trace-driven, out of order superscalar processor simulator. TG- Trace Generation portion based on PIN instrumentation tool.

AS- Architectural Simulation based on Simple scalar.

TG instruments all instructions to extract - instruction address, register used,opcode etc

AS reads the trace file and translates the code generated by compiler into Alpha like code.
Pipeline is based on Simple Scalar.

Wattch model power consumption.

Orion model inter-connection power consumption. Cacti model Cache.

Performance Estimation

Cycles for TLS are broken down into 6 segments - Busy : Cycles spent graduating NonTLS instructions. - Exe Stall : Cycles stalled due to lack of ILP - iFetch : Cycles stalled due to fetch penalty - dCache : Cycles stalled due to data cache misses. - Squash : Cycles stalled due to speculation failures. - Others : Cycles spent on various TLS overheads.

Deriving Seq execution time from TLS cycle breakdown.

PSEQs = (TLS Squash -Others)

Runtime Support
Performance profile with Hardware performance monitors. Decision making for TLS

Performance Profile With Hardware Montiors

Hardware Performance montiors are programmed to attribute execution cycles into following categories. Examining the head of the stall gives us some clue to the cause of a stall. - Busy cycles spent graduating instructions. - ExeStall, cycles stalled due to instruction execution delays. - iFetch, cycles stalled due to instruction fetch penalty. - dCache, cycles stalled due to data cache misses. - Useful Instrution, number of non-TLS instructions committed. - ThreadCount, number of threads committed. - Total, cycles elapsed since the beginning of TLS invocation.

- dCacheServe,each data cache miss, we also count the number of cycles needed to serve the miss.

Counters are maintained per core. A counter is aggregated if its value is aggregated from all the cores.


Total is incremented on every clock cycle. At a given cycle, if the ROB is empty, the iFetch counter is incremented. If the instruction at the head of ROB is able to graduate, the Busy counter is incremented. If the instruction stalled at the head of the ROB is a memory operation, the dCacheServe counter is incremented. If the instruction stalled is a TLS management instruction, such as thread creation/commit instructions or synchronization instruc-tions, no counter is incremented. Otherwise, the ExeStall counter is incremented. . When a non-TLS-management instruction commits,

Aggregating The Counters

when a thread is spawned to a core, counters on that core are reset.

When a thread commits (only the nonspeculative thread is allowed to commit), all the aggregated counters are forwarded to the next non- speculative thread and the ThreadCount is incremented when a speculative thread becomes nonspeculative, it aggregates the forwarded counters with its own counters.

Counting Cycles for Data Cache Misses

Estimating the cache performance of sequential execution from parallel execution is a complex task. Consider the following scenarios. - a data item used by a thread is actually brought into the cache by another speculative thread which fails. - a data item in the L1 cache is invalidated by a message from a speculative thread. - a data item is needed by two threads running on two different cores, causing two cache misses.

Decision Making

To decide which loops to parallelize speculatively, a performance table is maintained for the candidate loops.
Each entry in the table contains two entries

- saturation counter, which is incremented if the TLS execution outperforms the predicted sequential execution and vice a versa.
- a performance profile summary, which contains the cumulative difference in execution time between the TLS execution and the estimated sequential execution.

After a candidate loop is executed in TLS mode, the main thread updates the table by adding the difference between the TLS execution and the predicted sequential execution time to the performance summary.

Performance Evaluation

Dynamic Performance tuning method is required to

- Identify the loop that can take maximum advantage of TLS. - and also select the right level of loop to be parallelised.

Dynamic Performance Tuning Policies


Quantitative + Static
Quantitative + StaticHint