Hardware Multithreading

Hardware Multithreading A multithreading processor is able to pursue two or more threads of control in parallel within the processor pipeline.
The contexts of two or more threads are often stored in separate on-chip register sets. Formally speaking, CMT(Chip Multi Threading), is a processor technology that allows multiple hardware threads of execution (also known as strands) on the same chip, through multiple cores per chip, multiple threads per core, or a combination of both. Lets see various techniques that enable hardware multithreading: 1. Multiple Cores per Chip CMP (Chip Multi-Processing, a.k.a. Multicore), is a processor technology that combines multiple processors (a.k.a. cores) on the same chip. (see Figure 2 (b)) The idea is very similar to SMP, but implemented within a single chip. [10] is the most famous paper about this technology. 2. Multiple Threads per Core 2.1 Vertical Multithreading Instructions can be issued only from a single thread in any given CPU cycle. - Interleaved Multithreading (a.k.a. Fine Grained Multithreading), the instruction(s) of other threads is fetched and fed into the execution pipeline(s) at each processor cycle. So context switches at every CPU cycle.(see Figure 1 (b)) - Blocked Multithreading (a.k.a. Coarse Grained Multithreading), the instruction(s) of other threads is executed successively until an event in current execution thread occurs that may cause latency. This delay event induces a context switch. ( see Figure 1 (c)) 2.2 Horizontal Multithreading Instructions can be issued from multiple threads in any given cycle. This is so called Simultaneous multithreading (SMT): Instructions are simultaneously issued from multiple threads to the execution units of a superscalar processor. Thus, the wide superscalar instruction issue is combined with the multiple-context
approach. (see Figure 2 (a))
Unused instruction slots, which arise from latencies during the pipelined execution of single-threaded programs by a contemporary microprocessor, are filled by instructions of other threads within a multithreaded processor. The executions units are multiplexed among those thread contexts that are loaded in the register sets. - Underutilization of a superscalar processor due to missing instruction-level parallelism can be overcome by simultaneous multithreading, where a processor can issue multiple instructions from multiple threads in each cycle. Simultaneous multithreaded processors combine the multithreading technique with a wide-issue superscalar processor to utilize a larger part of the issue bandwidth by issuing instructions from different threads simultaneously. Notes:
Super pipeline extreme pipeline processor technology, where the instruction pipeline is divided into extreme amount (usually, 8+) of pipe-lined stages. Superscalar (a.k.a. multiple issue), is a processor technology, where multiple instructions can be issued to the instruction execution unit.
Types of multithreading Block multi-threading Concept The simplest type of multi-threading occurs when one thread runs until it is blocked by an event that normally would create a long latency stall. Such a stall might be a cache-miss that has to access off-chip memory, which might take hundreds of CPU cycles for the data to return. Instead of waiting for the stall to resolve, a threaded processor would switch execution to another thread that was ready to run. Only when the data for the previous thread had arrived, would the previous thread be placed back on the list of readyto-run threads. For example:
1. 2. 3. 4. 5. 6.
Cycle i : instruction j from thread A is issued Cycle i+1: instruction j+1 from thread A is issued Cycle i+2: instruction j+2 from thread A is issued, load instruction which misses in all caches Cycle i+3: thread scheduler invoked, switches to thread B Cycle i+4: instruction k from thread B is issued Cycle i+5: instruction k+1 from thread B is issued
Conceptually, it is similar to cooperative multi-tasking used in real-time operating systems in which tasks voluntarily give up execution time when they need to wait upon some type of the event. Terminology This type of multi threading is known as Block or Cooperative or Coarse-grained multithreading. Hardware cost The goal of multi-threading hardware support is to allow quick switching between a blocked thread and another thread ready to run. To achieve this goal, the hardware cost is to replicate the program visible registers as well as some processor control registers (such as the program counter). Switching from one thread to another thread means the hardware switches from using one register set to another. Such additional hardware has these benefits:
The thread switch can be done in one CPU cycle. It appears to each thread that it is executing alone and not sharing any hardware resources with any other threads. This minimizes the amount of software changes needed within the application as well as the operating system to support multithreading.
In order to switch efficiently between active threads, each active thread needs to have its own register set. For example, to quickly switch between two threads, the register hardware needs to be instantiated twice. Examples
Many families of microcontrollers and embedded processors have multiple register banks to allow quick context switching for interrupts. Such schemes can be considered a type of block multithreading among the user program thread and the interrupt threads.[citation needed]
Interleaved multi-threading
1. 2.
Cycle i+1: an instruction from thread B is issued Cycle i+2: an instruction from thread C is issued
The purpose of this type of multithreading is to remove all data dependency stalls from the execution pipeline. Since one thread is relatively independent from other threads, there's less chance of one instruction in one pipe stage needing an output from an older instruction in the pipeline. Conceptually, it is similar to pre-emptive multi-tasking used in operating systems. One can make the analogy that the time-slice given to each active thread is one CPU cycle. Terminology This type of multithreading was first called Barrel processing, in which the staves of a barrel represent the pipeline stages and their executing threads. Interleaved or Pre-emptive or Fine-grainedor time-sliced multithreading are more modern terminology. Hardware costs In addition to the hardware costs discussed in the Block type of multithreading, interleaved multithreading has an additional cost of each pipeline stage tracking the thread ID of the instruction it is processing. Also, since there are more threads being executed concurrently in the pipeline, shared resources such as caches and TLBs need to be larger to avoid thrashing between the different threads. Simultaneous multi-threading Simultaneous multithreading (SMT) is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading. SMT permits multiple independent threads of execution to better utilize the resources provided by modern processor architectures. Concept The most advanced type of multi-threading applies to superscalar processors. A normal superscalar processor issues multiple instructions from a single thread every CPU cycle. In Simultaneous Multi-threading (SMT), the superscalar processor can issue instructions from multiple threads every CPU cycle. Recognizing that any single thread has a limited amount of instruction level parallelism, this type of multithreading tries to exploit parallelism available across multiple threads to decrease the waste associated with unused issue slots. For example:
1. 2.
Cycle i : instructions j and j+1 from thread A; instruction k from thread B all simultaneously issued Cycle i+1: instruction j+2 from thread A; instruction k+1 from thread B; instruction m from thread C all simultaneously issued
3.
Cycle i+2: instruction j+3 from thread A; instructions m+1 and m+2 from thread C all simultaneously issued
Terminology To distinguish the other types of multithreading from SMT, the term Temporal multithreading is used to denote when instructions from only one thread can be issued at a time. Hardware costs In addition to the hardware costs discussed for interleaved multithreading, SMT has the additional cost of each pipeline stage tracking the Thread ID of each instruction being processed. Again, shared resources such as caches and TLBs have to be sized for the large number of active threads being processed. Fixed interleave (CDC 6600 PPUs, 1965) each of N threads executes one instruction every N cycles if thread not ready to go in its slot, insert pipeline bubble
Software-controlled interleave (TI ASC , PPUs, 1971) OS allocates S pipeline slots amongst N threads hardware performs fixed interleave over S slots executing whichever thread is in that slot
Hardware-controlled thread scheduling (HEP, 1982) hardware keeps track of which threads are ready to go picks next thread to execute based on hardware priority scheme

Hardware Multithreading

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Hardware Multithreading

Загружено:

Авторское право:

Доступные форматы

Hardware Multithreading A multithreading processor is able to pursue two or more threads of control in parallel within the processor pipeline.

approach. (see Figure 2 (a))

Вам также может понравиться