Pipe Lining Performance

Understanding pipelining performance
The original Pentium 4 was a radical design for a number of reasons, but perhaps its most striking and controversial feature was its extraordinarily deep pipeline. At over 20 stages, the Pentium 4's pipeline almost twice as deep as the pipelines of the P4's competitors. Recently Prescott, the 90nm successor to the Pentium 4, took pipelining to the next level by adding another 10 stages onto the Pentium 4's already unbelievably long pipeline. Intel's strategy of deepening the Pentium 4's pipeline, a practice that Intel calls "hyperpipelining", has paid off in terms of performance, but it is not without its drawbacks. In previous articles on the Pentium 4 and Prescott, I've referred to the drawbacks associated with deep pipelines, and I've even tried to explain these drawbacks within the context of larger technical articles on Netburst and other topics. In the present series of articles, I want to devote some serious time to explaining pipelining, its effect on microprocessor performance, and its potential downsides. I'll take you through a basic introduction to the concept of pipelining, and then I'll explain what's required to make pipelining successful and what pitfalls face deeply pipelined designs like Prescott. By the end of the article, you should have a clear grasp on exactly how pipeline depth is related to microprocessor performance on different types of code.
Pipelining Introduction
Let us break down our microprocessor into 5 distinct activities, which generally correspond to 5 distinct pieces of hardware: 1. Instruction fetch (IF) 2. Instruction Decode (ID) 3. Execution (EX) 4. Memory Read/Write (MEM) 5. Result Writeback (WB) Any given instruction will only require one of these modules at a time, generally in this order. The following timing diagram of the multi-cycle processor will show this in more detail:
This is all fine and good, but at any moment, 4 out of 5 units are not active, and could likely be used for other things.
Pipelining Philosophy
Pipelining is concerned with the following tasks:
Use multi-cycle methodologies to reduce the amount of computation in a single cycle. Shorter computations per cycle allow for faster clock cycles. Overlapping instructions allows all components of a processor to be operating on a different instruction. Throughput is increased by having instructions complete more frequently.
We will talk about how to make these things happen in the remainder of the chapter. [edit]Pipelining
Hardware
Given our multicycle processor, what if we wanted to overlap our execution, so that up to 5 instructions could be processed at the same time? Let's contract our timing diagram a little bit to show this idea:
As this diagram shows, each element in the processor is active in every cycle, and the instruction rate of the processor has been increased by 5 times! The question now is, what additional hardware do we need in order to perform this task? We need to add storage registers between each pipeline state to store the partial results between cycles, and we also need to reintroduce the redundant hardware from the single-cycle CPU. We can continue to use a single memory module (for instructions and data), so long as we restrict memory read operations to the first half of the cycle, and memory write operations to the second half of the cycle (or vice-versa). We can save time on the memory access by calculating the memory addresses in the previous stage.
The registers would need to hold the data from the pipeline at that point, and also the necessary control codes to operate the remainder of the pipeline. Our resultant processor design will look similar to this:
If we have 5 instructions, we can show them in our pipeline using different colors. In the diagram below, white corresponds to a NOP, and the different colors correspond to other instructions in the pipeline. Each stage, the instructions shift forward through the pipeline.
[edit]Superpipeline
Superpipelining is the technique of raising the pipeline depth in order to increase the clock speed and reduce the latency of individual stages. If the ALU takes three times longer then any other module, we can divide the ALU into three separate stages, which will reduce the amount of time wasted on shorter stages. The problem here is that we need to find a way to subdivide our stages into shorter stages, and we also need to construct more complicated control units to operate the pipeline and prevent all the possible hazards. It is not uncommon for modern high-end processors to have more than 20 pipeline stages. Example: Intel Pentium 4 The Intel Pentium 4 processor is a recent example of a super-pipelined processor. This diagram shows a Pentium 4 pipeline with 20 stages.
Instruction pipeline
An instruction pipeline is a technique used in the design of computers and other digital electronic devices to increase their instruction throughput (the number of instructions that can be executed in a unit of time). The fundamental idea is to split the processing of a computer instruction into a series of independent steps, with storage at the end of each step. This allows the computer's control circuitry to issue instructions at the processing rate of the slowest step, which is much faster than the time needed to perform all steps at once. The term pipeline refers to the fact that each step is carrying data at once (like water), and each step is connected to the next (like the links of a pipe.) The origin of pipelining is thought to be either the ILLIAC II project or the IBM Stretch project though a simple version was used earlier in the Z1 in 1939 and the Z3 in 1941. [1].
The IBM Stretch Project proposed the terms, "Fetch, Decode, and Execute" that became common usage. Most modern CPUs are driven by a clock. The CPU consists internally of logic and memory (flip flops). When the clock signal arrives, the flip flops take their new value and the logic then requires a period of time to decode the new values. Then the next clock pulse arrives and the flip flops again take their new values, and so on. By breaking the logic into smaller pieces and inserting flip flops between the pieces of logic, the delay before the logic gives valid outputs is reduced. In this way the clock period can be reduced. For example, the classic RISC pipeline is broken into five stages with a set of flip flops between each stage.
1. 2. 3. 4. 5.
Instruction fetch Instruction decode and register fetch Execute Memory access Register write back
When a programmer (or compiler) writes assembly code, they make the assumption that each instruction is executed before execution of the subsequent instruction is begun. This assumption is invalidated by pipelining. When this causes a program to behave incorrectly, the situation is known as a hazard. Various techniques for resolving hazards such as forwarding and stalling exist. A non-pipeline architecture is inefficient because some CPU components (modules) are idle while another module is active during the instruction cycle. Pipelining does not completely cancel out idle time in a CPU but making those modules work in parallel improves program execution significantly. Processors with pipelining are organized inside into stages which can semi-independently work on separate jobs. Each stage is organized and linked into a 'chain' so each stage's output is fed to another stage until the job is done. This organization of the processor allows overall processing time to be significantly reduced.
A deeper pipeline means that there are more stages in the pipeline, and therefore, fewer logic gates in each stage. This generally means that the processor's frequency can be increased as the cycle time is lowered. This happens because there are fewer components in each stage of the pipeline, so the propagation delay is decreased for the overall stage.[2] Unfortunately, not all instructions are independent. In a simple pipeline, completing an instruction may require 5 stages. To operate at full performance, this pipeline will need to run 4 subsequent independent instructions while the first is completing. If 4 instructions that depend on the output of the first instruction are not available, the pipeline control logic must insert a stall or wasted clock cycle into the pipeline until the dependency is resolved. Fortunately, techniques such as forwarding can significantly reduce the cases where stalling is required. While pipelining can in theory increase performance over an unpipelined core by a factor of the number of stages (assuming the clock frequency also scales with the number of stages), in reality, most code does not allow for ideal execution.
Advantages and Disadvantages

Pipelining does not help in all cases. There are several possible disadvantages. An instruction pipeline is said to be fully pipelined if it can accept a new instruction every clock cycle. A pipeline that is not fully pipelined has wait cycles that delay the progress of the pipeline. Advantages of Pipelining: 1. The cycle time of the processor is reduced, thus increasing instruction issue-rate in most cases. 2. Some combinational circuits such as adders or multipliers can be made faster by adding more circuitry. If pipelining is used instead, it can save circuitry vs. a more complex combinational circuit. Disadvantages of Pipelining: 1. A non-pipelined processor executes only a single instruction at a time. This prevents branch delays (in effect, every branch is delayed) and problems with serial instructions being executed concurrently. Consequently the design is simpler and cheaper to manufacture. 2. The instruction latency in a non-pipelined processor is slightly lower than in a pipelined equivalent. This is because extra flip flops must be added to the data path of a pipelined processor. 3. A non-pipelined processor will have a stable instruction bandwidth. The performance of a pipelined processor is much harder to predict and may vary more widely between different programs.
[edit]Examples
[edit]Generic
pipeline
Generic 4-stage pipeline; the colored boxes represent instructions independent of each other
To the right is a generic pipeline with four stages: 1. Fetch 2. Decode 3. Execute 4. Write-back (for lw and sw memory is accessed after execute stage) The top gray box is the list of instructions waiting to be executed; the bottom gray box is the list of instructions that have been completed; and the middle white box is the pipeline. Execution is as follows:
Time Execution
Four instructions are awaiting to be executed
the green instruction is fetched from memory
the green instruction is decoded the purple instruction is fetched from memory
the green instruction is executed (actual operation is performed) the purple instruction is decoded the blue instruction is fetched the green instruction's results are written back to the register file or memory the purple instruction is executed the blue instruction is decoded the red instruction is fetched the green instruction is completed the purple instruction is written back the blue instruction is executed the red instruction is decoded The purple instruction is completed the blue instruction is written back the red instruction is executed the blue instruction is completed the red instruction is written back the red instruction is completed
8 9
All instructions are executed
[edit] Mathematical pipelines: Mathematical or arithmetic pipelines are different from instructional pipelines, in that when mathematically processing large arrays or vectors, a particular mathematical process, such as a multiply is repeated many thousands of times. In this environment, an instruction need only kick off an event whereby the arithmetic logic unit (which is pipelined) takes over, and begins its series of calculations. Most of these circuits can be found today in math processors and math processing sections of CPUs like the Intel Pentium line. [edit]History
Math processing (super-computing) began in earnest in the late 1970s as Vector Processors and Array Processors. Usually very large bulky super-computing machines that needed special environments and super-cooling of the cores. One of the early super computers was the Cyber series built by Control Data Corporation. Its main architect was Seymour Cray, who later resigned from CDC to head up Cray Research. Cray developed the XMP line of super computers, using pipelining for both multiply and add/subtract functions. Later, Star Technologies took pipelining to another level by adding parallelism (several pipelined functions working in parallel), developed by their engineer, Roger Chen. In 1984, Star Technologies made another breakthrough with the pipelined divide circuit, developed by James Bradley. By the mid 1980s, super-computing had taken off with offerings from many different companies around the world. Today, most of these circuits can be found embedded inside most micro-processors
How is pipelining achieved in 8086 microprocessor?

EXECUTION UNIT TELLS THE BUS INTERFACE UN IT FROM WHERE TO FET CH INSTRUCTIONS AS WELL AS TO READ DATA.EU GETS THE OPCODE OF AN INSTRUCTION FROM AN INSTRUCTION QUEUE.THEN THE EU DECODES IT OR EXECUTES IT.BIU AND EU OPERATE INDEPENDENTLY.WHEN THE EU EXECUT ING AN INSTRUCTION,THEN BIU FETCHES INSTRUCTION CODES FROM MEMORY AND STORES THEM IN THE QUEUE.THIS TYPE OF OVERLAPPING OPERATION OF THE BIU AND EU FUNCTIONAL UN ITS OF A MICROPROCESSOR IS CALLED PIPELINING
Pipelining of Microcontroller Microprocessor

ew important characteristics and features of pipeline concept: - Processes more than one instruction at a time, and doesnt wait for one instruction to complete before starting the next. Fetch, decode, execute, and write stages are executed in parallel - As soon as one stage completes, it passes on the result to the next stage and then begins working on another instruction - The performance of a pipelined system depends on the time it takes only for any one stage to be completed, not on the total time for all stages as with non-pipelined designs - Each instruction takes 1 clock cycle for each stage, so the processor can accept 1 new instruction per clock. Pipelining doesnt improve the latency of instructions (each instruction still requires the same amount of time to complete), but it does improve the overall throughput - Sometimes pipelined instructions take more than one clock to complete a stage. When that happens, the processor has to stall and not accept new instructions until the slow instruction has moved on to the next stage - A pipelined processor can stall for a variety of reasons, including delays in reading information from memory, a poor instruction set design, or dependencies between instructions - Memory speed issues are commonly solved using caches. A cache is a section of fast memory placed between the processor and slower memory. When the processor wants to read a location in main memory, that location is also copied into the cache. Subsequent references to that location can come from the cache, which will return a result much more quickly than the main memory - Dependencies. Since each instruction takes some amount of time to store its result, and several instructions are being handled at the same time, later instructions may have to wait for the results of earlier instructions to be stored. However, a simple rearrangement of the instructions in a program (called Instruction Scheduling) can remove these performance limitations from RISC programs

Pipe Lining Performance

Загружено:

Сведения о документе

Исходное описание:

Авторское право

Доступные форматы

Поделиться этим документом

Поделиться или встроить документ

Параметры публикации

Этот документ был вам полезен?

Это неприемлемый материал?

Авторское право:

Доступные форматы

Pipe Lining Performance

Загружено:

Авторское право:

Доступные форматы

Understanding pipelining performance

Advantages and Disadvantages

Four instructions are awaiting to be executed

the green instruction is fetched from memory

All instructions are executed

How is pipelining achieved in 8086 microprocessor?

Pipelining of Microcontroller Microprocessor

Вам также может понравиться