7 pipelined processor”
As a mode of production management, assembly line is very helpful to improve production efficiency. It was first sprung up in automobile factories, and now it has been widely used in many industries. In the process of processor design, we also draw on the concept of pipeline to improve performance. Today we’ll take a look at the running water.How the line processor is designed.
Do you remember this old friend? He once showed us excellent cooking skills. Now let’s enjoy his cooking again. His cooking is divided into several steps, first washing, then cutting, third cooking, and finally dishing. We assume that each step takes 1 minutes.It takes 4 minutes for a chef to cook a dish. If he cooks four dishes altogether, it will take 16 minutes. Now, the restaurant he’s in charge of has a very good business and a lot of customers, but people complain that the service is too slow. What should we do? And I happen to be the owner of this restaurant, so I have to think about how to solve this problem.. First of all, I found that the chef was very quick in the chef that I could afford. So, I want to make some changes from the production management mode. Now, of course, it is a non pipelined way of operation. How can it be transformed into a pipelined operation?
Let’s take a look at what appliances in the kitchen first. We have a sink for washing vegetables, knives and boards for cutting vegetables, cooking utensils for cooking vegetables, and containers for final dishes. The chef had to use one of them at every moment of his work, and the other three were idle. That seems to be fromHard conditions, it is suitable for transformation into the operation of the pipeline. Well, as I said just now, I’m a small business. I can’t afford four cooks to do these four things separately. On the other hand, if I can afford four of these chefs, I might as well have them cook four dishes at the same time. It’s not necessary.What kind of pipeline operation is it so troublesome? Then what shall I do? Then I fired the chef and hired 4 people with the same money. Now the chef is all-round and can do anything, and each of the four new hires will do only one job. Well, now that the staffing and equipment are in place, I hope so.The four step can be operated in a streamlined manner. After each of them has done the work at hand, they hand over the results to the next stage for further operation. Of course, such a handover requires a unified command, so there must be a commander who does not need to be hired.The chef had just finished a dish every four minutes, so the trumpeter blew the trumpet every four minutes to direct the chef to make the next dish. Now let’s assume that each step takes a minute as much time as the chef did, so the trumpeter has to blow it every minute instead.Every time the line staff hears the bugle, they hand over the results of their work to the next stage. Of course, he must ensure that he has successfully completed his work before the bugle. This trumpet is just like the clock in CPU. If so, then our clock frequency now.The rate has increased to 4 times that of the original. Let’s take a look at how our new kitchen works.
The menu you have to do now has been delivered, and the work begins.
First of all, the ingredients of the first dish are sent to the dish washing section. At this time, all the links behind are idle. After one minute, the dish washing is completed.
The ingredients of the first course enter the vegetable cutting section, and at the same time, the raw materials of the second dishes enter the washing dishes. And then a minute later, the first course had been cut, and at the same time, the second dishes were washed.
When the next trumpet rings, the first dish goes into the cooking section, the second into the cutting section, and the third into the washing section, and then a minute later.
Four dishes are in the middle of the assembly line, at this time the whole assembly line of each link has begun to work, then we can say that the assembly line has been filled. The previous process is the process of filling the pipeline, which takes another minute.
The first course finished all the processes and served. And after every minute, we can have a dish.
The second way
The third way
Then the fourth way. After 7 minutes, the four dishes were all finished. That’s because we only have four dishes on our current task, so after the assembly line was filled up, we went through a process of emptying the assembly line. Of course, if there are many guests in this restaurant, there is a continuous stream of menus that need to be sent.You can always stay in a state of abundance and send out a dish every minute.
Let’s analyze the performance of this assembly line. Now we’re going to take seven minutes to make four dishes, less than two minutes per dish on average, and we’re going to be able to do one dish per minute after the line is filled. Previously, the use of non pipelining is only four minutes.If we can serve a dish, then if we can ensure that the pipeline is always full, then the current performance can be up to four times the original, and our hardware resources input has not changed significantly. Of course, we have to pay attention to the use of pipeline, although you can do a dish every minute, but singleIt only takes 4 minutes to target a dish alone, but this time has not been shortened. After looking at the kitchen example, let’s take a look at how pipelining works on real processor structures.
This is the one-cycle processor we’ve built before, and we can run some MIPS instructions correctly.
The execution of MIPS instructions can be divided into 5 steps.
The first step is to fetch fingers.
The second step is decoding.
The third step is execution.
The fourth step is to visit the deposit.
The fifth step is to write back.
Let’s take a look at this step in conjunction with the data path diagram.
Comparing with the structure diagram, we have made a little change, and we have moderately expanded the internal structure of the IFU.
Comparing with this graph, the stage of taking refers to accessing the instruction memory with the value of PC to get the instruction encoding, and also needs to generate the updated value of PC.
In the decoding stage, not only the instruction encoding needs to be decomposed, but also the value of the required register needs to be read out from the register heap.
The third step is mainly implemented in ALU. For arithmetic logic operation instruction, it is to complete the corresponding operation; for memory access instruction, it is to calculate the address of memory access.
The fourth step is memory access. For the load instruction, the corresponding data is read out from the data storage; for the store instruction, the data is sent to the data storage; while for other instructions, there is no substantial operation in this step.
The last step is to write back the instruction to overwrite the register, which writes the data to the specified location in the register heap.
What we should pay attention to is that although these five steps are divided, they are only for easy description. All signals must be stable during the execution of this instruction, such as the address signal sent from the PC register to the instruction memory, if it changes before the instruction is executed, the instructionThe Instruction Word sent by the memory changes, causing the register heap to select registers of different numbers, send different values, and the ALU may perform different operations, and the instruction may be wrong. So, for a single weekFor a period processor, all signals must be stable during the execution of this instruction.
And if we’re going to pipelining, we’ll also find that the hardware resources used at these different stages are basically independent of each other. If we can save the instruction code that the instruction memory outputs beforehand, we can update the value of the PC register in advance and use this new value to store the instruction.A new instruction is taken out of the memory, and while the new instruction is fetched, the encoding of the instruction that has just been fetched is decomposed into different bit domains, and the register heap sends the contents of the corresponding registers according to the input. So, just like the analysis of the pipeline principle just now, if we want to make full use of these hardware resources,If we make use of it, we need to divide it into several stages.
Then, to split the structure of the circuit, we add registers between each stage, which are called pipeline registers, to hold all the information that the previous stage sends to the latter.
We also take the stage from fetch to decoding as an example, where we connect the output of the instruction memory to a register (number 2 in the figure), where the instruction memory output code is saved when a clock rises. Then, after this rising edge, instructionsIf the address input (1) of the memory is changed, the output of the affected instruction memory will not be stored in this register (2). So, at this point, we can access the instruction memory with a new PC and get the binary encoding of the next instruction. And here it is.At the same time, the encoding of the previous instruction is already on the output of this (2) pipeline register, and the corresponding circuit is cut into different bit domains. Then one of the bit fields is connected to the register heap via rs, and the corresponding registers are selected, and the contents are placed in the busA signal.This signal will also be connected to a pipelined register (3 places). Then, when the next clock rises, the value of the RS register required for the current instruction is saved in the pipeline register. At the same time, the binary encoding of the next instruction.It will also be saved to this (2) pipeline register. Well, in a very short time.
Clock-to-Q After time, the Instruction Word of the instructions seen in the decoding phase becomes the second instruction. So soon, the register number of the RS obtained from the register heap changes, but that doesn’t matter. The value of the register required for the first instruction has been saved.In this (3) pipelined register, and at this point, should also be sent to the ALU input (4). So, by adding pipelined registers, we’ll basically turn this one-cycle processor into a pipelined processor.
Let’s do some simple performance analysis of the pipelined processor. Let’s say we’re going to execute these three instructions, and we’re going to draw the timeline, starting at zero o’clock, with 200 PS each.
If it’s on a single-cycle processor, it takes five steps to execute an instruction: fetch, decode, execute, access and write back. Assuming that each phase requires exactly 200 ps, the execution of this directive will take a total of 1000 ps, and then we can execute the second directive, andIt takes 1000ps, and then it executes the third instruction, so each instruction takes 1000ps, and the one-cycle processor needs to set its clock cycle to 1000ps. From the outside world, this processor can complete one instruction per 1000ps.
For pipelined processors, we also execute these three instructions. Looking at the first command, it takes the same steps, the same five steps, and the same 1000ps to complete. But the difference is that after the past 200ps, when the first instruction ends,When we get to the decoding stage, we can start the second instruction, that is, the second instruction has already been executed. Similarly, after 200ps, the first instruction completed the decoding and entered.Line phase, so that the second instruction is just completed, can enter the decoding phase, and at this time, the third instruction can also start to fetch. So for a pipelined processor, although it takes a total of 1000ps for an instruction, it can start executing an instruction every 200 PS.And when the pipeline is filled, every 200ps can complete one instruction. Therefore, for such a pipelined processor, its clock cycle can be set to 200ps. Therefore, the processor’s main frequency is 5 times that of the single cycle processor.
Of course, this is an ideal scenario, and real-world performance gains aren’t as big, one reason being that these new plugged-in pipeline registers also bring new delays in themselves. Let’s assume that the latency of these registers is 50 ps, so let’s look at the performance of this processor againSample changes.
This is the performance analysis without considering the pipeline register latency. If we add the pipeline register latency and still execute these instructions, we need to start a new instruction every 250ps. Therefore, the clock cycle should be set to 250ps.And for each instruction itself, it takes 1250 PS (250 * 5, five phases per clock cycle) to complete. At this point, it is slower than the one cycle processor just now.
Therefore, for pipelined processors, the execution time of the whole program can be shortened because each processing unit can work in parallel. However, pipelining does not shorten the execution time of single instruction. On the contrary, it will increase this time. Therefore, the use of pipelining is actually an increase in instruction.The throughput rate reduces the execution time of the program as a whole and improves the performance of the system.
We now understand the basic principles of pipelining, and analyze the general framework of a five-stage pipelining, which was also the implementation architecture of the early pipelining processor. Later, there have been many developments and changes in the design of pipelined processors, which we’ll explore further in the next section.