We all have learned about basic pipelines like 3-stage or 5-stage, while exploring the web or during our bachelors degree. The basic pipelines are also called inorder pipelines as the instructions are executed in the order of how they are being fetched by the fetch stage.
Although inorder pipelines do have some advantages over a conventional single cycle approach, they are insufficient to make the upgrading worthwhile because pipeline hazards are more frequent. So a concept called multiple issue came into play. Multiple issue is basically instancing the same pipeline n-number of times so that each issue can run concurrently. We can call this type of processor a multicore processor as at a time multiple issues will be processing multiple instructions. Can this approach really improve performance? let us find out by looking at a comparison between an in-order and a 4-issue in-order processor.
As indicated in the above image, using a multiple issue in-order pipeline can increase performance by up to 37.5% when compared to a standard pipeline. However, an increase in complexity and size of 37.5% for a 4-issue with 4 times the functional units is not acceptable.
What if I told you there was a technique to rearrange instructions before to execution? Out-of-order processing is the term used when we do not carry out the instruction that we first fetch. This can be done in two different ways. First, if we use software or a compiler to reorder instructions before saving them directly into the instruction memory. Second, create an intricate on-board hardware reordering unit that will take into account all dependencies and carry out the instructions as quickly as possible.
The contrast between a 4-issue in-order and 4-issue out-of-order processor is shown in the above figure. An out-of-order pipelined processor performs 133% faster than an in-order pipeline. Hence implying the adoption of an out-of-order method.
Now let us look at how an out-of-order pipeline is designed. We can see that an on-board instruction reordering unit is shown in the above image. The instructions flow through fetch and decode stage and later get stored in an instruction pool. The hardware then chooses sets of instructions without dependences between each other for parallel execution. Note that all the issues will have one common instruction pool. The instructions then pass to the execution units, and finally write back their results. If there is no method to efficiently reorganize the pool, it is also possible that an instruction with dependencies will need to wait. The size of the pool should be large enough to increase the likelyhood of reordering the instructions without slowing the processor down. Keep in mind that the instruction pool needs to be big, but not so big that it causes the processor to run slower or use more power. The instruction pool in contemporary CPUs can store anything from 64 to 256 items. SRAM is often used to design the instruction pool.
References:
Smruti R. Sarangi, Next-Gen Computer Architecture: Till the End of Silicon
Version 2.0, https://www.cse.iitd.ac.in/~srsarangi/advbook/index.html