02.1(conspect) Pipeline and branch prediction

Architecture as it is.

All previous lectures were related to the fact that computers were upgraded accidentally: problems appeared - technical capabilities increased. When the number of computers became too large, suddenly there was a need to increase productivity. If we buy a lot of computers (n computers), we will not be able to increase the power n times, in addition, computers are expensive.

How to increase the computer's power.

There are several ways to increase the power of computer. For example, the clock frequency(one micro function is performed per clock cycle). When the clock cycle increases twice, the frequency increases twice, but the temperature and power consumption do not increase linearly, so it is not profitable.

Nora's law ( anti-scientific, but for some reason it works)-every two years the power of computers increases twice. It turns out that any power is very quickly achievable. The problem of slow activation of processor keys is solved very easily: you need to make them shorter. Progress does not stop and now it reaches the size of nano microns and scientists are trying to make them even smaller. The problem is that the layer size will soon be equivalent to several atoms.

Another science-intensive way is to optimize the computer's actions themselves.

One way is very interesting: to optimize the logic. There is a so-called memory wall, you cannot immediately count by stepping 3 operations forward. Overcoming the memory wall is very important, because you can optimize the power, since how many operations, so many cycles.

Another way: we take and improve the structure of the processor. The number of processor cycles per unit of time is the processor's clock frequency.

There is a separate problem - access to memory. The external one is quite slow in comparison with registers, but there is a lot of it. It turns out that you can optimize, not the processor itself, but its interaction with memory.

The very first and simplest idea that comes to mind: let's take and stick two processors at once, but this is pointless. When we add a separate node, there are a lot of problems, so it's better to add a lot of devices. The data dependency of an instruction uses data that is calculated by another instruction, so it cannot be done in parallel. It's the same with memory when we use indirect addressing.

Data dependence.

The dependence on the implementation is a situation where we can't figure out which instruction, we need to follow next.

For example, one instruction depends on another, which decides which number is less or bigger, until we compare them, we can't continue. The same is true for a transition with indirect addressing.

It turns out that there is no way to get rid of dependencies, but they can be minimized.

Dependence minimization.

What can we do with the processor to make everything work faster? The easiest way is to increase the number of computing devices and teach the processor to perform vector operations.

For example, there is an addition operation, but we will make it much more voluminous and add 128 arithmetic devices, then it will be possible to make vector registers in which we will add data.

When we use a vector, it is very important that the vector is fully packed. If it is incomplete, there will be a loss of power. Classical algorithms do not respond well to vectorization. Another advantage is that the code can be reordered to work with the vector. You can put reordering on the thread, find where they are dependent and where they are not.

There is also on-the-fly optimization, when automatic optimization occurs before execution.

Super scalar operations.

Super scalar operations allow multiple instructions to be executed simultaneously. They run into either a data constraint or a management constraint.

Another problem is that in super scalar processors, it is difficult to achieve full load.

One of the simplest implementations is called VLIW. The idea is that we multiply all the components of the processor and write the program so that they all work, everything is loaded. The idea is good, but there are almost no compilers under VLIW.

The GPU is better. For efficient loading, you can run several parallel instructions at once, and even if they are dependent, let a free node start and let it calculate a separate one, and throw out the result of the remaining one.

When we compute in parallel, we need two different sets of registers. It is necessary to predict which of the transitions is more likely to reduce the ejection. If correctly predicted, there will be an increase in performance.

Pipeline.

JIT optimization takes the ready-made code, which is optimized using other code when running. You can also teach the processor to rename registers to convenient names and run them in parallel, even dependent ones, but there will be branches.

When calculating an instruction, we select parts that can be exactly parallel. This method is called a pipeline.

5 pipeline levels:

  1. IF- Instruction Fetch
  2. ID-Instruction Decode
  3. EX-Execute
  4. MEM-Memory access
  5. WB-Register write back

HSE/ProgrammingOS/02_PipelinePrediction/Conspect_en (последним исправлял пользователь VasilyKireenko 2020-02-05 20:29:01)