Опубликован: 12.07.2012 | Доступ: свободный | Студентов: 355 / 24 | Оценка: 4.00 / 4.20 | Длительность: 11:07:00
Специальности: Программист
Лекция 1:

Introduction to application optimizations with usage of Intel® performance tools

Лекция 1: 123 || Лекция 2 >

Registers and memory

System registers have the smallest access time, so the number of available registers affects the performance of the microprocessor.

Register spilling – lack of system registers causes great exchange between registers and stack of application.

Ia32e

Technology EM64T – added additional system registers.

Now the memory access speed is much lower than the speed of calculations.

There are two characteristics describing the properties of memory:

  • Response time (latency) – the number of processor cycles required to transfer data from the memory unit.
  • Bandwidth – number of items can be sent from the processor to memory at one cycle.

Two possible performance improvement strategies – to reduce response time or pre-fetch the necessary memory.

Reducing the memory access time is achieved via cache system (small amount of memory located on processor).

Memory blocks are preloaded into the cache.

If the address is in the cache memory - there is a " hit" and data acquisition is greatly increased.

Otherwise –" cache miss" and additional time is needed. In this case, the block of memory is read into the cache for one or more cycles of bus, called the filling cache lines. (Size of cache line is 64 bytes.)

There are different kinds of cache:

  • fully associative cache memory (each block can appear anywhere inside the cache)
  • direct mapping from memory (each block can be loaded into one place)
  • various hybrid options (pie memory, the memory of the set-associative access)
    • Set-associative access: least significant bits are used to determine cache line this memory can be loaded to; cache line may contain a few words from main memory, the mapping inside the line is held on an associative basis.

The quality of the memory access is main key to the performance.

Modern computing architectures contains complicated cache hierarchy.

Nehalem: i7

  • L1 - latency 4
  • L2 - latency 11
  • L3 - latency 38
  • Operative memory latency > 100

Proactive memory access mechanism is implemented with a hardware prefetching based on the history of cache misses. It tries to detect and prefetch independent streams of data.

There is a special set of instructions allows to induce the processor to load the memory specified into cache (software prefetching).

The principle of locality. The quality of the prefetch

Reference locality helps to reuse variables or related data.

There is difference between temporal localityreuse of certain data and resources, and spatial locality - use of data located in the memory beside.

The caching mechanism uses the principle of temporal locality. (Before new cache line is loaded to cache some cache line should be freed. Cache mechanism selects one which has oldest access time.

Prefetching engine uses the principle of spatial locality. It tries to define the pattern of memory access to pre-load to cache memory which will be need soon. Size of preloaded memory (cache line) is 64 bytes. Thus in case of good spatial locality (data used jointly during calculation is located in the memory beside) less cache lines should be loaded to the cache.

One of known performance problem is " cache aliasing" – bad memory locations of various objects participated in a calculation causes the replacement of useful cache lines by some other needed addresses.


Рис. 1.2.
Таблица . Pipeline
tick Instruction fetch Register fetch Instruction decode Execution Data fetch Write back
0 instr. 1 - - - - -
1 instr. 2 instr. 1 - - - -
2 instr. 3 instr. 2 instr. 1 - - -
3 instr. 4 instr. 3 instr. 2 instr. 1 - -
4 instr. 5 instr. 4 instr. 3 instr. 2 instr. 1 -
5 instr. 6 instr. 5 instr. 4 instr. 3 instr. 2 instr. 1
6 instr. 7 instr. 6 instr. 5 instr. 4 instr. 3 instr. 2

The quality of pipelining, instruction level of parallelism

Pipelining assumes that successive instructions will be processed together during execution but on different phases of pipeline.

Typical instruction execution can be divided into the following steps:

  • instruction fetch - IF;
  • decoding command / register selection - ID;
  • operation / calculation of effective memory addresses - EX;
  • memory access – MEM;
  • storing the result - WB.

Pipelining improves throughput of the processor, but if the instructions depend on the results of the previous instructions, there will be delays. Thus the benefits of pipelining depends on level of instruction parallelism.

The quality of prediction

The instructions may depend on the data and control logic. (Data dependence and control flow dependence).

The efficiency of pipeline is limited by various conditional branches inside instruction flow. If there is conditional branch than following instructions aren’t known until the condition isn’t calculated. Should the pipeline be stopped?

Branch predictor is designed to solve this problem.

Predictor selects one possible way and continues instructions fetching and processing.

All processed instructions are located in pipeline storage. If predictor assumption was correct all of them are marked as proper, otherwise "branch misprediction" is happened – pipeline storage should be clean and new instructions should be fetched.

There are static and dynamic predictors:

  • Static predictor uses some simple rules;
    • Trivial prediction – the branch will be not executed if the transition is carried forward and will be made if this is a back jump;
  • Dynamic predictor collects the statistics on every branch and its choice based on this information.

There is also branch target prediction, which predicts unconditional jumps.

Лекция 1: 123 || Лекция 2 >