Опубликован: 12.07.2012 | Доступ: свободный | Студентов: 355 / 24 | Оценка: 4.00 / 4.20 | Длительность: 11:07:00
Специальности: Программист
Лекция 1:

Introduction to application optimizations with usage of Intel® performance tools

Лекция 1: 123 || Лекция 2 >

Superscalarity

Superscalar processor – a processor which is capable to perform multiple operations per one clock cycle. It has several execution units.

The superscalar technique has several identifying characteristics:

  • Instructions are issued from a sequential instruction stream
  • There is special device which detects data dependences between instructions at run time.
  • The CPU accepts multiple instructions per clock cycle

Modern CPU is always superscalar and pipelined.

Each execution unit has own specialization. "Diversity" of instructions and high level of instruction parallelism causes best CPU effectiveness.

the intel Core Microarchitecture Pipeline Functionaly

Рис. 1.3. the intel Core Microarchitecture Pipeline Functionaly
Simplified processor model

Рис. 1.4. Simplified processor model

Vector instructions and Vectorization

A typical vector instruction performs an elementary operation on two vector sequences in the memory or vector registers of fixed length

C (1: n) = A (1: n) + B (1: n)

Fortran array sections are convenient to notate vector opertaions

Vectorization - the process of converting a scalar calculations, in which an operation is performed on a pair of operands, to the vector representation, in which an operation is performed on a pair of vector operands. Each vector contains several scalar operands.

Pentium III compute system of x86 family introduced SSE (Streaming SIMD Extensions). There were eight 128 bit registers (XMM0-XMM7) and 70 new instructions including working with real numbers.

SSE2, SSE3, SSEE3, SSE4, SSE4.2, AVX - further extensions of SSE.

Look ahead and out-of-order execution

Modern x86 family microprocessors have advanced processor mechanisms to view the instruction flow and identify instructions that can be computed in parallel. If there are enough instructions in look-ahead buffer which can be processed together, than processor pipeline will work with maximum effectiveness.

This approach leads to execution with change of the instruction sequence (out-of-order execution).

Implementation of out-of-order mechanisms makes processor architecture more complicated and causes additional energy costs. There are Intel processors without out-of-order support. (Itanium, Atom). In this case instruction scheduling is key factor of good processor performance.

The intel NetBurst Microachitecture

Рис. 1.5. The intel NetBurst Microachitecture

Parallelization and multi-core

Multitasking is a method where multiple tasks, also known as processes, share common resources of microprocessor.

Multithreading computers have hardware support to efficiently execute multiple threads. Threads are parts of a process and share the same memory. Multithreading allows to divide a calculation into several parts which are processed in parallel.

Hyper-threading technology allows to mix instruction sequences of different processes to improve instruction level parallelism.

Pentium 4 - Core i7

Cores – microprocessor contains several superscalar pipelines which have own calculation resources but share system bus, memory and up level cashes.

Multiprocessor solutions contains several processors.

Multiprocessor and multi-core systems allow to increase the application performance by creating multiple threads

Main characteristics of the application, affecting its performance

  • Calculations efficiency,
  • Memory usage effectiveness,
  • Correct branch prediction,
  • Efficient use of vector instructions,
  • The effectiveness of parallelization,
  • Instructional parallelism level.

Performance measuring

What factors affect the performance of a specific program?

  • Compiler quality
  • Performance of computer system

Consumers need criteria to determine the computer system performance

  • A representative set of typical tasks;
  • Universal testing scheme;
  • Independence from MP manufacturers.

Spec.org (Standart Performance Evaluated Corporation) - non-profit organization for training, support and maintenance of a standard set of tests to compare the performance of different computer systems. This organization develops and publishes standard suites for performance measuring.

CPU2006 - designed to measure performance. Can be used to compare the programs running on different computer systems.

OMP2001 - measures the performance on tests using OpenMP standard for parallel processing with shared memory (shared-memory parallel processing).

Optimizing compiler role

Compiler translates the entire source program into an equivalent program in the resulting machine code or assembly language.

Does the compiler have any role in the struggle for the performance of the MP?

  • The compiler is used during testing and debugging functionality of the new MP.
  • Performance of new computer system related with new instruction set, increasing number of registers can be demonstrated only with optimizing compiler which supports these innovations.
  • The compiler is able to hide the architects misses.
Лекция 1: 123 || Лекция 2 >