SlideShare a Scribd company logo
OpenMP Day 2
WorkSharing, Schedule,
Synchronization and
OMP best practices
✓ What is OPENMP?
✓ Fork/Join Programming model
✓ OPENMP Core Elements
✓ #pragma omp parallel OR Parallel construct
✓ run time variables
✓ environment variables
✓ data scoping (private, shared…)
✓ compile and run openmp program in c++ and fortran
✓ work sharing constructs
#pragma omp for
sections
tasks
➢ schedule clause
➢ synchronization
Recap of Day 1
Work Sharing: sections
SECTIONS directive is a non-iterative
work-sharing construct.
It specifies that the enclosed
section(s) of code are to be divided
among the threads in the team.
Each SECTION is executed ONCE by a
thread in the team.
Work Sharing: sections
OpenMP: lastprivate Clause
• Creates private memory location for each thread.
• Does not initialize the private variable.
• The sequentially last iteration of the associated loops, or the
lexically last section construct [...] to the original list item.
!$OMP DO PRIVATE(I)
LASTPRIVATE(B)
DO i = 1, 1000
B = i
ENDDO
!$OMP END DO
!—value of B here is
1000
!$OMP SECTIONS
LASTPRIVATE(B)
!$OMP SECTION
B = 2
!$OMP SECTION
B = 4
!$OMP SECTION
D = 6
!$OMP END SECTIONS
Work Sharing: tasks
#pragma omp task [clauses]……
• Tasks allow to parallelize irregular problems
(Unbounded loops & Recursive algorithms )
• A task has - Code to execute – Data environment (It
owns its data) – Internal control variables – An
assigned thread that executes the code and the data
• Each encountering thread packages a new instance of
a task (code and data)
• Some thread in the team executes the task at some
later time
Work Sharing: tasks
#pragma omp taskwait
Work Sharing: single
!$OMP SINGLE [clause ...]
PRIVATE (list)
FIRSTPRIVATE (list)
block
!$OMP END SINGLE [ NOWAIT ]
#pragma omp single [clause ...] newline private (list)
firstprivate (list) nowait structured_block
• The SINGLE directive specifies that the enclosed code
is to be executed by only one thread in the team.
• May be useful when dealing with sections of code that
are not thread safe (such as I/O)
Schedule Clause
How is the work is divided among threads?
Directives for work distribution
Schedule Clause: Types
A schedule kind is passed to an OpenMP loop schedule clause:
• Provides a hint for how iterations of the corresponding
OpenMP loop should be assigned to threads in the team of
the OpenMP region surrounding the loop.
• Five kinds of schedules for OpenMP loop1:
static
dynamic
guided
auto
runtime
• The OpenMP implementation and/or runtime defines how to
assign chunks to threads of a team given the kind of schedule
specified by as a hint.
STATIC: Iterations of a loop are divided into chunks of size ceiling(iterations/threads). Each
thread is assigned a separate chunk.
STATIC, N: Iterations of a loop are divided into chunks of size N. Each chunk is assigned to a
thread in round-robin fashion. N >= 1 (integer expression)
DYNAMIC: Iterations of a loop are divided into chunks of size 1.
Chunks are assigned to threads on a first-come, first-serve basis as threads become available.
This continues until all work is completed.
DYNAMIC, N: Same as above, all chunks are set to size N
GUIDED: Chunks are made progressively smaller until a chunk size of one is reached. The first
chunk is of size ceiling(iterations/threads). Remaining chunks are of
size ceiling(iterations_remaining/threads).Chunks are assigned to threads on a first-come,
first-serve basis as threads become available. This continues until all work is completed.
GUIDED, N: Minimum chunk size is N
AUTO: Delegated the decision of the scheduling to the compiler and/or runtime system
RUNTIME: Scheduling policy is determined at run time. OMP_SCHEDULE/
OMP_SET_SCHEDULE
Schedule Clause
OpenMP: Synchronization
• The programmer needs finer control over how variables are shared.
• The programmer must ensure that threads do not interfere with each other so
that the output does not depend on how the individual threads are scheduled.
• In particular, the programmer must manage threads so that they read the
correct values of a variable and that multiple threads do not try to write to a
variable at the same time.
• Data dependencies and Task Dependencies
• MASTER, CRITICAL, BARRIER, FLUSH, TASKWAIT, ORDERED, NOWAIT
Data Dependencies
OpenMP assumes that there is NO data-
dependency across jobs running in parallel
When the omp parallel directive is placed around
a code block, it is the programmer’s
responsibility to make sure data dependency is
ruled out
Synchronization Constructs
1) Mutual Exclusion (Data Dependencies)
Critical Sections : Protect access to shared & modifiable data,
allowing ONLY ONE thread to enter it at a given time
#pragma omp critical
#pragma omp atomic – special case of critical, less overhead
Locks
Only one thread
updates this at a
time
Synchronization Constructs
To impose order constraints and protect shared data.
Achieved by Mutual Exclusion & Barriers
2) Barriers (Task Dependencies)
Implicit : Sync points exist at the end of
parallel –necessary barrier – cant be removed
for – can be removed by using the nowait clause
sections – can be removed by using the nowait clause
single – can be removed by using the nowait clause
Explicit : Must be used when ordering is required
#pragma omp barrier
each thread waits until all threads arrive at the barrier
Explicit Barrier
Implicit Barrier at end
of parallel region
No Barrier
nowait cancels barrier
creation
Synchronization: Barrier
OPENMP Synchronization: review
PRAGMA DESCRIPTION
#pragma omp taskwait
!$OMP TASKWAIT
Specifies a wait on the completion of child tasks generated
since the beginning of the current task
#pragma omp critical
!$OMP CRITICAL
!$OMP END CRITICAL
Code within the block or pragma is only executed on one
thread at a time.
#pragma omp critical
!$OMP ATOMIC
!$OMP END ATOMIC
Provides a mini-CRITICAL section. specific memory location
must be updated atomically (Atomic statements)
#pragma omp barrier
!$OMP BARRIER
!$OMP END BARRIER
Synchronizes all threads in a team; all threads pause at the
barrier, until all threads execute the barrier.
OPENMP Synchronization: review
PRAGMA DESCRIPTION
#pragma omp for ordered
[clauses...] (loop region) #pragma
omp ordered structured_block
Used within a DO / for loop
Iterations of the enclosed loop will be executed in the
same order as if they were executed on a serial
processor. Threads will need to wait before executing
their chunk of iterations if previous iterations haven't
completed yet.
#pragma omp flush (list) Synchronization point at which all threads have the
same view of memory for all shared objects.
FLUSH is implied for
barrier
parallel - upon entry and exit
critical - upon entry and exit
ordered - upon entry and exit
for - upon exit
sections - upon exit
single - upon exi
Performance in OPENMP
Programs
Performance in OPENMP programs
Might not change speed much or break code!
Must understand application and use wisely
Performance of single threaded code
Percentage of code that is run in parallel and scalability
CPU utilization, effective data sharing, data locality and load balancing
Amount of synchronization and communication
Overhead to create, resume, manage, suspend, destroy and synchronize
threads
Memory conflicts due to shared memory or falsely shared memory
Performance limitations of shared resources e.g memory, bus bandwidth, CPU
execution units
Computing Efficiency of Parallel Code
Tserial = Speed Up
Tparallel
Tserial = Efficiency, P = number of processors
P xTparallel
Key Steps in Parallel Algorithms
• Dividing a computation into smaller computations
• Assign them to different processors for parallel execution
• The process of dividing a computation into smaller parts, some or all of which
may potentially be executed in parallel, is called decomposition
• The number and size of tasks into which a problem is decomposed
determines the granularity of the decomposition.
• Decomposition into a large number of small tasks is called fine-grained and a
decomposition into a small number of large tasks is called coarse-grained
• Decomposition for matrix-vector multiplication is fine-grained because each
of a large number of tasks performs a single dot-product.
• A coarse-grained decomposition of the same problem into 4 tasks, where
each task computes n/4 of the entries of the output vector of length n.
• The mechanism by which tasks are assigned to processes for execution is
called mapping.
Data decomposition & mapping to Processors
OR
Task decomposition & mapping to Processors
Objective
All tasks complete in the shortest amount of elapsed time
How to achieve the objective?
Reduce overheads:
Time spent in inter-process interaction/ Overheads of data sharing
between processes.
Time that some processes may spend being idle.
Uneven load distribution may cause some processes to finish earlier than
others.
Unfinished tasks mapped onto a process could be waiting for tasks
mapped onto other processes to finish in order to satisfy the constraints
imposed by the task-dependency graph.
Load Balancing: Gaussian Elimination
When eliminating a column, processors to the left of are idle
Each processor is active for only part of the computation
Conversion of a Matrix into its Upper Triangular Equivalent
Simple Data Partitioning method for parallel processing –
1D vertical strip partitioning
Each process owns N/P columns of data
The represents outstanding work in successive K iterations
Several Data & Task Decomposition and mapping techniques (Beyond
the scope of this talk)
Some simple techniques to avoid overheads.
Parallel Overhead
The amount of time required to coordinate parallel threads,
as opposed to doing useful work.
Thread start-up time
Synchronization
Software overhead imposed by parallel compilers, libraries,
tools, operating system, etc.
Thread termination time
Overheads of Parallel Directives
Overheads of Scheduling
Optimize the use of barriers
Prefer Atomic to Critical &
Avoid Large critical regions
Maximize Parallel Regions
Single parallel region enclosing all work sharing for loops.
Avoid Parallel Regions in inner loops
Caching issues in multicore performance
Memory
Architecture
Caching issues in multicore performance
Caching issues in multicore performance
Array[N]
N is large. A[1] is accessed by one processor & A[2]
by another
False Sharing
False Sharing
When threads on different processors modify variables that reside on the same cache line.
This invalidates the cache line and forces a memory update to maintain cache coherency.
Potential false sharing on the array sum_local.
“place” data on different blocks OR Reduce block size
False Sharing: Solution 1
Adding a schedule clause with chunksize that ensures that 2 threads
do not step over the same cache line
#pragma omp for schedule(static,chunkSize)
False Sharing: Solution 2
Array padding and memory alignment to reduce false sharing. This works
because successive Array elements are forced onto different cache lines, so
less (or no) cache line conflicts exist
sum_local[NUM_THREADS][cacheline];
sum_local[me][0]
Use compiler directives to force
individual variable alignment.
__declspec(align(n)) (n =64)
(64 byte boundary) to align the
individual variables on cache
line boundaries.
__declspec (align(64)) int
thread1_global_variable;
__declspec (align(64)) int
thread2_global_variable;
False Sharing: Solution 2
Array of data structures
• Pad the structure to the end of a
cache line to ensure that the array
elements begin on a cache line
boundary.
• If you cannot ensure that the array
is aligned on a cache line boundary,
pad the data structure to twice the
size of a cache line.
• If the array is dynamically allocated,
increase the allocation size and
adjust the pointer to align with a
cache line boundary.
Padding a data structure to a cache line boundary
Ensuring the array is also aligned using the
compiler
__declspec (align(n)) [ n = 64 (64 byte boundary)]
False Sharing: Solution 3
Use of private variables
Note: Shared data that is read-only in a loop does not lead to false sharing.
ThreadLocalSum
private(ThreadLocalSum)
ThreadLocalSum
One large global memory block – Shared
False Sharing?
Make sure each individual-block starts and ends at the cache
boundary
Separate blocks each local to its own core (i.e. private)
No false sharing but detailed code to identify where each private
block begins and ends.
False Sharing
Cache hits and misses
Cache could be KB or MB, but bytes are transferred in much
smaller sizes. Typical size of cache line is 64MB
When CPU asks for a value from the memory
If the value is already in the cache -> Cache Hit
Value is not in the cache, has to be fetched from the memory ->
Cache Miss
• Compulsory (cold start or process migration): – First access to
a block in memory impossible to avoid
• Capacity: Cache cannot hold all blocks accessed by the program
• Conflict (collision): – Multiple memory locations map to same
cache location
Cache hits and misses
Coherence Misses: Misses caused by coherence traffic with other processor
Also known as communication misses because represents data moving
between processors working together on a parallel program
For some parallel programs, coherence misses can dominate total misses
Spatial Coherence
“If you need one memory address’s contents now, then you will probably also
need the contents of some of the memory locations around it soon.”
Temporal Coherence
“If you need one memory address’s contents now, then you will probably also
need its contents again soon.”
Cache hits and misses
Sequential memory order
Jump in memory order
Cache hits
and misses
Where Cache Coherence Really
Matters: Matrix Multiply
Code simplicity!
Blindly marches
through memory
(how does this
affect the
cache?)
This is a problem in
a C /C++ program
because B is not
doing a unit stride
Where Cache
Coherence
Really Matters:
Matrix Multiply
Where Cache Coherence Really
Matters: Matrix Multiply
I, k, j
Block Dense Matrix Multiplication
Usually size of matrices (N) much larger than number of processors (p).
Divide matrix into s2 submatrices.
Each submatrix has N/s x N/s elements.
Block Dense Matrix Multiplication
Cache hits and misses
Mathematically calculating cache misses using cache block and line sizes
and size of objects in the code.
Using some performance tools, like perf. One can identify cache hits
and misses.
Perf comes with linux platforms.
$ perf stat -e task-clock,cycles,instructions,cache-references,cache-misses
./stream_c.exe
Exercise : Run perf on Matrix Vector multiplication code
Loop Unrolling
Loop Fusion
OpenMP Parallel Programming
▪ Start with a parallelizable algorithm
Loop level parallelism /tasks
▪ Implement Serially : Optimized Serial Program
▪ Test, Debug & Time to solution
▪ Annotate the code with parallelization and Synchronization
directives
▪ Remove Race Conditions, False Sharing
▪ Test and Debug
▪ Measure speed-up (T-serial/T-parallel)
Ad

More Related Content

What's hot (20)

Compiler Design Lecture Notes
Compiler Design Lecture NotesCompiler Design Lecture Notes
Compiler Design Lecture Notes
FellowBuddy.com
 
High Performance Computing using MPI
High Performance Computing using MPIHigh Performance Computing using MPI
High Performance Computing using MPI
Ankit Mahato
 
Greedy method by Dr. B. J. Mohite
Greedy method by Dr. B. J. MohiteGreedy method by Dr. B. J. Mohite
Greedy method by Dr. B. J. Mohite
Zeal Education Society, Pune
 
Process synchronization(deepa)
Process synchronization(deepa)Process synchronization(deepa)
Process synchronization(deepa)
Nagarajan
 
Component based software engineering
Component based software engineeringComponent based software engineering
Component based software engineering
Charotar University Of Science And Technology,Gujrat
 
Inter Process Communication
Inter Process CommunicationInter Process Communication
Inter Process Communication
Anil Kumar Pugalia
 
Openmp
OpenmpOpenmp
Openmp
Amirali Sharifian
 
Performance of Parallel Processors
Performance of Parallel ProcessorsPerformance of Parallel Processors
Performance of Parallel Processors
Ashish KC
 
Process synchronization
Process synchronizationProcess synchronization
Process synchronization
Ali Ahmad
 
Open mp directives
Open mp directivesOpen mp directives
Open mp directives
Prabhakaran V M
 
Parallel Algorithms
Parallel AlgorithmsParallel Algorithms
Parallel Algorithms
Dr Sandeep Kumar Poonia
 
OpenMP Tutorial for Beginners
OpenMP Tutorial for BeginnersOpenMP Tutorial for Beginners
OpenMP Tutorial for Beginners
Dhanashree Prasad
 
Threads in Operating System | Multithreading | Interprocess Communication
Threads in Operating System | Multithreading | Interprocess CommunicationThreads in Operating System | Multithreading | Interprocess Communication
Threads in Operating System | Multithreading | Interprocess Communication
Shivam Mitra
 
COMPILER DESIGN OPTIONS
COMPILER DESIGN OPTIONSCOMPILER DESIGN OPTIONS
COMPILER DESIGN OPTIONS
sonalikharade3
 
Memory consistency models and basics
Memory consistency models and basicsMemory consistency models and basics
Memory consistency models and basics
Ramdas Mozhikunnath
 
Process threads operating system.
Process threads operating system.Process threads operating system.
Process threads operating system.
Reham Maher El-Safarini
 
Software engineering unit 1
Software engineering unit 1Software engineering unit 1
Software engineering unit 1
gondwana university
 
System programming
System programmingSystem programming
System programming
jayashri kolekar
 
Daa:Dynamic Programing
Daa:Dynamic ProgramingDaa:Dynamic Programing
Daa:Dynamic Programing
rupali_2bonde
 
Process Scheduling
Process SchedulingProcess Scheduling
Process Scheduling
vampugani
 
Compiler Design Lecture Notes
Compiler Design Lecture NotesCompiler Design Lecture Notes
Compiler Design Lecture Notes
FellowBuddy.com
 
High Performance Computing using MPI
High Performance Computing using MPIHigh Performance Computing using MPI
High Performance Computing using MPI
Ankit Mahato
 
Process synchronization(deepa)
Process synchronization(deepa)Process synchronization(deepa)
Process synchronization(deepa)
Nagarajan
 
Performance of Parallel Processors
Performance of Parallel ProcessorsPerformance of Parallel Processors
Performance of Parallel Processors
Ashish KC
 
Process synchronization
Process synchronizationProcess synchronization
Process synchronization
Ali Ahmad
 
OpenMP Tutorial for Beginners
OpenMP Tutorial for BeginnersOpenMP Tutorial for Beginners
OpenMP Tutorial for Beginners
Dhanashree Prasad
 
Threads in Operating System | Multithreading | Interprocess Communication
Threads in Operating System | Multithreading | Interprocess CommunicationThreads in Operating System | Multithreading | Interprocess Communication
Threads in Operating System | Multithreading | Interprocess Communication
Shivam Mitra
 
COMPILER DESIGN OPTIONS
COMPILER DESIGN OPTIONSCOMPILER DESIGN OPTIONS
COMPILER DESIGN OPTIONS
sonalikharade3
 
Memory consistency models and basics
Memory consistency models and basicsMemory consistency models and basics
Memory consistency models and basics
Ramdas Mozhikunnath
 
Daa:Dynamic Programing
Daa:Dynamic ProgramingDaa:Dynamic Programing
Daa:Dynamic Programing
rupali_2bonde
 
Process Scheduling
Process SchedulingProcess Scheduling
Process Scheduling
vampugani
 

Similar to Introduction to OpenMP (Performance) (20)

Programming using Open Mp
Programming using Open MpProgramming using Open Mp
Programming using Open Mp
Anshul Sharma
 
Parallel Computing - Lec 5
Parallel Computing - Lec 5Parallel Computing - Lec 5
Parallel Computing - Lec 5
Shah Zaib
 
Cc module 3.pptx
Cc module 3.pptxCc module 3.pptx
Cc module 3.pptx
ssuserbead51
 
Nbvtalkataitamimageprocessingconf
NbvtalkataitamimageprocessingconfNbvtalkataitamimageprocessingconf
Nbvtalkataitamimageprocessingconf
Nagasuri Bala Venkateswarlu
 
Lecture6
Lecture6Lecture6
Lecture6
tt_aljobory
 
Lecture7
Lecture7Lecture7
Lecture7
tt_aljobory
 
Parallel Programming
Parallel ProgrammingParallel Programming
Parallel Programming
Roman Okolovich
 
MPI n OpenMP
MPI n OpenMPMPI n OpenMP
MPI n OpenMP
Surinder Kaur
 
Lecture8
Lecture8Lecture8
Lecture8
tt_aljobory
 
Coding For Cores - C# Way
Coding For Cores - C# WayCoding For Cores - C# Way
Coding For Cores - C# Way
Bishnu Rawal
 
OPEN MP TO FOR knowing more in the front
OPEN MP TO FOR knowing more in the frontOPEN MP TO FOR knowing more in the front
OPEN MP TO FOR knowing more in the front
bosdhoni7378
 
Parallel algorithms
Parallel algorithmsParallel algorithms
Parallel algorithms
Danish Javed
 
Presentation on Shared Memory Parallel Programming
Presentation on Shared Memory Parallel ProgrammingPresentation on Shared Memory Parallel Programming
Presentation on Shared Memory Parallel Programming
Vengada Karthik Rangaraju
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMP
Anil Bohare
 
OpenMP.pptx
OpenMP.pptxOpenMP.pptx
OpenMP.pptx
MunimAkhtarChoudhury
 
openmpfinal.pdf
openmpfinal.pdfopenmpfinal.pdf
openmpfinal.pdf
GopalPatidar13
 
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
Jeff Larkin
 
Parallel concepts1
Parallel concepts1Parallel concepts1
Parallel concepts1
Dr. C.V. Suresh Babu
 
openmp.New.intro-unc.edu.ppt
openmp.New.intro-unc.edu.pptopenmp.New.intro-unc.edu.ppt
openmp.New.intro-unc.edu.ppt
MALARMANNANA1
 
Multiprocessing.pdf..............,.......
Multiprocessing.pdf..............,.......Multiprocessing.pdf..............,.......
Multiprocessing.pdf..............,.......
Bhaveshmali28
 
Programming using Open Mp
Programming using Open MpProgramming using Open Mp
Programming using Open Mp
Anshul Sharma
 
Parallel Computing - Lec 5
Parallel Computing - Lec 5Parallel Computing - Lec 5
Parallel Computing - Lec 5
Shah Zaib
 
Coding For Cores - C# Way
Coding For Cores - C# WayCoding For Cores - C# Way
Coding For Cores - C# Way
Bishnu Rawal
 
OPEN MP TO FOR knowing more in the front
OPEN MP TO FOR knowing more in the frontOPEN MP TO FOR knowing more in the front
OPEN MP TO FOR knowing more in the front
bosdhoni7378
 
Parallel algorithms
Parallel algorithmsParallel algorithms
Parallel algorithms
Danish Javed
 
Presentation on Shared Memory Parallel Programming
Presentation on Shared Memory Parallel ProgrammingPresentation on Shared Memory Parallel Programming
Presentation on Shared Memory Parallel Programming
Vengada Karthik Rangaraju
 
Parallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMPParallelization of Coupled Cluster Code with OpenMP
Parallelization of Coupled Cluster Code with OpenMP
Anil Bohare
 
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
GTC16 - S6410 - Comparing OpenACC 2.5 and OpenMP 4.5
Jeff Larkin
 
openmp.New.intro-unc.edu.ppt
openmp.New.intro-unc.edu.pptopenmp.New.intro-unc.edu.ppt
openmp.New.intro-unc.edu.ppt
MALARMANNANA1
 
Multiprocessing.pdf..............,.......
Multiprocessing.pdf..............,.......Multiprocessing.pdf..............,.......
Multiprocessing.pdf..............,.......
Bhaveshmali28
 
Ad

More from Akhila Prabhakaran (7)

Re Imagining Education
Re Imagining EducationRe Imagining Education
Re Imagining Education
Akhila Prabhakaran
 
Hypothesis testing Part1
Hypothesis testing Part1Hypothesis testing Part1
Hypothesis testing Part1
Akhila Prabhakaran
 
Statistical Analysis with R- III
Statistical Analysis with R- IIIStatistical Analysis with R- III
Statistical Analysis with R- III
Akhila Prabhakaran
 
Statistical Analysis with R -II
Statistical Analysis with R -IIStatistical Analysis with R -II
Statistical Analysis with R -II
Akhila Prabhakaran
 
Statistical Analysis with R -I
Statistical Analysis with R -IStatistical Analysis with R -I
Statistical Analysis with R -I
Akhila Prabhakaran
 
Introduction to MPI
Introduction to MPIIntroduction to MPI
Introduction to MPI
Akhila Prabhakaran
 
Introduction to Parallel Computing
Introduction to Parallel ComputingIntroduction to Parallel Computing
Introduction to Parallel Computing
Akhila Prabhakaran
 
Ad

Recently uploaded (20)

Subject name: Introduction to psychology
Subject name: Introduction to psychologySubject name: Introduction to psychology
Subject name: Introduction to psychology
beebussy155
 
Mycology:Characteristics of Ascomycetes Fungi
Mycology:Characteristics of Ascomycetes FungiMycology:Characteristics of Ascomycetes Fungi
Mycology:Characteristics of Ascomycetes Fungi
SAYANTANMALLICK5
 
Eric Schott- Environment, Animal and Human Health (3).pptx
Eric Schott- Environment, Animal and Human Health (3).pptxEric Schott- Environment, Animal and Human Health (3).pptx
Eric Schott- Environment, Animal and Human Health (3).pptx
ttalbert1
 
CORONARY ARTERY BYPASS GRAFTING (1).pptx
CORONARY ARTERY BYPASS GRAFTING (1).pptxCORONARY ARTERY BYPASS GRAFTING (1).pptx
CORONARY ARTERY BYPASS GRAFTING (1).pptx
DharaniJajula
 
Applications of Radioisotopes in Cancer Research.pptx
Applications of Radioisotopes in Cancer Research.pptxApplications of Radioisotopes in Cancer Research.pptx
Applications of Radioisotopes in Cancer Research.pptx
MahitaLaveti
 
Introduction to Black Hole and how its formed
Introduction to Black Hole and how its formedIntroduction to Black Hole and how its formed
Introduction to Black Hole and how its formed
MSafiullahALawi
 
Water Pollution control using microorganisms
Water Pollution control using microorganismsWater Pollution control using microorganisms
Water Pollution control using microorganisms
gerefam247
 
Controls over genes.ppt. Gene Expression
Controls over genes.ppt. Gene ExpressionControls over genes.ppt. Gene Expression
Controls over genes.ppt. Gene Expression
NABIHANAEEM2
 
Pharmacologically active constituents.pdf
Pharmacologically active constituents.pdfPharmacologically active constituents.pdf
Pharmacologically active constituents.pdf
Nistarini College, Purulia (W.B) India
 
SULPHONAMIDES AND SULFONES Medicinal Chemistry III.ppt
SULPHONAMIDES AND SULFONES Medicinal Chemistry III.pptSULPHONAMIDES AND SULFONES Medicinal Chemistry III.ppt
SULPHONAMIDES AND SULFONES Medicinal Chemistry III.ppt
HRUTUJA WAGH
 
Euclid: The Story So far, a Departmental Colloquium at Maynooth University
Euclid: The Story So far, a Departmental Colloquium at Maynooth UniversityEuclid: The Story So far, a Departmental Colloquium at Maynooth University
Euclid: The Story So far, a Departmental Colloquium at Maynooth University
Peter Coles
 
AP 2024 Unit 1 Updated Chemistry of Life
AP 2024 Unit 1 Updated Chemistry of LifeAP 2024 Unit 1 Updated Chemistry of Life
AP 2024 Unit 1 Updated Chemistry of Life
mseileenlinden
 
Seismic evidence of liquid water at the base of Mars' upper crust
Seismic evidence of liquid water at the base of Mars' upper crustSeismic evidence of liquid water at the base of Mars' upper crust
Seismic evidence of liquid water at the base of Mars' upper crust
Sérgio Sacani
 
Animal Models for Biological and Clinical Research ppt 2.pptx
Animal Models for Biological and Clinical Research ppt 2.pptxAnimal Models for Biological and Clinical Research ppt 2.pptx
Animal Models for Biological and Clinical Research ppt 2.pptx
MahitaLaveti
 
Top 10 Biotech Startups for Beginners.pptx
Top 10 Biotech Startups for Beginners.pptxTop 10 Biotech Startups for Beginners.pptx
Top 10 Biotech Startups for Beginners.pptx
alexbagheriam
 
Preparation of Experimental Animals.pptx
Preparation of Experimental Animals.pptxPreparation of Experimental Animals.pptx
Preparation of Experimental Animals.pptx
klynct
 
Components of the Human Circulatory System.pptx
Components of the Human  Circulatory System.pptxComponents of the Human  Circulatory System.pptx
Components of the Human Circulatory System.pptx
autumnstreaks
 
Discrete choice experiments: Environmental Improvements to Airthrey Loch Lake...
Discrete choice experiments: Environmental Improvements to Airthrey Loch Lake...Discrete choice experiments: Environmental Improvements to Airthrey Loch Lake...
Discrete choice experiments: Environmental Improvements to Airthrey Loch Lake...
Professional Content Writing's
 
Sleep_physiology_types_duration_underlying mech.
Sleep_physiology_types_duration_underlying mech.Sleep_physiology_types_duration_underlying mech.
Sleep_physiology_types_duration_underlying mech.
klynct
 
Somato_Sensory _ somatomotor_Nervous_System.pptx
Somato_Sensory _ somatomotor_Nervous_System.pptxSomato_Sensory _ somatomotor_Nervous_System.pptx
Somato_Sensory _ somatomotor_Nervous_System.pptx
klynct
 
Subject name: Introduction to psychology
Subject name: Introduction to psychologySubject name: Introduction to psychology
Subject name: Introduction to psychology
beebussy155
 
Mycology:Characteristics of Ascomycetes Fungi
Mycology:Characteristics of Ascomycetes FungiMycology:Characteristics of Ascomycetes Fungi
Mycology:Characteristics of Ascomycetes Fungi
SAYANTANMALLICK5
 
Eric Schott- Environment, Animal and Human Health (3).pptx
Eric Schott- Environment, Animal and Human Health (3).pptxEric Schott- Environment, Animal and Human Health (3).pptx
Eric Schott- Environment, Animal and Human Health (3).pptx
ttalbert1
 
CORONARY ARTERY BYPASS GRAFTING (1).pptx
CORONARY ARTERY BYPASS GRAFTING (1).pptxCORONARY ARTERY BYPASS GRAFTING (1).pptx
CORONARY ARTERY BYPASS GRAFTING (1).pptx
DharaniJajula
 
Applications of Radioisotopes in Cancer Research.pptx
Applications of Radioisotopes in Cancer Research.pptxApplications of Radioisotopes in Cancer Research.pptx
Applications of Radioisotopes in Cancer Research.pptx
MahitaLaveti
 
Introduction to Black Hole and how its formed
Introduction to Black Hole and how its formedIntroduction to Black Hole and how its formed
Introduction to Black Hole and how its formed
MSafiullahALawi
 
Water Pollution control using microorganisms
Water Pollution control using microorganismsWater Pollution control using microorganisms
Water Pollution control using microorganisms
gerefam247
 
Controls over genes.ppt. Gene Expression
Controls over genes.ppt. Gene ExpressionControls over genes.ppt. Gene Expression
Controls over genes.ppt. Gene Expression
NABIHANAEEM2
 
SULPHONAMIDES AND SULFONES Medicinal Chemistry III.ppt
SULPHONAMIDES AND SULFONES Medicinal Chemistry III.pptSULPHONAMIDES AND SULFONES Medicinal Chemistry III.ppt
SULPHONAMIDES AND SULFONES Medicinal Chemistry III.ppt
HRUTUJA WAGH
 
Euclid: The Story So far, a Departmental Colloquium at Maynooth University
Euclid: The Story So far, a Departmental Colloquium at Maynooth UniversityEuclid: The Story So far, a Departmental Colloquium at Maynooth University
Euclid: The Story So far, a Departmental Colloquium at Maynooth University
Peter Coles
 
AP 2024 Unit 1 Updated Chemistry of Life
AP 2024 Unit 1 Updated Chemistry of LifeAP 2024 Unit 1 Updated Chemistry of Life
AP 2024 Unit 1 Updated Chemistry of Life
mseileenlinden
 
Seismic evidence of liquid water at the base of Mars' upper crust
Seismic evidence of liquid water at the base of Mars' upper crustSeismic evidence of liquid water at the base of Mars' upper crust
Seismic evidence of liquid water at the base of Mars' upper crust
Sérgio Sacani
 
Animal Models for Biological and Clinical Research ppt 2.pptx
Animal Models for Biological and Clinical Research ppt 2.pptxAnimal Models for Biological and Clinical Research ppt 2.pptx
Animal Models for Biological and Clinical Research ppt 2.pptx
MahitaLaveti
 
Top 10 Biotech Startups for Beginners.pptx
Top 10 Biotech Startups for Beginners.pptxTop 10 Biotech Startups for Beginners.pptx
Top 10 Biotech Startups for Beginners.pptx
alexbagheriam
 
Preparation of Experimental Animals.pptx
Preparation of Experimental Animals.pptxPreparation of Experimental Animals.pptx
Preparation of Experimental Animals.pptx
klynct
 
Components of the Human Circulatory System.pptx
Components of the Human  Circulatory System.pptxComponents of the Human  Circulatory System.pptx
Components of the Human Circulatory System.pptx
autumnstreaks
 
Discrete choice experiments: Environmental Improvements to Airthrey Loch Lake...
Discrete choice experiments: Environmental Improvements to Airthrey Loch Lake...Discrete choice experiments: Environmental Improvements to Airthrey Loch Lake...
Discrete choice experiments: Environmental Improvements to Airthrey Loch Lake...
Professional Content Writing's
 
Sleep_physiology_types_duration_underlying mech.
Sleep_physiology_types_duration_underlying mech.Sleep_physiology_types_duration_underlying mech.
Sleep_physiology_types_duration_underlying mech.
klynct
 
Somato_Sensory _ somatomotor_Nervous_System.pptx
Somato_Sensory _ somatomotor_Nervous_System.pptxSomato_Sensory _ somatomotor_Nervous_System.pptx
Somato_Sensory _ somatomotor_Nervous_System.pptx
klynct
 

Introduction to OpenMP (Performance)

  • 1. OpenMP Day 2 WorkSharing, Schedule, Synchronization and OMP best practices
  • 2. ✓ What is OPENMP? ✓ Fork/Join Programming model ✓ OPENMP Core Elements ✓ #pragma omp parallel OR Parallel construct ✓ run time variables ✓ environment variables ✓ data scoping (private, shared…) ✓ compile and run openmp program in c++ and fortran ✓ work sharing constructs #pragma omp for sections tasks ➢ schedule clause ➢ synchronization Recap of Day 1
  • 3. Work Sharing: sections SECTIONS directive is a non-iterative work-sharing construct. It specifies that the enclosed section(s) of code are to be divided among the threads in the team. Each SECTION is executed ONCE by a thread in the team.
  • 5. OpenMP: lastprivate Clause • Creates private memory location for each thread. • Does not initialize the private variable. • The sequentially last iteration of the associated loops, or the lexically last section construct [...] to the original list item. !$OMP DO PRIVATE(I) LASTPRIVATE(B) DO i = 1, 1000 B = i ENDDO !$OMP END DO !—value of B here is 1000 !$OMP SECTIONS LASTPRIVATE(B) !$OMP SECTION B = 2 !$OMP SECTION B = 4 !$OMP SECTION D = 6 !$OMP END SECTIONS
  • 6. Work Sharing: tasks #pragma omp task [clauses]…… • Tasks allow to parallelize irregular problems (Unbounded loops & Recursive algorithms ) • A task has - Code to execute – Data environment (It owns its data) – Internal control variables – An assigned thread that executes the code and the data • Each encountering thread packages a new instance of a task (code and data) • Some thread in the team executes the task at some later time
  • 8. Work Sharing: single !$OMP SINGLE [clause ...] PRIVATE (list) FIRSTPRIVATE (list) block !$OMP END SINGLE [ NOWAIT ] #pragma omp single [clause ...] newline private (list) firstprivate (list) nowait structured_block • The SINGLE directive specifies that the enclosed code is to be executed by only one thread in the team. • May be useful when dealing with sections of code that are not thread safe (such as I/O)
  • 9. Schedule Clause How is the work is divided among threads? Directives for work distribution
  • 10. Schedule Clause: Types A schedule kind is passed to an OpenMP loop schedule clause: • Provides a hint for how iterations of the corresponding OpenMP loop should be assigned to threads in the team of the OpenMP region surrounding the loop. • Five kinds of schedules for OpenMP loop1: static dynamic guided auto runtime • The OpenMP implementation and/or runtime defines how to assign chunks to threads of a team given the kind of schedule specified by as a hint.
  • 11. STATIC: Iterations of a loop are divided into chunks of size ceiling(iterations/threads). Each thread is assigned a separate chunk. STATIC, N: Iterations of a loop are divided into chunks of size N. Each chunk is assigned to a thread in round-robin fashion. N >= 1 (integer expression) DYNAMIC: Iterations of a loop are divided into chunks of size 1. Chunks are assigned to threads on a first-come, first-serve basis as threads become available. This continues until all work is completed. DYNAMIC, N: Same as above, all chunks are set to size N GUIDED: Chunks are made progressively smaller until a chunk size of one is reached. The first chunk is of size ceiling(iterations/threads). Remaining chunks are of size ceiling(iterations_remaining/threads).Chunks are assigned to threads on a first-come, first-serve basis as threads become available. This continues until all work is completed. GUIDED, N: Minimum chunk size is N AUTO: Delegated the decision of the scheduling to the compiler and/or runtime system RUNTIME: Scheduling policy is determined at run time. OMP_SCHEDULE/ OMP_SET_SCHEDULE Schedule Clause
  • 12. OpenMP: Synchronization • The programmer needs finer control over how variables are shared. • The programmer must ensure that threads do not interfere with each other so that the output does not depend on how the individual threads are scheduled. • In particular, the programmer must manage threads so that they read the correct values of a variable and that multiple threads do not try to write to a variable at the same time. • Data dependencies and Task Dependencies • MASTER, CRITICAL, BARRIER, FLUSH, TASKWAIT, ORDERED, NOWAIT
  • 13. Data Dependencies OpenMP assumes that there is NO data- dependency across jobs running in parallel When the omp parallel directive is placed around a code block, it is the programmer’s responsibility to make sure data dependency is ruled out
  • 14. Synchronization Constructs 1) Mutual Exclusion (Data Dependencies) Critical Sections : Protect access to shared & modifiable data, allowing ONLY ONE thread to enter it at a given time #pragma omp critical #pragma omp atomic – special case of critical, less overhead Locks Only one thread updates this at a time
  • 15. Synchronization Constructs To impose order constraints and protect shared data. Achieved by Mutual Exclusion & Barriers 2) Barriers (Task Dependencies) Implicit : Sync points exist at the end of parallel –necessary barrier – cant be removed for – can be removed by using the nowait clause sections – can be removed by using the nowait clause single – can be removed by using the nowait clause Explicit : Must be used when ordering is required #pragma omp barrier each thread waits until all threads arrive at the barrier
  • 16. Explicit Barrier Implicit Barrier at end of parallel region No Barrier nowait cancels barrier creation Synchronization: Barrier
  • 17. OPENMP Synchronization: review PRAGMA DESCRIPTION #pragma omp taskwait !$OMP TASKWAIT Specifies a wait on the completion of child tasks generated since the beginning of the current task #pragma omp critical !$OMP CRITICAL !$OMP END CRITICAL Code within the block or pragma is only executed on one thread at a time. #pragma omp critical !$OMP ATOMIC !$OMP END ATOMIC Provides a mini-CRITICAL section. specific memory location must be updated atomically (Atomic statements) #pragma omp barrier !$OMP BARRIER !$OMP END BARRIER Synchronizes all threads in a team; all threads pause at the barrier, until all threads execute the barrier.
  • 18. OPENMP Synchronization: review PRAGMA DESCRIPTION #pragma omp for ordered [clauses...] (loop region) #pragma omp ordered structured_block Used within a DO / for loop Iterations of the enclosed loop will be executed in the same order as if they were executed on a serial processor. Threads will need to wait before executing their chunk of iterations if previous iterations haven't completed yet. #pragma omp flush (list) Synchronization point at which all threads have the same view of memory for all shared objects. FLUSH is implied for barrier parallel - upon entry and exit critical - upon entry and exit ordered - upon entry and exit for - upon exit sections - upon exit single - upon exi
  • 20. Performance in OPENMP programs Might not change speed much or break code! Must understand application and use wisely Performance of single threaded code Percentage of code that is run in parallel and scalability CPU utilization, effective data sharing, data locality and load balancing Amount of synchronization and communication Overhead to create, resume, manage, suspend, destroy and synchronize threads Memory conflicts due to shared memory or falsely shared memory Performance limitations of shared resources e.g memory, bus bandwidth, CPU execution units
  • 21. Computing Efficiency of Parallel Code Tserial = Speed Up Tparallel Tserial = Efficiency, P = number of processors P xTparallel
  • 22. Key Steps in Parallel Algorithms • Dividing a computation into smaller computations • Assign them to different processors for parallel execution • The process of dividing a computation into smaller parts, some or all of which may potentially be executed in parallel, is called decomposition • The number and size of tasks into which a problem is decomposed determines the granularity of the decomposition. • Decomposition into a large number of small tasks is called fine-grained and a decomposition into a small number of large tasks is called coarse-grained • Decomposition for matrix-vector multiplication is fine-grained because each of a large number of tasks performs a single dot-product. • A coarse-grained decomposition of the same problem into 4 tasks, where each task computes n/4 of the entries of the output vector of length n. • The mechanism by which tasks are assigned to processes for execution is called mapping.
  • 23. Data decomposition & mapping to Processors OR Task decomposition & mapping to Processors Objective All tasks complete in the shortest amount of elapsed time How to achieve the objective? Reduce overheads: Time spent in inter-process interaction/ Overheads of data sharing between processes. Time that some processes may spend being idle. Uneven load distribution may cause some processes to finish earlier than others. Unfinished tasks mapped onto a process could be waiting for tasks mapped onto other processes to finish in order to satisfy the constraints imposed by the task-dependency graph.
  • 24. Load Balancing: Gaussian Elimination When eliminating a column, processors to the left of are idle Each processor is active for only part of the computation Conversion of a Matrix into its Upper Triangular Equivalent Simple Data Partitioning method for parallel processing – 1D vertical strip partitioning Each process owns N/P columns of data The represents outstanding work in successive K iterations
  • 25. Several Data & Task Decomposition and mapping techniques (Beyond the scope of this talk) Some simple techniques to avoid overheads.
  • 26. Parallel Overhead The amount of time required to coordinate parallel threads, as opposed to doing useful work. Thread start-up time Synchronization Software overhead imposed by parallel compilers, libraries, tools, operating system, etc. Thread termination time
  • 27. Overheads of Parallel Directives
  • 29. Optimize the use of barriers
  • 30. Prefer Atomic to Critical & Avoid Large critical regions
  • 31. Maximize Parallel Regions Single parallel region enclosing all work sharing for loops.
  • 32. Avoid Parallel Regions in inner loops
  • 33. Caching issues in multicore performance
  • 35. Caching issues in multicore performance
  • 36. Caching issues in multicore performance
  • 37. Array[N] N is large. A[1] is accessed by one processor & A[2] by another False Sharing
  • 38. False Sharing When threads on different processors modify variables that reside on the same cache line. This invalidates the cache line and forces a memory update to maintain cache coherency. Potential false sharing on the array sum_local. “place” data on different blocks OR Reduce block size
  • 39. False Sharing: Solution 1 Adding a schedule clause with chunksize that ensures that 2 threads do not step over the same cache line #pragma omp for schedule(static,chunkSize)
  • 40. False Sharing: Solution 2 Array padding and memory alignment to reduce false sharing. This works because successive Array elements are forced onto different cache lines, so less (or no) cache line conflicts exist sum_local[NUM_THREADS][cacheline]; sum_local[me][0] Use compiler directives to force individual variable alignment. __declspec(align(n)) (n =64) (64 byte boundary) to align the individual variables on cache line boundaries. __declspec (align(64)) int thread1_global_variable; __declspec (align(64)) int thread2_global_variable;
  • 41. False Sharing: Solution 2 Array of data structures • Pad the structure to the end of a cache line to ensure that the array elements begin on a cache line boundary. • If you cannot ensure that the array is aligned on a cache line boundary, pad the data structure to twice the size of a cache line. • If the array is dynamically allocated, increase the allocation size and adjust the pointer to align with a cache line boundary. Padding a data structure to a cache line boundary Ensuring the array is also aligned using the compiler __declspec (align(n)) [ n = 64 (64 byte boundary)]
  • 42. False Sharing: Solution 3 Use of private variables Note: Shared data that is read-only in a loop does not lead to false sharing. ThreadLocalSum private(ThreadLocalSum) ThreadLocalSum
  • 43. One large global memory block – Shared False Sharing? Make sure each individual-block starts and ends at the cache boundary Separate blocks each local to its own core (i.e. private) No false sharing but detailed code to identify where each private block begins and ends. False Sharing
  • 44. Cache hits and misses Cache could be KB or MB, but bytes are transferred in much smaller sizes. Typical size of cache line is 64MB When CPU asks for a value from the memory If the value is already in the cache -> Cache Hit Value is not in the cache, has to be fetched from the memory -> Cache Miss • Compulsory (cold start or process migration): – First access to a block in memory impossible to avoid • Capacity: Cache cannot hold all blocks accessed by the program • Conflict (collision): – Multiple memory locations map to same cache location
  • 45. Cache hits and misses Coherence Misses: Misses caused by coherence traffic with other processor Also known as communication misses because represents data moving between processors working together on a parallel program For some parallel programs, coherence misses can dominate total misses Spatial Coherence “If you need one memory address’s contents now, then you will probably also need the contents of some of the memory locations around it soon.” Temporal Coherence “If you need one memory address’s contents now, then you will probably also need its contents again soon.”
  • 46. Cache hits and misses Sequential memory order Jump in memory order
  • 48. Where Cache Coherence Really Matters: Matrix Multiply Code simplicity! Blindly marches through memory (how does this affect the cache?) This is a problem in a C /C++ program because B is not doing a unit stride
  • 50. Where Cache Coherence Really Matters: Matrix Multiply I, k, j
  • 51. Block Dense Matrix Multiplication Usually size of matrices (N) much larger than number of processors (p). Divide matrix into s2 submatrices. Each submatrix has N/s x N/s elements.
  • 52. Block Dense Matrix Multiplication
  • 53. Cache hits and misses Mathematically calculating cache misses using cache block and line sizes and size of objects in the code. Using some performance tools, like perf. One can identify cache hits and misses. Perf comes with linux platforms. $ perf stat -e task-clock,cycles,instructions,cache-references,cache-misses ./stream_c.exe Exercise : Run perf on Matrix Vector multiplication code
  • 56. OpenMP Parallel Programming ▪ Start with a parallelizable algorithm Loop level parallelism /tasks ▪ Implement Serially : Optimized Serial Program ▪ Test, Debug & Time to solution ▪ Annotate the code with parallelization and Synchronization directives ▪ Remove Race Conditions, False Sharing ▪ Test and Debug ▪ Measure speed-up (T-serial/T-parallel)
  翻译: