SlideShare a Scribd company logo
CSC 447: Parallel Programming for Multi-
Core and Cluster Systems
Performance Analysis
Instructor: Haidar M. Harmanani
Spring 2020
Outline
§ Performance scalability
§ Analytical performance measures
§ Amdahl’s law and Gustafson-Barsis’ law
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 2
Performance
§ In computing, performance is defined by 2 factors
– Computational requirements (what needs to be done)
– Computing resources (what it costs to do it)
§ Computational problems translate to requirements
§ Computing resources interplay and tradeoff
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 3
Time Energy
… and ultimately
MoneyHardware
Performance ~
1
Resources for solution
Measuring Performance
§ Performance itself is a measure of how well the computational
requirements can be satisfied
§ We evaluate performance to understand the relationships
between requirements and resources
– Decide how to change “solutions” to target objectives
§ Performance measures reflect decisions about how and how
well “solutions” are able to satisfy the computational
requirements
§ When measuring performance, it is important to understand
exactly what you are measuring and how you are measuring it
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 4
Scalability
§ A program can scale up to use many processors
– What does that mean?
§ How do you evaluate scalability?
§ How do you evaluate scalability goodness?
§ Comparative evaluation
– If double the number of processors, what to expect?
– Is scalability linear?
§ Use parallel efficiency measure
– Is efficiency retained as problem size increases?
§ Apply performance metrics
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 5
Performance and Scalability
§ Evaluation
– Sequential runtime (Tseq) is a function of problem size and
architecture
– Parallel runtime (Tpar) is a function of problem size and parallel
architecture and the number of processors used in the execution
– Parallel performance affected by algorithm + architecture
§ Scalability
– Ability of parallel algorithm to achieve performance gains
proportional to the number of processors and the size of the
problem
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 6
Performance Metrics and Formulas
§ T1 is the execution time on a single processor
§ Tp is the execution time on a p processor system
§ S(p) (Sp) is the speedup
§ E(p) (Ep) is the efficiency
§ Cost(p) (Cp) is the cost
§ Parallel algorithm is cost-optimal
– Parallel time = sequential time (Cp = T1 , Ep = 100%)
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 7
S( p) =
T1
Tp
Efficiency =
Sp
p
Cost = p ´ Tp
Speed-Up
§ Provides a measure of application performance with
respect to a given program platform
§ Can also be cast in terms of computational steps
o Can extend time complexity to parallel computations
§ Use the fastest known sequential algorithm for running on
a single processor
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 8
What is a “good” speedup?
§ Hopefully, S(n) > 1
§ Linear speedup:
– S(n) = n
– Parallel program considered perfectly scalable
§ Superlinear speedup:
– S(n) > n
– Can this happen?
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 9
Defining Speed-Up
§ We need more information to evaluate speedup:
– What problem size? Worst case time? Average case time?
– What do we count as work?
o Parallel computation, communication, overhead?
– What serial algorithm and what machine should we use for the
numerator?
o Can the algorithms used for the numerator and the denominator be different?
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 10
Common Definitions of Speed-Up
§ Common definitions of Speedup:
– Serial machine is one processor of parallel machine and serial algorithm is
interleaved version of parallel algorithm
– Serial algorithm is fastest known serial algorithm for running on a serial processor
– Serial algorithm is fastest known serial algorithm running on a one processor of the
parallel machine
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 11
)(
)1(
)(
nT
T
nS =
)(
)(
nT
T
nS s
=
)(
)1('
)(
nT
T
nS =
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 12
Can speedup be superlinear?
Can speedup be superlinear?
§ Speedup CANNOT be superlinear:
– Let M be a parallel machine with n processors
– Let T(X) be the time it takes to solve a problem on M with X
processors
– Speedup definition:
o Suppose a parallel algorithm A solves an instance I of a problem in t time units
§ Then A can solve the same problem in n x t units of time on M through time slicing
§ The best serial time for I will be no bigger than n x t
§ Hence speedup cannot be greater than n.
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 13
)(
)1(
)(
nT
T
nS =
S(n) =
T (1)
T (n)
≤
nt
t
= n
Can speedup be superlinear?
§ Speedup CAN be superlinear:
– Let M be a parallel machine with n processors
– Let T(X) be the time it takes to solve a problem on M with X processors
– Speedup definition:
– Serial version of the algorithm may involve more overhead than the parallel
version of the algorithm
o E.g. A=B+C on a SIMD machine with A,B,C matrices vs. loop overhead on a serial machine
– Hardware characteristics may favor parallel algorithm
o E.g. if all data can be decomposed in main memories of parallel processors vs. needing
secondary storage on serial processor to retain all data
– “work” may be counted differently in serial and parallel algorithms
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 14
)(
)(
nT
T
nS s
=
Speedup Factor
§ Maximum speedup is usually n with n processors (linear
speedup).
§ Possible to get superlinear speedup (greater than n) but
usually a specific reason such as:
– Extra memory in multiprocessor system
– Nondeterministic algorithm
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 15
Maximum Speedup: Amdahl’s law
§ f = fraction of program (algorithm) that is serial and cannot be parallelized
– Data setup
– Reading/writing to a single disk file
§ Speedup factor is given by:
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 16
Ts = fTs + (1− f )Ts
Tp = fTs +
(1− f )Ts
n
S(n) =
Ts
fTs +
(1− f )Ts
n
=
n
1+ (n −1) f
limn−>∞ =
1
f
The above equation is known as Amdahl’s Law
Note that as n ® ¥, the maximum speedup is limited to 1/f.
Bounds on Speedup
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 17
Serial section Parallelizable sections
(a) One processor
(b) Multiple
processors
fts (1 - f)ts
ts
(1 - f)ts /ptp
p processors
Speedup Against Number of
Processors
§ Even with infinite number
of processors, maximum
speedup limited to 1/f .
§ Example: With only 5% of
computation being serial,
maximum speedup is 20,
irrespective of number of
processors.
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 18
4
8
12
16
20
4 8 12 16 20
f = 20%
f = 10%
f = 5%
f = 0%
Number of processors , p
Example of Amdahl’s Law (1)
§ Suppose that a calculation has a 4% serial portion, what is
the limit of speedup on 16 processors?
– 16/(1 + (16 – 1)*.04) = 10
– What is the maximum speedup?
o 1/0.04 = 25
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 19
Example of Amdahl’s Law (2)
§ 95% of a program’s execution time occurs inside a loop
that can be executed in parallel. What is the maximum
speedup we should expect from a parallel version of the
program executing on 8 CPUs?
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 20
ψ ≤
1
0.05+ (1− 0.05) / 8
≅ 5.9
Example of Amdahl’s Law (3)
§ 20% of a program’s execution time is spent within
inherently sequential code. What is the limit to the
speedup achievable by a parallel version of the program?
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 21
lim
p→∞
1
0.2 + (1− 0.2) / p
=
1
0.2
= 5
Illustration of Amdahl Effect
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 22
n = 100
n = 1,000
n = 10,000
Speedup
Processors
Amdahl’s Law and Scalability
§ Scalability
– Ability of parallel algorithm to achieve performance gains proportional
to the number of processors and the size of the problem
§ When does Amdahl’s Law apply?
– When the problem size is fixed
– Strong scaling (p®∞, Sp = S∞ ® 1 / f )
– Speedup bound is determined by the degree of sequential execution
time in the computation, not # processors!!!
– Perfect efficiency is hard to achieve
§ See original paper by Amdahl on course webpage
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 23
Variants of Speedup: Efficiency
§ Efficiency: E(n) = S(n)/n * 100%
§ Efficiency measures the fraction of time that processors
are being used on the computation.
– A program with linear speedup is 100% efficient.
§ Using efficiency:
– A program attains 89% efficiency with a serial fraction of 2%.
Approximately how many processors are being used according to
Amdahl’s law?
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 24
Efficiency
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 25
usedProcessors
Speedup
Efficiency
timeexecutionParallelusedProcessors
timeexecutionSequential
Efficiency
=
´
=
Limitations of Speedup
§ Conventional notions of speedup don't always provide a reasonable
measure of performance
§ Questionable assumptions:
– "work" in conventional definitions of speedup is defined by operation count
o communication more expensive than computation on current high-performance computers
– best serial algorithm defines the least work necessary
o for some languages on some machines, serial algorithm may do more work -- (loop operations
vs. data parallel for example)
– good performance for many users involves fast time on a sufficiently large
problem; faster time on a smaller problem (better speedup) is less interesting
– traditional speedup measures assume a "flat memory approximation”, i.e. all
memory accesses take the same amount of time
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 26
“Flat Memory Approximation”
§ “Flat memory Approximation” – all accesses to memory
take the same amount of time
– in practice, accesses to information in cache, main memory and
peripheral memory take very different amounts of time.
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 27
Fully cached
Virtual
Memory
Main Memory
Time per access
Access
Another Perspective
§ We often use faster computers to solve larger problem
instances
§ Let’s treat time as a constant and allow problem size to
increase with number of processors
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 28
Limitations of Speedup
§ Gustafson challenged Amdahl's assumption that the proportion of a
program given to serial computations (f) and the proportion of a
program given to parallel computations remains the same over all
problem sizes.
– For example, if the serial part is a loop initialization and it can be executed in
parallel over the size of the input list, then the serial initialization becomes a
smaller proportion of the overall calculation as the problem size grows larger.
§ Gustafson defined two “more relevant” notions of speedup
– Scaled speedup
– Fixed-time speedup
o (usual version he called fixed-size speedup)
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 29
Gustafson-Barsis’s Law
§ Begin with parallel execution time
§ Estimate sequential execution time to solve same problem
§ Problem size is an increasing function of p
§ Predicts scaled speedup
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 30
Gustafson’s Law
Fix execution time on a single processor
◦ s + p = serial part + parallelizable part = 1 (normalized serial
time)
◦ (s = same as f previously)
◦ Assume problem fits in memory of serial computer
◦ Fixed-size speedup
Amdahl’s law
Fix execution time on a parallel computer (multiple processors)
◦ s + p = serial part + parallelizable part = 1 (normalized
parallel time)
◦ s + np = serial time on a single processor
◦ Assume problem fits in memory of parallel computer
◦ Scaled Speedup
Gustafson’s Law
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 31
n
s
s
n
p
s
ps
S sizefixed
-
+
=
+
+
=
1
1
_
snn
ps
nps
Sscaled
)1( -+=
+
+
=
Scaled Speedup
§ Scaling implies that problem size can increase with number of
processors
– Gustafson’s law gives measure of how much
§ Scaled Speedup derived by fixing the parallel execution time
– Amdahl fixed the problem size à fixes serial execution time
– Amdahl’s law may be too conservative for high-performance computing.
§ Interesting consequence of scaled speedup: no bound to speedup as
nà infinity, speedup can easily become superlinear!
§ In practice, unbounded scalability is unrealistic as quality of answer
will reach a point where no further increase in problem size may be
justified
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 32
Meaning of Scalability Function
§ To maintain efficiency when increasing p, we must
increase n
§ Maximum problem size limited by available memory,
which is linear in p
§ Scalability function shows how memory usage per
processor must grow to maintain efficiency
§ Scalability function a constant means parallel system is
perfectly scalable
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 33
Interpreting Scalability Function
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 34
Number of processors
Memoryneededperprocessor
Cplogp
Cp
Clogp
C
Memory Size
Can maintain
efficiency
Cannot maintain
efficiency
Gustafson-Barsis’ Law and Scalability
§ Scalability
– Ability of parallel algorithm to achieve performance gains proportional
to the number of processors and the size of the problem
§ When does Gustafson’s Law apply?
– When the problem size can increase as the number of processors
increases
– Weak scaling (Sp = 1 + (p-1)fpar )
– Speedup function includes the number of processors!!!
– Can maintain or increase parallel efficiency as the problem scales
§ See original paper by Gustafson on course webpage
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 35
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 36
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 37
Using Gustafson’s Law
§ Given a scaled speedup of 20 on 32 processors, what is the
serial fraction from Amdahl’s law? What is the serial
fraction from Gustafson’s Law?
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 38
snn
ps
nps
Sscaled
)1( -+=
+
+
=
Example 1
§ An application running on 10 processors spends 3% of its
time in serial code. What is the scaled speedup of the
application?
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 39
Execution on 1 CPU takes 10 times as long…
…except 9 do not have to execute serial code
ψ =10 + (1−10)(0.03) =10 − 0.27 = 9.73
Example 2
§ What is the maximum fraction of a program’s parallel
execution time that can be spent in serial code if it is to
achieve a scaled speedup of 7 on 8 processors?
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 40
7 = 8 + (1− 8)s ⇒ s ≈ 0.14
Why Are not Parallel Applications
Scalable?
Critical Paths
◦ Dependencies between computations spread
across processors
Bottlenecks
◦ One processor holds things up
Algorithmic overhead
◦ Some things just take more effort to do in
parallel
Communication overhead
◦ Spending increasing proportion of time on
communication
Load Imbalance
◦ Makes all processor wait for the “slowest” one
◦ Dynamic behavior
Speculative loss
◦ Do A and B in parallel, but B is ultimately not
needed
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 41
Critical Paths
§ Long chain of dependence
– Main limitation on performance
– Resistance to performance improvement
§ Diagnostic
– Performance stagnates to a (relatively) fixed value
– Critical path analysis
§ Solution
– Eliminate long chains if possible
– Shorten chains by removing work from critical path
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 42
Bottlenecks
§ How to detect?
– One processor A is busy while others wait
– Data dependency on the result produced by A
§ Typical situations:
– N-to-1 reduction / computation / 1-to-N broadcast
– One processor assigning job in response to requests
§ Solution techniques:
– More efficient communication
– Hierarchical schemes for master slave
§ Program may not show ill effects for a long time
§ Shows up when scaling
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 43
Algorithmic Overhead
§ Different sequential algorithms to solve the same problem
§ All parallel algorithms are sequential when run on 1 processor
§ All parallel algorithms introduce addition operations
– Parallel overhead
§ Where should be the starting point for a parallel algorithm?
– Best sequential algorithm might not parallelize at all
– Or, it does not parallelize well (e.g., not scalable)
§ What to do?
– Choose algorithmic variants that minimize overhead
– Use two level algorithms
§ Performance is the rub
– Are you achieving better parallel performance?
– Must compare with the best sequential algorithm
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 44
What is the maximum parallelism
possible?
§ Depends on application,
algorithm, program
– Data dependencies in execution
– Parallelism varies!
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 45
512-point FFT
parallel
signature
Embarrassingly Parallel Computations
§ An embarrassingly parallel computation is one that can be obviously
divided into completely independent parts that can be executed
simultaneously
– In a truly embarrassingly parallel computation there is no interaction between
separate processes
– In a nearly embarrassingly parallel computation results must be distributed
and collected/combined in some way
§ Embarrassingly parallel computations have potential to achieve
maximal speedup on parallel platforms
– If it takes T time sequentially, there is the potential to achieve T/P time
running in parallel with P processors
– What would cause this not to be the case always?
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 46
Embarrassingly Parallel Computations
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 47
Processes
Input Data
Results
. . .
No or very little communication between processes
Each process can do its tasks without any interaction with other processes
Examples
◦ Numerical integration
◦ Mandelbrot set
◦ Monte Carlo methods
Calculating p with Monte Carlo
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 48
422
11 ππ
=
∗
∗∗
2 in
Consider a circle of unit radius
Place circle inside a square box with side of 2 in
The ratio of the circle area to the square area is:
Monte Carlo Calculation of p
§ Randomly choose a number of points in the square
§ For each point p, determine if p is inside the circle
§ The ratio of points in the circle to points in the square will give an
approximation of p/4
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 49
Using Programs to Measure Machine
Performance
§ Speedup measures performance of an individual program
on a particular machine
– Speedup cannot be used to
o Compare different algorithms on the same computer
o Compare the same algorithm on different computers
§ Benchmarks are representative programs which can be
used to compare performance of machines
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 50
Benchmarks used for Parallel
Machines
§ The Perfect Club
§ The Livermore Loops
§ The NAS Parallel Benchmarks
§ The SPEC Benchmarks
§ The “PACKS” (Linpack, LAPACK, ScaLAPACK, etc.)
§ ParkBENCH
§ SLALOM, HINT
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 51
Limitations and Pitfalls of
Benchmarks
§ Benchmarks cannot address questions you did not ask
§ Specific application benchmarks will not tell you about
the performance of other applications without proper
analysis
§ General benchmarks will not tell you all the details about
the performance of your specific application
§ One should understand the benchmark itself to
understand what it tells us
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 52
Benefits of Benchmarks
§ Popular benchmarks keep vendors attuned to
applications
§ Benchmarks can give useful information about the
performance of systems on particular kinds of programs
§ Benchmarks help in exposing performance bottlenecks of
systems at the technical and applications level
Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 53
Ad

More Related Content

What's hot (20)

顔画像からの個人顔識別
顔画像からの個人顔識別顔画像からの個人顔識別
顔画像からの個人顔識別
epcnt19
 
Instruction Manual | Burris Oracle X Rangefinder Crossbow Scope | Optics Trade
Instruction Manual | Burris Oracle X Rangefinder Crossbow Scope | Optics TradeInstruction Manual | Burris Oracle X Rangefinder Crossbow Scope | Optics Trade
Instruction Manual | Burris Oracle X Rangefinder Crossbow Scope | Optics Trade
Optics-Trade
 
自然科学の統計学2.2 slideshare
自然科学の統計学2.2 slideshare自然科学の統計学2.2 slideshare
自然科学の統計学2.2 slideshare
wada, kazumi
 
Digital communication
Digital communicationDigital communication
Digital communication
meashi
 
【技術解説20】 ミニバッチ確率的勾配降下法
【技術解説20】 ミニバッチ確率的勾配降下法【技術解説20】 ミニバッチ確率的勾配降下法
【技術解説20】 ミニバッチ確率的勾配降下法
Ruo Ando
 
Data Communication & Computer network: Shanon fano coding
Data Communication & Computer network: Shanon fano codingData Communication & Computer network: Shanon fano coding
Data Communication & Computer network: Shanon fano coding
Dr Rajiv Srivastava
 
Mini Project Communication Link Simulation Digital Modulation Techniques Lec...
Mini Project Communication Link Simulation  Digital Modulation Techniques Lec...Mini Project Communication Link Simulation  Digital Modulation Techniques Lec...
Mini Project Communication Link Simulation Digital Modulation Techniques Lec...
University of Hertfordshire, School of Electronic Communications and Electrical Engineering
 
Amplitude modulation
Amplitude modulationAmplitude modulation
Amplitude modulation
avocado1111
 
APCS 程式設計實作題 物品堆疊
APCS 程式設計實作題 物品堆疊APCS 程式設計實作題 物品堆疊
APCS 程式設計實作題 物品堆疊
艾鍗科技
 
(ICML2020 K.Kato et al. fujitsu) Rate distortion optimization guided autoenco...
(ICML2020 K.Kato et al. fujitsu) Rate distortion optimization guided autoenco...(ICML2020 K.Kato et al. fujitsu) Rate distortion optimization guided autoenco...
(ICML2020 K.Kato et al. fujitsu) Rate distortion optimization guided autoenco...
Yoshiki Yamamoto
 
Ee6378 bandgap reference
Ee6378 bandgap referenceEe6378 bandgap reference
Ee6378 bandgap reference
ssuser2038c9
 
Signals and systems analysis using transform methods and matlab 3rd edition r...
Signals and systems analysis using transform methods and matlab 3rd edition r...Signals and systems analysis using transform methods and matlab 3rd edition r...
Signals and systems analysis using transform methods and matlab 3rd edition r...
Adelaide789
 
220206 transformer interpretability beyond attention visualization
220206 transformer interpretability beyond attention visualization220206 transformer interpretability beyond attention visualization
220206 transformer interpretability beyond attention visualization
taeseon ryu
 
Quadrature Amplitude Modulation. QAM Transmitter.ppt
Quadrature Amplitude Modulation. QAM Transmitter.pptQuadrature Amplitude Modulation. QAM Transmitter.ppt
Quadrature Amplitude Modulation. QAM Transmitter.ppt
Stefan Oprea
 
0136023126 ism 06(1)
0136023126 ism 06(1)0136023126 ism 06(1)
0136023126 ism 06(1)
aryaanuj1
 
Ece523 folded cascode design
Ece523 folded cascode designEce523 folded cascode design
Ece523 folded cascode design
Karthik Rathinavel
 
BDD/ZDDを用いたグラフ列挙索引化技法
BDD/ZDDを用いたグラフ列挙索引化技法BDD/ZDDを用いたグラフ列挙索引化技法
BDD/ZDDを用いたグラフ列挙索引化技法
soranoana dev
 
Q4.11: NEON Intrinsics
Q4.11: NEON IntrinsicsQ4.11: NEON Intrinsics
Q4.11: NEON Intrinsics
Linaro
 
第2回勉強会スライド
第2回勉強会スライド第2回勉強会スライド
第2回勉強会スライド
koturn 0;
 
Digital to Analog Conversion
Digital to Analog ConversionDigital to Analog Conversion
Digital to Analog Conversion
Soumen Santra
 
顔画像からの個人顔識別
顔画像からの個人顔識別顔画像からの個人顔識別
顔画像からの個人顔識別
epcnt19
 
Instruction Manual | Burris Oracle X Rangefinder Crossbow Scope | Optics Trade
Instruction Manual | Burris Oracle X Rangefinder Crossbow Scope | Optics TradeInstruction Manual | Burris Oracle X Rangefinder Crossbow Scope | Optics Trade
Instruction Manual | Burris Oracle X Rangefinder Crossbow Scope | Optics Trade
Optics-Trade
 
自然科学の統計学2.2 slideshare
自然科学の統計学2.2 slideshare自然科学の統計学2.2 slideshare
自然科学の統計学2.2 slideshare
wada, kazumi
 
Digital communication
Digital communicationDigital communication
Digital communication
meashi
 
【技術解説20】 ミニバッチ確率的勾配降下法
【技術解説20】 ミニバッチ確率的勾配降下法【技術解説20】 ミニバッチ確率的勾配降下法
【技術解説20】 ミニバッチ確率的勾配降下法
Ruo Ando
 
Data Communication & Computer network: Shanon fano coding
Data Communication & Computer network: Shanon fano codingData Communication & Computer network: Shanon fano coding
Data Communication & Computer network: Shanon fano coding
Dr Rajiv Srivastava
 
Amplitude modulation
Amplitude modulationAmplitude modulation
Amplitude modulation
avocado1111
 
APCS 程式設計實作題 物品堆疊
APCS 程式設計實作題 物品堆疊APCS 程式設計實作題 物品堆疊
APCS 程式設計實作題 物品堆疊
艾鍗科技
 
(ICML2020 K.Kato et al. fujitsu) Rate distortion optimization guided autoenco...
(ICML2020 K.Kato et al. fujitsu) Rate distortion optimization guided autoenco...(ICML2020 K.Kato et al. fujitsu) Rate distortion optimization guided autoenco...
(ICML2020 K.Kato et al. fujitsu) Rate distortion optimization guided autoenco...
Yoshiki Yamamoto
 
Ee6378 bandgap reference
Ee6378 bandgap referenceEe6378 bandgap reference
Ee6378 bandgap reference
ssuser2038c9
 
Signals and systems analysis using transform methods and matlab 3rd edition r...
Signals and systems analysis using transform methods and matlab 3rd edition r...Signals and systems analysis using transform methods and matlab 3rd edition r...
Signals and systems analysis using transform methods and matlab 3rd edition r...
Adelaide789
 
220206 transformer interpretability beyond attention visualization
220206 transformer interpretability beyond attention visualization220206 transformer interpretability beyond attention visualization
220206 transformer interpretability beyond attention visualization
taeseon ryu
 
Quadrature Amplitude Modulation. QAM Transmitter.ppt
Quadrature Amplitude Modulation. QAM Transmitter.pptQuadrature Amplitude Modulation. QAM Transmitter.ppt
Quadrature Amplitude Modulation. QAM Transmitter.ppt
Stefan Oprea
 
0136023126 ism 06(1)
0136023126 ism 06(1)0136023126 ism 06(1)
0136023126 ism 06(1)
aryaanuj1
 
BDD/ZDDを用いたグラフ列挙索引化技法
BDD/ZDDを用いたグラフ列挙索引化技法BDD/ZDDを用いたグラフ列挙索引化技法
BDD/ZDDを用いたグラフ列挙索引化技法
soranoana dev
 
Q4.11: NEON Intrinsics
Q4.11: NEON IntrinsicsQ4.11: NEON Intrinsics
Q4.11: NEON Intrinsics
Linaro
 
第2回勉強会スライド
第2回勉強会スライド第2回勉強会スライド
第2回勉強会スライド
koturn 0;
 
Digital to Analog Conversion
Digital to Analog ConversionDigital to Analog Conversion
Digital to Analog Conversion
Soumen Santra
 

Similar to Parallel Programming for Multi- Core and Cluster Systems - Performance Analysis (20)

ICS 2410.Parallel.Sytsems.Lecture.Week 3.week5.pptx
ICS 2410.Parallel.Sytsems.Lecture.Week 3.week5.pptxICS 2410.Parallel.Sytsems.Lecture.Week 3.week5.pptx
ICS 2410.Parallel.Sytsems.Lecture.Week 3.week5.pptx
johnsmith96441
 
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
EUDAT
 
Lecture 3
Lecture 3Lecture 3
Lecture 3
Mr SMAK
 
byteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE
 
3. Potential Benefits, Limits and Costs of Parallel Programming.pdf
3. Potential Benefits, Limits and Costs of Parallel Programming.pdf3. Potential Benefits, Limits and Costs of Parallel Programming.pdf
3. Potential Benefits, Limits and Costs of Parallel Programming.pdf
MohamedAymen14
 
Parallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesParallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and Disadvantages
Murtadha Alsabbagh
 
Unit 1.2 Parallel Programming in HPC.pptx
Unit 1.2 Parallel Programming in HPC.pptxUnit 1.2 Parallel Programming in HPC.pptx
Unit 1.2 Parallel Programming in HPC.pptx
sayalee7
 
comp422-534-2020-Lecture2-ConcurrencyDecomposition.pdf
comp422-534-2020-Lecture2-ConcurrencyDecomposition.pdfcomp422-534-2020-Lecture2-ConcurrencyDecomposition.pdf
comp422-534-2020-Lecture2-ConcurrencyDecomposition.pdf
AliMohaghegh8
 
High Performance Engineering - 01-intro.pdf
High Performance Engineering - 01-intro.pdfHigh Performance Engineering - 01-intro.pdf
High Performance Engineering - 01-intro.pdf
ss63261
 
Parallel Computing - Lec 6
Parallel Computing - Lec 6Parallel Computing - Lec 6
Parallel Computing - Lec 6
Shah Zaib
 
Evaluation of morden computer & system attributes in ACA
Evaluation of morden computer &  system attributes in ACAEvaluation of morden computer &  system attributes in ACA
Evaluation of morden computer & system attributes in ACA
Pankaj Kumar Jain
 
A Study on Task Scheduling in Could Data Centers for Energy Efficacy
A Study on Task Scheduling in Could Data Centers for Energy Efficacy A Study on Task Scheduling in Could Data Centers for Energy Efficacy
A Study on Task Scheduling in Could Data Centers for Energy Efficacy
Ehsan Sharifi
 
1.1 Introduction.pptx about the design thinking of the engineering students
1.1 Introduction.pptx about the design thinking of the engineering students1.1 Introduction.pptx about the design thinking of the engineering students
1.1 Introduction.pptx about the design thinking of the engineering students
HrushikeshDandu
 
Performance measures
Performance measuresPerformance measures
Performance measures
Divya Tiwari
 
Computer Organization Design ch2Slides.ppt
Computer Organization Design ch2Slides.pptComputer Organization Design ch2Slides.ppt
Computer Organization Design ch2Slides.ppt
rajesshs31r
 
Distributed Convex Optimization Thesis - Behroz Sikander
Distributed Convex Optimization Thesis - Behroz SikanderDistributed Convex Optimization Thesis - Behroz Sikander
Distributed Convex Optimization Thesis - Behroz Sikander
rogerz1234567
 
IRJET- Latin Square Computation of Order-3 using Open CL
IRJET- Latin Square Computation of Order-3 using Open CLIRJET- Latin Square Computation of Order-3 using Open CL
IRJET- Latin Square Computation of Order-3 using Open CL
IRJET Journal
 
Auslogics Video Grabber 1.0.0.7 Crack Free
Auslogics Video Grabber 1.0.0.7 Crack FreeAuslogics Video Grabber 1.0.0.7 Crack Free
Auslogics Video Grabber 1.0.0.7 Crack Free
shan05kpa
 
Auslogics Video Grabber 1.0.0.7 Crack Free
Auslogics Video Grabber 1.0.0.7 Crack FreeAuslogics Video Grabber 1.0.0.7 Crack Free
Auslogics Video Grabber 1.0.0.7 Crack Free
mohsinrazakpa92
 
Wondershare Recoverit 13.5.12.11 Free Download
Wondershare Recoverit 13.5.12.11 Free DownloadWondershare Recoverit 13.5.12.11 Free Download
Wondershare Recoverit 13.5.12.11 Free Download
alihamzakpa086
 
ICS 2410.Parallel.Sytsems.Lecture.Week 3.week5.pptx
ICS 2410.Parallel.Sytsems.Lecture.Week 3.week5.pptxICS 2410.Parallel.Sytsems.Lecture.Week 3.week5.pptx
ICS 2410.Parallel.Sytsems.Lecture.Week 3.week5.pptx
johnsmith96441
 
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
High Performance & High Throughput Computing - EUDAT Summer School (Giuseppe ...
EUDAT
 
Lecture 3
Lecture 3Lecture 3
Lecture 3
Mr SMAK
 
byteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurationsbyteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE's expertise across NVIDIA architectures and configurations
byteLAKE
 
3. Potential Benefits, Limits and Costs of Parallel Programming.pdf
3. Potential Benefits, Limits and Costs of Parallel Programming.pdf3. Potential Benefits, Limits and Costs of Parallel Programming.pdf
3. Potential Benefits, Limits and Costs of Parallel Programming.pdf
MohamedAymen14
 
Parallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and DisadvantagesParallel Algorithms Advantages and Disadvantages
Parallel Algorithms Advantages and Disadvantages
Murtadha Alsabbagh
 
Unit 1.2 Parallel Programming in HPC.pptx
Unit 1.2 Parallel Programming in HPC.pptxUnit 1.2 Parallel Programming in HPC.pptx
Unit 1.2 Parallel Programming in HPC.pptx
sayalee7
 
comp422-534-2020-Lecture2-ConcurrencyDecomposition.pdf
comp422-534-2020-Lecture2-ConcurrencyDecomposition.pdfcomp422-534-2020-Lecture2-ConcurrencyDecomposition.pdf
comp422-534-2020-Lecture2-ConcurrencyDecomposition.pdf
AliMohaghegh8
 
High Performance Engineering - 01-intro.pdf
High Performance Engineering - 01-intro.pdfHigh Performance Engineering - 01-intro.pdf
High Performance Engineering - 01-intro.pdf
ss63261
 
Parallel Computing - Lec 6
Parallel Computing - Lec 6Parallel Computing - Lec 6
Parallel Computing - Lec 6
Shah Zaib
 
Evaluation of morden computer & system attributes in ACA
Evaluation of morden computer &  system attributes in ACAEvaluation of morden computer &  system attributes in ACA
Evaluation of morden computer & system attributes in ACA
Pankaj Kumar Jain
 
A Study on Task Scheduling in Could Data Centers for Energy Efficacy
A Study on Task Scheduling in Could Data Centers for Energy Efficacy A Study on Task Scheduling in Could Data Centers for Energy Efficacy
A Study on Task Scheduling in Could Data Centers for Energy Efficacy
Ehsan Sharifi
 
1.1 Introduction.pptx about the design thinking of the engineering students
1.1 Introduction.pptx about the design thinking of the engineering students1.1 Introduction.pptx about the design thinking of the engineering students
1.1 Introduction.pptx about the design thinking of the engineering students
HrushikeshDandu
 
Performance measures
Performance measuresPerformance measures
Performance measures
Divya Tiwari
 
Computer Organization Design ch2Slides.ppt
Computer Organization Design ch2Slides.pptComputer Organization Design ch2Slides.ppt
Computer Organization Design ch2Slides.ppt
rajesshs31r
 
Distributed Convex Optimization Thesis - Behroz Sikander
Distributed Convex Optimization Thesis - Behroz SikanderDistributed Convex Optimization Thesis - Behroz Sikander
Distributed Convex Optimization Thesis - Behroz Sikander
rogerz1234567
 
IRJET- Latin Square Computation of Order-3 using Open CL
IRJET- Latin Square Computation of Order-3 using Open CLIRJET- Latin Square Computation of Order-3 using Open CL
IRJET- Latin Square Computation of Order-3 using Open CL
IRJET Journal
 
Auslogics Video Grabber 1.0.0.7 Crack Free
Auslogics Video Grabber 1.0.0.7 Crack FreeAuslogics Video Grabber 1.0.0.7 Crack Free
Auslogics Video Grabber 1.0.0.7 Crack Free
shan05kpa
 
Auslogics Video Grabber 1.0.0.7 Crack Free
Auslogics Video Grabber 1.0.0.7 Crack FreeAuslogics Video Grabber 1.0.0.7 Crack Free
Auslogics Video Grabber 1.0.0.7 Crack Free
mohsinrazakpa92
 
Wondershare Recoverit 13.5.12.11 Free Download
Wondershare Recoverit 13.5.12.11 Free DownloadWondershare Recoverit 13.5.12.11 Free Download
Wondershare Recoverit 13.5.12.11 Free Download
alihamzakpa086
 
Ad

More from Shah Zaib (9)

MPI - 4
MPI - 4MPI - 4
MPI - 4
Shah Zaib
 
MPI - 3
MPI - 3MPI - 3
MPI - 3
Shah Zaib
 
MPI - 2
MPI - 2MPI - 2
MPI - 2
Shah Zaib
 
MPI - 1
MPI - 1MPI - 1
MPI - 1
Shah Zaib
 
Parallel Computing - Lec 5
Parallel Computing - Lec 5Parallel Computing - Lec 5
Parallel Computing - Lec 5
Shah Zaib
 
Parallel Computing - Lec 4
Parallel Computing - Lec 4Parallel Computing - Lec 4
Parallel Computing - Lec 4
Shah Zaib
 
Parallel Computing - Lec 3
Parallel Computing - Lec 3Parallel Computing - Lec 3
Parallel Computing - Lec 3
Shah Zaib
 
Parallel Computing - Lec 2
Parallel Computing - Lec 2Parallel Computing - Lec 2
Parallel Computing - Lec 2
Shah Zaib
 
Mpi collective communication operations
Mpi collective communication operationsMpi collective communication operations
Mpi collective communication operations
Shah Zaib
 
Parallel Computing - Lec 5
Parallel Computing - Lec 5Parallel Computing - Lec 5
Parallel Computing - Lec 5
Shah Zaib
 
Parallel Computing - Lec 4
Parallel Computing - Lec 4Parallel Computing - Lec 4
Parallel Computing - Lec 4
Shah Zaib
 
Parallel Computing - Lec 3
Parallel Computing - Lec 3Parallel Computing - Lec 3
Parallel Computing - Lec 3
Shah Zaib
 
Parallel Computing - Lec 2
Parallel Computing - Lec 2Parallel Computing - Lec 2
Parallel Computing - Lec 2
Shah Zaib
 
Mpi collective communication operations
Mpi collective communication operationsMpi collective communication operations
Mpi collective communication operations
Shah Zaib
 
Ad

Recently uploaded (15)

Intro to Windows Presentation for CSS NC-2.pptx
Intro to Windows Presentation for CSS NC-2.pptxIntro to Windows Presentation for CSS NC-2.pptx
Intro to Windows Presentation for CSS NC-2.pptx
HelenAvila17
 
AnyTrans for iOS 8.9.2.20220609 Full Cracked Download [Latest]
AnyTrans for iOS 8.9.2.20220609 Full Cracked Download [Latest]AnyTrans for iOS 8.9.2.20220609 Full Cracked Download [Latest]
AnyTrans for iOS 8.9.2.20220609 Full Cracked Download [Latest]
FarhanSEO
 
Linear Accelerators: Principles, Components, Mechanism of Action, and Their V...
Linear Accelerators: Principles, Components, Mechanism of Action, and Their V...Linear Accelerators: Principles, Components, Mechanism of Action, and Their V...
Linear Accelerators: Principles, Components, Mechanism of Action, and Their V...
ChaudharyBharatDagur
 
ENSA_Module_12 - Network Troubleshooting.pdfchapterchapter-11.pdf.pdf
ENSA_Module_12 - Network Troubleshooting.pdfchapterchapter-11.pdf.pdfENSA_Module_12 - Network Troubleshooting.pdfchapterchapter-11.pdf.pdf
ENSA_Module_12 - Network Troubleshooting.pdfchapterchapter-11.pdf.pdf
edget1
 
TuxCon 2025 Experiments with ESP-NOW protocol and comparison to Zigbee, WiFi,...
TuxCon 2025 Experiments with ESP-NOW protocol and comparison to Zigbee, WiFi,...TuxCon 2025 Experiments with ESP-NOW protocol and comparison to Zigbee, WiFi,...
TuxCon 2025 Experiments with ESP-NOW protocol and comparison to Zigbee, WiFi,...
Olimex Bulgaria
 
ideapad_330S_14IKB_Shhjjkjhghhvbhjjjpec.pdf
ideapad_330S_14IKB_Shhjjkjhghhvbhjjjpec.pdfideapad_330S_14IKB_Shhjjkjhghhvbhjjjpec.pdf
ideapad_330S_14IKB_Shhjjkjhghhvbhjjjpec.pdf
MichaelDexterBalanta1
 
DU Meter Crack 8.01 + Serial Key Download [Latest 2025]
DU Meter Crack 8.01 + Serial Key Download [Latest 2025]DU Meter Crack 8.01 + Serial Key Download [Latest 2025]
DU Meter Crack 8.01 + Serial Key Download [Latest 2025]
FarhanSEO
 
Concavity_Presentation_Updated.pptx rana
Concavity_Presentation_Updated.pptx ranaConcavity_Presentation_Updated.pptx rana
Concavity_Presentation_Updated.pptx rana
ranamumtaz383
 
Parmila_nsnsnjnsnsnnwDevi_Rajbanshi.pptx
Parmila_nsnsnjnsnsnnwDevi_Rajbanshi.pptxParmila_nsnsnjnsnsnnwDevi_Rajbanshi.pptx
Parmila_nsnsnjnsnsnnwDevi_Rajbanshi.pptx
rahulrajbanshi981052
 
Bandicut Video Cutter 3.1.3.454 Crack Full Version [Latest]
Bandicut Video Cutter 3.1.3.454 Crack Full Version [Latest]Bandicut Video Cutter 3.1.3.454 Crack Full Version [Latest]
Bandicut Video Cutter 3.1.3.454 Crack Full Version [Latest]
FarhanSEO
 
chapter-11.pdfchapter-11.pdfchapter-11.pdfchapter-chapter-11.pdf.pdf
chapter-11.pdfchapter-11.pdfchapter-11.pdfchapter-chapter-11.pdf.pdfchapter-11.pdfchapter-11.pdfchapter-11.pdfchapter-chapter-11.pdf.pdf
chapter-11.pdfchapter-11.pdfchapter-11.pdfchapter-chapter-11.pdf.pdf
edget1
 
Py-Slides-1.pptPy-Slides-1.pptPy-Slides-1.pptPy-Slides-1.ppt
Py-Slides-1.pptPy-Slides-1.pptPy-Slides-1.pptPy-Slides-1.pptPy-Slides-1.pptPy-Slides-1.pptPy-Slides-1.pptPy-Slides-1.ppt
Py-Slides-1.pptPy-Slides-1.pptPy-Slides-1.pptPy-Slides-1.ppt
v65176016
 
Developing Euro 20 IoT Smart Home Server
Developing Euro 20  IoT Smart Home ServerDeveloping Euro 20  IoT Smart Home Server
Developing Euro 20 IoT Smart Home Server
Olimex Bulgaria
 
Unidad Pedagogica 3ro-4to.documento090904
Unidad Pedagogica 3ro-4to.documento090904Unidad Pedagogica 3ro-4to.documento090904
Unidad Pedagogica 3ro-4to.documento090904
maylingcastro9
 
Musicfy lolMusicfy lolMusicfy lolMusicfy lol
Musicfy lolMusicfy lolMusicfy lolMusicfy lolMusicfy lolMusicfy lolMusicfy lolMusicfy lol
Musicfy lolMusicfy lolMusicfy lolMusicfy lol
bilalshah786104
 
Intro to Windows Presentation for CSS NC-2.pptx
Intro to Windows Presentation for CSS NC-2.pptxIntro to Windows Presentation for CSS NC-2.pptx
Intro to Windows Presentation for CSS NC-2.pptx
HelenAvila17
 
AnyTrans for iOS 8.9.2.20220609 Full Cracked Download [Latest]
AnyTrans for iOS 8.9.2.20220609 Full Cracked Download [Latest]AnyTrans for iOS 8.9.2.20220609 Full Cracked Download [Latest]
AnyTrans for iOS 8.9.2.20220609 Full Cracked Download [Latest]
FarhanSEO
 
Linear Accelerators: Principles, Components, Mechanism of Action, and Their V...
Linear Accelerators: Principles, Components, Mechanism of Action, and Their V...Linear Accelerators: Principles, Components, Mechanism of Action, and Their V...
Linear Accelerators: Principles, Components, Mechanism of Action, and Their V...
ChaudharyBharatDagur
 
ENSA_Module_12 - Network Troubleshooting.pdfchapterchapter-11.pdf.pdf
ENSA_Module_12 - Network Troubleshooting.pdfchapterchapter-11.pdf.pdfENSA_Module_12 - Network Troubleshooting.pdfchapterchapter-11.pdf.pdf
ENSA_Module_12 - Network Troubleshooting.pdfchapterchapter-11.pdf.pdf
edget1
 
TuxCon 2025 Experiments with ESP-NOW protocol and comparison to Zigbee, WiFi,...
TuxCon 2025 Experiments with ESP-NOW protocol and comparison to Zigbee, WiFi,...TuxCon 2025 Experiments with ESP-NOW protocol and comparison to Zigbee, WiFi,...
TuxCon 2025 Experiments with ESP-NOW protocol and comparison to Zigbee, WiFi,...
Olimex Bulgaria
 
ideapad_330S_14IKB_Shhjjkjhghhvbhjjjpec.pdf
ideapad_330S_14IKB_Shhjjkjhghhvbhjjjpec.pdfideapad_330S_14IKB_Shhjjkjhghhvbhjjjpec.pdf
ideapad_330S_14IKB_Shhjjkjhghhvbhjjjpec.pdf
MichaelDexterBalanta1
 
DU Meter Crack 8.01 + Serial Key Download [Latest 2025]
DU Meter Crack 8.01 + Serial Key Download [Latest 2025]DU Meter Crack 8.01 + Serial Key Download [Latest 2025]
DU Meter Crack 8.01 + Serial Key Download [Latest 2025]
FarhanSEO
 
Concavity_Presentation_Updated.pptx rana
Concavity_Presentation_Updated.pptx ranaConcavity_Presentation_Updated.pptx rana
Concavity_Presentation_Updated.pptx rana
ranamumtaz383
 
Parmila_nsnsnjnsnsnnwDevi_Rajbanshi.pptx
Parmila_nsnsnjnsnsnnwDevi_Rajbanshi.pptxParmila_nsnsnjnsnsnnwDevi_Rajbanshi.pptx
Parmila_nsnsnjnsnsnnwDevi_Rajbanshi.pptx
rahulrajbanshi981052
 
Bandicut Video Cutter 3.1.3.454 Crack Full Version [Latest]
Bandicut Video Cutter 3.1.3.454 Crack Full Version [Latest]Bandicut Video Cutter 3.1.3.454 Crack Full Version [Latest]
Bandicut Video Cutter 3.1.3.454 Crack Full Version [Latest]
FarhanSEO
 
chapter-11.pdfchapter-11.pdfchapter-11.pdfchapter-chapter-11.pdf.pdf
chapter-11.pdfchapter-11.pdfchapter-11.pdfchapter-chapter-11.pdf.pdfchapter-11.pdfchapter-11.pdfchapter-11.pdfchapter-chapter-11.pdf.pdf
chapter-11.pdfchapter-11.pdfchapter-11.pdfchapter-chapter-11.pdf.pdf
edget1
 
Py-Slides-1.pptPy-Slides-1.pptPy-Slides-1.pptPy-Slides-1.ppt
Py-Slides-1.pptPy-Slides-1.pptPy-Slides-1.pptPy-Slides-1.pptPy-Slides-1.pptPy-Slides-1.pptPy-Slides-1.pptPy-Slides-1.ppt
Py-Slides-1.pptPy-Slides-1.pptPy-Slides-1.pptPy-Slides-1.ppt
v65176016
 
Developing Euro 20 IoT Smart Home Server
Developing Euro 20  IoT Smart Home ServerDeveloping Euro 20  IoT Smart Home Server
Developing Euro 20 IoT Smart Home Server
Olimex Bulgaria
 
Unidad Pedagogica 3ro-4to.documento090904
Unidad Pedagogica 3ro-4to.documento090904Unidad Pedagogica 3ro-4to.documento090904
Unidad Pedagogica 3ro-4to.documento090904
maylingcastro9
 
Musicfy lolMusicfy lolMusicfy lolMusicfy lol
Musicfy lolMusicfy lolMusicfy lolMusicfy lolMusicfy lolMusicfy lolMusicfy lolMusicfy lol
Musicfy lolMusicfy lolMusicfy lolMusicfy lol
bilalshah786104
 

Parallel Programming for Multi- Core and Cluster Systems - Performance Analysis

  • 1. CSC 447: Parallel Programming for Multi- Core and Cluster Systems Performance Analysis Instructor: Haidar M. Harmanani Spring 2020 Outline § Performance scalability § Analytical performance measures § Amdahl’s law and Gustafson-Barsis’ law Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 2
  • 2. Performance § In computing, performance is defined by 2 factors – Computational requirements (what needs to be done) – Computing resources (what it costs to do it) § Computational problems translate to requirements § Computing resources interplay and tradeoff Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 3 Time Energy … and ultimately MoneyHardware Performance ~ 1 Resources for solution Measuring Performance § Performance itself is a measure of how well the computational requirements can be satisfied § We evaluate performance to understand the relationships between requirements and resources – Decide how to change “solutions” to target objectives § Performance measures reflect decisions about how and how well “solutions” are able to satisfy the computational requirements § When measuring performance, it is important to understand exactly what you are measuring and how you are measuring it Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 4
  • 3. Scalability § A program can scale up to use many processors – What does that mean? § How do you evaluate scalability? § How do you evaluate scalability goodness? § Comparative evaluation – If double the number of processors, what to expect? – Is scalability linear? § Use parallel efficiency measure – Is efficiency retained as problem size increases? § Apply performance metrics Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 5 Performance and Scalability § Evaluation – Sequential runtime (Tseq) is a function of problem size and architecture – Parallel runtime (Tpar) is a function of problem size and parallel architecture and the number of processors used in the execution – Parallel performance affected by algorithm + architecture § Scalability – Ability of parallel algorithm to achieve performance gains proportional to the number of processors and the size of the problem Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 6
  • 4. Performance Metrics and Formulas § T1 is the execution time on a single processor § Tp is the execution time on a p processor system § S(p) (Sp) is the speedup § E(p) (Ep) is the efficiency § Cost(p) (Cp) is the cost § Parallel algorithm is cost-optimal – Parallel time = sequential time (Cp = T1 , Ep = 100%) Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 7 S( p) = T1 Tp Efficiency = Sp p Cost = p ´ Tp Speed-Up § Provides a measure of application performance with respect to a given program platform § Can also be cast in terms of computational steps o Can extend time complexity to parallel computations § Use the fastest known sequential algorithm for running on a single processor Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 8
  • 5. What is a “good” speedup? § Hopefully, S(n) > 1 § Linear speedup: – S(n) = n – Parallel program considered perfectly scalable § Superlinear speedup: – S(n) > n – Can this happen? Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 9 Defining Speed-Up § We need more information to evaluate speedup: – What problem size? Worst case time? Average case time? – What do we count as work? o Parallel computation, communication, overhead? – What serial algorithm and what machine should we use for the numerator? o Can the algorithms used for the numerator and the denominator be different? Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 10
  • 6. Common Definitions of Speed-Up § Common definitions of Speedup: – Serial machine is one processor of parallel machine and serial algorithm is interleaved version of parallel algorithm – Serial algorithm is fastest known serial algorithm for running on a serial processor – Serial algorithm is fastest known serial algorithm running on a one processor of the parallel machine Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 11 )( )1( )( nT T nS = )( )( nT T nS s = )( )1(' )( nT T nS = Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 12 Can speedup be superlinear?
  • 7. Can speedup be superlinear? § Speedup CANNOT be superlinear: – Let M be a parallel machine with n processors – Let T(X) be the time it takes to solve a problem on M with X processors – Speedup definition: o Suppose a parallel algorithm A solves an instance I of a problem in t time units § Then A can solve the same problem in n x t units of time on M through time slicing § The best serial time for I will be no bigger than n x t § Hence speedup cannot be greater than n. Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 13 )( )1( )( nT T nS = S(n) = T (1) T (n) ≤ nt t = n Can speedup be superlinear? § Speedup CAN be superlinear: – Let M be a parallel machine with n processors – Let T(X) be the time it takes to solve a problem on M with X processors – Speedup definition: – Serial version of the algorithm may involve more overhead than the parallel version of the algorithm o E.g. A=B+C on a SIMD machine with A,B,C matrices vs. loop overhead on a serial machine – Hardware characteristics may favor parallel algorithm o E.g. if all data can be decomposed in main memories of parallel processors vs. needing secondary storage on serial processor to retain all data – “work” may be counted differently in serial and parallel algorithms Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 14 )( )( nT T nS s =
  • 8. Speedup Factor § Maximum speedup is usually n with n processors (linear speedup). § Possible to get superlinear speedup (greater than n) but usually a specific reason such as: – Extra memory in multiprocessor system – Nondeterministic algorithm Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 15 Maximum Speedup: Amdahl’s law § f = fraction of program (algorithm) that is serial and cannot be parallelized – Data setup – Reading/writing to a single disk file § Speedup factor is given by: Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 16 Ts = fTs + (1− f )Ts Tp = fTs + (1− f )Ts n S(n) = Ts fTs + (1− f )Ts n = n 1+ (n −1) f limn−>∞ = 1 f The above equation is known as Amdahl’s Law Note that as n ® ¥, the maximum speedup is limited to 1/f.
  • 9. Bounds on Speedup Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 17 Serial section Parallelizable sections (a) One processor (b) Multiple processors fts (1 - f)ts ts (1 - f)ts /ptp p processors Speedup Against Number of Processors § Even with infinite number of processors, maximum speedup limited to 1/f . § Example: With only 5% of computation being serial, maximum speedup is 20, irrespective of number of processors. Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 18 4 8 12 16 20 4 8 12 16 20 f = 20% f = 10% f = 5% f = 0% Number of processors , p
  • 10. Example of Amdahl’s Law (1) § Suppose that a calculation has a 4% serial portion, what is the limit of speedup on 16 processors? – 16/(1 + (16 – 1)*.04) = 10 – What is the maximum speedup? o 1/0.04 = 25 Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 19 Example of Amdahl’s Law (2) § 95% of a program’s execution time occurs inside a loop that can be executed in parallel. What is the maximum speedup we should expect from a parallel version of the program executing on 8 CPUs? Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 20 ψ ≤ 1 0.05+ (1− 0.05) / 8 ≅ 5.9
  • 11. Example of Amdahl’s Law (3) § 20% of a program’s execution time is spent within inherently sequential code. What is the limit to the speedup achievable by a parallel version of the program? Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 21 lim p→∞ 1 0.2 + (1− 0.2) / p = 1 0.2 = 5 Illustration of Amdahl Effect Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 22 n = 100 n = 1,000 n = 10,000 Speedup Processors
  • 12. Amdahl’s Law and Scalability § Scalability – Ability of parallel algorithm to achieve performance gains proportional to the number of processors and the size of the problem § When does Amdahl’s Law apply? – When the problem size is fixed – Strong scaling (p®∞, Sp = S∞ ® 1 / f ) – Speedup bound is determined by the degree of sequential execution time in the computation, not # processors!!! – Perfect efficiency is hard to achieve § See original paper by Amdahl on course webpage Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 23 Variants of Speedup: Efficiency § Efficiency: E(n) = S(n)/n * 100% § Efficiency measures the fraction of time that processors are being used on the computation. – A program with linear speedup is 100% efficient. § Using efficiency: – A program attains 89% efficiency with a serial fraction of 2%. Approximately how many processors are being used according to Amdahl’s law? Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 24
  • 13. Efficiency Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 25 usedProcessors Speedup Efficiency timeexecutionParallelusedProcessors timeexecutionSequential Efficiency = ´ = Limitations of Speedup § Conventional notions of speedup don't always provide a reasonable measure of performance § Questionable assumptions: – "work" in conventional definitions of speedup is defined by operation count o communication more expensive than computation on current high-performance computers – best serial algorithm defines the least work necessary o for some languages on some machines, serial algorithm may do more work -- (loop operations vs. data parallel for example) – good performance for many users involves fast time on a sufficiently large problem; faster time on a smaller problem (better speedup) is less interesting – traditional speedup measures assume a "flat memory approximation”, i.e. all memory accesses take the same amount of time Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 26
  • 14. “Flat Memory Approximation” § “Flat memory Approximation” – all accesses to memory take the same amount of time – in practice, accesses to information in cache, main memory and peripheral memory take very different amounts of time. Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 27 Fully cached Virtual Memory Main Memory Time per access Access Another Perspective § We often use faster computers to solve larger problem instances § Let’s treat time as a constant and allow problem size to increase with number of processors Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 28
  • 15. Limitations of Speedup § Gustafson challenged Amdahl's assumption that the proportion of a program given to serial computations (f) and the proportion of a program given to parallel computations remains the same over all problem sizes. – For example, if the serial part is a loop initialization and it can be executed in parallel over the size of the input list, then the serial initialization becomes a smaller proportion of the overall calculation as the problem size grows larger. § Gustafson defined two “more relevant” notions of speedup – Scaled speedup – Fixed-time speedup o (usual version he called fixed-size speedup) Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 29 Gustafson-Barsis’s Law § Begin with parallel execution time § Estimate sequential execution time to solve same problem § Problem size is an increasing function of p § Predicts scaled speedup Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 30
  • 16. Gustafson’s Law Fix execution time on a single processor ◦ s + p = serial part + parallelizable part = 1 (normalized serial time) ◦ (s = same as f previously) ◦ Assume problem fits in memory of serial computer ◦ Fixed-size speedup Amdahl’s law Fix execution time on a parallel computer (multiple processors) ◦ s + p = serial part + parallelizable part = 1 (normalized parallel time) ◦ s + np = serial time on a single processor ◦ Assume problem fits in memory of parallel computer ◦ Scaled Speedup Gustafson’s Law Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 31 n s s n p s ps S sizefixed - + = + + = 1 1 _ snn ps nps Sscaled )1( -+= + + = Scaled Speedup § Scaling implies that problem size can increase with number of processors – Gustafson’s law gives measure of how much § Scaled Speedup derived by fixing the parallel execution time – Amdahl fixed the problem size à fixes serial execution time – Amdahl’s law may be too conservative for high-performance computing. § Interesting consequence of scaled speedup: no bound to speedup as nà infinity, speedup can easily become superlinear! § In practice, unbounded scalability is unrealistic as quality of answer will reach a point where no further increase in problem size may be justified Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 32
  • 17. Meaning of Scalability Function § To maintain efficiency when increasing p, we must increase n § Maximum problem size limited by available memory, which is linear in p § Scalability function shows how memory usage per processor must grow to maintain efficiency § Scalability function a constant means parallel system is perfectly scalable Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 33 Interpreting Scalability Function Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 34 Number of processors Memoryneededperprocessor Cplogp Cp Clogp C Memory Size Can maintain efficiency Cannot maintain efficiency
  • 18. Gustafson-Barsis’ Law and Scalability § Scalability – Ability of parallel algorithm to achieve performance gains proportional to the number of processors and the size of the problem § When does Gustafson’s Law apply? – When the problem size can increase as the number of processors increases – Weak scaling (Sp = 1 + (p-1)fpar ) – Speedup function includes the number of processors!!! – Can maintain or increase parallel efficiency as the problem scales § See original paper by Gustafson on course webpage Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 35 Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 36
  • 19. Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 37 Using Gustafson’s Law § Given a scaled speedup of 20 on 32 processors, what is the serial fraction from Amdahl’s law? What is the serial fraction from Gustafson’s Law? Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 38 snn ps nps Sscaled )1( -+= + + =
  • 20. Example 1 § An application running on 10 processors spends 3% of its time in serial code. What is the scaled speedup of the application? Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 39 Execution on 1 CPU takes 10 times as long… …except 9 do not have to execute serial code ψ =10 + (1−10)(0.03) =10 − 0.27 = 9.73 Example 2 § What is the maximum fraction of a program’s parallel execution time that can be spent in serial code if it is to achieve a scaled speedup of 7 on 8 processors? Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 40 7 = 8 + (1− 8)s ⇒ s ≈ 0.14
  • 21. Why Are not Parallel Applications Scalable? Critical Paths ◦ Dependencies between computations spread across processors Bottlenecks ◦ One processor holds things up Algorithmic overhead ◦ Some things just take more effort to do in parallel Communication overhead ◦ Spending increasing proportion of time on communication Load Imbalance ◦ Makes all processor wait for the “slowest” one ◦ Dynamic behavior Speculative loss ◦ Do A and B in parallel, but B is ultimately not needed Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 41 Critical Paths § Long chain of dependence – Main limitation on performance – Resistance to performance improvement § Diagnostic – Performance stagnates to a (relatively) fixed value – Critical path analysis § Solution – Eliminate long chains if possible – Shorten chains by removing work from critical path Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 42
  • 22. Bottlenecks § How to detect? – One processor A is busy while others wait – Data dependency on the result produced by A § Typical situations: – N-to-1 reduction / computation / 1-to-N broadcast – One processor assigning job in response to requests § Solution techniques: – More efficient communication – Hierarchical schemes for master slave § Program may not show ill effects for a long time § Shows up when scaling Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 43 Algorithmic Overhead § Different sequential algorithms to solve the same problem § All parallel algorithms are sequential when run on 1 processor § All parallel algorithms introduce addition operations – Parallel overhead § Where should be the starting point for a parallel algorithm? – Best sequential algorithm might not parallelize at all – Or, it does not parallelize well (e.g., not scalable) § What to do? – Choose algorithmic variants that minimize overhead – Use two level algorithms § Performance is the rub – Are you achieving better parallel performance? – Must compare with the best sequential algorithm Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 44
  • 23. What is the maximum parallelism possible? § Depends on application, algorithm, program – Data dependencies in execution – Parallelism varies! Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 45 512-point FFT parallel signature Embarrassingly Parallel Computations § An embarrassingly parallel computation is one that can be obviously divided into completely independent parts that can be executed simultaneously – In a truly embarrassingly parallel computation there is no interaction between separate processes – In a nearly embarrassingly parallel computation results must be distributed and collected/combined in some way § Embarrassingly parallel computations have potential to achieve maximal speedup on parallel platforms – If it takes T time sequentially, there is the potential to achieve T/P time running in parallel with P processors – What would cause this not to be the case always? Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 46
  • 24. Embarrassingly Parallel Computations Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 47 Processes Input Data Results . . . No or very little communication between processes Each process can do its tasks without any interaction with other processes Examples ◦ Numerical integration ◦ Mandelbrot set ◦ Monte Carlo methods Calculating p with Monte Carlo Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 48 422 11 ππ = ∗ ∗∗ 2 in Consider a circle of unit radius Place circle inside a square box with side of 2 in The ratio of the circle area to the square area is:
  • 25. Monte Carlo Calculation of p § Randomly choose a number of points in the square § For each point p, determine if p is inside the circle § The ratio of points in the circle to points in the square will give an approximation of p/4 Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 49 Using Programs to Measure Machine Performance § Speedup measures performance of an individual program on a particular machine – Speedup cannot be used to o Compare different algorithms on the same computer o Compare the same algorithm on different computers § Benchmarks are representative programs which can be used to compare performance of machines Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 50
  • 26. Benchmarks used for Parallel Machines § The Perfect Club § The Livermore Loops § The NAS Parallel Benchmarks § The SPEC Benchmarks § The “PACKS” (Linpack, LAPACK, ScaLAPACK, etc.) § ParkBENCH § SLALOM, HINT Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 51 Limitations and Pitfalls of Benchmarks § Benchmarks cannot address questions you did not ask § Specific application benchmarks will not tell you about the performance of other applications without proper analysis § General benchmarks will not tell you all the details about the performance of your specific application § One should understand the benchmark itself to understand what it tells us Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 52
  • 27. Benefits of Benchmarks § Popular benchmarks keep vendors attuned to applications § Benchmarks can give useful information about the performance of systems on particular kinds of programs § Benchmarks help in exposing performance bottlenecks of systems at the technical and applications level Spring 2020 CSC 447: Parallel Programming for Multi-Core and Cluster Systems 53
  翻译: