SlideShare a Scribd company logo
Dense Matrix Algorithms
Ananth Grama, Anshul Gupta,
George Karypis, and Vipin Kumar
To accompany the text “Introduction to Parallel Computing”,
Addison Wesley, 2003.
Topic Overview
• Matrix-Vector Multiplication
• Matrix-Matrix Multiplication
• Solving a System of Linear Equations
Matix Algorithms: Introduction
• Due to their regular structure, parallel computations
involving matrices and vectors readily lend themselves to
data-decomposition.
• Typical algorithms rely on input, output, or intermediate
data decomposition.
• Most algorithms use one- and two-dimensional block,
cyclic, and block-cyclic partitionings.
Matrix-Vector Multiplication
• We aim to multiply a dense n x n matrix A with an n x 1
vector x to yield the n x 1 result vector y.
• The serial algorithm requires n2 multiplications and
additions.
Matrix-Vector Multiplication:
Rowwise 1-D Partitioning
• The n x n matrix is partitioned among n processors, with
each processor storing complete row of the matrix.
• The n x 1 vector x is distributed such that each process
owns one of its elements.
Matrix-Vector Multiplication:
Rowwise 1-D Partitioning
Multiplication of an n x n matrix with an n x 1 vector using
rowwise block 1-D partitioning. For the one-row-per-process
case, p = n.
Matrix-Vector Multiplication:
Rowwise 1-D Partitioning
• Since each process starts with only one element of x ,
an all-to-all broadcast is required to distribute all the
elements to all the processes.
• Process Pi now computes .
• The all-to-all broadcast and the computation of y[i] both
take time Θ(n) . Therefore, the parallel time is Θ(n) .
Matrix-Vector Multiplication:
Rowwise 1-D Partitioning
• Consider now the case when p < n and we use block 1D
partitioning.
• Each process initially stores n=p complete rows of the
matrix and a portion of the vector of size n=p.
• The all-to-all broadcast takes place among p processes
and involves messages of size n=p.
• This is followed by n=p local dot products.
• Thus, the parallel run time of this procedure is
This is cost-optimal.
Matrix-Vector Multiplication:
Rowwise 1-D Partitioning
Scalability Analysis:
• We know that T0 = pTP - W, therefore, we have,
• For isoefficiency, we have W = KT0, where K = E/(1 – E)
for desired efficiency E.
• From this, we have W = O(p2) (from the tw term).
• There is also a bound on isoefficiency because of
concurrency. In this case, p < n, therefore, W = n2 =
Ω(p2).
• Overall isoefficiency is W = O(p2).
Matrix-Vector Multiplication:
2-D Partitioning
• The n x n matrix is partitioned among n2 processors such
that each processor owns a single element.
• The n x 1 vector x is distributed only in the last column of
n processors.
Matrix-Vector Multiplication: 2-D Partitioning
Matrix-vector multiplication with block 2-D partitioning. For the
one-element-per-process case, p = n2 if the matrix size is n x n .
Matrix-Vector Multiplication:
2-D Partitioning
• We must first align the vector with the matrix
appropriately.
• The first communication step for the 2-D partitioning
aligns the vector x along the principal diagonal of the
matrix.
• The second step copies the vector elements from each
diagonal process to all the processes in the
corresponding column using n simultaneous broadcasts
among all processors in the column.
• Finally, the result vector is computed by performing an
all-to-one reduction along the columns.
Matrix-Vector Multiplication:
2-D Partitioning
• Three basic communication operations are used in this
algorithm: one-to-one communication to align the vector
along the main diagonal, one-to-all broadcast of each
vector element among the n processes of each column,
and all-to-one reduction in each row.
• Each of these operations takes Θ(log n) time and the
parallel time is Θ(log n) .
• The cost (process-time product) is Θ(n2 log n) ; hence,
the algorithm is not cost-optimal.
Matrix-Vector Multiplication:
2-D Partitioning
• When using fewer than n2 processors, each process
owns an block of the matrix.
• The vector is distributed in portions of elements in
the last process-column only.
• In this case, the message sizes for the alignment,
broadcast, and reduction are all .
• The computation is a product of an
submatrix with a vector of length .
Matrix-Vector Multiplication:
2-D Partitioning
• The first alignment step takes time
• The broadcast and reductions take time
• Local matrix-vector products take time
• Total time is
Matrix-Vector Multiplication:
2-D Partitioning
• Scalability Analysis:
•
• Equating T0 with W, term by term, for isoefficiency, we
have, as the dominant term.
• The isoefficiency due to concurrency is O(p).
• The overall isoefficiency is (due to the
network bandwidth).
• For cost optimality, we have, . For this,
we have,
Matrix-Matrix Multiplication
• Consider the problem of multiplying two n x n dense,
square matrices A and B to yield the product matrix C =A
x B.
• The serial complexity is O(n3).
• We do not consider better serial algorithms (Strassen's
method), although, these can be used as serial kernels
in the parallel algorithms.
• A useful concept in this case is called block operations.
In this view, an n x n matrix A can be regarded as a q x q
array of blocks Ai,j (0 ≤ i, j < q) such that each block is an
(n/q) x (n/q) submatrix.
• In this view, we perform q3 matrix multiplications, each
involving (n/q) x (n/q) matrices.
Matrix-Matrix Multiplication
• Consider two n x n matrices A and B partitioned into p
blocks Ai,j and Bi,j (0 ≤ i, j < ) of size
each.
• Process Pi,j initially stores Ai,j and Bi,j and computes block
Ci,j of the result matrix.
• Computing submatrix Ci,j requires all submatrices Ai,k
and Bk,j for 0 ≤ k < .
• All-to-all broadcast blocks of A along rows and B along
columns.
• Perform local submatrix multiplication.
Matrix-Matrix Multiplication
• The two broadcasts take time
• The computation requires multiplications of
sized submatrices.
• The parallel run time is approximately
• The algorithm is cost optimal and the isoefficiency is
O(p1.5) due to bandwidth term tw and concurrency.
• Major drawback of the algorithm is that it is not memory
optimal.
Matrix-Matrix Multiplication:
Cannon's Algorithm
• In this algorithm, we schedule the computations of the
processes of the ith row such that, at any given time,
each process is using a different block Ai,k.
• These blocks can be systematically rotated among the
processes after every submatrix multiplication so that
every process gets a fresh Ai,k after each rotation.
Matrix-Matrix Multiplication:
Cannon's Algorithm
Communication steps in Cannon's algorithm on 16 processes.
Matrix-Matrix Multiplication:
Cannon's Algorithm
• Align the blocks of A and B in such a way that each
process multiplies its local submatrices. This is done by
shifting all submatrices Ai,j to the left (with wraparound)
by i steps and all submatrices Bi,j up (with wraparound)
by j steps.
• Perform local block multiplication.
• Each block of A moves one step left and each block of B
moves one step up (again with wraparound).
• Perform next block multiplication, add to partial result,
repeat until all blocks have been multiplied.
Matrix-Matrix Multiplication:
Cannon's Algorithm
• In the alignment step, since the maximum distance over
which a block shifts is , the two shift operations
require a total of time.
• Each of the single-step shifts in the compute-and-
shift phase of the algorithm takes time.
• The computation time for multiplying matrices of size
is .
• The parallel time is approximately:
• The cost-efficiency and isoefficiency of the algorithm are
identical to the first algorithm, except, this is memory
optimal.
Matrix-Matrix Multiplication:
DNS Algorithm
• Uses a 3-D partitioning.
• Visualize the matrix multiplication algorithm as a cube .
matrices A and B come in two orthogonal faces and
result C comes out the other orthogonal face.
• Each internal node in the cube represents a single add-
multiply operation (and thus the complexity).
• DNS algorithm partitions this cube using a 3-D block
scheme.
Matrix-Matrix Multiplication:
DNS Algorithm
• Assume an n x n x n mesh of processors.
• Move the columns of A and rows of B and perform
broadcast.
• Each processor computes a single add-multiply.
• This is followed by an accumulation along the C
dimension.
• Since each add-multiply takes constant time and
accumulation and broadcast takes log n time, the total
runtime is log n.
• This is not cost optimal. It can be made cost optimal by
using n / log n processors along the direction of
accumulation.
Matrix-Matrix Multiplication:
DNS Algorithm
The communication steps in the DNS algorithm while
multiplying 4 x 4 matrices A and B on 64 processes.
Matrix-Matrix Multiplication:
DNS Algorithm
Using fewer than n3 processors.
• Assume that the number of processes p is equal to q3 for
some q < n.
• The two matrices are partitioned into blocks of size (n/q)
x(n/q).
• Each matrix can thus be regarded as a q x q two-
dimensional square array of blocks.
• The algorithm follows from the previous one, except, in
this case, we operate on blocks rather than on individual
elements.
Matrix-Matrix Multiplication:
DNS Algorithm
Using fewer than n3 processors.
• The first one-to-one communication step is performed for
both A and B, and takes time for each matrix.
• The two one-to-all broadcasts take
time for each matrix.
• The reduction takes time .
• Multiplication of submatrices takes
time.
• The parallel time is approximated by:
• The isoefficiency function is .
Solving a System of Linear Equations
• Consider the problem of solving linear equations of the
kind:
• This is written as Ax = b, where A is an n x n matrix with
A[i, j] = ai,j, b is an n x 1 vector [ b0, b1, … , bn ]T, and x is
the solution.
Solving a System of Linear Equations
Two steps in solution are: reduction to triangular form, and
back-substitution. The triangular form is as:
We write this as: Ux = y .
A commonly used method for transforming a given matrix
into an upper-triangular matrix is Gaussian Elimination.
Gaussian Elimination
Serial Gaussian Elimination
Gaussian Elimination
• The computation has three nested loops - in the kth
iteration of the outer loop, the algorithm performs (n-k)2
computations. Summing from k = 1..n, we have roughly
(n3/3) multiplications-subtractions.
A typical computation in Gaussian elimination.
Parallel Gaussian Elimination
• Assume p = n with each row assigned to a processor.
• The first step of the algorithm normalizes the row. This is
a serial operation and takes time (n-k) in the kth
iteration.
• In the second step, the normalized row is broadcast to all
the processors. This takes time .
• Each processor can independently eliminate this row
from its own. This requires (n-k-1) multiplications and
subtractions.
• The total parallel time can be computed by summing
from k = 1 … n-1 as
• The formulation is not cost optimal because of the tw
term.
Parallel Gaussian Elimination
Gaussian elimination steps during the iteration corresponding k = 3
for an 8 x 8 matrix partitioned rowwise among eight processes.
Parallel Gaussian Elimination:
Pipelined Execution
• In the previous formulation, the (k+1)st iteration starts
only after all the computation and communication for the
kth iteration is complete.
• In the pipelined version, there are three steps -
normalization of a row, communication, and elimination.
These steps are performed in an asynchronous fashion.
• A processor Pk waits to receive and eliminate all rows
prior to k.
• Once it has done this, it forwards its own row to
processor Pk+1.
Parallel Gaussian Elimination:
Pipelined Execution
Pipelined Gaussian elimination on a 5 x 5 matrix partitioned
withone row per process.
Parallel Gaussian Elimination:
Pipelined Execution
• The total number of steps in the entire pipelined
procedure is Θ(n).
• In any step, either O(n) elements are communicated
between directly-connected processes, or a division step
is performed on O(n) elements of a row, or an elimination
step is performed on O(n) elements of a row.
• The parallel time is therefore O(n2) .
• This is cost optimal.
Parallel Gaussian Elimination:
Pipelined Execution
The communication in the Gaussian elimination iteration
corresponding to k = 3 for an 8 x 8 matrix distributed among
four processes using block 1-D partitioning.
Parallel Gaussian Elimination:
Block 1D with p < n
• The above algorithm can be easily adapted to the case
when p < n.
• In the kth iteration, a processor with all rows belonging to
the active part of the matrix performs (n – k -1) / np
multiplications and subtractions.
• In the pipelined version, for n > p, computation dominates
communication.
• The parallel time is given by:
or approximately, n3/p.
• While the algorithm is cost optimal, the cost of the parallel
algorithm is higher than the sequential run time by a factor
of 3/2.
Parallel Gaussian Elimination:
Block 1D with p < n
Computation load on different processes in block and cyclic
1-D partitioning of an 8 x 8 matrix on four processes during the
Gaussian elimination iteration corresponding to k = 3.
Parallel Gaussian Elimination:
Block 1D with p < n
• The load imbalance problem can be alleviated by using a
cyclic mapping.
• In this case, other than processing of the last p rows,
there is no load imbalance.
• This corresponds to a cumulative load imbalance
overhead of O(n2p) (instead of O(n3) in the previous
case).
Parallel Gaussian Elimination:
2-D Mapping
• Assume an n x n matrix A mapped onto an n x n mesh
of processors.
• Each update of the partial matrix can be thought of as a
scaled rank-one update (scaling by the pivot element).
• In the first step, the pivot is broadcast to the row of
processors.
• In the second step, each processor locally updates its
value. For this it needs the corresponding value from the
pivot row, and the scaling value from its own row.
• This requires two broadcasts, each of which takes log n
time.
• This results in a non-cost-optimal algorithm.
Parallel Gaussian Elimination:
2-D Mapping
Various steps in the Gaussian elimination iteration
corresponding to k = 3 for an 8 x 8 matrix on 64
processes arranged in a logical two-dimensional mesh.
Parallel Gaussian Elimination:
2-D Mapping with Pipelining
• We pipeline along two dimensions. First, the pivot value is pipelined
along the row. Then the scaled pivot row is pipelined down.
• Processor Pi,j (not on the pivot row) performs the elimination step
A[i, j] := A[i, j]] - A[i, k] A[k, j] as soon as A[i, k] and A[k, j] are
available.
• The computation and communication for each iteration moves
through the mesh from top-left to bottom-right as a ``front.''
• After the front corresponding to a certain iteration passes through a
process, the process is free to perform subsequent iterations.
• Multiple fronts that correspond to different iterations are active
simultaneously.
Parallel Gaussian Elimination:
2-D Mapping with Pipelining
• If each step (division, elimination, or communication) is
assumed to take constant time, the front moves a single
step in this time. The front takes Θ(n) time to reach Pn-1,n-
1.
• Once the front has progressed past a diagonal
processor, the next front can be initiated. In this way, the
last front passes the bottom-right corner of the matrix
Θ(n) steps after the first one.
• The parallel time is therefore O(n) , which is cost-optimal.
2-D Mapping with Pipelining
Pipelined Gaussian elimination for a 5 x 5 matrix with 25 processors.
Parallel Gaussian Elimination:
2-D Mapping with Pipelining and p < n
• In this case, a processor containing a completely active
part of the matrix performs n2/p multiplications and
subtractions, and communicates words along its
row and its column.
• The computation dominates communication for n >> p.
• The total parallel run time of this algorithm is (2n2/p) x n,
since there are n iterations. This is equal to 2n3/p.
• This is three times the serial operation count!
Parallel Gaussian Elimination:
2-D Mapping with Pipelining and p < n
The communication steps in the Gaussian elimination
iteration corresponding to k = 3 for an 8 x 8 matrix on 16
processes of a two-dimensional mesh.
Parallel Gaussian Elimination:
2-D Mapping with Pipelining and p < n
Computational load on different processes in block and cyclic
2-D mappings of an 8 x 8 matrix onto 16 processes during
the Gaussian elimination iteration corresponding to k = 3.
Parallel Gaussian Elimination:
2-D Cyclic Mapping
• The idling in the block mapping can be alleviated using a
cyclic mapping.
• The maximum difference in computational load between
any two processes in any iteration is that of one row and
one column update.
• This contributes to the overhead function. Since
there are n iterations, the total overhead is .
Gaussian Elimination
with Partial Pivoting
• For numerical stability, one generally uses partial
pivoting.
• In the k th iteration, we select a column i (called the pivot
column) such that A[k, i] is the largest in magnitude
among all A[k, i] such that k ≤ j < n.
• The k th and the i th columns are interchanged.
• Simple to implement with row-partitioning and does not
add overhead since the division step takes the same
time as computing the max.
• Column-partitioning, however, requires a global
reduction, adding a log p term to the overhead.
• Pivoting precludes the use of pipelining.
Gaussian Elimination with Partial Pivoting:
2-D Partitioning
• Partial pivoting restricts use of pipelining, resulting in
performance loss.
• This loss can be alleviated by restricting pivoting to
specific columns.
• Alternately, we can use faster algorithms for broadcast.
Solving a Triangular System:
Back-Substitution
• The upper triangular matrix U undergoes back-
substitution to determine the vector x.
A serial algorithm for back-substitution.
Solving a Triangular System:
Back-Substitution
• The algorithm performs approximately n2/2 multiplications and
subtractions.
• Since complexity of this part is asymptotically lower, we should
optimize the data distribution for the factorization part.
• Consider a rowwise block 1-D mapping of the n x n matrix U with
vector y distributed uniformly.
• The value of the variable solved at a step can be pipelined back.
• Each step of a pipelined implementation requires a constant amount
of time for communication and Θ(n/p) time for computation.
• The parallel run time of the entire algorithm is Θ(n2/p).
Solving a Triangular System:
Back-Substitution
• If the matrix is partitioned by using 2-D partitioning on a
logical mesh of processes, and the elements of
the vector are distributed along one of the columns of the
process mesh, then only the processes containing
the vector perform any computation.
• Using pipelining to communicate the appropriate
elements of U to the process containing the
corresponding elements of y for the substitution step
(line 7), the algorithm can be executed in time.
• While this is not cost optimal, since this does not
dominate the overall computation, the cost optimality is
determined by the factorization.
Ad

More Related Content

What's hot (20)

Logistic regression
Logistic regressionLogistic regression
Logistic regression
MartinHogg9
 
Basics of Soft Computing
Basics of Soft  Computing Basics of Soft  Computing
Basics of Soft Computing
Sangeetha Rajesh
 
Chap5 slides
Chap5 slidesChap5 slides
Chap5 slides
BaliThorat1
 
Linear models for classification
Linear models for classificationLinear models for classification
Linear models for classification
Sung Yub Kim
 
Lecture 4 principles of parallel algorithm design updated
Lecture 4   principles of parallel algorithm design updatedLecture 4   principles of parallel algorithm design updated
Lecture 4 principles of parallel algorithm design updated
Vajira Thambawita
 
Word2Vec
Word2VecWord2Vec
Word2Vec
hyunyoung Lee
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networks
Ashray Bhandare
 
Butterfly network
Butterfly networkButterfly network
Butterfly network
Syed Zaid Irshad
 
Access to non local names
Access to non local namesAccess to non local names
Access to non local names
Varsha Kumar
 
Dynamic programming
Dynamic programmingDynamic programming
Dynamic programming
Yıldırım Tam
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
Knoldus Inc.
 
FP-growth.pptx
FP-growth.pptxFP-growth.pptx
FP-growth.pptx
selvifitria1
 
Introduction to Autoencoders
Introduction to AutoencodersIntroduction to Autoencoders
Introduction to Autoencoders
Yan Xu
 
advanced computer architesture-conditions of parallelism
advanced computer architesture-conditions of parallelismadvanced computer architesture-conditions of parallelism
advanced computer architesture-conditions of parallelism
Pankaj Kumar Jain
 
Recurrent Neural Networks (RNNs)
Recurrent Neural Networks (RNNs)Recurrent Neural Networks (RNNs)
Recurrent Neural Networks (RNNs)
Abdullah al Mamun
 
Matrix multiplication
Matrix multiplicationMatrix multiplication
Matrix multiplication
International Islamic University
 
Stochastic Gradient Decent (SGD).pptx
Stochastic Gradient Decent (SGD).pptxStochastic Gradient Decent (SGD).pptx
Stochastic Gradient Decent (SGD).pptx
Shubham Jaybhaye
 
Graph Partitioning and Spectral Methods
Graph Partitioning and Spectral MethodsGraph Partitioning and Spectral Methods
Graph Partitioning and Spectral Methods
Carlos Castillo (ChaTo)
 
DBSCAN (1) (4).pptx
DBSCAN (1) (4).pptxDBSCAN (1) (4).pptx
DBSCAN (1) (4).pptx
ABINPMATHEW22020
 
Counterpropagation NETWORK
Counterpropagation NETWORKCounterpropagation NETWORK
Counterpropagation NETWORK
ESCOM
 
Logistic regression
Logistic regressionLogistic regression
Logistic regression
MartinHogg9
 
Linear models for classification
Linear models for classificationLinear models for classification
Linear models for classification
Sung Yub Kim
 
Lecture 4 principles of parallel algorithm design updated
Lecture 4   principles of parallel algorithm design updatedLecture 4   principles of parallel algorithm design updated
Lecture 4 principles of parallel algorithm design updated
Vajira Thambawita
 
Convolutional Neural Networks
Convolutional Neural NetworksConvolutional Neural Networks
Convolutional Neural Networks
Ashray Bhandare
 
Access to non local names
Access to non local namesAccess to non local names
Access to non local names
Varsha Kumar
 
Introduction to Recurrent Neural Network
Introduction to Recurrent Neural NetworkIntroduction to Recurrent Neural Network
Introduction to Recurrent Neural Network
Knoldus Inc.
 
Introduction to Autoencoders
Introduction to AutoencodersIntroduction to Autoencoders
Introduction to Autoencoders
Yan Xu
 
advanced computer architesture-conditions of parallelism
advanced computer architesture-conditions of parallelismadvanced computer architesture-conditions of parallelism
advanced computer architesture-conditions of parallelism
Pankaj Kumar Jain
 
Recurrent Neural Networks (RNNs)
Recurrent Neural Networks (RNNs)Recurrent Neural Networks (RNNs)
Recurrent Neural Networks (RNNs)
Abdullah al Mamun
 
Stochastic Gradient Decent (SGD).pptx
Stochastic Gradient Decent (SGD).pptxStochastic Gradient Decent (SGD).pptx
Stochastic Gradient Decent (SGD).pptx
Shubham Jaybhaye
 
Counterpropagation NETWORK
Counterpropagation NETWORKCounterpropagation NETWORK
Counterpropagation NETWORK
ESCOM
 

Similar to Chap8 slides (20)

densematrix.ppt
densematrix.pptdensematrix.ppt
densematrix.ppt
Rakesh Kumar
 
Chap9 slides
Chap9 slidesChap9 slides
Chap9 slides
BaliThorat1
 
Chap4 slides
Chap4 slidesChap4 slides
Chap4 slides
BaliThorat1
 
chap4_slides.ppt
chap4_slides.pptchap4_slides.ppt
chap4_slides.ppt
StrangerMe2
 
Chap4 slides
Chap4 slidesChap4 slides
Chap4 slides
elizabethalex84
 
Chap4 slides
Chap4 slidesChap4 slides
Chap4 slides
Jothish DL
 
Chap4 slides
Chap4 slidesChap4 slides
Chap4 slides
Sheena Jose
 
Chap4 slides
Chap4 slidesChap4 slides
Chap4 slides
Sheena Jose
 
All-Reduce and Prefix-Sum Operations
All-Reduce and Prefix-Sum Operations All-Reduce and Prefix-Sum Operations
All-Reduce and Prefix-Sum Operations
Syed Zaid Irshad
 
1535 graph algorithms
1535 graph algorithms1535 graph algorithms
1535 graph algorithms
Dr Fereidoun Dejahang
 
Chap10 slides
Chap10 slidesChap10 slides
Chap10 slides
BaliThorat1
 
Sorting algorithms
Sorting algorithmsSorting algorithms
Sorting algorithms
Syed Zaid Irshad
 
Cerebellar Model Articulation Controller
Cerebellar Model Articulation ControllerCerebellar Model Articulation Controller
Cerebellar Model Articulation Controller
Zahra Sadeghi
 
Parallel algorithm in linear algebra
Parallel algorithm in linear algebraParallel algorithm in linear algebra
Parallel algorithm in linear algebra
Harshana Madusanka Jayamaha
 
nural network ER. Abhishek k. upadhyay
nural network ER. Abhishek  k. upadhyaynural network ER. Abhishek  k. upadhyay
nural network ER. Abhishek k. upadhyay
abhishek upadhyay
 
Matt Purkeypile's Doctoral Dissertation Defense Slides
Matt Purkeypile's Doctoral Dissertation Defense SlidesMatt Purkeypile's Doctoral Dissertation Defense Slides
Matt Purkeypile's Doctoral Dissertation Defense Slides
mpurkeypile
 
Dynamic Programming and Applications.ppt
Dynamic Programming and Applications.pptDynamic Programming and Applications.ppt
Dynamic Programming and Applications.ppt
coolscools1231
 
Lacture Generative Adversal Network in Neural Networks
Lacture Generative Adversal Network in Neural NetworksLacture Generative Adversal Network in Neural Networks
Lacture Generative Adversal Network in Neural Networks
valerimmladenov
 
Undecidable Problems - COPING WITH THE LIMITATIONS OF ALGORITHM POWER
Undecidable Problems - COPING WITH THE LIMITATIONS OF ALGORITHM POWERUndecidable Problems - COPING WITH THE LIMITATIONS OF ALGORITHM POWER
Undecidable Problems - COPING WITH THE LIMITATIONS OF ALGORITHM POWER
muthukrishnavinayaga
 
machine learning.pptx
machine learning.pptxmachine learning.pptx
machine learning.pptx
AbdusSadik
 
chap4_slides.ppt
chap4_slides.pptchap4_slides.ppt
chap4_slides.ppt
StrangerMe2
 
All-Reduce and Prefix-Sum Operations
All-Reduce and Prefix-Sum Operations All-Reduce and Prefix-Sum Operations
All-Reduce and Prefix-Sum Operations
Syed Zaid Irshad
 
Cerebellar Model Articulation Controller
Cerebellar Model Articulation ControllerCerebellar Model Articulation Controller
Cerebellar Model Articulation Controller
Zahra Sadeghi
 
nural network ER. Abhishek k. upadhyay
nural network ER. Abhishek  k. upadhyaynural network ER. Abhishek  k. upadhyay
nural network ER. Abhishek k. upadhyay
abhishek upadhyay
 
Matt Purkeypile's Doctoral Dissertation Defense Slides
Matt Purkeypile's Doctoral Dissertation Defense SlidesMatt Purkeypile's Doctoral Dissertation Defense Slides
Matt Purkeypile's Doctoral Dissertation Defense Slides
mpurkeypile
 
Dynamic Programming and Applications.ppt
Dynamic Programming and Applications.pptDynamic Programming and Applications.ppt
Dynamic Programming and Applications.ppt
coolscools1231
 
Lacture Generative Adversal Network in Neural Networks
Lacture Generative Adversal Network in Neural NetworksLacture Generative Adversal Network in Neural Networks
Lacture Generative Adversal Network in Neural Networks
valerimmladenov
 
Undecidable Problems - COPING WITH THE LIMITATIONS OF ALGORITHM POWER
Undecidable Problems - COPING WITH THE LIMITATIONS OF ALGORITHM POWERUndecidable Problems - COPING WITH THE LIMITATIONS OF ALGORITHM POWER
Undecidable Problems - COPING WITH THE LIMITATIONS OF ALGORITHM POWER
muthukrishnavinayaga
 
machine learning.pptx
machine learning.pptxmachine learning.pptx
machine learning.pptx
AbdusSadik
 
Ad

More from BaliThorat1 (20)

Lec15 sfm
Lec15 sfmLec15 sfm
Lec15 sfm
BaliThorat1
 
Lec14 multiview stereo
Lec14 multiview stereoLec14 multiview stereo
Lec14 multiview stereo
BaliThorat1
 
Lec13 stereo converted
Lec13 stereo convertedLec13 stereo converted
Lec13 stereo converted
BaliThorat1
 
Lec12 epipolar
Lec12 epipolarLec12 epipolar
Lec12 epipolar
BaliThorat1
 
Lec11 single view-converted
Lec11 single view-convertedLec11 single view-converted
Lec11 single view-converted
BaliThorat1
 
Lec10 alignment
Lec10 alignmentLec10 alignment
Lec10 alignment
BaliThorat1
 
Lec09 hough
Lec09 houghLec09 hough
Lec09 hough
BaliThorat1
 
Lec08 fitting
Lec08 fittingLec08 fitting
Lec08 fitting
BaliThorat1
 
8 operating system concept
8 operating system concept8 operating system concept
8 operating system concept
BaliThorat1
 
7 processor
7 processor7 processor
7 processor
BaliThorat1
 
6 input output devices
6 input output devices6 input output devices
6 input output devices
BaliThorat1
 
2 windows operating system
2 windows operating system2 windows operating system
2 windows operating system
BaliThorat1
 
5 computer memory
5 computer memory5 computer memory
5 computer memory
BaliThorat1
 
4 computer languages
4 computer languages4 computer languages
4 computer languages
BaliThorat1
 
1 fundamentals of computer
1 fundamentals of computer1 fundamentals of computer
1 fundamentals of computer
BaliThorat1
 
1 fundamentals of computer system
1 fundamentals of computer system1 fundamentals of computer system
1 fundamentals of computer system
BaliThorat1
 
Computer generation and classification
Computer generation and classificationComputer generation and classification
Computer generation and classification
BaliThorat1
 
Algorithm and flowchart
Algorithm and flowchartAlgorithm and flowchart
Algorithm and flowchart
BaliThorat1
 
6 cpu scheduling
6 cpu scheduling6 cpu scheduling
6 cpu scheduling
BaliThorat1
 
5 process synchronization
5 process synchronization5 process synchronization
5 process synchronization
BaliThorat1
 
Lec14 multiview stereo
Lec14 multiview stereoLec14 multiview stereo
Lec14 multiview stereo
BaliThorat1
 
Lec13 stereo converted
Lec13 stereo convertedLec13 stereo converted
Lec13 stereo converted
BaliThorat1
 
Lec11 single view-converted
Lec11 single view-convertedLec11 single view-converted
Lec11 single view-converted
BaliThorat1
 
8 operating system concept
8 operating system concept8 operating system concept
8 operating system concept
BaliThorat1
 
6 input output devices
6 input output devices6 input output devices
6 input output devices
BaliThorat1
 
2 windows operating system
2 windows operating system2 windows operating system
2 windows operating system
BaliThorat1
 
5 computer memory
5 computer memory5 computer memory
5 computer memory
BaliThorat1
 
4 computer languages
4 computer languages4 computer languages
4 computer languages
BaliThorat1
 
1 fundamentals of computer
1 fundamentals of computer1 fundamentals of computer
1 fundamentals of computer
BaliThorat1
 
1 fundamentals of computer system
1 fundamentals of computer system1 fundamentals of computer system
1 fundamentals of computer system
BaliThorat1
 
Computer generation and classification
Computer generation and classificationComputer generation and classification
Computer generation and classification
BaliThorat1
 
Algorithm and flowchart
Algorithm and flowchartAlgorithm and flowchart
Algorithm and flowchart
BaliThorat1
 
6 cpu scheduling
6 cpu scheduling6 cpu scheduling
6 cpu scheduling
BaliThorat1
 
5 process synchronization
5 process synchronization5 process synchronization
5 process synchronization
BaliThorat1
 
Ad

Recently uploaded (20)

Mental Health Assessment in 5th semester bsc. nursing and also used in 2nd ye...
Mental Health Assessment in 5th semester bsc. nursing and also used in 2nd ye...Mental Health Assessment in 5th semester bsc. nursing and also used in 2nd ye...
Mental Health Assessment in 5th semester bsc. nursing and also used in 2nd ye...
parmarjuli1412
 
IPL QUIZ | THE QUIZ CLUB OF PSGCAS | 2025.pdf
IPL QUIZ | THE QUIZ CLUB OF PSGCAS | 2025.pdfIPL QUIZ | THE QUIZ CLUB OF PSGCAS | 2025.pdf
IPL QUIZ | THE QUIZ CLUB OF PSGCAS | 2025.pdf
Quiz Club of PSG College of Arts & Science
 
Peer Assesment- Libby.docx..............
Peer Assesment- Libby.docx..............Peer Assesment- Libby.docx..............
Peer Assesment- Libby.docx..............
19lburrell
 
The role of wall art in interior designing
The role of wall art in interior designingThe role of wall art in interior designing
The role of wall art in interior designing
meghaark2110
 
ANTI-VIRAL DRUGS unit 3 Pharmacology 3.pptx
ANTI-VIRAL DRUGS unit 3 Pharmacology 3.pptxANTI-VIRAL DRUGS unit 3 Pharmacology 3.pptx
ANTI-VIRAL DRUGS unit 3 Pharmacology 3.pptx
Mayuri Chavan
 
How to Configure Extra Steps During Checkout in Odoo 18 Website
How to Configure Extra Steps During Checkout in Odoo 18 WebsiteHow to Configure Extra Steps During Checkout in Odoo 18 Website
How to Configure Extra Steps During Checkout in Odoo 18 Website
Celine George
 
E-Filing_of_Income_Tax.pptx and concept of form 26AS
E-Filing_of_Income_Tax.pptx and concept of form 26ASE-Filing_of_Income_Tax.pptx and concept of form 26AS
E-Filing_of_Income_Tax.pptx and concept of form 26AS
Abinash Palangdar
 
INSULIN.pptx by Arka Das (Bsc. Critical care technology)
INSULIN.pptx by Arka Das (Bsc. Critical care technology)INSULIN.pptx by Arka Das (Bsc. Critical care technology)
INSULIN.pptx by Arka Das (Bsc. Critical care technology)
ArkaDas54
 
COPA Apprentice exam Questions and answers PDF
COPA Apprentice exam Questions and answers PDFCOPA Apprentice exam Questions and answers PDF
COPA Apprentice exam Questions and answers PDF
SONU HEETSON
 
MEDICAL BIOLOGY MCQS BY. DR NASIR MUSTAFA
MEDICAL BIOLOGY MCQS  BY. DR NASIR MUSTAFAMEDICAL BIOLOGY MCQS  BY. DR NASIR MUSTAFA
MEDICAL BIOLOGY MCQS BY. DR NASIR MUSTAFA
Dr. Nasir Mustafa
 
Redesigning Education as a Cognitive Ecosystem: Practical Insights into Emerg...
Redesigning Education as a Cognitive Ecosystem: Practical Insights into Emerg...Redesigning Education as a Cognitive Ecosystem: Practical Insights into Emerg...
Redesigning Education as a Cognitive Ecosystem: Practical Insights into Emerg...
Leonel Morgado
 
How To Maximize Sales Performance using Odoo 18 Diverse views in sales module
How To Maximize Sales Performance using Odoo 18 Diverse views in sales moduleHow To Maximize Sales Performance using Odoo 18 Diverse views in sales module
How To Maximize Sales Performance using Odoo 18 Diverse views in sales module
Celine George
 
LDMMIA Reiki Yoga S6 Free Workshop Money Pt 2
LDMMIA Reiki Yoga S6 Free Workshop Money Pt 2LDMMIA Reiki Yoga S6 Free Workshop Money Pt 2
LDMMIA Reiki Yoga S6 Free Workshop Money Pt 2
LDM & Mia eStudios
 
Bipolar Junction Transistors (BJTs): Basics, Construction & Configurations
Bipolar Junction Transistors (BJTs): Basics, Construction & ConfigurationsBipolar Junction Transistors (BJTs): Basics, Construction & Configurations
Bipolar Junction Transistors (BJTs): Basics, Construction & Configurations
GS Virdi
 
Final Evaluation.docx...........................
Final Evaluation.docx...........................Final Evaluation.docx...........................
Final Evaluation.docx...........................
l1bbyburrell
 
Module 1: Foundations of Research
Module 1: Foundations of ResearchModule 1: Foundations of Research
Module 1: Foundations of Research
drroxannekemp
 
INDIA QUIZ FOR SCHOOLS | THE QUIZ CLUB OF PSGCAS | AUGUST 2024
INDIA QUIZ FOR SCHOOLS | THE QUIZ CLUB OF PSGCAS | AUGUST 2024INDIA QUIZ FOR SCHOOLS | THE QUIZ CLUB OF PSGCAS | AUGUST 2024
INDIA QUIZ FOR SCHOOLS | THE QUIZ CLUB OF PSGCAS | AUGUST 2024
Quiz Club of PSG College of Arts & Science
 
Search Matching Applicants in Odoo 18 - Odoo Slides
Search Matching Applicants in Odoo 18 - Odoo SlidesSearch Matching Applicants in Odoo 18 - Odoo Slides
Search Matching Applicants in Odoo 18 - Odoo Slides
Celine George
 
Peer Assessment_ Unit 2 Skills Development for Live Performance - for Libby.docx
Peer Assessment_ Unit 2 Skills Development for Live Performance - for Libby.docxPeer Assessment_ Unit 2 Skills Development for Live Performance - for Libby.docx
Peer Assessment_ Unit 2 Skills Development for Live Performance - for Libby.docx
19lburrell
 
Look Up, Look Down: Spotting Local History Everywhere
Look Up, Look Down: Spotting Local History EverywhereLook Up, Look Down: Spotting Local History Everywhere
Look Up, Look Down: Spotting Local History Everywhere
History of Stoke Newington
 
Mental Health Assessment in 5th semester bsc. nursing and also used in 2nd ye...
Mental Health Assessment in 5th semester bsc. nursing and also used in 2nd ye...Mental Health Assessment in 5th semester bsc. nursing and also used in 2nd ye...
Mental Health Assessment in 5th semester bsc. nursing and also used in 2nd ye...
parmarjuli1412
 
Peer Assesment- Libby.docx..............
Peer Assesment- Libby.docx..............Peer Assesment- Libby.docx..............
Peer Assesment- Libby.docx..............
19lburrell
 
The role of wall art in interior designing
The role of wall art in interior designingThe role of wall art in interior designing
The role of wall art in interior designing
meghaark2110
 
ANTI-VIRAL DRUGS unit 3 Pharmacology 3.pptx
ANTI-VIRAL DRUGS unit 3 Pharmacology 3.pptxANTI-VIRAL DRUGS unit 3 Pharmacology 3.pptx
ANTI-VIRAL DRUGS unit 3 Pharmacology 3.pptx
Mayuri Chavan
 
How to Configure Extra Steps During Checkout in Odoo 18 Website
How to Configure Extra Steps During Checkout in Odoo 18 WebsiteHow to Configure Extra Steps During Checkout in Odoo 18 Website
How to Configure Extra Steps During Checkout in Odoo 18 Website
Celine George
 
E-Filing_of_Income_Tax.pptx and concept of form 26AS
E-Filing_of_Income_Tax.pptx and concept of form 26ASE-Filing_of_Income_Tax.pptx and concept of form 26AS
E-Filing_of_Income_Tax.pptx and concept of form 26AS
Abinash Palangdar
 
INSULIN.pptx by Arka Das (Bsc. Critical care technology)
INSULIN.pptx by Arka Das (Bsc. Critical care technology)INSULIN.pptx by Arka Das (Bsc. Critical care technology)
INSULIN.pptx by Arka Das (Bsc. Critical care technology)
ArkaDas54
 
COPA Apprentice exam Questions and answers PDF
COPA Apprentice exam Questions and answers PDFCOPA Apprentice exam Questions and answers PDF
COPA Apprentice exam Questions and answers PDF
SONU HEETSON
 
MEDICAL BIOLOGY MCQS BY. DR NASIR MUSTAFA
MEDICAL BIOLOGY MCQS  BY. DR NASIR MUSTAFAMEDICAL BIOLOGY MCQS  BY. DR NASIR MUSTAFA
MEDICAL BIOLOGY MCQS BY. DR NASIR MUSTAFA
Dr. Nasir Mustafa
 
Redesigning Education as a Cognitive Ecosystem: Practical Insights into Emerg...
Redesigning Education as a Cognitive Ecosystem: Practical Insights into Emerg...Redesigning Education as a Cognitive Ecosystem: Practical Insights into Emerg...
Redesigning Education as a Cognitive Ecosystem: Practical Insights into Emerg...
Leonel Morgado
 
How To Maximize Sales Performance using Odoo 18 Diverse views in sales module
How To Maximize Sales Performance using Odoo 18 Diverse views in sales moduleHow To Maximize Sales Performance using Odoo 18 Diverse views in sales module
How To Maximize Sales Performance using Odoo 18 Diverse views in sales module
Celine George
 
LDMMIA Reiki Yoga S6 Free Workshop Money Pt 2
LDMMIA Reiki Yoga S6 Free Workshop Money Pt 2LDMMIA Reiki Yoga S6 Free Workshop Money Pt 2
LDMMIA Reiki Yoga S6 Free Workshop Money Pt 2
LDM & Mia eStudios
 
Bipolar Junction Transistors (BJTs): Basics, Construction & Configurations
Bipolar Junction Transistors (BJTs): Basics, Construction & ConfigurationsBipolar Junction Transistors (BJTs): Basics, Construction & Configurations
Bipolar Junction Transistors (BJTs): Basics, Construction & Configurations
GS Virdi
 
Final Evaluation.docx...........................
Final Evaluation.docx...........................Final Evaluation.docx...........................
Final Evaluation.docx...........................
l1bbyburrell
 
Module 1: Foundations of Research
Module 1: Foundations of ResearchModule 1: Foundations of Research
Module 1: Foundations of Research
drroxannekemp
 
Search Matching Applicants in Odoo 18 - Odoo Slides
Search Matching Applicants in Odoo 18 - Odoo SlidesSearch Matching Applicants in Odoo 18 - Odoo Slides
Search Matching Applicants in Odoo 18 - Odoo Slides
Celine George
 
Peer Assessment_ Unit 2 Skills Development for Live Performance - for Libby.docx
Peer Assessment_ Unit 2 Skills Development for Live Performance - for Libby.docxPeer Assessment_ Unit 2 Skills Development for Live Performance - for Libby.docx
Peer Assessment_ Unit 2 Skills Development for Live Performance - for Libby.docx
19lburrell
 
Look Up, Look Down: Spotting Local History Everywhere
Look Up, Look Down: Spotting Local History EverywhereLook Up, Look Down: Spotting Local History Everywhere
Look Up, Look Down: Spotting Local History Everywhere
History of Stoke Newington
 

Chap8 slides

  • 1. Dense Matrix Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text “Introduction to Parallel Computing”, Addison Wesley, 2003.
  • 2. Topic Overview • Matrix-Vector Multiplication • Matrix-Matrix Multiplication • Solving a System of Linear Equations
  • 3. Matix Algorithms: Introduction • Due to their regular structure, parallel computations involving matrices and vectors readily lend themselves to data-decomposition. • Typical algorithms rely on input, output, or intermediate data decomposition. • Most algorithms use one- and two-dimensional block, cyclic, and block-cyclic partitionings.
  • 4. Matrix-Vector Multiplication • We aim to multiply a dense n x n matrix A with an n x 1 vector x to yield the n x 1 result vector y. • The serial algorithm requires n2 multiplications and additions.
  • 5. Matrix-Vector Multiplication: Rowwise 1-D Partitioning • The n x n matrix is partitioned among n processors, with each processor storing complete row of the matrix. • The n x 1 vector x is distributed such that each process owns one of its elements.
  • 6. Matrix-Vector Multiplication: Rowwise 1-D Partitioning Multiplication of an n x n matrix with an n x 1 vector using rowwise block 1-D partitioning. For the one-row-per-process case, p = n.
  • 7. Matrix-Vector Multiplication: Rowwise 1-D Partitioning • Since each process starts with only one element of x , an all-to-all broadcast is required to distribute all the elements to all the processes. • Process Pi now computes . • The all-to-all broadcast and the computation of y[i] both take time Θ(n) . Therefore, the parallel time is Θ(n) .
  • 8. Matrix-Vector Multiplication: Rowwise 1-D Partitioning • Consider now the case when p < n and we use block 1D partitioning. • Each process initially stores n=p complete rows of the matrix and a portion of the vector of size n=p. • The all-to-all broadcast takes place among p processes and involves messages of size n=p. • This is followed by n=p local dot products. • Thus, the parallel run time of this procedure is This is cost-optimal.
  • 9. Matrix-Vector Multiplication: Rowwise 1-D Partitioning Scalability Analysis: • We know that T0 = pTP - W, therefore, we have, • For isoefficiency, we have W = KT0, where K = E/(1 – E) for desired efficiency E. • From this, we have W = O(p2) (from the tw term). • There is also a bound on isoefficiency because of concurrency. In this case, p < n, therefore, W = n2 = Ω(p2). • Overall isoefficiency is W = O(p2).
  • 10. Matrix-Vector Multiplication: 2-D Partitioning • The n x n matrix is partitioned among n2 processors such that each processor owns a single element. • The n x 1 vector x is distributed only in the last column of n processors.
  • 11. Matrix-Vector Multiplication: 2-D Partitioning Matrix-vector multiplication with block 2-D partitioning. For the one-element-per-process case, p = n2 if the matrix size is n x n .
  • 12. Matrix-Vector Multiplication: 2-D Partitioning • We must first align the vector with the matrix appropriately. • The first communication step for the 2-D partitioning aligns the vector x along the principal diagonal of the matrix. • The second step copies the vector elements from each diagonal process to all the processes in the corresponding column using n simultaneous broadcasts among all processors in the column. • Finally, the result vector is computed by performing an all-to-one reduction along the columns.
  • 13. Matrix-Vector Multiplication: 2-D Partitioning • Three basic communication operations are used in this algorithm: one-to-one communication to align the vector along the main diagonal, one-to-all broadcast of each vector element among the n processes of each column, and all-to-one reduction in each row. • Each of these operations takes Θ(log n) time and the parallel time is Θ(log n) . • The cost (process-time product) is Θ(n2 log n) ; hence, the algorithm is not cost-optimal.
  • 14. Matrix-Vector Multiplication: 2-D Partitioning • When using fewer than n2 processors, each process owns an block of the matrix. • The vector is distributed in portions of elements in the last process-column only. • In this case, the message sizes for the alignment, broadcast, and reduction are all . • The computation is a product of an submatrix with a vector of length .
  • 15. Matrix-Vector Multiplication: 2-D Partitioning • The first alignment step takes time • The broadcast and reductions take time • Local matrix-vector products take time • Total time is
  • 16. Matrix-Vector Multiplication: 2-D Partitioning • Scalability Analysis: • • Equating T0 with W, term by term, for isoefficiency, we have, as the dominant term. • The isoefficiency due to concurrency is O(p). • The overall isoefficiency is (due to the network bandwidth). • For cost optimality, we have, . For this, we have,
  • 17. Matrix-Matrix Multiplication • Consider the problem of multiplying two n x n dense, square matrices A and B to yield the product matrix C =A x B. • The serial complexity is O(n3). • We do not consider better serial algorithms (Strassen's method), although, these can be used as serial kernels in the parallel algorithms. • A useful concept in this case is called block operations. In this view, an n x n matrix A can be regarded as a q x q array of blocks Ai,j (0 ≤ i, j < q) such that each block is an (n/q) x (n/q) submatrix. • In this view, we perform q3 matrix multiplications, each involving (n/q) x (n/q) matrices.
  • 18. Matrix-Matrix Multiplication • Consider two n x n matrices A and B partitioned into p blocks Ai,j and Bi,j (0 ≤ i, j < ) of size each. • Process Pi,j initially stores Ai,j and Bi,j and computes block Ci,j of the result matrix. • Computing submatrix Ci,j requires all submatrices Ai,k and Bk,j for 0 ≤ k < . • All-to-all broadcast blocks of A along rows and B along columns. • Perform local submatrix multiplication.
  • 19. Matrix-Matrix Multiplication • The two broadcasts take time • The computation requires multiplications of sized submatrices. • The parallel run time is approximately • The algorithm is cost optimal and the isoefficiency is O(p1.5) due to bandwidth term tw and concurrency. • Major drawback of the algorithm is that it is not memory optimal.
  • 20. Matrix-Matrix Multiplication: Cannon's Algorithm • In this algorithm, we schedule the computations of the processes of the ith row such that, at any given time, each process is using a different block Ai,k. • These blocks can be systematically rotated among the processes after every submatrix multiplication so that every process gets a fresh Ai,k after each rotation.
  • 21. Matrix-Matrix Multiplication: Cannon's Algorithm Communication steps in Cannon's algorithm on 16 processes.
  • 22. Matrix-Matrix Multiplication: Cannon's Algorithm • Align the blocks of A and B in such a way that each process multiplies its local submatrices. This is done by shifting all submatrices Ai,j to the left (with wraparound) by i steps and all submatrices Bi,j up (with wraparound) by j steps. • Perform local block multiplication. • Each block of A moves one step left and each block of B moves one step up (again with wraparound). • Perform next block multiplication, add to partial result, repeat until all blocks have been multiplied.
  • 23. Matrix-Matrix Multiplication: Cannon's Algorithm • In the alignment step, since the maximum distance over which a block shifts is , the two shift operations require a total of time. • Each of the single-step shifts in the compute-and- shift phase of the algorithm takes time. • The computation time for multiplying matrices of size is . • The parallel time is approximately: • The cost-efficiency and isoefficiency of the algorithm are identical to the first algorithm, except, this is memory optimal.
  • 24. Matrix-Matrix Multiplication: DNS Algorithm • Uses a 3-D partitioning. • Visualize the matrix multiplication algorithm as a cube . matrices A and B come in two orthogonal faces and result C comes out the other orthogonal face. • Each internal node in the cube represents a single add- multiply operation (and thus the complexity). • DNS algorithm partitions this cube using a 3-D block scheme.
  • 25. Matrix-Matrix Multiplication: DNS Algorithm • Assume an n x n x n mesh of processors. • Move the columns of A and rows of B and perform broadcast. • Each processor computes a single add-multiply. • This is followed by an accumulation along the C dimension. • Since each add-multiply takes constant time and accumulation and broadcast takes log n time, the total runtime is log n. • This is not cost optimal. It can be made cost optimal by using n / log n processors along the direction of accumulation.
  • 26. Matrix-Matrix Multiplication: DNS Algorithm The communication steps in the DNS algorithm while multiplying 4 x 4 matrices A and B on 64 processes.
  • 27. Matrix-Matrix Multiplication: DNS Algorithm Using fewer than n3 processors. • Assume that the number of processes p is equal to q3 for some q < n. • The two matrices are partitioned into blocks of size (n/q) x(n/q). • Each matrix can thus be regarded as a q x q two- dimensional square array of blocks. • The algorithm follows from the previous one, except, in this case, we operate on blocks rather than on individual elements.
  • 28. Matrix-Matrix Multiplication: DNS Algorithm Using fewer than n3 processors. • The first one-to-one communication step is performed for both A and B, and takes time for each matrix. • The two one-to-all broadcasts take time for each matrix. • The reduction takes time . • Multiplication of submatrices takes time. • The parallel time is approximated by: • The isoefficiency function is .
  • 29. Solving a System of Linear Equations • Consider the problem of solving linear equations of the kind: • This is written as Ax = b, where A is an n x n matrix with A[i, j] = ai,j, b is an n x 1 vector [ b0, b1, … , bn ]T, and x is the solution.
  • 30. Solving a System of Linear Equations Two steps in solution are: reduction to triangular form, and back-substitution. The triangular form is as: We write this as: Ux = y . A commonly used method for transforming a given matrix into an upper-triangular matrix is Gaussian Elimination.
  • 32. Gaussian Elimination • The computation has three nested loops - in the kth iteration of the outer loop, the algorithm performs (n-k)2 computations. Summing from k = 1..n, we have roughly (n3/3) multiplications-subtractions. A typical computation in Gaussian elimination.
  • 33. Parallel Gaussian Elimination • Assume p = n with each row assigned to a processor. • The first step of the algorithm normalizes the row. This is a serial operation and takes time (n-k) in the kth iteration. • In the second step, the normalized row is broadcast to all the processors. This takes time . • Each processor can independently eliminate this row from its own. This requires (n-k-1) multiplications and subtractions. • The total parallel time can be computed by summing from k = 1 … n-1 as • The formulation is not cost optimal because of the tw term.
  • 34. Parallel Gaussian Elimination Gaussian elimination steps during the iteration corresponding k = 3 for an 8 x 8 matrix partitioned rowwise among eight processes.
  • 35. Parallel Gaussian Elimination: Pipelined Execution • In the previous formulation, the (k+1)st iteration starts only after all the computation and communication for the kth iteration is complete. • In the pipelined version, there are three steps - normalization of a row, communication, and elimination. These steps are performed in an asynchronous fashion. • A processor Pk waits to receive and eliminate all rows prior to k. • Once it has done this, it forwards its own row to processor Pk+1.
  • 36. Parallel Gaussian Elimination: Pipelined Execution Pipelined Gaussian elimination on a 5 x 5 matrix partitioned withone row per process.
  • 37. Parallel Gaussian Elimination: Pipelined Execution • The total number of steps in the entire pipelined procedure is Θ(n). • In any step, either O(n) elements are communicated between directly-connected processes, or a division step is performed on O(n) elements of a row, or an elimination step is performed on O(n) elements of a row. • The parallel time is therefore O(n2) . • This is cost optimal.
  • 38. Parallel Gaussian Elimination: Pipelined Execution The communication in the Gaussian elimination iteration corresponding to k = 3 for an 8 x 8 matrix distributed among four processes using block 1-D partitioning.
  • 39. Parallel Gaussian Elimination: Block 1D with p < n • The above algorithm can be easily adapted to the case when p < n. • In the kth iteration, a processor with all rows belonging to the active part of the matrix performs (n – k -1) / np multiplications and subtractions. • In the pipelined version, for n > p, computation dominates communication. • The parallel time is given by: or approximately, n3/p. • While the algorithm is cost optimal, the cost of the parallel algorithm is higher than the sequential run time by a factor of 3/2.
  • 40. Parallel Gaussian Elimination: Block 1D with p < n Computation load on different processes in block and cyclic 1-D partitioning of an 8 x 8 matrix on four processes during the Gaussian elimination iteration corresponding to k = 3.
  • 41. Parallel Gaussian Elimination: Block 1D with p < n • The load imbalance problem can be alleviated by using a cyclic mapping. • In this case, other than processing of the last p rows, there is no load imbalance. • This corresponds to a cumulative load imbalance overhead of O(n2p) (instead of O(n3) in the previous case).
  • 42. Parallel Gaussian Elimination: 2-D Mapping • Assume an n x n matrix A mapped onto an n x n mesh of processors. • Each update of the partial matrix can be thought of as a scaled rank-one update (scaling by the pivot element). • In the first step, the pivot is broadcast to the row of processors. • In the second step, each processor locally updates its value. For this it needs the corresponding value from the pivot row, and the scaling value from its own row. • This requires two broadcasts, each of which takes log n time. • This results in a non-cost-optimal algorithm.
  • 43. Parallel Gaussian Elimination: 2-D Mapping Various steps in the Gaussian elimination iteration corresponding to k = 3 for an 8 x 8 matrix on 64 processes arranged in a logical two-dimensional mesh.
  • 44. Parallel Gaussian Elimination: 2-D Mapping with Pipelining • We pipeline along two dimensions. First, the pivot value is pipelined along the row. Then the scaled pivot row is pipelined down. • Processor Pi,j (not on the pivot row) performs the elimination step A[i, j] := A[i, j]] - A[i, k] A[k, j] as soon as A[i, k] and A[k, j] are available. • The computation and communication for each iteration moves through the mesh from top-left to bottom-right as a ``front.'' • After the front corresponding to a certain iteration passes through a process, the process is free to perform subsequent iterations. • Multiple fronts that correspond to different iterations are active simultaneously.
  • 45. Parallel Gaussian Elimination: 2-D Mapping with Pipelining • If each step (division, elimination, or communication) is assumed to take constant time, the front moves a single step in this time. The front takes Θ(n) time to reach Pn-1,n- 1. • Once the front has progressed past a diagonal processor, the next front can be initiated. In this way, the last front passes the bottom-right corner of the matrix Θ(n) steps after the first one. • The parallel time is therefore O(n) , which is cost-optimal.
  • 46. 2-D Mapping with Pipelining Pipelined Gaussian elimination for a 5 x 5 matrix with 25 processors.
  • 47. Parallel Gaussian Elimination: 2-D Mapping with Pipelining and p < n • In this case, a processor containing a completely active part of the matrix performs n2/p multiplications and subtractions, and communicates words along its row and its column. • The computation dominates communication for n >> p. • The total parallel run time of this algorithm is (2n2/p) x n, since there are n iterations. This is equal to 2n3/p. • This is three times the serial operation count!
  • 48. Parallel Gaussian Elimination: 2-D Mapping with Pipelining and p < n The communication steps in the Gaussian elimination iteration corresponding to k = 3 for an 8 x 8 matrix on 16 processes of a two-dimensional mesh.
  • 49. Parallel Gaussian Elimination: 2-D Mapping with Pipelining and p < n Computational load on different processes in block and cyclic 2-D mappings of an 8 x 8 matrix onto 16 processes during the Gaussian elimination iteration corresponding to k = 3.
  • 50. Parallel Gaussian Elimination: 2-D Cyclic Mapping • The idling in the block mapping can be alleviated using a cyclic mapping. • The maximum difference in computational load between any two processes in any iteration is that of one row and one column update. • This contributes to the overhead function. Since there are n iterations, the total overhead is .
  • 51. Gaussian Elimination with Partial Pivoting • For numerical stability, one generally uses partial pivoting. • In the k th iteration, we select a column i (called the pivot column) such that A[k, i] is the largest in magnitude among all A[k, i] such that k ≤ j < n. • The k th and the i th columns are interchanged. • Simple to implement with row-partitioning and does not add overhead since the division step takes the same time as computing the max. • Column-partitioning, however, requires a global reduction, adding a log p term to the overhead. • Pivoting precludes the use of pipelining.
  • 52. Gaussian Elimination with Partial Pivoting: 2-D Partitioning • Partial pivoting restricts use of pipelining, resulting in performance loss. • This loss can be alleviated by restricting pivoting to specific columns. • Alternately, we can use faster algorithms for broadcast.
  • 53. Solving a Triangular System: Back-Substitution • The upper triangular matrix U undergoes back- substitution to determine the vector x. A serial algorithm for back-substitution.
  • 54. Solving a Triangular System: Back-Substitution • The algorithm performs approximately n2/2 multiplications and subtractions. • Since complexity of this part is asymptotically lower, we should optimize the data distribution for the factorization part. • Consider a rowwise block 1-D mapping of the n x n matrix U with vector y distributed uniformly. • The value of the variable solved at a step can be pipelined back. • Each step of a pipelined implementation requires a constant amount of time for communication and Θ(n/p) time for computation. • The parallel run time of the entire algorithm is Θ(n2/p).
  • 55. Solving a Triangular System: Back-Substitution • If the matrix is partitioned by using 2-D partitioning on a logical mesh of processes, and the elements of the vector are distributed along one of the columns of the process mesh, then only the processes containing the vector perform any computation. • Using pipelining to communicate the appropriate elements of U to the process containing the corresponding elements of y for the substitution step (line 7), the algorithm can be executed in time. • While this is not cost optimal, since this does not dominate the overall computation, the cost optimality is determined by the factorization.
  翻译: