SlideShare a Scribd company logo
Parallel and
distributed computing
OUTLINE
GPU Architecture
INTRODUCTION
A Microprocessor for processing of graphics
Capable of handling millions of instructions per
second
GPU is graphic card that relieve CPU of much of
graphic processing load
INTRODUCTION
GPU is made for performing parallel operations that’s
why GPU have many parallel execution Units.
Many super computer uses GPU for high processing
and performance.
CPU Vs GPU
A CPU consists of four to eight CPU cores,
GPU consists of hundreds of smaller cores.
This massively parallel architecture is what gives the GPU its
high compute performance.
SINGLE CORE VS MULTICORE PREOCESOR
INTRODUCTION
Today NVIDIA GPU upgraded to 128 cores on single chip.
Hence consuming less power with high performance.
Each core handle 8 threads (total=1024 threads)
Modern GPU not only use for graphics or video coding but
also used in HPC
GPU are designed to handle large number of floating point
operations in parallel.
COMPONENTS OF GPU
Graphic processor
Graphic co-processor
Graphic accelerator
Frame buffer
Memory
Graphic BIOS
Digital to analog converter
Display connector
Computer Connector
Components of CPU:
control unit (CU)
arithmetic logic unit
(ALU)
registers.
cache.
clock.
ALUs and Fetch/Decode logic run at
high speed, consume little power and
require few hardware to build
Contrary Execution Unit require huge
number of transistors to build cache, it
may occupy 50% of total area hence
expensive
It may also main energy absorbing
element
Introduction to GPU Evolution
GPU Comprises of ten to thousand cores
To build GPU we need slimmer design of CPU
For this all complex and large units of should be
removed from general CPU
Basic Idea of GPU: to have many (100 or 1000) simpler
or weaker processing units that process many same
instructions simultaneously on different data.
Vector Addition of two floating Point vectors each containing 128 elements
Introduction to GPU Evolution
Instead of using one CPU Core we can use two
such cores.
We will be able to execute two instructions in
parallel
Hence throughput is increased
Two Instructions executing parallel on two CPU Cores
Even we can achieve higher performance by further replicating ALU,
Instead of replicating complete CPU Core
Fetch/decode logic remained share among ALUs
All ALUs should execute same operations on different input data
One such core can add eight vector elements in parallel
Parallel and Distributed Computing Chapter 8
Modern GPU
Modern GPU containing hundreds of simple
Processing elements suited for computation that can
be run in parallel
This typically involves arithmetic on large data sets
i.e. vectors matrices, images, where same operations
can be performed across millions of data elements at
the same time.
Modern GPU
Modern GPU
To exploit parallelism for GPU, Programmer should
partition their program into thousands of threads and
schedule them among Compute Units.
Memory Hierarchy on GPU
GPU has these five memory regions accessible from a
single work item
1. Registers
2. Local Memory
3. Texture Memory
4. Constant Memory
5. Global Memory
Memory Hierarchy on GPU
Memory Hierarchy on GPU
Registers:
Registers are at first and most preferable level
Each work-item has dedicated registers
There may be 16k, 32K or 64K register for work item.
Memory Hierarchy on GPU
Global Memory:
Also called graphic dynamic memory. Achieves high bandwidth
But this memory has a high latency compare to other memories.
It is global because it is accessed from GPU and CPU both.
GTX780 GPU has 3GB of global memory implemented in GDDR5
It is used for transfers between host and device
It is larger capacity memory and high latency.
Host and Device
Main CPU is host, while all other processors like,
GPU are names as device
Memory Hierarchy on GPU
There are also two additional memories that are accessible by all
work items
Constant memory and texture memory
Constant Memory: resides in device memory (Cached) where
constant and program arguments are stored
Constant memory has two special properties. Firstly, it is cached, and
second it supports broadcasting a single value to all work items
This broadcast takes place in just a single cycle.
Work item have only read access to this region. However host is
permitted both read/write access.
Memory Hierarchy on GPU
Texture Memory: When all reads in a work group are
physically adjacent, using texture memory can reduce
memory traffic and increase performance compared
to global memory
However Texture memory is much slower than
constant memory
Work-Item / Work-Group
Work-Item are actually threads (WI)
Work-Group, which is unit of work scheduled onto compute unit.
Work Items are organized into Work-Groups
Hence Work Group is also define as the set of Work Item
All work item in a work group are able to share local memory
Work group execute independently from each other.
Work Group executes on Compute Unit and Work Item are mapped
to CU PEs
OpenCL
A Framework use for writing programs that execute
across heterogeneous platforms consisting of CPUs,
GPUs or any accelerated hardware.
It define C Programming language that is used to
write and compute kernels just like C functions
One significant drawback: not easy to learn
OpenCL Kernel
Code that gets executed in a GPU Device is called
kernel in OpenCL.
The body of kernel function implements the
computation to be completed by all work-items.
When writing kernels in OpenCL, We must declare
memory with certain address space to mention that
data will resides in which memory. E.g.
(_ _global), (_ _constant), (_ _local), or by default
private memory within a kernel
Heterogeneous system
Also called platform model or heterogeneous system.
Consists of single host connected to one or more OpenCL
Devices e.g. FPGA Accelerators, DSP, GPU or even CPU
OpenCL Devices
OpenCL Devices comprises of several compute
unit. Each compute unit comprises of tens or
hundreds of processing element.
Execution Model
OpenCL execution model defines how kernels execute. i.e.
NDRange (N-Dimensional Range) execution model.
Host program invokes kernel over an index space called
NDRange.
NDRange defines total numbers of work items that execute
in parallel.
Programming in OpenCL
Sample C Code for
vector addition for
single core CPU
Kernel Function
To execute vector addition function on GPU Device, we must
write it as kernel function
Each thread on GPU Device will then execute the same
kernel function
Parallel and Distributed Computing Chapter 8
HOST CODE
First step is to code host application, that run’s on a user’s
computer and dispatches to connected devices.
Host application can be coded in C/C++
OpenCL supports wide range of heterogeneous platforms.
Prior to execute kernel function, the host program for a
heterogeneous system must carry out the some steps.
Steps to execute Kernel Function
1. Discover OpenCL Devices, The platform consists of one or
more devices capable of executing OpenCL Kernels.
2. Probe the characteristics of these devices so that kernel
functions can adapt to specific features.
3. Read source program, compile the kernel that will run on
selected devices.
4. Setup memory objects that will hold the data for
computation.
5. Run the kernel on selected devices
6. Collect the final result from devices.
Steps to execute Kernel Function
These steps are accomplished through following series of calls.
1.Prepare and initialize data on host.
2. Discover and initialize the devices.
3. Create a context.
4. Create a command queue.
5. Create the program object for a context.
6. Build the OpenCL program.
7. Create device buffers.
8. Write host data to device buffers.
9. Create and compile the kernel.
10. Set the kernel arguments
11. Set the execution model and enqueue the kernel for execution.
12. Read the output buffer back to the host
Discover & Initialize the devices
cl_int clGetDeviceIDs()
functions is used to discover and initialize
the devise.
This function returns number of devices in
num_devices.
Discover & Initialize the devices
Get device info. We use
cl_int clGetDeviceInfo()
This function will return max. computer
units, max work-group size, Device type,
size of memory
Create a context
clCreatecontext() function is use for
creating a context i.e. managing objects
such as command queues, program, kernel
objects
Build a program Executeable
clBuildProgram() function is use to build
(compile and link) a program from source
code.
Write host data to device buffer
clEnqueueWriterBuffer() function is use to
write data from host memory to device
buffer
This function provide data for processing
over device.
Enqueue the kernel for execution
OpenCL always execute kernel in parallel i.e. same
kernel execute on different data set.
Each kernel execution in OpenCL is called work-item
Each work-item is responsible for executing kernel
once on its assigned portion of data set.
Thus it is programmer responsibility to tell OpenCl
how many work-items are needed to process all data.
Occupancy
Occupancy is ratio of active work groups per
compute unit to maximize its performance. We
should always keeping a occupancy high in order
to hide latency when executing instructions
A compute unit should have a ready work group
to execute in every cycle as this is the only way
to keep hardware busy.
Ad

More Related Content

What's hot (20)

Parallel Distributed Systems and Heterogeneity.pptx
Parallel Distributed Systems and Heterogeneity.pptxParallel Distributed Systems and Heterogeneity.pptx
Parallel Distributed Systems and Heterogeneity.pptx
TayyabHussain032
 
Lecture 2 - Asynchrnous and Synchronous Computation & Communication.pptx
Lecture 2 - Asynchrnous and Synchronous Computation & Communication.pptxLecture 2 - Asynchrnous and Synchronous Computation & Communication.pptx
Lecture 2 - Asynchrnous and Synchronous Computation & Communication.pptx
IrsaAamir1
 
Lecture 4 principles of parallel algorithm design updated
Lecture 4   principles of parallel algorithm design updatedLecture 4   principles of parallel algorithm design updated
Lecture 4 principles of parallel algorithm design updated
Vajira Thambawita
 
Chapter 10 - File System Interface
Chapter 10 - File System InterfaceChapter 10 - File System Interface
Chapter 10 - File System Interface
Wayne Jones Jnr
 
Process scheduling (CPU Scheduling)
Process scheduling (CPU Scheduling)Process scheduling (CPU Scheduling)
Process scheduling (CPU Scheduling)
Mukesh Chinta
 
INTER PROCESS COMMUNICATION (IPC).pptx
INTER PROCESS COMMUNICATION (IPC).pptxINTER PROCESS COMMUNICATION (IPC).pptx
INTER PROCESS COMMUNICATION (IPC).pptx
LECO9
 
Load Balancing In Distributed Computing
Load Balancing In Distributed ComputingLoad Balancing In Distributed Computing
Load Balancing In Distributed Computing
Richa Singh
 
Stream oriented communication
Stream oriented communicationStream oriented communication
Stream oriented communication
Shyama Bhuvanendran
 
Parallel computing and its applications
Parallel computing and its applicationsParallel computing and its applications
Parallel computing and its applications
Burhan Ahmed
 
RPC: Remote procedure call
RPC: Remote procedure callRPC: Remote procedure call
RPC: Remote procedure call
Sunita Sahu
 
Ui disk & terminal drivers
Ui disk & terminal driversUi disk & terminal drivers
Ui disk & terminal drivers
Sarang Ananda Rao
 
Inter-Process communication in Operating System.ppt
Inter-Process communication in Operating System.pptInter-Process communication in Operating System.ppt
Inter-Process communication in Operating System.ppt
NitihyaAshwinC
 
Multiprocessing operating systems
Multiprocessing operating systemsMultiprocessing operating systems
Multiprocessing operating systems
Chathurangi Shyalika
 
Parallel computing
Parallel computingParallel computing
Parallel computing
Vinay Gupta
 
Distributed Shared Memory Systems
Distributed Shared Memory SystemsDistributed Shared Memory Systems
Distributed Shared Memory Systems
Arush Nagpal
 
Lecture 1 introduction to parallel and distributed computing
Lecture 1   introduction to parallel and distributed computingLecture 1   introduction to parallel and distributed computing
Lecture 1 introduction to parallel and distributed computing
Vajira Thambawita
 
Linux Troubleshooting
Linux TroubleshootingLinux Troubleshooting
Linux Troubleshooting
Keith Wright
 
Os Threads
Os ThreadsOs Threads
Os Threads
Salman Memon
 
Threads .ppt
Threads .pptThreads .ppt
Threads .ppt
meet darji
 
Lock based protocols
Lock based protocolsLock based protocols
Lock based protocols
ChethanMp7
 
Parallel Distributed Systems and Heterogeneity.pptx
Parallel Distributed Systems and Heterogeneity.pptxParallel Distributed Systems and Heterogeneity.pptx
Parallel Distributed Systems and Heterogeneity.pptx
TayyabHussain032
 
Lecture 2 - Asynchrnous and Synchronous Computation & Communication.pptx
Lecture 2 - Asynchrnous and Synchronous Computation & Communication.pptxLecture 2 - Asynchrnous and Synchronous Computation & Communication.pptx
Lecture 2 - Asynchrnous and Synchronous Computation & Communication.pptx
IrsaAamir1
 
Lecture 4 principles of parallel algorithm design updated
Lecture 4   principles of parallel algorithm design updatedLecture 4   principles of parallel algorithm design updated
Lecture 4 principles of parallel algorithm design updated
Vajira Thambawita
 
Chapter 10 - File System Interface
Chapter 10 - File System InterfaceChapter 10 - File System Interface
Chapter 10 - File System Interface
Wayne Jones Jnr
 
Process scheduling (CPU Scheduling)
Process scheduling (CPU Scheduling)Process scheduling (CPU Scheduling)
Process scheduling (CPU Scheduling)
Mukesh Chinta
 
INTER PROCESS COMMUNICATION (IPC).pptx
INTER PROCESS COMMUNICATION (IPC).pptxINTER PROCESS COMMUNICATION (IPC).pptx
INTER PROCESS COMMUNICATION (IPC).pptx
LECO9
 
Load Balancing In Distributed Computing
Load Balancing In Distributed ComputingLoad Balancing In Distributed Computing
Load Balancing In Distributed Computing
Richa Singh
 
Parallel computing and its applications
Parallel computing and its applicationsParallel computing and its applications
Parallel computing and its applications
Burhan Ahmed
 
RPC: Remote procedure call
RPC: Remote procedure callRPC: Remote procedure call
RPC: Remote procedure call
Sunita Sahu
 
Inter-Process communication in Operating System.ppt
Inter-Process communication in Operating System.pptInter-Process communication in Operating System.ppt
Inter-Process communication in Operating System.ppt
NitihyaAshwinC
 
Parallel computing
Parallel computingParallel computing
Parallel computing
Vinay Gupta
 
Distributed Shared Memory Systems
Distributed Shared Memory SystemsDistributed Shared Memory Systems
Distributed Shared Memory Systems
Arush Nagpal
 
Lecture 1 introduction to parallel and distributed computing
Lecture 1   introduction to parallel and distributed computingLecture 1   introduction to parallel and distributed computing
Lecture 1 introduction to parallel and distributed computing
Vajira Thambawita
 
Linux Troubleshooting
Linux TroubleshootingLinux Troubleshooting
Linux Troubleshooting
Keith Wright
 
Lock based protocols
Lock based protocolsLock based protocols
Lock based protocols
ChethanMp7
 

Similar to Parallel and Distributed Computing Chapter 8 (20)

GPU in Computer Science advance topic .pptx
GPU in Computer Science advance topic .pptxGPU in Computer Science advance topic .pptx
GPU in Computer Science advance topic .pptx
HamzaAli998966
 
Stream Processing
Stream ProcessingStream Processing
Stream Processing
arnamoy10
 
Balancing Power & Performance Webinar
Balancing Power & Performance WebinarBalancing Power & Performance Webinar
Balancing Power & Performance Webinar
Qualcomm Developer Network
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous Platforms
IJMER
 
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
Pradeep Singh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Cuda Without a Phd - A practical guick start
Cuda Without a Phd - A practical guick startCuda Without a Phd - A practical guick start
Cuda Without a Phd - A practical guick start
LloydMoore
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
ARUNACHALAM468781
 
General Purpose GPU Computing
General Purpose GPU ComputingGeneral Purpose GPU Computing
General Purpose GPU Computing
GlobalLogic Ukraine
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Topics - , Addressing modes, GPU, .pdf
Topics - , Addressing modes, GPU,  .pdfTopics - , Addressing modes, GPU,  .pdf
Topics - , Addressing modes, GPU, .pdf
ShubhamSinghRajput46
 
Unix operating system basics
Unix operating system basicsUnix operating system basics
Unix operating system basics
Sankar Suriya
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
mohamedragabslideshare
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
J On The Beach
 
Parallel computation
Parallel computationParallel computation
Parallel computation
Jayanti Prasad Ph.D.
 
Google warehouse scale computer
Google warehouse scale computerGoogle warehouse scale computer
Google warehouse scale computer
Tejhaskar Ashok Kumar
 
parallel-computation.pdf
parallel-computation.pdfparallel-computation.pdf
parallel-computation.pdf
Jayanti Prasad Ph.D.
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming Models
Zvi Avraham
 
GPU in Computer Science advance topic .pptx
GPU in Computer Science advance topic .pptxGPU in Computer Science advance topic .pptx
GPU in Computer Science advance topic .pptx
HamzaAli998966
 
Stream Processing
Stream ProcessingStream Processing
Stream Processing
arnamoy10
 
Accelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous PlatformsAccelerating Real Time Applications on Heterogeneous Platforms
Accelerating Real Time Applications on Heterogeneous Platforms
IJMER
 
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
Development of Signal Processing Algorithms using OpenCL for FPGA based Archi...
Pradeep Singh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Cuda Without a Phd - A practical guick start
Cuda Without a Phd - A practical guick startCuda Without a Phd - A practical guick start
Cuda Without a Phd - A practical guick start
LloydMoore
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
ARUNACHALAM468781
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Topics - , Addressing modes, GPU, .pdf
Topics - , Addressing modes, GPU,  .pdfTopics - , Addressing modes, GPU,  .pdf
Topics - , Addressing modes, GPU, .pdf
ShubhamSinghRajput46
 
Unix operating system basics
Unix operating system basicsUnix operating system basics
Unix operating system basics
Sankar Suriya
 
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the CoupledCpu-GPU ArchitectureRevisiting Co-Processing for Hash Joins on the CoupledCpu-GPU Architecture
Revisiting Co-Processing for Hash Joins on the Coupled Cpu-GPU Architecture
mohamedragabslideshare
 
Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.Using GPUs to handle Big Data with Java by Adam Roberts.
Using GPUs to handle Big Data with Java by Adam Roberts.
J On The Beach
 
Migration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming ModelsMigration To Multi Core - Parallel Programming Models
Migration To Multi Core - Parallel Programming Models
Zvi Avraham
 
Ad

More from AbdullahMunir32 (16)

Mobile Application Development-Lecture 15 & 16.pdf
Mobile Application Development-Lecture 15 & 16.pdfMobile Application Development-Lecture 15 & 16.pdf
Mobile Application Development-Lecture 15 & 16.pdf
AbdullahMunir32
 
Mobile Application Development-Lecture 13 & 14.pdf
Mobile Application Development-Lecture 13 & 14.pdfMobile Application Development-Lecture 13 & 14.pdf
Mobile Application Development-Lecture 13 & 14.pdf
AbdullahMunir32
 
Mobile Application Development -Lecture 11 & 12.pdf
Mobile Application Development -Lecture 11 & 12.pdfMobile Application Development -Lecture 11 & 12.pdf
Mobile Application Development -Lecture 11 & 12.pdf
AbdullahMunir32
 
Mobile Application Development -Lecture 09 & 10.pdf
Mobile Application Development -Lecture 09 & 10.pdfMobile Application Development -Lecture 09 & 10.pdf
Mobile Application Development -Lecture 09 & 10.pdf
AbdullahMunir32
 
Mobile Application Development -Lecture 07 & 08.pdf
Mobile Application Development -Lecture 07 & 08.pdfMobile Application Development -Lecture 07 & 08.pdf
Mobile Application Development -Lecture 07 & 08.pdf
AbdullahMunir32
 
Mobile Application Development Lecture 05 & 06.pdf
Mobile Application Development Lecture 05 & 06.pdfMobile Application Development Lecture 05 & 06.pdf
Mobile Application Development Lecture 05 & 06.pdf
AbdullahMunir32
 
Mobile Application Development-Lecture 03 & 04.pdf
Mobile Application Development-Lecture 03 & 04.pdfMobile Application Development-Lecture 03 & 04.pdf
Mobile Application Development-Lecture 03 & 04.pdf
AbdullahMunir32
 
Mobile Application Development-Lecture 01 & 02.pdf
Mobile Application Development-Lecture 01 & 02.pdfMobile Application Development-Lecture 01 & 02.pdf
Mobile Application Development-Lecture 01 & 02.pdf
AbdullahMunir32
 
Parallel and Distributed Computing Chapter 13
Parallel and Distributed Computing Chapter 13Parallel and Distributed Computing Chapter 13
Parallel and Distributed Computing Chapter 13
AbdullahMunir32
 
Parallel and Distributed Computing Chapter 9
Parallel and Distributed Computing Chapter 9Parallel and Distributed Computing Chapter 9
Parallel and Distributed Computing Chapter 9
AbdullahMunir32
 
Parallel and Distributed Computing Chapter 7
Parallel and Distributed Computing Chapter 7Parallel and Distributed Computing Chapter 7
Parallel and Distributed Computing Chapter 7
AbdullahMunir32
 
Parallel and Distributed Computing Chapter 6
Parallel and Distributed Computing Chapter 6Parallel and Distributed Computing Chapter 6
Parallel and Distributed Computing Chapter 6
AbdullahMunir32
 
Parallel and Distributed Computing Chapter 5
Parallel and Distributed Computing Chapter 5Parallel and Distributed Computing Chapter 5
Parallel and Distributed Computing Chapter 5
AbdullahMunir32
 
Parallel and Distributed Computing Chapter 4
Parallel and Distributed Computing Chapter 4Parallel and Distributed Computing Chapter 4
Parallel and Distributed Computing Chapter 4
AbdullahMunir32
 
Parallel and Distributed Computing Chapter 2
Parallel and Distributed Computing Chapter 2Parallel and Distributed Computing Chapter 2
Parallel and Distributed Computing Chapter 2
AbdullahMunir32
 
Parallel and Distributed Computing chapter 1
Parallel and Distributed Computing chapter 1Parallel and Distributed Computing chapter 1
Parallel and Distributed Computing chapter 1
AbdullahMunir32
 
Mobile Application Development-Lecture 15 & 16.pdf
Mobile Application Development-Lecture 15 & 16.pdfMobile Application Development-Lecture 15 & 16.pdf
Mobile Application Development-Lecture 15 & 16.pdf
AbdullahMunir32
 
Mobile Application Development-Lecture 13 & 14.pdf
Mobile Application Development-Lecture 13 & 14.pdfMobile Application Development-Lecture 13 & 14.pdf
Mobile Application Development-Lecture 13 & 14.pdf
AbdullahMunir32
 
Mobile Application Development -Lecture 11 & 12.pdf
Mobile Application Development -Lecture 11 & 12.pdfMobile Application Development -Lecture 11 & 12.pdf
Mobile Application Development -Lecture 11 & 12.pdf
AbdullahMunir32
 
Mobile Application Development -Lecture 09 & 10.pdf
Mobile Application Development -Lecture 09 & 10.pdfMobile Application Development -Lecture 09 & 10.pdf
Mobile Application Development -Lecture 09 & 10.pdf
AbdullahMunir32
 
Mobile Application Development -Lecture 07 & 08.pdf
Mobile Application Development -Lecture 07 & 08.pdfMobile Application Development -Lecture 07 & 08.pdf
Mobile Application Development -Lecture 07 & 08.pdf
AbdullahMunir32
 
Mobile Application Development Lecture 05 & 06.pdf
Mobile Application Development Lecture 05 & 06.pdfMobile Application Development Lecture 05 & 06.pdf
Mobile Application Development Lecture 05 & 06.pdf
AbdullahMunir32
 
Mobile Application Development-Lecture 03 & 04.pdf
Mobile Application Development-Lecture 03 & 04.pdfMobile Application Development-Lecture 03 & 04.pdf
Mobile Application Development-Lecture 03 & 04.pdf
AbdullahMunir32
 
Mobile Application Development-Lecture 01 & 02.pdf
Mobile Application Development-Lecture 01 & 02.pdfMobile Application Development-Lecture 01 & 02.pdf
Mobile Application Development-Lecture 01 & 02.pdf
AbdullahMunir32
 
Parallel and Distributed Computing Chapter 13
Parallel and Distributed Computing Chapter 13Parallel and Distributed Computing Chapter 13
Parallel and Distributed Computing Chapter 13
AbdullahMunir32
 
Parallel and Distributed Computing Chapter 9
Parallel and Distributed Computing Chapter 9Parallel and Distributed Computing Chapter 9
Parallel and Distributed Computing Chapter 9
AbdullahMunir32
 
Parallel and Distributed Computing Chapter 7
Parallel and Distributed Computing Chapter 7Parallel and Distributed Computing Chapter 7
Parallel and Distributed Computing Chapter 7
AbdullahMunir32
 
Parallel and Distributed Computing Chapter 6
Parallel and Distributed Computing Chapter 6Parallel and Distributed Computing Chapter 6
Parallel and Distributed Computing Chapter 6
AbdullahMunir32
 
Parallel and Distributed Computing Chapter 5
Parallel and Distributed Computing Chapter 5Parallel and Distributed Computing Chapter 5
Parallel and Distributed Computing Chapter 5
AbdullahMunir32
 
Parallel and Distributed Computing Chapter 4
Parallel and Distributed Computing Chapter 4Parallel and Distributed Computing Chapter 4
Parallel and Distributed Computing Chapter 4
AbdullahMunir32
 
Parallel and Distributed Computing Chapter 2
Parallel and Distributed Computing Chapter 2Parallel and Distributed Computing Chapter 2
Parallel and Distributed Computing Chapter 2
AbdullahMunir32
 
Parallel and Distributed Computing chapter 1
Parallel and Distributed Computing chapter 1Parallel and Distributed Computing chapter 1
Parallel and Distributed Computing chapter 1
AbdullahMunir32
 
Ad

Recently uploaded (20)

antiquity of writing in ancient India- literary & archaeological evidence
antiquity of writing in ancient India- literary & archaeological evidenceantiquity of writing in ancient India- literary & archaeological evidence
antiquity of writing in ancient India- literary & archaeological evidence
PrachiSontakke5
 
*"Sensing the World: Insect Sensory Systems"*
*"Sensing the World: Insect Sensory Systems"**"Sensing the World: Insect Sensory Systems"*
*"Sensing the World: Insect Sensory Systems"*
Arshad Shaikh
 
Search Matching Applicants in Odoo 18 - Odoo Slides
Search Matching Applicants in Odoo 18 - Odoo SlidesSearch Matching Applicants in Odoo 18 - Odoo Slides
Search Matching Applicants in Odoo 18 - Odoo Slides
Celine George
 
Myopathies (muscle disorders) for undergraduate
Myopathies (muscle disorders) for undergraduateMyopathies (muscle disorders) for undergraduate
Myopathies (muscle disorders) for undergraduate
Mohamed Rizk Khodair
 
Transform tomorrow: Master benefits analysis with Gen AI today webinar, 30 A...
Transform tomorrow: Master benefits analysis with Gen AI today webinar,  30 A...Transform tomorrow: Master benefits analysis with Gen AI today webinar,  30 A...
Transform tomorrow: Master benefits analysis with Gen AI today webinar, 30 A...
Association for Project Management
 
History Of The Monastery Of Mor Gabriel Philoxenos Yuhanon Dolabani
History Of The Monastery Of Mor Gabriel Philoxenos Yuhanon DolabaniHistory Of The Monastery Of Mor Gabriel Philoxenos Yuhanon Dolabani
History Of The Monastery Of Mor Gabriel Philoxenos Yuhanon Dolabani
fruinkamel7m
 
Pope Leo XIV, the first Pope from North America.pptx
Pope Leo XIV, the first Pope from North America.pptxPope Leo XIV, the first Pope from North America.pptx
Pope Leo XIV, the first Pope from North America.pptx
Martin M Flynn
 
LDMMIA Reiki News Ed3 Vol1 For Team and Guests
LDMMIA Reiki News Ed3 Vol1 For Team and GuestsLDMMIA Reiki News Ed3 Vol1 For Team and Guests
LDMMIA Reiki News Ed3 Vol1 For Team and Guests
LDM Mia eStudios
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 9 THEO ĐƠN VỊ BÀI HỌC - GLOBAL SUCCESS - CẢ NĂM (TỪ...
BÀI TẬP BỔ TRỢ TIẾNG ANH 9 THEO ĐƠN VỊ BÀI HỌC - GLOBAL SUCCESS - CẢ NĂM (TỪ...BÀI TẬP BỔ TRỢ TIẾNG ANH 9 THEO ĐƠN VỊ BÀI HỌC - GLOBAL SUCCESS - CẢ NĂM (TỪ...
BÀI TẬP BỔ TRỢ TIẾNG ANH 9 THEO ĐƠN VỊ BÀI HỌC - GLOBAL SUCCESS - CẢ NĂM (TỪ...
Nguyen Thanh Tu Collection
 
Ajanta Paintings: Study as a Source of History
Ajanta Paintings: Study as a Source of HistoryAjanta Paintings: Study as a Source of History
Ajanta Paintings: Study as a Source of History
Virag Sontakke
 
How to Manage Amounts in Local Currency in Odoo 18 Purchase
How to Manage Amounts in Local Currency in Odoo 18 PurchaseHow to Manage Amounts in Local Currency in Odoo 18 Purchase
How to Manage Amounts in Local Currency in Odoo 18 Purchase
Celine George
 
TERMINOLOGIES,GRIEF PROCESS AND LOSS AMD ITS TYPES .pptx
TERMINOLOGIES,GRIEF PROCESS AND LOSS AMD ITS TYPES .pptxTERMINOLOGIES,GRIEF PROCESS AND LOSS AMD ITS TYPES .pptx
TERMINOLOGIES,GRIEF PROCESS AND LOSS AMD ITS TYPES .pptx
PoojaSen20
 
Final Evaluation.docx...........................
Final Evaluation.docx...........................Final Evaluation.docx...........................
Final Evaluation.docx...........................
l1bbyburrell
 
Cultivation Practice of Onion in Nepal.pptx
Cultivation Practice of Onion in Nepal.pptxCultivation Practice of Onion in Nepal.pptx
Cultivation Practice of Onion in Nepal.pptx
UmeshTimilsina1
 
Ancient Stone Sculptures of India: As a Source of Indian History
Ancient Stone Sculptures of India: As a Source of Indian HistoryAncient Stone Sculptures of India: As a Source of Indian History
Ancient Stone Sculptures of India: As a Source of Indian History
Virag Sontakke
 
UPMVLE migration to ARAL. A step- by- step guide
UPMVLE migration to ARAL. A step- by- step guideUPMVLE migration to ARAL. A step- by- step guide
UPMVLE migration to ARAL. A step- by- step guide
abmerca
 
MEDICAL BIOLOGY MCQS BY. DR NASIR MUSTAFA
MEDICAL BIOLOGY MCQS  BY. DR NASIR MUSTAFAMEDICAL BIOLOGY MCQS  BY. DR NASIR MUSTAFA
MEDICAL BIOLOGY MCQS BY. DR NASIR MUSTAFA
Dr. Nasir Mustafa
 
Classification of mental disorder in 5th semester bsc. nursing and also used ...
Classification of mental disorder in 5th semester bsc. nursing and also used ...Classification of mental disorder in 5th semester bsc. nursing and also used ...
Classification of mental disorder in 5th semester bsc. nursing and also used ...
parmarjuli1412
 
LDMMIA Reiki Yoga S5 Daily Living Workshop
LDMMIA Reiki Yoga S5 Daily Living WorkshopLDMMIA Reiki Yoga S5 Daily Living Workshop
LDMMIA Reiki Yoga S5 Daily Living Workshop
LDM Mia eStudios
 
antiquity of writing in ancient India- literary & archaeological evidence
antiquity of writing in ancient India- literary & archaeological evidenceantiquity of writing in ancient India- literary & archaeological evidence
antiquity of writing in ancient India- literary & archaeological evidence
PrachiSontakke5
 
*"Sensing the World: Insect Sensory Systems"*
*"Sensing the World: Insect Sensory Systems"**"Sensing the World: Insect Sensory Systems"*
*"Sensing the World: Insect Sensory Systems"*
Arshad Shaikh
 
Search Matching Applicants in Odoo 18 - Odoo Slides
Search Matching Applicants in Odoo 18 - Odoo SlidesSearch Matching Applicants in Odoo 18 - Odoo Slides
Search Matching Applicants in Odoo 18 - Odoo Slides
Celine George
 
Myopathies (muscle disorders) for undergraduate
Myopathies (muscle disorders) for undergraduateMyopathies (muscle disorders) for undergraduate
Myopathies (muscle disorders) for undergraduate
Mohamed Rizk Khodair
 
Transform tomorrow: Master benefits analysis with Gen AI today webinar, 30 A...
Transform tomorrow: Master benefits analysis with Gen AI today webinar,  30 A...Transform tomorrow: Master benefits analysis with Gen AI today webinar,  30 A...
Transform tomorrow: Master benefits analysis with Gen AI today webinar, 30 A...
Association for Project Management
 
History Of The Monastery Of Mor Gabriel Philoxenos Yuhanon Dolabani
History Of The Monastery Of Mor Gabriel Philoxenos Yuhanon DolabaniHistory Of The Monastery Of Mor Gabriel Philoxenos Yuhanon Dolabani
History Of The Monastery Of Mor Gabriel Philoxenos Yuhanon Dolabani
fruinkamel7m
 
Pope Leo XIV, the first Pope from North America.pptx
Pope Leo XIV, the first Pope from North America.pptxPope Leo XIV, the first Pope from North America.pptx
Pope Leo XIV, the first Pope from North America.pptx
Martin M Flynn
 
LDMMIA Reiki News Ed3 Vol1 For Team and Guests
LDMMIA Reiki News Ed3 Vol1 For Team and GuestsLDMMIA Reiki News Ed3 Vol1 For Team and Guests
LDMMIA Reiki News Ed3 Vol1 For Team and Guests
LDM Mia eStudios
 
BÀI TẬP BỔ TRỢ TIẾNG ANH 9 THEO ĐƠN VỊ BÀI HỌC - GLOBAL SUCCESS - CẢ NĂM (TỪ...
BÀI TẬP BỔ TRỢ TIẾNG ANH 9 THEO ĐƠN VỊ BÀI HỌC - GLOBAL SUCCESS - CẢ NĂM (TỪ...BÀI TẬP BỔ TRỢ TIẾNG ANH 9 THEO ĐƠN VỊ BÀI HỌC - GLOBAL SUCCESS - CẢ NĂM (TỪ...
BÀI TẬP BỔ TRỢ TIẾNG ANH 9 THEO ĐƠN VỊ BÀI HỌC - GLOBAL SUCCESS - CẢ NĂM (TỪ...
Nguyen Thanh Tu Collection
 
Ajanta Paintings: Study as a Source of History
Ajanta Paintings: Study as a Source of HistoryAjanta Paintings: Study as a Source of History
Ajanta Paintings: Study as a Source of History
Virag Sontakke
 
How to Manage Amounts in Local Currency in Odoo 18 Purchase
How to Manage Amounts in Local Currency in Odoo 18 PurchaseHow to Manage Amounts in Local Currency in Odoo 18 Purchase
How to Manage Amounts in Local Currency in Odoo 18 Purchase
Celine George
 
TERMINOLOGIES,GRIEF PROCESS AND LOSS AMD ITS TYPES .pptx
TERMINOLOGIES,GRIEF PROCESS AND LOSS AMD ITS TYPES .pptxTERMINOLOGIES,GRIEF PROCESS AND LOSS AMD ITS TYPES .pptx
TERMINOLOGIES,GRIEF PROCESS AND LOSS AMD ITS TYPES .pptx
PoojaSen20
 
Final Evaluation.docx...........................
Final Evaluation.docx...........................Final Evaluation.docx...........................
Final Evaluation.docx...........................
l1bbyburrell
 
Cultivation Practice of Onion in Nepal.pptx
Cultivation Practice of Onion in Nepal.pptxCultivation Practice of Onion in Nepal.pptx
Cultivation Practice of Onion in Nepal.pptx
UmeshTimilsina1
 
Ancient Stone Sculptures of India: As a Source of Indian History
Ancient Stone Sculptures of India: As a Source of Indian HistoryAncient Stone Sculptures of India: As a Source of Indian History
Ancient Stone Sculptures of India: As a Source of Indian History
Virag Sontakke
 
UPMVLE migration to ARAL. A step- by- step guide
UPMVLE migration to ARAL. A step- by- step guideUPMVLE migration to ARAL. A step- by- step guide
UPMVLE migration to ARAL. A step- by- step guide
abmerca
 
MEDICAL BIOLOGY MCQS BY. DR NASIR MUSTAFA
MEDICAL BIOLOGY MCQS  BY. DR NASIR MUSTAFAMEDICAL BIOLOGY MCQS  BY. DR NASIR MUSTAFA
MEDICAL BIOLOGY MCQS BY. DR NASIR MUSTAFA
Dr. Nasir Mustafa
 
Classification of mental disorder in 5th semester bsc. nursing and also used ...
Classification of mental disorder in 5th semester bsc. nursing and also used ...Classification of mental disorder in 5th semester bsc. nursing and also used ...
Classification of mental disorder in 5th semester bsc. nursing and also used ...
parmarjuli1412
 
LDMMIA Reiki Yoga S5 Daily Living Workshop
LDMMIA Reiki Yoga S5 Daily Living WorkshopLDMMIA Reiki Yoga S5 Daily Living Workshop
LDMMIA Reiki Yoga S5 Daily Living Workshop
LDM Mia eStudios
 

Parallel and Distributed Computing Chapter 8

  • 3. INTRODUCTION A Microprocessor for processing of graphics Capable of handling millions of instructions per second GPU is graphic card that relieve CPU of much of graphic processing load
  • 4. INTRODUCTION GPU is made for performing parallel operations that’s why GPU have many parallel execution Units. Many super computer uses GPU for high processing and performance.
  • 5. CPU Vs GPU A CPU consists of four to eight CPU cores, GPU consists of hundreds of smaller cores. This massively parallel architecture is what gives the GPU its high compute performance.
  • 6. SINGLE CORE VS MULTICORE PREOCESOR
  • 7. INTRODUCTION Today NVIDIA GPU upgraded to 128 cores on single chip. Hence consuming less power with high performance. Each core handle 8 threads (total=1024 threads) Modern GPU not only use for graphics or video coding but also used in HPC GPU are designed to handle large number of floating point operations in parallel.
  • 8. COMPONENTS OF GPU Graphic processor Graphic co-processor Graphic accelerator Frame buffer Memory Graphic BIOS Digital to analog converter Display connector Computer Connector Components of CPU: control unit (CU) arithmetic logic unit (ALU) registers. cache. clock.
  • 9. ALUs and Fetch/Decode logic run at high speed, consume little power and require few hardware to build Contrary Execution Unit require huge number of transistors to build cache, it may occupy 50% of total area hence expensive It may also main energy absorbing element
  • 10. Introduction to GPU Evolution GPU Comprises of ten to thousand cores To build GPU we need slimmer design of CPU For this all complex and large units of should be removed from general CPU Basic Idea of GPU: to have many (100 or 1000) simpler or weaker processing units that process many same instructions simultaneously on different data.
  • 11. Vector Addition of two floating Point vectors each containing 128 elements
  • 12. Introduction to GPU Evolution Instead of using one CPU Core we can use two such cores. We will be able to execute two instructions in parallel Hence throughput is increased
  • 13. Two Instructions executing parallel on two CPU Cores
  • 14. Even we can achieve higher performance by further replicating ALU, Instead of replicating complete CPU Core Fetch/decode logic remained share among ALUs All ALUs should execute same operations on different input data
  • 15. One such core can add eight vector elements in parallel
  • 17. Modern GPU Modern GPU containing hundreds of simple Processing elements suited for computation that can be run in parallel This typically involves arithmetic on large data sets i.e. vectors matrices, images, where same operations can be performed across millions of data elements at the same time.
  • 19. Modern GPU To exploit parallelism for GPU, Programmer should partition their program into thousands of threads and schedule them among Compute Units.
  • 20. Memory Hierarchy on GPU GPU has these five memory regions accessible from a single work item 1. Registers 2. Local Memory 3. Texture Memory 4. Constant Memory 5. Global Memory
  • 22. Memory Hierarchy on GPU Registers: Registers are at first and most preferable level Each work-item has dedicated registers There may be 16k, 32K or 64K register for work item.
  • 23. Memory Hierarchy on GPU Global Memory: Also called graphic dynamic memory. Achieves high bandwidth But this memory has a high latency compare to other memories. It is global because it is accessed from GPU and CPU both. GTX780 GPU has 3GB of global memory implemented in GDDR5 It is used for transfers between host and device It is larger capacity memory and high latency.
  • 24. Host and Device Main CPU is host, while all other processors like, GPU are names as device
  • 25. Memory Hierarchy on GPU There are also two additional memories that are accessible by all work items Constant memory and texture memory Constant Memory: resides in device memory (Cached) where constant and program arguments are stored Constant memory has two special properties. Firstly, it is cached, and second it supports broadcasting a single value to all work items This broadcast takes place in just a single cycle. Work item have only read access to this region. However host is permitted both read/write access.
  • 26. Memory Hierarchy on GPU Texture Memory: When all reads in a work group are physically adjacent, using texture memory can reduce memory traffic and increase performance compared to global memory However Texture memory is much slower than constant memory
  • 27. Work-Item / Work-Group Work-Item are actually threads (WI) Work-Group, which is unit of work scheduled onto compute unit. Work Items are organized into Work-Groups Hence Work Group is also define as the set of Work Item All work item in a work group are able to share local memory Work group execute independently from each other. Work Group executes on Compute Unit and Work Item are mapped to CU PEs
  • 28. OpenCL A Framework use for writing programs that execute across heterogeneous platforms consisting of CPUs, GPUs or any accelerated hardware. It define C Programming language that is used to write and compute kernels just like C functions One significant drawback: not easy to learn
  • 29. OpenCL Kernel Code that gets executed in a GPU Device is called kernel in OpenCL. The body of kernel function implements the computation to be completed by all work-items. When writing kernels in OpenCL, We must declare memory with certain address space to mention that data will resides in which memory. E.g. (_ _global), (_ _constant), (_ _local), or by default private memory within a kernel
  • 30. Heterogeneous system Also called platform model or heterogeneous system. Consists of single host connected to one or more OpenCL Devices e.g. FPGA Accelerators, DSP, GPU or even CPU
  • 31. OpenCL Devices OpenCL Devices comprises of several compute unit. Each compute unit comprises of tens or hundreds of processing element.
  • 32. Execution Model OpenCL execution model defines how kernels execute. i.e. NDRange (N-Dimensional Range) execution model. Host program invokes kernel over an index space called NDRange. NDRange defines total numbers of work items that execute in parallel.
  • 33. Programming in OpenCL Sample C Code for vector addition for single core CPU
  • 34. Kernel Function To execute vector addition function on GPU Device, we must write it as kernel function Each thread on GPU Device will then execute the same kernel function
  • 36. HOST CODE First step is to code host application, that run’s on a user’s computer and dispatches to connected devices. Host application can be coded in C/C++ OpenCL supports wide range of heterogeneous platforms. Prior to execute kernel function, the host program for a heterogeneous system must carry out the some steps.
  • 37. Steps to execute Kernel Function 1. Discover OpenCL Devices, The platform consists of one or more devices capable of executing OpenCL Kernels. 2. Probe the characteristics of these devices so that kernel functions can adapt to specific features. 3. Read source program, compile the kernel that will run on selected devices. 4. Setup memory objects that will hold the data for computation. 5. Run the kernel on selected devices 6. Collect the final result from devices.
  • 38. Steps to execute Kernel Function These steps are accomplished through following series of calls. 1.Prepare and initialize data on host. 2. Discover and initialize the devices. 3. Create a context. 4. Create a command queue. 5. Create the program object for a context. 6. Build the OpenCL program. 7. Create device buffers. 8. Write host data to device buffers. 9. Create and compile the kernel. 10. Set the kernel arguments 11. Set the execution model and enqueue the kernel for execution. 12. Read the output buffer back to the host
  • 39. Discover & Initialize the devices cl_int clGetDeviceIDs() functions is used to discover and initialize the devise. This function returns number of devices in num_devices.
  • 40. Discover & Initialize the devices Get device info. We use cl_int clGetDeviceInfo() This function will return max. computer units, max work-group size, Device type, size of memory
  • 41. Create a context clCreatecontext() function is use for creating a context i.e. managing objects such as command queues, program, kernel objects
  • 42. Build a program Executeable clBuildProgram() function is use to build (compile and link) a program from source code.
  • 43. Write host data to device buffer clEnqueueWriterBuffer() function is use to write data from host memory to device buffer This function provide data for processing over device.
  • 44. Enqueue the kernel for execution OpenCL always execute kernel in parallel i.e. same kernel execute on different data set. Each kernel execution in OpenCL is called work-item Each work-item is responsible for executing kernel once on its assigned portion of data set. Thus it is programmer responsibility to tell OpenCl how many work-items are needed to process all data.
  • 45. Occupancy Occupancy is ratio of active work groups per compute unit to maximize its performance. We should always keeping a occupancy high in order to hide latency when executing instructions A compute unit should have a ready work group to execute in every cycle as this is the only way to keep hardware busy.
  翻译: