SlideShare a Scribd company logo
Lecture 13: Graphics Processing Units
(GPUs)
2
Multimedia Extensions (aka SIMD extensions)
• Very short vectors added to existing ISAs for microprocessors
• Use existing 64-bit registers split into 2x32b or 4x16b or 8x8b
– Lincoln Labs TX-2 from 1957 had 36b datapath split into 2x18b or 4x9b
– Newer designs have wider registers
» 128b for PowerPC Altivec, Intel SSE2/3/4
» 256b for Intel AVX
• Single instruction operates on all elements within register
16b 16b 16b 16b
32b 32b
64b
8b 8b 8b 8b 8b 8b 8b 8b
16b 16b 16b 16b
16b 16b 16b 16b
16b 16b 16b 16b
+ + + +4x16b adds
Types of Parallelism
• Instruction-Level Parallelism (ILP)
– Execute independent instructions from one instruction stream in
parallel (pipelining, superscalar, VLIW)
• Thread-Level Parallelism (TLP)
– Execute independent instruction streams in parallel (multithreading,
multiple cores)
• Data-Level Parallelism (DLP)
– Execute multiple operations of the same type in parallel
(vector/SIMD execution)
• Which is easiest to program?
• Which is most flexible form of parallelism?
– i.e., can be used in more situations
• Which is most efficient?
– i.e., greatest tasks/second/area, lowest energy/task
3
Resurgence of DLP
• Convergence of application demands and technology
constraints drives architecture choice
• New applications, such as graphics, machine vision,
speech recognition, machine learning, etc. all require
large numerical computations that are often trivially
data parallel
• SIMD-based architectures (vector-SIMD, subword-
SIMD, SIMT/GPUs) are most efficient way to execute
these algorithms
4
DLP important for conventional CPUs too
• Prediction for x86 processors,
from Hennessy & Patterson,
5th edition
– Note: Educated guess, not Intel
product plans!
• TLP: 2+ cores / 2 years
• DLP: 2x width / 4 years
• DLP will account for more
mainstream parallelism growth
than TLP in next decade.
– SIMD –single-instruction multiple-data
(DLP)
– MIMD- multiple-instruction multiple-
data (TLP)
5
Graphics Processing Units (GPUs)
• Original GPUs were dedicated fixed-function devices
for generating 3D graphics (mid-late 1990s) including
high-performance floating-point units
– Provide workstation-like graphics for PCs
– User could configure graphics pipeline, but not really program it
• Over time, more programmability added (2001-2005)
– E.g., New language Cg for writing small programs run on each
vertex or each pixel, also Windows DirectX variants
– Massively parallel (millions of vertices or pixels per frame) but very
constrained programming model
• Some users noticed they could do general-purpose
computation by mapping input and output data to
images, and computation to vertex and pixel shading
computations
– Incredibly difficult programming model as had to use graphics
pipeline model for general computation
6
General-Purpose GPUs (GP-GPUs)
• In 2006, Nvidia introduced GeForce 8800 GPU supporting
a new programming language: CUDA
– “Compute Unified Device Architecture”
– Subsequently, broader industry pushing for OpenCL, a vendor-neutral
version of same ideas.
• Idea: Take advantage of GPU computational performance
and memory bandwidth to accelerate some kernels for
general-purpose computing
• Attached processor model: Host CPU issues data-
parallel kernels to GP-GPU for execution
• This lecture has a simplified version of Nvidia CUDA-style
model and only considers GPU execution for
computational kernels, not graphics
– Would probably need another course to describe graphics processing
7
Simplified CUDA Programming Model
• Computation performed by a very large number of
independent small scalar threads (CUDA threads or
microthreads) grouped into thread blocks.
// C version of DAXPY loop.
void daxpy(int n, double a, double*x, double*y)
{ for (int i=0; i<n; i++)
y[i] = a*x[i] + y[i]; }
// CUDA version.
__host__ // Piece run on host processor.
int nblocks = (n+255)/256; // 256 CUDA threads/block
daxpy<<<nblocks,256>>>(n,2.0,x,y);
__device__ // Piece run on GP-GPU.
void daxpy(int n, double a, double*x, double*y)
{ int i blockIdx.x*blockDim.x + threadId.x;
if (i<n) y[i]=a*x[i=]+];y[i }
8
Programmer’s View of Execution
9
blockIdx 0
threadId 0
threadId 1
threadId 255
blockIdx 1
threadId 0
threadId 1
threadId 255
blockIdx
(n+255/256)
threadId 0
threadId 1
threadId 255
Create enough
blocks to cover
input vector
(Nvidia calls this
ensemble of
blocks a Grid, can
be 2-dimensional)
Conditional (i<n)
turns off unused
threads in last block
blockDim = 256
(programmer can
choose)
GPU
Hardware Execution Model
• GPU is built from multiple parallel cores, each core
contains a multithreaded SIMD processor with multiple
lanes but with no scalar processor
• CPU sends whole “grid” over to GPU, which distributes
thread blocks among cores (each thread block executes on
one core)
– Programmer unaware of number of cores
10
Core 0
Lane 0
Lane 1
Lane 15
Core 1
Lane 0
Lane 1
Lane 15
Core 15
Lane 0
Lane 1
Lane 15
GPU Memory
CPU
CPU Memory
“Single Instruction, Multiple Thread”
• GPUs use a SIMT model, where individual scalar
instruction streams for each CUDA thread are
grouped together for SIMD execution on hardware
(Nvidia groups 32 CUDA threads into a warp)
11
µT0 µT1 µT2 µT3 µT4 µT5 µT6 µT7
ld x
mul a
ld y
add
st y
Scalar
instruction
stream
SIMD execution across warp
Implications of SIMT Model
• All “vector” loads and stores are scatter-gather, as
individual µthreads perform scalar loads and stores
– GPU adds hardware to dynamically coalesce individual µthread
loads and stores to mimic vector loads and stores
• Every µthread has to perform stripmining calculations
redundantly (“am I active?”) as there is no scalar
processor equivalent
12
Conditionals in SIMT model
• Simple if-then-else are compiled into predicated
execution, equivalent to vector masking
• More complex control flow compiled into branches
• How to execute a vector of branches?
13
µT0 µT1 µT2 µT3 µT4 µT5 µT6 µT7
tid=threadid
If (tid >= n) skip
Call func1
add
st y
Scalar
instruction
stream
SIMD execution across warp
skip:
Branch divergence
• Hardware tracks which µthreads take or don’t take
branch
• If all go the same way, then keep going in SIMD
fashion
• If not, create mask vector indicating taken/not-taken
• Keep executing not-taken path under mask, push
taken branch PC+mask onto a hardware stack and
execute later
• When can execution of µthreads in warp
reconverge?
14
Warps are multithreaded on core
• One warp of 32 µthreads is a
single thread in the hardware
• Multiple warp threads are
interleaved in execution on a
single core to hide latencies
(memory and functional unit)
• A single thread block can
contain multiple warps (up to
512 µT max in CUDA), all
mapped to single core
• Can have multiple blocks
executing on one core
15[Nvidia, 2010]
GPU Memory Hierarchy
16
[ Nvidia, 2010]
SIMT
• Illusion of many independent threads
• But for efficiency, programmer must try and keep
µthreads aligned in a SIMD fashion
– Try and do unit-stride loads and store so memory coalescing kicks
in
– Avoid branch divergence so most instruction slots execute useful
work and are not masked off
17
Nvidia Fermi GF100 GPU
18
[Nvidia,
2010]
Fermi “Streaming
Multiprocessor” Core
19
Fermi Dual-Issue Warp Scheduler
20
Apple A5X
Processor for
iPad v3 (2012)
• 12.90mm x 12.79mm
• 45nm technology
21[Source: Chipworks, 2012]
Historical Retrospective, Cray-2 (1985)
• 243MHz ECL logic
• 2GB DRAM main memory (128 banks of 16MB each)
– Bank busy time 57 clocks!
• Local memory of 128KB/core
• 1 foreground + 4 background vector processors
22
Foreground
CPU
Shared Memory
Core 0
Lane
Local
Memory
Core 0
Lane
Local
Memory
Core 0
Lane
Local
Memory
Core 0
Lane
Local
Memory
GPU Future
• High-end desktops have separate GPU chip, but
trend towards integrating GPU on same die as CPU
(already in laptops, tablets and smartphones)
– Advantage is shared memory with CPU, no need to transfer data
– Disadvantage is reduced memory bandwidth compared to dedicated
smaller-capacity specialized memory system
» Graphics DRAM (GDDR) versus regular DRAM (DDR3)
• Will GP-GPU survive? Or will improvements in CPU
DLP make GP-GPU redundant?
– On same die, CPU and GPU should have same memory bandwidth
– GPU might have more FLOPS as needed for graphics anyway
23
Ad

More Related Content

What's hot (20)

Hardware Multi-Threading
Hardware Multi-ThreadingHardware Multi-Threading
Hardware Multi-Threading
babuece
 
Advanced computer architecture lesson 5 and 6
Advanced computer architecture lesson 5 and 6Advanced computer architecture lesson 5 and 6
Advanced computer architecture lesson 5 and 6
Ismail Mukiibi
 
Computer architecture
Computer architecture Computer architecture
Computer architecture
Ashish Kumar
 
Lect18
Lect18Lect18
Lect18
Vin Voro
 
Aca 2
Aca 2Aca 2
Aca 2
parbhatverma
 
18 parallel processing
18 parallel processing18 parallel processing
18 parallel processing
dilip kumar
 
27 multicore
27 multicore27 multicore
27 multicore
ssuser47ae65
 
Parallel processing Concepts
Parallel processing ConceptsParallel processing Concepts
Parallel processing Concepts
Army Public School and College -Faisal
 
Multiprocessor Architecture for Image Processing
Multiprocessor Architecture for Image ProcessingMultiprocessor Architecture for Image Processing
Multiprocessor Architecture for Image Processing
mayank.grd
 
Array Processor
Array ProcessorArray Processor
Array Processor
Anshuman Biswal
 
Chapter 2 pc
Chapter 2 pcChapter 2 pc
Chapter 2 pc
Hanif Durad
 
Shared-Memory Multiprocessors
Shared-Memory MultiprocessorsShared-Memory Multiprocessors
Shared-Memory Multiprocessors
Salvatore La Bua
 
Superscalar Architecture_AIUB
Superscalar Architecture_AIUBSuperscalar Architecture_AIUB
Superscalar Architecture_AIUB
Nusrat Mary
 
Parallel processing
Parallel processingParallel processing
Parallel processing
Syed Zaid Irshad
 
Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor ...
Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor ...Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor ...
Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor ...
Ahmed kasim
 
Chip Multithreading Systems Need a New Operating System Scheduler
Chip Multithreading Systems Need a New Operating System Scheduler Chip Multithreading Systems Need a New Operating System Scheduler
Chip Multithreading Systems Need a New Operating System Scheduler
Sarwan ali
 
Multiprocessor architecture and programming
Multiprocessor architecture and programmingMultiprocessor architecture and programming
Multiprocessor architecture and programming
Raul Goycoolea Seoane
 
Multicore processor by Ankit Raj and Akash Prajapati
Multicore processor by Ankit Raj and Akash PrajapatiMulticore processor by Ankit Raj and Akash Prajapati
Multicore processor by Ankit Raj and Akash Prajapati
Ankit Raj
 
Computer Architecture
Computer ArchitectureComputer Architecture
Computer Architecture
Shahriar Parvez
 
Memory protection unit
Memory protection unit Memory protection unit
Memory protection unit
GlobalLogic Ukraine
 
Hardware Multi-Threading
Hardware Multi-ThreadingHardware Multi-Threading
Hardware Multi-Threading
babuece
 
Advanced computer architecture lesson 5 and 6
Advanced computer architecture lesson 5 and 6Advanced computer architecture lesson 5 and 6
Advanced computer architecture lesson 5 and 6
Ismail Mukiibi
 
Computer architecture
Computer architecture Computer architecture
Computer architecture
Ashish Kumar
 
18 parallel processing
18 parallel processing18 parallel processing
18 parallel processing
dilip kumar
 
Multiprocessor Architecture for Image Processing
Multiprocessor Architecture for Image ProcessingMultiprocessor Architecture for Image Processing
Multiprocessor Architecture for Image Processing
mayank.grd
 
Shared-Memory Multiprocessors
Shared-Memory MultiprocessorsShared-Memory Multiprocessors
Shared-Memory Multiprocessors
Salvatore La Bua
 
Superscalar Architecture_AIUB
Superscalar Architecture_AIUBSuperscalar Architecture_AIUB
Superscalar Architecture_AIUB
Nusrat Mary
 
Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor ...
Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor ...Multithreading: Exploiting Thread-Level  Parallelism to Improve Uniprocessor ...
Multithreading: Exploiting Thread-Level Parallelism to Improve Uniprocessor ...
Ahmed kasim
 
Chip Multithreading Systems Need a New Operating System Scheduler
Chip Multithreading Systems Need a New Operating System Scheduler Chip Multithreading Systems Need a New Operating System Scheduler
Chip Multithreading Systems Need a New Operating System Scheduler
Sarwan ali
 
Multiprocessor architecture and programming
Multiprocessor architecture and programmingMultiprocessor architecture and programming
Multiprocessor architecture and programming
Raul Goycoolea Seoane
 
Multicore processor by Ankit Raj and Akash Prajapati
Multicore processor by Ankit Raj and Akash PrajapatiMulticore processor by Ankit Raj and Akash Prajapati
Multicore processor by Ankit Raj and Akash Prajapati
Ankit Raj
 

Similar to Graphics processing uni computer archiecture (20)

lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
ssuser413a98
 
Graphics Processing unit ppt
Graphics Processing unit pptGraphics Processing unit ppt
Graphics Processing unit ppt
VictorAbhinav
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universe
Ofer Rosenberg
 
Cuda Architecture
Cuda ArchitectureCuda Architecture
Cuda Architecture
Piyush Mittal
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computation
jtsagata
 
Gpu perf-presentation
Gpu perf-presentationGpu perf-presentation
Gpu perf-presentation
GiannisTsagatakis
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
Dhaval Kaneria
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
John Holden
 
hjjyjtjrtjrthjrtjr6usfgnfgngdngnrthrthrth.ppt
hjjyjtjrtjrthjrtjr6usfgnfgngdngnrthrthrth.ppthjjyjtjrtjrthjrtjr6usfgnfgngdngnrthrthrth.ppt
hjjyjtjrtjrthjrtjr6usfgnfgngdngnrthrthrth.ppt
SagnikMondal32
 
Graphics Processing Unit (GPU) system.ppt
Graphics Processing Unit (GPU) system.pptGraphics Processing Unit (GPU) system.ppt
Graphics Processing Unit (GPU) system.ppt
TeddyIswahyudi1
 
cuda.ppt
cuda.pptcuda.ppt
cuda.ppt
dawoodsarfraz
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
Tigabu Yaya
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
ARUNACHALAM468781
 
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAMaking the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Facultad de Informática UCM
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud Datacenters
CastLabKAIST
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
George Markomanolis
 
GIST AI-X Computing Cluster
GIST AI-X Computing ClusterGIST AI-X Computing Cluster
GIST AI-X Computing Cluster
Jax Jargalsaikhan
 
An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptx
AnirudhGarg35
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUs
Shree Kumar
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
lecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptxlecture11_GPUArchCUDA01.pptx
lecture11_GPUArchCUDA01.pptx
ssuser413a98
 
Graphics Processing unit ppt
Graphics Processing unit pptGraphics Processing unit ppt
Graphics Processing unit ppt
VictorAbhinav
 
Newbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universeNewbie’s guide to_the_gpgpu_universe
Newbie’s guide to_the_gpgpu_universe
Ofer Rosenberg
 
GPGPU Computation
GPGPU ComputationGPGPU Computation
GPGPU Computation
jtsagata
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
Dhaval Kaneria
 
Monte Carlo G P U Jan2010
Monte  Carlo  G P U  Jan2010Monte  Carlo  G P U  Jan2010
Monte Carlo G P U Jan2010
John Holden
 
hjjyjtjrtjrthjrtjr6usfgnfgngdngnrthrthrth.ppt
hjjyjtjrtjrthjrtjr6usfgnfgngdngnrthrthrth.ppthjjyjtjrtjrthjrtjr6usfgnfgngdngnrthrthrth.ppt
hjjyjtjrtjrthjrtjr6usfgnfgngdngnrthrthrth.ppt
SagnikMondal32
 
Graphics Processing Unit (GPU) system.ppt
Graphics Processing Unit (GPU) system.pptGraphics Processing Unit (GPU) system.ppt
Graphics Processing Unit (GPU) system.ppt
TeddyIswahyudi1
 
lecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdflecture_GPUArchCUDA02-CUDAMem.pdf
lecture_GPUArchCUDA02-CUDAMem.pdf
Tigabu Yaya
 
gpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsngpuprogram_lecture,architecture_designsn
gpuprogram_lecture,architecture_designsn
ARUNACHALAM468781
 
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGAMaking the most out of Heterogeneous Chips with CPU, GPU and FPGA
Making the most out of Heterogeneous Chips with CPU, GPU and FPGA
Facultad de Informática UCM
 
AI Accelerators for Cloud Datacenters
AI Accelerators for Cloud DatacentersAI Accelerators for Cloud Datacenters
AI Accelerators for Cloud Datacenters
CastLabKAIST
 
Utilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmapUtilizing AMD GPUs: Tuning, programming models, and roadmap
Utilizing AMD GPUs: Tuning, programming models, and roadmap
George Markomanolis
 
An Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptxAn Introduction to CUDA-OpenCL - University.pptx
An Introduction to CUDA-OpenCL - University.pptx
AnirudhGarg35
 
Computing using GPUs
Computing using GPUsComputing using GPUs
Computing using GPUs
Shree Kumar
 
Vpu technology &gpgpu computing
Vpu technology &gpgpu computingVpu technology &gpgpu computing
Vpu technology &gpgpu computing
Arka Ghosh
 
Ad

More from Haris456 (10)

Hazards Computer Architecture
Hazards Computer ArchitectureHazards Computer Architecture
Hazards Computer Architecture
Haris456
 
Pipelining of Processors Computer Architecture
Pipelining of  Processors Computer ArchitecturePipelining of  Processors Computer Architecture
Pipelining of Processors Computer Architecture
Haris456
 
Computer Architecture Vector Computer
Computer Architecture Vector ComputerComputer Architecture Vector Computer
Computer Architecture Vector Computer
Haris456
 
Computer Memory Hierarchy Computer Architecture
Computer Memory Hierarchy Computer ArchitectureComputer Memory Hierarchy Computer Architecture
Computer Memory Hierarchy Computer Architecture
Haris456
 
Computer Architecture Instruction-Level paraallel processors
Computer Architecture Instruction-Level paraallel processorsComputer Architecture Instruction-Level paraallel processors
Computer Architecture Instruction-Level paraallel processors
Haris456
 
Pipeline Computer Architecture
Pipeline Computer ArchitecturePipeline Computer Architecture
Pipeline Computer Architecture
Haris456
 
Addressing mode Computer Architecture
Addressing mode  Computer ArchitectureAddressing mode  Computer Architecture
Addressing mode Computer Architecture
Haris456
 
Ca lecture 03
Ca lecture 03Ca lecture 03
Ca lecture 03
Haris456
 
Instruction Set Architecture
Instruction  Set ArchitectureInstruction  Set Architecture
Instruction Set Architecture
Haris456
 
Computer Architecture
Computer ArchitectureComputer Architecture
Computer Architecture
Haris456
 
Hazards Computer Architecture
Hazards Computer ArchitectureHazards Computer Architecture
Hazards Computer Architecture
Haris456
 
Pipelining of Processors Computer Architecture
Pipelining of  Processors Computer ArchitecturePipelining of  Processors Computer Architecture
Pipelining of Processors Computer Architecture
Haris456
 
Computer Architecture Vector Computer
Computer Architecture Vector ComputerComputer Architecture Vector Computer
Computer Architecture Vector Computer
Haris456
 
Computer Memory Hierarchy Computer Architecture
Computer Memory Hierarchy Computer ArchitectureComputer Memory Hierarchy Computer Architecture
Computer Memory Hierarchy Computer Architecture
Haris456
 
Computer Architecture Instruction-Level paraallel processors
Computer Architecture Instruction-Level paraallel processorsComputer Architecture Instruction-Level paraallel processors
Computer Architecture Instruction-Level paraallel processors
Haris456
 
Pipeline Computer Architecture
Pipeline Computer ArchitecturePipeline Computer Architecture
Pipeline Computer Architecture
Haris456
 
Addressing mode Computer Architecture
Addressing mode  Computer ArchitectureAddressing mode  Computer Architecture
Addressing mode Computer Architecture
Haris456
 
Ca lecture 03
Ca lecture 03Ca lecture 03
Ca lecture 03
Haris456
 
Instruction Set Architecture
Instruction  Set ArchitectureInstruction  Set Architecture
Instruction Set Architecture
Haris456
 
Computer Architecture
Computer ArchitectureComputer Architecture
Computer Architecture
Haris456
 
Ad

Recently uploaded (20)

Buy vs. Build: Unlocking the right path for your training tech
Buy vs. Build: Unlocking the right path for your training techBuy vs. Build: Unlocking the right path for your training tech
Buy vs. Build: Unlocking the right path for your training tech
Rustici Software
 
Adobe InDesign Crack FREE Download 2025 link
Adobe InDesign Crack FREE Download 2025 linkAdobe InDesign Crack FREE Download 2025 link
Adobe InDesign Crack FREE Download 2025 link
mahmadzubair09
 
sequencediagrams.pptx software Engineering
sequencediagrams.pptx software Engineeringsequencediagrams.pptx software Engineering
sequencediagrams.pptx software Engineering
aashrithakondapalli8
 
Surviving a Downturn Making Smarter Portfolio Decisions with OnePlan - Webina...
Surviving a Downturn Making Smarter Portfolio Decisions with OnePlan - Webina...Surviving a Downturn Making Smarter Portfolio Decisions with OnePlan - Webina...
Surviving a Downturn Making Smarter Portfolio Decisions with OnePlan - Webina...
OnePlan Solutions
 
Do not let staffing shortages and limited fiscal view hamper your cause
Do not let staffing shortages and limited fiscal view hamper your causeDo not let staffing shortages and limited fiscal view hamper your cause
Do not let staffing shortages and limited fiscal view hamper your cause
Fexle Services Pvt. Ltd.
 
How to Troubleshoot 9 Types of OutOfMemoryError
How to Troubleshoot 9 Types of OutOfMemoryErrorHow to Troubleshoot 9 Types of OutOfMemoryError
How to Troubleshoot 9 Types of OutOfMemoryError
Tier1 app
 
GC Tuning: A Masterpiece in Performance Engineering
GC Tuning: A Masterpiece in Performance EngineeringGC Tuning: A Masterpiece in Performance Engineering
GC Tuning: A Masterpiece in Performance Engineering
Tier1 app
 
wAIred_LearnWithOutAI_JCON_14052025.pptx
wAIred_LearnWithOutAI_JCON_14052025.pptxwAIred_LearnWithOutAI_JCON_14052025.pptx
wAIred_LearnWithOutAI_JCON_14052025.pptx
SimonedeGijt
 
Adobe Media Encoder Crack FREE Download 2025
Adobe Media Encoder  Crack FREE Download 2025Adobe Media Encoder  Crack FREE Download 2025
Adobe Media Encoder Crack FREE Download 2025
zafranwaqar90
 
Digital Twins Software Service in Belfast
Digital Twins Software Service in BelfastDigital Twins Software Service in Belfast
Digital Twins Software Service in Belfast
julia smits
 
Medical Device Cybersecurity Threat & Risk Scoring
Medical Device Cybersecurity Threat & Risk ScoringMedical Device Cybersecurity Threat & Risk Scoring
Medical Device Cybersecurity Threat & Risk Scoring
ICS
 
Best HR and Payroll Software in Bangladesh - accordHRM
Best HR and Payroll Software in Bangladesh - accordHRMBest HR and Payroll Software in Bangladesh - accordHRM
Best HR and Payroll Software in Bangladesh - accordHRM
accordHRM
 
[gbgcpp] Let's get comfortable with concepts
[gbgcpp] Let's get comfortable with concepts[gbgcpp] Let's get comfortable with concepts
[gbgcpp] Let's get comfortable with concepts
Dimitrios Platis
 
Programs as Values - Write code and don't get lost
Programs as Values - Write code and don't get lostPrograms as Values - Write code and don't get lost
Programs as Values - Write code and don't get lost
Pierangelo Cecchetto
 
Mobile Application Developer Dubai | Custom App Solutions by Ajath
Mobile Application Developer Dubai | Custom App Solutions by AjathMobile Application Developer Dubai | Custom App Solutions by Ajath
Mobile Application Developer Dubai | Custom App Solutions by Ajath
Ajath Infotech Technologies LLC
 
Beyond the code. Complexity - 2025.05 - SwiftCraft
Beyond the code. Complexity - 2025.05 - SwiftCraftBeyond the code. Complexity - 2025.05 - SwiftCraft
Beyond the code. Complexity - 2025.05 - SwiftCraft
Dmitrii Ivanov
 
Download MathType Crack Version 2025???
Download MathType Crack  Version 2025???Download MathType Crack  Version 2025???
Download MathType Crack Version 2025???
Google
 
Download 4k Video Downloader Crack Pre-Activated
Download 4k Video Downloader Crack Pre-ActivatedDownload 4k Video Downloader Crack Pre-Activated
Download 4k Video Downloader Crack Pre-Activated
Web Designer
 
How I solved production issues with OpenTelemetry
How I solved production issues with OpenTelemetryHow I solved production issues with OpenTelemetry
How I solved production issues with OpenTelemetry
Cees Bos
 
AEM User Group DACH - 2025 Inaugural Meeting
AEM User Group DACH - 2025 Inaugural MeetingAEM User Group DACH - 2025 Inaugural Meeting
AEM User Group DACH - 2025 Inaugural Meeting
jennaf3
 
Buy vs. Build: Unlocking the right path for your training tech
Buy vs. Build: Unlocking the right path for your training techBuy vs. Build: Unlocking the right path for your training tech
Buy vs. Build: Unlocking the right path for your training tech
Rustici Software
 
Adobe InDesign Crack FREE Download 2025 link
Adobe InDesign Crack FREE Download 2025 linkAdobe InDesign Crack FREE Download 2025 link
Adobe InDesign Crack FREE Download 2025 link
mahmadzubair09
 
sequencediagrams.pptx software Engineering
sequencediagrams.pptx software Engineeringsequencediagrams.pptx software Engineering
sequencediagrams.pptx software Engineering
aashrithakondapalli8
 
Surviving a Downturn Making Smarter Portfolio Decisions with OnePlan - Webina...
Surviving a Downturn Making Smarter Portfolio Decisions with OnePlan - Webina...Surviving a Downturn Making Smarter Portfolio Decisions with OnePlan - Webina...
Surviving a Downturn Making Smarter Portfolio Decisions with OnePlan - Webina...
OnePlan Solutions
 
Do not let staffing shortages and limited fiscal view hamper your cause
Do not let staffing shortages and limited fiscal view hamper your causeDo not let staffing shortages and limited fiscal view hamper your cause
Do not let staffing shortages and limited fiscal view hamper your cause
Fexle Services Pvt. Ltd.
 
How to Troubleshoot 9 Types of OutOfMemoryError
How to Troubleshoot 9 Types of OutOfMemoryErrorHow to Troubleshoot 9 Types of OutOfMemoryError
How to Troubleshoot 9 Types of OutOfMemoryError
Tier1 app
 
GC Tuning: A Masterpiece in Performance Engineering
GC Tuning: A Masterpiece in Performance EngineeringGC Tuning: A Masterpiece in Performance Engineering
GC Tuning: A Masterpiece in Performance Engineering
Tier1 app
 
wAIred_LearnWithOutAI_JCON_14052025.pptx
wAIred_LearnWithOutAI_JCON_14052025.pptxwAIred_LearnWithOutAI_JCON_14052025.pptx
wAIred_LearnWithOutAI_JCON_14052025.pptx
SimonedeGijt
 
Adobe Media Encoder Crack FREE Download 2025
Adobe Media Encoder  Crack FREE Download 2025Adobe Media Encoder  Crack FREE Download 2025
Adobe Media Encoder Crack FREE Download 2025
zafranwaqar90
 
Digital Twins Software Service in Belfast
Digital Twins Software Service in BelfastDigital Twins Software Service in Belfast
Digital Twins Software Service in Belfast
julia smits
 
Medical Device Cybersecurity Threat & Risk Scoring
Medical Device Cybersecurity Threat & Risk ScoringMedical Device Cybersecurity Threat & Risk Scoring
Medical Device Cybersecurity Threat & Risk Scoring
ICS
 
Best HR and Payroll Software in Bangladesh - accordHRM
Best HR and Payroll Software in Bangladesh - accordHRMBest HR and Payroll Software in Bangladesh - accordHRM
Best HR and Payroll Software in Bangladesh - accordHRM
accordHRM
 
[gbgcpp] Let's get comfortable with concepts
[gbgcpp] Let's get comfortable with concepts[gbgcpp] Let's get comfortable with concepts
[gbgcpp] Let's get comfortable with concepts
Dimitrios Platis
 
Programs as Values - Write code and don't get lost
Programs as Values - Write code and don't get lostPrograms as Values - Write code and don't get lost
Programs as Values - Write code and don't get lost
Pierangelo Cecchetto
 
Mobile Application Developer Dubai | Custom App Solutions by Ajath
Mobile Application Developer Dubai | Custom App Solutions by AjathMobile Application Developer Dubai | Custom App Solutions by Ajath
Mobile Application Developer Dubai | Custom App Solutions by Ajath
Ajath Infotech Technologies LLC
 
Beyond the code. Complexity - 2025.05 - SwiftCraft
Beyond the code. Complexity - 2025.05 - SwiftCraftBeyond the code. Complexity - 2025.05 - SwiftCraft
Beyond the code. Complexity - 2025.05 - SwiftCraft
Dmitrii Ivanov
 
Download MathType Crack Version 2025???
Download MathType Crack  Version 2025???Download MathType Crack  Version 2025???
Download MathType Crack Version 2025???
Google
 
Download 4k Video Downloader Crack Pre-Activated
Download 4k Video Downloader Crack Pre-ActivatedDownload 4k Video Downloader Crack Pre-Activated
Download 4k Video Downloader Crack Pre-Activated
Web Designer
 
How I solved production issues with OpenTelemetry
How I solved production issues with OpenTelemetryHow I solved production issues with OpenTelemetry
How I solved production issues with OpenTelemetry
Cees Bos
 
AEM User Group DACH - 2025 Inaugural Meeting
AEM User Group DACH - 2025 Inaugural MeetingAEM User Group DACH - 2025 Inaugural Meeting
AEM User Group DACH - 2025 Inaugural Meeting
jennaf3
 

Graphics processing uni computer archiecture

  • 1. Lecture 13: Graphics Processing Units (GPUs)
  • 2. 2 Multimedia Extensions (aka SIMD extensions) • Very short vectors added to existing ISAs for microprocessors • Use existing 64-bit registers split into 2x32b or 4x16b or 8x8b – Lincoln Labs TX-2 from 1957 had 36b datapath split into 2x18b or 4x9b – Newer designs have wider registers » 128b for PowerPC Altivec, Intel SSE2/3/4 » 256b for Intel AVX • Single instruction operates on all elements within register 16b 16b 16b 16b 32b 32b 64b 8b 8b 8b 8b 8b 8b 8b 8b 16b 16b 16b 16b 16b 16b 16b 16b 16b 16b 16b 16b + + + +4x16b adds
  • 3. Types of Parallelism • Instruction-Level Parallelism (ILP) – Execute independent instructions from one instruction stream in parallel (pipelining, superscalar, VLIW) • Thread-Level Parallelism (TLP) – Execute independent instruction streams in parallel (multithreading, multiple cores) • Data-Level Parallelism (DLP) – Execute multiple operations of the same type in parallel (vector/SIMD execution) • Which is easiest to program? • Which is most flexible form of parallelism? – i.e., can be used in more situations • Which is most efficient? – i.e., greatest tasks/second/area, lowest energy/task 3
  • 4. Resurgence of DLP • Convergence of application demands and technology constraints drives architecture choice • New applications, such as graphics, machine vision, speech recognition, machine learning, etc. all require large numerical computations that are often trivially data parallel • SIMD-based architectures (vector-SIMD, subword- SIMD, SIMT/GPUs) are most efficient way to execute these algorithms 4
  • 5. DLP important for conventional CPUs too • Prediction for x86 processors, from Hennessy & Patterson, 5th edition – Note: Educated guess, not Intel product plans! • TLP: 2+ cores / 2 years • DLP: 2x width / 4 years • DLP will account for more mainstream parallelism growth than TLP in next decade. – SIMD –single-instruction multiple-data (DLP) – MIMD- multiple-instruction multiple- data (TLP) 5
  • 6. Graphics Processing Units (GPUs) • Original GPUs were dedicated fixed-function devices for generating 3D graphics (mid-late 1990s) including high-performance floating-point units – Provide workstation-like graphics for PCs – User could configure graphics pipeline, but not really program it • Over time, more programmability added (2001-2005) – E.g., New language Cg for writing small programs run on each vertex or each pixel, also Windows DirectX variants – Massively parallel (millions of vertices or pixels per frame) but very constrained programming model • Some users noticed they could do general-purpose computation by mapping input and output data to images, and computation to vertex and pixel shading computations – Incredibly difficult programming model as had to use graphics pipeline model for general computation 6
  • 7. General-Purpose GPUs (GP-GPUs) • In 2006, Nvidia introduced GeForce 8800 GPU supporting a new programming language: CUDA – “Compute Unified Device Architecture” – Subsequently, broader industry pushing for OpenCL, a vendor-neutral version of same ideas. • Idea: Take advantage of GPU computational performance and memory bandwidth to accelerate some kernels for general-purpose computing • Attached processor model: Host CPU issues data- parallel kernels to GP-GPU for execution • This lecture has a simplified version of Nvidia CUDA-style model and only considers GPU execution for computational kernels, not graphics – Would probably need another course to describe graphics processing 7
  • 8. Simplified CUDA Programming Model • Computation performed by a very large number of independent small scalar threads (CUDA threads or microthreads) grouped into thread blocks. // C version of DAXPY loop. void daxpy(int n, double a, double*x, double*y) { for (int i=0; i<n; i++) y[i] = a*x[i] + y[i]; } // CUDA version. __host__ // Piece run on host processor. int nblocks = (n+255)/256; // 256 CUDA threads/block daxpy<<<nblocks,256>>>(n,2.0,x,y); __device__ // Piece run on GP-GPU. void daxpy(int n, double a, double*x, double*y) { int i blockIdx.x*blockDim.x + threadId.x; if (i<n) y[i]=a*x[i=]+];y[i } 8
  • 9. Programmer’s View of Execution 9 blockIdx 0 threadId 0 threadId 1 threadId 255 blockIdx 1 threadId 0 threadId 1 threadId 255 blockIdx (n+255/256) threadId 0 threadId 1 threadId 255 Create enough blocks to cover input vector (Nvidia calls this ensemble of blocks a Grid, can be 2-dimensional) Conditional (i<n) turns off unused threads in last block blockDim = 256 (programmer can choose)
  • 10. GPU Hardware Execution Model • GPU is built from multiple parallel cores, each core contains a multithreaded SIMD processor with multiple lanes but with no scalar processor • CPU sends whole “grid” over to GPU, which distributes thread blocks among cores (each thread block executes on one core) – Programmer unaware of number of cores 10 Core 0 Lane 0 Lane 1 Lane 15 Core 1 Lane 0 Lane 1 Lane 15 Core 15 Lane 0 Lane 1 Lane 15 GPU Memory CPU CPU Memory
  • 11. “Single Instruction, Multiple Thread” • GPUs use a SIMT model, where individual scalar instruction streams for each CUDA thread are grouped together for SIMD execution on hardware (Nvidia groups 32 CUDA threads into a warp) 11 µT0 µT1 µT2 µT3 µT4 µT5 µT6 µT7 ld x mul a ld y add st y Scalar instruction stream SIMD execution across warp
  • 12. Implications of SIMT Model • All “vector” loads and stores are scatter-gather, as individual µthreads perform scalar loads and stores – GPU adds hardware to dynamically coalesce individual µthread loads and stores to mimic vector loads and stores • Every µthread has to perform stripmining calculations redundantly (“am I active?”) as there is no scalar processor equivalent 12
  • 13. Conditionals in SIMT model • Simple if-then-else are compiled into predicated execution, equivalent to vector masking • More complex control flow compiled into branches • How to execute a vector of branches? 13 µT0 µT1 µT2 µT3 µT4 µT5 µT6 µT7 tid=threadid If (tid >= n) skip Call func1 add st y Scalar instruction stream SIMD execution across warp skip:
  • 14. Branch divergence • Hardware tracks which µthreads take or don’t take branch • If all go the same way, then keep going in SIMD fashion • If not, create mask vector indicating taken/not-taken • Keep executing not-taken path under mask, push taken branch PC+mask onto a hardware stack and execute later • When can execution of µthreads in warp reconverge? 14
  • 15. Warps are multithreaded on core • One warp of 32 µthreads is a single thread in the hardware • Multiple warp threads are interleaved in execution on a single core to hide latencies (memory and functional unit) • A single thread block can contain multiple warps (up to 512 µT max in CUDA), all mapped to single core • Can have multiple blocks executing on one core 15[Nvidia, 2010]
  • 16. GPU Memory Hierarchy 16 [ Nvidia, 2010]
  • 17. SIMT • Illusion of many independent threads • But for efficiency, programmer must try and keep µthreads aligned in a SIMD fashion – Try and do unit-stride loads and store so memory coalescing kicks in – Avoid branch divergence so most instruction slots execute useful work and are not masked off 17
  • 18. Nvidia Fermi GF100 GPU 18 [Nvidia, 2010]
  • 20. Fermi Dual-Issue Warp Scheduler 20
  • 21. Apple A5X Processor for iPad v3 (2012) • 12.90mm x 12.79mm • 45nm technology 21[Source: Chipworks, 2012]
  • 22. Historical Retrospective, Cray-2 (1985) • 243MHz ECL logic • 2GB DRAM main memory (128 banks of 16MB each) – Bank busy time 57 clocks! • Local memory of 128KB/core • 1 foreground + 4 background vector processors 22 Foreground CPU Shared Memory Core 0 Lane Local Memory Core 0 Lane Local Memory Core 0 Lane Local Memory Core 0 Lane Local Memory
  • 23. GPU Future • High-end desktops have separate GPU chip, but trend towards integrating GPU on same die as CPU (already in laptops, tablets and smartphones) – Advantage is shared memory with CPU, no need to transfer data – Disadvantage is reduced memory bandwidth compared to dedicated smaller-capacity specialized memory system » Graphics DRAM (GDDR) versus regular DRAM (DDR3) • Will GP-GPU survive? Or will improvements in CPU DLP make GP-GPU redundant? – On same die, CPU and GPU should have same memory bandwidth – GPU might have more FLOPS as needed for graphics anyway 23
  翻译: