SlideShare a Scribd company logo
Data-Level Parallelism
CS4342 Advanced Computer Architecture
Dilum Bandara
Dilum.Bandara@uom.lk
Slides adapted from “Computer Architecture, A Quantitative Approach” by
John L. Hennessy and David A. Patterson, 5th Edition, 2012, Morgan
Kaufmann Publishers
Outline
 Vector architectures
 SIMD instruction set for multimedia
 GPUs
2
SIMD Architectures
 Exploit significant data-level parallelism for
 Matrix-oriented scientific computing
 Media-oriented image & sound processors
 SIMD is more energy efficient than MIMD
 Only needs to fetch 1 instruction per data operation
 Makes SIMD attractive for personal mobile devices
 SIMD allows programmer to continue to think
sequentially
3
SIMD Parallelism
1. Vector architectures
2. SIMD extensions
3. Graphics Processor Units (GPUs)
 For x86 processors
 Expect 2 additional cores per chip per year
 SIMD width to double every 4 years
 Potential speedup from SIMD to be 2x from MIMD
4
Potential Speedup
5
1. Vector Architectures
 Idea
 Read sets of data elements into “vector registers”
 Operate on those registers
 Disperse results back into memory
 Registers are controlled by compiler
 Used to hide memory latency
 Leverage memory bandwidth
6
Vector Architecture – VMIPS
7
Vector MIPS
VMIPS
 Loosely based on Cray-1
 Vector registers
 Each register holds a 64-element, 64 bits/element vector
 Register file has 16 read ports & 8 write ports
 Vector functional units
 Fully pipelined
 Data & control hazards are detected
 Vector load-store unit
 Fully pipelined
 1 word per clock cycle after initial latency
 Scalar registers
 32 general-purpose registers
 32 floating-point registers 8
VMIPS Instructions
 ADDVV.D – add 2 vectors
 ADDVS.D – add vector to a scalar
 LV/SV – vector load & vector store from
address
9
Example – DAXPY
 Y = aX + Y
L.D F0,a ; load scalar a
LV V1,Rx ; load vector X
MULVS.D V2,V1,F0 ; vector-scalar multiply
LV V3,Ry ; load vector Y
ADDVV V4,V2,V3 ; add
SV Ry,V4 ; store result
 Requires 6 instructions vs. almost 600 for MIPS
 Assuming 64 elements per vector
10
DAXPY – Double precision a×X plus Y
Vector Execution Time
 Execution time depends on 3 factors
 Length of operand vectors
 Structural hazards
 Data dependencies
 VMIPS functional units consume 1 element per
clock cycle
 ~ Execution time = Vector length
11
Definitions
 Convey
 Set of vector instructions that could potentially execute
together
 Free of structural hazards
 Sequences with read-after-write hazards can be in
same convey via chaining
 Chaining
 Allows a vector operation to start as soon as individual
elements of its vector source operand become available
12
Convey
LV V1,Rx ;load vector X
MULVS.D V2,V1,F0 ;vector-scalar multiply
LV V3,Ry ;load vector Y
ADDVV.D V4,V2,V3 ;add two vectors
SV Ry,V4 ;store the sum
 Convoys
1 LV MULVS.D
2 LV ADDVV.D
3 SV
13
Definitions – Chime
 Unit of time to execute 1 convey
 m conveys executes in m chimes
 For vector length of n, requires m × n clock cycles
14
Convey
LV V1,Rx ;load vector X
MULVS.D V2,V1,F0 ;vector-scalar multiply
LV V3,Ry ;load vector Y
ADDVV.D V4,V2,V3 ;add two vectors
SV Ry,V4 ;store the sum
 Convoys
1 LV MULVS.D
2 LV ADDVV.D
3 SV
 3 chimes, 2 FP ops per result, cycles per FLOP = 1.5
 For 64 element vectors, requires 64 x 3 = 192 clock
cycles 15
Challenges
 Start up time
 Latency of vector functional unit
 Assume same as Cray-1
 Floating-point add  6 clock cycles
 Floating-point multiply  7 clock cycles
 Floating-point divide  20 clock cycles
 Vector load  12 clock cycles
16
Improvements
1. > 1 element per clock cycle
 Multiple Lanes
2. Non-64 wide vectors
 Vector Length Register
3. IF statements in vector code
 Vector Mask Registers
4. Memory system optimizations to support vector
processors
 Memory Banks
5. Multi-dimensional matrices
 Stride
6. Sparse matrices
7. Programming a vector computer 17
1. Multiple Lanes
 Multiple hardware lanes
 Element n of vector register A is “hardwired” to
element n of vector register B
18
2. Vector Length Register
 Vector length not known at compile time?
 Vector Length Register (VLR)
 Use strip mining for vectors over maximum length
19
3. Vector Mask Registers
for (i = 0; i < 64; i=i+1)
if (X[i] != 0)
X[i] = X[i] – Y[i];
 Use vector mask register to “disable” elements
LV V1,Rx ;load vector X into V1
LV V2,Ry ;load vector Y
L.D F0,#0 ;load FP zero into F0
SNEVS.D V1,F0 ;sets VM(i) to 1 if V1(i)!=F0
SUBVV.D V1,V1,V2 ;subtract under vector mask
SV Rx,V1 ;store the result in X
20
4. Memory Banks
 Designed to support high bandwidth for vector
loads & stores
 Spread accesses across multiple banks
 Control bank addresses independently
 Load or store none sequential words
 Support multiple vector processors sharing the same
memory
21
Source: www.hardwareanalysis.com/content/image/10220/
5. Stride
for (i = 0; i < 100; i=i+1)
for (j = 0; j < 100; j = j+1) {
A[i][j] = 0.0;
for (k = 0; k < 100; k=k+1)
A[i][j] = A[i][j] + B[i][k] * D[k][j];
}
}
 Row-major vs. column-major
 Vectorize multiplication of rows of B with columns of D
 Use non-unit stride
 Bank conflict (stall) occurs when same bank is hit faster
than bank busy time
22
2. SIMD Extensions
 Media applications operate on data types
narrower than native word size
 E.g., four 8-bit operations on a 32-bit system
23
Why SIMD?
 Need minor changes to hardware
 Require little extra state compared to vector
architectures
 While context switching
 No need for high memory bandwidth compared
to vector processors
 No need to deal with page faults when used with
virtual memory
24
SIMD Extensions
 Limitations compared to vector instructions
 No of data operands encoded into op code
 Many instructions
 No sophisticated addressing modes
 Strided, scatter-gather
 No mask registers for conditional execution
25
SIMD Implementations
 Intel MMX (1996)
 Eight 8-bit integer ops or four 16-bit integer ops
 Streaming SIMD Extensions (SSE) (1999)
 Eight 16-bit integer ops
 Four 32-bit integer/FP ops or two 64-bit integer/FP ops
 Advanced Vector Extensions (AVX) (2010)
 Four 64-bit integer/fp ops
 Rely on programmer & libraries than compiler
 Operands must be consecutive & aligned memory
locations 26
Graphical Processing Units (GPUs)
 Originally designed to accelerate large no of
computations performed in graphics rendering
 Offloaded numerically intensive computation from
CPU
 GPUs grew with demand for high performance
graphics
 Eventually GPUs have become much powerful,
even more than CPUs for many computations
 Cost-power-performance advantage
27
GPUs
 GPU is typically a computer card, installed into a
PCI Express slot
 Market leaders: NVIDIA, Intel, AMD (ATI)
 Example NVIDIA GPUs at UoM
28
GeForce GTX 480 Tesla 2070
Example Specifications
29
GTX 480 Tesla 2070 Tesla K80
Peak double
precision FP
performance
650 Gigaflops 515 Gigaflops 2.91 Teraflops
Peak single
precision FP
performance
1.3 Teraflops 1.03 Teraflops 8.74 Teraflops
CUDA cores 480 448 4992
Frequency of CUDA
Cores
1.40 GHz 1.15 GHz 560/875 MHz
Memory size
(GDDR5)
1536 MB 6 GB 24 GB
Memory bandwidth 177.4 GB/sec 150 GB/sec 480 GB/sec
ECC Memory No Yes Yes
CPU vs. GPU Architecture
30
GPU devotes more transistors for computation
Basic Idea
 Heterogeneous execution model
 CPU – host
 GPU – device
 Develop a C-like programming language for GPU
 Unify all forms of GPU parallelism as threads
 Handle each data element in a separate thread  no
need for synchronization
 Programming model is “Single Instruction Multiple
Thread (SIMT)”
31
Programming GPUs
 CUDA language for Nvidia GPU products
 Compute Unified Device Architecture
 Based on C
 nvcc compiler
 Lots of tools for analysis, debug, profile, …
 OpenCL - Open Computing Language
 Based on C
 Supports GPU & CPU programming
 Lots of active research
 e.g., automatic code generation for GPUs
32
Threads & Blocks
 A thread is associated with
each data element
 Threads are organized into
blocks
 Blocks are organized into a
grid
 GPU hardware handles
thread management, not
applications or OS
33
Grids, Blocks, & Threads
July-Aug 2011 34
 Grid of size 6 (3x2
blocks)
 Each block has 12
threads (4x3)
NVIDIA GPU Architecture
 Similarities to vector machines
 Works well with data-level parallel problems
 Scatter-gather transfers
 Mask registers
 Large register files
 Differences
 No scalar processor
 Uses multithreading to hide memory latency
 Has many functional units, as opposed to a few
deeply pipelined units like a vector processor
35
Example – Multiply 2 Vectors of
Length 8192
 Code that works over all elements is the grid
 Thread blocks break this down into manageable sizes
 512/1024 threads per block
 SIMD instruction executes 32 elements at a time
 Thus, grid size = 16/8 blocks
 Block is analogous to a strip-mined vector loop with
vector length of 32
 Block is assigned to a multithreaded SIMD processor by
thread block scheduler
 Current-generation GPUs (Fermi) have 7-15
multithreaded SIMD processors (a.k.a. streaming
multiprocessors, SMX)
36
Multithreaded SIMD Processor
37
Threads of SIMD Instructions
 Each has its own PC
 Thread scheduler uses scoreboard to dispatch
 Keeps track of up to 48 threads of SIMD
instructions
 Hides memory latency
 No data dependencies between threads!
 Thread block scheduler schedules blocks to
SIMD processors
 Within each SIMD processor
 32 SIMD lanes
 Wide & shallow compared to vector processors 38
Registers
 NVIDIA GPU has 32,768 registers
 Divided into lanes
 Each SIMD thread is limited to 64 registers
 SIMD thread has up to
 64 vector registers of 32 32-bit elements
 32 vector registers of 32 64-bit elements
 Fermi has 16 physical SIMD lanes, each containing
2048 registers
39
Conditional Branching
 Like vector architectures, GPU branch hardware uses
internal masks
 Also uses
 Branch synchronization stack
 Entries consist of masks for each SIMD lane
 i.e. which threads commit their results (all threads execute)
 Instruction markers to manage when a branch diverges into
multiple execution paths
 Push on divergent branch
 …and when paths converge
 Act as barriers
 Pops stack
 Per-thread-lane 1-bit predicate register, specified by
programmer 40
NVIDIA GPU Memory Structure
GRID Architecture
41
Grid
• A group of threads all running
the same kernel
• Can run multiple grids at once
Block
• Grids composed of blocks
• Each block is a logical unit
containing a no of
coordinating threads &
some amount of shared
memory
NVIDIA GPU Memory Structures
(Cont.)
42
NVIDIA GPU Memory Structures
(Cont.)
 Each SIMD lane has private section of off-chip DRAM
 Private memory
 Contains stack frame, spilling registers, & private
variables
 Each multithreaded SIMD processor also has
local memory
 Shared by SIMD lanes / threads within a block
 Memory shared by SIMD processors is GPU
Memory
 Host can read & write GPU memory
43
Fermi Architecture Innovations
 Each SIMD processor has
 2 SIMD thread schedulers, 2 instruction dispatch units
 16 SIMD lanes (SIMD width = 32, chime = 2 cycles), 16 load-
store units, 4 special function units
 Thus, 2 threads of SIMD instructions are scheduled every 2
clock cycles
 Fast double precision
 Caches for GPU memory
 64-bit addressing & unified address space
 Error correcting codes
 Faster context switching
 Faster atomic instructions
44
Multithreaded Dual SIMD Processor
of a Fermi GPU
45
Summary
 Vector architectures
 High performance
 High cost
 SIMD instruction set for multimedia
 Simple extensions
 Low cost
 Low performance
 GPUs
 High performance
 Low cost
 SIMD only 46
Ad

More Related Content

What's hot (20)

Transport Layer Numericals
Transport Layer NumericalsTransport Layer Numericals
Transport Layer Numericals
Manisha Keim
 
Intel+80286
Intel+80286Intel+80286
Intel+80286
Dhanwantari Sali
 
mano.ppt
mano.pptmano.ppt
mano.ppt
prathamgunj
 
Embedded C - Day 1
Embedded C - Day 1Embedded C - Day 1
Embedded C - Day 1
Emertxe Information Technologies Pvt Ltd
 
Design Issues of an OS.ppt
Design Issues of an OS.pptDesign Issues of an OS.ppt
Design Issues of an OS.ppt
Senthil Vit
 
The IEEE 1149.1 Boundary-scan test standard
The IEEE 1149.1 Boundary-scan test standardThe IEEE 1149.1 Boundary-scan test standard
The IEEE 1149.1 Boundary-scan test standard
Jose Manuel Martins Ferreira
 
Huffman Coding Algorithm Presentation
Huffman Coding Algorithm PresentationHuffman Coding Algorithm Presentation
Huffman Coding Algorithm Presentation
Akm Monir
 
486 or 80486 DX Architecture
486 or 80486 DX Architecture486 or 80486 DX Architecture
486 or 80486 DX Architecture
Muthusamy Arumugam
 
Ipv4 and Ipv6
Ipv4 and Ipv6Ipv4 and Ipv6
Ipv4 and Ipv6
rahul kundu
 
IPV6 Header format.pptx
IPV6 Header format.pptxIPV6 Header format.pptx
IPV6 Header format.pptx
kanagasubarajas2
 
Operating system 27 semaphores
Operating system 27 semaphoresOperating system 27 semaphores
Operating system 27 semaphores
Vaibhav Khanna
 
Pipelining And Vector Processing
Pipelining And Vector ProcessingPipelining And Vector Processing
Pipelining And Vector Processing
TheInnocentTuber
 
14 data link control
14 data link control14 data link control
14 data link control
arvindsarja
 
Types of instructions
Types of instructionsTypes of instructions
Types of instructions
ihsanjamil
 
Disk management
Disk managementDisk management
Disk management
Agnas Jasmine
 
8086 architecture basics
8086 architecture basics8086 architecture basics
8086 architecture basics
Pundlik Rathod
 
Network layer tanenbaum
Network layer tanenbaumNetwork layer tanenbaum
Network layer tanenbaum
Mahesh Kumar Chelimilla
 
Introduction to nexux from zero to Hero
Introduction to nexux  from zero to HeroIntroduction to nexux  from zero to Hero
Introduction to nexux from zero to Hero
Dhruv Sharma
 
Privilege levels 80386
Privilege levels 80386Privilege levels 80386
Privilege levels 80386
Akshay Nagpurkar
 
Spanning tree protocol
Spanning tree protocolSpanning tree protocol
Spanning tree protocol
Muuluu
 

Similar to Data-Level Parallelism in Microprocessors (20)

Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
Martin Peniak
 
Caqa5e ch4
Caqa5e ch4Caqa5e ch4
Caqa5e ch4
Aravindharamanan S
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
Rob Gillen
 
Gpu computing workshop
Gpu computing workshopGpu computing workshop
Gpu computing workshop
datastack
 
Computer Architecture Vector Computer
Computer Architecture Vector ComputerComputer Architecture Vector Computer
Computer Architecture Vector Computer
Haris456
 
L05 parallel
L05 parallelL05 parallel
L05 parallel
MEPCO Schlenk Engineering College
 
Distributed Computing
Distributed ComputingDistributed Computing
Distributed Computing
Sudarsun Santhiappan
 
09 accelerators
09 accelerators09 accelerators
09 accelerators
Murali M
 
Advanced computer architecture
Advanced computer architectureAdvanced computer architecture
Advanced computer architecture
krishnaviswambharan
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
inside-BigData.com
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
Haris456
 
Something about SSE and beyond
Something about SSE and beyondSomething about SSE and beyond
Something about SSE and beyond
Lihang Li
 
COA Lecture 01(Introduction).pptx
COA Lecture 01(Introduction).pptxCOA Lecture 01(Introduction).pptx
COA Lecture 01(Introduction).pptx
syed rafi
 
LEGaTO Heterogeneous Hardware
LEGaTO Heterogeneous HardwareLEGaTO Heterogeneous Hardware
LEGaTO Heterogeneous Hardware
LEGATO project
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
Rob Gillen
 
SS-CPSIoT 2023_Kevin Mika and Piotr Zierhoffer presentation
SS-CPSIoT 2023_Kevin Mika and Piotr Zierhoffer presentationSS-CPSIoT 2023_Kevin Mika and Piotr Zierhoffer presentation
SS-CPSIoT 2023_Kevin Mika and Piotr Zierhoffer presentation
VEDLIoT Project
 
Exascale Capabl
Exascale CapablExascale Capabl
Exascale Capabl
Sagar Dolas
 
DATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceDATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe Conference
LEGATO project
 
SIMD Processing Using Compiler Intrinsics
SIMD Processing Using Compiler IntrinsicsSIMD Processing Using Compiler Intrinsics
SIMD Processing Using Compiler Intrinsics
Richard Thomson
 
Js2517181724
Js2517181724Js2517181724
Js2517181724
IJERA Editor
 
Introduction to parallel computing using CUDA
Introduction to parallel computing using CUDAIntroduction to parallel computing using CUDA
Introduction to parallel computing using CUDA
Martin Peniak
 
Intro to GPGPU Programming with Cuda
Intro to GPGPU Programming with CudaIntro to GPGPU Programming with Cuda
Intro to GPGPU Programming with Cuda
Rob Gillen
 
Gpu computing workshop
Gpu computing workshopGpu computing workshop
Gpu computing workshop
datastack
 
Computer Architecture Vector Computer
Computer Architecture Vector ComputerComputer Architecture Vector Computer
Computer Architecture Vector Computer
Haris456
 
09 accelerators
09 accelerators09 accelerators
09 accelerators
Murali M
 
A Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural NetworksA Dataflow Processing Chip for Training Deep Neural Networks
A Dataflow Processing Chip for Training Deep Neural Networks
inside-BigData.com
 
Graphics processing uni computer archiecture
Graphics processing uni computer archiectureGraphics processing uni computer archiecture
Graphics processing uni computer archiecture
Haris456
 
Something about SSE and beyond
Something about SSE and beyondSomething about SSE and beyond
Something about SSE and beyond
Lihang Li
 
COA Lecture 01(Introduction).pptx
COA Lecture 01(Introduction).pptxCOA Lecture 01(Introduction).pptx
COA Lecture 01(Introduction).pptx
syed rafi
 
LEGaTO Heterogeneous Hardware
LEGaTO Heterogeneous HardwareLEGaTO Heterogeneous Hardware
LEGaTO Heterogeneous Hardware
LEGATO project
 
Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)Intro to GPGPU with CUDA (DevLink)
Intro to GPGPU with CUDA (DevLink)
Rob Gillen
 
SS-CPSIoT 2023_Kevin Mika and Piotr Zierhoffer presentation
SS-CPSIoT 2023_Kevin Mika and Piotr Zierhoffer presentationSS-CPSIoT 2023_Kevin Mika and Piotr Zierhoffer presentation
SS-CPSIoT 2023_Kevin Mika and Piotr Zierhoffer presentation
VEDLIoT Project
 
DATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe ConferenceDATE 2020: Design, Automation and Test in Europe Conference
DATE 2020: Design, Automation and Test in Europe Conference
LEGATO project
 
SIMD Processing Using Compiler Intrinsics
SIMD Processing Using Compiler IntrinsicsSIMD Processing Using Compiler Intrinsics
SIMD Processing Using Compiler Intrinsics
Richard Thomson
 
Ad

More from Dilum Bandara (20)

Designing for Multiple Blockchains in Industry Ecosystems
Designing for Multiple Blockchains in Industry EcosystemsDesigning for Multiple Blockchains in Industry Ecosystems
Designing for Multiple Blockchains in Industry Ecosystems
Dilum Bandara
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Dilum Bandara
 
Time Series Analysis and Forecasting in Practice
Time Series Analysis and Forecasting in PracticeTime Series Analysis and Forecasting in Practice
Time Series Analysis and Forecasting in Practice
Dilum Bandara
 
Introduction to Dimension Reduction with PCA
Introduction to Dimension Reduction with PCAIntroduction to Dimension Reduction with PCA
Introduction to Dimension Reduction with PCA
Dilum Bandara
 
Introduction to Descriptive & Predictive Analytics
Introduction to Descriptive & Predictive AnalyticsIntroduction to Descriptive & Predictive Analytics
Introduction to Descriptive & Predictive Analytics
Dilum Bandara
 
Introduction to Concurrent Data Structures
Introduction to Concurrent Data StructuresIntroduction to Concurrent Data Structures
Introduction to Concurrent Data Structures
Dilum Bandara
 
Hard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
Hard to Paralelize Problems: Matrix-Vector and Matrix-MatrixHard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
Hard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
Dilum Bandara
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
Dilum Bandara
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel Problems
Dilum Bandara
 
Introduction to Warehouse-Scale Computers
Introduction to Warehouse-Scale ComputersIntroduction to Warehouse-Scale Computers
Introduction to Warehouse-Scale Computers
Dilum Bandara
 
Introduction to Thread Level Parallelism
Introduction to Thread Level ParallelismIntroduction to Thread Level Parallelism
Introduction to Thread Level Parallelism
Dilum Bandara
 
CPU Memory Hierarchy and Caching Techniques
CPU Memory Hierarchy and Caching TechniquesCPU Memory Hierarchy and Caching Techniques
CPU Memory Hierarchy and Caching Techniques
Dilum Bandara
 
Instruction Level Parallelism – Hardware Techniques
Instruction Level Parallelism – Hardware TechniquesInstruction Level Parallelism – Hardware Techniques
Instruction Level Parallelism – Hardware Techniques
Dilum Bandara
 
Instruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler TechniquesInstruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler Techniques
Dilum Bandara
 
CPU Pipelining and Hazards - An Introduction
CPU Pipelining and Hazards - An IntroductionCPU Pipelining and Hazards - An Introduction
CPU Pipelining and Hazards - An Introduction
Dilum Bandara
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
Dilum Bandara
 
High Performance Networking with Advanced TCP
High Performance Networking with Advanced TCPHigh Performance Networking with Advanced TCP
High Performance Networking with Advanced TCP
Dilum Bandara
 
Introduction to Content Delivery Networks
Introduction to Content Delivery NetworksIntroduction to Content Delivery Networks
Introduction to Content Delivery Networks
Dilum Bandara
 
Peer-to-Peer Networking Systems and Streaming
Peer-to-Peer Networking Systems and StreamingPeer-to-Peer Networking Systems and Streaming
Peer-to-Peer Networking Systems and Streaming
Dilum Bandara
 
Mobile Services
Mobile ServicesMobile Services
Mobile Services
Dilum Bandara
 
Designing for Multiple Blockchains in Industry Ecosystems
Designing for Multiple Blockchains in Industry EcosystemsDesigning for Multiple Blockchains in Industry Ecosystems
Designing for Multiple Blockchains in Industry Ecosystems
Dilum Bandara
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Dilum Bandara
 
Time Series Analysis and Forecasting in Practice
Time Series Analysis and Forecasting in PracticeTime Series Analysis and Forecasting in Practice
Time Series Analysis and Forecasting in Practice
Dilum Bandara
 
Introduction to Dimension Reduction with PCA
Introduction to Dimension Reduction with PCAIntroduction to Dimension Reduction with PCA
Introduction to Dimension Reduction with PCA
Dilum Bandara
 
Introduction to Descriptive & Predictive Analytics
Introduction to Descriptive & Predictive AnalyticsIntroduction to Descriptive & Predictive Analytics
Introduction to Descriptive & Predictive Analytics
Dilum Bandara
 
Introduction to Concurrent Data Structures
Introduction to Concurrent Data StructuresIntroduction to Concurrent Data Structures
Introduction to Concurrent Data Structures
Dilum Bandara
 
Hard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
Hard to Paralelize Problems: Matrix-Vector and Matrix-MatrixHard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
Hard to Paralelize Problems: Matrix-Vector and Matrix-Matrix
Dilum Bandara
 
Introduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with HadoopIntroduction to Map-Reduce Programming with Hadoop
Introduction to Map-Reduce Programming with Hadoop
Dilum Bandara
 
Embarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel ProblemsEmbarrassingly/Delightfully Parallel Problems
Embarrassingly/Delightfully Parallel Problems
Dilum Bandara
 
Introduction to Warehouse-Scale Computers
Introduction to Warehouse-Scale ComputersIntroduction to Warehouse-Scale Computers
Introduction to Warehouse-Scale Computers
Dilum Bandara
 
Introduction to Thread Level Parallelism
Introduction to Thread Level ParallelismIntroduction to Thread Level Parallelism
Introduction to Thread Level Parallelism
Dilum Bandara
 
CPU Memory Hierarchy and Caching Techniques
CPU Memory Hierarchy and Caching TechniquesCPU Memory Hierarchy and Caching Techniques
CPU Memory Hierarchy and Caching Techniques
Dilum Bandara
 
Instruction Level Parallelism – Hardware Techniques
Instruction Level Parallelism – Hardware TechniquesInstruction Level Parallelism – Hardware Techniques
Instruction Level Parallelism – Hardware Techniques
Dilum Bandara
 
Instruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler TechniquesInstruction Level Parallelism – Compiler Techniques
Instruction Level Parallelism – Compiler Techniques
Dilum Bandara
 
CPU Pipelining and Hazards - An Introduction
CPU Pipelining and Hazards - An IntroductionCPU Pipelining and Hazards - An Introduction
CPU Pipelining and Hazards - An Introduction
Dilum Bandara
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
Dilum Bandara
 
High Performance Networking with Advanced TCP
High Performance Networking with Advanced TCPHigh Performance Networking with Advanced TCP
High Performance Networking with Advanced TCP
Dilum Bandara
 
Introduction to Content Delivery Networks
Introduction to Content Delivery NetworksIntroduction to Content Delivery Networks
Introduction to Content Delivery Networks
Dilum Bandara
 
Peer-to-Peer Networking Systems and Streaming
Peer-to-Peer Networking Systems and StreamingPeer-to-Peer Networking Systems and Streaming
Peer-to-Peer Networking Systems and Streaming
Dilum Bandara
 
Ad

Recently uploaded (20)

AI-proof your career by Olivier Vroom and David WIlliamson
AI-proof your career by Olivier Vroom and David WIlliamsonAI-proof your career by Olivier Vroom and David WIlliamson
AI-proof your career by Olivier Vroom and David WIlliamson
UXPA Boston
 
Bepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firmBepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firm
Benard76
 
Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Everything You Need to Know About Agentforce? (Put AI Agents to Work)Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Cyntexa
 
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptxReimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
John Moore
 
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Safe Software
 
IT484 Cyber Forensics_Information Technology
IT484 Cyber Forensics_Information TechnologyIT484 Cyber Forensics_Information Technology
IT484 Cyber Forensics_Information Technology
SHEHABALYAMANI
 
Com fer un pla de gestió de dades amb l'eiNa DMP (en anglès)
Com fer un pla de gestió de dades amb l'eiNa DMP (en anglès)Com fer un pla de gestió de dades amb l'eiNa DMP (en anglès)
Com fer un pla de gestió de dades amb l'eiNa DMP (en anglès)
CSUC - Consorci de Serveis Universitaris de Catalunya
 
Smart Investments Leveraging Agentic AI for Real Estate Success.pptx
Smart Investments Leveraging Agentic AI for Real Estate Success.pptxSmart Investments Leveraging Agentic AI for Real Estate Success.pptx
Smart Investments Leveraging Agentic AI for Real Estate Success.pptx
Seasia Infotech
 
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptxDevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
Justin Reock
 
AsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API DesignAsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API Design
leonid54
 
machines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdfmachines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdf
AmirStern2
 
Viam product demo_ Deploying and scaling AI with hardware.pdf
Viam product demo_ Deploying and scaling AI with hardware.pdfViam product demo_ Deploying and scaling AI with hardware.pdf
Viam product demo_ Deploying and scaling AI with hardware.pdf
camilalamoratta
 
Shoehorning dependency injection into a FP language, what does it take?
Shoehorning dependency injection into a FP language, what does it take?Shoehorning dependency injection into a FP language, what does it take?
Shoehorning dependency injection into a FP language, what does it take?
Eric Torreborre
 
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
AI 3-in-1: Agents, RAG, and Local Models - Brent LasterAI 3-in-1: Agents, RAG, and Local Models - Brent Laster
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
All Things Open
 
Design pattern talk by Kaya Weers - 2025 (v2)
Design pattern talk by Kaya Weers - 2025 (v2)Design pattern talk by Kaya Weers - 2025 (v2)
Design pattern talk by Kaya Weers - 2025 (v2)
Kaya Weers
 
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Wonjun Hwang
 
Slack like a pro: strategies for 10x engineering teams
Slack like a pro: strategies for 10x engineering teamsSlack like a pro: strategies for 10x engineering teams
Slack like a pro: strategies for 10x engineering teams
Nacho Cougil
 
fennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solutionfennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solution
shallal2
 
Unlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web AppsUnlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web Apps
Maximiliano Firtman
 
Cybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and MitigationCybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and Mitigation
VICTOR MAESTRE RAMIREZ
 
AI-proof your career by Olivier Vroom and David WIlliamson
AI-proof your career by Olivier Vroom and David WIlliamsonAI-proof your career by Olivier Vroom and David WIlliamson
AI-proof your career by Olivier Vroom and David WIlliamson
UXPA Boston
 
Bepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firmBepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firm
Benard76
 
Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Everything You Need to Know About Agentforce? (Put AI Agents to Work)Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Everything You Need to Know About Agentforce? (Put AI Agents to Work)
Cyntexa
 
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptxReimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
John Moore
 
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Integrating FME with Python: Tips, Demos, and Best Practices for Powerful Aut...
Safe Software
 
IT484 Cyber Forensics_Information Technology
IT484 Cyber Forensics_Information TechnologyIT484 Cyber Forensics_Information Technology
IT484 Cyber Forensics_Information Technology
SHEHABALYAMANI
 
Smart Investments Leveraging Agentic AI for Real Estate Success.pptx
Smart Investments Leveraging Agentic AI for Real Estate Success.pptxSmart Investments Leveraging Agentic AI for Real Estate Success.pptx
Smart Investments Leveraging Agentic AI for Real Estate Success.pptx
Seasia Infotech
 
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptxDevOpsDays SLC - Platform Engineers are Product Managers.pptx
DevOpsDays SLC - Platform Engineers are Product Managers.pptx
Justin Reock
 
AsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API DesignAsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API Design
leonid54
 
machines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdfmachines-for-woodworking-shops-en-compressed.pdf
machines-for-woodworking-shops-en-compressed.pdf
AmirStern2
 
Viam product demo_ Deploying and scaling AI with hardware.pdf
Viam product demo_ Deploying and scaling AI with hardware.pdfViam product demo_ Deploying and scaling AI with hardware.pdf
Viam product demo_ Deploying and scaling AI with hardware.pdf
camilalamoratta
 
Shoehorning dependency injection into a FP language, what does it take?
Shoehorning dependency injection into a FP language, what does it take?Shoehorning dependency injection into a FP language, what does it take?
Shoehorning dependency injection into a FP language, what does it take?
Eric Torreborre
 
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
AI 3-in-1: Agents, RAG, and Local Models - Brent LasterAI 3-in-1: Agents, RAG, and Local Models - Brent Laster
AI 3-in-1: Agents, RAG, and Local Models - Brent Laster
All Things Open
 
Design pattern talk by Kaya Weers - 2025 (v2)
Design pattern talk by Kaya Weers - 2025 (v2)Design pattern talk by Kaya Weers - 2025 (v2)
Design pattern talk by Kaya Weers - 2025 (v2)
Kaya Weers
 
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Wonjun Hwang
 
Slack like a pro: strategies for 10x engineering teams
Slack like a pro: strategies for 10x engineering teamsSlack like a pro: strategies for 10x engineering teams
Slack like a pro: strategies for 10x engineering teams
Nacho Cougil
 
fennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solutionfennec fox optimization algorithm for optimal solution
fennec fox optimization algorithm for optimal solution
shallal2
 
Unlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web AppsUnlocking Generative AI in your Web Apps
Unlocking Generative AI in your Web Apps
Maximiliano Firtman
 
Cybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and MitigationCybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and Mitigation
VICTOR MAESTRE RAMIREZ
 

Data-Level Parallelism in Microprocessors

  • 1. Data-Level Parallelism CS4342 Advanced Computer Architecture Dilum Bandara Dilum.Bandara@uom.lk Slides adapted from “Computer Architecture, A Quantitative Approach” by John L. Hennessy and David A. Patterson, 5th Edition, 2012, Morgan Kaufmann Publishers
  • 2. Outline  Vector architectures  SIMD instruction set for multimedia  GPUs 2
  • 3. SIMD Architectures  Exploit significant data-level parallelism for  Matrix-oriented scientific computing  Media-oriented image & sound processors  SIMD is more energy efficient than MIMD  Only needs to fetch 1 instruction per data operation  Makes SIMD attractive for personal mobile devices  SIMD allows programmer to continue to think sequentially 3
  • 4. SIMD Parallelism 1. Vector architectures 2. SIMD extensions 3. Graphics Processor Units (GPUs)  For x86 processors  Expect 2 additional cores per chip per year  SIMD width to double every 4 years  Potential speedup from SIMD to be 2x from MIMD 4
  • 6. 1. Vector Architectures  Idea  Read sets of data elements into “vector registers”  Operate on those registers  Disperse results back into memory  Registers are controlled by compiler  Used to hide memory latency  Leverage memory bandwidth 6
  • 7. Vector Architecture – VMIPS 7 Vector MIPS
  • 8. VMIPS  Loosely based on Cray-1  Vector registers  Each register holds a 64-element, 64 bits/element vector  Register file has 16 read ports & 8 write ports  Vector functional units  Fully pipelined  Data & control hazards are detected  Vector load-store unit  Fully pipelined  1 word per clock cycle after initial latency  Scalar registers  32 general-purpose registers  32 floating-point registers 8
  • 9. VMIPS Instructions  ADDVV.D – add 2 vectors  ADDVS.D – add vector to a scalar  LV/SV – vector load & vector store from address 9
  • 10. Example – DAXPY  Y = aX + Y L.D F0,a ; load scalar a LV V1,Rx ; load vector X MULVS.D V2,V1,F0 ; vector-scalar multiply LV V3,Ry ; load vector Y ADDVV V4,V2,V3 ; add SV Ry,V4 ; store result  Requires 6 instructions vs. almost 600 for MIPS  Assuming 64 elements per vector 10 DAXPY – Double precision a×X plus Y
  • 11. Vector Execution Time  Execution time depends on 3 factors  Length of operand vectors  Structural hazards  Data dependencies  VMIPS functional units consume 1 element per clock cycle  ~ Execution time = Vector length 11
  • 12. Definitions  Convey  Set of vector instructions that could potentially execute together  Free of structural hazards  Sequences with read-after-write hazards can be in same convey via chaining  Chaining  Allows a vector operation to start as soon as individual elements of its vector source operand become available 12
  • 13. Convey LV V1,Rx ;load vector X MULVS.D V2,V1,F0 ;vector-scalar multiply LV V3,Ry ;load vector Y ADDVV.D V4,V2,V3 ;add two vectors SV Ry,V4 ;store the sum  Convoys 1 LV MULVS.D 2 LV ADDVV.D 3 SV 13
  • 14. Definitions – Chime  Unit of time to execute 1 convey  m conveys executes in m chimes  For vector length of n, requires m × n clock cycles 14
  • 15. Convey LV V1,Rx ;load vector X MULVS.D V2,V1,F0 ;vector-scalar multiply LV V3,Ry ;load vector Y ADDVV.D V4,V2,V3 ;add two vectors SV Ry,V4 ;store the sum  Convoys 1 LV MULVS.D 2 LV ADDVV.D 3 SV  3 chimes, 2 FP ops per result, cycles per FLOP = 1.5  For 64 element vectors, requires 64 x 3 = 192 clock cycles 15
  • 16. Challenges  Start up time  Latency of vector functional unit  Assume same as Cray-1  Floating-point add  6 clock cycles  Floating-point multiply  7 clock cycles  Floating-point divide  20 clock cycles  Vector load  12 clock cycles 16
  • 17. Improvements 1. > 1 element per clock cycle  Multiple Lanes 2. Non-64 wide vectors  Vector Length Register 3. IF statements in vector code  Vector Mask Registers 4. Memory system optimizations to support vector processors  Memory Banks 5. Multi-dimensional matrices  Stride 6. Sparse matrices 7. Programming a vector computer 17
  • 18. 1. Multiple Lanes  Multiple hardware lanes  Element n of vector register A is “hardwired” to element n of vector register B 18
  • 19. 2. Vector Length Register  Vector length not known at compile time?  Vector Length Register (VLR)  Use strip mining for vectors over maximum length 19
  • 20. 3. Vector Mask Registers for (i = 0; i < 64; i=i+1) if (X[i] != 0) X[i] = X[i] – Y[i];  Use vector mask register to “disable” elements LV V1,Rx ;load vector X into V1 LV V2,Ry ;load vector Y L.D F0,#0 ;load FP zero into F0 SNEVS.D V1,F0 ;sets VM(i) to 1 if V1(i)!=F0 SUBVV.D V1,V1,V2 ;subtract under vector mask SV Rx,V1 ;store the result in X 20
  • 21. 4. Memory Banks  Designed to support high bandwidth for vector loads & stores  Spread accesses across multiple banks  Control bank addresses independently  Load or store none sequential words  Support multiple vector processors sharing the same memory 21 Source: www.hardwareanalysis.com/content/image/10220/
  • 22. 5. Stride for (i = 0; i < 100; i=i+1) for (j = 0; j < 100; j = j+1) { A[i][j] = 0.0; for (k = 0; k < 100; k=k+1) A[i][j] = A[i][j] + B[i][k] * D[k][j]; } }  Row-major vs. column-major  Vectorize multiplication of rows of B with columns of D  Use non-unit stride  Bank conflict (stall) occurs when same bank is hit faster than bank busy time 22
  • 23. 2. SIMD Extensions  Media applications operate on data types narrower than native word size  E.g., four 8-bit operations on a 32-bit system 23
  • 24. Why SIMD?  Need minor changes to hardware  Require little extra state compared to vector architectures  While context switching  No need for high memory bandwidth compared to vector processors  No need to deal with page faults when used with virtual memory 24
  • 25. SIMD Extensions  Limitations compared to vector instructions  No of data operands encoded into op code  Many instructions  No sophisticated addressing modes  Strided, scatter-gather  No mask registers for conditional execution 25
  • 26. SIMD Implementations  Intel MMX (1996)  Eight 8-bit integer ops or four 16-bit integer ops  Streaming SIMD Extensions (SSE) (1999)  Eight 16-bit integer ops  Four 32-bit integer/FP ops or two 64-bit integer/FP ops  Advanced Vector Extensions (AVX) (2010)  Four 64-bit integer/fp ops  Rely on programmer & libraries than compiler  Operands must be consecutive & aligned memory locations 26
  • 27. Graphical Processing Units (GPUs)  Originally designed to accelerate large no of computations performed in graphics rendering  Offloaded numerically intensive computation from CPU  GPUs grew with demand for high performance graphics  Eventually GPUs have become much powerful, even more than CPUs for many computations  Cost-power-performance advantage 27
  • 28. GPUs  GPU is typically a computer card, installed into a PCI Express slot  Market leaders: NVIDIA, Intel, AMD (ATI)  Example NVIDIA GPUs at UoM 28 GeForce GTX 480 Tesla 2070
  • 29. Example Specifications 29 GTX 480 Tesla 2070 Tesla K80 Peak double precision FP performance 650 Gigaflops 515 Gigaflops 2.91 Teraflops Peak single precision FP performance 1.3 Teraflops 1.03 Teraflops 8.74 Teraflops CUDA cores 480 448 4992 Frequency of CUDA Cores 1.40 GHz 1.15 GHz 560/875 MHz Memory size (GDDR5) 1536 MB 6 GB 24 GB Memory bandwidth 177.4 GB/sec 150 GB/sec 480 GB/sec ECC Memory No Yes Yes
  • 30. CPU vs. GPU Architecture 30 GPU devotes more transistors for computation
  • 31. Basic Idea  Heterogeneous execution model  CPU – host  GPU – device  Develop a C-like programming language for GPU  Unify all forms of GPU parallelism as threads  Handle each data element in a separate thread  no need for synchronization  Programming model is “Single Instruction Multiple Thread (SIMT)” 31
  • 32. Programming GPUs  CUDA language for Nvidia GPU products  Compute Unified Device Architecture  Based on C  nvcc compiler  Lots of tools for analysis, debug, profile, …  OpenCL - Open Computing Language  Based on C  Supports GPU & CPU programming  Lots of active research  e.g., automatic code generation for GPUs 32
  • 33. Threads & Blocks  A thread is associated with each data element  Threads are organized into blocks  Blocks are organized into a grid  GPU hardware handles thread management, not applications or OS 33
  • 34. Grids, Blocks, & Threads July-Aug 2011 34  Grid of size 6 (3x2 blocks)  Each block has 12 threads (4x3)
  • 35. NVIDIA GPU Architecture  Similarities to vector machines  Works well with data-level parallel problems  Scatter-gather transfers  Mask registers  Large register files  Differences  No scalar processor  Uses multithreading to hide memory latency  Has many functional units, as opposed to a few deeply pipelined units like a vector processor 35
  • 36. Example – Multiply 2 Vectors of Length 8192  Code that works over all elements is the grid  Thread blocks break this down into manageable sizes  512/1024 threads per block  SIMD instruction executes 32 elements at a time  Thus, grid size = 16/8 blocks  Block is analogous to a strip-mined vector loop with vector length of 32  Block is assigned to a multithreaded SIMD processor by thread block scheduler  Current-generation GPUs (Fermi) have 7-15 multithreaded SIMD processors (a.k.a. streaming multiprocessors, SMX) 36
  • 38. Threads of SIMD Instructions  Each has its own PC  Thread scheduler uses scoreboard to dispatch  Keeps track of up to 48 threads of SIMD instructions  Hides memory latency  No data dependencies between threads!  Thread block scheduler schedules blocks to SIMD processors  Within each SIMD processor  32 SIMD lanes  Wide & shallow compared to vector processors 38
  • 39. Registers  NVIDIA GPU has 32,768 registers  Divided into lanes  Each SIMD thread is limited to 64 registers  SIMD thread has up to  64 vector registers of 32 32-bit elements  32 vector registers of 32 64-bit elements  Fermi has 16 physical SIMD lanes, each containing 2048 registers 39
  • 40. Conditional Branching  Like vector architectures, GPU branch hardware uses internal masks  Also uses  Branch synchronization stack  Entries consist of masks for each SIMD lane  i.e. which threads commit their results (all threads execute)  Instruction markers to manage when a branch diverges into multiple execution paths  Push on divergent branch  …and when paths converge  Act as barriers  Pops stack  Per-thread-lane 1-bit predicate register, specified by programmer 40
  • 41. NVIDIA GPU Memory Structure GRID Architecture 41 Grid • A group of threads all running the same kernel • Can run multiple grids at once Block • Grids composed of blocks • Each block is a logical unit containing a no of coordinating threads & some amount of shared memory
  • 42. NVIDIA GPU Memory Structures (Cont.) 42
  • 43. NVIDIA GPU Memory Structures (Cont.)  Each SIMD lane has private section of off-chip DRAM  Private memory  Contains stack frame, spilling registers, & private variables  Each multithreaded SIMD processor also has local memory  Shared by SIMD lanes / threads within a block  Memory shared by SIMD processors is GPU Memory  Host can read & write GPU memory 43
  • 44. Fermi Architecture Innovations  Each SIMD processor has  2 SIMD thread schedulers, 2 instruction dispatch units  16 SIMD lanes (SIMD width = 32, chime = 2 cycles), 16 load- store units, 4 special function units  Thus, 2 threads of SIMD instructions are scheduled every 2 clock cycles  Fast double precision  Caches for GPU memory  64-bit addressing & unified address space  Error correcting codes  Faster context switching  Faster atomic instructions 44
  • 45. Multithreaded Dual SIMD Processor of a Fermi GPU 45
  • 46. Summary  Vector architectures  High performance  High cost  SIMD instruction set for multimedia  Simple extensions  Low cost  Low performance  GPUs  High performance  Low cost  SIMD only 46

Editor's Notes

  • #16: 192/128 = 1.5
  • #41: Predicate register – can be used to tell whether to execute in instruction given a condition
  • #44: spilling registers - A "spilled variable" is a variable in main memory rather than in a CPU register. The operation of moving a variable from a register to memory is called spilling,
  • #46: 2 parts run separately
  翻译: