SlideShare a Scribd company logo
Optimizing the Graphics
Pipeline with Compute
Graham Wihlidal
Sr. Rendering Engineer, Frostbite
Acronyms
 Optimizations and algorithms presented are AMD GCN-centric [1][8]
VGT Vertex Grouper  Tessellator
PA Primitive Assembly
CP Command Processor
IA Input Assembly
SE Shader Engine
CU Compute Unit
LDS Local Data Share
HTILE Hi-Z Depth Compression
GCN Graphics Core Next
SGPR Scalar General-Purpose Register
VGPR Vector General-Purpose Register
ALU Arithmetic Logic Unit
SPI Shader Processor Interpolator
Optimizing the Graphics Pipeline with Compute, GDC 2016
libEdge
Optimizing the Graphics Pipeline with Compute, GDC 2016
/ clock
/ clock
/ clock
12 CU * 64 ALU * 2 FLOPs
1,536 ALU ops / cy
18 CU * 64 ALU * 2 FLOPs
2,304 ALU ops / cy
64 CU * 64 ALU * 2 FLOPs
8,192 ALU ops / cy
1,536 ALU ops / 2 engines
768 ALU ops per triangle
2,304 ALU ops / 2 engines
1,017 ALU ops per triangle
8,192 ALU ops / 4 engines
2,048 ALU ops per triangle
768 ALU ops / 2 ALU per cy
= 384 instruction limit
1,017 ALU ops / 2 ALU per cy
= 508 instruction limit
2,048 ALU ops / 2 ALU per cy
= 1024 instruction limit
Can anyone here cull a triangle in less than 384
instructions on Xbox One?
… I sure hope so ☺
Motivation – Death By 1000 Draws
 DirectX 12 promised millions of draws!
 Great CPU performance advancements
 Low overhead
 Power in the hands of (experienced) developers
 Console hardware is a fixed target
 GPU still chokes on tiny draws
 Common to see 2nd half of base pass barely utilizing the GPU
 Lots of tiny details or distant objects – most are Hi-Z culled
 Still have to run mostly empty vertex wavefronts
 More draws not necessarily a good thing
Motivation – Death By 1000 Draws
Motivation – Primitive Rate
 Wildly optimistic to assume we get close to 2 prims per cy – Getting 0.9 prim / cy
 If you are doing anything useful, you will be bound elsewhere in the pipeline
 You need good balance and lucky scheduling between the VGTs and PAs
 Depth of FIFO between VGT and PA
 Need positions of a VS back in < 4096 cy, or reduces primitive rate
 Some games hit close to peak perf (95+% range) in shadow passes
 Usually slower regions in there due to large triangles
 Coarse raster only does 1 super-tile per clock
 Triangles with bounding rectangle larger than 32x32?
 Multi-cycle on coarse raster, reduces primitive rate
Motivation – Primitive Rate
 Benchmarks that get 2 prims / cy (around 1.97) have these characteristics:
 VS reads nothing
 VS writes only SV_Position
 VS always outputs 0.0f for position - Trivially cull all primitives
 Index buffer is all 0s - Every vertex is a cache hit
 Every instance is a multiple of 64 vertices – Less likely to have unfilled VS waves
 No PS bound – No parameter cache usage
 Requires that nothing after VS causes a stall
 Parameter size <= 4 * PosSize
 Pixels drain faster than they are generated
 No scissoring occurs
 PA can receive work faster than VS can possibly generate it
 Often see tessellation achieve peak VS primitive throughout; one SE at a time
Motivation – Opportunity
 Coarse cull on CPU, refine on GPU
 Latency between CPU and GPU prevents optimizations
 GPGPU Submission!
 Depth-aware culling
 Tighten shadow bounds  sample distribution shadow maps [21]
 Cull shadow casters without contribution [4]
 Cull hidden objects from color pass
 VR late-latch culling
 CPU submits conservative frustum and GPU refines
 Triangle and cluster culling
 Covered by this presentation
Motivation – Opportunity
 Maps directly to graphics pipeline
 Offload tessellation hull shader work
 Offload entire tessellation pipeline! [16][17]
 Procedural vertex animation (wind, cloth, etc.)
 Reusing results between multiple passes & frames
 Maps indirectly to graphics pipeline
 Bounding volume generation
 Pre-skinning
 Blend shapes
 Generating GPU work from the GPU [4] [13]
 Scene and visibility determination
 Treat your draws as data!
 Pre-build
 Cache and reuse
 Generate on GPU
Culling Overview
Culling Overview
Scene
 Consists of:
 Collection of meshes
 Specific view
 Camera, light, etc.
Culling Overview
Batch
 Configurable subset of meshes in
a scene
 Meshes within a batch share the
same shader and strides
(vertex/index)
 Near 1:1 with DirectX 12 PSO
(Pipeline State Object)
Culling Overview
Mesh Section
 Represents an indexed draw call
(triangle list)
 Has its own:
 Vertex buffer(s)
 Index buffer
 Primitive count
 Etc.
Culling Overview
Work Item
 Optimal number of triangles for
processing in a wavefront
 AMD GCN has 64 threads per
wavefront
 Each culling thread processes 1
triangle
 Work item processes 256 triangles
Culling Overview
Batch
Work Item
Mesh Section
Batch
Mesh Section Mesh SectionMesh Section
Work Item Work Item Work Item Work Item Work Item Work Item Work Item
Multi Draw Indirect
Draw Args Draw Args Draw Args Draw Args
Culling Culling Culling Culling Culling Culling Culling Culling
Draw Call Compaction (No Zero Size Draws)
Draw Args Draw Args Draw Args
Scene
…
Mapping Mesh ID to MultiDraw ID
 Indirect draws no longer know the mesh section or instance they came from
 Important for loading various constants, etc.
 A DirectX 12 trick is to create a custom command signature
 Allows for parsing a custom indirect arguments buffer format
 We can store the mesh section id along with each draw argument block
 PC drivers use compute shader patching
 Xbox One has custom command processor microcode support
 OpenGL has gl_DrawId which can be used for this
 SPI Loads StartInstanceLocation into reserved SGPR and adds to SV_InstanceID
 A fallback approach can be an instancing buffer with a step rate of 1 which maps from instance id to
draw id
Mapping Mesh ID to MultiDraw ID
Mesh Section Id
Draw Args
Index Count Per Instance
Instance Count
Start Index Location
Base Vertex Location
Start Instance Location
De-Interleaved Vertex Buffers
P0 P1 P2 P3 …
N0 N1 N2 N3 …
TC0 TC1 TC2 TC3 …
Draw Call
P0 N0 TC0 P1 N1 TC1 P2 N2 TC2 …
Draw Call
Do This!
De-Interleaved vertex buffers are optimal on GCN architectures
They also make compute processing easier!
De-Interleaved Vertex Buffers
 Helpful for minimizing state changes for compute processing
 Constant vertex position stride
 Cleaner separation of volatile vs. non-volatile data
 Lower memory usage overall
 More optimal for regular GPU rendering
 Evict cache lines as quickly as possible!
Cluster Culling
Cluster Culling
 Generate triangle clusters using spatially coherent bucketing in spherical coordinates
 Optimize each triangle cluster to be cache coherent
 Generate optimal bounding cone of each cluster [19]
 Project normals on to the unit sphere
 Calculate minimum enclosing circle
 Diameter is the cone angle
 Center is projected back to Cartesian for cone normal
 Store cone in 8:8:8:8 SNORM
 Cull if dot(cone.Normal, -view) < -sin(cone.angle)
Cluster Culling
 64 is convenient on consoles
 Opens up intrinsic optimizations
 Not optimal, as the CP bottlenecks on too many draws
 Not LDS bound
 256 seems to be the sweet spot
 More vertex reuse
 Fewer atomic operations
 Larger than 256?
 2x VGTs alternate back and forth (256 triangles)
 Vertex re-use does not survive the flip
Cluster Culling
 Coarse reject clusters of triangles [4]
 Cull against:
 View (Bounding Cone)
 Frustum (Bounding Sphere)
 Hi-Z Depth (Screen Space Bounding Box)
 Be careful of perspective distortion! [22]
 Spheres become ellipsoids under projection
Draw Compaction
Compaction
At 133us - Efficiency drops as we hit a string of empty draws
At 151us - 10us of idle time
Compaction
Count = Min(MaxCommandCount, pCountBuffer)
Compaction
 Parallel Reduction
 Keep > 0 Count Args
We can do better!
Compaction
 Parallel prefix sum to the rescue!
0 0 0 1 2 3 3 4 4 5 5 6 7 8 8 9
Compaction
 __XB_Ballot64
 Produce a 64 bit mask
 Each bit is an evaluated predicate per wavefront thread
0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1
__XB_Ballot64(threadId & 1)
Compaction
1 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1
1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0
1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0
__XB_Ballot64(indexCount > 0)
&
=
Thread 5 Execution Mask
Thread 5
Population Count “popcnt” = 3
Compaction
 V_MBCNT_LO_U32_B32 [5]
 Masked bit count of the lower 32 threads (0-31)
 V_MBCNT_HI_U32_B32 [5]
 Masked bit count of the upper 32 threads (32-63)
 For each thread, returns the # of active threads which come before it.
Compaction
1 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1
0 0 0 1 2 3 3 4 4 5 5 6 7 8 8 9
__XB_MBCNT64(__XB_Ballot64(indexCount > 0))
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
Compaction
 No more barriers!
 Atomic to sync multiple
wavefronts
 Read lane to replicate
global slot to all threads
Triangle Culling
Per-Triangle Culling
 Each thread in a wavefront processes 1 triangle
 Cull masks are balloted and counted to determine compaction index
 Maintain vertex reuse across a wavefront
 Maintain vertex reuse across all wavefronts - ds_ordered_count [5][15]
 +0.1ms for ~3906 work items – use wavefront limits
Per-Triangle Culling
For Each Triangle
Unpack Index and Vertex Data (16 bit)
Orientation and Zero Area Culling (2DH)
Small Primitive Culling (NDC)
Frustum Culling (NDC)
Count Number of Surviving Indices
Compact Index Stream (Preserving Ordering)
Reserve Output Space for Surviving Indices
Write out Surviving Indices (16 bit)
Depth Culling – Hi-Z (NDC)
Perspective Divide (xyz/w) Scalar Branch (!culled)
Scalar Branch (!culled)
Scalar Branch (!culled)
__XB_GdsOrderedCount
(Optional)
__XB_MBCNT64
__XB_BALLOT64
Per-Triangle Culling
 Without ballot
 Compiler generates two tests for most if-statements
 1) One or more threads enter the if-statement
 2) Optimization where no threads enter the if-statement
 With ballot (or high level any/all/etc.), or if branch on scalar value (__XB_MakeUniform)
 Compiler only generates case# 2
 Skips extra control flow logic to handle divergence
 Use ballot for force uniform branching and avoid divergence
 No harm letting all threads execute the full sequence of culling tests
Orientation Culling
Triangle Orientation and Zero Area (2DH)
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016
Patch Orientation Culling
Small Primitive Culling
Rasterizer Efficiency
16 pixels / clock
100% Efficiency
1 pixel / clock
6.25% Efficiency
12 pixels / clock
75% Efficiency
Vi
Vj
Small Primitive Culling (NDC)
 This triangle is not culled because it encloses a
pixel center
any(round(min) == round(max))
Small Primitive Culling (NDC)
 This triangle is culled because it does not
enclose a pixel center
any(round(min) == round(max))
Small Primitive Culling (NDC)
 This triangle is culled because it does not
enclose a pixel center
any(round(min) == round(max))
Small Primitive Culling (NDC)
 This triangle is not culled because the bounding
box min and max snap to different coordinates
 This triangle should be culled, but accounting
for this case is not worth the cost
any(round(min) == round(max))
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016
Frustum Culling
Frustum Culling (NDC)
0,0
1,10,1
1,0
Max
Min
Max Min
Min.Y > 1
Max.X < 0
Min.X > 1
Max.Y < 0
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016
Depth Culling
Depth Tile Culling (NDC)
 Another available culling approach is to do manual depth testing
 Perform an LDS optimized parallel reduction [9], storing out the conservative depth
value for each tile
16x16 Tiles
Depth Tile Culling (NDC)
 ~41us on XB1 @ 1080p
 Bypasses LDS storage
 Bandwidth bound
 Shared with our light tile
culling
Depth Pyramid Culling (NDC)
 Another approach to depth culling is a hierarchical Z pyramid [10][11][23]
 Populate the Hi-Z pyramid after depth laydown
 Construct a mip-mapped screen resolution texture
 Culling can be done by comparing the depth of a bounding volume with the depth stored in the Hi-Z
pyramid
int mipMapLevel = min(ceil(log2(max(longestEdge, 1.0f))), levels - 1);
AMD GCN HTILE
 Depth acceleration meta data called HTILE [6][7]
 Every group of 8x8 pixels has a 32bit meta data block
 Can be decoded manually in a shader and used for 1 test -> 64 pixel rejection
 Avoids slow hardware decompression or resummarize
 Avoids losing Hi-Z on later depth enabled render passes
DEPTH HTILE
AMD GCN HTILE
AMD GCN HTILE
DS_SWIZZLE_B32 [5]
V_READLANE_B32 [5]
AMD GCN HTILE
 Manually encode; skip the resummarize on half resolution depth!
 HTILE encodes both near and far depth for each 8x8 pixel tile.
 Stencil Enabled = 14 bit near value, and a 6 bit delta towards far plane
 Stencil Disabled = MinMax depth encoded in 2x 14 bit UNORM pairs
Software Z
 One problem with using depth for culling is availability
 Many engines do not have a full Z pre-pass
 Restricts asynchronous compute scheduling
 Wait for Z buffer laydown
 You can load the Hi-Z pyramid with software Z!
 In Frostbite since Battlefield 3 [12]
 Done on the CPU for the upcoming GPU frame
 No latency
 You can prime HTILE!
 Full Z pre-pass
 Minimal cost
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016
Optimizing the Graphics Pipeline with Compute, GDC 2016
Batching and Perf
Batching
 Fixed memory budget of N buffers * 128k triangles
 128k triangles = 384k indices = 768 KB
 3 MB of memory usage, for up to 524288 surviving triangles in flight
128k triangles (768KB) 128k triangles (768KB) 128k triangles (768KB) 128k triangles (768KB)
Render
128k triangles (768KB) 128k triangles (768KB) 128k triangles (768KB) 128k triangles (768KB)
Render
Batching
Mesh Section (20k tri)
Mesh Section (34k tri)
Mesh Section (4k tri)
Mesh Section (20k tri)
Mesh Section (70k tri)
Culling (434k triangles)
…
434k / 512k capacity
Output #0
Output #1
Output #2
Output #0
Output #1
Output #2
Render #1
Render #0
Output #3
Output #3
Culling (546k triangles)
Batching
Mesh Section (20k tri)
Mesh Section (34k tri)
Mesh Section (4k tri)
Mesh Section (20k tri)
Mesh Section (70k tri)
…
546k / 512k capacity
Output #0
Output #1
Output #2
Render #0,0
Output #0
Output #1
Output #2
Render #1
Output #0
Render #0,1
Output #3
Output #3
Batching
Dispatch #0
Render #0
Dispatch #1 Dispatch #2 Dispatch #3
Render #1 Render #2 Render #3Startup Cost
 Overlapping culling and render on the graphics pipe is great
 But there is a high startup cost for dispatch #0 (no graphics to overlap)
 If only there were something we could use….
Batching
 Asynchronous compute to the rescue!
 We can launch the dispatch work alongside other GPU work in the frame
 Water simulation, physics, cloth, virtual texturing, etc.
 This can slow down “Other GPU Stuff” a bit, but overall frame is faster!
 Just be careful about what you schedule culling with
 We use wait on lightweight label operations to ensure that dispatch and render are
pipelined correctly
Dispatch #0
Render #0
Dispatch #1 Dispatch #2 Dispatch #3
Render #1 Render #2 Render #3Other GPU Stuff
Performance
443,429 triangles @ 1080p
171 unique PSOs
Performance
Filter Exclusively Culled Inclusively Culled
Orientation 46% 204,006 46% 204,006
Depth* 42% 187,537 20% 90,251
Small* 30% 128,705 8% 37,606
Frustum* 8% 35,182 4% 16,162
* Scene Dependent
Processed 100% 443,429
Culled 78% 348,025
Rendered 22% 95,404
Performance
Cull Draw Total
0.26ms 4.56ms 4.56ms
0.15ms 3.80ms 3.80ms
0.06ms 0.47ms 0.47ms
No Tessellation
Platform
XB1 (DRAM)
PS4 (GDDR5)
PC (Fury X)
Base
5.47ms
4.56ms
0.79ms
Cull Draw Total
0.24ms 4.54ms 4.78ms
0.13ms 3.76ms 3.89ms
0.06ms 0.47ms 0.53ms
Synchronous Asynchronous
443,429 triangles @ 1080p
171 unique PSOs No Cluster Culling
Performance
Cull Draw Total
0.26ms 11.2ms 11.2ms
0.15ms 8.10ms 8.10ms
0.06ms 0.64ms 0.64ms
Tessellation Factor 1-7 (Adaptive Phong)
Platform
XB1 (DRAM)
PS4 (GDDR5)
PC (Fury X)
Base
19.3ms
12.8ms
3.01ms
Cull Draw Total
0.24ms 11.1ms 11.3ms
0.13ms 8.08ms 8.21ms
0.06ms 0.64ms 0.70ms
AsynchronousSynchronous
443,429 triangles @ 1080p
171 unique PSOs No Cluster Culling
Future Work
 Reuse results between multiple passes
 Once for all shadow cascades
 Depth, gbuffer, emissive, forward, reflection
 Cube maps – load once, cull each side
 Xbox One supports switching PSOs with ExecuteIndirect
 Single submitted batch!
 Further reduce bottlenecks
 Move more and more CPU rendering logic to GPU
 Improve asynchronous scheduling
Future Work
 Instancing optimizations
 Each instance (re)loads vertex data
 Synchronous dispatch
 Near 100% L2$ hit
 ALU bound on render - 24 VGPRs, measured occupancy of 8
 1.5 bytes bandwidth usage per triangle
 Asynchronous dispatch
 Low L2$ residency - other render work between culling and render
 VMEM bound on render
 20 bytes bandwidth usage per triangle
Future Work
 Maximize bandwidth and throughput
 Load data into LDS chunks, bandwidth amplification
 Partition data into per-chunk index buffers
 Evaluate all instances
 More tuning of wavefront limits and CU masking
Hardware Tessellation
Hardware Tessellation
Input Assembler
Vertex (Local) Shader
Hull Shader
Tessellator
Domain (Vertex) Shader
Rasterizer
Pixel Shader
Output Merger
• Tessellation Factors
• Silhouette Orientation
• Back Face Culling
• Frustum Culling
• Coarse Culling (Hi-Z)
Hardware Tessellation
Input Assembler
Vertex (Local) Shader
Hull (Pass-through) Shader
Tessellator
Domain (Vertex) Shader
Rasterizer
Pixel Shader
Output Merger
Load Final Factors
Mesh Data
• Tessellation Factors
• Silhouette Orientation
• Back Face Culling
• Frustum Culling
• Coarse Culling (Hi-Z)
Compute Shader
Hardware Tessellation
Mesh Data
Compute Shader
Structured Work Queue #1
(Patches with factor [1…1]
Tessellation Factors
Structured Work Queue #2
(Patches with factor [2…7]
Tessellation Factors
Structured Work Queue #3
(Patches with factor [8…N]
Tessellation Factors
Patches with factor 0 (culled) are not
processed further, and do not get
inserted to any work queue.
Hardware Tessellation
Structured Work Queue #1
(Patches with factor [1…1]
Tessellation Factors
Structured Work Queue #2
(Patches with factor [2…7]
Tessellation Factors
Structured Work Queue #3
(Patches with factor [8…N]
Tessellation Factors
Compute Shader
Patch SubD 1 -> 4
Tessellation Factor 1/4
Tessellated Draw
Non-Tessellated Draw
Low Expansion Factor
GCN Friendly 
High Expansion Factor
GCN Unfriendly 
No Expansion Factor
Avoid Tessellator!
Summary
 Small and inefficient draws are a problem
 Compute and graphics are friends
 Use all the available GPU resources
 Asynchronous compute is extremely powerful
 Lots of cool GCN instructions available
 Check out AMD GPUOpen GeometryFX [20]
MAKE RASTERIZATION
GREAT AGAIN!
Acknowledgements
 Matthäus Chajdas (@NIV_Anteru)
 Ivan Nevraev (@Nevraev)
 Alex Nankervis
 Sébastien Lagarde (@SebLagarde)
 Andrew Goossen
 James Stanard (@JamesStanard)
 Martin Fuller (@MartinJIFuller)
 David Cook
 Tobias “GPU Psychiatrist” Berghoff (@TobiasBerghoff)
 Christina Coffin (@ChristinaCoffin)
 Alex “I Hate Polygons” Evans (@mmalex)
 Rob Krajcarski
 Jaymin “SHUFB 4 LIFE” Kessler (@okonomiyonda)
 Tomasz Stachowiak (@h3r2tic)
 Andrew Lauritzen (@AndrewLauritzen)
 Nicolas Thibieroz (@NThibieroz)
 Johan Andersson (@repi)
 Alex Fry (@TheFryster)
 Jasper Bekkers (@JasperBekkers)
 Graham Sellers (@grahamsellers)
 Cort Stratton (@postgoodism)
 David Simpson
 Jason Scanlin
 Mike Arnold
 Mark Cerny (@cerny)
 Pete Lewis
 Keith Yerex
 Andrew Butcher (@andrewbutcher)
 Matt Peters
 Sebastian Aaltonen (@SebAaltonen)
 Anton Michels
 Louis Bavoil (@LouisBavoil)
 Yury Uralsky
 Sebastien Hillaire (@SebHillaire)
 Daniel Collin (@daniel_collin)
References
 [1] “The AMD GCN Architecture – A Crash Course” – Layla Mah
 [2] “Clipping Using Homogenous Coordinates” – Jim Blinn, Martin Newell
 [3] "Triangle Scan Conversion using 2D Homogeneous Coordinates“ - Marc Olano, Trey Greer
 [4] “GPU-Driven Rendering Pipelines” – Ulrich Haar, Sebastian Aaltonen
 [5] “Southern Islands Series Instruction Set Architecture” – AMD
 [6] “Radeon Southern Islands Acceleration” – AMD
 [7] “Radeon Evergreen / Northern Islands Acceleration” - AMD
 [8] “GCN Architecture Whitepaper” - AMD
 [9] “Optimizing Parallel Reduction In CUDA” – Mark Harris
 [10] “Hierarchical-Z Map Based Occlusion Culling” – Daniel Rákos
 [11] “Hierarchical Z-Buffer Occlusion Culling” – Nick Darnell
 [12] “Culling the Battlefield: Data Oriented Design in Practice” – Daniel Collin
 [13] “The Rendering Pipeline – Challenges & Next Steps” – Johan Andersson
 [14] “GCN Performance Tweets” – AMD
 [15] “Learning from Failure: … Abandoned Renderers For Dreams PS4 …” – Alex Evans
 [16] “Patch Based Occlusion Culling For Hardware Tessellation” - Matthias Nießner, Charles Loop
 [17] “Tessellation In Call Of Duty: Ghosts” – Wade Brainerd
 [18] “MiniEngine Framework” – Alex Nankervis, James Stanard
 [19] “Optimal Bounding Cones of Vectors in Three Dimensions” – Gill Barequet, Gershon Elber
 [20] “GPUOpen GeometryFX” – AMD
 [21] “Sample Distribution Shadow Maps” – Andrew Lauritzen
 [22] “2D Polyhedral Bounds of a Clipped, Perspective-Projected 3D Sphere” – Mara and McGuire
 [23] “Practical, Dynamic Visibility for Games” - Stephen Hill
Thank You!
graham@frostbite.com
Questions?
Twitter - @gwihlidal
“If you’ve been struggling with a
tough ol’ programming problem all
day, maybe go for a walk. Talk to a
tree. Trust me, it helps.“
- Bob Ross, Game Dev
Instancing Optimizations
 Can do a fast bitonic sort of the instancing buffer for optimal
front-to-back order
 Utilize DS_SWIZZLE_B32
 Swizzles input thread data based on offset mask
 Data sharing within 32 consecutive threads
 Only 32 bit, so can efficiently sort 32 elements
 You could do clustered sorting
 Sort each cluster’s instances (within a thread)
 Sort the 32 clusters
Ad

More Related Content

What's hot (20)

Graphics Gems from CryENGINE 3 (Siggraph 2013)
Graphics Gems from CryENGINE 3 (Siggraph 2013)Graphics Gems from CryENGINE 3 (Siggraph 2013)
Graphics Gems from CryENGINE 3 (Siggraph 2013)
Tiago Sousa
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
AMD Developer Central
 
Moving Frostbite to Physically Based Rendering
Moving Frostbite to Physically Based RenderingMoving Frostbite to Physically Based Rendering
Moving Frostbite to Physically Based Rendering
Electronic Arts / DICE
 
A Bit More Deferred Cry Engine3
A Bit More Deferred   Cry Engine3A Bit More Deferred   Cry Engine3
A Bit More Deferred Cry Engine3
guest11b095
 
Triangle Visibility buffer
Triangle Visibility bufferTriangle Visibility buffer
Triangle Visibility buffer
Wolfgang Engel
 
Physically Based Sky, Atmosphere and Cloud Rendering in Frostbite
Physically Based Sky, Atmosphere and Cloud Rendering in FrostbitePhysically Based Sky, Atmosphere and Cloud Rendering in Frostbite
Physically Based Sky, Atmosphere and Cloud Rendering in Frostbite
Electronic Arts / DICE
 
Siggraph2016 - The Devil is in the Details: idTech 666
Siggraph2016 - The Devil is in the Details: idTech 666Siggraph2016 - The Devil is in the Details: idTech 666
Siggraph2016 - The Devil is in the Details: idTech 666
Tiago Sousa
 
Frostbite on Mobile
Frostbite on MobileFrostbite on Mobile
Frostbite on Mobile
Electronic Arts / DICE
 
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
repii
 
The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)
The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)
The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)
Philip Hammer
 
Physically Based and Unified Volumetric Rendering in Frostbite
Physically Based and Unified Volumetric Rendering in FrostbitePhysically Based and Unified Volumetric Rendering in Frostbite
Physically Based and Unified Volumetric Rendering in Frostbite
Electronic Arts / DICE
 
Rendering Technologies from Crysis 3 (GDC 2013)
Rendering Technologies from Crysis 3 (GDC 2013)Rendering Technologies from Crysis 3 (GDC 2013)
Rendering Technologies from Crysis 3 (GDC 2013)
Tiago Sousa
 
Secrets of CryENGINE 3 Graphics Technology
Secrets of CryENGINE 3 Graphics TechnologySecrets of CryENGINE 3 Graphics Technology
Secrets of CryENGINE 3 Graphics Technology
Tiago Sousa
 
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
repii
 
Bindless Deferred Decals in The Surge 2
Bindless Deferred Decals in The Surge 2Bindless Deferred Decals in The Surge 2
Bindless Deferred Decals in The Surge 2
Philip Hammer
 
Dx11 performancereloaded
Dx11 performancereloadedDx11 performancereloaded
Dx11 performancereloaded
mistercteam
 
Screen Space Reflections in The Surge
Screen Space Reflections in The SurgeScreen Space Reflections in The Surge
Screen Space Reflections in The Surge
Michele Giacalone
 
CryENGINE 3 Rendering Techniques
CryENGINE 3 Rendering TechniquesCryENGINE 3 Rendering Techniques
CryENGINE 3 Rendering Techniques
Tiago Sousa
 
Lighting the City of Glass
Lighting the City of GlassLighting the City of Glass
Lighting the City of Glass
Electronic Arts / DICE
 
Terrain in Battlefield 3: A Modern, Complete and Scalable System
Terrain in Battlefield 3: A Modern, Complete and Scalable SystemTerrain in Battlefield 3: A Modern, Complete and Scalable System
Terrain in Battlefield 3: A Modern, Complete and Scalable System
Electronic Arts / DICE
 
Graphics Gems from CryENGINE 3 (Siggraph 2013)
Graphics Gems from CryENGINE 3 (Siggraph 2013)Graphics Gems from CryENGINE 3 (Siggraph 2013)
Graphics Gems from CryENGINE 3 (Siggraph 2013)
Tiago Sousa
 
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil PerssonLow-level Shader Optimization for Next-Gen and DX11 by Emil Persson
Low-level Shader Optimization for Next-Gen and DX11 by Emil Persson
AMD Developer Central
 
Moving Frostbite to Physically Based Rendering
Moving Frostbite to Physically Based RenderingMoving Frostbite to Physically Based Rendering
Moving Frostbite to Physically Based Rendering
Electronic Arts / DICE
 
A Bit More Deferred Cry Engine3
A Bit More Deferred   Cry Engine3A Bit More Deferred   Cry Engine3
A Bit More Deferred Cry Engine3
guest11b095
 
Triangle Visibility buffer
Triangle Visibility bufferTriangle Visibility buffer
Triangle Visibility buffer
Wolfgang Engel
 
Physically Based Sky, Atmosphere and Cloud Rendering in Frostbite
Physically Based Sky, Atmosphere and Cloud Rendering in FrostbitePhysically Based Sky, Atmosphere and Cloud Rendering in Frostbite
Physically Based Sky, Atmosphere and Cloud Rendering in Frostbite
Electronic Arts / DICE
 
Siggraph2016 - The Devil is in the Details: idTech 666
Siggraph2016 - The Devil is in the Details: idTech 666Siggraph2016 - The Devil is in the Details: idTech 666
Siggraph2016 - The Devil is in the Details: idTech 666
Tiago Sousa
 
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
Parallel Graphics in Frostbite - Current & Future (Siggraph 2009)
repii
 
The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)
The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)
The Rendering Technology of 'Lords of the Fallen' (Game Connection Europe 2014)
Philip Hammer
 
Physically Based and Unified Volumetric Rendering in Frostbite
Physically Based and Unified Volumetric Rendering in FrostbitePhysically Based and Unified Volumetric Rendering in Frostbite
Physically Based and Unified Volumetric Rendering in Frostbite
Electronic Arts / DICE
 
Rendering Technologies from Crysis 3 (GDC 2013)
Rendering Technologies from Crysis 3 (GDC 2013)Rendering Technologies from Crysis 3 (GDC 2013)
Rendering Technologies from Crysis 3 (GDC 2013)
Tiago Sousa
 
Secrets of CryENGINE 3 Graphics Technology
Secrets of CryENGINE 3 Graphics TechnologySecrets of CryENGINE 3 Graphics Technology
Secrets of CryENGINE 3 Graphics Technology
Tiago Sousa
 
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
Terrain Rendering in Frostbite using Procedural Shader Splatting (Siggraph 2007)
repii
 
Bindless Deferred Decals in The Surge 2
Bindless Deferred Decals in The Surge 2Bindless Deferred Decals in The Surge 2
Bindless Deferred Decals in The Surge 2
Philip Hammer
 
Dx11 performancereloaded
Dx11 performancereloadedDx11 performancereloaded
Dx11 performancereloaded
mistercteam
 
Screen Space Reflections in The Surge
Screen Space Reflections in The SurgeScreen Space Reflections in The Surge
Screen Space Reflections in The Surge
Michele Giacalone
 
CryENGINE 3 Rendering Techniques
CryENGINE 3 Rendering TechniquesCryENGINE 3 Rendering Techniques
CryENGINE 3 Rendering Techniques
Tiago Sousa
 
Terrain in Battlefield 3: A Modern, Complete and Scalable System
Terrain in Battlefield 3: A Modern, Complete and Scalable SystemTerrain in Battlefield 3: A Modern, Complete and Scalable System
Terrain in Battlefield 3: A Modern, Complete and Scalable System
Electronic Arts / DICE
 

Similar to Optimizing the Graphics Pipeline with Compute, GDC 2016 (20)

D3 D10 Unleashed New Features And Effects
D3 D10 Unleashed   New Features And EffectsD3 D10 Unleashed   New Features And Effects
D3 D10 Unleashed New Features And Effects
Thomas Goddard
 
Introduction To Massive Model Visualization
Introduction To Massive Model VisualizationIntroduction To Massive Model Visualization
Introduction To Massive Model Visualization
pjcozzi
 
Masked Software Occlusion Culling
Masked Software Occlusion CullingMasked Software Occlusion Culling
Masked Software Occlusion Culling
Intel® Software
 
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
Umbra Software
 
OpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
OpenGL NVIDIA Command-List: Approaching Zero Driver OverheadOpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
OpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
Tristan Lorach
 
GDC 2012: Advanced Procedural Rendering in DX11
GDC 2012: Advanced Procedural Rendering in DX11GDC 2012: Advanced Procedural Rendering in DX11
GDC 2012: Advanced Procedural Rendering in DX11
smashflt
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
inside-BigData.com
 
General Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics HardwareGeneral Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics Hardware
Daniel Blezek
 
Practical Spherical Harmonics Based PRT Methods
Practical Spherical Harmonics Based PRT MethodsPractical Spherical Harmonics Based PRT Methods
Practical Spherical Harmonics Based PRT Methods
Naughty Dog
 
NVIDIA Graphics, Cg, and Transparency
NVIDIA Graphics, Cg, and TransparencyNVIDIA Graphics, Cg, and Transparency
NVIDIA Graphics, Cg, and Transparency
Mark Kilgard
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Akihiro Hayashi
 
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
Haim Yadid
 
Reduction
ReductionReduction
Reduction
Wei Shen
 
TiDB vs Aurora.pdf
TiDB vs Aurora.pdfTiDB vs Aurora.pdf
TiDB vs Aurora.pdf
ssuser3fb50b
 
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
repii
 
Data-Level Parallelism in Microprocessors
Data-Level Parallelism in MicroprocessorsData-Level Parallelism in Microprocessors
Data-Level Parallelism in Microprocessors
Dilum Bandara
 
Practical spherical harmonics based PRT methods.ppsx
Practical spherical harmonics based PRT methods.ppsxPractical spherical harmonics based PRT methods.ppsx
Practical spherical harmonics based PRT methods.ppsx
MannyK4
 
Advanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering PipelineAdvanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering Pipeline
Narann29
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
Dhaval Kaneria
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
fcassier
 
D3 D10 Unleashed New Features And Effects
D3 D10 Unleashed   New Features And EffectsD3 D10 Unleashed   New Features And Effects
D3 D10 Unleashed New Features And Effects
Thomas Goddard
 
Introduction To Massive Model Visualization
Introduction To Massive Model VisualizationIntroduction To Massive Model Visualization
Introduction To Massive Model Visualization
pjcozzi
 
Masked Software Occlusion Culling
Masked Software Occlusion CullingMasked Software Occlusion Culling
Masked Software Occlusion Culling
Intel® Software
 
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
GDC2014: Boosting your ARM mobile 3D rendering performance with Umbra
Umbra Software
 
OpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
OpenGL NVIDIA Command-List: Approaching Zero Driver OverheadOpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
OpenGL NVIDIA Command-List: Approaching Zero Driver Overhead
Tristan Lorach
 
GDC 2012: Advanced Procedural Rendering in DX11
GDC 2012: Advanced Procedural Rendering in DX11GDC 2012: Advanced Procedural Rendering in DX11
GDC 2012: Advanced Procedural Rendering in DX11
smashflt
 
Adaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and EigensolversAdaptive Linear Solvers and Eigensolvers
Adaptive Linear Solvers and Eigensolvers
inside-BigData.com
 
General Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics HardwareGeneral Purpose Computing using Graphics Hardware
General Purpose Computing using Graphics Hardware
Daniel Blezek
 
Practical Spherical Harmonics Based PRT Methods
Practical Spherical Harmonics Based PRT MethodsPractical Spherical Harmonics Based PRT Methods
Practical Spherical Harmonics Based PRT Methods
Naughty Dog
 
NVIDIA Graphics, Cg, and Transparency
NVIDIA Graphics, Cg, and TransparencyNVIDIA Graphics, Cg, and Transparency
NVIDIA Graphics, Cg, and Transparency
Mark Kilgard
 
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Exploring Compiler Optimization Opportunities for the OpenMP 4.x Accelerator...
Akihiro Hayashi
 
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
“Show Me the Garbage!”, Garbage Collection a Friend or a Foe
Haim Yadid
 
TiDB vs Aurora.pdf
TiDB vs Aurora.pdfTiDB vs Aurora.pdf
TiDB vs Aurora.pdf
ssuser3fb50b
 
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
The Intersection of Game Engines & GPUs: Current & Future (Graphics Hardware ...
repii
 
Data-Level Parallelism in Microprocessors
Data-Level Parallelism in MicroprocessorsData-Level Parallelism in Microprocessors
Data-Level Parallelism in Microprocessors
Dilum Bandara
 
Practical spherical harmonics based PRT methods.ppsx
Practical spherical harmonics based PRT methods.ppsxPractical spherical harmonics based PRT methods.ppsx
Practical spherical harmonics based PRT methods.ppsx
MannyK4
 
Advanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering PipelineAdvanced Scenegraph Rendering Pipeline
Advanced Scenegraph Rendering Pipeline
Narann29
 
Gpu with cuda architecture
Gpu with cuda architectureGpu with cuda architecture
Gpu with cuda architecture
Dhaval Kaneria
 
Monte Carlo on GPUs
Monte Carlo on GPUsMonte Carlo on GPUs
Monte Carlo on GPUs
fcassier
 
Ad

Recently uploaded (20)

Build With AI - In Person Session Slides.pdf
Build With AI - In Person Session Slides.pdfBuild With AI - In Person Session Slides.pdf
Build With AI - In Person Session Slides.pdf
Google Developer Group - Harare
 
UiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer OpportunitiesUiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer Opportunities
DianaGray10
 
Viam product demo_ Deploying and scaling AI with hardware.pdf
Viam product demo_ Deploying and scaling AI with hardware.pdfViam product demo_ Deploying and scaling AI with hardware.pdf
Viam product demo_ Deploying and scaling AI with hardware.pdf
camilalamoratta
 
AI You Can Trust: The Critical Role of Governance and Quality.pdf
AI You Can Trust: The Critical Role of Governance and Quality.pdfAI You Can Trust: The Critical Role of Governance and Quality.pdf
AI You Can Trust: The Critical Role of Governance and Quality.pdf
Precisely
 
IT484 Cyber Forensics_Information Technology
IT484 Cyber Forensics_Information TechnologyIT484 Cyber Forensics_Information Technology
IT484 Cyber Forensics_Information Technology
SHEHABALYAMANI
 
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à GenèveUiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPathCommunity
 
Does Pornify Allow NSFW? Everything You Should Know
Does Pornify Allow NSFW? Everything You Should KnowDoes Pornify Allow NSFW? Everything You Should Know
Does Pornify Allow NSFW? Everything You Should Know
Pornify CC
 
Transcript: Canadian book publishing: Insights from the latest salary survey ...
Transcript: Canadian book publishing: Insights from the latest salary survey ...Transcript: Canadian book publishing: Insights from the latest salary survey ...
Transcript: Canadian book publishing: Insights from the latest salary survey ...
BookNet Canada
 
The Future of Cisco Cloud Security: Innovations and AI Integration
The Future of Cisco Cloud Security: Innovations and AI IntegrationThe Future of Cisco Cloud Security: Innovations and AI Integration
The Future of Cisco Cloud Security: Innovations and AI Integration
Re-solution Data Ltd
 
Mastering Testing in the Modern F&B Landscape
Mastering Testing in the Modern F&B LandscapeMastering Testing in the Modern F&B Landscape
Mastering Testing in the Modern F&B Landscape
marketing943205
 
Cybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and MitigationCybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and Mitigation
VICTOR MAESTRE RAMIREZ
 
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Raffi Khatchadourian
 
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Wonjun Hwang
 
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdfKit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Wonjun Hwang
 
Jignesh Shah - The Innovator and Czar of Exchanges
Jignesh Shah - The Innovator and Czar of ExchangesJignesh Shah - The Innovator and Czar of Exchanges
Jignesh Shah - The Innovator and Czar of Exchanges
Jignesh Shah Innovator
 
AsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API DesignAsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API Design
leonid54
 
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptxReimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
John Moore
 
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Markus Eisele
 
Bepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firmBepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firm
Benard76
 
Slack like a pro: strategies for 10x engineering teams
Slack like a pro: strategies for 10x engineering teamsSlack like a pro: strategies for 10x engineering teams
Slack like a pro: strategies for 10x engineering teams
Nacho Cougil
 
UiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer OpportunitiesUiPath Agentic Automation: Community Developer Opportunities
UiPath Agentic Automation: Community Developer Opportunities
DianaGray10
 
Viam product demo_ Deploying and scaling AI with hardware.pdf
Viam product demo_ Deploying and scaling AI with hardware.pdfViam product demo_ Deploying and scaling AI with hardware.pdf
Viam product demo_ Deploying and scaling AI with hardware.pdf
camilalamoratta
 
AI You Can Trust: The Critical Role of Governance and Quality.pdf
AI You Can Trust: The Critical Role of Governance and Quality.pdfAI You Can Trust: The Critical Role of Governance and Quality.pdf
AI You Can Trust: The Critical Role of Governance and Quality.pdf
Precisely
 
IT484 Cyber Forensics_Information Technology
IT484 Cyber Forensics_Information TechnologyIT484 Cyber Forensics_Information Technology
IT484 Cyber Forensics_Information Technology
SHEHABALYAMANI
 
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à GenèveUiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPath Automation Suite – Cas d'usage d'une NGO internationale basée à Genève
UiPathCommunity
 
Does Pornify Allow NSFW? Everything You Should Know
Does Pornify Allow NSFW? Everything You Should KnowDoes Pornify Allow NSFW? Everything You Should Know
Does Pornify Allow NSFW? Everything You Should Know
Pornify CC
 
Transcript: Canadian book publishing: Insights from the latest salary survey ...
Transcript: Canadian book publishing: Insights from the latest salary survey ...Transcript: Canadian book publishing: Insights from the latest salary survey ...
Transcript: Canadian book publishing: Insights from the latest salary survey ...
BookNet Canada
 
The Future of Cisco Cloud Security: Innovations and AI Integration
The Future of Cisco Cloud Security: Innovations and AI IntegrationThe Future of Cisco Cloud Security: Innovations and AI Integration
The Future of Cisco Cloud Security: Innovations and AI Integration
Re-solution Data Ltd
 
Mastering Testing in the Modern F&B Landscape
Mastering Testing in the Modern F&B LandscapeMastering Testing in the Modern F&B Landscape
Mastering Testing in the Modern F&B Landscape
marketing943205
 
Cybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and MitigationCybersecurity Threat Vectors and Mitigation
Cybersecurity Threat Vectors and Mitigation
VICTOR MAESTRE RAMIREZ
 
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Challenges in Migrating Imperative Deep Learning Programs to Graph Execution:...
Raffi Khatchadourian
 
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Kit-Works Team Study_아직도 Dockefile.pdf_김성호
Wonjun Hwang
 
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdfKit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Kit-Works Team Study_팀스터디_김한솔_nuqs_20250509.pdf
Wonjun Hwang
 
Jignesh Shah - The Innovator and Czar of Exchanges
Jignesh Shah - The Innovator and Czar of ExchangesJignesh Shah - The Innovator and Czar of Exchanges
Jignesh Shah - The Innovator and Czar of Exchanges
Jignesh Shah Innovator
 
AsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API DesignAsyncAPI v3 : Streamlining Event-Driven API Design
AsyncAPI v3 : Streamlining Event-Driven API Design
leonid54
 
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptxReimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
Reimagine How You and Your Team Work with Microsoft 365 Copilot.pptx
John Moore
 
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Enterprise Integration Is Dead! Long Live AI-Driven Integration with Apache C...
Markus Eisele
 
Bepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firmBepents tech services - a premier cybersecurity consulting firm
Bepents tech services - a premier cybersecurity consulting firm
Benard76
 
Slack like a pro: strategies for 10x engineering teams
Slack like a pro: strategies for 10x engineering teamsSlack like a pro: strategies for 10x engineering teams
Slack like a pro: strategies for 10x engineering teams
Nacho Cougil
 
Ad

Optimizing the Graphics Pipeline with Compute, GDC 2016

  • 1. Optimizing the Graphics Pipeline with Compute Graham Wihlidal Sr. Rendering Engineer, Frostbite
  • 2. Acronyms  Optimizations and algorithms presented are AMD GCN-centric [1][8] VGT Vertex Grouper Tessellator PA Primitive Assembly CP Command Processor IA Input Assembly SE Shader Engine CU Compute Unit LDS Local Data Share HTILE Hi-Z Depth Compression GCN Graphics Core Next SGPR Scalar General-Purpose Register VGPR Vector General-Purpose Register ALU Arithmetic Logic Unit SPI Shader Processor Interpolator
  • 7. 12 CU * 64 ALU * 2 FLOPs 1,536 ALU ops / cy 18 CU * 64 ALU * 2 FLOPs 2,304 ALU ops / cy 64 CU * 64 ALU * 2 FLOPs 8,192 ALU ops / cy
  • 8. 1,536 ALU ops / 2 engines 768 ALU ops per triangle 2,304 ALU ops / 2 engines 1,017 ALU ops per triangle 8,192 ALU ops / 4 engines 2,048 ALU ops per triangle
  • 9. 768 ALU ops / 2 ALU per cy = 384 instruction limit 1,017 ALU ops / 2 ALU per cy = 508 instruction limit 2,048 ALU ops / 2 ALU per cy = 1024 instruction limit
  • 10. Can anyone here cull a triangle in less than 384 instructions on Xbox One? … I sure hope so ☺
  • 11. Motivation – Death By 1000 Draws  DirectX 12 promised millions of draws!  Great CPU performance advancements  Low overhead  Power in the hands of (experienced) developers  Console hardware is a fixed target  GPU still chokes on tiny draws  Common to see 2nd half of base pass barely utilizing the GPU  Lots of tiny details or distant objects – most are Hi-Z culled  Still have to run mostly empty vertex wavefronts  More draws not necessarily a good thing
  • 12. Motivation – Death By 1000 Draws
  • 13. Motivation – Primitive Rate  Wildly optimistic to assume we get close to 2 prims per cy – Getting 0.9 prim / cy  If you are doing anything useful, you will be bound elsewhere in the pipeline  You need good balance and lucky scheduling between the VGTs and PAs  Depth of FIFO between VGT and PA  Need positions of a VS back in < 4096 cy, or reduces primitive rate  Some games hit close to peak perf (95+% range) in shadow passes  Usually slower regions in there due to large triangles  Coarse raster only does 1 super-tile per clock  Triangles with bounding rectangle larger than 32x32?  Multi-cycle on coarse raster, reduces primitive rate
  • 14. Motivation – Primitive Rate  Benchmarks that get 2 prims / cy (around 1.97) have these characteristics:  VS reads nothing  VS writes only SV_Position  VS always outputs 0.0f for position - Trivially cull all primitives  Index buffer is all 0s - Every vertex is a cache hit  Every instance is a multiple of 64 vertices – Less likely to have unfilled VS waves  No PS bound – No parameter cache usage  Requires that nothing after VS causes a stall  Parameter size <= 4 * PosSize  Pixels drain faster than they are generated  No scissoring occurs  PA can receive work faster than VS can possibly generate it  Often see tessellation achieve peak VS primitive throughout; one SE at a time
  • 15. Motivation – Opportunity  Coarse cull on CPU, refine on GPU  Latency between CPU and GPU prevents optimizations  GPGPU Submission!  Depth-aware culling  Tighten shadow bounds sample distribution shadow maps [21]  Cull shadow casters without contribution [4]  Cull hidden objects from color pass  VR late-latch culling  CPU submits conservative frustum and GPU refines  Triangle and cluster culling  Covered by this presentation
  • 16. Motivation – Opportunity  Maps directly to graphics pipeline  Offload tessellation hull shader work  Offload entire tessellation pipeline! [16][17]  Procedural vertex animation (wind, cloth, etc.)  Reusing results between multiple passes & frames  Maps indirectly to graphics pipeline  Bounding volume generation  Pre-skinning  Blend shapes  Generating GPU work from the GPU [4] [13]  Scene and visibility determination  Treat your draws as data!  Pre-build  Cache and reuse  Generate on GPU
  • 18. Culling Overview Scene  Consists of:  Collection of meshes  Specific view  Camera, light, etc.
  • 19. Culling Overview Batch  Configurable subset of meshes in a scene  Meshes within a batch share the same shader and strides (vertex/index)  Near 1:1 with DirectX 12 PSO (Pipeline State Object)
  • 20. Culling Overview Mesh Section  Represents an indexed draw call (triangle list)  Has its own:  Vertex buffer(s)  Index buffer  Primitive count  Etc.
  • 21. Culling Overview Work Item  Optimal number of triangles for processing in a wavefront  AMD GCN has 64 threads per wavefront  Each culling thread processes 1 triangle  Work item processes 256 triangles
  • 22. Culling Overview Batch Work Item Mesh Section Batch Mesh Section Mesh SectionMesh Section Work Item Work Item Work Item Work Item Work Item Work Item Work Item Multi Draw Indirect Draw Args Draw Args Draw Args Draw Args Culling Culling Culling Culling Culling Culling Culling Culling Draw Call Compaction (No Zero Size Draws) Draw Args Draw Args Draw Args Scene …
  • 23. Mapping Mesh ID to MultiDraw ID  Indirect draws no longer know the mesh section or instance they came from  Important for loading various constants, etc.  A DirectX 12 trick is to create a custom command signature  Allows for parsing a custom indirect arguments buffer format  We can store the mesh section id along with each draw argument block  PC drivers use compute shader patching  Xbox One has custom command processor microcode support  OpenGL has gl_DrawId which can be used for this  SPI Loads StartInstanceLocation into reserved SGPR and adds to SV_InstanceID  A fallback approach can be an instancing buffer with a step rate of 1 which maps from instance id to draw id
  • 24. Mapping Mesh ID to MultiDraw ID Mesh Section Id Draw Args Index Count Per Instance Instance Count Start Index Location Base Vertex Location Start Instance Location
  • 25. De-Interleaved Vertex Buffers P0 P1 P2 P3 … N0 N1 N2 N3 … TC0 TC1 TC2 TC3 … Draw Call P0 N0 TC0 P1 N1 TC1 P2 N2 TC2 … Draw Call Do This! De-Interleaved vertex buffers are optimal on GCN architectures They also make compute processing easier!
  • 26. De-Interleaved Vertex Buffers  Helpful for minimizing state changes for compute processing  Constant vertex position stride  Cleaner separation of volatile vs. non-volatile data  Lower memory usage overall  More optimal for regular GPU rendering  Evict cache lines as quickly as possible!
  • 28. Cluster Culling  Generate triangle clusters using spatially coherent bucketing in spherical coordinates  Optimize each triangle cluster to be cache coherent  Generate optimal bounding cone of each cluster [19]  Project normals on to the unit sphere  Calculate minimum enclosing circle  Diameter is the cone angle  Center is projected back to Cartesian for cone normal  Store cone in 8:8:8:8 SNORM  Cull if dot(cone.Normal, -view) < -sin(cone.angle)
  • 29. Cluster Culling  64 is convenient on consoles  Opens up intrinsic optimizations  Not optimal, as the CP bottlenecks on too many draws  Not LDS bound  256 seems to be the sweet spot  More vertex reuse  Fewer atomic operations  Larger than 256?  2x VGTs alternate back and forth (256 triangles)  Vertex re-use does not survive the flip
  • 30. Cluster Culling  Coarse reject clusters of triangles [4]  Cull against:  View (Bounding Cone)  Frustum (Bounding Sphere)  Hi-Z Depth (Screen Space Bounding Box)  Be careful of perspective distortion! [22]  Spheres become ellipsoids under projection
  • 32. Compaction At 133us - Efficiency drops as we hit a string of empty draws At 151us - 10us of idle time
  • 34. Compaction  Parallel Reduction  Keep > 0 Count Args We can do better!
  • 35. Compaction  Parallel prefix sum to the rescue! 0 0 0 1 2 3 3 4 4 5 5 6 7 8 8 9
  • 36. Compaction  __XB_Ballot64  Produce a 64 bit mask  Each bit is an evaluated predicate per wavefront thread 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 __XB_Ballot64(threadId & 1)
  • 37. Compaction 1 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 __XB_Ballot64(indexCount > 0) & = Thread 5 Execution Mask Thread 5 Population Count “popcnt” = 3
  • 38. Compaction  V_MBCNT_LO_U32_B32 [5]  Masked bit count of the lower 32 threads (0-31)  V_MBCNT_HI_U32_B32 [5]  Masked bit count of the upper 32 threads (32-63)  For each thread, returns the # of active threads which come before it.
  • 39. Compaction 1 0 0 1 1 1 0 1 0 1 0 1 1 1 0 1 0 0 0 1 2 3 3 4 4 5 5 6 7 8 8 9 __XB_MBCNT64(__XB_Ballot64(indexCount > 0)) 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
  • 40. Compaction  No more barriers!  Atomic to sync multiple wavefronts  Read lane to replicate global slot to all threads
  • 42. Per-Triangle Culling  Each thread in a wavefront processes 1 triangle  Cull masks are balloted and counted to determine compaction index  Maintain vertex reuse across a wavefront  Maintain vertex reuse across all wavefronts - ds_ordered_count [5][15]  +0.1ms for ~3906 work items – use wavefront limits
  • 43. Per-Triangle Culling For Each Triangle Unpack Index and Vertex Data (16 bit) Orientation and Zero Area Culling (2DH) Small Primitive Culling (NDC) Frustum Culling (NDC) Count Number of Surviving Indices Compact Index Stream (Preserving Ordering) Reserve Output Space for Surviving Indices Write out Surviving Indices (16 bit) Depth Culling – Hi-Z (NDC) Perspective Divide (xyz/w) Scalar Branch (!culled) Scalar Branch (!culled) Scalar Branch (!culled) __XB_GdsOrderedCount (Optional) __XB_MBCNT64 __XB_BALLOT64
  • 44. Per-Triangle Culling  Without ballot  Compiler generates two tests for most if-statements  1) One or more threads enter the if-statement  2) Optimization where no threads enter the if-statement  With ballot (or high level any/all/etc.), or if branch on scalar value (__XB_MakeUniform)  Compiler only generates case# 2  Skips extra control flow logic to handle divergence  Use ballot for force uniform branching and avoid divergence  No harm letting all threads execute the full sequence of culling tests
  • 46. Triangle Orientation and Zero Area (2DH)
  • 51. Rasterizer Efficiency 16 pixels / clock 100% Efficiency 1 pixel / clock 6.25% Efficiency 12 pixels / clock 75% Efficiency
  • 52. Vi Vj
  • 53. Small Primitive Culling (NDC)  This triangle is not culled because it encloses a pixel center any(round(min) == round(max))
  • 54. Small Primitive Culling (NDC)  This triangle is culled because it does not enclose a pixel center any(round(min) == round(max))
  • 55. Small Primitive Culling (NDC)  This triangle is culled because it does not enclose a pixel center any(round(min) == round(max))
  • 56. Small Primitive Culling (NDC)  This triangle is not culled because the bounding box min and max snap to different coordinates  This triangle should be culled, but accounting for this case is not worth the cost any(round(min) == round(max))
  • 61. Frustum Culling (NDC) 0,0 1,10,1 1,0 Max Min Max Min Min.Y > 1 Max.X < 0 Min.X > 1 Max.Y < 0
  • 65. Depth Tile Culling (NDC)  Another available culling approach is to do manual depth testing  Perform an LDS optimized parallel reduction [9], storing out the conservative depth value for each tile 16x16 Tiles
  • 66. Depth Tile Culling (NDC)  ~41us on XB1 @ 1080p  Bypasses LDS storage  Bandwidth bound  Shared with our light tile culling
  • 67. Depth Pyramid Culling (NDC)  Another approach to depth culling is a hierarchical Z pyramid [10][11][23]  Populate the Hi-Z pyramid after depth laydown  Construct a mip-mapped screen resolution texture  Culling can be done by comparing the depth of a bounding volume with the depth stored in the Hi-Z pyramid int mipMapLevel = min(ceil(log2(max(longestEdge, 1.0f))), levels - 1);
  • 68. AMD GCN HTILE  Depth acceleration meta data called HTILE [6][7]  Every group of 8x8 pixels has a 32bit meta data block  Can be decoded manually in a shader and used for 1 test -> 64 pixel rejection  Avoids slow hardware decompression or resummarize  Avoids losing Hi-Z on later depth enabled render passes DEPTH HTILE
  • 70. AMD GCN HTILE DS_SWIZZLE_B32 [5] V_READLANE_B32 [5]
  • 71. AMD GCN HTILE  Manually encode; skip the resummarize on half resolution depth!  HTILE encodes both near and far depth for each 8x8 pixel tile.  Stencil Enabled = 14 bit near value, and a 6 bit delta towards far plane  Stencil Disabled = MinMax depth encoded in 2x 14 bit UNORM pairs
  • 72. Software Z  One problem with using depth for culling is availability  Many engines do not have a full Z pre-pass  Restricts asynchronous compute scheduling  Wait for Z buffer laydown  You can load the Hi-Z pyramid with software Z!  In Frostbite since Battlefield 3 [12]  Done on the CPU for the upcoming GPU frame  No latency  You can prime HTILE!  Full Z pre-pass  Minimal cost
  • 77. Batching  Fixed memory budget of N buffers * 128k triangles  128k triangles = 384k indices = 768 KB  3 MB of memory usage, for up to 524288 surviving triangles in flight 128k triangles (768KB) 128k triangles (768KB) 128k triangles (768KB) 128k triangles (768KB) Render 128k triangles (768KB) 128k triangles (768KB) 128k triangles (768KB) 128k triangles (768KB) Render
  • 78. Batching Mesh Section (20k tri) Mesh Section (34k tri) Mesh Section (4k tri) Mesh Section (20k tri) Mesh Section (70k tri) Culling (434k triangles) … 434k / 512k capacity Output #0 Output #1 Output #2 Output #0 Output #1 Output #2 Render #1 Render #0 Output #3 Output #3
  • 79. Culling (546k triangles) Batching Mesh Section (20k tri) Mesh Section (34k tri) Mesh Section (4k tri) Mesh Section (20k tri) Mesh Section (70k tri) … 546k / 512k capacity Output #0 Output #1 Output #2 Render #0,0 Output #0 Output #1 Output #2 Render #1 Output #0 Render #0,1 Output #3 Output #3
  • 80. Batching Dispatch #0 Render #0 Dispatch #1 Dispatch #2 Dispatch #3 Render #1 Render #2 Render #3Startup Cost  Overlapping culling and render on the graphics pipe is great  But there is a high startup cost for dispatch #0 (no graphics to overlap)  If only there were something we could use….
  • 81. Batching  Asynchronous compute to the rescue!  We can launch the dispatch work alongside other GPU work in the frame  Water simulation, physics, cloth, virtual texturing, etc.  This can slow down “Other GPU Stuff” a bit, but overall frame is faster!  Just be careful about what you schedule culling with  We use wait on lightweight label operations to ensure that dispatch and render are pipelined correctly Dispatch #0 Render #0 Dispatch #1 Dispatch #2 Dispatch #3 Render #1 Render #2 Render #3Other GPU Stuff
  • 82. Performance 443,429 triangles @ 1080p 171 unique PSOs
  • 83. Performance Filter Exclusively Culled Inclusively Culled Orientation 46% 204,006 46% 204,006 Depth* 42% 187,537 20% 90,251 Small* 30% 128,705 8% 37,606 Frustum* 8% 35,182 4% 16,162 * Scene Dependent Processed 100% 443,429 Culled 78% 348,025 Rendered 22% 95,404
  • 84. Performance Cull Draw Total 0.26ms 4.56ms 4.56ms 0.15ms 3.80ms 3.80ms 0.06ms 0.47ms 0.47ms No Tessellation Platform XB1 (DRAM) PS4 (GDDR5) PC (Fury X) Base 5.47ms 4.56ms 0.79ms Cull Draw Total 0.24ms 4.54ms 4.78ms 0.13ms 3.76ms 3.89ms 0.06ms 0.47ms 0.53ms Synchronous Asynchronous 443,429 triangles @ 1080p 171 unique PSOs No Cluster Culling
  • 85. Performance Cull Draw Total 0.26ms 11.2ms 11.2ms 0.15ms 8.10ms 8.10ms 0.06ms 0.64ms 0.64ms Tessellation Factor 1-7 (Adaptive Phong) Platform XB1 (DRAM) PS4 (GDDR5) PC (Fury X) Base 19.3ms 12.8ms 3.01ms Cull Draw Total 0.24ms 11.1ms 11.3ms 0.13ms 8.08ms 8.21ms 0.06ms 0.64ms 0.70ms AsynchronousSynchronous 443,429 triangles @ 1080p 171 unique PSOs No Cluster Culling
  • 86. Future Work  Reuse results between multiple passes  Once for all shadow cascades  Depth, gbuffer, emissive, forward, reflection  Cube maps – load once, cull each side  Xbox One supports switching PSOs with ExecuteIndirect  Single submitted batch!  Further reduce bottlenecks  Move more and more CPU rendering logic to GPU  Improve asynchronous scheduling
  • 87. Future Work  Instancing optimizations  Each instance (re)loads vertex data  Synchronous dispatch  Near 100% L2$ hit  ALU bound on render - 24 VGPRs, measured occupancy of 8  1.5 bytes bandwidth usage per triangle  Asynchronous dispatch  Low L2$ residency - other render work between culling and render  VMEM bound on render  20 bytes bandwidth usage per triangle
  • 88. Future Work  Maximize bandwidth and throughput  Load data into LDS chunks, bandwidth amplification  Partition data into per-chunk index buffers  Evaluate all instances  More tuning of wavefront limits and CU masking
  • 90. Hardware Tessellation Input Assembler Vertex (Local) Shader Hull Shader Tessellator Domain (Vertex) Shader Rasterizer Pixel Shader Output Merger • Tessellation Factors • Silhouette Orientation • Back Face Culling • Frustum Culling • Coarse Culling (Hi-Z)
  • 91. Hardware Tessellation Input Assembler Vertex (Local) Shader Hull (Pass-through) Shader Tessellator Domain (Vertex) Shader Rasterizer Pixel Shader Output Merger Load Final Factors Mesh Data • Tessellation Factors • Silhouette Orientation • Back Face Culling • Frustum Culling • Coarse Culling (Hi-Z) Compute Shader
  • 92. Hardware Tessellation Mesh Data Compute Shader Structured Work Queue #1 (Patches with factor [1…1] Tessellation Factors Structured Work Queue #2 (Patches with factor [2…7] Tessellation Factors Structured Work Queue #3 (Patches with factor [8…N] Tessellation Factors Patches with factor 0 (culled) are not processed further, and do not get inserted to any work queue.
  • 93. Hardware Tessellation Structured Work Queue #1 (Patches with factor [1…1] Tessellation Factors Structured Work Queue #2 (Patches with factor [2…7] Tessellation Factors Structured Work Queue #3 (Patches with factor [8…N] Tessellation Factors Compute Shader Patch SubD 1 -> 4 Tessellation Factor 1/4 Tessellated Draw Non-Tessellated Draw Low Expansion Factor GCN Friendly  High Expansion Factor GCN Unfriendly  No Expansion Factor Avoid Tessellator!
  • 94. Summary  Small and inefficient draws are a problem  Compute and graphics are friends  Use all the available GPU resources  Asynchronous compute is extremely powerful  Lots of cool GCN instructions available  Check out AMD GPUOpen GeometryFX [20]
  • 96. Acknowledgements  Matthäus Chajdas (@NIV_Anteru)  Ivan Nevraev (@Nevraev)  Alex Nankervis  Sébastien Lagarde (@SebLagarde)  Andrew Goossen  James Stanard (@JamesStanard)  Martin Fuller (@MartinJIFuller)  David Cook  Tobias “GPU Psychiatrist” Berghoff (@TobiasBerghoff)  Christina Coffin (@ChristinaCoffin)  Alex “I Hate Polygons” Evans (@mmalex)  Rob Krajcarski  Jaymin “SHUFB 4 LIFE” Kessler (@okonomiyonda)  Tomasz Stachowiak (@h3r2tic)  Andrew Lauritzen (@AndrewLauritzen)  Nicolas Thibieroz (@NThibieroz)  Johan Andersson (@repi)  Alex Fry (@TheFryster)  Jasper Bekkers (@JasperBekkers)  Graham Sellers (@grahamsellers)  Cort Stratton (@postgoodism)  David Simpson  Jason Scanlin  Mike Arnold  Mark Cerny (@cerny)  Pete Lewis  Keith Yerex  Andrew Butcher (@andrewbutcher)  Matt Peters  Sebastian Aaltonen (@SebAaltonen)  Anton Michels  Louis Bavoil (@LouisBavoil)  Yury Uralsky  Sebastien Hillaire (@SebHillaire)  Daniel Collin (@daniel_collin)
  • 97. References  [1] “The AMD GCN Architecture – A Crash Course” – Layla Mah  [2] “Clipping Using Homogenous Coordinates” – Jim Blinn, Martin Newell  [3] "Triangle Scan Conversion using 2D Homogeneous Coordinates“ - Marc Olano, Trey Greer  [4] “GPU-Driven Rendering Pipelines” – Ulrich Haar, Sebastian Aaltonen  [5] “Southern Islands Series Instruction Set Architecture” – AMD  [6] “Radeon Southern Islands Acceleration” – AMD  [7] “Radeon Evergreen / Northern Islands Acceleration” - AMD  [8] “GCN Architecture Whitepaper” - AMD  [9] “Optimizing Parallel Reduction In CUDA” – Mark Harris  [10] “Hierarchical-Z Map Based Occlusion Culling” – Daniel Rákos  [11] “Hierarchical Z-Buffer Occlusion Culling” – Nick Darnell  [12] “Culling the Battlefield: Data Oriented Design in Practice” – Daniel Collin  [13] “The Rendering Pipeline – Challenges & Next Steps” – Johan Andersson  [14] “GCN Performance Tweets” – AMD  [15] “Learning from Failure: … Abandoned Renderers For Dreams PS4 …” – Alex Evans  [16] “Patch Based Occlusion Culling For Hardware Tessellation” - Matthias Nießner, Charles Loop  [17] “Tessellation In Call Of Duty: Ghosts” – Wade Brainerd  [18] “MiniEngine Framework” – Alex Nankervis, James Stanard  [19] “Optimal Bounding Cones of Vectors in Three Dimensions” – Gill Barequet, Gershon Elber  [20] “GPUOpen GeometryFX” – AMD  [21] “Sample Distribution Shadow Maps” – Andrew Lauritzen  [22] “2D Polyhedral Bounds of a Clipped, Perspective-Projected 3D Sphere” – Mara and McGuire  [23] “Practical, Dynamic Visibility for Games” - Stephen Hill
  • 98. Thank You! graham@frostbite.com Questions? Twitter - @gwihlidal “If you’ve been struggling with a tough ol’ programming problem all day, maybe go for a walk. Talk to a tree. Trust me, it helps.“ - Bob Ross, Game Dev
  • 99. Instancing Optimizations  Can do a fast bitonic sort of the instancing buffer for optimal front-to-back order  Utilize DS_SWIZZLE_B32  Swizzles input thread data based on offset mask  Data sharing within 32 consecutive threads  Only 32 bit, so can efficiently sort 32 elements  You could do clustered sorting  Sort each cluster’s instances (within a thread)  Sort the 32 clusters
  翻译: