ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with Temporal Blocking For Extreme-Scale Many-Core Systems

Automatic Generation of High-
Order Finite-Difference Code with
Temporal Blocking For Extreme-
Scale Many-Core Systems
ESPM2 2018
Nov.12th, Dallas
Hideyuki Tanaka*, Youhei Ishihara, Ryo Sakamoto,
Takashi Nakamura, Yasuyuki Kimura, Keigo Nitadori,
Miyuki Tsubouchi, Jun Makino

Abstract
 For an explicit finite-difference scheme applied to computation
fluid dynamics, we have achieved 4.78 PFlops, 21.5% efficiency
of peak performance on the large-scale PEZY-SC2 based system
which has very low B/F by temporal blocking
 The achieved efficiency is comparable to recent works on very
high B/F systems
 To achieve this high efficiency on a low B/F machine,
we developed
 A framework for explicit stencil computation which generates
the boilerplate code for MPI and device kernel code with
temporal blocking
 A finite-difference scheme suitable for temporal blocking

Table of Contents
 Introduction
 Explicit stencil computation
 Temporal blocking
 About PEZY-SC2
 Details of our work
 Code generation framework: Formura
 Optimization for PEZY-SC2
 Benchmark results
 Performance on large-scale systems (Gyoukou)
 Discussion and summary

Explicit Stencil Computation
 Explicit stencil computation is simple but very important
application of HPC
 It is used for simulating weather, earthquake, inside of
the sun, etc.
 Optimizing stencil computation is very important
Source: Riken

Efficiency of Recent Stencil Computation
 Efficiency of explicit method on recent HPC
hardware is not high enough
 Even the best case efficiency on K computer is ≅ 20%,
other many cases are only ≅ 10% (not high enough)
 This low efficiency is caused by the problem in the
architecture of processors, or memory bandwidth
 We try to solve the problem of memory bandwidth
which does not depend on architecture
 In the past decades, B/F of HPC systems has been
reduced dramatically
 This trend seems likely to continue

Relative Performance Trend
 Green: FLOPS vs
memory bandwidth
(4.5x/decade)
 Red: FLOPS vs
network latency
(~30x/decade)
 This seems to
continue
Source: John D. McCalpin

PEZY-SC2
 Many core MIMD processor
 1984 individual RISC cores
 2.8TFlops peak DP performance (@700MHz)
 4ch DDR4 DRAM, 64GB, 80GB/s
 ⇒ B/F≒0.03
 cf. K computer: 0.5
Tesla V100: 0.12
TaihuLight: 0.04

PEZY-SC2 Architecture
 The chip consists of
8 prefectures
 Each prefecture
contains 16 cities
 Each city contains 16
processor elements
(PEs)
 8×16×16-64(redun-
dancy) = 1984PEs
 Each city shares L2
cache

Gyoukou
 Supercomputer installed at JAMSTEC, Japan
 (Available until April 2018)
 Peak 28.2PFlops (Full nodes)
 Top500 4th (Nov 2017)
 10000 PEZY-SC2s +
1250 Xeon D (1 for 8-SC2s)
 World’s largest numbers of
MIMD processor cores
(≒ 20M
cf. TaihuLight ≒ 11M)
 Suitable for the test to check
if the code can scale to exa-scale systems

Temporal Blocking （TB）
 One of the solution to explicit method on low B/F system
 With TB, multiple timesteps are calculated for working
array, so it can reduce required B/F when the working
array fits to the processor cache memory size
DRAM DRAM
Cache Cache Cache Cache
Computation
for one time step
Read from memory Write to memory
Network Network
・・・

Various Methods of TB
A variation of this used for
inter-node communication
This is used for
in-node computation

Detail of Our TB Calculation
 Inter-node communication
 Each node sends data to one-direction
 Each node receives data from one-direction
 Simple communication-computation overwrapping

Details of Our TB calculation
 Computation starts from the right-most block
 Upper-right of the parallelogram use dummy data to
equalize all loop lengths
 Gray part is unnecessary results
 This method increases few computation

SL4TH3 Scheme
 Fourth order accuracy
 Number of stencil = 2
 Flop per cell per step ≒ 2800
 Required B/F ≒ 0.05 w/o TB

Formura: a Framework for TB
 From a description of a stencil written in formura DSL,
optimized distributed parallel codes for large scale
parallel computers are generated
 In this work, we add the support of the TB method for
this work, and developed a device kernel code generator
for PEZY-SC2
Formura
DSL
MPI driver
code
TB kernel
code
Executable
formura gcc/mpicc
formura

Code generation by Formura
Input Equation
(Needs some more configuration files)
(Very part of) Generated C Codes
・・・
Output (Zoomed-in)

Code Generation
 Formura generates:
 Driver code for TB distributing on MPI
 Optimized kernel codes for node-local computations
 For new accelerator (or any other processors), we can
add a backend by modifying the code that calculates
temporal blocking steps
 Typically, major optimizations for each device are
blocking layout for data access locality and thread
scheduling

Optimization for PEZY-SC2
 Decide the block size
 Size of block is smaller than LLC size
 Parallelism close to number of PEs for load-balancing
 ４４× ４４× ４４ is the best block size
 443×10(variables/cell)×8Byte = 6.50MB
 6.50MB×2 (for overwrapping read and write)<32MB=LLC size
 442 (parallelisms) = 1936≒ 1984(number of PEs)
 Total 880(=44×20)3 cells per node < 64GB
 Allocate adjacent cells in PEs which shares L2
 Decrease inner-most loop instruction size
 PEZY-SC2’s L1 I-Cache size = 4KB (= 1024ops)
 SL4TH3 requires 2800ops/cell

Benchmark Results
 Conditions
 SL4TH3 scheme
 Optimized backend for PEZY-SC2
 8000 PEYZ-SC2s (20× 20× 20 layout) on Gyoukou
 ≒ 16M cores
 Total (880×20=)17600 3 cells
 Performance results
 4.78 PFlops
 21.5% efficiency (22.2PFlops theoretical-peak)

Effect of Temporal Blocking
NT Redundant
calculation
by TB
Required B/F
1 1.4% 0.058
2 2.7% 0.029
3 4.0% 0.020
4 5.4% 0.015
5 6.7% 0.012
6 8.0% 0.010
7 9.2% 0.009
8 10.5% 0.008
Required B/F by NT
(size per node = 8803) Calculation speed by NT
GF
NT
Calculation speed by NT
(= time step parameter)

Comparison with Other Studies
 This work achieves very high efficiency
 Comparable result with very high B/F (= 0.5) system
0
0.1
0.2
0.3
0.4
0.5
0.6
0
5
10
15
20
25
30
Yashiro et. al. Yang et. al. Hotta et. al. This work
B/F
Efficiency(%)
Efficiency
列 1
Device B/F

Weak Scaling
 The communication is
completely hidden by
computation
 Thus even though the
actual time for
communication increase
when we increase the
number of nodes, weak
scaling of the performance
is pretty good
Communication time
Total time

Future Works
 Other schemes
 HLLD
 Other application
 Tsunami (shallow-water equation)
 Reaction-diffusion system
 Further performance improvement

Conclusion
 We have achieved 4.78 PFlops, 21.5% efficiency of peak
performance on the fluid simulation code on the large-
scale PEZY-SC2 based system
 We developed an automatic code generation framework
for TB, a scheme suitable for it and a backend for PEZY-
SC2 accelerator
 Our achieved efficiency is comparable to other works on
high B/F systems

ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with Temporal Blocking For Extreme-Scale Many-Core Systems

Recommended

More Related Content

What's hot (20)

Similar to ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with Temporal Blocking For Extreme-Scale Many-Core Systems (20)

More from Hideyuki Tanaka (8)

Recently uploaded (20)

ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with Temporal Blocking For Extreme-Scale Many-Core Systems