SlideShare a Scribd company logo
Automatic Generation of High-
Order Finite-Difference Code with
Temporal Blocking For Extreme-
Scale Many-Core Systems
ESPM2 2018
Nov.12th, Dallas
Hideyuki Tanaka*, Youhei Ishihara, Ryo Sakamoto,
Takashi Nakamura, Yasuyuki Kimura, Keigo Nitadori,
Miyuki Tsubouchi, Jun Makino
Abstract
 For an explicit finite-difference scheme applied to computation
fluid dynamics, we have achieved 4.78 PFlops, 21.5% efficiency
of peak performance on the large-scale PEZY-SC2 based system
which has very low B/F by temporal blocking
 The achieved efficiency is comparable to recent works on very
high B/F systems
 To achieve this high efficiency on a low B/F machine,
we developed
 A framework for explicit stencil computation which generates
the boilerplate code for MPI and device kernel code with
temporal blocking
 A finite-difference scheme suitable for temporal blocking
Table of Contents
 Introduction
 Explicit stencil computation
 Temporal blocking
 About PEZY-SC2
 Details of our work
 Code generation framework: Formura
 Optimization for PEZY-SC2
 Benchmark results
 Performance on large-scale systems (Gyoukou)
 Discussion and summary
Introduction
Explicit Stencil Computation
 Explicit stencil computation is simple but very important
application of HPC
 It is used for simulating weather, earthquake, inside of
the sun, etc.
 Optimizing stencil computation is very important
Source: Riken
Efficiency of Recent Stencil Computation
 Efficiency of explicit method on recent HPC
hardware is not high enough
 Even the best case efficiency on K computer is ≅ 20%,
other many cases are only ≅ 10% (not high enough)
 This low efficiency is caused by the problem in the
architecture of processors, or memory bandwidth
 We try to solve the problem of memory bandwidth
which does not depend on architecture
 In the past decades, B/F of HPC systems has been
reduced dramatically
 This trend seems likely to continue
Relative Performance Trend
 Green: FLOPS vs
memory bandwidth
(4.5x/decade)
 Red: FLOPS vs
network latency
(~30x/decade)
 This seems to
continue
Source: John D. McCalpin
PEZY-SC2
 Many core MIMD processor
 1984 individual RISC cores
 2.8TFlops peak DP performance (@700MHz)
 4ch DDR4 DRAM, 64GB, 80GB/s
 ⇒ B/F≒0.03
 cf. K computer: 0.5
Tesla V100: 0.12
TaihuLight: 0.04
PEZY-SC2 Architecture
 The chip consists of
8 prefectures
 Each prefecture
contains 16 cities
 Each city contains 16
processor elements
(PEs)
 8×16×16-64(redun-
dancy) = 1984PEs
 Each city shares L2
cache
Gyoukou
 Supercomputer installed at JAMSTEC, Japan
 (Available until April 2018)
 Peak 28.2PFlops (Full nodes)
 Top500 4th (Nov 2017)
 10000 PEZY-SC2s +
1250 Xeon D (1 for 8-SC2s)
 World’s largest numbers of
MIMD processor cores
(≒ 20M
cf. TaihuLight ≒ 11M)
 Suitable for the test to check
if the code can scale to exa-scale systems
Details of the work
Temporal Blocking (TB)
 One of the solution to explicit method on low B/F system
 With TB, multiple timesteps are calculated for working
array, so it can reduce required B/F when the working
array fits to the processor cache memory size
DRAM DRAM
Cache Cache Cache Cache
Computation
for one time step
Read from memory Write to memory
Network Network
・・・
Various Methods of TB
A variation of this used for
inter-node communication
This is used for
in-node computation
Detail of Our TB Calculation
 Inter-node communication
 Each node sends data to one-direction
 Each node receives data from one-direction
 Simple communication-computation overwrapping
Details of Our TB calculation
 Computation starts from the right-most block
 Upper-right of the parallelogram use dummy data to
equalize all loop lengths
 Gray part is unnecessary results
 This method increases few computation
SL4TH3 Scheme
 Fourth order accuracy
 Number of stencil = 2
 Flop per cell per step ≒ 2800
 Required B/F ≒ 0.05 w/o TB
Input Differential Equation
Formura: a Framework for TB
 From a description of a stencil written in formura DSL,
optimized distributed parallel codes for large scale
parallel computers are generated
 In this work, we add the support of the TB method for
this work, and developed a device kernel code generator
for PEZY-SC2
Formura
DSL
MPI driver
code
TB kernel
code
Executable
formura gcc/mpicc
formura
Code generation by Formura
Input Equation
(Needs some more configuration files)
(Very part of) Generated C Codes
・・・
Output (Zoomed-in)
Code Generation
 Formura generates:
 Driver code for TB distributing on MPI
 Optimized kernel codes for node-local computations
 For new accelerator (or any other processors), we can
add a backend by modifying the code that calculates
temporal blocking steps
 Typically, major optimizations for each device are
blocking layout for data access locality and thread
scheduling
Optimization for PEZY-SC2
 Decide the block size
 Size of block is smaller than LLC size
 Parallelism close to number of PEs for load-balancing
 44× 44× 44 is the best block size
 443×10(variables/cell)×8Byte = 6.50MB
 6.50MB×2 (for overwrapping read and write)<32MB=LLC size
 442 (parallelisms) = 1936≒ 1984(number of PEs)
 Total 880(=44×20)3 cells per node < 64GB
 Allocate adjacent cells in PEs which shares L2
 Decrease inner-most loop instruction size
 PEZY-SC2’s L1 I-Cache size = 4KB (= 1024ops)
 SL4TH3 requires 2800ops/cell
Results
Benchmark Results
 Conditions
 SL4TH3 scheme
 Optimized backend for PEZY-SC2
 8000 PEYZ-SC2s (20× 20× 20 layout) on Gyoukou
 ≒ 16M cores
 Total (880×20=)17600 3 cells
 Performance results
 4.78 PFlops
 21.5% efficiency (22.2PFlops theoretical-peak)
Effect of Temporal Blocking
NT Redundant
calculation
by TB
Required B/F
1 1.4% 0.058
2 2.7% 0.029
3 4.0% 0.020
4 5.4% 0.015
5 6.7% 0.012
6 8.0% 0.010
7 9.2% 0.009
8 10.5% 0.008
Required B/F by NT
(size per node = 8803) Calculation speed by NT
GF
NT
Calculation speed by NT
(= time step parameter)
Comparison with Other Studies
 This work achieves very high efficiency
 Comparable result with very high B/F (= 0.5) system
0
0.1
0.2
0.3
0.4
0.5
0.6
0
5
10
15
20
25
30
Yashiro et. al. Yang et. al. Hotta et. al. This work
B/F
Efficiency(%)
Efficiency
列 1
Device B/F
Weak Scaling
 The communication is
completely hidden by
computation
 Thus even though the
actual time for
communication increase
when we increase the
number of nodes, weak
scaling of the performance
is pretty good
Communication time
Total time
Weak Scaling
Future Works
 Other schemes
 HLLD
 Other application
 Tsunami (shallow-water equation)
 Reaction-diffusion system
 Further performance improvement
Conclusion
 We have achieved 4.78 PFlops, 21.5% efficiency of peak
performance on the fluid simulation code on the large-
scale PEZY-SC2 based system
 We developed an automatic code generation framework
for TB, a scheme suitable for it and a backend for PEZY-
SC2 accelerator
 Our achieved efficiency is comparable to other works on
high B/F systems
Ad

More Related Content

What's hot (20)

Flexible dsp accelerator architecture exploiting carry save arithmetic
Flexible dsp accelerator architecture exploiting carry save arithmeticFlexible dsp accelerator architecture exploiting carry save arithmetic
Flexible dsp accelerator architecture exploiting carry save arithmetic
Ieee Xpert
 
B1030610
B1030610B1030610
B1030610
IJERD Editor
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
inventionjournals
 
Design Radix-4 64-Point Pipeline FFT/IFFT Processor for Wireless Application
Design Radix-4 64-Point Pipeline FFT/IFFT Processor for Wireless ApplicationDesign Radix-4 64-Point Pipeline FFT/IFFT Processor for Wireless Application
Design Radix-4 64-Point Pipeline FFT/IFFT Processor for Wireless Application
International Journal of Engineering Inventions www.ijeijournal.com
 
A high performance fir filter architecture for fixed and reconfigurable appli...
A high performance fir filter architecture for fixed and reconfigurable appli...A high performance fir filter architecture for fixed and reconfigurable appli...
A high performance fir filter architecture for fixed and reconfigurable appli...
Ieee Xpert
 
Graph based transistor network generation method for supergate design
Graph based transistor network generation method for supergate designGraph based transistor network generation method for supergate design
Graph based transistor network generation method for supergate design
Ieee Xpert
 
Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)
elliando dias
 
Iisrt swathi priya(26 30)
Iisrt swathi priya(26 30)Iisrt swathi priya(26 30)
Iisrt swathi priya(26 30)
IISRT
 
High performance pipelined architecture of elliptic curve scalar multiplicati...
High performance pipelined architecture of elliptic curve scalar multiplicati...High performance pipelined architecture of elliptic curve scalar multiplicati...
High performance pipelined architecture of elliptic curve scalar multiplicati...
Ieee Xpert
 
Optimization of Collective Communication in MPICH
Optimization of Collective Communication in MPICH Optimization of Collective Communication in MPICH
Optimization of Collective Communication in MPICH
Lino Possamai
 
TinyML - 4 speech recognition
TinyML - 4 speech recognition TinyML - 4 speech recognition
TinyML - 4 speech recognition
艾鍗科技
 
Code GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principleCode GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principle
Marina Kolpakova
 
High-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGA
High-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGAHigh-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGA
High-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGA
JAYAPRAKASH JPINFOTECH
 
Ad4103173176
Ad4103173176Ad4103173176
Ad4103173176
IJERA Editor
 
BPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveBPF Hardware Offload Deep Dive
BPF Hardware Offload Deep Dive
Netronome
 
HIGH-SPEED LOW-POWER VITERBI DECODER DESIGN FOR TCM DECODERS
HIGH-SPEED LOW-POWER VITERBI DECODER DESIGN FOR  TCM DECODERSHIGH-SPEED LOW-POWER VITERBI DECODER DESIGN FOR  TCM DECODERS
HIGH-SPEED LOW-POWER VITERBI DECODER DESIGN FOR TCM DECODERS
Lalitha Gosukonda
 
Xian He Sun Data-Centric Into
Xian He Sun Data-Centric IntoXian He Sun Data-Centric Into
Xian He Sun Data-Centric Into
SciCompIIT
 
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Netronome
 
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISA
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISALec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISA
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISA
Hsien-Hsin Sean Lee, Ph.D.
 
Tensorflow lite for microcontroller
Tensorflow lite for microcontrollerTensorflow lite for microcontroller
Tensorflow lite for microcontroller
Rouyun Pan
 
Flexible dsp accelerator architecture exploiting carry save arithmetic
Flexible dsp accelerator architecture exploiting carry save arithmeticFlexible dsp accelerator architecture exploiting carry save arithmetic
Flexible dsp accelerator architecture exploiting carry save arithmetic
Ieee Xpert
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
inventionjournals
 
A high performance fir filter architecture for fixed and reconfigurable appli...
A high performance fir filter architecture for fixed and reconfigurable appli...A high performance fir filter architecture for fixed and reconfigurable appli...
A high performance fir filter architecture for fixed and reconfigurable appli...
Ieee Xpert
 
Graph based transistor network generation method for supergate design
Graph based transistor network generation method for supergate designGraph based transistor network generation method for supergate design
Graph based transistor network generation method for supergate design
Ieee Xpert
 
Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)Learning Erlang (from a Prolog dropout's perspective)
Learning Erlang (from a Prolog dropout's perspective)
elliando dias
 
Iisrt swathi priya(26 30)
Iisrt swathi priya(26 30)Iisrt swathi priya(26 30)
Iisrt swathi priya(26 30)
IISRT
 
High performance pipelined architecture of elliptic curve scalar multiplicati...
High performance pipelined architecture of elliptic curve scalar multiplicati...High performance pipelined architecture of elliptic curve scalar multiplicati...
High performance pipelined architecture of elliptic curve scalar multiplicati...
Ieee Xpert
 
Optimization of Collective Communication in MPICH
Optimization of Collective Communication in MPICH Optimization of Collective Communication in MPICH
Optimization of Collective Communication in MPICH
Lino Possamai
 
TinyML - 4 speech recognition
TinyML - 4 speech recognition TinyML - 4 speech recognition
TinyML - 4 speech recognition
艾鍗科技
 
Code GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principleCode GPU with CUDA - Device code optimization principle
Code GPU with CUDA - Device code optimization principle
Marina Kolpakova
 
High-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGA
High-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGAHigh-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGA
High-Speed and Low-Latency ECC Processor Implementation Over GF(2m) on FPGA
JAYAPRAKASH JPINFOTECH
 
BPF Hardware Offload Deep Dive
BPF Hardware Offload Deep DiveBPF Hardware Offload Deep Dive
BPF Hardware Offload Deep Dive
Netronome
 
HIGH-SPEED LOW-POWER VITERBI DECODER DESIGN FOR TCM DECODERS
HIGH-SPEED LOW-POWER VITERBI DECODER DESIGN FOR  TCM DECODERSHIGH-SPEED LOW-POWER VITERBI DECODER DESIGN FOR  TCM DECODERS
HIGH-SPEED LOW-POWER VITERBI DECODER DESIGN FOR TCM DECODERS
Lalitha Gosukonda
 
Xian He Sun Data-Centric Into
Xian He Sun Data-Centric IntoXian He Sun Data-Centric Into
Xian He Sun Data-Centric Into
SciCompIIT
 
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Disaggregation a Primer: Optimizing design for Edge Cloud & Bare Metal applic...
Netronome
 
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISA
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISALec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISA
Lec4 Computer Architecture by Hsien-Hsin Sean Lee Georgia Tech -- ISA
Hsien-Hsin Sean Lee, Ph.D.
 
Tensorflow lite for microcontroller
Tensorflow lite for microcontrollerTensorflow lite for microcontroller
Tensorflow lite for microcontroller
Rouyun Pan
 

Similar to ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with Temporal Blocking For Extreme-Scale Many-Core Systems (20)

Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...
Joshua Mora
 
pMatlab on BlueGene
pMatlab on BlueGenepMatlab on BlueGene
pMatlab on BlueGene
vsachde
 
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
AMD Developer Central
 
Nilesh ranpura systemmodelling
Nilesh ranpura systemmodellingNilesh ranpura systemmodelling
Nilesh ranpura systemmodelling
Obsidian Software
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environment
Ericsson
 
On the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC ApplicationsOn the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC Applications
Wim Vanderbauwhede
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
Jinwon Lee
 
LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack Runtimes
LEGATO project
 
BURA Supercomputer
BURA SupercomputerBURA Supercomputer
BURA Supercomputer
SIMTEC Software and Services
 
Steen_Dissertation_March5
Steen_Dissertation_March5Steen_Dissertation_March5
Steen_Dissertation_March5
Steen Larsen
 
Melp codec optimization using DSP kit
Melp codec optimization using DSP kitMelp codec optimization using DSP kit
Melp codec optimization using DSP kit
sohaibaslam207
 
Crypto Performance on ARM Cortex-M Processors
Crypto Performance on ARM Cortex-M ProcessorsCrypto Performance on ARM Cortex-M Processors
Crypto Performance on ARM Cortex-M Processors
Hannes Tschofenig
 
Intro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPCIntro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPC
Slide_N
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
inside-BigData.com
 
7 eti pres
7 eti pres7 eti pres
7 eti pres
Raymond Kung
 
Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmic
guest40fc7cd
 
DSP_Assign_1
DSP_Assign_1DSP_Assign_1
DSP_Assign_1
Joseph Chandler
 
Anegdotic Maxeler (Romania)
  Anegdotic Maxeler (Romania)  Anegdotic Maxeler (Romania)
Anegdotic Maxeler (Romania)
Valentina Emilia Balas
 
NWU and HPC
NWU and HPCNWU and HPC
NWU and HPC
Wilhelm van Belkum
 
Cryptologic Applications of the PlayStation 3: Cell SPEED
Cryptologic Applications of the PlayStation 3: Cell SPEEDCryptologic Applications of the PlayStation 3: Cell SPEED
Cryptologic Applications of the PlayStation 3: Cell SPEED
Slide_N
 
Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...Performance analysis of 3D Finite Difference computational stencils on Seamic...
Performance analysis of 3D Finite Difference computational stencils on Seamic...
Joshua Mora
 
pMatlab on BlueGene
pMatlab on BlueGenepMatlab on BlueGene
pMatlab on BlueGene
vsachde
 
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
CC-4005, Performance analysis of 3D Finite Difference computational stencils ...
AMD Developer Central
 
Nilesh ranpura systemmodelling
Nilesh ranpura systemmodellingNilesh ranpura systemmodelling
Nilesh ranpura systemmodelling
Obsidian Software
 
Conference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environmentConference Paper: Universal Node: Towards a high-performance NFV environment
Conference Paper: Universal Node: Towards a high-performance NFV environment
Ericsson
 
On the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC ApplicationsOn the Capability and Achievable Performance of FPGAs for HPC Applications
On the Capability and Achievable Performance of FPGAs for HPC Applications
Wim Vanderbauwhede
 
In datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unitIn datacenter performance analysis of a tensor processing unit
In datacenter performance analysis of a tensor processing unit
Jinwon Lee
 
LEGaTO: Software Stack Runtimes
LEGaTO: Software Stack RuntimesLEGaTO: Software Stack Runtimes
LEGaTO: Software Stack Runtimes
LEGATO project
 
Steen_Dissertation_March5
Steen_Dissertation_March5Steen_Dissertation_March5
Steen_Dissertation_March5
Steen Larsen
 
Melp codec optimization using DSP kit
Melp codec optimization using DSP kitMelp codec optimization using DSP kit
Melp codec optimization using DSP kit
sohaibaslam207
 
Crypto Performance on ARM Cortex-M Processors
Crypto Performance on ARM Cortex-M ProcessorsCrypto Performance on ARM Cortex-M Processors
Crypto Performance on ARM Cortex-M Processors
Hannes Tschofenig
 
Intro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPCIntro to Cell Broadband Engine for HPC
Intro to Cell Broadband Engine for HPC
Slide_N
 
Large-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC WorkloadsLarge-Scale Optimization Strategies for Typical HPC Workloads
Large-Scale Optimization Strategies for Typical HPC Workloads
inside-BigData.com
 
Threading Successes 06 Allegorithmic
Threading Successes 06   AllegorithmicThreading Successes 06   Allegorithmic
Threading Successes 06 Allegorithmic
guest40fc7cd
 
Cryptologic Applications of the PlayStation 3: Cell SPEED
Cryptologic Applications of the PlayStation 3: Cell SPEEDCryptologic Applications of the PlayStation 3: Cell SPEED
Cryptologic Applications of the PlayStation 3: Cell SPEED
Slide_N
 
Ad

More from Hideyuki Tanaka (8)

Xpath in-lens
Xpath in-lensXpath in-lens
Xpath in-lens
Hideyuki Tanaka
 
IdrisでWebアプリを書く
IdrisでWebアプリを書くIdrisでWebアプリを書く
IdrisでWebアプリを書く
Hideyuki Tanaka
 
Yesod勉強会
Yesod勉強会Yesod勉強会
Yesod勉強会
Hideyuki Tanaka
 
C++コミュニティーの中心でC++をDISる
C++コミュニティーの中心でC++をDISるC++コミュニティーの中心でC++をDISる
C++コミュニティーの中心でC++をDISる
Hideyuki Tanaka
 
関数プログラミング入門
関数プログラミング入門関数プログラミング入門
関数プログラミング入門
Hideyuki Tanaka
 
Ad

Recently uploaded (20)

Approach to Upper GASTRO INTESTINAL Bleed.pptx
Approach to Upper GASTRO INTESTINAL Bleed.pptxApproach to Upper GASTRO INTESTINAL Bleed.pptx
Approach to Upper GASTRO INTESTINAL Bleed.pptx
PrabakaranNatarajan10
 
ANTI URINARY TRACK INFECTION AGENT MC III
ANTI URINARY TRACK INFECTION AGENT MC IIIANTI URINARY TRACK INFECTION AGENT MC III
ANTI URINARY TRACK INFECTION AGENT MC III
HRUTUJA WAGH
 
Freshwater Biome Types, Characteristics and Factors
Freshwater Biome Types, Characteristics and FactorsFreshwater Biome Types, Characteristics and Factors
Freshwater Biome Types, Characteristics and Factors
mytriplemonlineshop
 
Transgenic Mice in Cancer Research - Creative Biolabs
Transgenic Mice in Cancer Research - Creative BiolabsTransgenic Mice in Cancer Research - Creative Biolabs
Transgenic Mice in Cancer Research - Creative Biolabs
Creative-Biolabs
 
AP 2024 Unit 1 Updated Chemistry of Life
AP 2024 Unit 1 Updated Chemistry of LifeAP 2024 Unit 1 Updated Chemistry of Life
AP 2024 Unit 1 Updated Chemistry of Life
mseileenlinden
 
Chaos and Psychology: Modeling the Human Mind through Nonlinear Dynamical Sys...
Chaos and Psychology: Modeling the Human Mind through Nonlinear Dynamical Sys...Chaos and Psychology: Modeling the Human Mind through Nonlinear Dynamical Sys...
Chaos and Psychology: Modeling the Human Mind through Nonlinear Dynamical Sys...
Helena Celeste Mata Rico
 
ART.pdf. Agin Tom, clinical Psychology, Prajyoti Niketan College
ART.pdf. Agin Tom, clinical Psychology, Prajyoti Niketan CollegeART.pdf. Agin Tom, clinical Psychology, Prajyoti Niketan College
ART.pdf. Agin Tom, clinical Psychology, Prajyoti Niketan College
Agin Tom
 
Seismic evidence of liquid water at the base of Mars' upper crust
Seismic evidence of liquid water at the base of Mars' upper crustSeismic evidence of liquid water at the base of Mars' upper crust
Seismic evidence of liquid water at the base of Mars' upper crust
Sérgio Sacani
 
External Application in Homoeopathy- Definition,Scope and Types.
External Application  in Homoeopathy- Definition,Scope and Types.External Application  in Homoeopathy- Definition,Scope and Types.
External Application in Homoeopathy- Definition,Scope and Types.
AdharshnaPatrick
 
Preparation of Experimental Animals.pptx
Preparation of Experimental Animals.pptxPreparation of Experimental Animals.pptx
Preparation of Experimental Animals.pptx
klynct
 
A Massive Black Hole 0.8kpc from the Host Nucleus Revealed by the Offset Tida...
A Massive Black Hole 0.8kpc from the Host Nucleus Revealed by the Offset Tida...A Massive Black Hole 0.8kpc from the Host Nucleus Revealed by the Offset Tida...
A Massive Black Hole 0.8kpc from the Host Nucleus Revealed by the Offset Tida...
Sérgio Sacani
 
Secondary metabolite ,Plants and Health Care
Secondary metabolite ,Plants and Health CareSecondary metabolite ,Plants and Health Care
Secondary metabolite ,Plants and Health Care
Nistarini College, Purulia (W.B) India
 
An upper limit to the lifetime of stellar remnants from gravitational pair pr...
An upper limit to the lifetime of stellar remnants from gravitational pair pr...An upper limit to the lifetime of stellar remnants from gravitational pair pr...
An upper limit to the lifetime of stellar remnants from gravitational pair pr...
Sérgio Sacani
 
A CASE OF MULTINODULAR GOITRE,clinical presentation and management.pptx
A CASE OF MULTINODULAR GOITRE,clinical presentation and management.pptxA CASE OF MULTINODULAR GOITRE,clinical presentation and management.pptx
A CASE OF MULTINODULAR GOITRE,clinical presentation and management.pptx
ANJALICHANDRASEKARAN
 
university of arizona ~ favor's college candidate project.pptx
university of arizona ~ favor's college candidate project.pptxuniversity of arizona ~ favor's college candidate project.pptx
university of arizona ~ favor's college candidate project.pptx
favoranamelechi107
 
Freud e sua Historia na Psicanalise Psic
Freud e sua Historia na Psicanalise PsicFreud e sua Historia na Psicanalise Psic
Freud e sua Historia na Psicanalise Psic
StefannyGoffi1
 
Top 10 Biotech Startups for Beginners.pptx
Top 10 Biotech Startups for Beginners.pptxTop 10 Biotech Startups for Beginners.pptx
Top 10 Biotech Startups for Beginners.pptx
alexbagheriam
 
Brief Presentation on Garment Washing.pdf
Brief Presentation on Garment Washing.pdfBrief Presentation on Garment Washing.pdf
Brief Presentation on Garment Washing.pdf
BharathKumar556689
 
Evidence for a polar circumbinary exoplanet orbiting a pair of eclipsing brow...
Evidence for a polar circumbinary exoplanet orbiting a pair of eclipsing brow...Evidence for a polar circumbinary exoplanet orbiting a pair of eclipsing brow...
Evidence for a polar circumbinary exoplanet orbiting a pair of eclipsing brow...
Sérgio Sacani
 
MC III Prodrug Medicinal Chemistry III PPT
MC III Prodrug Medicinal Chemistry III PPTMC III Prodrug Medicinal Chemistry III PPT
MC III Prodrug Medicinal Chemistry III PPT
HRUTUJA WAGH
 
Approach to Upper GASTRO INTESTINAL Bleed.pptx
Approach to Upper GASTRO INTESTINAL Bleed.pptxApproach to Upper GASTRO INTESTINAL Bleed.pptx
Approach to Upper GASTRO INTESTINAL Bleed.pptx
PrabakaranNatarajan10
 
ANTI URINARY TRACK INFECTION AGENT MC III
ANTI URINARY TRACK INFECTION AGENT MC IIIANTI URINARY TRACK INFECTION AGENT MC III
ANTI URINARY TRACK INFECTION AGENT MC III
HRUTUJA WAGH
 
Freshwater Biome Types, Characteristics and Factors
Freshwater Biome Types, Characteristics and FactorsFreshwater Biome Types, Characteristics and Factors
Freshwater Biome Types, Characteristics and Factors
mytriplemonlineshop
 
Transgenic Mice in Cancer Research - Creative Biolabs
Transgenic Mice in Cancer Research - Creative BiolabsTransgenic Mice in Cancer Research - Creative Biolabs
Transgenic Mice in Cancer Research - Creative Biolabs
Creative-Biolabs
 
AP 2024 Unit 1 Updated Chemistry of Life
AP 2024 Unit 1 Updated Chemistry of LifeAP 2024 Unit 1 Updated Chemistry of Life
AP 2024 Unit 1 Updated Chemistry of Life
mseileenlinden
 
Chaos and Psychology: Modeling the Human Mind through Nonlinear Dynamical Sys...
Chaos and Psychology: Modeling the Human Mind through Nonlinear Dynamical Sys...Chaos and Psychology: Modeling the Human Mind through Nonlinear Dynamical Sys...
Chaos and Psychology: Modeling the Human Mind through Nonlinear Dynamical Sys...
Helena Celeste Mata Rico
 
ART.pdf. Agin Tom, clinical Psychology, Prajyoti Niketan College
ART.pdf. Agin Tom, clinical Psychology, Prajyoti Niketan CollegeART.pdf. Agin Tom, clinical Psychology, Prajyoti Niketan College
ART.pdf. Agin Tom, clinical Psychology, Prajyoti Niketan College
Agin Tom
 
Seismic evidence of liquid water at the base of Mars' upper crust
Seismic evidence of liquid water at the base of Mars' upper crustSeismic evidence of liquid water at the base of Mars' upper crust
Seismic evidence of liquid water at the base of Mars' upper crust
Sérgio Sacani
 
External Application in Homoeopathy- Definition,Scope and Types.
External Application  in Homoeopathy- Definition,Scope and Types.External Application  in Homoeopathy- Definition,Scope and Types.
External Application in Homoeopathy- Definition,Scope and Types.
AdharshnaPatrick
 
Preparation of Experimental Animals.pptx
Preparation of Experimental Animals.pptxPreparation of Experimental Animals.pptx
Preparation of Experimental Animals.pptx
klynct
 
A Massive Black Hole 0.8kpc from the Host Nucleus Revealed by the Offset Tida...
A Massive Black Hole 0.8kpc from the Host Nucleus Revealed by the Offset Tida...A Massive Black Hole 0.8kpc from the Host Nucleus Revealed by the Offset Tida...
A Massive Black Hole 0.8kpc from the Host Nucleus Revealed by the Offset Tida...
Sérgio Sacani
 
An upper limit to the lifetime of stellar remnants from gravitational pair pr...
An upper limit to the lifetime of stellar remnants from gravitational pair pr...An upper limit to the lifetime of stellar remnants from gravitational pair pr...
An upper limit to the lifetime of stellar remnants from gravitational pair pr...
Sérgio Sacani
 
A CASE OF MULTINODULAR GOITRE,clinical presentation and management.pptx
A CASE OF MULTINODULAR GOITRE,clinical presentation and management.pptxA CASE OF MULTINODULAR GOITRE,clinical presentation and management.pptx
A CASE OF MULTINODULAR GOITRE,clinical presentation and management.pptx
ANJALICHANDRASEKARAN
 
university of arizona ~ favor's college candidate project.pptx
university of arizona ~ favor's college candidate project.pptxuniversity of arizona ~ favor's college candidate project.pptx
university of arizona ~ favor's college candidate project.pptx
favoranamelechi107
 
Freud e sua Historia na Psicanalise Psic
Freud e sua Historia na Psicanalise PsicFreud e sua Historia na Psicanalise Psic
Freud e sua Historia na Psicanalise Psic
StefannyGoffi1
 
Top 10 Biotech Startups for Beginners.pptx
Top 10 Biotech Startups for Beginners.pptxTop 10 Biotech Startups for Beginners.pptx
Top 10 Biotech Startups for Beginners.pptx
alexbagheriam
 
Brief Presentation on Garment Washing.pdf
Brief Presentation on Garment Washing.pdfBrief Presentation on Garment Washing.pdf
Brief Presentation on Garment Washing.pdf
BharathKumar556689
 
Evidence for a polar circumbinary exoplanet orbiting a pair of eclipsing brow...
Evidence for a polar circumbinary exoplanet orbiting a pair of eclipsing brow...Evidence for a polar circumbinary exoplanet orbiting a pair of eclipsing brow...
Evidence for a polar circumbinary exoplanet orbiting a pair of eclipsing brow...
Sérgio Sacani
 
MC III Prodrug Medicinal Chemistry III PPT
MC III Prodrug Medicinal Chemistry III PPTMC III Prodrug Medicinal Chemistry III PPT
MC III Prodrug Medicinal Chemistry III PPT
HRUTUJA WAGH
 

ESPM2 2018 - Automatic Generation of High-Order Finite-Difference Code with Temporal Blocking For Extreme-Scale Many-Core Systems

  • 1. Automatic Generation of High- Order Finite-Difference Code with Temporal Blocking For Extreme- Scale Many-Core Systems ESPM2 2018 Nov.12th, Dallas Hideyuki Tanaka*, Youhei Ishihara, Ryo Sakamoto, Takashi Nakamura, Yasuyuki Kimura, Keigo Nitadori, Miyuki Tsubouchi, Jun Makino
  • 2. Abstract  For an explicit finite-difference scheme applied to computation fluid dynamics, we have achieved 4.78 PFlops, 21.5% efficiency of peak performance on the large-scale PEZY-SC2 based system which has very low B/F by temporal blocking  The achieved efficiency is comparable to recent works on very high B/F systems  To achieve this high efficiency on a low B/F machine, we developed  A framework for explicit stencil computation which generates the boilerplate code for MPI and device kernel code with temporal blocking  A finite-difference scheme suitable for temporal blocking
  • 3. Table of Contents  Introduction  Explicit stencil computation  Temporal blocking  About PEZY-SC2  Details of our work  Code generation framework: Formura  Optimization for PEZY-SC2  Benchmark results  Performance on large-scale systems (Gyoukou)  Discussion and summary
  • 5. Explicit Stencil Computation  Explicit stencil computation is simple but very important application of HPC  It is used for simulating weather, earthquake, inside of the sun, etc.  Optimizing stencil computation is very important Source: Riken
  • 6. Efficiency of Recent Stencil Computation  Efficiency of explicit method on recent HPC hardware is not high enough  Even the best case efficiency on K computer is ≅ 20%, other many cases are only ≅ 10% (not high enough)  This low efficiency is caused by the problem in the architecture of processors, or memory bandwidth  We try to solve the problem of memory bandwidth which does not depend on architecture  In the past decades, B/F of HPC systems has been reduced dramatically  This trend seems likely to continue
  • 7. Relative Performance Trend  Green: FLOPS vs memory bandwidth (4.5x/decade)  Red: FLOPS vs network latency (~30x/decade)  This seems to continue Source: John D. McCalpin
  • 8. PEZY-SC2  Many core MIMD processor  1984 individual RISC cores  2.8TFlops peak DP performance (@700MHz)  4ch DDR4 DRAM, 64GB, 80GB/s  ⇒ B/F≒0.03  cf. K computer: 0.5 Tesla V100: 0.12 TaihuLight: 0.04
  • 9. PEZY-SC2 Architecture  The chip consists of 8 prefectures  Each prefecture contains 16 cities  Each city contains 16 processor elements (PEs)  8×16×16-64(redun- dancy) = 1984PEs  Each city shares L2 cache
  • 10. Gyoukou  Supercomputer installed at JAMSTEC, Japan  (Available until April 2018)  Peak 28.2PFlops (Full nodes)  Top500 4th (Nov 2017)  10000 PEZY-SC2s + 1250 Xeon D (1 for 8-SC2s)  World’s largest numbers of MIMD processor cores (≒ 20M cf. TaihuLight ≒ 11M)  Suitable for the test to check if the code can scale to exa-scale systems
  • 12. Temporal Blocking (TB)  One of the solution to explicit method on low B/F system  With TB, multiple timesteps are calculated for working array, so it can reduce required B/F when the working array fits to the processor cache memory size DRAM DRAM Cache Cache Cache Cache Computation for one time step Read from memory Write to memory Network Network ・・・
  • 13. Various Methods of TB A variation of this used for inter-node communication This is used for in-node computation
  • 14. Detail of Our TB Calculation  Inter-node communication  Each node sends data to one-direction  Each node receives data from one-direction  Simple communication-computation overwrapping
  • 15. Details of Our TB calculation  Computation starts from the right-most block  Upper-right of the parallelogram use dummy data to equalize all loop lengths  Gray part is unnecessary results  This method increases few computation
  • 16. SL4TH3 Scheme  Fourth order accuracy  Number of stencil = 2  Flop per cell per step ≒ 2800  Required B/F ≒ 0.05 w/o TB
  • 18. Formura: a Framework for TB  From a description of a stencil written in formura DSL, optimized distributed parallel codes for large scale parallel computers are generated  In this work, we add the support of the TB method for this work, and developed a device kernel code generator for PEZY-SC2 Formura DSL MPI driver code TB kernel code Executable formura gcc/mpicc formura
  • 19. Code generation by Formura Input Equation (Needs some more configuration files) (Very part of) Generated C Codes ・・・ Output (Zoomed-in)
  • 20. Code Generation  Formura generates:  Driver code for TB distributing on MPI  Optimized kernel codes for node-local computations  For new accelerator (or any other processors), we can add a backend by modifying the code that calculates temporal blocking steps  Typically, major optimizations for each device are blocking layout for data access locality and thread scheduling
  • 21. Optimization for PEZY-SC2  Decide the block size  Size of block is smaller than LLC size  Parallelism close to number of PEs for load-balancing  44× 44× 44 is the best block size  443×10(variables/cell)×8Byte = 6.50MB  6.50MB×2 (for overwrapping read and write)<32MB=LLC size  442 (parallelisms) = 1936≒ 1984(number of PEs)  Total 880(=44×20)3 cells per node < 64GB  Allocate adjacent cells in PEs which shares L2  Decrease inner-most loop instruction size  PEZY-SC2’s L1 I-Cache size = 4KB (= 1024ops)  SL4TH3 requires 2800ops/cell
  • 23. Benchmark Results  Conditions  SL4TH3 scheme  Optimized backend for PEZY-SC2  8000 PEYZ-SC2s (20× 20× 20 layout) on Gyoukou  ≒ 16M cores  Total (880×20=)17600 3 cells  Performance results  4.78 PFlops  21.5% efficiency (22.2PFlops theoretical-peak)
  • 24. Effect of Temporal Blocking NT Redundant calculation by TB Required B/F 1 1.4% 0.058 2 2.7% 0.029 3 4.0% 0.020 4 5.4% 0.015 5 6.7% 0.012 6 8.0% 0.010 7 9.2% 0.009 8 10.5% 0.008 Required B/F by NT (size per node = 8803) Calculation speed by NT GF NT Calculation speed by NT (= time step parameter)
  • 25. Comparison with Other Studies  This work achieves very high efficiency  Comparable result with very high B/F (= 0.5) system 0 0.1 0.2 0.3 0.4 0.5 0.6 0 5 10 15 20 25 30 Yashiro et. al. Yang et. al. Hotta et. al. This work B/F Efficiency(%) Efficiency 列 1 Device B/F
  • 26. Weak Scaling  The communication is completely hidden by computation  Thus even though the actual time for communication increase when we increase the number of nodes, weak scaling of the performance is pretty good Communication time Total time
  • 28. Future Works  Other schemes  HLLD  Other application  Tsunami (shallow-water equation)  Reaction-diffusion system  Further performance improvement
  • 29. Conclusion  We have achieved 4.78 PFlops, 21.5% efficiency of peak performance on the fluid simulation code on the large- scale PEZY-SC2 based system  We developed an automatic code generation framework for TB, a scheme suitable for it and a backend for PEZY- SC2 accelerator  Our achieved efficiency is comparable to other works on high B/F systems
  翻译: