SlideShare a Scribd company logo
GPU acceleration of a non-
hydrostatic ocean model with a
multigrid Poisson/Helmholtz
solver
Takateru Yamagishi1, Yoshimasa Matsumura2
1 Research Organization for Information Science and
Technology
2 Institute of Low Temperature Science, Hokkaido University
6th International Workshop on Advances in High-
Performance Computational Earth Sciences: Applications
& Frameworks
Table of Contents
Motivation
Numerical ocean model ‘kinaco’
GPU implementation and Optimization
Evaluation and validation
Summary
Motivation
Significance of numerical ocean modelling
Global climate, weather, marine resource, etc.
GPU’s high computational performance
Explicit and detail expression, long time
simulation, many experiment cases
Previous studies
Bleichrodt et al. (2012), Milakov et al. (2013),
Werkhoven et al. (2013) Xu, et al. (2015)
They showed high performance, but limited to
experimental studies
We aim at realistic and practical studies
Non-hydrostatic numerical
ocean model ‘kinaco’
Formation of Antarctic bottom
water in the southern Weddell Sea
We try to accelerate this model by the GPU
Basic equation of dynamics in
kinaco
3D Navier-Stokes equation
Fluid dynamics
Poisson/Helmholtz equation
∆ = , (∆ + )ℎ = 0
Discretization
Stencil access to adjacent 6 grids
Solving systems of equations: Ax=b
Sparse matrix-vector multiplication
Efficient solver to solve Ax=b is required
CG method with multigrid
preconditioner (MGCG)
Fast and scalable
iteration method
Matsumura and Hasumi
(2008)
Preconditioner: Multigrid
method
Solve equation on various
resolution grids
multigrid method
Implementation to the GPU
CUDA Fortran
kinaco is written in Fortran 90
CUDA instructions are available
almost the same as CUDA C
Following the original structure of
CPU code
Good performance vs CPU is achieved
We aimed at further acceleration!
Optimization of the MGCG
solver
The cost of MGCG solver: 21% of total
simulation
Mainly consists of sparse matrix-vector
multiplication
Optimization
1. Memory access
2. Hide latency by thread/Instruction-level
parallelism
3. Mixed precision preconditioner of MGCG
Memory access in CPU kernel
DO k=1, n3
DO j=1, n2
DO i=1, n1
out(i,j,k) = a(-3,i,j,k) * x(i, j, k-1) &
+ a(-2,i,j,k) * x(i, j-1,k ) &
+ a(-1,i,j,k) * x(i-1,j, k ) &
+ a( 0,i,j,k) * x(i, j, k ) &
+ a( 1,i,j,k) * x(i+1,j, k ) &
+ a( 2,i,j,k) * x(i, j+1,k ) &
+ a( 3,i,j,k) * x(i, j, k+1)
END DO
END DO
END DO
-3 -2 -1 0 1 2 3
a(-3,i,j,k)~a( 3,i,j,k)
Sparse matrix-vector kernel in the CPU code
matrix coefficient
Location of
matrix coefficient
-3
3
1-1
-2
2
0
CPU thread
load the array
‘a’ in cache line.
Memory access in GPU kernel
a(i,j,k,-3)
a(i+1,j,k,-3)
a(i+2,j,k,-3)
thread(id)
thread(id+1)
thread(id+2)
a(-3:3,i,j,k) a(i,j,k,-3:3)
Each GPU thread accesses array “a” with 7 intervals.
a(-3,i,j,k) a(-3,i+1,j,k) a(-3,i+2,j,k)
thread(id) thread(id+1) thread(id+2)
Coalesced access to array “a”
Hide latency by thread/Instruction-
level parallelism
Hide latency = do other operations
when waiting for latency
Thread-level parallelism
Switch thread to hide latency
Instruction-level parallelism (Volkov,
2010)
One thread with several independent
operations
Comparison of the two parallelism
Case 1: Thread-level parallelism
i = threadidx%x + blockdim%x * (blockidx%x-1)
j = threadidx%y + blockdim%y * (blockidx%y-1)
k = threadidx%z + blockdim%z * (blockidx%z-1)
out(i,j,k) = a(i,j,k,-3) * x(i, j, k-1) &
+ a(i,j,k,-2) * x(i, j-1,k ) &
+ a(i,j,k,-1) * x(i-1,j, k ) &
+ a(i,j,k, 0) * x(i, j, k ) &
+ a(i,j,k, 1) * x(i+1,j, k ) &
+ a(i,j,k, 2) * x(i, j+1,k ) &
+ a(i,j,k, 3) * x(i, j, k+1)
Set many threads as possible (i, j, k)
• 3D (i, j, k) threads are set
• One thread for one grid
Hyde latency by switching many threads
Case 2: Instruction-level
parallelism
Independent operations are repeated
i = threadidx%x + blockdim%x * (blockidx%x-1)
j = threadidx%y + blockdim%y * (blockidx%y-1)
DO k=1, n3
out(i,j,k) = a(i,j,k,-3) * x(i, j, k-1) &
+ a(i,j,k,-2) * x(i, j-1,k ) &
+ a(i,j,k,-1) * x(i-1,j, k ) &
+ a(i,j,k, 0) * x(i, j, k ) &
+ a(i,j,k, 1) * x(i+1,j, k ) &
+ a(i,j,k, 2) * x(i, j+1,k ) &
+ a(i,j,k, 3) * x(i, j, k+1)
END DO
Hyde latency with instructions
• 2D (i, j) threads are set
• One thread for one column
(i, j)
Case 2 is
faster
Mixed precision for multigrid
preconditioning
Low precision
utilize GPU resources
Preconditioning
Low precision is enough
GPU: Deterioration of
performance with coarse
grids
multigrid method
Number of iterations in CG method
unchanged with/without mixed precision
Evaluation, experimental setting
CPU (Fujitsu SPARC64VIIIfx) vs GPU
(NVIDIA K20c)
1 CPU vs 1 GPU
Study of baloclinic instability
Visbeck et al. (1996)
Forcing: Coriolis force, temperature forcing
Structured, Isotropic domain
size: (256, 256, 32)
Time step, simulation time
2min, 5hours (150 steps)
5 days(3600 steps)
256
256
32
Performance
CPU GPU_1 GPU_2 GPU_3
Speedup
(GPU_3)
all components 174.2 42.6 39.2 37.3 4.7
Poisson/Helmholtz
solver
36.8 15.8 12.4 10.5 3.5
others 137.4 26.9 26.8 26.8 5.1
Elapsed time[s]: CPU vs GPU
CPU : original CPU code
GPU_1: basic and typical implementation to the GPU
GPU_2: GPU_1 + memory optimization, hyde latency
GPU_3: GPU_2 + mixed precision preconditioning
GPU achieved 4.7 times speedup vs CPU
5hours (150 steps)
Surface ocean current/velocity
field
GPU_3GPU_2CPU
Good reproduction of growing meanders due to
baloclinic instability
Temperature at the cross
section
Good reproduction
of vertical
convection of water
CPU GPU_2
GPU_2
Summary and future works
Numerical ocean model on the GPU
(K20C) vs the CPU (SPARC 64 VIIIfx)
x4.7 faster compared to CPU
The errors due to implementation
not significant to oceanic studies
Further works
Application of mixed precision to other
kernels
MPI implementation
Realistic experiments
Ad

More Related Content

What's hot (20)

Binary Vector Reconstruction via Discreteness-Aware Approximate Message Passing
Binary Vector Reconstruction via Discreteness-Aware Approximate Message PassingBinary Vector Reconstruction via Discreteness-Aware Approximate Message Passing
Binary Vector Reconstruction via Discreteness-Aware Approximate Message Passing
Ryo Hayakawa
 
Optimal Power System Planning with Renewable DGs with Reactive Power Consider...
Optimal Power System Planning with Renewable DGs with Reactive Power Consider...Optimal Power System Planning with Renewable DGs with Reactive Power Consider...
Optimal Power System Planning with Renewable DGs with Reactive Power Consider...
International Journal of Power Electronics and Drive Systems
 
Multi-core GPU – Fast parallel SAR image generation
Multi-core GPU – Fast parallel SAR image generationMulti-core GPU – Fast parallel SAR image generation
Multi-core GPU – Fast parallel SAR image generation
Mahesh Khadatare
 
Positive and negative solutions of a boundary value problem for a fractional ...
Positive and negative solutions of a boundary value problem for a fractional ...Positive and negative solutions of a boundary value problem for a fractional ...
Positive and negative solutions of a boundary value problem for a fractional ...
journal ijrtem
 
Math Senior Project Digital- Ahnaf Khan
Math Senior Project Digital- Ahnaf KhanMath Senior Project Digital- Ahnaf Khan
Math Senior Project Digital- Ahnaf Khan
M. Ahnaf Khan
 
Cs36565569
Cs36565569Cs36565569
Cs36565569
IJERA Editor
 
Availability of a Redundant System with Two Parallel Active Components
Availability of a Redundant System with Two Parallel Active ComponentsAvailability of a Redundant System with Two Parallel Active Components
Availability of a Redundant System with Two Parallel Active Components
theijes
 
Parallel K means clustering using CUDA
Parallel K means clustering using CUDAParallel K means clustering using CUDA
Parallel K means clustering using CUDA
prithan
 
総和伝搬法を用いた分散近似メッセージ伝搬アルゴリズム
総和伝搬法を用いた分散近似メッセージ伝搬アルゴリズム総和伝搬法を用いた分散近似メッセージ伝搬アルゴリズム
総和伝搬法を用いた分散近似メッセージ伝搬アルゴリズム
Ryo Hayakawa
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learning
Amgad Muhammad
 
Stationary Incompressible Viscous Flow Analysis by a Domain Decomposition Method
Stationary Incompressible Viscous Flow Analysis by a Domain Decomposition MethodStationary Incompressible Viscous Flow Analysis by a Domain Decomposition Method
Stationary Incompressible Viscous Flow Analysis by a Domain Decomposition Method
ADVENTURE Project
 
離散値ベクトル再構成手法とその通信応用
離散値ベクトル再構成手法とその通信応用離散値ベクトル再構成手法とその通信応用
離散値ベクトル再構成手法とその通信応用
Ryo Hayakawa
 
近似メッセージ伝搬法に基づく離散値ベクトル再構成の一般化
近似メッセージ伝搬法に基づく離散値ベクトル再構成の一般化近似メッセージ伝搬法に基づく離散値ベクトル再構成の一般化
近似メッセージ伝搬法に基づく離散値ベクトル再構成の一般化
Ryo Hayakawa
 
post119s1-file2
post119s1-file2post119s1-file2
post119s1-file2
Venkata Suhas Maringanti
 
FAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA.pptFAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA.ppt
grssieee
 
Transceiver design for single-cell and multi-cell downlink multiuser MIMO sys...
Transceiver design for single-cell and multi-cell downlink multiuser MIMO sys...Transceiver design for single-cell and multi-cell downlink multiuser MIMO sys...
Transceiver design for single-cell and multi-cell downlink multiuser MIMO sys...
T. E. BOGALE
 
A Note on Correlated Topic Models
A Note on Correlated Topic ModelsA Note on Correlated Topic Models
A Note on Correlated Topic Models
Tomonari Masada
 
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtionNÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion
Kai Katsumata
 
Datastructure tree
Datastructure treeDatastructure tree
Datastructure tree
rantd
 
IRJET- On Certain Subclasses of Univalent Functions: An Application
IRJET- On Certain Subclasses of Univalent Functions: An ApplicationIRJET- On Certain Subclasses of Univalent Functions: An Application
IRJET- On Certain Subclasses of Univalent Functions: An Application
IRJET Journal
 
Binary Vector Reconstruction via Discreteness-Aware Approximate Message Passing
Binary Vector Reconstruction via Discreteness-Aware Approximate Message PassingBinary Vector Reconstruction via Discreteness-Aware Approximate Message Passing
Binary Vector Reconstruction via Discreteness-Aware Approximate Message Passing
Ryo Hayakawa
 
Multi-core GPU – Fast parallel SAR image generation
Multi-core GPU – Fast parallel SAR image generationMulti-core GPU – Fast parallel SAR image generation
Multi-core GPU – Fast parallel SAR image generation
Mahesh Khadatare
 
Positive and negative solutions of a boundary value problem for a fractional ...
Positive and negative solutions of a boundary value problem for a fractional ...Positive and negative solutions of a boundary value problem for a fractional ...
Positive and negative solutions of a boundary value problem for a fractional ...
journal ijrtem
 
Math Senior Project Digital- Ahnaf Khan
Math Senior Project Digital- Ahnaf KhanMath Senior Project Digital- Ahnaf Khan
Math Senior Project Digital- Ahnaf Khan
M. Ahnaf Khan
 
Availability of a Redundant System with Two Parallel Active Components
Availability of a Redundant System with Two Parallel Active ComponentsAvailability of a Redundant System with Two Parallel Active Components
Availability of a Redundant System with Two Parallel Active Components
theijes
 
Parallel K means clustering using CUDA
Parallel K means clustering using CUDAParallel K means clustering using CUDA
Parallel K means clustering using CUDA
prithan
 
総和伝搬法を用いた分散近似メッセージ伝搬アルゴリズム
総和伝搬法を用いた分散近似メッセージ伝搬アルゴリズム総和伝搬法を用いた分散近似メッセージ伝搬アルゴリズム
総和伝搬法を用いた分散近似メッセージ伝搬アルゴリズム
Ryo Hayakawa
 
CUDA and Caffe for deep learning
CUDA and Caffe for deep learningCUDA and Caffe for deep learning
CUDA and Caffe for deep learning
Amgad Muhammad
 
Stationary Incompressible Viscous Flow Analysis by a Domain Decomposition Method
Stationary Incompressible Viscous Flow Analysis by a Domain Decomposition MethodStationary Incompressible Viscous Flow Analysis by a Domain Decomposition Method
Stationary Incompressible Viscous Flow Analysis by a Domain Decomposition Method
ADVENTURE Project
 
離散値ベクトル再構成手法とその通信応用
離散値ベクトル再構成手法とその通信応用離散値ベクトル再構成手法とその通信応用
離散値ベクトル再構成手法とその通信応用
Ryo Hayakawa
 
近似メッセージ伝搬法に基づく離散値ベクトル再構成の一般化
近似メッセージ伝搬法に基づく離散値ベクトル再構成の一般化近似メッセージ伝搬法に基づく離散値ベクトル再構成の一般化
近似メッセージ伝搬法に基づく離散値ベクトル再構成の一般化
Ryo Hayakawa
 
FAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA.pptFAST MAP PROJECTION ON CUDA.ppt
FAST MAP PROJECTION ON CUDA.ppt
grssieee
 
Transceiver design for single-cell and multi-cell downlink multiuser MIMO sys...
Transceiver design for single-cell and multi-cell downlink multiuser MIMO sys...Transceiver design for single-cell and multi-cell downlink multiuser MIMO sys...
Transceiver design for single-cell and multi-cell downlink multiuser MIMO sys...
T. E. BOGALE
 
A Note on Correlated Topic Models
A Note on Correlated Topic ModelsA Note on Correlated Topic Models
A Note on Correlated Topic Models
Tomonari Masada
 
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtionNÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion
NÜWA: Visual Synthesis Pre-training for Neural visUal World creAtion
Kai Katsumata
 
Datastructure tree
Datastructure treeDatastructure tree
Datastructure tree
rantd
 
IRJET- On Certain Subclasses of Univalent Functions: An Application
IRJET- On Certain Subclasses of Univalent Functions: An ApplicationIRJET- On Certain Subclasses of Univalent Functions: An Application
IRJET- On Certain Subclasses of Univalent Functions: An Application
IRJET Journal
 

Viewers also liked (19)

Norma tecnica ru
Norma tecnica ruNorma tecnica ru
Norma tecnica ru
Fernando Lugo
 
Pm arabic ch4 إدارة تنفيذ المشروعات والرقابة عليها
Pm arabic ch4 إدارة تنفيذ المشروعات والرقابة عليهاPm arabic ch4 إدارة تنفيذ المشروعات والرقابة عليها
Pm arabic ch4 إدارة تنفيذ المشروعات والرقابة عليها
khalid Dahleez
 
1 conceptos básicos de teledetección
1 conceptos básicos de teledetección1 conceptos básicos de teledetección
1 conceptos básicos de teledetección
Jenny Andrea
 
Improve together
Improve togetherImprove together
Improve together
Craig Brown
 
Psicologia social de la familia nicaraguense una mirada desde el analisis del...
Psicologia social de la familia nicaraguense una mirada desde el analisis del...Psicologia social de la familia nicaraguense una mirada desde el analisis del...
Psicologia social de la familia nicaraguense una mirada desde el analisis del...
Dr. Edgard Yesca-Palacios
 
3Com 3CGBIC97-OEM
3Com 3CGBIC97-OEM3Com 3CGBIC97-OEM
3Com 3CGBIC97-OEM
savomir
 
Higiene y seguridad fernando
Higiene y seguridad fernandoHigiene y seguridad fernando
Higiene y seguridad fernando
Fernando Lugo
 
La felicidad
La felicidadLa felicidad
La felicidad
sorelvys
 
Loadแนวข้อสอบ ช่างปูน กรมประมง
 Loadแนวข้อสอบ ช่างปูน กรมประมง Loadแนวข้อสอบ ช่างปูน กรมประมง
Loadแนวข้อสอบ ช่างปูน กรมประมง
nawaporn khamseanwong
 
#5 plan de vida y carrera
#5 plan de vida y carrera#5 plan de vida y carrera
#5 plan de vida y carrera
Alicia Espinoza
 
Presentacion usm como pintarse las uñas
Presentacion usm como pintarse las uñasPresentacion usm como pintarse las uñas
Presentacion usm como pintarse las uñas
Jackeline Julieta Juarez Ovalle
 
PT Agro Bukit RSPO Initial Assessment Summary Report
PT Agro Bukit RSPO Initial Assessment Summary ReportPT Agro Bukit RSPO Initial Assessment Summary Report
PT Agro Bukit RSPO Initial Assessment Summary Report
PT Agro Bukit Kalimantan Tengah Indonesia
 
気候モデル放射カーネルのGPUへの移植と高速化
気候モデル放射カーネルのGPUへの移植と高速化気候モデル放射カーネルのGPUへの移植と高速化
気候モデル放射カーネルのGPUへの移植と高速化
Takateru Yamagishi
 
Revista Digital Analisis Numerico
Revista Digital Analisis NumericoRevista Digital Analisis Numerico
Revista Digital Analisis Numerico
Fernando_1221
 
3Com Z3C-3CXFP94
3Com Z3C-3CXFP943Com Z3C-3CXFP94
3Com Z3C-3CXFP94
savomir
 
3Com 3C996B
3Com 3C996B3Com 3C996B
3Com 3C996B
savomir
 
Manali shimla package
Manali shimla packageManali shimla package
Manali shimla package
kings holidays
 
Presentacion electiva
Presentacion electivaPresentacion electiva
Presentacion electiva
Gabrielmosquera24
 
3Com 122-02503-000
3Com 122-02503-0003Com 122-02503-000
3Com 122-02503-000
savomir
 
Pm arabic ch4 إدارة تنفيذ المشروعات والرقابة عليها
Pm arabic ch4 إدارة تنفيذ المشروعات والرقابة عليهاPm arabic ch4 إدارة تنفيذ المشروعات والرقابة عليها
Pm arabic ch4 إدارة تنفيذ المشروعات والرقابة عليها
khalid Dahleez
 
1 conceptos básicos de teledetección
1 conceptos básicos de teledetección1 conceptos básicos de teledetección
1 conceptos básicos de teledetección
Jenny Andrea
 
Improve together
Improve togetherImprove together
Improve together
Craig Brown
 
Psicologia social de la familia nicaraguense una mirada desde el analisis del...
Psicologia social de la familia nicaraguense una mirada desde el analisis del...Psicologia social de la familia nicaraguense una mirada desde el analisis del...
Psicologia social de la familia nicaraguense una mirada desde el analisis del...
Dr. Edgard Yesca-Palacios
 
3Com 3CGBIC97-OEM
3Com 3CGBIC97-OEM3Com 3CGBIC97-OEM
3Com 3CGBIC97-OEM
savomir
 
Higiene y seguridad fernando
Higiene y seguridad fernandoHigiene y seguridad fernando
Higiene y seguridad fernando
Fernando Lugo
 
La felicidad
La felicidadLa felicidad
La felicidad
sorelvys
 
Loadแนวข้อสอบ ช่างปูน กรมประมง
 Loadแนวข้อสอบ ช่างปูน กรมประมง Loadแนวข้อสอบ ช่างปูน กรมประมง
Loadแนวข้อสอบ ช่างปูน กรมประมง
nawaporn khamseanwong
 
#5 plan de vida y carrera
#5 plan de vida y carrera#5 plan de vida y carrera
#5 plan de vida y carrera
Alicia Espinoza
 
気候モデル放射カーネルのGPUへの移植と高速化
気候モデル放射カーネルのGPUへの移植と高速化気候モデル放射カーネルのGPUへの移植と高速化
気候モデル放射カーネルのGPUへの移植と高速化
Takateru Yamagishi
 
Revista Digital Analisis Numerico
Revista Digital Analisis NumericoRevista Digital Analisis Numerico
Revista Digital Analisis Numerico
Fernando_1221
 
3Com Z3C-3CXFP94
3Com Z3C-3CXFP943Com Z3C-3CXFP94
3Com Z3C-3CXFP94
savomir
 
3Com 3C996B
3Com 3C996B3Com 3C996B
3Com 3C996B
savomir
 
3Com 122-02503-000
3Com 122-02503-0003Com 122-02503-000
3Com 122-02503-000
savomir
 
Ad

Similar to GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver (20)

Response Surface in Tensor Train format for Uncertainty Quantification
Response Surface in Tensor Train format for Uncertainty QuantificationResponse Surface in Tensor Train format for Uncertainty Quantification
Response Surface in Tensor Train format for Uncertainty Quantification
Alexander Litvinenko
 
Parameter estimation
Parameter estimationParameter estimation
Parameter estimation
Mohamed Mohamed El-Sayed
 
Conditional neural processes
Conditional neural processesConditional neural processes
Conditional neural processes
Kazuki Fujikawa
 
[DL輪読会]Conditional Neural Processes
[DL輪読会]Conditional Neural Processes[DL輪読会]Conditional Neural Processes
[DL輪読会]Conditional Neural Processes
Deep Learning JP
 
対応点を用いないローリングシャッタ歪み補正と映像安定化論文
対応点を用いないローリングシャッタ歪み補正と映像安定化論文対応点を用いないローリングシャッタ歪み補正と映像安定化論文
対応点を用いないローリングシャッタ歪み補正と映像安定化論文
doboncho
 
11.signal integrity analysis of modified coplanar waveguide structure using a...
11.signal integrity analysis of modified coplanar waveguide structure using a...11.signal integrity analysis of modified coplanar waveguide structure using a...
11.signal integrity analysis of modified coplanar waveguide structure using a...
Alexander Decker
 
4.[29 34]signal integrity analysis of modified coplanar waveguide structure u...
4.[29 34]signal integrity analysis of modified coplanar waveguide structure u...4.[29 34]signal integrity analysis of modified coplanar waveguide structure u...
4.[29 34]signal integrity analysis of modified coplanar waveguide structure u...
Alexander Decker
 
Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...
Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...
Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...
Shizuoka Inst. Science and Tech.
 
HYDROTHERMAL COORDINATION FOR SHORT RANGE FIXED HEAD STATIONS USING FAST GENE...
HYDROTHERMAL COORDINATION FOR SHORT RANGE FIXED HEAD STATIONS USING FAST GENE...HYDROTHERMAL COORDINATION FOR SHORT RANGE FIXED HEAD STATIONS USING FAST GENE...
HYDROTHERMAL COORDINATION FOR SHORT RANGE FIXED HEAD STATIONS USING FAST GENE...
ecij
 
Composite Field Multiplier based on Look-Up Table for Elliptic Curve Cryptogr...
Composite Field Multiplier based on Look-Up Table for Elliptic Curve Cryptogr...Composite Field Multiplier based on Look-Up Table for Elliptic Curve Cryptogr...
Composite Field Multiplier based on Look-Up Table for Elliptic Curve Cryptogr...
Marisa Paryasto
 
Impact of Auto-tuning of Kernel Loop Transformation by using ppOpen-AT
Impact of Auto-tuning of Kernel Loop Transformation by using ppOpen-ATImpact of Auto-tuning of Kernel Loop Transformation by using ppOpen-AT
Impact of Auto-tuning of Kernel Loop Transformation by using ppOpen-AT
Takahiro Katagiri
 
COUPLED FPGA/ASIC IMPLEMENTATION OF ELLIPTIC CURVE CRYPTO-PROCESSOR
COUPLED FPGA/ASIC IMPLEMENTATION OF ELLIPTIC CURVE CRYPTO-PROCESSORCOUPLED FPGA/ASIC IMPLEMENTATION OF ELLIPTIC CURVE CRYPTO-PROCESSOR
COUPLED FPGA/ASIC IMPLEMENTATION OF ELLIPTIC CURVE CRYPTO-PROCESSOR
IJNSA Journal
 
Iterative idma receivers with random and tree based interleavers
Iterative idma receivers with random and tree based interleaversIterative idma receivers with random and tree based interleavers
Iterative idma receivers with random and tree based interleavers
Alexander Decker
 
11.iterative idma receivers with random and tree based interleavers
11.iterative idma receivers with random and tree based interleavers11.iterative idma receivers with random and tree based interleavers
11.iterative idma receivers with random and tree based interleavers
Alexander Decker
 
11.iterative idma receivers with random and tree based interleavers
11.iterative idma receivers with random and tree based interleavers11.iterative idma receivers with random and tree based interleavers
11.iterative idma receivers with random and tree based interleavers
Alexander Decker
 
Heuristic Algorithm for Finding Sensitivity Analysis in Interval Solid Transp...
Heuristic Algorithm for Finding Sensitivity Analysis in Interval Solid Transp...Heuristic Algorithm for Finding Sensitivity Analysis in Interval Solid Transp...
Heuristic Algorithm for Finding Sensitivity Analysis in Interval Solid Transp...
AM Publications
 
Nucleon TMD Contractions in Lattice QCD using QUDA
Nucleon TMD Contractions in Lattice QCD using QUDANucleon TMD Contractions in Lattice QCD using QUDA
Nucleon TMD Contractions in Lattice QCD using QUDA
Christos Kallidonis
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUs
Sri Ambati
 
Identification of the Mathematical Models of Complex Relaxation Processes in ...
Identification of the Mathematical Models of Complex Relaxation Processes in ...Identification of the Mathematical Models of Complex Relaxation Processes in ...
Identification of the Mathematical Models of Complex Relaxation Processes in ...
Vladimir Bakhrushin
 
Smoothed Particle Galerkin Method Formulation.pdf
Smoothed Particle Galerkin Method Formulation.pdfSmoothed Particle Galerkin Method Formulation.pdf
Smoothed Particle Galerkin Method Formulation.pdf
keansheng
 
Response Surface in Tensor Train format for Uncertainty Quantification
Response Surface in Tensor Train format for Uncertainty QuantificationResponse Surface in Tensor Train format for Uncertainty Quantification
Response Surface in Tensor Train format for Uncertainty Quantification
Alexander Litvinenko
 
Conditional neural processes
Conditional neural processesConditional neural processes
Conditional neural processes
Kazuki Fujikawa
 
[DL輪読会]Conditional Neural Processes
[DL輪読会]Conditional Neural Processes[DL輪読会]Conditional Neural Processes
[DL輪読会]Conditional Neural Processes
Deep Learning JP
 
対応点を用いないローリングシャッタ歪み補正と映像安定化論文
対応点を用いないローリングシャッタ歪み補正と映像安定化論文対応点を用いないローリングシャッタ歪み補正と映像安定化論文
対応点を用いないローリングシャッタ歪み補正と映像安定化論文
doboncho
 
11.signal integrity analysis of modified coplanar waveguide structure using a...
11.signal integrity analysis of modified coplanar waveguide structure using a...11.signal integrity analysis of modified coplanar waveguide structure using a...
11.signal integrity analysis of modified coplanar waveguide structure using a...
Alexander Decker
 
4.[29 34]signal integrity analysis of modified coplanar waveguide structure u...
4.[29 34]signal integrity analysis of modified coplanar waveguide structure u...4.[29 34]signal integrity analysis of modified coplanar waveguide structure u...
4.[29 34]signal integrity analysis of modified coplanar waveguide structure u...
Alexander Decker
 
Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...
Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...
Talk at SciCADE2013 about "Accelerated Multiple Precision ODE solver base on ...
Shizuoka Inst. Science and Tech.
 
HYDROTHERMAL COORDINATION FOR SHORT RANGE FIXED HEAD STATIONS USING FAST GENE...
HYDROTHERMAL COORDINATION FOR SHORT RANGE FIXED HEAD STATIONS USING FAST GENE...HYDROTHERMAL COORDINATION FOR SHORT RANGE FIXED HEAD STATIONS USING FAST GENE...
HYDROTHERMAL COORDINATION FOR SHORT RANGE FIXED HEAD STATIONS USING FAST GENE...
ecij
 
Composite Field Multiplier based on Look-Up Table for Elliptic Curve Cryptogr...
Composite Field Multiplier based on Look-Up Table for Elliptic Curve Cryptogr...Composite Field Multiplier based on Look-Up Table for Elliptic Curve Cryptogr...
Composite Field Multiplier based on Look-Up Table for Elliptic Curve Cryptogr...
Marisa Paryasto
 
Impact of Auto-tuning of Kernel Loop Transformation by using ppOpen-AT
Impact of Auto-tuning of Kernel Loop Transformation by using ppOpen-ATImpact of Auto-tuning of Kernel Loop Transformation by using ppOpen-AT
Impact of Auto-tuning of Kernel Loop Transformation by using ppOpen-AT
Takahiro Katagiri
 
COUPLED FPGA/ASIC IMPLEMENTATION OF ELLIPTIC CURVE CRYPTO-PROCESSOR
COUPLED FPGA/ASIC IMPLEMENTATION OF ELLIPTIC CURVE CRYPTO-PROCESSORCOUPLED FPGA/ASIC IMPLEMENTATION OF ELLIPTIC CURVE CRYPTO-PROCESSOR
COUPLED FPGA/ASIC IMPLEMENTATION OF ELLIPTIC CURVE CRYPTO-PROCESSOR
IJNSA Journal
 
Iterative idma receivers with random and tree based interleavers
Iterative idma receivers with random and tree based interleaversIterative idma receivers with random and tree based interleavers
Iterative idma receivers with random and tree based interleavers
Alexander Decker
 
11.iterative idma receivers with random and tree based interleavers
11.iterative idma receivers with random and tree based interleavers11.iterative idma receivers with random and tree based interleavers
11.iterative idma receivers with random and tree based interleavers
Alexander Decker
 
11.iterative idma receivers with random and tree based interleavers
11.iterative idma receivers with random and tree based interleavers11.iterative idma receivers with random and tree based interleavers
11.iterative idma receivers with random and tree based interleavers
Alexander Decker
 
Heuristic Algorithm for Finding Sensitivity Analysis in Interval Solid Transp...
Heuristic Algorithm for Finding Sensitivity Analysis in Interval Solid Transp...Heuristic Algorithm for Finding Sensitivity Analysis in Interval Solid Transp...
Heuristic Algorithm for Finding Sensitivity Analysis in Interval Solid Transp...
AM Publications
 
Nucleon TMD Contractions in Lattice QCD using QUDA
Nucleon TMD Contractions in Lattice QCD using QUDANucleon TMD Contractions in Lattice QCD using QUDA
Nucleon TMD Contractions in Lattice QCD using QUDA
Christos Kallidonis
 
Intro to Machine Learning for GPUs
Intro to Machine Learning for GPUsIntro to Machine Learning for GPUs
Intro to Machine Learning for GPUs
Sri Ambati
 
Identification of the Mathematical Models of Complex Relaxation Processes in ...
Identification of the Mathematical Models of Complex Relaxation Processes in ...Identification of the Mathematical Models of Complex Relaxation Processes in ...
Identification of the Mathematical Models of Complex Relaxation Processes in ...
Vladimir Bakhrushin
 
Smoothed Particle Galerkin Method Formulation.pdf
Smoothed Particle Galerkin Method Formulation.pdfSmoothed Particle Galerkin Method Formulation.pdf
Smoothed Particle Galerkin Method Formulation.pdf
keansheng
 
Ad

Recently uploaded (20)

ANTI URINARY TRACK INFECTION AGENT MC III
ANTI URINARY TRACK INFECTION AGENT MC IIIANTI URINARY TRACK INFECTION AGENT MC III
ANTI URINARY TRACK INFECTION AGENT MC III
HRUTUJA WAGH
 
Phytonematodes, Ecology, Biology and Managementpptx
Phytonematodes, Ecology, Biology and ManagementpptxPhytonematodes, Ecology, Biology and Managementpptx
Phytonematodes, Ecology, Biology and Managementpptx
Dr Showkat Ahmad Wani
 
Macrolide and Miscellaneous Antibiotics.ppt
Macrolide and Miscellaneous Antibiotics.pptMacrolide and Miscellaneous Antibiotics.ppt
Macrolide and Miscellaneous Antibiotics.ppt
HRUTUJA WAGH
 
Examine human hair for cortex and medulla.
Examine human hair for cortex and medulla.Examine human hair for cortex and medulla.
Examine human hair for cortex and medulla.
NutanRathod6
 
class 7 polygenic inheritance.pptx biochemistry
class 7 polygenic inheritance.pptx biochemistryclass 7 polygenic inheritance.pptx biochemistry
class 7 polygenic inheritance.pptx biochemistry
LavanyaVijaykumar2
 
Biochemistry Lesson_Molecular Polarity.ppt
Biochemistry Lesson_Molecular Polarity.pptBiochemistry Lesson_Molecular Polarity.ppt
Biochemistry Lesson_Molecular Polarity.ppt
ErPri1
 
Cordaitales - Yudhvir Singh Checked[1].pptx gymnosperms
Cordaitales - Yudhvir Singh Checked[1].pptx gymnospermsCordaitales - Yudhvir Singh Checked[1].pptx gymnosperms
Cordaitales - Yudhvir Singh Checked[1].pptx gymnosperms
ReetikaMakkar
 
Antimalarial drug Medicinal Chemistry III
Antimalarial drug Medicinal Chemistry IIIAntimalarial drug Medicinal Chemistry III
Antimalarial drug Medicinal Chemistry III
HRUTUJA WAGH
 
2. peptic ulcer (1) (1) for Pharm D .pptx
2. peptic ulcer (1) (1) for Pharm D .pptx2. peptic ulcer (1) (1) for Pharm D .pptx
2. peptic ulcer (1) (1) for Pharm D .pptx
fafyfskhan251kmf
 
Chapter-10-Light-reflection-and-refraction.ppt
Chapter-10-Light-reflection-and-refraction.pptChapter-10-Light-reflection-and-refraction.ppt
Chapter-10-Light-reflection-and-refraction.ppt
uniyaladiti914
 
External Application in Homoeopathy- Definition,Scope and Types.
External Application  in Homoeopathy- Definition,Scope and Types.External Application  in Homoeopathy- Definition,Scope and Types.
External Application in Homoeopathy- Definition,Scope and Types.
AdharshnaPatrick
 
Green Synthesis of Gold Nanoparticles.pptx
Green Synthesis of Gold Nanoparticles.pptxGreen Synthesis of Gold Nanoparticles.pptx
Green Synthesis of Gold Nanoparticles.pptx
Torskal Nanoscience
 
Forestry_Exit_Exam_Wollega University_Gimbi Campus.pdf
Forestry_Exit_Exam_Wollega University_Gimbi Campus.pdfForestry_Exit_Exam_Wollega University_Gimbi Campus.pdf
Forestry_Exit_Exam_Wollega University_Gimbi Campus.pdf
ChalaKelbessa
 
Electroencephalogram_ wave components_Aignificancr
Electroencephalogram_ wave components_AignificancrElectroencephalogram_ wave components_Aignificancr
Electroencephalogram_ wave components_Aignificancr
klynct
 
Chaos and Psychology: Modeling the Human Mind through Nonlinear Dynamical Sys...
Chaos and Psychology: Modeling the Human Mind through Nonlinear Dynamical Sys...Chaos and Psychology: Modeling the Human Mind through Nonlinear Dynamical Sys...
Chaos and Psychology: Modeling the Human Mind through Nonlinear Dynamical Sys...
Helena Celeste Mata Rico
 
Top 10 Biotech Startups for Beginners.pptx
Top 10 Biotech Startups for Beginners.pptxTop 10 Biotech Startups for Beginners.pptx
Top 10 Biotech Startups for Beginners.pptx
alexbagheriam
 
Freud e sua Historia na Psicanalise Psic
Freud e sua Historia na Psicanalise PsicFreud e sua Historia na Psicanalise Psic
Freud e sua Historia na Psicanalise Psic
StefannyGoffi1
 
SULPHONAMIDES AND SULFONES Medicinal Chemistry III.ppt
SULPHONAMIDES AND SULFONES Medicinal Chemistry III.pptSULPHONAMIDES AND SULFONES Medicinal Chemistry III.ppt
SULPHONAMIDES AND SULFONES Medicinal Chemistry III.ppt
HRUTUJA WAGH
 
Funakoshi_ZymoResearch_2024-2025_catalog
Funakoshi_ZymoResearch_2024-2025_catalogFunakoshi_ZymoResearch_2024-2025_catalog
Funakoshi_ZymoResearch_2024-2025_catalog
fu7koshi
 
university of arizona ~ favor's college candidate project.pptx
university of arizona ~ favor's college candidate project.pptxuniversity of arizona ~ favor's college candidate project.pptx
university of arizona ~ favor's college candidate project.pptx
favoranamelechi107
 
ANTI URINARY TRACK INFECTION AGENT MC III
ANTI URINARY TRACK INFECTION AGENT MC IIIANTI URINARY TRACK INFECTION AGENT MC III
ANTI URINARY TRACK INFECTION AGENT MC III
HRUTUJA WAGH
 
Phytonematodes, Ecology, Biology and Managementpptx
Phytonematodes, Ecology, Biology and ManagementpptxPhytonematodes, Ecology, Biology and Managementpptx
Phytonematodes, Ecology, Biology and Managementpptx
Dr Showkat Ahmad Wani
 
Macrolide and Miscellaneous Antibiotics.ppt
Macrolide and Miscellaneous Antibiotics.pptMacrolide and Miscellaneous Antibiotics.ppt
Macrolide and Miscellaneous Antibiotics.ppt
HRUTUJA WAGH
 
Examine human hair for cortex and medulla.
Examine human hair for cortex and medulla.Examine human hair for cortex and medulla.
Examine human hair for cortex and medulla.
NutanRathod6
 
class 7 polygenic inheritance.pptx biochemistry
class 7 polygenic inheritance.pptx biochemistryclass 7 polygenic inheritance.pptx biochemistry
class 7 polygenic inheritance.pptx biochemistry
LavanyaVijaykumar2
 
Biochemistry Lesson_Molecular Polarity.ppt
Biochemistry Lesson_Molecular Polarity.pptBiochemistry Lesson_Molecular Polarity.ppt
Biochemistry Lesson_Molecular Polarity.ppt
ErPri1
 
Cordaitales - Yudhvir Singh Checked[1].pptx gymnosperms
Cordaitales - Yudhvir Singh Checked[1].pptx gymnospermsCordaitales - Yudhvir Singh Checked[1].pptx gymnosperms
Cordaitales - Yudhvir Singh Checked[1].pptx gymnosperms
ReetikaMakkar
 
Antimalarial drug Medicinal Chemistry III
Antimalarial drug Medicinal Chemistry IIIAntimalarial drug Medicinal Chemistry III
Antimalarial drug Medicinal Chemistry III
HRUTUJA WAGH
 
2. peptic ulcer (1) (1) for Pharm D .pptx
2. peptic ulcer (1) (1) for Pharm D .pptx2. peptic ulcer (1) (1) for Pharm D .pptx
2. peptic ulcer (1) (1) for Pharm D .pptx
fafyfskhan251kmf
 
Chapter-10-Light-reflection-and-refraction.ppt
Chapter-10-Light-reflection-and-refraction.pptChapter-10-Light-reflection-and-refraction.ppt
Chapter-10-Light-reflection-and-refraction.ppt
uniyaladiti914
 
External Application in Homoeopathy- Definition,Scope and Types.
External Application  in Homoeopathy- Definition,Scope and Types.External Application  in Homoeopathy- Definition,Scope and Types.
External Application in Homoeopathy- Definition,Scope and Types.
AdharshnaPatrick
 
Green Synthesis of Gold Nanoparticles.pptx
Green Synthesis of Gold Nanoparticles.pptxGreen Synthesis of Gold Nanoparticles.pptx
Green Synthesis of Gold Nanoparticles.pptx
Torskal Nanoscience
 
Forestry_Exit_Exam_Wollega University_Gimbi Campus.pdf
Forestry_Exit_Exam_Wollega University_Gimbi Campus.pdfForestry_Exit_Exam_Wollega University_Gimbi Campus.pdf
Forestry_Exit_Exam_Wollega University_Gimbi Campus.pdf
ChalaKelbessa
 
Electroencephalogram_ wave components_Aignificancr
Electroencephalogram_ wave components_AignificancrElectroencephalogram_ wave components_Aignificancr
Electroencephalogram_ wave components_Aignificancr
klynct
 
Chaos and Psychology: Modeling the Human Mind through Nonlinear Dynamical Sys...
Chaos and Psychology: Modeling the Human Mind through Nonlinear Dynamical Sys...Chaos and Psychology: Modeling the Human Mind through Nonlinear Dynamical Sys...
Chaos and Psychology: Modeling the Human Mind through Nonlinear Dynamical Sys...
Helena Celeste Mata Rico
 
Top 10 Biotech Startups for Beginners.pptx
Top 10 Biotech Startups for Beginners.pptxTop 10 Biotech Startups for Beginners.pptx
Top 10 Biotech Startups for Beginners.pptx
alexbagheriam
 
Freud e sua Historia na Psicanalise Psic
Freud e sua Historia na Psicanalise PsicFreud e sua Historia na Psicanalise Psic
Freud e sua Historia na Psicanalise Psic
StefannyGoffi1
 
SULPHONAMIDES AND SULFONES Medicinal Chemistry III.ppt
SULPHONAMIDES AND SULFONES Medicinal Chemistry III.pptSULPHONAMIDES AND SULFONES Medicinal Chemistry III.ppt
SULPHONAMIDES AND SULFONES Medicinal Chemistry III.ppt
HRUTUJA WAGH
 
Funakoshi_ZymoResearch_2024-2025_catalog
Funakoshi_ZymoResearch_2024-2025_catalogFunakoshi_ZymoResearch_2024-2025_catalog
Funakoshi_ZymoResearch_2024-2025_catalog
fu7koshi
 
university of arizona ~ favor's college candidate project.pptx
university of arizona ~ favor's college candidate project.pptxuniversity of arizona ~ favor's college candidate project.pptx
university of arizona ~ favor's college candidate project.pptx
favoranamelechi107
 

GPU acceleration of a non-hydrostatic ocean model with a multigrid Poisson/Helmholtz solver

  • 1. GPU acceleration of a non- hydrostatic ocean model with a multigrid Poisson/Helmholtz solver Takateru Yamagishi1, Yoshimasa Matsumura2 1 Research Organization for Information Science and Technology 2 Institute of Low Temperature Science, Hokkaido University 6th International Workshop on Advances in High- Performance Computational Earth Sciences: Applications & Frameworks
  • 2. Table of Contents Motivation Numerical ocean model ‘kinaco’ GPU implementation and Optimization Evaluation and validation Summary
  • 3. Motivation Significance of numerical ocean modelling Global climate, weather, marine resource, etc. GPU’s high computational performance Explicit and detail expression, long time simulation, many experiment cases Previous studies Bleichrodt et al. (2012), Milakov et al. (2013), Werkhoven et al. (2013) Xu, et al. (2015) They showed high performance, but limited to experimental studies We aim at realistic and practical studies
  • 4. Non-hydrostatic numerical ocean model ‘kinaco’ Formation of Antarctic bottom water in the southern Weddell Sea We try to accelerate this model by the GPU
  • 5. Basic equation of dynamics in kinaco 3D Navier-Stokes equation Fluid dynamics Poisson/Helmholtz equation ∆ = , (∆ + )ℎ = 0 Discretization Stencil access to adjacent 6 grids Solving systems of equations: Ax=b Sparse matrix-vector multiplication Efficient solver to solve Ax=b is required
  • 6. CG method with multigrid preconditioner (MGCG) Fast and scalable iteration method Matsumura and Hasumi (2008) Preconditioner: Multigrid method Solve equation on various resolution grids multigrid method
  • 7. Implementation to the GPU CUDA Fortran kinaco is written in Fortran 90 CUDA instructions are available almost the same as CUDA C Following the original structure of CPU code Good performance vs CPU is achieved We aimed at further acceleration!
  • 8. Optimization of the MGCG solver The cost of MGCG solver: 21% of total simulation Mainly consists of sparse matrix-vector multiplication Optimization 1. Memory access 2. Hide latency by thread/Instruction-level parallelism 3. Mixed precision preconditioner of MGCG
  • 9. Memory access in CPU kernel DO k=1, n3 DO j=1, n2 DO i=1, n1 out(i,j,k) = a(-3,i,j,k) * x(i, j, k-1) & + a(-2,i,j,k) * x(i, j-1,k ) & + a(-1,i,j,k) * x(i-1,j, k ) & + a( 0,i,j,k) * x(i, j, k ) & + a( 1,i,j,k) * x(i+1,j, k ) & + a( 2,i,j,k) * x(i, j+1,k ) & + a( 3,i,j,k) * x(i, j, k+1) END DO END DO END DO -3 -2 -1 0 1 2 3 a(-3,i,j,k)~a( 3,i,j,k) Sparse matrix-vector kernel in the CPU code matrix coefficient Location of matrix coefficient -3 3 1-1 -2 2 0 CPU thread load the array ‘a’ in cache line.
  • 10. Memory access in GPU kernel a(i,j,k,-3) a(i+1,j,k,-3) a(i+2,j,k,-3) thread(id) thread(id+1) thread(id+2) a(-3:3,i,j,k) a(i,j,k,-3:3) Each GPU thread accesses array “a” with 7 intervals. a(-3,i,j,k) a(-3,i+1,j,k) a(-3,i+2,j,k) thread(id) thread(id+1) thread(id+2) Coalesced access to array “a”
  • 11. Hide latency by thread/Instruction- level parallelism Hide latency = do other operations when waiting for latency Thread-level parallelism Switch thread to hide latency Instruction-level parallelism (Volkov, 2010) One thread with several independent operations Comparison of the two parallelism
  • 12. Case 1: Thread-level parallelism i = threadidx%x + blockdim%x * (blockidx%x-1) j = threadidx%y + blockdim%y * (blockidx%y-1) k = threadidx%z + blockdim%z * (blockidx%z-1) out(i,j,k) = a(i,j,k,-3) * x(i, j, k-1) & + a(i,j,k,-2) * x(i, j-1,k ) & + a(i,j,k,-1) * x(i-1,j, k ) & + a(i,j,k, 0) * x(i, j, k ) & + a(i,j,k, 1) * x(i+1,j, k ) & + a(i,j,k, 2) * x(i, j+1,k ) & + a(i,j,k, 3) * x(i, j, k+1) Set many threads as possible (i, j, k) • 3D (i, j, k) threads are set • One thread for one grid Hyde latency by switching many threads
  • 13. Case 2: Instruction-level parallelism Independent operations are repeated i = threadidx%x + blockdim%x * (blockidx%x-1) j = threadidx%y + blockdim%y * (blockidx%y-1) DO k=1, n3 out(i,j,k) = a(i,j,k,-3) * x(i, j, k-1) & + a(i,j,k,-2) * x(i, j-1,k ) & + a(i,j,k,-1) * x(i-1,j, k ) & + a(i,j,k, 0) * x(i, j, k ) & + a(i,j,k, 1) * x(i+1,j, k ) & + a(i,j,k, 2) * x(i, j+1,k ) & + a(i,j,k, 3) * x(i, j, k+1) END DO Hyde latency with instructions • 2D (i, j) threads are set • One thread for one column (i, j) Case 2 is faster
  • 14. Mixed precision for multigrid preconditioning Low precision utilize GPU resources Preconditioning Low precision is enough GPU: Deterioration of performance with coarse grids multigrid method Number of iterations in CG method unchanged with/without mixed precision
  • 15. Evaluation, experimental setting CPU (Fujitsu SPARC64VIIIfx) vs GPU (NVIDIA K20c) 1 CPU vs 1 GPU Study of baloclinic instability Visbeck et al. (1996) Forcing: Coriolis force, temperature forcing Structured, Isotropic domain size: (256, 256, 32) Time step, simulation time 2min, 5hours (150 steps) 5 days(3600 steps) 256 256 32
  • 16. Performance CPU GPU_1 GPU_2 GPU_3 Speedup (GPU_3) all components 174.2 42.6 39.2 37.3 4.7 Poisson/Helmholtz solver 36.8 15.8 12.4 10.5 3.5 others 137.4 26.9 26.8 26.8 5.1 Elapsed time[s]: CPU vs GPU CPU : original CPU code GPU_1: basic and typical implementation to the GPU GPU_2: GPU_1 + memory optimization, hyde latency GPU_3: GPU_2 + mixed precision preconditioning GPU achieved 4.7 times speedup vs CPU 5hours (150 steps)
  • 17. Surface ocean current/velocity field GPU_3GPU_2CPU Good reproduction of growing meanders due to baloclinic instability
  • 18. Temperature at the cross section Good reproduction of vertical convection of water CPU GPU_2 GPU_2
  • 19. Summary and future works Numerical ocean model on the GPU (K20C) vs the CPU (SPARC 64 VIIIfx) x4.7 faster compared to CPU The errors due to implementation not significant to oceanic studies Further works Application of mixed precision to other kernels MPI implementation Realistic experiments
  翻译: