SlideShare a Scribd company logo
www.univa.com
www.univa.com
Ian Lumb
Solutions Architect
SUSE, Booth #1681
SC17, Denver, CO
Managing Containerized
HPC and AI Workloads on
TSUBAME3.0
www.univa.com
2
www.univa.com
3
www.univa.com
Copyright © Univa Corporation, 2017. All Rights Reserved. Internal Use Only. 4
TSUBAME 3.0 - Compute Node Overview
A compute-node:
■ 256 GB DDR4 RAM
■ 2 TB SSDs
■ 2x 14 cores
■ 4x GPUs
■ 4x HFI (1000 Gbps)
⇒ This is what they call
a “fat compute node”
www.univa.com
Copyright © Univa Corporation, 2017. All Rights Reserved. Internal Use Only. 5
TSUBAME 3.0 - The Challenges
12.2 PetaFLOPS within only 20 racks or 540 compute
nodes
➢ It is the smallest >10 PFLOPS machine in the world
➢ Wasted/unreachable resources (parts of a node) have a
much bigger impact on such a “small” cluster
➢ Performance is also highly dependent on the job-
placement due to additional resources, such as GPUs
and HFI-devices (the closer, the better)
➢ It needs smart and flexible partitioning to ensure a
high utilization
www.univa.com
6
TSUBAME 3.0 - UGE Enhancements
▪ Core Bindings
▪ Enhanced PE support and strategies
▪ RSMAPS
▪ Enhanced PE support and chaining
▪ Docker
▪ Define unique but known container hostnames
▪ Configure Infiniband device in the container
▪ Map all job users into the container
▪ Provide execution host and Docker container hostnames to the job
www.univa.com
Copyright © Univa Corporation, 2017. All Rights Reserved. Internal Use Only. 7
Putting it all together …
qsub -l docker,docker_images="*ubuntu:14.04*"
-l gpu=1,hfi=1,hosts=1
-xd ‘--device=/dev/gpu${gpu(0)}:/dev/gpu,
--device=/dev/hfi${hfi(0)}:/dev/hfi’
-xd ‘--hostname ${hosts(0)}’
-binding one_socket_balanced:4
-pe rr 4 jobscript.sh
No matter the host-OS, the application
gets whatever OS it needs (if they run
their own docker-repo, the image can
even be prepared however they need it)
Each PE-task will get 1
GPU and 1 HFI device
(both with the same ID,
i.e. in the same “location”)
and a unique hostname
No matter which devices
are granted, the
application only sees
/dev/gpu and /dev/hfi
inside the container and
can use them directly
without any performance
penalty!
Even if the RSMAP would occupy 7 cores
per GPU, we only want 4 per PE-task.
Thus leaving room for other jobs, which do
not need a GPU or HFI. Also, we only go
on one socket per host.
Container gets a unique, known (!) hostname
www.univa.com
8
Ad

More Related Content

What's hot (19)

Deccan RubyConf 2016 - Lighning Talk - SpiceRub
Deccan RubyConf 2016 - Lighning Talk - SpiceRubDeccan RubyConf 2016 - Lighning Talk - SpiceRub
Deccan RubyConf 2016 - Lighning Talk - SpiceRub
Gaurav Tamba
 
Glusterfs session #2 1 layer above disk filesystems
Glusterfs session #2   1 layer above disk filesystemsGlusterfs session #2   1 layer above disk filesystems
Glusterfs session #2 1 layer above disk filesystems
Pranith Karampuri
 
Integrating GlusterFS with iSCSI Target
Integrating GlusterFS with iSCSI TargetIntegrating GlusterFS with iSCSI Target
Integrating GlusterFS with iSCSI Target
ijsrd.com
 
Hydra
HydraHydra
Hydra
Chris Birchall
 
Resource Management with Systemd and cgroups
Resource Management with Systemd and cgroupsResource Management with Systemd and cgroups
Resource Management with Systemd and cgroups
Tsung-en Hsiao
 
High Performance OSM Data Manipulation With Osmium - State of the Map 2013
High Performance OSM Data Manipulation With Osmium - State of the Map 2013High Performance OSM Data Manipulation With Osmium - State of the Map 2013
High Performance OSM Data Manipulation With Osmium - State of the Map 2013
OSMFstateofthemap
 
Sharding: Past, Present and Future with Krutika Dhananjay
Sharding: Past, Present and Future with Krutika DhananjaySharding: Past, Present and Future with Krutika Dhananjay
Sharding: Past, Present and Future with Krutika Dhananjay
Gluster.org
 
Distributed Data Processing Workshop - SBU
Distributed Data Processing Workshop - SBUDistributed Data Processing Workshop - SBU
Distributed Data Processing Workshop - SBU
Amir Sedighi
 
Open Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETOpen Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNET
Nikos Kormpakis
 
UBD (LaserVault Universal Backup Device )
UBD (LaserVault Universal Backup Device )UBD (LaserVault Universal Backup Device )
UBD (LaserVault Universal Backup Device )
Brad Jensen
 
GPU Performance Prediction Using High-level Application Models
GPU Performance Prediction Using High-level Application ModelsGPU Performance Prediction Using High-level Application Models
GPU Performance Prediction Using High-level Application Models
Filipo Mór
 
Cassandra4hadoop
Cassandra4hadoopCassandra4hadoop
Cassandra4hadoop
Edward Capriolo
 
In-core compression: how to shrink your database size in several times
In-core compression: how to shrink your database size in several timesIn-core compression: how to shrink your database size in several times
In-core compression: how to shrink your database size in several times
Aleksander Alekseev
 
Caffe + H2O - By Cyprien noel
Caffe + H2O - By Cyprien noelCaffe + H2O - By Cyprien noel
Caffe + H2O - By Cyprien noel
Sri Ambati
 
Integrating openSUSE Ceph Block Device & OpenStack
Integrating openSUSE Ceph Block Device & OpenStack Integrating openSUSE Ceph Block Device & OpenStack
Integrating openSUSE Ceph Block Device & OpenStack
utianayuba
 
Life as a GlusterFS Consultant with Ivan Rossi
Life as a GlusterFS Consultant with Ivan RossiLife as a GlusterFS Consultant with Ivan Rossi
Life as a GlusterFS Consultant with Ivan Rossi
Gluster.org
 
STOR2RRD presentation from Common CZ/SK 2015
STOR2RRD presentation from Common CZ/SK 2015STOR2RRD presentation from Common CZ/SK 2015
STOR2RRD presentation from Common CZ/SK 2015
Pavel Hampl
 
Rear
RearRear
Rear
Aleksandar Bilanovic
 
Alexander Ignatyev "MapReduce infrastructure"
Alexander Ignatyev "MapReduce infrastructure"Alexander Ignatyev "MapReduce infrastructure"
Alexander Ignatyev "MapReduce infrastructure"
Yandex
 
Deccan RubyConf 2016 - Lighning Talk - SpiceRub
Deccan RubyConf 2016 - Lighning Talk - SpiceRubDeccan RubyConf 2016 - Lighning Talk - SpiceRub
Deccan RubyConf 2016 - Lighning Talk - SpiceRub
Gaurav Tamba
 
Glusterfs session #2 1 layer above disk filesystems
Glusterfs session #2   1 layer above disk filesystemsGlusterfs session #2   1 layer above disk filesystems
Glusterfs session #2 1 layer above disk filesystems
Pranith Karampuri
 
Integrating GlusterFS with iSCSI Target
Integrating GlusterFS with iSCSI TargetIntegrating GlusterFS with iSCSI Target
Integrating GlusterFS with iSCSI Target
ijsrd.com
 
Resource Management with Systemd and cgroups
Resource Management with Systemd and cgroupsResource Management with Systemd and cgroups
Resource Management with Systemd and cgroups
Tsung-en Hsiao
 
High Performance OSM Data Manipulation With Osmium - State of the Map 2013
High Performance OSM Data Manipulation With Osmium - State of the Map 2013High Performance OSM Data Manipulation With Osmium - State of the Map 2013
High Performance OSM Data Manipulation With Osmium - State of the Map 2013
OSMFstateofthemap
 
Sharding: Past, Present and Future with Krutika Dhananjay
Sharding: Past, Present and Future with Krutika DhananjaySharding: Past, Present and Future with Krutika Dhananjay
Sharding: Past, Present and Future with Krutika Dhananjay
Gluster.org
 
Distributed Data Processing Workshop - SBU
Distributed Data Processing Workshop - SBUDistributed Data Processing Workshop - SBU
Distributed Data Processing Workshop - SBU
Amir Sedighi
 
Open Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNETOpen Source Storage at Scale: Ceph @ GRNET
Open Source Storage at Scale: Ceph @ GRNET
Nikos Kormpakis
 
UBD (LaserVault Universal Backup Device )
UBD (LaserVault Universal Backup Device )UBD (LaserVault Universal Backup Device )
UBD (LaserVault Universal Backup Device )
Brad Jensen
 
GPU Performance Prediction Using High-level Application Models
GPU Performance Prediction Using High-level Application ModelsGPU Performance Prediction Using High-level Application Models
GPU Performance Prediction Using High-level Application Models
Filipo Mór
 
In-core compression: how to shrink your database size in several times
In-core compression: how to shrink your database size in several timesIn-core compression: how to shrink your database size in several times
In-core compression: how to shrink your database size in several times
Aleksander Alekseev
 
Caffe + H2O - By Cyprien noel
Caffe + H2O - By Cyprien noelCaffe + H2O - By Cyprien noel
Caffe + H2O - By Cyprien noel
Sri Ambati
 
Integrating openSUSE Ceph Block Device & OpenStack
Integrating openSUSE Ceph Block Device & OpenStack Integrating openSUSE Ceph Block Device & OpenStack
Integrating openSUSE Ceph Block Device & OpenStack
utianayuba
 
Life as a GlusterFS Consultant with Ivan Rossi
Life as a GlusterFS Consultant with Ivan RossiLife as a GlusterFS Consultant with Ivan Rossi
Life as a GlusterFS Consultant with Ivan Rossi
Gluster.org
 
STOR2RRD presentation from Common CZ/SK 2015
STOR2RRD presentation from Common CZ/SK 2015STOR2RRD presentation from Common CZ/SK 2015
STOR2RRD presentation from Common CZ/SK 2015
Pavel Hampl
 
Alexander Ignatyev "MapReduce infrastructure"
Alexander Ignatyev "MapReduce infrastructure"Alexander Ignatyev "MapReduce infrastructure"
Alexander Ignatyev "MapReduce infrastructure"
Yandex
 

Similar to Managing Containerized HPC and AI Workloads on TSUBAME3.0 (20)

Hadoop installation
Hadoop installationHadoop installation
Hadoop installation
Ankit Desai
 
State of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigDataState of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigData
inside-BigData.com
 
Configure h base hadoop and hbase client
Configure h base hadoop and hbase clientConfigure h base hadoop and hbase client
Configure h base hadoop and hbase client
Shashwat Shriparv
 
Introduction to docker
Introduction to dockerIntroduction to docker
Introduction to docker
Christophe Muller
 
Docker 0.11 at MaxCDN meetup in Los Angeles
Docker 0.11 at MaxCDN meetup in Los AngelesDocker 0.11 at MaxCDN meetup in Los Angeles
Docker 0.11 at MaxCDN meetup in Los Angeles
Jérôme Petazzoni
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
kapa rohit
 
Run wordcount job (hadoop)
Run wordcount job (hadoop)Run wordcount job (hadoop)
Run wordcount job (hadoop)
valeri kopaleishvili
 
Docker Intro at the Google Developer Group and Google Cloud Platform Meet Up
Docker Intro at the Google Developer Group and Google Cloud Platform Meet UpDocker Intro at the Google Developer Group and Google Cloud Platform Meet Up
Docker Intro at the Google Developer Group and Google Cloud Platform Meet Up
Jérôme Petazzoni
 
02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configuration
Subhas Kumar Ghosh
 
Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...
Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...
Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...
OpenShift Origin
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentation
Amrut Patil
 
Single node hadoop cluster installation
Single node hadoop cluster installation Single node hadoop cluster installation
Single node hadoop cluster installation
Mahantesh Angadi
 
Docker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQ
Docker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQDocker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQ
Docker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQ
Jérôme Petazzoni
 
Docker Introduction + what is new in 0.9
Docker Introduction + what is new in 0.9 Docker Introduction + what is new in 0.9
Docker Introduction + what is new in 0.9
Jérôme Petazzoni
 
NFD9 - Matt Peterson, Data Center Operations
NFD9 - Matt Peterson, Data Center OperationsNFD9 - Matt Peterson, Data Center Operations
NFD9 - Matt Peterson, Data Center Operations
Cumulus Networks
 
NASA Advanced Supercomputing (NAS) Division - Programming and Building HPC Ap...
NASA Advanced Supercomputing (NAS) Division - Programming and Building HPC Ap...NASA Advanced Supercomputing (NAS) Division - Programming and Building HPC Ap...
NASA Advanced Supercomputing (NAS) Division - Programming and Building HPC Ap...
VICTOR MAESTRE RAMIREZ
 
DCSF19 Hardening Docker daemon with Rootless mode
DCSF19 Hardening Docker daemon with Rootless modeDCSF19 Hardening Docker daemon with Rootless mode
DCSF19 Hardening Docker daemon with Rootless mode
Docker, Inc.
 
[DockerCon 2019] Hardening Docker daemon with Rootless mode
[DockerCon 2019] Hardening Docker daemon with Rootless mode[DockerCon 2019] Hardening Docker daemon with Rootless mode
[DockerCon 2019] Hardening Docker daemon with Rootless mode
Akihiro Suda
 
Exp-3.pptx
Exp-3.pptxExp-3.pptx
Exp-3.pptx
PraveenKumar581409
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
Vitthal Gogate
 
Hadoop installation
Hadoop installationHadoop installation
Hadoop installation
Ankit Desai
 
State of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigDataState of Containers and the Convergence of HPC and BigData
State of Containers and the Convergence of HPC and BigData
inside-BigData.com
 
Configure h base hadoop and hbase client
Configure h base hadoop and hbase clientConfigure h base hadoop and hbase client
Configure h base hadoop and hbase client
Shashwat Shriparv
 
Docker 0.11 at MaxCDN meetup in Los Angeles
Docker 0.11 at MaxCDN meetup in Los AngelesDocker 0.11 at MaxCDN meetup in Los Angeles
Docker 0.11 at MaxCDN meetup in Los Angeles
Jérôme Petazzoni
 
Hadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapaHadoop Interview Questions and Answers by rohit kapa
Hadoop Interview Questions and Answers by rohit kapa
kapa rohit
 
Docker Intro at the Google Developer Group and Google Cloud Platform Meet Up
Docker Intro at the Google Developer Group and Google Cloud Platform Meet UpDocker Intro at the Google Developer Group and Google Cloud Platform Meet Up
Docker Intro at the Google Developer Group and Google Cloud Platform Meet Up
Jérôme Petazzoni
 
02 Hadoop deployment and configuration
02 Hadoop deployment and configuration02 Hadoop deployment and configuration
02 Hadoop deployment and configuration
Subhas Kumar Ghosh
 
Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...
Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...
Extending OpenShift Origin: Build Your Own Cartridge with Bill DeCoste of Red...
OpenShift Origin
 
Big data processing using hadoop poster presentation
Big data processing using hadoop poster presentationBig data processing using hadoop poster presentation
Big data processing using hadoop poster presentation
Amrut Patil
 
Single node hadoop cluster installation
Single node hadoop cluster installation Single node hadoop cluster installation
Single node hadoop cluster installation
Mahantesh Angadi
 
Docker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQ
Docker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQDocker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQ
Docker Introduction, and what's new in 0.9 — Docker Palo Alto at RelateIQ
Jérôme Petazzoni
 
Docker Introduction + what is new in 0.9
Docker Introduction + what is new in 0.9 Docker Introduction + what is new in 0.9
Docker Introduction + what is new in 0.9
Jérôme Petazzoni
 
NFD9 - Matt Peterson, Data Center Operations
NFD9 - Matt Peterson, Data Center OperationsNFD9 - Matt Peterson, Data Center Operations
NFD9 - Matt Peterson, Data Center Operations
Cumulus Networks
 
NASA Advanced Supercomputing (NAS) Division - Programming and Building HPC Ap...
NASA Advanced Supercomputing (NAS) Division - Programming and Building HPC Ap...NASA Advanced Supercomputing (NAS) Division - Programming and Building HPC Ap...
NASA Advanced Supercomputing (NAS) Division - Programming and Building HPC Ap...
VICTOR MAESTRE RAMIREZ
 
DCSF19 Hardening Docker daemon with Rootless mode
DCSF19 Hardening Docker daemon with Rootless modeDCSF19 Hardening Docker daemon with Rootless mode
DCSF19 Hardening Docker daemon with Rootless mode
Docker, Inc.
 
[DockerCon 2019] Hardening Docker daemon with Rootless mode
[DockerCon 2019] Hardening Docker daemon with Rootless mode[DockerCon 2019] Hardening Docker daemon with Rootless mode
[DockerCon 2019] Hardening Docker daemon with Rootless mode
Akihiro Suda
 
Hadoop configuration & performance tuning
Hadoop configuration & performance tuningHadoop configuration & performance tuning
Hadoop configuration & performance tuning
Vitthal Gogate
 
Ad

More from Ian Lumb (12)

Towards Deep Learning from Twitter for Improved Tsunami Alerts and Advisories
Towards Deep Learning from Twitter for Improved Tsunami Alerts and AdvisoriesTowards Deep Learning from Twitter for Improved Tsunami Alerts and Advisories
Towards Deep Learning from Twitter for Improved Tsunami Alerts and Advisories
Ian Lumb
 
Univa Unicloud - High Volume Workloads: How Smart Companies are Harnessing th...
Univa Unicloud - High Volume Workloads: How Smart Companies are Harnessing th...Univa Unicloud - High Volume Workloads: How Smart Companies are Harnessing th...
Univa Unicloud - High Volume Workloads: How Smart Companies are Harnessing th...
Ian Lumb
 
Dev / Test / Ops – Gain More Horsepower and Reduce Costs by Sharing Kubernete...
Dev / Test / Ops – Gain More Horsepower and Reduce Costs by Sharing Kubernete...Dev / Test / Ops – Gain More Horsepower and Reduce Costs by Sharing Kubernete...
Dev / Test / Ops – Gain More Horsepower and Reduce Costs by Sharing Kubernete...
Ian Lumb
 
Drilling Deep with Machine Learning as an Enterprise Enabled Micro Service
Drilling Deep with Machine Learning as an Enterprise Enabled Micro ServiceDrilling Deep with Machine Learning as an Enterprise Enabled Micro Service
Drilling Deep with Machine Learning as an Enterprise Enabled Micro Service
Ian Lumb
 
Machine Learning for Big Data Analytics: Scaling In with Containers while Sc...
Machine Learning for Big Data Analytics:  Scaling In with Containers while Sc...Machine Learning for Big Data Analytics:  Scaling In with Containers while Sc...
Machine Learning for Big Data Analytics: Scaling In with Containers while Sc...
Ian Lumb
 
Docker 101 - all about Docker containers
Docker 101 - all about Docker containers Docker 101 - all about Docker containers
Docker 101 - all about Docker containers
Ian Lumb
 
High Performance Computing in the Cloud?
High Performance Computing in the Cloud?High Performance Computing in the Cloud?
High Performance Computing in the Cloud?
Ian Lumb
 
VoDcast Slides: The Rise in Popularity of Apache Spark
VoDcast Slides: The Rise in Popularity of Apache SparkVoDcast Slides: The Rise in Popularity of Apache Spark
VoDcast Slides: The Rise in Popularity of Apache Spark
Ian Lumb
 
Bright Topics Webinar April 15, 2015 - Modernized Monitoring for Cluster and ...
Bright Topics Webinar April 15, 2015 - Modernized Monitoring for Cluster and ...Bright Topics Webinar April 15, 2015 - Modernized Monitoring for Cluster and ...
Bright Topics Webinar April 15, 2015 - Modernized Monitoring for Cluster and ...
Ian Lumb
 
Utilizing Public AND Private Clouds with Bright Cluster Manager
Utilizing Public AND Private Clouds with Bright Cluster ManagerUtilizing Public AND Private Clouds with Bright Cluster Manager
Utilizing Public AND Private Clouds with Bright Cluster Manager
Ian Lumb
 
How to Upgrade Your Hadoop Stack in 1 Step -- with Zero Downtime
How to Upgrade Your Hadoop Stack in 1 Step -- with Zero DowntimeHow to Upgrade Your Hadoop Stack in 1 Step -- with Zero Downtime
How to Upgrade Your Hadoop Stack in 1 Step -- with Zero Downtime
Ian Lumb
 
Bright Cluster Manager: A Comprehensive, Integrated Management Solution for P...
Bright Cluster Manager: A Comprehensive, Integrated Management Solution for P...Bright Cluster Manager: A Comprehensive, Integrated Management Solution for P...
Bright Cluster Manager: A Comprehensive, Integrated Management Solution for P...
Ian Lumb
 
Towards Deep Learning from Twitter for Improved Tsunami Alerts and Advisories
Towards Deep Learning from Twitter for Improved Tsunami Alerts and AdvisoriesTowards Deep Learning from Twitter for Improved Tsunami Alerts and Advisories
Towards Deep Learning from Twitter for Improved Tsunami Alerts and Advisories
Ian Lumb
 
Univa Unicloud - High Volume Workloads: How Smart Companies are Harnessing th...
Univa Unicloud - High Volume Workloads: How Smart Companies are Harnessing th...Univa Unicloud - High Volume Workloads: How Smart Companies are Harnessing th...
Univa Unicloud - High Volume Workloads: How Smart Companies are Harnessing th...
Ian Lumb
 
Dev / Test / Ops – Gain More Horsepower and Reduce Costs by Sharing Kubernete...
Dev / Test / Ops – Gain More Horsepower and Reduce Costs by Sharing Kubernete...Dev / Test / Ops – Gain More Horsepower and Reduce Costs by Sharing Kubernete...
Dev / Test / Ops – Gain More Horsepower and Reduce Costs by Sharing Kubernete...
Ian Lumb
 
Drilling Deep with Machine Learning as an Enterprise Enabled Micro Service
Drilling Deep with Machine Learning as an Enterprise Enabled Micro ServiceDrilling Deep with Machine Learning as an Enterprise Enabled Micro Service
Drilling Deep with Machine Learning as an Enterprise Enabled Micro Service
Ian Lumb
 
Machine Learning for Big Data Analytics: Scaling In with Containers while Sc...
Machine Learning for Big Data Analytics:  Scaling In with Containers while Sc...Machine Learning for Big Data Analytics:  Scaling In with Containers while Sc...
Machine Learning for Big Data Analytics: Scaling In with Containers while Sc...
Ian Lumb
 
Docker 101 - all about Docker containers
Docker 101 - all about Docker containers Docker 101 - all about Docker containers
Docker 101 - all about Docker containers
Ian Lumb
 
High Performance Computing in the Cloud?
High Performance Computing in the Cloud?High Performance Computing in the Cloud?
High Performance Computing in the Cloud?
Ian Lumb
 
VoDcast Slides: The Rise in Popularity of Apache Spark
VoDcast Slides: The Rise in Popularity of Apache SparkVoDcast Slides: The Rise in Popularity of Apache Spark
VoDcast Slides: The Rise in Popularity of Apache Spark
Ian Lumb
 
Bright Topics Webinar April 15, 2015 - Modernized Monitoring for Cluster and ...
Bright Topics Webinar April 15, 2015 - Modernized Monitoring for Cluster and ...Bright Topics Webinar April 15, 2015 - Modernized Monitoring for Cluster and ...
Bright Topics Webinar April 15, 2015 - Modernized Monitoring for Cluster and ...
Ian Lumb
 
Utilizing Public AND Private Clouds with Bright Cluster Manager
Utilizing Public AND Private Clouds with Bright Cluster ManagerUtilizing Public AND Private Clouds with Bright Cluster Manager
Utilizing Public AND Private Clouds with Bright Cluster Manager
Ian Lumb
 
How to Upgrade Your Hadoop Stack in 1 Step -- with Zero Downtime
How to Upgrade Your Hadoop Stack in 1 Step -- with Zero DowntimeHow to Upgrade Your Hadoop Stack in 1 Step -- with Zero Downtime
How to Upgrade Your Hadoop Stack in 1 Step -- with Zero Downtime
Ian Lumb
 
Bright Cluster Manager: A Comprehensive, Integrated Management Solution for P...
Bright Cluster Manager: A Comprehensive, Integrated Management Solution for P...Bright Cluster Manager: A Comprehensive, Integrated Management Solution for P...
Bright Cluster Manager: A Comprehensive, Integrated Management Solution for P...
Ian Lumb
 
Ad

Recently uploaded (20)

Download 4k Video Downloader Crack Pre-Activated
Download 4k Video Downloader Crack Pre-ActivatedDownload 4k Video Downloader Crack Pre-Activated
Download 4k Video Downloader Crack Pre-Activated
Web Designer
 
Autodesk Inventor Crack (2025) Latest
Autodesk Inventor    Crack (2025) LatestAutodesk Inventor    Crack (2025) Latest
Autodesk Inventor Crack (2025) Latest
Google
 
Reinventing Microservices Efficiency and Innovation with Single-Runtime
Reinventing Microservices Efficiency and Innovation with Single-RuntimeReinventing Microservices Efficiency and Innovation with Single-Runtime
Reinventing Microservices Efficiency and Innovation with Single-Runtime
Natan Silnitsky
 
sequencediagrams.pptx software Engineering
sequencediagrams.pptx software Engineeringsequencediagrams.pptx software Engineering
sequencediagrams.pptx software Engineering
aashrithakondapalli8
 
The Elixir Developer - All Things Open
The Elixir Developer - All Things OpenThe Elixir Developer - All Things Open
The Elixir Developer - All Things Open
Carlo Gilmar Padilla Santana
 
A Comprehensive Guide to CRM Software Benefits for Every Business Stage
A Comprehensive Guide to CRM Software Benefits for Every Business StageA Comprehensive Guide to CRM Software Benefits for Every Business Stage
A Comprehensive Guide to CRM Software Benefits for Every Business Stage
SynapseIndia
 
Beyond the code. Complexity - 2025.05 - SwiftCraft
Beyond the code. Complexity - 2025.05 - SwiftCraftBeyond the code. Complexity - 2025.05 - SwiftCraft
Beyond the code. Complexity - 2025.05 - SwiftCraft
Dmitrii Ivanov
 
What Do Candidates Really Think About AI-Powered Recruitment Tools?
What Do Candidates Really Think About AI-Powered Recruitment Tools?What Do Candidates Really Think About AI-Powered Recruitment Tools?
What Do Candidates Really Think About AI-Powered Recruitment Tools?
HireME
 
Why Tapitag Ranks Among the Best Digital Business Card Providers
Why Tapitag Ranks Among the Best Digital Business Card ProvidersWhy Tapitag Ranks Among the Best Digital Business Card Providers
Why Tapitag Ranks Among the Best Digital Business Card Providers
Tapitag
 
Do not let staffing shortages and limited fiscal view hamper your cause
Do not let staffing shortages and limited fiscal view hamper your causeDo not let staffing shortages and limited fiscal view hamper your cause
Do not let staffing shortages and limited fiscal view hamper your cause
Fexle Services Pvt. Ltd.
 
Memory Management and Leaks in Postgres from pgext.day 2025
Memory Management and Leaks in Postgres from pgext.day 2025Memory Management and Leaks in Postgres from pgext.day 2025
Memory Management and Leaks in Postgres from pgext.day 2025
Phil Eaton
 
Adobe Audition Crack FRESH Version 2025 FREE
Adobe Audition Crack FRESH Version 2025 FREEAdobe Audition Crack FRESH Version 2025 FREE
Adobe Audition Crack FRESH Version 2025 FREE
zafranwaqar90
 
Unit Two - Java Architecture and OOPS
Unit Two  -   Java Architecture and OOPSUnit Two  -   Java Architecture and OOPS
Unit Two - Java Architecture and OOPS
Nabin Dhakal
 
Medical Device Cybersecurity Threat & Risk Scoring
Medical Device Cybersecurity Threat & Risk ScoringMedical Device Cybersecurity Threat & Risk Scoring
Medical Device Cybersecurity Threat & Risk Scoring
ICS
 
AEM User Group DACH - 2025 Inaugural Meeting
AEM User Group DACH - 2025 Inaugural MeetingAEM User Group DACH - 2025 Inaugural Meeting
AEM User Group DACH - 2025 Inaugural Meeting
jennaf3
 
GC Tuning: A Masterpiece in Performance Engineering
GC Tuning: A Masterpiece in Performance EngineeringGC Tuning: A Masterpiece in Performance Engineering
GC Tuning: A Masterpiece in Performance Engineering
Tier1 app
 
Solar-wind hybrid engery a system sustainable power
Solar-wind  hybrid engery a system sustainable powerSolar-wind  hybrid engery a system sustainable power
Solar-wind hybrid engery a system sustainable power
bhoomigowda12345
 
Troubleshooting JVM Outages – 3 Fortune 500 case studies
Troubleshooting JVM Outages – 3 Fortune 500 case studiesTroubleshooting JVM Outages – 3 Fortune 500 case studies
Troubleshooting JVM Outages – 3 Fortune 500 case studies
Tier1 app
 
How to Install and Activate ListGrabber Plugin
How to Install and Activate ListGrabber PluginHow to Install and Activate ListGrabber Plugin
How to Install and Activate ListGrabber Plugin
eGrabber
 
Sequence Diagrams With Pictures (1).pptx
Sequence Diagrams With Pictures (1).pptxSequence Diagrams With Pictures (1).pptx
Sequence Diagrams With Pictures (1).pptx
aashrithakondapalli8
 
Download 4k Video Downloader Crack Pre-Activated
Download 4k Video Downloader Crack Pre-ActivatedDownload 4k Video Downloader Crack Pre-Activated
Download 4k Video Downloader Crack Pre-Activated
Web Designer
 
Autodesk Inventor Crack (2025) Latest
Autodesk Inventor    Crack (2025) LatestAutodesk Inventor    Crack (2025) Latest
Autodesk Inventor Crack (2025) Latest
Google
 
Reinventing Microservices Efficiency and Innovation with Single-Runtime
Reinventing Microservices Efficiency and Innovation with Single-RuntimeReinventing Microservices Efficiency and Innovation with Single-Runtime
Reinventing Microservices Efficiency and Innovation with Single-Runtime
Natan Silnitsky
 
sequencediagrams.pptx software Engineering
sequencediagrams.pptx software Engineeringsequencediagrams.pptx software Engineering
sequencediagrams.pptx software Engineering
aashrithakondapalli8
 
A Comprehensive Guide to CRM Software Benefits for Every Business Stage
A Comprehensive Guide to CRM Software Benefits for Every Business StageA Comprehensive Guide to CRM Software Benefits for Every Business Stage
A Comprehensive Guide to CRM Software Benefits for Every Business Stage
SynapseIndia
 
Beyond the code. Complexity - 2025.05 - SwiftCraft
Beyond the code. Complexity - 2025.05 - SwiftCraftBeyond the code. Complexity - 2025.05 - SwiftCraft
Beyond the code. Complexity - 2025.05 - SwiftCraft
Dmitrii Ivanov
 
What Do Candidates Really Think About AI-Powered Recruitment Tools?
What Do Candidates Really Think About AI-Powered Recruitment Tools?What Do Candidates Really Think About AI-Powered Recruitment Tools?
What Do Candidates Really Think About AI-Powered Recruitment Tools?
HireME
 
Why Tapitag Ranks Among the Best Digital Business Card Providers
Why Tapitag Ranks Among the Best Digital Business Card ProvidersWhy Tapitag Ranks Among the Best Digital Business Card Providers
Why Tapitag Ranks Among the Best Digital Business Card Providers
Tapitag
 
Do not let staffing shortages and limited fiscal view hamper your cause
Do not let staffing shortages and limited fiscal view hamper your causeDo not let staffing shortages and limited fiscal view hamper your cause
Do not let staffing shortages and limited fiscal view hamper your cause
Fexle Services Pvt. Ltd.
 
Memory Management and Leaks in Postgres from pgext.day 2025
Memory Management and Leaks in Postgres from pgext.day 2025Memory Management and Leaks in Postgres from pgext.day 2025
Memory Management and Leaks in Postgres from pgext.day 2025
Phil Eaton
 
Adobe Audition Crack FRESH Version 2025 FREE
Adobe Audition Crack FRESH Version 2025 FREEAdobe Audition Crack FRESH Version 2025 FREE
Adobe Audition Crack FRESH Version 2025 FREE
zafranwaqar90
 
Unit Two - Java Architecture and OOPS
Unit Two  -   Java Architecture and OOPSUnit Two  -   Java Architecture and OOPS
Unit Two - Java Architecture and OOPS
Nabin Dhakal
 
Medical Device Cybersecurity Threat & Risk Scoring
Medical Device Cybersecurity Threat & Risk ScoringMedical Device Cybersecurity Threat & Risk Scoring
Medical Device Cybersecurity Threat & Risk Scoring
ICS
 
AEM User Group DACH - 2025 Inaugural Meeting
AEM User Group DACH - 2025 Inaugural MeetingAEM User Group DACH - 2025 Inaugural Meeting
AEM User Group DACH - 2025 Inaugural Meeting
jennaf3
 
GC Tuning: A Masterpiece in Performance Engineering
GC Tuning: A Masterpiece in Performance EngineeringGC Tuning: A Masterpiece in Performance Engineering
GC Tuning: A Masterpiece in Performance Engineering
Tier1 app
 
Solar-wind hybrid engery a system sustainable power
Solar-wind  hybrid engery a system sustainable powerSolar-wind  hybrid engery a system sustainable power
Solar-wind hybrid engery a system sustainable power
bhoomigowda12345
 
Troubleshooting JVM Outages – 3 Fortune 500 case studies
Troubleshooting JVM Outages – 3 Fortune 500 case studiesTroubleshooting JVM Outages – 3 Fortune 500 case studies
Troubleshooting JVM Outages – 3 Fortune 500 case studies
Tier1 app
 
How to Install and Activate ListGrabber Plugin
How to Install and Activate ListGrabber PluginHow to Install and Activate ListGrabber Plugin
How to Install and Activate ListGrabber Plugin
eGrabber
 
Sequence Diagrams With Pictures (1).pptx
Sequence Diagrams With Pictures (1).pptxSequence Diagrams With Pictures (1).pptx
Sequence Diagrams With Pictures (1).pptx
aashrithakondapalli8
 

Managing Containerized HPC and AI Workloads on TSUBAME3.0

  • 1. www.univa.com www.univa.com Ian Lumb Solutions Architect SUSE, Booth #1681 SC17, Denver, CO Managing Containerized HPC and AI Workloads on TSUBAME3.0
  • 4. www.univa.com Copyright © Univa Corporation, 2017. All Rights Reserved. Internal Use Only. 4 TSUBAME 3.0 - Compute Node Overview A compute-node: ■ 256 GB DDR4 RAM ■ 2 TB SSDs ■ 2x 14 cores ■ 4x GPUs ■ 4x HFI (1000 Gbps) ⇒ This is what they call a “fat compute node”
  • 5. www.univa.com Copyright © Univa Corporation, 2017. All Rights Reserved. Internal Use Only. 5 TSUBAME 3.0 - The Challenges 12.2 PetaFLOPS within only 20 racks or 540 compute nodes ➢ It is the smallest >10 PFLOPS machine in the world ➢ Wasted/unreachable resources (parts of a node) have a much bigger impact on such a “small” cluster ➢ Performance is also highly dependent on the job- placement due to additional resources, such as GPUs and HFI-devices (the closer, the better) ➢ It needs smart and flexible partitioning to ensure a high utilization
  • 6. www.univa.com 6 TSUBAME 3.0 - UGE Enhancements ▪ Core Bindings ▪ Enhanced PE support and strategies ▪ RSMAPS ▪ Enhanced PE support and chaining ▪ Docker ▪ Define unique but known container hostnames ▪ Configure Infiniband device in the container ▪ Map all job users into the container ▪ Provide execution host and Docker container hostnames to the job
  • 7. www.univa.com Copyright © Univa Corporation, 2017. All Rights Reserved. Internal Use Only. 7 Putting it all together … qsub -l docker,docker_images="*ubuntu:14.04*" -l gpu=1,hfi=1,hosts=1 -xd ‘--device=/dev/gpu${gpu(0)}:/dev/gpu, --device=/dev/hfi${hfi(0)}:/dev/hfi’ -xd ‘--hostname ${hosts(0)}’ -binding one_socket_balanced:4 -pe rr 4 jobscript.sh No matter the host-OS, the application gets whatever OS it needs (if they run their own docker-repo, the image can even be prepared however they need it) Each PE-task will get 1 GPU and 1 HFI device (both with the same ID, i.e. in the same “location”) and a unique hostname No matter which devices are granted, the application only sees /dev/gpu and /dev/hfi inside the container and can use them directly without any performance penalty! Even if the RSMAP would occupy 7 cores per GPU, we only want 4 per PE-task. Thus leaving room for other jobs, which do not need a GPU or HFI. Also, we only go on one socket per host. Container gets a unique, known (!) hostname
  翻译: