Storage solutions for High Performance Computing

1
Storage Infrastructure
for HPC
Gabriel Mateescu
mateescu@acm.org

Overview
• Data-intensive science
• Architecture of Parallel Storage
• Parallel File Systems
– GPFS, Lustre, PanFS
• Data Striping
• Scale-out NAS and pNFS
• IO acceleration
2

The 4th paradigm of science
• Experiment
• Theory: models
• Computational Science: simulations
• Data-intensive science
– Unifies theory, experiment, and simulation
– Digital information processed by software
– Capture, curation, and analysis of data
– Creates a data explosion effect
3

Data Explosion 1
• Explosion of data volume
– The amount of data doubles every two years
– Number of files grows faster
• Challenges:
– Disk bandwidth growth lags compute bandwidth
growth
– Data management: migration to appropriate
performance tier, replication, backup, compression
– Capacity provisioning
4

Data Explosion 2
• Turning data into actionable insights
requires solving all these challenges
– Enough storage capacity
– Data placement and migration
– Data transfer bandwidth
– Data discovery
• New technology needed to handle massive
data sizes and file counts
– Access, preservation and movement of data
requires high-performance, scalable storage 5

Early days of HPC storage
6
Compute
Node 0
File 0
Local file
system
• One file per compute node
• Hard to manage and data stage-in and stage-
out needed
Compute
Node 1
File 1
Local file
system
Compute
Node 2
File 2
Local file
system
Compute
Node 3
File 3
Local file
system

Parallel and shared storage
7
Compute
Node 0
File A
• All compute nodes can access all files
• Multiple compute nodes can access the
same file concurrently
Compute
Node 1
File B
Compute
Node 2
File C
Compute
Node 3
File D
Shared and Parallel File System

Parallel Storage
• Parallel storage system
– Aggregate a large number of storage devices to
provide a system whose devices can be
accessed concurrently by many clients
– Ideally, the throughput of the system is the sum
of the throughput of the storage devices
• Parallel file system
– Global namespace on top of the storage
system: all clients see the same filenames
– Global address space: all clients see the same
address space of a given file 8

Network Attached Storage
Storage
Device
File System Server
Storage
Device
RAID Controller
Storage
Device
File System Server
Storage
Device
Interconnect Fabric 10GE, InfiniBand
Client node Client node
RAID Controller

Directly Attached Storage
Storage
Device
File System Server
Storage Interconnect Fabric 10GE, FCoE, InfiniBand
Storage
Device
Storage
Device
Storage
Device
Cluster Interconnect Fabric 10GE, InfiniBand
Compute node
RAID Controller RAID Controller
File System Server
Compute node

Scale-out NAS (SoNAS)
Storage
Device
File System Server
Storage
Device
Storage
Device
Storage
Device
WAN Interconnect
Client node
File System Server
Client node

Parallel File System vs SoNAS
• Parallel file system
– Provides high throughput to one file by striping
the file across several storage devices
– Client nodes may also be file system servers
• Scale-out NAS (SoNAS)
– Parallel File System + Parallel Access Protocol
– File system servers typically not on the LAN of
the compute nodes (clients)
12

• A LUN is a logical volume made out of
multiple physical disks
• Typically, a LUN is built as a RAID array
– RAID offers redundancy and/or concurrency
• There are several RAID types
– RAID0: striping
– RAID6: striping and two parity blocks
• 8 data disks + 2 parity disks
• Parity disks are distributed across 10 disks
13
LUN and RAID

• RAID stripe: a sequence of blocks that
contains one block from each disk of a LUN
– Stripe width = number of disks per LUN
– Stripe depth = size per disk
– Stripe size = Stripe width ×Stripe depth
• File system stripe: a sequence of blocks
(segments) that contains one block from
each LUN
– Stripe width = number of LUNs
– Stripe depth, aka block size
14
Striping

Scaling
• Capacity scaling
– cores/node, memory/node, node count
– storage size, network switches
• Performance scaling
– GFlops, Instructions/cycle, Memory bandwidth
– IO throughput: large or small file, metadata
• IO scaling requires a balanced system
architecture to avoid bottlenecks
15

Storage wall
• As the system size (CPUs, memory,
interconnect, the number of compute nodes)
increases, providing scalable IO throughput
becomes very expensive
• Ken Batcher, recipient of the Seymour Cray
award put it this way:
– A supercomputer is a device for turning
compute-bound problems into IO-bound
problems
17

IBM GPFS (1)
18
• Global Parallel File System
– Supports both architectures
• network-attached: software or hardware RAID
• directly-attached
– Network Shared Disk
• Cluster-wide naming
• Access to data
– Full POSIX semantics
• Atomicity of concurrent read and write operations

GPFS Directly Attached Storage
Storage
Device
Storage
Device
Storage
Device
Storage
Device
Cluster Interconnect Fabric 10GE, InfiniBand
GPFS NSD Server
Compute node
GPFS client
GPFS NSD Server
Compute node
GPFS client

GPFS Network Attached Storage
Storage
Device
Storage Node
NSD Server
Storage
Device
Compute node
GPFS client
NSD
Storage
Device
Storage Node
NSD Server
Storage
Device
Compute node
GPFS client
NSD

HA for Network Attached Storage
Storage
Array
Storage
Node
Storage
Array
Storage
Node
• If a storage node fails, the load on the other
storage node doubles
• Tolerates failure of one out of two nodes

Triad HA
Storage
Array
Storage
Node
Storage
Node
• If a storage node fails, the load on the other
two storage nodes grows with 50%
• Tolerates failure of two out of three nodes
Storage
Node
Storage
Array
Storage
Array

IBM GPFS 2
23
• Nodeset: group of nodes that operate on the
same file systems
• GPFS management servers
– cluster data server: one or two per cluster
• cluster configuration and file system information
– file system manager: one per file system
• disk space allocation, token management server
– configuration manager: one per nodeset
• Selects file system manager
– metanode: one per opened file
• handles updates to metadata

GPFS Scaling
• GPFS meta-nodes
– Each directory is assigned to a metanode that
manages it, e.g., locking
– Each file is assigned to a metanode that
manages it, e.g., locking
• The meta-node may become a bottleneck
– One file per task: puts pressure on the directory
meta-node for large jobs, unless a directory
hierarchy is created
– One shared file: puts pressure on the file meta-
node for large jobs 24

GPFS Striping (1)
• GPFS-level striping: spread the blocks of a
file across all LUNs
– Stripe width = number of LUNs
– GPFS block size = block stored in a LUN
• RAID-level striping
– Assume RAID6 with 8+2P, block-level striping
– Stripe width is 8 (8 + 2P)
– Stripe depth is the size of a block written to one
disk; a multiple of the sector size, e.g., 512 KiB
– Stripe size = Stripe depth × Stripe width = 8 ×
512 KiB = 4 MiB 25

GPFS Striping (2)
• GPFS block size
– equal to the RAID stripe size = 4 MiB
• Stripe width impacts aggregate bandwidth
– GPFS Stripe width equal to number of LUNs
maximizes throughout per file
– RAID Stripe Width of 8 (8+2P) for RAID6
balances performance and fault tolerance
• Applications should write blocks that are
– multiple of the GPFS block size and aligned
with the GPFS blocks
26

Impact of IO Block Size
27
IO size (Bytes)
Throughput (MB/sec) for a 1TB
SAS Seagate Barracuda ES2 disk

Handling Small Files
• Small files do not benefit from GPFS striping
• Techniques used for small files
– Read-ahead: pre-fetch next disk block on disk
– Write behind: buffer writes
• These are used by other parallel file
systems as well
– For example, Panasas PanFS
28

Lustre file system 1
• Has the network-attached architecture
• Object-based storage
– Uses storage objects instead of blocks
– Storage objects are units of storage that have
variable size, e.g., an entire data structure or
database table
– File layout gives the placement of objects rather
than blocks
• User can set stripe width and depth,
and the file layout 29

Lustre Architecture
Object
Storage
Target
Object Storage
Server (OSS)
Lustre client Lustre client
Metadata Server
(MDS)
Object
Storage
Target
Object
Storage
Target
Object Storage
Server (OSS)
Object
Storage
Target
Metadata Target

Lustre file system 2
• Metadata servers (MDS)
– Manages file metadata and global namespace
• Object storage server (OSS)
– Is the software that fulfills requests from clients
and gets/stores data to one or more Object
Storage Targets (OSTs)
– An OST is a logical unit number, which can
consists of one or more disk drives (RAID)
• Management Server (MGS)
– can be co-located with MDS/MDT 31

Parallel NFS (pNFS)
• pNFS allows clients to access storage
directly and in parallel
– Separation of data and metadata
– Direct access to the data servers
– Out-of-band metadata access
• Storage access protocols:
– File: NFS v4.1
– Object: object-based storage devices (OSD)
– Block: iSCSI, FCoE
32

pNFS architecture
NFS Data Server
pNFS client
NFS MDS
pNFS client
NFS Data Server

pNFS over Lustre
NFS Data Server
pNFS client
Lustre MDS
pNFS client
NFS MDS
Lustre OSS
NFS Data Server
Lustre OSS

Panasas Solution (1)
• SoNAS based on
– PanFS: Panasas ActiveScale file system
– pNFS or DirectFlow: Parallel access protocol
• Architecture
– Director Blade: MDS and management
– Storage Blade: storage nodes
• Disk: 2 or 3 TB/disk, 75 MB/s, one or two disks
• SSD (some models): 32 GB SLC
• CPU + Cache
– Shelf = 1 director blade + 10 storage blades 35

pNFS over PanFS
NFS Data Server
pNFS client
Director Blade
pNFS client
NFS MDS
Storage Blade
NFS Data Server
Storage Blade

Panasas Solution (2)
• Feeds and Speeds
– Shelf: 10 storage blades + 1 directory blade
• Disk Size = 10 * 6 TB = 60 TB
• Disk Throughput: 10 * (2 * 75 MB/s) = 1.5 GB/s
– Rack: 10 shelves
• Size = 600 TB
• Throughput: 15 GB/s
– System: 10 racks
• Size = 6 PB
• Throughput: 150 GB/s
37

Data vs Computation
Movement
38
• Consider a Lustre cluster with
–100 compute nodes (CNs), each with 1
TB local storage, 80 MB/s per local disk
–10 OSS and 10 OSTs/OSS,
–1TB/OST, 80 MB/s per OST
–4x SDR InfiniBand network, has 8 Gbps
that is, 1 GB/s

Lustre cluster
Object
Storage
Target
Object Storage
Server (OSS)
InfiniBand Fabric
Lustre client
Metadata Server
(MDS)
Object
Storage
Target
Object
Storage
Target
Object Storage
Server (OSS)
Object
Storage
Target
Metadata Target
Compute node
Lustre client
Compute node
Local Disk Local Disk
1 GB/s
80 MB/s

MapReduce /Lustre
40
• Compute Nodes access data from
Lustre
• Disk throughput per OSS = 10 * 80
MB/s = 800 MB/s
–InfiniBand has 1 GB/s, so it can sustain
this throughput
• Aggregate disk throughput
–10 * 800 MB/s = 8 GB/s

MapReduce on Lustre vs
HDFS
41
• MapReduce/HDFS:
– Compute nodes use local disks
– Per compute-node throughput is 80 MB/s
– Aggregate disk throughput is 100 * 80 MB/s =
8 GB/s
• Aggregate throughput is the same, 8 GB/s
– The interconnect fabric provides enough
bandwidth for the disks
• MapReduce/Lustre is competitive with
MapReduce/HDFS for latency-tolerant work

Data & Compute Trends
• Compute power: 90% per year
• Data volume: 40-60% per year
• Disk capacity: 50% per year
• Disk bandwidth: 15% per year
• Balancing the compute and disk
throughput requires the number of
disks to grow faster than the number of
compute nodes
42

IO Acceleration
43
• Disk bandwidth does not keep up with
memory and network bandwidth
• Hide low disk bandwidth using fast
buffering of IO data
–IO forwarding
–SSDs

Data Staging
44
• Data staging
–IO forwarding or SSDs
• IO forwarding hides disk bandwidth by
–Buffering the IO generated by an
application on staging machine: free
memory on the supercomputer for
simulation
– Overlapping computation on the
supercomputer with IO on the staging
machine

45
• Consider a machine with 1 PB of RAM that
reaches the peak performance of 1 PFlop/sec
when the operational intensity is >= 2 Flop/B
• Consider an application with operational intensity
1 Flop/B that uses 1 PB of RAM, executes 600
PFlop/iteration, and dumps each iterate to disk
• Running the application on the above machine
takes a time per iteration
Tcomp = (600 PFlop/iteration )/(.1 PFlop/sec) = 6000 sec
Benefits of IO forwarding (1)

Benefits of IO forwarding (2)
46
• We can hide almost all the IO time if we can
– Copy 1PB to a staging machine in Tfwd << Tcomp
– Write the 1 PB from the staging machine to disk in
(Tcomp – Tfwd ) ~ Tcomp
• Assume the staging machine has 64 K nodes
each with a 4x QDR port (4 GB/sec per port); then
Throughput = 64 K * 4 GB/sec = 256 TB/sec
Tfwd = 1024 TB/ (64 K * 4 GB/sec) = 4 sec << Tcomp
• So the required disk bandwidth is
BW = (1 PB) / (6000 sec) = 166 GB/sec << 256 TB/sec
• Similar benefit for SSDs

SSD Metadata Store
• MDS is a bottleneck for metadata-
intensive operations
• Use SSD for the metadata store
• IBM GPFS with SSD for metadata store
– eight NSD servers with four 1.8 TB SSD and 1.25
GB/s, PCIe attached; two GPFS clients
– Processes the 6.5 TB of metadata for a file
system with 10 Billion files in 43 min
– Enable timely policy-driven data management
47

Conclusion
• Parallel storage has evolved similarly
to parallel computation
– Scale by adding disk drives, networking, CPU,
and memory/cache
• Parallel file systems provide direct and
parallel access to storage
– Striping across and within storage nodes
• Staging to SSDs or another machine
hides the disk bandwidth
48

Storage solutions for High Performance Computing

Recommended

More Related Content

What's hot (20)

Similar to Storage solutions for High Performance Computing (20)

Recently uploaded (20)

Storage solutions for High Performance Computing