Navigating the maze of cloud instance types to optimise price/performance
Introduction
As organisations increasingly move their workloads into the cloud, costs can spiral out of control unless the organisation, and the individual teams and developers within it, are smart about how they provision compute for their workloads.
This article takes a look at observations that were obtained running a TensorFlow machine learning (ML) training application across different GPU configurations, and the cost and performance dynamic of using Spot vs dedicated on-demand instances. It concludes with an overview of how YellowDog, a cloud-native workload manager, can help in navigating the maze of cloud instance type options to deliver optimal price/performance.
Test setup
For this test case, the TensorFlow model and accompanying data were packaged in a Docker container for execution on a single GPU and required 26GB GPU memory.
Looking first at AWS, the closest option that fitted the memory profile was a large 8x GPU instance type (AWS p4d.24xlarge A100, 8GPU, 40GB). Using such an instance type though wouldn’t be cost-efficient given that only one of the eight GPUs would be exercised by the workload. A similar situation was found at Oracle with instance types matching the memory profile being typically orientated around larger multi-GPU machines (e.g., OCI BM.GPU4.8, A100, 8GPU, 40GB).
For the testing, it was therefore decided to utilise single-GPU instance types provided by Google Cloud. Specifically an A100 instance type (Google a2-highgpu-1g A100 40GB SXM4) to use as a reference point, and an H100 instance type (Google a3-highgpu-1g H100 80GB HBM3) to explore the impact on cost/performance of using a higher-spec GPU.
Test results
To set a baseline, the workload was run against the A100 cloud instance resulting in a workload completion time of 7hrs13mins. The higher-performance H100 instance naturally fared better, completing the benchmark in 5hrs21mins, a 35% improvement on the A100.
Whilst the H100 time was notably faster than that of the A100, the differential was lower than might have been expected. For example, external benchmarking using ResNet-50 (an ML model for image classification) typically shows a ~60% uplift in performance for the H100 compared to an A100.
The reduction in performance observed in the testing was possibly due in part to the workload utilising NVIDIA’s PTX virtual machine - this provides portability, allowing the code to be run against any NVIDIA GPU, but at the expense of optimisation since the PTX code needs to be jit-compiled into hardware-specific instructions at runtime.
Having said that, even if the code were to be optimised and compiled specifically for the H100, the expected 60% uplift is not justified by the current 2.4x price premium, the H100 instance costing ~$68 to complete the workload compared to the A100's ~$29.
Part of this cost differential though is due to the H100 instance type including a substantial amount of GPU memory, 80GB; far more than is needed for the target workload (26GB), and typically meant for use in larger instance types sporting multiple GPUs.
Multi-GPU option
If the workload were to be modified to utilise multiple GPUs rather than just the one (e.g., using TensorFlow's tf.distribute. MirroredStrategy), this would improve completion time whilst also being considerably more cost-effective.
For instance, the ResNet-50 benchmark on a machine using 8x H100 GPUs runs ~10.8x faster than a single GPU machine, which translates to an estimated reduction in workload completion time from 5hrs21mins to only ~30mins.
Naturally the cost for this instance type also increases, albeit more modestly at ~7.8x. What’s most interesting though is that whilst the cost/hr of using this instance type goes up, the time to complete the workload decreases, resulting in the overall cost going down from ~$68 to ~$49 for completing the workload. Essentially, the customer is getting their workload processed more quickly, whilst also saving money.
This might seem counterintuitive, but by enabling customers to complete their workloads faster, the Cloud Service Provider (CSP) frees up the machine for other paying customers thereby maximising utilisation and increasing revenue. And in the case of Quant research or Monte Carlo analysis, for example, the faster the customer can obtain results, the faster they can iterate, updating and rerunning the model and in doing so, consuming (and paying for) more cloud compute - a win:win for all.
To some extent, this may be one of the reasons why AWS and Oracle don't offer single-GPU instance types using the A100 or H100. Both GPUs are high-performance, expensive, and designed for large-scale workloads such as high-performance computing (HPC) and AI training that typically benefit from multiple GPUs. CSPs can allocate A100s/H100s much more effectively by grouping them into multi-GPU instance types rather than spreading them across single-GPU machines.
Previously, when running the workload on a single GPU machine, it was quite clear that the Google A2 instance type proved the most cost-effective. However, with 8x GPU machines, the choice between A2 and A3 becomes a little more nuanced, the A2 A100 instance type costing roughly half that of the A3 H100 for the workload ($27 vs $49), but with the A3 completing the workload ~1.7x faster (30mins vs 50mins).
In absolute terms, the A2 A100 still provides the most ‘bang for the buck’, achieving a performance/$/hr score that is 53% higher than the A3 (the H100 still being priced at a premium to the older A100 due to demand). But ultimately the decision comes down to whether the business wishes to prioritise time to results over cost, and whether the additional outlay is deemed to be worthwhile.
Either way, optimising workloads to utilise multiple GPUs (and compiling for the specific GPU), provides material improvements in workload completion time and cost efficiency.
Recommended by LinkedIn
Lower-end GPU option
For smaller workloads and general-purpose GPU tasks, AWS offers lower grade single-GPU options such as the NVIDIA A10G G5 instance types. For example, the g5.xlarge offered by AWS comes with a single NVIDIA A10G GPU and 24GB, which almost meets the target workload requirements (26GB).
The A10G performance is roughly half that of the A100 (ResNet-50), so workload completion times are likely to be considerably longer (~14.5hrs), although this drop in performance is offset somewhat by a reduction in cost (~$18).
Spot instance option
If budget is the primary driver, another way of reducing cost is by using Spot instances rather than on-demand - A100 instances (single or multi-GPU) enjoy a ~66% discount, whilst for H100 instances it's slightly lower at ~61%.
The downside of course is the risk of these instances being pre-empted which could result in the entire workload needing to be restarted from scratch. The cost savings from using Spot therefore need to be weighed up against the risk of pre-emption and the knock-on increase in completion time which would push up the total cost.
Taking the single-GPU A100 instance as an example, a single pre-emption event would double completion time to ~14.5hrs, equivalent to that of the A10G, and push up the Spot cost for the workload to $19 in comparison to the A10G on-demand instance cost of $18.
This pre-emption impact can be minimised to some degree if the workload is able to incorporate checkpoints (e.g., in the case of ML training using tf.train.Checkpoint). Careful selection of instance types from regions that are less likely to be pre-empted is another option, but dependent on in-depth knowledge of a marketplace that's constantly changing. And finally, as has already been shown, moving to a higher-spec H100 in a multi-GPU configuration reduces completion time considerably (down to ~30mins) and in doing so the risk of pre-emption.
Takeaways
Optimum use of cloud resources is dependent on a number of factors and trade-offs. At a simplistic level, cost can be reduced by switching to an older GPU albeit with the downside of a slower workload completion time. Having said that, such an approach was precluded in the testing due to the large memory requirement of the test workload (26GB). A wider range of instance types would become available if this memory requirement could be reduced, for instance by quantising the input data down to a smaller bit-depth. The ML model itself could also be compressed through quantising the model parameters and activation weights, or through model pruning.
If the priority is to accelerate workload completion, this can be achieved by upgrading to a newer high-performance GPU and/or switching to multi-GPU instances if this can be supported by the workload. In addition to using multi-GPU instances for accelerating individual workloads, such instances could also be used for running an ML model multiple times simultaneously with different input data sets to explore a broad problem space more quickly.
For workloads needing to optimally balance time vs cost where the priority is still to complete the workload quickly, an interesting strategy would be to combine the use of higher-spec multi-GPU instances able to process the workload faster with Spot pricing to reduce cost. Dependent on workload size, and whether or not the workload lends itself to the incorporation of checkpoints, such an approach can deliver high performance, at low cost and with minimal risk.
Finding and then using such instances though can be a challenge, and especially so if it requires forming relationships with new CSPs.
YellowDog can help, providing a number of services and capabilities to ease the burden for customers and both simplify and optimise use of cloud resources.
YellowDog Insights, for example, provides detailed price and availability information on all 10,000 global cloud instance types provided by the major cloud providers, whilst Insights Pro benchmarks each instance type using QuantLib and SysBench thereby enabling them to be ranked by price and/or performance.
With YellowDog Cloud, YellowDog provide a x-CSP managed service for providing compute resources that optimally meets customer needs.
Through the relationships that YellowDog have forged with each CSP, and combined with YellowDog's Insights across the entire market of instance types, price and availability, YellowDog are able to find and employ instances most suited to a customer based on their workload shape and business objectives.
Cost is minimised by ensuring the right instance type is selected to avoid over-provisioning, and by finding the lowest price for that instance type x-CSP. For example, in the case of an 8x A100 40GB instance type, price can vary considerably across Google, AWS and OCI; between $24-39/hr at the time of writing. YellowDog can find and employ the cheapest instance at any given point in time, and moreover for Spot instances can find those least likely to be pre-empted thereby improving resiliency.
With YellowDog Cloud, YellowDog can bring these benefits to customers without the customer needing to manage multiple CSP accounts, or move their accounts between CSPs. Equally, YellowDog Cloud reduces the engineering burden by enabling customers to integrate once with YellowDog rather than needing to rebuild their cloud integrations based on the proprietary SDKs, APIs and tools provided by each different CSP.
And it's not just limited to optimal compute orchestration. YellowDog takes an holistic view across both compute and data to determine the best strategy for workload placement based on data considerations such as container size, transport costs between regions and CSP egress fees, and use of external storage buckets.
In short, with YellowDog Cloud, the customer can concentrate on their business and leave the complexity of cloud compute management to YellowDog. And for those customers needing a more hands-on approach to managing their cloud compute resources and workload management, YellowDog Platform exposes the powerful scheduler and dynamic orchestration engine used by YellowDog Cloud via a comprehensive set of APIs and SDKs for customers to use directly.