KWOK: A Lightweight Kubernetes Simulator to Empower AI Batch Scheduling
In the field of AI, batch scheduling is a critical task that involves how to efficiently manage and allocate large-scale computational resources to perform concurrent workloads that run for long periods of time, such as distributed training, model tuning, and inference. However, the native Kubernetes scheduler does not support advanced features of batch scheduling, such as group scheduling, prioritisation, preemption, and retrying. Therefore, a more robust resource management solution is required to improve the performance and efficiency of batch scheduling.
01
MCAD, An Open Source Resource Management Solution
MCAD (Multi-Cluster App Dispatcher) is an open-source resource management solution proposed by the IBM research team. It can manage batch workloads in a single cluster or across multiple clusters, offering advanced resource management options such as queues, packaging, quotas, and cross-cluster scheduling. Additionally, MCAD can collaborate with other scheduling plugins, such as Cocheduler, to implement group scheduling functions, allowing a set of related Pods to be scheduled and run simultaneously.
In order to test and optimize MCAD's performance, the IBM research team used a tool called KWOK, a lightweight Kubernetes simulator that can create and simulate the behavior of pods and nodes as if they were real Kubernetes resources, with a low resource footprint. KWOK can specify simulated nodes and pods through annotations and affinities, and can also simulate node and pod lifecycle phases such as creation, readiness, and completion through the Stage API. KWOK also provides a command-line tool, kwokctl, which can be used to create and manage a KWOK-managed cluster, as well as save and restore etcd data for the cluster.
Using KWOK, the IBM research team can locally simulate a cluster with a large number of nodes and pods to test the performance of different versions or submissions of MCAD and the synergy with the Coscheduler plugin. They found that using MCAD and Coscheduler significantly reduced the average completion time, scheduling latency, and dispatch time of batch jobs, compared to using only the native Kubernetes scheduler or the Coscheduler plugin. They also found that using KWOK can save significant memory resources compared to using Kubemark, a Kubelet emulator that does not run containers.
02
KWOK's Benefits, Objectives, and Limitations
KWOK is a toolkit that allows you to set up a cluster of thousands of nodes in seconds. All nodes are simulated, so the entire cluster has a very low resource footprint and you can easily run it on your laptop. KWOK, whose full name is Kubernetes WithOut Kubelet, provides two tools: kwok and kwokctl. kwok is responsible for simulating the lifecycle of dummy nodes, pods, and other Kubernetes API resources. kwokctl is a CLI tool designed to simplify the creation and management of clusters where the nodes are simulated by kwok.
KWOK has several advantages:
1. Speed: You can create and delete clusters and nodes almost immediately, without waiting for launch or configuration.
2. Compatibility: KWOK can work with any tool or client that conforms to the Kubernetes API, such as kubectl, helm, and kui.
3. Portability: KWOK has no specific hardware or software requirements. You can run it with pre-built images as long as you have Docker, Podman or Nerdctl installed. In addition, all platforms also provide binaries that can be easily installed.
4. Flexibility: You can configure different node types, labels, stains, capacities, and conditions. You can also configure different pod behaviors and phases to test different scenarios and edge cases.
5. Performance: You can simulate thousands of nodes on your laptop without significantly consuming CPU or memory resources.
Recommended by LinkedIn
KWOK can be used for various purposes:
1. Learning: You can use KWOK to learn the concepts and features of Kubernetes without worrying about resource wastage or other consequences.
2. Development: You can use KWOK to develop new features or tools for Kubernetes without accessing a real cluster or needing other components.
3. Testing: You can measure how your application or controller scales with different numbers of nodes and/or pods. You can generate high workload on your cluster by creating many pods or services with different resource requests or limits. You can simulate node failures or network partitions by changing node conditions or randomly deleting nodes. You can test how your controller interacts with other components or features of Kubernetes by enabling different feature gates or API versions.
However, KWOK is not intended to be a complete replacement for other tools, and it has some limitations that you should be aware of:
1. Functionality: KWOK is not a kubelet and may exhibit different behavior in terms of pod lifecycle management, volume mounting, and device plugins. Its main purpose is to simulate updates to node and pod phases.
2. Accuracy: It is worth noting that KWOK does not accurately reflect the performance or behavior of real nodes under various workloads or environments.
03
Summary
In general, KWOK is a tool for simulating Kubernetes clusters and can help you better understand and use Kubernetes, as well as facilitate the development and testing of Kubernetes. It helps developers and researchers in the AI field to test and optimize the performance and efficiency of batch scheduling, as well as the synergy effect with other scheduling plug-ins, and ultimately achieve the goal of managing and allocating large-scale computational resources in the most efficient way.
KWOK is an open source project and welcomes any contributions and feedbacks from readers.
More information about KWOK and MCAD can be found at:
1. KWOK official website: Home | KWOK
2. MCAD Project: GitHub - project-codeflare/multi-cluster-app-dispatcher: Holistic job manager on Kubernetes
3. Best Practice: Improving Large Scale Batch Scheduling Performance with MCAD and KWOK
1) Read PDF:
2) Watch video:
I hope this article could help you better understand the role of KWOK in the AI field. If you have any questions or suggestions, please feel free to leave them in the comment section.
We look forward to your participation and feedback!