Published on June 25, 2024
Table of Contents
Google’s AI Hypercomputer tooling enables workloads to align their consumption of in-demand cloud resources to their specific usage requirements. By assembling the right pieces for your Ray, Kubeflow, Argo Workflows, Flux, PyTorch, or Airflow jobs, you can create effective consumption of the underlying GCP resources without being impacted by the broader limitations on supply in this market.
Overview
The AI Hypercomputer is not a specific product you can provision in Google Cloud, but rather a validated and supported architecture for consumption of in-demand GPU and TPU hardware in a way that aligns incentives between customers and the cloud provider. By setting appropriate parameters and using validated integrations, training workloads see lower disruption, better utilization, and a more understandable scheduling cadence.
There are two modes of this consumption, scheduled or flex start, using the Dynamic Workload Scheduler. Released in late 2023, this Google Cloud API enables you to schedule the needed accelerators in close proximity for increased efficiency of network communication.
This capability highlights some of the work that Google has done to bring recent work from the Borg ecosystem to bear for Google Kubernetes Engine users.
The Characters
In this post, we will talk about a number of components that enable data scientists to run large-scale GPU and TPU workloads on GKE.
- GKE Node Groups provisioned with Dynamic Workload Scheduler to enabled as-available or scheduled consumption
- Kueue to create pod resources that align to the scheduling characteristics of DWS
- JobSet an API that manages a group of parallel jobs as a single unit
- Ray and Jax as the Python frameworks developers use to interact with these resources once scheduled
Components of the Hypercomputer
The AI Hypercomputer design features a set of layers, aligned to the standard software stack, assembled to build a validated architecture on GCP. These layers are:
- Performance Optimized Hardware - transparent access to near-proximity TPUs and GPUs for compute, object storage, and the Jupiter data center network
- The open software stack - utilizing open source technologies like Kubernetes, Kueue, Tensorflow, and Pytorch alongside capabilities for multislice training and multihost inference
- Flexible consumption - utilizing Dynamic Workload Scheduler and GKE native scheduling features to create optimized consumption of Google Cloud resources
Cluster Setup
Let’s walk through setting up a Kubernetes cluster so that workloads can utilize these capabilities of the AI Hypercomputer. We will walk through the following steps:
- Deploy and configure the node pools
- Deploy Kueue
- Deploy JobSet controller
- Prepare reservations
Given an existing Kubernetes cluster, we need to add a node pool utilizing the Dynamic workload scheduler.
resource "google_container_node_pool" "dws" {
name = "dws-node-pool"
cluster = google_container_cluster.primary.id
location = "us-central2"
initial_node_count = 0
node_locations = ["us-central2-b"] # limited zonal support for `ct` machine class
node_config {
machine_type = "ct4p-hightpu-4t"
service_account = google_service_account.default.email
oauth_scopes = [
"https://www.googleapis.com/auth/cloud-platform"
]
workload_metadata_config {
mode = "GKE_METADATA"
}
}
placement_policy {
type = "COMPACT"
tpu_topology = "2x2x2"
}
queued_provisioning {
enabled = true
}
autoscaling {
min_node_count = 0
}
management { # Disable management actions while the machine are running
auto_repair = false
auto_upgrade = false
}
}
This provisions a ct4p-hightpu-4t
machine in a 2x2x2
topology, giving us 8 TPU chips and 2 VPMs in a multi-host configuration. We disable any actions that the underlying GKE
controller would take to disrupt these nodes using the management
block. Our goal will be to utilize this hardware as it becomes available for us and utilize the provisioned machines fully while the job is running.
To deploy pods that will utilize this node group, we will use Kueue to trigger the
ProvisioningRequest
API call by the GKE autoscaling controller. Kueue can be installed
by selecting the latest release version and running:
VERSION=v0.7.0
kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/$VERSION/manifests.yaml
In the configuration of Kueue, we need to override the kueue-manager-config
to
specify that we will be using the Jobset
resource type:
apiVersion: v1
kind: ConfigMap
metadata:
name: kueue-manager-config
namespace: kueue-system
data:
controller_manager_config.yaml: |
apiVersion: config.kueue.x-k8s.io/v1beta1
kind: Configuration
namespace: kueue-system
health:
healthProbeBindAddress: :8081
metrics:
bindAddress: :8080
# enableClusterQueueResources: true
webhook:
port: 9443
manageJobsWithoutQueueName: true
internalCertManagement:
enable: true
webhookServiceName: kueue-webhook-service
webhookSecretName: kueue-webhook-server-cert
waitForPodsReady:
enable: true
timeout: 10m
integrations:
frameworks:
- "jobset.x-k8s.io/jobset"
Now, let’s create a ResourceFlavor
aligned to the TPU topology and accelerator
we have selected in our nodepool:
apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
name: "vlp-8"
spec:
nodeLabels:
cloud.google.com/gke-tpu-accelerator: tpu-v4-podslice
cloud.google.com/gke-tpu-topology: 2x2x2
Now, we create a ClusterQueue
that can be selected by individual namespaces.
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: "cluster-tpu-queue"
spec:
namespaceSelector: {}
queueingStrategy: BestEffortFIFO
resourceGroups:
- coveredResources: ["google.com/tpu"]
flavors:
- name: "vlp-8"
resources:
- name: "google.com/tpu"
nominalQuota: 8
Now, a LocalQueue
enables workloads in a specific namespace to use the ClusterQueue
.
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
namespace: app-ns
name: tpu-queue
spec:
clusterQueue: cluster-tpu-queue
Now that Kueue is deployed to the cluster and my application namespace has a LocalQueue
,
jobs can be deployed that correspond to the suspend job KEP using the Jobset controller.
First, install Jobset by selecting a version of the controller and deploying it to your GKE cluster:
VERSION=v0.5.2
kubectl apply --server-side -f https://github.com/kubernetes-sigs/jobset/releases/download/$VERSION/manifests.yaml
Now, we can deploy jobsets that will request the accelerator when it becomes available in GKE:
apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
name: multislice-job
namespace: app-ns
labels:
kueue.x-k8s.io/queue-name: tpu-queue
annotations:
alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool
spec:
failurePolicy:
maxRestarts: 1
replicatedJobs:
- name: slice
replicas: 3
template:
spec:
parallelism: 2
completions: 2
backoffLimit: 0
template:
spec:
nodeSelector:
cloud.google.com/gke-tpu-accelerator: tpu-v4-podslice
cloud.google.com/gke-tpu-topology: 2x2x2
containers:
- name: jax-tpu
image: jax-custom-image:latest
ports:
- containerPort: 8471
- containerPort: 8080
command:
- entrypoint.sh
resources:
limits:
google.com/tpu: 4
Kueue then prepares the workloads to be submitted through the ProvisioningRequest API:
$ kubectl get workloads
NAME QUEUE ADMITTED BY AGE
jobset-multislice-job tpu-queue cluster-tpu-queue 10m
Once this jobset has been admitted, the pods are scheduled across a requested TPU topology and can utilize the underlying compute, networking, and storage capabilities accessible to the provisioned GCP hardware.
Concerns
This deployment strategy still consists of many moving parts and components that must be maintained to run your jobs effectively in the GKE environment. For those looking for less control of their workload environment, Google’s Vertex AI solution may be a better fit to submit PyTorch or TensorFlow jobs directly.
There are additional things you can do to ensure minimal disruption while your GKE jobs are running as well. Limiting the potential for API server upgrades can be done via a maintenance window.
You are also significantly limited in the regions where the TPU resources can be requested in GKE, as seen in their documentation. While you only want to run a job within a single availability zone as you want to take advantage of the location aware capabilities, the regional lockdown means this lower-cost hardware is far harder to use than a standard GKE GPU or CPU node pool.
Learn More
GCP has published a significant amount of information on the frameworks and capabilities of their TPU processors in recent events. There was a recent talk by Ishan Sharma at Cloud Field Day 20 a the Google Campus that addressed a number of questions from analysts and practitioners about Google’s approach to rolling out these capabilities for the broader GCP ecosystem.
You can also find a number of use cases online at the GCP Solutions page for the AI Hypercomputer.
We’d love to talk to you if you are running large scale AI workloads on GKE! Just say hello@superorbital.io and we look forward to learning more about your use case.