Assembling the GKE AI Hypercomputer Blueprint

Published on June 25, 2024

Table of Contents

Overview
- The Characters
Components of the Hypercomputer
Cluster Setup
Concerns
Learn More

Google’s AI Hypercomputer tooling enables workloads to align their consumption of in-demand cloud resources to their specific usage requirements. By assembling the right pieces for your Ray, Kubeflow, Argo Workflows, Flux, PyTorch, or Airflow jobs, you can create effective consumption of the underlying GCP resources without being impacted by the broader limitations on supply in this market.

Overview

The AI Hypercomputer is not a specific product you can provision in Google Cloud, but rather a validated and supported architecture for consumption of in-demand GPU and TPU hardware in a way that aligns incentives between customers and the cloud provider. By setting appropriate parameters and using validated integrations, training workloads see lower disruption, better utilization, and a more understandable scheduling cadence.

There are two modes of this consumption, scheduled or flex start, using the Dynamic Workload Scheduler. Released in late 2023, this Google Cloud API enables you to schedule the needed accelerators in close proximity for increased efficiency of network communication.

This capability highlights some of the work that Google has done to bring recent work from the Borg ecosystem to bear for Google Kubernetes Engine users.

The Characters

In this post, we will talk about a number of components that enable data scientists to run large-scale GPU and TPU workloads on GKE.

GKE Node Groups provisioned with Dynamic Workload Scheduler to enabled as-available or scheduled consumption
Kueue to create pod resources that align to the scheduling characteristics of DWS
JobSet an API that manages a group of parallel jobs as a single unit
Ray and Jax as the Python frameworks developers use to interact with these resources once scheduled

Components of the Hypercomputer

The AI Hypercomputer design features a set of layers, aligned to the standard software stack, assembled to build a validated architecture on GCP. These layers are:

Performance Optimized Hardware - transparent access to near-proximity TPUs and GPUs for compute, object storage, and the Jupiter data center network
The open software stack - utilizing open source technologies like Kubernetes, Kueue, Tensorflow, and Pytorch alongside capabilities for multislice training and multihost inference
Flexible consumption - utilizing Dynamic Workload Scheduler and GKE native scheduling features to create optimized consumption of Google Cloud resources

Cluster Setup

Let’s walk through setting up a Kubernetes cluster so that workloads can utilize these capabilities of the AI Hypercomputer. We will walk through the following steps:

Deploy and configure the node pools
Deploy Kueue
Deploy JobSet controller
Prepare reservations

Given an existing Kubernetes cluster, we need to add a node pool utilizing the Dynamic workload scheduler.

resource "google_container_node_pool" "dws" {
  name       = "dws-node-pool"
  cluster    = google_container_cluster.primary.id
  location   = "us-central2"
  initial_node_count = 0
  node_locations = ["us-central2-b"] # limited zonal support for `ct` machine class
  node_config {
    machine_type = "ct4p-hightpu-4t"
    service_account = google_service_account.default.email
    oauth_scopes    = [
      "https://www.googleapis.com/auth/cloud-platform"
    ]
    workload_metadata_config {
      mode = "GKE_METADATA"
    }    
  }
  placement_policy {
    type = "COMPACT"
    tpu_topology = "2x2x2"
  }
  queued_provisioning {
    enabled = true
  }
  autoscaling {
    min_node_count = 0
  }
  management { # Disable management actions while the machine are running
    auto_repair = false
    auto_upgrade = false
  }
}

This provisions a ct4p-hightpu-4t machine in a 2x2x2 topology, giving us 8 TPU chips and 2 VPMs in a multi-host configuration. We disable any actions that the underlying GKE controller would take to disrupt these nodes using the management block. Our goal will be to utilize this hardware as it becomes available for us and utilize the provisioned machines fully while the job is running.

To deploy pods that will utilize this node group, we will use Kueue to trigger the ProvisioningRequest API call by the GKE autoscaling controller. Kueue can be installed by selecting the latest release version and running:

VERSION=v0.7.0
kubectl apply --server-side -f https://github.com/kubernetes-sigs/kueue/releases/download/$VERSION/manifests.yaml

In the configuration of Kueue, we need to override the kueue-manager-config to specify that we will be using the Jobset resource type:

apiVersion: v1
kind: ConfigMap
metadata:
  name: kueue-manager-config
  namespace: kueue-system
data:
  controller_manager_config.yaml: |
    apiVersion: config.kueue.x-k8s.io/v1beta1
    kind: Configuration
    namespace: kueue-system
    health:
      healthProbeBindAddress: :8081
    metrics:
      bindAddress: :8080
      # enableClusterQueueResources: true
    webhook:
      port: 9443
    manageJobsWithoutQueueName: true
    internalCertManagement:
      enable: true
      webhookServiceName: kueue-webhook-service
      webhookSecretName: kueue-webhook-server-cert
    waitForPodsReady:
      enable: true
      timeout: 10m
    integrations:
      frameworks:
      - "jobset.x-k8s.io/jobset"

Now, let’s create a ResourceFlavor aligned to the TPU topology and accelerator we have selected in our nodepool:

apiVersion: kueue.x-k8s.io/v1beta1
kind: ResourceFlavor
metadata:
  name: "vlp-8"
spec:
  nodeLabels:
    cloud.google.com/gke-tpu-accelerator: tpu-v4-podslice
    cloud.google.com/gke-tpu-topology: 2x2x2

Now, we create a ClusterQueue that can be selected by individual namespaces.

apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
  name: "cluster-tpu-queue"
spec:
  namespaceSelector: {}
  queueingStrategy: BestEffortFIFO
  resourceGroups:
  - coveredResources: ["google.com/tpu"]
    flavors:
    - name: "vlp-8"
      resources:
      - name: "google.com/tpu"
        nominalQuota: 8

Now, a LocalQueue enables workloads in a specific namespace to use the ClusterQueue.

apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
  namespace: app-ns
  name: tpu-queue
spec:
  clusterQueue: cluster-tpu-queue

Now that Kueue is deployed to the cluster and my application namespace has a LocalQueue, jobs can be deployed that correspond to the suspend job KEP using the Jobset controller.

First, install Jobset by selecting a version of the controller and deploying it to your GKE cluster:

VERSION=v0.5.2
kubectl apply --server-side -f https://github.com/kubernetes-sigs/jobset/releases/download/$VERSION/manifests.yaml

Now, we can deploy jobsets that will request the accelerator when it becomes available in GKE:

apiVersion: jobset.x-k8s.io/v1alpha2
kind: JobSet
metadata:
  name: multislice-job
  namespace: app-ns
  labels:
    kueue.x-k8s.io/queue-name: tpu-queue
  annotations:
    alpha.jobset.sigs.k8s.io/exclusive-topology: cloud.google.com/gke-nodepool
spec:
  failurePolicy:
    maxRestarts: 1
  replicatedJobs:
    - name: slice
      replicas: 3
      template:
        spec:
          parallelism: 2
          completions: 2
          backoffLimit: 0
          template:
            spec:
              nodeSelector:
                cloud.google.com/gke-tpu-accelerator: tpu-v4-podslice
                cloud.google.com/gke-tpu-topology: 2x2x2
              containers:
              - name: jax-tpu
                image: jax-custom-image:latest
                ports:
                - containerPort: 8471
                - containerPort: 8080
                command:
                - entrypoint.sh
                resources:
                  limits:
                    google.com/tpu: 4

Kueue then prepares the workloads to be submitted through the ProvisioningRequest API:

$ kubectl get workloads
NAME                   QUEUE      ADMITTED BY        AGE
jobset-multislice-job  tpu-queue  cluster-tpu-queue  10m

Once this jobset has been admitted, the pods are scheduled across a requested TPU topology and can utilize the underlying compute, networking, and storage capabilities accessible to the provisioned GCP hardware.

Concerns

This deployment strategy still consists of many moving parts and components that must be maintained to run your jobs effectively in the GKE environment. For those looking for less control of their workload environment, Google’s Vertex AI solution may be a better fit to submit PyTorch or TensorFlow jobs directly.

There are additional things you can do to ensure minimal disruption while your GKE jobs are running as well. Limiting the potential for API server upgrades can be done via a maintenance window.

You are also significantly limited in the regions where the TPU resources can be requested in GKE, as seen in their documentation. While you only want to run a job within a single availability zone as you want to take advantage of the location aware capabilities, the regional lockdown means this lower-cost hardware is far harder to use than a standard GKE GPU or CPU node pool.

Learn More

GCP has published a significant amount of information on the frameworks and capabilities of their TPU processors in recent events. There was a recent talk by Ishan Sharma at Cloud Field Day 20 a the Google Campus that addressed a number of questions from analysts and practitioners about Google’s approach to rolling out these capabilities for the broader GCP ecosystem.

You can also find a number of use cases online at the GCP Solutions page for the AI Hypercomputer.

We’d love to talk to you if you are running large scale AI workloads on GKE! Just say hello@superorbital.io and we look forward to learning more about your use case.