A Case Study of Fleets of Clusters with Cluster API Provider AWS (CAPA)

Published on February 09, 2024

This is part of our series on Cluster API and how it can be a solution for managing large numbers of clusters at scale. For the first part on this series, see Cluster API: A Deep Dive On Declarative Cluster Lifecycle Management.

Table of Contents

What is CAPA?
Creating the Management Cluster
Creating a Managed Cluster
Extra Credit: Bootstrap + Pivot
What’s next?

At SuperOrbital, we often encounter challenging projects where customers want to quickly scale their Kubernetes clusters, deploy their workloads on them, and have complete control over the management of these clusters while not overbearing their DevOps team. Most of them make use of AWS’s EKS offering, which simplifies the Kubernetes management aspect. However, copy-pasting Terraform configuration files over and over for every cluster or refactoring large modules to try to DRY the code can become a pain, as can the lengthy terraform apply sessions needed to keep the state of all the clusters up-to-date. This is where CAPI comes in handy with the AWS infrastructure provider: Cluster API Provider AWS, also known as CAPA!

What is CAPA?

CAPA is the CAPI infrastructure provider for AWS, and allows the user to deploy CAPI-managed clusters with AWS infrastructure. Its list of features includes (but is not limited to):

Fully-featured Kubernetes clusters on EC2 instances
No need to faff around with the network configuration for control planes since CAPA will do that all for you
Support for managing EKS clusters
Ability to separate the clusters into different AWS accounts, even for the management cluster
Cost savings through support for Spot instances
Best practices for HA, such as the ability to deploy a cluster’s nodes across different availability zones by default.

The only prerequisite for CAPA is having access to an administrative account in AWS to bootstrap the IAM roles required for CAPA to create and manage all the resources required for these clusters.

Creating the Management Cluster

The process for deploying CAPA can be a point of friction, as it requires an existing cluster and manually executing a bunch of commands one after another. If you’re looking to simply get your feet wet and try out the capabilities of CAPA, here at SuperOrbital, we built the capa-bootstrap Terraform configuration, which automates the process of provisioning a single-node management cluster with everything that’s needed to start creating and managing production-ready clusters in AWS! This bootstrapper will create a single EC2 instance, set it up as a cluster, and install all the CAPI+CAPA controllers so that it will serve as our management cluster.

WARNING: For a production-ready CAPA installation, it’s recommended that the management cluster has multiple nodes to ensure CAPA is always available since any downtime means that no cluster or node can be created, modified, or deleted. For the sake of this tutorial, capa-bootstrap installs CAPI and CAPA on a single-node cluster. Additionally, as the management cluster itself is now a critical piece of infrastructure, be sure to back it up using your normal process. The capa-bootstrap tool is meant only for educational purposes, and production usage is highly discouraged!

Clone the capa-bootstrap repository and cd into it:

$ git clone https://github.com/superorbital/capa-bootstrap.git
Cloning into 'capa-bootstrap'...
remote: Enumerating objects: 232, done.
remote: Counting objects: 100% (232/232), done.
remote: Compressing objects: 100% (83/83), done.
remote: Total 232 (delta 126), reused 229 (delta 123), pack-reused 0
Receiving objects: 100% (232/232), 40.25 KiB | 2.87 MiB/s, done.
Resolving deltas: 100% (126/126), done.

$ cd capa-bootstrap/

Set the required variables aws_secret_key and aws_access_key, by creating a .tfvars file (using the provided example file) or by exporting them as environment variables:

export TF_VAR_aws_access_key="<MY ACCESS KEY>"
export TF_VAR_aws_secret_key="<MY SECRET KEY>"

Review and modify any other optional variables, such as the instance type and the Kubernetes version for the management cluster if desired.

Execute terraform init, followed by terraform apply:

$ terraform apply
data.aws_ami.latest_ubuntu: Reading...
data.aws_ami.latest_ubuntu: Read complete after 0s [id=ami-04ab94c703fb30101]

Terraform used the selected providers to generate the following execution plan. Resource actions are indicated with the following symbols:
  + create

Terraform will perform the following actions:

  # aws_instance.capa_server will be created
  + resource "aws_instance" "capa_server" {
      + ami                                  = "ami-0c7217cdde317cfec"
      + arn                                  = (known after apply)
      + associate_public_ip_address          = (known after apply)
      + availability_zone                    = (known after apply)
      + cpu_core_count                       = (known after apply)
      + cpu_threads_per_core                 = (known after apply)
      + disable_api_stop                     = (known after apply)
      + disable_api_termination              = (known after apply)
      + ebs_optimized                        = (known after apply)
      + get_password_data                    = false
      + host_id                              = (known after apply)
      + host_resource_group_arn              = (known after apply)
      + iam_instance_profile                 = (known after apply)
      + id                                   = (known after apply)
      + instance_initiated_shutdown_behavior = (known after apply)
      + instance_lifecycle                   = (known after apply)
      + instance_state                       = (known after apply)
      + instance_type                        = "m5a.large"
      + ipv6_address_count                   = (known after apply)
      + ipv6_addresses                       = (known after apply)
      + key_name                             = (known after apply)
      + monitoring                           = (known after apply)
      + outpost_arn                          = (known after apply)
      + password_data                        = (known after apply)
      + placement_group                      = (known after apply)
      + placement_partition_number           = (known after apply)
      + primary_network_interface_id         = (known after apply)
      + private_dns                          = (known after apply)
      + private_ip                           = (known after apply)
      + public_dns                           = (known after apply)
      + public_ip                            = (known after apply)
      + secondary_private_ips                = (known after apply)
      + security_groups                      = (known after apply)
      + source_dest_check                    = true
      + spot_instance_request_id             = (known after apply)
      + subnet_id                            = (known after apply)
      + tags                                 = {
          + "Name"  = "superorbital-quickstart-capa-server"
          + "Owner" = "capa-bootstrap"
        }
      + tags_all                             = {
          + "Name"  = "superorbital-quickstart-capa-server"
          + "Owner" = "capa-bootstrap"
        }
      + tenancy                              = (known after apply)
      + user_data                            = (known after apply)
      + user_data_base64                     = (known after apply)
      + user_data_replace_on_change          = false
      + vpc_security_group_ids               = (known after apply)

...<SNIPPED>...

Review the plan and accept!

Plan: 10 to add, 0 to change, 0 to destroy.

Changes to Outputs:
  + capa_node_ip = (known after apply)

Do you want to perform these actions?
  Terraform will perform the actions described above.
  Only 'yes' will be accepted to approve.

  Enter a value: yes

After a successful terraform apply, you should have a cluster running CAPA!

tls_private_key.global_key: Creating...
tls_private_key.global_key: Creation complete after 0s [id=235a6d5c450ee1dd91714d4b4f68bd0639e5c59f]
local_file.ssh_public_key_openssh: Creating...
local_sensitive_file.ssh_private_key_pem: Creating...
local_file.ssh_public_key_openssh: Creation complete after 0s [id=2c4b12d7db822a33a5f52c8485a1390976543a89]
local_sensitive_file.ssh_private_key_pem: Creation complete after 0s [id=34e9aab08bdc9ff2becd847de034699154104190]
aws_key_pair.capa_bootstrap_key_pair: Creating...
aws_security_group.capa_bootstrap_sg_allowall: Creating...
aws_key_pair.capa_bootstrap_key_pair: Creation complete after 0s [id=superorbital-quickstart-capa-bootstrap-20240126213038814400000001]
aws_security_group.capa_bootstrap_sg_allowall: Creation complete after 3s [id=sg-0bac29de0faa35c77]
aws_instance.capa_server: Creating...
aws_instance.capa_server: Still creating... [10s elapsed]
aws_instance.capa_server: Still creating... [20s elapsed]
aws_instance.capa_server (remote-exec): Waiting for cloud-init to complete...
aws_instance.capa_server: Still creating... [30s elapsed]
aws_instance.capa_server (remote-exec): Completed cloud-init!
aws_instance.capa_server: Creation complete after 35s [id=i-0e2961b43b81829d1]
module.capa.ssh_resource.install_k3s: Creating...
module.capa.ssh_resource.install_k3s: Still creating... [10s elapsed]
module.capa.ssh_resource.install_k3s: Creation complete after 11s [id=474554630196844793]
module.capa.ssh_resource.retrieve_config: Creating...
module.capa.ssh_resource.install_capa: Creating...
module.capa.ssh_resource.retrieve_config: Creation complete after 1s [id=6493639830213559672]
module.capa.local_file.kube_config_server_yaml: Creating...
module.capa.local_file.kube_config_server_yaml: Creation complete after 0s [id=d57ac94d6db2c9bd6d6f7bcd08e5024e4f79c833]
module.capa.ssh_resource.install_capa: Still creating... [10s elapsed]
module.capa.ssh_resource.install_capa: Still creating... [20s elapsed]
module.capa.ssh_resource.install_capa: Still creating... [30s elapsed]
module.capa.ssh_resource.install_capa: Still creating... [40s elapsed]
module.capa.ssh_resource.install_capa: Still creating... [50s elapsed]
module.capa.ssh_resource.install_capa: Creation complete after 52s [id=2560992061155103173]

Apply complete! Resources: 10 added, 0 changed, 0 destroyed.

Outputs:

capa_node_ip = "3.81.60.22"

Now that the management cluster is created, we can check if the Pods are running with kubectl get pods. The kubeconfig for the management cluster will be placed in your current directory by Terraform.

$ kubectl --kubeconfig capa-management.kubeconfig get pods -A

NAMESPACE                           NAME                                                             READY   STATUS      RESTARTS   AGE
cert-manager                        cert-manager-cainjector-c778d44d8-nl57j                          1/1     Running     0          20m
cert-manager                        cert-manager-7d75f47cc5-zjptm                                    1/1     Running     0          20m
cert-manager                        cert-manager-webhook-55d76f97bb-cwg6n                            1/1     Running     0          20m
kube-system                         coredns-6799fbcd5-xrk5c                                          1/1     Running     0          20m
kube-system                         local-path-provisioner-84db5d44d9-qnmpg                          1/1     Running     0          20m
kube-system                         helm-install-traefik-crd-zrvch                                   0/1     Completed   0          20m
kube-system                         metrics-server-67c658944b-wblth                                  1/1     Running     0          20m
kube-system                         svclb-traefik-27bfa6a0-gnmlx                                     2/2     Running     0          20m
kube-system                         helm-install-traefik-4bdww                                       0/1     Completed   1          20m
kube-system                         traefik-f4564c4f4-6g6q5                                          1/1     Running     0          20m
capi-system                         capi-controller-manager-855f9f859-w8c4r                          1/1     Running     0          20m
capi-kubeadm-bootstrap-system       capi-kubeadm-bootstrap-controller-manager-75b968db86-cgvcv       1/1     Running     0          20m
capi-kubeadm-control-plane-system   capi-kubeadm-control-plane-controller-manager-75758b5479-2f4zg   1/1     Running     0          20m
capa-system                         capa-controller-manager-6b48bcb87c-cb9wt                         1/1     Running     0          20m

One last thing to note is that if you wish to poke into the cluster directly, you can SSH into the EC2 instance where the management cluster is running using the public/private SSH key pair created in the same directory.

$ ls | grep id_rsa
id_rsa
id_rsa.pub

$ ssh -i id_rsa ubuntu@<NODE IP ADDRESS>

Creating a Managed Cluster

With CAPA up and running, there are two ways of creating managed clusters: using the clusterctl command to generate and apply manifests directly to the cluster or by creating the objects in the cluster with a dedicated Kubernetes client. The latter method is better suited for a future post where we talk about adding APIs on top of CAPI, so today, we’ll just focus on applying manifests.

The capa-bootstrap directory already has two example manifests for creating a simple cluster on AWS using EC2 instances with a CAPI-managed control plane and a simple EKS cluster with an AWS-managed control plane. Let’s take a look at the YAML defined for the CAPI-managed control plane cluster:

# Namespace (1)
apiVersion: v1
kind: Namespace
metadata:
  name: aws-cluster-1
---
# Cluster definition (2)
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata:
  name: aws-cluster-1
  namespace: aws-cluster-1
spec:
  clusterNetwork:
    pods:
      cidrBlocks:
      - 192.168.0.0/16 # (5)
  controlPlaneRef:
    apiVersion: controlplane.cluster.x-k8s.io/v1beta1
    kind: KubeadmControlPlane
    name: aws-cluster-1-control-plane
  infrastructureRef:
    apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
    kind: AWSCluster
    name: aws-cluster-1
---
# AWSCluster definition (2)
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AWSCluster
metadata:
  name: aws-cluster-1
  namespace: aws-cluster-1
spec:
  region: us-east-1
  sshKeyName: default
---
# KubeadmControlPlane definition (3)
apiVersion: controlplane.cluster.x-k8s.io/v1beta1
kind: KubeadmControlPlane
metadata:
  name: aws-cluster-1-control-plane
  namespace: aws-cluster-1
spec:
  machineTemplate:
    infrastructureRef:
      apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
      kind: AWSMachineTemplate
      name: aws-cluster-1-control-plane
  replicas: 3 # (4)
  version: v1.28.3 # (6)

In here, we’re defining a namespace (1), a cluster object that is configured as an “AWSCluster”-type cluster (2), and the configuration of our desired control plane for this cluster (3). We have complete control over the amount of control plane replicas (4), the pod CIDR (5), and the kubernetes version (6) that will be deployed for this control plane.

# AWSMachineTemplate (control plane)
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AWSMachineTemplate
metadata:
  name: aws-cluster-1-control-plane
  namespace: aws-cluster-1
spec:
  template:
    spec:
      iamInstanceProfile: control-plane.cluster-api-provider-aws.sigs.k8s.io
      instanceType: t3.medium # (9)
      sshKeyName: default
---
# MachineDeployment
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata:
  name: aws-cluster-1-md-0
  namespace: aws-cluster-1
spec:
  clusterName: aws-cluster-1
  replicas: 3 # (7)
  selector:
    matchLabels: null
  template:
    spec:
      bootstrap:
        configRef:
          apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
          kind: KubeadmConfigTemplate
          name: aws-cluster-1-md-0
      clusterName: aws-cluster-1
      infrastructureRef:
        apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
        kind: AWSMachineTemplate
        name: aws-cluster-1-md-0
      version: v1.28.3 # (10)
---
# AWSMachineTemplate (worker nodes)
apiVersion: infrastructure.cluster.x-k8s.io/v1beta1
kind: AWSMachineTemplate
metadata:
  name: aws-cluster-1-md-0
  namespace: aws-cluster-1
spec:
  template:
    spec:
      iamInstanceProfile: nodes.cluster-api-provider-aws.sigs.k8s.io
      instanceType: t3.medium # (9)
      sshKeyName: default
---
# KubeadmConfigTemplate (8)
apiVersion: bootstrap.cluster.x-k8s.io/v1beta1
kind: KubeadmConfigTemplate
metadata:
  name: aws-cluster-1-md-0
  namespace: aws-cluster-1
spec:
  template:
    spec:
      joinConfiguration:
        nodeRegistration:
          kubeletExtraArgs:
            cloud-provider: aws
          name: '{{ ds.meta_data.local_hostname }}'

We have a similar amount of configurability for the nodes in the cluster themselves, including the amount of replicas on your MachineDeployments (7), how kubelet is bootstrapped (8) and the instance type used for each node (9). All of these manifests can be modified to suit your needs for your managed cluster – maybe you’d like your cluster to use a previous version of Kubernetes than the one used on the manifest (10), or maybe a single MachineDeployment is not enough for your use case since you want to mix ARM and x86 workloads in a single cluster.

Creating these clusters is as simple as applying these manifests, which creates all the objects the controller needs to create the cloud resources. Doing so for the EKS cluster shows the following output:

$ kubectl --kubeconfig capa-management.kubeconfig apply -f examples/eks-cluster-1.yaml

cluster.cluster.x-k8s.io/eks-cluster-1 created
awsmanagedcluster.infrastructure.cluster.x-k8s.io/eks-cluster-1 created
awsmanagedcontrolplane.controlplane.cluster.x-k8s.io/eks-cluster-1-control-plane created
machinedeployment.cluster.x-k8s.io/eks-cluster-1-md-0 created
awsmachinetemplate.infrastructure.cluster.x-k8s.io/eks-cluster-1-md-0 created
eksconfigtemplate.bootstrap.cluster.x-k8s.io/eks-cluster-1-md-0 created

The cluster is now being created, so sit back and relax while CAPI and CAPA do the hard work! For details on the progress of the cluster creation, one can check the events on the cluster object itself or check the status of the various components (control plane, machines, and cluster) in the logs of the controllers of the management cluster.

$ kubectl --kubeconfig capa-management.kubeconfig describe cluster -n eks-cluster-1 eks-cluster-1

...
Status:
  Conditions:
    Last Transition Time: 2023-09-21T19:10:55Z
    Message:              4 of 10 completed
    Reason:               RouteTableReconciliationFailed
    Severity:             Warning
    Status:               False
    Type:                 Ready
    Last Transition Time: 2023-09-21T19:07:35Z
    Message:              Waitng for control plane provider to indicate the control plane has been initialized
    Reason:               WaitingForControlPlaneProviderInitialized
    Severity:             Info
    Status:               False
    Type:                 ControlPlaneInitialized
    Last Transition Time: 2023-09-21T19:10:55Z
    Message:              4 of 10 completed
    Reason:               RouteTableReconciliationFailed
    Severity:             Warning
    Status:               False
    Type:                 ControlPlaneReady
    Last Transition Time: 2023-09-21T19:07:35Z
    Status:               True
    Type:                 InfrastructureReady
  Infrastructure Ready:   true
  Observed Generation:    1
  Phase:                  Provisioning
Events:
  Type    Reason               Age                    From                Message
  ----    ------               ----                   ----                -------
  Normal  Provisioning         8m37s (x2 over 8m37s)  cluster-controller  Cluster eks-cluster-1 is Provisioning
  Normal  InfrastructureReady  8m37s                  cluster-controller  Cluster eks-cluster-1 InfrastructureReady is now true

Once the cluster is created, congratulations! We can see the status of our created clusters by the kubectl get clusters command on our management cluster:

$ kubectl get clusters -A

NAMESPACE         NAME            PHASE         AGE    VERSION
development-aws   aws-cluster-1   Provisioned   105m   
production-eks    bravo           Provisioned   104m
development-eks   eks-cluster-1   Provisioned   13m

The last thing to note is that each cluster type (AWS EC2 or EKS) has a different kind of Kubeconfig file available. EKS-flavored clusters contain two different types of kubeconfig files, one for users and another one for CAPI’s sole administrative use:

$ kubectl get secrets -n development-eks | grep kubeconfig

eks-cluster-1-user-kubeconfig   cluster.x-k8s.io.secret   1   9m32s
eks-cluster-1-kubeconfig        cluster.x-k8s.io.secret   1   9m32s

For EKS, only the user kubeconfig should be used to access the managed cluster:

$ kubectl get secrets -n development-eks eks-cluster-1-user-kubeconfig -o jsonpath='{.data.value}' | base64 -d > eks-cluster-1.kubeconfig

$ kubectl --kubeconfig eks-cluster-1.kubeconfig get nodes

NAME                           STATUS   ROLES    AGE   VERSION
ip-10-0-127-104.ec2.internal   Ready    <none>   3m    v1.22.17-eks-0a21954

$ kubectl --kubeconfig eks-cluster-1.kubeconfig get pods -A

NAMESPACE     NAME                      READY   STATUS    RESTARTS   AGE
kube-system   aws-node-68bnp            1/1     Running   0          3m16s
kube-system   coredns-7f5998f4c-gl24x   1/1     Running   0          12m
kube-system   coredns-7f5998f4c-jqjjp   1/1     Running   0          12m
kube-system   kube-proxy-4fxtj          1/1     Running   0          3m16s

The AWS EC2 clusters only have a single admin kubeconfig that’s maintained by CAPI, so care must be taken when using it, as it is equivalent to having root access on the managed cluster.

Extra Credit: Bootstrap + Pivot

OK, this is pretty advanced, and fairly mind-bendy. If the idea of using Terraform to manage the cluster bothers you when you’re trying to go all in on CAPI, you can have CAPI manage itself! First, you’d use CAPI to create a single Kubernetes cluster (see all the previous steps up until now). Then you can promote this new cluster to be your management cluster by pivoting the installation to it from the original single-node cluster. You can now safely destroy the cluster provisioned by Terraform.

While CAPI and CAPA can now manage their own management cluster like the other managed clusters, this removes Terraform’s role in the cluster lifecycle. Deleting these management resources entirely can become an increased challenge.

For more information on how to do this and how it works, see the documentation in the CAPI book.

What’s next?

We’ve seen how CAPI and CAPA are capable of managing our infrastructure, but what about the workloads in these clusters? For our next post, we’ll be going over a few options that we have available to easily deploy and update our workloads in the managed clusters.

Subscribe (yes, we still ❤️ RSS) or join our mailing list below to stay updated!