From Helm Operator to Go Controller

Published on November 06, 2024

A recent client engagement asked us to upgrade an operator using helm-operator with a Go controller in the kubebuilder framework in order to enable more sophisticated use cases than can be provided by simple helm charts and templating. The operator enables application authors to self-manage deploying a webapp and integrates with other operators to enable ingress, secrets management, and autoscaling.

What is Helm Operator?

helm-operator is a project from Operator SDK that simplifies operator development by putting most of the business logic of producing downstream resources into the templating of an embedded helm chart. It’s convenient for very simple operators that only need to apply resources in response to the upstream resource. The upstream resource is made available as a chart value so that the CRD values can affect the output.

Sounds Great! What’s the Problem?

Well, this simplicity comes at the cost of not allowing anything more sophisticated. For example, it could not react to the status of the downstream resources. And it would be unable to reach “Level 5” of the operator capability levels defined by Operator SDK.

The webapp operator used the hybrid helm operator approach, which runs the helm reconciler within a controller-runtime controller manager. With a hybrid operator, the helm reconciler can also be configured with a translator to transform the resource into a more suitable form, or to apply complex transformations that would be difficult to express in helm templates. The hybrid approach is intended for either mixing with controller-runtime reconcilers, or as a transition path to replacement of the helm-operator.

This post is about how we achieved that replacement, and some of the challenges along the way.

The Game Plan

Since helm-operator was already being invoked as a controller added to a controller-runtime controller Manager, it made sense to mount a second controller for the same resource, and gradually move resources from being managed by helm to being managed by the Go controller. Transitioning one downstream resource at a time would allow us to deploy to development and non-production clusters first to shake out issues with the process and reveal unanticipated problems with the migration. We carefully planned the order of resources to minimize risk and impact to running services, starting with ConfigMaps, and ending with the Deployment.

Challenges

The simplest part of making the transition was translating each templated resource in the chart into filling out the corresponding Go struct to submit to the api-server via the controller-runtime client. Beyond that mostly mechanical transformation, there were a few other details to work out.

Migrating object ownership

The helm reconciler uses helm as a library, and as such behaves very similarly to invoking helm from the command line. When updating a helm release, if the new version of the chart removes resources compared to the installed manifest, then helm will delete those resources. This creates a challenge for migrating the resources from helm over to the new controller. While some resources may be more-or-less harmless to delete and recreate, others like Deployments would cause workloads to restart across the whole cluster. Recreating an Ingress would disrupt service as the cloud load balancer is recreated, which takes minutes.

Helm logo

Controlling helm: Fortunately, helm honors an annotation, helm.sh/resource-policy: keep , which when applied to a chart resource, prevents helm from deleting that resource if it is removed from the release, or if the release is deleted. Helm-operator also normally applies ownerReferences to chart resources to control garbage collection–when the upstream resource is deleted, the owner references on the downstream resources would ensure they are cleaned up. Applying the annotation also disables adding the owner references.

This requires two phases to roll out. An initial release must update the helm chart to apply the annotation. Then a second rollout will migrate the resource by removing it from the chart, and allowing the Go controller to adopt the resource. The new Go controller will also apply owner references using controllerutil.SetControllerReference(). But note that between the rollouts, the downstream resource may not have any ownerReferences applied. In practice, we found that when resource-policy: keep is added, helm-operator does not remove the owner references that it added when the annotation was not present. However, any new webapps created between the rollout phases would not have owner references.

Finalizer for deletion: resource-policy: keep stopping owner references gives rise to another problem: After the first phase of roll out (where the annotation is applied, but helm is still managing resources), the downstream resources will not be deleted when the webapp is deleted. To address this, we had the new Go controller apply a finalizer to the webapp resource. When a webapp is deleted, then the controller code for the finalizer will delete the downstream resources in place of garbage collection.

Contingencies for rollback

Any large migration like this comes with risk of unforeseen problems and mistakes, so rollbacks must be considered within the plan.

Feature flags: We decided to add a feature flag for each downstream resource that directs it to be managed by the Go controller. When off, helm-operator continues to include the resource in the chart, and the Go controller will not manage creating or updating the resource. However, because we have prevented helm from deleting anything, the Go controller always deletes resources that are no longer needed due to a configuration change in the webapp, or—via the finalizer—due to deleting the webapp. That is, the create-and-update code is behind the feature flags, but the delete and finalizer code is not.

Equivalent controllers: The Go controller should, initially, precisely match the output of the helm controller. New behaviors can come in subsequent releases. Ensuring the controllers have equivalent output was taken on by the unit tests. Existing unit tests in the project, using EnvTest, had pretty good coverage of the expected output for a given input manifest. I adapted the tests to be run twice, with each controller enabled.

Testing roll-forward and -back: Because accidentally deleting resources would cause a major disruption, I added tests of roll-forward and roll-back of each feature flag that ensures each object is not deleted by mistake in the process. A couple of techniques facilitate these tests:

Recording and checking the UID of an object ensures that it is indeed the same object, and not an object that has been deleted and recreated with the same object key (name, namespace, apiVersion, and kind).
Adding a label, “reconciled-by”, to the downstream resources that is set uniquely by the two controllers allows detecting when each controller has reconciled after changing feature flags and restarting the controller-manager.

Cuckoo Nest
Much like a cuckoo, we tricked helm into adopting our children.
Image source

Tricking Helm: After a rollback, helm needs to resume control of the downstream resources. Helm normally does not want to trample resources that it did not create. We can trick Helm into adopting resources created by the Go controller by applying the labels and annotations that it adds to chart resources to mark them as helm managed. Adoption requires a label ("app.kubernetes.io/managed-by": "Helm") and two annotations ("meta.helm.sh/release-name" and "meta.helm.sh/release-namespace"). Helm-operator names the helm release after the upstream resource’s metadata.name, so the controller assigns these annotations with the webapp’s name and namespace.

Argo sync-policy

In the customer’s environment, every WebApp resource is deployed by an Argo Application. Argo is configured to automatically prune resources that are no longer part of the Application manifest. Argo labels the resources it creates with argocd.argoproj.io/instance. If a resource has this label, but is not part of the current manifest, then argo will prune the resource. Pruned resources are by default deleted with foreground propagation, so downstream resources should be deleted as well. Eg. if a Deployment is pruned, then its ReplicaSets and Pods will be cleaned up by garbage collection. Additionally, if a resource is owned (has an ownerReferences entry), then it will not be pruned.

The webapp operator copies labels from the webapp resource to most downstream resources. Since the webapp is labeled with argocd.argoproj.io/instance, but its downstream resources are not part of the manifest, those downstream resources are subject to pruning, unless they have ownerReferences.

Normally, a controller should own its downstream resources. But when we added helm.sh/resource-policy: keep to resources, that also caused helm-operator to stop adding ownerReferences to those resources¹. In combination with the propagated instance labels, that makes them vulnerable to pruning.

During the window between rolling out a release where the helm chart is changed to add resource-policy: keep, but before enabling the new Go controller (which would re-add ownerReferences), there’s a risk that Argo may come along and prune the resources we are trying so hard to not allow to be deleted.

Fortunately, Argo has an annotation that can be applied to resources that disables pruning of that resource: "argocd.argoproj.io/sync-options": "Prune=false". So we added that to the list of annotations added to downstream resources by both the helm chart and the Go controller.

In a future update to the webapp operator, we can avoid this problem more simply by being more selective about what labels are copied from parent resources to child resources. Lesson learned: It’s not a good practice to blindly copy all labels to child resources.

Cleaning up

After successfully rolling out the feature flags to the various non-prod and production clusters, it was time to remove the now-redundant helm chart and helm-operator from the operator codebase. Those roll-forward-and-back tests had served their purpose and were retired. All those labels and annotations? The operator can stop adding those too. And the duty of the finalizer code to delete the downstream resources can be re-assumed by the kubernetes garbage collector now that ownerReferences are back in place.

Looking at the clusters, though, there is now an (empty) helm release for every webapp. The last step was to write a script to helm uninstall those charts after verifying it is indeed empty with helm get manifest. While we’re at it, we don’t need those labels and annotations anymore. As a one-time change, it’s simpler to remove them, and the metadata.finalizers entry, in the cleanup script than to modify the controller to remove them.

Conclusion and Summary

In total, the transition from helm-operator to an equivalent Go controller was aided by 6 labels and annotations on downstream resources, and a finalizer on the upstream webapp resource, summarized in the listing below. Most of those can subsequently be removed along with the feature flags and helm charts after the transition is complete and stable. The rollback plan unfortunately needed to be enacted, but fortunately it was successful in mitigating the impact of bugs in the new Go controller.

---
apiVersion: v1
kind: ConfigMap
metadata:
  namespace: default
  name: my-webapp-config
  labels:
    app.kubernetes.io/managed-by: Helm   # Trick helm into adopting this in case of roll-back
    reconciled-by: webapp-go-controller  # Keep track of which controller last reconciled
  annotations:
    helm.sh/resource-policy: keep        # Tell helm not to delete this when removed from the chart.
    meta.helm.sh/release-name: my-webapp # Inform helm which release this should be adopted into.
    meta.helm.sh/release-namespace: default
    argocd.argoproj.io/sync-options: Prune=false  # Don't let Argo prune this!
  ownerReferences:
  - apiVersion: example.com/v1alpha1     # Retain ownerReferences
    kind: WebApp
    blockOwnerDeletion: true
    controller: true
    name: my-webapp
...

While helm-operator won’t add the owner reference anymore, neither will it delete the existing entries. In practice, this means this problem only affected webapp resources that were newly created (or recreated) during the rollout window. ↩