Published on November 06, 2024
A recent client engagement asked us to upgrade an operator using helm-operator with a Go controller in the kubebuilder framework in order to enable more sophisticated use cases than can be provided by simple helm charts and templating. The operator enables application authors to self-manage deploying a webapp and integrates with other operators to enable ingress, secrets management, and autoscaling.
What is Helm Operator?
helm-operator is a project from Operator SDK that simplifies operator development by putting most of the business logic of producing downstream resources into the templating of an embedded helm chart. It’s convenient for very simple operators that only need to apply resources in response to the upstream resource. The upstream resource is made available as a chart value so that the CRD values can affect the output.
Sounds Great! What’s the Problem?
Well, this simplicity comes at the cost of not allowing anything more sophisticated. For example, it could not react to the status of the downstream resources. And it would be unable to reach “Level 5” of the operator capability levels defined by Operator SDK.
The webapp operator used the hybrid helm operator approach, which runs the helm reconciler within a controller-runtime controller manager. With a hybrid operator, the helm reconciler can also be configured with a translator to transform the resource into a more suitable form, or to apply complex transformations that would be difficult to express in helm templates. The hybrid approach is intended for either mixing with controller-runtime reconcilers, or as a transition path to replacement of the helm-operator.
This post is about how we achieved that replacement, and some of the challenges along the way.
The Game Plan
Since helm-operator was already being invoked as a controller added to a controller-runtime controller Manager, it made sense to mount a second controller for the same resource, and gradually move resources from being managed by helm to being managed by the Go controller. Transitioning one downstream resource at a time would allow us to deploy to development and non-production clusters first to shake out issues with the process and reveal unanticipated problems with the migration. We carefully planned the order of resources to minimize risk and impact to running services, starting with ConfigMaps, and ending with the Deployment.
Challenges
The simplest part of making the transition was translating each templated resource in the chart into filling out the corresponding Go struct to submit to the api-server via the controller-runtime client. Beyond that mostly mechanical transformation, there were a few other details to work out.
Migrating object ownership
The helm reconciler uses helm as a library, and as such behaves very similarly to invoking helm from the command line. When updating a helm release, if the new version of the chart removes resources compared to the installed manifest, then helm will delete those resources. This creates a challenge for migrating the resources from helm over to the new controller. While some resources may be more-or-less harmless to delete and recreate, others like Deployments would cause workloads to restart across the whole cluster. Recreating an Ingress would disrupt service as the cloud load balancer is recreated, which takes minutes.
Controlling helm: Fortunately, helm honors
an annotation,
helm.sh/resource-policy: keep
, which when applied to a chart resource,
prevents helm from deleting that resource if it is removed from the release, or
if the release is deleted. Helm-operator also normally applies ownerReferences
to chart resources to control
garbage collection–when
the upstream resource is deleted, the owner references on the downstream
resources would ensure they are cleaned up. Applying the annotation also
disables adding the owner references.
This requires two phases to roll out. An initial release must update the helm
chart to apply the annotation. Then a second rollout will migrate the resource
by removing it from the chart, and allowing the Go controller to adopt the
resource. The new Go controller will also apply owner references using
controllerutil.SetControllerReference()
. But note that between the rollouts,
the downstream resource may not have any ownerReferences
applied. In practice,
we found that when resource-policy: keep
is added, helm-operator does not
remove the owner references that it added when the annotation was not present.
However, any new webapps created between the rollout phases would not have owner
references.
Finalizer for deletion: resource-policy: keep
stopping owner references
gives rise to another problem: After the first phase of roll out (where the
annotation is applied, but helm is still managing resources), the downstream
resources will not be deleted when the webapp is deleted. To address this, we
had the new Go controller apply a finalizer to the webapp resource. When a
webapp is deleted, then the controller code for the finalizer will delete the
downstream resources in place of garbage collection.
Contingencies for rollback
Any large migration like this comes with risk of unforeseen problems and mistakes, so rollbacks must be considered within the plan.
Feature flags: We decided to add a feature flag for each downstream resource that directs it to be managed by the Go controller. When off, helm-operator continues to include the resource in the chart, and the Go controller will not manage creating or updating the resource. However, because we have prevented helm from deleting anything, the Go controller always deletes resources that are no longer needed due to a configuration change in the webapp, or—via the finalizer—due to deleting the webapp. That is, the create-and-update code is behind the feature flags, but the delete and finalizer code is not.
Equivalent controllers: The Go controller should, initially, precisely match the output of the helm controller. New behaviors can come in subsequent releases. Ensuring the controllers have equivalent output was taken on by the unit tests. Existing unit tests in the project, using EnvTest, had pretty good coverage of the expected output for a given input manifest. I adapted the tests to be run twice, with each controller enabled.
Testing roll-forward and -back: Because accidentally deleting resources would cause a major disruption, I added tests of roll-forward and roll-back of each feature flag that ensures each object is not deleted by mistake in the process. A couple of techniques facilitate these tests:
- Recording and checking the UID of an object ensures that it is indeed the same object, and not an object that has been deleted and recreated with the same object key (name, namespace, apiVersion, and kind).
- Adding a label, “reconciled-by”, to the downstream resources that is set uniquely by the two controllers allows detecting when each controller has reconciled after changing feature flags and restarting the controller-manager.
Much like a cuckoo, we tricked helm into adopting our children.
Image source
Tricking Helm: After a rollback, helm needs to resume control of the
downstream resources. Helm normally does not want to trample resources that it
did not create. We can trick Helm into adopting resources created by the Go
controller by applying the labels and annotations that it adds to chart
resources to mark them as helm managed. Adoption requires a label
("app.kubernetes.io/managed-by": "Helm"
) and two annotations
("meta.helm.sh/release-name"
and "meta.helm.sh/release-namespace"
).
Helm-operator names the helm release after the upstream resource’s
metadata.name
, so the controller assigns these annotations with the webapp’s
name and namespace.
Argo sync-policy
In the customer’s environment, every WebApp resource is deployed by an Argo
Application. Argo is configured to
automatically prune
resources that are no longer part of the Application manifest. Argo labels the
resources it creates with argocd.argoproj.io/instance
. If a resource has this
label, but is not part of the current manifest, then argo will prune the
resource. Pruned resources are by default deleted with foreground propagation,
so downstream resources should be deleted as well. Eg. if a Deployment is
pruned, then its ReplicaSets and Pods will be cleaned up by garbage collection.
Additionally, if a resource is owned (has an ownerReferences
entry), then it
will not be pruned.
The webapp operator copies labels from the webapp resource to most downstream
resources. Since the webapp is labeled with argocd.argoproj.io/instance
, but
its downstream resources are not part of the manifest, those downstream
resources are subject to pruning, unless they have ownerReferences
.
Normally, a controller should own its downstream resources. But when we added
helm.sh/resource-policy: keep
to resources, that also caused helm-operator to
stop adding ownerReferences
to those resources1. In combination with the
propagated instance
labels, that makes them vulnerable to pruning.
During the window between rolling out a release where the helm chart is changed
to add resource-policy: keep
, but before enabling the new Go controller (which
would re-add ownerReferences
), there’s a risk that Argo may come along and
prune the resources we are trying so hard to not allow to be deleted.
Fortunately, Argo has an annotation that can be applied to resources that
disables pruning of that resource:
"argocd.argoproj.io/sync-options": "Prune=false"
. So we added that to the list
of annotations added to downstream resources by both the helm chart and the Go
controller.
In a future update to the webapp operator, we can avoid this problem more simply by being more selective about what labels are copied from parent resources to child resources. Lesson learned: It’s not a good practice to blindly copy all labels to child resources.
Cleaning up
After successfully rolling out the feature flags to the various non-prod and
production clusters, it was time to remove the now-redundant helm chart and
helm-operator from the operator codebase. Those roll-forward-and-back tests had
served their purpose and were retired. All those labels and annotations? The
operator can stop adding those too. And the duty of the finalizer code to delete
the downstream resources can be re-assumed by the kubernetes garbage collector
now that ownerReferences
are back in place.
Looking at the clusters, though, there is now an (empty) helm release for every
webapp. The last step was to write a script to helm uninstall
those charts
after verifying it is indeed empty with helm get manifest
. While we’re at it,
we don’t need those labels and annotations anymore. As a one-time change, it’s
simpler to remove them, and the metadata.finalizers
entry, in the cleanup
script than to modify the controller to remove them.
Conclusion and Summary
In total, the transition from helm-operator to an equivalent Go controller was aided by 6 labels and annotations on downstream resources, and a finalizer on the upstream webapp resource, summarized in the listing below. Most of those can subsequently be removed along with the feature flags and helm charts after the transition is complete and stable. The rollback plan unfortunately needed to be enacted, but fortunately it was successful in mitigating the impact of bugs in the new Go controller.
---
apiVersion: v1
kind: ConfigMap
metadata:
namespace: default
name: my-webapp-config
labels:
app.kubernetes.io/managed-by: Helm # Trick helm into adopting this in case of roll-back
reconciled-by: webapp-go-controller # Keep track of which controller last reconciled
annotations:
helm.sh/resource-policy: keep # Tell helm not to delete this when removed from the chart.
meta.helm.sh/release-name: my-webapp # Inform helm which release this should be adopted into.
meta.helm.sh/release-namespace: default
argocd.argoproj.io/sync-options: Prune=false # Don't let Argo prune this!
ownerReferences:
- apiVersion: example.com/v1alpha1 # Retain ownerReferences
kind: WebApp
blockOwnerDeletion: true
controller: true
name: my-webapp
...
-
While helm-operator won’t add the owner reference anymore, neither will it delete the existing entries. In practice, this means this problem only affected webapp resources that were newly created (or recreated) during the rollout window. ↩