Kubernetes CRD: the versioning joy

#kubernetes #devops #softwareengineering #theycoded

(The tribulations of a Kubernetes operator developer)

I am a developer of the Network Observability operator, for Kubernetes / OpenShift.

A few days ago, we released our 1.6 version -- which I hope you will try and appreciate, but this isn't the point here. I want to talk about an issue that was reported to us soon after the release.

The error says: risk of data loss updating "flowcollectors.flows.netobserv.io": new CRD removes version v1alpha1 that is listed as a stored version on the existing CRD

What's that? It was a first for the team. This is an error reported by OLM.

Investigating

Indeed, we used to serve a v1alpha1 version of our CRD. And indeed, we are now removing it. But we didn't do it abruptly. We thought we followed all the guidelines of an API versioning lifecycle. I think we did, except for one detail.

Let's rewind and recap the timeline:

v1alpha1 was the first version, introduced in our operator 1.0
in 1.2, we introduced a new v1beta1. It was the new preferred version, but the storage version was still v1alpha1. Both versions were still served, and a conversion webhook allowed to convert from one to another.
in 1.3, v1beta1 became the stored version. At this point, after an upgrade, every instance of our resource in etcd are in version v1beta1, right? (spoiler: it's more complicated).
in 1.5 we introduced a v1beta2, and we flagged v1alpha1 as deprecated.
in 1.6, we make v1beta2 the storage version, and removed v1alpha1.

And BOOM!

A few users complained about the error message mentioned above:

risk of data loss updating "flowcollectors.flows.netobserv.io": new CRD removes version v1alpha1 that is listed as a stored version on the existing CRD

And they are stuck: OLM won't allow them to proceed further. Or they can entirely remove the operator and the CRD, and reinstall.

In fact, this is only some early adopters of NetObserv who have been seeing this. And we didn't see it when testing the upgrade prior to releasing. So what happened? I spent the last couple of days trying to clear out the fog.

When users installed an old version <= 1.2, the CRD keeps track of the storage version in its status:

kubectl get crd flowcollectors.flows.netobserv.io -ojsonpath='{.status.storedVersions}'
["v1alpha1"]

Later on, when users upgrade to 1.3, the new storage version becomes v1beta1. So, this is certainly what now appears in the CRD status. This is certainly what now appears in the CRD status? (Padme style)

kubectl get crd flowcollectors.flows.netobserv.io -ojsonpath='{.status.storedVersions}'
["v1alpha1","v1beta1"]

Why is it keeping v1alpha1? Oh, I know! Upgrading the operator did not necessarily change anything in the custom resources. Only resources that have been changed post-install would have make the apiserver write them to etcd in the new storage version; but different versions may coexist in etcd, hence the status.storedVersions field being an array and not a single string. That makes sense.

Certainly, I can do some dummy edition of my custom resources to make sure they are in the new storage version. The apiserver will replace the old one with a new one, so it will use the updated storage version. Let's do this. Then check again:

kubectl get crd flowcollectors.flows.netobserv.io -ojsonpath='{.status.storedVersions}'
["v1alpha1","v1beta1"]

Hmm...
So, I am now almost sure I don't have any v1alpha1 remaining in my cluster, but the CRD doesn't tell me that. What I learned is that the CRD status is not a source of truth for what's in etcd.

Here's what the doc says:

storedVersions lists all versions of CustomResources that were ever persisted. Tracking these versions allows a migration path for stored versions in etcd. The field is mutable so a migration controller can finish a migration to another version (ensuring no old objects are left in storage), and then remove the rest of the versions from this list. Versions may not be removed from spec.versions while they exist in this list.

But how to ensure no old objects are left in storage? While poking around, I haven't found any simple way to inspect what custom resources are in etcd, and in which version. It seems like no one wants to be responsible for that, in the core kube ecosystem. It is like a black box.

Apiserver? it deals with incoming requests but it doesn't actively keep track / stats of what's in etcd.

There is actually a metric (gauge) showing which objects the apiserver stored. It is called apiserver_storage_objects:

But it tells nothing about the version -- and even if it did, it would probably not be reliable, as it's generated from the requests that the apiserver deals with, it is not keeping an active state of what's in etcd, as far as I understand.

etcd itself? It is a binary store, it knows nothing about the business meaning of what comes in and out.
And not talking about OLM, which is probably even further from knowing that.

If you, reader, can shed some light on how you would do that, ie. how you would ensure that no deprecated version of a custom resource is still lying around somewhere in a cluster, I would love to hear from you, don't hesitate to let me know!

There's the etcdctl tool that allows to interact with etcd, if you know exactly what you're looking for, and how this is stored in etcd, etc. But expecting our users to do this for upgrading? Meh...

Kube Storage Version Migrator

Actually, it turns out the kube community has a go-to option for the whole issue. It's called the Kube Storage Version Migrator (SVM). I guess in some flavours of Kubernetes, it might be enabled by default and triggers for any custom resource. In OpenShift, the trigger for automatic migration is not enabled, so it is up to the operator developers (or the users) to generate the migration requests.

In our case, this is how the migration request looks like:

apiVersion: migration.k8s.io/v1alpha1
kind: StorageVersionMigration
metadata:
  name: migrate-flowcollector-v1alpha1
spec:
  resource:
    group: flows.netobserv.io
    resource: flowcollectors 
    version: v1alpha1

Under the hood, the SVM just rewrites the custom resources without any modification, to make the apiserver trigger a conversion (possibly via your webhooks, if you have some) and make them stored in the new storage version.

To make sure the resources have really been modified, we can check their resourceVersion before and after applying the StorageVersionMigration:

# Before
$ kubectl get flowcollector cluster -ojsonpath='{.metadata.resourceVersion}'
53114

# Apply
$ kubectl apply -f ./migrate-flowcollector-v1alpha1.yaml

# After
$ kubectl get flowcollector cluster -ojsonpath='{.metadata.resourceVersion}'
55111

# Did it succeed?
$ kubectl get storageversionmigration.migration.k8s.io/migrate-flowcollector-v1alpha1 -o yaml
# [...]
  conditions:
  - lastUpdateTime: "2024-07-04T07:53:12Z"
    status: "True"
    type: Succeeded

Then, all you have to do is trust SVM and apiserver to have effectively rewritten all the deprecated versions in their new version.

Unfortunately, we're not entirely done yet.

kubectl get crd flowcollectors.flows.netobserv.io -ojsonpath='{.status.storedVersions}'
["v1alpha1","v1beta1"]

Yes, the CRD status isn't updated. It seems like it's not something SVM would do for us. So OLM will still block the upgrade. We need to manually edit the CRD status, and remove the deprecated version -- now that we're 99.9% sure it's not there (I don't like the other 0.1% much).

Revisited lifecycle

To repeat the versioning timeline, here is what it seems we should have done:

v1alpha1 was the first version, introduced in our operator 1.0
in 1.2, we introduced a new v1beta1. Storage version is still v1alpha1
in 1.3, v1beta1 becomes the stored version.
- ⚠️ The operator should check the CRD status and, if needed, create a StorageVersionMigration, and then update the CRD status to remove the old storage version ⚠️
in 1.5 v1beta2 is introduced, and we flag v1alpha1 as deprecated
in 1.6, v1beta2 is the new storage version, we run again through the StorageVersionMigration steps (so we're safe when v1beta1 will be removed later). We remove v1alpha1
Everything works like a charm, hopefully.

For the anecdote, in our case with NetObserv, all this convoluted scenario is probably just resulting from a false-alarm, the initial OLM error being a false positive: our FlowCollector resource manages workload installation, and have a status that reports the deployments readiness. On upgrade, new images are used, pods are redeployed, so the FlowCollector status changes. So, it had to be rewritten in the new storage version, v1beta1, prior to the removal of the deprecated version. The users who have seen this issue could simply have manually removed the v1alpha1 from the CRD status, and that's it.

While one could argue that OLM is too conservative here, blocking an upgrade that should pass because all the resources in storage must be fine, in its defense, it probably has no simple way to know that. And messing up with resources made inaccessible in etcd is certainly a scenario we really don't want to run into. This is something that operator developers have to deal with.

I hope this article will help prevent future mistakes for others. This error is quite tricky to spot, as it can reveal itself long after the fact.