CNI plugin migration for a running Kubernetes cluster, especially in production environment, is never a trivial task. One simple strategy is Blue/Green deployment, which provisions a new cluster with target CNI and moving workloads from current cluster to the new one. This approach ensures the least downtime(still need extra setup, e.g., loadbalancer, to ensure availability) and ignores discrepancy between CNIs. However, what if in-place migration is necessary? This is not common but challenging senario, and mostly needed when operating a cluster with stateful or unmanaged workloads. For instance, a platform provides computing resource by Kubernetes but operators have no control to services deployed on it. If operators want to migrate CNI without letting users aware/involve, in-place migration will still be required.
In-place can be categorized into 2 different types, replacement and live-migration. Replacement means removing previous CNI and install a new one, which means network connection disruption between removal/installation and unpredictable results; by contrary, live-migration keeps network connectivity during migration to make lease impacts to workloads. Most modern CNI plugins are without or with limited support of live-migration, e.g., OVN-Kubernetes, Kube-Router and Flannel. Though some CNIs support live-migration, they work under specific conditions, e.g., Calico supports migrating from Flannel and Canal but not from other CNI plugins. The lack of support of live-migration comes from the design of CNI plugin. The CNI specification defines how CNI plugins chosen and operations should be supported by container runtime, but doesn’t provide interfaces to support connectivity for Pods running on different CNI plugins. Some users might consider to use Multus. This CNI plugin can run as an extra layer between different CNIs, which makes supporting live-migration become possible. However, in my experience, mixing network interfaces of difference CNIs and related IP table rules don’t always work correctly (depends on target CNI plugins), which makes Multus not a reliable solution.
Currently, Cilium looks the only one officially supports live-migration from (possibly) any CNI plugins. It supports migration mode which allows Pods/Services running under different CNI can still communicate with each other.

Under the workflow, users can migrate one node a time to Cilium and won’t lose connectivity with workloads on it until the whole cluster migrated. Unlike Multus, no additional network interfaces will be created. It means Pods don’t need to handle network interface selection by themselves, which avoid possible additional route table/IP table/DNS setup in them. More details can be found in this article.
My Scenario: Leaving Calico Link to heading
My use case is migrating a cluster with Calico (precisely, Flannel + Canal) to other CNI plugins. The reason was Calico’s binary didn’t meet some security compliance, so I needed to find other solutions satisfy that and support live-migration.
Setup Link to heading
In my tests, I used:
- Kind with Canal + Flannel to simulate source cluster
- Goldpinger to monitor connectivity between Pods.
Calico -> Multus -> Any Other CNIs Link to heading
This was the first solution I tested, which I didn’t successfully make Calico work with any other CNIs via Multus successfully. For instance, Flannel, loopback or macvlan. I also tested other combinations without Calico like Flannel + Flannel and macvlan + macvlan, but still unable to make Pods communicate with each other via extra IP from secondary CNI. Under the situation, it could cause more problems for DNS resolution, e.g., connection between Pods to Services via DNS. There could be some mis-configuration, but I decided no to spend more time and effort on it as the result could be hacky and fragile.
Calico -> Flannel Link to heading
This the simplest and workable solution. All I needed to do was removing Canal containers from all Canal Pods, which made Flannel as the only containers. However, as Fannel doesn’t support Network Policies and advanced observability, I still moved forward to other CNIs.
Calico -> Cilium Link to heading
To migrate from Calico to Cilium, I simply followed the official document and things worked like a charm. No network connectivity affected during migration, and I didn’t need to worry about multiple network interface and DNS resolution issues. The difficult part was Network Policies. Originally, clusters I worked on used Calico GlobalNetworkPolicy. Thought Cilium supports CiliumClusterwideNetworkPolicy, as lack of documentation, it took me some time to trace source code about supported selectors and syntax rewrite them. AI tools helped, but as it seemed not too much developers did that, non-existed syntax and selectors were included in results which requires manual fixes. Also, few things need be aware of between CiliumClusterwideNetworkPolicy and GlobalNetworkPolicy:
- Cilium doesn’t support logged drop. To monitor dropped traffic, Cilium relies on Hubble to capture them
- It still necessary to write validations to ensure traffics are controlled expected. For instance, adding script to create Pods with expected selectors to ensure packets will be dropped/sent as expect in test pipelines/e2e tests.
- By default, node selectors (
fromNodeandtoNode) in CiliumClusterwideNetworkPolicy are not enabled. Users need to enable it via setting theenable-node-selector-labelsincilium-configConfigMap ornodeSelectorLabelsin Helm config totrue. Besides, in addition to restart theciliumDeployment, users also need to restart thecilium-operatorDaemonSet to make this change take effect. Otherwise, they’ll keep seeing CiliumClusterwideNetworkPolicies are marked as invalid inkubectl get ciliumclusterwidenetworkpolicyandfromNode and toNode selectors not enabledrelated errors fromkubectl describe ciliumclusterwidenetworkpolicy. The cause is CiliumClusterwideNetworkPolicy is the CRD managed bycilium-operator.
Further Readings Link to heading
If you’re interested in related topics: