Kubernetes Canary Deployment

Canary deployment tests new versions with a small subset of production traffic before full rollout. Named after the historical practice of using canary birds to detect mine gas, this pattern exposes a limited “canary” instance to production traffic to detect issues before they affect all users.

Operational Pattern

Gradual Exposure - Rather than instantly switching all traffic (blue-green) or incrementally replacing all instances (rolling deployment), canary deployment maintains most traffic on the stable version while routing a small percentage to the new version.

Risk Reduction - By limiting initial exposure, canary deployments contain the blast radius of defects. A critical bug in the new version affects only 5% of requests instead of causing a complete outage or requiring emergency rollback.

Metric-Based Validation - Canary deployments should include automated validation: error rates, latency percentiles, business metrics. If canary metrics deviate significantly from baseline, automatically abort the rollout.

Progressive Rollout - If canary instances perform well, gradually increase traffic percentage: 5% → 25% → 50% → 100%. At each stage, validate metrics before proceeding.

Implementation Approaches

Kubernetes doesn’t provide native canary deployment resources. Several approaches exist, trading implementation complexity against capabilities:

Replica-Based Canary

Use Deployment replica counts to approximate traffic percentages:

# Stable version - 9 replicas
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-stable
spec:
  replicas: 9
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
        version: stable
    spec:
      containers:
      - name: app
        image: myapp:v1
---
# Canary version - 1 replica
apiVersion: apps/v1
kind: Deployment
metadata:
  name: myapp-canary
spec:
  replicas: 1
  selector:
    matchLabels:
      app: myapp
  template:
    metadata:
      labels:
        app: myapp
        version: canary
    spec:
      containers:
      - name: app
        image: myapp:v2

The Service selects both using app: myapp label, distributing ~10% of traffic to canary.

Limitations: Coarse-grained (10% is minimum with 10 total replicas), doesn’t account for per-Pod load differences, can’t do 1% or 99% easily.

Ingress-Based Canary

Some ingress controllers support traffic splitting:

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: myapp
  annotations:
    nginx.ingress.kubernetes.io/canary: "true"
    nginx.ingress.kubernetes.io/canary-weight: "10"
spec:
  rules:
  - host: myapp.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service:
            name: myapp-canary
            port: 80

This requires supporting ingress controllers (NGINX, Traefik, etc.) but enables precise percentage control.

Service Mesh Canary

Service meshes (Istio, Linkerd) provide sophisticated traffic management:

apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
  name: myapp
spec:
  hosts:
  - myapp
  http:
  - match:
    - headers:
        user-type:
          exact: internal
    route:
    - destination:
        host: myapp
        subset: canary
  - route:
    - destination:
        host: myapp
        subset: stable
      weight: 90
    - destination:
        host: myapp
        subset: canary
      weight: 10

Enables header-based routing (internal users get canary), percentage splits, and metric-based routing decisions.

Validation and Metrics

Effective canary deployment requires automated validation:

Error Rates - Compare HTTP 5xx rates between canary and stable. Significant increase in canary errors should abort rollout.

Latency Percentiles - P50, P95, P99 latency. Canary shouldn’t show degraded latency profiles.

Business Metrics - Conversion rates, successful transactions, user engagement. Technical metrics may pass while business impact is negative.

Log Analysis - Automated scanning for error patterns, exceptions, or unexpected log volumes.

Integration Points - Monitor external service calls, database query performance, cache hit rates. Issues often manifest in integration layers.

Progressive Rollout Workflow

A typical canary rollout proceeds through stages:

Stage 1: Initial Canary (5%) - Deploy canary instances, route minimal traffic. Monitor for 10-30 minutes. Any issues trigger immediate abort.

Stage 2: Expanded Canary (25%) - If metrics are healthy, increase traffic. Continue monitoring. This exposes more edge cases while limiting risk.

Stage 3: Balanced Canary (50%) - Equal traffic split. Validates canary handles full production load patterns.

Stage 4: Dominant Canary (75%) - Canary is now the primary version. Validates no issues with sustained majority traffic.

Stage 5: Complete Rollout (100%) - Remove stable version, canary becomes the new stable baseline.

Each stage includes soak time for metrics collection and validation gates for automated progression or rollback.

Automated Rollout Tools

Manual canary management is tedious and error-prone. Production implementations typically use automation:

Flagger - Kubernetes operator automating progressive delivery. Integrates with service meshes, ingress controllers, and metric providers (Prometheus, Datadog). Automatically progresses through canary stages or rolls back based on metrics.

Argo Rollouts - Extends Deployments with a Rollout CRD supporting canary strategies. Integrates with metric providers for automated analysis and decision-making.

Spinnaker - Deployment orchestration platform with canary support across multiple cloud providers and Kubernetes.

Knative - Serverless platform for Kubernetes with built-in traffic splitting and progressive rollout capabilities.

These tools handle the complexity of traffic shifting, metric collection, analysis, and automated rollback, making canary deployment practical for production use.

Trade-offs and Considerations

Implementation Complexity - Canary deployment is significantly more complex than rolling deployment. Requires traffic management infrastructure (ingress controller or service mesh) and metric collection/analysis pipelines.

Observability Requirements - Effective canary requires comprehensive monitoring. Must distinguish canary vs. stable traffic in metrics, track version-specific error rates, and make automated decisions based on this data.

Stateful Applications - Canary works best with stateless applications. Stateful apps require careful handling of data written by canary instances - what happens if you roll back after canary has written to the database?

Cost Overhead - Running parallel versions consumes additional resources, though less than blue-green (which maintains 100% redundancy). Canary typically adds 5-25% overhead during rollout.

User Experience Inconsistency - During canary deployment, different users experience different versions. This can cause confusion (“It works for me but not my colleague”). Session stickiness can help but complicates traffic management.

Relationship to Primitives

Canary deployment builds on several Kubernetes concepts:

Deployments - Multiple Deployments manage stable and canary versions. Each can use rolling updates internally.

Services - Services provide the load balancing foundation, though advanced canary requires capabilities beyond basic Service resources.

Labels - Label selectors differentiate stable vs. canary Pods, enabling traffic management and metric segmentation.

Namespaces - Some organizations use separate namespaces for canary deployments to isolate resources and quotas.

The pattern demonstrates how sophisticated deployment strategies require composition of Kubernetes primitives with external tooling (ingress controllers, service meshes, metric systems).

Comparison to Other Strategies

vs Rolling Deployment - Rolling updates all instances gradually. Canary keeps most instances stable while testing new version with limited traffic. Rolling is simpler; canary provides better risk control.

vs Blue-Green - Blue-green does instant 100% cutover. Canary gradually shifts traffic with validation at each stage. Blue-green is simpler but higher risk; canary is progressive but more complex.

vs Recreate - Recreate has downtime and no gradual validation. Canary has zero downtime and progressive risk reduction. Recreate is simplest; canary is most sophisticated.

Gradual Notes

Recent Writing

Revisited

Space is Not Barrenness

Study the Canon

Recent Notes

Caching Context

Kubernetes Batch Jobs

Sidekiq Architecture

Sidekiq Capsules