Kubernetes Batch Jobs

The Batch Job pattern leverages the Kubernetes Job resource to execute finite, isolated units of work in a distributed environment. Unlike Deployments and ReplicaSets designed for long-running processes, Jobs manage workloads that run to completion and terminate.

The Job Resource

Jobs create one or more Pods and ensure they successfully complete their work. Once the expected number of Pods terminate successfully, the Job completes and no additional Pods start.

Persistence Creates Cleanup Debt

Jobs deliberately don’t self-destruct after completion, creating a management burden where success generates operational overhead—the very feature that enables debugging also causes cluster pollution.

Persistence and Tracking - Unlike bare Pods, Jobs are persisted tasks that survive cluster restarts. After completion, both the Job resource and its Pods remain for examination—you can inspect container logs to understand execution results. This persistence enables audit trails and debugging workflows.

Enhanced Resiliency - Jobs build fault tolerance into Pod execution. If a node fails or a Pod is evicted, the scheduler places the Pod on a new healthy node and reruns it. Bare Pods would remain in a failed state, requiring manual intervention.

Cleanup Management - Configure .spec.ttlSecondsAfterFinished to automatically remove completed Job Pods after a specified duration, preventing cluster clutter while maintaining logs during the retention window.

graph TB
    Job[Job Resource] -->|Creates| Pod1[Pod 1]
    Job -->|Creates| Pod2[Pod 2]
    Job -->|Creates| PodN[Pod N]

    Pod1 -->|Success| Complete{All Completions<br/>Reached?}
    Pod2 -->|Success| Complete
    PodN -->|Failure| Retry[Reschedule Pod]

    Retry --> PodN
    Complete -->|Yes| Done[Job Complete]
    Complete -->|No| Wait[Create More Pods]
    Wait --> Job

    style Job fill:#e1f5ff
    style Done fill:#d4edda
    style Retry fill:#fff3cd

Configuration Requirements

Restart Policy - Job definitions must specify a restart policy of either OnFailure or Never. The default Always policy used for long-running processes is invalid for Jobs—the workload must be able to complete rather than restart indefinitely.

Completion Count - .spec.completions specifies how many Pods must successfully complete. This allows expressing workloads like “process these 100 files” by setting completions to 100.

Parallelism - .spec.parallelism controls how many Pod replicas run simultaneously. Setting this higher doesn’t guarantee that level of concurrency—actual parallelism depends on resource availability and scheduler capacity.

Job Types and Patterns

Single Pod Jobs

The simplest pattern omits both .spec.completions and .spec.parallelism (defaulting to 1). The Job completes when the single Pod terminates successfully. This suits one-off tasks like database migrations or ad-hoc data processing.

Fixed Completion Count Jobs

Setting .spec.completions to a specific number (e.g., 5) creates a Job that runs exactly that many Pods to completion. This is optimal when the total work items are known in advance—for example, processing a fixed set of data shards or running a batch of simulation tasks.

apiVersion: batch/v1
kind: Job
metadata:
  name: data-processing
spec:
  completions: 100      # Must complete 100 tasks
  parallelism: 10       # Run 10 at a time
  template:
    spec:
      containers:
      - name: processor
        image: data-processor:v1
        resources:
          requests:
            cpu: "500m"
            memory: "512Mi"
      restartPolicy: OnFailure

Work Queue Jobs

Single Pod Success Triggers Mass Termination

In work queue Jobs, one Pod successfully exiting causes Kubernetes to immediately terminate all remaining workers—a counterintuitive design where individual completion signals collective shutdown.

Leave .spec.completions unset and set .spec.parallelism higher than one to create worker Pods that coordinate through an external work queue. Pods must determine when the queue is exhausted and exit successfully to signal Job completion.

This pattern distributes work without knowing the total item count upfront.

Indexed Jobs

Set .spec.completionMode to Indexed with a fixed completion count to distribute work without an external queue. Each Pod receives a unique index (0 through completions - 1) via the JOB_COMPLETION_INDEX environment variable.

Applications use this index to select their portion of work—Pod 0 processes files 0-99, Pod 1 processes files 100-199, and so on. This eliminates coordination overhead while maintaining clear work partitioning.

graph LR
    subgraph "Indexed Job Pattern"
        Job[Job<br/>completions: 3<br/>mode: Indexed]

        Job -->|index=0| Pod0[Pod 0<br/>Process<br/>items 0-33]
        Job -->|index=1| Pod1[Pod 1<br/>Process<br/>items 34-66]
        Job -->|index=2| Pod2[Pod 2<br/>Process<br/>items 67-99]
    end

    Pod0 --> Results[(Results)]
    Pod1 --> Results
    Pod2 --> Results

    style Job fill:#e1f5ff
    style Results fill:#d4edda

Workload Management Strategy

Jobs enable resource-efficient workload management by running applications only when needed, for the duration required. This contrasts with long-running abstractions like Deployments that consume resources continuously.

One Job Per Work Item - Creates individual Jobs for each task unit, providing high tracking granularity and independent lifecycle management. This approach suits complex tasks requiring detailed monitoring and failure isolation but incurs per-Job overhead.

One Job For All Work - Encapsulates all work items within a single Job, reducing Kubernetes resource overhead. Work distribution happens internally—often via batch frameworks like Spring Batch—making the Job a container for the batch process rather than individual task manager.

The choice depends on observability needs, failure isolation requirements, and operational complexity tolerance. Individual Jobs provide better visibility but complicate cluster management at scale.

Integration with Batch Frameworks

For sophisticated batch processing, combine the Job primitive with external batch frameworks. The Job handles lifecycle management, scheduling, and resiliency while the batch framework manages work distribution, transaction boundaries, and processing logic.

Spring Batch - Provides chunk-oriented processing, transaction management, and restart capabilities. A Spring Batch application runs inside a Job Pod, leveraging Job resiliency for the overall process while Spring Batch handles detailed processing concerns.

Apache Beam - Offers portable batch and stream processing. Beam pipelines execute within Job Pods, using Kubernetes for resource management while Beam handles data processing semantics.

This separation of concerns allows the Job to focus on platform integration—resource allocation, placement, container lifecycle—while frameworks handle domain-specific batch logic.

Relationship to Other Patterns

Periodic Job (CronJob) - Builds on the Batch Job pattern by scheduling Jobs on a time-based schedule. CronJobs are to Jobs what cron is to shell scripts—a temporal orchestration layer.

Deployments - Handle long-running services with continuous availability requirements. Jobs handle finite workloads with completion semantics. Choose Deployments for APIs and web services, Jobs for batch processing and data pipelines.

Managed Lifecycle - Jobs implement lifecycle management but prioritize completion over continuous operation. Job Pods must handle SIGTERM for graceful shutdown but focus on completing work rather than serving ongoing requests.

Best Practices

Set Appropriate Timeouts - Configure .spec.activeDeadlineSeconds to prevent runaway Jobs from consuming resources indefinitely. This deadline applies to total Job execution time, not individual Pod lifetime.

Handle Idempotency - Jobs may retry failed Pods, so ensure work is idempotent. Processing the same data multiple times due to retries should not corrupt results.

Choose Restart Policy Carefully - OnFailure retries failed containers on the same node; Never creates new Pods. Use OnFailure for transient failures (network issues), Never for persistent failures requiring new Pod placement.

Resource Profiling - Define accurate resource requests and limits. Batch workloads often need different profiles than services—higher CPU during processing bursts, moderate memory for data buffering.

Monitor Completion Status - Jobs don’t disappear after completion. Monitor and clean up completed Jobs to prevent cluster resource exhaustion from accumulated Job objects.

Backoff Limits - Set .spec.backoffLimit to control retry behavior. The default (6 retries) suits most workloads, but long-running batch processes may need higher limits to tolerate transient infrastructure issues.

Resource Efficiency

Jobs Redefine Kubernetes' Value Proposition

By enabling finite workloads that consume resources only during execution, Jobs transform Kubernetes from a service orchestrator into a general-purpose computing substrate—the paradigm shift from “always-on” to “on-demand” unlocks new economic models.

This makes Kubernetes suitable for diverse workloads beyond long-running services:

Cost Management - Run batch workloads on spot instances or during off-peak hours, terminating resources when work completes.

Multi-Tenancy - Mix batch Jobs with service Deployments in the same cluster, using QoS classes to prioritize service Pods during resource contention.

Capacity Planning - Jobs with defined resource profiles enable accurate capacity planning. Knowing a batch Job needs 50 Pods with 2 CPU each allows provisioning appropriate cluster capacity.

The Job abstraction transforms Kubernetes from a service orchestration platform into a general-purpose distributed computing substrate capable of managing both continuous services and finite batch workloads with unified operational primitives.

Gradual Notes

Recent Writing

Revisited

Space is Not Barrenness

Study the Canon

Recent Notes

Caching Context

Reentrant Code

Zeigarnik Effect

Migrating from Resque to Sidekiq

Kubernetes Batch Jobs

The Job Resource

Configuration Requirements

Job Types and Patterns

Single Pod Jobs

Fixed Completion Count Jobs

Work Queue Jobs

Indexed Jobs

Workload Management Strategy

Integration with Batch Frameworks

Relationship to Other Patterns

Best Practices

Resource Efficiency

Recent Writing

Revisited

Space is Not Barrenness

Study the Canon

Recent Notes

Caching Context

Reentrant Code

Zeigarnik Effect

Migrating from Resque to Sidekiq

Kubernetes Batch Jobs

Graph View

Table of Contents