Sidekiq Reliability Features

Sidekiq provides several mechanisms for reliable job processing through its Pro and Enterprise tiers. These features address the fundamental challenge in Distributed Systems: ensuring jobs execute despite network failures, process crashes, and concurrency conflicts.

Unique Jobs (Pro/Enterprise)

Unique Jobs prevent duplicate job enqueueing using Redis locks. A job with unique_for: 10.minutes won’t be enqueued again if a job with identical arguments is already pending or executing.

class ImportJob
  include Sidekiq::Job
  sidekiq_options unique_for: 10.minutes
 
  def perform(user_id)
    # Only one import per user every 10 minutes
  end
end

The uniqueness lock uses a Redis key combining the job class and arguments:

unique:ImportJob:user_id:123 -> expires in 600 seconds

Three uniqueness policies control when locks are created and released:

until_executing: Lock from enqueue to job start
until_executed: Lock from enqueue to job completion (default)
until_timeout: Lock for fixed duration regardless of execution

The until_executing policy enables “at-most-once” semantics—once a job starts, duplicates can be enqueued. This suits idempotent jobs where re-execution is safe but double-execution wastes resources.

Batches (Pro)

Batches group related jobs and trigger callbacks when all jobs complete. This solves the fan-out/fan-in problem where you need to wait for N async jobs before continuing.

batch = Sidekiq::Batch.new
batch.on(:success, NotificationJob)
batch.on(:complete, CleanupJob)  # Fires even if jobs failed
 
batch.jobs do
  1000.times do |i|
    ProcessRecordJob.perform_async(i)
  end
end

Batch state is tracked in Redis using counters and sets:

batch:BID:pending -> 1000  # Remaining jobs
batch:BID:total -> 1000    # Initial count
batch:BID:failures -> Set of failed JIDs
batch:BID:complete -> timestamp

When the pending counter reaches zero, Sidekiq enqueues the success callback. The complete callback always fires, regardless of failures. This enables cleanup logic that must run even when things fail.

Nested batches enable complex workflows:

outer_batch = Sidekiq::Batch.new
outer_batch.on(:success, SendEmailJob)
 
outer_batch.jobs do
  5.times do |i|
    inner_batch = Sidekiq::Batch.new
    inner_batch.jobs do
      10.times { ProcessChunkJob.perform_async(i) }
    end
  end
end

The outer batch waits for all inner batches to complete, creating a tree of dependencies.

Rate Limiting (Enterprise)

Rate limiters control job throughput using various algorithms. This prevents overwhelming downstream services or staying within API rate limits.

Concurrent limiter: Restricts how many jobs can execute simultaneously:

limiter = Sidekiq::Limiter.concurrent("api-service", 5)
 
limiter.within_limit do
  # Max 5 jobs call this service at once
  external_api.call
end

Implementation uses a Redis set to track active job IDs:

limiter:concurrent:api-service -> Set of active JIDs (max 5)

When 5 jobs are running, new jobs receive Sidekiq::Limiter::OverLimit exception and are automatically rescheduled after backoff.

Window limiter: Allows N operations per time window:

limiter = Sidekiq::Limiter.window("shopify-api", 100, :minute)
 
limiter.within_limit do
  # Max 100 calls per minute
  shopify_api.create_order
end

Uses a sorted set where scores are timestamps:

limiter:window:shopify-api -> Sorted set of timestamps

Expired timestamps are pruned, and new attempts check if the set size < limit.

Leaky bucket: Smooths burst traffic by allowing X requests upfront, then throttling to X per period:

limiter = Sidekiq::Limiter.leaky("external-api", 60, :minute)

This allows bursts up to 60 requests, then throttles to 60/minute sustained. Perfect for APIs that allow bursts but have sustained rate limits.

Reliability Shards (Enterprise)

Reliability Shards provide true “exactly-once” semantics by storing in-progress jobs in a secondary Redis sorted set. If a process crashes, another process can recover and re-enqueue the lost jobs.

When super_fetch pulls a job, it:

Atomically moves job from queue to working set
Executes the job
Removes from working set on success

If step 2 fails (process crash), the job remains in the working set. A recovery process periodically scans for stale jobs (>5 minutes in working set) and re-enqueues them.

This contrasts with standard Sidekiq’s “at-least-once” guarantee where crashed jobs are lost. The trade-off is additional Redis operations per job—acceptable for critical workflows but overkill for idempotent jobs.

Atomic Scheduling

The Atomic Scheduler (Pro 7.2+) eliminates race conditions in the scheduled job poller. Multiple Sidekiq processes poll the scheduled set, but only one should enqueue each job.

The scheduler uses a Lua script for atomic check-and-set:

local job = redis.call("zrange", "schedule", "-inf", now, "byscore", "limit", 0, 1)
if job[1] then
  local removed = redis.call("zrem", "schedule", job[1])
  if removed == 1 then
    redis.call("lpush", queue, job[1])
    return job[1]
  end
end

The zrem returns 0 if another process already removed the job, preventing duplicate enqueuing. This atomic operation replaces the previous check-then-remove pattern that had a race window.

Transaction-Aware Client

The transaction-aware client ensures jobs aren’t enqueued until the current database transaction commits:

User.transaction do
  user = User.create!(email: "test@example.com")
  WelcomeEmailJob.perform_async(user.id)
  # Email job only enqueues if transaction commits
end

Without this, the job could run before the transaction commits, failing to find the user record. The client hooks into ActiveRecord’s transaction callbacks to defer Redis operations until after commit.

See Sidekiq Architecture for how these reliability features integrate with the core job processing flow.

Gradual Notes

Recent Writing

Revisited

Space is Not Barrenness

Study the Canon

Recent Notes

Caching Context

Reentrant Code

Zeigarnik Effect

Migrating from Resque to Sidekiq