Guide

Running RustQueue in production

Durability tradeoffs, managed workers, retries, crash recovery, and graceful shutdown — without a message broker.

Foundation

Crash-only design

Every job is persisted to storage before it is acknowledged. There is no in-memory state that would be lost on a crash. You can kill -9 the process at any point and no committed work will be silently dropped.

Delivery is at-least-once: a job being processed when the process died will be retried automatically. Handlers should be idempotent, or use the unique_key field to deduplicate on re-delivery.

Storage

Durability modes

Choose the backend that matches your availability/throughput tradeoff.

Backend Durability Throughput Crash loss window When to use
redb (default) ACID, fsync per write ~314 push/s raw None — zero data loss Default production choice. Correct and zero-config.
Buffered redb (.with_write_coalescing) Batched fsync ~19 K ops/s One flush interval High-write workloads where small loss windows are acceptable.
Hybrid (RustQueue::hybrid) In-memory hot path + periodic snapshot 300 K+ ops/s Up to snapshot_interval Maximum throughput; explicitly trade durability for speed.
In-memory (RustQueue::memory) None — lost on drop Fastest Total Tests and ephemeral workloads only.
// Safest: ACID redb
let rq = RustQueue::redb("./jobs.db")?.build()?;

// Faster writes, tiny loss window:
use rustqueue::storage::BufferedRedbConfig;
let rq = RustQueue::redb("./jobs.db")?
    .with_write_coalescing(BufferedRedbConfig::default())
    .build()?;

// Maximum throughput, explicit durability tradeoff:
let rq = RustQueue::hybrid("./jobs.db")?.build()?;
Workers

Workers

run_worker is the managed entry point for processing jobs. It replaces a hand-rolled pull / ack / fail loop and handles operational details automatically:

  • Auto-acks a job when the handler returns Ok(()).
  • Auto-fails a job (triggering retry/DLQ logic) when the handler returns Err(e).
  • Auto-heartbeats the in-flight job so stall detection does not reclaim a healthy long-running job.
  • Starts housekeeping automatically on first call — no separate setup required.
use rustqueue::RustQueue;

#[tokio::main]
async fn main() -> anyhow::Result<()> {
    let rq = RustQueue::redb("./jobs.db")?.build()?;

    rq.run_worker("emails", |job| async move {
        println!("sending email: {}", job.data["to"]);
        // return Ok to ack, Err to fail (→ retry or DLQ)
        Ok::<(), String>(())
    })
    .await?;

    Ok(())
}

run_worker runs until Ctrl-C and finishes the in-flight job before returning.

Embedding in a web server

Use run_worker_with_shutdown to stop the worker when a server shuts down rather than on Ctrl-C:

use tokio::sync::oneshot;

let (tx, rx) = oneshot::channel::<()>();
// In your server shutdown handler: tx.send(()).ok();

tokio::spawn(async move {
    rq.run_worker_with_shutdown("emails", handler, async {
        rx.await.ok();
    })
    .await
    .expect("worker failed");
});
CPU-bound handlers: A handler that never .awaits will block the tokio thread and prevent heartbeats from firing. After stall_timeout the job will be reclaimed. Wrap CPU-intensive sections with tokio::task::spawn_blocking, or raise stall_timeout.
Reliability

Retries & DLQ

Every job carries retry configuration. The defaults:

SettingDefault
max_attempts3
backoffExponential
backoff_delay_ms1 000 ms

When a job fails (handler returns Err, or stall detection fires), the engine applies backoff and re-queues — unless attempt ≥ max_attempts, at which point the job moves to the dead-letter queue.

StrategyDelay formulaExample (base = 1 s)
Fixedbase1 s, 1 s, 1 s
Linearbase × attempt1 s, 2 s, 3 s
Exponential (default)base × 2^(attempt-1)1 s, 2 s, 4 s

Inspecting the DLQ

let dead = rq.get_dlq_jobs("emails", 50).await?;
for job in dead {
    println!("{}: {:?}", job.id, job.last_error);
}
Operations

Housekeeping

The housekeeping loop is a background tokio task (default tick: 1 s) that drives retries, schedules, and crash recovery. run_worker starts it automatically. If you use pull/ack directly, call it explicitly:

// Call once after building; safe to call multiple times (idempotent).
rq.start_housekeeping()?;
Required for correctness: without housekeeping, delayed jobs never become runnable, stalled jobs stay stuck forever, and schedules never fire.

What it does each tick

  • Promotes DelayedWaiting when delay expires.
  • Fires cron and interval schedules that are due.
  • Fails jobs that have exceeded timeout_ms.
  • Detects stalled Active jobs (no heartbeat within stall_timeout).
  • Refreshes queue depth metrics.
  • Cleans up expired completed/failed/DLQ jobs (every ~1 min).

Tuning

use std::time::Duration;

let rq = RustQueue::redb("./jobs.db")?
    .stall_timeout(Duration::from_secs(60))    // default: 30 s
    .tick_interval(Duration::from_millis(500))  // default: 1 s
    .build()?;
Lifecycle

Graceful shutdown

run_worker catches Ctrl-C and finishes the in-flight job before returning. For programmatic shutdown (e.g., Axum shutdown signal):

use tokio::sync::watch;

let (shutdown_tx, shutdown_rx) = watch::channel(false);

rq.run_worker_with_shutdown("work", handler, async move {
    shutdown_rx.changed().await.ok();
})
.await?;

// Elsewhere, to trigger shutdown:
shutdown_tx.send(true).ok();

Jobs that have not yet been pulled remain in Waiting state and will be picked up on the next startup. No work is lost.

Tutorial

Crash-recovery walkthrough

The crash_recovery example demonstrates the full lifecycle: push a job, crash mid-processing, restart, and watch the job complete automatically.

  1. Start the worker (Terminal 1)

    Uses a short stall_timeout of 3 s so you don't wait long.

    cargo run --example crash_recovery -- worker
  2. Push a job (Terminal 2)

    cargo run --example crash_recovery -- push

    The worker picks it up and logs progress steps.

  3. Simulate a crash

    While the job is mid-run, kill the worker process:

    kill -9 <pid printed at startup>

    The job is now Active with no heartbeat. The database still holds it.

  4. Restart and watch recovery

    cargo run --example crash_recovery -- worker

    After ~3 s, housekeeping detects no heartbeat, fails the job, and re-queues it. The new worker picks it up and completes it. No manual intervention. No data loss. No message broker.

Note: each crash consumes one retry attempt. With max_attempts=3 (the default), a job killed three times will land in the DLQ. Raise max_attempts for workloads that may experience frequent infrastructure interruptions.