bug: nack() is never called — transient handler failures silently drop events #3

Closed
opened 2026-06-10 09:12:59 +00:00 by GKaszewski · 0 comments
Owner

Problem

WorkerService::dispatch() always calls ack() after running all handlers, even when a handler returns an error. The nack() method is fully implemented in all queue adapters (sqlite, postgres, nats) with exponential backoff and dead-letter support, but it is never called anywhere.

Result: if a handler fails transiently (TMDB down, search index unavailable, etc.), the event is consumed and lost — no retry.

Root cause

crates/application/src/worker.rs dispatch():

async fn dispatch(handlers: Arc<Vec<Arc<dyn EventHandler>>>, envelope: EventEnvelope) {
    for handler in handlers.iter() {
        if let Err(e) = handler.handle(&envelope.event).await {
            tracing::warn!("event handler error (non-fatal): {e}");
        }
    }
    if let Err(e) = envelope.ack().await {   // always acks, never nacks
        tracing::error!("ack failed: {e}");
    }
}

Complication

The worker uses a fan-out model — every event is dispatched to all handlers. Naively nacking on any handler error would re-run all handlers, causing duplicate processing on the ones that already succeeded. Handlers would need to be idempotent for this to be safe.

Options

  1. Make handlers idempotent + nack on any error — simplest fix, requires auditing each handler for idempotency
  2. Route each event type to exactly one handler — removes the fan-out problem, nack becomes safe but changes dispatch architecture
  3. Per-handler ack tracking — most correct, most complex

Affected files

  • crates/application/src/worker.rs
  • crates/adapters/sqlite-event-queue/src/lib.rs (nack implemented but unused)
  • crates/adapters/postgres-event-queue/src/lib.rs (same)
  • crates/adapters/nats/src/consumer.rs (same)
## Problem `WorkerService::dispatch()` always calls `ack()` after running all handlers, even when a handler returns an error. The `nack()` method is fully implemented in all queue adapters (sqlite, postgres, nats) with exponential backoff and dead-letter support, but it is never called anywhere. Result: if a handler fails transiently (TMDB down, search index unavailable, etc.), the event is consumed and lost — no retry. ## Root cause `crates/application/src/worker.rs` `dispatch()`: ```rust async fn dispatch(handlers: Arc<Vec<Arc<dyn EventHandler>>>, envelope: EventEnvelope) { for handler in handlers.iter() { if let Err(e) = handler.handle(&envelope.event).await { tracing::warn!("event handler error (non-fatal): {e}"); } } if let Err(e) = envelope.ack().await { // always acks, never nacks tracing::error!("ack failed: {e}"); } } ``` ## Complication The worker uses a fan-out model — every event is dispatched to all handlers. Naively nacking on any handler error would re-run all handlers, causing duplicate processing on the ones that already succeeded. Handlers would need to be idempotent for this to be safe. ## Options 1. **Make handlers idempotent + nack on any error** — simplest fix, requires auditing each handler for idempotency 2. **Route each event type to exactly one handler** — removes the fan-out problem, nack becomes safe but changes dispatch architecture 3. **Per-handler ack tracking** — most correct, most complex ## Affected files - `crates/application/src/worker.rs` - `crates/adapters/sqlite-event-queue/src/lib.rs` (nack implemented but unused) - `crates/adapters/postgres-event-queue/src/lib.rs` (same) - `crates/adapters/nats/src/consumer.rs` (same)
Sign in to join this conversation.
No Label
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: GKaszewski/movies-diary#3