The hard part of retrying a dropped message is knowing when not to

Our product talks to small-business owners over WhatsApp. A message comes in, we persist it, and a background worker decides what to reply. That worker is built to be safe to run twice — every reply is recorded in a ledger before it's sent, so a retry that re-runs an already-answered message sends nothing. "At-least-once, deduplicated" is the textbook answer, and we had it.

Then we asked a quieter question: what protects the message that never reached the worker at all?

The ledger protects work a task actually picked up. But there's a step before that: the inbound is saved to the database, and then a task is enqueued onto the broker (Redis, in our case) for the worker to grab. If the process dies in the gap between "saved" and "enqueued" — or the broker itself restarts and loses what was queued — no task is ever created. The ledger has nothing to deduplicate, because nothing ran. The message just sits there, answered by no one.

So there are really two different failure gaps, and they need two different fixes:

Received but not yet processed — the task was queued, then the broker crashed and dropped it. Fixed by turning on broker persistence (append-only file), so a restart replays the queue instead of forgetting it.
Never enqueued — the task was never created. Persistence can't help; there's nothing on disk to replay. This needs something that looks at the outcome — "this message has no reply" — and acts.

Neither layer covers the other. We added both. The second one is where it got interesting.

The sweep is easy. The silence is hard.

The fix for the second gap is a periodic job — a "reconcile sweep." Every few minutes it asks: are there inbound messages with no reply? If so, re-enqueue them. The mechanism is genuinely about thirty lines: a scheduled task, a query, a re-dispatch.

We almost shipped exactly that. It would have caused quiet, user-visible mistakes.

Because "this message has no reply" is true in a lot of situations where replying again is the wrong move:

The customer opted out. They said stop. There's no reply because we're deliberately silent — and a sweep that "helpfully" re-sends would message someone who asked us not to.
A human operator took the conversation over. When one of our team steps in, the bot goes quiet on purpose. To the sweep, that looks identical to a dropped message — no bot reply present. Re-enqueueing it means the bot starts talking over the human mid-conversation.
The message arrived, we replied, and the reply is on its way but not yet recorded — a race. Resending would double-message.
The message is three days old. WhatsApp only lets you send a free-form reply within 24 hours of the customer's last message. A "recovery" send outside that window doesn't recover anything; it gets rejected.

The mechanism doesn't know any of this. It just sees "no reply." So the real work of the sweep isn't the retry — it's encoding every reason we might be silent on purpose, and refusing to act on those.

Mirror the gates that were already there

The insight that made it tractable: the live reply path already knows when to stay quiet. Before the bot answers anything, it checks the same things — is this contact opted out, has a human taken over, is the conversation escalated. Those gates already existed and were already trusted.

So the sweep doesn't invent its own judgment. It mirrors the same silence gates the bot uses in real time, plus the two the bot doesn't need but the sweep does: skip anything already answered (the race), and skip anything outside the 24-hour window (the external constraint). What's left after all those filters is the best candidate for a genuinely dropped message — and that is the only thing we re-enqueue.

inbound with no reply
  → opted out?            stop. (we're silent on purpose)
  → operator took over?   stop. (don't talk over a human)
  → escalated?            stop.
  → already answered?     stop. (it was a race)
  → older than 24h?       stop. (window closed)
  → otherwise             re-enqueue — this one really was dropped

What we took away

"Just retry it" is one of those phrases that sounds like a small task and isn't. Retrying is trivial when the only actor is your own code and the only outcome is success or failure. The moment there are humans in the loop — operators, people who opted out — and external constraints — a messaging window you don't control — the retry has to be as careful about not acting as it is about acting.

The thirty lines that re-send the message are the easy part. The list of reasons not to is the feature.

The hard part of retrying a dropped message is knowing when not to

At-least-once has a blind spot

The sweep is easy. The silence is hard.

Mirror the gates that were already there

What we took away

Want a site like this for your business?

Stop logging into a dashboard.