Jump to solution
Verify

The Fix

pip install celery==5.6.0

Based on closed celery/celery issue #8786 · PR/commit linked

Production note: This usually shows up under retries/timeouts. Treat it as a side-effect risk until you can verify behavior with a canary + real traffic.

Jump to Verify Open PR/Commit
@@ -305,3 +305,4 @@ Marc Bresson, 2024/09/02 Colin Watson, 2025/03/01 Lucas Infante, 2025/05/15 +Diego Margoni, 2025/07/01 diff --git a/celery/app/builtins.py b/celery/app/builtins.py index 1a79c40932d..66fb94a29b2 100644
repro.py
def test_group_chain_2_fail_1(self, celery_setup: CeleryTestSetup): queue = celery_setup.worker.worker_queue sig = chain( group( identity.si("a").set(queue=queue), fail.si("b").set(queue=queue), ), group( identity.si("c").set(queue=queue), identity.si("d").set(queue=queue), ), ) result = sig.apply_async() with pytest.raises(ExpectedException): result.get(timeout=RESULT_TIMEOUT)
verify
Re-run the minimal reproduction on your broken version, then apply the fix and re-run.
fix.md
Option A — Upgrade to fixed release\npip install celery==5.6.0\nWhen NOT to use: Do not use this fix if the chord body is not a group.\n\nOption C — Workaround\nquickly, but wanted to make sure you had enough to reproduce the problems I am observing, so I have included a zip of files that can be run via docker-compose to reproduce the problem, via Python unittest:\nWhen NOT to use: Do not use this fix if the chord body is not a group.\n\n

Why This Fix Works in Production

  • Trigger: for _ in self.drain_events_until(
  • Mechanism: The chord timeout behavior fails when chord bodies are groups and encounter failures in the header
  • Why the fix works: Resolves an issue with chord timeout behavior that happens when chord bodies are groups and encounter failures in the header. (first fixed release: 5.6.0).
Production impact:
  • If left unfixed, this can cause silent data inconsistencies that propagate (bad cache entries, incorrect downstream decisions).

Why This Breaks in Prod

  • Shows up under Python 3.9 in real deployments (not just unit tests).
  • The chord timeout behavior fails when chord bodies are groups and encounter failures in the header
  • Surfaces as: Traceback (most recent call last):

Proof / Evidence

  • GitHub issue: #8786
  • Fix PR: https://github.com/celery/celery/pull/9788
  • First fixed release: 5.6.0
  • Reproduced locally: No (not executed)
  • Last verified: 2026-02-09
  • Confidence: 0.75
  • Did this fix it?: Yes (upstream fix exists)
  • Own content ratio: 0.40

Discussion

High-signal excerpts from the issue thread (symptoms, repros, edge-cases).

“Fix pushed to main and will be released in Celery v5.4 @kevinjdolan”
@Nusnus · 2024-01-17 · confirmation · source
“Note: included in the test cases, similar behavior is observed when you have a task that calls self.replace() with a group (though sometimes it is…”
@kevinjdolan · 2024-01-10 · source
“Including an updates test suite here to include some test with chords (as I understand groups are converted to chords behind the scenes when unrolling…”
@kevinjdolan · 2024-01-10 · source
“Wow that was fast! Excited to get rid of my workaround!”
@kevinjdolan · 2024-01-17 · source

Failure Signature (Search String)

  • for _ in self.drain_events_until(

Error Message

Stack trace
error.txt
Error Message ------------- Traceback (most recent call last): File "/Users/nusnus/dev/GitHub/celery/celery/backends/asynchronous.py", line 287, in _wait_for_pending for _ in self.drain_events_until( File "/Users/nusnus/dev/GitHub/celery/celery/backends/asynchronous.py", line 52, in drain_events_until raise socket.timeout() TimeoutError During handling of the above exception, another exception occurred: Traceback (most recent call last): File "<string>", line 1, in <module> File "/Users/nusnus/dev/GitHub/celery/celery/result.py", line 705, in get return (self.join_native if self.supports_native_join else self.join)( ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ File "/Users/nusnus/dev/GitHub/celery/celery/result.py", line 827, in join_native for task_id, meta in self.iter_native(timeout, interval, no_ack, File "/Users/nusnus/dev/GitHub/celery/celery/backends/asynchronous.py", line 172, in iter_native for _ in self._wait_for_pending(result, no_ack=no_ack, **kwargs): File "/Users/nusnus/dev/GitHub/celery/celery/backends/asynchronous.py", line 293, in _wait_for_pending raise TimeoutError('The operation timed out.') celery.exceptions.TimeoutError: The operation timed out. Traceback (most recent call last): File "/Users/nusnus/dev/GitHub/celery/celery/backends/asynchronous.py", line 287, in _wait_for_pending for _ in self.drain_events_u ... (truncated) ...

Minimal Reproduction

repro.py
def test_group_chain_2_fail_1(self, celery_setup: CeleryTestSetup): queue = celery_setup.worker.worker_queue sig = chain( group( identity.si("a").set(queue=queue), fail.si("b").set(queue=queue), ), group( identity.si("c").set(queue=queue), identity.si("d").set(queue=queue), ), ) result = sig.apply_async() with pytest.raises(ExpectedException): result.get(timeout=RESULT_TIMEOUT)

Environment

  • Python: 3.9

What Broke

Tasks hang indefinitely when a group in a chord fails, causing timeouts.

Why It Broke

The chord timeout behavior fails when chord bodies are groups and encounter failures in the header

Fix Options (Details)

Option A — Upgrade to fixed release Safe default (recommended)

pip install celery==5.6.0

When NOT to use: Do not use this fix if the chord body is not a group.

Use when you can deploy the upstream fix. It is usually lower-risk than long-lived workarounds.

Option C — Workaround Temporary workaround

quickly, but wanted to make sure you had enough to reproduce the problems I am observing, so I have included a zip of files that can be run via docker-compose to reproduce the problem, via Python unittest:

When NOT to use: Do not use this fix if the chord body is not a group.

Use only if you cannot change versions today. Treat this as a stopgap and remove once upgraded.

Option D — Guard side-effects with OnceOnly Guardrail for side-effects

Mitigate duplicate external side-effects under retries/timeouts/agent loops by gating the operation before calling external systems.

  • Place OnceOnly between your code/agent and real side-effects (Stripe, emails, CRM, APIs).
  • Use a stable key per side-effect (e.g., customer_id + action + idempotency_key).
  • Fail-safe: configure fail-open vs fail-closed based on blast radius and spend risk.
  • This does NOT fix data corruption; it only prevents duplicate side-effects.
Show example snippet (optional)
onceonly.py
from onceonly import OnceOnly import os once = OnceOnly(api_key=os.environ["ONCEONLY_API_KEY"], fail_open=True) # Stable idempotency key per real side-effect. # Use a request id / job id / webhook delivery id / Stripe event id, etc. event_id = "evt_..." # replace key = f"stripe:webhook:{event_id}" res = once.check_lock(key=key, ttl=3600) if res.duplicate: return {"status": "already_processed"} # Safe to execute the side-effect exactly once. handle_event(event_id)

See OnceOnly SDK

When NOT to use: Do not use this to hide logic bugs or data corruption. Use it to block duplicate external side-effects and enforce tool permissions/spend caps.

Fix reference: https://github.com/celery/celery/pull/9788

First fixed release: 5.6.0

Last verified: 2026-02-09. Validate in your environment.

Get updates

We publish verified fixes weekly. No spam.

Subscribe

When NOT to Use This Fix

  • Do not use this fix if the chord body is not a group.
  • Do not use this to hide logic bugs or data corruption. Use it to block duplicate external side-effects and enforce tool permissions/spend caps.

Verify Fix

verify
Re-run the minimal reproduction on your broken version, then apply the fix and re-run.

Did This Fix Work in Your Case?

Quick signal helps us prioritize which fixes to verify and improve.

Prevention

  • Make timeouts explicit and test them (unit + integration) to avoid silent behavior changes.
  • Instrument retries (attempt count + reason) and alert on spikes to catch dependency slowdowns.

Version Compatibility Table

VersionStatus
5.6.0 Fixed

Related Issues

No related fixes found.

Sources

We don’t republish the full GitHub discussion text. Use the links above for context.