The Fix
pip install celery==5.6.0
Based on closed celery/celery issue #8786 · PR/commit linked
Production note: This usually shows up under retries/timeouts. Treat it as a side-effect risk until you can verify behavior with a canary + real traffic.
@@ -305,3 +305,4 @@ Marc Bresson, 2024/09/02
Colin Watson, 2025/03/01
Lucas Infante, 2025/05/15
+Diego Margoni, 2025/07/01
diff --git a/celery/app/builtins.py b/celery/app/builtins.py
index 1a79c40932d..66fb94a29b2 100644
def test_group_chain_2_fail_1(self, celery_setup: CeleryTestSetup):
queue = celery_setup.worker.worker_queue
sig = chain(
group(
identity.si("a").set(queue=queue),
fail.si("b").set(queue=queue),
),
group(
identity.si("c").set(queue=queue),
identity.si("d").set(queue=queue),
),
)
result = sig.apply_async()
with pytest.raises(ExpectedException):
result.get(timeout=RESULT_TIMEOUT)
Re-run the minimal reproduction on your broken version, then apply the fix and re-run.
Option A — Upgrade to fixed release\npip install celery==5.6.0\nWhen NOT to use: Do not use this fix if the chord body is not a group.\n\nOption C — Workaround\nquickly, but wanted to make sure you had enough to reproduce the problems I am observing, so I have included a zip of files that can be run via docker-compose to reproduce the problem, via Python unittest:\nWhen NOT to use: Do not use this fix if the chord body is not a group.\n\n
Why This Fix Works in Production
- Trigger: for _ in self.drain_events_until(
- Mechanism: The chord timeout behavior fails when chord bodies are groups and encounter failures in the header
- Why the fix works: Resolves an issue with chord timeout behavior that happens when chord bodies are groups and encounter failures in the header. (first fixed release: 5.6.0).
- If left unfixed, this can cause silent data inconsistencies that propagate (bad cache entries, incorrect downstream decisions).
Why This Breaks in Prod
- Shows up under Python 3.9 in real deployments (not just unit tests).
- The chord timeout behavior fails when chord bodies are groups and encounter failures in the header
- Surfaces as: Traceback (most recent call last):
Proof / Evidence
- GitHub issue: #8786
- Fix PR: https://github.com/celery/celery/pull/9788
- First fixed release: 5.6.0
- Reproduced locally: No (not executed)
- Last verified: 2026-02-09
- Confidence: 0.75
- Did this fix it?: Yes (upstream fix exists)
- Own content ratio: 0.40
Discussion
High-signal excerpts from the issue thread (symptoms, repros, edge-cases).
“Fix pushed to main and will be released in Celery v5.4 @kevinjdolan”
“Note: included in the test cases, similar behavior is observed when you have a task that calls self.replace() with a group (though sometimes it is…”
“Including an updates test suite here to include some test with chords (as I understand groups are converted to chords behind the scenes when unrolling…”
“Wow that was fast! Excited to get rid of my workaround!”
Failure Signature (Search String)
- for _ in self.drain_events_until(
Error Message
Stack trace
Error Message
-------------
Traceback (most recent call last):
File "/Users/nusnus/dev/GitHub/celery/celery/backends/asynchronous.py", line 287, in _wait_for_pending
for _ in self.drain_events_until(
File "/Users/nusnus/dev/GitHub/celery/celery/backends/asynchronous.py", line 52, in drain_events_until
raise socket.timeout()
TimeoutError
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/Users/nusnus/dev/GitHub/celery/celery/result.py", line 705, in get
return (self.join_native if self.supports_native_join else self.join)(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/Users/nusnus/dev/GitHub/celery/celery/result.py", line 827, in join_native
for task_id, meta in self.iter_native(timeout, interval, no_ack,
File "/Users/nusnus/dev/GitHub/celery/celery/backends/asynchronous.py", line 172, in iter_native
for _ in self._wait_for_pending(result, no_ack=no_ack, **kwargs):
File "/Users/nusnus/dev/GitHub/celery/celery/backends/asynchronous.py", line 293, in _wait_for_pending
raise TimeoutError('The operation timed out.')
celery.exceptions.TimeoutError: The operation timed out.
Traceback (most recent call last):
File "/Users/nusnus/dev/GitHub/celery/celery/backends/asynchronous.py", line 287, in _wait_for_pending
for _ in self.drain_events_u
... (truncated) ...
Minimal Reproduction
def test_group_chain_2_fail_1(self, celery_setup: CeleryTestSetup):
queue = celery_setup.worker.worker_queue
sig = chain(
group(
identity.si("a").set(queue=queue),
fail.si("b").set(queue=queue),
),
group(
identity.si("c").set(queue=queue),
identity.si("d").set(queue=queue),
),
)
result = sig.apply_async()
with pytest.raises(ExpectedException):
result.get(timeout=RESULT_TIMEOUT)
Environment
- Python: 3.9
What Broke
Tasks hang indefinitely when a group in a chord fails, causing timeouts.
Why It Broke
The chord timeout behavior fails when chord bodies are groups and encounter failures in the header
Fix Options (Details)
Option A — Upgrade to fixed release Safe default (recommended)
pip install celery==5.6.0
Use when you can deploy the upstream fix. It is usually lower-risk than long-lived workarounds.
Option C — Workaround Temporary workaround
quickly, but wanted to make sure you had enough to reproduce the problems I am observing, so I have included a zip of files that can be run via docker-compose to reproduce the problem, via Python unittest:
Use only if you cannot change versions today. Treat this as a stopgap and remove once upgraded.
Option D — Guard side-effects with OnceOnly Guardrail for side-effects
Mitigate duplicate external side-effects under retries/timeouts/agent loops by gating the operation before calling external systems.
- Place OnceOnly between your code/agent and real side-effects (Stripe, emails, CRM, APIs).
- Use a stable key per side-effect (e.g., customer_id + action + idempotency_key).
- Fail-safe: configure fail-open vs fail-closed based on blast radius and spend risk.
- This does NOT fix data corruption; it only prevents duplicate side-effects.
Show example snippet (optional)
from onceonly import OnceOnly
import os
once = OnceOnly(api_key=os.environ["ONCEONLY_API_KEY"], fail_open=True)
# Stable idempotency key per real side-effect.
# Use a request id / job id / webhook delivery id / Stripe event id, etc.
event_id = "evt_..." # replace
key = f"stripe:webhook:{event_id}"
res = once.check_lock(key=key, ttl=3600)
if res.duplicate:
return {"status": "already_processed"}
# Safe to execute the side-effect exactly once.
handle_event(event_id)
Fix reference: https://github.com/celery/celery/pull/9788
First fixed release: 5.6.0
Last verified: 2026-02-09. Validate in your environment.
When NOT to Use This Fix
- Do not use this fix if the chord body is not a group.
- Do not use this to hide logic bugs or data corruption. Use it to block duplicate external side-effects and enforce tool permissions/spend caps.
Verify Fix
Re-run the minimal reproduction on your broken version, then apply the fix and re-run.
Did This Fix Work in Your Case?
Quick signal helps us prioritize which fixes to verify and improve.
Prevention
- Make timeouts explicit and test them (unit + integration) to avoid silent behavior changes.
- Instrument retries (attempt count + reason) and alert on spikes to catch dependency slowdowns.
Version Compatibility Table
| Version | Status |
|---|---|
| 5.6.0 | Fixed |
Related Issues
No related fixes found.
Sources
We don’t republish the full GitHub discussion text. Use the links above for context.