The Fix
pip install celery==5.1.2
Based on closed celery/celery issue #7836 · PR/commit linked
Production note: Watch p95/p99 latency and retry volume; timeouts can turn into retry storms and duplicate side-effects.
@@ -278,19 +278,24 @@ def mark_as_retry(self, task_id, exc, traceback=None,
def chord_error_from_stack(self, callback, exc=None):
- # need below import for test for some crazy reason
- from celery import group # pylint: disable
app = self.app
=================================== FAILURES ===================================
_ test_chord.test_mutable_errback_called_by_chord_from_group_fail_multiple[errback_old_style] [Error propagates from body group] _
self = <celery.backends.redis.ResultConsumer object at 0x7f6132ec6710>
result = <GroupResult: 8d8dfee3-a5fc-4c82-8a89-fc1d10beef7d [aef15b21-9fa3-4862-b37c-3d0c04fb1577, 0d02d76a-82de-4788-944f-fb3e...dc3, 68473988-4cef-4f6e-b556-df28d8b22560, 16ef8f73-7e94-4cb6-8d95-dc38d1b52416, d88e2c60-e2dc-420c-9e57-76e07f27f6aa]>
timeout = 60, on_interval = None, on_message = None
kwargs = {'interval': 0.5, 'no_ack': True}, prev_on_m = None, _ = None
def _wait_for_pending(self, result,
timeout=None, on_interval=None, on_message=None,
**kwargs):
self.on_wait_for_pending(result, timeout=timeout, **kwargs)
prev_on_m, self.on_message = self.on_message, on_message
try:
for _ in self.drain_events_until(
result.on_ready, timeout=timeout,
on_interval=on_interval):
celery/backends/asynchronous.py:287:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <celery.backends.asynchronous.Drainer object at 0x7f6132ec6560>
p = <promise@0x7f6132f0bd90>, timeout = 60, interval = 1, on_interval = None
wait = <bound method ResultConsumer.drain_events of <celery.backends.redis.ResultConsumer object at 0x7f6132ec6710>>
def drain_events_until(self, p, timeout=None, interval=1, on_interval=None, wait=None):
wait = wait or self.result_consumer.drain_events
time_start = time.monotonic()
while 1:
# Total time spent may exceed a single call to wait()
if timeout and time.monotonic() - time_start >= timeout:
raise socket.timeout()
E TimeoutError
redis_connection.delete(fail_sig_id)
with subtests.test(msg="Error propagates from body group"):
res = chord_sig.delay()
sleep(1)
with pytest.raises(ExpectedException):
res.get(timeout=TIMEOUT)
t/integration/test_canvas.py:2521:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
celery/result.py:675: in get
return (self.join_native if self.supports_native_join else self.join)(
celery/result.py:797: in join_native
for task_id, meta in self.iter_native(timeout, interval, no_ack,
celery/backends/asynchronous.py:172: in iter_native
for _ in self._wait_for_pending(result, no_ack=no_ack, **kwargs):
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <celery.backends.redis.ResultConsumer object at 0x7f6132ec6710>
result = <GroupResult: 8d8dfee3-a5fc-4c82-8a89-fc1d10beef7d [aef15b21-9fa3-4862-b37c-3d0c04fb1577, 0d02d76a-82de-4788-944f-fb3e...dc3, 68473988-4cef-4f6e-b556-df28d8b22560, 16ef8f73-7e94-4cb6-8d95-dc38d1b52416, d88e2c60-e2dc-420c-9e57-76e07f27f6aa]>
timeout = 60, on_interval = None, on_message = None
kwargs = {'interval': 0.5, 'no_ack': True}, prev_on_m = None, _ = None
def _wait_for_pending(self, result,
timeout=None, on_interval=None, on_message=None,
**kwargs):
self.on_wait_for_pending(result, timeout=timeout, **kwargs)
prev_on_m, self.on_message = self.on_message, on_message
try:
for _ in self.drain_events_unti
... (truncated) ...
Re-run the minimal reproduction on your broken version, then apply the fix and re-run.
Option A — Upgrade to fixed release\npip install celery==5.1.2\nWhen NOT to use: This fix should not be used if the errback logic is fundamentally different.\n\n
Why This Fix Works in Production
- Trigger: CI: test_mutable_errback_called_by_chord_from_group_fail_multiple fails often
- Mechanism: The errback handling for chords was incorrectly assuming old style errbacks
- Why the fix works: Fixes the handling of errbacks when chords fail, ensuring that both new and old style errbacks are called appropriately. (first fixed release: 5.1.2).
- If left unfixed, tail latency can spike under load and surface as timeouts/retries (amplifying incident impact).
Why This Breaks in Prod
- The errback handling for chords was incorrectly assuming old style errbacks
- Production symptom (often without a traceback): CI: test_mutable_errback_called_by_chord_from_group_fail_multiple fails often
Proof / Evidence
- GitHub issue: #7836
- Fix PR: https://github.com/celery/celery/pull/6814
- First fixed release: 5.1.2
- Reproduced locally: No (not executed)
- Last verified: 2026-02-09
- Confidence: 0.85
- Did this fix it?: Yes (upstream fix exists)
- Own content ratio: 0.37
Discussion
High-signal excerpts from the issue thread (symptoms, repros, edge-cases).
“I have a fix attempt at #7837, you're welcome to check it out @woutdenolf.”
“Hey @woutdenolf :wave:, Thank you for opening an issue”
“The test was introduced in https://github.com/celery/celery/pull/6814. @maybe-sybr Any ideas?”
“I've seen it as well for some time now. I suspect the problem lies here. Let me check it out.”
Failure Signature (Search String)
- CI: test_mutable_errback_called_by_chord_from_group_fail_multiple fails often
- I see this test fail often in CI
Copy-friendly signature
Failure Signature
-----------------
CI: test_mutable_errback_called_by_chord_from_group_fail_multiple fails often
I see this test fail often in CI
Error Message
Signature-only (no traceback captured)
Error Message
-------------
CI: test_mutable_errback_called_by_chord_from_group_fail_multiple fails often
I see this test fail often in CI
Minimal Reproduction
=================================== FAILURES ===================================
_ test_chord.test_mutable_errback_called_by_chord_from_group_fail_multiple[errback_old_style] [Error propagates from body group] _
self = <celery.backends.redis.ResultConsumer object at 0x7f6132ec6710>
result = <GroupResult: 8d8dfee3-a5fc-4c82-8a89-fc1d10beef7d [aef15b21-9fa3-4862-b37c-3d0c04fb1577, 0d02d76a-82de-4788-944f-fb3e...dc3, 68473988-4cef-4f6e-b556-df28d8b22560, 16ef8f73-7e94-4cb6-8d95-dc38d1b52416, d88e2c60-e2dc-420c-9e57-76e07f27f6aa]>
timeout = 60, on_interval = None, on_message = None
kwargs = {'interval': 0.5, 'no_ack': True}, prev_on_m = None, _ = None
def _wait_for_pending(self, result,
timeout=None, on_interval=None, on_message=None,
**kwargs):
self.on_wait_for_pending(result, timeout=timeout, **kwargs)
prev_on_m, self.on_message = self.on_message, on_message
try:
for _ in self.drain_events_until(
result.on_ready, timeout=timeout,
on_interval=on_interval):
celery/backends/asynchronous.py:287:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <celery.backends.asynchronous.Drainer object at 0x7f6132ec6560>
p = <promise@0x7f6132f0bd90>, timeout = 60, interval = 1, on_interval = None
wait = <bound method ResultConsumer.drain_events of <celery.backends.redis.ResultConsumer object at 0x7f6132ec6710>>
def drain_events_until(self, p, timeout=None, interval=1, on_interval=None, wait=None):
wait = wait or self.result_consumer.drain_events
time_start = time.monotonic()
while 1:
# Total time spent may exceed a single call to wait()
if timeout and time.monotonic() - time_start >= timeout:
raise socket.timeout()
E TimeoutError
redis_connection.delete(fail_sig_id)
with subtests.test(msg="Error propagates from body group"):
res = chord_sig.delay()
sleep(1)
with pytest.raises(ExpectedException):
res.get(timeout=TIMEOUT)
t/integration/test_canvas.py:2521:
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
celery/result.py:675: in get
return (self.join_native if self.supports_native_join else self.join)(
celery/result.py:797: in join_native
for task_id, meta in self.iter_native(timeout, interval, no_ack,
celery/backends/asynchronous.py:172: in iter_native
for _ in self._wait_for_pending(result, no_ack=no_ack, **kwargs):
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
self = <celery.backends.redis.ResultConsumer object at 0x7f6132ec6710>
result = <GroupResult: 8d8dfee3-a5fc-4c82-8a89-fc1d10beef7d [aef15b21-9fa3-4862-b37c-3d0c04fb1577, 0d02d76a-82de-4788-944f-fb3e...dc3, 68473988-4cef-4f6e-b556-df28d8b22560, 16ef8f73-7e94-4cb6-8d95-dc38d1b52416, d88e2c60-e2dc-420c-9e57-76e07f27f6aa]>
timeout = 60, on_interval = None, on_message = None
kwargs = {'interval': 0.5, 'no_ack': True}, prev_on_m = None, _ = None
def _wait_for_pending(self, result,
timeout=None, on_interval=None, on_message=None,
**kwargs):
self.on_wait_for_pending(result, timeout=timeout, **kwargs)
prev_on_m, self.on_message = self.on_message, on_message
try:
for _ in self.drain_events_unti
... (truncated) ...
What Broke
Tests fail intermittently in CI due to timeouts when handling errbacks.
Why It Broke
The errback handling for chords was incorrectly assuming old style errbacks
Fix Options (Details)
Option A — Upgrade to fixed release Safe default (recommended)
pip install celery==5.1.2
Use when you can deploy the upstream fix. It is usually lower-risk than long-lived workarounds.
Fix reference: https://github.com/celery/celery/pull/6814
First fixed release: 5.1.2
Last verified: 2026-02-09. Validate in your environment.
When NOT to Use This Fix
- This fix should not be used if the errback logic is fundamentally different.
Verify Fix
Re-run the minimal reproduction on your broken version, then apply the fix and re-run.
Did This Fix Work in Your Case?
Quick signal helps us prioritize which fixes to verify and improve.
Prevention
- Make timeouts explicit and test them (unit + integration) to avoid silent behavior changes.
- Instrument retries (attempt count + reason) and alert on spikes to catch dependency slowdowns.
Version Compatibility Table
| Version | Status |
|---|---|
| 5.1.2 | Fixed |
Related Issues
No related fixes found.
Sources
We don’t republish the full GitHub discussion text. Use the links above for context.