Celery CI:

The Fix

pip install celery==5.1.2

Based on closed celery/celery issue #7836 · PR/commit linked

Production note: Watch p95/p99 latency and retry volume; timeouts can turn into retry storms and duplicate side-effects.

Jump to Verify Open PR/Commit

@@ -278,19 +278,24 @@ def mark_as_retry(self, task_id, exc, traceback=None,
 
     def chord_error_from_stack(self, callback, exc=None):
-        # need below import for test for some crazy reason
-        from celery import group  # pylint: disable
         app = self.app

repro.py

=================================== FAILURES ===================================
  _ test_chord.test_mutable_errback_called_by_chord_from_group_fail_multiple[errback_old_style] [Error propagates from body group] _
  
  self = <celery.backends.redis.ResultConsumer object at 0x7f6132ec6710>
  result = <GroupResult: 8d8dfee3-a5fc-4c82-8a89-fc1d10beef7d [aef15b21-9fa3-4862-b37c-3d0c04fb1577, 0d02d76a-82de-4788-944f-fb3e...dc3, 68473988-4cef-4f6e-b556-df28d8b22560, 16ef8f73-7e94-4cb6-8d95-dc38d1b52416, d88e2c60-e2dc-420c-9e57-76e07f27f6aa]>
  timeout = 60, on_interval = None, on_message = None
  kwargs = {'interval': 0.5, 'no_ack': True}, prev_on_m = None, _ = None
  
      def _wait_for_pending(self, result,
                            timeout=None, on_interval=None, on_message=None,
                            **kwargs):
          self.on_wait_for_pending(result, timeout=timeout, **kwargs)
          prev_on_m, self.on_message = self.on_message, on_message
          try:
          for _ in self.drain_events_until(
                      result.on_ready, timeout=timeout,
                      on_interval=on_interval):
  
  celery/backends/asynchronous.py:287: 
  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
  
  self = <celery.backends.asynchronous.Drainer object at 0x7f6132ec6560>
  p = <promise@0x7f6132f0bd90>, timeout = 60, interval = 1, on_interval = None
  wait = <bound method ResultConsumer.drain_events of <celery.backends.redis.ResultConsumer object at 0x7f6132ec6710>>
  
      def drain_events_until(self, p, timeout=None, interval=1, on_interval=None, wait=None):
          wait = wait or self.result_consumer.drain_events
          time_start = time.monotonic()
      
          while 1:
              # Total time spent may exceed a single call to wait()
              if timeout and time.monotonic() - time_start >= timeout:
              raise socket.timeout()
  E               TimeoutError
  
              redis_connection.delete(fail_sig_id)
          with subtests.test(msg="Error propagates from body group"):
              res = chord_sig.delay()
              sleep(1)
              with pytest.raises(ExpectedException):
              res.get(timeout=TIMEOUT)
  
  t/integration/test_canvas.py:2521: 
  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
  celery/result.py:675: in get
      return (self.join_native if self.supports_native_join else self.join)(
  celery/result.py:797: in join_native
      for task_id, meta in self.iter_native(timeout, interval, no_ack,
  celery/backends/asynchronous.py:172: in iter_native
      for _ in self._wait_for_pending(result, no_ack=no_ack, **kwargs):
  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
  
  self = <celery.backends.redis.ResultConsumer object at 0x7f6132ec6710>
  result = <GroupResult: 8d8dfee3-a5fc-4c82-8a89-fc1d10beef7d [aef15b21-9fa3-4862-b37c-3d0c04fb1577, 0d02d76a-82de-4788-944f-fb3e...dc3, 68473988-4cef-4f6e-b556-df28d8b22560, 16ef8f73-7e94-4cb6-8d95-dc38d1b52416, d88e2c60-e2dc-420c-9e57-76e07f27f6aa]>
  timeout = 60, on_interval = None, on_message = None
  kwargs = {'interval': 0.5, 'no_ack': True}, prev_on_m = None, _ = None
  
      def _wait_for_pending(self, result,
                            timeout=None, on_interval=None, on_message=None,
                            **kwargs):
          self.on_wait_for_pending(result, timeout=timeout, **kwargs)
          prev_on_m, self.on_message = self.on_message, on_message
          try:
              for _ in self.drain_events_unti
... (truncated) ...

verify

Re-run the minimal reproduction on your broken version, then apply the fix and re-run.

fix.md

Option A — Upgrade to fixed release\npip install celery==5.1.2\nWhen NOT to use: This fix should not be used if the errback logic is fundamentally different.\n\n

Why This Fix Works in Production

Trigger: CI: test_mutable_errback_called_by_chord_from_group_fail_multiple fails often
Mechanism: The errback handling for chords was incorrectly assuming old style errbacks
Why the fix works: Fixes the handling of errbacks when chords fail, ensuring that both new and old style errbacks are called appropriately. (first fixed release: 5.1.2).

Production impact:

If left unfixed, tail latency can spike under load and surface as timeouts/retries (amplifying incident impact).

Why This Breaks in Prod

The errback handling for chords was incorrectly assuming old style errbacks
Production symptom (often without a traceback): CI: test_mutable_errback_called_by_chord_from_group_fail_multiple fails often

Proof / Evidence

GitHub issue: #7836
Fix PR: https://github.com/celery/celery/pull/6814
First fixed release: 5.1.2
Reproduced locally: No (not executed)
Last verified: 2026-02-09
Confidence: 0.85
Did this fix it?: Yes (upstream fix exists)
Own content ratio: 0.37

Discussion

High-signal excerpts from the issue thread (symptoms, repros, edge-cases).

Jump to Sources Open on GitHub

“I have a fix attempt at #7837, you're welcome to check it out @woutdenolf.”

@Nusnus · 2022-10-21 · source

“Hey @woutdenolf :wave:, Thank you for opening an issue”

@open-collective-bot · 2022-10-19 · source

“The test was introduced in https://github.com/celery/celery/pull/6814. @maybe-sybr Any ideas?”

@woutdenolf · 2022-10-19 · source

“I've seen it as well for some time now. I suspect the problem lies here. Let me check it out.”

@Nusnus · 2022-10-19 · source

Failure Signature (Search String)

CI: test_mutable_errback_called_by_chord_from_group_fail_multiple fails often
I see this test fail often in CI

Copy-friendly signature

signature.txt

Failure Signature
-----------------
CI: test_mutable_errback_called_by_chord_from_group_fail_multiple fails often
I see this test fail often in CI

Error Message

Signature-only (no traceback captured)

error.txt

Error Message
-------------
CI: test_mutable_errback_called_by_chord_from_group_fail_multiple fails often
I see this test fail often in CI

Minimal Reproduction

repro.py

=================================== FAILURES ===================================
  _ test_chord.test_mutable_errback_called_by_chord_from_group_fail_multiple[errback_old_style] [Error propagates from body group] _
  
  self = <celery.backends.redis.ResultConsumer object at 0x7f6132ec6710>
  result = <GroupResult: 8d8dfee3-a5fc-4c82-8a89-fc1d10beef7d [aef15b21-9fa3-4862-b37c-3d0c04fb1577, 0d02d76a-82de-4788-944f-fb3e...dc3, 68473988-4cef-4f6e-b556-df28d8b22560, 16ef8f73-7e94-4cb6-8d95-dc38d1b52416, d88e2c60-e2dc-420c-9e57-76e07f27f6aa]>
  timeout = 60, on_interval = None, on_message = None
  kwargs = {'interval': 0.5, 'no_ack': True}, prev_on_m = None, _ = None
  
      def _wait_for_pending(self, result,
                            timeout=None, on_interval=None, on_message=None,
                            **kwargs):
          self.on_wait_for_pending(result, timeout=timeout, **kwargs)
          prev_on_m, self.on_message = self.on_message, on_message
          try:
          for _ in self.drain_events_until(
                      result.on_ready, timeout=timeout,
                      on_interval=on_interval):
  
  celery/backends/asynchronous.py:287: 
  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
  
  self = <celery.backends.asynchronous.Drainer object at 0x7f6132ec6560>
  p = <promise@0x7f6132f0bd90>, timeout = 60, interval = 1, on_interval = None
  wait = <bound method ResultConsumer.drain_events of <celery.backends.redis.ResultConsumer object at 0x7f6132ec6710>>
  
      def drain_events_until(self, p, timeout=None, interval=1, on_interval=None, wait=None):
          wait = wait or self.result_consumer.drain_events
          time_start = time.monotonic()
      
          while 1:
              # Total time spent may exceed a single call to wait()
              if timeout and time.monotonic() - time_start >= timeout:
              raise socket.timeout()
  E               TimeoutError
  
              redis_connection.delete(fail_sig_id)
          with subtests.test(msg="Error propagates from body group"):
              res = chord_sig.delay()
              sleep(1)
              with pytest.raises(ExpectedException):
              res.get(timeout=TIMEOUT)
  
  t/integration/test_canvas.py:2521: 
  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
  celery/result.py:675: in get
      return (self.join_native if self.supports_native_join else self.join)(
  celery/result.py:797: in join_native
      for task_id, meta in self.iter_native(timeout, interval, no_ack,
  celery/backends/asynchronous.py:172: in iter_native
      for _ in self._wait_for_pending(result, no_ack=no_ack, **kwargs):
  _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
  
  self = <celery.backends.redis.ResultConsumer object at 0x7f6132ec6710>
  result = <GroupResult: 8d8dfee3-a5fc-4c82-8a89-fc1d10beef7d [aef15b21-9fa3-4862-b37c-3d0c04fb1577, 0d02d76a-82de-4788-944f-fb3e...dc3, 68473988-4cef-4f6e-b556-df28d8b22560, 16ef8f73-7e94-4cb6-8d95-dc38d1b52416, d88e2c60-e2dc-420c-9e57-76e07f27f6aa]>
  timeout = 60, on_interval = None, on_message = None
  kwargs = {'interval': 0.5, 'no_ack': True}, prev_on_m = None, _ = None
  
      def _wait_for_pending(self, result,
                            timeout=None, on_interval=None, on_message=None,
                            **kwargs):
          self.on_wait_for_pending(result, timeout=timeout, **kwargs)
          prev_on_m, self.on_message = self.on_message, on_message
          try:
              for _ in self.drain_events_unti
... (truncated) ...

What Broke

Tests fail intermittently in CI due to timeouts when handling errbacks.

Why It Broke

The errback handling for chords was incorrectly assuming old style errbacks

Fix Options (Details)

Option A — Upgrade to fixed release Safe default (recommended)

pip install celery==5.1.2

When NOT to use: This fix should not be used if the errback logic is fundamentally different.

Use when you can deploy the upstream fix. It is usually lower-risk than long-lived workarounds.

Fix reference: https://github.com/celery/celery/pull/6814

First fixed release: 5.1.2

Last verified: 2026-02-09. Validate in your environment.

When NOT to Use This Fix

This fix should not be used if the errback logic is fundamentally different.

Verify Fix

verify

Re-run the minimal reproduction on your broken version, then apply the fix and re-run.

Did This Fix Work in Your Case?

Quick signal helps us prioritize which fixes to verify and improve.

Prevention

Make timeouts explicit and test them (unit + integration) to avoid silent behavior changes.
Instrument retries (attempt count + reason) and alert on spikes to catch dependency slowdowns.

Version Compatibility Table

Version	Status
5.1.2	Fixed

Related Issues

No related fixes found.

Cluster: celery:timeout Celery hub Celery best practices All hubs All clusters

Related clusters: Configuration error Data consistency Duplicates

Sources

We don’t republish the full GitHub discussion text. Use the links above for context.

Celery CI: test_mutable_errback_called_by_chord_from_group_fail_multiple fails often (Fix)