The Fix
pip install celery==4.4.0rc5
Based on closed celery/celery issue #5844 · PR/commit linked
Production note: This usually shows up under retries/timeouts. Treat it as a side-effect risk until you can verify behavior with a canary + real traffic.
@@ -510,6 +510,10 @@ def on_failure(self, exc_info, send_failed_event=True, return_ok=False):
elif ack:
self.acknowledge()
+ else:
+ # supporting the behaviour where a task failed and
+ # need to be removed from prefetched local queue
from celery import Celery
app.conf.worker_prefetch_multiplier = 1
@app.task(base=BaseAsyncJobTask, acks_late=True, acks_on_failure_or_timeout=False)
def dead_letter_q_task():
return 1/0
dead_letter_q_task.delay()
dead_letter_q_task.delay()
dead_letter_q_task.delay()
Re-run the minimal reproduction on your broken version, then apply the fix and re-run.
Option A — Upgrade to fixed release\npip install celery==4.4.0rc5\nWhen NOT to use: Do not use this fix if your application relies on tasks being retried after failure.\n\n
Why This Fix Works in Production
- Trigger: SQS backend will stop consuming tasks after failures
- Mechanism: The SQS backend did not reject tasks on failure, causing the worker to stop consuming tasks
- Why the fix works: The SQS backend was modified to reject tasks on failure, preventing the worker from stopping after a failure. (first fixed release: 4.4.0rc5).
- If left unfixed, the same config can fail only in production (env differences), causing startup failures or partial feature outages.
Why This Breaks in Prod
- The SQS backend did not reject tasks on failure, causing the worker to stop consuming tasks
- Production symptom (often without a traceback): SQS backend will stop consuming tasks after failures
Proof / Evidence
- GitHub issue: #5844
- Fix PR: https://github.com/celery/celery/pull/5843
- First fixed release: 4.4.0rc5
- Reproduced locally: No (not executed)
- Last verified: 2026-02-09
- Confidence: 0.85
- Did this fix it?: Yes (upstream fix exists)
- Own content ratio: 0.60
Discussion
High-signal excerpts from the issue thread (symptoms, repros, edge-cases).
“can you send a PR with a test? did you try celery==4.4.0rc4?”
“Tried with celery==4.4.0rc4, reproduced test case: https://github.com/galCohen88/kombu/pull/1/files”
“Figured it out PR: https://github.com/celery/celery/pull/5843”
“I would need a little guidance, as I'm not sure how I can access QoS attributes (https://github.com/celery/kombu/blob/master/kombu/transport/virtual/base.py#L182) from Task context”
Failure Signature (Search String)
- SQS backend will stop consuming tasks after failures
- - [X] I have included all related issues and possible duplicate issues
Copy-friendly signature
Failure Signature
-----------------
SQS backend will stop consuming tasks after failures
- [X] I have included all related issues and possible duplicate issues
Error Message
Signature-only (no traceback captured)
Error Message
-------------
SQS backend will stop consuming tasks after failures
- [X] I have included all related issues and possible duplicate issues
Minimal Reproduction
from celery import Celery
app.conf.worker_prefetch_multiplier = 1
@app.task(base=BaseAsyncJobTask, acks_late=True, acks_on_failure_or_timeout=False)
def dead_letter_q_task():
return 1/0
dead_letter_q_task.delay()
dead_letter_q_task.delay()
dead_letter_q_task.delay()
What Broke
Workers stopped consuming tasks after a failure, leading to task backlog.
Why It Broke
The SQS backend did not reject tasks on failure, causing the worker to stop consuming tasks
Fix Options (Details)
Option A — Upgrade to fixed release Safe default (recommended)
pip install celery==4.4.0rc5
Use when you can deploy the upstream fix. It is usually lower-risk than long-lived workarounds.
Fix reference: https://github.com/celery/celery/pull/5843
First fixed release: 4.4.0rc5
Last verified: 2026-02-09. Validate in your environment.
When NOT to Use This Fix
- Do not use this fix if your application relies on tasks being retried after failure.
Verify Fix
Re-run the minimal reproduction on your broken version, then apply the fix and re-run.
Did This Fix Work in Your Case?
Quick signal helps us prioritize which fixes to verify and improve.
Prevention
- Make timeouts explicit and test them (unit + integration) to avoid silent behavior changes.
- Instrument retries (attempt count + reason) and alert on spikes to catch dependency slowdowns.
Version Compatibility Table
| Version | Status |
|---|---|
| 4.4.0rc5 | Fixed |
Related Issues
No related fixes found.
Sources
We don’t republish the full GitHub discussion text. Use the links above for context.