Jump to solution
Verify

The Fix

pip install celery==4.4.0rc5

Based on closed celery/celery issue #5299 · PR/commit linked

Production note: This usually shows up under retries/timeouts. Treat it as a side-effect risk until you can verify behavior with a canary + real traffic.

Jump to Verify Open PR/Commit
@@ -254,6 +254,7 @@ Andrew Wong, 2017/09/07 Tobias 'rixx' Kunze, 2017/08/20 Mikhail Wolfson, 2017/12/11 +Matt Davis, 2017/12/13 Alex Garel, 2018/01/04 Régis Behmo 2018/01/20
repro.py
software -> celery:4.2.0 (windowlicker) kombu:4.2.2-post1 py:3.6.6 billiard:3.5.0.5 sqs:N/A platform -> system:Linux arch:64bit, ELF kernel version:3.13.0-139-generic imp:CPython loader -> celery.loaders.app.AppLoader settings -> transport:sqs results:disabled broker_url: 'sqs://localhost//' include: [...] worker_hijack_root_logger: False task_serializer: 'json' result_expires: 3600 accept_content: ['json'] result_serializer: 'json' timezone: 'Europe/Berlin' enable_utc: True broker_transport_options: { 'polling_interval': 1, 'region': 'eu-west-1', 'visibility_timeout': 10860} task_ignore_result: True task_acks_late: True worker_prefetch_multiplier: 1 worker_max_tasks_per_child: 10 worker_pool: 'celery.concurrency.prefork:TaskPool' task_time_limit: 10800 worker_enable_remote_control: False worker_send_task_events: False task_default_queue: 'celery'
verify
Re-run the minimal reproduction on your broken version, then apply the fix and re-run.
fix.md
Option A — Upgrade to fixed release\npip install celery==4.4.0rc5\nWhen NOT to use: Do not apply this fix if using a different broker or concurrency model that does not involve epoll.\n\n

Why This Fix Works in Production

  • Trigger: <built-in method unregister of select.epoll object at remote 0x7fadac9b8600>
  • Mechanism: The worker process consumes 100% CPU due to excessive event handling in the epoll loop
  • Why the fix works: Adds a test case demonstrating the issue of connection loss and modifies the code to handle bad file descriptors more safely. (first fixed release: 4.4.0rc5).
Production impact:
  • If left unfixed, tail latency can spike under load and surface as timeouts/retries (amplifying incident impact).

Why This Breaks in Prod

  • Shows up under Python 3.6 in real deployments (not just unit tests).
  • The worker process consumes 100% CPU due to excessive event handling in the epoll loop
  • Surfaces as: Traceback (most recent call first):

Proof / Evidence

  • GitHub issue: #5299
  • Fix PR: https://github.com/celery/celery/pull/5499
  • First fixed release: 4.4.0rc5
  • Reproduced locally: No (not executed)
  • Last verified: 2026-02-09
  • Confidence: 0.75
  • Did this fix it?: Yes (upstream fix exists)
  • Own content ratio: 0.29

Discussion

High-signal excerpts from the issue thread (symptoms, repros, edge-cases).

“Unfortunately it does not help”
@tuky · 2019-05-15 · source
“So, on_inqueue_close is being called successfully, but everytime, at least once afterwards, on_process_alive is being called and the line https://github.com/celery/celery/blob/e7ae4290ef044de4ead45314d8fe2b190e497322/celery/concurrency/asynpool.py#L1083 adds t”
@tuky · 2019-05-20 · source
“> Can you please verify? I'm not sure, I understand. Do you want me to verify, if _join_exited_workers is being called?”
@tuky · 2019-05-21 · source
“I analysed the loop with lots of log statements, but all i can say, is that it most of the time ran into https://github.com/celery/kombu/blob/master/kombu/asynchronous/hub.py#L362, which…”
@tuky · 2019-05-21 · source

Failure Signature (Search String)

  • <built-in method unregister of select.epoll object at remote 0x7fadac9b8600>

Error Message

Stack trace
error.txt
Error Message ------------- Traceback (most recent call first): <built-in method unregister of select.epoll object at remote 0x7fadac9b8600> File "$ENV/lib/python3.6/site-packages/kombu/utils/eventio.py", line 75, in unregister self._epoll.unregister(fd) File "$ENV/lib/python3.6/site-packages/kombu/asynchronous/hub.py", line 243, in _unregister self.poller.unregister(fd) File "$ENV/lib/python3.6/site-packages/kombu/asynchronous/hub.py", line 160, in _remove_from_loop self._unregister(fd) File "$ENV/lib/python3.6/site-packages/kombu/asynchronous/hub.py", line 181, in remove self._remove_from_loop(fd) File "$ENV/lib/python3.6/site-packages/celery/concurrency/asynpool.py", line 721, in <listcomp> [hub_remove(fd) for fd in diff(active_writes)] File "$ENV/lib/python3.6/site-packages/celery/concurrency/asynpool.py", line 721, in on_poll_start [hub_remove(fd) for fd in diff(active_writes)] File "$ENV/lib/python3.6/site-packages/kombu/asynchronous/hub.py", line 295, in create_loop tick_callback() <built-in method next of module object at remote 0x7fadcaf37638> File "$ENV/lib/python3.6/site-packages/celery/worker/loops.py", line 91, in asynloop next(loop) File "$ENV/lib/python3.6/site-packages/celery/worker/consumer/consumer.py", line 596, in start c.loop(*c.loop_args()) File "$ENV/lib/python3.6/site-packages/celery/bootsteps.py", line 119, in start s ... (truncated) ...
Stack trace
error.txt
Error Message ------------- Traceback (most recent call last): File ".../kombu/asynchronous/hub.py", line 243, in _unregister self.poller.unregister(fd) File ".../kombu/utils/eventio.py", line 78, in unregister self._epoll.unregister(fd) FileNotFoundError: [Errno 2] No such file or directory

Minimal Reproduction

repro.py
software -> celery:4.2.0 (windowlicker) kombu:4.2.2-post1 py:3.6.6 billiard:3.5.0.5 sqs:N/A platform -> system:Linux arch:64bit, ELF kernel version:3.13.0-139-generic imp:CPython loader -> celery.loaders.app.AppLoader settings -> transport:sqs results:disabled broker_url: 'sqs://localhost//' include: [...] worker_hijack_root_logger: False task_serializer: 'json' result_expires: 3600 accept_content: ['json'] result_serializer: 'json' timezone: 'Europe/Berlin' enable_utc: True broker_transport_options: { 'polling_interval': 1, 'region': 'eu-west-1', 'visibility_timeout': 10860} task_ignore_result: True task_acks_late: True worker_prefetch_multiplier: 1 worker_max_tasks_per_child: 10 worker_pool: 'celery.concurrency.prefork:TaskPool' task_time_limit: 10800 worker_enable_remote_control: False worker_send_task_events: False task_default_queue: 'celery'

Environment

  • Python: 3.6

What Broke

Workers experience high CPU usage, leading to performance degradation and potential task delays.

Why It Broke

The worker process consumes 100% CPU due to excessive event handling in the epoll loop

Fix Options (Details)

Option A — Upgrade to fixed release Safe default (recommended)

pip install celery==4.4.0rc5

When NOT to use: Do not apply this fix if using a different broker or concurrency model that does not involve epoll.

Use when you can deploy the upstream fix. It is usually lower-risk than long-lived workarounds.

Fix reference: https://github.com/celery/celery/pull/5499

First fixed release: 4.4.0rc5

Last verified: 2026-02-09. Validate in your environment.

Get updates

We publish verified fixes weekly. No spam.

Subscribe

When NOT to Use This Fix

  • Do not apply this fix if using a different broker or concurrency model that does not involve epoll.

Verify Fix

verify
Re-run the minimal reproduction on your broken version, then apply the fix and re-run.

Did This Fix Work in Your Case?

Quick signal helps us prioritize which fixes to verify and improve.

Prevention

  • Make timeouts explicit and test them (unit + integration) to avoid silent behavior changes.
  • Instrument retries (attempt count + reason) and alert on spikes to catch dependency slowdowns.

Version Compatibility Table

VersionStatus
4.4.0rc5 Fixed

Related Issues

No related fixes found.

Sources

We don’t republish the full GitHub discussion text. Use the links above for context.