The Fix
pip install celery==4.4.0rc5
Based on closed celery/celery issue #5299 · PR/commit linked
Production note: This usually shows up under retries/timeouts. Treat it as a side-effect risk until you can verify behavior with a canary + real traffic.
@@ -254,6 +254,7 @@ Andrew Wong, 2017/09/07
Tobias 'rixx' Kunze, 2017/08/20
Mikhail Wolfson, 2017/12/11
+Matt Davis, 2017/12/13
Alex Garel, 2018/01/04
Régis Behmo 2018/01/20
software -> celery:4.2.0 (windowlicker) kombu:4.2.2-post1 py:3.6.6
billiard:3.5.0.5 sqs:N/A
platform -> system:Linux arch:64bit, ELF
kernel version:3.13.0-139-generic imp:CPython
loader -> celery.loaders.app.AppLoader
settings -> transport:sqs results:disabled
broker_url: 'sqs://localhost//'
include: [...]
worker_hijack_root_logger: False
task_serializer: 'json'
result_expires: 3600
accept_content: ['json']
result_serializer: 'json'
timezone: 'Europe/Berlin'
enable_utc: True
broker_transport_options: {
'polling_interval': 1,
'region': 'eu-west-1',
'visibility_timeout': 10860}
task_ignore_result: True
task_acks_late: True
worker_prefetch_multiplier: 1
worker_max_tasks_per_child: 10
worker_pool: 'celery.concurrency.prefork:TaskPool'
task_time_limit: 10800
worker_enable_remote_control: False
worker_send_task_events: False
task_default_queue: 'celery'
Re-run the minimal reproduction on your broken version, then apply the fix and re-run.
Option A — Upgrade to fixed release\npip install celery==4.4.0rc5\nWhen NOT to use: Do not apply this fix if using a different broker or concurrency model that does not involve epoll.\n\n
Why This Fix Works in Production
- Trigger: <built-in method unregister of select.epoll object at remote 0x7fadac9b8600>
- Mechanism: The worker process consumes 100% CPU due to excessive event handling in the epoll loop
- Why the fix works: Adds a test case demonstrating the issue of connection loss and modifies the code to handle bad file descriptors more safely. (first fixed release: 4.4.0rc5).
- If left unfixed, tail latency can spike under load and surface as timeouts/retries (amplifying incident impact).
Why This Breaks in Prod
- Shows up under Python 3.6 in real deployments (not just unit tests).
- The worker process consumes 100% CPU due to excessive event handling in the epoll loop
- Surfaces as: Traceback (most recent call first):
Proof / Evidence
- GitHub issue: #5299
- Fix PR: https://github.com/celery/celery/pull/5499
- First fixed release: 4.4.0rc5
- Reproduced locally: No (not executed)
- Last verified: 2026-02-09
- Confidence: 0.75
- Did this fix it?: Yes (upstream fix exists)
- Own content ratio: 0.29
Discussion
High-signal excerpts from the issue thread (symptoms, repros, edge-cases).
“Unfortunately it does not help”
“So, on_inqueue_close is being called successfully, but everytime, at least once afterwards, on_process_alive is being called and the line https://github.com/celery/celery/blob/e7ae4290ef044de4ead45314d8fe2b190e497322/celery/concurrency/asynpool.py#L1083 adds t”
“> Can you please verify? I'm not sure, I understand. Do you want me to verify, if _join_exited_workers is being called?”
“I analysed the loop with lots of log statements, but all i can say, is that it most of the time ran into https://github.com/celery/kombu/blob/master/kombu/asynchronous/hub.py#L362, which…”
Failure Signature (Search String)
- <built-in method unregister of select.epoll object at remote 0x7fadac9b8600>
Error Message
Stack trace
Error Message
-------------
Traceback (most recent call first):
<built-in method unregister of select.epoll object at remote 0x7fadac9b8600>
File "$ENV/lib/python3.6/site-packages/kombu/utils/eventio.py", line 75, in unregister
self._epoll.unregister(fd)
File "$ENV/lib/python3.6/site-packages/kombu/asynchronous/hub.py", line 243, in _unregister
self.poller.unregister(fd)
File "$ENV/lib/python3.6/site-packages/kombu/asynchronous/hub.py", line 160, in _remove_from_loop
self._unregister(fd)
File "$ENV/lib/python3.6/site-packages/kombu/asynchronous/hub.py", line 181, in remove
self._remove_from_loop(fd)
File "$ENV/lib/python3.6/site-packages/celery/concurrency/asynpool.py", line 721, in <listcomp>
[hub_remove(fd) for fd in diff(active_writes)]
File "$ENV/lib/python3.6/site-packages/celery/concurrency/asynpool.py", line 721, in on_poll_start
[hub_remove(fd) for fd in diff(active_writes)]
File "$ENV/lib/python3.6/site-packages/kombu/asynchronous/hub.py", line 295, in create_loop
tick_callback()
<built-in method next of module object at remote 0x7fadcaf37638>
File "$ENV/lib/python3.6/site-packages/celery/worker/loops.py", line 91, in asynloop
next(loop)
File "$ENV/lib/python3.6/site-packages/celery/worker/consumer/consumer.py", line 596, in start
c.loop(*c.loop_args())
File "$ENV/lib/python3.6/site-packages/celery/bootsteps.py", line 119, in start
s
... (truncated) ...
Stack trace
Error Message
-------------
Traceback (most recent call last):
File ".../kombu/asynchronous/hub.py", line 243, in _unregister
self.poller.unregister(fd)
File ".../kombu/utils/eventio.py", line 78, in unregister
self._epoll.unregister(fd)
FileNotFoundError: [Errno 2] No such file or directory
Minimal Reproduction
software -> celery:4.2.0 (windowlicker) kombu:4.2.2-post1 py:3.6.6
billiard:3.5.0.5 sqs:N/A
platform -> system:Linux arch:64bit, ELF
kernel version:3.13.0-139-generic imp:CPython
loader -> celery.loaders.app.AppLoader
settings -> transport:sqs results:disabled
broker_url: 'sqs://localhost//'
include: [...]
worker_hijack_root_logger: False
task_serializer: 'json'
result_expires: 3600
accept_content: ['json']
result_serializer: 'json'
timezone: 'Europe/Berlin'
enable_utc: True
broker_transport_options: {
'polling_interval': 1,
'region': 'eu-west-1',
'visibility_timeout': 10860}
task_ignore_result: True
task_acks_late: True
worker_prefetch_multiplier: 1
worker_max_tasks_per_child: 10
worker_pool: 'celery.concurrency.prefork:TaskPool'
task_time_limit: 10800
worker_enable_remote_control: False
worker_send_task_events: False
task_default_queue: 'celery'
Environment
- Python: 3.6
What Broke
Workers experience high CPU usage, leading to performance degradation and potential task delays.
Why It Broke
The worker process consumes 100% CPU due to excessive event handling in the epoll loop
Fix Options (Details)
Option A — Upgrade to fixed release Safe default (recommended)
pip install celery==4.4.0rc5
Use when you can deploy the upstream fix. It is usually lower-risk than long-lived workarounds.
Fix reference: https://github.com/celery/celery/pull/5499
First fixed release: 4.4.0rc5
Last verified: 2026-02-09. Validate in your environment.
When NOT to Use This Fix
- Do not apply this fix if using a different broker or concurrency model that does not involve epoll.
Verify Fix
Re-run the minimal reproduction on your broken version, then apply the fix and re-run.
Did This Fix Work in Your Case?
Quick signal helps us prioritize which fixes to verify and improve.
Prevention
- Make timeouts explicit and test them (unit + integration) to avoid silent behavior changes.
- Instrument retries (attempt count + reason) and alert on spikes to catch dependency slowdowns.
Version Compatibility Table
| Version | Status |
|---|---|
| 4.4.0rc5 | Fixed |
Related Issues
No related fixes found.
Sources
We don’t republish the full GitHub discussion text. Use the links above for context.