Celery Celery worker using 100% CPU around epoll w/

The Fix

pip install celery==4.4.0rc5

Based on closed celery/celery issue #5299 · PR/commit linked

Production note: This usually shows up under retries/timeouts. Treat it as a side-effect risk until you can verify behavior with a canary + real traffic.

Jump to Verify Open PR/Commit

@@ -254,6 +254,7 @@ Andrew Wong, 2017/09/07
 Tobias 'rixx' Kunze, 2017/08/20
 Mikhail Wolfson, 2017/12/11
+Matt Davis, 2017/12/13
 Alex Garel, 2018/01/04
 Régis Behmo 2018/01/20

repro.py

software -> celery:4.2.0 (windowlicker) kombu:4.2.2-post1 py:3.6.6
            billiard:3.5.0.5 sqs:N/A
platform -> system:Linux arch:64bit, ELF
            kernel version:3.13.0-139-generic imp:CPython
loader   -> celery.loaders.app.AppLoader
settings -> transport:sqs results:disabled

broker_url: 'sqs://localhost//'
include: [...]
worker_hijack_root_logger: False
task_serializer: 'json'
result_expires: 3600
accept_content: ['json']
result_serializer: 'json'
timezone: 'Europe/Berlin'
enable_utc: True
broker_transport_options: {
    'polling_interval': 1,
    'region': 'eu-west-1',
    'visibility_timeout': 10860}
task_ignore_result: True
task_acks_late: True
worker_prefetch_multiplier: 1
worker_max_tasks_per_child: 10
worker_pool: 'celery.concurrency.prefork:TaskPool'
task_time_limit: 10800
worker_enable_remote_control: False
worker_send_task_events: False
task_default_queue: 'celery'

verify

Re-run the minimal reproduction on your broken version, then apply the fix and re-run.

fix.md

Option A — Upgrade to fixed release\npip install celery==4.4.0rc5\nWhen NOT to use: Do not apply this fix if using a different broker or concurrency model that does not involve epoll.\n\n

Why This Fix Works in Production

Trigger: <built-in method unregister of select.epoll object at remote 0x7fadac9b8600>
Mechanism: The worker process consumes 100% CPU due to excessive event handling in the epoll loop
Why the fix works: Adds a test case demonstrating the issue of connection loss and modifies the code to handle bad file descriptors more safely. (first fixed release: 4.4.0rc5).

Production impact:

If left unfixed, tail latency can spike under load and surface as timeouts/retries (amplifying incident impact).

Why This Breaks in Prod

Shows up under Python 3.6 in real deployments (not just unit tests).
The worker process consumes 100% CPU due to excessive event handling in the epoll loop
Surfaces as: Traceback (most recent call first):

Proof / Evidence

GitHub issue: #5299
Fix PR: https://github.com/celery/celery/pull/5499
First fixed release: 4.4.0rc5
Reproduced locally: No (not executed)
Last verified: 2026-02-09
Confidence: 0.75
Did this fix it?: Yes (upstream fix exists)
Own content ratio: 0.29

Discussion

High-signal excerpts from the issue thread (symptoms, repros, edge-cases).

Jump to Sources Open on GitHub

“Unfortunately it does not help”

@tuky · 2019-05-15 · source

“So, on_inqueue_close is being called successfully, but everytime, at least once afterwards, on_process_alive is being called and the line https://github.com/celery/celery/blob/e7ae4290ef044de4ead45314d8fe2b190e497322/celery/concurrency/asynpool.py#L1083 adds t”

@tuky · 2019-05-20 · source

“> Can you please verify? I'm not sure, I understand. Do you want me to verify, if _join_exited_workers is being called?”

@tuky · 2019-05-21 · source

“I analysed the loop with lots of log statements, but all i can say, is that it most of the time ran into https://github.com/celery/kombu/blob/master/kombu/asynchronous/hub.py#L362, which…”

@tuky · 2019-05-21 · source

Failure Signature (Search String)

<built-in method unregister of select.epoll object at remote 0x7fadac9b8600>

Error Message

Stack trace

error.txt

Error Message
-------------
Traceback (most recent call first):
  <built-in method unregister of select.epoll object at remote 0x7fadac9b8600>
  File "$ENV/lib/python3.6/site-packages/kombu/utils/eventio.py", line 75, in unregister
    self._epoll.unregister(fd)
  File "$ENV/lib/python3.6/site-packages/kombu/asynchronous/hub.py", line 243, in _unregister
    self.poller.unregister(fd)
  File "$ENV/lib/python3.6/site-packages/kombu/asynchronous/hub.py", line 160, in _remove_from_loop
    self._unregister(fd)
  File "$ENV/lib/python3.6/site-packages/kombu/asynchronous/hub.py", line 181, in remove
    self._remove_from_loop(fd)
  File "$ENV/lib/python3.6/site-packages/celery/concurrency/asynpool.py", line 721, in <listcomp>
    [hub_remove(fd) for fd in diff(active_writes)]
  File "$ENV/lib/python3.6/site-packages/celery/concurrency/asynpool.py", line 721, in on_poll_start
    [hub_remove(fd) for fd in diff(active_writes)]
  File "$ENV/lib/python3.6/site-packages/kombu/asynchronous/hub.py", line 295, in create_loop
    tick_callback()
  <built-in method next of module object at remote 0x7fadcaf37638>
  File "$ENV/lib/python3.6/site-packages/celery/worker/loops.py", line 91, in asynloop
    next(loop)
  File "$ENV/lib/python3.6/site-packages/celery/worker/consumer/consumer.py", line 596, in start
    c.loop(*c.loop_args())
  File "$ENV/lib/python3.6/site-packages/celery/bootsteps.py", line 119, in start
    s
... (truncated) ...

Stack trace

error.txt

Error Message
-------------
Traceback (most recent call last):
  File ".../kombu/asynchronous/hub.py", line 243, in _unregister
    self.poller.unregister(fd)
  File ".../kombu/utils/eventio.py", line 78, in unregister
    self._epoll.unregister(fd)
FileNotFoundError: [Errno 2] No such file or directory

Minimal Reproduction

repro.py

software -> celery:4.2.0 (windowlicker) kombu:4.2.2-post1 py:3.6.6
            billiard:3.5.0.5 sqs:N/A
platform -> system:Linux arch:64bit, ELF
            kernel version:3.13.0-139-generic imp:CPython
loader   -> celery.loaders.app.AppLoader
settings -> transport:sqs results:disabled

broker_url: 'sqs://localhost//'
include: [...]
worker_hijack_root_logger: False
task_serializer: 'json'
result_expires: 3600
accept_content: ['json']
result_serializer: 'json'
timezone: 'Europe/Berlin'
enable_utc: True
broker_transport_options: {
    'polling_interval': 1,
    'region': 'eu-west-1',
    'visibility_timeout': 10860}
task_ignore_result: True
task_acks_late: True
worker_prefetch_multiplier: 1
worker_max_tasks_per_child: 10
worker_pool: 'celery.concurrency.prefork:TaskPool'
task_time_limit: 10800
worker_enable_remote_control: False
worker_send_task_events: False
task_default_queue: 'celery'

Environment

Python: 3.6

What Broke

Workers experience high CPU usage, leading to performance degradation and potential task delays.

Why It Broke

The worker process consumes 100% CPU due to excessive event handling in the epoll loop

Fix Options (Details)

Option A — Upgrade to fixed release Safe default (recommended)

pip install celery==4.4.0rc5

When NOT to use: Do not apply this fix if using a different broker or concurrency model that does not involve epoll.

Use when you can deploy the upstream fix. It is usually lower-risk than long-lived workarounds.

Fix reference: https://github.com/celery/celery/pull/5499

First fixed release: 4.4.0rc5

Last verified: 2026-02-09. Validate in your environment.

When NOT to Use This Fix

Do not apply this fix if using a different broker or concurrency model that does not involve epoll.

Verify Fix

verify

Re-run the minimal reproduction on your broken version, then apply the fix and re-run.

Did This Fix Work in Your Case?

Quick signal helps us prioritize which fixes to verify and improve.

Prevention

Make timeouts explicit and test them (unit + integration) to avoid silent behavior changes.
Instrument retries (attempt count + reason) and alert on spikes to catch dependency slowdowns.

Version Compatibility Table

Version	Status
4.4.0rc5	Fixed

Related Issues

No related fixes found.

Cluster: celery:timeout Celery hub Celery best practices All hubs All clusters

Related clusters: Configuration error Data consistency Duplicates

Sources

We don’t republish the full GitHub discussion text. Use the links above for context.

Celery Celery worker using 100% CPU around epoll w/ prefork+SQS but still consuming tasks (Fix)