aiohttp deadlock with TCPConnector limit after timeout (Fix)

The Fix

Refactor connection waiters to be cancellation safe, fixing a deadlock that could occur while attempting to get a new connection slot after a timeout.

Based on closed aio-libs/aiohttp issue #9670 · PR/commit linked

Production note: This tends to surface only under concurrency. Reproduce with load tests and watch for lock contention/cancellation paths.

Jump to Verify Open PR/Commit

@@ -0,0 +1 @@
@@ -0,0 +1 @@
+9671.bugfix.rst
\ No newline at end of file
diff --git a/CHANGES/9671.bugfix.rst b/CHANGES/9671.bugfix.rst

repro.py

from fastapi import FastAPI
import asyncio
from time import time

app = FastAPI()

sleep_time = 0.3
timestamp_last_change = time()
last_recv = time()

@app.get("/getme")
async def getme():
    global sleep_time, last_recv
    if time()-last_recv>8:
        # Go "up", forcing 0 sleep time if we don't know anymore about clients
        sleep_time = 0
    elif time()-timestamp_last_change>2:
        if sleep_time:
            sleep_time = 0
        else:
            sleep_time = 0.3
    print(f"recv {sleep_time=:.3f}")
    last_recv = time()

    await asyncio.sleep(sleep_time)
    print(f"fin {sleep_time=:.3f}")
    return {
        "asdf": "asdf",
    }

verify

Re-run the minimal reproduction on your broken version, then apply the fix and re-run.

fix.md

Option A — Apply the official fix\nRefactor connection waiters to be cancellation safe, fixing a deadlock that could occur while attempting to get a new connection slot after a timeout.\nWhen NOT to use: This fix should not be used if the application requires strict timeout handling without cancellation.\n\n

Why This Fix Works in Production

Trigger: deadlock with TCPConnector limit after timeout
Mechanism: A deadlock occurs when new requests are not sent due to sticky timeouts in TCPConnector

Production impact:

If left unfixed, failures can be intermittent under concurrency (hard to reproduce; shows up as sporadic 5xx/timeouts).

Why This Breaks in Prod

Shows up under Python 3.12.7 in real deployments (not just unit tests).
A deadlock occurs when new requests are not sent due to sticky timeouts in TCPConnector
Production symptom (often without a traceback): deadlock with TCPConnector limit after timeout

Proof / Evidence

GitHub issue: #9670
Fix PR: https://github.com/aio-libs/aiohttp/pull/9671
Reproduced locally: No (not executed)
Last verified: 2026-02-09
Confidence: 0.80
Did this fix it?: Yes (upstream fix exists)
Own content ratio: 0.61

Discussion

High-signal excerpts from the issue thread (symptoms, repros, edge-cases).

Jump to Sources Open on GitHub

“It would be great if you could come up with a reproducer without external dependencies as we will need to be able to create a…”

@bdraco · 2024-11-04 · source

“It looks like there are two race points the available and key in self._waiters --- could have cancelled futures there The ValueError suppression is a…”

@bdraco · 2024-11-04 · source

“Looks like its been a problem for a long time. reproducible on 3.9.5 as well”

@bdraco · 2024-11-04 · source

“I'm not sure its fixable with the current design. key in self._waiters its not atomic in respect to if the connection is available so its…”

@bdraco · 2024-11-04 · source

Failure Signature (Search String)

deadlock with TCPConnector limit after timeout
When using a limit in TCPConnector, timeouts can lead to a condition where new requests are not actually sent, resulting in "sticky" timeouts.

Copy-friendly signature

signature.txt

Failure Signature
-----------------
deadlock with TCPConnector limit after timeout
When using a limit in TCPConnector, timeouts can lead to a condition where new requests are not actually sent, resulting in "sticky" timeouts.

Error Message

Signature-only (no traceback captured)

error.txt

Error Message
-------------
deadlock with TCPConnector limit after timeout
When using a limit in TCPConnector, timeouts can lead to a condition where new requests are not actually sent, resulting in "sticky" timeouts.

Minimal Reproduction

repro.py

from fastapi import FastAPI
import asyncio
from time import time

app = FastAPI()

sleep_time = 0.3
timestamp_last_change = time()
last_recv = time()

@app.get("/getme")
async def getme():
    global sleep_time, last_recv
    if time()-last_recv>8:
        # Go "up", forcing 0 sleep time if we don't know anymore about clients
        sleep_time = 0
    elif time()-timestamp_last_change>2:
        if sleep_time:
            sleep_time = 0
        else:
            sleep_time = 0.3
    print(f"recv {sleep_time=:.3f}")
    last_recv = time()

    await asyncio.sleep(sleep_time)
    print(f"fin {sleep_time=:.3f}")
    return {
        "asdf": "asdf",
    }

Environment

Python: 3.12.7

What Broke

Timeouts lead to unresponsive requests, causing potential outages in production environments.

Why It Broke

A deadlock occurs when new requests are not sent due to sticky timeouts in TCPConnector

Fix Options (Details)

Option A — Apply the official fix

Refactor connection waiters to be cancellation safe, fixing a deadlock that could occur while attempting to get a new connection slot after a timeout.

When NOT to use: This fix should not be used if the application requires strict timeout handling without cancellation.

Fix reference: https://github.com/aio-libs/aiohttp/pull/9671

Last verified: 2026-02-09. Validate in your environment.

When NOT to Use This Fix

This fix should not be used if the application requires strict timeout handling without cancellation.

Verify Fix

verify

Re-run the minimal reproduction on your broken version, then apply the fix and re-run.

Did This Fix Work in Your Case?

Quick signal helps us prioritize which fixes to verify and improve.

Prevention

Add a stress test that runs high-concurrency workloads and fails on thread dumps / blocked locks.
Enable watchdog dumps in prod (faulthandler, thread dump endpoint) to capture deadlocks quickly.
Make timeouts explicit and test them (unit + integration) to avoid silent behavior changes.
Instrument retries (attempt count + reason) and alert on spikes to catch dependency slowdowns.

Related Issues

No related fixes found.

Cluster: aiohttp:race-condition All hubs All clusters

Related clusters: Configuration error Data consistency Auth

Sources

We don’t republish the full GitHub discussion text. Use the links above for context.

aiohttp deadlock with TCPConnector limit after timeout (Fix)

The Fix

Why This Fix Works in Production

Why This Breaks in Prod

Proof / Evidence

Discussion

Failure Signature (Search String)

Error Message

Minimal Reproduction

Environment

What Broke

Why It Broke

Fix Options (Details)

Option A — Apply the official fix

Get updates

When NOT to Use This Fix

Verify Fix

Did This Fix Work in Your Case?

Prevention

Related Issues

Sources