The Fix
pip install redis==5.3.0
Based on closed redis/redis-py issue #3130 · PR/commit linked
Production note: Watch p95/p99 latency and retry volume; timeouts can turn into retry storms and duplicate side-effects.
@@ -2185,7 +2185,7 @@ def _send_cluster_commands(
try:
connection = get_connection(redis_node, c.args)
- except ConnectionError:
+ except (ConnectionError, TimeoutError):
for n in nodes.values():
from redis.cluster import RedisCluster, ClusterNode
import random
import time
startup_node = ClusterNode('mystartupnode', '6379')
client = RedisCluster(startup_nodes=[startup_node])
while True:
try:
for _ in range(10):
pipeline = client.pipeline()
for key in [f"key-{random.randint(10000,11000)}" for _ in range(50)]:
pipeline.get(key)
pipeline.execute()
except Exception as error:
print("Failure ", error)
time.sleep(1)
Re-run the minimal reproduction on your broken version, then apply the fix and re-run.
Option A — Upgrade to fixed release\npip install redis==5.3.0\nWhen NOT to use: Do not use this fix if the application requires strict error handling for TimeoutError.\n\n
Why This Fix Works in Production
- Trigger: TimeoutError during ClusterPipeline makes the client unrecoverable
- Mechanism: The client does not recover from TimeoutError, causing connection pool exhaustion
- Why the fix works: Fixes the TimeoutError issue during ClusterPipeline, allowing the client to recover by reinitializing the node slot table on TimeoutError while getting a connection inside a pipeline. (first fixed release: 5.3.0).
- If left unfixed, this can cause silent data inconsistencies that propagate (bad cache entries, incorrect downstream decisions).
Why This Breaks in Prod
- Shows up under Python 3.8 in real deployments (not just unit tests).
- The client does not recover from TimeoutError, causing connection pool exhaustion
- Surfaces as: TimeoutError during ClusterPipeline makes the client unrecoverable
Proof / Evidence
- GitHub issue: #3130
- Fix PR: https://github.com/redis/redis-py/pull/3513
- First fixed release: 5.3.0
- Reproduced locally: No (not executed)
- Last verified: 2026-02-08
- Confidence: 0.70
- Did this fix it?: Yes (upstream fix exists)
- Own content ratio: 0.64
Discussion
High-signal excerpts from the issue thread (symptoms, repros, edge-cases).
“This issue is marked stale. It will be closed in 30 days if it is not updated.”
Failure Signature (Search String)
- TimeoutError during ClusterPipeline makes the client unrecoverable
Error Message
Stack trace
Error Message
-------------
TimeoutError during ClusterPipeline makes the client unrecoverable
Minimal Reproduction
from redis.cluster import RedisCluster, ClusterNode
import random
import time
startup_node = ClusterNode('mystartupnode', '6379')
client = RedisCluster(startup_nodes=[startup_node])
while True:
try:
for _ in range(10):
pipeline = client.pipeline()
for key in [f"key-{random.randint(10000,11000)}" for _ in range(50)]:
pipeline.get(key)
pipeline.execute()
except Exception as error:
print("Failure ", error)
time.sleep(1)
Environment
- Python: 3.8
What Broke
TimeoutError leads to unrecoverable client state and connection pool blockage.
Why It Broke
The client does not recover from TimeoutError, causing connection pool exhaustion
Fix Options (Details)
Option A — Upgrade to fixed release Safe default (recommended)
pip install redis==5.3.0
Use when you can deploy the upstream fix. It is usually lower-risk than long-lived workarounds.
Option D — Guard side-effects with OnceOnly Guardrail for side-effects
Mitigate duplicate external side-effects under retries/timeouts/agent loops by gating the operation before calling external systems.
- Place OnceOnly between your code/agent and real side-effects (Stripe, emails, CRM, APIs).
- Use a stable key per side-effect (e.g., customer_id + action + idempotency_key).
- Fail-safe: configure fail-open vs fail-closed based on blast radius and spend risk.
- This does NOT fix data corruption; it only prevents duplicate side-effects.
Show example snippet (optional)
from onceonly import OnceOnly
import os
once = OnceOnly(api_key=os.environ["ONCEONLY_API_KEY"], fail_open=True)
# Stable idempotency key per real side-effect.
# Use a request id / job id / webhook delivery id / Stripe event id, etc.
event_id = "evt_..." # replace
key = f"stripe:webhook:{event_id}"
res = once.check_lock(key=key, ttl=3600)
if res.duplicate:
return {"status": "already_processed"}
# Safe to execute the side-effect exactly once.
handle_event(event_id)
Fix reference: https://github.com/redis/redis-py/pull/3513
First fixed release: 5.3.0
Last verified: 2026-02-08. Validate in your environment.
When NOT to Use This Fix
- Do not use this fix if the application requires strict error handling for TimeoutError.
- Do not use this to hide logic bugs or data corruption. Use it to block duplicate external side-effects and enforce tool permissions/spend caps.
Verify Fix
Re-run the minimal reproduction on your broken version, then apply the fix and re-run.
Did This Fix Work in Your Case?
Quick signal helps us prioritize which fixes to verify and improve.
Prevention
- Add a stress test that runs high-concurrency workloads and fails on thread dumps / blocked locks.
- Enable watchdog dumps in prod (faulthandler, thread dump endpoint) to capture deadlocks quickly.
- Make timeouts explicit and test them (unit + integration) to avoid silent behavior changes.
- Instrument retries (attempt count + reason) and alert on spikes to catch dependency slowdowns.
Version Compatibility Table
| Version | Status |
|---|---|
| 5.3.0 | Fixed |
Related Issues
No related fixes found.
Sources
We don’t republish the full GitHub discussion text. Use the links above for context.