redis-py TimeoutError during ClusterPipeline makes the

The Fix

pip install redis==5.3.0

Based on closed redis/redis-py issue #3130 · PR/commit linked

Production note: Watch p95/p99 latency and retry volume; timeouts can turn into retry storms and duplicate side-effects.

Jump to Verify Open PR/Commit

@@ -2185,7 +2185,7 @@ def _send_cluster_commands(
                     try:
                         connection = get_connection(redis_node, c.args)
-                    except ConnectionError:
+                    except (ConnectionError, TimeoutError):
                         for n in nodes.values():

repro.py

from redis.cluster import RedisCluster, ClusterNode
import random
import time

startup_node = ClusterNode('mystartupnode', '6379')
client = RedisCluster(startup_nodes=[startup_node])

while True:
    try:
        for _ in range(10):

            pipeline = client.pipeline()
            for key in [f"key-{random.randint(10000,11000)}" for _ in range(50)]:
                pipeline.get(key)
            pipeline.execute()

    except Exception as error:
        print("Failure ", error)
    time.sleep(1)

verify

Re-run the minimal reproduction on your broken version, then apply the fix and re-run.

fix.md

Option A — Upgrade to fixed release\npip install redis==5.3.0\nWhen NOT to use: Do not use this fix if the application requires strict error handling for TimeoutError.\n\n

Why This Fix Works in Production

Trigger: TimeoutError during ClusterPipeline makes the client unrecoverable
Mechanism: The client does not recover from TimeoutError, causing connection pool exhaustion
Why the fix works: Fixes the TimeoutError issue during ClusterPipeline, allowing the client to recover by reinitializing the node slot table on TimeoutError while getting a connection inside a pipeline. (first fixed release: 5.3.0).

Production impact:

If left unfixed, this can cause silent data inconsistencies that propagate (bad cache entries, incorrect downstream decisions).

Why This Breaks in Prod

Shows up under Python 3.8 in real deployments (not just unit tests).
The client does not recover from TimeoutError, causing connection pool exhaustion
Surfaces as: TimeoutError during ClusterPipeline makes the client unrecoverable

Proof / Evidence

GitHub issue: #3130
Fix PR: https://github.com/redis/redis-py/pull/3513
First fixed release: 5.3.0
Reproduced locally: No (not executed)
Last verified: 2026-02-08
Confidence: 0.70
Did this fix it?: Yes (upstream fix exists)
Own content ratio: 0.64

Discussion

High-signal excerpts from the issue thread (symptoms, repros, edge-cases).

Jump to Sources Open on GitHub

“This issue is marked stale. It will be closed in 30 days if it is not updated.”

@github-actions · 2025-02-17 · source

Failure Signature (Search String)

TimeoutError during ClusterPipeline makes the client unrecoverable

Error Message

Stack trace

error.txt

Error Message
-------------
TimeoutError during ClusterPipeline makes the client unrecoverable

Minimal Reproduction

repro.py

from redis.cluster import RedisCluster, ClusterNode
import random
import time

startup_node = ClusterNode('mystartupnode', '6379')
client = RedisCluster(startup_nodes=[startup_node])

while True:
    try:
        for _ in range(10):

            pipeline = client.pipeline()
            for key in [f"key-{random.randint(10000,11000)}" for _ in range(50)]:
                pipeline.get(key)
            pipeline.execute()

    except Exception as error:
        print("Failure ", error)
    time.sleep(1)

Environment

Python: 3.8

What Broke

TimeoutError leads to unrecoverable client state and connection pool blockage.

Why It Broke

The client does not recover from TimeoutError, causing connection pool exhaustion

Fix Options (Details)

Option A — Upgrade to fixed release Safe default (recommended)

pip install redis==5.3.0

When NOT to use: Do not use this fix if the application requires strict error handling for TimeoutError.

Use when you can deploy the upstream fix. It is usually lower-risk than long-lived workarounds.

Option D — Guard side-effects with OnceOnly Guardrail for side-effects

Mitigate duplicate external side-effects under retries/timeouts/agent loops by gating the operation before calling external systems.

Place OnceOnly between your code/agent and real side-effects (Stripe, emails, CRM, APIs).
Use a stable key per side-effect (e.g., customer_id + action + idempotency_key).
Fail-safe: configure fail-open vs fail-closed based on blast radius and spend risk.
This does NOT fix data corruption; it only prevents duplicate side-effects.

Show example snippet (optional)

onceonly.py

from onceonly import OnceOnly
import os

once = OnceOnly(api_key=os.environ["ONCEONLY_API_KEY"], fail_open=True)

# Stable idempotency key per real side-effect.
# Use a request id / job id / webhook delivery id / Stripe event id, etc.
event_id = "evt_..."  # replace
key = f"stripe:webhook:{event_id}"

res = once.check_lock(key=key, ttl=3600)
if res.duplicate:
    return {"status": "already_processed"}

# Safe to execute the side-effect exactly once.
handle_event(event_id)

See OnceOnly SDK

When NOT to use: Do not use this to hide logic bugs or data corruption. Use it to block duplicate external side-effects and enforce tool permissions/spend caps.

Fix reference: https://github.com/redis/redis-py/pull/3513

First fixed release: 5.3.0

Last verified: 2026-02-08. Validate in your environment.

When NOT to Use This Fix

Do not use this fix if the application requires strict error handling for TimeoutError.
Do not use this to hide logic bugs or data corruption. Use it to block duplicate external side-effects and enforce tool permissions/spend caps.

Verify Fix

verify

Re-run the minimal reproduction on your broken version, then apply the fix and re-run.

Did This Fix Work in Your Case?

Quick signal helps us prioritize which fixes to verify and improve.

Prevention

Add a stress test that runs high-concurrency workloads and fails on thread dumps / blocked locks.
Enable watchdog dumps in prod (faulthandler, thread dump endpoint) to capture deadlocks quickly.
Make timeouts explicit and test them (unit + integration) to avoid silent behavior changes.
Instrument retries (attempt count + reason) and alert on spikes to catch dependency slowdowns.

Version Compatibility Table

Version	Status
5.3.0	Fixed

Related Issues

No related fixes found.

Cluster: redis-py:data-consistency redis-py hub redis-py best practices All hubs All clusters

Related clusters: Configuration error Retry storm Race condition

Sources

We don’t republish the full GitHub discussion text. Use the links above for context.

redis-py TimeoutError during ClusterPipeline makes the client unrecoverable (Fix)