From: Sasha Levin <sashal@kernel.org>
To: patches@lists.linux.dev, stable@vger.kernel.org
Cc: "Håkon Bugge" <haakon.bugge@oracle.com>,
"Allison Henderson" <allison.henderson@oracle.com>,
"Jakub Kicinski" <kuba@kernel.org>,
"Sasha Levin" <sashal@kernel.org>,
netdev@vger.kernel.org, linux-rdma@vger.kernel.org,
rds-devel@oss.oracle.com
Subject: [PATCH AUTOSEL 6.19-5.10] net/rds: Clear reconnect pending bit
Date: Sat, 14 Feb 2026 16:22:42 -0500 [thread overview]
Message-ID: <20260214212452.782265-17-sashal@kernel.org> (raw)
In-Reply-To: <20260214212452.782265-1-sashal@kernel.org>
From: Håkon Bugge <haakon.bugge@oracle.com>
[ Upstream commit b89fc7c2523b2b0750d91840f4e52521270d70ed ]
When canceling the reconnect worker, care must be taken to reset the
reconnect-pending bit. If the reconnect worker has not yet been
scheduled before it is canceled, the reconnect-pending bit will stay
on forever.
Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
Link: https://patch.msgid.link/20260203055723.1085751-6-achender@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Sasha Levin <sashal@kernel.org>
---
LLM Generated explanations, may be completely bogus:
### 3. BUG MECHANISM — CLEAR AND CRITICAL
Now the full picture is clear:
**The flow:**
1. `rds_queue_reconnect()` at line 138 sets `RDS_RECONNECT_PENDING` and
queues the delayed work.
2. `rds_connect_worker()` (the worker function) at line 173 clears
`RDS_RECONNECT_PENDING` when it runs.
3. `rds_conn_path_connect_if_down()` at line 911 uses
`test_and_set_bit(RDS_RECONNECT_PENDING, ...)` — if the bit is
already set, it returns without queuing work, trusting that a
reconnect is already pending.
**The bug:**
In `rds_conn_shutdown()`, `cancel_delayed_work_sync()` cancels the
queued worker. If the worker hadn't started yet, it never runs
`clear_bit(RDS_RECONNECT_PENDING, ...)`. The bit stays set. Then when
`rds_conn_path_connect_if_down()` is later called (e.g., when trying to
send data), `test_and_set_bit` finds the bit already set and skips
queuing — **forever**. The connection can never reconnect.
This is a **permanent connection failure** bug — once triggered, the RDS
connection path is effectively dead until the system is rebooted or the
module is reloaded.
### 4. CLASSIFICATION
- **Bug type:** State corruption / logic bug leading to permanent loss
of network connectivity
- **Severity:** HIGH — RDS is used in production Oracle RAC clusters and
RDMA-based environments
- **Trigger:** Race between shutdown and reconnect scheduling —
realistic in production with network flaps
### 5. SCOPE AND RISK
- **Change size:** 1 functional line (plus 1 blank line) — extremely
small and surgical
- **Files changed:** 1 file (`net/rds/connection.c`)
- **Risk of regression:** Very low — `clear_bit` is idempotent. If the
worker already ran and cleared the bit, clearing it again is harmless.
If the worker didn't run, this is the correct fix.
- **The fix is placed correctly:** After `cancel_delayed_work_sync()`
guarantees the worker won't run, and before `rds_queue_reconnect()`
which will set the bit again if needed.
### 6. STABLE KERNEL CRITERIA
- **Obviously correct:** Yes — the logic is straightforward and well-
explained
- **Fixes a real bug:** Yes — permanent loss of RDS connectivity
- **Important issue:** Yes — affects network reliability for RDS users
(Oracle, RDMA environments)
- **Small and contained:** Yes — 1 line in 1 file
- **No new features:** Correct — purely a bug fix
- **No new APIs:** Correct
### 7. DEPENDENCY CHECK
This fix is self-contained. It only adds a `clear_bit()` call after an
existing `cancel_delayed_work_sync()` call. The `RDS_RECONNECT_PENDING`
flag and the surrounding code have been in the kernel for a long time.
This should apply cleanly to any stable tree that has the RDS subsystem.
### Conclusion
This is a textbook stable backport candidate: a tiny, obviously correct,
one-line fix for a serious bug (permanent connection failure) in
networking code used in production environments. The fix has zero risk
of regression due to the idempotent nature of `clear_bit`, and the bug
mechanism is clearly explained and verified through code analysis.
**YES**
net/rds/connection.c | 2 ++
1 file changed, 2 insertions(+)
diff --git a/net/rds/connection.c b/net/rds/connection.c
index ad8027e6f54ef..dbfea6fa11260 100644
--- a/net/rds/connection.c
+++ b/net/rds/connection.c
@@ -429,6 +429,8 @@ void rds_conn_shutdown(struct rds_conn_path *cp)
* to the conn hash, so we never trigger a reconnect on this
* conn - the reconnect is always triggered by the active peer. */
cancel_delayed_work_sync(&cp->cp_conn_w);
+
+ clear_bit(RDS_RECONNECT_PENDING, &cp->cp_flags);
rcu_read_lock();
if (!hlist_unhashed(&conn->c_hash_node)) {
rcu_read_unlock();
--
2.51.0
next prev parent reply other threads:[~2026-02-14 21:25 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
[not found] <20260214212452.782265-1-sashal@kernel.org>
2026-02-14 21:22 ` [PATCH AUTOSEL 6.19-5.10] myri10ge: avoid uninitialized variable use Sasha Levin
2026-02-14 21:22 ` [PATCH AUTOSEL 6.19-6.1] net: mctp-i2c: fix duplicate reception of old data Sasha Levin
2026-02-14 21:22 ` [PATCH AUTOSEL 6.19-6.12] net: wwan: mhi: Add network support for Foxconn T99W760 Sasha Levin
2026-02-14 21:22 ` Sasha Levin [this message]
2026-02-14 21:22 ` [PATCH AUTOSEL 6.19-6.12] ipv6: annotate data-races over sysctl.flowlabel_reflect Sasha Levin
2026-02-14 21:22 ` [PATCH AUTOSEL 6.19-5.15] ipv6: exthdrs: annotate data-race over multiple sysctl Sasha Levin
2026-02-14 21:23 ` [PATCH AUTOSEL 6.19-5.10] octeontx2-af: Workaround SQM/PSE stalls by disabling sticky Sasha Levin
2026-02-14 21:23 ` [PATCH AUTOSEL 6.19-5.10] vmw_vsock: bypass false-positive Wnonnull warning with gcc-16 Sasha Levin
2026-02-14 21:23 ` [PATCH AUTOSEL 6.19-5.15] ipv6: annotate data-races in ip6_multipath_hash_{policy,fields}() Sasha Levin
2026-02-14 21:23 ` [PATCH AUTOSEL 6.19-6.6] ipv4: igmp: annotate data-races around idev->mr_maxdelay Sasha Levin
2026-02-14 21:23 ` [PATCH AUTOSEL 6.19-5.10] net/rds: No shortcut out of RDS_CONN_ERROR Sasha Levin
2026-02-14 21:23 ` [PATCH AUTOSEL 6.19-6.18] ipv6: annotate data-races in net/ipv6/route.c Sasha Levin
2026-02-14 21:23 ` [PATCH AUTOSEL 6.19-6.12] bnxt_en: Allow ntuple filters for drops Sasha Levin
2026-02-14 21:23 ` [PATCH AUTOSEL 6.19-6.18] ptp: ptp_vmclock: add 'VMCLOCK' to ACPI device match Sasha Levin
2026-02-14 21:23 ` [PATCH AUTOSEL 6.19-5.10] ipv4: fib: Annotate access to struct fib_alias.fa_state Sasha Levin
2026-02-14 21:23 ` [PATCH AUTOSEL 6.19-6.12] net: sfp: add quirk for Lantech 8330-265D Sasha Levin
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20260214212452.782265-17-sashal@kernel.org \
--to=sashal@kernel.org \
--cc=allison.henderson@oracle.com \
--cc=haakon.bugge@oracle.com \
--cc=kuba@kernel.org \
--cc=linux-rdma@vger.kernel.org \
--cc=netdev@vger.kernel.org \
--cc=patches@lists.linux.dev \
--cc=rds-devel@oss.oracle.com \
--cc=stable@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox