* [PATCH net-next v2 0/4] net/rds: RDS-TCP reconnect and fanout improvements
@ 2026-02-08 5:37 Allison Henderson
2026-02-08 5:37 ` [PATCH net-next v2 1/4] net/rds: Refactor __rds_conn_create for blocking transport cleanup Allison Henderson
` (3 more replies)
0 siblings, 4 replies; 5+ messages in thread
From: Allison Henderson @ 2026-02-08 5:37 UTC (permalink / raw)
To: netdev
Cc: linux-kselftest, pabeni, edumazet, rds-devel, kuba, horms,
linux-rdma, allison.henderson
Hi all,
This is subset 4 of the larger RDS-TCP patch series I posted last
Oct. The greater series aims to correct multiple rds-tcp issues that
can cause dropped or out of sequence messages. I've broken it down into
smaller sets to make reviews more manageable.
In this set, we address some reconnect issues occurring during connection
teardowns, and also move connection fanout operations to a background
worker.
The entire set can be viewed in the rfc here:
https://lore.kernel.org/netdev/20251022191715.157755-1-achender@kernel.org/
Questions, comments, flames appreciated!
Thanks,
Allison
Change Log
v2:
[PATCH net-next v2 1/4] net/rds: Refactor __rds_conn_create for
blocking transport cleanup
- NEW
[PATCH net-next v2 2/4] net/rds: Delegate fan-out to a background
worker
- Added syzbot report link
Allison Henderson (2):
net/rds: Refactor __rds_conn_create for blocking transport cleanup
net/rds: rds_sendmsg should not discard payload_len
Gerd Rausch (1):
net/rds: Delegate fan-out to a background worker
Greg Jumper (1):
net/rds: Use proper peer port number even when not connected
net/rds/connection.c | 32 ++++++++++++----------
net/rds/send.c | 6 +++--
net/rds/tcp.c | 3 +++
net/rds/tcp.h | 7 ++---
net/rds/tcp_connect.c | 2 ++
net/rds/tcp_listen.c | 63 ++++++++++++++++++++++++++++++++-----------
6 files changed, 76 insertions(+), 37 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 5+ messages in thread
* [PATCH net-next v2 1/4] net/rds: Refactor __rds_conn_create for blocking transport cleanup
2026-02-08 5:37 [PATCH net-next v2 0/4] net/rds: RDS-TCP reconnect and fanout improvements Allison Henderson
@ 2026-02-08 5:37 ` Allison Henderson
2026-02-08 5:37 ` [PATCH net-next v2 2/4] net/rds: Delegate fan-out to a background worker Allison Henderson
` (2 subsequent siblings)
3 siblings, 0 replies; 5+ messages in thread
From: Allison Henderson @ 2026-02-08 5:37 UTC (permalink / raw)
To: netdev
Cc: linux-kselftest, pabeni, edumazet, rds-devel, kuba, horms,
linux-rdma, allison.henderson
The next patch will delegate fanout operations to a background worker,
which requires cancel_work_sync() during connection cleanup. However,
the error path of __rds_conn_create() currently calls
trans->conn_free() while holding rds_conn_lock (spinlock) and
rcu_read_lock, which creates an atomic context where cancel_work_sync()
cannot sleep.
To avoid this, refactor the error/race paths to defer
trans->conn_free() calls until after locks are released. This allows
transport cleanup functions to perform blocking operations safely.
This patch moves the cp_transport_data cleanup to the 'out:' label
where it runs outside the critical section, after the connection has
been freed from the slab and cannot be accessed by racing threads.
Signed-off-by: Allison Henderson <achender@kernel.org>
---
net/rds/connection.c | 32 ++++++++++++++++++--------------
1 file changed, 18 insertions(+), 14 deletions(-)
diff --git a/net/rds/connection.c b/net/rds/connection.c
index 185f73b01694..695ab7446178 100644
--- a/net/rds/connection.c
+++ b/net/rds/connection.c
@@ -170,6 +170,7 @@ static struct rds_connection *__rds_conn_create(struct net *net,
struct hlist_head *head = rds_conn_bucket(laddr, faddr);
struct rds_transport *loop_trans;
struct rds_conn_path *free_cp = NULL;
+ struct rds_transport *free_trans = NULL;
unsigned long flags;
int ret, i;
int npaths = (trans->t_mp_capable ? RDS_MPATH_WORKERS : 1);
@@ -305,7 +306,7 @@ static struct rds_connection *__rds_conn_create(struct net *net,
if (parent) {
/* Creating passive conn */
if (parent->c_passive) {
- trans->conn_free(conn->c_path[0].cp_transport_data);
+ free_trans = trans;
free_cp = conn->c_path;
kmem_cache_free(rds_conn_slab, conn);
conn = parent->c_passive;
@@ -321,18 +322,7 @@ static struct rds_connection *__rds_conn_create(struct net *net,
found = rds_conn_lookup(net, head, laddr, faddr, trans,
tos, dev_if);
if (found) {
- struct rds_conn_path *cp;
- int i;
-
- for (i = 0; i < npaths; i++) {
- cp = &conn->c_path[i];
- /* The ->conn_alloc invocation may have
- * allocated resource for all paths, so all
- * of them may have to be freed here.
- */
- if (cp->cp_transport_data)
- trans->conn_free(cp->cp_transport_data);
- }
+ free_trans = trans;
free_cp = conn->c_path;
kmem_cache_free(rds_conn_slab, conn);
conn = found;
@@ -349,9 +339,23 @@ static struct rds_connection *__rds_conn_create(struct net *net,
out:
if (free_cp) {
- for (i = 0; i < npaths; i++)
+ for (i = 0; i < npaths; i++) {
+ /*
+ * The trans->conn_alloc call may have allocated
+ * resources for the cp paths, which will need to
+ * be freed before freeing cp itself. We do this here
+ * so it runs outside the rds_conn_lock spinlock
+ * and rcu_read_lock section, because conn_free()
+ * may call cancel_work_sync() which
+ * can sleep. free_trans is only set in the
+ * race-loss paths where conn_alloc() succeeded.
+ */
+ if (free_trans && free_cp[i].cp_transport_data)
+ free_trans->conn_free
+ (free_cp[i].cp_transport_data);
if (free_cp[i].cp_wq != rds_wq)
destroy_workqueue(free_cp[i].cp_wq);
+ }
kfree(free_cp);
}
--
2.43.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH net-next v2 2/4] net/rds: Delegate fan-out to a background worker
2026-02-08 5:37 [PATCH net-next v2 0/4] net/rds: RDS-TCP reconnect and fanout improvements Allison Henderson
2026-02-08 5:37 ` [PATCH net-next v2 1/4] net/rds: Refactor __rds_conn_create for blocking transport cleanup Allison Henderson
@ 2026-02-08 5:37 ` Allison Henderson
2026-02-08 5:37 ` [PATCH net-next v2 3/4] net/rds: Use proper peer port number even when not connected Allison Henderson
2026-02-08 5:37 ` [PATCH net-next v2 4/4] net/rds: rds_sendmsg should not discard payload_len Allison Henderson
3 siblings, 0 replies; 5+ messages in thread
From: Allison Henderson @ 2026-02-08 5:37 UTC (permalink / raw)
To: netdev
Cc: linux-kselftest, pabeni, edumazet, rds-devel, kuba, horms,
linux-rdma, allison.henderson
From: Gerd Rausch <gerd.rausch@oracle.com>
Delegate fan-out to a background worker in order to allow
kernel_getpeername() to acquire a lock on the socket.
This has become necessary since the introduction of
commit "9dfc685e0262d ("inet: remove races in inet{6}_getname()")
The socket is already locked in the context that
"kernel_getpeername" used to get called by either
rds_tcp_recv_path" or "tcp_v{4,6}_rcv",
and therefore causing a deadlock.
Luckily, the fan-out need not happen in-context nor fast,
so we can easily just do the same in a background worker.
Also, while we're doing this, we get rid of the unused
struct members "t_conn_w", "t_send_w", "t_down_w" & "t_recv_w".
Reported-by: syzbot+ci858e84e8400d24b3@syzkaller.appspotmail.com
Link: https://ci.syzbot.org/series/1a5ef180-c02c-401d-9df7-670b18570a55
Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Signed-off-by: Allison Henderson <achender@kernel.org>
---
net/rds/tcp.c | 3 +++
net/rds/tcp.h | 7 ++----
net/rds/tcp_connect.c | 2 ++
net/rds/tcp_listen.c | 54 +++++++++++++++++++++++++++++++------------
4 files changed, 46 insertions(+), 20 deletions(-)
diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index 45484a93d75f..02f8f928c20b 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -358,6 +358,8 @@ static void rds_tcp_conn_free(void *arg)
rdsdebug("freeing tc %p\n", tc);
+ cancel_work_sync(&tc->t_fan_out_w);
+
spin_lock_irqsave(&rds_tcp_conn_lock, flags);
if (!tc->t_tcp_node_detached)
list_del(&tc->t_tcp_node);
@@ -384,6 +386,7 @@ static int rds_tcp_conn_alloc(struct rds_connection *conn, gfp_t gfp)
tc->t_tinc = NULL;
tc->t_tinc_hdr_rem = sizeof(struct rds_header);
tc->t_tinc_data_rem = 0;
+ INIT_WORK(&tc->t_fan_out_w, rds_tcp_fan_out_w);
init_waitqueue_head(&tc->t_recv_done_waitq);
conn->c_path[i].cp_transport_data = tc;
diff --git a/net/rds/tcp.h b/net/rds/tcp.h
index 39c86347188c..9ecb0b6b658a 100644
--- a/net/rds/tcp.h
+++ b/net/rds/tcp.h
@@ -44,11 +44,7 @@ struct rds_tcp_connection {
size_t t_tinc_hdr_rem;
size_t t_tinc_data_rem;
- /* XXX error report? */
- struct work_struct t_conn_w;
- struct work_struct t_send_w;
- struct work_struct t_down_w;
- struct work_struct t_recv_w;
+ struct work_struct t_fan_out_w;
/* for info exporting only */
struct list_head t_list_item;
@@ -90,6 +86,7 @@ void rds_tcp_state_change(struct sock *sk);
struct socket *rds_tcp_listen_init(struct net *net, bool isv6);
void rds_tcp_listen_stop(struct socket *sock, struct work_struct *acceptor);
void rds_tcp_listen_data_ready(struct sock *sk);
+void rds_tcp_fan_out_w(struct work_struct *work);
void rds_tcp_conn_slots_available(struct rds_connection *conn, bool fan_out);
int rds_tcp_accept_one(struct rds_tcp_net *rtn);
void rds_tcp_keepalive(struct socket *sock);
diff --git a/net/rds/tcp_connect.c b/net/rds/tcp_connect.c
index b77c88ffb199..6954b8c479f1 100644
--- a/net/rds/tcp_connect.c
+++ b/net/rds/tcp_connect.c
@@ -115,6 +115,8 @@ int rds_tcp_conn_path_connect(struct rds_conn_path *cp)
if (cp->cp_index > 0 && cp->cp_conn->c_npaths < 2)
return -EAGAIN;
+ cancel_work_sync(&tc->t_fan_out_w);
+
mutex_lock(&tc->t_conn_path_lock);
if (rds_conn_path_up(cp)) {
diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
index 6fb5c928b8fd..8fb8f7d26683 100644
--- a/net/rds/tcp_listen.c
+++ b/net/rds/tcp_listen.c
@@ -123,27 +123,20 @@ rds_tcp_accept_one_path(struct rds_connection *conn, struct socket *sock)
return NULL;
}
-void rds_tcp_conn_slots_available(struct rds_connection *conn, bool fan_out)
+void rds_tcp_fan_out_w(struct work_struct *work)
{
- struct rds_tcp_connection *tc;
- struct rds_tcp_net *rtn;
- struct socket *sock;
+ struct rds_tcp_connection *tc = container_of(work,
+ struct rds_tcp_connection,
+ t_fan_out_w);
+ struct rds_connection *conn = tc->t_cpath->cp_conn;
+ struct rds_tcp_net *rtn = tc->t_rtn;
+ struct socket *sock = tc->t_sock;
int sport, npaths;
- if (rds_destroy_pending(conn))
- return;
-
- tc = conn->c_path->cp_transport_data;
- rtn = tc->t_rtn;
- if (!rtn)
- return;
-
- sock = tc->t_sock;
-
/* During fan-out, check that the connection we already
* accepted in slot#0 carried the proper source port modulo.
*/
- if (fan_out && conn->c_with_sport_idx && sock &&
+ if (conn->c_with_sport_idx && sock &&
rds_addr_cmp(&conn->c_laddr, &conn->c_faddr) > 0) {
/* cp->cp_index is encoded in lowest bits of source-port */
sport = rds_tcp_get_peer_sport(sock);
@@ -167,6 +160,37 @@ void rds_tcp_conn_slots_available(struct rds_connection *conn, bool fan_out)
rds_tcp_accept_work(rtn);
}
+void rds_tcp_conn_slots_available(struct rds_connection *conn, bool fan_out)
+{
+ struct rds_conn_path *cp0;
+ struct rds_tcp_connection *tc;
+ struct rds_tcp_net *rtn;
+
+ if (rds_destroy_pending(conn))
+ return;
+
+ cp0 = conn->c_path;
+ tc = cp0->cp_transport_data;
+ rtn = tc->t_rtn;
+ if (!rtn)
+ return;
+
+ if (fan_out)
+ /* Delegate fan-out to a background worker in order
+ * to allow "kernel_getpeername" to acquire a lock
+ * on the socket.
+ * The socket is already locked in this context
+ * by either "rds_tcp_recv_path" or "tcp_v{4,6}_rcv",
+ * depending on the origin of the dequeue-request.
+ */
+ queue_work(cp0->cp_wq, &tc->t_fan_out_w);
+ else
+ /* Fan-out either already happened or is unnecessary.
+ * Just go ahead and attempt to accept more connections
+ */
+ rds_tcp_accept_work(rtn);
+}
+
int rds_tcp_accept_one(struct rds_tcp_net *rtn)
{
struct socket *listen_sock = rtn->rds_tcp_listen_sock;
--
2.43.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH net-next v2 3/4] net/rds: Use proper peer port number even when not connected
2026-02-08 5:37 [PATCH net-next v2 0/4] net/rds: RDS-TCP reconnect and fanout improvements Allison Henderson
2026-02-08 5:37 ` [PATCH net-next v2 1/4] net/rds: Refactor __rds_conn_create for blocking transport cleanup Allison Henderson
2026-02-08 5:37 ` [PATCH net-next v2 2/4] net/rds: Delegate fan-out to a background worker Allison Henderson
@ 2026-02-08 5:37 ` Allison Henderson
2026-02-08 5:37 ` [PATCH net-next v2 4/4] net/rds: rds_sendmsg should not discard payload_len Allison Henderson
3 siblings, 0 replies; 5+ messages in thread
From: Allison Henderson @ 2026-02-08 5:37 UTC (permalink / raw)
To: netdev
Cc: linux-kselftest, pabeni, edumazet, rds-devel, kuba, horms,
linux-rdma, allison.henderson
From: Greg Jumper <greg.jumper@oracle.com>
The function rds_tcp_get_peer_sport() should return the peer port of a
socket, even when the socket is not currently connected, so that RDS
can reliably determine the MPRDS "lane" corresponding to the port.
rds_tcp_get_peer_sport() calls kernel_getpeername() to get the port
number; however, when paths between endpoints frequently drop and
reconnect, kernel_getpeername() can return -ENOTCONN, causing
rds_tcp_get_peer_sport() to return an error, and ultimately causing
RDS to use the wrong lane for a port when reconnecting to a peer.
This patch modifies rds_tcp_get_peer_sport() to directly call the
socket-specific get-name function (inet_getname() in this case) that
kernel_getpeername() also calls. The socket-specific function offers
an additional argument which, when set to a value greater than 1,
causes the function to return the socket's peer name even when the
socket is not connected, which in turn allows rds_tcp_get_peer_sport()
to return the correct port number.
Signed-off-by: Greg Jumper <greg.jumper@oracle.com>
Signed-off-by: Allison Henderson <achender@kernel.org>
---
net/rds/tcp_listen.c | 9 ++++++++-
1 file changed, 8 insertions(+), 1 deletion(-)
diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
index 8fb8f7d26683..db4938fd1672 100644
--- a/net/rds/tcp_listen.c
+++ b/net/rds/tcp_listen.c
@@ -67,7 +67,14 @@ rds_tcp_get_peer_sport(struct socket *sock)
} saddr;
int sport;
- if (kernel_getpeername(sock, &saddr.addr) >= 0) {
+ /* Call the socket's getname() function (inet_getname() in this case)
+ * with a final argument greater than 1 to get the peer's port
+ * regardless of whether the socket is currently connected.
+ * Using peer=2 will get the peer port even during reconnection states
+ * (TCPF_CLOSE, TCPF_SYN_SENT). This avoids -ENOTCONN while
+ * inet_dport still contains the correct peer port.
+ */
+ if (sock->ops->getname(sock, &saddr.addr, 2) >= 0) {
switch (saddr.addr.sa_family) {
case AF_INET:
sport = ntohs(saddr.sin.sin_port);
--
2.43.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
* [PATCH net-next v2 4/4] net/rds: rds_sendmsg should not discard payload_len
2026-02-08 5:37 [PATCH net-next v2 0/4] net/rds: RDS-TCP reconnect and fanout improvements Allison Henderson
` (2 preceding siblings ...)
2026-02-08 5:37 ` [PATCH net-next v2 3/4] net/rds: Use proper peer port number even when not connected Allison Henderson
@ 2026-02-08 5:37 ` Allison Henderson
3 siblings, 0 replies; 5+ messages in thread
From: Allison Henderson @ 2026-02-08 5:37 UTC (permalink / raw)
To: netdev
Cc: linux-kselftest, pabeni, edumazet, rds-devel, kuba, horms,
linux-rdma, allison.henderson
From: Allison Henderson <allison.henderson@oracle.com>
Commit 3db6e0d172c9 ("rds: use RCU to synchronize work-enqueue with
connection teardown") modifies rds_sendmsg to avoid enqueueing work
while a tear down is in progress. However, it also changed the return
value of rds_sendmsg to that of rds_send_xmit instead of the
payload_len. This means the user may incorrectly receive errno values
when it should have simply received a payload of 0 while the peer
attempts a reconnections. So this patch corrects the teardown handling
code to only use the out error path in that case, thus restoring the
original payload_len return value.
Signed-off-by: Allison Henderson <achender@kernel.org>
---
net/rds/send.c | 6 ++++--
1 file changed, 4 insertions(+), 2 deletions(-)
diff --git a/net/rds/send.c b/net/rds/send.c
index 6e96f108473e..a1039e422a38 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -1431,9 +1431,11 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, size_t payload_len)
else
queue_delayed_work(cpath->cp_wq, &cpath->cp_send_w, 1);
rcu_read_unlock();
+
+ if (ret)
+ goto out;
}
- if (ret)
- goto out;
+
rds_message_put(rm);
for (ind = 0; ind < vct.indx; ind++)
--
2.43.0
^ permalink raw reply related [flat|nested] 5+ messages in thread
end of thread, other threads:[~2026-02-08 5:37 UTC | newest]
Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-08 5:37 [PATCH net-next v2 0/4] net/rds: RDS-TCP reconnect and fanout improvements Allison Henderson
2026-02-08 5:37 ` [PATCH net-next v2 1/4] net/rds: Refactor __rds_conn_create for blocking transport cleanup Allison Henderson
2026-02-08 5:37 ` [PATCH net-next v2 2/4] net/rds: Delegate fan-out to a background worker Allison Henderson
2026-02-08 5:37 ` [PATCH net-next v2 3/4] net/rds: Use proper peer port number even when not connected Allison Henderson
2026-02-08 5:37 ` [PATCH net-next v2 4/4] net/rds: rds_sendmsg should not discard payload_len Allison Henderson
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox