* [PATCH net-next v5 0/2] net/rds: RDS-TCP reconnect and fanout improvements
@ 2026-02-23 22:19 Allison Henderson
2026-02-23 22:19 ` [PATCH net-next v5 1/2] net/rds: Refactor __rds_conn_create for blocking transport cleanup Allison Henderson
2026-02-23 22:19 ` [PATCH net-next v5 2/2] net/rds: Delegate fan-out to a background worker Allison Henderson
0 siblings, 2 replies; 7+ messages in thread
From: Allison Henderson @ 2026-02-23 22:19 UTC (permalink / raw)
To: netdev
Cc: linux-kselftest, pabeni, edumazet, rds-devel, kuba, horms,
linux-rdma, allison.henderson
Hi all,
This is subset 4 of the larger RDS-TCP patch series I posted last
Oct. The greater series aims to correct multiple rds-tcp issues that
can cause dropped or out of sequence messages. I've broken it down into
smaller sets to make reviews more manageable.
In this set, we address some reconnect issues occurring during connection
teardowns, and also move connection fanout operations to a background
worker.
The entire set can be viewed in the rfc here:
https://lore.kernel.org/netdev/20251022191715.157755-1-achender@kernel.org/
Questions, comments, flames appreciated!
Thanks,
Allison
Change Log
v2:
[PATCH net-next v2 1/4] net/rds: Refactor __rds_conn_create for
blocking transport cleanup
- NEW
[PATCH net-next v2 2/4] net/rds: Delegate fan-out to a background
worker
- Added syzbot report link
v3:
[PATCH net-next v3 1/4] net/rds: Fix NULL pointer dereference in
rds_tcp_accept_one
- NEW
- Fixes syzbot bug
https://syzkaller.appspot.com/bug?extid=96046021045ffe6d7709
[PATCH net-next v3 2/4] net/rds: Refactor __rds_conn_create for
blocking transport cleanup
- Moved syzbot report link from "Delegate fan-out to a background
worker" to this patch
- this patch fixes syzbot bug flaged in next patch
https://ci.syzbot.org/series/1a5ef180-c02c-401d-9df7-670b18570a55
[PATCH net-next v3 3/4] net/rds: Delegate fan-out to a background
worker
- Moved syzbot report link to previous patch
[PATCH net-next v3 4/4] net/rds: Use proper peer port number even
when not connected
- Fixed ai comment complaint
[PATCH net-next v2 4/4] net/rds: rds_sendmsg should not discard
payload_len
- Removed
- Deferred to net instead of net-next
v4:
[PATCH net-next v4 2/4] net/rds: Refactor __rds_conn_create for
blocking transport cleanup
- Removed Reported-by: tag
[PATCH net-next v4 3/4] net/rds: Delegate fan-out to a background
worker
- Expanded commit header to clarify ordered queue mechanics and
cancel_work_sync() usage
v5:
[PATCH net-next v4 1/4] net/rds: Fix NULL pointer dereference in
rds_tcp_accept_one
- Removed
- Deferred to net instead of net-next
[PATCH net v5 1/2] net/rds: Refactor __rds_conn_create for blocking
transport cleanup
- Clarified commit header and commets per ai review
[PATCH net-next v4 4/4] net/rds: Use proper peer port number even
when not connected
- Removed
- Superceeded by
"net/rds: fix recursive lock in rds_tcp_conn_slots_available"
Allison Henderson (1):
net/rds: Refactor __rds_conn_create for blocking transport cleanup
Gerd Rausch (1):
net/rds: Delegate fan-out to a background worker
net/rds/connection.c | 32 ++++++++++++++-----------
net/rds/tcp.c | 3 +++
net/rds/tcp.h | 7 ++----
net/rds/tcp_connect.c | 2 ++
net/rds/tcp_listen.c | 54 +++++++++++++++++++++++++++++++------------
5 files changed, 64 insertions(+), 34 deletions(-)
--
2.43.0
^ permalink raw reply [flat|nested] 7+ messages in thread* [PATCH net-next v5 1/2] net/rds: Refactor __rds_conn_create for blocking transport cleanup 2026-02-23 22:19 [PATCH net-next v5 0/2] net/rds: RDS-TCP reconnect and fanout improvements Allison Henderson @ 2026-02-23 22:19 ` Allison Henderson 2026-02-24 16:16 ` [net-next,v5,1/2] " Simon Horman 2026-02-23 22:19 ` [PATCH net-next v5 2/2] net/rds: Delegate fan-out to a background worker Allison Henderson 1 sibling, 1 reply; 7+ messages in thread From: Allison Henderson @ 2026-02-23 22:19 UTC (permalink / raw) To: netdev Cc: linux-kselftest, pabeni, edumazet, rds-devel, kuba, horms, linux-rdma, allison.henderson The next patch will delegate fanout operations to a background worker, which adds cancel_work_sync() to rds_tcp_conn_free(). This is called during a connection cleanup and requires an operations to be blocking. However, the error path of __rds_conn_create() currently calls trans->conn_free() while holding rds_conn_lock (spinlock) and rcu_read_lock, which creates an atomic context where cancel_work_sync() cannot sleep. To avoid this, refactor the error/race paths to defer trans->conn_free() calls until after locks are released at the out: label. The 'out:' label already runs outside locks because destroy_workqueue() can sleep. So we extend this pattern to transport cleanup as well. This way, connections that "lose" the race are safe to clean up outside the critical section since they were never added to the hashtable, and therefore are inaccessible to other threads. Link: https://ci.syzbot.org/series/1a5ef180-c02c-401d-9df7-670b18570a55 Signed-off-by: Allison Henderson <achender@kernel.org> --- net/rds/connection.c | 32 ++++++++++++++++++-------------- 1 file changed, 18 insertions(+), 14 deletions(-) diff --git a/net/rds/connection.c b/net/rds/connection.c index 185f73b01694..804e928da223 100644 --- a/net/rds/connection.c +++ b/net/rds/connection.c @@ -170,6 +170,7 @@ static struct rds_connection *__rds_conn_create(struct net *net, struct hlist_head *head = rds_conn_bucket(laddr, faddr); struct rds_transport *loop_trans; struct rds_conn_path *free_cp = NULL; + struct rds_transport *free_trans = NULL; unsigned long flags; int ret, i; int npaths = (trans->t_mp_capable ? RDS_MPATH_WORKERS : 1); @@ -305,7 +306,7 @@ static struct rds_connection *__rds_conn_create(struct net *net, if (parent) { /* Creating passive conn */ if (parent->c_passive) { - trans->conn_free(conn->c_path[0].cp_transport_data); + free_trans = trans; free_cp = conn->c_path; kmem_cache_free(rds_conn_slab, conn); conn = parent->c_passive; @@ -321,18 +322,7 @@ static struct rds_connection *__rds_conn_create(struct net *net, found = rds_conn_lookup(net, head, laddr, faddr, trans, tos, dev_if); if (found) { - struct rds_conn_path *cp; - int i; - - for (i = 0; i < npaths; i++) { - cp = &conn->c_path[i]; - /* The ->conn_alloc invocation may have - * allocated resource for all paths, so all - * of them may have to be freed here. - */ - if (cp->cp_transport_data) - trans->conn_free(cp->cp_transport_data); - } + free_trans = trans; free_cp = conn->c_path; kmem_cache_free(rds_conn_slab, conn); conn = found; @@ -349,9 +339,23 @@ static struct rds_connection *__rds_conn_create(struct net *net, out: if (free_cp) { - for (i = 0; i < npaths; i++) + for (i = 0; i < npaths; i++) { + /* + * The trans->conn_alloc call may have allocated + * resources for the cp paths, which will need to + * be freed before freeing cp itself. We do this here + * after the out: label so it runs outside the + * rds_conn_lock spinlock and rcu_read_lock section, + * since both destroy_workqueue() and conn_free can + * block. The local free_trans pointer is only set in + * the race-loss paths where conn_alloc() succeeded. + */ + if (free_trans && free_cp[i].cp_transport_data) + free_trans->conn_free + (free_cp[i].cp_transport_data); if (free_cp[i].cp_wq != rds_wq) destroy_workqueue(free_cp[i].cp_wq); + } kfree(free_cp); } -- 2.43.0 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [net-next,v5,1/2] net/rds: Refactor __rds_conn_create for blocking transport cleanup 2026-02-23 22:19 ` [PATCH net-next v5 1/2] net/rds: Refactor __rds_conn_create for blocking transport cleanup Allison Henderson @ 2026-02-24 16:16 ` Simon Horman 2026-02-25 6:44 ` Allison Henderson 0 siblings, 1 reply; 7+ messages in thread From: Simon Horman @ 2026-02-24 16:16 UTC (permalink / raw) To: achender Cc: Simon Horman, linux-kselftest, kuba, linux-rdma, allison.henderson, edumazet, netdev, rds-devel, pabeni This is an AI-generated review of your patch. The human sending this email has considered the AI review valid, or at least plausible. For local reproduction steps see https://netdev-ai.bots.linux.dev/ai-local.html --- net/rds: Refactor __rds_conn_create for blocking transport cleanup The next patch will delegate fanout operations to a background worker, which adds cancel_work_sync() to rds_tcp_conn_free(). This is called during a connection cleanup and requires an operations to be blocking. However, the error path of __rds_conn_create() currently calls trans->conn_free() while holding rds_conn_lock (spinlock) and rcu_read_lock, which creates an atomic context where cancel_work_sync() cannot sleep. Link: https://ci.syzbot.org/series/1a5ef180-c02c-401d-9df7-670b18570a55 > diff --git a/net/rds/connection.c b/net/rds/connection.c > --- a/net/rds/connection.c > +++ b/net/rds/connection.c > @@ -170,6 +170,7 @@ static struct rds_connection *__rds_conn_create(struct net *net, > struct hlist_head *head = rds_conn_bucket(laddr, faddr); > struct rds_transport *loop_trans; > struct rds_conn_path *free_cp = NULL; > + struct rds_transport *free_trans = NULL; > unsigned long flags; > int ret, i; > int npaths = (trans->t_mp_capable ? RDS_MPATH_WORKERS : 1); > @@ -305,7 +306,7 @@ > if (parent) { > /* Creating passive conn */ > if (parent->c_passive) { > - trans->conn_free(conn->c_path[0].cp_transport_data); > + free_trans = trans; > free_cp = conn->c_path; > kmem_cache_free(rds_conn_slab, conn); > conn = parent->c_passive; [ ... ] > @@ -321,18 +322,7 @@ > found = rds_conn_lookup(net, head, laddr, faddr, trans, > tos, dev_if); > if (found) { > - struct rds_conn_path *cp; > - int i; > - > - for (i = 0; i < npaths; i++) { > - cp = &conn->c_path[i]; > - /* The ->conn_alloc invocation may have > - * allocated resource for all paths, so all > - * of them may have to be freed here. > - */ > - if (cp->cp_transport_data) > - trans->conn_free(cp->cp_transport_data); > - } > + free_trans = trans; > free_cp = conn->c_path; > kmem_cache_free(rds_conn_slab, conn); > conn = found; Does this create a use-after-free in other threads that iterate the transport list? Looking at the race-loss path when rds_conn_lookup() finds an existing connection, the old code called trans->conn_free() before kmem_cache_free(conn). Now the patch defers conn_free() to after locks are released at the out: label, but kmem_cache_free(conn) still executes inside the critical section (line 327 after this change). This creates a window where rds_tcp_connection objects remain on rds_tcp_conn_list with tc->t_cpath->cp_conn pointing to the already-freed conn object. The timeline would be: CPU 0 (__rds_conn_create race-loss) spin_lock_irqsave(&rds_conn_lock) found = rds_conn_lookup() // finds existing conn free_trans = trans free_cp = conn->c_path kmem_cache_free(rds_conn_slab, conn) // conn freed conn = found spin_unlock_irqrestore(&rds_conn_lock) rcu_read_unlock() // ... later at out: label free_trans->conn_free(...) // removes tc from list, but too late CPU 1 (rds_tcp_kill_sock) spin_lock_irq(&rds_tcp_conn_lock) list_for_each_entry_safe(tc, _tc, &rds_tcp_conn_list, t_tcp_node) c_net = read_pnet(&tc->t_cpath->cp_conn->c_net) // UAF here! The commit message states "connections that 'lose' the race are safe to clean up outside the critical section since they were never added to the hashtable, and therefore are inaccessible to other threads." However, while the rds_connection was never added to the RDS hashtable, the transport data (tc) WAS added to rds_tcp_conn_list during rds_tcp_conn_alloc(). Looking at that function in net/rds/tcp.c: rds_tcp_conn_alloc() { // ... allocate tc structures ... spin_lock_irq(&rds_tcp_conn_lock); for (i = 0; i < RDS_MPATH_WORKERS; i++) { tc = conn->c_path[i].cp_transport_data; tc->t_tcp_node_detached = false; list_add_tail(&tc->t_tcp_node, &rds_tcp_conn_list); } spin_unlock_irq(&rds_tcp_conn_lock); } So the tc objects are accessible to other threads via the transport list, even though the conn itself never made it into the RDS hashtable. Functions that iterate rds_tcp_conn_list include: - rds_tcp_kill_sock() at net/rds/tcp.c:610 - rds_tcp_destroy_conns() - rds_tcp_sysctl_reset() All of these dereference tc->t_cpath->cp_conn during iteration. In the old code, conn_free() ran BEFORE kmem_cache_free(conn), so the tc was removed from the list before conn was freed. No UAF window existed. Could the kmem_cache_free(rds_conn_slab, conn) be moved to after conn_free() at the out: label, or should conn_free() be called before kmem_cache_free() inside the critical section for the race-loss paths? > @@ -349,9 +339,23 @@ > > out: > if (free_cp) { > - for (i = 0; i < npaths; i++) > + for (i = 0; i < npaths; i++) { > + /* > + * The trans->conn_alloc call may have allocated > + * resources for the cp paths, which will need to > + * be freed before freeing cp itself. We do this here > + * after the out: label so it runs outside the > + * rds_conn_lock spinlock and rcu_read_lock section, > + * since both destroy_workqueue() and conn_free can > + * block. The local free_trans pointer is only set in > + * the race-loss paths where conn_alloc() succeeded. > + */ > + if (free_trans && free_cp[i].cp_transport_data) > + free_trans->conn_free > + (free_cp[i].cp_transport_data); > if (free_cp[i].cp_wq != rds_wq) > destroy_workqueue(free_cp[i].cp_wq); > + } > kfree(free_cp); > } ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [net-next,v5,1/2] net/rds: Refactor __rds_conn_create for blocking transport cleanup 2026-02-24 16:16 ` [net-next,v5,1/2] " Simon Horman @ 2026-02-25 6:44 ` Allison Henderson 0 siblings, 0 replies; 7+ messages in thread From: Allison Henderson @ 2026-02-25 6:44 UTC (permalink / raw) To: horms@kernel.org, achender@kernel.org Cc: pabeni@redhat.com, rds-devel@oss.oracle.com, edumazet@google.com, linux-kselftest@vger.kernel.org, kuba@kernel.org, linux-rdma@vger.kernel.org, netdev@vger.kernel.org On Tue, 2026-02-24 at 16:16 +0000, Simon Horman wrote: > This is an AI-generated review of your patch. The human sending this > email has considered the AI review valid, or at least plausible. > > For local reproduction steps see https://urldefense.com/v3/__https://netdev-ai.bots.linux.dev/ai-local.html__;!!ACWV5N9M2RV99hQ!Ki9pK6KFAOgmcbdBBDA6lG7dXM_7sJiBAcp2aRSeDi8bRG9ACJajfiLdK-YgNvXyOPqrJpC5y7k9fq6huS8$ > --- > net/rds: Refactor __rds_conn_create for blocking transport cleanup > > The next patch will delegate fanout operations to a background worker, > which adds cancel_work_sync() to rds_tcp_conn_free(). This is called > during a connection cleanup and requires an operations to be blocking. > However, the error path of __rds_conn_create() currently calls > trans->conn_free() while holding rds_conn_lock (spinlock) and > rcu_read_lock, which creates an atomic context where cancel_work_sync() > cannot sleep. > > Link: https://urldefense.com/v3/__https://ci.syzbot.org/series/1a5ef180-c02c-401d-9df7-670b18570a55__;!!ACWV5N9M2RV99hQ!Ki9pK6KFAOgmcbdBBDA6lG7dXM_7sJiBAcp2aRSeDi8bRG9ACJajfiLdK-YgNvXyOPqrJpC5y7k9noDt20w$ > > > diff --git a/net/rds/connection.c b/net/rds/connection.c > > --- a/net/rds/connection.c > > +++ b/net/rds/connection.c > > @@ -170,6 +170,7 @@ static struct rds_connection *__rds_conn_create(struct net *net, > > struct hlist_head *head = rds_conn_bucket(laddr, faddr); > > struct rds_transport *loop_trans; > > struct rds_conn_path *free_cp = NULL; > > + struct rds_transport *free_trans = NULL; > > unsigned long flags; > > int ret, i; > > int npaths = (trans->t_mp_capable ? RDS_MPATH_WORKERS : 1); > > @@ -305,7 +306,7 @@ > > if (parent) { > > /* Creating passive conn */ > > if (parent->c_passive) { > > - trans->conn_free(conn->c_path[0].cp_transport_data); > > + free_trans = trans; > > free_cp = conn->c_path; > > kmem_cache_free(rds_conn_slab, conn); > > conn = parent->c_passive; > > [ ... ] > > > @@ -321,18 +322,7 @@ > > found = rds_conn_lookup(net, head, laddr, faddr, trans, > > tos, dev_if); > > if (found) { > > - struct rds_conn_path *cp; > > - int i; > > - > > - for (i = 0; i < npaths; i++) { > > - cp = &conn->c_path[i]; > > - /* The ->conn_alloc invocation may have > > - * allocated resource for all paths, so all > > - * of them may have to be freed here. > > - */ > > - if (cp->cp_transport_data) > > - trans->conn_free(cp->cp_transport_data); > > - } > > + free_trans = trans; > > free_cp = conn->c_path; > > kmem_cache_free(rds_conn_slab, conn); > > conn = found; > > Does this create a use-after-free in other threads that iterate the > transport list? > > Looking at the race-loss path when rds_conn_lookup() finds an existing > connection, the old code called trans->conn_free() before > kmem_cache_free(conn). Now the patch defers conn_free() to after > locks are released at the out: label, but kmem_cache_free(conn) still > executes inside the critical section (line 327 after this change). > > This creates a window where rds_tcp_connection objects remain on > rds_tcp_conn_list with tc->t_cpath->cp_conn pointing to the > already-freed conn object. > > The timeline would be: > > CPU 0 (__rds_conn_create race-loss) > spin_lock_irqsave(&rds_conn_lock) > found = rds_conn_lookup() // finds existing conn > free_trans = trans > free_cp = conn->c_path > kmem_cache_free(rds_conn_slab, conn) // conn freed > conn = found > spin_unlock_irqrestore(&rds_conn_lock) > rcu_read_unlock() > // ... later at out: label > free_trans->conn_free(...) // removes tc from list, but too late > > CPU 1 (rds_tcp_kill_sock) > spin_lock_irq(&rds_tcp_conn_lock) > list_for_each_entry_safe(tc, _tc, &rds_tcp_conn_list, t_tcp_node) > c_net = read_pnet(&tc->t_cpath->cp_conn->c_net) // UAF here! > > The commit message states "connections that 'lose' the race are safe to > clean up outside the critical section since they were never added to > the hashtable, and therefore are inaccessible to other threads." > > However, while the rds_connection was never added to the RDS hashtable, > the transport data (tc) WAS added to rds_tcp_conn_list during > rds_tcp_conn_alloc(). Looking at that function in net/rds/tcp.c: > > rds_tcp_conn_alloc() { > // ... allocate tc structures ... > spin_lock_irq(&rds_tcp_conn_lock); > for (i = 0; i < RDS_MPATH_WORKERS; i++) { > tc = conn->c_path[i].cp_transport_data; > tc->t_tcp_node_detached = false; > list_add_tail(&tc->t_tcp_node, &rds_tcp_conn_list); > } > spin_unlock_irq(&rds_tcp_conn_lock); > } > > So the tc objects are accessible to other threads via the transport > list, even though the conn itself never made it into the RDS hashtable. > > Functions that iterate rds_tcp_conn_list include: > - rds_tcp_kill_sock() at net/rds/tcp.c:610 > - rds_tcp_destroy_conns() > - rds_tcp_sysctl_reset() > > All of these dereference tc->t_cpath->cp_conn during iteration. > > In the old code, conn_free() ran BEFORE kmem_cache_free(conn), so the tc > was removed from the list before conn was freed. No UAF window existed. > > Could the kmem_cache_free(rds_conn_slab, conn) be moved to after > conn_free() at the out: label, or should conn_free() be called before > kmem_cache_free() inside the critical section for the race-loss paths? I think it probably needs to go after the the out label. Will update. Thank you! Allison > > > @@ -349,9 +339,23 @@ > > > > out: > > if (free_cp) { > > - for (i = 0; i < npaths; i++) > > + for (i = 0; i < npaths; i++) { > > + /* > > + * The trans->conn_alloc call may have allocated > > + * resources for the cp paths, which will need to > > + * be freed before freeing cp itself. We do this here > > + * after the out: label so it runs outside the > > + * rds_conn_lock spinlock and rcu_read_lock section, > > + * since both destroy_workqueue() and conn_free can > > + * block. The local free_trans pointer is only set in > > + * the race-loss paths where conn_alloc() succeeded. > > + */ > > + if (free_trans && free_cp[i].cp_transport_data) > > + free_trans->conn_free > > + (free_cp[i].cp_transport_data); > > if (free_cp[i].cp_wq != rds_wq) > > destroy_workqueue(free_cp[i].cp_wq); > > + } > > kfree(free_cp); > > } ^ permalink raw reply [flat|nested] 7+ messages in thread
* [PATCH net-next v5 2/2] net/rds: Delegate fan-out to a background worker 2026-02-23 22:19 [PATCH net-next v5 0/2] net/rds: RDS-TCP reconnect and fanout improvements Allison Henderson 2026-02-23 22:19 ` [PATCH net-next v5 1/2] net/rds: Refactor __rds_conn_create for blocking transport cleanup Allison Henderson @ 2026-02-23 22:19 ` Allison Henderson 2026-02-25 16:33 ` Fernando Fernandez Mancera 1 sibling, 1 reply; 7+ messages in thread From: Allison Henderson @ 2026-02-23 22:19 UTC (permalink / raw) To: netdev Cc: linux-kselftest, pabeni, edumazet, rds-devel, kuba, horms, linux-rdma, allison.henderson From: Gerd Rausch <gerd.rausch@oracle.com> Delegate fan-out to a background worker in order to allow kernel_getpeername() to acquire a lock on the socket. This has become necessary since the introduction of commit "9dfc685e0262d ("inet: remove races in inet{6}_getname()") The socket is already locked in the context that "kernel_getpeername" used to get called by either rds_tcp_recv_path" or "tcp_v{4,6}_rcv", and therefore causing a deadlock. Luckily, the fan-out need not happen in-context nor fast, so we can easily just do the same in a background worker. Also, while we're doing this, we get rid of the unused struct members "t_conn_w", "t_send_w", "t_down_w" & "t_recv_w". The fan-out work and the shutdown worker (cp_down_w) are both queued on the same ordered workqueue (cp0->cp_wq), so they cannot execute concurrently. We only need cancel_work_sync() in rds_tcp_conn_free() and rds_tcp_conn_path_connect() because those run from outside the ordered workqueue. Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> Signed-off-by: Allison Henderson <achender@kernel.org> --- net/rds/tcp.c | 3 +++ net/rds/tcp.h | 7 ++---- net/rds/tcp_connect.c | 2 ++ net/rds/tcp_listen.c | 54 +++++++++++++++++++++++++++++++------------ 4 files changed, 46 insertions(+), 20 deletions(-) diff --git a/net/rds/tcp.c b/net/rds/tcp.c index 45484a93d75f..02f8f928c20b 100644 --- a/net/rds/tcp.c +++ b/net/rds/tcp.c @@ -358,6 +358,8 @@ static void rds_tcp_conn_free(void *arg) rdsdebug("freeing tc %p\n", tc); + cancel_work_sync(&tc->t_fan_out_w); + spin_lock_irqsave(&rds_tcp_conn_lock, flags); if (!tc->t_tcp_node_detached) list_del(&tc->t_tcp_node); @@ -384,6 +386,7 @@ static int rds_tcp_conn_alloc(struct rds_connection *conn, gfp_t gfp) tc->t_tinc = NULL; tc->t_tinc_hdr_rem = sizeof(struct rds_header); tc->t_tinc_data_rem = 0; + INIT_WORK(&tc->t_fan_out_w, rds_tcp_fan_out_w); init_waitqueue_head(&tc->t_recv_done_waitq); conn->c_path[i].cp_transport_data = tc; diff --git a/net/rds/tcp.h b/net/rds/tcp.h index 39c86347188c..9ecb0b6b658a 100644 --- a/net/rds/tcp.h +++ b/net/rds/tcp.h @@ -44,11 +44,7 @@ struct rds_tcp_connection { size_t t_tinc_hdr_rem; size_t t_tinc_data_rem; - /* XXX error report? */ - struct work_struct t_conn_w; - struct work_struct t_send_w; - struct work_struct t_down_w; - struct work_struct t_recv_w; + struct work_struct t_fan_out_w; /* for info exporting only */ struct list_head t_list_item; @@ -90,6 +86,7 @@ void rds_tcp_state_change(struct sock *sk); struct socket *rds_tcp_listen_init(struct net *net, bool isv6); void rds_tcp_listen_stop(struct socket *sock, struct work_struct *acceptor); void rds_tcp_listen_data_ready(struct sock *sk); +void rds_tcp_fan_out_w(struct work_struct *work); void rds_tcp_conn_slots_available(struct rds_connection *conn, bool fan_out); int rds_tcp_accept_one(struct rds_tcp_net *rtn); void rds_tcp_keepalive(struct socket *sock); diff --git a/net/rds/tcp_connect.c b/net/rds/tcp_connect.c index b77c88ffb199..6954b8c479f1 100644 --- a/net/rds/tcp_connect.c +++ b/net/rds/tcp_connect.c @@ -115,6 +115,8 @@ int rds_tcp_conn_path_connect(struct rds_conn_path *cp) if (cp->cp_index > 0 && cp->cp_conn->c_npaths < 2) return -EAGAIN; + cancel_work_sync(&tc->t_fan_out_w); + mutex_lock(&tc->t_conn_path_lock); if (rds_conn_path_up(cp)) { diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c index 6fb5c928b8fd..8fb8f7d26683 100644 --- a/net/rds/tcp_listen.c +++ b/net/rds/tcp_listen.c @@ -123,27 +123,20 @@ rds_tcp_accept_one_path(struct rds_connection *conn, struct socket *sock) return NULL; } -void rds_tcp_conn_slots_available(struct rds_connection *conn, bool fan_out) +void rds_tcp_fan_out_w(struct work_struct *work) { - struct rds_tcp_connection *tc; - struct rds_tcp_net *rtn; - struct socket *sock; + struct rds_tcp_connection *tc = container_of(work, + struct rds_tcp_connection, + t_fan_out_w); + struct rds_connection *conn = tc->t_cpath->cp_conn; + struct rds_tcp_net *rtn = tc->t_rtn; + struct socket *sock = tc->t_sock; int sport, npaths; - if (rds_destroy_pending(conn)) - return; - - tc = conn->c_path->cp_transport_data; - rtn = tc->t_rtn; - if (!rtn) - return; - - sock = tc->t_sock; - /* During fan-out, check that the connection we already * accepted in slot#0 carried the proper source port modulo. */ - if (fan_out && conn->c_with_sport_idx && sock && + if (conn->c_with_sport_idx && sock && rds_addr_cmp(&conn->c_laddr, &conn->c_faddr) > 0) { /* cp->cp_index is encoded in lowest bits of source-port */ sport = rds_tcp_get_peer_sport(sock); @@ -167,6 +160,37 @@ void rds_tcp_conn_slots_available(struct rds_connection *conn, bool fan_out) rds_tcp_accept_work(rtn); } +void rds_tcp_conn_slots_available(struct rds_connection *conn, bool fan_out) +{ + struct rds_conn_path *cp0; + struct rds_tcp_connection *tc; + struct rds_tcp_net *rtn; + + if (rds_destroy_pending(conn)) + return; + + cp0 = conn->c_path; + tc = cp0->cp_transport_data; + rtn = tc->t_rtn; + if (!rtn) + return; + + if (fan_out) + /* Delegate fan-out to a background worker in order + * to allow "kernel_getpeername" to acquire a lock + * on the socket. + * The socket is already locked in this context + * by either "rds_tcp_recv_path" or "tcp_v{4,6}_rcv", + * depending on the origin of the dequeue-request. + */ + queue_work(cp0->cp_wq, &tc->t_fan_out_w); + else + /* Fan-out either already happened or is unnecessary. + * Just go ahead and attempt to accept more connections + */ + rds_tcp_accept_work(rtn); +} + int rds_tcp_accept_one(struct rds_tcp_net *rtn) { struct socket *listen_sock = rtn->rds_tcp_listen_sock; -- 2.43.0 ^ permalink raw reply related [flat|nested] 7+ messages in thread
* Re: [PATCH net-next v5 2/2] net/rds: Delegate fan-out to a background worker 2026-02-23 22:19 ` [PATCH net-next v5 2/2] net/rds: Delegate fan-out to a background worker Allison Henderson @ 2026-02-25 16:33 ` Fernando Fernandez Mancera 2026-02-26 17:06 ` Allison Henderson 0 siblings, 1 reply; 7+ messages in thread From: Fernando Fernandez Mancera @ 2026-02-25 16:33 UTC (permalink / raw) To: Allison Henderson, netdev Cc: linux-kselftest, pabeni, edumazet, rds-devel, kuba, horms, linux-rdma, allison.henderson On 2/23/26 11:19 PM, Allison Henderson wrote: > From: Gerd Rausch <gerd.rausch@oracle.com> > > Delegate fan-out to a background worker in order to allow > kernel_getpeername() to acquire a lock on the socket. > > This has become necessary since the introduction of > commit "9dfc685e0262d ("inet: remove races in inet{6}_getname()") > > The socket is already locked in the context that > "kernel_getpeername" used to get called by either > rds_tcp_recv_path" or "tcp_v{4,6}_rcv", > and therefore causing a deadlock. > > Luckily, the fan-out need not happen in-context nor fast, > so we can easily just do the same in a background worker. > > Also, while we're doing this, we get rid of the unused > struct members "t_conn_w", "t_send_w", "t_down_w" & "t_recv_w". > > The fan-out work and the shutdown worker (cp_down_w) are both > queued on the same ordered workqueue (cp0->cp_wq), so they > cannot execute concurrently. We only need cancel_work_sync() > in rds_tcp_conn_free() and rds_tcp_conn_path_connect() because > those run from outside the ordered workqueue. > > Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> > Signed-off-by: Allison Henderson <achender@kernel.org> > --- > net/rds/tcp.c | 3 +++ > net/rds/tcp.h | 7 ++---- > net/rds/tcp_connect.c | 2 ++ > net/rds/tcp_listen.c | 54 +++++++++++++++++++++++++++++++------------ > 4 files changed, 46 insertions(+), 20 deletions(-) > Isn't this change kind of dangerous since 021fd0f87004 ("net/rds: fix recursive lock in rds_tcp_conn_slots_available") [1]?. Why is kernel_getpeername() needed as only the peer source port is required for the operation? Thanks, Fernando. ^ permalink raw reply [flat|nested] 7+ messages in thread
* Re: [PATCH net-next v5 2/2] net/rds: Delegate fan-out to a background worker 2026-02-25 16:33 ` Fernando Fernandez Mancera @ 2026-02-26 17:06 ` Allison Henderson 0 siblings, 0 replies; 7+ messages in thread From: Allison Henderson @ 2026-02-26 17:06 UTC (permalink / raw) To: fmancera@suse.de, achender@kernel.org, netdev@vger.kernel.org Cc: linux-rdma@vger.kernel.org, rds-devel@oss.oracle.com, horms@kernel.org, edumazet@google.com, linux-kselftest@vger.kernel.org, kuba@kernel.org, pabeni@redhat.com On Wed, 2026-02-25 at 17:33 +0100, Fernando Fernandez Mancera wrote: > On 2/23/26 11:19 PM, Allison Henderson wrote: > > From: Gerd Rausch <gerd.rausch@oracle.com> > > > > Delegate fan-out to a background worker in order to allow > > kernel_getpeername() to acquire a lock on the socket. > > > > This has become necessary since the introduction of > > commit "9dfc685e0262d ("inet: remove races in inet{6}_getname()") > > > > The socket is already locked in the context that > > "kernel_getpeername" used to get called by either > > rds_tcp_recv_path" or "tcp_v{4,6}_rcv", > > and therefore causing a deadlock. > > > > Luckily, the fan-out need not happen in-context nor fast, > > so we can easily just do the same in a background worker. > > > > Also, while we're doing this, we get rid of the unused > > struct members "t_conn_w", "t_send_w", "t_down_w" & "t_recv_w". > > > > The fan-out work and the shutdown worker (cp_down_w) are both > > queued on the same ordered workqueue (cp0->cp_wq), so they > > cannot execute concurrently. We only need cancel_work_sync() > > in rds_tcp_conn_free() and rds_tcp_conn_path_connect() because > > those run from outside the ordered workqueue. > > > > Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com> > > Signed-off-by: Allison Henderson <achender@kernel.org> > > --- > > net/rds/tcp.c | 3 +++ > > net/rds/tcp.h | 7 ++---- > > net/rds/tcp_connect.c | 2 ++ > > net/rds/tcp_listen.c | 54 +++++++++++++++++++++++++++++++------------ > > 4 files changed, 46 insertions(+), 20 deletions(-) > > > > Isn't this change kind of dangerous since 021fd0f87004 ("net/rds: fix > recursive lock in rds_tcp_conn_slots_available") [1]?. Why is > kernel_getpeername() needed as only the peer source port is required for > the operation? Thanks for the catch. You're right, I missed that your fix would displace both patches. I will drop this series for now. Thank you! Allison > > Thanks, > Fernando. ^ permalink raw reply [flat|nested] 7+ messages in thread
end of thread, other threads:[~2026-02-26 17:06 UTC | newest] Thread overview: 7+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-02-23 22:19 [PATCH net-next v5 0/2] net/rds: RDS-TCP reconnect and fanout improvements Allison Henderson 2026-02-23 22:19 ` [PATCH net-next v5 1/2] net/rds: Refactor __rds_conn_create for blocking transport cleanup Allison Henderson 2026-02-24 16:16 ` [net-next,v5,1/2] " Simon Horman 2026-02-25 6:44 ` Allison Henderson 2026-02-23 22:19 ` [PATCH net-next v5 2/2] net/rds: Delegate fan-out to a background worker Allison Henderson 2026-02-25 16:33 ` Fernando Fernandez Mancera 2026-02-26 17:06 ` Allison Henderson
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox