[PATCH net-next v5 0/2] net/rds: RDS-TCP reconnect and fanout improvements

public inbox for netdev@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH net-next v5 0/2] net/rds: RDS-TCP reconnect and fanout improvements
@ 2026-02-23 22:19 Allison Henderson
  2026-02-23 22:19 ` [PATCH net-next v5 1/2] net/rds: Refactor __rds_conn_create for blocking transport cleanup Allison Henderson
  2026-02-23 22:19 ` [PATCH net-next v5 2/2] net/rds: Delegate fan-out to a background worker Allison Henderson
  0 siblings, 2 replies; 7+ messages in thread
From: Allison Henderson @ 2026-02-23 22:19 UTC (permalink / raw)
  To: netdev
  Cc: linux-kselftest, pabeni, edumazet, rds-devel, kuba, horms,
	linux-rdma, allison.henderson

Hi all,

This is subset 4 of the larger RDS-TCP patch series I posted last
Oct.  The greater series aims to correct multiple rds-tcp issues that
can cause dropped or out of sequence messages.  I've broken it down into
smaller sets to make reviews more manageable.

In this set, we address some reconnect issues occurring during connection
teardowns, and also move connection fanout operations to a background
worker.

The entire set can be viewed in the rfc here:
https://lore.kernel.org/netdev/20251022191715.157755-1-achender@kernel.org/

Questions, comments, flames appreciated!

Thanks,
Allison

Change Log
v2:
   [PATCH net-next v2 1/4] net/rds: Refactor __rds_conn_create for
   blocking transport cleanup
      - NEW

   [PATCH net-next v2 2/4] net/rds: Delegate fan-out to a background
    worker
      - Added syzbot report link
v3:
  [PATCH net-next v3 1/4] net/rds: Fix NULL pointer dereference in
  rds_tcp_accept_one
    - NEW
    - Fixes syzbot bug
      https://syzkaller.appspot.com/bug?extid=96046021045ffe6d7709

  [PATCH net-next v3 2/4] net/rds: Refactor __rds_conn_create for
  blocking transport cleanup
    - Moved syzbot report link from "Delegate fan-out to a background
      worker" to this patch
    - this patch fixes syzbot bug flaged in next patch
      https://ci.syzbot.org/series/1a5ef180-c02c-401d-9df7-670b18570a55

   [PATCH net-next v3 3/4] net/rds: Delegate fan-out to a background
    worker
     - Moved syzbot report link to previous patch

   [PATCH net-next v3 4/4] net/rds: Use proper peer port number even
   when not connected
     - Fixed ai comment complaint

   [PATCH net-next v2 4/4] net/rds: rds_sendmsg should not discard
   payload_len
     - Removed
     - Deferred to net instead of net-next

v4:
  [PATCH net-next v4 2/4] net/rds: Refactor __rds_conn_create for
  blocking transport cleanup
    - Removed Reported-by: tag

  [PATCH net-next v4 3/4] net/rds: Delegate fan-out to a background
  worker
    - Expanded commit header to clarify ordered queue mechanics and
      cancel_work_sync() usage

v5:
  [PATCH net-next v4 1/4] net/rds: Fix NULL pointer dereference in
  rds_tcp_accept_one
   - Removed
   - Deferred to net instead of net-next

  [PATCH net v5 1/2] net/rds: Refactor __rds_conn_create for blocking
  transport cleanup
   - Clarified commit header and commets per ai review

  [PATCH net-next v4 4/4] net/rds: Use proper peer port number even
  when not connected
   - Removed
   - Superceeded by
        "net/rds: fix recursive lock in rds_tcp_conn_slots_available"

Allison Henderson (1):
  net/rds: Refactor __rds_conn_create for blocking transport cleanup

Gerd Rausch (1):
  net/rds: Delegate fan-out to a background worker

 net/rds/connection.c  | 32 ++++++++++++++-----------
 net/rds/tcp.c         |  3 +++
 net/rds/tcp.h         |  7 ++----
 net/rds/tcp_connect.c |  2 ++
 net/rds/tcp_listen.c  | 54 +++++++++++++++++++++++++++++++------------
 5 files changed, 64 insertions(+), 34 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH net-next v5 1/2] net/rds: Refactor __rds_conn_create for blocking transport cleanup
  2026-02-23 22:19 [PATCH net-next v5 0/2] net/rds: RDS-TCP reconnect and fanout improvements Allison Henderson
@ 2026-02-23 22:19 ` Allison Henderson
  2026-02-24 16:16   ` [net-next,v5,1/2] " Simon Horman
  2026-02-23 22:19 ` [PATCH net-next v5 2/2] net/rds: Delegate fan-out to a background worker Allison Henderson
  1 sibling, 1 reply; 7+ messages in thread
From: Allison Henderson @ 2026-02-23 22:19 UTC (permalink / raw)
  To: netdev
  Cc: linux-kselftest, pabeni, edumazet, rds-devel, kuba, horms,
	linux-rdma, allison.henderson

The next patch will delegate fanout operations to a background worker,
which adds cancel_work_sync() to rds_tcp_conn_free(). This is called
during a connection cleanup and requires an operations to be blocking.
However, the error path of __rds_conn_create() currently calls
trans->conn_free() while holding rds_conn_lock (spinlock) and
rcu_read_lock, which creates an atomic context where cancel_work_sync()
cannot sleep.

To avoid this, refactor the error/race paths to defer
trans->conn_free() calls until after locks are released at the out:
label. The 'out:' label already runs outside locks because
destroy_workqueue() can sleep. So we extend this pattern to transport
cleanup as well.  This way, connections that "lose" the race are safe
to clean up outside the critical section since they were never added
to the hashtable, and therefore are inaccessible to other threads.

Link: https://ci.syzbot.org/series/1a5ef180-c02c-401d-9df7-670b18570a55
Signed-off-by: Allison Henderson <achender@kernel.org>
---
 net/rds/connection.c | 32 ++++++++++++++++++--------------
 1 file changed, 18 insertions(+), 14 deletions(-)

diff --git a/net/rds/connection.c b/net/rds/connection.c
index 185f73b01694..804e928da223 100644
--- a/net/rds/connection.c
+++ b/net/rds/connection.c
@@ -170,6 +170,7 @@ static struct rds_connection *__rds_conn_create(struct net *net,
 	struct hlist_head *head = rds_conn_bucket(laddr, faddr);
 	struct rds_transport *loop_trans;
 	struct rds_conn_path *free_cp = NULL;
+	struct rds_transport *free_trans = NULL;
 	unsigned long flags;
 	int ret, i;
 	int npaths = (trans->t_mp_capable ? RDS_MPATH_WORKERS : 1);
@@ -305,7 +306,7 @@ static struct rds_connection *__rds_conn_create(struct net *net,
 	if (parent) {
 		/* Creating passive conn */
 		if (parent->c_passive) {
-			trans->conn_free(conn->c_path[0].cp_transport_data);
+			free_trans = trans;
 			free_cp = conn->c_path;
 			kmem_cache_free(rds_conn_slab, conn);
 			conn = parent->c_passive;
@@ -321,18 +322,7 @@ static struct rds_connection *__rds_conn_create(struct net *net,
 		found = rds_conn_lookup(net, head, laddr, faddr, trans,
 					tos, dev_if);
 		if (found) {
-			struct rds_conn_path *cp;
-			int i;
-
-			for (i = 0; i < npaths; i++) {
-				cp = &conn->c_path[i];
-				/* The ->conn_alloc invocation may have
-				 * allocated resource for all paths, so all
-				 * of them may have to be freed here.
-				 */
-				if (cp->cp_transport_data)
-					trans->conn_free(cp->cp_transport_data);
-			}
+			free_trans = trans;
 			free_cp = conn->c_path;
 			kmem_cache_free(rds_conn_slab, conn);
 			conn = found;
@@ -349,9 +339,23 @@ static struct rds_connection *__rds_conn_create(struct net *net,
 
 out:
 	if (free_cp) {
-		for (i = 0; i < npaths; i++)
+		for (i = 0; i < npaths; i++) {
+			/*
+			 * The trans->conn_alloc call may have allocated
+			 * resources for the cp paths, which will need to
+			 * be freed before freeing cp itself.  We do this here
+			 * after the out: label so it runs outside the
+			 * rds_conn_lock spinlock and rcu_read_lock section,
+			 * since both destroy_workqueue() and conn_free can
+			 * block. The local free_trans pointer is only set in
+			 * the race-loss paths where conn_alloc() succeeded.
+			 */
+			if (free_trans && free_cp[i].cp_transport_data)
+				free_trans->conn_free
+					(free_cp[i].cp_transport_data);
 			if (free_cp[i].cp_wq != rds_wq)
 				destroy_workqueue(free_cp[i].cp_wq);
+		}
 		kfree(free_cp);
 	}
 
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [net-next,v5,1/2] net/rds: Refactor __rds_conn_create for blocking transport cleanup
  2026-02-23 22:19 ` [PATCH net-next v5 1/2] net/rds: Refactor __rds_conn_create for blocking transport cleanup Allison Henderson
@ 2026-02-24 16:16   ` Simon Horman
  2026-02-25  6:44     ` Allison Henderson
  0 siblings, 1 reply; 7+ messages in thread
From: Simon Horman @ 2026-02-24 16:16 UTC (permalink / raw)
  To: achender
  Cc: Simon Horman, linux-kselftest, kuba, linux-rdma,
	allison.henderson, edumazet, netdev, rds-devel, pabeni

This is an AI-generated review of your patch. The human sending this
email has considered the AI review valid, or at least plausible.

For local reproduction steps see https://netdev-ai.bots.linux.dev/ai-local.html
---
net/rds: Refactor __rds_conn_create for blocking transport cleanup

The next patch will delegate fanout operations to a background worker,
which adds cancel_work_sync() to rds_tcp_conn_free(). This is called
during a connection cleanup and requires an operations to be blocking.
However, the error path of __rds_conn_create() currently calls
trans->conn_free() while holding rds_conn_lock (spinlock) and
rcu_read_lock, which creates an atomic context where cancel_work_sync()
cannot sleep.

Link: https://ci.syzbot.org/series/1a5ef180-c02c-401d-9df7-670b18570a55

> diff --git a/net/rds/connection.c b/net/rds/connection.c
> --- a/net/rds/connection.c
> +++ b/net/rds/connection.c
> @@ -170,6 +170,7 @@ static struct rds_connection *__rds_conn_create(struct net *net,
>  	struct hlist_head *head = rds_conn_bucket(laddr, faddr);
>  	struct rds_transport *loop_trans;
>  	struct rds_conn_path *free_cp = NULL;
> +	struct rds_transport *free_trans = NULL;
>  	unsigned long flags;
>  	int ret, i;
>  	int npaths = (trans->t_mp_capable ? RDS_MPATH_WORKERS : 1);
> @@ -305,7 +306,7 @@
>  	if (parent) {
>  		/* Creating passive conn */
>  		if (parent->c_passive) {
> -			trans->conn_free(conn->c_path[0].cp_transport_data);
> +			free_trans = trans;
>  			free_cp = conn->c_path;
>  			kmem_cache_free(rds_conn_slab, conn);
>  			conn = parent->c_passive;

[ ... ]

> @@ -321,18 +322,7 @@
>  		found = rds_conn_lookup(net, head, laddr, faddr, trans,
>  					tos, dev_if);
>  		if (found) {
> -			struct rds_conn_path *cp;
> -			int i;
> -
> -			for (i = 0; i < npaths; i++) {
> -				cp = &conn->c_path[i];
> -				/* The ->conn_alloc invocation may have
> -				 * allocated resource for all paths, so all
> -				 * of them may have to be freed here.
> -				 */
> -				if (cp->cp_transport_data)
> -					trans->conn_free(cp->cp_transport_data);
> -			}
> +			free_trans = trans;
>  			free_cp = conn->c_path;
>  			kmem_cache_free(rds_conn_slab, conn);
>  			conn = found;

Does this create a use-after-free in other threads that iterate the
transport list?

Looking at the race-loss path when rds_conn_lookup() finds an existing
connection, the old code called trans->conn_free() before
kmem_cache_free(conn). Now the patch defers conn_free() to after
locks are released at the out: label, but kmem_cache_free(conn) still
executes inside the critical section (line 327 after this change).

This creates a window where rds_tcp_connection objects remain on
rds_tcp_conn_list with tc->t_cpath->cp_conn pointing to the
already-freed conn object.

The timeline would be:

CPU 0 (__rds_conn_create race-loss)
  spin_lock_irqsave(&rds_conn_lock)
  found = rds_conn_lookup()  // finds existing conn
  free_trans = trans
  free_cp = conn->c_path
  kmem_cache_free(rds_conn_slab, conn)  // conn freed
  conn = found
  spin_unlock_irqrestore(&rds_conn_lock)
  rcu_read_unlock()
  // ... later at out: label
  free_trans->conn_free(...)  // removes tc from list, but too late

CPU 1 (rds_tcp_kill_sock)
  spin_lock_irq(&rds_tcp_conn_lock)
  list_for_each_entry_safe(tc, _tc, &rds_tcp_conn_list, t_tcp_node)
    c_net = read_pnet(&tc->t_cpath->cp_conn->c_net)  // UAF here!

The commit message states "connections that 'lose' the race are safe to
clean up outside the critical section since they were never added to
the hashtable, and therefore are inaccessible to other threads."

However, while the rds_connection was never added to the RDS hashtable,
the transport data (tc) WAS added to rds_tcp_conn_list during
rds_tcp_conn_alloc(). Looking at that function in net/rds/tcp.c:

rds_tcp_conn_alloc() {
    // ... allocate tc structures ...
    spin_lock_irq(&rds_tcp_conn_lock);
    for (i = 0; i < RDS_MPATH_WORKERS; i++) {
        tc = conn->c_path[i].cp_transport_data;
        tc->t_tcp_node_detached = false;
        list_add_tail(&tc->t_tcp_node, &rds_tcp_conn_list);
    }
    spin_unlock_irq(&rds_tcp_conn_lock);
}

So the tc objects are accessible to other threads via the transport
list, even though the conn itself never made it into the RDS hashtable.

Functions that iterate rds_tcp_conn_list include:
- rds_tcp_kill_sock() at net/rds/tcp.c:610
- rds_tcp_destroy_conns()
- rds_tcp_sysctl_reset()

All of these dereference tc->t_cpath->cp_conn during iteration.

In the old code, conn_free() ran BEFORE kmem_cache_free(conn), so the tc
was removed from the list before conn was freed. No UAF window existed.

Could the kmem_cache_free(rds_conn_slab, conn) be moved to after
conn_free() at the out: label, or should conn_free() be called before
kmem_cache_free() inside the critical section for the race-loss paths?

> @@ -349,9 +339,23 @@
>
>  out:
>  	if (free_cp) {
> -		for (i = 0; i < npaths; i++)
> +		for (i = 0; i < npaths; i++) {
> +			/*
> +			 * The trans->conn_alloc call may have allocated
> +			 * resources for the cp paths, which will need to
> +			 * be freed before freeing cp itself.  We do this here
> +			 * after the out: label so it runs outside the
> +			 * rds_conn_lock spinlock and rcu_read_lock section,
> +			 * since both destroy_workqueue() and conn_free can
> +			 * block. The local free_trans pointer is only set in
> +			 * the race-loss paths where conn_alloc() succeeded.
> +			 */
> +			if (free_trans && free_cp[i].cp_transport_data)
> +				free_trans->conn_free
> +					(free_cp[i].cp_transport_data);
>  			if (free_cp[i].cp_wq != rds_wq)
>  				destroy_workqueue(free_cp[i].cp_wq);
> +		}
>  		kfree(free_cp);
>  	}

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [net-next,v5,1/2] net/rds: Refactor __rds_conn_create for blocking transport cleanup
  2026-02-24 16:16   ` [net-next,v5,1/2] " Simon Horman
@ 2026-02-25  6:44     ` Allison Henderson
  0 siblings, 0 replies; 7+ messages in thread
From: Allison Henderson @ 2026-02-25  6:44 UTC (permalink / raw)
  To: horms@kernel.org, achender@kernel.org
  Cc: pabeni@redhat.com, rds-devel@oss.oracle.com, edumazet@google.com,
	linux-kselftest@vger.kernel.org, kuba@kernel.org,
	linux-rdma@vger.kernel.org, netdev@vger.kernel.org

On Tue, 2026-02-24 at 16:16 +0000, Simon Horman wrote:
> This is an AI-generated review of your patch. The human sending this
> email has considered the AI review valid, or at least plausible.
> 
> For local reproduction steps see https://urldefense.com/v3/__https://netdev-ai.bots.linux.dev/ai-local.html__;!!ACWV5N9M2RV99hQ!Ki9pK6KFAOgmcbdBBDA6lG7dXM_7sJiBAcp2aRSeDi8bRG9ACJajfiLdK-YgNvXyOPqrJpC5y7k9fq6huS8$ 
> ---
> net/rds: Refactor __rds_conn_create for blocking transport cleanup
> 
> The next patch will delegate fanout operations to a background worker,
> which adds cancel_work_sync() to rds_tcp_conn_free(). This is called
> during a connection cleanup and requires an operations to be blocking.
> However, the error path of __rds_conn_create() currently calls
> trans->conn_free() while holding rds_conn_lock (spinlock) and
> rcu_read_lock, which creates an atomic context where cancel_work_sync()
> cannot sleep.
> 
> Link: https://urldefense.com/v3/__https://ci.syzbot.org/series/1a5ef180-c02c-401d-9df7-670b18570a55__;!!ACWV5N9M2RV99hQ!Ki9pK6KFAOgmcbdBBDA6lG7dXM_7sJiBAcp2aRSeDi8bRG9ACJajfiLdK-YgNvXyOPqrJpC5y7k9noDt20w$ 
> 
> > diff --git a/net/rds/connection.c b/net/rds/connection.c
> > --- a/net/rds/connection.c
> > +++ b/net/rds/connection.c
> > @@ -170,6 +170,7 @@ static struct rds_connection *__rds_conn_create(struct net *net,
> >  	struct hlist_head *head = rds_conn_bucket(laddr, faddr);
> >  	struct rds_transport *loop_trans;
> >  	struct rds_conn_path *free_cp = NULL;
> > +	struct rds_transport *free_trans = NULL;
> >  	unsigned long flags;
> >  	int ret, i;
> >  	int npaths = (trans->t_mp_capable ? RDS_MPATH_WORKERS : 1);
> > @@ -305,7 +306,7 @@
> >  	if (parent) {
> >  		/* Creating passive conn */
> >  		if (parent->c_passive) {
> > -			trans->conn_free(conn->c_path[0].cp_transport_data);
> > +			free_trans = trans;
> >  			free_cp = conn->c_path;
> >  			kmem_cache_free(rds_conn_slab, conn);
> >  			conn = parent->c_passive;
> 
> [ ... ]
> 
> > @@ -321,18 +322,7 @@
> >  		found = rds_conn_lookup(net, head, laddr, faddr, trans,
> >  					tos, dev_if);
> >  		if (found) {
> > -			struct rds_conn_path *cp;
> > -			int i;
> > -
> > -			for (i = 0; i < npaths; i++) {
> > -				cp = &conn->c_path[i];
> > -				/* The ->conn_alloc invocation may have
> > -				 * allocated resource for all paths, so all
> > -				 * of them may have to be freed here.
> > -				 */
> > -				if (cp->cp_transport_data)
> > -					trans->conn_free(cp->cp_transport_data);
> > -			}
> > +			free_trans = trans;
> >  			free_cp = conn->c_path;
> >  			kmem_cache_free(rds_conn_slab, conn);
> >  			conn = found;
> 
> Does this create a use-after-free in other threads that iterate the
> transport list?
> 
> Looking at the race-loss path when rds_conn_lookup() finds an existing
> connection, the old code called trans->conn_free() before
> kmem_cache_free(conn). Now the patch defers conn_free() to after
> locks are released at the out: label, but kmem_cache_free(conn) still
> executes inside the critical section (line 327 after this change).
> 
> This creates a window where rds_tcp_connection objects remain on
> rds_tcp_conn_list with tc->t_cpath->cp_conn pointing to the
> already-freed conn object.
> 
> The timeline would be:
> 
> CPU 0 (__rds_conn_create race-loss)
>   spin_lock_irqsave(&rds_conn_lock)
>   found = rds_conn_lookup()  // finds existing conn
>   free_trans = trans
>   free_cp = conn->c_path
>   kmem_cache_free(rds_conn_slab, conn)  // conn freed
>   conn = found
>   spin_unlock_irqrestore(&rds_conn_lock)
>   rcu_read_unlock()
>   // ... later at out: label
>   free_trans->conn_free(...)  // removes tc from list, but too late
> 
> CPU 1 (rds_tcp_kill_sock)
>   spin_lock_irq(&rds_tcp_conn_lock)
>   list_for_each_entry_safe(tc, _tc, &rds_tcp_conn_list, t_tcp_node)
>     c_net = read_pnet(&tc->t_cpath->cp_conn->c_net)  // UAF here!
> 
> The commit message states "connections that 'lose' the race are safe to
> clean up outside the critical section since they were never added to
> the hashtable, and therefore are inaccessible to other threads."
> 
> However, while the rds_connection was never added to the RDS hashtable,
> the transport data (tc) WAS added to rds_tcp_conn_list during
> rds_tcp_conn_alloc(). Looking at that function in net/rds/tcp.c:
> 
> rds_tcp_conn_alloc() {
>     // ... allocate tc structures ...
>     spin_lock_irq(&rds_tcp_conn_lock);
>     for (i = 0; i < RDS_MPATH_WORKERS; i++) {
>         tc = conn->c_path[i].cp_transport_data;
>         tc->t_tcp_node_detached = false;
>         list_add_tail(&tc->t_tcp_node, &rds_tcp_conn_list);
>     }
>     spin_unlock_irq(&rds_tcp_conn_lock);
> }
> 
> So the tc objects are accessible to other threads via the transport
> list, even though the conn itself never made it into the RDS hashtable.
> 
> Functions that iterate rds_tcp_conn_list include:
> - rds_tcp_kill_sock() at net/rds/tcp.c:610
> - rds_tcp_destroy_conns()
> - rds_tcp_sysctl_reset()
> 
> All of these dereference tc->t_cpath->cp_conn during iteration.
> 
> In the old code, conn_free() ran BEFORE kmem_cache_free(conn), so the tc
> was removed from the list before conn was freed. No UAF window existed.
> 
> Could the kmem_cache_free(rds_conn_slab, conn) be moved to after
> conn_free() at the out: label, or should conn_free() be called before
> kmem_cache_free() inside the critical section for the race-loss paths?
I think it probably needs to go after the the out label.  Will update.  Thank you!

Allison

> 
> > @@ -349,9 +339,23 @@
> > 
> >  out:
> >  	if (free_cp) {
> > -		for (i = 0; i < npaths; i++)
> > +		for (i = 0; i < npaths; i++) {
> > +			/*
> > +			 * The trans->conn_alloc call may have allocated
> > +			 * resources for the cp paths, which will need to
> > +			 * be freed before freeing cp itself.  We do this here
> > +			 * after the out: label so it runs outside the
> > +			 * rds_conn_lock spinlock and rcu_read_lock section,
> > +			 * since both destroy_workqueue() and conn_free can
> > +			 * block. The local free_trans pointer is only set in
> > +			 * the race-loss paths where conn_alloc() succeeded.
> > +			 */
> > +			if (free_trans && free_cp[i].cp_transport_data)
> > +				free_trans->conn_free
> > +					(free_cp[i].cp_transport_data);
> >  			if (free_cp[i].cp_wq != rds_wq)
> >  				destroy_workqueue(free_cp[i].cp_wq);
> > +		}
> >  		kfree(free_cp);
> >  	}

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH net-next v5 2/2] net/rds: Delegate fan-out to a background worker
  2026-02-23 22:19 [PATCH net-next v5 0/2] net/rds: RDS-TCP reconnect and fanout improvements Allison Henderson
  2026-02-23 22:19 ` [PATCH net-next v5 1/2] net/rds: Refactor __rds_conn_create for blocking transport cleanup Allison Henderson
@ 2026-02-23 22:19 ` Allison Henderson
  2026-02-25 16:33   ` Fernando Fernandez Mancera
  1 sibling, 1 reply; 7+ messages in thread
From: Allison Henderson @ 2026-02-23 22:19 UTC (permalink / raw)
  To: netdev
  Cc: linux-kselftest, pabeni, edumazet, rds-devel, kuba, horms,
	linux-rdma, allison.henderson

From: Gerd Rausch <gerd.rausch@oracle.com>

Delegate fan-out to a background worker in order to allow
kernel_getpeername() to acquire a lock on the socket.

This has become necessary since the introduction of
commit "9dfc685e0262d ("inet: remove races in inet{6}_getname()")

The socket is already locked in the context that
"kernel_getpeername" used to get called by either
rds_tcp_recv_path" or "tcp_v{4,6}_rcv",
and therefore causing a deadlock.

Luckily, the fan-out need not happen in-context nor fast,
so we can easily just do the same in a background worker.

Also, while we're doing this, we get rid of the unused
struct members "t_conn_w", "t_send_w", "t_down_w" & "t_recv_w".

The fan-out work and the shutdown worker (cp_down_w) are both
queued on the same ordered workqueue (cp0->cp_wq), so they
cannot execute concurrently.  We only need cancel_work_sync()
in rds_tcp_conn_free() and rds_tcp_conn_path_connect() because
those run from outside the ordered workqueue.

Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
Signed-off-by: Allison Henderson <achender@kernel.org>
---
 net/rds/tcp.c         |  3 +++
 net/rds/tcp.h         |  7 ++----
 net/rds/tcp_connect.c |  2 ++
 net/rds/tcp_listen.c  | 54 +++++++++++++++++++++++++++++++------------
 4 files changed, 46 insertions(+), 20 deletions(-)

diff --git a/net/rds/tcp.c b/net/rds/tcp.c
index 45484a93d75f..02f8f928c20b 100644
--- a/net/rds/tcp.c
+++ b/net/rds/tcp.c
@@ -358,6 +358,8 @@ static void rds_tcp_conn_free(void *arg)
 
 	rdsdebug("freeing tc %p\n", tc);
 
+	cancel_work_sync(&tc->t_fan_out_w);
+
 	spin_lock_irqsave(&rds_tcp_conn_lock, flags);
 	if (!tc->t_tcp_node_detached)
 		list_del(&tc->t_tcp_node);
@@ -384,6 +386,7 @@ static int rds_tcp_conn_alloc(struct rds_connection *conn, gfp_t gfp)
 		tc->t_tinc = NULL;
 		tc->t_tinc_hdr_rem = sizeof(struct rds_header);
 		tc->t_tinc_data_rem = 0;
+		INIT_WORK(&tc->t_fan_out_w, rds_tcp_fan_out_w);
 		init_waitqueue_head(&tc->t_recv_done_waitq);
 
 		conn->c_path[i].cp_transport_data = tc;
diff --git a/net/rds/tcp.h b/net/rds/tcp.h
index 39c86347188c..9ecb0b6b658a 100644
--- a/net/rds/tcp.h
+++ b/net/rds/tcp.h
@@ -44,11 +44,7 @@ struct rds_tcp_connection {
 	size_t			t_tinc_hdr_rem;
 	size_t			t_tinc_data_rem;
 
-	/* XXX error report? */
-	struct work_struct	t_conn_w;
-	struct work_struct	t_send_w;
-	struct work_struct	t_down_w;
-	struct work_struct	t_recv_w;
+	struct work_struct	t_fan_out_w;
 
 	/* for info exporting only */
 	struct list_head	t_list_item;
@@ -90,6 +86,7 @@ void rds_tcp_state_change(struct sock *sk);
 struct socket *rds_tcp_listen_init(struct net *net, bool isv6);
 void rds_tcp_listen_stop(struct socket *sock, struct work_struct *acceptor);
 void rds_tcp_listen_data_ready(struct sock *sk);
+void rds_tcp_fan_out_w(struct work_struct *work);
 void rds_tcp_conn_slots_available(struct rds_connection *conn, bool fan_out);
 int rds_tcp_accept_one(struct rds_tcp_net *rtn);
 void rds_tcp_keepalive(struct socket *sock);
diff --git a/net/rds/tcp_connect.c b/net/rds/tcp_connect.c
index b77c88ffb199..6954b8c479f1 100644
--- a/net/rds/tcp_connect.c
+++ b/net/rds/tcp_connect.c
@@ -115,6 +115,8 @@ int rds_tcp_conn_path_connect(struct rds_conn_path *cp)
 	if (cp->cp_index > 0 && cp->cp_conn->c_npaths < 2)
 		return -EAGAIN;
 
+	cancel_work_sync(&tc->t_fan_out_w);
+
 	mutex_lock(&tc->t_conn_path_lock);
 
 	if (rds_conn_path_up(cp)) {
diff --git a/net/rds/tcp_listen.c b/net/rds/tcp_listen.c
index 6fb5c928b8fd..8fb8f7d26683 100644
--- a/net/rds/tcp_listen.c
+++ b/net/rds/tcp_listen.c
@@ -123,27 +123,20 @@ rds_tcp_accept_one_path(struct rds_connection *conn, struct socket *sock)
 	return NULL;
 }
 
-void rds_tcp_conn_slots_available(struct rds_connection *conn, bool fan_out)
+void rds_tcp_fan_out_w(struct work_struct *work)
 {
-	struct rds_tcp_connection *tc;
-	struct rds_tcp_net *rtn;
-	struct socket *sock;
+	struct rds_tcp_connection *tc = container_of(work,
+						     struct rds_tcp_connection,
+						     t_fan_out_w);
+	struct rds_connection *conn = tc->t_cpath->cp_conn;
+	struct rds_tcp_net *rtn = tc->t_rtn;
+	struct socket *sock = tc->t_sock;
 	int sport, npaths;
 
-	if (rds_destroy_pending(conn))
-		return;
-
-	tc = conn->c_path->cp_transport_data;
-	rtn = tc->t_rtn;
-	if (!rtn)
-		return;
-
-	sock = tc->t_sock;
-
 	/* During fan-out, check that the connection we already
 	 * accepted in slot#0 carried the proper source port modulo.
 	 */
-	if (fan_out && conn->c_with_sport_idx && sock &&
+	if (conn->c_with_sport_idx && sock &&
 	    rds_addr_cmp(&conn->c_laddr, &conn->c_faddr) > 0) {
 		/* cp->cp_index is encoded in lowest bits of source-port */
 		sport = rds_tcp_get_peer_sport(sock);
@@ -167,6 +160,37 @@ void rds_tcp_conn_slots_available(struct rds_connection *conn, bool fan_out)
 	rds_tcp_accept_work(rtn);
 }
 
+void rds_tcp_conn_slots_available(struct rds_connection *conn, bool fan_out)
+{
+	struct rds_conn_path *cp0;
+	struct rds_tcp_connection *tc;
+	struct rds_tcp_net *rtn;
+
+	if (rds_destroy_pending(conn))
+		return;
+
+	cp0 = conn->c_path;
+	tc = cp0->cp_transport_data;
+	rtn = tc->t_rtn;
+	if (!rtn)
+		return;
+
+	if (fan_out)
+		/* Delegate fan-out to a background worker in order
+		 * to allow "kernel_getpeername" to acquire a lock
+		 * on the socket.
+		 * The socket is already locked in this context
+		 * by either "rds_tcp_recv_path" or "tcp_v{4,6}_rcv",
+		 * depending on the origin of the dequeue-request.
+		 */
+		queue_work(cp0->cp_wq, &tc->t_fan_out_w);
+	else
+		/* Fan-out either already happened or is unnecessary.
+		 * Just go ahead and attempt to accept more connections
+		 */
+		rds_tcp_accept_work(rtn);
+}
+
 int rds_tcp_accept_one(struct rds_tcp_net *rtn)
 {
 	struct socket *listen_sock = rtn->rds_tcp_listen_sock;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next v5 2/2] net/rds: Delegate fan-out to a background worker
  2026-02-23 22:19 ` [PATCH net-next v5 2/2] net/rds: Delegate fan-out to a background worker Allison Henderson
@ 2026-02-25 16:33   ` Fernando Fernandez Mancera
  2026-02-26 17:06     ` Allison Henderson
  0 siblings, 1 reply; 7+ messages in thread
From: Fernando Fernandez Mancera @ 2026-02-25 16:33 UTC (permalink / raw)
  To: Allison Henderson, netdev
  Cc: linux-kselftest, pabeni, edumazet, rds-devel, kuba, horms,
	linux-rdma, allison.henderson

On 2/23/26 11:19 PM, Allison Henderson wrote:
> From: Gerd Rausch <gerd.rausch@oracle.com>
> 
> Delegate fan-out to a background worker in order to allow
> kernel_getpeername() to acquire a lock on the socket.
> 
> This has become necessary since the introduction of
> commit "9dfc685e0262d ("inet: remove races in inet{6}_getname()")
> 
> The socket is already locked in the context that
> "kernel_getpeername" used to get called by either
> rds_tcp_recv_path" or "tcp_v{4,6}_rcv",
> and therefore causing a deadlock.
> 
> Luckily, the fan-out need not happen in-context nor fast,
> so we can easily just do the same in a background worker.
> 
> Also, while we're doing this, we get rid of the unused
> struct members "t_conn_w", "t_send_w", "t_down_w" & "t_recv_w".
> 
> The fan-out work and the shutdown worker (cp_down_w) are both
> queued on the same ordered workqueue (cp0->cp_wq), so they
> cannot execute concurrently.  We only need cancel_work_sync()
> in rds_tcp_conn_free() and rds_tcp_conn_path_connect() because
> those run from outside the ordered workqueue.
> 
> Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
> Signed-off-by: Allison Henderson <achender@kernel.org>
> ---
>   net/rds/tcp.c         |  3 +++
>   net/rds/tcp.h         |  7 ++----
>   net/rds/tcp_connect.c |  2 ++
>   net/rds/tcp_listen.c  | 54 +++++++++++++++++++++++++++++++------------
>   4 files changed, 46 insertions(+), 20 deletions(-)
> 

Isn't this change kind of dangerous since 021fd0f87004 ("net/rds: fix 
recursive lock in rds_tcp_conn_slots_available") [1]?. Why is 
kernel_getpeername() needed as only the peer source port is required for 
the operation?

Thanks,
Fernando.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next v5 2/2] net/rds: Delegate fan-out to a background worker
  2026-02-25 16:33   ` Fernando Fernandez Mancera
@ 2026-02-26 17:06     ` Allison Henderson
  0 siblings, 0 replies; 7+ messages in thread
From: Allison Henderson @ 2026-02-26 17:06 UTC (permalink / raw)
  To: fmancera@suse.de, achender@kernel.org, netdev@vger.kernel.org
  Cc: linux-rdma@vger.kernel.org, rds-devel@oss.oracle.com,
	horms@kernel.org, edumazet@google.com,
	linux-kselftest@vger.kernel.org, kuba@kernel.org,
	pabeni@redhat.com

On Wed, 2026-02-25 at 17:33 +0100, Fernando Fernandez Mancera wrote:
> On 2/23/26 11:19 PM, Allison Henderson wrote:
> > From: Gerd Rausch <gerd.rausch@oracle.com>
> > 
> > Delegate fan-out to a background worker in order to allow
> > kernel_getpeername() to acquire a lock on the socket.
> > 
> > This has become necessary since the introduction of
> > commit "9dfc685e0262d ("inet: remove races in inet{6}_getname()")
> > 
> > The socket is already locked in the context that
> > "kernel_getpeername" used to get called by either
> > rds_tcp_recv_path" or "tcp_v{4,6}_rcv",
> > and therefore causing a deadlock.
> > 
> > Luckily, the fan-out need not happen in-context nor fast,
> > so we can easily just do the same in a background worker.
> > 
> > Also, while we're doing this, we get rid of the unused
> > struct members "t_conn_w", "t_send_w", "t_down_w" & "t_recv_w".
> > 
> > The fan-out work and the shutdown worker (cp_down_w) are both
> > queued on the same ordered workqueue (cp0->cp_wq), so they
> > cannot execute concurrently.  We only need cancel_work_sync()
> > in rds_tcp_conn_free() and rds_tcp_conn_path_connect() because
> > those run from outside the ordered workqueue.
> > 
> > Signed-off-by: Gerd Rausch <gerd.rausch@oracle.com>
> > Signed-off-by: Allison Henderson <achender@kernel.org>
> > ---
> >   net/rds/tcp.c         |  3 +++
> >   net/rds/tcp.h         |  7 ++----
> >   net/rds/tcp_connect.c |  2 ++
> >   net/rds/tcp_listen.c  | 54 +++++++++++++++++++++++++++++++------------
> >   4 files changed, 46 insertions(+), 20 deletions(-)
> > 
> 
> Isn't this change kind of dangerous since 021fd0f87004 ("net/rds: fix 
> recursive lock in rds_tcp_conn_slots_available") [1]?. Why is 
> kernel_getpeername() needed as only the peer source port is required for 
> the operation?

Thanks for the catch.  You're right, I missed that your fix would displace both patches.  I will drop this series for
now.  Thank you!

Allison

> 
> Thanks,
> Fernando.


^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2026-02-26 17:06 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-02-23 22:19 [PATCH net-next v5 0/2] net/rds: RDS-TCP reconnect and fanout improvements Allison Henderson
2026-02-23 22:19 ` [PATCH net-next v5 1/2] net/rds: Refactor __rds_conn_create for blocking transport cleanup Allison Henderson
2026-02-24 16:16   ` [net-next,v5,1/2] " Simon Horman
2026-02-25  6:44     ` Allison Henderson
2026-02-23 22:19 ` [PATCH net-next v5 2/2] net/rds: Delegate fan-out to a background worker Allison Henderson
2026-02-25 16:33   ` Fernando Fernandez Mancera
2026-02-26 17:06     ` Allison Henderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox