[PATCH net-next v2 0/2] net/rds: RDS-TCP bug fix collection, subset 1: Work queue scalability

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next v2 0/2] net/rds: RDS-TCP bug fix collection, subset 1: Work queue scalability
@ 2025-11-17 20:23 Allison Henderson
  2025-11-17 20:23 ` [PATCH net-next v2 1/2] net/rds: Add per cp work queue Allison Henderson
  2025-11-17 20:23 ` [PATCH net-next v2 2/2] net/rds: Give each connection its own workqueue Allison Henderson
  0 siblings, 2 replies; 5+ messages in thread
From: Allison Henderson @ 2025-11-17 20:23 UTC (permalink / raw)
  To: netdev; +Cc: achender, pabeni, edumazet, rds-devel, kuba, horms, linux-rdma

From: Allison Henderson <allison.henderson@oracle.com>

Hi all,

This is subset 1 of the RDS-TCP bug fix collection series I posted last
week.  The greater series aims to correct multiple rds-tcp bugs that
can cause dropped or out of sequence messages.  The set was starting to
get a bit large, so I've broken it down into smaller sets to make
reviews more manageable.

In this subset, we focus on work queue scalability.  Messages queues
are refactored to operate in parallel across multiple connections,
which improves response times and avoids timeouts.

The entire set can be viewed in the rfc here:
https://lore.kernel.org/netdev/20251022191715.157755-1-achender@kernel.org/

Questions, comments, flames appreciated!
Thanks!
Allison

Change Log:
rfc->v1
 - Fixed lkp warnings and white space cleanup
 - Split out the workqueue changes as a subset

v1->v2
 [PATCH 1/2] net/rds: Add per cp work queue
   - Checkpatch nits
 [PATCH 2/2] net/rds: Give each connection its own workqueue
   - Checkpatch nits
   - Updated commit message with workqueue overhead accounting

Allison Henderson (2):
  net/rds: Add per cp work queue
  net/rds: Give each connection its own workqueue

 net/rds/connection.c | 16 ++++++++++++++--
 net/rds/ib_recv.c    |  2 +-
 net/rds/ib_send.c    |  2 +-
 net/rds/rds.h        |  1 +
 net/rds/send.c       |  9 +++++----
 net/rds/tcp_recv.c   |  2 +-
 net/rds/tcp_send.c   |  2 +-
 net/rds/threads.c    | 16 ++++++++--------
 8 files changed, 32 insertions(+), 18 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 5+ messages in thread

* [PATCH net-next v2 1/2] net/rds: Add per cp work queue
  2025-11-17 20:23 [PATCH net-next v2 0/2] net/rds: RDS-TCP bug fix collection, subset 1: Work queue scalability Allison Henderson
@ 2025-11-17 20:23 ` Allison Henderson
  2025-11-17 20:23 ` [PATCH net-next v2 2/2] net/rds: Give each connection its own workqueue Allison Henderson
  1 sibling, 0 replies; 5+ messages in thread
From: Allison Henderson @ 2025-11-17 20:23 UTC (permalink / raw)
  To: netdev; +Cc: achender, pabeni, edumazet, rds-devel, kuba, horms, linux-rdma

From: Allison Henderson <allison.henderson@oracle.com>

This patch adds a per connection workqueue which can be initialized
and used independently of the globally shared rds_wq.

This patch is the first in a series that aims to address tcp ack
timeouts during the tcp socket shutdown sequence.

This initial refactoring lays the ground work needed to alleviate
queue congestion during heavy reads and writes.  The independently
managed queues will allow shutdowns and reconnects respond more quickly
before the peer(s) timeout waiting for the proper acks.

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
---
 net/rds/connection.c |  5 +++--
 net/rds/ib_recv.c    |  2 +-
 net/rds/ib_send.c    |  2 +-
 net/rds/rds.h        |  1 +
 net/rds/send.c       |  9 +++++----
 net/rds/tcp_recv.c   |  2 +-
 net/rds/tcp_send.c   |  2 +-
 net/rds/threads.c    | 16 ++++++++--------
 8 files changed, 21 insertions(+), 18 deletions(-)

diff --git a/net/rds/connection.c b/net/rds/connection.c
index 68bc88cce84ec..dc7323707f450 100644
--- a/net/rds/connection.c
+++ b/net/rds/connection.c
@@ -269,6 +269,7 @@ static struct rds_connection *__rds_conn_create(struct net *net,
 		__rds_conn_path_init(conn, &conn->c_path[i],
 				     is_outgoing);
 		conn->c_path[i].cp_index = i;
+		conn->c_path[i].cp_wq = rds_wq;
 	}
 	rcu_read_lock();
 	if (rds_destroy_pending(conn))
@@ -884,7 +885,7 @@ void rds_conn_path_drop(struct rds_conn_path *cp, bool destroy)
 		rcu_read_unlock();
 		return;
 	}
-	queue_work(rds_wq, &cp->cp_down_w);
+	queue_work(cp->cp_wq, &cp->cp_down_w);
 	rcu_read_unlock();
 }
 EXPORT_SYMBOL_GPL(rds_conn_path_drop);
@@ -909,7 +910,7 @@ void rds_conn_path_connect_if_down(struct rds_conn_path *cp)
 	}
 	if (rds_conn_path_state(cp) == RDS_CONN_DOWN &&
 	    !test_and_set_bit(RDS_RECONNECT_PENDING, &cp->cp_flags))
-		queue_delayed_work(rds_wq, &cp->cp_conn_w, 0);
+		queue_delayed_work(cp->cp_wq, &cp->cp_conn_w, 0);
 	rcu_read_unlock();
 }
 EXPORT_SYMBOL_GPL(rds_conn_path_connect_if_down);
diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
index 4248dfa816ebf..357128d34a546 100644
--- a/net/rds/ib_recv.c
+++ b/net/rds/ib_recv.c
@@ -457,7 +457,7 @@ void rds_ib_recv_refill(struct rds_connection *conn, int prefill, gfp_t gfp)
 	    (must_wake ||
 	    (can_wait && rds_ib_ring_low(&ic->i_recv_ring)) ||
 	    rds_ib_ring_empty(&ic->i_recv_ring))) {
-		queue_delayed_work(rds_wq, &conn->c_recv_w, 1);
+		queue_delayed_work(conn->c_path->cp_wq, &conn->c_recv_w, 1);
 	}
 	if (can_wait)
 		cond_resched();
diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c
index 4190b90ff3b18..e35bbb6ffb689 100644
--- a/net/rds/ib_send.c
+++ b/net/rds/ib_send.c
@@ -419,7 +419,7 @@ void rds_ib_send_add_credits(struct rds_connection *conn, unsigned int credits)
 
 	atomic_add(IB_SET_SEND_CREDITS(credits), &ic->i_credits);
 	if (test_and_clear_bit(RDS_LL_SEND_FULL, &conn->c_flags))
-		queue_delayed_work(rds_wq, &conn->c_send_w, 0);
+		queue_delayed_work(conn->c_path->cp_wq, &conn->c_send_w, 0);
 
 	WARN_ON(IB_GET_SEND_CREDITS(credits) >= 16384);
 
diff --git a/net/rds/rds.h b/net/rds/rds.h
index a029e5fcdea72..b35afa2658cc4 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -118,6 +118,7 @@ struct rds_conn_path {
 
 	void			*cp_transport_data;
 
+	struct workqueue_struct	*cp_wq;
 	atomic_t		cp_state;
 	unsigned long		cp_send_gen;
 	unsigned long		cp_flags;
diff --git a/net/rds/send.c b/net/rds/send.c
index 0b3d0ef2f008b..3e3d028bc21ee 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -458,7 +458,8 @@ int rds_send_xmit(struct rds_conn_path *cp)
 			if (rds_destroy_pending(cp->cp_conn))
 				ret = -ENETUNREACH;
 			else
-				queue_delayed_work(rds_wq, &cp->cp_send_w, 1);
+				queue_delayed_work(cp->cp_wq,
+						   &cp->cp_send_w, 1);
 			rcu_read_unlock();
 		} else if (raced) {
 			rds_stats_inc(s_send_lock_queue_raced);
@@ -1380,7 +1381,7 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, size_t payload_len)
 		if (rds_destroy_pending(cpath->cp_conn))
 			ret = -ENETUNREACH;
 		else
-			queue_delayed_work(rds_wq, &cpath->cp_send_w, 1);
+			queue_delayed_work(cpath->cp_wq, &cpath->cp_send_w, 1);
 		rcu_read_unlock();
 	}
 	if (ret)
@@ -1470,10 +1471,10 @@ rds_send_probe(struct rds_conn_path *cp, __be16 sport,
 	rds_stats_inc(s_send_queued);
 	rds_stats_inc(s_send_pong);
 
-	/* schedule the send work on rds_wq */
+	/* schedule the send work on cp_wq */
 	rcu_read_lock();
 	if (!rds_destroy_pending(cp->cp_conn))
-		queue_delayed_work(rds_wq, &cp->cp_send_w, 1);
+		queue_delayed_work(cp->cp_wq, &cp->cp_send_w, 1);
 	rcu_read_unlock();
 
 	rds_message_put(rm);
diff --git a/net/rds/tcp_recv.c b/net/rds/tcp_recv.c
index 7997a19d1da30..b7cf7f451430d 100644
--- a/net/rds/tcp_recv.c
+++ b/net/rds/tcp_recv.c
@@ -327,7 +327,7 @@ void rds_tcp_data_ready(struct sock *sk)
 	if (rds_tcp_read_sock(cp, GFP_ATOMIC) == -ENOMEM) {
 		rcu_read_lock();
 		if (!rds_destroy_pending(cp->cp_conn))
-			queue_delayed_work(rds_wq, &cp->cp_recv_w, 0);
+			queue_delayed_work(cp->cp_wq, &cp->cp_recv_w, 0);
 		rcu_read_unlock();
 	}
 out:
diff --git a/net/rds/tcp_send.c b/net/rds/tcp_send.c
index 7d284ac7e81a5..4e82c9644aa6a 100644
--- a/net/rds/tcp_send.c
+++ b/net/rds/tcp_send.c
@@ -201,7 +201,7 @@ void rds_tcp_write_space(struct sock *sk)
 	rcu_read_lock();
 	if ((refcount_read(&sk->sk_wmem_alloc) << 1) <= sk->sk_sndbuf &&
 	    !rds_destroy_pending(cp->cp_conn))
-		queue_delayed_work(rds_wq, &cp->cp_send_w, 0);
+		queue_delayed_work(cp->cp_wq, &cp->cp_send_w, 0);
 	rcu_read_unlock();
 
 out:
diff --git a/net/rds/threads.c b/net/rds/threads.c
index 1f424cbfcbb47..639302bab51ef 100644
--- a/net/rds/threads.c
+++ b/net/rds/threads.c
@@ -89,8 +89,8 @@ void rds_connect_path_complete(struct rds_conn_path *cp, int curr)
 	set_bit(0, &cp->cp_conn->c_map_queued);
 	rcu_read_lock();
 	if (!rds_destroy_pending(cp->cp_conn)) {
-		queue_delayed_work(rds_wq, &cp->cp_send_w, 0);
-		queue_delayed_work(rds_wq, &cp->cp_recv_w, 0);
+		queue_delayed_work(cp->cp_wq, &cp->cp_send_w, 0);
+		queue_delayed_work(cp->cp_wq, &cp->cp_recv_w, 0);
 	}
 	rcu_read_unlock();
 	cp->cp_conn->c_proposed_version = RDS_PROTOCOL_VERSION;
@@ -140,7 +140,7 @@ void rds_queue_reconnect(struct rds_conn_path *cp)
 		cp->cp_reconnect_jiffies = rds_sysctl_reconnect_min_jiffies;
 		rcu_read_lock();
 		if (!rds_destroy_pending(cp->cp_conn))
-			queue_delayed_work(rds_wq, &cp->cp_conn_w, 0);
+			queue_delayed_work(cp->cp_wq, &cp->cp_conn_w, 0);
 		rcu_read_unlock();
 		return;
 	}
@@ -151,7 +151,7 @@ void rds_queue_reconnect(struct rds_conn_path *cp)
 		 conn, &conn->c_laddr, &conn->c_faddr);
 	rcu_read_lock();
 	if (!rds_destroy_pending(cp->cp_conn))
-		queue_delayed_work(rds_wq, &cp->cp_conn_w,
+		queue_delayed_work(cp->cp_wq, &cp->cp_conn_w,
 				   rand % cp->cp_reconnect_jiffies);
 	rcu_read_unlock();
 
@@ -203,11 +203,11 @@ void rds_send_worker(struct work_struct *work)
 		switch (ret) {
 		case -EAGAIN:
 			rds_stats_inc(s_send_immediate_retry);
-			queue_delayed_work(rds_wq, &cp->cp_send_w, 0);
+			queue_delayed_work(cp->cp_wq, &cp->cp_send_w, 0);
 			break;
 		case -ENOMEM:
 			rds_stats_inc(s_send_delayed_retry);
-			queue_delayed_work(rds_wq, &cp->cp_send_w, 2);
+			queue_delayed_work(cp->cp_wq, &cp->cp_send_w, 2);
 			break;
 		default:
 			break;
@@ -228,11 +228,11 @@ void rds_recv_worker(struct work_struct *work)
 		switch (ret) {
 		case -EAGAIN:
 			rds_stats_inc(s_recv_immediate_retry);
-			queue_delayed_work(rds_wq, &cp->cp_recv_w, 0);
+			queue_delayed_work(cp->cp_wq, &cp->cp_recv_w, 0);
 			break;
 		case -ENOMEM:
 			rds_stats_inc(s_recv_delayed_retry);
-			queue_delayed_work(rds_wq, &cp->cp_recv_w, 2);
+			queue_delayed_work(cp->cp_wq, &cp->cp_recv_w, 2);
 			break;
 		default:
 			break;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* [PATCH net-next v2 2/2] net/rds: Give each connection its own workqueue
  2025-11-17 20:23 [PATCH net-next v2 0/2] net/rds: RDS-TCP bug fix collection, subset 1: Work queue scalability Allison Henderson
  2025-11-17 20:23 ` [PATCH net-next v2 1/2] net/rds: Add per cp work queue Allison Henderson
@ 2025-11-17 20:23 ` Allison Henderson
  2025-11-20 10:43   ` Paolo Abeni
  1 sibling, 1 reply; 5+ messages in thread
From: Allison Henderson @ 2025-11-17 20:23 UTC (permalink / raw)
  To: netdev; +Cc: achender, pabeni, edumazet, rds-devel, kuba, horms, linux-rdma

From: Allison Henderson <allison.henderson@oracle.com>

RDS was written to require ordered workqueues for "cp->cp_wq":
Work is executed in the order scheduled, one item at a time.

If these workqueues are shared across connections,
then work executed on behalf of one connection blocks work
scheduled for a different and unrelated connection.

Luckily we don't need to share these workqueues.
While it obviously makes sense to limit the number of
workers (processes) that ought to be allocated on a system,
a workqueue that doesn't have a rescue worker attached,
has a tiny footprint compared to the connection as a whole:
A workqueue costs ~900 bytes, including the workqueue_struct,
pool_workqueue, workqueue_attrs, wq_node_nr_active and the
node_nr_active flex array.  While an RDS/IB connection
totals only ~5 MBytes.

So we're getting a signficant performance gain
(90% of connections fail over under 3 seconds vs. 40%)
for a less than 0.02% overhead.

RDS doesn't even benefit from the additional rescue workers:
of all the reasons that RDS blocks workers, allocation under
memory pressue is the least of our concerns. And even if RDS
was stalling due to the memory-reclaim process, the work
executed by the rescue workers are highly unlikely to free up
any memory. If anything, they might try to allocate even more.

By giving each connection its own workqueues, we allow RDS
to better utilize the unbound workers that the system
has available.

Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
---
 net/rds/connection.c | 13 ++++++++++++-
 1 file changed, 12 insertions(+), 1 deletion(-)

diff --git a/net/rds/connection.c b/net/rds/connection.c
index dc7323707f450..dcb554e10531f 100644
--- a/net/rds/connection.c
+++ b/net/rds/connection.c
@@ -269,7 +269,15 @@ static struct rds_connection *__rds_conn_create(struct net *net,
 		__rds_conn_path_init(conn, &conn->c_path[i],
 				     is_outgoing);
 		conn->c_path[i].cp_index = i;
-		conn->c_path[i].cp_wq = rds_wq;
+		conn->c_path[i].cp_wq = alloc_ordered_workqueue(
+						"krds_cp_wq#%lu/%d", 0,
+						rds_conn_count, i);
+		if (!conn->c_path[i].cp_wq) {
+			while (--i >= 0)
+				destroy_workqueue(conn->c_path[i].cp_wq);
+			conn = ERR_PTR(-ENOMEM);
+			goto out;
+		}
 	}
 	rcu_read_lock();
 	if (rds_destroy_pending(conn))
@@ -471,6 +479,9 @@ static void rds_conn_path_destroy(struct rds_conn_path *cp)
 	WARN_ON(work_pending(&cp->cp_down_w));

 	cp->cp_conn->c_trans->conn_free(cp->cp_transport_data);
+
+	destroy_workqueue(cp->cp_wq);
+	cp->cp_wq = NULL;
 }

 /*
-- 
2.43.0

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH net-next v2 2/2] net/rds: Give each connection its own workqueue
  2025-11-17 20:23 ` [PATCH net-next v2 2/2] net/rds: Give each connection its own workqueue Allison Henderson
@ 2025-11-20 10:43   ` Paolo Abeni
  2025-11-22  0:36     ` Allison Henderson
  0 siblings, 1 reply; 5+ messages in thread
From: Paolo Abeni @ 2025-11-20 10:43 UTC (permalink / raw)
  To: Allison Henderson, netdev; +Cc: edumazet, rds-devel, kuba, horms, linux-rdma

On 11/17/25 9:23 PM, Allison Henderson wrote:
> From: Allison Henderson <allison.henderson@oracle.com>
> 
> RDS was written to require ordered workqueues for "cp->cp_wq":
> Work is executed in the order scheduled, one item at a time.
> 
> If these workqueues are shared across connections,
> then work executed on behalf of one connection blocks work
> scheduled for a different and unrelated connection.
> 
> Luckily we don't need to share these workqueues.
> While it obviously makes sense to limit the number of
> workers (processes) that ought to be allocated on a system,
> a workqueue that doesn't have a rescue worker attached,
> has a tiny footprint compared to the connection as a whole:
> A workqueue costs ~900 bytes, including the workqueue_struct,
> pool_workqueue, workqueue_attrs, wq_node_nr_active and the
> node_nr_active flex array.  While an RDS/IB connection
> totals only ~5 MBytes.

The above accounting still looks incorrect to me. AFAICS
pool_workqueue/cpu_pwq is a per CPU data. On recent hosts it will
require 64K or more.

Also it looks like it would a WQ per path, up to 8 WQs per connection.
> So we're getting a signficant performance gain
> (90% of connections fail over under 3 seconds vs. 40%)
> for a less than 0.02% overhead.
> 
> RDS doesn't even benefit from the additional rescue workers:
> of all the reasons that RDS blocks workers, allocation under
> memory pressue is the least of our concerns. And even if RDS
> was stalling due to the memory-reclaim process, the work
> executed by the rescue workers are highly unlikely to free up
> any memory. If anything, they might try to allocate even more.
> 
> By giving each connection its own workqueues, we allow RDS
> to better utilize the unbound workers that the system
> has available.
> 
> Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
> Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
> ---
>  net/rds/connection.c | 13 ++++++++++++-
>  1 file changed, 12 insertions(+), 1 deletion(-)
> 
> diff --git a/net/rds/connection.c b/net/rds/connection.c
> index dc7323707f450..dcb554e10531f 100644
> --- a/net/rds/connection.c
> +++ b/net/rds/connection.c
> @@ -269,7 +269,15 @@ static struct rds_connection *__rds_conn_create(struct net *net,
>  		__rds_conn_path_init(conn, &conn->c_path[i],
>  				     is_outgoing);
>  		conn->c_path[i].cp_index = i;
> -		conn->c_path[i].cp_wq = rds_wq;
> +		conn->c_path[i].cp_wq = alloc_ordered_workqueue(
> +						"krds_cp_wq#%lu/%d", 0,
> +						rds_conn_count, i);
This has a reasonable chance of failure under memory pressure, what
about falling back to rds_wq usage instead of shutting down the
connection entirely?

/P


^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH net-next v2 2/2] net/rds: Give each connection its own workqueue
  2025-11-20 10:43   ` Paolo Abeni
@ 2025-11-22  0:36     ` Allison Henderson
  0 siblings, 0 replies; 5+ messages in thread
From: Allison Henderson @ 2025-11-22  0:36 UTC (permalink / raw)
  To: achender@kernel.org, netdev@vger.kernel.org, pabeni@redhat.com
  Cc: rds-devel@oss.oracle.com, horms@kernel.org, edumazet@google.com,
	kuba@kernel.org, linux-rdma@vger.kernel.org

On Thu, 2025-11-20 at 11:43 +0100, Paolo Abeni wrote:
> On 11/17/25 9:23 PM, Allison Henderson wrote:
> > From: Allison Henderson <allison.henderson@oracle.com>
> > 
> > RDS was written to require ordered workqueues for "cp->cp_wq":
> > Work is executed in the order scheduled, one item at a time.
> > 
> > If these workqueues are shared across connections,
> > then work executed on behalf of one connection blocks work
> > scheduled for a different and unrelated connection.
> > 
> > Luckily we don't need to share these workqueues.
> > While it obviously makes sense to limit the number of
> > workers (processes) that ought to be allocated on a system,
> > a workqueue that doesn't have a rescue worker attached,
> > has a tiny footprint compared to the connection as a whole:
> > A workqueue costs ~900 bytes, including the workqueue_struct,
> > pool_workqueue, workqueue_attrs, wq_node_nr_active and the
> > node_nr_active flex array.  While an RDS/IB connection
> > totals only ~5 MBytes.
> 
> The above accounting still looks incorrect to me. AFAICS
> pool_workqueue/cpu_pwq is a per CPU data. On recent hosts it will
> require 64K or more.


I think that's true of bounded queues, but for unbounded queues with only one worker, it should just be one?  If we look
at this call stack: __rds_conn_create -> __alloc_workqueue -> alloc_and_link_pwqs 

We see this in alloc_and_link_pwqs: 

static int alloc_and_link_pwqs(struct workqueue_struct *wq)
{
	if (!(wq->flags & WQ_UNBOUND)) {

	... per cpu queues allocated here...

	return 0;
	}

        if (wq->flags & __WQ_ORDERED) {
                struct pool_workqueue *dfl_pwq;

                ret = apply_workqueue_attrs_locked(wq, ordered_wq_attrs[highpri]);
                /* there should only be single pwq for ordering guarantee */
                dfl_pwq = rcu_access_pointer(wq->dfl_pwq);
                
                ...
        }
        ....
}

I just realized in my last response I mentioned the kmem_cache_alloc_node call in this function, but that appears in the
!(wq->flags & WQ_UNBOUND) code path, which is not correct in our case.  So apologies for the confusion there.  We should
end up in the __WQ_ORDERED path since the alloc_ordered_workqueue sets  "WQ_UNBOUND | __WQ_ORDERED".  

Then in apply_workqueue_attrs_locked -> apply_wqattrs_commit, we have:
ctx->dfl_pwq = install_unbound_pwq(ctx->wq, -1, ctx->dfl_pwq);

The -1 is the cpu parameter. And cpu < 0 will assign the default pool_workqueue for unbounded queues (dfl_pwq). 

Let me know if this helps or if there's a different code path you're seeing that I've missed? 

> Also it looks like it would a WQ per path, up to 8 WQs per connection.
Yes that's true, it should say "connection path," not just "connection".  So in the text above, how about:


"A workqueue costs ~900 bytes, including the workqueue_struct,
pool_workqueue, workqueue_attrs, wq_node_nr_active and the
node_nr_active flex array.  Each connection can have up to 8 
(RDS_MPATH_WORKERS) paths for a worst case of ~7 KBytes per 
connection.  While an RDS/IB connection totals only ~5 MBytes."

Let me know if that sounds ok, or if you think there should be
more detail in the break down?

We can also rename the patch to "Give each connection path its own workqueue"

> > So we're getting a signficant performance gain
> > (90% of connections fail over under 3 seconds vs. 40%)
> > for a less than 0.02% overhead.
> > 
> > RDS doesn't even benefit from the additional rescue workers:
> > of all the reasons that RDS blocks workers, allocation under
> > memory pressue is the least of our concerns. And even if RDS
> > was stalling due to the memory-reclaim process, the work
> > executed by the rescue workers are highly unlikely to free up
> > any memory. If anything, they might try to allocate even more.
> > 
> > By giving each connection its own workqueues, we allow RDS
And here, how about:

"By giving each connection path its own workqueue, ..."

?

> > to better utilize the unbound workers that the system
> > has available.
> > 
> > Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
> > Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
> > ---
> >  net/rds/connection.c | 13 ++++++++++++-
> >  1 file changed, 12 insertions(+), 1 deletion(-)
> > 
> > diff --git a/net/rds/connection.c b/net/rds/connection.c
> > index dc7323707f450..dcb554e10531f 100644
> > --- a/net/rds/connection.c
> > +++ b/net/rds/connection.c
> > @@ -269,7 +269,15 @@ static struct rds_connection *__rds_conn_create(struct net *net,
> >  		__rds_conn_path_init(conn, &conn->c_path[i],
> >  				     is_outgoing);
> >  		conn->c_path[i].cp_index = i;
> > -		conn->c_path[i].cp_wq = rds_wq;
> > +		conn->c_path[i].cp_wq = alloc_ordered_workqueue(
> > +						"krds_cp_wq#%lu/%d", 0,
> > +						rds_conn_count, i);
> This has a reasonable chance of failure under memory pressure, what
> about falling back to rds_wq usage instead of shutting down the
> connection entirely?
Sure, we can add it as a fall back, it just means there will be a little extra handling in rds_conn_path_destroy to make
sure we don't tear down rds_wq.  

I hope this has helped some? Let me know what you think or if you think the queue accounting needs more digging.  Thanks
for the reviews!

Allison

> 
> /P
> 


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2025-11-22  0:36 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-11-17 20:23 [PATCH net-next v2 0/2] net/rds: RDS-TCP bug fix collection, subset 1: Work queue scalability Allison Henderson
2025-11-17 20:23 ` [PATCH net-next v2 1/2] net/rds: Add per cp work queue Allison Henderson
2025-11-17 20:23 ` [PATCH net-next v2 2/2] net/rds: Give each connection its own workqueue Allison Henderson
2025-11-20 10:43   ` Paolo Abeni
2025-11-22  0:36     ` Allison Henderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).