[PATCH net-next v1 0/2] net/rds: RDS-TCP bug fix collection, subset 1: Work queue scalability

netdev.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH net-next v1 0/2] net/rds: RDS-TCP bug fix collection, subset 1: Work queue scalability
@ 2025-10-29 17:46 Allison Henderson
  2025-10-29 17:46 ` [PATCH net-next v1 1/2] net/rds: Add per cp work queue Allison Henderson
  2025-10-29 17:46 ` [PATCH net-next v1 2/2] net/rds: Give each connection its own workqueue Allison Henderson
  0 siblings, 2 replies; 7+ messages in thread
From: Allison Henderson @ 2025-10-29 17:46 UTC (permalink / raw)
  To: netdev; +Cc: allison.henderson

From: Allison Henderson <allison.henderson@oracle.com>

Hi all,

This is subset 1 of the RDS-TCP bug fix collection series I posted last
week.  The greater series aims to correct multiple rds-tcp bugs that
can cause dropped or out of sequence messages.  The set was starting to
get a bit large, so I've broken it down into smaller sets to make
reviews more manageable.

In this subset, we focus on work queue scalability.  Messages queues
are refactored to operate in parallel across multiple connections,
which improves response times and avoids timeouts.

The entire set can be viewed in the rfc here:
https://lore.kernel.org/netdev/20251022191715.157755-1-achender@kernel.org/

Questions, comments, flames appreciated!
Thanks!
Allison

Change Log:
rfc->v1
 - Fixed lkp warnings and white space cleanup
 - Split out the workqueue changes as a subset

Allison Henderson (2):
  net/rds: Add per cp work queue
  net/rds: Give each connection its own workqueue

 net/rds/connection.c | 15 +++++++++++++--
 net/rds/ib_recv.c    |  2 +-
 net/rds/ib_send.c    |  2 +-
 net/rds/rds.h        |  1 +
 net/rds/send.c       |  8 ++++----
 net/rds/tcp_recv.c   |  2 +-
 net/rds/tcp_send.c   |  2 +-
 net/rds/threads.c    | 16 ++++++++--------
 8 files changed, 30 insertions(+), 18 deletions(-)

-- 
2.43.0


^ permalink raw reply	[flat|nested] 7+ messages in thread

* [PATCH net-next v1 1/2] net/rds: Add per cp work queue
  2025-10-29 17:46 [PATCH net-next v1 0/2] net/rds: RDS-TCP bug fix collection, subset 1: Work queue scalability Allison Henderson
@ 2025-10-29 17:46 ` Allison Henderson
  2025-10-29 17:46 ` [PATCH net-next v1 2/2] net/rds: Give each connection its own workqueue Allison Henderson
  1 sibling, 0 replies; 7+ messages in thread
From: Allison Henderson @ 2025-10-29 17:46 UTC (permalink / raw)
  To: netdev; +Cc: allison.henderson

From: Allison Henderson <allison.henderson@oracle.com>

This patch adds a per connection workqueue which can be initialized
and used independently of the globally shared rds_wq.

This patch is the first in a series that aims to address tcp ack
timeouts during the tcp socket shutdown sequence.

This initial refactoring lays the ground work needed to alleviate
queue congestion during heavy reads and writes.  The independently
managed queues will allow shutdowns and reconnects respond more quickly
before the peer(s) timeout waiting for the proper acks.

Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
---
 net/rds/connection.c |  5 +++--
 net/rds/ib_recv.c    |  2 +-
 net/rds/ib_send.c    |  2 +-
 net/rds/rds.h        |  1 +
 net/rds/send.c       |  8 ++++----
 net/rds/tcp_recv.c   |  2 +-
 net/rds/tcp_send.c   |  2 +-
 net/rds/threads.c    | 16 ++++++++--------
 8 files changed, 20 insertions(+), 18 deletions(-)

diff --git a/net/rds/connection.c b/net/rds/connection.c
index 68bc88cce84ec..dc7323707f450 100644
--- a/net/rds/connection.c
+++ b/net/rds/connection.c
@@ -269,6 +269,7 @@ static struct rds_connection *__rds_conn_create(struct net *net,
 		__rds_conn_path_init(conn, &conn->c_path[i],
 				     is_outgoing);
 		conn->c_path[i].cp_index = i;
+		conn->c_path[i].cp_wq = rds_wq;
 	}
 	rcu_read_lock();
 	if (rds_destroy_pending(conn))
@@ -884,7 +885,7 @@ void rds_conn_path_drop(struct rds_conn_path *cp, bool destroy)
 		rcu_read_unlock();
 		return;
 	}
-	queue_work(rds_wq, &cp->cp_down_w);
+	queue_work(cp->cp_wq, &cp->cp_down_w);
 	rcu_read_unlock();
 }
 EXPORT_SYMBOL_GPL(rds_conn_path_drop);
@@ -909,7 +910,7 @@ void rds_conn_path_connect_if_down(struct rds_conn_path *cp)
 	}
 	if (rds_conn_path_state(cp) == RDS_CONN_DOWN &&
 	    !test_and_set_bit(RDS_RECONNECT_PENDING, &cp->cp_flags))
-		queue_delayed_work(rds_wq, &cp->cp_conn_w, 0);
+		queue_delayed_work(cp->cp_wq, &cp->cp_conn_w, 0);
 	rcu_read_unlock();
 }
 EXPORT_SYMBOL_GPL(rds_conn_path_connect_if_down);
diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
index 4248dfa816ebf..357128d34a546 100644
--- a/net/rds/ib_recv.c
+++ b/net/rds/ib_recv.c
@@ -457,7 +457,7 @@ void rds_ib_recv_refill(struct rds_connection *conn, int prefill, gfp_t gfp)
 	    (must_wake ||
 	    (can_wait && rds_ib_ring_low(&ic->i_recv_ring)) ||
 	    rds_ib_ring_empty(&ic->i_recv_ring))) {
-		queue_delayed_work(rds_wq, &conn->c_recv_w, 1);
+		queue_delayed_work(conn->c_path->cp_wq, &conn->c_recv_w, 1);
 	}
 	if (can_wait)
 		cond_resched();
diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c
index 4190b90ff3b18..e35bbb6ffb689 100644
--- a/net/rds/ib_send.c
+++ b/net/rds/ib_send.c
@@ -419,7 +419,7 @@ void rds_ib_send_add_credits(struct rds_connection *conn, unsigned int credits)
 
 	atomic_add(IB_SET_SEND_CREDITS(credits), &ic->i_credits);
 	if (test_and_clear_bit(RDS_LL_SEND_FULL, &conn->c_flags))
-		queue_delayed_work(rds_wq, &conn->c_send_w, 0);
+		queue_delayed_work(conn->c_path->cp_wq, &conn->c_send_w, 0);
 
 	WARN_ON(IB_GET_SEND_CREDITS(credits) >= 16384);
 
diff --git a/net/rds/rds.h b/net/rds/rds.h
index 5b1c072e2e7ff..11fa304f2164a 100644
--- a/net/rds/rds.h
+++ b/net/rds/rds.h
@@ -118,6 +118,7 @@ struct rds_conn_path {
 
 	void			*cp_transport_data;
 
+	struct workqueue_struct	*cp_wq;
 	atomic_t		cp_state;
 	unsigned long		cp_send_gen;
 	unsigned long		cp_flags;
diff --git a/net/rds/send.c b/net/rds/send.c
index 0b3d0ef2f008b..ed8d84a74c34e 100644
--- a/net/rds/send.c
+++ b/net/rds/send.c
@@ -458,7 +458,7 @@ int rds_send_xmit(struct rds_conn_path *cp)
 			if (rds_destroy_pending(cp->cp_conn))
 				ret = -ENETUNREACH;
 			else
-				queue_delayed_work(rds_wq, &cp->cp_send_w, 1);
+				queue_delayed_work(cp->cp_wq, &cp->cp_send_w, 1);
 			rcu_read_unlock();
 		} else if (raced) {
 			rds_stats_inc(s_send_lock_queue_raced);
@@ -1380,7 +1380,7 @@ int rds_sendmsg(struct socket *sock, struct msghdr *msg, size_t payload_len)
 		if (rds_destroy_pending(cpath->cp_conn))
 			ret = -ENETUNREACH;
 		else
-			queue_delayed_work(rds_wq, &cpath->cp_send_w, 1);
+			queue_delayed_work(cpath->cp_wq, &cpath->cp_send_w, 1);
 		rcu_read_unlock();
 	}
 	if (ret)
@@ -1470,10 +1470,10 @@ rds_send_probe(struct rds_conn_path *cp, __be16 sport,
 	rds_stats_inc(s_send_queued);
 	rds_stats_inc(s_send_pong);
 
-	/* schedule the send work on rds_wq */
+	/* schedule the send work on cp_wq */
 	rcu_read_lock();
 	if (!rds_destroy_pending(cp->cp_conn))
-		queue_delayed_work(rds_wq, &cp->cp_send_w, 1);
+		queue_delayed_work(cp->cp_wq, &cp->cp_send_w, 1);
 	rcu_read_unlock();
 
 	rds_message_put(rm);
diff --git a/net/rds/tcp_recv.c b/net/rds/tcp_recv.c
index 7997a19d1da30..b7cf7f451430d 100644
--- a/net/rds/tcp_recv.c
+++ b/net/rds/tcp_recv.c
@@ -327,7 +327,7 @@ void rds_tcp_data_ready(struct sock *sk)
 	if (rds_tcp_read_sock(cp, GFP_ATOMIC) == -ENOMEM) {
 		rcu_read_lock();
 		if (!rds_destroy_pending(cp->cp_conn))
-			queue_delayed_work(rds_wq, &cp->cp_recv_w, 0);
+			queue_delayed_work(cp->cp_wq, &cp->cp_recv_w, 0);
 		rcu_read_unlock();
 	}
 out:
diff --git a/net/rds/tcp_send.c b/net/rds/tcp_send.c
index 7d284ac7e81a5..4e82c9644aa6a 100644
--- a/net/rds/tcp_send.c
+++ b/net/rds/tcp_send.c
@@ -201,7 +201,7 @@ void rds_tcp_write_space(struct sock *sk)
 	rcu_read_lock();
 	if ((refcount_read(&sk->sk_wmem_alloc) << 1) <= sk->sk_sndbuf &&
 	    !rds_destroy_pending(cp->cp_conn))
-		queue_delayed_work(rds_wq, &cp->cp_send_w, 0);
+		queue_delayed_work(cp->cp_wq, &cp->cp_send_w, 0);
 	rcu_read_unlock();
 
 out:
diff --git a/net/rds/threads.c b/net/rds/threads.c
index 1f424cbfcbb47..639302bab51ef 100644
--- a/net/rds/threads.c
+++ b/net/rds/threads.c
@@ -89,8 +89,8 @@ void rds_connect_path_complete(struct rds_conn_path *cp, int curr)
 	set_bit(0, &cp->cp_conn->c_map_queued);
 	rcu_read_lock();
 	if (!rds_destroy_pending(cp->cp_conn)) {
-		queue_delayed_work(rds_wq, &cp->cp_send_w, 0);
-		queue_delayed_work(rds_wq, &cp->cp_recv_w, 0);
+		queue_delayed_work(cp->cp_wq, &cp->cp_send_w, 0);
+		queue_delayed_work(cp->cp_wq, &cp->cp_recv_w, 0);
 	}
 	rcu_read_unlock();
 	cp->cp_conn->c_proposed_version = RDS_PROTOCOL_VERSION;
@@ -140,7 +140,7 @@ void rds_queue_reconnect(struct rds_conn_path *cp)
 		cp->cp_reconnect_jiffies = rds_sysctl_reconnect_min_jiffies;
 		rcu_read_lock();
 		if (!rds_destroy_pending(cp->cp_conn))
-			queue_delayed_work(rds_wq, &cp->cp_conn_w, 0);
+			queue_delayed_work(cp->cp_wq, &cp->cp_conn_w, 0);
 		rcu_read_unlock();
 		return;
 	}
@@ -151,7 +151,7 @@ void rds_queue_reconnect(struct rds_conn_path *cp)
 		 conn, &conn->c_laddr, &conn->c_faddr);
 	rcu_read_lock();
 	if (!rds_destroy_pending(cp->cp_conn))
-		queue_delayed_work(rds_wq, &cp->cp_conn_w,
+		queue_delayed_work(cp->cp_wq, &cp->cp_conn_w,
 				   rand % cp->cp_reconnect_jiffies);
 	rcu_read_unlock();
 
@@ -203,11 +203,11 @@ void rds_send_worker(struct work_struct *work)
 		switch (ret) {
 		case -EAGAIN:
 			rds_stats_inc(s_send_immediate_retry);
-			queue_delayed_work(rds_wq, &cp->cp_send_w, 0);
+			queue_delayed_work(cp->cp_wq, &cp->cp_send_w, 0);
 			break;
 		case -ENOMEM:
 			rds_stats_inc(s_send_delayed_retry);
-			queue_delayed_work(rds_wq, &cp->cp_send_w, 2);
+			queue_delayed_work(cp->cp_wq, &cp->cp_send_w, 2);
 			break;
 		default:
 			break;
@@ -228,11 +228,11 @@ void rds_recv_worker(struct work_struct *work)
 		switch (ret) {
 		case -EAGAIN:
 			rds_stats_inc(s_recv_immediate_retry);
-			queue_delayed_work(rds_wq, &cp->cp_recv_w, 0);
+			queue_delayed_work(cp->cp_wq, &cp->cp_recv_w, 0);
 			break;
 		case -ENOMEM:
 			rds_stats_inc(s_recv_delayed_retry);
-			queue_delayed_work(rds_wq, &cp->cp_recv_w, 2);
+			queue_delayed_work(cp->cp_wq, &cp->cp_recv_w, 2);
 			break;
 		default:
 			break;
-- 
2.43.0


^ permalink raw reply related	[flat|nested] 7+ messages in thread

* [PATCH net-next v1 2/2] net/rds: Give each connection its own workqueue
  2025-10-29 17:46 [PATCH net-next v1 0/2] net/rds: RDS-TCP bug fix collection, subset 1: Work queue scalability Allison Henderson
  2025-10-29 17:46 ` [PATCH net-next v1 1/2] net/rds: Add per cp work queue Allison Henderson
@ 2025-10-29 17:46 ` Allison Henderson
  2025-11-04 14:57   ` Paolo Abeni
  1 sibling, 1 reply; 7+ messages in thread
From: Allison Henderson @ 2025-10-29 17:46 UTC (permalink / raw)
  To: netdev; +Cc: allison.henderson

From: Allison Henderson <allison.henderson@oracle.com>

RDS was written to require ordered workqueues for "cp->cp_wq":
Work is executed in the order scheduled, one item at a time.

If these workqueues are shared across connections,
then work executed on behalf of one connection blocks work
scheduled for a different and unrelated connection.

Luckily we don't need to share these workqueues.
While it obviously makes sense to limit the number of
workers (processes) that ought to be allocated on a system,
a workqueue that doesn't have a rescue worker attached,
has a tiny footprint compared to the connection as a whole:
A workqueue costs ~800 bytes, while an RDS/IB connection
totals ~5 MBytes.

So we're getting a signficant performance gain
(90% of connections fail over under 3 seconds vs. 40%)
for a less than 0.02% overhead.

RDS doesn't even benefit from the additional rescue workers:
of all the reasons that RDS blocks workers, allocation under
memory pressue is the least of our concerns.
And even if RDS was stalling due to the memory-reclaim process,
the work executed by the rescue workers are highly unlikely
to free up any memory.
If anything, they might try to allocate even more.

By giving each connection its own workqueues, we allow RDS
to better utilize the unbound workers that the system
has available.

Signed-off-by: Somasundaram Krishnasamy <somasundaram.krishnasamy@oracle.com>
Signed-off-by: Allison Henderson <allison.henderson@oracle.com>
---
 net/rds/connection.c | 12 +++++++++++-
 1 file changed, 11 insertions(+), 1 deletion(-)

diff --git a/net/rds/connection.c b/net/rds/connection.c
index dc7323707f450..ac555f02c045e 100644
--- a/net/rds/connection.c
+++ b/net/rds/connection.c
@@ -269,7 +269,14 @@ static struct rds_connection *__rds_conn_create(struct net *net,
 		__rds_conn_path_init(conn, &conn->c_path[i],
 				     is_outgoing);
 		conn->c_path[i].cp_index = i;
-		conn->c_path[i].cp_wq = rds_wq;
+		conn->c_path[i].cp_wq = alloc_ordered_workqueue("krds_cp_wq#%lu/%d", 0,
+								rds_conn_count, i);
+		if (!conn->c_path[i].cp_wq) {
+			while (--i >= 0)
+				destroy_workqueue(conn->c_path[i].cp_wq);
+			conn = ERR_PTR(-ENOMEM);
+			goto out;
+		}
 	}
 	rcu_read_lock();
 	if (rds_destroy_pending(conn))
@@ -471,6 +478,9 @@ static void rds_conn_path_destroy(struct rds_conn_path *cp)
 	WARN_ON(work_pending(&cp->cp_down_w));

 	cp->cp_conn->c_trans->conn_free(cp->cp_transport_data);
+
+	destroy_workqueue(cp->cp_wq);
+	cp->cp_wq = NULL;
 }

 /*
-- 
2.43.0

^ permalink raw reply related	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next v1 2/2] net/rds: Give each connection its own workqueue
  2025-10-29 17:46 ` [PATCH net-next v1 2/2] net/rds: Give each connection its own workqueue Allison Henderson
@ 2025-11-04 14:57   ` Paolo Abeni
  2025-11-04 21:23     ` Allison Henderson
  0 siblings, 1 reply; 7+ messages in thread
From: Paolo Abeni @ 2025-11-04 14:57 UTC (permalink / raw)
  To: Allison Henderson, netdev; +Cc: allison.henderson

On 10/29/25 6:46 PM, Allison Henderson wrote:
> From: Allison Henderson <allison.henderson@oracle.com>
> 
> RDS was written to require ordered workqueues for "cp->cp_wq":
> Work is executed in the order scheduled, one item at a time.
> 
> If these workqueues are shared across connections,
> then work executed on behalf of one connection blocks work
> scheduled for a different and unrelated connection.
> 
> Luckily we don't need to share these workqueues.
> While it obviously makes sense to limit the number of
> workers (processes) that ought to be allocated on a system,
> a workqueue that doesn't have a rescue worker attached,
> has a tiny footprint compared to the connection as a whole:
> A workqueue costs ~800 bytes, while an RDS/IB connection
> totals ~5 MBytes.

Still a workqueue per connection feels overkill. Have you considered
moving to WQ_PERCPU for rds_wq? Why does not fit?

Thanks,

Paolo


^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next v1 2/2] net/rds: Give each connection its own workqueue
  2025-11-04 14:57   ` Paolo Abeni
@ 2025-11-04 21:23     ` Allison Henderson
  2025-11-06 10:52       ` Paolo Abeni
  0 siblings, 1 reply; 7+ messages in thread
From: Allison Henderson @ 2025-11-04 21:23 UTC (permalink / raw)
  To: achender@kernel.org, netdev@vger.kernel.org, pabeni@redhat.com

On Tue, 2025-11-04 at 15:57 +0100, Paolo Abeni wrote:
> On 10/29/25 6:46 PM, Allison Henderson wrote:
> > From: Allison Henderson <allison.henderson@oracle.com>
> > 
> > RDS was written to require ordered workqueues for "cp->cp_wq":
> > Work is executed in the order scheduled, one item at a time.
> > 
> > If these workqueues are shared across connections,
> > then work executed on behalf of one connection blocks work
> > scheduled for a different and unrelated connection.
> > 
> > Luckily we don't need to share these workqueues.
> > While it obviously makes sense to limit the number of
> > workers (processes) that ought to be allocated on a system,
> > a workqueue that doesn't have a rescue worker attached,
> > has a tiny footprint compared to the connection as a whole:
> > A workqueue costs ~800 bytes, while an RDS/IB connection
> > totals ~5 MBytes.
> 
> Still a workqueue per connection feels overkill. Have you considered
> moving to WQ_PERCPU for rds_wq? Why does not fit?
> 
> Thanks,
> 
> Paolo
> 
Hi Paolo

I hadnt thought of WQ_PERCPU before, so I did some digging on it.  In our case though, we need FIFO behavior per-
connection, so if we switched to queues per cpu, we'd have to pin a CPU to a connection to get the right behavior.  And
then that brings back head of the line blocking since now all the items on that queue have to share that CPU even if the
other CPUs are idle.  So it wouldn't quite be a synonymous solution for what we're trying to do in this case.  I hope
that made sense?  Let me know what you think.

Thank you,
Allison

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next v1 2/2] net/rds: Give each connection its own workqueue
  2025-11-04 21:23     ` Allison Henderson
@ 2025-11-06 10:52       ` Paolo Abeni
  2025-11-08  1:24         ` Allison Henderson
  0 siblings, 1 reply; 7+ messages in thread
From: Paolo Abeni @ 2025-11-06 10:52 UTC (permalink / raw)
  To: Allison Henderson, achender@kernel.org, netdev@vger.kernel.org

On 11/4/25 10:23 PM, Allison Henderson wrote:
> On Tue, 2025-11-04 at 15:57 +0100, Paolo Abeni wrote:
>> On 10/29/25 6:46 PM, Allison Henderson wrote:
>>> From: Allison Henderson <allison.henderson@oracle.com>
>>>
>>> RDS was written to require ordered workqueues for "cp->cp_wq":
>>> Work is executed in the order scheduled, one item at a time.
>>>
>>> If these workqueues are shared across connections,
>>> then work executed on behalf of one connection blocks work
>>> scheduled for a different and unrelated connection.
>>>
>>> Luckily we don't need to share these workqueues.
>>> While it obviously makes sense to limit the number of
>>> workers (processes) that ought to be allocated on a system,
>>> a workqueue that doesn't have a rescue worker attached,
>>> has a tiny footprint compared to the connection as a whole:
>>> A workqueue costs ~800 bytes, while an RDS/IB connection
>>> totals ~5 MBytes.
>>
>> Still a workqueue per connection feels overkill. Have you considered
>> moving to WQ_PERCPU for rds_wq? Why does not fit?
>>
>> Thanks,
>>
>> Paolo
>>
> Hi Paolo
> 
> I hadnt thought of WQ_PERCPU before, so I did some digging on it.  In our case though, we need FIFO behavior per-
> connection, so if we switched to queues per cpu, we'd have to pin a CPU to a connection to get the right behavior.  And
> then that brings back head of the line blocking since now all the items on that queue have to share that CPU even if the
> other CPUs are idle.  So it wouldn't quite be a synonymous solution for what we're trying to do in this case.  I hope
> that made sense?  Let me know what you think.

Still the workqueue per connection gives significant more overhead than
your estimate above. I guess ~800 bytes is sizeof(struct workqueue_struct)?

Please note that such struct contains several dynamically allocated
pointers, among them per_cpu ones: the overall amount of memory used is
significantly greater than your estimate. You should provide a more
accurate one.

Much more importantly, using a workqueue per connection provides
scalibility gain only in the measure that each workqueue uses a
different pool and thus creates additional kthread(s). I'm haven't dived
into the workqueue implementation but I think this is not the case. My
current guestimate is that you measure some gain because the per
connection WK actually creates (or just use) a single pool different
from rds_wq's one.

Please double check the above.

Out of sheer ignorance I suspect/hope that replacing the current
workqueue with alloc_ordered_workqueue() (possibly UNBOUND?!?) will give
the same scalability improvement with no cost.

/P

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [PATCH net-next v1 2/2] net/rds: Give each connection its own workqueue
  2025-11-06 10:52       ` Paolo Abeni
@ 2025-11-08  1:24         ` Allison Henderson
  0 siblings, 0 replies; 7+ messages in thread
From: Allison Henderson @ 2025-11-08  1:24 UTC (permalink / raw)
  To: achender@kernel.org, netdev@vger.kernel.org, pabeni@redhat.com

On Thu, 2025-11-06 at 11:52 +0100, Paolo Abeni wrote:
> On 11/4/25 10:23 PM, Allison Henderson wrote:
> > On Tue, 2025-11-04 at 15:57 +0100, Paolo Abeni wrote:
> > > On 10/29/25 6:46 PM, Allison Henderson wrote:
> > > > From: Allison Henderson <allison.henderson@oracle.com>
> > > > 
> > > > RDS was written to require ordered workqueues for "cp->cp_wq":
> > > > Work is executed in the order scheduled, one item at a time.
> > > > 
> > > > If these workqueues are shared across connections,
> > > > then work executed on behalf of one connection blocks work
> > > > scheduled for a different and unrelated connection.
> > > > 
> > > > Luckily we don't need to share these workqueues.
> > > > While it obviously makes sense to limit the number of
> > > > workers (processes) that ought to be allocated on a system,
> > > > a workqueue that doesn't have a rescue worker attached,
> > > > has a tiny footprint compared to the connection as a whole:
> > > > A workqueue costs ~800 bytes, while an RDS/IB connection
> > > > totals ~5 MBytes.
> > > 
> > > Still a workqueue per connection feels overkill. Have you considered
> > > moving to WQ_PERCPU for rds_wq? Why does not fit?
> > > 
> > > Thanks,
> > > 
> > > Paolo
> > > 
> > Hi Paolo
> > 
> > I hadnt thought of WQ_PERCPU before, so I did some digging on it.  In our case though, we need FIFO behavior per-
> > connection, so if we switched to queues per cpu, we'd have to pin a CPU to a connection to get the right behavior.  And
> > then that brings back head of the line blocking since now all the items on that queue have to share that CPU even if the
> > other CPUs are idle.  So it wouldn't quite be a synonymous solution for what we're trying to do in this case.  I hope
> > that made sense?  Let me know what you think.
> 
> Still the workqueue per connection gives significant more overhead than
> your estimate above. I guess ~800 bytes is sizeof(struct workqueue_struct)?
> 
> Please note that such struct contains several dynamically allocated
> pointers, among them per_cpu ones: the overall amount of memory used is
> significantly greater than your estimate. You should provide a more
> accurate one.
> 

Sure, I've done some digging around with the workqueues and allocation code for us:

So first, here's the size of workqueue_struct (about 320 bytes):
achender@workbox:~/mnt/net-next_coverage$ pahole -C workqueue_struct vmlinux | grep size
	/* size: 320, cachelines: 5, members: 24 */

Then that plus other members that get allocated:
achender@workbox:~/mnt/net-next_coverage$ pahole -C pool_workqueue  vmlinux | grep size
	/* size: 512, cachelines: 8, members: 15 */

achender@workbox:~/mnt/net-next_coverage$ pahole -C workqueue_attrs | grep size
	/* size: 24, cachelines: 1, members: 3 */

achender@workbox:~/mnt/net-next_coverage$ pahole -C wq_node_nr_active vmlinux  | grep size
	/* size: 32, cachelines: 1, members: 4 */

Also at least one 8 byte pointer in the node_nr_active flex array at the bottom of the workqueue_struct.

So that brings us up to 896 bytes for a single node x86_64 system.  The node_nr_active flex array is as large as the
number of nodes.  So in a multi node environment that one will increase by a factor of (32+8)*N nodes.  So the 800 byte
hand wave is a little low.  If you agree with the above accounting we can include the break down in the commit message,
or just bump up the ballpark figure if that's too wordy.

> Much more importantly, using a workqueue per connection provides
> scalibility gain only in the measure that each workqueue uses a
> different pool and thus creates additional kthread(s). I'm haven't dived
> into the workqueue implementation but I think this is not the case. My
> current guestimate is that you measure some gain because the per
> connection WK actually creates (or just use) a single pool different
> from rds_wq's one.
> 
> Please double check the above.

Sure, so I dont think they get their own pool, but they are allocated part of a shared a part of a pool.  In
__rds_conn_create, we have this call stack where the pool_workqueue gets linked to the queue: 
__rds_conn_create -> __alloc_workqueue -> alloc_and_link_pwqs -> kmem_cache_alloc_node

They do however, get their own kworker.  If we look at alloc_ordered_workqueue in workqueue.h, we get this:

#define alloc_ordered_workqueue(fmt, flags, args...)                    \
        alloc_workqueue(fmt, WQ_UNBOUND | __WQ_ORDERED | (flags), 1, ##args)

So, the queues are unbound and ordered, and the 1 is max_active workers.  So, they are allowed to spawn at most one
kworker for that queue. 

> 
> Out of sheer ignorance I suspect/hope that replacing the current
> workqueue with alloc_ordered_workqueue() (possibly UNBOUND?!?) will give
> the same scalability improvement with no cost.
> 
> /P
> 

No worries, from looking at the alloc_ordered_workqueue macro, it looks like the queues are unbound by default.  But if
we had only one unbounded queue with one worker, then we still have the head of the line blocking like we did before. 
It looks to me like ordered work queues imply having only one worker (trying to set more than that on an ordered queue
will just error out in workqueue_set_max_active).  And without the ordering, we loose the fifo behavior we need.  I
think the performance bump we are seeing isn't so much from more workers or pools, its just getting away from the cross-
connection serialization.

I hope that helps?  Let me know if I missed anything or if you think the commit message should be amended in anyway.  

Thank you!
Allison

^ permalink raw reply	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2025-11-08  1:24 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-10-29 17:46 [PATCH net-next v1 0/2] net/rds: RDS-TCP bug fix collection, subset 1: Work queue scalability Allison Henderson
2025-10-29 17:46 ` [PATCH net-next v1 1/2] net/rds: Add per cp work queue Allison Henderson
2025-10-29 17:46 ` [PATCH net-next v1 2/2] net/rds: Give each connection its own workqueue Allison Henderson
2025-11-04 14:57   ` Paolo Abeni
2025-11-04 21:23     ` Allison Henderson
2025-11-06 10:52       ` Paolo Abeni
2025-11-08  1:24         ` Allison Henderson

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).