[PATCH v1 1/2] Revert "SUNRPC: Reduce thread wake-up rate when receiving large RPC messages"

All of lore.kernel.org
 help / color / mirror / Atom feed

* [PATCH v1 1/2] Revert "SUNRPC: Reduce thread wake-up rate when receiving large RPC messages"
@ 2025-01-03  1:00 cel
  2025-01-03  1:00 ` [PATCH v1 2/2] NFSD: Change the filecache laundrette workqueue again cel
  2025-01-03 14:11 ` [PATCH v1 1/2] Revert "SUNRPC: Reduce thread wake-up rate when receiving large RPC messages" Jeff Layton
  0 siblings, 2 replies; 11+ messages in thread
From: cel @ 2025-01-03  1:00 UTC (permalink / raw)
  To: Neil Brown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever, Jakub Kicinski

From: Chuck Lever <chuck.lever@oracle.com>

I noticed that a handful of NFSv3 fstests were taking an
unexpectedly long time to run. Troubleshooting showed that the
server's TCP window closed and never re-opened, which caused the
client to trigger an RPC retransmit timeout after 180 seconds.

The client's recovery action was to establish a fresh connection
and retransmit the timed-out requests. This worked, but it adds a
long delay.

I tracked the problem to the commit that attempted to reduce the
rate at which the network layer delivers TCP socket data_ready
callbacks. Under most circumstances this change worked as expected,
but for NFSv3, which has no session or other type of throttling, it
can overwhelm the receiver on occasion.

I'm sure I could tweak the lowat settings, but the small benefit
doesn't seem worth the bother. Just revert it.

Fixes: 2b877fc53e97 ("SUNRPC: Reduce thread wake-up rate when receiving large RPC messages")
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 net/sunrpc/svcsock.c | 12 +-----------
 1 file changed, 1 insertion(+), 11 deletions(-)

diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 95397677673b..cb3bd12f5818 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1083,9 +1083,6 @@ static void svc_tcp_fragment_received(struct svc_sock *svsk)
 	/* If we have more data, signal svc_xprt_enqueue() to try again */
 	svsk->sk_tcplen = 0;
 	svsk->sk_marker = xdr_zero;
-
-	smp_wmb();
-	tcp_set_rcvlowat(svsk->sk_sk, 1);
 }
 
 /**
@@ -1175,17 +1172,10 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
 		goto err_delete;
 	if (len == want)
 		svc_tcp_fragment_received(svsk);
-	else {
-		/* Avoid more ->sk_data_ready() calls until the rest
-		 * of the message has arrived. This reduces service
-		 * thread wake-ups on large incoming messages. */
-		tcp_set_rcvlowat(svsk->sk_sk,
-				 svc_sock_reclen(svsk) - svsk->sk_tcplen);
-
+	else
 		trace_svcsock_tcp_recv_short(&svsk->sk_xprt,
 				svc_sock_reclen(svsk),
 				svsk->sk_tcplen - sizeof(rpc_fraghdr));
-	}
 	goto err_noclose;
 error:
 	if (len != -EAGAIN)
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH v1 2/2] NFSD: Change the filecache laundrette workqueue again
  2025-01-03  1:00 [PATCH v1 1/2] Revert "SUNRPC: Reduce thread wake-up rate when receiving large RPC messages" cel
@ 2025-01-03  1:00 ` cel
  2025-01-03 14:21   ` Jeff Layton
  2025-01-03 22:50   ` NeilBrown
  2025-01-03 14:11 ` [PATCH v1 1/2] Revert "SUNRPC: Reduce thread wake-up rate when receiving large RPC messages" Jeff Layton
  1 sibling, 2 replies; 11+ messages in thread
From: cel @ 2025-01-03  1:00 UTC (permalink / raw)
  To: Neil Brown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever, Tejun Heo

From: Chuck Lever <chuck.lever@oracle.com>

Youzhong Yang <youzhong@gmail.com> noticed the workqueue subsystem
complaining about how long the filecache laundrette was running.
This resulted in switching from using the system_wq for the
laundrette to system_unbound_wq (see commit 4b84551a35e3 ("nfsd: use
system_unbound_wq for nfsd_file_gc_worker()").

However, I've seen the laundrette running for multiple milliseconds
on some workloads, delaying other work. For the purpose of
scheduling fairness, perhaps a better choice would be to process
filecache disposal queues on the system_long_wq instead.

Cc: Tejun Heo <tj@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
 fs/nfsd/filecache.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index a1cdba42c4fa..91a535c2dede 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -112,7 +112,7 @@ static void
 nfsd_file_schedule_laundrette(void)
 {
 	if (test_bit(NFSD_FILE_CACHE_UP, &nfsd_file_flags))
-		queue_delayed_work(system_unbound_wq, &nfsd_filecache_laundrette,
+		queue_delayed_work(system_long_wq, &nfsd_filecache_laundrette,
 				   NFSD_LAUNDRETTE_DELAY);
 }
 
-- 
2.47.0


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH v1 1/2] Revert "SUNRPC: Reduce thread wake-up rate when receiving large RPC messages"
  2025-01-03  1:00 [PATCH v1 1/2] Revert "SUNRPC: Reduce thread wake-up rate when receiving large RPC messages" cel
  2025-01-03  1:00 ` [PATCH v1 2/2] NFSD: Change the filecache laundrette workqueue again cel
@ 2025-01-03 14:11 ` Jeff Layton
  2025-01-04  0:44   ` NeilBrown
  1 sibling, 1 reply; 11+ messages in thread
From: Jeff Layton @ 2025-01-03 14:11 UTC (permalink / raw)
  To: cel, Neil Brown, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever, Jakub Kicinski

On Thu, 2025-01-02 at 20:00 -0500, cel@kernel.org wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> I noticed that a handful of NFSv3 fstests were taking an
> unexpectedly long time to run. Troubleshooting showed that the
> server's TCP window closed and never re-opened, which caused the
> client to trigger an RPC retransmit timeout after 180 seconds.
> 
> The client's recovery action was to establish a fresh connection
> and retransmit the timed-out requests. This worked, but it adds a
> long delay.
> 
> I tracked the problem to the commit that attempted to reduce the
> rate at which the network layer delivers TCP socket data_ready
> callbacks. Under most circumstances this change worked as expected,
> but for NFSv3, which has no session or other type of throttling, it
> can overwhelm the receiver on occasion.
> 
> I'm sure I could tweak the lowat settings, but the small benefit
> doesn't seem worth the bother. Just revert it.
> 
> Fixes: 2b877fc53e97 ("SUNRPC: Reduce thread wake-up rate when receiving large RPC messages")
> Cc: Jakub Kicinski <kuba@kernel.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  net/sunrpc/svcsock.c | 12 +-----------
>  1 file changed, 1 insertion(+), 11 deletions(-)
> 
> diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> index 95397677673b..cb3bd12f5818 100644
> --- a/net/sunrpc/svcsock.c
> +++ b/net/sunrpc/svcsock.c
> @@ -1083,9 +1083,6 @@ static void svc_tcp_fragment_received(struct svc_sock *svsk)
>  	/* If we have more data, signal svc_xprt_enqueue() to try again */
>  	svsk->sk_tcplen = 0;
>  	svsk->sk_marker = xdr_zero;
> -
> -	smp_wmb();
> -	tcp_set_rcvlowat(svsk->sk_sk, 1);
>  }
>  
>  /**
> @@ -1175,17 +1172,10 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
>  		goto err_delete;
>  	if (len == want)
>  		svc_tcp_fragment_received(svsk);
> -	else {
> -		/* Avoid more ->sk_data_ready() calls until the rest
> -		 * of the message has arrived. This reduces service
> -		 * thread wake-ups on large incoming messages. */
> -		tcp_set_rcvlowat(svsk->sk_sk,
> -				 svc_sock_reclen(svsk) - svsk->sk_tcplen);

Could we instead clamp this to the window size so that we at least only
do a wakeup for each window-size chunk?

> -
> +	else
>  		trace_svcsock_tcp_recv_short(&svsk->sk_xprt,
>  				svc_sock_reclen(svsk),
>  				svsk->sk_tcplen - sizeof(rpc_fraghdr));
> -	}
>  	goto err_noclose;
>  error:
>  	if (len != -EAGAIN)

Reverting for now is fine though.

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v1 2/2] NFSD: Change the filecache laundrette workqueue again
  2025-01-03  1:00 ` [PATCH v1 2/2] NFSD: Change the filecache laundrette workqueue again cel
@ 2025-01-03 14:21   ` Jeff Layton
  2025-01-03 22:50   ` NeilBrown
  1 sibling, 0 replies; 11+ messages in thread
From: Jeff Layton @ 2025-01-03 14:21 UTC (permalink / raw)
  To: cel, Neil Brown, Olga Kornievskaia, Dai Ngo, Tom Talpey
  Cc: linux-nfs, Chuck Lever, Tejun Heo

On Thu, 2025-01-02 at 20:00 -0500, cel@kernel.org wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> Youzhong Yang <youzhong@gmail.com> noticed the workqueue subsystem
> complaining about how long the filecache laundrette was running.
> This resulted in switching from using the system_wq for the
> laundrette to system_unbound_wq (see commit 4b84551a35e3 ("nfsd: use
> system_unbound_wq for nfsd_file_gc_worker()").
> 
> However, I've seen the laundrette running for multiple milliseconds
> on some workloads, delaying other work. For the purpose of
> scheduling fairness, perhaps a better choice would be to process
> filecache disposal queues on the system_long_wq instead.
> 
> Cc: Tejun Heo <tj@kernel.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  fs/nfsd/filecache.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index a1cdba42c4fa..91a535c2dede 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -112,7 +112,7 @@ static void
>  nfsd_file_schedule_laundrette(void)
>  {
>  	if (test_bit(NFSD_FILE_CACHE_UP, &nfsd_file_flags))
> -		queue_delayed_work(system_unbound_wq, &nfsd_filecache_laundrette,
> +		queue_delayed_work(system_long_wq, &nfsd_filecache_laundrette,
>  				   NFSD_LAUNDRETTE_DELAY);
>  }
>  

TIL that there is system_long_wq! Seems like a reasonable thing to do
since this is generally low-priority work.

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v1 2/2] NFSD: Change the filecache laundrette workqueue again
  2025-01-03  1:00 ` [PATCH v1 2/2] NFSD: Change the filecache laundrette workqueue again cel
  2025-01-03 14:21   ` Jeff Layton
@ 2025-01-03 22:50   ` NeilBrown
  2025-01-03 22:53     ` Tejun Heo
  1 sibling, 1 reply; 11+ messages in thread
From: NeilBrown @ 2025-01-03 22:50 UTC (permalink / raw)
  To: cel
  Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
	Chuck Lever, Tejun Heo

On Fri, 03 Jan 2025, cel@kernel.org wrote:
> From: Chuck Lever <chuck.lever@oracle.com>
> 
> Youzhong Yang <youzhong@gmail.com> noticed the workqueue subsystem
> complaining about how long the filecache laundrette was running.
> This resulted in switching from using the system_wq for the
> laundrette to system_unbound_wq (see commit 4b84551a35e3 ("nfsd: use
> system_unbound_wq for nfsd_file_gc_worker()").
> 
> However, I've seen the laundrette running for multiple milliseconds
> on some workloads, delaying other work. For the purpose of
> scheduling fairness, perhaps a better choice would be to process
> filecache disposal queues on the system_long_wq instead.
> 
> Cc: Tejun Heo <tj@kernel.org>
> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> ---
>  fs/nfsd/filecache.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index a1cdba42c4fa..91a535c2dede 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -112,7 +112,7 @@ static void
>  nfsd_file_schedule_laundrette(void)
>  {
>  	if (test_bit(NFSD_FILE_CACHE_UP, &nfsd_file_flags))
> -		queue_delayed_work(system_unbound_wq, &nfsd_filecache_laundrette,
> +		queue_delayed_work(system_long_wq, &nfsd_filecache_laundrette,
>  				   NFSD_LAUNDRETTE_DELAY);

This doesn't seem right to me.
"unbound_wq" is not bound to an CPU.  It just gets allocated that thread
which shouldn't interfere with any other work (except for normal
scheduling contention) until the queue limit is reach, which is unlikely
to happen often.

"long_wq" IS bound to a CPU and is concurrency managed so it could
interfere with other things.  The idea is that work items scheduled
there are not in a hurry and could take longer.  But I dont' think they
should take very long...

I think the problem is elsewhere.
list_lru_walk() can hold a spin lock for as long as it takes to walk
nr_to_walk objects.  This will tie up the CPU no matter which work queue
it is running on.

I think that instead of passing "list_lru_count()" we should pass some
constant like 1024.

cnt = list_lru_count()
while (cnt > 0) {
   num = min(cnt, 1024);
   list_lru_walk(...., num);
   cond_sched()
   cnt -= num;
}

Then run it from system_wq.

list_lru_shrink is most often called as list_lru_shrink_walk() from a
shrinker, and the pattern there is essentially that above.  A count is
taken, possibly scaled down, then the shrinker is called in batches.

NeilBrown

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v1 2/2] NFSD: Change the filecache laundrette workqueue again
  2025-01-03 22:50   ` NeilBrown
@ 2025-01-03 22:53     ` Tejun Heo
  2025-01-04 15:37       ` Chuck Lever
  0 siblings, 1 reply; 11+ messages in thread
From: Tejun Heo @ 2025-01-03 22:53 UTC (permalink / raw)
  To: NeilBrown
  Cc: cel, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs, Chuck Lever

On Sat, Jan 04, 2025 at 09:50:32AM +1100, NeilBrown wrote:
> On Fri, 03 Jan 2025, cel@kernel.org wrote:
...
> I think that instead of passing "list_lru_count()" we should pass some
> constant like 1024.
> 
> cnt = list_lru_count()
> while (cnt > 0) {
>    num = min(cnt, 1024);
>    list_lru_walk(...., num);
>    cond_sched()
>    cnt -= num;
> }
> 
> Then run it from system_wq.
> 
> list_lru_shrink is most often called as list_lru_shrink_walk() from a
> shrinker, and the pattern there is essentially that above.  A count is
> taken, possibly scaled down, then the shrinker is called in batches.

BTW, there's nothing wrong with taking some msecs or even tens of msecs
running on system_unbound_wq, so the current state may be fine too.

Thanks.

-- 
tejun

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v1 1/2] Revert "SUNRPC: Reduce thread wake-up rate when receiving large RPC messages"
  2025-01-03 14:11 ` [PATCH v1 1/2] Revert "SUNRPC: Reduce thread wake-up rate when receiving large RPC messages" Jeff Layton
@ 2025-01-04  0:44   ` NeilBrown
  0 siblings, 0 replies; 11+ messages in thread
From: NeilBrown @ 2025-01-04  0:44 UTC (permalink / raw)
  To: Jeff Layton
  Cc: cel, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs,
	Chuck Lever, Jakub Kicinski

On Sat, 04 Jan 2025, Jeff Layton wrote:
> On Thu, 2025-01-02 at 20:00 -0500, cel@kernel.org wrote:
> > From: Chuck Lever <chuck.lever@oracle.com>
> > 
> > I noticed that a handful of NFSv3 fstests were taking an
> > unexpectedly long time to run. Troubleshooting showed that the
> > server's TCP window closed and never re-opened, which caused the
> > client to trigger an RPC retransmit timeout after 180 seconds.
> > 
> > The client's recovery action was to establish a fresh connection
> > and retransmit the timed-out requests. This worked, but it adds a
> > long delay.
> > 
> > I tracked the problem to the commit that attempted to reduce the
> > rate at which the network layer delivers TCP socket data_ready
> > callbacks. Under most circumstances this change worked as expected,
> > but for NFSv3, which has no session or other type of throttling, it
> > can overwhelm the receiver on occasion.
> > 
> > I'm sure I could tweak the lowat settings, but the small benefit
> > doesn't seem worth the bother. Just revert it.
> > 
> > Fixes: 2b877fc53e97 ("SUNRPC: Reduce thread wake-up rate when receiving large RPC messages")
> > Cc: Jakub Kicinski <kuba@kernel.org>
> > Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
> > ---
> >  net/sunrpc/svcsock.c | 12 +-----------
> >  1 file changed, 1 insertion(+), 11 deletions(-)
> > 
> > diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
> > index 95397677673b..cb3bd12f5818 100644
> > --- a/net/sunrpc/svcsock.c
> > +++ b/net/sunrpc/svcsock.c
> > @@ -1083,9 +1083,6 @@ static void svc_tcp_fragment_received(struct svc_sock *svsk)
> >  	/* If we have more data, signal svc_xprt_enqueue() to try again */
> >  	svsk->sk_tcplen = 0;
> >  	svsk->sk_marker = xdr_zero;
> > -
> > -	smp_wmb();
> > -	tcp_set_rcvlowat(svsk->sk_sk, 1);
> >  }
> >  
> >  /**
> > @@ -1175,17 +1172,10 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
> >  		goto err_delete;
> >  	if (len == want)
> >  		svc_tcp_fragment_received(svsk);
> > -	else {
> > -		/* Avoid more ->sk_data_ready() calls until the rest
> > -		 * of the message has arrived. This reduces service
> > -		 * thread wake-ups on large incoming messages. */
> > -		tcp_set_rcvlowat(svsk->sk_sk,
> > -				 svc_sock_reclen(svsk) - svsk->sk_tcplen);
> 
> Could we instead clamp this to the window size so that we at least only
> do a wakeup for each window-size chunk?

tcp_set_rcvlowat() already imposes and upper limit - based on configured
queue space.  So I don't think we need to clamp.

I note that the calculation is wrong. sk_tcplen includes the size of the
4byte length header, but svc_sock_reclen() does not.  Other places there
these two numbers are compared, sizeof(rpc_fraghdr) is subtracted from
sk_tcplen.
So there is a corner case where the size passed to tcp_set_rcvlowat() is
negative.  The two numbers in the subtraction are u32 and the function
expects "int".
I didn't dig enough to understand how an negative would be handled.

> 
> > -
> > +	else
> >  		trace_svcsock_tcp_recv_short(&svsk->sk_xprt,
> >  				svc_sock_reclen(svsk),
> >  				svsk->sk_tcplen - sizeof(rpc_fraghdr));
> > -	}
> >  	goto err_noclose;
> >  error:
> >  	if (len != -EAGAIN)
> 
> Reverting for now is fine though.

Agreed.  I would expect modern network cards to deliver fairly large
packets to the application already.  Still, it would be nice to
understand what is really happening.  Setting rcvlowat shouldn't be able
to cause the window size to go to zero.

NeilBrown


> 
> Reviewed-by: Jeff Layton <jlayton@kernel.org>
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v1 2/2] NFSD: Change the filecache laundrette workqueue again
  2025-01-03 22:53     ` Tejun Heo
@ 2025-01-04 15:37       ` Chuck Lever
  2025-01-05 22:28         ` NeilBrown
  0 siblings, 1 reply; 11+ messages in thread
From: Chuck Lever @ 2025-01-04 15:37 UTC (permalink / raw)
  To: Tejun Heo, NeilBrown
  Cc: cel, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey,
	linux-nfs

On 1/3/25 5:53 PM, Tejun Heo wrote:
> On Sat, Jan 04, 2025 at 09:50:32AM +1100, NeilBrown wrote:
>> On Fri, 03 Jan 2025, cel@kernel.org wrote:
> ...
>> I think that instead of passing "list_lru_count()" we should pass some
>> constant like 1024.
>>
>> cnt = list_lru_count()
>> while (cnt > 0) {
>>     num = min(cnt, 1024);
>>     list_lru_walk(...., num);
>>     cond_sched()
>>     cnt -= num;
>> }
>>
>> Then run it from system_wq.
>>
>> list_lru_shrink is most often called as list_lru_shrink_walk() from a
>> shrinker, and the pattern there is essentially that above.  A count is
>> taken, possibly scaled down, then the shrinker is called in batches.
> 
> BTW, there's nothing wrong with taking some msecs or even tens of msecs
> running on system_unbound_wq, so the current state may be fine too.

My thinking was that this work is low priority, so there should be
plenty of opportunity to set it aside for a few moments and handle
higher priority work. Maybe not worrisome on systems with a high core
count, but on small SMP (eg VM guests), I've found that tasks like this
can be rude neighbors.

We could do this by adding a cond_resched() call site in the loop,
or take Neil's suggestion of breaking up the free list across multiple
work items that handle one or just a few file releases each.

-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v1 2/2] NFSD: Change the filecache laundrette workqueue again
  2025-01-04 15:37       ` Chuck Lever
@ 2025-01-05 22:28         ` NeilBrown
  2025-01-06  1:39           ` Chuck Lever
  0 siblings, 1 reply; 11+ messages in thread
From: NeilBrown @ 2025-01-05 22:28 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Tejun Heo, cel, Jeff Layton, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, linux-nfs

On Sun, 05 Jan 2025, Chuck Lever wrote:
> On 1/3/25 5:53 PM, Tejun Heo wrote:
> > On Sat, Jan 04, 2025 at 09:50:32AM +1100, NeilBrown wrote:
> >> On Fri, 03 Jan 2025, cel@kernel.org wrote:
> > ...
> >> I think that instead of passing "list_lru_count()" we should pass some
> >> constant like 1024.
> >>
> >> cnt = list_lru_count()
> >> while (cnt > 0) {
> >>     num = min(cnt, 1024);
> >>     list_lru_walk(...., num);
> >>     cond_sched()
> >>     cnt -= num;
> >> }
> >>
> >> Then run it from system_wq.
> >>
> >> list_lru_shrink is most often called as list_lru_shrink_walk() from a
> >> shrinker, and the pattern there is essentially that above.  A count is
> >> taken, possibly scaled down, then the shrinker is called in batches.
> > 
> > BTW, there's nothing wrong with taking some msecs or even tens of msecs
> > running on system_unbound_wq, so the current state may be fine too.
> 
> My thinking was that this work is low priority, so there should be
> plenty of opportunity to set it aside for a few moments and handle
> higher priority work. Maybe not worrisome on systems with a high core
> count, but on small SMP (eg VM guests), I've found that tasks like this
> can be rude neighbors.

None of the different work queues are expected to "set aside" the work
for more than a normal scheduling time slice.
system_long_wq was created "so that flush_scheduled_work() doesn't take
so long" but flush_scheduled_work() is never called now (and would
generate a warning if it was) so system_long_wq should really be
removed. 

system_unbound_wq uses threads that are not bound to a CPU so the
scheduler can move them around.  That is most suitable for something
that might run for a long time.

system_wq is bound to the CPU that schedules the work, but it can run
multiple work items - potentially long running ones - without trouble by
helping the scheduler share the CPU among them.  This requires that they
can sleep of course.

I haven't seen the actually symptoms that resulted in the various
changes to this code, but that last point is I think the key one.  The
work item, like all kernel code, needs to have scheduler points if it
might run for a longish time.  list_lru_walk() doesn't contain any
scheduling points so giving a large "nr_to_walk" is always a bad idea.

> 
> We could do this by adding a cond_resched() call site in the loop,
> or take Neil's suggestion of breaking up the free list across multiple
> work items that handle one or just a few file releases each.

I guess I should send a proper patch.

NeilBrown

> 
> -- 
> Chuck Lever
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v1 2/2] NFSD: Change the filecache laundrette workqueue again
  2025-01-05 22:28         ` NeilBrown
@ 2025-01-06  1:39           ` Chuck Lever
  2025-01-06  2:47             ` NeilBrown
  0 siblings, 1 reply; 11+ messages in thread
From: Chuck Lever @ 2025-01-06  1:39 UTC (permalink / raw)
  To: NeilBrown
  Cc: Tejun Heo, cel, Jeff Layton, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, linux-nfs

On 1/5/25 5:28 PM, NeilBrown wrote:
> On Sun, 05 Jan 2025, Chuck Lever wrote:
>> On 1/3/25 5:53 PM, Tejun Heo wrote:
>>> On Sat, Jan 04, 2025 at 09:50:32AM +1100, NeilBrown wrote:
>>>> On Fri, 03 Jan 2025, cel@kernel.org wrote:
>>> ...
>>>> I think that instead of passing "list_lru_count()" we should pass some
>>>> constant like 1024.
>>>>
>>>> cnt = list_lru_count()
>>>> while (cnt > 0) {
>>>>      num = min(cnt, 1024);
>>>>      list_lru_walk(...., num);
>>>>      cond_sched()
>>>>      cnt -= num;
>>>> }
>>>>
>>>> Then run it from system_wq.
>>>>
>>>> list_lru_shrink is most often called as list_lru_shrink_walk() from a
>>>> shrinker, and the pattern there is essentially that above.  A count is
>>>> taken, possibly scaled down, then the shrinker is called in batches.
>>>
>>> BTW, there's nothing wrong with taking some msecs or even tens of msecs
>>> running on system_unbound_wq, so the current state may be fine too.
>>
>> My thinking was that this work is low priority, so there should be
>> plenty of opportunity to set it aside for a few moments and handle
>> higher priority work. Maybe not worrisome on systems with a high core
>> count, but on small SMP (eg VM guests), I've found that tasks like this
>> can be rude neighbors.
> 
> None of the different work queues are expected to "set aside" the work
> for more than a normal scheduling time slice.
> system_long_wq was created "so that flush_scheduled_work() doesn't take
> so long" but flush_scheduled_work() is never called now (and would
> generate a warning if it was) so system_long_wq should really be
> removed.
> 
> system_unbound_wq uses threads that are not bound to a CPU so the
> scheduler can move them around.  That is most suitable for something
> that might run for a long time.

That's not my understanding: unbound selects a CPU to run the work
item on, and it can be (and usually is) not the same CPU as the one
that invoked queue_work(). Then it isn't rescheduled. The work items
are expected to complete quickly; work item termination is the typical
reschedule point. My understanding, as always, could be stale.

But that's neither here nor there: I've dropped the patch to replace
system_unbound_wq with system_long_wq.


> system_wq is bound to the CPU that schedules the work, but it can run
> multiple work items - potentially long running ones - without trouble by
> helping the scheduler share the CPU among them.  This requires that they
> can sleep of course.
> 
> I haven't seen the actually symptoms that resulted in the various
> changes to this code, but that last point is I think the key one.  The
> work item, like all kernel code, needs to have scheduler points if it
> might run for a longish time.  list_lru_walk() doesn't contain any
> scheduling points so giving a large "nr_to_walk" is always a bad idea.

That might be a overstated... I don't see other consumers that are so
concerned about rescheduling. But then it's not clear if they are
dealing with lengthy LRU lists.


>> We could do this by adding a cond_resched() call site in the loop,
>> or take Neil's suggestion of breaking up the free list across multiple
>> work items that handle one or just a few file releases each.
> 
> I guess I should send a proper patch.

More comments in reply to that patch. Generally speaking, I'm
comfortable chopping up the work as you propose, we just have to
address some details.


-- 
Chuck Lever

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH v1 2/2] NFSD: Change the filecache laundrette workqueue again
  2025-01-06  1:39           ` Chuck Lever
@ 2025-01-06  2:47             ` NeilBrown
  0 siblings, 0 replies; 11+ messages in thread
From: NeilBrown @ 2025-01-06  2:47 UTC (permalink / raw)
  To: Chuck Lever
  Cc: Tejun Heo, cel, Jeff Layton, Olga Kornievskaia, Dai Ngo,
	Tom Talpey, linux-nfs

On Mon, 06 Jan 2025, Chuck Lever wrote:
> On 1/5/25 5:28 PM, NeilBrown wrote:
> > On Sun, 05 Jan 2025, Chuck Lever wrote:
> >> On 1/3/25 5:53 PM, Tejun Heo wrote:
> >>> On Sat, Jan 04, 2025 at 09:50:32AM +1100, NeilBrown wrote:
> >>>> On Fri, 03 Jan 2025, cel@kernel.org wrote:
> >>> ...
> >>>> I think that instead of passing "list_lru_count()" we should pass some
> >>>> constant like 1024.
> >>>>
> >>>> cnt = list_lru_count()
> >>>> while (cnt > 0) {
> >>>>      num = min(cnt, 1024);
> >>>>      list_lru_walk(...., num);
> >>>>      cond_sched()
> >>>>      cnt -= num;
> >>>> }
> >>>>
> >>>> Then run it from system_wq.
> >>>>
> >>>> list_lru_shrink is most often called as list_lru_shrink_walk() from a
> >>>> shrinker, and the pattern there is essentially that above.  A count is
> >>>> taken, possibly scaled down, then the shrinker is called in batches.
> >>>
> >>> BTW, there's nothing wrong with taking some msecs or even tens of msecs
> >>> running on system_unbound_wq, so the current state may be fine too.
> >>
> >> My thinking was that this work is low priority, so there should be
> >> plenty of opportunity to set it aside for a few moments and handle
> >> higher priority work. Maybe not worrisome on systems with a high core
> >> count, but on small SMP (eg VM guests), I've found that tasks like this
> >> can be rude neighbors.
> > 
> > None of the different work queues are expected to "set aside" the work
> > for more than a normal scheduling time slice.
> > system_long_wq was created "so that flush_scheduled_work() doesn't take
> > so long" but flush_scheduled_work() is never called now (and would
> > generate a warning if it was) so system_long_wq should really be
> > removed.
> > 
> > system_unbound_wq uses threads that are not bound to a CPU so the
> > scheduler can move them around.  That is most suitable for something
> > that might run for a long time.
> 
> That's not my understanding: unbound selects a CPU to run the work
> item on, and it can be (and usually is) not the same CPU as the one
> that invoked queue_work(). Then it isn't rescheduled. The work items
> are expected to complete quickly; work item termination is the typical
> reschedule point. My understanding, as always, could be stale.

If the work item will never schedule, then there isn't much point in
using a work queue.  The code can be run inline, or with a timer.

> 
> But that's neither here nor there: I've dropped the patch to replace
> system_unbound_wq with system_long_wq.
> 
> 
> > system_wq is bound to the CPU that schedules the work, but it can run
> > multiple work items - potentially long running ones - without trouble by
> > helping the scheduler share the CPU among them.  This requires that they
> > can sleep of course.
> > 
> > I haven't seen the actually symptoms that resulted in the various
> > changes to this code, but that last point is I think the key one.  The
> > work item, like all kernel code, needs to have scheduler points if it
> > might run for a longish time.  list_lru_walk() doesn't contain any
> > scheduling points so giving a large "nr_to_walk" is always a bad idea.
> 
> That might be a overstated... I don't see other consumers that are so
> concerned about rescheduling. But then it's not clear if they are
> dealing with lengthy LRU lists.

There are very few callers of list_lru_walk().
shrink_dcache_sb() is the closest analogue and it uses batches of 1024.
xfs_buftarg_drain() uses batches of LONG_MAX !!!

Mostly the lru is walked in a shrinker callback, and the shrinker
manages the batch size.

Thanks,
NeilBrown

> 
> 
> >> We could do this by adding a cond_resched() call site in the loop,
> >> or take Neil's suggestion of breaking up the free list across multiple
> >> work items that handle one or just a few file releases each.
> > 
> > I guess I should send a proper patch.
> 
> More comments in reply to that patch. Generally speaking, I'm
> comfortable chopping up the work as you propose, we just have to
> address some details.
> 
> 
> -- 
> Chuck Lever
> 


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2025-01-06  2:48 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-03  1:00 [PATCH v1 1/2] Revert "SUNRPC: Reduce thread wake-up rate when receiving large RPC messages" cel
2025-01-03  1:00 ` [PATCH v1 2/2] NFSD: Change the filecache laundrette workqueue again cel
2025-01-03 14:21   ` Jeff Layton
2025-01-03 22:50   ` NeilBrown
2025-01-03 22:53     ` Tejun Heo
2025-01-04 15:37       ` Chuck Lever
2025-01-05 22:28         ` NeilBrown
2025-01-06  1:39           ` Chuck Lever
2025-01-06  2:47             ` NeilBrown
2025-01-03 14:11 ` [PATCH v1 1/2] Revert "SUNRPC: Reduce thread wake-up rate when receiving large RPC messages" Jeff Layton
2025-01-04  0:44   ` NeilBrown

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.