* [PATCH v1 1/2] Revert "SUNRPC: Reduce thread wake-up rate when receiving large RPC messages"
@ 2025-01-03 1:00 cel
2025-01-03 1:00 ` [PATCH v1 2/2] NFSD: Change the filecache laundrette workqueue again cel
2025-01-03 14:11 ` [PATCH v1 1/2] Revert "SUNRPC: Reduce thread wake-up rate when receiving large RPC messages" Jeff Layton
0 siblings, 2 replies; 11+ messages in thread
From: cel @ 2025-01-03 1:00 UTC (permalink / raw)
To: Neil Brown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey
Cc: linux-nfs, Chuck Lever, Jakub Kicinski
From: Chuck Lever <chuck.lever@oracle.com>
I noticed that a handful of NFSv3 fstests were taking an
unexpectedly long time to run. Troubleshooting showed that the
server's TCP window closed and never re-opened, which caused the
client to trigger an RPC retransmit timeout after 180 seconds.
The client's recovery action was to establish a fresh connection
and retransmit the timed-out requests. This worked, but it adds a
long delay.
I tracked the problem to the commit that attempted to reduce the
rate at which the network layer delivers TCP socket data_ready
callbacks. Under most circumstances this change worked as expected,
but for NFSv3, which has no session or other type of throttling, it
can overwhelm the receiver on occasion.
I'm sure I could tweak the lowat settings, but the small benefit
doesn't seem worth the bother. Just revert it.
Fixes: 2b877fc53e97 ("SUNRPC: Reduce thread wake-up rate when receiving large RPC messages")
Cc: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
---
net/sunrpc/svcsock.c | 12 +-----------
1 file changed, 1 insertion(+), 11 deletions(-)
diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c
index 95397677673b..cb3bd12f5818 100644
--- a/net/sunrpc/svcsock.c
+++ b/net/sunrpc/svcsock.c
@@ -1083,9 +1083,6 @@ static void svc_tcp_fragment_received(struct svc_sock *svsk)
/* If we have more data, signal svc_xprt_enqueue() to try again */
svsk->sk_tcplen = 0;
svsk->sk_marker = xdr_zero;
-
- smp_wmb();
- tcp_set_rcvlowat(svsk->sk_sk, 1);
}
/**
@@ -1175,17 +1172,10 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp)
goto err_delete;
if (len == want)
svc_tcp_fragment_received(svsk);
- else {
- /* Avoid more ->sk_data_ready() calls until the rest
- * of the message has arrived. This reduces service
- * thread wake-ups on large incoming messages. */
- tcp_set_rcvlowat(svsk->sk_sk,
- svc_sock_reclen(svsk) - svsk->sk_tcplen);
-
+ else
trace_svcsock_tcp_recv_short(&svsk->sk_xprt,
svc_sock_reclen(svsk),
svsk->sk_tcplen - sizeof(rpc_fraghdr));
- }
goto err_noclose;
error:
if (len != -EAGAIN)
--
2.47.0
^ permalink raw reply related [flat|nested] 11+ messages in thread* [PATCH v1 2/2] NFSD: Change the filecache laundrette workqueue again 2025-01-03 1:00 [PATCH v1 1/2] Revert "SUNRPC: Reduce thread wake-up rate when receiving large RPC messages" cel @ 2025-01-03 1:00 ` cel 2025-01-03 14:21 ` Jeff Layton 2025-01-03 22:50 ` NeilBrown 2025-01-03 14:11 ` [PATCH v1 1/2] Revert "SUNRPC: Reduce thread wake-up rate when receiving large RPC messages" Jeff Layton 1 sibling, 2 replies; 11+ messages in thread From: cel @ 2025-01-03 1:00 UTC (permalink / raw) To: Neil Brown, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey Cc: linux-nfs, Chuck Lever, Tejun Heo From: Chuck Lever <chuck.lever@oracle.com> Youzhong Yang <youzhong@gmail.com> noticed the workqueue subsystem complaining about how long the filecache laundrette was running. This resulted in switching from using the system_wq for the laundrette to system_unbound_wq (see commit 4b84551a35e3 ("nfsd: use system_unbound_wq for nfsd_file_gc_worker()"). However, I've seen the laundrette running for multiple milliseconds on some workloads, delaying other work. For the purpose of scheduling fairness, perhaps a better choice would be to process filecache disposal queues on the system_long_wq instead. Cc: Tejun Heo <tj@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com> --- fs/nfsd/filecache.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c index a1cdba42c4fa..91a535c2dede 100644 --- a/fs/nfsd/filecache.c +++ b/fs/nfsd/filecache.c @@ -112,7 +112,7 @@ static void nfsd_file_schedule_laundrette(void) { if (test_bit(NFSD_FILE_CACHE_UP, &nfsd_file_flags)) - queue_delayed_work(system_unbound_wq, &nfsd_filecache_laundrette, + queue_delayed_work(system_long_wq, &nfsd_filecache_laundrette, NFSD_LAUNDRETTE_DELAY); } -- 2.47.0 ^ permalink raw reply related [flat|nested] 11+ messages in thread
* Re: [PATCH v1 2/2] NFSD: Change the filecache laundrette workqueue again 2025-01-03 1:00 ` [PATCH v1 2/2] NFSD: Change the filecache laundrette workqueue again cel @ 2025-01-03 14:21 ` Jeff Layton 2025-01-03 22:50 ` NeilBrown 1 sibling, 0 replies; 11+ messages in thread From: Jeff Layton @ 2025-01-03 14:21 UTC (permalink / raw) To: cel, Neil Brown, Olga Kornievskaia, Dai Ngo, Tom Talpey Cc: linux-nfs, Chuck Lever, Tejun Heo On Thu, 2025-01-02 at 20:00 -0500, cel@kernel.org wrote: > From: Chuck Lever <chuck.lever@oracle.com> > > Youzhong Yang <youzhong@gmail.com> noticed the workqueue subsystem > complaining about how long the filecache laundrette was running. > This resulted in switching from using the system_wq for the > laundrette to system_unbound_wq (see commit 4b84551a35e3 ("nfsd: use > system_unbound_wq for nfsd_file_gc_worker()"). > > However, I've seen the laundrette running for multiple milliseconds > on some workloads, delaying other work. For the purpose of > scheduling fairness, perhaps a better choice would be to process > filecache disposal queues on the system_long_wq instead. > > Cc: Tejun Heo <tj@kernel.org> > Signed-off-by: Chuck Lever <chuck.lever@oracle.com> > --- > fs/nfsd/filecache.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > index a1cdba42c4fa..91a535c2dede 100644 > --- a/fs/nfsd/filecache.c > +++ b/fs/nfsd/filecache.c > @@ -112,7 +112,7 @@ static void > nfsd_file_schedule_laundrette(void) > { > if (test_bit(NFSD_FILE_CACHE_UP, &nfsd_file_flags)) > - queue_delayed_work(system_unbound_wq, &nfsd_filecache_laundrette, > + queue_delayed_work(system_long_wq, &nfsd_filecache_laundrette, > NFSD_LAUNDRETTE_DELAY); > } > TIL that there is system_long_wq! Seems like a reasonable thing to do since this is generally low-priority work. Reviewed-by: Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v1 2/2] NFSD: Change the filecache laundrette workqueue again 2025-01-03 1:00 ` [PATCH v1 2/2] NFSD: Change the filecache laundrette workqueue again cel 2025-01-03 14:21 ` Jeff Layton @ 2025-01-03 22:50 ` NeilBrown 2025-01-03 22:53 ` Tejun Heo 1 sibling, 1 reply; 11+ messages in thread From: NeilBrown @ 2025-01-03 22:50 UTC (permalink / raw) To: cel Cc: Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever, Tejun Heo On Fri, 03 Jan 2025, cel@kernel.org wrote: > From: Chuck Lever <chuck.lever@oracle.com> > > Youzhong Yang <youzhong@gmail.com> noticed the workqueue subsystem > complaining about how long the filecache laundrette was running. > This resulted in switching from using the system_wq for the > laundrette to system_unbound_wq (see commit 4b84551a35e3 ("nfsd: use > system_unbound_wq for nfsd_file_gc_worker()"). > > However, I've seen the laundrette running for multiple milliseconds > on some workloads, delaying other work. For the purpose of > scheduling fairness, perhaps a better choice would be to process > filecache disposal queues on the system_long_wq instead. > > Cc: Tejun Heo <tj@kernel.org> > Signed-off-by: Chuck Lever <chuck.lever@oracle.com> > --- > fs/nfsd/filecache.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c > index a1cdba42c4fa..91a535c2dede 100644 > --- a/fs/nfsd/filecache.c > +++ b/fs/nfsd/filecache.c > @@ -112,7 +112,7 @@ static void > nfsd_file_schedule_laundrette(void) > { > if (test_bit(NFSD_FILE_CACHE_UP, &nfsd_file_flags)) > - queue_delayed_work(system_unbound_wq, &nfsd_filecache_laundrette, > + queue_delayed_work(system_long_wq, &nfsd_filecache_laundrette, > NFSD_LAUNDRETTE_DELAY); This doesn't seem right to me. "unbound_wq" is not bound to an CPU. It just gets allocated that thread which shouldn't interfere with any other work (except for normal scheduling contention) until the queue limit is reach, which is unlikely to happen often. "long_wq" IS bound to a CPU and is concurrency managed so it could interfere with other things. The idea is that work items scheduled there are not in a hurry and could take longer. But I dont' think they should take very long... I think the problem is elsewhere. list_lru_walk() can hold a spin lock for as long as it takes to walk nr_to_walk objects. This will tie up the CPU no matter which work queue it is running on. I think that instead of passing "list_lru_count()" we should pass some constant like 1024. cnt = list_lru_count() while (cnt > 0) { num = min(cnt, 1024); list_lru_walk(...., num); cond_sched() cnt -= num; } Then run it from system_wq. list_lru_shrink is most often called as list_lru_shrink_walk() from a shrinker, and the pattern there is essentially that above. A count is taken, possibly scaled down, then the shrinker is called in batches. NeilBrown ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v1 2/2] NFSD: Change the filecache laundrette workqueue again 2025-01-03 22:50 ` NeilBrown @ 2025-01-03 22:53 ` Tejun Heo 2025-01-04 15:37 ` Chuck Lever 0 siblings, 1 reply; 11+ messages in thread From: Tejun Heo @ 2025-01-03 22:53 UTC (permalink / raw) To: NeilBrown Cc: cel, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever On Sat, Jan 04, 2025 at 09:50:32AM +1100, NeilBrown wrote: > On Fri, 03 Jan 2025, cel@kernel.org wrote: ... > I think that instead of passing "list_lru_count()" we should pass some > constant like 1024. > > cnt = list_lru_count() > while (cnt > 0) { > num = min(cnt, 1024); > list_lru_walk(...., num); > cond_sched() > cnt -= num; > } > > Then run it from system_wq. > > list_lru_shrink is most often called as list_lru_shrink_walk() from a > shrinker, and the pattern there is essentially that above. A count is > taken, possibly scaled down, then the shrinker is called in batches. BTW, there's nothing wrong with taking some msecs or even tens of msecs running on system_unbound_wq, so the current state may be fine too. Thanks. -- tejun ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v1 2/2] NFSD: Change the filecache laundrette workqueue again 2025-01-03 22:53 ` Tejun Heo @ 2025-01-04 15:37 ` Chuck Lever 2025-01-05 22:28 ` NeilBrown 0 siblings, 1 reply; 11+ messages in thread From: Chuck Lever @ 2025-01-04 15:37 UTC (permalink / raw) To: Tejun Heo, NeilBrown Cc: cel, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs On 1/3/25 5:53 PM, Tejun Heo wrote: > On Sat, Jan 04, 2025 at 09:50:32AM +1100, NeilBrown wrote: >> On Fri, 03 Jan 2025, cel@kernel.org wrote: > ... >> I think that instead of passing "list_lru_count()" we should pass some >> constant like 1024. >> >> cnt = list_lru_count() >> while (cnt > 0) { >> num = min(cnt, 1024); >> list_lru_walk(...., num); >> cond_sched() >> cnt -= num; >> } >> >> Then run it from system_wq. >> >> list_lru_shrink is most often called as list_lru_shrink_walk() from a >> shrinker, and the pattern there is essentially that above. A count is >> taken, possibly scaled down, then the shrinker is called in batches. > > BTW, there's nothing wrong with taking some msecs or even tens of msecs > running on system_unbound_wq, so the current state may be fine too. My thinking was that this work is low priority, so there should be plenty of opportunity to set it aside for a few moments and handle higher priority work. Maybe not worrisome on systems with a high core count, but on small SMP (eg VM guests), I've found that tasks like this can be rude neighbors. We could do this by adding a cond_resched() call site in the loop, or take Neil's suggestion of breaking up the free list across multiple work items that handle one or just a few file releases each. -- Chuck Lever ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v1 2/2] NFSD: Change the filecache laundrette workqueue again 2025-01-04 15:37 ` Chuck Lever @ 2025-01-05 22:28 ` NeilBrown 2025-01-06 1:39 ` Chuck Lever 0 siblings, 1 reply; 11+ messages in thread From: NeilBrown @ 2025-01-05 22:28 UTC (permalink / raw) To: Chuck Lever Cc: Tejun Heo, cel, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs On Sun, 05 Jan 2025, Chuck Lever wrote: > On 1/3/25 5:53 PM, Tejun Heo wrote: > > On Sat, Jan 04, 2025 at 09:50:32AM +1100, NeilBrown wrote: > >> On Fri, 03 Jan 2025, cel@kernel.org wrote: > > ... > >> I think that instead of passing "list_lru_count()" we should pass some > >> constant like 1024. > >> > >> cnt = list_lru_count() > >> while (cnt > 0) { > >> num = min(cnt, 1024); > >> list_lru_walk(...., num); > >> cond_sched() > >> cnt -= num; > >> } > >> > >> Then run it from system_wq. > >> > >> list_lru_shrink is most often called as list_lru_shrink_walk() from a > >> shrinker, and the pattern there is essentially that above. A count is > >> taken, possibly scaled down, then the shrinker is called in batches. > > > > BTW, there's nothing wrong with taking some msecs or even tens of msecs > > running on system_unbound_wq, so the current state may be fine too. > > My thinking was that this work is low priority, so there should be > plenty of opportunity to set it aside for a few moments and handle > higher priority work. Maybe not worrisome on systems with a high core > count, but on small SMP (eg VM guests), I've found that tasks like this > can be rude neighbors. None of the different work queues are expected to "set aside" the work for more than a normal scheduling time slice. system_long_wq was created "so that flush_scheduled_work() doesn't take so long" but flush_scheduled_work() is never called now (and would generate a warning if it was) so system_long_wq should really be removed. system_unbound_wq uses threads that are not bound to a CPU so the scheduler can move them around. That is most suitable for something that might run for a long time. system_wq is bound to the CPU that schedules the work, but it can run multiple work items - potentially long running ones - without trouble by helping the scheduler share the CPU among them. This requires that they can sleep of course. I haven't seen the actually symptoms that resulted in the various changes to this code, but that last point is I think the key one. The work item, like all kernel code, needs to have scheduler points if it might run for a longish time. list_lru_walk() doesn't contain any scheduling points so giving a large "nr_to_walk" is always a bad idea. > > We could do this by adding a cond_resched() call site in the loop, > or take Neil's suggestion of breaking up the free list across multiple > work items that handle one or just a few file releases each. I guess I should send a proper patch. NeilBrown > > -- > Chuck Lever > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v1 2/2] NFSD: Change the filecache laundrette workqueue again 2025-01-05 22:28 ` NeilBrown @ 2025-01-06 1:39 ` Chuck Lever 2025-01-06 2:47 ` NeilBrown 0 siblings, 1 reply; 11+ messages in thread From: Chuck Lever @ 2025-01-06 1:39 UTC (permalink / raw) To: NeilBrown Cc: Tejun Heo, cel, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs On 1/5/25 5:28 PM, NeilBrown wrote: > On Sun, 05 Jan 2025, Chuck Lever wrote: >> On 1/3/25 5:53 PM, Tejun Heo wrote: >>> On Sat, Jan 04, 2025 at 09:50:32AM +1100, NeilBrown wrote: >>>> On Fri, 03 Jan 2025, cel@kernel.org wrote: >>> ... >>>> I think that instead of passing "list_lru_count()" we should pass some >>>> constant like 1024. >>>> >>>> cnt = list_lru_count() >>>> while (cnt > 0) { >>>> num = min(cnt, 1024); >>>> list_lru_walk(...., num); >>>> cond_sched() >>>> cnt -= num; >>>> } >>>> >>>> Then run it from system_wq. >>>> >>>> list_lru_shrink is most often called as list_lru_shrink_walk() from a >>>> shrinker, and the pattern there is essentially that above. A count is >>>> taken, possibly scaled down, then the shrinker is called in batches. >>> >>> BTW, there's nothing wrong with taking some msecs or even tens of msecs >>> running on system_unbound_wq, so the current state may be fine too. >> >> My thinking was that this work is low priority, so there should be >> plenty of opportunity to set it aside for a few moments and handle >> higher priority work. Maybe not worrisome on systems with a high core >> count, but on small SMP (eg VM guests), I've found that tasks like this >> can be rude neighbors. > > None of the different work queues are expected to "set aside" the work > for more than a normal scheduling time slice. > system_long_wq was created "so that flush_scheduled_work() doesn't take > so long" but flush_scheduled_work() is never called now (and would > generate a warning if it was) so system_long_wq should really be > removed. > > system_unbound_wq uses threads that are not bound to a CPU so the > scheduler can move them around. That is most suitable for something > that might run for a long time. That's not my understanding: unbound selects a CPU to run the work item on, and it can be (and usually is) not the same CPU as the one that invoked queue_work(). Then it isn't rescheduled. The work items are expected to complete quickly; work item termination is the typical reschedule point. My understanding, as always, could be stale. But that's neither here nor there: I've dropped the patch to replace system_unbound_wq with system_long_wq. > system_wq is bound to the CPU that schedules the work, but it can run > multiple work items - potentially long running ones - without trouble by > helping the scheduler share the CPU among them. This requires that they > can sleep of course. > > I haven't seen the actually symptoms that resulted in the various > changes to this code, but that last point is I think the key one. The > work item, like all kernel code, needs to have scheduler points if it > might run for a longish time. list_lru_walk() doesn't contain any > scheduling points so giving a large "nr_to_walk" is always a bad idea. That might be a overstated... I don't see other consumers that are so concerned about rescheduling. But then it's not clear if they are dealing with lengthy LRU lists. >> We could do this by adding a cond_resched() call site in the loop, >> or take Neil's suggestion of breaking up the free list across multiple >> work items that handle one or just a few file releases each. > > I guess I should send a proper patch. More comments in reply to that patch. Generally speaking, I'm comfortable chopping up the work as you propose, we just have to address some details. -- Chuck Lever ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v1 2/2] NFSD: Change the filecache laundrette workqueue again 2025-01-06 1:39 ` Chuck Lever @ 2025-01-06 2:47 ` NeilBrown 0 siblings, 0 replies; 11+ messages in thread From: NeilBrown @ 2025-01-06 2:47 UTC (permalink / raw) To: Chuck Lever Cc: Tejun Heo, cel, Jeff Layton, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs On Mon, 06 Jan 2025, Chuck Lever wrote: > On 1/5/25 5:28 PM, NeilBrown wrote: > > On Sun, 05 Jan 2025, Chuck Lever wrote: > >> On 1/3/25 5:53 PM, Tejun Heo wrote: > >>> On Sat, Jan 04, 2025 at 09:50:32AM +1100, NeilBrown wrote: > >>>> On Fri, 03 Jan 2025, cel@kernel.org wrote: > >>> ... > >>>> I think that instead of passing "list_lru_count()" we should pass some > >>>> constant like 1024. > >>>> > >>>> cnt = list_lru_count() > >>>> while (cnt > 0) { > >>>> num = min(cnt, 1024); > >>>> list_lru_walk(...., num); > >>>> cond_sched() > >>>> cnt -= num; > >>>> } > >>>> > >>>> Then run it from system_wq. > >>>> > >>>> list_lru_shrink is most often called as list_lru_shrink_walk() from a > >>>> shrinker, and the pattern there is essentially that above. A count is > >>>> taken, possibly scaled down, then the shrinker is called in batches. > >>> > >>> BTW, there's nothing wrong with taking some msecs or even tens of msecs > >>> running on system_unbound_wq, so the current state may be fine too. > >> > >> My thinking was that this work is low priority, so there should be > >> plenty of opportunity to set it aside for a few moments and handle > >> higher priority work. Maybe not worrisome on systems with a high core > >> count, but on small SMP (eg VM guests), I've found that tasks like this > >> can be rude neighbors. > > > > None of the different work queues are expected to "set aside" the work > > for more than a normal scheduling time slice. > > system_long_wq was created "so that flush_scheduled_work() doesn't take > > so long" but flush_scheduled_work() is never called now (and would > > generate a warning if it was) so system_long_wq should really be > > removed. > > > > system_unbound_wq uses threads that are not bound to a CPU so the > > scheduler can move them around. That is most suitable for something > > that might run for a long time. > > That's not my understanding: unbound selects a CPU to run the work > item on, and it can be (and usually is) not the same CPU as the one > that invoked queue_work(). Then it isn't rescheduled. The work items > are expected to complete quickly; work item termination is the typical > reschedule point. My understanding, as always, could be stale. If the work item will never schedule, then there isn't much point in using a work queue. The code can be run inline, or with a timer. > > But that's neither here nor there: I've dropped the patch to replace > system_unbound_wq with system_long_wq. > > > > system_wq is bound to the CPU that schedules the work, but it can run > > multiple work items - potentially long running ones - without trouble by > > helping the scheduler share the CPU among them. This requires that they > > can sleep of course. > > > > I haven't seen the actually symptoms that resulted in the various > > changes to this code, but that last point is I think the key one. The > > work item, like all kernel code, needs to have scheduler points if it > > might run for a longish time. list_lru_walk() doesn't contain any > > scheduling points so giving a large "nr_to_walk" is always a bad idea. > > That might be a overstated... I don't see other consumers that are so > concerned about rescheduling. But then it's not clear if they are > dealing with lengthy LRU lists. There are very few callers of list_lru_walk(). shrink_dcache_sb() is the closest analogue and it uses batches of 1024. xfs_buftarg_drain() uses batches of LONG_MAX !!! Mostly the lru is walked in a shrinker callback, and the shrinker manages the batch size. Thanks, NeilBrown > > > >> We could do this by adding a cond_resched() call site in the loop, > >> or take Neil's suggestion of breaking up the free list across multiple > >> work items that handle one or just a few file releases each. > > > > I guess I should send a proper patch. > > More comments in reply to that patch. Generally speaking, I'm > comfortable chopping up the work as you propose, we just have to > address some details. > > > -- > Chuck Lever > ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v1 1/2] Revert "SUNRPC: Reduce thread wake-up rate when receiving large RPC messages" 2025-01-03 1:00 [PATCH v1 1/2] Revert "SUNRPC: Reduce thread wake-up rate when receiving large RPC messages" cel 2025-01-03 1:00 ` [PATCH v1 2/2] NFSD: Change the filecache laundrette workqueue again cel @ 2025-01-03 14:11 ` Jeff Layton 2025-01-04 0:44 ` NeilBrown 1 sibling, 1 reply; 11+ messages in thread From: Jeff Layton @ 2025-01-03 14:11 UTC (permalink / raw) To: cel, Neil Brown, Olga Kornievskaia, Dai Ngo, Tom Talpey Cc: linux-nfs, Chuck Lever, Jakub Kicinski On Thu, 2025-01-02 at 20:00 -0500, cel@kernel.org wrote: > From: Chuck Lever <chuck.lever@oracle.com> > > I noticed that a handful of NFSv3 fstests were taking an > unexpectedly long time to run. Troubleshooting showed that the > server's TCP window closed and never re-opened, which caused the > client to trigger an RPC retransmit timeout after 180 seconds. > > The client's recovery action was to establish a fresh connection > and retransmit the timed-out requests. This worked, but it adds a > long delay. > > I tracked the problem to the commit that attempted to reduce the > rate at which the network layer delivers TCP socket data_ready > callbacks. Under most circumstances this change worked as expected, > but for NFSv3, which has no session or other type of throttling, it > can overwhelm the receiver on occasion. > > I'm sure I could tweak the lowat settings, but the small benefit > doesn't seem worth the bother. Just revert it. > > Fixes: 2b877fc53e97 ("SUNRPC: Reduce thread wake-up rate when receiving large RPC messages") > Cc: Jakub Kicinski <kuba@kernel.org> > Signed-off-by: Chuck Lever <chuck.lever@oracle.com> > --- > net/sunrpc/svcsock.c | 12 +----------- > 1 file changed, 1 insertion(+), 11 deletions(-) > > diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c > index 95397677673b..cb3bd12f5818 100644 > --- a/net/sunrpc/svcsock.c > +++ b/net/sunrpc/svcsock.c > @@ -1083,9 +1083,6 @@ static void svc_tcp_fragment_received(struct svc_sock *svsk) > /* If we have more data, signal svc_xprt_enqueue() to try again */ > svsk->sk_tcplen = 0; > svsk->sk_marker = xdr_zero; > - > - smp_wmb(); > - tcp_set_rcvlowat(svsk->sk_sk, 1); > } > > /** > @@ -1175,17 +1172,10 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp) > goto err_delete; > if (len == want) > svc_tcp_fragment_received(svsk); > - else { > - /* Avoid more ->sk_data_ready() calls until the rest > - * of the message has arrived. This reduces service > - * thread wake-ups on large incoming messages. */ > - tcp_set_rcvlowat(svsk->sk_sk, > - svc_sock_reclen(svsk) - svsk->sk_tcplen); Could we instead clamp this to the window size so that we at least only do a wakeup for each window-size chunk? > - > + else > trace_svcsock_tcp_recv_short(&svsk->sk_xprt, > svc_sock_reclen(svsk), > svsk->sk_tcplen - sizeof(rpc_fraghdr)); > - } > goto err_noclose; > error: > if (len != -EAGAIN) Reverting for now is fine though. Reviewed-by: Jeff Layton <jlayton@kernel.org> ^ permalink raw reply [flat|nested] 11+ messages in thread
* Re: [PATCH v1 1/2] Revert "SUNRPC: Reduce thread wake-up rate when receiving large RPC messages" 2025-01-03 14:11 ` [PATCH v1 1/2] Revert "SUNRPC: Reduce thread wake-up rate when receiving large RPC messages" Jeff Layton @ 2025-01-04 0:44 ` NeilBrown 0 siblings, 0 replies; 11+ messages in thread From: NeilBrown @ 2025-01-04 0:44 UTC (permalink / raw) To: Jeff Layton Cc: cel, Olga Kornievskaia, Dai Ngo, Tom Talpey, linux-nfs, Chuck Lever, Jakub Kicinski On Sat, 04 Jan 2025, Jeff Layton wrote: > On Thu, 2025-01-02 at 20:00 -0500, cel@kernel.org wrote: > > From: Chuck Lever <chuck.lever@oracle.com> > > > > I noticed that a handful of NFSv3 fstests were taking an > > unexpectedly long time to run. Troubleshooting showed that the > > server's TCP window closed and never re-opened, which caused the > > client to trigger an RPC retransmit timeout after 180 seconds. > > > > The client's recovery action was to establish a fresh connection > > and retransmit the timed-out requests. This worked, but it adds a > > long delay. > > > > I tracked the problem to the commit that attempted to reduce the > > rate at which the network layer delivers TCP socket data_ready > > callbacks. Under most circumstances this change worked as expected, > > but for NFSv3, which has no session or other type of throttling, it > > can overwhelm the receiver on occasion. > > > > I'm sure I could tweak the lowat settings, but the small benefit > > doesn't seem worth the bother. Just revert it. > > > > Fixes: 2b877fc53e97 ("SUNRPC: Reduce thread wake-up rate when receiving large RPC messages") > > Cc: Jakub Kicinski <kuba@kernel.org> > > Signed-off-by: Chuck Lever <chuck.lever@oracle.com> > > --- > > net/sunrpc/svcsock.c | 12 +----------- > > 1 file changed, 1 insertion(+), 11 deletions(-) > > > > diff --git a/net/sunrpc/svcsock.c b/net/sunrpc/svcsock.c > > index 95397677673b..cb3bd12f5818 100644 > > --- a/net/sunrpc/svcsock.c > > +++ b/net/sunrpc/svcsock.c > > @@ -1083,9 +1083,6 @@ static void svc_tcp_fragment_received(struct svc_sock *svsk) > > /* If we have more data, signal svc_xprt_enqueue() to try again */ > > svsk->sk_tcplen = 0; > > svsk->sk_marker = xdr_zero; > > - > > - smp_wmb(); > > - tcp_set_rcvlowat(svsk->sk_sk, 1); > > } > > > > /** > > @@ -1175,17 +1172,10 @@ static int svc_tcp_recvfrom(struct svc_rqst *rqstp) > > goto err_delete; > > if (len == want) > > svc_tcp_fragment_received(svsk); > > - else { > > - /* Avoid more ->sk_data_ready() calls until the rest > > - * of the message has arrived. This reduces service > > - * thread wake-ups on large incoming messages. */ > > - tcp_set_rcvlowat(svsk->sk_sk, > > - svc_sock_reclen(svsk) - svsk->sk_tcplen); > > Could we instead clamp this to the window size so that we at least only > do a wakeup for each window-size chunk? tcp_set_rcvlowat() already imposes and upper limit - based on configured queue space. So I don't think we need to clamp. I note that the calculation is wrong. sk_tcplen includes the size of the 4byte length header, but svc_sock_reclen() does not. Other places there these two numbers are compared, sizeof(rpc_fraghdr) is subtracted from sk_tcplen. So there is a corner case where the size passed to tcp_set_rcvlowat() is negative. The two numbers in the subtraction are u32 and the function expects "int". I didn't dig enough to understand how an negative would be handled. > > > - > > + else > > trace_svcsock_tcp_recv_short(&svsk->sk_xprt, > > svc_sock_reclen(svsk), > > svsk->sk_tcplen - sizeof(rpc_fraghdr)); > > - } > > goto err_noclose; > > error: > > if (len != -EAGAIN) > > Reverting for now is fine though. Agreed. I would expect modern network cards to deliver fairly large packets to the application already. Still, it would be nice to understand what is really happening. Setting rcvlowat shouldn't be able to cause the window size to go to zero. NeilBrown > > Reviewed-by: Jeff Layton <jlayton@kernel.org> > ^ permalink raw reply [flat|nested] 11+ messages in thread
end of thread, other threads:[~2025-01-06 2:48 UTC | newest] Thread overview: 11+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2025-01-03 1:00 [PATCH v1 1/2] Revert "SUNRPC: Reduce thread wake-up rate when receiving large RPC messages" cel 2025-01-03 1:00 ` [PATCH v1 2/2] NFSD: Change the filecache laundrette workqueue again cel 2025-01-03 14:21 ` Jeff Layton 2025-01-03 22:50 ` NeilBrown 2025-01-03 22:53 ` Tejun Heo 2025-01-04 15:37 ` Chuck Lever 2025-01-05 22:28 ` NeilBrown 2025-01-06 1:39 ` Chuck Lever 2025-01-06 2:47 ` NeilBrown 2025-01-03 14:11 ` [PATCH v1 1/2] Revert "SUNRPC: Reduce thread wake-up rate when receiving large RPC messages" Jeff Layton 2025-01-04 0:44 ` NeilBrown
This is an external index of several public inboxes, see mirroring instructions on how to clone and mirror all data and code used by this external index.