From: Chuck Lever <chuck.lever@oracle.com>
To: NeilBrown <neilb@suse.de>
Cc: Jeff Layton <jlayton@kernel.org>,
linux-nfs@vger.kernel.org,
Olga Kornievskaia <okorniev@redhat.com>,
Dai Ngo <Dai.Ngo@oracle.com>, Tom Talpey <tom@talpey.com>
Subject: Re: [PATCH] nfsd: add scheduling point in nfsd_file_gc()
Date: Wed, 8 Jan 2025 17:27:59 -0500 [thread overview]
Message-ID: <51c156f5-e70c-4a75-873b-68fed4fad998@oracle.com> (raw)
In-Reply-To: <173636890368.22054.15435316321445899208@noble.neil.brown.name>
On 1/8/25 3:41 PM, NeilBrown wrote:
> On Thu, 09 Jan 2025, Chuck Lever wrote:
>> On 1/7/25 6:01 PM, NeilBrown wrote:
>>> On Tue, 07 Jan 2025, Chuck Lever wrote:
>>>> On 1/5/25 10:02 PM, NeilBrown wrote:
>>>>> On Mon, 06 Jan 2025, Chuck Lever wrote:
>>>>>> On 1/5/25 6:11 PM, NeilBrown wrote:
>>>>
>>>>>>> + unsigned long num_to_scan = min(cnt, 1024UL);
>>>>>>
>>>>>> I see long delays with fewer than 1024 items on the list. I might
>>>>>> drop this number by one or two orders of magnitude. And make it a
>>>>>> symbolic constant.
>>>>>
>>>>> In that case I seriously wonder if this is where the delays are coming
>>>>> from.
>>>>>
>>>>> nfsd_file_dispose_list_delayed() does take and drop a spinlock
>>>>> repeatedly (though it may not always be the same lock) and call
>>>>> svc_wake_up() repeatedly - although the head of the queue might already
>>>>> be woken. We could optimise that to detect runs with the same nn and
>>>>> only take the lock once, and only wake_up once.
>>>>>
>>>>>>
>>>>>> There's another naked integer (8) in nfsd_file_net_dispose() -- how does
>>>>>> that relate to this new cap? Should that also be a symbolic constant?
>>>>>
>>>>> I don't think they relate.
>>>>> The trade-off with "8" is:
>>>>> a bigger number might block an nfsd thread for longer,
>>>>> forcing serialising when the work can usefully be done in parallel.
>>>>> a smaller number might needlessly wake lots of threads
>>>>> to share out a tiny amount of work.
>>>>>
>>>>> The 1024 is simply about "don't hold a spinlock for too long".
>>>>
>>>> By that, I think you mean list_lru_walk() takes &l->lock for the
>>>> duration of the scan? For a long scan, that would effectively block
>>>> adding or removing LRU items for quite some time.
>>>>
>>>> So here's a typical excerpt from a common test:
>>>>
>>>> kworker/u80:7-206 [003] 266.985735: nfsd_file_unhash: ...
>>>>
>>>> kworker/u80:7-206 [003] 266.987723: nfsd_file_gc_removed: 1309
>>>> entries removed, 2972 remaining
>>>>
>>>> nfsd-1532 [015] 266.988626: nfsd_file_free: ...
>>>>
>>>> Here, the nfsd_file_unhash record marks the beginning of the LRU
>>>> walk, and the nfsd_file_gc_removed record marks the end. The
>>>> timestamps indicate the walk took two milliseconds.
>>>>
>>>> The nfsd_file_free record above marks the last disposal activity.
>>>> That takes almost a millisecond, but as far as I can tell, it
>>>> does not hold any locks for long.
>>>>
>>>> This seems to me like a strong argument for cutting the scan size
>>>> down to no more than 32-64 items. Ideally spin locks are supposed
>>>> to be held only for simple operations (eg, list_add); this seems a
>>>> little outside that window (hence your remark that "a large
>>>> nr_to_walk is always a bad idea" -- I now see what you meant).
>>>
>>> This is useful - thanks.
>>> So the problem seems to be that holding the list_lru while canning the
>>> whole list can block all incoming NFSv3 for a noticeable amount of time
>>> - 2 msecs above. That makes perfect sense and as you say it suggests
>>> that the lack of scheduling points isn't really the issue.
>>>
>>> This confirms for me that the list_lru approach is no a good fit for
>>> this problem. I have written a patch which replaces it with a pair of
>>> simple lists as I described in my cover letter.
>>
>> Before proceeding with replacement of the LRU, is there interest in
>> addressing this issue in LTS kernels as well? If so, then IMO the
>> better approach would be to take a variant of your narrower fix for
>> v6.14, and then visit the deeper LRU changes for v6.15ff.
>
> That is probably reasonable. You could take the first patch, drop the
> 1024 to 64 (or less if testing suggests that is still too high), and
> maybe drop he cond_resched().
I will make it so. Enjoy the rest of your leave!
--
Chuck Lever
prev parent reply other threads:[~2025-01-08 22:28 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-05 23:11 [PATCH intro] nfsd: add scheduling point in nfsd_file_gc() NeilBrown
2025-01-05 23:11 ` [PATCH] " NeilBrown
2025-01-06 1:55 ` Chuck Lever
2025-01-06 3:02 ` NeilBrown
2025-01-06 13:57 ` Chuck Lever
2025-01-07 23:01 ` NeilBrown
2025-01-08 13:39 ` Chuck Lever
2025-01-08 20:41 ` NeilBrown
2025-01-08 22:27 ` Chuck Lever [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=51c156f5-e70c-4a75-873b-68fed4fad998@oracle.com \
--to=chuck.lever@oracle.com \
--cc=Dai.Ngo@oracle.com \
--cc=jlayton@kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=neilb@suse.de \
--cc=okorniev@redhat.com \
--cc=tom@talpey.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox