public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Chuck Lever <chuck.lever@oracle.com>
To: NeilBrown <neilb@suse.de>
Cc: Jeff Layton <jlayton@kernel.org>,
	linux-nfs@vger.kernel.org,
	Olga Kornievskaia <okorniev@redhat.com>,
	Dai Ngo <Dai.Ngo@oracle.com>, Tom Talpey <tom@talpey.com>
Subject: Re: [PATCH] nfsd: add scheduling point in nfsd_file_gc()
Date: Wed, 8 Jan 2025 08:39:28 -0500	[thread overview]
Message-ID: <c205384d-cf12-4dde-8875-e826e4a7c2f6@oracle.com> (raw)
In-Reply-To: <173629091812.22054.10406068450776372683@noble.neil.brown.name>

On 1/7/25 6:01 PM, NeilBrown wrote:
> On Tue, 07 Jan 2025, Chuck Lever wrote:
>> On 1/5/25 10:02 PM, NeilBrown wrote:
>>> On Mon, 06 Jan 2025, Chuck Lever wrote:
>>>> On 1/5/25 6:11 PM, NeilBrown wrote:
>>
>>>>> +		unsigned long num_to_scan = min(cnt, 1024UL);
>>>>
>>>> I see long delays with fewer than 1024 items on the list. I might
>>>> drop this number by one or two orders of magnitude. And make it a
>>>> symbolic constant.
>>>
>>> In that case I seriously wonder if this is where the delays are coming
>>> from.
>>>
>>> nfsd_file_dispose_list_delayed() does take and drop a spinlock
>>> repeatedly (though it may not always be the same lock) and call
>>> svc_wake_up() repeatedly - although the head of the queue might already
>>> be woken.  We could optimise that to detect runs with the same nn and
>>> only take the lock once, and only wake_up once.
>>>
>>>>
>>>> There's another naked integer (8) in nfsd_file_net_dispose() -- how does
>>>> that relate to this new cap? Should that also be a symbolic constant?
>>>
>>> I don't think they relate.
>>> The trade-off with "8" is:
>>>     a bigger number might block an nfsd thread for longer,
>>>       forcing serialising when the work can usefully be done in parallel.
>>>     a smaller number might needlessly wake lots of threads
>>>       to share out a tiny amount of work.
>>>
>>> The 1024 is simply about "don't hold a spinlock for too long".
>>
>> By that, I think you mean list_lru_walk() takes &l->lock for the
>> duration of the scan? For a long scan, that would effectively block
>> adding or removing LRU items for quite some time.
>>
>> So here's a typical excerpt from a common test:
>>
>> kworker/u80:7-206   [003]   266.985735: nfsd_file_unhash: ...
>>
>> kworker/u80:7-206   [003]   266.987723: nfsd_file_gc_removed: 1309
>> entries removed, 2972 remaining
>>
>> nfsd-1532  [015]   266.988626: nfsd_file_free: ...
>>
>> Here, the nfsd_file_unhash record marks the beginning of the LRU
>> walk, and the nfsd_file_gc_removed record marks the end. The
>> timestamps indicate the walk took two milliseconds.
>>
>> The nfsd_file_free record above marks the last disposal activity.
>> That takes almost a millisecond, but as far as I can tell, it
>> does not hold any locks for long.
>>
>> This seems to me like a strong argument for cutting the scan size
>> down to no more than 32-64 items. Ideally spin locks are supposed
>> to be held only for simple operations (eg, list_add); this seems a
>> little outside that window (hence your remark that "a large
>> nr_to_walk is always a bad idea" -- I now see what you meant).
> 
> This is useful - thanks.
> So the problem seems to be that holding the list_lru while canning the
> whole list can block all incoming NFSv3 for a noticeable amount of time
> - 2 msecs above.  That makes perfect sense and as you say it suggests
> that the lack of scheduling points isn't really the issue.
> 
> This confirms for me that the list_lru approach is no a good fit for
> this problem.  I have written a patch which replaces it with a pair of
> simple lists as I described in my cover letter.

Before proceeding with replacement of the LRU, is there interest in
addressing this issue in LTS kernels as well? If so, then IMO the
better approach would be to take a variant of your narrower fix for
v6.14, and then visit the deeper LRU changes for v6.15ff.


> It is a bit large and needs careful review.  In particular I haven't
> given thought to the tracepoints which might need to be moved or changed
> or discarded.

Some of those trace points are specific to the state machine that is
part of the list_lru scan. Those should be removed if we replace that
state machine.


-- 
Chuck Lever

  reply	other threads:[~2025-01-08 13:39 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-05 23:11 [PATCH intro] nfsd: add scheduling point in nfsd_file_gc() NeilBrown
2025-01-05 23:11 ` [PATCH] " NeilBrown
2025-01-06  1:55   ` Chuck Lever
2025-01-06  3:02     ` NeilBrown
2025-01-06 13:57       ` Chuck Lever
2025-01-07 23:01         ` NeilBrown
2025-01-08 13:39           ` Chuck Lever [this message]
2025-01-08 20:41             ` NeilBrown
2025-01-08 22:27               ` Chuck Lever

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=c205384d-cf12-4dde-8875-e826e4a7c2f6@oracle.com \
    --to=chuck.lever@oracle.com \
    --cc=Dai.Ngo@oracle.com \
    --cc=jlayton@kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=okorniev@redhat.com \
    --cc=tom@talpey.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox