Re: [PATCH] nfsd: add scheduling point in nfsd_file_gc()

public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Chuck Lever <chuck.lever@oracle.com>
To: NeilBrown <neilb@suse.de>
Cc: Jeff Layton <jlayton@kernel.org>,
	linux-nfs@vger.kernel.org,
	Olga Kornievskaia <okorniev@redhat.com>,
	Dai Ngo <Dai.Ngo@oracle.com>, Tom Talpey <tom@talpey.com>
Subject: Re: [PATCH] nfsd: add scheduling point in nfsd_file_gc()
Date: Wed, 8 Jan 2025 17:27:59 -0500	[thread overview]
Message-ID: <51c156f5-e70c-4a75-873b-68fed4fad998@oracle.com> (raw)
In-Reply-To: <173636890368.22054.15435316321445899208@noble.neil.brown.name>

On 1/8/25 3:41 PM, NeilBrown wrote:
> On Thu, 09 Jan 2025, Chuck Lever wrote:
>> On 1/7/25 6:01 PM, NeilBrown wrote:
>>> On Tue, 07 Jan 2025, Chuck Lever wrote:
>>>> On 1/5/25 10:02 PM, NeilBrown wrote:
>>>>> On Mon, 06 Jan 2025, Chuck Lever wrote:
>>>>>> On 1/5/25 6:11 PM, NeilBrown wrote:
>>>>
>>>>>>> +		unsigned long num_to_scan = min(cnt, 1024UL);
>>>>>>
>>>>>> I see long delays with fewer than 1024 items on the list. I might
>>>>>> drop this number by one or two orders of magnitude. And make it a
>>>>>> symbolic constant.
>>>>>
>>>>> In that case I seriously wonder if this is where the delays are coming
>>>>> from.
>>>>>
>>>>> nfsd_file_dispose_list_delayed() does take and drop a spinlock
>>>>> repeatedly (though it may not always be the same lock) and call
>>>>> svc_wake_up() repeatedly - although the head of the queue might already
>>>>> be woken.  We could optimise that to detect runs with the same nn and
>>>>> only take the lock once, and only wake_up once.
>>>>>
>>>>>>
>>>>>> There's another naked integer (8) in nfsd_file_net_dispose() -- how does
>>>>>> that relate to this new cap? Should that also be a symbolic constant?
>>>>>
>>>>> I don't think they relate.
>>>>> The trade-off with "8" is:
>>>>>      a bigger number might block an nfsd thread for longer,
>>>>>        forcing serialising when the work can usefully be done in parallel.
>>>>>      a smaller number might needlessly wake lots of threads
>>>>>        to share out a tiny amount of work.
>>>>>
>>>>> The 1024 is simply about "don't hold a spinlock for too long".
>>>>
>>>> By that, I think you mean list_lru_walk() takes &l->lock for the
>>>> duration of the scan? For a long scan, that would effectively block
>>>> adding or removing LRU items for quite some time.
>>>>
>>>> So here's a typical excerpt from a common test:
>>>>
>>>> kworker/u80:7-206   [003]   266.985735: nfsd_file_unhash: ...
>>>>
>>>> kworker/u80:7-206   [003]   266.987723: nfsd_file_gc_removed: 1309
>>>> entries removed, 2972 remaining
>>>>
>>>> nfsd-1532  [015]   266.988626: nfsd_file_free: ...
>>>>
>>>> Here, the nfsd_file_unhash record marks the beginning of the LRU
>>>> walk, and the nfsd_file_gc_removed record marks the end. The
>>>> timestamps indicate the walk took two milliseconds.
>>>>
>>>> The nfsd_file_free record above marks the last disposal activity.
>>>> That takes almost a millisecond, but as far as I can tell, it
>>>> does not hold any locks for long.
>>>>
>>>> This seems to me like a strong argument for cutting the scan size
>>>> down to no more than 32-64 items. Ideally spin locks are supposed
>>>> to be held only for simple operations (eg, list_add); this seems a
>>>> little outside that window (hence your remark that "a large
>>>> nr_to_walk is always a bad idea" -- I now see what you meant).
>>>
>>> This is useful - thanks.
>>> So the problem seems to be that holding the list_lru while canning the
>>> whole list can block all incoming NFSv3 for a noticeable amount of time
>>> - 2 msecs above.  That makes perfect sense and as you say it suggests
>>> that the lack of scheduling points isn't really the issue.
>>>
>>> This confirms for me that the list_lru approach is no a good fit for
>>> this problem.  I have written a patch which replaces it with a pair of
>>> simple lists as I described in my cover letter.
>>
>> Before proceeding with replacement of the LRU, is there interest in
>> addressing this issue in LTS kernels as well? If so, then IMO the
>> better approach would be to take a variant of your narrower fix for
>> v6.14, and then visit the deeper LRU changes for v6.15ff.
> 
> That is probably reasonable.  You could take the first patch, drop the
> 1024 to 64 (or less if testing suggests that is still too high), and
> maybe drop he cond_resched().

I will make it so. Enjoy the rest of your leave!


-- 
Chuck Lever

     prev parent reply	other threads:[~2025-01-08 22:28 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-05 23:11 [PATCH intro] nfsd: add scheduling point in nfsd_file_gc() NeilBrown
2025-01-05 23:11 ` [PATCH] " NeilBrown
2025-01-06  1:55   ` Chuck Lever
2025-01-06  3:02     ` NeilBrown
2025-01-06 13:57       ` Chuck Lever
2025-01-07 23:01         ` NeilBrown
2025-01-08 13:39           ` Chuck Lever
2025-01-08 20:41             ` NeilBrown
2025-01-08 22:27               ` Chuck Lever [this message]

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=51c156f5-e70c-4a75-873b-68fed4fad998@oracle.com \
    --to=chuck.lever@oracle.com \
    --cc=Dai.Ngo@oracle.com \
    --cc=jlayton@kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=okorniev@redhat.com \
    --cc=tom@talpey.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox