public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Chuck Lever <chuck.lever@oracle.com>
To: NeilBrown <neilb@suse.de>
Cc: Jeff Layton <jlayton@kernel.org>,
	linux-nfs@vger.kernel.org,
	Olga Kornievskaia <okorniev@redhat.com>,
	Dai Ngo <Dai.Ngo@oracle.com>, Tom Talpey <tom@talpey.com>
Subject: Re: [PATCH] nfsd: add scheduling point in nfsd_file_gc()
Date: Mon, 6 Jan 2025 08:57:50 -0500	[thread overview]
Message-ID: <e8eda0b1-c55f-4f86-8f7e-3c6a50dacac8@oracle.com> (raw)
In-Reply-To: <173613252284.22054.16371856139892298093@noble.neil.brown.name>

On 1/5/25 10:02 PM, NeilBrown wrote:
> On Mon, 06 Jan 2025, Chuck Lever wrote:
>> On 1/5/25 6:11 PM, NeilBrown wrote:

>>> +		unsigned long num_to_scan = min(cnt, 1024UL);
>>
>> I see long delays with fewer than 1024 items on the list. I might
>> drop this number by one or two orders of magnitude. And make it a
>> symbolic constant.
> 
> In that case I seriously wonder if this is where the delays are coming
> from.
> 
> nfsd_file_dispose_list_delayed() does take and drop a spinlock
> repeatedly (though it may not always be the same lock) and call
> svc_wake_up() repeatedly - although the head of the queue might already
> be woken.  We could optimise that to detect runs with the same nn and
> only take the lock once, and only wake_up once.
> 
>>
>> There's another naked integer (8) in nfsd_file_net_dispose() -- how does
>> that relate to this new cap? Should that also be a symbolic constant?
> 
> I don't think they relate.
> The trade-off with "8" is:
>    a bigger number might block an nfsd thread for longer,
>      forcing serialising when the work can usefully be done in parallel.
>    a smaller number might needlessly wake lots of threads
>      to share out a tiny amount of work.
> 
> The 1024 is simply about "don't hold a spinlock for too long".

By that, I think you mean list_lru_walk() takes &l->lock for the
duration of the scan? For a long scan, that would effectively block
adding or removing LRU items for quite some time.

So here's a typical excerpt from a common test:

kworker/u80:7-206   [003]   266.985735: nfsd_file_unhash: ...

kworker/u80:7-206   [003]   266.987723: nfsd_file_gc_removed: 1309 
entries removed, 2972 remaining

nfsd-1532  [015]   266.988626: nfsd_file_free: ...

Here, the nfsd_file_unhash record marks the beginning of the LRU
walk, and the nfsd_file_gc_removed record marks the end. The
timestamps indicate the walk took two milliseconds.

The nfsd_file_free record above marks the last disposal activity.
That takes almost a millisecond, but as far as I can tell, it
does not hold any locks for long.

This seems to me like a strong argument for cutting the scan size
down to no more than 32-64 items. Ideally spin locks are supposed
to be held only for simple operations (eg, list_add); this seems a
little outside that window (hence your remark that "a large
nr_to_walk is always a bad idea" -- I now see what you meant).

IMHO the patch description should single out that purpose: We want to
significantly reduce the maximum amount of time that list_lru_walk()
blocks foreground LRU activity such as list_lru_add_obj().

The cond_resched() in this case might be gravy.


>>> +		ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
>>> +				    &dispose, num_to_scan);
>>> +		trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
>>> +		nfsd_file_dispose_list_delayed(&dispose);
>>
>> I need to go back and review the function traces to see where the
>> delays add up -- to make sure rescheduling here, rather than at some
>> other point, is appropriate. It probably is, but my memory fails me
>> these days.
> 
> I would like to see those function traces too.

Here's my reproducer:

1. On the client: Set up xfstests to use NFSv3
2. On the server: "sudo trace-cmd record -e nfsd &"
3. On the client: Run xfstests with "sudo ./check -nfs generic/750"


-- 
Chuck Lever

  reply	other threads:[~2025-01-06 13:58 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-05 23:11 [PATCH intro] nfsd: add scheduling point in nfsd_file_gc() NeilBrown
2025-01-05 23:11 ` [PATCH] " NeilBrown
2025-01-06  1:55   ` Chuck Lever
2025-01-06  3:02     ` NeilBrown
2025-01-06 13:57       ` Chuck Lever [this message]
2025-01-07 23:01         ` NeilBrown
2025-01-08 13:39           ` Chuck Lever
2025-01-08 20:41             ` NeilBrown
2025-01-08 22:27               ` Chuck Lever

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=e8eda0b1-c55f-4f86-8f7e-3c6a50dacac8@oracle.com \
    --to=chuck.lever@oracle.com \
    --cc=Dai.Ngo@oracle.com \
    --cc=jlayton@kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=okorniev@redhat.com \
    --cc=tom@talpey.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox