From: Chuck Lever <chuck.lever@oracle.com>
To: Dave Chinner <david@fromorbit.com>, NeilBrown <neilb@suse.de>
Cc: Jeff Layton <jlayton@kernel.org>,
linux-nfs@vger.kernel.org,
Olga Kornievskaia <okorniev@redhat.com>,
Dai Ngo <Dai.Ngo@oracle.com>, Tom Talpey <tom@talpey.com>
Subject: Re: [PATCH 0/7] nfsd: filecache: change garbage collection lists
Date: Tue, 28 Jan 2025 11:05:57 -0500 [thread overview]
Message-ID: <08fb33c7-d495-4b88-a28f-53521377f3bc@oracle.com> (raw)
In-Reply-To: <154547d0-fafb-4b5f-a071-6cb697f6d9ed@oracle.com>
On 1/28/25 9:27 AM, Chuck Lever wrote:
> On 1/28/25 1:37 AM, Dave Chinner wrote:
>> On Mon, Jan 27, 2025 at 12:20:31PM +1100, NeilBrown wrote:
>>> [
>>> davec added to cc incase I've said something incorrect about list_lru
>>>
>>> Changes in this version:
>>> - no _bh locking
>>> - add name for a magic constant
>>> - remove unnecessary race-handling code
>>> - give a more meaningfule name for a lock for /proc/lock_stat
>>> - minor cleanups suggested by Jeff
>>>
>>> ]
>>>
>>> The nfsd filecache currently uses list_lru for tracking files recently
>>> used in NFSv3 requests which need to be "garbage collected" when they
>>> have becoming idle - unused for 2-4 seconds.
>>>
>>> I do not believe list_lru is a good tool for this. It does not allow
>>> the timeout which filecache requires so we have to add a timeout
>>> mechanism which holds the list_lru lock while the whole list is scanned
>>> looking for entries that haven't been recently accessed. When the list
>>> is largish (even a few hundred) this can block new requests noticably
>>> which need the lock to remove a file to access it.
>>
>> Looks entirely like a trivial implementation bug in how the list_lru
>> is walked in nfsd_file_gc().
>>
>> static void
>> nfsd_file_gc(void)
>> {
>> LIST_HEAD(dispose);
>> unsigned long ret;
>>
>> ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
>> &dispose, list_lru_count(&nfsd_file_lru));
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>
>> trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
>> nfsd_file_dispose_list_delayed(&dispose);
>> }
>>
>> i.e. the list_lru_walk() has been told to walk the entire list in a
>> single lock hold if nothing blocks it.
>>
>> We've known this for a long, long time, and it's something we've
>> handled for a long time with shrinkers, too. here's the typical way
>> of doing a full list aging and GC pass in one go without excessively
>> long lock holds:
>>
>> {
>> long nr_to_scan = list_lru_count(&nfsd_file_lru);
>> LIST_HEAD(dispose);
>>
>> while (nr_to_scan > 0) {
>> long batch = min(nr_to_scan, 64);
>>
>> list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
>> &dispose, batch);
>>
>> if (list_empty(&dispose))
>> break;
>> dispose_list(&dispose);
>> nr_to_scan -= batch;
>> }
>> }
>
> The above is in fact similar to what we're planning to push first so
> that it can be cleanly backported to LTS kernels:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?
> h=nfsd-testing&id=9caea737d2cdfe2d194e225c1924090c1d68c25f
I've rebased that branch. Here's a more permanent link:
https://lore.kernel.org/all/20250109142438.18689-2-cel@kernel.org/
But note that the batch size in the patch committed to my tree is 16
items, not 32.
>> And we don't need two lists to separate recently referenced vs
>> gc candidates because we have a referenced bit in the nf->nf_flags.
>> i.e. nfsd_file_lru_cb() does:
>>
>> nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
>> void *arg)
>> {
>> ....
>> /* If it was recently added to the list, skip it */
>> if (test_and_clear_bit(NFSD_FILE_REFERENCED, &nf->nf_flags)) {
>> trace_nfsd_file_gc_referenced(nf);
>> return LRU_ROTATE;
>> }
>> .....
>>
>> Which moves recently referenced entries to the far end of the list,
>> resulting in all the reclaimable objects congrating at the end of
>> the list that is walked first by list_lru_walk().
>
> My concern (which I haven't voiced yet) about having two lists is that
> it will increase memory traffic over the current single atomic bit
> operation.
>
>
>> IOWs, a batched walk like above resumes the walk exactly where it
>> left off, because it is always either reclaiming or rotating the
>> object at the head of the list.
>>
>>> This patch removes the list_lru and instead uses 2 simple linked lists.
>>> When a file is accessed it is removed from whichever list it is on,
>>> then added to the tail of the first list. Every 2 seconds the second
>>> list is moved to the "freeme" list and the first list is moved to the
>>> second list. This avoids any need to walk a list to find old entries.
>>
>> Yup, that's exactly what the current code does via the laundrette
>> work that schedules nfsd_file_gc() to run every two seconds does.
>>
>>> These lists are per-netns rather than global as the freeme list is
>>> per-netns as the actual freeing is done in nfsd threads which are
>>> per-netns.
>>
>> The list_lru is actually multiple lists - it is a per-numa node list
>> and so moving to global scope linked lists per netns is going to
>> reduce scalability and increase lock contention on large machines.
>>
>> I also don't see any perf numbers, scalability analysis, latency
>> measurement, CPU profiles, etc showing the problems with using list_lru
>> for the GC function, nor any improvement this new code brings.
>>
>> i.e. It's kinda hard to make any real comment on "I do not believe
>> list_lru is a good tool for this" when there is no actual
>> measurements provided to back the statement one way or the other...
>
> True, it would be good to get some comparative metrics; in particular
> looking at spin lock contention and memory traffic.
>
>
--
Chuck Lever
next prev parent reply other threads:[~2025-01-28 16:11 UTC|newest]
Thread overview: 23+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-01-27 1:20 [PATCH 0/7] nfsd: filecache: change garbage collection lists NeilBrown
2025-01-27 1:20 ` [PATCH 1/7] nfsd: filecache: remove race handling NeilBrown
2025-01-27 13:42 ` Jeff Layton
2025-01-27 1:20 ` [PATCH 2/7] nfsd: filecache: use nfsd_file_dispose_list() in nfsd_file_close_inode_sync() NeilBrown
2025-01-27 1:20 ` [PATCH 3/7] nfsd: filecache: move globals nfsd_file_lru and nfsd_file_shrinker to be per-net NeilBrown
2025-01-27 1:20 ` [PATCH 4/7] nfsd: filecache: change garbage collection list management NeilBrown
2025-01-27 14:15 ` Jeff Layton
2025-01-27 1:20 ` [PATCH 5/7] nfsd: filecache: document the arbitrary limit on file-disposes-per-loop NeilBrown
2025-01-27 14:40 ` Jeff Layton
2025-01-27 1:20 ` [PATCH 6/7] nfsd: filecache: change garbage collection to a timer NeilBrown
2025-01-27 14:39 ` Jeff Layton
2025-01-27 1:20 ` [PATCH 7/7] nfsd: filecache: give disposal lock a unique class name NeilBrown
2025-01-27 14:29 ` Chuck Lever
2025-01-27 14:40 ` Jeff Layton
2025-01-28 6:37 ` [PATCH 0/7] nfsd: filecache: change garbage collection lists Dave Chinner
2025-01-28 14:27 ` Chuck Lever
2025-01-28 16:05 ` Chuck Lever [this message]
2025-01-29 21:34 ` NeilBrown
2025-02-06 2:21 ` Dave Chinner
2025-02-06 3:04 ` NeilBrown
2025-02-06 14:35 ` Chuck Lever
2025-02-05 23:04 ` NeilBrown
2025-02-06 3:02 ` Chuck Lever
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=08fb33c7-d495-4b88-a28f-53521377f3bc@oracle.com \
--to=chuck.lever@oracle.com \
--cc=Dai.Ngo@oracle.com \
--cc=david@fromorbit.com \
--cc=jlayton@kernel.org \
--cc=linux-nfs@vger.kernel.org \
--cc=neilb@suse.de \
--cc=okorniev@redhat.com \
--cc=tom@talpey.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox