public inbox for linux-nfs@vger.kernel.org
 help / color / mirror / Atom feed
From: Chuck Lever <chuck.lever@oracle.com>
To: Dave Chinner <david@fromorbit.com>, NeilBrown <neilb@suse.de>
Cc: Jeff Layton <jlayton@kernel.org>,
	linux-nfs@vger.kernel.org,
	Olga Kornievskaia <okorniev@redhat.com>,
	Dai Ngo <Dai.Ngo@oracle.com>, Tom Talpey <tom@talpey.com>
Subject: Re: [PATCH 0/7] nfsd: filecache: change garbage collection lists
Date: Tue, 28 Jan 2025 11:05:57 -0500	[thread overview]
Message-ID: <08fb33c7-d495-4b88-a28f-53521377f3bc@oracle.com> (raw)
In-Reply-To: <154547d0-fafb-4b5f-a071-6cb697f6d9ed@oracle.com>

On 1/28/25 9:27 AM, Chuck Lever wrote:
> On 1/28/25 1:37 AM, Dave Chinner wrote:
>> On Mon, Jan 27, 2025 at 12:20:31PM +1100, NeilBrown wrote:
>>> [
>>> davec added to cc incase I've said something incorrect about list_lru
>>>
>>> Changes in this version:
>>>    - no _bh locking
>>>    - add name for a magic constant
>>>    - remove unnecessary race-handling code
>>>    - give a more meaningfule name for a lock for /proc/lock_stat
>>>    - minor cleanups suggested by Jeff
>>>
>>> ]
>>>
>>> The nfsd filecache currently uses  list_lru for tracking files recently
>>> used in NFSv3 requests which need to be "garbage collected" when they
>>> have becoming idle - unused for 2-4 seconds.
>>>
>>> I do not believe list_lru is a good tool for this.  It does not allow
>>> the timeout which filecache requires so we have to add a timeout
>>> mechanism which holds the list_lru lock while the whole list is scanned
>>> looking for entries that haven't been recently accessed.  When the list
>>> is largish (even a few hundred) this can block new requests noticably
>>> which need the lock to remove a file to access it.
>>
>> Looks entirely like a trivial implementation bug in how the list_lru
>> is walked in nfsd_file_gc().
>>
>> static void
>> nfsd_file_gc(void)
>> {
>>          LIST_HEAD(dispose);
>>          unsigned long ret;
>>
>>          ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
>>                              &dispose, list_lru_count(&nfsd_file_lru));
>>                           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>
>>          trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
>>          nfsd_file_dispose_list_delayed(&dispose);
>> }
>>
>> i.e. the list_lru_walk() has been told to walk the entire list in a
>> single lock hold if nothing blocks it.
>>
>> We've known this for a long, long time, and it's something we've
>> handled for a long time with shrinkers, too. here's the typical way
>> of doing a full list aging and GC pass in one go without excessively
>> long lock holds:
>>
>> {
>>     long nr_to_scan = list_lru_count(&nfsd_file_lru);
>>     LIST_HEAD(dispose);
>>
>>     while (nr_to_scan > 0) {
>>         long batch = min(nr_to_scan, 64);
>>
>>         list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
>>                 &dispose, batch);
>>
>>         if (list_empty(&dispose))
>>             break;
>>         dispose_list(&dispose);
>>         nr_to_scan -= batch;
>>     }
>> }
> 
> The above is in fact similar to what we're planning to push first so
> that it can be cleanly backported to LTS kernels:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/? 
> h=nfsd-testing&id=9caea737d2cdfe2d194e225c1924090c1d68c25f

I've rebased that branch. Here's a more permanent link:

https://lore.kernel.org/all/20250109142438.18689-2-cel@kernel.org/

But note that the batch size in the patch committed to my tree is 16
items, not 32.


>> And we don't need two lists to separate recently referenced vs
>> gc candidates because we have a referenced bit in the nf->nf_flags.
>> i.e.  nfsd_file_lru_cb() does:
>>
>> nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
>>                   void *arg)
>> {
>> ....
>>          /* If it was recently added to the list, skip it */
>>          if (test_and_clear_bit(NFSD_FILE_REFERENCED, &nf->nf_flags)) {
>>                  trace_nfsd_file_gc_referenced(nf);
>>                  return LRU_ROTATE;
>>          }
>> .....
>>
>> Which moves recently referenced entries to the far end of the list,
>> resulting in all the reclaimable objects congrating at the end of
>> the list that is walked first by list_lru_walk().
> 
> My concern (which I haven't voiced yet) about having two lists is that
> it will increase memory traffic over the current single atomic bit
> operation.
> 
> 
>> IOWs, a batched walk like above resumes the walk exactly where it
>> left off, because it is always either reclaiming or rotating the
>> object at the head of the list.
>>
>>> This patch removes the list_lru and instead uses 2 simple linked lists.
>>> When a file is accessed it is removed from whichever list it is on,
>>> then added to the tail of the first list.  Every 2 seconds the second
>>> list is moved to the "freeme" list and the first list is moved to the
>>> second list.  This avoids any need to walk a list to find old entries.
>>
>> Yup, that's exactly what the current code does via the laundrette
>> work that schedules nfsd_file_gc() to run every two seconds does.
>>
>>> These lists are per-netns rather than global as the freeme list is
>>> per-netns as the actual freeing is done in nfsd threads which are
>>> per-netns.
>>
>> The list_lru is actually multiple lists - it is a per-numa node list
>> and so moving to global scope linked lists per netns is going to
>> reduce scalability and increase lock contention on large machines.
>>
>> I also don't see any perf numbers, scalability analysis, latency
>> measurement, CPU profiles, etc showing the problems with using list_lru
>> for the GC function, nor any improvement this new code brings.
>>
>> i.e. It's kinda hard to make any real comment on "I do not believe
>> list_lru is a good tool for this" when there is no actual
>> measurements provided to back the statement one way or the other...
> 
> True, it would be good to get some comparative metrics; in particular
> looking at spin lock contention and memory traffic.
> 
> 


-- 
Chuck Lever

  reply	other threads:[~2025-01-28 16:11 UTC|newest]

Thread overview: 23+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-01-27  1:20 [PATCH 0/7] nfsd: filecache: change garbage collection lists NeilBrown
2025-01-27  1:20 ` [PATCH 1/7] nfsd: filecache: remove race handling NeilBrown
2025-01-27 13:42   ` Jeff Layton
2025-01-27  1:20 ` [PATCH 2/7] nfsd: filecache: use nfsd_file_dispose_list() in nfsd_file_close_inode_sync() NeilBrown
2025-01-27  1:20 ` [PATCH 3/7] nfsd: filecache: move globals nfsd_file_lru and nfsd_file_shrinker to be per-net NeilBrown
2025-01-27  1:20 ` [PATCH 4/7] nfsd: filecache: change garbage collection list management NeilBrown
2025-01-27 14:15   ` Jeff Layton
2025-01-27  1:20 ` [PATCH 5/7] nfsd: filecache: document the arbitrary limit on file-disposes-per-loop NeilBrown
2025-01-27 14:40   ` Jeff Layton
2025-01-27  1:20 ` [PATCH 6/7] nfsd: filecache: change garbage collection to a timer NeilBrown
2025-01-27 14:39   ` Jeff Layton
2025-01-27  1:20 ` [PATCH 7/7] nfsd: filecache: give disposal lock a unique class name NeilBrown
2025-01-27 14:29   ` Chuck Lever
2025-01-27 14:40   ` Jeff Layton
2025-01-28  6:37 ` [PATCH 0/7] nfsd: filecache: change garbage collection lists Dave Chinner
2025-01-28 14:27   ` Chuck Lever
2025-01-28 16:05     ` Chuck Lever [this message]
2025-01-29 21:34   ` NeilBrown
2025-02-06  2:21     ` Dave Chinner
2025-02-06  3:04       ` NeilBrown
2025-02-06 14:35         ` Chuck Lever
2025-02-05 23:04 ` NeilBrown
2025-02-06  3:02   ` Chuck Lever

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=08fb33c7-d495-4b88-a28f-53521377f3bc@oracle.com \
    --to=chuck.lever@oracle.com \
    --cc=Dai.Ngo@oracle.com \
    --cc=david@fromorbit.com \
    --cc=jlayton@kernel.org \
    --cc=linux-nfs@vger.kernel.org \
    --cc=neilb@suse.de \
    --cc=okorniev@redhat.com \
    --cc=tom@talpey.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox