* [PATCH 0/7] nfsd: filecache: change garbage collection lists
@ 2025-01-27 1:20 NeilBrown
2025-01-27 1:20 ` [PATCH 1/7] nfsd: filecache: remove race handling NeilBrown
` (8 more replies)
0 siblings, 9 replies; 23+ messages in thread
From: NeilBrown @ 2025-01-27 1:20 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton
Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey, Dave Chinner
[
davec added to cc incase I've said something incorrect about list_lru
Changes in this version:
- no _bh locking
- add name for a magic constant
- remove unnecessary race-handling code
- give a more meaningfule name for a lock for /proc/lock_stat
- minor cleanups suggested by Jeff
]
The nfsd filecache currently uses list_lru for tracking files recently
used in NFSv3 requests which need to be "garbage collected" when they
have becoming idle - unused for 2-4 seconds.
I do not believe list_lru is a good tool for this. It does not allow
the timeout which filecache requires so we have to add a timeout
mechanism which holds the list_lru lock while the whole list is scanned
looking for entries that haven't been recently accessed. When the list
is largish (even a few hundred) this can block new requests noticably
which need the lock to remove a file to access it.
This patch removes the list_lru and instead uses 2 simple linked lists.
When a file is accessed it is removed from whichever list it is on,
then added to the tail of the first list. Every 2 seconds the second
list is moved to the "freeme" list and the first list is moved to the
second list. This avoids any need to walk a list to find old entries.
These lists are per-netns rather than global as the freeme list is
per-netns as the actual freeing is done in nfsd threads which are
per-netns.
Thanks,
NeilBrown
[PATCH 1/7] nfsd: filecache: remove race handling.
[PATCH 2/7] nfsd: filecache: use nfsd_file_dispose_list() in
[PATCH 3/7] nfsd: filecache: move globals nfsd_file_lru and
[PATCH 4/7] nfsd: filecache: change garbage collection list
[PATCH 5/7] nfsd: filecache: document the arbitrary limit on
[PATCH 6/7] nfsd: filecache: change garbage collection to a timer.
[PATCH 7/7] nfsd: filecache: give disposal lock a unique class name.
^ permalink raw reply [flat|nested] 23+ messages in thread
* [PATCH 1/7] nfsd: filecache: remove race handling.
2025-01-27 1:20 [PATCH 0/7] nfsd: filecache: change garbage collection lists NeilBrown
@ 2025-01-27 1:20 ` NeilBrown
2025-01-27 13:42 ` Jeff Layton
2025-01-27 1:20 ` [PATCH 2/7] nfsd: filecache: use nfsd_file_dispose_list() in nfsd_file_close_inode_sync() NeilBrown
` (7 subsequent siblings)
8 siblings, 1 reply; 23+ messages in thread
From: NeilBrown @ 2025-01-27 1:20 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton
Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey, Dave Chinner
The race that this code tries to protect against is not interesting.
The code is problematic as we access the "nf" after we have given our
reference to the lru system. While that take 2+ seconds to free things
it is still poor form.
The only interesting race I can find would be with
nfsd_file_close_inode_sync();
This is the only place that really doesn't want the file to stay on the
LRU when unhashed (which is the direct consequence of the race).
However for the race to happen, some other thread must own a reference
to a file and be putting in while nfsd_file_close_inode_sync() is trying
to close all files for an inode. If this is possible, that other thread
could simply call nfsd_file_put() a little bit later and the result
would be the same: not all files are closed when
nfsd_file_close_inode_sync() completes.
If this was really a problem, we would need to wait in close_inode_sync
for the other references to be dropped. We probably don't want to do
that.
So it is best to simply remove this code.
Signed-off-by: NeilBrown <neilb@suse.de>
---
fs/nfsd/filecache.c | 16 +++-------------
1 file changed, 3 insertions(+), 13 deletions(-)
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index e342b2e76882..f5a92ac3f16f 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -371,20 +371,10 @@ nfsd_file_put(struct nfsd_file *nf)
/* Try to add it to the LRU. If that fails, decrement. */
if (nfsd_file_lru_add(nf)) {
- /* If it's still hashed, we're done */
- if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) {
- nfsd_file_schedule_laundrette();
- return;
- }
-
- /*
- * We're racing with unhashing, so try to remove it from
- * the LRU. If removal fails, then someone else already
- * has our reference.
- */
- if (!nfsd_file_lru_remove(nf))
- return;
+ nfsd_file_schedule_laundrette();
+ return;
}
+
}
if (refcount_dec_and_test(&nf->nf_ref))
nfsd_file_free(nf);
--
2.47.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 2/7] nfsd: filecache: use nfsd_file_dispose_list() in nfsd_file_close_inode_sync()
2025-01-27 1:20 [PATCH 0/7] nfsd: filecache: change garbage collection lists NeilBrown
2025-01-27 1:20 ` [PATCH 1/7] nfsd: filecache: remove race handling NeilBrown
@ 2025-01-27 1:20 ` NeilBrown
2025-01-27 1:20 ` [PATCH 3/7] nfsd: filecache: move globals nfsd_file_lru and nfsd_file_shrinker to be per-net NeilBrown
` (6 subsequent siblings)
8 siblings, 0 replies; 23+ messages in thread
From: NeilBrown @ 2025-01-27 1:20 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton
Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey, Dave Chinner
nfsd_file_close_inode_sync() contains an exactly copy of
nfsd_file_dispose_list().
This patch removes it and calls nfsd_file_dispose_list() instead.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
---
fs/nfsd/filecache.c | 7 +------
1 file changed, 1 insertion(+), 6 deletions(-)
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index f5a92ac3f16f..549969d4aa7c 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -681,17 +681,12 @@ nfsd_file_close_inode(struct inode *inode)
void
nfsd_file_close_inode_sync(struct inode *inode)
{
- struct nfsd_file *nf;
LIST_HEAD(dispose);
trace_nfsd_file_close(inode);
nfsd_file_queue_for_close(inode, &dispose);
- while (!list_empty(&dispose)) {
- nf = list_first_entry(&dispose, struct nfsd_file, nf_gc);
- list_del_init(&nf->nf_gc);
- nfsd_file_free(nf);
- }
+ nfsd_file_dispose_list(&dispose);
}
static int
--
2.47.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 3/7] nfsd: filecache: move globals nfsd_file_lru and nfsd_file_shrinker to be per-net
2025-01-27 1:20 [PATCH 0/7] nfsd: filecache: change garbage collection lists NeilBrown
2025-01-27 1:20 ` [PATCH 1/7] nfsd: filecache: remove race handling NeilBrown
2025-01-27 1:20 ` [PATCH 2/7] nfsd: filecache: use nfsd_file_dispose_list() in nfsd_file_close_inode_sync() NeilBrown
@ 2025-01-27 1:20 ` NeilBrown
2025-01-27 1:20 ` [PATCH 4/7] nfsd: filecache: change garbage collection list management NeilBrown
` (5 subsequent siblings)
8 siblings, 0 replies; 23+ messages in thread
From: NeilBrown @ 2025-01-27 1:20 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton
Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey, Dave Chinner
The final freeing of nfsd files is done by per-net nfsd threads (which
call nfsd_file_net_dispose()) so it makes some sense to make more of the
freeing infrastructure to be per-net - in struct nfsd_fcache_disposal.
This is a step towards replacing the list_lru with simple lists which
each share the per-net lock in nfsd_fcache_disposal and will require
less list walking.
As the net is always shutdown before there is any chance that the rest
of the filecache is shut down we can removed the tests on
NFSD_FILE_CACHE_UP.
For the filecache stats file, which assumes a global lru, we keep a
separate counter which includes all files in all netns lrus.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
---
fs/nfsd/filecache.c | 128 +++++++++++++++++++++++++-------------------
1 file changed, 72 insertions(+), 56 deletions(-)
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index 549969d4aa7c..55f69bcde500 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -60,18 +60,20 @@ static DEFINE_PER_CPU(unsigned long, nfsd_file_allocations);
static DEFINE_PER_CPU(unsigned long, nfsd_file_releases);
static DEFINE_PER_CPU(unsigned long, nfsd_file_total_age);
static DEFINE_PER_CPU(unsigned long, nfsd_file_evictions);
+static DEFINE_PER_CPU(unsigned long, nfsd_file_lru_total);
struct nfsd_fcache_disposal {
spinlock_t lock;
+ struct list_lru file_lru;
struct list_head freeme;
+ struct delayed_work filecache_laundrette;
+ struct shrinker *file_shrinker;
};
static struct kmem_cache *nfsd_file_slab;
static struct kmem_cache *nfsd_file_mark_slab;
-static struct list_lru nfsd_file_lru;
static unsigned long nfsd_file_flags;
static struct fsnotify_group *nfsd_file_fsnotify_group;
-static struct delayed_work nfsd_filecache_laundrette;
static struct rhltable nfsd_file_rhltable
____cacheline_aligned_in_smp;
@@ -109,11 +111,18 @@ static const struct rhashtable_params nfsd_file_rhash_params = {
};
static void
-nfsd_file_schedule_laundrette(void)
+nfsd_file_schedule_laundrette(struct nfsd_fcache_disposal *l)
{
- if (test_bit(NFSD_FILE_CACHE_UP, &nfsd_file_flags))
- queue_delayed_work(system_unbound_wq, &nfsd_filecache_laundrette,
- NFSD_LAUNDRETTE_DELAY);
+ queue_delayed_work(system_unbound_wq, &l->filecache_laundrette,
+ NFSD_LAUNDRETTE_DELAY);
+}
+
+static void
+nfsd_file_schedule_laundrette_net(struct net *net)
+{
+ struct nfsd_net *nn = net_generic(net, nfsd_net_id);
+
+ nfsd_file_schedule_laundrette(nn->fcache_disposal);
}
static void
@@ -318,11 +327,14 @@ nfsd_file_check_writeback(struct nfsd_file *nf)
mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK);
}
-
static bool nfsd_file_lru_add(struct nfsd_file *nf)
{
+ struct nfsd_net *nn = net_generic(nf->nf_net, nfsd_net_id);
+ struct nfsd_fcache_disposal *l = nn->fcache_disposal;
+
set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
- if (list_lru_add_obj(&nfsd_file_lru, &nf->nf_lru)) {
+ if (list_lru_add_obj(&l->file_lru, &nf->nf_lru)) {
+ this_cpu_inc(nfsd_file_lru_total);
trace_nfsd_file_lru_add(nf);
return true;
}
@@ -331,7 +343,11 @@ static bool nfsd_file_lru_add(struct nfsd_file *nf)
static bool nfsd_file_lru_remove(struct nfsd_file *nf)
{
- if (list_lru_del_obj(&nfsd_file_lru, &nf->nf_lru)) {
+ struct nfsd_net *nn = net_generic(nf->nf_net, nfsd_net_id);
+ struct nfsd_fcache_disposal *l = nn->fcache_disposal;
+
+ if (list_lru_del_obj(&l->file_lru, &nf->nf_lru)) {
+ this_cpu_dec(nfsd_file_lru_total);
trace_nfsd_file_lru_del(nf);
return true;
}
@@ -371,7 +387,7 @@ nfsd_file_put(struct nfsd_file *nf)
/* Try to add it to the LRU. If that fails, decrement. */
if (nfsd_file_lru_add(nf)) {
- nfsd_file_schedule_laundrette();
+ nfsd_file_schedule_laundrette_net(nf->nf_net);
return;
}
@@ -525,12 +541,14 @@ nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
if (!refcount_dec_and_test(&nf->nf_ref)) {
trace_nfsd_file_gc_in_use(nf);
list_lru_isolate(lru, &nf->nf_lru);
+ this_cpu_dec(nfsd_file_lru_total);
return LRU_REMOVED;
}
/* Refcount went to zero. Unhash it and queue it to the dispose list */
nfsd_file_unhash(nf);
list_lru_isolate(lru, &nf->nf_lru);
+ this_cpu_dec(nfsd_file_lru_total);
list_add(&nf->nf_gc, head);
this_cpu_inc(nfsd_file_evictions);
trace_nfsd_file_gc_disposed(nf);
@@ -538,18 +556,18 @@ nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
}
static void
-nfsd_file_gc(void)
+nfsd_file_gc(struct nfsd_fcache_disposal *l)
{
- unsigned long remaining = list_lru_count(&nfsd_file_lru);
+ unsigned long remaining = list_lru_count(&l->file_lru);
LIST_HEAD(dispose);
unsigned long ret;
while (remaining > 0) {
unsigned long num_to_scan = min(remaining, NFSD_FILE_GC_BATCH);
- ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
+ ret = list_lru_walk(&l->file_lru, nfsd_file_lru_cb,
&dispose, num_to_scan);
- trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
+ trace_nfsd_file_gc_removed(ret, list_lru_count(&l->file_lru));
nfsd_file_dispose_list_delayed(&dispose);
remaining -= num_to_scan;
}
@@ -558,32 +576,36 @@ nfsd_file_gc(void)
static void
nfsd_file_gc_worker(struct work_struct *work)
{
- nfsd_file_gc();
- if (list_lru_count(&nfsd_file_lru))
- nfsd_file_schedule_laundrette();
+ struct nfsd_fcache_disposal *l = container_of(
+ work, struct nfsd_fcache_disposal, filecache_laundrette.work);
+ nfsd_file_gc(l);
+ if (list_lru_count(&l->file_lru))
+ nfsd_file_schedule_laundrette(l);
}
static unsigned long
nfsd_file_lru_count(struct shrinker *s, struct shrink_control *sc)
{
- return list_lru_count(&nfsd_file_lru);
+ struct nfsd_fcache_disposal *l = s->private_data;
+
+ return list_lru_count(&l->file_lru);
}
static unsigned long
nfsd_file_lru_scan(struct shrinker *s, struct shrink_control *sc)
{
+ struct nfsd_fcache_disposal *l = s->private_data;
+
LIST_HEAD(dispose);
unsigned long ret;
- ret = list_lru_shrink_walk(&nfsd_file_lru, sc,
+ ret = list_lru_shrink_walk(&l->file_lru, sc,
nfsd_file_lru_cb, &dispose);
- trace_nfsd_file_shrinker_removed(ret, list_lru_count(&nfsd_file_lru));
+ trace_nfsd_file_shrinker_removed(ret, list_lru_count(&l->file_lru));
nfsd_file_dispose_list_delayed(&dispose);
return ret;
}
-static struct shrinker *nfsd_file_shrinker;
-
/**
* nfsd_file_cond_queue - conditionally unhash and queue a nfsd_file
* @nf: nfsd_file to attempt to queue
@@ -763,29 +785,10 @@ nfsd_file_cache_init(void)
goto out_err;
}
- ret = list_lru_init(&nfsd_file_lru);
- if (ret) {
- pr_err("nfsd: failed to init nfsd_file_lru: %d\n", ret);
- goto out_err;
- }
-
- nfsd_file_shrinker = shrinker_alloc(0, "nfsd-filecache");
- if (!nfsd_file_shrinker) {
- ret = -ENOMEM;
- pr_err("nfsd: failed to allocate nfsd_file_shrinker\n");
- goto out_lru;
- }
-
- nfsd_file_shrinker->count_objects = nfsd_file_lru_count;
- nfsd_file_shrinker->scan_objects = nfsd_file_lru_scan;
- nfsd_file_shrinker->seeks = 1;
-
- shrinker_register(nfsd_file_shrinker);
-
ret = lease_register_notifier(&nfsd_file_lease_notifier);
if (ret) {
pr_err("nfsd: unable to register lease notifier: %d\n", ret);
- goto out_shrinker;
+ goto out_err;
}
nfsd_file_fsnotify_group = fsnotify_alloc_group(&nfsd_file_fsnotify_ops,
@@ -798,17 +801,12 @@ nfsd_file_cache_init(void)
goto out_notifier;
}
- INIT_DELAYED_WORK(&nfsd_filecache_laundrette, nfsd_file_gc_worker);
out:
if (ret)
clear_bit(NFSD_FILE_CACHE_UP, &nfsd_file_flags);
return ret;
out_notifier:
lease_unregister_notifier(&nfsd_file_lease_notifier);
-out_shrinker:
- shrinker_free(nfsd_file_shrinker);
-out_lru:
- list_lru_destroy(&nfsd_file_lru);
out_err:
kmem_cache_destroy(nfsd_file_slab);
nfsd_file_slab = NULL;
@@ -860,13 +858,38 @@ nfsd_alloc_fcache_disposal(void)
if (!l)
return NULL;
spin_lock_init(&l->lock);
+ INIT_DELAYED_WORK(&l->filecache_laundrette, nfsd_file_gc_worker);
INIT_LIST_HEAD(&l->freeme);
+ l->file_shrinker = shrinker_alloc(0, "nfsd-filecache");
+ if (!l->file_shrinker) {
+ pr_err("nfsd: failed to allocate nfsd_file_shrinker\n");
+ kfree(l);
+ return NULL;
+ }
+ l->file_shrinker->count_objects = nfsd_file_lru_count;
+ l->file_shrinker->scan_objects = nfsd_file_lru_scan;
+ l->file_shrinker->seeks = 1;
+ l->file_shrinker->private_data = l;
+
+ /* if file_lru is not zeroed it can trigger a bug: ->key is the problem */
+ memset(&l->file_lru, 0, sizeof(l->file_lru));
+ if (list_lru_init(&l->file_lru)) {
+ pr_err("nfsd: failed to init nfsd_file_lru\n");
+ shrinker_free(l->file_shrinker);
+ kfree(l);
+ return NULL;
+ }
+
+ shrinker_register(l->file_shrinker);
return l;
}
static void
nfsd_free_fcache_disposal(struct nfsd_fcache_disposal *l)
{
+ cancel_delayed_work_sync(&l->filecache_laundrette);
+ shrinker_free(l->file_shrinker);
+ list_lru_destroy(&l->file_lru);
nfsd_file_dispose_list(&l->freeme);
kfree(l);
}
@@ -919,14 +942,7 @@ nfsd_file_cache_shutdown(void)
return;
lease_unregister_notifier(&nfsd_file_lease_notifier);
- shrinker_free(nfsd_file_shrinker);
- /*
- * make sure all callers of nfsd_file_lru_cb are done before
- * calling nfsd_file_cache_purge
- */
- cancel_delayed_work_sync(&nfsd_filecache_laundrette);
__nfsd_file_cache_purge(NULL);
- list_lru_destroy(&nfsd_file_lru);
rcu_barrier();
fsnotify_put_group(nfsd_file_fsnotify_group);
nfsd_file_fsnotify_group = NULL;
@@ -944,6 +960,7 @@ nfsd_file_cache_shutdown(void)
per_cpu(nfsd_file_releases, i) = 0;
per_cpu(nfsd_file_total_age, i) = 0;
per_cpu(nfsd_file_evictions, i) = 0;
+ per_cpu(nfsd_file_lru_total, i) = 0;
}
}
@@ -1297,8 +1314,6 @@ int nfsd_file_cache_stats_show(struct seq_file *m, void *v)
struct bucket_table *tbl;
struct rhashtable *ht;
- lru = list_lru_count(&nfsd_file_lru);
-
rcu_read_lock();
ht = &nfsd_file_rhltable.ht;
count = atomic_read(&ht->nelems);
@@ -1309,6 +1324,7 @@ int nfsd_file_cache_stats_show(struct seq_file *m, void *v)
mutex_unlock(&nfsd_mutex);
for_each_possible_cpu(i) {
+ lru += per_cpu(nfsd_file_lru_total, i);
hits += per_cpu(nfsd_file_cache_hits, i);
acquisitions += per_cpu(nfsd_file_acquisitions, i);
allocations += per_cpu(nfsd_file_allocations, i);
--
2.47.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 4/7] nfsd: filecache: change garbage collection list management.
2025-01-27 1:20 [PATCH 0/7] nfsd: filecache: change garbage collection lists NeilBrown
` (2 preceding siblings ...)
2025-01-27 1:20 ` [PATCH 3/7] nfsd: filecache: move globals nfsd_file_lru and nfsd_file_shrinker to be per-net NeilBrown
@ 2025-01-27 1:20 ` NeilBrown
2025-01-27 14:15 ` Jeff Layton
2025-01-27 1:20 ` [PATCH 5/7] nfsd: filecache: document the arbitrary limit on file-disposes-per-loop NeilBrown
` (4 subsequent siblings)
8 siblings, 1 reply; 23+ messages in thread
From: NeilBrown @ 2025-01-27 1:20 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton
Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey, Dave Chinner
The nfsd filecache currently uses list_lru for tracking files recently
used in NFSv3 requests which need to be "garbage collected" when they
have becoming idle - unused for 2-4 seconds.
I do not believe list_lru is a good tool for this. It does not allow the
timeout which filecache requires so we have to add a timeout mechanism
which holds the list_lru lock while the whole list is scanned looking for
entries that haven't been recently accessed. When the list is largish
(even a few hundred) this can block new requests which need the lock to
remove a file to access it.
This patch removes the list_lru and instead uses 2 simple linked lists.
When a file is accessed it is removed from whichever list it is on,
then added to the tail of the first list. Every 2 seconds the second
list is moved to the "freeme" list and the first list is moved to the
second list. This avoids any need to walk a list to find old entries.
Previously a file would be unhashed before being moved to the freeme
list. We don't do that any more. The freeme list is much like the
other two lists (recent and older) in that they all hold a reference to
the file and the file is still hashed. When the nfsd thread processes
the freeme list it now uses the new nfsd_file_release_list() which uses
nfsd_file_cond_queue() to unhash and drop the refcount.
We no longer have a precise count of the size of the lru (recent +
older) as we don't know how big "older" is when it is moved to "freeme".
However the shrinker can cope with an approximation. So we keep a
count of number in the lru and when "older" is moved to "freeme" we
divide that count by 2. When we remove anything from the lru we
decrement that counter but ensure it never goes negative. Naturally
when we add to the lru we increase the counter.
For the filecache stats file, which assumes a global lru, we keep a
separate counter which includes all files in all netns in recent or
older or freeme.
We discard the nf_gc linkage in an nfsd_file and only use nf_lru.
We discard NFSD_FILE_REFERENCED.
This patch drops the nfsd_file_gc_removed() trace point. I couldn't
think of useful information to provide.
Signed-off-by: NeilBrown <neilb@suse.de>
---
fs/nfsd/filecache.c | 218 +++++++++++++++++++++-----------------------
fs/nfsd/filecache.h | 4 +-
fs/nfsd/trace.h | 4 -
3 files changed, 107 insertions(+), 119 deletions(-)
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index 55f69bcde500..1e90da507152 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -34,7 +34,6 @@
#include <linux/file.h>
#include <linux/pagemap.h>
#include <linux/sched.h>
-#include <linux/list_lru.h>
#include <linux/fsnotify_backend.h>
#include <linux/fsnotify.h>
#include <linux/seq_file.h>
@@ -64,10 +63,13 @@ static DEFINE_PER_CPU(unsigned long, nfsd_file_lru_total);
struct nfsd_fcache_disposal {
spinlock_t lock;
- struct list_lru file_lru;
- struct list_head freeme;
+ struct list_head recent; /* have been used in last 0-2 seconds */
+ struct list_head older; /* haven't been used in last 0-2 seconds */
+ struct list_head freeme; /* ready to be discarded */
+ unsigned long num_gc; /* Approximate size of recent plus older */
struct delayed_work filecache_laundrette;
struct shrinker *file_shrinker;
+ struct nfsd_net *nn;
};
static struct kmem_cache *nfsd_file_slab;
@@ -227,7 +229,6 @@ nfsd_file_alloc(struct net *net, struct inode *inode, unsigned char need,
this_cpu_inc(nfsd_file_allocations);
INIT_LIST_HEAD(&nf->nf_lru);
- INIT_LIST_HEAD(&nf->nf_gc);
nf->nf_birthtime = ktime_get();
nf->nf_file = NULL;
nf->nf_cred = get_current_cred();
@@ -332,12 +333,16 @@ static bool nfsd_file_lru_add(struct nfsd_file *nf)
struct nfsd_net *nn = net_generic(nf->nf_net, nfsd_net_id);
struct nfsd_fcache_disposal *l = nn->fcache_disposal;
- set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
- if (list_lru_add_obj(&l->file_lru, &nf->nf_lru)) {
+ spin_lock(&l->lock);
+ if (list_empty(&nf->nf_lru)) {
+ list_add_tail(&nf->nf_lru, &l->recent);
+ l->num_gc += 1;
this_cpu_inc(nfsd_file_lru_total);
trace_nfsd_file_lru_add(nf);
+ spin_unlock(&l->lock);
return true;
}
+ spin_unlock(&l->lock);
return false;
}
@@ -346,11 +351,17 @@ static bool nfsd_file_lru_remove(struct nfsd_file *nf)
struct nfsd_net *nn = net_generic(nf->nf_net, nfsd_net_id);
struct nfsd_fcache_disposal *l = nn->fcache_disposal;
- if (list_lru_del_obj(&l->file_lru, &nf->nf_lru)) {
+ spin_lock(&l->lock);
+ if (!list_empty(&nf->nf_lru)) {
+ list_del_init(&nf->nf_lru);
this_cpu_dec(nfsd_file_lru_total);
+ if (l->num_gc > 0)
+ l->num_gc -= 1;
trace_nfsd_file_lru_del(nf);
+ spin_unlock(&l->lock);
return true;
}
+ spin_unlock(&l->lock);
return false;
}
@@ -430,12 +441,26 @@ nfsd_file_dispose_list(struct list_head *dispose)
struct nfsd_file *nf;
while (!list_empty(dispose)) {
- nf = list_first_entry(dispose, struct nfsd_file, nf_gc);
- list_del_init(&nf->nf_gc);
+ nf = list_first_entry(dispose, struct nfsd_file, nf_lru);
+ list_del_init(&nf->nf_lru);
nfsd_file_free(nf);
}
}
+static void
+nfsd_file_cond_queue(struct nfsd_file *nf, struct list_head *dispose);
+
+static void
+nfsd_file_release_list(struct list_head *dispose)
+{
+ LIST_HEAD(dispose2);
+ struct nfsd_file *nf, *nf2;
+
+ list_for_each_entry_safe(nf, nf2, dispose, nf_lru)
+ nfsd_file_cond_queue(nf, &dispose2);
+ nfsd_file_dispose_list(&dispose2);
+}
+
/**
* nfsd_file_dispose_list_delayed - move list of dead files to net's freeme list
* @dispose: list of nfsd_files to be disposed
@@ -448,13 +473,13 @@ nfsd_file_dispose_list_delayed(struct list_head *dispose)
{
while(!list_empty(dispose)) {
struct nfsd_file *nf = list_first_entry(dispose,
- struct nfsd_file, nf_gc);
+ struct nfsd_file, nf_lru);
struct nfsd_net *nn = net_generic(nf->nf_net, nfsd_net_id);
struct nfsd_fcache_disposal *l = nn->fcache_disposal;
struct svc_serv *serv;
spin_lock(&l->lock);
- list_move_tail(&nf->nf_gc, &l->freeme);
+ list_move_tail(&nf->nf_lru, &l->freeme);
spin_unlock(&l->lock);
/*
@@ -486,90 +511,32 @@ void nfsd_file_net_dispose(struct nfsd_net *nn)
int i;
spin_lock(&l->lock);
- for (i = 0; i < 8 && !list_empty(&l->freeme); i++)
- list_move(l->freeme.next, &dispose);
+ for (i = 0; i < 8 && !list_empty(&l->freeme); i++) {
+ struct nfsd_file *nf = list_first_entry(
+ &l->freeme, struct nfsd_file, nf_lru);
+
+ /*
+ * Don't throw out files that are still
+ * undergoing I/O or that have uncleared errors
+ * pending.
+ */
+ if (nfsd_file_check_writeback(nf)) {
+ trace_nfsd_file_gc_writeback(nf);
+ list_move(&nf->nf_lru, &l->recent);
+ l->num_gc += 1;
+ } else {
+ trace_nfsd_file_gc_disposed(nf);
+ list_move(&nf->nf_lru, &dispose);
+ this_cpu_inc(nfsd_file_evictions);
+ }
+ }
spin_unlock(&l->lock);
if (!list_empty(&l->freeme))
/* Wake up another thread to share the work
* *before* doing any actual disposing.
*/
svc_wake_up(nn->nfsd_serv);
- nfsd_file_dispose_list(&dispose);
- }
-}
-
-/**
- * nfsd_file_lru_cb - Examine an entry on the LRU list
- * @item: LRU entry to examine
- * @lru: controlling LRU
- * @arg: dispose list
- *
- * Return values:
- * %LRU_REMOVED: @item was removed from the LRU
- * %LRU_ROTATE: @item is to be moved to the LRU tail
- * %LRU_SKIP: @item cannot be evicted
- */
-static enum lru_status
-nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
- void *arg)
-{
- struct list_head *head = arg;
- struct nfsd_file *nf = list_entry(item, struct nfsd_file, nf_lru);
-
- /* We should only be dealing with GC entries here */
- WARN_ON_ONCE(!test_bit(NFSD_FILE_GC, &nf->nf_flags));
-
- /*
- * Don't throw out files that are still undergoing I/O or
- * that have uncleared errors pending.
- */
- if (nfsd_file_check_writeback(nf)) {
- trace_nfsd_file_gc_writeback(nf);
- return LRU_SKIP;
- }
-
- /* If it was recently added to the list, skip it */
- if (test_and_clear_bit(NFSD_FILE_REFERENCED, &nf->nf_flags)) {
- trace_nfsd_file_gc_referenced(nf);
- return LRU_ROTATE;
- }
-
- /*
- * Put the reference held on behalf of the LRU. If it wasn't the last
- * one, then just remove it from the LRU and ignore it.
- */
- if (!refcount_dec_and_test(&nf->nf_ref)) {
- trace_nfsd_file_gc_in_use(nf);
- list_lru_isolate(lru, &nf->nf_lru);
- this_cpu_dec(nfsd_file_lru_total);
- return LRU_REMOVED;
- }
-
- /* Refcount went to zero. Unhash it and queue it to the dispose list */
- nfsd_file_unhash(nf);
- list_lru_isolate(lru, &nf->nf_lru);
- this_cpu_dec(nfsd_file_lru_total);
- list_add(&nf->nf_gc, head);
- this_cpu_inc(nfsd_file_evictions);
- trace_nfsd_file_gc_disposed(nf);
- return LRU_REMOVED;
-}
-
-static void
-nfsd_file_gc(struct nfsd_fcache_disposal *l)
-{
- unsigned long remaining = list_lru_count(&l->file_lru);
- LIST_HEAD(dispose);
- unsigned long ret;
-
- while (remaining > 0) {
- unsigned long num_to_scan = min(remaining, NFSD_FILE_GC_BATCH);
-
- ret = list_lru_walk(&l->file_lru, nfsd_file_lru_cb,
- &dispose, num_to_scan);
- trace_nfsd_file_gc_removed(ret, list_lru_count(&l->file_lru));
- nfsd_file_dispose_list_delayed(&dispose);
- remaining -= num_to_scan;
+ nfsd_file_release_list(&dispose);
}
}
@@ -578,9 +545,19 @@ nfsd_file_gc_worker(struct work_struct *work)
{
struct nfsd_fcache_disposal *l = container_of(
work, struct nfsd_fcache_disposal, filecache_laundrette.work);
- nfsd_file_gc(l);
- if (list_lru_count(&l->file_lru))
+
+ spin_lock(&l->lock);
+ list_splice_init(&l->older, &l->freeme);
+ list_splice_init(&l->recent, &l->older);
+ /* We don't know how many were moved to 'freeme' and don't want
+ * to waste time counting - guess a half.
+ */
+ l->num_gc /= 2;
+ if (!list_empty(&l->freeme))
+ svc_wake_up(l->nn->nfsd_serv);
+ if (!list_empty(&l->older) || !list_empty(&l->recent))
nfsd_file_schedule_laundrette(l);
+ spin_unlock(&l->lock);
}
static unsigned long
@@ -588,22 +565,40 @@ nfsd_file_lru_count(struct shrinker *s, struct shrink_control *sc)
{
struct nfsd_fcache_disposal *l = s->private_data;
- return list_lru_count(&l->file_lru);
+ return l->num_gc;
}
static unsigned long
nfsd_file_lru_scan(struct shrinker *s, struct shrink_control *sc)
{
struct nfsd_fcache_disposal *l = s->private_data;
+ struct nfsd_file *nf;
+ int scanned = 0;
+
+ spin_lock(&l->lock);
+ while (scanned < sc->nr_to_scan &&
+ (nf = list_first_entry_or_null(&l->older,
+ struct nfsd_file, nf_lru)) != NULL) {
+ list_del_init(&nf->nf_lru);
+ list_add_tail(&nf->nf_lru, &l->freeme);
+ if (l->num_gc > 0)
+ l->num_gc -= 1;
+ scanned += 1;
+ }
+ if (scanned > 0)
+ svc_wake_up(l->nn->nfsd_serv);
- LIST_HEAD(dispose);
- unsigned long ret;
+ trace_nfsd_file_shrinker_removed(scanned, l->num_gc);
- ret = list_lru_shrink_walk(&l->file_lru, sc,
- nfsd_file_lru_cb, &dispose);
- trace_nfsd_file_shrinker_removed(ret, list_lru_count(&l->file_lru));
- nfsd_file_dispose_list_delayed(&dispose);
- return ret;
+ while (scanned < sc->nr_to_scan &&
+ (nf = list_first_entry_or_null(&l->recent,
+ struct nfsd_file, nf_lru)) != NULL) {
+ list_del_init(&nf->nf_lru);
+ list_add_tail(&nf->nf_lru, &l->older);
+ scanned += 1;
+ }
+ spin_unlock(&l->lock);
+ return scanned;
}
/**
@@ -616,7 +611,6 @@ nfsd_file_lru_scan(struct shrinker *s, struct shrink_control *sc)
*/
static void
nfsd_file_cond_queue(struct nfsd_file *nf, struct list_head *dispose)
- __must_hold(RCU)
{
int decrement = 1;
@@ -634,7 +628,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct list_head *dispose)
/* If refcount goes to 0, then put on the dispose list */
if (refcount_sub_and_test(decrement, &nf->nf_ref)) {
- list_add(&nf->nf_gc, dispose);
+ list_add(&nf->nf_lru, dispose);
trace_nfsd_file_closing(nf);
}
}
@@ -859,7 +853,10 @@ nfsd_alloc_fcache_disposal(void)
return NULL;
spin_lock_init(&l->lock);
INIT_DELAYED_WORK(&l->filecache_laundrette, nfsd_file_gc_worker);
+ INIT_LIST_HEAD(&l->recent);
+ INIT_LIST_HEAD(&l->older);
INIT_LIST_HEAD(&l->freeme);
+ l->num_gc = 0;
l->file_shrinker = shrinker_alloc(0, "nfsd-filecache");
if (!l->file_shrinker) {
pr_err("nfsd: failed to allocate nfsd_file_shrinker\n");
@@ -871,15 +868,6 @@ nfsd_alloc_fcache_disposal(void)
l->file_shrinker->seeks = 1;
l->file_shrinker->private_data = l;
- /* if file_lru is not zeroed it can trigger a bug: ->key is the problem */
- memset(&l->file_lru, 0, sizeof(l->file_lru));
- if (list_lru_init(&l->file_lru)) {
- pr_err("nfsd: failed to init nfsd_file_lru\n");
- shrinker_free(l->file_shrinker);
- kfree(l);
- return NULL;
- }
-
shrinker_register(l->file_shrinker);
return l;
}
@@ -889,8 +877,12 @@ nfsd_free_fcache_disposal(struct nfsd_fcache_disposal *l)
{
cancel_delayed_work_sync(&l->filecache_laundrette);
shrinker_free(l->file_shrinker);
- list_lru_destroy(&l->file_lru);
- nfsd_file_dispose_list(&l->freeme);
+ nfsd_file_release_list(&l->recent);
+ WARN_ON_ONCE(!list_empty(&l->recent));
+ nfsd_file_release_list(&l->older);
+ WARN_ON_ONCE(!list_empty(&l->older));
+ nfsd_file_release_list(&l->freeme);
+ WARN_ON_ONCE(!list_empty(&l->freeme));
kfree(l);
}
@@ -909,6 +901,8 @@ nfsd_file_cache_start_net(struct net *net)
struct nfsd_net *nn = net_generic(net, nfsd_net_id);
nn->fcache_disposal = nfsd_alloc_fcache_disposal();
+ if (nn->fcache_disposal)
+ nn->fcache_disposal->nn = nn;
return nn->fcache_disposal ? 0 : -ENOMEM;
}
diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
index a09a851d510e..02059b6c5e9a 100644
--- a/fs/nfsd/filecache.h
+++ b/fs/nfsd/filecache.h
@@ -42,15 +42,13 @@ struct nfsd_file {
struct net *nf_net;
#define NFSD_FILE_HASHED (0)
#define NFSD_FILE_PENDING (1)
-#define NFSD_FILE_REFERENCED (2)
-#define NFSD_FILE_GC (3)
+#define NFSD_FILE_GC (2)
unsigned long nf_flags;
refcount_t nf_ref;
unsigned char nf_may;
struct nfsd_file_mark *nf_mark;
struct list_head nf_lru;
- struct list_head nf_gc;
struct rcu_head nf_rcu;
ktime_t nf_birthtime;
};
diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index ad2c0c432d08..efa683541ed5 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -1038,7 +1038,6 @@ DEFINE_CLID_EVENT(confirmed_r);
__print_flags(val, "|", \
{ 1 << NFSD_FILE_HASHED, "HASHED" }, \
{ 1 << NFSD_FILE_PENDING, "PENDING" }, \
- { 1 << NFSD_FILE_REFERENCED, "REFERENCED" }, \
{ 1 << NFSD_FILE_GC, "GC" })
DECLARE_EVENT_CLASS(nfsd_file_class,
@@ -1314,9 +1313,7 @@ DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_lru_add);
DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_lru_add_disposed);
DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_lru_del);
DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_lru_del_disposed);
-DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_in_use);
DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_writeback);
-DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_referenced);
DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_disposed);
DECLARE_EVENT_CLASS(nfsd_file_lruwalk_class,
@@ -1345,7 +1342,6 @@ DEFINE_EVENT(nfsd_file_lruwalk_class, name, \
), \
TP_ARGS(removed, remaining))
-DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_removed);
DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_shrinker_removed);
TRACE_EVENT(nfsd_file_close,
--
2.47.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 5/7] nfsd: filecache: document the arbitrary limit on file-disposes-per-loop
2025-01-27 1:20 [PATCH 0/7] nfsd: filecache: change garbage collection lists NeilBrown
` (3 preceding siblings ...)
2025-01-27 1:20 ` [PATCH 4/7] nfsd: filecache: change garbage collection list management NeilBrown
@ 2025-01-27 1:20 ` NeilBrown
2025-01-27 14:40 ` Jeff Layton
2025-01-27 1:20 ` [PATCH 6/7] nfsd: filecache: change garbage collection to a timer NeilBrown
` (3 subsequent siblings)
8 siblings, 1 reply; 23+ messages in thread
From: NeilBrown @ 2025-01-27 1:20 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton
Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey, Dave Chinner
Rather than having the bare number "8" use a named constant and explain
the tradeoffs that lead to the choice.
Signed-off-by: NeilBrown <neilb@suse.de>
---
fs/nfsd/filecache.c | 18 +++++++++++++++++-
1 file changed, 17 insertions(+), 1 deletion(-)
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index 1e90da507152..7264faa57280 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -493,6 +493,21 @@ nfsd_file_dispose_list_delayed(struct list_head *dispose)
}
}
+/*
+ * Disposing of files can involve non-trivial work, and they
+ * can appear in batches. So we don't want to try handling them
+ * all in one thread - if there are lots it would be better to allow
+ * several nfsd threads to handle them in parallel.
+ * On average one RPC request can create at most 1 file to be disposed
+ * so handling one each time around the nfsd loop should keep the list
+ * under control. However there are often benefits of batching so
+ * 2 at a time will likely be more efficient than 1. 4 more so.
+ * We need to choose a number which will often handle all the files,
+ * but will allow other threads to help when the list gets long.
+ * The current choice is:
+ */
+#define NFSD_FILE_DISPOSE_BATCH 8
+
/**
* nfsd_file_net_dispose - deal with nfsd_files waiting to be disposed.
* @nn: nfsd_net in which to find files to be disposed.
@@ -511,7 +526,8 @@ void nfsd_file_net_dispose(struct nfsd_net *nn)
int i;
spin_lock(&l->lock);
- for (i = 0; i < 8 && !list_empty(&l->freeme); i++) {
+ for (i = 0; i < NFSD_FILE_DISPOSE_BATCH &&
+ !list_empty(&l->freeme); i++) {
struct nfsd_file *nf = list_first_entry(
&l->freeme, struct nfsd_file, nf_lru);
--
2.47.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 6/7] nfsd: filecache: change garbage collection to a timer.
2025-01-27 1:20 [PATCH 0/7] nfsd: filecache: change garbage collection lists NeilBrown
` (4 preceding siblings ...)
2025-01-27 1:20 ` [PATCH 5/7] nfsd: filecache: document the arbitrary limit on file-disposes-per-loop NeilBrown
@ 2025-01-27 1:20 ` NeilBrown
2025-01-27 14:39 ` Jeff Layton
2025-01-27 1:20 ` [PATCH 7/7] nfsd: filecache: give disposal lock a unique class name NeilBrown
` (2 subsequent siblings)
8 siblings, 1 reply; 23+ messages in thread
From: NeilBrown @ 2025-01-27 1:20 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton
Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey, Dave Chinner
garbage collection never sleeps and no longer walks a list so it runs
quickly only requiring a spinlock.
This means we don't need to use a workqueue, we can use a simple timer
instead.
Rather than taking the lock in the timer callback, which would require
using _bh locking, simply test a flag and wake an nfsd thread. That
thread checks the flag and ages the lists when needed.
Signed-off-by: NeilBrown <neilb@suse.de>
---
fs/nfsd/filecache.c | 43 ++++++++++++++++++++++++-------------------
1 file changed, 24 insertions(+), 19 deletions(-)
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index 7264faa57280..eb95a53f806f 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -67,10 +67,12 @@ struct nfsd_fcache_disposal {
struct list_head older; /* haven't been used in last 0-2 seconds */
struct list_head freeme; /* ready to be discarded */
unsigned long num_gc; /* Approximate size of recent plus older */
- struct delayed_work filecache_laundrette;
+ struct timer_list timer;
struct shrinker *file_shrinker;
struct nfsd_net *nn;
+ unsigned long flags;
};
+#define NFSD_FCACHE_DO_AGE BIT(0) /* time to age the lists */
static struct kmem_cache *nfsd_file_slab;
static struct kmem_cache *nfsd_file_mark_slab;
@@ -115,8 +117,8 @@ static const struct rhashtable_params nfsd_file_rhash_params = {
static void
nfsd_file_schedule_laundrette(struct nfsd_fcache_disposal *l)
{
- queue_delayed_work(system_unbound_wq, &l->filecache_laundrette,
- NFSD_LAUNDRETTE_DELAY);
+ if (!timer_pending(&l->timer))
+ mod_timer(&l->timer, jiffies + NFSD_LAUNDRETTE_DELAY);
}
static void
@@ -521,6 +523,19 @@ void nfsd_file_net_dispose(struct nfsd_net *nn)
{
struct nfsd_fcache_disposal *l = nn->fcache_disposal;
+ if (test_and_clear_bit(NFSD_FCACHE_DO_AGE, &l->flags)) {
+ spin_lock(&l->lock);
+ list_splice_init(&l->older, &l->freeme);
+ list_splice_init(&l->recent, &l->older);
+ /* We don't know how many were moved to 'freeme' and don't want
+ * to waste time counting - guess a half. This is only used
+ * for the shrinker which doesn't need complete precision.
+ */
+ l->num_gc /= 2;
+ if (!list_empty(&l->older) || !list_empty(&l->recent))
+ mod_timer(&l->timer, jiffies + NFSD_LAUNDRETTE_DELAY);
+ spin_unlock(&l->lock);
+ }
if (!list_empty(&l->freeme)) {
LIST_HEAD(dispose);
int i;
@@ -557,23 +572,13 @@ void nfsd_file_net_dispose(struct nfsd_net *nn)
}
static void
-nfsd_file_gc_worker(struct work_struct *work)
+nfsd_file_gc_worker(struct timer_list *t)
{
struct nfsd_fcache_disposal *l = container_of(
- work, struct nfsd_fcache_disposal, filecache_laundrette.work);
+ t, struct nfsd_fcache_disposal, timer);
- spin_lock(&l->lock);
- list_splice_init(&l->older, &l->freeme);
- list_splice_init(&l->recent, &l->older);
- /* We don't know how many were moved to 'freeme' and don't want
- * to waste time counting - guess a half.
- */
- l->num_gc /= 2;
- if (!list_empty(&l->freeme))
- svc_wake_up(l->nn->nfsd_serv);
- if (!list_empty(&l->older) || !list_empty(&l->recent))
- nfsd_file_schedule_laundrette(l);
- spin_unlock(&l->lock);
+ set_bit(NFSD_FCACHE_DO_AGE, &l->flags);
+ svc_wake_up(l->nn->nfsd_serv);
}
static unsigned long
@@ -868,7 +873,7 @@ nfsd_alloc_fcache_disposal(void)
if (!l)
return NULL;
spin_lock_init(&l->lock);
- INIT_DELAYED_WORK(&l->filecache_laundrette, nfsd_file_gc_worker);
+ timer_setup(&l->timer, nfsd_file_gc_worker, 0);
INIT_LIST_HEAD(&l->recent);
INIT_LIST_HEAD(&l->older);
INIT_LIST_HEAD(&l->freeme);
@@ -891,7 +896,7 @@ nfsd_alloc_fcache_disposal(void)
static void
nfsd_free_fcache_disposal(struct nfsd_fcache_disposal *l)
{
- cancel_delayed_work_sync(&l->filecache_laundrette);
+ del_timer_sync(&l->timer);
shrinker_free(l->file_shrinker);
nfsd_file_release_list(&l->recent);
WARN_ON_ONCE(!list_empty(&l->recent));
--
2.47.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* [PATCH 7/7] nfsd: filecache: give disposal lock a unique class name.
2025-01-27 1:20 [PATCH 0/7] nfsd: filecache: change garbage collection lists NeilBrown
` (5 preceding siblings ...)
2025-01-27 1:20 ` [PATCH 6/7] nfsd: filecache: change garbage collection to a timer NeilBrown
@ 2025-01-27 1:20 ` NeilBrown
2025-01-27 14:29 ` Chuck Lever
2025-01-27 14:40 ` Jeff Layton
2025-01-28 6:37 ` [PATCH 0/7] nfsd: filecache: change garbage collection lists Dave Chinner
2025-02-05 23:04 ` NeilBrown
8 siblings, 2 replies; 23+ messages in thread
From: NeilBrown @ 2025-01-27 1:20 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton
Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey, Dave Chinner
There are at least three locks in the kernel which are initialised as
spin_Lock_init(&l->lock);
This makes them hard to differential in /proc/lock_stat.
For the lock in nfsd/filecache.c introduce a variable with a more
descriptve name so we can:
spin_lock_init(&nfsd_fcache_disposal->lock);
and easily identify this in /proc/lock_stat.
Signed-off-by: NeilBrown <neilb@suse.de>
---
fs/nfsd/filecache.c | 5 +++--
1 file changed, 3 insertions(+), 2 deletions(-)
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index eb95a53f806f..af95bc381753 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -867,12 +867,13 @@ __nfsd_file_cache_purge(struct net *net)
static struct nfsd_fcache_disposal *
nfsd_alloc_fcache_disposal(void)
{
- struct nfsd_fcache_disposal *l;
+ struct nfsd_fcache_disposal *l, *nfsd_fcache_disposal;
l = kmalloc(sizeof(*l), GFP_KERNEL);
if (!l)
return NULL;
- spin_lock_init(&l->lock);
+ nfsd_fcache_disposal = l;
+ spin_lock_init(&nfsd_fcache_disposal->lock);
timer_setup(&l->timer, nfsd_file_gc_worker, 0);
INIT_LIST_HEAD(&l->recent);
INIT_LIST_HEAD(&l->older);
--
2.47.1
^ permalink raw reply related [flat|nested] 23+ messages in thread
* Re: [PATCH 1/7] nfsd: filecache: remove race handling.
2025-01-27 1:20 ` [PATCH 1/7] nfsd: filecache: remove race handling NeilBrown
@ 2025-01-27 13:42 ` Jeff Layton
0 siblings, 0 replies; 23+ messages in thread
From: Jeff Layton @ 2025-01-27 13:42 UTC (permalink / raw)
To: NeilBrown, Chuck Lever
Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey, Dave Chinner
On Mon, 2025-01-27 at 12:20 +1100, NeilBrown wrote:
> The race that this code tries to protect against is not interesting.
> The code is problematic as we access the "nf" after we have given our
> reference to the lru system. While that take 2+ seconds to free things
> it is still poor form.
>
> The only interesting race I can find would be with
> nfsd_file_close_inode_sync();
> This is the only place that really doesn't want the file to stay on the
> LRU when unhashed (which is the direct consequence of the race).
>
> However for the race to happen, some other thread must own a reference
> to a file and be putting in while nfsd_file_close_inode_sync() is trying
> to close all files for an inode. If this is possible, that other thread
> could simply call nfsd_file_put() a little bit later and the result
> would be the same: not all files are closed when
> nfsd_file_close_inode_sync() completes.
>
> If this was really a problem, we would need to wait in close_inode_sync
> for the other references to be dropped. We probably don't want to do
> that.
>
> So it is best to simply remove this code.
>
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
> fs/nfsd/filecache.c | 16 +++-------------
> 1 file changed, 3 insertions(+), 13 deletions(-)
>
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index e342b2e76882..f5a92ac3f16f 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -371,20 +371,10 @@ nfsd_file_put(struct nfsd_file *nf)
>
> /* Try to add it to the LRU. If that fails, decrement. */
> if (nfsd_file_lru_add(nf)) {
> - /* If it's still hashed, we're done */
> - if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) {
> - nfsd_file_schedule_laundrette();
> - return;
> - }
> -
> - /*
> - * We're racing with unhashing, so try to remove it from
> - * the LRU. If removal fails, then someone else already
> - * has our reference.
> - */
> - if (!nfsd_file_lru_remove(nf))
> - return;
> + nfsd_file_schedule_laundrette();
> + return;
> }
> +
> }
> if (refcount_dec_and_test(&nf->nf_ref))
> nfsd_file_free(nf);
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 4/7] nfsd: filecache: change garbage collection list management.
2025-01-27 1:20 ` [PATCH 4/7] nfsd: filecache: change garbage collection list management NeilBrown
@ 2025-01-27 14:15 ` Jeff Layton
0 siblings, 0 replies; 23+ messages in thread
From: Jeff Layton @ 2025-01-27 14:15 UTC (permalink / raw)
To: NeilBrown, Chuck Lever
Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey, Dave Chinner
On Mon, 2025-01-27 at 12:20 +1100, NeilBrown wrote:
> The nfsd filecache currently uses list_lru for tracking files recently
> used in NFSv3 requests which need to be "garbage collected" when they
> have becoming idle - unused for 2-4 seconds.
>
> I do not believe list_lru is a good tool for this. It does not allow the
> timeout which filecache requires so we have to add a timeout mechanism
> which holds the list_lru lock while the whole list is scanned looking for
> entries that haven't been recently accessed. When the list is largish
> (even a few hundred) this can block new requests which need the lock to
> remove a file to access it.
>
> This patch removes the list_lru and instead uses 2 simple linked lists.
> When a file is accessed it is removed from whichever list it is on,
> then added to the tail of the first list. Every 2 seconds the second
> list is moved to the "freeme" list and the first list is moved to the
> second list. This avoids any need to walk a list to find old entries.
>
> Previously a file would be unhashed before being moved to the freeme
> list. We don't do that any more. The freeme list is much like the
> other two lists (recent and older) in that they all hold a reference to
> the file and the file is still hashed. When the nfsd thread processes
> the freeme list it now uses the new nfsd_file_release_list() which uses
> nfsd_file_cond_queue() to unhash and drop the refcount.
>
> We no longer have a precise count of the size of the lru (recent +
> older) as we don't know how big "older" is when it is moved to "freeme".
> However the shrinker can cope with an approximation. So we keep a
> count of number in the lru and when "older" is moved to "freeme" we
> divide that count by 2. When we remove anything from the lru we
> decrement that counter but ensure it never goes negative. Naturally
> when we add to the lru we increase the counter.
>
> For the filecache stats file, which assumes a global lru, we keep a
> separate counter which includes all files in all netns in recent or
> older or freeme.
>
> We discard the nf_gc linkage in an nfsd_file and only use nf_lru.
> We discard NFSD_FILE_REFERENCED.
>
> This patch drops the nfsd_file_gc_removed() trace point. I couldn't
> think of useful information to provide.
>
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
> fs/nfsd/filecache.c | 218 +++++++++++++++++++++-----------------------
> fs/nfsd/filecache.h | 4 +-
> fs/nfsd/trace.h | 4 -
> 3 files changed, 107 insertions(+), 119 deletions(-)
>
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index 55f69bcde500..1e90da507152 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -34,7 +34,6 @@
> #include <linux/file.h>
> #include <linux/pagemap.h>
> #include <linux/sched.h>
> -#include <linux/list_lru.h>
> #include <linux/fsnotify_backend.h>
> #include <linux/fsnotify.h>
> #include <linux/seq_file.h>
> @@ -64,10 +63,13 @@ static DEFINE_PER_CPU(unsigned long, nfsd_file_lru_total);
>
> struct nfsd_fcache_disposal {
> spinlock_t lock;
> - struct list_lru file_lru;
> - struct list_head freeme;
> + struct list_head recent; /* have been used in last 0-2 seconds */
> + struct list_head older; /* haven't been used in last 0-2 seconds */
> + struct list_head freeme; /* ready to be discarded */
> + unsigned long num_gc; /* Approximate size of recent plus older */
It looks like the above fields should cover one cacheline on x86_64.
Given that you're planning to change these fields to be manipulated
directly in IRQ context, maybe we should ensure that those fields are
all part of the same cacheline?
> struct delayed_work filecache_laundrette;
> struct shrinker *file_shrinker;
> + struct nfsd_net *nn;
> };
>
> static struct kmem_cache *nfsd_file_slab;
> @@ -227,7 +229,6 @@ nfsd_file_alloc(struct net *net, struct inode *inode, unsigned char need,
>
> this_cpu_inc(nfsd_file_allocations);
> INIT_LIST_HEAD(&nf->nf_lru);
> - INIT_LIST_HEAD(&nf->nf_gc);
> nf->nf_birthtime = ktime_get();
> nf->nf_file = NULL;
> nf->nf_cred = get_current_cred();
> @@ -332,12 +333,16 @@ static bool nfsd_file_lru_add(struct nfsd_file *nf)
> struct nfsd_net *nn = net_generic(nf->nf_net, nfsd_net_id);
> struct nfsd_fcache_disposal *l = nn->fcache_disposal;
>
> - set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> - if (list_lru_add_obj(&l->file_lru, &nf->nf_lru)) {
> + spin_lock(&l->lock);
> + if (list_empty(&nf->nf_lru)) {
> + list_add_tail(&nf->nf_lru, &l->recent);
> + l->num_gc += 1;
> this_cpu_inc(nfsd_file_lru_total);
> trace_nfsd_file_lru_add(nf);
> + spin_unlock(&l->lock);
> return true;
> }
> + spin_unlock(&l->lock);
> return false;
> }
>
> @@ -346,11 +351,17 @@ static bool nfsd_file_lru_remove(struct nfsd_file *nf)
> struct nfsd_net *nn = net_generic(nf->nf_net, nfsd_net_id);
> struct nfsd_fcache_disposal *l = nn->fcache_disposal;
>
> - if (list_lru_del_obj(&l->file_lru, &nf->nf_lru)) {
> + spin_lock(&l->lock);
> + if (!list_empty(&nf->nf_lru)) {
> + list_del_init(&nf->nf_lru);
> this_cpu_dec(nfsd_file_lru_total);
> + if (l->num_gc > 0)
> + l->num_gc -= 1;
> trace_nfsd_file_lru_del(nf);
> + spin_unlock(&l->lock);
> return true;
> }
> + spin_unlock(&l->lock);
> return false;
> }
>
> @@ -430,12 +441,26 @@ nfsd_file_dispose_list(struct list_head *dispose)
> struct nfsd_file *nf;
>
> while (!list_empty(dispose)) {
> - nf = list_first_entry(dispose, struct nfsd_file, nf_gc);
> - list_del_init(&nf->nf_gc);
> + nf = list_first_entry(dispose, struct nfsd_file, nf_lru);
> + list_del_init(&nf->nf_lru);
> nfsd_file_free(nf);
> }
> }
>
> +static void
> +nfsd_file_cond_queue(struct nfsd_file *nf, struct list_head *dispose);
> +
> +static void
> +nfsd_file_release_list(struct list_head *dispose)
> +{
> + LIST_HEAD(dispose2);
> + struct nfsd_file *nf, *nf2;
> +
> + list_for_each_entry_safe(nf, nf2, dispose, nf_lru)
> + nfsd_file_cond_queue(nf, &dispose2);
> + nfsd_file_dispose_list(&dispose2);
> +}
> +
> /**
> * nfsd_file_dispose_list_delayed - move list of dead files to net's freeme list
> * @dispose: list of nfsd_files to be disposed
> @@ -448,13 +473,13 @@ nfsd_file_dispose_list_delayed(struct list_head *dispose)
> {
> while(!list_empty(dispose)) {
> struct nfsd_file *nf = list_first_entry(dispose,
> - struct nfsd_file, nf_gc);
> + struct nfsd_file, nf_lru);
> struct nfsd_net *nn = net_generic(nf->nf_net, nfsd_net_id);
> struct nfsd_fcache_disposal *l = nn->fcache_disposal;
> struct svc_serv *serv;
>
> spin_lock(&l->lock);
> - list_move_tail(&nf->nf_gc, &l->freeme);
> + list_move_tail(&nf->nf_lru, &l->freeme);
> spin_unlock(&l->lock);
>
> /*
> @@ -486,90 +511,32 @@ void nfsd_file_net_dispose(struct nfsd_net *nn)
> int i;
>
> spin_lock(&l->lock);
> - for (i = 0; i < 8 && !list_empty(&l->freeme); i++)
> - list_move(l->freeme.next, &dispose);
> + for (i = 0; i < 8 && !list_empty(&l->freeme); i++) {
> + struct nfsd_file *nf = list_first_entry(
> + &l->freeme, struct nfsd_file, nf_lru);
> +
> + /*
> + * Don't throw out files that are still
> + * undergoing I/O or that have uncleared errors
> + * pending.
> + */
> + if (nfsd_file_check_writeback(nf)) {
> + trace_nfsd_file_gc_writeback(nf);
> + list_move(&nf->nf_lru, &l->recent);
> + l->num_gc += 1;
> + } else {
> + trace_nfsd_file_gc_disposed(nf);
> + list_move(&nf->nf_lru, &dispose);
> + this_cpu_inc(nfsd_file_evictions);
> + }
> + }
> spin_unlock(&l->lock);
> if (!list_empty(&l->freeme))
> /* Wake up another thread to share the work
> * *before* doing any actual disposing.
> */
> svc_wake_up(nn->nfsd_serv);
> - nfsd_file_dispose_list(&dispose);
> - }
> -}
> -
> -/**
> - * nfsd_file_lru_cb - Examine an entry on the LRU list
> - * @item: LRU entry to examine
> - * @lru: controlling LRU
> - * @arg: dispose list
> - *
> - * Return values:
> - * %LRU_REMOVED: @item was removed from the LRU
> - * %LRU_ROTATE: @item is to be moved to the LRU tail
> - * %LRU_SKIP: @item cannot be evicted
> - */
> -static enum lru_status
> -nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
> - void *arg)
> -{
> - struct list_head *head = arg;
> - struct nfsd_file *nf = list_entry(item, struct nfsd_file, nf_lru);
> -
> - /* We should only be dealing with GC entries here */
> - WARN_ON_ONCE(!test_bit(NFSD_FILE_GC, &nf->nf_flags));
> -
> - /*
> - * Don't throw out files that are still undergoing I/O or
> - * that have uncleared errors pending.
> - */
> - if (nfsd_file_check_writeback(nf)) {
> - trace_nfsd_file_gc_writeback(nf);
> - return LRU_SKIP;
> - }
> -
> - /* If it was recently added to the list, skip it */
> - if (test_and_clear_bit(NFSD_FILE_REFERENCED, &nf->nf_flags)) {
> - trace_nfsd_file_gc_referenced(nf);
> - return LRU_ROTATE;
> - }
> -
> - /*
> - * Put the reference held on behalf of the LRU. If it wasn't the last
> - * one, then just remove it from the LRU and ignore it.
> - */
> - if (!refcount_dec_and_test(&nf->nf_ref)) {
> - trace_nfsd_file_gc_in_use(nf);
> - list_lru_isolate(lru, &nf->nf_lru);
> - this_cpu_dec(nfsd_file_lru_total);
> - return LRU_REMOVED;
> - }
> -
> - /* Refcount went to zero. Unhash it and queue it to the dispose list */
> - nfsd_file_unhash(nf);
> - list_lru_isolate(lru, &nf->nf_lru);
> - this_cpu_dec(nfsd_file_lru_total);
> - list_add(&nf->nf_gc, head);
> - this_cpu_inc(nfsd_file_evictions);
> - trace_nfsd_file_gc_disposed(nf);
> - return LRU_REMOVED;
> -}
> -
> -static void
> -nfsd_file_gc(struct nfsd_fcache_disposal *l)
> -{
> - unsigned long remaining = list_lru_count(&l->file_lru);
> - LIST_HEAD(dispose);
> - unsigned long ret;
> -
> - while (remaining > 0) {
> - unsigned long num_to_scan = min(remaining, NFSD_FILE_GC_BATCH);
> -
> - ret = list_lru_walk(&l->file_lru, nfsd_file_lru_cb,
> - &dispose, num_to_scan);
> - trace_nfsd_file_gc_removed(ret, list_lru_count(&l->file_lru));
> - nfsd_file_dispose_list_delayed(&dispose);
> - remaining -= num_to_scan;
> + nfsd_file_release_list(&dispose);
> }
> }
>
> @@ -578,9 +545,19 @@ nfsd_file_gc_worker(struct work_struct *work)
> {
> struct nfsd_fcache_disposal *l = container_of(
> work, struct nfsd_fcache_disposal, filecache_laundrette.work);
> - nfsd_file_gc(l);
> - if (list_lru_count(&l->file_lru))
> +
> + spin_lock(&l->lock);
> + list_splice_init(&l->older, &l->freeme);
> + list_splice_init(&l->recent, &l->older);
> + /* We don't know how many were moved to 'freeme' and don't want
> + * to waste time counting - guess a half.
> + */
> + l->num_gc /= 2;
> + if (!list_empty(&l->freeme))
> + svc_wake_up(l->nn->nfsd_serv);
> + if (!list_empty(&l->older) || !list_empty(&l->recent))
> nfsd_file_schedule_laundrette(l);
> + spin_unlock(&l->lock);
> }
>
> static unsigned long
> @@ -588,22 +565,40 @@ nfsd_file_lru_count(struct shrinker *s, struct shrink_control *sc)
> {
> struct nfsd_fcache_disposal *l = s->private_data;
>
> - return list_lru_count(&l->file_lru);
> + return l->num_gc;
> }
>
> static unsigned long
> nfsd_file_lru_scan(struct shrinker *s, struct shrink_control *sc)
> {
> struct nfsd_fcache_disposal *l = s->private_data;
> + struct nfsd_file *nf;
> + int scanned = 0;
> +
> + spin_lock(&l->lock);
> + while (scanned < sc->nr_to_scan &&
> + (nf = list_first_entry_or_null(&l->older,
> + struct nfsd_file, nf_lru)) != NULL) {
> + list_del_init(&nf->nf_lru);
> + list_add_tail(&nf->nf_lru, &l->freeme);
> + if (l->num_gc > 0)
> + l->num_gc -= 1;
> + scanned += 1;
> + }
> + if (scanned > 0)
> + svc_wake_up(l->nn->nfsd_serv);
>
> - LIST_HEAD(dispose);
> - unsigned long ret;
> + trace_nfsd_file_shrinker_removed(scanned, l->num_gc);
>
> - ret = list_lru_shrink_walk(&l->file_lru, sc,
> - nfsd_file_lru_cb, &dispose);
> - trace_nfsd_file_shrinker_removed(ret, list_lru_count(&l->file_lru));
> - nfsd_file_dispose_list_delayed(&dispose);
> - return ret;
> + while (scanned < sc->nr_to_scan &&
> + (nf = list_first_entry_or_null(&l->recent,
> + struct nfsd_file, nf_lru)) != NULL) {
> + list_del_init(&nf->nf_lru);
> + list_add_tail(&nf->nf_lru, &l->older);
> + scanned += 1;
> + }
> + spin_unlock(&l->lock);
> + return scanned;
> }
>
> /**
> @@ -616,7 +611,6 @@ nfsd_file_lru_scan(struct shrinker *s, struct shrink_control *sc)
> */
> static void
> nfsd_file_cond_queue(struct nfsd_file *nf, struct list_head *dispose)
> - __must_hold(RCU)
> {
> int decrement = 1;
>
> @@ -634,7 +628,7 @@ nfsd_file_cond_queue(struct nfsd_file *nf, struct list_head *dispose)
>
> /* If refcount goes to 0, then put on the dispose list */
> if (refcount_sub_and_test(decrement, &nf->nf_ref)) {
> - list_add(&nf->nf_gc, dispose);
> + list_add(&nf->nf_lru, dispose);
> trace_nfsd_file_closing(nf);
> }
> }
> @@ -859,7 +853,10 @@ nfsd_alloc_fcache_disposal(void)
> return NULL;
> spin_lock_init(&l->lock);
> INIT_DELAYED_WORK(&l->filecache_laundrette, nfsd_file_gc_worker);
> + INIT_LIST_HEAD(&l->recent);
> + INIT_LIST_HEAD(&l->older);
> INIT_LIST_HEAD(&l->freeme);
> + l->num_gc = 0;
> l->file_shrinker = shrinker_alloc(0, "nfsd-filecache");
> if (!l->file_shrinker) {
> pr_err("nfsd: failed to allocate nfsd_file_shrinker\n");
> @@ -871,15 +868,6 @@ nfsd_alloc_fcache_disposal(void)
> l->file_shrinker->seeks = 1;
> l->file_shrinker->private_data = l;
>
> - /* if file_lru is not zeroed it can trigger a bug: ->key is the problem */
> - memset(&l->file_lru, 0, sizeof(l->file_lru));
> - if (list_lru_init(&l->file_lru)) {
> - pr_err("nfsd: failed to init nfsd_file_lru\n");
> - shrinker_free(l->file_shrinker);
> - kfree(l);
> - return NULL;
> - }
> -
> shrinker_register(l->file_shrinker);
> return l;
> }
> @@ -889,8 +877,12 @@ nfsd_free_fcache_disposal(struct nfsd_fcache_disposal *l)
> {
> cancel_delayed_work_sync(&l->filecache_laundrette);
> shrinker_free(l->file_shrinker);
> - list_lru_destroy(&l->file_lru);
> - nfsd_file_dispose_list(&l->freeme);
> + nfsd_file_release_list(&l->recent);
> + WARN_ON_ONCE(!list_empty(&l->recent));
> + nfsd_file_release_list(&l->older);
> + WARN_ON_ONCE(!list_empty(&l->older));
> + nfsd_file_release_list(&l->freeme);
> + WARN_ON_ONCE(!list_empty(&l->freeme));
> kfree(l);
> }
>
> @@ -909,6 +901,8 @@ nfsd_file_cache_start_net(struct net *net)
> struct nfsd_net *nn = net_generic(net, nfsd_net_id);
>
> nn->fcache_disposal = nfsd_alloc_fcache_disposal();
> + if (nn->fcache_disposal)
> + nn->fcache_disposal->nn = nn;
> return nn->fcache_disposal ? 0 : -ENOMEM;
> }
>
> diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
> index a09a851d510e..02059b6c5e9a 100644
> --- a/fs/nfsd/filecache.h
> +++ b/fs/nfsd/filecache.h
> @@ -42,15 +42,13 @@ struct nfsd_file {
> struct net *nf_net;
> #define NFSD_FILE_HASHED (0)
> #define NFSD_FILE_PENDING (1)
> -#define NFSD_FILE_REFERENCED (2)
> -#define NFSD_FILE_GC (3)
> +#define NFSD_FILE_GC (2)
> unsigned long nf_flags;
> refcount_t nf_ref;
> unsigned char nf_may;
>
> struct nfsd_file_mark *nf_mark;
> struct list_head nf_lru;
> - struct list_head nf_gc;
> struct rcu_head nf_rcu;
> ktime_t nf_birthtime;
> };
> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> index ad2c0c432d08..efa683541ed5 100644
> --- a/fs/nfsd/trace.h
> +++ b/fs/nfsd/trace.h
> @@ -1038,7 +1038,6 @@ DEFINE_CLID_EVENT(confirmed_r);
> __print_flags(val, "|", \
> { 1 << NFSD_FILE_HASHED, "HASHED" }, \
> { 1 << NFSD_FILE_PENDING, "PENDING" }, \
> - { 1 << NFSD_FILE_REFERENCED, "REFERENCED" }, \
> { 1 << NFSD_FILE_GC, "GC" })
>
> DECLARE_EVENT_CLASS(nfsd_file_class,
> @@ -1314,9 +1313,7 @@ DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_lru_add);
> DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_lru_add_disposed);
> DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_lru_del);
> DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_lru_del_disposed);
> -DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_in_use);
> DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_writeback);
> -DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_referenced);
> DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_disposed);
>
> DECLARE_EVENT_CLASS(nfsd_file_lruwalk_class,
> @@ -1345,7 +1342,6 @@ DEFINE_EVENT(nfsd_file_lruwalk_class, name, \
> ), \
> TP_ARGS(removed, remaining))
>
> -DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_removed);
> DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_shrinker_removed);
>
> TRACE_EVENT(nfsd_file_close,
This new method does seem a lot cleaner.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 7/7] nfsd: filecache: give disposal lock a unique class name.
2025-01-27 1:20 ` [PATCH 7/7] nfsd: filecache: give disposal lock a unique class name NeilBrown
@ 2025-01-27 14:29 ` Chuck Lever
2025-01-27 14:40 ` Jeff Layton
1 sibling, 0 replies; 23+ messages in thread
From: Chuck Lever @ 2025-01-27 14:29 UTC (permalink / raw)
To: NeilBrown, Jeff Layton
Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey, Dave Chinner
On 1/26/25 8:20 PM, NeilBrown wrote:
> There are at least three locks in the kernel which are initialised as
>
> spin_Lock_init(&l->lock);
>
> This makes them hard to differential in /proc/lock_stat.
>
> For the lock in nfsd/filecache.c introduce a variable with a more
> descriptve name so we can:
>
> spin_lock_init(&nfsd_fcache_disposal->lock);
>
> and easily identify this in /proc/lock_stat.
>
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
> fs/nfsd/filecache.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index eb95a53f806f..af95bc381753 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -867,12 +867,13 @@ __nfsd_file_cache_purge(struct net *net)
> static struct nfsd_fcache_disposal *
> nfsd_alloc_fcache_disposal(void)
> {
> - struct nfsd_fcache_disposal *l;
> + struct nfsd_fcache_disposal *l, *nfsd_fcache_disposal;
>
> l = kmalloc(sizeof(*l), GFP_KERNEL);
> if (!l)
> return NULL;
> - spin_lock_init(&l->lock);
> + nfsd_fcache_disposal = l;
> + spin_lock_init(&nfsd_fcache_disposal->lock);
> timer_setup(&l->timer, nfsd_file_gc_worker, 0);
> INIT_LIST_HEAD(&l->recent);
> INIT_LIST_HEAD(&l->older);
My concern was there would be multiple dynamically allocated locks
in order to make the locking fine-grained, but I guess that was a
mistaken assumption.
No objection to this approach.
--
Chuck Lever
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 6/7] nfsd: filecache: change garbage collection to a timer.
2025-01-27 1:20 ` [PATCH 6/7] nfsd: filecache: change garbage collection to a timer NeilBrown
@ 2025-01-27 14:39 ` Jeff Layton
0 siblings, 0 replies; 23+ messages in thread
From: Jeff Layton @ 2025-01-27 14:39 UTC (permalink / raw)
To: NeilBrown, Chuck Lever
Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey, Dave Chinner
On Mon, 2025-01-27 at 12:20 +1100, NeilBrown wrote:
> garbage collection never sleeps and no longer walks a list so it runs
> quickly only requiring a spinlock.
>
> This means we don't need to use a workqueue, we can use a simple timer
> instead.
>
> Rather than taking the lock in the timer callback, which would require
> using _bh locking, simply test a flag and wake an nfsd thread. That
> thread checks the flag and ages the lists when needed.
>
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
> fs/nfsd/filecache.c | 43 ++++++++++++++++++++++++-------------------
> 1 file changed, 24 insertions(+), 19 deletions(-)
>
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index 7264faa57280..eb95a53f806f 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -67,10 +67,12 @@ struct nfsd_fcache_disposal {
> struct list_head older; /* haven't been used in last 0-2 seconds */
> struct list_head freeme; /* ready to be discarded */
> unsigned long num_gc; /* Approximate size of recent plus older */
Dang, adding flags would push this into
> - struct delayed_work filecache_laundrette;
> + struct timer_list timer;
> struct shrinker *file_shrinker;
> struct nfsd_net *nn;
> + unsigned long flags;
> };
> +#define NFSD_FCACHE_DO_AGE BIT(0) /* time to age the lists */
>
> static struct kmem_cache *nfsd_file_slab;
> static struct kmem_cache *nfsd_file_mark_slab;
> @@ -115,8 +117,8 @@ static const struct rhashtable_params nfsd_file_rhash_params = {
> static void
> nfsd_file_schedule_laundrette(struct nfsd_fcache_disposal *l)
> {
> - queue_delayed_work(system_unbound_wq, &l->filecache_laundrette,
> - NFSD_LAUNDRETTE_DELAY);
> + if (!timer_pending(&l->timer))
> + mod_timer(&l->timer, jiffies + NFSD_LAUNDRETTE_DELAY);
> }
>
> static void
> @@ -521,6 +523,19 @@ void nfsd_file_net_dispose(struct nfsd_net *nn)
> {
> struct nfsd_fcache_disposal *l = nn->fcache_disposal;
>
> + if (test_and_clear_bit(NFSD_FCACHE_DO_AGE, &l->flags)) {
> + spin_lock(&l->lock);
> + list_splice_init(&l->older, &l->freeme);
> + list_splice_init(&l->recent, &l->older);
> + /* We don't know how many were moved to 'freeme' and don't want
> + * to waste time counting - guess a half. This is only used
> + * for the shrinker which doesn't need complete precision.
> + */
> + l->num_gc /= 2;
> + if (!list_empty(&l->older) || !list_empty(&l->recent))
> + mod_timer(&l->timer, jiffies + NFSD_LAUNDRETTE_DELAY);
> + spin_unlock(&l->lock);
> + }
> if (!list_empty(&l->freeme)) {
> LIST_HEAD(dispose);
> int i;
> @@ -557,23 +572,13 @@ void nfsd_file_net_dispose(struct nfsd_net *nn)
> }
>
> static void
> -nfsd_file_gc_worker(struct work_struct *work)
> +nfsd_file_gc_worker(struct timer_list *t)
> {
> struct nfsd_fcache_disposal *l = container_of(
> - work, struct nfsd_fcache_disposal, filecache_laundrette.work);
> + t, struct nfsd_fcache_disposal, timer);
>
> - spin_lock(&l->lock);
> - list_splice_init(&l->older, &l->freeme);
> - list_splice_init(&l->recent, &l->older);
> - /* We don't know how many were moved to 'freeme' and don't want
> - * to waste time counting - guess a half.
> - */
> - l->num_gc /= 2;
> - if (!list_empty(&l->freeme))
> - svc_wake_up(l->nn->nfsd_serv);
> - if (!list_empty(&l->older) || !list_empty(&l->recent))
> - nfsd_file_schedule_laundrette(l);
> - spin_unlock(&l->lock);
> + set_bit(NFSD_FCACHE_DO_AGE, &l->flags);
> + svc_wake_up(l->nn->nfsd_serv);
Disregard my earlier comment about the cacheline. It still wouldn't
hurt to do, but since you're not actually taking the lock inside the
timer callback in this version, it's not as big a deal.
This seems better.
> }
>
> static unsigned long
> @@ -868,7 +873,7 @@ nfsd_alloc_fcache_disposal(void)
> if (!l)
> return NULL;
> spin_lock_init(&l->lock);
> - INIT_DELAYED_WORK(&l->filecache_laundrette, nfsd_file_gc_worker);
> + timer_setup(&l->timer, nfsd_file_gc_worker, 0);
> INIT_LIST_HEAD(&l->recent);
> INIT_LIST_HEAD(&l->older);
> INIT_LIST_HEAD(&l->freeme);
> @@ -891,7 +896,7 @@ nfsd_alloc_fcache_disposal(void)
> static void
> nfsd_free_fcache_disposal(struct nfsd_fcache_disposal *l)
> {
> - cancel_delayed_work_sync(&l->filecache_laundrette);
> + del_timer_sync(&l->timer);
> shrinker_free(l->file_shrinker);
> nfsd_file_release_list(&l->recent);
> WARN_ON_ONCE(!list_empty(&l->recent));
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 5/7] nfsd: filecache: document the arbitrary limit on file-disposes-per-loop
2025-01-27 1:20 ` [PATCH 5/7] nfsd: filecache: document the arbitrary limit on file-disposes-per-loop NeilBrown
@ 2025-01-27 14:40 ` Jeff Layton
0 siblings, 0 replies; 23+ messages in thread
From: Jeff Layton @ 2025-01-27 14:40 UTC (permalink / raw)
To: NeilBrown, Chuck Lever
Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey, Dave Chinner
On Mon, 2025-01-27 at 12:20 +1100, NeilBrown wrote:
> Rather than having the bare number "8" use a named constant and explain
> the tradeoffs that lead to the choice.
>
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
> fs/nfsd/filecache.c | 18 +++++++++++++++++-
> 1 file changed, 17 insertions(+), 1 deletion(-)
>
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index 1e90da507152..7264faa57280 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -493,6 +493,21 @@ nfsd_file_dispose_list_delayed(struct list_head *dispose)
> }
> }
>
> +/*
> + * Disposing of files can involve non-trivial work, and they
> + * can appear in batches. So we don't want to try handling them
> + * all in one thread - if there are lots it would be better to allow
> + * several nfsd threads to handle them in parallel.
> + * On average one RPC request can create at most 1 file to be disposed
> + * so handling one each time around the nfsd loop should keep the list
> + * under control. However there are often benefits of batching so
> + * 2 at a time will likely be more efficient than 1. 4 more so.
> + * We need to choose a number which will often handle all the files,
> + * but will allow other threads to help when the list gets long.
> + * The current choice is:
> + */
> +#define NFSD_FILE_DISPOSE_BATCH 8
> +
> /**
> * nfsd_file_net_dispose - deal with nfsd_files waiting to be disposed.
> * @nn: nfsd_net in which to find files to be disposed.
> @@ -511,7 +526,8 @@ void nfsd_file_net_dispose(struct nfsd_net *nn)
> int i;
>
> spin_lock(&l->lock);
> - for (i = 0; i < 8 && !list_empty(&l->freeme); i++) {
> + for (i = 0; i < NFSD_FILE_DISPOSE_BATCH &&
> + !list_empty(&l->freeme); i++) {
> struct nfsd_file *nf = list_first_entry(
> &l->freeme, struct nfsd_file, nf_lru);
>
Thanks for doing this.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 7/7] nfsd: filecache: give disposal lock a unique class name.
2025-01-27 1:20 ` [PATCH 7/7] nfsd: filecache: give disposal lock a unique class name NeilBrown
2025-01-27 14:29 ` Chuck Lever
@ 2025-01-27 14:40 ` Jeff Layton
1 sibling, 0 replies; 23+ messages in thread
From: Jeff Layton @ 2025-01-27 14:40 UTC (permalink / raw)
To: NeilBrown, Chuck Lever
Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey, Dave Chinner
On Mon, 2025-01-27 at 12:20 +1100, NeilBrown wrote:
> There are at least three locks in the kernel which are initialised as
>
> spin_Lock_init(&l->lock);
>
> This makes them hard to differential in /proc/lock_stat.
>
> For the lock in nfsd/filecache.c introduce a variable with a more
> descriptve name so we can:
>
> spin_lock_init(&nfsd_fcache_disposal->lock);
>
> and easily identify this in /proc/lock_stat.
>
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
> fs/nfsd/filecache.c | 5 +++--
> 1 file changed, 3 insertions(+), 2 deletions(-)
>
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index eb95a53f806f..af95bc381753 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -867,12 +867,13 @@ __nfsd_file_cache_purge(struct net *net)
> static struct nfsd_fcache_disposal *
> nfsd_alloc_fcache_disposal(void)
> {
> - struct nfsd_fcache_disposal *l;
> + struct nfsd_fcache_disposal *l, *nfsd_fcache_disposal;
>
> l = kmalloc(sizeof(*l), GFP_KERNEL);
> if (!l)
> return NULL;
> - spin_lock_init(&l->lock);
> + nfsd_fcache_disposal = l;
> + spin_lock_init(&nfsd_fcache_disposal->lock);
> timer_setup(&l->timer, nfsd_file_gc_worker, 0);
> INIT_LIST_HEAD(&l->recent);
> INIT_LIST_HEAD(&l->older);
Weird, but ok.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 0/7] nfsd: filecache: change garbage collection lists
2025-01-27 1:20 [PATCH 0/7] nfsd: filecache: change garbage collection lists NeilBrown
` (6 preceding siblings ...)
2025-01-27 1:20 ` [PATCH 7/7] nfsd: filecache: give disposal lock a unique class name NeilBrown
@ 2025-01-28 6:37 ` Dave Chinner
2025-01-28 14:27 ` Chuck Lever
2025-01-29 21:34 ` NeilBrown
2025-02-05 23:04 ` NeilBrown
8 siblings, 2 replies; 23+ messages in thread
From: Dave Chinner @ 2025-01-28 6:37 UTC (permalink / raw)
To: NeilBrown
Cc: Chuck Lever, Jeff Layton, linux-nfs, Olga Kornievskaia, Dai Ngo,
Tom Talpey
On Mon, Jan 27, 2025 at 12:20:31PM +1100, NeilBrown wrote:
> [
> davec added to cc incase I've said something incorrect about list_lru
>
> Changes in this version:
> - no _bh locking
> - add name for a magic constant
> - remove unnecessary race-handling code
> - give a more meaningfule name for a lock for /proc/lock_stat
> - minor cleanups suggested by Jeff
>
> ]
>
> The nfsd filecache currently uses list_lru for tracking files recently
> used in NFSv3 requests which need to be "garbage collected" when they
> have becoming idle - unused for 2-4 seconds.
>
> I do not believe list_lru is a good tool for this. It does not allow
> the timeout which filecache requires so we have to add a timeout
> mechanism which holds the list_lru lock while the whole list is scanned
> looking for entries that haven't been recently accessed. When the list
> is largish (even a few hundred) this can block new requests noticably
> which need the lock to remove a file to access it.
Looks entirely like a trivial implementation bug in how the list_lru
is walked in nfsd_file_gc().
static void
nfsd_file_gc(void)
{
LIST_HEAD(dispose);
unsigned long ret;
ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
&dispose, list_lru_count(&nfsd_file_lru));
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
nfsd_file_dispose_list_delayed(&dispose);
}
i.e. the list_lru_walk() has been told to walk the entire list in a
single lock hold if nothing blocks it.
We've known this for a long, long time, and it's something we've
handled for a long time with shrinkers, too. here's the typical way
of doing a full list aging and GC pass in one go without excessively
long lock holds:
{
long nr_to_scan = list_lru_count(&nfsd_file_lru);
LIST_HEAD(dispose);
while (nr_to_scan > 0) {
long batch = min(nr_to_scan, 64);
list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
&dispose, batch);
if (list_empty(&dispose))
break;
dispose_list(&dispose);
nr_to_scan -= batch;
}
}
And we don't need two lists to separate recently referenced vs
gc candidates because we have a referenced bit in the nf->nf_flags.
i.e. nfsd_file_lru_cb() does:
nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
void *arg)
{
....
/* If it was recently added to the list, skip it */
if (test_and_clear_bit(NFSD_FILE_REFERENCED, &nf->nf_flags)) {
trace_nfsd_file_gc_referenced(nf);
return LRU_ROTATE;
}
.....
Which moves recently referenced entries to the far end of the list,
resulting in all the reclaimable objects congrating at the end of
the list that is walked first by list_lru_walk().
IOWs, a batched walk like above resumes the walk exactly where it
left off, because it is always either reclaiming or rotating the
object at the head of the list.
> This patch removes the list_lru and instead uses 2 simple linked lists.
> When a file is accessed it is removed from whichever list it is on,
> then added to the tail of the first list. Every 2 seconds the second
> list is moved to the "freeme" list and the first list is moved to the
> second list. This avoids any need to walk a list to find old entries.
Yup, that's exactly what the current code does via the laundrette
work that schedules nfsd_file_gc() to run every two seconds does.
> These lists are per-netns rather than global as the freeme list is
> per-netns as the actual freeing is done in nfsd threads which are
> per-netns.
The list_lru is actually multiple lists - it is a per-numa node list
and so moving to global scope linked lists per netns is going to
reduce scalability and increase lock contention on large machines.
I also don't see any perf numbers, scalability analysis, latency
measurement, CPU profiles, etc showing the problems with using list_lru
for the GC function, nor any improvement this new code brings.
i.e. It's kinda hard to make any real comment on "I do not believe
list_lru is a good tool for this" when there is no actual
measurements provided to back the statement one way or the other...
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 0/7] nfsd: filecache: change garbage collection lists
2025-01-28 6:37 ` [PATCH 0/7] nfsd: filecache: change garbage collection lists Dave Chinner
@ 2025-01-28 14:27 ` Chuck Lever
2025-01-28 16:05 ` Chuck Lever
2025-01-29 21:34 ` NeilBrown
1 sibling, 1 reply; 23+ messages in thread
From: Chuck Lever @ 2025-01-28 14:27 UTC (permalink / raw)
To: Dave Chinner, NeilBrown
Cc: Jeff Layton, linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey
On 1/28/25 1:37 AM, Dave Chinner wrote:
> On Mon, Jan 27, 2025 at 12:20:31PM +1100, NeilBrown wrote:
>> [
>> davec added to cc incase I've said something incorrect about list_lru
>>
>> Changes in this version:
>> - no _bh locking
>> - add name for a magic constant
>> - remove unnecessary race-handling code
>> - give a more meaningfule name for a lock for /proc/lock_stat
>> - minor cleanups suggested by Jeff
>>
>> ]
>>
>> The nfsd filecache currently uses list_lru for tracking files recently
>> used in NFSv3 requests which need to be "garbage collected" when they
>> have becoming idle - unused for 2-4 seconds.
>>
>> I do not believe list_lru is a good tool for this. It does not allow
>> the timeout which filecache requires so we have to add a timeout
>> mechanism which holds the list_lru lock while the whole list is scanned
>> looking for entries that haven't been recently accessed. When the list
>> is largish (even a few hundred) this can block new requests noticably
>> which need the lock to remove a file to access it.
>
> Looks entirely like a trivial implementation bug in how the list_lru
> is walked in nfsd_file_gc().
>
> static void
> nfsd_file_gc(void)
> {
> LIST_HEAD(dispose);
> unsigned long ret;
>
> ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
> &dispose, list_lru_count(&nfsd_file_lru));
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
> nfsd_file_dispose_list_delayed(&dispose);
> }
>
> i.e. the list_lru_walk() has been told to walk the entire list in a
> single lock hold if nothing blocks it.
>
> We've known this for a long, long time, and it's something we've
> handled for a long time with shrinkers, too. here's the typical way
> of doing a full list aging and GC pass in one go without excessively
> long lock holds:
>
> {
> long nr_to_scan = list_lru_count(&nfsd_file_lru);
> LIST_HEAD(dispose);
>
> while (nr_to_scan > 0) {
> long batch = min(nr_to_scan, 64);
>
> list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
> &dispose, batch);
>
> if (list_empty(&dispose))
> break;
> dispose_list(&dispose);
> nr_to_scan -= batch;
> }
> }
The above is in fact similar to what we're planning to push first so
that it can be cleanly backported to LTS kernels:
https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?h=nfsd-testing&id=9caea737d2cdfe2d194e225c1924090c1d68c25f
> And we don't need two lists to separate recently referenced vs
> gc candidates because we have a referenced bit in the nf->nf_flags.
> i.e. nfsd_file_lru_cb() does:
>
> nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
> void *arg)
> {
> ....
> /* If it was recently added to the list, skip it */
> if (test_and_clear_bit(NFSD_FILE_REFERENCED, &nf->nf_flags)) {
> trace_nfsd_file_gc_referenced(nf);
> return LRU_ROTATE;
> }
> .....
>
> Which moves recently referenced entries to the far end of the list,
> resulting in all the reclaimable objects congrating at the end of
> the list that is walked first by list_lru_walk().
My concern (which I haven't voiced yet) about having two lists is that
it will increase memory traffic over the current single atomic bit
operation.
> IOWs, a batched walk like above resumes the walk exactly where it
> left off, because it is always either reclaiming or rotating the
> object at the head of the list.
>
>> This patch removes the list_lru and instead uses 2 simple linked lists.
>> When a file is accessed it is removed from whichever list it is on,
>> then added to the tail of the first list. Every 2 seconds the second
>> list is moved to the "freeme" list and the first list is moved to the
>> second list. This avoids any need to walk a list to find old entries.
>
> Yup, that's exactly what the current code does via the laundrette
> work that schedules nfsd_file_gc() to run every two seconds does.
>
>> These lists are per-netns rather than global as the freeme list is
>> per-netns as the actual freeing is done in nfsd threads which are
>> per-netns.
>
> The list_lru is actually multiple lists - it is a per-numa node list
> and so moving to global scope linked lists per netns is going to
> reduce scalability and increase lock contention on large machines.
>
> I also don't see any perf numbers, scalability analysis, latency
> measurement, CPU profiles, etc showing the problems with using list_lru
> for the GC function, nor any improvement this new code brings.
>
> i.e. It's kinda hard to make any real comment on "I do not believe
> list_lru is a good tool for this" when there is no actual
> measurements provided to back the statement one way or the other...
True, it would be good to get some comparative metrics; in particular
looking at spin lock contention and memory traffic.
--
Chuck Lever
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 0/7] nfsd: filecache: change garbage collection lists
2025-01-28 14:27 ` Chuck Lever
@ 2025-01-28 16:05 ` Chuck Lever
0 siblings, 0 replies; 23+ messages in thread
From: Chuck Lever @ 2025-01-28 16:05 UTC (permalink / raw)
To: Dave Chinner, NeilBrown
Cc: Jeff Layton, linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey
On 1/28/25 9:27 AM, Chuck Lever wrote:
> On 1/28/25 1:37 AM, Dave Chinner wrote:
>> On Mon, Jan 27, 2025 at 12:20:31PM +1100, NeilBrown wrote:
>>> [
>>> davec added to cc incase I've said something incorrect about list_lru
>>>
>>> Changes in this version:
>>> - no _bh locking
>>> - add name for a magic constant
>>> - remove unnecessary race-handling code
>>> - give a more meaningfule name for a lock for /proc/lock_stat
>>> - minor cleanups suggested by Jeff
>>>
>>> ]
>>>
>>> The nfsd filecache currently uses list_lru for tracking files recently
>>> used in NFSv3 requests which need to be "garbage collected" when they
>>> have becoming idle - unused for 2-4 seconds.
>>>
>>> I do not believe list_lru is a good tool for this. It does not allow
>>> the timeout which filecache requires so we have to add a timeout
>>> mechanism which holds the list_lru lock while the whole list is scanned
>>> looking for entries that haven't been recently accessed. When the list
>>> is largish (even a few hundred) this can block new requests noticably
>>> which need the lock to remove a file to access it.
>>
>> Looks entirely like a trivial implementation bug in how the list_lru
>> is walked in nfsd_file_gc().
>>
>> static void
>> nfsd_file_gc(void)
>> {
>> LIST_HEAD(dispose);
>> unsigned long ret;
>>
>> ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
>> &dispose, list_lru_count(&nfsd_file_lru));
>> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>
>> trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
>> nfsd_file_dispose_list_delayed(&dispose);
>> }
>>
>> i.e. the list_lru_walk() has been told to walk the entire list in a
>> single lock hold if nothing blocks it.
>>
>> We've known this for a long, long time, and it's something we've
>> handled for a long time with shrinkers, too. here's the typical way
>> of doing a full list aging and GC pass in one go without excessively
>> long lock holds:
>>
>> {
>> long nr_to_scan = list_lru_count(&nfsd_file_lru);
>> LIST_HEAD(dispose);
>>
>> while (nr_to_scan > 0) {
>> long batch = min(nr_to_scan, 64);
>>
>> list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
>> &dispose, batch);
>>
>> if (list_empty(&dispose))
>> break;
>> dispose_list(&dispose);
>> nr_to_scan -= batch;
>> }
>> }
>
> The above is in fact similar to what we're planning to push first so
> that it can be cleanly backported to LTS kernels:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/cel/linux.git/commit/?
> h=nfsd-testing&id=9caea737d2cdfe2d194e225c1924090c1d68c25f
I've rebased that branch. Here's a more permanent link:
https://lore.kernel.org/all/20250109142438.18689-2-cel@kernel.org/
But note that the batch size in the patch committed to my tree is 16
items, not 32.
>> And we don't need two lists to separate recently referenced vs
>> gc candidates because we have a referenced bit in the nf->nf_flags.
>> i.e. nfsd_file_lru_cb() does:
>>
>> nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
>> void *arg)
>> {
>> ....
>> /* If it was recently added to the list, skip it */
>> if (test_and_clear_bit(NFSD_FILE_REFERENCED, &nf->nf_flags)) {
>> trace_nfsd_file_gc_referenced(nf);
>> return LRU_ROTATE;
>> }
>> .....
>>
>> Which moves recently referenced entries to the far end of the list,
>> resulting in all the reclaimable objects congrating at the end of
>> the list that is walked first by list_lru_walk().
>
> My concern (which I haven't voiced yet) about having two lists is that
> it will increase memory traffic over the current single atomic bit
> operation.
>
>
>> IOWs, a batched walk like above resumes the walk exactly where it
>> left off, because it is always either reclaiming or rotating the
>> object at the head of the list.
>>
>>> This patch removes the list_lru and instead uses 2 simple linked lists.
>>> When a file is accessed it is removed from whichever list it is on,
>>> then added to the tail of the first list. Every 2 seconds the second
>>> list is moved to the "freeme" list and the first list is moved to the
>>> second list. This avoids any need to walk a list to find old entries.
>>
>> Yup, that's exactly what the current code does via the laundrette
>> work that schedules nfsd_file_gc() to run every two seconds does.
>>
>>> These lists are per-netns rather than global as the freeme list is
>>> per-netns as the actual freeing is done in nfsd threads which are
>>> per-netns.
>>
>> The list_lru is actually multiple lists - it is a per-numa node list
>> and so moving to global scope linked lists per netns is going to
>> reduce scalability and increase lock contention on large machines.
>>
>> I also don't see any perf numbers, scalability analysis, latency
>> measurement, CPU profiles, etc showing the problems with using list_lru
>> for the GC function, nor any improvement this new code brings.
>>
>> i.e. It's kinda hard to make any real comment on "I do not believe
>> list_lru is a good tool for this" when there is no actual
>> measurements provided to back the statement one way or the other...
>
> True, it would be good to get some comparative metrics; in particular
> looking at spin lock contention and memory traffic.
>
>
--
Chuck Lever
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 0/7] nfsd: filecache: change garbage collection lists
2025-01-28 6:37 ` [PATCH 0/7] nfsd: filecache: change garbage collection lists Dave Chinner
2025-01-28 14:27 ` Chuck Lever
@ 2025-01-29 21:34 ` NeilBrown
2025-02-06 2:21 ` Dave Chinner
1 sibling, 1 reply; 23+ messages in thread
From: NeilBrown @ 2025-01-29 21:34 UTC (permalink / raw)
To: Dave Chinner
Cc: Chuck Lever, Jeff Layton, linux-nfs, Olga Kornievskaia, Dai Ngo,
Tom Talpey
On Tue, 28 Jan 2025, Dave Chinner wrote:
> On Mon, Jan 27, 2025 at 12:20:31PM +1100, NeilBrown wrote:
> > [
> > davec added to cc incase I've said something incorrect about list_lru
> >
> > Changes in this version:
> > - no _bh locking
> > - add name for a magic constant
> > - remove unnecessary race-handling code
> > - give a more meaningfule name for a lock for /proc/lock_stat
> > - minor cleanups suggested by Jeff
> >
> > ]
> >
> > The nfsd filecache currently uses list_lru for tracking files recently
> > used in NFSv3 requests which need to be "garbage collected" when they
> > have becoming idle - unused for 2-4 seconds.
> >
> > I do not believe list_lru is a good tool for this. It does not allow
> > the timeout which filecache requires so we have to add a timeout
> > mechanism which holds the list_lru lock while the whole list is scanned
> > looking for entries that haven't been recently accessed. When the list
> > is largish (even a few hundred) this can block new requests noticably
> > which need the lock to remove a file to access it.
>
> Looks entirely like a trivial implementation bug in how the list_lru
> is walked in nfsd_file_gc().
>
> static void
> nfsd_file_gc(void)
> {
> LIST_HEAD(dispose);
> unsigned long ret;
>
> ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
> &dispose, list_lru_count(&nfsd_file_lru));
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>
> trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
> nfsd_file_dispose_list_delayed(&dispose);
> }
>
> i.e. the list_lru_walk() has been told to walk the entire list in a
> single lock hold if nothing blocks it.
>
> We've known this for a long, long time, and it's something we've
> handled for a long time with shrinkers, too. here's the typical way
> of doing a full list aging and GC pass in one go without excessively
> long lock holds:
"Typical"?? Can you please point me to an existing example?
>
> {
> long nr_to_scan = list_lru_count(&nfsd_file_lru);
> LIST_HEAD(dispose);
>
> while (nr_to_scan > 0) {
> long batch = min(nr_to_scan, 64);
>
> list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
> &dispose, batch);
>
> if (list_empty(&dispose))
> break;
> dispose_list(&dispose);
> nr_to_scan -= batch;
> }
> }
>
> And we don't need two lists to separate recently referenced vs
> gc candidates because we have a referenced bit in the nf->nf_flags.
> i.e. nfsd_file_lru_cb() does:
>
> nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
> void *arg)
> {
> ....
> /* If it was recently added to the list, skip it */
> if (test_and_clear_bit(NFSD_FILE_REFERENCED, &nf->nf_flags)) {
> trace_nfsd_file_gc_referenced(nf);
> return LRU_ROTATE;
> }
> .....
>
> Which moves recently referenced entries to the far end of the list,
> resulting in all the reclaimable objects congrating at the end of
> the list that is walked first by list_lru_walk().
>
> IOWs, a batched walk like above resumes the walk exactly where it
> left off, because it is always either reclaiming or rotating the
> object at the head of the list.
"rotating the object" to the head of the per-node list, not the head of
the whole list_Lru (except in the trivial single-node case).
list_lru_walk() iterates over the multiple node lists in a fixed order.
Suppose you have 4 nodes, each with 32 items, all of them referenced, and
a batch size of 64.
The first batch will process the 32 items on the first list clearing the
referenced bit and moving them to the end of that list. Then continue
through those 32 again and freeing them all. The second batch will do the
same for the second list. The last two lists won't be touched.
list_lru_walk() is only ever used (correctly) for clearing out a
list_lru. It should probably be replaced by a function with a more apt
name which is targeted at this: no spinlocks, no return value from the
call-back.
Yes, we could change the above code to use list_lru_walk_node and wrap a
for loop around the whole, but then I wonder if list_lru is really
giving us anything of value.
Walking a linked list just to set a bit in ever entry is a lot more work
that a few list manipulation in 5 cache-lines.
>
> > This patch removes the list_lru and instead uses 2 simple linked lists.
> > When a file is accessed it is removed from whichever list it is on,
> > then added to the tail of the first list. Every 2 seconds the second
> > list is moved to the "freeme" list and the first list is moved to the
> > second list. This avoids any need to walk a list to find old entries.
>
> Yup, that's exactly what the current code does via the laundrette
> work that schedules nfsd_file_gc() to run every two seconds does.
>
> > These lists are per-netns rather than global as the freeme list is
> > per-netns as the actual freeing is done in nfsd threads which are
> > per-netns.
>
> The list_lru is actually multiple lists - it is a per-numa node list
> and so moving to global scope linked lists per netns is going to
> reduce scalability and increase lock contention on large machines.
Possibly we could duplicate the filecache_disposal structure across
svc_pools instead of net namespaces. That would fix the scalability
issue. Probably we should avoid removing and re-adding the file in
the lru for every access as that probably causes more spinlock
contention. We would need to adjust the ageing mechanism but i suspect
it could be made to work.
But I really wanted to make it correct first, performant later. And I
still think that means not using list_lru.
>
> I also don't see any perf numbers, scalability analysis, latency
> measurement, CPU profiles, etc showing the problems with using list_lru
> for the GC function, nor any improvement this new code brings.
>
> i.e. It's kinda hard to make any real comment on "I do not believe
> list_lru is a good tool for this" when there is no actual
> measurements provided to back the statement one way or the other...
It's not about measurements, its about behavioural correctness. Yes, I
should have spelled that out better. Thanks for encouraging me to do
so.
NeilBrown
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 0/7] nfsd: filecache: change garbage collection lists
2025-01-27 1:20 [PATCH 0/7] nfsd: filecache: change garbage collection lists NeilBrown
` (7 preceding siblings ...)
2025-01-28 6:37 ` [PATCH 0/7] nfsd: filecache: change garbage collection lists Dave Chinner
@ 2025-02-05 23:04 ` NeilBrown
2025-02-06 3:02 ` Chuck Lever
8 siblings, 1 reply; 23+ messages in thread
From: NeilBrown @ 2025-02-05 23:04 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton
Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey, Dave Chinner
Hi Chuck,
what are you current thoughts on merging this series?
One of my thoughts is that I now realise that
Commit 0758a7212628 ("nfsd: drop the lock during filecache LRU scans")
in nfsd-next is bad. If there are multiple nodes (and so multiple
sublits in the list_lru) and if there are more than a few dozen files in
the lru, then that patch results in the first sublist being completely
freed before anything is done to the next.
I think the best fix for backporting to -stable is to wrap a
for_each_node_state((nid) around the while loop and using
list_lru_count_node() and list_lru_walk_node().
I could send a SQUASH patch for that and rebase this series on it.
Thanks,
NeilBrown
On Mon, 27 Jan 2025, NeilBrown wrote:
> [
> davec added to cc incase I've said something incorrect about list_lru
>
> Changes in this version:
> - no _bh locking
> - add name for a magic constant
> - remove unnecessary race-handling code
> - give a more meaningfule name for a lock for /proc/lock_stat
> - minor cleanups suggested by Jeff
>
> ]
>
> The nfsd filecache currently uses list_lru for tracking files recently
> used in NFSv3 requests which need to be "garbage collected" when they
> have becoming idle - unused for 2-4 seconds.
>
> I do not believe list_lru is a good tool for this. It does not allow
> the timeout which filecache requires so we have to add a timeout
> mechanism which holds the list_lru lock while the whole list is scanned
> looking for entries that haven't been recently accessed. When the list
> is largish (even a few hundred) this can block new requests noticably
> which need the lock to remove a file to access it.
>
> This patch removes the list_lru and instead uses 2 simple linked lists.
> When a file is accessed it is removed from whichever list it is on,
> then added to the tail of the first list. Every 2 seconds the second
> list is moved to the "freeme" list and the first list is moved to the
> second list. This avoids any need to walk a list to find old entries.
>
> These lists are per-netns rather than global as the freeme list is
> per-netns as the actual freeing is done in nfsd threads which are
> per-netns.
>
> Thanks,
> NeilBrown
>
> [PATCH 1/7] nfsd: filecache: remove race handling.
> [PATCH 2/7] nfsd: filecache: use nfsd_file_dispose_list() in
> [PATCH 3/7] nfsd: filecache: move globals nfsd_file_lru and
> [PATCH 4/7] nfsd: filecache: change garbage collection list
> [PATCH 5/7] nfsd: filecache: document the arbitrary limit on
> [PATCH 6/7] nfsd: filecache: change garbage collection to a timer.
> [PATCH 7/7] nfsd: filecache: give disposal lock a unique class name.
>
>
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 0/7] nfsd: filecache: change garbage collection lists
2025-01-29 21:34 ` NeilBrown
@ 2025-02-06 2:21 ` Dave Chinner
2025-02-06 3:04 ` NeilBrown
0 siblings, 1 reply; 23+ messages in thread
From: Dave Chinner @ 2025-02-06 2:21 UTC (permalink / raw)
To: NeilBrown
Cc: Chuck Lever, Jeff Layton, linux-nfs, Olga Kornievskaia, Dai Ngo,
Tom Talpey
On Thu, Jan 30, 2025 at 08:34:15AM +1100, NeilBrown wrote:
> On Tue, 28 Jan 2025, Dave Chinner wrote:
> > On Mon, Jan 27, 2025 at 12:20:31PM +1100, NeilBrown wrote:
> > > [
> > > davec added to cc incase I've said something incorrect about list_lru
> > >
> > > Changes in this version:
> > > - no _bh locking
> > > - add name for a magic constant
> > > - remove unnecessary race-handling code
> > > - give a more meaningfule name for a lock for /proc/lock_stat
> > > - minor cleanups suggested by Jeff
> > >
> > > ]
> > >
> > > The nfsd filecache currently uses list_lru for tracking files recently
> > > used in NFSv3 requests which need to be "garbage collected" when they
> > > have becoming idle - unused for 2-4 seconds.
> > >
> > > I do not believe list_lru is a good tool for this. It does not allow
> > > the timeout which filecache requires so we have to add a timeout
> > > mechanism which holds the list_lru lock while the whole list is scanned
> > > looking for entries that haven't been recently accessed. When the list
> > > is largish (even a few hundred) this can block new requests noticably
> > > which need the lock to remove a file to access it.
> >
> > Looks entirely like a trivial implementation bug in how the list_lru
> > is walked in nfsd_file_gc().
> >
> > static void
> > nfsd_file_gc(void)
> > {
> > LIST_HEAD(dispose);
> > unsigned long ret;
> >
> > ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
> > &dispose, list_lru_count(&nfsd_file_lru));
> > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> >
> > trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
> > nfsd_file_dispose_list_delayed(&dispose);
> > }
> >
> > i.e. the list_lru_walk() has been told to walk the entire list in a
> > single lock hold if nothing blocks it.
> >
> > We've known this for a long, long time, and it's something we've
> > handled for a long time with shrinkers, too. here's the typical way
> > of doing a full list aging and GC pass in one go without excessively
> > long lock holds:
>
> "Typical"?? Can you please point me to an existing example?
Of the top of my head: shrink_dcache_sb().
However, *every single shrinker implementation in the kernel* uses
this algorithm whether they use list-lru or not.
i.e. this "iterate list in small batchs to minimise lock hold
latency" algorithm has been used by the shrinker infrastructure
since the 2.5.x days.
list-lru was designed explicitly for use with shrinker walk
algorithms, so -typical- usage it to walk in small batches unless
it is known that there are no concurrent accesses to the list
possible. (e.g. during teardown).
Just because you don't know about how list_lru is typically used,
doesn't mean that there aren't typical uses...
> > IOWs, a batched walk like above resumes the walk exactly where it
> > left off, because it is always either reclaiming or rotating the
> > object at the head of the list.
>
> "rotating the object" to the head of the per-node list, not the head of
> the whole list_Lru (except in the trivial single-node case).
Yup. That's because shrinkers are numa node specific. i.e. memory
reclaim is not global, it is per-node and we have one list_lru list
per node. IOWs, the structure of list-lru is intended to be optimal
for NUMA aware memory reclaim algorithms.
Most important VFS and filesystem caches ceased to have global LRU
ordering a -long- time ago. Scalability to really large machines is
far more important than strict global LRU maintenance.
> list_lru_walk() iterates over the multiple node lists in a fixed order.
> Suppose you have 4 nodes, each with 32 items, all of them referenced, and
> a batch size of 64.
> The first batch will process the 32 items on the first list clearing the
> referenced bit and moving them to the end of that list. Then continue
> through those 32 again and freeing them all. The second batch will do the
> same for the second list. The last two lists won't be touched.
>
> list_lru_walk() is only ever used (correctly) for clearing out a
> list_lru. It should probably be replaced by a function with a more apt
> name which is targeted at this: no spinlocks, no return value from the
> call-back.
>
> Yes, we could change the above code to use list_lru_walk_node and wrap a
> for loop around the whole, but then I wonder if list_lru is really
> giving us anything of value.
Apart from scalability and the ability for memory reclaim to do sane
per-node object reclaim? What about the fact that anyone familiar
with list-lru doesn't need to look at how the lists are implemented
to know how the code behaves?
Using common infrastructure, even when it's not an exact perfect
fit, holds a lot more value to the wider community than a one-off
special snowflake implementation that only one or two people
understand....
> Walking a linked list just to set a bit in ever entry is a lot more work
> that a few list manipulation in 5 cache-lines.
Maybe so, but the cache gc isn't a performance critical path.
> > > This patch removes the list_lru and instead uses 2 simple linked lists.
> > > When a file is accessed it is removed from whichever list it is on,
> > > then added to the tail of the first list. Every 2 seconds the second
> > > list is moved to the "freeme" list and the first list is moved to the
> > > second list. This avoids any need to walk a list to find old entries.
> >
> > Yup, that's exactly what the current code does via the laundrette
> > work that schedules nfsd_file_gc() to run every two seconds does.
> >
> > > These lists are per-netns rather than global as the freeme list is
> > > per-netns as the actual freeing is done in nfsd threads which are
> > > per-netns.
> >
> > The list_lru is actually multiple lists - it is a per-numa node list
> > and so moving to global scope linked lists per netns is going to
> > reduce scalability and increase lock contention on large machines.
>
> Possibly we could duplicate the filecache_disposal structure across
> svc_pools instead of net namespaces. That would fix the scalability
> issue. Probably we should avoid removing and re-adding the file in
> the lru for every access as that probably causes more spinlock
> contention. We would need to adjust the ageing mechanism but i suspect
> it could be made to work.
Typical usage of list-lru is lazy removal. i.e. we only add it to
the LRU list if it's not already there, and only reclaim/gc removes
it from the list. This is how the inode and dentry caches have
worked since before list_lru even existed, and it's a pattern that
is replicated across many subsystems that use LRUs for gc
purposes...
IOWs, the "object in cache" lookup fast path should never touch
the LRU at all.
> > i.e. It's kinda hard to make any real comment on "I do not believe
> > list_lru is a good tool for this" when there is no actual
> > measurements provided to back the statement one way or the other...
>
> It's not about measurements, its about behavioural correctness. Yes, I
> should have spelled that out better. Thanks for encouraging me to do
> so.
So you are saying that you don't care that the existing code can
easily be fixed, that your one-off solution won't scale as well and
is less functional from memory reclaim POV, and that the new
implementation is less maintainable than using generic
infrastructure to do the same work.
If that's the approach you are taking, then I don't know why you
asked me to point out all the stuff about list_lru that you didn't
seem to know about in the first place...
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 0/7] nfsd: filecache: change garbage collection lists
2025-02-05 23:04 ` NeilBrown
@ 2025-02-06 3:02 ` Chuck Lever
0 siblings, 0 replies; 23+ messages in thread
From: Chuck Lever @ 2025-02-06 3:02 UTC (permalink / raw)
To: NeilBrown, Jeff Layton
Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey, Dave Chinner
Howdy Neil -
On 2/5/25 6:04 PM, NeilBrown wrote:
>
> Hi Chuck,
> what are you current thoughts on merging this series?
I think NFSD is better off trying to make list_lru work, as that is a
broadly shared mechanism that has had a lot of review and testing. So
far there hasn't been a lot of evidence that the proposed replacement
is significantly more efficient; it's just different. I generally agree
with Dave's sentiment:
> Using common infrastructure, even when it's not an exact perfect
> fit, holds a lot more value to the wider community than a one-off
> special snowflake implementation that only one or two people
> understand....
LRU for the NFSD filecache is a case where IMO code re-use /is/
valuable.
One of my peeves about the filecache is that it is buried inside of NFSD
so deeply that it is well-nigh impossible to build unit testing for it.
I'd rather not add more special cases in this area of the code unless
there is a palpable benefit.
> One of my thoughts is that I now realise that
>
> Commit 0758a7212628 ("nfsd: drop the lock during filecache LRU scans")
>
> in nfsd-next is bad. If there are multiple nodes (and so multiple
> sublits in the list_lru) and if there are more than a few dozen files in
> the lru, then that patch results in the first sublist being completely
> freed before anything is done to the next.
> I think the best fix for backporting to -stable is to wrap a
> for_each_node_state((nid) around the while loop and using
> list_lru_count_node() and list_lru_walk_node().
>
> I could send a SQUASH patch for that and rebase this series on it.
Yeah, proper NUMA sensitivity is important, I feel. Please post a full
replacement. nfsd-testing is a topic branch, so the current revision of
0758a7212628 can be replaced wholesale.
> Thanks,
> NeilBrown
>
>
> On Mon, 27 Jan 2025, NeilBrown wrote:
>> [
>> davec added to cc incase I've said something incorrect about list_lru
>>
>> Changes in this version:
>> - no _bh locking
>> - add name for a magic constant
>> - remove unnecessary race-handling code
>> - give a more meaningfule name for a lock for /proc/lock_stat
>> - minor cleanups suggested by Jeff
>>
>> ]
>>
>> The nfsd filecache currently uses list_lru for tracking files recently
>> used in NFSv3 requests which need to be "garbage collected" when they
>> have becoming idle - unused for 2-4 seconds.
>>
>> I do not believe list_lru is a good tool for this. It does not allow
>> the timeout which filecache requires so we have to add a timeout
>> mechanism which holds the list_lru lock while the whole list is scanned
>> looking for entries that haven't been recently accessed. When the list
>> is largish (even a few hundred) this can block new requests noticably
>> which need the lock to remove a file to access it.
>>
>> This patch removes the list_lru and instead uses 2 simple linked lists.
>> When a file is accessed it is removed from whichever list it is on,
>> then added to the tail of the first list. Every 2 seconds the second
>> list is moved to the "freeme" list and the first list is moved to the
>> second list. This avoids any need to walk a list to find old entries.
>>
>> These lists are per-netns rather than global as the freeme list is
>> per-netns as the actual freeing is done in nfsd threads which are
>> per-netns.
>>
>> Thanks,
>> NeilBrown
>>
>> [PATCH 1/7] nfsd: filecache: remove race handling.
>> [PATCH 2/7] nfsd: filecache: use nfsd_file_dispose_list() in
>> [PATCH 3/7] nfsd: filecache: move globals nfsd_file_lru and
>> [PATCH 4/7] nfsd: filecache: change garbage collection list
>> [PATCH 5/7] nfsd: filecache: document the arbitrary limit on
>> [PATCH 6/7] nfsd: filecache: change garbage collection to a timer.
>> [PATCH 7/7] nfsd: filecache: give disposal lock a unique class name.
>>
>>
>>
>
--
Chuck Lever
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 0/7] nfsd: filecache: change garbage collection lists
2025-02-06 2:21 ` Dave Chinner
@ 2025-02-06 3:04 ` NeilBrown
2025-02-06 14:35 ` Chuck Lever
0 siblings, 1 reply; 23+ messages in thread
From: NeilBrown @ 2025-02-06 3:04 UTC (permalink / raw)
To: Dave Chinner
Cc: Chuck Lever, Jeff Layton, linux-nfs, Olga Kornievskaia, Dai Ngo,
Tom Talpey
On Thu, 06 Feb 2025, Dave Chinner wrote:
> On Thu, Jan 30, 2025 at 08:34:15AM +1100, NeilBrown wrote:
> > On Tue, 28 Jan 2025, Dave Chinner wrote:
> > > On Mon, Jan 27, 2025 at 12:20:31PM +1100, NeilBrown wrote:
> > > > [
> > > > davec added to cc incase I've said something incorrect about list_lru
> > > >
> > > > Changes in this version:
> > > > - no _bh locking
> > > > - add name for a magic constant
> > > > - remove unnecessary race-handling code
> > > > - give a more meaningfule name for a lock for /proc/lock_stat
> > > > - minor cleanups suggested by Jeff
> > > >
> > > > ]
> > > >
> > > > The nfsd filecache currently uses list_lru for tracking files recently
> > > > used in NFSv3 requests which need to be "garbage collected" when they
> > > > have becoming idle - unused for 2-4 seconds.
> > > >
> > > > I do not believe list_lru is a good tool for this. It does not allow
> > > > the timeout which filecache requires so we have to add a timeout
> > > > mechanism which holds the list_lru lock while the whole list is scanned
> > > > looking for entries that haven't been recently accessed. When the list
> > > > is largish (even a few hundred) this can block new requests noticably
> > > > which need the lock to remove a file to access it.
> > >
> > > Looks entirely like a trivial implementation bug in how the list_lru
> > > is walked in nfsd_file_gc().
> > >
> > > static void
> > > nfsd_file_gc(void)
> > > {
> > > LIST_HEAD(dispose);
> > > unsigned long ret;
> > >
> > > ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
> > > &dispose, list_lru_count(&nfsd_file_lru));
> > > ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> > >
> > > trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
> > > nfsd_file_dispose_list_delayed(&dispose);
> > > }
> > >
> > > i.e. the list_lru_walk() has been told to walk the entire list in a
> > > single lock hold if nothing blocks it.
> > >
> > > We've known this for a long, long time, and it's something we've
> > > handled for a long time with shrinkers, too. here's the typical way
> > > of doing a full list aging and GC pass in one go without excessively
> > > long lock holds:
> >
> > "Typical"?? Can you please point me to an existing example?
>
> Of the top of my head: shrink_dcache_sb().
shrink_dcache_sb() contains:
} while (list_lru_count(&sb->s_dentry_lru) > 0);
i.e. it removes *everything* from the list_lru.
There are 5 callers of list_all_walk() in the kernel.
binder_shrink_scan() appears to be buggy. Other _scan functions
use list_lru_shrink_walk which is per-node as you say.
binder_selftest_free_page() calls until the list is empty
shrink_dcache_sb() calls until the list is empty
nfsd_file_gc() call for the whole list but does NOT want the
list to necessarily be empty
xfs_bufarg_drain() calls until the list is empty
So three call until the list is empty, one should use
list_lru_shrink_walk(), and nfsd wants something quite different.
>
> However, *every single shrinker implementation in the kernel* uses
> this algorithm whether they use list-lru or not.
Yes, but we aren't doing shrinking here. We are doing ageing. We want
a firm upper limit to how long things remain cached after the last use.
>
> > list_lru_walk() iterates over the multiple node lists in a fixed order.
> > Suppose you have 4 nodes, each with 32 items, all of them referenced, and
> > a batch size of 64.
> > The first batch will process the 32 items on the first list clearing the
> > referenced bit and moving them to the end of that list. Then continue
> > through those 32 again and freeing them all. The second batch will do the
> > same for the second list. The last two lists won't be touched.
> >
> > list_lru_walk() is only ever used (correctly) for clearing out a
> > list_lru. It should probably be replaced by a function with a more apt
> > name which is targeted at this: no spinlocks, no return value from the
> > call-back.
> >
> > Yes, we could change the above code to use list_lru_walk_node and wrap a
> > for loop around the whole, but then I wonder if list_lru is really
> > giving us anything of value.
>
> Apart from scalability and the ability for memory reclaim to do sane
> per-node object reclaim? What about the fact that anyone familiar
> with list-lru doesn't need to look at how the lists are implemented
> to know how the code behaves?
What an odd thing to say... how does one become familiar except by
looking at the code?
I accept that my current approach loses per-node object reclaim. I
don't know how important that is. I believe the primary cost of nfsd
leaving files on the lru is not the memory usage, but the fact the file
file is held open while not is use.
As I said before, nfsd already has some per-node awareness. Linking the
file ageing into that would be easy enough.
>
> Using common infrastructure, even when it's not an exact perfect
> fit, holds a lot more value to the wider community than a one-off
> special snowflake implementation that only one or two people
> understand....
While there is certainly truth in that, it is also true that using
common infrastructure in a non-ideomatic way can cause confusion.
Sometimes things that are different should look different.
From my perspective the nfsd filecache lru is primarily about again, not
about garbage collection.
>
> > Walking a linked list just to set a bit in ever entry is a lot more work
> > that a few list manipulation in 5 cache-lines.
>
> Maybe so, but the cache gc isn't a performance critical path.
Any yet it was the "gc", which was really "aging", which highlighted the
problem for us.
Normal list_lru never walks the entire list (unless it is short) except
when freeing everything. With the "reference" bit approach, nfsd needs
to walk the entire lru frequently. That can make the aging more of a
performance issue.
dcache uses DCACHE_REFERENCED effectively, but in a different way. It
is only cleared by the shrinker for those entries that are one the lru
before the ones that can be freed - it doesn't want to clear the
DCACHE_REFERENCED bits on everything in the lru (which has it set).
nfsd DOES want to clear the referenced bit on everything for the aging.
>
> > > > This patch removes the list_lru and instead uses 2 simple linked lists.
> > > > When a file is accessed it is removed from whichever list it is on,
> > > > then added to the tail of the first list. Every 2 seconds the second
> > > > list is moved to the "freeme" list and the first list is moved to the
> > > > second list. This avoids any need to walk a list to find old entries.
> > >
> > > Yup, that's exactly what the current code does via the laundrette
> > > work that schedules nfsd_file_gc() to run every two seconds does.
> > >
> > > > These lists are per-netns rather than global as the freeme list is
> > > > per-netns as the actual freeing is done in nfsd threads which are
> > > > per-netns.
> > >
> > > The list_lru is actually multiple lists - it is a per-numa node list
> > > and so moving to global scope linked lists per netns is going to
> > > reduce scalability and increase lock contention on large machines.
> >
> > Possibly we could duplicate the filecache_disposal structure across
> > svc_pools instead of net namespaces. That would fix the scalability
> > issue. Probably we should avoid removing and re-adding the file in
> > the lru for every access as that probably causes more spinlock
> > contention. We would need to adjust the ageing mechanism but i suspect
> > it could be made to work.
>
> Typical usage of list-lru is lazy removal. i.e. we only add it to
> the LRU list if it's not already there, and only reclaim/gc removes
> it from the list. This is how the inode and dentry caches have
> worked since before list_lru even existed, and it's a pattern that
> is replicated across many subsystems that use LRUs for gc
> purposes...
>
> IOWs, the "object in cache" lookup fast path should never touch
> the LRU at all.
Yes, I agree. Whichever way we go, this is a change we need to make to
the nfss filecache - not to remove on each access.
>
> > > i.e. It's kinda hard to make any real comment on "I do not believe
> > > list_lru is a good tool for this" when there is no actual
> > > measurements provided to back the statement one way or the other...
> >
> > It's not about measurements, its about behavioural correctness. Yes, I
> > should have spelled that out better. Thanks for encouraging me to do
> > so.
>
> So you are saying that you don't care that the existing code can
> easily be fixed, that your one-off solution won't scale as well and
> is less functional from memory reclaim POV, and that the new
> implementation is less maintainable than using generic
> infrastructure to do the same work.
No, I'm saying that it isn't clear to me that the existing code can be
easily fixed. Certainly there are easy improvements. But list_lru is
designed for an important access pattern which is different from what
the nfsd filecache needs.
>
> If that's the approach you are taking, then I don't know why you
> asked me to point out all the stuff about list_lru that you didn't
> seem to know about in the first place...
It is always useful to have intelligent review, even when one doesn't
agree with all of it. Thanks!
NeilBrown
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: [PATCH 0/7] nfsd: filecache: change garbage collection lists
2025-02-06 3:04 ` NeilBrown
@ 2025-02-06 14:35 ` Chuck Lever
0 siblings, 0 replies; 23+ messages in thread
From: Chuck Lever @ 2025-02-06 14:35 UTC (permalink / raw)
To: NeilBrown, Dave Chinner
Cc: Jeff Layton, linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey
On 2/5/25 10:04 PM, NeilBrown wrote:
> On Thu, 06 Feb 2025, Dave Chinner wrote:
>>
>> However, *every single shrinker implementation in the kernel* uses
>> this algorithm whether they use list-lru or not.
>
> Yes, but we aren't doing shrinking here. We are doing ageing.
The NFSD filecache uses this mechanism for both aging and shrinking.
What might be appropriate is to provide a list_lru API for aging. That
would have the benefit of using an existing common mechanism but make
it more appropriate for NFSD's usage scenario.
> We want
> a firm upper limit to how long things remain cached after the last use.
Do we need a firm limit? The age limit is not about correctness, is it?
What we don't want is a remote file access via NFSv3 to keep a file open
indefinitely. The current 2 second value is not a deep requirement, it's
merely a value that seems to work well.
That aspect is a little different than inode/dentry caching, where it
can be valuable (and at least, won't be harmful) to leave the item in
the cache.
--
Chuck Lever
^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2025-02-06 14:35 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-01-27 1:20 [PATCH 0/7] nfsd: filecache: change garbage collection lists NeilBrown
2025-01-27 1:20 ` [PATCH 1/7] nfsd: filecache: remove race handling NeilBrown
2025-01-27 13:42 ` Jeff Layton
2025-01-27 1:20 ` [PATCH 2/7] nfsd: filecache: use nfsd_file_dispose_list() in nfsd_file_close_inode_sync() NeilBrown
2025-01-27 1:20 ` [PATCH 3/7] nfsd: filecache: move globals nfsd_file_lru and nfsd_file_shrinker to be per-net NeilBrown
2025-01-27 1:20 ` [PATCH 4/7] nfsd: filecache: change garbage collection list management NeilBrown
2025-01-27 14:15 ` Jeff Layton
2025-01-27 1:20 ` [PATCH 5/7] nfsd: filecache: document the arbitrary limit on file-disposes-per-loop NeilBrown
2025-01-27 14:40 ` Jeff Layton
2025-01-27 1:20 ` [PATCH 6/7] nfsd: filecache: change garbage collection to a timer NeilBrown
2025-01-27 14:39 ` Jeff Layton
2025-01-27 1:20 ` [PATCH 7/7] nfsd: filecache: give disposal lock a unique class name NeilBrown
2025-01-27 14:29 ` Chuck Lever
2025-01-27 14:40 ` Jeff Layton
2025-01-28 6:37 ` [PATCH 0/7] nfsd: filecache: change garbage collection lists Dave Chinner
2025-01-28 14:27 ` Chuck Lever
2025-01-28 16:05 ` Chuck Lever
2025-01-29 21:34 ` NeilBrown
2025-02-06 2:21 ` Dave Chinner
2025-02-06 3:04 ` NeilBrown
2025-02-06 14:35 ` Chuck Lever
2025-02-05 23:04 ` NeilBrown
2025-02-06 3:02 ` Chuck Lever
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox