* [PATCH 1/6] nfsd: filecache: remove race handling.
2025-02-07 5:15 [PATCH 0/6] nfsd: filecache: various fixes NeilBrown
@ 2025-02-07 5:15 ` NeilBrown
2025-02-10 23:33 ` Dave Chinner
2025-02-07 5:15 ` [PATCH 2/6] nfsd: filecache: use nfsd_file_dispose_list() in nfsd_file_close_inode_sync() NeilBrown
` (4 subsequent siblings)
5 siblings, 1 reply; 25+ messages in thread
From: NeilBrown @ 2025-02-07 5:15 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton
Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey, Dave Chinner
The race that this code tries to protect against is not interesting.
The code is problematic as we access the "nf" after we have given our
reference to the lru system. While that take 2+ seconds to free things
it is still poor form.
The only interesting race I can find would be with
nfsd_file_close_inode_sync();
This is the only place that really doesn't want the file to stay on the
LRU when unhashed (which is the direct consequence of the race).
However for the race to happen, some other thread must own a reference
to a file and be putting in while nfsd_file_close_inode_sync() is trying
to close all files for an inode. If this is possible, that other thread
could simply call nfsd_file_put() a little bit later and the result
would be the same: not all files are closed when
nfsd_file_close_inode_sync() completes.
If this was really a problem, we would need to wait in close_inode_sync
for the other references to be dropped. We probably don't want to do
that.
So it is best to simply remove this code.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
---
fs/nfsd/filecache.c | 16 +++-------------
1 file changed, 3 insertions(+), 13 deletions(-)
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index a1cdba42c4fa..b13255bcbb96 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -371,20 +371,10 @@ nfsd_file_put(struct nfsd_file *nf)
/* Try to add it to the LRU. If that fails, decrement. */
if (nfsd_file_lru_add(nf)) {
- /* If it's still hashed, we're done */
- if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) {
- nfsd_file_schedule_laundrette();
- return;
- }
-
- /*
- * We're racing with unhashing, so try to remove it from
- * the LRU. If removal fails, then someone else already
- * has our reference.
- */
- if (!nfsd_file_lru_remove(nf))
- return;
+ nfsd_file_schedule_laundrette();
+ return;
}
+
}
if (refcount_dec_and_test(&nf->nf_ref))
nfsd_file_free(nf);
--
2.47.1
^ permalink raw reply related [flat|nested] 25+ messages in thread* Re: [PATCH 1/6] nfsd: filecache: remove race handling.
2025-02-07 5:15 ` [PATCH 1/6] nfsd: filecache: remove race handling NeilBrown
@ 2025-02-10 23:33 ` Dave Chinner
2025-02-12 22:16 ` NeilBrown
0 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2025-02-10 23:33 UTC (permalink / raw)
To: NeilBrown
Cc: Chuck Lever, Jeff Layton, linux-nfs, Olga Kornievskaia, Dai Ngo,
Tom Talpey
On Fri, Feb 07, 2025 at 04:15:11PM +1100, NeilBrown wrote:
> The race that this code tries to protect against is not interesting.
> The code is problematic as we access the "nf" after we have given our
> reference to the lru system. While that take 2+ seconds to free things
> it is still poor form.
>
> The only interesting race I can find would be with
> nfsd_file_close_inode_sync();
> This is the only place that really doesn't want the file to stay on the
> LRU when unhashed (which is the direct consequence of the race).
>
> However for the race to happen, some other thread must own a reference
> to a file and be putting in while nfsd_file_close_inode_sync() is trying
> to close all files for an inode. If this is possible, that other thread
> could simply call nfsd_file_put() a little bit later and the result
> would be the same: not all files are closed when
> nfsd_file_close_inode_sync() completes.
>
> If this was really a problem, we would need to wait in close_inode_sync
> for the other references to be dropped. We probably don't want to do
> that.
>
> So it is best to simply remove this code.
>
> Reviewed-by: Jeff Layton <jlayton@kernel.org>
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
> fs/nfsd/filecache.c | 16 +++-------------
> 1 file changed, 3 insertions(+), 13 deletions(-)
>
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index a1cdba42c4fa..b13255bcbb96 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -371,20 +371,10 @@ nfsd_file_put(struct nfsd_file *nf)
>
> /* Try to add it to the LRU. If that fails, decrement. */
> if (nfsd_file_lru_add(nf)) {
> - /* If it's still hashed, we're done */
> - if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) {
> - nfsd_file_schedule_laundrette();
> - return;
> - }
> -
> - /*
> - * We're racing with unhashing, so try to remove it from
> - * the LRU. If removal fails, then someone else already
> - * has our reference.
> - */
> - if (!nfsd_file_lru_remove(nf))
> - return;
> + nfsd_file_schedule_laundrette();
> + return;
Why do we need to start the background gc on demand? When the nfsd
subsystem is under load it is going to be calling
nfsd_file_schedule_laundrette() all the time. However, gc is almost
always going to be queued/running already.
i.e. the above code results in adding the overhead of an atomic
NFSD_FILE_CACHE_UP bit check and then a call to queue_delayed_work()
on an already queued item. This will disables interrupts, make an
atomic bit check that sees the work is queued, re-enable interrupts
and return.
IMO, there is no need to do this unnecessary work on every object
that is added to the LRU. Changing the gc worker to always run
every 2s and check if it has work to do like so:
static void
nfsd_file_gc_worker(struct work_struct *work)
{
- nfsd_file_gc();
- if (list_lru_count(&nfsd_file_lru))
- nfsd_file_schedule_laundrette();
+ if (list_lru_count(&nfsd_file_lru))
+ nfsd_file_gc();
+ nfsd_file_schedule_laundrette();
}
means that nfsd_file_gc() will be run the same way and have the same
behaviour as the current code. When the system it idle, it does a
list_lru_count() check every 2 seconds and goes back to sleep.
That's going to be pretty much unnoticable on most machines that run
NFS servers.
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH 1/6] nfsd: filecache: remove race handling.
2025-02-10 23:33 ` Dave Chinner
@ 2025-02-12 22:16 ` NeilBrown
2025-02-13 15:02 ` Chuck Lever
0 siblings, 1 reply; 25+ messages in thread
From: NeilBrown @ 2025-02-12 22:16 UTC (permalink / raw)
To: Dave Chinner
Cc: Chuck Lever, Jeff Layton, linux-nfs, Olga Kornievskaia, Dai Ngo,
Tom Talpey
On Tue, 11 Feb 2025, Dave Chinner wrote:
> On Fri, Feb 07, 2025 at 04:15:11PM +1100, NeilBrown wrote:
> > The race that this code tries to protect against is not interesting.
> > The code is problematic as we access the "nf" after we have given our
> > reference to the lru system. While that take 2+ seconds to free things
> > it is still poor form.
> >
> > The only interesting race I can find would be with
> > nfsd_file_close_inode_sync();
> > This is the only place that really doesn't want the file to stay on the
> > LRU when unhashed (which is the direct consequence of the race).
> >
> > However for the race to happen, some other thread must own a reference
> > to a file and be putting in while nfsd_file_close_inode_sync() is trying
> > to close all files for an inode. If this is possible, that other thread
> > could simply call nfsd_file_put() a little bit later and the result
> > would be the same: not all files are closed when
> > nfsd_file_close_inode_sync() completes.
> >
> > If this was really a problem, we would need to wait in close_inode_sync
> > for the other references to be dropped. We probably don't want to do
> > that.
> >
> > So it is best to simply remove this code.
> >
> > Reviewed-by: Jeff Layton <jlayton@kernel.org>
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> > fs/nfsd/filecache.c | 16 +++-------------
> > 1 file changed, 3 insertions(+), 13 deletions(-)
> >
> > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > index a1cdba42c4fa..b13255bcbb96 100644
> > --- a/fs/nfsd/filecache.c
> > +++ b/fs/nfsd/filecache.c
> > @@ -371,20 +371,10 @@ nfsd_file_put(struct nfsd_file *nf)
> >
> > /* Try to add it to the LRU. If that fails, decrement. */
> > if (nfsd_file_lru_add(nf)) {
> > - /* If it's still hashed, we're done */
> > - if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) {
> > - nfsd_file_schedule_laundrette();
> > - return;
> > - }
> > -
> > - /*
> > - * We're racing with unhashing, so try to remove it from
> > - * the LRU. If removal fails, then someone else already
> > - * has our reference.
> > - */
> > - if (!nfsd_file_lru_remove(nf))
> > - return;
> > + nfsd_file_schedule_laundrette();
> > + return;
>
> Why do we need to start the background gc on demand? When the nfsd
> subsystem is under load it is going to be calling
> nfsd_file_schedule_laundrette() all the time. However, gc is almost
> always going to be queued/running already.
>
> i.e. the above code results in adding the overhead of an atomic
> NFSD_FILE_CACHE_UP bit check and then a call to queue_delayed_work()
> on an already queued item. This will disables interrupts, make an
> atomic bit check that sees the work is queued, re-enable interrupts
> and return.
I don't think we really need the NFSD_FILE_CACHE_UP test - if we have a
file to put, then the cache must be up.
And we could use delayed_work_pending() to avoid the irq disable.
That is still one atomic bit test though.
>
> IMO, there is no need to do this unnecessary work on every object
> that is added to the LRU. Changing the gc worker to always run
> every 2s and check if it has work to do like so:
>
> static void
> nfsd_file_gc_worker(struct work_struct *work)
> {
> - nfsd_file_gc();
> - if (list_lru_count(&nfsd_file_lru))
> - nfsd_file_schedule_laundrette();
> + if (list_lru_count(&nfsd_file_lru))
> + nfsd_file_gc();
> + nfsd_file_schedule_laundrette();
> }
>
> means that nfsd_file_gc() will be run the same way and have the same
> behaviour as the current code. When the system it idle, it does a
> list_lru_count() check every 2 seconds and goes back to sleep.
> That's going to be pretty much unnoticable on most machines that run
> NFS servers.
When serving a v4 only load we don't need the timer at all... Maybe
that doesn't matter.
I would rather use a timer instead of a delayed work as the task doesn't
require sleeping, and we have a pool of nfsd threads to do any work that
might be needed. So a timer that checks if work is needed then sets a
flag and wakes an nfsd would be even lower impact that a delayed work
which needs to wake a wq thread every time even if there is nothing to
do.
Thanks,
NeilBrown
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH 1/6] nfsd: filecache: remove race handling.
2025-02-12 22:16 ` NeilBrown
@ 2025-02-13 15:02 ` Chuck Lever
0 siblings, 0 replies; 25+ messages in thread
From: Chuck Lever @ 2025-02-13 15:02 UTC (permalink / raw)
To: NeilBrown, Dave Chinner
Cc: Jeff Layton, linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey
[ responding to both Neil and Dave ... ]
On 2/12/25 5:16 PM, NeilBrown wrote:
> On Tue, 11 Feb 2025, Dave Chinner wrote:
>> On Fri, Feb 07, 2025 at 04:15:11PM +1100, NeilBrown wrote:
>>> The race that this code tries to protect against is not interesting.
>>> The code is problematic as we access the "nf" after we have given our
>>> reference to the lru system. While that take 2+ seconds to free things
>>> it is still poor form.
>>>
>>> The only interesting race I can find would be with
>>> nfsd_file_close_inode_sync();
>>> This is the only place that really doesn't want the file to stay on the
>>> LRU when unhashed (which is the direct consequence of the race).
>>>
>>> However for the race to happen, some other thread must own a reference
>>> to a file and be putting in while nfsd_file_close_inode_sync() is trying
>>> to close all files for an inode. If this is possible, that other thread
>>> could simply call nfsd_file_put() a little bit later and the result
>>> would be the same: not all files are closed when
>>> nfsd_file_close_inode_sync() completes.
>>>
>>> If this was really a problem, we would need to wait in close_inode_sync
>>> for the other references to be dropped. We probably don't want to do
>>> that.
>>>
>>> So it is best to simply remove this code.
>>>
>>> Reviewed-by: Jeff Layton <jlayton@kernel.org>
>>> Signed-off-by: NeilBrown <neilb@suse.de>
>>> ---
>>> fs/nfsd/filecache.c | 16 +++-------------
>>> 1 file changed, 3 insertions(+), 13 deletions(-)
>>>
>>> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
>>> index a1cdba42c4fa..b13255bcbb96 100644
>>> --- a/fs/nfsd/filecache.c
>>> +++ b/fs/nfsd/filecache.c
>>> @@ -371,20 +371,10 @@ nfsd_file_put(struct nfsd_file *nf)
>>>
>>> /* Try to add it to the LRU. If that fails, decrement. */
>>> if (nfsd_file_lru_add(nf)) {
>>> - /* If it's still hashed, we're done */
>>> - if (test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) {
>>> - nfsd_file_schedule_laundrette();
>>> - return;
>>> - }
>>> -
>>> - /*
>>> - * We're racing with unhashing, so try to remove it from
>>> - * the LRU. If removal fails, then someone else already
>>> - * has our reference.
>>> - */
>>> - if (!nfsd_file_lru_remove(nf))
>>> - return;
>>> + nfsd_file_schedule_laundrette();
>>> + return;
>>
>> Why do we need to start the background gc on demand?
This call is an original feature of the filecache -- I'm guessing it
was done to ensure that a final put closes the file quickly. But over
time, the filecache has grown explicit code that handles that, so I
think it is likely this call is no longer needed.
>> When the nfsd
>> subsystem is under load it is going to be calling
>> nfsd_file_schedule_laundrette() all the time. However, gc is almost
>> always going to be queued/running already.
>>
>> i.e. the above code results in adding the overhead of an atomic
>> NFSD_FILE_CACHE_UP bit check and then a call to queue_delayed_work()
>> on an already queued item. This will disables interrupts, make an
>> atomic bit check that sees the work is queued, re-enable interrupts
>> and return.
>
> I don't think we really need the NFSD_FILE_CACHE_UP test - if we have a
> file to put, then the cache must be up.
I can change 1/6 to remove the call to nfsd_file_schedule_laundrette().
Then these questions become moot.
> And we could use delayed_work_pending() to avoid the irq disable.
> That is still one atomic bit test though.
>
>>
>> IMO, there is no need to do this unnecessary work on every object
>> that is added to the LRU. Changing the gc worker to always run
>> every 2s and check if it has work to do like so:
>>
>> static void
>> nfsd_file_gc_worker(struct work_struct *work)
>> {
>> - nfsd_file_gc();
>> - if (list_lru_count(&nfsd_file_lru))
>> - nfsd_file_schedule_laundrette();
>> + if (list_lru_count(&nfsd_file_lru))
>> + nfsd_file_gc();
>> + nfsd_file_schedule_laundrette();
>> }
>>
>> means that nfsd_file_gc() will be run the same way and have the same
>> behaviour as the current code. When the system it idle, it does a
>> list_lru_count() check every 2 seconds and goes back to sleep.
>> That's going to be pretty much unnoticable on most machines that run
>> NFS servers.
I can add a patch to this series that makes this swap.
> When serving a v4 only load we don't need the timer at all... Maybe
> that doesn't matter.
>
> I would rather use a timer instead of a delayed work as the task doesn't
> require sleeping, and we have a pool of nfsd threads to do any work that
> might be needed. So a timer that checks if work is needed then sets a
> flag and wakes an nfsd would be even lower impact that a delayed work
> which needs to wake a wq thread every time even if there is nothing to
> do.
Moving the laundrette work to nfsd threads seems sensible, but let's
leave that for another patch series.
--
Chuck Lever
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 2/6] nfsd: filecache: use nfsd_file_dispose_list() in nfsd_file_close_inode_sync()
2025-02-07 5:15 [PATCH 0/6] nfsd: filecache: various fixes NeilBrown
2025-02-07 5:15 ` [PATCH 1/6] nfsd: filecache: remove race handling NeilBrown
@ 2025-02-07 5:15 ` NeilBrown
2025-02-07 5:15 ` [PATCH 3/6] nfsd: filecache: use list_lru_walk_node() in nfsd_file_gc() NeilBrown
` (3 subsequent siblings)
5 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2025-02-07 5:15 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton
Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey, Dave Chinner
nfsd_file_close_inode_sync() contains an exactly copy of
nfsd_file_dispose_list().
This patch removes it and calls nfsd_file_dispose_list() instead.
Reviewed-by: Jeff Layton <jlayton@kernel.org>
Signed-off-by: NeilBrown <neilb@suse.de>
---
fs/nfsd/filecache.c | 7 +------
1 file changed, 1 insertion(+), 6 deletions(-)
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index b13255bcbb96..7dc20143c854 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -666,17 +666,12 @@ nfsd_file_close_inode(struct inode *inode)
void
nfsd_file_close_inode_sync(struct inode *inode)
{
- struct nfsd_file *nf;
LIST_HEAD(dispose);
trace_nfsd_file_close(inode);
nfsd_file_queue_for_close(inode, &dispose);
- while (!list_empty(&dispose)) {
- nf = list_first_entry(&dispose, struct nfsd_file, nf_gc);
- list_del_init(&nf->nf_gc);
- nfsd_file_free(nf);
- }
+ nfsd_file_dispose_list(&dispose);
}
static int
--
2.47.1
^ permalink raw reply related [flat|nested] 25+ messages in thread* [PATCH 3/6] nfsd: filecache: use list_lru_walk_node() in nfsd_file_gc()
2025-02-07 5:15 [PATCH 0/6] nfsd: filecache: various fixes NeilBrown
2025-02-07 5:15 ` [PATCH 1/6] nfsd: filecache: remove race handling NeilBrown
2025-02-07 5:15 ` [PATCH 2/6] nfsd: filecache: use nfsd_file_dispose_list() in nfsd_file_close_inode_sync() NeilBrown
@ 2025-02-07 5:15 ` NeilBrown
2025-02-07 14:43 ` Chuck Lever
2025-02-10 13:46 ` Jeff Layton
2025-02-07 5:15 ` [PATCH 4/6] nfsd: filecache: introduce NFSD_FILE_RECENT NeilBrown
` (2 subsequent siblings)
5 siblings, 2 replies; 25+ messages in thread
From: NeilBrown @ 2025-02-07 5:15 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton
Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey, Dave Chinner
list_lru_walk() is only useful when the aim is to remove all elements
from the list_lru. It will repeated visit rotated element of the first
per-node sublist before proceeding to subsrequent sublists.
This patch changes to use list_lru_walk_node() and list_lru_count_node()
on each individual node.
Signed-off-by: NeilBrown <neilb@suse.de>
---
fs/nfsd/filecache.c | 10 +++++++---
1 file changed, 7 insertions(+), 3 deletions(-)
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index 7dc20143c854..04588c03bdfe 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -532,10 +532,14 @@ static void
nfsd_file_gc(void)
{
LIST_HEAD(dispose);
- unsigned long ret;
+ unsigned long ret = 0;
+ int nid;
- ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
- &dispose, list_lru_count(&nfsd_file_lru));
+ for_each_node_state(nid, N_NORMAL_MEMORY) {
+ unsigned long nr = list_lru_count_node(&nfsd_file_lru, nid);
+ ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_lru_cb,
+ &dispose, &nr);
+ }
trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
nfsd_file_dispose_list_delayed(&dispose);
}
--
2.47.1
^ permalink raw reply related [flat|nested] 25+ messages in thread* Re: [PATCH 3/6] nfsd: filecache: use list_lru_walk_node() in nfsd_file_gc()
2025-02-07 5:15 ` [PATCH 3/6] nfsd: filecache: use list_lru_walk_node() in nfsd_file_gc() NeilBrown
@ 2025-02-07 14:43 ` Chuck Lever
2025-02-09 23:16 ` NeilBrown
2025-02-10 13:46 ` Jeff Layton
1 sibling, 1 reply; 25+ messages in thread
From: Chuck Lever @ 2025-02-07 14:43 UTC (permalink / raw)
To: NeilBrown, Jeff Layton
Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey, Dave Chinner
On 2/7/25 12:15 AM, NeilBrown wrote:
> list_lru_walk() is only useful when the aim is to remove all elements
> from the list_lru.
I think I get where this is going. Can the description cite some API
documentation that comports with this claim?
> It will repeated visit rotated element of the first
> per-node sublist before proceeding to subsrequent sublists.
s/repeated/repeatedly, s/element/elements, and s/subsrequent/subsequent.
> This patch changes to use list_lru_walk_node() and list_lru_count_node()
> on each individual node.
>
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
> fs/nfsd/filecache.c | 10 +++++++---
> 1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index 7dc20143c854..04588c03bdfe 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -532,10 +532,14 @@ static void
> nfsd_file_gc(void)
> {
> LIST_HEAD(dispose);
> - unsigned long ret;
> + unsigned long ret = 0;
> + int nid;
>
> - ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
> - &dispose, list_lru_count(&nfsd_file_lru));
> + for_each_node_state(nid, N_NORMAL_MEMORY) {
> + unsigned long nr = list_lru_count_node(&nfsd_file_lru, nid);
> + ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_lru_cb,
> + &dispose, &nr);
> + }
> trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
Nit: Maybe this trace point should be placed in the loop and changed to
record the nid as well?
> nfsd_file_dispose_list_delayed(&dispose);
Wondering if maybe the nfsd_file_dispose_list_delayed() call should also
be moved inside the for loop to break up the disposal processing per
node as well.
> }
--
Chuck Lever
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH 3/6] nfsd: filecache: use list_lru_walk_node() in nfsd_file_gc()
2025-02-07 14:43 ` Chuck Lever
@ 2025-02-09 23:16 ` NeilBrown
0 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2025-02-09 23:16 UTC (permalink / raw)
To: Chuck Lever
Cc: Jeff Layton, linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey,
Dave Chinner
On Sat, 08 Feb 2025, Chuck Lever wrote:
> On 2/7/25 12:15 AM, NeilBrown wrote:
> > list_lru_walk() is only useful when the aim is to remove all elements
> > from the list_lru.
>
> I think I get where this is going. Can the description cite some API
> documentation that comports with this claim?
That would require someone to write some documentation, because there
isn't any - not for list_lru_walk(). I wouldn't recommend writing any
either. It should be replaced with something with a name like
"list_lru_destroy()" which is given a list_lru_ and a callback (no
count). The callback can skip or isolate or if the disposal list gets
lock, drop the lock, dispose the lists, and return LRU_RETRY.
>
>
> > It will repeated visit rotated element of the first
> > per-node sublist before proceeding to subsrequent sublists.
>
> s/repeated/repeatedly, s/element/elements, and s/subsrequent/subsequent.
>
Yep.
>
> > This patch changes to use list_lru_walk_node() and list_lru_count_node()
> > on each individual node.
> >
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> > fs/nfsd/filecache.c | 10 +++++++---
> > 1 file changed, 7 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > index 7dc20143c854..04588c03bdfe 100644
> > --- a/fs/nfsd/filecache.c
> > +++ b/fs/nfsd/filecache.c
> > @@ -532,10 +532,14 @@ static void
> > nfsd_file_gc(void)
> > {
> > LIST_HEAD(dispose);
> > - unsigned long ret;
> > + unsigned long ret = 0;
> > + int nid;
> >
> > - ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
> > - &dispose, list_lru_count(&nfsd_file_lru));
> > + for_each_node_state(nid, N_NORMAL_MEMORY) {
> > + unsigned long nr = list_lru_count_node(&nfsd_file_lru, nid);
> > + ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_lru_cb,
> > + &dispose, &nr);
> > + }
> > trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
>
> Nit: Maybe this trace point should be placed in the loop and changed to
> record the nid as well?
Maybe.
>
>
> > nfsd_file_dispose_list_delayed(&dispose);
>
> Wondering if maybe the nfsd_file_dispose_list_delayed() call should also
> be moved inside the for loop to break up the disposal processing per
> node as well.
>
I don't think it would make much difference. The "disposal" here is
just moving the items onto different lists.
NeilBrown
>
> > }
>
>
> --
> Chuck Lever
>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 3/6] nfsd: filecache: use list_lru_walk_node() in nfsd_file_gc()
2025-02-07 5:15 ` [PATCH 3/6] nfsd: filecache: use list_lru_walk_node() in nfsd_file_gc() NeilBrown
2025-02-07 14:43 ` Chuck Lever
@ 2025-02-10 13:46 ` Jeff Layton
1 sibling, 0 replies; 25+ messages in thread
From: Jeff Layton @ 2025-02-10 13:46 UTC (permalink / raw)
To: NeilBrown, Chuck Lever
Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey, Dave Chinner
On Fri, 2025-02-07 at 16:15 +1100, NeilBrown wrote:
> list_lru_walk() is only useful when the aim is to remove all elements
> from the list_lru. It will repeated visit rotated element of the first
> per-node sublist before proceeding to subsrequent sublists.
>
> This patch changes to use list_lru_walk_node() and list_lru_count_node()
> on each individual node.
>
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
> fs/nfsd/filecache.c | 10 +++++++---
> 1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index 7dc20143c854..04588c03bdfe 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -532,10 +532,14 @@ static void
> nfsd_file_gc(void)
> {
> LIST_HEAD(dispose);
> - unsigned long ret;
> + unsigned long ret = 0;
> + int nid;
>
> - ret = list_lru_walk(&nfsd_file_lru, nfsd_file_lru_cb,
> - &dispose, list_lru_count(&nfsd_file_lru));
> + for_each_node_state(nid, N_NORMAL_MEMORY) {
> + unsigned long nr = list_lru_count_node(&nfsd_file_lru, nid);
> + ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_lru_cb,
> + &dispose, &nr);
> + }
> trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
> nfsd_file_dispose_list_delayed(&dispose);
> }
It might be worthwhile to make this a generic helper in list_lru.h:
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 4/6] nfsd: filecache: introduce NFSD_FILE_RECENT
2025-02-07 5:15 [PATCH 0/6] nfsd: filecache: various fixes NeilBrown
` (2 preceding siblings ...)
2025-02-07 5:15 ` [PATCH 3/6] nfsd: filecache: use list_lru_walk_node() in nfsd_file_gc() NeilBrown
@ 2025-02-07 5:15 ` NeilBrown
2025-02-07 14:52 ` Chuck Lever
2025-02-10 14:01 ` Jeff Layton
2025-02-07 5:15 ` [PATCH 5/6] nfsd: filecache: don't repeatedly add/remove files on the lru list NeilBrown
2025-02-07 5:15 ` [PATCH 6/6] nfsd: filecache: drop the list_lru lock during lock gc scans NeilBrown
5 siblings, 2 replies; 25+ messages in thread
From: NeilBrown @ 2025-02-07 5:15 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton
Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey, Dave Chinner
The filecache lru is walked in 2 circumstances for 2 different reasons.
1/ When called from the shrinker we want to discard the first few
entries on the list, ignoring any with NFSD_FILE_REFERENCED set
because they should really be at the end of the LRU as they have been
referenced recently. So those ones are ROTATED.
2/ When called from the nfsd_file_gc() timer function we want to discard
anything that hasn't been used since before the previous call, and
mark everything else as unused at this point in time.
Using the same flag for both of these can result in some unexpected
outcomes. If the shrinker callback clears NFSD_FILE_REFERENCED then the
nfsd_file_gc() will think the file hasn't been used in a while, while
really it has.
I think it is easier to reason about the behaviour if we instead have
two flags.
NFSD_FILE_REFERENCED means "this should be at the end of the LRU, please
put it there when convenient"
NFSD_FILE_RECENT means "this has been used recently - since the last
run of nfsd_file_gc()
When either caller finds an NFSD_FILE_REFERENCED entry, that entry
should be moved to the end of the LRU and the flag cleared. This can
safely happen at any time. The actual order on the lru might not be
strictly least-recently-used, but that is normal for linux lrus.
The shrinker callback can ignore the "recent" flag. If it ends up
freeing something that is "recent" that simply means that memory
pressure is sufficient to limit the acceptable cache age to less than
the nfsd_file_gc frequency.
The gc caller should primarily focus on NFSD_FILE_RECENT. It should
free everything that doesn't have this flag set, and should clear the
flag on everything else. When it clears the flag it is convenient to
clear the "REFERENCED" flag and move to the end of the LRU too.
With this, calls from the shrinker do not prematurely age files. It
will focus only on freeing those that are least recently used.
Signed-off-by: NeilBrown <neilb@suse.de>
---
fs/nfsd/filecache.c | 21 +++++++++++++++++++--
fs/nfsd/filecache.h | 1 +
fs/nfsd/trace.h | 3 +++
3 files changed, 23 insertions(+), 2 deletions(-)
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index 04588c03bdfe..9faf469354a5 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -318,10 +318,10 @@ nfsd_file_check_writeback(struct nfsd_file *nf)
mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK);
}
-
static bool nfsd_file_lru_add(struct nfsd_file *nf)
{
set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
+ set_bit(NFSD_FILE_RECENT, &nf->nf_flags);
if (list_lru_add_obj(&nfsd_file_lru, &nf->nf_lru)) {
trace_nfsd_file_lru_add(nf);
return true;
@@ -528,6 +528,23 @@ nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
return LRU_REMOVED;
}
+static enum lru_status
+nfsd_file_gc_cb(struct list_head *item, struct list_lru_one *lru,
+ void *arg)
+{
+ struct nfsd_file *nf = list_entry(item, struct nfsd_file, nf_lru);
+
+ if (test_and_clear_bit(NFSD_FILE_RECENT, &nf->nf_flags)) {
+ /* "REFERENCED" really means "should be at the end of the LRU.
+ * As we are putting it there we can clear the flag
+ */
+ clear_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
+ trace_nfsd_file_gc_aged(nf);
+ return LRU_ROTATE;
+ }
+ return nfsd_file_lru_cb(item, lru, arg);
+}
+
static void
nfsd_file_gc(void)
{
@@ -537,7 +554,7 @@ nfsd_file_gc(void)
for_each_node_state(nid, N_NORMAL_MEMORY) {
unsigned long nr = list_lru_count_node(&nfsd_file_lru, nid);
- ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_lru_cb,
+ ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_gc_cb,
&dispose, &nr);
}
trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
index d5db6b34ba30..de5b8aa7fcb0 100644
--- a/fs/nfsd/filecache.h
+++ b/fs/nfsd/filecache.h
@@ -38,6 +38,7 @@ struct nfsd_file {
#define NFSD_FILE_PENDING (1)
#define NFSD_FILE_REFERENCED (2)
#define NFSD_FILE_GC (3)
+#define NFSD_FILE_RECENT (4)
unsigned long nf_flags;
refcount_t nf_ref;
unsigned char nf_may;
diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
index ad2c0c432d08..9af723eeb2b0 100644
--- a/fs/nfsd/trace.h
+++ b/fs/nfsd/trace.h
@@ -1039,6 +1039,7 @@ DEFINE_CLID_EVENT(confirmed_r);
{ 1 << NFSD_FILE_HASHED, "HASHED" }, \
{ 1 << NFSD_FILE_PENDING, "PENDING" }, \
{ 1 << NFSD_FILE_REFERENCED, "REFERENCED" }, \
+ { 1 << NFSD_FILE_RECENT, "RECENT" }, \
{ 1 << NFSD_FILE_GC, "GC" })
DECLARE_EVENT_CLASS(nfsd_file_class,
@@ -1317,6 +1318,7 @@ DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_lru_del_disposed);
DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_in_use);
DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_writeback);
DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_referenced);
+DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_aged);
DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_disposed);
DECLARE_EVENT_CLASS(nfsd_file_lruwalk_class,
@@ -1346,6 +1348,7 @@ DEFINE_EVENT(nfsd_file_lruwalk_class, name, \
TP_ARGS(removed, remaining))
DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_removed);
+DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_recent);
DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_shrinker_removed);
TRACE_EVENT(nfsd_file_close,
--
2.47.1
^ permalink raw reply related [flat|nested] 25+ messages in thread* Re: [PATCH 4/6] nfsd: filecache: introduce NFSD_FILE_RECENT
2025-02-07 5:15 ` [PATCH 4/6] nfsd: filecache: introduce NFSD_FILE_RECENT NeilBrown
@ 2025-02-07 14:52 ` Chuck Lever
2025-02-09 23:23 ` NeilBrown
2025-02-10 14:01 ` Jeff Layton
1 sibling, 1 reply; 25+ messages in thread
From: Chuck Lever @ 2025-02-07 14:52 UTC (permalink / raw)
To: NeilBrown, Jeff Layton
Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey, Dave Chinner
On 2/7/25 12:15 AM, NeilBrown wrote:
> The filecache lru is walked in 2 circumstances for 2 different reasons.
>
> 1/ When called from the shrinker we want to discard the first few
> entries on the list, ignoring any with NFSD_FILE_REFERENCED set
> because they should really be at the end of the LRU as they have been
> referenced recently. So those ones are ROTATED.
>
> 2/ When called from the nfsd_file_gc() timer function we want to discard
> anything that hasn't been used since before the previous call, and
> mark everything else as unused at this point in time.
>
> Using the same flag for both of these can result in some unexpected
> outcomes. If the shrinker callback clears NFSD_FILE_REFERENCED then the
> nfsd_file_gc() will think the file hasn't been used in a while, while
> really it has.
>
> I think it is easier to reason about the behaviour if we instead have
> two flags.
>
> NFSD_FILE_REFERENCED means "this should be at the end of the LRU, please
> put it there when convenient"
> NFSD_FILE_RECENT means "this has been used recently - since the last
> run of nfsd_file_gc()
>
> When either caller finds an NFSD_FILE_REFERENCED entry, that entry
> should be moved to the end of the LRU and the flag cleared. This can
> safely happen at any time. The actual order on the lru might not be
> strictly least-recently-used, but that is normal for linux lrus.
>
> The shrinker callback can ignore the "recent" flag. If it ends up
> freeing something that is "recent" that simply means that memory
> pressure is sufficient to limit the acceptable cache age to less than
> the nfsd_file_gc frequency.
>
> The gc caller should primarily focus on NFSD_FILE_RECENT. It should
> free everything that doesn't have this flag set, and should clear the
> flag on everything else. When it clears the flag it is convenient to
> clear the "REFERENCED" flag and move to the end of the LRU too.
>
> With this, calls from the shrinker do not prematurely age files. It
> will focus only on freeing those that are least recently used.
>
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
> fs/nfsd/filecache.c | 21 +++++++++++++++++++--
> fs/nfsd/filecache.h | 1 +
> fs/nfsd/trace.h | 3 +++
> 3 files changed, 23 insertions(+), 2 deletions(-)
>
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index 04588c03bdfe..9faf469354a5 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -318,10 +318,10 @@ nfsd_file_check_writeback(struct nfsd_file *nf)
> mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK);
> }
>
> -
> static bool nfsd_file_lru_add(struct nfsd_file *nf)
> {
> set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> + set_bit(NFSD_FILE_RECENT, &nf->nf_flags);
> if (list_lru_add_obj(&nfsd_file_lru, &nf->nf_lru)) {
> trace_nfsd_file_lru_add(nf);
> return true;
> @@ -528,6 +528,23 @@ nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
> return LRU_REMOVED;
> }
>
> +static enum lru_status
> +nfsd_file_gc_cb(struct list_head *item, struct list_lru_one *lru,
> + void *arg)
> +{
> + struct nfsd_file *nf = list_entry(item, struct nfsd_file, nf_lru);
> +
> + if (test_and_clear_bit(NFSD_FILE_RECENT, &nf->nf_flags)) {
> + /* "REFERENCED" really means "should be at the end of the LRU.
> + * As we are putting it there we can clear the flag
> + */
> + clear_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> + trace_nfsd_file_gc_aged(nf);
> + return LRU_ROTATE;
> + }
> + return nfsd_file_lru_cb(item, lru, arg);
> +}
> +
> static void
> nfsd_file_gc(void)
> {
> @@ -537,7 +554,7 @@ nfsd_file_gc(void)
>
> for_each_node_state(nid, N_NORMAL_MEMORY) {
> unsigned long nr = list_lru_count_node(&nfsd_file_lru, nid);
> - ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_lru_cb,
> + ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_gc_cb,
> &dispose, &nr);
> }
> trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
> diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
> index d5db6b34ba30..de5b8aa7fcb0 100644
> --- a/fs/nfsd/filecache.h
> +++ b/fs/nfsd/filecache.h
> @@ -38,6 +38,7 @@ struct nfsd_file {
> #define NFSD_FILE_PENDING (1)
> #define NFSD_FILE_REFERENCED (2)
> #define NFSD_FILE_GC (3)
> +#define NFSD_FILE_RECENT (4)
> unsigned long nf_flags;
> refcount_t nf_ref;
> unsigned char nf_may;
> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> index ad2c0c432d08..9af723eeb2b0 100644
> --- a/fs/nfsd/trace.h
> +++ b/fs/nfsd/trace.h
> @@ -1039,6 +1039,7 @@ DEFINE_CLID_EVENT(confirmed_r);
> { 1 << NFSD_FILE_HASHED, "HASHED" }, \
> { 1 << NFSD_FILE_PENDING, "PENDING" }, \
> { 1 << NFSD_FILE_REFERENCED, "REFERENCED" }, \
> + { 1 << NFSD_FILE_RECENT, "RECENT" }, \
> { 1 << NFSD_FILE_GC, "GC" })
>
> DECLARE_EVENT_CLASS(nfsd_file_class,
> @@ -1317,6 +1318,7 @@ DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_lru_del_disposed);
> DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_in_use);
> DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_writeback);
> DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_referenced);
> +DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_aged);
> DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_disposed);
>
> DECLARE_EVENT_CLASS(nfsd_file_lruwalk_class,
> @@ -1346,6 +1348,7 @@ DEFINE_EVENT(nfsd_file_lruwalk_class, name, \
> TP_ARGS(removed, remaining))
>
> DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_removed);
> +DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_recent);
> DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_shrinker_removed);
>
> TRACE_EVENT(nfsd_file_close,
The other patches in this series look like solid improvements. This one
could be as well, but it will take me some time to understand it.
I am generally in favor of replacing the logic that removes and adds
these items with a single atomic bitop, and I'm happy to see NFSD stick
with the use of an existing LRU facility while documenting its unique
requirements ("nfsd_file_gc_aged" and so on).
I would still prefer the backport to be lighter -- looks like the key
changes are 3/6 and 6/6. Is there any chance the series can be
reorganized to facilitate backporting? I have to ask, and the answer
might be "no", I realize.
--
Chuck Lever
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH 4/6] nfsd: filecache: introduce NFSD_FILE_RECENT
2025-02-07 14:52 ` Chuck Lever
@ 2025-02-09 23:23 ` NeilBrown
2025-02-10 0:50 ` Chuck Lever
0 siblings, 1 reply; 25+ messages in thread
From: NeilBrown @ 2025-02-09 23:23 UTC (permalink / raw)
To: Chuck Lever
Cc: Jeff Layton, linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey,
Dave Chinner
On Sat, 08 Feb 2025, Chuck Lever wrote:
> On 2/7/25 12:15 AM, NeilBrown wrote:
> > The filecache lru is walked in 2 circumstances for 2 different reasons.
> >
> > 1/ When called from the shrinker we want to discard the first few
> > entries on the list, ignoring any with NFSD_FILE_REFERENCED set
> > because they should really be at the end of the LRU as they have been
> > referenced recently. So those ones are ROTATED.
> >
> > 2/ When called from the nfsd_file_gc() timer function we want to discard
> > anything that hasn't been used since before the previous call, and
> > mark everything else as unused at this point in time.
> >
> > Using the same flag for both of these can result in some unexpected
> > outcomes. If the shrinker callback clears NFSD_FILE_REFERENCED then the
> > nfsd_file_gc() will think the file hasn't been used in a while, while
> > really it has.
> >
> > I think it is easier to reason about the behaviour if we instead have
> > two flags.
> >
> > NFSD_FILE_REFERENCED means "this should be at the end of the LRU, please
> > put it there when convenient"
> > NFSD_FILE_RECENT means "this has been used recently - since the last
> > run of nfsd_file_gc()
> >
> > When either caller finds an NFSD_FILE_REFERENCED entry, that entry
> > should be moved to the end of the LRU and the flag cleared. This can
> > safely happen at any time. The actual order on the lru might not be
> > strictly least-recently-used, but that is normal for linux lrus.
> >
> > The shrinker callback can ignore the "recent" flag. If it ends up
> > freeing something that is "recent" that simply means that memory
> > pressure is sufficient to limit the acceptable cache age to less than
> > the nfsd_file_gc frequency.
> >
> > The gc caller should primarily focus on NFSD_FILE_RECENT. It should
> > free everything that doesn't have this flag set, and should clear the
> > flag on everything else. When it clears the flag it is convenient to
> > clear the "REFERENCED" flag and move to the end of the LRU too.
> >
> > With this, calls from the shrinker do not prematurely age files. It
> > will focus only on freeing those that are least recently used.
> >
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> > fs/nfsd/filecache.c | 21 +++++++++++++++++++--
> > fs/nfsd/filecache.h | 1 +
> > fs/nfsd/trace.h | 3 +++
> > 3 files changed, 23 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > index 04588c03bdfe..9faf469354a5 100644
> > --- a/fs/nfsd/filecache.c
> > +++ b/fs/nfsd/filecache.c
> > @@ -318,10 +318,10 @@ nfsd_file_check_writeback(struct nfsd_file *nf)
> > mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK);
> > }
> >
> > -
> > static bool nfsd_file_lru_add(struct nfsd_file *nf)
> > {
> > set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> > + set_bit(NFSD_FILE_RECENT, &nf->nf_flags);
> > if (list_lru_add_obj(&nfsd_file_lru, &nf->nf_lru)) {
> > trace_nfsd_file_lru_add(nf);
> > return true;
> > @@ -528,6 +528,23 @@ nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
> > return LRU_REMOVED;
> > }
> >
> > +static enum lru_status
> > +nfsd_file_gc_cb(struct list_head *item, struct list_lru_one *lru,
> > + void *arg)
> > +{
> > + struct nfsd_file *nf = list_entry(item, struct nfsd_file, nf_lru);
> > +
> > + if (test_and_clear_bit(NFSD_FILE_RECENT, &nf->nf_flags)) {
> > + /* "REFERENCED" really means "should be at the end of the LRU.
> > + * As we are putting it there we can clear the flag
> > + */
> > + clear_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> > + trace_nfsd_file_gc_aged(nf);
> > + return LRU_ROTATE;
> > + }
> > + return nfsd_file_lru_cb(item, lru, arg);
> > +}
> > +
> > static void
> > nfsd_file_gc(void)
> > {
> > @@ -537,7 +554,7 @@ nfsd_file_gc(void)
> >
> > for_each_node_state(nid, N_NORMAL_MEMORY) {
> > unsigned long nr = list_lru_count_node(&nfsd_file_lru, nid);
> > - ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_lru_cb,
> > + ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_gc_cb,
> > &dispose, &nr);
> > }
> > trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
> > diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
> > index d5db6b34ba30..de5b8aa7fcb0 100644
> > --- a/fs/nfsd/filecache.h
> > +++ b/fs/nfsd/filecache.h
> > @@ -38,6 +38,7 @@ struct nfsd_file {
> > #define NFSD_FILE_PENDING (1)
> > #define NFSD_FILE_REFERENCED (2)
> > #define NFSD_FILE_GC (3)
> > +#define NFSD_FILE_RECENT (4)
> > unsigned long nf_flags;
> > refcount_t nf_ref;
> > unsigned char nf_may;
> > diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> > index ad2c0c432d08..9af723eeb2b0 100644
> > --- a/fs/nfsd/trace.h
> > +++ b/fs/nfsd/trace.h
> > @@ -1039,6 +1039,7 @@ DEFINE_CLID_EVENT(confirmed_r);
> > { 1 << NFSD_FILE_HASHED, "HASHED" }, \
> > { 1 << NFSD_FILE_PENDING, "PENDING" }, \
> > { 1 << NFSD_FILE_REFERENCED, "REFERENCED" }, \
> > + { 1 << NFSD_FILE_RECENT, "RECENT" }, \
> > { 1 << NFSD_FILE_GC, "GC" })
> >
> > DECLARE_EVENT_CLASS(nfsd_file_class,
> > @@ -1317,6 +1318,7 @@ DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_lru_del_disposed);
> > DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_in_use);
> > DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_writeback);
> > DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_referenced);
> > +DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_aged);
> > DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_disposed);
> >
> > DECLARE_EVENT_CLASS(nfsd_file_lruwalk_class,
> > @@ -1346,6 +1348,7 @@ DEFINE_EVENT(nfsd_file_lruwalk_class, name, \
> > TP_ARGS(removed, remaining))
> >
> > DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_removed);
> > +DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_recent);
> > DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_shrinker_removed);
> >
> > TRACE_EVENT(nfsd_file_close,
>
> The other patches in this series look like solid improvements. This one
> could be as well, but it will take me some time to understand it.
>
> I am generally in favor of replacing the logic that removes and adds
> these items with a single atomic bitop, and I'm happy to see NFSD stick
> with the use of an existing LRU facility while documenting its unique
> requirements ("nfsd_file_gc_aged" and so on).
>
> I would still prefer the backport to be lighter -- looks like the key
> changes are 3/6 and 6/6. Is there any chance the series can be
> reorganized to facilitate backporting? I have to ask, and the answer
> might be "no", I realize.
I'm going with "no".
To be honest, I was hoping that the complexity displayed here needed
to work around the assumptions of list_lru what don't match our needs
would be sufficient to convince you that list_lru isn't worth pursuing.
I see that didn't work.
So I am no longer invested in this patch set. You are welcome to use it
if you wish and to make any changes that you think are suitable, but I
don't think it is a good direction to go and will not be offering any
more code changes to support the use of list_lru here.
Thanks,
NeilBrown
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH 4/6] nfsd: filecache: introduce NFSD_FILE_RECENT
2025-02-09 23:23 ` NeilBrown
@ 2025-02-10 0:50 ` Chuck Lever
2025-02-10 2:31 ` NeilBrown
0 siblings, 1 reply; 25+ messages in thread
From: Chuck Lever @ 2025-02-10 0:50 UTC (permalink / raw)
To: NeilBrown
Cc: Jeff Layton, linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey,
Dave Chinner
On 2/9/25 6:23 PM, NeilBrown wrote:
> On Sat, 08 Feb 2025, Chuck Lever wrote:
>> On 2/7/25 12:15 AM, NeilBrown wrote:
>>> The filecache lru is walked in 2 circumstances for 2 different reasons.
>>>
>>> 1/ When called from the shrinker we want to discard the first few
>>> entries on the list, ignoring any with NFSD_FILE_REFERENCED set
>>> because they should really be at the end of the LRU as they have been
>>> referenced recently. So those ones are ROTATED.
>>>
>>> 2/ When called from the nfsd_file_gc() timer function we want to discard
>>> anything that hasn't been used since before the previous call, and
>>> mark everything else as unused at this point in time.
>>>
>>> Using the same flag for both of these can result in some unexpected
>>> outcomes. If the shrinker callback clears NFSD_FILE_REFERENCED then the
>>> nfsd_file_gc() will think the file hasn't been used in a while, while
>>> really it has.
>>>
>>> I think it is easier to reason about the behaviour if we instead have
>>> two flags.
>>>
>>> NFSD_FILE_REFERENCED means "this should be at the end of the LRU, please
>>> put it there when convenient"
>>> NFSD_FILE_RECENT means "this has been used recently - since the last
>>> run of nfsd_file_gc()
>>>
>>> When either caller finds an NFSD_FILE_REFERENCED entry, that entry
>>> should be moved to the end of the LRU and the flag cleared. This can
>>> safely happen at any time. The actual order on the lru might not be
>>> strictly least-recently-used, but that is normal for linux lrus.
>>>
>>> The shrinker callback can ignore the "recent" flag. If it ends up
>>> freeing something that is "recent" that simply means that memory
>>> pressure is sufficient to limit the acceptable cache age to less than
>>> the nfsd_file_gc frequency.
>>>
>>> The gc caller should primarily focus on NFSD_FILE_RECENT. It should
>>> free everything that doesn't have this flag set, and should clear the
>>> flag on everything else. When it clears the flag it is convenient to
>>> clear the "REFERENCED" flag and move to the end of the LRU too.
>>>
>>> With this, calls from the shrinker do not prematurely age files. It
>>> will focus only on freeing those that are least recently used.
>>>
>>> Signed-off-by: NeilBrown <neilb@suse.de>
>>> ---
>>> fs/nfsd/filecache.c | 21 +++++++++++++++++++--
>>> fs/nfsd/filecache.h | 1 +
>>> fs/nfsd/trace.h | 3 +++
>>> 3 files changed, 23 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
>>> index 04588c03bdfe..9faf469354a5 100644
>>> --- a/fs/nfsd/filecache.c
>>> +++ b/fs/nfsd/filecache.c
>>> @@ -318,10 +318,10 @@ nfsd_file_check_writeback(struct nfsd_file *nf)
>>> mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK);
>>> }
>>>
>>> -
>>> static bool nfsd_file_lru_add(struct nfsd_file *nf)
>>> {
>>> set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
>>> + set_bit(NFSD_FILE_RECENT, &nf->nf_flags);
>>> if (list_lru_add_obj(&nfsd_file_lru, &nf->nf_lru)) {
>>> trace_nfsd_file_lru_add(nf);
>>> return true;
>>> @@ -528,6 +528,23 @@ nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
>>> return LRU_REMOVED;
>>> }
>>>
>>> +static enum lru_status
>>> +nfsd_file_gc_cb(struct list_head *item, struct list_lru_one *lru,
>>> + void *arg)
>>> +{
>>> + struct nfsd_file *nf = list_entry(item, struct nfsd_file, nf_lru);
>>> +
>>> + if (test_and_clear_bit(NFSD_FILE_RECENT, &nf->nf_flags)) {
>>> + /* "REFERENCED" really means "should be at the end of the LRU.
>>> + * As we are putting it there we can clear the flag
>>> + */
>>> + clear_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
>>> + trace_nfsd_file_gc_aged(nf);
>>> + return LRU_ROTATE;
>>> + }
>>> + return nfsd_file_lru_cb(item, lru, arg);
>>> +}
>>> +
>>> static void
>>> nfsd_file_gc(void)
>>> {
>>> @@ -537,7 +554,7 @@ nfsd_file_gc(void)
>>>
>>> for_each_node_state(nid, N_NORMAL_MEMORY) {
>>> unsigned long nr = list_lru_count_node(&nfsd_file_lru, nid);
>>> - ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_lru_cb,
>>> + ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_gc_cb,
>>> &dispose, &nr);
>>> }
>>> trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
>>> diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
>>> index d5db6b34ba30..de5b8aa7fcb0 100644
>>> --- a/fs/nfsd/filecache.h
>>> +++ b/fs/nfsd/filecache.h
>>> @@ -38,6 +38,7 @@ struct nfsd_file {
>>> #define NFSD_FILE_PENDING (1)
>>> #define NFSD_FILE_REFERENCED (2)
>>> #define NFSD_FILE_GC (3)
>>> +#define NFSD_FILE_RECENT (4)
>>> unsigned long nf_flags;
>>> refcount_t nf_ref;
>>> unsigned char nf_may;
>>> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
>>> index ad2c0c432d08..9af723eeb2b0 100644
>>> --- a/fs/nfsd/trace.h
>>> +++ b/fs/nfsd/trace.h
>>> @@ -1039,6 +1039,7 @@ DEFINE_CLID_EVENT(confirmed_r);
>>> { 1 << NFSD_FILE_HASHED, "HASHED" }, \
>>> { 1 << NFSD_FILE_PENDING, "PENDING" }, \
>>> { 1 << NFSD_FILE_REFERENCED, "REFERENCED" }, \
>>> + { 1 << NFSD_FILE_RECENT, "RECENT" }, \
>>> { 1 << NFSD_FILE_GC, "GC" })
>>>
>>> DECLARE_EVENT_CLASS(nfsd_file_class,
>>> @@ -1317,6 +1318,7 @@ DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_lru_del_disposed);
>>> DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_in_use);
>>> DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_writeback);
>>> DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_referenced);
>>> +DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_aged);
>>> DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_disposed);
>>>
>>> DECLARE_EVENT_CLASS(nfsd_file_lruwalk_class,
>>> @@ -1346,6 +1348,7 @@ DEFINE_EVENT(nfsd_file_lruwalk_class, name, \
>>> TP_ARGS(removed, remaining))
>>>
>>> DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_removed);
>>> +DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_recent);
>>> DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_shrinker_removed);
>>>
>>> TRACE_EVENT(nfsd_file_close,
>>
>> The other patches in this series look like solid improvements. This one
>> could be as well, but it will take me some time to understand it.
>>
>> I am generally in favor of replacing the logic that removes and adds
>> these items with a single atomic bitop, and I'm happy to see NFSD stick
>> with the use of an existing LRU facility while documenting its unique
>> requirements ("nfsd_file_gc_aged" and so on).
>>
>> I would still prefer the backport to be lighter -- looks like the key
>> changes are 3/6 and 6/6. Is there any chance the series can be
>> reorganized to facilitate backporting? I have to ask, and the answer
>> might be "no", I realize.
>
> I'm going with "no".
> To be honest, I was hoping that the complexity displayed here needed
> to work around the assumptions of list_lru what don't match our needs
> would be sufficient to convince you that list_lru isn't worth pursuing.
> I see that didn't work.
Fair enough.
> So I am no longer invested in this patch set. You are welcome to use it
> if you wish and to make any changes that you think are suitable, but I
> don't think it is a good direction to go and will not be offering any
> more code changes to support the use of list_lru here.
If I may observe, you haven't offered a compelling explanation of why an
imperfect fit between list_lru and the filecache adds more technical
debt than does the introduction of a bespoke LRU mechanism.
I'm open to that argument, but I need stronger rationale (or performance
data) to back it up. So far I can agree that the defect rate in this
area is somewhat abnormal, but that seems to be because we don't
understand how to use the list_lru API to its best advantage.
--
Chuck Lever
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH 4/6] nfsd: filecache: introduce NFSD_FILE_RECENT
2025-02-10 0:50 ` Chuck Lever
@ 2025-02-10 2:31 ` NeilBrown
2025-02-10 14:25 ` Jeff Layton
2025-02-10 14:26 ` Chuck Lever
0 siblings, 2 replies; 25+ messages in thread
From: NeilBrown @ 2025-02-10 2:31 UTC (permalink / raw)
To: Chuck Lever
Cc: Jeff Layton, linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey,
Dave Chinner
On Mon, 10 Feb 2025, Chuck Lever wrote:
> On 2/9/25 6:23 PM, NeilBrown wrote:
> > On Sat, 08 Feb 2025, Chuck Lever wrote:
> >> On 2/7/25 12:15 AM, NeilBrown wrote:
> >>> The filecache lru is walked in 2 circumstances for 2 different reasons.
> >>>
> >>> 1/ When called from the shrinker we want to discard the first few
> >>> entries on the list, ignoring any with NFSD_FILE_REFERENCED set
> >>> because they should really be at the end of the LRU as they have been
> >>> referenced recently. So those ones are ROTATED.
> >>>
> >>> 2/ When called from the nfsd_file_gc() timer function we want to discard
> >>> anything that hasn't been used since before the previous call, and
> >>> mark everything else as unused at this point in time.
> >>>
> >>> Using the same flag for both of these can result in some unexpected
> >>> outcomes. If the shrinker callback clears NFSD_FILE_REFERENCED then the
> >>> nfsd_file_gc() will think the file hasn't been used in a while, while
> >>> really it has.
> >>>
> >>> I think it is easier to reason about the behaviour if we instead have
> >>> two flags.
> >>>
> >>> NFSD_FILE_REFERENCED means "this should be at the end of the LRU, please
> >>> put it there when convenient"
> >>> NFSD_FILE_RECENT means "this has been used recently - since the last
> >>> run of nfsd_file_gc()
> >>>
> >>> When either caller finds an NFSD_FILE_REFERENCED entry, that entry
> >>> should be moved to the end of the LRU and the flag cleared. This can
> >>> safely happen at any time. The actual order on the lru might not be
> >>> strictly least-recently-used, but that is normal for linux lrus.
> >>>
> >>> The shrinker callback can ignore the "recent" flag. If it ends up
> >>> freeing something that is "recent" that simply means that memory
> >>> pressure is sufficient to limit the acceptable cache age to less than
> >>> the nfsd_file_gc frequency.
> >>>
> >>> The gc caller should primarily focus on NFSD_FILE_RECENT. It should
> >>> free everything that doesn't have this flag set, and should clear the
> >>> flag on everything else. When it clears the flag it is convenient to
> >>> clear the "REFERENCED" flag and move to the end of the LRU too.
> >>>
> >>> With this, calls from the shrinker do not prematurely age files. It
> >>> will focus only on freeing those that are least recently used.
> >>>
> >>> Signed-off-by: NeilBrown <neilb@suse.de>
> >>> ---
> >>> fs/nfsd/filecache.c | 21 +++++++++++++++++++--
> >>> fs/nfsd/filecache.h | 1 +
> >>> fs/nfsd/trace.h | 3 +++
> >>> 3 files changed, 23 insertions(+), 2 deletions(-)
> >>>
> >>> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> >>> index 04588c03bdfe..9faf469354a5 100644
> >>> --- a/fs/nfsd/filecache.c
> >>> +++ b/fs/nfsd/filecache.c
> >>> @@ -318,10 +318,10 @@ nfsd_file_check_writeback(struct nfsd_file *nf)
> >>> mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK);
> >>> }
> >>>
> >>> -
> >>> static bool nfsd_file_lru_add(struct nfsd_file *nf)
> >>> {
> >>> set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> >>> + set_bit(NFSD_FILE_RECENT, &nf->nf_flags);
> >>> if (list_lru_add_obj(&nfsd_file_lru, &nf->nf_lru)) {
> >>> trace_nfsd_file_lru_add(nf);
> >>> return true;
> >>> @@ -528,6 +528,23 @@ nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
> >>> return LRU_REMOVED;
> >>> }
> >>>
> >>> +static enum lru_status
> >>> +nfsd_file_gc_cb(struct list_head *item, struct list_lru_one *lru,
> >>> + void *arg)
> >>> +{
> >>> + struct nfsd_file *nf = list_entry(item, struct nfsd_file, nf_lru);
> >>> +
> >>> + if (test_and_clear_bit(NFSD_FILE_RECENT, &nf->nf_flags)) {
> >>> + /* "REFERENCED" really means "should be at the end of the LRU.
> >>> + * As we are putting it there we can clear the flag
> >>> + */
> >>> + clear_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> >>> + trace_nfsd_file_gc_aged(nf);
> >>> + return LRU_ROTATE;
> >>> + }
> >>> + return nfsd_file_lru_cb(item, lru, arg);
> >>> +}
> >>> +
> >>> static void
> >>> nfsd_file_gc(void)
> >>> {
> >>> @@ -537,7 +554,7 @@ nfsd_file_gc(void)
> >>>
> >>> for_each_node_state(nid, N_NORMAL_MEMORY) {
> >>> unsigned long nr = list_lru_count_node(&nfsd_file_lru, nid);
> >>> - ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_lru_cb,
> >>> + ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_gc_cb,
> >>> &dispose, &nr);
> >>> }
> >>> trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
> >>> diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
> >>> index d5db6b34ba30..de5b8aa7fcb0 100644
> >>> --- a/fs/nfsd/filecache.h
> >>> +++ b/fs/nfsd/filecache.h
> >>> @@ -38,6 +38,7 @@ struct nfsd_file {
> >>> #define NFSD_FILE_PENDING (1)
> >>> #define NFSD_FILE_REFERENCED (2)
> >>> #define NFSD_FILE_GC (3)
> >>> +#define NFSD_FILE_RECENT (4)
> >>> unsigned long nf_flags;
> >>> refcount_t nf_ref;
> >>> unsigned char nf_may;
> >>> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> >>> index ad2c0c432d08..9af723eeb2b0 100644
> >>> --- a/fs/nfsd/trace.h
> >>> +++ b/fs/nfsd/trace.h
> >>> @@ -1039,6 +1039,7 @@ DEFINE_CLID_EVENT(confirmed_r);
> >>> { 1 << NFSD_FILE_HASHED, "HASHED" }, \
> >>> { 1 << NFSD_FILE_PENDING, "PENDING" }, \
> >>> { 1 << NFSD_FILE_REFERENCED, "REFERENCED" }, \
> >>> + { 1 << NFSD_FILE_RECENT, "RECENT" }, \
> >>> { 1 << NFSD_FILE_GC, "GC" })
> >>>
> >>> DECLARE_EVENT_CLASS(nfsd_file_class,
> >>> @@ -1317,6 +1318,7 @@ DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_lru_del_disposed);
> >>> DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_in_use);
> >>> DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_writeback);
> >>> DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_referenced);
> >>> +DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_aged);
> >>> DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_disposed);
> >>>
> >>> DECLARE_EVENT_CLASS(nfsd_file_lruwalk_class,
> >>> @@ -1346,6 +1348,7 @@ DEFINE_EVENT(nfsd_file_lruwalk_class, name, \
> >>> TP_ARGS(removed, remaining))
> >>>
> >>> DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_removed);
> >>> +DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_recent);
> >>> DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_shrinker_removed);
> >>>
> >>> TRACE_EVENT(nfsd_file_close,
> >>
> >> The other patches in this series look like solid improvements. This one
> >> could be as well, but it will take me some time to understand it.
> >>
> >> I am generally in favor of replacing the logic that removes and adds
> >> these items with a single atomic bitop, and I'm happy to see NFSD stick
> >> with the use of an existing LRU facility while documenting its unique
> >> requirements ("nfsd_file_gc_aged" and so on).
> >>
> >> I would still prefer the backport to be lighter -- looks like the key
> >> changes are 3/6 and 6/6. Is there any chance the series can be
> >> reorganized to facilitate backporting? I have to ask, and the answer
> >> might be "no", I realize.
> >
> > I'm going with "no".
> > To be honest, I was hoping that the complexity displayed here needed
> > to work around the assumptions of list_lru what don't match our needs
> > would be sufficient to convince you that list_lru isn't worth pursuing.
> > I see that didn't work.
>
> Fair enough.
>
>
> > So I am no longer invested in this patch set. You are welcome to use it
> > if you wish and to make any changes that you think are suitable, but I
> > don't think it is a good direction to go and will not be offering any
> > more code changes to support the use of list_lru here.
>
> If I may observe, you haven't offered a compelling explanation of why an
> imperfect fit between list_lru and the filecache adds more technical
> debt than does the introduction of a bespoke LRU mechanism.
>
> I'm open to that argument, but I need stronger rationale (or performance
> data) to back it up. So far I can agree that the defect rate in this
> area is somewhat abnormal, but that seems to be because we don't
> understand how to use the list_lru API to its best advantage.
>
I would characterise the cause of the defect rate differently.
I would say it is because we are treating this as an lru-style problem
when it isn't an lru-style problem. list_lru is great for lrus. That
isn't what we have.
What we have is a desire to keep files open between consecutive IO
requests without any clear indication of when we have seen the last in a
series of IO requests. So we make a policy decision "keep files open
until there have been no IOs for 2 seconds - then close them".
This is a good and sensible policy that nothing to do with the "LRU"
concept.
We implement this policy by keeping all unused files on a list, set a
flag every time the file is used, clearing the flag on a timer tick
(every 2 seconds) and closing anything which still has the flag cleared
2 seconds later.
Still nothing in this description that is at all related to LRU
concepts.
Now we decide that it would be good the add a shinker - fair enough as
we don't *need* these to remain. How should the shrinker choose files
to close? It probably doesn't matter beyond avoiding files that still
have the not-timed-out flag set.
But we try to also impose an LRU disciple over the list, and we use
list_lru.
The interfaces for list_lru() are well documented but the intent is
not. Most users of list_lru (gfs2/quota might be an exception) only
explicitly delete things from the lru when it is time to discard them
completely. They rely on the shrinker to detect things that are in use
again, and to remove them. And possibly to detect things that have been
referenced and to rotate them. But if the shrinker doesn't run because
there isn't much memory pressure they are just left alone.
This is what list_lru is optimised for - for shrinker driven scanning
which skips or removes or rotates things that can't or shouldn't
be freed, and frees others. You would expect to normally only scan a
small fraction of the list, because realistically you want to keep most
of them.
For filecache we don't want to keep them very long. So I think it
matters a lot less what we choose for shrinking. I'm tempted to suggest
we don't bother with the shrinker. Old files will be discarded soon
anyway if they aren't used, and slowness in memory allocation (due to
any memory pressure) will naturally slow down the addition of new files
to the cache. So the cache will shrink naturally.
I'm not 100% certain of that, but I do think that the needs of the
shrinker should not dominate the design as they currently do.
Note that maybe we *don't* need to close files so quickly. Maybe we
could discard the whole timer thing, and then it would make sense to use
list_lru(). What is the cost of keeping them open?
All I can think of is that it affects unlink. An unlinked file won't be
removed while there is a reference to the inode. Maybe we should
address that by taking a lease on the file while it is in the
filecache?? When the lease is broken, we discard the file from the
cache.
If that could work (which might involve creating a new internal lease
type that is only broken on unlink), then we could remove the timeout
and leave files in the cache indefinitely. Then it would make perfect
sense to use list_lru() because the problem would start to look exactly
like an LRU problem. But I don't think that is what we have today.
Thanks,
NeilBrown
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH 4/6] nfsd: filecache: introduce NFSD_FILE_RECENT
2025-02-10 2:31 ` NeilBrown
@ 2025-02-10 14:25 ` Jeff Layton
2025-02-12 22:39 ` NeilBrown
2025-02-10 14:26 ` Chuck Lever
1 sibling, 1 reply; 25+ messages in thread
From: Jeff Layton @ 2025-02-10 14:25 UTC (permalink / raw)
To: NeilBrown, Chuck Lever
Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey, Dave Chinner
On Mon, 2025-02-10 at 13:31 +1100, NeilBrown wrote:
> On Mon, 10 Feb 2025, Chuck Lever wrote:
> > On 2/9/25 6:23 PM, NeilBrown wrote:
> > > On Sat, 08 Feb 2025, Chuck Lever wrote:
> > > > On 2/7/25 12:15 AM, NeilBrown wrote:
> > > > > The filecache lru is walked in 2 circumstances for 2 different reasons.
> > > > >
> > > > > 1/ When called from the shrinker we want to discard the first few
> > > > > entries on the list, ignoring any with NFSD_FILE_REFERENCED set
> > > > > because they should really be at the end of the LRU as they have been
> > > > > referenced recently. So those ones are ROTATED.
> > > > >
> > > > > 2/ When called from the nfsd_file_gc() timer function we want to discard
> > > > > anything that hasn't been used since before the previous call, and
> > > > > mark everything else as unused at this point in time.
> > > > >
> > > > > Using the same flag for both of these can result in some unexpected
> > > > > outcomes. If the shrinker callback clears NFSD_FILE_REFERENCED then the
> > > > > nfsd_file_gc() will think the file hasn't been used in a while, while
> > > > > really it has.
> > > > >
> > > > > I think it is easier to reason about the behaviour if we instead have
> > > > > two flags.
> > > > >
> > > > > NFSD_FILE_REFERENCED means "this should be at the end of the LRU, please
> > > > > put it there when convenient"
> > > > > NFSD_FILE_RECENT means "this has been used recently - since the last
> > > > > run of nfsd_file_gc()
> > > > >
> > > > > When either caller finds an NFSD_FILE_REFERENCED entry, that entry
> > > > > should be moved to the end of the LRU and the flag cleared. This can
> > > > > safely happen at any time. The actual order on the lru might not be
> > > > > strictly least-recently-used, but that is normal for linux lrus.
> > > > >
> > > > > The shrinker callback can ignore the "recent" flag. If it ends up
> > > > > freeing something that is "recent" that simply means that memory
> > > > > pressure is sufficient to limit the acceptable cache age to less than
> > > > > the nfsd_file_gc frequency.
> > > > >
> > > > > The gc caller should primarily focus on NFSD_FILE_RECENT. It should
> > > > > free everything that doesn't have this flag set, and should clear the
> > > > > flag on everything else. When it clears the flag it is convenient to
> > > > > clear the "REFERENCED" flag and move to the end of the LRU too.
> > > > >
> > > > > With this, calls from the shrinker do not prematurely age files. It
> > > > > will focus only on freeing those that are least recently used.
> > > > >
> > > > > Signed-off-by: NeilBrown <neilb@suse.de>
> > > > > ---
> > > > > fs/nfsd/filecache.c | 21 +++++++++++++++++++--
> > > > > fs/nfsd/filecache.h | 1 +
> > > > > fs/nfsd/trace.h | 3 +++
> > > > > 3 files changed, 23 insertions(+), 2 deletions(-)
> > > > >
> > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > > > > index 04588c03bdfe..9faf469354a5 100644
> > > > > --- a/fs/nfsd/filecache.c
> > > > > +++ b/fs/nfsd/filecache.c
> > > > > @@ -318,10 +318,10 @@ nfsd_file_check_writeback(struct nfsd_file *nf)
> > > > > mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK);
> > > > > }
> > > > >
> > > > > -
> > > > > static bool nfsd_file_lru_add(struct nfsd_file *nf)
> > > > > {
> > > > > set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> > > > > + set_bit(NFSD_FILE_RECENT, &nf->nf_flags);
> > > > > if (list_lru_add_obj(&nfsd_file_lru, &nf->nf_lru)) {
> > > > > trace_nfsd_file_lru_add(nf);
> > > > > return true;
> > > > > @@ -528,6 +528,23 @@ nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
> > > > > return LRU_REMOVED;
> > > > > }
> > > > >
> > > > > +static enum lru_status
> > > > > +nfsd_file_gc_cb(struct list_head *item, struct list_lru_one *lru,
> > > > > + void *arg)
> > > > > +{
> > > > > + struct nfsd_file *nf = list_entry(item, struct nfsd_file, nf_lru);
> > > > > +
> > > > > + if (test_and_clear_bit(NFSD_FILE_RECENT, &nf->nf_flags)) {
> > > > > + /* "REFERENCED" really means "should be at the end of the LRU.
> > > > > + * As we are putting it there we can clear the flag
> > > > > + */
> > > > > + clear_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> > > > > + trace_nfsd_file_gc_aged(nf);
> > > > > + return LRU_ROTATE;
> > > > > + }
> > > > > + return nfsd_file_lru_cb(item, lru, arg);
> > > > > +}
> > > > > +
> > > > > static void
> > > > > nfsd_file_gc(void)
> > > > > {
> > > > > @@ -537,7 +554,7 @@ nfsd_file_gc(void)
> > > > >
> > > > > for_each_node_state(nid, N_NORMAL_MEMORY) {
> > > > > unsigned long nr = list_lru_count_node(&nfsd_file_lru, nid);
> > > > > - ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_lru_cb,
> > > > > + ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_gc_cb,
> > > > > &dispose, &nr);
> > > > > }
> > > > > trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
> > > > > diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
> > > > > index d5db6b34ba30..de5b8aa7fcb0 100644
> > > > > --- a/fs/nfsd/filecache.h
> > > > > +++ b/fs/nfsd/filecache.h
> > > > > @@ -38,6 +38,7 @@ struct nfsd_file {
> > > > > #define NFSD_FILE_PENDING (1)
> > > > > #define NFSD_FILE_REFERENCED (2)
> > > > > #define NFSD_FILE_GC (3)
> > > > > +#define NFSD_FILE_RECENT (4)
> > > > > unsigned long nf_flags;
> > > > > refcount_t nf_ref;
> > > > > unsigned char nf_may;
> > > > > diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> > > > > index ad2c0c432d08..9af723eeb2b0 100644
> > > > > --- a/fs/nfsd/trace.h
> > > > > +++ b/fs/nfsd/trace.h
> > > > > @@ -1039,6 +1039,7 @@ DEFINE_CLID_EVENT(confirmed_r);
> > > > > { 1 << NFSD_FILE_HASHED, "HASHED" }, \
> > > > > { 1 << NFSD_FILE_PENDING, "PENDING" }, \
> > > > > { 1 << NFSD_FILE_REFERENCED, "REFERENCED" }, \
> > > > > + { 1 << NFSD_FILE_RECENT, "RECENT" }, \
> > > > > { 1 << NFSD_FILE_GC, "GC" })
> > > > >
> > > > > DECLARE_EVENT_CLASS(nfsd_file_class,
> > > > > @@ -1317,6 +1318,7 @@ DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_lru_del_disposed);
> > > > > DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_in_use);
> > > > > DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_writeback);
> > > > > DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_referenced);
> > > > > +DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_aged);
> > > > > DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_disposed);
> > > > >
> > > > > DECLARE_EVENT_CLASS(nfsd_file_lruwalk_class,
> > > > > @@ -1346,6 +1348,7 @@ DEFINE_EVENT(nfsd_file_lruwalk_class, name, \
> > > > > TP_ARGS(removed, remaining))
> > > > >
> > > > > DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_removed);
> > > > > +DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_recent);
> > > > > DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_shrinker_removed);
> > > > >
> > > > > TRACE_EVENT(nfsd_file_close,
> > > >
> > > > The other patches in this series look like solid improvements. This one
> > > > could be as well, but it will take me some time to understand it.
> > > >
> > > > I am generally in favor of replacing the logic that removes and adds
> > > > these items with a single atomic bitop, and I'm happy to see NFSD stick
> > > > with the use of an existing LRU facility while documenting its unique
> > > > requirements ("nfsd_file_gc_aged" and so on).
> > > >
> > > > I would still prefer the backport to be lighter -- looks like the key
> > > > changes are 3/6 and 6/6. Is there any chance the series can be
> > > > reorganized to facilitate backporting? I have to ask, and the answer
> > > > might be "no", I realize.
> > >
> > > I'm going with "no".
> > > To be honest, I was hoping that the complexity displayed here needed
> > > to work around the assumptions of list_lru what don't match our needs
> > > would be sufficient to convince you that list_lru isn't worth pursuing.
> > > I see that didn't work.
> >
> > Fair enough.
> >
> >
> > > So I am no longer invested in this patch set. You are welcome to use it
> > > if you wish and to make any changes that you think are suitable, but I
> > > don't think it is a good direction to go and will not be offering any
> > > more code changes to support the use of list_lru here.
> >
> > If I may observe, you haven't offered a compelling explanation of why an
> > imperfect fit between list_lru and the filecache adds more technical
> > debt than does the introduction of a bespoke LRU mechanism.
> >
> > I'm open to that argument, but I need stronger rationale (or performance
> > data) to back it up. So far I can agree that the defect rate in this
> > area is somewhat abnormal, but that seems to be because we don't
> > understand how to use the list_lru API to its best advantage.
> >
>
> I would characterise the cause of the defect rate differently.
> I would say it is because we are treating this as an lru-style problem
> when it isn't an lru-style problem. list_lru is great for lrus. That
> isn't what we have.
>
> What we have is a desire to keep files open between consecutive IO
> requests without any clear indication of when we have seen the last in a
> series of IO requests. So we make a policy decision "keep files open
> until there have been no IOs for 2 seconds - then close them".
> This is a good and sensible policy that nothing to do with the "LRU"
> concept.
>
> We implement this policy by keeping all unused files on a list, set a
> flag every time the file is used, clearing the flag on a timer tick
> (every 2 seconds) and closing anything which still has the flag cleared
> 2 seconds later.
>
> Still nothing in this description that is at all related to LRU
> concepts.
>
> Now we decide that it would be good the add a shinker - fair enough as
> we don't *need* these to remain. How should the shrinker choose files
> to close? It probably doesn't matter beyond avoiding files that still
> have the not-timed-out flag set.
>
> But we try to also impose an LRU disciple over the list, and we use
> list_lru.
> The interfaces for list_lru() are well documented but the intent is
> not. Most users of list_lru (gfs2/quota might be an exception) only
> explicitly delete things from the lru when it is time to discard them
> completely. They rely on the shrinker to detect things that are in use
> again, and to remove them. And possibly to detect things that have been
> referenced and to rotate them. But if the shrinker doesn't run because
> there isn't much memory pressure they are just left alone.
>
> This is what list_lru is optimised for - for shrinker driven scanning
> which skips or removes or rotates things that can't or shouldn't
> be freed, and frees others. You would expect to normally only scan a
> small fraction of the list, because realistically you want to keep most
> of them.
>
> For filecache we don't want to keep them very long. So I think it
> matters a lot less what we choose for shrinking. I'm tempted to suggest
> we don't bother with the shrinker. Old files will be discarded soon
> anyway if they aren't used, and slowness in memory allocation (due to
> any memory pressure) will naturally slow down the addition of new files
> to the cache. So the cache will shrink naturally.
>
> I'm not 100% certain of that, but I do think that the needs of the
> shrinker should not dominate the design as they currently do.
>
> Note that maybe we *don't* need to close files so quickly. Maybe we
> could discard the whole timer thing, and then it would make sense to use
> list_lru(). What is the cost of keeping them open?
>
> All I can think of is that it affects unlink. An unlinked file won't be
> removed while there is a reference to the inode. Maybe we should
> address that by taking a lease on the file while it is in the
> filecache?? When the lease is broken, we discard the file from the
> cache.
It may also affect other applications trying to take out leases. The
filecache has the nfsd_file_lease_notifier that tells it when someone
is trying to take out a lease on a file. That happens then it will try
to close the file first.
>
> If that could work (which might involve creating a new internal lease
> type that is only broken on unlink), then we could remove the timeout
> and leave files in the cache indefinitely. Then it would make perfect
> sense to use list_lru() because the problem would start to look exactly
> like an LRU problem. But I don't think that is what we have today.
>
The filecache already sets a fsnotify_mark on the inode to watch for
its i_nlink to go to 0, and then removes it from the cache when that
happens. I think we could keep these files open for quite a bit longer
if we chose to do so.
One thing that Chuck has brought up a few times is that maybe we should
consider making v4 not use the filecache at all. If that simplifies
things then that might be a good thing to consider as well.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH 4/6] nfsd: filecache: introduce NFSD_FILE_RECENT
2025-02-10 14:25 ` Jeff Layton
@ 2025-02-12 22:39 ` NeilBrown
2025-02-13 0:08 ` Chuck Lever
2025-02-13 11:27 ` Jeff Layton
0 siblings, 2 replies; 25+ messages in thread
From: NeilBrown @ 2025-02-12 22:39 UTC (permalink / raw)
To: Jeff Layton
Cc: Chuck Lever, linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey,
Dave Chinner
On Tue, 11 Feb 2025, Jeff Layton wrote:
> On Mon, 2025-02-10 at 13:31 +1100, NeilBrown wrote:
> > On Mon, 10 Feb 2025, Chuck Lever wrote:
> > > On 2/9/25 6:23 PM, NeilBrown wrote:
> > > > On Sat, 08 Feb 2025, Chuck Lever wrote:
> > > > > On 2/7/25 12:15 AM, NeilBrown wrote:
> > > > > > The filecache lru is walked in 2 circumstances for 2 different reasons.
> > > > > >
> > > > > > 1/ When called from the shrinker we want to discard the first few
> > > > > > entries on the list, ignoring any with NFSD_FILE_REFERENCED set
> > > > > > because they should really be at the end of the LRU as they have been
> > > > > > referenced recently. So those ones are ROTATED.
> > > > > >
> > > > > > 2/ When called from the nfsd_file_gc() timer function we want to discard
> > > > > > anything that hasn't been used since before the previous call, and
> > > > > > mark everything else as unused at this point in time.
> > > > > >
> > > > > > Using the same flag for both of these can result in some unexpected
> > > > > > outcomes. If the shrinker callback clears NFSD_FILE_REFERENCED then the
> > > > > > nfsd_file_gc() will think the file hasn't been used in a while, while
> > > > > > really it has.
> > > > > >
> > > > > > I think it is easier to reason about the behaviour if we instead have
> > > > > > two flags.
> > > > > >
> > > > > > NFSD_FILE_REFERENCED means "this should be at the end of the LRU, please
> > > > > > put it there when convenient"
> > > > > > NFSD_FILE_RECENT means "this has been used recently - since the last
> > > > > > run of nfsd_file_gc()
> > > > > >
> > > > > > When either caller finds an NFSD_FILE_REFERENCED entry, that entry
> > > > > > should be moved to the end of the LRU and the flag cleared. This can
> > > > > > safely happen at any time. The actual order on the lru might not be
> > > > > > strictly least-recently-used, but that is normal for linux lrus.
> > > > > >
> > > > > > The shrinker callback can ignore the "recent" flag. If it ends up
> > > > > > freeing something that is "recent" that simply means that memory
> > > > > > pressure is sufficient to limit the acceptable cache age to less than
> > > > > > the nfsd_file_gc frequency.
> > > > > >
> > > > > > The gc caller should primarily focus on NFSD_FILE_RECENT. It should
> > > > > > free everything that doesn't have this flag set, and should clear the
> > > > > > flag on everything else. When it clears the flag it is convenient to
> > > > > > clear the "REFERENCED" flag and move to the end of the LRU too.
> > > > > >
> > > > > > With this, calls from the shrinker do not prematurely age files. It
> > > > > > will focus only on freeing those that are least recently used.
> > > > > >
> > > > > > Signed-off-by: NeilBrown <neilb@suse.de>
> > > > > > ---
> > > > > > fs/nfsd/filecache.c | 21 +++++++++++++++++++--
> > > > > > fs/nfsd/filecache.h | 1 +
> > > > > > fs/nfsd/trace.h | 3 +++
> > > > > > 3 files changed, 23 insertions(+), 2 deletions(-)
> > > > > >
> > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > > > > > index 04588c03bdfe..9faf469354a5 100644
> > > > > > --- a/fs/nfsd/filecache.c
> > > > > > +++ b/fs/nfsd/filecache.c
> > > > > > @@ -318,10 +318,10 @@ nfsd_file_check_writeback(struct nfsd_file *nf)
> > > > > > mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK);
> > > > > > }
> > > > > >
> > > > > > -
> > > > > > static bool nfsd_file_lru_add(struct nfsd_file *nf)
> > > > > > {
> > > > > > set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> > > > > > + set_bit(NFSD_FILE_RECENT, &nf->nf_flags);
> > > > > > if (list_lru_add_obj(&nfsd_file_lru, &nf->nf_lru)) {
> > > > > > trace_nfsd_file_lru_add(nf);
> > > > > > return true;
> > > > > > @@ -528,6 +528,23 @@ nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
> > > > > > return LRU_REMOVED;
> > > > > > }
> > > > > >
> > > > > > +static enum lru_status
> > > > > > +nfsd_file_gc_cb(struct list_head *item, struct list_lru_one *lru,
> > > > > > + void *arg)
> > > > > > +{
> > > > > > + struct nfsd_file *nf = list_entry(item, struct nfsd_file, nf_lru);
> > > > > > +
> > > > > > + if (test_and_clear_bit(NFSD_FILE_RECENT, &nf->nf_flags)) {
> > > > > > + /* "REFERENCED" really means "should be at the end of the LRU.
> > > > > > + * As we are putting it there we can clear the flag
> > > > > > + */
> > > > > > + clear_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> > > > > > + trace_nfsd_file_gc_aged(nf);
> > > > > > + return LRU_ROTATE;
> > > > > > + }
> > > > > > + return nfsd_file_lru_cb(item, lru, arg);
> > > > > > +}
> > > > > > +
> > > > > > static void
> > > > > > nfsd_file_gc(void)
> > > > > > {
> > > > > > @@ -537,7 +554,7 @@ nfsd_file_gc(void)
> > > > > >
> > > > > > for_each_node_state(nid, N_NORMAL_MEMORY) {
> > > > > > unsigned long nr = list_lru_count_node(&nfsd_file_lru, nid);
> > > > > > - ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_lru_cb,
> > > > > > + ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_gc_cb,
> > > > > > &dispose, &nr);
> > > > > > }
> > > > > > trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
> > > > > > diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
> > > > > > index d5db6b34ba30..de5b8aa7fcb0 100644
> > > > > > --- a/fs/nfsd/filecache.h
> > > > > > +++ b/fs/nfsd/filecache.h
> > > > > > @@ -38,6 +38,7 @@ struct nfsd_file {
> > > > > > #define NFSD_FILE_PENDING (1)
> > > > > > #define NFSD_FILE_REFERENCED (2)
> > > > > > #define NFSD_FILE_GC (3)
> > > > > > +#define NFSD_FILE_RECENT (4)
> > > > > > unsigned long nf_flags;
> > > > > > refcount_t nf_ref;
> > > > > > unsigned char nf_may;
> > > > > > diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> > > > > > index ad2c0c432d08..9af723eeb2b0 100644
> > > > > > --- a/fs/nfsd/trace.h
> > > > > > +++ b/fs/nfsd/trace.h
> > > > > > @@ -1039,6 +1039,7 @@ DEFINE_CLID_EVENT(confirmed_r);
> > > > > > { 1 << NFSD_FILE_HASHED, "HASHED" }, \
> > > > > > { 1 << NFSD_FILE_PENDING, "PENDING" }, \
> > > > > > { 1 << NFSD_FILE_REFERENCED, "REFERENCED" }, \
> > > > > > + { 1 << NFSD_FILE_RECENT, "RECENT" }, \
> > > > > > { 1 << NFSD_FILE_GC, "GC" })
> > > > > >
> > > > > > DECLARE_EVENT_CLASS(nfsd_file_class,
> > > > > > @@ -1317,6 +1318,7 @@ DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_lru_del_disposed);
> > > > > > DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_in_use);
> > > > > > DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_writeback);
> > > > > > DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_referenced);
> > > > > > +DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_aged);
> > > > > > DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_disposed);
> > > > > >
> > > > > > DECLARE_EVENT_CLASS(nfsd_file_lruwalk_class,
> > > > > > @@ -1346,6 +1348,7 @@ DEFINE_EVENT(nfsd_file_lruwalk_class, name, \
> > > > > > TP_ARGS(removed, remaining))
> > > > > >
> > > > > > DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_removed);
> > > > > > +DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_recent);
> > > > > > DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_shrinker_removed);
> > > > > >
> > > > > > TRACE_EVENT(nfsd_file_close,
> > > > >
> > > > > The other patches in this series look like solid improvements. This one
> > > > > could be as well, but it will take me some time to understand it.
> > > > >
> > > > > I am generally in favor of replacing the logic that removes and adds
> > > > > these items with a single atomic bitop, and I'm happy to see NFSD stick
> > > > > with the use of an existing LRU facility while documenting its unique
> > > > > requirements ("nfsd_file_gc_aged" and so on).
> > > > >
> > > > > I would still prefer the backport to be lighter -- looks like the key
> > > > > changes are 3/6 and 6/6. Is there any chance the series can be
> > > > > reorganized to facilitate backporting? I have to ask, and the answer
> > > > > might be "no", I realize.
> > > >
> > > > I'm going with "no".
> > > > To be honest, I was hoping that the complexity displayed here needed
> > > > to work around the assumptions of list_lru what don't match our needs
> > > > would be sufficient to convince you that list_lru isn't worth pursuing.
> > > > I see that didn't work.
> > >
> > > Fair enough.
> > >
> > >
> > > > So I am no longer invested in this patch set. You are welcome to use it
> > > > if you wish and to make any changes that you think are suitable, but I
> > > > don't think it is a good direction to go and will not be offering any
> > > > more code changes to support the use of list_lru here.
> > >
> > > If I may observe, you haven't offered a compelling explanation of why an
> > > imperfect fit between list_lru and the filecache adds more technical
> > > debt than does the introduction of a bespoke LRU mechanism.
> > >
> > > I'm open to that argument, but I need stronger rationale (or performance
> > > data) to back it up. So far I can agree that the defect rate in this
> > > area is somewhat abnormal, but that seems to be because we don't
> > > understand how to use the list_lru API to its best advantage.
> > >
> >
> > I would characterise the cause of the defect rate differently.
> > I would say it is because we are treating this as an lru-style problem
> > when it isn't an lru-style problem. list_lru is great for lrus. That
> > isn't what we have.
> >
> > What we have is a desire to keep files open between consecutive IO
> > requests without any clear indication of when we have seen the last in a
> > series of IO requests. So we make a policy decision "keep files open
> > until there have been no IOs for 2 seconds - then close them".
> > This is a good and sensible policy that nothing to do with the "LRU"
> > concept.
> >
> > We implement this policy by keeping all unused files on a list, set a
> > flag every time the file is used, clearing the flag on a timer tick
> > (every 2 seconds) and closing anything which still has the flag cleared
> > 2 seconds later.
> >
> > Still nothing in this description that is at all related to LRU
> > concepts.
> >
> > Now we decide that it would be good the add a shinker - fair enough as
> > we don't *need* these to remain. How should the shrinker choose files
> > to close? It probably doesn't matter beyond avoiding files that still
> > have the not-timed-out flag set.
> >
> > But we try to also impose an LRU disciple over the list, and we use
> > list_lru.
> > The interfaces for list_lru() are well documented but the intent is
> > not. Most users of list_lru (gfs2/quota might be an exception) only
> > explicitly delete things from the lru when it is time to discard them
> > completely. They rely on the shrinker to detect things that are in use
> > again, and to remove them. And possibly to detect things that have been
> > referenced and to rotate them. But if the shrinker doesn't run because
> > there isn't much memory pressure they are just left alone.
> >
> > This is what list_lru is optimised for - for shrinker driven scanning
> > which skips or removes or rotates things that can't or shouldn't
> > be freed, and frees others. You would expect to normally only scan a
> > small fraction of the list, because realistically you want to keep most
> > of them.
> >
> > For filecache we don't want to keep them very long. So I think it
> > matters a lot less what we choose for shrinking. I'm tempted to suggest
> > we don't bother with the shrinker. Old files will be discarded soon
> > anyway if they aren't used, and slowness in memory allocation (due to
> > any memory pressure) will naturally slow down the addition of new files
> > to the cache. So the cache will shrink naturally.
> >
> > I'm not 100% certain of that, but I do think that the needs of the
> > shrinker should not dominate the design as they currently do.
> >
> > Note that maybe we *don't* need to close files so quickly. Maybe we
> > could discard the whole timer thing, and then it would make sense to use
> > list_lru(). What is the cost of keeping them open?
> >
> > All I can think of is that it affects unlink. An unlinked file won't be
> > removed while there is a reference to the inode. Maybe we should
> > address that by taking a lease on the file while it is in the
> > filecache?? When the lease is broken, we discard the file from the
> > cache.
>
>
> It may also affect other applications trying to take out leases. The
> filecache has the nfsd_file_lease_notifier that tells it when someone
> is trying to take out a lease on a file. That happens then it will try
> to close the file first.
It wouldn't have to be an FL_LEASE lease. We could invent a new thing:
FL_CACHED which gets added to the locks list and triggers a notification
whenever anyone tries to break leases.
The nfsd_file_lease_notifier is interesting... I wasn't aware of that.
>
> >
> > If that could work (which might involve creating a new internal lease
> > type that is only broken on unlink), then we could remove the timeout
> > and leave files in the cache indefinitely. Then it would make perfect
> > sense to use list_lru() because the problem would start to look exactly
> > like an LRU problem. But I don't think that is what we have today.
> >
>
> The filecache already sets a fsnotify_mark on the inode to watch for
> its i_nlink to go to 0, and then removes it from the cache when that
> happens. I think we could keep these files open for quite a bit longer
> if we chose to do so.
Chuck mentioned that holding he v3 files open longer could interfere
with v4 DENY_READ etc opens. But I think they only every test for other
v4 opens. A v3 (or local) open never blocks a v4 DENY open. Does that
match your understanding?
Do you know of any other reason that we currently time out files after
2-4 seconds?
>
> One thing that Chuck has brought up a few times is that maybe we should
> consider making v4 not use the filecache at all. If that simplifies
> things then that might be a good thing to consider as well.
As v4 stores the file with the open state it shouldn't need the cache.
Maybe the filecache makes life a bit easier for localio?
I don't know that it would make the garbage-collection/shrinking any
simpler but it does superficially seem like an unnecessary indirection.
Do you know of any specific benefit it brings v4?
Thanks,
NeilBrown
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH 4/6] nfsd: filecache: introduce NFSD_FILE_RECENT
2025-02-12 22:39 ` NeilBrown
@ 2025-02-13 0:08 ` Chuck Lever
2025-02-13 11:27 ` Jeff Layton
1 sibling, 0 replies; 25+ messages in thread
From: Chuck Lever @ 2025-02-13 0:08 UTC (permalink / raw)
To: NeilBrown, Jeff Layton
Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey, Dave Chinner
On 2/12/25 5:39 PM, NeilBrown wrote:
> On Tue, 11 Feb 2025, Jeff Layton wrote:
>> One thing that Chuck has brought up a few times is that maybe we should
>> consider making v4 not use the filecache at all. If that simplifies
>> things then that might be a good thing to consider as well.
>
> As v4 stores the file with the open state it shouldn't need the cache.
> Maybe the filecache makes life a bit easier for localio?
>
> I don't know that it would make the garbage-collection/shrinking any
> simpler but it does superficially seem like an unnecessary indirection.
> Do you know of any specific benefit it brings v4?
One benefit is that the fs/nfsd/vfs.c APIs can all deal with nfsd_file's
no matter which NFS version is in use.
--
Chuck Lever
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 4/6] nfsd: filecache: introduce NFSD_FILE_RECENT
2025-02-12 22:39 ` NeilBrown
2025-02-13 0:08 ` Chuck Lever
@ 2025-02-13 11:27 ` Jeff Layton
1 sibling, 0 replies; 25+ messages in thread
From: Jeff Layton @ 2025-02-13 11:27 UTC (permalink / raw)
To: NeilBrown
Cc: Chuck Lever, linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey,
Dave Chinner
On Thu, 2025-02-13 at 09:39 +1100, NeilBrown wrote:
> On Tue, 11 Feb 2025, Jeff Layton wrote:
> > On Mon, 2025-02-10 at 13:31 +1100, NeilBrown wrote:
> > > On Mon, 10 Feb 2025, Chuck Lever wrote:
> > > > On 2/9/25 6:23 PM, NeilBrown wrote:
> > > > > On Sat, 08 Feb 2025, Chuck Lever wrote:
> > > > > > On 2/7/25 12:15 AM, NeilBrown wrote:
> > > > > > > The filecache lru is walked in 2 circumstances for 2 different reasons.
> > > > > > >
> > > > > > > 1/ When called from the shrinker we want to discard the first few
> > > > > > > entries on the list, ignoring any with NFSD_FILE_REFERENCED set
> > > > > > > because they should really be at the end of the LRU as they have been
> > > > > > > referenced recently. So those ones are ROTATED.
> > > > > > >
> > > > > > > 2/ When called from the nfsd_file_gc() timer function we want to discard
> > > > > > > anything that hasn't been used since before the previous call, and
> > > > > > > mark everything else as unused at this point in time.
> > > > > > >
> > > > > > > Using the same flag for both of these can result in some unexpected
> > > > > > > outcomes. If the shrinker callback clears NFSD_FILE_REFERENCED then the
> > > > > > > nfsd_file_gc() will think the file hasn't been used in a while, while
> > > > > > > really it has.
> > > > > > >
> > > > > > > I think it is easier to reason about the behaviour if we instead have
> > > > > > > two flags.
> > > > > > >
> > > > > > > NFSD_FILE_REFERENCED means "this should be at the end of the LRU, please
> > > > > > > put it there when convenient"
> > > > > > > NFSD_FILE_RECENT means "this has been used recently - since the last
> > > > > > > run of nfsd_file_gc()
> > > > > > >
> > > > > > > When either caller finds an NFSD_FILE_REFERENCED entry, that entry
> > > > > > > should be moved to the end of the LRU and the flag cleared. This can
> > > > > > > safely happen at any time. The actual order on the lru might not be
> > > > > > > strictly least-recently-used, but that is normal for linux lrus.
> > > > > > >
> > > > > > > The shrinker callback can ignore the "recent" flag. If it ends up
> > > > > > > freeing something that is "recent" that simply means that memory
> > > > > > > pressure is sufficient to limit the acceptable cache age to less than
> > > > > > > the nfsd_file_gc frequency.
> > > > > > >
> > > > > > > The gc caller should primarily focus on NFSD_FILE_RECENT. It should
> > > > > > > free everything that doesn't have this flag set, and should clear the
> > > > > > > flag on everything else. When it clears the flag it is convenient to
> > > > > > > clear the "REFERENCED" flag and move to the end of the LRU too.
> > > > > > >
> > > > > > > With this, calls from the shrinker do not prematurely age files. It
> > > > > > > will focus only on freeing those that are least recently used.
> > > > > > >
> > > > > > > Signed-off-by: NeilBrown <neilb@suse.de>
> > > > > > > ---
> > > > > > > fs/nfsd/filecache.c | 21 +++++++++++++++++++--
> > > > > > > fs/nfsd/filecache.h | 1 +
> > > > > > > fs/nfsd/trace.h | 3 +++
> > > > > > > 3 files changed, 23 insertions(+), 2 deletions(-)
> > > > > > >
> > > > > > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > > > > > > index 04588c03bdfe..9faf469354a5 100644
> > > > > > > --- a/fs/nfsd/filecache.c
> > > > > > > +++ b/fs/nfsd/filecache.c
> > > > > > > @@ -318,10 +318,10 @@ nfsd_file_check_writeback(struct nfsd_file *nf)
> > > > > > > mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK);
> > > > > > > }
> > > > > > >
> > > > > > > -
> > > > > > > static bool nfsd_file_lru_add(struct nfsd_file *nf)
> > > > > > > {
> > > > > > > set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> > > > > > > + set_bit(NFSD_FILE_RECENT, &nf->nf_flags);
> > > > > > > if (list_lru_add_obj(&nfsd_file_lru, &nf->nf_lru)) {
> > > > > > > trace_nfsd_file_lru_add(nf);
> > > > > > > return true;
> > > > > > > @@ -528,6 +528,23 @@ nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
> > > > > > > return LRU_REMOVED;
> > > > > > > }
> > > > > > >
> > > > > > > +static enum lru_status
> > > > > > > +nfsd_file_gc_cb(struct list_head *item, struct list_lru_one *lru,
> > > > > > > + void *arg)
> > > > > > > +{
> > > > > > > + struct nfsd_file *nf = list_entry(item, struct nfsd_file, nf_lru);
> > > > > > > +
> > > > > > > + if (test_and_clear_bit(NFSD_FILE_RECENT, &nf->nf_flags)) {
> > > > > > > + /* "REFERENCED" really means "should be at the end of the LRU.
> > > > > > > + * As we are putting it there we can clear the flag
> > > > > > > + */
> > > > > > > + clear_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> > > > > > > + trace_nfsd_file_gc_aged(nf);
> > > > > > > + return LRU_ROTATE;
> > > > > > > + }
> > > > > > > + return nfsd_file_lru_cb(item, lru, arg);
> > > > > > > +}
> > > > > > > +
> > > > > > > static void
> > > > > > > nfsd_file_gc(void)
> > > > > > > {
> > > > > > > @@ -537,7 +554,7 @@ nfsd_file_gc(void)
> > > > > > >
> > > > > > > for_each_node_state(nid, N_NORMAL_MEMORY) {
> > > > > > > unsigned long nr = list_lru_count_node(&nfsd_file_lru, nid);
> > > > > > > - ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_lru_cb,
> > > > > > > + ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_gc_cb,
> > > > > > > &dispose, &nr);
> > > > > > > }
> > > > > > > trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
> > > > > > > diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
> > > > > > > index d5db6b34ba30..de5b8aa7fcb0 100644
> > > > > > > --- a/fs/nfsd/filecache.h
> > > > > > > +++ b/fs/nfsd/filecache.h
> > > > > > > @@ -38,6 +38,7 @@ struct nfsd_file {
> > > > > > > #define NFSD_FILE_PENDING (1)
> > > > > > > #define NFSD_FILE_REFERENCED (2)
> > > > > > > #define NFSD_FILE_GC (3)
> > > > > > > +#define NFSD_FILE_RECENT (4)
> > > > > > > unsigned long nf_flags;
> > > > > > > refcount_t nf_ref;
> > > > > > > unsigned char nf_may;
> > > > > > > diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> > > > > > > index ad2c0c432d08..9af723eeb2b0 100644
> > > > > > > --- a/fs/nfsd/trace.h
> > > > > > > +++ b/fs/nfsd/trace.h
> > > > > > > @@ -1039,6 +1039,7 @@ DEFINE_CLID_EVENT(confirmed_r);
> > > > > > > { 1 << NFSD_FILE_HASHED, "HASHED" }, \
> > > > > > > { 1 << NFSD_FILE_PENDING, "PENDING" }, \
> > > > > > > { 1 << NFSD_FILE_REFERENCED, "REFERENCED" }, \
> > > > > > > + { 1 << NFSD_FILE_RECENT, "RECENT" }, \
> > > > > > > { 1 << NFSD_FILE_GC, "GC" })
> > > > > > >
> > > > > > > DECLARE_EVENT_CLASS(nfsd_file_class,
> > > > > > > @@ -1317,6 +1318,7 @@ DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_lru_del_disposed);
> > > > > > > DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_in_use);
> > > > > > > DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_writeback);
> > > > > > > DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_referenced);
> > > > > > > +DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_aged);
> > > > > > > DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_disposed);
> > > > > > >
> > > > > > > DECLARE_EVENT_CLASS(nfsd_file_lruwalk_class,
> > > > > > > @@ -1346,6 +1348,7 @@ DEFINE_EVENT(nfsd_file_lruwalk_class, name, \
> > > > > > > TP_ARGS(removed, remaining))
> > > > > > >
> > > > > > > DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_removed);
> > > > > > > +DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_recent);
> > > > > > > DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_shrinker_removed);
> > > > > > >
> > > > > > > TRACE_EVENT(nfsd_file_close,
> > > > > >
> > > > > > The other patches in this series look like solid improvements. This one
> > > > > > could be as well, but it will take me some time to understand it.
> > > > > >
> > > > > > I am generally in favor of replacing the logic that removes and adds
> > > > > > these items with a single atomic bitop, and I'm happy to see NFSD stick
> > > > > > with the use of an existing LRU facility while documenting its unique
> > > > > > requirements ("nfsd_file_gc_aged" and so on).
> > > > > >
> > > > > > I would still prefer the backport to be lighter -- looks like the key
> > > > > > changes are 3/6 and 6/6. Is there any chance the series can be
> > > > > > reorganized to facilitate backporting? I have to ask, and the answer
> > > > > > might be "no", I realize.
> > > > >
> > > > > I'm going with "no".
> > > > > To be honest, I was hoping that the complexity displayed here needed
> > > > > to work around the assumptions of list_lru what don't match our needs
> > > > > would be sufficient to convince you that list_lru isn't worth pursuing.
> > > > > I see that didn't work.
> > > >
> > > > Fair enough.
> > > >
> > > >
> > > > > So I am no longer invested in this patch set. You are welcome to use it
> > > > > if you wish and to make any changes that you think are suitable, but I
> > > > > don't think it is a good direction to go and will not be offering any
> > > > > more code changes to support the use of list_lru here.
> > > >
> > > > If I may observe, you haven't offered a compelling explanation of why an
> > > > imperfect fit between list_lru and the filecache adds more technical
> > > > debt than does the introduction of a bespoke LRU mechanism.
> > > >
> > > > I'm open to that argument, but I need stronger rationale (or performance
> > > > data) to back it up. So far I can agree that the defect rate in this
> > > > area is somewhat abnormal, but that seems to be because we don't
> > > > understand how to use the list_lru API to its best advantage.
> > > >
> > >
> > > I would characterise the cause of the defect rate differently.
> > > I would say it is because we are treating this as an lru-style problem
> > > when it isn't an lru-style problem. list_lru is great for lrus. That
> > > isn't what we have.
> > >
> > > What we have is a desire to keep files open between consecutive IO
> > > requests without any clear indication of when we have seen the last in a
> > > series of IO requests. So we make a policy decision "keep files open
> > > until there have been no IOs for 2 seconds - then close them".
> > > This is a good and sensible policy that nothing to do with the "LRU"
> > > concept.
> > >
> > > We implement this policy by keeping all unused files on a list, set a
> > > flag every time the file is used, clearing the flag on a timer tick
> > > (every 2 seconds) and closing anything which still has the flag cleared
> > > 2 seconds later.
> > >
> > > Still nothing in this description that is at all related to LRU
> > > concepts.
> > >
> > > Now we decide that it would be good the add a shinker - fair enough as
> > > we don't *need* these to remain. How should the shrinker choose files
> > > to close? It probably doesn't matter beyond avoiding files that still
> > > have the not-timed-out flag set.
> > >
> > > But we try to also impose an LRU disciple over the list, and we use
> > > list_lru.
> > > The interfaces for list_lru() are well documented but the intent is
> > > not. Most users of list_lru (gfs2/quota might be an exception) only
> > > explicitly delete things from the lru when it is time to discard them
> > > completely. They rely on the shrinker to detect things that are in use
> > > again, and to remove them. And possibly to detect things that have been
> > > referenced and to rotate them. But if the shrinker doesn't run because
> > > there isn't much memory pressure they are just left alone.
> > >
> > > This is what list_lru is optimised for - for shrinker driven scanning
> > > which skips or removes or rotates things that can't or shouldn't
> > > be freed, and frees others. You would expect to normally only scan a
> > > small fraction of the list, because realistically you want to keep most
> > > of them.
> > >
> > > For filecache we don't want to keep them very long. So I think it
> > > matters a lot less what we choose for shrinking. I'm tempted to suggest
> > > we don't bother with the shrinker. Old files will be discarded soon
> > > anyway if they aren't used, and slowness in memory allocation (due to
> > > any memory pressure) will naturally slow down the addition of new files
> > > to the cache. So the cache will shrink naturally.
> > >
> > > I'm not 100% certain of that, but I do think that the needs of the
> > > shrinker should not dominate the design as they currently do.
> > >
> > > Note that maybe we *don't* need to close files so quickly. Maybe we
> > > could discard the whole timer thing, and then it would make sense to use
> > > list_lru(). What is the cost of keeping them open?
> > >
> > > All I can think of is that it affects unlink. An unlinked file won't be
> > > removed while there is a reference to the inode. Maybe we should
> > > address that by taking a lease on the file while it is in the
> > > filecache?? When the lease is broken, we discard the file from the
> > > cache.
> >
> >
> > It may also affect other applications trying to take out leases. The
> > filecache has the nfsd_file_lease_notifier that tells it when someone
> > is trying to take out a lease on a file. That happens then it will try
> > to close the file first.
>
> It wouldn't have to be an FL_LEASE lease. We could invent a new thing:
> FL_CACHED which gets added to the locks list and triggers a notification
> whenever anyone tries to break leases.
>
> The nfsd_file_lease_notifier is interesting... I wasn't aware of that.
>
>
> >
> > >
> > > If that could work (which might involve creating a new internal lease
> > > type that is only broken on unlink), then we could remove the timeout
> > > and leave files in the cache indefinitely. Then it would make perfect
> > > sense to use list_lru() because the problem would start to look exactly
> > > like an LRU problem. But I don't think that is what we have today.
> > >
> >
> > The filecache already sets a fsnotify_mark on the inode to watch for
> > its i_nlink to go to 0, and then removes it from the cache when that
> > happens. I think we could keep these files open for quite a bit longer
> > if we chose to do so.
>
> Chuck mentioned that holding he v3 files open longer could interfere
> with v4 DENY_READ etc opens. But I think they only every test for other
> v4 opens. A v3 (or local) open never blocks a v4 DENY open. Does that
> match your understanding?
>
Yes, that's correct. nfsd's share/deny locking doesn't extend past
other NFSv4 files, so that shouldn't be a problem.
> Do you know of any other reason that we currently time out files after
> 2-4 seconds?
>
No. The basic idea was to keep files open for "a little while" so that
when we're doing v3 READ/WRITE operations we can avoid having to do the
entire open/close. ~2-4s was considered long enough for that "little
while". We could easily keep them open longer, particularly since we
have some mechanisms to close them when needed for competing access
(leases and unlink activity, mostly).
> >
> > One thing that Chuck has brought up a few times is that maybe we should
> > consider making v4 not use the filecache at all. If that simplifies
> > things then that might be a good thing to consider as well.
>
> As v4 stores the file with the open state it shouldn't need the cache.
> Maybe the filecache makes life a bit easier for localio?
>
> I don't know that it would make the garbage-collection/shrinking any
> simpler but it does superficially seem like an unnecessary indirection.
> Do you know of any specific benefit it brings v4?
>
None that I can think of.
Originally, I had thought that we would keep both v2/3 and v4 files in
the cache and that they would be shared. It hasn't really worked out
that way, so we could drop v4 files from the filecache altogether.
The only real "problem" is that struct nfsd_file is referenced in a lot
of nfsv4 code, so we'd need to do a big conversion back to it using
regular struct file pointers instead.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 4/6] nfsd: filecache: introduce NFSD_FILE_RECENT
2025-02-10 2:31 ` NeilBrown
2025-02-10 14:25 ` Jeff Layton
@ 2025-02-10 14:26 ` Chuck Lever
1 sibling, 0 replies; 25+ messages in thread
From: Chuck Lever @ 2025-02-10 14:26 UTC (permalink / raw)
To: NeilBrown
Cc: Jeff Layton, linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey,
Dave Chinner
On 2/9/25 9:31 PM, NeilBrown wrote:
> On Mon, 10 Feb 2025, Chuck Lever wrote:
>> On 2/9/25 6:23 PM, NeilBrown wrote:
>>> On Sat, 08 Feb 2025, Chuck Lever wrote:
>>>> On 2/7/25 12:15 AM, NeilBrown wrote:
>>>>> The filecache lru is walked in 2 circumstances for 2 different reasons.
>>>>>
>>>>> 1/ When called from the shrinker we want to discard the first few
>>>>> entries on the list, ignoring any with NFSD_FILE_REFERENCED set
>>>>> because they should really be at the end of the LRU as they have been
>>>>> referenced recently. So those ones are ROTATED.
>>>>>
>>>>> 2/ When called from the nfsd_file_gc() timer function we want to discard
>>>>> anything that hasn't been used since before the previous call, and
>>>>> mark everything else as unused at this point in time.
>>>>>
>>>>> Using the same flag for both of these can result in some unexpected
>>>>> outcomes. If the shrinker callback clears NFSD_FILE_REFERENCED then the
>>>>> nfsd_file_gc() will think the file hasn't been used in a while, while
>>>>> really it has.
>>>>>
>>>>> I think it is easier to reason about the behaviour if we instead have
>>>>> two flags.
>>>>>
>>>>> NFSD_FILE_REFERENCED means "this should be at the end of the LRU, please
>>>>> put it there when convenient"
>>>>> NFSD_FILE_RECENT means "this has been used recently - since the last
>>>>> run of nfsd_file_gc()
>>>>>
>>>>> When either caller finds an NFSD_FILE_REFERENCED entry, that entry
>>>>> should be moved to the end of the LRU and the flag cleared. This can
>>>>> safely happen at any time. The actual order on the lru might not be
>>>>> strictly least-recently-used, but that is normal for linux lrus.
>>>>>
>>>>> The shrinker callback can ignore the "recent" flag. If it ends up
>>>>> freeing something that is "recent" that simply means that memory
>>>>> pressure is sufficient to limit the acceptable cache age to less than
>>>>> the nfsd_file_gc frequency.
>>>>>
>>>>> The gc caller should primarily focus on NFSD_FILE_RECENT. It should
>>>>> free everything that doesn't have this flag set, and should clear the
>>>>> flag on everything else. When it clears the flag it is convenient to
>>>>> clear the "REFERENCED" flag and move to the end of the LRU too.
>>>>>
>>>>> With this, calls from the shrinker do not prematurely age files. It
>>>>> will focus only on freeing those that are least recently used.
>>>>>
>>>>> Signed-off-by: NeilBrown <neilb@suse.de>
>>>>> ---
>>>>> fs/nfsd/filecache.c | 21 +++++++++++++++++++--
>>>>> fs/nfsd/filecache.h | 1 +
>>>>> fs/nfsd/trace.h | 3 +++
>>>>> 3 files changed, 23 insertions(+), 2 deletions(-)
>>>>>
>>>>> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
>>>>> index 04588c03bdfe..9faf469354a5 100644
>>>>> --- a/fs/nfsd/filecache.c
>>>>> +++ b/fs/nfsd/filecache.c
>>>>> @@ -318,10 +318,10 @@ nfsd_file_check_writeback(struct nfsd_file *nf)
>>>>> mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK);
>>>>> }
>>>>>
>>>>> -
>>>>> static bool nfsd_file_lru_add(struct nfsd_file *nf)
>>>>> {
>>>>> set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
>>>>> + set_bit(NFSD_FILE_RECENT, &nf->nf_flags);
>>>>> if (list_lru_add_obj(&nfsd_file_lru, &nf->nf_lru)) {
>>>>> trace_nfsd_file_lru_add(nf);
>>>>> return true;
>>>>> @@ -528,6 +528,23 @@ nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
>>>>> return LRU_REMOVED;
>>>>> }
>>>>>
>>>>> +static enum lru_status
>>>>> +nfsd_file_gc_cb(struct list_head *item, struct list_lru_one *lru,
>>>>> + void *arg)
>>>>> +{
>>>>> + struct nfsd_file *nf = list_entry(item, struct nfsd_file, nf_lru);
>>>>> +
>>>>> + if (test_and_clear_bit(NFSD_FILE_RECENT, &nf->nf_flags)) {
>>>>> + /* "REFERENCED" really means "should be at the end of the LRU.
>>>>> + * As we are putting it there we can clear the flag
>>>>> + */
>>>>> + clear_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
>>>>> + trace_nfsd_file_gc_aged(nf);
>>>>> + return LRU_ROTATE;
>>>>> + }
>>>>> + return nfsd_file_lru_cb(item, lru, arg);
>>>>> +}
>>>>> +
>>>>> static void
>>>>> nfsd_file_gc(void)
>>>>> {
>>>>> @@ -537,7 +554,7 @@ nfsd_file_gc(void)
>>>>>
>>>>> for_each_node_state(nid, N_NORMAL_MEMORY) {
>>>>> unsigned long nr = list_lru_count_node(&nfsd_file_lru, nid);
>>>>> - ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_lru_cb,
>>>>> + ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_gc_cb,
>>>>> &dispose, &nr);
>>>>> }
>>>>> trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
>>>>> diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
>>>>> index d5db6b34ba30..de5b8aa7fcb0 100644
>>>>> --- a/fs/nfsd/filecache.h
>>>>> +++ b/fs/nfsd/filecache.h
>>>>> @@ -38,6 +38,7 @@ struct nfsd_file {
>>>>> #define NFSD_FILE_PENDING (1)
>>>>> #define NFSD_FILE_REFERENCED (2)
>>>>> #define NFSD_FILE_GC (3)
>>>>> +#define NFSD_FILE_RECENT (4)
>>>>> unsigned long nf_flags;
>>>>> refcount_t nf_ref;
>>>>> unsigned char nf_may;
>>>>> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
>>>>> index ad2c0c432d08..9af723eeb2b0 100644
>>>>> --- a/fs/nfsd/trace.h
>>>>> +++ b/fs/nfsd/trace.h
>>>>> @@ -1039,6 +1039,7 @@ DEFINE_CLID_EVENT(confirmed_r);
>>>>> { 1 << NFSD_FILE_HASHED, "HASHED" }, \
>>>>> { 1 << NFSD_FILE_PENDING, "PENDING" }, \
>>>>> { 1 << NFSD_FILE_REFERENCED, "REFERENCED" }, \
>>>>> + { 1 << NFSD_FILE_RECENT, "RECENT" }, \
>>>>> { 1 << NFSD_FILE_GC, "GC" })
>>>>>
>>>>> DECLARE_EVENT_CLASS(nfsd_file_class,
>>>>> @@ -1317,6 +1318,7 @@ DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_lru_del_disposed);
>>>>> DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_in_use);
>>>>> DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_writeback);
>>>>> DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_referenced);
>>>>> +DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_aged);
>>>>> DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_disposed);
>>>>>
>>>>> DECLARE_EVENT_CLASS(nfsd_file_lruwalk_class,
>>>>> @@ -1346,6 +1348,7 @@ DEFINE_EVENT(nfsd_file_lruwalk_class, name, \
>>>>> TP_ARGS(removed, remaining))
>>>>>
>>>>> DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_removed);
>>>>> +DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_recent);
>>>>> DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_shrinker_removed);
>>>>>
>>>>> TRACE_EVENT(nfsd_file_close,
>>>>
>>>> The other patches in this series look like solid improvements. This one
>>>> could be as well, but it will take me some time to understand it.
>>>>
>>>> I am generally in favor of replacing the logic that removes and adds
>>>> these items with a single atomic bitop, and I'm happy to see NFSD stick
>>>> with the use of an existing LRU facility while documenting its unique
>>>> requirements ("nfsd_file_gc_aged" and so on).
>>>>
>>>> I would still prefer the backport to be lighter -- looks like the key
>>>> changes are 3/6 and 6/6. Is there any chance the series can be
>>>> reorganized to facilitate backporting? I have to ask, and the answer
>>>> might be "no", I realize.
>>>
>>> I'm going with "no".
>>> To be honest, I was hoping that the complexity displayed here needed
>>> to work around the assumptions of list_lru what don't match our needs
>>> would be sufficient to convince you that list_lru isn't worth pursuing.
>>> I see that didn't work.
>>
>> Fair enough.
>>
>>
>>> So I am no longer invested in this patch set. You are welcome to use it
>>> if you wish and to make any changes that you think are suitable, but I
>>> don't think it is a good direction to go and will not be offering any
>>> more code changes to support the use of list_lru here.
>>
>> If I may observe, you haven't offered a compelling explanation of why an
>> imperfect fit between list_lru and the filecache adds more technical
>> debt than does the introduction of a bespoke LRU mechanism.
>>
>> I'm open to that argument, but I need stronger rationale (or performance
>> data) to back it up. So far I can agree that the defect rate in this
>> area is somewhat abnormal, but that seems to be because we don't
>> understand how to use the list_lru API to its best advantage.
>>
>
> I would characterise the cause of the defect rate differently.
> I would say it is because we are treating this as an lru-style problem
> when it isn't an lru-style problem. list_lru is great for lrus. That
> isn't what we have.
>
> What we have is a desire to keep files open between consecutive IO
> requests without any clear indication of when we have seen the last in a
> series of IO requests. So we make a policy decision "keep files open
> until there have been no IOs for 2 seconds - then close them".
> This is a good and sensible policy that nothing to do with the "LRU"
> concept.
>
> We implement this policy by keeping all unused files on a list, set a
> flag every time the file is used, clearing the flag on a timer tick
> (every 2 seconds) and closing anything which still has the flag cleared
> 2 seconds later.
>
> Still nothing in this description that is at all related to LRU
> concepts.
>
> Now we decide that it would be good the add a shinker - fair enough as
> we don't *need* these to remain. How should the shrinker choose files
> to close? It probably doesn't matter beyond avoiding files that still
> have the not-timed-out flag set.
>
> But we try to also impose an LRU disciple over the list, and we use
> list_lru.
> The interfaces for list_lru() are well documented but the intent is
> not. Most users of list_lru (gfs2/quota might be an exception) only
> explicitly delete things from the lru when it is time to discard them
> completely. They rely on the shrinker to detect things that are in use
> again, and to remove them. And possibly to detect things that have been
> referenced and to rotate them. But if the shrinker doesn't run because
> there isn't much memory pressure they are just left alone.
>
> This is what list_lru is optimised for - for shrinker driven scanning
> which skips or removes or rotates things that can't or shouldn't
> be freed, and frees others. You would expect to normally only scan a
> small fraction of the list, because realistically you want to keep most
> of them.
>
> For filecache we don't want to keep them very long. So I think it
> matters a lot less what we choose for shrinking. I'm tempted to suggest
> we don't bother with the shrinker. Old files will be discarded soon
> anyway if they aren't used, and slowness in memory allocation (due to
> any memory pressure) will naturally slow down the addition of new files
> to the cache. So the cache will shrink naturally.
>
> I'm not 100% certain of that, but I do think that the needs of the
> shrinker should not dominate the design as they currently do.
I'm willing to consider removing the shrinker, but IMO we can't have it
both ways: if we remove the shrinker because items age out quickly, then
we can't then leave them in the cache indefinitely.
I would need to be convinced that the filecache cannot grow without
bound -- there can't be any pathological cases here, and competing
namespaces can't starve each other.
> Note that maybe we *don't* need to close files so quickly. Maybe we
> could discard the whole timer thing, and then it would make sense to use
> list_lru(). What is the cost of keeping them open?
The reason to age these items quickly is because we don't want open
NFSv3 files to block conflicting NFSv4 opens. Again, I think that
contraindicates leaving items in the filecache (or, at least, leaving
them open) indefinitely.
> All I can think of is that it affects unlink. An unlinked file won't be
> removed while there is a reference to the inode. Maybe we should
> address that by taking a lease on the file while it is in the
> filecache?? When the lease is broken, we discard the file from the
> cache.
>
> If that could work (which might involve creating a new internal lease
> type that is only broken on unlink), then we could remove the timeout
> and leave files in the cache indefinitely. Then it would make perfect
> sense to use list_lru() because the problem would start to look exactly
> like an LRU problem. But I don't think that is what we have today.
IMO, filecache does utilize the LRU characteristic of list_lru: the
shrinker uses that characteristic to select files that may be closed
early. Early on, ISTR there was a cap on the number of items that could
be cached -- again, that kind of mechanism would make direct use of an
LRU to evict only older items when the cache is full. So there are
technical rationales for the current code structure, but they might have
become historical as the code evolved.
--
Chuck Lever
^ permalink raw reply [flat|nested] 25+ messages in thread
* Re: [PATCH 4/6] nfsd: filecache: introduce NFSD_FILE_RECENT
2025-02-07 5:15 ` [PATCH 4/6] nfsd: filecache: introduce NFSD_FILE_RECENT NeilBrown
2025-02-07 14:52 ` Chuck Lever
@ 2025-02-10 14:01 ` Jeff Layton
2025-02-10 23:57 ` Dave Chinner
1 sibling, 1 reply; 25+ messages in thread
From: Jeff Layton @ 2025-02-10 14:01 UTC (permalink / raw)
To: NeilBrown, Chuck Lever
Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey, Dave Chinner
On Fri, 2025-02-07 at 16:15 +1100, NeilBrown wrote:
> The filecache lru is walked in 2 circumstances for 2 different reasons.
>
> 1/ When called from the shrinker we want to discard the first few
> entries on the list, ignoring any with NFSD_FILE_REFERENCED set
> because they should really be at the end of the LRU as they have been
> referenced recently. So those ones are ROTATED.
>
> 2/ When called from the nfsd_file_gc() timer function we want to discard
> anything that hasn't been used since before the previous call, and
> mark everything else as unused at this point in time.
>
> Using the same flag for both of these can result in some unexpected
> outcomes. If the shrinker callback clears NFSD_FILE_REFERENCED then the
> nfsd_file_gc() will think the file hasn't been used in a while, while
> really it has.
>
> I think it is easier to reason about the behaviour if we instead have
> two flags.
>
> NFSD_FILE_REFERENCED means "this should be at the end of the LRU, please
> put it there when convenient"
> NFSD_FILE_RECENT means "this has been used recently - since the last
> run of nfsd_file_gc()
>
> When either caller finds an NFSD_FILE_REFERENCED entry, that entry
> should be moved to the end of the LRU and the flag cleared. This can
> safely happen at any time. The actual order on the lru might not be
> strictly least-recently-used, but that is normal for linux lrus.
>
> The shrinker callback can ignore the "recent" flag. If it ends up
> freeing something that is "recent" that simply means that memory
> pressure is sufficient to limit the acceptable cache age to less than
> the nfsd_file_gc frequency.
>
> The gc caller should primarily focus on NFSD_FILE_RECENT. It should
> free everything that doesn't have this flag set, and should clear the
> flag on everything else. When it clears the flag it is convenient to
> clear the "REFERENCED" flag and move to the end of the LRU too.
>
> With this, calls from the shrinker do not prematurely age files. It
> will focus only on freeing those that are least recently used.
>
> Signed-off-by: NeilBrown <neilb@suse.de>
> ---
> fs/nfsd/filecache.c | 21 +++++++++++++++++++--
> fs/nfsd/filecache.h | 1 +
> fs/nfsd/trace.h | 3 +++
> 3 files changed, 23 insertions(+), 2 deletions(-)
>
> diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> index 04588c03bdfe..9faf469354a5 100644
> --- a/fs/nfsd/filecache.c
> +++ b/fs/nfsd/filecache.c
> @@ -318,10 +318,10 @@ nfsd_file_check_writeback(struct nfsd_file *nf)
> mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK);
> }
>
> -
> static bool nfsd_file_lru_add(struct nfsd_file *nf)
> {
> set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> + set_bit(NFSD_FILE_RECENT, &nf->nf_flags);
Technically, I don't think you need the REFERENCED bit at all. This is
the only place it's set, and below this is calling list_lru_add_obj().
That returns false if the object was already on a per-node LRU.
Instead of that, you could add a list_lru helper that will rotate the
object to the end of its nodelist if it's already on one. OTOH, that
might mean more cross NUMA-node accesses to the spinlocks than we get
by using a flag and doing this at GC time.
> if (list_lru_add_obj(&nfsd_file_lru, &nf->nf_lru)) {
> trace_nfsd_file_lru_add(nf);
> return true;
> @@ -528,6 +528,23 @@ nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
> return LRU_REMOVED;
> }
>
> +static enum lru_status
> +nfsd_file_gc_cb(struct list_head *item, struct list_lru_one *lru,
> + void *arg)
> +{
> + struct nfsd_file *nf = list_entry(item, struct nfsd_file, nf_lru);
> +
> + if (test_and_clear_bit(NFSD_FILE_RECENT, &nf->nf_flags)) {
> + /* "REFERENCED" really means "should be at the end of the LRU.
> + * As we are putting it there we can clear the flag
> + */
> + clear_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> + trace_nfsd_file_gc_aged(nf);
> + return LRU_ROTATE;
> + }
> + return nfsd_file_lru_cb(item, lru, arg);
> +}
> +
> static void
> nfsd_file_gc(void)
> {
> @@ -537,7 +554,7 @@ nfsd_file_gc(void)
>
> for_each_node_state(nid, N_NORMAL_MEMORY) {
> unsigned long nr = list_lru_count_node(&nfsd_file_lru, nid);
> - ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_lru_cb,
> + ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_gc_cb,
> &dispose, &nr);
> }
> trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
> diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
> index d5db6b34ba30..de5b8aa7fcb0 100644
> --- a/fs/nfsd/filecache.h
> +++ b/fs/nfsd/filecache.h
> @@ -38,6 +38,7 @@ struct nfsd_file {
> #define NFSD_FILE_PENDING (1)
> #define NFSD_FILE_REFERENCED (2)
> #define NFSD_FILE_GC (3)
> +#define NFSD_FILE_RECENT (4)
> unsigned long nf_flags;
> refcount_t nf_ref;
> unsigned char nf_may;
> diff --git a/fs/nfsd/trace.h b/fs/nfsd/trace.h
> index ad2c0c432d08..9af723eeb2b0 100644
> --- a/fs/nfsd/trace.h
> +++ b/fs/nfsd/trace.h
> @@ -1039,6 +1039,7 @@ DEFINE_CLID_EVENT(confirmed_r);
> { 1 << NFSD_FILE_HASHED, "HASHED" }, \
> { 1 << NFSD_FILE_PENDING, "PENDING" }, \
> { 1 << NFSD_FILE_REFERENCED, "REFERENCED" }, \
> + { 1 << NFSD_FILE_RECENT, "RECENT" }, \
> { 1 << NFSD_FILE_GC, "GC" })
>
> DECLARE_EVENT_CLASS(nfsd_file_class,
> @@ -1317,6 +1318,7 @@ DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_lru_del_disposed);
> DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_in_use);
> DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_writeback);
> DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_referenced);
> +DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_aged);
> DEFINE_NFSD_FILE_GC_EVENT(nfsd_file_gc_disposed);
>
> DECLARE_EVENT_CLASS(nfsd_file_lruwalk_class,
> @@ -1346,6 +1348,7 @@ DEFINE_EVENT(nfsd_file_lruwalk_class, name, \
> TP_ARGS(removed, remaining))
>
> DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_removed);
> +DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_gc_recent);
> DEFINE_NFSD_FILE_LRUWALK_EVENT(nfsd_file_shrinker_removed);
>
> TRACE_EVENT(nfsd_file_close,
Reviewed-by: Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH 4/6] nfsd: filecache: introduce NFSD_FILE_RECENT
2025-02-10 14:01 ` Jeff Layton
@ 2025-02-10 23:57 ` Dave Chinner
2025-02-11 11:38 ` Jeff Layton
0 siblings, 1 reply; 25+ messages in thread
From: Dave Chinner @ 2025-02-10 23:57 UTC (permalink / raw)
To: Jeff Layton
Cc: NeilBrown, Chuck Lever, linux-nfs, Olga Kornievskaia, Dai Ngo,
Tom Talpey
On Mon, Feb 10, 2025 at 09:01:41AM -0500, Jeff Layton wrote:
> On Fri, 2025-02-07 at 16:15 +1100, NeilBrown wrote:
> > The filecache lru is walked in 2 circumstances for 2 different reasons.
> >
> > 1/ When called from the shrinker we want to discard the first few
> > entries on the list, ignoring any with NFSD_FILE_REFERENCED set
> > because they should really be at the end of the LRU as they have been
> > referenced recently. So those ones are ROTATED.
> >
> > 2/ When called from the nfsd_file_gc() timer function we want to discard
> > anything that hasn't been used since before the previous call, and
> > mark everything else as unused at this point in time.
> >
> > Using the same flag for both of these can result in some unexpected
> > outcomes. If the shrinker callback clears NFSD_FILE_REFERENCED then the
> > nfsd_file_gc() will think the file hasn't been used in a while, while
> > really it has.
> >
> > I think it is easier to reason about the behaviour if we instead have
> > two flags.
> >
> > NFSD_FILE_REFERENCED means "this should be at the end of the LRU, please
> > put it there when convenient"
> > NFSD_FILE_RECENT means "this has been used recently - since the last
> > run of nfsd_file_gc()
> >
> > When either caller finds an NFSD_FILE_REFERENCED entry, that entry
> > should be moved to the end of the LRU and the flag cleared. This can
> > safely happen at any time. The actual order on the lru might not be
> > strictly least-recently-used, but that is normal for linux lrus.
> >
> > The shrinker callback can ignore the "recent" flag. If it ends up
> > freeing something that is "recent" that simply means that memory
> > pressure is sufficient to limit the acceptable cache age to less than
> > the nfsd_file_gc frequency.
> >
> > The gc caller should primarily focus on NFSD_FILE_RECENT. It should
> > free everything that doesn't have this flag set, and should clear the
> > flag on everything else. When it clears the flag it is convenient to
> > clear the "REFERENCED" flag and move to the end of the LRU too.
> >
> > With this, calls from the shrinker do not prematurely age files. It
> > will focus only on freeing those that are least recently used.
> >
> > Signed-off-by: NeilBrown <neilb@suse.de>
> > ---
> > fs/nfsd/filecache.c | 21 +++++++++++++++++++--
> > fs/nfsd/filecache.h | 1 +
> > fs/nfsd/trace.h | 3 +++
> > 3 files changed, 23 insertions(+), 2 deletions(-)
> >
> > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > index 04588c03bdfe..9faf469354a5 100644
> > --- a/fs/nfsd/filecache.c
> > +++ b/fs/nfsd/filecache.c
> > @@ -318,10 +318,10 @@ nfsd_file_check_writeback(struct nfsd_file *nf)
> > mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK);
> > }
> >
> > -
> > static bool nfsd_file_lru_add(struct nfsd_file *nf)
> > {
> > set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> > + set_bit(NFSD_FILE_RECENT, &nf->nf_flags);
>
> Technically, I don't think you need the REFERENCED bit at all. This is
> the only place it's set, and below this is calling list_lru_add_obj().
> That returns false if the object was already on a per-node LRU.
>
> Instead of that, you could add a list_lru helper that will rotate the
> object to the end of its nodelist if it's already on one. OTOH, that
> might mean more cross NUMA-node accesses to the spinlocks than we get
> by using a flag and doing this at GC time.
No, please don't.
Per-object reference bits are required to enable lazy LRU rotation.
The LRU lists are -hot- objects; touching them every time we touch
an object on the LRU is prohibitively expensive because of exclusive
lock/cacheline contention. Hence we defer operations like rotation
to a context where we already have the list locked and cached
exclusively for some other reason (i.e. memory reclaim).
This is the same reason we use lazy removal from LRUs - it avoids
LRU list manipulations every time a hot cached object is accessed
and/or dropped.
IOWs, removing the per-object NFSD_FILE_REFERENCED bit will undo one
of the necessary the optimisations that allow hot caches LRU
management to work efficiently with minimal overhead.
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 25+ messages in thread* Re: [PATCH 4/6] nfsd: filecache: introduce NFSD_FILE_RECENT
2025-02-10 23:57 ` Dave Chinner
@ 2025-02-11 11:38 ` Jeff Layton
0 siblings, 0 replies; 25+ messages in thread
From: Jeff Layton @ 2025-02-11 11:38 UTC (permalink / raw)
To: Dave Chinner
Cc: NeilBrown, Chuck Lever, linux-nfs, Olga Kornievskaia, Dai Ngo,
Tom Talpey
On Tue, 2025-02-11 at 10:57 +1100, Dave Chinner wrote:
> On Mon, Feb 10, 2025 at 09:01:41AM -0500, Jeff Layton wrote:
> > On Fri, 2025-02-07 at 16:15 +1100, NeilBrown wrote:
> > > The filecache lru is walked in 2 circumstances for 2 different reasons.
> > >
> > > 1/ When called from the shrinker we want to discard the first few
> > > entries on the list, ignoring any with NFSD_FILE_REFERENCED set
> > > because they should really be at the end of the LRU as they have been
> > > referenced recently. So those ones are ROTATED.
> > >
> > > 2/ When called from the nfsd_file_gc() timer function we want to discard
> > > anything that hasn't been used since before the previous call, and
> > > mark everything else as unused at this point in time.
> > >
> > > Using the same flag for both of these can result in some unexpected
> > > outcomes. If the shrinker callback clears NFSD_FILE_REFERENCED then the
> > > nfsd_file_gc() will think the file hasn't been used in a while, while
> > > really it has.
> > >
> > > I think it is easier to reason about the behaviour if we instead have
> > > two flags.
> > >
> > > NFSD_FILE_REFERENCED means "this should be at the end of the LRU, please
> > > put it there when convenient"
> > > NFSD_FILE_RECENT means "this has been used recently - since the last
> > > run of nfsd_file_gc()
> > >
> > > When either caller finds an NFSD_FILE_REFERENCED entry, that entry
> > > should be moved to the end of the LRU and the flag cleared. This can
> > > safely happen at any time. The actual order on the lru might not be
> > > strictly least-recently-used, but that is normal for linux lrus.
> > >
> > > The shrinker callback can ignore the "recent" flag. If it ends up
> > > freeing something that is "recent" that simply means that memory
> > > pressure is sufficient to limit the acceptable cache age to less than
> > > the nfsd_file_gc frequency.
> > >
> > > The gc caller should primarily focus on NFSD_FILE_RECENT. It should
> > > free everything that doesn't have this flag set, and should clear the
> > > flag on everything else. When it clears the flag it is convenient to
> > > clear the "REFERENCED" flag and move to the end of the LRU too.
> > >
> > > With this, calls from the shrinker do not prematurely age files. It
> > > will focus only on freeing those that are least recently used.
> > >
> > > Signed-off-by: NeilBrown <neilb@suse.de>
> > > ---
> > > fs/nfsd/filecache.c | 21 +++++++++++++++++++--
> > > fs/nfsd/filecache.h | 1 +
> > > fs/nfsd/trace.h | 3 +++
> > > 3 files changed, 23 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
> > > index 04588c03bdfe..9faf469354a5 100644
> > > --- a/fs/nfsd/filecache.c
> > > +++ b/fs/nfsd/filecache.c
> > > @@ -318,10 +318,10 @@ nfsd_file_check_writeback(struct nfsd_file *nf)
> > > mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK);
> > > }
> > >
> > > -
> > > static bool nfsd_file_lru_add(struct nfsd_file *nf)
> > > {
> > > set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
> > > + set_bit(NFSD_FILE_RECENT, &nf->nf_flags);
> >
> > Technically, I don't think you need the REFERENCED bit at all. This is
> > the only place it's set, and below this is calling list_lru_add_obj().
> > That returns false if the object was already on a per-node LRU.
> >
> > Instead of that, you could add a list_lru helper that will rotate the
> > object to the end of its nodelist if it's already on one. OTOH, that
> > might mean more cross NUMA-node accesses to the spinlocks than we get
> > by using a flag and doing this at GC time.
>
> No, please don't.
>
> Per-object reference bits are required to enable lazy LRU rotation.
> The LRU lists are -hot- objects; touching them every time we touch
> an object on the LRU is prohibitively expensive because of exclusive
> lock/cacheline contention. Hence we defer operations like rotation
> to a context where we already have the list locked and cached
> exclusively for some other reason (i.e. memory reclaim).
>
> This is the same reason we use lazy removal from LRUs - it avoids
> LRU list manipulations every time a hot cached object is accessed
> and/or dropped.
>
> IOWs, removing the per-object NFSD_FILE_REFERENCED bit will undo one
> of the necessary the optimisations that allow hot caches LRU
> management to work efficiently with minimal overhead.
>
Yep, that was the point of my "OTOH" comment. Keeping the REFERENCED
flag is better from a "let's minimize cacheline invalidations"
standpoint.
--
Jeff Layton <jlayton@kernel.org>
^ permalink raw reply [flat|nested] 25+ messages in thread
* [PATCH 5/6] nfsd: filecache: don't repeatedly add/remove files on the lru list
2025-02-07 5:15 [PATCH 0/6] nfsd: filecache: various fixes NeilBrown
` (3 preceding siblings ...)
2025-02-07 5:15 ` [PATCH 4/6] nfsd: filecache: introduce NFSD_FILE_RECENT NeilBrown
@ 2025-02-07 5:15 ` NeilBrown
2025-02-07 5:15 ` [PATCH 6/6] nfsd: filecache: drop the list_lru lock during lock gc scans NeilBrown
5 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2025-02-07 5:15 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton
Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey, Dave Chinner
There is no need to remove a file from the lru every time we access it,
and then add it back. It is sufficient to set the REFERENCED flag every
time we put the file. The order in the lru of REFERENCED files is
largely irrelevant as they will all be moved to the end.
With this patch, files are only added when they are allocated (if
want_gc) and they are only removed by the list_lru_(shrink_)walk
callback or when forcibly removing a file.
This should reduce contention on the list_lru spinlock(s) and reduce
memory traffic a little.
Signed-off-by: NeilBrown <neilb@suse.de>
---
fs/nfsd/filecache.c | 51 +++++++++++++++------------------------------
1 file changed, 17 insertions(+), 34 deletions(-)
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index 9faf469354a5..d1ce0bc86ff7 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -318,15 +318,14 @@ nfsd_file_check_writeback(struct nfsd_file *nf)
mapping_tagged(mapping, PAGECACHE_TAG_WRITEBACK);
}
-static bool nfsd_file_lru_add(struct nfsd_file *nf)
+static void nfsd_file_lru_add(struct nfsd_file *nf)
{
- set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
- set_bit(NFSD_FILE_RECENT, &nf->nf_flags);
- if (list_lru_add_obj(&nfsd_file_lru, &nf->nf_lru)) {
+ refcount_inc(&nf->nf_ref);
+ if (list_lru_add_obj(&nfsd_file_lru, &nf->nf_lru))
trace_nfsd_file_lru_add(nf);
- return true;
- }
- return false;
+ else
+ WARN_ON(1);
+ nfsd_file_schedule_laundrette();
}
static bool nfsd_file_lru_remove(struct nfsd_file *nf)
@@ -362,20 +361,10 @@ nfsd_file_put(struct nfsd_file *nf)
if (test_bit(NFSD_FILE_GC, &nf->nf_flags) &&
test_bit(NFSD_FILE_HASHED, &nf->nf_flags)) {
- /*
- * If this is the last reference (nf_ref == 1), then try to
- * transfer it to the LRU.
- */
- if (refcount_dec_not_one(&nf->nf_ref))
- return;
-
- /* Try to add it to the LRU. If that fails, decrement. */
- if (nfsd_file_lru_add(nf)) {
- nfsd_file_schedule_laundrette();
- return;
- }
-
+ set_bit(NFSD_FILE_REFERENCED, &nf->nf_flags);
+ set_bit(NFSD_FILE_RECENT, &nf->nf_flags);
}
+
if (refcount_dec_and_test(&nf->nf_ref))
nfsd_file_free(nf);
}
@@ -510,13 +499,12 @@ nfsd_file_lru_cb(struct list_head *item, struct list_lru_one *lru,
}
/*
- * Put the reference held on behalf of the LRU. If it wasn't the last
- * one, then just remove it from the LRU and ignore it.
+ * Put the reference held on behalf of the LRU if it is the last
+ * reference, else rotate.
*/
- if (!refcount_dec_and_test(&nf->nf_ref)) {
+ if (!refcount_dec_if_one(&nf->nf_ref)) {
trace_nfsd_file_gc_in_use(nf);
- list_lru_isolate(lru, &nf->nf_lru);
- return LRU_REMOVED;
+ return LRU_ROTATE;
}
/* Refcount went to zero. Unhash it and queue it to the dispose list */
@@ -1046,16 +1034,8 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct net *net,
nf = nfsd_file_lookup_locked(net, current_cred(), inode, need, want_gc);
rcu_read_unlock();
- if (nf) {
- /*
- * If the nf is on the LRU then it holds an extra reference
- * that must be put if it's removed. It had better not be
- * the last one however, since we should hold another.
- */
- if (nfsd_file_lru_remove(nf))
- refcount_dec(&nf->nf_ref);
+ if (nf)
goto wait_for_construction;
- }
new = nfsd_file_alloc(net, inode, need, want_gc);
if (!new) {
@@ -1149,6 +1129,9 @@ nfsd_file_do_acquire(struct svc_rqst *rqstp, struct net *net,
*/
if (status != nfs_ok || inode->i_nlink == 0)
nfsd_file_unhash(nf);
+ else if (want_gc)
+ nfsd_file_lru_add(nf);
+
clear_and_wake_up_bit(NFSD_FILE_PENDING, &nf->nf_flags);
if (status == nfs_ok)
goto out;
--
2.47.1
^ permalink raw reply related [flat|nested] 25+ messages in thread* [PATCH 6/6] nfsd: filecache: drop the list_lru lock during lock gc scans
2025-02-07 5:15 [PATCH 0/6] nfsd: filecache: various fixes NeilBrown
` (4 preceding siblings ...)
2025-02-07 5:15 ` [PATCH 5/6] nfsd: filecache: don't repeatedly add/remove files on the lru list NeilBrown
@ 2025-02-07 5:15 ` NeilBrown
5 siblings, 0 replies; 25+ messages in thread
From: NeilBrown @ 2025-02-07 5:15 UTC (permalink / raw)
To: Chuck Lever, Jeff Layton
Cc: linux-nfs, Olga Kornievskaia, Dai Ngo, Tom Talpey, Dave Chinner
Under a high NFSv3 load with lots of different files being accessed,
the LRU list of garbage-collectable files can become quite long.
Asking list_lru_scan_node() to scan the whole list can result in a long
period during which a spinlock is held, blocking the addition of new LRU
items.
So ask list_lru_scan_node() to scan only a few entries at a time, and
repeat until the scan is complete.
If the shrinker runs between two consecutive calls of
list_lru_scan_node() it could invalidate the "remaining" counter which
could lead to premature freeing. So add a spinlock to avoid that.
Signed-off-by: NeilBrown <neilb@suse.de>
---
fs/nfsd/filecache.c | 27 ++++++++++++++++++++++++---
fs/nfsd/filecache.h | 6 ++++++
2 files changed, 30 insertions(+), 3 deletions(-)
diff --git a/fs/nfsd/filecache.c b/fs/nfsd/filecache.c
index d1ce0bc86ff7..54df5e23f119 100644
--- a/fs/nfsd/filecache.c
+++ b/fs/nfsd/filecache.c
@@ -533,6 +533,13 @@ nfsd_file_gc_cb(struct list_head *item, struct list_lru_one *lru,
return nfsd_file_lru_cb(item, lru, arg);
}
+/* If the shrinker runs between calls to list_lru_walk_node() in
+ * nfsd_file_gc(), the "remaining" count will be wrong. This could
+ * result in premature freeing of some files. This may not matter much
+ * but is easy to fix with this spinlock which temporarily disables
+ * the shrinker.
+ */
+static DEFINE_SPINLOCK(nfsd_gc_lock);
static void
nfsd_file_gc(void)
{
@@ -540,11 +547,21 @@ nfsd_file_gc(void)
unsigned long ret = 0;
int nid;
+ spin_lock(&nfsd_gc_lock);
for_each_node_state(nid, N_NORMAL_MEMORY) {
- unsigned long nr = list_lru_count_node(&nfsd_file_lru, nid);
- ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_gc_cb,
- &dispose, &nr);
+ unsigned long remaining = list_lru_count_node(&nfsd_file_lru, nid);
+
+ while (remaining > 0) {
+ unsigned long nr = min(remaining, NFSD_FILE_GC_BATCH);
+ remaining -= nr;
+ ret += list_lru_walk_node(&nfsd_file_lru, nid, nfsd_file_gc_cb,
+ &dispose, &nr);
+ if (nr)
+ /* walk aborted early */
+ remaining = 0;
+ }
}
+ spin_unlock(&nfsd_gc_lock);
trace_nfsd_file_gc_removed(ret, list_lru_count(&nfsd_file_lru));
nfsd_file_dispose_list_delayed(&dispose);
}
@@ -569,8 +586,12 @@ nfsd_file_lru_scan(struct shrinker *s, struct shrink_control *sc)
LIST_HEAD(dispose);
unsigned long ret;
+ if (!spin_trylock(&nfsd_gc_lock))
+ return SHRINK_STOP;
+
ret = list_lru_shrink_walk(&nfsd_file_lru, sc,
nfsd_file_lru_cb, &dispose);
+ spin_unlock(&nfsd_gc_lock);
trace_nfsd_file_shrinker_removed(ret, list_lru_count(&nfsd_file_lru));
nfsd_file_dispose_list_delayed(&dispose);
return ret;
diff --git a/fs/nfsd/filecache.h b/fs/nfsd/filecache.h
index de5b8aa7fcb0..5865f9c72712 100644
--- a/fs/nfsd/filecache.h
+++ b/fs/nfsd/filecache.h
@@ -3,6 +3,12 @@
#include <linux/fsnotify_backend.h>
+/*
+ * Limit the time that the list_lru_one lock is held during
+ * an LRU scan.
+ */
+#define NFSD_FILE_GC_BATCH (16UL)
+
/*
* This is the fsnotify_mark container that nfsd attaches to the files that it
* is holding open. Note that we have a separate refcount here aside from the
--
2.47.1
^ permalink raw reply related [flat|nested] 25+ messages in thread