From: Horst Birthelmer <horst@birthelmer.de>
To: Mateusz Guzik <mjguzik@gmail.com>
Cc: Horst Birthelmer <horst@birthelmer.com>,
Miklos Szeredi <miklos@szeredi.hu>,
Jonathan Corbet <corbet@lwn.net>,
Shuah Khan <skhan@linuxfoundation.org>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Christian Brauner <brauner@kernel.org>, Jan Kara <jack@suse.cz>,
linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-fsdevel@vger.kernel.org,
Horst Birthelmer <hbirthelmer@ddn.com>
Subject: Re: Re: [PATCH v2] dcache: add fs.dentry-limit sysctl with negative-first reaper
Date: Sun, 17 May 2026 11:42:37 +0200 [thread overview]
Message-ID: <agmK0xmOVL5TLxdy@fedora.fritz.box> (raw)
In-Reply-To: <5afacskoalmd2u6s525dosvyrtr3j66ajd5m4p2ylymtlgytkz@excrdfpndx37>
On Sun, May 17, 2026 at 11:15:04AM +0200, Mateusz Guzik wrote:
> On Sat, May 16, 2026 at 04:52:54PM +0200, Horst Birthelmer wrote:
> > From: Horst Birthelmer <hbirthelmer@ddn.com>
> >
> > The dcache only shrinks under memory pressure, which is rarely reached
> > on machines with ample RAM, so cached negative dentries can accumulate
> > without bound. Give administrators a soft cap they can set,
> > and a background worker that prefers negative dentries when reclaiming.
> >
> > Two new sysctls under /proc/sys/fs/:
> >
> > dentry-limit -- soft cap on nr_dentry. 0 (default)
> > disables the feature; behaviour is then
> > identical to before.
> > dentry-limit-interval-ms -- pacing for the worker while still over
> > the cap. Default 1000, minimum 1.
> >
> > When the cap is exceeded, a delayed_work runs in two phases:
> >
> > 1. iterate_supers() draining only negative dentries from every LRU.
> > Positive entries are rotated past so the walk makes progress.
> > DCACHE_REFERENCED is ignored here on purpose -- an admin-imposed
> > cap should evict even hot negatives before any positive entry.
> > 2. If still over the cap, iterate_supers() again with the same
> > isolate callback the memory-pressure shrinker uses.
> >
> > Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>
> > ---
> > There was a discussion at LSFMM about servers with too many cached
> > negative dentries.
> > That gave me the idea to keep the dentries in general limited
> > if the system administrator needs it to.
> >
>
> I wrote about the negative entries problem here:
>
> https://lore.kernel.org/linux-fsdevel/f7bp3ggliqbb7adyysonxgvo6zn76mo4unroagfcuu3bfghynu@7wkgqkfb5c43/#t
>
> The mechanism as suggested here will end up evicting *useful* negative
> entries. Granted, they will be recreated soon enough so it's not a
> tragedy but it still is an avoidable perf loss.
>
> What is needed in the long run is a mechanism which aggressively
> recycles stale negative entries and recognizes which ones should be
> saved for the time being.
>
> Below some magic threshold you just allocate a new negative entry.
>
> All new entries would get a grace period where they need to get hits and
> prove useful OR get whacked. If you are at or above the threshold and
> are allocating a new entry, you can whack the oldest negative one which
> did not make it.
>
> This is just one idea, what is not up for debate is the discrepancy
> between small subset of negative entires with tons of hits vs the ones
> which get virtually no traffic at all.
I'm trying not to focus that much on the negative dentries since it has
no relevance for fuse, but was just a nice effect to solve that one, too,
and a bit of 'when you're at it' logic.
I'm more interested in throwing out the unused ones.
You are completely right in your analysis that this could remove fresh
and useful negative dentries.
>
> Whatever the mechanism it will have to take advantage of it.
>
> > This is somewhat related to [1] where it would address the same
> > symptoms but in a more unobtrusive way, by just garbage collecting
> > the negative and then the unused cache entries.
> >
> > The other effect I have seen regarding this is that FUSE
> > will not forget inodes (no FORGET call to the FUSE server)
> > even after the latest reference has been closed until much later.
> >
> > In a FUSE server that mirrors the kernel cached inodes in user space
> > because it has to keep a lot of private data for every node
> > this puts an unnecessarry memory strain on that userspace entity
> > especially if the memory is limited for its cgroup.
>
> I don't know anything about how FUSE works. In this context I presume
> you have a mount point backed by FUSE and the problematic memory usage
> stems from inodes created against such a mount point.
>
correct
> This would suggest you would be better served with a mechanism which
> allows userspace to cull some number of dentries for a given mount
> point, maybe even with an optional preference for negative entries if
> that's considered better for given fs.
>
As I mentioned in the other post, I kinda did this (not triggered by user
space, though, just by a limit negotiated during init with user space)
just for fuse and was told that this kind of limit would be useful in vfs.
> Or to put it differently, I would look into exposing sb shrinkers to
> root instead of rolling with a global scan.
This would be a cool idea.
>
> > +static enum lru_status dentry_lru_isolate_negative(struct list_head *item,
> > + struct list_lru_one *lru, void *arg)
> > +{
> > + struct list_head *freeable = arg;
> > + struct dentry *dentry = container_of(item, struct dentry, d_lru);
> > +
> > + if (!spin_trylock(&dentry->d_lock))
> > + return LRU_SKIP;
>
> If anything of the sort is to land, you definitely want to pre-check
> d_count and d_is_negative without the lock.
probably ...
I still think that a lock held is a good indicator that we can just move on.
Thanks for your time,
Horst
prev parent reply other threads:[~2026-05-17 9:42 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-16 14:52 [PATCH v2] dcache: add fs.dentry-limit sysctl with negative-first reaper Horst Birthelmer
2026-05-16 23:09 ` Matthew Wilcox
2026-05-17 7:57 ` Horst Birthelmer
2026-05-17 9:15 ` Mateusz Guzik
2026-05-17 9:42 ` Horst Birthelmer [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=agmK0xmOVL5TLxdy@fedora.fritz.box \
--to=horst@birthelmer.de \
--cc=brauner@kernel.org \
--cc=corbet@lwn.net \
--cc=hbirthelmer@ddn.com \
--cc=horst@birthelmer.com \
--cc=jack@suse.cz \
--cc=linux-doc@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=miklos@szeredi.hu \
--cc=mjguzik@gmail.com \
--cc=skhan@linuxfoundation.org \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox