Linux Documentation
 help / color / mirror / Atom feed
* [PATCH v2] dcache: add fs.dentry-limit sysctl with negative-first reaper
@ 2026-05-16 14:52 Horst Birthelmer
  2026-05-16 23:09 ` Matthew Wilcox
  2026-05-17  9:15 ` Mateusz Guzik
  0 siblings, 2 replies; 5+ messages in thread
From: Horst Birthelmer @ 2026-05-16 14:52 UTC (permalink / raw)
  To: Miklos Szeredi, Jonathan Corbet, Shuah Khan, Alexander Viro,
	Christian Brauner, Jan Kara
  Cc: linux-doc, linux-kernel, linux-fsdevel, Horst Birthelmer

From: Horst Birthelmer <hbirthelmer@ddn.com>

The dcache only shrinks under memory pressure, which is rarely reached
on machines with ample RAM, so cached negative dentries can accumulate
without bound.  Give administrators a soft cap they can set,
and a background worker that prefers negative dentries when reclaiming.

Two new sysctls under /proc/sys/fs/:

  dentry-limit             -- soft cap on nr_dentry.  0 (default)
                              disables the feature; behaviour is then
                              identical to before.
  dentry-limit-interval-ms -- pacing for the worker while still over
                              the cap.  Default 1000, minimum 1.

When the cap is exceeded, a delayed_work runs in two phases:

  1. iterate_supers() draining only negative dentries from every LRU.
     Positive entries are rotated past so the walk makes progress.
     DCACHE_REFERENCED is ignored here on purpose -- an admin-imposed
     cap should evict even hot negatives before any positive entry.
  2. If still over the cap, iterate_supers() again with the same
     isolate callback the memory-pressure shrinker uses.

Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>
---
There was a discussion at LSFMM about servers with too many cached
negative dentries.
That gave me the idea to keep the dentries in general limited
if the system administrator needs it to.

This is somewhat related to [1] where it would address the same
symptoms but in a more unobtrusive way, by just garbage collecting
the negative and then the unused cache entries.

The other effect I have seen regarding this is that FUSE
will not forget inodes (no FORGET call to the FUSE server)
even after the latest reference has been closed until much later.

In a FUSE server that mirrors the kernel cached inodes in user space
because it has to keep a lot of private data for every node
this puts an unnecessarry memory strain on that userspace entity
especially if the memory is limited for its cgroup.

[1]: https://lore.kernel.org/linux-fsdevel/20260331012925.74840-1-raven@themaw.net/
---
Changes in v2:
- get_nr_dentry() was protected by #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS) 
  fix the location for the reaper code to be inside the #if bracket
- Link to v1: https://lore.kernel.org/r/20260514-limit-dentries-cache-v1-1-431b9eb0c530@ddn.com
---
 Documentation/admin-guide/sysctl/fs.rst |  28 +++++
 fs/dcache.c                             | 207 ++++++++++++++++++++++++++++++++
 2 files changed, 235 insertions(+)

diff --git a/Documentation/admin-guide/sysctl/fs.rst b/Documentation/admin-guide/sysctl/fs.rst
index 9b7f65c3efd8..0229aea45d85 100644
--- a/Documentation/admin-guide/sysctl/fs.rst
+++ b/Documentation/admin-guide/sysctl/fs.rst
@@ -38,6 +38,34 @@ requests.  ``aio-max-nr`` allows you to change the maximum value
 ``aio-max-nr`` does not result in the
 pre-allocation or re-sizing of any kernel data structures.
 
+dentry-limit
+------------
+
+Soft cap on the total number of dentries allocated system-wide (i.e. on
+``nr_dentry`` from ``dentry-state``).  A value of ``0`` (the default)
+disables the feature and the dcache grows or shrinks only under memory
+pressure as before.
+
+When set to a non-zero value, a background worker is woken whenever
+the live dentry count exceeds the limit. The worker walks every
+superblock's LRU and prefers to evict negative dentries first; if it
+cannot get back under the limit using negative entries alone it falls
+back to the same LRU policy used by the memory-pressure shrinker.
+
+The limit is *soft*: allocations never fail because of it, and brief
+overshoots while the worker catches up are expected. Set the cap a
+comfortable margin above your steady-state working set.
+
+dentry-limit-interval-ms
+------------------------
+
+How often, in milliseconds, the ``dentry-limit`` worker re-runs while
+``nr_dentry`` is still above the cap. Defaults to ``1000`` (one
+second); the minimum accepted value is ``1``. Smaller values trim the
+cache more aggressively at the cost of more CPU spent walking LRUs;
+larger values let temporary spikes ride out before any work is done.
+Has no effect when ``dentry-limit`` is ``0``.
+
 dentry-negative
 ----------------------------
 
diff --git a/fs/dcache.c b/fs/dcache.c
index 2c61aeea41f4..196f842845ed 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -145,6 +145,26 @@ static DEFINE_PER_CPU(long, nr_dentry_negative);
 static int dentry_negative_policy;
 
 #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
+/*
+ * Soft cap on the total number of dentries. When non-zero and exceeded,
+ * a background worker prunes unused dentries (preferring negative ones)
+ * until we are back under the limit. Zero (the default) disables the
+ * feature entirely; the fast path in __d_alloc() only pays the cost of
+ * a READ_ONCE and a branch in that case.
+ */
+static unsigned long sysctl_dentry_limit __read_mostly;
+static unsigned int sysctl_dentry_limit_interval_ms __read_mostly = 1000;
+static unsigned long dentry_limit_last_kick;
+
+static void dentry_limit_kick(void);
+
+/* Forward decls: the helpers used by the reaper live further down. */
+static void d_lru_isolate(struct list_lru_one *lru, struct dentry *dentry);
+static void d_lru_shrink_move(struct list_lru_one *lru, struct dentry *dentry,
+			      struct list_head *list);
+static enum lru_status dentry_lru_isolate(struct list_head *item,
+			      struct list_lru_one *lru, void *arg);
+
 /* Statistics gathering. */
 static struct dentry_stat_t dentry_stat = {
 	.age_limit = 45,
@@ -171,6 +191,161 @@ static long get_nr_dentry(void)
 	return sum < 0 ? 0 : sum;
 }
 
+#define DENTRY_LIMIT_BATCH	1024UL
+
+static void dentry_limit_worker_fn(struct work_struct *work);
+static DECLARE_DELAYED_WORK(dentry_limit_work, dentry_limit_worker_fn);
+
+/*
+ * Variant of dentry_lru_isolate() that only frees negative dentries.
+ * DCACHE_REFERENCED is intentionally not honoured here: the whole point
+ * of an admin-imposed cap on negatives is that even frequently-looked-up
+ * negative entries should be evicted before any positive dentry.
+ * Positive entries are rotated to the tail so the walk continues to
+ * make progress without disturbing their LRU position.
+ */
+static enum lru_status dentry_lru_isolate_negative(struct list_head *item,
+		struct list_lru_one *lru, void *arg)
+{
+	struct list_head *freeable = arg;
+	struct dentry *dentry = container_of(item, struct dentry, d_lru);
+
+	if (!spin_trylock(&dentry->d_lock))
+		return LRU_SKIP;
+
+	/* Same handling as dentry_lru_isolate() for in-use entries. */
+	if (dentry->d_lockref.count) {
+		d_lru_isolate(lru, dentry);
+		spin_unlock(&dentry->d_lock);
+		return LRU_REMOVED;
+	}
+
+	if (!d_is_negative(dentry)) {
+		spin_unlock(&dentry->d_lock);
+		return LRU_ROTATE;
+	}
+
+	d_lru_shrink_move(lru, dentry, freeable);
+	spin_unlock(&dentry->d_lock);
+	return LRU_REMOVED;
+}
+
+struct dentry_limit_ctx {
+	long over;		/* remaining dentries to evict */
+	list_lru_walk_cb isolate;
+};
+
+static void dentry_limit_prune_sb(struct super_block *sb, void *arg)
+{
+	struct dentry_limit_ctx *ctx = arg;
+	unsigned long walked = 0;
+	unsigned long budget;
+
+	if (ctx->over <= 0)
+		return;
+
+	/*
+	 * Walk up to one full pass of this superblock's LRU, in
+	 * DENTRY_LIMIT_BATCH-sized chunks. The loop matters mainly for
+	 * phase 1: dentry_lru_isolate_negative() returns LRU_ROTATE for
+	 * positive dentries, which still counts against list_lru_walk()'s
+	 * nr_to_walk. A single batch can therefore finish having freed
+	 * nothing when positives crowd the head of the LRU, and without
+	 * the inner loop the worker would have to wait a full
+	 * dentry-limit-interval-ms before retrying never reaching the
+	 * negatives buried behind a long run of positives.
+	 *
+	 * The budget is snapshot at entry so a filesystem allocating
+	 * dentries faster than we drain them can't keep us spinning here
+	 * forever; freshly added dentries are picked up on the next
+	 * worker invocation.
+	 *
+	 * Phase 2 normally exits much sooner: its isolate callback frees
+	 * any non-referenced dentry, so ctx->over typically hits zero
+	 * inside the first batch. The worst-case over-eviction is one
+	 * batch past the cap, which is within the soft semantics of
+	 * fs.dentry-limit.
+	 */
+	budget = list_lru_count(&sb->s_dentry_lru);
+
+	while (ctx->over > 0 && walked < budget) {
+		LIST_HEAD(dispose);
+		unsigned long nr;
+		long freed;
+
+		nr = min(DENTRY_LIMIT_BATCH, budget - walked);
+		freed = list_lru_walk(&sb->s_dentry_lru, ctx->isolate,
+				      &dispose, nr);
+		shrink_dentry_list(&dispose);
+
+		ctx->over -= freed;
+		walked += nr;
+
+		cond_resched();
+	}
+}
+
+static void dentry_limit_worker_fn(struct work_struct *work)
+{
+	struct dentry_limit_ctx ctx;
+	unsigned long limit = READ_ONCE(sysctl_dentry_limit);
+	unsigned int ms;
+	long nr;
+
+	if (!limit)
+		return;
+
+	nr = get_nr_dentry();
+	if (nr <= (long)limit)
+		return;
+
+	ctx.over = nr - (long)limit;
+
+	/* Phase 1: drain negative dentries across every superblock. */
+	ctx.isolate = dentry_lru_isolate_negative;
+	iterate_supers(dentry_limit_prune_sb, &ctx);
+
+	/* Phase 2: still over? Apply the ordinary LRU policy. */
+	if (ctx.over > 0) {
+		ctx.isolate = dentry_lru_isolate;
+		iterate_supers(dentry_limit_prune_sb, &ctx);
+	}
+
+	/*
+	 * Re-arm while still above the limit. Re-read the sysctls in
+	 * case the admin raised the cap or disabled the feature during
+	 * the walk.
+	 */
+	limit = READ_ONCE(sysctl_dentry_limit);
+	if (!limit || get_nr_dentry() <= (long)limit)
+		return;
+
+	ms = READ_ONCE(sysctl_dentry_limit_interval_ms);
+	queue_delayed_work(system_unbound_wq, &dentry_limit_work,
+			   msecs_to_jiffies(ms));
+}
+
+static void dentry_limit_kick(void)
+{
+	unsigned long limit = READ_ONCE(sysctl_dentry_limit);
+	unsigned long now;
+
+	if (!limit)
+		return;
+	if (delayed_work_pending(&dentry_limit_work))
+		return;
+
+	now = jiffies;
+	if (time_before(now, READ_ONCE(dentry_limit_last_kick) + HZ / 10))
+		return;
+	WRITE_ONCE(dentry_limit_last_kick, now);
+
+	if (get_nr_dentry() <= (long)limit)
+		return;
+
+	queue_delayed_work(system_unbound_wq, &dentry_limit_work, 0);
+}
+
 static long get_nr_dentry_unused(void)
 {
 	int i;
@@ -199,6 +374,20 @@ static int proc_nr_dentry(const struct ctl_table *table, int write, void *buffer
 	return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
 }
 
+/*
+ * Writing fs.dentry-limit should give prompt feedback to admins
+ * lowering the cap, so kick the worker on every successful write.
+ */
+static int proc_dentry_limit(const struct ctl_table *table, int write,
+			     void *buffer, size_t *lenp, loff_t *ppos)
+{
+	int ret = proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
+
+	if (write && !ret)
+		dentry_limit_kick();
+	return ret;
+}
+
 static const struct ctl_table fs_dcache_sysctls[] = {
 	{
 		.procname	= "dentry-state",
@@ -207,6 +396,21 @@ static const struct ctl_table fs_dcache_sysctls[] = {
 		.mode		= 0444,
 		.proc_handler	= proc_nr_dentry,
 	},
+	{
+		.procname	= "dentry-limit",
+		.data		= &sysctl_dentry_limit,
+		.maxlen		= sizeof(sysctl_dentry_limit),
+		.mode		= 0644,
+		.proc_handler	= proc_dentry_limit,
+	},
+	{
+		.procname	= "dentry-limit-interval-ms",
+		.data		= &sysctl_dentry_limit_interval_ms,
+		.maxlen		= sizeof(sysctl_dentry_limit_interval_ms),
+		.mode		= 0644,
+		.proc_handler	= proc_douintvec_minmax,
+		.extra1		= SYSCTL_ONE,
+	},
 	{
 		.procname	= "dentry-negative",
 		.data		= &dentry_negative_policy,
@@ -1868,6 +2072,9 @@ static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
 	}
 
 	this_cpu_inc(nr_dentry);
+#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
+	dentry_limit_kick();
+#endif
 
 	return dentry;
 }

---
base-commit: 6916d5703ddf9a38f1f6c2cc793381a24ee914c6
change-id: 20260513-limit-dentries-cache-63685729672b

Best regards,
-- 
Horst Birthelmer <hbirthelmer@ddn.com>


^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] dcache: add fs.dentry-limit sysctl with negative-first reaper
  2026-05-16 14:52 [PATCH v2] dcache: add fs.dentry-limit sysctl with negative-first reaper Horst Birthelmer
@ 2026-05-16 23:09 ` Matthew Wilcox
  2026-05-17  7:57   ` Horst Birthelmer
  2026-05-17  9:15 ` Mateusz Guzik
  1 sibling, 1 reply; 5+ messages in thread
From: Matthew Wilcox @ 2026-05-16 23:09 UTC (permalink / raw)
  To: Horst Birthelmer
  Cc: Miklos Szeredi, Jonathan Corbet, Shuah Khan, Alexander Viro,
	Christian Brauner, Jan Kara, linux-doc, linux-kernel,
	linux-fsdevel, Horst Birthelmer

On Sat, May 16, 2026 at 04:52:54PM +0200, Horst Birthelmer wrote:
> There was a discussion at LSFMM about servers with too many cached
> negative dentries.
> That gave me the idea to keep the dentries in general limited
> if the system administrator needs it to.

I feel you should link to the dozens of previous attempts at this kind
of thing to show that you're aware that this has been tried before and
you're doing something meaningfully different.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Re: [PATCH v2] dcache: add fs.dentry-limit sysctl with negative-first reaper
  2026-05-16 23:09 ` Matthew Wilcox
@ 2026-05-17  7:57   ` Horst Birthelmer
  0 siblings, 0 replies; 5+ messages in thread
From: Horst Birthelmer @ 2026-05-17  7:57 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Horst Birthelmer, Miklos Szeredi, Jonathan Corbet, Shuah Khan,
	Alexander Viro, Christian Brauner, Jan Kara, linux-doc,
	linux-kernel, linux-fsdevel, Horst Birthelmer

On Sun, May 17, 2026 at 12:09:26AM +0100, Matthew Wilcox wrote:
> On Sat, May 16, 2026 at 04:52:54PM +0200, Horst Birthelmer wrote:
> > There was a discussion at LSFMM about servers with too many cached
> > negative dentries.
> > That gave me the idea to keep the dentries in general limited
> > if the system administrator needs it to.
> 
> I feel you should link to the dozens of previous attempts at this kind
> of thing to show that you're aware that this has been tried before and
> you're doing something meaningfully different.
> 

Hi Matthew,

thanks for looking at this.

- The first limitation of dentries I could find was a patch for Linux 2.6.7
which introduced the vfs_cache_pressure option. [1]
This is still in use today but will not limit dentries as such just the
relation of where to release the pressure but you have to get into a pressure
situation for it to actually matter.
For my case when we get into pressure the fuse server could already be in heavy
trouble (we have had OOM events for the cgroup due to this)

- in 2011 there was the attempt to limit dentries by container [2] [3]
Here Dave Chinner made the point that the dentry cache is usually not the problem
but the inode cache, which is exactly what we see as well, since the fuse server
has to keep a lot of private data for every cached inode. However we have the
information for LRU only for the dentries, so it is the best way we can keep this
under control, limit the number of dentries to an acceptable amounr.

- there was an entire series by Waiman Long starting at around 2017 [4] 
Here even you were part of the discussion, and I think the ideas are very similar,
I'm just more worried about unused dentries (I just prefer negative ones on
reclaim) There are several of different attempts in this context.

- then there was tbe one I mentioned in the cover letter [5]
which was trying to modify the caching to limit the excessing traversing when
there are so many entries.
I'm trying to save the same symptoms but not with that approach at all. I don't
worry at all about the chache structures.

- there was [6] by Gautham Ananthakrishna
Here the focus was on the memory used for the dentries. This is none of my concern
in this patch. I'm trying to just not keep dentries and indirectly inodes 
unnecessarily in the kernel and as a consequence in the fuse server in user space.

- currently we have the possibility via /proc/sys/fs/dentry-negative to disable
negative dentries completely
I am completely agnostic to this. If an admin disables negative dentries I try to
free some by freeing unused ones and make the limit, if not this is no problem.

I'm sure I have probably missed some where the limitation of dentries was a 
secodnary effect. I have searched for patches for fs/dcache.c that had anything
to do with dentries.

--
As a conclusion, I think I have an uncommon perspective on the cache entries
since I don't usually work on vfs but argue from the perspective of a fuse server
Where the kernel makes us waste resources. This hurts way more in the FUSE context
than in a 'normal' file system.
I have taken the look at the dentry cache just because people told me that this
has to be solved in the vfs (and I agree). I actually have a somewhat hacky patch
to do this from fuse and only for the fuse sb.

This patch will start a worker when we pass the set limit and free the negative
dentries then continue on with the unused dentries based on the LRU data.

What I'm trying to achieve is keep only actually used entries in if we are over
an arbitrary limit, and trying not to mess too much with the work that is done
by the kernel. Then there is the point that shrink_dentry_list() is only there
since 2019, so older approaches did not have tthe possibility.

The short version: This is noothing new, just a new combination of already existing
solutions, that could be useful.

[1] https://www.kernel.org/pub/linux/kernel/people/akpm/patches/2.6/2.6.7/2.6.7-mm1/broken-out/vfs-shrinkage-tuning.patch
[2] https://lwn.net/Articles/441164/
[3] https://lore.kernel.org/all/4DBFF1AD.90303@parallels.com/
[4] https://lore.kernel.org/all/1500298773-7510-1-git-send-email-longman@redhat.com/
[5] https://lore.kernel.org/linux-fsdevel/20260331012925.74840-1-raven@themaw.net/
[6] https://lore.kernel.org/all/1611235185-1685-1-git-send-email-gautham.ananthakrishna@oracle.com/

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH v2] dcache: add fs.dentry-limit sysctl with negative-first reaper
  2026-05-16 14:52 [PATCH v2] dcache: add fs.dentry-limit sysctl with negative-first reaper Horst Birthelmer
  2026-05-16 23:09 ` Matthew Wilcox
@ 2026-05-17  9:15 ` Mateusz Guzik
  2026-05-17  9:42   ` Horst Birthelmer
  1 sibling, 1 reply; 5+ messages in thread
From: Mateusz Guzik @ 2026-05-17  9:15 UTC (permalink / raw)
  To: Horst Birthelmer
  Cc: Miklos Szeredi, Jonathan Corbet, Shuah Khan, Alexander Viro,
	Christian Brauner, Jan Kara, linux-doc, linux-kernel,
	linux-fsdevel, Horst Birthelmer

On Sat, May 16, 2026 at 04:52:54PM +0200, Horst Birthelmer wrote:
> From: Horst Birthelmer <hbirthelmer@ddn.com>
> 
> The dcache only shrinks under memory pressure, which is rarely reached
> on machines with ample RAM, so cached negative dentries can accumulate
> without bound.  Give administrators a soft cap they can set,
> and a background worker that prefers negative dentries when reclaiming.
> 
> Two new sysctls under /proc/sys/fs/:
> 
>   dentry-limit             -- soft cap on nr_dentry.  0 (default)
>                               disables the feature; behaviour is then
>                               identical to before.
>   dentry-limit-interval-ms -- pacing for the worker while still over
>                               the cap.  Default 1000, minimum 1.
> 
> When the cap is exceeded, a delayed_work runs in two phases:
> 
>   1. iterate_supers() draining only negative dentries from every LRU.
>      Positive entries are rotated past so the walk makes progress.
>      DCACHE_REFERENCED is ignored here on purpose -- an admin-imposed
>      cap should evict even hot negatives before any positive entry.
>   2. If still over the cap, iterate_supers() again with the same
>      isolate callback the memory-pressure shrinker uses.
> 
> Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>
> ---
> There was a discussion at LSFMM about servers with too many cached
> negative dentries.
> That gave me the idea to keep the dentries in general limited
> if the system administrator needs it to.
> 

I wrote about the negative entries problem here:

https://lore.kernel.org/linux-fsdevel/f7bp3ggliqbb7adyysonxgvo6zn76mo4unroagfcuu3bfghynu@7wkgqkfb5c43/#t

The mechanism as suggested here will end up evicting *useful* negative
entries. Granted, they will be recreated soon enough so it's not a
tragedy but it still is an avoidable perf loss.

What is needed in the long run is a mechanism which aggressively
recycles stale negative entries and recognizes which ones should be
saved for the time being.

Below some magic threshold you just allocate a new negative entry.

All new entries would get a grace period where they need to get hits and
prove useful OR get whacked. If you are at or above the threshold and
are allocating a new entry, you can whack the oldest negative one which
did not make it.

This is just one idea, what is not up for debate is the discrepancy
between small subset of negative entires with tons of hits vs the ones
which get virtually no traffic at all.

Whatever the mechanism it will have to take advantage of it.

> This is somewhat related to [1] where it would address the same
> symptoms but in a more unobtrusive way, by just garbage collecting
> the negative and then the unused cache entries.
> 
> The other effect I have seen regarding this is that FUSE
> will not forget inodes (no FORGET call to the FUSE server)
> even after the latest reference has been closed until much later.
> 
> In a FUSE server that mirrors the kernel cached inodes in user space
> because it has to keep a lot of private data for every node
> this puts an unnecessarry memory strain on that userspace entity
> especially if the memory is limited for its cgroup.

I don't know anything about how FUSE works. In this context I presume
you have a mount point backed by FUSE and the problematic memory usage
stems from inodes created against such a mount point.

This would suggest you would be better served with a mechanism which
allows userspace to cull some number of dentries for a given mount
point, maybe even with an optional preference for negative entries if
that's considered better for given fs. 

Or to put it differently, I would look into exposing sb shrinkers to
root instead of rolling with a global scan.

> +static enum lru_status dentry_lru_isolate_negative(struct list_head *item,
> +		struct list_lru_one *lru, void *arg)
> +{
> +	struct list_head *freeable = arg;
> +	struct dentry *dentry = container_of(item, struct dentry, d_lru);
> +
> +	if (!spin_trylock(&dentry->d_lock))
> +		return LRU_SKIP;

If anything of the sort is to land, you definitely want to pre-check
d_count and d_is_negative without the lock.

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: Re: [PATCH v2] dcache: add fs.dentry-limit sysctl with negative-first reaper
  2026-05-17  9:15 ` Mateusz Guzik
@ 2026-05-17  9:42   ` Horst Birthelmer
  0 siblings, 0 replies; 5+ messages in thread
From: Horst Birthelmer @ 2026-05-17  9:42 UTC (permalink / raw)
  To: Mateusz Guzik
  Cc: Horst Birthelmer, Miklos Szeredi, Jonathan Corbet, Shuah Khan,
	Alexander Viro, Christian Brauner, Jan Kara, linux-doc,
	linux-kernel, linux-fsdevel, Horst Birthelmer

On Sun, May 17, 2026 at 11:15:04AM +0200, Mateusz Guzik wrote:
> On Sat, May 16, 2026 at 04:52:54PM +0200, Horst Birthelmer wrote:
> > From: Horst Birthelmer <hbirthelmer@ddn.com>
> > 
> > The dcache only shrinks under memory pressure, which is rarely reached
> > on machines with ample RAM, so cached negative dentries can accumulate
> > without bound.  Give administrators a soft cap they can set,
> > and a background worker that prefers negative dentries when reclaiming.
> > 
> > Two new sysctls under /proc/sys/fs/:
> > 
> >   dentry-limit             -- soft cap on nr_dentry.  0 (default)
> >                               disables the feature; behaviour is then
> >                               identical to before.
> >   dentry-limit-interval-ms -- pacing for the worker while still over
> >                               the cap.  Default 1000, minimum 1.
> > 
> > When the cap is exceeded, a delayed_work runs in two phases:
> > 
> >   1. iterate_supers() draining only negative dentries from every LRU.
> >      Positive entries are rotated past so the walk makes progress.
> >      DCACHE_REFERENCED is ignored here on purpose -- an admin-imposed
> >      cap should evict even hot negatives before any positive entry.
> >   2. If still over the cap, iterate_supers() again with the same
> >      isolate callback the memory-pressure shrinker uses.
> > 
> > Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>
> > ---
> > There was a discussion at LSFMM about servers with too many cached
> > negative dentries.
> > That gave me the idea to keep the dentries in general limited
> > if the system administrator needs it to.
> > 
> 
> I wrote about the negative entries problem here:
> 
> https://lore.kernel.org/linux-fsdevel/f7bp3ggliqbb7adyysonxgvo6zn76mo4unroagfcuu3bfghynu@7wkgqkfb5c43/#t
> 
> The mechanism as suggested here will end up evicting *useful* negative
> entries. Granted, they will be recreated soon enough so it's not a
> tragedy but it still is an avoidable perf loss.
> 
> What is needed in the long run is a mechanism which aggressively
> recycles stale negative entries and recognizes which ones should be
> saved for the time being.
> 
> Below some magic threshold you just allocate a new negative entry.
> 
> All new entries would get a grace period where they need to get hits and
> prove useful OR get whacked. If you are at or above the threshold and
> are allocating a new entry, you can whack the oldest negative one which
> did not make it.
> 
> This is just one idea, what is not up for debate is the discrepancy
> between small subset of negative entires with tons of hits vs the ones
> which get virtually no traffic at all.

I'm trying not to focus that much on the negative dentries since it has
no relevance for fuse, but was just a nice effect to solve that one, too,
and a bit of 'when you're at it' logic.
I'm more interested in throwing out the unused ones.

You are completely right in your analysis that this could remove fresh
and useful negative dentries.

> 
> Whatever the mechanism it will have to take advantage of it.
> 
> > This is somewhat related to [1] where it would address the same
> > symptoms but in a more unobtrusive way, by just garbage collecting
> > the negative and then the unused cache entries.
> > 
> > The other effect I have seen regarding this is that FUSE
> > will not forget inodes (no FORGET call to the FUSE server)
> > even after the latest reference has been closed until much later.
> > 
> > In a FUSE server that mirrors the kernel cached inodes in user space
> > because it has to keep a lot of private data for every node
> > this puts an unnecessarry memory strain on that userspace entity
> > especially if the memory is limited for its cgroup.
> 
> I don't know anything about how FUSE works. In this context I presume
> you have a mount point backed by FUSE and the problematic memory usage
> stems from inodes created against such a mount point.
> 

correct

> This would suggest you would be better served with a mechanism which
> allows userspace to cull some number of dentries for a given mount
> point, maybe even with an optional preference for negative entries if
> that's considered better for given fs. 
> 

As I mentioned in the other post, I kinda did this (not triggered by user
space, though, just by a limit negotiated during init with user space) 
just for fuse and was told that this kind of limit would be useful in vfs.

> Or to put it differently, I would look into exposing sb shrinkers to
> root instead of rolling with a global scan.

This would be a cool idea.

> 
> > +static enum lru_status dentry_lru_isolate_negative(struct list_head *item,
> > +		struct list_lru_one *lru, void *arg)
> > +{
> > +	struct list_head *freeable = arg;
> > +	struct dentry *dentry = container_of(item, struct dentry, d_lru);
> > +
> > +	if (!spin_trylock(&dentry->d_lock))
> > +		return LRU_SKIP;
> 
> If anything of the sort is to land, you definitely want to pre-check
> d_count and d_is_negative without the lock.

probably ... 
I still think that a lock held is a good indicator that we can just move on.

Thanks for your time,
Horst

^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2026-05-17  9:42 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-05-16 14:52 [PATCH v2] dcache: add fs.dentry-limit sysctl with negative-first reaper Horst Birthelmer
2026-05-16 23:09 ` Matthew Wilcox
2026-05-17  7:57   ` Horst Birthelmer
2026-05-17  9:15 ` Mateusz Guzik
2026-05-17  9:42   ` Horst Birthelmer

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox