* [PATCH] dcache: add fs.dentry-limit sysctl with negative-first reaper
@ 2026-05-14 15:13 Horst Birthelmer
2026-05-15 15:09 ` kernel test robot
` (2 more replies)
0 siblings, 3 replies; 17+ messages in thread
From: Horst Birthelmer @ 2026-05-14 15:13 UTC (permalink / raw)
To: Miklos Szeredi, Jonathan Corbet, Shuah Khan, Alexander Viro,
Christian Brauner, Jan Kara
Cc: linux-doc, linux-kernel, linux-fsdevel, Horst Birthelmer
From: Horst Birthelmer <hbirthelmer@ddn.com>
The dcache only shrinks under memory pressure, which is rarely reached
on machines with ample RAM, so cached negative dentries can accumulate
without bound. Give administrators a soft cap they can set,
and a background worker that prefers negative dentries when reclaiming.
Two new sysctls under /proc/sys/fs/:
dentry-limit -- soft cap on nr_dentry. 0 (default)
disables the feature; behaviour is then
identical to before.
dentry-limit-interval-ms -- pacing for the worker while still over
the cap. Default 1000, minimum 1.
When the cap is exceeded, a delayed_work runs in two phases:
1. iterate_supers() draining only negative dentries from every LRU.
Positive entries are rotated past so the walk makes progress.
DCACHE_REFERENCED is ignored here on purpose -- an admin-imposed
cap should evict even hot negatives before any positive entry.
2. If still over the cap, iterate_supers() again with the same
isolate callback the memory-pressure shrinker uses.
Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com>
---
There was a discussion at LSFMM about servers with too many cached
negative dentries.
That gave me the idea to keep the dentries in general limited
if the system administrator needs it to.
This is somewhat related to [1] where it would address the same
symptoms but in a more unobtrusive way, by just garbage collecting
the negative and then the unused cache entries.
The other effect I have seen regarding this is that FUSE
will not forget inodes (no FORGET call to the FUSE server)
even after the latest reference has been closed until much later.
In a FUSE server that mirrors the kernel cached inodes in user space
because it has to keep a lot of private data for every node
this puts an unnecessarry memory strain on that userspace entity
especially if the memory is limited for its cgroup.
[1]: https://lore.kernel.org/linux-fsdevel/20260331012925.74840-1-raven@themaw.net/
---
Documentation/admin-guide/sysctl/fs.rst | 28 +++++
fs/dcache.c | 197 ++++++++++++++++++++++++++++++++
2 files changed, 225 insertions(+)
diff --git a/Documentation/admin-guide/sysctl/fs.rst b/Documentation/admin-guide/sysctl/fs.rst
index 9b7f65c3efd8..0229aea45d85 100644
--- a/Documentation/admin-guide/sysctl/fs.rst
+++ b/Documentation/admin-guide/sysctl/fs.rst
@@ -38,6 +38,34 @@ requests. ``aio-max-nr`` allows you to change the maximum value
``aio-max-nr`` does not result in the
pre-allocation or re-sizing of any kernel data structures.
+dentry-limit
+------------
+
+Soft cap on the total number of dentries allocated system-wide (i.e. on
+``nr_dentry`` from ``dentry-state``). A value of ``0`` (the default)
+disables the feature and the dcache grows or shrinks only under memory
+pressure as before.
+
+When set to a non-zero value, a background worker is woken whenever
+the live dentry count exceeds the limit. The worker walks every
+superblock's LRU and prefers to evict negative dentries first; if it
+cannot get back under the limit using negative entries alone it falls
+back to the same LRU policy used by the memory-pressure shrinker.
+
+The limit is *soft*: allocations never fail because of it, and brief
+overshoots while the worker catches up are expected. Set the cap a
+comfortable margin above your steady-state working set.
+
+dentry-limit-interval-ms
+------------------------
+
+How often, in milliseconds, the ``dentry-limit`` worker re-runs while
+``nr_dentry`` is still above the cap. Defaults to ``1000`` (one
+second); the minimum accepted value is ``1``. Smaller values trim the
+cache more aggressively at the cost of more CPU spent walking LRUs;
+larger values let temporary spikes ride out before any work is done.
+Has no effect when ``dentry-limit`` is ``0``.
+
dentry-negative
----------------------------
diff --git a/fs/dcache.c b/fs/dcache.c
index 2c61aeea41f4..4959d2c011c0 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -144,6 +144,19 @@ static DEFINE_PER_CPU(long, nr_dentry_unused);
static DEFINE_PER_CPU(long, nr_dentry_negative);
static int dentry_negative_policy;
+/*
+ * Soft cap on the total number of dentries. When non-zero and exceeded,
+ * a background worker prunes unused dentries (preferring negative ones)
+ * until we are back under the limit. Zero (the default) disables the
+ * feature entirely; the fast path in __d_alloc() only pays the cost of
+ * a READ_ONCE and a branch in that case.
+ */
+static unsigned long sysctl_dentry_limit __read_mostly;
+static unsigned int sysctl_dentry_limit_interval_ms __read_mostly = 1000;
+static unsigned long dentry_limit_last_kick;
+
+static void dentry_limit_kick(void);
+
#if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS)
/* Statistics gathering. */
static struct dentry_stat_t dentry_stat = {
@@ -199,6 +212,20 @@ static int proc_nr_dentry(const struct ctl_table *table, int write, void *buffer
return proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
}
+/*
+ * Writing fs.dentry-limit should give prompt feedback to admins
+ * lowering the cap, so kick the worker on every successful write.
+ */
+static int proc_dentry_limit(const struct ctl_table *table, int write,
+ void *buffer, size_t *lenp, loff_t *ppos)
+{
+ int ret = proc_doulongvec_minmax(table, write, buffer, lenp, ppos);
+
+ if (write && !ret)
+ dentry_limit_kick();
+ return ret;
+}
+
static const struct ctl_table fs_dcache_sysctls[] = {
{
.procname = "dentry-state",
@@ -207,6 +234,21 @@ static const struct ctl_table fs_dcache_sysctls[] = {
.mode = 0444,
.proc_handler = proc_nr_dentry,
},
+ {
+ .procname = "dentry-limit",
+ .data = &sysctl_dentry_limit,
+ .maxlen = sizeof(sysctl_dentry_limit),
+ .mode = 0644,
+ .proc_handler = proc_dentry_limit,
+ },
+ {
+ .procname = "dentry-limit-interval-ms",
+ .data = &sysctl_dentry_limit_interval_ms,
+ .maxlen = sizeof(sysctl_dentry_limit_interval_ms),
+ .mode = 0644,
+ .proc_handler = proc_douintvec_minmax,
+ .extra1 = SYSCTL_ONE,
+ },
{
.procname = "dentry-negative",
.data = &dentry_negative_policy,
@@ -1325,6 +1367,160 @@ static enum lru_status dentry_lru_isolate_shrink(struct list_head *item,
return LRU_REMOVED;
}
+#define DENTRY_LIMIT_BATCH 1024UL
+
+static void dentry_limit_worker_fn(struct work_struct *work);
+static DECLARE_DELAYED_WORK(dentry_limit_work, dentry_limit_worker_fn);
+
+/*
+ * Variant of dentry_lru_isolate() that only frees negative dentries.
+ * DCACHE_REFERENCED is intentionally not honoured here: the whole point
+ * of an admin-imposed cap on negatives is that even frequently-looked-up
+ * negative entries should be evicted before any positive dentry.
+ * Positive entries are rotated to the tail so the walk continues to
+ * make progress without disturbing their LRU position.
+ */
+static enum lru_status dentry_lru_isolate_negative(struct list_head *item,
+ struct list_lru_one *lru, void *arg)
+{
+ struct list_head *freeable = arg;
+ struct dentry *dentry = container_of(item, struct dentry, d_lru);
+
+ if (!spin_trylock(&dentry->d_lock))
+ return LRU_SKIP;
+
+ /* Same handling as dentry_lru_isolate() for in-use entries. */
+ if (dentry->d_lockref.count) {
+ d_lru_isolate(lru, dentry);
+ spin_unlock(&dentry->d_lock);
+ return LRU_REMOVED;
+ }
+
+ if (!d_is_negative(dentry)) {
+ spin_unlock(&dentry->d_lock);
+ return LRU_ROTATE;
+ }
+
+ d_lru_shrink_move(lru, dentry, freeable);
+ spin_unlock(&dentry->d_lock);
+ return LRU_REMOVED;
+}
+
+struct dentry_limit_ctx {
+ long over; /* remaining dentries to evict */
+ list_lru_walk_cb isolate;
+};
+
+static void dentry_limit_prune_sb(struct super_block *sb, void *arg)
+{
+ struct dentry_limit_ctx *ctx = arg;
+ unsigned long walked = 0;
+ unsigned long budget;
+
+ if (ctx->over <= 0)
+ return;
+
+ /*
+ * Walk up to one full pass of this superblock's LRU, in
+ * DENTRY_LIMIT_BATCH-sized chunks. The loop matters mainly for
+ * phase 1: dentry_lru_isolate_negative() returns LRU_ROTATE for
+ * positive dentries, which still counts against list_lru_walk()'s
+ * nr_to_walk. A single batch can therefore finish having freed
+ * nothing when positives crowd the head of the LRU, and without
+ * the inner loop the worker would have to wait a full
+ * dentry-limit-interval-ms before retrying never reaching the
+ * negatives buried behind a long run of positives.
+ *
+ * The budget is snapshot at entry so a filesystem allocating
+ * dentries faster than we drain them can't keep us spinning here
+ * forever; freshly added dentries are picked up on the next
+ * worker invocation.
+ *
+ * Phase 2 normally exits much sooner: its isolate callback frees
+ * any non-referenced dentry, so ctx->over typically hits zero
+ * inside the first batch. The worst-case over-eviction is one
+ * batch past the cap, which is within the soft semantics of
+ * fs.dentry-limit.
+ */
+ budget = list_lru_count(&sb->s_dentry_lru);
+
+ while (ctx->over > 0 && walked < budget) {
+ LIST_HEAD(dispose);
+ unsigned long nr;
+ long freed;
+
+ nr = min(DENTRY_LIMIT_BATCH, budget - walked);
+ freed = list_lru_walk(&sb->s_dentry_lru, ctx->isolate,
+ &dispose, nr);
+ shrink_dentry_list(&dispose);
+
+ ctx->over -= freed;
+ walked += nr;
+
+ cond_resched();
+ }
+}
+
+static void dentry_limit_worker_fn(struct work_struct *work)
+{
+ struct dentry_limit_ctx ctx;
+ unsigned long limit = READ_ONCE(sysctl_dentry_limit);
+ unsigned int ms;
+ long nr;
+
+ if (!limit)
+ return;
+
+ nr = get_nr_dentry();
+ if (nr <= (long)limit)
+ return;
+
+ ctx.over = nr - (long)limit;
+
+ /* Phase 1: drain negative dentries across every superblock. */
+ ctx.isolate = dentry_lru_isolate_negative;
+ iterate_supers(dentry_limit_prune_sb, &ctx);
+
+ /* Phase 2: still over? Apply the ordinary LRU policy. */
+ if (ctx.over > 0) {
+ ctx.isolate = dentry_lru_isolate;
+ iterate_supers(dentry_limit_prune_sb, &ctx);
+ }
+
+ /*
+ * Re-arm while still above the limit. Re-read the sysctls in
+ * case the admin raised the cap or disabled the feature during
+ * the walk.
+ */
+ limit = READ_ONCE(sysctl_dentry_limit);
+ if (!limit || get_nr_dentry() <= (long)limit)
+ return;
+
+ ms = READ_ONCE(sysctl_dentry_limit_interval_ms);
+ queue_delayed_work(system_unbound_wq, &dentry_limit_work,
+ msecs_to_jiffies(ms));
+}
+
+static void dentry_limit_kick(void)
+{
+ unsigned long limit = READ_ONCE(sysctl_dentry_limit);
+ unsigned long now;
+
+ if (!limit)
+ return;
+ if (delayed_work_pending(&dentry_limit_work))
+ return;
+
+ now = jiffies;
+ if (time_before(now, READ_ONCE(dentry_limit_last_kick) + HZ / 10))
+ return;
+ WRITE_ONCE(dentry_limit_last_kick, now);
+
+ if (get_nr_dentry() <= (long)limit)
+ return;
+
+ queue_delayed_work(system_unbound_wq, &dentry_limit_work, 0);
+}
/**
* shrink_dcache_sb - shrink dcache for a superblock
@@ -1868,6 +2064,7 @@ static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
}
this_cpu_inc(nr_dentry);
+ dentry_limit_kick();
return dentry;
}
---
base-commit: 5d6919055dec134de3c40167a490f33c74c12581
change-id: 20260513-limit-dentries-cache-63685729672b
Best regards,
--
Horst Birthelmer <hbirthelmer@ddn.com>
^ permalink raw reply related [flat|nested] 17+ messages in thread* Re: [PATCH] dcache: add fs.dentry-limit sysctl with negative-first reaper 2026-05-14 15:13 [PATCH] dcache: add fs.dentry-limit sysctl with negative-first reaper Horst Birthelmer @ 2026-05-15 15:09 ` kernel test robot 2026-05-16 6:55 ` Horst Birthelmer 2026-05-15 15:09 ` kernel test robot 2026-05-17 23:55 ` NeilBrown 2 siblings, 1 reply; 17+ messages in thread From: kernel test robot @ 2026-05-15 15:09 UTC (permalink / raw) To: Horst Birthelmer, Miklos Szeredi, Jonathan Corbet, Shuah Khan, Alexander Viro, Christian Brauner, Jan Kara Cc: oe-kbuild-all, linux-doc, linux-kernel, linux-fsdevel, Horst Birthelmer Hi Horst, kernel test robot noticed the following build errors: [auto build test ERROR on 5d6919055dec134de3c40167a490f33c74c12581] url: https://github.com/intel-lab-lkp/linux/commits/Horst-Birthelmer/dcache-add-fs-dentry-limit-sysctl-with-negative-first-reaper/20260515-154600 base: 5d6919055dec134de3c40167a490f33c74c12581 patch link: https://lore.kernel.org/r/20260514-limit-dentries-cache-v1-1-431b9eb0c530%40ddn.com patch subject: [PATCH] dcache: add fs.dentry-limit sysctl with negative-first reaper config: openrisc-randconfig-r073-20260515 (https://download.01.org/0day-ci/archive/20260515/202605152333.0pOd2zJR-lkp@intel.com/config) compiler: or1k-linux-gcc (GCC) 10.5.0 smatch: v0.5.0-9185-gbcc58b9c reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260515/202605152333.0pOd2zJR-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202605152333.0pOd2zJR-lkp@intel.com/ All errors (new ones prefixed by >>): fs/dcache.c: In function 'dentry_limit_worker_fn': >> fs/dcache.c:1474:7: error: implicit declaration of function 'get_nr_dentry'; did you mean 'retain_dentry'? [-Werror=implicit-function-declaration] 1474 | nr = get_nr_dentry(); | ^~~~~~~~~~~~~ | retain_dentry cc1: some warnings being treated as errors vim +1474 fs/dcache.c 1463 1464 static void dentry_limit_worker_fn(struct work_struct *work) 1465 { 1466 struct dentry_limit_ctx ctx; 1467 unsigned long limit = READ_ONCE(sysctl_dentry_limit); 1468 unsigned int ms; 1469 long nr; 1470 1471 if (!limit) 1472 return; 1473 > 1474 nr = get_nr_dentry(); 1475 if (nr <= (long)limit) 1476 return; 1477 1478 ctx.over = nr - (long)limit; 1479 1480 /* Phase 1: drain negative dentries across every superblock. */ 1481 ctx.isolate = dentry_lru_isolate_negative; 1482 iterate_supers(dentry_limit_prune_sb, &ctx); 1483 1484 /* Phase 2: still over? Apply the ordinary LRU policy. */ 1485 if (ctx.over > 0) { 1486 ctx.isolate = dentry_lru_isolate; 1487 iterate_supers(dentry_limit_prune_sb, &ctx); 1488 } 1489 1490 /* 1491 * Re-arm while still above the limit. Re-read the sysctls in 1492 * case the admin raised the cap or disabled the feature during 1493 * the walk. 1494 */ 1495 limit = READ_ONCE(sysctl_dentry_limit); 1496 if (!limit || get_nr_dentry() <= (long)limit) 1497 return; 1498 1499 ms = READ_ONCE(sysctl_dentry_limit_interval_ms); 1500 queue_delayed_work(system_unbound_wq, &dentry_limit_work, 1501 msecs_to_jiffies(ms)); 1502 } 1503 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Re: [PATCH] dcache: add fs.dentry-limit sysctl with negative-first reaper 2026-05-15 15:09 ` kernel test robot @ 2026-05-16 6:55 ` Horst Birthelmer 2026-05-16 10:33 ` Stafford Horne 0 siblings, 1 reply; 17+ messages in thread From: Horst Birthelmer @ 2026-05-16 6:55 UTC (permalink / raw) To: kernel test robot Cc: Horst Birthelmer, Miklos Szeredi, Jonathan Corbet, Shuah Khan, Alexander Viro, Christian Brauner, Jan Kara, oe-kbuild-all, linux-doc, linux-kernel, linux-fsdevel, Horst Birthelmer On Fri, May 15, 2026 at 11:09:54PM +0800, kernel test robot wrote: > Hi Horst, > > kernel test robot noticed the following build errors: > > [auto build test ERROR on 5d6919055dec134de3c40167a490f33c74c12581] > > url: https://github.com/intel-lab-lkp/linux/commits/Horst-Birthelmer/dcache-add-fs-dentry-limit-sysctl-with-negative-first-reaper/20260515-154600 > base: 5d6919055dec134de3c40167a490f33c74c12581 > patch link: https://lore.kernel.org/r/20260514-limit-dentries-cache-v1-1-431b9eb0c530%40ddn.com > patch subject: [PATCH] dcache: add fs.dentry-limit sysctl with negative-first reaper > config: openrisc-randconfig-r073-20260515 (https://download.01.org/0day-ci/archive/20260515/202605152333.0pOd2zJR-lkp@intel.com/config) > compiler: or1k-linux-gcc (GCC) 10.5.0 > smatch: v0.5.0-9185-gbcc58b9c > reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260515/202605152333.0pOd2zJR-lkp@intel.com/reproduce) > > If you fix the issue in a separate patch/commit (i.e. not just a new version of > the same patch/commit), kindly add following tags > | Reported-by: kernel test robot <lkp@intel.com> > | Closes: https://lore.kernel.org/oe-kbuild-all/202605152333.0pOd2zJR-lkp@intel.com/ > > All errors (new ones prefixed by >>): > > fs/dcache.c: In function 'dentry_limit_worker_fn': > >> fs/dcache.c:1474:7: error: implicit declaration of function 'get_nr_dentry'; did you mean 'retain_dentry'? [-Werror=implicit-function-declaration] > 1474 | nr = get_nr_dentry(); > | ^~~~~~~~~~~~~ > | retain_dentry > cc1: some warnings being treated as errors > > > vim +1474 fs/dcache.c > ... > > -- > 0-DAY CI Kernel Test Service > https://github.com/intel/lkp-tests/wiki This is puzzling to me get_nr_dentry() is defined in line 178 in the same file and first used in line 209 and has been there since 2013. Builds fine applied to tag v7.1-rc3 and to the current master with gcc and clang. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Re: [PATCH] dcache: add fs.dentry-limit sysctl with negative-first reaper 2026-05-16 6:55 ` Horst Birthelmer @ 2026-05-16 10:33 ` Stafford Horne 2026-05-16 14:15 ` Horst Birthelmer 0 siblings, 1 reply; 17+ messages in thread From: Stafford Horne @ 2026-05-16 10:33 UTC (permalink / raw) To: Horst Birthelmer Cc: kernel test robot, Horst Birthelmer, Miklos Szeredi, Jonathan Corbet, Shuah Khan, Alexander Viro, Christian Brauner, Jan Kara, oe-kbuild-all, linux-doc, linux-kernel, linux-fsdevel, Horst Birthelmer On Sat, May 16, 2026 at 08:55:16AM +0200, Horst Birthelmer wrote: > On Fri, May 15, 2026 at 11:09:54PM +0800, kernel test robot wrote: > > Hi Horst, > > > > kernel test robot noticed the following build errors: > > > > [auto build test ERROR on 5d6919055dec134de3c40167a490f33c74c12581] > > > > url: https://github.com/intel-lab-lkp/linux/commits/Horst-Birthelmer/dcache-add-fs-dentry-limit-sysctl-with-negative-first-reaper/20260515-154600 > > base: 5d6919055dec134de3c40167a490f33c74c12581 > > patch link: https://lore.kernel.org/r/20260514-limit-dentries-cache-v1-1-431b9eb0c530%40ddn.com > > patch subject: [PATCH] dcache: add fs.dentry-limit sysctl with negative-first reaper > > config: openrisc-randconfig-r073-20260515 (https://download.01.org/0day-ci/archive/20260515/202605152333.0pOd2zJR-lkp@intel.com/config) > > compiler: or1k-linux-gcc (GCC) 10.5.0 > > smatch: v0.5.0-9185-gbcc58b9c > > reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260515/202605152333.0pOd2zJR-lkp@intel.com/reproduce) > > > > If you fix the issue in a separate patch/commit (i.e. not just a new version of > > the same patch/commit), kindly add following tags > > | Reported-by: kernel test robot <lkp@intel.com> > > | Closes: https://lore.kernel.org/oe-kbuild-all/202605152333.0pOd2zJR-lkp@intel.com/ > > > > All errors (new ones prefixed by >>): > > > > fs/dcache.c: In function 'dentry_limit_worker_fn': > > >> fs/dcache.c:1474:7: error: implicit declaration of function 'get_nr_dentry'; did you mean 'retain_dentry'? [-Werror=implicit-function-declaration] > > 1474 | nr = get_nr_dentry(); > > | ^~~~~~~~~~~~~ > > | retain_dentry > > cc1: some warnings being treated as errors > > > > > > vim +1474 fs/dcache.c > > > ... > > > > -- > > 0-DAY CI Kernel Test Service > > https://github.com/intel/lkp-tests/wiki > > This is puzzling to me get_nr_dentry() is defined in line 178 in the same file and first used in line 209 > and has been there since 2013. > > Builds fine applied to tag v7.1-rc3 and to the current master with gcc and clang. Hi They are protected in: #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS) With #endif on line 247. In the rand config as least I see: # CONFIG_PROC_FS is not set -Stafford ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Re: Re: [PATCH] dcache: add fs.dentry-limit sysctl with negative-first reaper 2026-05-16 10:33 ` Stafford Horne @ 2026-05-16 14:15 ` Horst Birthelmer 0 siblings, 0 replies; 17+ messages in thread From: Horst Birthelmer @ 2026-05-16 14:15 UTC (permalink / raw) To: Stafford Horne Cc: kernel test robot, Horst Birthelmer, Miklos Szeredi, Jonathan Corbet, Shuah Khan, Alexander Viro, Christian Brauner, Jan Kara, oe-kbuild-all, linux-doc, linux-kernel, linux-fsdevel, Horst Birthelmer On Sat, May 16, 2026 at 11:33:48AM +0100, Stafford Horne wrote: > On Sat, May 16, 2026 at 08:55:16AM +0200, Horst Birthelmer wrote: > > On Fri, May 15, 2026 at 11:09:54PM +0800, kernel test robot wrote: > > > Hi Horst, > > > > > > kernel test robot noticed the following build errors: > > > > > > [auto build test ERROR on 5d6919055dec134de3c40167a490f33c74c12581] > > > > > > url: https://github.com/intel-lab-lkp/linux/commits/Horst-Birthelmer/dcache-add-fs-dentry-limit-sysctl-with-negative-first-reaper/20260515-154600 > > > base: 5d6919055dec134de3c40167a490f33c74c12581 > > > patch link: https://lore.kernel.org/r/20260514-limit-dentries-cache-v1-1-431b9eb0c530%40ddn.com > > > patch subject: [PATCH] dcache: add fs.dentry-limit sysctl with negative-first reaper > > > config: openrisc-randconfig-r073-20260515 (https://download.01.org/0day-ci/archive/20260515/202605152333.0pOd2zJR-lkp@intel.com/config) > > > compiler: or1k-linux-gcc (GCC) 10.5.0 > > > smatch: v0.5.0-9185-gbcc58b9c > > > reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260515/202605152333.0pOd2zJR-lkp@intel.com/reproduce) > > > > > > If you fix the issue in a separate patch/commit (i.e. not just a new version of > > > the same patch/commit), kindly add following tags > > > | Reported-by: kernel test robot <lkp@intel.com> > > > | Closes: https://lore.kernel.org/oe-kbuild-all/202605152333.0pOd2zJR-lkp@intel.com/ > > > > > > All errors (new ones prefixed by >>): > > > > > > fs/dcache.c: In function 'dentry_limit_worker_fn': > > > >> fs/dcache.c:1474:7: error: implicit declaration of function 'get_nr_dentry'; did you mean 'retain_dentry'? [-Werror=implicit-function-declaration] > > > 1474 | nr = get_nr_dentry(); > > > | ^~~~~~~~~~~~~ > > > | retain_dentry > > > cc1: some warnings being treated as errors > > > > > > > > > vim +1474 fs/dcache.c > > > > > ... > > > > > > -- > > > 0-DAY CI Kernel Test Service > > > https://github.com/intel/lkp-tests/wiki > > > > This is puzzling to me get_nr_dentry() is defined in line 178 in the same file and first used in line 209 > > and has been there since 2013. > > > > Builds fine applied to tag v7.1-rc3 and to the current master with gcc and clang. > > Hi > > They are protected in: > > #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS) > > With #endif on line 247. You are right, of course. Thank you! I will send a corrected version. > > In the rand config as least I see: > # CONFIG_PROC_FS is not set > > -Stafford > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] dcache: add fs.dentry-limit sysctl with negative-first reaper 2026-05-14 15:13 [PATCH] dcache: add fs.dentry-limit sysctl with negative-first reaper Horst Birthelmer 2026-05-15 15:09 ` kernel test robot @ 2026-05-15 15:09 ` kernel test robot 2026-05-17 23:55 ` NeilBrown 2 siblings, 0 replies; 17+ messages in thread From: kernel test robot @ 2026-05-15 15:09 UTC (permalink / raw) To: Horst Birthelmer, Miklos Szeredi, Jonathan Corbet, Shuah Khan, Alexander Viro, Christian Brauner, Jan Kara Cc: llvm, oe-kbuild-all, linux-doc, linux-kernel, linux-fsdevel, Horst Birthelmer Hi Horst, kernel test robot noticed the following build errors: [auto build test ERROR on 5d6919055dec134de3c40167a490f33c74c12581] url: https://github.com/intel-lab-lkp/linux/commits/Horst-Birthelmer/dcache-add-fs-dentry-limit-sysctl-with-negative-first-reaper/20260515-154600 base: 5d6919055dec134de3c40167a490f33c74c12581 patch link: https://lore.kernel.org/r/20260514-limit-dentries-cache-v1-1-431b9eb0c530%40ddn.com patch subject: [PATCH] dcache: add fs.dentry-limit sysctl with negative-first reaper config: s390-randconfig-002-20260515 (https://download.01.org/0day-ci/archive/20260515/202605152329.WHnEvZt7-lkp@intel.com/config) compiler: clang version 23.0.0git (https://github.com/llvm/llvm-project 5bac06718f502014fade905512f1d26d578a18f3) reproduce (this is a W=1 build): (https://download.01.org/0day-ci/archive/20260515/202605152329.WHnEvZt7-lkp@intel.com/reproduce) If you fix the issue in a separate patch/commit (i.e. not just a new version of the same patch/commit), kindly add following tags | Reported-by: kernel test robot <lkp@intel.com> | Closes: https://lore.kernel.org/oe-kbuild-all/202605152329.WHnEvZt7-lkp@intel.com/ All errors (new ones prefixed by >>): >> fs/dcache.c:1474:7: error: call to undeclared function 'get_nr_dentry'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration] 1474 | nr = get_nr_dentry(); | ^ fs/dcache.c:1474:7: note: did you mean 'retain_dentry'? fs/dcache.c:835:20: note: 'retain_dentry' declared here 835 | static inline bool retain_dentry(struct dentry *dentry, bool locked) | ^ fs/dcache.c:1519:6: error: call to undeclared function 'get_nr_dentry'; ISO C99 and later do not support implicit function declarations [-Wimplicit-function-declaration] 1519 | if (get_nr_dentry() <= (long)limit) | ^ 2 errors generated. vim +/get_nr_dentry +1474 fs/dcache.c 1463 1464 static void dentry_limit_worker_fn(struct work_struct *work) 1465 { 1466 struct dentry_limit_ctx ctx; 1467 unsigned long limit = READ_ONCE(sysctl_dentry_limit); 1468 unsigned int ms; 1469 long nr; 1470 1471 if (!limit) 1472 return; 1473 > 1474 nr = get_nr_dentry(); 1475 if (nr <= (long)limit) 1476 return; 1477 1478 ctx.over = nr - (long)limit; 1479 1480 /* Phase 1: drain negative dentries across every superblock. */ 1481 ctx.isolate = dentry_lru_isolate_negative; 1482 iterate_supers(dentry_limit_prune_sb, &ctx); 1483 1484 /* Phase 2: still over? Apply the ordinary LRU policy. */ 1485 if (ctx.over > 0) { 1486 ctx.isolate = dentry_lru_isolate; 1487 iterate_supers(dentry_limit_prune_sb, &ctx); 1488 } 1489 1490 /* 1491 * Re-arm while still above the limit. Re-read the sysctls in 1492 * case the admin raised the cap or disabled the feature during 1493 * the walk. 1494 */ 1495 limit = READ_ONCE(sysctl_dentry_limit); 1496 if (!limit || get_nr_dentry() <= (long)limit) 1497 return; 1498 1499 ms = READ_ONCE(sysctl_dentry_limit_interval_ms); 1500 queue_delayed_work(system_unbound_wq, &dentry_limit_work, 1501 msecs_to_jiffies(ms)); 1502 } 1503 -- 0-DAY CI Kernel Test Service https://github.com/intel/lkp-tests/wiki ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] dcache: add fs.dentry-limit sysctl with negative-first reaper 2026-05-14 15:13 [PATCH] dcache: add fs.dentry-limit sysctl with negative-first reaper Horst Birthelmer 2026-05-15 15:09 ` kernel test robot 2026-05-15 15:09 ` kernel test robot @ 2026-05-17 23:55 ` NeilBrown 2026-05-18 2:55 ` Ian Kent 2026-05-18 7:01 ` Horst Birthelmer 2 siblings, 2 replies; 17+ messages in thread From: NeilBrown @ 2026-05-17 23:55 UTC (permalink / raw) To: Horst Birthelmer Cc: Miklos Szeredi, Jonathan Corbet, Shuah Khan, Alexander Viro, Christian Brauner, Jan Kara, linux-doc, linux-kernel, linux-fsdevel, Horst Birthelmer On Fri, 15 May 2026, Horst Birthelmer wrote: > From: Horst Birthelmer <hbirthelmer@ddn.com> > > The dcache only shrinks under memory pressure, which is rarely reached > on machines with ample RAM, so cached negative dentries can accumulate > without bound. Give administrators a soft cap they can set, > and a background worker that prefers negative dentries when reclaiming. > > Two new sysctls under /proc/sys/fs/: > > dentry-limit -- soft cap on nr_dentry. 0 (default) > disables the feature; behaviour is then > identical to before. Is a system-wide cap really a suitable tool? What guidance would you give to sysadmins who are considering setting a number? Is there a better approach? According to the email you linked, a problem arises when a directory has a great many negative children. Code which walks the list of children (such as fsnotify) while holding a lock can suffer unpredictable delays and result in long lock-hold times. So maybe a limit on negative dentries for any parent is what we really want. That would be clumsy to implement I imagine. But what if we move dentries to the end of the list when they become negative, and to the start of the list when they become positive? Then code which walks the child list could simply abort on the first negative. I doubt that would be quite as easy as it sounds, but it would at least be more focused on the observed symptom rather than some whole-system number which only vaguely correlates with the observed symptom. Maybe a completely different approach: change children-walking code to drop and retake the lock (with appropriate validation) periodically. What too would address the specific symptom. Thanks for attempting to resolve this issue, but I'm not convinced that you have found a good solution yet. NeilBrown > dentry-limit-interval-ms -- pacing for the worker while still over > the cap. Default 1000, minimum 1. > > When the cap is exceeded, a delayed_work runs in two phases: > > 1. iterate_supers() draining only negative dentries from every LRU. > Positive entries are rotated past so the walk makes progress. > DCACHE_REFERENCED is ignored here on purpose -- an admin-imposed > cap should evict even hot negatives before any positive entry. > 2. If still over the cap, iterate_supers() again with the same > isolate callback the memory-pressure shrinker uses. > > Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com> > --- > There was a discussion at LSFMM about servers with too many cached > negative dentries. > That gave me the idea to keep the dentries in general limited > if the system administrator needs it to. > > This is somewhat related to [1] where it would address the same > symptoms but in a more unobtrusive way, by just garbage collecting > the negative and then the unused cache entries. > > The other effect I have seen regarding this is that FUSE > will not forget inodes (no FORGET call to the FUSE server) > even after the latest reference has been closed until much later. > > In a FUSE server that mirrors the kernel cached inodes in user space > because it has to keep a lot of private data for every node > this puts an unnecessarry memory strain on that userspace entity > especially if the memory is limited for its cgroup. > > [1]: https://lore.kernel.org/linux-fsdevel/20260331012925.74840-1-raven@themaw.net/ > --- > Documentation/admin-guide/sysctl/fs.rst | 28 +++++ > fs/dcache.c | 197 ++++++++++++++++++++++++++++++++ > 2 files changed, 225 insertions(+) > > diff --git a/Documentation/admin-guide/sysctl/fs.rst b/Documentation/admin-guide/sysctl/fs.rst > index 9b7f65c3efd8..0229aea45d85 100644 > --- a/Documentation/admin-guide/sysctl/fs.rst > +++ b/Documentation/admin-guide/sysctl/fs.rst > @@ -38,6 +38,34 @@ requests. ``aio-max-nr`` allows you to change the maximum value > ``aio-max-nr`` does not result in the > pre-allocation or re-sizing of any kernel data structures. > > +dentry-limit > +------------ > + > +Soft cap on the total number of dentries allocated system-wide (i.e. on > +``nr_dentry`` from ``dentry-state``). A value of ``0`` (the default) > +disables the feature and the dcache grows or shrinks only under memory > +pressure as before. > + > +When set to a non-zero value, a background worker is woken whenever > +the live dentry count exceeds the limit. The worker walks every > +superblock's LRU and prefers to evict negative dentries first; if it > +cannot get back under the limit using negative entries alone it falls > +back to the same LRU policy used by the memory-pressure shrinker. > + > +The limit is *soft*: allocations never fail because of it, and brief > +overshoots while the worker catches up are expected. Set the cap a > +comfortable margin above your steady-state working set. > + > +dentry-limit-interval-ms > +------------------------ > + > +How often, in milliseconds, the ``dentry-limit`` worker re-runs while > +``nr_dentry`` is still above the cap. Defaults to ``1000`` (one > +second); the minimum accepted value is ``1``. Smaller values trim the > +cache more aggressively at the cost of more CPU spent walking LRUs; > +larger values let temporary spikes ride out before any work is done. > +Has no effect when ``dentry-limit`` is ``0``. > + > dentry-negative > ---------------------------- > > diff --git a/fs/dcache.c b/fs/dcache.c > index 2c61aeea41f4..4959d2c011c0 100644 > --- a/fs/dcache.c > +++ b/fs/dcache.c > @@ -144,6 +144,19 @@ static DEFINE_PER_CPU(long, nr_dentry_unused); > static DEFINE_PER_CPU(long, nr_dentry_negative); > static int dentry_negative_policy; > > +/* > + * Soft cap on the total number of dentries. When non-zero and exceeded, > + * a background worker prunes unused dentries (preferring negative ones) > + * until we are back under the limit. Zero (the default) disables the > + * feature entirely; the fast path in __d_alloc() only pays the cost of > + * a READ_ONCE and a branch in that case. > + */ > +static unsigned long sysctl_dentry_limit __read_mostly; > +static unsigned int sysctl_dentry_limit_interval_ms __read_mostly = 1000; > +static unsigned long dentry_limit_last_kick; > + > +static void dentry_limit_kick(void); > + > #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS) > /* Statistics gathering. */ > static struct dentry_stat_t dentry_stat = { > @@ -199,6 +212,20 @@ static int proc_nr_dentry(const struct ctl_table *table, int write, void *buffer > return proc_doulongvec_minmax(table, write, buffer, lenp, ppos); > } > > +/* > + * Writing fs.dentry-limit should give prompt feedback to admins > + * lowering the cap, so kick the worker on every successful write. > + */ > +static int proc_dentry_limit(const struct ctl_table *table, int write, > + void *buffer, size_t *lenp, loff_t *ppos) > +{ > + int ret = proc_doulongvec_minmax(table, write, buffer, lenp, ppos); > + > + if (write && !ret) > + dentry_limit_kick(); > + return ret; > +} > + > static const struct ctl_table fs_dcache_sysctls[] = { > { > .procname = "dentry-state", > @@ -207,6 +234,21 @@ static const struct ctl_table fs_dcache_sysctls[] = { > .mode = 0444, > .proc_handler = proc_nr_dentry, > }, > + { > + .procname = "dentry-limit", > + .data = &sysctl_dentry_limit, > + .maxlen = sizeof(sysctl_dentry_limit), > + .mode = 0644, > + .proc_handler = proc_dentry_limit, > + }, > + { > + .procname = "dentry-limit-interval-ms", > + .data = &sysctl_dentry_limit_interval_ms, > + .maxlen = sizeof(sysctl_dentry_limit_interval_ms), > + .mode = 0644, > + .proc_handler = proc_douintvec_minmax, > + .extra1 = SYSCTL_ONE, > + }, > { > .procname = "dentry-negative", > .data = &dentry_negative_policy, > @@ -1325,6 +1367,160 @@ static enum lru_status dentry_lru_isolate_shrink(struct list_head *item, > return LRU_REMOVED; > } > > +#define DENTRY_LIMIT_BATCH 1024UL > + > +static void dentry_limit_worker_fn(struct work_struct *work); > +static DECLARE_DELAYED_WORK(dentry_limit_work, dentry_limit_worker_fn); > + > +/* > + * Variant of dentry_lru_isolate() that only frees negative dentries. > + * DCACHE_REFERENCED is intentionally not honoured here: the whole point > + * of an admin-imposed cap on negatives is that even frequently-looked-up > + * negative entries should be evicted before any positive dentry. > + * Positive entries are rotated to the tail so the walk continues to > + * make progress without disturbing their LRU position. > + */ > +static enum lru_status dentry_lru_isolate_negative(struct list_head *item, > + struct list_lru_one *lru, void *arg) > +{ > + struct list_head *freeable = arg; > + struct dentry *dentry = container_of(item, struct dentry, d_lru); > + > + if (!spin_trylock(&dentry->d_lock)) > + return LRU_SKIP; > + > + /* Same handling as dentry_lru_isolate() for in-use entries. */ > + if (dentry->d_lockref.count) { > + d_lru_isolate(lru, dentry); > + spin_unlock(&dentry->d_lock); > + return LRU_REMOVED; > + } > + > + if (!d_is_negative(dentry)) { > + spin_unlock(&dentry->d_lock); > + return LRU_ROTATE; > + } > + > + d_lru_shrink_move(lru, dentry, freeable); > + spin_unlock(&dentry->d_lock); > + return LRU_REMOVED; > +} > + > +struct dentry_limit_ctx { > + long over; /* remaining dentries to evict */ > + list_lru_walk_cb isolate; > +}; > + > +static void dentry_limit_prune_sb(struct super_block *sb, void *arg) > +{ > + struct dentry_limit_ctx *ctx = arg; > + unsigned long walked = 0; > + unsigned long budget; > + > + if (ctx->over <= 0) > + return; > + > + /* > + * Walk up to one full pass of this superblock's LRU, in > + * DENTRY_LIMIT_BATCH-sized chunks. The loop matters mainly for > + * phase 1: dentry_lru_isolate_negative() returns LRU_ROTATE for > + * positive dentries, which still counts against list_lru_walk()'s > + * nr_to_walk. A single batch can therefore finish having freed > + * nothing when positives crowd the head of the LRU, and without > + * the inner loop the worker would have to wait a full > + * dentry-limit-interval-ms before retrying never reaching the > + * negatives buried behind a long run of positives. > + * > + * The budget is snapshot at entry so a filesystem allocating > + * dentries faster than we drain them can't keep us spinning here > + * forever; freshly added dentries are picked up on the next > + * worker invocation. > + * > + * Phase 2 normally exits much sooner: its isolate callback frees > + * any non-referenced dentry, so ctx->over typically hits zero > + * inside the first batch. The worst-case over-eviction is one > + * batch past the cap, which is within the soft semantics of > + * fs.dentry-limit. > + */ > + budget = list_lru_count(&sb->s_dentry_lru); > + > + while (ctx->over > 0 && walked < budget) { > + LIST_HEAD(dispose); > + unsigned long nr; > + long freed; > + > + nr = min(DENTRY_LIMIT_BATCH, budget - walked); > + freed = list_lru_walk(&sb->s_dentry_lru, ctx->isolate, > + &dispose, nr); > + shrink_dentry_list(&dispose); > + > + ctx->over -= freed; > + walked += nr; > + > + cond_resched(); > + } > +} > + > +static void dentry_limit_worker_fn(struct work_struct *work) > +{ > + struct dentry_limit_ctx ctx; > + unsigned long limit = READ_ONCE(sysctl_dentry_limit); > + unsigned int ms; > + long nr; > + > + if (!limit) > + return; > + > + nr = get_nr_dentry(); > + if (nr <= (long)limit) > + return; > + > + ctx.over = nr - (long)limit; > + > + /* Phase 1: drain negative dentries across every superblock. */ > + ctx.isolate = dentry_lru_isolate_negative; > + iterate_supers(dentry_limit_prune_sb, &ctx); > + > + /* Phase 2: still over? Apply the ordinary LRU policy. */ > + if (ctx.over > 0) { > + ctx.isolate = dentry_lru_isolate; > + iterate_supers(dentry_limit_prune_sb, &ctx); > + } > + > + /* > + * Re-arm while still above the limit. Re-read the sysctls in > + * case the admin raised the cap or disabled the feature during > + * the walk. > + */ > + limit = READ_ONCE(sysctl_dentry_limit); > + if (!limit || get_nr_dentry() <= (long)limit) > + return; > + > + ms = READ_ONCE(sysctl_dentry_limit_interval_ms); > + queue_delayed_work(system_unbound_wq, &dentry_limit_work, > + msecs_to_jiffies(ms)); > +} > + > +static void dentry_limit_kick(void) > +{ > + unsigned long limit = READ_ONCE(sysctl_dentry_limit); > + unsigned long now; > + > + if (!limit) > + return; > + if (delayed_work_pending(&dentry_limit_work)) > + return; > + > + now = jiffies; > + if (time_before(now, READ_ONCE(dentry_limit_last_kick) + HZ / 10)) > + return; > + WRITE_ONCE(dentry_limit_last_kick, now); > + > + if (get_nr_dentry() <= (long)limit) > + return; > + > + queue_delayed_work(system_unbound_wq, &dentry_limit_work, 0); > +} > > /** > * shrink_dcache_sb - shrink dcache for a superblock > @@ -1868,6 +2064,7 @@ static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name) > } > > this_cpu_inc(nr_dentry); > + dentry_limit_kick(); > > return dentry; > } > > --- > base-commit: 5d6919055dec134de3c40167a490f33c74c12581 > change-id: 20260513-limit-dentries-cache-63685729672b > > Best regards, > -- > Horst Birthelmer <hbirthelmer@ddn.com> > > > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] dcache: add fs.dentry-limit sysctl with negative-first reaper 2026-05-17 23:55 ` NeilBrown @ 2026-05-18 2:55 ` Ian Kent 2026-05-18 8:19 ` Jan Kara 2026-05-18 7:01 ` Horst Birthelmer 1 sibling, 1 reply; 17+ messages in thread From: Ian Kent @ 2026-05-18 2:55 UTC (permalink / raw) To: NeilBrown, Horst Birthelmer, Amir Goldstein Cc: Miklos Szeredi, Jonathan Corbet, Shuah Khan, Alexander Viro, Christian Brauner, Jan Kara, linux-doc, linux-kernel, linux-fsdevel, Horst Birthelmer Hi Neil, ;) I just happen to have been caught by this problem too! See below. On 18/5/26 07:55, NeilBrown wrote: > On Fri, 15 May 2026, Horst Birthelmer wrote: >> From: Horst Birthelmer <hbirthelmer@ddn.com> >> >> The dcache only shrinks under memory pressure, which is rarely reached >> on machines with ample RAM, so cached negative dentries can accumulate >> without bound. Give administrators a soft cap they can set, >> and a background worker that prefers negative dentries when reclaiming. >> >> Two new sysctls under /proc/sys/fs/: >> >> dentry-limit -- soft cap on nr_dentry. 0 (default) >> disables the feature; behaviour is then >> identical to before. > Is a system-wide cap really a suitable tool? What guidance would you > give to sysadmins who are considering setting a number? > > Is there a better approach? That's a good question. In my RFC (with a different, almost trivial, approach) to get a feel for what people thought of the idea of limiting dentry going to the LRU a number of similar comments came up. The commit e6957c99dca5 (("vfs: Delete the associated dentry when deleting a file") is something like what we want but also doesn't use a limit and so lacks a calculation to work that out. It's probably a little bit "too" aggressive. Worse still my claim in the RFC that many entries in the dcache can lead to a performance problem with long hash chains didn't hold true when I looked closer at it. For example (and TBH I've forgotten the actual numbers now but these should be close-ish), with the dcache at 12 Million entries with around 8 million negative, AFAICT, leads to an average hash chain length of around 12 which is a little too long but not terrible by any means. In my experience an average chain length of 8 or less performs really well so this can't realistically be used and by this time the problem is already evident. Leaving me at a loss for a reasonable way to calculate this. > > According to the email you linked, a problem arises when a directory has > a great many negative children. Code which walks the list of children > (such as fsnotify) while holding a lock can suffer unpredictable delays > and result in long lock-hold times. So maybe a limit on negative > dentries for any parent is what we really want. That would be clumsy to > implement I imagine. But the notion of dropping the dentry in ->d_delete() on last dput() is simple enough but did see regressions (the only other place in the VFS besides dentry_kill() that the inode is unlinked from the dentry on dput()). I wonder if the regression wwas related to the test itself deliberately recreating deleted files and if that really is normal behaviour. By itself that should prevent almost all negative dentries being retained. Although file systems could do this as well (think XFS inode recycling) it should be reasonable to require it be left to the VFS. But even that's not enough given that, in my case, there would still be around 4 million dentries in the LRU cache and in fsnotify there are directory child traversals holding the parent i_lock "spinlock" that are going to cause problems. That's all that much more puzzling when I see things like commit 172e422ffea2 ("fsnotify: clear PARENT_WATCHED flags lazily") which looks like it implies the child flag depends entirely on the parent state (what am I missing Amir?) so why is this traversal even retained in fsnotify? > > But what if we move dentries to the end of the list when they become > negative, and to the start of the list when they become positive? Then > code which walks the child list could simply abort on the first > negative. > > I doubt that would be quite as easy as it sounds, but it would at least > be more focused on the observed symptom rather than some whole-system > number which only vaguely correlates with the observed symptom. > > Maybe a completely different approach: change children-walking code to > drop and retake the lock (with appropriate validation) periodically. > What too would address the specific symptom. Another good question. I have assumed that dropping and re-taking the lock cannot be done but this is a question I would like answered as well. Dropping and re-taking lock would require, as Miklos pointed out to me off-list, recording the list position with say a cursor, introducing unwanted complexity when it would be better to accept the cost of a single extra access to the parent flags (which I assume is one reason to set the flag in the child). > > Thanks for attempting to resolve this issue, but I'm not convinced that > you have found a good solution yet. This same sort of issue comes up again and again and I have thought about it many times without actually useful ideas and it seems like I'm not alone, ;) Ian > > NeilBrown > > > >> dentry-limit-interval-ms -- pacing for the worker while still over >> the cap. Default 1000, minimum 1. >> >> When the cap is exceeded, a delayed_work runs in two phases: >> >> 1. iterate_supers() draining only negative dentries from every LRU. >> Positive entries are rotated past so the walk makes progress. >> DCACHE_REFERENCED is ignored here on purpose -- an admin-imposed >> cap should evict even hot negatives before any positive entry. >> 2. If still over the cap, iterate_supers() again with the same >> isolate callback the memory-pressure shrinker uses. >> >> Signed-off-by: Horst Birthelmer <hbirthelmer@ddn.com> >> --- >> There was a discussion at LSFMM about servers with too many cached >> negative dentries. >> That gave me the idea to keep the dentries in general limited >> if the system administrator needs it to. >> >> This is somewhat related to [1] where it would address the same >> symptoms but in a more unobtrusive way, by just garbage collecting >> the negative and then the unused cache entries. >> >> The other effect I have seen regarding this is that FUSE >> will not forget inodes (no FORGET call to the FUSE server) >> even after the latest reference has been closed until much later. >> >> In a FUSE server that mirrors the kernel cached inodes in user space >> because it has to keep a lot of private data for every node >> this puts an unnecessarry memory strain on that userspace entity >> especially if the memory is limited for its cgroup. >> >> [1]: https://lore.kernel.org/linux-fsdevel/20260331012925.74840-1-raven@themaw.net/ >> --- >> Documentation/admin-guide/sysctl/fs.rst | 28 +++++ >> fs/dcache.c | 197 ++++++++++++++++++++++++++++++++ >> 2 files changed, 225 insertions(+) >> >> diff --git a/Documentation/admin-guide/sysctl/fs.rst b/Documentation/admin-guide/sysctl/fs.rst >> index 9b7f65c3efd8..0229aea45d85 100644 >> --- a/Documentation/admin-guide/sysctl/fs.rst >> +++ b/Documentation/admin-guide/sysctl/fs.rst >> @@ -38,6 +38,34 @@ requests. ``aio-max-nr`` allows you to change the maximum value >> ``aio-max-nr`` does not result in the >> pre-allocation or re-sizing of any kernel data structures. >> >> +dentry-limit >> +------------ >> + >> +Soft cap on the total number of dentries allocated system-wide (i.e. on >> +``nr_dentry`` from ``dentry-state``). A value of ``0`` (the default) >> +disables the feature and the dcache grows or shrinks only under memory >> +pressure as before. >> + >> +When set to a non-zero value, a background worker is woken whenever >> +the live dentry count exceeds the limit. The worker walks every >> +superblock's LRU and prefers to evict negative dentries first; if it >> +cannot get back under the limit using negative entries alone it falls >> +back to the same LRU policy used by the memory-pressure shrinker. >> + >> +The limit is *soft*: allocations never fail because of it, and brief >> +overshoots while the worker catches up are expected. Set the cap a >> +comfortable margin above your steady-state working set. >> + >> +dentry-limit-interval-ms >> +------------------------ >> + >> +How often, in milliseconds, the ``dentry-limit`` worker re-runs while >> +``nr_dentry`` is still above the cap. Defaults to ``1000`` (one >> +second); the minimum accepted value is ``1``. Smaller values trim the >> +cache more aggressively at the cost of more CPU spent walking LRUs; >> +larger values let temporary spikes ride out before any work is done. >> +Has no effect when ``dentry-limit`` is ``0``. >> + >> dentry-negative >> ---------------------------- >> >> diff --git a/fs/dcache.c b/fs/dcache.c >> index 2c61aeea41f4..4959d2c011c0 100644 >> --- a/fs/dcache.c >> +++ b/fs/dcache.c >> @@ -144,6 +144,19 @@ static DEFINE_PER_CPU(long, nr_dentry_unused); >> static DEFINE_PER_CPU(long, nr_dentry_negative); >> static int dentry_negative_policy; >> >> +/* >> + * Soft cap on the total number of dentries. When non-zero and exceeded, >> + * a background worker prunes unused dentries (preferring negative ones) >> + * until we are back under the limit. Zero (the default) disables the >> + * feature entirely; the fast path in __d_alloc() only pays the cost of >> + * a READ_ONCE and a branch in that case. >> + */ >> +static unsigned long sysctl_dentry_limit __read_mostly; >> +static unsigned int sysctl_dentry_limit_interval_ms __read_mostly = 1000; >> +static unsigned long dentry_limit_last_kick; >> + >> +static void dentry_limit_kick(void); >> + >> #if defined(CONFIG_SYSCTL) && defined(CONFIG_PROC_FS) >> /* Statistics gathering. */ >> static struct dentry_stat_t dentry_stat = { >> @@ -199,6 +212,20 @@ static int proc_nr_dentry(const struct ctl_table *table, int write, void *buffer >> return proc_doulongvec_minmax(table, write, buffer, lenp, ppos); >> } >> >> +/* >> + * Writing fs.dentry-limit should give prompt feedback to admins >> + * lowering the cap, so kick the worker on every successful write. >> + */ >> +static int proc_dentry_limit(const struct ctl_table *table, int write, >> + void *buffer, size_t *lenp, loff_t *ppos) >> +{ >> + int ret = proc_doulongvec_minmax(table, write, buffer, lenp, ppos); >> + >> + if (write && !ret) >> + dentry_limit_kick(); >> + return ret; >> +} >> + >> static const struct ctl_table fs_dcache_sysctls[] = { >> { >> .procname = "dentry-state", >> @@ -207,6 +234,21 @@ static const struct ctl_table fs_dcache_sysctls[] = { >> .mode = 0444, >> .proc_handler = proc_nr_dentry, >> }, >> + { >> + .procname = "dentry-limit", >> + .data = &sysctl_dentry_limit, >> + .maxlen = sizeof(sysctl_dentry_limit), >> + .mode = 0644, >> + .proc_handler = proc_dentry_limit, >> + }, >> + { >> + .procname = "dentry-limit-interval-ms", >> + .data = &sysctl_dentry_limit_interval_ms, >> + .maxlen = sizeof(sysctl_dentry_limit_interval_ms), >> + .mode = 0644, >> + .proc_handler = proc_douintvec_minmax, >> + .extra1 = SYSCTL_ONE, >> + }, >> { >> .procname = "dentry-negative", >> .data = &dentry_negative_policy, >> @@ -1325,6 +1367,160 @@ static enum lru_status dentry_lru_isolate_shrink(struct list_head *item, >> return LRU_REMOVED; >> } >> >> +#define DENTRY_LIMIT_BATCH 1024UL >> + >> +static void dentry_limit_worker_fn(struct work_struct *work); >> +static DECLARE_DELAYED_WORK(dentry_limit_work, dentry_limit_worker_fn); >> + >> +/* >> + * Variant of dentry_lru_isolate() that only frees negative dentries. >> + * DCACHE_REFERENCED is intentionally not honoured here: the whole point >> + * of an admin-imposed cap on negatives is that even frequently-looked-up >> + * negative entries should be evicted before any positive dentry. >> + * Positive entries are rotated to the tail so the walk continues to >> + * make progress without disturbing their LRU position. >> + */ >> +static enum lru_status dentry_lru_isolate_negative(struct list_head *item, >> + struct list_lru_one *lru, void *arg) >> +{ >> + struct list_head *freeable = arg; >> + struct dentry *dentry = container_of(item, struct dentry, d_lru); >> + >> + if (!spin_trylock(&dentry->d_lock)) >> + return LRU_SKIP; >> + >> + /* Same handling as dentry_lru_isolate() for in-use entries. */ >> + if (dentry->d_lockref.count) { >> + d_lru_isolate(lru, dentry); >> + spin_unlock(&dentry->d_lock); >> + return LRU_REMOVED; >> + } >> + >> + if (!d_is_negative(dentry)) { >> + spin_unlock(&dentry->d_lock); >> + return LRU_ROTATE; >> + } >> + >> + d_lru_shrink_move(lru, dentry, freeable); >> + spin_unlock(&dentry->d_lock); >> + return LRU_REMOVED; >> +} >> + >> +struct dentry_limit_ctx { >> + long over; /* remaining dentries to evict */ >> + list_lru_walk_cb isolate; >> +}; >> + >> +static void dentry_limit_prune_sb(struct super_block *sb, void *arg) >> +{ >> + struct dentry_limit_ctx *ctx = arg; >> + unsigned long walked = 0; >> + unsigned long budget; >> + >> + if (ctx->over <= 0) >> + return; >> + >> + /* >> + * Walk up to one full pass of this superblock's LRU, in >> + * DENTRY_LIMIT_BATCH-sized chunks. The loop matters mainly for >> + * phase 1: dentry_lru_isolate_negative() returns LRU_ROTATE for >> + * positive dentries, which still counts against list_lru_walk()'s >> + * nr_to_walk. A single batch can therefore finish having freed >> + * nothing when positives crowd the head of the LRU, and without >> + * the inner loop the worker would have to wait a full >> + * dentry-limit-interval-ms before retrying never reaching the >> + * negatives buried behind a long run of positives. >> + * >> + * The budget is snapshot at entry so a filesystem allocating >> + * dentries faster than we drain them can't keep us spinning here >> + * forever; freshly added dentries are picked up on the next >> + * worker invocation. >> + * >> + * Phase 2 normally exits much sooner: its isolate callback frees >> + * any non-referenced dentry, so ctx->over typically hits zero >> + * inside the first batch. The worst-case over-eviction is one >> + * batch past the cap, which is within the soft semantics of >> + * fs.dentry-limit. >> + */ >> + budget = list_lru_count(&sb->s_dentry_lru); >> + >> + while (ctx->over > 0 && walked < budget) { >> + LIST_HEAD(dispose); >> + unsigned long nr; >> + long freed; >> + >> + nr = min(DENTRY_LIMIT_BATCH, budget - walked); >> + freed = list_lru_walk(&sb->s_dentry_lru, ctx->isolate, >> + &dispose, nr); >> + shrink_dentry_list(&dispose); >> + >> + ctx->over -= freed; >> + walked += nr; >> + >> + cond_resched(); >> + } >> +} >> + >> +static void dentry_limit_worker_fn(struct work_struct *work) >> +{ >> + struct dentry_limit_ctx ctx; >> + unsigned long limit = READ_ONCE(sysctl_dentry_limit); >> + unsigned int ms; >> + long nr; >> + >> + if (!limit) >> + return; >> + >> + nr = get_nr_dentry(); >> + if (nr <= (long)limit) >> + return; >> + >> + ctx.over = nr - (long)limit; >> + >> + /* Phase 1: drain negative dentries across every superblock. */ >> + ctx.isolate = dentry_lru_isolate_negative; >> + iterate_supers(dentry_limit_prune_sb, &ctx); >> + >> + /* Phase 2: still over? Apply the ordinary LRU policy. */ >> + if (ctx.over > 0) { >> + ctx.isolate = dentry_lru_isolate; >> + iterate_supers(dentry_limit_prune_sb, &ctx); >> + } >> + >> + /* >> + * Re-arm while still above the limit. Re-read the sysctls in >> + * case the admin raised the cap or disabled the feature during >> + * the walk. >> + */ >> + limit = READ_ONCE(sysctl_dentry_limit); >> + if (!limit || get_nr_dentry() <= (long)limit) >> + return; >> + >> + ms = READ_ONCE(sysctl_dentry_limit_interval_ms); >> + queue_delayed_work(system_unbound_wq, &dentry_limit_work, >> + msecs_to_jiffies(ms)); >> +} >> + >> +static void dentry_limit_kick(void) >> +{ >> + unsigned long limit = READ_ONCE(sysctl_dentry_limit); >> + unsigned long now; >> + >> + if (!limit) >> + return; >> + if (delayed_work_pending(&dentry_limit_work)) >> + return; >> + >> + now = jiffies; >> + if (time_before(now, READ_ONCE(dentry_limit_last_kick) + HZ / 10)) >> + return; >> + WRITE_ONCE(dentry_limit_last_kick, now); >> + >> + if (get_nr_dentry() <= (long)limit) >> + return; >> + >> + queue_delayed_work(system_unbound_wq, &dentry_limit_work, 0); >> +} >> >> /** >> * shrink_dcache_sb - shrink dcache for a superblock >> @@ -1868,6 +2064,7 @@ static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name) >> } >> >> this_cpu_inc(nr_dentry); >> + dentry_limit_kick(); >> >> return dentry; >> } >> >> --- >> base-commit: 5d6919055dec134de3c40167a490f33c74c12581 >> change-id: 20260513-limit-dentries-cache-63685729672b >> >> Best regards, >> -- >> Horst Birthelmer <hbirthelmer@ddn.com> >> >> >> > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] dcache: add fs.dentry-limit sysctl with negative-first reaper 2026-05-18 2:55 ` Ian Kent @ 2026-05-18 8:19 ` Jan Kara 2026-05-18 13:39 ` Ian Kent 0 siblings, 1 reply; 17+ messages in thread From: Jan Kara @ 2026-05-18 8:19 UTC (permalink / raw) To: Ian Kent Cc: NeilBrown, Horst Birthelmer, Amir Goldstein, Miklos Szeredi, Jonathan Corbet, Shuah Khan, Alexander Viro, Christian Brauner, Jan Kara, linux-doc, linux-kernel, linux-fsdevel, Horst Birthelmer Hi Ian, On Mon 18-05-26 10:55:43, Ian Kent wrote: > On 18/5/26 07:55, NeilBrown wrote: > > On Fri, 15 May 2026, Horst Birthelmer wrote: > > According to the email you linked, a problem arises when a directory has > > a great many negative children. Code which walks the list of children > > (such as fsnotify) while holding a lock can suffer unpredictable delays > > and result in long lock-hold times. So maybe a limit on negative > > dentries for any parent is what we really want. That would be clumsy to > > implement I imagine. > > But the notion of dropping the dentry in ->d_delete() on last dput() is > simple enough but did see regressions (the only other place in the VFS > besides dentry_kill() that the inode is unlinked from the dentry on > dput()). I wonder if the regression was related to the test itself > deliberately recreating deleted files and if that really is normal > behaviour. By itself that should prevent almost all negative dentries > being retained. Although file systems could do this as well (think XFS > inode recycling) it should be reasonable to require it be left to the > VFS. > > But even that's not enough given that, in my case, there would still be > around 4 million dentries in the LRU cache and in fsnotify there are > directory child traversals holding the parent i_lock "spinlock" that are > going to cause problems. Do you mean there are very many positive children of a directory? > That's all that much more puzzling when I see things like commit > 172e422ffea2 ("fsnotify: clear PARENT_WATCHED flags lazily") which looks > like it implies the child flag depends entirely on the parent state (what > am I missing Amir?) PARENT_WATCHED dentry flags (as the name suggests) are only caching the information whether the parent has notification marks receiving events from the child. So yes, the flag fully depends on the parent state. > so why is this traversal even retained in fsnotify? Not sure which traversal you mean but if you set watch on a parent, you have to walk all children to set PARENT_WATCHED flag so that you don't miss events on children... > > But what if we move dentries to the end of the list when they become > > negative, and to the start of the list when they become positive? Then > > code which walks the child list could simply abort on the first > > negative. > > > > I doubt that would be quite as easy as it sounds, but it would at least > > be more focused on the observed symptom rather than some whole-system > > number which only vaguely correlates with the observed symptom. > > > > Maybe a completely different approach: change children-walking code to > > drop and retake the lock (with appropriate validation) periodically. > > What too would address the specific symptom. > > Another good question. > > I have assumed that dropping and re-taking the lock cannot be done but > this is a question I would like answered as well. Dropping and re-taking > lock would require, as Miklos pointed out to me off-list, recording the > list position with say a cursor, introducing unwanted complexity when it > would be better to accept the cost of a single extra access to the parent > flags (which I assume is one reason to set the flag in the child). The parent access is actually more expensive than you might think. Based on experience with past fsnotify related performance regression I expect some 20% performance hit for small tmpfs writes if you add unconditional parent access to the write path. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] dcache: add fs.dentry-limit sysctl with negative-first reaper 2026-05-18 8:19 ` Jan Kara @ 2026-05-18 13:39 ` Ian Kent 2026-05-19 9:12 ` Jan Kara 0 siblings, 1 reply; 17+ messages in thread From: Ian Kent @ 2026-05-18 13:39 UTC (permalink / raw) To: Jan Kara Cc: NeilBrown, Horst Birthelmer, Amir Goldstein, Miklos Szeredi, Jonathan Corbet, Shuah Khan, Alexander Viro, Christian Brauner, linux-doc, linux-kernel, linux-fsdevel, Horst Birthelmer On 18/5/26 16:19, Jan Kara wrote: > Hi Ian, > > On Mon 18-05-26 10:55:43, Ian Kent wrote: >> On 18/5/26 07:55, NeilBrown wrote: >>> On Fri, 15 May 2026, Horst Birthelmer wrote: >>> According to the email you linked, a problem arises when a directory has >>> a great many negative children. Code which walks the list of children >>> (such as fsnotify) while holding a lock can suffer unpredictable delays >>> and result in long lock-hold times. So maybe a limit on negative >>> dentries for any parent is what we really want. That would be clumsy to >>> implement I imagine. >> But the notion of dropping the dentry in ->d_delete() on last dput() is >> simple enough but did see regressions (the only other place in the VFS >> besides dentry_kill() that the inode is unlinked from the dentry on >> dput()). I wonder if the regression was related to the test itself >> deliberately recreating deleted files and if that really is normal >> behaviour. By itself that should prevent almost all negative dentries >> being retained. Although file systems could do this as well (think XFS >> inode recycling) it should be reasonable to require it be left to the >> VFS. >> >> But even that's not enough given that, in my case, there would still be >> around 4 million dentries in the LRU cache and in fsnotify there are >> directory child traversals holding the parent i_lock "spinlock" that are >> going to cause problems. > Do you mean there are very many positive children of a directory? Didn't quantify that. The symptom is the "Spinlock held for more than ... seconds" occurring in the log. So there are certainly a lot of children in the list, but it's an assumption the ratio of positive to negative entries is roughly the same as the overall ratio in the dcache. > >> That's all that much more puzzling when I see things like commit >> 172e422ffea2 ("fsnotify: clear PARENT_WATCHED flags lazily") which looks >> like it implies the child flag depends entirely on the parent state (what >> am I missing Amir?) > PARENT_WATCHED dentry flags (as the name suggests) are only caching the > information whether the parent has notification marks receiving events from > the child. So yes, the flag fully depends on the parent state. Ok, this is something I was after, I will keep looking at the fsnotify code since there is something to find, thanks for that. > >> so why is this traversal even retained in fsnotify? > Not sure which traversal you mean but if you set watch on a parent, you > have to walk all children to set PARENT_WATCHED flag so that you don't miss > events on children... Yes, that traversal is what I'm questioning ... again thanks. I think the function name is still fsnotify_set_children_dentry_flags() in recent kernels, the subject of commit 172e422ffea2 I mentioned above. When you say miss events are you saying that accessing the parent dentry to work out if the child needs to respond to an event is quite expensive in the overall event processing context, that might make more sense to me ... or do I completely not yet understand the reasoning behind the need for the flag? > >>> But what if we move dentries to the end of the list when they become >>> negative, and to the start of the list when they become positive? Then >>> code which walks the child list could simply abort on the first >>> negative. >>> >>> I doubt that would be quite as easy as it sounds, but it would at least >>> be more focused on the observed symptom rather than some whole-system >>> number which only vaguely correlates with the observed symptom. >>> >>> Maybe a completely different approach: change children-walking code to >>> drop and retake the lock (with appropriate validation) periodically. >>> What too would address the specific symptom. >> Another good question. >> >> I have assumed that dropping and re-taking the lock cannot be done but >> this is a question I would like answered as well. Dropping and re-taking >> lock would require, as Miklos pointed out to me off-list, recording the >> list position with say a cursor, introducing unwanted complexity when it >> would be better to accept the cost of a single extra access to the parent >> flags (which I assume is one reason to set the flag in the child). > The parent access is actually more expensive than you might think. Based on > experience with past fsnotify related performance regression I expect some > 20% performance hit for small tmpfs writes if you add unconditional parent > access to the write path. That sounds like a lot for what should be a memory access of an already in memory structure since the parent must be accessed to traverse the list of child entries. I clearly don't fully understand the implications of what I'm saying but there has been mention of another context ... Nevertheless more useful information, ;) Thanks again, Ian ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] dcache: add fs.dentry-limit sysctl with negative-first reaper 2026-05-18 13:39 ` Ian Kent @ 2026-05-19 9:12 ` Jan Kara 2026-05-20 7:16 ` Ian Kent 0 siblings, 1 reply; 17+ messages in thread From: Jan Kara @ 2026-05-19 9:12 UTC (permalink / raw) To: Ian Kent Cc: Jan Kara, NeilBrown, Horst Birthelmer, Amir Goldstein, Miklos Szeredi, Jonathan Corbet, Shuah Khan, Alexander Viro, Christian Brauner, linux-doc, linux-kernel, linux-fsdevel, Horst Birthelmer On Mon 18-05-26 21:39:13, Ian Kent wrote: > On 18/5/26 16:19, Jan Kara wrote: > > Hi Ian, > > > > On Mon 18-05-26 10:55:43, Ian Kent wrote: > > > On 18/5/26 07:55, NeilBrown wrote: > > > > On Fri, 15 May 2026, Horst Birthelmer wrote: > > > > According to the email you linked, a problem arises when a directory has > > > > a great many negative children. Code which walks the list of children > > > > (such as fsnotify) while holding a lock can suffer unpredictable delays > > > > and result in long lock-hold times. So maybe a limit on negative > > > > dentries for any parent is what we really want. That would be clumsy to > > > > implement I imagine. > > > But the notion of dropping the dentry in ->d_delete() on last dput() is > > > simple enough but did see regressions (the only other place in the VFS > > > besides dentry_kill() that the inode is unlinked from the dentry on > > > dput()). I wonder if the regression was related to the test itself > > > deliberately recreating deleted files and if that really is normal > > > behaviour. By itself that should prevent almost all negative dentries > > > being retained. Although file systems could do this as well (think XFS > > > inode recycling) it should be reasonable to require it be left to the > > > VFS. > > > > > > But even that's not enough given that, in my case, there would still be > > > around 4 million dentries in the LRU cache and in fsnotify there are > > > directory child traversals holding the parent i_lock "spinlock" that are > > > going to cause problems. > > Do you mean there are very many positive children of a directory? > > Didn't quantify that. > > The symptom is the "Spinlock held for more than ... seconds" occurring in > the log. So there are certainly a lot of children in the list, but it's > an assumption the ratio of positive to negative entries is roughly the > same as the overall ratio in the dcache. OK, but that's not necessarily true. I have seen these complaints from the kernel but in all the cases I remember it was due to negative dentries accumultating in a particular directory. There are certain apps such as ElasticSearch which really do like creating huge amounts of negative dentries in one directory - they use hashes as filenames and use directory lookup instead of a DB table lookup and lookup lots of non-existent keys... > > > so why is this traversal even retained in fsnotify? > > Not sure which traversal you mean but if you set watch on a parent, you > > have to walk all children to set PARENT_WATCHED flag so that you don't miss > > events on children... > > Yes, that traversal is what I'm questioning ... again thanks. > > I think the function name is still fsnotify_set_children_dentry_flags() > in recent kernels, the subject of commit 172e422ffea2 I mentioned above. OK, thanks. > When you say miss events are you saying that accessing the parent dentry to > work out if the child needs to respond to an event is quite expensive in the > overall event processing context, that might make more sense to me ... or do > I completely not yet understand the reasoning behind the need for the flag? Close but not quite. The cost is the overhead of dget_parent() in fsnotify_parent() which is often a couple of cache cold loads and atomic instructions to find out we don't need to send any event for the current write(2) or read(2) call. It gets worse if there are many IOs happening to dentries in the same directory from multiple CPUs because instead of cache-cold loads you get a cacheline contention on the parent. > > > > But what if we move dentries to the end of the list when they become > > > > negative, and to the start of the list when they become positive? Then > > > > code which walks the child list could simply abort on the first > > > > negative. > > > > > > > > I doubt that would be quite as easy as it sounds, but it would at least > > > > be more focused on the observed symptom rather than some whole-system > > > > number which only vaguely correlates with the observed symptom. > > > > > > > > Maybe a completely different approach: change children-walking code to > > > > drop and retake the lock (with appropriate validation) periodically. > > > > What too would address the specific symptom. > > > Another good question. > > > > > > I have assumed that dropping and re-taking the lock cannot be done but > > > this is a question I would like answered as well. Dropping and re-taking > > > lock would require, as Miklos pointed out to me off-list, recording the > > > list position with say a cursor, introducing unwanted complexity when it > > > would be better to accept the cost of a single extra access to the parent > > > flags (which I assume is one reason to set the flag in the child). > > The parent access is actually more expensive than you might think. Based on > > experience with past fsnotify related performance regression I expect some > > 20% performance hit for small tmpfs writes if you add unconditional parent > > access to the write path. > > That sounds like a lot for what should be a memory access of an already in > memory structure since the parent must be accessed to traverse the list of > child entries. I clearly don't fully understand the implications of what > I'm saying but there has been mention of another context ... Parent dentry is of course in memory but often cache cold - you don't need the parent to do e.g. write(2) to an already open file. You seem to be somewhat confused about the child dentry list traversal (or maybe I'm misunderstanding) - that happens only when placing the notification mark but definitely not for each IO operation. Honza -- Jan Kara <jack@suse.com> SUSE Labs, CR ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] dcache: add fs.dentry-limit sysctl with negative-first reaper 2026-05-19 9:12 ` Jan Kara @ 2026-05-20 7:16 ` Ian Kent 2026-05-20 9:43 ` Amir Goldstein 0 siblings, 1 reply; 17+ messages in thread From: Ian Kent @ 2026-05-20 7:16 UTC (permalink / raw) To: Jan Kara Cc: NeilBrown, Horst Birthelmer, Amir Goldstein, Miklos Szeredi, Jonathan Corbet, Shuah Khan, Alexander Viro, Christian Brauner, linux-doc, linux-kernel, linux-fsdevel, Horst Birthelmer On 19/5/26 17:12, Jan Kara wrote: > On Mon 18-05-26 21:39:13, Ian Kent wrote: >> On 18/5/26 16:19, Jan Kara wrote: >>> Hi Ian, >>> >>> On Mon 18-05-26 10:55:43, Ian Kent wrote: >>>> On 18/5/26 07:55, NeilBrown wrote: >>>>> On Fri, 15 May 2026, Horst Birthelmer wrote: >>>>> According to the email you linked, a problem arises when a directory has >>>>> a great many negative children. Code which walks the list of children >>>>> (such as fsnotify) while holding a lock can suffer unpredictable delays >>>>> and result in long lock-hold times. So maybe a limit on negative >>>>> dentries for any parent is what we really want. That would be clumsy to >>>>> implement I imagine. >>>> But the notion of dropping the dentry in ->d_delete() on last dput() is >>>> simple enough but did see regressions (the only other place in the VFS >>>> besides dentry_kill() that the inode is unlinked from the dentry on >>>> dput()). I wonder if the regression was related to the test itself >>>> deliberately recreating deleted files and if that really is normal >>>> behaviour. By itself that should prevent almost all negative dentries >>>> being retained. Although file systems could do this as well (think XFS >>>> inode recycling) it should be reasonable to require it be left to the >>>> VFS. >>>> >>>> But even that's not enough given that, in my case, there would still be >>>> around 4 million dentries in the LRU cache and in fsnotify there are >>>> directory child traversals holding the parent i_lock "spinlock" that are >>>> going to cause problems. >>> Do you mean there are very many positive children of a directory? >> Didn't quantify that. >> >> The symptom is the "Spinlock held for more than ... seconds" occurring in >> the log. So there are certainly a lot of children in the list, but it's >> an assumption the ratio of positive to negative entries is roughly the >> same as the overall ratio in the dcache. > OK, but that's not necessarily true. I have seen these complaints from the > kernel but in all the cases I remember it was due to negative dentries > accumultating in a particular directory. There are certain apps such as > ElasticSearch which really do like creating huge amounts of negative > dentries in one directory - they use hashes as filenames and use directory > lookup instead of a DB table lookup and lookup lots of non-existent keys... Umm ... that's a good point, I hadn't paid much attention to ENOENT result lookups, I'll need to check on the like cycle of those, I think they do get hashed. That has to be the other source of negative dentries that I've neglected ... > >>>> so why is this traversal even retained in fsnotify? >>> Not sure which traversal you mean but if you set watch on a parent, you >>> have to walk all children to set PARENT_WATCHED flag so that you don't miss >>> events on children... >> Yes, that traversal is what I'm questioning ... again thanks. >> >> I think the function name is still fsnotify_set_children_dentry_flags() >> in recent kernels, the subject of commit 172e422ffea2 I mentioned above. > OK, thanks. > >> When you say miss events are you saying that accessing the parent dentry to >> work out if the child needs to respond to an event is quite expensive in the >> overall event processing context, that might make more sense to me ... or do >> I completely not yet understand the reasoning behind the need for the flag? > Close but not quite. The cost is the overhead of dget_parent() in > fsnotify_parent() which is often a couple of cache cold loads and atomic > instructions to find out we don't need to send any event for the current > write(2) or read(2) call. It gets worse if there are many IOs happening to > dentries in the same directory from multiple CPUs because instead of > cache-cold loads you get a cacheline contention on the parent. > >>>>> But what if we move dentries to the end of the list when they become >>>>> negative, and to the start of the list when they become positive? Then >>>>> code which walks the child list could simply abort on the first >>>>> negative. >>>>> >>>>> I doubt that would be quite as easy as it sounds, but it would at least >>>>> be more focused on the observed symptom rather than some whole-system >>>>> number which only vaguely correlates with the observed symptom. >>>>> >>>>> Maybe a completely different approach: change children-walking code to >>>>> drop and retake the lock (with appropriate validation) periodically. >>>>> What too would address the specific symptom. >>>> Another good question. >>>> >>>> I have assumed that dropping and re-taking the lock cannot be done but >>>> this is a question I would like answered as well. Dropping and re-taking >>>> lock would require, as Miklos pointed out to me off-list, recording the >>>> list position with say a cursor, introducing unwanted complexity when it >>>> would be better to accept the cost of a single extra access to the parent >>>> flags (which I assume is one reason to set the flag in the child). >>> The parent access is actually more expensive than you might think. Based on >>> experience with past fsnotify related performance regression I expect some >>> 20% performance hit for small tmpfs writes if you add unconditional parent >>> access to the write path. >> That sounds like a lot for what should be a memory access of an already in >> memory structure since the parent must be accessed to traverse the list of >> child entries. I clearly don't fully understand the implications of what >> I'm saying but there has been mention of another context ... > Parent dentry is of course in memory but often cache cold - you don't need > the parent to do e.g. write(2) to an already open file. You seem to be > somewhat confused about the child dentry list traversal (or maybe I'm > misunderstanding) - that happens only when placing the notification mark > but definitely not for each IO operation. LOL, confusion is a pretty common state of mind for me! I do get your point though and I am confusing the traversal with other operations. I think this answers the question I've been asking (maybe that wasn't obvious) about the reason for the traversal (ie. the reason to maintain a flag in the child). While I have looked at the code here I haven't absorbed it and I definitely don't understand it, your continued patience is appreciated and will be beneficial when I get time to look at it a bit closer. I do still need to use a notifications mechanism to match up with Miklos's statmount implementation to get the full benefit of that in user space, if I ever get a chance to work on that again. So it sounds like it would be worth while considering a traversal that's based on taking a reference on each dentry rather than a spinlock for the duration. It would be tricky though, for obvious reasons, like children added during the traversal, added overhead of getting the next entry reference, etc. Ian ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] dcache: add fs.dentry-limit sysctl with negative-first reaper 2026-05-20 7:16 ` Ian Kent @ 2026-05-20 9:43 ` Amir Goldstein 2026-05-21 0:55 ` Ian Kent 2026-05-22 4:16 ` NeilBrown 0 siblings, 2 replies; 17+ messages in thread From: Amir Goldstein @ 2026-05-20 9:43 UTC (permalink / raw) To: Ian Kent Cc: Jan Kara, NeilBrown, Horst Birthelmer, Miklos Szeredi, Jonathan Corbet, Shuah Khan, Alexander Viro, Christian Brauner, linux-doc, linux-kernel, linux-fsdevel, Horst Birthelmer On Wed, May 20, 2026 at 9:16 AM Ian Kent <raven@themaw.net> wrote: > > On 19/5/26 17:12, Jan Kara wrote: > > On Mon 18-05-26 21:39:13, Ian Kent wrote: > >> On 18/5/26 16:19, Jan Kara wrote: > >>> Hi Ian, > >>> > >>> On Mon 18-05-26 10:55:43, Ian Kent wrote: > >>>> On 18/5/26 07:55, NeilBrown wrote: > >>>>> On Fri, 15 May 2026, Horst Birthelmer wrote: > >>>>> According to the email you linked, a problem arises when a directory has > >>>>> a great many negative children. Code which walks the list of children > >>>>> (such as fsnotify) while holding a lock can suffer unpredictable delays > >>>>> and result in long lock-hold times. So maybe a limit on negative > >>>>> dentries for any parent is what we really want. That would be clumsy to > >>>>> implement I imagine. > >>>> But the notion of dropping the dentry in ->d_delete() on last dput() is > >>>> simple enough but did see regressions (the only other place in the VFS > >>>> besides dentry_kill() that the inode is unlinked from the dentry on > >>>> dput()). I wonder if the regression was related to the test itself > >>>> deliberately recreating deleted files and if that really is normal > >>>> behaviour. By itself that should prevent almost all negative dentries > >>>> being retained. Although file systems could do this as well (think XFS > >>>> inode recycling) it should be reasonable to require it be left to the > >>>> VFS. > >>>> > >>>> But even that's not enough given that, in my case, there would still be > >>>> around 4 million dentries in the LRU cache and in fsnotify there are > >>>> directory child traversals holding the parent i_lock "spinlock" that are > >>>> going to cause problems. > >>> Do you mean there are very many positive children of a directory? > >> Didn't quantify that. > >> > >> The symptom is the "Spinlock held for more than ... seconds" occurring in > >> the log. So there are certainly a lot of children in the list, but it's > >> an assumption the ratio of positive to negative entries is roughly the > >> same as the overall ratio in the dcache. > > OK, but that's not necessarily true. I have seen these complaints from the > > kernel but in all the cases I remember it was due to negative dentries > > accumultating in a particular directory. There are certain apps such as > > ElasticSearch which really do like creating huge amounts of negative > > dentries in one directory - they use hashes as filenames and use directory > > lookup instead of a DB table lookup and lookup lots of non-existent keys... > > Umm ... that's a good point, I hadn't paid much attention to ENOENT result > > lookups, I'll need to check on the like cycle of those, I think they do get > > hashed. That has to be the other source of negative dentries that I've > > neglected ... > Yes, it has been claimed that some real life workloads create a lot of those. If we can keep those at the tail of the children list, it will be best for the fsnotify iteration, which only cares about positive dentries. > > > >>>> so why is this traversal even retained in fsnotify? > >>> Not sure which traversal you mean but if you set watch on a parent, you > >>> have to walk all children to set PARENT_WATCHED flag so that you don't miss > >>> events on children... > >> Yes, that traversal is what I'm questioning ... again thanks. > >> > >> I think the function name is still fsnotify_set_children_dentry_flags() > >> in recent kernels, the subject of commit 172e422ffea2 I mentioned above. > > OK, thanks. > > > >> When you say miss events are you saying that accessing the parent dentry to > >> work out if the child needs to respond to an event is quite expensive in the > >> overall event processing context, that might make more sense to me ... or do > >> I completely not yet understand the reasoning behind the need for the flag? > > Close but not quite. The cost is the overhead of dget_parent() in > > fsnotify_parent() which is often a couple of cache cold loads and atomic > > instructions to find out we don't need to send any event for the current > > write(2) or read(2) call. It gets worse if there are many IOs happening to > > dentries in the same directory from multiple CPUs because instead of > > cache-cold loads you get a cacheline contention on the parent. > > > >>>>> But what if we move dentries to the end of the list when they become > >>>>> negative, and to the start of the list when they become positive? Then > >>>>> code which walks the child list could simply abort on the first > >>>>> negative. > >>>>> > >>>>> I doubt that would be quite as easy as it sounds, but it would at least > >>>>> be more focused on the observed symptom rather than some whole-system > >>>>> number which only vaguely correlates with the observed symptom. > >>>>> > >>>>> Maybe a completely different approach: change children-walking code to > >>>>> drop and retake the lock (with appropriate validation) periodically. > >>>>> What too would address the specific symptom. > >>>> Another good question. > >>>> > >>>> I have assumed that dropping and re-taking the lock cannot be done but > >>>> this is a question I would like answered as well. Dropping and re-taking > >>>> lock would require, as Miklos pointed out to me off-list, recording the > >>>> list position with say a cursor, introducing unwanted complexity when it > >>>> would be better to accept the cost of a single extra access to the parent > >>>> flags (which I assume is one reason to set the flag in the child). > >>> The parent access is actually more expensive than you might think. Based on > >>> experience with past fsnotify related performance regression I expect some > >>> 20% performance hit for small tmpfs writes if you add unconditional parent > >>> access to the write path. > >> That sounds like a lot for what should be a memory access of an already in > >> memory structure since the parent must be accessed to traverse the list of > >> child entries. I clearly don't fully understand the implications of what > >> I'm saying but there has been mention of another context ... > > Parent dentry is of course in memory but often cache cold - you don't need > > the parent to do e.g. write(2) to an already open file. You seem to be > > somewhat confused about the child dentry list traversal (or maybe I'm > > misunderstanding) - that happens only when placing the notification mark > > but definitely not for each IO operation. > > LOL, confusion is a pretty common state of mind for me! > > > I do get your point though and I am confusing the traversal with other > > operations. I think this answers the question I've been asking (maybe > > that wasn't obvious) about the reason for the traversal (ie. the reason > > to maintain a flag in the child). > > > While I have looked at the code here I haven't absorbed it and I > > definitely don't understand it, your continued patience is appreciated > > and will be beneficial when I get time to look at it a bit closer. I > > do still need to use a notifications mechanism to match up with Miklos's > > statmount implementation to get the full benefit of that in user space, > > if I ever get a chance to work on that again. > > > So it sounds like it would be worth while considering a traversal that's > > based on taking a reference on each dentry rather than a spinlock for > > the duration. It would be tricky though, for obvious reasons, like > > children added during the traversal, added overhead of getting the next > > entry reference, etc. Didn't look closely, but it feels like RCU traversal should be possible if entries are added to the tail, or to the END_OF_POSITIVE location. When we discussed the "negavites at tail" at LSFMM it was said that managing the transitions positive<->negative would be challenging, but I don't know that anyone tried to look closer at this. At least for fsnotify, positive->negative transition is not a problem w.r.t skipping entry and observing entry twice during positive iteration. If negative->positive transitions inserts at END_OF_POSITIVE location, then should be fine as well? Iterators that need to iterate all children can do this under lock. Does that make sense? Thanks, Amir. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] dcache: add fs.dentry-limit sysctl with negative-first reaper 2026-05-20 9:43 ` Amir Goldstein @ 2026-05-21 0:55 ` Ian Kent 2026-05-22 4:16 ` NeilBrown 1 sibling, 0 replies; 17+ messages in thread From: Ian Kent @ 2026-05-21 0:55 UTC (permalink / raw) To: Amir Goldstein Cc: Jan Kara, NeilBrown, Horst Birthelmer, Miklos Szeredi, Jonathan Corbet, Shuah Khan, Alexander Viro, Christian Brauner, linux-doc, linux-kernel, linux-fsdevel, Horst Birthelmer On 20/5/26 17:43, Amir Goldstein wrote: > On Wed, May 20, 2026 at 9:16 AM Ian Kent <raven@themaw.net> wrote: >> On 19/5/26 17:12, Jan Kara wrote: >>> On Mon 18-05-26 21:39:13, Ian Kent wrote: >>>> On 18/5/26 16:19, Jan Kara wrote: >>>>> Hi Ian, >>>>> >>>>> On Mon 18-05-26 10:55:43, Ian Kent wrote: >>>>>> On 18/5/26 07:55, NeilBrown wrote: >>>>>>> On Fri, 15 May 2026, Horst Birthelmer wrote: >>>>>>> According to the email you linked, a problem arises when a directory has >>>>>>> a great many negative children. Code which walks the list of children >>>>>>> (such as fsnotify) while holding a lock can suffer unpredictable delays >>>>>>> and result in long lock-hold times. So maybe a limit on negative >>>>>>> dentries for any parent is what we really want. That would be clumsy to >>>>>>> implement I imagine. >>>>>> But the notion of dropping the dentry in ->d_delete() on last dput() is >>>>>> simple enough but did see regressions (the only other place in the VFS >>>>>> besides dentry_kill() that the inode is unlinked from the dentry on >>>>>> dput()). I wonder if the regression was related to the test itself >>>>>> deliberately recreating deleted files and if that really is normal >>>>>> behaviour. By itself that should prevent almost all negative dentries >>>>>> being retained. Although file systems could do this as well (think XFS >>>>>> inode recycling) it should be reasonable to require it be left to the >>>>>> VFS. >>>>>> >>>>>> But even that's not enough given that, in my case, there would still be >>>>>> around 4 million dentries in the LRU cache and in fsnotify there are >>>>>> directory child traversals holding the parent i_lock "spinlock" that are >>>>>> going to cause problems. >>>>> Do you mean there are very many positive children of a directory? >>>> Didn't quantify that. >>>> >>>> The symptom is the "Spinlock held for more than ... seconds" occurring in >>>> the log. So there are certainly a lot of children in the list, but it's >>>> an assumption the ratio of positive to negative entries is roughly the >>>> same as the overall ratio in the dcache. >>> OK, but that's not necessarily true. I have seen these complaints from the >>> kernel but in all the cases I remember it was due to negative dentries >>> accumultating in a particular directory. There are certain apps such as >>> ElasticSearch which really do like creating huge amounts of negative >>> dentries in one directory - they use hashes as filenames and use directory >>> lookup instead of a DB table lookup and lookup lots of non-existent keys... >> Umm ... that's a good point, I hadn't paid much attention to ENOENT result >> >> lookups, I'll need to check on the like cycle of those, I think they do get >> >> hashed. That has to be the other source of negative dentries that I've >> >> neglected ... >> > Yes, it has been claimed that some real life workloads create a lot of those. > > If we can keep those at the tail of the children list, it will be best > for the fsnotify > iteration, which only cares about positive dentries. > >>>>>> so why is this traversal even retained in fsnotify? >>>>> Not sure which traversal you mean but if you set watch on a parent, you >>>>> have to walk all children to set PARENT_WATCHED flag so that you don't miss >>>>> events on children... >>>> Yes, that traversal is what I'm questioning ... again thanks. >>>> >>>> I think the function name is still fsnotify_set_children_dentry_flags() >>>> in recent kernels, the subject of commit 172e422ffea2 I mentioned above. >>> OK, thanks. >>> >>>> When you say miss events are you saying that accessing the parent dentry to >>>> work out if the child needs to respond to an event is quite expensive in the >>>> overall event processing context, that might make more sense to me ... or do >>>> I completely not yet understand the reasoning behind the need for the flag? >>> Close but not quite. The cost is the overhead of dget_parent() in >>> fsnotify_parent() which is often a couple of cache cold loads and atomic >>> instructions to find out we don't need to send any event for the current >>> write(2) or read(2) call. It gets worse if there are many IOs happening to >>> dentries in the same directory from multiple CPUs because instead of >>> cache-cold loads you get a cacheline contention on the parent. >>> >>>>>>> But what if we move dentries to the end of the list when they become >>>>>>> negative, and to the start of the list when they become positive? Then >>>>>>> code which walks the child list could simply abort on the first >>>>>>> negative. >>>>>>> >>>>>>> I doubt that would be quite as easy as it sounds, but it would at least >>>>>>> be more focused on the observed symptom rather than some whole-system >>>>>>> number which only vaguely correlates with the observed symptom. >>>>>>> >>>>>>> Maybe a completely different approach: change children-walking code to >>>>>>> drop and retake the lock (with appropriate validation) periodically. >>>>>>> What too would address the specific symptom. >>>>>> Another good question. >>>>>> >>>>>> I have assumed that dropping and re-taking the lock cannot be done but >>>>>> this is a question I would like answered as well. Dropping and re-taking >>>>>> lock would require, as Miklos pointed out to me off-list, recording the >>>>>> list position with say a cursor, introducing unwanted complexity when it >>>>>> would be better to accept the cost of a single extra access to the parent >>>>>> flags (which I assume is one reason to set the flag in the child). >>>>> The parent access is actually more expensive than you might think. Based on >>>>> experience with past fsnotify related performance regression I expect some >>>>> 20% performance hit for small tmpfs writes if you add unconditional parent >>>>> access to the write path. >>>> That sounds like a lot for what should be a memory access of an already in >>>> memory structure since the parent must be accessed to traverse the list of >>>> child entries. I clearly don't fully understand the implications of what >>>> I'm saying but there has been mention of another context ... >>> Parent dentry is of course in memory but often cache cold - you don't need >>> the parent to do e.g. write(2) to an already open file. You seem to be >>> somewhat confused about the child dentry list traversal (or maybe I'm >>> misunderstanding) - that happens only when placing the notification mark >>> but definitely not for each IO operation. >> LOL, confusion is a pretty common state of mind for me! >> >> >> I do get your point though and I am confusing the traversal with other >> >> operations. I think this answers the question I've been asking (maybe >> >> that wasn't obvious) about the reason for the traversal (ie. the reason >> >> to maintain a flag in the child). >> >> >> While I have looked at the code here I haven't absorbed it and I >> >> definitely don't understand it, your continued patience is appreciated >> >> and will be beneficial when I get time to look at it a bit closer. I >> >> do still need to use a notifications mechanism to match up with Miklos's >> >> statmount implementation to get the full benefit of that in user space, >> >> if I ever get a chance to work on that again. >> >> >> So it sounds like it would be worth while considering a traversal that's >> >> based on taking a reference on each dentry rather than a spinlock for >> >> the duration. It would be tricky though, for obvious reasons, like >> >> children added during the traversal, added overhead of getting the next >> >> entry reference, etc. > Didn't look closely, but it feels like RCU traversal should be > possible if entries are added to the tail, or to the END_OF_POSITIVE > location. > > When we discussed the "negavites at tail" at LSFMM > it was said that managing the transitions positive<->negative > would be challenging, but I don't know that anyone tried to look closer at this. I guess that should be straight forward as long as it's done at the point of transition except if it's done by a filesystem instead of the VFS (maybe require a helper be used ...). Might be a bit harder for dentries that don't transition (ie. ENOENT lookups that start out and stay negative) might escape the needed handling. > > At least for fsnotify, positive->negative transition is not a problem > w.r.t skipping entry and observing entry twice during positive iteration. > > If negative->positive transitions inserts at END_OF_POSITIVE > location, then should be fine as well? > > Iterators that need to iterate all children can do this under lock. Only catch there is the number of positive children might be large as well. > > Does that make sense? Yep, the notion of a cursor is a good idea. Nevertheless the challenge is to identify dentries that should be discarded rather than kept at final dput() in addition to what we already have in dout() but there doesn't seem to be anything sensible to add to those checks. On a different note another possibility to identify candidates to discard on traversals might be a ttl, basically an extension of the referenced flag. The dentry d_time field could be used for that but only for negative dentries since, IIRC, nfs uses that dentry field. But now I'm not sure I'm making sense as a couple of your comments sound like they refer to a discussion I'm not aware of, ;) Ian ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] dcache: add fs.dentry-limit sysctl with negative-first reaper 2026-05-20 9:43 ` Amir Goldstein 2026-05-21 0:55 ` Ian Kent @ 2026-05-22 4:16 ` NeilBrown 2026-05-22 8:27 ` Amir Goldstein 1 sibling, 1 reply; 17+ messages in thread From: NeilBrown @ 2026-05-22 4:16 UTC (permalink / raw) To: Amir Goldstein Cc: Ian Kent, Jan Kara, Horst Birthelmer, Miklos Szeredi, Jonathan Corbet, Shuah Khan, Alexander Viro, Christian Brauner, linux-doc, linux-kernel, linux-fsdevel, Horst Birthelmer On Wed, 20 May 2026, Amir Goldstein wrote: > On Wed, May 20, 2026 at 9:16 AM Ian Kent <raven@themaw.net> wrote: > > > > On 19/5/26 17:12, Jan Kara wrote: > > > On Mon 18-05-26 21:39:13, Ian Kent wrote: > > >> On 18/5/26 16:19, Jan Kara wrote: > > >>> Hi Ian, > > >>> > > >>> On Mon 18-05-26 10:55:43, Ian Kent wrote: > > >>>> On 18/5/26 07:55, NeilBrown wrote: > > >>>>> On Fri, 15 May 2026, Horst Birthelmer wrote: > > >>>>> According to the email you linked, a problem arises when a directory has > > >>>>> a great many negative children. Code which walks the list of children > > >>>>> (such as fsnotify) while holding a lock can suffer unpredictable delays > > >>>>> and result in long lock-hold times. So maybe a limit on negative > > >>>>> dentries for any parent is what we really want. That would be clumsy to > > >>>>> implement I imagine. > > >>>> But the notion of dropping the dentry in ->d_delete() on last dput() is > > >>>> simple enough but did see regressions (the only other place in the VFS > > >>>> besides dentry_kill() that the inode is unlinked from the dentry on > > >>>> dput()). I wonder if the regression was related to the test itself > > >>>> deliberately recreating deleted files and if that really is normal > > >>>> behaviour. By itself that should prevent almost all negative dentries > > >>>> being retained. Although file systems could do this as well (think XFS > > >>>> inode recycling) it should be reasonable to require it be left to the > > >>>> VFS. > > >>>> > > >>>> But even that's not enough given that, in my case, there would still be > > >>>> around 4 million dentries in the LRU cache and in fsnotify there are > > >>>> directory child traversals holding the parent i_lock "spinlock" that are > > >>>> going to cause problems. > > >>> Do you mean there are very many positive children of a directory? > > >> Didn't quantify that. > > >> > > >> The symptom is the "Spinlock held for more than ... seconds" occurring in > > >> the log. So there are certainly a lot of children in the list, but it's > > >> an assumption the ratio of positive to negative entries is roughly the > > >> same as the overall ratio in the dcache. > > > OK, but that's not necessarily true. I have seen these complaints from the > > > kernel but in all the cases I remember it was due to negative dentries > > > accumultating in a particular directory. There are certain apps such as > > > ElasticSearch which really do like creating huge amounts of negative > > > dentries in one directory - they use hashes as filenames and use directory > > > lookup instead of a DB table lookup and lookup lots of non-existent keys... > > > > Umm ... that's a good point, I hadn't paid much attention to ENOENT result > > > > lookups, I'll need to check on the like cycle of those, I think they do get > > > > hashed. That has to be the other source of negative dentries that I've > > > > neglected ... > > > > Yes, it has been claimed that some real life workloads create a lot of those. > > If we can keep those at the tail of the children list, it will be best > for the fsnotify > iteration, which only cares about positive dentries. > > > > > > >>>> so why is this traversal even retained in fsnotify? > > >>> Not sure which traversal you mean but if you set watch on a parent, you > > >>> have to walk all children to set PARENT_WATCHED flag so that you don't miss > > >>> events on children... > > >> Yes, that traversal is what I'm questioning ... again thanks. > > >> > > >> I think the function name is still fsnotify_set_children_dentry_flags() > > >> in recent kernels, the subject of commit 172e422ffea2 I mentioned above. > > > OK, thanks. > > > > > >> When you say miss events are you saying that accessing the parent dentry to > > >> work out if the child needs to respond to an event is quite expensive in the > > >> overall event processing context, that might make more sense to me ... or do > > >> I completely not yet understand the reasoning behind the need for the flag? > > > Close but not quite. The cost is the overhead of dget_parent() in > > > fsnotify_parent() which is often a couple of cache cold loads and atomic > > > instructions to find out we don't need to send any event for the current > > > write(2) or read(2) call. It gets worse if there are many IOs happening to > > > dentries in the same directory from multiple CPUs because instead of > > > cache-cold loads you get a cacheline contention on the parent. > > > > > >>>>> But what if we move dentries to the end of the list when they become > > >>>>> negative, and to the start of the list when they become positive? Then > > >>>>> code which walks the child list could simply abort on the first > > >>>>> negative. > > >>>>> > > >>>>> I doubt that would be quite as easy as it sounds, but it would at least > > >>>>> be more focused on the observed symptom rather than some whole-system > > >>>>> number which only vaguely correlates with the observed symptom. > > >>>>> > > >>>>> Maybe a completely different approach: change children-walking code to > > >>>>> drop and retake the lock (with appropriate validation) periodically. > > >>>>> What too would address the specific symptom. > > >>>> Another good question. > > >>>> > > >>>> I have assumed that dropping and re-taking the lock cannot be done but > > >>>> this is a question I would like answered as well. Dropping and re-taking > > >>>> lock would require, as Miklos pointed out to me off-list, recording the > > >>>> list position with say a cursor, introducing unwanted complexity when it > > >>>> would be better to accept the cost of a single extra access to the parent > > >>>> flags (which I assume is one reason to set the flag in the child). > > >>> The parent access is actually more expensive than you might think. Based on > > >>> experience with past fsnotify related performance regression I expect some > > >>> 20% performance hit for small tmpfs writes if you add unconditional parent > > >>> access to the write path. > > >> That sounds like a lot for what should be a memory access of an already in > > >> memory structure since the parent must be accessed to traverse the list of > > >> child entries. I clearly don't fully understand the implications of what > > >> I'm saying but there has been mention of another context ... > > > Parent dentry is of course in memory but often cache cold - you don't need > > > the parent to do e.g. write(2) to an already open file. You seem to be > > > somewhat confused about the child dentry list traversal (or maybe I'm > > > misunderstanding) - that happens only when placing the notification mark > > > but definitely not for each IO operation. > > > > LOL, confusion is a pretty common state of mind for me! > > > > > > I do get your point though and I am confusing the traversal with other > > > > operations. I think this answers the question I've been asking (maybe > > > > that wasn't obvious) about the reason for the traversal (ie. the reason > > > > to maintain a flag in the child). > > > > > > While I have looked at the code here I haven't absorbed it and I > > > > definitely don't understand it, your continued patience is appreciated > > > > and will be beneficial when I get time to look at it a bit closer. I > > > > do still need to use a notifications mechanism to match up with Miklos's > > > > statmount implementation to get the full benefit of that in user space, > > > > if I ever get a chance to work on that again. > > > > > > So it sounds like it would be worth while considering a traversal that's > > > > based on taking a reference on each dentry rather than a spinlock for > > > > the duration. It would be tricky though, for obvious reasons, like > > > > children added during the traversal, added overhead of getting the next > > > > entry reference, etc. > > Didn't look closely, but it feels like RCU traversal should be > possible if entries are added to the tail, or to the END_OF_POSITIVE > location. > > When we discussed the "negavites at tail" at LSFMM > it was said that managing the transitions positive<->negative > would be challenging, but I don't know that anyone tried to look closer at this. I had a quick look. Most users of d_sib walk from the parent->d_children with the parent ->d_lock held, so they shouldn't notice a movement in the list. The two exceptions I could find are d_walk() and the readdir code in libfs.c. I think the main problem case would be if they were holding a dentry as a cursor which transitioned when the parent d_lock is dropped and retaken. d_walk already needs to cope with a concurrent rename messing with its cursor so possibly something similar could be used to trigger a restart. libfs readdir walks from a DCACHE_DENTRY_CURSOR which will never transition and so won't move spontaneously. That is exactly as safe as walking from the parent. So I think d_walk() might need some help to avoid getting lost. It could probably simply check if its cursor changed ->d_inode between dropping ->d_lock and retaking it. If it did, then restart. It isn't clear to me that we can track a "end of positive" location. I think we would need to move negatives to the end and positives to the start. Can you see a down-side with doing that? Thanks, NeilBrown > > At least for fsnotify, positive->negative transition is not a problem > w.r.t skipping entry and observing entry twice during positive iteration. > > If negative->positive transitions inserts at END_OF_POSITIVE > location, then should be fine as well? > > Iterators that need to iterate all children can do this under lock. > > Does that make sense? > > Thanks, > Amir. > ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: [PATCH] dcache: add fs.dentry-limit sysctl with negative-first reaper 2026-05-22 4:16 ` NeilBrown @ 2026-05-22 8:27 ` Amir Goldstein 0 siblings, 0 replies; 17+ messages in thread From: Amir Goldstein @ 2026-05-22 8:27 UTC (permalink / raw) To: NeilBrown Cc: Ian Kent, Jan Kara, Horst Birthelmer, Miklos Szeredi, Jonathan Corbet, Shuah Khan, Alexander Viro, Christian Brauner, linux-doc, linux-kernel, linux-fsdevel, Horst Birthelmer On Fri, May 22, 2026 at 6:16 AM NeilBrown <neilb@ownmail.net> wrote: > > On Wed, 20 May 2026, Amir Goldstein wrote: > > On Wed, May 20, 2026 at 9:16 AM Ian Kent <raven@themaw.net> wrote: > > > > > > On 19/5/26 17:12, Jan Kara wrote: > > > > On Mon 18-05-26 21:39:13, Ian Kent wrote: > > > >> On 18/5/26 16:19, Jan Kara wrote: > > > >>> Hi Ian, > > > >>> > > > >>> On Mon 18-05-26 10:55:43, Ian Kent wrote: > > > >>>> On 18/5/26 07:55, NeilBrown wrote: > > > >>>>> On Fri, 15 May 2026, Horst Birthelmer wrote: > > > >>>>> According to the email you linked, a problem arises when a directory has > > > >>>>> a great many negative children. Code which walks the list of children > > > >>>>> (such as fsnotify) while holding a lock can suffer unpredictable delays > > > >>>>> and result in long lock-hold times. So maybe a limit on negative > > > >>>>> dentries for any parent is what we really want. That would be clumsy to > > > >>>>> implement I imagine. > > > >>>> But the notion of dropping the dentry in ->d_delete() on last dput() is > > > >>>> simple enough but did see regressions (the only other place in the VFS > > > >>>> besides dentry_kill() that the inode is unlinked from the dentry on > > > >>>> dput()). I wonder if the regression was related to the test itself > > > >>>> deliberately recreating deleted files and if that really is normal > > > >>>> behaviour. By itself that should prevent almost all negative dentries > > > >>>> being retained. Although file systems could do this as well (think XFS > > > >>>> inode recycling) it should be reasonable to require it be left to the > > > >>>> VFS. > > > >>>> > > > >>>> But even that's not enough given that, in my case, there would still be > > > >>>> around 4 million dentries in the LRU cache and in fsnotify there are > > > >>>> directory child traversals holding the parent i_lock "spinlock" that are > > > >>>> going to cause problems. > > > >>> Do you mean there are very many positive children of a directory? > > > >> Didn't quantify that. > > > >> > > > >> The symptom is the "Spinlock held for more than ... seconds" occurring in > > > >> the log. So there are certainly a lot of children in the list, but it's > > > >> an assumption the ratio of positive to negative entries is roughly the > > > >> same as the overall ratio in the dcache. > > > > OK, but that's not necessarily true. I have seen these complaints from the > > > > kernel but in all the cases I remember it was due to negative dentries > > > > accumultating in a particular directory. There are certain apps such as > > > > ElasticSearch which really do like creating huge amounts of negative > > > > dentries in one directory - they use hashes as filenames and use directory > > > > lookup instead of a DB table lookup and lookup lots of non-existent keys... > > > > > > Umm ... that's a good point, I hadn't paid much attention to ENOENT result > > > > > > lookups, I'll need to check on the like cycle of those, I think they do get > > > > > > hashed. That has to be the other source of negative dentries that I've > > > > > > neglected ... > > > > > > > Yes, it has been claimed that some real life workloads create a lot of those. > > > > If we can keep those at the tail of the children list, it will be best > > for the fsnotify > > iteration, which only cares about positive dentries. > > > > > > > > > >>>> so why is this traversal even retained in fsnotify? > > > >>> Not sure which traversal you mean but if you set watch on a parent, you > > > >>> have to walk all children to set PARENT_WATCHED flag so that you don't miss > > > >>> events on children... > > > >> Yes, that traversal is what I'm questioning ... again thanks. > > > >> > > > >> I think the function name is still fsnotify_set_children_dentry_flags() > > > >> in recent kernels, the subject of commit 172e422ffea2 I mentioned above. > > > > OK, thanks. > > > > > > > >> When you say miss events are you saying that accessing the parent dentry to > > > >> work out if the child needs to respond to an event is quite expensive in the > > > >> overall event processing context, that might make more sense to me ... or do > > > >> I completely not yet understand the reasoning behind the need for the flag? > > > > Close but not quite. The cost is the overhead of dget_parent() in > > > > fsnotify_parent() which is often a couple of cache cold loads and atomic > > > > instructions to find out we don't need to send any event for the current > > > > write(2) or read(2) call. It gets worse if there are many IOs happening to > > > > dentries in the same directory from multiple CPUs because instead of > > > > cache-cold loads you get a cacheline contention on the parent. > > > > > > > >>>>> But what if we move dentries to the end of the list when they become > > > >>>>> negative, and to the start of the list when they become positive? Then > > > >>>>> code which walks the child list could simply abort on the first > > > >>>>> negative. > > > >>>>> > > > >>>>> I doubt that would be quite as easy as it sounds, but it would at least > > > >>>>> be more focused on the observed symptom rather than some whole-system > > > >>>>> number which only vaguely correlates with the observed symptom. > > > >>>>> > > > >>>>> Maybe a completely different approach: change children-walking code to > > > >>>>> drop and retake the lock (with appropriate validation) periodically. > > > >>>>> What too would address the specific symptom. > > > >>>> Another good question. > > > >>>> > > > >>>> I have assumed that dropping and re-taking the lock cannot be done but > > > >>>> this is a question I would like answered as well. Dropping and re-taking > > > >>>> lock would require, as Miklos pointed out to me off-list, recording the > > > >>>> list position with say a cursor, introducing unwanted complexity when it > > > >>>> would be better to accept the cost of a single extra access to the parent > > > >>>> flags (which I assume is one reason to set the flag in the child). > > > >>> The parent access is actually more expensive than you might think. Based on > > > >>> experience with past fsnotify related performance regression I expect some > > > >>> 20% performance hit for small tmpfs writes if you add unconditional parent > > > >>> access to the write path. > > > >> That sounds like a lot for what should be a memory access of an already in > > > >> memory structure since the parent must be accessed to traverse the list of > > > >> child entries. I clearly don't fully understand the implications of what > > > >> I'm saying but there has been mention of another context ... > > > > Parent dentry is of course in memory but often cache cold - you don't need > > > > the parent to do e.g. write(2) to an already open file. You seem to be > > > > somewhat confused about the child dentry list traversal (or maybe I'm > > > > misunderstanding) - that happens only when placing the notification mark > > > > but definitely not for each IO operation. > > > > > > LOL, confusion is a pretty common state of mind for me! > > > > > > > > > I do get your point though and I am confusing the traversal with other > > > > > > operations. I think this answers the question I've been asking (maybe > > > > > > that wasn't obvious) about the reason for the traversal (ie. the reason > > > > > > to maintain a flag in the child). > > > > > > > > > While I have looked at the code here I haven't absorbed it and I > > > > > > definitely don't understand it, your continued patience is appreciated > > > > > > and will be beneficial when I get time to look at it a bit closer. I > > > > > > do still need to use a notifications mechanism to match up with Miklos's > > > > > > statmount implementation to get the full benefit of that in user space, > > > > > > if I ever get a chance to work on that again. > > > > > > > > > So it sounds like it would be worth while considering a traversal that's > > > > > > based on taking a reference on each dentry rather than a spinlock for > > > > > > the duration. It would be tricky though, for obvious reasons, like > > > > > > children added during the traversal, added overhead of getting the next > > > > > > entry reference, etc. > > > > Didn't look closely, but it feels like RCU traversal should be > > possible if entries are added to the tail, or to the END_OF_POSITIVE > > location. > > > > When we discussed the "negavites at tail" at LSFMM > > it was said that managing the transitions positive<->negative > > would be challenging, but I don't know that anyone tried to look closer at this. > > I had a quick look. Most users of d_sib walk from the parent->d_children > with the parent ->d_lock held, so they shouldn't notice a movement in > the list. > The two exceptions I could find are d_walk() and the readdir code in > libfs.c. > I think the main problem case would be if they were holding a dentry as > a cursor which transitioned when the parent d_lock is dropped and retaken. > > d_walk already needs to cope with a concurrent rename messing with > its cursor so possibly something similar could be used to trigger a > restart. > > libfs readdir walks from a DCACHE_DENTRY_CURSOR which will never > transition and so won't move spontaneously. That is exactly as safe as > walking from the parent. > > So I think d_walk() might need some help to avoid getting lost. It could > probably simply check if its cursor changed ->d_inode between dropping > ->d_lock and retaking it. If it did, then restart. > > It isn't clear to me that we can track a "end of positive" location. I > think we would need to move negatives to the end and positives to the > start. Can you see a down-side with doing that? > One of the intentions (that may be lost in this conversations) was to avoid holding the parent inode lock over the entire children iteration in fsnotify_set_children_dentry_flags(). This function only needs to go over all positive children. That's why I thought about moving new positive to the end of positives. Maybe this works if we keep positives at tail and negatives at head and also have dentry shrinkers work on the negative dentries at the head first. Possibly, we can make sure to set DCACHE_FSNOTIFY_PARENT_WATCHED if needed when making a dentry positive, to avoid some races. NOTE that DCACHE_FSNOTIFY_PARENT_WATCHED can allow false positives. It's an optimization flag that is auto cleared with fsnotify_clear_child_dentry_flag() in case of false positive. Also, it is possible to set DCACHE_FSNOTIFY_PARENT_WATCHED *before* dentry is made positive, for example on start_creating() because whether a negative dentry has DCACHE_FSNOTIFY_PARENT_WATCHED or not is insignificant. Thanks, Amir. ^ permalink raw reply [flat|nested] 17+ messages in thread
* Re: Re: [PATCH] dcache: add fs.dentry-limit sysctl with negative-first reaper 2026-05-17 23:55 ` NeilBrown 2026-05-18 2:55 ` Ian Kent @ 2026-05-18 7:01 ` Horst Birthelmer 1 sibling, 0 replies; 17+ messages in thread From: Horst Birthelmer @ 2026-05-18 7:01 UTC (permalink / raw) To: NeilBrown Cc: Horst Birthelmer, Miklos Szeredi, Jonathan Corbet, Shuah Khan, Alexander Viro, Christian Brauner, Jan Kara, linux-doc, linux-kernel, linux-fsdevel, Horst Birthelmer On Mon, May 18, 2026 at 09:55:05AM +1000, NeilBrown wrote: > On Fri, 15 May 2026, Horst Birthelmer wrote: > > From: Horst Birthelmer <hbirthelmer@ddn.com> > > > > The dcache only shrinks under memory pressure, which is rarely reached > > on machines with ample RAM, so cached negative dentries can accumulate > > without bound. Give administrators a soft cap they can set, > > and a background worker that prefers negative dentries when reclaiming. > > > > Two new sysctls under /proc/sys/fs/: > > > > dentry-limit -- soft cap on nr_dentry. 0 (default) > > disables the feature; behaviour is then > > identical to before. > > Is a system-wide cap really a suitable tool? What guidance would you > give to sysadmins who are considering setting a number? I know it is a rhetorical question ... nevertheless It's a soft cap, so it depends on the number of open files usually floating around on the machine. It even depends on the file systems. That was actually my motivation (more than the negative entries). Some cache entries are expensive for our fuse server due to our DLM usage and private data held in user space. > Is there a better approach? After reading your thoughts and those of the others who have taken the time to revisit this, I think there is no better solution in the VFS layer. Since 2025 (commit 395b95530343e) shrink_dentry_list() is an exported symbol and that can be used for a specific file system to do its own housekeeping. This will probably be considered a misuse by some , but it would be more specific and better controllable especially from filesystems where certain cache entries are more expensive than others and/or running in user space (FUSE). > > According to the email you linked, a problem arises when a directory has > a great many negative children. Code which walks the list of children > (such as fsnotify) while holding a lock can suffer unpredictable delays > and result in long lock-hold times. So maybe a limit on negative > dentries for any parent is what we really want. That would be clumsy to > implement I imagine. > > But what if we move dentries to the end of the list when they become > negative, and to the start of the list when they become positive? Then > code which walks the child list could simply abort on the first > negative. > > I doubt that would be quite as easy as it sounds, but it would at least > be more focused on the observed symptom rather than some whole-system > number which only vaguely correlates with the observed symptom. > > Maybe a completely different approach: change children-walking code to > drop and retake the lock (with appropriate validation) periodically. > What too would address the specific symptom. > > Thanks for attempting to resolve this issue, but I'm not convinced that > you have found a good solution yet. Thanks for the clear words. I realy appreciate it! > > NeilBrown > ^ permalink raw reply [flat|nested] 17+ messages in thread
end of thread, other threads:[~2026-05-22 8:28 UTC | newest] Thread overview: 17+ messages (download: mbox.gz follow: Atom feed -- links below jump to the message on this page -- 2026-05-14 15:13 [PATCH] dcache: add fs.dentry-limit sysctl with negative-first reaper Horst Birthelmer 2026-05-15 15:09 ` kernel test robot 2026-05-16 6:55 ` Horst Birthelmer 2026-05-16 10:33 ` Stafford Horne 2026-05-16 14:15 ` Horst Birthelmer 2026-05-15 15:09 ` kernel test robot 2026-05-17 23:55 ` NeilBrown 2026-05-18 2:55 ` Ian Kent 2026-05-18 8:19 ` Jan Kara 2026-05-18 13:39 ` Ian Kent 2026-05-19 9:12 ` Jan Kara 2026-05-20 7:16 ` Ian Kent 2026-05-20 9:43 ` Amir Goldstein 2026-05-21 0:55 ` Ian Kent 2026-05-22 4:16 ` NeilBrown 2026-05-22 8:27 ` Amir Goldstein 2026-05-18 7:01 ` Horst Birthelmer
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox