* bad things when too many negative dentries in a directory
@ 2025-04-11 9:40 Miklos Szeredi
2025-04-11 14:47 ` Christian Brauner
` (2 more replies)
0 siblings, 3 replies; 23+ messages in thread
From: Miklos Szeredi @ 2025-04-11 9:40 UTC (permalink / raw)
To: linux-fsdevel
Cc: Al Viro, Christian Brauner, Amir Goldstein, Jan Kara, Ian Kent
There are reports of soflockups in fsnotify if there are large numbers
of negative dentries (e.g. ~300M) in a directory. This can happen if
lots of temp files are created and removed and there's not enough
memory pressure to trigger the lru shrinker.
These are on old kernels and some of this is possibly due to missing
172e422ffea2 ("fsnotify: clear PARENT_WATCHED flags lazily"), but I
managed to reproduce the softlockup on a recent kernel in
fsnotify_set_children_dentry_flags() (see end of mail).
This was with ~1.2G negative dentries. Doing "rmdir testdir"
afterwards does not trigger the softlockup detector, due to the
reschedules in shrink_dcache_parent() code, but it took 10 minutes(!)
to finish removing that empty directory.
So I wonder, do we really want negative dentries on ->d_children?
Except for shrink_dcache_parent() I don't see any uses. And it's also
a question whether shrinking negative dentries is useful or not. If
they've been around for so long that hundreds of millions of them
could accumulate and that memory wasn't needed by anybody, then it
shouldn't make a big difference if they kept hanging around. On
umount, at the latest, the lru list can be used to kill everything,
AFAICT.
I'm curious if this is the right path? Any better ideas?
Thanks,
Miklos
[96789.366007] watchdog: BUG: soft lockup - CPU#79 stuck for 26s!
[fanotify4:52805]
[96789.373396] Modules linked in: rfkill mlx5_ib ib_uverbs macsec
ib_core vfat fat mlx5_core acpi_ipmi ast ipmi_ssif arm_spe_pmu igb
mlxfw psample i2c_algo_bit tls pci_hyperv_intf ipmi_devintf
ipmi_msghandler arm_cmn arm_dmc620_pmu arm_dsu_pmu cppc_cpufreq loop
fuse nfnetlink xfs nvme crct10dif_ce ghash_ce sha2_ce sha256_arm64
nvme_core sha1_ce sbsa_gwdt nvme_auth i2c_designware_platform
i2c_designware_core xgene_hwmon dm_mirror dm_region_hash dm_log dm_mod
[96789.413624] CPU: 79 UID: 0 PID: 52805 Comm: fanotify4 Kdump: loaded
Not tainted 6.12.0-55.9.1.el10_0.aarch64 #1
[96789.423698] Hardware name: GIGABYTE R272-P30-JG/MP32-AR0-JG, BIOS
F31n (SCP: 2.10.20220810) 09/30/2022
[96789.432990] pstate: a0400009 (NzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
[96789.439939] pc : fsnotify_set_children_dentry_flags+0x80/0xf0
[96789.445675] lr : fsnotify_set_children_dentry_flags+0xa4/0xf0
[96789.451408] sp : ffff8000cc77b8c0
[96789.454710] x29: ffff8000cc77b8c0 x28: 0000000000000001 x27: 0000000000000000
[96789.461833] x26: ffff07ff8463dc50 x25: ffff080e6e44dc50 x24: 0000000000000001
[96789.468956] x23: ffff07ff9d94eec0 x22: ffff07fff2cf01b8 x21: ffff07ff9d94ee40
[96789.476079] x20: ffff0800eb6dff40 x19: ffff0800eb6df2c0 x18: 0000000000000014
[96789.483202] x17: 00000000cec6e315 x16: 00000000ed365140 x15: 00000000ae8684a4
[96789.490325] x14: 000000000d831309 x13: 00000000387d7ee0 x12: 0000000000000000
[96789.497448] x11: 0000000000000001 x10: 0000000000000001 x9 : ffffc3bacc1864bc
[96789.504570] x8 : 000000001007ffff x7 : ffffc3bace89a4c0 x6 : 0000000000000001
[96789.511694] x5 : 0000000008000020 x4 : 0000000000000000 x3 : 0000000000000003
[96789.518816] x2 : 0000000000000001 x1 : 0000000000000000 x0 : ffff0800eb6df358
[96789.525939] Call trace:
[96789.528373] fsnotify_set_children_dentry_flags+0x80/0xf0
[96789.533759] fsnotify_recalc_mask.part.0+0x94/0xc8
[96789.538538] fsnotify_recalc_mask+0x1c/0x40
[96789.542709] fanotify_add_mark+0x15c/0x360
[96789.546794] do_fanotify_mark+0x3c0/0x7a0
[96789.550791] __arm64_sys_fanotify_mark+0x30/0x60
[96789.555396] invoke_syscall.constprop.0+0x74/0xd0
[96789.560090] do_el0_svc+0xb0/0xe8
[96789.563393] el0_svc+0x44/0x1d0
[96789.566525] el0t_64_sync_handler+0x120/0x130
[96789.570870] el0t_64_sync+0x1a4/0x1a8
[151513.714945] INFO: task (ostnamed):77658 blocked for more than 122 seconds.
[151513.721903] Tainted: G L ------- ---
6.12.0-55.9.1.el10_0.aarch64 #1
[151513.730334] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
[151513.738241] task:(ostnamed) state:D stack:0 pid:77658
tgid:77658 ppid:1 flags:0x00000205
[151513.747625] Call trace:
[151513.750146] __switch_to+0xec/0x148
[151513.753712] __schedule+0x234/0x738
[151513.757278] schedule+0x3c/0xe0
[151513.760493] schedule_preempt_disabled+0x2c/0x58
[151513.765188] rwsem_down_write_slowpath+0x1e4/0x720
[151513.770054] down_write+0xac/0xc0
[151513.773444] do_lock_mount+0x3c/0x220
[151513.777185] path_mount+0x378/0x810
[151513.780748] __arm64_sys_mount+0x158/0x2d8
[151513.784921] invoke_syscall.constprop.0+0x74/0xd0
[151513.789702] do_el0_svc+0xb0/0xe8
[151513.793093] el0_svc+0x44/0x1d0
[151513.796312] el0t_64_sync_handler+0x120/0x130
[151513.800744] el0t_64_sync+0x1a4/0x1a8
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: bad things when too many negative dentries in a directory
2025-04-11 9:40 bad things when too many negative dentries in a directory Miklos Szeredi
@ 2025-04-11 14:47 ` Christian Brauner
2025-04-11 15:40 ` Miklos Szeredi
2025-04-12 1:48 ` Ian Kent
2025-04-11 21:02 ` Mateusz Guzik
2025-04-20 4:49 ` Al Viro
2 siblings, 2 replies; 23+ messages in thread
From: Christian Brauner @ 2025-04-11 14:47 UTC (permalink / raw)
To: Miklos Szeredi; +Cc: linux-fsdevel, Al Viro, Amir Goldstein, Jan Kara, Ian Kent
On Fri, Apr 11, 2025 at 11:40:28AM +0200, Miklos Szeredi wrote:
> There are reports of soflockups in fsnotify if there are large numbers
> of negative dentries (e.g. ~300M) in a directory. This can happen if
> lots of temp files are created and removed and there's not enough
> memory pressure to trigger the lru shrinker.
>
> These are on old kernels and some of this is possibly due to missing
> 172e422ffea2 ("fsnotify: clear PARENT_WATCHED flags lazily"), but I
> managed to reproduce the softlockup on a recent kernel in
> fsnotify_set_children_dentry_flags() (see end of mail).
>
> This was with ~1.2G negative dentries. Doing "rmdir testdir"
> afterwards does not trigger the softlockup detector, due to the
> reschedules in shrink_dcache_parent() code, but it took 10 minutes(!)
> to finish removing that empty directory.
>
> So I wonder, do we really want negative dentries on ->d_children?
> Except for shrink_dcache_parent() I don't see any uses. And it's also
> a question whether shrinking negative dentries is useful or not. If
> they've been around for so long that hundreds of millions of them
> could accumulate and that memory wasn't needed by anybody, then it
> shouldn't make a big difference if they kept hanging around. On
> umount, at the latest, the lru list can be used to kill everything,
> AFAICT.
>
> I'm curious if this is the right path? Any better ideas?
Note that we have a new sysctl:
/proc/sys/fs/dentry-negative
that can be used to control the negative dentry policy because any
generic change that we tried to make has always resulted in unacceptable
regressions for someone's workload. Currently we only allow it to be set
to 1 (default 0). If set to 1 it will not create negative dentries
during unlink. If that's sufficient than recommend this to users that
suffer from this problem if not consider adding another sensitive
policy.
>
> Thanks,
> Miklos
>
>
> [96789.366007] watchdog: BUG: soft lockup - CPU#79 stuck for 26s!
> [fanotify4:52805]
> [96789.373396] Modules linked in: rfkill mlx5_ib ib_uverbs macsec
> ib_core vfat fat mlx5_core acpi_ipmi ast ipmi_ssif arm_spe_pmu igb
> mlxfw psample i2c_algo_bit tls pci_hyperv_intf ipmi_devintf
> ipmi_msghandler arm_cmn arm_dmc620_pmu arm_dsu_pmu cppc_cpufreq loop
> fuse nfnetlink xfs nvme crct10dif_ce ghash_ce sha2_ce sha256_arm64
> nvme_core sha1_ce sbsa_gwdt nvme_auth i2c_designware_platform
> i2c_designware_core xgene_hwmon dm_mirror dm_region_hash dm_log dm_mod
> [96789.413624] CPU: 79 UID: 0 PID: 52805 Comm: fanotify4 Kdump: loaded
> Not tainted 6.12.0-55.9.1.el10_0.aarch64 #1
> [96789.423698] Hardware name: GIGABYTE R272-P30-JG/MP32-AR0-JG, BIOS
> F31n (SCP: 2.10.20220810) 09/30/2022
> [96789.432990] pstate: a0400009 (NzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
> [96789.439939] pc : fsnotify_set_children_dentry_flags+0x80/0xf0
> [96789.445675] lr : fsnotify_set_children_dentry_flags+0xa4/0xf0
> [96789.451408] sp : ffff8000cc77b8c0
> [96789.454710] x29: ffff8000cc77b8c0 x28: 0000000000000001 x27: 0000000000000000
> [96789.461833] x26: ffff07ff8463dc50 x25: ffff080e6e44dc50 x24: 0000000000000001
> [96789.468956] x23: ffff07ff9d94eec0 x22: ffff07fff2cf01b8 x21: ffff07ff9d94ee40
> [96789.476079] x20: ffff0800eb6dff40 x19: ffff0800eb6df2c0 x18: 0000000000000014
> [96789.483202] x17: 00000000cec6e315 x16: 00000000ed365140 x15: 00000000ae8684a4
> [96789.490325] x14: 000000000d831309 x13: 00000000387d7ee0 x12: 0000000000000000
> [96789.497448] x11: 0000000000000001 x10: 0000000000000001 x9 : ffffc3bacc1864bc
> [96789.504570] x8 : 000000001007ffff x7 : ffffc3bace89a4c0 x6 : 0000000000000001
> [96789.511694] x5 : 0000000008000020 x4 : 0000000000000000 x3 : 0000000000000003
> [96789.518816] x2 : 0000000000000001 x1 : 0000000000000000 x0 : ffff0800eb6df358
> [96789.525939] Call trace:
> [96789.528373] fsnotify_set_children_dentry_flags+0x80/0xf0
> [96789.533759] fsnotify_recalc_mask.part.0+0x94/0xc8
> [96789.538538] fsnotify_recalc_mask+0x1c/0x40
> [96789.542709] fanotify_add_mark+0x15c/0x360
> [96789.546794] do_fanotify_mark+0x3c0/0x7a0
> [96789.550791] __arm64_sys_fanotify_mark+0x30/0x60
> [96789.555396] invoke_syscall.constprop.0+0x74/0xd0
> [96789.560090] do_el0_svc+0xb0/0xe8
> [96789.563393] el0_svc+0x44/0x1d0
> [96789.566525] el0t_64_sync_handler+0x120/0x130
> [96789.570870] el0t_64_sync+0x1a4/0x1a8
> [151513.714945] INFO: task (ostnamed):77658 blocked for more than 122 seconds.
> [151513.721903] Tainted: G L ------- ---
> 6.12.0-55.9.1.el10_0.aarch64 #1
> [151513.730334] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
> disables this message.
> [151513.738241] task:(ostnamed) state:D stack:0 pid:77658
> tgid:77658 ppid:1 flags:0x00000205
> [151513.747625] Call trace:
> [151513.750146] __switch_to+0xec/0x148
> [151513.753712] __schedule+0x234/0x738
> [151513.757278] schedule+0x3c/0xe0
> [151513.760493] schedule_preempt_disabled+0x2c/0x58
> [151513.765188] rwsem_down_write_slowpath+0x1e4/0x720
> [151513.770054] down_write+0xac/0xc0
> [151513.773444] do_lock_mount+0x3c/0x220
> [151513.777185] path_mount+0x378/0x810
> [151513.780748] __arm64_sys_mount+0x158/0x2d8
> [151513.784921] invoke_syscall.constprop.0+0x74/0xd0
> [151513.789702] do_el0_svc+0xb0/0xe8
> [151513.793093] el0_svc+0x44/0x1d0
> [151513.796312] el0t_64_sync_handler+0x120/0x130
> [151513.800744] el0t_64_sync+0x1a4/0x1a8
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: bad things when too many negative dentries in a directory
2025-04-11 14:47 ` Christian Brauner
@ 2025-04-11 15:40 ` Miklos Szeredi
2025-04-11 16:01 ` Matthew Wilcox
2025-04-14 6:28 ` Ian Kent
2025-04-12 1:48 ` Ian Kent
1 sibling, 2 replies; 23+ messages in thread
From: Miklos Szeredi @ 2025-04-11 15:40 UTC (permalink / raw)
To: Christian Brauner
Cc: linux-fsdevel, Al Viro, Amir Goldstein, Jan Kara, Ian Kent
On Fri, 11 Apr 2025 at 16:47, Christian Brauner <brauner@kernel.org> wrote:
> Note that we have a new sysctl:
>
> /proc/sys/fs/dentry-negative
>
> that can be used to control the negative dentry policy because any
> generic change that we tried to make has always resulted in unacceptable
> regressions for someone's workload. Currently we only allow it to be set
> to 1 (default 0). If set to 1 it will not create negative dentries
> during unlink. If that's sufficient than recommend this to users that
> suffer from this problem if not consider adding another sensitive
> policy.
Okay, I'll forward that info.
However, hundreds of millions of negative dentries can be created
rather efficiently without unlink, though this one probably doesn't
happen under normal circumstances. Allowing this to starve the
scheduler for an arbitrary long time is not a good idea in any case,
so the fsnotify problem needs some other solution, and I suspect that
it's not to disable negative caching completely, as that would be a
major bummer.
But the idea of leaving negative dentries off d_children is
independent of caching policy. The lookup cache would work fine
without d_sib being chained, it only needs careful thought in
1) putting the dentry on d_children when it's turned into positive
2) getting the dentry off d_children when it's turned into negative.
Thanks,
Miklos
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: bad things when too many negative dentries in a directory
2025-04-11 15:40 ` Miklos Szeredi
@ 2025-04-11 16:01 ` Matthew Wilcox
2025-04-14 14:07 ` James Bottomley
2025-04-14 6:28 ` Ian Kent
1 sibling, 1 reply; 23+ messages in thread
From: Matthew Wilcox @ 2025-04-11 16:01 UTC (permalink / raw)
To: Miklos Szeredi
Cc: Christian Brauner, linux-fsdevel, Al Viro, Amir Goldstein,
Jan Kara, Ian Kent
On Fri, Apr 11, 2025 at 05:40:08PM +0200, Miklos Szeredi wrote:
> However, hundreds of millions of negative dentries can be created
> rather efficiently without unlink, though this one probably doesn't
> happen under normal circumstances.
Depends on your userspace. Since we don't have union directories,
consider the not uncommon case of having a search path A:B:C. Application
looks for D in directory A, doesn't find it, creates a negative dentry.
Application looks for D in directory B, creates a negative dentry.
Application looks for D in directory C, doesn't find it, so it creates it.
Now we have two negative dentries and one positive dentry.
And for some applications, the name "D" is going to be unique, so the
negative dentries have _no_ further use. The application isn't even
going to open C/D again. If there's no memory pressure, we can build
up billions of dentries. I believe the customer is currently echoing
2 to /proc/sys/vm/drop-caches every hour.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: bad things when too many negative dentries in a directory
2025-04-11 9:40 bad things when too many negative dentries in a directory Miklos Szeredi
2025-04-11 14:47 ` Christian Brauner
@ 2025-04-11 21:02 ` Mateusz Guzik
2025-04-20 4:49 ` Al Viro
2 siblings, 0 replies; 23+ messages in thread
From: Mateusz Guzik @ 2025-04-11 21:02 UTC (permalink / raw)
To: Miklos Szeredi
Cc: linux-fsdevel, Al Viro, Christian Brauner, Amir Goldstein,
Jan Kara, Ian Kent
On Fri, Apr 11, 2025 at 11:40:28AM +0200, Miklos Szeredi wrote:
> There are reports of soflockups in fsnotify if there are large numbers
> of negative dentries (e.g. ~300M) in a directory. This can happen if
> lots of temp files are created and removed and there's not enough
> memory pressure to trigger the lru shrinker.
>
> These are on old kernels and some of this is possibly due to missing
> 172e422ffea2 ("fsnotify: clear PARENT_WATCHED flags lazily"), but I
> managed to reproduce the softlockup on a recent kernel in
> fsnotify_set_children_dentry_flags() (see end of mail).
>
> This was with ~1.2G negative dentries. Doing "rmdir testdir"
> afterwards does not trigger the softlockup detector, due to the
> reschedules in shrink_dcache_parent() code, but it took 10 minutes(!)
> to finish removing that empty directory.
>
I wrote about this some time ago:
https://lore.kernel.org/linux-fsdevel/f7bp3ggliqbb7adyysonxgvo6zn76mo4unroagfcuu3bfghynu@7wkgqkfb5c43/#t
bottom line is only a small subset of negative entries is useful in the
long run
while a great policy to tame the total count while not hindering
performance is left as an exercise for the reader(tm), I outlined
something which should be *tolerable*.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: bad things when too many negative dentries in a directory
2025-04-11 14:47 ` Christian Brauner
2025-04-11 15:40 ` Miklos Szeredi
@ 2025-04-12 1:48 ` Ian Kent
2025-04-12 1:56 ` Ian Kent
2025-04-12 6:31 ` Ian Kent
1 sibling, 2 replies; 23+ messages in thread
From: Ian Kent @ 2025-04-12 1:48 UTC (permalink / raw)
To: Christian Brauner, Miklos Szeredi
Cc: linux-fsdevel, Al Viro, Amir Goldstein, Jan Kara
On 11/4/25 22:47, Christian Brauner wrote:
> On Fri, Apr 11, 2025 at 11:40:28AM +0200, Miklos Szeredi wrote:
>> There are reports of soflockups in fsnotify if there are large numbers
>> of negative dentries (e.g. ~300M) in a directory. This can happen if
>> lots of temp files are created and removed and there's not enough
>> memory pressure to trigger the lru shrinker.
>>
>> These are on old kernels and some of this is possibly due to missing
>> 172e422ffea2 ("fsnotify: clear PARENT_WATCHED flags lazily"), but I
>> managed to reproduce the softlockup on a recent kernel in
>> fsnotify_set_children_dentry_flags() (see end of mail).
>>
>> This was with ~1.2G negative dentries. Doing "rmdir testdir"
>> afterwards does not trigger the softlockup detector, due to the
>> reschedules in shrink_dcache_parent() code, but it took 10 minutes(!)
>> to finish removing that empty directory.
>>
>> So I wonder, do we really want negative dentries on ->d_children?
>> Except for shrink_dcache_parent() I don't see any uses. And it's also
>> a question whether shrinking negative dentries is useful or not. If
>> they've been around for so long that hundreds of millions of them
>> could accumulate and that memory wasn't needed by anybody, then it
>> shouldn't make a big difference if they kept hanging around. On
>> umount, at the latest, the lru list can be used to kill everything,
>> AFAICT.
>>
>> I'm curious if this is the right path? Any better ideas?
> Note that we have a new sysctl:
>
> /proc/sys/fs/dentry-negative
>
> that can be used to control the negative dentry policy because any
> generic change that we tried to make has always resulted in unacceptable
> regressions for someone's workload. Currently we only allow it to be set
> to 1 (default 0). If set to 1 it will not create negative dentries
> during unlink. If that's sufficient than recommend this to users that
> suffer from this problem if not consider adding another sensitive
> policy.
Interesting, I wasn't sure how the negative dentries were accumulating but
I didn't actually look at the unlink code (I'll take a look). I thought the
most likely cause was laziness not unlinking temporary files (the file names
in question "looked" like temporary file names).
When I do look at unlink I suspect I'll find the VFS is justified in caching
these and the responsibility (or should) lies with the file system call back
to unhash the dentry if it doesn't want this caching ... but the file system
always doing this is not ideal either ... maybe we need a hint so that the
relevant file system callbacks can make this decision for themselves.
Ian
>
>> Thanks,
>> Miklos
>>
>>
>> [96789.366007] watchdog: BUG: soft lockup - CPU#79 stuck for 26s!
>> [fanotify4:52805]
>> [96789.373396] Modules linked in: rfkill mlx5_ib ib_uverbs macsec
>> ib_core vfat fat mlx5_core acpi_ipmi ast ipmi_ssif arm_spe_pmu igb
>> mlxfw psample i2c_algo_bit tls pci_hyperv_intf ipmi_devintf
>> ipmi_msghandler arm_cmn arm_dmc620_pmu arm_dsu_pmu cppc_cpufreq loop
>> fuse nfnetlink xfs nvme crct10dif_ce ghash_ce sha2_ce sha256_arm64
>> nvme_core sha1_ce sbsa_gwdt nvme_auth i2c_designware_platform
>> i2c_designware_core xgene_hwmon dm_mirror dm_region_hash dm_log dm_mod
>> [96789.413624] CPU: 79 UID: 0 PID: 52805 Comm: fanotify4 Kdump: loaded
>> Not tainted 6.12.0-55.9.1.el10_0.aarch64 #1
>> [96789.423698] Hardware name: GIGABYTE R272-P30-JG/MP32-AR0-JG, BIOS
>> F31n (SCP: 2.10.20220810) 09/30/2022
>> [96789.432990] pstate: a0400009 (NzCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
>> [96789.439939] pc : fsnotify_set_children_dentry_flags+0x80/0xf0
>> [96789.445675] lr : fsnotify_set_children_dentry_flags+0xa4/0xf0
>> [96789.451408] sp : ffff8000cc77b8c0
>> [96789.454710] x29: ffff8000cc77b8c0 x28: 0000000000000001 x27: 0000000000000000
>> [96789.461833] x26: ffff07ff8463dc50 x25: ffff080e6e44dc50 x24: 0000000000000001
>> [96789.468956] x23: ffff07ff9d94eec0 x22: ffff07fff2cf01b8 x21: ffff07ff9d94ee40
>> [96789.476079] x20: ffff0800eb6dff40 x19: ffff0800eb6df2c0 x18: 0000000000000014
>> [96789.483202] x17: 00000000cec6e315 x16: 00000000ed365140 x15: 00000000ae8684a4
>> [96789.490325] x14: 000000000d831309 x13: 00000000387d7ee0 x12: 0000000000000000
>> [96789.497448] x11: 0000000000000001 x10: 0000000000000001 x9 : ffffc3bacc1864bc
>> [96789.504570] x8 : 000000001007ffff x7 : ffffc3bace89a4c0 x6 : 0000000000000001
>> [96789.511694] x5 : 0000000008000020 x4 : 0000000000000000 x3 : 0000000000000003
>> [96789.518816] x2 : 0000000000000001 x1 : 0000000000000000 x0 : ffff0800eb6df358
>> [96789.525939] Call trace:
>> [96789.528373] fsnotify_set_children_dentry_flags+0x80/0xf0
>> [96789.533759] fsnotify_recalc_mask.part.0+0x94/0xc8
>> [96789.538538] fsnotify_recalc_mask+0x1c/0x40
>> [96789.542709] fanotify_add_mark+0x15c/0x360
>> [96789.546794] do_fanotify_mark+0x3c0/0x7a0
>> [96789.550791] __arm64_sys_fanotify_mark+0x30/0x60
>> [96789.555396] invoke_syscall.constprop.0+0x74/0xd0
>> [96789.560090] do_el0_svc+0xb0/0xe8
>> [96789.563393] el0_svc+0x44/0x1d0
>> [96789.566525] el0t_64_sync_handler+0x120/0x130
>> [96789.570870] el0t_64_sync+0x1a4/0x1a8
>> [151513.714945] INFO: task (ostnamed):77658 blocked for more than 122 seconds.
>> [151513.721903] Tainted: G L ------- ---
>> 6.12.0-55.9.1.el10_0.aarch64 #1
>> [151513.730334] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>> disables this message.
>> [151513.738241] task:(ostnamed) state:D stack:0 pid:77658
>> tgid:77658 ppid:1 flags:0x00000205
>> [151513.747625] Call trace:
>> [151513.750146] __switch_to+0xec/0x148
>> [151513.753712] __schedule+0x234/0x738
>> [151513.757278] schedule+0x3c/0xe0
>> [151513.760493] schedule_preempt_disabled+0x2c/0x58
>> [151513.765188] rwsem_down_write_slowpath+0x1e4/0x720
>> [151513.770054] down_write+0xac/0xc0
>> [151513.773444] do_lock_mount+0x3c/0x220
>> [151513.777185] path_mount+0x378/0x810
>> [151513.780748] __arm64_sys_mount+0x158/0x2d8
>> [151513.784921] invoke_syscall.constprop.0+0x74/0xd0
>> [151513.789702] do_el0_svc+0xb0/0xe8
>> [151513.793093] el0_svc+0x44/0x1d0
>> [151513.796312] el0t_64_sync_handler+0x120/0x130
>> [151513.800744] el0t_64_sync+0x1a4/0x1a8
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: bad things when too many negative dentries in a directory
2025-04-12 1:48 ` Ian Kent
@ 2025-04-12 1:56 ` Ian Kent
2025-04-12 6:31 ` Ian Kent
1 sibling, 0 replies; 23+ messages in thread
From: Ian Kent @ 2025-04-12 1:56 UTC (permalink / raw)
To: Christian Brauner, Miklos Szeredi
Cc: linux-fsdevel, Al Viro, Amir Goldstein, Jan Kara
On 12/4/25 09:48, Ian Kent wrote:
>
> On 11/4/25 22:47, Christian Brauner wrote:
>> On Fri, Apr 11, 2025 at 11:40:28AM +0200, Miklos Szeredi wrote:
>>> There are reports of soflockups in fsnotify if there are large numbers
>>> of negative dentries (e.g. ~300M) in a directory. This can happen if
>>> lots of temp files are created and removed and there's not enough
>>> memory pressure to trigger the lru shrinker.
>>>
>>> These are on old kernels and some of this is possibly due to missing
>>> 172e422ffea2 ("fsnotify: clear PARENT_WATCHED flags lazily"), but I
>>> managed to reproduce the softlockup on a recent kernel in
>>> fsnotify_set_children_dentry_flags() (see end of mail).
>>>
>>> This was with ~1.2G negative dentries. Doing "rmdir testdir"
>>> afterwards does not trigger the softlockup detector, due to the
>>> reschedules in shrink_dcache_parent() code, but it took 10 minutes(!)
>>> to finish removing that empty directory.
>>>
>>> So I wonder, do we really want negative dentries on ->d_children?
>>> Except for shrink_dcache_parent() I don't see any uses. And it's also
>>> a question whether shrinking negative dentries is useful or not. If
>>> they've been around for so long that hundreds of millions of them
>>> could accumulate and that memory wasn't needed by anybody, then it
>>> shouldn't make a big difference if they kept hanging around. On
>>> umount, at the latest, the lru list can be used to kill everything,
>>> AFAICT.
>>>
>>> I'm curious if this is the right path? Any better ideas?
>> Note that we have a new sysctl:
>>
>> /proc/sys/fs/dentry-negative
>>
>> that can be used to control the negative dentry policy because any
>> generic change that we tried to make has always resulted in unacceptable
>> regressions for someone's workload. Currently we only allow it to be set
>> to 1 (default 0). If set to 1 it will not create negative dentries
>> during unlink. If that's sufficient than recommend this to users that
>> suffer from this problem if not consider adding another sensitive
>> policy.
>
> Interesting, I wasn't sure how the negative dentries were accumulating
> but
>
> I didn't actually look at the unlink code (I'll take a look). I
> thought the
>
> most likely cause was laziness not unlinking temporary files (the file
> names
>
> in question "looked" like temporary file names).
>
>
> When I do look at unlink I suspect I'll find the VFS is justified in
> caching
>
> these and the responsibility (or should) lies with the file system
> call back
>
> to unhash the dentry if it doesn't want this caching ... but the file
> system
>
> always doing this is not ideal either ... maybe we need a hint so that
> the
>
> relevant file system callbacks can make this decision for themselves.
Crikey, I thought I had seen something like this in the VFS.
We already have a hint, DCACHE_DONTCACHE, and an exported VFS function
to handle it,
d_mark_dontcache(), with several file systems using it.
I'll keep looking ...
>
>
> Ian
>
>>
>>> Thanks,
>>> Miklos
>>>
>>>
>>> [96789.366007] watchdog: BUG: soft lockup - CPU#79 stuck for 26s!
>>> [fanotify4:52805]
>>> [96789.373396] Modules linked in: rfkill mlx5_ib ib_uverbs macsec
>>> ib_core vfat fat mlx5_core acpi_ipmi ast ipmi_ssif arm_spe_pmu igb
>>> mlxfw psample i2c_algo_bit tls pci_hyperv_intf ipmi_devintf
>>> ipmi_msghandler arm_cmn arm_dmc620_pmu arm_dsu_pmu cppc_cpufreq loop
>>> fuse nfnetlink xfs nvme crct10dif_ce ghash_ce sha2_ce sha256_arm64
>>> nvme_core sha1_ce sbsa_gwdt nvme_auth i2c_designware_platform
>>> i2c_designware_core xgene_hwmon dm_mirror dm_region_hash dm_log dm_mod
>>> [96789.413624] CPU: 79 UID: 0 PID: 52805 Comm: fanotify4 Kdump: loaded
>>> Not tainted 6.12.0-55.9.1.el10_0.aarch64 #1
>>> [96789.423698] Hardware name: GIGABYTE R272-P30-JG/MP32-AR0-JG, BIOS
>>> F31n (SCP: 2.10.20220810) 09/30/2022
>>> [96789.432990] pstate: a0400009 (NzCv daif +PAN -UAO -TCO -DIT -SSBS
>>> BTYPE=--)
>>> [96789.439939] pc : fsnotify_set_children_dentry_flags+0x80/0xf0
>>> [96789.445675] lr : fsnotify_set_children_dentry_flags+0xa4/0xf0
>>> [96789.451408] sp : ffff8000cc77b8c0
>>> [96789.454710] x29: ffff8000cc77b8c0 x28: 0000000000000001 x27:
>>> 0000000000000000
>>> [96789.461833] x26: ffff07ff8463dc50 x25: ffff080e6e44dc50 x24:
>>> 0000000000000001
>>> [96789.468956] x23: ffff07ff9d94eec0 x22: ffff07fff2cf01b8 x21:
>>> ffff07ff9d94ee40
>>> [96789.476079] x20: ffff0800eb6dff40 x19: ffff0800eb6df2c0 x18:
>>> 0000000000000014
>>> [96789.483202] x17: 00000000cec6e315 x16: 00000000ed365140 x15:
>>> 00000000ae8684a4
>>> [96789.490325] x14: 000000000d831309 x13: 00000000387d7ee0 x12:
>>> 0000000000000000
>>> [96789.497448] x11: 0000000000000001 x10: 0000000000000001 x9 :
>>> ffffc3bacc1864bc
>>> [96789.504570] x8 : 000000001007ffff x7 : ffffc3bace89a4c0 x6 :
>>> 0000000000000001
>>> [96789.511694] x5 : 0000000008000020 x4 : 0000000000000000 x3 :
>>> 0000000000000003
>>> [96789.518816] x2 : 0000000000000001 x1 : 0000000000000000 x0 :
>>> ffff0800eb6df358
>>> [96789.525939] Call trace:
>>> [96789.528373] fsnotify_set_children_dentry_flags+0x80/0xf0
>>> [96789.533759] fsnotify_recalc_mask.part.0+0x94/0xc8
>>> [96789.538538] fsnotify_recalc_mask+0x1c/0x40
>>> [96789.542709] fanotify_add_mark+0x15c/0x360
>>> [96789.546794] do_fanotify_mark+0x3c0/0x7a0
>>> [96789.550791] __arm64_sys_fanotify_mark+0x30/0x60
>>> [96789.555396] invoke_syscall.constprop.0+0x74/0xd0
>>> [96789.560090] do_el0_svc+0xb0/0xe8
>>> [96789.563393] el0_svc+0x44/0x1d0
>>> [96789.566525] el0t_64_sync_handler+0x120/0x130
>>> [96789.570870] el0t_64_sync+0x1a4/0x1a8
>>> [151513.714945] INFO: task (ostnamed):77658 blocked for more than
>>> 122 seconds.
>>> [151513.721903] Tainted: G L ------- ---
>>> 6.12.0-55.9.1.el10_0.aarch64 #1
>>> [151513.730334] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>> disables this message.
>>> [151513.738241] task:(ostnamed) state:D stack:0 pid:77658
>>> tgid:77658 ppid:1 flags:0x00000205
>>> [151513.747625] Call trace:
>>> [151513.750146] __switch_to+0xec/0x148
>>> [151513.753712] __schedule+0x234/0x738
>>> [151513.757278] schedule+0x3c/0xe0
>>> [151513.760493] schedule_preempt_disabled+0x2c/0x58
>>> [151513.765188] rwsem_down_write_slowpath+0x1e4/0x720
>>> [151513.770054] down_write+0xac/0xc0
>>> [151513.773444] do_lock_mount+0x3c/0x220
>>> [151513.777185] path_mount+0x378/0x810
>>> [151513.780748] __arm64_sys_mount+0x158/0x2d8
>>> [151513.784921] invoke_syscall.constprop.0+0x74/0xd0
>>> [151513.789702] do_el0_svc+0xb0/0xe8
>>> [151513.793093] el0_svc+0x44/0x1d0
>>> [151513.796312] el0t_64_sync_handler+0x120/0x130
>>> [151513.800744] el0t_64_sync+0x1a4/0x1a8
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: bad things when too many negative dentries in a directory
2025-04-12 1:48 ` Ian Kent
2025-04-12 1:56 ` Ian Kent
@ 2025-04-12 6:31 ` Ian Kent
1 sibling, 0 replies; 23+ messages in thread
From: Ian Kent @ 2025-04-12 6:31 UTC (permalink / raw)
To: Christian Brauner, Miklos Szeredi
Cc: linux-fsdevel, Al Viro, Amir Goldstein, Jan Kara
On 12/4/25 09:48, Ian Kent wrote:
>
> On 11/4/25 22:47, Christian Brauner wrote:
>> On Fri, Apr 11, 2025 at 11:40:28AM +0200, Miklos Szeredi wrote:
>>> There are reports of soflockups in fsnotify if there are large numbers
>>> of negative dentries (e.g. ~300M) in a directory. This can happen if
>>> lots of temp files are created and removed and there's not enough
>>> memory pressure to trigger the lru shrinker.
>>>
>>> These are on old kernels and some of this is possibly due to missing
>>> 172e422ffea2 ("fsnotify: clear PARENT_WATCHED flags lazily"), but I
>>> managed to reproduce the softlockup on a recent kernel in
>>> fsnotify_set_children_dentry_flags() (see end of mail).
>>>
>>> This was with ~1.2G negative dentries. Doing "rmdir testdir"
>>> afterwards does not trigger the softlockup detector, due to the
>>> reschedules in shrink_dcache_parent() code, but it took 10 minutes(!)
>>> to finish removing that empty directory.
>>>
>>> So I wonder, do we really want negative dentries on ->d_children?
>>> Except for shrink_dcache_parent() I don't see any uses. And it's also
>>> a question whether shrinking negative dentries is useful or not. If
>>> they've been around for so long that hundreds of millions of them
>>> could accumulate and that memory wasn't needed by anybody, then it
>>> shouldn't make a big difference if they kept hanging around. On
>>> umount, at the latest, the lru list can be used to kill everything,
>>> AFAICT.
>>>
>>> I'm curious if this is the right path? Any better ideas?
>> Note that we have a new sysctl:
>>
>> /proc/sys/fs/dentry-negative
>>
>> that can be used to control the negative dentry policy because any
>> generic change that we tried to make has always resulted in unacceptable
>> regressions for someone's workload. Currently we only allow it to be set
>> to 1 (default 0). If set to 1 it will not create negative dentries
>> during unlink. If that's sufficient than recommend this to users that
>> suffer from this problem if not consider adding another sensitive
>> policy.
>
> Interesting, I wasn't sure how the negative dentries were accumulating
> but
>
> I didn't actually look at the unlink code (I'll take a look). I
> thought the
>
> most likely cause was laziness not unlinking temporary files (the file
> names
>
> in question "looked" like temporary file names).
>
>
> When I do look at unlink I suspect I'll find the VFS is justified in
> caching
>
> these and the responsibility (or should) lies with the file system
> call back
>
> to unhash the dentry if it doesn't want this caching ... but the file
> system
>
> always doing this is not ideal either ... maybe we need a hint so that
> the
>
> relevant file system callbacks can make this decision for themselves.
But I didn't find this to be the case at all.
Assuming the customer application is behaving sensibly and calling unlink()
on the files are no longer needed (for temporary files that should be
the case)
then unhasing the dentry before final dput() will indeed result in the
dentry
being discarded.
It looks like all we need is e6957c99dca5f ("vfs: Add a sysctl for automated
deletion of dentry") and that looks like it will apply cleanly to the
kernel we
are concerned with.
It will be interesting to test this to see if the application is
actually behaving.
>
>
> Ian
>
>>
>>> Thanks,
>>> Miklos
>>>
>>>
>>> [96789.366007] watchdog: BUG: soft lockup - CPU#79 stuck for 26s!
>>> [fanotify4:52805]
>>> [96789.373396] Modules linked in: rfkill mlx5_ib ib_uverbs macsec
>>> ib_core vfat fat mlx5_core acpi_ipmi ast ipmi_ssif arm_spe_pmu igb
>>> mlxfw psample i2c_algo_bit tls pci_hyperv_intf ipmi_devintf
>>> ipmi_msghandler arm_cmn arm_dmc620_pmu arm_dsu_pmu cppc_cpufreq loop
>>> fuse nfnetlink xfs nvme crct10dif_ce ghash_ce sha2_ce sha256_arm64
>>> nvme_core sha1_ce sbsa_gwdt nvme_auth i2c_designware_platform
>>> i2c_designware_core xgene_hwmon dm_mirror dm_region_hash dm_log dm_mod
>>> [96789.413624] CPU: 79 UID: 0 PID: 52805 Comm: fanotify4 Kdump: loaded
>>> Not tainted 6.12.0-55.9.1.el10_0.aarch64 #1
>>> [96789.423698] Hardware name: GIGABYTE R272-P30-JG/MP32-AR0-JG, BIOS
>>> F31n (SCP: 2.10.20220810) 09/30/2022
>>> [96789.432990] pstate: a0400009 (NzCv daif +PAN -UAO -TCO -DIT -SSBS
>>> BTYPE=--)
>>> [96789.439939] pc : fsnotify_set_children_dentry_flags+0x80/0xf0
>>> [96789.445675] lr : fsnotify_set_children_dentry_flags+0xa4/0xf0
>>> [96789.451408] sp : ffff8000cc77b8c0
>>> [96789.454710] x29: ffff8000cc77b8c0 x28: 0000000000000001 x27:
>>> 0000000000000000
>>> [96789.461833] x26: ffff07ff8463dc50 x25: ffff080e6e44dc50 x24:
>>> 0000000000000001
>>> [96789.468956] x23: ffff07ff9d94eec0 x22: ffff07fff2cf01b8 x21:
>>> ffff07ff9d94ee40
>>> [96789.476079] x20: ffff0800eb6dff40 x19: ffff0800eb6df2c0 x18:
>>> 0000000000000014
>>> [96789.483202] x17: 00000000cec6e315 x16: 00000000ed365140 x15:
>>> 00000000ae8684a4
>>> [96789.490325] x14: 000000000d831309 x13: 00000000387d7ee0 x12:
>>> 0000000000000000
>>> [96789.497448] x11: 0000000000000001 x10: 0000000000000001 x9 :
>>> ffffc3bacc1864bc
>>> [96789.504570] x8 : 000000001007ffff x7 : ffffc3bace89a4c0 x6 :
>>> 0000000000000001
>>> [96789.511694] x5 : 0000000008000020 x4 : 0000000000000000 x3 :
>>> 0000000000000003
>>> [96789.518816] x2 : 0000000000000001 x1 : 0000000000000000 x0 :
>>> ffff0800eb6df358
>>> [96789.525939] Call trace:
>>> [96789.528373] fsnotify_set_children_dentry_flags+0x80/0xf0
>>> [96789.533759] fsnotify_recalc_mask.part.0+0x94/0xc8
>>> [96789.538538] fsnotify_recalc_mask+0x1c/0x40
>>> [96789.542709] fanotify_add_mark+0x15c/0x360
>>> [96789.546794] do_fanotify_mark+0x3c0/0x7a0
>>> [96789.550791] __arm64_sys_fanotify_mark+0x30/0x60
>>> [96789.555396] invoke_syscall.constprop.0+0x74/0xd0
>>> [96789.560090] do_el0_svc+0xb0/0xe8
>>> [96789.563393] el0_svc+0x44/0x1d0
>>> [96789.566525] el0t_64_sync_handler+0x120/0x130
>>> [96789.570870] el0t_64_sync+0x1a4/0x1a8
>>> [151513.714945] INFO: task (ostnamed):77658 blocked for more than
>>> 122 seconds.
>>> [151513.721903] Tainted: G L ------- ---
>>> 6.12.0-55.9.1.el10_0.aarch64 #1
>>> [151513.730334] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
>>> disables this message.
>>> [151513.738241] task:(ostnamed) state:D stack:0 pid:77658
>>> tgid:77658 ppid:1 flags:0x00000205
>>> [151513.747625] Call trace:
>>> [151513.750146] __switch_to+0xec/0x148
>>> [151513.753712] __schedule+0x234/0x738
>>> [151513.757278] schedule+0x3c/0xe0
>>> [151513.760493] schedule_preempt_disabled+0x2c/0x58
>>> [151513.765188] rwsem_down_write_slowpath+0x1e4/0x720
>>> [151513.770054] down_write+0xac/0xc0
>>> [151513.773444] do_lock_mount+0x3c/0x220
>>> [151513.777185] path_mount+0x378/0x810
>>> [151513.780748] __arm64_sys_mount+0x158/0x2d8
>>> [151513.784921] invoke_syscall.constprop.0+0x74/0xd0
>>> [151513.789702] do_el0_svc+0xb0/0xe8
>>> [151513.793093] el0_svc+0x44/0x1d0
>>> [151513.796312] el0t_64_sync_handler+0x120/0x130
>>> [151513.800744] el0t_64_sync+0x1a4/0x1a8
>
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: bad things when too many negative dentries in a directory
2025-04-11 15:40 ` Miklos Szeredi
2025-04-11 16:01 ` Matthew Wilcox
@ 2025-04-14 6:28 ` Ian Kent
2025-04-14 7:17 ` Miklos Szeredi
1 sibling, 1 reply; 23+ messages in thread
From: Ian Kent @ 2025-04-14 6:28 UTC (permalink / raw)
To: Miklos Szeredi, Christian Brauner
Cc: linux-fsdevel, Al Viro, Amir Goldstein, Jan Kara
On 11/4/25 23:40, Miklos Szeredi wrote:
> On Fri, 11 Apr 2025 at 16:47, Christian Brauner <brauner@kernel.org> wrote:
>
>> Note that we have a new sysctl:
>>
>> /proc/sys/fs/dentry-negative
>>
>> that can be used to control the negative dentry policy because any
>> generic change that we tried to make has always resulted in unacceptable
>> regressions for someone's workload. Currently we only allow it to be set
>> to 1 (default 0). If set to 1 it will not create negative dentries
>> during unlink. If that's sufficient than recommend this to users that
>> suffer from this problem if not consider adding another sensitive
>> policy.
> Okay, I'll forward that info.
>
> However, hundreds of millions of negative dentries can be created
> rather efficiently without unlink, though this one probably doesn't
> happen under normal circumstances. Allowing this to starve the
> scheduler for an arbitrary long time is not a good idea in any case,
> so the fsnotify problem needs some other solution, and I suspect that
> it's not to disable negative caching completely, as that would be a
> major bummer.
I know that the most recent case we have seen of this would probably
be resolved by the sysctl but this was not the first recent case we had.
Unfortunately I can't remember the details, all I remember is it was
similar but not quite the same.
In any case it's quite possible that many files can be processed, opened
and then closed, not unlinked.
So I think it's worth considering this as well.
>
> But the idea of leaving negative dentries off d_children is
> independent of caching policy. The lookup cache would work fine
> without d_sib being chained, it only needs careful thought in
>
> 1) putting the dentry on d_children when it's turned into positive
> 2) getting the dentry off d_children when it's turned into negative.
That shouldn't be too difficult to do ... sounds like a good idea to me.
Ian
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: bad things when too many negative dentries in a directory
2025-04-14 6:28 ` Ian Kent
@ 2025-04-14 7:17 ` Miklos Szeredi
0 siblings, 0 replies; 23+ messages in thread
From: Miklos Szeredi @ 2025-04-14 7:17 UTC (permalink / raw)
To: Ian Kent
Cc: Christian Brauner, linux-fsdevel, Al Viro, Amir Goldstein,
Jan Kara
On Mon, 14 Apr 2025 at 08:28, Ian Kent <raven@themaw.net> wrote:
> > 1) putting the dentry on d_children when it's turned into positive
> > 2) getting the dentry off d_children when it's turned into negative.
>
> That shouldn't be too difficult to do ... sounds like a good idea to me.
I hadn't counted with parent pointers. While not actually
dereferenced, they are compared on cache lookup. So if the parent is
removed and a directory dentry is recreated with the same pointer the
cache becomes corrupted.
Keeping the parent alive while any negative child dentries remain
doesn't sound too difficult, e.g. an need an additional refcount that
is incremented in parent on child unlink and decremented on child
reclaim. But that's more space in struct dentry and more
complexity...
Thanks,
Miklos
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: bad things when too many negative dentries in a directory
2025-04-11 16:01 ` Matthew Wilcox
@ 2025-04-14 14:07 ` James Bottomley
2025-04-14 14:30 ` Matthew Wilcox
0 siblings, 1 reply; 23+ messages in thread
From: James Bottomley @ 2025-04-14 14:07 UTC (permalink / raw)
To: Matthew Wilcox, Miklos Szeredi
Cc: Christian Brauner, linux-fsdevel, Al Viro, Amir Goldstein,
Jan Kara, Ian Kent
On Fri, 2025-04-11 at 17:01 +0100, Matthew Wilcox wrote:
> On Fri, Apr 11, 2025 at 05:40:08PM +0200, Miklos Szeredi wrote:
> > However, hundreds of millions of negative dentries can be created
> > rather efficiently without unlink, though this one probably doesn't
> > happen under normal circumstances.
>
> Depends on your userspace. Since we don't have union directories,
> consider the not uncommon case of having a search path A:B:C.
> Application looks for D in directory A, doesn't find it, creates a
> negative dentry. Application looks for D in directory B, creates a
> negative dentry. Application looks for D in directory C, doesn't find
> it, so it creates it. Now we have two negative dentries and one
> positive dentry.
If an application does an A:B:C directory search pattern it's usually
because it doesn't directly own the file location and hence suggests
that other applications would also be looking for it, which would seem
to indicate, if the search pattern gets repeated, that the two negative
dentries do serve a purpose.
> And for some applications, the name "D" is going to be unique, so the
> negative dentries have _no_ further use. The application isn't even
> going to open C/D again. If there's no memory pressure, we can build
> up billions of dentries. I believe the customer is currently echoing
> 2 to /proc/sys/vm/drop-caches every hour.
So this is an application that's the sole owner of D (i.e. sole
controller of the entire path) yet it still does a search for it, why
is that (if it's something like to update the location, it would be
better served by first looking in the default location before searching
others)? The problem is the pattern exactly matches the shared file
one above so there doesn't seem to be a heuristic way to distinguish
them.
Regards,
James
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: bad things when too many negative dentries in a directory
2025-04-14 14:07 ` James Bottomley
@ 2025-04-14 14:30 ` Matthew Wilcox
2025-04-14 15:40 ` James Bottomley
0 siblings, 1 reply; 23+ messages in thread
From: Matthew Wilcox @ 2025-04-14 14:30 UTC (permalink / raw)
To: James Bottomley
Cc: Miklos Szeredi, Christian Brauner, linux-fsdevel, Al Viro,
Amir Goldstein, Jan Kara, Ian Kent
On Mon, Apr 14, 2025 at 10:07:09AM -0400, James Bottomley wrote:
> On Fri, 2025-04-11 at 17:01 +0100, Matthew Wilcox wrote:
> > On Fri, Apr 11, 2025 at 05:40:08PM +0200, Miklos Szeredi wrote:
> > > However, hundreds of millions of negative dentries can be created
> > > rather efficiently without unlink, though this one probably doesn't
> > > happen under normal circumstances.
> >
> > Depends on your userspace. Since we don't have union directories,
> > consider the not uncommon case of having a search path A:B:C.
> > Application looks for D in directory A, doesn't find it, creates a
> > negative dentry. Application looks for D in directory B, creates a
> > negative dentry. Application looks for D in directory C, doesn't find
> > it, so it creates it. Now we have two negative dentries and one
> > positive dentry.
>
> If an application does an A:B:C directory search pattern it's usually
> because it doesn't directly own the file location and hence suggests
> that other applications would also be looking for it, which would seem
> to indicate, if the search pattern gets repeated, that the two negative
> dentries do serve a purpose.
Not in this case. It's doing something like looking in /etc/app.d
/usr/share/app/defaults/ and then /var/run/app/ . Don't quote me on the
exact paths, or suggest alternatives based on these names; it's been a
few years since I last looked. But I can assure you no other app is
looking at these dentries; they're looked up exactly once.
> > And for some applications, the name "D" is going to be unique, so the
> > negative dentries have _no_ further use. The application isn't even
> > going to open C/D again. If there's no memory pressure, we can build
> > up billions of dentries. I believe the customer is currently echoing
> > 2 to /proc/sys/vm/drop-caches every hour.
>
> So this is an application that's the sole owner of D (i.e. sole
> controller of the entire path) yet it still does a search for it, why
> is that (if it's something like to update the location, it would be
> better served by first looking in the default location before searching
> others)? The problem is the pattern exactly matches the shared file
> one above so there doesn't seem to be a heuristic way to distinguish
> them.
Everything works fine when there's memory pressure. The problem is that
negative dentry growth is only constrained by available memory; there's
no reclaim of negative dentries which haven't been looked at in seconds
or minutes.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: bad things when too many negative dentries in a directory
2025-04-14 14:30 ` Matthew Wilcox
@ 2025-04-14 15:40 ` James Bottomley
2025-04-14 16:14 ` Matthew Wilcox
0 siblings, 1 reply; 23+ messages in thread
From: James Bottomley @ 2025-04-14 15:40 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Miklos Szeredi, Christian Brauner, linux-fsdevel, Al Viro,
Amir Goldstein, Jan Kara, Ian Kent
On Mon, 2025-04-14 at 15:30 +0100, Matthew Wilcox wrote:
> On Mon, Apr 14, 2025 at 10:07:09AM -0400, James Bottomley wrote:
> > On Fri, 2025-04-11 at 17:01 +0100, Matthew Wilcox wrote:
> > > On Fri, Apr 11, 2025 at 05:40:08PM +0200, Miklos Szeredi wrote:
> > > > However, hundreds of millions of negative dentries can be
> > > > created rather efficiently without unlink, though this one
> > > > probably doesn't happen under normal circumstances.
> > >
> > > Depends on your userspace. Since we don't have union
> > > directories, consider the not uncommon case of having a search
> > > path A:B:C. Application looks for D in directory A, doesn't find
> > > it, creates a negative dentry. Application looks for D in
> > > directory B, creates a negative dentry. Application looks for D
> > > in directory C, doesn't find it, so it creates it. Now we have
> > > two negative dentries and one positive dentry.
> >
> > If an application does an A:B:C directory search pattern it's
> > usually because it doesn't directly own the file location and hence
> > suggests that other applications would also be looking for it,
> > which would seem to indicate, if the search pattern gets repeated,
> > that the two negative dentries do serve a purpose.
>
> Not in this case. It's doing something like looking in /etc/app.d
> /usr/share/app/defaults/ and then /var/run/app/ . Don't quote me on
> the exact paths, or suggest alternatives based on these names; it's
> been a few years since I last looked. But I can assure you no other
> app is looking at these dentries; they're looked up exactly once.
I got that's what it's doing, and why the negative dentries are useless
since the file name is app specific, I'm just curious why an app that
knows it's the only consumer of a file places it in the last place it
looks rather than the first ... it seems to be suboptimal and difficult
for us to detect heuristically.
Regards,
James
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: bad things when too many negative dentries in a directory
2025-04-14 15:40 ` James Bottomley
@ 2025-04-14 16:14 ` Matthew Wilcox
2025-04-14 17:58 ` James Bottomley
0 siblings, 1 reply; 23+ messages in thread
From: Matthew Wilcox @ 2025-04-14 16:14 UTC (permalink / raw)
To: James Bottomley
Cc: Miklos Szeredi, Christian Brauner, linux-fsdevel, Al Viro,
Amir Goldstein, Jan Kara, Ian Kent
On Mon, Apr 14, 2025 at 11:40:36AM -0400, James Bottomley wrote:
> On Mon, 2025-04-14 at 15:30 +0100, Matthew Wilcox wrote:
> > > If an application does an A:B:C directory search pattern it's
> > > usually because it doesn't directly own the file location and hence
> > > suggests that other applications would also be looking for it,
> > > which would seem to indicate, if the search pattern gets repeated,
> > > that the two negative dentries do serve a purpose.
> >
> > Not in this case. It's doing something like looking in /etc/app.d
> > /usr/share/app/defaults/ and then /var/run/app/ . Don't quote me on
> > the exact paths, or suggest alternatives based on these names; it's
> > been a few years since I last looked. But I can assure you no other
> > app is looking at these dentries; they're looked up exactly once.
>
> I got that's what it's doing, and why the negative dentries are useless
> since the file name is app specific, I'm just curious why an app that
> knows it's the only consumer of a file places it in the last place it
> looks rather than the first ... it seems to be suboptimal and difficult
> for us to detect heuristically.
The first two are read only. One is where the package could have an
override, the second is where the local sysadmin could have an override.
The third is writable. It's not entirely insane.
Another way to solve this would be to notice "hey, this directory only has
three entries and umpteen negative entries, let's do the thing that ramfs
does to tell the dcache that it knows about all positive entries in this
directory and delete all the negative ones". I forget what flag that is.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: bad things when too many negative dentries in a directory
2025-04-14 16:14 ` Matthew Wilcox
@ 2025-04-14 17:58 ` James Bottomley
2025-04-15 17:22 ` Andreas Dilger
0 siblings, 1 reply; 23+ messages in thread
From: James Bottomley @ 2025-04-14 17:58 UTC (permalink / raw)
To: Matthew Wilcox
Cc: Miklos Szeredi, Christian Brauner, linux-fsdevel, Al Viro,
Amir Goldstein, Jan Kara, Ian Kent
On Mon, 2025-04-14 at 17:14 +0100, Matthew Wilcox wrote:
[...]
> > I got that's what it's doing, and why the negative dentries are
> > useless since the file name is app specific, I'm just curious why
> > an app that knows it's the only consumer of a file places it in the
> > last place it looks rather than the first ... it seems to be
> > suboptimal and difficult for us to detect heuristically.
>
> The first two are read only. One is where the package could have an
> override, the second is where the local sysadmin could have an
> override. The third is writable. It's not entirely insane.
>
> Another way to solve this would be to notice "hey, this directory
> only has three entries and umpteen negative entries, let's do the
> thing that ramfs does to tell the dcache that it knows about all
> positive entries in this directory and delete all the negative
> ones". I forget what flag that is.
It's not a flag, it's the dentry operations for pseudo filesystems
(simple_lookup sets simple_dentry_operations which provides a d_delete
that always says don't retain). However, that's really because all
pseudo filesystems have a complete dentry cache (all visible files have
dentries), so there's no benefit caching negative lookups (and the
d_delete trick only affects negative dentries because positive ones
have a non zero refcount).
There is a DCACHE_DONTCACHE flag that dumps a dentry (regardless of
positive or negative) on final dput I suppose that could be set in
lookup_open() on negative under some circumstances (open flag, sysctl,
etc.).
Regards,
James
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: bad things when too many negative dentries in a directory
2025-04-14 17:58 ` James Bottomley
@ 2025-04-15 17:22 ` Andreas Dilger
2025-04-16 15:18 ` Miklos Szeredi
2025-04-16 15:26 ` James Bottomley
0 siblings, 2 replies; 23+ messages in thread
From: Andreas Dilger @ 2025-04-15 17:22 UTC (permalink / raw)
To: James Bottomley
Cc: Matthew Wilcox, Miklos Szeredi, Christian Brauner, linux-fsdevel,
Al Viro, Amir Goldstein, Jan Kara, Ian Kent
[-- Attachment #1: Type: text/plain, Size: 3348 bytes --]
On Apr 14, 2025, at 11:58 AM, James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
>
> On Mon, 2025-04-14 at 17:14 +0100, Matthew Wilcox wrote:
> [...]
>>> I got that's what it's doing, and why the negative dentries are
>>> useless since the file name is app specific, I'm just curious why
>>> an app that knows it's the only consumer of a file places it in the
>>> last place it looks rather than the first ... it seems to be
>>> suboptimal and difficult for us to detect heuristically.
>>
>> The first two are read only. One is where the package could have an
>> override, the second is where the local sysadmin could have an
>> override. The third is writable. It's not entirely insane.
>>
>> Another way to solve this would be to notice "hey, this directory
>> only has three entries and umpteen negative entries, let's do the
>> thing that ramfs does to tell the dcache that it knows about all
>> positive entries in this directory and delete all the negative
>> ones". I forget what flag that is.
>
> It's not a flag, it's the dentry operations for pseudo filesystems
> (simple_lookup sets simple_dentry_operations which provides a d_delete
> that always says don't retain). However, that's really because all
> pseudo filesystems have a complete dentry cache (all visible files have
> dentries), so there's no benefit caching negative lookups (and the
> d_delete trick only affects negative dentries because positive ones
> have a non zero refcount).
>
> There is a DCACHE_DONTCACHE flag that dumps a dentry (regardless of
> positive or negative) on final dput I suppose that could be set in
> lookup_open() on negative under some circumstances (open flag, sysctl,
> etc.).
Negative dentries are only useful if there are fewer than the number
of entries in that directory.
If the negative dentry count exceeds the actual entry count, it would
be more efficient to just cache all of the positive dentries and mark
the directory with a "full dentry list" flag that indicates all of the
names are already present in dcache and any miss is authoritative.
In essence that gives an "infinite" negative lookup cache instead of
explicitly storing all of the possible negative entries.
For directories like ~/bin, /usr/bin, /usr/lib64, etc. (or any directory)
where negative lookups are frequent, it should be possible to determine
this threshold automatically. Once the negative dentry count exceeds
the size of the directory by some factor (e.g. directory size / 16,
or the actual entry count if the filesystem knows this, it doesn't have
to be exactly correct) then a readdir could load all of the names to
fully populate the dcache and set the "full dentry list" flag on the
directory would allow dropping all negative dentries in that directory.
The VFS/VM should avoid dropping directories/dentries from cache in this
case, since it is saving more memory (and avoiding filesystem IO) to keep
them pinned rather than dropping them from cache. There might need to be
a matching "part of full dentry list" flag on the positive dentries to
avoid dcache shrinking of those entries (which would invalidate the premise
that the parent holds all of the possible entries in that directory), if
checking the parent's flag is too expensive.
Cheers, Andreas
[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: bad things when too many negative dentries in a directory
2025-04-15 17:22 ` Andreas Dilger
@ 2025-04-16 15:18 ` Miklos Szeredi
2025-04-16 15:37 ` Matthew Wilcox
2025-04-16 21:41 ` Dave Chinner
2025-04-16 15:26 ` James Bottomley
1 sibling, 2 replies; 23+ messages in thread
From: Miklos Szeredi @ 2025-04-16 15:18 UTC (permalink / raw)
To: Andreas Dilger
Cc: James Bottomley, Matthew Wilcox, Christian Brauner, linux-fsdevel,
Al Viro, Amir Goldstein, Jan Kara, Ian Kent
On Tue, 15 Apr 2025 at 19:22, Andreas Dilger <adilger@dilger.ca> wrote:
> If the negative dentry count exceeds the actual entry count, it would
> be more efficient to just cache all of the positive dentries and mark
> the directory with a "full dentry list" flag that indicates all of the
> names are already present in dcache and any miss is authoritative.
> In essence that gives an "infinite" negative lookup cache instead of
> explicitly storing all of the possible negative entries.
This sounds nice in theory, but there are quite a number of things to sort out:
- The "full dir read" needs to be done in the background to avoid
large latencies, right?
- Instantiate inodes during this, or have some dentry flag indicating
that it's to be done later?
- When does the whole directory get reclaimed?
- What about revalidation in netfs? How often should a "full dir
read" get triggered?
I feel that it's just too complex.
What's wrong with just trying to get rid of the bad effects of
negative dentries, instead of getting rid of the dentries themselves ?
Lack of memory pressure should mean that nobody else needs that
memory, so it should make no difference if it's used up in negative
dentries instead of being free memory. Maybe I'm missing something
fundamental?
Thanks,
Miklos
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: bad things when too many negative dentries in a directory
2025-04-15 17:22 ` Andreas Dilger
2025-04-16 15:18 ` Miklos Szeredi
@ 2025-04-16 15:26 ` James Bottomley
2025-04-22 6:57 ` Andreas Dilger
1 sibling, 1 reply; 23+ messages in thread
From: James Bottomley @ 2025-04-16 15:26 UTC (permalink / raw)
To: Andreas Dilger
Cc: Matthew Wilcox, Miklos Szeredi, Christian Brauner, linux-fsdevel,
Al Viro, Amir Goldstein, Jan Kara, Ian Kent
[-- Attachment #1: Type: text/plain, Size: 3418 bytes --]
On Tue, 2025-04-15 at 11:22 -0600, Andreas Dilger wrote:
[...]
> Negative dentries are only useful if there are fewer than the number
> of entries in that directory.
I agree with this, yes.
> If the negative dentry count exceeds the actual entry count,
Yes, but finding this number is going to be hard. We can't iterate a
directory to count them in the fast path and a directory i_size is
extremely filesystem and format dependent. However, since we only need
a rough count, perhaps having the filesystem export its average
directory entry size and simply dividing by that would give a good
enough approximation to the number?
> it would be more efficient to just cache all of the positive dentries
> and mark the directory with a "full dentry list" flag that indicates
> all of the names are already present in dcache and any miss is
> authoritative. In essence that gives an "infinite" negative lookup
> cache instead of explicitly storing all of the possible negative
> entries.
Practically, I think directories with that flag would probably
automatically retain positive child dentries as an addition to our
retain_dentry() logic and automatically kill negative ones.
This behaviour, though, would remove them from the shrinkers, so
probably there would have to be a global count of the number of
unshrinkable children this gives us and have that factor into the
superblock shrinkers somehow. Probably add the parent to the lru list
but make dentry_lru_isolate() always skip until the tipping point for
shrinking filled directories is reached?
> For directories like ~/bin, /usr/bin, /usr/lib64, etc. (or any
> directory) where negative lookups are frequent, it should be possible
> to determine this threshold automatically. Once the negative dentry
> count exceeds the size of the directory by some factor (e.g.
> directory size / 16, or the actual entry count if the filesystem
> knows this, it doesn't have to be exactly correct) then a readdir
> could load all of the names to fully populate the dcache and set the
> "full dentry list" flag on the directory would allow dropping all
> negative dentries in that directory.
All this supposes we have some per directory count of the negative
dentries. I think there'd be push back on adding this to struct dentry
and making it an exact count in the fast path. The next logical place
to evaluate it would be the shrinkers but then that wouldn't solve
Matthew's use case where the shrinkers don't get activated. I suppose
some flag that userspace could add to directories it identifies as hot
might be the next best thing?
> The VFS/VM should avoid dropping directories/dentries from cache in
> this case, since it is saving more memory (and avoiding filesystem
> IO) to keep them pinned rather than dropping them from cache. There
> might need to be a matching "part of full dentry list" flag on the
> positive dentries to avoid dcache shrinking of those entries (which
> would invalidate the premise that the parent holds all of the
> possible entries in that directory), if checking the parent's flag is
> too expensive.
As I said above, I think simply checking the parent flags in
retain_dentry should do. Since you don't need it to be exact and the
parent should have a positive refcount, it should be possible to do a
READ_ONCE rather than locking.
Regards,
James
[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 228 bytes --]
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: bad things when too many negative dentries in a directory
2025-04-16 15:18 ` Miklos Szeredi
@ 2025-04-16 15:37 ` Matthew Wilcox
2025-04-16 21:41 ` Dave Chinner
1 sibling, 0 replies; 23+ messages in thread
From: Matthew Wilcox @ 2025-04-16 15:37 UTC (permalink / raw)
To: Miklos Szeredi
Cc: Andreas Dilger, James Bottomley, Christian Brauner, linux-fsdevel,
Al Viro, Amir Goldstein, Jan Kara, Ian Kent
On Wed, Apr 16, 2025 at 05:18:17PM +0200, Miklos Szeredi wrote:
> Lack of memory pressure should mean that nobody else needs that
> memory, so it should make no difference if it's used up in negative
> dentries instead of being free memory. Maybe I'm missing something
> fundamental?
You're missing two things:
- The dentry hash table is a fixed size. Long chains give poor
performance, so polluting the hash table with unused entries
has a cost.
- Eventually, we do trigger reclaim. And then we wait for hours while
the reclaiming process tries to shrink billions of entries. I think
we had a report on one machine of it taking more than 24 hours ("more
than" because the customer decided enough was enough and rebooted the
machine).
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: bad things when too many negative dentries in a directory
2025-04-16 15:18 ` Miklos Szeredi
2025-04-16 15:37 ` Matthew Wilcox
@ 2025-04-16 21:41 ` Dave Chinner
1 sibling, 0 replies; 23+ messages in thread
From: Dave Chinner @ 2025-04-16 21:41 UTC (permalink / raw)
To: Miklos Szeredi
Cc: Andreas Dilger, James Bottomley, Matthew Wilcox,
Christian Brauner, linux-fsdevel, Al Viro, Amir Goldstein,
Jan Kara, Ian Kent
On Wed, Apr 16, 2025 at 05:18:17PM +0200, Miklos Szeredi wrote:
> On Tue, 15 Apr 2025 at 19:22, Andreas Dilger <adilger@dilger.ca> wrote:
>
> > If the negative dentry count exceeds the actual entry count, it would
> > be more efficient to just cache all of the positive dentries and mark
> > the directory with a "full dentry list" flag that indicates all of the
> > names are already present in dcache and any miss is authoritative.
> > In essence that gives an "infinite" negative lookup cache instead of
> > explicitly storing all of the possible negative entries.
>
> This sounds nice in theory, but there are quite a number of things to sort out:
>
> - The "full dir read" needs to be done in the background to avoid
> large latencies, right?
>
> - Instantiate inodes during this, or have some dentry flag indicating
> that it's to be done later?
>
> - When does the whole directory get reclaimed?
>
> - What about revalidation in netfs? How often should a "full dir
> read" get triggered?
>
> I feel that it's just too complex.
>
> What's wrong with just trying to get rid of the bad effects of
> negative dentries, instead of getting rid of the dentries themselves ?
>
> Lack of memory pressure should mean that nobody else needs that
> memory, so it should make no difference if it's used up in negative
> dentries instead of being free memory. Maybe I'm missing something
> fundamental?
There is no issue with the existence of huge numbers of negative
dentries. The issue is the overhead and latency of reclaiming
hundreds of millions of tiny objects to release the memory is
prohibitive. Dentry reclaim is generally pretty slow, especially if
it is being done by a single background thread like kswapd.
FWIW, I think there is a simpler version of this "per-directory
dentry count" heuristic that might work well enough to bound the
upper maximum: apply the same hueristic to the entire dentry cache.
I'm pretty sure this has been proposed in the past, but we should
probably revisit it anyway because this problem hasn't gone away.
i.e. if the number of negative dentries exceeds the number of
positive dentries and the total number of dentries exceeds a certain
amount of memory, kick a background thread to reap some negative
dentries from the LRU. e.g. every 30s check if dentries exceed 10%
of memory and negative dentries exceed positive. If so, reap the
oldest 10% of negative dentries.
That will still allow a system with free memory to build up a -lot-
of negative dentries, but also largely bound the amount of free
memory that can be consumed by negative dentries to around 5% of
total memory.
-Dave.
--
Dave Chinner
david@fromorbit.com
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: bad things when too many negative dentries in a directory
2025-04-11 9:40 bad things when too many negative dentries in a directory Miklos Szeredi
2025-04-11 14:47 ` Christian Brauner
2025-04-11 21:02 ` Mateusz Guzik
@ 2025-04-20 4:49 ` Al Viro
2025-05-08 15:45 ` Miklos Szeredi
2 siblings, 1 reply; 23+ messages in thread
From: Al Viro @ 2025-04-20 4:49 UTC (permalink / raw)
To: Miklos Szeredi
Cc: linux-fsdevel, Christian Brauner, Amir Goldstein, Jan Kara,
Ian Kent
On Fri, Apr 11, 2025 at 11:40:28AM +0200, Miklos Szeredi wrote:
> Except for shrink_dcache_parent() I don't see any uses. And it's also
> a question whether shrinking negative dentries is useful or not.
One-word answer: umount.
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: bad things when too many negative dentries in a directory
2025-04-16 15:26 ` James Bottomley
@ 2025-04-22 6:57 ` Andreas Dilger
0 siblings, 0 replies; 23+ messages in thread
From: Andreas Dilger @ 2025-04-22 6:57 UTC (permalink / raw)
To: James Bottomley
Cc: Matthew Wilcox, Miklos Szeredi, Christian Brauner, linux-fsdevel,
Al Viro, Amir Goldstein, Jan Kara, Ian Kent
[-- Attachment #1: Type: text/plain, Size: 5705 bytes --]
On Apr 16, 2025, at 9:26 AM, James Bottomley <James.Bottomley@HansenPartnership.com> wrote:
>
> On Tue, 2025-04-15 at 11:22 -0600, Andreas Dilger wrote:
> [...]
>> Negative dentries are only useful if there are fewer than the number
>> of entries in that directory.
>
> I agree with this, yes.
>
>> If the negative dentry count exceeds the actual entry count,
>
> Yes, but finding this number is going to be hard. We can't iterate a
> directory to count them in the fast path and a directory i_size is
> extremely filesystem and format dependent.
This depends. Some filesystems will store the actual number of entries
in the directory, or it can be estimated based on the number of blocks
in the directory.
> However, since we only need a rough count, perhaps having the filesystem
> export its average directory entry size and simply dividing by that
> would give a good enough approximation to the number?
I would suggest to add an inode method that can be called on the directory
to request the (estimated) number of entries in a directory. If the fs
has a good idea of this it can return that number, or it can estimate
based on allocated blocks. It does not need to be exact, but provides
an upper bound on the useful number of negative dcache entries to keep.
>> it would be more efficient to just cache all of the positive dentries
>> and mark the directory with a "full dentry list" flag that indicates
>> all of the names are already present in dcache and any miss is
>> authoritative. In essence that gives an "infinite" negative lookup
>> cache instead of explicitly storing all of the possible negative
>> entries.
>
> Practically, I think directories with that flag would probably
> automatically retain positive child dentries as an addition to our
> retain_dentry() logic and automatically kill negative ones.
>
> This behaviour, though, would remove them from the shrinkers, so
> probably there would have to be a global count of the number of
> unshrinkable children this gives us and have that factor into the
> superblock shrinkers somehow. Probably add the parent to the lru list
> but make dentry_lru_isolate() always skip until the tipping point for
> shrinking filled directories is reached?
It's true that this flag would (generally) remove the directory and its
immediate children from the dcache shrinkers. However, the point of a
shrinker is to reduce memory usage, and if the directory can no longer
guarantee that all positive dentries are cached (so no negative dentries
are needed) would generally *increase* memory usage in the end.
I could imagine that such directories would eventually be reaped, but
it should be much harder to do so. For example, every negative lookup
in such a directory should refresh it in the LRU since the parent dentry
avoided a negative entry from being added to the dcache.
>> For directories like ~/bin, /usr/bin, /usr/lib64, etc. (or any
>> directory) where negative lookups are frequent, it should be possible
>> to determine this threshold automatically. Once the negative dentry
>> count exceeds the size of the directory by some factor (e.g.
>> directory size / 16, or the actual entry count if the filesystem
>> knows this, it doesn't have to be exactly correct) then a readdir
>> could load all of the names to fully populate the dcache and set the
>> "full dentry list" flag on the directory would allow dropping all
>> negative dentries in that directory.
>
> All this supposes we have some per directory count of the negative
> dentries. I think there'd be push back on adding this to struct dentry
> and making it an exact count in the fast path. The next logical place
> to evaluate it would be the shrinkers but then that wouldn't solve
> Matthew's use case where the shrinkers don't get activated. I suppose
> some flag that userspace could add to directories it identifies as hot
> might be the next best thing?
No. Kernel memory management shouldn't be dependent on userspace doing
the right thing, and no userspace would ever be taught to consistently
set such a flag.
Again, the numbers don't have to be exact, but if negative dcache is
2x the number of dir entries (or e.g. 1000 more as a directory gets
larger) then it is time to change to caching only positive entries.
Having the negative dcache be directly linked to the parent would be
fine too. It doesn't make sense to cache negative dentries longer than
the parent, and if there is an upper bound on how many negative entries
can exist on a directory avoids the need to shrink them independently.
If there is lots of memory pressure on the dcache then directories with
inactive negative dentries would eventually be reaped, and even "full
dentry list" directories would eventually come around for shrinking if
they were inactive for a long time.
>> The VFS/VM should avoid dropping directories/dentries from cache in
>> this case, since it is saving more memory (and avoiding filesystem
>> IO) to keep them pinned rather than dropping them from cache. There
>> might need to be a matching "part of full dentry list" flag on the
>> positive dentries to avoid dcache shrinking of those entries (which
>> would invalidate the premise that the parent holds all of the
>> possible entries in that directory), if checking the parent's flag is
>> too expensive.
>
> As I said above, I think simply checking the parent flags in
> retain_dentry should do. Since you don't need it to be exact and the
> parent should have a positive refcount, it should be possible to do a
> READ_ONCE rather than locking.
Cheers, Andreas
[-- Attachment #2: Message signed with OpenPGP --]
[-- Type: application/pgp-signature, Size: 873 bytes --]
^ permalink raw reply [flat|nested] 23+ messages in thread
* Re: bad things when too many negative dentries in a directory
2025-04-20 4:49 ` Al Viro
@ 2025-05-08 15:45 ` Miklos Szeredi
0 siblings, 0 replies; 23+ messages in thread
From: Miklos Szeredi @ 2025-05-08 15:45 UTC (permalink / raw)
To: Al Viro
Cc: linux-fsdevel, Christian Brauner, Amir Goldstein, Jan Kara,
Ian Kent
On Sun, 20 Apr 2025 at 06:49, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> On Fri, Apr 11, 2025 at 11:40:28AM +0200, Miklos Szeredi wrote:
>
> > Except for shrink_dcache_parent() I don't see any uses. And it's also
> > a question whether shrinking negative dentries is useful or not.
>
> One-word answer: umount.
shink_dcache_sb() should work fine in that situation.
The only thing it can't do is hunt down spurious references to
dentries, but that's a debug thing and not something that is needed in
production.
Am I missing something?
Thanks,
Miklos
^ permalink raw reply [flat|nested] 23+ messages in thread
end of thread, other threads:[~2025-05-08 15:46 UTC | newest]
Thread overview: 23+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2025-04-11 9:40 bad things when too many negative dentries in a directory Miklos Szeredi
2025-04-11 14:47 ` Christian Brauner
2025-04-11 15:40 ` Miklos Szeredi
2025-04-11 16:01 ` Matthew Wilcox
2025-04-14 14:07 ` James Bottomley
2025-04-14 14:30 ` Matthew Wilcox
2025-04-14 15:40 ` James Bottomley
2025-04-14 16:14 ` Matthew Wilcox
2025-04-14 17:58 ` James Bottomley
2025-04-15 17:22 ` Andreas Dilger
2025-04-16 15:18 ` Miklos Szeredi
2025-04-16 15:37 ` Matthew Wilcox
2025-04-16 21:41 ` Dave Chinner
2025-04-16 15:26 ` James Bottomley
2025-04-22 6:57 ` Andreas Dilger
2025-04-14 6:28 ` Ian Kent
2025-04-14 7:17 ` Miklos Szeredi
2025-04-12 1:48 ` Ian Kent
2025-04-12 1:56 ` Ian Kent
2025-04-12 6:31 ` Ian Kent
2025-04-11 21:02 ` Mateusz Guzik
2025-04-20 4:49 ` Al Viro
2025-05-08 15:45 ` Miklos Szeredi
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.