From: Ian Kent <raven@themaw.net>
To: Amir Goldstein <amir73il@gmail.com>
Cc: Jan Kara <jack@suse.cz>, NeilBrown <neil@brown.name>,
Horst Birthelmer <horst@birthelmer.com>,
Miklos Szeredi <miklos@szeredi.hu>,
Jonathan Corbet <corbet@lwn.net>,
Shuah Khan <skhan@linuxfoundation.org>,
Alexander Viro <viro@zeniv.linux.org.uk>,
Christian Brauner <brauner@kernel.org>,
linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org,
linux-fsdevel@vger.kernel.org,
Horst Birthelmer <hbirthelmer@ddn.com>
Subject: Re: [PATCH] dcache: add fs.dentry-limit sysctl with negative-first reaper
Date: Thu, 21 May 2026 08:55:06 +0800 [thread overview]
Message-ID: <0c8fc005-e485-494a-8598-07b617c87b62@themaw.net> (raw)
In-Reply-To: <CAOQ4uxg-7Tsb0GWF4LN3iFBaY7uGxR5_7PwBF+GfMWtCdfi4xw@mail.gmail.com>
On 20/5/26 17:43, Amir Goldstein wrote:
> On Wed, May 20, 2026 at 9:16 AM Ian Kent <raven@themaw.net> wrote:
>> On 19/5/26 17:12, Jan Kara wrote:
>>> On Mon 18-05-26 21:39:13, Ian Kent wrote:
>>>> On 18/5/26 16:19, Jan Kara wrote:
>>>>> Hi Ian,
>>>>>
>>>>> On Mon 18-05-26 10:55:43, Ian Kent wrote:
>>>>>> On 18/5/26 07:55, NeilBrown wrote:
>>>>>>> On Fri, 15 May 2026, Horst Birthelmer wrote:
>>>>>>> According to the email you linked, a problem arises when a directory has
>>>>>>> a great many negative children. Code which walks the list of children
>>>>>>> (such as fsnotify) while holding a lock can suffer unpredictable delays
>>>>>>> and result in long lock-hold times. So maybe a limit on negative
>>>>>>> dentries for any parent is what we really want. That would be clumsy to
>>>>>>> implement I imagine.
>>>>>> But the notion of dropping the dentry in ->d_delete() on last dput() is
>>>>>> simple enough but did see regressions (the only other place in the VFS
>>>>>> besides dentry_kill() that the inode is unlinked from the dentry on
>>>>>> dput()). I wonder if the regression was related to the test itself
>>>>>> deliberately recreating deleted files and if that really is normal
>>>>>> behaviour. By itself that should prevent almost all negative dentries
>>>>>> being retained. Although file systems could do this as well (think XFS
>>>>>> inode recycling) it should be reasonable to require it be left to the
>>>>>> VFS.
>>>>>>
>>>>>> But even that's not enough given that, in my case, there would still be
>>>>>> around 4 million dentries in the LRU cache and in fsnotify there are
>>>>>> directory child traversals holding the parent i_lock "spinlock" that are
>>>>>> going to cause problems.
>>>>> Do you mean there are very many positive children of a directory?
>>>> Didn't quantify that.
>>>>
>>>> The symptom is the "Spinlock held for more than ... seconds" occurring in
>>>> the log. So there are certainly a lot of children in the list, but it's
>>>> an assumption the ratio of positive to negative entries is roughly the
>>>> same as the overall ratio in the dcache.
>>> OK, but that's not necessarily true. I have seen these complaints from the
>>> kernel but in all the cases I remember it was due to negative dentries
>>> accumultating in a particular directory. There are certain apps such as
>>> ElasticSearch which really do like creating huge amounts of negative
>>> dentries in one directory - they use hashes as filenames and use directory
>>> lookup instead of a DB table lookup and lookup lots of non-existent keys...
>> Umm ... that's a good point, I hadn't paid much attention to ENOENT result
>>
>> lookups, I'll need to check on the like cycle of those, I think they do get
>>
>> hashed. That has to be the other source of negative dentries that I've
>>
>> neglected ...
>>
> Yes, it has been claimed that some real life workloads create a lot of those.
>
> If we can keep those at the tail of the children list, it will be best
> for the fsnotify
> iteration, which only cares about positive dentries.
>
>>>>>> so why is this traversal even retained in fsnotify?
>>>>> Not sure which traversal you mean but if you set watch on a parent, you
>>>>> have to walk all children to set PARENT_WATCHED flag so that you don't miss
>>>>> events on children...
>>>> Yes, that traversal is what I'm questioning ... again thanks.
>>>>
>>>> I think the function name is still fsnotify_set_children_dentry_flags()
>>>> in recent kernels, the subject of commit 172e422ffea2 I mentioned above.
>>> OK, thanks.
>>>
>>>> When you say miss events are you saying that accessing the parent dentry to
>>>> work out if the child needs to respond to an event is quite expensive in the
>>>> overall event processing context, that might make more sense to me ... or do
>>>> I completely not yet understand the reasoning behind the need for the flag?
>>> Close but not quite. The cost is the overhead of dget_parent() in
>>> fsnotify_parent() which is often a couple of cache cold loads and atomic
>>> instructions to find out we don't need to send any event for the current
>>> write(2) or read(2) call. It gets worse if there are many IOs happening to
>>> dentries in the same directory from multiple CPUs because instead of
>>> cache-cold loads you get a cacheline contention on the parent.
>>>
>>>>>>> But what if we move dentries to the end of the list when they become
>>>>>>> negative, and to the start of the list when they become positive? Then
>>>>>>> code which walks the child list could simply abort on the first
>>>>>>> negative.
>>>>>>>
>>>>>>> I doubt that would be quite as easy as it sounds, but it would at least
>>>>>>> be more focused on the observed symptom rather than some whole-system
>>>>>>> number which only vaguely correlates with the observed symptom.
>>>>>>>
>>>>>>> Maybe a completely different approach: change children-walking code to
>>>>>>> drop and retake the lock (with appropriate validation) periodically.
>>>>>>> What too would address the specific symptom.
>>>>>> Another good question.
>>>>>>
>>>>>> I have assumed that dropping and re-taking the lock cannot be done but
>>>>>> this is a question I would like answered as well. Dropping and re-taking
>>>>>> lock would require, as Miklos pointed out to me off-list, recording the
>>>>>> list position with say a cursor, introducing unwanted complexity when it
>>>>>> would be better to accept the cost of a single extra access to the parent
>>>>>> flags (which I assume is one reason to set the flag in the child).
>>>>> The parent access is actually more expensive than you might think. Based on
>>>>> experience with past fsnotify related performance regression I expect some
>>>>> 20% performance hit for small tmpfs writes if you add unconditional parent
>>>>> access to the write path.
>>>> That sounds like a lot for what should be a memory access of an already in
>>>> memory structure since the parent must be accessed to traverse the list of
>>>> child entries. I clearly don't fully understand the implications of what
>>>> I'm saying but there has been mention of another context ...
>>> Parent dentry is of course in memory but often cache cold - you don't need
>>> the parent to do e.g. write(2) to an already open file. You seem to be
>>> somewhat confused about the child dentry list traversal (or maybe I'm
>>> misunderstanding) - that happens only when placing the notification mark
>>> but definitely not for each IO operation.
>> LOL, confusion is a pretty common state of mind for me!
>>
>>
>> I do get your point though and I am confusing the traversal with other
>>
>> operations. I think this answers the question I've been asking (maybe
>>
>> that wasn't obvious) about the reason for the traversal (ie. the reason
>>
>> to maintain a flag in the child).
>>
>>
>> While I have looked at the code here I haven't absorbed it and I
>>
>> definitely don't understand it, your continued patience is appreciated
>>
>> and will be beneficial when I get time to look at it a bit closer. I
>>
>> do still need to use a notifications mechanism to match up with Miklos's
>>
>> statmount implementation to get the full benefit of that in user space,
>>
>> if I ever get a chance to work on that again.
>>
>>
>> So it sounds like it would be worth while considering a traversal that's
>>
>> based on taking a reference on each dentry rather than a spinlock for
>>
>> the duration. It would be tricky though, for obvious reasons, like
>>
>> children added during the traversal, added overhead of getting the next
>>
>> entry reference, etc.
> Didn't look closely, but it feels like RCU traversal should be
> possible if entries are added to the tail, or to the END_OF_POSITIVE
> location.
>
> When we discussed the "negavites at tail" at LSFMM
> it was said that managing the transitions positive<->negative
> would be challenging, but I don't know that anyone tried to look closer at this.
I guess that should be straight forward as long as it's done at the point of
transition except if it's done by a filesystem instead of the VFS (maybe
require
a helper be used ...). Might be a bit harder for dentries that don't
transition
(ie. ENOENT lookups that start out and stay negative) might escape the
needed
handling.
>
> At least for fsnotify, positive->negative transition is not a problem
> w.r.t skipping entry and observing entry twice during positive iteration.
>
> If negative->positive transitions inserts at END_OF_POSITIVE
> location, then should be fine as well?
>
> Iterators that need to iterate all children can do this under lock.
Only catch there is the number of positive children might be large as well.
>
> Does that make sense?
Yep, the notion of a cursor is a good idea.
Nevertheless the challenge is to identify dentries that should be discarded
rather than kept at final dput() in addition to what we already have in
dout()
but there doesn't seem to be anything sensible to add to those checks.
On a different note another possibility to identify candidates to discard on
traversals might be a ttl, basically an extension of the referenced
flag. The
dentry d_time field could be used for that but only for negative
dentries since,
IIRC, nfs uses that dentry field.
But now I'm not sure I'm making sense as a couple of your comments sound
like
they refer to a discussion I'm not aware of, ;)
Ian
next prev parent reply other threads:[~2026-05-21 0:55 UTC|newest]
Thread overview: 17+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-05-14 15:13 [PATCH] dcache: add fs.dentry-limit sysctl with negative-first reaper Horst Birthelmer
2026-05-15 15:09 ` kernel test robot
2026-05-16 6:55 ` Horst Birthelmer
2026-05-16 10:33 ` Stafford Horne
2026-05-16 14:15 ` Horst Birthelmer
2026-05-15 15:09 ` kernel test robot
2026-05-17 23:55 ` NeilBrown
2026-05-18 2:55 ` Ian Kent
2026-05-18 8:19 ` Jan Kara
2026-05-18 13:39 ` Ian Kent
2026-05-19 9:12 ` Jan Kara
2026-05-20 7:16 ` Ian Kent
2026-05-20 9:43 ` Amir Goldstein
2026-05-21 0:55 ` Ian Kent [this message]
2026-05-22 4:16 ` NeilBrown
2026-05-22 8:27 ` Amir Goldstein
2026-05-18 7:01 ` Horst Birthelmer
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=0c8fc005-e485-494a-8598-07b617c87b62@themaw.net \
--to=raven@themaw.net \
--cc=amir73il@gmail.com \
--cc=brauner@kernel.org \
--cc=corbet@lwn.net \
--cc=hbirthelmer@ddn.com \
--cc=horst@birthelmer.com \
--cc=jack@suse.cz \
--cc=linux-doc@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=miklos@szeredi.hu \
--cc=neil@brown.name \
--cc=skhan@linuxfoundation.org \
--cc=viro@zeniv.linux.org.uk \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox