From: Steven Rostedt <rostedt@goodmis.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>,
Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
LKML <linux-kernel@vger.kernel.org>,
Linux Trace Devel <linux-trace-devel@vger.kernel.org>,
Christian Brauner <brauner@kernel.org>,
Ajay Kaher <ajay.kaher@broadcom.com>,
Geert Uytterhoeven <geert@linux-m68k.org>,
linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: [PATCH] eventfs: Have inodes have unique inode numbers
Date: Sun, 28 Jan 2024 16:11:07 -0500 [thread overview]
Message-ID: <20240128161107.01af6dd2@rorschach.local.home> (raw)
In-Reply-To: <CAHk-=whKJ6dzQJX27gvL4Xug5bFRKW7_Cx4XpngMKmWxOtb+Qg@mail.gmail.com>
On Sun, 28 Jan 2024 12:53:31 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:
> On Sun, 28 Jan 2024 at 12:15, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > I have to understand how the dentry lookup works. Basically, when the
> > ei gets deleted, it can't be freed until all dentries it references
> > (including its children) are no longer being accessed. Does that lookup
> > get called only when a dentry with the name doesn't already exist?
>
> Dentry lookup gets called with the parent inode locked for reading, so
> a lookup can happen in parallel with readdir and other dentry lookup.
>
> BUT.
>
> Each dentry is also "self-serializing", so you will never see a lookup
> on the same name (in the same directory) concurrently.
The above is what I wanted to know.
>
> The implementation is very much internal to the VFS layer, and it's
> all kinds of nasty, with a new potential lookup waiting for the old
> one, verifying that the old one is still usable, and maybe repeating
> it all until we find a successful previous lookup or we're the only
> dentry remaining.
>
> It's nasty code that is very much in the "Al Viro" camp, but the point
> is that any normal filesystem should treat lookups as being concurrent
> with non-creation events, but not concurrently multiples.
>
> There *is* some common code with "atomic_open()", where filesystems
> that implement that then want to know if it's the *first* lookup, or a
> use of a previously looked up dentry, and they'll use the
> "d_in_lookup()" thing to determine that. So this whole "keep track of
> which dentries are *currently* being looked up is actually exposed,
> but any normal filesystem should never care.
>
> But if you think you have that issue (tracefs does not), you really
> want to talk to Al extensively.
>
> > That is, can I assume that eventfs_root_lookup() is only called when
> > the VFS file system could not find an existing dentry and it has to
> > create one?
>
> Correct. For any _particular_ name, you should think of lookup as serialized.
>
> > If that's the case, then I can remove the ei->dentry and just add a ref
> > counter that it was accessed. Then the final dput() should call
> > eventfs_set_ei_status_free() (I hate that name and need to change it),
> > and if the ei->is_freed is set, it can free the ei.
>
> Note that the final 'dput()' will happen *after* the dentry has been
> removed, so what can happen is
>
> lookup("name", d1);
> ... lookup successful, dentry is used ..
> ... dentry at some point has no more users ..
> ... memory pressure prunes unused dentries ..
> ... dentry gets unhashed and is no longer visible ..
> lookup("name", d2);
> ... new dentry is created ..
> final dput(d1);
> .. old dentry - that wasn't accessible any more is freed ..
Actually I was mistaken. I'm looking at the final iput() not dput().
>
> and this is actually one of the *reasons* that virtual filesystems
> must not try to cache dentry pointers in their internal data
> structures. Because note how the fuilesystem saw the new lookup(d2) of
> the same name *before* it saw the >d_release(d1) of the old dentry.
>
> And the above is fundamental: we obviously cannot call '->d_release()'
> until the old dentry is all dead and buried (lockref_mark_dead() etc),
> so pretty much by definition you'll have that ordering being possible.
>
> It's extremely unlikely, of course. I'll be you'll never hit it in testing.
>
> So if if you associate some internal data structure with a dentry,
> just *what* happens when you haven't been told abotu the old dentry
> being dead when the new one happens?
>
> See why I say that it's fundamentally wrong for a filesystem to try to
> track dentries? All the operations that can use a dentry will get one
> passed down to them by the VFS layer. The filesystem has no business
> trying to remember some dentry from a previous operation, and the
> filesystem *will* get it wrong.
>
> But also note how refcounting works fine. In fact, refcounting is
> pretty much the *only* thing that works fine. So what you *should* do
> is
>
> - at lookup(), when you save your filesystem data in "->d_fsdata",
> you increment a refcount
>
> - at ->d_release(), you decrement a refcount
>
> and now you're fine. Yes, when the above (very very unlikely)
> situation happens, you'll temporarily have a refcount incremented
> twice, but that's kind of the *point* of refcounts.
>
> Side note: this is pretty much true of any kernel data structure. If
> you have a kernel data structure that isn't just used within one
> thread, it must be refcounted. But it'as *doubly* true when you save
> references to something that the VFS maintains, because you DO NOT
> CONTROL the lifetime of that entity.
>
> > The eventfs_remove_dir() can free the ei (after SRCU) if it has no
> > references, otherwise it needs to wait for the final dput() to do the
> > free.
>
> Honestly, you should just *always* do refcounting. No "free after RCU
> delay" as an alternative. Just refcount it.
>
> Now, the RCU delay may be needed if the lookup of said structure
> happens under RCU, but no, saying "I use SRCU to make sure the
> lifetime is at least X" is just broken.
>
> The refcount is what gives the lifetime. Any form of RCU-delaying
> should then be purely about non-refcounting RCU lookups that may
> happen as the thing is dying (and said lookup should *look* at the
> refcount and say "oh, this is dead, I'm not returning this".
>
> > I think the ei->dentry was required for the dir wrapper logic that we
> > removed.
>
> I think all of this was due to the bogus readdir that created dentries
> willy-nilly and without the required serialization.
>
> And that was all horribly broken. It wasn't even the above kind of
> "really subtle race that you'll never hit in practice" broken. It was
> just absolutely broken with readdir and lookup racing on the same name
> and creating an unholy dentry mess.
Hmm, if I understand the above, I could get rid of keeping around
dentry and even remove the eventfs_set_ei_status_free().
I can try something to see if it works.
-- Steve
next prev parent reply other threads:[~2024-01-28 21:11 UTC|newest]
Thread overview: 52+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-01-26 20:02 [PATCH] eventfs: Have inodes have unique inode numbers Steven Rostedt
2024-01-26 20:25 ` Linus Torvalds
2024-01-26 21:26 ` Steven Rostedt
2024-01-26 21:31 ` Linus Torvalds
2024-01-26 21:43 ` Steven Rostedt
2024-01-26 21:36 ` Linus Torvalds
2024-01-26 21:42 ` Steven Rostedt
2024-01-26 21:49 ` Linus Torvalds
2024-01-26 22:08 ` Steven Rostedt
2024-01-26 22:26 ` Linus Torvalds
2024-01-27 14:47 ` Steven Rostedt
2024-01-28 14:42 ` Steven Rostedt
2024-01-26 22:14 ` Mathieu Desnoyers
2024-01-26 22:29 ` Linus Torvalds
2024-01-26 22:41 ` Mathieu Desnoyers
2024-01-26 22:49 ` Linus Torvalds
2024-01-29 16:00 ` Mathieu Desnoyers
2024-01-29 18:58 ` Linus Torvalds
2024-01-26 22:34 ` Matthew Wilcox
2024-01-26 22:40 ` Mathieu Desnoyers
2024-01-26 22:48 ` Linus Torvalds
2024-01-26 23:04 ` Matthew Wilcox
2024-01-26 23:11 ` Linus Torvalds
2024-01-26 23:17 ` Linus Torvalds
2024-01-27 9:36 ` Andreas Schwab
2024-01-27 21:47 ` Linus Torvalds
2024-01-28 20:15 ` Steven Rostedt
2024-01-28 20:53 ` Linus Torvalds
2024-01-28 21:08 ` Linus Torvalds
2024-01-28 22:01 ` Steven Rostedt
2024-01-28 22:17 ` Linus Torvalds
2024-01-28 22:26 ` Steven Rostedt
2024-01-28 21:11 ` Steven Rostedt [this message]
2024-01-28 21:19 ` Steven Rostedt
2024-01-28 21:43 ` Linus Torvalds
2024-01-28 22:07 ` Linus Torvalds
2024-01-28 22:17 ` Steven Rostedt
2024-01-28 22:25 ` Linus Torvalds
2024-01-28 22:51 ` Steven Rostedt
2024-01-28 23:24 ` Linus Torvalds
2024-01-28 23:59 ` Steven Rostedt
2024-01-29 0:21 ` Steven Rostedt
2024-01-29 1:00 ` Linus Torvalds
2024-01-29 1:42 ` Linus Torvalds
2024-01-29 2:32 ` Steven Rostedt
2024-01-29 3:40 ` Steven Rostedt
2024-01-29 4:01 ` Linus Torvalds
2024-01-29 2:09 ` Steven Rostedt
2024-01-29 6:44 ` Amir Goldstein
2024-01-29 9:32 ` Steven Rostedt
2024-01-27 15:26 ` David Laight
2024-01-27 20:01 ` Linus Torvalds
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20240128161107.01af6dd2@rorschach.local.home \
--to=rostedt@goodmis.org \
--cc=ajay.kaher@broadcom.com \
--cc=brauner@kernel.org \
--cc=geert@linux-m68k.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-trace-devel@vger.kernel.org \
--cc=mathieu.desnoyers@efficios.com \
--cc=mhiramat@kernel.org \
--cc=torvalds@linux-foundation.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox