Re: [PATCH] eventfs: Have inodes have unique inode numbers

public inbox for linux-kernel@vger.kernel.org
 help / color / mirror / Atom feed

From: Steven Rostedt <rostedt@goodmis.org>
To: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>,
	Mathieu Desnoyers <mathieu.desnoyers@efficios.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Linux Trace Devel <linux-trace-devel@vger.kernel.org>,
	Christian Brauner <brauner@kernel.org>,
	Ajay Kaher <ajay.kaher@broadcom.com>,
	Geert Uytterhoeven <geert@linux-m68k.org>,
	linux-fsdevel <linux-fsdevel@vger.kernel.org>
Subject: Re: [PATCH] eventfs: Have inodes have unique inode numbers
Date: Sun, 28 Jan 2024 16:11:07 -0500	[thread overview]
Message-ID: <20240128161107.01af6dd2@rorschach.local.home> (raw)
In-Reply-To: <CAHk-=whKJ6dzQJX27gvL4Xug5bFRKW7_Cx4XpngMKmWxOtb+Qg@mail.gmail.com>

On Sun, 28 Jan 2024 12:53:31 -0800
Linus Torvalds <torvalds@linux-foundation.org> wrote:

> On Sun, 28 Jan 2024 at 12:15, Steven Rostedt <rostedt@goodmis.org> wrote:
> >
> > I have to understand how the dentry lookup works. Basically, when the
> > ei gets deleted, it can't be freed until all dentries it references
> > (including its children) are no longer being accessed. Does that lookup
> > get called only when a dentry with the name doesn't already exist?  
> 
> Dentry lookup gets called with the parent inode locked for reading, so
> a lookup can happen in parallel with readdir and other dentry lookup.
> 
> BUT.
> 
> Each dentry is also "self-serializing", so you will never see a lookup
> on the same name (in the same directory) concurrently.

The above is what I wanted to know.

> 
> The implementation is very much internal to the VFS layer, and it's
> all kinds of nasty, with a new potential lookup waiting for the old
> one, verifying that the old one is still usable, and maybe repeating
> it all until we find a successful previous lookup or we're the only
> dentry remaining.
> 
> It's nasty code that is very much in the "Al Viro" camp, but the point
> is that any normal filesystem should treat lookups as being concurrent
> with non-creation events, but not concurrently multiples.
> 
> There *is* some common code with "atomic_open()", where filesystems
> that implement that then want to know if it's the *first* lookup, or a
> use of a previously looked up dentry, and they'll use the
> "d_in_lookup()" thing to determine that. So this whole "keep track of
> which dentries are *currently* being looked up is actually exposed,
> but any normal filesystem should never care.
> 
> But if you think you have that issue (tracefs does not), you really
> want to talk to Al extensively.
> 
> > That is, can I assume that eventfs_root_lookup() is only called when
> > the VFS file system could not find an existing dentry and it has to
> > create one?  
> 
> Correct. For any _particular_ name, you should think of lookup as serialized.
> 
> > If that's the case, then I can remove the ei->dentry and just add a ref
> > counter that it was accessed. Then the final dput() should call
> > eventfs_set_ei_status_free() (I hate that name and need to change it),
> > and if the ei->is_freed is set, it can free the ei.  
> 
> Note that the final 'dput()' will happen *after* the dentry has been
> removed, so what can happen is
> 
>    lookup("name", d1);
>    ... lookup successful, dentry is used ..
>    ... dentry at some point has no more users ..
>    ... memory pressure prunes unused dentries ..
>    ... dentry gets unhashed and is no longer visible ..
>    lookup("name", d2);
>    ... new dentry is created ..
>    final dput(d1);
>    .. old dentry - that wasn't accessible any more is freed ..

Actually I was mistaken. I'm looking at the final iput() not dput().

> 
> and this is actually one of the *reasons* that virtual filesystems
> must not try to cache dentry pointers in their internal data
> structures. Because note how the fuilesystem saw the new lookup(d2) of
> the same name *before* it saw the >d_release(d1) of the old dentry.
> 
> And the above is fundamental: we obviously cannot call '->d_release()'
> until the old dentry is all dead and buried (lockref_mark_dead() etc),
> so pretty much by definition you'll have that ordering being possible.
> 
> It's extremely unlikely, of course. I'll be you'll never hit it in testing.
> 
> So if if you associate some internal data structure with a dentry,
> just *what* happens when you haven't been told abotu the old dentry
> being dead when the new one happens?
> 
> See why I say that it's fundamentally wrong for a filesystem to try to
> track dentries? All the operations that can use a dentry will get one
> passed down to them by the VFS layer. The filesystem has no business
> trying to remember some dentry from a previous operation, and the
> filesystem *will* get it wrong.
> 
> But also note how refcounting works fine. In fact, refcounting is
> pretty much the *only* thing that works fine. So what you *should* do
> is
> 
>  - at lookup(), when you save your filesystem data in "->d_fsdata",
> you increment a refcount
> 
>  - at ->d_release(), you decrement a refcount
> 
> and now you're fine. Yes, when the above (very very unlikely)
> situation happens, you'll temporarily have a refcount incremented
> twice, but that's kind of the *point* of refcounts.
> 
> Side note: this is pretty much true of any kernel data structure. If
> you have a kernel data structure that isn't just used within one
> thread, it must be refcounted. But it'as *doubly* true when you save
> references to something that the VFS maintains, because you DO NOT
> CONTROL the lifetime of that entity.
> 
> > The eventfs_remove_dir() can free the ei (after SRCU) if it has no
> > references, otherwise it needs to wait for the final dput() to do the
> > free.  
> 
> Honestly, you should just *always* do refcounting. No "free after RCU
> delay" as an alternative. Just refcount it.
> 
> Now, the RCU delay may be needed if the lookup of said structure
> happens under RCU, but no, saying "I use SRCU to make sure the
> lifetime is at least X" is just broken.
> 
> The refcount is what gives the lifetime. Any form of RCU-delaying
> should then be purely about non-refcounting RCU lookups that may
> happen as the thing is dying (and said lookup should *look* at the
> refcount and say "oh, this is dead, I'm not returning this".
> 
> > I think the ei->dentry was required for the dir wrapper logic that we
> > removed.  
> 
> I think all of this was due to the bogus readdir that created dentries
> willy-nilly and without the required serialization.
> 
> And that was all horribly broken.  It wasn't even the above kind of
> "really subtle race that you'll never hit in practice" broken. It was
> just absolutely broken with readdir and lookup racing on the same name
> and creating an unholy dentry mess.

Hmm, if I understand the above, I could get rid of keeping around
dentry and even remove the eventfs_set_ei_status_free().

I can try something to see if it works.

-- Steve

next prev parent reply	other threads:[~2024-01-28 21:11 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-01-26 20:02 [PATCH] eventfs: Have inodes have unique inode numbers Steven Rostedt
2024-01-26 20:25 ` Linus Torvalds
2024-01-26 21:26   ` Steven Rostedt
2024-01-26 21:31     ` Linus Torvalds
2024-01-26 21:43       ` Steven Rostedt
2024-01-26 21:36     ` Linus Torvalds
2024-01-26 21:42       ` Steven Rostedt
2024-01-26 21:49       ` Linus Torvalds
2024-01-26 22:08         ` Steven Rostedt
2024-01-26 22:26           ` Linus Torvalds
2024-01-27 14:47             ` Steven Rostedt
2024-01-28 14:42               ` Steven Rostedt
2024-01-26 22:14         ` Mathieu Desnoyers
2024-01-26 22:29           ` Linus Torvalds
2024-01-26 22:41             ` Mathieu Desnoyers
2024-01-26 22:49               ` Linus Torvalds
2024-01-29 16:00                 ` Mathieu Desnoyers
2024-01-29 18:58                   ` Linus Torvalds
2024-01-26 22:34           ` Matthew Wilcox
2024-01-26 22:40             ` Mathieu Desnoyers
2024-01-26 22:48             ` Linus Torvalds
2024-01-26 23:04               ` Matthew Wilcox
2024-01-26 23:11                 ` Linus Torvalds
2024-01-26 23:17                   ` Linus Torvalds
2024-01-27  9:36                     ` Andreas Schwab
2024-01-27 21:47         ` Linus Torvalds
2024-01-28 20:15           ` Steven Rostedt
2024-01-28 20:53             ` Linus Torvalds
2024-01-28 21:08               ` Linus Torvalds
2024-01-28 22:01                 ` Steven Rostedt
2024-01-28 22:17                   ` Linus Torvalds
2024-01-28 22:26                     ` Steven Rostedt
2024-01-28 21:11               ` Steven Rostedt [this message]
2024-01-28 21:19               ` Steven Rostedt
2024-01-28 21:43                 ` Linus Torvalds
2024-01-28 22:07                   ` Linus Torvalds
2024-01-28 22:17                     ` Steven Rostedt
2024-01-28 22:25                       ` Linus Torvalds
2024-01-28 22:51           ` Steven Rostedt
2024-01-28 23:24             ` Linus Torvalds
2024-01-28 23:59               ` Steven Rostedt
2024-01-29  0:21                 ` Steven Rostedt
2024-01-29  1:00                   ` Linus Torvalds
2024-01-29  1:42                     ` Linus Torvalds
2024-01-29  2:32                       ` Steven Rostedt
2024-01-29  3:40                         ` Steven Rostedt
2024-01-29  4:01                           ` Linus Torvalds
2024-01-29  2:09                     ` Steven Rostedt
2024-01-29  6:44                       ` Amir Goldstein
2024-01-29  9:32                         ` Steven Rostedt
2024-01-27 15:26       ` David Laight
2024-01-27 20:01         ` Linus Torvalds

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240128161107.01af6dd2@rorschach.local.home \
    --to=rostedt@goodmis.org \
    --cc=ajay.kaher@broadcom.com \
    --cc=brauner@kernel.org \
    --cc=geert@linux-m68k.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-trace-devel@vger.kernel.org \
    --cc=mathieu.desnoyers@efficios.com \
    --cc=mhiramat@kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox