[PATCH][RFC] get rid of busy-wait in shrink_dcache

public inbox for linux-fsdevel@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH][RFC] get rid of busy-wait in shrink_dcache_tree()
@ 2026-01-22 20:20 Al Viro
  2026-01-23  0:19 ` Linus Torvalds
                   ` (2 more replies)
  0 siblings, 3 replies; 63+ messages in thread
From: Al Viro @ 2026-01-22 20:20 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
	Max Kellermann

There's a case in which shrink_dcache_tree() ends up busy-waiting: if some
dentry in the subtree in question is found to be in process of being evicted
by another thread.  We need to wait for that to finish so that parent would
no longer be pinned, to avoid the situations when nothing in the tree is
busy, but shrink_dcache_tree() fails to evict some directory only because
memory pressure initiated eviction of some of its children before we got to
evicting those ourselves.  That would be bogus both for shrink_dcache_parent()
and for shrink_dcache_for_umount().

Unfortunately, we have nothing to wait on.  That had led to the possibility
of busy-waiting - getting through the iteration of shrink_dcache_tree() main
loop without having made any progress.  That's Not Nice(tm) and that had been
discussed quite a few times since at least 2018.  Recently it became obvious
that this goes beyond "not nice" - on sufficiently contrieved setup it's
possible to get a livelock there, with both threads involved tied to the same
CPU, shrink_dcache_tree() one with higher priority than the thread that has
given CPU up on may_sleep() in the very beginning of iput() during eviction
of the dentry in the tree shrink_dcache_tree() is busy-waiting for.

Let's get rid of that busy-waiting.  Constraints are
        * don't grow struct dentry
        * don't slow the normal case of __dentry_kill() down
and it turns out to be doable.

Dentries in question are
        * already marked dead (negative ->d_count)
        * already negative
        * already unhashed
        * already not in in-lookup hash
        * yet to get removed from ->d_sib and get DCACHE_DENTRY_KILLED in
flags.

Neither ->d_alias nor the fields overlapping it (->d_rcu and ->d_in_lookup_hash)
are going to be accessed for these dentries until after dentry_unlist().  What's
more, ->d_alias.next is guaranteed to be NULL.

So we can embed struct completion into struct select_data and (ab)use
->d_alias.next for linked list of struct select_data instances.

If dentry_unlist() finds ->d_alias.next non-NULL, it carefully goes over that
list and calls complete() for each of those.

That way select_collect2() can treat negative ->d_count the same way it deals
with dentries on other thread's shrink list - grab rcu_read_lock(), stash the
dentry into data.victim and tell d_walk() to stop.

If shrink_dcache_parent() runs into that case, it should attach its select_data
to victim dentry, evict whatever normal eviction candidates it has gathered
and wait for completion.  Voila...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/fs/dcache.c b/fs/dcache.c
index dc2fff4811d1..6db72a684d8d 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -605,6 +605,54 @@ void d_drop(struct dentry *dentry)
 }
 EXPORT_SYMBOL(d_drop);
 
+struct select_data {
+	struct dentry *start;
+	union {
+		long found;
+		struct {
+			struct dentry *victim;
+			struct select_data *next;
+			struct completion completion;
+		};
+	};
+	struct list_head dispose;
+};
+
+/*
+ *  shrink_dcache_parent() needs to be notified when dentry in process of
+ *  being evicted finally gets unlisted.  Such dentries are
+ *	already with negative ->d_count
+ *	already negative
+ *	already not in in-lookup hash
+ *	reachable only via ->d_sib.
+ *
+ *  Neither ->d_alias, nor ->d_rcu, nor ->d_in_lookup_hash are going to be
+ *  accessed for those, so we can (ab)use ->d_alias.next for list of
+ *  select_data of waiters.  Initially it's going to be NULL and as long
+ *  as dentry_unlist() returns it to that state we are fine.
+ */
+static inline void d_add_waiter(struct dentry *dentry, struct select_data *p)
+{
+	struct select_data *v = (void *)dentry->d_u.d_alias.next;
+	init_completion(&p->completion);
+	p->next = v;
+	dentry->d_u.d_alias.next = (void *)p;
+}
+
+static inline void d_complete_waiters(struct dentry *dentry)
+{
+	struct select_data *v = (void *)dentry->d_u.d_alias.next;
+	if (unlikely(v)) {
+		/* some shrink_dcache_tree() instances are waiting */
+		dentry->d_u.d_alias.next = NULL;
+		while (v) {
+			struct completion *r = &v->completion;
+			v = v->next;
+			complete(r);
+		}
+	}
+}
+
 static inline void dentry_unlist(struct dentry *dentry)
 {
 	struct dentry *next;
@@ -613,6 +661,7 @@ static inline void dentry_unlist(struct dentry *dentry)
 	 * attached to the dentry tree
 	 */
 	dentry->d_flags |= DCACHE_DENTRY_KILLED;
+	d_complete_waiters(dentry);
 	if (unlikely(hlist_unhashed(&dentry->d_sib)))
 		return;
 	__hlist_del(&dentry->d_sib);
@@ -1499,15 +1548,6 @@ int d_set_mounted(struct dentry *dentry)
  * constraints.
  */
 
-struct select_data {
-	struct dentry *start;
-	union {
-		long found;
-		struct dentry *victim;
-	};
-	struct list_head dispose;
-};
-
 static enum d_walk_ret select_collect(void *_data, struct dentry *dentry)
 {
 	struct select_data *data = _data;
@@ -1559,6 +1599,10 @@ static enum d_walk_ret select_collect2(void *_data, struct dentry *dentry)
 			return D_WALK_QUIT;
 		}
 		to_shrink_list(dentry, &data->dispose);
+	} else if (dentry->d_lockref.count < 0) {
+		rcu_read_lock();
+		data->victim = dentry;
+		return D_WALK_QUIT;
 	}
 	/*
 	 * We can return to the caller if we have found some (this
@@ -1598,12 +1642,26 @@ static void shrink_dcache_tree(struct dentry *parent, bool for_umount)
 		data.victim = NULL;
 		d_walk(parent, &data, select_collect2);
 		if (data.victim) {
-			spin_lock(&data.victim->d_lock);
-			if (!lock_for_kill(data.victim)) {
-				spin_unlock(&data.victim->d_lock);
+			struct dentry *v = data.victim;
+
+			spin_lock(&v->d_lock);
+			if (v->d_lockref.count < 0 &&
+			    !(v->d_flags & DCACHE_DENTRY_KILLED)) {
+				// It's busy dying; have it notify us once
+				// it becomes invisible to d_walk().
+				d_add_waiter(v, &data);
+				spin_unlock(&v->d_lock);
+				rcu_read_unlock();
+				if (!list_empty(&data.dispose))
+					shrink_dentry_list(&data.dispose);
+				wait_for_completion(&data.completion);
+				continue;
+			}
+			if (!lock_for_kill(v)) {
+				spin_unlock(&v->d_lock);
 				rcu_read_unlock();
 			} else {
-				shrink_kill(data.victim);
+				shrink_kill(v);
 			}
 		}
 		if (!list_empty(&data.dispose))

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH][RFC] get rid of busy-wait in shrink_dcache_tree()
  2026-01-22 20:20 [PATCH][RFC] get rid of busy-wait in shrink_dcache_tree() Al Viro
@ 2026-01-23  0:19 ` Linus Torvalds
  2026-01-23  0:36   ` Al Viro
  2026-04-02 18:08 ` [RFC PATCH v2 0/4] getting rid of busy-wait in shrink_dcache_parent() Al Viro
  2026-04-04  8:07 ` [RFC PATCH v3 " Al Viro
  2 siblings, 1 reply; 63+ messages in thread
From: Linus Torvalds @ 2026-01-23  0:19 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
	Max Kellermann

On Thu, 22 Jan 2026 at 12:18, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> +static inline void d_add_waiter(struct dentry *dentry, struct select_data *p)
> +{
> +       struct select_data *v = (void *)dentry->d_u.d_alias.next;
> +       init_completion(&p->completion);
> +       p->next = v;
> +       dentry->d_u.d_alias.next = (void *)p;
> +}

I tend to not love it when I see new users of completions - I've seen
too many mis-uses - but this does seem to be a good use-case for them.

That said, I absolutely abhor your cast. Christ - that 'd_u' is
*already* a union, exactly because that thing gets used for different
things - just add a new union member, instead of mis-using an existing
union member that then requires you to cast the data to a different
form.

Yes, you had an explanation for why you used d_alias.next, but please
make that explanation be in the union itself, not in the commit
message of something that mis-uses the union. Please?

That way there's no need for a cast, and you can name that new union
member something that also clarifies things on a source level
("eviction_completion" or whatever).

Or am I missing something?

              Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH][RFC] get rid of busy-wait in shrink_dcache_tree()
  2026-01-23  0:19 ` Linus Torvalds
@ 2026-01-23  0:36   ` Al Viro
  2026-01-24  4:36     ` Al Viro
  0 siblings, 1 reply; 63+ messages in thread
From: Al Viro @ 2026-01-23  0:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
	Max Kellermann

On Thu, Jan 22, 2026 at 04:19:56PM -0800, Linus Torvalds wrote:
> On Thu, 22 Jan 2026 at 12:18, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > +static inline void d_add_waiter(struct dentry *dentry, struct select_data *p)
> > +{
> > +       struct select_data *v = (void *)dentry->d_u.d_alias.next;
> > +       init_completion(&p->completion);
> > +       p->next = v;
> > +       dentry->d_u.d_alias.next = (void *)p;
> > +}
> 
> I tend to not love it when I see new users of completions - I've seen
> too many mis-uses - but this does seem to be a good use-case for them.
> 
> That said, I absolutely abhor your cast. Christ - that 'd_u' is
> *already* a union, exactly because that thing gets used for different
> things - just add a new union member, instead of mis-using an existing
> union member that then requires you to cast the data to a different
> form.
> 
> Yes, you had an explanation for why you used d_alias.next, but please
> make that explanation be in the union itself, not in the commit
> message of something that mis-uses the union. Please?
> 
> That way there's no need for a cast, and you can name that new union
> member something that also clarifies things on a source level
> ("eviction_completion" or whatever).
> 
> Or am I missing something?

In practice it doesn't really matter, but we don't want to initialize
that field to NULL - no good place for doing that.  Sure, the entire
d_alias has been subject to hlist_del_init() or INIT_HLIST_NODE(), so
any pointer field unioned with it will end up being NULL without
any assignments to it, but...  ugh.  "We have a union of two-pointer
struct, a pointer and some other stuff; we'd set both members of that
struct member to NULL and count upon the pointer member of union
having been zeroed by that" leaves a bad taste.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH][RFC] get rid of busy-wait in shrink_dcache_tree()
  2026-01-23  0:36   ` Al Viro
@ 2026-01-24  4:36     ` Al Viro
  2026-01-24  4:46       ` Linus Torvalds
  0 siblings, 1 reply; 63+ messages in thread
From: Al Viro @ 2026-01-24  4:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
	Max Kellermann

On Fri, Jan 23, 2026 at 12:36:51AM +0000, Al Viro wrote:
> On Thu, Jan 22, 2026 at 04:19:56PM -0800, Linus Torvalds wrote:
> > On Thu, 22 Jan 2026 at 12:18, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > +static inline void d_add_waiter(struct dentry *dentry, struct select_data *p)
> > > +{
> > > +       struct select_data *v = (void *)dentry->d_u.d_alias.next;
> > > +       init_completion(&p->completion);
> > > +       p->next = v;
> > > +       dentry->d_u.d_alias.next = (void *)p;
> > > +}
> > 
> > I tend to not love it when I see new users of completions - I've seen
> > too many mis-uses - but this does seem to be a good use-case for them.
> > 
> > That said, I absolutely abhor your cast. Christ - that 'd_u' is
> > *already* a union, exactly because that thing gets used for different
> > things - just add a new union member, instead of mis-using an existing
> > union member that then requires you to cast the data to a different
> > form.
> > 
> > Yes, you had an explanation for why you used d_alias.next, but please
> > make that explanation be in the union itself, not in the commit
> > message of something that mis-uses the union. Please?
> > 
> > That way there's no need for a cast, and you can name that new union
> > member something that also clarifies things on a source level
> > ("eviction_completion" or whatever).
> > 
> > Or am I missing something?
> 
> In practice it doesn't really matter, but we don't want to initialize
> that field to NULL - no good place for doing that.  Sure, the entire
> d_alias has been subject to hlist_del_init() or INIT_HLIST_NODE(), so
> any pointer field unioned with it will end up being NULL without
> any assignments to it, but...  ugh.  "We have a union of two-pointer
> struct, a pointer and some other stuff; we'd set both members of that
> struct member to NULL and count upon the pointer member of union
> having been zeroed by that" leaves a bad taste.

FWIW, there's another reason, but it's not one I'm fond of.  There are few
places where we use hlist_unhashed(&dentry->d_u.d_alias) as a debugging
check.  I _think_ that none of those are reachable for dentries in that state,
but then all of them are of "it should evaluate true unless we have a kernel
bug" kind.

I'm not saying it's a good reason.  As the matter of fact, the tests are
misspelled checks for dentry being negative.  _If_ we kill those, we
could declare the state of ->d_u.d_alias undefined for negative dentries
and have transitions to that state explicitly clear the new field instead.
IOW, hlist_del_init() in dentry_unlink_inode() becomes
	__hlist_del(&dentry->d_u.d_alias);	// !hlist_unhashed() guaranteed
	dentry->d_u.shrink_waiters = NULL;
INIT_HLIST_NODE(&dentry->d_u.d_alias) in __d_alloc() and __d_lookup_unhash() simply
	dentry->d_u.shrink_waiters = NULL;
Then we are out of the nasal demon country.

Debugging tests are interesting, though.  We have them in
	dentry_free() - WARN_ON if on alias list; what we want is rather "have
DENTRY_KILLED in flags", and I would probably add "no PERSISTENT" as well as
"no LRU_LIST" there.
	d_instantiate(), d_instantiate_new(), d_make_persistent(), d_mark_tmpfile() -
a mix of WARN_ON() and BUG_ON(); in all cases it should be "it must be negative
here", and in all cases it's done before ->d_lock is taken.  Not wanting an oops
while holding ->d_lock (or ->i_lock, for that matter) on anything is understandable,
and stability is actually provided by the callers, but... it's still confusing
for readers, especially since the real proof of stability is nowhere near trivial.
I'd probably go for d_really_is_negative() in all of those; it might make sense to
unify some of that boilerplate, but that's for the next cycle...

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH][RFC] get rid of busy-wait in shrink_dcache_tree()
  2026-01-24  4:36     ` Al Viro
@ 2026-01-24  4:46       ` Linus Torvalds
  2026-01-24  5:36         ` Al Viro
  0 siblings, 1 reply; 63+ messages in thread
From: Linus Torvalds @ 2026-01-24  4:46 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
	Max Kellermann

On Fri, 23 Jan 2026 at 20:34, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> >
> > In practice it doesn't really matter, but we don't want to initialize
> > that field to NULL - no good place for doing that.  Sure, the entire
> > d_alias has been subject to hlist_del_init() or INIT_HLIST_NODE(), so
> > any pointer field unioned with it will end up being NULL without
> > any assignments to it, but...  ugh.  "We have a union of two-pointer
> > struct, a pointer and some other stuff; we'd set both members of that
> > struct member to NULL and count upon the pointer member of union
> > having been zeroed by that" leaves a bad taste.
>
> FWIW, there's another reason, but it's not one I'm fond of.  There are few
> places where we use hlist_unhashed(&dentry->d_u.d_alias) as a debugging
> check.  I _think_ that none of those are reachable for dentries in that state,
> but then all of them are of "it should evaluate true unless we have a kernel
> bug" kind.
>
Note that I'm not complaining about re-using the
'dentry->d_u.d_alias.next' field.

I'm complaining about re-using it WITH A CAST, AND WITHOUT DOCUMENTING
IT IN THE TYPE DECLARATION.

So by all means use that field, and use that value, but use it *BY*
making it a proper union member, and by documenting the lifetime
change of that union member.

None of this "randomly cast a different pointer to 'void *' where it
will then be implicitly cast to to 'struct select_data *'" garbage.

Just make a proper "struct select_data *" union member that aliases
that "d_alias.next" field, and the compiler will generate the EXACT
same code, except the source code will be cleaner, and you won't need
any hacky pointer casts.

And document how that field is NULL when the dentry is killed, and how
that NULL 'dentry->d_u.d_alias.next' field at that point becomes a
NULL 'dentry->d_u.d_select_data' field.

You don't need to describe 'struct select_data', you just need to
declare it. IOW, something like this:

  --- a/include/linux/dcache.h
  +++ b/include/linux/dcache.h
  @@ -89,6 +89,11 @@ union shortname_store {
   #define d_lock       d_lockref.lock
   #define d_iname d_shortname.string

  +// The d_alias list becomes a 'struct select_data' pointer when
  +// the dentry is killed (so the NULL d_alias.next pointer becomes
  +// a NULL 'struct select_data *'
  +struct select_data;
  +
   struct dentry {
        /* RCU lookup touched fields */
        unsigned int d_flags;           /* protected by d_lock */
  @@ -128,6 +133,7 @@ struct dentry {
                struct hlist_node d_alias;      /* inode alias list */
                struct hlist_bl_node d_in_lookup_hash;  /* only for
in-lookup ones */
                struct rcu_head d_rcu;
  +             struct select_data *d_select_data; /* only for killed
dentries */
        } d_u;
   };

but I obviously didn't test it.

            Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH][RFC] get rid of busy-wait in shrink_dcache_tree()
  2026-01-24  4:46       ` Linus Torvalds
@ 2026-01-24  5:36         ` Al Viro
  2026-01-24 17:45           ` Linus Torvalds
  0 siblings, 1 reply; 63+ messages in thread
From: Al Viro @ 2026-01-24  5:36 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
	Max Kellermann

On Fri, Jan 23, 2026 at 08:46:52PM -0800, Linus Torvalds wrote:

> Just make a proper "struct select_data *" union member that aliases
> that "d_alias.next" field, and the compiler will generate the EXACT
> same code, except the source code will be cleaner, and you won't need
> any hacky pointer casts.
> 
> And document how that field is NULL when the dentry is killed, and how
> that NULL 'dentry->d_u.d_alias.next' field at that point becomes a
> NULL 'dentry->d_u.d_select_data' field.
> 
> You don't need to describe 'struct select_data', you just need to
> declare it. IOW, something like this:

As the matter of fact, that _was_ the previous iteration of that patch -
see http://ftp.linux.org.uk/people/viro/y8

The only trouble is that as soon as some joker slaps __randomize_layout
on struct hlist_node they'll start flipping from sharing with ->next to
sharing with ->pprev, at random.

I'm not saying it wouldn't work, but I would rather have the proofs of
correctness less subtle.  And it's not even hard to do - the only
rule added would be that ->d_u.d_alias should never be accessed for
negative dentries and never without ->i_lock on the inode of dentry
in question.

The only places where it does not hold at the moment are those WARN_ON()
and we'll be better off having those spelled in less obscure way; we
want to verify that dentry is negative, so let's express that in the
idiomatic way.

And that's it - with that done, we can add a field, obviously with
forward declaration of struct select_data, etc. and have it explicitly
initialized whenever dentry goes negative.  Instead of zeroing
->d_u.d_alias.{next,pprev} as we do now.

Currently !hlist_unlinked(&dentry->d_u.d_alias) is equivalent to
dentry->d_inode != NULL, with identical stability requirements.
And nobody ever traverses that hlist without ->i_lock - no RCU accesses
there.  We do have lockless checks that list is not empty (right before
grabbing ->i_lock and rechecking), but those come from the inode side;
"are there any aliases for this inode" rather than "is this dentry
an alias for anything (== positive)".

I'm putting together short documentation on d_inode/d_alias/i_dentry/type
bits in d_flags; should be done by tomorrow morning...

PS: a fun catch while doing that code audit - AFAICS, we don't really need
to play with fake root dentry for NFS anymore; the reason why it used to be
needed had been gone since 2013 as an unnoticed side effect of switching
shrink_dcache_for_umount() to use of d_walk()...  Obviously needs
a review from NFS folks, but if they see no problems with that, it would
be nice to get rid of that kludge, as in

diff --git a/fs/nfs/getroot.c b/fs/nfs/getroot.c
index f13d25d95b85..2ac8404e1a15 100644
--- a/fs/nfs/getroot.c
+++ b/fs/nfs/getroot.c
@@ -32,35 +32,6 @@

 #define NFSDBG_FACILITY		NFSDBG_CLIENT

-/*
- * Set the superblock root dentry.
- * Note that this function frees the inode in case of error.
- */
-static int nfs_superblock_set_dummy_root(struct super_block *sb, struct inode *inode)
-{
-	/* The mntroot acts as the dummy root dentry for this superblock */
-	if (sb->s_root == NULL) {
-		sb->s_root = d_make_root(inode);
-		if (sb->s_root == NULL)
-			return -ENOMEM;
-		ihold(inode);
-		/*
-		 * Ensure that this dentry is invisible to d_find_alias().
-		 * Otherwise, it may be spliced into the tree by
-		 * d_splice_alias if a parent directory from the same
-		 * filesystem gets mounted at a later time.
-		 * This again causes shrink_dcache_for_umount_subtree() to
-		 * Oops, since the test for IS_ROOT() will fail.
-		 */
-		spin_lock(&d_inode(sb->s_root)->i_lock);
-		spin_lock(&sb->s_root->d_lock);
-		hlist_del_init(&sb->s_root->d_u.d_alias);
-		spin_unlock(&sb->s_root->d_lock);
-		spin_unlock(&d_inode(sb->s_root)->i_lock);
-	}
-	return 0;
-}
-
 /*
  * get a root dentry from the root filehandle
  */
@@ -99,10 +70,6 @@ int nfs_get_root(struct super_block *s, struct fs_context *fc)
 		goto out_fattr;
 	}

-	error = nfs_superblock_set_dummy_root(s, inode);
-	if (error != 0)
-		goto out_fattr;
-
 	/* root dentries normally start off anonymous and get spliced in later
 	 * if the dentry tree reaches them; however if the dentry already
 	 * exists, we'll pick it up at this point and use it as the root
@@ -115,7 +82,6 @@ int nfs_get_root(struct super_block *s, struct fs_context *fc)
 		goto out_fattr;
 	}

-	security_d_instantiate(root, inode);
 	spin_lock(&root->d_lock);
 	if (IS_ROOT(root) && !root->d_fsdata &&
 	    !(root->d_flags & DCACHE_NFSFS_RENAMED)) {
@@ -123,6 +89,8 @@ int nfs_get_root(struct super_block *s, struct fs_context *fc)
 		name = NULL;
 	}
 	spin_unlock(&root->d_lock);
+	if (!s->s_root)
+		s->s_root = dget(root);
 	fc->root = root;
 	if (server->caps & NFS_CAP_SECURITY_LABEL)
 		kflags |= SECURITY_LSM_NATIVE_LABELS;

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [PATCH][RFC] get rid of busy-wait in shrink_dcache_tree()
  2026-01-24  5:36         ` Al Viro
@ 2026-01-24 17:45           ` Linus Torvalds
  2026-01-24 18:43             ` Al Viro
  0 siblings, 1 reply; 63+ messages in thread
From: Linus Torvalds @ 2026-01-24 17:45 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
	Max Kellermann

On Fri, 23 Jan 2026 at 21:34, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> The only trouble is that as soon as some joker slaps __randomize_layout
> on struct hlist_node they'll start flipping from sharing with ->next to
> sharing with ->pprev, at random.

If somebody starts using randomize_layout on core data structures,
they get what they deserve.

We have tons of data structures that are *NOT* randomizable.

In fact, RANDSTRUCT is so broken in general that we actually taint the
kernel if you enable that crazy option in the first place. So no,
"what if somebody enables it on random things" is not even remotely
worth worrying about.

            Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH][RFC] get rid of busy-wait in shrink_dcache_tree()
  2026-01-24 17:45           ` Linus Torvalds
@ 2026-01-24 18:43             ` Al Viro
  2026-01-24 19:32               ` Linus Torvalds
  0 siblings, 1 reply; 63+ messages in thread
From: Al Viro @ 2026-01-24 18:43 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
	Max Kellermann

On Sat, Jan 24, 2026 at 09:45:54AM -0800, Linus Torvalds wrote:
> On Fri, 23 Jan 2026 at 21:34, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > The only trouble is that as soon as some joker slaps __randomize_layout
> > on struct hlist_node they'll start flipping from sharing with ->next to
> > sharing with ->pprev, at random.
> 
> If somebody starts using randomize_layout on core data structures,
> they get what they deserve.
> 
> We have tons of data structures that are *NOT* randomizable.
> 
> In fact, RANDSTRUCT is so broken in general that we actually taint the
> kernel if you enable that crazy option in the first place. So no,
> "what if somebody enables it on random things" is not even remotely
> worth worrying about.

Very much agreed, but we *do* have that on e.g. struct path (two pointers),
as well as struct inode, struct file, struct mount, etc.  As far as VFS goes,
those are core data structures...

While we are at it, does anybody have objections to making dentry->d_u anonymous?
We are already using anonymous union members in struct dentry, so compiler support
is no longer a consideration.

Another thing in the same area:

#define for_each_alias(dentry, inode) \
	hlist_for_each_entry(dentry, &(inode)->i_dentry, d_u.d_alias)

to avoid the boilerplate.  Objections?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH][RFC] get rid of busy-wait in shrink_dcache_tree()
  2026-01-24 18:43             ` Al Viro
@ 2026-01-24 19:32               ` Linus Torvalds
  2026-01-24 20:28                 ` Al Viro
  0 siblings, 1 reply; 63+ messages in thread
From: Linus Torvalds @ 2026-01-24 19:32 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
	Max Kellermann

On Sat, 24 Jan 2026 at 10:41, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> > In fact, RANDSTRUCT is so broken in general that we actually taint the
> > kernel if you enable that crazy option in the first place. So no,
> > "what if somebody enables it on random things" is not even remotely
> > worth worrying about.
>
> Very much agreed, but we *do* have that on e.g. struct path (two pointers),
> as well as struct inode, struct file, struct mount, etc.  As far as VFS goes,
> those are core data structures...

I certainly wouldn't mind if we remove the 'struct path' one in
particular. It's insanely pointless to do randstruct on two fields.

The inode and file structs make a lot more sense, in that they are
obviously not only much bigger, but they are prime candidates for the
kind of attacks that randstruct is designed for.

> While we are at it, does anybody have objections to making dentry->d_u anonymous?
> We are already using anonymous union members in struct dentry, so compiler support
> is no longer a consideration.
>
> Another thing in the same area:
>
> #define for_each_alias(dentry, inode) \
>         hlist_for_each_entry(dentry, &(inode)->i_dentry, d_u.d_alias)
>
> to avoid the boilerplate.  Objections?

No objection to either of those from me.

            Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [PATCH][RFC] get rid of busy-wait in shrink_dcache_tree()
  2026-01-24 19:32               ` Linus Torvalds
@ 2026-01-24 20:28                 ` Al Viro
  0 siblings, 0 replies; 63+ messages in thread
From: Al Viro @ 2026-01-24 20:28 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
	Max Kellermann

On Sat, Jan 24, 2026 at 11:32:59AM -0800, Linus Torvalds wrote:
> On Sat, 24 Jan 2026 at 10:41, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > > In fact, RANDSTRUCT is so broken in general that we actually taint the
> > > kernel if you enable that crazy option in the first place. So no,
> > > "what if somebody enables it on random things" is not even remotely
> > > worth worrying about.
> >
> > Very much agreed, but we *do* have that on e.g. struct path (two pointers),
> > as well as struct inode, struct file, struct mount, etc.  As far as VFS goes,
> > those are core data structures...
> 
> I certainly wouldn't mind if we remove the 'struct path' one in
> particular. It's insanely pointless to do randstruct on two fields.

<wry>
Frankly, each time I talk to these folks I keep hearing "Every bit
is wanted/Every bit is good/Every bit is needed/In your neighborhood"
set to the obvious tune...
</wry>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [RFC PATCH v2 0/4] getting rid of busy-wait in shrink_dcache_parent()
  2026-01-22 20:20 [PATCH][RFC] get rid of busy-wait in shrink_dcache_tree() Al Viro
  2026-01-23  0:19 ` Linus Torvalds
@ 2026-04-02 18:08 ` Al Viro
  2026-04-02 18:08   ` [RFC PATCH v2 1/4] for_each_alias(): helper macro for iterating through dentries of given inode Al Viro
                     ` (4 more replies)
  2026-04-04  8:07 ` [RFC PATCH v3 " Al Viro
  2 siblings, 5 replies; 63+ messages in thread
From: Al Viro @ 2026-04-02 18:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
	Max Kellermann, Eric Sandeen, Paulo Alcantara

[This got stalled last cycle; back then it had been a single patch, now it's
carved up, and hopefully more readable for that.  I think Linus' objections
should be addressed by that variant - no casts, no void * conversions, etc.
As the matter of fact, no type-punning and no reliance upon the layouts of
overlapping members of struct dentry either]

There's a case in which shrink_dcache_tree() ends up busy-waiting: if some
dentry in the subtree in question is found to be in process of being evicted
by another thread.  We need to wait for that to finish so that parent would
no longer be pinned, to avoid the situations when nothing in the tree is
busy, but shrink_dcache_tree() fails to evict some directory only because
memory pressure initiated eviction of some of its children before we got to
evicting those ourselves.  That would be bogus both for shrink_dcache_parent()
and for shrink_dcache_for_umount().

Unfortunately, we have nothing to wait on.  That had led to the possibility
of busy-waiting - getting through the iteration of shrink_dcache_tree() main
loop without having made any progress.  That's Not Nice(tm) and that had been
discussed quite a few times since at least 2018.  Recently it became obvious
that this goes beyond "not nice" - on sufficiently contrieved setup it's
possible to get a livelock there, with both threads involved tied to the same
CPU, shrink_dcache_tree() one with higher priority than the thread that has
given CPU up on may_sleep() in the very beginning of iput() during eviction
of the dentry in the tree shrink_dcache_tree() is busy-waiting for.

Let's get rid of that busy-waiting.  Constraints are
        * don't grow struct dentry
        * don't slow the normal case of __dentry_kill() down
Satisfying those turns out to be possible.

Dentries in question are
        * already marked dying (negative ->d_count)
        * already negative
        * already unhashed
        * already not in in-lookup hash
	* still visible in the dentry tree (reachable via ->d_sib)
        * yet to get DCACHE_DENTRY_KILLED in flags.

Neither ->d_alias (aside of several pathological places, dealt with in
#3/4) nor the members overlapping it (->d_rcu and ->d_in_lookup_hash)
are going to be accessed for these dentries until after dentry_unlist().
So we can embed struct completion into struct select_data and store
a reference to linked list of select_data instances in a new member
(->waiters) colocated with ->d_alias.  That allows to have
shrink_dcache_parent() insert its select_data to ->waiters of dying
dentry it needs to wait for and use wait_for_complete(), with complete()
coming from dentry_unlist() in case it sees non-NULL ->waiters.

No more busy-waiting there...

Branch (currently -rc6-based) lives in
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.dcache-busy-wait
individual patches in followups.

Please, review.  It does survive local testing, but extra eyes would be very welcome.

Note: in -next CIFS has grown a local copy of d_mark_tmpfile(), which would need
 the same modification of sanity checks the original gets in #3/4 here.  I wonder
if that cifs_d_mark_tmpfile() should be turned into to d_mark_tmpfile_name() and
lifted into fs/dcache.c...  Anyway, that's a separate story...

Shortlog:
Al Viro (4):
  for_each_alias(): helper macro for iterating through dentries of given inode
  struct dentry: make ->d_u anonymous
  dcache.c: more idiomatic "positives are not allowed" sanity checks
  get rid of busy-waiting in shrink_dcache_tree()

Diffstat:
 Documentation/filesystems/porting.rst |  10 ++
 fs/affs/amigaffs.c                    |   2 +-
 fs/ceph/mds_client.c                  |   4 +-
 fs/dcache.c                           | 138 ++++++++++++++++++--------
 fs/exportfs/expfs.c                   |   2 +-
 fs/inode.c                            |   2 +-
 fs/nfs/dir.c                          |   4 +-
 fs/nfs/getroot.c                      |   2 +-
 fs/notify/fsnotify.c                  |   2 +-
 fs/ocfs2/dcache.c                     |   2 +-
 fs/overlayfs/dir.c                    |   2 +-
 fs/smb/client/inode.c                 |   2 +-
 include/linux/dcache.h                |  24 ++++-
 13 files changed, 140 insertions(+), 56 deletions(-)

Series overview:

First a couple of cleanups to reduce the amount of noise:

1) for_each_alias(): helper macro for iterating through dentries of given inode

Most of the places using d_alias are loops iterating through all aliases for
given inode; introduce a helper macro (for_each_alias(dentry, inode))
and convert open-coded instances of such loop to it.

They are easier to read that way and it reduces the noise on the next steps.

2) struct dentry: make ->d_u anonymous

Making ->d_rcu and (then) ->d_child overlapping dates back to 2006; anon unions
support had been added to gcc only in 4.6 (2011) and the minimal gcc version hadn't
been bumped to that until 4.19 (2018).
These days there's no reason not to keep that union named.

Next we get rid of the aforementioned pathologies:

3) dcache.c: more idiomatic "positives are not allowed" sanity checks

Several functions have BUG_ON/WARN_ON sanity checks that want to verify
that dentry is not positive and instead of looking at ->d_inode (as we
do in all other places that check that) they look at ->d_alias.
Just use the normal helpers instead - that way we no longer even look
at ->d_alias for negative dentries

Now everything is ready for adding a new member and killing busy-wait:

4) get rid of busy-waiting in shrink_dcache_tree()

If shrink_dcache_parent() runs into a potential victim that is already
dying, it must wait for that dentry to go away.  To avoid busy-waiting
we need some object to wait on and a way for dentry_unlist() to see that
we need to be notified.

The obvious place for the object to wait on would be inside struct
select_data we have on our stack frame.  We will store a pointer to
select_data in victim dentry; if there's more than one thread wanting to
wait for the same dentry to finish dying, we'll have their select_data
linked into a list, with reference in dentry pointing to the head of
that list.

* add struct completion and a pointer to the next select_data into
struct select_data; that will be what we wait on and the list linkage.

* add a new member (->waiters, opaque pointer to struct select_data)
to struct dentry.  It is defined for negative live dentries that are
not in-lookup ones and it will remain NULL for almost all of them.

It does not conflict with ->d_rcu (defined for killed dentries), ->d_alias
(defined for positive dentries, all live) or ->d_in_lookup_hash (defined
for in-lookup dentries, all live negative).  That allows to colocate
all four members.

* make sure that all places where dentry enters the state where ->waiters
is defined (live, negative, not-in-lookup) initialize ->waiters to NULL.

* if select_collect2() runs into a dentry that is already dying, have
its caller insert its select_data into the head of list that hangs off
dentry->waiters and wait for completion.

* if dentry_unlist() sees non-NULL ->waiters, have it carefully walk
through the select_data on that list, calling complete() for each.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [RFC PATCH v2 1/4] for_each_alias(): helper macro for iterating through dentries of given inode
  2026-04-02 18:08 ` [RFC PATCH v2 0/4] getting rid of busy-wait in shrink_dcache_parent() Al Viro
@ 2026-04-02 18:08   ` Al Viro
  2026-04-02 18:08   ` [RFC PATCH v2 2/4] struct dentry: make ->d_u anonymous Al Viro
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 63+ messages in thread
From: Al Viro @ 2026-04-02 18:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
	Max Kellermann, Eric Sandeen, Paulo Alcantara

Most of the places using d_alias are loops iterating through all aliases for
given inode; introduce a helper macro (for_each_alias(dentry, inode))
and convert open-coded instances of such loop to it.

They are easier to read that way and it reduces the noise on the next steps.

You _must_ hold inode->i_lock over that thing.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 Documentation/filesystems/porting.rst | 10 ++++++++++
 fs/affs/amigaffs.c                    |  2 +-
 fs/ceph/mds_client.c                  |  2 +-
 fs/dcache.c                           |  6 +++---
 fs/exportfs/expfs.c                   |  2 +-
 fs/nfs/dir.c                          |  2 +-
 fs/notify/fsnotify.c                  |  2 +-
 fs/ocfs2/dcache.c                     |  2 +-
 fs/overlayfs/dir.c                    |  2 +-
 fs/smb/client/inode.c                 |  2 +-
 include/linux/dcache.h                |  4 ++++
 11 files changed, 25 insertions(+), 11 deletions(-)

diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
index 52ff1d19405b..9a9babd9ec48 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -1361,3 +1361,13 @@ to match what strlen() would return if it was ran on the string.
 
 However, if the string is freely accessible for the duration of inode's
 lifetime, consider using inode_set_cached_link() instead.
+
+---
+
+**recommended**
+
+If you really need to iterate through dentries for given inode, use
+for_each_alias(dentry, inode) instead of hlist_for_each_entry; better
+yet, see if any of the exported primitives could be used instead of
+the entire loop.  You still need to hold ->i_lock of the inode over
+either form of manual loop.
diff --git a/fs/affs/amigaffs.c b/fs/affs/amigaffs.c
index fd669daa4e7b..91966b1f41f6 100644
--- a/fs/affs/amigaffs.c
+++ b/fs/affs/amigaffs.c
@@ -126,7 +126,7 @@ affs_fix_dcache(struct inode *inode, u32 entry_ino)
 {
 	struct dentry *dentry;
 	spin_lock(&inode->i_lock);
-	hlist_for_each_entry(dentry, &inode->i_dentry, d_u.d_alias) {
+	for_each_alias(dentry, inode) {
 		if (entry_ino == (u32)(long)dentry->d_fsdata) {
 			dentry->d_fsdata = (void *)inode->i_ino;
 			break;
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index b1746273f186..f839109fb66f 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -4614,7 +4614,7 @@ static struct dentry* d_find_primary(struct inode *inode)
 		goto out_unlock;
 	}
 
-	hlist_for_each_entry(alias, &inode->i_dentry, d_u.d_alias) {
+	for_each_alias(alias, inode) {
 		spin_lock(&alias->d_lock);
 		if (!d_unhashed(alias) &&
 		    (ceph_dentry(alias)->flags & CEPH_DENTRY_PRIMARY_LINK)) {
diff --git a/fs/dcache.c b/fs/dcache.c
index 7ba1801d8132..e069b6ec4ec0 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -790,7 +790,7 @@ void d_mark_dontcache(struct inode *inode)
 	struct dentry *de;
 
 	spin_lock(&inode->i_lock);
-	hlist_for_each_entry(de, &inode->i_dentry, d_u.d_alias) {
+	for_each_alias(de, inode) {
 		spin_lock(&de->d_lock);
 		de->d_flags |= DCACHE_DONTCACHE;
 		spin_unlock(&de->d_lock);
@@ -1040,7 +1040,7 @@ static struct dentry *__d_find_alias(struct inode *inode)
 	if (S_ISDIR(inode->i_mode))
 		return __d_find_any_alias(inode);
 
-	hlist_for_each_entry(alias, &inode->i_dentry, d_u.d_alias) {
+	for_each_alias(alias, inode) {
 		spin_lock(&alias->d_lock);
  		if (!d_unhashed(alias)) {
 			dget_dlock(alias);
@@ -1133,7 +1133,7 @@ void d_prune_aliases(struct inode *inode)
 	struct dentry *dentry;
 
 	spin_lock(&inode->i_lock);
-	hlist_for_each_entry(dentry, &inode->i_dentry, d_u.d_alias)
+	for_each_alias(dentry, inode)
 		d_dispose_if_unused(dentry, &dispose);
 	spin_unlock(&inode->i_lock);
 	shrink_dentry_list(&dispose);
diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
index 6c9be60a3e48..f67b3ce672fc 100644
--- a/fs/exportfs/expfs.c
+++ b/fs/exportfs/expfs.c
@@ -52,7 +52,7 @@ find_acceptable_alias(struct dentry *result,
 
 	inode = result->d_inode;
 	spin_lock(&inode->i_lock);
-	hlist_for_each_entry(dentry, &inode->i_dentry, d_u.d_alias) {
+	for_each_alias(dentry, inode) {
 		dget(dentry);
 		spin_unlock(&inode->i_lock);
 		if (toput)
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 2402f57c8e7d..5a0bd8113e3a 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1471,7 +1471,7 @@ static void nfs_clear_verifier_file(struct inode *inode)
 	struct dentry *alias;
 	struct inode *dir;
 
-	hlist_for_each_entry(alias, &inode->i_dentry, d_u.d_alias) {
+	for_each_alias(alias, inode) {
 		spin_lock(&alias->d_lock);
 		dir = d_inode_rcu(alias->d_parent);
 		if (!dir ||
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 9995de1710e5..b7198c4744e3 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -76,7 +76,7 @@ void fsnotify_set_children_dentry_flags(struct inode *inode)
 	spin_lock(&inode->i_lock);
 	/* run all of the dentries associated with this inode.  Since this is a
 	 * directory, there damn well better only be one item on this list */
-	hlist_for_each_entry(alias, &inode->i_dentry, d_u.d_alias) {
+	for_each_alias(alias, inode) {
 		struct dentry *child;
 
 		/* run all of the children of the original inode and fix their
diff --git a/fs/ocfs2/dcache.c b/fs/ocfs2/dcache.c
index c4ba968e778b..e06774fd89d8 100644
--- a/fs/ocfs2/dcache.c
+++ b/fs/ocfs2/dcache.c
@@ -145,7 +145,7 @@ struct dentry *ocfs2_find_local_alias(struct inode *inode,
 	struct dentry *dentry;
 
 	spin_lock(&inode->i_lock);
-	hlist_for_each_entry(dentry, &inode->i_dentry, d_u.d_alias) {
+	for_each_alias(dentry, inode) {
 		spin_lock(&dentry->d_lock);
 		if (ocfs2_match_dentry(dentry, parent_blkno, skip_unhashed)) {
 			trace_ocfs2_find_local_alias(dentry->d_name.len,
diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
index ff3dbd1ca61f..f8dfa534b566 100644
--- a/fs/overlayfs/dir.c
+++ b/fs/overlayfs/dir.c
@@ -904,7 +904,7 @@ static void ovl_drop_nlink(struct dentry *dentry)
 
 	/* Try to find another, hashed alias */
 	spin_lock(&inode->i_lock);
-	hlist_for_each_entry(alias, &inode->i_dentry, d_u.d_alias) {
+	for_each_alias(alias, inode) {
 		if (alias != dentry && !d_unhashed(alias))
 			break;
 	}
diff --git a/fs/smb/client/inode.c b/fs/smb/client/inode.c
index 888f9e35f14b..e2b4ad9bd0bd 100644
--- a/fs/smb/client/inode.c
+++ b/fs/smb/client/inode.c
@@ -1595,7 +1595,7 @@ inode_has_hashed_dentries(struct inode *inode)
 	struct dentry *dentry;
 
 	spin_lock(&inode->i_lock);
-	hlist_for_each_entry(dentry, &inode->i_dentry, d_u.d_alias) {
+	for_each_alias(dentry, inode) {
 		if (!d_unhashed(dentry) || IS_ROOT(dentry)) {
 			spin_unlock(&inode->i_lock);
 			return true;
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 898c60d21c92..7f1dbc7121d7 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -615,4 +615,8 @@ void set_default_d_op(struct super_block *, const struct dentry_operations *);
 struct dentry *d_make_persistent(struct dentry *, struct inode *);
 void d_make_discardable(struct dentry *dentry);
 
+/* inode->i_lock must be held over that */
+#define for_each_alias(dentry, inode) \
+	hlist_for_each_entry(dentry, &(inode)->i_dentry, d_u.d_alias)
+
 #endif	/* __LINUX_DCACHE_H */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [RFC PATCH v2 2/4] struct dentry: make ->d_u anonymous
  2026-04-02 18:08 ` [RFC PATCH v2 0/4] getting rid of busy-wait in shrink_dcache_parent() Al Viro
  2026-04-02 18:08   ` [RFC PATCH v2 1/4] for_each_alias(): helper macro for iterating through dentries of given inode Al Viro
@ 2026-04-02 18:08   ` Al Viro
  2026-04-02 18:08   ` [RFC PATCH v2 3/4] dcache.c: more idiomatic "positives are not allowed" sanity checks Al Viro
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 63+ messages in thread
From: Al Viro @ 2026-04-02 18:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
	Max Kellermann, Eric Sandeen, Paulo Alcantara

Making ->d_rcu and (then) ->d_child overlapping dates back to
2006; anon unions support had been added to gcc only in 4.6
(2011) and the minimal gcc version hadn't been bumped to that
until 4.19 (2018).

These days there's no reason not to keep that union named.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/ceph/mds_client.c   |  2 +-
 fs/dcache.c            | 48 +++++++++++++++++++++---------------------
 fs/inode.c             |  2 +-
 fs/nfs/dir.c           |  2 +-
 fs/nfs/getroot.c       |  2 +-
 include/linux/dcache.h |  4 ++--
 6 files changed, 30 insertions(+), 30 deletions(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index f839109fb66f..a5eb99c3c36b 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -4608,7 +4608,7 @@ static struct dentry* d_find_primary(struct inode *inode)
 		goto out_unlock;
 
 	if (S_ISDIR(inode->i_mode)) {
-		alias = hlist_entry(inode->i_dentry.first, struct dentry, d_u.d_alias);
+		alias = hlist_entry(inode->i_dentry.first, struct dentry, d_alias);
 		if (!IS_ROOT(alias))
 			dn = dget(alias);
 		goto out_unlock;
diff --git a/fs/dcache.c b/fs/dcache.c
index e069b6ec4ec0..4378eb8a00bb 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -40,7 +40,7 @@
 /*
  * Usage:
  * dcache->d_inode->i_lock protects:
- *   - i_dentry, d_u.d_alias, d_inode of aliases
+ *   - i_dentry, d_alias, d_inode of aliases
  * dcache_hash_bucket lock protects:
  *   - the dcache hash table
  * s_roots bl list spinlock protects:
@@ -55,7 +55,7 @@
  *   - d_unhashed()
  *   - d_parent and d_chilren
  *   - childrens' d_sib and d_parent
- *   - d_u.d_alias, d_inode
+ *   - d_alias, d_inode
  *
  * Ordering:
  * dentry->d_inode->i_lock
@@ -341,14 +341,14 @@ static inline struct external_name *external_name(struct dentry *dentry)
 
 static void __d_free(struct rcu_head *head)
 {
-	struct dentry *dentry = container_of(head, struct dentry, d_u.d_rcu);
+	struct dentry *dentry = container_of(head, struct dentry, d_rcu);
 
 	kmem_cache_free(dentry_cache, dentry); 
 }
 
 static void __d_free_external(struct rcu_head *head)
 {
-	struct dentry *dentry = container_of(head, struct dentry, d_u.d_rcu);
+	struct dentry *dentry = container_of(head, struct dentry, d_rcu);
 	kfree(external_name(dentry));
 	kmem_cache_free(dentry_cache, dentry);
 }
@@ -428,19 +428,19 @@ static inline void __d_clear_type_and_inode(struct dentry *dentry)
 
 static void dentry_free(struct dentry *dentry)
 {
-	WARN_ON(!hlist_unhashed(&dentry->d_u.d_alias));
+	WARN_ON(!hlist_unhashed(&dentry->d_alias));
 	if (unlikely(dname_external(dentry))) {
 		struct external_name *p = external_name(dentry);
 		if (likely(atomic_dec_and_test(&p->count))) {
-			call_rcu(&dentry->d_u.d_rcu, __d_free_external);
+			call_rcu(&dentry->d_rcu, __d_free_external);
 			return;
 		}
 	}
 	/* if dentry was never visible to RCU, immediate free is OK */
 	if (dentry->d_flags & DCACHE_NORCU)
-		__d_free(&dentry->d_u.d_rcu);
+		__d_free(&dentry->d_rcu);
 	else
-		call_rcu(&dentry->d_u.d_rcu, __d_free);
+		call_rcu(&dentry->d_rcu, __d_free);
 }
 
 /*
@@ -455,7 +455,7 @@ static void dentry_unlink_inode(struct dentry * dentry)
 
 	raw_write_seqcount_begin(&dentry->d_seq);
 	__d_clear_type_and_inode(dentry);
-	hlist_del_init(&dentry->d_u.d_alias);
+	hlist_del_init(&dentry->d_alias);
 	raw_write_seqcount_end(&dentry->d_seq);
 	spin_unlock(&dentry->d_lock);
 	spin_unlock(&inode->i_lock);
@@ -1010,7 +1010,7 @@ static struct dentry * __d_find_any_alias(struct inode *inode)
 
 	if (hlist_empty(&inode->i_dentry))
 		return NULL;
-	alias = hlist_entry(inode->i_dentry.first, struct dentry, d_u.d_alias);
+	alias = hlist_entry(inode->i_dentry.first, struct dentry, d_alias);
 	lockref_get(&alias->d_lockref);
 	return alias;
 }
@@ -1093,9 +1093,9 @@ struct dentry *d_find_alias_rcu(struct inode *inode)
 	// used without having I_FREEING set, which means no aliases left
 	if (likely(!(inode_state_read(inode) & I_FREEING) && !hlist_empty(l))) {
 		if (S_ISDIR(inode->i_mode)) {
-			de = hlist_entry(l->first, struct dentry, d_u.d_alias);
+			de = hlist_entry(l->first, struct dentry, d_alias);
 		} else {
-			hlist_for_each_entry(de, l, d_u.d_alias)
+			hlist_for_each_entry(de, l, d_alias)
 				if (!d_unhashed(de))
 					break;
 		}
@@ -1787,7 +1787,7 @@ static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
 	INIT_HLIST_BL_NODE(&dentry->d_hash);
 	INIT_LIST_HEAD(&dentry->d_lru);
 	INIT_HLIST_HEAD(&dentry->d_children);
-	INIT_HLIST_NODE(&dentry->d_u.d_alias);
+	INIT_HLIST_NODE(&dentry->d_alias);
 	INIT_HLIST_NODE(&dentry->d_sib);
 
 	if (dentry->d_op && dentry->d_op->d_init) {
@@ -1980,7 +1980,7 @@ static void __d_instantiate(struct dentry *dentry, struct inode *inode)
 	if ((dentry->d_flags &
 	     (DCACHE_LRU_LIST|DCACHE_SHRINK_LIST)) == DCACHE_LRU_LIST)
 		this_cpu_dec(nr_dentry_negative);
-	hlist_add_head(&dentry->d_u.d_alias, &inode->i_dentry);
+	hlist_add_head(&dentry->d_alias, &inode->i_dentry);
 	raw_write_seqcount_begin(&dentry->d_seq);
 	__d_set_inode_and_type(dentry, inode, add_flags);
 	raw_write_seqcount_end(&dentry->d_seq);
@@ -2004,7 +2004,7 @@ static void __d_instantiate(struct dentry *dentry, struct inode *inode)
  
 void d_instantiate(struct dentry *entry, struct inode * inode)
 {
-	BUG_ON(!hlist_unhashed(&entry->d_u.d_alias));
+	BUG_ON(!hlist_unhashed(&entry->d_alias));
 	if (inode) {
 		security_d_instantiate(entry, inode);
 		spin_lock(&inode->i_lock);
@@ -2024,7 +2024,7 @@ EXPORT_SYMBOL(d_instantiate);
  */
 void d_instantiate_new(struct dentry *entry, struct inode *inode)
 {
-	BUG_ON(!hlist_unhashed(&entry->d_u.d_alias));
+	BUG_ON(!hlist_unhashed(&entry->d_alias));
 	BUG_ON(!inode);
 	lockdep_annotate_inode_mutex_key(inode);
 	security_d_instantiate(entry, inode);
@@ -2087,7 +2087,7 @@ static struct dentry *__d_obtain_alias(struct inode *inode, bool disconnected)
 
 		spin_lock(&new->d_lock);
 		__d_set_inode_and_type(new, inode, add_flags);
-		hlist_add_head(&new->d_u.d_alias, &inode->i_dentry);
+		hlist_add_head(&new->d_alias, &inode->i_dentry);
 		if (!disconnected) {
 			hlist_bl_lock(&sb->s_roots);
 			hlist_bl_add_head(&new->d_hash, &sb->s_roots);
@@ -2658,7 +2658,7 @@ struct dentry *d_alloc_parallel(struct dentry *parent,
 	 * we unlock the chain.  All fields are stable in everything
 	 * we encounter.
 	 */
-	hlist_bl_for_each_entry(dentry, node, b, d_u.d_in_lookup_hash) {
+	hlist_bl_for_each_entry(dentry, node, b, d_in_lookup_hash) {
 		if (dentry->d_name.hash != hash)
 			continue;
 		if (dentry->d_parent != parent)
@@ -2700,7 +2700,7 @@ struct dentry *d_alloc_parallel(struct dentry *parent,
 	}
 	rcu_read_unlock();
 	new->d_wait = wq;
-	hlist_bl_add_head(&new->d_u.d_in_lookup_hash, b);
+	hlist_bl_add_head(&new->d_in_lookup_hash, b);
 	hlist_bl_unlock(b);
 	return new;
 mismatch:
@@ -2725,11 +2725,11 @@ static wait_queue_head_t *__d_lookup_unhash(struct dentry *dentry)
 	b = in_lookup_hash(dentry->d_parent, dentry->d_name.hash);
 	hlist_bl_lock(b);
 	dentry->d_flags &= ~DCACHE_PAR_LOOKUP;
-	__hlist_bl_del(&dentry->d_u.d_in_lookup_hash);
+	__hlist_bl_del(&dentry->d_in_lookup_hash);
 	d_wait = dentry->d_wait;
 	dentry->d_wait = NULL;
 	hlist_bl_unlock(b);
-	INIT_HLIST_NODE(&dentry->d_u.d_alias);
+	INIT_HLIST_NODE(&dentry->d_alias);
 	INIT_LIST_HEAD(&dentry->d_lru);
 	return d_wait;
 }
@@ -2760,7 +2760,7 @@ static inline void __d_add(struct dentry *dentry, struct inode *inode,
 		d_set_d_op(dentry, ops);
 	if (inode) {
 		unsigned add_flags = d_flags_for_inode(inode);
-		hlist_add_head(&dentry->d_u.d_alias, &inode->i_dentry);
+		hlist_add_head(&dentry->d_alias, &inode->i_dentry);
 		raw_write_seqcount_begin(&dentry->d_seq);
 		__d_set_inode_and_type(dentry, inode, add_flags);
 		raw_write_seqcount_end(&dentry->d_seq);
@@ -2795,7 +2795,7 @@ EXPORT_SYMBOL(d_add);
 
 struct dentry *d_make_persistent(struct dentry *dentry, struct inode *inode)
 {
-	WARN_ON(!hlist_unhashed(&dentry->d_u.d_alias));
+	WARN_ON(!hlist_unhashed(&dentry->d_alias));
 	WARN_ON(!inode);
 	security_d_instantiate(dentry, inode);
 	spin_lock(&inode->i_lock);
@@ -3185,7 +3185,7 @@ void d_mark_tmpfile(struct file *file, struct inode *inode)
 	struct dentry *dentry = file->f_path.dentry;
 
 	BUG_ON(dname_external(dentry) ||
-		!hlist_unhashed(&dentry->d_u.d_alias) ||
+		!hlist_unhashed(&dentry->d_alias) ||
 		!d_unlinked(dentry));
 	spin_lock(&dentry->d_parent->d_lock);
 	spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
diff --git a/fs/inode.c b/fs/inode.c
index cc12b68e021b..9e1ab333d382 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -754,7 +754,7 @@ void dump_mapping(const struct address_space *mapping)
 		return;
 	}
 
-	dentry_ptr = container_of(dentry_first, struct dentry, d_u.d_alias);
+	dentry_ptr = container_of(dentry_first, struct dentry, d_alias);
 	if (get_kernel_nofault(dentry, dentry_ptr) ||
 	    !dentry.d_parent || !dentry.d_name.name) {
 		pr_warn("aops:%ps ino:%lx invalid dentry:%px\n",
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 5a0bd8113e3a..f2f1b036f2f1 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1490,7 +1490,7 @@ static void nfs_clear_verifier_directory(struct inode *dir)
 	if (hlist_empty(&dir->i_dentry))
 		return;
 	this_parent =
-		hlist_entry(dir->i_dentry.first, struct dentry, d_u.d_alias);
+		hlist_entry(dir->i_dentry.first, struct dentry, d_alias);
 
 	spin_lock(&this_parent->d_lock);
 	nfs_unset_verifier_delegated(&this_parent->d_time);
diff --git a/fs/nfs/getroot.c b/fs/nfs/getroot.c
index f13d25d95b85..eef0736beb67 100644
--- a/fs/nfs/getroot.c
+++ b/fs/nfs/getroot.c
@@ -54,7 +54,7 @@ static int nfs_superblock_set_dummy_root(struct super_block *sb, struct inode *i
 		 */
 		spin_lock(&d_inode(sb->s_root)->i_lock);
 		spin_lock(&sb->s_root->d_lock);
-		hlist_del_init(&sb->s_root->d_u.d_alias);
+		hlist_del_init(&sb->s_root->d_alias);
 		spin_unlock(&sb->s_root->d_lock);
 		spin_unlock(&d_inode(sb->s_root)->i_lock);
 	}
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 7f1dbc7121d7..f939d2ed10a3 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -128,7 +128,7 @@ struct dentry {
 		struct hlist_node d_alias;	/* inode alias list */
 		struct hlist_bl_node d_in_lookup_hash;	/* only for in-lookup ones */
 	 	struct rcu_head d_rcu;
-	} d_u;
+	};
 };
 
 /*
@@ -617,6 +617,6 @@ void d_make_discardable(struct dentry *dentry);
 
 /* inode->i_lock must be held over that */
 #define for_each_alias(dentry, inode) \
-	hlist_for_each_entry(dentry, &(inode)->i_dentry, d_u.d_alias)
+	hlist_for_each_entry(dentry, &(inode)->i_dentry, d_alias)
 
 #endif	/* __LINUX_DCACHE_H */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [RFC PATCH v2 3/4] dcache.c: more idiomatic "positives are not allowed" sanity checks
  2026-04-02 18:08 ` [RFC PATCH v2 0/4] getting rid of busy-wait in shrink_dcache_parent() Al Viro
  2026-04-02 18:08   ` [RFC PATCH v2 1/4] for_each_alias(): helper macro for iterating through dentries of given inode Al Viro
  2026-04-02 18:08   ` [RFC PATCH v2 2/4] struct dentry: make ->d_u anonymous Al Viro
@ 2026-04-02 18:08   ` Al Viro
  2026-04-02 18:08   ` [RFC PATCH v2 4/4] get rid of busy-waiting in shrink_dcache_tree() Al Viro
  2026-04-02 20:28   ` [RFC PATCH v2 0/4] getting rid of busy-wait in shrink_dcache_parent() Paulo Alcantara
  4 siblings, 0 replies; 63+ messages in thread
From: Al Viro @ 2026-04-02 18:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
	Max Kellermann, Eric Sandeen, Paulo Alcantara

Several functions have BUG_ON/WARN_ON sanity checks that want to verify
that dentry is not positive and instead of looking at ->d_inode (as we
do in all other places that check that) they look at ->d_alias.
Just use the normal helpers instead - that way we no longer even look
at ->d_alias for negative dentries

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/dcache.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 4378eb8a00bb..616a445ec720 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -428,7 +428,7 @@ static inline void __d_clear_type_and_inode(struct dentry *dentry)
 
 static void dentry_free(struct dentry *dentry)
 {
-	WARN_ON(!hlist_unhashed(&dentry->d_alias));
+	WARN_ON(d_really_is_positive(dentry));
 	if (unlikely(dname_external(dentry))) {
 		struct external_name *p = external_name(dentry);
 		if (likely(atomic_dec_and_test(&p->count))) {
@@ -2004,7 +2004,7 @@ static void __d_instantiate(struct dentry *dentry, struct inode *inode)
  
 void d_instantiate(struct dentry *entry, struct inode * inode)
 {
-	BUG_ON(!hlist_unhashed(&entry->d_alias));
+	BUG_ON(d_really_is_positive(entry));
 	if (inode) {
 		security_d_instantiate(entry, inode);
 		spin_lock(&inode->i_lock);
@@ -2024,7 +2024,7 @@ EXPORT_SYMBOL(d_instantiate);
  */
 void d_instantiate_new(struct dentry *entry, struct inode *inode)
 {
-	BUG_ON(!hlist_unhashed(&entry->d_alias));
+	BUG_ON(d_really_is_positive(entry));
 	BUG_ON(!inode);
 	lockdep_annotate_inode_mutex_key(inode);
 	security_d_instantiate(entry, inode);
@@ -2795,7 +2795,7 @@ EXPORT_SYMBOL(d_add);
 
 struct dentry *d_make_persistent(struct dentry *dentry, struct inode *inode)
 {
-	WARN_ON(!hlist_unhashed(&dentry->d_alias));
+	WARN_ON(d_really_is_positive(dentry));
 	WARN_ON(!inode);
 	security_d_instantiate(dentry, inode);
 	spin_lock(&inode->i_lock);
@@ -3185,7 +3185,7 @@ void d_mark_tmpfile(struct file *file, struct inode *inode)
 	struct dentry *dentry = file->f_path.dentry;
 
 	BUG_ON(dname_external(dentry) ||
-		!hlist_unhashed(&dentry->d_alias) ||
+		d_really_is_positive(dentry) ||
 		!d_unlinked(dentry));
 	spin_lock(&dentry->d_parent->d_lock);
 	spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [RFC PATCH v2 4/4] get rid of busy-waiting in shrink_dcache_tree()
  2026-04-02 18:08 ` [RFC PATCH v2 0/4] getting rid of busy-wait in shrink_dcache_parent() Al Viro
                     ` (2 preceding siblings ...)
  2026-04-02 18:08   ` [RFC PATCH v2 3/4] dcache.c: more idiomatic "positives are not allowed" sanity checks Al Viro
@ 2026-04-02 18:08   ` Al Viro
  2026-04-02 19:52     ` Linus Torvalds
  2026-04-02 20:28   ` [RFC PATCH v2 0/4] getting rid of busy-wait in shrink_dcache_parent() Paulo Alcantara
  4 siblings, 1 reply; 63+ messages in thread
From: Al Viro @ 2026-04-02 18:08 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
	Max Kellermann, Eric Sandeen, Paulo Alcantara

If shrink_dcache_parent() runs into a potential victim that is already
dying, it must wait for that dentry to go away.  To avoid busy-waiting
we need some object to wait on and a way for dentry_unlist() to see that
we need to be notified.

The obvious place for the object to wait on would be inside struct
select_data we have on our stack frame.  We will store a pointer to
select_data in victim dentry; if there's more than one thread wanting to
wait for the same dentry to finish dying, we'll have their select_data
linked into a list, with reference in dentry pointing to the head of
that list.

* add struct completion and a pointer to the next select_data into
struct select_data; that will be what we wait on and the list linkage.

* add a new member (->waiters, opaque pointer to struct select_data)
to struct dentry.  It is defined for negative live dentries that are
not in-lookup ones and it will remain NULL for almost all of them.

It does not conflict with ->d_rcu (defined for killed dentries), ->d_alias
(defined for positive dentries, all live) or ->d_in_lookup_hash (defined
for in-lookup dentries, all live negative).  That allows to colocate
all four members.

* make sure that all places where dentry enters the state where ->waiters
is defined (live, negative, not-in-lookup) initialize ->waiters to NULL.

* if select_collect2() runs into a dentry that is already dying, have
its caller insert its select_data into the head of list that hangs off
dentry->waiters and wait for completion.

* if dentry_unlist() sees non-NULL ->waiters, have it carefully walk
through the select_data on that list, calling complete() for each.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/dcache.c            | 90 ++++++++++++++++++++++++++++++++++--------
 include/linux/dcache.h | 18 +++++++--
 2 files changed, 89 insertions(+), 19 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 616a445ec720..09a80f7f3d05 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -455,7 +455,9 @@ static void dentry_unlink_inode(struct dentry * dentry)
 
 	raw_write_seqcount_begin(&dentry->d_seq);
 	__d_clear_type_and_inode(dentry);
-	hlist_del_init(&dentry->d_alias);
+	if (likely(!hlist_unhashed(&dentry->d_alias)))
+		__hlist_del(&dentry->d_alias);
+	dentry->waiters = NULL;
 	raw_write_seqcount_end(&dentry->d_seq);
 	spin_unlock(&dentry->d_lock);
 	spin_unlock(&inode->i_lock);
@@ -605,6 +607,52 @@ void d_drop(struct dentry *dentry)
 }
 EXPORT_SYMBOL(d_drop);
 
+struct select_data {
+	struct dentry *start;
+	union {
+		long found;
+		struct {
+			struct dentry *victim;
+			struct select_data *next;
+			struct completion completion;
+		};
+	};
+	struct list_head dispose;
+};
+
+/*
+ *  shrink_dcache_parent() needs to be notified when dentry in process of
+ *  being evicted finally gets unlisted.  Such dentries are
+ *	already with negative ->d_count
+ *	already negative
+ *	already not in in-lookup hash
+ *	reachable only via ->d_sib.
+ *
+ *  Use ->waiters for a single-linked list of struct select_data of
+ *  waiters.
+ */
+static inline void d_add_waiter(struct dentry *dentry, struct select_data *p)
+{
+	struct select_data *v = dentry->waiters;
+	init_completion(&p->completion);
+	p->next = v;
+	dentry->waiters = p;
+}
+
+static inline void d_complete_waiters(struct dentry *dentry)
+{
+	struct select_data *v = dentry->waiters;
+	if (unlikely(v)) {
+		/* some shrink_dcache_tree() instances are waiting */
+		dentry->waiters = NULL;
+		while (v) {
+			struct completion *r = &v->completion;
+			v = v->next;
+			complete(r);
+		}
+	}
+}
+
 static inline void dentry_unlist(struct dentry *dentry)
 {
 	struct dentry *next;
@@ -613,6 +661,7 @@ static inline void dentry_unlist(struct dentry *dentry)
 	 * attached to the dentry tree
 	 */
 	dentry->d_flags |= DCACHE_DENTRY_KILLED;
+	d_complete_waiters(dentry);
 	if (unlikely(hlist_unhashed(&dentry->d_sib)))
 		return;
 	__hlist_del(&dentry->d_sib);
@@ -1509,15 +1558,6 @@ int d_set_mounted(struct dentry *dentry)
  * constraints.
  */
 
-struct select_data {
-	struct dentry *start;
-	union {
-		long found;
-		struct dentry *victim;
-	};
-	struct list_head dispose;
-};
-
 static enum d_walk_ret select_collect(void *_data, struct dentry *dentry)
 {
 	struct select_data *data = _data;
@@ -1569,6 +1609,10 @@ static enum d_walk_ret select_collect2(void *_data, struct dentry *dentry)
 			return D_WALK_QUIT;
 		}
 		to_shrink_list(dentry, &data->dispose);
+	} else if (dentry->d_lockref.count < 0) {
+		rcu_read_lock();
+		data->victim = dentry;
+		return D_WALK_QUIT;
 	}
 	/*
 	 * We can return to the caller if we have found some (this
@@ -1608,12 +1652,26 @@ static void shrink_dcache_tree(struct dentry *parent, bool for_umount)
 		data.victim = NULL;
 		d_walk(parent, &data, select_collect2);
 		if (data.victim) {
-			spin_lock(&data.victim->d_lock);
-			if (!lock_for_kill(data.victim)) {
-				spin_unlock(&data.victim->d_lock);
+			struct dentry *v = data.victim;
+
+			spin_lock(&v->d_lock);
+			if (v->d_lockref.count < 0 &&
+			    !(v->d_flags & DCACHE_DENTRY_KILLED)) {
+				// It's busy dying; have it notify us once
+				// it becomes invisible to d_walk().
+				d_add_waiter(v, &data);
+				spin_unlock(&v->d_lock);
+				rcu_read_unlock();
+				if (!list_empty(&data.dispose))
+					shrink_dentry_list(&data.dispose);
+				wait_for_completion(&data.completion);
+				continue;
+			}
+			if (!lock_for_kill(v)) {
+				spin_unlock(&v->d_lock);
 				rcu_read_unlock();
 			} else {
-				shrink_kill(data.victim);
+				shrink_kill(v);
 			}
 		}
 		if (!list_empty(&data.dispose))
@@ -1787,7 +1845,7 @@ static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
 	INIT_HLIST_BL_NODE(&dentry->d_hash);
 	INIT_LIST_HEAD(&dentry->d_lru);
 	INIT_HLIST_HEAD(&dentry->d_children);
-	INIT_HLIST_NODE(&dentry->d_alias);
+	dentry->waiters = NULL;
 	INIT_HLIST_NODE(&dentry->d_sib);
 
 	if (dentry->d_op && dentry->d_op->d_init) {
@@ -2729,7 +2787,7 @@ static wait_queue_head_t *__d_lookup_unhash(struct dentry *dentry)
 	d_wait = dentry->d_wait;
 	dentry->d_wait = NULL;
 	hlist_bl_unlock(b);
-	INIT_HLIST_NODE(&dentry->d_alias);
+	dentry->waiters = NULL;
 	INIT_LIST_HEAD(&dentry->d_lru);
 	return d_wait;
 }
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index f939d2ed10a3..07dac37eb75c 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -88,6 +88,7 @@ union shortname_store {
 
 #define d_lock	d_lockref.lock
 #define d_iname d_shortname.string
+struct select_data;
 
 struct dentry {
 	/* RCU lookup touched fields */
@@ -122,12 +123,23 @@ struct dentry {
 	struct hlist_node d_sib;	/* child of parent list */
 	struct hlist_head d_children;	/* our children */
 	/*
-	 * d_alias and d_rcu can share memory
+	 * the following members can share memory - their uses are
+	 * mutually exclusive.
 	 */
 	union {
-		struct hlist_node d_alias;	/* inode alias list */
-		struct hlist_bl_node d_in_lookup_hash;	/* only for in-lookup ones */
+		/* positives: inode alias list */
+		struct hlist_node d_alias;
+		/* in-lookup ones (all negative, live): hash chain */
+		struct hlist_bl_node d_in_lookup_hash;
+		/* killed ones: (already negative) used to schedule freeing */
 	 	struct rcu_head d_rcu;
+		/*
+		 * live non-in-lookup negatives: used if shrink_dcache_parent()
+		 * races with eviction by another thread and needs to wait for
+		 * this dentry to get killed .  Remains NULL for almost all
+		 * negative dentries.
+		 */
+		struct select_data *waiters;
 	};
 };
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH v2 4/4] get rid of busy-waiting in shrink_dcache_tree()
  2026-04-02 18:08   ` [RFC PATCH v2 4/4] get rid of busy-waiting in shrink_dcache_tree() Al Viro
@ 2026-04-02 19:52     ` Linus Torvalds
  2026-04-02 22:44       ` Al Viro
  0 siblings, 1 reply; 63+ messages in thread
From: Linus Torvalds @ 2026-04-02 19:52 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
	Max Kellermann, Eric Sandeen, Paulo Alcantara

On Thu, 2 Apr 2026 at 11:05, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> @@ -455,7 +455,9 @@ static void dentry_unlink_inode(struct dentry * dentry)
>
>         raw_write_seqcount_begin(&dentry->d_seq);
>         __d_clear_type_and_inode(dentry);
> -       hlist_del_init(&dentry->d_alias);
> +       if (likely(!hlist_unhashed(&dentry->d_alias)))
> +               __hlist_del(&dentry->d_alias);
> +       dentry->waiters = NULL;
>         raw_write_seqcount_end(&dentry->d_seq);
>         spin_unlock(&dentry->d_lock);
>         spin_unlock(&inode->i_lock);

I have no real objections to this patch series, but I'd like this to
get a big comment here.

There's also a 'hlist_del_init(&dentry->d_alias);' in nfs/getroot.c,
and that one stays as such.

I thnk that's just because it can't actually be a real alias list at
that point, and all that case does is to remove its own inode from the
initial list to make it always be empty. And so it doesn't have any of
this 'negative live under lookup' case. But I'd like to have some
commentary about *this* particular one, because this is the subtle
one.

In particular, I have paged out the history from the last iteration of
this all, and the reason why you changed hlist_del_init() to that
'likely(!hlist_unhashed...' escapes me.

I'm sure I could look it up and remind myself, but honestly, I'd
rather have it just all explained in the code.

Please?

               Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH v2 4/4] get rid of busy-waiting in shrink_dcache_tree()
  2026-04-02 19:52     ` Linus Torvalds
@ 2026-04-02 22:44       ` Al Viro
  2026-04-02 22:49         ` Linus Torvalds
  0 siblings, 1 reply; 63+ messages in thread
From: Al Viro @ 2026-04-02 22:44 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
	Max Kellermann, Eric Sandeen, Paulo Alcantara

On Thu, Apr 02, 2026 at 12:52:28PM -0700, Linus Torvalds wrote:
> There's also a 'hlist_del_init(&dentry->d_alias);' in nfs/getroot.c,
> and that one stays as such.
> 
> I thnk that's just because it can't actually be a real alias list at
> that point, and all that case does is to remove its own inode from the
> initial list to make it always be empty. And so it doesn't have any of
> this 'negative live under lookup' case. But I'd like to have some
> commentary about *this* particular one, because this is the subtle
> one.
> 
> In particular, I have paged out the history from the last iteration of
> this all, and the reason why you changed hlist_del_init() to that
> 'likely(!hlist_unhashed...' escapes me.
> 
> I'm sure I could look it up and remind myself, but honestly, I'd
> rather have it just all explained in the code.

No deep reasons, really - it can stay hlist_del_init() too; the only
reason why I went for __hlist_del_init() (and got immediately reminded
that the sucker needs hlist_unhashed() checked first) was that ->d_alias
ceases to exist at that point, so there's no real reason to zero it.
OTOH, these two stores (to ->next and ->pprev) are not going to cost much,
especially since one of them combines with zeroing ->waiters immediately
afterwards...  We'd just read both words, and they'd better be in the
same cacheline, so the extra store should be pretty much free...

Let's go with hlist_del_init() there, possibly with a comment that "init"
part is not really needed, but use of __hlist_del() is clumsier and it
wouldn't really win us anything.

Re nfs_superblock_set_dummy_root(): it's really ugly wart and I'm not
sure we still need it these days; it _is_ safe, even though it violates
all kind of rules for dentry state.  It's also a headache on every code
audit in the area ;-/  I suspect that we could get rid of the entire
dummy root thing and if sb->s_root is still NULL after d_obtain_root() in
nfs_get_root(), have root removed from sb->s_roots and stored in ->s_root.
Once upon a time we used to oops if shrink_dcache_for_umount() found
!IS_ROOT(sb->s_root) (i.e. if root got spliced on top of one of the
secondary trees at some point), but that's no longer true - not since
42c326082d8a "switch shrink_dcache_for_umount() to use of d_walk()".
Never got around to checking if anything in NFS might get unhappy with
such situation; VFS ought to be OK with it.

That's definitely a separate story, though; as far as this series is
concerned, dummy root is out of scope - it's still positive, even
though it's removed from alias list.  Which is yet another reason
why these weirdly spelled sanity checks were wrong - we never hit any
of those for the dummy root, but they would wrongly treat it as negative
if we ever did...

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH v2 4/4] get rid of busy-waiting in shrink_dcache_tree()
  2026-04-02 22:44       ` Al Viro
@ 2026-04-02 22:49         ` Linus Torvalds
  2026-04-02 23:16           ` Al Viro
  0 siblings, 1 reply; 63+ messages in thread
From: Linus Torvalds @ 2026-04-02 22:49 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
	Max Kellermann, Eric Sandeen, Paulo Alcantara

On Thu, 2 Apr 2026 at 15:41, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> No deep reasons, really - it can stay hlist_del_init() too; the only
> reason why I went for __hlist_del_init() (and got immediately reminded
> that the sucker needs hlist_unhashed() checked first) was that ->d_alias
> ceases to exist at that point, so there's no real reason to zero it.

Ok, so that's what I thought it was, but then the whole "now you're
checking whether it was hashed" confused me.

> Let's go with hlist_del_init() there, possibly with a comment that "init"
> part is not really needed, but use of __hlist_del() is clumsier and it
> wouldn't really win us anything.

Agreed. A comment about how the "state" of that union changes from
being a hlist to being that waiter, at that point is all that is
needed, but I'd prefer the "hlist_del_init()".

In fact, the __hlist_del() *without* the init part would work fine for
me with the comment.

But it was really the added

        if (likely(!hlist_unhashed(&dentry->d_alias)))

that made me go "why is _that_ needed?"

Afaik, it's a perfectly find hlist entry at that point, and no need to
add the hlist_unhashed() check. No?

                Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH v2 4/4] get rid of busy-waiting in shrink_dcache_tree()
  2026-04-02 22:49         ` Linus Torvalds
@ 2026-04-02 23:16           ` Al Viro
  2026-04-03  0:29             ` Linus Torvalds
  0 siblings, 1 reply; 63+ messages in thread
From: Al Viro @ 2026-04-02 23:16 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
	Max Kellermann, Eric Sandeen, Paulo Alcantara

On Thu, Apr 02, 2026 at 03:49:39PM -0700, Linus Torvalds wrote:

> In fact, the __hlist_del() *without* the init part would work fine for
> me with the comment.
> 
> But it was really the added
> 
>         if (likely(!hlist_unhashed(&dentry->d_alias)))
> 
> that made me go "why is _that_ needed?"
> 
> Afaik, it's a perfectly find hlist entry at that point, and no need to
> add the hlist_unhashed() check. No?

The incremental I've got right now is

diff --git a/fs/dcache.c b/fs/dcache.c
index 09a80f7f3d05..b83c279f34a0 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -455,8 +455,15 @@ static void dentry_unlink_inode(struct dentry * dentry)
 
 	raw_write_seqcount_begin(&dentry->d_seq);
 	__d_clear_type_and_inode(dentry);
-	if (likely(!hlist_unhashed(&dentry->d_alias)))
-		__hlist_del(&dentry->d_alias);
+	hlist_del_init(&dentry->d_alias);
+	/*
+	 * dentry becomes negative, so the space occupied by ->d_alias
+	 * belongs to ->waiters now; we could use __hlist_del() instead
+	 * of hlist_del_init(), if not for the stunt pulled by nfs
+	 * dummy root dentries - positive dentry *not* included into
+	 * the alias list of its inode.  Open-coding hlist_del_init()
+	 * and removing zeroing would be too clumsy...
+	 */
 	dentry->waiters = NULL;
 	raw_write_seqcount_end(&dentry->d_seq);
 	spin_unlock(&dentry->d_lock);

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH v2 4/4] get rid of busy-waiting in shrink_dcache_tree()
  2026-04-02 23:16           ` Al Viro
@ 2026-04-03  0:29             ` Linus Torvalds
  2026-04-03  2:15               ` Al Viro
  0 siblings, 1 reply; 63+ messages in thread
From: Linus Torvalds @ 2026-04-03  0:29 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
	Max Kellermann, Eric Sandeen, Paulo Alcantara

On Thu, 2 Apr 2026 at 16:13, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> The incremental I've got right now is

LGTM.

                 Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH v2 4/4] get rid of busy-waiting in shrink_dcache_tree()
  2026-04-03  0:29             ` Linus Torvalds
@ 2026-04-03  2:15               ` Al Viro
  2026-04-04  0:02                 ` Al Viro
  0 siblings, 1 reply; 63+ messages in thread
From: Al Viro @ 2026-04-03  2:15 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
	Max Kellermann, Eric Sandeen, Paulo Alcantara

On Thu, Apr 02, 2026 at 05:29:28PM -0700, Linus Torvalds wrote:
> On Thu, 2 Apr 2026 at 16:13, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > The incremental I've got right now is
> 
> LGTM.

Force-pushed...

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH v2 4/4] get rid of busy-waiting in shrink_dcache_tree()
  2026-04-03  2:15               ` Al Viro
@ 2026-04-04  0:02                 ` Al Viro
  2026-04-04  0:04                   ` Linus Torvalds
  0 siblings, 1 reply; 63+ messages in thread
From: Al Viro @ 2026-04-04  0:02 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
	Max Kellermann, Eric Sandeen, Paulo Alcantara

On Fri, Apr 03, 2026 at 03:15:22AM +0100, Al Viro wrote:
> On Thu, Apr 02, 2026 at 05:29:28PM -0700, Linus Torvalds wrote:
> > On Thu, 2 Apr 2026 at 16:13, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > The incremental I've got right now is
> > 
> > LGTM.
> 
> Force-pushed...

FWIW, I wonder if we would be better off with something like

struct completion_list {
	struct completion completion;
	struct completion_list *next;
};

embedded into select_data, with dentry->waiters being a pointer
to that.  It's just that dragging struct select_data into that,
even as an opaque pointer, tastes wrong ;-/

Preferences?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH v2 4/4] get rid of busy-waiting in shrink_dcache_tree()
  2026-04-04  0:02                 ` Al Viro
@ 2026-04-04  0:04                   ` Linus Torvalds
  2026-04-04 18:54                     ` Al Viro
  0 siblings, 1 reply; 63+ messages in thread
From: Linus Torvalds @ 2026-04-04  0:04 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
	Max Kellermann, Eric Sandeen, Paulo Alcantara

On Fri, 3 Apr 2026 at 16:58, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> FWIW, I wonder if we would be better off with something like
>
> struct completion_list {
>         struct completion completion;
>         struct completion_list *next;
> };
>
> embedded into select_data, with dentry->waiters being a pointer
> to that.  It's just that dragging struct select_data into that,
> even as an opaque pointer, tastes wrong ;-/
>
> Preferences?

I have no strong preferences in this area.

The opaque pointer is fine by me, because it's type-safe without
exposing internal implementation or having people being able to much
around with the fields.

But if you prefer a completion list, i don't see that being a problem
either. So entirely up to you.

          Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH v2 4/4] get rid of busy-waiting in shrink_dcache_tree()
  2026-04-04  0:04                   ` Linus Torvalds
@ 2026-04-04 18:54                     ` Al Viro
  2026-04-04 19:04                       ` Linus Torvalds
  0 siblings, 1 reply; 63+ messages in thread
From: Al Viro @ 2026-04-04 18:54 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
	Max Kellermann, Eric Sandeen, Paulo Alcantara

On Fri, Apr 03, 2026 at 05:04:26PM -0700, Linus Torvalds wrote:
> On Fri, 3 Apr 2026 at 16:58, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > FWIW, I wonder if we would be better off with something like
> >
> > struct completion_list {
> >         struct completion completion;
> >         struct completion_list *next;
> > };
> >
> > embedded into select_data, with dentry->waiters being a pointer
> > to that.  It's just that dragging struct select_data into that,
> > even as an opaque pointer, tastes wrong ;-/
> >
> > Preferences?
> 
> I have no strong preferences in this area.
> 
> The opaque pointer is fine by me, because it's type-safe without
> exposing internal implementation or having people being able to much
> around with the fields.
> 
> But if you prefer a completion list, i don't see that being a problem
> either. So entirely up to you.

BTW, how much would you hate something like

/*
 * Check if dentry is already dying and if it is arrange to be notified
 * once it's gone.
 *
 * caller must hold rcu_read_lock and dentry->d_lock
 */
static bool d_deathwatch(struct dentry *dentry, struct completion_list *node)
{
	if (dentry->d_lockref.count >= 0)
		return false;		// it's not dying yet

	init_completion(&node->completion);
	if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) {
		// it's already gone
		complete(&node->completion);
		return true;
	}
	node->next = dentry->waiters;
	dentry->waiters = node;
	return true;
}

with
		if (data.victim) {
			struct dentry *v = data.victim;
			struct completion_list wait;

			spin_lock(&v->d_lock);
			if (d_deathwatch(v, &wait)) {
				// Another thread got around to killing it.
				// if there's something else to evict, do so
				// then wait for the victim to be gone
				spin_unlock(&v->d_lock);
				rcu_read_unlock();
				if (!list_empty(&data.dispose))
					shrink_dentry_list(&data.dispose);
				wait_for_completion(&wait.completion);
				continue;
			}
			if (!lock_for_kill(v)) {
				spin_unlock(&v->d_lock);
				rcu_read_unlock();
			} else {
				shrink_kill(v);
			}
		}

in shrink_dentry_tree()?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH v2 4/4] get rid of busy-waiting in shrink_dcache_tree()
  2026-04-04 18:54                     ` Al Viro
@ 2026-04-04 19:04                       ` Linus Torvalds
  2026-04-05  0:04                         ` Al Viro
  0 siblings, 1 reply; 63+ messages in thread
From: Linus Torvalds @ 2026-04-04 19:04 UTC (permalink / raw)
  To: Al Viro
  Cc: linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
	Max Kellermann, Eric Sandeen, Paulo Alcantara

On Sat, 4 Apr 2026 at 11:51, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> BTW, how much would you hate something like

Ugh. This is your code, You'r ein charge in the end.

> static bool d_deathwatch(struct dentry *dentry, struct completion_list *node)

This looks odd to me. Particularly the "complete it myself" case where
you return true. and then shrink the dentry list because you *think*
you'll be waiting even if you don't.

IOW, if you go down this case, wouldn't it make more sense to have
d_deathwatch() return a ternary value,  and have a switch case in the
caller, and just enumerate the different cases?

But I don't know. That's really more of a "that looks like an odd
pattern" to me than any kind of objection.

           Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH v2 4/4] get rid of busy-waiting in shrink_dcache_tree()
  2026-04-04 19:04                       ` Linus Torvalds
@ 2026-04-05  0:04                         ` Al Viro
  0 siblings, 0 replies; 63+ messages in thread
From: Al Viro @ 2026-04-05  0:04 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
	Max Kellermann, Eric Sandeen, Paulo Alcantara

On Sat, Apr 04, 2026 at 12:04:42PM -0700, Linus Torvalds wrote:

> > static bool d_deathwatch(struct dentry *dentry, struct completion_list *node)
> 
> This looks odd to me. Particularly the "complete it myself" case where
> you return true. and then shrink the dentry list because you *think*
> you'll be waiting even if you don't.

Not really - we shrink the list before each rescan anyway.

We have collected some dentries into data.dispose and perhaps we'd stopped at
finding a victim already in somebody else's shrink list or already dying.

1) no such victim found => shrink data.dispose, rescan
2) victim found and successfully stolen from whatever shrink list it was on =>
evict the victim, shrink data.dispose, rescan
3) victim was busy dying => shrink data.dispose, wait until it finishes dying,
rescan
4) victim has managed to die while we'd been returning from d_walk() =>
shrink data.dispose, rescan

In (3) we could do wait_for_completion() before shrinking data.dispose; that
would merge the tails of all 4 cases.  However, that would add pointless waits -
basically, that's the scenario when another thread is sitting inside iput()
and we would be better off evicting whatever else we'd collected.

So (3) gets shrinking first, then wait, then go for the next iteration.
Sure, we can have (4) as a separate case is the same 3-way switch, but
what would it do?
		spin_unlock(&v->d_lock);
		rcu_read_unlock();
		break;	// proceeding to shrink_dentry_list() between the
			// end of if (data.victim) and the end of loop body
And right next to it we'd have
		spin_unlock(&v->d_lock);
		rcu_read_unlock();
		shrink data.dispose
		wait_for_completion(&wait.complete);
		continue;
for case (3), which differs only in having shrink_dentry_list() right there
and followed by wait_for_completion().  Put another way,
(4) is equivalent to
		spin_unlock(&v->d_lock);
		rcu_read_unlock();
		shrink data.dispose
		continue;

> IOW, if you go down this case, wouldn't it make more sense to have
> d_deathwatch() return a ternary value,  and have a switch case in the
> caller, and just enumerate the different cases?

See above.

> But I don't know. That's really more of a "that looks like an odd
> pattern" to me than any kind of objection.

FWIW, the thing that had me flipping back and forth between the variants
of calling conventions until I went "I've been staring at that for too
long to trust my taste" and asked that question was not the difference
in location of shrinking - it was self-called complete().

Hell knows; I'll probably leave it as-is for now and return to that
next week - it doesn't affect correctness, after all, and I'd rather
have that thing in -next in the current form seeing that it fixes
a real bug and further modifications would be local to fs/dcache.c
anyway.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH v2 0/4] getting rid of busy-wait in shrink_dcache_parent()
  2026-04-02 18:08 ` [RFC PATCH v2 0/4] getting rid of busy-wait in shrink_dcache_parent() Al Viro
                     ` (3 preceding siblings ...)
  2026-04-02 18:08   ` [RFC PATCH v2 4/4] get rid of busy-waiting in shrink_dcache_tree() Al Viro
@ 2026-04-02 20:28   ` Paulo Alcantara
  2026-04-03  4:46     ` Al Viro
  4 siblings, 1 reply; 63+ messages in thread
From: Paulo Alcantara @ 2026-04-02 20:28 UTC (permalink / raw)
  To: Al Viro, Linus Torvalds
  Cc: linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
	Max Kellermann, Eric Sandeen

Al Viro <viro@zeniv.linux.org.uk> writes:

> Note: in -next CIFS has grown a local copy of d_mark_tmpfile(), which would need
>  the same modification of sanity checks the original gets in #3/4 here.  I wonder
> if that cifs_d_mark_tmpfile() should be turned into to d_mark_tmpfile_name() and
> lifted into fs/dcache.c...  Anyway, that's a separate story...

Yeah, I would love to have a helper like that and don't mess up with any
->d_name outside fs/dcache.c.  That's why I'd suggested it in [1].

I'll send a separate patch adding it as soon as I'm done with testing
O_TMPFILE support in CIFS.

[1] https://lore.kernel.org/r/20260401011153.1757515-1-pc@manguebit.org

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH v2 0/4] getting rid of busy-wait in shrink_dcache_parent()
  2026-04-02 20:28   ` [RFC PATCH v2 0/4] getting rid of busy-wait in shrink_dcache_parent() Paulo Alcantara
@ 2026-04-03  4:46     ` Al Viro
  0 siblings, 0 replies; 63+ messages in thread
From: Al Viro @ 2026-04-03  4:46 UTC (permalink / raw)
  To: Paulo Alcantara
  Cc: Linus Torvalds, linux-fsdevel, Christian Brauner, Jan Kara,
	Nikolay Borisov, Max Kellermann, Eric Sandeen

On Thu, Apr 02, 2026 at 05:28:58PM -0300, Paulo Alcantara wrote:
> Al Viro <viro@zeniv.linux.org.uk> writes:
> 
> > Note: in -next CIFS has grown a local copy of d_mark_tmpfile(), which would need
> >  the same modification of sanity checks the original gets in #3/4 here.  I wonder
> > if that cifs_d_mark_tmpfile() should be turned into to d_mark_tmpfile_name() and
> > lifted into fs/dcache.c...  Anyway, that's a separate story...
> 
> Yeah, I would love to have a helper like that and don't mess up with any
> ->d_name outside fs/dcache.c.  That's why I'd suggested it in [1].
> 
> I'll send a separate patch adding it as soon as I'm done with testing
> O_TMPFILE support in CIFS.
> 
> [1] https://lore.kernel.org/r/20260401011153.1757515-1-pc@manguebit.org

OK...  IMO we'd be better off with never-rebased branch adding the helper with
cifs branch pulling it and busy-wait branch based on top of the "add the helper"
one - no merge conflicts that way...

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [RFC PATCH v3 0/4] getting rid of busy-wait in shrink_dcache_parent()
  2026-01-22 20:20 [PATCH][RFC] get rid of busy-wait in shrink_dcache_tree() Al Viro
  2026-01-23  0:19 ` Linus Torvalds
  2026-04-02 18:08 ` [RFC PATCH v2 0/4] getting rid of busy-wait in shrink_dcache_parent() Al Viro
@ 2026-04-04  8:07 ` Al Viro
  2026-04-04  8:07   ` [RFC PATCH v3 1/4] for_each_alias(): helper macro for iterating through dentries of given inode Al Viro
                     ` (4 more replies)
  2 siblings, 5 replies; 63+ messages in thread
From: Al Viro @ 2026-04-04  8:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
	Max Kellermann, Eric Sandeen, Paulo Alcantara

Changes since v2:
	* separate select_data from the completion list elements - add a separate
struct completion_list (private to fs/dcache.c at the moment, but it's completely
independent from struct dentry and can be lifted e.g. into linux/completion.h
if there are other potential users).

Branch (currently -rc6-based) lives in
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.dcache-busy-wait
individual patches in followups.  Please, review.

Shortlog:
Al Viro (4):
  for_each_alias(): helper macro for iterating through dentries of given inode
  struct dentry: make ->d_u anonymous
  dcache.c: more idiomatic "positives are not allowed" sanity checks
  get rid of busy-waiting in shrink_dcache_tree()

Diffstat:
 Documentation/filesystems/porting.rst |  10 ++
 fs/affs/amigaffs.c                    |   2 +-
 fs/ceph/mds_client.c                  |   4 +-
 fs/dcache.c                           | 129 +++++++++++++++++++-------
 fs/exportfs/expfs.c                   |   2 +-
 fs/inode.c                            |   2 +-
 fs/nfs/dir.c                          |   4 +-
 fs/nfs/getroot.c                      |   2 +-
 fs/notify/fsnotify.c                  |   2 +-
 fs/ocfs2/dcache.c                     |   2 +-
 fs/overlayfs/dir.c                    |   2 +-
 fs/smb/client/inode.c                 |   2 +-
 include/linux/dcache.h                |  24 ++++-
 13 files changed, 140 insertions(+), 47 deletions(-)

Series overview

First a couple of cleanups to reduce the amount of noise:

1) for_each_alias(): helper macro for iterating through dentries of given inode

Most of the places using d_alias are loops iterating through all aliases for
given inode; introduce a helper macro (for_each_alias(dentry, inode))
and convert open-coded instances of such loop to it.

They are easier to read that way and it reduces the noise on the next steps.

2) struct dentry: make ->d_u anonymous

Making ->d_rcu and (then) ->d_child overlapping dates back to 2006; anon unions
support had been added to gcc only in 4.6 (2011) and the minimal gcc version hadn't
been bumped to that until 4.19 (2018).
These days there's no reason not to keep that union named.

Next we get rid of the aforementioned pathologies:

3) dcache.c: more idiomatic "positives are not allowed" sanity checks

Several functions have BUG_ON/WARN_ON sanity checks that want to verify
that dentry is not positive and instead of looking at ->d_inode (as we
do in all other places that check that) they look at ->d_alias.
Just use the normal helpers instead - that way we no longer even look
at ->d_alias for negative dentries

Now that everything is ready for adding a new member and killing busy-wait:

4) get rid of busy-waiting in shrink_dcache_tree()

If shrink_dcache_tree() runs into a potential victim that is already
dying, it must wait for that dentry to go away.  To avoid busy-waiting
we need some object to wait on and a way for dentry_unlist() to see that
we need to be notified.

The obvious place for the object to wait on would be on our stack frame.
We will store a pointer to that object (struct completion_list) in victim
dentry; if there's more than one thread wanting to wait for the same
dentry to finish dying, we'll have their instances linked into a list,
with reference in dentry pointing to the head of that list.

* new object - struct completion_list.  A pair of struct completion and
pointer to the next instance.  That's what shrink_dcache_tree() will wait
on if needed.

* add a new member (->waiters, opaque pointer to struct completion_list)
to struct dentry.  It is defined for negative live dentries that are
not in-lookup ones and it will remain NULL for almost all of them.

It does not conflict with ->d_rcu (defined for killed dentries), ->d_alias
(defined for positive dentries, all live) or ->d_in_lookup_hash (defined
for in-lookup dentries, all live negative).  That allows to colocate
all four members.

* make sure that all places where dentry enters the state where ->waiters
is defined (live, negative, not-in-lookup) initialize ->waiters to NULL.

* if select_collect2() runs into a dentry that is already dying, have
its caller insert a local instance of struct completion_list into the
head of the list hanging off dentry->waiters and wait for completion.

* if dentry_unlist() sees non-NULL ->waiters, have it carefully walk
through the completion_list instances in that list, calling complete()
for each.

For now struct completion_list is local to fs/dcache.c; it's obviously
dentry-agnostic, and it can be trivially lifted into linux/completion.h
if somebody finds a reason to do so...

--
2.47.3

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [RFC PATCH v3 1/4] for_each_alias(): helper macro for iterating through dentries of given inode
  2026-04-04  8:07 ` [RFC PATCH v3 " Al Viro
@ 2026-04-04  8:07   ` Al Viro
  2026-04-04  8:07   ` [RFC PATCH v3 2/4] struct dentry: make ->d_u anonymous Al Viro
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 63+ messages in thread
From: Al Viro @ 2026-04-04  8:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
	Max Kellermann, Eric Sandeen, Paulo Alcantara

Most of the places using d_alias are loops iterating through all aliases for
given inode; introduce a helper macro (for_each_alias(dentry, inode))
and convert open-coded instances of such loop to it.

They are easier to read that way and it reduces the noise on the next steps.

You _must_ hold inode->i_lock over that thing.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 Documentation/filesystems/porting.rst | 10 ++++++++++
 fs/affs/amigaffs.c                    |  2 +-
 fs/ceph/mds_client.c                  |  2 +-
 fs/dcache.c                           |  6 +++---
 fs/exportfs/expfs.c                   |  2 +-
 fs/nfs/dir.c                          |  2 +-
 fs/notify/fsnotify.c                  |  2 +-
 fs/ocfs2/dcache.c                     |  2 +-
 fs/overlayfs/dir.c                    |  2 +-
 fs/smb/client/inode.c                 |  2 +-
 include/linux/dcache.h                |  4 ++++
 11 files changed, 25 insertions(+), 11 deletions(-)

diff --git a/Documentation/filesystems/porting.rst b/Documentation/filesystems/porting.rst
index 52ff1d19405b..9a9babd9ec48 100644
--- a/Documentation/filesystems/porting.rst
+++ b/Documentation/filesystems/porting.rst
@@ -1361,3 +1361,13 @@ to match what strlen() would return if it was ran on the string.
 
 However, if the string is freely accessible for the duration of inode's
 lifetime, consider using inode_set_cached_link() instead.
+
+---
+
+**recommended**
+
+If you really need to iterate through dentries for given inode, use
+for_each_alias(dentry, inode) instead of hlist_for_each_entry; better
+yet, see if any of the exported primitives could be used instead of
+the entire loop.  You still need to hold ->i_lock of the inode over
+either form of manual loop.
diff --git a/fs/affs/amigaffs.c b/fs/affs/amigaffs.c
index fd669daa4e7b..91966b1f41f6 100644
--- a/fs/affs/amigaffs.c
+++ b/fs/affs/amigaffs.c
@@ -126,7 +126,7 @@ affs_fix_dcache(struct inode *inode, u32 entry_ino)
 {
 	struct dentry *dentry;
 	spin_lock(&inode->i_lock);
-	hlist_for_each_entry(dentry, &inode->i_dentry, d_u.d_alias) {
+	for_each_alias(dentry, inode) {
 		if (entry_ino == (u32)(long)dentry->d_fsdata) {
 			dentry->d_fsdata = (void *)inode->i_ino;
 			break;
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index b1746273f186..f839109fb66f 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -4614,7 +4614,7 @@ static struct dentry* d_find_primary(struct inode *inode)
 		goto out_unlock;
 	}
 
-	hlist_for_each_entry(alias, &inode->i_dentry, d_u.d_alias) {
+	for_each_alias(alias, inode) {
 		spin_lock(&alias->d_lock);
 		if (!d_unhashed(alias) &&
 		    (ceph_dentry(alias)->flags & CEPH_DENTRY_PRIMARY_LINK)) {
diff --git a/fs/dcache.c b/fs/dcache.c
index 7ba1801d8132..e069b6ec4ec0 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -790,7 +790,7 @@ void d_mark_dontcache(struct inode *inode)
 	struct dentry *de;
 
 	spin_lock(&inode->i_lock);
-	hlist_for_each_entry(de, &inode->i_dentry, d_u.d_alias) {
+	for_each_alias(de, inode) {
 		spin_lock(&de->d_lock);
 		de->d_flags |= DCACHE_DONTCACHE;
 		spin_unlock(&de->d_lock);
@@ -1040,7 +1040,7 @@ static struct dentry *__d_find_alias(struct inode *inode)
 	if (S_ISDIR(inode->i_mode))
 		return __d_find_any_alias(inode);
 
-	hlist_for_each_entry(alias, &inode->i_dentry, d_u.d_alias) {
+	for_each_alias(alias, inode) {
 		spin_lock(&alias->d_lock);
  		if (!d_unhashed(alias)) {
 			dget_dlock(alias);
@@ -1133,7 +1133,7 @@ void d_prune_aliases(struct inode *inode)
 	struct dentry *dentry;
 
 	spin_lock(&inode->i_lock);
-	hlist_for_each_entry(dentry, &inode->i_dentry, d_u.d_alias)
+	for_each_alias(dentry, inode)
 		d_dispose_if_unused(dentry, &dispose);
 	spin_unlock(&inode->i_lock);
 	shrink_dentry_list(&dispose);
diff --git a/fs/exportfs/expfs.c b/fs/exportfs/expfs.c
index 6c9be60a3e48..f67b3ce672fc 100644
--- a/fs/exportfs/expfs.c
+++ b/fs/exportfs/expfs.c
@@ -52,7 +52,7 @@ find_acceptable_alias(struct dentry *result,
 
 	inode = result->d_inode;
 	spin_lock(&inode->i_lock);
-	hlist_for_each_entry(dentry, &inode->i_dentry, d_u.d_alias) {
+	for_each_alias(dentry, inode) {
 		dget(dentry);
 		spin_unlock(&inode->i_lock);
 		if (toput)
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 2402f57c8e7d..5a0bd8113e3a 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1471,7 +1471,7 @@ static void nfs_clear_verifier_file(struct inode *inode)
 	struct dentry *alias;
 	struct inode *dir;
 
-	hlist_for_each_entry(alias, &inode->i_dentry, d_u.d_alias) {
+	for_each_alias(alias, inode) {
 		spin_lock(&alias->d_lock);
 		dir = d_inode_rcu(alias->d_parent);
 		if (!dir ||
diff --git a/fs/notify/fsnotify.c b/fs/notify/fsnotify.c
index 9995de1710e5..b7198c4744e3 100644
--- a/fs/notify/fsnotify.c
+++ b/fs/notify/fsnotify.c
@@ -76,7 +76,7 @@ void fsnotify_set_children_dentry_flags(struct inode *inode)
 	spin_lock(&inode->i_lock);
 	/* run all of the dentries associated with this inode.  Since this is a
 	 * directory, there damn well better only be one item on this list */
-	hlist_for_each_entry(alias, &inode->i_dentry, d_u.d_alias) {
+	for_each_alias(alias, inode) {
 		struct dentry *child;
 
 		/* run all of the children of the original inode and fix their
diff --git a/fs/ocfs2/dcache.c b/fs/ocfs2/dcache.c
index c4ba968e778b..e06774fd89d8 100644
--- a/fs/ocfs2/dcache.c
+++ b/fs/ocfs2/dcache.c
@@ -145,7 +145,7 @@ struct dentry *ocfs2_find_local_alias(struct inode *inode,
 	struct dentry *dentry;
 
 	spin_lock(&inode->i_lock);
-	hlist_for_each_entry(dentry, &inode->i_dentry, d_u.d_alias) {
+	for_each_alias(dentry, inode) {
 		spin_lock(&dentry->d_lock);
 		if (ocfs2_match_dentry(dentry, parent_blkno, skip_unhashed)) {
 			trace_ocfs2_find_local_alias(dentry->d_name.len,
diff --git a/fs/overlayfs/dir.c b/fs/overlayfs/dir.c
index ff3dbd1ca61f..f8dfa534b566 100644
--- a/fs/overlayfs/dir.c
+++ b/fs/overlayfs/dir.c
@@ -904,7 +904,7 @@ static void ovl_drop_nlink(struct dentry *dentry)
 
 	/* Try to find another, hashed alias */
 	spin_lock(&inode->i_lock);
-	hlist_for_each_entry(alias, &inode->i_dentry, d_u.d_alias) {
+	for_each_alias(alias, inode) {
 		if (alias != dentry && !d_unhashed(alias))
 			break;
 	}
diff --git a/fs/smb/client/inode.c b/fs/smb/client/inode.c
index 888f9e35f14b..e2b4ad9bd0bd 100644
--- a/fs/smb/client/inode.c
+++ b/fs/smb/client/inode.c
@@ -1595,7 +1595,7 @@ inode_has_hashed_dentries(struct inode *inode)
 	struct dentry *dentry;
 
 	spin_lock(&inode->i_lock);
-	hlist_for_each_entry(dentry, &inode->i_dentry, d_u.d_alias) {
+	for_each_alias(dentry, inode) {
 		if (!d_unhashed(dentry) || IS_ROOT(dentry)) {
 			spin_unlock(&inode->i_lock);
 			return true;
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 898c60d21c92..7f1dbc7121d7 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -615,4 +615,8 @@ void set_default_d_op(struct super_block *, const struct dentry_operations *);
 struct dentry *d_make_persistent(struct dentry *, struct inode *);
 void d_make_discardable(struct dentry *dentry);
 
+/* inode->i_lock must be held over that */
+#define for_each_alias(dentry, inode) \
+	hlist_for_each_entry(dentry, &(inode)->i_dentry, d_u.d_alias)
+
 #endif	/* __LINUX_DCACHE_H */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [RFC PATCH v3 2/4] struct dentry: make ->d_u anonymous
  2026-04-04  8:07 ` [RFC PATCH v3 " Al Viro
  2026-04-04  8:07   ` [RFC PATCH v3 1/4] for_each_alias(): helper macro for iterating through dentries of given inode Al Viro
@ 2026-04-04  8:07   ` Al Viro
  2026-04-04  8:07   ` [RFC PATCH v3 3/4] dcache.c: more idiomatic "positives are not allowed" sanity checks Al Viro
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 63+ messages in thread
From: Al Viro @ 2026-04-04  8:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
	Max Kellermann, Eric Sandeen, Paulo Alcantara

Making ->d_rcu and (then) ->d_child overlapping dates back to
2006; anon unions support had been added to gcc only in 4.6
(2011) and the minimal gcc version hadn't been bumped to that
until 4.19 (2018).

These days there's no reason not to keep that union named.

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/ceph/mds_client.c   |  2 +-
 fs/dcache.c            | 48 +++++++++++++++++++++---------------------
 fs/inode.c             |  2 +-
 fs/nfs/dir.c           |  2 +-
 fs/nfs/getroot.c       |  2 +-
 include/linux/dcache.h |  4 ++--
 6 files changed, 30 insertions(+), 30 deletions(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index f839109fb66f..a5eb99c3c36b 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -4608,7 +4608,7 @@ static struct dentry* d_find_primary(struct inode *inode)
 		goto out_unlock;
 
 	if (S_ISDIR(inode->i_mode)) {
-		alias = hlist_entry(inode->i_dentry.first, struct dentry, d_u.d_alias);
+		alias = hlist_entry(inode->i_dentry.first, struct dentry, d_alias);
 		if (!IS_ROOT(alias))
 			dn = dget(alias);
 		goto out_unlock;
diff --git a/fs/dcache.c b/fs/dcache.c
index e069b6ec4ec0..4378eb8a00bb 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -40,7 +40,7 @@
 /*
  * Usage:
  * dcache->d_inode->i_lock protects:
- *   - i_dentry, d_u.d_alias, d_inode of aliases
+ *   - i_dentry, d_alias, d_inode of aliases
  * dcache_hash_bucket lock protects:
  *   - the dcache hash table
  * s_roots bl list spinlock protects:
@@ -55,7 +55,7 @@
  *   - d_unhashed()
  *   - d_parent and d_chilren
  *   - childrens' d_sib and d_parent
- *   - d_u.d_alias, d_inode
+ *   - d_alias, d_inode
  *
  * Ordering:
  * dentry->d_inode->i_lock
@@ -341,14 +341,14 @@ static inline struct external_name *external_name(struct dentry *dentry)
 
 static void __d_free(struct rcu_head *head)
 {
-	struct dentry *dentry = container_of(head, struct dentry, d_u.d_rcu);
+	struct dentry *dentry = container_of(head, struct dentry, d_rcu);
 
 	kmem_cache_free(dentry_cache, dentry); 
 }
 
 static void __d_free_external(struct rcu_head *head)
 {
-	struct dentry *dentry = container_of(head, struct dentry, d_u.d_rcu);
+	struct dentry *dentry = container_of(head, struct dentry, d_rcu);
 	kfree(external_name(dentry));
 	kmem_cache_free(dentry_cache, dentry);
 }
@@ -428,19 +428,19 @@ static inline void __d_clear_type_and_inode(struct dentry *dentry)
 
 static void dentry_free(struct dentry *dentry)
 {
-	WARN_ON(!hlist_unhashed(&dentry->d_u.d_alias));
+	WARN_ON(!hlist_unhashed(&dentry->d_alias));
 	if (unlikely(dname_external(dentry))) {
 		struct external_name *p = external_name(dentry);
 		if (likely(atomic_dec_and_test(&p->count))) {
-			call_rcu(&dentry->d_u.d_rcu, __d_free_external);
+			call_rcu(&dentry->d_rcu, __d_free_external);
 			return;
 		}
 	}
 	/* if dentry was never visible to RCU, immediate free is OK */
 	if (dentry->d_flags & DCACHE_NORCU)
-		__d_free(&dentry->d_u.d_rcu);
+		__d_free(&dentry->d_rcu);
 	else
-		call_rcu(&dentry->d_u.d_rcu, __d_free);
+		call_rcu(&dentry->d_rcu, __d_free);
 }
 
 /*
@@ -455,7 +455,7 @@ static void dentry_unlink_inode(struct dentry * dentry)
 
 	raw_write_seqcount_begin(&dentry->d_seq);
 	__d_clear_type_and_inode(dentry);
-	hlist_del_init(&dentry->d_u.d_alias);
+	hlist_del_init(&dentry->d_alias);
 	raw_write_seqcount_end(&dentry->d_seq);
 	spin_unlock(&dentry->d_lock);
 	spin_unlock(&inode->i_lock);
@@ -1010,7 +1010,7 @@ static struct dentry * __d_find_any_alias(struct inode *inode)
 
 	if (hlist_empty(&inode->i_dentry))
 		return NULL;
-	alias = hlist_entry(inode->i_dentry.first, struct dentry, d_u.d_alias);
+	alias = hlist_entry(inode->i_dentry.first, struct dentry, d_alias);
 	lockref_get(&alias->d_lockref);
 	return alias;
 }
@@ -1093,9 +1093,9 @@ struct dentry *d_find_alias_rcu(struct inode *inode)
 	// used without having I_FREEING set, which means no aliases left
 	if (likely(!(inode_state_read(inode) & I_FREEING) && !hlist_empty(l))) {
 		if (S_ISDIR(inode->i_mode)) {
-			de = hlist_entry(l->first, struct dentry, d_u.d_alias);
+			de = hlist_entry(l->first, struct dentry, d_alias);
 		} else {
-			hlist_for_each_entry(de, l, d_u.d_alias)
+			hlist_for_each_entry(de, l, d_alias)
 				if (!d_unhashed(de))
 					break;
 		}
@@ -1787,7 +1787,7 @@ static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
 	INIT_HLIST_BL_NODE(&dentry->d_hash);
 	INIT_LIST_HEAD(&dentry->d_lru);
 	INIT_HLIST_HEAD(&dentry->d_children);
-	INIT_HLIST_NODE(&dentry->d_u.d_alias);
+	INIT_HLIST_NODE(&dentry->d_alias);
 	INIT_HLIST_NODE(&dentry->d_sib);
 
 	if (dentry->d_op && dentry->d_op->d_init) {
@@ -1980,7 +1980,7 @@ static void __d_instantiate(struct dentry *dentry, struct inode *inode)
 	if ((dentry->d_flags &
 	     (DCACHE_LRU_LIST|DCACHE_SHRINK_LIST)) == DCACHE_LRU_LIST)
 		this_cpu_dec(nr_dentry_negative);
-	hlist_add_head(&dentry->d_u.d_alias, &inode->i_dentry);
+	hlist_add_head(&dentry->d_alias, &inode->i_dentry);
 	raw_write_seqcount_begin(&dentry->d_seq);
 	__d_set_inode_and_type(dentry, inode, add_flags);
 	raw_write_seqcount_end(&dentry->d_seq);
@@ -2004,7 +2004,7 @@ static void __d_instantiate(struct dentry *dentry, struct inode *inode)
  
 void d_instantiate(struct dentry *entry, struct inode * inode)
 {
-	BUG_ON(!hlist_unhashed(&entry->d_u.d_alias));
+	BUG_ON(!hlist_unhashed(&entry->d_alias));
 	if (inode) {
 		security_d_instantiate(entry, inode);
 		spin_lock(&inode->i_lock);
@@ -2024,7 +2024,7 @@ EXPORT_SYMBOL(d_instantiate);
  */
 void d_instantiate_new(struct dentry *entry, struct inode *inode)
 {
-	BUG_ON(!hlist_unhashed(&entry->d_u.d_alias));
+	BUG_ON(!hlist_unhashed(&entry->d_alias));
 	BUG_ON(!inode);
 	lockdep_annotate_inode_mutex_key(inode);
 	security_d_instantiate(entry, inode);
@@ -2087,7 +2087,7 @@ static struct dentry *__d_obtain_alias(struct inode *inode, bool disconnected)
 
 		spin_lock(&new->d_lock);
 		__d_set_inode_and_type(new, inode, add_flags);
-		hlist_add_head(&new->d_u.d_alias, &inode->i_dentry);
+		hlist_add_head(&new->d_alias, &inode->i_dentry);
 		if (!disconnected) {
 			hlist_bl_lock(&sb->s_roots);
 			hlist_bl_add_head(&new->d_hash, &sb->s_roots);
@@ -2658,7 +2658,7 @@ struct dentry *d_alloc_parallel(struct dentry *parent,
 	 * we unlock the chain.  All fields are stable in everything
 	 * we encounter.
 	 */
-	hlist_bl_for_each_entry(dentry, node, b, d_u.d_in_lookup_hash) {
+	hlist_bl_for_each_entry(dentry, node, b, d_in_lookup_hash) {
 		if (dentry->d_name.hash != hash)
 			continue;
 		if (dentry->d_parent != parent)
@@ -2700,7 +2700,7 @@ struct dentry *d_alloc_parallel(struct dentry *parent,
 	}
 	rcu_read_unlock();
 	new->d_wait = wq;
-	hlist_bl_add_head(&new->d_u.d_in_lookup_hash, b);
+	hlist_bl_add_head(&new->d_in_lookup_hash, b);
 	hlist_bl_unlock(b);
 	return new;
 mismatch:
@@ -2725,11 +2725,11 @@ static wait_queue_head_t *__d_lookup_unhash(struct dentry *dentry)
 	b = in_lookup_hash(dentry->d_parent, dentry->d_name.hash);
 	hlist_bl_lock(b);
 	dentry->d_flags &= ~DCACHE_PAR_LOOKUP;
-	__hlist_bl_del(&dentry->d_u.d_in_lookup_hash);
+	__hlist_bl_del(&dentry->d_in_lookup_hash);
 	d_wait = dentry->d_wait;
 	dentry->d_wait = NULL;
 	hlist_bl_unlock(b);
-	INIT_HLIST_NODE(&dentry->d_u.d_alias);
+	INIT_HLIST_NODE(&dentry->d_alias);
 	INIT_LIST_HEAD(&dentry->d_lru);
 	return d_wait;
 }
@@ -2760,7 +2760,7 @@ static inline void __d_add(struct dentry *dentry, struct inode *inode,
 		d_set_d_op(dentry, ops);
 	if (inode) {
 		unsigned add_flags = d_flags_for_inode(inode);
-		hlist_add_head(&dentry->d_u.d_alias, &inode->i_dentry);
+		hlist_add_head(&dentry->d_alias, &inode->i_dentry);
 		raw_write_seqcount_begin(&dentry->d_seq);
 		__d_set_inode_and_type(dentry, inode, add_flags);
 		raw_write_seqcount_end(&dentry->d_seq);
@@ -2795,7 +2795,7 @@ EXPORT_SYMBOL(d_add);
 
 struct dentry *d_make_persistent(struct dentry *dentry, struct inode *inode)
 {
-	WARN_ON(!hlist_unhashed(&dentry->d_u.d_alias));
+	WARN_ON(!hlist_unhashed(&dentry->d_alias));
 	WARN_ON(!inode);
 	security_d_instantiate(dentry, inode);
 	spin_lock(&inode->i_lock);
@@ -3185,7 +3185,7 @@ void d_mark_tmpfile(struct file *file, struct inode *inode)
 	struct dentry *dentry = file->f_path.dentry;
 
 	BUG_ON(dname_external(dentry) ||
-		!hlist_unhashed(&dentry->d_u.d_alias) ||
+		!hlist_unhashed(&dentry->d_alias) ||
 		!d_unlinked(dentry));
 	spin_lock(&dentry->d_parent->d_lock);
 	spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
diff --git a/fs/inode.c b/fs/inode.c
index cc12b68e021b..9e1ab333d382 100644
--- a/fs/inode.c
+++ b/fs/inode.c
@@ -754,7 +754,7 @@ void dump_mapping(const struct address_space *mapping)
 		return;
 	}
 
-	dentry_ptr = container_of(dentry_first, struct dentry, d_u.d_alias);
+	dentry_ptr = container_of(dentry_first, struct dentry, d_alias);
 	if (get_kernel_nofault(dentry, dentry_ptr) ||
 	    !dentry.d_parent || !dentry.d_name.name) {
 		pr_warn("aops:%ps ino:%lx invalid dentry:%px\n",
diff --git a/fs/nfs/dir.c b/fs/nfs/dir.c
index 5a0bd8113e3a..f2f1b036f2f1 100644
--- a/fs/nfs/dir.c
+++ b/fs/nfs/dir.c
@@ -1490,7 +1490,7 @@ static void nfs_clear_verifier_directory(struct inode *dir)
 	if (hlist_empty(&dir->i_dentry))
 		return;
 	this_parent =
-		hlist_entry(dir->i_dentry.first, struct dentry, d_u.d_alias);
+		hlist_entry(dir->i_dentry.first, struct dentry, d_alias);
 
 	spin_lock(&this_parent->d_lock);
 	nfs_unset_verifier_delegated(&this_parent->d_time);
diff --git a/fs/nfs/getroot.c b/fs/nfs/getroot.c
index f13d25d95b85..eef0736beb67 100644
--- a/fs/nfs/getroot.c
+++ b/fs/nfs/getroot.c
@@ -54,7 +54,7 @@ static int nfs_superblock_set_dummy_root(struct super_block *sb, struct inode *i
 		 */
 		spin_lock(&d_inode(sb->s_root)->i_lock);
 		spin_lock(&sb->s_root->d_lock);
-		hlist_del_init(&sb->s_root->d_u.d_alias);
+		hlist_del_init(&sb->s_root->d_alias);
 		spin_unlock(&sb->s_root->d_lock);
 		spin_unlock(&d_inode(sb->s_root)->i_lock);
 	}
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index 7f1dbc7121d7..f939d2ed10a3 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -128,7 +128,7 @@ struct dentry {
 		struct hlist_node d_alias;	/* inode alias list */
 		struct hlist_bl_node d_in_lookup_hash;	/* only for in-lookup ones */
 	 	struct rcu_head d_rcu;
-	} d_u;
+	};
 };
 
 /*
@@ -617,6 +617,6 @@ void d_make_discardable(struct dentry *dentry);
 
 /* inode->i_lock must be held over that */
 #define for_each_alias(dentry, inode) \
-	hlist_for_each_entry(dentry, &(inode)->i_dentry, d_u.d_alias)
+	hlist_for_each_entry(dentry, &(inode)->i_dentry, d_alias)
 
 #endif	/* __LINUX_DCACHE_H */
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [RFC PATCH v3 3/4] dcache.c: more idiomatic "positives are not allowed" sanity checks
  2026-04-04  8:07 ` [RFC PATCH v3 " Al Viro
  2026-04-04  8:07   ` [RFC PATCH v3 1/4] for_each_alias(): helper macro for iterating through dentries of given inode Al Viro
  2026-04-04  8:07   ` [RFC PATCH v3 2/4] struct dentry: make ->d_u anonymous Al Viro
@ 2026-04-04  8:07   ` Al Viro
  2026-04-04  8:07   ` [RFC PATCH v3 4/4] get rid of busy-waiting in shrink_dcache_tree() Al Viro
  2026-04-09 16:51   ` [RFC PATCH v3 0/4] getting rid of busy-wait in shrink_dcache_parent() Jeff Layton
  4 siblings, 0 replies; 63+ messages in thread
From: Al Viro @ 2026-04-04  8:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
	Max Kellermann, Eric Sandeen, Paulo Alcantara

Several functions have BUG_ON/WARN_ON sanity checks that want to verify
that dentry is not positive and instead of looking at ->d_inode (as we
do in all other places that check that) they look at ->d_alias.
Just use the normal helpers instead - that way we no longer even look
at ->d_alias for negative dentries

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/dcache.c | 10 +++++-----
 1 file changed, 5 insertions(+), 5 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 4378eb8a00bb..616a445ec720 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -428,7 +428,7 @@ static inline void __d_clear_type_and_inode(struct dentry *dentry)
 
 static void dentry_free(struct dentry *dentry)
 {
-	WARN_ON(!hlist_unhashed(&dentry->d_alias));
+	WARN_ON(d_really_is_positive(dentry));
 	if (unlikely(dname_external(dentry))) {
 		struct external_name *p = external_name(dentry);
 		if (likely(atomic_dec_and_test(&p->count))) {
@@ -2004,7 +2004,7 @@ static void __d_instantiate(struct dentry *dentry, struct inode *inode)
  
 void d_instantiate(struct dentry *entry, struct inode * inode)
 {
-	BUG_ON(!hlist_unhashed(&entry->d_alias));
+	BUG_ON(d_really_is_positive(entry));
 	if (inode) {
 		security_d_instantiate(entry, inode);
 		spin_lock(&inode->i_lock);
@@ -2024,7 +2024,7 @@ EXPORT_SYMBOL(d_instantiate);
  */
 void d_instantiate_new(struct dentry *entry, struct inode *inode)
 {
-	BUG_ON(!hlist_unhashed(&entry->d_alias));
+	BUG_ON(d_really_is_positive(entry));
 	BUG_ON(!inode);
 	lockdep_annotate_inode_mutex_key(inode);
 	security_d_instantiate(entry, inode);
@@ -2795,7 +2795,7 @@ EXPORT_SYMBOL(d_add);
 
 struct dentry *d_make_persistent(struct dentry *dentry, struct inode *inode)
 {
-	WARN_ON(!hlist_unhashed(&dentry->d_alias));
+	WARN_ON(d_really_is_positive(dentry));
 	WARN_ON(!inode);
 	security_d_instantiate(dentry, inode);
 	spin_lock(&inode->i_lock);
@@ -3185,7 +3185,7 @@ void d_mark_tmpfile(struct file *file, struct inode *inode)
 	struct dentry *dentry = file->f_path.dentry;
 
 	BUG_ON(dname_external(dentry) ||
-		!hlist_unhashed(&dentry->d_alias) ||
+		d_really_is_positive(dentry) ||
 		!d_unlinked(dentry));
 	spin_lock(&dentry->d_parent->d_lock);
 	spin_lock_nested(&dentry->d_lock, DENTRY_D_LOCK_NESTED);
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* [RFC PATCH v3 4/4] get rid of busy-waiting in shrink_dcache_tree()
  2026-04-04  8:07 ` [RFC PATCH v3 " Al Viro
                     ` (2 preceding siblings ...)
  2026-04-04  8:07   ` [RFC PATCH v3 3/4] dcache.c: more idiomatic "positives are not allowed" sanity checks Al Viro
@ 2026-04-04  8:07   ` Al Viro
  2026-04-09 16:51   ` [RFC PATCH v3 0/4] getting rid of busy-wait in shrink_dcache_parent() Jeff Layton
  4 siblings, 0 replies; 63+ messages in thread
From: Al Viro @ 2026-04-04  8:07 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
	Max Kellermann, Eric Sandeen, Paulo Alcantara

If shrink_dcache_tree() runs into a potential victim that is already
dying, it must wait for that dentry to go away.  To avoid busy-waiting
we need some object to wait on and a way for dentry_unlist() to see that
we need to be notified.

The obvious place for the object to wait on would be on our stack frame.
We will store a pointer to that object (struct completion_list) in victim
dentry; if there's more than one thread wanting to wait for the same
dentry to finish dying, we'll have their instances linked into a list,
with reference in dentry pointing to the head of that list.

* new object - struct completion_list.  A pair of struct completion and
pointer to the next instance.  That's what shrink_dcache_tree() will wait
on if needed.

* add a new member (->waiters, opaque pointer to struct completion_list)
to struct dentry.  It is defined for negative live dentries that are
not in-lookup ones and it will remain NULL for almost all of them.

It does not conflict with ->d_rcu (defined for killed dentries), ->d_alias
(defined for positive dentries, all live) or ->d_in_lookup_hash (defined
for in-lookup dentries, all live negative).  That allows to colocate
all four members.

* make sure that all places where dentry enters the state where ->waiters
is defined (live, negative, not-in-lookup) initialize ->waiters to NULL.

* if select_collect2() runs into a dentry that is already dying, have
its caller insert a local instance of struct completion_list into the
head of the list hanging off dentry->waiters and wait for completion.

* if dentry_unlist() sees non-NULL ->waiters, have it carefully walk
through the completion_list instances in that list, calling complete()
for each.

For now struct completion_list is local to fs/dcache.c; it's obviously
dentry-agnostic, and it can be trivially lifted into linux/completion.h
if somebody finds a reason to do so...

Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
 fs/dcache.c            | 79 ++++++++++++++++++++++++++++++++++++++----
 include/linux/dcache.h | 18 ++++++++--
 2 files changed, 88 insertions(+), 9 deletions(-)

diff --git a/fs/dcache.c b/fs/dcache.c
index 616a445ec720..0c8faeee02e2 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -456,6 +456,15 @@ static void dentry_unlink_inode(struct dentry * dentry)
 	raw_write_seqcount_begin(&dentry->d_seq);
 	__d_clear_type_and_inode(dentry);
 	hlist_del_init(&dentry->d_alias);
+	/*
+	 * dentry becomes negative, so the space occupied by ->d_alias
+	 * belongs to ->waiters now; we could use __hlist_del() instead
+	 * of hlist_del_init(), if not for the stunt pulled by nfs
+	 * dummy root dentries - positive dentry *not* included into
+	 * the alias list of its inode.  Open-coding hlist_del_init()
+	 * and removing zeroing would be too clumsy...
+	 */
+	dentry->waiters = NULL;
 	raw_write_seqcount_end(&dentry->d_seq);
 	spin_unlock(&dentry->d_lock);
 	spin_unlock(&inode->i_lock);
@@ -605,6 +614,44 @@ void d_drop(struct dentry *dentry)
 }
 EXPORT_SYMBOL(d_drop);
 
+struct completion_list {
+	struct completion_list *next;
+	struct completion completion;
+};
+
+/*
+ *  shrink_dcache_tree() needs to be notified when dentry in process of
+ *  being evicted finally gets unlisted.  Such dentries are
+ *	already with negative ->d_count
+ *	already negative
+ *	already not in in-lookup hash
+ *	reachable only via ->d_sib.
+ *
+ *  Use ->waiters for a single-linked list of struct completion_list of
+ *  waiters.
+ */
+static inline void d_add_waiter(struct dentry *dentry, struct completion_list *p)
+{
+	struct completion_list *v = dentry->waiters;
+	init_completion(&p->completion);
+	p->next = v;
+	dentry->waiters = p;
+}
+
+static inline void d_complete_waiters(struct dentry *dentry)
+{
+	struct completion_list *v = dentry->waiters;
+	if (unlikely(v)) {
+		/* some shrink_dcache_tree() instances are waiting */
+		dentry->waiters = NULL;
+		while (v) {
+			struct completion *r = &v->completion;
+			v = v->next;
+			complete(r);
+		}
+	}
+}
+
 static inline void dentry_unlist(struct dentry *dentry)
 {
 	struct dentry *next;
@@ -613,6 +660,7 @@ static inline void dentry_unlist(struct dentry *dentry)
 	 * attached to the dentry tree
 	 */
 	dentry->d_flags |= DCACHE_DENTRY_KILLED;
+	d_complete_waiters(dentry);
 	if (unlikely(hlist_unhashed(&dentry->d_sib)))
 		return;
 	__hlist_del(&dentry->d_sib);
@@ -1569,6 +1617,10 @@ static enum d_walk_ret select_collect2(void *_data, struct dentry *dentry)
 			return D_WALK_QUIT;
 		}
 		to_shrink_list(dentry, &data->dispose);
+	} else if (dentry->d_lockref.count < 0) {
+		rcu_read_lock();
+		data->victim = dentry;
+		return D_WALK_QUIT;
 	}
 	/*
 	 * We can return to the caller if we have found some (this
@@ -1608,12 +1660,27 @@ static void shrink_dcache_tree(struct dentry *parent, bool for_umount)
 		data.victim = NULL;
 		d_walk(parent, &data, select_collect2);
 		if (data.victim) {
-			spin_lock(&data.victim->d_lock);
-			if (!lock_for_kill(data.victim)) {
-				spin_unlock(&data.victim->d_lock);
+			struct dentry *v = data.victim;
+
+			spin_lock(&v->d_lock);
+			if (v->d_lockref.count < 0 &&
+			    !(v->d_flags & DCACHE_DENTRY_KILLED)) {
+				struct completion_list wait;
+				// It's busy dying; have it notify us once
+				// it becomes invisible to d_walk().
+				d_add_waiter(v, &wait);
+				spin_unlock(&v->d_lock);
+				rcu_read_unlock();
+				if (!list_empty(&data.dispose))
+					shrink_dentry_list(&data.dispose);
+				wait_for_completion(&wait.completion);
+				continue;
+			}
+			if (!lock_for_kill(v)) {
+				spin_unlock(&v->d_lock);
 				rcu_read_unlock();
 			} else {
-				shrink_kill(data.victim);
+				shrink_kill(v);
 			}
 		}
 		if (!list_empty(&data.dispose))
@@ -1787,7 +1854,7 @@ static struct dentry *__d_alloc(struct super_block *sb, const struct qstr *name)
 	INIT_HLIST_BL_NODE(&dentry->d_hash);
 	INIT_LIST_HEAD(&dentry->d_lru);
 	INIT_HLIST_HEAD(&dentry->d_children);
-	INIT_HLIST_NODE(&dentry->d_alias);
+	dentry->waiters = NULL;
 	INIT_HLIST_NODE(&dentry->d_sib);
 
 	if (dentry->d_op && dentry->d_op->d_init) {
@@ -2729,7 +2796,7 @@ static wait_queue_head_t *__d_lookup_unhash(struct dentry *dentry)
 	d_wait = dentry->d_wait;
 	dentry->d_wait = NULL;
 	hlist_bl_unlock(b);
-	INIT_HLIST_NODE(&dentry->d_alias);
+	dentry->waiters = NULL;
 	INIT_LIST_HEAD(&dentry->d_lru);
 	return d_wait;
 }
diff --git a/include/linux/dcache.h b/include/linux/dcache.h
index f939d2ed10a3..19098253f2dd 100644
--- a/include/linux/dcache.h
+++ b/include/linux/dcache.h
@@ -88,6 +88,7 @@ union shortname_store {
 
 #define d_lock	d_lockref.lock
 #define d_iname d_shortname.string
+struct completion_list;
 
 struct dentry {
 	/* RCU lookup touched fields */
@@ -122,12 +123,23 @@ struct dentry {
 	struct hlist_node d_sib;	/* child of parent list */
 	struct hlist_head d_children;	/* our children */
 	/*
-	 * d_alias and d_rcu can share memory
+	 * the following members can share memory - their uses are
+	 * mutually exclusive.
 	 */
 	union {
-		struct hlist_node d_alias;	/* inode alias list */
-		struct hlist_bl_node d_in_lookup_hash;	/* only for in-lookup ones */
+		/* positives: inode alias list */
+		struct hlist_node d_alias;
+		/* in-lookup ones (all negative, live): hash chain */
+		struct hlist_bl_node d_in_lookup_hash;
+		/* killed ones: (already negative) used to schedule freeing */
 	 	struct rcu_head d_rcu;
+		/*
+		 * live non-in-lookup negatives: used if shrink_dcache_tree()
+		 * races with eviction by another thread and needs to wait for
+		 * this dentry to get killed .  Remains NULL for almost all
+		 * negative dentries.
+		 */
+		struct completion_list *waiters;
 	};
 };
 
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH v3 0/4] getting rid of busy-wait in shrink_dcache_parent()
  2026-04-04  8:07 ` [RFC PATCH v3 " Al Viro
                     ` (3 preceding siblings ...)
  2026-04-04  8:07   ` [RFC PATCH v3 4/4] get rid of busy-waiting in shrink_dcache_tree() Al Viro
@ 2026-04-09 16:51   ` Jeff Layton
  2026-04-09 19:02     ` Al Viro
  4 siblings, 1 reply; 63+ messages in thread
From: Jeff Layton @ 2026-04-09 16:51 UTC (permalink / raw)
  To: Al Viro, Linus Torvalds
  Cc: linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
	Max Kellermann, Eric Sandeen, Paulo Alcantara

On Sat, 2026-04-04 at 09:07 +0100, Al Viro wrote:
> Changes since v2:
> 	* separate select_data from the completion list elements - add a separate
> struct completion_list (private to fs/dcache.c at the moment, but it's completely
> independent from struct dentry and can be lifted e.g. into linux/completion.h
> if there are other potential users).
> 
> Branch (currently -rc6-based) lives in
> git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.dcache-busy-wait
> individual patches in followups.  Please, review.
> 
> Shortlog:
> Al Viro (4):
>   for_each_alias(): helper macro for iterating through dentries of given inode
>   struct dentry: make ->d_u anonymous
>   dcache.c: more idiomatic "positives are not allowed" sanity checks
>   get rid of busy-waiting in shrink_dcache_tree()
> 
> Diffstat:
>  Documentation/filesystems/porting.rst |  10 ++
>  fs/affs/amigaffs.c                    |   2 +-
>  fs/ceph/mds_client.c                  |   4 +-
>  fs/dcache.c                           | 129 +++++++++++++++++++-------
>  fs/exportfs/expfs.c                   |   2 +-
>  fs/inode.c                            |   2 +-
>  fs/nfs/dir.c                          |   4 +-
>  fs/nfs/getroot.c                      |   2 +-
>  fs/notify/fsnotify.c                  |   2 +-
>  fs/ocfs2/dcache.c                     |   2 +-
>  fs/overlayfs/dir.c                    |   2 +-
>  fs/smb/client/inode.c                 |   2 +-
>  include/linux/dcache.h                |  24 ++++-
>  13 files changed, 140 insertions(+), 47 deletions(-)
> 
> Series overview
> 
> First a couple of cleanups to reduce the amount of noise:
> 
> 1) for_each_alias(): helper macro for iterating through dentries of given inode
> 
> Most of the places using d_alias are loops iterating through all aliases for
> given inode; introduce a helper macro (for_each_alias(dentry, inode))
> and convert open-coded instances of such loop to it.
> 
> They are easier to read that way and it reduces the noise on the next steps.
> 
> 2) struct dentry: make ->d_u anonymous
> 
> Making ->d_rcu and (then) ->d_child overlapping dates back to 2006; anon unions
> support had been added to gcc only in 4.6 (2011) and the minimal gcc version hadn't
> been bumped to that until 4.19 (2018).
> These days there's no reason not to keep that union named.
> 
> Next we get rid of the aforementioned pathologies:
> 
> 3) dcache.c: more idiomatic "positives are not allowed" sanity checks
> 
> Several functions have BUG_ON/WARN_ON sanity checks that want to verify
> that dentry is not positive and instead of looking at ->d_inode (as we
> do in all other places that check that) they look at ->d_alias.
> Just use the normal helpers instead - that way we no longer even look
> at ->d_alias for negative dentries
> 
> Now that everything is ready for adding a new member and killing busy-wait:
> 
> 4) get rid of busy-waiting in shrink_dcache_tree()
> 
> If shrink_dcache_tree() runs into a potential victim that is already
> dying, it must wait for that dentry to go away.  To avoid busy-waiting
> we need some object to wait on and a way for dentry_unlist() to see that
> we need to be notified.
> 
> The obvious place for the object to wait on would be on our stack frame.
> We will store a pointer to that object (struct completion_list) in victim
> dentry; if there's more than one thread wanting to wait for the same
> dentry to finish dying, we'll have their instances linked into a list,
> with reference in dentry pointing to the head of that list.
> 
> * new object - struct completion_list.  A pair of struct completion and
> pointer to the next instance.  That's what shrink_dcache_tree() will wait
> on if needed.
> 
> * add a new member (->waiters, opaque pointer to struct completion_list)
> to struct dentry.  It is defined for negative live dentries that are
> not in-lookup ones and it will remain NULL for almost all of them.
> 
> It does not conflict with ->d_rcu (defined for killed dentries), ->d_alias
> (defined for positive dentries, all live) or ->d_in_lookup_hash (defined
> for in-lookup dentries, all live negative).  That allows to colocate
> all four members.
> 
> * make sure that all places where dentry enters the state where ->waiters
> is defined (live, negative, not-in-lookup) initialize ->waiters to NULL.
> 
> * if select_collect2() runs into a dentry that is already dying, have
> its caller insert a local instance of struct completion_list into the
> head of the list hanging off dentry->waiters and wait for completion.
> 
> * if dentry_unlist() sees non-NULL ->waiters, have it carefully walk
> through the completion_list instances in that list, calling complete()
> for each.
> 
> For now struct completion_list is local to fs/dcache.c; it's obviously
> dentry-agnostic, and it can be trivially lifted into linux/completion.h
> if somebody finds a reason to do so...
> 
> 

This all looks good, Al. Nice work. I like the new wrappers and anon
union too -- makes the code a lot more readable.

I can also confirm that this fixes the livelock we were seeing as a
precursor to this UAF [1]. I'm still not 100% sure how that livelock
leads to a UAF. Claude thinks is has to do with the livelock making the
loop outlive an RCU grace period.

I did eventually find a vmcore and was able to confirm some things:

- in the freed/reallocated dentry, most of the fields are clobbered,
but the d_lru list pointers seemed to be intact. That tells me that the
sucker was added to the shrink list after being freed and reallocated.
That means that the WARN_ON_ONCE() wouldn't have caught this anyway.

- there was some list corruption too: it looked like there were two
different (on-stack) list heads in the list. There were also some
dentries in there that were outside the shrin

Note that this was an older (v6.13-based) kernel though, and it's
possible there are other unfixed bugs in here.

I'm currently trying to reproduce the UAF using that mdelay() based
fault injection patch on something closer to mainline.

[1]: https://lore.kernel.org/linux-fsdevel/20260406-dcache-warn-v1-1-c665efbc005f@kernel.org/

I reviewed the version in your work.dcache-busy-wait branch, but you
can add this to the lot.

Reviewed-by: Jeff Layton <jlayton@kernel.org>



^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH v3 0/4] getting rid of busy-wait in shrink_dcache_parent()
  2026-04-09 16:51   ` [RFC PATCH v3 0/4] getting rid of busy-wait in shrink_dcache_parent() Jeff Layton
@ 2026-04-09 19:02     ` Al Viro
  2026-04-09 20:10       ` Jeff Layton
  0 siblings, 1 reply; 63+ messages in thread
From: Al Viro @ 2026-04-09 19:02 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Linus Torvalds, linux-fsdevel, Christian Brauner, Jan Kara,
	Nikolay Borisov, Max Kellermann, Eric Sandeen, Paulo Alcantara

On Thu, Apr 09, 2026 at 12:51:38PM -0400, Jeff Layton wrote:

> I can also confirm that this fixes the livelock we were seeing as a
> precursor to this UAF [1]. I'm still not 100% sure how that livelock
> leads to a UAF. Claude thinks is has to do with the livelock making the
> loop outlive an RCU grace period.

Claude is full of substance likely to cause CoC complaints when mentioned,
film at 11.  I agree with the "livelock opens up some window" part,
what with the stack traces there, but I very much doubt that mechanism
has anything to do with the stuff on a shrink list staying around from
one iteration to another.  The thing is, *nothing* is held between the
iterations of loop in there; data.dispose is empty at the end of the loop
body in all cases and reference in data.victim is not retained either
across the iterations either.  We do have a weird RCU use pattern there,
all right (unbalanced acquire in select_collect2() paired with drop in
the caller of caller of select_collect2()), but the scope is neither
long nor extended by the livelock - in that case the function between
shrink_dentry_tree() and select_collect2() (d_walk()) is told to drop
the locks and return to its caller immediately.

What I suspect might be getting opened up is the possibility
of d_invalidate() (e.g. from pathname resolution trying
to walk into /proc/<pid>/...) hitting in the middle of
proc_invalidate_siblings_dcache().

Incidentally, one thing to check would be whether anything on affected
boxen had things mounted under /proc/<pid>/... - that would give
some idea whether detach_mounts() (and namespace_unlock() with its
handling of the list of mountpoints) might be needed in the mix.

> I did eventually find a vmcore and was able to confirm some things:
> 
> - in the freed/reallocated dentry, most of the fields are clobbered,
> but the d_lru list pointers seemed to be intact. That tells me that the
> sucker was added to the shrink list after being freed and reallocated.
> That means that the WARN_ON_ONCE() wouldn't have caught this anyway.
> 
> - there was some list corruption too: it looked like there were two
> different (on-stack) list heads in the list. There were also some
> dentries in there that were outside the shrin

?

> Note that this was an older (v6.13-based) kernel though, and it's
> possible there are other unfixed bugs in here.
> 
> I'm currently trying to reproduce the UAF using that mdelay() based
> fault injection patch on something closer to mainline.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH v3 0/4] getting rid of busy-wait in shrink_dcache_parent()
  2026-04-09 19:02     ` Al Viro
@ 2026-04-09 20:10       ` Jeff Layton
  2026-04-09 21:57         ` Al Viro
  0 siblings, 1 reply; 63+ messages in thread
From: Jeff Layton @ 2026-04-09 20:10 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, linux-fsdevel, Christian Brauner, Jan Kara,
	Nikolay Borisov, Max Kellermann, Eric Sandeen, Paulo Alcantara

On Thu, 2026-04-09 at 20:02 +0100, Al Viro wrote:
> On Thu, Apr 09, 2026 at 12:51:38PM -0400, Jeff Layton wrote:
> 
> > I can also confirm that this fixes the livelock we were seeing as a
> > precursor to this UAF [1]. I'm still not 100% sure how that livelock
> > leads to a UAF. Claude thinks is has to do with the livelock making the
> > loop outlive an RCU grace period.
> 
> Claude is full of substance likely to cause CoC complaints when mentioned,
> film at 11.  I agree with the "livelock opens up some window" part,
> what with the stack traces there, but I very much doubt that mechanism
> has anything to do with the stuff on a shrink list staying around from
> one iteration to another.  The thing is, *nothing* is held between the
> iterations of loop in there; data.dispose is empty at the end of the loop
> body in all cases and reference in data.victim is not retained either
> across the iterations either.  We do have a weird RCU use pattern there,
> all right (unbalanced acquire in select_collect2() paired with drop in
> the caller of caller of select_collect2()), but the scope is neither
> long nor extended by the livelock - in that case the function between
> shrink_dentry_tree() and select_collect2() (d_walk()) is told to drop
> the locks and return to its caller immediately.
> 
> What I suspect might be getting opened up is the possibility
> of d_invalidate() (e.g. from pathname resolution trying
> to walk into /proc/<pid>/...) hitting in the middle of
> proc_invalidate_siblings_dcache().
> 
> Incidentally, one thing to check would be whether anything on affected
> boxen had things mounted under /proc/<pid>/... - that would give
> some idea whether detach_mounts() (and namespace_unlock() with its
> handling of the list of mountpoints) might be needed in the mix.
> 

● had_submounts = false. The d_invalidate code never found submounts, so
   detach_mounts() was never called for this dentry. The find_submount /
   namespace_unlock path was not involved in this crash.

  Summary for Al: On this crashed machine, there were no mounts under
  /proc/<pid>/... in any of the 49 mount namespaces checked (from
  surviving tasks). The d_invalidate stack frame shows had_submounts =
  false, confirming that detach_mounts() and namespace_unlock() were not
   in the mix for this crash. The code path was purely
  shrink_dcache_parent → shrink_dentry_list → lock_for_kill, without any
   mount detachment involved.


> > I did eventually find a vmcore and was able to confirm some things:
> > 
> > - in the freed/reallocated dentry, most of the fields are clobbered,
> > but the d_lru list pointers seemed to be intact. That tells me that the
> > sucker was added to the shrink list after being freed and reallocated.
> > That means that the WARN_ON_ONCE() wouldn't have caught this anyway.
> > 
> > - there was some list corruption too: it looked like there were two
> > different (on-stack) list heads in the list. There were also some
> > dentries in there that were outside the 

...directories being shrunk from processes exiting. Here's what it
found when I told it to validate the integrity of the d_lru that the
realloc'ed entry was on.

More claude slop from the earlier vmcore analysis:

------------------8<------------------
### Dispose List: Two Merged Lists
The dispose list contains **55 unique entries** and **two stack-based
list heads** forming a single ring:
1. **head_A** at `0xffffc900c0e8fcc0` — crashed thread's `data.dispose`
2. **head_B** at `0xffffc900c0e7fcc0` — another thread's `data.dispose`
(exited, stack freed, exactly 0x10000 bytes apart)
```
head_A ↔ UAF ↔ mountinfo(1) ↔ swaps(/) ↔ ... ↔
  head_B(STACK) ↔ status(5962) ↔ exe(5962) ↔ ... ↔
  stat(8808) ↔ ... ↔ mountinfo(1) ↔ UAF ↔ head_A
```
### d_walk Escaped Its Starting Dentry
`d_walk` was called with `parent = /proc/4530/task/5964` (`data.start`
confirmed in stack frame). It should only traverse descendants of 5964.
But the dispose list contains entries from:
- `/proc/4530/task/5962/*` (151 children — sibling of 5964)
- `/proc/4530/task/6830`, `/proc/4530/task/8808` — other task entries
- `/proc/1/mountinfo`, `/proc/1/status`, `/proc/1/net` — PID 1 entries
- `/proc/sys/vm/overcommit_memory`, `/proc/sys/fs/*` — sysctl entries
- `/proc/pressure/{cpu,io,memory}` — PSI entries
- `/proc/swaps`, `/proc/cpuinfo`, `/proc/kcore` — root proc entries
------------------8<------------------

> 
>  
> > Note that this was an older (v6.13-based) kernel though, and it's
> > possible there are other unfixed bugs in here.
> > 
> > I'm currently trying to reproduce the UAF using that mdelay() based
> > fault injection patch on something closer to mainline.

...and so far it's not trivially reproducible. I can reproduce the
stalls on a recent kernel without your patches, but whatever causes
this UAF is not easily triggerable by just stalling things out.

I think I'm going to have to hunt for more vmcores...
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH v3 0/4] getting rid of busy-wait in shrink_dcache_parent()
  2026-04-09 20:10       ` Jeff Layton
@ 2026-04-09 21:57         ` Al Viro
  2026-04-09 22:38           ` Jeff Layton
  2026-04-10  8:48           ` [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order Al Viro
  0 siblings, 2 replies; 63+ messages in thread
From: Al Viro @ 2026-04-09 21:57 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Linus Torvalds, linux-fsdevel, Christian Brauner, Jan Kara,
	Nikolay Borisov, Max Kellermann, Eric Sandeen, Paulo Alcantara

On Thu, Apr 09, 2026 at 04:10:41PM -0400, Jeff Layton wrote:

> head_A ↔ UAF ↔ mountinfo(1) ↔ swaps(/) ↔ ... ↔
>   head_B(STACK) ↔ status(5962) ↔ exe(5962) ↔ ... ↔
>   stat(8808) ↔ ... ↔ mountinfo(1) ↔ UAF ↔ head_A
> ```

Wait a bloody minute; *two* UAF and two mountinfo there?  Are the links
in that cyclic list consistent?  ->next and ->prev, I mean.
If I understand the notation correctly... are there two different
dentries, both with "mountinfo" for name and PROC_I(_->d_inode)->pid
being PID 1?  What are their ->d_parent pointing to?  Are they hashed?

> ### d_walk Escaped Its Starting Dentry
> `d_walk` was called with `parent = /proc/4530/task/5964` (`data.start`
> confirmed in stack frame). It should only traverse descendants of 5964.
> But the dispose list contains entries from:
> - `/proc/4530/task/5962/*` (151 children — sibling of 5964)
> - `/proc/4530/task/6830`, `/proc/4530/task/8808` — other task entries
> - `/proc/1/mountinfo`, `/proc/1/status`, `/proc/1/net` — PID 1 entries
> - `/proc/sys/vm/overcommit_memory`, `/proc/sys/fs/*` — sysctl entries
> - `/proc/pressure/{cpu,io,memory}` — PSI entries
> - `/proc/swaps`, `/proc/cpuinfo`, `/proc/kcore` — root proc entries

Not at all obvious; there's a list from another thread mixed into that,
and we've no idea what had the root been for that one.  For that matter,
I'd check ->d_flags on the entries, just to verify that those are
from shrink lists and not from LRU - the fact that these UAF still
have pointers to plausible dentries does not mean they are from the
same moment in time; if dentry in question had been on a different
list before getting freed...  That's why the question about ->prev
and ->next consistncy.

And seeing that procfs has zero callers of d_splice_alias() or d_move(),
I would expect ->d_parent on all dentries in there to be constant over
the entire lifetime; would be rather hard for d_walk() to escape in
such conditions...

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC PATCH v3 0/4] getting rid of busy-wait in shrink_dcache_parent()
  2026-04-09 21:57         ` Al Viro
@ 2026-04-09 22:38           ` Jeff Layton
  2026-04-10  8:48           ` [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order Al Viro
  1 sibling, 0 replies; 63+ messages in thread
From: Jeff Layton @ 2026-04-09 22:38 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, linux-fsdevel, Christian Brauner, Jan Kara,
	Nikolay Borisov, Max Kellermann, Eric Sandeen, Paulo Alcantara

On Thu, 2026-04-09 at 22:57 +0100, Al Viro wrote:
> On Thu, Apr 09, 2026 at 04:10:41PM -0400, Jeff Layton wrote:
> 
> > head_A ↔ UAF ↔ mountinfo(1) ↔ swaps(/) ↔ ... ↔
> >   head_B(STACK) ↔ status(5962) ↔ exe(5962) ↔ ... ↔
> >   stat(8808) ↔ ... ↔ mountinfo(1) ↔ UAF ↔ head_A
> > ```
> 

I think the above is BS. Claude says:

In the earlier analysis I walked the list both forward from 
head_B (the stack entry) and backward from head_A  That was misleading
— there's only one mountinfo node. The ring passes through it once. The
duplicate in my listing was an artifact of walking the ring from two
entry points and not  deduplicating.                                  


> Wait a bloody minute; *two* UAF and two mountinfo there?  Are the links
> in that cyclic list consistent?  ->next and ->prev, I mean.

No:

The list was NOT a proper ring:

  - FWD (head.next): LOOP at entry 1 — immediately hits a self-loop (the processed "status" dentry)
  - BWD (head.prev): LOOP at entry 55 — walks 55 entries then hits a cycle


> If I understand the notation correctly... are there two different
> dentries, both with "mountinfo" for name and PROC_I(_->d_inode)->pid
> being PID 1?  What are their ->d_parent pointing to?  Are they hashed?
> 
> > ### d_walk Escaped Its Starting Dentry
> > `d_walk` was called with `parent = /proc/4530/task/5964` (`data.start`
> > confirmed in stack frame). It should only traverse descendants of 5964.
> > But the dispose list contains entries from:
> > - `/proc/4530/task/5962/*` (151 children — sibling of 5964)
> > - `/proc/4530/task/6830`, `/proc/4530/task/8808` — other task entries
> > - `/proc/1/mountinfo`, `/proc/1/status`, `/proc/1/net` — PID 1 entries
> > - `/proc/sys/vm/overcommit_memory`, `/proc/sys/fs/*` — sysctl entries
> > - `/proc/pressure/{cpu,io,memory}` — PSI entries
> > - `/proc/swaps`, `/proc/cpuinfo`, `/proc/kcore` — root proc entries
> 
> Not at all obvious; there's a list from another thread mixed into that,
> and we've no idea what had the root been for that one.  For that matter,
> I'd check ->d_flags on the entries, just to verify that those are
> from shrink lists and not from LRU - the fact that these UAF still
> have pointers to plausible dentries does not mean they are from the
> same moment in time; if dentry in question had been on a different
> list before getting freed...  That's why the question about ->prev
> and ->next consistncy.
>
> And seeing that procfs has zero callers of d_splice_alias() or d_move(),
> I would expect ->d_parent on all dentries in there to be constant over
> the entire lifetime; would be rather hard for d_walk() to escape in
> such conditions...

Definitely not legitimately. I suspect some other sort of mem
corruption. /proc/1/mountinfo is on the list though:

mountinfo dentries on dispose list: 1                                 
                                                                                                    
    /proc/1/mountinfo (dentry 0xffff88823deef800)                     
      d_parent: 0xffff891373c06a80 "1" -> "/" (root)                  
      hashed: yes (d_hash.pprev=0xffff893d337264d0)                   
      PROC_I->pid nr: 3217077                                         
      d_inode: 0xffff88ea55a44508                                     

Claude's theory (maybe garbage):

PID 1's mountinfo being on the list is actually a strong clue — it's
a dentry that would be on the global LRU (negative or zero-count
after someone closed /proc/1/mountinfo), and drop_caches would pull
it off the LRU via prune_dcache_sb. That's likely how it ended up on
Thread B's dispose list.

Here's a cursory comparitive analysis across the 3 vmcores I have so
far:

  Comparative analysis across three vmcores                                                      
  =========================================                                                      
                                                                                                 
  Kernel                  6.13.2-0_fbk5-hf2   6.13.2-0_fbk12-hf1      6.13.2-0_fbk5-hf2          
  Date                    Mar 31               Apr 9                    Apr 2                    
  Process                 SREventBase61        tokio-runtime-w          TTLSFWorkerU151          
  d_invalidate target     /task/5964           /task/3555587            /task/1102               
  had_submounts           False                False                    False                    
  UAF slab                kmalloc-96           kmalloc-64               kmalloc-96               
  UAF d_flags             0x0                  0x81e73ce0               0x0                      
  UAF d_sb                0x0                  0x0                      0x0                      
  UAF d_lockref.count     0                    -14080                   -771573072               

  Consistent findings across all three cores:                                                    
                        
    - had_submounts is False in every case.  detach_mounts() and                                 
      namespace_unlock() are not involved in any of these crashes.                               

    - No mounts under /proc/<pid>/... in any mount namespace (checked                            
      on the od0103 core across 49 namespaces, zero proc submounts).                             

    - All three are in the same call path:                                                       
        release_task -> proc_flush_pid -> d_invalidate ->                                        
        shrink_dcache_parent -> shrink_dentry_list -> spin_lock (UAF)                            
                                                
    - The UAF'd dentry is in a non-dentry slab in every case (kmalloc-64                         
      or kmalloc-96), confirming the dentry was freed and its page reused                        
      by a different slab cache before shrink_dentry_list tried to lock it.                      
                                                                                                 
    - Different applications triggered the crash (Rust async runtime,                            
      C++ worker, Java-like worker) -- not application-specific.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
  2026-04-09 21:57         ` Al Viro
  2026-04-09 22:38           ` Jeff Layton
@ 2026-04-10  8:48           ` Al Viro
  2026-04-10 11:18             ` Jeff Layton
  2026-04-10 15:25             ` Linus Torvalds
  1 sibling, 2 replies; 63+ messages in thread
From: Al Viro @ 2026-04-10  8:48 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Linus Torvalds, linux-fsdevel, Christian Brauner, Jan Kara,
	Nikolay Borisov, Max Kellermann, Eric Sandeen, Paulo Alcantara

[this may or may not be the source of UAFs caught by Jeff and by Helge]

lock_for_kill() losing a race to eviction by another thread will end up
returning false (correctly - the caller should *not* evict dentry in
that case), with ->d_lock held and rcu read-critical area still unbroken,
so the caller can safely drop the locks, provided that it's done in the
right order - ->d_lock should be dropped before rcu_read_unlock().

Unfortunately, 3 of 4 callers did it the other way round and for two of them
(finish_dput() and shrink_kill()) that ended up with a possibility of UAF.
The third (shrink_dentry_list()) had been safe for other reasons (dentry being
on the shrink list => the other thread wouldn't even schedule its freeing
until observing dentry off the shrink list while holding ->d_lock), but it's
simpler to make it a general rule - in case of lock_for_kill() failure
->d_lock must be dropped before rcu_read_unlock().

UAF scenario was basically this:
	dput() drops the last reference to foo/bar and evicts it.
	The reference it held to foo happens to be the last one.
	When trylock on ->i_lock of parent's inode fails, we
	(under rcu_read_lock()) drop ->d_lock and take the locks in
	the right order; while we'd been spinning on ->i_lock somebody
	else comes and evicts the parent.
	Noticing non-zero (negative) refcount we decide there's nothing
	left to do.  And had we only dropped parent's ->d_lock before
	rcu_read_unlock(), everything would be fine.  Unfortunately,
	doing that the other way round allows rcu-scheduled freeing of
	parent to proceed before we drop its ->d_lock.
The same applies if instead of final dput() of foo/bar it gets evicted
by shrink_dcache_parent() or memory pressure, again taking out the last
reference to that used to pin its parent.

Fixes: 339e9e13530b ("don't try to cut corners in shrink_lock_dentry()")
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
---
diff --git a/fs/dcache.c b/fs/dcache.c
index 0c8faeee02e2..b1c163f20db6 100644
--- a/fs/dcache.c
+++ b/fs/dcache.c
@@ -751,6 +751,8 @@ static struct dentry *__dentry_kill(struct dentry *dentry)
  *
  * Return false if dentry is busy.  Otherwise, return true and have
  * that dentry's inode locked.
+ *
+ * On failure the caller must drop ->d_lock *before* rcu_read_unlock()
  */
 
 static bool lock_for_kill(struct dentry *dentry)
@@ -933,8 +935,8 @@ static void finish_dput(struct dentry *dentry)
 		}
 		rcu_read_lock();
 	}
-	rcu_read_unlock();
 	spin_unlock(&dentry->d_lock);
+	rcu_read_unlock();
 }
 
 /* 
@@ -1193,11 +1195,12 @@ static inline void shrink_kill(struct dentry *victim)
 	do {
 		rcu_read_unlock();
 		victim = __dentry_kill(victim);
+		if (!victim)
+			return;
 		rcu_read_lock();
-	} while (victim && lock_for_kill(victim));
+	} while (lock_for_kill(victim));
+	spin_unlock(&victim->d_lock);
 	rcu_read_unlock();
-	if (victim)
-		spin_unlock(&victim->d_lock);
 }
 
 void shrink_dentry_list(struct list_head *list)
@@ -1210,10 +1213,10 @@ void shrink_dentry_list(struct list_head *list)
 		rcu_read_lock();
 		if (!lock_for_kill(dentry)) {
 			bool can_free;
-			rcu_read_unlock();
 			d_shrink_del(dentry);
 			can_free = dentry->d_flags & DCACHE_DENTRY_KILLED;
 			spin_unlock(&dentry->d_lock);
+			rcu_read_unlock();
 			if (can_free)
 				dentry_free(dentry);
 			continue;

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
  2026-04-10  8:48           ` [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order Al Viro
@ 2026-04-10 11:18             ` Jeff Layton
  2026-04-10 11:56               ` Jeff Layton
  2026-04-10 15:25             ` Linus Torvalds
  1 sibling, 1 reply; 63+ messages in thread
From: Jeff Layton @ 2026-04-10 11:18 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, linux-fsdevel, Christian Brauner, Jan Kara,
	Nikolay Borisov, Max Kellermann, Eric Sandeen, Paulo Alcantara

On Fri, 2026-04-10 at 09:48 +0100, Al Viro wrote:
> [this may or may not be the source of UAFs caught by Jeff and by Helge]
> 
> lock_for_kill() losing a race to eviction by another thread will end up
> returning false (correctly - the caller should *not* evict dentry in
> that case), with ->d_lock held and rcu read-critical area still unbroken,
> so the caller can safely drop the locks, provided that it's done in the
> right order - ->d_lock should be dropped before rcu_read_unlock().
> 
> Unfortunately, 3 of 4 callers did it the other way round and for two of them
> (finish_dput() and shrink_kill()) that ended up with a possibility of UAF.
> The third (shrink_dentry_list()) had been safe for other reasons (dentry being
> on the shrink list => the other thread wouldn't even schedule its freeing
> until observing dentry off the shrink list while holding ->d_lock), but it's
> simpler to make it a general rule - in case of lock_for_kill() failure
> ->d_lock must be dropped before rcu_read_unlock().
> 
> UAF scenario was basically this:
> 	dput() drops the last reference to foo/bar and evicts it.
> 	The reference it held to foo happens to be the last one.
> 	When trylock on ->i_lock of parent's inode fails, we
> 	(under rcu_read_lock()) drop ->d_lock and take the locks in
> 	the right order; while we'd been spinning on ->i_lock somebody
> 	else comes and evicts the parent.
> 	Noticing non-zero (negative) refcount we decide there's nothing
> 	left to do.  And had we only dropped parent's ->d_lock before
> 	rcu_read_unlock(), everything would be fine.  Unfortunately,
> 	doing that the other way round allows rcu-scheduled freeing of
> 	parent to proceed before we drop its ->d_lock.
> The same applies if instead of final dput() of foo/bar it gets evicted
> by shrink_dcache_parent() or memory pressure, again taking out the last
> reference to that used to pin its parent.
> 
> Fixes: 339e9e13530b ("don't try to cut corners in shrink_lock_dentry()")
> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> ---
> diff --git a/fs/dcache.c b/fs/dcache.c
> index 0c8faeee02e2..b1c163f20db6 100644
> --- a/fs/dcache.c
> +++ b/fs/dcache.c
> @@ -751,6 +751,8 @@ static struct dentry *__dentry_kill(struct dentry *dentry)
>   *
>   * Return false if dentry is busy.  Otherwise, return true and have
>   * that dentry's inode locked.
> + *
> + * On failure the caller must drop ->d_lock *before* rcu_read_unlock()
>   */
>  
>  static bool lock_for_kill(struct dentry *dentry)
> @@ -933,8 +935,8 @@ static void finish_dput(struct dentry *dentry)
>  		}
>  		rcu_read_lock();
>  	}
> -	rcu_read_unlock();
>  	spin_unlock(&dentry->d_lock);
> +	rcu_read_unlock();
>  }
>  
>  /* 
> @@ -1193,11 +1195,12 @@ static inline void shrink_kill(struct dentry *victim)
>  	do {
>  		rcu_read_unlock();
>  		victim = __dentry_kill(victim);
> +		if (!victim)
> +			return;
>  		rcu_read_lock();
> -	} while (victim && lock_for_kill(victim));
> +	} while (lock_for_kill(victim));
> +	spin_unlock(&victim->d_lock);
>  	rcu_read_unlock();
> -	if (victim)
> -		spin_unlock(&victim->d_lock);
>  }
>  
>  void shrink_dentry_list(struct list_head *list)
> @@ -1210,10 +1213,10 @@ void shrink_dentry_list(struct list_head *list)
>  		rcu_read_lock();
>  		if (!lock_for_kill(dentry)) {
>  			bool can_free;
> -			rcu_read_unlock();
>  			d_shrink_del(dentry);
>  			can_free = dentry->d_flags & DCACHE_DENTRY_KILLED;
>  			spin_unlock(&dentry->d_lock);
> +			rcu_read_unlock();
>  			if (can_free)
>  				dentry_free(dentry);
>  			continue;

I'm a little skeptical that this explains the UAF we've seen. It looks
like this would mostly affect the parent dentry rather than the
children on the shrink list, but parents can be children too, so we
can't rule it out.

Either way, releasing the rcu_read_lock() protecting a dentry you're
holding a lock on seems like a bad idea.

Reviewed-by: Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
  2026-04-10 11:18             ` Jeff Layton
@ 2026-04-10 11:56               ` Jeff Layton
  0 siblings, 0 replies; 63+ messages in thread
From: Jeff Layton @ 2026-04-10 11:56 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, linux-fsdevel, Christian Brauner, Jan Kara,
	Nikolay Borisov, Max Kellermann, Eric Sandeen, Paulo Alcantara

On Fri, 2026-04-10 at 07:18 -0400, Jeff Layton wrote:
> On Fri, 2026-04-10 at 09:48 +0100, Al Viro wrote:
> > [this may or may not be the source of UAFs caught by Jeff and by Helge]
> > 
> > lock_for_kill() losing a race to eviction by another thread will end up
> > returning false (correctly - the caller should *not* evict dentry in
> > that case), with ->d_lock held and rcu read-critical area still unbroken,
> > so the caller can safely drop the locks, provided that it's done in the
> > right order - ->d_lock should be dropped before rcu_read_unlock().
> > 
> > Unfortunately, 3 of 4 callers did it the other way round and for two of them
> > (finish_dput() and shrink_kill()) that ended up with a possibility of UAF.
> > The third (shrink_dentry_list()) had been safe for other reasons (dentry being
> > on the shrink list => the other thread wouldn't even schedule its freeing
> > until observing dentry off the shrink list while holding ->d_lock), but it's
> > simpler to make it a general rule - in case of lock_for_kill() failure
> > ->d_lock must be dropped before rcu_read_unlock().
> > 
> > UAF scenario was basically this:
> > 	dput() drops the last reference to foo/bar and evicts it.
> > 	The reference it held to foo happens to be the last one.
> > 	When trylock on ->i_lock of parent's inode fails, we
> > 	(under rcu_read_lock()) drop ->d_lock and take the locks in
> > 	the right order; while we'd been spinning on ->i_lock somebody
> > 	else comes and evicts the parent.
> > 	Noticing non-zero (negative) refcount we decide there's nothing
> > 	left to do.  And had we only dropped parent's ->d_lock before
> > 	rcu_read_unlock(), everything would be fine.  Unfortunately,
> > 	doing that the other way round allows rcu-scheduled freeing of
> > 	parent to proceed before we drop its ->d_lock.
> > The same applies if instead of final dput() of foo/bar it gets evicted
> > by shrink_dcache_parent() or memory pressure, again taking out the last
> > reference to that used to pin its parent.
> > 
> > Fixes: 339e9e13530b ("don't try to cut corners in shrink_lock_dentry()")
> > Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
> > ---
> > diff --git a/fs/dcache.c b/fs/dcache.c
> > index 0c8faeee02e2..b1c163f20db6 100644
> > --- a/fs/dcache.c
> > +++ b/fs/dcache.c
> > @@ -751,6 +751,8 @@ static struct dentry *__dentry_kill(struct dentry *dentry)
> >   *
> >   * Return false if dentry is busy.  Otherwise, return true and have
> >   * that dentry's inode locked.
> > + *
> > + * On failure the caller must drop ->d_lock *before* rcu_read_unlock()
> >   */
> >  
> >  static bool lock_for_kill(struct dentry *dentry)
> > @@ -933,8 +935,8 @@ static void finish_dput(struct dentry *dentry)
> >  		}
> >  		rcu_read_lock();
> >  	}
> > -	rcu_read_unlock();
> >  	spin_unlock(&dentry->d_lock);
> > +	rcu_read_unlock();
> >  }
> >  
> >  /* 
> > @@ -1193,11 +1195,12 @@ static inline void shrink_kill(struct dentry *victim)
> >  	do {
> >  		rcu_read_unlock();
> >  		victim = __dentry_kill(victim);
> > +		if (!victim)
> > +			return;
> >  		rcu_read_lock();
> > -	} while (victim && lock_for_kill(victim));
> > +	} while (lock_for_kill(victim));
> > +	spin_unlock(&victim->d_lock);
> >  	rcu_read_unlock();
> > -	if (victim)
> > -		spin_unlock(&victim->d_lock);
> >  }
> >  
> >  void shrink_dentry_list(struct list_head *list)
> > @@ -1210,10 +1213,10 @@ void shrink_dentry_list(struct list_head *list)
> >  		rcu_read_lock();
> >  		if (!lock_for_kill(dentry)) {
> >  			bool can_free;
> > -			rcu_read_unlock();
> >  			d_shrink_del(dentry);
> >  			can_free = dentry->d_flags & DCACHE_DENTRY_KILLED;
> >  			spin_unlock(&dentry->d_lock);
> > +			rcu_read_unlock();
> >  			if (can_free)
> >  				dentry_free(dentry);
> >  			continue;
> 
> I'm a little skeptical that this explains the UAF we've seen. It looks
> like this would mostly affect the parent dentry rather than the
> children on the shrink list, but parents can be children too, so we
> can't rule it out.
> 
> Either way, releasing the rcu_read_lock() protecting a dentry you're
> holding a lock on seems like a bad idea.
> 
> Reviewed-by: Jeff Layton <jlayton@kernel.org>

FWIW, here's Claude's review:

  Summary                                                                                           
  -------                                                                                           
                                                                                                    
  The patch fixes a use-after-free caused by incorrect lock ordering                                
  when lock_for_kill() returns false (race lost to another thread's                                 
  eviction).  Three of four callers were dropping rcu_read_unlock()                                 
  before spin_unlock(&dentry->d_lock), which allowed RCU-scheduled                                  
  freeing of the dentry to proceed while d_lock was still held on                                   
  the freed object.                                                                                 
                                                                                                    
  Changes                                                                                           
  -------                                                                                           
                                                             
  finish_dput() (line 935-938):                   
                         
    Swaps the unlock order so d_lock is dropped before rcu_read_unlock().                           
    This is reachable from dput() and d_make_discardable().  The UAF                                
    scenario described in the commit message applies directly here:                                 
    dput() evicts foo/bar, the parent's refcount drops to zero,                                     
    lock_for_kill() fails on the parent (trylock on i_lock failed and                               
    during the retry another thread evicted the parent), then                                       
    rcu_read_unlock() lets the parent be freed while d_lock is still                                
    held on it.                                   

  shrink_kill() (lines 1193-1203):                                                                  
                                                            
    Restructured to pull the !victim early return out of the while                                  
    condition.  When __dentry_kill() returns NULL (parent's refcount
    didn't drop to zero), the function now returns immediately without                              
    touching rcu at all -- correct, since __dentry_kill() entered                                   
    with rcu_read_unlock() already called.  When lock_for_kill() fails,                             
    the loop exits and d_lock is dropped before rcu_read_unlock().                                  
    This is the path most likely responsible for the production crashes                             
    in T262643929: shrink_dentry_list() calls shrink_kill(), which                                  
    walks up to parent dentries via __dentry_kill().                                                
                                                                                                    
  shrink_dentry_list() (lines 1213-1218):                                                           
                                                                                                    
    Swaps the unlock order in the lock_for_kill() failure path.  As                                 
    the commit message notes, this was already safe for other reasons:
    the dentry is on a shrink list (DCACHE_SHRINK_LIST set), so the                                 
    other thread's __dentry_kill() would see that flag and set                                      
    can_free = false, preventing dentry_free() from being called.                                   
    Changed for consistency with the general rule.                                                  
                                                                                                    
  lock_for_kill() comment:                                                                          
                                                                                                    
    Documents the ordering requirement: on failure, callers must drop                               
    d_lock before rcu_read_unlock().                        
                                                                                                    
  Production crash relevance                                                                        
  --------------------------
                                                                                                    
  The shrink_kill() fix directly addresses the UAF pattern observed                                 
  in three production vmcores (od0103.dkl2, od5175.prn6,
  twshared90354.17.frc2).  In all three, the crashed thread was in                                  
  shrink_dentry_list() -> spin_lock() on a dentry whose slab page had                               
  been freed and reused by kmalloc-64 or kmalloc-96.                                                
                                                                                                    
  The scenario: shrink_dentry_list() kills a dentry via shrink_kill(),                              
  __dentry_kill() returns the parent with d_lock held and refcount                                  
  decremented to zero, lock_for_kill() fails on the parent because                                  
  another thread raced in and evicted it.  With the old code, the                                   
  rcu_read_unlock() in shrink_kill() allowed the parent's                                           
  call_rcu-scheduled free to complete, and the subsequent                                           
  spin_unlock(&victim->d_lock) touched freed memory.                                                
                                                            
  Whether this is the full explanation for the list corruption                                      
  observed in the vmcores (LRU dentries spliced onto the dispose
  list) remains an open question.  The lock ordering fix prevents
  the immediate UAF on the parent dentry, but the mechanism by                                      
  which entire LRU chains ended up on the dispose list may involve                                  
  additional factors.     

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
  2026-04-10  8:48           ` [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order Al Viro
  2026-04-10 11:18             ` Jeff Layton
@ 2026-04-10 15:25             ` Linus Torvalds
  2026-04-10 15:57               ` Al Viro
                                 ` (3 more replies)
  1 sibling, 4 replies; 63+ messages in thread
From: Linus Torvalds @ 2026-04-10 15:25 UTC (permalink / raw)
  To: Al Viro, Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki
  Cc: Jeff Layton, linux-fsdevel, Christian Brauner, Jan Kara,
	Nikolay Borisov, Max Kellermann, Eric Sandeen, Paulo Alcantara

[ Adding RCU maintainers to the participants: see

  https://lore.kernel.org/all/20260410084839.GA1310153@ZenIV/

 and the fairly long thread associated with it  for context ]

On Fri, 10 Apr 2026 at 01:44, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> [this may or may not be the source of UAFs caught by Jeff and by Helge]

Hmm.

I think this patch may indeed fix the problem, and I don't mind how it looks.

But while I think the patch looks fine, I am still quite unhappy about
it if it matters - because we have very much documented that spinlocks
in themselves are also RCU read locks:

        Note that anything that disables bottom halves, preemption,
        or interrupts also enters an RCU read-side critical section.
        Acquiring a spinlock also enters an RCU read-side critical
        sections, even for spinlocks that do not disable preemption,
        as is the case in kernels built with CONFIG_PREEMPT_RT=y.
        Sleeplocks do *not* enter RCU read-side critical sections.

so *if* this makes a difference, I think it's actually implying that
there's something wrong with our "rcu_read_unlock()" implementation,
and that it doesn't nest properly.

Because our documentation also makes it very clear that this should
all work as-is, and your patch should be a complete no-op. Just a few
lines later in that core RCU doc, we have

        Note that RCU read-side critical sections may be nested and/or
        overlapping.

so the order of the spin_unlock() and the rcu_read_unlock() *should*
be entirely immaterial, and the order of unlocking simply shouldn't
matter.

So it may indeed be that relying on the ordering of RCU read unlock
matters for the case of explicit unlocks and the implicit unlocks by a
spinlock, but that's not great.

Does the rcu_read_unlock() code perhaps only check the RCU count, not
the atomicity count? That would explain it, and looking at
__rcu_read_unlock() in kernel/rcu/tree_plugin.h that may indeed be the
case (rcu_preempt_read_exit() seems to only check
rcu_read_lock_nesting).

This is most likely dependent on kernel config options, and it may all
be unavoidable becasue RCU might not even have a way to tell that it's
still in a critical region without preemption counts etc.

But if it's unavoidable, we need to update the docs about this gotcha,
and we need to have some tooling scan our existing code for this
documentation change.

RCU people - what do you think?

          Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
  2026-04-10 15:25             ` Linus Torvalds
@ 2026-04-10 15:57               ` Al Viro
  2026-04-10 16:27               ` Boqun Feng
                                 ` (2 subsequent siblings)
  3 siblings, 0 replies; 63+ messages in thread
From: Al Viro @ 2026-04-10 15:57 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Jeff Layton, linux-fsdevel, Christian Brauner, Jan Kara,
	Nikolay Borisov, Max Kellermann, Eric Sandeen, Paulo Alcantara,
	Helge Deller

[belatedly Cc'ing Helge on this one]

On Fri, Apr 10, 2026 at 08:25:27AM -0700, Linus Torvalds wrote:
> [ Adding RCU maintainers to the participants: see
> 
>   https://lore.kernel.org/all/20260410084839.GA1310153@ZenIV/

FWIW, the thread starting at https://lore.kernel.org/all/adQO8B0rkR2ftusI@p100/
might also be relevant - another UAF report in the same area, this one
with more(?) reliable reproducer, parisc-specific.

>  and the fairly long thread associated with it  for context ]

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
  2026-04-10 15:25             ` Linus Torvalds
  2026-04-10 15:57               ` Al Viro
@ 2026-04-10 16:27               ` Boqun Feng
  2026-04-10 17:31                 ` Linus Torvalds
  2026-04-10 17:32               ` Paul E. McKenney
  2026-04-10 18:52               ` Al Viro
  3 siblings, 1 reply; 63+ messages in thread
From: Boqun Feng @ 2026-04-10 16:27 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Uladzislau Rezki, Jeff Layton,
	linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
	Max Kellermann, Eric Sandeen, Paulo Alcantara

On Fri, Apr 10, 2026 at 08:25:27AM -0700, Linus Torvalds wrote:
> [ Adding RCU maintainers to the participants: see
> 
>   https://lore.kernel.org/all/20260410084839.GA1310153@ZenIV/
> 
>  and the fairly long thread associated with it  for context ]
> 
> On Fri, 10 Apr 2026 at 01:44, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > [this may or may not be the source of UAFs caught by Jeff and by Helge]
> 
> Hmm.
> 
> I think this patch may indeed fix the problem, and I don't mind how it looks.
> 
> But while I think the patch looks fine, I am still quite unhappy about
> it if it matters - because we have very much documented that spinlocks
> in themselves are also RCU read locks:
> 
>         Note that anything that disables bottom halves, preemption,
>         or interrupts also enters an RCU read-side critical section.
>         Acquiring a spinlock also enters an RCU read-side critical
>         sections, even for spinlocks that do not disable preemption,
>         as is the case in kernels built with CONFIG_PREEMPT_RT=y.
>         Sleeplocks do *not* enter RCU read-side critical sections.
> 
> so *if* this makes a difference, I think it's actually implying that
> there's something wrong with our "rcu_read_unlock()" implementation,
> and that it doesn't nest properly.
> 
> Because our documentation also makes it very clear that this should
> all work as-is, and your patch should be a complete no-op. Just a few
> lines later in that core RCU doc, we have
> 
>         Note that RCU read-side critical sections may be nested and/or
>         overlapping.
> 
> so the order of the spin_unlock() and the rcu_read_unlock() *should*
> be entirely immaterial, and the order of unlocking simply shouldn't
> matter.
> 

I agree. So there could be a lock/RCU bug here ;-)

> So it may indeed be that relying on the ordering of RCU read unlock
> matters for the case of explicit unlocks and the implicit unlocks by a
> spinlock, but that's not great.
> 
> Does the rcu_read_unlock() code perhaps only check the RCU count, not
> the atomicity count? That would explain it, and looking at
> __rcu_read_unlock() in kernel/rcu/tree_plugin.h that may indeed be the
> case (rcu_preempt_read_exit() seems to only check
> rcu_read_lock_nesting).
> 

Notice in rcu_read_unlock_special(), atomicity/preemption count is
checked in two ways:

a) if it's called with preemption count being 0, then good a quiescent
  state will be reported to unblock the RCU grace period.

b) if it's called with preemption disabled, there are two cases:

  1. use_softirq is true and either we are in an hard irq or we need an
     expedited GP, RCU softirq will raised, but in it, rcu_core() will
     guard the rcu_preempt_deferred_qs() with preempt count checking.
  2. otherwise, RCU will just rely on __schedule() to call
     rcu_note_context_switch() to report a QS, hence the preempt count
     is also respected.

@Jeff, just want to confirm, this UAF can be reproduced without
PREEMPT_RT=y, right?

Regards,
Boqun

> This is most likely dependent on kernel config options, and it may all
> be unavoidable becasue RCU might not even have a way to tell that it's
> still in a critical region without preemption counts etc.
> 
> But if it's unavoidable, we need to update the docs about this gotcha,
> and we need to have some tooling scan our existing code for this
> documentation change.
> 
> RCU people - what do you think?
> 
>           Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
  2026-04-10 16:27               ` Boqun Feng
@ 2026-04-10 17:31                 ` Linus Torvalds
  2026-04-10 18:11                   ` Paul E. McKenney
  2026-04-10 18:21                   ` Jeff Layton
  0 siblings, 2 replies; 63+ messages in thread
From: Linus Torvalds @ 2026-04-10 17:31 UTC (permalink / raw)
  To: Boqun Feng
  Cc: Al Viro, Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Uladzislau Rezki, Jeff Layton,
	linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
	Max Kellermann, Eric Sandeen, Paulo Alcantara

On Fri, 10 Apr 2026 at 09:27, Boqun Feng <boqun@kernel.org> wrote:
>
> Notice in rcu_read_unlock_special(), atomicity/preemption count is
> checked in two ways:

Yeah, I didn't go that deep ... That code is inscrutable.

Raising a softirq looks pretty strangein there, since there's
absolutely no reason to believe that the preemption-off region has
ended before the softirq is triggered. It makes sense for the "we're
in hardirq context", but it also does it for that (needs_exp &&
!irqs_were_disabled()) case

So then it depends on the softirq checking it *again*. That all seems
a bit strange.

But it looks like rcu_core_si() does do that, and then falls back to

        if (rcu_preempt_need_deferred_qs())
                set_need_resched_current();

and it feels like maybe the rcu_read_unlock_special() should just have
done that itself instead of playing softirq games.

Maybe it's some off optimization, but going through the softirq code
doesn't _feel_ like it's the right thing to do.

I don't know this code well enough to judge.

Let's wait to see if Al's patch actually makes any difference. It
*shouldn't* make any difference, and maybe it doesn't. No need to
start chasing RCU bugs that may not even be there.

Jeff?

> @Jeff, just want to confirm, this UAF can be reproduced without
> PREEMPT_RT=y, right?

I haven't seen Jeff's config, but some of the syzbot reports that
*may* be related have configs that look pretty normal and didn't have
PREEMPT_RT.

               Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
  2026-04-10 17:31                 ` Linus Torvalds
@ 2026-04-10 18:11                   ` Paul E. McKenney
  2026-04-10 18:21                   ` Jeff Layton
  1 sibling, 0 replies; 63+ messages in thread
From: Paul E. McKenney @ 2026-04-10 18:11 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Boqun Feng, Al Viro, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Uladzislau Rezki, Jeff Layton,
	linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
	Max Kellermann, Eric Sandeen, Paulo Alcantara,
	Sebastian Andrzej Siewior

On Fri, Apr 10, 2026 at 10:31:32AM -0700, Linus Torvalds wrote:
> On Fri, 10 Apr 2026 at 09:27, Boqun Feng <boqun@kernel.org> wrote:
> >
> > Notice in rcu_read_unlock_special(), atomicity/preemption count is
> > checked in two ways:
> 
> Yeah, I didn't go that deep ... That code is inscrutable.

We are working to make it better, but this will be a long process.

> Raising a softirq looks pretty strangein there, since there's
> absolutely no reason to believe that the preemption-off region has
> ended before the softirq is triggered. It makes sense for the "we're
> in hardirq context", but it also does it for that (needs_exp &&
> !irqs_were_disabled()) case
> 
> So then it depends on the softirq checking it *again*. That all seems
> a bit strange.
> 
> But it looks like rcu_core_si() does do that, and then falls back to
> 
>         if (rcu_preempt_need_deferred_qs())
>                 set_need_resched_current();
> 
> and it feels like maybe the rcu_read_unlock_special() should just have
> done that itself instead of playing softirq games.
> 
> Maybe it's some off optimization, but going through the softirq code
> doesn't _feel_ like it's the right thing to do.
> 
> I don't know this code well enough to judge.

We could quite easily add another leg to the inner "if" statement
in rcu_read_unlock_special() so that it does raise_softirq_irqoff()
if in_hardirq() and set_need_resched_current() otherwise, as in the
(untested, probably doesn't even build)first  patch at the end of this
email.  My feeling is that it is not worth it, but on the other hand,
my idea of "simplicity" might or might not serve the people reading
this code.

> Let's wait to see if Al's patch actually makes any difference. It
> *shouldn't* make any difference, and maybe it doesn't. No need to
> start chasing RCU bugs that may not even be there.

Fair enough!

> Jeff?
> 
> > @Jeff, just want to confirm, this UAF can be reproduced without
> > PREEMPT_RT=y, right?
> 
> I haven't seen Jeff's config, but some of the syzbot reports that
> *may* be related have configs that look pretty normal and didn't have
> PREEMPT_RT.

And I am concerned about PREEMPT_RT, please see the second patch below.
But Sebastian will no doubt tell me the error of my ways.  My question is
"why is it safe to exit the RCU reader before the lock is fully released".
Sebastian's answer might well be that all the valid use cases for
combining locking and RCU have a lock acquisition on the free path.
Whether I believe that or not, I have no idea.  ;-)

------------------------------------------------------------------------

diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 95ad967adcf3c..066b6fe6b6a3f 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -749,7 +749,10 @@ static void rcu_read_unlock_special(struct task_struct *t)
 			// GP in flight or a potential need to deboost.
 			if (rdp->defer_qs_pending != DEFER_QS_PENDING) {
 				rdp->defer_qs_pending = DEFER_QS_PENDING;
-				raise_softirq_irqoff(RCU_SOFTIRQ);
+				if (in_hardirq())
+					raise_softirq_irqoff(RCU_SOFTIRQ);
+				else
+					set_need_resched_current();
 			}
 		} else {
 			// Enabling BH or preempt does reschedule, so...

------------------------------------------------------------------------

diff --git a/kernel/locking/spinlock_rt.c b/kernel/locking/spinlock_rt.c
index db1e11b45de67..fbf4deba4029c 100644
--- a/kernel/locking/spinlock_rt.c
+++ b/kernel/locking/spinlock_rt.c
@@ -79,10 +79,10 @@ void __sched rt_spin_unlock(spinlock_t *lock) __releases(RCU)
 {
 	spin_release(&lock->dep_map, _RET_IP_);
 	migrate_enable();
-	rcu_read_unlock();
 
 	if (unlikely(!rt_mutex_cmpxchg_release(&lock->lock, current, NULL)))
 		rt_mutex_slowunlock(&lock->lock);
+	rcu_read_unlock();
 }
 EXPORT_SYMBOL(rt_spin_unlock);
 
@@ -262,17 +262,17 @@ void __sched rt_read_unlock(rwlock_t *rwlock) __releases(RCU)
 {
 	rwlock_release(&rwlock->dep_map, _RET_IP_);
 	migrate_enable();
-	rcu_read_unlock();
 	rwbase_read_unlock(&rwlock->rwbase, TASK_RTLOCK_WAIT);
+	rcu_read_unlock();
 }
 EXPORT_SYMBOL(rt_read_unlock);
 
 void __sched rt_write_unlock(rwlock_t *rwlock) __releases(RCU)
 {
 	rwlock_release(&rwlock->dep_map, _RET_IP_);
-	rcu_read_unlock();
 	migrate_enable();
 	rwbase_write_unlock(&rwlock->rwbase);
+	rcu_read_unlock();
 }
 EXPORT_SYMBOL(rt_write_unlock);
 

^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
  2026-04-10 17:31                 ` Linus Torvalds
  2026-04-10 18:11                   ` Paul E. McKenney
@ 2026-04-10 18:21                   ` Jeff Layton
  2026-04-10 19:19                     ` Al Viro
  1 sibling, 1 reply; 63+ messages in thread
From: Jeff Layton @ 2026-04-10 18:21 UTC (permalink / raw)
  To: Linus Torvalds, Boqun Feng
  Cc: Al Viro, Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Uladzislau Rezki, linux-fsdevel,
	Christian Brauner, Jan Kara, Nikolay Borisov, Max Kellermann,
	Eric Sandeen, Paulo Alcantara

On Fri, 2026-04-10 at 10:31 -0700, Linus Torvalds wrote:
> On Fri, 10 Apr 2026 at 09:27, Boqun Feng <boqun@kernel.org> wrote:
> > 
> > Notice in rcu_read_unlock_special(), atomicity/preemption count is
> > checked in two ways:
> 
> Yeah, I didn't go that deep ... That code is inscrutable.
> 
> Raising a softirq looks pretty strangein there, since there's
> absolutely no reason to believe that the preemption-off region has
> ended before the softirq is triggered. It makes sense for the "we're
> in hardirq context", but it also does it for that (needs_exp &&
> !irqs_were_disabled()) case
> 
> So then it depends on the softirq checking it *again*. That all seems
> a bit strange.
> 
> But it looks like rcu_core_si() does do that, and then falls back to
> 
>         if (rcu_preempt_need_deferred_qs())
>                 set_need_resched_current();
> 
> and it feels like maybe the rcu_read_unlock_special() should just have
> done that itself instead of playing softirq games.
> 
> Maybe it's some off optimization, but going through the softirq code
> doesn't _feel_ like it's the right thing to do.
> 
> I don't know this code well enough to judge.
> 
> Let's wait to see if Al's patch actually makes any difference. It
> *shouldn't* make any difference, and maybe it doesn't. No need to
> start chasing RCU bugs that may not even be there.
> 
> Jeff?
> 
> > @Jeff, just want to confirm, this UAF can be reproduced without
> > PREEMPT_RT=y, right?
> 
> I haven't seen Jeff's config, but some of the syzbot reports that
> *may* be related have configs that look pretty normal and didn't have
> PREEMPT_RT.
> 
> 

I've not been able to reproduce the UAF at all.

It was possible to reproduce the hang that we were seeing by injecting
an mdelay() just after the cond_resched() in __dentry_kill(), but Al's
other series seems to have fixed that.

The UAF seems to be dependent on those hangs somehow, but we were never
able to get one to fire from a reproducer attempt. They just
occasionally pop up when these stalls occur.
-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
  2026-04-10 18:21                   ` Jeff Layton
@ 2026-04-10 19:19                     ` Al Viro
  2026-04-10 19:32                       ` Jeff Layton
  0 siblings, 1 reply; 63+ messages in thread
From: Al Viro @ 2026-04-10 19:19 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Linus Torvalds, Boqun Feng, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Uladzislau Rezki,
	linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
	Max Kellermann, Eric Sandeen, Paulo Alcantara

On Fri, Apr 10, 2026 at 02:21:39PM -0400, Jeff Layton wrote:

> I've not been able to reproduce the UAF at all.
> 
> It was possible to reproduce the hang that we were seeing by injecting
> an mdelay() just after the cond_resched() in __dentry_kill(), but Al's
> other series seems to have fixed that.

Elimination of busy-wait is likely to hide whatever's going on, so it
shouldn't be there when hunting for reproducer.

> The UAF seems to be dependent on those hangs somehow, but we were never
> able to get one to fire from a reproducer attempt. They just
> occasionally pop up when these stalls occur.

Not even with mdelay() stuck in there?  FWIW, another UAF report in the
same general area is parisc buildd running into that when building
a specific package (webkit2gtk); Debian kernels 6.16..6.19 (at least),
no idea which config it is...

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
  2026-04-10 19:19                     ` Al Viro
@ 2026-04-10 19:32                       ` Jeff Layton
  2026-04-10 21:13                         ` Calvin Owens
  0 siblings, 1 reply; 63+ messages in thread
From: Jeff Layton @ 2026-04-10 19:32 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Boqun Feng, Paul E. McKenney, Frederic Weisbecker,
	Neeraj Upadhyay, Joel Fernandes, Josh Triplett, Uladzislau Rezki,
	linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
	Max Kellermann, Eric Sandeen, Paulo Alcantara

On Fri, 2026-04-10 at 20:19 +0100, Al Viro wrote:
> On Fri, Apr 10, 2026 at 02:21:39PM -0400, Jeff Layton wrote:
> 
> > I've not been able to reproduce the UAF at all.
> > 
> > It was possible to reproduce the hang that we were seeing by injecting
> > an mdelay() just after the cond_resched() in __dentry_kill(), but Al's
> > other series seems to have fixed that.
> 
> Elimination of busy-wait is likely to hide whatever's going on, so it
> shouldn't be there when hunting for reproducer.
> 

This was before that. I know Gustavo also tried to reproduce it and he
was able to hit the hang but no UAF.

> > The UAF seems to be dependent on those hangs somehow, but we were never
> > able to get one to fire from a reproducer attempt. They just
> > occasionally pop up when these stalls occur.
> 
> Not even with mdelay() stuck in there?  FWIW, another UAF report in the
> same general area is parisc buildd running into that when building
> a specific package (webkit2gtk); Debian kernels 6.16..6.19 (at least),
> no idea which config it is...

Yep, not even with that. One thing that claude pointed out: doing an
mdelay() might prevent it from making any progress freeing dentries.
I'm experimenting now with one that only randomly does a delay, but so
far that hasn't turned up anything. I'll let you know if that changes.

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
  2026-04-10 19:32                       ` Jeff Layton
@ 2026-04-10 21:13                         ` Calvin Owens
  2026-04-10 21:24                           ` Al Viro
  0 siblings, 1 reply; 63+ messages in thread
From: Calvin Owens @ 2026-04-10 21:13 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Al Viro, Linus Torvalds, Boqun Feng, Paul E. McKenney,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Uladzislau Rezki, linux-fsdevel, Christian Brauner,
	Jan Kara, Nikolay Borisov, Max Kellermann, Eric Sandeen,
	Paulo Alcantara

On Friday 04/10 at 15:32 -0400, Jeff Layton wrote:
> Yep, not even with that. One thing that claude pointed out: doing an
> mdelay() might prevent it from making any progress freeing dentries.
> I'm experimenting now with one that only randomly does a delay, but so
> far that hasn't turned up anything. I'll let you know if that changes.

Hi Jeff,

I've been poking at this one too. With the attached reproducer, the
waiter will get stuck in an infinite loop apparently in d_walk() with a
stock 7.0-rc7 (in my case with PREEMPT_LAZY):

        d_walk
        list_empty (inlined)
        shrink_dcache_tree
        d_invalidate
        proc_invalidate_siblings_dcache
        release_task
        wait_task_zombie (inlined)
        wait_consider_task
        do_wait_thread (inlined)
        __do_wait
        do_wait
        kernel_wait4
        __do_sys_wait4 (inlined)
        __se_sys_wait4 (inlined)
        __x64_sys_wait4
        do_syscall_64
        entry_SYSCALL_64_after_hwframe
        __internal_syscall_cancel
        __syscall_cancel
        __wait4 (inlined)
        os_wait_impl (inlined)
        os_wait
        _Py_LeaveRecursiveCallTstate (inlined)
        cfunction_vectorcall_NOARGS
        _PyObject_VectorcallTstate (inlined)
        PyObject_Vectorcall

It doesn't happen 100% of the time, but it's never taken more than three
attempts on two different machines. I've SIGKILL'd it and waited over an
hour, it really never comes back. Making NR_FORKS larger than 3 was
counterproductive on my 32-thread machines, but YMMV of course.

I haven't had time to dig more yet, but wanted to share the reproducer,
hopefully it yields something useful on your end.

I don't think this is Al's livelock, because only one thread is
spinning on an otherwise idle machine... but I apologize if it is and
it's going over my head.

Cheers,
Calvin

---
#!/usr/bin/python3
# sudo prlimit --nofile=10000000:10000000 ./repro.py

import os

NR_FORKS = 3

for _ in range(NR_FORKS):
	pid = os.fork()
	if pid == 0:
		break

for dirpath, _, filenames in os.walk("/proc"):
	for fn in filenames:
		try:
			os.open(os.path.join(dirpath, fn), os.O_RDONLY)
		except:
			pass

if pid == 0:
	os._exit(0)

while True:
	try:
		os.wait()
	except ChildProcessError:
		os._exit(0)


^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
  2026-04-10 21:13                         ` Calvin Owens
@ 2026-04-10 21:24                           ` Al Viro
  2026-04-10 22:15                             ` Calvin Owens
  0 siblings, 1 reply; 63+ messages in thread
From: Al Viro @ 2026-04-10 21:24 UTC (permalink / raw)
  To: Calvin Owens
  Cc: Jeff Layton, Linus Torvalds, Boqun Feng, Paul E. McKenney,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Uladzislau Rezki, linux-fsdevel, Christian Brauner,
	Jan Kara, Nikolay Borisov, Max Kellermann, Eric Sandeen,
	Paulo Alcantara

On Fri, Apr 10, 2026 at 02:13:10PM -0700, Calvin Owens wrote:
> On Friday 04/10 at 15:32 -0400, Jeff Layton wrote:
> > Yep, not even with that. One thing that claude pointed out: doing an
> > mdelay() might prevent it from making any progress freeing dentries.
> > I'm experimenting now with one that only randomly does a delay, but so
> > far that hasn't turned up anything. I'll let you know if that changes.
> 
> Hi Jeff,
> 
> I've been poking at this one too. With the attached reproducer, the
> waiter will get stuck in an infinite loop apparently in d_walk() with a
> stock 7.0-rc7 (in my case with PREEMPT_LAZY):

Interesting...  I'll try it in a bit, but just in case - could you check if
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.dcache-busy-wait
steps into the same thing on your setup?

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
  2026-04-10 21:24                           ` Al Viro
@ 2026-04-10 22:15                             ` Calvin Owens
  2026-04-10 23:05                               ` Al Viro
  0 siblings, 1 reply; 63+ messages in thread
From: Calvin Owens @ 2026-04-10 22:15 UTC (permalink / raw)
  To: Al Viro
  Cc: Jeff Layton, Linus Torvalds, Boqun Feng, Paul E. McKenney,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Uladzislau Rezki, linux-fsdevel, Christian Brauner,
	Jan Kara, Nikolay Borisov, Max Kellermann, Eric Sandeen,
	Paulo Alcantara

On Friday 04/10 at 22:24 +0100, Al Viro wrote:
> On Fri, Apr 10, 2026 at 02:13:10PM -0700, Calvin Owens wrote:
> > On Friday 04/10 at 15:32 -0400, Jeff Layton wrote:
> > > Yep, not even with that. One thing that claude pointed out: doing an
> > > mdelay() might prevent it from making any progress freeing dentries.
> > > I'm experimenting now with one that only randomly does a delay, but so
> > > far that hasn't turned up anything. I'll let you know if that changes.
> > 
> > Hi Jeff,
> > 
> > I've been poking at this one too. With the attached reproducer, the
> > waiter will get stuck in an infinite loop apparently in d_walk() with a
> > stock 7.0-rc7 (in my case with PREEMPT_LAZY):
> 
> Interesting...  I'll try it in a bit, but just in case - could you check if
> git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.dcache-busy-wait
> steps into the same thing on your setup?

Hi Al,

It still reproduces on that branch in the exact same way.

I'm absolutely 100% guessing... but I recall tomfoolery with the "magic"
symlinks in /proc/$/fd/* in the past... the below patch makes the infinite
spins impossible to trigger with my reproducer, does my explanation make
any sense to you? Apologies if I'm off in the weeds...

Thanks,
Calvin

---
From: Calvin Owens <calvin@wbinvd.org>
Subject: [PATCH] fs/proc: Implement d_weak_revalidate()

I think this is necessary because ND_JUMPED can happen in /proc when
the magic symlinks in /proc/$/fd/* point to files in /proc. If true,
this is a problem, because the default behavior for d_weak_revalidate()
is to assume the jumped-to dentries are always valid.

Fix by checking the in_use count in d_weak_revalidate() for procfs, the
same way it is checked in procfs d_revalidate() today.

Signed-off-by: Calvin Owens <calvin@wbinvd.org>
---
 fs/proc/generic.c | 22 +++++++++++++++++-----
 1 file changed, 17 insertions(+), 5 deletions(-)

diff --git a/fs/proc/generic.c b/fs/proc/generic.c
index 501889856461..359176e5eeb3 100644
--- a/fs/proc/generic.c
+++ b/fs/proc/generic.c
@@ -216,15 +216,26 @@ void proc_free_inum(unsigned int inum)
 	ida_free(&proc_inum_ida, inum - PROC_DYNAMIC_FIRST);
 }
 
+static int __proc_misc_revalidate(struct dentry *dentry)
+{
+	if (atomic_read(&PDE(d_inode(dentry))->in_use) < 0)
+		return 0; /* revalidate */
+	return 1;
+}
+
 static int proc_misc_d_revalidate(struct inode *dir, const struct qstr *name,
 				  struct dentry *dentry, unsigned int flags)
 {
 	if (flags & LOOKUP_RCU)
 		return -ECHILD;
 
-	if (atomic_read(&PDE(d_inode(dentry))->in_use) < 0)
-		return 0; /* revalidate */
-	return 1;
+	return __proc_misc_revalidate(dentry);
+}
+
+static int proc_misc_d_weak_revalidate(struct dentry * dentry,
+				       unsigned int flags)
+{
+	return __proc_misc_revalidate(dentry);
 }
 
 static int proc_misc_d_delete(const struct dentry *dentry)
@@ -233,8 +244,9 @@ static int proc_misc_d_delete(const struct dentry *dentry)
 }
 
 static const struct dentry_operations proc_misc_dentry_ops = {
-	.d_revalidate	= proc_misc_d_revalidate,
-	.d_delete	= proc_misc_d_delete,
+	.d_revalidate		= proc_misc_d_revalidate,
+	.d_weak_revalidate	= proc_misc_d_weak_revalidate,
+	.d_delete		= proc_misc_d_delete,
 };
 
 /*
-- 
2.47.3


^ permalink raw reply related	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
  2026-04-10 22:15                             ` Calvin Owens
@ 2026-04-10 23:05                               ` Al Viro
  2026-04-10 23:30                                 ` Calvin Owens
  0 siblings, 1 reply; 63+ messages in thread
From: Al Viro @ 2026-04-10 23:05 UTC (permalink / raw)
  To: Calvin Owens
  Cc: Jeff Layton, Linus Torvalds, Boqun Feng, Paul E. McKenney,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Uladzislau Rezki, linux-fsdevel, Christian Brauner,
	Jan Kara, Nikolay Borisov, Max Kellermann, Eric Sandeen,
	Paulo Alcantara

On Fri, Apr 10, 2026 at 03:15:16PM -0700, Calvin Owens wrote:
> On Friday 04/10 at 22:24 +0100, Al Viro wrote:
> > On Fri, Apr 10, 2026 at 02:13:10PM -0700, Calvin Owens wrote:
> > > On Friday 04/10 at 15:32 -0400, Jeff Layton wrote:
> > > > Yep, not even with that. One thing that claude pointed out: doing an
> > > > mdelay() might prevent it from making any progress freeing dentries.
> > > > I'm experimenting now with one that only randomly does a delay, but so
> > > > far that hasn't turned up anything. I'll let you know if that changes.
> > > 
> > > Hi Jeff,
> > > 
> > > I've been poking at this one too. With the attached reproducer, the
> > > waiter will get stuck in an infinite loop apparently in d_walk() with a
> > > stock 7.0-rc7 (in my case with PREEMPT_LAZY):
> > 
> > Interesting...  I'll try it in a bit, but just in case - could you check if
> > git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.dcache-busy-wait
> > steps into the same thing on your setup?
> 
> Hi Al,
> 
> It still reproduces on that branch in the exact same way.
> 
> I'm absolutely 100% guessing... but I recall tomfoolery with the "magic"
> symlinks in /proc/$/fd/* in the past... the below patch makes the infinite
> spins impossible to trigger with my reproducer, does my explanation make
> any sense to you? Apologies if I'm off in the weeds...

What it's doing, AFAICS, is a weird way of opening arseloads of files, mostly
the same ones again and again - the reason it needs that prlimit (there's
way fewer files in procfs when it starts) is that these threads keep running
through each other's /proc/*/fd and opening what they'd got again and again,
adding to /proc/*/fd/* as they go.

IOW, you are opening a random mix of files, both procfs and ones some processes
had opened, then exiting.  That triggers invalidation of your /proc/*, with
a _lot_ of shite that needs to be taken out.  It might be triggering a livelock
or a UAF somewhere in dcache or it might be something entirely different -
no idea at that point.  I'll look further into that, but I wouldn't be surprised
if it turns out to be entirely unrelated.  Would be easier to deal with if the
mix had been more predictable, but we have what we have...

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
  2026-04-10 23:05                               ` Al Viro
@ 2026-04-10 23:30                                 ` Calvin Owens
  2026-04-11  0:51                                   ` Al Viro
  0 siblings, 1 reply; 63+ messages in thread
From: Calvin Owens @ 2026-04-10 23:30 UTC (permalink / raw)
  To: Al Viro
  Cc: Jeff Layton, Linus Torvalds, Boqun Feng, Paul E. McKenney,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Uladzislau Rezki, linux-fsdevel, Christian Brauner,
	Jan Kara, Nikolay Borisov, Max Kellermann, Eric Sandeen,
	Paulo Alcantara

On Saturday 04/11 at 00:05 +0100, Al Viro wrote:
> On Fri, Apr 10, 2026 at 03:15:16PM -0700, Calvin Owens wrote:
> > On Friday 04/10 at 22:24 +0100, Al Viro wrote:
> > > On Fri, Apr 10, 2026 at 02:13:10PM -0700, Calvin Owens wrote:
> > > > On Friday 04/10 at 15:32 -0400, Jeff Layton wrote:
> > > > > Yep, not even with that. One thing that claude pointed out: doing an
> > > > > mdelay() might prevent it from making any progress freeing dentries.
> > > > > I'm experimenting now with one that only randomly does a delay, but so
> > > > > far that hasn't turned up anything. I'll let you know if that changes.
> > > > 
> > > > Hi Jeff,
> > > > 
> > > > I've been poking at this one too. With the attached reproducer, the
> > > > waiter will get stuck in an infinite loop apparently in d_walk() with a
> > > > stock 7.0-rc7 (in my case with PREEMPT_LAZY):
> > > 
> > > Interesting...  I'll try it in a bit, but just in case - could you check if
> > > git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs.git #work.dcache-busy-wait
> > > steps into the same thing on your setup?
> > 
> > Hi Al,
> > 
> > It still reproduces on that branch in the exact same way.
> > 
> > I'm absolutely 100% guessing... but I recall tomfoolery with the "magic"
> > symlinks in /proc/$/fd/* in the past... the below patch makes the infinite
> > spins impossible to trigger with my reproducer, does my explanation make
> > any sense to you? Apologies if I'm off in the weeds...
> 
> What it's doing, AFAICS, is a weird way of opening arseloads of files, mostly
> the same ones again and again - the reason it needs that prlimit (there's
> way fewer files in procfs when it starts) is that these threads keep running
> through each other's /proc/*/fd and opening what they'd got again and again,
> adding to /proc/*/fd/* as they go.

Yes exactly, I was trying a lot of different pathological behaviors and
that was the first pattern I found that triggered the d_walk() spin
consistently.

Initially I made the reproducer ignore itself in /proc. But it turns out
to be much more reliable if it *doesn't*. I realized it was opening its
own children's files in /proc/*/fd/*, and the children were also opening
other /proc files, which is what made me suspect the magic symlinks.

> IOW, you are opening a random mix of files, both procfs and ones some processes
> had opened, then exiting.  That triggers invalidation of your /proc/*, with
> a _lot_ of shite that needs to be taken out.  It might be triggering a livelock
> or a UAF somewhere in dcache or it might be something entirely different -
> no idea at that point.  I'll look further into that, but I wouldn't be surprised
> if it turns out to be entirely unrelated.  Would be easier to deal with if the
> mix had been more predictable, but we have what we have...

I'm sure I can narrow down the reproducer more, I'll try a bit.

I appreciate you taking a look. I'm happy to test anything else you like
if that reproducer isn't working for you, it's very consistent for me on
two different machines. Hopefully it works for Jeff.

Thanks,
Calvin

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
  2026-04-10 23:30                                 ` Calvin Owens
@ 2026-04-11  0:51                                   ` Al Viro
  0 siblings, 0 replies; 63+ messages in thread
From: Al Viro @ 2026-04-11  0:51 UTC (permalink / raw)
  To: Calvin Owens
  Cc: Jeff Layton, Linus Torvalds, Boqun Feng, Paul E. McKenney,
	Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Uladzislau Rezki, linux-fsdevel, Christian Brauner,
	Jan Kara, Nikolay Borisov, Max Kellermann, Eric Sandeen,
	Paulo Alcantara

On Fri, Apr 10, 2026 at 04:30:32PM -0700, Calvin Owens wrote:

> Yes exactly, I was trying a lot of different pathological behaviors and
> that was the first pattern I found that triggered the d_walk() spin
> consistently.
> 
> Initially I made the reproducer ignore itself in /proc. But it turns out
> to be much more reliable if it *doesn't*. I realized it was opening its
> own children's files in /proc/*/fd/*, and the children were also opening
> other /proc files, which is what made me suspect the magic symlinks.
> 
> > IOW, you are opening a random mix of files, both procfs and ones some processes
> > had opened, then exiting.  That triggers invalidation of your /proc/*, with
> > a _lot_ of shite that needs to be taken out.  It might be triggering a livelock
> > or a UAF somewhere in dcache or it might be something entirely different -
> > no idea at that point.  I'll look further into that, but I wouldn't be surprised
> > if it turns out to be entirely unrelated.  Would be easier to deal with if the
> > mix had been more predictable, but we have what we have...
> 
> I'm sure I can narrow down the reproducer more, I'll try a bit.

No need...  Just have parent and child share descriptor table, then let parent
wait for child and child do
	char buf[128], buf2[128];
	int fd = 0, n;
	do {
		n = fd;
		sprintf(buf, "/proc/self/fdinfo/%d", fd);
		fd = open(buf, 0);
	} while (fd >= 0);
	for (int i = 0; i <= n; i++) {
		sprintf(buf, "/proc/self/fd/%d", i);
		readlink(buf, buf2, sizeof(buf2));
	}
and that's it.

That's a nice demonstration of how nasty conditions can be arranged for
d_invalidate(); *plenty* of busy dentries in the tree (to the tune of
several millions), with some evictables scattered here and there.
If you get enough busy stuff to wade through until you find an eviction
candidate, you'll get select_collect() return D_WALK_QUIT as soon as you
get a single evictable sucker.  Of course, then you need to restart
the whole thing - with the same pile of busy ones to get through.  Repeat
for every single evictable left in there.

No memory corruption involved, just a fuckton of rescans ;-/
It *is* recoverable - you just need to flush caches hard enough and that'll
unwedge the sucker.

IMO that's a very convincing argument for not using shrink_dcache_parent()
in d_invalidate().

That goes back to 2.1.65 - very early in dcache history.  Back then (and
until about 2014, IIRC) it had been "try to evict the subtree, unhash
if they are all gone, fail otherwise"; these days it's worse, since
d_invalidate() is not allowed to fail.

"Rescan if we have found something and ran out of timeslice" is there since
2005 mingo's 116194f29f13 "[PATCH] sched: vfs: fix scheduling latencies in
prune_dcache() and select_parent()" - latency problems prior to that got
traded for this "rescan the same pile of busy stuff for each evictable"
pathological behaviour.

FWIW, it doesn't have to be on procfs - you can arrange for something similar
with fuse.

Joy...

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
  2026-04-10 15:25             ` Linus Torvalds
  2026-04-10 15:57               ` Al Viro
  2026-04-10 16:27               ` Boqun Feng
@ 2026-04-10 17:32               ` Paul E. McKenney
  2026-04-10 18:26                 ` Jeff Layton
  2026-04-10 18:52               ` Al Viro
  3 siblings, 1 reply; 63+ messages in thread
From: Paul E. McKenney @ 2026-04-10 17:32 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Al Viro, Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, Jeff Layton,
	linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
	Max Kellermann, Eric Sandeen, Paulo Alcantara

On Fri, Apr 10, 2026 at 08:25:27AM -0700, Linus Torvalds wrote:
> [ Adding RCU maintainers to the participants: see
> 
>   https://lore.kernel.org/all/20260410084839.GA1310153@ZenIV/
> 
>  and the fairly long thread associated with it  for context ]
> 
> On Fri, 10 Apr 2026 at 01:44, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > [this may or may not be the source of UAFs caught by Jeff and by Helge]
> 
> Hmm.
> 
> I think this patch may indeed fix the problem, and I don't mind how it looks.
> 
> But while I think the patch looks fine, I am still quite unhappy about
> it if it matters - because we have very much documented that spinlocks
> in themselves are also RCU read locks:
> 
>         Note that anything that disables bottom halves, preemption,
>         or interrupts also enters an RCU read-side critical section.
>         Acquiring a spinlock also enters an RCU read-side critical
>         sections, even for spinlocks that do not disable preemption,
>         as is the case in kernels built with CONFIG_PREEMPT_RT=y.
>         Sleeplocks do *not* enter RCU read-side critical sections.
> 
> so *if* this makes a difference, I think it's actually implying that
> there's something wrong with our "rcu_read_unlock()" implementation,
> and that it doesn't nest properly.
> 
> Because our documentation also makes it very clear that this should
> all work as-is, and your patch should be a complete no-op. Just a few
> lines later in that core RCU doc, we have
> 
>         Note that RCU read-side critical sections may be nested and/or
>         overlapping.
> 
> so the order of the spin_unlock() and the rcu_read_unlock() *should*
> be entirely immaterial, and the order of unlocking simply shouldn't
> matter.
> 
> So it may indeed be that relying on the ordering of RCU read unlock
> matters for the case of explicit unlocks and the implicit unlocks by a
> spinlock, but that's not great.
> 
> Does the rcu_read_unlock() code perhaps only check the RCU count, not
> the atomicity count? That would explain it, and looking at
> __rcu_read_unlock() in kernel/rcu/tree_plugin.h that may indeed be the
> case (rcu_preempt_read_exit() seems to only check
> rcu_read_lock_nesting).
> 
> This is most likely dependent on kernel config options, and it may all
> be unavoidable becasue RCU might not even have a way to tell that it's
> still in a critical region without preemption counts etc.
> 
> But if it's unavoidable, we need to update the docs about this gotcha,
> and we need to have some tooling scan our existing code for this
> documentation change.
> 
> RCU people - what do you think?

Adding to Boqun's email:

The https://lore.kernel.org/all/20260410084839.GA1310153@ZenIV URL
has preemption disabled (right, Jeff?), in which case rcu_read_lock()
is just preempt_disable() and rcu_read_unlock() is just preempt_enable().
All the spinlocks also disable and enable preemption, so if you have
something like this:

	rcu_read_lock();
	do_something();
	spin_lock(...);
	do_something_else();
	rcu_read_unlock();
	do_yet_another_thing();
	spin_unlock(...);

Then there should not be any calls to schedule() between the initial
rcu_read_lock() and the final spin_unlock().  But this is just
preempt_disable() doing its job, not much to do with RCU.

And repeating Boqun's email:

Now if this is a preemptible kernel, things are more involved.
However, in this case we know that the kernel has been built with
CONFIG_PREEMPT_COUNT=y.  So rcu_read_unlock_special() can sample
preempt_count() and set preempt_bh_were_disabled is then true if either
preemption or softirq are disabled.  Then irqs_were_disabled is set if
interrupts were disabled on entry.  If either are set, the inner pair of
"if" statements set up to defer the end of the read-side critical section.
If this is important, which it might be, I will go through this in more
detail later.

Otherwise, the call to rcu_preempt_deferred_qs_irqrestore() at the end
of rcu_read_unlock_special() immediately ends the RCU read-side critical
section.

And more adding to Boqun's email:

Now, CONFIG_PREEMPT_RT=y is a special case, because "spinlocks"
are preemptible.  So __rt_spin_lock() and __rt_spin_trylock() both do
rcu_read_lock() and rt_spin_unlock() does rcu_read_unlock().  So that
should be covered via RCU nesting along with the rcu_read_unlock_special()
checks for disabled interrupts, softirq, and preemption.

And rcutorture tests random overlapping of RCU readers with all those
disabled regions.  But it might still be missing something.  And I
will add spinlocks to the mix in addition to the preemption disabling.

In the meantime, is there something more specific that I should be
looking at?

						Thanx, Paul

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
  2026-04-10 17:32               ` Paul E. McKenney
@ 2026-04-10 18:26                 ` Jeff Layton
  2026-04-10 18:36                   ` Paul E. McKenney
  0 siblings, 1 reply; 63+ messages in thread
From: Jeff Layton @ 2026-04-10 18:26 UTC (permalink / raw)
  To: paulmck, Linus Torvalds
  Cc: Al Viro, Frederic Weisbecker, Neeraj Upadhyay, Joel Fernandes,
	Josh Triplett, Boqun Feng, Uladzislau Rezki, linux-fsdevel,
	Christian Brauner, Jan Kara, Nikolay Borisov, Max Kellermann,
	Eric Sandeen, Paulo Alcantara

On Fri, 2026-04-10 at 10:32 -0700, Paul E. McKenney wrote:
> On Fri, Apr 10, 2026 at 08:25:27AM -0700, Linus Torvalds wrote:
> > [ Adding RCU maintainers to the participants: see
> > 
> >   https://lore.kernel.org/all/20260410084839.GA1310153@ZenIV/
> > 
> >  and the fairly long thread associated with it  for context ]
> > 
> > On Fri, 10 Apr 2026 at 01:44, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > 
> > > [this may or may not be the source of UAFs caught by Jeff and by Helge]
> > 
> > Hmm.
> > 
> > I think this patch may indeed fix the problem, and I don't mind how it looks.
> > 
> > But while I think the patch looks fine, I am still quite unhappy about
> > it if it matters - because we have very much documented that spinlocks
> > in themselves are also RCU read locks:
> > 
> >         Note that anything that disables bottom halves, preemption,
> >         or interrupts also enters an RCU read-side critical section.
> >         Acquiring a spinlock also enters an RCU read-side critical
> >         sections, even for spinlocks that do not disable preemption,
> >         as is the case in kernels built with CONFIG_PREEMPT_RT=y.
> >         Sleeplocks do *not* enter RCU read-side critical sections.
> > 
> > so *if* this makes a difference, I think it's actually implying that
> > there's something wrong with our "rcu_read_unlock()" implementation,
> > and that it doesn't nest properly.
> > 
> > Because our documentation also makes it very clear that this should
> > all work as-is, and your patch should be a complete no-op. Just a few
> > lines later in that core RCU doc, we have
> > 
> >         Note that RCU read-side critical sections may be nested and/or
> >         overlapping.
> > 
> > so the order of the spin_unlock() and the rcu_read_unlock() *should*
> > be entirely immaterial, and the order of unlocking simply shouldn't
> > matter.
> > 
> > So it may indeed be that relying on the ordering of RCU read unlock
> > matters for the case of explicit unlocks and the implicit unlocks by a
> > spinlock, but that's not great.
> > 
> > Does the rcu_read_unlock() code perhaps only check the RCU count, not
> > the atomicity count? That would explain it, and looking at
> > __rcu_read_unlock() in kernel/rcu/tree_plugin.h that may indeed be the
> > case (rcu_preempt_read_exit() seems to only check
> > rcu_read_lock_nesting).
> > 
> > This is most likely dependent on kernel config options, and it may all
> > be unavoidable becasue RCU might not even have a way to tell that it's
> > still in a critical region without preemption counts etc.
> > 
> > But if it's unavoidable, we need to update the docs about this gotcha,
> > and we need to have some tooling scan our existing code for this
> > documentation change.
> > 
> > RCU people - what do you think?
> 
> Adding to Boqun's email:
> 
> The https://lore.kernel.org/all/20260410084839.GA1310153@ZenIV URL
> has preemption disabled (right, Jeff?)
> 

Not as familiar with this part of the kernel, but I think that's
correct:

# zgrep PREEMPT /proc/config.gz 
CONFIG_PREEMPT_NONE_BUILD=y
CONFIG_ARCH_HAS_PREEMPT_LAZY=y
CONFIG_PREEMPT_NONE=y
# CONFIG_PREEMPT_VOLUNTARY is not set
# CONFIG_PREEMPT is not set
# CONFIG_PREEMPT_LAZY is not set
# CONFIG_PREEMPT_RT is not set
# CONFIG_PREEMPT_DYNAMIC is not set
CONFIG_HAVE_PREEMPT_DYNAMIC=y
CONFIG_HAVE_PREEMPT_DYNAMIC_CALL=y
CONFIG_PREEMPT_NOTIFIERS=y
# CONFIG_PREEMPTIRQ_DELAY_TEST is not set


> , in which case rcu_read_lock()
> is just preempt_disable() and rcu_read_unlock() is just preempt_enable().
> All the spinlocks also disable and enable preemption, so if you have
> something like this:
> 
> 	rcu_read_lock();
> 	do_something();
> 	spin_lock(...);
> 	do_something_else();
> 	rcu_read_unlock();
> 	do_yet_another_thing();
> 	spin_unlock(...);
> 
> Then there should not be any calls to schedule() between the initial
> rcu_read_lock() and the final spin_unlock().  But this is just
> preempt_disable() doing its job, not much to do with RCU.
> 
> And repeating Boqun's email:
> 
> Now if this is a preemptible kernel, things are more involved.
> However, in this case we know that the kernel has been built with
> CONFIG_PREEMPT_COUNT=y.  So rcu_read_unlock_special() can sample
> preempt_count() and set preempt_bh_were_disabled is then true if either
> preemption or softirq are disabled.  Then irqs_were_disabled is set if
> interrupts were disabled on entry.  If either are set, the inner pair of
> "if" statements set up to defer the end of the read-side critical section.
> If this is important, which it might be, I will go through this in more
> detail later.
> 
> Otherwise, the call to rcu_preempt_deferred_qs_irqrestore() at the end
> of rcu_read_unlock_special() immediately ends the RCU read-side critical
> section.
> 
> And more adding to Boqun's email:
> 
> Now, CONFIG_PREEMPT_RT=y is a special case, because "spinlocks"
> are preemptible.  So __rt_spin_lock() and __rt_spin_trylock() both do
> rcu_read_lock() and rt_spin_unlock() does rcu_read_unlock().  So that
> should be covered via RCU nesting along with the rcu_read_unlock_special()
> checks for disabled interrupts, softirq, and preemption.
> 
> And rcutorture tests random overlapping of RCU readers with all those
> disabled regions.  But it might still be missing something.  And I
> will add spinlocks to the mix in addition to the preemption disabling.
> 
> In the meantime, is there something more specific that I should be
> looking at?
> 
> 						Thanx, Paul

-- 
Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
  2026-04-10 18:26                 ` Jeff Layton
@ 2026-04-10 18:36                   ` Paul E. McKenney
  0 siblings, 0 replies; 63+ messages in thread
From: Paul E. McKenney @ 2026-04-10 18:36 UTC (permalink / raw)
  To: Jeff Layton
  Cc: Linus Torvalds, Al Viro, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	linux-fsdevel, Christian Brauner, Jan Kara, Nikolay Borisov,
	Max Kellermann, Eric Sandeen, Paulo Alcantara

On Fri, Apr 10, 2026 at 02:26:47PM -0400, Jeff Layton wrote:
> On Fri, 2026-04-10 at 10:32 -0700, Paul E. McKenney wrote:
> > On Fri, Apr 10, 2026 at 08:25:27AM -0700, Linus Torvalds wrote:
> > > [ Adding RCU maintainers to the participants: see
> > > 
> > >   https://lore.kernel.org/all/20260410084839.GA1310153@ZenIV/
> > > 
> > >  and the fairly long thread associated with it  for context ]
> > > 
> > > On Fri, 10 Apr 2026 at 01:44, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > > > 
> > > > [this may or may not be the source of UAFs caught by Jeff and by Helge]
> > > 
> > > Hmm.
> > > 
> > > I think this patch may indeed fix the problem, and I don't mind how it looks.
> > > 
> > > But while I think the patch looks fine, I am still quite unhappy about
> > > it if it matters - because we have very much documented that spinlocks
> > > in themselves are also RCU read locks:
> > > 
> > >         Note that anything that disables bottom halves, preemption,
> > >         or interrupts also enters an RCU read-side critical section.
> > >         Acquiring a spinlock also enters an RCU read-side critical
> > >         sections, even for spinlocks that do not disable preemption,
> > >         as is the case in kernels built with CONFIG_PREEMPT_RT=y.
> > >         Sleeplocks do *not* enter RCU read-side critical sections.
> > > 
> > > so *if* this makes a difference, I think it's actually implying that
> > > there's something wrong with our "rcu_read_unlock()" implementation,
> > > and that it doesn't nest properly.
> > > 
> > > Because our documentation also makes it very clear that this should
> > > all work as-is, and your patch should be a complete no-op. Just a few
> > > lines later in that core RCU doc, we have
> > > 
> > >         Note that RCU read-side critical sections may be nested and/or
> > >         overlapping.
> > > 
> > > so the order of the spin_unlock() and the rcu_read_unlock() *should*
> > > be entirely immaterial, and the order of unlocking simply shouldn't
> > > matter.
> > > 
> > > So it may indeed be that relying on the ordering of RCU read unlock
> > > matters for the case of explicit unlocks and the implicit unlocks by a
> > > spinlock, but that's not great.
> > > 
> > > Does the rcu_read_unlock() code perhaps only check the RCU count, not
> > > the atomicity count? That would explain it, and looking at
> > > __rcu_read_unlock() in kernel/rcu/tree_plugin.h that may indeed be the
> > > case (rcu_preempt_read_exit() seems to only check
> > > rcu_read_lock_nesting).
> > > 
> > > This is most likely dependent on kernel config options, and it may all
> > > be unavoidable becasue RCU might not even have a way to tell that it's
> > > still in a critical region without preemption counts etc.
> > > 
> > > But if it's unavoidable, we need to update the docs about this gotcha,
> > > and we need to have some tooling scan our existing code for this
> > > documentation change.
> > > 
> > > RCU people - what do you think?
> > 
> > Adding to Boqun's email:
> > 
> > The https://lore.kernel.org/all/20260410084839.GA1310153@ZenIV URL
> > has preemption disabled (right, Jeff?)
> > 
> 
> Not as familiar with this part of the kernel, but I think that's
> correct:
> 
> # zgrep PREEMPT /proc/config.gz 
> CONFIG_PREEMPT_NONE_BUILD=y
> CONFIG_ARCH_HAS_PREEMPT_LAZY=y
> CONFIG_PREEMPT_NONE=y
> # CONFIG_PREEMPT_VOLUNTARY is not set
> # CONFIG_PREEMPT is not set
> # CONFIG_PREEMPT_LAZY is not set
> # CONFIG_PREEMPT_RT is not set
> # CONFIG_PREEMPT_DYNAMIC is not set
> CONFIG_HAVE_PREEMPT_DYNAMIC=y
> CONFIG_HAVE_PREEMPT_DYNAMIC_CALL=y
> CONFIG_PREEMPT_NOTIFIERS=y
> # CONFIG_PREEMPTIRQ_DELAY_TEST is not set

Thank you, Jeff!

Yes that has preemption entirely disabled, so RCU readers do nothing but
disable and enable preemption.  Except that it is everywhere disabled,
so the order of the rcu_read_unlock() and the spin_unlock() should
not matter.  (Famous last words!)

							Thanx, Paul

> > , in which case rcu_read_lock()
> > is just preempt_disable() and rcu_read_unlock() is just preempt_enable().
> > All the spinlocks also disable and enable preemption, so if you have
> > something like this:
> > 
> > 	rcu_read_lock();
> > 	do_something();
> > 	spin_lock(...);
> > 	do_something_else();
> > 	rcu_read_unlock();
> > 	do_yet_another_thing();
> > 	spin_unlock(...);
> > 
> > Then there should not be any calls to schedule() between the initial
> > rcu_read_lock() and the final spin_unlock().  But this is just
> > preempt_disable() doing its job, not much to do with RCU.
> > 
> > And repeating Boqun's email:
> > 
> > Now if this is a preemptible kernel, things are more involved.
> > However, in this case we know that the kernel has been built with
> > CONFIG_PREEMPT_COUNT=y.  So rcu_read_unlock_special() can sample
> > preempt_count() and set preempt_bh_were_disabled is then true if either
> > preemption or softirq are disabled.  Then irqs_were_disabled is set if
> > interrupts were disabled on entry.  If either are set, the inner pair of
> > "if" statements set up to defer the end of the read-side critical section.
> > If this is important, which it might be, I will go through this in more
> > detail later.
> > 
> > Otherwise, the call to rcu_preempt_deferred_qs_irqrestore() at the end
> > of rcu_read_unlock_special() immediately ends the RCU read-side critical
> > section.
> > 
> > And more adding to Boqun's email:
> > 
> > Now, CONFIG_PREEMPT_RT=y is a special case, because "spinlocks"
> > are preemptible.  So __rt_spin_lock() and __rt_spin_trylock() both do
> > rcu_read_lock() and rt_spin_unlock() does rcu_read_unlock().  So that
> > should be covered via RCU nesting along with the rcu_read_unlock_special()
> > checks for disabled interrupts, softirq, and preemption.
> > 
> > And rcutorture tests random overlapping of RCU readers with all those
> > disabled regions.  But it might still be missing something.  And I
> > will add spinlocks to the mix in addition to the preemption disabling.
> > 
> > In the meantime, is there something more specific that I should be
> > looking at?
> > 
> > 						Thanx, Paul
> 
> -- 
> Jeff Layton <jlayton@kernel.org>

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
  2026-04-10 15:25             ` Linus Torvalds
                                 ` (2 preceding siblings ...)
  2026-04-10 17:32               ` Paul E. McKenney
@ 2026-04-10 18:52               ` Al Viro
  2026-04-10 19:21                 ` Paul E. McKenney
  2026-04-10 19:30                 ` Linus Torvalds
  3 siblings, 2 replies; 63+ messages in thread
From: Al Viro @ 2026-04-10 18:52 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Jeff Layton, linux-fsdevel, Christian Brauner, Jan Kara,
	Nikolay Borisov, Max Kellermann, Eric Sandeen, Paulo Alcantara

On Fri, Apr 10, 2026 at 08:25:27AM -0700, Linus Torvalds wrote:
> [ Adding RCU maintainers to the participants: see
> 
>   https://lore.kernel.org/all/20260410084839.GA1310153@ZenIV/
> 
>  and the fairly long thread associated with it  for context ]
> 
> On Fri, 10 Apr 2026 at 01:44, Al Viro <viro@zeniv.linux.org.uk> wrote:
> >
> > [this may or may not be the source of UAFs caught by Jeff and by Helge]
> 
> Hmm.
> 
> I think this patch may indeed fix the problem, and I don't mind how it looks.
> 
> But while I think the patch looks fine, I am still quite unhappy about
> it if it matters - because we have very much documented that spinlocks
> in themselves are also RCU read locks:
> 
>         Note that anything that disables bottom halves, preemption,
>         or interrupts also enters an RCU read-side critical section.
>         Acquiring a spinlock also enters an RCU read-side critical
>         sections, even for spinlocks that do not disable preemption,
>         as is the case in kernels built with CONFIG_PREEMPT_RT=y.
>         Sleeplocks do *not* enter RCU read-side critical sections.
> 
> so *if* this makes a difference, I think it's actually implying that
> there's something wrong with our "rcu_read_unlock()" implementation,
> and that it doesn't nest properly.
> 
> Because our documentation also makes it very clear that this should
> all work as-is, and your patch should be a complete no-op. Just a few
> lines later in that core RCU doc, we have
> 
>         Note that RCU read-side critical sections may be nested and/or
>         overlapping.
> 
> so the order of the spin_unlock() and the rcu_read_unlock() *should*
> be entirely immaterial, and the order of unlocking simply shouldn't
> matter.
> 
> So it may indeed be that relying on the ordering of RCU read unlock
> matters for the case of explicit unlocks and the implicit unlocks by a
> spinlock, but that's not great.
> 
> Does the rcu_read_unlock() code perhaps only check the RCU count, not
> the atomicity count? That would explain it, and looking at
> __rcu_read_unlock() in kernel/rcu/tree_plugin.h that may indeed be the
> case (rcu_preempt_read_exit() seems to only check
> rcu_read_lock_nesting).
> 
> This is most likely dependent on kernel config options, and it may all
> be unavoidable becasue RCU might not even have a way to tell that it's
> still in a critical region without preemption counts etc.
> 
> But if it's unavoidable, we need to update the docs about this gotcha,
> and we need to have some tooling scan our existing code for this
> documentation change.
> 
> RCU people - what do you think?

FWIW, I wonder if we would be better off with the following:

void shrink_dentry_list(struct list_head *list)
{
	while (!list_empty(list)) {
		struct dentry *dentry;
 
		dentry = list_entry(list->prev, struct dentry, d_lru);
		spin_lock(&dentry->d_lock);
		d_shrink_del(dentry);
		if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) {
			// killed while on shrink list, freeing left to us
			spin_unlock(&dentry->d_lock);
			dentry_free(dentry);
			continue;
		}
		rcu_read_lock();
		if (!lock_for_kill(dentry)) {
			spin_unlock(&dentry->d_lock);
			rcu_read_unlock();
			continue;
		}
		shrink_kill(dentry);
	}
}

because in that case both callers of shrink_kill() have the same form - the
other one is
			if (!lock_for_kill(v)) {
				spin_unlock(&v->d_lock);
				rcu_read_unlock();
			} else {
				shrink_kill(v);
			}
and we could fold lock_for_kill() into shrink_kill(), with a nice series of
cleanups becoming possible after that, especially if we can rely upon
spinlock acting as RCU read-side critical in all cases...

Interesting...

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
  2026-04-10 18:52               ` Al Viro
@ 2026-04-10 19:21                 ` Paul E. McKenney
  2026-04-10 19:30                 ` Linus Torvalds
  1 sibling, 0 replies; 63+ messages in thread
From: Paul E. McKenney @ 2026-04-10 19:21 UTC (permalink / raw)
  To: Al Viro
  Cc: Linus Torvalds, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Jeff Layton, linux-fsdevel, Christian Brauner, Jan Kara,
	Nikolay Borisov, Max Kellermann, Eric Sandeen, Paulo Alcantara

On Fri, Apr 10, 2026 at 07:52:43PM +0100, Al Viro wrote:
> On Fri, Apr 10, 2026 at 08:25:27AM -0700, Linus Torvalds wrote:
> > [ Adding RCU maintainers to the participants: see
> > 
> >   https://lore.kernel.org/all/20260410084839.GA1310153@ZenIV/
> > 
> >  and the fairly long thread associated with it  for context ]
> > 
> > On Fri, 10 Apr 2026 at 01:44, Al Viro <viro@zeniv.linux.org.uk> wrote:
> > >
> > > [this may or may not be the source of UAFs caught by Jeff and by Helge]
> > 
> > Hmm.
> > 
> > I think this patch may indeed fix the problem, and I don't mind how it looks.
> > 
> > But while I think the patch looks fine, I am still quite unhappy about
> > it if it matters - because we have very much documented that spinlocks
> > in themselves are also RCU read locks:
> > 
> >         Note that anything that disables bottom halves, preemption,
> >         or interrupts also enters an RCU read-side critical section.
> >         Acquiring a spinlock also enters an RCU read-side critical
> >         sections, even for spinlocks that do not disable preemption,
> >         as is the case in kernels built with CONFIG_PREEMPT_RT=y.
> >         Sleeplocks do *not* enter RCU read-side critical sections.
> > 
> > so *if* this makes a difference, I think it's actually implying that
> > there's something wrong with our "rcu_read_unlock()" implementation,
> > and that it doesn't nest properly.
> > 
> > Because our documentation also makes it very clear that this should
> > all work as-is, and your patch should be a complete no-op. Just a few
> > lines later in that core RCU doc, we have
> > 
> >         Note that RCU read-side critical sections may be nested and/or
> >         overlapping.
> > 
> > so the order of the spin_unlock() and the rcu_read_unlock() *should*
> > be entirely immaterial, and the order of unlocking simply shouldn't
> > matter.
> > 
> > So it may indeed be that relying on the ordering of RCU read unlock
> > matters for the case of explicit unlocks and the implicit unlocks by a
> > spinlock, but that's not great.
> > 
> > Does the rcu_read_unlock() code perhaps only check the RCU count, not
> > the atomicity count? That would explain it, and looking at
> > __rcu_read_unlock() in kernel/rcu/tree_plugin.h that may indeed be the
> > case (rcu_preempt_read_exit() seems to only check
> > rcu_read_lock_nesting).
> > 
> > This is most likely dependent on kernel config options, and it may all
> > be unavoidable becasue RCU might not even have a way to tell that it's
> > still in a critical region without preemption counts etc.
> > 
> > But if it's unavoidable, we need to update the docs about this gotcha,
> > and we need to have some tooling scan our existing code for this
> > documentation change.
> > 
> > RCU people - what do you think?
> 
> FWIW, I wonder if we would be better off with the following:
> 
> void shrink_dentry_list(struct list_head *list)
> {
> 	while (!list_empty(list)) {
> 		struct dentry *dentry;
>  
> 		dentry = list_entry(list->prev, struct dentry, d_lru);
> 		spin_lock(&dentry->d_lock);
> 		d_shrink_del(dentry);
> 		if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) {
> 			// killed while on shrink list, freeing left to us
> 			spin_unlock(&dentry->d_lock);
> 			dentry_free(dentry);
> 			continue;
> 		}
> 		rcu_read_lock();
> 		if (!lock_for_kill(dentry)) {
> 			spin_unlock(&dentry->d_lock);
> 			rcu_read_unlock();
> 			continue;
> 		}
> 		shrink_kill(dentry);
> 	}
> }
> 
> because in that case both callers of shrink_kill() have the same form - the
> other one is
> 			if (!lock_for_kill(v)) {
> 				spin_unlock(&v->d_lock);
> 				rcu_read_unlock();
> 			} else {
> 				shrink_kill(v);
> 			}
> and we could fold lock_for_kill() into shrink_kill(), with a nice series of
> cleanups becoming possible after that, especially if we can rely upon
> spinlock acting as RCU read-side critical in all cases...

If there is a case where a spinlock fails to act as an RCU read-side
critical section, then we have a bug that needs to be fixed.

							Thanx, Paul

> Interesting...

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
  2026-04-10 18:52               ` Al Viro
  2026-04-10 19:21                 ` Paul E. McKenney
@ 2026-04-10 19:30                 ` Linus Torvalds
  2026-04-10 20:24                   ` Al Viro
  1 sibling, 1 reply; 63+ messages in thread
From: Linus Torvalds @ 2026-04-10 19:30 UTC (permalink / raw)
  To: Al Viro
  Cc: Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Jeff Layton, linux-fsdevel, Christian Brauner, Jan Kara,
	Nikolay Borisov, Max Kellermann, Eric Sandeen, Paulo Alcantara

On Fri, 10 Apr 2026 at 11:48, Al Viro <viro@zeniv.linux.org.uk> wrote:
>
> FWIW, I wonder if we would be better off with the following:
>
> void shrink_dentry_list(struct list_head *list)
> {
>         while (!list_empty(list)) {
>                 struct dentry *dentry;
>
>                 dentry = list_entry(list->prev, struct dentry, d_lru);
>                 spin_lock(&dentry->d_lock);
>                 d_shrink_del(dentry);
>                 if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED)) {
>                         // killed while on shrink list, freeing left to us
>                         spin_unlock(&dentry->d_lock);
>                         dentry_free(dentry);
>                         continue;
>                 }
>                 rcu_read_lock();
>                 if (!lock_for_kill(dentry)) {
>                         spin_unlock(&dentry->d_lock);
>                         rcu_read_unlock();
>                         continue;
>                 }

Ugh. I despise that rcu_read_lock() in there, because it's typically
completely useless.

The reason it exists is because lock_for_kill() can drop d_lock(), but
that's in the unlikely case that we cn't just immediately get the
inode lock.

So honestly, I think that rcu_read_lock() should be inside
lock_for_kill(), rather than in the caller as a "just in case things
go down".

IOW, that code basically now not only does that typically unnecessary
rcu locking (and then unlocking), it does so *because* it knows about
the subtle internal behavior of lock_for_kill().

In contrast, if we put it inside lock_for_kill(), all we need is a
fairly straightforward comment along the lines of

  "we're dropping the spinlock in order to take the inode lock, but
the caller may be depending on RCU behavior so we need to take the rcu
read lock first".

 that would turn strange illogical code that also generates worse code
into straightforwardly explained code that also performs better.

Ok, so "performs better" is kind of exaggerated, in that obviously the
extra rcu_read_lock/unlock sequences aren't exactly expensive, but
still - I feel it's a win-win to just do this differently.

Or am I missing somethign else?

                 Linus

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
  2026-04-10 19:30                 ` Linus Torvalds
@ 2026-04-10 20:24                   ` Al Viro
  2026-04-10 20:48                     ` Al Viro
  0 siblings, 1 reply; 63+ messages in thread
From: Al Viro @ 2026-04-10 20:24 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Jeff Layton, linux-fsdevel, Christian Brauner, Jan Kara,
	Nikolay Borisov, Max Kellermann, Eric Sandeen, Paulo Alcantara

On Fri, Apr 10, 2026 at 12:30:13PM -0700, Linus Torvalds wrote:

> The reason it exists is because lock_for_kill() can drop d_lock(), but
> that's in the unlikely case that we cn't just immediately get the
> inode lock.
> 
> So honestly, I think that rcu_read_lock() should be inside
> lock_for_kill(), rather than in the caller as a "just in case things
> go down".

Yup, in the cascade of followups I've mentioned...

> IOW, that code basically now not only does that typically unnecessary
> rcu locking (and then unlocking), it does so *because* it knows about
> the subtle internal behavior of lock_for_kill().
> 
> In contrast, if we put it inside lock_for_kill(), all we need is a
> fairly straightforward comment along the lines of
> 
>   "we're dropping the spinlock in order to take the inode lock, but
> the caller may be depending on RCU behavior so we need to take the rcu
> read lock first".
> 
>  that would turn strange illogical code that also generates worse code
> into straightforwardly explained code that also performs better.
> 
> Ok, so "performs better" is kind of exaggerated, in that obviously the
> extra rcu_read_lock/unlock sequences aren't exactly expensive, but
> still - I feel it's a win-win to just do this differently.
> 
> Or am I missing somethign else?

Mostly that fast_dput() calling conventions would need to change a bit
as well.

With the above as step 1,

Step 2: fold lock_for_kill() into shrink_kill().  Result: 2 call sites of
lock_for_kill() left, shrink_kill() becomes
static inline void shrink_kill(struct dentry *victim)
{
	while (lock_for_kill(victim)) {   
		rcu_read_unlock();
		victim = __dentry_kill(victim);
		if (!victim)
			return;
		rcu_read_lock();
	}
	spin_unlock(&victim->d_lock);
	rcu_read_unlock();
}
shrink_dentry_list() and shrink_dentry_tree() calling shrink_kill()
without bothering with lock_for_kill().

Step 3: note that both call sites of lock_for_kill() are followed with the
same sequence - drop ->d_lock + rcu_read_unlock on failure,
rcu_read_unlock + __dentry_kill() on success.  And these are the
only callers of __dentry_kill().  Moreover, NULL from __dentry_kill()
means "bugger off" in both callers.  So add

struct dentry *dentry_kill(dentry)
{
	if (!lock_for_kill(dentry)) {
		spin_unlock(&dentry->d_lock);
		rcu_read_unlock();
		return NULL;
	}
	rcu_read_unlock();
	return __dentry_kill(dentry);
}
and convert both callers to that:
static inline void shrink_kill(struct dentry *victim)
{
	while ((victim = dentry_kill(victim)) != NULL)
		rcu_read_lock();
}
and
static void finish_dput(struct dentry *dentry)
{
	while ((dentry = dentry_kill(dentry)) != NULL) {
		if (retain_dentry(dentry, true)) {
			spin_unlock(&dentry->d_lock);
			return;
		}
		rcu_read_lock();
	}
}

Step 4: note that callers of shrink_kill() and finish_dput()
have dentry->d_lock held.  We can (assuming there's no RCU
bugs) drop the lock before calling them and grab it the
first thing in.  Moreover, we can shift grabbing it into
lock_for_kill(), leaving these
static inline void shrink_kill(struct dentry *victim)
{
	while ((victim = dentry_kill(victim)) != NULL)
		;
}
and
static void finish_dput(struct dentry *dentry)
{
	while ((dentry = dentry_kill(dentry)) != NULL) {
		if (retain_dentry(dentry, true)) {
			spin_unlock(&dentry->d_lock);
			return;
		}
	}
}
resp., with the callers becoming
void dput(struct dentry *dentry)
{
	if (!dentry)
		return;
	might_sleep();
	rcu_read_lock();
	if (likely(fast_dput(dentry))) {
		rcu_read_unlock();
		return;
	}
	rcu_read_unlock();
	finish_dput(dentry);
}

void d_make_discardable(struct dentry *dentry)
{
        spin_lock(&dentry->d_lock);
        WARN_ON(!(dentry->d_flags & DCACHE_PERSISTENT));
        dentry->d_flags &= ~DCACHE_PERSISTENT;
        dentry->d_lockref.count--;
        finish_dput(dentry);
}

void shrink_dentry_list(struct list_head *list)
{
        while (!list_empty(list)) {
                struct dentry *dentry;
 
                dentry = list_entry(list->prev, struct dentry, d_lru);
                spin_lock(&dentry->d_lock);
                d_shrink_del(dentry);
		if (unlikely(dentry->d_flags & DCACHE_DENTRY_KILLED) {
                        spin_unlock(&dentry->d_lock);
			dentry_free(dentry);
			continue;
		}
                shrink_kill(dentry);
	}
}

and, in shrink_dentry_list(),
                        spin_lock(&v->d_lock);
			rcu_read_unlock();	// now held by ->d_lock
                        if (v->d_lockref.count < 0 &&
                            !(v->d_flags & DCACHE_DENTRY_KILLED)) {
                                struct completion_list wait;
                                // It's busy dying; have it notify us once
                                // it becomes invisible to d_walk().
                                d_add_waiter(v, &wait);
                                spin_unlock(&v->d_lock);
                                if (!list_empty(&data.dispose))
                                        shrink_dentry_list(&data.dispose);
                                wait_for_completion(&wait.completion);
                                continue;
                        }
			shrink_kill(v);

lock_for_kill() has grown a call of rcu_read_lock() in the very beginning,
will be gone shortly.

Step 5: pull the followup rcu_read_unlock() (and dropping ->d_lock on
failure) into lock_for_kill(); again, that assumes no RCU bugs there:
static bool lock_for_kill(struct dentry *dentry)
{
        struct inode *inode = dentry->d_inode;
 
        if (unlikely(dentry->d_lockref.count)) {
		spin_unlock(&dentry->d_lock);
                return false;
	}
 
        if (!inode || likely(spin_trylock(&inode->i_lock)))
                return true;
 
	rcu_read_lock();
        do {   
                spin_unlock(&dentry->d_lock);
                spin_lock(&inode->i_lock);
                spin_lock(&dentry->d_lock);
                if (likely(inode == dentry->d_inode))
                        break;
                spin_unlock(&inode->i_lock);
                inode = dentry->d_inode;
        } while (inode);
	rcu_read_unlock();
        if (likely(!dentry->d_lockref.count))
                return true;
        if (inode)
                spin_unlock(&inode->i_lock);
	spin_unlock(&dentry->d_lock);
        return false;
}

and in dentry_kill()
{
	if (!lock_for_kill(dentry))
		return NULL;
	return __dentry_kill(dentry);
}

Step 6: fold __dentry_kill() into dentry_kill().

Step 7: both callers of fast_dput() are identical:
        rcu_read_lock();
        if (likely(fast_dput(dentry))) {
                rcu_read_unlock();
                return;
        }
	rcu_read_unlock();
(in dput() and dput_to_list()).  Pull rcu_read_lock() and rcu_read_unlock()
into fast_dput(); ends up with 5 lines added there.  Result (with comments
skipped) is
static inline bool fast_dput(struct dentry *dentry)
{
        int ret;
 
	rcu_read_lock();		// added
        ret = lockref_put_return(&dentry->d_lockref);
 
        if (unlikely(ret < 0)) {
                spin_lock(&dentry->d_lock);
		rcu_read_unlock();	// added
                if (WARN_ON_ONCE(dentry->d_lockref.count <= 0)) {
                        spin_unlock(&dentry->d_lock);
                        return true;
                }
                dentry->d_lockref.count--;
                goto locked;
        }
        if (ret) {
		rcu_read_unlock();	// added
                return true;
	}
        if (retain_dentry(dentry, false)) {
		rcu_read_unlock();	// added
                return true;
	}
        spin_lock(&dentry->d_lock);
	rcu_read_unlock();		// added
locked:
        if (dentry->d_lockref.count || retain_dentry(dentry, true)) {
                spin_unlock(&dentry->d_lock);
                return true;
        }
        return false;
}

Callers become
void dput(struct dentry *dentry)
{
	if (!dentry)
		return;
	might_sleep();
	if (likely(fast_dput(dentry)))
		return;
	finish_dput(dentry);
}
and
void dput_to_list(struct dentry *dentry, struct list_head *list)
{
	if (likely(fast_dput(dentry)))
		return;
	to_shrink_list(dentry, list);
	spin_unlock(&dentry->d_lock);
}
resp.

^ permalink raw reply	[flat|nested] 63+ messages in thread

* Re: [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order
  2026-04-10 20:24                   ` Al Viro
@ 2026-04-10 20:48                     ` Al Viro
  0 siblings, 0 replies; 63+ messages in thread
From: Al Viro @ 2026-04-10 20:48 UTC (permalink / raw)
  To: Linus Torvalds
  Cc: Paul E. McKenney, Frederic Weisbecker, Neeraj Upadhyay,
	Joel Fernandes, Josh Triplett, Boqun Feng, Uladzislau Rezki,
	Jeff Layton, linux-fsdevel, Christian Brauner, Jan Kara,
	Nikolay Borisov, Max Kellermann, Eric Sandeen, Paulo Alcantara

On Fri, Apr 10, 2026 at 09:24:04PM +0100, Al Viro wrote:
> On Fri, Apr 10, 2026 at 12:30:13PM -0700, Linus Torvalds wrote:
> 
> > The reason it exists is because lock_for_kill() can drop d_lock(), but
> > that's in the unlikely case that we cn't just immediately get the
> > inode lock.
> > 
> > So honestly, I think that rcu_read_lock() should be inside
> > lock_for_kill(), rather than in the caller as a "just in case things
> > go down".
> 
> Yup, in the cascade of followups I've mentioned...
> 
> > IOW, that code basically now not only does that typically unnecessary
> > rcu locking (and then unlocking), it does so *because* it knows about
> > the subtle internal behavior of lock_for_kill().
> > 
> > In contrast, if we put it inside lock_for_kill(), all we need is a
> > fairly straightforward comment along the lines of
> > 
> >   "we're dropping the spinlock in order to take the inode lock, but
> > the caller may be depending on RCU behavior so we need to take the rcu
> > read lock first".
> > 
> >  that would turn strange illogical code that also generates worse code
> > into straightforwardly explained code that also performs better.
> > 
> > Ok, so "performs better" is kind of exaggerated, in that obviously the
> > extra rcu_read_lock/unlock sequences aren't exactly expensive, but
> > still - I feel it's a win-win to just do this differently.
> > 
> > Or am I missing somethign else?
> 
> Mostly that fast_dput() calling conventions would need to change a bit
> as well.
> 
> With the above as step 1,
[snip]

FWIW, I would really like to figure out what the hell is going on with
those UAF; livelock elimination (which is clearly needed as well -
there's an easy way to turn that livelock into hard lock reliably eating
a CPU on an SMP box) is likely to hide whatever the bug is.  And it
does look like RCU bugs are not the cause of what Jeff is occasionally
seeing - not with their config.  So there's something else going on ;-/
Jeff's reports are 6.11--6.13; Helge's parisc ones are 6.16--6.19.
If they are using the stock debian parisc kernel, it's PREEMPT_NONE
config (AFAICS) and in that case RCU is not the cause there as well...

Back to trying to put together a proof of correctness and see where
it breaks ;-/  If nothing else, that should give a set of assertion
checks to try and slap on top of the kernels involved...

^ permalink raw reply	[flat|nested] 63+ messages in thread

end of thread, other threads:[~2026-04-11  0:47 UTC | newest]

Thread overview: 63+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2026-01-22 20:20 [PATCH][RFC] get rid of busy-wait in shrink_dcache_tree() Al Viro
2026-01-23  0:19 ` Linus Torvalds
2026-01-23  0:36   ` Al Viro
2026-01-24  4:36     ` Al Viro
2026-01-24  4:46       ` Linus Torvalds
2026-01-24  5:36         ` Al Viro
2026-01-24 17:45           ` Linus Torvalds
2026-01-24 18:43             ` Al Viro
2026-01-24 19:32               ` Linus Torvalds
2026-01-24 20:28                 ` Al Viro
2026-04-02 18:08 ` [RFC PATCH v2 0/4] getting rid of busy-wait in shrink_dcache_parent() Al Viro
2026-04-02 18:08   ` [RFC PATCH v2 1/4] for_each_alias(): helper macro for iterating through dentries of given inode Al Viro
2026-04-02 18:08   ` [RFC PATCH v2 2/4] struct dentry: make ->d_u anonymous Al Viro
2026-04-02 18:08   ` [RFC PATCH v2 3/4] dcache.c: more idiomatic "positives are not allowed" sanity checks Al Viro
2026-04-02 18:08   ` [RFC PATCH v2 4/4] get rid of busy-waiting in shrink_dcache_tree() Al Viro
2026-04-02 19:52     ` Linus Torvalds
2026-04-02 22:44       ` Al Viro
2026-04-02 22:49         ` Linus Torvalds
2026-04-02 23:16           ` Al Viro
2026-04-03  0:29             ` Linus Torvalds
2026-04-03  2:15               ` Al Viro
2026-04-04  0:02                 ` Al Viro
2026-04-04  0:04                   ` Linus Torvalds
2026-04-04 18:54                     ` Al Viro
2026-04-04 19:04                       ` Linus Torvalds
2026-04-05  0:04                         ` Al Viro
2026-04-02 20:28   ` [RFC PATCH v2 0/4] getting rid of busy-wait in shrink_dcache_parent() Paulo Alcantara
2026-04-03  4:46     ` Al Viro
2026-04-04  8:07 ` [RFC PATCH v3 " Al Viro
2026-04-04  8:07   ` [RFC PATCH v3 1/4] for_each_alias(): helper macro for iterating through dentries of given inode Al Viro
2026-04-04  8:07   ` [RFC PATCH v3 2/4] struct dentry: make ->d_u anonymous Al Viro
2026-04-04  8:07   ` [RFC PATCH v3 3/4] dcache.c: more idiomatic "positives are not allowed" sanity checks Al Viro
2026-04-04  8:07   ` [RFC PATCH v3 4/4] get rid of busy-waiting in shrink_dcache_tree() Al Viro
2026-04-09 16:51   ` [RFC PATCH v3 0/4] getting rid of busy-wait in shrink_dcache_parent() Jeff Layton
2026-04-09 19:02     ` Al Viro
2026-04-09 20:10       ` Jeff Layton
2026-04-09 21:57         ` Al Viro
2026-04-09 22:38           ` Jeff Layton
2026-04-10  8:48           ` [RFC][PATCH] make sure that lock_for_kill() callers drop the locks in safe order Al Viro
2026-04-10 11:18             ` Jeff Layton
2026-04-10 11:56               ` Jeff Layton
2026-04-10 15:25             ` Linus Torvalds
2026-04-10 15:57               ` Al Viro
2026-04-10 16:27               ` Boqun Feng
2026-04-10 17:31                 ` Linus Torvalds
2026-04-10 18:11                   ` Paul E. McKenney
2026-04-10 18:21                   ` Jeff Layton
2026-04-10 19:19                     ` Al Viro
2026-04-10 19:32                       ` Jeff Layton
2026-04-10 21:13                         ` Calvin Owens
2026-04-10 21:24                           ` Al Viro
2026-04-10 22:15                             ` Calvin Owens
2026-04-10 23:05                               ` Al Viro
2026-04-10 23:30                                 ` Calvin Owens
2026-04-11  0:51                                   ` Al Viro
2026-04-10 17:32               ` Paul E. McKenney
2026-04-10 18:26                 ` Jeff Layton
2026-04-10 18:36                   ` Paul E. McKenney
2026-04-10 18:52               ` Al Viro
2026-04-10 19:21                 ` Paul E. McKenney
2026-04-10 19:30                 ` Linus Torvalds
2026-04-10 20:24                   ` Al Viro
2026-04-10 20:48                     ` Al Viro

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox