All of lore.kernel.org
 help / color / mirror / Atom feed
From: Diangang Li <lidiangang@bytedance.com>
To: Al Viro <viro@zeniv.linux.org.uk>
Cc: Stephen Brennan <stephen.s.brennan@oracle.com>,
	linux-kernel@vger.kernel.org,
	Luis Chamberlain <mcgrof@kernel.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Jan Kara <jack@suse.cz>,
	linux-fsdevel@vger.kernel.org, Arnd Bergmann <arnd@arndb.de>,
	Amir Goldstein <amir73il@gmail.com>,
	Fengnan Chang <changfengnan@bytedance.com>
Subject: Re: [PATCH v2 1/4] dcache: sweep cached negative dentries to the end of list of siblings
Date: Wed, 10 Sep 2025 15:38:55 +0800	[thread overview]
Message-ID: <20250910073855.GA59407@bytedance.com> (raw)
In-Reply-To: <YgSjo5wascR9mfnA@zeniv-ca.linux.org.uk>

On Thu, Feb 10, 2022 at 05:33:23AM +0000, Al Viro wrote:
> On Wed, Feb 09, 2022 at 03:14:03PM -0800, Stephen Brennan wrote:
> 
> > +static void sweep_negative(struct dentry *dentry)
> > +{
> > +	struct dentry *parent;
> > +
> > +	rcu_read_lock();
> > +	parent = lock_parent(dentry);
> > +	if (!parent) {
> > +		rcu_read_unlock();
> > +		return;
> > +	}
> > +
> > +	/*
> > +	 * If we did not hold a reference to dentry (as in the case of dput),
> > +	 * and dentry->d_lock was dropped in lock_parent(), then we could now be
> > +	 * holding onto a dead dentry. Be careful to check d_count and unlock
> > +	 * before dropping RCU lock, otherwise we could corrupt freed memory.
> > +	 */
> > +	if (!d_count(dentry) && d_is_negative(dentry) &&
> > +		!d_is_tail_negative(dentry)) {
> > +		dentry->d_flags |= DCACHE_TAIL_NEGATIVE;
> > +		list_move_tail(&dentry->d_child, &parent->d_subdirs);
> > +	}
> > +
> > +	spin_unlock(&parent->d_lock);
> > +	spin_unlock(&dentry->d_lock);
> > +	rcu_read_unlock();
> > +}
> 
> 	I'm not sure if it came up the last time you'd posted this series
> (and I apologize if it had and I forgot the explanation), but... consider
> the comment in dentry_unlist().  What's to prevent the race described there
> making d_walk() skip a part of tree, by replacing the "lseek moving cursor
> in just the wrong moment" with "dput moving the negative dentry right next
> to the one being killed to the tail of the list"?
> 
> 	The race in question:
> d_walk() is leaving a subdirectory.  We are here:
>         rcu_read_lock();
> ascend:
>         if (this_parent != parent) {
> 
> It isn't - we are not back to the root of tree being walked.
> At this point this_parent is the directory we'd just finished looking into.
> 
>                 struct dentry *child = this_parent;
>                 this_parent = child->d_parent;
> 
> ... and now child points to it, and this_parent points to its parent.
> 
>                 spin_unlock(&child->d_lock);
> 
> No locks held.  Another CPU gets through successful rmdir().  child gets
> unhashed and dropped.  It's off the ->d_subdirs of this_parent; its
> ->d_child.next is still pointing where it used to, and whatever it points
> to won't be physically freed until rcu_read_unlock().
> 
> Moreover, in the meanwhile this next sibling (negative, pinned) got dput().
> And had been moved to the tail of the this_parent->d_subdirs.  Since
> its ->d_child.prev does *NOT* point to child (which is off-list, about to
> be freed shortly, etc.), child->d_dchild.next is not modified - it still
> points to that (now moved) sibling.
> 
>                 spin_lock(&this_parent->d_lock);
> Got it.
> 
>                 /* might go back up the wrong parent if we have had a rename. */
>                 if (need_seqretry(&rename_lock, seq))
>                         goto rename_retry;
> 
> Nope, hadn't happened.
> 
>                 /* go into the first sibling still alive */
>                 do {
>                         next = child->d_child.next;
> ... and this is the moved sibling, now in the end of the ->d_subdirs.
> 
>                         if (next == &this_parent->d_subdirs)
>                                 goto ascend;
> 
> No, it is not - it's the last element of the list, not its anchor.
> 
>                         child = list_entry(next, struct dentry, d_child);
> 
> Our moved negative dentry.
> 
>                 } while (unlikely(child->d_flags & DCACHE_DENTRY_KILLED));
> 
> Not killed, that one.
>                 rcu_read_unlock();
>                 goto resume;
> 
> ... and since that sucker has no children, we proceed to look at it,
> ascend and now we are at the end of this_parent->d_subdirs.  And we
> ascend out of it, having entirely skipped all branches that used to
> be between the rmdir victim and the end of the parent's ->d_subdirs.
> 
> What am I missing here?  Unlike the trick we used with cursors (see
> dentry_unlist()) we can't predict who won't get moved in this case...
> 
> Note that treating "child is has DCACHE_DENTRY_KILLED" same as we do
> for rename_lock mismatches would not work unless you grab the spinlock
> component of rename_lock every time dentry becomes positive.  Which
> is obviously not feasible (it's a system-wide lock and cacheline
> pingpong alone would hurt us very badly, not to mention the contention
> issues due to the frequency of grabbing it going up by several orders
> of magnitude).

Hi Al, Stephen,

How about adding a pointer in union d_u to record the original next of negative dentry
when executing `sweep_negative`, as follows:

@@ -125,6 +126,7 @@ struct dentry {
 		struct hlist_node d_alias;	/* inode alias list */
 		struct hlist_bl_node d_in_lookup_hash;	/* only for in-lookup ones */
 	 	struct rcu_head d_rcu;
+
+		/* valid between sweep_negative and recyle_negative */
+		struct hlist_node *d_sib_backup;
 	} d_u;
 };

Then, during `d_walk`, once we find a dentry with `DCACHE_DENTRY_KILLED`, and its next
with `DCACHE_TAIL_NEGATIVE`, iterate to get the genuine next dentry in d_children.

@@ -1357,8 +1405,22 @@ static void d_walk(struct dentry *parent, void *data,
    /* might go back up the wrong parent if we have had a rename. */
    if (need_seqretry(&rename_lock, seq))
        goto rename_retry;
+
+		next = hlist_entry_safe(dentry->d_sib.next, struct dentry, d_sib);
+		do {
+			if (!(next && next->d_u.d_sib_backup) ||
+			    !(dentry->d_flags & DCACHE_DENTRY_KILLED) ||
+			    !(next->d_flags & DCACHE_TAIL_NEGATIVE)) {
+				dentry = next;
+				break;
+			}
+			dentry = hlist_entry_safe(next->d_u.d_sib_backup, struct dentry, d_sib);
+			next = hlist_entry_safe(dentry->d_sib.next, struct dentry, d_sib);
+		} while (true);
+
		/* go into the first sibling still alive */
-		hlist_for_each_entry_continue(dentry, d_sib) {
+		hlist_for_each_entry_from(dentry, d_sib) {

And since `d_children` changed from `list_head` to `hlist_head`, we cannot move a negative dentry
to the tail of d_children directly. Instead, add another `d_negative` list to cache negative
dentries. For the majority of d_children traversal, we only care about positive dentries.

@@ -118,6 +118,7 @@ struct dentry {
 	};
 	struct hlist_node d_sib;	/* child of parent list */
 	struct hlist_head d_children;	/* our children */
+	struct hlist_head d_negative;	/* cached negative dentries */

The commit `41f49be2e51a71` ("fsnotify: clear PARENT_WATCHED flags lazily") has resolved the
softlockup in `__fsnotify_parent`, but we are still seeing softlockup in `fsnotify_recalc_mask`
when adding watches with millions of negative dentries. As noted in [1], we have encountered the
same issue. In our case, the negative dentries are accumulated primarily by opening non-existent
files. The `dentry-negative` sysctl introduced in commit `e6957c99dca5f`
("vfs: Add a sysctl for automated deletion of dentry") does not seem to have any effect in our
scenario. Given this, we believe that splitting the `d_children` list into positive and negative
dentries, and skipping the negatives in `fsnotify_set_children_dentry_flags`, is still a
valuable approach.

[1] https://lore.kernel.org/linux-fsdevel/CAJfpegs+czRD1=s+o5yNoOp13xH+utQ8jQkJ9ec5283MNT_xmg@mail.gmail.com/


Regards,
Diangang

  parent reply	other threads:[~2025-09-10  7:39 UTC|newest]

Thread overview: 12+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-02-09 23:14 [PATCH v2 0/4] Fix softlockup when adding inotify watch Stephen Brennan
2022-02-09 23:14 ` [PATCH v2 1/4] dcache: sweep cached negative dentries to the end of list of siblings Stephen Brennan
2022-02-10  5:33   ` Al Viro
2022-02-16  2:24     ` Stephen Brennan
2022-02-16  3:27       ` Al Viro
2022-02-16  3:49         ` Al Viro
2022-02-16  7:43           ` Stephen Brennan
2025-09-10  7:38     ` Diangang Li [this message]
2022-02-09 23:14 ` [PATCH v2 2/4] fsnotify: stop walking child dentries if remaining tail is negative Stephen Brennan
2022-02-09 23:14 ` [PATCH v2 3/4] dcache: add action D_WALK_SKIP_SIBLINGS to d_walk() Stephen Brennan
2022-02-09 23:14 ` [PATCH v2 4/4] dcache: stop walking siblings if remaining dentries all negative Stephen Brennan
  -- strict thread matches above, loose matches on Subject: below --
2022-01-31 21:19 [PATCH v2 0/4] Fix softlockup when adding inotify watch Stephen Brennan
2022-01-31 21:19 ` [PATCH v2 1/4] dcache: sweep cached negative dentries to the end of list of siblings Stephen Brennan

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20250910073855.GA59407@bytedance.com \
    --to=lidiangang@bytedance.com \
    --cc=akpm@linux-foundation.org \
    --cc=amir73il@gmail.com \
    --cc=arnd@arndb.de \
    --cc=changfengnan@bytedance.com \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mcgrof@kernel.org \
    --cc=stephen.s.brennan@oracle.com \
    --cc=viro@zeniv.linux.org.uk \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.