From mboxrd@z Thu Jan 1 00:00:00 1970 Received: from mail-pf1-f181.google.com (mail-pf1-f181.google.com [209.85.210.181]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id D195D35961 for ; Wed, 10 Sep 2025 07:39:03 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=209.85.210.181 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757489945; cv=none; b=bWDflD7qKPZ1S3UL9r+k5A/yErDaEgLXALRED4TH0PUv6phZvZxq2Q5xQUE/t9vrkSGehzXcuLRNRDi5g5zmD9BykH/Px55hP3/L3iTMlgOCcreQ3/TioYc+tsWCe9tjgxIpDYjF/YLqpnbMs3+WUxk+0YDcIxoX8HjF8pRicHM= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1757489945; c=relaxed/simple; bh=w0Jv7GNyL9/zUxDACagQ8HWDv7Fg94c5/+VONATQV2c=; h=Date:From:To:Cc:Subject:Message-ID:References:MIME-Version: Content-Type:Content-Disposition:In-Reply-To; b=rBdSz5artJ8wb0GHuncGvGwp52eCAmuY9HOkA14SZoAIJtQ+w0YEyAFpYR1xUrFUhP8nxYcnJi78en8hMNmF9zIMo9AGP3i24lfgG27cpbqP7NXozsB9D4GOiYyAOMRUsP3WxM+FEm7uCY14z4111R/A5Tm7M1cds8hHCW5/mWs= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com; spf=pass smtp.mailfrom=bytedance.com; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b=Tfuwk8q9; arc=none smtp.client-ip=209.85.210.181 Authentication-Results: smtp.subspace.kernel.org; dmarc=pass (p=quarantine dis=none) header.from=bytedance.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=bytedance.com Authentication-Results: smtp.subspace.kernel.org; dkim=pass (2048-bit key) header.d=bytedance.com header.i=@bytedance.com header.b="Tfuwk8q9" Received: by mail-pf1-f181.google.com with SMTP id d2e1a72fcca58-7704f3c46ceso5475116b3a.2 for ; Wed, 10 Sep 2025 00:39:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1757489943; x=1758094743; darn=vger.kernel.org; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date:message-id:reply-to; bh=amhdj6MOEhyly+/5boSIJUaEeiy//aQ5RS0+3yOj7Ek=; b=Tfuwk8q90E8Krcvgs3NaoiXyNjYRdIzPtzC/54X++ZNN4JsKF63GBzJLgTpQt3Coav MDcCQ/8dFdemH5rvZjyqYWocVGdTUC10F42yHRoi+tyerkMVN1mNJ5skxvt9lSR4xmBx 9QJvS5Do1uzmwbztzafRMrF2rqAcLwS0xAI4QLTFusi9OYslvauaLJDRHDOlANvO7mbb w6ZVpeN0sYCStMcGA/ouDGYWFuBNrLSOBSohbd9joeIZsfkeUUv/tlkwA75zh4J64qY5 GZzWuwnIEwoFf6tld21630I9JWiG4riEJ8fO8qD9n4/4QoLsB23wnBbiYEpKx5+30u3k wY3g== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1757489943; x=1758094743; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date :message-id:reply-to; bh=amhdj6MOEhyly+/5boSIJUaEeiy//aQ5RS0+3yOj7Ek=; b=hUeu1b2qN/8ai108ExJc0y2Hn+iPn0u21xy6OAdBRTVpWljR/wKqPF315dzX+iE2gK hJLdMT45QnhM+k+YRDQ5xyMo+MmCJITSIiu/C+kmgEviCTl6smvR+hAVV/vuSy8smyBq vFPWeh0EImNCcpSUU2l7cyaTsV+SShDMZ0NnwZCXolASHajLUIfn0W4sNqvycqfGQHB0 UiN4sTJHhPjm8h0DaF7rx+M5Y06eTFhAmVBjegR+ECa4D+MTsP9hlF6Uonv3ZBmLdvbW iTPosWqSKzZxWK+u2v1DtmJGDE/SlLFo4+Ic6dtlnmsXKWmCWFBTUtL2bY05dtWnd3LU VDuA== X-Forwarded-Encrypted: i=1; AJvYcCVNAppOh89YTPyAy29jy17rjNbx2TmOBfpBJ9vVd9xdY5BYdLNUNnAb+lS+rqwXx18VU5wXBJNUGg+7WTg=@vger.kernel.org X-Gm-Message-State: AOJu0Yz0r8Uw0aDo/d9ZD5EGQp2r4kMAaj28lzUnWWDcaKgV94uJq+/k wju6olqf38rlaQHzBLu0OrezNflZ6066GeJm3/5AQpf5hfpd2bzGBap4AyPFkPHN4LptQCE7Kb7 rFgrM X-Gm-Gg: ASbGncsgzXA7PncKfrjUDKlzRCo8QFjsAEoCIYEtOVnd+gINU+0Vj+kV1wa2ulno0zM rGTx7oY+tDZQyNPkhSSiQw/u7lxZQSdgI/QzhSCIg+WlxEsQ2ZY3td6KwU0V62Q5+XYD3N2+ALW BhrcfaYjcfPypypMVrN5mm3MSj1HWQATRyfD67KcFThr+PtQd1dQ2IJjoDDAznWBKgR2cv76RCO zN1w/YlQHhtEOsXI8Ukc/tedZ0t3sVbm45MAXw5zgAq2ufQPzJM7KDYZLatni0IYvHdEwnb5sqk OHU3IsppLv7HVAuBvBItQqfs98PUR8cwzG1BYPSJhd7AIdD4hjn8p7VjvVN45xtCR+uDy6LLIRj I+R5BrTf+B9TY3CimKeHVhivGmJrCv8bSJ9VUJS+VFsceGsBHCqvb0dxd X-Google-Smtp-Source: AGHT+IH9Ygo2eY4mJlDFHfCvTHSy2SH/RWavdZLm5XlF9+y7C7O+224Jc3mIRhdpemTVZaLXTkfzwQ== X-Received: by 2002:a05:6a00:3d53:b0:772:6235:7038 with SMTP id d2e1a72fcca58-7742dce0f81mr15382700b3a.10.1757489942865; Wed, 10 Sep 2025 00:39:02 -0700 (PDT) Received: from H7GWF0W104 ([139.177.225.237]) by smtp.gmail.com with ESMTPSA id d2e1a72fcca58-774662f9e95sm4259619b3a.98.2025.09.10.00.38.57 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 10 Sep 2025 00:39:02 -0700 (PDT) Date: Wed, 10 Sep 2025 15:38:55 +0800 From: Diangang Li To: Al Viro Cc: Stephen Brennan , linux-kernel@vger.kernel.org, Luis Chamberlain , Andrew Morton , Jan Kara , linux-fsdevel@vger.kernel.org, Arnd Bergmann , Amir Goldstein , Fengnan Chang Subject: Re: [PATCH v2 1/4] dcache: sweep cached negative dentries to the end of list of siblings Message-ID: <20250910073855.GA59407@bytedance.com> References: <20220209231406.187668-1-stephen.s.brennan@oracle.com> <20220209231406.187668-2-stephen.s.brennan@oracle.com> Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: On Thu, Feb 10, 2022 at 05:33:23AM +0000, Al Viro wrote: > On Wed, Feb 09, 2022 at 03:14:03PM -0800, Stephen Brennan wrote: > > > +static void sweep_negative(struct dentry *dentry) > > +{ > > + struct dentry *parent; > > + > > + rcu_read_lock(); > > + parent = lock_parent(dentry); > > + if (!parent) { > > + rcu_read_unlock(); > > + return; > > + } > > + > > + /* > > + * If we did not hold a reference to dentry (as in the case of dput), > > + * and dentry->d_lock was dropped in lock_parent(), then we could now be > > + * holding onto a dead dentry. Be careful to check d_count and unlock > > + * before dropping RCU lock, otherwise we could corrupt freed memory. > > + */ > > + if (!d_count(dentry) && d_is_negative(dentry) && > > + !d_is_tail_negative(dentry)) { > > + dentry->d_flags |= DCACHE_TAIL_NEGATIVE; > > + list_move_tail(&dentry->d_child, &parent->d_subdirs); > > + } > > + > > + spin_unlock(&parent->d_lock); > > + spin_unlock(&dentry->d_lock); > > + rcu_read_unlock(); > > +} > > I'm not sure if it came up the last time you'd posted this series > (and I apologize if it had and I forgot the explanation), but... consider > the comment in dentry_unlist(). What's to prevent the race described there > making d_walk() skip a part of tree, by replacing the "lseek moving cursor > in just the wrong moment" with "dput moving the negative dentry right next > to the one being killed to the tail of the list"? > > The race in question: > d_walk() is leaving a subdirectory. We are here: > rcu_read_lock(); > ascend: > if (this_parent != parent) { > > It isn't - we are not back to the root of tree being walked. > At this point this_parent is the directory we'd just finished looking into. > > struct dentry *child = this_parent; > this_parent = child->d_parent; > > ... and now child points to it, and this_parent points to its parent. > > spin_unlock(&child->d_lock); > > No locks held. Another CPU gets through successful rmdir(). child gets > unhashed and dropped. It's off the ->d_subdirs of this_parent; its > ->d_child.next is still pointing where it used to, and whatever it points > to won't be physically freed until rcu_read_unlock(). > > Moreover, in the meanwhile this next sibling (negative, pinned) got dput(). > And had been moved to the tail of the this_parent->d_subdirs. Since > its ->d_child.prev does *NOT* point to child (which is off-list, about to > be freed shortly, etc.), child->d_dchild.next is not modified - it still > points to that (now moved) sibling. > > spin_lock(&this_parent->d_lock); > Got it. > > /* might go back up the wrong parent if we have had a rename. */ > if (need_seqretry(&rename_lock, seq)) > goto rename_retry; > > Nope, hadn't happened. > > /* go into the first sibling still alive */ > do { > next = child->d_child.next; > ... and this is the moved sibling, now in the end of the ->d_subdirs. > > if (next == &this_parent->d_subdirs) > goto ascend; > > No, it is not - it's the last element of the list, not its anchor. > > child = list_entry(next, struct dentry, d_child); > > Our moved negative dentry. > > } while (unlikely(child->d_flags & DCACHE_DENTRY_KILLED)); > > Not killed, that one. > rcu_read_unlock(); > goto resume; > > ... and since that sucker has no children, we proceed to look at it, > ascend and now we are at the end of this_parent->d_subdirs. And we > ascend out of it, having entirely skipped all branches that used to > be between the rmdir victim and the end of the parent's ->d_subdirs. > > What am I missing here? Unlike the trick we used with cursors (see > dentry_unlist()) we can't predict who won't get moved in this case... > > Note that treating "child is has DCACHE_DENTRY_KILLED" same as we do > for rename_lock mismatches would not work unless you grab the spinlock > component of rename_lock every time dentry becomes positive. Which > is obviously not feasible (it's a system-wide lock and cacheline > pingpong alone would hurt us very badly, not to mention the contention > issues due to the frequency of grabbing it going up by several orders > of magnitude). Hi Al, Stephen, How about adding a pointer in union d_u to record the original next of negative dentry when executing `sweep_negative`, as follows: @@ -125,6 +126,7 @@ struct dentry { struct hlist_node d_alias; /* inode alias list */ struct hlist_bl_node d_in_lookup_hash; /* only for in-lookup ones */ struct rcu_head d_rcu; + + /* valid between sweep_negative and recyle_negative */ + struct hlist_node *d_sib_backup; } d_u; }; Then, during `d_walk`, once we find a dentry with `DCACHE_DENTRY_KILLED`, and its next with `DCACHE_TAIL_NEGATIVE`, iterate to get the genuine next dentry in d_children. @@ -1357,8 +1405,22 @@ static void d_walk(struct dentry *parent, void *data, /* might go back up the wrong parent if we have had a rename. */ if (need_seqretry(&rename_lock, seq)) goto rename_retry; + + next = hlist_entry_safe(dentry->d_sib.next, struct dentry, d_sib); + do { + if (!(next && next->d_u.d_sib_backup) || + !(dentry->d_flags & DCACHE_DENTRY_KILLED) || + !(next->d_flags & DCACHE_TAIL_NEGATIVE)) { + dentry = next; + break; + } + dentry = hlist_entry_safe(next->d_u.d_sib_backup, struct dentry, d_sib); + next = hlist_entry_safe(dentry->d_sib.next, struct dentry, d_sib); + } while (true); + /* go into the first sibling still alive */ - hlist_for_each_entry_continue(dentry, d_sib) { + hlist_for_each_entry_from(dentry, d_sib) { And since `d_children` changed from `list_head` to `hlist_head`, we cannot move a negative dentry to the tail of d_children directly. Instead, add another `d_negative` list to cache negative dentries. For the majority of d_children traversal, we only care about positive dentries. @@ -118,6 +118,7 @@ struct dentry { }; struct hlist_node d_sib; /* child of parent list */ struct hlist_head d_children; /* our children */ + struct hlist_head d_negative; /* cached negative dentries */ The commit `41f49be2e51a71` ("fsnotify: clear PARENT_WATCHED flags lazily") has resolved the softlockup in `__fsnotify_parent`, but we are still seeing softlockup in `fsnotify_recalc_mask` when adding watches with millions of negative dentries. As noted in [1], we have encountered the same issue. In our case, the negative dentries are accumulated primarily by opening non-existent files. The `dentry-negative` sysctl introduced in commit `e6957c99dca5f` ("vfs: Add a sysctl for automated deletion of dentry") does not seem to have any effect in our scenario. Given this, we believe that splitting the `d_children` list into positive and negative dentries, and skipping the negatives in `fsnotify_set_children_dentry_flags`, is still a valuable approach. [1] https://lore.kernel.org/linux-fsdevel/CAJfpegs+czRD1=s+o5yNoOp13xH+utQ8jQkJ9ec5283MNT_xmg@mail.gmail.com/ Regards, Diangang