From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id C3DEFC6FA8F for ; Wed, 30 Aug 2023 02:05:10 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232295AbjH3CEk (ORCPT ); Tue, 29 Aug 2023 22:04:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:38018 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S241706AbjH3CEf (ORCPT ); Tue, 29 Aug 2023 22:04:35 -0400 Received: from dfw.source.kernel.org (dfw.source.kernel.org [IPv6:2604:1380:4641:c500::1]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 1DDAC1A6 for ; Tue, 29 Aug 2023 19:04:32 -0700 (PDT) Received: from smtp.kernel.org (relay.kernel.org [52.25.139.140]) (using TLSv1.3 with cipher TLS_AES_256_GCM_SHA384 (256/256 bits) key-exchange X25519 server-signature RSA-PSS (2048 bits)) (No client certificate requested) by dfw.source.kernel.org (Postfix) with ESMTPS id 987B863495 for ; Wed, 30 Aug 2023 02:04:31 +0000 (UTC) Received: by smtp.kernel.org (Postfix) with ESMTPSA id EA401C433C7; Wed, 30 Aug 2023 02:04:30 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1693361071; bh=G22AhGwxrFKyzxlFyo4gfj4CmonWiKhZbeFY3166wtc=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=u0QKQTNd71Em2SmIm8wPXa4T01oHr2MC+mpDoWKlBOzZJbWHsQgdlAlqc+ORT+wbJ taEsZU/9pSCw+CmeKn0DBr/bU5+wSDgrO9TD/GG30BMsfKPEtmyKLFJnDKem7BnBvr xU1y9vmtxPsxixvoSQ80vkartvwLKvSAzgkQ2Ir8SZThIEdsoqL9Vjk/NJpZkjOW2+ y/ok6tij79wKxV9QekgGUJfVMJDJC5HKG5XEgE7ZJnjfSaTVItDVegY1lWvtITEkx1 HbUgUROWGfgs8SDKTf8A+C/UTOsthWzesvybYiYGeQUicPu9GOmoX6ZwKxlJG7q5b+ cRGgFbjHJ4iyQ== Date: Tue, 29 Aug 2023 19:04:30 -0700 From: "Darrick J. Wong" To: Dave Chinner Cc: Eric Sandeen , xfs , shrikanth hegde , Ritesh Harjani Subject: Re: [RFC PATCH] xfs: load uncached unlinked inodes into memory on demand Message-ID: <20230830020430.GH28186@frogsfrogsfrogs> References: <20230829232043.GE28186@frogsfrogsfrogs> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-xfs@vger.kernel.org On Wed, Aug 30, 2023 at 10:13:28AM +1000, Dave Chinner wrote: > On Tue, Aug 29, 2023 at 04:20:43PM -0700, Darrick J. Wong wrote: > > From: Darrick J. Wong > > > > shrikanth hegde reports that filesystems fail shortly after mount with > > the following failure: > > > > WARNING: CPU: 56 PID: 12450 at fs/xfs/xfs_inode.c:1839 xfs_iunlink_lookup+0x58/0x80 [xfs] > > > > This of course is the WARN_ON_ONCE in xfs_iunlink_lookup: > > > > ip = radix_tree_lookup(&pag->pag_ici_root, agino); > > if (WARN_ON_ONCE(!ip || !ip->i_ino)) { ... } > > > > From diagnostic data collected by the bug reporters, it would appear > > that we cleanly mounted a filesystem that contained unlinked inodes. > > Unlinked inodes are only processed as a final step of log recovery, > > which means that clean mounts do not process the unlinked list at all. > > > > Prior to the introduction of the incore unlinked lists, this wasn't a > > problem because the unlink code would (very expensively) traverse the > > entire ondisk metadata iunlink chain to keep things up to date. > > However, the incore unlinked list code complains when it realizes that > > it is out of sync with the ondisk metadata and shuts down the fs, which > > is bad. > > > > Ritesh proposed to solve this problem by unconditionally parsing the > > unlinked lists at mount time, but this imposes a mount time cost for > > every filesystem to catch something that should be very infrequent. > > Instead, let's target the places where we can encounter a next_unlinked > > pointer that refers to an inode that is not in cache, and load it into > > cache. > > > > Note: This patch does not address the problem of iget loading an inode > > from the middle of the iunlink list and needing to set i_prev_unlinked > > correctly. > > > > Link: https://lore.kernel.org/linux-xfs/e5004868-4a03-93e5-5077-e7ed0e533996@linux.vnet.ibm.com/ > > Reported-by: shrikanth hegde > > Triaged-by: Ritesh Harjani > > Signed-off-by: Darrick J. Wong > > --- > > fs/xfs/xfs_inode.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++-- > > fs/xfs/xfs_trace.h | 25 +++++++++++++++++++ > > 2 files changed, 92 insertions(+), 3 deletions(-) > > > > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c > > index 6ee266be45d4..3ab140ec09bb 100644 > > --- a/fs/xfs/xfs_inode.c > > +++ b/fs/xfs/xfs_inode.c > > @@ -1829,12 +1829,17 @@ xfs_iunlink_lookup( > > > > rcu_read_lock(); > > ip = radix_tree_lookup(&pag->pag_ici_root, agino); > > + if (!ip) { > > + /* Caller can handle inode not being in memory. */ > > + rcu_read_unlock(); > > + return NULL; > > + } > > > > /* > > - * Inode not in memory or in RCU freeing limbo should not happen. > > - * Warn about this and let the caller handle the failure. > > + * Inode in RCU freeing limbo should not happen. Warn about this and > > + * let the caller handle the failure. > > */ > > - if (WARN_ON_ONCE(!ip || !ip->i_ino)) { > > + if (WARN_ON_ONCE(!ip->i_ino)) { > > rcu_read_unlock(); > > return NULL; > > } > > I think we should still log a message about this situation, as it implies > that we had an unrecovered unlinked list on the filesystem and that > should "never happen" in normal conditions. > > i.e. something like: > > XFS(dev): Found unrecovered unlinked inodes in AG X. Runtime recovery initiated. > > which uses a perag state flag to only issue the message once per AG > per mount. At least this way, if we get weird stuff happening > because of loading an inode in the middle of an unlinked list (the > unhandled prev_agino case) we know why weird stuff might be > happening... Ok, I'll make that explicit. > > > @@ -1902,6 +1907,60 @@ xfs_iunlink_update_bucket( > > return 0; > > } > > > > +/* > > + * Load the inode @next_agino into the cache and set its prev_unlinked pointer > > + * to @prev_agino. Caller must hold the AGI to synchronize with other changes > > + * to the unlinked list. > > + */ > > +STATIC int > > +xfs_iunlink_reload_next( > > + struct xfs_trans *tp, > > + struct xfs_buf *agibp, > > + xfs_agino_t prev_agino, > > + xfs_agino_t next_agino) > > +{ > > + struct xfs_perag *pag = agibp->b_pag; > > + struct xfs_mount *mp = pag->pag_mount; > > + struct xfs_inode *next_ip = NULL; > > + xfs_ino_t ino; > > + int error; > > + > > + ASSERT(next_agino != NULLAGINO); > > + > > +#ifdef DEBUG > > + rcu_read_lock(); > > + next_ip = radix_tree_lookup(&pag->pag_ici_root, next_agino); > > + ASSERT(next_ip == NULL); > > + rcu_read_unlock(); > > +#endif > > + > > + ino = XFS_AGINO_TO_INO(mp, pag->pag_agno, next_agino); > > + error = xfs_iget(mp, tp, ino, XFS_IGET_UNTRUSTED, 0, &next_ip); > > + if (error) > > + return error; > > WHy are we using XFS_IGET_UNTRUSTED here? A comment explaining why > we don't trust the agino on th eunlinked list we are about to try to > recover (i.e. trust!) would be good. /* * Use an untrusted lookup just to be cautious in case the AGI * has been corrupted and now points at a free inode. That * shouldn't happen, but we'd rather shut down now since we're * already running in a weird situation. */ > > + /* If this is not an unlinked inode, something is very wrong. */ > > + if (VFS_I(next_ip)->i_nlink != 0) { > > + error = -EFSCORRUPTED; > > + goto rele; > > + } > > *nod* > > > + > > + next_ip->i_prev_unlinked = prev_agino; > > + trace_xfs_iunlink_reload_next(next_ip); > > +rele: > > + /* > > + * We're running in transaction context, so we cannot run any inode > > + * release code. Clear DONTCACHE on this inode to prevent the VFS from > > + * initiating writeback and to force the irele to push this inode to > > + * the LRU instead of dropping it immediately. > > + */ > > + spin_lock(&VFS_I(next_ip)->i_lock); > > + VFS_I(next_ip)->i_state &= ~I_DONTCACHE; > > + spin_unlock(&VFS_I(next_ip)->i_lock); > > + xfs_irele(next_ip); > > Huh. We just loaded the next_ip into memory - how is it dirty, > and what writeback will happen? Also, how would I_DONTCACHE get set > in the first place here? Ah, that's a historical accident -- originally when I thought the possibility of unrecovered unlinked inodes was vanishingly small, I wrote a whole bunch of code into online repair to deal with reloading the incore list, etc. When I first started prototyping it, xchk_irele didn't exist yet, so any time I had to release an inode within a scrub transaction, I had to manually clear I_DONTCACHE. That got copied around everywhere in the scrub code, and then it got copied over when I started working on the runtime version. That's been lurking beyond the depths of djwong-wtf for quite a long time now, and I never got back to it until the heat started going up after 6.1. I think here it's not necessary since (as you point out) nobody can actually dirty the inode, nor can they set DONTCACHE. > > > + return error; > > +} > > + > > static int > > xfs_iunlink_insert_inode( > > struct xfs_trans *tp, > > @@ -1933,6 +1992,8 @@ xfs_iunlink_insert_inode( > > * inode. > > */ > > error = xfs_iunlink_update_backref(pag, agino, next_agino); > > + if (error == -ENOLINK) > > + error = xfs_iunlink_reload_next(tp, agibp, agino, next_agino); > > if (error) > > return error; > > Where does this -ENOLINK error come from? > xfs_iunlink_update_backref() returns either -EFSCORRUPTED or 0. Is > the patch missing hunks or is it dependent on some other patch that > does this? I forgot to copy that when I backported this patch from my dev tree to TOT. Welllllp thanks for catching that, now I can go restart the test fleet. /* Update the prev pointer of the next agino. */ static int xfs_iunlink_update_backref( struct xfs_perag *pag, xfs_agino_t prev_agino, xfs_agino_t next_agino) { struct xfs_inode *ip; /* No update necessary if we are at the end of the list. */ if (next_agino == NULLAGINO) return 0; ip = xfs_iunlink_lookup(pag, next_agino); if (!ip) return -ENOLINK; ip->i_prev_unlinked = prev_agino; return 0; } --D > Cheers, > > Dave. > -- > Dave Chinner > david@fromorbit.com