From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: Eric Sandeen <sandeen@redhat.com>,
xfs <linux-xfs@vger.kernel.org>,
shrikanth hegde <sshegde@linux.vnet.ibm.com>,
Ritesh Harjani <ritesh.list@gmail.com>
Subject: Re: [RFC PATCH] xfs: load uncached unlinked inodes into memory on demand
Date: Tue, 29 Aug 2023 19:04:30 -0700 [thread overview]
Message-ID: <20230830020430.GH28186@frogsfrogsfrogs> (raw)
In-Reply-To: <ZO6JqOBOOUCcS4ac@dread.disaster.area>
On Wed, Aug 30, 2023 at 10:13:28AM +1000, Dave Chinner wrote:
> On Tue, Aug 29, 2023 at 04:20:43PM -0700, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> >
> > shrikanth hegde reports that filesystems fail shortly after mount with
> > the following failure:
> >
> > WARNING: CPU: 56 PID: 12450 at fs/xfs/xfs_inode.c:1839 xfs_iunlink_lookup+0x58/0x80 [xfs]
> >
> > This of course is the WARN_ON_ONCE in xfs_iunlink_lookup:
> >
> > ip = radix_tree_lookup(&pag->pag_ici_root, agino);
> > if (WARN_ON_ONCE(!ip || !ip->i_ino)) { ... }
> >
> > From diagnostic data collected by the bug reporters, it would appear
> > that we cleanly mounted a filesystem that contained unlinked inodes.
> > Unlinked inodes are only processed as a final step of log recovery,
> > which means that clean mounts do not process the unlinked list at all.
> >
> > Prior to the introduction of the incore unlinked lists, this wasn't a
> > problem because the unlink code would (very expensively) traverse the
> > entire ondisk metadata iunlink chain to keep things up to date.
> > However, the incore unlinked list code complains when it realizes that
> > it is out of sync with the ondisk metadata and shuts down the fs, which
> > is bad.
> >
> > Ritesh proposed to solve this problem by unconditionally parsing the
> > unlinked lists at mount time, but this imposes a mount time cost for
> > every filesystem to catch something that should be very infrequent.
> > Instead, let's target the places where we can encounter a next_unlinked
> > pointer that refers to an inode that is not in cache, and load it into
> > cache.
> >
> > Note: This patch does not address the problem of iget loading an inode
> > from the middle of the iunlink list and needing to set i_prev_unlinked
> > correctly.
> >
> > Link: https://lore.kernel.org/linux-xfs/e5004868-4a03-93e5-5077-e7ed0e533996@linux.vnet.ibm.com/
> > Reported-by: shrikanth hegde <sshegde@linux.vnet.ibm.com>
> > Triaged-by: Ritesh Harjani <ritesh.list@gmail.com>
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> > fs/xfs/xfs_inode.c | 70 ++++++++++++++++++++++++++++++++++++++++++++++++++--
> > fs/xfs/xfs_trace.h | 25 +++++++++++++++++++
> > 2 files changed, 92 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> > index 6ee266be45d4..3ab140ec09bb 100644
> > --- a/fs/xfs/xfs_inode.c
> > +++ b/fs/xfs/xfs_inode.c
> > @@ -1829,12 +1829,17 @@ xfs_iunlink_lookup(
> >
> > rcu_read_lock();
> > ip = radix_tree_lookup(&pag->pag_ici_root, agino);
> > + if (!ip) {
> > + /* Caller can handle inode not being in memory. */
> > + rcu_read_unlock();
> > + return NULL;
> > + }
> >
> > /*
> > - * Inode not in memory or in RCU freeing limbo should not happen.
> > - * Warn about this and let the caller handle the failure.
> > + * Inode in RCU freeing limbo should not happen. Warn about this and
> > + * let the caller handle the failure.
> > */
> > - if (WARN_ON_ONCE(!ip || !ip->i_ino)) {
> > + if (WARN_ON_ONCE(!ip->i_ino)) {
> > rcu_read_unlock();
> > return NULL;
> > }
>
> I think we should still log a message about this situation, as it implies
> that we had an unrecovered unlinked list on the filesystem and that
> should "never happen" in normal conditions.
>
> i.e. something like:
>
> XFS(dev): Found unrecovered unlinked inodes in AG X. Runtime recovery initiated.
>
> which uses a perag state flag to only issue the message once per AG
> per mount. At least this way, if we get weird stuff happening
> because of loading an inode in the middle of an unlinked list (the
> unhandled prev_agino case) we know why weird stuff might be
> happening...
<nod> Ok, I'll make that explicit.
>
> > @@ -1902,6 +1907,60 @@ xfs_iunlink_update_bucket(
> > return 0;
> > }
> >
> > +/*
> > + * Load the inode @next_agino into the cache and set its prev_unlinked pointer
> > + * to @prev_agino. Caller must hold the AGI to synchronize with other changes
> > + * to the unlinked list.
> > + */
> > +STATIC int
> > +xfs_iunlink_reload_next(
> > + struct xfs_trans *tp,
> > + struct xfs_buf *agibp,
> > + xfs_agino_t prev_agino,
> > + xfs_agino_t next_agino)
> > +{
> > + struct xfs_perag *pag = agibp->b_pag;
> > + struct xfs_mount *mp = pag->pag_mount;
> > + struct xfs_inode *next_ip = NULL;
> > + xfs_ino_t ino;
> > + int error;
> > +
> > + ASSERT(next_agino != NULLAGINO);
> > +
> > +#ifdef DEBUG
> > + rcu_read_lock();
> > + next_ip = radix_tree_lookup(&pag->pag_ici_root, next_agino);
> > + ASSERT(next_ip == NULL);
> > + rcu_read_unlock();
> > +#endif
> > +
> > + ino = XFS_AGINO_TO_INO(mp, pag->pag_agno, next_agino);
> > + error = xfs_iget(mp, tp, ino, XFS_IGET_UNTRUSTED, 0, &next_ip);
> > + if (error)
> > + return error;
>
> WHy are we using XFS_IGET_UNTRUSTED here? A comment explaining why
> we don't trust the agino on th eunlinked list we are about to try to
> recover (i.e. trust!) would be good.
/*
* Use an untrusted lookup just to be cautious in case the AGI
* has been corrupted and now points at a free inode. That
* shouldn't happen, but we'd rather shut down now since we're
* already running in a weird situation.
*/
> > + /* If this is not an unlinked inode, something is very wrong. */
> > + if (VFS_I(next_ip)->i_nlink != 0) {
> > + error = -EFSCORRUPTED;
> > + goto rele;
> > + }
>
> *nod*
>
> > +
> > + next_ip->i_prev_unlinked = prev_agino;
> > + trace_xfs_iunlink_reload_next(next_ip);
> > +rele:
> > + /*
> > + * We're running in transaction context, so we cannot run any inode
> > + * release code. Clear DONTCACHE on this inode to prevent the VFS from
> > + * initiating writeback and to force the irele to push this inode to
> > + * the LRU instead of dropping it immediately.
> > + */
> > + spin_lock(&VFS_I(next_ip)->i_lock);
> > + VFS_I(next_ip)->i_state &= ~I_DONTCACHE;
> > + spin_unlock(&VFS_I(next_ip)->i_lock);
> > + xfs_irele(next_ip);
>
> Huh. We just loaded the next_ip into memory - how is it dirty,
> and what writeback will happen? Also, how would I_DONTCACHE get set
> in the first place here?
Ah, that's a historical accident -- originally when I thought the
possibility of unrecovered unlinked inodes was vanishingly small, I
wrote a whole bunch of code into online repair to deal with reloading
the incore list, etc.
When I first started prototyping it, xchk_irele didn't exist yet, so any
time I had to release an inode within a scrub transaction, I had to
manually clear I_DONTCACHE. That got copied around everywhere in the
scrub code, and then it got copied over when I started working on the
runtime version. That's been lurking beyond the depths of djwong-wtf
for quite a long time now, and I never got back to it until the heat
started going up after 6.1.
I think here it's not necessary since (as you point out) nobody can
actually dirty the inode, nor can they set DONTCACHE.
>
> > + return error;
> > +}
> > +
> > static int
> > xfs_iunlink_insert_inode(
> > struct xfs_trans *tp,
> > @@ -1933,6 +1992,8 @@ xfs_iunlink_insert_inode(
> > * inode.
> > */
> > error = xfs_iunlink_update_backref(pag, agino, next_agino);
> > + if (error == -ENOLINK)
> > + error = xfs_iunlink_reload_next(tp, agibp, agino, next_agino);
> > if (error)
> > return error;
>
> Where does this -ENOLINK error come from?
> xfs_iunlink_update_backref() returns either -EFSCORRUPTED or 0. Is
> the patch missing hunks or is it dependent on some other patch that
> does this?
<sigh> I forgot to copy that when I backported this patch from my dev
tree to TOT. Welllllp thanks for catching that, now I can go restart
the test fleet.
/* Update the prev pointer of the next agino. */
static int
xfs_iunlink_update_backref(
struct xfs_perag *pag,
xfs_agino_t prev_agino,
xfs_agino_t next_agino)
{
struct xfs_inode *ip;
/* No update necessary if we are at the end of the list. */
if (next_agino == NULLAGINO)
return 0;
ip = xfs_iunlink_lookup(pag, next_agino);
if (!ip)
return -ENOLINK;
ip->i_prev_unlinked = prev_agino;
return 0;
}
--D
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
prev parent reply other threads:[~2023-08-30 2:05 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-08-29 23:20 [RFC PATCH] xfs: load uncached unlinked inodes into memory on demand Darrick J. Wong
2023-08-30 0:13 ` Dave Chinner
2023-08-30 2:04 ` Darrick J. Wong [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230830020430.GH28186@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
--cc=ritesh.list@gmail.com \
--cc=sandeen@redhat.com \
--cc=sshegde@linux.vnet.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox