From: "Darrick J. Wong" <djwong@kernel.org>
To: Chandan Babu R <chandanrlinux@gmail.com>
Cc: "Ritesh Harjani (IBM)" <ritesh.list@gmail.com>,
Dave Chinner <david@fromorbit.com>,
Eric Sandeen <sandeen@redhat.com>,
xfs <linux-xfs@vger.kernel.org>,
shrikanth hegde <sshegde@linux.vnet.ibm.com>,
Bill O'Donnell <bodonnel@redhat.com>,
Eric Sandeen <sandeen@sandeen.net>
Subject: Re: [PATCH v3] xfs: load uncached unlinked inodes into memory on demand
Date: Fri, 1 Sep 2023 08:57:49 -0700 [thread overview]
Message-ID: <20230901155749.GS28186@frogsfrogsfrogs> (raw)
In-Reply-To: <20230901150311.GR28186@frogsfrogsfrogs>
On Fri, Sep 01, 2023 at 08:03:11AM -0700, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> shrikanth hegde reports that filesystems fail shortly after mount with
> the following failure:
>
> WARNING: CPU: 56 PID: 12450 at fs/xfs/xfs_inode.c:1839 xfs_iunlink_lookup+0x58/0x80 [xfs]
>
> This of course is the WARN_ON_ONCE in xfs_iunlink_lookup:
>
> ip = radix_tree_lookup(&pag->pag_ici_root, agino);
> if (WARN_ON_ONCE(!ip || !ip->i_ino)) { ... }
>
> From diagnostic data collected by the bug reporters, it would appear
> that we cleanly mounted a filesystem that contained unlinked inodes.
> Unlinked inodes are only processed as a final step of log recovery,
> which means that clean mounts do not process the unlinked list at all.
>
> Prior to the introduction of the incore unlinked lists, this wasn't a
> problem because the unlink code would (very expensively) traverse the
> entire ondisk metadata iunlink chain to keep things up to date.
> However, the incore unlinked list code complains when it realizes that
> it is out of sync with the ondisk metadata and shuts down the fs, which
> is bad.
>
> Ritesh proposed to solve this problem by unconditionally parsing the
> unlinked lists at mount time, but this imposes a mount time cost for
> every filesystem to catch something that should be very infrequent.
> Instead, let's target the places where we can encounter a next_unlinked
> pointer that refers to an inode that is not in cache, and load it into
> cache.
>
> Note: This patch does not address the problem of iget loading an inode
> from the middle of the iunlink list and needing to set i_prev_unlinked
> correctly.
>
> Eric Sandeen adds:
>
> "One way to end up in this situation is to have at one point run a very
> old kernel which did not contain this commit, merged in kernel v4.14:
>
> commit 6f4a1eefdd0ad4561543270a7fceadabcca075dd
> Author: Eric Sandeen <sandeen@sandeen.net>
> Date: Tue Aug 8 18:21:49 2017 -0700
>
> xfs: toggle readonly state around xfs_log_mount_finish
>
> When we do log recovery on a readonly mount, unlinked inode
> processing does not happen due to the readonly checks in
> xfs_inactive(), which are trying to prevent any I/O on a
> readonly mount.
>
> This is misguided - we do I/O on readonly mounts all the time,
> for consistency; for example, log recovery. So do the same
> RDONLY flag twiddling around xfs_log_mount_finish() as we
> do around xfs_log_mount(), for the same reason.
>
> This all cries out for a big rework but for now this is a
> simple fix to an obvious problem.
>
> Signed-off-by: Eric Sandeen <sandeen@redhat.com>
> Reviewed-by: Brian Foster <bfoster@redhat.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>
> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
>
> "so if you:
>
> 1) Crash with unlinked inodes
> 2) mount -o ro <recovers log but skips unlinked inode recovery>
> 3) mount -o remount,rw
> 4) umount <writes clean log record>
>
> "You now have a filesystem with on-disk unlinked inodes and a clean log,
> and those inodes won't get cleaned up until log recovery runs again or
> xfs_repair is run.
>
> "And in testing an old OS (RHEL7) it does seem that the root filesystem
> goes through a mount -o ro, mount -o remount,rw transition at boot time.
> So this situation may be somewhat common."
>
> Reported-by: shrikanth hegde <sshegde@linux.vnet.ibm.com>
> Triaged-by: Ritesh Harjani <ritesh.list@gmail.com>
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> Reviewed-by: Dave Chinner <dchinner@redhat.com>
> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
> ---
> v3: add RVB tags and historical context from sandeen
> v2: log that we're doing runtime recovery, dont mess with DONTCACHE,
> and actually return ENOLINK
> ---
> fs/xfs/xfs_inode.c | 75 +++++++++++++++++++++++++++++++++++++++++++++++++---
> fs/xfs/xfs_trace.h | 25 +++++++++++++++++
> 2 files changed, 96 insertions(+), 4 deletions(-)
>
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 6ee266be45d4..2942002560b5 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1829,12 +1829,17 @@ xfs_iunlink_lookup(
<sigh> Somehow between the four copies of this patch that I have flying
around (djwong-dev, patchmail, linus TOT, old 6.6 merge branch) I once
again lost the comment change for this function, and apparently never
actually sent that to the list.
"A poor workman blames his tools", etc.
I bet that workman doesn't have to do this much manual paperwork
either...
--D
>
> rcu_read_lock();
> ip = radix_tree_lookup(&pag->pag_ici_root, agino);
> + if (!ip) {
> + /* Caller can handle inode not being in memory. */
> + rcu_read_unlock();
> + return NULL;
> + }
>
> /*
> - * Inode not in memory or in RCU freeing limbo should not happen.
> - * Warn about this and let the caller handle the failure.
> + * Inode in RCU freeing limbo should not happen. Warn about this and
> + * let the caller handle the failure.
> */
> - if (WARN_ON_ONCE(!ip || !ip->i_ino)) {
> + if (WARN_ON_ONCE(!ip->i_ino)) {
> rcu_read_unlock();
> return NULL;
> }
> @@ -1858,7 +1863,8 @@ xfs_iunlink_update_backref(
>
> ip = xfs_iunlink_lookup(pag, next_agino);
> if (!ip)
> - return -EFSCORRUPTED;
> + return -ENOLINK;
> +
> ip->i_prev_unlinked = prev_agino;
> return 0;
> }
> @@ -1902,6 +1908,62 @@ xfs_iunlink_update_bucket(
> return 0;
> }
>
> +/*
> + * Load the inode @next_agino into the cache and set its prev_unlinked pointer
> + * to @prev_agino. Caller must hold the AGI to synchronize with other changes
> + * to the unlinked list.
> + */
> +STATIC int
> +xfs_iunlink_reload_next(
> + struct xfs_trans *tp,
> + struct xfs_buf *agibp,
> + xfs_agino_t prev_agino,
> + xfs_agino_t next_agino)
> +{
> + struct xfs_perag *pag = agibp->b_pag;
> + struct xfs_mount *mp = pag->pag_mount;
> + struct xfs_inode *next_ip = NULL;
> + xfs_ino_t ino;
> + int error;
> +
> + ASSERT(next_agino != NULLAGINO);
> +
> +#ifdef DEBUG
> + rcu_read_lock();
> + next_ip = radix_tree_lookup(&pag->pag_ici_root, next_agino);
> + ASSERT(next_ip == NULL);
> + rcu_read_unlock();
> +#endif
> +
> + xfs_info_ratelimited(mp,
> + "Found unrecovered unlinked inode 0x%x in AG 0x%x. Initiating recovery.",
> + next_agino, pag->pag_agno);
> +
> + /*
> + * Use an untrusted lookup just to be cautious in case the AGI has been
> + * corrupted and now points at a free inode. That shouldn't happen,
> + * but we'd rather shut down now since we're already running in a weird
> + * situation.
> + */
> + ino = XFS_AGINO_TO_INO(mp, pag->pag_agno, next_agino);
> + error = xfs_iget(mp, tp, ino, XFS_IGET_UNTRUSTED, 0, &next_ip);
> + if (error)
> + return error;
> +
> + /* If this is not an unlinked inode, something is very wrong. */
> + if (VFS_I(next_ip)->i_nlink != 0) {
> + error = -EFSCORRUPTED;
> + goto rele;
> + }
> +
> + next_ip->i_prev_unlinked = prev_agino;
> + trace_xfs_iunlink_reload_next(next_ip);
> +rele:
> + ASSERT(!(VFS_I(next_ip)->i_state & I_DONTCACHE));
> + xfs_irele(next_ip);
> + return error;
> +}
> +
> static int
> xfs_iunlink_insert_inode(
> struct xfs_trans *tp,
> @@ -1933,6 +1995,8 @@ xfs_iunlink_insert_inode(
> * inode.
> */
> error = xfs_iunlink_update_backref(pag, agino, next_agino);
> + if (error == -ENOLINK)
> + error = xfs_iunlink_reload_next(tp, agibp, agino, next_agino);
> if (error)
> return error;
>
> @@ -2027,6 +2091,9 @@ xfs_iunlink_remove_inode(
> */
> error = xfs_iunlink_update_backref(pag, ip->i_prev_unlinked,
> ip->i_next_unlinked);
> + if (error == -ENOLINK)
> + error = xfs_iunlink_reload_next(tp, agibp, ip->i_prev_unlinked,
> + ip->i_next_unlinked);
> if (error)
> return error;
>
> diff --git a/fs/xfs/xfs_trace.h b/fs/xfs/xfs_trace.h
> index 36bd42ed9ec8..f4e46bac9b91 100644
> --- a/fs/xfs/xfs_trace.h
> +++ b/fs/xfs/xfs_trace.h
> @@ -3832,6 +3832,31 @@ TRACE_EVENT(xfs_iunlink_update_dinode,
> __entry->new_ptr)
> );
>
> +TRACE_EVENT(xfs_iunlink_reload_next,
> + TP_PROTO(struct xfs_inode *ip),
> + TP_ARGS(ip),
> + TP_STRUCT__entry(
> + __field(dev_t, dev)
> + __field(xfs_agnumber_t, agno)
> + __field(xfs_agino_t, agino)
> + __field(xfs_agino_t, prev_agino)
> + __field(xfs_agino_t, next_agino)
> + ),
> + TP_fast_assign(
> + __entry->dev = ip->i_mount->m_super->s_dev;
> + __entry->agno = XFS_INO_TO_AGNO(ip->i_mount, ip->i_ino);
> + __entry->agino = XFS_INO_TO_AGINO(ip->i_mount, ip->i_ino);
> + __entry->prev_agino = ip->i_prev_unlinked;
> + __entry->next_agino = ip->i_next_unlinked;
> + ),
> + TP_printk("dev %d:%d agno 0x%x agino 0x%x prev_unlinked 0x%x next_unlinked 0x%x",
> + MAJOR(__entry->dev), MINOR(__entry->dev),
> + __entry->agno,
> + __entry->agino,
> + __entry->prev_agino,
> + __entry->next_agino)
> +);
> +
> DECLARE_EVENT_CLASS(xfs_ag_inode_class,
> TP_PROTO(struct xfs_inode *ip),
> TP_ARGS(ip),
prev parent reply other threads:[~2023-09-01 15:58 UTC|newest]
Thread overview: 2+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-09-01 15:03 [PATCH v3] xfs: load uncached unlinked inodes into memory on demand Darrick J. Wong
2023-09-01 15:57 ` Darrick J. Wong [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20230901155749.GS28186@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=bodonnel@redhat.com \
--cc=chandanrlinux@gmail.com \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
--cc=ritesh.list@gmail.com \
--cc=sandeen@redhat.com \
--cc=sandeen@sandeen.net \
--cc=sshegde@linux.vnet.ibm.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox