Re: [PATCH] xfs: recheck appropriateness of map_shared lock

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

From: Dave Chinner <david@fromorbit.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH] xfs: recheck appropriateness of map_shared lock
Date: Thu, 19 Jan 2023 16:14:11 +1100	[thread overview]
Message-ID: <20230119051411.GJ360264@dread.disaster.area> (raw)
In-Reply-To: <Y8ib6ls32e/pJezE@magnolia>

On Wed, Jan 18, 2023 at 05:24:58PM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
> 
> While fuzzing the data fork extent count on a btree-format directory
> with xfs/375, I observed the following (excerpted) splat:
> 
> XFS: Assertion failed: xfs_isilocked(ip, XFS_ILOCK_EXCL), file: fs/xfs/libxfs/xfs_bmap.c, line: 1208
> ------------[ cut here ]------------
> WARNING: CPU: 0 PID: 43192 at fs/xfs/xfs_message.c:104 assfail+0x46/0x4a [xfs]
> Call Trace:
>  <TASK>
>  xfs_iread_extents+0x1af/0x210 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
>  xchk_dir_walk+0xb8/0x190 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
>  xchk_parent_count_parent_dentries+0x41/0x80 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
>  xchk_parent_validate+0x199/0x2e0 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
>  xchk_parent+0xdf/0x130 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
>  xfs_scrub_metadata+0x2b8/0x730 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
>  xfs_scrubv_metadata+0x38b/0x4d0 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
>  xfs_ioc_scrubv_metadata+0x111/0x160 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
>  xfs_file_ioctl+0x367/0xf50 [xfs 09f66509ece4938760fac7de64732a0cbd3e39cd]
>  __x64_sys_ioctl+0x82/0xa0
>  do_syscall_64+0x2b/0x80
>  entry_SYSCALL_64_after_hwframe+0x46/0xb0
> 
> The cause of this is a race condition in xfs_ilock_data_map_shared,
> which performs an unlocked access to the data fork to guess which lock
> mode it needs:
> 
> Thread 0                          Thread 1
> 
> xfs_need_iread_extents
> <observe no iext tree>
> xfs_ilock(..., ILOCK_EXCL)
> xfs_iread_extents
> <observe no iext tree>
> <check ILOCK_EXCL>
> <load bmbt extents into iext>
> <notice iext size doesn't
>  match nextents>
>                                   xfs_need_iread_extents
>                                   <observe iext tree>
>                                   xfs_ilock(..., ILOCK_SHARED)
> <tear down iext tree>
> xfs_iunlock(..., ILOCK_EXCL)
>                                   xfs_iread_extents
>                                   <observe no iext tree>
>                                   <check ILOCK_EXCL>
>                                   *BOOM*
> 
> mitigate this race by having thread 1 to recheck xfs_need_iread_extents
> after taking the shared ILOCK.  If the iext tree isn't present, then we
> need to upgrade to the exclusive ILOCK to try to load the bmbt.

Yup, I see the problem - this check is failing:

        if (XFS_IS_CORRUPT(mp, ir.loaded != ifp->if_nextents)) {
                error = -EFSCORRUPTED;
                goto out;
        }

and that results in calling xfs_iext_destroy() to tear down the
extent tree.

But we know the BMBT is corrupted and the extent list cannot be read
until the corruption is fixed. IOWs, we can't access any data in the
inode no matter how we lock it until the corruption is repaired.

> 
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
>  fs/xfs/xfs_inode.c |   29 +++++++++++++++++++++++++++++
>  1 file changed, 29 insertions(+)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index d354ea2b74f9..6ce1e0e9f256 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -117,6 +117,20 @@ xfs_ilock_data_map_shared(
>  	if (xfs_need_iread_extents(&ip->i_df))
>  		lock_mode = XFS_ILOCK_EXCL;
>  	xfs_ilock(ip, lock_mode);
> +
> +	/*
> +	 * It's possible that the unlocked access of the data fork to determine
> +	 * the lock mode could have raced with another thread that was failing
> +	 * to load the bmbt but hadn't yet torn down the iext tree.  Recheck
> +	 * the lock mode and upgrade to an exclusive lock if we need to.
> +	 */
> +	if (lock_mode == XFS_ILOCK_SHARED &&
> +	    xfs_need_iread_extents(&ip->i_df)) {
> +		xfs_iunlock(ip, lock_mode);
> +		lock_mode = XFS_ILOCK_EXCL;
> +		xfs_ilock(ip, lock_mode);
> +	}

.... and this makes me cringe. :/

If we hit this race condition, re-reading the extent list from disk
isn't going to fix the corruption, so I don't see much point in
papering over the problem just by changing the locking and failing
to read in the extent list again and returning -EFSCORRUPTED to the
operation.

So.... shouldn't we mark the inode as sick when we detect the extent
list corruption issue? i.e. before destroying the iext tree, calling
xfs_inode_mark_sick(XFS_SICK_INO_BMBTD) (or BMBTA, depending on the
fork being read) so that there is a record of the BMBT being
corrupt?

That would mean that this path simply becomes:

	if (ip->i_sick & XFS_SICK_INO_BMBTD) {
		xfs_iunlock(ip, lock_mode);
		return -EFSCORRUPTED;
	}

Which is now pretty clear that we there's no point continuing
because we can't read in the extent list, and in doing so we've
removed the race condition caused by temporarily filling the in-core
extent list.

> +
>  	return lock_mode;
>  }
>  
> @@ -129,6 +143,21 @@ xfs_ilock_attr_map_shared(
>  	if (xfs_inode_has_attr_fork(ip) && xfs_need_iread_extents(&ip->i_af))
>  		lock_mode = XFS_ILOCK_EXCL;
>  	xfs_ilock(ip, lock_mode);
> +
> +	/*
> +	 * It's possible that the unlocked access of the attr fork to determine
> +	 * the lock mode could have raced with another thread that was failing
> +	 * to load the bmbt but hadn't yet torn down the iext tree.  Recheck
> +	 * the lock mode and upgrade to an exclusive lock if we need to.
> +	 */
> +	if (lock_mode == XFS_ILOCK_SHARED &&
> +	    xfs_inode_has_attr_fork(ip) &&
> +	    xfs_need_iread_extents(&ip->i_af)) {
> +		xfs_iunlock(ip, lock_mode);
> +		lock_mode = XFS_ILOCK_EXCL;
> +		xfs_ilock(ip, lock_mode);
> +	}

And this can just check for XFS_SICK_INO_BMBTA instead...

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

next prev parent reply	other threads:[~2023-01-19  5:15 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-01-19  1:24 [PATCH] xfs: recheck appropriateness of map_shared lock Darrick J. Wong
2023-01-19  5:14 ` Dave Chinner [this message]
2023-01-19 18:39   ` Christoph Hellwig
2023-01-19 20:34     ` Dave Chinner
2023-02-28 20:08   ` Darrick J. Wong
2023-01-19 18:31 ` Christoph Hellwig
2023-04-11  1:05   ` Darrick J. Wong

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230119051411.GJ360264@dread.disaster.area \
    --to=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox