From: Dave Chinner <david@fromorbit.com>
To: "Darrick J. Wong" <djwong@kernel.org>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 1/3] xfs: invalidate block device page cache during unmount
Date: Tue, 29 Nov 2022 16:23:22 +1100 [thread overview]
Message-ID: <20221129052322.GA3600936@dread.disaster.area> (raw)
In-Reply-To: <166930916399.2061853.16165124824627761814.stgit@magnolia>
On Thu, Nov 24, 2022 at 08:59:24AM -0800, Darrick J. Wong wrote:
> From: Darrick J. Wong <djwong@kernel.org>
>
> Every now and then I see fstests failures on aarch64 (64k pages) that
> trigger on the following sequence:
>
> mkfs.xfs $dev
> mount $dev $mnt
> touch $mnt/a
> umount $mnt
> xfs_db -c 'path /a' -c 'print' $dev
>
> 99% of the time this succeeds, but every now and then xfs_db cannot find
> /a and fails. This turns out to be a race involving udev/blkid, the
> page cache for the block device, and the xfs_db process.
>
> udev is triggered whenever anyone closes a block device or unmounts it.
> The default udev rules invoke blkid to read the fs super and create
> symlinks to the bdev under /dev/disk. For this, it uses buffered reads
> through the page cache.
>
> xfs_db also uses buffered reads to examine metadata. There is no
> coordination between xfs_db and udev, which means that they can run
> concurrently. Note there is no coordination between the kernel and
> blkid either.
>
> On a system with 64k pages, the page cache can cache the superblock and
> the root inode (and hence the root dir) with the same 64k page. If
> udev spawns blkid after the mkfs and the system is busy enough that it
> is still running when xfs_db starts up, they'll both read from the same
> page in the pagecache.
>
> The unmount writes updated inode metadata to disk directly. The XFS
> buffer cache does not use the bdev pagecache, nor does it invalidate the
> pagecache on umount. If the above scenario occurs, the pagecache no
> longer reflects what's on disk, xfs_db reads the stale metadata, and
> fails to find /a. Most of the time this succeeds because closing a bdev
> invalidates the page cache, but when processes race, everyone loses.
>
> Fix the problem by invalidating the bdev pagecache after flushing the
> bdev, so that xfs_db will see up to date metadata.
>
> Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> ---
> fs/xfs/xfs_buf.c | 1 +
> 1 file changed, 1 insertion(+)
>
>
> diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> index dde346450952..54c774af6e1c 100644
> --- a/fs/xfs/xfs_buf.c
> +++ b/fs/xfs/xfs_buf.c
> @@ -1945,6 +1945,7 @@ xfs_free_buftarg(
> list_lru_destroy(&btp->bt_lru);
>
> blkdev_issue_flush(btp->bt_bdev);
> + invalidate_bdev(btp->bt_bdev);
> fs_put_dax(btp->bt_daxdev, btp->bt_mount);
>
> kmem_free(btp);
Looks OK and because XFS has multiple block devices we have to do
this invalidation for each bdev.
Reviewed-by: Dave Chinner <dchinner@redhat.com>
However: this does not look to be an XFS specific problem. If we
look at reconfigure_super(), when it completes a remount-ro
operation it calls invalidate_bdev() because:
/*
* Some filesystems modify their metadata via some other path than the
* bdev buffer cache (eg. use a private mapping, or directories in
* pagecache, etc). Also file data modifications go via their own
* mappings. So If we try to mount readonly then copy the filesystem
* from bdev, we could get stale data, so invalidate it to give a best
* effort at coherency.
*/
if (remount_ro && sb->s_bdev)
invalidate_bdev(sb->s_bdev);
This is pretty much the same problem as this patch avoids for XFS in
the unmount path, yes? Shouldn't we be adding a call to
invalidate_bdev(sb->s_bdev) after the fs->kill_sb() call in
deactivate_locked_super() so that this problem goes away for all
filesystems?
Cheers,
Dave.
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2022-11-29 5:23 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-11-24 16:59 [PATCHSET 0/3] xfs: fixes for 6.2 Darrick J. Wong
2022-11-24 16:59 ` [PATCH 1/3] xfs: invalidate block device page cache during unmount Darrick J. Wong
2022-11-29 2:36 ` Gao Xiang
2022-11-29 5:23 ` Dave Chinner [this message]
2022-11-29 5:59 ` Darrick J. Wong
2022-11-24 16:59 ` [PATCH 2/3] xfs: use memcpy, not strncpy, to format the attr prefix during listxattr Darrick J. Wong
2022-11-29 2:37 ` Gao Xiang
2022-11-29 5:26 ` Dave Chinner
2022-11-24 16:59 ` [PATCH 3/3] xfs: shut up -Wuninitialized in xfsaild_push Darrick J. Wong
2022-11-29 3:00 ` Gao Xiang
2022-11-29 5:36 ` Dave Chinner
2022-11-27 18:36 ` [PATCH 4/3] xfs: attach dquots to inode before reading data/cow fork mappings Darrick J. Wong
2022-11-29 6:31 ` Dave Chinner
2022-11-29 6:50 ` Darrick J. Wong
2022-11-29 8:04 ` Dave Chinner
2022-11-29 21:03 ` Darrick J. Wong
2022-11-29 21:05 ` [PATCH v2 " Darrick J. Wong
2022-11-29 21:38 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20221129052322.GA3600936@dread.disaster.area \
--to=david@fromorbit.com \
--cc=djwong@kernel.org \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox