public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed
From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH 1/3] xfs: invalidate block device page cache during unmount
Date: Mon, 28 Nov 2022 21:59:27 -0800	[thread overview]
Message-ID: <Y4Wfvzqd7YPkxb31@magnolia> (raw)
In-Reply-To: <20221129052322.GA3600936@dread.disaster.area>

On Tue, Nov 29, 2022 at 04:23:22PM +1100, Dave Chinner wrote:
> On Thu, Nov 24, 2022 at 08:59:24AM -0800, Darrick J. Wong wrote:
> > From: Darrick J. Wong <djwong@kernel.org>
> > 
> > Every now and then I see fstests failures on aarch64 (64k pages) that
> > trigger on the following sequence:
> > 
> > mkfs.xfs $dev
> > mount $dev $mnt
> > touch $mnt/a
> > umount $mnt
> > xfs_db -c 'path /a' -c 'print' $dev
> > 
> > 99% of the time this succeeds, but every now and then xfs_db cannot find
> > /a and fails.  This turns out to be a race involving udev/blkid, the
> > page cache for the block device, and the xfs_db process.
> > 
> > udev is triggered whenever anyone closes a block device or unmounts it.
> > The default udev rules invoke blkid to read the fs super and create
> > symlinks to the bdev under /dev/disk.  For this, it uses buffered reads
> > through the page cache.
> > 
> > xfs_db also uses buffered reads to examine metadata.  There is no
> > coordination between xfs_db and udev, which means that they can run
> > concurrently.  Note there is no coordination between the kernel and
> > blkid either.
> > 
> > On a system with 64k pages, the page cache can cache the superblock and
> > the root inode (and hence the root dir) with the same 64k page.  If
> > udev spawns blkid after the mkfs and the system is busy enough that it
> > is still running when xfs_db starts up, they'll both read from the same
> > page in the pagecache.
> > 
> > The unmount writes updated inode metadata to disk directly.  The XFS
> > buffer cache does not use the bdev pagecache, nor does it invalidate the
> > pagecache on umount.  If the above scenario occurs, the pagecache no
> > longer reflects what's on disk, xfs_db reads the stale metadata, and
> > fails to find /a.  Most of the time this succeeds because closing a bdev
> > invalidates the page cache, but when processes race, everyone loses.
> > 
> > Fix the problem by invalidating the bdev pagecache after flushing the
> > bdev, so that xfs_db will see up to date metadata.
> > 
> > Signed-off-by: Darrick J. Wong <djwong@kernel.org>
> > ---
> >  fs/xfs/xfs_buf.c |    1 +
> >  1 file changed, 1 insertion(+)
> > 
> > 
> > diff --git a/fs/xfs/xfs_buf.c b/fs/xfs/xfs_buf.c
> > index dde346450952..54c774af6e1c 100644
> > --- a/fs/xfs/xfs_buf.c
> > +++ b/fs/xfs/xfs_buf.c
> > @@ -1945,6 +1945,7 @@ xfs_free_buftarg(
> >  	list_lru_destroy(&btp->bt_lru);
> >  
> >  	blkdev_issue_flush(btp->bt_bdev);
> > +	invalidate_bdev(btp->bt_bdev);
> >  	fs_put_dax(btp->bt_daxdev, btp->bt_mount);
> >  
> >  	kmem_free(btp);
> 
> Looks OK and because XFS has multiple block devices we have to do
> this invalidation for each bdev.
> 
> Reviewed-by: Dave Chinner <dchinner@redhat.com>
> 
> However: this does not look to be an XFS specific problem.  If we
> look at reconfigure_super(), when it completes a remount-ro
> operation it calls invalidate_bdev() because:
> 
>        /*
>          * Some filesystems modify their metadata via some other path than the
>          * bdev buffer cache (eg. use a private mapping, or directories in
>          * pagecache, etc). Also file data modifications go via their own
>          * mappings. So If we try to mount readonly then copy the filesystem
>          * from bdev, we could get stale data, so invalidate it to give a best
>          * effort at coherency.
>          */
>         if (remount_ro && sb->s_bdev)
>                 invalidate_bdev(sb->s_bdev);
> 
> This is pretty much the same problem as this patch avoids for XFS in
> the unmount path, yes? Shouldn't we be adding a call to
> invalidate_bdev(sb->s_bdev) after the fs->kill_sb() call in
> deactivate_locked_super() so that this problem goes away for all
> filesystems?

I'm not sure this applies to everyone -- AFAICT, ext2/4 still write
everything through the bdev page cache, which means that the
invalidation isn't necessary there, except for perhaps the MMP block.

Years ago I remember Andreas rolling his eyes at how the kernel would
usually drop the whole pagecache between umount and e2fsck starting.
But I guess that's *usually* what we get anyways, so adding an
invalidation everywhere for the long tail of simple bdev filesystems
wouldn't hurt much.  Hmm.  Ok, I'm more convinced now.

I'll ask on the ext4 concall this week, and in the meantime try to
figure out what's the deal with btrfs.

--D

> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

  reply	other threads:[~2022-11-29  5:59 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-24 16:59 [PATCHSET 0/3] xfs: fixes for 6.2 Darrick J. Wong
2022-11-24 16:59 ` [PATCH 1/3] xfs: invalidate block device page cache during unmount Darrick J. Wong
2022-11-29  2:36   ` Gao Xiang
2022-11-29  5:23   ` Dave Chinner
2022-11-29  5:59     ` Darrick J. Wong [this message]
2022-11-24 16:59 ` [PATCH 2/3] xfs: use memcpy, not strncpy, to format the attr prefix during listxattr Darrick J. Wong
2022-11-29  2:37   ` Gao Xiang
2022-11-29  5:26   ` Dave Chinner
2022-11-24 16:59 ` [PATCH 3/3] xfs: shut up -Wuninitialized in xfsaild_push Darrick J. Wong
2022-11-29  3:00   ` Gao Xiang
2022-11-29  5:36   ` Dave Chinner
2022-11-27 18:36 ` [PATCH 4/3] xfs: attach dquots to inode before reading data/cow fork mappings Darrick J. Wong
2022-11-29  6:31   ` Dave Chinner
2022-11-29  6:50     ` Darrick J. Wong
2022-11-29  8:04       ` Dave Chinner
2022-11-29 21:03         ` Darrick J. Wong
2022-11-29 21:05   ` [PATCH v2 " Darrick J. Wong
2022-11-29 21:38     ` Dave Chinner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y4Wfvzqd7YPkxb31@magnolia \
    --to=djwong@kernel.org \
    --cc=david@fromorbit.com \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox