Re: [PATCH 6/7] xfs: use iomap_valid method to detect stale cached iomaps

All of lore.kernel.org
 help / color / mirror / Atom feed

From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 6/7] xfs: use iomap_valid method to detect stale cached iomaps
Date: Mon, 7 Nov 2022 16:00:56 -0800	[thread overview]
Message-ID: <Y2mcOCpKiDb4nf1X@magnolia> (raw)
In-Reply-To: <20221102223605.GC3600936@dread.disaster.area>

On Thu, Nov 03, 2022 at 09:36:05AM +1100, Dave Chinner wrote:
> On Wed, Nov 02, 2022 at 10:19:33AM -0700, Darrick J. Wong wrote:
> > On Tue, Nov 01, 2022 at 11:34:11AM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > > 
> > > Now that iomap supports a mechanism to validate cached iomaps for
> > > buffered write operations, hook it up to the XFS buffered write ops
> > > so that we can avoid data corruptions that result from stale cached
> > > iomaps. See:
> > > 
> > > https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/
> > > 
> > > or the ->iomap_valid() introduction commit for exact details of the
> > > corruption vector.
> > > 
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > >  fs/xfs/libxfs/xfs_bmap.c |  4 +--
> > >  fs/xfs/xfs_aops.c        |  2 +-
> > >  fs/xfs/xfs_iomap.c       | 69 ++++++++++++++++++++++++++++++----------
> > >  fs/xfs/xfs_iomap.h       |  4 +--
> > >  fs/xfs/xfs_pnfs.c        |  5 +--
> > >  5 files changed, 61 insertions(+), 23 deletions(-)
> > > 
> > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > > index 49d0d4ea63fc..db225130618c 100644
> > > --- a/fs/xfs/libxfs/xfs_bmap.c
> > > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > > @@ -4551,8 +4551,8 @@ xfs_bmapi_convert_delalloc(
> > >  	 * the extent.  Just return the real extent at this offset.
> > >  	 */
> > >  	if (!isnullstartblock(bma.got.br_startblock)) {
> > > -		xfs_bmbt_to_iomap(ip, iomap, &bma.got, 0, flags);
> > >  		*seq = READ_ONCE(ifp->if_seq);
> > > +		xfs_bmbt_to_iomap(ip, iomap, &bma.got, 0, flags, *seq);
> > >  		goto out_trans_cancel;
> > >  	}
> > >  
> > > @@ -4599,8 +4599,8 @@ xfs_bmapi_convert_delalloc(
> > >  	XFS_STATS_INC(mp, xs_xstrat_quick);
> > >  
> > >  	ASSERT(!isnullstartblock(bma.got.br_startblock));
> > > -	xfs_bmbt_to_iomap(ip, iomap, &bma.got, 0, flags);
> > >  	*seq = READ_ONCE(ifp->if_seq);
> > > +	xfs_bmbt_to_iomap(ip, iomap, &bma.got, 0, flags, *seq);
> > >  
> > >  	if (whichfork == XFS_COW_FORK)
> > 
> > Hmm.  @ifp here could be the cow fork, which means *seq will be from the
> > cow fork.  This I think is going to cause problems in
> > xfs_buffered_write_iomap_valid...
> 
> We can't get here from the buffered write path. This can only be
> reached by the writeback path via:
> 
> iomap_do_writepage
>   iomap_writepage_map
>     xfs_map_blocks
>       xfs_convert_blocks
>         xfs_bmapi_convert_delalloc
> 
> Hence the sequence numbers stored in iomap->private here are
> completely ignored by the writeback code. They still get stored
> in the struct xfs_writepage_ctx and checked by xfs_imap_valid(), so
> the changes here were really just making sure all the
> xfs_bmbt_to_iomap() callers stored the sequence number consistently.
> 
> > > @@ -839,24 +848,25 @@ xfs_direct_write_iomap_begin(
> > >  	xfs_iunlock(ip, lockmode);
> > >  
> > >  	error = xfs_iomap_write_direct(ip, offset_fsb, end_fsb - offset_fsb,
> > > -			flags, &imap);
> > > +			flags, &imap, &seq);
> > >  	if (error)
> > >  		return error;
> > >  
> > >  	trace_xfs_iomap_alloc(ip, offset, length, XFS_DATA_FORK, &imap);
> > >  	return xfs_bmbt_to_iomap(ip, iomap, &imap, flags,
> > > -				 iomap_flags | IOMAP_F_NEW);
> > > +				 iomap_flags | IOMAP_F_NEW, seq);
> > >  
> > >  out_found_cow:
> > > +	seq = READ_ONCE(ip->i_df.if_seq);
> > >  	xfs_iunlock(ip, lockmode);
> > >  	length = XFS_FSB_TO_B(mp, cmap.br_startoff + cmap.br_blockcount);
> > >  	trace_xfs_iomap_found(ip, offset, length - offset, XFS_COW_FORK, &cmap);
> > >  	if (imap.br_startblock != HOLESTARTBLOCK) {
> > > -		error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags, 0);
> > > +		error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags, 0, seq);
> > >  		if (error)
> > >  			return error;
> > >  	}
> > > -	return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, IOMAP_F_SHARED);
> > > +	return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, IOMAP_F_SHARED, seq);
> > 
> > Here we've found a mapping from the COW fork and we're encoding it into
> > the struct iomap.  Why is the sequence number from the *data* fork and
> > not the COW fork?
> 
> Because my brain isn't big enough to hold all this code straight in
> my head.  I found it all too easy to get confused by what iomap is
> passed where and how to tell them apart and I could not tell if
> there was an easy way to tell if the iomap passed to
> xfs_iomap_valid() needed to check against the data or cow fork.
> 
> Hence I set it up such that the xfs_iomap_valid() callback only
> checks the data fork sequence so the only correct thing to set in
> the iomap is the data fork sequence and moved on to the next part of
> the problem...
> 
> But, again, this information probably should be encoded into the
> sequence cookie we store. Which begs the question - if we are doing
> a COW operation, we have to check that neither the data fork (the
> srcmap) nor the COW fork (the iter->iomap) have changed, right? This
> is what the writeback code does (i.e. checks both data and COW fork
> sequence numbers), so perhaps that's also what we should be doing
> here?
> 
> i.e. either the cookie for COW operations needs to contain both
> data+cow sequence numbers, and both need to match to proceed, or we
> have to validate both the srcmap and the iter->iomap with separate
> callouts once the folio is locked.
> 
> Thoughts?

As we sorta discussed last week on IRC, I guess you could start by
encoding both values (as needed) in something fugly like this in struct
iomap:

union {
	void *private;
	u64 cookie;
};

But the longer term solution is (I suspect) to make a new struct that
XFS can envelop:

struct iomap_pagecache_ctx {
	struct iomap_iter	iter;	/* do not touch */
	struct iomap_page_ops	*ops;	/* callers can attach here */
};

struct xfs_pagecache_ctx {
	struct iomap_pagecache_ctx	xpctx;
	u32		data_seq;
	u32		cow_seq;
};

and pass that into the iomap functions:

	struct xfs_pagecache_ctx	ctx = { };

	ret = iomap_file_buffered_write(iocb, from, &ctx,
			&xfs_buffered_write_iomap_ops);

so that xfs_iomap_revalidate can do the usual struct unwrapping:

	struct xfs_pagecache_ctx	*xpctx;

	xpctx = container_of(ipctx, struct xfs_pagecache_ctx, ipctx);
	if (xpctx->data_seq != ip->i_df.df_seq)
		return NOPE;

But that's a pretty aggressive refactoring, not suitable for a bugfix,
do it afterwards while you're waiting for the rest of us to check over
part 1, etc.

--D

> > > @@ -1328,9 +1342,26 @@ xfs_buffered_write_iomap_end(
> > >  	return 0;
> > >  }
> > >  
> > > +/*
> > > + * Check that the iomap passed to us is still valid for the given offset and
> > > + * length.
> > > + */
> > > +static bool
> > > +xfs_buffered_write_iomap_valid(
> > > +	struct inode		*inode,
> > > +	const struct iomap	*iomap)
> > > +{
> > > +	int			seq = *((int *)&iomap->private);
> > > +
> > > +	if (seq != READ_ONCE(XFS_I(inode)->i_df.if_seq))
> > 
> > Why is it ok to sample the data fork's sequence number even if the
> > mapping came from the COW fork?  That doesn't make sense to me,
> > conceptually.  I definitely think it's buggy that the revalidation
> > might not sample from the same sequence counter as the original
> > measurement.
> 
> Yup, see above.
> 
> > >  const struct iomap_ops xfs_read_iomap_ops = {
> > > @@ -1438,7 +1471,7 @@ xfs_seek_iomap_begin(
> > >  			end_fsb = min(end_fsb, data_fsb);
> > >  		xfs_trim_extent(&cmap, offset_fsb, end_fsb);
> > >  		error = xfs_bmbt_to_iomap(ip, iomap, &cmap, flags,
> > > -					  IOMAP_F_SHARED);
> > > +				IOMAP_F_SHARED, READ_ONCE(ip->i_cowfp->if_seq));
> > 
> > Here we explicitly sample from the cow fork but the comparison is done
> > against the data fork.  That /probably/ doesn't matter for reads since
> > you're not proposing that we revalidate on those.  Should we?
> 
> As I said elsewhere, I haven't even looked at the read path. The
> iomap code is now complex enough that I have trouble following it
> and keeping all the corner cases in my head. This is the down side
> of having people smarter than you write the code you need to debug
> when shit inevitably goes wrong.
> 
> > What
> > happens if userspace hands us a large read() from an unwritten extent
> > and the same dirty/writeback/reclaim sequence happens to the last folio
> > in that read() range before read_iter actually gets there?
> 
> No idea - shit may well go wrong there in the same way, but I've had
> my hands full just fixing the write path. I'm *really* tired of
> fighting with the iomap code right now...
> 
> > > @@ -189,7 +190,7 @@ xfs_fs_map_blocks(
> > >  		xfs_iunlock(ip, lock_flags);
> > >  
> > >  		error = xfs_iomap_write_direct(ip, offset_fsb,
> > > -				end_fsb - offset_fsb, 0, &imap);
> > > +				end_fsb - offset_fsb, 0, &imap, &seq);
> > 
> > I got a compiler warning about seq not being set in the case where we
> > don't call xfs_iomap_write_direct.
> 
> Yup, gcc 11.2 doesn't warn about it. Now to find out why my build
> system is using gcc 11.2 when I upgraded it months ago to use
> gcc-12 because of exactly these sorts of issues.... :(

--D

> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com

next prev parent reply	other threads:[~2022-11-08  0:01 UTC|newest]

Thread overview: 43+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-11-01  0:34 xfs, iomap: fix data corrupton due to stale cached iomaps Dave Chinner
2022-11-01  0:34 ` [PATCH 1/7] xfs: write page faults in iomap are not buffered writes Dave Chinner
2022-11-02  7:17   ` Christoph Hellwig
2022-11-02 16:12   ` Darrick J. Wong
2022-11-02 21:11     ` Dave Chinner
2022-11-01  0:34 ` [PATCH 2/7] xfs: punching delalloc extents on write failure is racy Dave Chinner
2022-11-02  7:18   ` Christoph Hellwig
2022-11-02 16:22   ` Darrick J. Wong
2022-11-01  0:34 ` [PATCH 3/7] xfs: use byte ranges for write cleanup ranges Dave Chinner
2022-11-02  7:20   ` Christoph Hellwig
2022-11-02 16:32   ` Darrick J. Wong
2022-11-04  5:40     ` Dave Chinner
2022-11-07 23:53       ` Darrick J. Wong
2022-11-01  0:34 ` [PATCH 4/7] xfs: buffered write failure should not truncate the page cache Dave Chinner
2022-11-01 11:57   ` kernel test robot
2022-11-02  7:24   ` Christoph Hellwig
2022-11-02 20:57     ` Dave Chinner
2022-11-02 16:41   ` Darrick J. Wong
2022-11-02 21:04     ` Dave Chinner
2022-11-02 22:26       ` Darrick J. Wong
2022-11-04  8:08   ` Christoph Hellwig
2022-11-04 23:10     ` Dave Chinner
2022-11-07 23:48       ` Darrick J. Wong
2022-11-01  0:34 ` [PATCH 5/7] iomap: write iomap validity checks Dave Chinner
2022-11-02  8:36   ` Christoph Hellwig
2022-11-02 16:43     ` Darrick J. Wong
2022-11-02 16:58       ` Darrick J. Wong
2022-11-03  0:35         ` Dave Chinner
2022-11-04  8:12           ` Christoph Hellwig
2022-11-02 16:57   ` Darrick J. Wong
2022-11-01  0:34 ` [PATCH 6/7] xfs: use iomap_valid method to detect stale cached iomaps Dave Chinner
2022-11-01  9:15   ` kernel test robot
2022-11-02  8:41   ` Christoph Hellwig
2022-11-02 21:39     ` Dave Chinner
2022-11-04  8:14       ` Christoph Hellwig
2022-11-02 17:19   ` Darrick J. Wong
2022-11-02 22:36     ` Dave Chinner
2022-11-08  0:00       ` Darrick J. Wong [this message]
2022-11-01  0:34 ` [PATCH 7/7] xfs: drop write error injection is unfixable, remove it Dave Chinner
2022-11-01  3:39 ` xfs, iomap: fix data corrupton due to stale cached iomaps Darrick J. Wong
2022-11-01  4:21   ` Dave Chinner
2022-11-02 17:23     ` Darrick J. Wong
  -- strict thread matches above, loose matches on Subject: below --
2022-11-01 12:17 [PATCH 6/7] xfs: use iomap_valid method to detect " kernel test robot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Y2mcOCpKiDb4nf1X@magnolia \
    --to=djwong@kernel.org \
    --cc=david@fromorbit.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.