From: "Darrick J. Wong" <djwong@kernel.org>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 6/7] xfs: use iomap_valid method to detect stale cached iomaps
Date: Mon, 7 Nov 2022 16:00:56 -0800 [thread overview]
Message-ID: <Y2mcOCpKiDb4nf1X@magnolia> (raw)
In-Reply-To: <20221102223605.GC3600936@dread.disaster.area>
On Thu, Nov 03, 2022 at 09:36:05AM +1100, Dave Chinner wrote:
> On Wed, Nov 02, 2022 at 10:19:33AM -0700, Darrick J. Wong wrote:
> > On Tue, Nov 01, 2022 at 11:34:11AM +1100, Dave Chinner wrote:
> > > From: Dave Chinner <dchinner@redhat.com>
> > >
> > > Now that iomap supports a mechanism to validate cached iomaps for
> > > buffered write operations, hook it up to the XFS buffered write ops
> > > so that we can avoid data corruptions that result from stale cached
> > > iomaps. See:
> > >
> > > https://lore.kernel.org/linux-xfs/20220817093627.GZ3600936@dread.disaster.area/
> > >
> > > or the ->iomap_valid() introduction commit for exact details of the
> > > corruption vector.
> > >
> > > Signed-off-by: Dave Chinner <dchinner@redhat.com>
> > > ---
> > > fs/xfs/libxfs/xfs_bmap.c | 4 +--
> > > fs/xfs/xfs_aops.c | 2 +-
> > > fs/xfs/xfs_iomap.c | 69 ++++++++++++++++++++++++++++++----------
> > > fs/xfs/xfs_iomap.h | 4 +--
> > > fs/xfs/xfs_pnfs.c | 5 +--
> > > 5 files changed, 61 insertions(+), 23 deletions(-)
> > >
> > > diff --git a/fs/xfs/libxfs/xfs_bmap.c b/fs/xfs/libxfs/xfs_bmap.c
> > > index 49d0d4ea63fc..db225130618c 100644
> > > --- a/fs/xfs/libxfs/xfs_bmap.c
> > > +++ b/fs/xfs/libxfs/xfs_bmap.c
> > > @@ -4551,8 +4551,8 @@ xfs_bmapi_convert_delalloc(
> > > * the extent. Just return the real extent at this offset.
> > > */
> > > if (!isnullstartblock(bma.got.br_startblock)) {
> > > - xfs_bmbt_to_iomap(ip, iomap, &bma.got, 0, flags);
> > > *seq = READ_ONCE(ifp->if_seq);
> > > + xfs_bmbt_to_iomap(ip, iomap, &bma.got, 0, flags, *seq);
> > > goto out_trans_cancel;
> > > }
> > >
> > > @@ -4599,8 +4599,8 @@ xfs_bmapi_convert_delalloc(
> > > XFS_STATS_INC(mp, xs_xstrat_quick);
> > >
> > > ASSERT(!isnullstartblock(bma.got.br_startblock));
> > > - xfs_bmbt_to_iomap(ip, iomap, &bma.got, 0, flags);
> > > *seq = READ_ONCE(ifp->if_seq);
> > > + xfs_bmbt_to_iomap(ip, iomap, &bma.got, 0, flags, *seq);
> > >
> > > if (whichfork == XFS_COW_FORK)
> >
> > Hmm. @ifp here could be the cow fork, which means *seq will be from the
> > cow fork. This I think is going to cause problems in
> > xfs_buffered_write_iomap_valid...
>
> We can't get here from the buffered write path. This can only be
> reached by the writeback path via:
>
> iomap_do_writepage
> iomap_writepage_map
> xfs_map_blocks
> xfs_convert_blocks
> xfs_bmapi_convert_delalloc
>
> Hence the sequence numbers stored in iomap->private here are
> completely ignored by the writeback code. They still get stored
> in the struct xfs_writepage_ctx and checked by xfs_imap_valid(), so
> the changes here were really just making sure all the
> xfs_bmbt_to_iomap() callers stored the sequence number consistently.
>
> > > @@ -839,24 +848,25 @@ xfs_direct_write_iomap_begin(
> > > xfs_iunlock(ip, lockmode);
> > >
> > > error = xfs_iomap_write_direct(ip, offset_fsb, end_fsb - offset_fsb,
> > > - flags, &imap);
> > > + flags, &imap, &seq);
> > > if (error)
> > > return error;
> > >
> > > trace_xfs_iomap_alloc(ip, offset, length, XFS_DATA_FORK, &imap);
> > > return xfs_bmbt_to_iomap(ip, iomap, &imap, flags,
> > > - iomap_flags | IOMAP_F_NEW);
> > > + iomap_flags | IOMAP_F_NEW, seq);
> > >
> > > out_found_cow:
> > > + seq = READ_ONCE(ip->i_df.if_seq);
> > > xfs_iunlock(ip, lockmode);
> > > length = XFS_FSB_TO_B(mp, cmap.br_startoff + cmap.br_blockcount);
> > > trace_xfs_iomap_found(ip, offset, length - offset, XFS_COW_FORK, &cmap);
> > > if (imap.br_startblock != HOLESTARTBLOCK) {
> > > - error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags, 0);
> > > + error = xfs_bmbt_to_iomap(ip, srcmap, &imap, flags, 0, seq);
> > > if (error)
> > > return error;
> > > }
> > > - return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, IOMAP_F_SHARED);
> > > + return xfs_bmbt_to_iomap(ip, iomap, &cmap, flags, IOMAP_F_SHARED, seq);
> >
> > Here we've found a mapping from the COW fork and we're encoding it into
> > the struct iomap. Why is the sequence number from the *data* fork and
> > not the COW fork?
>
> Because my brain isn't big enough to hold all this code straight in
> my head. I found it all too easy to get confused by what iomap is
> passed where and how to tell them apart and I could not tell if
> there was an easy way to tell if the iomap passed to
> xfs_iomap_valid() needed to check against the data or cow fork.
>
> Hence I set it up such that the xfs_iomap_valid() callback only
> checks the data fork sequence so the only correct thing to set in
> the iomap is the data fork sequence and moved on to the next part of
> the problem...
>
> But, again, this information probably should be encoded into the
> sequence cookie we store. Which begs the question - if we are doing
> a COW operation, we have to check that neither the data fork (the
> srcmap) nor the COW fork (the iter->iomap) have changed, right? This
> is what the writeback code does (i.e. checks both data and COW fork
> sequence numbers), so perhaps that's also what we should be doing
> here?
>
> i.e. either the cookie for COW operations needs to contain both
> data+cow sequence numbers, and both need to match to proceed, or we
> have to validate both the srcmap and the iter->iomap with separate
> callouts once the folio is locked.
>
> Thoughts?
As we sorta discussed last week on IRC, I guess you could start by
encoding both values (as needed) in something fugly like this in struct
iomap:
union {
void *private;
u64 cookie;
};
But the longer term solution is (I suspect) to make a new struct that
XFS can envelop:
struct iomap_pagecache_ctx {
struct iomap_iter iter; /* do not touch */
struct iomap_page_ops *ops; /* callers can attach here */
};
struct xfs_pagecache_ctx {
struct iomap_pagecache_ctx xpctx;
u32 data_seq;
u32 cow_seq;
};
and pass that into the iomap functions:
struct xfs_pagecache_ctx ctx = { };
ret = iomap_file_buffered_write(iocb, from, &ctx,
&xfs_buffered_write_iomap_ops);
so that xfs_iomap_revalidate can do the usual struct unwrapping:
struct xfs_pagecache_ctx *xpctx;
xpctx = container_of(ipctx, struct xfs_pagecache_ctx, ipctx);
if (xpctx->data_seq != ip->i_df.df_seq)
return NOPE;
But that's a pretty aggressive refactoring, not suitable for a bugfix,
do it afterwards while you're waiting for the rest of us to check over
part 1, etc.
--D
> > > @@ -1328,9 +1342,26 @@ xfs_buffered_write_iomap_end(
> > > return 0;
> > > }
> > >
> > > +/*
> > > + * Check that the iomap passed to us is still valid for the given offset and
> > > + * length.
> > > + */
> > > +static bool
> > > +xfs_buffered_write_iomap_valid(
> > > + struct inode *inode,
> > > + const struct iomap *iomap)
> > > +{
> > > + int seq = *((int *)&iomap->private);
> > > +
> > > + if (seq != READ_ONCE(XFS_I(inode)->i_df.if_seq))
> >
> > Why is it ok to sample the data fork's sequence number even if the
> > mapping came from the COW fork? That doesn't make sense to me,
> > conceptually. I definitely think it's buggy that the revalidation
> > might not sample from the same sequence counter as the original
> > measurement.
>
> Yup, see above.
>
> > > const struct iomap_ops xfs_read_iomap_ops = {
> > > @@ -1438,7 +1471,7 @@ xfs_seek_iomap_begin(
> > > end_fsb = min(end_fsb, data_fsb);
> > > xfs_trim_extent(&cmap, offset_fsb, end_fsb);
> > > error = xfs_bmbt_to_iomap(ip, iomap, &cmap, flags,
> > > - IOMAP_F_SHARED);
> > > + IOMAP_F_SHARED, READ_ONCE(ip->i_cowfp->if_seq));
> >
> > Here we explicitly sample from the cow fork but the comparison is done
> > against the data fork. That /probably/ doesn't matter for reads since
> > you're not proposing that we revalidate on those. Should we?
>
> As I said elsewhere, I haven't even looked at the read path. The
> iomap code is now complex enough that I have trouble following it
> and keeping all the corner cases in my head. This is the down side
> of having people smarter than you write the code you need to debug
> when shit inevitably goes wrong.
>
> > What
> > happens if userspace hands us a large read() from an unwritten extent
> > and the same dirty/writeback/reclaim sequence happens to the last folio
> > in that read() range before read_iter actually gets there?
>
> No idea - shit may well go wrong there in the same way, but I've had
> my hands full just fixing the write path. I'm *really* tired of
> fighting with the iomap code right now...
>
> > > @@ -189,7 +190,7 @@ xfs_fs_map_blocks(
> > > xfs_iunlock(ip, lock_flags);
> > >
> > > error = xfs_iomap_write_direct(ip, offset_fsb,
> > > - end_fsb - offset_fsb, 0, &imap);
> > > + end_fsb - offset_fsb, 0, &imap, &seq);
> >
> > I got a compiler warning about seq not being set in the case where we
> > don't call xfs_iomap_write_direct.
>
> Yup, gcc 11.2 doesn't warn about it. Now to find out why my build
> system is using gcc 11.2 when I upgraded it months ago to use
> gcc-12 because of exactly these sorts of issues.... :(
--D
> -Dave.
> --
> Dave Chinner
> david@fromorbit.com
next prev parent reply other threads:[~2022-11-08 0:01 UTC|newest]
Thread overview: 42+ messages / expand[flat|nested] mbox.gz Atom feed top
2022-11-01 0:34 xfs, iomap: fix data corrupton due to stale cached iomaps Dave Chinner
2022-11-01 0:34 ` [PATCH 1/7] xfs: write page faults in iomap are not buffered writes Dave Chinner
2022-11-02 7:17 ` Christoph Hellwig
2022-11-02 16:12 ` Darrick J. Wong
2022-11-02 21:11 ` Dave Chinner
2022-11-01 0:34 ` [PATCH 2/7] xfs: punching delalloc extents on write failure is racy Dave Chinner
2022-11-02 7:18 ` Christoph Hellwig
2022-11-02 16:22 ` Darrick J. Wong
2022-11-01 0:34 ` [PATCH 3/7] xfs: use byte ranges for write cleanup ranges Dave Chinner
2022-11-02 7:20 ` Christoph Hellwig
2022-11-02 16:32 ` Darrick J. Wong
2022-11-04 5:40 ` Dave Chinner
2022-11-07 23:53 ` Darrick J. Wong
2022-11-01 0:34 ` [PATCH 4/7] xfs: buffered write failure should not truncate the page cache Dave Chinner
2022-11-01 11:57 ` kernel test robot
2022-11-02 7:24 ` Christoph Hellwig
2022-11-02 20:57 ` Dave Chinner
2022-11-02 16:41 ` Darrick J. Wong
2022-11-02 21:04 ` Dave Chinner
2022-11-02 22:26 ` Darrick J. Wong
2022-11-04 8:08 ` Christoph Hellwig
2022-11-04 23:10 ` Dave Chinner
2022-11-07 23:48 ` Darrick J. Wong
2022-11-01 0:34 ` [PATCH 5/7] iomap: write iomap validity checks Dave Chinner
2022-11-02 8:36 ` Christoph Hellwig
2022-11-02 16:43 ` Darrick J. Wong
2022-11-02 16:58 ` Darrick J. Wong
2022-11-03 0:35 ` Dave Chinner
2022-11-04 8:12 ` Christoph Hellwig
2022-11-02 16:57 ` Darrick J. Wong
2022-11-01 0:34 ` [PATCH 6/7] xfs: use iomap_valid method to detect stale cached iomaps Dave Chinner
2022-11-01 9:15 ` kernel test robot
2022-11-02 8:41 ` Christoph Hellwig
2022-11-02 21:39 ` Dave Chinner
2022-11-04 8:14 ` Christoph Hellwig
2022-11-02 17:19 ` Darrick J. Wong
2022-11-02 22:36 ` Dave Chinner
2022-11-08 0:00 ` Darrick J. Wong [this message]
2022-11-01 0:34 ` [PATCH 7/7] xfs: drop write error injection is unfixable, remove it Dave Chinner
2022-11-01 3:39 ` xfs, iomap: fix data corrupton due to stale cached iomaps Darrick J. Wong
2022-11-01 4:21 ` Dave Chinner
2022-11-02 17:23 ` Darrick J. Wong
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=Y2mcOCpKiDb4nf1X@magnolia \
--to=djwong@kernel.org \
--cc=david@fromorbit.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).