From: Dave Chinner <david@fromorbit.com>
To: Brian Foster <bfoster@redhat.com>
Cc: "Darrick J. Wong" <darrick.wong@oracle.com>,
xfs <linux-xfs@vger.kernel.org>
Subject: Re: [RFC PATCH] xfs: flush posteof zeroing before reflink truncation
Date: Mon, 19 Nov 2018 11:26:11 +1100 [thread overview]
Message-ID: <20181119002611.GH19305@dastard> (raw)
In-Reply-To: <20181118141205.GA40476@bfoster>
On Sun, Nov 18, 2018 at 09:12:06AM -0500, Brian Foster wrote:
> On Sat, Nov 17, 2018 at 07:37:56AM +1100, Dave Chinner wrote:
> > On Fri, Nov 16, 2018 at 11:07:24AM -0800, Darrick J. Wong wrote:
> > > From: Darrick J. Wong <darrick.wong@oracle.com>
> > >
> > > If we're remapping into a range that starts beyond EOF, we have to zero
> > > the memory between EOF and the start of the target range, as established
> > > in 410fdc72b05af. However, in 4918ef4ea008, we extended the pagecache
> > > truncation range downwards to a page boundary to guarantee that
> > > pagecache pages are removed and that there's no possibility that we end
> > > up zeroing subpage blocks within a page. Unfortunately, we never commit
> > > the posteof zeroing to disk, so on a filesystem where page size > block
> > > size the truncation partially undoes the zeroing and we end up with
> > > stale disk contents.
> > >
> > > Brian and I reproduced this problem by running generic/091 on a 1k block
> > > xfs filesystem, assuming fsx in fstests supports clone/dedupe/copyrange.
> > >
> > > Fixes: 410fdc72b05a ("xfs: zero posteof blocks when cloning above eof")
> > > Fixes: 4918ef4ea008 ("xfs: fix pagecache truncation prior to reflink")
> > > Simultaneously-diagnosed-by: Brian Foster <bfoster@redhat.com>
> > > Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
> >
> > Ok, I have a different fix for this again.
> >
> > > ---
> > > Note: I haven't tested this thoroughly but wanted to push this out for
> > > everyone to look at ASAP.
> > > ---
> > > fs/xfs/xfs_reflink.c | 8 +++++++-
> > > 1 file changed, 7 insertions(+), 1 deletion(-)
> > >
> > > diff --git a/fs/xfs/xfs_reflink.c b/fs/xfs/xfs_reflink.c
> > > index c56bdbfcf7ae..8ea09a7e550c 100644
> > > --- a/fs/xfs/xfs_reflink.c
> > > +++ b/fs/xfs/xfs_reflink.c
> > > @@ -1255,13 +1255,19 @@ xfs_reflink_zero_posteof(
> > > loff_t pos)
> > > {
> > > loff_t isize = i_size_read(VFS_I(ip));
> > > + int error;
> > >
> > > if (pos <= isize)
> > > return 0;
> > >
> > > trace_xfs_zero_eof(ip, isize, pos - isize);
> > > - return iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL,
> > > + error = iomap_zero_range(VFS_I(ip), isize, pos - isize, NULL,
> > > &xfs_iomap_ops);
> > > + if (error)
> > > + return error;
> > > +
> > > + return filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
> > > + isize, pos - 1);
> >
> > This doesn't work on block size > page size setups, unfortunately.
> >
> > Immediately after this we truncate the page cache, which also
> > doesn't do the right thing on block size > page cache setups.
> > So there's a couple of bugs here.
> >
> > IMO, the truncate needs fixing, not the zeroing. Flushing after
> > zeroing leaves a potential landmine of other dirty data not getting
> > flushed properly before the truncate, so we should fix the truncate
> > to do a flush first. And we should fix it in a way that doesn't mean
> > we need to fix it again in the very near future. i.e. the patch
> > below that uses xfs_flush_unmap_range().
> >
> > FWIW, I'm workng on cleaning up the ~10 patches I have for various
> > fsx and other corruption fixes so I can post them - it'll be monday
> > before I get that done - but if you're having fsx failures w/
> > copy/dedupe/clone on fsx I've probably already got a fix for it...
> >
>
> Ok, so FYI this doesn't actually address the writeback issue I
> reproduced because the added flush targets the start of the destination
> offset.
Oh, sorry, I didn't notice that difference. Fixed that now. That
might actually be one(*) of the fsx bugs I've been chasing for
several days.
> Note again that I think this is distinct from the issue both you
> and Darrick have documented in each commit log. Darrick's patch
> addresses it because the flush targets the range that has been zeroed
> (isize to dest offset) and thus potentially dirtied in cache. The
> zeroing is what leads to sharing a block with an active dirty page (and
> eventually leads to writeback of a page to a shared block).
Yes, we may be seeing *different symptoms* but the underlying
problem is the same - we are not writing back pages over the range
we are about to share, and so we don't trigger a COW on the range
before writeback occurs.
IMO, using xfs_flush_unmap_range() is still the right change to make
here, even if my initial patch didn't address this specific problem
but a different flush/inval problem with this code.
Cheers.
Dave.
(*) I've still got several different fsx variants that fail on either
default configs and/or 1k block size with different signatures.
Problem is they take between 370,000 ops and 5 million ops to
trigger, and so generate tens to hundreds of GB of trace data....
e.g. on a default 4k filesystem on pmem, this fails after 377k ops:
# ltp/fsx -q -p50000 -l 500000 -r 4096 -t 512 -w 512 -Z -R -W /mnt/scratch/foo
....
READ BAD DATA: offset = 0x31000, size = 0xa000, fname = /mnt/scratch/foo
OFFSET GOOD BAD RANGE
0x36000 0x9084 0x7940 0x00000
....
and this slight variant (buffered IO rather than direct IO) fails
after 2.17 million ops:
# ltp/fsx -q -p50000 -l 500000 -r 4096 -t 512 -w 512 -R -W /mnt/scratch/foo
....
READ BAD DATA: offset = 0xf000, size = 0xf318, fname = /mnt/scratch/foo
OFFSET GOOD BAD RANGE
0x15000 0x990b 0x6c0b 0x00000
....
I'm also seeing MAPREAD failures with data after EOF on several
different configs, and there's a couple of other failures that show
up every so often, too.
If I turn off copy/dedupe/clone file range, they all run for
billions of ops without failure....
--
Dave Chinner
david@fromorbit.com
next prev parent reply other threads:[~2018-11-19 10:48 UTC|newest]
Thread overview: 8+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-11-16 19:07 [RFC PATCH] xfs: flush posteof zeroing before reflink truncation Darrick J. Wong
2018-11-16 19:31 ` Brian Foster
2018-11-16 20:37 ` Dave Chinner
2018-11-18 14:12 ` Brian Foster
2018-11-19 0:26 ` Dave Chinner [this message]
2018-11-19 19:05 ` Brian Foster
2018-11-19 22:04 ` Dave Chinner
2018-11-19 22:07 ` Dave Chinner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20181119002611.GH19305@dastard \
--to=david@fromorbit.com \
--cc=bfoster@redhat.com \
--cc=darrick.wong@oracle.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).