From: Eryu Guan <eguan@redhat.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org
Subject: Re: [PATCH] xfs: flush the range before zero partial block range on truncate down
Date: Wed, 1 Nov 2017 11:46:39 +0800 [thread overview]
Message-ID: <20171101034639.GP17339@eguan.usersys.redhat.com> (raw)
In-Reply-To: <20171031225804.GF5858@dastard>
On Wed, Nov 01, 2017 at 09:58:04AM +1100, Dave Chinner wrote:
> On Fri, Oct 27, 2017 at 08:53:28PM +0800, Eryu Guan wrote:
> > On truncate down, if new size is not block size aligned, we zero the
> > rest of block via iomap_truncate_page() to avoid exposing stale data
> > to user, and iomap_truncate_page() skips zeroing if the range is
> > already in unwritten status or a hole.
>
> Unless the page is in the page cache already, and then it gets
> zeroed in memory as part of truncate_setsize() call.
>
> > But it's possible that a buffer write overwrites the unwritten
> > extent, which won't be converted to a normal extent until I/O
> > completion, and iomap_truncate_page() skips zeroing wrongly because
> > of the not-converted unwritten extent. This would cause a subsequent
> > mmap read sees non-zeros beyond EOF.
>
> Yes, it should skip the zeroing on disk. The page in the page cache
> over the unwritten extent will be zeroed on read.
>
> The real question is this: where are the zeros in the page that fsx
> is complaining about?
The partial block that iomap_truncate_page() skipped zeroing was latter
written back to disk, and the punch_hole before mmap read invalidated
the page cache so mmap read from disk and saw non-zeros. This is a
hard-to-hit sequence, it took me almost 2000 iterations of generic/112
runs to hit one failure. I'll provide more details below.
>
> > I occasionally see this in fsx runs in fstests generic/112, a
> > simplified fsx operation sequence is like (assuming 4k block size
> > xfs):
>
> What should have is:
>
> > fallocate 0x0 0x1000 0x0 keep_size
>
> Unwritten, no data.
Yes, assuming 4k block size and 4k page size, unwritten extent with 1
block allocated, i_size stays 0.
>
> > write 0x0 0x1000 0x0
>
> Unwritten, contains data in page cache.
Exactly, and in-core i_size is 4k now, but on-disk di_size is still 0.
>
> > truncate 0x0 0x800 0x1000
>
> Unwritten, page contains data 0-0x800, zeros 0x800-0x1000
Yes, the page cache after truncate is correct. But before we zero the
page cache (in truncate_setsize()), we skipped zeroing the partial block
range 0x800-0x1000 and then triggered a writeback on range
[di_size, newsize], which was 0-0x800, and 0x800-0x1000 was written back
to disk too, which contained non-zeros.
(newsize(2k) > di_size(0) && oldsize(4k) != di_size(0)) was true.
if (did_zeroing ||
(newsize > ip->i_d.di_size && oldsize != ip->i_d.di_size)) {
error = filemap_write_and_wait_range(mapping, ip->i_d.di_size,
newsize - 1);
if (error)
return error;
}
>
> > punch_hole 0x0 0x800 0x800
>
> Unwritten, page contains zeros 0x0-0x1000
i_mapping had no pages (nrpages == 0) after punch_hole.
>
> > mapread 0x0 0x800 0x800
pagefault read block from disk, 0-0x7ff was zero, but 0x800-0xfff was
non-zero.
>
> Should map a page full of zeros as it is either a read over an
> unwritten extent or a hole, or it finds a page cache page that is
> fully zeroed.
>
> The only wrinkle in this is if the write is direct IO, but
> then the truncate would see a written extent and this whole problem
> doesn't occur.
>
> So, more info required. :P
Above is what I observed in debugging, maybe I should have provided more
details in the first place. (I did add some comments to the fstests case
though..).
Thanks,
Eryu
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> david@fromorbit.com
next prev parent reply other threads:[~2017-11-01 3:46 UTC|newest]
Thread overview: 9+ messages / expand[flat|nested] mbox.gz Atom feed top
2017-10-27 12:53 [PATCH] xfs: flush the range before zero partial block range on truncate down Eryu Guan
2017-10-28 6:05 ` Christoph Hellwig
2017-10-31 10:09 ` Eryu Guan
2017-10-31 17:11 ` Darrick J. Wong
2017-10-31 23:03 ` Dave Chinner
2017-10-31 22:58 ` Dave Chinner
2017-11-01 3:46 ` Eryu Guan [this message]
2017-11-01 4:44 ` Dave Chinner
2017-11-01 12:06 ` Eryu Guan
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20171101034639.GP17339@eguan.usersys.redhat.com \
--to=eguan@redhat.com \
--cc=david@fromorbit.com \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.