From: "Darrick J. Wong" <darrick.wong@oracle.com>
To: Dave Chinner <david@fromorbit.com>
Cc: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org
Subject: Re: [PATCH 6/5 V2] iomap: readpages doesn't zero page tail beyond EOF properly
Date: Wed, 21 Nov 2018 08:20:51 -0800 [thread overview]
Message-ID: <20181121162051.GQ6792@magnolia> (raw)
In-Reply-To: <20181121082736.GT19305@dastard>
On Wed, Nov 21, 2018 at 07:27:36PM +1100, Dave Chinner wrote:
>
> From: Dave Chinner <dchinner@redhat.com>
>
> When we read the EOF page of the file via readpages, we need
> to zero the region beyond EOF that we either do not read or
> should not contain data so that mmap does not expose stale data to
> user applications.
>
> However, iomap_adjust_read_range() fails to detect EOF correctly,
> and so fsx on 1k block size filesystems fails very quickly with
> mapreads exposing data beyond EOF. There are two problems here.
>
> Firstly, when calculating the end block of the EOF byte, we have
> to round the size by one to avoid a block aligned EOF from reporting
> a block too large. i.e. a size of 1024 bytes is 1 block, which in
> index terms is block 0. Therefore we have to calculate the end block
> from (isize - 1), not isize.
>
> The second bug is determining if the current page spans EOF, and so
> whether we need split it into two half, one for the IO, and the
> other for zeroing. Unfortunately, the code that checks whether
> we should split the block doesn't actually check if we span EOF, it
> just checks if the read spans the /offset in the page/ that EOF
> sits on. So it splits every read into two if EOF is not page
> aligned, regardless of whether we are reading the EOF block or not.
>
> Hence we need to restrict the "does the read span EOF" check to
> just the page that spans EOF, not every page we read.
>
> This patch results in correct EOF detection through readpages:
>
> xfs_vm_readpages: dev 259:0 ino 0x43 nr_pages 24
> xfs_iomap_found: dev 259:0 ino 0x43 size 0x66c00 offset 0x4f000 count 98304 type hole startoff 0x13c startblock 1368 blockcount 0x4
> iomap_readpage_actor: orig pos 323584 pos 323584, length 4096, poff 0 plen 4096, isize 420864
> xfs_iomap_found: dev 259:0 ino 0x43 size 0x66c00 offset 0x50000 count 94208 type hole startoff 0x140 startblock 1497 blockcount 0x5c
> iomap_readpage_actor: orig pos 327680 pos 327680, length 94208, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 331776 pos 331776, length 90112, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 335872 pos 335872, length 86016, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 339968 pos 339968, length 81920, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 344064 pos 344064, length 77824, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 348160 pos 348160, length 73728, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 352256 pos 352256, length 69632, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 356352 pos 356352, length 65536, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 360448 pos 360448, length 61440, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 364544 pos 364544, length 57344, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 368640 pos 368640, length 53248, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 372736 pos 372736, length 49152, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 376832 pos 376832, length 45056, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 380928 pos 380928, length 40960, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 385024 pos 385024, length 36864, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 389120 pos 389120, length 32768, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 393216 pos 393216, length 28672, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 397312 pos 397312, length 24576, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 401408 pos 401408, length 20480, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 405504 pos 405504, length 16384, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 409600 pos 409600, length 12288, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 413696 pos 413696, length 8192, poff 0 plen 4096, isize 420864
> iomap_readpage_actor: orig pos 417792 pos 417792, length 4096, poff 0 plen 3072, isize 420864
> iomap_readpage_actor: orig pos 420864 pos 420864, length 1024, poff 3072 plen 1024, isize 420864
>
> As you can see, it now does full page reads until the last one which
> is split correctly at the block aligned EOF, reading 3072 bytes and
> zeroing the last 1024 bytes. The original version of the patch got
> this right, but it got another case wrong.
>
> The EOF detection crossing really needs to the the original length
> as plen, while it starts at the end of the block, will be shortened
> as up-to-date blocks are found on the page. This means "orig_pos +
> plen" no longer points to the end of the page, and so will not
> correctly detect EOF crossing. Hence we have to use the length
> passed in to detect this partial page case:
>
> xfs_filemap_fault: dev 259:1 ino 0x43 write_fault 0
> xfs_vm_readpage: dev 259:1 ino 0x43 nr_pages 1
> xfs_iomap_found: dev 259:1 ino 0x43 size 0x2cc00 offset 0x2c000 count 4096 type hole startoff 0xb0 startblock 282 blockcount 0x4
> iomap_readpage_actor: orig pos 180224 pos 181248, length 4096, poff 1024 plen 2048, isize 183296
> xfs_iomap_found: dev 259:1 ino 0x43 size 0x2cc00 offset 0x2cc00 count 1024 type hole startoff 0xb3 startblock 285 blockcount 0x1
> iomap_readpage_actor: orig pos 183296 pos 183296, length 1024, poff 3072 plen 1024, isize 183296
>
> Heere we see a trace where the first block on the EOF page is up to
> date, hence poff = 1024 bytes. The offset into the page of EOF is
> 3072, so the range we want to read is 1024 - 3071, and the range we
> want to zero is 3072 - 4095. You can see this is split correctly
> now.
>
> This fixes the stale data beyond EOF problem that fsx quickly
> uncovers on 1k block size filesystems.
>
> Signed-off-by: Dave Chinner <dchinner@redhat.com>
Looks ok, will test...
Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com>
--D
> ---
>
> Testing the first version of this patch failed the direct IO fsx
> test case I've been using at ~16.5 million ops when it tripped over
> a partially up to date EOF page. I used the wrong length to detect
> EOF being crossed, this update fixes that and it no longer fails at
> 16.5m ops. We'll see how long it lasts this time....
>
> fs/iomap.c | 10 +++++++---
> 1 file changed, 7 insertions(+), 3 deletions(-)
>
> diff --git a/fs/iomap.c b/fs/iomap.c
> index d51e7a2ae641..b95110867eee 100644
> --- a/fs/iomap.c
> +++ b/fs/iomap.c
> @@ -142,13 +142,14 @@ static void
> iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
> loff_t *pos, loff_t length, unsigned *offp, unsigned *lenp)
> {
> + loff_t orig_pos = *pos;
> + loff_t isize = i_size_read(inode);
> unsigned block_bits = inode->i_blkbits;
> unsigned block_size = (1 << block_bits);
> unsigned poff = offset_in_page(*pos);
> unsigned plen = min_t(loff_t, PAGE_SIZE - poff, length);
> unsigned first = poff >> block_bits;
> unsigned last = (poff + plen - 1) >> block_bits;
> - unsigned end = offset_in_page(i_size_read(inode)) >> block_bits;
>
> /*
> * If the block size is smaller than the page size we need to check the
> @@ -183,8 +184,11 @@ iomap_adjust_read_range(struct inode *inode, struct iomap_page *iop,
> * handle both halves separately so that we properly zero data in the
> * page cache for blocks that are entirely outside of i_size.
> */
> - if (first <= end && last > end)
> - plen -= (last - end) * block_size;
> + if (orig_pos <= isize && orig_pos + length > isize) {
> + unsigned end = offset_in_page(isize - 1) >> block_bits;
> + if (first <= end && last > end)
> + plen -= (last - end) * block_size;
> + }
>
> *offp = poff;
> *lenp = plen;
next prev parent reply other threads:[~2018-11-22 2:56 UTC|newest]
Thread overview: 16+ messages / expand[flat|nested] mbox.gz Atom feed top
2018-11-19 21:17 [PATCH 0/5] iomap: data corruption fixes and more Dave Chinner
2018-11-19 21:17 ` [PATCH 1/5] iomap: FUA is wrong for DIO O_DSYNC writes into unwritten extents Dave Chinner
2018-11-20 7:48 ` Christoph Hellwig
2018-11-20 23:00 ` Darrick J. Wong
2018-11-19 21:17 ` [PATCH 2/5] iomap: sub-block dio needs to zeroout beyond EOF Dave Chinner
2018-11-20 23:01 ` Darrick J. Wong
2018-11-19 21:17 ` [PATCH 3/5] iomap: dio data corruption and spurious errors when pipes fill Dave Chinner
2018-11-19 21:17 ` [PATCH 4/5] splice: increase pipe size in splice_direct_to_actor() Dave Chinner
2018-11-20 7:49 ` Christoph Hellwig
2018-11-20 23:02 ` Darrick J. Wong
2018-11-19 21:17 ` [PATCH 5/5] vfs: vfs_dedupe_file_range() doesn't return EOPNOTSUPP Dave Chinner
2018-11-20 7:49 ` Christoph Hellwig
2018-11-21 6:50 ` [PATCH 6/5] iomap: readpages doesn't zero page tail beyond EOF properly Dave Chinner
2018-11-21 8:27 ` [PATCH 6/5 V2] " Dave Chinner
2018-11-21 16:20 ` Darrick J. Wong [this message]
2018-11-21 16:34 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20181121162051.GQ6792@magnolia \
--to=darrick.wong@oracle.com \
--cc=david@fromorbit.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).