From: Brian Foster <bfoster@redhat.com>
To: Zhang Yi <yi.zhang@huaweicloud.com>
Cc: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org, djwong@kernel.org,
hch@infradead.org, brauner@kernel.org, david@fromorbit.com,
chandanbabu@kernel.org, jack@suse.cz, willy@infradead.org,
yi.zhang@huawei.com, chengzhihao1@huawei.com, yukuai3@huawei.com
Subject: Re: [RFC PATCH v4 1/8] iomap: zeroing needs to be pagecache aware
Date: Sun, 2 Jun 2024 07:04:47 -0400 [thread overview]
Message-ID: <ZlxRz9LPNuoOZOtl@bfoster> (raw)
In-Reply-To: <20240529095206.2568162-2-yi.zhang@huaweicloud.com>
On Wed, May 29, 2024 at 05:51:59PM +0800, Zhang Yi wrote:
> From: Dave Chinner <dchinner@redhat.com>
>
> Unwritten extents can have page cache data over the range being
> zeroed so we can't just skip them entirely. Fix this by checking for
> an existing dirty folio over the unwritten range we are zeroing
> and only performing zeroing if the folio is already dirty.
>
> XXX: how do we detect a iomap containing a cow mapping over a hole
> in iomap_zero_iter()? The XFS code implies this case also needs to
> zero the page cache if there is data present, so trigger for page
> cache lookup only in iomap_zero_iter() needs to handle this case as
> well.
>
> Before:
>
> $ time sudo ./pwrite-trunc /mnt/scratch/foo 50000
> path /mnt/scratch/foo, 50000 iters
>
> real 0m14.103s
> user 0m0.015s
> sys 0m0.020s
>
> $ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000
> path /mnt/scratch/foo, 50000 iters
> % time seconds usecs/call calls errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> 85.90 0.847616 16 50000 ftruncate
> 14.01 0.138229 2 50000 pwrite64
> ....
>
> After:
>
> $ time sudo ./pwrite-trunc /mnt/scratch/foo 50000
> path /mnt/scratch/foo, 50000 iters
>
> real 0m0.144s
> user 0m0.021s
> sys 0m0.012s
>
> $ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000
> path /mnt/scratch/foo, 50000 iters
> % time seconds usecs/call calls errors syscall
> ------ ----------- ----------- --------- --------- ----------------
> 53.86 0.505964 10 50000 ftruncate
> 46.12 0.433251 8 50000 pwrite64
> ....
>
> Yup, we get back all the performance.
>
> As for the "mmap write beyond EOF" data exposure aspect
> documented here:
>
> https://lore.kernel.org/linux-xfs/20221104182358.2007475-1-bfoster@redhat.com/
>
> With this command:
>
> $ sudo xfs_io -tfc "falloc 0 1k" -c "pwrite 0 1k" \
> -c "mmap 0 4k" -c "mwrite 3k 1k" -c "pwrite 32k 4k" \
> -c fsync -c "pread -v 3k 32" /mnt/scratch/foo
>
> Before:
>
> wrote 1024/1024 bytes at offset 0
> 1 KiB, 1 ops; 0.0000 sec (34.877 MiB/sec and 35714.2857 ops/sec)
> wrote 4096/4096 bytes at offset 32768
> 4 KiB, 1 ops; 0.0000 sec (229.779 MiB/sec and 58823.5294 ops/sec)
> 00000c00: 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
> XXXXXXXXXXXXXXXX
> 00000c10: 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
> XXXXXXXXXXXXXXXX
> read 32/32 bytes at offset 3072
> 32.000000 bytes, 1 ops; 0.0000 sec (568.182 KiB/sec and 18181.8182
> ops/sec
>
> After:
>
> wrote 1024/1024 bytes at offset 0
> 1 KiB, 1 ops; 0.0000 sec (40.690 MiB/sec and 41666.6667 ops/sec)
> wrote 4096/4096 bytes at offset 32768
> 4 KiB, 1 ops; 0.0000 sec (150.240 MiB/sec and 38461.5385 ops/sec)
> 00000c00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> ................
> 00000c10: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> ................
> read 32/32 bytes at offset 3072
> 32.000000 bytes, 1 ops; 0.0000 sec (558.036 KiB/sec and 17857.1429
> ops/sec)
>
> We see that this post-eof unwritten extent dirty page zeroing is
> working correctly.
>
I've pointed this out in the past, but IIRC this implementation is racy
vs. reclaim. Specifically, relying on folio lookup after mapping lookup
doesn't take reclaim into account, so if we look up an unwritten mapping
and then a folio flushes and reclaims by the time the scan reaches that
offset, it incorrectly treats that subrange as already zero when it
actually isn't (because the extent is actually stale by that point, but
the stale extent check is skipped).
A simple example to demonstrate this is something like the following:
# looping truncate zeroing
while [ true ]; do
xfs_io -fc "truncate 0" -c "falloc 0 32K" -c "pwrite 0 4k" -c "truncate 2k" <file>
xfs_io -c "mmap 0 4k" -c "mread -v 2k 16" <file> | grep cd && break
done
vs.
# looping writeback and reclaim
while [ true ]; do
xfs_io -c "sync_range -a 0 0" -c "fadvise -d 0 0" <file>
done
If I ran that against this patch, the first loop will eventually detect
stale data exposed past eof.
Brian
next prev parent reply other threads:[~2024-06-02 11:04 UTC|newest]
Thread overview: 47+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-05-29 9:51 [RFC PATCH v4 0/8] iomap/xfs: fix stale data exposure when truncating realtime inodes Zhang Yi
2024-05-29 9:51 ` [RFC PATCH v4 1/8] iomap: zeroing needs to be pagecache aware Zhang Yi
2024-05-31 13:11 ` Christoph Hellwig
2024-05-31 14:03 ` Darrick J. Wong
2024-05-31 14:05 ` Christoph Hellwig
2024-05-31 15:44 ` Brian Foster
2024-05-31 15:43 ` Brian Foster
2024-06-02 22:22 ` Dave Chinner
2024-06-02 11:04 ` Brian Foster [this message]
2024-06-03 9:07 ` Zhang Yi
2024-06-03 14:37 ` Brian Foster
2024-06-04 23:38 ` Dave Chinner
2024-05-29 9:52 ` [RFC PATCH v4 2/8] math64: add rem_u64() to just return the remainder Zhang Yi
2024-05-31 12:35 ` Christoph Hellwig
2024-05-31 14:04 ` Darrick J. Wong
2024-05-29 9:52 ` [RFC PATCH v4 3/8] iomap: pass blocksize to iomap_truncate_page() Zhang Yi
2024-05-31 12:39 ` Christoph Hellwig
2024-06-02 11:16 ` Brian Foster
2024-06-03 13:23 ` Zhang Yi
2024-05-29 9:52 ` [RFC PATCH v4 4/8] fsdax: pass blocksize to dax_truncate_page() Zhang Yi
2024-05-29 9:52 ` [RFC PATCH v4 5/8] xfs: refactor the truncating order Zhang Yi
2024-05-31 13:31 ` Christoph Hellwig
2024-05-31 15:27 ` Darrick J. Wong
2024-05-31 16:17 ` Christoph Hellwig
2024-06-03 13:51 ` Zhang Yi
2024-05-31 15:44 ` Darrick J. Wong
2024-06-03 14:15 ` Zhang Yi
2024-06-02 22:46 ` Dave Chinner
2024-06-03 14:18 ` Zhang Yi
2024-05-29 9:52 ` [RFC PATCH v4 6/8] xfs: correct the truncate blocksize of realtime inode Zhang Yi
2024-05-31 13:36 ` Christoph Hellwig
2024-06-03 14:35 ` Zhang Yi
2024-05-29 9:52 ` [RFC PATCH v4 7/8] xfs: reserve blocks for truncating " Zhang Yi
2024-05-31 12:42 ` Christoph Hellwig
2024-05-31 14:10 ` Darrick J. Wong
2024-05-31 14:13 ` Christoph Hellwig
2024-05-31 15:29 ` Darrick J. Wong
2024-05-31 16:17 ` Christoph Hellwig
2024-05-29 9:52 ` [RFC PATCH v4 8/8] xfs: improve truncate on a realtime inode with huge extsize Zhang Yi
2024-05-31 13:46 ` Christoph Hellwig
2024-05-31 14:12 ` Darrick J. Wong
2024-05-31 14:15 ` Christoph Hellwig
2024-05-31 15:00 ` Darrick J. Wong
2024-06-04 7:09 ` Zhang Yi
2024-05-31 12:26 ` [RFC PATCH v4 0/8] iomap/xfs: fix stale data exposure when truncating realtime inodes Christoph Hellwig
2024-06-01 7:38 ` Zhang Yi
2024-06-01 7:40 ` Christoph Hellwig
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZlxRz9LPNuoOZOtl@bfoster \
--to=bfoster@redhat.com \
--cc=brauner@kernel.org \
--cc=chandanbabu@kernel.org \
--cc=chengzhihao1@huawei.com \
--cc=david@fromorbit.com \
--cc=djwong@kernel.org \
--cc=hch@infradead.org \
--cc=jack@suse.cz \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=willy@infradead.org \
--cc=yi.zhang@huawei.com \
--cc=yi.zhang@huaweicloud.com \
--cc=yukuai3@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).