linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Zhang Yi <yi.zhang@huaweicloud.com>
To: linux-xfs@vger.kernel.org, linux-fsdevel@vger.kernel.org
Cc: linux-kernel@vger.kernel.org, djwong@kernel.org,
	hch@infradead.org, brauner@kernel.org, david@fromorbit.com,
	chandanbabu@kernel.org, jack@suse.cz, willy@infradead.org,
	yi.zhang@huawei.com, yi.zhang@huaweicloud.com,
	chengzhihao1@huawei.com, yukuai3@huawei.com
Subject: [RFC PATCH v4 1/8] iomap: zeroing needs to be pagecache aware
Date: Wed, 29 May 2024 17:51:59 +0800	[thread overview]
Message-ID: <20240529095206.2568162-2-yi.zhang@huaweicloud.com> (raw)
In-Reply-To: <20240529095206.2568162-1-yi.zhang@huaweicloud.com>

From: Dave Chinner <dchinner@redhat.com>

Unwritten extents can have page cache data over the range being
zeroed so we can't just skip them entirely. Fix this by checking for
an existing dirty folio over the unwritten range we are zeroing
and only performing zeroing if the folio is already dirty.

XXX: how do we detect a iomap containing a cow mapping over a hole
in iomap_zero_iter()? The XFS code implies this case also needs to
zero the page cache if there is data present, so trigger for page
cache lookup only in iomap_zero_iter() needs to handle this case as
well.

Before:

$ time sudo ./pwrite-trunc /mnt/scratch/foo 50000
path /mnt/scratch/foo, 50000 iters

real    0m14.103s
user    0m0.015s
sys     0m0.020s

$ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000
path /mnt/scratch/foo, 50000 iters
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 85.90    0.847616          16     50000           ftruncate
 14.01    0.138229           2     50000           pwrite64
....

After:

$ time sudo ./pwrite-trunc /mnt/scratch/foo 50000
path /mnt/scratch/foo, 50000 iters

real    0m0.144s
user    0m0.021s
sys     0m0.012s

$ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000
path /mnt/scratch/foo, 50000 iters
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 53.86    0.505964          10     50000           ftruncate
 46.12    0.433251           8     50000           pwrite64
....

Yup, we get back all the performance.

As for the "mmap write beyond EOF" data exposure aspect
documented here:

https://lore.kernel.org/linux-xfs/20221104182358.2007475-1-bfoster@redhat.com/

With this command:

$ sudo xfs_io -tfc "falloc 0 1k" -c "pwrite 0 1k" \
  -c "mmap 0 4k" -c "mwrite 3k 1k" -c "pwrite 32k 4k" \
  -c fsync -c "pread -v 3k 32" /mnt/scratch/foo

Before:

wrote 1024/1024 bytes at offset 0
1 KiB, 1 ops; 0.0000 sec (34.877 MiB/sec and 35714.2857 ops/sec)
wrote 4096/4096 bytes at offset 32768
4 KiB, 1 ops; 0.0000 sec (229.779 MiB/sec and 58823.5294 ops/sec)
00000c00:  58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
XXXXXXXXXXXXXXXX
00000c10:  58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
XXXXXXXXXXXXXXXX
read 32/32 bytes at offset 3072
32.000000 bytes, 1 ops; 0.0000 sec (568.182 KiB/sec and 18181.8182
   ops/sec

After:

wrote 1024/1024 bytes at offset 0
1 KiB, 1 ops; 0.0000 sec (40.690 MiB/sec and 41666.6667 ops/sec)
wrote 4096/4096 bytes at offset 32768
4 KiB, 1 ops; 0.0000 sec (150.240 MiB/sec and 38461.5385 ops/sec)
00000c00:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
................
00000c10:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
................
read 32/32 bytes at offset 3072
32.000000 bytes, 1 ops; 0.0000 sec (558.036 KiB/sec and 17857.1429
   ops/sec)

We see that this post-eof unwritten extent dirty page zeroing is
working correctly.

This has passed through most of fstests on a couple of test VMs
without issues at the moment, so I think this approach to fixing the
issue is going to be solid once we've worked out how to detect the
COW-hole mapping case.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/iomap/buffered-io.c | 42 ++++++++++++++++++++++++++++++++++++++++--
 fs/xfs/xfs_iops.c      | 12 +-----------
 2 files changed, 41 insertions(+), 13 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 41c8f0c68ef5..a3a5d4eb7289 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -583,11 +583,23 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
  *
  * Returns a locked reference to the folio at @pos, or an error pointer if the
  * folio could not be obtained.
+ *
+ * Note: when zeroing unwritten extents, we might have data in the page cache
+ * over an unwritten extent. In this case, we want to do a pure lookup on the
+ * page cache and not create a new folio as we don't need to perform zeroing on
+ * unwritten extents if there is no cached data over the given range.
  */
 struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
 {
 	fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS;
 
+	if (iter->flags & IOMAP_ZERO) {
+		const struct iomap *srcmap = iomap_iter_srcmap(iter);
+
+		if (srcmap->type == IOMAP_UNWRITTEN)
+			fgp &= ~FGP_CREAT;
+	}
+
 	if (iter->flags & IOMAP_NOWAIT)
 		fgp |= FGP_NOWAIT;
 	fgp |= fgf_set_order(len);
@@ -1388,7 +1400,7 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 	loff_t written = 0;
 
 	/* already zeroed?  we're done. */
-	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
+	if (srcmap->type == IOMAP_HOLE)
 		return length;
 
 	do {
@@ -1399,8 +1411,22 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 		bool ret;
 
 		status = iomap_write_begin(iter, pos, bytes, &folio);
-		if (status)
+		if (status) {
+			if (status == -ENOENT) {
+				/*
+				 * Unwritten extents need to have page cache
+				 * lookups done to determine if they have data
+				 * over them that needs zeroing. If there is no
+				 * data, we'll get -ENOENT returned here, so we
+				 * can just skip over this index.
+				 */
+				WARN_ON_ONCE(srcmap->type != IOMAP_UNWRITTEN);
+				if (bytes > PAGE_SIZE - offset_in_page(pos))
+					bytes = PAGE_SIZE - offset_in_page(pos);
+				goto loop_continue;
+			}
 			return status;
+		}
 		if (iter->iomap.flags & IOMAP_F_STALE)
 			break;
 
@@ -1408,6 +1434,17 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 		if (bytes > folio_size(folio) - offset)
 			bytes = folio_size(folio) - offset;
 
+		/*
+		 * If the folio over an unwritten extent is clean (i.e. because
+		 * it has been read from), then it already contains zeros. Hence
+		 * we can just skip it.
+		 */
+		if (srcmap->type == IOMAP_UNWRITTEN &&
+		    !folio_test_dirty(folio)) {
+			folio_unlock(folio);
+			goto loop_continue;
+		}
+
 		folio_zero_range(folio, offset, bytes);
 		folio_mark_accessed(folio);
 
@@ -1416,6 +1453,7 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 		if (WARN_ON_ONCE(!ret))
 			return -EIO;
 
+loop_continue:
 		pos += bytes;
 		length -= bytes;
 		written += bytes;
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index ff222827e550..d44508930b67 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -861,17 +861,7 @@ xfs_setattr_size(
 		trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
 		error = xfs_zero_range(ip, oldsize, newsize - oldsize,
 				&did_zeroing);
-	} else {
-		/*
-		 * iomap won't detect a dirty page over an unwritten block (or a
-		 * cow block over a hole) and subsequently skips zeroing the
-		 * newly post-EOF portion of the page. Flush the new EOF to
-		 * convert the block before the pagecache truncate.
-		 */
-		error = filemap_write_and_wait_range(inode->i_mapping, newsize,
-						     newsize);
-		if (error)
-			return error;
+	} else if (newsize != oldsize) {
 		error = xfs_truncate_page(ip, newsize, &did_zeroing);
 	}
 
-- 
2.39.2


  reply	other threads:[~2024-05-29  1:59 UTC|newest]

Thread overview: 47+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-05-29  9:51 [RFC PATCH v4 0/8] iomap/xfs: fix stale data exposure when truncating realtime inodes Zhang Yi
2024-05-29  9:51 ` Zhang Yi [this message]
2024-05-31 13:11   ` [RFC PATCH v4 1/8] iomap: zeroing needs to be pagecache aware Christoph Hellwig
2024-05-31 14:03     ` Darrick J. Wong
2024-05-31 14:05       ` Christoph Hellwig
2024-05-31 15:44         ` Brian Foster
2024-05-31 15:43       ` Brian Foster
2024-06-02 22:22     ` Dave Chinner
2024-06-02 11:04   ` Brian Foster
2024-06-03  9:07     ` Zhang Yi
2024-06-03 14:37       ` Brian Foster
2024-06-04 23:38         ` Dave Chinner
2024-05-29  9:52 ` [RFC PATCH v4 2/8] math64: add rem_u64() to just return the remainder Zhang Yi
2024-05-31 12:35   ` Christoph Hellwig
2024-05-31 14:04   ` Darrick J. Wong
2024-05-29  9:52 ` [RFC PATCH v4 3/8] iomap: pass blocksize to iomap_truncate_page() Zhang Yi
2024-05-31 12:39   ` Christoph Hellwig
2024-06-02 11:16     ` Brian Foster
2024-06-03 13:23     ` Zhang Yi
2024-05-29  9:52 ` [RFC PATCH v4 4/8] fsdax: pass blocksize to dax_truncate_page() Zhang Yi
2024-05-29  9:52 ` [RFC PATCH v4 5/8] xfs: refactor the truncating order Zhang Yi
2024-05-31 13:31   ` Christoph Hellwig
2024-05-31 15:27     ` Darrick J. Wong
2024-05-31 16:17       ` Christoph Hellwig
2024-06-03 13:51       ` Zhang Yi
2024-05-31 15:44   ` Darrick J. Wong
2024-06-03 14:15     ` Zhang Yi
2024-06-02 22:46   ` Dave Chinner
2024-06-03 14:18     ` Zhang Yi
2024-05-29  9:52 ` [RFC PATCH v4 6/8] xfs: correct the truncate blocksize of realtime inode Zhang Yi
2024-05-31 13:36   ` Christoph Hellwig
2024-06-03 14:35     ` Zhang Yi
2024-05-29  9:52 ` [RFC PATCH v4 7/8] xfs: reserve blocks for truncating " Zhang Yi
2024-05-31 12:42   ` Christoph Hellwig
2024-05-31 14:10     ` Darrick J. Wong
2024-05-31 14:13       ` Christoph Hellwig
2024-05-31 15:29         ` Darrick J. Wong
2024-05-31 16:17           ` Christoph Hellwig
2024-05-29  9:52 ` [RFC PATCH v4 8/8] xfs: improve truncate on a realtime inode with huge extsize Zhang Yi
2024-05-31 13:46   ` Christoph Hellwig
2024-05-31 14:12     ` Darrick J. Wong
2024-05-31 14:15       ` Christoph Hellwig
2024-05-31 15:00         ` Darrick J. Wong
2024-06-04  7:09           ` Zhang Yi
2024-05-31 12:26 ` [RFC PATCH v4 0/8] iomap/xfs: fix stale data exposure when truncating realtime inodes Christoph Hellwig
2024-06-01  7:38   ` Zhang Yi
2024-06-01  7:40     ` Christoph Hellwig

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20240529095206.2568162-2-yi.zhang@huaweicloud.com \
    --to=yi.zhang@huaweicloud.com \
    --cc=brauner@kernel.org \
    --cc=chandanbabu@kernel.org \
    --cc=chengzhihao1@huawei.com \
    --cc=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-xfs@vger.kernel.org \
    --cc=willy@infradead.org \
    --cc=yi.zhang@huawei.com \
    --cc=yukuai3@huawei.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).