linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v4 0/8] iomap/xfs: fix stale data exposure when truncating realtime inodes
@ 2024-05-29  9:51 Zhang Yi
  2024-05-29  9:51 ` [RFC PATCH v4 1/8] iomap: zeroing needs to be pagecache aware Zhang Yi
                   ` (8 more replies)
  0 siblings, 9 replies; 47+ messages in thread
From: Zhang Yi @ 2024-05-29  9:51 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, djwong, hch, brauner, david, chandanbabu, jack,
	willy, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

Changes since v3:
 - Factor out a new helper to get the remainder in math64.h as Darrick
   suggested.
 - Adjust the truncating order to prevent too much redundant blocking
   writes as Dave suggested.
 - Improve to convert the tail extent to unwritten when truncating down
   an inode with large rtextsize as Darrick and Dave suggested.

This series fix a stale data exposure issue reported by Chandan when
running fstests generic/561 on xfs with realtime device[1]. The real
problem is xfs_setattr_size() doesn't zero out enough range when
truncating a realtime inode, please see the patch 6 or [1] for
details.

Patch 1 is from Dave, it improves truncate down performace by changing
iomap_zero_iter() to aware dirty pages on unwritten extents, but for the
case of the zeroing range that contains a cow mapping over a hole still
needs to be handled. 

Patch 3-5 modify iomap_truncate_page() and dax_truncate_page() to pass
filesystem identified blocksize, and drop the assumption of
i_blocksize() as Dave suggested.

Patch 6-7 adjust the truncating down processing order to first zero out
the tail aligned blocks, then write back, update i_size and finally drop
cache beyond aligned EOF. Fix the data exposure issue by zeroing out the
entire EOF extent.

Patch 8-9 add a rtextsize threshold (64k), improves truncate down performace
on realtime inode with large rtextsize (beyonds this threshold) by
converting the tail unaligned extent to unwritten.

I've tested this series on fstests (1) with reflink=0, (2) with 28K RT
device and (3) with 96K RT device (beyonds rtextsize threshold), no new
failures detected. This series still needs to do furtuer tests with
reflink=1 after Patch 1 covers the cow mapping over a hole case.

[1] https://lore.kernel.org/linux-xfs/87ttj8ircu.fsf@debian-BULLSEYE-live-builder-AMD64/

Thanks,
Yi.

Dave Chinner (1):
  iomap: zeroing needs to be pagecache aware

Zhang Yi (7):
  math64: add rem_u64() to just return the remainder
  iomap: pass blocksize to iomap_truncate_page()
  fsdax: pass blocksize to dax_truncate_page()
  xfs: refactor the truncating order
  xfs: correct the truncate blocksize of realtime inode
  xfs: reserve blocks for truncating realtime inode
  xfs: improve truncate on a realtime inode with huge extsize

 fs/dax.c               |   8 +--
 fs/ext2/inode.c        |   4 +-
 fs/iomap/buffered-io.c |  50 ++++++++++++++--
 fs/xfs/xfs_inode.c     |   3 +
 fs/xfs/xfs_inode.h     |  12 ++++
 fs/xfs/xfs_iomap.c     |   5 +-
 fs/xfs/xfs_iomap.h     |   3 +-
 fs/xfs/xfs_iops.c      | 133 +++++++++++++++++++++++++----------------
 include/linux/dax.h    |   4 +-
 include/linux/iomap.h  |   4 +-
 include/linux/math64.h |  24 ++++++++
 11 files changed, 179 insertions(+), 71 deletions(-)

-- 
2.39.2


^ permalink raw reply	[flat|nested] 47+ messages in thread

* [RFC PATCH v4 1/8] iomap: zeroing needs to be pagecache aware
  2024-05-29  9:51 [RFC PATCH v4 0/8] iomap/xfs: fix stale data exposure when truncating realtime inodes Zhang Yi
@ 2024-05-29  9:51 ` Zhang Yi
  2024-05-31 13:11   ` Christoph Hellwig
  2024-06-02 11:04   ` Brian Foster
  2024-05-29  9:52 ` [RFC PATCH v4 2/8] math64: add rem_u64() to just return the remainder Zhang Yi
                   ` (7 subsequent siblings)
  8 siblings, 2 replies; 47+ messages in thread
From: Zhang Yi @ 2024-05-29  9:51 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, djwong, hch, brauner, david, chandanbabu, jack,
	willy, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Dave Chinner <dchinner@redhat.com>

Unwritten extents can have page cache data over the range being
zeroed so we can't just skip them entirely. Fix this by checking for
an existing dirty folio over the unwritten range we are zeroing
and only performing zeroing if the folio is already dirty.

XXX: how do we detect a iomap containing a cow mapping over a hole
in iomap_zero_iter()? The XFS code implies this case also needs to
zero the page cache if there is data present, so trigger for page
cache lookup only in iomap_zero_iter() needs to handle this case as
well.

Before:

$ time sudo ./pwrite-trunc /mnt/scratch/foo 50000
path /mnt/scratch/foo, 50000 iters

real    0m14.103s
user    0m0.015s
sys     0m0.020s

$ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000
path /mnt/scratch/foo, 50000 iters
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 85.90    0.847616          16     50000           ftruncate
 14.01    0.138229           2     50000           pwrite64
....

After:

$ time sudo ./pwrite-trunc /mnt/scratch/foo 50000
path /mnt/scratch/foo, 50000 iters

real    0m0.144s
user    0m0.021s
sys     0m0.012s

$ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000
path /mnt/scratch/foo, 50000 iters
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 53.86    0.505964          10     50000           ftruncate
 46.12    0.433251           8     50000           pwrite64
....

Yup, we get back all the performance.

As for the "mmap write beyond EOF" data exposure aspect
documented here:

https://lore.kernel.org/linux-xfs/20221104182358.2007475-1-bfoster@redhat.com/

With this command:

$ sudo xfs_io -tfc "falloc 0 1k" -c "pwrite 0 1k" \
  -c "mmap 0 4k" -c "mwrite 3k 1k" -c "pwrite 32k 4k" \
  -c fsync -c "pread -v 3k 32" /mnt/scratch/foo

Before:

wrote 1024/1024 bytes at offset 0
1 KiB, 1 ops; 0.0000 sec (34.877 MiB/sec and 35714.2857 ops/sec)
wrote 4096/4096 bytes at offset 32768
4 KiB, 1 ops; 0.0000 sec (229.779 MiB/sec and 58823.5294 ops/sec)
00000c00:  58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
XXXXXXXXXXXXXXXX
00000c10:  58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
XXXXXXXXXXXXXXXX
read 32/32 bytes at offset 3072
32.000000 bytes, 1 ops; 0.0000 sec (568.182 KiB/sec and 18181.8182
   ops/sec

After:

wrote 1024/1024 bytes at offset 0
1 KiB, 1 ops; 0.0000 sec (40.690 MiB/sec and 41666.6667 ops/sec)
wrote 4096/4096 bytes at offset 32768
4 KiB, 1 ops; 0.0000 sec (150.240 MiB/sec and 38461.5385 ops/sec)
00000c00:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
................
00000c10:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
................
read 32/32 bytes at offset 3072
32.000000 bytes, 1 ops; 0.0000 sec (558.036 KiB/sec and 17857.1429
   ops/sec)

We see that this post-eof unwritten extent dirty page zeroing is
working correctly.

This has passed through most of fstests on a couple of test VMs
without issues at the moment, so I think this approach to fixing the
issue is going to be solid once we've worked out how to detect the
COW-hole mapping case.

Signed-off-by: Dave Chinner <dchinner@redhat.com>
---
 fs/iomap/buffered-io.c | 42 ++++++++++++++++++++++++++++++++++++++++--
 fs/xfs/xfs_iops.c      | 12 +-----------
 2 files changed, 41 insertions(+), 13 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 41c8f0c68ef5..a3a5d4eb7289 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -583,11 +583,23 @@ EXPORT_SYMBOL_GPL(iomap_is_partially_uptodate);
  *
  * Returns a locked reference to the folio at @pos, or an error pointer if the
  * folio could not be obtained.
+ *
+ * Note: when zeroing unwritten extents, we might have data in the page cache
+ * over an unwritten extent. In this case, we want to do a pure lookup on the
+ * page cache and not create a new folio as we don't need to perform zeroing on
+ * unwritten extents if there is no cached data over the given range.
  */
 struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
 {
 	fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS;
 
+	if (iter->flags & IOMAP_ZERO) {
+		const struct iomap *srcmap = iomap_iter_srcmap(iter);
+
+		if (srcmap->type == IOMAP_UNWRITTEN)
+			fgp &= ~FGP_CREAT;
+	}
+
 	if (iter->flags & IOMAP_NOWAIT)
 		fgp |= FGP_NOWAIT;
 	fgp |= fgf_set_order(len);
@@ -1388,7 +1400,7 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 	loff_t written = 0;
 
 	/* already zeroed?  we're done. */
-	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
+	if (srcmap->type == IOMAP_HOLE)
 		return length;
 
 	do {
@@ -1399,8 +1411,22 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 		bool ret;
 
 		status = iomap_write_begin(iter, pos, bytes, &folio);
-		if (status)
+		if (status) {
+			if (status == -ENOENT) {
+				/*
+				 * Unwritten extents need to have page cache
+				 * lookups done to determine if they have data
+				 * over them that needs zeroing. If there is no
+				 * data, we'll get -ENOENT returned here, so we
+				 * can just skip over this index.
+				 */
+				WARN_ON_ONCE(srcmap->type != IOMAP_UNWRITTEN);
+				if (bytes > PAGE_SIZE - offset_in_page(pos))
+					bytes = PAGE_SIZE - offset_in_page(pos);
+				goto loop_continue;
+			}
 			return status;
+		}
 		if (iter->iomap.flags & IOMAP_F_STALE)
 			break;
 
@@ -1408,6 +1434,17 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 		if (bytes > folio_size(folio) - offset)
 			bytes = folio_size(folio) - offset;
 
+		/*
+		 * If the folio over an unwritten extent is clean (i.e. because
+		 * it has been read from), then it already contains zeros. Hence
+		 * we can just skip it.
+		 */
+		if (srcmap->type == IOMAP_UNWRITTEN &&
+		    !folio_test_dirty(folio)) {
+			folio_unlock(folio);
+			goto loop_continue;
+		}
+
 		folio_zero_range(folio, offset, bytes);
 		folio_mark_accessed(folio);
 
@@ -1416,6 +1453,7 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 		if (WARN_ON_ONCE(!ret))
 			return -EIO;
 
+loop_continue:
 		pos += bytes;
 		length -= bytes;
 		written += bytes;
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index ff222827e550..d44508930b67 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -861,17 +861,7 @@ xfs_setattr_size(
 		trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
 		error = xfs_zero_range(ip, oldsize, newsize - oldsize,
 				&did_zeroing);
-	} else {
-		/*
-		 * iomap won't detect a dirty page over an unwritten block (or a
-		 * cow block over a hole) and subsequently skips zeroing the
-		 * newly post-EOF portion of the page. Flush the new EOF to
-		 * convert the block before the pagecache truncate.
-		 */
-		error = filemap_write_and_wait_range(inode->i_mapping, newsize,
-						     newsize);
-		if (error)
-			return error;
+	} else if (newsize != oldsize) {
 		error = xfs_truncate_page(ip, newsize, &did_zeroing);
 	}
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH v4 2/8] math64: add rem_u64() to just return the remainder
  2024-05-29  9:51 [RFC PATCH v4 0/8] iomap/xfs: fix stale data exposure when truncating realtime inodes Zhang Yi
  2024-05-29  9:51 ` [RFC PATCH v4 1/8] iomap: zeroing needs to be pagecache aware Zhang Yi
@ 2024-05-29  9:52 ` Zhang Yi
  2024-05-31 12:35   ` Christoph Hellwig
  2024-05-31 14:04   ` Darrick J. Wong
  2024-05-29  9:52 ` [RFC PATCH v4 3/8] iomap: pass blocksize to iomap_truncate_page() Zhang Yi
                   ` (6 subsequent siblings)
  8 siblings, 2 replies; 47+ messages in thread
From: Zhang Yi @ 2024-05-29  9:52 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, djwong, hch, brauner, david, chandanbabu, jack,
	willy, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

Add a new helper rem_u64() to only get the remainder of unsigned 64bit
divide with 32bit divisor.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 include/linux/math64.h | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/include/linux/math64.h b/include/linux/math64.h
index d34def7f9a8c..618df4862091 100644
--- a/include/linux/math64.h
+++ b/include/linux/math64.h
@@ -3,6 +3,7 @@
 #define _LINUX_MATH64_H
 
 #include <linux/types.h>
+#include <linux/log2.h>
 #include <linux/math.h>
 #include <asm/div64.h>
 #include <vdso/math64.h>
@@ -12,6 +13,20 @@
 #define div64_long(x, y) div64_s64((x), (y))
 #define div64_ul(x, y)   div64_u64((x), (y))
 
+/**
+ * rem_u64 - remainder of unsigned 64bit divide with 32bit divisor
+ * @dividend: unsigned 64bit dividend
+ * @divisor: unsigned 32bit divisor
+ *
+ * Return: dividend % divisor
+ */
+static inline u32 rem_u64(u64 dividend, u32 divisor)
+{
+	if (is_power_of_2(divisor))
+		return dividend & (divisor - 1);
+	return dividend % divisor;
+}
+
 /**
  * div_u64_rem - unsigned 64bit divide with 32bit divisor with remainder
  * @dividend: unsigned 64bit dividend
@@ -86,6 +101,15 @@ static inline s64 div64_s64(s64 dividend, s64 divisor)
 #define div64_long(x, y) div_s64((x), (y))
 #define div64_ul(x, y)   div_u64((x), (y))
 
+#ifndef rem_u64
+static inline u32 rem_u64(u64 dividend, u32 divisor)
+{
+	if (is_power_of_2(divisor))
+		return dividend & (divisor - 1);
+	return do_div(dividend, divisor);
+}
+#endif
+
 #ifndef div_u64_rem
 static inline u64 div_u64_rem(u64 dividend, u32 divisor, u32 *remainder)
 {
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH v4 3/8] iomap: pass blocksize to iomap_truncate_page()
  2024-05-29  9:51 [RFC PATCH v4 0/8] iomap/xfs: fix stale data exposure when truncating realtime inodes Zhang Yi
  2024-05-29  9:51 ` [RFC PATCH v4 1/8] iomap: zeroing needs to be pagecache aware Zhang Yi
  2024-05-29  9:52 ` [RFC PATCH v4 2/8] math64: add rem_u64() to just return the remainder Zhang Yi
@ 2024-05-29  9:52 ` Zhang Yi
  2024-05-31 12:39   ` Christoph Hellwig
  2024-05-29  9:52 ` [RFC PATCH v4 4/8] fsdax: pass blocksize to dax_truncate_page() Zhang Yi
                   ` (5 subsequent siblings)
  8 siblings, 1 reply; 47+ messages in thread
From: Zhang Yi @ 2024-05-29  9:52 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, djwong, hch, brauner, david, chandanbabu, jack,
	willy, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

iomap_truncate_page() always assumes the block size of the truncating
inode is i_blocksize(), this is not always true for some filesystems,
e.g. XFS does extent size alignment for realtime inodes. Drop this
assumption and pass the block size for zeroing into
iomap_truncate_page(), allow filesystems to indicate the correct block
size.

Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/iomap/buffered-io.c | 8 ++++----
 fs/xfs/xfs_iomap.c     | 3 ++-
 include/linux/iomap.h  | 4 ++--
 3 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index a3a5d4eb7289..30b49d6e9d48 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -17,6 +17,7 @@
 #include <linux/bio.h>
 #include <linux/sched/signal.h>
 #include <linux/migrate.h>
+#include <linux/math64.h>
 #include "trace.h"
 
 #include "../internal.h"
@@ -1483,11 +1484,10 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
 EXPORT_SYMBOL_GPL(iomap_zero_range);
 
 int
-iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
-		const struct iomap_ops *ops)
+iomap_truncate_page(struct inode *inode, loff_t pos, unsigned int blocksize,
+		bool *did_zero, const struct iomap_ops *ops)
 {
-	unsigned int blocksize = i_blocksize(inode);
-	unsigned int off = pos & (blocksize - 1);
+	unsigned int off = rem_u64(pos, blocksize);
 
 	/* Block boundary? Nothing to do */
 	if (!off)
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 378342673925..32306804b01b 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1471,10 +1471,11 @@ xfs_truncate_page(
 	bool			*did_zero)
 {
 	struct inode		*inode = VFS_I(ip);
+	unsigned int		blocksize = i_blocksize(inode);
 
 	if (IS_DAX(inode))
 		return dax_truncate_page(inode, pos, did_zero,
 					&xfs_dax_write_iomap_ops);
-	return iomap_truncate_page(inode, pos, did_zero,
+	return iomap_truncate_page(inode, pos, blocksize, did_zero,
 				   &xfs_buffered_write_iomap_ops);
 }
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 6fc1c858013d..d67bf86ec582 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -273,8 +273,8 @@ int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
 		const struct iomap_ops *ops);
 int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
 		bool *did_zero, const struct iomap_ops *ops);
-int iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
-		const struct iomap_ops *ops);
+int iomap_truncate_page(struct inode *inode, loff_t pos, unsigned int blocksize,
+		bool *did_zero, const struct iomap_ops *ops);
 vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf,
 			const struct iomap_ops *ops);
 int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH v4 4/8] fsdax: pass blocksize to dax_truncate_page()
  2024-05-29  9:51 [RFC PATCH v4 0/8] iomap/xfs: fix stale data exposure when truncating realtime inodes Zhang Yi
                   ` (2 preceding siblings ...)
  2024-05-29  9:52 ` [RFC PATCH v4 3/8] iomap: pass blocksize to iomap_truncate_page() Zhang Yi
@ 2024-05-29  9:52 ` Zhang Yi
  2024-05-29  9:52 ` [RFC PATCH v4 5/8] xfs: refactor the truncating order Zhang Yi
                   ` (4 subsequent siblings)
  8 siblings, 0 replies; 47+ messages in thread
From: Zhang Yi @ 2024-05-29  9:52 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, djwong, hch, brauner, david, chandanbabu, jack,
	willy, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

dax_truncate_page() always assumes the block size of the truncating
inode is i_blocksize(), this is not always true for some filesystems,
e.g. XFS does extent size alignment for realtime inodes. Drop this
assumption and pass the block size for zeroing into dax_truncate_page(),
allow filesystems to indicate the correct block size.

Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/dax.c            | 8 ++++----
 fs/ext2/inode.c     | 4 ++--
 fs/xfs/xfs_iomap.c  | 2 +-
 include/linux/dax.h | 4 ++--
 4 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index becb4a6920c6..4cbd94fd96ed 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -25,6 +25,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/iomap.h>
 #include <linux/rmap.h>
+#include <linux/math64.h>
 #include <asm/pgalloc.h>
 
 #define CREATE_TRACE_POINTS
@@ -1403,11 +1404,10 @@ int dax_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
 }
 EXPORT_SYMBOL_GPL(dax_zero_range);
 
-int dax_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
-		const struct iomap_ops *ops)
+int dax_truncate_page(struct inode *inode, loff_t pos, unsigned int blocksize,
+		bool *did_zero, const struct iomap_ops *ops)
 {
-	unsigned int blocksize = i_blocksize(inode);
-	unsigned int off = pos & (blocksize - 1);
+	unsigned int off = rem_u64(pos, blocksize);
 
 	/* Block boundary? Nothing to do */
 	if (!off)
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 0caa1650cee8..337349c94adf 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1276,8 +1276,8 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)
 	inode_dio_wait(inode);
 
 	if (IS_DAX(inode))
-		error = dax_truncate_page(inode, newsize, NULL,
-					  &ext2_iomap_ops);
+		error = dax_truncate_page(inode, newsize, i_blocksize(inode),
+					  NULL, &ext2_iomap_ops);
 	else
 		error = block_truncate_page(inode->i_mapping,
 				newsize, ext2_get_block);
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 32306804b01b..8cdfcbb5baa7 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1474,7 +1474,7 @@ xfs_truncate_page(
 	unsigned int		blocksize = i_blocksize(inode);
 
 	if (IS_DAX(inode))
-		return dax_truncate_page(inode, pos, did_zero,
+		return dax_truncate_page(inode, pos, blocksize, did_zero,
 					&xfs_dax_write_iomap_ops);
 	return iomap_truncate_page(inode, pos, blocksize, did_zero,
 				   &xfs_buffered_write_iomap_ops);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 9d3e3327af4c..4aa8ef7c8fd4 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -210,8 +210,8 @@ int dax_file_unshare(struct inode *inode, loff_t pos, loff_t len,
 		const struct iomap_ops *ops);
 int dax_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
 		const struct iomap_ops *ops);
-int dax_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
-		const struct iomap_ops *ops);
+int dax_truncate_page(struct inode *inode, loff_t pos, unsigned int blocksize,
+		bool *did_zero, const struct iomap_ops *ops);
 
 #if IS_ENABLED(CONFIG_DAX)
 int dax_read_lock(void);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH v4 5/8] xfs: refactor the truncating order
  2024-05-29  9:51 [RFC PATCH v4 0/8] iomap/xfs: fix stale data exposure when truncating realtime inodes Zhang Yi
                   ` (3 preceding siblings ...)
  2024-05-29  9:52 ` [RFC PATCH v4 4/8] fsdax: pass blocksize to dax_truncate_page() Zhang Yi
@ 2024-05-29  9:52 ` Zhang Yi
  2024-05-31 13:31   ` Christoph Hellwig
                     ` (2 more replies)
  2024-05-29  9:52 ` [RFC PATCH v4 6/8] xfs: correct the truncate blocksize of realtime inode Zhang Yi
                   ` (3 subsequent siblings)
  8 siblings, 3 replies; 47+ messages in thread
From: Zhang Yi @ 2024-05-29  9:52 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, djwong, hch, brauner, david, chandanbabu, jack,
	willy, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

When truncating down an inode, we call xfs_truncate_page() to zero out
the tail partial block that beyond new EOF, which prevents exposing
stale data. But xfs_truncate_page() always assumes the blocksize is
i_blocksize(inode), it's not always true if we have a large allocation
unit for a file and we should aligned to this unitsize, e.g. realtime
inode should aligned to the rtextsize.

Current xfs_setattr_size() can't support zeroing out a large alignment
size on trucate down since the process order is wrong. We first do zero
out through xfs_truncate_page(), and then update inode size through
truncate_setsize() immediately. If the zeroed range is larger than a
folio, the write back path would not write back zeroed pagecache beyond
the EOF folio, so it doesn't write zeroes to the entire tail extent and
could expose stale data after an appending write into the next aligned
extent.

We need to adjust the order to zero out tail aligned blocks, write back
zeroed or cached data, update i_size and drop cache beyond aligned EOF
block, preparing for the fix of realtime inode and supporting the
upcoming forced alignment feature.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/xfs/xfs_iomap.c |   2 +-
 fs/xfs/xfs_iomap.h |   3 +-
 fs/xfs/xfs_iops.c  | 107 ++++++++++++++++++++++++++++-----------------
 3 files changed, 69 insertions(+), 43 deletions(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 8cdfcbb5baa7..0369b64cc3f4 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1468,10 +1468,10 @@ int
 xfs_truncate_page(
 	struct xfs_inode	*ip,
 	loff_t			pos,
+	unsigned int		blocksize,
 	bool			*did_zero)
 {
 	struct inode		*inode = VFS_I(ip);
-	unsigned int		blocksize = i_blocksize(inode);
 
 	if (IS_DAX(inode))
 		return dax_truncate_page(inode, pos, blocksize, did_zero,
diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
index 4da13440bae9..feb1610cb645 100644
--- a/fs/xfs/xfs_iomap.h
+++ b/fs/xfs/xfs_iomap.h
@@ -25,7 +25,8 @@ int xfs_bmbt_to_iomap(struct xfs_inode *ip, struct iomap *iomap,
 
 int xfs_zero_range(struct xfs_inode *ip, loff_t pos, loff_t len,
 		bool *did_zero);
-int xfs_truncate_page(struct xfs_inode *ip, loff_t pos, bool *did_zero);
+int xfs_truncate_page(struct xfs_inode *ip, loff_t pos,
+		unsigned int blocksize, bool *did_zero);
 
 static inline xfs_filblks_t
 xfs_aligned_fsb_count(
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index d44508930b67..d24927075022 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -812,6 +812,7 @@ xfs_setattr_size(
 	int			error;
 	uint			lock_flags = 0;
 	bool			did_zeroing = false;
+	bool			write_back = false;
 
 	xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL);
 	ASSERT(S_ISREG(inode->i_mode));
@@ -853,30 +854,7 @@ xfs_setattr_size(
 	 * the transaction because the inode cannot be unlocked once it is a
 	 * part of the transaction.
 	 *
-	 * Start with zeroing any data beyond EOF that we may expose on file
-	 * extension, or zeroing out the rest of the block on a downward
-	 * truncate.
-	 */
-	if (newsize > oldsize) {
-		trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
-		error = xfs_zero_range(ip, oldsize, newsize - oldsize,
-				&did_zeroing);
-	} else if (newsize != oldsize) {
-		error = xfs_truncate_page(ip, newsize, &did_zeroing);
-	}
-
-	if (error)
-		return error;
-
-	/*
-	 * We've already locked out new page faults, so now we can safely remove
-	 * pages from the page cache knowing they won't get refaulted until we
-	 * drop the XFS_MMAP_EXCL lock after the extent manipulations are
-	 * complete. The truncate_setsize() call also cleans partial EOF page
-	 * PTEs on extending truncates and hence ensures sub-page block size
-	 * filesystems are correctly handled, too.
-	 *
-	 * We have to do all the page cache truncate work outside the
+	 * And we have to do all the page cache truncate work outside the
 	 * transaction context as the "lock" order is page lock->log space
 	 * reservation as defined by extent allocation in the writeback path.
 	 * Hence a truncate can fail with ENOMEM from xfs_trans_alloc(), but
@@ -884,27 +862,74 @@ xfs_setattr_size(
 	 * user visible changes). There's not much we can do about this, except
 	 * to hope that the caller sees ENOMEM and retries the truncate
 	 * operation.
-	 *
-	 * And we update in-core i_size and truncate page cache beyond newsize
-	 * before writeback the [i_disk_size, newsize] range, so we're
-	 * guaranteed not to write stale data past the new EOF on truncate down.
 	 */
-	truncate_setsize(inode, newsize);
+	write_back = newsize > ip->i_disk_size && oldsize != ip->i_disk_size;
+	if (newsize < oldsize) {
+		unsigned int blocksize = i_blocksize(inode);
 
-	/*
-	 * We are going to log the inode size change in this transaction so
-	 * any previous writes that are beyond the on disk EOF and the new
-	 * EOF that have not been written out need to be written here.  If we
-	 * do not write the data out, we expose ourselves to the null files
-	 * problem. Note that this includes any block zeroing we did above;
-	 * otherwise those blocks may not be zeroed after a crash.
-	 */
-	if (did_zeroing ||
-	    (newsize > ip->i_disk_size && oldsize != ip->i_disk_size)) {
-		error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
-						ip->i_disk_size, newsize - 1);
+		/*
+		 * Zeroing out the partial EOF block and the rest of the extra
+		 * aligned blocks on a downward truncate.
+		 */
+		error = xfs_truncate_page(ip, newsize, blocksize, &did_zeroing);
 		if (error)
 			return error;
+
+		/*
+		 * We are going to log the inode size change in this transaction
+		 * so any previous writes that are beyond the on disk EOF and
+		 * the new EOF that have not been written out need to be written
+		 * here.  If we do not write the data out, we expose ourselves
+		 * to the null files problem. Note that this includes any block
+		 * zeroing we did above; otherwise those blocks may not be
+		 * zeroed after a crash.
+		 */
+		if (did_zeroing || write_back) {
+			error = filemap_write_and_wait_range(inode->i_mapping,
+					min_t(loff_t, ip->i_disk_size, newsize),
+					roundup_64(newsize, blocksize) - 1);
+			if (error)
+				return error;
+		}
+
+		/*
+		 * Updating i_size after writing back to make sure the zeroed
+		 * blocks could been written out, and drop all the page cache
+		 * range that beyond blocksize aligned new EOF block.
+		 *
+		 * We've already locked out new page faults, so now we can
+		 * safely remove pages from the page cache knowing they won't
+		 * get refaulted until we drop the XFS_MMAP_EXCL lock after the
+		 * extent manipulations are complete.
+		 */
+		i_size_write(inode, newsize);
+		truncate_pagecache(inode, roundup_64(newsize, blocksize));
+	} else {
+		/*
+		 * Start with zeroing any data beyond EOF that we may expose on
+		 * file extension.
+		 */
+		if (newsize > oldsize) {
+			trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
+			error = xfs_zero_range(ip, oldsize, newsize - oldsize,
+					       &did_zeroing);
+			if (error)
+				return error;
+		}
+
+		/*
+		 * The truncate_setsize() call also cleans partial EOF page
+		 * PTEs on extending truncates and hence ensures sub-page block
+		 * size filesystems are correctly handled, too.
+		 */
+		truncate_setsize(inode, newsize);
+
+		if (did_zeroing || write_back) {
+			error = filemap_write_and_wait_range(inode->i_mapping,
+					ip->i_disk_size, newsize - 1);
+			if (error)
+				return error;
+		}
 	}
 
 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH v4 6/8] xfs: correct the truncate blocksize of realtime inode
  2024-05-29  9:51 [RFC PATCH v4 0/8] iomap/xfs: fix stale data exposure when truncating realtime inodes Zhang Yi
                   ` (4 preceding siblings ...)
  2024-05-29  9:52 ` [RFC PATCH v4 5/8] xfs: refactor the truncating order Zhang Yi
@ 2024-05-29  9:52 ` Zhang Yi
  2024-05-31 13:36   ` Christoph Hellwig
  2024-05-29  9:52 ` [RFC PATCH v4 7/8] xfs: reserve blocks for truncating " Zhang Yi
                   ` (2 subsequent siblings)
  8 siblings, 1 reply; 47+ messages in thread
From: Zhang Yi @ 2024-05-29  9:52 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, djwong, hch, brauner, david, chandanbabu, jack,
	willy, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

When unaligned truncating down a realtime file which sb_rextsize is
bigger than one block, xfs_truncate_page() only zeros out the tail EOF
block, this could expose stale data since commit '943bc0882ceb ("iomap:
don't increase i_size if it's not a write operation")'.

If we truncate file that contains a large enough written extent:

     |<    rxext    >|<    rtext    >|
  ...WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
        ^ (new EOF)      ^ old EOF

Since we only zeros out the tail of the EOF block, and
xfs_itruncate_extents() unmap the whole ailgned extents, it becomes
this state:

     |<    rxext    >|
  ...WWWzWWWWWWWWWWWWW
        ^ new EOF

Then if we do an extending write like this, the blocks in the previous
tail extent becomes stale:

     |<    rxext    >|
  ...WWWzSSSSSSSSSSSSS..........WWWWWWWWWWWWWWWWW
        ^ old EOF               ^ append start  ^ new EOF

Fix this by zeroing out the tail allocation uint and also make sure
xfs_itruncate_extents() unmap rtextsize aligned extents.

Fixes: 943bc0882ceb ("iomap: don't increase i_size if it's not a write operation")
Reported-by: Chandan Babu R <chandanbabu@kernel.org>
Link: https://lore.kernel.org/linux-xfs/0b92a215-9d9b-3788-4504-a520778953c2@huaweicloud.com
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/xfs/xfs_inode.c | 3 +++
 fs/xfs/xfs_iops.c  | 2 +-
 2 files changed, 4 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 58fb7a5062e1..db35167acef6 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -35,6 +35,7 @@
 #include "xfs_trans_priv.h"
 #include "xfs_log.h"
 #include "xfs_bmap_btree.h"
+#include "xfs_rtbitmap.h"
 #include "xfs_reflink.h"
 #include "xfs_ag.h"
 #include "xfs_log_priv.h"
@@ -1512,6 +1513,8 @@ xfs_itruncate_extents_flags(
 	 * the page cache can't scale that far.
 	 */
 	first_unmap_block = XFS_B_TO_FSB(mp, (xfs_ufsize_t)new_size);
+	if (xfs_inode_has_bigrtalloc(ip))
+		first_unmap_block = xfs_rtb_roundup_rtx(mp, first_unmap_block);
 	if (!xfs_verify_fileoff(mp, first_unmap_block)) {
 		WARN_ON_ONCE(first_unmap_block > XFS_MAX_FILEOFF);
 		return 0;
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index d24927075022..ec7b7bdf8825 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -865,7 +865,7 @@ xfs_setattr_size(
 	 */
 	write_back = newsize > ip->i_disk_size && oldsize != ip->i_disk_size;
 	if (newsize < oldsize) {
-		unsigned int blocksize = i_blocksize(inode);
+		unsigned int blocksize = xfs_inode_alloc_unitsize(ip);
 
 		/*
 		 * Zeroing out the partial EOF block and the rest of the extra
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH v4 7/8] xfs: reserve blocks for truncating realtime inode
  2024-05-29  9:51 [RFC PATCH v4 0/8] iomap/xfs: fix stale data exposure when truncating realtime inodes Zhang Yi
                   ` (5 preceding siblings ...)
  2024-05-29  9:52 ` [RFC PATCH v4 6/8] xfs: correct the truncate blocksize of realtime inode Zhang Yi
@ 2024-05-29  9:52 ` Zhang Yi
  2024-05-31 12:42   ` Christoph Hellwig
  2024-05-29  9:52 ` [RFC PATCH v4 8/8] xfs: improve truncate on a realtime inode with huge extsize Zhang Yi
  2024-05-31 12:26 ` [RFC PATCH v4 0/8] iomap/xfs: fix stale data exposure when truncating realtime inodes Christoph Hellwig
  8 siblings, 1 reply; 47+ messages in thread
From: Zhang Yi @ 2024-05-29  9:52 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, djwong, hch, brauner, david, chandanbabu, jack,
	willy, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

On a realtime inode, __xfs_bunmapi() could convert the unaligned extra
blocks to unwritten state, but it couldn't work as expected on truncate
down since the reserved block is zero in xfs_setattr_size(), fix this by
reserved XFS_IS_REALTIME_INODE blocks for realtime inode.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/xfs/xfs_iops.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index ec7b7bdf8825..c53de5e6ef66 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -17,6 +17,8 @@
 #include "xfs_da_btree.h"
 #include "xfs_attr.h"
 #include "xfs_trans.h"
+#include "xfs_trans_space.h"
+#include "xfs_bmap_btree.h"
 #include "xfs_trace.h"
 #include "xfs_icache.h"
 #include "xfs_symlink.h"
@@ -811,6 +813,7 @@ xfs_setattr_size(
 	struct xfs_trans	*tp;
 	int			error;
 	uint			lock_flags = 0;
+	uint			resblks;
 	bool			did_zeroing = false;
 	bool			write_back = false;
 
@@ -932,7 +935,9 @@ xfs_setattr_size(
 		}
 	}
 
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
+	resblks = XFS_IS_REALTIME_INODE(ip) ? XFS_DIOSTRAT_SPACE_RES(mp, 0) : 0;
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
+				0, 0, &tp);
 	if (error)
 		return error;
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* [RFC PATCH v4 8/8] xfs: improve truncate on a realtime inode with huge extsize
  2024-05-29  9:51 [RFC PATCH v4 0/8] iomap/xfs: fix stale data exposure when truncating realtime inodes Zhang Yi
                   ` (6 preceding siblings ...)
  2024-05-29  9:52 ` [RFC PATCH v4 7/8] xfs: reserve blocks for truncating " Zhang Yi
@ 2024-05-29  9:52 ` Zhang Yi
  2024-05-31 13:46   ` Christoph Hellwig
  2024-05-31 12:26 ` [RFC PATCH v4 0/8] iomap/xfs: fix stale data exposure when truncating realtime inodes Christoph Hellwig
  8 siblings, 1 reply; 47+ messages in thread
From: Zhang Yi @ 2024-05-29  9:52 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, djwong, hch, brauner, david, chandanbabu, jack,
	willy, yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

If we truncate down a realtime inode which extsize is too large, zeroing
out the entire aligned EOF extent could be very slow. Fortunately,
__xfs_bunmapi() would align the unmapped range to rtextsize, split and
convert the extra blocks to unwritten state. So, adjust the blocksize to
the filesystem blocksize if the rtextsize is large enough, let
__xfs_bunmapi() to convert the tail blocks to unwritten, this could
improve the truncate performance significantly.

 # mkfs.xfs -f -rrtdev=/dev/pmem1s -f -m reflink=0,rmapbt=0, \
            -d rtinherit=1 -r extsize=1G /dev/pmem2s
 # for i in {1..5}; \
   do dd if=/dev/zero of=/mnt/scratch/$i bs=1M count=1024; done
 # sync
 # time for i in {1..5}; \
   do xfs_io -c "truncate 4k" /mnt/scratch/$i; done

Before:
 real    0m16.762s
 user    0m0.008s
 sys     0m16.750s

After:
 real    0m0.076s
 user    0m0.010s
 sys     0m0.069s

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/xfs/xfs_inode.c |  2 +-
 fs/xfs/xfs_inode.h | 12 ++++++++++++
 fs/xfs/xfs_iops.c  |  9 +++++++++
 3 files changed, 22 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index db35167acef6..c0c1ab310aae 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1513,7 +1513,7 @@ xfs_itruncate_extents_flags(
 	 * the page cache can't scale that far.
 	 */
 	first_unmap_block = XFS_B_TO_FSB(mp, (xfs_ufsize_t)new_size);
-	if (xfs_inode_has_bigrtalloc(ip))
+	if (xfs_inode_has_bigrtalloc(ip) && !xfs_inode_has_hugertalloc(ip))
 		first_unmap_block = xfs_rtb_roundup_rtx(mp, first_unmap_block);
 	if (!xfs_verify_fileoff(mp, first_unmap_block)) {
 		WARN_ON_ONCE(first_unmap_block > XFS_MAX_FILEOFF);
diff --git a/fs/xfs/xfs_inode.h b/fs/xfs/xfs_inode.h
index 292b90b5f2ac..4eed5b0c57c0 100644
--- a/fs/xfs/xfs_inode.h
+++ b/fs/xfs/xfs_inode.h
@@ -320,6 +320,18 @@ static inline bool xfs_inode_has_bigrtalloc(struct xfs_inode *ip)
 	return XFS_IS_REALTIME_INODE(ip) && ip->i_mount->m_sb.sb_rextsize > 1;
 }
 
+/*
+ * Decide if this file is a realtime file whose data allocation unit is larger
+ * than default.
+ */
+static inline bool xfs_inode_has_hugertalloc(struct xfs_inode *ip)
+{
+	struct xfs_mount *mp = ip->i_mount;
+
+	return XFS_IS_REALTIME_INODE(ip) &&
+	       mp->m_sb.sb_rextsize > XFS_B_TO_FSB(mp, XFS_DFL_RTEXTSIZE);
+}
+
 /*
  * Return the buftarg used for data allocations on a given inode.
  */
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index c53de5e6ef66..d5fc84e5a37c 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -870,6 +870,15 @@ xfs_setattr_size(
 	if (newsize < oldsize) {
 		unsigned int blocksize = xfs_inode_alloc_unitsize(ip);
 
+		/*
+		 * If the extsize is too large on a realtime inode, zeroing
+		 * out the entire aligned EOF extent could be slow, adjust the
+		 * blocksize to the filesystem blocksize, let __xfs_bunmapi()
+		 * to convert the tail blocks to unwritten.
+		 */
+		if (xfs_inode_has_hugertalloc(ip))
+			blocksize = i_blocksize(inode);
+
 		/*
 		 * Zeroing out the partial EOF block and the rest of the extra
 		 * aligned blocks on a downward truncate.
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v4 0/8] iomap/xfs: fix stale data exposure when truncating realtime inodes
  2024-05-29  9:51 [RFC PATCH v4 0/8] iomap/xfs: fix stale data exposure when truncating realtime inodes Zhang Yi
                   ` (7 preceding siblings ...)
  2024-05-29  9:52 ` [RFC PATCH v4 8/8] xfs: improve truncate on a realtime inode with huge extsize Zhang Yi
@ 2024-05-31 12:26 ` Christoph Hellwig
  2024-06-01  7:38   ` Zhang Yi
  8 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2024-05-31 12:26 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-xfs, linux-fsdevel, linux-kernel, djwong, hch, brauner,
	david, chandanbabu, jack, willy, yi.zhang, chengzhihao1, yukuai3

Procedural question before I get into the actual review:  given we
are close to -rc3 and there is no user of the iomap change yet,
should we revert that for 6.10 and instead try again in 6.11 when
the XFS bits are sorted out?


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v4 2/8] math64: add rem_u64() to just return the remainder
  2024-05-29  9:52 ` [RFC PATCH v4 2/8] math64: add rem_u64() to just return the remainder Zhang Yi
@ 2024-05-31 12:35   ` Christoph Hellwig
  2024-05-31 14:04   ` Darrick J. Wong
  1 sibling, 0 replies; 47+ messages in thread
From: Christoph Hellwig @ 2024-05-31 12:35 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-xfs, linux-fsdevel, linux-kernel, djwong, hch, brauner,
	david, chandanbabu, jack, willy, yi.zhang, chengzhihao1, yukuai3

> +#ifndef rem_u64
> +static inline u32 rem_u64(u64 dividend, u32 divisor)
> +{
> +	if (is_power_of_2(divisor))
> +		return dividend & (divisor - 1);
> +	return do_div(dividend, divisor);
> +}
> +#endif

This ifndef seems superflous.

Otherwise looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v4 3/8] iomap: pass blocksize to iomap_truncate_page()
  2024-05-29  9:52 ` [RFC PATCH v4 3/8] iomap: pass blocksize to iomap_truncate_page() Zhang Yi
@ 2024-05-31 12:39   ` Christoph Hellwig
  2024-06-02 11:16     ` Brian Foster
  2024-06-03 13:23     ` Zhang Yi
  0 siblings, 2 replies; 47+ messages in thread
From: Christoph Hellwig @ 2024-05-31 12:39 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-xfs, linux-fsdevel, linux-kernel, djwong, hch, brauner,
	david, chandanbabu, jack, willy, yi.zhang, chengzhihao1, yukuai3

> -		const struct iomap_ops *ops)
> +iomap_truncate_page(struct inode *inode, loff_t pos, unsigned int blocksize,
> +		bool *did_zero, const struct iomap_ops *ops)
>  {
> -	unsigned int blocksize = i_blocksize(inode);
> -	unsigned int off = pos & (blocksize - 1);
> +	unsigned int off = rem_u64(pos, blocksize);
>  
>  	/* Block boundary? Nothing to do */
>  	if (!off)

Instad of passing yet another argument here, can we just kill
iomap_truncate_page?

I.e. just open code the rem_u64 and 0 offset check in the only caller
and call iomap_zero_range.  Same for the DAX variant and it's two
callers.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v4 7/8] xfs: reserve blocks for truncating realtime inode
  2024-05-29  9:52 ` [RFC PATCH v4 7/8] xfs: reserve blocks for truncating " Zhang Yi
@ 2024-05-31 12:42   ` Christoph Hellwig
  2024-05-31 14:10     ` Darrick J. Wong
  0 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2024-05-31 12:42 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-xfs, linux-fsdevel, linux-kernel, djwong, hch, brauner,
	david, chandanbabu, jack, willy, yi.zhang, chengzhihao1, yukuai3

> -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
> +	resblks = XFS_IS_REALTIME_INODE(ip) ? XFS_DIOSTRAT_SPACE_RES(mp, 0) : 0;

This probably wants a comment explaining that we need the block
reservation for bmap btree block allocations / splits that can happen
because we can split a written extent into one written and one
unwritten, while for the data fork we'll always just shorten or
remove extents.

I'd also find this more readable if resblks was initialized to 0,
and this became a:

	if (XFS_IS_REALTIME_INODE(ip))
		resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0);


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v4 1/8] iomap: zeroing needs to be pagecache aware
  2024-05-29  9:51 ` [RFC PATCH v4 1/8] iomap: zeroing needs to be pagecache aware Zhang Yi
@ 2024-05-31 13:11   ` Christoph Hellwig
  2024-05-31 14:03     ` Darrick J. Wong
  2024-06-02 22:22     ` Dave Chinner
  2024-06-02 11:04   ` Brian Foster
  1 sibling, 2 replies; 47+ messages in thread
From: Christoph Hellwig @ 2024-05-31 13:11 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-xfs, linux-fsdevel, linux-kernel, djwong, hch, brauner,
	david, chandanbabu, jack, willy, yi.zhang, chengzhihao1, yukuai3

On Wed, May 29, 2024 at 05:51:59PM +0800, Zhang Yi wrote:
> XXX: how do we detect a iomap containing a cow mapping over a hole
> in iomap_zero_iter()? The XFS code implies this case also needs to
> zero the page cache if there is data present, so trigger for page
> cache lookup only in iomap_zero_iter() needs to handle this case as
> well.

If there is no data in the page cache and either a whole or unwritten
extent it really should not matter what is in the COW fork, a there
obviously isn't any data we could zero.

If there is data in the page cache for something that is marked as
a hole in the srcmap, but we have data in the COW fork due to
COW extsize preallocation we'd need to zero it, but as the
xfs iomap ops don't return a separate srcmap for that case we
should be fine.  Or am I missing something?

> + * Note: when zeroing unwritten extents, we might have data in the page cache
> + * over an unwritten extent. In this case, we want to do a pure lookup on the
> + * page cache and not create a new folio as we don't need to perform zeroing on
> + * unwritten extents if there is no cached data over the given range.
>   */
>  struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
>  {
>  	fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS;
>  
> +	if (iter->flags & IOMAP_ZERO) {
> +		const struct iomap *srcmap = iomap_iter_srcmap(iter);
> +
> +		if (srcmap->type == IOMAP_UNWRITTEN)
> +			fgp &= ~FGP_CREAT;
> +	}

Nit:  The comment would probably stand out a little better if it was
right next to the IOMAP_ZERO conditional instead of above the
function.

> +		if (status) {
> +			if (status == -ENOENT) {
> +				/*
> +				 * Unwritten extents need to have page cache
> +				 * lookups done to determine if they have data
> +				 * over them that needs zeroing. If there is no
> +				 * data, we'll get -ENOENT returned here, so we
> +				 * can just skip over this index.
> +				 */
> +				WARN_ON_ONCE(srcmap->type != IOMAP_UNWRITTEN);

I'd return -EIO if the WARN_ON triggers.

> +loop_continue:

While I'm no strange to gotos for loop control something trips me
up about jumping to the end of the loop.  Here is what I could come
up with instead.  Not arguing it's objectively better, but I somehow
like it a little better:

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 700b22d6807783..81378f7cd8d7ff 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1412,49 +1412,56 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 		bool ret;
 
 		status = iomap_write_begin(iter, pos, bytes, &folio);
-		if (status) {
-			if (status == -ENOENT) {
-				/*
-				 * Unwritten extents need to have page cache
-				 * lookups done to determine if they have data
-				 * over them that needs zeroing. If there is no
-				 * data, we'll get -ENOENT returned here, so we
-				 * can just skip over this index.
-				 */
-				WARN_ON_ONCE(srcmap->type != IOMAP_UNWRITTEN);
-				if (bytes > PAGE_SIZE - offset_in_page(pos))
-					bytes = PAGE_SIZE - offset_in_page(pos);
-				goto loop_continue;
-			}
+		if (status && status != -ENOENT)
 			return status;
-		}
-		if (iter->iomap.flags & IOMAP_F_STALE)
-			break;
 
-		offset = offset_in_folio(folio, pos);
-		if (bytes > folio_size(folio) - offset)
-			bytes = folio_size(folio) - offset;
+		if (status == -ENOENT) {
+			/*
+			 * If we end up here, we did not find a folio in the
+			 * page cache for an unwritten extent and thus can
+			 * skip over the range.
+			 */
+			if (WARN_ON_ONCE(srcmap->type != IOMAP_UNWRITTEN))
+				return -EIO;
 
-		/*
-		 * If the folio over an unwritten extent is clean (i.e. because
-		 * it has been read from), then it already contains zeros. Hence
-		 * we can just skip it.
-		 */
-		if (srcmap->type == IOMAP_UNWRITTEN &&
-		    !folio_test_dirty(folio)) {
-			folio_unlock(folio);
-			goto loop_continue;
+			/*
+			 * XXX: It would be nice if we could get the offset of
+			 * the next entry in the pagecache so that we don't have
+			 * to iterate one page at a time here.
+			 */
+			offset = offset_in_page(pos);
+			if (bytes > PAGE_SIZE - offset)
+				bytes = PAGE_SIZE - offset;
+		} else {
+			if (iter->iomap.flags & IOMAP_F_STALE)
+				break;
+
+			offset = offset_in_folio(folio, pos);
+			if (bytes > folio_size(folio) - offset)
+				bytes = folio_size(folio) - offset;
+		
+			/*
+			 * If the folio over an unwritten extent is clean (i.e.
+			 * because it has only been read from), then it already
+			 * contains zeros.  Hence we can just skip it.
+			 */
+			if (srcmap->type == IOMAP_UNWRITTEN &&
+			    !folio_test_dirty(folio)) {
+				folio_unlock(folio);
+				status = -ENOENT;
+			}
 		}
 
-		folio_zero_range(folio, offset, bytes);
-		folio_mark_accessed(folio);
+		if (status != -ENOENT) {
+			folio_zero_range(folio, offset, bytes);
+			folio_mark_accessed(folio);
 
-		ret = iomap_write_end(iter, pos, bytes, bytes, folio);
-		__iomap_put_folio(iter, pos, bytes, folio);
-		if (WARN_ON_ONCE(!ret))
-			return -EIO;
+			ret = iomap_write_end(iter, pos, bytes, bytes, folio);
+			__iomap_put_folio(iter, pos, bytes, folio);
+			if (WARN_ON_ONCE(!ret))
+				return -EIO;
+		}
 
-loop_continue:
 		pos += bytes;
 		length -= bytes;
 		written += bytes;

^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v4 5/8] xfs: refactor the truncating order
  2024-05-29  9:52 ` [RFC PATCH v4 5/8] xfs: refactor the truncating order Zhang Yi
@ 2024-05-31 13:31   ` Christoph Hellwig
  2024-05-31 15:27     ` Darrick J. Wong
  2024-05-31 15:44   ` Darrick J. Wong
  2024-06-02 22:46   ` Dave Chinner
  2 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2024-05-31 13:31 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-xfs, linux-fsdevel, linux-kernel, djwong, hch, brauner,
	david, chandanbabu, jack, willy, yi.zhang, chengzhihao1, yukuai3

> +	write_back = newsize > ip->i_disk_size && oldsize != ip->i_disk_size;

Maybe need_writeback would be a better name for the variable?  Also no
need to initialize it to false at declaration time if it is
unconditionally set here.

> +		/*
> +		 * Updating i_size after writing back to make sure the zeroed
> +		 * blocks could been written out, and drop all the page cache
> +		 * range that beyond blocksize aligned new EOF block.
> +		 *
> +		 * We've already locked out new page faults, so now we can
> +		 * safely remove pages from the page cache knowing they won't
> +		 * get refaulted until we drop the XFS_MMAP_EXCL lock after the
> +		 * extent manipulations are complete.
> +		 */
> +		i_size_write(inode, newsize);
> +		truncate_pagecache(inode, roundup_64(newsize, blocksize));

Any reason this open codes truncate_setsize()?


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v4 6/8] xfs: correct the truncate blocksize of realtime inode
  2024-05-29  9:52 ` [RFC PATCH v4 6/8] xfs: correct the truncate blocksize of realtime inode Zhang Yi
@ 2024-05-31 13:36   ` Christoph Hellwig
  2024-06-03 14:35     ` Zhang Yi
  0 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2024-05-31 13:36 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-xfs, linux-fsdevel, linux-kernel, djwong, hch, brauner,
	david, chandanbabu, jack, willy, yi.zhang, chengzhihao1, yukuai3

On Wed, May 29, 2024 at 05:52:04PM +0800, Zhang Yi wrote:
> +	if (xfs_inode_has_bigrtalloc(ip))
> +		first_unmap_block = xfs_rtb_roundup_rtx(mp, first_unmap_block);

Given that first_unmap_block is a xfs_fileoff_t and not a xfs_rtblock_t,
this looks a bit confusing.  I'd suggest to just open code the
arithmetics in xfs_rtb_roundup_rtx.  For future proofing my also
use xfs_inode_alloc_unitsize() as in the hunk below instead of hard
coding the rtextsize.  I.e.:

	first_unmap_block = XFS_B_TO_FSB(mp,
		roundup_64(new_size, xfs_inode_alloc_unitsize(ip)));


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v4 8/8] xfs: improve truncate on a realtime inode with huge extsize
  2024-05-29  9:52 ` [RFC PATCH v4 8/8] xfs: improve truncate on a realtime inode with huge extsize Zhang Yi
@ 2024-05-31 13:46   ` Christoph Hellwig
  2024-05-31 14:12     ` Darrick J. Wong
  0 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2024-05-31 13:46 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-xfs, linux-fsdevel, linux-kernel, djwong, hch, brauner,
	david, chandanbabu, jack, willy, yi.zhang, chengzhihao1, yukuai3

> +/*
> + * Decide if this file is a realtime file whose data allocation unit is larger
> + * than default.
> + */
> +static inline bool xfs_inode_has_hugertalloc(struct xfs_inode *ip)
> +{
> +	struct xfs_mount *mp = ip->i_mount;
> +
> +	return XFS_IS_REALTIME_INODE(ip) &&
> +	       mp->m_sb.sb_rextsize > XFS_B_TO_FSB(mp, XFS_DFL_RTEXTSIZE);
> +}

The default rtextsize is actually a single FSB unless we're on a striped
volume in which case it is increased.

I'll take care of removing the unused and confusing XFS_DFL_RTEXTSIZE,
but for this patch we'd need to know the trade-off of when to just
convert to unwritten.  For single-fsb rtextents we obviously don't need
any special action.  But do you see a slowdown when converting to
unwritten for small > 1 rtextsizes?  Because if not we could just
always use that code path, which would significantly simplify things
and remove yet another different to test code path.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v4 1/8] iomap: zeroing needs to be pagecache aware
  2024-05-31 13:11   ` Christoph Hellwig
@ 2024-05-31 14:03     ` Darrick J. Wong
  2024-05-31 14:05       ` Christoph Hellwig
  2024-05-31 15:43       ` Brian Foster
  2024-06-02 22:22     ` Dave Chinner
  1 sibling, 2 replies; 47+ messages in thread
From: Darrick J. Wong @ 2024-05-31 14:03 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Zhang Yi, linux-xfs, linux-fsdevel, linux-kernel, brauner, david,
	chandanbabu, jack, willy, yi.zhang, chengzhihao1, yukuai3

On Fri, May 31, 2024 at 06:11:25AM -0700, Christoph Hellwig wrote:
> On Wed, May 29, 2024 at 05:51:59PM +0800, Zhang Yi wrote:
> > XXX: how do we detect a iomap containing a cow mapping over a hole
> > in iomap_zero_iter()? The XFS code implies this case also needs to
> > zero the page cache if there is data present, so trigger for page
> > cache lookup only in iomap_zero_iter() needs to handle this case as
> > well.
> 
> If there is no data in the page cache and either a whole or unwritten
> extent it really should not matter what is in the COW fork, a there
> obviously isn't any data we could zero.
> 
> If there is data in the page cache for something that is marked as
> a hole in the srcmap, but we have data in the COW fork due to
> COW extsize preallocation we'd need to zero it, but as the
> xfs iomap ops don't return a separate srcmap for that case we
> should be fine.  Or am I missing something?

It might be useful to skip the scan for dirty pagecache if both forks
have holes, since (in theory) that's never possible on xfs.

OTOH maybe there are filesystems that allow dirty pagecache over a hole?

> > + * Note: when zeroing unwritten extents, we might have data in the page cache
> > + * over an unwritten extent. In this case, we want to do a pure lookup on the
> > + * page cache and not create a new folio as we don't need to perform zeroing on
> > + * unwritten extents if there is no cached data over the given range.
> >   */
> >  struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
> >  {
> >  	fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS;
> >  
> > +	if (iter->flags & IOMAP_ZERO) {
> > +		const struct iomap *srcmap = iomap_iter_srcmap(iter);
> > +
> > +		if (srcmap->type == IOMAP_UNWRITTEN)
> > +			fgp &= ~FGP_CREAT;
> > +	}
> 
> Nit:  The comment would probably stand out a little better if it was
> right next to the IOMAP_ZERO conditional instead of above the
> function.

Agreed.

> > +		if (status) {
> > +			if (status == -ENOENT) {
> > +				/*
> > +				 * Unwritten extents need to have page cache
> > +				 * lookups done to determine if they have data
> > +				 * over them that needs zeroing. If there is no
> > +				 * data, we'll get -ENOENT returned here, so we
> > +				 * can just skip over this index.
> > +				 */
> > +				WARN_ON_ONCE(srcmap->type != IOMAP_UNWRITTEN);
> 
> I'd return -EIO if the WARN_ON triggers.
> 
> > +loop_continue:
> 
> While I'm no strange to gotos for loop control something trips me
> up about jumping to the end of the loop.  Here is what I could come
> up with instead.  Not arguing it's objectively better, but I somehow
> like it a little better:
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index 700b22d6807783..81378f7cd8d7ff 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1412,49 +1412,56 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
>  		bool ret;
>  
>  		status = iomap_write_begin(iter, pos, bytes, &folio);
> -		if (status) {
> -			if (status == -ENOENT) {
> -				/*
> -				 * Unwritten extents need to have page cache
> -				 * lookups done to determine if they have data
> -				 * over them that needs zeroing. If there is no
> -				 * data, we'll get -ENOENT returned here, so we
> -				 * can just skip over this index.
> -				 */
> -				WARN_ON_ONCE(srcmap->type != IOMAP_UNWRITTEN);
> -				if (bytes > PAGE_SIZE - offset_in_page(pos))
> -					bytes = PAGE_SIZE - offset_in_page(pos);
> -				goto loop_continue;
> -			}
> +		if (status && status != -ENOENT)
>  			return status;
> -		}
> -		if (iter->iomap.flags & IOMAP_F_STALE)
> -			break;
>  
> -		offset = offset_in_folio(folio, pos);
> -		if (bytes > folio_size(folio) - offset)
> -			bytes = folio_size(folio) - offset;
> +		if (status == -ENOENT) {
> +			/*
> +			 * If we end up here, we did not find a folio in the
> +			 * page cache for an unwritten extent and thus can
> +			 * skip over the range.
> +			 */
> +			if (WARN_ON_ONCE(srcmap->type != IOMAP_UNWRITTEN))
> +				return -EIO;
>  
> -		/*
> -		 * If the folio over an unwritten extent is clean (i.e. because
> -		 * it has been read from), then it already contains zeros. Hence
> -		 * we can just skip it.
> -		 */
> -		if (srcmap->type == IOMAP_UNWRITTEN &&
> -		    !folio_test_dirty(folio)) {
> -			folio_unlock(folio);
> -			goto loop_continue;
> +			/*
> +			 * XXX: It would be nice if we could get the offset of
> +			 * the next entry in the pagecache so that we don't have
> +			 * to iterate one page at a time here.
> +			 */
> +			offset = offset_in_page(pos);
> +			if (bytes > PAGE_SIZE - offset)
> +				bytes = PAGE_SIZE - offset;

Why is it PAGE_SIZE here and not folio_size() like below?

(I know you're just copying the existing code; I'm merely wondering if
this is some minor bug.)

--D

> +		} else {
> +			if (iter->iomap.flags & IOMAP_F_STALE)
> +				break;
> +
> +			offset = offset_in_folio(folio, pos);
> +			if (bytes > folio_size(folio) - offset)
> +				bytes = folio_size(folio) - offset;
> +		
> +			/*
> +			 * If the folio over an unwritten extent is clean (i.e.
> +			 * because it has only been read from), then it already
> +			 * contains zeros.  Hence we can just skip it.
> +			 */
> +			if (srcmap->type == IOMAP_UNWRITTEN &&
> +			    !folio_test_dirty(folio)) {
> +				folio_unlock(folio);
> +				status = -ENOENT;
> +			}
>  		}
>  
> -		folio_zero_range(folio, offset, bytes);
> -		folio_mark_accessed(folio);
> +		if (status != -ENOENT) {
> +			folio_zero_range(folio, offset, bytes);
> +			folio_mark_accessed(folio);
>  
> -		ret = iomap_write_end(iter, pos, bytes, bytes, folio);
> -		__iomap_put_folio(iter, pos, bytes, folio);
> -		if (WARN_ON_ONCE(!ret))
> -			return -EIO;
> +			ret = iomap_write_end(iter, pos, bytes, bytes, folio);
> +			__iomap_put_folio(iter, pos, bytes, folio);
> +			if (WARN_ON_ONCE(!ret))
> +				return -EIO;
> +		}
>  
> -loop_continue:
>  		pos += bytes;
>  		length -= bytes;
>  		written += bytes;
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v4 2/8] math64: add rem_u64() to just return the remainder
  2024-05-29  9:52 ` [RFC PATCH v4 2/8] math64: add rem_u64() to just return the remainder Zhang Yi
  2024-05-31 12:35   ` Christoph Hellwig
@ 2024-05-31 14:04   ` Darrick J. Wong
  1 sibling, 0 replies; 47+ messages in thread
From: Darrick J. Wong @ 2024-05-31 14:04 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-xfs, linux-fsdevel, linux-kernel, hch, brauner, david,
	chandanbabu, jack, willy, yi.zhang, chengzhihao1, yukuai3

On Wed, May 29, 2024 at 05:52:00PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> Add a new helper rem_u64() to only get the remainder of unsigned 64bit
> divide with 32bit divisor.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>

Modulo hch's comments,
Reviewed-by: Darrick J. Wong <djwong@kernel.org>

--D

> ---
>  include/linux/math64.h | 24 ++++++++++++++++++++++++
>  1 file changed, 24 insertions(+)
> 
> diff --git a/include/linux/math64.h b/include/linux/math64.h
> index d34def7f9a8c..618df4862091 100644
> --- a/include/linux/math64.h
> +++ b/include/linux/math64.h
> @@ -3,6 +3,7 @@
>  #define _LINUX_MATH64_H
>  
>  #include <linux/types.h>
> +#include <linux/log2.h>
>  #include <linux/math.h>
>  #include <asm/div64.h>
>  #include <vdso/math64.h>
> @@ -12,6 +13,20 @@
>  #define div64_long(x, y) div64_s64((x), (y))
>  #define div64_ul(x, y)   div64_u64((x), (y))
>  
> +/**
> + * rem_u64 - remainder of unsigned 64bit divide with 32bit divisor
> + * @dividend: unsigned 64bit dividend
> + * @divisor: unsigned 32bit divisor
> + *
> + * Return: dividend % divisor
> + */
> +static inline u32 rem_u64(u64 dividend, u32 divisor)
> +{
> +	if (is_power_of_2(divisor))
> +		return dividend & (divisor - 1);
> +	return dividend % divisor;
> +}
> +
>  /**
>   * div_u64_rem - unsigned 64bit divide with 32bit divisor with remainder
>   * @dividend: unsigned 64bit dividend
> @@ -86,6 +101,15 @@ static inline s64 div64_s64(s64 dividend, s64 divisor)
>  #define div64_long(x, y) div_s64((x), (y))
>  #define div64_ul(x, y)   div_u64((x), (y))
>  
> +#ifndef rem_u64
> +static inline u32 rem_u64(u64 dividend, u32 divisor)
> +{
> +	if (is_power_of_2(divisor))
> +		return dividend & (divisor - 1);
> +	return do_div(dividend, divisor);
> +}
> +#endif
> +
>  #ifndef div_u64_rem
>  static inline u64 div_u64_rem(u64 dividend, u32 divisor, u32 *remainder)
>  {
> -- 
> 2.39.2
> 
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v4 1/8] iomap: zeroing needs to be pagecache aware
  2024-05-31 14:03     ` Darrick J. Wong
@ 2024-05-31 14:05       ` Christoph Hellwig
  2024-05-31 15:44         ` Brian Foster
  2024-05-31 15:43       ` Brian Foster
  1 sibling, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2024-05-31 14:05 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Zhang Yi, linux-xfs, linux-fsdevel,
	linux-kernel, brauner, david, chandanbabu, jack, willy, yi.zhang,
	chengzhihao1, yukuai3

On Fri, May 31, 2024 at 07:03:58AM -0700, Darrick J. Wong wrote:
> > +			/*
> > +			 * XXX: It would be nice if we could get the offset of
> > +			 * the next entry in the pagecache so that we don't have
> > +			 * to iterate one page at a time here.
> > +			 */
> > +			offset = offset_in_page(pos);
> > +			if (bytes > PAGE_SIZE - offset)
> > +				bytes = PAGE_SIZE - offset;
> 
> Why is it PAGE_SIZE here and not folio_size() like below?
> 
> (I know you're just copying the existing code; I'm merely wondering if
> this is some minor bug.)

See the comment just above :)


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v4 7/8] xfs: reserve blocks for truncating realtime inode
  2024-05-31 12:42   ` Christoph Hellwig
@ 2024-05-31 14:10     ` Darrick J. Wong
  2024-05-31 14:13       ` Christoph Hellwig
  0 siblings, 1 reply; 47+ messages in thread
From: Darrick J. Wong @ 2024-05-31 14:10 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Zhang Yi, linux-xfs, linux-fsdevel, linux-kernel, brauner, david,
	chandanbabu, jack, willy, yi.zhang, chengzhihao1, yukuai3

On Fri, May 31, 2024 at 05:42:37AM -0700, Christoph Hellwig wrote:
> > -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
> > +	resblks = XFS_IS_REALTIME_INODE(ip) ? XFS_DIOSTRAT_SPACE_RES(mp, 0) : 0;
> 
> This probably wants a comment explaining that we need the block
> reservation for bmap btree block allocations / splits that can happen
> because we can split a written extent into one written and one
> unwritten, while for the data fork we'll always just shorten or
> remove extents.

"for the data fork"? <confused>

This always runs on the data fork.  Did you mean "for files with alloc
unit > 1 fsblock"?

> I'd also find this more readable if resblks was initialized to 0,
> and this became a:
> 
> 	if (XFS_IS_REALTIME_INODE(ip))
> 		resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0);

Agreed.

--D

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v4 8/8] xfs: improve truncate on a realtime inode with huge extsize
  2024-05-31 13:46   ` Christoph Hellwig
@ 2024-05-31 14:12     ` Darrick J. Wong
  2024-05-31 14:15       ` Christoph Hellwig
  0 siblings, 1 reply; 47+ messages in thread
From: Darrick J. Wong @ 2024-05-31 14:12 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Zhang Yi, linux-xfs, linux-fsdevel, linux-kernel, brauner, david,
	chandanbabu, jack, willy, yi.zhang, chengzhihao1, yukuai3

On Fri, May 31, 2024 at 06:46:10AM -0700, Christoph Hellwig wrote:
> > +/*
> > + * Decide if this file is a realtime file whose data allocation unit is larger
> > + * than default.
> > + */
> > +static inline bool xfs_inode_has_hugertalloc(struct xfs_inode *ip)
> > +{
> > +	struct xfs_mount *mp = ip->i_mount;
> > +
> > +	return XFS_IS_REALTIME_INODE(ip) &&
> > +	       mp->m_sb.sb_rextsize > XFS_B_TO_FSB(mp, XFS_DFL_RTEXTSIZE);
> > +}
> 
> The default rtextsize is actually a single FSB unless we're on a striped
> volume in which case it is increased.
> 
> I'll take care of removing the unused and confusing XFS_DFL_RTEXTSIZE,
> but for this patch we'd need to know the trade-off of when to just
> convert to unwritten.  For single-fsb rtextents we obviously don't need
> any special action.  But do you see a slowdown when converting to
> unwritten for small > 1 rtextsizes?  Because if not we could just
> always use that code path, which would significantly simplify things
> and remove yet another different to test code path.

There are <cough> some users that want 1G extents.

For the rest of us who don't live in the stratosphere, it's convenient
for fsdax to have rt extents that match the PMD size, which could be
large on arm64 (e.g. 512M, or two smr sectors).

--D

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v4 7/8] xfs: reserve blocks for truncating realtime inode
  2024-05-31 14:10     ` Darrick J. Wong
@ 2024-05-31 14:13       ` Christoph Hellwig
  2024-05-31 15:29         ` Darrick J. Wong
  0 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2024-05-31 14:13 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Zhang Yi, linux-xfs, linux-fsdevel,
	linux-kernel, brauner, david, chandanbabu, jack, willy, yi.zhang,
	chengzhihao1, yukuai3

On Fri, May 31, 2024 at 07:10:00AM -0700, Darrick J. Wong wrote:
> On Fri, May 31, 2024 at 05:42:37AM -0700, Christoph Hellwig wrote:
> > > -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
> > > +	resblks = XFS_IS_REALTIME_INODE(ip) ? XFS_DIOSTRAT_SPACE_RES(mp, 0) : 0;
> > 
> > This probably wants a comment explaining that we need the block
> > reservation for bmap btree block allocations / splits that can happen
> > because we can split a written extent into one written and one
> > unwritten, while for the data fork we'll always just shorten or
> > remove extents.
> 
> "for the data fork"? <confused>
> 
> This always runs on the data fork.  Did you mean "for files with alloc
> unit > 1 fsblock"?

Sorry, it was meant to say for the data device.  My whole journey
to check if this could get called for the attr fork twisted my mind.
But you have a good point that even for the rt device we only need
the reservation for an rtextsize > 1.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v4 8/8] xfs: improve truncate on a realtime inode with huge extsize
  2024-05-31 14:12     ` Darrick J. Wong
@ 2024-05-31 14:15       ` Christoph Hellwig
  2024-05-31 15:00         ` Darrick J. Wong
  0 siblings, 1 reply; 47+ messages in thread
From: Christoph Hellwig @ 2024-05-31 14:15 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Zhang Yi, linux-xfs, linux-fsdevel,
	linux-kernel, brauner, david, chandanbabu, jack, willy, yi.zhang,
	chengzhihao1, yukuai3

On Fri, May 31, 2024 at 07:12:10AM -0700, Darrick J. Wong wrote:
> There are <cough> some users that want 1G extents.
> 
> For the rest of us who don't live in the stratosphere, it's convenient
> for fsdax to have rt extents that match the PMD size, which could be
> large on arm64 (e.g. 512M, or two smr sectors).

That's fine.  Maybe to rephrase my question.  With this series we
have 3 different truncate path:

 1) unmap all blocks (!rt || rtextsizse == 1)
 2) zero leftover blocks in an rtextent (small rtextsize, but > 1)
 3) converted leftover block in an rtextent to unwritten (large
   rtextsize)

What is the right threshold to switch between 2 and 3?  And do we
really need 2) at all?


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v4 8/8] xfs: improve truncate on a realtime inode with huge extsize
  2024-05-31 14:15       ` Christoph Hellwig
@ 2024-05-31 15:00         ` Darrick J. Wong
  2024-06-04  7:09           ` Zhang Yi
  0 siblings, 1 reply; 47+ messages in thread
From: Darrick J. Wong @ 2024-05-31 15:00 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Zhang Yi, linux-xfs, linux-fsdevel, linux-kernel, brauner, david,
	chandanbabu, jack, willy, yi.zhang, chengzhihao1, yukuai3

On Fri, May 31, 2024 at 07:15:34AM -0700, Christoph Hellwig wrote:
> On Fri, May 31, 2024 at 07:12:10AM -0700, Darrick J. Wong wrote:
> > There are <cough> some users that want 1G extents.
> > 
> > For the rest of us who don't live in the stratosphere, it's convenient
> > for fsdax to have rt extents that match the PMD size, which could be
> > large on arm64 (e.g. 512M, or two smr sectors).
> 
> That's fine.  Maybe to rephrase my question.  With this series we
> have 3 different truncate path:
> 
>  1) unmap all blocks (!rt || rtextsizse == 1)
>  2) zero leftover blocks in an rtextent (small rtextsize, but > 1)
>  3) converted leftover block in an rtextent to unwritten (large
>    rtextsize)
> 
> What is the right threshold to switch between 2 and 3?  And do we
> really need 2) at all?

I don't think we need (2) at all.

There's likely some threshold below where it's a wash -- compare with
ext4 strategy of trying to write 64k chunks even if that requires
zeroing pagecache to cut down on fragmentation on hdds -- but I don't
know if we care anymore. ;)

--D

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v4 5/8] xfs: refactor the truncating order
  2024-05-31 13:31   ` Christoph Hellwig
@ 2024-05-31 15:27     ` Darrick J. Wong
  2024-05-31 16:17       ` Christoph Hellwig
  2024-06-03 13:51       ` Zhang Yi
  0 siblings, 2 replies; 47+ messages in thread
From: Darrick J. Wong @ 2024-05-31 15:27 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Zhang Yi, linux-xfs, linux-fsdevel, linux-kernel, brauner, david,
	chandanbabu, jack, willy, yi.zhang, chengzhihao1, yukuai3

On Fri, May 31, 2024 at 06:31:36AM -0700, Christoph Hellwig wrote:
> > +	write_back = newsize > ip->i_disk_size && oldsize != ip->i_disk_size;
> 
> Maybe need_writeback would be a better name for the variable?  Also no
> need to initialize it to false at declaration time if it is
> unconditionally set here.

This variable captures whether or not we need to write dirty file tail
data because we're extending the ondisk EOF, right?

I don't really like long names like any good 1980s C programmer, but
maybe we should name this something like "extending_ondisk_eof"?

	if (newsize > ip->i_disk_size && oldsize != ip->i_disk_size)
		extending_ondisk_eof = true;

	...

	if (did_zeroing || extending_ondisk_eof)
		filemap_write_and_wait_range(...);

Hm?

> > +		/*
> > +		 * Updating i_size after writing back to make sure the zeroed
> > +		 * blocks could been written out, and drop all the page cache
> > +		 * range that beyond blocksize aligned new EOF block.
> > +		 *
> > +		 * We've already locked out new page faults, so now we can
> > +		 * safely remove pages from the page cache knowing they won't
> > +		 * get refaulted until we drop the XFS_MMAP_EXCL lock after the

And can we correct the comment here too?

"...until we drop XFS_MMAPLOCK_EXCL after the extent manipulations..."

--D

> > +		 * extent manipulations are complete.
> > +		 */
> > +		i_size_write(inode, newsize);
> > +		truncate_pagecache(inode, roundup_64(newsize, blocksize));
> 
> Any reason this open codes truncate_setsize()?
> 
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v4 7/8] xfs: reserve blocks for truncating realtime inode
  2024-05-31 14:13       ` Christoph Hellwig
@ 2024-05-31 15:29         ` Darrick J. Wong
  2024-05-31 16:17           ` Christoph Hellwig
  0 siblings, 1 reply; 47+ messages in thread
From: Darrick J. Wong @ 2024-05-31 15:29 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Zhang Yi, linux-xfs, linux-fsdevel, linux-kernel, brauner, david,
	chandanbabu, jack, willy, yi.zhang, chengzhihao1, yukuai3

On Fri, May 31, 2024 at 07:13:05AM -0700, Christoph Hellwig wrote:
> On Fri, May 31, 2024 at 07:10:00AM -0700, Darrick J. Wong wrote:
> > On Fri, May 31, 2024 at 05:42:37AM -0700, Christoph Hellwig wrote:
> > > > -	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
> > > > +	resblks = XFS_IS_REALTIME_INODE(ip) ? XFS_DIOSTRAT_SPACE_RES(mp, 0) : 0;
> > > 
> > > This probably wants a comment explaining that we need the block
> > > reservation for bmap btree block allocations / splits that can happen
> > > because we can split a written extent into one written and one
> > > unwritten, while for the data fork we'll always just shorten or
> > > remove extents.
> > 
> > "for the data fork"? <confused>
> > 
> > This always runs on the data fork.  Did you mean "for files with alloc
> > unit > 1 fsblock"?
> 
> Sorry, it was meant to say for the data device.  My whole journey
> to check if this could get called for the attr fork twisted my mind.

I really hope not -- all writes to the attr fork have known sizes at
syscall time, and appending doesn't even make sense.

> But you have a good point that even for the rt device we only need
> the reservation for an rtextsize > 1.

<nod>

--D

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v4 1/8] iomap: zeroing needs to be pagecache aware
  2024-05-31 14:03     ` Darrick J. Wong
  2024-05-31 14:05       ` Christoph Hellwig
@ 2024-05-31 15:43       ` Brian Foster
  1 sibling, 0 replies; 47+ messages in thread
From: Brian Foster @ 2024-05-31 15:43 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Zhang Yi, linux-xfs, linux-fsdevel,
	linux-kernel, brauner, david, chandanbabu, jack, willy, yi.zhang,
	chengzhihao1, yukuai3

On Fri, May 31, 2024 at 07:03:58AM -0700, Darrick J. Wong wrote:
> On Fri, May 31, 2024 at 06:11:25AM -0700, Christoph Hellwig wrote:
> > On Wed, May 29, 2024 at 05:51:59PM +0800, Zhang Yi wrote:
> > > XXX: how do we detect a iomap containing a cow mapping over a hole
> > > in iomap_zero_iter()? The XFS code implies this case also needs to
> > > zero the page cache if there is data present, so trigger for page
> > > cache lookup only in iomap_zero_iter() needs to handle this case as
> > > well.
> > 
> > If there is no data in the page cache and either a whole or unwritten
> > extent it really should not matter what is in the COW fork, a there
> > obviously isn't any data we could zero.
> > 
> > If there is data in the page cache for something that is marked as
> > a hole in the srcmap, but we have data in the COW fork due to
> > COW extsize preallocation we'd need to zero it, but as the
> > xfs iomap ops don't return a separate srcmap for that case we
> > should be fine.  Or am I missing something?
> 
> It might be useful to skip the scan for dirty pagecache if both forks
> have holes, since (in theory) that's never possible on xfs.
> 
> OTOH maybe there are filesystems that allow dirty pagecache over a hole?
> 

IIRC there was a case where dirty cache can exist over what is reported
as a hole to zero range. I want to say it was something like a COW
prealloc over a data fork hole followed by a buffered write and then a
zero range, but I don't recall the details. That is all something that
should be fixed on the lookup side anyways.

Brian

> > > + * Note: when zeroing unwritten extents, we might have data in the page cache
> > > + * over an unwritten extent. In this case, we want to do a pure lookup on the
> > > + * page cache and not create a new folio as we don't need to perform zeroing on
> > > + * unwritten extents if there is no cached data over the given range.
> > >   */
> > >  struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len)
> > >  {
> > >  	fgf_t fgp = FGP_WRITEBEGIN | FGP_NOFS;
> > >  
> > > +	if (iter->flags & IOMAP_ZERO) {
> > > +		const struct iomap *srcmap = iomap_iter_srcmap(iter);
> > > +
> > > +		if (srcmap->type == IOMAP_UNWRITTEN)
> > > +			fgp &= ~FGP_CREAT;
> > > +	}
> > 
> > Nit:  The comment would probably stand out a little better if it was
> > right next to the IOMAP_ZERO conditional instead of above the
> > function.
> 
> Agreed.
> 
> > > +		if (status) {
> > > +			if (status == -ENOENT) {
> > > +				/*
> > > +				 * Unwritten extents need to have page cache
> > > +				 * lookups done to determine if they have data
> > > +				 * over them that needs zeroing. If there is no
> > > +				 * data, we'll get -ENOENT returned here, so we
> > > +				 * can just skip over this index.
> > > +				 */
> > > +				WARN_ON_ONCE(srcmap->type != IOMAP_UNWRITTEN);
> > 
> > I'd return -EIO if the WARN_ON triggers.
> > 
> > > +loop_continue:
> > 
> > While I'm no strange to gotos for loop control something trips me
> > up about jumping to the end of the loop.  Here is what I could come
> > up with instead.  Not arguing it's objectively better, but I somehow
> > like it a little better:
> > 
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index 700b22d6807783..81378f7cd8d7ff 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -1412,49 +1412,56 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
> >  		bool ret;
> >  
> >  		status = iomap_write_begin(iter, pos, bytes, &folio);
> > -		if (status) {
> > -			if (status == -ENOENT) {
> > -				/*
> > -				 * Unwritten extents need to have page cache
> > -				 * lookups done to determine if they have data
> > -				 * over them that needs zeroing. If there is no
> > -				 * data, we'll get -ENOENT returned here, so we
> > -				 * can just skip over this index.
> > -				 */
> > -				WARN_ON_ONCE(srcmap->type != IOMAP_UNWRITTEN);
> > -				if (bytes > PAGE_SIZE - offset_in_page(pos))
> > -					bytes = PAGE_SIZE - offset_in_page(pos);
> > -				goto loop_continue;
> > -			}
> > +		if (status && status != -ENOENT)
> >  			return status;
> > -		}
> > -		if (iter->iomap.flags & IOMAP_F_STALE)
> > -			break;
> >  
> > -		offset = offset_in_folio(folio, pos);
> > -		if (bytes > folio_size(folio) - offset)
> > -			bytes = folio_size(folio) - offset;
> > +		if (status == -ENOENT) {
> > +			/*
> > +			 * If we end up here, we did not find a folio in the
> > +			 * page cache for an unwritten extent and thus can
> > +			 * skip over the range.
> > +			 */
> > +			if (WARN_ON_ONCE(srcmap->type != IOMAP_UNWRITTEN))
> > +				return -EIO;
> >  
> > -		/*
> > -		 * If the folio over an unwritten extent is clean (i.e. because
> > -		 * it has been read from), then it already contains zeros. Hence
> > -		 * we can just skip it.
> > -		 */
> > -		if (srcmap->type == IOMAP_UNWRITTEN &&
> > -		    !folio_test_dirty(folio)) {
> > -			folio_unlock(folio);
> > -			goto loop_continue;
> > +			/*
> > +			 * XXX: It would be nice if we could get the offset of
> > +			 * the next entry in the pagecache so that we don't have
> > +			 * to iterate one page at a time here.
> > +			 */
> > +			offset = offset_in_page(pos);
> > +			if (bytes > PAGE_SIZE - offset)
> > +				bytes = PAGE_SIZE - offset;
> 
> Why is it PAGE_SIZE here and not folio_size() like below?
> 
> (I know you're just copying the existing code; I'm merely wondering if
> this is some minor bug.)
> 
> --D
> 
> > +		} else {
> > +			if (iter->iomap.flags & IOMAP_F_STALE)
> > +				break;
> > +
> > +			offset = offset_in_folio(folio, pos);
> > +			if (bytes > folio_size(folio) - offset)
> > +				bytes = folio_size(folio) - offset;
> > +		
> > +			/*
> > +			 * If the folio over an unwritten extent is clean (i.e.
> > +			 * because it has only been read from), then it already
> > +			 * contains zeros.  Hence we can just skip it.
> > +			 */
> > +			if (srcmap->type == IOMAP_UNWRITTEN &&
> > +			    !folio_test_dirty(folio)) {
> > +				folio_unlock(folio);
> > +				status = -ENOENT;
> > +			}
> >  		}
> >  
> > -		folio_zero_range(folio, offset, bytes);
> > -		folio_mark_accessed(folio);
> > +		if (status != -ENOENT) {
> > +			folio_zero_range(folio, offset, bytes);
> > +			folio_mark_accessed(folio);
> >  
> > -		ret = iomap_write_end(iter, pos, bytes, bytes, folio);
> > -		__iomap_put_folio(iter, pos, bytes, folio);
> > -		if (WARN_ON_ONCE(!ret))
> > -			return -EIO;
> > +			ret = iomap_write_end(iter, pos, bytes, bytes, folio);
> > +			__iomap_put_folio(iter, pos, bytes, folio);
> > +			if (WARN_ON_ONCE(!ret))
> > +				return -EIO;
> > +		}
> >  
> > -loop_continue:
> >  		pos += bytes;
> >  		length -= bytes;
> >  		written += bytes;
> > 
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v4 5/8] xfs: refactor the truncating order
  2024-05-29  9:52 ` [RFC PATCH v4 5/8] xfs: refactor the truncating order Zhang Yi
  2024-05-31 13:31   ` Christoph Hellwig
@ 2024-05-31 15:44   ` Darrick J. Wong
  2024-06-03 14:15     ` Zhang Yi
  2024-06-02 22:46   ` Dave Chinner
  2 siblings, 1 reply; 47+ messages in thread
From: Darrick J. Wong @ 2024-05-31 15:44 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-xfs, linux-fsdevel, linux-kernel, hch, brauner, david,
	chandanbabu, jack, willy, yi.zhang, chengzhihao1, yukuai3

On Wed, May 29, 2024 at 05:52:03PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> When truncating down an inode, we call xfs_truncate_page() to zero out
> the tail partial block that beyond new EOF, which prevents exposing
> stale data. But xfs_truncate_page() always assumes the blocksize is
> i_blocksize(inode), it's not always true if we have a large allocation
> unit for a file and we should aligned to this unitsize, e.g. realtime
> inode should aligned to the rtextsize.
> 
> Current xfs_setattr_size() can't support zeroing out a large alignment
> size on trucate down since the process order is wrong. We first do zero
> out through xfs_truncate_page(), and then update inode size through
> truncate_setsize() immediately. If the zeroed range is larger than a
> folio, the write back path would not write back zeroed pagecache beyond
> the EOF folio, so it doesn't write zeroes to the entire tail extent and
> could expose stale data after an appending write into the next aligned
> extent.
> 
> We need to adjust the order to zero out tail aligned blocks, write back
> zeroed or cached data, update i_size and drop cache beyond aligned EOF
> block, preparing for the fix of realtime inode and supporting the
> upcoming forced alignment feature.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
>  fs/xfs/xfs_iomap.c |   2 +-
>  fs/xfs/xfs_iomap.h |   3 +-
>  fs/xfs/xfs_iops.c  | 107 ++++++++++++++++++++++++++++-----------------
>  3 files changed, 69 insertions(+), 43 deletions(-)
> 
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 8cdfcbb5baa7..0369b64cc3f4 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1468,10 +1468,10 @@ int
>  xfs_truncate_page(
>  	struct xfs_inode	*ip,
>  	loff_t			pos,
> +	unsigned int		blocksize,
>  	bool			*did_zero)
>  {
>  	struct inode		*inode = VFS_I(ip);
> -	unsigned int		blocksize = i_blocksize(inode);
>  
>  	if (IS_DAX(inode))
>  		return dax_truncate_page(inode, pos, blocksize, did_zero,
> diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
> index 4da13440bae9..feb1610cb645 100644
> --- a/fs/xfs/xfs_iomap.h
> +++ b/fs/xfs/xfs_iomap.h
> @@ -25,7 +25,8 @@ int xfs_bmbt_to_iomap(struct xfs_inode *ip, struct iomap *iomap,
>  
>  int xfs_zero_range(struct xfs_inode *ip, loff_t pos, loff_t len,
>  		bool *did_zero);
> -int xfs_truncate_page(struct xfs_inode *ip, loff_t pos, bool *did_zero);
> +int xfs_truncate_page(struct xfs_inode *ip, loff_t pos,
> +		unsigned int blocksize, bool *did_zero);
>  
>  static inline xfs_filblks_t
>  xfs_aligned_fsb_count(
> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
> index d44508930b67..d24927075022 100644
> --- a/fs/xfs/xfs_iops.c
> +++ b/fs/xfs/xfs_iops.c
> @@ -812,6 +812,7 @@ xfs_setattr_size(
>  	int			error;
>  	uint			lock_flags = 0;
>  	bool			did_zeroing = false;
> +	bool			write_back = false;
>  
>  	xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL);
>  	ASSERT(S_ISREG(inode->i_mode));
> @@ -853,30 +854,7 @@ xfs_setattr_size(
>  	 * the transaction because the inode cannot be unlocked once it is a
>  	 * part of the transaction.
>  	 *
> -	 * Start with zeroing any data beyond EOF that we may expose on file
> -	 * extension, or zeroing out the rest of the block on a downward
> -	 * truncate.
> -	 */
> -	if (newsize > oldsize) {
> -		trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
> -		error = xfs_zero_range(ip, oldsize, newsize - oldsize,
> -				&did_zeroing);
> -	} else if (newsize != oldsize) {
> -		error = xfs_truncate_page(ip, newsize, &did_zeroing);
> -	}
> -
> -	if (error)
> -		return error;
> -
> -	/*
> -	 * We've already locked out new page faults, so now we can safely remove
> -	 * pages from the page cache knowing they won't get refaulted until we
> -	 * drop the XFS_MMAP_EXCL lock after the extent manipulations are
> -	 * complete. The truncate_setsize() call also cleans partial EOF page
> -	 * PTEs on extending truncates and hence ensures sub-page block size
> -	 * filesystems are correctly handled, too.
> -	 *
> -	 * We have to do all the page cache truncate work outside the
> +	 * And we have to do all the page cache truncate work outside the

Style nit: don't start a paragraph with "and".

>  	 * transaction context as the "lock" order is page lock->log space
>  	 * reservation as defined by extent allocation in the writeback path.
>  	 * Hence a truncate can fail with ENOMEM from xfs_trans_alloc(), but
> @@ -884,27 +862,74 @@ xfs_setattr_size(
>  	 * user visible changes). There's not much we can do about this, except
>  	 * to hope that the caller sees ENOMEM and retries the truncate
>  	 * operation.
> -	 *
> -	 * And we update in-core i_size and truncate page cache beyond newsize
> -	 * before writeback the [i_disk_size, newsize] range, so we're
> -	 * guaranteed not to write stale data past the new EOF on truncate down.
>  	 */
> -	truncate_setsize(inode, newsize);
> +	write_back = newsize > ip->i_disk_size && oldsize != ip->i_disk_size;
> +	if (newsize < oldsize) {
> +		unsigned int blocksize = i_blocksize(inode);
>  
> -	/*
> -	 * We are going to log the inode size change in this transaction so
> -	 * any previous writes that are beyond the on disk EOF and the new
> -	 * EOF that have not been written out need to be written here.  If we
> -	 * do not write the data out, we expose ourselves to the null files
> -	 * problem. Note that this includes any block zeroing we did above;
> -	 * otherwise those blocks may not be zeroed after a crash.
> -	 */
> -	if (did_zeroing ||
> -	    (newsize > ip->i_disk_size && oldsize != ip->i_disk_size)) {
> -		error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
> -						ip->i_disk_size, newsize - 1);
> +		/*
> +		 * Zeroing out the partial EOF block and the rest of the extra
> +		 * aligned blocks on a downward truncate.
> +		 */
> +		error = xfs_truncate_page(ip, newsize, blocksize, &did_zeroing);
>  		if (error)
>  			return error;
> +
> +		/*
> +		 * We are going to log the inode size change in this transaction
> +		 * so any previous writes that are beyond the on disk EOF and
> +		 * the new EOF that have not been written out need to be written
> +		 * here.  If we do not write the data out, we expose ourselves
> +		 * to the null files problem. Note that this includes any block
> +		 * zeroing we did above; otherwise those blocks may not be
> +		 * zeroed after a crash.
> +		 */
> +		if (did_zeroing || write_back) {
> +			error = filemap_write_and_wait_range(inode->i_mapping,
> +					min_t(loff_t, ip->i_disk_size, newsize),
> +					roundup_64(newsize, blocksize) - 1);
> +			if (error)
> +				return error;
> +		}
> +
> +		/*
> +		 * Updating i_size after writing back to make sure the zeroed

"Update the incore i_size after flushing dirty tail pages to disk, and
drop all the pagecache beyond the allocation unit containing EOF." ?

> +		 * blocks could been written out, and drop all the page cache
> +		 * range that beyond blocksize aligned new EOF block.
> +		 *
> +		 * We've already locked out new page faults, so now we can
> +		 * safely remove pages from the page cache knowing they won't
> +		 * get refaulted until we drop the XFS_MMAP_EXCL lock after the
> +		 * extent manipulations are complete.
> +		 */
> +		i_size_write(inode, newsize);
> +		truncate_pagecache(inode, roundup_64(newsize, blocksize));

I'm not sure why we need to preserve the pagecache beyond eof having
zeroed and then written the post-eof blocks out to disk, but I'm
guessing this is why you open-code truncate_setsize?

> +	} else {
> +		/*
> +		 * Start with zeroing any data beyond EOF that we may expose on
> +		 * file extension.
> +		 */
> +		if (newsize > oldsize) {
> +			trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
> +			error = xfs_zero_range(ip, oldsize, newsize - oldsize,
> +					       &did_zeroing);
> +			if (error)
> +				return error;
> +		}
> +
> +		/*
> +		 * The truncate_setsize() call also cleans partial EOF page
> +		 * PTEs on extending truncates and hence ensures sub-page block
> +		 * size filesystems are correctly handled, too.
> +		 */
> +		truncate_setsize(inode, newsize);
> +
> +		if (did_zeroing || write_back) {
> +			error = filemap_write_and_wait_range(inode->i_mapping,
> +					ip->i_disk_size, newsize - 1);
> +			if (error)
> +				return error;
> +		}
>  	}

At this point I wonder if these three truncate cases (down, up, and
unchanged) should just be broken out into three helpers without so much
twisty logic.

xfs_setattr_truncate_down():
	xfs_truncate_page(..., &did_zeroing);

	if (did_zeroing || extending_ondisk_eof)
		filemap_write_and_wait_range(...);

	truncate_setsize(...); /* or your opencoded version */

xfs_setattr_truncate_up():
	xfs_zero_range(..., &did_zeroing);

	truncate_setsize(...);

	if (did_zeroing || extending_ondisk_eof)
		filemap_write_and_wait_range(...);

xfs_setattr_truncate_unchanged():
	truncate_setsize(...);

	if (extending_ondisk_eof)
		filemap_write_and_wait_range(...);

So then the callsite becomes:

	if (newsize > oldsize)
		xfs_settattr_truncate_up();
	else if (newsize < oldsize)
		xfs_setattr_truncate_down();
	else
		xfs_setattr_truncate_unchanged();

But, I dunno.  Most of the code is really just extensive commenting.

--D

> +			if (error)
> +				return error;
> +		}
> +
> +		/*
> +		 * The truncate_setsize() call also cleans partial EOF page
> +		 * PTEs on extending truncates and hence ensures sub-page block
> +		 * size filesystems are correctly handled, too.
> +		 */
> +		truncate_setsize(inode, newsize);
> +
> +		if (did_zeroing || write_back) {
> +			error = filemap_write_and_wait_range(inode->i_mapping,
> +					ip->i_disk_size, newsize - 1);



>  
>  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
> -- 
> 2.39.2
> 
> 

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v4 1/8] iomap: zeroing needs to be pagecache aware
  2024-05-31 14:05       ` Christoph Hellwig
@ 2024-05-31 15:44         ` Brian Foster
  0 siblings, 0 replies; 47+ messages in thread
From: Brian Foster @ 2024-05-31 15:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Darrick J. Wong, Zhang Yi, linux-xfs, linux-fsdevel, linux-kernel,
	brauner, david, chandanbabu, jack, willy, yi.zhang, chengzhihao1,
	yukuai3

On Fri, May 31, 2024 at 07:05:38AM -0700, Christoph Hellwig wrote:
> On Fri, May 31, 2024 at 07:03:58AM -0700, Darrick J. Wong wrote:
> > > +			/*
> > > +			 * XXX: It would be nice if we could get the offset of
> > > +			 * the next entry in the pagecache so that we don't have
> > > +			 * to iterate one page at a time here.
> > > +			 */
> > > +			offset = offset_in_page(pos);
> > > +			if (bytes > PAGE_SIZE - offset)
> > > +				bytes = PAGE_SIZE - offset;
> > 
> > Why is it PAGE_SIZE here and not folio_size() like below?
> > 
> > (I know you're just copying the existing code; I'm merely wondering if
> > this is some minor bug.)
> 
> See the comment just above :)
> 
> 

FWIW, something like the following is pretty slow with the current
implementation on a quick test:

  xfs_io -fc "falloc -k 0 1t" -c "pwrite 1000g 4k" <file>

... so I'd think you'd want some kind of data seek or something to more
efficiently process the range.

Brian


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v4 5/8] xfs: refactor the truncating order
  2024-05-31 15:27     ` Darrick J. Wong
@ 2024-05-31 16:17       ` Christoph Hellwig
  2024-06-03 13:51       ` Zhang Yi
  1 sibling, 0 replies; 47+ messages in thread
From: Christoph Hellwig @ 2024-05-31 16:17 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Zhang Yi, linux-xfs, linux-fsdevel,
	linux-kernel, brauner, david, chandanbabu, jack, willy, yi.zhang,
	chengzhihao1, yukuai3

On Fri, May 31, 2024 at 08:27:32AM -0700, Darrick J. Wong wrote:
> On Fri, May 31, 2024 at 06:31:36AM -0700, Christoph Hellwig wrote:
> > > +	write_back = newsize > ip->i_disk_size && oldsize != ip->i_disk_size;
> > 
> > Maybe need_writeback would be a better name for the variable?  Also no
> > need to initialize it to false at declaration time if it is
> > unconditionally set here.
> 
> This variable captures whether or not we need to write dirty file tail
> data because we're extending the ondisk EOF, right?

Yes.

> I don't really like long names like any good 1980s C programmer, but
> maybe we should name this something like "extending_ondisk_eof"?

Sure.

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v4 7/8] xfs: reserve blocks for truncating realtime inode
  2024-05-31 15:29         ` Darrick J. Wong
@ 2024-05-31 16:17           ` Christoph Hellwig
  0 siblings, 0 replies; 47+ messages in thread
From: Christoph Hellwig @ 2024-05-31 16:17 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: Christoph Hellwig, Zhang Yi, linux-xfs, linux-fsdevel,
	linux-kernel, brauner, david, chandanbabu, jack, willy, yi.zhang,
	chengzhihao1, yukuai3

On Fri, May 31, 2024 at 08:29:22AM -0700, Darrick J. Wong wrote:
> > 
> > Sorry, it was meant to say for the data device.  My whole journey
> > to check if this could get called for the attr fork twisted my mind.
> 
> I really hope not -- all writes to the attr fork have known sizes at
> syscall time, and appending doesn't even make sense.

It's obviously not.  I just had to go out and actually read the code
before commenting.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v4 0/8] iomap/xfs: fix stale data exposure when truncating realtime inodes
  2024-05-31 12:26 ` [RFC PATCH v4 0/8] iomap/xfs: fix stale data exposure when truncating realtime inodes Christoph Hellwig
@ 2024-06-01  7:38   ` Zhang Yi
  2024-06-01  7:40     ` Christoph Hellwig
  0 siblings, 1 reply; 47+ messages in thread
From: Zhang Yi @ 2024-06-01  7:38 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-xfs, linux-fsdevel, linux-kernel, djwong, brauner, david,
	chandanbabu, jack, willy, yi.zhang, chengzhihao1, yukuai3

On 2024/5/31 20:26, Christoph Hellwig wrote:
> Procedural question before I get into the actual review:  given we
> are close to -rc3 and there is no user of the iomap change yet,
> should we revert that for 6.10 and instead try again in 6.11 when
> the XFS bits are sorted out?
> 

Okay, fine, it looks we still need some time to fix this issue.  I
will send out a patch to revert the commit '943bc0882ceb ("iomap:
don't increase i_size if it's not a write operation")' soon, other
commits in my previous series looks harmless, so I think we can
keep them.

Thanks,
Yi.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v4 0/8] iomap/xfs: fix stale data exposure when truncating realtime inodes
  2024-06-01  7:38   ` Zhang Yi
@ 2024-06-01  7:40     ` Christoph Hellwig
  0 siblings, 0 replies; 47+ messages in thread
From: Christoph Hellwig @ 2024-06-01  7:40 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Christoph Hellwig, linux-xfs, linux-fsdevel, linux-kernel, djwong,
	brauner, david, chandanbabu, jack, willy, yi.zhang, chengzhihao1,
	yukuai3

On Sat, Jun 01, 2024 at 03:38:37PM +0800, Zhang Yi wrote:
> will send out a patch to revert the commit '943bc0882ceb ("iomap:
> don't increase i_size if it's not a write operation")' soon, other
> commits in my previous series looks harmless, so I think we can
> keep them.

Agreed and thanks!  For 6.11 we'll also need to figure out how to
keep the related changes together.  I suspect we should just re-merge
that iomap change through the XFS tree, but we can talk about that
later.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v4 1/8] iomap: zeroing needs to be pagecache aware
  2024-05-29  9:51 ` [RFC PATCH v4 1/8] iomap: zeroing needs to be pagecache aware Zhang Yi
  2024-05-31 13:11   ` Christoph Hellwig
@ 2024-06-02 11:04   ` Brian Foster
  2024-06-03  9:07     ` Zhang Yi
  1 sibling, 1 reply; 47+ messages in thread
From: Brian Foster @ 2024-06-02 11:04 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-xfs, linux-fsdevel, linux-kernel, djwong, hch, brauner,
	david, chandanbabu, jack, willy, yi.zhang, chengzhihao1, yukuai3

On Wed, May 29, 2024 at 05:51:59PM +0800, Zhang Yi wrote:
> From: Dave Chinner <dchinner@redhat.com>
> 
> Unwritten extents can have page cache data over the range being
> zeroed so we can't just skip them entirely. Fix this by checking for
> an existing dirty folio over the unwritten range we are zeroing
> and only performing zeroing if the folio is already dirty.
> 
> XXX: how do we detect a iomap containing a cow mapping over a hole
> in iomap_zero_iter()? The XFS code implies this case also needs to
> zero the page cache if there is data present, so trigger for page
> cache lookup only in iomap_zero_iter() needs to handle this case as
> well.
> 
> Before:
> 
> $ time sudo ./pwrite-trunc /mnt/scratch/foo 50000
> path /mnt/scratch/foo, 50000 iters
> 
> real    0m14.103s
> user    0m0.015s
> sys     0m0.020s
> 
> $ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000
> path /mnt/scratch/foo, 50000 iters
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  85.90    0.847616          16     50000           ftruncate
>  14.01    0.138229           2     50000           pwrite64
> ....
> 
> After:
> 
> $ time sudo ./pwrite-trunc /mnt/scratch/foo 50000
> path /mnt/scratch/foo, 50000 iters
> 
> real    0m0.144s
> user    0m0.021s
> sys     0m0.012s
> 
> $ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000
> path /mnt/scratch/foo, 50000 iters
> % time     seconds  usecs/call     calls    errors syscall
> ------ ----------- ----------- --------- --------- ----------------
>  53.86    0.505964          10     50000           ftruncate
>  46.12    0.433251           8     50000           pwrite64
> ....
> 
> Yup, we get back all the performance.
> 
> As for the "mmap write beyond EOF" data exposure aspect
> documented here:
> 
> https://lore.kernel.org/linux-xfs/20221104182358.2007475-1-bfoster@redhat.com/
> 
> With this command:
> 
> $ sudo xfs_io -tfc "falloc 0 1k" -c "pwrite 0 1k" \
>   -c "mmap 0 4k" -c "mwrite 3k 1k" -c "pwrite 32k 4k" \
>   -c fsync -c "pread -v 3k 32" /mnt/scratch/foo
> 
> Before:
> 
> wrote 1024/1024 bytes at offset 0
> 1 KiB, 1 ops; 0.0000 sec (34.877 MiB/sec and 35714.2857 ops/sec)
> wrote 4096/4096 bytes at offset 32768
> 4 KiB, 1 ops; 0.0000 sec (229.779 MiB/sec and 58823.5294 ops/sec)
> 00000c00:  58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
> XXXXXXXXXXXXXXXX
> 00000c10:  58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
> XXXXXXXXXXXXXXXX
> read 32/32 bytes at offset 3072
> 32.000000 bytes, 1 ops; 0.0000 sec (568.182 KiB/sec and 18181.8182
>    ops/sec
> 
> After:
> 
> wrote 1024/1024 bytes at offset 0
> 1 KiB, 1 ops; 0.0000 sec (40.690 MiB/sec and 41666.6667 ops/sec)
> wrote 4096/4096 bytes at offset 32768
> 4 KiB, 1 ops; 0.0000 sec (150.240 MiB/sec and 38461.5385 ops/sec)
> 00000c00:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> ................
> 00000c10:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> ................
> read 32/32 bytes at offset 3072
> 32.000000 bytes, 1 ops; 0.0000 sec (558.036 KiB/sec and 17857.1429
>    ops/sec)
> 
> We see that this post-eof unwritten extent dirty page zeroing is
> working correctly.
> 

I've pointed this out in the past, but IIRC this implementation is racy
vs. reclaim. Specifically, relying on folio lookup after mapping lookup
doesn't take reclaim into account, so if we look up an unwritten mapping
and then a folio flushes and reclaims by the time the scan reaches that
offset, it incorrectly treats that subrange as already zero when it
actually isn't (because the extent is actually stale by that point, but
the stale extent check is skipped).

A simple example to demonstrate this is something like the following:

# looping truncate zeroing
while [ true ]; do
	xfs_io -fc "truncate 0" -c "falloc 0 32K" -c "pwrite 0 4k" -c "truncate 2k" <file>
	xfs_io -c "mmap 0 4k" -c "mread -v 2k 16" <file> | grep cd && break
done

vs.

# looping writeback and reclaim
while [ true ]; do
	xfs_io -c "sync_range -a 0 0" -c "fadvise -d 0 0" <file>
done

If I ran that against this patch, the first loop will eventually detect
stale data exposed past eof.

Brian


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v4 3/8] iomap: pass blocksize to iomap_truncate_page()
  2024-05-31 12:39   ` Christoph Hellwig
@ 2024-06-02 11:16     ` Brian Foster
  2024-06-03 13:23     ` Zhang Yi
  1 sibling, 0 replies; 47+ messages in thread
From: Brian Foster @ 2024-06-02 11:16 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Zhang Yi, linux-xfs, linux-fsdevel, linux-kernel, djwong, brauner,
	david, chandanbabu, jack, willy, yi.zhang, chengzhihao1, yukuai3

On Fri, May 31, 2024 at 05:39:10AM -0700, Christoph Hellwig wrote:
> > -		const struct iomap_ops *ops)
> > +iomap_truncate_page(struct inode *inode, loff_t pos, unsigned int blocksize,
> > +		bool *did_zero, const struct iomap_ops *ops)
> >  {
> > -	unsigned int blocksize = i_blocksize(inode);
> > -	unsigned int off = pos & (blocksize - 1);
> > +	unsigned int off = rem_u64(pos, blocksize);
> >  
> >  	/* Block boundary? Nothing to do */
> >  	if (!off)
> 
> Instad of passing yet another argument here, can we just kill
> iomap_truncate_page?
> 
> I.e. just open code the rem_u64 and 0 offset check in the only caller
> and call iomap_zero_range.  Same for the DAX variant and it's two
> callers.
> 
> 

Hey Christoph,

I've wondered the same about killing off iomap_truncate_page(), but JFYI
one of the several prototypes I have around of other potential ways to
address this problem slightly splits off truncate page from being a
straight zero range wrapper. A quick diff [2] of that is inlined below
for reference (only lightly tested, may be busted, etc.).

The idea is that IIRC truncate_page() was really the only zero range
user that might actually encounter dirty cache over unwritten mappings,
so given that and the typically constrained range size of truncate page,
just let it be more aggressive about bypassing the unwritten mapping
optimization in iomap_zero_iter(). Just something else to consider, and
this is definitely not something you'd want to do for zero range proper.

Brian

P.S., I think the last patches I actually posted around this were here
[1], but I also have multiple versions of that selective flush approach
ported to iomap instead of being XFS specific as well.

[1] https://lore.kernel.org/linux-xfs/20221104182358.2007475-1-bfoster@redhat.com/
    https://lore.kernel.org/linux-xfs/20221128173945.3953659-1-bfoster@redhat.com/
[2] truncate page POC:

--- 8< ---

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index c5802a459334..a261e732ea05 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1380,7 +1380,8 @@ iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
 }
 EXPORT_SYMBOL_GPL(iomap_file_unshare);
 
-static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
+static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero,
+			      bool dirty_cache)
 {
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
 	loff_t pos = iter->pos;
@@ -1388,7 +1389,8 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 	loff_t written = 0;
 
 	/* already zeroed?  we're done. */
-	if (srcmap->type == IOMAP_HOLE || srcmap->type == IOMAP_UNWRITTEN)
+	if (srcmap->type == IOMAP_HOLE ||
+	    (!dirty_cache && srcmap->type == IOMAP_UNWRITTEN))
 		return length;
 
 	do {
@@ -1439,7 +1441,7 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
 	int ret;
 
 	while ((ret = iomap_iter(&iter, ops)) > 0)
-		iter.processed = iomap_zero_iter(&iter, did_zero);
+		iter.processed = iomap_zero_iter(&iter, did_zero, false);
 	return ret;
 }
 EXPORT_SYMBOL_GPL(iomap_zero_range);
@@ -1450,11 +1452,29 @@ iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
 {
 	unsigned int blocksize = i_blocksize(inode);
 	unsigned int off = pos & (blocksize - 1);
+	struct iomap_iter iter = {
+		.inode		= inode,
+		.pos		= pos,
+		.len		= blocksize - off,
+		.flags		= IOMAP_ZERO,
+	};
+	loff_t end;
+	int ret;
+	bool dirty_cache = false;
 
 	/* Block boundary? Nothing to do */
 	if (!off)
 		return 0;
-	return iomap_zero_range(inode, pos, blocksize - off, did_zero, ops);
+
+	/* overwrite unwritten ranges if any part of the range is dirty for
+	 * truncate page */
+	end = iter.pos + iter.len - 1;
+	if (filemap_range_needs_writeback(inode->i_mapping, iter.pos, end))
+		dirty_cache = true;
+
+	while ((ret = iomap_iter(&iter, ops)) > 0)
+		iter.processed = iomap_zero_iter(&iter, did_zero, dirty_cache);
+	return ret;
 }
 EXPORT_SYMBOL_GPL(iomap_truncate_page);
 


^ permalink raw reply related	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v4 1/8] iomap: zeroing needs to be pagecache aware
  2024-05-31 13:11   ` Christoph Hellwig
  2024-05-31 14:03     ` Darrick J. Wong
@ 2024-06-02 22:22     ` Dave Chinner
  1 sibling, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2024-06-02 22:22 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: Zhang Yi, linux-xfs, linux-fsdevel, linux-kernel, djwong, brauner,
	chandanbabu, jack, willy, yi.zhang, chengzhihao1, yukuai3

On Fri, May 31, 2024 at 06:11:25AM -0700, Christoph Hellwig wrote:
> On Wed, May 29, 2024 at 05:51:59PM +0800, Zhang Yi wrote:
> > XXX: how do we detect a iomap containing a cow mapping over a hole
> > in iomap_zero_iter()? The XFS code implies this case also needs to
> > zero the page cache if there is data present, so trigger for page
> > cache lookup only in iomap_zero_iter() needs to handle this case as
> > well.
> 
> If there is no data in the page cache and either a whole or unwritten
> extent it really should not matter what is in the COW fork, a there
> obviously isn't any data we could zero.
> 
> If there is data in the page cache for something that is marked as
> a hole in the srcmap, but we have data in the COW fork due to
> COW extsize preallocation we'd need to zero it, but as the
> xfs iomap ops don't return a separate srcmap for that case we
> should be fine.  Or am I missing something?

If the data extent is a hole, xfs_buffered_write_iomap_begin()
doesn't even check the cow fork for extents if IOMAP_ZERO is being
done. Hence if there is a pending COW extent that extends over a
data fork hole (cow fork preallocation can do that, right?), then we
may have data in the page cache over an unwritten extent in the COW
fork.

This code:

	/* We never need to allocate blocks for zeroing or unsharing a hole. */
        if ((flags & (IOMAP_UNSHARE | IOMAP_ZERO)) &&
            imap.br_startoff > offset_fsb) {
                xfs_hole_to_iomap(ip, iomap, offset_fsb, imap.br_startoff);
                goto out_unlock;
        }

The comment, IMO, indicates the issue here:  we're not going to
allocate blocks in IOMAP_ZERO, but we do need to map anything that
might contain page cache data for the IOMAP_ZERO case. If "data
hole, COW unwritten, page cache dirty" can exist as the comment in
xfs_setattr_size() implies, then this code is broken and needs
fixing.

I don't know what that fix looks like yet - I suspect that all we
need to do for IOMAP_ZERO is to return the COW extent in the srcmap,
and then the zeroing code should do the right thing if it's an
unwritten COW extent...

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v4 5/8] xfs: refactor the truncating order
  2024-05-29  9:52 ` [RFC PATCH v4 5/8] xfs: refactor the truncating order Zhang Yi
  2024-05-31 13:31   ` Christoph Hellwig
  2024-05-31 15:44   ` Darrick J. Wong
@ 2024-06-02 22:46   ` Dave Chinner
  2024-06-03 14:18     ` Zhang Yi
  2 siblings, 1 reply; 47+ messages in thread
From: Dave Chinner @ 2024-06-02 22:46 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-xfs, linux-fsdevel, linux-kernel, djwong, hch, brauner,
	chandanbabu, jack, willy, yi.zhang, chengzhihao1, yukuai3

On Wed, May 29, 2024 at 05:52:03PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> When truncating down an inode, we call xfs_truncate_page() to zero out
> the tail partial block that beyond new EOF, which prevents exposing
> stale data. But xfs_truncate_page() always assumes the blocksize is
> i_blocksize(inode), it's not always true if we have a large allocation
> unit for a file and we should aligned to this unitsize, e.g. realtime
> inode should aligned to the rtextsize.
> 
> Current xfs_setattr_size() can't support zeroing out a large alignment
> size on trucate down since the process order is wrong. We first do zero
> out through xfs_truncate_page(), and then update inode size through
> truncate_setsize() immediately. If the zeroed range is larger than a
> folio, the write back path would not write back zeroed pagecache beyond
> the EOF folio, so it doesn't write zeroes to the entire tail extent and
> could expose stale data after an appending write into the next aligned
> extent.
> 
> We need to adjust the order to zero out tail aligned blocks, write back
> zeroed or cached data, update i_size and drop cache beyond aligned EOF
> block, preparing for the fix of realtime inode and supporting the
> upcoming forced alignment feature.
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
.....
> @@ -853,30 +854,7 @@ xfs_setattr_size(
>  	 * the transaction because the inode cannot be unlocked once it is a
>  	 * part of the transaction.
>  	 *
> -	 * Start with zeroing any data beyond EOF that we may expose on file
> -	 * extension, or zeroing out the rest of the block on a downward
> -	 * truncate.
> -	 */
> -	if (newsize > oldsize) {
> -		trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
> -		error = xfs_zero_range(ip, oldsize, newsize - oldsize,
> -				&did_zeroing);
> -	} else if (newsize != oldsize) {
> -		error = xfs_truncate_page(ip, newsize, &did_zeroing);
> -	}
> -
> -	if (error)
> -		return error;
> -
> -	/*
> -	 * We've already locked out new page faults, so now we can safely remove
> -	 * pages from the page cache knowing they won't get refaulted until we
> -	 * drop the XFS_MMAP_EXCL lock after the extent manipulations are
> -	 * complete. The truncate_setsize() call also cleans partial EOF page
> -	 * PTEs on extending truncates and hence ensures sub-page block size
> -	 * filesystems are correctly handled, too.
> -	 *
> -	 * We have to do all the page cache truncate work outside the
> +	 * And we have to do all the page cache truncate work outside the
>  	 * transaction context as the "lock" order is page lock->log space
>  	 * reservation as defined by extent allocation in the writeback path.
>  	 * Hence a truncate can fail with ENOMEM from xfs_trans_alloc(), but
......

Lots of new logic for zeroing here. That makes xfs_setattr_size()
even longer than it already is. Can you lift this EOF zeroing logic
into it's own helper function so that it is clear that it is a
completely independent operation to the actual transaction that
changes the inode size. That would also allow the operations to be
broken up into:

	if (newsize >= oldsize) {
		/* do the simple stuff */
		....
		return error;
	}
	/* do the complex size reduction stuff without additional indenting */

-Dave.

-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v4 1/8] iomap: zeroing needs to be pagecache aware
  2024-06-02 11:04   ` Brian Foster
@ 2024-06-03  9:07     ` Zhang Yi
  2024-06-03 14:37       ` Brian Foster
  0 siblings, 1 reply; 47+ messages in thread
From: Zhang Yi @ 2024-06-03  9:07 UTC (permalink / raw)
  To: Brian Foster
  Cc: linux-xfs, linux-fsdevel, linux-kernel, djwong, hch, brauner,
	david, chandanbabu, jack, willy, yi.zhang, chengzhihao1, yukuai3

On 2024/6/2 19:04, Brian Foster wrote:
> On Wed, May 29, 2024 at 05:51:59PM +0800, Zhang Yi wrote:
>> From: Dave Chinner <dchinner@redhat.com>
>>
>> Unwritten extents can have page cache data over the range being
>> zeroed so we can't just skip them entirely. Fix this by checking for
>> an existing dirty folio over the unwritten range we are zeroing
>> and only performing zeroing if the folio is already dirty.
>>
>> XXX: how do we detect a iomap containing a cow mapping over a hole
>> in iomap_zero_iter()? The XFS code implies this case also needs to
>> zero the page cache if there is data present, so trigger for page
>> cache lookup only in iomap_zero_iter() needs to handle this case as
>> well.
>>
>> Before:
>>
>> $ time sudo ./pwrite-trunc /mnt/scratch/foo 50000
>> path /mnt/scratch/foo, 50000 iters
>>
>> real    0m14.103s
>> user    0m0.015s
>> sys     0m0.020s
>>
>> $ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000
>> path /mnt/scratch/foo, 50000 iters
>> % time     seconds  usecs/call     calls    errors syscall
>> ------ ----------- ----------- --------- --------- ----------------
>>  85.90    0.847616          16     50000           ftruncate
>>  14.01    0.138229           2     50000           pwrite64
>> ....
>>
>> After:
>>
>> $ time sudo ./pwrite-trunc /mnt/scratch/foo 50000
>> path /mnt/scratch/foo, 50000 iters
>>
>> real    0m0.144s
>> user    0m0.021s
>> sys     0m0.012s
>>
>> $ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000
>> path /mnt/scratch/foo, 50000 iters
>> % time     seconds  usecs/call     calls    errors syscall
>> ------ ----------- ----------- --------- --------- ----------------
>>  53.86    0.505964          10     50000           ftruncate
>>  46.12    0.433251           8     50000           pwrite64
>> ....
>>
>> Yup, we get back all the performance.
>>
>> As for the "mmap write beyond EOF" data exposure aspect
>> documented here:
>>
>> https://lore.kernel.org/linux-xfs/20221104182358.2007475-1-bfoster@redhat.com/
>>
>> With this command:
>>
>> $ sudo xfs_io -tfc "falloc 0 1k" -c "pwrite 0 1k" \
>>   -c "mmap 0 4k" -c "mwrite 3k 1k" -c "pwrite 32k 4k" \
>>   -c fsync -c "pread -v 3k 32" /mnt/scratch/foo
>>
>> Before:
>>
>> wrote 1024/1024 bytes at offset 0
>> 1 KiB, 1 ops; 0.0000 sec (34.877 MiB/sec and 35714.2857 ops/sec)
>> wrote 4096/4096 bytes at offset 32768
>> 4 KiB, 1 ops; 0.0000 sec (229.779 MiB/sec and 58823.5294 ops/sec)
>> 00000c00:  58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
>> XXXXXXXXXXXXXXXX
>> 00000c10:  58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
>> XXXXXXXXXXXXXXXX
>> read 32/32 bytes at offset 3072
>> 32.000000 bytes, 1 ops; 0.0000 sec (568.182 KiB/sec and 18181.8182
>>    ops/sec
>>
>> After:
>>
>> wrote 1024/1024 bytes at offset 0
>> 1 KiB, 1 ops; 0.0000 sec (40.690 MiB/sec and 41666.6667 ops/sec)
>> wrote 4096/4096 bytes at offset 32768
>> 4 KiB, 1 ops; 0.0000 sec (150.240 MiB/sec and 38461.5385 ops/sec)
>> 00000c00:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> ................
>> 00000c10:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
>> ................
>> read 32/32 bytes at offset 3072
>> 32.000000 bytes, 1 ops; 0.0000 sec (558.036 KiB/sec and 17857.1429
>>    ops/sec)
>>
>> We see that this post-eof unwritten extent dirty page zeroing is
>> working correctly.
>>
> 
> I've pointed this out in the past, but IIRC this implementation is racy
> vs. reclaim. Specifically, relying on folio lookup after mapping lookup
> doesn't take reclaim into account, so if we look up an unwritten mapping
> and then a folio flushes and reclaims by the time the scan reaches that
> offset, it incorrectly treats that subrange as already zero when it
> actually isn't (because the extent is actually stale by that point, but
> the stale extent check is skipped).
> 

Hello, Brian!

I'm confused, how could that happen? We do stale check under folio lock,
if the folio flushed and reclaimed before we get&lock that folio in
iomap_zero_iter()->iomap_write_begin(), the ->iomap_valid() would check
this stale out and zero again in the next iteration. Am I missing
something?

Thanks,
Yi.

> A simple example to demonstrate this is something like the following:
> 
> # looping truncate zeroing
> while [ true ]; do
> 	xfs_io -fc "truncate 0" -c "falloc 0 32K" -c "pwrite 0 4k" -c "truncate 2k" <file>
> 	xfs_io -c "mmap 0 4k" -c "mread -v 2k 16" <file> | grep cd && break
> done
> 
> vs.
> 
> # looping writeback and reclaim
> while [ true ]; do
> 	xfs_io -c "sync_range -a 0 0" -c "fadvise -d 0 0" <file>
> done
> 
> If I ran that against this patch, the first loop will eventually detect
> stale data exposed past eof.
> 
> Brian
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v4 3/8] iomap: pass blocksize to iomap_truncate_page()
  2024-05-31 12:39   ` Christoph Hellwig
  2024-06-02 11:16     ` Brian Foster
@ 2024-06-03 13:23     ` Zhang Yi
  1 sibling, 0 replies; 47+ messages in thread
From: Zhang Yi @ 2024-06-03 13:23 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-xfs, linux-fsdevel, linux-kernel, djwong, brauner, david,
	chandanbabu, jack, willy, yi.zhang, chengzhihao1, yukuai3

On 2024/5/31 20:39, Christoph Hellwig wrote:
>> -		const struct iomap_ops *ops)
>> +iomap_truncate_page(struct inode *inode, loff_t pos, unsigned int blocksize,
>> +		bool *did_zero, const struct iomap_ops *ops)
>>  {
>> -	unsigned int blocksize = i_blocksize(inode);
>> -	unsigned int off = pos & (blocksize - 1);
>> +	unsigned int off = rem_u64(pos, blocksize);
>>  
>>  	/* Block boundary? Nothing to do */
>>  	if (!off)
> 
> Instad of passing yet another argument here, can we just kill
> iomap_truncate_page?
> 
> I.e. just open code the rem_u64 and 0 offset check in the only caller
> and call iomap_zero_range.  Same for the DAX variant and it's two
> callers.
> 

Yeah, we could drop iomap_truncate_page() and dax_truncate_page(), but
that means we have to open code the zeroing length calculation or add a
fs private helper to do that in every filesystems. Now we only have xfs
and ext2 two caller, so it looks fine, but if the caller becomes more in
the future, this could becomes repetitive, if we keep them, all
filesystems could don't pay attention to these details.

Thanks,
Yi.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v4 5/8] xfs: refactor the truncating order
  2024-05-31 15:27     ` Darrick J. Wong
  2024-05-31 16:17       ` Christoph Hellwig
@ 2024-06-03 13:51       ` Zhang Yi
  1 sibling, 0 replies; 47+ messages in thread
From: Zhang Yi @ 2024-06-03 13:51 UTC (permalink / raw)
  To: Darrick J. Wong, Christoph Hellwig
  Cc: linux-xfs, linux-fsdevel, linux-kernel, brauner, david,
	chandanbabu, jack, willy, yi.zhang, chengzhihao1, yukuai3

On 2024/5/31 23:27, Darrick J. Wong wrote:
> On Fri, May 31, 2024 at 06:31:36AM -0700, Christoph Hellwig wrote:
>>> +	write_back = newsize > ip->i_disk_size && oldsize != ip->i_disk_size;
>>
>> Maybe need_writeback would be a better name for the variable?  Also no
>> need to initialize it to false at declaration time if it is
>> unconditionally set here.
> 
> This variable captures whether or not we need to write dirty file tail
> data because we're extending the ondisk EOF, right?
> 
> I don't really like long names like any good 1980s C programmer, but
> maybe we should name this something like "extending_ondisk_eof"?
> 
> 	if (newsize > ip->i_disk_size && oldsize != ip->i_disk_size)
> 		extending_ondisk_eof = true;
> 
> 	...
> 
> 	if (did_zeroing || extending_ondisk_eof)
> 		filemap_write_and_wait_range(...);
> 
> Hm?

Sure, this name looks better.

> 
>>> +		/*
>>> +		 * Updating i_size after writing back to make sure the zeroed
>>> +		 * blocks could been written out, and drop all the page cache
>>> +		 * range that beyond blocksize aligned new EOF block.
>>> +		 *
>>> +		 * We've already locked out new page faults, so now we can
>>> +		 * safely remove pages from the page cache knowing they won't
>>> +		 * get refaulted until we drop the XFS_MMAP_EXCL lock after the
> 
> And can we correct the comment here too?
> 
> "...until we drop XFS_MMAPLOCK_EXCL after the extent manipulations..."
> 

Sure,

> --D
> 
>>> +		 * extent manipulations are complete.
>>> +		 */
>>> +		i_size_write(inode, newsize);
>>> +		truncate_pagecache(inode, roundup_64(newsize, blocksize));
>>
>> Any reason this open codes truncate_setsize()?
>>

It's not equal to open codes truncate_setsize(), please look the truncate
start pos is aligned to rtextsize for realtime inode, we only drop page
cache that beyond the new aligned EOF block.

Thanks,
Yi.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v4 5/8] xfs: refactor the truncating order
  2024-05-31 15:44   ` Darrick J. Wong
@ 2024-06-03 14:15     ` Zhang Yi
  0 siblings, 0 replies; 47+ messages in thread
From: Zhang Yi @ 2024-06-03 14:15 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: linux-xfs, linux-fsdevel, linux-kernel, hch, brauner, david,
	chandanbabu, jack, willy, yi.zhang, chengzhihao1, yukuai3

On 2024/5/31 23:44, Darrick J. Wong wrote:
> On Wed, May 29, 2024 at 05:52:03PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> When truncating down an inode, we call xfs_truncate_page() to zero out
>> the tail partial block that beyond new EOF, which prevents exposing
>> stale data. But xfs_truncate_page() always assumes the blocksize is
>> i_blocksize(inode), it's not always true if we have a large allocation
>> unit for a file and we should aligned to this unitsize, e.g. realtime
>> inode should aligned to the rtextsize.
>>
>> Current xfs_setattr_size() can't support zeroing out a large alignment
>> size on trucate down since the process order is wrong. We first do zero
>> out through xfs_truncate_page(), and then update inode size through
>> truncate_setsize() immediately. If the zeroed range is larger than a
>> folio, the write back path would not write back zeroed pagecache beyond
>> the EOF folio, so it doesn't write zeroes to the entire tail extent and
>> could expose stale data after an appending write into the next aligned
>> extent.
>>
>> We need to adjust the order to zero out tail aligned blocks, write back
>> zeroed or cached data, update i_size and drop cache beyond aligned EOF
>> block, preparing for the fix of realtime inode and supporting the
>> upcoming forced alignment feature.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> ---
>>  fs/xfs/xfs_iomap.c |   2 +-
>>  fs/xfs/xfs_iomap.h |   3 +-
>>  fs/xfs/xfs_iops.c  | 107 ++++++++++++++++++++++++++++-----------------
>>  3 files changed, 69 insertions(+), 43 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
>> index 8cdfcbb5baa7..0369b64cc3f4 100644
>> --- a/fs/xfs/xfs_iomap.c
>> +++ b/fs/xfs/xfs_iomap.c
>> @@ -1468,10 +1468,10 @@ int
>>  xfs_truncate_page(
>>  	struct xfs_inode	*ip,
>>  	loff_t			pos,
>> +	unsigned int		blocksize,
>>  	bool			*did_zero)
>>  {
>>  	struct inode		*inode = VFS_I(ip);
>> -	unsigned int		blocksize = i_blocksize(inode);
>>  
>>  	if (IS_DAX(inode))
>>  		return dax_truncate_page(inode, pos, blocksize, did_zero,
>> diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
>> index 4da13440bae9..feb1610cb645 100644
>> --- a/fs/xfs/xfs_iomap.h
>> +++ b/fs/xfs/xfs_iomap.h
>> @@ -25,7 +25,8 @@ int xfs_bmbt_to_iomap(struct xfs_inode *ip, struct iomap *iomap,
>>  
>>  int xfs_zero_range(struct xfs_inode *ip, loff_t pos, loff_t len,
>>  		bool *did_zero);
>> -int xfs_truncate_page(struct xfs_inode *ip, loff_t pos, bool *did_zero);
>> +int xfs_truncate_page(struct xfs_inode *ip, loff_t pos,
>> +		unsigned int blocksize, bool *did_zero);
>>  
>>  static inline xfs_filblks_t
>>  xfs_aligned_fsb_count(
>> diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
>> index d44508930b67..d24927075022 100644
>> --- a/fs/xfs/xfs_iops.c
>> +++ b/fs/xfs/xfs_iops.c
>> @@ -812,6 +812,7 @@ xfs_setattr_size(
>>  	int			error;
>>  	uint			lock_flags = 0;
>>  	bool			did_zeroing = false;
>> +	bool			write_back = false;
>>  
>>  	xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL);
>>  	ASSERT(S_ISREG(inode->i_mode));
>> @@ -853,30 +854,7 @@ xfs_setattr_size(
>>  	 * the transaction because the inode cannot be unlocked once it is a
>>  	 * part of the transaction.
>>  	 *
>> -	 * Start with zeroing any data beyond EOF that we may expose on file
>> -	 * extension, or zeroing out the rest of the block on a downward
>> -	 * truncate.
>> -	 */
>> -	if (newsize > oldsize) {
>> -		trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
>> -		error = xfs_zero_range(ip, oldsize, newsize - oldsize,
>> -				&did_zeroing);
>> -	} else if (newsize != oldsize) {
>> -		error = xfs_truncate_page(ip, newsize, &did_zeroing);
>> -	}
>> -
>> -	if (error)
>> -		return error;
>> -
>> -	/*
>> -	 * We've already locked out new page faults, so now we can safely remove
>> -	 * pages from the page cache knowing they won't get refaulted until we
>> -	 * drop the XFS_MMAP_EXCL lock after the extent manipulations are
>> -	 * complete. The truncate_setsize() call also cleans partial EOF page
>> -	 * PTEs on extending truncates and hence ensures sub-page block size
>> -	 * filesystems are correctly handled, too.
>> -	 *
>> -	 * We have to do all the page cache truncate work outside the
>> +	 * And we have to do all the page cache truncate work outside the
> 
> Style nit: don't start a paragraph with "and".

Sure, thanks for point this out.

> 
>>  	 * transaction context as the "lock" order is page lock->log space
>>  	 * reservation as defined by extent allocation in the writeback path.
>>  	 * Hence a truncate can fail with ENOMEM from xfs_trans_alloc(), but
>> @@ -884,27 +862,74 @@ xfs_setattr_size(
>>  	 * user visible changes). There's not much we can do about this, except
>>  	 * to hope that the caller sees ENOMEM and retries the truncate
>>  	 * operation.
>> -	 *
>> -	 * And we update in-core i_size and truncate page cache beyond newsize
>> -	 * before writeback the [i_disk_size, newsize] range, so we're
>> -	 * guaranteed not to write stale data past the new EOF on truncate down.
>>  	 */
>> -	truncate_setsize(inode, newsize);
>> +	write_back = newsize > ip->i_disk_size && oldsize != ip->i_disk_size;
>> +	if (newsize < oldsize) {
>> +		unsigned int blocksize = i_blocksize(inode);
>>  
>> -	/*
>> -	 * We are going to log the inode size change in this transaction so
>> -	 * any previous writes that are beyond the on disk EOF and the new
>> -	 * EOF that have not been written out need to be written here.  If we
>> -	 * do not write the data out, we expose ourselves to the null files
>> -	 * problem. Note that this includes any block zeroing we did above;
>> -	 * otherwise those blocks may not be zeroed after a crash.
>> -	 */
>> -	if (did_zeroing ||
>> -	    (newsize > ip->i_disk_size && oldsize != ip->i_disk_size)) {
>> -		error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
>> -						ip->i_disk_size, newsize - 1);
>> +		/*
>> +		 * Zeroing out the partial EOF block and the rest of the extra
>> +		 * aligned blocks on a downward truncate.
>> +		 */
>> +		error = xfs_truncate_page(ip, newsize, blocksize, &did_zeroing);
>>  		if (error)
>>  			return error;
>> +
>> +		/*
>> +		 * We are going to log the inode size change in this transaction
>> +		 * so any previous writes that are beyond the on disk EOF and
>> +		 * the new EOF that have not been written out need to be written
>> +		 * here.  If we do not write the data out, we expose ourselves
>> +		 * to the null files problem. Note that this includes any block
>> +		 * zeroing we did above; otherwise those blocks may not be
>> +		 * zeroed after a crash.
>> +		 */
>> +		if (did_zeroing || write_back) {
>> +			error = filemap_write_and_wait_range(inode->i_mapping,
>> +					min_t(loff_t, ip->i_disk_size, newsize),
>> +					roundup_64(newsize, blocksize) - 1);
>> +			if (error)
>> +				return error;
>> +		}
>> +
>> +		/*
>> +		 * Updating i_size after writing back to make sure the zeroed
> 
> "Update the incore i_size after flushing dirty tail pages to disk, and
> drop all the pagecache beyond the allocation unit containing EOF." ?

Yep.

> 
>> +		 * blocks could been written out, and drop all the page cache
>> +		 * range that beyond blocksize aligned new EOF block.
>> +		 *
>> +		 * We've already locked out new page faults, so now we can
>> +		 * safely remove pages from the page cache knowing they won't
>> +		 * get refaulted until we drop the XFS_MMAP_EXCL lock after the
>> +		 * extent manipulations are complete.
>> +		 */
>> +		i_size_write(inode, newsize);
>> +		truncate_pagecache(inode, roundup_64(newsize, blocksize));
> 
> I'm not sure why we need to preserve the pagecache beyond eof having
> zeroed and then written the post-eof blocks out to disk, but I'm
> guessing this is why you open-code truncate_setsize?

Yeah, xfs_truncate_page() already done the zero out, if we keep passing the
newsize to truncate_pagecache() through truncate_setsize(), it would zero out
partial folio which cover the already zeroed blocks. What we should do at
this moment is just drop all the page cache beyond aligned EOF block, so I
roundup the newsize, just a small optimization.

> 
>> +	} else {
>> +		/*
>> +		 * Start with zeroing any data beyond EOF that we may expose on
>> +		 * file extension.
>> +		 */
>> +		if (newsize > oldsize) {
>> +			trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
>> +			error = xfs_zero_range(ip, oldsize, newsize - oldsize,
>> +					       &did_zeroing);
>> +			if (error)
>> +				return error;
>> +		}
>> +
>> +		/*
>> +		 * The truncate_setsize() call also cleans partial EOF page
>> +		 * PTEs on extending truncates and hence ensures sub-page block
>> +		 * size filesystems are correctly handled, too.
>> +		 */
>> +		truncate_setsize(inode, newsize);
>> +
>> +		if (did_zeroing || write_back) {
>> +			error = filemap_write_and_wait_range(inode->i_mapping,
>> +					ip->i_disk_size, newsize - 1);
>> +			if (error)
>> +				return error;
>> +		}
>>  	}
> 
> At this point I wonder if these three truncate cases (down, up, and
> unchanged) should just be broken out into three helpers without so much
> twisty logic.
> 
> xfs_setattr_truncate_down():
> 	xfs_truncate_page(..., &did_zeroing);
> 
> 	if (did_zeroing || extending_ondisk_eof)
> 		filemap_write_and_wait_range(...);
> 
> 	truncate_setsize(...); /* or your opencoded version */
> 
> xfs_setattr_truncate_up():
> 	xfs_zero_range(..., &did_zeroing);
> 
> 	truncate_setsize(...);
> 
> 	if (did_zeroing || extending_ondisk_eof)
> 		filemap_write_and_wait_range(...);
> 
> xfs_setattr_truncate_unchanged():
> 	truncate_setsize(...);
> 
> 	if (extending_ondisk_eof)
> 		filemap_write_and_wait_range(...);
> 
> So then the callsite becomes:
> 
> 	if (newsize > oldsize)
> 		xfs_settattr_truncate_up();
> 	else if (newsize < oldsize)
> 		xfs_setattr_truncate_down();
> 	else
> 		xfs_setattr_truncate_unchanged();

Sounds good.

> 
> But, I dunno.  Most of the code is really just extensive commenting.
> 

Yeah, the extensive comments also bothers me, too. I will try to make
it more clear in the next iteration, I hope.

Thanks,
Yi.

> --D
> 
>> +			if (error)
>> +				return error;
>> +		}
>> +
>> +		/*
>> +		 * The truncate_setsize() call also cleans partial EOF page
>> +		 * PTEs on extending truncates and hence ensures sub-page block
>> +		 * size filesystems are correctly handled, too.
>> +		 */
>> +		truncate_setsize(inode, newsize);
>> +
>> +		if (did_zeroing || write_back) {
>> +			error = filemap_write_and_wait_range(inode->i_mapping,
>> +					ip->i_disk_size, newsize - 1);
> 
> 
> 
>>  
>>  	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
>> -- 
>> 2.39.2
>>
>>


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v4 5/8] xfs: refactor the truncating order
  2024-06-02 22:46   ` Dave Chinner
@ 2024-06-03 14:18     ` Zhang Yi
  0 siblings, 0 replies; 47+ messages in thread
From: Zhang Yi @ 2024-06-03 14:18 UTC (permalink / raw)
  To: Dave Chinner
  Cc: linux-xfs, linux-fsdevel, linux-kernel, djwong, hch, brauner,
	chandanbabu, jack, willy, yi.zhang, chengzhihao1, yukuai3

On 2024/6/3 6:46, Dave Chinner wrote:
> On Wed, May 29, 2024 at 05:52:03PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> When truncating down an inode, we call xfs_truncate_page() to zero out
>> the tail partial block that beyond new EOF, which prevents exposing
>> stale data. But xfs_truncate_page() always assumes the blocksize is
>> i_blocksize(inode), it's not always true if we have a large allocation
>> unit for a file and we should aligned to this unitsize, e.g. realtime
>> inode should aligned to the rtextsize.
>>
>> Current xfs_setattr_size() can't support zeroing out a large alignment
>> size on trucate down since the process order is wrong. We first do zero
>> out through xfs_truncate_page(), and then update inode size through
>> truncate_setsize() immediately. If the zeroed range is larger than a
>> folio, the write back path would not write back zeroed pagecache beyond
>> the EOF folio, so it doesn't write zeroes to the entire tail extent and
>> could expose stale data after an appending write into the next aligned
>> extent.
>>
>> We need to adjust the order to zero out tail aligned blocks, write back
>> zeroed or cached data, update i_size and drop cache beyond aligned EOF
>> block, preparing for the fix of realtime inode and supporting the
>> upcoming forced alignment feature.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> ---
> .....
>> @@ -853,30 +854,7 @@ xfs_setattr_size(
>>  	 * the transaction because the inode cannot be unlocked once it is a
>>  	 * part of the transaction.
>>  	 *
>> -	 * Start with zeroing any data beyond EOF that we may expose on file
>> -	 * extension, or zeroing out the rest of the block on a downward
>> -	 * truncate.
>> -	 */
>> -	if (newsize > oldsize) {
>> -		trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
>> -		error = xfs_zero_range(ip, oldsize, newsize - oldsize,
>> -				&did_zeroing);
>> -	} else if (newsize != oldsize) {
>> -		error = xfs_truncate_page(ip, newsize, &did_zeroing);
>> -	}
>> -
>> -	if (error)
>> -		return error;
>> -
>> -	/*
>> -	 * We've already locked out new page faults, so now we can safely remove
>> -	 * pages from the page cache knowing they won't get refaulted until we
>> -	 * drop the XFS_MMAP_EXCL lock after the extent manipulations are
>> -	 * complete. The truncate_setsize() call also cleans partial EOF page
>> -	 * PTEs on extending truncates and hence ensures sub-page block size
>> -	 * filesystems are correctly handled, too.
>> -	 *
>> -	 * We have to do all the page cache truncate work outside the
>> +	 * And we have to do all the page cache truncate work outside the
>>  	 * transaction context as the "lock" order is page lock->log space
>>  	 * reservation as defined by extent allocation in the writeback path.
>>  	 * Hence a truncate can fail with ENOMEM from xfs_trans_alloc(), but
> ......
> 
> Lots of new logic for zeroing here. That makes xfs_setattr_size()
> even longer than it already is. Can you lift this EOF zeroing logic
> into it's own helper function so that it is clear that it is a
> completely independent operation to the actual transaction that
> changes the inode size. That would also allow the operations to be
> broken up into:
> 
> 	if (newsize >= oldsize) {
> 		/* do the simple stuff */
> 		....
> 		return error;
> 	}
> 	/* do the complex size reduction stuff without additional indenting */
> 

Sure, I will try to factor them out.

Thanks,
Yi.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v4 6/8] xfs: correct the truncate blocksize of realtime inode
  2024-05-31 13:36   ` Christoph Hellwig
@ 2024-06-03 14:35     ` Zhang Yi
  0 siblings, 0 replies; 47+ messages in thread
From: Zhang Yi @ 2024-06-03 14:35 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-xfs, linux-fsdevel, linux-kernel, djwong, brauner, david,
	chandanbabu, jack, willy, yi.zhang, chengzhihao1, yukuai3

On 2024/5/31 21:36, Christoph Hellwig wrote:
> On Wed, May 29, 2024 at 05:52:04PM +0800, Zhang Yi wrote:
>> +	if (xfs_inode_has_bigrtalloc(ip))
>> +		first_unmap_block = xfs_rtb_roundup_rtx(mp, first_unmap_block);
> 
> Given that first_unmap_block is a xfs_fileoff_t and not a xfs_rtblock_t,
> this looks a bit confusing.  I'd suggest to just open code the
> arithmetics in xfs_rtb_roundup_rtx.  For future proofing my also
> use xfs_inode_alloc_unitsize() as in the hunk below instead of hard
> coding the rtextsize.  I.e.:
> 
> 	first_unmap_block = XFS_B_TO_FSB(mp,
> 		roundup_64(new_size, xfs_inode_alloc_unitsize(ip)));
> 
Sure, makes sense to me.

Thanks,
Yi.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v4 1/8] iomap: zeroing needs to be pagecache aware
  2024-06-03  9:07     ` Zhang Yi
@ 2024-06-03 14:37       ` Brian Foster
  2024-06-04 23:38         ` Dave Chinner
  0 siblings, 1 reply; 47+ messages in thread
From: Brian Foster @ 2024-06-03 14:37 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-xfs, linux-fsdevel, linux-kernel, djwong, hch, brauner,
	david, chandanbabu, jack, willy, yi.zhang, chengzhihao1, yukuai3

On Mon, Jun 03, 2024 at 05:07:02PM +0800, Zhang Yi wrote:
> On 2024/6/2 19:04, Brian Foster wrote:
> > On Wed, May 29, 2024 at 05:51:59PM +0800, Zhang Yi wrote:
> >> From: Dave Chinner <dchinner@redhat.com>
> >>
> >> Unwritten extents can have page cache data over the range being
> >> zeroed so we can't just skip them entirely. Fix this by checking for
> >> an existing dirty folio over the unwritten range we are zeroing
> >> and only performing zeroing if the folio is already dirty.
> >>
> >> XXX: how do we detect a iomap containing a cow mapping over a hole
> >> in iomap_zero_iter()? The XFS code implies this case also needs to
> >> zero the page cache if there is data present, so trigger for page
> >> cache lookup only in iomap_zero_iter() needs to handle this case as
> >> well.
> >>
> >> Before:
> >>
> >> $ time sudo ./pwrite-trunc /mnt/scratch/foo 50000
> >> path /mnt/scratch/foo, 50000 iters
> >>
> >> real    0m14.103s
> >> user    0m0.015s
> >> sys     0m0.020s
> >>
> >> $ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000
> >> path /mnt/scratch/foo, 50000 iters
> >> % time     seconds  usecs/call     calls    errors syscall
> >> ------ ----------- ----------- --------- --------- ----------------
> >>  85.90    0.847616          16     50000           ftruncate
> >>  14.01    0.138229           2     50000           pwrite64
> >> ....
> >>
> >> After:
> >>
> >> $ time sudo ./pwrite-trunc /mnt/scratch/foo 50000
> >> path /mnt/scratch/foo, 50000 iters
> >>
> >> real    0m0.144s
> >> user    0m0.021s
> >> sys     0m0.012s
> >>
> >> $ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000
> >> path /mnt/scratch/foo, 50000 iters
> >> % time     seconds  usecs/call     calls    errors syscall
> >> ------ ----------- ----------- --------- --------- ----------------
> >>  53.86    0.505964          10     50000           ftruncate
> >>  46.12    0.433251           8     50000           pwrite64
> >> ....
> >>
> >> Yup, we get back all the performance.
> >>
> >> As for the "mmap write beyond EOF" data exposure aspect
> >> documented here:
> >>
> >> https://lore.kernel.org/linux-xfs/20221104182358.2007475-1-bfoster@redhat.com/
> >>
> >> With this command:
> >>
> >> $ sudo xfs_io -tfc "falloc 0 1k" -c "pwrite 0 1k" \
> >>   -c "mmap 0 4k" -c "mwrite 3k 1k" -c "pwrite 32k 4k" \
> >>   -c fsync -c "pread -v 3k 32" /mnt/scratch/foo
> >>
> >> Before:
> >>
> >> wrote 1024/1024 bytes at offset 0
> >> 1 KiB, 1 ops; 0.0000 sec (34.877 MiB/sec and 35714.2857 ops/sec)
> >> wrote 4096/4096 bytes at offset 32768
> >> 4 KiB, 1 ops; 0.0000 sec (229.779 MiB/sec and 58823.5294 ops/sec)
> >> 00000c00:  58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
> >> XXXXXXXXXXXXXXXX
> >> 00000c10:  58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
> >> XXXXXXXXXXXXXXXX
> >> read 32/32 bytes at offset 3072
> >> 32.000000 bytes, 1 ops; 0.0000 sec (568.182 KiB/sec and 18181.8182
> >>    ops/sec
> >>
> >> After:
> >>
> >> wrote 1024/1024 bytes at offset 0
> >> 1 KiB, 1 ops; 0.0000 sec (40.690 MiB/sec and 41666.6667 ops/sec)
> >> wrote 4096/4096 bytes at offset 32768
> >> 4 KiB, 1 ops; 0.0000 sec (150.240 MiB/sec and 38461.5385 ops/sec)
> >> 00000c00:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> ................
> >> 00000c10:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> >> ................
> >> read 32/32 bytes at offset 3072
> >> 32.000000 bytes, 1 ops; 0.0000 sec (558.036 KiB/sec and 17857.1429
> >>    ops/sec)
> >>
> >> We see that this post-eof unwritten extent dirty page zeroing is
> >> working correctly.
> >>
> > 
> > I've pointed this out in the past, but IIRC this implementation is racy
> > vs. reclaim. Specifically, relying on folio lookup after mapping lookup
> > doesn't take reclaim into account, so if we look up an unwritten mapping
> > and then a folio flushes and reclaims by the time the scan reaches that
> > offset, it incorrectly treats that subrange as already zero when it
> > actually isn't (because the extent is actually stale by that point, but
> > the stale extent check is skipped).
> > 
> 
> Hello, Brian!
> 
> I'm confused, how could that happen? We do stale check under folio lock,
> if the folio flushed and reclaimed before we get&lock that folio in
> iomap_zero_iter()->iomap_write_begin(), the ->iomap_valid() would check
> this stale out and zero again in the next iteration. Am I missing
> something?
> 

Hi Yi,

Yep, that is my understanding of how the revalidation thing works in
general as well. The nuance in this particular case is that no folio
exists at the associated offset. Therefore, the reval is skipped in
iomap_write_begin(), iomap_zero_iter() skips over the range as well, and
the operation carries on as normal.

Have you tried the test sequence above? I just retried on latest master
plus this series and it still trips for me. I haven't redone the low
level analysis, but in general what this is trying to show is something
like the following...

Suppose we start with an unwritten block on disk with a dirty folio in
cache:

- iomap looks up the extent and finds the unwritten mapping.
- Reclaim kicks in and writes back the page and removes it from cache.
  The underlying block is no longer unwritten (current mapping is now
  stale).
- iomap_zero_iter() processes the associated offset. iomap_get_folio()
  clears FGP_CREAT, no folio is found.
- iomap_write_begin() returns -ENOENT (i.e. no folio lock ->
  iomap_valid()). iomap_zero_iter() consumes the error and skips to the
  next range.

At that point we have data on disk from a previous write that a
post-write zero range (i.e. truncate) failed to zero. Let me know if you
think this doesn't match up with the current series. It's been a while
since I've looked at this and it's certainly possible I've missed
something or something changed.

Otherwise, I think there are multiple potential ways around this. I.e.,
you could consider a fallback locking mode for reval, explicit (large?)
folio recreation and invalidation (if non-dirty) over unwritten mappings
for revalidation purposes, or something like the POC diff in reply to
hch on patch 3 around checking cache prior to iomap lookup (which also
has tradeoffs). I'm sure there are other ideas as well.

Brian

> Thanks,
> Yi.
> 
> > A simple example to demonstrate this is something like the following:
> > 
> > # looping truncate zeroing
> > while [ true ]; do
> > 	xfs_io -fc "truncate 0" -c "falloc 0 32K" -c "pwrite 0 4k" -c "truncate 2k" <file>
> > 	xfs_io -c "mmap 0 4k" -c "mread -v 2k 16" <file> | grep cd && break
> > done
> > 
> > vs.
> > 
> > # looping writeback and reclaim
> > while [ true ]; do
> > 	xfs_io -c "sync_range -a 0 0" -c "fadvise -d 0 0" <file>
> > done
> > 
> > If I ran that against this patch, the first loop will eventually detect
> > stale data exposed past eof.
> > 
> > Brian
> > 
> 


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v4 8/8] xfs: improve truncate on a realtime inode with huge extsize
  2024-05-31 15:00         ` Darrick J. Wong
@ 2024-06-04  7:09           ` Zhang Yi
  0 siblings, 0 replies; 47+ messages in thread
From: Zhang Yi @ 2024-06-04  7:09 UTC (permalink / raw)
  To: Darrick J. Wong, Christoph Hellwig
  Cc: linux-xfs, linux-fsdevel, linux-kernel, brauner, david,
	chandanbabu, jack, willy, yi.zhang, chengzhihao1, yukuai3

On 2024/5/31 23:00, Darrick J. Wong wrote:
> On Fri, May 31, 2024 at 07:15:34AM -0700, Christoph Hellwig wrote:
>> On Fri, May 31, 2024 at 07:12:10AM -0700, Darrick J. Wong wrote:
>>> There are <cough> some users that want 1G extents.
>>>
>>> For the rest of us who don't live in the stratosphere, it's convenient
>>> for fsdax to have rt extents that match the PMD size, which could be
>>> large on arm64 (e.g. 512M, or two smr sectors).
>>
>> That's fine.  Maybe to rephrase my question.  With this series we
>> have 3 different truncate path:
>>
>>  1) unmap all blocks (!rt || rtextsizse == 1)
>>  2) zero leftover blocks in an rtextent (small rtextsize, but > 1)
>>  3) converted leftover block in an rtextent to unwritten (large
>>    rtextsize)
>>
>> What is the right threshold to switch between 2 and 3?  And do we
>> really need 2) at all?
> 
> I don't think we need (2) at all.
> 
> There's likely some threshold below where it's a wash -- compare with
> ext4 strategy of trying to write 64k chunks even if that requires
> zeroing pagecache to cut down on fragmentation on hdds -- but I don't
> know if we care anymore. ;)
> 

I supplemented some tests for small > 1 rtextsizes on my ramdisk,

  mkfs.xfs -f -m reflink=0,rmapbt=0, -d rtinherit=1 \
           -r rtdev=/dev/pmem1s,extsize=$rtextsize /dev/pmem2s
  mount -ortdev=/dev/pmem1s /dev/pmem2s /mnt/scratch
  for i in {1..1000}; \
  do dd if=/dev/zero of=/mnt/scratch/$i bs=$rtextsize count=1; done
  sync
  time for i in {1..1000}; \
  do xfs_io -c "truncate 4k" /mnt/scratch/$i; done

rtextsize            8k      16k      32k      64k     256k     1024k
zero out:          9.601s  10.229s  11.153s  12.086s  12.259s  20.141s
convert unwritten: 9.710s   9.642s   9.958s   9.441s  10.021s  10.526s

The test showed that there is no much difference between (2) and (3)
with small rtextsize, but if the size gets progressively larger, (3)
will be better, so I agree with you that we could just drop (2) for
rt device.

Thanks,
Yi.


^ permalink raw reply	[flat|nested] 47+ messages in thread

* Re: [RFC PATCH v4 1/8] iomap: zeroing needs to be pagecache aware
  2024-06-03 14:37       ` Brian Foster
@ 2024-06-04 23:38         ` Dave Chinner
  0 siblings, 0 replies; 47+ messages in thread
From: Dave Chinner @ 2024-06-04 23:38 UTC (permalink / raw)
  To: Brian Foster
  Cc: Zhang Yi, linux-xfs, linux-fsdevel, linux-kernel, djwong, hch,
	brauner, chandanbabu, jack, willy, yi.zhang, chengzhihao1,
	yukuai3

On Mon, Jun 03, 2024 at 10:37:48AM -0400, Brian Foster wrote:
> On Mon, Jun 03, 2024 at 05:07:02PM +0800, Zhang Yi wrote:
> > On 2024/6/2 19:04, Brian Foster wrote:
> > > On Wed, May 29, 2024 at 05:51:59PM +0800, Zhang Yi wrote:
> > >> From: Dave Chinner <dchinner@redhat.com>
> > >>
> > >> Unwritten extents can have page cache data over the range being
> > >> zeroed so we can't just skip them entirely. Fix this by checking for
> > >> an existing dirty folio over the unwritten range we are zeroing
> > >> and only performing zeroing if the folio is already dirty.
> > >>
> > >> XXX: how do we detect a iomap containing a cow mapping over a hole
> > >> in iomap_zero_iter()? The XFS code implies this case also needs to
> > >> zero the page cache if there is data present, so trigger for page
> > >> cache lookup only in iomap_zero_iter() needs to handle this case as
> > >> well.
> > >>
> > >> Before:
> > >>
> > >> $ time sudo ./pwrite-trunc /mnt/scratch/foo 50000
> > >> path /mnt/scratch/foo, 50000 iters
> > >>
> > >> real    0m14.103s
> > >> user    0m0.015s
> > >> sys     0m0.020s
> > >>
> > >> $ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000
> > >> path /mnt/scratch/foo, 50000 iters
> > >> % time     seconds  usecs/call     calls    errors syscall
> > >> ------ ----------- ----------- --------- --------- ----------------
> > >>  85.90    0.847616          16     50000           ftruncate
> > >>  14.01    0.138229           2     50000           pwrite64
> > >> ....
> > >>
> > >> After:
> > >>
> > >> $ time sudo ./pwrite-trunc /mnt/scratch/foo 50000
> > >> path /mnt/scratch/foo, 50000 iters
> > >>
> > >> real    0m0.144s
> > >> user    0m0.021s
> > >> sys     0m0.012s
> > >>
> > >> $ sudo strace -c ./pwrite-trunc /mnt/scratch/foo 50000
> > >> path /mnt/scratch/foo, 50000 iters
> > >> % time     seconds  usecs/call     calls    errors syscall
> > >> ------ ----------- ----------- --------- --------- ----------------
> > >>  53.86    0.505964          10     50000           ftruncate
> > >>  46.12    0.433251           8     50000           pwrite64
> > >> ....
> > >>
> > >> Yup, we get back all the performance.
> > >>
> > >> As for the "mmap write beyond EOF" data exposure aspect
> > >> documented here:
> > >>
> > >> https://lore.kernel.org/linux-xfs/20221104182358.2007475-1-bfoster@redhat.com/
> > >>
> > >> With this command:
> > >>
> > >> $ sudo xfs_io -tfc "falloc 0 1k" -c "pwrite 0 1k" \
> > >>   -c "mmap 0 4k" -c "mwrite 3k 1k" -c "pwrite 32k 4k" \
> > >>   -c fsync -c "pread -v 3k 32" /mnt/scratch/foo
> > >>
> > >> Before:
> > >>
> > >> wrote 1024/1024 bytes at offset 0
> > >> 1 KiB, 1 ops; 0.0000 sec (34.877 MiB/sec and 35714.2857 ops/sec)
> > >> wrote 4096/4096 bytes at offset 32768
> > >> 4 KiB, 1 ops; 0.0000 sec (229.779 MiB/sec and 58823.5294 ops/sec)
> > >> 00000c00:  58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
> > >> XXXXXXXXXXXXXXXX
> > >> 00000c10:  58 58 58 58 58 58 58 58 58 58 58 58 58 58 58 58
> > >> XXXXXXXXXXXXXXXX
> > >> read 32/32 bytes at offset 3072
> > >> 32.000000 bytes, 1 ops; 0.0000 sec (568.182 KiB/sec and 18181.8182
> > >>    ops/sec
> > >>
> > >> After:
> > >>
> > >> wrote 1024/1024 bytes at offset 0
> > >> 1 KiB, 1 ops; 0.0000 sec (40.690 MiB/sec and 41666.6667 ops/sec)
> > >> wrote 4096/4096 bytes at offset 32768
> > >> 4 KiB, 1 ops; 0.0000 sec (150.240 MiB/sec and 38461.5385 ops/sec)
> > >> 00000c00:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > >> ................
> > >> 00000c10:  00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
> > >> ................
> > >> read 32/32 bytes at offset 3072
> > >> 32.000000 bytes, 1 ops; 0.0000 sec (558.036 KiB/sec and 17857.1429
> > >>    ops/sec)
> > >>
> > >> We see that this post-eof unwritten extent dirty page zeroing is
> > >> working correctly.
> > >>
> > > 
> > > I've pointed this out in the past, but IIRC this implementation is racy
> > > vs. reclaim. Specifically, relying on folio lookup after mapping lookup
> > > doesn't take reclaim into account, so if we look up an unwritten mapping
> > > and then a folio flushes and reclaims by the time the scan reaches that
> > > offset, it incorrectly treats that subrange as already zero when it
> > > actually isn't (because the extent is actually stale by that point, but
> > > the stale extent check is skipped).
> > > 
> > 
> > Hello, Brian!
> > 
> > I'm confused, how could that happen? We do stale check under folio lock,
> > if the folio flushed and reclaimed before we get&lock that folio in
> > iomap_zero_iter()->iomap_write_begin(), the ->iomap_valid() would check
> > this stale out and zero again in the next iteration. Am I missing
> > something?
> > 
> 
> Hi Yi,
> 
> Yep, that is my understanding of how the revalidation thing works in
> general as well. The nuance in this particular case is that no folio
> exists at the associated offset. Therefore, the reval is skipped in
> iomap_write_begin(), iomap_zero_iter() skips over the range as well, and
> the operation carries on as normal.
>> 
> Have you tried the test sequence above? I just retried on latest master
> plus this series and it still trips for me. I haven't redone the low
> level analysis, but in general what this is trying to show is something
> like the following...
> 
> Suppose we start with an unwritten block on disk with a dirty folio in
> cache:
> 
> - iomap looks up the extent and finds the unwritten mapping.
> - Reclaim kicks in and writes back the page and removes it from cache.

To be pedantic, reclaim doesn't write folios back - we haven't done
that for the best part of a decade on XFS. We don't even have a
->writepage method for reclaim to write back pages anymore.

Hence writeback has to come from background flusher threads hitting
that specific folio, then IO completion running and converting the
unwritten extent, then reclaim hitting that folio, all while the
zeroing of the current iomap is being walked and zeroed.

That makes it an extremely rare and difficult condition to hit. Yes,
it's possible, but it's also something we can easily detect. So as
long as detection is low cost, the cost of resolution when such a
rare event is detected isn't going to be noticed by anyone.

>   The underlying block is no longer unwritten (current mapping is now
>   stale).
> - iomap_zero_iter() processes the associated offset. iomap_get_folio()
>   clears FGP_CREAT, no folio is found.

Actually, this is really easy to fix - we simply revalidate the
mapping at this point rather than just skipping the folio range. If
the mapping has changed because it's now written, the zeroing code
backs out and gets a new mapping that reflects the changed state of
this range.

However, with the above cost analysis in mind, a lower overhead
common case alternative might be to only revalidate the mapping at
->iomap_end() time. If the mapping has changed while zeroing, we
return -EBUSY/-ESTALE and that triggers the zeroing to restart from
the offset at the beginning of the "stale" iomap.  The runtime cost
is one extra mapping revalidation call per mapping, and the
resolution cost is refetching and zeroing the range of a single
unwritten iomap.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 47+ messages in thread

end of thread, other threads:[~2024-06-04 23:38 UTC | newest]

Thread overview: 47+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-05-29  9:51 [RFC PATCH v4 0/8] iomap/xfs: fix stale data exposure when truncating realtime inodes Zhang Yi
2024-05-29  9:51 ` [RFC PATCH v4 1/8] iomap: zeroing needs to be pagecache aware Zhang Yi
2024-05-31 13:11   ` Christoph Hellwig
2024-05-31 14:03     ` Darrick J. Wong
2024-05-31 14:05       ` Christoph Hellwig
2024-05-31 15:44         ` Brian Foster
2024-05-31 15:43       ` Brian Foster
2024-06-02 22:22     ` Dave Chinner
2024-06-02 11:04   ` Brian Foster
2024-06-03  9:07     ` Zhang Yi
2024-06-03 14:37       ` Brian Foster
2024-06-04 23:38         ` Dave Chinner
2024-05-29  9:52 ` [RFC PATCH v4 2/8] math64: add rem_u64() to just return the remainder Zhang Yi
2024-05-31 12:35   ` Christoph Hellwig
2024-05-31 14:04   ` Darrick J. Wong
2024-05-29  9:52 ` [RFC PATCH v4 3/8] iomap: pass blocksize to iomap_truncate_page() Zhang Yi
2024-05-31 12:39   ` Christoph Hellwig
2024-06-02 11:16     ` Brian Foster
2024-06-03 13:23     ` Zhang Yi
2024-05-29  9:52 ` [RFC PATCH v4 4/8] fsdax: pass blocksize to dax_truncate_page() Zhang Yi
2024-05-29  9:52 ` [RFC PATCH v4 5/8] xfs: refactor the truncating order Zhang Yi
2024-05-31 13:31   ` Christoph Hellwig
2024-05-31 15:27     ` Darrick J. Wong
2024-05-31 16:17       ` Christoph Hellwig
2024-06-03 13:51       ` Zhang Yi
2024-05-31 15:44   ` Darrick J. Wong
2024-06-03 14:15     ` Zhang Yi
2024-06-02 22:46   ` Dave Chinner
2024-06-03 14:18     ` Zhang Yi
2024-05-29  9:52 ` [RFC PATCH v4 6/8] xfs: correct the truncate blocksize of realtime inode Zhang Yi
2024-05-31 13:36   ` Christoph Hellwig
2024-06-03 14:35     ` Zhang Yi
2024-05-29  9:52 ` [RFC PATCH v4 7/8] xfs: reserve blocks for truncating " Zhang Yi
2024-05-31 12:42   ` Christoph Hellwig
2024-05-31 14:10     ` Darrick J. Wong
2024-05-31 14:13       ` Christoph Hellwig
2024-05-31 15:29         ` Darrick J. Wong
2024-05-31 16:17           ` Christoph Hellwig
2024-05-29  9:52 ` [RFC PATCH v4 8/8] xfs: improve truncate on a realtime inode with huge extsize Zhang Yi
2024-05-31 13:46   ` Christoph Hellwig
2024-05-31 14:12     ` Darrick J. Wong
2024-05-31 14:15       ` Christoph Hellwig
2024-05-31 15:00         ` Darrick J. Wong
2024-06-04  7:09           ` Zhang Yi
2024-05-31 12:26 ` [RFC PATCH v4 0/8] iomap/xfs: fix stale data exposure when truncating realtime inodes Christoph Hellwig
2024-06-01  7:38   ` Zhang Yi
2024-06-01  7:40     ` Christoph Hellwig

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).