[PATCH -next v5 0/8] iomap/xfs: fix stale data exposure when truncating realtime inodes

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH -next v5 0/8] iomap/xfs: fix stale data exposure when truncating realtime inodes
@ 2024-06-13  9:00 Zhang Yi
  2024-06-13  9:00 ` [PATCH -next v5 1/8] math64: add rem_u64() to just return the remainder Zhang Yi
                   ` (7 more replies)
  0 siblings, 8 replies; 17+ messages in thread
From: Zhang Yi @ 2024-06-13  9:00 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, djwong, hch, brauner, david, chandanbabu, jack,
	yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

Changes since v4:
 - Drop the first patch in v4 "iomap: zeroing needs to be pagecache
   aware" since this series is not strongly depends on it, that patch
   still needs furtuer analyse and also should add to handle the case of
   a pending COW extent that extends over a data fork hole. This is a
   big job, so let's fix the exposure stale data issue and brings back
   the changes in iomap_write_end() first, don't block the ext4 buffered
   iomap conversion.
 - In patch 1, drop the 'ifndef rem_u64'.
 - In patch 4, factor out a helper xfs_setattr_truncate_data() to handle
   the zero out, update i_size, write back and drop pagecache on
   truncate.
 - In patch 5, switch to use xfs_inode_alloc_unitsize() in
   xfs_itruncate_extents_flags().
 - In patch 6, changes to reserve blocks for rtextsize > 1 realtime
   inodes on truncate down.
 - In patch 7, drop the unwritten convert threshold, always convert tail
   blocks to unwritten on truncate down realtime inodes.
 - Add patch 8 to bring back 'commit 943bc0882ceb ("iomap: don't
   increase i_size if it's not a write operation")'.

Changes since v3:
 - Factor out a new helper to get the remainder in math64.h as Darrick
   suggested.
 - Adjust the truncating order to prevent too much redundant blocking
   writes as Dave suggested.
 - Improve to convert the tail extent to unwritten when truncating down
   an inode with large rtextsize as Darrick and Dave suggested.

Since 'commit 943bc0882ceb ("iomap: don't increase i_size if it's not a
write operation")' merged, Chandan reported a stale data exposure issue
when running fstests generic/561 on xfs with realtime device [1]. This
issue has been fix on 6.10 by revert this commit through commit
'0841ea4a3b41 ("iomap: keep on increasing i_size in iomap_write_end()")',
but the real problem is xfs_setattr_size() doesn't zero out enough range
when truncate down a realtime inode. So this series fix this problem by
zeroing out allocation unitsize and convert the tail blocks to unwritten
when truncate down realtime inodes, finally we could bring commit
943bc0882ceb back.

Patch 1-3 modify iomap_truncate_page() and dax_truncate_page() to pass
filesystem identified blocksize, and drop the assumption of
i_blocksize() as Dave suggested.

Patch 4-5 refactor and adjust the truncating down processing order to
first zero out the tail aligned blocks, then write back and update
i_size, finally drop cache beyond aligned EOF. Fix the data exposure
issue by zeroing out the entire EOF extent.

Patch 6-7 improves truncate down performace on realtime inodes with
big rtextsize(>1 fsblock) by converting the tail unaligned extent to
unwritten.

Patch 8 reverts commit 0841ea4a3b41 and brings commit 943bc0882ceb back,
don't increase i_size on IOMAP_ZERO and IOMAP_UNSHARE.

I've tested this series on fstests (1) with reflink=0, (2) with
reflink=1, (3) with 28K RT device and (4) with dax, no new failures
detected, and it passed generic/561 on RT device over 1000+ rounds,
please let me know if it needs other tests.

[1] https://lore.kernel.org/linux-xfs/87ttj8ircu.fsf@debian-BULLSEYE-live-builder-AMD64/

Thanks,
Yi.

Zhang Yi (8):
  math64: add rem_u64() to just return the remainder
  iomap: pass blocksize to iomap_truncate_page()
  fsdax: pass blocksize to dax_truncate_page()
  xfs: refactor the truncating order
  xfs: correct the truncate blocksize of realtime inode
  xfs: reserve blocks for truncating large realtime inode
  xfs: speed up truncating down a big realtime inode
  iomap: don't increase i_size in iomap_write_end()

 fs/dax.c               |   8 +-
 fs/ext2/inode.c        |   4 +-
 fs/iomap/buffered-io.c |  61 +++++++-------
 fs/xfs/xfs_inode.c     |   9 ++-
 fs/xfs/xfs_iomap.c     |   5 +-
 fs/xfs/xfs_iomap.h     |   3 +-
 fs/xfs/xfs_iops.c      | 180 ++++++++++++++++++++++++++++-------------
 include/linux/dax.h    |   4 +-
 include/linux/iomap.h  |   4 +-
 include/linux/math64.h |  22 +++++
 10 files changed, 204 insertions(+), 96 deletions(-)

-- 
2.39.2

^ permalink raw reply	[flat|nested] 17+ messages in thread

* [PATCH -next v5 1/8] math64: add rem_u64() to just return the remainder
  2024-06-13  9:00 [PATCH -next v5 0/8] iomap/xfs: fix stale data exposure when truncating realtime inodes Zhang Yi
@ 2024-06-13  9:00 ` Zhang Yi
  2024-06-13  9:00 ` [PATCH -next v5 2/8] iomap: pass blocksize to iomap_truncate_page() Zhang Yi
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Zhang Yi @ 2024-06-13  9:00 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, djwong, hch, brauner, david, chandanbabu, jack,
	yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

Add a new helper rem_u64() to only get the remainder of unsigned 64bit
divide with 32bit divisor.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
---
 include/linux/math64.h | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/include/linux/math64.h b/include/linux/math64.h
index d34def7f9a8c..efbe58c157e3 100644
--- a/include/linux/math64.h
+++ b/include/linux/math64.h
@@ -3,6 +3,7 @@
 #define _LINUX_MATH64_H
 
 #include <linux/types.h>
+#include <linux/log2.h>
 #include <linux/math.h>
 #include <asm/div64.h>
 #include <vdso/math64.h>
@@ -12,6 +13,20 @@
 #define div64_long(x, y) div64_s64((x), (y))
 #define div64_ul(x, y)   div64_u64((x), (y))
 
+/**
+ * rem_u64 - remainder of unsigned 64bit divide with 32bit divisor
+ * @dividend: unsigned 64bit dividend
+ * @divisor: unsigned 32bit divisor
+ *
+ * Return: dividend % divisor
+ */
+static inline u32 rem_u64(u64 dividend, u32 divisor)
+{
+	if (is_power_of_2(divisor))
+		return dividend & (divisor - 1);
+	return dividend % divisor;
+}
+
 /**
  * div_u64_rem - unsigned 64bit divide with 32bit divisor with remainder
  * @dividend: unsigned 64bit dividend
@@ -86,6 +101,13 @@ static inline s64 div64_s64(s64 dividend, s64 divisor)
 #define div64_long(x, y) div_s64((x), (y))
 #define div64_ul(x, y)   div_u64((x), (y))
 
+static inline u32 rem_u64(u64 dividend, u32 divisor)
+{
+	if (is_power_of_2(divisor))
+		return dividend & (divisor - 1);
+	return do_div(dividend, divisor);
+}
+
 #ifndef div_u64_rem
 static inline u64 div_u64_rem(u64 dividend, u32 divisor, u32 *remainder)
 {
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH -next v5 2/8] iomap: pass blocksize to iomap_truncate_page()
  2024-06-13  9:00 [PATCH -next v5 0/8] iomap/xfs: fix stale data exposure when truncating realtime inodes Zhang Yi
  2024-06-13  9:00 ` [PATCH -next v5 1/8] math64: add rem_u64() to just return the remainder Zhang Yi
@ 2024-06-13  9:00 ` Zhang Yi
  2024-06-14  5:56   ` Christoph Hellwig
  2024-06-13  9:00 ` [PATCH -next v5 3/8] fsdax: pass blocksize to dax_truncate_page() Zhang Yi
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 17+ messages in thread
From: Zhang Yi @ 2024-06-13  9:00 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, djwong, hch, brauner, david, chandanbabu, jack,
	yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

iomap_truncate_page() always assumes the block size of the truncating
inode is i_blocksize(), this is not always true for some filesystems,
e.g. XFS does extent size alignment for realtime inodes. Drop this
assumption and pass the block size for zeroing into
iomap_truncate_page(), allow filesystems to indicate the correct block
size.

Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/iomap/buffered-io.c | 8 ++++----
 fs/xfs/xfs_iomap.c     | 3 ++-
 include/linux/iomap.h  | 4 ++--
 3 files changed, 8 insertions(+), 7 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 9952cc3a239b..4a23c3950a47 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -17,6 +17,7 @@
 #include <linux/bio.h>
 #include <linux/sched/signal.h>
 #include <linux/migrate.h>
+#include <linux/math64.h>
 #include "trace.h"
 
 #include "../internal.h"
@@ -1453,11 +1454,10 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
 EXPORT_SYMBOL_GPL(iomap_zero_range);
 
 int
-iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
-		const struct iomap_ops *ops)
+iomap_truncate_page(struct inode *inode, loff_t pos, unsigned int blocksize,
+		bool *did_zero, const struct iomap_ops *ops)
 {
-	unsigned int blocksize = i_blocksize(inode);
-	unsigned int off = pos & (blocksize - 1);
+	unsigned int off = rem_u64(pos, blocksize);
 
 	/* Block boundary? Nothing to do */
 	if (!off)
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 378342673925..32306804b01b 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1471,10 +1471,11 @@ xfs_truncate_page(
 	bool			*did_zero)
 {
 	struct inode		*inode = VFS_I(ip);
+	unsigned int		blocksize = i_blocksize(inode);
 
 	if (IS_DAX(inode))
 		return dax_truncate_page(inode, pos, did_zero,
 					&xfs_dax_write_iomap_ops);
-	return iomap_truncate_page(inode, pos, did_zero,
+	return iomap_truncate_page(inode, pos, blocksize, did_zero,
 				   &xfs_buffered_write_iomap_ops);
 }
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 6fc1c858013d..d67bf86ec582 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -273,8 +273,8 @@ int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
 		const struct iomap_ops *ops);
 int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
 		bool *did_zero, const struct iomap_ops *ops);
-int iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
-		const struct iomap_ops *ops);
+int iomap_truncate_page(struct inode *inode, loff_t pos, unsigned int blocksize,
+		bool *did_zero, const struct iomap_ops *ops);
 vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf,
 			const struct iomap_ops *ops);
 int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH -next v5 3/8] fsdax: pass blocksize to dax_truncate_page()
  2024-06-13  9:00 [PATCH -next v5 0/8] iomap/xfs: fix stale data exposure when truncating realtime inodes Zhang Yi
  2024-06-13  9:00 ` [PATCH -next v5 1/8] math64: add rem_u64() to just return the remainder Zhang Yi
  2024-06-13  9:00 ` [PATCH -next v5 2/8] iomap: pass blocksize to iomap_truncate_page() Zhang Yi
@ 2024-06-13  9:00 ` Zhang Yi
  2024-06-13  9:00 ` [PATCH -next v5 4/8] xfs: refactor the truncating order Zhang Yi
                   ` (4 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Zhang Yi @ 2024-06-13  9:00 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, djwong, hch, brauner, david, chandanbabu, jack,
	yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

dax_truncate_page() always assumes the block size of the truncating
inode is i_blocksize(), this is not always true for some filesystems,
e.g. XFS does extent size alignment for realtime inodes. Drop this
assumption and pass the block size for zeroing into dax_truncate_page(),
allow filesystems to indicate the correct block size.

Suggested-by: Dave Chinner <david@fromorbit.com>
Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/dax.c            | 8 ++++----
 fs/ext2/inode.c     | 4 ++--
 fs/xfs/xfs_iomap.c  | 2 +-
 include/linux/dax.h | 4 ++--
 4 files changed, 9 insertions(+), 9 deletions(-)

diff --git a/fs/dax.c b/fs/dax.c
index becb4a6920c6..4cbd94fd96ed 100644
--- a/fs/dax.c
+++ b/fs/dax.c
@@ -25,6 +25,7 @@
 #include <linux/mmu_notifier.h>
 #include <linux/iomap.h>
 #include <linux/rmap.h>
+#include <linux/math64.h>
 #include <asm/pgalloc.h>
 
 #define CREATE_TRACE_POINTS
@@ -1403,11 +1404,10 @@ int dax_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
 }
 EXPORT_SYMBOL_GPL(dax_zero_range);
 
-int dax_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
-		const struct iomap_ops *ops)
+int dax_truncate_page(struct inode *inode, loff_t pos, unsigned int blocksize,
+		bool *did_zero, const struct iomap_ops *ops)
 {
-	unsigned int blocksize = i_blocksize(inode);
-	unsigned int off = pos & (blocksize - 1);
+	unsigned int off = rem_u64(pos, blocksize);
 
 	/* Block boundary? Nothing to do */
 	if (!off)
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 0caa1650cee8..337349c94adf 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1276,8 +1276,8 @@ static int ext2_setsize(struct inode *inode, loff_t newsize)
 	inode_dio_wait(inode);
 
 	if (IS_DAX(inode))
-		error = dax_truncate_page(inode, newsize, NULL,
-					  &ext2_iomap_ops);
+		error = dax_truncate_page(inode, newsize, i_blocksize(inode),
+					  NULL, &ext2_iomap_ops);
 	else
 		error = block_truncate_page(inode->i_mapping,
 				newsize, ext2_get_block);
diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 32306804b01b..8cdfcbb5baa7 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1474,7 +1474,7 @@ xfs_truncate_page(
 	unsigned int		blocksize = i_blocksize(inode);
 
 	if (IS_DAX(inode))
-		return dax_truncate_page(inode, pos, did_zero,
+		return dax_truncate_page(inode, pos, blocksize, did_zero,
 					&xfs_dax_write_iomap_ops);
 	return iomap_truncate_page(inode, pos, blocksize, did_zero,
 				   &xfs_buffered_write_iomap_ops);
diff --git a/include/linux/dax.h b/include/linux/dax.h
index 9d3e3327af4c..4aa8ef7c8fd4 100644
--- a/include/linux/dax.h
+++ b/include/linux/dax.h
@@ -210,8 +210,8 @@ int dax_file_unshare(struct inode *inode, loff_t pos, loff_t len,
 		const struct iomap_ops *ops);
 int dax_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
 		const struct iomap_ops *ops);
-int dax_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
-		const struct iomap_ops *ops);
+int dax_truncate_page(struct inode *inode, loff_t pos, unsigned int blocksize,
+		bool *did_zero, const struct iomap_ops *ops);
 
 #if IS_ENABLED(CONFIG_DAX)
 int dax_read_lock(void);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH -next v5 4/8] xfs: refactor the truncating order
  2024-06-13  9:00 [PATCH -next v5 0/8] iomap/xfs: fix stale data exposure when truncating realtime inodes Zhang Yi
                   ` (2 preceding siblings ...)
  2024-06-13  9:00 ` [PATCH -next v5 3/8] fsdax: pass blocksize to dax_truncate_page() Zhang Yi
@ 2024-06-13  9:00 ` Zhang Yi
  2024-06-13  9:00 ` [PATCH -next v5 5/8] xfs: correct the truncate blocksize of realtime inode Zhang Yi
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Zhang Yi @ 2024-06-13  9:00 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, djwong, hch, brauner, david, chandanbabu, jack,
	yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

When truncating down an inode, we call xfs_truncate_page() to zero out
the tail partial block that beyond new EOF, which prevents exposing
stale data. But xfs_truncate_page() always assumes the blocksize is
i_blocksize(inode), it's not always true if we have a large allocation
unit for a file and we should aligned to this unitsize, e.g. realtime
inode should aligned to the rtextsize.

Current xfs_setattr_size() can't support zeroing out a large alignment
size on trucate down since the process order is wrong. We first do zero
out through xfs_truncate_page(), and then update inode size through
truncate_setsize() immediately. If the zeroed range is larger than a
folio, the write back path would not write back zeroed pagecache beyond
the EOF folio, so it doesn't write zeroes to the entire tail extent and
could expose stale data after an appending write into the next aligned
extent.

We need to adjust the order to zero out tail aligned blocks, write back
zeroed or cached data, update i_size and drop all the pagecache beyond
the allocation unit containing EOF, preparing for the fix of realtime
inode and supporting the upcoming forced alignment feature.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/xfs/xfs_iomap.c |   2 +-
 fs/xfs/xfs_iomap.h |   3 +-
 fs/xfs/xfs_iops.c  | 162 +++++++++++++++++++++++++++++----------------
 3 files changed, 109 insertions(+), 58 deletions(-)

diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
index 8cdfcbb5baa7..0369b64cc3f4 100644
--- a/fs/xfs/xfs_iomap.c
+++ b/fs/xfs/xfs_iomap.c
@@ -1468,10 +1468,10 @@ int
 xfs_truncate_page(
 	struct xfs_inode	*ip,
 	loff_t			pos,
+	unsigned int		blocksize,
 	bool			*did_zero)
 {
 	struct inode		*inode = VFS_I(ip);
-	unsigned int		blocksize = i_blocksize(inode);
 
 	if (IS_DAX(inode))
 		return dax_truncate_page(inode, pos, blocksize, did_zero,
diff --git a/fs/xfs/xfs_iomap.h b/fs/xfs/xfs_iomap.h
index 4da13440bae9..feb1610cb645 100644
--- a/fs/xfs/xfs_iomap.h
+++ b/fs/xfs/xfs_iomap.h
@@ -25,7 +25,8 @@ int xfs_bmbt_to_iomap(struct xfs_inode *ip, struct iomap *iomap,
 
 int xfs_zero_range(struct xfs_inode *ip, loff_t pos, loff_t len,
 		bool *did_zero);
-int xfs_truncate_page(struct xfs_inode *ip, loff_t pos, bool *did_zero);
+int xfs_truncate_page(struct xfs_inode *ip, loff_t pos,
+		unsigned int blocksize, bool *did_zero);
 
 static inline xfs_filblks_t
 xfs_aligned_fsb_count(
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index ff222827e550..0919a42cceb6 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -792,6 +792,108 @@ xfs_setattr_nonsize(
 	return error;
 }
 
+/*
+ * Zero and flush data on truncate.
+ *
+ * Zero out any data beyond EOF on size changed truncate, write back
+ * all cached data if we need to extend ondisk EOF, and drop all the
+ * pagecache that beyond the new EOF block.
+ */
+STATIC int
+xfs_setattr_truncate_data(
+	struct xfs_inode	*ip,
+	xfs_off_t		oldsize,
+	xfs_off_t		newsize)
+{
+	struct inode		*inode = VFS_I(ip);
+	bool			did_zeroing = false;
+	bool			extending_ondisk_eof;
+	unsigned int		blocksize;
+	int			error;
+
+	extending_ondisk_eof = newsize > ip->i_disk_size &&
+			       oldsize != ip->i_disk_size;
+
+	/*
+	 * Start with zeroing any data beyond EOF that we may expose on file
+	 * extension, or zeroing out the rest of the block on a downward
+	 * truncate.
+	 *
+	 * We've already locked out new page faults, so now we can safely call
+	 * truncate_setsize() or truncate_pagecache() to remove pages from the
+	 * page cache knowing they won't get refaulted until we drop the
+	 * XFS_MMAPLOCK_EXCL after the extent manipulations are complete.  The
+	 * truncate_setsize() call also cleans partial EOF page PTEs on
+	 * extending truncates and hence ensures sub-page block size filesystems
+	 * are correctly handled, too.
+	 */
+	if (newsize >= oldsize) {
+		/* File extentsion */
+		if (newsize != oldsize) {
+			trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
+			error = xfs_zero_range(ip, oldsize, newsize - oldsize,
+					       &did_zeroing);
+			if (error)
+				return error;
+		}
+
+		truncate_setsize(inode, newsize);
+
+		/*
+		 * We are going to log the inode size change in this transaction
+		 * so any previous writes that are beyond the on disk EOF and
+		 * the new EOF that have not been written out need to be written
+		 * here.  If we do not write the data out, we expose ourselves
+		 * to the null files problem.  Note that this includes any block
+		 * zeroing we did above; otherwise those blocks may not be
+		 * zeroed after a crash.
+		 */
+		if (did_zeroing || extending_ondisk_eof) {
+			error = filemap_write_and_wait_range(inode->i_mapping,
+					ip->i_disk_size, newsize - 1);
+			if (error)
+				return error;
+		}
+		return 0;
+	}
+
+	/* Truncate down */
+	blocksize = i_blocksize(inode);
+
+	/*
+	 * iomap won't detect a dirty page over an unwritten block (or a cow
+	 * block over a hole) and subsequently skips zeroing the newly post-EOF
+	 * portion of the page. Flush the new EOF to convert the block before
+	 * the pagecache truncate.
+	 */
+	error = filemap_write_and_wait_range(inode->i_mapping, newsize,
+			roundup_64(newsize, blocksize) - 1);
+	if (error)
+		return error;
+
+	error = xfs_truncate_page(ip, newsize, blocksize, &did_zeroing);
+	if (error)
+		return error;
+
+	if (did_zeroing || extending_ondisk_eof) {
+		error = filemap_write_and_wait_range(inode->i_mapping,
+				min_t(loff_t, ip->i_disk_size, newsize),
+				roundup_64(newsize, blocksize) - 1);
+		if (error)
+			return error;
+	}
+
+	/*
+	 * Open code truncate_setsize(), update the incore i_size after flushing
+	 * dirty tail pages to disk, don't zero out the partial EOF folio which
+	 * may contains already zeroed tail blocks again and just drop all the
+	 * pagecache beyond the allocation unit containing EOF.
+	 */
+	i_size_write(inode, newsize);
+	truncate_pagecache(inode, roundup_64(newsize, blocksize));
+	return 0;
+}
+
 /*
  * Truncate file.  Must have write permission and not be a directory.
  *
@@ -811,7 +913,6 @@ xfs_setattr_size(
 	struct xfs_trans	*tp;
 	int			error;
 	uint			lock_flags = 0;
-	bool			did_zeroing = false;
 
 	xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL);
 	ASSERT(S_ISREG(inode->i_mode));
@@ -853,40 +954,7 @@ xfs_setattr_size(
 	 * the transaction because the inode cannot be unlocked once it is a
 	 * part of the transaction.
 	 *
-	 * Start with zeroing any data beyond EOF that we may expose on file
-	 * extension, or zeroing out the rest of the block on a downward
-	 * truncate.
-	 */
-	if (newsize > oldsize) {
-		trace_xfs_zero_eof(ip, oldsize, newsize - oldsize);
-		error = xfs_zero_range(ip, oldsize, newsize - oldsize,
-				&did_zeroing);
-	} else {
-		/*
-		 * iomap won't detect a dirty page over an unwritten block (or a
-		 * cow block over a hole) and subsequently skips zeroing the
-		 * newly post-EOF portion of the page. Flush the new EOF to
-		 * convert the block before the pagecache truncate.
-		 */
-		error = filemap_write_and_wait_range(inode->i_mapping, newsize,
-						     newsize);
-		if (error)
-			return error;
-		error = xfs_truncate_page(ip, newsize, &did_zeroing);
-	}
-
-	if (error)
-		return error;
-
-	/*
-	 * We've already locked out new page faults, so now we can safely remove
-	 * pages from the page cache knowing they won't get refaulted until we
-	 * drop the XFS_MMAP_EXCL lock after the extent manipulations are
-	 * complete. The truncate_setsize() call also cleans partial EOF page
-	 * PTEs on extending truncates and hence ensures sub-page block size
-	 * filesystems are correctly handled, too.
-	 *
-	 * We have to do all the page cache truncate work outside the
+	 * We also have to do all the page cache truncate work outside the
 	 * transaction context as the "lock" order is page lock->log space
 	 * reservation as defined by extent allocation in the writeback path.
 	 * Hence a truncate can fail with ENOMEM from xfs_trans_alloc(), but
@@ -894,28 +962,10 @@ xfs_setattr_size(
 	 * user visible changes). There's not much we can do about this, except
 	 * to hope that the caller sees ENOMEM and retries the truncate
 	 * operation.
-	 *
-	 * And we update in-core i_size and truncate page cache beyond newsize
-	 * before writeback the [i_disk_size, newsize] range, so we're
-	 * guaranteed not to write stale data past the new EOF on truncate down.
 	 */
-	truncate_setsize(inode, newsize);
-
-	/*
-	 * We are going to log the inode size change in this transaction so
-	 * any previous writes that are beyond the on disk EOF and the new
-	 * EOF that have not been written out need to be written here.  If we
-	 * do not write the data out, we expose ourselves to the null files
-	 * problem. Note that this includes any block zeroing we did above;
-	 * otherwise those blocks may not be zeroed after a crash.
-	 */
-	if (did_zeroing ||
-	    (newsize > ip->i_disk_size && oldsize != ip->i_disk_size)) {
-		error = filemap_write_and_wait_range(VFS_I(ip)->i_mapping,
-						ip->i_disk_size, newsize - 1);
-		if (error)
-			return error;
-	}
+	error = xfs_setattr_truncate_data(ip, oldsize, newsize);
+	if (error)
+		return error;
 
 	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
 	if (error)
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH -next v5 5/8] xfs: correct the truncate blocksize of realtime inode
  2024-06-13  9:00 [PATCH -next v5 0/8] iomap/xfs: fix stale data exposure when truncating realtime inodes Zhang Yi
                   ` (3 preceding siblings ...)
  2024-06-13  9:00 ` [PATCH -next v5 4/8] xfs: refactor the truncating order Zhang Yi
@ 2024-06-13  9:00 ` Zhang Yi
  2024-06-13  9:00 ` [PATCH -next v5 6/8] xfs: reserve blocks for truncating large " Zhang Yi
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 17+ messages in thread
From: Zhang Yi @ 2024-06-13  9:00 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, djwong, hch, brauner, david, chandanbabu, jack,
	yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

When unaligned truncating down a realtime file which sb_rextsize is
bigger than one block, xfs_truncate_page() only zeros out the tail EOF
block, this could expose stale data since commit '943bc0882ceb ("iomap:
don't increase i_size if it's not a write operation")'.

If we truncate file that contains a large enough written extent:

     |<    rxext    >|<    rtext    >|
  ...WWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWWW
        ^ (new EOF)      ^ old EOF

Since we only zeros out the tail of the EOF block, and
xfs_itruncate_extents() unmap the whole ailgned extents, it becomes
this state:

     |<    rxext    >|
  ...WWWzWWWWWWWWWWWWW
        ^ new EOF

Then if we do an extending write like this, the blocks in the previous
tail extent becomes stale:

     |<    rxext    >|
  ...WWWzSSSSSSSSSSSSS..........WWWWWWWWWWWWWWWWW
        ^ old EOF               ^ append start  ^ new EOF

Fix this by zeroing out the tail allocation uint and also make sure
xfs_itruncate_extents() unmap allocation uint aligned extents.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/xfs/xfs_inode.c | 3 ++-
 fs/xfs/xfs_iops.c  | 2 +-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 58fb7a5062e1..92daa2279053 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1511,7 +1511,8 @@ xfs_itruncate_extents_flags(
 	 * We have to free all the blocks to the bmbt maximum offset, even if
 	 * the page cache can't scale that far.
 	 */
-	first_unmap_block = XFS_B_TO_FSB(mp, (xfs_ufsize_t)new_size);
+	first_unmap_block = XFS_B_TO_FSB(mp,
+			roundup_64(new_size, xfs_inode_alloc_unitsize(ip)));
 	if (!xfs_verify_fileoff(mp, first_unmap_block)) {
 		WARN_ON_ONCE(first_unmap_block > XFS_MAX_FILEOFF);
 		return 0;
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 0919a42cceb6..8e7e6c435fb3 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -858,7 +858,7 @@ xfs_setattr_truncate_data(
 	}
 
 	/* Truncate down */
-	blocksize = i_blocksize(inode);
+	blocksize = xfs_inode_alloc_unitsize(ip);
 
 	/*
 	 * iomap won't detect a dirty page over an unwritten block (or a cow
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH -next v5 6/8] xfs: reserve blocks for truncating large realtime inode
  2024-06-13  9:00 [PATCH -next v5 0/8] iomap/xfs: fix stale data exposure when truncating realtime inodes Zhang Yi
                   ` (4 preceding siblings ...)
  2024-06-13  9:00 ` [PATCH -next v5 5/8] xfs: correct the truncate blocksize of realtime inode Zhang Yi
@ 2024-06-13  9:00 ` Zhang Yi
  2024-06-14  5:58   ` Christoph Hellwig
  2024-06-13  9:00 ` [PATCH -next v5 7/8] xfs: speed up truncating down a big " Zhang Yi
  2024-06-13  9:00 ` [PATCH -next v5 8/8] iomap: don't increase i_size in iomap_write_end() Zhang Yi
  7 siblings, 1 reply; 17+ messages in thread
From: Zhang Yi @ 2024-06-13  9:00 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, djwong, hch, brauner, david, chandanbabu, jack,
	yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

For a large realtime inode, __xfs_bunmapi() could split the tail written
extent and convert the later one that beyond EOF block to unwritten, but
it couldn't work as expected on truncate down now since the reserved
block is zero in xfs_setattr_size(), fix this by reserving
XFS_DIOSTRAT_SPACE_RES blocks for large realtime inode.

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/xfs/xfs_iops.c | 15 ++++++++++++++-
 1 file changed, 14 insertions(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 8e7e6c435fb3..8af13fd37f1b 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -17,6 +17,8 @@
 #include "xfs_da_btree.h"
 #include "xfs_attr.h"
 #include "xfs_trans.h"
+#include "xfs_trans_space.h"
+#include "xfs_bmap_btree.h"
 #include "xfs_trace.h"
 #include "xfs_icache.h"
 #include "xfs_symlink.h"
@@ -913,6 +915,7 @@ xfs_setattr_size(
 	struct xfs_trans	*tp;
 	int			error;
 	uint			lock_flags = 0;
+	uint			resblks = 0;
 
 	xfs_assert_ilocked(ip, XFS_IOLOCK_EXCL | XFS_MMAPLOCK_EXCL);
 	ASSERT(S_ISREG(inode->i_mode));
@@ -967,7 +970,17 @@ xfs_setattr_size(
 	if (error)
 		return error;
 
-	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, 0, 0, 0, &tp);
+	/*
+	 * For realtime inode with more than one block rtextsize, we need the
+	 * block reservation for bmap btree block allocations/splits that can
+	 * happen since it could split the tail written extent and convert the
+	 * right beyond EOF one to unwritten.
+	 */
+	if (xfs_inode_has_bigrtalloc(ip))
+		resblks = XFS_DIOSTRAT_SPACE_RES(mp, 0);
+
+	error = xfs_trans_alloc(mp, &M_RES(mp)->tr_itruncate, resblks,
+				0, 0, &tp);
 	if (error)
 		return error;
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH -next v5 7/8] xfs: speed up truncating down a big realtime inode
  2024-06-13  9:00 [PATCH -next v5 0/8] iomap/xfs: fix stale data exposure when truncating realtime inodes Zhang Yi
                   ` (5 preceding siblings ...)
  2024-06-13  9:00 ` [PATCH -next v5 6/8] xfs: reserve blocks for truncating large " Zhang Yi
@ 2024-06-13  9:00 ` Zhang Yi
  2024-06-14  6:08   ` Christoph Hellwig
  2024-06-13  9:00 ` [PATCH -next v5 8/8] iomap: don't increase i_size in iomap_write_end() Zhang Yi
  7 siblings, 1 reply; 17+ messages in thread
From: Zhang Yi @ 2024-06-13  9:00 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, djwong, hch, brauner, david, chandanbabu, jack,
	yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

If we truncate down a big realtime inode, zero out the entire aligned
EOF extent could gets slow down as the rtextsize increases. Fortunately,
__xfs_bunmapi() would align the unmapped range to rtextsize, split and
convert the blocks beyond EOF to unwritten. So speed up this by
adjusting the unitsize to the filesystem blocksize when truncating down
a large realtime inode, let __xfs_bunmapi() convert the tail blocks to
unwritten, this could improve the performance significantly.

 # mkfs.xfs -f -rrtdev=/dev/pmem1s -f -m reflink=0,rmapbt=0, \
            -d rtinherit=1 -r extsize=$rtextsize /dev/pmem2s
 # mount -ortdev=/dev/pmem1s /dev/pmem2s /mnt/scratch
 # for i in {1..1000}; \
   do dd if=/dev/zero of=/mnt/scratch/$i bs=$rtextsize count=1024; done
 # sync
 # time for i in {1..1000}; \
   do xfs_io -c "truncate 4k" /mnt/scratch/$i; done

 rtextsize       8k      16k      32k      64k     256k     1024k
 before:       9.601s  10.229s  11.153s  12.086s  12.259s  20.141s
 after:        9.710s   9.642s   9.958s   9.441s  10.021s  10.526s

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/xfs/xfs_inode.c | 10 ++++++++--
 fs/xfs/xfs_iops.c  |  9 +++++++++
 2 files changed, 17 insertions(+), 2 deletions(-)

diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
index 92daa2279053..5e837ed093b0 100644
--- a/fs/xfs/xfs_inode.c
+++ b/fs/xfs/xfs_inode.c
@@ -1487,6 +1487,7 @@ xfs_itruncate_extents_flags(
 	struct xfs_trans	*tp = *tpp;
 	xfs_fileoff_t		first_unmap_block;
 	int			error = 0;
+	unsigned int		unitsize = xfs_inode_alloc_unitsize(ip);
 
 	xfs_assert_ilocked(ip, XFS_ILOCK_EXCL);
 	if (atomic_read(&VFS_I(ip)->i_count))
@@ -1510,9 +1511,14 @@ xfs_itruncate_extents_flags(
 	 *
 	 * We have to free all the blocks to the bmbt maximum offset, even if
 	 * the page cache can't scale that far.
+	 *
+	 * For big realtime inode, don't aligned to allocation unitsize,
+	 * it'll split the extent and convert the tail blocks to unwritten.
 	 */
-	first_unmap_block = XFS_B_TO_FSB(mp,
-			roundup_64(new_size, xfs_inode_alloc_unitsize(ip)));
+	if (xfs_inode_has_bigrtalloc(ip))
+		unitsize = i_blocksize(VFS_I(ip));
+	first_unmap_block = XFS_B_TO_FSB(mp, roundup_64(new_size, unitsize));
+
 	if (!xfs_verify_fileoff(mp, first_unmap_block)) {
 		WARN_ON_ONCE(first_unmap_block > XFS_MAX_FILEOFF);
 		return 0;
diff --git a/fs/xfs/xfs_iops.c b/fs/xfs/xfs_iops.c
index 8af13fd37f1b..1903c06d39bc 100644
--- a/fs/xfs/xfs_iops.c
+++ b/fs/xfs/xfs_iops.c
@@ -862,6 +862,15 @@ xfs_setattr_truncate_data(
 	/* Truncate down */
 	blocksize = xfs_inode_alloc_unitsize(ip);
 
+	/*
+	 * If it's a big realtime inode, zero out the entire EOF extent could
+	 * get slow down as the rtextsize increases, speed it up by adjusting
+	 * the blocksize to the filesystem blocksize, let __xfs_bunmapi() to
+	 * split the extent and convert the tail blocks to unwritten.
+	 */
+	if (xfs_inode_has_bigrtalloc(ip))
+		blocksize = i_blocksize(inode);
+
 	/*
 	 * iomap won't detect a dirty page over an unwritten block (or a cow
 	 * block over a hole) and subsequently skips zeroing the newly post-EOF
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* [PATCH -next v5 8/8] iomap: don't increase i_size in iomap_write_end()
  2024-06-13  9:00 [PATCH -next v5 0/8] iomap/xfs: fix stale data exposure when truncating realtime inodes Zhang Yi
                   ` (6 preceding siblings ...)
  2024-06-13  9:00 ` [PATCH -next v5 7/8] xfs: speed up truncating down a big " Zhang Yi
@ 2024-06-13  9:00 ` Zhang Yi
  7 siblings, 0 replies; 17+ messages in thread
From: Zhang Yi @ 2024-06-13  9:00 UTC (permalink / raw)
  To: linux-xfs, linux-fsdevel
  Cc: linux-kernel, djwong, hch, brauner, david, chandanbabu, jack,
	yi.zhang, yi.zhang, chengzhihao1, yukuai3

From: Zhang Yi <yi.zhang@huawei.com>

This reverts commit '0841ea4a3b41 ("iomap: keep on increasing i_size in
iomap_write_end()")'.

After xfs could zero out the tail blocks aligned to the allocation
unitsize and convert the tail blocks to unwritten for realtime inode on
truncate down, it couldn't expose any stale data when unaligned truncate
down realtime inodes, so we could keep on stop increasing i_size for
IOMAP_UNSHARE and IOMAP_ZERO in iomap_write_end().

Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
---
 fs/iomap/buffered-io.c | 53 +++++++++++++++++++++++-------------------
 1 file changed, 29 insertions(+), 24 deletions(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index 4a23c3950a47..75360128f1da 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -891,37 +891,22 @@ static bool iomap_write_end(struct iomap_iter *iter, loff_t pos, size_t len,
 		size_t copied, struct folio *folio)
 {
 	const struct iomap *srcmap = iomap_iter_srcmap(iter);
-	loff_t old_size = iter->inode->i_size;
-	size_t written;
 
 	if (srcmap->type == IOMAP_INLINE) {
 		iomap_write_end_inline(iter, folio, pos, copied);
-		written = copied;
-	} else if (srcmap->flags & IOMAP_F_BUFFER_HEAD) {
-		written = block_write_end(NULL, iter->inode->i_mapping, pos,
-					len, copied, &folio->page, NULL);
-		WARN_ON_ONCE(written != copied && written != 0);
-	} else {
-		written = __iomap_write_end(iter->inode, pos, len, copied,
-					    folio) ? copied : 0;
+		return true;
 	}
 
-	/*
-	 * Update the in-memory inode size after copying the data into the page
-	 * cache.  It's up to the file system to write the updated size to disk,
-	 * preferably after I/O completion so that no stale data is exposed.
-	 * Only once that's done can we unlock and release the folio.
-	 */
-	if (pos + written > old_size) {
-		i_size_write(iter->inode, pos + written);
-		iter->iomap.flags |= IOMAP_F_SIZE_CHANGED;
-	}
-	__iomap_put_folio(iter, pos, written, folio);
+	if (srcmap->flags & IOMAP_F_BUFFER_HEAD) {
+		size_t bh_written;
 
-	if (old_size < pos)
-		pagecache_isize_extended(iter->inode, old_size, pos);
+		bh_written = block_write_end(NULL, iter->inode->i_mapping, pos,
+					len, copied, &folio->page, NULL);
+		WARN_ON_ONCE(bh_written != copied && bh_written != 0);
+		return bh_written == copied;
+	}
 
-	return written == copied;
+	return __iomap_write_end(iter->inode, pos, len, copied, folio);
 }
 
 static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
@@ -936,6 +921,7 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 
 	do {
 		struct folio *folio;
+		loff_t old_size;
 		size_t offset;		/* Offset into folio */
 		size_t bytes;		/* Bytes to write to folio */
 		size_t copied;		/* Bytes copied from user */
@@ -987,6 +973,23 @@ static loff_t iomap_write_iter(struct iomap_iter *iter, struct iov_iter *i)
 		written = iomap_write_end(iter, pos, bytes, copied, folio) ?
 			  copied : 0;
 
+		/*
+		 * Update the in-memory inode size after copying the data into
+		 * the page cache.  It's up to the file system to write the
+		 * updated size to disk, preferably after I/O completion so that
+		 * no stale data is exposed.  Only once that's done can we
+		 * unlock and release the folio.
+		 */
+		old_size = iter->inode->i_size;
+		if (pos + written > old_size) {
+			i_size_write(iter->inode, pos + written);
+			iter->iomap.flags |= IOMAP_F_SIZE_CHANGED;
+		}
+		__iomap_put_folio(iter, pos, written, folio);
+
+		if (old_size < pos)
+			pagecache_isize_extended(iter->inode, old_size, pos);
+
 		cond_resched();
 		if (unlikely(written == 0)) {
 			/*
@@ -1357,6 +1360,7 @@ static loff_t iomap_unshare_iter(struct iomap_iter *iter)
 			bytes = folio_size(folio) - offset;
 
 		ret = iomap_write_end(iter, pos, bytes, bytes, folio);
+		__iomap_put_folio(iter, pos, bytes, folio);
 		if (WARN_ON_ONCE(!ret))
 			return -EIO;
 
@@ -1422,6 +1426,7 @@ static loff_t iomap_zero_iter(struct iomap_iter *iter, bool *did_zero)
 		folio_mark_accessed(folio);
 
 		ret = iomap_write_end(iter, pos, bytes, bytes, folio);
+		__iomap_put_folio(iter, pos, bytes, folio);
 		if (WARN_ON_ONCE(!ret))
 			return -EIO;
 
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 17+ messages in thread

* Re: [PATCH -next v5 2/8] iomap: pass blocksize to iomap_truncate_page()
  2024-06-13  9:00 ` [PATCH -next v5 2/8] iomap: pass blocksize to iomap_truncate_page() Zhang Yi
@ 2024-06-14  5:56   ` Christoph Hellwig
  0 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2024-06-14  5:56 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-xfs, linux-fsdevel, linux-kernel, djwong, hch, brauner,
	david, chandanbabu, jack, yi.zhang, chengzhihao1, yukuai3

On Thu, Jun 13, 2024 at 05:00:27PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> iomap_truncate_page() always assumes the block size of the truncating
> inode is i_blocksize(), this is not always true for some filesystems,
> e.g. XFS does extent size alignment for realtime inodes. Drop this
> assumption and pass the block size for zeroing into
> iomap_truncate_page(), allow filesystems to indicate the correct block
> size.

FYI, I still prefer to just kill iomap_truncate_page in it's current
form instead.

>  #include "../internal.h"
> @@ -1453,11 +1454,10 @@ iomap_zero_range(struct inode *inode, loff_t pos, loff_t len, bool *did_zero,
>  EXPORT_SYMBOL_GPL(iomap_zero_range);
>  
>  int
> -iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
> -		const struct iomap_ops *ops)
> +iomap_truncate_page(struct inode *inode, loff_t pos, unsigned int blocksize,
> +		bool *did_zero, const struct iomap_ops *ops)
>  {
> -	unsigned int blocksize = i_blocksize(inode);
> -	unsigned int off = pos & (blocksize - 1);
> +	unsigned int off = rem_u64(pos, blocksize);
>  
>  	/* Block boundary? Nothing to do */
>  	if (!off)
> diff --git a/fs/xfs/xfs_iomap.c b/fs/xfs/xfs_iomap.c
> index 378342673925..32306804b01b 100644
> --- a/fs/xfs/xfs_iomap.c
> +++ b/fs/xfs/xfs_iomap.c
> @@ -1471,10 +1471,11 @@ xfs_truncate_page(
>  	bool			*did_zero)
>  {
>  	struct inode		*inode = VFS_I(ip);
> +	unsigned int		blocksize = i_blocksize(inode);
>  
>  	if (IS_DAX(inode))
>  		return dax_truncate_page(inode, pos, did_zero,
>  					&xfs_dax_write_iomap_ops);
> -	return iomap_truncate_page(inode, pos, did_zero,
> +	return iomap_truncate_page(inode, pos, blocksize, did_zero,
>  				   &xfs_buffered_write_iomap_ops);
>  }
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 6fc1c858013d..d67bf86ec582 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -273,8 +273,8 @@ int iomap_file_unshare(struct inode *inode, loff_t pos, loff_t len,
>  		const struct iomap_ops *ops);
>  int iomap_zero_range(struct inode *inode, loff_t pos, loff_t len,
>  		bool *did_zero, const struct iomap_ops *ops);
> -int iomap_truncate_page(struct inode *inode, loff_t pos, bool *did_zero,
> -		const struct iomap_ops *ops);
> +int iomap_truncate_page(struct inode *inode, loff_t pos, unsigned int blocksize,
> +		bool *did_zero, const struct iomap_ops *ops);
>  vm_fault_t iomap_page_mkwrite(struct vm_fault *vmf,
>  			const struct iomap_ops *ops);
>  int iomap_fiemap(struct inode *inode, struct fiemap_extent_info *fieinfo,
> -- 
> 2.39.2
> 
> 
---end quoted text---

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH -next v5 6/8] xfs: reserve blocks for truncating large realtime inode
  2024-06-13  9:00 ` [PATCH -next v5 6/8] xfs: reserve blocks for truncating large " Zhang Yi
@ 2024-06-14  5:58   ` Christoph Hellwig
  0 siblings, 0 replies; 17+ messages in thread
From: Christoph Hellwig @ 2024-06-14  5:58 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-xfs, linux-fsdevel, linux-kernel, djwong, hch, brauner,
	david, chandanbabu, jack, yi.zhang, chengzhihao1, yukuai3

On Thu, Jun 13, 2024 at 05:00:31PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> For a large realtime inode, __xfs_bunmapi() could split the tail written
> extent and convert the later one that beyond EOF block to unwritten, but
> it couldn't work as expected on truncate down now since the reserved
> block is zero in xfs_setattr_size(), fix this by reserving
> XFS_DIOSTRAT_SPACE_RES blocks for large realtime inode.

Looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH -next v5 7/8] xfs: speed up truncating down a big realtime inode
  2024-06-13  9:00 ` [PATCH -next v5 7/8] xfs: speed up truncating down a big " Zhang Yi
@ 2024-06-14  6:08   ` Christoph Hellwig
  2024-06-14  7:18     ` Zhang Yi
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2024-06-14  6:08 UTC (permalink / raw)
  To: Zhang Yi
  Cc: linux-xfs, linux-fsdevel, linux-kernel, djwong, hch, brauner,
	david, chandanbabu, jack, yi.zhang, chengzhihao1, yukuai3

On Thu, Jun 13, 2024 at 05:00:32PM +0800, Zhang Yi wrote:
> From: Zhang Yi <yi.zhang@huawei.com>
> 
> If we truncate down a big realtime inode, zero out the entire aligned
> EOF extent could gets slow down as the rtextsize increases. Fortunately,
> __xfs_bunmapi() would align the unmapped range to rtextsize, split and
> convert the blocks beyond EOF to unwritten. So speed up this by
> adjusting the unitsize to the filesystem blocksize when truncating down
> a large realtime inode, let __xfs_bunmapi() convert the tail blocks to
> unwritten, this could improve the performance significantly.
> 
>  # mkfs.xfs -f -rrtdev=/dev/pmem1s -f -m reflink=0,rmapbt=0, \
>             -d rtinherit=1 -r extsize=$rtextsize /dev/pmem2s
>  # mount -ortdev=/dev/pmem1s /dev/pmem2s /mnt/scratch
>  # for i in {1..1000}; \
>    do dd if=/dev/zero of=/mnt/scratch/$i bs=$rtextsize count=1024; done
>  # sync
>  # time for i in {1..1000}; \
>    do xfs_io -c "truncate 4k" /mnt/scratch/$i; done
> 
>  rtextsize       8k      16k      32k      64k     256k     1024k
>  before:       9.601s  10.229s  11.153s  12.086s  12.259s  20.141s
>  after:        9.710s   9.642s   9.958s   9.441s  10.021s  10.526s
> 
> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> ---
>  fs/xfs/xfs_inode.c | 10 ++++++++--
>  fs/xfs/xfs_iops.c  |  9 +++++++++
>  2 files changed, 17 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
> index 92daa2279053..5e837ed093b0 100644
> --- a/fs/xfs/xfs_inode.c
> +++ b/fs/xfs/xfs_inode.c
> @@ -1487,6 +1487,7 @@ xfs_itruncate_extents_flags(
>  	struct xfs_trans	*tp = *tpp;
>  	xfs_fileoff_t		first_unmap_block;
>  	int			error = 0;
> +	unsigned int		unitsize = xfs_inode_alloc_unitsize(ip);
>  
>  	xfs_assert_ilocked(ip, XFS_ILOCK_EXCL);
>  	if (atomic_read(&VFS_I(ip)->i_count))
> @@ -1510,9 +1511,14 @@ xfs_itruncate_extents_flags(
>  	 *
>  	 * We have to free all the blocks to the bmbt maximum offset, even if
>  	 * the page cache can't scale that far.
> +	 *
> +	 * For big realtime inode, don't aligned to allocation unitsize,
> +	 * it'll split the extent and convert the tail blocks to unwritten.
>  	 */
> +	if (xfs_inode_has_bigrtalloc(ip))
> +		unitsize = i_blocksize(VFS_I(ip));
> +	first_unmap_block = XFS_B_TO_FSB(mp, roundup_64(new_size, unitsize));

If you expand what xfs_inode_alloc_unitsize and xfs_inode_has_bigrtalloc
this is looking a bit silly:

	unsigned int            blocks = 1;

	if (XFS_IS_REALTIME_INODE(ip))
		blocks = ip->i_mount->m_sb.sb_rextsize;

	unitsize = XFS_FSB_TO_B(ip->i_mount, blocks);
	if (XFS_IS_REALTIME_INODE(ip) && ip->i_mount->m_sb.sb_rextsize > 1)
		unsitsize = i_blocksize(inode);

So I think we can simply drop this part now that the variant that zeroes
the entire rtextent is gone.

> @@ -862,6 +862,15 @@ xfs_setattr_truncate_data(
>  	/* Truncate down */
>  	blocksize = xfs_inode_alloc_unitsize(ip);
>  
> +	/*
> +	 * If it's a big realtime inode, zero out the entire EOF extent could
> +	 * get slow down as the rtextsize increases, speed it up by adjusting
> +	 * the blocksize to the filesystem blocksize, let __xfs_bunmapi() to
> +	 * split the extent and convert the tail blocks to unwritten.
> +	 */
> +	if (xfs_inode_has_bigrtalloc(ip))
> +		blocksize = i_blocksize(inode);

Same here.  And with that probably also the passing of the block size
to the truncate_page helpers.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH -next v5 7/8] xfs: speed up truncating down a big realtime inode
  2024-06-14  6:08   ` Christoph Hellwig
@ 2024-06-14  7:18     ` Zhang Yi
  2024-06-14  9:13       ` Christoph Hellwig
  0 siblings, 1 reply; 17+ messages in thread
From: Zhang Yi @ 2024-06-14  7:18 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-xfs, linux-fsdevel, linux-kernel, djwong, brauner, david,
	chandanbabu, jack, yi.zhang, chengzhihao1, yukuai3, John Garry

On 2024/6/14 14:08, Christoph Hellwig wrote:
> On Thu, Jun 13, 2024 at 05:00:32PM +0800, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> If we truncate down a big realtime inode, zero out the entire aligned
>> EOF extent could gets slow down as the rtextsize increases. Fortunately,
>> __xfs_bunmapi() would align the unmapped range to rtextsize, split and
>> convert the blocks beyond EOF to unwritten. So speed up this by
>> adjusting the unitsize to the filesystem blocksize when truncating down
>> a large realtime inode, let __xfs_bunmapi() convert the tail blocks to
>> unwritten, this could improve the performance significantly.
>>
>>  # mkfs.xfs -f -rrtdev=/dev/pmem1s -f -m reflink=0,rmapbt=0, \
>>             -d rtinherit=1 -r extsize=$rtextsize /dev/pmem2s
>>  # mount -ortdev=/dev/pmem1s /dev/pmem2s /mnt/scratch
>>  # for i in {1..1000}; \
>>    do dd if=/dev/zero of=/mnt/scratch/$i bs=$rtextsize count=1024; done
>>  # sync
>>  # time for i in {1..1000}; \
>>    do xfs_io -c "truncate 4k" /mnt/scratch/$i; done
>>
>>  rtextsize       8k      16k      32k      64k     256k     1024k
>>  before:       9.601s  10.229s  11.153s  12.086s  12.259s  20.141s
>>  after:        9.710s   9.642s   9.958s   9.441s  10.021s  10.526s
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>> ---
>>  fs/xfs/xfs_inode.c | 10 ++++++++--
>>  fs/xfs/xfs_iops.c  |  9 +++++++++
>>  2 files changed, 17 insertions(+), 2 deletions(-)
>>
>> diff --git a/fs/xfs/xfs_inode.c b/fs/xfs/xfs_inode.c
>> index 92daa2279053..5e837ed093b0 100644
>> --- a/fs/xfs/xfs_inode.c
>> +++ b/fs/xfs/xfs_inode.c
>> @@ -1487,6 +1487,7 @@ xfs_itruncate_extents_flags(
>>  	struct xfs_trans	*tp = *tpp;
>>  	xfs_fileoff_t		first_unmap_block;
>>  	int			error = 0;
>> +	unsigned int		unitsize = xfs_inode_alloc_unitsize(ip);
>>  
>>  	xfs_assert_ilocked(ip, XFS_ILOCK_EXCL);
>>  	if (atomic_read(&VFS_I(ip)->i_count))
>> @@ -1510,9 +1511,14 @@ xfs_itruncate_extents_flags(
>>  	 *
>>  	 * We have to free all the blocks to the bmbt maximum offset, even if
>>  	 * the page cache can't scale that far.
>> +	 *
>> +	 * For big realtime inode, don't aligned to allocation unitsize,
>> +	 * it'll split the extent and convert the tail blocks to unwritten.
>>  	 */
>> +	if (xfs_inode_has_bigrtalloc(ip))
>> +		unitsize = i_blocksize(VFS_I(ip));
>> +	first_unmap_block = XFS_B_TO_FSB(mp, roundup_64(new_size, unitsize));
> 
> If you expand what xfs_inode_alloc_unitsize and xfs_inode_has_bigrtalloc
> this is looking a bit silly:
> 
> 	unsigned int            blocks = 1;
> 
> 	if (XFS_IS_REALTIME_INODE(ip))
> 		blocks = ip->i_mount->m_sb.sb_rextsize;
> 
> 	unitsize = XFS_FSB_TO_B(ip->i_mount, blocks);
> 	if (XFS_IS_REALTIME_INODE(ip) && ip->i_mount->m_sb.sb_rextsize > 1)
> 		unsitsize = i_blocksize(inode);
> 
> So I think we can simply drop this part now that the variant that zeroes
> the entire rtextent is gone.
> 
Thanks for your suggestion.

Yeah, we could fix the realtime inode problem by just drop this part, but
for the upcoming forcealign feature and atomic feature by John, IIUC, we
couldn't split and convert the tail extent like RT inode does, we should
zero out the entire tail force aligned extent, if not, atomic write could
be broken by submitting unaligned bios.

Jone had already expand the xfs_inode_alloc_unitsize() [1], so I think
we should keep this part for forcealign feature and deal with realtime
inode separately, is that right?

[1]
https://lore.kernel.org/linux-xfs/20240607143919.2622319-1-john.g.garry@oracle.com/T/#m73ccaa7b6fec9988f24b881e013fc367429405d6
https://lore.kernel.org/linux-xfs/20240607143919.2622319-1-john.g.garry@oracle.com/T/#m1a6312428e4addc4d0d260fbf33ad7bcffb98e0d

Thanks,
Yi.

>> @@ -862,6 +862,15 @@ xfs_setattr_truncate_data(
>>  	/* Truncate down */
>>  	blocksize = xfs_inode_alloc_unitsize(ip);
>>  
>> +	/*
>> +	 * If it's a big realtime inode, zero out the entire EOF extent could
>> +	 * get slow down as the rtextsize increases, speed it up by adjusting
>> +	 * the blocksize to the filesystem blocksize, let __xfs_bunmapi() to
>> +	 * split the extent and convert the tail blocks to unwritten.
>> +	 */
>> +	if (xfs_inode_has_bigrtalloc(ip))
>> +		blocksize = i_blocksize(inode);
> 
> Same here.  And with that probably also the passing of the block size
> to the truncate_page helpers.
> 


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH -next v5 7/8] xfs: speed up truncating down a big realtime inode
  2024-06-14  7:18     ` Zhang Yi
@ 2024-06-14  9:13       ` Christoph Hellwig
  2024-06-15 11:44         ` Zhang Yi
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2024-06-14  9:13 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Christoph Hellwig, linux-xfs, linux-fsdevel, linux-kernel, djwong,
	brauner, david, chandanbabu, jack, yi.zhang, chengzhihao1,
	yukuai3, John Garry

On Fri, Jun 14, 2024 at 03:18:07PM +0800, Zhang Yi wrote:
> Thanks for your suggestion.
> 
> Yeah, we could fix the realtime inode problem by just drop this part, but
> for the upcoming forcealign feature and atomic feature by John, IIUC, we
> couldn't split and convert the tail extent like RT inode does, we should
> zero out the entire tail force aligned extent, if not, atomic write could
> be broken by submitting unaligned bios.

Let's worry about that if/when those actually land.  I also see no
rason why those couldn't just use partially convert to unwritten path
offhand (but without having looked into the details).


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH -next v5 7/8] xfs: speed up truncating down a big realtime inode
  2024-06-14  9:13       ` Christoph Hellwig
@ 2024-06-15 11:44         ` Zhang Yi
  2024-06-17  6:59           ` Christoph Hellwig
  0 siblings, 1 reply; 17+ messages in thread
From: Zhang Yi @ 2024-06-15 11:44 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-xfs, linux-fsdevel, linux-kernel, djwong, brauner, david,
	chandanbabu, jack, yi.zhang, chengzhihao1, yukuai3, John Garry

On 2024/6/14 17:13, Christoph Hellwig wrote:
> On Fri, Jun 14, 2024 at 03:18:07PM +0800, Zhang Yi wrote:
>> Thanks for your suggestion.
>>
>> Yeah, we could fix the realtime inode problem by just drop this part, but
>> for the upcoming forcealign feature and atomic feature by John, IIUC, we
>> couldn't split and convert the tail extent like RT inode does, we should
>> zero out the entire tail force aligned extent, if not, atomic write could
>> be broken by submitting unaligned bios.
> 
> Let's worry about that if/when those actually land.

OK, if we don't consider the upcoming forcealign feature and atomic feature,
I think only path 6 is needed to fix the issue.

> I also see no
> rason why those couldn't just use partially convert to unwritten path
> offhand (but without having looked into the details).
> 

The reason why atomic feature can't split and convert the tail extent on truncate
down now is the dio write iter loop will split an atomic dio which covers the
whole allocation unit(extsize) since there are two extents in on allocation unit.

Please see this code:
__iomap_dio_rw()
{
	...
	while ((ret = iomap_iter(&iomi, ops)) > 0) {
		iomi.processed = iomap_dio_iter(&iomi, dio);
	...
}

The first loop find and submit the frist extent and the second loop submit the
second extent, this breaks the atomic property.

For example, we have a file with only one extszie length，if we truncate down
and split the extent, the file becomes below,

  |   forced extsize (one atomic IO unit)  |
  wwwwwwwwwwwwww+uuuuuuuuuuuuuuuuuuuuuuuuuuu
                ^new size A                ^old size B

Then if we submit a DIO from 0 to B, xfs should submit it in one bio, but it
will submit to two bios, since there are two extents. So, unless we find
another way to guarantee submit one bio even we have two extents in one atomic
write unit (I guess it may complicated), we have to zero out the whole unit
when truncate down(I'd prefer this solution), we need consider this in the near
future.

Thanks,
Yi.

^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH -next v5 7/8] xfs: speed up truncating down a big realtime inode
  2024-06-15 11:44         ` Zhang Yi
@ 2024-06-17  6:59           ` Christoph Hellwig
  2024-06-17  9:11             ` Zhang Yi
  0 siblings, 1 reply; 17+ messages in thread
From: Christoph Hellwig @ 2024-06-17  6:59 UTC (permalink / raw)
  To: Zhang Yi
  Cc: Christoph Hellwig, linux-xfs, linux-fsdevel, linux-kernel, djwong,
	brauner, david, chandanbabu, jack, yi.zhang, chengzhihao1,
	yukuai3, John Garry

On Sat, Jun 15, 2024 at 07:44:21PM +0800, Zhang Yi wrote:
> The reason why atomic feature can't split and convert the tail extent on truncate
> down now is the dio write iter loop will split an atomic dio which covers the
> whole allocation unit(extsize) since there are two extents in on allocation unit.

We could fix this by merging the two in iomap_begin, as the end_io
handler already deals with multiple ranges.  But let's think of that
when the need actually arises.


^ permalink raw reply	[flat|nested] 17+ messages in thread

* Re: [PATCH -next v5 7/8] xfs: speed up truncating down a big realtime inode
  2024-06-17  6:59           ` Christoph Hellwig
@ 2024-06-17  9:11             ` Zhang Yi
  0 siblings, 0 replies; 17+ messages in thread
From: Zhang Yi @ 2024-06-17  9:11 UTC (permalink / raw)
  To: Christoph Hellwig
  Cc: linux-xfs, linux-fsdevel, linux-kernel, djwong, brauner, david,
	chandanbabu, jack, yi.zhang, chengzhihao1, yukuai3, John Garry

On 2024/6/17 14:59, Christoph Hellwig wrote:
> On Sat, Jun 15, 2024 at 07:44:21PM +0800, Zhang Yi wrote:
>> The reason why atomic feature can't split and convert the tail extent on truncate
>> down now is the dio write iter loop will split an atomic dio which covers the
>> whole allocation unit(extsize) since there are two extents in on allocation unit.
> 
> We could fix this by merging the two in iomap_begin, as the end_io
> handler already deals with multiple ranges.  

Yeah, that's one solution.

> But let's think of that when the need actually arises.
> 

Sure, I will retest and submit only patch 6&8 to solve current issue in my next
version.

Thanks,
Yi.


^ permalink raw reply	[flat|nested] 17+ messages in thread

end of thread, other threads:[~2024-06-17  9:11 UTC | newest]

Thread overview: 17+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-06-13  9:00 [PATCH -next v5 0/8] iomap/xfs: fix stale data exposure when truncating realtime inodes Zhang Yi
2024-06-13  9:00 ` [PATCH -next v5 1/8] math64: add rem_u64() to just return the remainder Zhang Yi
2024-06-13  9:00 ` [PATCH -next v5 2/8] iomap: pass blocksize to iomap_truncate_page() Zhang Yi
2024-06-14  5:56   ` Christoph Hellwig
2024-06-13  9:00 ` [PATCH -next v5 3/8] fsdax: pass blocksize to dax_truncate_page() Zhang Yi
2024-06-13  9:00 ` [PATCH -next v5 4/8] xfs: refactor the truncating order Zhang Yi
2024-06-13  9:00 ` [PATCH -next v5 5/8] xfs: correct the truncate blocksize of realtime inode Zhang Yi
2024-06-13  9:00 ` [PATCH -next v5 6/8] xfs: reserve blocks for truncating large " Zhang Yi
2024-06-14  5:58   ` Christoph Hellwig
2024-06-13  9:00 ` [PATCH -next v5 7/8] xfs: speed up truncating down a big " Zhang Yi
2024-06-14  6:08   ` Christoph Hellwig
2024-06-14  7:18     ` Zhang Yi
2024-06-14  9:13       ` Christoph Hellwig
2024-06-15 11:44         ` Zhang Yi
2024-06-17  6:59           ` Christoph Hellwig
2024-06-17  9:11             ` Zhang Yi
2024-06-13  9:00 ` [PATCH -next v5 8/8] iomap: don't increase i_size in iomap_write_end() Zhang Yi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).