[PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes

public inbox for linux-xfs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes
@ 2024-11-27  6:35 Long Li
  2024-11-27  6:35 ` [PATCH v5 2/2] xfs: clean up xfs_end_ioend() to reuse local variables Long Li
                   ` (3 more replies)
  0 siblings, 4 replies; 18+ messages in thread
From: Long Li @ 2024-11-27  6:35 UTC (permalink / raw)
  To: brauner, djwong, cem
  Cc: linux-xfs, linux-fsdevel, yi.zhang, houtao1, leo.lilong,
	yangerkun

During concurrent append writes to XFS filesystem, zero padding data
may appear in the file after power failure. This happens due to imprecise
disk size updates when handling write completion.

Consider this scenario with concurrent append writes same file:

  Thread 1:                  Thread 2:
  ------------               -----------
  write [A, A+B]
  update inode size to A+B
  submit I/O [A, A+BS]
                             write [A+B, A+B+C]
                             update inode size to A+B+C
  <I/O completes, updates disk size to min(A+B+C, A+BS)>
  <power failure>

After reboot:
  1) with A+B+C < A+BS, the file has zero padding in range [A+B, A+B+C]

  |<         Block Size (BS)      >|
  |DDDDDDDDDDDDDDDD0000000000000000|
  ^               ^        ^
  A              A+B     A+B+C
                         (EOF)

  2) with A+B+C > A+BS, the file has zero padding in range [A+B, A+BS]

  |<         Block Size (BS)      >|<           Block Size (BS)    >|
  |DDDDDDDDDDDDDDDD0000000000000000|00000000000000000000000000000000|
  ^               ^                ^               ^
  A              A+B              A+BS           A+B+C
                                  (EOF)

  D = Valid Data
  0 = Zero Padding

The issue stems from disk size being set to min(io_offset + io_size,
inode->i_size) at I/O completion. Since io_offset+io_size is block
size granularity, it may exceed the actual valid file data size. In
the case of concurrent append writes, inode->i_size may be larger
than the actual range of valid file data written to disk, leading to
inaccurate disk size updates.

This patch modifies the meaning of io_size to represent the size of
valid data within EOF in an ioend. If the ioend spans beyond i_size,
io_size will be trimmed to provide the file with more accurate size
information. This is particularly useful for on-disk size updates
at completion time.

After this change, ioends that span i_size will not grow or merge with
other ioends in concurrent scenarios. However, these cases that need
growth/merging rarely occur and it seems no noticeable performance impact.
Although rounding up io_size could enable ioend growth/merging in these
scenarios, we decided to keep the code simple after discussion [1].

Another benefit is that it makes the xfs_ioend_is_append() check more
accurate, which can reduce unnecessary end bio callbacks of xfs_end_bio()
in certain scenarios, such as repeated writes at the file tail without
extending the file size.

Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Link[1]: https://patchwork.kernel.org/project/xfs/patch/20241113091907.56937-1-leo.lilong@huawei.com
Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Brian Foster <bfoster@redhat.com>
---
v4->v5: remove iomap_ioend_size_aligned() and don't round up io_size for
	ioend growth/merging to keep the code simple. 
 fs/iomap/buffered-io.c | 10 ++++++++++
 include/linux/iomap.h  |  2 +-
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
index d42f01e0fc1c..dc360c8e5641 100644
--- a/fs/iomap/buffered-io.c
+++ b/fs/iomap/buffered-io.c
@@ -1774,6 +1774,7 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
 {
 	struct iomap_folio_state *ifs = folio->private;
 	size_t poff = offset_in_folio(folio, pos);
+	loff_t isize = i_size_read(inode);
 	int error;
 
 	if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, pos)) {
@@ -1789,7 +1790,16 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
 
 	if (ifs)
 		atomic_add(len, &ifs->write_bytes_pending);
+
+	/*
+	 * If the ioend spans i_size, trim io_size to the former to provide
+	 * the fs with more accurate size information. This is useful for
+	 * completion time on-disk size updates.
+	 */
 	wpc->ioend->io_size += len;
+	if (wpc->ioend->io_offset + wpc->ioend->io_size > isize)
+		wpc->ioend->io_size = isize - wpc->ioend->io_offset;
+
 	wbc_account_cgroup_owner(wbc, folio, len);
 	return 0;
 }
diff --git a/include/linux/iomap.h b/include/linux/iomap.h
index 5675af6b740c..75bf54e76f3b 100644
--- a/include/linux/iomap.h
+++ b/include/linux/iomap.h
@@ -335,7 +335,7 @@ struct iomap_ioend {
 	u16			io_type;
 	u16			io_flags;	/* IOMAP_F_* */
 	struct inode		*io_inode;	/* file being written to */
-	size_t			io_size;	/* size of the extent */
+	size_t			io_size;	/* size of data within eof */
 	loff_t			io_offset;	/* offset in the file */
 	sector_t		io_sector;	/* start sector of ioend */
 	struct bio		io_bio;		/* MUST BE LAST! */
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v5 2/2] xfs: clean up xfs_end_ioend() to reuse local variables
  2024-11-27  6:35 [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes Long Li
@ 2024-11-27  6:35 ` Long Li
  2024-11-27 16:07   ` Darrick J. Wong
  2024-11-27 16:28 ` [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes Darrick J. Wong
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 18+ messages in thread
From: Long Li @ 2024-11-27  6:35 UTC (permalink / raw)
  To: brauner, djwong, cem
  Cc: linux-xfs, linux-fsdevel, yi.zhang, houtao1, leo.lilong,
	yangerkun

Use already initialized local variables 'offset' and 'size' instead
of accessing ioend members directly in xfs_setfilesize() call.

This is just a code cleanup with no functional changes.

Signed-off-by: Long Li <leo.lilong@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
---
v4->v5: No changes
 fs/xfs/xfs_aops.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
index 559a3a577097..67877c36ed11 100644
--- a/fs/xfs/xfs_aops.c
+++ b/fs/xfs/xfs_aops.c
@@ -131,7 +131,7 @@ xfs_end_ioend(
 		error = xfs_iomap_write_unwritten(ip, offset, size, false);
 
 	if (!error && xfs_ioend_is_append(ioend))
-		error = xfs_setfilesize(ip, ioend->io_offset, ioend->io_size);
+		error = xfs_setfilesize(ip, offset, size);
 done:
 	iomap_finish_ioends(ioend, error);
 	memalloc_nofs_restore(nofs_flag);
-- 
2.39.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 2/2] xfs: clean up xfs_end_ioend() to reuse local variables
  2024-11-27  6:35 ` [PATCH v5 2/2] xfs: clean up xfs_end_ioend() to reuse local variables Long Li
@ 2024-11-27 16:07   ` Darrick J. Wong
  0 siblings, 0 replies; 18+ messages in thread
From: Darrick J. Wong @ 2024-11-27 16:07 UTC (permalink / raw)
  To: Long Li
  Cc: brauner, cem, linux-xfs, linux-fsdevel, yi.zhang, houtao1,
	yangerkun

On Wed, Nov 27, 2024 at 02:35:03PM +0800, Long Li wrote:
> Use already initialized local variables 'offset' and 'size' instead
> of accessing ioend members directly in xfs_setfilesize() call.
> 
> This is just a code cleanup with no functional changes.
> 
> Signed-off-by: Long Li <leo.lilong@huawei.com>
> Reviewed-by: Christoph Hellwig <hch@lst.de>

Pretty straightforward cleanup there,
Reviewed-by: "Darrick J. Wong" <djwong@kernel.org>

--D

> ---
> v4->v5: No changes
>  fs/xfs/xfs_aops.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/xfs/xfs_aops.c b/fs/xfs/xfs_aops.c
> index 559a3a577097..67877c36ed11 100644
> --- a/fs/xfs/xfs_aops.c
> +++ b/fs/xfs/xfs_aops.c
> @@ -131,7 +131,7 @@ xfs_end_ioend(
>  		error = xfs_iomap_write_unwritten(ip, offset, size, false);
>  
>  	if (!error && xfs_ioend_is_append(ioend))
> -		error = xfs_setfilesize(ip, ioend->io_offset, ioend->io_size);
> +		error = xfs_setfilesize(ip, offset, size);
>  done:
>  	iomap_finish_ioends(ioend, error);
>  	memalloc_nofs_restore(nofs_flag);
> -- 
> 2.39.2
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes
  2024-11-27  6:35 [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes Long Li
  2024-11-27  6:35 ` [PATCH v5 2/2] xfs: clean up xfs_end_ioend() to reuse local variables Long Li
@ 2024-11-27 16:28 ` Darrick J. Wong
  2024-11-28  6:38   ` Long Li
  2024-11-28  3:33 ` Christoph Hellwig
  2024-11-30 13:39 ` Long Li
  3 siblings, 1 reply; 18+ messages in thread
From: Darrick J. Wong @ 2024-11-27 16:28 UTC (permalink / raw)
  To: Long Li
  Cc: brauner, cem, linux-xfs, linux-fsdevel, yi.zhang, houtao1,
	yangerkun

On Wed, Nov 27, 2024 at 02:35:02PM +0800, Long Li wrote:
> During concurrent append writes to XFS filesystem, zero padding data
> may appear in the file after power failure. This happens due to imprecise
> disk size updates when handling write completion.
> 
> Consider this scenario with concurrent append writes same file:
> 
>   Thread 1:                  Thread 2:
>   ------------               -----------
>   write [A, A+B]
>   update inode size to A+B
>   submit I/O [A, A+BS]
>                              write [A+B, A+B+C]
>                              update inode size to A+B+C
>   <I/O completes, updates disk size to min(A+B+C, A+BS)>
>   <power failure>
> 
> After reboot:
>   1) with A+B+C < A+BS, the file has zero padding in range [A+B, A+B+C]
> 
>   |<         Block Size (BS)      >|
>   |DDDDDDDDDDDDDDDD0000000000000000|
>   ^               ^        ^
>   A              A+B     A+B+C
>                          (EOF)
> 
>   2) with A+B+C > A+BS, the file has zero padding in range [A+B, A+BS]
> 
>   |<         Block Size (BS)      >|<           Block Size (BS)    >|
>   |DDDDDDDDDDDDDDDD0000000000000000|00000000000000000000000000000000|
>   ^               ^                ^               ^
>   A              A+B              A+BS           A+B+C
>                                   (EOF)
> 
>   D = Valid Data
>   0 = Zero Padding
> 
> The issue stems from disk size being set to min(io_offset + io_size,
> inode->i_size) at I/O completion. Since io_offset+io_size is block
> size granularity, it may exceed the actual valid file data size. In
> the case of concurrent append writes, inode->i_size may be larger
> than the actual range of valid file data written to disk, leading to
> inaccurate disk size updates.
> 
> This patch modifies the meaning of io_size to represent the size of
> valid data within EOF in an ioend. If the ioend spans beyond i_size,
> io_size will be trimmed to provide the file with more accurate size
> information. This is particularly useful for on-disk size updates
> at completion time.
> 
> After this change, ioends that span i_size will not grow or merge with
> other ioends in concurrent scenarios. However, these cases that need
> growth/merging rarely occur and it seems no noticeable performance impact.
> Although rounding up io_size could enable ioend growth/merging in these
> scenarios, we decided to keep the code simple after discussion [1].
> 
> Another benefit is that it makes the xfs_ioend_is_append() check more
> accurate, which can reduce unnecessary end bio callbacks of xfs_end_bio()
> in certain scenarios, such as repeated writes at the file tail without
> extending the file size.
> 
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")

Er... filesystem iomap wasn't in 2.6.12, how did you come up with this?
At most this is a fix against a roughly 6.1 era kernel, right?

> Link[1]: https://patchwork.kernel.org/project/xfs/patch/20241113091907.56937-1-leo.lilong@huawei.com
> Signed-off-by: Long Li <leo.lilong@huawei.com>
> Reviewed-by: Brian Foster <bfoster@redhat.com>
> ---
> v4->v5: remove iomap_ioend_size_aligned() and don't round up io_size for
> 	ioend growth/merging to keep the code simple. 
>  fs/iomap/buffered-io.c | 10 ++++++++++
>  include/linux/iomap.h  |  2 +-
>  2 files changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index d42f01e0fc1c..dc360c8e5641 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1774,6 +1774,7 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
>  {
>  	struct iomap_folio_state *ifs = folio->private;
>  	size_t poff = offset_in_folio(folio, pos);
> +	loff_t isize = i_size_read(inode);
>  	int error;
>  
>  	if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, pos)) {
> @@ -1789,7 +1790,16 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
>  
>  	if (ifs)
>  		atomic_add(len, &ifs->write_bytes_pending);
> +
> +	/*
> +	 * If the ioend spans i_size, trim io_size to the former to provide
> +	 * the fs with more accurate size information. This is useful for
> +	 * completion time on-disk size updates.

I think it's useful to preserve the diagram showing exactly what problem
you're solving:

	/*
	 * Clamp io_offset and io_size to the incore EOF so that ondisk
	 * file size updates in the ioend completion are byte-accurate.
	 * This avoids recovering files with zeroed tail regions when
	 * writeback races with appending writes:
	 *
	 *    Thread 1:                  Thread 2:
	 *    ------------               -----------
	 *    write [A, A+B]
	 *    update inode size to A+B
	 *    submit I/O [A, A+BS]
	 *                               write [A+B, A+B+C]
	 *                               update inode size to A+B+C
	 *    <I/O completes, updates disk size to min(A+B+C, A+BS)>
	 *    <power failure>
	 *
	 *  After reboot:
	 *    1) with A+B+C < A+BS, the file has zero padding in range
	 *       [A+B, A+B+C]
	 *
	 *    |<     Block Size (BS)    >|
	 *    |DDDDDDDDDDDD00000000000000|
	 *    ^           ^        ^
	 *    A          A+B     A+B+C
	 *                       (EOF)
	 *
	 *    2) with A+B+C > A+BS, the file has zero padding in range
	 *       [A+B, A+BS]
	 *
	 *    |<     Block Size (BS)    >|<      Block Size (BS)    >|
	 *    |DDDDDDDDDDDD00000000000000|000000000000000000000000000|
	 *    ^           ^              ^           ^
	 *    A          A+B            A+BS       A+B+C
	 *                              (EOF)
	 *
	 *    D = Valid Data
	 *    0 = Zero Padding
	 *
	 * Note that this defeats the ability to chain the ioends of
	 * appending writes.
	 */

(I reduced the blocksize a bit for wrapping purposes)

The logic looks ok, but I'm curious about how you landed at 2.6.12-rc
for the fixes tag.

--D

> +	 */
>  	wpc->ioend->io_size += len;
> +	if (wpc->ioend->io_offset + wpc->ioend->io_size > isize)
> +		wpc->ioend->io_size = isize - wpc->ioend->io_offset;
> +
>  	wbc_account_cgroup_owner(wbc, folio, len);
>  	return 0;
>  }
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 5675af6b740c..75bf54e76f3b 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -335,7 +335,7 @@ struct iomap_ioend {
>  	u16			io_type;
>  	u16			io_flags;	/* IOMAP_F_* */
>  	struct inode		*io_inode;	/* file being written to */
> -	size_t			io_size;	/* size of the extent */
> +	size_t			io_size;	/* size of data within eof */
>  	loff_t			io_offset;	/* offset in the file */
>  	sector_t		io_sector;	/* start sector of ioend */
>  	struct bio		io_bio;		/* MUST BE LAST! */
> -- 
> 2.39.2
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes
  2024-11-27  6:35 [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes Long Li
  2024-11-27  6:35 ` [PATCH v5 2/2] xfs: clean up xfs_end_ioend() to reuse local variables Long Li
  2024-11-27 16:28 ` [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes Darrick J. Wong
@ 2024-11-28  3:33 ` Christoph Hellwig
  2024-11-30 13:39 ` Long Li
  3 siblings, 0 replies; 18+ messages in thread
From: Christoph Hellwig @ 2024-11-28  3:33 UTC (permalink / raw)
  To: Long Li
  Cc: brauner, djwong, cem, linux-xfs, linux-fsdevel, yi.zhang, houtao1,
	yangerkun

I agree with Darrick that the graphic comment would be nice to keep,
but otherwise this looks good:

Reviewed-by: Christoph Hellwig <hch@lst.de>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes
  2024-11-27 16:28 ` [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes Darrick J. Wong
@ 2024-11-28  6:38   ` Long Li
  0 siblings, 0 replies; 18+ messages in thread
From: Long Li @ 2024-11-28  6:38 UTC (permalink / raw)
  To: Darrick J. Wong
  Cc: brauner, cem, linux-xfs, linux-fsdevel, yi.zhang, houtao1,
	yangerkun

On Wed, Nov 27, 2024 at 08:28:29AM -0800, Darrick J. Wong wrote:
> > @@ -1789,7 +1790,16 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
> >  
> >  	if (ifs)
> >  		atomic_add(len, &ifs->write_bytes_pending);
> > +
> > +	/*
> > +	 * If the ioend spans i_size, trim io_size to the former to provide
> > +	 * the fs with more accurate size information. This is useful for
> > +	 * completion time on-disk size updates.
> 
> I think it's useful to preserve the diagram showing exactly what problem
> you're solving:
> 
> 	/*
> 	 * Clamp io_offset and io_size to the incore EOF so that ondisk
> 	 * file size updates in the ioend completion are byte-accurate.
> 	 * This avoids recovering files with zeroed tail regions when
> 	 * writeback races with appending writes:
> 	 *
> 	 *    Thread 1:                  Thread 2:
> 	 *    ------------               -----------
> 	 *    write [A, A+B]
> 	 *    update inode size to A+B
> 	 *    submit I/O [A, A+BS]
> 	 *                               write [A+B, A+B+C]
> 	 *                               update inode size to A+B+C
> 	 *    <I/O completes, updates disk size to min(A+B+C, A+BS)>
> 	 *    <power failure>
> 	 *
> 	 *  After reboot:
> 	 *    1) with A+B+C < A+BS, the file has zero padding in range
> 	 *       [A+B, A+B+C]
> 	 *
> 	 *    |<     Block Size (BS)    >|
> 	 *    |DDDDDDDDDDDD00000000000000|
> 	 *    ^           ^        ^
> 	 *    A          A+B     A+B+C
> 	 *                       (EOF)
> 	 *
> 	 *    2) with A+B+C > A+BS, the file has zero padding in range
> 	 *       [A+B, A+BS]
> 	 *
> 	 *    |<     Block Size (BS)    >|<      Block Size (BS)    >|
> 	 *    |DDDDDDDDDDDD00000000000000|000000000000000000000000000|
> 	 *    ^           ^              ^           ^
> 	 *    A          A+B            A+BS       A+B+C
> 	 *                              (EOF)
> 	 *
> 	 *    D = Valid Data
> 	 *    0 = Zero Padding
> 	 *
> 	 * Note that this defeats the ability to chain the ioends of
> 	 * appending writes.
> 	 */
> 
> (I reduced the blocksize a bit for wrapping purposes)

Ok, I will update it.

> 
> The logic looks ok, but I'm curious about how you landed at 2.6.12-rc
> for the fixes tag.
> 
> --D

I see that io_size was introduced in version 2.6. It's quite difficult
to determine the exact version where the issue was introduced, but I can
confirm it was before version 4.19, as I can reproduce the issue in 4.19.
It should before introduce iomap infrastructure, how about using the
following fix tag?

Fixes: ae259a9c8593 ("fs: introduce iomap infrastructure") # goes further back than this

Thanks, 
Long Li

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes
  2024-11-27  6:35 [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes Long Li
                   ` (2 preceding siblings ...)
  2024-11-28  3:33 ` Christoph Hellwig
@ 2024-11-30 13:39 ` Long Li
  2024-12-02 15:26   ` Brian Foster
  3 siblings, 1 reply; 18+ messages in thread
From: Long Li @ 2024-11-30 13:39 UTC (permalink / raw)
  To: brauner, djwong, cem
  Cc: linux-xfs, linux-fsdevel, yi.zhang, houtao1, yangerkun

On Wed, Nov 27, 2024 at 02:35:02PM +0800, Long Li wrote:
> During concurrent append writes to XFS filesystem, zero padding data
> may appear in the file after power failure. This happens due to imprecise
> disk size updates when handling write completion.
> 
> Consider this scenario with concurrent append writes same file:
> 
>   Thread 1:                  Thread 2:
>   ------------               -----------
>   write [A, A+B]
>   update inode size to A+B
>   submit I/O [A, A+BS]
>                              write [A+B, A+B+C]
>                              update inode size to A+B+C
>   <I/O completes, updates disk size to min(A+B+C, A+BS)>
>   <power failure>
> 
> After reboot:
>   1) with A+B+C < A+BS, the file has zero padding in range [A+B, A+B+C]
> 
>   |<         Block Size (BS)      >|
>   |DDDDDDDDDDDDDDDD0000000000000000|
>   ^               ^        ^
>   A              A+B     A+B+C
>                          (EOF)
> 
>   2) with A+B+C > A+BS, the file has zero padding in range [A+B, A+BS]
> 
>   |<         Block Size (BS)      >|<           Block Size (BS)    >|
>   |DDDDDDDDDDDDDDDD0000000000000000|00000000000000000000000000000000|
>   ^               ^                ^               ^
>   A              A+B              A+BS           A+B+C
>                                   (EOF)
> 
>   D = Valid Data
>   0 = Zero Padding
> 
> The issue stems from disk size being set to min(io_offset + io_size,
> inode->i_size) at I/O completion. Since io_offset+io_size is block
> size granularity, it may exceed the actual valid file data size. In
> the case of concurrent append writes, inode->i_size may be larger
> than the actual range of valid file data written to disk, leading to
> inaccurate disk size updates.
> 
> This patch modifies the meaning of io_size to represent the size of
> valid data within EOF in an ioend. If the ioend spans beyond i_size,
> io_size will be trimmed to provide the file with more accurate size
> information. This is particularly useful for on-disk size updates
> at completion time.
> 
> After this change, ioends that span i_size will not grow or merge with
> other ioends in concurrent scenarios. However, these cases that need
> growth/merging rarely occur and it seems no noticeable performance impact.
> Although rounding up io_size could enable ioend growth/merging in these
> scenarios, we decided to keep the code simple after discussion [1].
> 
> Another benefit is that it makes the xfs_ioend_is_append() check more
> accurate, which can reduce unnecessary end bio callbacks of xfs_end_bio()
> in certain scenarios, such as repeated writes at the file tail without
> extending the file size.
> 
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> Link[1]: https://patchwork.kernel.org/project/xfs/patch/20241113091907.56937-1-leo.lilong@huawei.com
> Signed-off-by: Long Li <leo.lilong@huawei.com>
> Reviewed-by: Brian Foster <bfoster@redhat.com>
> ---
> v4->v5: remove iomap_ioend_size_aligned() and don't round up io_size for
> 	ioend growth/merging to keep the code simple. 
>  fs/iomap/buffered-io.c | 10 ++++++++++
>  include/linux/iomap.h  |  2 +-
>  2 files changed, 11 insertions(+), 1 deletion(-)
> 
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index d42f01e0fc1c..dc360c8e5641 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1774,6 +1774,7 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
>  {
>  	struct iomap_folio_state *ifs = folio->private;
>  	size_t poff = offset_in_folio(folio, pos);
> +	loff_t isize = i_size_read(inode);
>  	int error;
>  
>  	if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, pos)) {
> @@ -1789,7 +1790,16 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
>  
>  	if (ifs)
>  		atomic_add(len, &ifs->write_bytes_pending);
> +
> +	/*
> +	 * If the ioend spans i_size, trim io_size to the former to provide
> +	 * the fs with more accurate size information. This is useful for
> +	 * completion time on-disk size updates.
> +	 */
>  	wpc->ioend->io_size += len;
> +	if (wpc->ioend->io_offset + wpc->ioend->io_size > isize)
> +		wpc->ioend->io_size = isize - wpc->ioend->io_offset;
> +
 
When performing fsstress test with this patch set, there is a very low probability of
encountering an issue where isize is less than ioend->io_offset in iomap_add_to_ioend.
After investigation, this was found to be caused by concurrent with truncate operations.
Consider a scenario with 4K block size and a file size of 12K.

//write back [8K, 12K]           //truncate file to 4K
----------------------          ----------------------
iomap_writepage_map             xfs_setattr_size
  iomap_writepage_handle_eof
                                  truncate_setsize
				    i_size_write(inode, newsize)  //update inode size to 4K
  iomap_writepage_map_blocks
    iomap_add_to_ioend
           < iszie < ioend->io_offset>
	   <iszie = 4K,  ioend->io_offset=8K>

It appears that in extreme cases, folios beyond EOF might be written back,
resulting in situations where isize is less than pos. In such cases,
maybe we should not trim the io_size further.

Long Li


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes
  2024-11-30 13:39 ` Long Li
@ 2024-12-02 15:26   ` Brian Foster
  2024-12-03  2:08     ` Dave Chinner
  0 siblings, 1 reply; 18+ messages in thread
From: Brian Foster @ 2024-12-02 15:26 UTC (permalink / raw)
  To: Long Li
  Cc: brauner, djwong, cem, linux-xfs, linux-fsdevel, yi.zhang, houtao1,
	yangerkun

On Sat, Nov 30, 2024 at 09:39:29PM +0800, Long Li wrote:
> On Wed, Nov 27, 2024 at 02:35:02PM +0800, Long Li wrote:
> > During concurrent append writes to XFS filesystem, zero padding data
> > may appear in the file after power failure. This happens due to imprecise
> > disk size updates when handling write completion.
> > 
> > Consider this scenario with concurrent append writes same file:
> > 
> >   Thread 1:                  Thread 2:
> >   ------------               -----------
> >   write [A, A+B]
> >   update inode size to A+B
> >   submit I/O [A, A+BS]
> >                              write [A+B, A+B+C]
> >                              update inode size to A+B+C
> >   <I/O completes, updates disk size to min(A+B+C, A+BS)>
> >   <power failure>
> > 
> > After reboot:
> >   1) with A+B+C < A+BS, the file has zero padding in range [A+B, A+B+C]
> > 
> >   |<         Block Size (BS)      >|
> >   |DDDDDDDDDDDDDDDD0000000000000000|
> >   ^               ^        ^
> >   A              A+B     A+B+C
> >                          (EOF)
> > 
> >   2) with A+B+C > A+BS, the file has zero padding in range [A+B, A+BS]
> > 
> >   |<         Block Size (BS)      >|<           Block Size (BS)    >|
> >   |DDDDDDDDDDDDDDDD0000000000000000|00000000000000000000000000000000|
> >   ^               ^                ^               ^
> >   A              A+B              A+BS           A+B+C
> >                                   (EOF)
> > 
> >   D = Valid Data
> >   0 = Zero Padding
> > 
> > The issue stems from disk size being set to min(io_offset + io_size,
> > inode->i_size) at I/O completion. Since io_offset+io_size is block
> > size granularity, it may exceed the actual valid file data size. In
> > the case of concurrent append writes, inode->i_size may be larger
> > than the actual range of valid file data written to disk, leading to
> > inaccurate disk size updates.
> > 
> > This patch modifies the meaning of io_size to represent the size of
> > valid data within EOF in an ioend. If the ioend spans beyond i_size,
> > io_size will be trimmed to provide the file with more accurate size
> > information. This is particularly useful for on-disk size updates
> > at completion time.
> > 
> > After this change, ioends that span i_size will not grow or merge with
> > other ioends in concurrent scenarios. However, these cases that need
> > growth/merging rarely occur and it seems no noticeable performance impact.
> > Although rounding up io_size could enable ioend growth/merging in these
> > scenarios, we decided to keep the code simple after discussion [1].
> > 
> > Another benefit is that it makes the xfs_ioend_is_append() check more
> > accurate, which can reduce unnecessary end bio callbacks of xfs_end_bio()
> > in certain scenarios, such as repeated writes at the file tail without
> > extending the file size.
> > 
> > Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
> > Link[1]: https://patchwork.kernel.org/project/xfs/patch/20241113091907.56937-1-leo.lilong@huawei.com
> > Signed-off-by: Long Li <leo.lilong@huawei.com>
> > Reviewed-by: Brian Foster <bfoster@redhat.com>
> > ---
> > v4->v5: remove iomap_ioend_size_aligned() and don't round up io_size for
> > 	ioend growth/merging to keep the code simple. 
> >  fs/iomap/buffered-io.c | 10 ++++++++++
> >  include/linux/iomap.h  |  2 +-
> >  2 files changed, 11 insertions(+), 1 deletion(-)
> > 
> > diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> > index d42f01e0fc1c..dc360c8e5641 100644
> > --- a/fs/iomap/buffered-io.c
> > +++ b/fs/iomap/buffered-io.c
> > @@ -1774,6 +1774,7 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
> >  {
> >  	struct iomap_folio_state *ifs = folio->private;
> >  	size_t poff = offset_in_folio(folio, pos);
> > +	loff_t isize = i_size_read(inode);
> >  	int error;
> >  
> >  	if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, pos)) {
> > @@ -1789,7 +1790,16 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
> >  
> >  	if (ifs)
> >  		atomic_add(len, &ifs->write_bytes_pending);
> > +
> > +	/*
> > +	 * If the ioend spans i_size, trim io_size to the former to provide
> > +	 * the fs with more accurate size information. This is useful for
> > +	 * completion time on-disk size updates.
> > +	 */
> >  	wpc->ioend->io_size += len;
> > +	if (wpc->ioend->io_offset + wpc->ioend->io_size > isize)
> > +		wpc->ioend->io_size = isize - wpc->ioend->io_offset;
> > +
>  
> When performing fsstress test with this patch set, there is a very low probability of
> encountering an issue where isize is less than ioend->io_offset in iomap_add_to_ioend.
> After investigation, this was found to be caused by concurrent with truncate operations.
> Consider a scenario with 4K block size and a file size of 12K.
> 
> //write back [8K, 12K]           //truncate file to 4K
> ----------------------          ----------------------
> iomap_writepage_map             xfs_setattr_size
>   iomap_writepage_handle_eof
>                                   truncate_setsize
> 				    i_size_write(inode, newsize)  //update inode size to 4K
>   iomap_writepage_map_blocks
>     iomap_add_to_ioend
>            < iszie < ioend->io_offset>
> 	   <iszie = 4K,  ioend->io_offset=8K>
> 
> It appears that in extreme cases, folios beyond EOF might be written back,
> resulting in situations where isize is less than pos. In such cases,
> maybe we should not trim the io_size further.
> 

Hmm.. it might be wise to characterize this further to determine whether
there are potentially larger problems to address before committing to
anything. For example, assuming truncate acquires ilock and does
xfs_itruncate_extents() and whatnot before this ioend submits/completes,
does anything in that submission or completion path detect and handle
this scenario gracefully? What if the ioend happens to be unwritten
post-eof preallocation and completion wants to convert blocks that might
no longer exist in the file..?

I don't see anything obvious on a quick look other than unwritten
conversion doesn't look like it would bump up i_size, which sounds sane,
but I could have easily missed something. If nobody else can point at
something, a way to instrument this might be to do something like:

1. add post-eof preallocation to a file
2. buffered write beyond eof
3. inject a delay in the writeback path somewhere after the writeback
eof checks
4. when writeback sits on that delay, truncate the file and try to
remove those extents before the ioend submits

Maybe something similar could be done to isolate the append ioend
completion scenario as well. Hm?

Brian

> Long Li
> 
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes
  2024-12-02 15:26   ` Brian Foster
@ 2024-12-03  2:08     ` Dave Chinner
  2024-12-03 14:54       ` Brian Foster
  2024-12-04  9:00       ` Long Li
  0 siblings, 2 replies; 18+ messages in thread
From: Dave Chinner @ 2024-12-03  2:08 UTC (permalink / raw)
  To: Brian Foster
  Cc: Long Li, brauner, djwong, cem, linux-xfs, linux-fsdevel, yi.zhang,
	houtao1, yangerkun

On Mon, Dec 02, 2024 at 10:26:14AM -0500, Brian Foster wrote:
> On Sat, Nov 30, 2024 at 09:39:29PM +0800, Long Li wrote:
> > When performing fsstress test with this patch set, there is a very low probability of
> > encountering an issue where isize is less than ioend->io_offset in iomap_add_to_ioend.
> > After investigation, this was found to be caused by concurrent with truncate operations.
> > Consider a scenario with 4K block size and a file size of 12K.
> > 
> > //write back [8K, 12K]           //truncate file to 4K
> > ----------------------          ----------------------
> > iomap_writepage_map             xfs_setattr_size

folio is locked here

> >   iomap_writepage_handle_eof
> >                                   truncate_setsize
> > 				    i_size_write(inode, newsize)  //update inode size to 4K

truncate_setsize() is supposed to invalidate whole pages beyond
EOF before completing, yes?

/**
 * truncate_setsize - update inode and pagecache for a new file size
 * @inode: inode
 * @newsize: new file size
 *
 * truncate_setsize updates i_size and performs pagecache truncation (if
 * necessary) to @newsize. It will be typically be called from the filesystem's
 * setattr function when ATTR_SIZE is passed in.
 *
 * Must be called with a lock serializing truncates and writes (generally
 * i_rwsem but e.g. xfs uses a different lock) and before all filesystem
 * specific block truncation has been performed.
 */
void truncate_setsize(struct inode *inode, loff_t newsize)
{
        loff_t oldsize = inode->i_size;

        i_size_write(inode, newsize);
        if (newsize > oldsize)
                pagecache_isize_extended(inode, oldsize, newsize);
        truncate_pagecache(inode, newsize);
}
EXPORT_SYMBOL(truncate_setsize);

Note that this says "serialising truncates and writes" - the
emphasis needs to be placed on "writes" here, not "writeback". The
comment about XFS is also stale - it uses the i_rwsem here like
all other filesystems now.

The issue demonstrated above is -write back- racing against
truncate_setsize(), not writes. And -write back- is only serialised
against truncate_pagecache() by folio locks and state, not inode
locks. hence any change to the inode size in truncate can and will
race with writeback in progress.

Hence writeback needs to be able to handle folios end up beyond
EOF at any time during writeback. i.e. once we have a folio locked
in writeback and we've checked against i_size_read() for validity,
it needs to be considered a valid offset all the way through to
IO completion.

> >   iomap_writepage_map_blocks
> >     iomap_add_to_ioend
> >            < iszie < ioend->io_offset>
> > 	   <iszie = 4K,  ioend->io_offset=8K>

Ah, so the bug fix adds a new call to i_size_read() in the IO
submission path? I suspect that is the underlying problem leading
to the observed behaviour....

> > 
> > It appears that in extreme cases, folios beyond EOF might be written back,
> > resulting in situations where isize is less than pos. In such cases,
> > maybe we should not trim the io_size further.
> > 
> 
> Hmm.. it might be wise to characterize this further to determine whether
> there are potentially larger problems to address before committing to
> anything. For example, assuming truncate acquires ilock and does
> xfs_itruncate_extents() and whatnot before this ioend submits/completes,

I don't think xfs_itruncate_extents() is the concern here - that
happens after the page cache and writeback has been sorted out and
the ILOCK has been taken and the page cache state should
have already been sorted out. truncate_setsize() does that for us;
it guarantees that all writeback in the truncate down range has
been completed and the page cache invalidated.

We hold the MMAP_LOCK (filemap_invalidate_lock()) so no new pages
can be instantiated over the range whilst we are running
xfs_itruncate_extents(). hence once truncate_setsize() returns, we
are guaranteed that there will be no IO in progress or can be
started over the range we are removing.

Really, the issue is that writeback mappings have to be able to
handle the range being mapped suddenly appear to be beyond EOF.
This behaviour is a longstanding writeback constraint, and is what
iomap_writepage_handle_eof() is attempting to handle.

We handle this by only sampling i_size_read() whilst we have the
folio locked and can determine the action we should take with that
folio (i.e. nothing, partial zeroing, or skip altogether). Once
we've made the decision that the folio is within EOF and taken
action on it (i.e. moved the folio to writeback state), we cannot
then resample the inode size because a truncate may have started
and changed the inode size.

We have to complete the mapping of the folio to disk blocks - the
disk block mapping is guaranteed to be valid for the life of the IO
because the folio is locked and under writeback - and submit the IO
so that truncate_pagecache() will unblock and invalidate the folio
when the IO completes.

Hence writeback vs truncate serialisation is really dependent on
only sampling the inode size -once- whilst the dirty folio we are
writing back is locked.

I suspect that we can store and pass the sampled inode size through
the block mapping and ioend management code so it is constant for
the entire folio IO submission process, but whether we can do that
and still fix the orginal issue that we are trying to fix is not
something I've considered at this point....

> does anything in that submission or completion path detect and handle
> this scenario gracefully? What if the ioend happens to be unwritten
> post-eof preallocation and completion wants to convert blocks that might
> no longer exist in the file..?

That can't happen because writeback must complete before
truncate_setsize() will be allowed to remove the pages from the
cache before xfs_itruncate_extents() can run.

-Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes
  2024-12-03  2:08     ` Dave Chinner
@ 2024-12-03 14:54       ` Brian Foster
  2024-12-03 21:12         ` Dave Chinner
  2024-12-04  9:06         ` Long Li
  2024-12-04  9:00       ` Long Li
  1 sibling, 2 replies; 18+ messages in thread
From: Brian Foster @ 2024-12-03 14:54 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Long Li, brauner, djwong, cem, linux-xfs, linux-fsdevel, yi.zhang,
	houtao1, yangerkun

On Tue, Dec 03, 2024 at 01:08:38PM +1100, Dave Chinner wrote:
> On Mon, Dec 02, 2024 at 10:26:14AM -0500, Brian Foster wrote:
> > On Sat, Nov 30, 2024 at 09:39:29PM +0800, Long Li wrote:
> > > When performing fsstress test with this patch set, there is a very low probability of
> > > encountering an issue where isize is less than ioend->io_offset in iomap_add_to_ioend.
> > > After investigation, this was found to be caused by concurrent with truncate operations.
> > > Consider a scenario with 4K block size and a file size of 12K.
> > > 
> > > //write back [8K, 12K]           //truncate file to 4K
> > > ----------------------          ----------------------
> > > iomap_writepage_map             xfs_setattr_size
> 
> folio is locked here
> 
> > >   iomap_writepage_handle_eof
> > >                                   truncate_setsize
> > > 				    i_size_write(inode, newsize)  //update inode size to 4K
> 
...
> > > 
> > > It appears that in extreme cases, folios beyond EOF might be written back,
> > > resulting in situations where isize is less than pos. In such cases,
> > > maybe we should not trim the io_size further.
> > > 
> > 
> > Hmm.. it might be wise to characterize this further to determine whether
> > there are potentially larger problems to address before committing to
> > anything. For example, assuming truncate acquires ilock and does
> > xfs_itruncate_extents() and whatnot before this ioend submits/completes,
> 
> I don't think xfs_itruncate_extents() is the concern here - that
> happens after the page cache and writeback has been sorted out and
> the ILOCK has been taken and the page cache state should
> have already been sorted out. truncate_setsize() does that for us;
> it guarantees that all writeback in the truncate down range has
> been completed and the page cache invalidated.
> 

Ah this is what I was missing on previous read through.
truncate_inode_pages_range() waits on writeback in its second pass. I
was initially confused by what would have prevented removing a writeback
folio in the first pass, but that is handled a level down in
find_lock_entries() where it skips locked||writeback folios and thus
leaves them for the second pass.

> We hold the MMAP_LOCK (filemap_invalidate_lock()) so no new pages
> can be instantiated over the range whilst we are running
> xfs_itruncate_extents(). hence once truncate_setsize() returns, we
> are guaranteed that there will be no IO in progress or can be
> started over the range we are removing.
> 
> Really, the issue is that writeback mappings have to be able to
> handle the range being mapped suddenly appear to be beyond EOF.
> This behaviour is a longstanding writeback constraint, and is what
> iomap_writepage_handle_eof() is attempting to handle.
> 
> We handle this by only sampling i_size_read() whilst we have the
> folio locked and can determine the action we should take with that
> folio (i.e. nothing, partial zeroing, or skip altogether). Once
> we've made the decision that the folio is within EOF and taken
> action on it (i.e. moved the folio to writeback state), we cannot
> then resample the inode size because a truncate may have started
> and changed the inode size.
> 
> We have to complete the mapping of the folio to disk blocks - the
> disk block mapping is guaranteed to be valid for the life of the IO
> because the folio is locked and under writeback - and submit the IO
> so that truncate_pagecache() will unblock and invalidate the folio
> when the IO completes.
> 
> Hence writeback vs truncate serialisation is really dependent on
> only sampling the inode size -once- whilst the dirty folio we are
> writing back is locked.
> 

Not sure I see how this is a serialization dependency given that
writeback completion also samples i_size. But no matter, it seems a
reasonable implementation to me to make the submission path consistent
in handling eof.

I wonder if this could just use end_pos returned from
iomap_writepage_handle_eof()?

Brian

> I suspect that we can store and pass the sampled inode size through
> the block mapping and ioend management code so it is constant for
> the entire folio IO submission process, but whether we can do that
> and still fix the orginal issue that we are trying to fix is not
> something I've considered at this point....
> 
> > does anything in that submission or completion path detect and handle
> > this scenario gracefully? What if the ioend happens to be unwritten
> > post-eof preallocation and completion wants to convert blocks that might
> > no longer exist in the file..?
> 
> That can't happen because writeback must complete before
> truncate_setsize() will be allowed to remove the pages from the
> cache before xfs_itruncate_extents() can run.
> 
> -Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes
  2024-12-03 14:54       ` Brian Foster
@ 2024-12-03 21:12         ` Dave Chinner
  2024-12-04 12:05           ` Brian Foster
  2024-12-04  9:06         ` Long Li
  1 sibling, 1 reply; 18+ messages in thread
From: Dave Chinner @ 2024-12-03 21:12 UTC (permalink / raw)
  To: Brian Foster
  Cc: Long Li, brauner, djwong, cem, linux-xfs, linux-fsdevel, yi.zhang,
	houtao1, yangerkun

On Tue, Dec 03, 2024 at 09:54:41AM -0500, Brian Foster wrote:
> On Tue, Dec 03, 2024 at 01:08:38PM +1100, Dave Chinner wrote:
> > On Mon, Dec 02, 2024 at 10:26:14AM -0500, Brian Foster wrote:
> > > On Sat, Nov 30, 2024 at 09:39:29PM +0800, Long Li wrote:
> > We hold the MMAP_LOCK (filemap_invalidate_lock()) so no new pages
> > can be instantiated over the range whilst we are running
> > xfs_itruncate_extents(). hence once truncate_setsize() returns, we
> > are guaranteed that there will be no IO in progress or can be
> > started over the range we are removing.
> > 
> > Really, the issue is that writeback mappings have to be able to
> > handle the range being mapped suddenly appear to be beyond EOF.
> > This behaviour is a longstanding writeback constraint, and is what
> > iomap_writepage_handle_eof() is attempting to handle.
> > 
> > We handle this by only sampling i_size_read() whilst we have the
> > folio locked and can determine the action we should take with that
> > folio (i.e. nothing, partial zeroing, or skip altogether). Once
> > we've made the decision that the folio is within EOF and taken
> > action on it (i.e. moved the folio to writeback state), we cannot
> > then resample the inode size because a truncate may have started
> > and changed the inode size.
> > 
> > We have to complete the mapping of the folio to disk blocks - the
> > disk block mapping is guaranteed to be valid for the life of the IO
> > because the folio is locked and under writeback - and submit the IO
> > so that truncate_pagecache() will unblock and invalidate the folio
> > when the IO completes.
> > 
> > Hence writeback vs truncate serialisation is really dependent on
> > only sampling the inode size -once- whilst the dirty folio we are
> > writing back is locked.
> > 
> 
> Not sure I see how this is a serialization dependency given that
> writeback completion also samples i_size.

Ah, I didn't explain what I meant very clearly, did I?

What I mean was we can't sample i_size in the IO path without
specific checking/serialisation against truncate operations. And
that means once we have partially zeroed the contents of a EOF
straddling folio, we can't then sample the EOF again to determine
the length of valid data in the folio as this can race with truncate
and result in a different size for the data in the folio than we
prepared it for.

> But no matter, it seems a
> reasonable implementation to me to make the submission path consistent
> in handling eof.

Yes, the IO completion path does sample it again via xfs_new_eof().
However, as per above, it has specific checking for truncate down
races and handles them:

/*
 * If this I/O goes past the on-disk inode size update it unless it would
 * be past the current in-core inode size.
 */
static inline xfs_fsize_t
xfs_new_eof(struct xfs_inode *ip, xfs_fsize_t new_size)
{
        xfs_fsize_t i_size = i_size_read(VFS_I(ip));

>>>>    if (new_size > i_size || new_size < 0)
>>>>            new_size = i_size;
        return new_size > ip->i_disk_size ? new_size : 0;
}

If we have a truncate_setsize() called for a truncate down whilst
this IO is in progress, then xfs_new_eof() will see the new, smaller
inode isize. The clamp on new_size handles this situation, and we
then only triggers an update if the on-disk size is still smaller
than the new truncated size (i.e. the IO being completed is still
partially within the new EOF from the truncate down).

So I don't think there's an issue here at all at IO completion;
it handles truncate down races cleanly...

> I wonder if this could just use end_pos returned from
> iomap_writepage_handle_eof()?

Yeah, that was what I was thinking, but I haven't looked at the code
for long enough to have any real idea of whether that is sufficient
or not.

Cheers,

Dave.
-- 
Dave Chinner
david@fromorbit.com

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes
  2024-12-03  2:08     ` Dave Chinner
  2024-12-03 14:54       ` Brian Foster
@ 2024-12-04  9:00       ` Long Li
  2024-12-04 12:17         ` Brian Foster
  1 sibling, 1 reply; 18+ messages in thread
From: Long Li @ 2024-12-04  9:00 UTC (permalink / raw)
  To: Dave Chinner, Brian Foster
  Cc: brauner, djwong, cem, linux-xfs, linux-fsdevel, yi.zhang, houtao1,
	yangerkun

On Tue, Dec 03, 2024 at 01:08:38PM +1100, Dave Chinner wrote:
> On Mon, Dec 02, 2024 at 10:26:14AM -0500, Brian Foster wrote:
> > On Sat, Nov 30, 2024 at 09:39:29PM +0800, Long Li wrote:
> > > When performing fsstress test with this patch set, there is a very low probability of
> > > encountering an issue where isize is less than ioend->io_offset in iomap_add_to_ioend.
> > > After investigation, this was found to be caused by concurrent with truncate operations.
> > > Consider a scenario with 4K block size and a file size of 12K.
> > > 
> > > //write back [8K, 12K]           //truncate file to 4K
> > > ----------------------          ----------------------
> > > iomap_writepage_map             xfs_setattr_size
> 
> folio is locked here
> 
> > >   iomap_writepage_handle_eof
> > >                                   truncate_setsize
> > > 				    i_size_write(inode, newsize)  //update inode size to 4K
> 
> truncate_setsize() is supposed to invalidate whole pages beyond
> EOF before completing, yes?
> 
> /**
>  * truncate_setsize - update inode and pagecache for a new file size
>  * @inode: inode
>  * @newsize: new file size
>  *
>  * truncate_setsize updates i_size and performs pagecache truncation (if
>  * necessary) to @newsize. It will be typically be called from the filesystem's
>  * setattr function when ATTR_SIZE is passed in.
>  *
>  * Must be called with a lock serializing truncates and writes (generally
>  * i_rwsem but e.g. xfs uses a different lock) and before all filesystem
>  * specific block truncation has been performed.
>  */
> void truncate_setsize(struct inode *inode, loff_t newsize)
> {
>         loff_t oldsize = inode->i_size;
> 
>         i_size_write(inode, newsize);
>         if (newsize > oldsize)
>                 pagecache_isize_extended(inode, oldsize, newsize);
>         truncate_pagecache(inode, newsize);
> }
> EXPORT_SYMBOL(truncate_setsize);
> 
> Note that this says "serialising truncates and writes" - the
> emphasis needs to be placed on "writes" here, not "writeback". The
> comment about XFS is also stale - it uses the i_rwsem here like
> all other filesystems now.
> 
> The issue demonstrated above is -write back- racing against
> truncate_setsize(), not writes. And -write back- is only serialised
> against truncate_pagecache() by folio locks and state, not inode
> locks. hence any change to the inode size in truncate can and will
> race with writeback in progress.
> 
> Hence writeback needs to be able to handle folios end up beyond
> EOF at any time during writeback. i.e. once we have a folio locked
> in writeback and we've checked against i_size_read() for validity,
> it needs to be considered a valid offset all the way through to
> IO completion.
> 
> 
> > >   iomap_writepage_map_blocks
> > >     iomap_add_to_ioend
> > >            < iszie < ioend->io_offset>
> > > 	   <iszie = 4K,  ioend->io_offset=8K>
> 
> Ah, so the bug fix adds a new call to i_size_read() in the IO
> submission path? I suspect that is the underlying problem leading
> to the observed behaviour....
> 
> > > 
> > > It appears that in extreme cases, folios beyond EOF might be written back,
> > > resulting in situations where isize is less than pos. In such cases,
> > > maybe we should not trim the io_size further.
> > > 
> > 
> > Hmm.. it might be wise to characterize this further to determine whether
> > there are potentially larger problems to address before committing to
> > anything. For example, assuming truncate acquires ilock and does
> > xfs_itruncate_extents() and whatnot before this ioend submits/completes,
> 
> I don't think xfs_itruncate_extents() is the concern here - that
> happens after the page cache and writeback has been sorted out and
> the ILOCK has been taken and the page cache state should
> have already been sorted out. truncate_setsize() does that for us;
> it guarantees that all writeback in the truncate down range has
> been completed and the page cache invalidated.
> 
> We hold the MMAP_LOCK (filemap_invalidate_lock()) so no new pages
> can be instantiated over the range whilst we are running
> xfs_itruncate_extents(). hence once truncate_setsize() returns, we
> are guaranteed that there will be no IO in progress or can be
> started over the range we are removing.
> 
> Really, the issue is that writeback mappings have to be able to
> handle the range being mapped suddenly appear to be beyond EOF.
> This behaviour is a longstanding writeback constraint, and is what
> iomap_writepage_handle_eof() is attempting to handle.
> 
> We handle this by only sampling i_size_read() whilst we have the
> folio locked and can determine the action we should take with that
> folio (i.e. nothing, partial zeroing, or skip altogether). Once
> we've made the decision that the folio is within EOF and taken
> action on it (i.e. moved the folio to writeback state), we cannot
> then resample the inode size because a truncate may have started
> and changed the inode size.
> 

My understanding is the issue isn't that we can't sample the inode size. 
The key point is that writeback mapping must be able to handle cases where
the mapped range suddenly appears beyond EOF. If we can handle such
cases properly, wouldn't sampling still be possible?

Coming back to our current issue, during writeback mapping, we sample
the inode size to determine if the ioend is within EOF and attempt to
trim io_size. Concurrent truncate operations may update the inode size,
causing the pos of write back beyond EOF. In such cases, we simply don't
trim io_size, which seems like a viable approach.

Thanks,
Long Li

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes
  2024-12-03 14:54       ` Brian Foster
  2024-12-03 21:12         ` Dave Chinner
@ 2024-12-04  9:06         ` Long Li
  2024-12-04 12:05           ` Brian Foster
  1 sibling, 1 reply; 18+ messages in thread
From: Long Li @ 2024-12-04  9:06 UTC (permalink / raw)
  To: Brian Foster, Dave Chinner
  Cc: brauner, djwong, cem, linux-xfs, linux-fsdevel, yi.zhang, houtao1,
	yangerkun

On Tue, Dec 03, 2024 at 09:54:41AM -0500, Brian Foster wrote:
> Not sure I see how this is a serialization dependency given that
> writeback completion also samples i_size. But no matter, it seems a
> reasonable implementation to me to make the submission path consistent
> in handling eof.
> 
> I wonder if this could just use end_pos returned from
> iomap_writepage_handle_eof()?
> 
> Brian
> 

It seems reasonable to me, but end_pos is block-size granular. We need
to pass in a more precise byte-granular end. 

Thanks,
Long Li

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes
  2024-12-04  9:06         ` Long Li
@ 2024-12-04 12:05           ` Brian Foster
  2024-12-06  3:36             ` Long Li
  0 siblings, 1 reply; 18+ messages in thread
From: Brian Foster @ 2024-12-04 12:05 UTC (permalink / raw)
  To: Long Li
  Cc: Dave Chinner, brauner, djwong, cem, linux-xfs, linux-fsdevel,
	yi.zhang, houtao1, yangerkun

On Wed, Dec 04, 2024 at 05:06:00PM +0800, Long Li wrote:
> On Tue, Dec 03, 2024 at 09:54:41AM -0500, Brian Foster wrote:
> > Not sure I see how this is a serialization dependency given that
> > writeback completion also samples i_size. But no matter, it seems a
> > reasonable implementation to me to make the submission path consistent
> > in handling eof.
> > 
> > I wonder if this could just use end_pos returned from
> > iomap_writepage_handle_eof()?
> > 
> > Brian
> > 
> 
> It seems reasonable to me, but end_pos is block-size granular. We need
> to pass in a more precise byte-granular end. 

Well Ok, but _handle_eof() doesn't actually use the value itself so from
that standpoint I see no reason it couldn't at least return the
unaligned end pos. From there, it looks like we do still want the
rounded up value for the various ifs state management calls.

I can see a couple ways of doing that.. one is just align the value in
the caller and use the two variants appropriately. Since the ifs_
helpers all seem to be in byte units, another option could be to
sanitize the helpers to the appropriate start/end rounding internally.

Either of those probably warrant a separate prep patch or two rather
than being squashed in with the i_size fix, but otherwise I'm not seeing
much of a roadblock here. Am I missing something?

Brian

> 
> Thanks,
> Long Li
> 

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes
  2024-12-03 21:12         ` Dave Chinner
@ 2024-12-04 12:05           ` Brian Foster
  0 siblings, 0 replies; 18+ messages in thread
From: Brian Foster @ 2024-12-04 12:05 UTC (permalink / raw)
  To: Dave Chinner
  Cc: Long Li, brauner, djwong, cem, linux-xfs, linux-fsdevel, yi.zhang,
	houtao1, yangerkun

On Wed, Dec 04, 2024 at 08:12:05AM +1100, Dave Chinner wrote:
> On Tue, Dec 03, 2024 at 09:54:41AM -0500, Brian Foster wrote:
> > On Tue, Dec 03, 2024 at 01:08:38PM +1100, Dave Chinner wrote:
> > > On Mon, Dec 02, 2024 at 10:26:14AM -0500, Brian Foster wrote:
> > > > On Sat, Nov 30, 2024 at 09:39:29PM +0800, Long Li wrote:
> > > We hold the MMAP_LOCK (filemap_invalidate_lock()) so no new pages
> > > can be instantiated over the range whilst we are running
> > > xfs_itruncate_extents(). hence once truncate_setsize() returns, we
> > > are guaranteed that there will be no IO in progress or can be
> > > started over the range we are removing.
> > > 
> > > Really, the issue is that writeback mappings have to be able to
> > > handle the range being mapped suddenly appear to be beyond EOF.
> > > This behaviour is a longstanding writeback constraint, and is what
> > > iomap_writepage_handle_eof() is attempting to handle.
> > > 
> > > We handle this by only sampling i_size_read() whilst we have the
> > > folio locked and can determine the action we should take with that
> > > folio (i.e. nothing, partial zeroing, or skip altogether). Once
> > > we've made the decision that the folio is within EOF and taken
> > > action on it (i.e. moved the folio to writeback state), we cannot
> > > then resample the inode size because a truncate may have started
> > > and changed the inode size.
> > > 
> > > We have to complete the mapping of the folio to disk blocks - the
> > > disk block mapping is guaranteed to be valid for the life of the IO
> > > because the folio is locked and under writeback - and submit the IO
> > > so that truncate_pagecache() will unblock and invalidate the folio
> > > when the IO completes.
> > > 
> > > Hence writeback vs truncate serialisation is really dependent on
> > > only sampling the inode size -once- whilst the dirty folio we are
> > > writing back is locked.
> > > 
> > 
> > Not sure I see how this is a serialization dependency given that
> > writeback completion also samples i_size.
> 
> Ah, I didn't explain what I meant very clearly, did I?
> 
> What I mean was we can't sample i_size in the IO path without
> specific checking/serialisation against truncate operations. And
> that means once we have partially zeroed the contents of a EOF
> straddling folio, we can't then sample the EOF again to determine
> the length of valid data in the folio as this can race with truncate
> and result in a different size for the data in the folio than we
> prepared it for.
> 

Ok, I think we're just saying the same thing using different words.

> > But no matter, it seems a
> > reasonable implementation to me to make the submission path consistent
> > in handling eof.
> 
> Yes, the IO completion path does sample it again via xfs_new_eof().
> However, as per above, it has specific checking for truncate down
> races and handles them:
> 
> /*
>  * If this I/O goes past the on-disk inode size update it unless it would
>  * be past the current in-core inode size.
>  */
> static inline xfs_fsize_t
> xfs_new_eof(struct xfs_inode *ip, xfs_fsize_t new_size)
> {
>         xfs_fsize_t i_size = i_size_read(VFS_I(ip));
> 
> >>>>    if (new_size > i_size || new_size < 0)
> >>>>            new_size = i_size;
>         return new_size > ip->i_disk_size ? new_size : 0;
> }
> 
> If we have a truncate_setsize() called for a truncate down whilst
> this IO is in progress, then xfs_new_eof() will see the new, smaller
> inode isize. The clamp on new_size handles this situation, and we
> then only triggers an update if the on-disk size is still smaller
> than the new truncated size (i.e. the IO being completed is still
> partially within the new EOF from the truncate down).
> 
> So I don't think there's an issue here at all at IO completion;
> it handles truncate down races cleanly...
> 

Agree.. this was kind of the point of the submit side trimming. I'm not
sure a second sample of i_size on submission for trimming purposes
affects this in any problematic way either.

Brian

> > I wonder if this could just use end_pos returned from
> > iomap_writepage_handle_eof()?
> 
> Yeah, that was what I was thinking, but I haven't looked at the code
> for long enough to have any real idea of whether that is sufficient
> or not.
> 
> Cheers,
> 
> Dave.
> -- 
> Dave Chinner
> david@fromorbit.com
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes
  2024-12-04  9:00       ` Long Li
@ 2024-12-04 12:17         ` Brian Foster
  2024-12-05 12:47           ` Long Li
  0 siblings, 1 reply; 18+ messages in thread
From: Brian Foster @ 2024-12-04 12:17 UTC (permalink / raw)
  To: Long Li
  Cc: Dave Chinner, brauner, djwong, cem, linux-xfs, linux-fsdevel,
	yi.zhang, houtao1, yangerkun

On Wed, Dec 04, 2024 at 05:00:44PM +0800, Long Li wrote:
> On Tue, Dec 03, 2024 at 01:08:38PM +1100, Dave Chinner wrote:
> > On Mon, Dec 02, 2024 at 10:26:14AM -0500, Brian Foster wrote:
> > > On Sat, Nov 30, 2024 at 09:39:29PM +0800, Long Li wrote:
> > > > When performing fsstress test with this patch set, there is a very low probability of
> > > > encountering an issue where isize is less than ioend->io_offset in iomap_add_to_ioend.
> > > > After investigation, this was found to be caused by concurrent with truncate operations.
> > > > Consider a scenario with 4K block size and a file size of 12K.
> > > > 
> > > > //write back [8K, 12K]           //truncate file to 4K
> > > > ----------------------          ----------------------
> > > > iomap_writepage_map             xfs_setattr_size
> > 
> > folio is locked here
> > 
> > > >   iomap_writepage_handle_eof
> > > >                                   truncate_setsize
> > > > 				    i_size_write(inode, newsize)  //update inode size to 4K
> > 
> > truncate_setsize() is supposed to invalidate whole pages beyond
> > EOF before completing, yes?
> > 
> > /**
> >  * truncate_setsize - update inode and pagecache for a new file size
> >  * @inode: inode
> >  * @newsize: new file size
> >  *
> >  * truncate_setsize updates i_size and performs pagecache truncation (if
> >  * necessary) to @newsize. It will be typically be called from the filesystem's
> >  * setattr function when ATTR_SIZE is passed in.
> >  *
> >  * Must be called with a lock serializing truncates and writes (generally
> >  * i_rwsem but e.g. xfs uses a different lock) and before all filesystem
> >  * specific block truncation has been performed.
> >  */
> > void truncate_setsize(struct inode *inode, loff_t newsize)
> > {
> >         loff_t oldsize = inode->i_size;
> > 
> >         i_size_write(inode, newsize);
> >         if (newsize > oldsize)
> >                 pagecache_isize_extended(inode, oldsize, newsize);
> >         truncate_pagecache(inode, newsize);
> > }
> > EXPORT_SYMBOL(truncate_setsize);
> > 
> > Note that this says "serialising truncates and writes" - the
> > emphasis needs to be placed on "writes" here, not "writeback". The
> > comment about XFS is also stale - it uses the i_rwsem here like
> > all other filesystems now.
> > 
> > The issue demonstrated above is -write back- racing against
> > truncate_setsize(), not writes. And -write back- is only serialised
> > against truncate_pagecache() by folio locks and state, not inode
> > locks. hence any change to the inode size in truncate can and will
> > race with writeback in progress.
> > 
> > Hence writeback needs to be able to handle folios end up beyond
> > EOF at any time during writeback. i.e. once we have a folio locked
> > in writeback and we've checked against i_size_read() for validity,
> > it needs to be considered a valid offset all the way through to
> > IO completion.
> > 
> > 
> > > >   iomap_writepage_map_blocks
> > > >     iomap_add_to_ioend
> > > >            < iszie < ioend->io_offset>
> > > > 	   <iszie = 4K,  ioend->io_offset=8K>
> > 
> > Ah, so the bug fix adds a new call to i_size_read() in the IO
> > submission path? I suspect that is the underlying problem leading
> > to the observed behaviour....
> > 
> > > > 
> > > > It appears that in extreme cases, folios beyond EOF might be written back,
> > > > resulting in situations where isize is less than pos. In such cases,
> > > > maybe we should not trim the io_size further.
> > > > 
> > > 
> > > Hmm.. it might be wise to characterize this further to determine whether
> > > there are potentially larger problems to address before committing to
> > > anything. For example, assuming truncate acquires ilock and does
> > > xfs_itruncate_extents() and whatnot before this ioend submits/completes,
> > 
> > I don't think xfs_itruncate_extents() is the concern here - that
> > happens after the page cache and writeback has been sorted out and
> > the ILOCK has been taken and the page cache state should
> > have already been sorted out. truncate_setsize() does that for us;
> > it guarantees that all writeback in the truncate down range has
> > been completed and the page cache invalidated.
> > 
> > We hold the MMAP_LOCK (filemap_invalidate_lock()) so no new pages
> > can be instantiated over the range whilst we are running
> > xfs_itruncate_extents(). hence once truncate_setsize() returns, we
> > are guaranteed that there will be no IO in progress or can be
> > started over the range we are removing.
> > 
> > Really, the issue is that writeback mappings have to be able to
> > handle the range being mapped suddenly appear to be beyond EOF.
> > This behaviour is a longstanding writeback constraint, and is what
> > iomap_writepage_handle_eof() is attempting to handle.
> > 
> > We handle this by only sampling i_size_read() whilst we have the
> > folio locked and can determine the action we should take with that
> > folio (i.e. nothing, partial zeroing, or skip altogether). Once
> > we've made the decision that the folio is within EOF and taken
> > action on it (i.e. moved the folio to writeback state), we cannot
> > then resample the inode size because a truncate may have started
> > and changed the inode size.
> > 
> 
> My understanding is the issue isn't that we can't sample the inode size. 
> The key point is that writeback mapping must be able to handle cases where
> the mapped range suddenly appears beyond EOF. If we can handle such
> cases properly, wouldn't sampling still be possible?
> 

I think so. I wouldn't harp too much on this as I think we're tripping
over words. ISTM the critical thing is that the folio is handled
properly wrt the truncate operation, which should be facilitated by the
folio lock.

> Coming back to our current issue, during writeback mapping, we sample
> the inode size to determine if the ioend is within EOF and attempt to
> trim io_size. Concurrent truncate operations may update the inode size,
> causing the pos of write back beyond EOF. In such cases, we simply don't
> trim io_size, which seems like a viable approach.
> 

Perhaps. I'm not claiming it isn't functional. But to Dave's (more
elaborated) point and in light of the racy i_size issue you've
uncovered, what bugs me also about this is that this creates an internal
inconsistency in the submission codepath.

I.e., the top level code does one thing based on one value of i_size,
then the ioend construction does another, and the logic is not directly
correlated so there is no real guarantee changes in one area correlate
to the other. IME, this increases potential for future bugs and adds
maintenance burden.

A simple example to consider might be.. suppose sometime in the future
we determine there is a selective case where we do want to allow a
post-eof writeback. As of right now, all that really requires is
adjustment to the "handle_eof()" logic and the rest of the codepath does
the right thing agnostic to outside operations like truncate. I think
there's value if we can preserve that invariant going forward.

FWIW, I'm not objecting to the alternative if something in the above
reasoning is wrong. I'm just trying to prioritize keeping things simple
and maintainable, particularly since truncate is kind of a complicated
beast as it is.

Brian

> Thanks,
> Long Li
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes
  2024-12-04 12:17         ` Brian Foster
@ 2024-12-05 12:47           ` Long Li
  0 siblings, 0 replies; 18+ messages in thread
From: Long Li @ 2024-12-05 12:47 UTC (permalink / raw)
  To: Brian Foster
  Cc: Dave Chinner, brauner, djwong, cem, linux-xfs, linux-fsdevel,
	yi.zhang, houtao1, yangerkun

On Wed, Dec 04, 2024 at 07:17:45AM -0500, Brian Foster wrote:
> > Coming back to our current issue, during writeback mapping, we sample
> > the inode size to determine if the ioend is within EOF and attempt to
> > trim io_size. Concurrent truncate operations may update the inode size,
> > causing the pos of write back beyond EOF. In such cases, we simply don't
> > trim io_size, which seems like a viable approach.
> > 
> 
> Perhaps. I'm not claiming it isn't functional. But to Dave's (more
> elaborated) point and in light of the racy i_size issue you've
> uncovered, what bugs me also about this is that this creates an internal
> inconsistency in the submission codepath.
> 
> I.e., the top level code does one thing based on one value of i_size,
> then the ioend construction does another, and the logic is not directly
> correlated so there is no real guarantee changes in one area correlate
> to the other. IME, this increases potential for future bugs and adds
> maintenance burden.
> 
> A simple example to consider might be.. suppose sometime in the future
> we determine there is a selective case where we do want to allow a
> post-eof writeback. As of right now, all that really requires is
> adjustment to the "handle_eof()" logic and the rest of the codepath does
> the right thing agnostic to outside operations like truncate. I think
> there's value if we can preserve that invariant going forward.
> 
> FWIW, I'm not objecting to the alternative if something in the above
> reasoning is wrong. I'm just trying to prioritize keeping things simple
> and maintainable, particularly since truncate is kind of a complicated
> beast as it is.
> 
> Brian
> 

Yes, I agree with you, thanks for the detailed explanation.

Thanks,
Long Li

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes
  2024-12-04 12:05           ` Brian Foster
@ 2024-12-06  3:36             ` Long Li
  0 siblings, 0 replies; 18+ messages in thread
From: Long Li @ 2024-12-06  3:36 UTC (permalink / raw)
  To: Brian Foster
  Cc: Dave Chinner, brauner, djwong, cem, linux-xfs, linux-fsdevel,
	yi.zhang, houtao1, yangerkun

On Wed, Dec 04, 2024 at 07:05:01AM -0500, Brian Foster wrote:
> On Wed, Dec 04, 2024 at 05:06:00PM +0800, Long Li wrote:
> > On Tue, Dec 03, 2024 at 09:54:41AM -0500, Brian Foster wrote:
> > > Not sure I see how this is a serialization dependency given that
> > > writeback completion also samples i_size. But no matter, it seems a
> > > reasonable implementation to me to make the submission path consistent
> > > in handling eof.
> > > 
> > > I wonder if this could just use end_pos returned from
> > > iomap_writepage_handle_eof()?
> > > 
> > > Brian
> > > 
> > 
> > It seems reasonable to me, but end_pos is block-size granular. We need
> > to pass in a more precise byte-granular end. 
> 
> Well Ok, but _handle_eof() doesn't actually use the value itself so from
> that standpoint I see no reason it couldn't at least return the
> unaligned end pos. From there, it looks like we do still want the
> rounded up value for the various ifs state management calls.
> 
> I can see a couple ways of doing that.. one is just align the value in
> the caller and use the two variants appropriately. Since the ifs_
> helpers all seem to be in byte units, another option could be to
> sanitize the helpers to the appropriate start/end rounding internally.
> 
> Either of those probably warrant a separate prep patch or two rather
> than being squashed in with the i_size fix, but otherwise I'm not seeing
> much of a roadblock here. Am I missing something?
> 
> Brian
> 

Agreed. Looks like there are no other roadblocks. I'll update in the next
version. 

Best Regards
Long Li

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2024-12-06  3:39 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2024-11-27  6:35 [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes Long Li
2024-11-27  6:35 ` [PATCH v5 2/2] xfs: clean up xfs_end_ioend() to reuse local variables Long Li
2024-11-27 16:07   ` Darrick J. Wong
2024-11-27 16:28 ` [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes Darrick J. Wong
2024-11-28  6:38   ` Long Li
2024-11-28  3:33 ` Christoph Hellwig
2024-11-30 13:39 ` Long Li
2024-12-02 15:26   ` Brian Foster
2024-12-03  2:08     ` Dave Chinner
2024-12-03 14:54       ` Brian Foster
2024-12-03 21:12         ` Dave Chinner
2024-12-04 12:05           ` Brian Foster
2024-12-04  9:06         ` Long Li
2024-12-04 12:05           ` Brian Foster
2024-12-06  3:36             ` Long Li
2024-12-04  9:00       ` Long Li
2024-12-04 12:17         ` Brian Foster
2024-12-05 12:47           ` Long Li

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox