From: "Darrick J. Wong" <djwong@kernel.org>
To: Long Li <leo.lilong@huawei.com>
Cc: brauner@kernel.org, cem@kernel.org, linux-xfs@vger.kernel.org,
linux-fsdevel@vger.kernel.org, yi.zhang@huawei.com,
houtao1@huawei.com, yangerkun@huawei.com
Subject: Re: [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes
Date: Wed, 27 Nov 2024 08:28:29 -0800 [thread overview]
Message-ID: <20241127162829.GY1926309@frogsfrogsfrogs> (raw)
In-Reply-To: <20241127063503.2200005-1-leo.lilong@huawei.com>
On Wed, Nov 27, 2024 at 02:35:02PM +0800, Long Li wrote:
> During concurrent append writes to XFS filesystem, zero padding data
> may appear in the file after power failure. This happens due to imprecise
> disk size updates when handling write completion.
>
> Consider this scenario with concurrent append writes same file:
>
> Thread 1: Thread 2:
> ------------ -----------
> write [A, A+B]
> update inode size to A+B
> submit I/O [A, A+BS]
> write [A+B, A+B+C]
> update inode size to A+B+C
> <I/O completes, updates disk size to min(A+B+C, A+BS)>
> <power failure>
>
> After reboot:
> 1) with A+B+C < A+BS, the file has zero padding in range [A+B, A+B+C]
>
> |< Block Size (BS) >|
> |DDDDDDDDDDDDDDDD0000000000000000|
> ^ ^ ^
> A A+B A+B+C
> (EOF)
>
> 2) with A+B+C > A+BS, the file has zero padding in range [A+B, A+BS]
>
> |< Block Size (BS) >|< Block Size (BS) >|
> |DDDDDDDDDDDDDDDD0000000000000000|00000000000000000000000000000000|
> ^ ^ ^ ^
> A A+B A+BS A+B+C
> (EOF)
>
> D = Valid Data
> 0 = Zero Padding
>
> The issue stems from disk size being set to min(io_offset + io_size,
> inode->i_size) at I/O completion. Since io_offset+io_size is block
> size granularity, it may exceed the actual valid file data size. In
> the case of concurrent append writes, inode->i_size may be larger
> than the actual range of valid file data written to disk, leading to
> inaccurate disk size updates.
>
> This patch modifies the meaning of io_size to represent the size of
> valid data within EOF in an ioend. If the ioend spans beyond i_size,
> io_size will be trimmed to provide the file with more accurate size
> information. This is particularly useful for on-disk size updates
> at completion time.
>
> After this change, ioends that span i_size will not grow or merge with
> other ioends in concurrent scenarios. However, these cases that need
> growth/merging rarely occur and it seems no noticeable performance impact.
> Although rounding up io_size could enable ioend growth/merging in these
> scenarios, we decided to keep the code simple after discussion [1].
>
> Another benefit is that it makes the xfs_ioend_is_append() check more
> accurate, which can reduce unnecessary end bio callbacks of xfs_end_bio()
> in certain scenarios, such as repeated writes at the file tail without
> extending the file size.
>
> Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Er... filesystem iomap wasn't in 2.6.12, how did you come up with this?
At most this is a fix against a roughly 6.1 era kernel, right?
> Link[1]: https://patchwork.kernel.org/project/xfs/patch/20241113091907.56937-1-leo.lilong@huawei.com
> Signed-off-by: Long Li <leo.lilong@huawei.com>
> Reviewed-by: Brian Foster <bfoster@redhat.com>
> ---
> v4->v5: remove iomap_ioend_size_aligned() and don't round up io_size for
> ioend growth/merging to keep the code simple.
> fs/iomap/buffered-io.c | 10 ++++++++++
> include/linux/iomap.h | 2 +-
> 2 files changed, 11 insertions(+), 1 deletion(-)
>
> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
> index d42f01e0fc1c..dc360c8e5641 100644
> --- a/fs/iomap/buffered-io.c
> +++ b/fs/iomap/buffered-io.c
> @@ -1774,6 +1774,7 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
> {
> struct iomap_folio_state *ifs = folio->private;
> size_t poff = offset_in_folio(folio, pos);
> + loff_t isize = i_size_read(inode);
> int error;
>
> if (!wpc->ioend || !iomap_can_add_to_ioend(wpc, pos)) {
> @@ -1789,7 +1790,16 @@ static int iomap_add_to_ioend(struct iomap_writepage_ctx *wpc,
>
> if (ifs)
> atomic_add(len, &ifs->write_bytes_pending);
> +
> + /*
> + * If the ioend spans i_size, trim io_size to the former to provide
> + * the fs with more accurate size information. This is useful for
> + * completion time on-disk size updates.
I think it's useful to preserve the diagram showing exactly what problem
you're solving:
/*
* Clamp io_offset and io_size to the incore EOF so that ondisk
* file size updates in the ioend completion are byte-accurate.
* This avoids recovering files with zeroed tail regions when
* writeback races with appending writes:
*
* Thread 1: Thread 2:
* ------------ -----------
* write [A, A+B]
* update inode size to A+B
* submit I/O [A, A+BS]
* write [A+B, A+B+C]
* update inode size to A+B+C
* <I/O completes, updates disk size to min(A+B+C, A+BS)>
* <power failure>
*
* After reboot:
* 1) with A+B+C < A+BS, the file has zero padding in range
* [A+B, A+B+C]
*
* |< Block Size (BS) >|
* |DDDDDDDDDDDD00000000000000|
* ^ ^ ^
* A A+B A+B+C
* (EOF)
*
* 2) with A+B+C > A+BS, the file has zero padding in range
* [A+B, A+BS]
*
* |< Block Size (BS) >|< Block Size (BS) >|
* |DDDDDDDDDDDD00000000000000|000000000000000000000000000|
* ^ ^ ^ ^
* A A+B A+BS A+B+C
* (EOF)
*
* D = Valid Data
* 0 = Zero Padding
*
* Note that this defeats the ability to chain the ioends of
* appending writes.
*/
(I reduced the blocksize a bit for wrapping purposes)
The logic looks ok, but I'm curious about how you landed at 2.6.12-rc
for the fixes tag.
--D
> + */
> wpc->ioend->io_size += len;
> + if (wpc->ioend->io_offset + wpc->ioend->io_size > isize)
> + wpc->ioend->io_size = isize - wpc->ioend->io_offset;
> +
> wbc_account_cgroup_owner(wbc, folio, len);
> return 0;
> }
> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
> index 5675af6b740c..75bf54e76f3b 100644
> --- a/include/linux/iomap.h
> +++ b/include/linux/iomap.h
> @@ -335,7 +335,7 @@ struct iomap_ioend {
> u16 io_type;
> u16 io_flags; /* IOMAP_F_* */
> struct inode *io_inode; /* file being written to */
> - size_t io_size; /* size of the extent */
> + size_t io_size; /* size of data within eof */
> loff_t io_offset; /* offset in the file */
> sector_t io_sector; /* start sector of ioend */
> struct bio io_bio; /* MUST BE LAST! */
> --
> 2.39.2
>
next prev parent reply other threads:[~2024-11-27 16:28 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-11-27 6:35 [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes Long Li
2024-11-27 6:35 ` [PATCH v5 2/2] xfs: clean up xfs_end_ioend() to reuse local variables Long Li
2024-11-27 16:07 ` Darrick J. Wong
2024-11-27 16:28 ` Darrick J. Wong [this message]
2024-11-28 6:38 ` [PATCH v5 1/2] iomap: fix zero padding data issue in concurrent append writes Long Li
2024-11-28 3:33 ` Christoph Hellwig
2024-11-30 13:39 ` Long Li
2024-12-02 15:26 ` Brian Foster
2024-12-03 2:08 ` Dave Chinner
2024-12-03 14:54 ` Brian Foster
2024-12-03 21:12 ` Dave Chinner
2024-12-04 12:05 ` Brian Foster
2024-12-04 9:06 ` Long Li
2024-12-04 12:05 ` Brian Foster
2024-12-06 3:36 ` Long Li
2024-12-04 9:00 ` Long Li
2024-12-04 12:17 ` Brian Foster
2024-12-05 12:47 ` Long Li
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20241127162829.GY1926309@frogsfrogsfrogs \
--to=djwong@kernel.org \
--cc=brauner@kernel.org \
--cc=cem@kernel.org \
--cc=houtao1@huawei.com \
--cc=leo.lilong@huawei.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-xfs@vger.kernel.org \
--cc=yangerkun@huawei.com \
--cc=yi.zhang@huawei.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox