Linux EXT4 FS development
 help / color / mirror / Atom feed
From: Zhang Yi <yizhang089@gmail.com>
To: Ojaswin Mujoo <ojaswin@linux.ibm.com>,
	Zhang Yi <yi.zhang@huaweicloud.com>
Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, tytso@mit.edu,
	adilger.kernel@dilger.ca, libaokun@linux.alibaba.com,
	jack@suse.cz, ritesh.list@gmail.com, djwong@kernel.org,
	hch@infradead.org, yangerkun@huawei.com, yukuai@fnnas.com
Subject: Re: [PATCH v4 18/23] ext4: wait for ordered I/O in the iomap buffered I/O path
Date: Tue, 2 Jun 2026 11:22:12 +0800	[thread overview]
Message-ID: <ad963861-4ae3-4753-9415-793bbac00e06@gmail.com> (raw)
In-Reply-To: <ah3QlCt8V-3kVzW8@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>

On 6/2/2026 2:33 AM, Ojaswin Mujoo wrote:
> On Sat, May 30, 2026 at 04:24:24PM +0800, Zhang Yi wrote:
>> On 5/30/2026 3:22 PM, Zhang Yi wrote:
>>> Hi, Ojaswin!
>>>
>>> On 5/27/2026 11:58 PM, Ojaswin Mujoo wrote:
>>>> On Mon, May 11, 2026 at 03:23:38PM +0800, Zhang Yi wrote:
>>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>>
>>>>> For append writes, wait for ordered I/O to complete before updating
>>>>> i_disksize. This ensures that zeroed data is flushed to disk before the
>>>>> metadata update, preventing stale data from being exposed during
>>>>> unaligned post-EOF append writes.
>>>>>
>>>>> Suggested-by: Jan Kara <jack@suse.cz>
>>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>>>> ---
>>>>>   fs/ext4/ext4.h    | 11 +++++++
>>>>>   fs/ext4/inode.c   | 80 ++++++++++++++++++++++++++++++++++++++++++-----
>>>>>   fs/ext4/page-io.c | 60 +++++++++++++++++++++++++++++++++++
>>>>>   fs/ext4/super.c   | 23 ++++++++++----
>>>>>   4 files changed, 161 insertions(+), 13 deletions(-)
>>>>>
>>>>> diff --git a/fs/ext4/ext4.h b/fs/ext4/ext4.h
>>>>> index 078feda47e36..9ce2128eea3e 100644
>>>>> --- a/fs/ext4/ext4.h
>>>>> +++ b/fs/ext4/ext4.h
>>>>> @@ -1195,6 +1195,15 @@ struct ext4_inode_info {
>>>>>   #ifdef CONFIG_FS_ENCRYPTION
>>>>>   	struct fscrypt_inode_info *i_crypt_info;
>>>>>   #endif
>>>>> +
>>>>> +	/*
>>>>> +	 * Track ordered zeroed data during post-EOF append writes, fallocate,
>>>>> +	 * and truncate-up operations. These parameters are used only in the
>>>>> +	 * iomap buffered I/O path.
>>>>> +	 */
>>>>> +	ext4_lblk_t i_ordered_lblk;
>>>>> +	ext4_lblk_t i_ordered_len;
>>>>> +	wait_queue_head_t i_ordered_wq;
>>>>>   };
>>>>>   
>>>>>   /*
>>>>> @@ -3858,6 +3867,8 @@ extern int ext4_move_extents(struct file *o_filp, struct file *d_filp,
>>>>>   			     __u64 len, __u64 *moved_len);
>>>>>   
>>>>>   /* page-io.c */
>>>>> +#define EXT4_IOMAP_IOEND_ORDER_IO	1UL	/* This I/O is an ordered one */
>>>>> +
>>>>>   extern int __init ext4_init_pageio(void);
>>>>>   extern void ext4_exit_pageio(void);
>>>>>   extern ext4_io_end_t *ext4_init_io_end(struct inode *inode, gfp_t flags);
>>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>>> index e013aeb03d7b..11fb369efeb1 100644
>>>>> --- a/fs/ext4/inode.c
>>>>> +++ b/fs/ext4/inode.c
>>>>> @@ -4345,6 +4345,7 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
>>>>>   {
>>>>>   	struct iomap_ioend *ioend = wpc->wb_ctx;
>>>>>   	struct ext4_inode_info *ei = EXT4_I(ioend->io_inode);
>>>>> +	ext4_lblk_t start, end, order_lblk, order_len;
>>>>>   
>>>>>   	/*
>>>>>   	 * After I/O completion, a worker needs to be scheduled when:
>>>>> @@ -4357,6 +4358,30 @@ static int ext4_iomap_writeback_submit(struct iomap_writepage_ctx *wpc,
>>>>>   	    test_opt(ioend->io_inode->i_sb, DATA_ERR_ABORT))
>>>>>   		ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
>>>>>   
>>>>> +	/*
>>>>> +	 * Mark the I/O as ordered. Ordered I/O requires separate endio
>>>>> +	 * handling and must not be merged with regular I/O operations.
>>>>> +	 */
>>>>> +	order_len = READ_ONCE(ei->i_ordered_len);
>>>>> +	if (order_len) {
>>>>> +		/*
>>>>> +		 * Pair with smp_store_release() in ext4_block_zero_eof().
>>>>> +		 * Ensure we see the updated i_ordered_lblk that was written
>>>>> +		 * before the release store to i_ordered_len.
>>>>> +		 */
>>>>> +		smp_rmb();
>>>>> +		order_lblk = READ_ONCE(ei->i_ordered_lblk);
>>>>> +		start = ioend->io_offset >> ioend->io_inode->i_blkbits;
>>>>> +		end = EXT4_B_TO_LBLK(ioend->io_inode,
>>>>> +				     ioend->io_offset + ioend->io_size);
>>>>> +
>>>>> +		if (start <= order_lblk && end >= order_lblk + order_len) {
>>>>
>>>> Hi Zhang,
>>>>
>>>> I guess this check is enough cause ordered_lblk and ordered_len will
>>>> always be  contained in a single block.
>>>
>>> Yeah.
>>>
>>>>
>>>>> +			ioend->io_bio.bi_end_io = ext4_iomap_end_bio;
>>>>> +			ioend->io_private = (void *)EXT4_IOMAP_IOEND_ORDER_IO;
>>>>> +			ioend->io_flags |= IOMAP_IOEND_BOUNDARY;
>>>>
>>>> FWIU, we are wanting the ordered IO to not be merged and submitted asap
>>>> since we want to wake up the waiters. Is there any other reason?
>>>
>>> My original intention was to prevent the loss of the
>>> EXT4_IOMAP_IOEND_ORDER_IO flag during worker processing triggered by I/O
>>> completion, which could be caused by merging an ordered ioend with a
>>> normal ioend.  In patch 19, we need to determine the flag to update
>>> i_disksize to the correct position.
> 
> Ahh okay, we don't want the flag to be lost.
> 
>>>
>>>>
>>>> Adding the boundary in ->writeback_submit() only affects
>>>> iomap_ioend_can_merge() which happens after we have woken up the waiters
>>>> and deferred the IO to the wq. We ideally want it affect
>>>> iomap_can_add_to_ioend() ie we need to add IOMAP_F_BOUNDARY in
>>>> ->writeback_range().
>>>
>>> IIUC, merging into the same ioend during the submission stage doesn't
>>> seem to cause any problems.
> 
> Got it since the flag is set later. I was thinking we want to quickly
> issue the ordered IO to wake up the waiters and not waste time trying to
> merge it and hence we wanted to use that flag.
> 
>>>
>>>>
>>>> Secondly, I don't think boundary is the right flag here. It ensures
>>>> that everything before the ordered iomap gets submitted and the ordered
>>>> iomap starts a new ioend. This can still keep getting merged with the
>>>> newer ioends untils we decide to submit the IO, which can delay waking
>>>> up the waiters. If we really want the "no merge" behavior, we'll have to
>>>> do something like [1] (Check the 2 NOMERGE flag patches).
>>>
>>> Yeah, IOMAP_IOEND_BOUNDARY appears to be just a one-way barrier and
>>> still cannot prevent merging. I missed this, thank you for pointing this
>>> out. However, I think perhaps we should change iomap_ioend_can_merge()
>>> to check the iomap_ioend->io_private. Something like:
>>>
>>> 	if (ioend->io_private || next->io_private)
>>              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
>>              ioend->io_private != next->io_private
> 
> I guess if the purpose is to just not lose the flag, then boundary seems
> to work for because we only lose the flag if ordered ioend backward
> merges to a prev one. Flag is retained if we forward merge. Which
> boundary seems to take care of.
> 
Yes, IOMAP_IOEND_BOUNDARY is indeed worked currently as it prevents flag
loss. However, from the perspective of the iomap infrastructure, I
believe it is still necessary to add the ioend->io_private !=
next->io_private check. Because ioends with different io_private values
should not be merged, as this carries the risk of potentially losing
io_private or even memory leaks. With this check in iomap, we would no
longer need IOMAP_IOEND_BOUNDARY.

> However, if we want to avoid merges so we can quickly issue IO and wake
> up the waiters then the above change looks good. Also, if this is the
> reason we'd also want to have this during submission stage so the flag
> setting will probs have to move to ->wirteback_range()

Yes. Issuing ordered I/O as soon as possible is beneficial as it reduces
the latency of sync file range. Suppose when we are syncing data beyond
the ordered range, the background writeback process has already started
committing and bundled the ordered range into a large ioend (up to
IOEND_BATCH_SIZE folios), then this sync operation will indeed
experience significant latency. However, for other non-sync scenarios,
there should be little benefit.

But I'm not sure if this is strictly necessary, because in the existing
implementation, issuing ordered I/O via data=ordered mode works the same
way — it also doesn't issue ordered I/O as soon as possible, and still
has to wait when encountering concurrent background writeback. So I
think we can keep the current implementation for now and see user
feedback to decide whether further optimization is needed.

Cheers,
Yi.

  reply	other threads:[~2026-06-02  3:22 UTC|newest]

Thread overview: 85+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-11  7:23 [PATCH v4 00/23] ext4: use iomap for regular file's buffered I/O path Zhang Yi
2026-05-11  7:23 ` [PATCH v4 01/23] ext4: simplify size updating in ext4_setattr() Zhang Yi
2026-05-19  5:24   ` Ojaswin Mujoo
2026-05-11  7:23 ` [PATCH v4 02/23] ext4: factor out ext4_truncate_[up|down]() Zhang Yi
2026-05-19  6:05   ` Ojaswin Mujoo
2026-06-16  9:31   ` Jan Kara
2026-05-11  7:23 ` [PATCH v4 03/23] ext4: simplify error handling in ext4_setattr() Zhang Yi
2026-05-19  6:08   ` Ojaswin Mujoo
2026-06-16  9:36   ` Jan Kara
2026-05-11  7:23 ` [PATCH v4 04/23] ext4: add iomap address space operations for buffered I/O Zhang Yi
2026-05-19  9:28   ` Ojaswin Mujoo
2026-05-19 12:35     ` Zhang Yi
2026-05-19 16:53       ` Ojaswin Mujoo
2026-05-20  2:49         ` Zhang Yi
2026-05-26 17:11           ` Ojaswin Mujoo
2026-05-11  7:23 ` [PATCH v4 05/23] ext4: implement buffered read path using iomap Zhang Yi
2026-05-19 10:00   ` Ojaswin Mujoo
2026-05-11  7:23 ` [PATCH v4 06/23] ext4: pass out extent seq counter when mapping da blocks Zhang Yi
2026-05-19 10:02   ` Ojaswin Mujoo
2026-06-16 10:04   ` Jan Kara
2026-06-16 12:37     ` Zhang Yi
2026-05-11  7:23 ` [PATCH v4 07/23] ext4: do not use data=ordered mode for inodes using buffered iomap path Zhang Yi
2026-05-19 10:41   ` Ojaswin Mujoo
2026-05-19 13:31     ` Ojaswin Mujoo
2026-05-20  8:18       ` Zhang Yi
2026-05-20 13:17         ` Ojaswin Mujoo
2026-06-16 10:01   ` Jan Kara
2026-05-11  7:23 ` [PATCH v4 08/23] ext4: implement buffered write path using iomap Zhang Yi
2026-05-26 17:10   ` Ojaswin Mujoo
2026-05-28 15:40     ` Darrick J. Wong
2026-06-02  7:02       ` Ojaswin Mujoo
2026-05-29  9:13     ` Zhang Yi
2026-06-02 10:05       ` Ojaswin Mujoo
2026-06-03  1:44         ` Zhang Yi
2026-06-02 10:26   ` Ojaswin Mujoo
2026-06-03  2:56     ` Zhang Yi
2026-06-03 11:08       ` Ojaswin Mujoo
2026-06-16 10:45   ` Jan Kara
2026-06-16 14:42     ` Zhang Yi
2026-05-11  7:23 ` [PATCH v4 09/23] ext4: implement writeback " Zhang Yi
2026-05-27 12:49   ` Ojaswin Mujoo
2026-05-29 14:12     ` Zhang Yi
2026-05-29 15:32       ` Ojaswin Mujoo
2026-05-30  1:21         ` Zhang Yi
2026-06-01  6:26           ` Ojaswin Mujoo
2026-06-16 11:47   ` Jan Kara
2026-05-11  7:23 ` [PATCH v4 10/23] ext4: implement mmap " Zhang Yi
2026-05-27 12:56   ` Ojaswin Mujoo
2026-06-16 11:56   ` Jan Kara
2026-05-11  7:23 ` [PATCH v4 11/23] iomap: correct the range of a partial dirty clear Zhang Yi
2026-05-11  7:46   ` Christoph Hellwig
2026-05-11  8:57     ` Zhang Yi
2026-05-11  7:23 ` [PATCH v4 12/23] iomap: support invalidating partial folios Zhang Yi
2026-05-11  7:23 ` [PATCH v4 13/23] iomap: fix incorrect did_zero setting in iomap_zero_iter() Zhang Yi
2026-05-11  7:23 ` [PATCH v4 14/23] ext4: implement partial block zero range path using iomap Zhang Yi
2026-05-27 13:13   ` Ojaswin Mujoo
2026-06-16 12:28   ` Jan Kara
2026-05-11  7:23 ` [PATCH v4 15/23] ext4: add block mapping tracepoints for iomap buffered I/O path Zhang Yi
2026-05-27 13:14   ` Ojaswin Mujoo
2026-06-16 12:29   ` Jan Kara
2026-05-11  7:23 ` [PATCH v4 16/23] ext4: disable online defrag when inode using " Zhang Yi
2026-05-27 13:14   ` Ojaswin Mujoo
2026-06-16 12:30   ` Jan Kara
2026-05-11  7:23 ` [PATCH v4 17/23] ext4: submit zeroed post-EOF data immediately in the " Zhang Yi
2026-05-27 13:41   ` Ojaswin Mujoo
2026-05-30  2:53     ` Zhang Yi
2026-06-01  9:08       ` Ojaswin Mujoo
2026-06-01 12:22         ` Zhang Yi
2026-06-01 18:15           ` Ojaswin Mujoo
2026-06-02  3:36             ` Zhang Yi
2026-05-11  7:23 ` [PATCH v4 18/23] ext4: wait for ordered I/O " Zhang Yi
2026-05-27 15:58   ` Ojaswin Mujoo
2026-05-28 13:34     ` Ojaswin Mujoo
2026-05-30  9:32       ` Zhang Yi
2026-06-02  5:56         ` Ojaswin Mujoo
2026-05-30  7:22     ` Zhang Yi
2026-05-30  8:24       ` Zhang Yi
2026-06-01 18:33         ` Ojaswin Mujoo
2026-06-02  3:22           ` Zhang Yi [this message]
2026-06-02  5:35             ` Ojaswin Mujoo
2026-05-11  7:23 ` [PATCH v4 19/23] ext4: update i_disksize to i_size on ordered I/O completion Zhang Yi
2026-05-11  7:23 ` [PATCH v4 20/23] ext4: wait for ordered I/O to complete during insert and collapse range Zhang Yi
2026-05-11  7:23 ` [PATCH v4 21/23] ext4: add tracepoints for ordered I/O in the iomap buffered I/O path Zhang Yi
2026-05-11  7:23 ` [PATCH v4 22/23] ext4: partially enable iomap for the buffered I/O path of regular files Zhang Yi
2026-05-11  7:23 ` [PATCH v4 23/23] ext4: introduce a mount option for iomap buffered I/O path Zhang Yi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ad963861-4ae3-4753-9415-793bbac00e06@gmail.com \
    --to=yizhang089@gmail.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=djwong@kernel.org \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=libaokun@linux.alibaba.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ojaswin@linux.ibm.com \
    --cc=ritesh.list@gmail.com \
    --cc=tytso@mit.edu \
    --cc=yangerkun@huawei.com \
    --cc=yi.zhang@huaweicloud.com \
    --cc=yukuai@fnnas.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox