Re: [PATCH v4 17/23] ext4: submit zeroed post-EOF data immediately in the iomap buffered I/O path

Linux EXT4 FS development
 help / color / mirror / Atom feed

From: Zhang Yi <yizhang089@gmail.com>
To: Ojaswin Mujoo <ojaswin@linux.ibm.com>,
	Zhang Yi <yi.zhang@huaweicloud.com>
Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, tytso@mit.edu,
	adilger.kernel@dilger.ca, libaokun@linux.alibaba.com,
	jack@suse.cz, ritesh.list@gmail.com, djwong@kernel.org,
	hch@infradead.org, yi.zhang@huawei.com, yangerkun@huawei.com,
	yukuai@fnnas.com
Subject: Re: [PATCH v4 17/23] ext4: submit zeroed post-EOF data immediately in the iomap buffered I/O path
Date: Mon, 1 Jun 2026 20:22:09 +0800	[thread overview]
Message-ID: <cc458b0f-4637-4313-a2cd-45db4cdaa250@gmail.com> (raw)
In-Reply-To: <ah1MG1b89ELKM-Fl@li-dc0c254c-257c-11b2-a85c-98b6c1322444.ibm.com>

On 6/1/2026 5:08 PM, Ojaswin Mujoo wrote:
> On Sat, May 30, 2026 at 10:53:24AM +0800, Zhang Yi wrote:
>> On 5/27/2026 9:41 PM, Ojaswin Mujoo wrote:
>>> On Mon, May 11, 2026 at 03:23:37PM +0800, Zhang Yi wrote:
>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>
>>>> In the generic buffered_head I/O path, we rely on the data=order mode to
>>>> ensure that the zeroed EOF block data is written before updating
>>>> i_disksize, thus preventing stale data from being exposed.
>>>>
>>>> However, the iomap buffered I/O path cannot use this mechanism. Instead,
>>>> we issue the I/O immediately after performing the zero operation
>>>> (without synchronous waiting for performance). This can reduce the risk
>>>> of exposing stale data, but it does not guarantee that the zero data
>>>> will be flushed to disk before the metadata of i_disksize is updated.
>>>> The subsequent patches will wait for this I/O to complete before
>>>> updating i_disksize.
>>>>
>>>> Suggested-by: Jan Kara <jack@suse.cz>
>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>>
>>> I think we discussed that we may not need to do this [1] but I guess
>>> you've decided to make the tradeoff of issuing the IO to avoid having to
>>> wait for bg flush to complete the tail page zeroing
>>>

I'm glad to hear that, thanks.

>>
>> Yes. For truncate up and append fallocate, originally i_disksize would
>> be updated immediately, and the change would be persisted via the
>> journal within default 5 seconds. But now, if the tail page is not
>> committed immediately, the update to i_disksize will be delayed by about
>> 30 seconds, and persistence will be postponed to around 35 seconds. I'm
>> not sure what impact this change might have — I just don't really want
>> to introduce it.
> 
>> For normal append writes, the impact is minimal, unless we call
>> sync_range to sync the portion of data that extends beyond EOF.
> 
> Hmm while trying to retain the behavior for falloc and truncate up,
> we end up changing it for append writes :) But anyways, I understand
> your reasoning and don't have any strong opinions against it. I'll let
> Jan pitch in since he had some comments around this.
> 
>>
>> In addition, if the zeroed page is not issued here immediately, the
>> logic will become more complex because we need to more careful about the
>> order of write-back IOs to prevent deadlock issues caused by mutual
>> waiting.
> 
> You mean an endio completion waiting for ordered IO to complete but
> ordered IO writeback is somehow waiting for this endio completion? Is that
> actually possible?

Well, after thinking it over more carefully, it seems this is
impossible, I cannot think of a scenario that could trigger this kind of
issue. The generic writeback process always executes writeback in folio
index order, so there would be no situation where a later data I/O
depends on an earlier ordered I/O. Even if both kinds of IOs are placed
in the same ioend, there should be no problem. I was confused and 
overthinking it.

 From this perspective, if we can accept that truncate up and fallocate
will have longer persistence time by default(I guess this is
acceptable), we can avoid writing back zeroed data immediately. To
achieve this, we only need to consider the case of sync file range. :-)

Regards,
Yi.
>>
>>> However, I think one side effect might be many threads calling the
>>> writeback mechanism to issue zero IOs which might not scale well. I
>>> don't know if it'll be a huge problem though, I guess it's a sort of
>>> thing we will have to deal with in case we see it in real world
>>> workloads.
>>>
>>
>> I agree with you. However, I suspect that unless we run some specific
>> benchmark tests, it should be difficult to encounter a large number of
>> post-EOF append writes and truncate up operations in real-world usage
>> scenarios — and I haven't come across such scenarios yet. For
>> simplicity, I'd like to proceed with this implementation for now. If we
>> do run into actual problems later, we can consider not issuing I/O
>> directly here, but instead: 1) find the ordered block in
>> ext4_sync_file() and perform writeback; 2) ensure writeback ordering
>> for normal background writeback as well — otherwise, there is a risk of
>> deadlock (mutual waiting). What do you think?
> 
> Yes sounds good Yi, we can deal with performance tuning later.
> 
> Regards,
> Ojaswin
> 
>>
>> Cheers,
>> Yi.
>>
>>> [1] https://lore.kernel.org/linux-ext4/yhy4cgc4fnk7tzfejuhy6m6ljo425ebpg6khss6vtvpidg6lyp@5xcyabxrl6zm/
>>>
>>>> ---
>>>>   fs/ext4/inode.c | 66 ++++++++++++++++++++++++++++++++++++++++---------
>>>>   1 file changed, 55 insertions(+), 11 deletions(-)
>>>>
>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>> index 239d387ffaf2..e013aeb03d7b 100644
>>>> --- a/fs/ext4/inode.c
>>>> +++ b/fs/ext4/inode.c
>>>> @@ -4742,6 +4742,32 @@ static int ext4_block_zero_range(struct inode *inode,
>>>>   					zero_written);
>>>>   }
>>>>   
>>>> +static int ext4_iomap_submit_zero_block(struct inode *inode,
>>>> +					loff_t from, loff_t end)
>>>> +{
>>>> +	struct address_space *mapping = inode->i_mapping;
>>>> +	struct folio *folio;
>>>> +	bool do_submit = false;
>>>> +
>>>> +	folio = filemap_lock_folio(mapping, from >> PAGE_SHIFT);
>>>> +	if (IS_ERR(folio))
>>>> +		/* Already writeback and clear? */
>>>> +		return PTR_ERR(folio) == -ENOENT ? 0 : PTR_ERR(folio);
>>>> +
>>>> +	folio_wait_writeback(folio);
>>>> +	WARN_ON_ONCE(folio_test_writeback(folio));
>>>> +
>>>> +	if (likely(folio_test_dirty(folio)))
>>>> +		do_submit = true;
>>>> +	folio_unlock(folio);
>>>> +	folio_put(folio);
>>>> +
>>>> +	/* Submit zeroed block. */
>>>> +	if (do_submit)
>>>> +		return filemap_fdatawrite_range(mapping, from, end - 1);
>>>> +	return 0;
>>>> +}
>>>> +
>>>>   /*
>>>>    * Zero out a mapping from file offset 'from' up to the end of the block
>>>>    * which corresponds to 'from' or to the given 'end' inside this block.
>>>> @@ -4765,8 +4791,10 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
>>>>   	if (IS_ENCRYPTED(inode) && !fscrypt_has_encryption_key(inode))
>>>>   		return 0;
>>>>   
>>>> -	if (length > blocksize - offset)
>>>> +	if (length > blocksize - offset) {
>>>>   		length = blocksize - offset;
>>>> +		end = from + length;
>>>> +	}
>>>>   
>>>>   	err = ext4_block_zero_range(inode, from, length,
>>>>   				    &did_zero, &zero_written);
>>>> @@ -4781,18 +4809,34 @@ int ext4_block_zero_eof(struct inode *inode, loff_t from, loff_t end)
>>>>   	 * TODO: In the iomap path, handle this by updating i_disksize to
>>>>   	 * i_size after the zeroed data has been written back.
>>>>   	 */
>>>> -	if (ext4_should_order_data(inode) &&
>>>> -	    did_zero && zero_written && !IS_DAX(inode)) {
>>>> -		handle_t *handle;
>>>> +	if (did_zero && zero_written && !IS_DAX(inode)) {
>>>> +		if (ext4_should_order_data(inode)) {
>>>> +			handle_t *handle;
>>>>   
>>>> -		handle = ext4_journal_start(inode, EXT4_HT_MISC, 1);
>>>> -		if (IS_ERR(handle))
>>>> -			return PTR_ERR(handle);
>>>> +			handle = ext4_journal_start(inode, EXT4_HT_MISC, 1);
>>>> +			if (IS_ERR(handle))
>>>> +				return PTR_ERR(handle);
>>>>   
>>>> -		err = ext4_jbd2_inode_add_write(handle, inode, from, length);
>>>> -		ext4_journal_stop(handle);
>>>> -		if (err)
>>>> -			return err;
>>>> +			err = ext4_jbd2_inode_add_write(handle, inode, from,
>>>> +							length);
>>>> +			ext4_journal_stop(handle);
>>>> +			if (err)
>>>> +				return err;
>>>> +		/*
>>>> +		 * inodes using the iomap buffered I/O path do not use the
>>>> +		 * data=ordered mode. We submit zeroed range directly here.
>>>> +		 * Do not wait for I/O completion for performance.
>>>> +		 *
>>>> +		 * TODO: Any operation that extends i_disksize (including
>>>> +		 * append write end io past the zeroed boundary, truncate up,
>>>> +		 * and append fallocate) must wait for the relevant I/O to
>>>> +		 * complete before updating i_disksize.
>>>> +		 */
>>>> +		} else if (ext4_inode_buffered_iomap(inode)) {
>>>> +			err = ext4_iomap_submit_zero_block(inode, from, end);
>>>> +			if (err)
>>>> +				return err;
>>>> +		}
>>>>   	}
>>>>   
>>>>   	return 0;
>>>> -- 
>>>> 2.52.0
>>>>
>>

next prev parent reply	other threads:[~2026-06-01 12:22 UTC|newest]

Thread overview: 85+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-05-11  7:23 [PATCH v4 00/23] ext4: use iomap for regular file's buffered I/O path Zhang Yi
2026-05-11  7:23 ` [PATCH v4 01/23] ext4: simplify size updating in ext4_setattr() Zhang Yi
2026-05-19  5:24   ` Ojaswin Mujoo
2026-05-11  7:23 ` [PATCH v4 02/23] ext4: factor out ext4_truncate_[up|down]() Zhang Yi
2026-05-19  6:05   ` Ojaswin Mujoo
2026-06-16  9:31   ` Jan Kara
2026-05-11  7:23 ` [PATCH v4 03/23] ext4: simplify error handling in ext4_setattr() Zhang Yi
2026-05-19  6:08   ` Ojaswin Mujoo
2026-06-16  9:36   ` Jan Kara
2026-05-11  7:23 ` [PATCH v4 04/23] ext4: add iomap address space operations for buffered I/O Zhang Yi
2026-05-19  9:28   ` Ojaswin Mujoo
2026-05-19 12:35     ` Zhang Yi
2026-05-19 16:53       ` Ojaswin Mujoo
2026-05-20  2:49         ` Zhang Yi
2026-05-26 17:11           ` Ojaswin Mujoo
2026-05-11  7:23 ` [PATCH v4 05/23] ext4: implement buffered read path using iomap Zhang Yi
2026-05-19 10:00   ` Ojaswin Mujoo
2026-05-11  7:23 ` [PATCH v4 06/23] ext4: pass out extent seq counter when mapping da blocks Zhang Yi
2026-05-19 10:02   ` Ojaswin Mujoo
2026-06-16 10:04   ` Jan Kara
2026-06-16 12:37     ` Zhang Yi
2026-05-11  7:23 ` [PATCH v4 07/23] ext4: do not use data=ordered mode for inodes using buffered iomap path Zhang Yi
2026-05-19 10:41   ` Ojaswin Mujoo
2026-05-19 13:31     ` Ojaswin Mujoo
2026-05-20  8:18       ` Zhang Yi
2026-05-20 13:17         ` Ojaswin Mujoo
2026-06-16 10:01   ` Jan Kara
2026-05-11  7:23 ` [PATCH v4 08/23] ext4: implement buffered write path using iomap Zhang Yi
2026-05-26 17:10   ` Ojaswin Mujoo
2026-05-28 15:40     ` Darrick J. Wong
2026-06-02  7:02       ` Ojaswin Mujoo
2026-05-29  9:13     ` Zhang Yi
2026-06-02 10:05       ` Ojaswin Mujoo
2026-06-03  1:44         ` Zhang Yi
2026-06-02 10:26   ` Ojaswin Mujoo
2026-06-03  2:56     ` Zhang Yi
2026-06-03 11:08       ` Ojaswin Mujoo
2026-06-16 10:45   ` Jan Kara
2026-06-16 14:42     ` Zhang Yi
2026-05-11  7:23 ` [PATCH v4 09/23] ext4: implement writeback " Zhang Yi
2026-05-27 12:49   ` Ojaswin Mujoo
2026-05-29 14:12     ` Zhang Yi
2026-05-29 15:32       ` Ojaswin Mujoo
2026-05-30  1:21         ` Zhang Yi
2026-06-01  6:26           ` Ojaswin Mujoo
2026-06-16 11:47   ` Jan Kara
2026-05-11  7:23 ` [PATCH v4 10/23] ext4: implement mmap " Zhang Yi
2026-05-27 12:56   ` Ojaswin Mujoo
2026-06-16 11:56   ` Jan Kara
2026-05-11  7:23 ` [PATCH v4 11/23] iomap: correct the range of a partial dirty clear Zhang Yi
2026-05-11  7:46   ` Christoph Hellwig
2026-05-11  8:57     ` Zhang Yi
2026-05-11  7:23 ` [PATCH v4 12/23] iomap: support invalidating partial folios Zhang Yi
2026-05-11  7:23 ` [PATCH v4 13/23] iomap: fix incorrect did_zero setting in iomap_zero_iter() Zhang Yi
2026-05-11  7:23 ` [PATCH v4 14/23] ext4: implement partial block zero range path using iomap Zhang Yi
2026-05-27 13:13   ` Ojaswin Mujoo
2026-06-16 12:28   ` Jan Kara
2026-05-11  7:23 ` [PATCH v4 15/23] ext4: add block mapping tracepoints for iomap buffered I/O path Zhang Yi
2026-05-27 13:14   ` Ojaswin Mujoo
2026-06-16 12:29   ` Jan Kara
2026-05-11  7:23 ` [PATCH v4 16/23] ext4: disable online defrag when inode using " Zhang Yi
2026-05-27 13:14   ` Ojaswin Mujoo
2026-06-16 12:30   ` Jan Kara
2026-05-11  7:23 ` [PATCH v4 17/23] ext4: submit zeroed post-EOF data immediately in the " Zhang Yi
2026-05-27 13:41   ` Ojaswin Mujoo
2026-05-30  2:53     ` Zhang Yi
2026-06-01  9:08       ` Ojaswin Mujoo
2026-06-01 12:22         ` Zhang Yi [this message]
2026-06-01 18:15           ` Ojaswin Mujoo
2026-06-02  3:36             ` Zhang Yi
2026-05-11  7:23 ` [PATCH v4 18/23] ext4: wait for ordered I/O " Zhang Yi
2026-05-27 15:58   ` Ojaswin Mujoo
2026-05-28 13:34     ` Ojaswin Mujoo
2026-05-30  9:32       ` Zhang Yi
2026-06-02  5:56         ` Ojaswin Mujoo
2026-05-30  7:22     ` Zhang Yi
2026-05-30  8:24       ` Zhang Yi
2026-06-01 18:33         ` Ojaswin Mujoo
2026-06-02  3:22           ` Zhang Yi
2026-06-02  5:35             ` Ojaswin Mujoo
2026-05-11  7:23 ` [PATCH v4 19/23] ext4: update i_disksize to i_size on ordered I/O completion Zhang Yi
2026-05-11  7:23 ` [PATCH v4 20/23] ext4: wait for ordered I/O to complete during insert and collapse range Zhang Yi
2026-05-11  7:23 ` [PATCH v4 21/23] ext4: add tracepoints for ordered I/O in the iomap buffered I/O path Zhang Yi
2026-05-11  7:23 ` [PATCH v4 22/23] ext4: partially enable iomap for the buffered I/O path of regular files Zhang Yi
2026-05-11  7:23 ` [PATCH v4 23/23] ext4: introduce a mount option for iomap buffered I/O path Zhang Yi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=cc458b0f-4637-4313-a2cd-45db4cdaa250@gmail.com \
    --to=yizhang089@gmail.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=djwong@kernel.org \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=libaokun@linux.alibaba.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ojaswin@linux.ibm.com \
    --cc=ritesh.list@gmail.com \
    --cc=tytso@mit.edu \
    --cc=yangerkun@huawei.com \
    --cc=yi.zhang@huawei.com \
    --cc=yi.zhang@huaweicloud.com \
    --cc=yukuai@fnnas.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox