linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Zhang Yi <yi.zhang@huaweicloud.com>
To: Jan Kara <jack@suse.cz>
Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, tytso@mit.edu,
	adilger.kernel@dilger.ca, yi.zhang@huawei.com,
	yizhang089@gmail.com, libaokun1@huawei.com, yangerkun@huawei.com
Subject: Re: [PATCH v2 03/13] ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1
Date: Fri, 28 Nov 2025 11:45:51 +0800	[thread overview]
Message-ID: <2713db6e-ff43-4583-b328-412e38f3d7bf@huaweicloud.com> (raw)
In-Reply-To: <yro4hwpttmy6e2zspvwjfdbpej6qvhlqjvlr5kp3nwffqgcnfd@z6qual55zhfq>

On 11/27/2025 9:41 PM, Jan Kara wrote:
> On Fri 21-11-25 14:08:01, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> When allocating initialized blocks from a large unwritten extent, or
>> when splitting an unwritten extent during end I/O and converting it to
>> initialized, there is currently a potential issue of stale data if the
>> extent needs to be split in the middle.
>>
>>        0  A      B  N
>>        [UUUUUUUUUUUU]    U: unwritten extent
>>        [--DDDDDDDD--]    D: valid data
>>           |<-  ->| ----> this range needs to be initialized
>>
>> ext4_split_extent() first try to split this extent at B with
>> EXT4_EXT_DATA_ENTIRE_VALID1 and EXT4_EXT_MAY_ZEROOUT flag set, but
>> ext4_split_extent_at() failed to split this extent due to temporary lack
>> of space. It zeroout B to N and mark the entire extent from 0 to N
>> as written.
>>
>>        0  A      B  N
>>        [WWWWWWWWWWWW]    W: written extent
>>        [SSDDDDDDDDZZ]    Z: zeroed, S: stale data
>>
>> ext4_split_extent() then try to split this extent at A with
>> EXT4_EXT_DATA_VALID2 flag set. This time, it split successfully and left
>> a stale written extent from 0 to A.
>>
>>        0  A      B   N
>>        [WW|WWWWWWWWWW]
>>        [SS|DDDDDDDDZZ]
>>
>> Fix this by pass EXT4_EXT_DATA_PARTIAL_VALID1 to ext4_split_extent_at()
>> when splitting at B, don't convert the entire extent to written and left
>> it as unwritten after zeroing out B to N. The remaining work is just
>> like the standard two-part split. ext4_split_extent() will pass the
>> EXT4_EXT_DATA_VALID2 flag when it calls ext4_split_extent_at() for the
>> second time, allowing it to properly handle the split. If the split is
>> successful, it will keep extent from 0 to A as unwritten.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
> 
> Good catch on the data exposure issue! First I'd like to discuss whether
> there isn't a way to fix these problems in a way that doesn't make the
> already complex code even more complex. My observation is that
> EXT4_EXT_MAY_ZEROOUT is only set in ext4_ext_convert_to_initialized() and
> in ext4_split_convert_extents() which both call ext4_split_extent(). The
> actual extent zeroing happens in ext4_split_extent_at() and in
> ext4_ext_convert_to_initialized().

Yes.

> I think the code would be much clearer
> if we just centralized all the zeroing in ext4_split_extent(). At that
> place the situation is actually pretty simple:

Thank you for your suggestion!

> 
> 1) 'ex' is unwritten, 'map' describes part with already written data which
> we want to convert to initialized (generally IO completion situation) => we
> can zero out boundaries if they are smaller than max_zeroout or if extent
> split fails.
> 

Yes. Agree.

> 2) 'ex' is unwritten, 'map' describes part we are preparing for write (IO
> submission) => the split is opportunistic here, if we cannot split due to
> ENOSPC, just go on and deal with it at IO completion time. No zeroing
> needed.

Yes. At the same time, if we can indeed move the entire split unwritten
operation to be handled after I/O completion in the future, it would also be
more convenient to remove this segment of logic.

> 
> 3) 'ex' is written, 'map' describes part that should be converted to
> unwritten => we can zero out the 'map' part if smaller than max_zeroout or
> if extent split fails.

This makes sense to me! This case it originates from the fallocate with Zero
Range operation. Currently, the zero-out operation will not be performed if
the split operation fails, instead, it immediately returns a failure.

I agree with you that we can do zero out if the 'map' part smaller than
max_zeroout instead of split extents. However, if the 'map' part is bigger
than max_zeroout and if extent split fails, I don't think zero out is a good
idea, Because it might cause zero-range calls to take a long time to execute.
Although fallocate doesn't explicitly specify how ZERO_RANGE should be
implemented, users expect it to be very fast. Therefore, in this case, if the
split fails, it would be better to simply return an error, leave things as
they are. What do you think?

> 
> This should all result in a relatively straightforward code where we can
> distinguish the three cases based on 'ex' and passed flags, we should be
> able to drop the 'EXT4_EXT_DATA_VALID*' flags and logic (possibly we could
> drop the 'split_flag' argument of ext4_split_extent() altogether), and fix
> the data exposure issues at the same time. What do you think? Am I missing
> some case?
> 

Indeed, I think the overall solution is a nice cleanup idea. :-)
But this would involve a significant amount of refactoring and logical changes.
Could we first merge the current set of patches(it could be more easier to
backport to the early LTS version), and then I can start a new series to
address this optimization?

Cheers,
Yi.

> 								Honza
> 
>> ---
>>  fs/ext4/extents.c | 11 ++++++++++-
>>  1 file changed, 10 insertions(+), 1 deletion(-)
>>
>> diff --git a/fs/ext4/extents.c b/fs/ext4/extents.c
>> index f7aa497e5d6c..cafe66cb562f 100644
>> --- a/fs/ext4/extents.c
>> +++ b/fs/ext4/extents.c
>> @@ -3294,6 +3294,13 @@ static struct ext4_ext_path *ext4_split_extent_at(handle_t *handle,
>>  		err = ext4_ext_zeroout(inode, &zero_ex);
>>  		if (err)
>>  			goto fix_extent_len;
>> +		/*
>> +		 * The first half contains partially valid data, the splitting
>> +		 * of this extent has not been completed, fix extent length
>> +		 * and ext4_split_extent() split will the first half again.
>> +		 */
>> +		if (split_flag & EXT4_EXT_DATA_PARTIAL_VALID1)
>> +			goto fix_extent_len;
>>  
>>  		/* update the extent length and mark as initialized */
>>  		ex->ee_len = cpu_to_le16(ee_len);
>> @@ -3364,7 +3371,9 @@ static struct ext4_ext_path *ext4_split_extent(handle_t *handle,
>>  			split_flag1 |= EXT4_EXT_MARK_UNWRIT1 |
>>  				       EXT4_EXT_MARK_UNWRIT2;
>>  		if (split_flag & EXT4_EXT_DATA_VALID2)
>> -			split_flag1 |= EXT4_EXT_DATA_ENTIRE_VALID1;
>> +			split_flag1 |= map->m_lblk > ee_block ?
>> +				       EXT4_EXT_DATA_PARTIAL_VALID1 :
>> +				       EXT4_EXT_DATA_ENTIRE_VALID1;
>>  		path = ext4_split_extent_at(handle, inode, path,
>>  				map->m_lblk + map->m_len, split_flag1, flags1);
>>  		if (IS_ERR(path))
>> -- 
>> 2.46.1
>>


  reply	other threads:[~2025-11-28  3:45 UTC|newest]

Thread overview: 52+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-11-21  6:07 [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents Zhang Yi
2025-11-21  6:07 ` [PATCH v2 01/13] ext4: cleanup zeroout in ext4_split_extent_at() Zhang Yi
2025-11-26 11:26   ` Ojaswin Mujoo
2025-11-27 12:02   ` Jan Kara
2025-11-28  2:22     ` Zhang Yi
2025-11-28  1:54   ` Baokun Li
2025-11-21  6:08 ` [PATCH v2 02/13] ext4: subdivide EXT4_EXT_DATA_VALID1 Zhang Yi
2025-11-26 11:27   ` Ojaswin Mujoo
2025-11-26 11:55     ` Ojaswin Mujoo
2025-11-27  6:09       ` Zhang Yi
2025-11-21  6:08 ` [PATCH v2 03/13] ext4: don't zero the entire extent if EXT4_EXT_DATA_PARTIAL_VALID1 Zhang Yi
2025-11-26 11:29   ` Ojaswin Mujoo
2025-11-27  6:13     ` Zhang Yi
2025-11-27 13:41   ` Jan Kara
2025-11-28  3:45     ` Zhang Yi [this message]
2025-11-28 10:58       ` Jan Kara
2025-11-28  7:28     ` Ojaswin Mujoo
2025-11-28 11:14       ` Jan Kara
2025-11-28 14:20         ` Ojaswin Mujoo
2025-11-28 19:52         ` Andreas Dilger
2025-11-29 18:41           ` Ojaswin Mujoo
2025-11-21  6:08 ` [PATCH v2 04/13] ext4: don't set EXT4_GET_BLOCKS_CONVERT when splitting before submitting I/O Zhang Yi
2025-11-26 11:50   ` Ojaswin Mujoo
2025-11-21  6:08 ` [PATCH v2 05/13] ext4: correct the mapping status if the extent has been zeroed Zhang Yi
2025-11-26 11:56   ` Ojaswin Mujoo
2025-11-21  6:08 ` [PATCH v2 06/13] ext4: don't cache extent during splitting extent Zhang Yi
2025-11-26 12:04   ` Ojaswin Mujoo
2025-11-27  7:01     ` Zhang Yi
2025-11-28  8:18       ` Ojaswin Mujoo
2025-11-21  6:08 ` [PATCH v2 07/13] ext4: drop extent cache before " Zhang Yi
2025-11-26 12:24   ` Ojaswin Mujoo
2025-11-27  7:27     ` Zhang Yi
2025-11-28  8:16       ` Ojaswin Mujoo
2025-11-29  1:36         ` Zhang Yi
2025-11-21  6:08 ` [PATCH v2 08/13] ext4: cleanup useless out tag in __es_remove_extent() Zhang Yi
2025-11-27 12:43   ` Jan Kara
2025-11-28  8:20   ` Ojaswin Mujoo
2025-11-21  6:08 ` [PATCH v2 09/13] ext4: make __es_remove_extent() check extent status Zhang Yi
2025-11-21  6:08 ` [PATCH v2 10/13] ext4: make ext4_es_cache_extent() support overwrite existing extents Zhang Yi
2025-11-21  6:08 ` [PATCH v2 11/13] ext4: adjust the debug info in ext4_es_cache_extent() Zhang Yi
2025-11-21  6:08 ` [PATCH v2 12/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents Zhang Yi
2025-11-21  6:08 ` [PATCH v2 13/13] ext4: drop the TODO comment in ext4_es_insert_extent() Zhang Yi
2025-11-23 10:55 ` [PATCH v2 00/13] ext4: replace ext4_es_insert_extent() when caching on-disk extents Ojaswin Mujoo
2025-11-24  5:04   ` Zhang Yi
2025-11-24 12:50     ` Ojaswin Mujoo
2025-11-24 14:05       ` Zhang Yi
2025-11-27 12:24     ` Jan Kara
2025-11-28  4:37       ` Zhang Yi
2025-11-28 16:49 ` Theodore Tso
2025-11-29  1:32   ` Zhang Yi
2025-11-29  3:52     ` Theodore Tso
2025-11-29  4:44       ` Zhang Yi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=2713db6e-ff43-4583-b328-412e38f3d7bf@huaweicloud.com \
    --to=yi.zhang@huaweicloud.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=jack@suse.cz \
    --cc=libaokun1@huawei.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=tytso@mit.edu \
    --cc=yangerkun@huawei.com \
    --cc=yi.zhang@huawei.com \
    --cc=yizhang089@gmail.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).