From: Zhang Yi <yizhang089@gmail.com>
To: Jan Kara <jack@suse.cz>, Zhang Yi <yi.zhang@huaweicloud.com>
Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org, tytso@mit.edu,
adilger.kernel@dilger.ca, libaokun@linux.alibaba.com,
ojaswin@linux.ibm.com, ritesh.list@gmail.com,
yi.zhang@huawei.com, chengzhihao1@huawei.com,
yangerkun@huawei.com
Subject: Re: [PATCH 2/6] ext4: clarify return semantics of ext4_load_tail_bh()
Date: Sat, 4 Jul 2026 09:32:20 +0800 [thread overview]
Message-ID: <7e2c6b5f-6135-4772-9618-e064f8a59ee0@gmail.com> (raw)
In-Reply-To: <qbrrym4b6tsstu5lqdlxvwp67tvkfp4xufre6v7llan7xgiw3k@qieeaqvt75ns>
On 7/4/2026 12:01 AM, Jan Kara wrote:
> On Fri 03-07-26 11:39:03, Zhang Yi wrote:
>> On 7/3/2026 12:43 AM, Jan Kara wrote:
>>> On Thu 02-07-26 22:40:17, Zhang Yi wrote:
>>>> On 7/2/2026 5:53 PM, Jan Kara wrote:
>>>>> On Wed 01-07-26 22:20:05, yizhang089@gmail.com wrote:
>>>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>>>
>>>>>> ext4_load_tail_bh() returns NULL for both holes and clean unwritten
>>>>>> buffers, but the conditions that lead to this are not obvious from the
>>>>>> code alone. Document this behavior to clarify the return value, so that
>>>>>> readers do not mistakenly assume that only holes result in a NULL
>>>>>> return.
>>>>>>
>>>>>> Also update the inline comment following the ext4_get_block() call to
>>>>>> reflect this: both holes and clean unwritten buffers fall through to the
>>>>>> "nothing to do" path.
>>>>>>
>>>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>>>>> ---
>>>>>> fs/ext4/inode.c | 6 +++++-
>>>>>> 1 file changed, 5 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>>>> index c2c2d6ac7f3d..0b31fa873743 100644
>>>>>> --- a/fs/ext4/inode.c
>>>>>> +++ b/fs/ext4/inode.c
>>>>>> @@ -4026,6 +4026,10 @@ void ext4_set_aops(struct inode *inode)
>>>>>> * because it might have data in pagecache (eg, if called from ext4_zero_range,
>>>>>> * ext4_punch_hole, etc) which needs to be properly zeroed out. Otherwise a
>>>>>> * racing writeback can come later and flush the stale pagecache to disk.
>>>>>> + *
>>>>>> + * Return the loaded bh if it actually needs zeroing - in written, dirty
>>>>>> + * unwritten, or delalloc state. Return NULL if it's clean (i.e., a hole or
>>>>>> + * a clean unwritten block).
>>>>>> */
>>>>>
>>>>> Great that you're adding this comment because when I was reading previous
>>>>> patch, I've spent like 10 minutes trying to figure that out :). But as far
>>>>> as I'm reading the code, ext4_load_tail_bh() will return the unwritten bh
>>>>> even if it is clean - map_bh() in _ext4_get_block() will set
>>>>> buffer_mapped() even for unwritten extent. What am I missing? BTW, it also
>>>>> seems to be ext4_load_tail_bh() could attempt to ext4_read_bh_lock() on
>>>>> unwritten bh if things align wrongly...
>>>>>
>>>>> Honza
>>>>
>>>> Thank you for the review!
>>>>
>>>> Please note the ext4_update_bh_state(bh, map.m_flags) call in
>>>> _ext4_get_block() — it restores the mapped flag back to unwritten. As a
>>>> result, the !buffer_mapped(bh) check will evaluate to true for a clean
>>>> unwritten block, the function will return NULL.
>>>
>>> Argh, right. But ext4_ext_map_blocks() sets both BH_Unwritten and BH_Mapped
>>> in the returned map->m_flags for unwritten extents. So I think my comment
>>> still applies.
>>
>> Because ext4_get_block() is called with create = 0, this is purely a
>> lookup operation. Therefore, in ext4_ext_map_blocks() ->
>> ext4_ext_handle_unwritten_extents(), only EXT4_MAP_UNWRITTEN is set,
>> and EXT4_MAP_MAPPED is not (please see the branch where
>> if (flags & EXT4_GET_BLOCKS_CREATE) == 0)). So at this point, whether
>> the lookup is served from the extent status cache or from the on-disk
>> query, it works correctly now.
>
> Aah, now I remember. Yes, we have a catch in the mapping code like that.
> Thanks for reminding me again. Now back to my original question: So
> ext4_load_tail_bh() returns unwritten bh when it is dirty because someone
> must have called ext4_get_block() on it with EXT4_GET_BLOCKS_CREATE and at
> that point we set the BH_Mapped flag on it?
Yes.
> Subtle... It would be worth to
> write this explanation to that comment as well.
OK, sure, I will add this in my next iteration.
>
>>> Plus there's a very strange inconsistency between this and
>>> the lookup in extent status tree in ext4_map_blocks() where (as you
>>> indicate) we only set BH_Unwritten but *not* BH_Mapped. AFAICT there's
>>> something buggy in here :) Note that when we don't set BH_Mapped flag e.g.
>>> from ext4_get_block_unwritten() (which gets used when delalloc is disabled),
>>> then writes to unwritten extent will get lost because
>>> ext4_bio_write_folio() will just treat them as holes...
>>
>> In fact, there is no real problem here. For the write path that gets an
>> unwritten block, EXT4_GET_BLOCKS_UNWRIT_EXT and
>> EXT4_GET_BLOCKS_CREATE are always passed together. This ensures that
>> even if we find the unwritten extent in the extent status cache, we
>> still call ext4_map_create_blocks() (due to EXT4_GET_BLOCKS_CREATE)
>> and go through ext4_ext_handle_unwritten_extents(). Inside that
>> function, because the flags contain EXT4_GET_BLOCKS_UNWRIT_EXT,
>> EXT4_MAP_MAPPED is set. So for the write path, ext4_bio_write_folio()
>> still writes back correctly.
>>
>> That said, the semantic inconsistency between the lookup-only and
>> allocation paths has been bothering me for a while. I had to be very
>> careful about this when developing the buffered I/O iomap path, because
>> this semantics also differs from the iomap infrastructure. In iomap, for
>> an unwritten extent, IOMAP_UNWRITTEN and IOMAP_MAPPED are not set
>> simultaneously; IOMAP_MAPPED always represents a written extent. The
>> good news is that this conversion is not complicated.
>>
>> I asked an agent to look into the historical reasons, and here is what
>> he told me:
>>
>> When ext4 introduced fallocate + delalloc in 2008-2009, it followed
>> the XFS convention that "unwritten is not mapped." However, ext4's
>> writeback path (mpage_da_submit_io) depended on BH_MAPPED to issue
>> I/O, which meant that unwritten buffers would never be written out.
>> Commit bf068ee266f9 was the first fix - it converted unwritten to
>> mapped during the writepages stage. Commit 29fa89d08894 was the
>> second fix - it made unwritten carry mapped at the write-begin stage.
>> Commit 2a8964d63d50 was a revert patch - it cleared the unwritten
>> flag in the create = 1 path to avoid duplicate get_block() calls.
>>
>> Then the extent status cache was introduced in 2013. Commits
>> a25a4e1a5d5d (Zheng Liu, 2013-02) and d100eef2440f (2013) added the
>> ES-cache lookup path. In that path, ext4_es_is_unwritten returns
>> EXT4_MAP_UNWRITTEN without EXT4_MAP_MAPPED - this was not a new
>> design decision, but rather a continuation of the old 2009 convention
>> that "unwritten is not mapped." Zheng Liu's commit message did not
>> discuss whether MAPPED should be set at the same time; it only said
>> "ext4_map_blocks needs to use this flag to determine the extent
>> status." He was simply implementing the ES-cache lookup, and the
>> status semantics followed the old implementation.
>>
>> The comment in ext4_map_blocks() - which says "if blocks have been
>> preallocated ext4_ext_map_blocks() returns with buffer head unmapped"
>> - was inherited from Aneesh's 2009 code. It was not a new decision
>> made in 2013.
>>
>> So my understanding is that this inconsistency in ext4_map_blocks() is
>> a historical legacy problem. There are many places where handling these
>> two flags is also semantically ambiguous. Some may treat those with the
>> BH_mapped flag set as written, such as ext4_load_tail_bh(). Therefore,
>> setting the mapped flag may cause many issues, and it can only be done
>> by setting the unwritten flag when performing a query. If I am wrong
>> about any of this, please correct me.
>>
>> Going forward, I think we should revisit and clearly redefine the
>> semantics of these two flags, and unify the behavior of ext4_get_block().
>> Otherwise, carrying this historical baggage will only hinder future
>> development. :-)
>
> Yeah, frankly I think it would be good to cleanup. But as you say it is
> currently working correctly.
>
Yes, this cleanup work requires checking many places, and there's no
rush to carry it out at the moment.
Thanks,
Yi.
next prev parent reply other threads:[~2026-07-04 1:32 UTC|newest]
Thread overview: 18+ messages / expand[flat|nested] mbox.gz Atom feed top
2026-07-01 14:20 [PATCH 0/6] ext4: fix unaligned edge handling in FALLOC_FL_WRITE_ZEROES yizhang089
2026-07-01 14:20 ` [PATCH 1/6] ext4: move partial block zeroing earlier in ext4_zero_range() yizhang089
2026-07-02 9:40 ` Jan Kara
2026-07-01 14:20 ` [PATCH 2/6] ext4: clarify return semantics of ext4_load_tail_bh() yizhang089
2026-07-02 9:53 ` Jan Kara
2026-07-02 14:40 ` Zhang Yi
2026-07-02 16:43 ` Jan Kara
2026-07-03 3:39 ` Zhang Yi
2026-07-03 16:01 ` Jan Kara
2026-07-04 1:32 ` Zhang Yi [this message]
2026-07-01 14:20 ` [PATCH 3/6] ext4: track partial-zero outcome per edge in ext4_zero_partial_blocks() yizhang089
2026-07-02 9:57 ` Jan Kara
2026-07-01 14:20 ` [PATCH 4/6] ext4: zero out whole block for clean edges in WRITE_ZEROES yizhang089
2026-07-01 14:20 ` [PATCH 5/6] ext4: write back partial-zeroed " yizhang089
2026-07-02 10:05 ` Jan Kara
2026-07-01 14:20 ` [PATCH 6/6] ext4: skip ext4_update_disksize_before_punch() " yizhang089
2026-07-02 10:19 ` Jan Kara
2026-07-02 14:51 ` Zhang Yi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=7e2c6b5f-6135-4772-9618-e064f8a59ee0@gmail.com \
--to=yizhang089@gmail.com \
--cc=adilger.kernel@dilger.ca \
--cc=chengzhihao1@huawei.com \
--cc=jack@suse.cz \
--cc=libaokun@linux.alibaba.com \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=ojaswin@linux.ibm.com \
--cc=ritesh.list@gmail.com \
--cc=tytso@mit.edu \
--cc=yangerkun@huawei.com \
--cc=yi.zhang@huawei.com \
--cc=yi.zhang@huaweicloud.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox