Re: [PATCH 2/6] ext4: clarify return semantics of ext4_load_tail_bh()

Linux filesystem development
 help / color / mirror / Atom feed

From: Zhang Yi <yizhang089@gmail.com>
To: Jan Kara <jack@suse.cz>, Zhang Yi <yi.zhang@huaweicloud.com>
Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, tytso@mit.edu,
	adilger.kernel@dilger.ca, libaokun@linux.alibaba.com,
	ojaswin@linux.ibm.com, ritesh.list@gmail.com,
	yi.zhang@huawei.com, chengzhihao1@huawei.com,
	yangerkun@huawei.com
Subject: Re: [PATCH 2/6] ext4: clarify return semantics of ext4_load_tail_bh()
Date: Sat, 4 Jul 2026 09:32:20 +0800	[thread overview]
Message-ID: <7e2c6b5f-6135-4772-9618-e064f8a59ee0@gmail.com> (raw)
In-Reply-To: <qbrrym4b6tsstu5lqdlxvwp67tvkfp4xufre6v7llan7xgiw3k@qieeaqvt75ns>

On 7/4/2026 12:01 AM, Jan Kara wrote:
> On Fri 03-07-26 11:39:03, Zhang Yi wrote:
>> On 7/3/2026 12:43 AM, Jan Kara wrote:
>>> On Thu 02-07-26 22:40:17, Zhang Yi wrote:
>>>> On 7/2/2026 5:53 PM, Jan Kara wrote:
>>>>> On Wed 01-07-26 22:20:05, yizhang089@gmail.com wrote:
>>>>>> From: Zhang Yi <yi.zhang@huawei.com>
>>>>>>
>>>>>> ext4_load_tail_bh() returns NULL for both holes and clean unwritten
>>>>>> buffers, but the conditions that lead to this are not obvious from the
>>>>>> code alone. Document this behavior to clarify the return value, so that
>>>>>> readers do not mistakenly assume that only holes result in a NULL
>>>>>> return.
>>>>>>
>>>>>> Also update the inline comment following the ext4_get_block() call to
>>>>>> reflect this: both holes and clean unwritten buffers fall through to the
>>>>>> "nothing to do" path.
>>>>>>
>>>>>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>>>>>> ---
>>>>>>    fs/ext4/inode.c | 6 +++++-
>>>>>>    1 file changed, 5 insertions(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
>>>>>> index c2c2d6ac7f3d..0b31fa873743 100644
>>>>>> --- a/fs/ext4/inode.c
>>>>>> +++ b/fs/ext4/inode.c
>>>>>> @@ -4026,6 +4026,10 @@ void ext4_set_aops(struct inode *inode)
>>>>>>     * because it might have data in pagecache (eg, if called from ext4_zero_range,
>>>>>>     * ext4_punch_hole, etc) which needs to be properly zeroed out. Otherwise a
>>>>>>     * racing writeback can come later and flush the stale pagecache to disk.
>>>>>> + *
>>>>>> + * Return the loaded bh if it actually needs zeroing - in written, dirty
>>>>>> + * unwritten, or delalloc state. Return NULL if it's clean (i.e., a hole or
>>>>>> + * a clean unwritten block).
>>>>>>     */
>>>>>
>>>>> Great that you're adding this comment because when I was reading previous
>>>>> patch, I've spent like 10 minutes trying to figure that out :). But as far
>>>>> as I'm reading the code, ext4_load_tail_bh() will return the unwritten bh
>>>>> even if it is clean - map_bh() in _ext4_get_block() will set
>>>>> buffer_mapped() even for unwritten extent. What am I missing? BTW, it also
>>>>> seems to be ext4_load_tail_bh() could attempt to ext4_read_bh_lock() on
>>>>> unwritten bh if things align wrongly...
>>>>>
>>>>> 								Honza
>>>>
>>>> Thank you for the review!
>>>>
>>>> Please note the ext4_update_bh_state(bh, map.m_flags) call in
>>>> _ext4_get_block() — it restores the mapped flag back to unwritten. As a
>>>> result, the !buffer_mapped(bh) check will evaluate to true for a clean
>>>> unwritten block, the function will return NULL.
>>>
>>> Argh, right. But ext4_ext_map_blocks() sets both BH_Unwritten and BH_Mapped
>>> in the returned map->m_flags for unwritten extents. So I think my comment
>>> still applies.
>>
>> Because ext4_get_block() is called with create = 0, this is purely a
>> lookup operation. Therefore, in ext4_ext_map_blocks() ->
>> ext4_ext_handle_unwritten_extents(), only EXT4_MAP_UNWRITTEN is set,
>> and EXT4_MAP_MAPPED is not (please see the branch where
>> if (flags & EXT4_GET_BLOCKS_CREATE) == 0)). So at this point, whether
>> the lookup is served from the extent status cache or from the on-disk
>> query, it works correctly now.
> 
> Aah, now I remember. Yes, we have a catch in the mapping code like that.
> Thanks for reminding me again. Now back to my original question: So
> ext4_load_tail_bh() returns unwritten bh when it is dirty because someone
> must have called ext4_get_block() on it with EXT4_GET_BLOCKS_CREATE and at
> that point we set the BH_Mapped flag on it? 

Yes.

> Subtle... It would be worth to
> write this explanation to that comment as well.

OK, sure, I will add this in my next iteration.

> 
>>> Plus there's a very strange inconsistency between this and
>>> the lookup in extent status tree in ext4_map_blocks() where (as you
>>> indicate) we only set BH_Unwritten but *not* BH_Mapped. AFAICT there's
>>> something buggy in here :) Note that when we don't set BH_Mapped flag e.g.
>>> from ext4_get_block_unwritten() (which gets used when delalloc is disabled),
>>> then writes to unwritten extent will get lost because
>>> ext4_bio_write_folio() will just treat them as holes...
>>
>> In fact, there is no real problem here. For the write path that gets an
>> unwritten block, EXT4_GET_BLOCKS_UNWRIT_EXT and
>> EXT4_GET_BLOCKS_CREATE are always passed together. This ensures that
>> even if we find the unwritten extent in the extent status cache, we
>> still call ext4_map_create_blocks() (due to EXT4_GET_BLOCKS_CREATE)
>> and go through ext4_ext_handle_unwritten_extents(). Inside that
>> function, because the flags contain EXT4_GET_BLOCKS_UNWRIT_EXT,
>> EXT4_MAP_MAPPED is set. So for the write path, ext4_bio_write_folio()
>> still writes back correctly.
>>
>> That said, the semantic inconsistency between the lookup-only and
>> allocation paths has been bothering me for a while. I had to be very
>> careful about this when developing the buffered I/O iomap path, because
>> this semantics also differs from the iomap infrastructure. In iomap, for
>> an unwritten extent, IOMAP_UNWRITTEN and IOMAP_MAPPED are not set
>> simultaneously; IOMAP_MAPPED always represents a written extent. The
>> good news is that this conversion is not complicated.
>>
>> I asked an agent to look into the historical reasons, and here is what
>> he told me:
>>
>>    When ext4 introduced fallocate + delalloc in 2008-2009, it followed
>>    the XFS convention that "unwritten is not mapped." However, ext4's
>>    writeback path (mpage_da_submit_io) depended on BH_MAPPED to issue
>>    I/O, which meant that unwritten buffers would never be written out.
>>    Commit bf068ee266f9 was the first fix - it converted unwritten to
>>    mapped during the writepages stage. Commit 29fa89d08894 was the
>>    second fix - it made unwritten carry mapped at the write-begin stage.
>>    Commit 2a8964d63d50 was a revert patch - it cleared the unwritten
>>    flag in the create = 1 path to avoid duplicate get_block() calls.
>>
>>    Then the extent status cache was introduced in 2013. Commits
>>    a25a4e1a5d5d (Zheng Liu, 2013-02) and d100eef2440f (2013) added the
>>    ES-cache lookup path. In that path, ext4_es_is_unwritten returns
>>    EXT4_MAP_UNWRITTEN without EXT4_MAP_MAPPED - this was not a new
>>    design decision, but rather a continuation of the old 2009 convention
>>    that "unwritten is not mapped." Zheng Liu's commit message did not
>>    discuss whether MAPPED should be set at the same time; it only said
>>    "ext4_map_blocks needs to use this flag to determine the extent
>>    status." He was simply implementing the ES-cache lookup, and the
>>    status semantics followed the old implementation.
>>
>>    The comment in ext4_map_blocks() - which says "if blocks have been
>>    preallocated ext4_ext_map_blocks() returns with buffer head unmapped"
>>    - was inherited from Aneesh's 2009 code. It was not a new decision
>>    made in 2013.
>>
>> So my understanding is that this inconsistency in ext4_map_blocks() is
>> a historical legacy problem. There are many places where handling these
>> two flags is also semantically ambiguous. Some may treat those with the
>> BH_mapped flag set as written, such as ext4_load_tail_bh(). Therefore,
>> setting the mapped flag may cause many issues, and it can only be done
>> by setting the unwritten flag when performing a query. If I am wrong
>> about any of this, please correct me.
>>
>> Going forward, I think we should revisit and clearly redefine the
>> semantics of these two flags, and unify the behavior of ext4_get_block().
>> Otherwise, carrying this historical baggage will only hinder future
>> development. :-)
> 
> Yeah, frankly I think it would be good to cleanup. But as you say it is
> currently working correctly.
> 

Yes, this cleanup work requires checking many places, and there's no
rush to carry it out at the moment.

Thanks,
Yi.

next prev parent reply	other threads:[~2026-07-04  1:32 UTC|newest]

Thread overview: 18+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2026-07-01 14:20 [PATCH 0/6] ext4: fix unaligned edge handling in FALLOC_FL_WRITE_ZEROES yizhang089
2026-07-01 14:20 ` [PATCH 1/6] ext4: move partial block zeroing earlier in ext4_zero_range() yizhang089
2026-07-02  9:40   ` Jan Kara
2026-07-01 14:20 ` [PATCH 2/6] ext4: clarify return semantics of ext4_load_tail_bh() yizhang089
2026-07-02  9:53   ` Jan Kara
2026-07-02 14:40     ` Zhang Yi
2026-07-02 16:43       ` Jan Kara
2026-07-03  3:39         ` Zhang Yi
2026-07-03 16:01           ` Jan Kara
2026-07-04  1:32             ` Zhang Yi [this message]
2026-07-01 14:20 ` [PATCH 3/6] ext4: track partial-zero outcome per edge in ext4_zero_partial_blocks() yizhang089
2026-07-02  9:57   ` Jan Kara
2026-07-01 14:20 ` [PATCH 4/6] ext4: zero out whole block for clean edges in WRITE_ZEROES yizhang089
2026-07-01 14:20 ` [PATCH 5/6] ext4: write back partial-zeroed " yizhang089
2026-07-02 10:05   ` Jan Kara
2026-07-01 14:20 ` [PATCH 6/6] ext4: skip ext4_update_disksize_before_punch() " yizhang089
2026-07-02 10:19   ` Jan Kara
2026-07-02 14:51     ` Zhang Yi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=7e2c6b5f-6135-4772-9618-e064f8a59ee0@gmail.com \
    --to=yizhang089@gmail.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=chengzhihao1@huawei.com \
    --cc=jack@suse.cz \
    --cc=libaokun@linux.alibaba.com \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ojaswin@linux.ibm.com \
    --cc=ritesh.list@gmail.com \
    --cc=tytso@mit.edu \
    --cc=yangerkun@huawei.com \
    --cc=yi.zhang@huawei.com \
    --cc=yi.zhang@huaweicloud.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox