Re: [PATCH 12/27] ext4: introduce seq counter for the extent status entry

linux-fsdevel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

From: Zhang Yi <yi.zhang@huaweicloud.com>
To: Jan Kara <jack@suse.cz>
Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org, tytso@mit.edu,
	adilger.kernel@dilger.ca, ritesh.list@gmail.com,
	hch@infradead.org, djwong@kernel.org, david@fromorbit.com,
	zokeefe@google.com, yi.zhang@huawei.com, chengzhihao1@huawei.com,
	yukuai3@huawei.com, yangerkun@huawei.com
Subject: Re: [PATCH 12/27] ext4: introduce seq counter for the extent status entry
Date: Wed, 11 Dec 2024 15:59:51 +0800	[thread overview]
Message-ID: <95c631d7-84da-412b-b7dc-f4785739f41a@huaweicloud.com> (raw)
In-Reply-To: <20241210125726.gzcx6mpuecifqdwe@quack3>

On 2024/12/10 20:57, Jan Kara wrote:
> On Mon 09-12-24 16:32:41, Zhang Yi wrote:
>> On 2024/12/7 0:21, Jan Kara wrote:
>>>>> I think you'll need to use atomic_t and appropriate functions here.
>>>>>
>>>>>> @@ -872,6 +879,7 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
>>>>>>  	BUG_ON(end < lblk);
>>>>>>  	WARN_ON_ONCE(status & EXTENT_STATUS_DELAYED);
>>>>>>  
>>>>>> +	ext4_es_inc_seq(inode);
>>>>>
>>>>> I'm somewhat wondering: Are extent status tree modifications the right
>>>>> place to advance the sequence counter? The counter needs to advance
>>>>> whenever the mapping information changes. This means that we'd be
>>>>> needlessly advancing the counter (and thus possibly forcing retries) when
>>>>> we are just adding new information from ordinary extent tree into cache.
>>>>> Also someone can be doing extent tree manipulations without touching extent
>>>>> status tree (if the information was already pruned from there). 
>>>>
>>>> Sorry, I don't quite understand here. IIUC, we can't modify the extent
>>>> tree without also touching extent status tree; otherwise, the extent
>>>> status tree will become stale, potentially leading to undesirable and
>>>> unexpected outcomes later on, as the extent lookup paths rely on and
>>>> always trust the status tree. If this situation happens, would it be
>>>> considered a bug? Additionally, I have checked the code but didn't find
>>>> any concrete cases where this could happen. Was I overlooked something?
>>>
>>> What I'm worried about is that this seems a bit fragile because e.g. in
>>> ext4_collapse_range() we do:
>>>
>>> ext4_es_remove_extent(inode, start, EXT_MAX_BLOCKS - start)
>>> <now go and manipulate the extent tree>
>>>
>>> So if somebody managed to sneak in between ext4_es_remove_extent() and
>>> the extent tree manipulation, he could get a block mapping which is shortly
>>> after invalidated by the extent tree changes. And as I'm checking now,
>>> writeback code *can* sneak in there because during extent tree
>>> manipulations we call ext4_datasem_ensure_credits() which can drop
>>> i_data_sem to restart a transaction.
>>>
>>> Now we do writeout & invalidate page cache before we start to do these
>>> extent tree dances so I don't see how this could lead to *actual* use
>>> after free issues but it makes me somewhat nervous. So that's why I'd like
>>> to have some clear rules from which it is obvious that the counter makes
>>> sure we do not use stale mappings.
>>
>> Yes, I see. I think the rule should be as follows:
>>
>> First, when the iomap infrastructure is creating or querying file
>> mapping information, we must ensure that the mapping information
>> always passes through the extent status tree, which means
>> ext4_map_blocks(), ext4_map_query_blocks(), and
>> ext4_map_create_blocks() should cache the extent status entries that
>> we intend to use.
> 
> OK, this currently holds. There's just one snag that during fastcommit
> replay ext4_es_insert_extent() doesn't do anything. I don't think there's
> any race possible during that stage but it's another case to think about.

OK.

> 
>> Second, when updating the extent tree, we must hold the i_data_sem in
>> write mode and update the extent status tree atomically.
> 
> Fine.
> 
>> Additionally,
>> if we cannot update the extent tree while holding a single i_data_sem,
>> we should first remove all related extent status entries within the
>> specified range, then manipulate the extent tree, ensuring that the
>> extent status entries are always up-to-date if they exist (as
>> ext4_collapse_range() does).
> 
> In this case, I think we need to provide more details. In particular I
> would require that in all such cases we must:
> a) hold i_rwsem exclusively and hold invalidate_lock exclusively ->
>    provides exclusion against page faults, reads, writes

Yes.

> b) evict all page cache in the affected range -> should stop writeback -
>    *but* currently there's one case which could be problematic. Assume we
>    do punch hole 0..N and the page at N+1 is dirty. Punch hole does all of
>    the above and starts removing blocks, needs to restart transaction so it
>    drops i_data_sem. Writeback starts for page N+1, needs to load extent
>    block into memory, ext4_cache_extents() now loads back some extents
>    covering range 0..N into extent status tree. 

This completely confuses me. Do you mention the case below,

There are many extent entries in the page range 0..N+1, for example,

   0                                  N N+1
   |                                  |/
  [www][wwwwww][wwwwwwww]...[wwwww][wwww]...
                |      |
                N-a    N-b

Punch hole is removing each extent entries from N..0
(ext4_ext_remove_space() removes blocks from end to start), and could
drop i_data_sem just after manipulating(removing) the extent entry
[N-a,N-b], At the same time, a concurrent writeback start write back
page N+1 since the writeback only hold page lock, doesn't hold i_rwsem
and invalidate_lock. It may load back the extents 0..N-a into the
extent status tree again while finding extent that contains N+1?
Finally it may left some stale extent status entries after punch hole
is done?

If my understanding is correct, isn't that a problem that exists now?
I mean without this patch series.

>    So the only protection
>    against using freed blocks is that nobody should be mapping anything in
>    the range 0..N because we hold those locks & have evicted page cache.
> 
> So I think we need to also document, that anybody mapping blocks needs to
> hold i_rwsem or invalidate_lock or a page lock, ideally asserting that in
> ext4_map_blocks() to catch cases we missed. Asserting for page lock will
> not be really doable but luckily only page writeback needs that so that can
> get some extemption from the assert.

In the case above, it seems that merely holding a page lock is
insufficient?

> 
>> Finally, if we want to manipulate the extent tree without caching, we
>> should also remove the extent status entries first.
> 
> Based on the above, I don't think this is really needed. We only must make
> sure that after all extent tree updates are done and before we release
> invalidate_lock, all extents from extent status tree in the modified range
> must be evicted / replaced to match reality.
> 
Yeah, I agree with you.

Thanks,
Yi.

next prev parent reply	other threads:[~2024-12-11  7:59 UTC|newest]

Thread overview: 59+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-10-22 11:10 [PATCH 00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio Zhang Yi
2024-10-22  6:59 ` Sedat Dilek
2024-10-22  9:22   ` Zhang Yi
2024-10-23 12:13     ` Sedat Dilek
2024-10-24  7:44       ` Zhang Yi
2024-10-22 11:10 ` [PATCH 01/27] ext4: remove writable userspace mappings before truncating page cache Zhang Yi
2024-12-04 11:13   ` Jan Kara
2024-12-06  7:59     ` Zhang Yi
2024-12-06 15:49       ` Jan Kara
2024-10-22 11:10 ` [PATCH 02/27] ext4: don't explicit update times in ext4_fallocate() Zhang Yi
2024-10-22 11:10 ` [PATCH 03/27] ext4: don't write back data before punch hole in nojournal mode Zhang Yi
2024-11-18 23:15   ` Darrick J. Wong
2024-11-20  2:56     ` Zhang Yi
2024-12-04 11:26       ` Jan Kara
2024-12-04 11:27   ` Jan Kara
2024-10-22 11:10 ` [PATCH 04/27] ext4: refactor ext4_punch_hole() Zhang Yi
2024-11-18 23:27   ` Darrick J. Wong
2024-11-20  3:18     ` Zhang Yi
2024-12-04 11:36   ` Jan Kara
2024-10-22 11:10 ` [PATCH 05/27] ext4: refactor ext4_zero_range() Zhang Yi
2024-12-04 11:52   ` Jan Kara
2024-12-06  8:09     ` Zhang Yi
2024-10-22 11:10 ` [PATCH 06/27] ext4: refactor ext4_collapse_range() Zhang Yi
2024-12-04 11:58   ` Jan Kara
2024-10-22 11:10 ` [PATCH 07/27] ext4: refactor ext4_insert_range() Zhang Yi
2024-12-04 12:02   ` Jan Kara
2024-10-22 11:10 ` [PATCH 08/27] ext4: factor out ext4_do_fallocate() Zhang Yi
2024-10-22 11:10 ` [PATCH 09/27] ext4: move out inode_lock into ext4_fallocate() Zhang Yi
2024-12-04 12:05   ` Jan Kara
2024-12-06  8:13     ` Zhang Yi
2024-12-06 15:51       ` Jan Kara
2024-10-22 11:10 ` [PATCH 10/27] ext4: move out common parts " Zhang Yi
2024-12-04 12:10   ` Jan Kara
2024-10-22 11:10 ` [PATCH 11/27] ext4: use reserved metadata blocks when splitting extent on endio Zhang Yi
2024-12-04 12:16   ` Jan Kara
2024-10-22 11:10 ` [PATCH 12/27] ext4: introduce seq counter for the extent status entry Zhang Yi
2024-12-04 12:42   ` Jan Kara
2024-12-06  8:55     ` Zhang Yi
2024-12-06 16:21       ` Jan Kara
2024-12-09  8:32         ` Zhang Yi
2024-12-10 12:57           ` Jan Kara
2024-12-11  7:59             ` Zhang Yi [this message]
2024-12-11 16:00               ` Jan Kara
2024-12-12  2:32                 ` Zhang Yi
2024-10-22 11:10 ` [PATCH 13/27] ext4: add a new iomap aops for regular file's buffered IO path Zhang Yi
2024-10-22 11:10 ` [PATCH 14/27] ext4: implement buffered read iomap path Zhang Yi
2024-10-22 11:10 ` [PATCH 15/27] ext4: implement buffered write " Zhang Yi
2024-10-22 11:10 ` [PATCH 16/27] ext4: don't order data for inode with EXT4_STATE_BUFFERED_IOMAP Zhang Yi
2024-10-22 11:10 ` [PATCH 17/27] ext4: implement writeback iomap path Zhang Yi
2024-10-22 11:10 ` [PATCH 18/27] ext4: implement mmap " Zhang Yi
2024-10-22 11:10 ` [PATCH 19/27] ext4: do not always order data when partial zeroing out a block Zhang Yi
2024-10-22 11:10 ` [PATCH 20/27] ext4: do not start handle if unnecessary while " Zhang Yi
2024-10-22 11:10 ` [PATCH 21/27] ext4: implement zero_range iomap path Zhang Yi
2024-10-22 11:10 ` [PATCH 22/27] ext4: disable online defrag when inode using iomap buffered I/O path Zhang Yi
2024-10-22 11:10 ` [PATCH 23/27] ext4: disable inode journal mode when " Zhang Yi
2024-10-22 11:10 ` [PATCH 24/27] ext4: partially enable iomap for the buffered I/O path of regular files Zhang Yi
2024-10-22 11:10 ` [PATCH 25/27] ext4: enable large folio for regular file with iomap buffered I/O path Zhang Yi
2024-10-22 11:10 ` [PATCH 26/27] ext4: change mount options code style Zhang Yi
2024-10-22 11:10 ` [PATCH 27/27] ext4: introduce a mount option for iomap buffered I/O path Zhang Yi

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=95c631d7-84da-412b-b7dc-f4785739f41a@huaweicloud.com \
    --to=yi.zhang@huaweicloud.com \
    --cc=adilger.kernel@dilger.ca \
    --cc=chengzhihao1@huawei.com \
    --cc=david@fromorbit.com \
    --cc=djwong@kernel.org \
    --cc=hch@infradead.org \
    --cc=jack@suse.cz \
    --cc=linux-ext4@vger.kernel.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=ritesh.list@gmail.com \
    --cc=tytso@mit.edu \
    --cc=yangerkun@huawei.com \
    --cc=yi.zhang@huawei.com \
    --cc=yukuai3@huawei.com \
    --cc=zokeefe@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).