From: Zhang Yi <yi.zhang@huaweicloud.com>
To: Jan Kara <jack@suse.cz>
Cc: linux-ext4@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org, tytso@mit.edu,
adilger.kernel@dilger.ca, ritesh.list@gmail.com,
hch@infradead.org, djwong@kernel.org, david@fromorbit.com,
zokeefe@google.com, yi.zhang@huawei.com, chengzhihao1@huawei.com,
yukuai3@huawei.com, yangerkun@huawei.com
Subject: Re: [PATCH 12/27] ext4: introduce seq counter for the extent status entry
Date: Fri, 6 Dec 2024 16:55:01 +0800 [thread overview]
Message-ID: <c831732e-38c5-4a82-ab30-de17cff29584@huaweicloud.com> (raw)
In-Reply-To: <20241204124221.aix7qxjl2n4ya3b7@quack3>
On 2024/12/4 20:42, Jan Kara wrote:
> On Tue 22-10-24 19:10:43, Zhang Yi wrote:
>> From: Zhang Yi <yi.zhang@huawei.com>
>>
>> In the iomap_write_iter(), the iomap buffered write frame does not hold
>> any locks between querying the inode extent mapping info and performing
>> page cache writes. As a result, the extent mapping can be changed due to
>> concurrent I/O in flight. Similarly, in the iomap_writepage_map(), the
>> write-back process faces a similar problem: concurrent changes can
>> invalidate the extent mapping before the I/O is submitted.
>>
>> Therefore, both of these processes must recheck the mapping info after
>> acquiring the folio lock. To address this, similar to XFS, we propose
>> introducing an extent sequence number to serve as a validity cookie for
>> the extent. We will increment this number whenever the extent status
>> tree changes, thereby preparing for the buffered write iomap conversion.
>> Besides, it also changes the trace code style to make checkpatch.pl
>> happy.
>>
>> Signed-off-by: Zhang Yi <yi.zhang@huawei.com>
>
> Overall using some sequence counter makes sense.
>
>> diff --git a/fs/ext4/extents_status.c b/fs/ext4/extents_status.c
>> index c786691dabd3..bea4f87db502 100644
>> --- a/fs/ext4/extents_status.c
>> +++ b/fs/ext4/extents_status.c
>> @@ -204,6 +204,13 @@ static inline ext4_lblk_t ext4_es_end(struct extent_status *es)
>> return es->es_lblk + es->es_len - 1;
>> }
>>
>> +static inline void ext4_es_inc_seq(struct inode *inode)
>> +{
>> + struct ext4_inode_info *ei = EXT4_I(inode);
>> +
>> + WRITE_ONCE(ei->i_es_seq, READ_ONCE(ei->i_es_seq) + 1);
>> +}
>
> This looks potentially dangerous because we can loose i_es_seq updates this
> way. Like
>
> CPU1 CPU2
> x = READ_ONCE(ei->i_es_seq)
> x = READ_ONCE(ei->i_es_seq)
> WRITE_ONCE(ei->i_es_seq, x + 1)
> ...
> potentially many times
> WRITE_ONCE(ei->i_es_seq, x + 1)
> -> the counter goes back leading to possibly false equality checks
>
In my current implementation, I don't think this race condition can
happen since all ext4_es_inc_seq() invocations are under
EXT4_I(inode)->i_es_lock. So I think it works fine now, or was I
missed something?
> I think you'll need to use atomic_t and appropriate functions here.
>
>> @@ -872,6 +879,7 @@ void ext4_es_insert_extent(struct inode *inode, ext4_lblk_t lblk,
>> BUG_ON(end < lblk);
>> WARN_ON_ONCE(status & EXTENT_STATUS_DELAYED);
>>
>> + ext4_es_inc_seq(inode);
>
> I'm somewhat wondering: Are extent status tree modifications the right
> place to advance the sequence counter? The counter needs to advance
> whenever the mapping information changes. This means that we'd be
> needlessly advancing the counter (and thus possibly forcing retries) when
> we are just adding new information from ordinary extent tree into cache.
> Also someone can be doing extent tree manipulations without touching extent
> status tree (if the information was already pruned from there).
Sorry, I don't quite understand here. IIUC, we can't modify the extent
tree without also touching extent status tree; otherwise, the extent
status tree will become stale, potentially leading to undesirable and
unexpected outcomes later on, as the extent lookup paths rely on and
always trust the status tree. If this situation happens, would it be
considered a bug? Additionally, I have checked the code but didn't find
any concrete cases where this could happen. Was I overlooked something?
> So I think
> needs some very good documentation what are the expectations from the
> sequence counter and explanations why they are satisfied so that we don't
> break this in the future.
>
Yeah, it's a good suggestion, where do you suggest putting this
documentation, how about in the front of extents_status.c?
Thanks,
Yi.
next prev parent reply other threads:[~2024-12-06 8:55 UTC|newest]
Thread overview: 59+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-10-22 11:10 [PATCH 00/27] ext4: use iomap for regular file's buffered I/O path and enable large folio Zhang Yi
2024-10-22 6:59 ` Sedat Dilek
2024-10-22 9:22 ` Zhang Yi
2024-10-23 12:13 ` Sedat Dilek
2024-10-24 7:44 ` Zhang Yi
2024-10-22 11:10 ` [PATCH 01/27] ext4: remove writable userspace mappings before truncating page cache Zhang Yi
2024-12-04 11:13 ` Jan Kara
2024-12-06 7:59 ` Zhang Yi
2024-12-06 15:49 ` Jan Kara
2024-10-22 11:10 ` [PATCH 02/27] ext4: don't explicit update times in ext4_fallocate() Zhang Yi
2024-10-22 11:10 ` [PATCH 03/27] ext4: don't write back data before punch hole in nojournal mode Zhang Yi
2024-11-18 23:15 ` Darrick J. Wong
2024-11-20 2:56 ` Zhang Yi
2024-12-04 11:26 ` Jan Kara
2024-12-04 11:27 ` Jan Kara
2024-10-22 11:10 ` [PATCH 04/27] ext4: refactor ext4_punch_hole() Zhang Yi
2024-11-18 23:27 ` Darrick J. Wong
2024-11-20 3:18 ` Zhang Yi
2024-12-04 11:36 ` Jan Kara
2024-10-22 11:10 ` [PATCH 05/27] ext4: refactor ext4_zero_range() Zhang Yi
2024-12-04 11:52 ` Jan Kara
2024-12-06 8:09 ` Zhang Yi
2024-10-22 11:10 ` [PATCH 06/27] ext4: refactor ext4_collapse_range() Zhang Yi
2024-12-04 11:58 ` Jan Kara
2024-10-22 11:10 ` [PATCH 07/27] ext4: refactor ext4_insert_range() Zhang Yi
2024-12-04 12:02 ` Jan Kara
2024-10-22 11:10 ` [PATCH 08/27] ext4: factor out ext4_do_fallocate() Zhang Yi
2024-10-22 11:10 ` [PATCH 09/27] ext4: move out inode_lock into ext4_fallocate() Zhang Yi
2024-12-04 12:05 ` Jan Kara
2024-12-06 8:13 ` Zhang Yi
2024-12-06 15:51 ` Jan Kara
2024-10-22 11:10 ` [PATCH 10/27] ext4: move out common parts " Zhang Yi
2024-12-04 12:10 ` Jan Kara
2024-10-22 11:10 ` [PATCH 11/27] ext4: use reserved metadata blocks when splitting extent on endio Zhang Yi
2024-12-04 12:16 ` Jan Kara
2024-10-22 11:10 ` [PATCH 12/27] ext4: introduce seq counter for the extent status entry Zhang Yi
2024-12-04 12:42 ` Jan Kara
2024-12-06 8:55 ` Zhang Yi [this message]
2024-12-06 16:21 ` Jan Kara
2024-12-09 8:32 ` Zhang Yi
2024-12-10 12:57 ` Jan Kara
2024-12-11 7:59 ` Zhang Yi
2024-12-11 16:00 ` Jan Kara
2024-12-12 2:32 ` Zhang Yi
2024-10-22 11:10 ` [PATCH 13/27] ext4: add a new iomap aops for regular file's buffered IO path Zhang Yi
2024-10-22 11:10 ` [PATCH 14/27] ext4: implement buffered read iomap path Zhang Yi
2024-10-22 11:10 ` [PATCH 15/27] ext4: implement buffered write " Zhang Yi
2024-10-22 11:10 ` [PATCH 16/27] ext4: don't order data for inode with EXT4_STATE_BUFFERED_IOMAP Zhang Yi
2024-10-22 11:10 ` [PATCH 17/27] ext4: implement writeback iomap path Zhang Yi
2024-10-22 11:10 ` [PATCH 18/27] ext4: implement mmap " Zhang Yi
2024-10-22 11:10 ` [PATCH 19/27] ext4: do not always order data when partial zeroing out a block Zhang Yi
2024-10-22 11:10 ` [PATCH 20/27] ext4: do not start handle if unnecessary while " Zhang Yi
2024-10-22 11:10 ` [PATCH 21/27] ext4: implement zero_range iomap path Zhang Yi
2024-10-22 11:10 ` [PATCH 22/27] ext4: disable online defrag when inode using iomap buffered I/O path Zhang Yi
2024-10-22 11:10 ` [PATCH 23/27] ext4: disable inode journal mode when " Zhang Yi
2024-10-22 11:10 ` [PATCH 24/27] ext4: partially enable iomap for the buffered I/O path of regular files Zhang Yi
2024-10-22 11:10 ` [PATCH 25/27] ext4: enable large folio for regular file with iomap buffered I/O path Zhang Yi
2024-10-22 11:10 ` [PATCH 26/27] ext4: change mount options code style Zhang Yi
2024-10-22 11:10 ` [PATCH 27/27] ext4: introduce a mount option for iomap buffered I/O path Zhang Yi
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=c831732e-38c5-4a82-ab30-de17cff29584@huaweicloud.com \
--to=yi.zhang@huaweicloud.com \
--cc=adilger.kernel@dilger.ca \
--cc=chengzhihao1@huawei.com \
--cc=david@fromorbit.com \
--cc=djwong@kernel.org \
--cc=hch@infradead.org \
--cc=jack@suse.cz \
--cc=linux-ext4@vger.kernel.org \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=ritesh.list@gmail.com \
--cc=tytso@mit.edu \
--cc=yangerkun@huawei.com \
--cc=yi.zhang@huawei.com \
--cc=yukuai3@huawei.com \
--cc=zokeefe@google.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).