From: Sasha Levin <sashal@kernel.org>
To: Joanne Koong <joannelkoong@gmail.com>
Cc: willy@infradead.org, linux-fsdevel@vger.kernel.org,
linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 1/1] iomap: fix race between iomap_set_range_uptodate and folio_end_read
Date: Tue, 23 Dec 2025 20:31:57 -0500 [thread overview]
Message-ID: <aUtCjXbraDrq-Sxe@laps> (raw)
In-Reply-To: <CAJnrk1ZiJVNg-k+CSY_VqJ3sQOW1mo6C-9QT0bzgLT4sKGGCyg@mail.gmail.com>
On Tue, Dec 23, 2025 at 05:12:09PM -0800, Joanne Koong wrote:
>On Tue, Dec 23, 2025 at 2:30 PM Sasha Levin <sashal@kernel.org> wrote:
>>
>
>Hi Sasha,
>
>Thanks for your patch and for the detailed writeup.
Thanks for looking into this!
>> When iomap uses large folios, per-block uptodate tracking is managed via
>> iomap_folio_state (ifs). A race condition can cause the ifs uptodate bits
>> to become inconsistent with the folio's uptodate flag.
>>
>> The race occurs because folio_end_read() uses XOR semantics to atomically
>> set the uptodate bit and clear the locked bit:
>>
>> Thread A (read completion): Thread B (concurrent write):
>> -------------------------------- --------------------------------
>> iomap_finish_folio_read()
>> spin_lock(state_lock)
>> ifs_set_range_uptodate() -> true
>> spin_unlock(state_lock)
>> iomap_set_range_uptodate()
>> spin_lock(state_lock)
>> ifs_set_range_uptodate() -> true
>> spin_unlock(state_lock)
>> folio_mark_uptodate(folio)
>> folio_end_read(folio, true)
>> folio_xor_flags() // XOR CLEARS uptodate!
>
>The part I'm confused about here is how this can happen between a
>concurrent read and write. My understanding is that the folio is
>locked when the read occurs and locked when the write occurs and both
>locks get dropped only when the read or write finishes. Looking at
>iomap code, I see iomap_set_range_uptodate() getting called in
>__iomap_write_begin() and __iomap_write_end() for the writes, but in
>both those places the folio lock is held while this is called. I'm not
>seeing how the read and write race in the diagram can happen, but
>maybe I'm missing something here?
Hmm, you're right... The folio lock should prevent concurrent read/write
access. Looking at this again, I suspect that FUSE was calling
folio_clear_uptodate() and folio_mark_uptodate() directly without updating the
ifs bits. For example, in fuse_send_write_pages() on write error, it calls
folio_clear_uptodate(folio) which clears the folio flag but leaves ifs still
showing all blocks uptodate?
>>
>> Result: folio is NOT uptodate, but ifs says all blocks ARE uptodate.
>
>Ah I see the WARN_ON_ONCE() in ifs_free:
> WARN_ON_ONCE(ifs_is_fully_uptodate(folio, ifs) !=
> folio_test_uptodate(folio));
>
>Just to confirm, are you seeing that the folio is not marked uptodate
>but the ifs blocks are? Or are the ifs blocks not uptodate but the
>folio is?
The former: folio is NOT uptodate but ifs shows all blocks ARE uptodate
(state=0xffff with 16 blocks)
>>
>> Fix by checking read_bytes_pending in iomap_set_range_uptodate() under the
>> lock. If a read is in progress, skip calling folio_mark_uptodate() - the
>> read completion path will handle it via folio_end_read().
>>
>> The warning was triggered during FUSE-based filesystem (e.g., NTFS-3G)
>> unmount when the LTP writev03 test was run:
>>
>> WARNING: fs/iomap/buffered-io.c at ifs_free
>> Call trace:
>> ifs_free
>> iomap_invalidate_folio
>> truncate_cleanup_folio
>> truncate_inode_pages_range
>> truncate_inode_pages_final
>> fuse_evict_inode
>> ...
>> fuse_kill_sb_blk
>>
>> Fixes: 7a4847e54cc1 ("iomap: use folio_end_read()")
>> Assisted-by: claude-opus-4-5-20251101
>> Signed-off-by: Sasha Levin <sashal@kernel.org>
>> ---
>> fs/fuse/dev.c | 3 +-
>> fs/fuse/file.c | 6 ++--
>> fs/iomap/buffered-io.c | 65 +++++++++++++++++++++++++++++++++++++++---
>> include/linux/iomap.h | 2 ++
>> 4 files changed, 68 insertions(+), 8 deletions(-)
>>
>> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
>> index 6d59cbc877c6..50e84e913589 100644
>> --- a/fs/fuse/dev.c
>> +++ b/fs/fuse/dev.c
>> @@ -11,6 +11,7 @@
>> #include "fuse_dev_i.h"
>>
>> #include <linux/init.h>
>> +#include <linux/iomap.h>
>> #include <linux/module.h>
>> #include <linux/poll.h>
>> #include <linux/sched/signal.h>
>> @@ -1820,7 +1821,7 @@ static int fuse_notify_store(struct fuse_conn *fc, unsigned int size,
>> if (!folio_test_uptodate(folio) && !err && offset == 0 &&
>> (nr_bytes == folio_size(folio) || file_size == end)) {
>> folio_zero_segment(folio, nr_bytes, folio_size(folio));
>> - folio_mark_uptodate(folio);
>> + iomap_set_range_uptodate(folio, 0, folio_size(folio));
>> }
>> folio_unlock(folio);
>> folio_put(folio);
>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>> index 01bc894e9c2b..3abe38416199 100644
>> --- a/fs/fuse/file.c
>> +++ b/fs/fuse/file.c
>> @@ -1216,13 +1216,13 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
>> struct folio *folio = ap->folios[i];
>>
>> if (err) {
>> - folio_clear_uptodate(folio);
>> + iomap_clear_folio_uptodate(folio);
>> } else {
>> if (count >= folio_size(folio) - offset)
>> count -= folio_size(folio) - offset;
>> else {
>> if (short_write)
>> - folio_clear_uptodate(folio);
>> + iomap_clear_folio_uptodate(folio);
>> count = 0;
>> }
>> offset = 0;
>> @@ -1305,7 +1305,7 @@ static ssize_t fuse_fill_write_pages(struct fuse_io_args *ia,
>>
>> /* If we copied full folio, mark it uptodate */
>> if (tmp == folio_size(folio))
>> - folio_mark_uptodate(folio);
>> + iomap_set_range_uptodate(folio, 0, folio_size(folio));
>>
>> if (folio_test_uptodate(folio)) {
>> folio_unlock(folio);
>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>> index e5c1ca440d93..7ceda24cf6a7 100644
>> --- a/fs/iomap/buffered-io.c
>> +++ b/fs/iomap/buffered-io.c
>> @@ -74,8 +74,7 @@ static bool ifs_set_range_uptodate(struct folio *folio,
>> return ifs_is_fully_uptodate(folio, ifs);
>> }
>>
>> -static void iomap_set_range_uptodate(struct folio *folio, size_t off,
>> - size_t len)
>> +void iomap_set_range_uptodate(struct folio *folio, size_t off, size_t len)
>> {
>> struct iomap_folio_state *ifs = folio->private;
>> unsigned long flags;
>> @@ -87,12 +86,50 @@ static void iomap_set_range_uptodate(struct folio *folio, size_t off,
>> if (ifs) {
>> spin_lock_irqsave(&ifs->state_lock, flags);
>> uptodate = ifs_set_range_uptodate(folio, ifs, off, len);
>> + /*
>> + * If a read is in progress, we must NOT call folio_mark_uptodate
>> + * here. The read completion path (iomap_finish_folio_read or
>> + * iomap_read_end) will call folio_end_read() which uses XOR
>> + * semantics to set the uptodate bit. If we set it here, the XOR
>> + * in folio_end_read() will clear it, leaving the folio not
>> + * uptodate while the ifs says all blocks are uptodate.
>> + */
>> + if (uptodate && ifs->read_bytes_pending)
>> + uptodate = false;
>
>Does the warning you saw in ifs_free() still go away without the
>changes here to iomap_set_range_uptodate() or is this change here
>necessary? I'm asking mostly because I'm not seeing how
>iomap_set_range_uptodate() can be called while the read is in
>progress, as the logic should be already protected by the folio locks.
Yes, the warning goes away even without this part. I don't think that this is
necessary - I just kept it while figuring out the race.
>> spin_unlock_irqrestore(&ifs->state_lock, flags);
>> }
>>
>> if (uptodate)
>> folio_mark_uptodate(folio);
>> }
>> +EXPORT_SYMBOL_GPL(iomap_set_range_uptodate);
>> +
>> +void iomap_clear_folio_uptodate(struct folio *folio)
>> +{
>> + struct iomap_folio_state *ifs = folio->private;
>> +
>> + if (ifs) {
>> + struct inode *inode = folio->mapping->host;
>> + unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
>> + unsigned long flags;
>> +
>> + spin_lock_irqsave(&ifs->state_lock, flags);
>> + /*
>> + * If a read is in progress, don't clear the uptodate state.
>> + * The read completion path will handle the folio state, and
>> + * clearing here would race with iomap_finish_folio_read()
>> + * potentially causing ifs/folio uptodate state mismatch.
>> + */
>> + if (ifs->read_bytes_pending) {
>> + spin_unlock_irqrestore(&ifs->state_lock, flags);
>> + return;
>> + }
>> + bitmap_clear(ifs->state, 0, nr_blocks);
>> + spin_unlock_irqrestore(&ifs->state_lock, flags);
>> + }
>> + folio_clear_uptodate(folio);
>> +}
>> +EXPORT_SYMBOL_GPL(iomap_clear_folio_uptodate);
>>
>> /*
>> * Find the next dirty block in the folio. end_blk is inclusive.
>> @@ -399,8 +436,17 @@ void iomap_finish_folio_read(struct folio *folio, size_t off, size_t len,
>> spin_unlock_irqrestore(&ifs->state_lock, flags);
>> }
>>
>> - if (finished)
>> + if (finished) {
>> + /*
>> + * If uptodate is true but the folio is already marked uptodate,
>> + * folio_end_read's XOR semantics would clear the uptodate bit.
>> + * This should never happen because iomap_set_range_uptodate()
>> + * skips calling folio_mark_uptodate() when read_bytes_pending
>> + * is non-zero, ensuring only the read completion path sets it.
>> + */
>> + WARN_ON_ONCE(uptodate && folio_test_uptodate(folio));
>
>Matthew pointed out in another thread [1] that folio_end_read() has
>already the warnings against double-unlocks or double-uptodates
>in-built:
>
> VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
> VM_BUG_ON_FOLIO(success && folio_test_uptodate(folio), folio);
>
>but imo the WARN_ON_ONCE() here is nice to have too, as I don't think
>most builds enable CONFIG_DEBUG_VM.
>
>[1] https://lore.kernel.org/linux-fsdevel/aPu1ilw6Tq6tKPrf@casper.infradead.org/
>
>Thanks,
>Joanne
>> folio_end_read(folio, uptodate);
>> + }
>> }
>> EXPORT_SYMBOL_GPL(iomap_finish_folio_read);
>>
>> @@ -481,8 +527,19 @@ static void iomap_read_end(struct folio *folio, size_t bytes_submitted)
>> if (end_read)
>> uptodate = ifs_is_fully_uptodate(folio, ifs);
>> spin_unlock_irq(&ifs->state_lock);
>> - if (end_read)
>> + if (end_read) {
>> + /*
>> + * If uptodate is true but the folio is already marked
>> + * uptodate, folio_end_read's XOR semantics would clear
>> + * the uptodate bit. This should never happen because
>> + * iomap_set_range_uptodate() skips calling
>> + * folio_mark_uptodate() when read_bytes_pending is
>> + * non-zero, ensuring only the read completion path
>> + * sets it.
>> + */
>> + WARN_ON_ONCE(uptodate && folio_test_uptodate(folio));
>> folio_end_read(folio, uptodate);
>> + }
>> } else if (!bytes_submitted) {
>> /*
>> * If there were no bytes submitted, this means we are
>> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
>> index 520e967cb501..3c2ad88d16b6 100644
>> --- a/include/linux/iomap.h
>> +++ b/include/linux/iomap.h
>> @@ -345,6 +345,8 @@ void iomap_read_folio(const struct iomap_ops *ops,
>> void iomap_readahead(const struct iomap_ops *ops,
>> struct iomap_read_folio_ctx *ctx);
>> bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
>> +void iomap_set_range_uptodate(struct folio *folio, size_t off, size_t len);
>> +void iomap_clear_folio_uptodate(struct folio *folio);
>> struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len);
>> bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
>> void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
>> --
>> 2.51.0
>>
--
Thanks,
Sasha
next prev parent reply other threads:[~2025-12-24 1:31 UTC|newest]
Thread overview: 50+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-09-26 0:25 [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Joanne Koong
2025-09-26 0:25 ` [PATCH v5 01/14] iomap: move bio read logic into helper function Joanne Koong
2025-09-26 0:25 ` [PATCH v5 02/14] iomap: move read/readahead bio submission " Joanne Koong
2025-09-26 0:25 ` [PATCH v5 03/14] iomap: store read/readahead bio generically Joanne Koong
2025-09-26 0:25 ` [PATCH v5 04/14] iomap: iterate over folio mapping in iomap_readpage_iter() Joanne Koong
2025-09-26 0:26 ` [PATCH v5 05/14] iomap: rename iomap_readpage_iter() to iomap_read_folio_iter() Joanne Koong
2025-09-26 0:26 ` [PATCH v5 06/14] iomap: rename iomap_readpage_ctx struct to iomap_read_folio_ctx Joanne Koong
2025-09-26 0:26 ` [PATCH v5 07/14] iomap: track pending read bytes more optimally Joanne Koong
2025-10-23 19:34 ` Brian Foster
2025-10-24 0:01 ` Joanne Koong
2025-10-24 16:25 ` Joanne Koong
2025-10-24 17:14 ` Brian Foster
2025-10-24 19:48 ` Joanne Koong
2025-10-24 21:55 ` Joanne Koong
2025-10-27 12:16 ` Brian Foster
2025-10-24 17:21 ` Matthew Wilcox
2025-10-24 19:22 ` Joanne Koong
2025-10-24 20:59 ` Matthew Wilcox
2025-10-24 21:37 ` Darrick J. Wong
2025-10-24 21:58 ` Joanne Koong
2025-09-26 0:26 ` [PATCH v5 08/14] iomap: set accurate iter->pos when reading folio ranges Joanne Koong
2025-09-26 0:26 ` [PATCH v5 09/14] iomap: add caller-provided callbacks for read and readahead Joanne Koong
2025-09-26 0:26 ` [PATCH v5 10/14] iomap: move buffered io bio logic into new file Joanne Koong
2025-09-26 0:26 ` [PATCH v5 11/14] iomap: make iomap_read_folio() a void return Joanne Koong
2025-09-26 0:26 ` [PATCH v5 12/14] fuse: use iomap for read_folio Joanne Koong
2025-12-23 22:30 ` [RFC PATCH 0/1] iomap: fix race between iomap_set_range_uptodate and folio_end_read Sasha Levin
2025-12-23 22:30 ` [RFC PATCH 1/1] " Sasha Levin
2025-12-24 1:12 ` Joanne Koong
2025-12-24 1:31 ` Sasha Levin [this message]
2026-02-07 7:16 ` Wei Gao
2026-02-09 19:08 ` Joanne Koong
2026-02-10 0:12 ` Wei Gao
2026-02-10 0:20 ` Joanne Koong
2026-02-10 0:40 ` Wei Gao
2026-02-10 22:18 ` Joanne Koong
2026-02-11 0:00 ` Sasha Levin
2026-02-11 3:11 ` Matthew Wilcox
2026-02-11 19:33 ` Joanne Koong
2026-02-11 21:03 ` Matthew Wilcox
2026-02-11 23:13 ` Joanne Koong
2026-02-12 19:31 ` Matthew Wilcox
2026-02-13 0:53 ` Joanne Koong
2025-12-24 2:10 ` Matthew Wilcox
2025-12-24 15:43 ` Sasha Levin
2025-12-24 17:27 ` Matthew Wilcox
2025-12-24 21:21 ` Sasha Levin
2025-12-30 0:58 ` Joanne Koong
2025-09-26 0:26 ` [PATCH v5 13/14] fuse: use iomap for readahead Joanne Koong
2025-09-26 0:26 ` [PATCH v5 14/14] fuse: remove fc->blkbits workaround for partial writes Joanne Koong
2025-09-29 9:38 ` [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Christian Brauner
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=aUtCjXbraDrq-Sxe@laps \
--to=sashal@kernel.org \
--cc=joannelkoong@gmail.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=willy@infradead.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox