All of lore.kernel.org
 help / color / mirror / Atom feed
From: Sasha Levin <sashal@kernel.org>
To: Joanne Koong <joannelkoong@gmail.com>
Cc: willy@infradead.org, linux-fsdevel@vger.kernel.org,
	linux-kernel@vger.kernel.org
Subject: Re: [RFC PATCH 1/1] iomap: fix race between iomap_set_range_uptodate and folio_end_read
Date: Tue, 23 Dec 2025 20:31:57 -0500	[thread overview]
Message-ID: <aUtCjXbraDrq-Sxe@laps> (raw)
In-Reply-To: <CAJnrk1ZiJVNg-k+CSY_VqJ3sQOW1mo6C-9QT0bzgLT4sKGGCyg@mail.gmail.com>

On Tue, Dec 23, 2025 at 05:12:09PM -0800, Joanne Koong wrote:
>On Tue, Dec 23, 2025 at 2:30 PM Sasha Levin <sashal@kernel.org> wrote:
>>
>
>Hi Sasha,
>
>Thanks for your patch and for the detailed writeup.

Thanks for looking into this!

>> When iomap uses large folios, per-block uptodate tracking is managed via
>> iomap_folio_state (ifs). A race condition can cause the ifs uptodate bits
>> to become inconsistent with the folio's uptodate flag.
>>
>> The race occurs because folio_end_read() uses XOR semantics to atomically
>> set the uptodate bit and clear the locked bit:
>>
>>   Thread A (read completion):          Thread B (concurrent write):
>>   --------------------------------     --------------------------------
>>   iomap_finish_folio_read()
>>     spin_lock(state_lock)
>>     ifs_set_range_uptodate() -> true
>>     spin_unlock(state_lock)
>>                                        iomap_set_range_uptodate()
>>                                          spin_lock(state_lock)
>>                                          ifs_set_range_uptodate() -> true
>>                                          spin_unlock(state_lock)
>>                                          folio_mark_uptodate(folio)
>>     folio_end_read(folio, true)
>>       folio_xor_flags()  // XOR CLEARS uptodate!
>
>The part I'm confused about here is how this can happen between a
>concurrent read and write. My understanding is that the folio is
>locked when the read occurs and locked when the write occurs and both
>locks get dropped only when the read or write finishes. Looking at
>iomap code, I see iomap_set_range_uptodate() getting called in
>__iomap_write_begin() and __iomap_write_end() for the writes, but in
>both those places the folio lock is held while this is called. I'm not
>seeing how the read and write race in the diagram can happen, but
>maybe I'm missing something here?

Hmm, you're right... The folio lock should prevent concurrent read/write
access. Looking at this again, I suspect that FUSE was calling
folio_clear_uptodate() and folio_mark_uptodate() directly without updating the
ifs bits. For example, in fuse_send_write_pages() on write error, it calls
folio_clear_uptodate(folio) which clears the folio flag but leaves ifs still
showing all blocks uptodate?

>>
>> Result: folio is NOT uptodate, but ifs says all blocks ARE uptodate.
>
>Ah I see the WARN_ON_ONCE() in ifs_free:
>        WARN_ON_ONCE(ifs_is_fully_uptodate(folio, ifs) !=
>                        folio_test_uptodate(folio));
>
>Just to confirm, are you seeing that the folio is not marked uptodate
>but the ifs blocks are? Or are the ifs blocks not uptodate but the
>folio is?

The former: folio is NOT uptodate but ifs shows all blocks ARE uptodate
(state=0xffff with 16 blocks)

>>
>> Fix by checking read_bytes_pending in iomap_set_range_uptodate() under the
>> lock. If a read is in progress, skip calling folio_mark_uptodate() - the
>> read completion path will handle it via folio_end_read().
>>
>> The warning was triggered during FUSE-based filesystem (e.g., NTFS-3G)
>> unmount when the LTP writev03 test was run:
>>
>>   WARNING: fs/iomap/buffered-io.c at ifs_free
>>   Call trace:
>>    ifs_free
>>    iomap_invalidate_folio
>>    truncate_cleanup_folio
>>    truncate_inode_pages_range
>>    truncate_inode_pages_final
>>    fuse_evict_inode
>>    ...
>>    fuse_kill_sb_blk
>>
>> Fixes: 7a4847e54cc1 ("iomap: use folio_end_read()")
>> Assisted-by: claude-opus-4-5-20251101
>> Signed-off-by: Sasha Levin <sashal@kernel.org>
>> ---
>>  fs/fuse/dev.c          |  3 +-
>>  fs/fuse/file.c         |  6 ++--
>>  fs/iomap/buffered-io.c | 65 +++++++++++++++++++++++++++++++++++++++---
>>  include/linux/iomap.h  |  2 ++
>>  4 files changed, 68 insertions(+), 8 deletions(-)
>>
>> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
>> index 6d59cbc877c6..50e84e913589 100644
>> --- a/fs/fuse/dev.c
>> +++ b/fs/fuse/dev.c
>> @@ -11,6 +11,7 @@
>>  #include "fuse_dev_i.h"
>>
>>  #include <linux/init.h>
>> +#include <linux/iomap.h>
>>  #include <linux/module.h>
>>  #include <linux/poll.h>
>>  #include <linux/sched/signal.h>
>> @@ -1820,7 +1821,7 @@ static int fuse_notify_store(struct fuse_conn *fc, unsigned int size,
>>                 if (!folio_test_uptodate(folio) && !err && offset == 0 &&
>>                     (nr_bytes == folio_size(folio) || file_size == end)) {
>>                         folio_zero_segment(folio, nr_bytes, folio_size(folio));
>> -                       folio_mark_uptodate(folio);
>> +                       iomap_set_range_uptodate(folio, 0, folio_size(folio));
>>                 }
>>                 folio_unlock(folio);
>>                 folio_put(folio);
>> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
>> index 01bc894e9c2b..3abe38416199 100644
>> --- a/fs/fuse/file.c
>> +++ b/fs/fuse/file.c
>> @@ -1216,13 +1216,13 @@ static ssize_t fuse_send_write_pages(struct fuse_io_args *ia,
>>                 struct folio *folio = ap->folios[i];
>>
>>                 if (err) {
>> -                       folio_clear_uptodate(folio);
>> +                       iomap_clear_folio_uptodate(folio);
>>                 } else {
>>                         if (count >= folio_size(folio) - offset)
>>                                 count -= folio_size(folio) - offset;
>>                         else {
>>                                 if (short_write)
>> -                                       folio_clear_uptodate(folio);
>> +                                       iomap_clear_folio_uptodate(folio);
>>                                 count = 0;
>>                         }
>>                         offset = 0;
>> @@ -1305,7 +1305,7 @@ static ssize_t fuse_fill_write_pages(struct fuse_io_args *ia,
>>
>>                 /* If we copied full folio, mark it uptodate */
>>                 if (tmp == folio_size(folio))
>> -                       folio_mark_uptodate(folio);
>> +                       iomap_set_range_uptodate(folio, 0, folio_size(folio));
>>
>>                 if (folio_test_uptodate(folio)) {
>>                         folio_unlock(folio);
>> diff --git a/fs/iomap/buffered-io.c b/fs/iomap/buffered-io.c
>> index e5c1ca440d93..7ceda24cf6a7 100644
>> --- a/fs/iomap/buffered-io.c
>> +++ b/fs/iomap/buffered-io.c
>> @@ -74,8 +74,7 @@ static bool ifs_set_range_uptodate(struct folio *folio,
>>         return ifs_is_fully_uptodate(folio, ifs);
>>  }
>>
>> -static void iomap_set_range_uptodate(struct folio *folio, size_t off,
>> -               size_t len)
>> +void iomap_set_range_uptodate(struct folio *folio, size_t off, size_t len)
>>  {
>>         struct iomap_folio_state *ifs = folio->private;
>>         unsigned long flags;
>> @@ -87,12 +86,50 @@ static void iomap_set_range_uptodate(struct folio *folio, size_t off,
>>         if (ifs) {
>>                 spin_lock_irqsave(&ifs->state_lock, flags);
>>                 uptodate = ifs_set_range_uptodate(folio, ifs, off, len);
>> +               /*
>> +                * If a read is in progress, we must NOT call folio_mark_uptodate
>> +                * here. The read completion path (iomap_finish_folio_read or
>> +                * iomap_read_end) will call folio_end_read() which uses XOR
>> +                * semantics to set the uptodate bit. If we set it here, the XOR
>> +                * in folio_end_read() will clear it, leaving the folio not
>> +                * uptodate while the ifs says all blocks are uptodate.
>> +                */
>> +               if (uptodate && ifs->read_bytes_pending)
>> +                       uptodate = false;
>
>Does the warning you saw in ifs_free() still go away without the
>changes here to iomap_set_range_uptodate() or is this change here
>necessary?  I'm asking mostly because I'm not seeing how
>iomap_set_range_uptodate() can be called while the read is in
>progress, as the logic should be already protected by the folio locks.

Yes, the warning goes away even without this part. I don't think that this is
necessary - I just kept it while figuring out the race.

>>                 spin_unlock_irqrestore(&ifs->state_lock, flags);
>>         }
>>
>>         if (uptodate)
>>                 folio_mark_uptodate(folio);
>>  }
>> +EXPORT_SYMBOL_GPL(iomap_set_range_uptodate);
>> +
>> +void iomap_clear_folio_uptodate(struct folio *folio)
>> +{
>> +       struct iomap_folio_state *ifs = folio->private;
>> +
>> +       if (ifs) {
>> +               struct inode *inode = folio->mapping->host;
>> +               unsigned int nr_blocks = i_blocks_per_folio(inode, folio);
>> +               unsigned long flags;
>> +
>> +               spin_lock_irqsave(&ifs->state_lock, flags);
>> +               /*
>> +                * If a read is in progress, don't clear the uptodate state.
>> +                * The read completion path will handle the folio state, and
>> +                * clearing here would race with iomap_finish_folio_read()
>> +                * potentially causing ifs/folio uptodate state mismatch.
>> +                */
>> +               if (ifs->read_bytes_pending) {
>> +                       spin_unlock_irqrestore(&ifs->state_lock, flags);
>> +                       return;
>> +               }
>> +               bitmap_clear(ifs->state, 0, nr_blocks);
>> +               spin_unlock_irqrestore(&ifs->state_lock, flags);
>> +       }
>> +       folio_clear_uptodate(folio);
>> +}
>> +EXPORT_SYMBOL_GPL(iomap_clear_folio_uptodate);
>>
>>  /*
>>   * Find the next dirty block in the folio. end_blk is inclusive.
>> @@ -399,8 +436,17 @@ void iomap_finish_folio_read(struct folio *folio, size_t off, size_t len,
>>                 spin_unlock_irqrestore(&ifs->state_lock, flags);
>>         }
>>
>> -       if (finished)
>> +       if (finished) {
>> +               /*
>> +                * If uptodate is true but the folio is already marked uptodate,
>> +                * folio_end_read's XOR semantics would clear the uptodate bit.
>> +                * This should never happen because iomap_set_range_uptodate()
>> +                * skips calling folio_mark_uptodate() when read_bytes_pending
>> +                * is non-zero, ensuring only the read completion path sets it.
>> +                */
>> +               WARN_ON_ONCE(uptodate && folio_test_uptodate(folio));
>
>Matthew pointed out in another thread [1] that folio_end_read() has
>already the warnings against double-unlocks or double-uptodates
>in-built:
>
>        VM_BUG_ON_FOLIO(!folio_test_locked(folio), folio);
>        VM_BUG_ON_FOLIO(success && folio_test_uptodate(folio), folio);
>
>but imo the WARN_ON_ONCE() here is nice to have too, as I don't think
>most builds enable CONFIG_DEBUG_VM.
>
>[1] https://lore.kernel.org/linux-fsdevel/aPu1ilw6Tq6tKPrf@casper.infradead.org/
>
>Thanks,
>Joanne
>>                 folio_end_read(folio, uptodate);
>> +       }
>>  }
>>  EXPORT_SYMBOL_GPL(iomap_finish_folio_read);
>>
>> @@ -481,8 +527,19 @@ static void iomap_read_end(struct folio *folio, size_t bytes_submitted)
>>                 if (end_read)
>>                         uptodate = ifs_is_fully_uptodate(folio, ifs);
>>                 spin_unlock_irq(&ifs->state_lock);
>> -               if (end_read)
>> +               if (end_read) {
>> +                       /*
>> +                        * If uptodate is true but the folio is already marked
>> +                        * uptodate, folio_end_read's XOR semantics would clear
>> +                        * the uptodate bit. This should never happen because
>> +                        * iomap_set_range_uptodate() skips calling
>> +                        * folio_mark_uptodate() when read_bytes_pending is
>> +                        * non-zero, ensuring only the read completion path
>> +                        * sets it.
>> +                        */
>> +                       WARN_ON_ONCE(uptodate && folio_test_uptodate(folio));
>>                         folio_end_read(folio, uptodate);
>> +               }
>>         } else if (!bytes_submitted) {
>>                 /*
>>                  * If there were no bytes submitted, this means we are
>> diff --git a/include/linux/iomap.h b/include/linux/iomap.h
>> index 520e967cb501..3c2ad88d16b6 100644
>> --- a/include/linux/iomap.h
>> +++ b/include/linux/iomap.h
>> @@ -345,6 +345,8 @@ void iomap_read_folio(const struct iomap_ops *ops,
>>  void iomap_readahead(const struct iomap_ops *ops,
>>                 struct iomap_read_folio_ctx *ctx);
>>  bool iomap_is_partially_uptodate(struct folio *, size_t from, size_t count);
>> +void iomap_set_range_uptodate(struct folio *folio, size_t off, size_t len);
>> +void iomap_clear_folio_uptodate(struct folio *folio);
>>  struct folio *iomap_get_folio(struct iomap_iter *iter, loff_t pos, size_t len);
>>  bool iomap_release_folio(struct folio *folio, gfp_t gfp_flags);
>>  void iomap_invalidate_folio(struct folio *folio, size_t offset, size_t len);
>> --
>> 2.51.0
>>

-- 
Thanks,
Sasha

  reply	other threads:[~2025-12-24  1:31 UTC|newest]

Thread overview: 50+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2025-09-26  0:25 [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Joanne Koong
2025-09-26  0:25 ` [PATCH v5 01/14] iomap: move bio read logic into helper function Joanne Koong
2025-09-26  0:25 ` [PATCH v5 02/14] iomap: move read/readahead bio submission " Joanne Koong
2025-09-26  0:25 ` [PATCH v5 03/14] iomap: store read/readahead bio generically Joanne Koong
2025-09-26  0:25 ` [PATCH v5 04/14] iomap: iterate over folio mapping in iomap_readpage_iter() Joanne Koong
2025-09-26  0:26 ` [PATCH v5 05/14] iomap: rename iomap_readpage_iter() to iomap_read_folio_iter() Joanne Koong
2025-09-26  0:26 ` [PATCH v5 06/14] iomap: rename iomap_readpage_ctx struct to iomap_read_folio_ctx Joanne Koong
2025-09-26  0:26 ` [PATCH v5 07/14] iomap: track pending read bytes more optimally Joanne Koong
2025-10-23 19:34   ` Brian Foster
2025-10-24  0:01     ` Joanne Koong
2025-10-24 16:25       ` Joanne Koong
2025-10-24 17:14         ` Brian Foster
2025-10-24 19:48           ` Joanne Koong
2025-10-24 21:55             ` Joanne Koong
2025-10-27 12:16               ` Brian Foster
2025-10-24 17:21         ` Matthew Wilcox
2025-10-24 19:22           ` Joanne Koong
2025-10-24 20:59             ` Matthew Wilcox
2025-10-24 21:37               ` Darrick J. Wong
2025-10-24 21:58               ` Joanne Koong
2025-09-26  0:26 ` [PATCH v5 08/14] iomap: set accurate iter->pos when reading folio ranges Joanne Koong
2025-09-26  0:26 ` [PATCH v5 09/14] iomap: add caller-provided callbacks for read and readahead Joanne Koong
2025-09-26  0:26 ` [PATCH v5 10/14] iomap: move buffered io bio logic into new file Joanne Koong
2025-09-26  0:26 ` [PATCH v5 11/14] iomap: make iomap_read_folio() a void return Joanne Koong
2025-09-26  0:26 ` [PATCH v5 12/14] fuse: use iomap for read_folio Joanne Koong
2025-12-23 22:30   ` [RFC PATCH 0/1] iomap: fix race between iomap_set_range_uptodate and folio_end_read Sasha Levin
2025-12-23 22:30     ` [RFC PATCH 1/1] " Sasha Levin
2025-12-24  1:12       ` Joanne Koong
2025-12-24  1:31         ` Sasha Levin [this message]
2026-02-07  7:16           ` Wei Gao
2026-02-09 19:08             ` Joanne Koong
2026-02-10  0:12               ` Wei Gao
2026-02-10  0:20                 ` Joanne Koong
2026-02-10  0:40                   ` Wei Gao
2026-02-10 22:18                     ` Joanne Koong
2026-02-11  0:00                       ` Sasha Levin
2026-02-11  3:11                       ` Matthew Wilcox
2026-02-11 19:33                         ` Joanne Koong
2026-02-11 21:03                           ` Matthew Wilcox
2026-02-11 23:13                             ` Joanne Koong
2026-02-12 19:31                               ` Matthew Wilcox
2026-02-13  0:53                                 ` Joanne Koong
2025-12-24  2:10         ` Matthew Wilcox
2025-12-24 15:43           ` Sasha Levin
2025-12-24 17:27             ` Matthew Wilcox
2025-12-24 21:21               ` Sasha Levin
2025-12-30  0:58                 ` Joanne Koong
2025-09-26  0:26 ` [PATCH v5 13/14] fuse: use iomap for readahead Joanne Koong
2025-09-26  0:26 ` [PATCH v5 14/14] fuse: remove fc->blkbits workaround for partial writes Joanne Koong
2025-09-29  9:38 ` [PATCH v5 00/14] fuse: use iomap for buffered reads + readahead Christian Brauner

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=aUtCjXbraDrq-Sxe@laps \
    --to=sashal@kernel.org \
    --cc=joannelkoong@gmail.com \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=willy@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.