From: Josef Bacik <josef@toxicpanda.com>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>
Cc: linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH] btrfs: fix race in subpage sync writeback handling
Date: Tue, 15 Apr 2025 13:22:16 -0400 [thread overview]
Message-ID: <20250415172216.GA2701859@perftesting> (raw)
In-Reply-To: <e13cf6fa-edd2-454f-8635-da8559b97ccc@gmx.com>
On Tue, Apr 15, 2025 at 07:15:42AM +0930, Qu Wenruo wrote:
>
>
> 在 2025/4/15 05:02, Josef Bacik 写道:
> > While debugging a fs corruption with subpage we noticed a potential race
> > in how we do tagging for writeback. Boris came up with the following
> > diagram to describe the race potential
> >
> > EB1 EB2
> > btree_write_cache_pages
> > tag_pages_for_writeback
> > filemap_get_folios_tag(TOWRITE)
> > submit_eb_page
> > submit_eb_subpage
> > for eb in folio:
> > write_one_eb
> > set_extent_buffer_dirty
> > btrfs_meta_folio_set_dirty
> > ...
> > filemap_fdatawrite_range
> > btree_write_cache_pages
> > tag_pages_for_writeback
> > folio_lock
> > btrfs_meta_folio_clear_dirty
> > btrfs_meta_folio_set_writeback // clear TOWRITE
> > folio_unlock
> > filemap_get_folios_tag(TOWRITE) //missed
> >
> > The problem here is that we'll call folio_start_writeback() the first
> > time we initiate writeback on one extent buffer, if we happened to dirty
> > the extent buffer behind the one we were currently writing in the first
> > task, and race in as described above, we would miss the TOWRITE tag as
> > it would have been cleared, and we will never initiate writeback on that
> > EB.
> >
> > This is kind of complicated for us, the best thing to do here is to
> > simply leave the TOWRITE tag in place, and only clear it if we aren't
> > dirty after we finish processing all the eb's in the folio.
> >
> > Fixes: c4aec299fa8f ("btrfs: introduce submit_eb_subpage() to submit a subpage metadata page")
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ---
> > fs/btrfs/extent_io.c | 23 +++++++++++++++++++++++
> > fs/btrfs/subpage.c | 2 +-
> > 2 files changed, 24 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index 6cfd286b8bbc..5d09a47c1c28 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -2063,6 +2063,29 @@ static int submit_eb_subpage(struct folio *folio, struct writeback_control *wbc)
> > }
> > free_extent_buffer(eb);
> > }
> > + /*
> > + * Normally folio_start_writeback() will clear TAG_TOWRITE, but for
> > + * subpage we use __folio_start_writeback(folio, true), which keeps it
> > + * from clearing TOWRITE. This is because we walk the bitmap and
> > + * process each eb one at a time, and then locking the folio when we
> > + * process the eb. We could have somebody dirty behind us, and then
> > + * subsequently mark this range as TOWRITE. In that case we must not
> > + * clear TOWRITE or we will skip writing back the dirty folio.
> > + *
> > + * So here lock the folio, if it is clean we know we are done with it,
> > + * and we can clear TOWRITE.
> > + */
> > + folio_lock(folio);
> > + if (!folio_test_dirty(folio)) {
> > + XA_STATE(xas, &folio->mapping->i_pages, folio_index(folio));
> > + unsigned long flags;
> > +
> > + xas_lock_irqsave(&xas, flags);
> > + xas_load(&xas);
> > + xas_clear_mark(&xas, PAGECACHE_TAG_TOWRITE);
> > + xas_unlock_irqrestore(&xas, flags);
> > + }
> > + folio_unlock(folio);
> > return submitted;
> > }
> > diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
> > index d4f019233493..53140a9dad9d 100644
> > --- a/fs/btrfs/subpage.c
> > +++ b/fs/btrfs/subpage.c
> > @@ -454,7 +454,7 @@ void btrfs_subpage_set_writeback(const struct btrfs_fs_info *fs_info,
> > spin_lock_irqsave(&subpage->lock, flags);
> > bitmap_set(subpage->bitmaps, start_bit, len >> fs_info->sectorsize_bits);
> > if (!folio_test_writeback(folio))
> > - folio_start_writeback(folio);
> > + __folio_start_writeback(folio, true);
>
> This looks a little dangerous to me, as for data writeback we rely on
> folio_start_writeback() to clear the TOWRITE tag before this change.
>
> Can we do a test on the dirty bitmap and only call clear TOWRITE tag when
> there is no dirty blocks?
This accomplishes the same thing, we only modify the dirty bitmap under the
folio lock, so this is safe.
>
> (This also means, we must clear the folio dirty before start writeback,
> there are some exceptions that needs to be addressed)
>
Which is what we're doing, and what we're supposed to do, even not in subpage,
because that's how the tag clearing works for the DIRTY tag, it checks to see if
the folio is dirty and only clears the tag if DIRTY is cleared, and it does this
in folio_start_writeback().
Keep in mind this is a quick patch to fix the bug in a way that we can backport
to all the kernels that are affected, which is basically anything that has
subpage support. The followup series I'm writing will make this all much more
sane.
Thanks,
Josef
prev parent reply other threads:[~2025-04-15 17:22 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-14 19:32 [PATCH] btrfs: fix race in subpage sync writeback handling Josef Bacik
2025-04-14 19:52 ` Boris Burkov
2025-04-14 21:45 ` Qu Wenruo
2025-04-15 0:40 ` Boris Burkov
2025-04-15 17:22 ` Josef Bacik [this message]
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250415172216.GA2701859@perftesting \
--to=josef@toxicpanda.com \
--cc=kernel-team@fb.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=quwenruo.btrfs@gmx.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox