From: Boris Burkov <boris@bur.io>
To: Qu Wenruo <quwenruo.btrfs@gmx.com>
Cc: Josef Bacik <josef@toxicpanda.com>,
linux-btrfs@vger.kernel.org, kernel-team@fb.com
Subject: Re: [PATCH] btrfs: fix race in subpage sync writeback handling
Date: Mon, 14 Apr 2025 17:40:37 -0700 [thread overview]
Message-ID: <20250415004037.GC3218113@zen.localdomain> (raw)
In-Reply-To: <e13cf6fa-edd2-454f-8635-da8559b97ccc@gmx.com>
On Tue, Apr 15, 2025 at 07:15:42AM +0930, Qu Wenruo wrote:
>
>
> 在 2025/4/15 05:02, Josef Bacik 写道:
> > While debugging a fs corruption with subpage we noticed a potential race
> > in how we do tagging for writeback. Boris came up with the following
> > diagram to describe the race potential
> >
> > EB1 EB2
> > btree_write_cache_pages
> > tag_pages_for_writeback
> > filemap_get_folios_tag(TOWRITE)
> > submit_eb_page
> > submit_eb_subpage
> > for eb in folio:
> > write_one_eb
> > set_extent_buffer_dirty
> > btrfs_meta_folio_set_dirty
> > ...
> > filemap_fdatawrite_range
> > btree_write_cache_pages
> > tag_pages_for_writeback
> > folio_lock
> > btrfs_meta_folio_clear_dirty
> > btrfs_meta_folio_set_writeback // clear TOWRITE
> > folio_unlock
> > filemap_get_folios_tag(TOWRITE) //missed
> >
> > The problem here is that we'll call folio_start_writeback() the first
> > time we initiate writeback on one extent buffer, if we happened to dirty
> > the extent buffer behind the one we were currently writing in the first
> > task, and race in as described above, we would miss the TOWRITE tag as
> > it would have been cleared, and we will never initiate writeback on that
> > EB.
> >
> > This is kind of complicated for us, the best thing to do here is to
> > simply leave the TOWRITE tag in place, and only clear it if we aren't
> > dirty after we finish processing all the eb's in the folio.
> >
> > Fixes: c4aec299fa8f ("btrfs: introduce submit_eb_subpage() to submit a subpage metadata page")
> > Signed-off-by: Josef Bacik <josef@toxicpanda.com>
> > ---
> > fs/btrfs/extent_io.c | 23 +++++++++++++++++++++++
> > fs/btrfs/subpage.c | 2 +-
> > 2 files changed, 24 insertions(+), 1 deletion(-)
> >
> > diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
> > index 6cfd286b8bbc..5d09a47c1c28 100644
> > --- a/fs/btrfs/extent_io.c
> > +++ b/fs/btrfs/extent_io.c
> > @@ -2063,6 +2063,29 @@ static int submit_eb_subpage(struct folio *folio, struct writeback_control *wbc)
> > }
> > free_extent_buffer(eb);
> > }
> > + /*
> > + * Normally folio_start_writeback() will clear TAG_TOWRITE, but for
> > + * subpage we use __folio_start_writeback(folio, true), which keeps it
> > + * from clearing TOWRITE. This is because we walk the bitmap and
> > + * process each eb one at a time, and then locking the folio when we
> > + * process the eb. We could have somebody dirty behind us, and then
> > + * subsequently mark this range as TOWRITE. In that case we must not
> > + * clear TOWRITE or we will skip writing back the dirty folio.
> > + *
> > + * So here lock the folio, if it is clean we know we are done with it,
> > + * and we can clear TOWRITE.
> > + */
> > + folio_lock(folio);
> > + if (!folio_test_dirty(folio)) {
> > + XA_STATE(xas, &folio->mapping->i_pages, folio_index(folio));
> > + unsigned long flags;
> > +
> > + xas_lock_irqsave(&xas, flags);
> > + xas_load(&xas);
> > + xas_clear_mark(&xas, PAGECACHE_TAG_TOWRITE);
> > + xas_unlock_irqrestore(&xas, flags);
> > + }
> > + folio_unlock(folio);
> > return submitted;
> > }
> > diff --git a/fs/btrfs/subpage.c b/fs/btrfs/subpage.c
> > index d4f019233493..53140a9dad9d 100644
> > --- a/fs/btrfs/subpage.c
> > +++ b/fs/btrfs/subpage.c
> > @@ -454,7 +454,7 @@ void btrfs_subpage_set_writeback(const struct btrfs_fs_info *fs_info,
> > spin_lock_irqsave(&subpage->lock, flags);
> > bitmap_set(subpage->bitmaps, start_bit, len >> fs_info->sectorsize_bits);
> > if (!folio_test_writeback(folio))
> > - folio_start_writeback(folio);
> > + __folio_start_writeback(folio, true);
>
> This looks a little dangerous to me, as for data writeback we rely on
> folio_start_writeback() to clear the TOWRITE tag before this change.
During an extended test of a branch with both fixes, I did in fact see a
deadlock that looked like an extent_write_cache_pages going re-doing
writeback on a page that should have had the mark cleared. I'm not 100%
sure it is this, since it took 60+ runs of the logwrites crash test to hit
the deadlock during the instrumented workload
Anyway, I will take a look at and test your other proposed fix in
Josef's other email.
Thanks,
Boris
>
> Can we do a test on the dirty bitmap and only call clear TOWRITE tag when
> there is no dirty blocks?
>
> (This also means, we must clear the folio dirty before start writeback,
> there are some exceptions that needs to be addressed)
>
> By that, we do not need the manual tag handling in submit_eb_subpage().
>
> Thanks,
> Qu
>
> > spin_unlock_irqrestore(&subpage->lock, flags);
> > }
>
next prev parent reply other threads:[~2025-04-15 0:40 UTC|newest]
Thread overview: 5+ messages / expand[flat|nested] mbox.gz Atom feed top
2025-04-14 19:32 [PATCH] btrfs: fix race in subpage sync writeback handling Josef Bacik
2025-04-14 19:52 ` Boris Burkov
2025-04-14 21:45 ` Qu Wenruo
2025-04-15 0:40 ` Boris Burkov [this message]
2025-04-15 17:22 ` Josef Bacik
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20250415004037.GC3218113@zen.localdomain \
--to=boris@bur.io \
--cc=josef@toxicpanda.com \
--cc=kernel-team@fb.com \
--cc=linux-btrfs@vger.kernel.org \
--cc=quwenruo.btrfs@gmx.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox