From: Brian Foster <bfoster@redhat.com>
To: linux-bcachefs@vger.kernel.org
Subject: [BUG] bcachefs (early?) bucket allocator raciness
Date: Thu, 19 Oct 2023 09:37:24 -0400 [thread overview]
Message-ID: <ZTExFP9G1fUIUgVZ@bfoster> (raw)
Hi Kent, All,
I recently observed a data corruption problem that is related to the
recently discovered issue of mounted fs' running with the early bucket
allocator instead of the freelist allocator. The immediate failure is
generic/113 producing various splats, the most common of which is a
duplicate backpointer issue. generic/113 is primarily an aio/dio stress
test.
I eventually tracked this down to an actual duplicate bucket allocation
in the early bucket allocator code. The race generally looks as follows:
- Task 1 lands in bch2_bucket_alloc_early(), selects key A from the
alloc btree, and then schedules (perhaps due to freelist_lock).
- Task 2 runs through the same alloc path and selects the same key K,
but proceeds to open the associated bucket, alloc/write to it,
complete the I/O and release the bucket (removing it from the hash).
- Task 1 continues with alloc key K. bch2_bucket_is_open() returns false
because the previously opened bucket has been removed from the hash
list. Therefore task 1 opens a new bucket for what is now no longer free
space and uses it for the its associated write operation.
This eventually results in a splat related to duplicate backpointers or
multiple data types in a single bucket. The fundamental problem is
inconsistency between the key walk and bucket management. In theory, a
simple fix would be something like reader exclusion or revalidation
(i.e. seqlock type checks to revalidate the current/prospective key)
once the allocation side is under lock, but that would require more
experimentation to confirm, validate performance, etc.
Once it became apparent that this fs shouldn't be running the early
allocator in the first place, I worked around that problem to try and
see whether this sort of race could still be reproduced with the
freelist allocator. So far I've not been able to reproduce.
Note that one factor wrt to the early allocator is that it doesn't
effectively update the alloc cursor, which means multiple threads can
come through and process the same sets of keys repeatedly. Fixing that
cursor issue [1] actually helps mitigate the race as well, even if not a
proper fix. However, I'm still not able to reproduce with the freelist
allocator even if I remove the cursor updates to try and simulate the
same sort of problem there. So far nothing stands out to me as obviously
different between how the alloc and freespace btrees are managed wrt to
serialization against foreground allocation, so I'm not totally clear if
this is just a timing thing due to relative inefficiency of the early
allocator or if I'm just missing something in the code. Thoughts?
Brian
[1] https://lore.kernel.org/linux-bcachefs/20231019132746.279256-1-bfoster@redhat.com/
next reply other threads:[~2023-10-19 13:37 UTC|newest]
Thread overview: 3+ messages / expand[flat|nested] mbox.gz Atom feed top
2023-10-19 13:37 Brian Foster [this message]
2023-10-19 15:52 ` [BUG] bcachefs (early?) bucket allocator raciness Kent Overstreet
2023-10-19 17:25 ` Brian Foster
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZTExFP9G1fUIUgVZ@bfoster \
--to=bfoster@redhat.com \
--cc=linux-bcachefs@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox