[BUG] bcachefs (early?) bucket allocator raciness

public inbox for linux-bcachefs@vger.kernel.org
 help / color / mirror / Atom feed

* [BUG] bcachefs (early?) bucket allocator raciness
@ 2023-10-19 13:37 Brian Foster
  2023-10-19 15:52 ` Kent Overstreet
  0 siblings, 1 reply; 3+ messages in thread
From: Brian Foster @ 2023-10-19 13:37 UTC (permalink / raw)
  To: linux-bcachefs

Hi Kent, All,

I recently observed a data corruption problem that is related to the
recently discovered issue of mounted fs' running with the early bucket
allocator instead of the freelist allocator. The immediate failure is
generic/113 producing various splats, the most common of which is a
duplicate backpointer issue. generic/113 is primarily an aio/dio stress
test.

I eventually tracked this down to an actual duplicate bucket allocation
in the early bucket allocator code. The race generally looks as follows:

- Task 1 lands in bch2_bucket_alloc_early(), selects key A from the
  alloc btree, and then schedules (perhaps due to freelist_lock).

- Task 2 runs through the same alloc path and selects the same key K,
  but proceeds to open the associated bucket, alloc/write to it,
  complete the I/O and release the bucket (removing it from the hash).

- Task 1 continues with alloc key K. bch2_bucket_is_open() returns false
  because the previously opened bucket has been removed from the hash
  list. Therefore task 1 opens a new bucket for what is now no longer free
  space and uses it for the its associated write operation.

This eventually results in a splat related to duplicate backpointers or
multiple data types in a single bucket. The fundamental problem is
inconsistency between the key walk and bucket management. In theory, a
simple fix would be something like reader exclusion or revalidation
(i.e. seqlock type checks to revalidate the current/prospective key)
once the allocation side is under lock, but that would require more
experimentation to confirm, validate performance, etc.

Once it became apparent that this fs shouldn't be running the early
allocator in the first place, I worked around that problem to try and
see whether this sort of race could still be reproduced with the
freelist allocator. So far I've not been able to reproduce.

Note that one factor wrt to the early allocator is that it doesn't
effectively update the alloc cursor, which means multiple threads can
come through and process the same sets of keys repeatedly. Fixing that
cursor issue [1] actually helps mitigate the race as well, even if not a
proper fix. However, I'm still not able to reproduce with the freelist
allocator even if I remove the cursor updates to try and simulate the
same sort of problem there. So far nothing stands out to me as obviously
different between how the alloc and freespace btrees are managed wrt to
serialization against foreground allocation, so I'm not totally clear if
this is just a timing thing due to relative inefficiency of the early
allocator or if I'm just missing something in the code. Thoughts?

Brian

[1] https://lore.kernel.org/linux-bcachefs/20231019132746.279256-1-bfoster@redhat.com/

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [BUG] bcachefs (early?) bucket allocator raciness
  2023-10-19 13:37 [BUG] bcachefs (early?) bucket allocator raciness Brian Foster
@ 2023-10-19 15:52 ` Kent Overstreet
  2023-10-19 17:25   ` Brian Foster
  0 siblings, 1 reply; 3+ messages in thread
From: Kent Overstreet @ 2023-10-19 15:52 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs

On Thu, Oct 19, 2023 at 09:37:24AM -0400, Brian Foster wrote:
> Hi Kent, All,
> 
> I recently observed a data corruption problem that is related to the
> recently discovered issue of mounted fs' running with the early bucket
> allocator instead of the freelist allocator. The immediate failure is
> generic/113 producing various splats, the most common of which is a
> duplicate backpointer issue. generic/113 is primarily an aio/dio stress
> test.
> 
> I eventually tracked this down to an actual duplicate bucket allocation
> in the early bucket allocator code. The race generally looks as follows:
> 
> - Task 1 lands in bch2_bucket_alloc_early(), selects key A from the
>   alloc btree, and then schedules (perhaps due to freelist_lock).
> 
> - Task 2 runs through the same alloc path and selects the same key K,
>   but proceeds to open the associated bucket, alloc/write to it,
>   complete the I/O and release the bucket (removing it from the hash).
> 
> - Task 1 continues with alloc key K. bch2_bucket_is_open() returns false
>   because the previously opened bucket has been removed from the hash
>   list. Therefore task 1 opens a new bucket for what is now no longer free
>   space and uses it for the its associated write operation.

This shouldn't be possible because task 1 is holding the alloc key
locked, and task 2 has to update that same alloc key before releasing
the open bucket.

Except perhaps not - perhaps this is a key cache coherency issue?

We're not using a BTREE_ITER_CACHED iterator, because we're scanning and
we can't scan with key cache iterators. It's still supposed to be
coherent with the key cache; bch2_btree_iter_peek_slot() ->
btree_trans_peek_key_cache() checks if a key exists in the key cache and
returns that key instead of the key in the btree if it exists.

But it doesn't return with that slot locked in the key cache locked if
the key didn't exist in the key cache. Oops.

So we're going to need to keep an eye out for this issue occuring
elsewhere, and maybe come up with a real fix in the btree iterator code:
looking up a key in a cached btree without BTREE_ITER_CACHED _does_
return the correct key at that particular point in time, but it does
_not_ necessarily keep it locked for the duration of the transaction.

For now, we can fix this locally in bch2_bucket_alloc_early() with a
second BTREE_ITER_CACHED iterator - run some tests with freespace
initialization disabled, confirm that that's the issue, then go from
there.

^ permalink raw reply	[flat|nested] 3+ messages in thread

* Re: [BUG] bcachefs (early?) bucket allocator raciness
  2023-10-19 15:52 ` Kent Overstreet
@ 2023-10-19 17:25   ` Brian Foster
  0 siblings, 0 replies; 3+ messages in thread
From: Brian Foster @ 2023-10-19 17:25 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs

On Thu, Oct 19, 2023 at 11:52:11AM -0400, Kent Overstreet wrote:
> On Thu, Oct 19, 2023 at 09:37:24AM -0400, Brian Foster wrote:
> > Hi Kent, All,
> > 
> > I recently observed a data corruption problem that is related to the
> > recently discovered issue of mounted fs' running with the early bucket
> > allocator instead of the freelist allocator. The immediate failure is
> > generic/113 producing various splats, the most common of which is a
> > duplicate backpointer issue. generic/113 is primarily an aio/dio stress
> > test.
> > 
> > I eventually tracked this down to an actual duplicate bucket allocation
> > in the early bucket allocator code. The race generally looks as follows:
> > 
> > - Task 1 lands in bch2_bucket_alloc_early(), selects key A from the
> >   alloc btree, and then schedules (perhaps due to freelist_lock).
> > 
> > - Task 2 runs through the same alloc path and selects the same key K,
> >   but proceeds to open the associated bucket, alloc/write to it,
> >   complete the I/O and release the bucket (removing it from the hash).
> > 
> > - Task 1 continues with alloc key K. bch2_bucket_is_open() returns false
> >   because the previously opened bucket has been removed from the hash
> >   list. Therefore task 1 opens a new bucket for what is now no longer free
> >   space and uses it for the its associated write operation.
> 
> This shouldn't be possible because task 1 is holding the alloc key
> locked, and task 2 has to update that same alloc key before releasing
> the open bucket.
> 
> Except perhaps not - perhaps this is a key cache coherency issue?
> 
> We're not using a BTREE_ITER_CACHED iterator, because we're scanning and
> we can't scan with key cache iterators. It's still supposed to be
> coherent with the key cache; bch2_btree_iter_peek_slot() ->
> btree_trans_peek_key_cache() checks if a key exists in the key cache and
> returns that key instead of the key in the btree if it exists.
> 
> But it doesn't return with that slot locked in the key cache locked if
> the key didn't exist in the key cache. Oops.
> 

Ah, interesting. I wasn't aware of the lower level locking involved
here. This sounds like a plausible theory wrt key cache, but I'll have
to dig more into it to grok the locking. Thanks for the additional
context.

Brian

> So we're going to need to keep an eye out for this issue occuring
> elsewhere, and maybe come up with a real fix in the btree iterator code:
> looking up a key in a cached btree without BTREE_ITER_CACHED _does_
> return the correct key at that particular point in time, but it does
> _not_ necessarily keep it locked for the duration of the transaction.
> 
> For now, we can fix this locally in bch2_bucket_alloc_early() with a
> second BTREE_ITER_CACHED iterator - run some tests with freespace
> initialization disabled, confirm that that's the issue, then go from
> there.
> 


^ permalink raw reply	[flat|nested] 3+ messages in thread

end of thread, other threads:[~2023-10-19 17:26 UTC | newest]

Thread overview: 3+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-10-19 13:37 [BUG] bcachefs (early?) bucket allocator raciness Brian Foster
2023-10-19 15:52 ` Kent Overstreet
2023-10-19 17:25   ` Brian Foster

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox