Re: bcachefs backpointers errors in no write buf mode

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Brian Foster <bfoster@redhat.com>
To: Kent Overstreet <kent.overstreet@linux.dev>
Cc: linux-bcachefs@vger.kernel.org
Subject: Re: bcachefs backpointers errors in no write buf mode
Date: Wed, 12 Jul 2023 12:52:14 -0400	[thread overview]
Message-ID: <ZK7aPhB9QV+56R/J@bfoster> (raw)
In-Reply-To: <20230712154047.hqw7rdwjxasjvsdn@moria.home.lan>

On Wed, Jul 12, 2023 at 11:40:47AM -0400, Kent Overstreet wrote:
> On Wed, Jul 12, 2023 at 11:01:06AM -0400, Brian Foster wrote:
> > Yep. I think I have a general idea of what's going on here. The first
> > thing I realized after looking through some tracing was that this seems
> > to correlate with the non-fast fallback paths for write buffer flushing.
> > For example, if I unconditionally redirect wb key flushes to the
> > trans_commit path in flush_one(), that 1. dramatically increases the
> > reproduction rate and 2. also reproduces a similar issue for lru
> > entries. I think the only reason we don't reproduce the lru variant of
> > the problem with generic/388 alone is that I don't see nearly as many
> > lru updates as backpointer updates, and so the workload just doesn't
> > happen to facilitate it.
> 
> Fantastic :)
> 
> I just pulled dynamic fault injection into the bcachefs tree - to make
> sure this path gets tested better in the future we can add a
> dynamic_fault("race") here.
> 
> (first, I have to dust off the fault injection test integration...)
> 
> > With that, I tweaked my kernel to unconditionally redirect backpointer
> > btree updates to the non-fast flush paths for debug purposes. From
> > there, I've seen a few different variants of the problem that I think
> > just happen to relate to the different extent modifying operations that
> > fsstress happens to be running. The most simple example that describes
> > the fundamental problem is as follows:
> > 
> > 1. Extent allocated and associated bp inserted to write buffer index 0.
> > 2. Flush begins on idx 0, active idx switches to 1.
> > 3. Previously alloc'd extent is deleted. Associated deleted bp key
> > inserted to buffer idx 1.
> > 4. Flush of idx 0 lands on the initial bp key for the extent. The
> > trans_commit path rejournals the key for the backpointer btree update.
> > 5. The fs shuts down. Recovery finds the key updates in the order they
> > were journaled by this sequence (update->delete->update), which doesn't
> > match the actual runtime changes of the extent btree (update->delete).
> > This puts an entry in the backpointers btree that doesn't reflect
> > reality.
> > 
> > So essentially the current rejournaling by the non-fast write buffer
> > flush paths creates an on-disk journal ordering inconsistency wrt
> > changes in associated btrees.
> 
> Yep, sounds like you nailed it.
> 
> > I tried to skip journaling in the write buffer flush commit path as a
> > quick experiment, but that doesn't work on its own because we can then
> > lose key updates in the journal held by the write buffer pin. I.e., the
> > tail of the journal can then move past the key before it's been updated
> > in the btree because the latter update hasn't journaled it (this results
> > in similar backpointer errors and I can see the key getting lost by
> > looking at list_journal vs. list_journal -a).
> 
> Correct
> 
> > OTOH, it looks like the fast flush path gets this right without
> > rejournaling because it adds a pin to the btree buffer based on the seq
> > of the already journaled key. IIUC, this essentially allows the btree
> > update to refer to the journal entry created when the key was inserted
> > into the write buffer. I'm wondering if the right way to fix this is to
> > allow the trans update/commit path to basically do the same thing. I.e.,
> > somehow or another allow a trans update of an already journaled key to
> > inherit the seq of the key for a btree pin rather than explicitly
> > rejournal it at the current seq of the transaction. Thoughts?
> 
> I think that could work.
> 
> What if the entry being flushed is already at the tail of the journal,
> and journal reclaim is already trying to flush everything at that
> sequence number? Since we'll often have multiple write buffer keys per
> journal entry, and per btree node, this might lead to btree nodes
> getting flushed multiple times redundantly.
> 

I think I follow what you mean, but isn't this what the fast path is
doing already? (Not to say that it isn't a potential issue to be aware
of.)

BTW, I haven't dug much into the btree code... If a btree node has a
bunch of pins of different sequence numbers for whatever key updates are
pending, is the whole set flushed out on the first pin flush (releasing
all pins)? Or is it an incremental flush to update keys just up to the
specified pin seq?

> I suspect this would only happen in artificial circumstances - it'll be
> based on the ratio of write buffer size:journal size - so if we add a
> slowpath tracepoint/counter, and make sure to to some testing where we
> provoke this condition that'll be sufficient.
> 
> Maybe write a small journal, large write buffer test?
> 

Yeah, I was wondering if/what sort of non-default configuration might
more reliably reproduce this issue. That sounds like something to try.

> I was also looking at adding the same "skip rejournalling" optimization
> to key cache flushing - this could take care of that too.
> 

Indeed. Even if that is an optimization for key cache vs. more of a
correctness issue for write buffer, it sounds like quite a bit of
functional overlap.

Anyways, I'll see if I can prototype something to verify whether this
addresses the backpointer issue before getting too far into the weeds.
;) Thanks for the feedback.

Brian

next prev parent reply	other threads:[~2023-07-12 16:50 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-07 17:32 bcachefs backpointers errors in no write buf mode Brian Foster
2023-07-07 17:58 ` Kent Overstreet
2023-07-10 13:37   ` Brian Foster
2023-07-10 14:56     ` Kent Overstreet
2023-07-12 15:01       ` Brian Foster
2023-07-12 15:40         ` Kent Overstreet
2023-07-12 16:52           ` Brian Foster [this message]
2023-07-12 17:10             ` Kent Overstreet
2023-07-07 20:37 ` Kent Overstreet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZK7aPhB9QV+56R/J@bfoster \
    --to=bfoster@redhat.com \
    --cc=kent.overstreet@linux.dev \
    --cc=linux-bcachefs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.