Re: bcachefs backpointers errors in no write buf mode

All of lore.kernel.org
 help / color / mirror / Atom feed

From: Brian Foster <bfoster@redhat.com>
To: Kent Overstreet <kent.overstreet@linux.dev>
Cc: linux-bcachefs@vger.kernel.org
Subject: Re: bcachefs backpointers errors in no write buf mode
Date: Wed, 12 Jul 2023 11:01:06 -0400	[thread overview]
Message-ID: <ZK7AMu6MBLoLX8vJ@bfoster> (raw)
In-Reply-To: <20230710145646.3vzxwzvrbbmxc57h@moria.home.lan>

On Mon, Jul 10, 2023 at 10:56:46AM -0400, Kent Overstreet wrote:
> On Mon, Jul 10, 2023 at 09:37:23AM -0400, Brian Foster wrote:
> > This wording is a little confusing.. it's not clear to me why the lack
> > of ref counts directly means they have to run in reverse order. What
> > happens if such triggers are not run in reverse order? From the old (now
> > removed) ordering fix patch, I take it the issue is that a backpointer
> > (for example) key would be added and then immediately removed, right?
> > 
> > I wonder if this bit of docs would be more clear to do something like 1.
> > explain what specific key types have particular ordering rules (and why,
> > such as the reference count example given above), and then 2.  explain
> > why the high level trigger invocation code has the current ordering (if
> > it matters). I suspect the part below just goes away with that write buf
> > ordering patch, at least for the time being..
> 
> Yeah, that sounds like a better documentation approach.
> 
> I'll work up a new patch (and try to do some git spelunking to make sure
> I remember all the context).
> 
> > > Yeah I tend to think adding a bch2_trans_update_ordered() that matches
> > > the behaviour of bch2_trans_update_buffered() is the way to go - the
> > > whole point of the nowritebuffer path is to give us something simpler
> > > but with matching behaviour of the write buffer path, to aid in
> > > isolating bugs, so that only makes sense if they actually do behave the
> > > same way.
> > > 
> > 
> > I take it your follow up reply (re: the ordering patch) invalidates the
> > thought around bch2_trans_update_ordered(), but this is useful context
> > on !writebuffer nonetheless. I'll see if I can just bypass these
> > existing backpointer errors and narrow further in on the actual missing
> > bp issue...
> 
> Yeah, it turned out the recent trigger ordering patches were definitely
> to blame. With those patches yanked, I was able to run the
> merge_torture_flakey test for ~24 hours in
> backpointers_no_use_write_buffer mode - and with the write buffer on,
> we're not hitting the "backpointer not found" errors as often as we were
> before.
> 

Ok, I realized after the fact that those patches caused the runtime
bp errors. I've not seen them since those patches were removed.

> So we probably need to think a bit more about what happens differently
> when the write buffer is in use; we're doing exactly the same updates,
> but flushing them to the btree is a bit delayed and the ordering is
> different. I think we need to hone in on that.
> 

Yep. I think I have a general idea of what's going on here. The first
thing I realized after looking through some tracing was that this seems
to correlate with the non-fast fallback paths for write buffer flushing.
For example, if I unconditionally redirect wb key flushes to the
trans_commit path in flush_one(), that 1. dramatically increases the
reproduction rate and 2. also reproduces a similar issue for lru
entries. I think the only reason we don't reproduce the lru variant of
the problem with generic/388 alone is that I don't see nearly as many
lru updates as backpointer updates, and so the workload just doesn't
happen to facilitate it.

With that, I tweaked my kernel to unconditionally redirect backpointer
btree updates to the non-fast flush paths for debug purposes. From
there, I've seen a few different variants of the problem that I think
just happen to relate to the different extent modifying operations that
fsstress happens to be running. The most simple example that describes
the fundamental problem is as follows:

1. Extent allocated and associated bp inserted to write buffer index 0.
2. Flush begins on idx 0, active idx switches to 1.
3. Previously alloc'd extent is deleted. Associated deleted bp key
inserted to buffer idx 1.
4. Flush of idx 0 lands on the initial bp key for the extent. The
trans_commit path rejournals the key for the backpointer btree update.
5. The fs shuts down. Recovery finds the key updates in the order they
were journaled by this sequence (update->delete->update), which doesn't
match the actual runtime changes of the extent btree (update->delete).
This puts an entry in the backpointers btree that doesn't reflect
reality.

So essentially the current rejournaling by the non-fast write buffer
flush paths creates an on-disk journal ordering inconsistency wrt
changes in associated btrees.

I tried to skip journaling in the write buffer flush commit path as a
quick experiment, but that doesn't work on its own because we can then
lose key updates in the journal held by the write buffer pin. I.e., the
tail of the journal can then move past the key before it's been updated
in the btree because the latter update hasn't journaled it (this results
in similar backpointer errors and I can see the key getting lost by
looking at list_journal vs. list_journal -a).

OTOH, it looks like the fast flush path gets this right without
rejournaling because it adds a pin to the btree buffer based on the seq
of the already journaled key. IIUC, this essentially allows the btree
update to refer to the journal entry created when the key was inserted
into the write buffer. I'm wondering if the right way to fix this is to
allow the trans update/commit path to basically do the same thing. I.e.,
somehow or another allow a trans update of an already journaled key to
inherit the seq of the key for a btree pin rather than explicitly
rejournal it at the current seq of the transaction. Thoughts?

Brian

next prev parent reply	other threads:[~2023-07-12 14:59 UTC|newest]

Thread overview: 9+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-07 17:32 bcachefs backpointers errors in no write buf mode Brian Foster
2023-07-07 17:58 ` Kent Overstreet
2023-07-10 13:37   ` Brian Foster
2023-07-10 14:56     ` Kent Overstreet
2023-07-12 15:01       ` Brian Foster [this message]
2023-07-12 15:40         ` Kent Overstreet
2023-07-12 16:52           ` Brian Foster
2023-07-12 17:10             ` Kent Overstreet
2023-07-07 20:37 ` Kent Overstreet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZK7AMu6MBLoLX8vJ@bfoster \
    --to=bfoster@redhat.com \
    --cc=kent.overstreet@linux.dev \
    --cc=linux-bcachefs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.