Re: [PATCH 05/17] bcachefs: Kill BTREE_UPDATE_PREJOURNAL

public inbox for linux-bcachefs@vger.kernel.org
 help / color / mirror / Atom feed

From: Brian Foster <bfoster@redhat.com>
To: Kent Overstreet <kent.overstreet@linux.dev>
Cc: linux-bcachefs@vger.kernel.org
Subject: Re: [PATCH 05/17] bcachefs: Kill BTREE_UPDATE_PREJOURNAL
Date: Tue, 14 Nov 2023 08:17:36 -0500	[thread overview]
Message-ID: <ZVNzcE9EopgWAepU@bfoster> (raw)
In-Reply-To: <20231113164910.bcfhm4pathhfsbxg@moria.home.lan>

On Mon, Nov 13, 2023 at 11:49:10AM -0500, Kent Overstreet wrote:
> On Mon, Nov 13, 2023 at 10:29:34AM -0500, Brian Foster wrote:
> > On Fri, Nov 10, 2023 at 11:31:42AM -0500, Kent Overstreet wrote:
> > > With the previous patch that reworks BTREE_INSERT_JOURNAL_REPLAY, we can
> > > now switch the btree write buffer to use it for flushing.
> > > 
> > > This has the advantage that transaction commits don't need to take a
> > > journal reservation at all.
> > > 
> > > Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> > > ---
> > >  fs/bcachefs/bkey_methods.h       |  2 --
> > >  fs/bcachefs/btree_trans_commit.c |  7 +------
> > >  fs/bcachefs/btree_types.h        |  1 -
> > >  fs/bcachefs/btree_update.c       | 23 -----------------------
> > >  fs/bcachefs/btree_write_buffer.c | 14 ++++++++++----
> > >  5 files changed, 11 insertions(+), 36 deletions(-)
> > > 
> > ...
> > > diff --git a/fs/bcachefs/btree_trans_commit.c b/fs/bcachefs/btree_trans_commit.c
> > > index ec90a06a6cf9..f231f01072c2 100644
> > > --- a/fs/bcachefs/btree_trans_commit.c
> > > +++ b/fs/bcachefs/btree_trans_commit.c
> > > @@ -779,12 +779,7 @@ bch2_trans_commit_write_locked(struct btree_trans *trans, unsigned flags,
> > >  
> > >  	trans_for_each_update(trans, i) {
> > >  		if (!i->cached) {
> > > -			u64 seq = trans->journal_res.seq;
> > > -
> > > -			if (i->flags & BTREE_UPDATE_PREJOURNAL)
> > > -				seq = i->seq;
> > > -
> > > -			bch2_btree_insert_key_leaf(trans, i->path, i->k, seq);
> > > +			bch2_btree_insert_key_leaf(trans, i->path, i->k, trans->journal_res.seq);
> > 
> > Ok, so instead of passing the seq to the commit path via the insert
> > entry, we use a flag that enables a means to pass journal_res.seq
> > straight through to the commit. That seems reasonable to me.
> > 
> > One subtle thing that comes to mind is that the existing mechanism
> > tracks a seq per key update whereas this looks like it associates the
> > seq to the transaction and then to every key update. That's how it's
> > used today AFAICS so doesn't seem like a big deal, but what happens if
> > this is misused in the future? Does anything prevent having multiple
> > keys from different journal seqs in the same transaction leading to
> > pinning the wrong seq for some subset of keys? If not, it would be nice
> > to have some kind of check or something somewhere to fail an update for
> > a trans that might already have a pre journaled key.
> 
> Well, anything to do with taking a journal reservation or a journal pin
> really does go with the transaction commit, not an individual update -
> if we were doing multiple updates that went with different journal
> updates it wouldn't really be a transaction at that point.
> 

Yep that makes complete sense. I'm just lamenting that it seems a little
easy to misuse. I think this is partly due to the disconnected nature of
setting journal_res.seq and passing INSERT_JOURNAL_REPLAY at commit
time.

If the no_res mode were a fundamental transaction state, that could
potentially facilitate error checks that could help prevent adding keys
from multiple journal seqs to the same transaction. For example,
consider something like the original update helper that took the seq as
a param, but set some state on the transaction such that any further
updates could be validated to require the same seq (knowing that the
transaction is now "nores").

It's relatively minor in theory but in practice that could be the
difference between identifying a problem via runtime error vs. possibly
missing the bug and spending a bunch of time triaging an out of order
log recovery bug or whatever might occur as a result. Anyways, this is
mainly just a thought around possibly making things a bit more (noob)
developer friendly.

> > > -	return  bch2_trans_update_seq(trans, wb->journal_seq, iter, &wb->k,
> > > -				      BTREE_UPDATE_INTERNAL_SNAPSHOT_NODE) ?:
> > > +	trans->journal_res.seq = wb->journal_seq;
> > > +
> > > +	return  bch2_trans_update(trans, iter, &wb->k,
> > > +				  BTREE_UPDATE_INTERNAL_SNAPSHOT_NODE) ?:
> > >  		bch2_trans_commit(trans, NULL, NULL,
> > >  				  commit_flags|
> > >  				  BTREE_INSERT_NOCHECK_RW|
> > >  				  BTREE_INSERT_NOFAIL|
> > > +				  BTREE_INSERT_JOURNAL_REPLAY|
> > >  				  BTREE_INSERT_JOURNAL_RECLAIM);
> >
> > This is more of a nit for now, but I find the general use of a flag with
> > a contextual name unnecessarily confusing. I.e., the flag implies we're
> > doing journal replay, which we're not, and so makes the code confusing
> > to somebody who doesn't have the historical development context. Could
> > we rename or repurpose this to better reflect the functional purpose of
> > not acquiring a reservation (and let journal replay also use it)? I can
> > look into that as a followon change if you want to make suggestions or
> > share any thoughts..
> 
> Agreed - I didn't post this patch to the list because it was a simpler
> one, but:
> https://evilpiepirate.org/git/bcachefs.git/commit/?id=ed1e075104f3768375e1e8d3efcf87d9641029a7
> 
> (renaming all the BTREE_INSERT flags to BCH_TRANS_COMMIT, and with
> better anmes).
> 

Yeah, that is much better at a quick glance. Thanks.

> > But as a related example, do we care about how this flag modifies
> > invalid key checks (via __bch2_trans_commit()) for example?
> 
> There are key invalid checks that we want to run whenever doing new
> updates (because it's good to check inconsistencies as soon as
> possible), but we can't run on existing keys - possibly because the
> check didn't exist in the past, and so could cause us to throw away
> existing metadata.
> 

Ok, thanks for the context. My question as it relates to this patch is
consider the following hunk in the flag rename patch above:

@@ -1048,7 +1048,7 @@ int __bch2_trans_commit(struct btree_trans *trans, unsigned flags)
 		struct printbuf buf = PRINTBUF;
 		enum bkey_invalid_flags invalid_flags = 0;
 
-		if (!(flags & BTREE_INSERT_JOURNAL_REPLAY))
+		if (!(flags & BCH_TRANS_COMMIT_no_journal_res))
 			invalid_flags |= BKEY_INVALID_WRITE|BKEY_INVALID_COMMIT;
 
 		if (unlikely(bch2_bkey_invalid(c, bkey_i_to_s_c(i->k),

We've effectively removed these invalid_flags for the write buffer key
flush path by virtue of turning on no_journal_res there. Should those
flags be elided because there is no journal res, or are they more
associated with journal replay context (which happens to not use journal
res)?

Brian

> The snapshot skiplist checks are a good example. There were (multiple)
> bugs where we'd delete a snapshot tree node and end up with snapshot
> keys with bad skiplist pointers. Now, we can detect and repair this
> during fsck, but it's much better if we can catch this right away, when
> writing the new snapshot key; i.e. if a skiplist node points to an id
> smaller than the parent, then it's obviously bad.
> 
> But, since that check didn't exist when skiplist nodes were first
> introduced we can't run it when checking existing snapshot nodes,
> otherwise a filesystem would with corrupt skiplist nodes would have
> those snapshot keys deleted entirely instead of letting fsck repair just
> the skiplist entries.
> 
> .key_invalid()s cause those key to be deleted because those checks run
> e.g. every time we run a btree node in from disk - we're very limited in
> what kinds of updates we can do then.
>

next prev parent reply	other threads:[~2023-11-14 13:17 UTC|newest]

Thread overview: 29+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-11-10 16:31 [PATCH 00/17] btree write buffer & journal optimizations Kent Overstreet
2023-11-10 16:31 ` [PATCH 01/17] bcachefs: Kill journal pre-reservations Kent Overstreet
2023-11-10 16:31 ` [PATCH 02/17] bcachefs: track_event_change() Kent Overstreet
2023-11-10 16:31 ` [PATCH 03/17] bcachefs: Journal pins must always have a flush_fn Kent Overstreet
2023-11-13 15:22   ` Brian Foster
2023-11-13 16:36     ` Kent Overstreet
2023-11-13 17:08       ` Brian Foster
2023-11-10 16:31 ` [PATCH 04/17] bcachefs: BTREE_INSERT_JOURNAL_REPLAY now "don't init trans->journal_res" Kent Overstreet
2023-11-10 16:31 ` [PATCH 05/17] bcachefs: Kill BTREE_UPDATE_PREJOURNAL Kent Overstreet
2023-11-13 15:29   ` Brian Foster
2023-11-13 16:49     ` Kent Overstreet
2023-11-14 13:17       ` Brian Foster [this message]
2023-11-10 16:31 ` [PATCH 06/17] bcachefs: Go rw before journal replay Kent Overstreet
2023-11-10 16:31 ` [PATCH 07/17] bcachefs: Make journal replay more efficient Kent Overstreet
2023-11-14 13:19   ` Brian Foster
2023-11-15  1:50     ` Kent Overstreet
2023-11-10 16:31 ` [PATCH 08/17] bcachefs: Don't flush journal after replay Kent Overstreet
2023-11-10 16:31 ` [PATCH 09/17] bcachefs: Unwritten journal buffers are always dirty Kent Overstreet
2023-11-10 16:31 ` [PATCH 10/17] bcachefs: journal->buf_lock Kent Overstreet
2023-11-10 16:31 ` [PATCH 11/17] bcachefs: bch2_journal_block_reservations() Kent Overstreet
2023-11-10 16:31 ` [PATCH 12/17] bcachefs: Clean up btree write buffer write ref handling Kent Overstreet
2023-11-10 16:31 ` [PATCH 13/17] bcachefs: bch2_btree_write_buffer_flush_locked() Kent Overstreet
2023-11-10 16:31 ` [PATCH 14/17] bcachefs: bch2_btree_write_buffer_flush() -> bch2_btree_write_buffer_tryflush() Kent Overstreet
2023-11-10 16:31 ` [PATCH 15/17] bcachefs: Improve btree write buffer tracepoints Kent Overstreet
2023-11-10 16:31 ` [PATCH 16/17] bcachefs: btree write buffer now slurps keys from journal Kent Overstreet
2023-11-21 10:56   ` Geert Uytterhoeven
2023-11-21 16:52     ` Kent Overstreet
2023-11-10 16:31 ` [PATCH 17/17] bcachefs: Inline btree write buffer sort Kent Overstreet
2023-11-10 16:42 ` [PATCH 00/17] btree write buffer & journal optimizations Kent Overstreet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=ZVNzcE9EopgWAepU@bfoster \
    --to=bfoster@redhat.com \
    --cc=kent.overstreet@linux.dev \
    --cc=linux-bcachefs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox