From: Brian Foster <bfoster@redhat.com>
To: Kent Overstreet <kent.overstreet@linux.dev>
Cc: linux-bcachefs@vger.kernel.org, linux-kernel@vger.kernel.org,
djwong@kernel.org
Subject: Re: [PATCH 03/21] bcachefs: btree write buffer knows how to accumulate bch_accounting keys
Date: Thu, 29 Feb 2024 13:44:07 -0500 [thread overview]
Message-ID: <ZeDQd7l6PQF5IpBa@bfoster> (raw)
In-Reply-To: <ius5flpyc3nelyhflvf3cjok7fh6a6qceq43ruweqyu3giqg2k@n2eun6iwtsyj>
On Wed, Feb 28, 2024 at 05:42:39PM -0500, Kent Overstreet wrote:
> On Tue, Feb 27, 2024 at 10:50:23AM -0500, Brian Foster wrote:
> > On Sat, Feb 24, 2024 at 09:38:05PM -0500, Kent Overstreet wrote:
> > > + if (!*accounting_accumulated && wb->k.k.type == KEY_TYPE_accounting) {
> > > + struct bkey u;
> > > + struct bkey_s_c k = bch2_btree_path_peek_slot_exact(btree_iter_path(trans, iter), &u);
> > > +
> > > + if (k.k->type == KEY_TYPE_accounting)
> > > + bch2_accounting_accumulate(bkey_i_to_accounting(&wb->k),
> > > + bkey_s_c_to_accounting(k));
> >
> > So it looks like we're accumulating from the btree key into the write
> > buffer key. Is this so the following code will basically insert a new
> > btree key based on the value of the write buffer key?
>
> Correct, this is where we go from "accounting keys is a delta" to
> "accounting key is new version of the key".
>
> > > darray_for_each(wb->sorted, i) {
> > > struct btree_write_buffered_key *k = &wb->flushing.keys.data[i->idx];
> > > + bool accounting_accumulated = false;
> >
> > Should this live within the interior flush loop?
>
> We can't define it within the loop because then we'd be setting it to
> false on every loop iteration... but it does belong _with_ the loop, so
> I'll move it to right before.
>
Ah, right.
> > > - bch2_journal_pin_update(j, i->journal_seq, &wb->flushing.pin,
> > > - bch2_btree_write_buffer_journal_flush);
> > > + if (!accounting_replay_done &&
> > > + i->k.k.type == KEY_TYPE_accounting) {
> > > + could_not_insert++;
> > > + continue;
> > > + }
> > > +
> > > + if (!could_not_insert)
> > > + bch2_journal_pin_update(j, i->journal_seq, &wb->flushing.pin,
> > > + bch2_btree_write_buffer_journal_flush);
> >
> > Hmm.. so this is sane because the slowpath runs in journal sorted order,
> > right?
>
> yup, which means as soon as we hit a key we can't insert we can't
> release any more journal pins
>
> >
> > >
> > > bch2_trans_begin(trans);
> > >
> > > @@ -375,13 +409,27 @@ static int bch2_btree_write_buffer_flush_locked(struct btree_trans *trans)
> > > btree_write_buffered_insert(trans, i));
> > > if (ret)
> > > goto err;
> > > +
> > > + i->journal_seq = 0;
> > > + }
> > > +
> >
> > /*
> > * Condense the remaining keys <reasons reasons>...??
> > */
>
> yup, that's a good comment
>
> > > + if (could_not_insert) {
> > > + struct btree_write_buffered_key *dst = wb->flushing.keys.data;
> > > +
> > > + darray_for_each(wb->flushing.keys, i)
> > > + if (i->journal_seq)
> > > + *dst++ = *i;
> > > + wb->flushing.keys.nr = dst - wb->flushing.keys.data;
> > > }
> > > }
> > > err:
> > > + if (ret || !could_not_insert) {
> > > + bch2_journal_pin_drop(j, &wb->flushing.pin);
> > > + wb->flushing.keys.nr = 0;
> > > + }
> > > +
> > > bch2_fs_fatal_err_on(ret, c, "%s: insert error %s", __func__, bch2_err_str(ret));
> > > - trace_write_buffer_flush(trans, wb->flushing.keys.nr, skipped, fast, 0);
> > > - bch2_journal_pin_drop(j, &wb->flushing.pin);
> > > - wb->flushing.keys.nr = 0;
> > > + trace_write_buffer_flush(trans, wb->flushing.keys.nr, overwritten, fast, 0);
> >
> > I feel like the last time I looked at the write buffer stuff the flush
> > wasn't reentrant in this way. I.e., the flush switched out the active
> > buffer and so had to process all entries in the current buffer (or
> > something like that). Has something changed or do I misunderstand?
>
> Yeah, originally we were adding keys to the write buffer directly from
> the transaction commit path, so that necessitated the super fast
> lockless stuff where we'd toggle between buffers so one was always
> available.
>
> Now keys are pulled from the journal, so we can use (somewhat) simpler
> locking and buffering; now the complication is that we can't predict in
> advance how many keys are going to come out of the journal for the write
> buffer.
>
Ok.
> >
> > > return ret;
> > > }
> > >
> > > diff --git a/fs/bcachefs/recovery.c b/fs/bcachefs/recovery.c
> > > index 6829d80bd181..b8289af66c8e 100644
> > > --- a/fs/bcachefs/recovery.c
> > > +++ b/fs/bcachefs/recovery.c
> > > @@ -228,6 +228,8 @@ static int bch2_journal_replay(struct bch_fs *c)
> > > goto err;
> > > }
> > >
> > > + set_bit(BCH_FS_accounting_replay_done, &c->flags);
> > > +
> >
> > I assume this ties into the question on the previous patch..
> >
> > Related question.. if the write buffer can't flush during journal
> > replay, is there concern/risk of overflowing it?
>
> Shouldn't be any actual risk. It's just new accounting updates that the
> write buffer can't flush, and those are only going to be generated by
> interior btree node updates as journal replay has to split/rewrite nodes
> to make room for its updates.
>
> And for those new acounting updates, updates to the same counters get
> accumulated as they're flushed from the journal to the write buffer -
> see the patch for eytzingcer tree accumulated. So we could only overflow
> if the number of distinct counters touched somehow was very large.
>
> And the number of distinct counters will be growing significantly, but
> the new counters will all be for user data, not metadata.
>
> (Except: that reminds me, we do want to add per-btree counters, so users
> can see "I have x amount of extents, x amount of dirents, etc.).
>
Heh, Ok. This all does sound a little open ended to me. Maybe the better
question is: suppose this hypothetically does happen after adding a
bunch of new counters, what would the expected side effect be in the
recovery scenario where the write buffer can't be flushed?
If write buffer updates now basically just journal a special entry,
would that basically mean we'd deadlock during recovery due to no longer
being able to insert journal entries due to a pinned write buffer? If
so, that actually seems reasonable to me in the sense that in theory it
at least doesn't break the filesystem on-disk, but obviously it would
require some kind of enhancement in order to complete the recovery (even
if what that is is currently unknown). Hm?
Brian
next prev parent reply other threads:[~2024-02-29 18:42 UTC|newest]
Thread overview: 39+ messages / expand[flat|nested] mbox.gz Atom feed top
2024-02-25 2:38 [PATCH 00/21] bcachefs disk accounting rewrite Kent Overstreet
2024-02-25 2:38 ` [PATCH 01/21] bcachefs: KEY_TYPE_accounting Kent Overstreet
2024-02-27 15:49 ` Brian Foster
2024-02-28 19:39 ` Kent Overstreet
2024-02-29 18:43 ` Brian Foster
2024-02-29 21:24 ` Kent Overstreet
2024-03-01 15:03 ` Brian Foster
2024-03-01 19:30 ` Kent Overstreet
2024-02-25 2:38 ` [PATCH 02/21] bcachefs: Accumulate accounting keys in journal replay Kent Overstreet
2024-02-27 15:49 ` Brian Foster
2024-02-28 20:06 ` Kent Overstreet
2024-02-25 2:38 ` [PATCH 03/21] bcachefs: btree write buffer knows how to accumulate bch_accounting keys Kent Overstreet
2024-02-27 15:50 ` Brian Foster
2024-02-28 22:42 ` Kent Overstreet
2024-02-29 18:44 ` Brian Foster [this message]
2024-02-29 20:25 ` Kent Overstreet
2024-02-25 2:38 ` [PATCH 04/21] bcachefs: Disk space accounting rewrite Kent Overstreet
2024-02-27 15:55 ` Brian Foster
2024-02-29 4:10 ` Kent Overstreet
2024-02-29 18:44 ` Brian Foster
2024-02-29 21:16 ` Kent Overstreet
2024-03-01 15:03 ` Brian Foster
2024-02-25 2:38 ` [PATCH 05/21] bcachefs: dev_usage updated by new accounting Kent Overstreet
2024-02-25 2:38 ` [PATCH 06/21] bcachefs: Kill bch2_fs_usage_initialize() Kent Overstreet
2024-02-25 2:38 ` [PATCH 07/21] bcachefs: Convert bch2_ioctl_fs_usage() to new accounting Kent Overstreet
2024-02-25 2:38 ` [PATCH 08/21] bcachefs: kill bch2_fs_usage_read() Kent Overstreet
2024-02-25 2:38 ` [PATCH 09/21] bcachefs: Kill writing old accounting to journal Kent Overstreet
2024-02-25 2:38 ` [PATCH 10/21] bcachefs: Delete journal-buf-sharded old style accounting Kent Overstreet
2024-02-25 2:38 ` [PATCH 11/21] bcachefs: Kill bch2_fs_usage_to_text() Kent Overstreet
2024-02-25 2:38 ` [PATCH 12/21] bcachefs: Kill fs_usage_online Kent Overstreet
2024-02-25 2:38 ` [PATCH 13/21] bcachefs: Kill replicas_journal_res Kent Overstreet
2024-02-25 2:38 ` [PATCH 14/21] bcachefs: Convert gc to new accounting Kent Overstreet
2024-02-25 2:38 ` [PATCH 15/21] bcachefs: Convert bch2_replicas_gc2() " Kent Overstreet
2024-02-25 2:38 ` [PATCH 16/21] bcachefs: bch2_verify_accounting_clean() Kent Overstreet
2024-02-25 2:38 ` [PATCH 17/21] bcachefs: Eytzinger accumulation for accounting keys Kent Overstreet
2024-02-25 2:38 ` [PATCH 18/21] bcachefs: bch_acct_compression Kent Overstreet
2024-02-25 2:38 ` [PATCH 19/21] bcachefs: Convert bch2_compression_stats_to_text() to new accounting Kent Overstreet
2024-02-25 2:38 ` [PATCH 20/21] bcachefs: bch2_fs_accounting_to_text() Kent Overstreet
2024-02-25 2:38 ` [PATCH 21/21] bcachefs: bch2_fs_usage_base_to_text() Kent Overstreet
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=ZeDQd7l6PQF5IpBa@bfoster \
--to=bfoster@redhat.com \
--cc=djwong@kernel.org \
--cc=kent.overstreet@linux.dev \
--cc=linux-bcachefs@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox