public inbox for linux-bcachefs@vger.kernel.org
 help / color / mirror / Atom feed
From: Brian Foster <bfoster@redhat.com>
To: Kent Overstreet <kent.overstreet@linux.dev>
Cc: linux-bcachefs@vger.kernel.org, linux-kernel@vger.kernel.org,
	djwong@kernel.org
Subject: Re: [PATCH 02/21] bcachefs: Accumulate accounting keys in journal replay
Date: Tue, 27 Feb 2024 10:49:46 -0500	[thread overview]
Message-ID: <Zd4EmpK0zJp0mbml@bfoster> (raw)
In-Reply-To: <20240225023826.2413565-3-kent.overstreet@linux.dev>

On Sat, Feb 24, 2024 at 09:38:04PM -0500, Kent Overstreet wrote:
> Until accounting keys hit the btree, they are deltas, not new versions
> of the existing key; this means we have to teach journal replay to
> accumulate them.
> 
> Additionally, the journal doesn't track precisely which entries have
> been flushed to the btree; it only tracks a range of entries that may
> possibly still need to be flushed.
> 
> That means we need to compare accounting keys against the version in the
> btree and only flush updates that are newer.
> 
> There's another wrinkle with the write buffer: if the write buffer
> starts flushing accounting keys before journal replay has finished
> flushing accounting keys, journal replay will see the version number
> from the new updates and updates from the journal will be lost.
> 
> To avoid this, journal replay has to flush accounting keys first, and
> we'll be adding a flag so that write buffer flush knows to hold
> accounting keys until then.
> 
> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> ---
>  fs/bcachefs/btree_journal_iter.c | 23 +++-------
>  fs/bcachefs/btree_journal_iter.h | 15 +++++++
>  fs/bcachefs/btree_trans_commit.c |  9 +++-
>  fs/bcachefs/btree_update.h       | 14 +++++-
>  fs/bcachefs/recovery.c           | 76 +++++++++++++++++++++++++++++++-
>  5 files changed, 117 insertions(+), 20 deletions(-)
> 
...
> diff --git a/fs/bcachefs/recovery.c b/fs/bcachefs/recovery.c
> index 96e7a1ec7091..6829d80bd181 100644
> --- a/fs/bcachefs/recovery.c
> +++ b/fs/bcachefs/recovery.c
> @@ -11,6 +11,7 @@
>  #include "btree_io.h"
>  #include "buckets.h"
>  #include "dirent.h"
> +#include "disk_accounting.h"
>  #include "ec.h"
>  #include "errcode.h"
>  #include "error.h"
> @@ -87,6 +88,56 @@ static void replay_now_at(struct journal *j, u64 seq)
>  		bch2_journal_pin_put(j, j->replay_journal_seq++);
>  }
>  
> +static int bch2_journal_replay_accounting_key(struct btree_trans *trans,
> +					      struct journal_key *k)
> +{
> +	struct journal_keys *keys = &trans->c->journal_keys;
> +
> +	struct btree_iter iter;
> +	bch2_trans_node_iter_init(trans, &iter, k->btree_id, k->k->k.p,
> +				  BTREE_MAX_DEPTH, k->level,
> +				  BTREE_ITER_INTENT);
> +	int ret = bch2_btree_iter_traverse(&iter);
> +	if (ret)
> +		goto out;
> +
> +	struct bkey u;
> +	struct bkey_s_c old = bch2_btree_path_peek_slot(btree_iter_path(trans, &iter), &u);
> +
> +	if (bversion_cmp(old.k->version, k->k->k.version) >= 0) {
> +		ret = 0;
> +		goto out;
> +	}

So I assume this is what correlates back to the need to not flush the
write buffer until replay completes, otherwise we could unintentionally
skip subsequent key updates. Is that the case?

If so, it would be nice to have some comments here that explain this.
I.e., I don't quite have a big enough picture to know where or how this
is prevented to ensure that the version updates down the key
accumulation helpers don't conflict with this particular check, so
something that helps connect the dots enough to somebody who doesn't
already know how this is all supposed to work would be useful.

Brian

> +
> +	if (k + 1 < &darray_top(*keys) &&
> +	    !journal_key_cmp(k, k + 1)) {
> +		BUG_ON(bversion_cmp(k[0].k->k.version, k[1].k->k.version) > 0);
> +
> +		bch2_accounting_accumulate(bkey_i_to_accounting(k[1].k),
> +					   bkey_i_to_s_c_accounting(k[0].k));
> +		ret = 0;
> +		goto out;
> +	}
> +
> +	struct bkey_i *new = k->k;
> +	if (old.k->type == KEY_TYPE_accounting) {
> +		new = bch2_bkey_make_mut_noupdate(trans, bkey_i_to_s_c(k->k));
> +		ret = PTR_ERR_OR_ZERO(new);
> +		if (ret)
> +			goto out;
> +
> +		bch2_accounting_accumulate(bkey_i_to_accounting(new),
> +					   bkey_s_c_to_accounting(old));
> +	}
> +
> +	trans->journal_res.seq = k->journal_seq;
> +
> +	ret = bch2_trans_update(trans, &iter, new, BTREE_TRIGGER_NORUN);
> +out:
> +	bch2_trans_iter_exit(trans, &iter);
> +	return ret;
> +}
> +
>  static int bch2_journal_replay_key(struct btree_trans *trans,
>  				   struct journal_key *k)
>  {
> @@ -159,12 +210,33 @@ static int bch2_journal_replay(struct bch_fs *c)
>  
>  	BUG_ON(!atomic_read(&keys->ref));
>  
> +	/*
> +	 * Replay accounting keys first: we can't allow the write buffer to
> +	 * flush accounting keys until we're done
> +	 */
> +	darray_for_each(*keys, k) {
> +		if (!(k->k->k.type == KEY_TYPE_accounting && !k->allocated))
> +			continue;
> +
> +		cond_resched();
> +
> +		ret = commit_do(trans, NULL, NULL,
> +				BCH_TRANS_COMMIT_no_enospc|
> +				BCH_TRANS_COMMIT_no_journal_res,
> +			     bch2_journal_replay_accounting_key(trans, k));
> +		if (bch2_fs_fatal_err_on(ret, c, "error replaying accounting; %s", bch2_err_str(ret)))
> +			goto err;
> +	}
> +
>  	/*
>  	 * First, attempt to replay keys in sorted order. This is more
>  	 * efficient - better locality of btree access -  but some might fail if
>  	 * that would cause a journal deadlock.
>  	 */
>  	darray_for_each(*keys, k) {
> +		if (k->k->k.type == KEY_TYPE_accounting && !k->allocated)
> +			continue;
> +
>  		cond_resched();
>  
>  		/* Skip fastpath if we're low on space in the journal */
> @@ -174,7 +246,7 @@ static int bch2_journal_replay(struct bch_fs *c)
>  				  BCH_TRANS_COMMIT_journal_reclaim|
>  				  (!k->allocated ? BCH_TRANS_COMMIT_no_journal_res : 0),
>  			     bch2_journal_replay_key(trans, k));
> -		BUG_ON(!ret && !k->overwritten);
> +		BUG_ON(!ret && !k->overwritten && k->k->k.type != KEY_TYPE_accounting);
>  		if (ret) {
>  			ret = darray_push(&keys_sorted, k);
>  			if (ret)
> @@ -208,7 +280,7 @@ static int bch2_journal_replay(struct bch_fs *c)
>  		if (ret)
>  			goto err;
>  
> -		BUG_ON(!k->overwritten);
> +		BUG_ON(k->btree_id != BTREE_ID_accounting && !k->overwritten);
>  	}
>  
>  	/*
> -- 
> 2.43.0
> 


  reply	other threads:[~2024-02-27 15:48 UTC|newest]

Thread overview: 39+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2024-02-25  2:38 [PATCH 00/21] bcachefs disk accounting rewrite Kent Overstreet
2024-02-25  2:38 ` [PATCH 01/21] bcachefs: KEY_TYPE_accounting Kent Overstreet
2024-02-27 15:49   ` Brian Foster
2024-02-28 19:39     ` Kent Overstreet
2024-02-29 18:43       ` Brian Foster
2024-02-29 21:24         ` Kent Overstreet
2024-03-01 15:03           ` Brian Foster
2024-03-01 19:30             ` Kent Overstreet
2024-02-25  2:38 ` [PATCH 02/21] bcachefs: Accumulate accounting keys in journal replay Kent Overstreet
2024-02-27 15:49   ` Brian Foster [this message]
2024-02-28 20:06     ` Kent Overstreet
2024-02-25  2:38 ` [PATCH 03/21] bcachefs: btree write buffer knows how to accumulate bch_accounting keys Kent Overstreet
2024-02-27 15:50   ` Brian Foster
2024-02-28 22:42     ` Kent Overstreet
2024-02-29 18:44       ` Brian Foster
2024-02-29 20:25         ` Kent Overstreet
2024-02-25  2:38 ` [PATCH 04/21] bcachefs: Disk space accounting rewrite Kent Overstreet
2024-02-27 15:55   ` Brian Foster
2024-02-29  4:10     ` Kent Overstreet
2024-02-29 18:44       ` Brian Foster
2024-02-29 21:16         ` Kent Overstreet
2024-03-01 15:03           ` Brian Foster
2024-02-25  2:38 ` [PATCH 05/21] bcachefs: dev_usage updated by new accounting Kent Overstreet
2024-02-25  2:38 ` [PATCH 06/21] bcachefs: Kill bch2_fs_usage_initialize() Kent Overstreet
2024-02-25  2:38 ` [PATCH 07/21] bcachefs: Convert bch2_ioctl_fs_usage() to new accounting Kent Overstreet
2024-02-25  2:38 ` [PATCH 08/21] bcachefs: kill bch2_fs_usage_read() Kent Overstreet
2024-02-25  2:38 ` [PATCH 09/21] bcachefs: Kill writing old accounting to journal Kent Overstreet
2024-02-25  2:38 ` [PATCH 10/21] bcachefs: Delete journal-buf-sharded old style accounting Kent Overstreet
2024-02-25  2:38 ` [PATCH 11/21] bcachefs: Kill bch2_fs_usage_to_text() Kent Overstreet
2024-02-25  2:38 ` [PATCH 12/21] bcachefs: Kill fs_usage_online Kent Overstreet
2024-02-25  2:38 ` [PATCH 13/21] bcachefs: Kill replicas_journal_res Kent Overstreet
2024-02-25  2:38 ` [PATCH 14/21] bcachefs: Convert gc to new accounting Kent Overstreet
2024-02-25  2:38 ` [PATCH 15/21] bcachefs: Convert bch2_replicas_gc2() " Kent Overstreet
2024-02-25  2:38 ` [PATCH 16/21] bcachefs: bch2_verify_accounting_clean() Kent Overstreet
2024-02-25  2:38 ` [PATCH 17/21] bcachefs: Eytzinger accumulation for accounting keys Kent Overstreet
2024-02-25  2:38 ` [PATCH 18/21] bcachefs: bch_acct_compression Kent Overstreet
2024-02-25  2:38 ` [PATCH 19/21] bcachefs: Convert bch2_compression_stats_to_text() to new accounting Kent Overstreet
2024-02-25  2:38 ` [PATCH 20/21] bcachefs: bch2_fs_accounting_to_text() Kent Overstreet
2024-02-25  2:38 ` [PATCH 21/21] bcachefs: bch2_fs_usage_base_to_text() Kent Overstreet

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=Zd4EmpK0zJp0mbml@bfoster \
    --to=bfoster@redhat.com \
    --cc=djwong@kernel.org \
    --cc=kent.overstreet@linux.dev \
    --cc=linux-bcachefs@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox