[PATCH 00/17] btree write buffer & journal optimizations

public inbox for linux-bcachefs@vger.kernel.org
 help / color / mirror / Atom feed

* [PATCH 00/17] btree write buffer & journal optimizations
@ 2023-11-10 16:31 Kent Overstreet
  2023-11-10 16:31 ` [PATCH 01/17] bcachefs: Kill journal pre-reservations Kent Overstreet
                   ` (17 more replies)
  0 siblings, 18 replies; 29+ messages in thread
From: Kent Overstreet @ 2023-11-10 16:31 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Kent Overstreet

Large stack of changes - performance optimizations and improvements to
related code.

The first big one, killing journal pre-reservations, is already merged:
besides being a performance improvement it turned out to fix a few tests
that were sporadically failing with "journal stuck" - nice. The idea
with that patch is that instead of reserving space in the journal ahead
of time (for journal reservations that must succeed or risk deadlock),
we instead rely on watermarks and tracking which operations are
necessary for reclaim - when we're low on space in the journal and
flushing things for reclaim, we're able to flush things in order, or
track when we're not flushing things in order and allow it to fail

Most of the other patches are prep work leading up to the big btree
write buffer rewrite, which changes the btree write buffer to pull keys
from the journal, instead of having the tranaction commit path directly
add keys to the btree write buffer.

btree write buffer performance is pretty important for multithreaded
update workloads - it's an amdahl's law situation, flushing it is single
threaded. More performance optimizations there would still be
worthwhile, and I've been looking at it because I'm sketching out a
rewrite of disk space accounting (which will allow for per snapshot Id
accounting) that will lean heavily on the btree write buffer.

Kent Overstreet (17):
  bcachefs: Kill journal pre-reservations
  bcachefs: track_event_change()
  bcachefs: Journal pins must always have a flush_fn
  bcachefs: BTREE_INSERT_JOURNAL_REPLAY now "don't init
    trans->journal_res"
  bcachefs: Kill BTREE_UPDATE_PREJOURNAL
  bcachefs: Go rw before journal replay
  bcachefs: Make journal replay more efficient
  bcachefs: Don't flush journal after replay
  bcachefs: Unwritten journal buffers are always dirty
  bcachefs: journal->buf_lock
  bcachefs: bch2_journal_block_reservations()
  bcachefs: Clean up btree write buffer write ref handling
  bcachefs: bch2_btree_write_buffer_flush_locked()
  bcachefs: bch2_btree_write_buffer_flush() ->
    bch2_btree_write_buffer_tryflush()
  bcachefs: Improve btree write buffer tracepoints
  bcachefs: btree write buffer now slurps keys from journal
  bcachefs: Inline btree write buffer sort

 fs/bcachefs/alloc_background.c         |   2 +-
 fs/bcachefs/bcachefs.h                 |  18 +-
 fs/bcachefs/bcachefs_format.h          |   3 +-
 fs/bcachefs/bkey_methods.h             |   2 -
 fs/bcachefs/btree_iter.c               |   2 -
 fs/bcachefs/btree_key_cache.c          |  14 -
 fs/bcachefs/btree_trans_commit.c       | 123 ++----
 fs/bcachefs/btree_types.h              |   4 -
 fs/bcachefs/btree_update.c             |  23 -
 fs/bcachefs/btree_update_interior.c    |  46 +-
 fs/bcachefs/btree_update_interior.h    |   1 -
 fs/bcachefs/btree_write_buffer.c       | 584 +++++++++++++++++--------
 fs/bcachefs/btree_write_buffer.h       |  48 +-
 fs/bcachefs/btree_write_buffer_types.h |  37 +-
 fs/bcachefs/ec.c                       |   2 +-
 fs/bcachefs/errcode.h                  |   2 -
 fs/bcachefs/inode.c                    |  10 +-
 fs/bcachefs/journal.c                  | 120 +++--
 fs/bcachefs/journal.h                  | 102 +----
 fs/bcachefs/journal_io.c               |  57 ++-
 fs/bcachefs/journal_reclaim.c          |  66 +--
 fs/bcachefs/journal_reclaim.h          |   1 +
 fs/bcachefs/journal_types.h            |  40 +-
 fs/bcachefs/move.c                     |   7 +-
 fs/bcachefs/movinggc.c                 |   4 +-
 fs/bcachefs/recovery.c                 | 106 +++--
 fs/bcachefs/super.c                    |   6 +-
 fs/bcachefs/trace.h                    |  42 +-
 fs/bcachefs/util.c                     | 140 +++---
 fs/bcachefs/util.h                     |  33 +-
 30 files changed, 925 insertions(+), 720 deletions(-)

-- 
2.42.0

^ permalink raw reply	[flat|nested] 29+ messages in thread

* [PATCH 01/17] bcachefs: Kill journal pre-reservations
  2023-11-10 16:31 [PATCH 00/17] btree write buffer & journal optimizations Kent Overstreet
@ 2023-11-10 16:31 ` Kent Overstreet
  2023-11-10 16:31 ` [PATCH 02/17] bcachefs: track_event_change() Kent Overstreet
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 29+ messages in thread
From: Kent Overstreet @ 2023-11-10 16:31 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Kent Overstreet

This deletes the complicated and somewhat expensive journal
pre-reservation machinery in favor of just using journal watermarks:
when the journal is more than half full, we run journal reclaim more
aggressively, and when the journal is more than 3/4s full we only allow
journal reclaim to get new journal reservations.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/bcachefs/btree_iter.c            |  2 -
 fs/bcachefs/btree_key_cache.c       | 14 -----
 fs/bcachefs/btree_trans_commit.c    | 36 +----------
 fs/bcachefs/btree_types.h           |  3 -
 fs/bcachefs/btree_update_interior.c | 25 --------
 fs/bcachefs/btree_update_interior.h |  1 -
 fs/bcachefs/journal.c               | 31 ---------
 fs/bcachefs/journal.h               | 98 -----------------------------
 fs/bcachefs/journal_reclaim.c       | 42 +++++--------
 fs/bcachefs/journal_types.h         | 26 --------
 fs/bcachefs/trace.h                 | 11 +---
 11 files changed, 19 insertions(+), 270 deletions(-)

diff --git a/fs/bcachefs/btree_iter.c b/fs/bcachefs/btree_iter.c
index 50493b784316..96bdf0c6051c 100644
--- a/fs/bcachefs/btree_iter.c
+++ b/fs/bcachefs/btree_iter.c
@@ -3089,8 +3089,6 @@ void bch2_trans_put(struct btree_trans *trans)
 		srcu_read_unlock(&c->btree_trans_barrier, trans->srcu_idx);
 	}
 
-	bch2_journal_preres_put(&c->journal, &trans->journal_preres);
-
 	kfree(trans->extra_journal_entries.data);
 
 	if (trans->fs_usage_deltas) {
diff --git a/fs/bcachefs/btree_key_cache.c b/fs/bcachefs/btree_key_cache.c
index 2239c03b242c..375a997a0bdd 100644
--- a/fs/bcachefs/btree_key_cache.c
+++ b/fs/bcachefs/btree_key_cache.c
@@ -673,7 +673,6 @@ static int btree_key_cache_flush_pos(struct btree_trans *trans,
 		goto out;
 
 	bch2_journal_pin_drop(j, &ck->journal);
-	bch2_journal_preres_put(j, &ck->res);
 
 	BUG_ON(!btree_node_locked(c_iter.path, 0));
 
@@ -771,18 +770,6 @@ bool bch2_btree_insert_key_cached(struct btree_trans *trans,
 
 	BUG_ON(insert->k.u64s > ck->u64s);
 
-	if (likely(!(flags & BTREE_INSERT_JOURNAL_REPLAY))) {
-		int difference;
-
-		BUG_ON(jset_u64s(insert->k.u64s) > trans->journal_preres.u64s);
-
-		difference = jset_u64s(insert->k.u64s) - ck->res.u64s;
-		if (difference > 0) {
-			trans->journal_preres.u64s	-= difference;
-			ck->res.u64s			+= difference;
-		}
-	}
-
 	bkey_copy(ck->k, insert);
 	ck->valid = true;
 
@@ -1009,7 +996,6 @@ void bch2_fs_btree_key_cache_exit(struct btree_key_cache *bc)
 		cond_resched();
 
 		bch2_journal_pin_drop(&c->journal, &ck->journal);
-		bch2_journal_preres_put(&c->journal, &ck->res);
 
 		list_del(&ck->list);
 		kfree(ck->k);
diff --git a/fs/bcachefs/btree_trans_commit.c b/fs/bcachefs/btree_trans_commit.c
index 02d806ff07bf..d0ead308deef 100644
--- a/fs/bcachefs/btree_trans_commit.c
+++ b/fs/bcachefs/btree_trans_commit.c
@@ -323,17 +323,6 @@ static inline void btree_insert_entry_checks(struct btree_trans *trans,
 		bch2_snapshot_is_internal_node(trans->c, i->k->k.p.snapshot));
 }
 
-static noinline int
-bch2_trans_journal_preres_get_cold(struct btree_trans *trans, unsigned flags,
-				   unsigned long trace_ip)
-{
-	return drop_locks_do(trans,
-		bch2_journal_preres_get(&trans->c->journal,
-			&trans->journal_preres,
-			trans->journal_preres_u64s,
-			(flags & BCH_WATERMARK_MASK)));
-}
-
 static __always_inline int bch2_trans_journal_res_get(struct btree_trans *trans,
 						      unsigned flags)
 {
@@ -882,14 +871,6 @@ static inline int do_bch2_trans_commit(struct btree_trans *trans, unsigned flags
 		}
 	}
 
-	ret = bch2_journal_preres_get(&c->journal,
-			&trans->journal_preres, trans->journal_preres_u64s,
-			(flags & BCH_WATERMARK_MASK)|JOURNAL_RES_GET_NONBLOCK);
-	if (unlikely(ret == -BCH_ERR_journal_preres_get_blocked))
-		ret = bch2_trans_journal_preres_get_cold(trans, flags, trace_ip);
-	if (unlikely(ret))
-		return ret;
-
 	ret = bch2_trans_lock_write(trans);
 	if (unlikely(ret))
 		return ret;
@@ -1052,7 +1033,6 @@ int __bch2_trans_commit(struct btree_trans *trans, unsigned flags)
 	struct bch_fs *c = trans->c;
 	struct btree_insert_entry *i = NULL;
 	struct btree_write_buffered_key *wb;
-	unsigned u64s;
 	int ret = 0;
 
 	if (!trans->nr_updates &&
@@ -1112,13 +1092,8 @@ int __bch2_trans_commit(struct btree_trans *trans, unsigned flags)
 
 	EBUG_ON(test_bit(BCH_FS_CLEAN_SHUTDOWN, &c->flags));
 
-	memset(&trans->journal_preres, 0, sizeof(trans->journal_preres));
-
 	trans->journal_u64s		= trans->extra_journal_entries.nr;
-	trans->journal_preres_u64s	= 0;
-
 	trans->journal_transaction_names = READ_ONCE(c->opts.journal_transaction_names);
-
 	if (trans->journal_transaction_names)
 		trans->journal_u64s += jset_u64s(JSET_ENTRY_LOG_U64s);
 
@@ -1134,16 +1109,11 @@ int __bch2_trans_commit(struct btree_trans *trans, unsigned flags)
 		if (i->key_cache_already_flushed)
 			continue;
 
-		/* we're going to journal the key being updated: */
-		u64s = jset_u64s(i->k->k.u64s);
-		if (i->cached &&
-		    likely(!(flags & BTREE_INSERT_JOURNAL_REPLAY)))
-			trans->journal_preres_u64s += u64s;
-
 		if (i->flags & BTREE_UPDATE_NOJOURNAL)
 			continue;
 
-		trans->journal_u64s += u64s;
+		/* we're going to journal the key being updated: */
+		trans->journal_u64s += jset_u64s(i->k->k.u64s);
 
 		/* and we're also going to log the overwrite: */
 		if (trans->journal_transaction_names)
@@ -1175,8 +1145,6 @@ int __bch2_trans_commit(struct btree_trans *trans, unsigned flags)
 
 	trace_and_count(c, transaction_commit, trans, _RET_IP_);
 out:
-	bch2_journal_preres_put(&c->journal, &trans->journal_preres);
-
 	if (likely(!(flags & BTREE_INSERT_NOCHECK_RW)))
 		bch2_write_ref_put(c, BCH_WRITE_REF_trans);
 out_reset:
diff --git a/fs/bcachefs/btree_types.h b/fs/bcachefs/btree_types.h
index 35a1623a3c55..0c97b4ef78fb 100644
--- a/fs/bcachefs/btree_types.h
+++ b/fs/bcachefs/btree_types.h
@@ -337,7 +337,6 @@ struct bkey_cached {
 	struct rhash_head	hash;
 	struct list_head	list;
 
-	struct journal_preres	res;
 	struct journal_entry_pin journal;
 	u64			seq;
 
@@ -448,11 +447,9 @@ struct btree_trans {
 	struct journal_entry_pin *journal_pin;
 
 	struct journal_res	journal_res;
-	struct journal_preres	journal_preres;
 	u64			*journal_seq;
 	struct disk_reservation *disk_res;
 	unsigned		journal_u64s;
-	unsigned		journal_preres_u64s;
 	struct replicas_delta_list *fs_usage_deltas;
 };
 
diff --git a/fs/bcachefs/btree_update_interior.c b/fs/bcachefs/btree_update_interior.c
index 39c2db68123b..925837eca369 100644
--- a/fs/bcachefs/btree_update_interior.c
+++ b/fs/bcachefs/btree_update_interior.c
@@ -513,8 +513,6 @@ static void bch2_btree_update_free(struct btree_update *as, struct btree_trans *
 		up_read(&c->gc_lock);
 	as->took_gc_lock = false;
 
-	bch2_journal_preres_put(&c->journal, &as->journal_preres);
-
 	bch2_journal_pin_drop(&c->journal, &as->journal);
 	bch2_journal_pin_flush(&c->journal, &as->journal);
 	bch2_disk_reservation_put(c, &as->disk_res);
@@ -734,8 +732,6 @@ static void btree_update_nodes_written(struct btree_update *as)
 
 	bch2_journal_pin_drop(&c->journal, &as->journal);
 
-	bch2_journal_preres_put(&c->journal, &as->journal_preres);
-
 	mutex_lock(&c->btree_interior_update_lock);
 	for (i = 0; i < as->nr_new_nodes; i++) {
 		b = as->new_nodes[i];
@@ -1129,27 +1125,6 @@ bch2_btree_update_start(struct btree_trans *trans, struct btree_path *path,
 	if (ret)
 		goto err;
 
-	ret = bch2_journal_preres_get(&c->journal, &as->journal_preres,
-				      BTREE_UPDATE_JOURNAL_RES,
-				      journal_flags|JOURNAL_RES_GET_NONBLOCK);
-	if (ret) {
-		if (flags & BTREE_INSERT_JOURNAL_RECLAIM) {
-			ret = -BCH_ERR_journal_reclaim_would_deadlock;
-			goto err;
-		}
-
-		ret = drop_locks_do(trans,
-			bch2_journal_preres_get(&c->journal, &as->journal_preres,
-					      BTREE_UPDATE_JOURNAL_RES,
-					      journal_flags));
-		if (ret == -BCH_ERR_journal_preres_get_blocked) {
-			trace_and_count(c, trans_restart_journal_preres_get, trans, _RET_IP_, journal_flags);
-			ret = btree_trans_restart(trans, BCH_ERR_transaction_restart_journal_preres_get);
-		}
-		if (ret)
-			goto err;
-	}
-
 	ret = bch2_disk_reservation_get(c, &as->disk_res,
 			(nr_nodes[0] + nr_nodes[1]) * btree_sectors(c),
 			c->opts.metadata_replicas,
diff --git a/fs/bcachefs/btree_update_interior.h b/fs/bcachefs/btree_update_interior.h
index 4df21512d640..031076e75fa1 100644
--- a/fs/bcachefs/btree_update_interior.h
+++ b/fs/bcachefs/btree_update_interior.h
@@ -55,7 +55,6 @@ struct btree_update {
 	unsigned			update_level;
 
 	struct disk_reservation		disk_res;
-	struct journal_preres		journal_preres;
 
 	/*
 	 * BTREE_INTERIOR_UPDATING_NODE:
diff --git a/fs/bcachefs/journal.c b/fs/bcachefs/journal.c
index ced1983d675b..46eba1072940 100644
--- a/fs/bcachefs/journal.c
+++ b/fs/bcachefs/journal.c
@@ -526,36 +526,6 @@ int bch2_journal_res_get_slowpath(struct journal *j, struct journal_res *res,
 	return ret;
 }
 
-/* journal_preres: */
-
-static bool journal_preres_available(struct journal *j,
-				     struct journal_preres *res,
-				     unsigned new_u64s,
-				     unsigned flags)
-{
-	bool ret = bch2_journal_preres_get_fast(j, res, new_u64s, flags, true);
-
-	if (!ret && mutex_trylock(&j->reclaim_lock)) {
-		bch2_journal_reclaim(j);
-		mutex_unlock(&j->reclaim_lock);
-	}
-
-	return ret;
-}
-
-int __bch2_journal_preres_get(struct journal *j,
-			      struct journal_preres *res,
-			      unsigned new_u64s,
-			      unsigned flags)
-{
-	int ret;
-
-	closure_wait_event(&j->preres_wait,
-		   (ret = bch2_journal_error(j)) ||
-		   journal_preres_available(j, res, new_u64s, flags));
-	return ret;
-}
-
 /* journal_entry_res: */
 
 void bch2_journal_entry_res_resize(struct journal *j,
@@ -1307,7 +1277,6 @@ void __bch2_journal_debug_to_text(struct printbuf *out, struct journal *j)
 	prt_printf(out, "last_seq:\t\t%llu\n",			journal_last_seq(j));
 	prt_printf(out, "last_seq_ondisk:\t%llu\n",		j->last_seq_ondisk);
 	prt_printf(out, "flushed_seq_ondisk:\t%llu\n",		j->flushed_seq_ondisk);
-	prt_printf(out, "prereserved:\t\t%u/%u\n",		j->prereserved.reserved, j->prereserved.remaining);
 	prt_printf(out, "watermark:\t\t%s\n",			bch2_watermarks[j->watermark]);
 	prt_printf(out, "each entry reserved:\t%u\n",		j->entry_u64s_reserved);
 	prt_printf(out, "nr flush writes:\t%llu\n",		j->nr_flush_writes);
diff --git a/fs/bcachefs/journal.h b/fs/bcachefs/journal.h
index 011711e99c8d..c85d01cf4948 100644
--- a/fs/bcachefs/journal.h
+++ b/fs/bcachefs/journal.h
@@ -395,104 +395,6 @@ static inline int bch2_journal_res_get(struct journal *j, struct journal_res *re
 	return 0;
 }
 
-/* journal_preres: */
-
-static inline void journal_set_watermark(struct journal *j)
-{
-	union journal_preres_state s = READ_ONCE(j->prereserved);
-	unsigned watermark = BCH_WATERMARK_stripe;
-
-	if (fifo_free(&j->pin) < j->pin.size / 4)
-		watermark = max_t(unsigned, watermark, BCH_WATERMARK_copygc);
-	if (fifo_free(&j->pin) < j->pin.size / 8)
-		watermark = max_t(unsigned, watermark, BCH_WATERMARK_reclaim);
-
-	if (s.reserved > s.remaining)
-		watermark = max_t(unsigned, watermark, BCH_WATERMARK_copygc);
-	if (!s.remaining)
-		watermark = max_t(unsigned, watermark, BCH_WATERMARK_reclaim);
-
-	if (watermark == j->watermark)
-		return;
-
-	swap(watermark, j->watermark);
-	if (watermark > j->watermark)
-		journal_wake(j);
-}
-
-static inline void bch2_journal_preres_put(struct journal *j,
-					   struct journal_preres *res)
-{
-	union journal_preres_state s = { .reserved = res->u64s };
-
-	if (!res->u64s)
-		return;
-
-	s.v = atomic64_sub_return(s.v, &j->prereserved.counter);
-	res->u64s = 0;
-
-	if (unlikely(s.waiting)) {
-		clear_bit(ilog2((((union journal_preres_state) { .waiting = 1 }).v)),
-			  (unsigned long *) &j->prereserved.v);
-		closure_wake_up(&j->preres_wait);
-	}
-
-	if (s.reserved <= s.remaining && j->watermark)
-		journal_set_watermark(j);
-}
-
-int __bch2_journal_preres_get(struct journal *,
-			struct journal_preres *, unsigned, unsigned);
-
-static inline int bch2_journal_preres_get_fast(struct journal *j,
-					       struct journal_preres *res,
-					       unsigned new_u64s,
-					       unsigned flags,
-					       bool set_waiting)
-{
-	int d = new_u64s - res->u64s;
-	union journal_preres_state old, new;
-	u64 v = atomic64_read(&j->prereserved.counter);
-	enum bch_watermark watermark = flags & BCH_WATERMARK_MASK;
-	int ret;
-
-	do {
-		old.v = new.v = v;
-		ret = 0;
-
-		if (watermark == BCH_WATERMARK_reclaim ||
-		    new.reserved + d < new.remaining) {
-			new.reserved += d;
-			ret = 1;
-		} else if (set_waiting && !new.waiting)
-			new.waiting = true;
-		else
-			return 0;
-	} while ((v = atomic64_cmpxchg(&j->prereserved.counter,
-				       old.v, new.v)) != old.v);
-
-	if (ret)
-		res->u64s += d;
-	return ret;
-}
-
-static inline int bch2_journal_preres_get(struct journal *j,
-					  struct journal_preres *res,
-					  unsigned new_u64s,
-					  unsigned flags)
-{
-	if (new_u64s <= res->u64s)
-		return 0;
-
-	if (bch2_journal_preres_get_fast(j, res, new_u64s, flags, false))
-		return 0;
-
-	if (flags & JOURNAL_RES_GET_NONBLOCK)
-		return -BCH_ERR_journal_preres_get_blocked;
-
-	return __bch2_journal_preres_get(j, res, new_u64s, flags);
-}
-
 /* journal_entry_res: */
 
 void bch2_journal_entry_res_resize(struct journal *,
diff --git a/fs/bcachefs/journal_reclaim.c b/fs/bcachefs/journal_reclaim.c
index 4a18f47c9ae7..b3b822170b2c 100644
--- a/fs/bcachefs/journal_reclaim.c
+++ b/fs/bcachefs/journal_reclaim.c
@@ -50,16 +50,21 @@ unsigned bch2_journal_dev_buckets_available(struct journal *j,
 	return available;
 }
 
-static void journal_set_remaining(struct journal *j, unsigned u64s_remaining)
+static inline void journal_set_watermark(struct journal *j, bool low_on_space)
 {
-	union journal_preres_state old, new;
-	u64 v = atomic64_read(&j->prereserved.counter);
+	unsigned watermark = BCH_WATERMARK_stripe;
 
-	do {
-		old.v = new.v = v;
-		new.remaining = u64s_remaining;
-	} while ((v = atomic64_cmpxchg(&j->prereserved.counter,
-				       old.v, new.v)) != old.v);
+	if (low_on_space)
+		watermark = max_t(unsigned, watermark, BCH_WATERMARK_reclaim);
+	if (fifo_free(&j->pin) < j->pin.size / 4)
+		watermark = max_t(unsigned, watermark, BCH_WATERMARK_reclaim);
+
+	if (watermark == j->watermark)
+		return;
+
+	swap(watermark, j->watermark);
+	if (watermark > j->watermark)
+		journal_wake(j);
 }
 
 static struct journal_space
@@ -162,7 +167,6 @@ void bch2_journal_space_available(struct journal *j)
 	struct bch_fs *c = container_of(j, struct bch_fs, journal);
 	struct bch_dev *ca;
 	unsigned clean, clean_ondisk, total;
-	s64 u64s_remaining = 0;
 	unsigned max_entry_size	 = min(j->buf[0].buf_size >> 9,
 				       j->buf[1].buf_size >> 9);
 	unsigned i, nr_online = 0, nr_devs_want;
@@ -222,16 +226,10 @@ void bch2_journal_space_available(struct journal *j)
 	else
 		clear_bit(JOURNAL_MAY_SKIP_FLUSH, &j->flags);
 
-	u64s_remaining  = (u64) clean << 6;
-	u64s_remaining -= (u64) total << 3;
-	u64s_remaining = max(0LL, u64s_remaining);
-	u64s_remaining /= 4;
-	u64s_remaining = min_t(u64, u64s_remaining, U32_MAX);
+	journal_set_watermark(j, clean * 4 <= total);
 out:
 	j->cur_entry_sectors	= !ret ? j->space[journal_space_discarded].next_entry : 0;
 	j->cur_entry_error	= ret;
-	journal_set_remaining(j, u64s_remaining);
-	journal_set_watermark(j);
 
 	if (!ret)
 		journal_wake(j);
@@ -590,11 +588,6 @@ static u64 journal_seq_to_flush(struct journal *j)
 		/* Try to keep the journal at most half full: */
 		nr_buckets = ja->nr / 2;
 
-		/* And include pre-reservations: */
-		nr_buckets += DIV_ROUND_UP(j->prereserved.reserved,
-					   (ca->mi.bucket_size << 6) -
-					   journal_entry_overhead(j));
-
 		nr_buckets = min(nr_buckets, ja->nr);
 
 		bucket_to_flush = (ja->cur_idx + nr_buckets) % ja->nr;
@@ -673,10 +666,7 @@ static int __bch2_journal_reclaim(struct journal *j, bool direct, bool kicked)
 			       msecs_to_jiffies(c->opts.journal_reclaim_delay)))
 			min_nr = 1;
 
-		if (j->prereserved.reserved * 4 > j->prereserved.remaining)
-			min_nr = 1;
-
-		if (fifo_free(&j->pin) <= 32)
+		if (j->watermark != BCH_WATERMARK_stripe)
 			min_nr = 1;
 
 		if (atomic_read(&c->btree_cache.dirty) * 2 > c->btree_cache.used)
@@ -687,8 +677,6 @@ static int __bch2_journal_reclaim(struct journal *j, bool direct, bool kicked)
 		trace_and_count(c, journal_reclaim_start, c,
 				direct, kicked,
 				min_nr, min_key_cache,
-				j->prereserved.reserved,
-				j->prereserved.remaining,
 				atomic_read(&c->btree_cache.dirty),
 				c->btree_cache.used,
 				atomic_long_read(&c->btree_key_cache.nr_dirty),
diff --git a/fs/bcachefs/journal_types.h b/fs/bcachefs/journal_types.h
index 1a9a0293c36a..1acce6ecfca0 100644
--- a/fs/bcachefs/journal_types.h
+++ b/fs/bcachefs/journal_types.h
@@ -76,14 +76,6 @@ struct journal_res {
 	u64			seq;
 };
 
-/*
- * For reserving space in the journal prior to getting a reservation on a
- * particular journal entry:
- */
-struct journal_preres {
-	unsigned		u64s;
-};
-
 union journal_res_state {
 	struct {
 		atomic64_t	counter;
@@ -104,22 +96,6 @@ union journal_res_state {
 	};
 };
 
-union journal_preres_state {
-	struct {
-		atomic64_t	counter;
-	};
-
-	struct {
-		u64		v;
-	};
-
-	struct {
-		u64		waiting:1,
-				reserved:31,
-				remaining:32;
-	};
-};
-
 /* bytes: */
 #define JOURNAL_ENTRY_SIZE_MIN		(64U << 10) /* 64k */
 #define JOURNAL_ENTRY_SIZE_MAX		(4U  << 20) /* 4M */
@@ -180,8 +156,6 @@ struct journal {
 	union journal_res_state reservations;
 	enum bch_watermark	watermark;
 
-	union journal_preres_state prereserved;
-
 	} __aligned(SMP_CACHE_BYTES);
 
 	unsigned long		flags;
diff --git a/fs/bcachefs/trace.h b/fs/bcachefs/trace.h
index 893304a1f06e..7857671159b4 100644
--- a/fs/bcachefs/trace.h
+++ b/fs/bcachefs/trace.h
@@ -196,10 +196,9 @@ DEFINE_EVENT(bio, journal_write,
 TRACE_EVENT(journal_reclaim_start,
 	TP_PROTO(struct bch_fs *c, bool direct, bool kicked,
 		 u64 min_nr, u64 min_key_cache,
-		 u64 prereserved, u64 prereserved_total,
 		 u64 btree_cache_dirty, u64 btree_cache_total,
 		 u64 btree_key_cache_dirty, u64 btree_key_cache_total),
-	TP_ARGS(c, direct, kicked, min_nr, min_key_cache, prereserved, prereserved_total,
+	TP_ARGS(c, direct, kicked, min_nr, min_key_cache,
 		btree_cache_dirty, btree_cache_total,
 		btree_key_cache_dirty, btree_key_cache_total),
 
@@ -209,8 +208,6 @@ TRACE_EVENT(journal_reclaim_start,
 		__field(bool,		kicked			)
 		__field(u64,		min_nr			)
 		__field(u64,		min_key_cache		)
-		__field(u64,		prereserved		)
-		__field(u64,		prereserved_total	)
 		__field(u64,		btree_cache_dirty	)
 		__field(u64,		btree_cache_total	)
 		__field(u64,		btree_key_cache_dirty	)
@@ -223,22 +220,18 @@ TRACE_EVENT(journal_reclaim_start,
 		__entry->kicked			= kicked;
 		__entry->min_nr			= min_nr;
 		__entry->min_key_cache		= min_key_cache;
-		__entry->prereserved		= prereserved;
-		__entry->prereserved_total	= prereserved_total;
 		__entry->btree_cache_dirty	= btree_cache_dirty;
 		__entry->btree_cache_total	= btree_cache_total;
 		__entry->btree_key_cache_dirty	= btree_key_cache_dirty;
 		__entry->btree_key_cache_total	= btree_key_cache_total;
 	),
 
-	TP_printk("%d,%d direct %u kicked %u min %llu key cache %llu prereserved %llu/%llu btree cache %llu/%llu key cache %llu/%llu",
+	TP_printk("%d,%d direct %u kicked %u min %llu key cache %llu btree cache %llu/%llu key cache %llu/%llu",
 		  MAJOR(__entry->dev), MINOR(__entry->dev),
 		  __entry->direct,
 		  __entry->kicked,
 		  __entry->min_nr,
 		  __entry->min_key_cache,
-		  __entry->prereserved,
-		  __entry->prereserved_total,
 		  __entry->btree_cache_dirty,
 		  __entry->btree_cache_total,
 		  __entry->btree_key_cache_dirty,
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 02/17] bcachefs: track_event_change()
  2023-11-10 16:31 [PATCH 00/17] btree write buffer & journal optimizations Kent Overstreet
  2023-11-10 16:31 ` [PATCH 01/17] bcachefs: Kill journal pre-reservations Kent Overstreet
@ 2023-11-10 16:31 ` Kent Overstreet
  2023-11-10 16:31 ` [PATCH 03/17] bcachefs: Journal pins must always have a flush_fn Kent Overstreet
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 29+ messages in thread
From: Kent Overstreet @ 2023-11-10 16:31 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Kent Overstreet

This introduces a new helper for connecting time_stats to state changes,
i.e. when taking journal reservations is blocked for some reason.

We use this to track separately the different reasons the journal might
be blocked - i.e. space in the journal full, or the journal pin fifo
full.

Also do some cleanup and improvements on the time stats code.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/bcachefs/bcachefs.h        |   4 +-
 fs/bcachefs/journal.c         |  16 +---
 fs/bcachefs/journal_io.c      |   3 +
 fs/bcachefs/journal_reclaim.c |  27 ++++---
 fs/bcachefs/journal_types.h   |   6 +-
 fs/bcachefs/super.c           |   3 +-
 fs/bcachefs/util.c            | 140 +++++++++++++++++++---------------
 fs/bcachefs/util.h            |  33 +++++++-
 8 files changed, 140 insertions(+), 92 deletions(-)

diff --git a/fs/bcachefs/bcachefs.h b/fs/bcachefs/bcachefs.h
index 403aa3389fcc..3117ab4426a7 100644
--- a/fs/bcachefs/bcachefs.h
+++ b/fs/bcachefs/bcachefs.h
@@ -401,7 +401,9 @@ BCH_DEBUG_PARAMS_DEBUG()
 	x(journal_flush_write)			\
 	x(journal_noflush_write)		\
 	x(journal_flush_seq)			\
-	x(blocked_journal)			\
+	x(blocked_journal_low_on_space)		\
+	x(blocked_journal_low_on_pin)		\
+	x(blocked_journal_max_in_flight)	\
 	x(blocked_allocate)			\
 	x(blocked_allocate_open_bucket)		\
 	x(nocow_lock_contended)
diff --git a/fs/bcachefs/journal.c b/fs/bcachefs/journal.c
index 46eba1072940..7d448136434b 100644
--- a/fs/bcachefs/journal.c
+++ b/fs/bcachefs/journal.c
@@ -361,11 +361,6 @@ static int journal_entry_open(struct journal *j)
 	} while ((v = atomic64_cmpxchg(&j->reservations.counter,
 				       old.v, new.v)) != old.v);
 
-	if (j->res_get_blocked_start)
-		bch2_time_stats_update(j->blocked_time,
-				       j->res_get_blocked_start);
-	j->res_get_blocked_start = 0;
-
 	mod_delayed_work(c->io_complete_wq,
 			 &j->write_work,
 			 msecs_to_jiffies(c->opts.journal_flush_delay));
@@ -465,15 +460,12 @@ static int __journal_res_get(struct journal *j, struct journal_res *res,
 	__journal_entry_close(j, JOURNAL_ENTRY_CLOSED_VAL);
 	ret = journal_entry_open(j);
 
-	if (ret == JOURNAL_ERR_max_in_flight)
+	if (ret == JOURNAL_ERR_max_in_flight) {
+		track_event_change(&c->times[BCH_TIME_blocked_journal_max_in_flight],
+				   &j->max_in_flight_start, true);
 		trace_and_count(c, journal_entry_full, c);
-unlock:
-	if ((ret && ret != JOURNAL_ERR_insufficient_devices) &&
-	    !j->res_get_blocked_start) {
-		j->res_get_blocked_start = local_clock() ?: 1;
-		trace_and_count(c, journal_full, c);
 	}
-
+unlock:
 	can_discard = j->can_discard;
 	spin_unlock(&j->lock);
 
diff --git a/fs/bcachefs/journal_io.c b/fs/bcachefs/journal_io.c
index 25dbeb0396b2..52a59770d8d0 100644
--- a/fs/bcachefs/journal_io.c
+++ b/fs/bcachefs/journal_io.c
@@ -1585,6 +1585,9 @@ static void journal_write_done(struct closure *cl)
 
 	bch2_journal_space_available(j);
 
+	track_event_change(&c->times[BCH_TIME_blocked_journal_max_in_flight],
+			   &j->max_in_flight_start, false);
+
 	closure_wake_up(&w->wait);
 	journal_wake(j);
 
diff --git a/fs/bcachefs/journal_reclaim.c b/fs/bcachefs/journal_reclaim.c
index b3b822170b2c..79a6fdc6e6ef 100644
--- a/fs/bcachefs/journal_reclaim.c
+++ b/fs/bcachefs/journal_reclaim.c
@@ -50,17 +50,21 @@ unsigned bch2_journal_dev_buckets_available(struct journal *j,
 	return available;
 }
 
-static inline void journal_set_watermark(struct journal *j, bool low_on_space)
+static inline void journal_set_watermark(struct journal *j)
 {
-	unsigned watermark = BCH_WATERMARK_stripe;
-
-	if (low_on_space)
-		watermark = max_t(unsigned, watermark, BCH_WATERMARK_reclaim);
-	if (fifo_free(&j->pin) < j->pin.size / 4)
-		watermark = max_t(unsigned, watermark, BCH_WATERMARK_reclaim);
-
-	if (watermark == j->watermark)
-		return;
+	struct bch_fs *c = container_of(j, struct bch_fs, journal);
+	bool low_on_space = j->space[journal_space_clean].total * 4 <=
+		j->space[journal_space_total].total;
+	bool low_on_pin = fifo_free(&j->pin) < j->pin.size / 4;
+	unsigned watermark = low_on_space || low_on_pin
+		? BCH_WATERMARK_reclaim
+		: BCH_WATERMARK_stripe;
+
+	if (track_event_change(&c->times[BCH_TIME_blocked_journal_low_on_space],
+			       &j->low_on_space_start, low_on_space) ||
+	    track_event_change(&c->times[BCH_TIME_blocked_journal_low_on_pin],
+			       &j->low_on_pin_start, low_on_pin))
+		trace_and_count(c, journal_full, c);
 
 	swap(watermark, j->watermark);
 	if (watermark > j->watermark)
@@ -226,7 +230,7 @@ void bch2_journal_space_available(struct journal *j)
 	else
 		clear_bit(JOURNAL_MAY_SKIP_FLUSH, &j->flags);
 
-	journal_set_watermark(j, clean * 4 <= total);
+	journal_set_watermark(j);
 out:
 	j->cur_entry_sectors	= !ret ? j->space[journal_space_discarded].next_entry : 0;
 	j->cur_entry_error	= ret;
@@ -828,6 +832,7 @@ static int journal_flush_done(struct journal *j, u64 seq_to_flush,
 
 bool bch2_journal_flush_pins(struct journal *j, u64 seq_to_flush)
 {
+	/* time_stats this */
 	bool did_work = false;
 
 	if (!test_bit(JOURNAL_STARTED, &j->flags))
diff --git a/fs/bcachefs/journal_types.h b/fs/bcachefs/journal_types.h
index 1acce6ecfca0..2427cce64fed 100644
--- a/fs/bcachefs/journal_types.h
+++ b/fs/bcachefs/journal_types.h
@@ -262,16 +262,18 @@ struct journal {
 
 	unsigned long		last_flush_write;
 
-	u64			res_get_blocked_start;
 	u64			write_start_time;
 
 	u64			nr_flush_writes;
 	u64			nr_noflush_writes;
 	u64			entry_bytes_written;
 
+	u64			low_on_space_start;
+	u64			low_on_pin_start;
+	u64			max_in_flight_start;
+
 	struct bch2_time_stats	*flush_write_time;
 	struct bch2_time_stats	*noflush_write_time;
-	struct bch2_time_stats	*blocked_time;
 	struct bch2_time_stats	*flush_seq_time;
 
 #ifdef CONFIG_DEBUG_LOCK_ALLOC
diff --git a/fs/bcachefs/super.c b/fs/bcachefs/super.c
index 24672bb31cbe..bb9451082e87 100644
--- a/fs/bcachefs/super.c
+++ b/fs/bcachefs/super.c
@@ -641,7 +641,9 @@ static int bch2_fs_online(struct bch_fs *c)
 	ret = kobject_add(&c->kobj, NULL, "%pU", c->sb.user_uuid.b) ?:
 	    kobject_add(&c->internal, &c->kobj, "internal") ?:
 	    kobject_add(&c->opts_dir, &c->kobj, "options") ?:
+#ifndef CONFIG_BCACHEFS_NO_LATENCY_ACCT
 	    kobject_add(&c->time_stats, &c->kobj, "time_stats") ?:
+#endif
 	    kobject_add(&c->counters_kobj, &c->kobj, "counters") ?:
 	    bch2_opts_create_sysfs_files(&c->opts_dir);
 	if (ret) {
@@ -750,7 +752,6 @@ static struct bch_fs *bch2_fs_alloc(struct bch_sb *sb, struct bch_opts opts)
 
 	c->journal.flush_write_time	= &c->times[BCH_TIME_journal_flush_write];
 	c->journal.noflush_write_time	= &c->times[BCH_TIME_journal_noflush_write];
-	c->journal.blocked_time		= &c->times[BCH_TIME_blocked_journal];
 	c->journal.flush_seq_time	= &c->times[BCH_TIME_journal_flush_seq];
 
 	bch2_fs_btree_cache_init_early(&c->btree_cache);
diff --git a/fs/bcachefs/util.c b/fs/bcachefs/util.c
index eefe7b42a20a..2ff9cdfb006c 100644
--- a/fs/bcachefs/util.c
+++ b/fs/bcachefs/util.c
@@ -315,6 +315,57 @@ int bch2_prt_task_backtrace(struct printbuf *out, struct task_struct *task)
 	return ret;
 }
 
+#ifndef __KERNEL__
+#include <time.h>
+void bch2_prt_datetime(struct printbuf *out, time64_t sec)
+{
+	time_t t = sec;
+	char buf[64];
+	ctime_r(&t, buf);
+	prt_str(out, buf);
+}
+#else
+void bch2_prt_datetime(struct printbuf *out, time64_t sec)
+{
+	char buf[64];
+	snprintf(buf, sizeof(buf), "%ptT", &sec);
+	prt_u64(out, sec);
+}
+#endif
+
+static const struct time_unit {
+	const char	*name;
+	u64		nsecs;
+} time_units[] = {
+	{ "ns",		1		 },
+	{ "us",		NSEC_PER_USEC	 },
+	{ "ms",		NSEC_PER_MSEC	 },
+	{ "s",		NSEC_PER_SEC	 },
+	{ "m",          (u64) NSEC_PER_SEC * 60},
+	{ "h",          (u64) NSEC_PER_SEC * 3600},
+	{ "eon",        U64_MAX          },
+};
+
+static const struct time_unit *pick_time_units(u64 ns)
+{
+	const struct time_unit *u;
+
+	for (u = time_units;
+	     u + 1 < time_units + ARRAY_SIZE(time_units) &&
+	     ns >= u[1].nsecs << 1;
+	     u++)
+		;
+
+	return u;
+}
+
+void bch2_pr_time_units(struct printbuf *out, u64 ns)
+{
+	const struct time_unit *u = pick_time_units(ns);
+
+	prt_printf(out, "%llu %s", div_u64(ns, u->nsecs), u->name);
+}
+
 /* time stats: */
 
 #ifndef CONFIG_BCACHEFS_NO_LATENCY_ACCT
@@ -359,6 +410,7 @@ static inline void bch2_time_stats_update_one(struct bch2_time_stats *stats,
 		mean_and_variance_weighted_update(&stats->duration_stats_weighted, duration);
 		stats->max_duration = max(stats->max_duration, duration);
 		stats->min_duration = min(stats->min_duration, duration);
+		stats->total_duration += duration;
 		bch2_quantiles_update(&stats->quantiles, duration);
 	}
 
@@ -372,20 +424,24 @@ static inline void bch2_time_stats_update_one(struct bch2_time_stats *stats,
 	}
 }
 
+static void __bch2_time_stats_clear_buffer(struct bch2_time_stats *stats,
+					   struct bch2_time_stat_buffer *b)
+{
+	for (struct bch2_time_stat_buffer_entry *i = b->entries;
+	     i < b->entries + ARRAY_SIZE(b->entries);
+	     i++)
+		bch2_time_stats_update_one(stats, i->start, i->end);
+	b->nr = 0;
+}
+
 static noinline void bch2_time_stats_clear_buffer(struct bch2_time_stats *stats,
 						  struct bch2_time_stat_buffer *b)
 {
-	struct bch2_time_stat_buffer_entry *i;
 	unsigned long flags;
 
 	spin_lock_irqsave(&stats->lock, flags);
-	for (i = b->entries;
-	     i < b->entries + ARRAY_SIZE(b->entries);
-	     i++)
-		bch2_time_stats_update_one(stats, i->start, i->end);
+	__bch2_time_stats_clear_buffer(stats, b);
 	spin_unlock_irqrestore(&stats->lock, flags);
-
-	b->nr = 0;
 }
 
 void __bch2_time_stats_update(struct bch2_time_stats *stats, u64 start, u64 end)
@@ -423,40 +479,6 @@ void __bch2_time_stats_update(struct bch2_time_stats *stats, u64 start, u64 end)
 		preempt_enable();
 	}
 }
-#endif
-
-static const struct time_unit {
-	const char	*name;
-	u64		nsecs;
-} time_units[] = {
-	{ "ns",		1		 },
-	{ "us",		NSEC_PER_USEC	 },
-	{ "ms",		NSEC_PER_MSEC	 },
-	{ "s",		NSEC_PER_SEC	 },
-	{ "m",          (u64) NSEC_PER_SEC * 60},
-	{ "h",          (u64) NSEC_PER_SEC * 3600},
-	{ "eon",        U64_MAX          },
-};
-
-static const struct time_unit *pick_time_units(u64 ns)
-{
-	const struct time_unit *u;
-
-	for (u = time_units;
-	     u + 1 < time_units + ARRAY_SIZE(time_units) &&
-	     ns >= u[1].nsecs << 1;
-	     u++)
-		;
-
-	return u;
-}
-
-void bch2_pr_time_units(struct printbuf *out, u64 ns)
-{
-	const struct time_unit *u = pick_time_units(ns);
-
-	prt_printf(out, "%llu %s", div_u64(ns, u->nsecs), u->name);
-}
 
 static void bch2_pr_time_units_aligned(struct printbuf *out, u64 ns)
 {
@@ -467,26 +489,6 @@ static void bch2_pr_time_units_aligned(struct printbuf *out, u64 ns)
 	prt_printf(out, "%s", u->name);
 }
 
-#ifndef __KERNEL__
-#include <time.h>
-void bch2_prt_datetime(struct printbuf *out, time64_t sec)
-{
-	time_t t = sec;
-	char buf[64];
-	ctime_r(&t, buf);
-	prt_str(out, buf);
-}
-#else
-void bch2_prt_datetime(struct printbuf *out, time64_t sec)
-{
-	char buf[64];
-	snprintf(buf, sizeof(buf), "%ptT", &sec);
-	prt_u64(out, sec);
-}
-#endif
-
-#define TABSTOP_SIZE 12
-
 static inline void pr_name_and_units(struct printbuf *out, const char *name, u64 ns)
 {
 	prt_str(out, name);
@@ -495,12 +497,24 @@ static inline void pr_name_and_units(struct printbuf *out, const char *name, u64
 	prt_newline(out);
 }
 
+#define TABSTOP_SIZE 12
+
 void bch2_time_stats_to_text(struct printbuf *out, struct bch2_time_stats *stats)
 {
 	const struct time_unit *u;
 	s64 f_mean = 0, d_mean = 0;
 	u64 q, last_q = 0, f_stddev = 0, d_stddev = 0;
 	int i;
+
+	if (stats->buffer) {
+		int cpu;
+
+		spin_lock_irq(&stats->lock);
+		for_each_possible_cpu(cpu)
+			__bch2_time_stats_clear_buffer(stats, per_cpu_ptr(stats->buffer, cpu));
+		spin_unlock_irq(&stats->lock);
+	}
+
 	/*
 	 * avoid divide by zero
 	 */
@@ -546,6 +560,7 @@ void bch2_time_stats_to_text(struct printbuf *out, struct bch2_time_stats *stats
 
 	pr_name_and_units(out, "min:", stats->min_duration);
 	pr_name_and_units(out, "max:", stats->max_duration);
+	pr_name_and_units(out, "total:", stats->total_duration);
 
 	prt_printf(out, "mean:");
 	prt_tab(out);
@@ -603,6 +618,9 @@ void bch2_time_stats_to_text(struct printbuf *out, struct bch2_time_stats *stats
 		last_q = q;
 	}
 }
+#else
+void bch2_time_stats_to_text(struct printbuf *out, struct bch2_time_stats *stats) {}
+#endif
 
 void bch2_time_stats_exit(struct bch2_time_stats *stats)
 {
diff --git a/fs/bcachefs/util.h b/fs/bcachefs/util.h
index 977e1d9fa453..54e309d94b9b 100644
--- a/fs/bcachefs/util.h
+++ b/fs/bcachefs/util.h
@@ -372,8 +372,9 @@ struct bch2_time_stat_buffer {
 struct bch2_time_stats {
 	spinlock_t	lock;
 	/* all fields are in nanoseconds */
-	u64		max_duration;
 	u64             min_duration;
+	u64		max_duration;
+	u64		total_duration;
 	u64             max_freq;
 	u64             min_freq;
 	u64		last_event;
@@ -388,15 +389,39 @@ struct bch2_time_stats {
 
 #ifndef CONFIG_BCACHEFS_NO_LATENCY_ACCT
 void __bch2_time_stats_update(struct bch2_time_stats *stats, u64, u64);
-#else
-static inline void __bch2_time_stats_update(struct bch2_time_stats *stats, u64 start, u64 end) {}
-#endif
 
 static inline void bch2_time_stats_update(struct bch2_time_stats *stats, u64 start)
 {
 	__bch2_time_stats_update(stats, start, local_clock());
 }
 
+static inline bool track_event_change(struct bch2_time_stats *stats,
+				      u64 *start, bool v)
+{
+	if (v != !!*start) {
+		if (!v) {
+			bch2_time_stats_update(stats, *start);
+			*start = 0;
+		} else {
+			*start = local_clock() ?: 1;
+			return true;
+		}
+	}
+
+	return false;
+}
+#else
+static inline void __bch2_time_stats_update(struct bch2_time_stats *stats, u64 start, u64 end) {}
+static inline void bch2_time_stats_update(struct bch2_time_stats *stats, u64 start) {}
+static inline bool track_event_change(struct bch2_time_stats *stats,
+				      u64 *start, bool v)
+{
+	bool ret = v && !*start;
+	*start = v;
+	return ret;
+}
+#endif
+
 void bch2_time_stats_to_text(struct printbuf *, struct bch2_time_stats *);
 
 void bch2_time_stats_exit(struct bch2_time_stats *);
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 03/17] bcachefs: Journal pins must always have a flush_fn
  2023-11-10 16:31 [PATCH 00/17] btree write buffer & journal optimizations Kent Overstreet
  2023-11-10 16:31 ` [PATCH 01/17] bcachefs: Kill journal pre-reservations Kent Overstreet
  2023-11-10 16:31 ` [PATCH 02/17] bcachefs: track_event_change() Kent Overstreet
@ 2023-11-10 16:31 ` Kent Overstreet
  2023-11-13 15:22   ` Brian Foster
  2023-11-10 16:31 ` [PATCH 04/17] bcachefs: BTREE_INSERT_JOURNAL_REPLAY now "don't init trans->journal_res" Kent Overstreet
                   ` (14 subsequent siblings)
  17 siblings, 1 reply; 29+ messages in thread
From: Kent Overstreet @ 2023-11-10 16:31 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Kent Overstreet

flush_fn is how we identify journal pins in debugfs - this is a
debugging aid.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/bcachefs/btree_trans_commit.c    |  9 ++++++++-
 fs/bcachefs/btree_update_interior.c | 21 ++++++++++++++++++---
 fs/bcachefs/btree_write_buffer.c    | 18 +++++++-----------
 fs/bcachefs/journal_reclaim.c       | 12 +++++++-----
 4 files changed, 40 insertions(+), 20 deletions(-)

diff --git a/fs/bcachefs/btree_trans_commit.c b/fs/bcachefs/btree_trans_commit.c
index d0ead308deef..3152f6abf73f 100644
--- a/fs/bcachefs/btree_trans_commit.c
+++ b/fs/bcachefs/btree_trans_commit.c
@@ -841,6 +841,12 @@ static noinline int bch2_trans_commit_bkey_invalid(struct btree_trans *trans,
 	return -EINVAL;
 }
 
+static int bch2_trans_commit_journal_pin_flush(struct journal *j,
+				struct journal_entry_pin *_pin, u64 seq)
+{
+	return 0;
+}
+
 /*
  * Get journal reservation, take write locks, and attempt to do btree update(s):
  */
@@ -884,7 +890,8 @@ static inline int do_bch2_trans_commit(struct btree_trans *trans, unsigned flags
 
 	if (!ret && trans->journal_pin)
 		bch2_journal_pin_add(&c->journal, trans->journal_res.seq,
-				     trans->journal_pin, NULL);
+				     trans->journal_pin,
+				     bch2_trans_commit_journal_pin_flush);
 
 	/*
 	 * Drop journal reservation after dropping write locks, since dropping
diff --git a/fs/bcachefs/btree_update_interior.c b/fs/bcachefs/btree_update_interior.c
index 925837eca369..518e5b003d04 100644
--- a/fs/bcachefs/btree_update_interior.c
+++ b/fs/bcachefs/btree_update_interior.c
@@ -814,6 +814,12 @@ static void btree_update_updated_node(struct btree_update *as, struct btree *b)
 	mutex_unlock(&c->btree_interior_update_lock);
 }
 
+static int bch2_update_reparent_journal_pin_flush(struct journal *j,
+				struct journal_entry_pin *_pin, u64 seq)
+{
+	return 0;
+}
+
 static void btree_update_reparent(struct btree_update *as,
 				  struct btree_update *child)
 {
@@ -824,7 +830,8 @@ static void btree_update_reparent(struct btree_update *as,
 	child->b = NULL;
 	child->mode = BTREE_INTERIOR_UPDATING_AS;
 
-	bch2_journal_pin_copy(&c->journal, &as->journal, &child->journal, NULL);
+	bch2_journal_pin_copy(&c->journal, &as->journal, &child->journal,
+			      bch2_update_reparent_journal_pin_flush);
 }
 
 static void btree_update_updated_root(struct btree_update *as, struct btree *b)
@@ -933,6 +940,12 @@ static void bch2_btree_update_get_open_buckets(struct btree_update *as, struct b
 			b->ob.v[--b->ob.nr];
 }
 
+static int bch2_btree_update_will_free_node_journal_pin_flush(struct journal *j,
+				struct journal_entry_pin *_pin, u64 seq)
+{
+	return 0;
+}
+
 /*
  * @b is being split/rewritten: it may have pointers to not-yet-written btree
  * nodes and thus outstanding btree_updates - redirect @b's
@@ -984,11 +997,13 @@ static void bch2_btree_interior_update_will_free_node(struct btree_update *as,
 	 * when the new nodes are persistent and reachable on disk:
 	 */
 	w = btree_current_write(b);
-	bch2_journal_pin_copy(&c->journal, &as->journal, &w->journal, NULL);
+	bch2_journal_pin_copy(&c->journal, &as->journal, &w->journal,
+			      bch2_btree_update_will_free_node_journal_pin_flush);
 	bch2_journal_pin_drop(&c->journal, &w->journal);
 
 	w = btree_prev_write(b);
-	bch2_journal_pin_copy(&c->journal, &as->journal, &w->journal, NULL);
+	bch2_journal_pin_copy(&c->journal, &as->journal, &w->journal,
+			      bch2_btree_update_will_free_node_journal_pin_flush);
 	bch2_journal_pin_drop(&c->journal, &w->journal);
 
 	mutex_unlock(&c->btree_interior_update_lock);
diff --git a/fs/bcachefs/btree_write_buffer.c b/fs/bcachefs/btree_write_buffer.c
index 5c398b05cb39..9e3107187e1d 100644
--- a/fs/bcachefs/btree_write_buffer.c
+++ b/fs/bcachefs/btree_write_buffer.c
@@ -11,6 +11,9 @@
 
 #include <linux/sort.h>
 
+static int bch2_btree_write_buffer_journal_flush(struct journal *,
+				struct journal_entry_pin *, u64);
+
 static int btree_write_buffered_key_cmp(const void *_l, const void *_r)
 {
 	const struct btree_write_buffered_key *l = _l;
@@ -148,7 +151,8 @@ int __bch2_btree_write_buffer_flush(struct btree_trans *trans, unsigned commit_f
 	if (!locked && !mutex_trylock(&wb->flush_lock))
 		return 0;
 
-	bch2_journal_pin_copy(j, &pin, &wb->journal_pin, NULL);
+	bch2_journal_pin_copy(j, &pin, &wb->journal_pin,
+			      bch2_btree_write_buffer_journal_flush);
 	bch2_journal_pin_drop(j, &wb->journal_pin);
 
 	s = btree_write_buffer_switch(wb);
@@ -250,16 +254,8 @@ int __bch2_btree_write_buffer_flush(struct btree_trans *trans, unsigned commit_f
 		if (!i->journal_seq)
 			continue;
 
-		if (i->journal_seq > pin.seq) {
-			struct journal_entry_pin pin2;
-
-			memset(&pin2, 0, sizeof(pin2));
-
-			bch2_journal_pin_add(j, i->journal_seq, &pin2, NULL);
-			bch2_journal_pin_drop(j, &pin);
-			bch2_journal_pin_copy(j, &pin, &pin2, NULL);
-			bch2_journal_pin_drop(j, &pin2);
-		}
+		bch2_journal_pin_update(j, i->journal_seq, &pin,
+			      bch2_btree_write_buffer_journal_flush);
 
 		ret = commit_do(trans, NULL, NULL,
 				commit_flags|
diff --git a/fs/bcachefs/journal_reclaim.c b/fs/bcachefs/journal_reclaim.c
index 79a6fdc6e6ef..8fa05bedb7df 100644
--- a/fs/bcachefs/journal_reclaim.c
+++ b/fs/bcachefs/journal_reclaim.c
@@ -378,14 +378,16 @@ static inline void bch2_journal_pin_set_locked(struct journal *j, u64 seq,
 {
 	struct journal_entry_pin_list *pin_list = journal_seq_pin(j, seq);
 
+	/*
+	 * flush_fn is how we identify journal pins in debugfs, so must always
+	 * exist, even if it doesn't do anything:
+	 */
+	BUG_ON(!flush_fn);
+
 	atomic_inc(&pin_list->count);
 	pin->seq	= seq;
 	pin->flush	= flush_fn;
-
-	if (flush_fn)
-		list_add(&pin->list, &pin_list->list[type]);
-	else
-		list_add(&pin->list, &pin_list->flushed);
+	list_add(&pin->list, &pin_list->list[type]);
 }
 
 void bch2_journal_pin_copy(struct journal *j,
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 04/17] bcachefs: BTREE_INSERT_JOURNAL_REPLAY now "don't init trans->journal_res"
  2023-11-10 16:31 [PATCH 00/17] btree write buffer & journal optimizations Kent Overstreet
                   ` (2 preceding siblings ...)
  2023-11-10 16:31 ` [PATCH 03/17] bcachefs: Journal pins must always have a flush_fn Kent Overstreet
@ 2023-11-10 16:31 ` Kent Overstreet
  2023-11-10 16:31 ` [PATCH 05/17] bcachefs: Kill BTREE_UPDATE_PREJOURNAL Kent Overstreet
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 29+ messages in thread
From: Kent Overstreet @ 2023-11-10 16:31 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Kent Overstreet

This slightly changes how trans->journal_res works, in preparation for
changing the btree write buffer flush path to use it.

Now, BTREE_INSERT_JOURNAL_REPLAY means "don't take a journal
reservation; trans->journal_res.seq already refers to the journal
sequence number to pin".

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/bcachefs/btree_trans_commit.c | 19 +++++++++++++++----
 fs/bcachefs/recovery.c           |  2 ++
 2 files changed, 17 insertions(+), 4 deletions(-)

diff --git a/fs/bcachefs/btree_trans_commit.c b/fs/bcachefs/btree_trans_commit.c
index 3152f6abf73f..ec90a06a6cf9 100644
--- a/fs/bcachefs/btree_trans_commit.c
+++ b/fs/bcachefs/btree_trans_commit.c
@@ -677,8 +677,6 @@ bch2_trans_commit_write_locked(struct btree_trans *trans, unsigned flags,
 
 		if (unlikely(trans->journal_transaction_names))
 			journal_transaction_name(trans);
-	} else {
-		trans->journal_res.seq = c->journal.replay_journal_seq;
 	}
 
 	/*
@@ -897,7 +895,8 @@ static inline int do_bch2_trans_commit(struct btree_trans *trans, unsigned flags
 	 * Drop journal reservation after dropping write locks, since dropping
 	 * the journal reservation may kick off a journal write:
 	 */
-	bch2_journal_res_put(&c->journal, &trans->journal_res);
+	if (likely(!(flags & BTREE_INSERT_JOURNAL_REPLAY)))
+		bch2_journal_res_put(&c->journal, &trans->journal_res);
 
 	return ret;
 }
@@ -1140,7 +1139,8 @@ int __bch2_trans_commit(struct btree_trans *trans, unsigned flags)
 	}
 retry:
 	bch2_trans_verify_not_in_restart(trans);
-	memset(&trans->journal_res, 0, sizeof(trans->journal_res));
+	if (likely(!(flags & BTREE_INSERT_JOURNAL_REPLAY)))
+		memset(&trans->journal_res, 0, sizeof(trans->journal_res));
 
 	ret = do_bch2_trans_commit(trans, flags, &i, _RET_IP_);
 
@@ -1165,5 +1165,16 @@ int __bch2_trans_commit(struct btree_trans *trans, unsigned flags)
 	if (ret)
 		goto out;
 
+	/*
+	 * We might have done another transaction commit in the error path -
+	 * i.e. btree write buffer flush - which will have made use of
+	 * trans->journal_res, but with BTREE_INSERT_JOURNAL_REPLAY that is how
+	 * the journal sequence number to pin is passed in - so we must restart:
+	 */
+	if (flags & BTREE_INSERT_JOURNAL_REPLAY) {
+		ret = -BCH_ERR_transaction_restart_nested;
+		goto out;
+	}
+
 	goto retry;
 }
diff --git a/fs/bcachefs/recovery.c b/fs/bcachefs/recovery.c
index 9c30500ce920..8fe39a91beed 100644
--- a/fs/bcachefs/recovery.c
+++ b/fs/bcachefs/recovery.c
@@ -98,6 +98,8 @@ static int bch2_journal_replay_key(struct btree_trans *trans,
 	unsigned update_flags = BTREE_TRIGGER_NORUN;
 	int ret;
 
+	trans->journal_res.seq = k->journal_seq;
+
 	/*
 	 * BTREE_UPDATE_KEY_CACHE_RECLAIM disables key cache lookup/update to
 	 * keep the key cache coherent with the underlying btree. Nothing
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 05/17] bcachefs: Kill BTREE_UPDATE_PREJOURNAL
  2023-11-10 16:31 [PATCH 00/17] btree write buffer & journal optimizations Kent Overstreet
                   ` (3 preceding siblings ...)
  2023-11-10 16:31 ` [PATCH 04/17] bcachefs: BTREE_INSERT_JOURNAL_REPLAY now "don't init trans->journal_res" Kent Overstreet
@ 2023-11-10 16:31 ` Kent Overstreet
  2023-11-13 15:29   ` Brian Foster
  2023-11-10 16:31 ` [PATCH 06/17] bcachefs: Go rw before journal replay Kent Overstreet
                   ` (12 subsequent siblings)
  17 siblings, 1 reply; 29+ messages in thread
From: Kent Overstreet @ 2023-11-10 16:31 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Kent Overstreet

With the previous patch that reworks BTREE_INSERT_JOURNAL_REPLAY, we can
now switch the btree write buffer to use it for flushing.

This has the advantage that transaction commits don't need to take a
journal reservation at all.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/bcachefs/bkey_methods.h       |  2 --
 fs/bcachefs/btree_trans_commit.c |  7 +------
 fs/bcachefs/btree_types.h        |  1 -
 fs/bcachefs/btree_update.c       | 23 -----------------------
 fs/bcachefs/btree_write_buffer.c | 14 ++++++++++----
 5 files changed, 11 insertions(+), 36 deletions(-)

diff --git a/fs/bcachefs/bkey_methods.h b/fs/bcachefs/bkey_methods.h
index 3a370b7087ac..912adadfb4dd 100644
--- a/fs/bcachefs/bkey_methods.h
+++ b/fs/bcachefs/bkey_methods.h
@@ -93,7 +93,6 @@ static inline int bch2_mark_key(struct btree_trans *trans,
 enum btree_update_flags {
 	__BTREE_UPDATE_INTERNAL_SNAPSHOT_NODE = __BTREE_ITER_FLAGS_END,
 	__BTREE_UPDATE_NOJOURNAL,
-	__BTREE_UPDATE_PREJOURNAL,
 	__BTREE_UPDATE_KEY_CACHE_RECLAIM,
 
 	__BTREE_TRIGGER_NORUN,		/* Don't run triggers at all */
@@ -108,7 +107,6 @@ enum btree_update_flags {
 
 #define BTREE_UPDATE_INTERNAL_SNAPSHOT_NODE (1U << __BTREE_UPDATE_INTERNAL_SNAPSHOT_NODE)
 #define BTREE_UPDATE_NOJOURNAL		(1U << __BTREE_UPDATE_NOJOURNAL)
-#define BTREE_UPDATE_PREJOURNAL		(1U << __BTREE_UPDATE_PREJOURNAL)
 #define BTREE_UPDATE_KEY_CACHE_RECLAIM	(1U << __BTREE_UPDATE_KEY_CACHE_RECLAIM)
 
 #define BTREE_TRIGGER_NORUN		(1U << __BTREE_TRIGGER_NORUN)
diff --git a/fs/bcachefs/btree_trans_commit.c b/fs/bcachefs/btree_trans_commit.c
index ec90a06a6cf9..f231f01072c2 100644
--- a/fs/bcachefs/btree_trans_commit.c
+++ b/fs/bcachefs/btree_trans_commit.c
@@ -779,12 +779,7 @@ bch2_trans_commit_write_locked(struct btree_trans *trans, unsigned flags,
 
 	trans_for_each_update(trans, i) {
 		if (!i->cached) {
-			u64 seq = trans->journal_res.seq;
-
-			if (i->flags & BTREE_UPDATE_PREJOURNAL)
-				seq = i->seq;
-
-			bch2_btree_insert_key_leaf(trans, i->path, i->k, seq);
+			bch2_btree_insert_key_leaf(trans, i->path, i->k, trans->journal_res.seq);
 		} else if (!i->key_cache_already_flushed)
 			bch2_btree_insert_key_cached(trans, flags, i);
 		else {
diff --git a/fs/bcachefs/btree_types.h b/fs/bcachefs/btree_types.h
index 0c97b4ef78fb..a3f24cd0043d 100644
--- a/fs/bcachefs/btree_types.h
+++ b/fs/bcachefs/btree_types.h
@@ -366,7 +366,6 @@ struct btree_insert_entry {
 	u8			old_btree_u64s;
 	struct bkey_i		*k;
 	struct btree_path	*path;
-	u64			seq;
 	/* key being overwritten: */
 	struct bkey		old_k;
 	const struct bch_val	*old_v;
diff --git a/fs/bcachefs/btree_update.c b/fs/bcachefs/btree_update.c
index 324767c0ddcc..001198f15730 100644
--- a/fs/bcachefs/btree_update.c
+++ b/fs/bcachefs/btree_update.c
@@ -380,21 +380,12 @@ bch2_trans_update_by_path(struct btree_trans *trans, struct btree_path *path,
 {
 	struct bch_fs *c = trans->c;
 	struct btree_insert_entry *i, n;
-	u64 seq = 0;
 	int cmp;
 
 	EBUG_ON(!path->should_be_locked);
 	EBUG_ON(trans->nr_updates >= BTREE_ITER_MAX);
 	EBUG_ON(!bpos_eq(k->k.p, path->pos));
 
-	/*
-	 * The transaction journal res hasn't been allocated at this point.
-	 * That occurs at commit time. Reuse the seq field to pass in the seq
-	 * of a prejournaled key.
-	 */
-	if (flags & BTREE_UPDATE_PREJOURNAL)
-		seq = trans->journal_res.seq;
-
 	n = (struct btree_insert_entry) {
 		.flags		= flags,
 		.bkey_type	= __btree_node_type(path->level, path->btree_id),
@@ -403,7 +394,6 @@ bch2_trans_update_by_path(struct btree_trans *trans, struct btree_path *path,
 		.cached		= path->cached,
 		.path		= path,
 		.k		= k,
-		.seq		= seq,
 		.ip_allocated	= ip,
 	};
 
@@ -431,7 +421,6 @@ bch2_trans_update_by_path(struct btree_trans *trans, struct btree_path *path,
 		i->cached	= n.cached;
 		i->k		= n.k;
 		i->path		= n.path;
-		i->seq		= n.seq;
 		i->ip_allocated	= n.ip_allocated;
 	} else {
 		array_insert_item(trans->updates, trans->nr_updates,
@@ -542,18 +531,6 @@ int __must_check bch2_trans_update(struct btree_trans *trans, struct btree_iter
 	return bch2_trans_update_by_path(trans, path, k, flags, _RET_IP_);
 }
 
-/*
- * Add a transaction update for a key that has already been journaled.
- */
-int __must_check bch2_trans_update_seq(struct btree_trans *trans, u64 seq,
-				       struct btree_iter *iter, struct bkey_i *k,
-				       enum btree_update_flags flags)
-{
-	trans->journal_res.seq = seq;
-	return bch2_trans_update(trans, iter, k, flags|BTREE_UPDATE_NOJOURNAL|
-						 BTREE_UPDATE_PREJOURNAL);
-}
-
 int __must_check bch2_trans_update_buffered(struct btree_trans *trans,
 					    enum btree_id btree,
 					    struct bkey_i *k)
diff --git a/fs/bcachefs/btree_write_buffer.c b/fs/bcachefs/btree_write_buffer.c
index 9e3107187e1d..f40ac365620f 100644
--- a/fs/bcachefs/btree_write_buffer.c
+++ b/fs/bcachefs/btree_write_buffer.c
@@ -76,12 +76,15 @@ static int bch2_btree_write_buffer_flush_one(struct btree_trans *trans,
 	(*fast)++;
 	return 0;
 trans_commit:
-	return  bch2_trans_update_seq(trans, wb->journal_seq, iter, &wb->k,
-				      BTREE_UPDATE_INTERNAL_SNAPSHOT_NODE) ?:
+	trans->journal_res.seq = wb->journal_seq;
+
+	return  bch2_trans_update(trans, iter, &wb->k,
+				  BTREE_UPDATE_INTERNAL_SNAPSHOT_NODE) ?:
 		bch2_trans_commit(trans, NULL, NULL,
 				  commit_flags|
 				  BTREE_INSERT_NOCHECK_RW|
 				  BTREE_INSERT_NOFAIL|
+				  BTREE_INSERT_JOURNAL_REPLAY|
 				  BTREE_INSERT_JOURNAL_RECLAIM);
 }
 
@@ -125,9 +128,11 @@ btree_write_buffered_insert(struct btree_trans *trans,
 	bch2_trans_iter_init(trans, &iter, wb->btree, bkey_start_pos(&wb->k.k),
 			     BTREE_ITER_CACHED|BTREE_ITER_INTENT);
 
+	trans->journal_res.seq = wb->journal_seq;
+
 	ret   = bch2_btree_iter_traverse(&iter) ?:
-		bch2_trans_update_seq(trans, wb->journal_seq, &iter, &wb->k,
-				      BTREE_UPDATE_INTERNAL_SNAPSHOT_NODE);
+		bch2_trans_update(trans, &iter, &wb->k,
+				  BTREE_UPDATE_INTERNAL_SNAPSHOT_NODE);
 	bch2_trans_iter_exit(trans, &iter);
 	return ret;
 }
@@ -260,6 +265,7 @@ int __bch2_btree_write_buffer_flush(struct btree_trans *trans, unsigned commit_f
 		ret = commit_do(trans, NULL, NULL,
 				commit_flags|
 				BTREE_INSERT_NOFAIL|
+				BTREE_INSERT_JOURNAL_REPLAY|
 				BTREE_INSERT_JOURNAL_RECLAIM,
 				btree_write_buffered_insert(trans, i));
 		if (bch2_fs_fatal_err_on(ret, c, "%s: insert error %s", __func__, bch2_err_str(ret)))
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 06/17] bcachefs: Go rw before journal replay
  2023-11-10 16:31 [PATCH 00/17] btree write buffer & journal optimizations Kent Overstreet
                   ` (4 preceding siblings ...)
  2023-11-10 16:31 ` [PATCH 05/17] bcachefs: Kill BTREE_UPDATE_PREJOURNAL Kent Overstreet
@ 2023-11-10 16:31 ` Kent Overstreet
  2023-11-10 16:31 ` [PATCH 07/17] bcachefs: Make journal replay more efficient Kent Overstreet
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 29+ messages in thread
From: Kent Overstreet @ 2023-11-10 16:31 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Kent Overstreet

This gets us slightly nicer log messages.

Also, this slightly clarifies synchronization of c->journal_keys; after
we go RW it's in use by multiple threads (so that the btree iterator
code can overlay keys from the journal); so it has to be prepped before
that point.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/bcachefs/recovery.c | 16 ++++++++++++----
 1 file changed, 12 insertions(+), 4 deletions(-)

diff --git a/fs/bcachefs/recovery.c b/fs/bcachefs/recovery.c
index 8fe39a91beed..6eafff2557a4 100644
--- a/fs/bcachefs/recovery.c
+++ b/fs/bcachefs/recovery.c
@@ -148,9 +148,6 @@ static int bch2_journal_replay(struct bch_fs *c)
 	size_t i;
 	int ret;
 
-	move_gap(keys->d, keys->nr, keys->size, keys->gap, keys->nr);
-	keys->gap = keys->nr;
-
 	keys_sorted = kvmalloc_array(keys->nr, sizeof(*keys_sorted), GFP_KERNEL);
 	if (!keys_sorted)
 		return -BCH_ERR_ENOMEM_journal_replay;
@@ -177,7 +174,6 @@ static int bch2_journal_replay(struct bch_fs *c)
 		replay_now_at(j, k->journal_seq);
 
 		ret = bch2_trans_do(c, NULL, NULL,
-				    BTREE_INSERT_LAZY_RW|
 				    BTREE_INSERT_NOFAIL|
 				    (!k->allocated
 				     ? BTREE_INSERT_JOURNAL_REPLAY|BCH_WATERMARK_reclaim
@@ -491,7 +487,19 @@ static int bch2_check_allocations(struct bch_fs *c)
 
 static int bch2_set_may_go_rw(struct bch_fs *c)
 {
+	struct journal_keys *keys = &c->journal_keys;
+
+	/*
+	 * After we go RW, the journal keys buffer can't be modified (except for
+	 * setting journal_key->overwritten: it will be accessed by multiple
+	 * threads
+	 */
+	move_gap(keys->d, keys->nr, keys->size, keys->gap, keys->nr);
+	keys->gap = keys->nr;
+
 	set_bit(BCH_FS_MAY_GO_RW, &c->flags);
+	if (keys->nr)
+		return bch2_fs_read_write_early(c);
 	return 0;
 }
 
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 07/17] bcachefs: Make journal replay more efficient
  2023-11-10 16:31 [PATCH 00/17] btree write buffer & journal optimizations Kent Overstreet
                   ` (5 preceding siblings ...)
  2023-11-10 16:31 ` [PATCH 06/17] bcachefs: Go rw before journal replay Kent Overstreet
@ 2023-11-10 16:31 ` Kent Overstreet
  2023-11-14 13:19   ` Brian Foster
  2023-11-10 16:31 ` [PATCH 08/17] bcachefs: Don't flush journal after replay Kent Overstreet
                   ` (10 subsequent siblings)
  17 siblings, 1 reply; 29+ messages in thread
From: Kent Overstreet @ 2023-11-10 16:31 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Kent Overstreet

Journal replay now first attempts to replay keys in sorted order,
similar to how the btree write buffer flush path works.

Any keys that can not be replayed due to journal deadlock are then left
for later and replayed in journal order, unpinning journal entries as we
go.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/bcachefs/errcode.h  |  1 -
 fs/bcachefs/recovery.c | 84 ++++++++++++++++++++++++++++--------------
 2 files changed, 56 insertions(+), 29 deletions(-)

diff --git a/fs/bcachefs/errcode.h b/fs/bcachefs/errcode.h
index 68a1a96bb7ca..e5c3262cc303 100644
--- a/fs/bcachefs/errcode.h
+++ b/fs/bcachefs/errcode.h
@@ -73,7 +73,6 @@
 	x(ENOMEM,			ENOMEM_fsck_add_nlink)			\
 	x(ENOMEM,			ENOMEM_journal_key_insert)		\
 	x(ENOMEM,			ENOMEM_journal_keys_sort)		\
-	x(ENOMEM,			ENOMEM_journal_replay)			\
 	x(ENOMEM,			ENOMEM_read_superblock_clean)		\
 	x(ENOMEM,			ENOMEM_fs_alloc)			\
 	x(ENOMEM,			ENOMEM_fs_name_alloc)			\
diff --git a/fs/bcachefs/recovery.c b/fs/bcachefs/recovery.c
index 6eafff2557a4..3524d1e0003e 100644
--- a/fs/bcachefs/recovery.c
+++ b/fs/bcachefs/recovery.c
@@ -98,6 +98,9 @@ static int bch2_journal_replay_key(struct btree_trans *trans,
 	unsigned update_flags = BTREE_TRIGGER_NORUN;
 	int ret;
 
+	if (k->overwritten)
+		return 0;
+
 	trans->journal_res.seq = k->journal_seq;
 
 	/*
@@ -141,24 +144,14 @@ static int journal_sort_seq_cmp(const void *_l, const void *_r)
 static int bch2_journal_replay(struct bch_fs *c)
 {
 	struct journal_keys *keys = &c->journal_keys;
-	struct journal_key **keys_sorted, *k;
+	DARRAY(struct journal_key *) keys_sorted = { 0 };
+	struct journal_key **kp;
 	struct journal *j = &c->journal;
 	u64 start_seq	= c->journal_replay_seq_start;
 	u64 end_seq	= c->journal_replay_seq_start;
-	size_t i;
+	struct btree_trans *trans = bch2_trans_get(c);
 	int ret;
 
-	keys_sorted = kvmalloc_array(keys->nr, sizeof(*keys_sorted), GFP_KERNEL);
-	if (!keys_sorted)
-		return -BCH_ERR_ENOMEM_journal_replay;
-
-	for (i = 0; i < keys->nr; i++)
-		keys_sorted[i] = &keys->d[i];
-
-	sort(keys_sorted, keys->nr,
-	     sizeof(keys_sorted[0]),
-	     journal_sort_seq_cmp, NULL);
-
 	if (keys->nr) {
 		ret = bch2_journal_log_msg(c, "Starting journal replay (%zu keys in entries %llu-%llu)",
 					   keys->nr, start_seq, end_seq);
@@ -166,26 +159,61 @@ static int bch2_journal_replay(struct bch_fs *c)
 			goto err;
 	}
 
-	for (i = 0; i < keys->nr; i++) {
-		k = keys_sorted[i];
+	/*
+	 * First, attempt to replay keys in sorted order. This is more
+	 * efficient, but some might fail if that would cause a journal
+	 * deadlock.
+	 */
+	for (size_t i = 0; i < keys->nr; i++) {
+		cond_resched();
+
+		struct journal_key *k = keys->d + i;
+
+		ret = commit_do(trans, NULL, NULL,
+				BTREE_INSERT_NOFAIL|
+				BTREE_INSERT_JOURNAL_RECLAIM|
+				(!k->allocated ? BTREE_INSERT_JOURNAL_REPLAY : 0),
+			     bch2_journal_replay_key(trans, k));
+		BUG_ON(!ret && !k->overwritten);
+		if (ret) {
+			ret = darray_push(&keys_sorted, k);
+			if (ret)
+				goto err;
+		}
+	}
+
+	/*
+	 * Now, replay any remaining keys in the order in which they appear in
+	 * the journal, unpinning those journal entries as we go:
+	 */
+	sort(keys_sorted.data, keys_sorted.nr,
+	     sizeof(keys_sorted.data[0]),
+	     journal_sort_seq_cmp, NULL);
 
+	darray_for_each(keys_sorted, kp) {
 		cond_resched();
 
+		struct journal_key *k = *kp;
+
 		replay_now_at(j, k->journal_seq);
 
-		ret = bch2_trans_do(c, NULL, NULL,
-				    BTREE_INSERT_NOFAIL|
-				    (!k->allocated
-				     ? BTREE_INSERT_JOURNAL_REPLAY|BCH_WATERMARK_reclaim
-				     : 0),
+		ret = commit_do(trans, NULL, NULL,
+				BTREE_INSERT_NOFAIL|
+				(!k->allocated
+				 ? BTREE_INSERT_JOURNAL_REPLAY|BCH_WATERMARK_reclaim
+				 : 0),
 			     bch2_journal_replay_key(trans, k));
-		if (ret) {
-			bch_err(c, "journal replay: error while replaying key at btree %s level %u: %s",
-				bch2_btree_id_str(k->btree_id), k->level, bch2_err_str(ret));
+		bch_err_msg(c, ret, "while replaying key at btree %s level %u:",
+			    bch2_btree_id_str(k->btree_id), k->level);
+		if (ret)
 			goto err;
-		}
+
+		BUG_ON(!k->overwritten);
 	}
 
+	bch2_trans_put(trans);
+	trans = NULL;
+
 	replay_now_at(j, j->replay_journal_seq_end);
 	j->replay_journal_seq = 0;
 
@@ -196,10 +224,10 @@ static int bch2_journal_replay(struct bch_fs *c)
 	if (keys->nr && !ret)
 		bch2_journal_log_msg(c, "journal replay finished");
 err:
-	kvfree(keys_sorted);
-
-	if (ret)
-		bch_err_fn(c, ret);
+	if (trans)
+		bch2_trans_put(trans);
+	darray_exit(&keys_sorted);
+	bch_err_fn(c, ret);
 	return ret;
 }
 
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 08/17] bcachefs: Don't flush journal after replay
  2023-11-10 16:31 [PATCH 00/17] btree write buffer & journal optimizations Kent Overstreet
                   ` (6 preceding siblings ...)
  2023-11-10 16:31 ` [PATCH 07/17] bcachefs: Make journal replay more efficient Kent Overstreet
@ 2023-11-10 16:31 ` Kent Overstreet
  2023-11-10 16:31 ` [PATCH 09/17] bcachefs: Unwritten journal buffers are always dirty Kent Overstreet
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 29+ messages in thread
From: Kent Overstreet @ 2023-11-10 16:31 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Kent Overstreet

The flush_all_pins() after journal replay was unecessary, and trying to
completely flush the journal while RW is not a great idea - it's not
guaranteed to terminate if other threads keep adding things to the
jorunal.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/bcachefs/recovery.c | 4 +---
 1 file changed, 1 insertion(+), 3 deletions(-)

diff --git a/fs/bcachefs/recovery.c b/fs/bcachefs/recovery.c
index 3524d1e0003e..86713b2a6c87 100644
--- a/fs/bcachefs/recovery.c
+++ b/fs/bcachefs/recovery.c
@@ -218,10 +218,8 @@ static int bch2_journal_replay(struct bch_fs *c)
 	j->replay_journal_seq = 0;
 
 	bch2_journal_set_replay_done(j);
-	bch2_journal_flush_all_pins(j);
-	ret = bch2_journal_error(j);
 
-	if (keys->nr && !ret)
+	if (keys->nr)
 		bch2_journal_log_msg(c, "journal replay finished");
 err:
 	if (trans)
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 09/17] bcachefs: Unwritten journal buffers are always dirty
  2023-11-10 16:31 [PATCH 00/17] btree write buffer & journal optimizations Kent Overstreet
                   ` (7 preceding siblings ...)
  2023-11-10 16:31 ` [PATCH 08/17] bcachefs: Don't flush journal after replay Kent Overstreet
@ 2023-11-10 16:31 ` Kent Overstreet
  2023-11-10 16:31 ` [PATCH 10/17] bcachefs: journal->buf_lock Kent Overstreet
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 29+ messages in thread
From: Kent Overstreet @ 2023-11-10 16:31 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Kent Overstreet

Ensure that journal bufs that haven't been written can't be reclaimed
from the journal pin fifo, and can thus have new pins taken.

Prep work for changing the btree write buffer to pull keys from the
journal directly.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/bcachefs/journal_io.c      | 1 +
 fs/bcachefs/journal_reclaim.c | 1 +
 2 files changed, 2 insertions(+)

diff --git a/fs/bcachefs/journal_io.c b/fs/bcachefs/journal_io.c
index 52a59770d8d0..51796454bba4 100644
--- a/fs/bcachefs/journal_io.c
+++ b/fs/bcachefs/journal_io.c
@@ -1583,6 +1583,7 @@ static void journal_write_done(struct closure *cl)
 	} while ((v = atomic64_cmpxchg(&j->reservations.counter,
 				       old.v, new.v)) != old.v);
 
+	bch2_journal_reclaim_fast(j);
 	bch2_journal_space_available(j);
 
 	track_event_change(&c->times[BCH_TIME_blocked_journal_max_in_flight],
diff --git a/fs/bcachefs/journal_reclaim.c b/fs/bcachefs/journal_reclaim.c
index 8fa05bedb7df..67684cc5c1e5 100644
--- a/fs/bcachefs/journal_reclaim.c
+++ b/fs/bcachefs/journal_reclaim.c
@@ -303,6 +303,7 @@ void bch2_journal_reclaim_fast(struct journal *j)
 	 * all btree nodes got written out
 	 */
 	while (!fifo_empty(&j->pin) &&
+	       j->pin.front <= j->seq_ondisk &&
 	       !atomic_read(&fifo_peek_front(&j->pin).count)) {
 		j->pin.front++;
 		popped = true;
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 10/17] bcachefs: journal->buf_lock
  2023-11-10 16:31 [PATCH 00/17] btree write buffer & journal optimizations Kent Overstreet
                   ` (8 preceding siblings ...)
  2023-11-10 16:31 ` [PATCH 09/17] bcachefs: Unwritten journal buffers are always dirty Kent Overstreet
@ 2023-11-10 16:31 ` Kent Overstreet
  2023-11-10 16:31 ` [PATCH 11/17] bcachefs: bch2_journal_block_reservations() Kent Overstreet
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 29+ messages in thread
From: Kent Overstreet @ 2023-11-10 16:31 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Kent Overstreet

Add a new lock for synchronizing between journal IO path and btree write
buffer flush.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/bcachefs/journal.c       | 1 +
 fs/bcachefs/journal_io.c    | 2 ++
 fs/bcachefs/journal_types.h | 6 ++++++
 3 files changed, 9 insertions(+)

diff --git a/fs/bcachefs/journal.c b/fs/bcachefs/journal.c
index 7d448136434b..3a6b73d4ec24 100644
--- a/fs/bcachefs/journal.c
+++ b/fs/bcachefs/journal.c
@@ -1215,6 +1215,7 @@ int bch2_fs_journal_init(struct journal *j)
 	static struct lock_class_key res_key;
 	unsigned i;
 
+	mutex_init(&j->buf_lock);
 	spin_lock_init(&j->lock);
 	spin_lock_init(&j->err_lock);
 	init_waitqueue_head(&j->wait);
diff --git a/fs/bcachefs/journal_io.c b/fs/bcachefs/journal_io.c
index 51796454bba4..2394d4b02fff 100644
--- a/fs/bcachefs/journal_io.c
+++ b/fs/bcachefs/journal_io.c
@@ -1869,9 +1869,11 @@ void bch2_journal_write(struct closure *cl)
 	if (ret)
 		goto err;
 
+	mutex_lock(&j->buf_lock);
 	journal_buf_realloc(j, w);
 
 	ret = bch2_journal_write_prep(j, w);
+	mutex_unlock(&j->buf_lock);
 	if (ret)
 		goto err;
 
diff --git a/fs/bcachefs/journal_types.h b/fs/bcachefs/journal_types.h
index 2427cce64fed..0fed32f94976 100644
--- a/fs/bcachefs/journal_types.h
+++ b/fs/bcachefs/journal_types.h
@@ -181,6 +181,12 @@ struct journal {
 	 */
 	darray_u64		early_journal_entries;
 
+	/*
+	 * Protects journal_buf->data, when accessing without a jorunal
+	 * reservation: for synchronization between the btree write buffer code
+	 * and the journal write path:
+	 */
+	struct mutex		buf_lock;
 	/*
 	 * Two journal entries -- one is currently open for new entries, the
 	 * other is possibly being written out.
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 11/17] bcachefs: bch2_journal_block_reservations()
  2023-11-10 16:31 [PATCH 00/17] btree write buffer & journal optimizations Kent Overstreet
                   ` (9 preceding siblings ...)
  2023-11-10 16:31 ` [PATCH 10/17] bcachefs: journal->buf_lock Kent Overstreet
@ 2023-11-10 16:31 ` Kent Overstreet
  2023-11-10 16:31 ` [PATCH 12/17] bcachefs: Clean up btree write buffer write ref handling Kent Overstreet
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 29+ messages in thread
From: Kent Overstreet @ 2023-11-10 16:31 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Kent Overstreet

Add a new journal method for blocking new journal reservations and
waiting on outstanding reservations, to be used by the btree write
buffer flush path.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/bcachefs/journal.c | 29 +++++++++++++++++++++++++++++
 fs/bcachefs/journal.h |  3 ++-
 2 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/fs/bcachefs/journal.c b/fs/bcachefs/journal.c
index 3a6b73d4ec24..5b390cb91884 100644
--- a/fs/bcachefs/journal.c
+++ b/fs/bcachefs/journal.c
@@ -147,6 +147,7 @@ void bch2_journal_buf_put_final(struct journal *j, u64 seq, bool write)
 		bch2_journal_reclaim_fast(j);
 	if (write)
 		closure_call(&j->io, bch2_journal_write, c->io_complete_wq, NULL);
+	wake_up(&j->wait);
 }
 
 /*
@@ -764,6 +765,34 @@ void bch2_journal_block(struct journal *j)
 	journal_quiesce(j);
 }
 
+/*
+ * XXX: ideally this would not be closing the current journal entry, but
+ * otherwise we do not have a way to avoid racing with res_get() - j->blocked
+ * will race.
+ */
+static bool journal_reservations_stopped(struct journal *j)
+{
+	union journal_res_state s;
+
+	journal_entry_close(j);
+
+	s.v = atomic64_read_acquire(&j->reservations.counter);
+
+	return  s.buf0_count == 0 &&
+		s.buf1_count == 0 &&
+		s.buf2_count == 0 &&
+		s.buf3_count == 0;
+}
+
+void bch2_journal_block_reservations(struct journal *j)
+{
+	spin_lock(&j->lock);
+	j->blocked++;
+	spin_unlock(&j->lock);
+
+	wait_event(j->wait, journal_reservations_stopped(j));
+}
+
 /* allocate journal on a device: */
 
 static int __bch2_set_nr_journal_buckets(struct bch_dev *ca, unsigned nr,
diff --git a/fs/bcachefs/journal.h b/fs/bcachefs/journal.h
index c85d01cf4948..310654bb74de 100644
--- a/fs/bcachefs/journal.h
+++ b/fs/bcachefs/journal.h
@@ -259,7 +259,7 @@ static inline union journal_res_state journal_state_buf_put(struct journal *j, u
 {
 	union journal_res_state s;
 
-	s.v = atomic64_sub_return(((union journal_res_state) {
+	s.v = atomic64_sub_return_release(((union journal_res_state) {
 				    .buf0_count = idx == 0,
 				    .buf1_count = idx == 1,
 				    .buf2_count = idx == 2,
@@ -427,6 +427,7 @@ static inline void bch2_journal_set_replay_done(struct journal *j)
 
 void bch2_journal_unblock(struct journal *);
 void bch2_journal_block(struct journal *);
+void bch2_journal_block_reservations(struct journal *);
 
 void __bch2_journal_debug_to_text(struct printbuf *, struct journal *);
 void bch2_journal_debug_to_text(struct printbuf *, struct journal *);
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 12/17] bcachefs: Clean up btree write buffer write ref handling
  2023-11-10 16:31 [PATCH 00/17] btree write buffer & journal optimizations Kent Overstreet
                   ` (10 preceding siblings ...)
  2023-11-10 16:31 ` [PATCH 11/17] bcachefs: bch2_journal_block_reservations() Kent Overstreet
@ 2023-11-10 16:31 ` Kent Overstreet
  2023-11-10 16:31 ` [PATCH 13/17] bcachefs: bch2_btree_write_buffer_flush_locked() Kent Overstreet
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 29+ messages in thread
From: Kent Overstreet @ 2023-11-10 16:31 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Kent Overstreet

__bch2_btree_write_buffer_flush() now assumes a write ref is already
held (as called by the transaction commit path); and the wrappers
bch2_write_buffer_flush() and flush_sync() take an explicit write ref.

This means internally the write buffer code can always use
BTREE_INSERT_NOCHECK_RW, instead of in the previous code passing flags
around and hoping the NOCHECK_RW flag was always carried around
correctly.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/bcachefs/bcachefs.h           |  3 ++-
 fs/bcachefs/btree_trans_commit.c |  6 ++----
 fs/bcachefs/btree_write_buffer.c | 33 +++++++++++++++++++++-----------
 fs/bcachefs/btree_write_buffer.h |  2 +-
 fs/bcachefs/inode.c              | 10 +++++-----
 5 files changed, 32 insertions(+), 22 deletions(-)

diff --git a/fs/bcachefs/bcachefs.h b/fs/bcachefs/bcachefs.h
index 3117ab4426a7..748164c0e82b 100644
--- a/fs/bcachefs/bcachefs.h
+++ b/fs/bcachefs/bcachefs.h
@@ -664,7 +664,8 @@ struct btree_trans_buf {
 	x(invalidate)							\
 	x(delete_dead_snapshots)					\
 	x(snapshot_delete_pagecache)					\
-	x(sysfs)
+	x(sysfs)							\
+	x(btree_write_buffer)
 
 enum bch_write_ref {
 #define x(n) BCH_WRITE_REF_##n,
diff --git a/fs/bcachefs/btree_trans_commit.c b/fs/bcachefs/btree_trans_commit.c
index f231f01072c2..929fe398f1db 100644
--- a/fs/bcachefs/btree_trans_commit.c
+++ b/fs/bcachefs/btree_trans_commit.c
@@ -962,8 +962,7 @@ int bch2_trans_commit_error(struct btree_trans *trans, unsigned flags,
 
 			if (wb->state.nr > wb->size * 3 / 4) {
 				bch2_trans_begin(trans);
-				ret = __bch2_btree_write_buffer_flush(trans,
-						flags|BTREE_INSERT_NOCHECK_RW, true);
+				ret = __bch2_btree_write_buffer_flush(trans, true);
 				if (!ret) {
 					trace_and_count(c, trans_restart_write_buffer_flush, trans, _THIS_IP_);
 					ret = btree_trans_restart(trans, BCH_ERR_transaction_restart_write_buffer_flush);
@@ -1082,8 +1081,7 @@ int __bch2_trans_commit(struct btree_trans *trans, unsigned flags)
 		bch2_trans_begin(trans);
 		bch2_trans_unlock(trans);
 
-		ret = __bch2_btree_write_buffer_flush(trans,
-					flags|BTREE_INSERT_NOCHECK_RW, true);
+		ret = __bch2_btree_write_buffer_flush(trans, true);
 		if (!ret) {
 			trace_and_count(c, trans_restart_write_buffer_flush, trans, _THIS_IP_);
 			ret = btree_trans_restart(trans, BCH_ERR_transaction_restart_write_buffer_flush);
diff --git a/fs/bcachefs/btree_write_buffer.c b/fs/bcachefs/btree_write_buffer.c
index f40ac365620f..7f358a3d0789 100644
--- a/fs/bcachefs/btree_write_buffer.c
+++ b/fs/bcachefs/btree_write_buffer.c
@@ -137,8 +137,7 @@ btree_write_buffered_insert(struct btree_trans *trans,
 	return ret;
 }
 
-int __bch2_btree_write_buffer_flush(struct btree_trans *trans, unsigned commit_flags,
-				    bool locked)
+int __bch2_btree_write_buffer_flush(struct btree_trans *trans, bool locked)
 {
 	struct bch_fs *c = trans->c;
 	struct journal *j = &c->journal;
@@ -210,8 +209,8 @@ int __bch2_btree_write_buffer_flush(struct btree_trans *trans, unsigned commit_f
 		iter.path->preserve = false;
 
 		do {
-			ret = bch2_btree_write_buffer_flush_one(trans, &iter, i,
-						commit_flags, &write_locked, &fast);
+			ret = bch2_btree_write_buffer_flush_one(trans, &iter, i, 0,
+								&write_locked, &fast);
 			if (!write_locked)
 				bch2_trans_begin(trans);
 		} while (bch2_err_matches(ret, BCH_ERR_transaction_restart));
@@ -252,9 +251,6 @@ int __bch2_btree_write_buffer_flush(struct btree_trans *trans, unsigned commit_f
 	     btree_write_buffered_journal_cmp,
 	     NULL);
 
-	commit_flags &= ~BCH_WATERMARK_MASK;
-	commit_flags |= BCH_WATERMARK_reclaim;
-
 	for (i = keys; i < keys + nr; i++) {
 		if (!i->journal_seq)
 			continue;
@@ -263,7 +259,8 @@ int __bch2_btree_write_buffer_flush(struct btree_trans *trans, unsigned commit_f
 			      bch2_btree_write_buffer_journal_flush);
 
 		ret = commit_do(trans, NULL, NULL,
-				commit_flags|
+				BCH_WATERMARK_reclaim|
+				BTREE_INSERT_NOCHECK_RW|
 				BTREE_INSERT_NOFAIL|
 				BTREE_INSERT_JOURNAL_REPLAY|
 				BTREE_INSERT_JOURNAL_RECLAIM,
@@ -277,14 +274,28 @@ int __bch2_btree_write_buffer_flush(struct btree_trans *trans, unsigned commit_f
 
 int bch2_btree_write_buffer_flush_sync(struct btree_trans *trans)
 {
+	struct bch_fs *c = trans->c;
+
+	if (!bch2_write_ref_tryget(c, BCH_WRITE_REF_btree_write_buffer))
+		return -BCH_ERR_erofs_no_writes;
+
 	bch2_trans_unlock(trans);
 	mutex_lock(&trans->c->btree_write_buffer.flush_lock);
-	return __bch2_btree_write_buffer_flush(trans, 0, true);
+	int ret = __bch2_btree_write_buffer_flush(trans, true);
+	bch2_write_ref_put(c, BCH_WRITE_REF_btree_write_buffer);
+	return ret;
 }
 
 int bch2_btree_write_buffer_flush(struct btree_trans *trans)
 {
-	return __bch2_btree_write_buffer_flush(trans, 0, false);
+	struct bch_fs *c = trans->c;
+
+	if (!bch2_write_ref_tryget(c, BCH_WRITE_REF_btree_write_buffer))
+		return -BCH_ERR_erofs_no_writes;
+
+	int ret = __bch2_btree_write_buffer_flush(trans, false);
+	bch2_write_ref_put(c, BCH_WRITE_REF_btree_write_buffer);
+	return ret;
 }
 
 static int bch2_btree_write_buffer_journal_flush(struct journal *j,
@@ -296,7 +307,7 @@ static int bch2_btree_write_buffer_journal_flush(struct journal *j,
 	mutex_lock(&wb->flush_lock);
 
 	return bch2_trans_run(c,
-			__bch2_btree_write_buffer_flush(trans, BTREE_INSERT_NOCHECK_RW, true));
+			__bch2_btree_write_buffer_flush(trans, true));
 }
 
 static inline u64 btree_write_buffer_ref(int idx)
diff --git a/fs/bcachefs/btree_write_buffer.h b/fs/bcachefs/btree_write_buffer.h
index 322df1c8304e..8639b4209544 100644
--- a/fs/bcachefs/btree_write_buffer.h
+++ b/fs/bcachefs/btree_write_buffer.h
@@ -2,7 +2,7 @@
 #ifndef _BCACHEFS_BTREE_WRITE_BUFFER_H
 #define _BCACHEFS_BTREE_WRITE_BUFFER_H
 
-int __bch2_btree_write_buffer_flush(struct btree_trans *, unsigned, bool);
+int __bch2_btree_write_buffer_flush(struct btree_trans *, bool);
 int bch2_btree_write_buffer_flush_sync(struct btree_trans *);
 int bch2_btree_write_buffer_flush(struct btree_trans *);
 
diff --git a/fs/bcachefs/inode.c b/fs/bcachefs/inode.c
index def77f2d8802..a8488237d0bb 100644
--- a/fs/bcachefs/inode.c
+++ b/fs/bcachefs/inode.c
@@ -1157,10 +1157,6 @@ int bch2_delete_dead_inodes(struct bch_fs *c)
 again:
 	need_another_pass = false;
 
-	ret = bch2_btree_write_buffer_flush_sync(trans);
-	if (ret)
-		goto err;
-
 	/*
 	 * Weird transaction restart handling here because on successful delete,
 	 * bch2_inode_rm_snapshot() will return a nested transaction restart,
@@ -1189,8 +1185,12 @@ int bch2_delete_dead_inodes(struct bch_fs *c)
 	}
 	bch2_trans_iter_exit(trans, &iter);
 
-	if (!ret && need_another_pass)
+	if (!ret && need_another_pass) {
+		ret = bch2_btree_write_buffer_flush_sync(trans);
+		if (ret)
+			goto err;
 		goto again;
+	}
 err:
 	bch2_trans_put(trans);
 
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 13/17] bcachefs: bch2_btree_write_buffer_flush_locked()
  2023-11-10 16:31 [PATCH 00/17] btree write buffer & journal optimizations Kent Overstreet
                   ` (11 preceding siblings ...)
  2023-11-10 16:31 ` [PATCH 12/17] bcachefs: Clean up btree write buffer write ref handling Kent Overstreet
@ 2023-11-10 16:31 ` Kent Overstreet
  2023-11-10 16:31 ` [PATCH 14/17] bcachefs: bch2_btree_write_buffer_flush() -> bch2_btree_write_buffer_tryflush() Kent Overstreet
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 29+ messages in thread
From: Kent Overstreet @ 2023-11-10 16:31 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Kent Overstreet

Minor refactoring - improved naming, and move the responsibility for
flush_lock to the caller instead of having it be shared.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/bcachefs/btree_trans_commit.c |  6 ++++--
 fs/bcachefs/btree_write_buffer.c | 23 +++++++++++++----------
 fs/bcachefs/btree_write_buffer.h |  2 +-
 3 files changed, 18 insertions(+), 13 deletions(-)

diff --git a/fs/bcachefs/btree_trans_commit.c b/fs/bcachefs/btree_trans_commit.c
index 929fe398f1db..722542316eb1 100644
--- a/fs/bcachefs/btree_trans_commit.c
+++ b/fs/bcachefs/btree_trans_commit.c
@@ -962,7 +962,8 @@ int bch2_trans_commit_error(struct btree_trans *trans, unsigned flags,
 
 			if (wb->state.nr > wb->size * 3 / 4) {
 				bch2_trans_begin(trans);
-				ret = __bch2_btree_write_buffer_flush(trans, true);
+				ret = bch2_btree_write_buffer_flush_locked(trans);
+				mutex_unlock(&wb->flush_lock);
 				if (!ret) {
 					trace_and_count(c, trans_restart_write_buffer_flush, trans, _THIS_IP_);
 					ret = btree_trans_restart(trans, BCH_ERR_transaction_restart_write_buffer_flush);
@@ -1081,7 +1082,8 @@ int __bch2_trans_commit(struct btree_trans *trans, unsigned flags)
 		bch2_trans_begin(trans);
 		bch2_trans_unlock(trans);
 
-		ret = __bch2_btree_write_buffer_flush(trans, true);
+		ret = bch2_btree_write_buffer_flush_locked(trans);
+		mutex_unlock(&c->btree_write_buffer.flush_lock);
 		if (!ret) {
 			trace_and_count(c, trans_restart_write_buffer_flush, trans, _THIS_IP_);
 			ret = btree_trans_restart(trans, BCH_ERR_transaction_restart_write_buffer_flush);
diff --git a/fs/bcachefs/btree_write_buffer.c b/fs/bcachefs/btree_write_buffer.c
index 7f358a3d0789..4eced8b161bd 100644
--- a/fs/bcachefs/btree_write_buffer.c
+++ b/fs/bcachefs/btree_write_buffer.c
@@ -137,7 +137,7 @@ btree_write_buffered_insert(struct btree_trans *trans,
 	return ret;
 }
 
-int __bch2_btree_write_buffer_flush(struct btree_trans *trans, bool locked)
+int bch2_btree_write_buffer_flush_locked(struct btree_trans *trans)
 {
 	struct bch_fs *c = trans->c;
 	struct journal *j = &c->journal;
@@ -152,9 +152,6 @@ int __bch2_btree_write_buffer_flush(struct btree_trans *trans, bool locked)
 
 	memset(&pin, 0, sizeof(pin));
 
-	if (!locked && !mutex_trylock(&wb->flush_lock))
-		return 0;
-
 	bch2_journal_pin_copy(j, &pin, &wb->journal_pin,
 			      bch2_btree_write_buffer_journal_flush);
 	bch2_journal_pin_drop(j, &wb->journal_pin);
@@ -237,7 +234,6 @@ int __bch2_btree_write_buffer_flush(struct btree_trans *trans, bool locked)
 	bch2_fs_fatal_err_on(ret, c, "%s: insert error %s", __func__, bch2_err_str(ret));
 out:
 	bch2_journal_pin_drop(j, &pin);
-	mutex_unlock(&wb->flush_lock);
 	return ret;
 slowpath:
 	trace_write_buffer_flush_slowpath(trans, i - keys, nr);
@@ -280,8 +276,9 @@ int bch2_btree_write_buffer_flush_sync(struct btree_trans *trans)
 		return -BCH_ERR_erofs_no_writes;
 
 	bch2_trans_unlock(trans);
-	mutex_lock(&trans->c->btree_write_buffer.flush_lock);
-	int ret = __bch2_btree_write_buffer_flush(trans, true);
+	mutex_lock(&c->btree_write_buffer.flush_lock);
+	int ret = bch2_btree_write_buffer_flush_locked(trans);
+	mutex_unlock(&c->btree_write_buffer.flush_lock);
 	bch2_write_ref_put(c, BCH_WRITE_REF_btree_write_buffer);
 	return ret;
 }
@@ -289,11 +286,16 @@ int bch2_btree_write_buffer_flush_sync(struct btree_trans *trans)
 int bch2_btree_write_buffer_flush(struct btree_trans *trans)
 {
 	struct bch_fs *c = trans->c;
+	struct btree_write_buffer *wb = &c->btree_write_buffer;
+	int ret = 0;
 
 	if (!bch2_write_ref_tryget(c, BCH_WRITE_REF_btree_write_buffer))
 		return -BCH_ERR_erofs_no_writes;
 
-	int ret = __bch2_btree_write_buffer_flush(trans, false);
+	if (mutex_trylock(&wb->flush_lock)) {
+		ret = bch2_btree_write_buffer_flush_locked(trans);
+		mutex_unlock(&wb->flush_lock);
+	}
 	bch2_write_ref_put(c, BCH_WRITE_REF_btree_write_buffer);
 	return ret;
 }
@@ -305,9 +307,10 @@ static int bch2_btree_write_buffer_journal_flush(struct journal *j,
 	struct btree_write_buffer *wb = &c->btree_write_buffer;
 
 	mutex_lock(&wb->flush_lock);
+	int ret = bch2_trans_run(c, bch2_btree_write_buffer_flush_locked(trans));
+	mutex_unlock(&wb->flush_lock);
 
-	return bch2_trans_run(c,
-			__bch2_btree_write_buffer_flush(trans, true));
+	return ret;
 }
 
 static inline u64 btree_write_buffer_ref(int idx)
diff --git a/fs/bcachefs/btree_write_buffer.h b/fs/bcachefs/btree_write_buffer.h
index 8639b4209544..3735bdcf11f7 100644
--- a/fs/bcachefs/btree_write_buffer.h
+++ b/fs/bcachefs/btree_write_buffer.h
@@ -2,7 +2,7 @@
 #ifndef _BCACHEFS_BTREE_WRITE_BUFFER_H
 #define _BCACHEFS_BTREE_WRITE_BUFFER_H
 
-int __bch2_btree_write_buffer_flush(struct btree_trans *, bool);
+int bch2_btree_write_buffer_flush_locked(struct btree_trans *);
 int bch2_btree_write_buffer_flush_sync(struct btree_trans *);
 int bch2_btree_write_buffer_flush(struct btree_trans *);
 
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 14/17] bcachefs: bch2_btree_write_buffer_flush() -> bch2_btree_write_buffer_tryflush()
  2023-11-10 16:31 [PATCH 00/17] btree write buffer & journal optimizations Kent Overstreet
                   ` (12 preceding siblings ...)
  2023-11-10 16:31 ` [PATCH 13/17] bcachefs: bch2_btree_write_buffer_flush_locked() Kent Overstreet
@ 2023-11-10 16:31 ` Kent Overstreet
  2023-11-10 16:31 ` [PATCH 15/17] bcachefs: Improve btree write buffer tracepoints Kent Overstreet
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 29+ messages in thread
From: Kent Overstreet @ 2023-11-10 16:31 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Kent Overstreet

More accurate naming.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/bcachefs/alloc_background.c   | 2 +-
 fs/bcachefs/btree_write_buffer.c | 2 +-
 fs/bcachefs/btree_write_buffer.h | 2 +-
 fs/bcachefs/ec.c                 | 2 +-
 fs/bcachefs/move.c               | 7 +++----
 fs/bcachefs/movinggc.c           | 4 ++--
 6 files changed, 9 insertions(+), 10 deletions(-)

diff --git a/fs/bcachefs/alloc_background.c b/fs/bcachefs/alloc_background.c
index 1fec0e67891f..e5e597be84bb 100644
--- a/fs/bcachefs/alloc_background.c
+++ b/fs/bcachefs/alloc_background.c
@@ -1802,7 +1802,7 @@ static void bch2_do_invalidates_work(struct work_struct *work)
 	unsigned i;
 	int ret = 0;
 
-	ret = bch2_btree_write_buffer_flush(trans);
+	ret = bch2_btree_write_buffer_tryflush(trans);
 	if (ret)
 		goto err;
 
diff --git a/fs/bcachefs/btree_write_buffer.c b/fs/bcachefs/btree_write_buffer.c
index 4eced8b161bd..be8308789477 100644
--- a/fs/bcachefs/btree_write_buffer.c
+++ b/fs/bcachefs/btree_write_buffer.c
@@ -283,7 +283,7 @@ int bch2_btree_write_buffer_flush_sync(struct btree_trans *trans)
 	return ret;
 }
 
-int bch2_btree_write_buffer_flush(struct btree_trans *trans)
+int bch2_btree_write_buffer_tryflush(struct btree_trans *trans)
 {
 	struct bch_fs *c = trans->c;
 	struct btree_write_buffer *wb = &c->btree_write_buffer;
diff --git a/fs/bcachefs/btree_write_buffer.h b/fs/bcachefs/btree_write_buffer.h
index 3735bdcf11f7..89d290e306e2 100644
--- a/fs/bcachefs/btree_write_buffer.h
+++ b/fs/bcachefs/btree_write_buffer.h
@@ -4,7 +4,7 @@
 
 int bch2_btree_write_buffer_flush_locked(struct btree_trans *);
 int bch2_btree_write_buffer_flush_sync(struct btree_trans *);
-int bch2_btree_write_buffer_flush(struct btree_trans *);
+int bch2_btree_write_buffer_tryflush(struct btree_trans *);
 
 int bch2_btree_insert_keys_write_buffer(struct btree_trans *);
 
diff --git a/fs/bcachefs/ec.c b/fs/bcachefs/ec.c
index 2a77de18c004..423f2b04e0f3 100644
--- a/fs/bcachefs/ec.c
+++ b/fs/bcachefs/ec.c
@@ -1005,7 +1005,7 @@ static int ec_stripe_update_extents(struct bch_fs *c, struct ec_stripe_buf *s)
 	unsigned i, nr_data = v->nr_blocks - v->nr_redundant;
 	int ret = 0;
 
-	ret = bch2_btree_write_buffer_flush(trans);
+	ret = bch2_btree_write_buffer_tryflush(trans);
 	if (ret)
 		goto err;
 
diff --git a/fs/bcachefs/move.c b/fs/bcachefs/move.c
index ab749bf2fcbc..581cf7e34147 100644
--- a/fs/bcachefs/move.c
+++ b/fs/bcachefs/move.c
@@ -721,11 +721,10 @@ int __bch2_evacuate_bucket(struct moving_context *ctxt,
 	bucket_size = bch_dev_bkey_exists(c, bucket.inode)->mi.bucket_size;
 	fragmentation = a->fragmentation_lru;
 
-	ret = bch2_btree_write_buffer_flush(trans);
-	if (ret) {
-		bch_err_msg(c, ret, "flushing btree write buffer");
+	ret = bch2_btree_write_buffer_tryflush(trans);
+	bch_err_msg(c, ret, "flushing btree write buffer");
+	if (ret)
 		goto err;
-	}
 
 	while (!(ret = bch2_move_ratelimit(ctxt))) {
 		bch2_trans_begin(trans);
diff --git a/fs/bcachefs/movinggc.c b/fs/bcachefs/movinggc.c
index 0a0576326c5b..100525a38ee7 100644
--- a/fs/bcachefs/movinggc.c
+++ b/fs/bcachefs/movinggc.c
@@ -153,8 +153,8 @@ static int bch2_copygc_get_buckets(struct moving_context *ctxt,
 
 	move_buckets_wait(ctxt, buckets_in_flight, false);
 
-	ret = bch2_btree_write_buffer_flush(trans);
-	if (bch2_fs_fatal_err_on(ret, c, "%s: error %s from bch2_btree_write_buffer_flush()",
+	ret = bch2_btree_write_buffer_tryflush(trans);
+	if (bch2_fs_fatal_err_on(ret, c, "%s: error %s from bch2_btree_write_buffer_tryflush()",
 				 __func__, bch2_err_str(ret)))
 		return ret;
 
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 15/17] bcachefs: Improve btree write buffer tracepoints
  2023-11-10 16:31 [PATCH 00/17] btree write buffer & journal optimizations Kent Overstreet
                   ` (13 preceding siblings ...)
  2023-11-10 16:31 ` [PATCH 14/17] bcachefs: bch2_btree_write_buffer_flush() -> bch2_btree_write_buffer_tryflush() Kent Overstreet
@ 2023-11-10 16:31 ` Kent Overstreet
  2023-11-10 16:31 ` [PATCH 16/17] bcachefs: btree write buffer now slurps keys from journal Kent Overstreet
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 29+ messages in thread
From: Kent Overstreet @ 2023-11-10 16:31 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Kent Overstreet

 - add a tracepoint for write_buffer_flush_sync; this is expensive
 - fix the write_buffer_flush_slowpath tracepoint

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/bcachefs/btree_write_buffer.c |  4 +++-
 fs/bcachefs/trace.h              | 31 ++++++++++++++++++++++++-------
 2 files changed, 27 insertions(+), 8 deletions(-)

diff --git a/fs/bcachefs/btree_write_buffer.c b/fs/bcachefs/btree_write_buffer.c
index be8308789477..74d864715b39 100644
--- a/fs/bcachefs/btree_write_buffer.c
+++ b/fs/bcachefs/btree_write_buffer.c
@@ -236,7 +236,7 @@ int bch2_btree_write_buffer_flush_locked(struct btree_trans *trans)
 	bch2_journal_pin_drop(j, &pin);
 	return ret;
 slowpath:
-	trace_write_buffer_flush_slowpath(trans, i - keys, nr);
+	trace_write_buffer_flush_slowpath(trans, slowpath, nr);
 
 	/*
 	 * Now sort the rest by journal seq and bump the journal pin as we go.
@@ -275,6 +275,8 @@ int bch2_btree_write_buffer_flush_sync(struct btree_trans *trans)
 	if (!bch2_write_ref_tryget(c, BCH_WRITE_REF_btree_write_buffer))
 		return -BCH_ERR_erofs_no_writes;
 
+	trace_write_buffer_flush_sync(trans, _RET_IP_);
+
 	bch2_trans_unlock(trans);
 	mutex_lock(&c->btree_write_buffer.flush_lock);
 	int ret = bch2_btree_write_buffer_flush_locked(trans);
diff --git a/fs/bcachefs/trace.h b/fs/bcachefs/trace.h
index 7857671159b4..8fadf06c7dbf 100644
--- a/fs/bcachefs/trace.h
+++ b/fs/bcachefs/trace.h
@@ -1298,21 +1298,38 @@ TRACE_EVENT(write_buffer_flush,
 		  __entry->nr, __entry->size, __entry->skipped, __entry->fast)
 );
 
+TRACE_EVENT(write_buffer_flush_sync,
+	TP_PROTO(struct btree_trans *trans, unsigned long caller_ip),
+	TP_ARGS(trans, caller_ip),
+
+	TP_STRUCT__entry(
+		__array(char,			trans_fn, 32	)
+		__field(unsigned long,		caller_ip	)
+	),
+
+	TP_fast_assign(
+		strscpy(__entry->trans_fn, trans->fn, sizeof(__entry->trans_fn));
+		__entry->caller_ip		= caller_ip;
+	),
+
+	TP_printk("%s %pS", __entry->trans_fn, (void *) __entry->caller_ip)
+);
+
 TRACE_EVENT(write_buffer_flush_slowpath,
-	TP_PROTO(struct btree_trans *trans, size_t nr, size_t size),
-	TP_ARGS(trans, nr, size),
+	TP_PROTO(struct btree_trans *trans, size_t slowpath, size_t total),
+	TP_ARGS(trans, slowpath, total),
 
 	TP_STRUCT__entry(
-		__field(size_t,		nr		)
-		__field(size_t,		size		)
+		__field(size_t,		slowpath	)
+		__field(size_t,		total		)
 	),
 
 	TP_fast_assign(
-		__entry->nr	= nr;
-		__entry->size	= size;
+		__entry->slowpath	= slowpath;
+		__entry->total		= total;
 	),
 
-	TP_printk("%zu/%zu", __entry->nr, __entry->size)
+	TP_printk("%zu/%zu", __entry->slowpath, __entry->total)
 );
 
 #endif /* _TRACE_BCACHEFS_H */
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 16/17] bcachefs: btree write buffer now slurps keys from journal
  2023-11-10 16:31 [PATCH 00/17] btree write buffer & journal optimizations Kent Overstreet
                   ` (14 preceding siblings ...)
  2023-11-10 16:31 ` [PATCH 15/17] bcachefs: Improve btree write buffer tracepoints Kent Overstreet
@ 2023-11-10 16:31 ` Kent Overstreet
  2023-11-21 10:56   ` Geert Uytterhoeven
  2023-11-10 16:31 ` [PATCH 17/17] bcachefs: Inline btree write buffer sort Kent Overstreet
  2023-11-10 16:42 ` [PATCH 00/17] btree write buffer & journal optimizations Kent Overstreet
  17 siblings, 1 reply; 29+ messages in thread
From: Kent Overstreet @ 2023-11-10 16:31 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Kent Overstreet

Previosuly, the transaction commit path would have to add keys to the
btree write buffer as a separate operation, requiring additional global
synchronization.

This patch introduces a new journal entry type, which indicates that the
keys need to be copied into the btree write buffer prior to being
written out. We switch the journal entry type back to
JSET_ENTRY_btree_keys prior to write, so this is not an on disk format
change.

Flushing the btree write buffer may require pulling keys out of journal
entries yet to be written, and quiescing outstanding journal
reservations; we previously added journal->buf_lock for synchronization
with the journal write path.

We also can't put strict bounds on the number of keys in the journal
destined for the write buffer, which means we might overflow the size of
the preallocated buffer and have to reallocate - this introduces a
potentially fatal memory allocation failure. This is something we'll
have to watch for, if it becomes an issue in practice we can do
additional mitigation.

The transaction commit path no longer has to explicitly check if the
write buffer is full and wait on flushing; this is another performance
optimization. Instead, when the btree write buffer is close to full we
change the journal watermark, so that only reservations for journal
reclaim are allowed.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/bcachefs/bcachefs.h                 |  11 +
 fs/bcachefs/bcachefs_format.h          |   3 +-
 fs/bcachefs/btree_trans_commit.c       |  52 +--
 fs/bcachefs/btree_write_buffer.c       | 507 ++++++++++++++++---------
 fs/bcachefs/btree_write_buffer.h       |  46 ++-
 fs/bcachefs/btree_write_buffer_types.h |  37 +-
 fs/bcachefs/errcode.h                  |   1 -
 fs/bcachefs/journal.c                  |  43 +++
 fs/bcachefs/journal.h                  |   1 +
 fs/bcachefs/journal_io.c               |  51 ++-
 fs/bcachefs/journal_reclaim.c          |  12 +-
 fs/bcachefs/journal_reclaim.h          |   1 +
 fs/bcachefs/journal_types.h            |   2 +
 fs/bcachefs/super.c                    |   3 +-
 14 files changed, 512 insertions(+), 258 deletions(-)

diff --git a/fs/bcachefs/bcachefs.h b/fs/bcachefs/bcachefs.h
index 748164c0e82b..d40a6bb0e7ef 100644
--- a/fs/bcachefs/bcachefs.h
+++ b/fs/bcachefs/bcachefs.h
@@ -406,6 +406,7 @@ BCH_DEBUG_PARAMS_DEBUG()
 	x(blocked_journal_max_in_flight)	\
 	x(blocked_allocate)			\
 	x(blocked_allocate_open_bucket)		\
+	x(blocked_write_buffer_full)		\
 	x(nocow_lock_contended)
 
 enum bch_time_stats {
@@ -1065,6 +1066,16 @@ static inline void bch2_write_ref_get(struct bch_fs *c, enum bch_write_ref ref)
 #endif
 }
 
+static inline bool __bch2_write_ref_tryget(struct bch_fs *c, enum bch_write_ref ref)
+{
+#ifdef BCH_WRITE_REF_DEBUG
+	return !test_bit(BCH_FS_GOING_RO, &c->flags) &&
+		atomic_long_inc_not_zero(&c->writes[ref]);
+#else
+	return percpu_ref_tryget(&c->writes);
+#endif
+}
+
 static inline bool bch2_write_ref_tryget(struct bch_fs *c, enum bch_write_ref ref)
 {
 #ifdef BCH_WRITE_REF_DEBUG
diff --git a/fs/bcachefs/bcachefs_format.h b/fs/bcachefs/bcachefs_format.h
index 0a750953ff92..15083f9273d4 100644
--- a/fs/bcachefs/bcachefs_format.h
+++ b/fs/bcachefs/bcachefs_format.h
@@ -2124,7 +2124,8 @@ static inline __u64 __bset_magic(struct bch_sb *sb)
 	x(clock,		7)		\
 	x(dev_usage,		8)		\
 	x(log,			9)		\
-	x(overwrite,		10)
+	x(overwrite,		10)		\
+	x(write_buffer_keys,	11)
 
 enum {
 #define x(f, nr)	BCH_JSET_ENTRY_##f	= nr,
diff --git a/fs/bcachefs/btree_trans_commit.c b/fs/bcachefs/btree_trans_commit.c
index 722542316eb1..e358990f58f0 100644
--- a/fs/bcachefs/btree_trans_commit.c
+++ b/fs/bcachefs/btree_trans_commit.c
@@ -660,10 +660,6 @@ bch2_trans_commit_write_locked(struct btree_trans *trans, unsigned flags,
 		i->k->k.needs_whiteout = false;
 	}
 
-	if (trans->nr_wb_updates &&
-	    trans->nr_wb_updates + c->btree_write_buffer.state.nr > c->btree_write_buffer.size)
-		return -BCH_ERR_btree_insert_need_flush_buffer;
-
 	/*
 	 * Don't get journal reservation until after we know insert will
 	 * succeed:
@@ -698,14 +694,6 @@ bch2_trans_commit_write_locked(struct btree_trans *trans, unsigned flags,
 	    bch2_trans_fs_usage_apply(trans, trans->fs_usage_deltas))
 		return -BCH_ERR_btree_insert_need_mark_replicas;
 
-	if (trans->nr_wb_updates) {
-		EBUG_ON(flags & BTREE_INSERT_JOURNAL_REPLAY);
-
-		ret = bch2_btree_insert_keys_write_buffer(trans);
-		if (ret)
-			goto revert_fs_usage;
-	}
-
 	h = trans->hooks;
 	while (h) {
 		ret = h->fn(trans, h);
@@ -767,7 +755,7 @@ bch2_trans_commit_write_locked(struct btree_trans *trans, unsigned flags,
 
 		trans_for_each_wb_update(trans, wb) {
 			entry = bch2_journal_add_entry(j, &trans->journal_res,
-					       BCH_JSET_ENTRY_btree_keys,
+					       BCH_JSET_ENTRY_write_buffer_keys,
 					       wb->btree, 0,
 					       wb->k.k.u64s);
 			bkey_copy((struct bkey_i *) entry->start, &wb->k);
@@ -951,30 +939,6 @@ int bch2_trans_commit_error(struct btree_trans *trans, unsigned flags,
 
 		ret = bch2_trans_relock(trans);
 		break;
-	case -BCH_ERR_btree_insert_need_flush_buffer: {
-		struct btree_write_buffer *wb = &c->btree_write_buffer;
-
-		ret = 0;
-
-		if (wb->state.nr > wb->size * 3 / 4) {
-			bch2_trans_unlock(trans);
-			mutex_lock(&wb->flush_lock);
-
-			if (wb->state.nr > wb->size * 3 / 4) {
-				bch2_trans_begin(trans);
-				ret = bch2_btree_write_buffer_flush_locked(trans);
-				mutex_unlock(&wb->flush_lock);
-				if (!ret) {
-					trace_and_count(c, trans_restart_write_buffer_flush, trans, _THIS_IP_);
-					ret = btree_trans_restart(trans, BCH_ERR_transaction_restart_write_buffer_flush);
-				}
-			} else {
-				mutex_unlock(&wb->flush_lock);
-				ret = bch2_trans_relock(trans);
-			}
-		}
-		break;
-	}
 	default:
 		BUG_ON(ret >= 0);
 		break;
@@ -1077,20 +1041,6 @@ int __bch2_trans_commit(struct btree_trans *trans, unsigned flags)
 			goto out_reset;
 	}
 
-	if (c->btree_write_buffer.state.nr > c->btree_write_buffer.size / 2 &&
-	    mutex_trylock(&c->btree_write_buffer.flush_lock)) {
-		bch2_trans_begin(trans);
-		bch2_trans_unlock(trans);
-
-		ret = bch2_btree_write_buffer_flush_locked(trans);
-		mutex_unlock(&c->btree_write_buffer.flush_lock);
-		if (!ret) {
-			trace_and_count(c, trans_restart_write_buffer_flush, trans, _THIS_IP_);
-			ret = btree_trans_restart(trans, BCH_ERR_transaction_restart_write_buffer_flush);
-		}
-		goto out;
-	}
-
 	EBUG_ON(test_bit(BCH_FS_CLEAN_SHUTDOWN, &c->flags));
 
 	trans->journal_u64s		= trans->extra_journal_entries.nr;
diff --git a/fs/bcachefs/btree_write_buffer.c b/fs/bcachefs/btree_write_buffer.c
index 74d864715b39..e961cf33db1e 100644
--- a/fs/bcachefs/btree_write_buffer.c
+++ b/fs/bcachefs/btree_write_buffer.c
@@ -7,6 +7,7 @@
 #include "btree_write_buffer.h"
 #include "error.h"
 #include "journal.h"
+#include "journal_io.h"
 #include "journal_reclaim.h"
 
 #include <linux/sort.h>
@@ -14,36 +15,47 @@
 static int bch2_btree_write_buffer_journal_flush(struct journal *,
 				struct journal_entry_pin *, u64);
 
-static int btree_write_buffered_key_cmp(const void *_l, const void *_r)
+static int bch2_journal_keys_to_write_buffer(struct bch_fs *, struct journal_buf *);
+
+static inline int wb_key_cmp(const void *_l, const void *_r)
 {
-	const struct btree_write_buffered_key *l = _l;
-	const struct btree_write_buffered_key *r = _r;
+	const struct btree_write_buffered_key_ref *l = _l;
+	const struct btree_write_buffered_key_ref *r = _r;
 
 	return  cmp_int(l->btree, r->btree) ?:
-		bpos_cmp(l->k.k.p, r->k.k.p) ?:
-		cmp_int(l->journal_seq, r->journal_seq) ?:
-		cmp_int(l->journal_offset, r->journal_offset);
+		bpos_cmp(l->pos, r->pos) ?:
+		cmp_int(l->idx, r->idx);
 }
 
-static int btree_write_buffered_journal_cmp(const void *_l, const void *_r)
+static noinline int wb_flush_one_slowpath(struct btree_trans *trans,
+					  struct btree_iter *iter,
+					  struct btree_write_buffered_key *wb)
 {
-	const struct btree_write_buffered_key *l = _l;
-	const struct btree_write_buffered_key *r = _r;
+	bch2_btree_node_unlock_write(trans, iter->path, iter->path->l[0].b);
+
+	trans->journal_res.seq = wb->journal_seq;
 
-	return  cmp_int(l->journal_seq, r->journal_seq);
+	return bch2_trans_update(trans, iter, &wb->k,
+				 BTREE_UPDATE_INTERNAL_SNAPSHOT_NODE) ?:
+		bch2_trans_commit(trans, NULL, NULL,
+				  BTREE_INSERT_NOCHECK_RW|
+				  BTREE_INSERT_NOFAIL|
+				  BTREE_INSERT_JOURNAL_REPLAY|
+				  BTREE_INSERT_JOURNAL_RECLAIM);
 }
 
-static int bch2_btree_write_buffer_flush_one(struct btree_trans *trans,
-					     struct btree_iter *iter,
-					     struct btree_write_buffered_key *wb,
-					     unsigned commit_flags,
-					     bool *write_locked,
-					     size_t *fast)
+static inline int wb_flush_one(struct btree_trans *trans, struct btree_iter *iter,
+			       struct btree_write_buffered_key *wb,
+			       bool *write_locked, size_t *fast)
 {
 	struct bch_fs *c = trans->c;
 	struct btree_path *path;
 	int ret;
 
+	EBUG_ON(!wb->journal_seq);
+	EBUG_ON(!c->btree_write_buffer.flushing.pin.seq);
+	EBUG_ON(c->btree_write_buffer.flushing.pin.seq > wb->journal_seq);
+
 	ret = bch2_btree_iter_traverse(iter);
 	if (ret)
 		return ret;
@@ -66,46 +78,14 @@ static int bch2_btree_write_buffer_flush_one(struct btree_trans *trans,
 		*write_locked = true;
 	}
 
-	if (!bch2_btree_node_insert_fits(c, path->l[0].b, wb->k.k.u64s)) {
-		bch2_btree_node_unlock_write(trans, path, path->l[0].b);
+	if (unlikely(!bch2_btree_node_insert_fits(c, path->l[0].b, wb->k.k.u64s))) {
 		*write_locked = false;
-		goto trans_commit;
+		return wb_flush_one_slowpath(trans, iter, wb);
 	}
 
 	bch2_btree_insert_key_leaf(trans, path, &wb->k, wb->journal_seq);
 	(*fast)++;
 	return 0;
-trans_commit:
-	trans->journal_res.seq = wb->journal_seq;
-
-	return  bch2_trans_update(trans, iter, &wb->k,
-				  BTREE_UPDATE_INTERNAL_SNAPSHOT_NODE) ?:
-		bch2_trans_commit(trans, NULL, NULL,
-				  commit_flags|
-				  BTREE_INSERT_NOCHECK_RW|
-				  BTREE_INSERT_NOFAIL|
-				  BTREE_INSERT_JOURNAL_REPLAY|
-				  BTREE_INSERT_JOURNAL_RECLAIM);
-}
-
-static union btree_write_buffer_state btree_write_buffer_switch(struct btree_write_buffer *wb)
-{
-	union btree_write_buffer_state old, new;
-	u64 v = READ_ONCE(wb->state.v);
-
-	do {
-		old.v = new.v = v;
-
-		new.nr = 0;
-		new.idx++;
-	} while ((v = atomic64_cmpxchg_acquire(&wb->state.counter, old.v, new.v)) != old.v);
-
-	while (old.idx == 0 ? wb->state.ref0 : wb->state.ref1)
-		cpu_relax();
-
-	smp_mb();
-
-	return old;
 }
 
 /*
@@ -137,31 +117,76 @@ btree_write_buffered_insert(struct btree_trans *trans,
 	return ret;
 }
 
-int bch2_btree_write_buffer_flush_locked(struct btree_trans *trans)
+static void move_keys_from_inc_to_flushing(struct btree_write_buffer *wb)
+{
+	struct bch_fs *c = container_of(wb, struct bch_fs, btree_write_buffer);
+	struct journal *j = &c->journal;
+
+	if (!wb->inc.keys.nr)
+		return;
+
+	bch2_journal_pin_add(j, wb->inc.keys.data[0].journal_seq, &wb->flushing.pin,
+			     bch2_btree_write_buffer_journal_flush);
+
+	darray_resize(&wb->flushing.keys, min_t(size_t, 1U << 20, wb->flushing.keys.nr + wb->inc.keys.nr));
+	darray_resize(&wb->sorted, wb->flushing.keys.size);
+
+	if (!wb->flushing.keys.nr && wb->sorted.size >= wb->inc.keys.nr) {
+		swap(wb->flushing.keys, wb->inc.keys);
+		goto out;
+	}
+
+	size_t nr = min(darray_room(wb->flushing.keys),
+			wb->sorted.size - wb->flushing.keys.nr);
+	nr = min(nr, wb->inc.keys.nr);
+
+	memcpy(&darray_top(wb->flushing.keys),
+	       wb->inc.keys.data,
+	       sizeof(wb->inc.keys.data[0]) * nr);
+
+	memmove(wb->inc.keys.data,
+		wb->inc.keys.data + nr,
+	       sizeof(wb->inc.keys.data[0]) * (wb->inc.keys.nr - nr));
+
+	wb->flushing.keys.nr	+= nr;
+	wb->inc.keys.nr		-= nr;
+out:
+	if (!wb->inc.keys.nr)
+		bch2_journal_pin_drop(j, &wb->inc.pin);
+
+	if (j->watermark) {
+		spin_lock(&j->lock);
+		bch2_journal_set_watermark(j);
+		spin_unlock(&j->lock);
+	}
+
+	BUG_ON(wb->sorted.size < wb->flushing.keys.nr);
+}
+
+static int bch2_btree_write_buffer_flush_locked(struct btree_trans *trans)
 {
 	struct bch_fs *c = trans->c;
 	struct journal *j = &c->journal;
 	struct btree_write_buffer *wb = &c->btree_write_buffer;
-	struct journal_entry_pin pin;
-	struct btree_write_buffered_key *i, *keys;
+	struct btree_write_buffered_key_ref *i;
 	struct btree_iter iter = { NULL };
-	size_t nr = 0, skipped = 0, fast = 0, slowpath = 0;
+	size_t skipped = 0, fast = 0, slowpath = 0;
 	bool write_locked = false;
-	union btree_write_buffer_state s;
 	int ret = 0;
 
-	memset(&pin, 0, sizeof(pin));
-
-	bch2_journal_pin_copy(j, &pin, &wb->journal_pin,
-			      bch2_btree_write_buffer_journal_flush);
-	bch2_journal_pin_drop(j, &wb->journal_pin);
+	bch2_trans_unlock(trans);
+	bch2_trans_begin(trans);
 
-	s = btree_write_buffer_switch(wb);
-	keys = wb->keys[s.idx];
-	nr = s.nr;
+	mutex_lock(&wb->inc.lock);
+	move_keys_from_inc_to_flushing(wb);
+	mutex_unlock(&wb->inc.lock);
 
-	if (race_fault())
-		goto slowpath;
+	for (size_t i = 0; i < wb->flushing.keys.nr; i++) {
+		wb->sorted.data[i].idx = i;
+		wb->sorted.data[i].btree = wb->flushing.keys.data[i].btree;
+		wb->sorted.data[i].pos = wb->flushing.keys.data[i].k.k.p;
+	}
+	wb->sorted.nr = wb->flushing.keys.nr;
 
 	/*
 	 * We first sort so that we can detect and skip redundant updates, and
@@ -177,110 +202,154 @@ int bch2_btree_write_buffer_flush_locked(struct btree_trans *trans)
 	 * If that happens, simply skip the key so we can optimistically insert
 	 * as many keys as possible in the fast path.
 	 */
-	sort(keys, nr, sizeof(keys[0]),
-	     btree_write_buffered_key_cmp, NULL);
+	sort(wb->sorted.data, wb->sorted.nr,
+	     sizeof(wb->sorted.data[0]),
+	     wb_key_cmp, NULL);
+
+	darray_for_each(wb->sorted, i) {
+		struct btree_write_buffered_key *k = &wb->flushing.keys.data[i->idx];
 
-	for (i = keys; i < keys + nr; i++) {
-		if (i + 1 < keys + nr &&
+		BUG_ON(!k->journal_seq);
+
+		if (i + 1 < &darray_top(wb->sorted) &&
 		    i[0].btree == i[1].btree &&
-		    bpos_eq(i[0].k.k.p, i[1].k.k.p)) {
+		    bpos_eq(i[0].pos, i[1].pos)) {
+			struct btree_write_buffered_key *n = &wb->flushing.keys.data[i[1].idx];
+
 			skipped++;
-			i->journal_seq = 0;
+			n->journal_seq = min(n->journal_seq, k->journal_seq);;
+			k->journal_seq = 0;
 			continue;
 		}
 
 		if (write_locked &&
-		    (iter.path->btree_id != i->btree ||
-		     bpos_gt(i->k.k.p, iter.path->l[0].b->key.k.p))) {
+		    (iter.path->btree_id != k->btree ||
+		     bpos_gt(k->k.k.p, iter.path->l[0].b->key.k.p))) {
 			bch2_btree_node_unlock_write(trans, iter.path, iter.path->l[0].b);
 			write_locked = false;
 		}
 
-		if (!iter.path || iter.path->btree_id != i->btree) {
+		if (!iter.path || iter.path->btree_id != k->btree) {
 			bch2_trans_iter_exit(trans, &iter);
-			bch2_trans_iter_init(trans, &iter, i->btree, i->k.k.p,
+			bch2_trans_iter_init(trans, &iter, k->btree, k->k.k.p,
 					     BTREE_ITER_INTENT|BTREE_ITER_ALL_SNAPSHOTS);
 		}
 
-		bch2_btree_iter_set_pos(&iter, i->k.k.p);
+		bch2_btree_iter_set_pos(&iter, k->k.k.p);
 		iter.path->preserve = false;
 
 		do {
-			ret = bch2_btree_write_buffer_flush_one(trans, &iter, i, 0,
-								&write_locked, &fast);
+			if (race_fault()) {
+				ret = -BCH_ERR_journal_reclaim_would_deadlock;
+				break;
+			}
+
+			ret = wb_flush_one(trans, &iter, k, &write_locked, &fast);
 			if (!write_locked)
 				bch2_trans_begin(trans);
 		} while (bch2_err_matches(ret, BCH_ERR_transaction_restart));
 
-		if (ret == -BCH_ERR_journal_reclaim_would_deadlock) {
+		if (!ret) {
+			k->journal_seq = 0;
+		} else if (ret == -BCH_ERR_journal_reclaim_would_deadlock) {
 			slowpath++;
-			continue;
-		}
-		if (ret)
+			ret = 0;
+		} else
 			break;
-
-		i->journal_seq = 0;
 	}
 
 	if (write_locked)
 		bch2_btree_node_unlock_write(trans, iter.path, iter.path->l[0].b);
 	bch2_trans_iter_exit(trans, &iter);
 
-	trace_write_buffer_flush(trans, nr, skipped, fast, wb->size);
-
-	if (slowpath)
-		goto slowpath;
-
+	if (ret)
+		goto err;
+
+	if (slowpath) {
+		/*
+		 * Flush in the order they were present in the journal, so that
+		 * we can release journal pins:
+		 * The fastpath zapped the seq of keys that were successfully flushed so
+		 * we can skip those here.
+		 */
+		trace_write_buffer_flush_slowpath(trans, slowpath, wb->flushing.keys.nr);
+
+		struct btree_write_buffered_key *i;
+		darray_for_each(wb->flushing.keys, i) {
+			if (!i->journal_seq)
+				continue;
+
+			bch2_journal_pin_update(j, i->journal_seq, &wb->flushing.pin,
+						bch2_btree_write_buffer_journal_flush);
+
+			bch2_trans_begin(trans);
+
+			ret = commit_do(trans, NULL, NULL,
+					BCH_WATERMARK_reclaim|
+					BTREE_INSERT_NOCHECK_RW|
+					BTREE_INSERT_NOFAIL|
+					BTREE_INSERT_JOURNAL_REPLAY|
+					BTREE_INSERT_JOURNAL_RECLAIM,
+					btree_write_buffered_insert(trans, i));
+			if (ret)
+				goto err;
+		}
+	}
+err:
 	bch2_fs_fatal_err_on(ret, c, "%s: insert error %s", __func__, bch2_err_str(ret));
-out:
-	bch2_journal_pin_drop(j, &pin);
+	trace_write_buffer_flush(trans, wb->flushing.keys.nr, skipped, fast, 0);
+	bch2_journal_pin_drop(j, &wb->flushing.pin);
+	wb->flushing.keys.nr = 0;
 	return ret;
-slowpath:
-	trace_write_buffer_flush_slowpath(trans, slowpath, nr);
-
-	/*
-	 * Now sort the rest by journal seq and bump the journal pin as we go.
-	 * The slowpath zapped the seq of keys that were successfully flushed so
-	 * we can skip those here.
-	 */
-	sort(keys, nr, sizeof(keys[0]),
-	     btree_write_buffered_journal_cmp,
-	     NULL);
+}
 
-	for (i = keys; i < keys + nr; i++) {
-		if (!i->journal_seq)
-			continue;
+static int fetch_wb_keys_from_journal(struct bch_fs *c, u64 seq)
+{
+	struct journal *j = &c->journal;
+	struct journal_buf *buf;
+	int ret = 0;
 
-		bch2_journal_pin_update(j, i->journal_seq, &pin,
-			      bch2_btree_write_buffer_journal_flush);
-
-		ret = commit_do(trans, NULL, NULL,
-				BCH_WATERMARK_reclaim|
-				BTREE_INSERT_NOCHECK_RW|
-				BTREE_INSERT_NOFAIL|
-				BTREE_INSERT_JOURNAL_REPLAY|
-				BTREE_INSERT_JOURNAL_RECLAIM,
-				btree_write_buffered_insert(trans, i));
-		if (bch2_fs_fatal_err_on(ret, c, "%s: insert error %s", __func__, bch2_err_str(ret)))
+	mutex_lock(&j->buf_lock);
+	while ((buf = bch2_next_write_buffer_flush_journal_buf(j, seq)))
+		if (bch2_journal_keys_to_write_buffer(c, buf)) {
+			ret = -ENOMEM;
 			break;
-	}
+		}
+	mutex_unlock(&j->buf_lock);
 
-	goto out;
+	return ret;
 }
 
 int bch2_btree_write_buffer_flush_sync(struct btree_trans *trans)
 {
 	struct bch_fs *c = trans->c;
+	struct btree_write_buffer *wb = &c->btree_write_buffer;
+	int ret = 0, fetch_from_journal_err;
 
 	if (!bch2_write_ref_tryget(c, BCH_WRITE_REF_btree_write_buffer))
 		return -BCH_ERR_erofs_no_writes;
 
 	trace_write_buffer_flush_sync(trans, _RET_IP_);
-
+retry:
 	bch2_trans_unlock(trans);
-	mutex_lock(&c->btree_write_buffer.flush_lock);
-	int ret = bch2_btree_write_buffer_flush_locked(trans);
-	mutex_unlock(&c->btree_write_buffer.flush_lock);
+
+	bch2_journal_block_reservations(&c->journal);
+	fetch_from_journal_err = fetch_wb_keys_from_journal(c, U64_MAX);
+	bch2_journal_unblock(&c->journal);
+
+	/*
+	 * On memory allocation failure, bch2_btree_write_buffer_flush_locked()
+	 * is not guaranteed to empty wb->inc:
+	 */
+	mutex_lock(&wb->flushing.lock);
+	while (!ret &&
+	       (wb->flushing.keys.nr || wb->inc.keys.nr))
+		ret = bch2_btree_write_buffer_flush_locked(trans);
+	mutex_unlock(&wb->flushing.lock);
+
+	if (!ret && fetch_from_journal_err)
+		goto retry;
+
 	bch2_write_ref_put(c, BCH_WRITE_REF_btree_write_buffer);
 	return ret;
 }
@@ -294,74 +363,170 @@ int bch2_btree_write_buffer_tryflush(struct btree_trans *trans)
 	if (!bch2_write_ref_tryget(c, BCH_WRITE_REF_btree_write_buffer))
 		return -BCH_ERR_erofs_no_writes;
 
-	if (mutex_trylock(&wb->flush_lock)) {
+	if (mutex_trylock(&wb->flushing.lock)) {
 		ret = bch2_btree_write_buffer_flush_locked(trans);
-		mutex_unlock(&wb->flush_lock);
+		mutex_unlock(&wb->flushing.lock);
 	}
 	bch2_write_ref_put(c, BCH_WRITE_REF_btree_write_buffer);
 	return ret;
 }
 
+static int bch2_btree_write_buffer_flush_all(struct bch_fs *c)
+{
+	struct btree_write_buffer *wb = &c->btree_write_buffer;
+	struct btree_trans *trans = bch2_trans_get(c);
+	int ret;;
+
+	mutex_lock(&wb->flushing.lock);
+	do {
+		ret = bch2_btree_write_buffer_flush_locked(trans);
+	} while (!ret && wb->inc.keys.nr);
+	mutex_unlock(&wb->flushing.lock);
+
+	bch2_trans_put(trans);
+	return ret;
+}
+
 static int bch2_btree_write_buffer_journal_flush(struct journal *j,
 				struct journal_entry_pin *_pin, u64 seq)
 {
 	struct bch_fs *c = container_of(j, struct bch_fs, journal);
-	struct btree_write_buffer *wb = &c->btree_write_buffer;
+	int ret, fetch_from_journal_err;
 
-	mutex_lock(&wb->flush_lock);
-	int ret = bch2_trans_run(c, bch2_btree_write_buffer_flush_locked(trans));
-	mutex_unlock(&wb->flush_lock);
+	do {
+		fetch_from_journal_err = fetch_wb_keys_from_journal(c, seq);
+		ret = bch2_btree_write_buffer_flush_all(c);
+	} while (!ret && fetch_from_journal_err);
 
 	return ret;
 }
 
-static inline u64 btree_write_buffer_ref(int idx)
+static void bch2_btree_write_buffer_flush_work(struct work_struct *work)
 {
-	return ((union btree_write_buffer_state) {
-		.ref0 = idx == 0,
-		.ref1 = idx == 1,
-	}).v;
+	struct bch_fs *c = container_of(work, struct bch_fs, btree_write_buffer.flush_work);
+	struct btree_write_buffer *wb = &c->btree_write_buffer;
+	struct btree_trans *trans = bch2_trans_get(c);
+	int ret;;
+
+	mutex_lock(&wb->flushing.lock);
+	do {
+		ret = bch2_btree_write_buffer_flush_locked(trans);
+	} while (!ret && bch2_btree_write_buffer_should_flush(c));
+	mutex_unlock(&wb->flushing.lock);
+
+	bch2_trans_put(trans);
+	bch2_write_ref_put(c, BCH_WRITE_REF_btree_write_buffer);
 }
 
-int bch2_btree_insert_keys_write_buffer(struct btree_trans *trans)
+int __bch2_journal_key_to_wb(struct bch_fs *c,
+			     struct journal_keys_to_wb *dst,
+			     u64 seq, enum btree_id btree, struct bkey_i *k)
 {
-	struct bch_fs *c = trans->c;
 	struct btree_write_buffer *wb = &c->btree_write_buffer;
-	struct btree_write_buffered_key *i;
-	union btree_write_buffer_state old, new;
-	int ret = 0;
-	u64 v;
-
-	trans_for_each_wb_update(trans, i) {
-		EBUG_ON(i->k.k.u64s > BTREE_WRITE_BUFERED_U64s_MAX);
+	int ret;
+retry:
+	ret = darray_make_room_gfp(&dst->wb->keys, 1, GFP_KERNEL);
+	if (!ret && dst->wb == &wb->flushing)
+		ret = darray_resize(&wb->sorted, wb->flushing.keys.size);
+
+	if (unlikely(ret)) {
+		if (dst->wb == &c->btree_write_buffer.flushing) {
+			mutex_unlock(&dst->wb->lock);
+			dst->wb = &c->btree_write_buffer.inc;
+			bch2_journal_pin_add(&c->journal, seq, &dst->wb->pin,
+					     bch2_btree_write_buffer_journal_flush);
+			goto retry;
+		}
 
-		i->journal_seq		= trans->journal_res.seq;
-		i->journal_offset	= trans->journal_res.offset;
+		return ret;
 	}
 
-	preempt_disable();
-	v = READ_ONCE(wb->state.v);
-	do {
-		old.v = new.v = v;
+	dst->room = darray_room(dst->wb->keys);
+	if (dst->wb == &wb->flushing)
+		dst->room = min(dst->room, wb->sorted.size - wb->flushing.keys.nr);
+	BUG_ON(!dst->room);
+
+	struct btree_write_buffered_key *wb_k = &darray_top(dst->wb->keys);
+	wb_k->journal_seq	= seq;
+	wb_k->btree		= btree;
+	bkey_copy(&wb_k->k, k);
+	dst->wb->keys.nr++;
+	dst->room--;
+	return 0;
+}
 
-		new.v += btree_write_buffer_ref(new.idx);
-		new.nr += trans->nr_wb_updates;
-		if (new.nr > wb->size) {
-			ret = -BCH_ERR_btree_insert_need_flush_buffer;
-			goto out;
+void bch2_journal_keys_to_write_buffer_start(struct bch_fs *c, struct journal_keys_to_wb *dst, u64 seq)
+{
+	struct btree_write_buffer *wb = &c->btree_write_buffer;
+
+	if (mutex_trylock(&wb->flushing.lock)) {
+		mutex_lock(&wb->inc.lock);
+		move_keys_from_inc_to_flushing(wb);
+
+		/*
+		 * Attempt to skip wb->inc, and add keys directly to
+		 * wb->flushing, saving us a copy later:
+		 */
+
+		if (!wb->inc.keys.nr) {
+			dst->wb = &wb->flushing;
+		} else {
+			mutex_unlock(&wb->flushing.lock);
+			dst->wb = &wb->inc;
 		}
-	} while ((v = atomic64_cmpxchg_acquire(&wb->state.counter, old.v, new.v)) != old.v);
+	} else {
+		mutex_lock(&wb->inc.lock);
+		dst->wb = &wb->inc;
+	}
 
-	memcpy(wb->keys[new.idx] + old.nr,
-	       trans->wb_updates,
-	       sizeof(trans->wb_updates[0]) * trans->nr_wb_updates);
+	dst->room = darray_room(dst->wb->keys);
+	if (dst->wb == &wb->flushing)
+		dst->room = min(dst->room, wb->sorted.size - wb->flushing.keys.nr);
 
-	bch2_journal_pin_add(&c->journal, trans->journal_res.seq, &wb->journal_pin,
+	bch2_journal_pin_add(&c->journal, seq, &dst->wb->pin,
 			     bch2_btree_write_buffer_journal_flush);
+}
+
+void bch2_journal_keys_to_write_buffer_end(struct bch_fs *c, struct journal_keys_to_wb *dst)
+{
+	struct btree_write_buffer *wb = &c->btree_write_buffer;
+
+	if (!dst->wb->keys.nr)
+		bch2_journal_pin_drop(&c->journal, &dst->wb->pin);
+
+	if (bch2_btree_write_buffer_should_flush(c) &&
+	    __bch2_write_ref_tryget(c, BCH_WRITE_REF_btree_write_buffer) &&
+	    !queue_work(system_unbound_wq, &c->btree_write_buffer.flush_work))
+		bch2_write_ref_put(c, BCH_WRITE_REF_btree_write_buffer);
+
+	if (dst->wb == &wb->flushing)
+		mutex_unlock(&wb->flushing.lock);
+	mutex_unlock(&wb->inc.lock);
+}
+
+static int bch2_journal_keys_to_write_buffer(struct bch_fs *c, struct journal_buf *buf)
+{
+	struct journal_keys_to_wb dst;
+	struct jset_entry *entry;
+	struct bkey_i *k;
+	u64 seq = le64_to_cpu(buf->data->seq);
+	int ret = 0;
+
+	bch2_journal_keys_to_write_buffer_start(c, &dst, seq);
+
+	for_each_jset_entry_type(entry, buf->data, BCH_JSET_ENTRY_write_buffer_keys) {
+		jset_entry_for_each_key(entry, k) {
+			ret = bch2_journal_key_to_wb(c, &dst, seq, entry->btree_id, k);
+			if (ret)
+				goto out;
+		}
 
-	atomic64_sub_return_release(btree_write_buffer_ref(new.idx), &wb->state.counter);
+		entry->type = BCH_JSET_ENTRY_btree_keys;
+	}
+
+	buf->need_flush_to_write_buffer = false;
 out:
-	preempt_enable();
+	bch2_journal_keys_to_write_buffer_end(c, &dst);
 	return ret;
 }
 
@@ -369,23 +534,23 @@ void bch2_fs_btree_write_buffer_exit(struct bch_fs *c)
 {
 	struct btree_write_buffer *wb = &c->btree_write_buffer;
 
-	BUG_ON(wb->state.nr && !bch2_journal_error(&c->journal));
+	BUG_ON((wb->inc.keys.nr || wb->flushing.keys.nr) &&
+	       !bch2_journal_error(&c->journal));
 
-	kvfree(wb->keys[1]);
-	kvfree(wb->keys[0]);
+	darray_exit(&wb->sorted);
+	darray_exit(&wb->flushing.keys);
+	darray_exit(&wb->inc.keys);
 }
 
 int bch2_fs_btree_write_buffer_init(struct bch_fs *c)
 {
 	struct btree_write_buffer *wb = &c->btree_write_buffer;
 
-	mutex_init(&wb->flush_lock);
-	wb->size = c->opts.btree_write_buffer_size;
+	mutex_init(&wb->inc.lock);
+	mutex_init(&wb->flushing.lock);
+	INIT_WORK(&wb->flush_work, bch2_btree_write_buffer_flush_work);
 
-	wb->keys[0] = kvmalloc_array(wb->size, sizeof(*wb->keys[0]), GFP_KERNEL);
-	wb->keys[1] = kvmalloc_array(wb->size, sizeof(*wb->keys[1]), GFP_KERNEL);
-	if (!wb->keys[0] || !wb->keys[1])
-		return -BCH_ERR_ENOMEM_fs_btree_write_buffer_init;
-
-	return 0;
+	return  darray_make_room(&wb->inc.keys, c->opts.btree_write_buffer_size) ?:
+		darray_make_room(&wb->flushing.keys, c->opts.btree_write_buffer_size) ?:
+		darray_make_room(&wb->sorted, c->opts.btree_write_buffer_size);
 }
diff --git a/fs/bcachefs/btree_write_buffer.h b/fs/bcachefs/btree_write_buffer.h
index 89d290e306e2..a28a37d75746 100644
--- a/fs/bcachefs/btree_write_buffer.h
+++ b/fs/bcachefs/btree_write_buffer.h
@@ -2,11 +2,53 @@
 #ifndef _BCACHEFS_BTREE_WRITE_BUFFER_H
 #define _BCACHEFS_BTREE_WRITE_BUFFER_H
 
-int bch2_btree_write_buffer_flush_locked(struct btree_trans *);
+#include "bkey.h"
+
+static inline bool bch2_btree_write_buffer_should_flush(struct bch_fs *c)
+{
+	struct btree_write_buffer *wb = &c->btree_write_buffer;
+
+	return wb->inc.keys.nr + wb->flushing.keys.nr > c->opts.btree_write_buffer_size / 2;
+}
+
+static inline bool bch2_btree_write_buffer_must_wait(struct bch_fs *c)
+{
+	return c->btree_write_buffer.inc.keys.nr > c->opts.btree_write_buffer_size * 3 / 4;
+}
+
+struct btree_trans;
 int bch2_btree_write_buffer_flush_sync(struct btree_trans *);
 int bch2_btree_write_buffer_tryflush(struct btree_trans *);
 
-int bch2_btree_insert_keys_write_buffer(struct btree_trans *);
+struct journal_keys_to_wb {
+	struct btree_write_buffer_keys *wb;
+	size_t room;
+};
+
+int __bch2_journal_key_to_wb(struct bch_fs *,
+			     struct journal_keys_to_wb *,
+			     u64, enum btree_id, struct bkey_i *);
+
+static inline int bch2_journal_key_to_wb(struct bch_fs *c,
+			     struct journal_keys_to_wb *dst,
+			     u64 seq, enum btree_id btree, struct bkey_i *k)
+{
+	BUG_ON(!seq);
+
+	if (unlikely(!dst->room))
+		return __bch2_journal_key_to_wb(c, dst, seq, btree, k);
+
+	struct btree_write_buffered_key *wb_k = &darray_top(dst->wb->keys);
+	wb_k->journal_seq	= seq;
+	wb_k->btree		= btree;
+	bkey_copy(&wb_k->k, k);
+	dst->wb->keys.nr++;
+	dst->room--;
+	return 0;
+}
+
+void bch2_journal_keys_to_write_buffer_start(struct bch_fs *, struct journal_keys_to_wb *, u64);
+void bch2_journal_keys_to_write_buffer_end(struct bch_fs *, struct journal_keys_to_wb *);
 
 void bch2_fs_btree_write_buffer_exit(struct bch_fs *);
 int bch2_fs_btree_write_buffer_init(struct bch_fs *);
diff --git a/fs/bcachefs/btree_write_buffer_types.h b/fs/bcachefs/btree_write_buffer_types.h
index 99993ba77aea..416790aca114 100644
--- a/fs/bcachefs/btree_write_buffer_types.h
+++ b/fs/bcachefs/btree_write_buffer_types.h
@@ -2,43 +2,36 @@
 #ifndef _BCACHEFS_BTREE_WRITE_BUFFER_TYPES_H
 #define _BCACHEFS_BTREE_WRITE_BUFFER_TYPES_H
 
+#include "darray.h"
 #include "journal_types.h"
 
 #define BTREE_WRITE_BUFERED_VAL_U64s_MAX	4
 #define BTREE_WRITE_BUFERED_U64s_MAX	(BKEY_U64s + BTREE_WRITE_BUFERED_VAL_U64s_MAX)
 
+struct btree_write_buffered_key_ref {
+	unsigned		idx:20;
+	enum btree_id		btree:12;
+	struct bpos		pos;
+};
+
 struct btree_write_buffered_key {
 	u64			journal_seq;
-	unsigned		journal_offset;
 	enum btree_id		btree;
 	__BKEY_PADDED(k, BTREE_WRITE_BUFERED_VAL_U64s_MAX);
 };
 
-union btree_write_buffer_state {
-	struct {
-		atomic64_t	counter;
-	};
-
-	struct {
-		u64		v;
-	};
-
-	struct {
-		u64			nr:23;
-		u64			idx:1;
-		u64			ref0:20;
-		u64			ref1:20;
-	};
+struct btree_write_buffer_keys {
+	DARRAY(struct btree_write_buffered_key) keys;
+	struct journal_entry_pin	pin;
+	struct mutex			lock;
 };
 
 struct btree_write_buffer {
-	struct mutex			flush_lock;
-	struct journal_entry_pin	journal_pin;
-
-	union btree_write_buffer_state	state;
-	size_t				size;
+	DARRAY(struct btree_write_buffered_key_ref) sorted;
 
-	struct btree_write_buffered_key	*keys[2];
+	struct btree_write_buffer_keys	inc;
+	struct btree_write_buffer_keys	flushing;
+	struct work_struct		flush_work;
 };
 
 #endif /* _BCACHEFS_BTREE_WRITE_BUFFER_TYPES_H */
diff --git a/fs/bcachefs/errcode.h b/fs/bcachefs/errcode.h
index e5c3262cc303..e42b45293bbd 100644
--- a/fs/bcachefs/errcode.h
+++ b/fs/bcachefs/errcode.h
@@ -150,7 +150,6 @@
 	x(BCH_ERR_btree_insert_fail,	btree_insert_need_mark_replicas)	\
 	x(BCH_ERR_btree_insert_fail,	btree_insert_need_journal_res)		\
 	x(BCH_ERR_btree_insert_fail,	btree_insert_need_journal_reclaim)	\
-	x(BCH_ERR_btree_insert_fail,	btree_insert_need_flush_buffer)		\
 	x(0,				backpointer_to_overwritten_btree_node)	\
 	x(0,				lock_fail_root_changed)			\
 	x(0,				journal_reclaim_would_deadlock)		\
diff --git a/fs/bcachefs/journal.c b/fs/bcachefs/journal.c
index 5b390cb91884..8d53faa9a429 100644
--- a/fs/bcachefs/journal.c
+++ b/fs/bcachefs/journal.c
@@ -10,6 +10,7 @@
 #include "bkey_methods.h"
 #include "btree_gc.h"
 #include "btree_update.h"
+#include "btree_write_buffer.h"
 #include "buckets.h"
 #include "error.h"
 #include "journal.h"
@@ -329,6 +330,7 @@ static int journal_entry_open(struct journal *j)
 	buf->must_flush	= false;
 	buf->separate_flush = false;
 	buf->flush_time	= 0;
+	buf->need_flush_to_write_buffer = true;
 
 	memset(buf->data, 0, sizeof(*buf->data));
 	buf->data->seq	= cpu_to_le64(journal_cur_seq(j));
@@ -793,6 +795,47 @@ void bch2_journal_block_reservations(struct journal *j)
 	wait_event(j->wait, journal_reservations_stopped(j));
 }
 
+static struct journal_buf *__bch2_next_write_buffer_flush_journal_buf(struct journal *j, u64 max_seq)
+{
+	spin_lock(&j->lock);
+	max_seq = min(max_seq, journal_cur_seq(j));
+
+	for (u64 seq = journal_last_unwritten_seq(j);
+	     seq <= max_seq;
+	     seq++) {
+		unsigned idx = seq & JOURNAL_BUF_MASK;
+		struct journal_buf *buf = j->buf + idx;
+		union journal_res_state s;
+
+		if (!buf->need_flush_to_write_buffer)
+			continue;
+
+		if (seq == journal_cur_seq(j))
+			__journal_entry_close(j, JOURNAL_ENTRY_CLOSED_VAL);
+
+		s.v = atomic64_read_acquire(&j->reservations.counter);
+
+		if (journal_state_count(s, idx)) {
+			spin_unlock(&j->lock);
+			return ERR_PTR(-EAGAIN);
+		}
+
+		spin_unlock(&j->lock);
+		return buf;
+	}
+
+	spin_unlock(&j->lock);
+	return NULL;
+}
+
+struct journal_buf *bch2_next_write_buffer_flush_journal_buf(struct journal *j, u64 max_seq)
+{
+	struct journal_buf *ret;
+
+	wait_event(j->wait, (ret = __bch2_next_write_buffer_flush_journal_buf(j, max_seq)) != ERR_PTR(-EAGAIN));
+	return ret;
+}
+
 /* allocate journal on a device: */
 
 static int __bch2_set_nr_journal_buckets(struct bch_dev *ca, unsigned nr,
diff --git a/fs/bcachefs/journal.h b/fs/bcachefs/journal.h
index 310654bb74de..b5185f97af0f 100644
--- a/fs/bcachefs/journal.h
+++ b/fs/bcachefs/journal.h
@@ -428,6 +428,7 @@ static inline void bch2_journal_set_replay_done(struct journal *j)
 void bch2_journal_unblock(struct journal *);
 void bch2_journal_block(struct journal *);
 void bch2_journal_block_reservations(struct journal *);
+struct journal_buf *bch2_next_write_buffer_flush_journal_buf(struct journal *j, u64 max_seq);
 
 void __bch2_journal_debug_to_text(struct printbuf *, struct journal *);
 void bch2_journal_debug_to_text(struct printbuf *, struct journal *);
diff --git a/fs/bcachefs/journal_io.c b/fs/bcachefs/journal_io.c
index 2394d4b02fff..89852fd16d68 100644
--- a/fs/bcachefs/journal_io.c
+++ b/fs/bcachefs/journal_io.c
@@ -4,6 +4,7 @@
 #include "alloc_foreground.h"
 #include "btree_io.h"
 #include "btree_update_interior.h"
+#include "btree_write_buffer.h"
 #include "buckets.h"
 #include "checksum.h"
 #include "disk_groups.h"
@@ -713,6 +714,22 @@ static void journal_entry_overwrite_to_text(struct printbuf *out, struct bch_fs
 	journal_entry_btree_keys_to_text(out, c, entry);
 }
 
+static int journal_entry_write_buffer_keys_validate(struct bch_fs *c,
+				struct jset *jset,
+				struct jset_entry *entry,
+				unsigned version, int big_endian,
+				enum bkey_invalid_flags flags)
+{
+	return journal_entry_btree_keys_validate(c, jset, entry,
+				version, big_endian, READ);
+}
+
+static void journal_entry_write_buffer_keys_to_text(struct printbuf *out, struct bch_fs *c,
+					    struct jset_entry *entry)
+{
+	journal_entry_btree_keys_to_text(out, c, entry);
+}
+
 struct jset_entry_ops {
 	int (*validate)(struct bch_fs *, struct jset *,
 			struct jset_entry *, unsigned, int,
@@ -1687,9 +1704,11 @@ static int bch2_journal_write_prep(struct journal *j, struct journal_buf *w)
 	struct bch_fs *c = container_of(j, struct bch_fs, journal);
 	struct jset_entry *start, *end, *i, *next, *prev = NULL;
 	struct jset *jset = w->data;
+	struct journal_keys_to_wb wb = { NULL };
 	unsigned sectors, bytes, u64s;
-	bool validate_before_checksum = false;
 	unsigned long btree_roots_have = 0;
+	bool validate_before_checksum = false;
+	u64 seq = le64_to_cpu(jset->seq);
 	int ret;
 
 	/*
@@ -1717,9 +1736,28 @@ static int bch2_journal_write_prep(struct journal *j, struct journal_buf *w)
 		 * to c->btree_roots we have to get any missing btree roots and
 		 * add them to this journal entry:
 		 */
-		if (i->type == BCH_JSET_ENTRY_btree_root) {
+		switch (i->type) {
+		case BCH_JSET_ENTRY_btree_root:
 			bch2_journal_entry_to_btree_root(c, i);
 			__set_bit(i->btree_id, &btree_roots_have);
+			break;
+		case BCH_JSET_ENTRY_write_buffer_keys:
+			EBUG_ON(!w->need_flush_to_write_buffer);
+
+			if (!wb.wb)
+				bch2_journal_keys_to_write_buffer_start(c, &wb, seq);
+
+			struct bkey_i *k;
+			jset_entry_for_each_key(i, k) {
+				ret = bch2_journal_key_to_wb(c, &wb, seq, i->btree_id, k);
+				if (ret) {
+					bch2_fs_fatal_error(c, "-ENOMEM flushing journal keys to btree write buffer");
+					bch2_journal_keys_to_write_buffer_end(c, &wb);
+					return ret;
+				}
+			}
+			i->type = BCH_JSET_ENTRY_btree_keys;
+			break;
 		}
 
 		/* Can we merge with previous entry? */
@@ -1742,6 +1780,10 @@ static int bch2_journal_write_prep(struct journal *j, struct journal_buf *w)
 			memmove_u64s_down(prev, i, jset_u64s(u64s));
 	}
 
+	if (wb.wb)
+		bch2_journal_keys_to_write_buffer_end(c, &wb);
+	w->need_flush_to_write_buffer = false;
+
 	prev = prev ? vstruct_next(prev) : jset->start;
 	jset->u64s = cpu_to_le32((u64 *) prev - jset->_data);
 
@@ -1749,8 +1791,7 @@ static int bch2_journal_write_prep(struct journal *j, struct journal_buf *w)
 
 	end	= bch2_btree_roots_to_journal_entries(c, end, btree_roots_have);
 
-	bch2_journal_super_entries_add_common(c, &end,
-				le64_to_cpu(jset->seq));
+	bch2_journal_super_entries_add_common(c, &end, seq);
 	u64s	= (u64 *) end - (u64 *) start;
 	BUG_ON(u64s > j->entry_u64s_reserved);
 
@@ -1773,7 +1814,7 @@ static int bch2_journal_write_prep(struct journal *j, struct journal_buf *w)
 	SET_JSET_CSUM_TYPE(jset, bch2_meta_checksum_type(c));
 
 	if (!JSET_NO_FLUSH(jset) && journal_entry_empty(jset))
-		j->last_empty_seq = le64_to_cpu(jset->seq);
+		j->last_empty_seq = seq;
 
 	if (bch2_csum_type_is_encryption(JSET_CSUM_TYPE(jset)))
 		validate_before_checksum = true;
diff --git a/fs/bcachefs/journal_reclaim.c b/fs/bcachefs/journal_reclaim.c
index 67684cc5c1e5..32d1255936ed 100644
--- a/fs/bcachefs/journal_reclaim.c
+++ b/fs/bcachefs/journal_reclaim.c
@@ -3,6 +3,7 @@
 #include "bcachefs.h"
 #include "btree_key_cache.h"
 #include "btree_update.h"
+#include "btree_write_buffer.h"
 #include "buckets.h"
 #include "errcode.h"
 #include "error.h"
@@ -50,20 +51,23 @@ unsigned bch2_journal_dev_buckets_available(struct journal *j,
 	return available;
 }
 
-static inline void journal_set_watermark(struct journal *j)
+void bch2_journal_set_watermark(struct journal *j)
 {
 	struct bch_fs *c = container_of(j, struct bch_fs, journal);
 	bool low_on_space = j->space[journal_space_clean].total * 4 <=
 		j->space[journal_space_total].total;
 	bool low_on_pin = fifo_free(&j->pin) < j->pin.size / 4;
-	unsigned watermark = low_on_space || low_on_pin
+	bool low_on_wb = bch2_btree_write_buffer_must_wait(c);
+	unsigned watermark = low_on_space || low_on_pin || low_on_wb
 		? BCH_WATERMARK_reclaim
 		: BCH_WATERMARK_stripe;
 
 	if (track_event_change(&c->times[BCH_TIME_blocked_journal_low_on_space],
 			       &j->low_on_space_start, low_on_space) ||
 	    track_event_change(&c->times[BCH_TIME_blocked_journal_low_on_pin],
-			       &j->low_on_pin_start, low_on_pin))
+			       &j->low_on_pin_start, low_on_pin) ||
+	    track_event_change(&c->times[BCH_TIME_blocked_write_buffer_full],
+			       &j->write_buffer_full_start, low_on_wb))
 		trace_and_count(c, journal_full, c);
 
 	swap(watermark, j->watermark);
@@ -230,7 +234,7 @@ void bch2_journal_space_available(struct journal *j)
 	else
 		clear_bit(JOURNAL_MAY_SKIP_FLUSH, &j->flags);
 
-	journal_set_watermark(j);
+	bch2_journal_set_watermark(j);
 out:
 	j->cur_entry_sectors	= !ret ? j->space[journal_space_discarded].next_entry : 0;
 	j->cur_entry_error	= ret;
diff --git a/fs/bcachefs/journal_reclaim.h b/fs/bcachefs/journal_reclaim.h
index 7b15d682a0f5..ec84c3345281 100644
--- a/fs/bcachefs/journal_reclaim.h
+++ b/fs/bcachefs/journal_reclaim.h
@@ -16,6 +16,7 @@ static inline void journal_reclaim_kick(struct journal *j)
 unsigned bch2_journal_dev_buckets_available(struct journal *,
 					    struct journal_device *,
 					    enum journal_space_from);
+void bch2_journal_set_watermark(struct journal *);
 void bch2_journal_space_available(struct journal *);
 
 static inline bool journal_pin_active(struct journal_entry_pin *pin)
diff --git a/fs/bcachefs/journal_types.h b/fs/bcachefs/journal_types.h
index 0fed32f94976..85c543af60e5 100644
--- a/fs/bcachefs/journal_types.h
+++ b/fs/bcachefs/journal_types.h
@@ -36,6 +36,7 @@ struct journal_buf {
 	bool			noflush;	/* write has already been kicked off, and was noflush */
 	bool			must_flush;	/* something wants a flush */
 	bool			separate_flush;
+	bool			need_flush_to_write_buffer;
 };
 
 /*
@@ -277,6 +278,7 @@ struct journal {
 	u64			low_on_space_start;
 	u64			low_on_pin_start;
 	u64			max_in_flight_start;
+	u64			write_buffer_full_start;
 
 	struct bch2_time_stats	*flush_write_time;
 	struct bch2_time_stats	*noflush_write_time;
diff --git a/fs/bcachefs/super.c b/fs/bcachefs/super.c
index bb9451082e87..9af723466aab 100644
--- a/fs/bcachefs/super.c
+++ b/fs/bcachefs/super.c
@@ -314,7 +314,8 @@ void bch2_fs_read_only(struct bch_fs *c)
 		BUG_ON(c->journal.last_empty_seq != journal_cur_seq(&c->journal));
 		BUG_ON(atomic_read(&c->btree_cache.dirty));
 		BUG_ON(atomic_long_read(&c->btree_key_cache.nr_dirty));
-		BUG_ON(c->btree_write_buffer.state.nr);
+		BUG_ON(c->btree_write_buffer.inc.keys.nr);
+		BUG_ON(c->btree_write_buffer.flushing.keys.nr);
 
 		bch_verbose(c, "marking filesystem clean");
 		bch2_fs_mark_clean(c);
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* [PATCH 17/17] bcachefs: Inline btree write buffer sort
  2023-11-10 16:31 [PATCH 00/17] btree write buffer & journal optimizations Kent Overstreet
                   ` (15 preceding siblings ...)
  2023-11-10 16:31 ` [PATCH 16/17] bcachefs: btree write buffer now slurps keys from journal Kent Overstreet
@ 2023-11-10 16:31 ` Kent Overstreet
  2023-11-10 16:42 ` [PATCH 00/17] btree write buffer & journal optimizations Kent Overstreet
  17 siblings, 0 replies; 29+ messages in thread
From: Kent Overstreet @ 2023-11-10 16:31 UTC (permalink / raw)
  To: linux-bcachefs; +Cc: Kent Overstreet

The sort in the btree write buffer flush path is a very hot path, and
it's particularly performance sensitive since it's single threaded and
can block every other thread on a multithreaded write workload.

It's well worth doing a sort with inlined cmp and swap functions.

Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
---
 fs/bcachefs/btree_write_buffer.c | 57 +++++++++++++++++++++++++++-----
 1 file changed, 48 insertions(+), 9 deletions(-)

diff --git a/fs/bcachefs/btree_write_buffer.c b/fs/bcachefs/btree_write_buffer.c
index e961cf33db1e..e482bbb767e6 100644
--- a/fs/bcachefs/btree_write_buffer.c
+++ b/fs/bcachefs/btree_write_buffer.c
@@ -10,23 +10,64 @@
 #include "journal_io.h"
 #include "journal_reclaim.h"
 
-#include <linux/sort.h>
-
 static int bch2_btree_write_buffer_journal_flush(struct journal *,
 				struct journal_entry_pin *, u64);
 
 static int bch2_journal_keys_to_write_buffer(struct bch_fs *, struct journal_buf *);
 
-static inline int wb_key_cmp(const void *_l, const void *_r)
+static inline int wb_key_cmp(struct btree_write_buffered_key_ref *l,
+			     struct btree_write_buffered_key_ref *r)
 {
-	const struct btree_write_buffered_key_ref *l = _l;
-	const struct btree_write_buffered_key_ref *r = _r;
-
 	return  cmp_int(l->btree, r->btree) ?:
 		bpos_cmp(l->pos, r->pos) ?:
 		cmp_int(l->idx, r->idx);
 }
 
+static void wb_sort(struct btree_write_buffered_key_ref *base, size_t num)
+{
+	size_t n = num, a = num / 2;
+
+	if (!a)		/* num < 2 || size == 0 */
+		return;
+
+	for (;;) {
+		size_t b, c, d;
+
+		if (a)			/* Building heap: sift down --a */
+			--a;
+		else if (--n)		/* Sorting: Extract root to --n */
+			swap(base[0], base[n]);
+		else			/* Sort complete */
+			break;
+
+		/*
+		 * Sift element at "a" down into heap.  This is the
+		 * "bottom-up" variant, which significantly reduces
+		 * calls to cmp_func(): we find the sift-down path all
+		 * the way to the leaves (one compare per level), then
+		 * backtrack to find where to insert the target element.
+		 *
+		 * Because elements tend to sift down close to the leaves,
+		 * this uses fewer compares than doing two per level
+		 * on the way down.  (A bit more than half as many on
+		 * average, 3/4 worst-case.)
+		 */
+		for (b = a; c = 2*b + 1, (d = c + 1) < n;)
+			b = wb_key_cmp(base + c, base + d) >= 0 ? c : d;
+		if (d == n)		/* Special case last leaf with no sibling */
+			b = c;
+
+		/* Now backtrack from "b" to the correct location for "a" */
+		while (b != a && wb_key_cmp(base + a, base + b) >= 0)
+			b = (b - 1) / 2;
+		c = b;			/* Where "a" belongs */
+		while (b != a) {	/* Shift it into place */
+			b = (b - 1) / 2;
+			swap(base[b], base[c]);
+		}
+	}
+}
+
 static noinline int wb_flush_one_slowpath(struct btree_trans *trans,
 					  struct btree_iter *iter,
 					  struct btree_write_buffered_key *wb)
@@ -202,9 +243,7 @@ static int bch2_btree_write_buffer_flush_locked(struct btree_trans *trans)
 	 * If that happens, simply skip the key so we can optimistically insert
 	 * as many keys as possible in the fast path.
 	 */
-	sort(wb->sorted.data, wb->sorted.nr,
-	     sizeof(wb->sorted.data[0]),
-	     wb_key_cmp, NULL);
+	wb_sort(wb->sorted.data, wb->sorted.nr);
 
 	darray_for_each(wb->sorted, i) {
 		struct btree_write_buffered_key *k = &wb->flushing.keys.data[i->idx];
-- 
2.42.0


^ permalink raw reply related	[flat|nested] 29+ messages in thread

* Re: [PATCH 00/17] btree write buffer & journal optimizations
  2023-11-10 16:31 [PATCH 00/17] btree write buffer & journal optimizations Kent Overstreet
                   ` (16 preceding siblings ...)
  2023-11-10 16:31 ` [PATCH 17/17] bcachefs: Inline btree write buffer sort Kent Overstreet
@ 2023-11-10 16:42 ` Kent Overstreet
  17 siblings, 0 replies; 29+ messages in thread
From: Kent Overstreet @ 2023-11-10 16:42 UTC (permalink / raw)
  To: linux-bcachefs

On Fri, Nov 10, 2023 at 11:31:37AM -0500, Kent Overstreet wrote:
> Large stack of changes - performance optimizations and improvements to
> related code.
> 
> The first big one, killing journal pre-reservations, is already merged:
> besides being a performance improvement it turned out to fix a few tests
> that were sporadically failing with "journal stuck" - nice. The idea
> with that patch is that instead of reserving space in the journal ahead
> of time (for journal reservations that must succeed or risk deadlock),
> we instead rely on watermarks and tracking which operations are
> necessary for reclaim - when we're low on space in the journal and
> flushing things for reclaim, we're able to flush things in order, or
> track when we're not flushing things in order and allow it to fail
> 
> Most of the other patches are prep work leading up to the big btree
> write buffer rewrite, which changes the btree write buffer to pull keys
> from the journal, instead of having the tranaction commit path directly
> add keys to the btree write buffer.
> 
> btree write buffer performance is pretty important for multithreaded
> update workloads - it's an amdahl's law situation, flushing it is single
> threaded. More performance optimizations there would still be
> worthwhile, and I've been looking at it because I'm sketching out a
> rewrite of disk space accounting (which will allow for per snapshot Id
> accounting) that will lean heavily on the btree write buffer.

Forgot to mention performance numbers -

fio, 4k random write iops, no data IO mode, checksums off (this is my
benchmark setup for just testing core bcachefs code):

		1 job		8 jobs
before		169k		645k
after		207k		798k

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 03/17] bcachefs: Journal pins must always have a flush_fn
  2023-11-10 16:31 ` [PATCH 03/17] bcachefs: Journal pins must always have a flush_fn Kent Overstreet
@ 2023-11-13 15:22   ` Brian Foster
  2023-11-13 16:36     ` Kent Overstreet
  0 siblings, 1 reply; 29+ messages in thread
From: Brian Foster @ 2023-11-13 15:22 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs

On Fri, Nov 10, 2023 at 11:31:40AM -0500, Kent Overstreet wrote:
> flush_fn is how we identify journal pins in debugfs - this is a
> debugging aid.
> 
> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> ---
>  fs/bcachefs/btree_trans_commit.c    |  9 ++++++++-
>  fs/bcachefs/btree_update_interior.c | 21 ++++++++++++++++++---
>  fs/bcachefs/btree_write_buffer.c    | 18 +++++++-----------
>  fs/bcachefs/journal_reclaim.c       | 12 +++++++-----
>  4 files changed, 40 insertions(+), 20 deletions(-)
> 
...
> diff --git a/fs/bcachefs/btree_write_buffer.c b/fs/bcachefs/btree_write_buffer.c
> index 5c398b05cb39..9e3107187e1d 100644
> --- a/fs/bcachefs/btree_write_buffer.c
> +++ b/fs/bcachefs/btree_write_buffer.c
...
> @@ -148,7 +151,8 @@ int __bch2_btree_write_buffer_flush(struct btree_trans *trans, unsigned commit_f
>  	if (!locked && !mutex_trylock(&wb->flush_lock))
>  		return 0;
>  
> -	bch2_journal_pin_copy(j, &pin, &wb->journal_pin, NULL);
> +	bch2_journal_pin_copy(j, &pin, &wb->journal_pin,
> +			      bch2_btree_write_buffer_journal_flush);
>  	bch2_journal_pin_drop(j, &wb->journal_pin);
>  
>  	s = btree_write_buffer_switch(wb);
> @@ -250,16 +254,8 @@ int __bch2_btree_write_buffer_flush(struct btree_trans *trans, unsigned commit_f
>  		if (!i->journal_seq)
>  			continue;
>  
> -		if (i->journal_seq > pin.seq) {
> -			struct journal_entry_pin pin2;
> -
> -			memset(&pin2, 0, sizeof(pin2));
> -
> -			bch2_journal_pin_add(j, i->journal_seq, &pin2, NULL);
> -			bch2_journal_pin_drop(j, &pin);
> -			bch2_journal_pin_copy(j, &pin, &pin2, NULL);
> -			bch2_journal_pin_drop(j, &pin2);
> -		}
> +		bch2_journal_pin_update(j, i->journal_seq, &pin,
> +			      bch2_btree_write_buffer_journal_flush);

Hmm.. I recall looking at this on the previous improvements to this
path, but I don't quite remember why I didn't make this sort of change.
The existing code implies a race (i.e., using pin2 to ensure the pin is
never fully absent from the pin list) as opposed to what _pin_update()
does (remove then add). Any idea why the existing code does what it does
and/or can you explain why this change is safe? Thanks.

Brian

>  
>  		ret = commit_do(trans, NULL, NULL,
>  				commit_flags|
> diff --git a/fs/bcachefs/journal_reclaim.c b/fs/bcachefs/journal_reclaim.c
> index 79a6fdc6e6ef..8fa05bedb7df 100644
> --- a/fs/bcachefs/journal_reclaim.c
> +++ b/fs/bcachefs/journal_reclaim.c
> @@ -378,14 +378,16 @@ static inline void bch2_journal_pin_set_locked(struct journal *j, u64 seq,
>  {
>  	struct journal_entry_pin_list *pin_list = journal_seq_pin(j, seq);
>  
> +	/*
> +	 * flush_fn is how we identify journal pins in debugfs, so must always
> +	 * exist, even if it doesn't do anything:
> +	 */
> +	BUG_ON(!flush_fn);
> +
>  	atomic_inc(&pin_list->count);
>  	pin->seq	= seq;
>  	pin->flush	= flush_fn;
> -
> -	if (flush_fn)
> -		list_add(&pin->list, &pin_list->list[type]);
> -	else
> -		list_add(&pin->list, &pin_list->flushed);
> +	list_add(&pin->list, &pin_list->list[type]);
>  }
>  
>  void bch2_journal_pin_copy(struct journal *j,
> -- 
> 2.42.0
> 
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 05/17] bcachefs: Kill BTREE_UPDATE_PREJOURNAL
  2023-11-10 16:31 ` [PATCH 05/17] bcachefs: Kill BTREE_UPDATE_PREJOURNAL Kent Overstreet
@ 2023-11-13 15:29   ` Brian Foster
  2023-11-13 16:49     ` Kent Overstreet
  0 siblings, 1 reply; 29+ messages in thread
From: Brian Foster @ 2023-11-13 15:29 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs

On Fri, Nov 10, 2023 at 11:31:42AM -0500, Kent Overstreet wrote:
> With the previous patch that reworks BTREE_INSERT_JOURNAL_REPLAY, we can
> now switch the btree write buffer to use it for flushing.
> 
> This has the advantage that transaction commits don't need to take a
> journal reservation at all.
> 
> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> ---
>  fs/bcachefs/bkey_methods.h       |  2 --
>  fs/bcachefs/btree_trans_commit.c |  7 +------
>  fs/bcachefs/btree_types.h        |  1 -
>  fs/bcachefs/btree_update.c       | 23 -----------------------
>  fs/bcachefs/btree_write_buffer.c | 14 ++++++++++----
>  5 files changed, 11 insertions(+), 36 deletions(-)
> 
...
> diff --git a/fs/bcachefs/btree_trans_commit.c b/fs/bcachefs/btree_trans_commit.c
> index ec90a06a6cf9..f231f01072c2 100644
> --- a/fs/bcachefs/btree_trans_commit.c
> +++ b/fs/bcachefs/btree_trans_commit.c
> @@ -779,12 +779,7 @@ bch2_trans_commit_write_locked(struct btree_trans *trans, unsigned flags,
>  
>  	trans_for_each_update(trans, i) {
>  		if (!i->cached) {
> -			u64 seq = trans->journal_res.seq;
> -
> -			if (i->flags & BTREE_UPDATE_PREJOURNAL)
> -				seq = i->seq;
> -
> -			bch2_btree_insert_key_leaf(trans, i->path, i->k, seq);
> +			bch2_btree_insert_key_leaf(trans, i->path, i->k, trans->journal_res.seq);

Ok, so instead of passing the seq to the commit path via the insert
entry, we use a flag that enables a means to pass journal_res.seq
straight through to the commit. That seems reasonable to me.

One subtle thing that comes to mind is that the existing mechanism
tracks a seq per key update whereas this looks like it associates the
seq to the transaction and then to every key update. That's how it's
used today AFAICS so doesn't seem like a big deal, but what happens if
this is misused in the future? Does anything prevent having multiple
keys from different journal seqs in the same transaction leading to
pinning the wrong seq for some subset of keys? If not, it would be nice
to have some kind of check or something somewhere to fail an update for
a trans that might already have a pre journaled key.

>  		} else if (!i->key_cache_already_flushed)
>  			bch2_btree_insert_key_cached(trans, flags, i);
>  		else {
...
> diff --git a/fs/bcachefs/btree_write_buffer.c b/fs/bcachefs/btree_write_buffer.c
> index 9e3107187e1d..f40ac365620f 100644
> --- a/fs/bcachefs/btree_write_buffer.c
> +++ b/fs/bcachefs/btree_write_buffer.c
> @@ -76,12 +76,15 @@ static int bch2_btree_write_buffer_flush_one(struct btree_trans *trans,
>  	(*fast)++;
>  	return 0;
>  trans_commit:
> -	return  bch2_trans_update_seq(trans, wb->journal_seq, iter, &wb->k,
> -				      BTREE_UPDATE_INTERNAL_SNAPSHOT_NODE) ?:
> +	trans->journal_res.seq = wb->journal_seq;
> +
> +	return  bch2_trans_update(trans, iter, &wb->k,
> +				  BTREE_UPDATE_INTERNAL_SNAPSHOT_NODE) ?:
>  		bch2_trans_commit(trans, NULL, NULL,
>  				  commit_flags|
>  				  BTREE_INSERT_NOCHECK_RW|
>  				  BTREE_INSERT_NOFAIL|
> +				  BTREE_INSERT_JOURNAL_REPLAY|
>  				  BTREE_INSERT_JOURNAL_RECLAIM);

This is more of a nit for now, but I find the general use of a flag with
a contextual name unnecessarily confusing. I.e., the flag implies we're
doing journal replay, which we're not, and so makes the code confusing
to somebody who doesn't have the historical development context. Could
we rename or repurpose this to better reflect the functional purpose of
not acquiring a reservation (and let journal replay also use it)? I can
look into that as a followon change if you want to make suggestions or
share any thoughts..

But as a related example, do we care about how this flag modifies
invalid key checks (via __bch2_trans_commit()) for example?

Brian

>  }
>  
> @@ -125,9 +128,11 @@ btree_write_buffered_insert(struct btree_trans *trans,
>  	bch2_trans_iter_init(trans, &iter, wb->btree, bkey_start_pos(&wb->k.k),
>  			     BTREE_ITER_CACHED|BTREE_ITER_INTENT);
>  
> +	trans->journal_res.seq = wb->journal_seq;
> +
>  	ret   = bch2_btree_iter_traverse(&iter) ?:
> -		bch2_trans_update_seq(trans, wb->journal_seq, &iter, &wb->k,
> -				      BTREE_UPDATE_INTERNAL_SNAPSHOT_NODE);
> +		bch2_trans_update(trans, &iter, &wb->k,
> +				  BTREE_UPDATE_INTERNAL_SNAPSHOT_NODE);
>  	bch2_trans_iter_exit(trans, &iter);
>  	return ret;
>  }
> @@ -260,6 +265,7 @@ int __bch2_btree_write_buffer_flush(struct btree_trans *trans, unsigned commit_f
>  		ret = commit_do(trans, NULL, NULL,
>  				commit_flags|
>  				BTREE_INSERT_NOFAIL|
> +				BTREE_INSERT_JOURNAL_REPLAY|
>  				BTREE_INSERT_JOURNAL_RECLAIM,
>  				btree_write_buffered_insert(trans, i));
>  		if (bch2_fs_fatal_err_on(ret, c, "%s: insert error %s", __func__, bch2_err_str(ret)))
> -- 
> 2.42.0
> 
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 03/17] bcachefs: Journal pins must always have a flush_fn
  2023-11-13 15:22   ` Brian Foster
@ 2023-11-13 16:36     ` Kent Overstreet
  2023-11-13 17:08       ` Brian Foster
  0 siblings, 1 reply; 29+ messages in thread
From: Kent Overstreet @ 2023-11-13 16:36 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs

On Mon, Nov 13, 2023 at 10:22:13AM -0500, Brian Foster wrote:
> On Fri, Nov 10, 2023 at 11:31:40AM -0500, Kent Overstreet wrote:
> > -		if (i->journal_seq > pin.seq) {
> > -			struct journal_entry_pin pin2;
> > -
> > -			memset(&pin2, 0, sizeof(pin2));
> > -
> > -			bch2_journal_pin_add(j, i->journal_seq, &pin2, NULL);
> > -			bch2_journal_pin_drop(j, &pin);
> > -			bch2_journal_pin_copy(j, &pin, &pin2, NULL);
> > -			bch2_journal_pin_drop(j, &pin2);
> > -		}
> > +		bch2_journal_pin_update(j, i->journal_seq, &pin,
> > +			      bch2_btree_write_buffer_journal_flush);
> 
> Hmm.. I recall looking at this on the previous improvements to this
> path, but I don't quite remember why I didn't make this sort of change.
> The existing code implies a race (i.e., using pin2 to ensure the pin is
> never fully absent from the pin list) as opposed to what _pin_update()
> does (remove then add). Any idea why the existing code does what it does
> and/or can you explain why this change is safe? Thanks.

Perhaps you missed it because journal_pin_update() could be named
better? journal_pin_add() and journal_pin_update() have opposite
behaviour when asked to overrwite an existing pin, I would like names
that make that more explicit.

But I don't see any races possible here: journal_pin_update() first
checks pin->seq, and that's entirely under the control of this thread so
we're fine, and if it overwrites an existing pin the drop and the set
are done together under the journal lock.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 05/17] bcachefs: Kill BTREE_UPDATE_PREJOURNAL
  2023-11-13 15:29   ` Brian Foster
@ 2023-11-13 16:49     ` Kent Overstreet
  2023-11-14 13:17       ` Brian Foster
  0 siblings, 1 reply; 29+ messages in thread
From: Kent Overstreet @ 2023-11-13 16:49 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs

On Mon, Nov 13, 2023 at 10:29:34AM -0500, Brian Foster wrote:
> On Fri, Nov 10, 2023 at 11:31:42AM -0500, Kent Overstreet wrote:
> > With the previous patch that reworks BTREE_INSERT_JOURNAL_REPLAY, we can
> > now switch the btree write buffer to use it for flushing.
> > 
> > This has the advantage that transaction commits don't need to take a
> > journal reservation at all.
> > 
> > Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> > ---
> >  fs/bcachefs/bkey_methods.h       |  2 --
> >  fs/bcachefs/btree_trans_commit.c |  7 +------
> >  fs/bcachefs/btree_types.h        |  1 -
> >  fs/bcachefs/btree_update.c       | 23 -----------------------
> >  fs/bcachefs/btree_write_buffer.c | 14 ++++++++++----
> >  5 files changed, 11 insertions(+), 36 deletions(-)
> > 
> ...
> > diff --git a/fs/bcachefs/btree_trans_commit.c b/fs/bcachefs/btree_trans_commit.c
> > index ec90a06a6cf9..f231f01072c2 100644
> > --- a/fs/bcachefs/btree_trans_commit.c
> > +++ b/fs/bcachefs/btree_trans_commit.c
> > @@ -779,12 +779,7 @@ bch2_trans_commit_write_locked(struct btree_trans *trans, unsigned flags,
> >  
> >  	trans_for_each_update(trans, i) {
> >  		if (!i->cached) {
> > -			u64 seq = trans->journal_res.seq;
> > -
> > -			if (i->flags & BTREE_UPDATE_PREJOURNAL)
> > -				seq = i->seq;
> > -
> > -			bch2_btree_insert_key_leaf(trans, i->path, i->k, seq);
> > +			bch2_btree_insert_key_leaf(trans, i->path, i->k, trans->journal_res.seq);
> 
> Ok, so instead of passing the seq to the commit path via the insert
> entry, we use a flag that enables a means to pass journal_res.seq
> straight through to the commit. That seems reasonable to me.
> 
> One subtle thing that comes to mind is that the existing mechanism
> tracks a seq per key update whereas this looks like it associates the
> seq to the transaction and then to every key update. That's how it's
> used today AFAICS so doesn't seem like a big deal, but what happens if
> this is misused in the future? Does anything prevent having multiple
> keys from different journal seqs in the same transaction leading to
> pinning the wrong seq for some subset of keys? If not, it would be nice
> to have some kind of check or something somewhere to fail an update for
> a trans that might already have a pre journaled key.

Well, anything to do with taking a journal reservation or a journal pin
really does go with the transaction commit, not an individual update -
if we were doing multiple updates that went with different journal
updates it wouldn't really be a transaction at that point.

> > -	return  bch2_trans_update_seq(trans, wb->journal_seq, iter, &wb->k,
> > -				      BTREE_UPDATE_INTERNAL_SNAPSHOT_NODE) ?:
> > +	trans->journal_res.seq = wb->journal_seq;
> > +
> > +	return  bch2_trans_update(trans, iter, &wb->k,
> > +				  BTREE_UPDATE_INTERNAL_SNAPSHOT_NODE) ?:
> >  		bch2_trans_commit(trans, NULL, NULL,
> >  				  commit_flags|
> >  				  BTREE_INSERT_NOCHECK_RW|
> >  				  BTREE_INSERT_NOFAIL|
> > +				  BTREE_INSERT_JOURNAL_REPLAY|
> >  				  BTREE_INSERT_JOURNAL_RECLAIM);
>
> This is more of a nit for now, but I find the general use of a flag with
> a contextual name unnecessarily confusing. I.e., the flag implies we're
> doing journal replay, which we're not, and so makes the code confusing
> to somebody who doesn't have the historical development context. Could
> we rename or repurpose this to better reflect the functional purpose of
> not acquiring a reservation (and let journal replay also use it)? I can
> look into that as a followon change if you want to make suggestions or
> share any thoughts..

Agreed - I didn't post this patch to the list because it was a simpler
one, but:
https://evilpiepirate.org/git/bcachefs.git/commit/?id=ed1e075104f3768375e1e8d3efcf87d9641029a7

(renaming all the BTREE_INSERT flags to BCH_TRANS_COMMIT, and with
better anmes).

> But as a related example, do we care about how this flag modifies
> invalid key checks (via __bch2_trans_commit()) for example?

There are key invalid checks that we want to run whenever doing new
updates (because it's good to check inconsistencies as soon as
possible), but we can't run on existing keys - possibly because the
check didn't exist in the past, and so could cause us to throw away
existing metadata.

The snapshot skiplist checks are a good example. There were (multiple)
bugs where we'd delete a snapshot tree node and end up with snapshot
keys with bad skiplist pointers. Now, we can detect and repair this
during fsck, but it's much better if we can catch this right away, when
writing the new snapshot key; i.e. if a skiplist node points to an id
smaller than the parent, then it's obviously bad.

But, since that check didn't exist when skiplist nodes were first
introduced we can't run it when checking existing snapshot nodes,
otherwise a filesystem would with corrupt skiplist nodes would have
those snapshot keys deleted entirely instead of letting fsck repair just
the skiplist entries.

.key_invalid()s cause those key to be deleted because those checks run
e.g. every time we run a btree node in from disk - we're very limited in
what kinds of updates we can do then.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 03/17] bcachefs: Journal pins must always have a flush_fn
  2023-11-13 16:36     ` Kent Overstreet
@ 2023-11-13 17:08       ` Brian Foster
  0 siblings, 0 replies; 29+ messages in thread
From: Brian Foster @ 2023-11-13 17:08 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs

On Mon, Nov 13, 2023 at 11:36:23AM -0500, Kent Overstreet wrote:
> On Mon, Nov 13, 2023 at 10:22:13AM -0500, Brian Foster wrote:
> > On Fri, Nov 10, 2023 at 11:31:40AM -0500, Kent Overstreet wrote:
> > > -		if (i->journal_seq > pin.seq) {
> > > -			struct journal_entry_pin pin2;
> > > -
> > > -			memset(&pin2, 0, sizeof(pin2));
> > > -
> > > -			bch2_journal_pin_add(j, i->journal_seq, &pin2, NULL);
> > > -			bch2_journal_pin_drop(j, &pin);
> > > -			bch2_journal_pin_copy(j, &pin, &pin2, NULL);
> > > -			bch2_journal_pin_drop(j, &pin2);
> > > -		}
> > > +		bch2_journal_pin_update(j, i->journal_seq, &pin,
> > > +			      bch2_btree_write_buffer_journal_flush);
> > 
> > Hmm.. I recall looking at this on the previous improvements to this
> > path, but I don't quite remember why I didn't make this sort of change.
> > The existing code implies a race (i.e., using pin2 to ensure the pin is
> > never fully absent from the pin list) as opposed to what _pin_update()
> > does (remove then add). Any idea why the existing code does what it does
> > and/or can you explain why this change is safe? Thanks.
> 
> Perhaps you missed it because journal_pin_update() could be named
> better? journal_pin_add() and journal_pin_update() have opposite
> behaviour when asked to overrwite an existing pin, I would like names
> that make that more explicit.
> 

Quite possible..

> But I don't see any races possible here: journal_pin_update() first
> checks pin->seq, and that's entirely under the control of this thread so
> we're fine, and if it overwrites an existing pin the drop and the set
> are done together under the journal lock.
> 

Yeah, I don't mean to imply there actually is a race. I just mean that
it looks like the old code took care to add the new pin before removing
the old for some reason, and I'm just curious what that reason is. ;)

Brian


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 05/17] bcachefs: Kill BTREE_UPDATE_PREJOURNAL
  2023-11-13 16:49     ` Kent Overstreet
@ 2023-11-14 13:17       ` Brian Foster
  0 siblings, 0 replies; 29+ messages in thread
From: Brian Foster @ 2023-11-14 13:17 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs

On Mon, Nov 13, 2023 at 11:49:10AM -0500, Kent Overstreet wrote:
> On Mon, Nov 13, 2023 at 10:29:34AM -0500, Brian Foster wrote:
> > On Fri, Nov 10, 2023 at 11:31:42AM -0500, Kent Overstreet wrote:
> > > With the previous patch that reworks BTREE_INSERT_JOURNAL_REPLAY, we can
> > > now switch the btree write buffer to use it for flushing.
> > > 
> > > This has the advantage that transaction commits don't need to take a
> > > journal reservation at all.
> > > 
> > > Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> > > ---
> > >  fs/bcachefs/bkey_methods.h       |  2 --
> > >  fs/bcachefs/btree_trans_commit.c |  7 +------
> > >  fs/bcachefs/btree_types.h        |  1 -
> > >  fs/bcachefs/btree_update.c       | 23 -----------------------
> > >  fs/bcachefs/btree_write_buffer.c | 14 ++++++++++----
> > >  5 files changed, 11 insertions(+), 36 deletions(-)
> > > 
> > ...
> > > diff --git a/fs/bcachefs/btree_trans_commit.c b/fs/bcachefs/btree_trans_commit.c
> > > index ec90a06a6cf9..f231f01072c2 100644
> > > --- a/fs/bcachefs/btree_trans_commit.c
> > > +++ b/fs/bcachefs/btree_trans_commit.c
> > > @@ -779,12 +779,7 @@ bch2_trans_commit_write_locked(struct btree_trans *trans, unsigned flags,
> > >  
> > >  	trans_for_each_update(trans, i) {
> > >  		if (!i->cached) {
> > > -			u64 seq = trans->journal_res.seq;
> > > -
> > > -			if (i->flags & BTREE_UPDATE_PREJOURNAL)
> > > -				seq = i->seq;
> > > -
> > > -			bch2_btree_insert_key_leaf(trans, i->path, i->k, seq);
> > > +			bch2_btree_insert_key_leaf(trans, i->path, i->k, trans->journal_res.seq);
> > 
> > Ok, so instead of passing the seq to the commit path via the insert
> > entry, we use a flag that enables a means to pass journal_res.seq
> > straight through to the commit. That seems reasonable to me.
> > 
> > One subtle thing that comes to mind is that the existing mechanism
> > tracks a seq per key update whereas this looks like it associates the
> > seq to the transaction and then to every key update. That's how it's
> > used today AFAICS so doesn't seem like a big deal, but what happens if
> > this is misused in the future? Does anything prevent having multiple
> > keys from different journal seqs in the same transaction leading to
> > pinning the wrong seq for some subset of keys? If not, it would be nice
> > to have some kind of check or something somewhere to fail an update for
> > a trans that might already have a pre journaled key.
> 
> Well, anything to do with taking a journal reservation or a journal pin
> really does go with the transaction commit, not an individual update -
> if we were doing multiple updates that went with different journal
> updates it wouldn't really be a transaction at that point.
> 

Yep that makes complete sense. I'm just lamenting that it seems a little
easy to misuse. I think this is partly due to the disconnected nature of
setting journal_res.seq and passing INSERT_JOURNAL_REPLAY at commit
time.

If the no_res mode were a fundamental transaction state, that could
potentially facilitate error checks that could help prevent adding keys
from multiple journal seqs to the same transaction. For example,
consider something like the original update helper that took the seq as
a param, but set some state on the transaction such that any further
updates could be validated to require the same seq (knowing that the
transaction is now "nores").

It's relatively minor in theory but in practice that could be the
difference between identifying a problem via runtime error vs. possibly
missing the bug and spending a bunch of time triaging an out of order
log recovery bug or whatever might occur as a result. Anyways, this is
mainly just a thought around possibly making things a bit more (noob)
developer friendly.

> > > -	return  bch2_trans_update_seq(trans, wb->journal_seq, iter, &wb->k,
> > > -				      BTREE_UPDATE_INTERNAL_SNAPSHOT_NODE) ?:
> > > +	trans->journal_res.seq = wb->journal_seq;
> > > +
> > > +	return  bch2_trans_update(trans, iter, &wb->k,
> > > +				  BTREE_UPDATE_INTERNAL_SNAPSHOT_NODE) ?:
> > >  		bch2_trans_commit(trans, NULL, NULL,
> > >  				  commit_flags|
> > >  				  BTREE_INSERT_NOCHECK_RW|
> > >  				  BTREE_INSERT_NOFAIL|
> > > +				  BTREE_INSERT_JOURNAL_REPLAY|
> > >  				  BTREE_INSERT_JOURNAL_RECLAIM);
> >
> > This is more of a nit for now, but I find the general use of a flag with
> > a contextual name unnecessarily confusing. I.e., the flag implies we're
> > doing journal replay, which we're not, and so makes the code confusing
> > to somebody who doesn't have the historical development context. Could
> > we rename or repurpose this to better reflect the functional purpose of
> > not acquiring a reservation (and let journal replay also use it)? I can
> > look into that as a followon change if you want to make suggestions or
> > share any thoughts..
> 
> Agreed - I didn't post this patch to the list because it was a simpler
> one, but:
> https://evilpiepirate.org/git/bcachefs.git/commit/?id=ed1e075104f3768375e1e8d3efcf87d9641029a7
> 
> (renaming all the BTREE_INSERT flags to BCH_TRANS_COMMIT, and with
> better anmes).
> 

Yeah, that is much better at a quick glance. Thanks.

> > But as a related example, do we care about how this flag modifies
> > invalid key checks (via __bch2_trans_commit()) for example?
> 
> There are key invalid checks that we want to run whenever doing new
> updates (because it's good to check inconsistencies as soon as
> possible), but we can't run on existing keys - possibly because the
> check didn't exist in the past, and so could cause us to throw away
> existing metadata.
> 

Ok, thanks for the context. My question as it relates to this patch is
consider the following hunk in the flag rename patch above:

@@ -1048,7 +1048,7 @@ int __bch2_trans_commit(struct btree_trans *trans, unsigned flags)
 		struct printbuf buf = PRINTBUF;
 		enum bkey_invalid_flags invalid_flags = 0;
 
-		if (!(flags & BTREE_INSERT_JOURNAL_REPLAY))
+		if (!(flags & BCH_TRANS_COMMIT_no_journal_res))
 			invalid_flags |= BKEY_INVALID_WRITE|BKEY_INVALID_COMMIT;
 
 		if (unlikely(bch2_bkey_invalid(c, bkey_i_to_s_c(i->k),

We've effectively removed these invalid_flags for the write buffer key
flush path by virtue of turning on no_journal_res there. Should those
flags be elided because there is no journal res, or are they more
associated with journal replay context (which happens to not use journal
res)?

Brian

> The snapshot skiplist checks are a good example. There were (multiple)
> bugs where we'd delete a snapshot tree node and end up with snapshot
> keys with bad skiplist pointers. Now, we can detect and repair this
> during fsck, but it's much better if we can catch this right away, when
> writing the new snapshot key; i.e. if a skiplist node points to an id
> smaller than the parent, then it's obviously bad.
> 
> But, since that check didn't exist when skiplist nodes were first
> introduced we can't run it when checking existing snapshot nodes,
> otherwise a filesystem would with corrupt skiplist nodes would have
> those snapshot keys deleted entirely instead of letting fsck repair just
> the skiplist entries.
> 
> .key_invalid()s cause those key to be deleted because those checks run
> e.g. every time we run a btree node in from disk - we're very limited in
> what kinds of updates we can do then.
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 07/17] bcachefs: Make journal replay more efficient
  2023-11-10 16:31 ` [PATCH 07/17] bcachefs: Make journal replay more efficient Kent Overstreet
@ 2023-11-14 13:19   ` Brian Foster
  2023-11-15  1:50     ` Kent Overstreet
  0 siblings, 1 reply; 29+ messages in thread
From: Brian Foster @ 2023-11-14 13:19 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs

On Fri, Nov 10, 2023 at 11:31:44AM -0500, Kent Overstreet wrote:
> Journal replay now first attempts to replay keys in sorted order,
> similar to how the btree write buffer flush path works.
> 
> Any keys that can not be replayed due to journal deadlock are then left
> for later and replayed in journal order, unpinning journal entries as we
> go.
> 
> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> ---

Just a question and a nit..

>  fs/bcachefs/errcode.h  |  1 -
>  fs/bcachefs/recovery.c | 84 ++++++++++++++++++++++++++++--------------
>  2 files changed, 56 insertions(+), 29 deletions(-)
> 
...
> diff --git a/fs/bcachefs/recovery.c b/fs/bcachefs/recovery.c
> index 6eafff2557a4..3524d1e0003e 100644
> --- a/fs/bcachefs/recovery.c
> +++ b/fs/bcachefs/recovery.c
...
> @@ -166,26 +159,61 @@ static int bch2_journal_replay(struct bch_fs *c)
>  			goto err;
>  	}
>  
> -	for (i = 0; i < keys->nr; i++) {
> -		k = keys_sorted[i];
> +	/*
> +	 * First, attempt to replay keys in sorted order. This is more
> +	 * efficient, but some might fail if that would cause a journal
> +	 * deadlock.
> +	 */

What does "sorted order" mean here vs. when we sort by journal seq
below?

> +	for (size_t i = 0; i < keys->nr; i++) {
> +		cond_resched();
> +
> +		struct journal_key *k = keys->d + i;
> +
> +		ret = commit_do(trans, NULL, NULL,
> +				BTREE_INSERT_NOFAIL|
> +				BTREE_INSERT_JOURNAL_RECLAIM|
> +				(!k->allocated ? BTREE_INSERT_JOURNAL_REPLAY : 0),
> +			     bch2_journal_replay_key(trans, k));
> +		BUG_ON(!ret && !k->overwritten);
> +		if (ret) {
> +			ret = darray_push(&keys_sorted, k);
> +			if (ret)
> +				goto err;
> +		}
> +	}
> +
> +	/*
> +	 * Now, replay any remaining keys in the order in which they appear in
> +	 * the journal, unpinning those journal entries as we go:
> +	 */
> +	sort(keys_sorted.data, keys_sorted.nr,
> +	     sizeof(keys_sorted.data[0]),
> +	     journal_sort_seq_cmp, NULL);
>  
> +	darray_for_each(keys_sorted, kp) {
>  		cond_resched();
>  
> +		struct journal_key *k = *kp;
> +
>  		replay_now_at(j, k->journal_seq);
>  
> -		ret = bch2_trans_do(c, NULL, NULL,
> -				    BTREE_INSERT_NOFAIL|
> -				    (!k->allocated
> -				     ? BTREE_INSERT_JOURNAL_REPLAY|BCH_WATERMARK_reclaim
> -				     : 0),
> +		ret = commit_do(trans, NULL, NULL,
> +				BTREE_INSERT_NOFAIL|
> +				(!k->allocated
> +				 ? BTREE_INSERT_JOURNAL_REPLAY|BCH_WATERMARK_reclaim
> +				 : 0),
>  			     bch2_journal_replay_key(trans, k));
> -		if (ret) {
> -			bch_err(c, "journal replay: error while replaying key at btree %s level %u: %s",
> -				bch2_btree_id_str(k->btree_id), k->level, bch2_err_str(ret));
> +		bch_err_msg(c, ret, "while replaying key at btree %s level %u:",
> +			    bch2_btree_id_str(k->btree_id), k->level);
> +		if (ret)
>  			goto err;
> -		}
> +
> +		BUG_ON(!k->overwritten);
>  	}
>  
> +	bch2_trans_put(trans);
> +	trans = NULL;
> +

Nit: This looks spurious w/ the exit path, just FWIW.

Brian

>  	replay_now_at(j, j->replay_journal_seq_end);
>  	j->replay_journal_seq = 0;
>  
> @@ -196,10 +224,10 @@ static int bch2_journal_replay(struct bch_fs *c)
>  	if (keys->nr && !ret)
>  		bch2_journal_log_msg(c, "journal replay finished");
>  err:
> -	kvfree(keys_sorted);
> -
> -	if (ret)
> -		bch_err_fn(c, ret);
> +	if (trans)
> +		bch2_trans_put(trans);
> +	darray_exit(&keys_sorted);
> +	bch_err_fn(c, ret);
>  	return ret;
>  }
>  
> -- 
> 2.42.0
> 
> 


^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 07/17] bcachefs: Make journal replay more efficient
  2023-11-14 13:19   ` Brian Foster
@ 2023-11-15  1:50     ` Kent Overstreet
  0 siblings, 0 replies; 29+ messages in thread
From: Kent Overstreet @ 2023-11-15  1:50 UTC (permalink / raw)
  To: Brian Foster; +Cc: linux-bcachefs

On Tue, Nov 14, 2023 at 08:19:04AM -0500, Brian Foster wrote:
> On Fri, Nov 10, 2023 at 11:31:44AM -0500, Kent Overstreet wrote:
> > +	/*
> > +	 * First, attempt to replay keys in sorted order. This is more
> > +	 * efficient, but some might fail if that would cause a journal
> > +	 * deadlock.
> > +	 */
> 
> What does "sorted order" mean here vs. when we sort by journal seq
> below?

Btree key order - so better locality in the btree.

> > +	bch2_trans_put(trans);
> > +	trans = NULL;
> > +
> 
> Nit: This looks spurious w/ the exit path, just FWIW.

It's because flush_all_pins() is going to internally use a btree_trans,
and we can't have two btree_trans in the same thread - I'll add a
comment.

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 16/17] bcachefs: btree write buffer now slurps keys from journal
  2023-11-10 16:31 ` [PATCH 16/17] bcachefs: btree write buffer now slurps keys from journal Kent Overstreet
@ 2023-11-21 10:56   ` Geert Uytterhoeven
  2023-11-21 16:52     ` Kent Overstreet
  0 siblings, 1 reply; 29+ messages in thread
From: Geert Uytterhoeven @ 2023-11-21 10:56 UTC (permalink / raw)
  To: Kent Overstreet; +Cc: linux-bcachefs

 	Hi Kent,

On Fri, 10 Nov 2023, Kent Overstreet wrote:
> Previosuly, the transaction commit path would have to add keys to the
> btree write buffer as a separate operation, requiring additional global
> synchronization.
>
> This patch introduces a new journal entry type, which indicates that the
> keys need to be copied into the btree write buffer prior to being
> written out. We switch the journal entry type back to
> JSET_ENTRY_btree_keys prior to write, so this is not an on disk format
> change.
>
> Flushing the btree write buffer may require pulling keys out of journal
> entries yet to be written, and quiescing outstanding journal
> reservations; we previously added journal->buf_lock for synchronization
> with the journal write path.
>
> We also can't put strict bounds on the number of keys in the journal
> destined for the write buffer, which means we might overflow the size of
> the preallocated buffer and have to reallocate - this introduces a
> potentially fatal memory allocation failure. This is something we'll
> have to watch for, if it becomes an issue in practice we can do
> additional mitigation.
>
> The transaction commit path no longer has to explicitly check if the
> write buffer is full and wait on flushing; this is another performance
> optimization. Instead, when the btree write buffer is close to full we
> change the journal watermark, so that only reservations for journal
> reclaim are allowed.
>
> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>

Thanks for your patch, which is the closest match to commit
9e5a6c7797b240f1 ("bcachefs: btree write buffer now slurps keys from
journal") in next-20231121.

> fs/bcachefs/btree_write_buffer.c       | 507 ++++++++++++++++---------

The committed version has lots of changes, including:

++              for (struct wb_key_ref *n = i + 1; n < min(i + 4, &darray_top(wb->sorted)); n++)
++                      prefetch(&wb->flushing.keys.data[n->idx]);

which requires the inclusion if <linux/prefetch.h>, as reported by
kernel test robot <lkp@intel.com> for arc[1], and by
noreply@ellerman.id.au for m68k.

[1]
https://lore.kernel.org/all/202311211650.4gRNvqfL-lkp@intel.com

Gr{oetje,eeting}s,

 						Geert

--
Geert Uytterhoeven -- There's lots of Linux beyond ia32 -- geert@linux-m68k.org

In personal conversations with technical people, I call myself a hacker. But
when I'm talking to journalists I just say "programmer" or something like that.
 							    -- Linus Torvalds

^ permalink raw reply	[flat|nested] 29+ messages in thread

* Re: [PATCH 16/17] bcachefs: btree write buffer now slurps keys from journal
  2023-11-21 10:56   ` Geert Uytterhoeven
@ 2023-11-21 16:52     ` Kent Overstreet
  0 siblings, 0 replies; 29+ messages in thread
From: Kent Overstreet @ 2023-11-21 16:52 UTC (permalink / raw)
  To: Geert Uytterhoeven; +Cc: linux-bcachefs

On Tue, Nov 21, 2023 at 11:56:49AM +0100, Geert Uytterhoeven wrote:
> 	Hi Kent,
> 
> On Fri, 10 Nov 2023, Kent Overstreet wrote:
> > Previosuly, the transaction commit path would have to add keys to the
> > btree write buffer as a separate operation, requiring additional global
> > synchronization.
> > 
> > This patch introduces a new journal entry type, which indicates that the
> > keys need to be copied into the btree write buffer prior to being
> > written out. We switch the journal entry type back to
> > JSET_ENTRY_btree_keys prior to write, so this is not an on disk format
> > change.
> > 
> > Flushing the btree write buffer may require pulling keys out of journal
> > entries yet to be written, and quiescing outstanding journal
> > reservations; we previously added journal->buf_lock for synchronization
> > with the journal write path.
> > 
> > We also can't put strict bounds on the number of keys in the journal
> > destined for the write buffer, which means we might overflow the size of
> > the preallocated buffer and have to reallocate - this introduces a
> > potentially fatal memory allocation failure. This is something we'll
> > have to watch for, if it becomes an issue in practice we can do
> > additional mitigation.
> > 
> > The transaction commit path no longer has to explicitly check if the
> > write buffer is full and wait on flushing; this is another performance
> > optimization. Instead, when the btree write buffer is close to full we
> > change the journal watermark, so that only reservations for journal
> > reclaim are allowed.
> > 
> > Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
> 
> Thanks for your patch, which is the closest match to commit
> 9e5a6c7797b240f1 ("bcachefs: btree write buffer now slurps keys from
> journal") in next-20231121.
> 
> > fs/bcachefs/btree_write_buffer.c       | 507 ++++++++++++++++---------
> 
> The committed version has lots of changes, including:
> 
> ++              for (struct wb_key_ref *n = i + 1; n < min(i + 4, &darray_top(wb->sorted)); n++)
> ++                      prefetch(&wb->flushing.keys.data[n->idx]);
> 
> which requires the inclusion if <linux/prefetch.h>, as reported by
> kernel test robot <lkp@intel.com> for arc[1], and by
> noreply@ellerman.id.au for m68k.

It's fixed in the current version of my for-next branch (noticed when I
imported the code into -tools).

^ permalink raw reply	[flat|nested] 29+ messages in thread

end of thread, other threads:[~2023-11-21 16:52 UTC | newest]

Thread overview: 29+ messages (download: mbox.gz follow: Atom feed
-- links below jump to the message on this page --
2023-11-10 16:31 [PATCH 00/17] btree write buffer & journal optimizations Kent Overstreet
2023-11-10 16:31 ` [PATCH 01/17] bcachefs: Kill journal pre-reservations Kent Overstreet
2023-11-10 16:31 ` [PATCH 02/17] bcachefs: track_event_change() Kent Overstreet
2023-11-10 16:31 ` [PATCH 03/17] bcachefs: Journal pins must always have a flush_fn Kent Overstreet
2023-11-13 15:22   ` Brian Foster
2023-11-13 16:36     ` Kent Overstreet
2023-11-13 17:08       ` Brian Foster
2023-11-10 16:31 ` [PATCH 04/17] bcachefs: BTREE_INSERT_JOURNAL_REPLAY now "don't init trans->journal_res" Kent Overstreet
2023-11-10 16:31 ` [PATCH 05/17] bcachefs: Kill BTREE_UPDATE_PREJOURNAL Kent Overstreet
2023-11-13 15:29   ` Brian Foster
2023-11-13 16:49     ` Kent Overstreet
2023-11-14 13:17       ` Brian Foster
2023-11-10 16:31 ` [PATCH 06/17] bcachefs: Go rw before journal replay Kent Overstreet
2023-11-10 16:31 ` [PATCH 07/17] bcachefs: Make journal replay more efficient Kent Overstreet
2023-11-14 13:19   ` Brian Foster
2023-11-15  1:50     ` Kent Overstreet
2023-11-10 16:31 ` [PATCH 08/17] bcachefs: Don't flush journal after replay Kent Overstreet
2023-11-10 16:31 ` [PATCH 09/17] bcachefs: Unwritten journal buffers are always dirty Kent Overstreet
2023-11-10 16:31 ` [PATCH 10/17] bcachefs: journal->buf_lock Kent Overstreet
2023-11-10 16:31 ` [PATCH 11/17] bcachefs: bch2_journal_block_reservations() Kent Overstreet
2023-11-10 16:31 ` [PATCH 12/17] bcachefs: Clean up btree write buffer write ref handling Kent Overstreet
2023-11-10 16:31 ` [PATCH 13/17] bcachefs: bch2_btree_write_buffer_flush_locked() Kent Overstreet
2023-11-10 16:31 ` [PATCH 14/17] bcachefs: bch2_btree_write_buffer_flush() -> bch2_btree_write_buffer_tryflush() Kent Overstreet
2023-11-10 16:31 ` [PATCH 15/17] bcachefs: Improve btree write buffer tracepoints Kent Overstreet
2023-11-10 16:31 ` [PATCH 16/17] bcachefs: btree write buffer now slurps keys from journal Kent Overstreet
2023-11-21 10:56   ` Geert Uytterhoeven
2023-11-21 16:52     ` Kent Overstreet
2023-11-10 16:31 ` [PATCH 17/17] bcachefs: Inline btree write buffer sort Kent Overstreet
2023-11-10 16:42 ` [PATCH 00/17] btree write buffer & journal optimizations Kent Overstreet

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox