* [PATCH 1/5] bcachefs: remove duplicate code between backpointer update paths
2023-07-19 12:53 [PATCH 0/5] bcachefs: write buffer journaling fixes Brian Foster
@ 2023-07-19 12:53 ` Brian Foster
2023-07-19 12:53 ` [PATCH 2/5] bcachefs: remove unnecessary btree_insert_key_leaf() wrapper Brian Foster
` (3 subsequent siblings)
4 siblings, 0 replies; 8+ messages in thread
From: Brian Foster @ 2023-07-19 12:53 UTC (permalink / raw)
To: linux-bcachefs
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
fs/bcachefs/backpointers.c | 18 +-----------------
fs/bcachefs/backpointers.h | 8 ++++----
2 files changed, 5 insertions(+), 21 deletions(-)
diff --git a/fs/bcachefs/backpointers.c b/fs/bcachefs/backpointers.c
index 7c1e6546d054..8747c5e19f99 100644
--- a/fs/bcachefs/backpointers.c
+++ b/fs/bcachefs/backpointers.c
@@ -134,31 +134,15 @@ static noinline int backpointer_mod_err(struct btree_trans *trans,
}
int bch2_bucket_backpointer_mod_nowritebuffer(struct btree_trans *trans,
- struct bpos bucket,
+ struct bkey_i_backpointer *bp_k,
struct bch_backpointer bp,
struct bkey_s_c orig_k,
bool insert)
{
- struct bch_fs *c = trans->c;
- struct bkey_i_backpointer *bp_k;
struct btree_iter bp_iter;
struct bkey_s_c k;
int ret;
- bp_k = bch2_trans_kmalloc_nomemzero(trans, sizeof(struct bkey_i_backpointer));
- ret = PTR_ERR_OR_ZERO(bp_k);
- if (ret)
- return ret;
-
- bkey_backpointer_init(&bp_k->k_i);
- bp_k->k.p = bucket_pos_to_bp(c, bucket, bp.bucket_offset);
- bp_k->v = bp;
-
- if (!insert) {
- bp_k->k.type = KEY_TYPE_deleted;
- set_bkey_val_u64s(&bp_k->k, 0);
- }
-
k = bch2_bkey_get_iter(trans, &bp_iter, BTREE_ID_backpointers,
bp_k->k.p,
BTREE_ITER_INTENT|
diff --git a/fs/bcachefs/backpointers.h b/fs/bcachefs/backpointers.h
index 87e31aa1975c..547e0617602a 100644
--- a/fs/bcachefs/backpointers.h
+++ b/fs/bcachefs/backpointers.h
@@ -54,7 +54,7 @@ static inline struct bpos bucket_pos_to_bp(const struct bch_fs *c,
return ret;
}
-int bch2_bucket_backpointer_mod_nowritebuffer(struct btree_trans *, struct bpos,
+int bch2_bucket_backpointer_mod_nowritebuffer(struct btree_trans *, struct bkey_i_backpointer *,
struct bch_backpointer, struct bkey_s_c, bool);
static inline int bch2_bucket_backpointer_mod(struct btree_trans *trans,
@@ -67,9 +67,6 @@ static inline int bch2_bucket_backpointer_mod(struct btree_trans *trans,
struct bkey_i_backpointer *bp_k;
int ret;
- if (unlikely(bch2_backpointers_no_use_write_buffer))
- return bch2_bucket_backpointer_mod_nowritebuffer(trans, bucket, bp, orig_k, insert);
-
bp_k = bch2_trans_kmalloc_nomemzero(trans, sizeof(struct bkey_i_backpointer));
ret = PTR_ERR_OR_ZERO(bp_k);
if (ret)
@@ -84,6 +81,9 @@ static inline int bch2_bucket_backpointer_mod(struct btree_trans *trans,
set_bkey_val_u64s(&bp_k->k, 0);
}
+ if (unlikely(bch2_backpointers_no_use_write_buffer))
+ return bch2_bucket_backpointer_mod_nowritebuffer(trans, bp_k, bp, orig_k, insert);
+
return bch2_trans_update_buffered(trans, BTREE_ID_backpointers, &bp_k->k_i);
}
--
2.40.1
^ permalink raw reply related [flat|nested] 8+ messages in thread* [PATCH 2/5] bcachefs: remove unnecessary btree_insert_key_leaf() wrapper
2023-07-19 12:53 [PATCH 0/5] bcachefs: write buffer journaling fixes Brian Foster
2023-07-19 12:53 ` [PATCH 1/5] bcachefs: remove duplicate code between backpointer update paths Brian Foster
@ 2023-07-19 12:53 ` Brian Foster
2023-07-19 12:53 ` [PATCH 3/5] bcachefs: fold bch2_trans_update_by_path_trace() into callers Brian Foster
` (2 subsequent siblings)
4 siblings, 0 replies; 8+ messages in thread
From: Brian Foster @ 2023-07-19 12:53 UTC (permalink / raw)
To: linux-bcachefs
This is in preparation to support prejournaled keys. We want the
ability to optionally pass a seq stored in the btree update rather
than the seq of the committing transaction.
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
fs/bcachefs/btree_update_leaf.c | 8 +-------
1 file changed, 1 insertion(+), 7 deletions(-)
diff --git a/fs/bcachefs/btree_update_leaf.c b/fs/bcachefs/btree_update_leaf.c
index 3638cef211b2..11b992282032 100644
--- a/fs/bcachefs/btree_update_leaf.c
+++ b/fs/bcachefs/btree_update_leaf.c
@@ -290,12 +290,6 @@ inline void bch2_btree_insert_key_leaf(struct btree_trans *trans,
bch2_trans_node_reinit_iter(trans, b);
}
-static void btree_insert_key_leaf(struct btree_trans *trans,
- struct btree_insert_entry *insert)
-{
- bch2_btree_insert_key_leaf(trans, insert->path, insert->k, trans->journal_res.seq);
-}
-
/* Cached btree updates: */
/* Normal update interface: */
@@ -753,7 +747,7 @@ bch2_trans_commit_write_locked(struct btree_trans *trans, unsigned flags,
i->k->k.needs_whiteout = false;
if (!i->cached)
- btree_insert_key_leaf(trans, i);
+ bch2_btree_insert_key_leaf(trans, i->path, i->k, trans->journal_res.seq);
else if (!i->key_cache_already_flushed)
bch2_btree_insert_key_cached(trans, flags, i);
else {
--
2.40.1
^ permalink raw reply related [flat|nested] 8+ messages in thread* [PATCH 3/5] bcachefs: fold bch2_trans_update_by_path_trace() into callers
2023-07-19 12:53 [PATCH 0/5] bcachefs: write buffer journaling fixes Brian Foster
2023-07-19 12:53 ` [PATCH 1/5] bcachefs: remove duplicate code between backpointer update paths Brian Foster
2023-07-19 12:53 ` [PATCH 2/5] bcachefs: remove unnecessary btree_insert_key_leaf() wrapper Brian Foster
@ 2023-07-19 12:53 ` Brian Foster
2023-07-19 12:53 ` [PATCH 4/5] bcachefs: support btree updates of prejournaled keys Brian Foster
2023-07-19 12:53 ` [PATCH 5/5] bcachefs: use prejournaled key updates for write buffer flushes Brian Foster
4 siblings, 0 replies; 8+ messages in thread
From: Brian Foster @ 2023-07-19 12:53 UTC (permalink / raw)
To: linux-bcachefs
There is only one other caller so eliminate some boilerplate.
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
fs/bcachefs/btree_update_leaf.c | 27 ++++++++-------------------
1 file changed, 8 insertions(+), 19 deletions(-)
diff --git a/fs/bcachefs/btree_update_leaf.c b/fs/bcachefs/btree_update_leaf.c
index 11b992282032..319286294d6a 100644
--- a/fs/bcachefs/btree_update_leaf.c
+++ b/fs/bcachefs/btree_update_leaf.c
@@ -66,7 +66,8 @@ static void verify_update_old_key(struct btree_trans *trans, struct btree_insert
static int __must_check
bch2_trans_update_by_path(struct btree_trans *, struct btree_path *,
- struct bkey_i *, enum btree_update_flags);
+ struct bkey_i *, enum btree_update_flags,
+ unsigned long ip);
static inline int btree_insert_entry_cmp(const struct btree_insert_entry *l,
const struct btree_insert_entry *r)
@@ -1489,7 +1490,7 @@ int bch2_trans_update_extent(struct btree_trans *trans,
ret = bch2_trans_update_by_path(trans, iter.path, update,
BTREE_UPDATE_INTERNAL_SNAPSHOT_NODE|
- flags);
+ flags, _RET_IP_);
if (ret)
goto err;
goto out;
@@ -1527,11 +1528,6 @@ int bch2_trans_update_extent(struct btree_trans *trans,
return ret;
}
-static int __must_check
-bch2_trans_update_by_path_trace(struct btree_trans *trans, struct btree_path *path,
- struct bkey_i *k, enum btree_update_flags flags,
- unsigned long ip);
-
static noinline int flush_new_cached_update(struct btree_trans *trans,
struct btree_path *path,
struct btree_insert_entry *i,
@@ -1562,16 +1558,16 @@ static noinline int flush_new_cached_update(struct btree_trans *trans,
i->flags |= BTREE_TRIGGER_NORUN;
btree_path_set_should_be_locked(btree_path);
- ret = bch2_trans_update_by_path_trace(trans, btree_path, i->k, flags, ip);
+ ret = bch2_trans_update_by_path(trans, btree_path, i->k, flags, ip);
out:
bch2_path_put(trans, btree_path, true);
return ret;
}
static int __must_check
-bch2_trans_update_by_path_trace(struct btree_trans *trans, struct btree_path *path,
- struct bkey_i *k, enum btree_update_flags flags,
- unsigned long ip)
+bch2_trans_update_by_path(struct btree_trans *trans, struct btree_path *path,
+ struct bkey_i *k, enum btree_update_flags flags,
+ unsigned long ip)
{
struct bch_fs *c = trans->c;
struct btree_insert_entry *i, n;
@@ -1650,13 +1646,6 @@ bch2_trans_update_by_path_trace(struct btree_trans *trans, struct btree_path *pa
return 0;
}
-static inline int __must_check
-bch2_trans_update_by_path(struct btree_trans *trans, struct btree_path *path,
- struct bkey_i *k, enum btree_update_flags flags)
-{
- return bch2_trans_update_by_path_trace(trans, path, k, flags, _RET_IP_);
-}
-
int __must_check bch2_trans_update(struct btree_trans *trans, struct btree_iter *iter,
struct bkey_i *k, enum btree_update_flags flags)
{
@@ -1717,7 +1706,7 @@ int __must_check bch2_trans_update(struct btree_trans *trans, struct btree_iter
path = iter->key_cache_path;
}
- return bch2_trans_update_by_path(trans, path, k, flags);
+ return bch2_trans_update_by_path(trans, path, k, flags, _RET_IP_);
}
int __must_check bch2_trans_update_buffered(struct btree_trans *trans,
--
2.40.1
^ permalink raw reply related [flat|nested] 8+ messages in thread* [PATCH 4/5] bcachefs: support btree updates of prejournaled keys
2023-07-19 12:53 [PATCH 0/5] bcachefs: write buffer journaling fixes Brian Foster
` (2 preceding siblings ...)
2023-07-19 12:53 ` [PATCH 3/5] bcachefs: fold bch2_trans_update_by_path_trace() into callers Brian Foster
@ 2023-07-19 12:53 ` Brian Foster
2023-07-19 12:53 ` [PATCH 5/5] bcachefs: use prejournaled key updates for write buffer flushes Brian Foster
4 siblings, 0 replies; 8+ messages in thread
From: Brian Foster @ 2023-07-19 12:53 UTC (permalink / raw)
To: linux-bcachefs
Introduce support for prejournaled key updates. This allows a
transaction to commit an update for a key that already exists (and
is pinned) in the journal. This is required for btree write buffer
updates as the current scheme of journaling both on write buffer
insertion and write buffer (slow path) flush is unsafe in certain
crash recovery scenarios.
Create a small trans update wrapper to pass along the seq where the
key resides into the btree_insert_entry. From there, trans commit
passes the seq into the btree insert path where it is used to manage
the journal pin for the associated btree leaf.
Note that this patch only introduces the underlying mechanism and
otherwise includes no functional changes.
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
fs/bcachefs/bkey_methods.h | 2 ++
fs/bcachefs/btree_types.h | 1 +
fs/bcachefs/btree_update.h | 2 ++
fs/bcachefs/btree_update_leaf.c | 34 ++++++++++++++++++++++++++++++---
4 files changed, 36 insertions(+), 3 deletions(-)
diff --git a/fs/bcachefs/bkey_methods.h b/fs/bcachefs/bkey_methods.h
index 0f3dc156adf6..f4e60d2e6677 100644
--- a/fs/bcachefs/bkey_methods.h
+++ b/fs/bcachefs/bkey_methods.h
@@ -97,6 +97,7 @@ static inline int bch2_mark_key(struct btree_trans *trans,
enum btree_update_flags {
__BTREE_UPDATE_INTERNAL_SNAPSHOT_NODE = __BTREE_ITER_FLAGS_END,
__BTREE_UPDATE_NOJOURNAL,
+ __BTREE_UPDATE_PREJOURNAL,
__BTREE_UPDATE_KEY_CACHE_RECLAIM,
__BTREE_TRIGGER_NORUN, /* Don't run triggers at all */
@@ -111,6 +112,7 @@ enum btree_update_flags {
#define BTREE_UPDATE_INTERNAL_SNAPSHOT_NODE (1U << __BTREE_UPDATE_INTERNAL_SNAPSHOT_NODE)
#define BTREE_UPDATE_NOJOURNAL (1U << __BTREE_UPDATE_NOJOURNAL)
+#define BTREE_UPDATE_PREJOURNAL (1U << __BTREE_UPDATE_PREJOURNAL)
#define BTREE_UPDATE_KEY_CACHE_RECLAIM (1U << __BTREE_UPDATE_KEY_CACHE_RECLAIM)
#define BTREE_TRIGGER_NORUN (1U << __BTREE_TRIGGER_NORUN)
diff --git a/fs/bcachefs/btree_types.h b/fs/bcachefs/btree_types.h
index 4efc69492612..d953601608d5 100644
--- a/fs/bcachefs/btree_types.h
+++ b/fs/bcachefs/btree_types.h
@@ -390,6 +390,7 @@ struct btree_insert_entry {
u8 old_btree_u64s;
struct bkey_i *k;
struct btree_path *path;
+ u64 seq;
/* key being overwritten: */
struct bkey old_k;
const struct bch_val *old_v;
diff --git a/fs/bcachefs/btree_update.h b/fs/bcachefs/btree_update.h
index f794c9d108b8..256da97f721c 100644
--- a/fs/bcachefs/btree_update.h
+++ b/fs/bcachefs/btree_update.h
@@ -111,6 +111,8 @@ int bch2_bkey_get_empty_slot(struct btree_trans *, struct btree_iter *,
int __must_check bch2_trans_update(struct btree_trans *, struct btree_iter *,
struct bkey_i *, enum btree_update_flags);
+int __must_check bch2_trans_update_seq(struct btree_trans *, u64, struct btree_iter *,
+ struct bkey_i *, enum btree_update_flags);
int __must_check bch2_trans_update_buffered(struct btree_trans *,
enum btree_id, struct bkey_i *);
diff --git a/fs/bcachefs/btree_update_leaf.c b/fs/bcachefs/btree_update_leaf.c
index 319286294d6a..609780f0ce8e 100644
--- a/fs/bcachefs/btree_update_leaf.c
+++ b/fs/bcachefs/btree_update_leaf.c
@@ -747,9 +747,14 @@ bch2_trans_commit_write_locked(struct btree_trans *trans, unsigned flags,
trans_for_each_update(trans, i) {
i->k->k.needs_whiteout = false;
- if (!i->cached)
- bch2_btree_insert_key_leaf(trans, i->path, i->k, trans->journal_res.seq);
- else if (!i->key_cache_already_flushed)
+ if (!i->cached) {
+ u64 seq = trans->journal_res.seq;
+
+ if (i->flags & BTREE_UPDATE_PREJOURNAL)
+ seq = i->seq;
+
+ bch2_btree_insert_key_leaf(trans, i->path, i->k, seq);
+ } else if (!i->key_cache_already_flushed)
bch2_btree_insert_key_cached(trans, flags, i);
else {
bch2_btree_key_cache_drop(trans, i->path);
@@ -1571,12 +1576,21 @@ bch2_trans_update_by_path(struct btree_trans *trans, struct btree_path *path,
{
struct bch_fs *c = trans->c;
struct btree_insert_entry *i, n;
+ u64 seq = 0;
int cmp;
EBUG_ON(!path->should_be_locked);
EBUG_ON(trans->nr_updates >= BTREE_ITER_MAX);
EBUG_ON(!bpos_eq(k->k.p, path->pos));
+ /*
+ * The transaction journal res hasn't been allocated at this point.
+ * That occurs at commit time. Reuse the seq field to pass in the seq
+ * of a prejournaled key.
+ */
+ if (flags & BTREE_UPDATE_PREJOURNAL)
+ seq = trans->journal_res.seq;
+
n = (struct btree_insert_entry) {
.flags = flags,
.bkey_type = __btree_node_type(path->level, path->btree_id),
@@ -1585,6 +1599,7 @@ bch2_trans_update_by_path(struct btree_trans *trans, struct btree_path *path,
.cached = path->cached,
.path = path,
.k = k,
+ .seq = seq,
.ip_allocated = ip,
};
@@ -1612,6 +1627,7 @@ bch2_trans_update_by_path(struct btree_trans *trans, struct btree_path *path,
i->cached = n.cached;
i->k = n.k;
i->path = n.path;
+ i->seq = n.seq;
i->ip_allocated = n.ip_allocated;
} else {
array_insert_item(trans->updates, trans->nr_updates,
@@ -1709,6 +1725,18 @@ int __must_check bch2_trans_update(struct btree_trans *trans, struct btree_iter
return bch2_trans_update_by_path(trans, path, k, flags, _RET_IP_);
}
+/*
+ * Add a transaction update for a key that has already been journaled.
+ */
+int __must_check bch2_trans_update_seq(struct btree_trans *trans, u64 seq,
+ struct btree_iter *iter, struct bkey_i *k,
+ enum btree_update_flags flags)
+{
+ trans->journal_res.seq = seq;
+ return bch2_trans_update(trans, iter, k, flags|BTREE_UPDATE_NOJOURNAL|
+ BTREE_UPDATE_PREJOURNAL);
+}
+
int __must_check bch2_trans_update_buffered(struct btree_trans *trans,
enum btree_id btree,
struct bkey_i *k)
--
2.40.1
^ permalink raw reply related [flat|nested] 8+ messages in thread* [PATCH 5/5] bcachefs: use prejournaled key updates for write buffer flushes
2023-07-19 12:53 [PATCH 0/5] bcachefs: write buffer journaling fixes Brian Foster
` (3 preceding siblings ...)
2023-07-19 12:53 ` [PATCH 4/5] bcachefs: support btree updates of prejournaled keys Brian Foster
@ 2023-07-19 12:53 ` Brian Foster
2023-07-20 21:26 ` Kent Overstreet
4 siblings, 1 reply; 8+ messages in thread
From: Brian Foster @ 2023-07-19 12:53 UTC (permalink / raw)
To: linux-bcachefs
The write buffer mechanism journals keys twice in certain
situations. A key is always journaled on write buffer insertion, and
is potentially journaled again if a write buffer flush falls into
either of the slow btree insert paths. This has shown to cause
journal recovery ordering problems in the event of an untimely
crash.
For example, consider if a key is inserted into index 0 of a write
buffer, the active write buffer switches to index 1, the key is
deleted in index 1, and then index 0 is flushed. If the original key
is rejournaled in the btree update from the index 0 flush, the (now
deleted) key is journaled in a seq buffer ahead of the latest
version of key (which was journaled when the key was deleted in
index 1). If the fs crashes while this is still observable in the
log, recovery sees the key from the btree update after the delete
key from the write buffer insert, which is the incorrect order. This
problem is occasionally reproduced by generic/388 and generally
manifests as one or more backpointer entry inconsistencies.
To avoid this problem, never rejournal write buffered key updates to
the associated btree. Instead, use prejournaled key updates to pass
the journal seq of the write buffer insert down to the btree insert,
which updates the btree leaf pin to reflect the seq of the key.
Note that tracking the seq is required instead of just using
NOJOURNAL here because otherwise we lose protection of the write
buffer pin when the buffer is flushed, which means the key can fall
off the tail of the on-disk journal before the btree leaf is flushed
and lead to similar recovery inconsistencies.
Signed-off-by: Brian Foster <bfoster@redhat.com>
---
fs/bcachefs/btree_write_buffer.c | 30 ++++++++++++++++++++++++++++--
1 file changed, 28 insertions(+), 2 deletions(-)
diff --git a/fs/bcachefs/btree_write_buffer.c b/fs/bcachefs/btree_write_buffer.c
index 6c30a72e6eee..5f96db539fd7 100644
--- a/fs/bcachefs/btree_write_buffer.c
+++ b/fs/bcachefs/btree_write_buffer.c
@@ -75,7 +75,7 @@ static int bch2_btree_write_buffer_flush_one(struct btree_trans *trans,
}
return 0;
trans_commit:
- return bch2_trans_update(trans, iter, &wb->k, 0) ?:
+ return bch2_trans_update_seq(trans, wb->journal_seq, iter, &wb->k, 0) ?:
bch2_trans_commit(trans, NULL, NULL,
commit_flags|
BTREE_INSERT_NOCHECK_RW|
@@ -103,6 +103,32 @@ static union btree_write_buffer_state btree_write_buffer_switch(struct btree_wri
return old;
}
+/*
+ * Update a btree with a write buffered key using the journal seq of the
+ * original write buffer insert.
+ *
+ * It is not safe to rejournal the key once it has been inserted into the write
+ * buffer because that may break recovery ordering. For example, the key may
+ * have already been modified in the active write buffer in a seq that comes
+ * before the current transaction. If we were to journal this key again and
+ * crash, recovery would process updates in the wrong order.
+ */
+static int
+btree_write_buffered_insert(struct btree_trans *trans,
+ struct btree_write_buffered_key *wb)
+{
+ struct btree_iter iter;
+ int ret;
+
+ bch2_trans_iter_init(trans, &iter, wb->btree, bkey_start_pos(&wb->k.k),
+ BTREE_ITER_CACHED|BTREE_ITER_INTENT);
+
+ ret = bch2_btree_iter_traverse(&iter) ?:
+ bch2_trans_update_seq(trans, wb->journal_seq, &iter, &wb->k, 0);
+ bch2_trans_iter_exit(trans, &iter);
+ return ret;
+}
+
int __bch2_btree_write_buffer_flush(struct btree_trans *trans, unsigned commit_flags,
bool locked)
{
@@ -238,7 +264,7 @@ int __bch2_btree_write_buffer_flush(struct btree_trans *trans, unsigned commit_f
commit_flags|
BTREE_INSERT_NOFAIL|
BTREE_INSERT_JOURNAL_RECLAIM,
- __bch2_btree_insert(trans, i->btree, &i->k, 0));
+ btree_write_buffered_insert(trans, i));
if (bch2_fs_fatal_err_on(ret, c, "%s: insert error %s", __func__, bch2_err_str(ret)))
break;
}
--
2.40.1
^ permalink raw reply related [flat|nested] 8+ messages in thread* Re: [PATCH 5/5] bcachefs: use prejournaled key updates for write buffer flushes
2023-07-19 12:53 ` [PATCH 5/5] bcachefs: use prejournaled key updates for write buffer flushes Brian Foster
@ 2023-07-20 21:26 ` Kent Overstreet
2023-07-21 13:09 ` Brian Foster
0 siblings, 1 reply; 8+ messages in thread
From: Kent Overstreet @ 2023-07-20 21:26 UTC (permalink / raw)
To: Brian Foster; +Cc: linux-bcachefs
On Wed, Jul 19, 2023 at 08:53:06AM -0400, Brian Foster wrote:
> The write buffer mechanism journals keys twice in certain
> situations. A key is always journaled on write buffer insertion, and
> is potentially journaled again if a write buffer flush falls into
> either of the slow btree insert paths. This has shown to cause
> journal recovery ordering problems in the event of an untimely
> crash.
>
> For example, consider if a key is inserted into index 0 of a write
> buffer, the active write buffer switches to index 1, the key is
> deleted in index 1, and then index 0 is flushed. If the original key
> is rejournaled in the btree update from the index 0 flush, the (now
> deleted) key is journaled in a seq buffer ahead of the latest
> version of key (which was journaled when the key was deleted in
> index 1). If the fs crashes while this is still observable in the
> log, recovery sees the key from the btree update after the delete
> key from the write buffer insert, which is the incorrect order. This
> problem is occasionally reproduced by generic/388 and generally
> manifests as one or more backpointer entry inconsistencies.
>
> To avoid this problem, never rejournal write buffered key updates to
> the associated btree. Instead, use prejournaled key updates to pass
> the journal seq of the write buffer insert down to the btree insert,
> which updates the btree leaf pin to reflect the seq of the key.
>
> Note that tracking the seq is required instead of just using
> NOJOURNAL here because otherwise we lose protection of the write
> buffer pin when the buffer is flushed, which means the key can fall
> off the tail of the on-disk journal before the btree leaf is flushed
> and lead to similar recovery inconsistencies.
Fairly simple, and it looks like this fixes all the backpointers errors
- nice, applied.
So journal replay does something similar, where we're doing a btree
update for a key that already lives in a particular journal entry and
the dependency needs to be recorded. Could you have a quick look to see
if there's any cleanups/unification worth doing there? Or if not, just
leave a note.
Also, did you look at adding the tracepoint we talked about, for when
we're doing btree updates that are already at the end of the journal and
immediately need to be flushed?
^ permalink raw reply [flat|nested] 8+ messages in thread
* Re: [PATCH 5/5] bcachefs: use prejournaled key updates for write buffer flushes
2023-07-20 21:26 ` Kent Overstreet
@ 2023-07-21 13:09 ` Brian Foster
0 siblings, 0 replies; 8+ messages in thread
From: Brian Foster @ 2023-07-21 13:09 UTC (permalink / raw)
To: Kent Overstreet; +Cc: linux-bcachefs
On Thu, Jul 20, 2023 at 05:26:57PM -0400, Kent Overstreet wrote:
> On Wed, Jul 19, 2023 at 08:53:06AM -0400, Brian Foster wrote:
> > The write buffer mechanism journals keys twice in certain
> > situations. A key is always journaled on write buffer insertion, and
> > is potentially journaled again if a write buffer flush falls into
> > either of the slow btree insert paths. This has shown to cause
> > journal recovery ordering problems in the event of an untimely
> > crash.
> >
> > For example, consider if a key is inserted into index 0 of a write
> > buffer, the active write buffer switches to index 1, the key is
> > deleted in index 1, and then index 0 is flushed. If the original key
> > is rejournaled in the btree update from the index 0 flush, the (now
> > deleted) key is journaled in a seq buffer ahead of the latest
> > version of key (which was journaled when the key was deleted in
> > index 1). If the fs crashes while this is still observable in the
> > log, recovery sees the key from the btree update after the delete
> > key from the write buffer insert, which is the incorrect order. This
> > problem is occasionally reproduced by generic/388 and generally
> > manifests as one or more backpointer entry inconsistencies.
> >
> > To avoid this problem, never rejournal write buffered key updates to
> > the associated btree. Instead, use prejournaled key updates to pass
> > the journal seq of the write buffer insert down to the btree insert,
> > which updates the btree leaf pin to reflect the seq of the key.
> >
> > Note that tracking the seq is required instead of just using
> > NOJOURNAL here because otherwise we lose protection of the write
> > buffer pin when the buffer is flushed, which means the key can fall
> > off the tail of the on-disk journal before the btree leaf is flushed
> > and lead to similar recovery inconsistencies.
>
> Fairly simple, and it looks like this fixes all the backpointers errors
> - nice, applied.
>
Thanks!
> So journal replay does something similar, where we're doing a btree
> update for a key that already lives in a particular journal entry and
> the dependency needs to be recorded. Could you have a quick look to see
> if there's any cleanups/unification worth doing there? Or if not, just
> leave a note.
>
Interesting. IIRC, one of the purposes of generic/388 was to test
crashes during recovery itself, so we should have at least some coverage
in that area.
Can you point to an example or area of code where this occurs with
recovery? (Or do you mean recovery in general has this rejournaling
behavior?)
> Also, did you look at adding the tracepoint we talked about, for when
> we're doing btree updates that are already at the end of the journal and
> immediately need to be flushed?
>
Not yet.. I wanted to get this worked out first and then ran into some
sort of contention issue during some fstests that I had to dig into
before I got a chance to think about it. I'll send a separate mail on
that one and will probably have to pick up both once I'm back..
Brian
^ permalink raw reply [flat|nested] 8+ messages in thread