public inbox for linux-bcachefs@vger.kernel.org
 help / color / mirror / Atom feed
From: Brian Foster <bfoster@redhat.com>
To: linux-bcachefs@vger.kernel.org
Subject: [PATCH 5/5] bcachefs: use prejournaled key updates for write buffer flushes
Date: Wed, 19 Jul 2023 08:53:06 -0400	[thread overview]
Message-ID: <20230719125306.109342-6-bfoster@redhat.com> (raw)
In-Reply-To: <20230719125306.109342-1-bfoster@redhat.com>

The write buffer mechanism journals keys twice in certain
situations. A key is always journaled on write buffer insertion, and
is potentially journaled again if a write buffer flush falls into
either of the slow btree insert paths. This has shown to cause
journal recovery ordering problems in the event of an untimely
crash.

For example, consider if a key is inserted into index 0 of a write
buffer, the active write buffer switches to index 1, the key is
deleted in index 1, and then index 0 is flushed. If the original key
is rejournaled in the btree update from the index 0 flush, the (now
deleted) key is journaled in a seq buffer ahead of the latest
version of key (which was journaled when the key was deleted in
index 1). If the fs crashes while this is still observable in the
log, recovery sees the key from the btree update after the delete
key from the write buffer insert, which is the incorrect order. This
problem is occasionally reproduced by generic/388 and generally
manifests as one or more backpointer entry inconsistencies.

To avoid this problem, never rejournal write buffered key updates to
the associated btree. Instead, use prejournaled key updates to pass
the journal seq of the write buffer insert down to the btree insert,
which updates the btree leaf pin to reflect the seq of the key.

Note that tracking the seq is required instead of just using
NOJOURNAL here because otherwise we lose protection of the write
buffer pin when the buffer is flushed, which means the key can fall
off the tail of the on-disk journal before the btree leaf is flushed
and lead to similar recovery inconsistencies.

Signed-off-by: Brian Foster <bfoster@redhat.com>
---
 fs/bcachefs/btree_write_buffer.c | 30 ++++++++++++++++++++++++++++--
 1 file changed, 28 insertions(+), 2 deletions(-)

diff --git a/fs/bcachefs/btree_write_buffer.c b/fs/bcachefs/btree_write_buffer.c
index 6c30a72e6eee..5f96db539fd7 100644
--- a/fs/bcachefs/btree_write_buffer.c
+++ b/fs/bcachefs/btree_write_buffer.c
@@ -75,7 +75,7 @@ static int bch2_btree_write_buffer_flush_one(struct btree_trans *trans,
 	}
 	return 0;
 trans_commit:
-	return  bch2_trans_update(trans, iter, &wb->k, 0) ?:
+	return  bch2_trans_update_seq(trans, wb->journal_seq, iter, &wb->k, 0) ?:
 		bch2_trans_commit(trans, NULL, NULL,
 				  commit_flags|
 				  BTREE_INSERT_NOCHECK_RW|
@@ -103,6 +103,32 @@ static union btree_write_buffer_state btree_write_buffer_switch(struct btree_wri
 	return old;
 }
 
+/*
+ * Update a btree with a write buffered key using the journal seq of the
+ * original write buffer insert.
+ *
+ * It is not safe to rejournal the key once it has been inserted into the write
+ * buffer because that may break recovery ordering. For example, the key may
+ * have already been modified in the active write buffer in a seq that comes
+ * before the current transaction. If we were to journal this key again and
+ * crash, recovery would process updates in the wrong order.
+ */
+static int
+btree_write_buffered_insert(struct btree_trans *trans,
+			  struct btree_write_buffered_key *wb)
+{
+	struct btree_iter iter;
+	int ret;
+
+	bch2_trans_iter_init(trans, &iter, wb->btree, bkey_start_pos(&wb->k.k),
+			     BTREE_ITER_CACHED|BTREE_ITER_INTENT);
+
+	ret   = bch2_btree_iter_traverse(&iter) ?:
+		bch2_trans_update_seq(trans, wb->journal_seq, &iter, &wb->k, 0);
+	bch2_trans_iter_exit(trans, &iter);
+	return ret;
+}
+
 int __bch2_btree_write_buffer_flush(struct btree_trans *trans, unsigned commit_flags,
 				    bool locked)
 {
@@ -238,7 +264,7 @@ int __bch2_btree_write_buffer_flush(struct btree_trans *trans, unsigned commit_f
 				commit_flags|
 				BTREE_INSERT_NOFAIL|
 				BTREE_INSERT_JOURNAL_RECLAIM,
-				__bch2_btree_insert(trans, i->btree, &i->k, 0));
+				btree_write_buffered_insert(trans, i));
 		if (bch2_fs_fatal_err_on(ret, c, "%s: insert error %s", __func__, bch2_err_str(ret)))
 			break;
 	}
-- 
2.40.1


  parent reply	other threads:[~2023-07-19 12:51 UTC|newest]

Thread overview: 8+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-19 12:53 [PATCH 0/5] bcachefs: write buffer journaling fixes Brian Foster
2023-07-19 12:53 ` [PATCH 1/5] bcachefs: remove duplicate code between backpointer update paths Brian Foster
2023-07-19 12:53 ` [PATCH 2/5] bcachefs: remove unnecessary btree_insert_key_leaf() wrapper Brian Foster
2023-07-19 12:53 ` [PATCH 3/5] bcachefs: fold bch2_trans_update_by_path_trace() into callers Brian Foster
2023-07-19 12:53 ` [PATCH 4/5] bcachefs: support btree updates of prejournaled keys Brian Foster
2023-07-19 12:53 ` Brian Foster [this message]
2023-07-20 21:26   ` [PATCH 5/5] bcachefs: use prejournaled key updates for write buffer flushes Kent Overstreet
2023-07-21 13:09     ` Brian Foster

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230719125306.109342-6-bfoster@redhat.com \
    --to=bfoster@redhat.com \
    --cc=linux-bcachefs@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox